Is 2 pass on the source file necessary for assembler and linker?_问答_开发者

I heard many times that the assembler and linker need to traverse its input开发者_开发百科 file at least 2 times, is this really necessary? Why cannot it been done in one pass?

The assembler translates a symbolic assembler language into a binary representation.

In the input language (assembler), labels are symbolic too.

In the binary output language they are typically a distance in bytes, relative to the current position or some other fixed point (e.g. jump so many bytes ahead or back).

The first pass just determines the offset from the start of the code or some other fixed point of all assembler instructions to fixate the position of the labels.

This allows to calculate the correct jump distances from branch instructions in the second pass.

One pass assembler would be possible, but you would only be able to jump to labels you already had declared ("bacK") not forward.

One example when this is necessary is when two functions call each other.

int sub_a(int v);
int sub_b(int v);

int sub_a(int v) {
    int u = v;
    if ( 0 < u ) {
        u = sub_b( v - 1 );
    }
    return u - 1;
}

int sub_b(int v) {
    int u = v;
    if ( 0 < u ) {
        u = sub_a( v - 1 );
    }
    return u - 1;
}

It is then necessary to do a two-pass scan. As any ordering of the functions will have a dependency on a function that hasn’t been scanned.

it may even take more than two.

here:
  ...
  jmp outside
  ...
  jmp there
  ...
  jmp here
  ...
there:

In particular for instruction sets that have some form of a near jump and some form of a far jump. The assembler doesnt always want to waste a far jump on every branch/jmp. Take the code above for example when it gets to the jmp here line it knows how many instructions are between the here label and the jump to here instruction. it can make a pretty good estimate if it is going to need to encode that as a near or far jump. Normally the far version of a jump is a case where it takes more bytes to implement causing all the instructions and labels that follow to shift.

When it encounters the jmp there instruction it does not know long or far and has to come back later on a separate pass (through the data). When it encounters the label there it could go back and look to see if up to this point there has been a reference to it, and patch up that reference. that is another pass through the data, pass 2. or you just make one complete pass through the source code, then start to go back and forth through the data more times.

Lets say the jump outside does not resolve a label. Now depending on the instruction set the assembler has to respond. Some instruction sets, lets say the msp430 where a far jump simply means an absolute address in memory, all of memory space, no segments or nothing like that. Well you could simply assume a far jump and leave the address for the linker to fill in later. some instruction sets like ARM you have to allocate some memory, within near reach of the instruction. often hiding things behind unconditional branches (this can be a bad thing and fail). Basically you need to allocate a place where the whole address to the external item can be referenced, encode the instruction to load from that near memory location and let the linker fill in the address later.

Back to here and there. What if on the first pass you assumed that all of the unknown jumps were near and on the first pass computed addresses based on that. And if on that pass here was exactly 128 bytes from the jmp here instruction for an instruction set that has a reach of only 128 bytes. So you assume jmp here is also near, and to make this painful what if when there was found jump there to there was 127 bytes which was your maximum near jump forward. But outside is not found! it has to be far, so you need to burn some more bytes, now the here to jmp here is too far it needs to be more bytes, now the jmp there is too far and it needs to be more bytes. How many passes through the data did it take to figure those three things out? More than two. One pass to start. the second pass marks outside as far, the assumption has jmp there as near on the second pass, when it gets to jmp here it discovers that has to be a far jump causing the there address to change. The third pass it discovers that jmp there needs to be far and that affects everything after that instruction. For this simple code that is it everything is resolved.

think about a bubble sort. you keep looping through the data doing swaps until you have a flag that says, I made no changes on that last pass, indicating everything is resolved, we are done. You have to play the same game with an assembler. For instruction sets like ARM you need to do things like try to find places to tuck away addresses and constants/immediates that dont encode into a single instruction. That is if the assembler wants to do that work for you. You could easily declare an error and say the destination is too far for the instruction chosen. Arm assemblers allow you to be lazy and do things like:

  ldr r0,=0x1234567
  ...
  ldr r1,=lab7
  ...
lab7:

The assembler looks at that = and knows it has to determine, can I encode that constant/immediate in the instruction (changing your ldr to a mov for you) or do I need to find a place wedged in your code to place the word, and then encode the instruction with a near address offset.

Even without dealing with near and far, simply resolving addresses, the outside, there, here example above takes two passes. first pass reads everything, jump here happens to know where here is on the first pass. but you have to make a second pass through the program (not necessarily from the disk, can keep the info in memory) there might be a jump to here that preceeds the here: label. the second pass will find the jump outside and know there is no outside label in the program marking it on the second pass as unresolved or external depending on the rules of the assembler. The second pas resolves the jump there as being a known label, and the second pass doesnt mess with the jump here because it resolved it on the first pass. This is your classic two pass problem/solution.

The linker has the same problem, it has to pass through all the sources, think of each object as a complicated line in source code. it finds all the labels, both ones found in the objects and ones not resolved in the object. If it finds the I need an "outside" label in the second file out of 10 files, it has to pass through all 10 files, then go back through the data either on file or in memory to resolve all the forward referenced labels. It wont know on the first occurrence of jmp outside that there was no outside label, on the second pass through is when it finds jmp outside, looks through the list it keeps of found labels (that could be considered a third pass) finds no outside label and declares an error.