I understand that in a typical ELF binary, functions get called through the Procedure Linkage Table (PLT). The PLT entry for a function usually contains a jump to a Global Offset Table (GOT) entry. This entry will first reference some code to load the actual function address into the GOT, and contain the actual function address after the first call (lazy binding).
To be precise, before lazy binding the GOT entry points back into the PLT, to the instructions following the jump into the GOT. These instructions will usually jump to the head of the PLT, from where some binding routine gets called which will then update the GOT entry.
Now I'm wondering why there are two indirections (calling into the PLT and then jumping to an address from the GOT), instead of just sparing the PLT and calling the address from the GOT directly. It looks like this could save a jump and the complete PLT. You would of course still need some code calling the binding routine, but this can be outside the PLT.
Is there anything I am missing? What is/was the purpose of an extra PLT?
Update:
As suggested in the comments, I created some (pseudo-) code ASCII art to further explain what I'm referring to:
This is the situation, as far as I understand it, in the current PLT scheme before lazy binding: (Some indirections between the PLT and printf are represented by "...".)
Program PLT printf
+---------------+ +------------------+ +-----+
| ... | | push [0x603008] |<---+ +-->| ... |
| call j_printf |--+ | jmp [0x603010] |----+--...--+ +-----+
| ... | | | ... | |
+---------------+ +-->| jmp [printf#GOT] |-+ |
| push 0xf |<+ |
| jmp 0x400da0 |----+
| ... |
+------------------+
… and after lazy binding:
Program PLT printf
+---------------+ +------------------+ +-----+
| ... | | push [0x603008] | +-->| ... |
| call j_printf |--+ | jmp [0x603010] | | +-----+
| ... | | | ... | |
+---------------+ +-->| jmp [printf#GOT] |--+
| push 0xf |
| jmp 0x400da0 |
| ... |
+------------------+
In my imaginary alternative scheme without a PLT, the situation before lazy binding would look like this: (I kept the code in the "Lazy Binding Table" similar to to the one from the PLT. It could also look differently, I don't care.)
Program Lazy Binding Table printf
+-------------------+ +------------------+ +-----+
| ... | | push [0x603008] |<-+ +-->| ... |
| call [printf#GOT] |--+ | jmp [0x603010] |--+--...--+ +-----+
| ... | | | ... | |
+-------------------+ +-->| push 0xf | |
| jmp 0x400da0 |--+
| ... |
+------------------+
Now after the lazy binding, one wouldn't use the table anymore:
Program Lazy Binding Table printf
+-------------------+ +------------------+ +-----+
| ... | | push [0x603008] | +-->| ... |
| call [printf#GOT] |--+ | jmp [0x603010] | | +-----+
| ... | | | ... | |
+-------------------+ | | push 0xf | |
| | jmp 0x400da0 | |
| | ... | |
| +------------------+ |
+------------------------+
The problem is that replacing call printf#PLT with call [printf#GOTPLT] requires that the compiler knows that the function printf exists in a shared library and not a static library (or even in just a plain object file). The linker can change call printf into call printf#PLT, jmp printf into jmp printf#PLT or even mov eax, printf into mov eax, printf#PLT because all it's doing it changing a relocation based on the symbol printf into relocation based on the symbol printf#PLT. The linker can't change call printf into call [printf#GOTPLT] because it doesn't know from the relocation whether it's a CALL or JMP instruction or something else entirely. Without knowing whether it's a CALL instruction or not, it doesn't know whether it should change the opcode from a direct CALL to a indirect CALL.
However even if there was a special relocation type that indicated that the instruction was a CALL, you still have the problem that a direct call instruction is a 5 bytes long but a indirect call instruction is 6 bytes long. The compiler would have to emit code like nop; call printf#CALL to give the linker room to insert the additional byte needed and it would have to do it for all calls to any global function. It would probably end up being a net performance loss because of all the extra and not actually necessary NOP instructions.
Another problem is that on 32-bit x86 targets the PLT entries are relocated at runtime. The indirect jmp [xxx#GOTPLT] instructions in the PLT don't use relative addressing like the direct CALL and JMP instructions, and since the address of xxx#GOTPLT depends on where the image was loaded in memory the instruction needs to be fixed up to use the correct address. By having all these indirect JMP instructions grouped together in one .plt section means that much smaller number of virtual memory pages need to be modified. Each 4K page that's modified can no longer be shared with other processes, when the instructions that need to modified are scattered all over memory it requires that a much larger part the image to be unshared.
Note that this later issue is only a problem with shared libraries and position independent executables on 32-bit x86 targets. Traditional executables can't be relocated, so there's no need to fix the #GOTPLT references, while on 64-bit x86 targets RIP relative addressing is used to access the #GOTPLT entries.
Because of that last point new versions of a GCC (6.1 or later) support the -fno-plt flag. On 64-bit x86 targets this option causes the compiler to generate call printf#GOTPCREL[rip] instructions instead of call printf instructions. However it appears to do this for any call to a function that isn't defined in the same compilation unit. That is any function it doesn't know for sure isn't defined in shared library. That would mean that indirect jumps would also be used for calls to functions defined in other object files or static libraries. On 32-bit x86 targets the -fno-plt option is ignored unless compiling position independent code (-fpic or -fpie) where it results in call printf#GOT[ebx] instructions being emitted. In addition to generating unnecessary indirect jumps, this also has the disadvantage of requiring the allocation of a register for the GOT pointer though most functions would need it allocated anyways.
Finally, Windows is able to do what you suggest by declaring symbols in header files with the "dllimport" attribute, indicating that they exist in DLLs. This way the compiler knows whether or not to generate direct or indirect call instruction when calling the function. The disadvantage of this is that the symbol has to exist in a DLL, so if this attribute used is you can't decide after compilation to link with a static library instead.
Read also Drepper's How to write a shared library paper, it explains that quite well in details (for Linux).
Now I'm wondering why there are two indirections
(calling into the PLT and then jumping to an address from the GOT),
First of all there are two calls, but just one indirection (call to PLT stub is direct).
instead of just sparing the PLT and calling the address from the GOT directly.
In case you do not need lazy binding, you can use -fno-plt which bypasses the PLT.
But if you wanted to keep it, you'd need some stub code to see if symbol has been resolved and branch accordingly. Now, to facilitate branch prediction, this stub code has to be duplicated for every called symbol and voila, you re-invented the PLT.
Related
I'm trying to implement in C an integer java virtual machine taking as example the IJVM described by Andrew Tanenbaum in "Structured Computer Organization" ->
IVJM .
Until now I was able to execute some instruction such as GOTO, IFEQ, IADD, etc. by adjusting the right value of the program counter or/and pushing/popping values or constants (from pool area) onto stack, I've created the stack as a global array. My intentions is to call some methods and pass the parameters to it (onto stack), but I have a blocking point regarding how to implement local variable frame for each function. I know that the method area first contains two shorts (2 byte numbers), the first one signifying the number of arguments the method expects, and the second one being the local variable area size. The local variable area size helps me a lot, but:
-> the local variable should be located also in the array used for stack (operands)?
-> the main function doesn't have a local variable area size, how to prevent overlapping with values pushed into stack?
-> what is the best approach to implement the local variable frame?
the local variable should be located also in the array used for stack
(operands)?
From your link:
Local Variable Frame: The beginning of the frame will have the parameters (a.k.a. arguments) of the method/procedure, with the local variables following.
Operand Stack: Will be right above the local variable stack frame.
Therefore, these two stacks are within a global Java stack, one following the other.
Nevertheless, the local variable should not be located within the array part used for the stack (operands).
the main function doesn't have a local variable area size, how to
prevent overlapping with values pushed into stack?
The stack grows only in one direction and then shrinks respectively, according to the processing of bytecodes. It can overflow or underflow but overlapping is hard, i.e. only produced by an incorrect implementation of bytecodes (IJVM Assembly), exception handling, method returning. When a method is called a new stack frame is put onto the stack, i.e. return info (I don't know how return info is handled in IJVM but there must be away to remove a stack frame from the stack. Usually, it is a pointer to the beginning of the previous stack frame) and place for the arguments and local variables. From then on the stack grows/shrinks above the local variables according to the executed bytecodes. The same applies for the main method.
There is an implementation at https://github.com/HongyuHe/IJVM. It contains the following stack visualization:
| ...... |
|operation stack above | [...]<- sp
| old_lv |
| old_pc |
| var[2] |
| var[1] |
| var[0] |
| ...... |
| arg[2] |
| arg[1] |
| arg[0] |
| link_ptr | [...]<- lv
| ...... | [2]
| ...... | [1]
| MAIN_FRAM | [0]
what is the best approach to implement the local variable frame?
Pointer to the Java stack where the local variables are located of type stack element. (Be aware that depending on your particular VM spec some primitive types might use up more than one stack element.) The bytecodes refer to local variables via an index. Therefore, it can simply be used as an array.
I understand that in a typical ELF binary, functions get called through the Procedure Linkage Table (PLT). The PLT entry for a function usually contains a jump to a Global Offset Table (GOT) entry. This entry will first reference some code to load the actual function address into the GOT, and contain the actual function address after the first call (lazy binding).
To be precise, before lazy binding the GOT entry points back into the PLT, to the instructions following the jump into the GOT. These instructions will usually jump to the head of the PLT, from where some binding routine gets called which will then update the GOT entry.
Now I'm wondering why there are two indirections (calling into the PLT and then jumping to an address from the GOT), instead of just sparing the PLT and calling the address from the GOT directly. It looks like this could save a jump and the complete PLT. You would of course still need some code calling the binding routine, but this can be outside the PLT.
Is there anything I am missing? What is/was the purpose of an extra PLT?
Update:
As suggested in the comments, I created some (pseudo-) code ASCII art to further explain what I'm referring to:
This is the situation, as far as I understand it, in the current PLT scheme before lazy binding: (Some indirections between the PLT and printf are represented by "...".)
Program PLT printf
+---------------+ +------------------+ +-----+
| ... | | push [0x603008] |<---+ +-->| ... |
| call j_printf |--+ | jmp [0x603010] |----+--...--+ +-----+
| ... | | | ... | |
+---------------+ +-->| jmp [printf#GOT] |-+ |
| push 0xf |<+ |
| jmp 0x400da0 |----+
| ... |
+------------------+
… and after lazy binding:
Program PLT printf
+---------------+ +------------------+ +-----+
| ... | | push [0x603008] | +-->| ... |
| call j_printf |--+ | jmp [0x603010] | | +-----+
| ... | | | ... | |
+---------------+ +-->| jmp [printf#GOT] |--+
| push 0xf |
| jmp 0x400da0 |
| ... |
+------------------+
In my imaginary alternative scheme without a PLT, the situation before lazy binding would look like this: (I kept the code in the "Lazy Binding Table" similar to to the one from the PLT. It could also look differently, I don't care.)
Program Lazy Binding Table printf
+-------------------+ +------------------+ +-----+
| ... | | push [0x603008] |<-+ +-->| ... |
| call [printf#GOT] |--+ | jmp [0x603010] |--+--...--+ +-----+
| ... | | | ... | |
+-------------------+ +-->| push 0xf | |
| jmp 0x400da0 |--+
| ... |
+------------------+
Now after the lazy binding, one wouldn't use the table anymore:
Program Lazy Binding Table printf
+-------------------+ +------------------+ +-----+
| ... | | push [0x603008] | +-->| ... |
| call [printf#GOT] |--+ | jmp [0x603010] | | +-----+
| ... | | | ... | |
+-------------------+ | | push 0xf | |
| | jmp 0x400da0 | |
| | ... | |
| +------------------+ |
+------------------------+
The problem is that replacing call printf#PLT with call [printf#GOTPLT] requires that the compiler knows that the function printf exists in a shared library and not a static library (or even in just a plain object file). The linker can change call printf into call printf#PLT, jmp printf into jmp printf#PLT or even mov eax, printf into mov eax, printf#PLT because all it's doing it changing a relocation based on the symbol printf into relocation based on the symbol printf#PLT. The linker can't change call printf into call [printf#GOTPLT] because it doesn't know from the relocation whether it's a CALL or JMP instruction or something else entirely. Without knowing whether it's a CALL instruction or not, it doesn't know whether it should change the opcode from a direct CALL to a indirect CALL.
However even if there was a special relocation type that indicated that the instruction was a CALL, you still have the problem that a direct call instruction is a 5 bytes long but a indirect call instruction is 6 bytes long. The compiler would have to emit code like nop; call printf#CALL to give the linker room to insert the additional byte needed and it would have to do it for all calls to any global function. It would probably end up being a net performance loss because of all the extra and not actually necessary NOP instructions.
Another problem is that on 32-bit x86 targets the PLT entries are relocated at runtime. The indirect jmp [xxx#GOTPLT] instructions in the PLT don't use relative addressing like the direct CALL and JMP instructions, and since the address of xxx#GOTPLT depends on where the image was loaded in memory the instruction needs to be fixed up to use the correct address. By having all these indirect JMP instructions grouped together in one .plt section means that much smaller number of virtual memory pages need to be modified. Each 4K page that's modified can no longer be shared with other processes, when the instructions that need to modified are scattered all over memory it requires that a much larger part the image to be unshared.
Note that this later issue is only a problem with shared libraries and position independent executables on 32-bit x86 targets. Traditional executables can't be relocated, so there's no need to fix the #GOTPLT references, while on 64-bit x86 targets RIP relative addressing is used to access the #GOTPLT entries.
Because of that last point new versions of a GCC (6.1 or later) support the -fno-plt flag. On 64-bit x86 targets this option causes the compiler to generate call printf#GOTPCREL[rip] instructions instead of call printf instructions. However it appears to do this for any call to a function that isn't defined in the same compilation unit. That is any function it doesn't know for sure isn't defined in shared library. That would mean that indirect jumps would also be used for calls to functions defined in other object files or static libraries. On 32-bit x86 targets the -fno-plt option is ignored unless compiling position independent code (-fpic or -fpie) where it results in call printf#GOT[ebx] instructions being emitted. In addition to generating unnecessary indirect jumps, this also has the disadvantage of requiring the allocation of a register for the GOT pointer though most functions would need it allocated anyways.
Finally, Windows is able to do what you suggest by declaring symbols in header files with the "dllimport" attribute, indicating that they exist in DLLs. This way the compiler knows whether or not to generate direct or indirect call instruction when calling the function. The disadvantage of this is that the symbol has to exist in a DLL, so if this attribute used is you can't decide after compilation to link with a static library instead.
Read also Drepper's How to write a shared library paper, it explains that quite well in details (for Linux).
Now I'm wondering why there are two indirections
(calling into the PLT and then jumping to an address from the GOT),
First of all there are two calls, but just one indirection (call to PLT stub is direct).
instead of just sparing the PLT and calling the address from the GOT directly.
In case you do not need lazy binding, you can use -fno-plt which bypasses the PLT.
But if you wanted to keep it, you'd need some stub code to see if symbol has been resolved and branch accordingly. Now, to facilitate branch prediction, this stub code has to be duplicated for every called symbol and voila, you re-invented the PLT.
I'm doing a practical example of the Buddy Memory Allocation Method and I stumbled upon a step that I'm confused by. The following is an example of the memory and its allocated sections.
--------------------------------
| | |
| a1 | a2 |
| | |
--------------------------------
What happens if now I have free(a3);? Since a3 is not even in any of the blocks, do we just ignore it?
It depends, on how the algorithm is implemented. If the implementation doesn't check, whether this is a valid pointer, it can corrupt the internal state leading to subtle and hard-to-find bugs.
I've been reading about stack and memory address location in a few tutorials and wondering why their reference of low and high memory location are different?
This is confusing.
E.g.
Low Memory Address located at the top while High Memory Address at the bottom
Low Memory Address located at the bottom while High Memory Address at the top
When I try it with a simple C program, it seems like Low Memory Address is located on the top. From bfbf7498 > bfbf749c > bfbf74a0 > bfbf74a4 > bfbf74a8 > bfbf74ac
user#linux:~$ cat > stack.c
#include <stdio.h>
int main()
{
int A = 3;
int B = 5;
int C = 8;
int D = 10;
int E = 11;
int F;
F = B + D;
printf("+-----------+-----+-----+-----+\n");
printf("| Address | Var | Dec | Hex |\n");
printf("|-----------+-----+-----+-----|\n");
printf("| %x | F | %d | %X |\n",&F,F,F);
printf("| %x | E | %d | %X |\n",&E,E,E);
printf("| %x | D | %d | %X |\n",&D,D,D);
printf("| %x | C | 0%d | %X |\n",&C,C,C);
printf("| %x | B | 0%d | %X |\n",&B,B,B);
printf("| %x | A | 0%d | %X |\n",&A,A,A);
printf("+-----------+-----+-----+-----+\n");
}
user#linux:~$
user#linux:~$ gcc -g stack.c -o stack ; ./stack
+-----------+-----+-----+-----+
| Address | Var | Dec | Hex |
|-----------+-----+-----+-----|
| bfbf7498 | F | 15 | F |
| bfbf749c | E | 11 | B |
| bfbf74a0 | D | 10 | A |
| bfbf74a4 | C | 08 | 8 |
| bfbf74a8 | B | 05 | 5 |
| bfbf74ac | A | 03 | 3 |
+-----------+-----+-----+-----+
user#linux:~$
It isn't exactly clear what your question is. What Arjun was answering is why does stack memory grow down (decreading memory addresses) instead of up (increasing memory addresses) and the simple answer to this is it is arbitrary. It really doesn't matter, but an architecture has to choose one or the other - there are typically cpu instructions that manipulate the stack and they are expecting a particular implementation.
The other possible question you may be asking is related to visual references from multiple sources. In your example above, you have one diagram showing low addresses at the top and you have another showing low addresses at the bottom. They are both showing the stack growing down from larger addresses down to smaller addresses. Again this is arbitrary, the authors needed to choose one or the other and they are communicating to you their choice. If you want to compare them side by side you may want to flip one so they have similar orientations.
By the way, your example code is showing that the stack does indeed start from high addresses and grow down (the memory address of 'A' is allocated first and has a higher memory address than the others).
Why does the stack address grow towards decreasing memory addresses?
This thread has a pretty good answer to your question. It also has a pretty good visual.
https://unix.stackexchange.com/questions/4929/what-are-high-memory-and-low-memory-on-linux?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
This is also a pretty good explanation (but related specifically to unix/linux)
Essentially it is totally dependent on the platform though
Explanation of why some stack memories grow differently:
There are a number of different methods used, depending on the OS (linux realtime vs. normal) and the language runtime system underneath:
1) dynamic, by page fault
typically preallocate a few real pages to higher addresses and assign the initial sp to that. The stack grows downward, the heap grows upward. If a page fault happens somewhat below the stack bottom, the missing intermediate pages are allocated and mapped. Effectively increasing the stack from the top towards the bottom automatically. There is typically a maximum up to which such automatic allocation is performed, which can or can not be specified in the environment (ulimit), exe-header, or dynamically adjusted by the program via a system call (rlimit). Especially this adjustability varies heavily between different OSes. There is also typically a limit to "how far away" from the stack bottom a page fault is considered to be ok and an automatic grow to happen. Notice that not all systems' stack grows downward: under HPUX it (used?) to grow upward so I am not sure what a linux on the PA-Risc does (can someone comment on this).
2) fixed size
other OSes (and especially in embedded and mobile environments) either have fixed sizes by definition, or specified in the exe header, or specified when a program/thread is created. Especially in embedded real time controllers, this is often a configuration parameter, and individual control tasks get fix stacks (to avoid runaway threads taking the memory of higher prio control tasks). Of course also in this case, the memory might be allocated only virtually, untill really needed.
3) pagewise, spaghetti and similar
such mechanisms tend to be forgotten, but are still in use in some run time systems (I know of Lisp/Scheme and Smalltalk systems). These allocate and increase the stack dynamically as-required. However, not as a single contigious segment, but instead as a linked chain of multi-page chunks. It requires different function entry/exit code to be generated by the compiler(s), in order to handle segment boundaries. Therefore such schemes are typically implemented by a language support system and not the OS itself (used to be earlier times - sigh). The reason is that when you have many (say 1000s of) threads in an interactive environment, preallocating say 1Mb would simply fill your virtual address space and you could not support a system where the thread needs of an individual thread is unknown before (which is typically the case in a dynamic environment, where the use might enter eval-code into a separate workspace). So dynamic allocation as in scheme 1 above is not possible, because there are would be other threads with their own stacks in the way. The stack is made up of smaller segments (say 8-64k) which are allocated and deallocated from a pool and linked into a chain of stack segments. Such a scheme may also be requried for high performance support of things like continuations, coroutines etc.
Modern unixes/linuxes and (I guess, but not 100% certain) windows use scheme 1) for the main thread of your exe, and 2) for additional (p-)threads, which need a fix stack size given by the thread creator initially. Most embedded systems and controllers use fixed (but configurable) preallocation (even physically preallocated in many cases).
Sorry if this answer is a bit dense, I'm not sure to simply it and still give a valid explanation. As for why the example you gave in C has Low Memory Address located on the top, the simplest way of explaining it is that C was built like that.
When loading an executable then segments like the code, data, bss and so on need to be placed in memory. I am just wondering, if someone could tell me where on a standard x86 for example the libc library is placed. Is that at the top or bottom of memory. My guess is at the bottom, close to the application code, ie., that would look something like this here:
--------- 0x1000
Stack
|
V
^
|
Heap
----------
Data + BSS
----------
App Code
----------
libc
---------- 0x0000
Thanks a lot,
Ross
It depends on the whims of the loader.
In particular, on any modern system that uses ASLR, you can't predict where a particular library is going to end up.