When a function is called, execution is shifted to a point indicated by the function pointer. At the start of execution, the executable code has to be loaded from disk.
How is the correct function pointer called? The executable code is not mapped into virtual memory at the same location every time, right? So how does the runtime make sure that a call to a function always calls the correct function even if the location of the executable code is different for each execution?
Consider the following code:
void func(void); //Func defined in another dynamic library
int main()
{
func();
//How is the pointer to func known if the file containing func is loaded from disk at run time?
};
The way that function pointers are resolved is really quite simple. When the compiler chain spits out an executable binary, all internal addresses are relative to a "base address." In some executable formats, this base address is specified, in others it is implied.
Basically, the compiler says that it assumes execution will start at address A. The runtime decides that it should actually start at B. The runtime then subtracts A and adds B to all non-relative addresses in the binary before executing it.
This process also applies to things like DLLs. Dynamic libraries store a list of addresses relative to the base pointer that point to each exported function. Names are often also associated with the list, so that you can reference a function by name. When the library is loaded, the address translation is applied to everything, including the address table. At that point, a caller just has to look up the address in the table that was translated, and then they'll have the absolute address of a given function.
In older operating systems, long long ago (and, in some cases, even today), well before things like address space layout randomization, memory pages, and multitasking operating systems, programs would just be copied to the specified base address in memory where it would then be executed.
In modern operating systems, one of a few things can happen, depending on the capabilities or requirements of the platform and application. Most operating systems handle native binaries as I described in the second paragraph, however some applications (such as running 16-bit x86 on later architectures) can involve more complex strategies. One such strategy involves giving the code a static virtual address space. This has various limitations, such as the need for an emulation/compatibility layer if you want it to interact with external code (like a windowed console or the network stack).
As the need for 16-bit support declines though, that sort of scheme is used less and less. Giving all programs their own unique address space (rather than letting it overlap) promotes the use of shared libraries, services, and other shared goodies.
In general, function calls are resolved statically. When you compile the file, first - .o (or .obj) file is created. All known addresses - are local functions (from this file). Unknown are "extern" functions.
Then, linking is performed. Linking completes address mapping for every function which is "extern". If any names are missing - linking error occurs.
How is the correct function pointer called?
Function pointer is function address, function name is function address. Both are values, not L-values. &func and func are absolutely same.
Loading or PE (or ELF) files is a process or loading the executable to memory. Too much information to explain. Basically, just for clarification, consider: every function has its own address in the process address space.
You can print the 'func' and see whether is has the same address during every execution like this:
printf("%u", function);
For me it's the same address every time (virtual memory wise).
Related
In a tracing tool I'm developing, I use the libC backtrace function to get the list of return pointers in the stack (array of void*) from some points in the execution. These lists of pointers are then stored in a dictionary and associated with a unique integer.
Problem: from one execution of the program to another, the code of the program may be located somewhere else in memory, so the array of pointers from a particular call to backtrace will change from an execution to another, although the program has executed the same things.
Partial solution: getting the address of a reference function (e.g., main) and store the difference between the address of this function and the addresses in the backtrace, instead of the raw addresses.
New problem: if the program uses dynamically loaded libraries, the location in memory of the code of these libraries may differ from one execution to another, so the differences of address with respect to a reference function will change.
Any suggestion to solve this? (I thought of using the backtrace_symbols function to get the names of the functions instead of return addresses, but this function will return names only if they are available, for instance if the program has been compiled with -g).
On Linux, I would suggest also looking at /proc/self/maps (or /proc/pid/maps if you're looking at another process) to get the load map of all the dynamic libs, as well as the main executable. You can then map the void *'s in the backtrace to which object they are part of, and offsets from the start of the object.
Something similar may be available on other OSes (many UNIX variants have a /proc filesystem that may contain similar information)
For example:
If I have a function named void Does_Stuff(int arg) and call it in the main function, is void Does_Stuff loaded into memory ONLY when it is first called? Or is it loaded into memory during program initialization?
And after calling Does_Stuff in main, can I manually unload it from memory?
For reference the operating system I am running is Windows 7 and I am compiling with MinGW.
In simple terms (with the usual depends-on-various-platform-things caveat), the code for your normal, global C function is "loaded into memory" at the time the program is loaded. You cannot request that it be "unloaded".
That said, as Hans mentions in a comment, the OS at a lower level is in charge of what bits of stuff are important enough to be present in physical RAM, and may choose to "page out" memory that isn't being used frequently. This isn't per-function, and has no knowledge of the structure of your code. So in that sense the function's code may happen at various times exist in actual RAM or not. But this is a level below the application's execution, where a C function is always "present and available".
DLL's called by your code could conceivably come and go as you call them. But your main program *.exe should go all-in at start time.
Though the exact details depend on the compiler, linker, platform and implementation, typically all the functions in your program are loaded into memory by the executable loader of the OS and reside there until the program terminates. This memory is also typically static (though certain programs can and do rewrite parts of themselves), so it's read-only.
Now every time you call a function and pass it an argument, that argument is added to memory (a different memory in principal than where the functions are), and removed again when the function call returns (this is a simplified version).
On some platforms (for instance, DOS) your whole program resides in memory while it runs. On other platforms, it might be swapped out of memory while not running (for instance, ancient UNIX versions). On most platforms your program is splitted into pages of usually 4 kilobytes. When you access a page that is not yet loaded, the operating system produces the required page for you transparently (i.e. you don't notice that at all). If the operating system runs out of memory it may swap out single pages. You cannot control this at all from inside your program.
If you want to be able to control what is in memory and what not, you might wan to read about memory mapping and the mmap system call.
Here is the thing, I have several functions,
void foo() {}
void bar() {}
And I want to pass these functions around just like ordinary objects' pointers,
int main()
{
void (*fptr1)() = foo;
void (*fptr2)() = fptr1;
void (*fptr3)() = bar;
if (fptr1 == foo)
printf("foo function\n");
if (fptr2 == foo)
printf("foo function\n");
if (fptr3 == foo)
printf("foo function\n")
}
Can I use these function pointers this way? And I wrote a program to test it, seems like ok. Furthermore, I think, not like ordinary objects which may lie in stack or heap, functions reside in text segment (right?), so when I refer to foo, does it give me the physical address at which function foo lies in the text segment?
FOLLOW UP
If I indeed work with DLL, consider this: first, a function ptr fptr is assigned a function,
ReturnType (*fptr)(ArgType) = beautiful_func;
Two scenarios here,
1) if beautiful_func is not in DLL, then it is safe to use this fptr.
2) if it is in DLL, then later, I think it would be unsafe to use fptr, because it now may refer to a totally different function which is not fptr was born for, right?
You can check if two function pointers are equal just by simply == them, as they are just normal pointers. That's obvious.
However, when you are say "compare", check what you really have in your mind:
are you interested in detecting that you are given a different "thing"
or are you interested in detecting that you are given different function?
comparing pointers (not only function pointers! it applies to all of them) is a bit risky: you are not checking the contents (logical identity), but just the location ("physical" identity). Most of the time it really is the same, but sometimes, beware, you stumble upon copies.
It is obvious that if you create an array with numbers 1,2,3,4 and then allocate another array and copy the contents there, then you get two different pointers, right? But the array may be SAME for you, depending on what you need it for.
With function pointers the problem is just the same, and even more: you don't actually know what the compiler/linker has done with your code. It might have opitmized few things, it might have merged some not-exported functions together if it noticed them equal, it might have copied or inlined others.
Especially that may happen when working with bigger separate "subprojects". Imagine you wrote a sorting function, then include it with subproject A, and subproject B, compile/build everything, then link and run. Will you end with one sort functions or two? Hard question until you actually check and tailor the linkage options properly.
This is a bit more complex than with arrays. With arrays, you got different pointer if the array was different. Here, the same function may have many different addresses. It might be especially noticeable when working with templates in C++, but that again depends on how well the linker did his job.. Oh, great example: DLLs. With three DLLs based on similar code, they almost have guaranteeded to have three copies of everything they were statically linked to.
And when talkig about DLLs... You know that they can load/unload additional code into your memory, right? This means that when you load a DLL, at some address XYZ a function appears. Then you unload it, and it goes away. But when you now load different DLL? Of course the OS is allowed to reuse the space, and it is allowed to map a newly loaded DLL into the same area as previous one. Most of the time you will not notice it, as newly loaded DLL will be mapped into different region, but it may happen.
This means that while you can compare the pointers, the only answer you get is: are the pointers same or not?
if they not same, then you SIMPLY DON'T KNOW; different function pointer does not mean that the function is different. It may be so, it will be so in 99% of cases, but does not have to be different
if they are same:
if you are NOT loading/unloading the various dynamic libraries large number of times, you might assume that nothing changes and you may be sure got the same function/object/array as before
if you are working with unloadable dynamic modules, you'd better not assume that at all, unless you are absolutely sure that none of the pointers comes from a DLL that will be unloaded in future. Note that it some libraries use dynamic libraries for "plugin-like" functionality. Be careful with pointers from them, and watch for plugin load/unload notifications. Your function may change when a dynamic library is unloaded.
EDIT TO FOLLOWUP:
Unless you (or some library you use) ever unload the DLL, then your pointer-to-function-that-targets-a-DLL is safe to use.
Once the DLL is loaded, the only evil thing that can change the meaning of address that this DLL has taken is unloading the dynamic module.
If you are sure that:
(1) either your pointer-to-function does not target a function from dynamic module (points to statically-linked code only)
(2) or it targets a dynamic module, but that dynamic module is never unloaded (ok: until program quits or crashes)
(3) or it targets a dynamic module, and you precisely know which one, and that dynamic module is sometimes unloaded during the runtime, but your code gets some 'prior notification' about that fact
then your pointer to function is safe to store and use and compare, provided that you add some safety measures:
for (1), no safety measures are needed: the functions will not get replaced
for (2), no safety measures are needed: the functions will not get until program quits
for (3), safety measures are needed: you must listen to those notifications, and once you get notified about DLL being unloaded, you must immediatelly forget all pointers that target DLL. You are still safe to remember any other. You are still safe to re-remember it when it gets loaded again.
If you suspect that your pointer-to-function does targets a function from dynamic module that will be unloaded at some point of time before program quits, and:
you don't actually know which DLL is pointed by that pointer
or that DLL will be unloaded at any point without any notice
then your pointer to function is unsafe to use at all. And by at all I mean AT ALL in any way. Dont store it, as it may instantly evaporate immediately.
does it give me the physical address at which function foo lies in the
text segment?
Unless you are working on a primitive or any other special OS , NO!
The addresses are not physical , they are virtual addresses!
Basically there is a mechanism employed by operating systems that allows programs that are even larger than the physical memory. So the OS does the the job of handling the mapping behind the scenes.
Sorry if i confused you. Your understanding is correct (it is perfectly alright to use function pointers in the way you are using them) but the addresses are not physical addresses ( which refers to the numbers using which your main memory is actually addressed ).
Yes, the C standard allows you to compare function pointers with the operators == and != ,
e.g. from C11 6.5.9:
Two pointers compare equal if and only if both are null pointers, both
are pointers to the same object (including a pointer to an object and
a subobject at its beginning) or function,
Exactly where a function resides depends on your platform, it might be in a text segment, or it might be somewhere else. When running on operating systems with virtual memory, the address is normally a virtual address, not a physical memory address.
Considering the fact that pointer stores than memory address, which comes down to an number, yes, you can compare tham that way. As for other question, taken from here, text segment would be defined as "one of the sections of a program in an object file or in memory, which contains executable instructions.", which means, that pointer should containt address of somewhere in the text segment.
Yes, you can use that way. And you understanding is correct.
If I run a program, just like
#include <stdio.h>
int main(int argc, char *argv[], char *env[]) {
printf("My references are at %p, %p, %p\n", &argc, &argv, &env);
}
We can see that those regions are actually in the stack.
But what else is there? If we ran a loop through all the values in Linux 3.5.3 (for example, until segfault) we can see some weird numbers, and kind of two regions, separated by a bunch of zeros, maybe to try to prevent overwriting the environment variables accidentally.
Anyway, in the first region there must be a lot of numbers, such as all the frames for each function call.
How could we distinguish the end of each frame, where the parameters are, where the canary if the compiler added one, return address, CPU status and such?
Without some knowledge of the overlay, you only see bits, or numbers. While some of the regions are subject to machine specifics, a large number of the details are pretty standard.
If you didn't move too far outside of a nested routine, you are probably looking at the call stack portion of memory. With some generally considered "unsafe" C, you can write up fun functions that access function variables a few "calls" above, even if those variables were not "passed" to the function as written in the source code.
The call stack is a good place to start, as 3rd party libraries must be callable by programs that aren't even written yet. As such, it is fairly standardized.
Stepping outside of your process memory boundaries will give you the dreaded Segmentation violation, as memory fencing will detect an attempt to access non-authorized memory by the process. Malloc does a little more than "just" return a pointer, on systems with memory segmentation features, it also "marks" the memory accessible to that process and checks all memory accesses that the process assignments are not being violated.
If you keep following this path, sooner or later, you'll get an interest in either the kernel or the object format. It's much easier to investigate one way of how things are done with Linux, where the source code is available. Having the source code allows you to not reverse-engineer the data structures by looking at their binaries. When starting out, the hard part will be learning how to find the right headers. Later it will be learning how to poke around and possibly change stuff that under non-tinkering conditions you probably shouldn't be changing.
PS. You might consider this memory "the stack" but after a while, you'll see that really it's just a large slab of accessible memory, with one portion of it being considered the stack...
The contents of the stack are basically:
Whatever the OS passes to the program.
Call frames (also called stack frames, activation areas, ...)
What does the OS pass to the program? A typical *nix will pass the environment, arguments to the program, possibly some auxiliary information, and pointers to them to be passed to main().
In Linux, you'll see:
a NULL
the filename for the program.
environment strings
argument strings (including argv[0])
padding full of zeros
the auxv array, used to pass information from the kernel to the program
pointers to environment strings, ended by a NULL pointer
pointers to argument strings, ended by a NULL pointer
argc
Then, below that are stack frames, which contain:
arguments
the return address
possibly the old value of the frame pointer
possibly a canary
local variables
some padding, for alignment purposes
How do you know which is which in each stack frame? The compiler knows, so it just treats its location in the stack frame appropriately. Debuggers can use annotations for each function in the form of debug info, if available. Otherwise, if there is a frame pointer, you can identify things relative to it: local variables are below the frame pointer, arguments are above the stack pointer. Otherwise, you must use heuristics, things that look like code addresses are probably code addresses, but sometimes this results in incorrect and annoying stack traces.
The content of the stack will vary depending on the architecture ABI, the compiler, and probably various compiler settings and options.
A good place to start is the published ABI for your target architecture, then check that your particular compiler conforms to that standard. Ultimately you could analyse the assembler output of the compiler or observe the instruction level operation in your debugger.
Remember also that a compiler need not initialise the stack, and will certainly not "clear it down", when it has finished with it, so when it is allocated to a process or thread, it might contain any value - even at power-on, SDRAM for example will not contain any specific or predictable value, if the physical RAM address has been previously used by another process since power on or even an earlier called function in the same process, the content will have whatever that process left in it. So just looking at the raw stack does not tell you much.
Commonly a generic stack frame may contain the address that control will jump to when the function returns, the values of all the parameters passed, and the value of all auto local variables in the function. However the ARM ABI for example passes the first four arguments to a function in registers R0 to R3, and holds the return value of the leaf function in the LR register, so it is not as simple in all cases as the "typical" implementation I have suggested.
The details are very dependent on your environment. The operating system generally defines an ABI, but that's in fact only enforced for syscalls.
Each language (and each compiler even if they compile the same language) in fact may do some things differently.
However there is some sort of system-wide convention, at least in the sense of interfacing with dynamically loaded libraries.
Yet, details vary a lot.
A very simple "primer" could be http://kernelnewbies.org/ABI
A very detailed and complete specification you could look at to get an idea of the level of complexity and details that are involved in defining an ABI is "System V Application Binary Interface AMD64 Architecture Processor Supplement" http://www.x86-64.org/documentation/abi.pdf
I'm studying about windows and DLL stuffs and I have some question about it. :)
I made a simple program that loads my own DLL. This DLL has just simple functions, plus, minus.
This is the question : if I load some DLL (for example, text.dll), is this DLL always have the same Base Address? or it changes when I restart it? and can I hold the DLL's Base Address?
When I test it, it always have same Base Address, but I think when I need to do about this, I have to make some exception about the DLL Base Address.
The operating system will load your DLL in whatever base address it pleases. You can specify a "preferred" base address, but if that does not happen to be available, (for whatever reason, which may well be completely out of your control,) your DLL will be relocated by the operating system to whatever address the operating system sees fit.
i load some DLL(for example, text.dll), is this DLL always have the same Base Address?
No. It is a preferred base address. If something is already loaded at that address, the loader will rebase it and fixup all of the addresses.
Other things, like Address Space Layout Randomization could cause it to be different every time the process starts.
That's a common problem with DLLs that we encountered when trying to decode stacktraces issued by GNAT runtime (Ada).
When presented with a list of addresses (traceback) when our executables crash, we are able to perform addr2line on the given addresses and rebuild the call tree without issues.
On DLLs, this isn't the case (that's why I highly doubt that this issue is ASLR-related, else the executables would have the same random shift), vcsjones answer explains the "why".
Now to workaround this issue, you can write the address of a given symbol (example: the main program) to disk. When analysing a crash, just perform a difference between the address of the symbol in the mapfile and the address written to disk. Apply this difference to your addresses, and you'll be able to compute the theorical addresses, thus the call stack.