Getting the address of a function - c

I'm playing around with ASM, byte code and executable memory.
I'm using echo -e "<asm instruction here>" | as -o /dev/null -64 -al -msyntax=att to "compile" my code. Anyways, I noticed that movq 2, %rax will result in 0000 488B0425 02000000, but both call printk and call printf will result in 0000 E8000000 00, which I believe is because of the fact that as doesn't know anything about printk or printf.
My question is how can I get the address of those functions so I can use them in my byte code?

The assembler does not know the values of external symbols. It writes a file with multiple sets of data in it. One set of data is the bytes to put into a program’s text section. Another set of data is information about external symbols used and what adjustments to make to the bytes in the text section when those external symbols are resolved. The file produced by the assembler does not contain any information about the values of symbols that come from external libraries.
The linker resolves external symbols. When producing a statically linked executable, it will fully resolve the external symbols. When producing dynamically linked executable, resolution of external symbols is completed during program execution.
After you statically link an executable, you can use a tool such as nm to see the values of symbols in it (if they have not been stripped). Then you could use those values in assembly. However, hardcoding those values into assembly will fail if the program is ever linked in a different way, that puts the routines at different locations.
If you dynamically link an executable, you would have to inspect the executing program to determine where its external symbols were placed. Again, you could in theory use those values in assembly, but this would be even more prone to failure than doing so with static linking, especially on systems that deliberately change the addresses used in executables to thwart malicious code.
Hard-coding symbol values into assembly is not the normal way of writing assembly language or other code and should not be done except for learning or special-purpose debugging or system inspection.

Use the proper tool to inspect your object files. Assuming any Unix, use
nm -AP file.o
to see the addresses of all symbols (functions and variables). Since printf is likely in a library, run nm on that library for the same effect.

Related

In linking, what is the difference between run-time addresses and the addresses in the .text (machine code) section of a relocatable object file?

I am learning about linking right now (self-taught) and I am having some trouble understanding some concepts.
After the preprocessing, compilation, and assembly of a source code file, you got a relocatable object file with an ELF format (WLOG). In this ____.o file, there is a .text section that contains the machine code of the individual source code.
Does this machine code correspond to the run-time addresses of the code that is in the input file? Like if the machine code where to run (assuming no unresolved external references) would the runtime profile of the code match the machine code here?
If this is true, is it safe to say that symbol references in this code are pointing to the runtime address of their corresponding symbols?
I need to know this so that I can better understand the linking process which happens directly after this process.
Does this machine code correspond to the run-time addresses of the code that is in the input file?
No.
It can't, because the code in a single .o file doesn't know what other object files will be linked into the main executable. Imagine foo.o saying "I want to be at address 0x123000", and bar.o saying "I want to be at address 0x123004". They clearly can't be both at the same address.
The "final" runtime addresses are determined by the linker, which collects all the different .o files, resolves references between them, and lays out the final executable in memory. (Even this isn't a complete story, as shared libraries and position-independent executables complicate the answer more.)

How to compile a library for a fixed address in microblaze

I want to build a library which is relocatable (ie. nothing other than local variables. I also want to force the location of the library to be at a fixed location in memory. I think this has to be done in the makefile, but I am confused as to what I have to do to force the library to be loaded at a fixed location. This is using mb-gcc.
The reason I need this is I want to write a loader where I dont want to clobber over the code that is actually doing the copy of the other program. So I want the program that is doing the copying to be located somewhere else at a location that is not being used (ie. ddr).
If I have all the functions that do the compiled into a library, what special makefile arguments do I need to force this to be loaded at location 0x80000000 for example.
Any help would be greatly appreciated. Thanks in advance.
You write a linker script, and tell the compiler/linker to use it by using the -T script.ld option (to gcc and/or ld, depending on how you build your firmware files).
In your library C source files, you can use the __attribute__((section ("name"))) syntax to put your functions and variables into a specific section. The linker script can then decide where to put each section -- often at a fixed address for these kinds of devices. (You'll often see macro declarations like #define FIRMWARE __attribute__((section(".text.firmware"))) or similar, to make the code easier to read and understand.)
If you create a separate firmware file just for your library, then you don't need to add the attributes to your code, just write the linker script to put the .text (executable code), .rodata (read-only constants), and .bss (uninitialized variables) sections at suitable addresses.
A web search for microblaze "linker script" finds some useful examples, and even more guides. Some of them must be suitable for your tools.

Discrepancy between binary and map file

I noticed something peculiar while inspecting a map file produced with the -Map=mapfile option of the GNU linker. It listed a couple of symbols as belonging to the .text section, whereas the binary's symbol table lists them as part of the .rodata section. I suspect that this is some sort of cheeky optimization, as the compiler probably determined that those symbols are only ever read, but what surprises me is that the map file doesn't reflect that. My understanding is the linking is pretty much the final stage of the compilation process and all optimization happens before it. Is that correct? Why were these symbols optimized afterwards?
As you've probably inferred, the toolchain is GCC. The source code is written in C.

Is Dynamic Linker part of Kernel or GCC Library on Linux Systems?

Is Dynamic Linker (aka Program Interpreter, Link Loader) part of Kernel or GCC Library ?
UPDATE (28-08-16):
I have found that the default path for dynamic linker that every binary (i.e linked against a shared library) uses /lib64/ld-linux-x86-64.so.2 is a link to the shared library /lib/x86_64-linux-gnu/ld-2.23.so which is the actual dynamic linker.
And It is part of libc6 (2.23-0ubuntu3) package viz. GNU C Library: Shared libraries in ubuntu for AMD64 architectures.
My actual question was
what would happen to all the applications that are dynamically linked (all,now a days), if this helper program (ld-2.23.so) doesn't exist ?
And answer to that is " no application would run, even the shell program ". I've tried it on virutal machine.
In an ELF executable, this is referred to as the "ELF interpreter". On linux (e.g.) this is /lib64/ld-linux-x86-64.so.2
This is not part of the kernel and [generally] with glibc et. al.
When the kernel executes an ELF executable, it must map the executable into userspace memory. It then looks inside for a special sub-section known as INTERP [which contains a string that is the full path].
The kernel then maps the interpreter into userspace memory and transfers control to it. Then, the interpreter does the necessary linking/loading and starts the program.
Because ELF stands for "extensible linker format", this allows many different sub-sections with the ELF file.
Rather than burdening the kernel with having to know about all the myriad of extensions, the ELF interpreter that is paired with the file knows.
Although usually only one format is used on a given system, there can be several different variants of ELF files on a system, each with its own ELF interpreter.
This would allow [say] a BSD ELF file to be run on a linux system [with other adjustments/support] because the ELF file would point to the BSD ELF interpreter rather than the linux one.
UPDATE:
every process(vlc player, chrome) had the shared library ld.so as part of their address space.
Yes. I assume you're looking at /proc/<pid>/maps. These are mappings (e.g. like using mmap) to the files. That is somewhat different than "loading", which can imply [symbol] linking.
So primarily loader after loading the executable(code & data) onto memory , It loads& maps dynamic linker (.so) to its address space
The best way to understand this is to rephrase what you just said:
So primarily the kernel after mapping the executable(code & data) onto memory, the kernel maps dynamic linker (.so) to the program address space
That is essentially correct. The kernel also maps other things, such as the bss segment and the stack. It then "pushes" argc, argv, and envp [the space for environment variables] onto the stack.
Then, having determined the start address of ld.so [by reading a special section of the file], it sets that as the resume address and starts the thread.
Up until now, it has been the kernel doing things. The kernel does little to no symbol linking.
Now, ld.so takes over ...
which further Loads shared Libraries , map & resolve references to libraries. It then calls entry function (_start)
Because the original executable (e.g. vlc) has been mapped into memory, ld.so can examine it for the list of shared libraries that it needs. It maps these into memory, but does not necessarily link the symbols right away.
Mapping is easy and quick--just an mmap call.
The start address of the executable [not to be confused with the start address of ld.so], is taken from a special section of the ELF executable. Although, the symbol associated with this start address has been traditionally called _start, it could actually be named anything (e.g. __my_start) as it is what is in the section data that determines the start address and not address of the symbol _start
Linking symbol references to symbol definitions is a time consuming process. So, this is deferred until the symbol is actually used. That is, if a program has references to printf, the linker doesn't actually try to link in printf until the first time the program actually calls printf
This is sometimes called "link-on-demand" or "on-demand-linking". See my answer here: Which segments are affected by a copy-on-write? for a more detailed explanation of that and what actually happens when an executable is mapped into userspace.
If you're interested, you could do ldd /usr/bin/vlc to get a list of the shared libraries it uses. If you looked at the output of readelf -a /usr/bin/vlc, you'll see these same shared libraries. Also, you'd get the full path of the ELF interpreter and could do readelf -a <full_path_to_interpreter> and note some of the differences. You could repeat the process for any .so files that vlc wanted.
Combining all that with /proc/<pid>maps et. al. might help with your understanding.

how to get minimum executable opcodes for c program?

to get opcodes author here does following:
[bodo#bakawali testbed8]$ as testshell2.s -o testshell2.o
[bodo#bakawali testbed8]$ ld testshell2.o -o testshell2
[bodo#bakawali testbed8]$ objdump -d testshell2
and then he gets three sections (or mentions only these 3):
<_start>
< starter>
< ender>
I have tried to get hex opcodes the same way but cannot ld correctly. Of course I can produce .o and prog file for example with:
gcc main.o -o prog -g
however when
objdump --prefix-addresses --show-raw-insn -Srl prog
to see complete code with annotations and symbols, I have many additional sections there, for example:
.init
.plt
.text (yes, I know, main is here) [many parts here: _start(), call_gmon_start(), __do_global_dtors_aux(), frame_dummy(), main(), __libc_csu_init(), __libc_csu_fini(), __do_global_ctors_aux()]
.fini
I assume these are additions introduced by gcc linking to runtime libraries. I think i don't need these all sections to call opcode from c code (author uses only those 3 sections) however my problem is I don't know which exactly I might discard and which are necessary. I want to use it like this:
#include <unistd.h>
char code[] = "\x31\xed\x49\x89\x...x00\x00";
int main(int argc, char **argv)
{
/*creating a function pointer*/
int (*func)();
func = (int (*)()) code;
(int)(*func)();
return 0;
}
so I have created this :
#include <unistd.h>
/*
*
*/
int main() {
char *shell[2];
shell[0] = "/bin/sh";
shell[1] = NULL;
execve(shell[0], shell, NULL);
return 0;
}
and I did disassembly as I described. I tried to use opcode from .text main(), this gave me segmentation fault, then .text main() + additionally .text _start(), with same result.
So, what to choose from above sections, or how to generate only as minimized "prog" as with three sections?
char code[] = "\x31\xed\x49\x89\x...x00\x00";
This will not work.
Reason: The code definitely contains adresses. Mainly the address of the function execve() and the address of the string constant "/bin/sh".
The executable using the "code[]" approach will not contain a string constant "/bin/sh" at all and the address of the function execve() will be different (if the function will be linked into the executable at all).
Therefore the "call" instruction to the "execve()" function will jump to anywhere in the executable using the "code[]" approach.
Some theory about executables - just for your information:
There are two possibilities for executables:
Statically linked: These executables contain all necessary code. Therefore they do not access dynamic libraries like "libc.so"
Dynamically linked: These executables do not contain code that is frequently used. Such code is stored in files common to all executables: The dynamic libraries (e.g. "libc.so")
When the same C code is used then statically linked executables are much bigger than dynamically linked executables because all C functions (e.g. "printf", "execve", ...) must be bundled into the executable.
When not using any of these library functions the statically linked executables are simpler and therefore easier to understand.
Statically linked executable behaviour
A statically linked executable is loaded into the memory by the operating system (when it is started using execve()). The executable contains an entry point address. This address is stored in the file header of the executable. You can see it using "objdump -h ...".
The operating system performs a jump to that address so the program execution starts at this address. The address is typically the function "_start" however this can be changed using command line options when linking using "ld".
The code at "_start" will prepare the executable (e.g. initialize variables, calculate the values for "argc" and "argv", ...) and call the "main()" function. When "main()" returns the "_start" function will pass the value returned by "main()" to the "_exit()" function.
Dynamically linked executable behaviour
Such executables contain two additional sections. The first section contains the file name of the dynamic linker (maybe. "/lib/ld-linux.so.1"). The operating system will then load the executable and the dynamic linker and jump to the entry point of the dynamic linker (and not to that of the executable).
The dynamic linker will read the second additional section: It contains information about dynamic libraries (e.g. "libc.so") required by the executable. It will load all these libraries and initialize a lot of variables. Then it calls the initialization function ("_init()") of all libraries and of the executable.
Note that both the operating system and the dynamic linker ignore the function and section names! The address of the entry point is taken from the file header and the addresses of the "_init()" functions is taken from the additional section - the functions may be named differently!
When all this is done the dynamic linker will jump to the entry point ("_start") of the executable.
About the "GOT", "PLT", ... sections:
These sections contain information about the addresses where the dynamic libraries have been loaded by the linker. The "PLT" section contains wrapper code that will contain jumps to the dynamic libraries. This means: The section "PLT" will contain a function "printf()" that will actually do nothing but jump to the "printf()" function in "libc.so". This is done because directly calling a function in a dynamic library from C code would make linking much more difficult so C code will not call functions in a dynamic library directly. Another advantage of this implementation is that "lazy linking" is possible.
Some words about Windows
Windows only knows dynamically linked executables. Windows XP even refused to load an executable not requiring DLLs. The "dynamic linker" is integrated into the operating system and not a separate file. There is also an equivalent of the "PLT" section. However many compilers support "directly" calling DLL code from C code without calling the code in the PLT section first (theoretically this would also be possible under Linux). Lazy linking is not supported.
You should read this article: http://www.muppetlabs.com/~breadbox/software/tiny/teensy.html.
It explains all you need to create really tiny program in great detail.

Resources