Problem
I wish to inject an object file into an existing binary. As a concrete example, consider a source Hello.c:
#include <stdlib.h>
int main(void)
{
return EXIT_SUCCESS;
}
It can be compiled to an executable named Hello through gcc -std=gnu99 -Wall Hello.c -o Hello. Furthermore, now consider Embed.c:
func1(void)
{
}
An object file Embed.o can be created from this through gcc -c Embed.c. My question is how to generically insert Embed.o into Hello in such a way that the necessary relocations are performed, and the appropriate ELF internal tables (e.g. symbol table, PLT, etc.) are patched properly?
Assumptions
It can be assumed that the object file to be embedded has its dependencies statically linked already. Any dynamic dependencies, such as the C runtime can be assumed to be present also in the target executable.
Current Attempts/Ideas
Use libbfd to copy sections from the object file into the binary. The progress I have made with this is that I can create a new object with the sections from the original binary and the sections from the object file. The problem is that since the object file is relocatable, its sections can not be copied properly to the output without performing the relocations first.
Convert the binary back to an object file and relink with ld. So far I tried using objcopy to perform the conversion objcopy --input elf64-x86-64 --output elf64-x86-64 Hello Hello.o. Evidently this does not work as I intend since ld -o Hello2 Embed.o Hello.o will then result in ld: error: Hello.o: unsupported ELF file type 2. I guess this should be expected though since Hello is not an object file.
Find an existing tool which performs this sort of insertion?
Rationale (Optional Read)
I am making a static executable editor, where the vision is to allow the instrumentation of arbitrary user-defined routines into an existing binary. This will work in two steps:
The injection of an object file (containing the user-defined routines) into the binary. This is a mandatory step and can not be worked around by alternatives such as injection of a shared object instead.
Performing static analysis on the new binary and using this to statically detour routines from the original code to the newly added code.
I have, for the most part, already completed the work necessary for step 2, but I am having trouble with the injection of the object file. The problem is definitely solvable given that other tools use the same method of object injection (e.g. EEL).
If it were me, I'd look to create Embed.c into a shared object, libembed.so, like so:
gcc -Wall -shared -fPIC -o libembed.so Embed.c
That should created a relocatable shared object from Embed.c. With that, you can force your target binary to load this shared object by setting the environment variable LD_PRELOAD when running it (see more information here):
LD_PRELOAD=/path/to/libembed.so Hello
The "trick" here will be to figure out how to do your instrumentation, especially considering it's a static executable. There, I can't help you, but this is one way to have code present in a process' memory space. You'll probably want to do some sort of initialization in a constructor, which you can do with an attribute (if you're using gcc, at least):
void __attribute__ ((constructor)) my_init()
{
// put code here!
}
Assuming source code for first executable is available and is compiled with a linker script that allocates space for later object file(s), there is a relatively simpler solution. Since I am currently working on an ARM project examples below are compiled with the GNU ARM cross-compiler.
Primary source code file, hello.c
#include <stdio.h>
int main ()
{
return 0;
}
is built with a simple linker script allocating space for an object to be embedded later:
SECTIONS
{
.text :
{
KEEP (*(embed)) ;
*(.text .text*) ;
}
}
Like:
arm-none-eabi-gcc -nostartfiles -Ttest.ld -o hello hello.c
readelf -s hello
Num: Value Size Type Bind Vis Ndx Name
0: 00000000 0 NOTYPE LOCAL DEFAULT UND
1: 00000000 0 SECTION LOCAL DEFAULT 1
2: 00000000 0 SECTION LOCAL DEFAULT 2
3: 00000000 0 SECTION LOCAL DEFAULT 3
4: 00000000 0 FILE LOCAL DEFAULT ABS hello.c
5: 00000000 0 NOTYPE LOCAL DEFAULT 1 $a
6: 00000000 0 FILE LOCAL DEFAULT ABS
7: 00000000 28 FUNC GLOBAL DEFAULT 1 main
Now lets compile the object to be embedded whose source is in embed.c
void func1()
{
/* Something useful here */
}
Recompile with the same linker script this time inserting new symbols:
arm-none-eabi-gcc -c embed.c
arm-none-eabi-gcc -nostartfiles -Ttest.ld -o new_hello hello embed.o
See the results:
readelf -s new_hello
Num: Value Size Type Bind Vis Ndx Name
0: 00000000 0 NOTYPE LOCAL DEFAULT UND
1: 00000000 0 SECTION LOCAL DEFAULT 1
2: 00000000 0 SECTION LOCAL DEFAULT 2
3: 00000000 0 SECTION LOCAL DEFAULT 3
4: 00000000 0 FILE LOCAL DEFAULT ABS hello.c
5: 00000000 0 NOTYPE LOCAL DEFAULT 1 $a
6: 00000000 0 FILE LOCAL DEFAULT ABS
7: 00000000 0 FILE LOCAL DEFAULT ABS embed.c
8: 0000001c 0 NOTYPE LOCAL DEFAULT 1 $a
9: 00000000 0 FILE LOCAL DEFAULT ABS
10: 0000001c 20 FUNC GLOBAL DEFAULT 1 func1
11: 00000000 28 FUNC GLOBAL DEFAULT 1 main
The problem is that .o's are not fully linked yet, and most references are still symbolic. Binaries (shared libraries and executables) are one step closer to finally linked code.
Doing the linking step to a shared lib, doesn't mean you must load it via the dynamic lib loader. The suggestion is more that an own loader for a binary or shared lib might be simpler than for .o.
Another possibility would be to customize that linking process yourself and call the linker and link it to be loaded on some fixed address. You might also look at the preparation of e.g. bootloaders, which also involve a basic linking step to do exactly this (fixate a piece of code to a known loading address).
If you don't link to a fixed address, and want to relocate runtime you will have to write a basic linker that takes the object file, relocates it to the destination address by doing the appropriate fixups.
I assume you already have it, seeing it is your master thesis, but this book: http://www.iecc.com/linker/ is the standard introduction about this.
You must make room for the relocatable code to fit in the executable by extending the executables text segment, just like a virus infection. Then after writing the relocatable code into that space, update the symbol table by adding symbols for anything in that relocatable object, and then apply the necessary relocation computations. I've written code that does this pretty well with 32bit ELF's.
You cannot do this in any practical way. The intended solution is to make that object into a shared lib and then call dlopen on it.
Have you looked at the DyninstAPI? It appears support was recently added for linking a .o into a static executable.
From the release site:
Binary rewriter support for statically linked binaries on x86 and x86_64 platforms
Related
Why are some relocation entries in an ELF file symbol name + addend while others are section + addend? I am looking to clear up some confusion and gain a deeper understanding of ELFs. Below is my investigation.
I have a very simple C file, test.c:
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
static void func1(void)
{
fprintf(stdout, "Inside func1\n");
}
// ... a couple other simple *static* functions
int main (void)
{
func1();
// ... call some other functions
exit(EXIT_SUCCESS);
}
I then compile this into an object file with:
clang -O0 -Wall -g -c test.c -o test.o
If look at the relocations with readelf -r test.o I see the entries that refer to my static functions as follows (this one is picked from the .rela.debug_info section):
Offset Info Type Symbol's Value Symbol's Name + Addend
...
000000000000006f 0000000400000001 R_X86_64_64 0000000000000000 .text + b0
...
Why are these functions referred to as section + addend rather than symbol name + addend? I see entries for the functions in the .symtab using readelf -s test.o:
Num: Value Size Type Bind Vis Ndx Name
...
2: 00000000000000b0 31 FUNC LOCAL DEFAULT 2 func1
...
Additionally, when I disassemble the object file (via objdump -d), I see that the functions are there and weren't optimized into main or anything.
If I don't make the functions static and then look at the relocations, I see the same as before when the type is R_X86_64_64, but I also see entries that use the symbol name plus an addend with type R_X86_64_PC32. So for example in .rela.text:
Offset Info Type Symbol's Value Symbol's Name + Addend
...
00000000000000fe 0000001200000002 R_X86_64_PC32 0000000000000000 func1 + 1c
...
Please let me know if more examples/readelf output would be helpful. Thank you for taking the time to read this.
Why are these functions referred to as section + addend rather than symbol name + addend?
The function names for static functions are not guaranteed to be present at link time. You could remove them with e.g. objcopy --strip-unneeded or objcopy --strip-symbol, and the result will still link.
I see entries for the functions in the .symtab using readelf -s test.o
I believe the only reason they are kept is to help debugging, and they are not used by the linker at all. But I have not verified this by looking at linker source, and so did not answer this related question.
Eli Bendersky's blog also mentions this in his blog post. From the section titled "Extra credit: Why was the call relocation needed?":
In short, however, when ml_util_func is global, it may be overridden in the executable or another shared library, so when linking our shared library, the linker can't just assume the offset is known and hard-code it [12]. It makes all references to global symbols relocatable in order to allow the dynamic loader to decide how to resolve them. This is why declaring the function static makes a difference - since it's no longer global or exported, the linker can hard-code its offset in the code.
The full post should be read to get complete context, but I thought I would share it here as it presents better examples than in my question and reinforces the solution that Employed Russian gave.
I am reading xv6 lectures.
I have a file named initcode.S that is to be linked in the kernel.
Now two symbols are created that way :
extern char _binary_initcode_start[], _binary_initcode_size[];
inside a function.
The lecture says :
as part of the kernel build process, the linker embeds that binary that defines two special symbols, _binary_initcode_starcode_size, indicating the location and size of the binary.
I understand that binutils is getting the address and the size of this assembled code.
I wonder about the notation : is it default ? my searches didn't prove that clearly.
_binary -> it is originally an assembly code
_initcode -> the name of my file
_start -> the parameter i am interested in.
It would imply that any assembly code compiled would have those variables too.
I have no proof of that, though.
The question is :
is _binary_myAsmFileHere_myParameterhere the default variable structure binutils give to the assembly file to export their address, size and so on ?
Could someone tell me if my assumption is right and if it is better than that : the rule
Thanks
Strangely enough, it doesn't seem to be documented in the ld manual. However, man objcopy does say this:
You can access this binary data inside a program by referencing the
special symbols that are created by the conversion process. These
symbols are called _binary_objfile_start, _binary_objfile_end and
_binary_objfile_size. e.g. you can transform a picture file into an object file and then access it in your code using these symbols.
Apparently the same logic is used by ld when embedding binary files.
Notice that the Makefile for xv6 contains this line for linking the kernel:
$(LD) $(LDFLAGS) -T kernel.ld -o kernel entry.o $(OBJS) -b binary initcode entryother
As you can see, it uses -b binary to embed the files initcode and entryother, so the above symbols will be defined during this process.
when a .global variable is defined in an assembly file, for a C file to be able to reference that variable, the C file has to prepend a '_' to the variable name. This is so the linker can 'link' the name in the C file with the name in the assembly file.
The test is on 32-bit Linux, x86.
Suppose in my assembly program final.s, I have to load some library symbols, say, stdin##GLIBC_2.0, and I want to load these symbols in a fixed address.
So following instructions in this question, I did this:
echo ""stdin##GLIBC_2.0" = 0x080a7390;" > symbolfile
echo ""stdin#GLIBC_2.0 (4)" = 0x080a7390;" >> symbolfile
gcc -Wl,--just-symbols=symbolfile final.s -g
And when I checked the output of symbol table, I got this:
readelf -s a.out | grep stdin
53: 080a7390 4 OBJECT GLOBAL DEFAULT ABS stdin##GLIBC_2.0
17166: 080a7390 0 NOTYPE GLOBAL DEFAULT ABS stdin#GLIBC_2.0 (4)
And comparing to a common ELF biary that requires stdin symbol:
readelf -s hello.out | grep stdin
17199: 0838b8c4 4 OBJECT GLOBAL DEFAULT 25 stdin##GLIBC_2.0
52: 0838b8c4 4 OBJECT GLOBAL DEFAULT 25 stdin#GLIBC_2.0 (4)
So an obvious difference I found is that the Ndx column, say, the section number of my fixed position symbols are ABS. Please check the references here.
When executing the a.out, it throws a segmentation fault error.
So my question is, how to set the section number of the symbol fixed position?
I want to load these symbols in a fixed address.
You are importing these symbols from GLIBC. Unless you are doing a fully-static linking, you get no say in what address these symbols end up at.
So my question is, how to set the section number of the symbol
That question makes no sense: section number itself is meaningless and 25 may refer to .bss in one executable, but to .text in another.
Your section 25 just happens to be .bss on this particular system and for this particular build. Try building a fully-static binary, and you are likely to see section 24 instead.
Anyway, a normal executable gets stdin copied from libc.so.6. You will do well to read this description of the process, and pay special attention to "Extra credit #2: Referencing shared library data from the executable" section.
But it may be easier to understand the fully-static case first.
I am currently compiling a bought data stack in C. I use their own tool to compile it, using in the background gcc. I can pass flags and parameters to gcc as I see fit. I want to know, from which file is the main() used. That is, in the project, which file is the starting point. Is there any way to tell gcc to generate a list of files, or similar, given that I dont know from which file is main() being taken? Thank you.
You can disassemble the final executable to find the starting point. Although you have not provided any additional info to help you more. I'm using a sample code to demonstrate the process.
#include <stdio.h>
int main() {
printf("hello world\n");
return 0;
}
Now the object main.o has the following this
[root#s1 sf]# gcc -c main.c
[root#s1 sf]# nm main.o
0000000000000000 T main
U puts
You can see main is not initialized. Because it will changed in linking stage. Now after linking :
$gcc main.o
$nm a.out
U __libc_start_main##GLIBC_2.2.5
0000000000600874 A _edata
0000000000600888 A _end
00000000004005b8 T _fini
0000000000400390 T _init
00000000004003e0 T _start
000000000040040c t call_gmon_start
0000000000600878 b completed.6347
0000000000600870 W data_start
0000000000600880 b dtor_idx.6349
00000000004004a0 t frame_dummy
00000000004004c4 T main
You see that main has a address now. But its still not final. Because this main will called by C runtime dynamically. you can see who will do the part of U __libc_start_main##GLIBC_2.2.5:
[root#s1 sf]# ldd a.out
linux-vdso.so.1 => (0x00007fff61de1000) /* the linux system call interface */
libc.so.6 => /lib64/libc.so.6 (0x0000003c96000000) /* libc runime , this will invoke your main*/
/lib64/ld-linux-x86-64.so.2 (0x0000003c95c00000) /* dynamic loader */
Now you can verify this by viewing the disassembly :
00000000004003e0 <_start>:
..........
4003fd: 48 c7 c7 c4 04 40 00 mov rdi,0x4004c4 /* address of start of main */
400404: e8 bf ff ff ff call 4003c8 <__libc_start_main#plt> /* this will set up the environment for main, like pushing argc and argv to stack */
...........
If you don't have the source with you, then you can search in the executable for references to libc_start_main or main or start to see how your executable is initialized and starts the main.
Now all of these is done when linking is done with default linker script. Many big project will use its own linker script. If your project has custom linker script, then finding the start point will be different depending on the linker script used. There are projects which does not uses glibc's runtime. In that case, its still possible to find the start point by hacking the object files, library archives etc.
If your binary is stripped from symbols, then you have to actually rely on your assembler skill to find where it starts.
I've assumed that you don't have the source, that is the stack is distributed with some libraries and some header definitions only.(A common practice of commercial software vendors).
But if you have source with you, then its just too trivial. just grep your way through it. Some answers already pointed that out.
From where main() is called is implementation-dependent -- using GCC, it will most likely be a stub object file in /usr/lib called crt0.o or crt1.o from which it is called. (this file contains the OS-dependent symbol which is automatically invoked by the kernel when your app is loaded into memory. On Linux and Mac OS X, this is called start).
You can use objdump -t to list symbols from object files. So assuming you are on Linux, and also assuming that the object files are still around somewhere, you can do this:
find -name '*.o' -print0 \
| xargs -0 objdump -t \
| awk '/\.o:/{f=$1} /\.text\.main/{print f, $6}'
This will print a list of object files and the references to main they contain. Usually there should be a simple map from object files to source files. If there are multiple object files containing that symbol, then it depends on which one of those actually got linked into the binary you're looking at, as there can be no more than one main per executable binary (except perhaps for some really exotic black magic).
After the application is linked and debugging symbols are stripped, there usually is no indication from which source file a specific function came. The exception to this are files which include the function names as string literals, e.g. using the __FILE__ macro. Before stripping debugging symbols, you might use the debugger to obtain that information. If debugging symbols are included, that is.
I had to modify some open source code to use in a C project. Instead of building a library from the modified code, I'd like to just compile and build an executable from my own source combined with the modified open source code. The goal is to have a stand-alone package that can be distributed. I can get this to work just fine using the GNU build tools and have successfully built my executable.
Now I'd like to pare down the amount of code I am building and linking. Is there an easy way to determine which of the open source files I actually need to compile? There are, say, 40 .c files in the open source package. I'm guessing my code only uses (or causes to be used) 20-ish of those files. Currently I'm compiling all of them and throwing everything at the linker. There has to be a smart (and easy?) way to determine which ones I actually need, right?
I'm happy to provide further details if it's helpful. Thanks in advance.
When faced with this I've either simply taken the final link command stripped out all of the objects and then added back in until it works, or processed the output of the nm command.
Worked example:
Looking at the output of nm:
$ nm *.o
a.o:
00000000 T a
U aa
b.o:
00000000 T b
t.o:
U a
U b
00000000 T main
ua.o:
00000000 T ua
ub.o:
00000000 T ub
So I create the following awk script
# find-unused.awk
BEGIN {req["main"]="crt"}
/\.o\:$/{
gsub(/\:/,"");
modulename=$0;
}
$1=="U"{
req[$2] = modulename;
}
/[0-9,a-f].* T/{
def[$3] = modulename;
}
END{
print "modules referenced:"
for (i in req)
{
if (def[i] != "")
print " "def[i];
}
print "functions not found"
for (i in req)
{
if (def[i] == "")
print " "i;
}
}
and then call it like this;
$ nm *.o|awk -f find-unused.awk
it tells me:
modules referenced:
t.o
a.o
b.o
functions not found
aa
Which is right - because the ua & ub functions in the above example aren't used.
See if you can get your dead-code stripper to tell you what functions/symbols it eliminated during the link. Then you will know what source code you can safely remove. The GNU linker's -map option may be useful on that front. You could, for instance, link once without dead-code stripping, then link again with dead-code stripping and compare the output map files.
If there are only 40 source files maximum, is this optimization really worth your time?