Find start point, int main()

Find start point, int main() - c

I am currently compiling a bought data stack in C. I use their own tool to compile it, using in the background gcc. I can pass flags and parameters to gcc as I see fit. I want to know, from which file is the main() used. That is, in the project, which file is the starting point. Is there any way to tell gcc to generate a list of files, or similar, given that I dont know from which file is main() being taken? Thank you.

You can disassemble the final executable to find the starting point. Although you have not provided any additional info to help you more. I'm using a sample code to demonstrate the process.
#include <stdio.h>
int main() {
printf("hello world\n");
return 0;
}
Now the object main.o has the following this
[root#s1 sf]# gcc -c main.c
[root#s1 sf]# nm main.o
0000000000000000 T main
U puts
You can see main is not initialized. Because it will changed in linking stage. Now after linking :
$gcc main.o
$nm a.out
U __libc_start_main##GLIBC_2.2.5
0000000000600874 A _edata
0000000000600888 A _end
00000000004005b8 T _fini
0000000000400390 T _init
00000000004003e0 T _start
000000000040040c t call_gmon_start
0000000000600878 b completed.6347
0000000000600870 W data_start
0000000000600880 b dtor_idx.6349
00000000004004a0 t frame_dummy
00000000004004c4 T main
You see that main has a address now. But its still not final. Because this main will called by C runtime dynamically. you can see who will do the part of U __libc_start_main##GLIBC_2.2.5:
[root#s1 sf]# ldd a.out
linux-vdso.so.1 => (0x00007fff61de1000) /* the linux system call interface */
libc.so.6 => /lib64/libc.so.6 (0x0000003c96000000) /* libc runime , this will invoke your main*/
/lib64/ld-linux-x86-64.so.2 (0x0000003c95c00000) /* dynamic loader */
Now you can verify this by viewing the disassembly :
00000000004003e0 <_start>:
..........
4003fd: 48 c7 c7 c4 04 40 00 mov rdi,0x4004c4 /* address of start of main */
400404: e8 bf ff ff ff call 4003c8 <__libc_start_main#plt> /* this will set up the environment for main, like pushing argc and argv to stack */
...........
If you don't have the source with you, then you can search in the executable for references to libc_start_main or main or start to see how your executable is initialized and starts the main.
Now all of these is done when linking is done with default linker script. Many big project will use its own linker script. If your project has custom linker script, then finding the start point will be different depending on the linker script used. There are projects which does not uses glibc's runtime. In that case, its still possible to find the start point by hacking the object files, library archives etc.
If your binary is stripped from symbols, then you have to actually rely on your assembler skill to find where it starts.
I've assumed that you don't have the source, that is the stack is distributed with some libraries and some header definitions only.(A common practice of commercial software vendors).
But if you have source with you, then its just too trivial. just grep your way through it. Some answers already pointed that out.

From where main() is called is implementation-dependent -- using GCC, it will most likely be a stub object file in /usr/lib called crt0.o or crt1.o from which it is called. (this file contains the OS-dependent symbol which is automatically invoked by the kernel when your app is loaded into memory. On Linux and Mac OS X, this is called start).

You can use objdump -t to list symbols from object files. So assuming you are on Linux, and also assuming that the object files are still around somewhere, you can do this:
find -name '*.o' -print0 \
| xargs -0 objdump -t \
| awk '/\.o:/{f=$1} /\.text\.main/{print f, $6}'
This will print a list of object files and the references to main they contain. Usually there should be a simple map from object files to source files. If there are multiple object files containing that symbol, then it depends on which one of those actually got linked into the binary you're looking at, as there can be no more than one main per executable binary (except perhaps for some really exotic black magic).
After the application is linked and debugging symbols are stripped, there usually is no indication from which source file a specific function came. The exception to this are files which include the function names as string literals, e.g. using the __FILE__ macro. Before stripping debugging symbols, you might use the debugger to obtain that information. If debugging symbols are included, that is.

Related

Call a function in another object file without using PLT within a shared library?

I have two assembly codes, code1.s and code2.s and I want to build a relocatable (using -fPIC switch) shared library from these two.
I want code2.s call a function, named myfun1, which is defined in code1.s.
When I use call myfun1#PLT in code2.s it finds the function and it works like a charm but it uses PLT section to call this function which is in the same shared library. I want to do this without adhering to PLT section. When I remove #PLT I get the relocation R_X86_64_PC32 against symbol error for myfun1.
How can I do this without using PLT section? Is there any way at all? I think it should be feasible as the shared library should be relocatable but not necessary each of its object files, therefore why calling a function inside the same library should goes through the PLT section.
Here is my compile commands:
For codeX.s:
gcc -c codeX.s -fPIC -DPIC -o codeX.o
or
gcc -c codeX.s -o codeX.o
and for sharelibrary named libcodes.so:
gcc -shared -fPIC -DPIC -o libcodes.so code1.o code2.o
Just as you may be curious why I am doing so, I have many object files and each of them wants to call myfun1. Here I just made it simpler to ask the technical part. Even I tries to put myfun1 in all codeX.s files but I get the error that myfun1 is defined multiple times. I don't that much care about space and if I get to put myfun1 in all files.

From within one source file you can just use two labels (Building .so with recursive function in it), one with .globl and the other not. But that's not sufficient across source files within the shared library.
Still useful in combination with the below answer for functions that are also exported: one .hidden and one not, so you can efficiently call within the library.
Use .globl and .hidden to create a symbol that can be seen outside the current object file, but not outside the shared library. Thus it's not subject to symbol-interposition, and calls from other files in the same shared library can call it directly, not through the PLT or GOT.
Tested and working example:
## foo.S
.globl myfunc
.hidden myfunc
myfunc:
#.globl myfunc_external # optional, a non-hidden symbol at the same addr
#myfunc_external:
ret
## bar.S
.globl bar
bar:
call myfunc
ret
Build with gcc -shared foo.S bar.S -o foo.so, and objdump -drwC -Mintel foo.so:
Disassembly of section .text:
000000000000024d <myfunc>:
24d: c3 ret
000000000000024e <bar>:
24e: e8 fa ff ff ff call 24d <myfunc> # a direct near call
253: c3 ret
(I actually built with -nostdlib as well to keep the disassembly output clean for example purposes by omitting the other functions like __do_global_dtors_aux and register_tm_clones, and the .init section.)
I think Glibc uses strong or weak_alias for this (what does the weak_alias function do and where is it defined), so calls from within the shared library can use the normal name. Where are syscalls located in glibc source, e.g. __chdir and chdir.
e.g. glibc's printf.c defines __printf and makes printf a strong alias for it.
io/chdir.c defines __chdir and makes chdir a weak alias for it.
One of the x86-64 memchr asm implementations also uses a strong_alias macro (at the bottom of the file).
The relevant GAS directives are:
.weak names
.weakref foo, foo_internal
There's no strong alias GAS directive. That may be equivalent to simply foo = foo_internal or an equivalent .set foo, foo_internal.
(TODO: complete example and more details of what strong/weak do exactly. I don't currently know, so edits welcome if I don't get around to reading the docs myself.
I know this stuff exists and solves this problem, but I don't know exactly how.)

Well, I was not able to find any way to do so but as I edited my question I do not care to put myfun1 in all object files.
The problem I had was that linker outputted error that I have defined myfun1 in multiple places and that was all because I had globl directive for myfun1 which when I removed that line it get fixed.
Thanks Ross Ridge for pushing me again to try that.

Re-export Shared Library Symbols from Other Library (OS X / POSIX)

My question is fairly OS X on x86-64 specific but a universal solution that works on other POSIX OSes is even more appreciated.
Given a list of symbol names of some shared library (called original library in the following) and I want my shared library to re-export these symbols. Re-export as in if someone tries to resolve the symbol against my library I either provide my version of this symbol or (if my library doesn't have this symbol) forward to the original library's symbol.
I don't know the types of the symbols, I only know whether they are functions (type T in nm output) or other symbols (type S in nm output).
For functions, I already have a solution: For every function I want to re-export I generate an assembly stub that does dynamically resolve the symbol (using dlsym()) and then jumps into the resolved function with the very same environment (registers rdi, rsi, rdx, rcx, r8, r9, stack pointer, ...). I'm basically generating universal proxy functions. Using some macro trickery that can be generated fairly easy without writing code for each and every symbol.
For non-function symbols the problem seems to be harder because I cannot generate this universal proxy function, because the resolving party does never call a function.
Using a constructor function static void init(void) __attribute__((constructor)); I can execute code whenever someone loads my library, that would be a good point to resolve and re-export all non-function symbols if that's possible.
In other words, I'd like to write the symbol table of my library to point to the respective symbols of another shared library. Doing the rewriting at compile or run time is okay (run time preferred). Or put yet another way, the behaviour of DYLD_INSERT_LIBRARIES (LD_PRELOAD) is exactly what I need but I don't want to insert a new library, I want to replace one (in the file system). EDIT: The reason I don't want/can't use DYLD_INSERT_LIBRARIES or any other environment variable of the DYLD_* family is that they are ignored for code signed, restricted, ... binaries.
I'm aware of the -reexport-l, -reexport_library and -reexported_symbols_list linker flags but I could not get them to work, especially when my library is a "replacement" for frameworks that are part of umbrella frameworks (example: /System/Library/Frameworks/CoreServices.framework/Frameworks/SearchKit.framework/SearchKit) because ld forbids to link directly against parts of umbrella frameworks.
EDIT: Because I explained it somewhat ambiguously: I can't change the way the actual program is linked. The goal is to produce a shared library that is a replacement for the original library. (Apparently called filter library.)

Found it out now (OS X specific): clang -o replacement-lib.dylib ... -Xlinker -reexport_library PATH_TO_ORIGINAL_LIB does the trick. PATH_TO_ORIGINAL_LIB could for example be /System/Library/Frameworks/CoreServices.framework/Frameworks/SearchKit.framework/Versions/Current/SearchKit.
If PATH_TO_ORIGINAL_LIB is a library that is part of an umbrella framework (as in the example above), then replace PATH_TO_ORIGINAL_LIB by the path of some other lib (I created a lib empty.dylib for that) and as a second step do
install_name_tool -change /usr/local/lib/empty.dylib PATH_TO_ORIGINAL_LIB replacement-lib.dylib
To see if the actual reexporting worked use:
otool -l replacement-lib.dylib | grep -A2 LC_REEXPORT_DYLIB
The output should look like
cmd LC_REEXPORT_DYLIB
cmdsize XX
name empty.dylib (offset YY)
After launching the install_name_tool it could be
cmd LC_REEXPORT_DYLIB
cmdsize XX
name /System/Library/Frameworks/CoreServices.framework/Frameworks/SearchKit.framework/Versions/Current/SearchKit (offset YY)

You could link against both libraries and use the link order to make sure to link against the right symbols. This works on both OS X and Linux:
cc -o executable -lmylib -loriglib
Where origlib is the original library and mylib contains symbols that are supposed to overwrite symbols in origlib. Then the executable will be linked against your symbols from mylib first and all unresolved symbols will be linked against origlib.
This works in the same way when linking against OS X frameworks. Just link against your library that replaces symbols first and against the framework after.
cc -o executable -lmylib -framework SomeFramework
Edit: If you just want to replace symbols at runtime then you can use LD_PRELOAD in the same way:
cc -o executable -framework SomeFramework
LD_PRELOAD=libmylib.dylib ./executable

ld: access beyond end of merged section

i'm trying to link a simple c program on an arm debian machine (a raspberry pi) and when linking the ogject file the linker returns me the error in the subject.
my program is as simple as
simple.c:
int main(){
int a = 2;
int b = 3;
int c = a+b;
}
i compile it with
$>gcc -o simple.obj simple.c
and then link it with
$>ld -o simple.elf simple.obj
ld: simple.obj: access beyond end of merged section (33872)
i can't understand why...
if i try to read the elf file with objdump -d it doesn't manage to decompile the .text section (it only prints address, value, .word and again value preceded by 0x) but the binary data is the same as the one i get from the decompiled simple.obj.
the only difference is in the loading start (and consequent) addresses of the binary data: the elf file starts at 0x8280, the object file starts at 0x82a0.
what does all this mean?
EDIT:
this is the dump for the obj file: http://pastebin.com/YZ94kRk4
and this is the dump for the elf file: http://pastebin.com/3C3sWqrC
i tried compiling with -c option that makes gcc stop after assembly time (it already did the linking part) but now i have a different problem: it says that there is no _start section in my object file...
the new dumps are:
simple.obj: http://pastebin.com/t0TqmgPa
simple.elf: http://pastebin.com/qD35cnqw

You are misunderstanding the effect of the commands you ran. If you run:
$ gcc -o simple.obj simple.c
it already creates the program you want to run, it's already linked. You don't need to link it again, especially by running ld directly unless you know what you are doing. Even if its extension is obj, it doesn't matter, it's just the name of the file, but the content of the file is already a complete Linux program. So if you run:
$ ./simple.obj
it will execute your code.
You usually don't call ld directly, but instead you use gcc as a front-end to compile and link. This is because gcc takes care of linking also important libraries that you are not linking such as the startup code, and that's the reason why your second attempt resulted in "no _start section" or something like that.

Could you print the output of the objdump -d command?
Btw, notice that 33872 == 0x8450.
I am not familiar with raspberry PI's memory map, so if you'r following any tutorials about this or have some other resource to help me help you out - it would be great :)

How to find out .c and .h files that were used to build a binary?

I am building a project that builds multiple shared libraries and executable files. All the source files that are used to build these binaries are in a single /src directory. So it is not obvious to figure out which source files were used to build each of the binaries (there is many-to-many relation).
My goal is to write a script that would parse a set of C files for each binary and make sure that only the right functions are called from them.
One option seems to be to try to extract this information from Makefile. But this does not work well with generated files and headers (due to dependence on Includes).
Another option could be to simply browse call graphs, but this would get complicated, because a lot of functions are called by using function pointers.
Any other ideas?

You can first compile your project with debug information (gcc -g) and use objdump to get which source files were included.
objdump -W <some_compiled_binary>
Dwarf format should contain the information you are looking for.
<0><b>: Abbrev Number: 1 (DW_TAG_compile_unit)
< c> DW_AT_producer : (indirect string, offset: 0x5f): GNU C 4.4.3
<10> DW_AT_language : 1 (ANSI C)
<11> DW_AT_name : (indirect string, offset: 0x28): test_3.c
<15> DW_AT_comp_dir : (indirect string, offset: 0x36): /home/auselen/trials
<19> DW_AT_low_pc : 0x82f0
<1d> DW_AT_high_pc : 0x8408
<21> DW_AT_stmt_list : 0x0
In this example, I've compiled object file from test_3, and it was located in .../trials directory. Then of course you need to write some script around this to collect related source file names.

First you need to separate the debug symbols from the binary you just compiled. check this question on how to do so:
How to generate gcc debug symbol outside the build target?
Then you can try to parse this file on your own. I know how to do so for Visual Studio but as you are using GCC I won't be able to help you further.

Here is an idea, need to refine based on your specific build. Make a build, log it using script (for example script log.txt make clean all). The last (or one of the last) step should be the linking of object files. (Tip: look for cc -o <your_binary_name>). That line should link all .o files which should have corresponding .c files in your tree. Then grep those .c files for all the included header files.
If you have duplicate names in your .c files in your tree, then we'll need to look at the full path in the linker line or work from the Makefile.
What Mahmood suggests below should work too. If you have an image with symbols, strings <debug_image> | grep <full_path_of_src_directory> should give you a list of C files.

You can use unix nm tool. It shows all symbols that are defined in the object. So you need to:
Run nm on your binary and grab all undefined symbols
Run ldd on your binary to grab list of all its dynamic dependencies (.so files your binary is linked to)
Run nm on each .so file youf found in step 2.
That will give you the full list of dynamic symbols that your binary use.
Example:
nm -C --dynamic /bin/ls
....skipping.....
00000000006186d0 A _edata
0000000000618c70 A _end
U _exit
0000000000410e34 T _fini
0000000000401d88 T _init
U _obstack_begin
U _obstack_newchunk
U _setjmp
U abort
U acl_extended_file
U bindtextdomain
U calloc
U clock_gettime
U closedir
U dcgettext
U dirfd
All those symbols with capital "U" are used by ls command.

If your goal is to analyze C source files, you can do that by customizing the GCC compiler. You could use MELT for that purpose (MELT is a high-level domain specific language to extend GCC) -adding your own analyzing passes coded in MELT inside GCC-, but you should first learn about GCC middle-end internal representations (Gimple, Tree, ...).
Customizing GCC takes several days of work (mostly because GCC internals are quite complex in the details).
Feel free to ask me more about MELT.

How to determine which object files are actually necessary for linking?

I had to modify some open source code to use in a C project. Instead of building a library from the modified code, I'd like to just compile and build an executable from my own source combined with the modified open source code. The goal is to have a stand-alone package that can be distributed. I can get this to work just fine using the GNU build tools and have successfully built my executable.
Now I'd like to pare down the amount of code I am building and linking. Is there an easy way to determine which of the open source files I actually need to compile? There are, say, 40 .c files in the open source package. I'm guessing my code only uses (or causes to be used) 20-ish of those files. Currently I'm compiling all of them and throwing everything at the linker. There has to be a smart (and easy?) way to determine which ones I actually need, right?
I'm happy to provide further details if it's helpful. Thanks in advance.

When faced with this I've either simply taken the final link command stripped out all of the objects and then added back in until it works, or processed the output of the nm command.
Worked example:
Looking at the output of nm:
$ nm *.o
a.o:
00000000 T a
U aa
b.o:
00000000 T b
t.o:
U a
U b
00000000 T main
ua.o:
00000000 T ua
ub.o:
00000000 T ub
So I create the following awk script
# find-unused.awk
BEGIN {req["main"]="crt"}
/\.o\:$/{
gsub(/\:/,"");
modulename=$0;
}
$1=="U"{
req[$2] = modulename;
}
/[0-9,a-f].* T/{
def[$3] = modulename;
}
END{
print "modules referenced:"
for (i in req)
{
if (def[i] != "")
print " "def[i];
}
print "functions not found"
for (i in req)
{
if (def[i] == "")
print " "i;
}
}
and then call it like this;
$ nm *.o|awk -f find-unused.awk
it tells me:
modules referenced:
t.o
a.o
b.o
functions not found
aa
Which is right - because the ua & ub functions in the above example aren't used.

See if you can get your dead-code stripper to tell you what functions/symbols it eliminated during the link. Then you will know what source code you can safely remove. The GNU linker's -map option may be useful on that front. You could, for instance, link once without dead-code stripping, then link again with dead-code stripping and compare the output map files.
If there are only 40 source files maximum, is this optimization really worth your time?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight