How to determine which object files are actually necessary for linking? - c

I had to modify some open source code to use in a C project. Instead of building a library from the modified code, I'd like to just compile and build an executable from my own source combined with the modified open source code. The goal is to have a stand-alone package that can be distributed. I can get this to work just fine using the GNU build tools and have successfully built my executable.
Now I'd like to pare down the amount of code I am building and linking. Is there an easy way to determine which of the open source files I actually need to compile? There are, say, 40 .c files in the open source package. I'm guessing my code only uses (or causes to be used) 20-ish of those files. Currently I'm compiling all of them and throwing everything at the linker. There has to be a smart (and easy?) way to determine which ones I actually need, right?
I'm happy to provide further details if it's helpful. Thanks in advance.

When faced with this I've either simply taken the final link command stripped out all of the objects and then added back in until it works, or processed the output of the nm command.
Worked example:
Looking at the output of nm:
$ nm *.o
a.o:
00000000 T a
U aa
b.o:
00000000 T b
t.o:
U a
U b
00000000 T main
ua.o:
00000000 T ua
ub.o:
00000000 T ub
So I create the following awk script
# find-unused.awk
BEGIN {req["main"]="crt"}
/\.o\:$/{
gsub(/\:/,"");
modulename=$0;
}
$1=="U"{
req[$2] = modulename;
}
/[0-9,a-f].* T/{
def[$3] = modulename;
}
END{
print "modules referenced:"
for (i in req)
{
if (def[i] != "")
print " "def[i];
}
print "functions not found"
for (i in req)
{
if (def[i] == "")
print " "i;
}
}
and then call it like this;
$ nm *.o|awk -f find-unused.awk
it tells me:
modules referenced:
t.o
a.o
b.o
functions not found
aa
Which is right - because the ua & ub functions in the above example aren't used.

See if you can get your dead-code stripper to tell you what functions/symbols it eliminated during the link. Then you will know what source code you can safely remove. The GNU linker's -map option may be useful on that front. You could, for instance, link once without dead-code stripping, then link again with dead-code stripping and compare the output map files.
If there are only 40 source files maximum, is this optimization really worth your time?

Related

C undefined symbol at runtime

I have inherited a codebase that I am attempting to build and run. I have been able to do so before, but now on a different system I am getting some runtime errors, even though compilation goes fine. I am quite new to C and would appreciate any suggestions to move forward.
Specifically, I have a directory, common, of common utilities, which gets built into a shared library, and when the main program tries to use symbols from the shared library, I get errors like this:
some_file.so: undefined symbol: CUSTOM_LOG_LEVEL
The symbol is declared within a header file, custom_log.h, like so:
extern int CUSTOM_LOG_LEVEL;
The symbol is defined within the custom_log.c file, like so:
int CUSTOM_LOG_LEVEL = DEFAULT_CUSTOM_LOG_LEVEL;
These logger files are contained in a directory of other common utilities, which are all brought together in a file common.h. Here is how the logger is included in the common library:
#include "custom_log.h"
After I have run configure, make, and make install, the relevant contents of the lib directory look like this:
libsomething_common.so -> libsomething_common.so.0.0.0
libsomething_common.so.0 -> libsomething_common.so.0.0.0
libsomething_common.so.0.0.0
So there are a few symlinked .so files, and the one they end up pointing to is libsomething_common.so.0.0.0. Analyzing the file with the objdump command yields the following results:
$ objdump -t libsomething_common.so.0.0.0 | grep CUSTOM_LOG_LEVEL
0000000000209a20 l O .data 0000000000000004 CUSTOM_LOG_LEVEL
$ objdump -T libsomething_common.so.0.0.0 | grep CUSTOM_LOG_LEVEL
So we can see that the symbol is not dynamic, and I believe it should be. But how to make that happen?
Again, I am very new to this build process and appreciate any suggestions you might have.

ld: access beyond end of merged section

i'm trying to link a simple c program on an arm debian machine (a raspberry pi) and when linking the ogject file the linker returns me the error in the subject.
my program is as simple as
simple.c:
int main(){
int a = 2;
int b = 3;
int c = a+b;
}
i compile it with
$>gcc -o simple.obj simple.c
and then link it with
$>ld -o simple.elf simple.obj
ld: simple.obj: access beyond end of merged section (33872)
i can't understand why...
if i try to read the elf file with objdump -d it doesn't manage to decompile the .text section (it only prints address, value, .word and again value preceded by 0x) but the binary data is the same as the one i get from the decompiled simple.obj.
the only difference is in the loading start (and consequent) addresses of the binary data: the elf file starts at 0x8280, the object file starts at 0x82a0.
what does all this mean?
EDIT:
this is the dump for the obj file: http://pastebin.com/YZ94kRk4
and this is the dump for the elf file: http://pastebin.com/3C3sWqrC
i tried compiling with -c option that makes gcc stop after assembly time (it already did the linking part) but now i have a different problem: it says that there is no _start section in my object file...
the new dumps are:
simple.obj: http://pastebin.com/t0TqmgPa
simple.elf: http://pastebin.com/qD35cnqw
You are misunderstanding the effect of the commands you ran. If you run:
$ gcc -o simple.obj simple.c
it already creates the program you want to run, it's already linked. You don't need to link it again, especially by running ld directly unless you know what you are doing. Even if its extension is obj, it doesn't matter, it's just the name of the file, but the content of the file is already a complete Linux program. So if you run:
$ ./simple.obj
it will execute your code.
You usually don't call ld directly, but instead you use gcc as a front-end to compile and link. This is because gcc takes care of linking also important libraries that you are not linking such as the startup code, and that's the reason why your second attempt resulted in "no _start section" or something like that.
Could you print the output of the objdump -d command?
Btw, notice that 33872 == 0x8450.
I am not familiar with raspberry PI's memory map, so if you'r following any tutorials about this or have some other resource to help me help you out - it would be great :)

How to find out *.c and *.h files that were used to build a binary?

I am building a project that builds multiple shared libraries and executable files. All the source files that are used to build these binaries are in a single /src directory. So it is not obvious to figure out which source files were used to build each of the binaries (there is many-to-many relation).
My goal is to write a script that would parse a set of C files for each binary and make sure that only the right functions are called from them.
One option seems to be to try to extract this information from Makefile. But this does not work well with generated files and headers (due to dependence on Includes).
Another option could be to simply browse call graphs, but this would get complicated, because a lot of functions are called by using function pointers.
Any other ideas?
You can first compile your project with debug information (gcc -g) and use objdump to get which source files were included.
objdump -W <some_compiled_binary>
Dwarf format should contain the information you are looking for.
<0><b>: Abbrev Number: 1 (DW_TAG_compile_unit)
< c> DW_AT_producer : (indirect string, offset: 0x5f): GNU C 4.4.3
<10> DW_AT_language : 1 (ANSI C)
<11> DW_AT_name : (indirect string, offset: 0x28): test_3.c
<15> DW_AT_comp_dir : (indirect string, offset: 0x36): /home/auselen/trials
<19> DW_AT_low_pc : 0x82f0
<1d> DW_AT_high_pc : 0x8408
<21> DW_AT_stmt_list : 0x0
In this example, I've compiled object file from test_3, and it was located in .../trials directory. Then of course you need to write some script around this to collect related source file names.
First you need to separate the debug symbols from the binary you just compiled. check this question on how to do so:
How to generate gcc debug symbol outside the build target?
Then you can try to parse this file on your own. I know how to do so for Visual Studio but as you are using GCC I won't be able to help you further.
Here is an idea, need to refine based on your specific build. Make a build, log it using script (for example script log.txt make clean all). The last (or one of the last) step should be the linking of object files. (Tip: look for cc -o <your_binary_name>). That line should link all .o files which should have corresponding .c files in your tree. Then grep those .c files for all the included header files.
If you have duplicate names in your .c files in your tree, then we'll need to look at the full path in the linker line or work from the Makefile.
What Mahmood suggests below should work too. If you have an image with symbols, strings <debug_image> | grep <full_path_of_src_directory> should give you a list of C files.
You can use unix nm tool. It shows all symbols that are defined in the object. So you need to:
Run nm on your binary and grab all undefined symbols
Run ldd on your binary to grab list of all its dynamic dependencies (.so files your binary is linked to)
Run nm on each .so file youf found in step 2.
That will give you the full list of dynamic symbols that your binary use.
Example:
nm -C --dynamic /bin/ls
....skipping.....
00000000006186d0 A _edata
0000000000618c70 A _end
U _exit
0000000000410e34 T _fini
0000000000401d88 T _init
U _obstack_begin
U _obstack_newchunk
U _setjmp
U abort
U acl_extended_file
U bindtextdomain
U calloc
U clock_gettime
U closedir
U dcgettext
U dirfd
All those symbols with capital "U" are used by ls command.
If your goal is to analyze C source files, you can do that by customizing the GCC compiler. You could use MELT for that purpose (MELT is a high-level domain specific language to extend GCC) -adding your own analyzing passes coded in MELT inside GCC-, but you should first learn about GCC middle-end internal representations (Gimple, Tree, ...).
Customizing GCC takes several days of work (mostly because GCC internals are quite complex in the details).
Feel free to ask me more about MELT.

Find start point, int main()

I am currently compiling a bought data stack in C. I use their own tool to compile it, using in the background gcc. I can pass flags and parameters to gcc as I see fit. I want to know, from which file is the main() used. That is, in the project, which file is the starting point. Is there any way to tell gcc to generate a list of files, or similar, given that I dont know from which file is main() being taken? Thank you.
You can disassemble the final executable to find the starting point. Although you have not provided any additional info to help you more. I'm using a sample code to demonstrate the process.
#include <stdio.h>
int main() {
printf("hello world\n");
return 0;
}
Now the object main.o has the following this
[root#s1 sf]# gcc -c main.c
[root#s1 sf]# nm main.o
0000000000000000 T main
U puts
You can see main is not initialized. Because it will changed in linking stage. Now after linking :
$gcc main.o
$nm a.out
U __libc_start_main##GLIBC_2.2.5
0000000000600874 A _edata
0000000000600888 A _end
00000000004005b8 T _fini
0000000000400390 T _init
00000000004003e0 T _start
000000000040040c t call_gmon_start
0000000000600878 b completed.6347
0000000000600870 W data_start
0000000000600880 b dtor_idx.6349
00000000004004a0 t frame_dummy
00000000004004c4 T main
You see that main has a address now. But its still not final. Because this main will called by C runtime dynamically. you can see who will do the part of U __libc_start_main##GLIBC_2.2.5:
[root#s1 sf]# ldd a.out
linux-vdso.so.1 => (0x00007fff61de1000) /* the linux system call interface */
libc.so.6 => /lib64/libc.so.6 (0x0000003c96000000) /* libc runime , this will invoke your main*/
/lib64/ld-linux-x86-64.so.2 (0x0000003c95c00000) /* dynamic loader */
Now you can verify this by viewing the disassembly :
00000000004003e0 <_start>:
..........
4003fd: 48 c7 c7 c4 04 40 00 mov rdi,0x4004c4 /* address of start of main */
400404: e8 bf ff ff ff call 4003c8 <__libc_start_main#plt> /* this will set up the environment for main, like pushing argc and argv to stack */
...........
If you don't have the source with you, then you can search in the executable for references to libc_start_main or main or start to see how your executable is initialized and starts the main.
Now all of these is done when linking is done with default linker script. Many big project will use its own linker script. If your project has custom linker script, then finding the start point will be different depending on the linker script used. There are projects which does not uses glibc's runtime. In that case, its still possible to find the start point by hacking the object files, library archives etc.
If your binary is stripped from symbols, then you have to actually rely on your assembler skill to find where it starts.
I've assumed that you don't have the source, that is the stack is distributed with some libraries and some header definitions only.(A common practice of commercial software vendors).
But if you have source with you, then its just too trivial. just grep your way through it. Some answers already pointed that out.
From where main() is called is implementation-dependent -- using GCC, it will most likely be a stub object file in /usr/lib called crt0.o or crt1.o from which it is called. (this file contains the OS-dependent symbol which is automatically invoked by the kernel when your app is loaded into memory. On Linux and Mac OS X, this is called start).
You can use objdump -t to list symbols from object files. So assuming you are on Linux, and also assuming that the object files are still around somewhere, you can do this:
find -name '*.o' -print0 \
| xargs -0 objdump -t \
| awk '/\.o:/{f=$1} /\.text\.main/{print f, $6}'
This will print a list of object files and the references to main they contain. Usually there should be a simple map from object files to source files. If there are multiple object files containing that symbol, then it depends on which one of those actually got linked into the binary you're looking at, as there can be no more than one main per executable binary (except perhaps for some really exotic black magic).
After the application is linked and debugging symbols are stripped, there usually is no indication from which source file a specific function came. The exception to this are files which include the function names as string literals, e.g. using the __FILE__ macro. Before stripping debugging symbols, you might use the debugger to obtain that information. If debugging symbols are included, that is.

how do I always include symbols from a static library?

Suppose I have a static library libx.a. How to I make some symbols (not all) from this library to be always present in any binary I link with my library? Reason is that I need these symbols to be available via dlopen+dlsym. I'm aware of --whole-archive linker switch, but it forces all object files from library archive to linked into resulting binary, and that is not what I want...
Observations so far (CentOS 5.4, 32bit) (upd: this paragraph is wrong; I could not reproduce this behaviour)
ld main.o libx.a
will happily strip all non-referenced symbols, while
ld main.o -L. -lx
will link whole library in. I guess this depends on version of binutils used, however, and newer linkers will be able to cherry-pick individual objects from a static library.
Another question is how can I achieve the same effect under Windows?
Thanks in advance. Any hints will be greatly appreciated.
Imagine you have a project which consists of the following three C files in the same folder;
// ---- jam.h
int jam_badger(int);
// ---- jam.c
#include "jam.h"
int jam_badger(int a)
{
return a + 1;
}
// ---- main.c
#include "jam.h"
int main()
{
return jam_badger(2);
}
And you build it with a boost-build bjam file like this;
lib jam : jam.c <link>static ;
lib jam_badger : jam ;
exe demo : jam_badger main.c ;
You will get an error like this.
undefined reference to `jam_badger'
(I have used bjam here because the file is easier to read, but you could use anything you want)
Removing the 'static' produces a working binary, as does adding static to the other library, or just using the one library (rather than the silly wrapping on inside the other)
The reason this happens is because ld is clever enough to only select the parts of the archive which are actually used, which in this case is none of them.
The solution is to surround the static archives with -Wl,--whole-archive and -Wl,--no-whole-archive, like so;
g++ -o "libjam_candle_badger.so" -Wl,--whole-archive libjam_badger.a Wl,--no-whole-archive
Not quite sure how to get boost-build to do this for you, but you get the idea.
First things first: ld main.o libx.a does not build a valid executable. In general, you should never use ld to link anything directly; always use proper compiler driver (gcc in this case) instead.
Also, "ld main.o libx.a" and "ld main.o -L. -lx" should be exactly equivalent. I am very doubtful you actually got different results from these two commands.
Now to answer your question: if you want foo, bar and baz to be exported from your a.out, do this:
gcc -Wl,-u,foo,-u,bar,-u,baz main.o -L. -lx -rdynamic
Update:
your statement: "symbols I want to include are used by library internally only" doesn't make much sense: if the symbols are internal to the library, why do you want to export them? And if something else uses them (via dlsym), then they are not internal to the library -- they are part of the library public API.
You should clarify your question and explain what you really are trying to achieve. Providing sample code will not hurt either.
I would start with splitting off those symbols you always need into a seperate library, retaining only the optional ones in libx.a.
Take an address of the symbol you need to include.
If gcc's optimiser anyway eliminates it, do something with this address - should be enough.

Resources