Does the linker refer to the main code - c

Let assume I am having three source files main.c, a.c and b.c. In the main.c are called some of the functions (not all) that are defined in a.c. None of the functions defined in b.c are called (used) by main.c. In main.c is the main function. Then we have a makefile that compiles all the source files(main.c, a.c and b.c) and then links them to produce executable file, in my case intel hex file. My question is: Does the linker know in which file the main function resides and knowing that to determine what part of the object files to link together? I mean if the linker produces the exe file based only on the recipe of the rule to make the target then no matter how many functions are called in our application code the size of the executable will be the same because the recipe says to link all the object files. For example we compile the three source files and we get three object files: main.o a.o and b.o (the bigger the object files are, the bigger the exe file is). I know you would say if you dont want anything from the b.c then do not include it in the build. But it means that every time I want to change the application (include/exclide modules) I need to change the makefile too. And another thing is how the linker knows what part of the object file to take, does it understand the C language? I hope you understand my question, excuse my bad English.

1) Does the linker know in which file the main function resides and knowing that to determine what part of the object files to link together?
Maybe there are options of your toolchain (compiler/linker) to enable this kind of optimizations, I mean removing unused functions from link, but I have big doubt for global functions (could be possible for static functions).
2) And another thing is how the linker knows what part of the object file to take, does it understand the C language?
Linker may detect if a function or variable is not used by the application (once again, check the available options), but it is not really the objective of this tool. However if you compile/link some functions as library functions (see options), you can generate a "library" file and then link this library with other object files. The functions of the library will then be included by the linker ONLY if they are used.
What I suggest: use compilation flags (#ifdef...) to include or exclude parts of code from compilation/link.

If you want only those functions in the executable that are eventually called from main, use a library of object files.
Basically the smallest unit the linker will extract from a library is the object file. Whatever symbols are in that object file will also be resolved, until all symbols are resolved.
In other words, if none of the symbols in an object file are needed, it won't end up in the result. If at least one symbol is needed, it will get linked in its entirety.
No, the linker does not understand C. Note that a lot of language compilers create object files (C++, FORTRAN, ..., and assemblers). A linker resolves symbols, which are names attached to values.
John Levine has written a book, "Linkers and Loaders", available on the 'net, which will give you an in-depth understanding of linkers, symbols, and object files.

Related

List "never linked against" source file in C project

I would like to know if someone is aware of a trick to retrieve the list of files that had been (or ideally will be) used by linker to produce an executable.
Some kind of solution must exist. A a static source analyzer, or a hack, such as compiling with some weird flags, and analyzing produced executable with another tool, or force the linker to output this information.
The goal is to provide a tool that strip useless source files from a list of source files.
The end goal is to ease the build process, by allowing him to give a list of usable source files. Then my tool would only compile the ones actually used by linker instead of everything.
This would allow for some unit_test to still be runnable even if some others are broken and can't compile, while not asking the user to manually list every test dependencies manually in the cmake.
I am targetting linux for now, but will be intersted in the futur to do the same trick on others OS. So I would like a cross-platform solution, eventhought I doubt I will have it :)
Thanks for your help
Edit because I see that it is confusing, what I mean by
allowing him to give a list of usable source file
is that, in cmake, for exemple. If you use add_executable(name, sources), then sources is considered as the sources to compile and link on.
I want to wrap add_executable, so sources is viewed as a set of usable if necessary sources files.
I'm afraid the idea of detecting never linked source files is not a fruitful one.
To build a program, CMake will not compile a source file if it not going to link the resulting object
file into the program. I can understand how you might think that this happens, but it doesn't happen.
CMake already does what you would like it to do and the same is true of every other build automation system going back to
their invention in the 1970s. The fundamental purpose of all
such systems is to ensure that the building of a program
compiles a source file name.(c|cc|f|m|...) if and only if
the object file name.o is going to be linked into the program
and is out of date or does not exist. You can always defeat this purpose by
egregiously bad coding of the project's build spec (CMakeLists.txt, Makefile, SConstruct, etc.),
but with CMake you would need to be really trying to do it, and
trying quite expertly.
If you do not want name.c to be compiled and the object file name.o
linked into a target program, then you do not tell the build system
that name.o or name.c is a prerequisite of the program. Don't tell
it what you know is not true. It is elementary competence not to specify redundant prerequisites of
a build system target.
The linker will link all its input object files into an output
program without question. It does not ask whether or not they are "needed"
by the program because it cannot answer that question. Neither the
linker nor any possible static analysis tool can know what program
you intend to produce when you input some object files for linkage.
It can only be assumed that you intend to produce the program that
results from the linkage of those object files, assuming the
linkage is successful.
If those object files cannot be linked into a program at all, the linker will tell you
that, and why. Otherwise, if you have linked object files that you didn't
intend to link, you can only discover that for yourself, by noticing
the mistake in the build log, or failing that by testing the program and/or inspecting its contents and comparing
your observations with your expectations.
Given your choice of object files for linkage, you can instruct the linker
to detect any code sections or data sections it extracts those object files in
which no symbols are defined that can be referenced by the program, and to
throw away all such unreferenced input sections instead of linking them
into the program. This is called linktime "garbage collection". You tell the
linker to do it by passing the option -Wl,-gc-sections in the
gcc linkage command. See this question
to learn how to maximise the collectible garbage. This is what you
can do to remove redundant object code from the linkage.
But you can only collect any garbage from a program in this way if the program
is dynamically opaque, i.e not linked with the option -rdynamic
: then the global symbols defined in the program's static image are not visible
to the OS loader and cannot be referenced from outside its static image by dynamic
libraries in the same process. In this case the linker can determine by static
analysis that a symbol whose definition is not referenced in the program's static
image cannot be referenced at all, since it cannot be referenced dynamically,
and if all symbols defined in an input section are statically unreferenced then
it can garbage-collect the section.
If the program has been linked -rdynamic then -Wl,-gc-sections will
collect no garbage, and this is quite right, because if the program is
not dynamically opaque then it is impossible for static analysis to determine that anything
defined in its linkage cannot be referenced.
It's noteworthy that although -rdynamic is not a default linkage
option for GCC, it is a default linkage option for CMake projects using
the GCC toolchain. So to use linktime garbage collection in CMake projects
you would always have to override the -rdynamic default. And obviously it would only be
valid to do this if you have determined that it is alright for the program to
be dynamically opaque.

Given a fully linked a.out, which defines symbol foo, how can I tell which object file that definition came from, without relinking the a.out?

I'm trying to parse an ELF file and create a list of symbols defined in each object file. I am able to find everything I need except for the link between symbols and object files.
I couldn't find anything like that in the ELF specifications.
In this particular file I'm parsing I have some embedded DWARF debug info I could use, but ideally I would like to find a link between symbols and objects that is standard as I want to apply this for many non GCC compilers.
I'm trying to parse an ELF file and create a list of symbols defined in each object file.
Are you processing individual ELF object files, or a fully linked executable or shared library? Since the only way for your question to make sense is the latter, let's assume that your actual question is:
Given a fully linked a.out, which defines symbol foo, how can I tell which object file that definition came from, without relinking the a.out?.
In general, you can't.
First, not every symbol defined in a.out may even come from an object file: some may be defined via a linker script or a --defsym command line argument.
Second, weak symbols could be defined in multiple object files, and the linker is free to choose any one of them.
Last, there is absolutely no record of object file -> symbol association in the a.out. In fact, you can't even extract the list of .o files that were linked in (without redoing the link and asking the linker to print them).
You may be able to reestablish this association by looking at the debug info, which will tell you what translation unit the symbol came from, and then guess that probably foo.c was compiled into foo.o, but this again may fail as foo.c may have been compiled into bar.o and baz.o (with different -DFOO defines).

C questions about static and shared libraries

I have some .c and .h files with the main function encapsulated in the MAIN_FUNC.c. I need to pass them to a guy who is going to integrate the MAIN_FUNC() with his files.
However since my algorithm is confidential I can't just send the .c and .h files and so I've been looking into static and shared libraries. However I still have some doubts.
1: In every tutorial that I've seen the .hs are needed as well. Is there any way that I can send the guy just one single library file that he can #include in his code?
2: Even if I have to pass the .hs files, do i really need to pass all of them? How can I give him only the libMAIN_FUNC.a and the MAIN_FUNC.h?
3: With the .a or .so libraries, is there any way of reverse engineering the files so that one can see the .c and .h code?
No, you must provide him with at least one .h file.
No, you need to pass only those that are sufficiently define interface between your library and user. I suggest you to read about pimpl paradigm
Theoretically yes, your .a and .so files can be reverse-engineered, but it is very nontrivial.
My understanding is that you have a .c and .h file that someone else will be implementing, but you want to keep your code confidential.
If your only concern is handing out source code, then there is always the option of partial compilation. If you have gone through the trouble of making sure your code works without issue, you can partially compile your program into a .o file.
I don't know the details of your code, but if this other person you've mentioned will just be implementing your functions like a library, then the .o is all he would need.
sample makefile for your end:
all: MAIN_FUNC.o
MAIN_FUNC.o: MAIN_FUNC.c MAIN_FUNC.h
gcc -c MAIN_FUNC.c
sample makefile for other guy's end:
all: main
main: main.c main.h MAIN_FUNC.o
gcc -o main.c MAIN_FUNC.o main
A lot of companies do this sort of thing in order to protect their property. When one company sells software to another, they oftentimes sell these .o files. You would only need to provide the knowledge of what the function does (i.e. "This function takes an input from the console and returns the number of words written as an integer")--something basic that would allow the implementation of your work without revealing your source code.
Edit: fixed a typo
First things first, on reverse engineering. Given infinite time and resources, your code can always be reverse engineered. Having said that, your objective is to make it impractical for others to reverse engineer your code.
Now to answer your question:
Generating an executable binary from c code happens in "two major" steps. Compiling and Linking.
After compiling, your files.c become object files (machine code). They are not executable yet.
If you have two files: file1.c and file2.c, you will get file1.o and file2.o for example.
Now, the code in file1.c may be calling a function which exists in file2.o. At compilation stage, all what file1.c needs to know is the function prototype.
When the linker is invoked to generate the executable binary, it makes sure that the function called from file1.o exists somewhere, such as in file2.o.
How this affects you:
The header file should not be proprietary (but perhaps it is for legal reasons). The header file is mainly used to tell other .c files what functions and return values to expect (declaration, not implementation).
Now perhaps you have some proprietary function prototypes for whatever reason which you don't want to expose to the world. Say you want the world to start your code by calling the function
start_magic();
Then, what you do is:
Provide a header file: magic.h to be included in the main.c
header file will have the function: void start_magic();
You then put your proprietary code in algo.c and algo.h
algo.c will have start_magic() implementation
algo.c will include proprietary algo.h
Now what you can do is compile (no linking) your algo.c file, and strip the debugging symbols to make it hard to reverse engineer. How this is done depends on the compiler you are using.
Now you can provide the object file and the header file to somebody who wants to call the function start_magic().
The implementer of main has to link the program using the object file you provided.
Example
Assume you have algo.c with your algorithms. Let us say algo.c has the function:
float sqrt(float x){
taylor_approx(x);
}
Suppose that sqrt function will be shared with supplier. However, sqrt function calls on proprietary function taylor_approx(x) to calculate the square root.
You can create an algo.h file to be sent to the users, which contains:
extern float sqrt(float x);
Then you can send your -stripped from debugging symbols- compiled object file, for example, algo.o, to the users and ask them to put algo.h in their main.c
Note that this is one way to do it.
1) You can compile your C file into a library (*.a or whatever), given it is written properly and distribute it along with the h file. you have to give the h files as they are the interface to your library, which is just a binary blob otherwise.
2) You need to pass the headers declaring the public interface your library is exporting. I.e. the functions and symbols you want the user of the library to have access to.
3) Yes, there is always way of reverse engineering of just anything. The only question is the gain/effort ratio.

Does program need additional symbols from .so shared library except those declared in header file?

In C programming, I thought that a object file can be successfully linked with a .so file as long as the .so file offers all symbols which have been declared in the header file.
Suppose I have foo.c, bar.h and two libraries libbar.so.1 and libbar.so.2. The implementation of libbar.so.1 and libbar.so.2 is totally different, but I think it's OK as long as they both offers functions declared in bar.h.
I linked foo.o with libbar.so.1 and produced an executable: foo.bin. This executable worked when libbar.so.1 is in LD_LIBRARY_PATH.(of course a symbolic link is made as libbar.so) However, when I change the symbolic link to libbar.so.2, foo.bin could not run and complainted this:
undefined symbol: _ZSt4cerr
libbar.so.1 is a c++ built library, while libbar.so.2 is a c built library. I don't understand why foo.bin needs those c++ related symbols only meaningful in libbar.so.1 itself, since foo.bin is built upon pure c code foo.c.
_ZSt4cerr is obviously a mangled C++ name. You may need to check if you are using the right compiler (gcc/g++, i know it sounds stupid, but i happened to run into such confusion ;) ), and if there are any macros in the bar.h file that could have referenced cerr.
You must demangle c++ name before searching. For gcc there is a c++filt utility:
$ c++filt
_ZSt4cerr
std::cerr
It is just standard error file stream.
You probably just forgot to link the whole program with the C++ standard library.
The statement in the question is just wrong. You say, 'the header file.' There is no such thing as 'the (one and only) header file.' If you mean 'the header file declaring a certain C++ class', well, that class might inherit from other classes. Or it might use exceptions. Or RTTI. In which case, by default, the .so containing the code that goes with it will contain 'hanging undefined symbols'. By default, the expectation is that the 'main' program is in C++, and it links to the C++ runtime.
It is possible to create a self-contained .so, but you have to do extra work to create it. You might need to use -Bsymbolic, or specify some -l libraries when linking it, or both. This area is not well-documented and generally required some archaeology.

How do linkers decide what parts of libraries to include?

Assume library A has a() and b(). If I link my program B with A and call a(), does b() get included in the binary? Does the compiler see if any function in the program call b() (perhaps a() calls b() or another lib calls b())? If so, how does the compiler get this information? If not, isn't this a big waste of final compile size if I'm linking to a big library but only using a minor feature?
Take a look at link-time optimization. This is necessarily vendor dependent. It will also depend how you build your binaries. MS compilers (2005 onwards at least) provide something called Function Level Linking -- which is another way of stripping symbols you don't need. This post explains how the same can be achieved with GCC (this is old, GCC must've moved on but the content is relevant to your question).
Also take a look at the LLVM implementation (and the examples section).
I suggest you also take a look at Linkers and Loaders by John Levine -- an excellent read.
It depends.
If the library is a shared object or DLL, then everything in the library is loaded, but at run time. The cost in extra memory is (hopefully) offset by sharing the library (really, the code pages) between all the processes in memory that use that library. This is a big win for something like libc.so, less so for myreallyobscurelibrary.so. But you probably aren't asking about shared objects, really.
Static libraries are a simply a collection of individual object files, each the result of a separate compilation (or assembly), and possibly not even written in the same source language. Each object file has a number of exported symbols, and almost always a number of imported symbols.
The linker's job is to create a finished executable that has no remaining undefined imported symbols. (I'm lying, of course, if dynamic linking is allowed, but bear with me.) To do that, it starts with the modules named explicitly on the link command line (and possibly implicitly in its configuration) and assumes that any module named explicitly must be part of the finished executable. It then attempts to find definitions for all of the undefined symbols.
Usually, the named object modules expect to get symbols from some library such as libc.a.
In your example, you have a single module that calls the function a(), which will result in the linker looking for module that exports a().
You say that the library named A (on unix, probably libA.a) offers a() and b(), but you don't specify how. You implied that a() and b() do not call each other, which I will assume.
If libA.a was built from a.o and b.o where each defines the corresponding single function, then the linker will include a.o and ignore b.o.
However, if libA.a included ab.o that defined both a() and b() then it will include ab.o in the link, satisfying the need for a(), and including the unused function b().
As others have mentioned, there are linkers that are capable of splitting individual functions out of modules, and including only those that are actually used. In many cases, that is a safe thing to do. But it is usually safest to assume that your linker does not do that unless you have specific documentation.
Something else to be aware of is that most linkers make as few passes as they can through the files and libraries that are named on the command line, and build up their symbol table as they go. As a practical matter, this means that it is good practice to always specify libraries after all of the object modules on the link command line.
It depends on the linker.
eg. Microsoft Visual C++ has an option "Enable function level linking" so you can enable it manually.
(I assume they have a reason for not just enabling it all the time...maybe linking is slower or something)
Usually (static) libraries are composed of objects created from source files. What linkers usually do is include the object if a function that is provided by that object is referenced. if your source file only contains one function than only that function will be brought in by the linker. There are more sophisticated linkers out there but most C based linkers still work like outlined. There are tools available that split C source that contain multiple functions into artificially smaller source files to make static linking more fine granular.
If you are using shared libraries then you don't impact you compiled size by using more or less of them. However your runtime size will include them.
This lecture at Academic Earth gives a pretty good overview, linking is talked about near the later half of the talk, IIRC.
Without any optimization, yes, it'll be included. The linker, however, might be able to optimize out by statically analyzing the code and trying to remove unreachable code.
It depends on the linker, but in general only functions that are actually called get included in the final executable. The linker works by looking up the function name in the library and then using the code associated with the name.
There are very few books on linkers, which is strange when you think how important they are. The text for a good one can be found here.
It depends on the options passed to the linker, but typically the linker will leave out the object files in a library that are not referenced anywhere.
$ cat foo.c
int main(){}
$ gcc -static foo.c
$ size
text data bss dec hex filename
452659 1928 6880 461467 70a9b a.out
# force linking of libz.a even though it isn't used
$ gcc -static foo.c -Wl,-whole-archive -lz -Wl,-no-whole-archive
$ size
text data bss dec hex filename
517951 2180 6844 526975 80a7f a.out
It depends on the linker and how the library was built. Usually libraries are a combination of object files (import libraries are a major exception to this). Older linkers would pull things into the output file image at a granularity of the object files that were put into the library. So if function a() and function b() were both in the same object file, they would both be in the output file - even if only one of the 2 functions were actually referenced.
This is a reason why you'll often see library-oriented projects with a policy of a single C function per source file. That way each function is packaged in its own object file and linkers have no problem pulling in only what is referenced.
Note however that newer linkers (certainly newer Microsoft linkers) have the ability to pull in only parts of object files that are referenced, so there's less of a need today to enforce a one-function-per-source-file policy - though there are reasonable arguments that that should be done anyway for maintainability.

Resources