What is responsible for ensuring all symbols are known/defined? - c

Is it the C preprocessor, compiler, or linkage editor?

To tell you the truth, it is programmer.

The answer you are looking for is... the compiler it depends. Sometimes it's the compiler, sometimes it's the linker, and sometimes it doesn't happen until the program is loaded.
The preprocessor:
handles directives for source file inclusion (#include), macro definitions (#define), and conditional inclusion (#if).
...
The language of preprocessor directives is agnostic to the grammar of C, so the C preprocessor can also be used independently to process other kinds of text files.
The linker:
takes one or more objects generated by a compiler and combines them into a single executable program.
...
Computer programs typically comprise several parts or modules; all
these parts/modules need not be contained within a single object file,
and in such case refer to each other by means of symbols. Typically,
an object file can contain three kinds of symbols:
defined symbols, which allow it to be called by other modules,
undefined symbols, which call the other modules where these symbols are defined, and
local symbols, used internally within the object file to facilitate relocation.
When a program comprises multiple object files, the linker combines
these files into a unified executable program, resolving the
symbols as it goes along.
In environments which allow dynamic linking, it is possible that
executable code still contains undefined symbols, plus a list of objects or libraries that will provide definitions for these.

The programmer must make sure everything is defined somewhere. The programmer is RESPONSIBLE for doing so.
Various tools will complain along the way if they notice anything missing:
The compiler will notice certain things missing, and will error out if it can realize that something's not there.
The linker will error out if it can't fix up a reference that's not in a library somewhere.
At run time there is a loader that pulls the relevant shared libraries into the process's memory space. The loader is the last thing that gets a crack at fixing up symbols before the program gets to run any code, and it will throw errors if it can't find a shared library/dll, or if the interface for the library that was used at link-time doesn't match up correctly with the available library.
None of these tools is RESPONSIBLE for making sure everything is defined. They are just the things that will notice if things are NOT defined, and will be the ones throwing the error message.

For symbols with internal linkage or no linkage: the compiler.
For symbols with external linkage: the linker, either the "traditional" one, or the runtime linker.
Note that the dynamic/runtime linker may choose to do its job lazily, resolving symbols only when they are used (e.g: when a function is called for the first time).

Related

Except linking object files of current code and other included header files, what other task linker do?

#include <stdio.h>
#include<a.h> // other file which have function "fun()" declaration
int main()
{
int a =100;
printf("print fun = %d", fun());
return 0;
}
Except linking object files of current code and other included header files, what other task linker do?
see below dummy code:
This answer applies to linked programs (e.g., C, C++) , as opposed to interpreted programs (e.g., shell scripts, most byte-code languages, interpreters). The shades of grey for byte-code languages are ignored here.
Linkers start by collecting the object modules (called .o files herein) and libraries the command line points it to, building a list of all provided and referenced global names therein. By this time the actual source code has been left behind. Ignoring debugging information, described later, effectively only the names of global variables and functions remain in .o files along with associated values.
The object code tracks what parts of .o files are variables and what parts are executable code along with the names and locations of external entries. Some languages (C++) track argument signatures though others (C) do not.
Keeping it simple, after including .o files on the command line in the build, the linker's major job is to look through those .o files to extract references to all externals and then hunt through the libraries to satisfy those externals. The whole lot is bundled up into your executable or, in the case of dynamic libraries loaded during execution, appropriate links to loadable modules are put in the executable.
The linking process assigns everything linkers put into executables a memory addresses to reside at. The linker puts all this into the executable along with additional metadata, such as file headings and optional debugging information, in a way loaders can nicely extract when the program runs. This is usually divided into program segments for executable code, data segments when the program's variables that have initial values associated with them live, and places for variables that have no specific values live that get the default value of binary zeros. These segments provide the run time layout of the executable.
A few basic things, like the names and locations of functions, are often kept in all builds, but this can be so minimal to be worthless in debugging most failures, which leads developers into using -g during development in most POSIX/Linus type compilers.
Debugging information for .o modules compiled under -g (or whatever) will be discarded if the linker is not told to keep it in the build. This information includes all of the external variable names, functions, their addresses, and more; all that stuff you can see with debuggers is included here, and adds considerable to the size of executables. This often includes associating locations of functions, and code therein, to the original source files.
Linkers also identify where the execution is to start at. A small but critical thing and is not the "main()" type functions that beginners tend to believe is the start of their code's run but some place deep with the language library that is automagically included into the build in perhaps obscure ways. This startup code assures all the stuff needed for successful execution happens, like getting ready to handle malloc(), use inherited open files, setting up environment variables, setting up the stack and heap, and the like. Once all this is done main()'s code is used to kick off the user's code. While the linker has nothing to do with this initialization code, nor the final completion code that runs when the program exits or the main returns, the linker does point assure this start up code is present, which also includes properly exit when user code is done.
Once built the linker is done, with the big exclusion of possibility handling dynamic modules that get loaded when called. How such are handled are heavily dependent on the OS and the nature of the link. Sometimes a linker can be involved and sometimes it is "just" loading a module file.
In my view the "loader" that reads in executables into the proper memory, zeros out some chunks of memory, and other start up handling, as a key partner with the linker. While a key partner it is definitely a separate program.
Basically, the linker links object files (the result of compiling your various source code files (not headers), the standard library, the math library, the ncurses library, the big integer library, ...) and adds management data to make a recognizable executable (ELF format, EXE format, PE format, ...).

List "never linked against" source file in C project

I would like to know if someone is aware of a trick to retrieve the list of files that had been (or ideally will be) used by linker to produce an executable.
Some kind of solution must exist. A a static source analyzer, or a hack, such as compiling with some weird flags, and analyzing produced executable with another tool, or force the linker to output this information.
The goal is to provide a tool that strip useless source files from a list of source files.
The end goal is to ease the build process, by allowing him to give a list of usable source files. Then my tool would only compile the ones actually used by linker instead of everything.
This would allow for some unit_test to still be runnable even if some others are broken and can't compile, while not asking the user to manually list every test dependencies manually in the cmake.
I am targetting linux for now, but will be intersted in the futur to do the same trick on others OS. So I would like a cross-platform solution, eventhought I doubt I will have it :)
Thanks for your help
Edit because I see that it is confusing, what I mean by
allowing him to give a list of usable source file
is that, in cmake, for exemple. If you use add_executable(name, sources), then sources is considered as the sources to compile and link on.
I want to wrap add_executable, so sources is viewed as a set of usable if necessary sources files.
I'm afraid the idea of detecting never linked source files is not a fruitful one.
To build a program, CMake will not compile a source file if it not going to link the resulting object
file into the program. I can understand how you might think that this happens, but it doesn't happen.
CMake already does what you would like it to do and the same is true of every other build automation system going back to
their invention in the 1970s. The fundamental purpose of all
such systems is to ensure that the building of a program
compiles a source file name.(c|cc|f|m|...) if and only if
the object file name.o is going to be linked into the program
and is out of date or does not exist. You can always defeat this purpose by
egregiously bad coding of the project's build spec (CMakeLists.txt, Makefile, SConstruct, etc.),
but with CMake you would need to be really trying to do it, and
trying quite expertly.
If you do not want name.c to be compiled and the object file name.o
linked into a target program, then you do not tell the build system
that name.o or name.c is a prerequisite of the program. Don't tell
it what you know is not true. It is elementary competence not to specify redundant prerequisites of
a build system target.
The linker will link all its input object files into an output
program without question. It does not ask whether or not they are "needed"
by the program because it cannot answer that question. Neither the
linker nor any possible static analysis tool can know what program
you intend to produce when you input some object files for linkage.
It can only be assumed that you intend to produce the program that
results from the linkage of those object files, assuming the
linkage is successful.
If those object files cannot be linked into a program at all, the linker will tell you
that, and why. Otherwise, if you have linked object files that you didn't
intend to link, you can only discover that for yourself, by noticing
the mistake in the build log, or failing that by testing the program and/or inspecting its contents and comparing
your observations with your expectations.
Given your choice of object files for linkage, you can instruct the linker
to detect any code sections or data sections it extracts those object files in
which no symbols are defined that can be referenced by the program, and to
throw away all such unreferenced input sections instead of linking them
into the program. This is called linktime "garbage collection". You tell the
linker to do it by passing the option -Wl,-gc-sections in the
gcc linkage command. See this question
to learn how to maximise the collectible garbage. This is what you
can do to remove redundant object code from the linkage.
But you can only collect any garbage from a program in this way if the program
is dynamically opaque, i.e not linked with the option -rdynamic
: then the global symbols defined in the program's static image are not visible
to the OS loader and cannot be referenced from outside its static image by dynamic
libraries in the same process. In this case the linker can determine by static
analysis that a symbol whose definition is not referenced in the program's static
image cannot be referenced at all, since it cannot be referenced dynamically,
and if all symbols defined in an input section are statically unreferenced then
it can garbage-collect the section.
If the program has been linked -rdynamic then -Wl,-gc-sections will
collect no garbage, and this is quite right, because if the program is
not dynamically opaque then it is impossible for static analysis to determine that anything
defined in its linkage cannot be referenced.
It's noteworthy that although -rdynamic is not a default linkage
option for GCC, it is a default linkage option for CMake projects using
the GCC toolchain. So to use linktime garbage collection in CMake projects
you would always have to override the -rdynamic default. And obviously it would only be
valid to do this if you have determined that it is alright for the program to
be dynamically opaque.

Determining the space used by library functions in C

I have a bunch of code I need to analyse that I don't know how to do. I have a pile of code that here and there are using math functions from a header file I have included called math.h that came with my IDE. I am being asked to see how much space is used to include this. Specifically is the compiler including all of the library functions or just the ones we use. There is no object file being created. So I think the library code is being compiled into the individual files. Any ideas of a slick way to figure this out? I can't just comment out the includes because then the code wont complie and I won't know a size and if I comment out all the lines that use math functions it is not really representative.
You can use the objdump command to see the individual symbols inside your object files and the space they require.
Note that unless you're doing static compilation, library methods aren't generally copied into your produced binary, but only referenced (and brought in via the dynamic linker when your program is loaded).
As math.h is part of the standard C library, a copy of that library is basically guaranteed to always be in memory, so the additional memory and space requirements on dynamic linking are minimal. (During static linking, all symbols which aren't directly required by your program are discarded, and math functions don't tend to be very big, so usage should be fairly minimal there too).
The code in the header file is being complied into the object file of the .c you are using if your header has the definitions of the functions and just being referenced to if they are simply the declarations. The linker will then find a definition for each symbol and place it in your executable if you are using dynamic linking the OS will pull in the definition at run time.

How to catch unintentional function interpositioning?

Reading through my book Expert C Programming, I came across the chapter on function interpositioning and how it can lead to some serious hard to find bugs if done unintentionally.
The example given in the book is the following:
my_source.c
mktemp() { ... }
main() {
mktemp();
getwd();
}
libc
mktemp(){ ... }
getwd(){ ...; mktemp(); ... }
According to the book, what happens in main() is that mktemp() (a standard C library function) is interposed by the implementation in my_source.c. Although having main() call my implementation of mktemp() is intended behavior, having getwd() (another C library function) also call my implementation of mktemp() is not.
Apparently, this example was a real life bug that existed in SunOS 4.0.3's version of lpr. The book goes on to explain the fix was to add the keyword static to the definition of mktemp() in my_source.c; although changing the name altogether should have fixed this problem as well.
This chapter leaves me with some unresolved questions that I hope you guys could answer:
Does GCC have a way to warn about function interposition? We certainly don't ever intend on this happening and I'd like to know about it if it does.
Should our software group adopt the practice of putting the keyword static in front of all functions that we don't want to be exposed?
Can interposition happen with functions introduced by static libraries?
Thanks for the help.
EDIT
I should note that my question is not just aimed at interposing over standard C library functions, but also functions contained in other libraries, perhaps 3rd party, perhaps ones created in-house. Essentially, I want to catch any instance of interpositioning regardless of where the interposed function resides.
This is really a linker issue.
When you compile a bunch of C source files the compiler will create an object file for each one. Each .o file will contain a list of the public functions in this module, plus a list of functions that are called by code in the module, but are not actually defined there i.e. functions that this module is expecting some library to provide.
When you link a bunch of .o files together to make an executable the linker must resolve all of these missing references. This is the point where interposing can happen. If there are unresolved references to a function called "mktemp" and several libraries provide a public function with that name, which version should it use? There's no easy answer to this and yes odd things can happen if the wrong one is chosen
So yes, it's a good idea in C to "static" everything unless you really do need to use it from other source files. In fact in many other languages this is the default behavior and you have to mark things "public" if you want them accessible from outside.
It sounds like what you want is for the tools to detect that there are name conflicts in functions - ie., you don't want your externally accessible function names form accidentally having the same name and therefore 'override' or hide functions with the same name in a library.
There was a recent SO question related to this problem: Linking Libraries with Duplicate Class Names using GCC
Using the --whole-archive option on all the libraries you link against may help (but as I mentioned in the answer over there, I really don't know how well this works or how easy it is to convince builds to apply the option to all libraries)
Purely formally, the interpositioning you describe is a straightforward violation of C language definition rules (ODR rule, in C++ parlance). Any decent compiler must either detect these situations, or provide options for detecting them. It is simply illegal to define more than one function with the same name in C language, regardless of where these functions are defined (Standard library, other user library etc.)
I understand that many platforms provide means to customize the [standard] library behavior by defining some standard functions as weak symbols. While this is indeed a useful feature, I believe the compilers must still provide the user with means to enforce the standard diagnostics (on per-function or per-library basis preferably).
So, again, you should not worry about interpositioning if you have no weak symbols in your libraries. If you do (or if you suspect that you do), you have to consult your compiler documentation to find out if it offers you with means to inspect the weak symbol resolution.
In GCC, for example, you can disable the weak symbol functionality by using -fno-weak, but this basically kills everything related to weak symbols, which is not always desirable.
If the function does not need to be accessed outside of the C file it lives in then yes, I would recommend making the function static.
One thing you can do to help catch this is to use an editor that has configurable syntax highlighting. I personally use SciTE, and I have configured it to display all standard library function names in red. That way, it's easy to spot if I am re-using a name I shouldn't be using (nothing is enforced by the compiler, though).
It's relatively easy to write a script that runs nm -o on all your .o files and your libraries and checks to see if an external name is defined both in your program and in a library. Just one of the many sane sensible services that the Unix linker doesn't provide because it's stuck in 1974, looking at one file at a time. (Try putting libraries in the wrong order and see if you get a useful error message!)
The Interposistioning occurs when the linker is trying to link separate modules.
It cannot occur within a module. If there are duplicate symbols in a module the linker will report this as an error.
For *nix linkers, unintended Interposistioning is a problem and it is difficult for the linker to guard against it.
For the purposes of this answer consider the two linking stages:
The linker links translation units into modulles (basically
applications or libraries).
The linker links any remaining unfound symbols by searching in modules.
Consider the scenario described in 'Expert C programming' and in SiegeX's question.
The linker fist tries to build the application module.
It sess that the symbol mktemp() is an external and tries to find a funcion definiton for the symbol. The linker finds
the definition for the function in the object code of the application module and marks the symbol as found.
At this stage the symbol mktemp() is completely resolved. It is not considered in any way tentative so as to allow
for the possibility that the anothere module might define the symbol.
In many ways this makes sense, since the linker should first try and resolve external symbols within the module it is
currently linking. It is only unfound symbols that it searches for when linking in other modules.
Furthermore, since the symbol has been marked as resolved, the linker will use the applications mktemp() in any
other cases where is needs to resolve this symbol.
Thus the applications version of mktemp() will be used by the library.
A simple way to guard agains the problem is to try and make all external sysmbols in your application or library unique.
For modules that are only going to shared on a limited basis, this can fairly easily be done by making sure all
extenal symbols in your module are unique by appending a unique identifier.
For modules that are widely shared making up unique names is a problem.

How do linkers decide what parts of libraries to include?

Assume library A has a() and b(). If I link my program B with A and call a(), does b() get included in the binary? Does the compiler see if any function in the program call b() (perhaps a() calls b() or another lib calls b())? If so, how does the compiler get this information? If not, isn't this a big waste of final compile size if I'm linking to a big library but only using a minor feature?
Take a look at link-time optimization. This is necessarily vendor dependent. It will also depend how you build your binaries. MS compilers (2005 onwards at least) provide something called Function Level Linking -- which is another way of stripping symbols you don't need. This post explains how the same can be achieved with GCC (this is old, GCC must've moved on but the content is relevant to your question).
Also take a look at the LLVM implementation (and the examples section).
I suggest you also take a look at Linkers and Loaders by John Levine -- an excellent read.
It depends.
If the library is a shared object or DLL, then everything in the library is loaded, but at run time. The cost in extra memory is (hopefully) offset by sharing the library (really, the code pages) between all the processes in memory that use that library. This is a big win for something like libc.so, less so for myreallyobscurelibrary.so. But you probably aren't asking about shared objects, really.
Static libraries are a simply a collection of individual object files, each the result of a separate compilation (or assembly), and possibly not even written in the same source language. Each object file has a number of exported symbols, and almost always a number of imported symbols.
The linker's job is to create a finished executable that has no remaining undefined imported symbols. (I'm lying, of course, if dynamic linking is allowed, but bear with me.) To do that, it starts with the modules named explicitly on the link command line (and possibly implicitly in its configuration) and assumes that any module named explicitly must be part of the finished executable. It then attempts to find definitions for all of the undefined symbols.
Usually, the named object modules expect to get symbols from some library such as libc.a.
In your example, you have a single module that calls the function a(), which will result in the linker looking for module that exports a().
You say that the library named A (on unix, probably libA.a) offers a() and b(), but you don't specify how. You implied that a() and b() do not call each other, which I will assume.
If libA.a was built from a.o and b.o where each defines the corresponding single function, then the linker will include a.o and ignore b.o.
However, if libA.a included ab.o that defined both a() and b() then it will include ab.o in the link, satisfying the need for a(), and including the unused function b().
As others have mentioned, there are linkers that are capable of splitting individual functions out of modules, and including only those that are actually used. In many cases, that is a safe thing to do. But it is usually safest to assume that your linker does not do that unless you have specific documentation.
Something else to be aware of is that most linkers make as few passes as they can through the files and libraries that are named on the command line, and build up their symbol table as they go. As a practical matter, this means that it is good practice to always specify libraries after all of the object modules on the link command line.
It depends on the linker.
eg. Microsoft Visual C++ has an option "Enable function level linking" so you can enable it manually.
(I assume they have a reason for not just enabling it all the time...maybe linking is slower or something)
Usually (static) libraries are composed of objects created from source files. What linkers usually do is include the object if a function that is provided by that object is referenced. if your source file only contains one function than only that function will be brought in by the linker. There are more sophisticated linkers out there but most C based linkers still work like outlined. There are tools available that split C source that contain multiple functions into artificially smaller source files to make static linking more fine granular.
If you are using shared libraries then you don't impact you compiled size by using more or less of them. However your runtime size will include them.
This lecture at Academic Earth gives a pretty good overview, linking is talked about near the later half of the talk, IIRC.
Without any optimization, yes, it'll be included. The linker, however, might be able to optimize out by statically analyzing the code and trying to remove unreachable code.
It depends on the linker, but in general only functions that are actually called get included in the final executable. The linker works by looking up the function name in the library and then using the code associated with the name.
There are very few books on linkers, which is strange when you think how important they are. The text for a good one can be found here.
It depends on the options passed to the linker, but typically the linker will leave out the object files in a library that are not referenced anywhere.
$ cat foo.c
int main(){}
$ gcc -static foo.c
$ size
text data bss dec hex filename
452659 1928 6880 461467 70a9b a.out
# force linking of libz.a even though it isn't used
$ gcc -static foo.c -Wl,-whole-archive -lz -Wl,-no-whole-archive
$ size
text data bss dec hex filename
517951 2180 6844 526975 80a7f a.out
It depends on the linker and how the library was built. Usually libraries are a combination of object files (import libraries are a major exception to this). Older linkers would pull things into the output file image at a granularity of the object files that were put into the library. So if function a() and function b() were both in the same object file, they would both be in the output file - even if only one of the 2 functions were actually referenced.
This is a reason why you'll often see library-oriented projects with a policy of a single C function per source file. That way each function is packaged in its own object file and linkers have no problem pulling in only what is referenced.
Note however that newer linkers (certainly newer Microsoft linkers) have the ability to pull in only parts of object files that are referenced, so there's less of a need today to enforce a one-function-per-source-file policy - though there are reasonable arguments that that should be done anyway for maintainability.

Resources