Make unresolved linking dependencies reported at runtime instead of at compilation/program load time for the purposes of unit testing

Make unresolved linking dependencies reported at runtime instead of at compilation/program load time for the purposes of unit testing - c

I have a home-grown unit testing framework for C programs on Linux using GCC. For each file in the project, let's say foobar.c, a matching file foobar-test.c may exist. If that is the case, both files are compiled and statically linked together into a small executable foobar-test which is then run. foobar-test.c is expected to contain main() which calls all the unit test cases defined in foobar-test.c.
Let's say I want to add a new test file barbaz-test.c to exercise sort() inside an existing production file barbaz.c:
// barbaz.c
#include "barbaz.h"
#include "log.h" // declares log() as a linking dependency coming from elsewhere
int func1() { ... res = log(); ...}
int func2() {... res = log(); ...}
int sort() {...}
Besides sort() there are several other functions in the same file which call into log() defined elsewhere in the project.
The functionality of sort() does not depend on log(), so testing it will never reach log(). Neither func1() nor func2() require testing and won't be reachable from the new test case I am about to prepare.
However, the barbaz-test executable cannot be successfully linked until I provide stub implementations of all dependencies coming from barbaz.c. A usual stub looks like this:
// in barbaz-test.c
#include "barbaz.h"
#include "log.h"
int log() {
assert(false && "stub must not be reached");
return 0;
}
// Actual test case for sort() starts here
...
If barbaz.c is large (which is often the case for legacy code written with no regard to the possibility to test it), it will contain many linking dependencies. I cannot start writing a test case for sort() until I provide stubs for all of them. Additionally, it creates a burden of maintaining these stubs, i.e. updating their prototypes whenever the production counterpart is updated, not forgetting to delete stubs which no longer are required etc.
What I am looking for is an option to have late runtime binding performed for missing symbols, similarly to how it is done in dynamic languages, but for C. If an unresolved symbol is reached during the test execution, that should lead to a failure. Having a proper diagnostic about the reason would be ideal, but a simple NULL pointer dereference would be good enough.
My current solution is to automate the initial generation of source code of stubs. It is done by analyzing of linking error messages and then looking up declarations for missing symbols in the headers. It is done in an ad-hoc manner, e.g. it involves "parsing" of C code with regular expressions.
Needless to say, it is very fragile: depends on specific format of linker error messages and uniformly formatted function declarations for regexps to recognize. It does not solve the future maintenance burden such stubs create either.
Another approach is to collect stubs for the most "popular" linking dependencies into a common object file which is then always linked into the test executables. This leaves a shorter list of "unique" dependencies requiring attention for each new file. This approach breaks down when a slightly specialized version of a common stub function has to be prepared. In such cases linking would fail with "the same symbol defined twice".

I may have stumbled on a solution myself, inspired by this discussion: Why can't ld ignore an unused unresolved symbol?
The linker can for sure determine if certain linking dependencies are not reachable. But it is not allowed to remove them by default because the compiler has put all function symbols into the same ELF section. The linker is not allowed to modify sections, but is allowed to drop whole sections.
A solution would be to add -fdata-sections and -ffunction-sections to compiler flags, and --gc-sections to linker flags.
The former options will create one section per function during the compilation. The latter will allow linker to remove unreachable code.
I do not think these flags can be safely used in a project without doing some benchmarking of the effects first. They affect size/speed of the production code.
man gcc says:
Only use these options when there are significant benefits from doing so. When you specify these options, the assembler and linker create larger object and executable files and are also slower. These options affect code generation. They prevent optimizations by the compiler and assembler using relative locations inside a translation unit since the locations are unknown until link time.
And it goes without saying that the solution only applies to the GCC/GNU Binutils toolchain.

Related

Gathering test symbols into an array statically in C/C++

Short version of question
Is it possible to gather specific symbols in C into a single list/array into the executable statically at compile time, without relying on crt initialization (I frequently support embedded targets, and have limited support on dynamic memory).
EDIT: I'm 100% ok with this happening at link time and also ok with not having symbols cross library boundaries.
EDIT 2: I'm also OK with compiler specific answers if it's gcc or clang but would prefer cross platform if possible.
Longer version with more background
This has been a pain in my side for a while.
Right now I have a number of built-in self tests that I like to run in order.
I enforce the same calling convention on all functions and am manually gathering all the tests into an array statically.
// ThisLibrary_testlist.h
#define DECLARE_TEST(TESTNAME) void TESTNAME##_test(void * test_args)
DECLARE_TEST(test1);
DECLARE_TEST(test2);
DECLARE_TEST(test3);
// ThisLibrary_some_module.c
#include "ThisLibrary_testlist.h"
DECLARE_TEST(test1)
{
// ... do hood stuff here
}
// ThisLibrary_testarray.c
#include "ThisLibrary_testlist.h"
typedef void (*testfunc_t) (void*);
#define LIST_TEST(TESTNAME)
testfunc_t tests[] =
{
&LIST_TEST(test1),
&LIST_TEST(test2)
};
// now it's an array... you know what to do.
So far this has kept me alive but it's getting kind of ridiculous that I have to basically modify the code in 3 separate locations if I want to update a test.
Not to mention the absolute #ifdef nightmare that comes with conditionally compiled tests.
Is there a better way?

With a bit of scripting magic you could do the following: After compiling your source files (but before linking) you search the object files for symbols that match your test name pattern. See man nm how to obtain symbol names from object files (well, on Unix, that is - no idea about windows, sorry). Based on the list of object names found, you auto-create the file ThisLibrary_testarray.c, putting in all the extern declarations and then the function pointer table. After generation of this file, you compile it and finally link everything.
This way you only have to add new test functions to the source files. No need to maintain the header file ThisLibrary_testlist.h, but you have to make sure the test functions have external linkage, follow the naming pattern - and be sure no other symbol uses the naming pattern :-)

Test embedded code by replacing static symbols at compile time

Background
I'm building a C application for an embedded Cortex M4 TI-RTOS SYS/BIOS target, however this question should apply to all embedded targets where a single binary is loaded onto some microprocessor.
What I want
I'd like to do some in situ regression tests on the target where I just replace a single function with some test function instead. E.g. a GetAdcMeasurement() function would return predefined values from a read-only array instead of doing the actual measurement and returning that value.
This could of course be done with a mess of #ifndefs, but I'd rather keep the production code as untouched as possible.
My attempt
I figure one way to achieve this would be to have duplicate symbol definitions at the linker stage, and then have the linker prioritise the definitions from the test suite (given some #define).
I've looked into using LD_PRELOAD, but that doesn't really seem to apply here (since I'm using only static objects).
Details
I'm using TI Code Composer, with TI-RTOS & SYS/BIOS on the Sitara AM57xx platform, compiling for the M4 remote processor (denoted IPU1).
Here's the path to the compiler and linker
/opt/ti/ccsv7/tools/compiler/ti-cgt-arm_16.9.6.LTS/bin/armcl

One solution could be to have multiple .c files for each module, one the production code and one the test code, and compile and link with one of the two. The globals and function signatures in both .c file must be at least the same (at least: there may be more symbols but not less).
Another solution, building on the previous one, is to have two libraries, one with the production code and one with the test code, and link with one of both. You could ieven link with both lubraries, with the test version first, as linkers often resolve symbols in the order they are encountered.
And, as you said, you could work with a bunch of #ifdefs, which would have the advantage of having just one .c file, but making tyhe code less readable.
I would not go for #ifdefs on the function level, i.e. defining just one function of a .c file for test and keeping the others as is; however, if necessary, it could be away. And if necessary, you could have one .c file (two) for each function, but that would negate the module concept.
I think the first approach would be the cleanest.

One additional approach (apart from Paul Ogilvie's) would be to have your mocking header also create a define which will replace the original function symbol at the pre-processing stage.
I.e. if your mocking header looks like this:
// mock.h
#ifdef MOCKING_ENABLED
adcdata_t GetAdcMeasurement_mocked(void);
stuff_t GetSomeStuff_mocked(void);
#define GetAdcMeasurement GetAdcMeasurement_mocked
#define GetSomeStuff GetSomeStuff_mocked
#endif
Then whenever you include the file, the preprocessor will replace the calls before it even hits the compiler:
#include "mock.h"
void SomeOtherFunc(void)
{
// preprocessor will change this symbol into 'GetAdcMeasurement_mocked'
adcdata_t data = GetAdcMeasurement();
}
The approach might confuse the unsuspected reader of your code, because they won't necessarily realize that you are calling a different function altogether. Nevertheless, I find this approach to have the least impact to the production code (apart from adding the include, obviously).

(This is a quick sum up the discussion in the comments, thanks for answers)
A function can be redefined if it has the weak attribute, see
https://en.wikipedia.org/wiki/Weak_symbol
On GCC that would be the weak attribute, e.g.
int __attribute__((weak)) power2(int x);
and on the armcl (as in my question) that would be the pragma directive
#pragma weak power2
int power2(int x);
Letting the production code consist of partly weak functions will allow a test framework to replace single functions.

Generating link-time error for deprecated functions

Is there a way with gcc and GNU binutils to mark some functions such that they will generate an error at link-time if used? My situation is that I have some library functions which I am not removing for the sake of compatibility with existing binaries, but I want to ensure that no newly-compiled binary tries to make use of the functions. I can't just use compile-time gcc attributes because the offending code is ignoring my headers and detecting the presence of the functions with a configure script and prototyping them itself. My goal is to generate a link-time error for the bad configure scripts so that they stop detecting the existence of the functions.
Edit: An idea.. would using assembly to specify the wrong .type for the entry points be compatible with the dynamic linker but generate link errors when trying to link new programs?

FreeBSD 9.x does something very close to what you want with the ttyslot() function. This function is meaningless with utmpx. The trick is that there are only non-default versions of this symbol. Therefore, ld will not find it, but rtld will find the versioned definition when an old binary is run. I don't know what happens if an old binary has an unversioned reference, but it is probably sensible if there is only one definition.
For example,
__asm__(".symver hidden_badfunc, badfunc#MYLIB_1.0");
Normally, there would also be a default version, like
__asm__(".symver new_badfunc, badfunc##MYLIB_1.1");
or via a Solaris-compatible version script, but the trick is not to add one.
Typically, the asm directive is wrapped into a macro.
The trick depends on the GNU extensions to define symbol versions with the .symver assembler directive, so it will probably only work on Linux and FreeBSD. The Solaris-compatible version scripts can only express one definition per symbol.
More information: .symver directive in info gas, Ulrich Drepper's "How to write shared libraries", the commit that deprecated ttyslot() at http://gitorious.org/freebsd/freebsd/commit/3f59ed0d571ac62355fc2bde3edbfe9a4e722845

One idea could be to generate a stub library that has these symbols but with unexpected properties.
perhaps create objects that have the name of the functions, so the linker in the configuration phase might complain that the symbols are not compatible
create functions that have a dependency "dont_use_symbol_XXX" that is never resolved
or fake a .a file with a global index that would have your functions but where the .o members in the archive have the wrong format

The best way to generate a link-time error for deprecated functions that you do not want people to use is to make sure the deprecated functions are not present in the libraries - which makes them one stage beyond 'deprecated'.
Maybe you can provide an auxilliary library with the deprecated function in it; the reprobates who won't pay attention can link with the auxilliary library, but people in the mainstream won't use the auxilliary library and therefore won't use the functions. However, it is still taking it beyond the 'deprecated' stage.
Getting a link-time warning is tricky. Clearly, GCC does that for some function (mktemp() et al), and Apple has GCC warn if you run a program that uses gets(). I don't know what they do to make that happen.
In the light of the comments, I think you need to head the problem off at compile time, rather than waiting until link time or run time.
The GCC attributes include (from the GCC 4.4.1 manual):
error ("message")
If this attribute is used on a function declaration and a call to such a function is
not eliminated through dead code elimination or other optimizations, an error
which will include message will be diagnosed. This is useful for compile time
checking, especially together with __builtin_constant_p and inline functions
where checking the inline function arguments is not possible through extern
char [(condition) ? 1 : -1]; tricks. While it is possible to leave the function
undefined and thus invoke a link failure, when using this attribute the problem
will be diagnosed earlier and with exact location of the call even in presence of
inline functions or when not emitting debugging information.
warning ("message")
If this attribute is used on a function declaration and a call to such a function is
not eliminated through dead code elimination or other optimizations, a warning
which will include message will be diagnosed. This is useful for compile time
checking, especially together with __builtin_constant_p and inline functions.
While it is possible to define the function with a message in .gnu.warning*
section, when using this attribute the problem will be diagnosed earlier and
with exact location of the call even in presence of inline functions or when not
emitting debugging information.
If the configuration programs ignore the errors, they're simply broken. This means that new code could not be compiled using the functions, but the existing code can continue to use the deprecated functions in the libraries (up until it needs to be recompiled).

Why before ALL functions (except for main()) there is a 'static' keyword?

I was reading some source code files in C and C++ (mainly C)...
I know the meaning of 'static' keyword is that static functions are functions that are only visible to other functions in the same file. In another context I read up it's nice to use static functions in cases where we don't want them to be used outside from the file they are written...
I was reading one source code file as I mentioned before, and I saw that ALL the functions (except the main) were static...Because there are not other additional files linked with the main source code .c file (not even headers), logically why should I put static before all functions? From WHAT should they be protected when there's only 1 source file?!
EDIT: IMHO I think those keywords are put just to make the code look bigger and heavier..

If a function is extern (default), the compiler must ensure that it is always callable through its externally visible symbol.
If a function is static, then that gives the compiler more flexibility. For example, the optimizer may decide to inline a function; with static, the compiler does not need to generate an additional out-of-line copy. Also, the symbol table will smaller, possibly speeding up the linking process too.
Also, it's just a good habit to get into.

It is hard to guess in isolation, but my assumption would be that it was written by someone who assumes that more files might be added at some point (or this file included in another project), so gives the least necessary access for the code to function. Essentially limiting the public API to the minimum.

But there are other files linked with your main module.
In fact, there are hundreds or even thousands, in the libraries. Most of these won't be selected for a small program but all the symbols are scanned by the linker. A collision between a symbol in main and an exported symbol from a library won't by itself cause any harm, but think of the trouble accidently naming something strcpy() could cause.
Also, it probably doesn't hurt to get used to the best-practice styles.

As a coding rule that I follow, any function (other than main()) that is visible outside its source file needs a declaration, and that declaration should be in a header. I avoid writing 'extern' declarations for functions in my source files it at all possible, and it almost always is possible.
If a function is only used inside a single source file, it should be static. That makes it much easier to modify; you know that the only place you need to look to see how it is used is the source file you have in front of you now (unless you make a habit of including '.c' files in other '.c' files - which is also a bad habit that should be broken now).
I use GCC to help me enforce the coding rule:
gcc -m64 -Wall -Wextra -std=c99 -Wmissing-prototypes -Wstrict-prototypes
That's a fairly typical collection of flags; I sometimes use -std=c89 instead of -std=c99; I sometimes use -m32 instead of -m64; I don't always use -Wextra (but my code is moving in that direction). I always use -Wmissing-prototypes and -Wstrict-prototypes to ensure that each external function is declared before it is defined or used (and each static function is either declared or defined before it is used). I occasionally use -Werror (so if the compile emits a warning, the compilation fails). I could use it more than I do since my code does compile without warnings - or gets fixed so that it does.
So, you could easily have been looking at my code. In my code, the only functions that are exposed - even in single source file programs - are the functions that are declared in a header, which means that they are part of the external interface to the module that the source file represents.

It may be that the author is taking precautions. For example, if someone else is using this file as a source by including it into his main file.

Because there are not other additional
files linked with the main source code
.c file (not even headers), logically
why should i put static before all
functions ? From WHAT should they be
protected when there's only 1 source
file?!
You honestly don't need the static keyword in this case.
EDIT: IMHO i think those keywords are put just to make the code looks bigger and heavier..
However, if you really want to read more about static keyword you can start with a book.
Some more info on the keyword static at #2216239 -- may help!

How do linkers decide what parts of libraries to include?

Assume library A has a() and b(). If I link my program B with A and call a(), does b() get included in the binary? Does the compiler see if any function in the program call b() (perhaps a() calls b() or another lib calls b())? If so, how does the compiler get this information? If not, isn't this a big waste of final compile size if I'm linking to a big library but only using a minor feature?

Take a look at link-time optimization. This is necessarily vendor dependent. It will also depend how you build your binaries. MS compilers (2005 onwards at least) provide something called Function Level Linking -- which is another way of stripping symbols you don't need. This post explains how the same can be achieved with GCC (this is old, GCC must've moved on but the content is relevant to your question).
Also take a look at the LLVM implementation (and the examples section).
I suggest you also take a look at Linkers and Loaders by John Levine -- an excellent read.

It depends.
If the library is a shared object or DLL, then everything in the library is loaded, but at run time. The cost in extra memory is (hopefully) offset by sharing the library (really, the code pages) between all the processes in memory that use that library. This is a big win for something like libc.so, less so for myreallyobscurelibrary.so. But you probably aren't asking about shared objects, really.
Static libraries are a simply a collection of individual object files, each the result of a separate compilation (or assembly), and possibly not even written in the same source language. Each object file has a number of exported symbols, and almost always a number of imported symbols.
The linker's job is to create a finished executable that has no remaining undefined imported symbols. (I'm lying, of course, if dynamic linking is allowed, but bear with me.) To do that, it starts with the modules named explicitly on the link command line (and possibly implicitly in its configuration) and assumes that any module named explicitly must be part of the finished executable. It then attempts to find definitions for all of the undefined symbols.
Usually, the named object modules expect to get symbols from some library such as libc.a.
In your example, you have a single module that calls the function a(), which will result in the linker looking for module that exports a().
You say that the library named A (on unix, probably libA.a) offers a() and b(), but you don't specify how. You implied that a() and b() do not call each other, which I will assume.
If libA.a was built from a.o and b.o where each defines the corresponding single function, then the linker will include a.o and ignore b.o.
However, if libA.a included ab.o that defined both a() and b() then it will include ab.o in the link, satisfying the need for a(), and including the unused function b().
As others have mentioned, there are linkers that are capable of splitting individual functions out of modules, and including only those that are actually used. In many cases, that is a safe thing to do. But it is usually safest to assume that your linker does not do that unless you have specific documentation.
Something else to be aware of is that most linkers make as few passes as they can through the files and libraries that are named on the command line, and build up their symbol table as they go. As a practical matter, this means that it is good practice to always specify libraries after all of the object modules on the link command line.

It depends on the linker.
eg. Microsoft Visual C++ has an option "Enable function level linking" so you can enable it manually.
(I assume they have a reason for not just enabling it all the time...maybe linking is slower or something)

Usually (static) libraries are composed of objects created from source files. What linkers usually do is include the object if a function that is provided by that object is referenced. if your source file only contains one function than only that function will be brought in by the linker. There are more sophisticated linkers out there but most C based linkers still work like outlined. There are tools available that split C source that contain multiple functions into artificially smaller source files to make static linking more fine granular.
If you are using shared libraries then you don't impact you compiled size by using more or less of them. However your runtime size will include them.

This lecture at Academic Earth gives a pretty good overview, linking is talked about near the later half of the talk, IIRC.

Without any optimization, yes, it'll be included. The linker, however, might be able to optimize out by statically analyzing the code and trying to remove unreachable code.

It depends on the linker, but in general only functions that are actually called get included in the final executable. The linker works by looking up the function name in the library and then using the code associated with the name.
There are very few books on linkers, which is strange when you think how important they are. The text for a good one can be found here.

It depends on the options passed to the linker, but typically the linker will leave out the object files in a library that are not referenced anywhere.
$ cat foo.c
int main(){}
$ gcc -static foo.c
$ size
text data bss dec hex filename
452659 1928 6880 461467 70a9b a.out
# force linking of libz.a even though it isn't used
$ gcc -static foo.c -Wl,-whole-archive -lz -Wl,-no-whole-archive
$ size
text data bss dec hex filename
517951 2180 6844 526975 80a7f a.out

It depends on the linker and how the library was built. Usually libraries are a combination of object files (import libraries are a major exception to this). Older linkers would pull things into the output file image at a granularity of the object files that were put into the library. So if function a() and function b() were both in the same object file, they would both be in the output file - even if only one of the 2 functions were actually referenced.
This is a reason why you'll often see library-oriented projects with a policy of a single C function per source file. That way each function is packaged in its own object file and linkers have no problem pulling in only what is referenced.
Note however that newer linkers (certainly newer Microsoft linkers) have the ability to pull in only parts of object files that are referenced, so there's less of a need today to enforce a one-function-per-source-file policy - though there are reasonable arguments that that should be done anyway for maintainability.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight