Tool to list callers of a function in C? - c

Background:
In a particular project there are about couple of thousand functions in more than hundred files. The functions are divided to reside in two banks of code memory - fast_mem and slow_mem. But now, since the fast_mem area is limited, its running out of space to accommodate any new code changes.
As part of code review, its been found that some functions in fast_mem have no callers. But the list of functions is too huge to check them one by one manually.
Question:
So, coming to the question, is there a tool that can list the callers of all the functions in the project? With this, I can go ahead and remove functions in fast_mem that don't have any callers.
I use cscope for code browsing along with ctags. But this requires one to input the function name manually. Can this be automated some how to get the complete list?
I also tried Doxygen with its caller graph feature. The result is not so comfortable to use though.

I use Scientific Toolworks Understand

If your compiler is a recent GCC (or if you can switch to GCC 4.6, possibly as a cross-compiler) you might develop a GCC plugin or a MELT extension to find out.
Of course, if you are e.g. doing tricks with function pointers (e.g. unportable pointer arithmetic on function pointers) the original question is undecidable.
Actually, if you are using function pointers, often the only reasonable thing to say is that they can reach only functions of the same signature.
And perhaps the project is important enough so that customizing the compiler to make a better (automatic or semi-automatic) trade-off between fast_mem & slow_mem is worthwhile. This is typically an excellent case for GCC plugins or MELT extensions (but that take some work -days or weeks, not hours-, because you need to understand the internal GCC representations to customize GCC), and you are probably the only one who could do it (because your question is very peculiar to some strange systems).

Let's assume there aren't any odd function pointer games going on. Then you can break out the under-used cflow:
http://www.gnu.org/software/cflow/
Generate a "reverse index" with the -r flag. you'll get a list of every function, followed by where it's called. You can feed it multiple files.

You can use static code analysis tool like cppcheck.
If you call it with --enable=unusedFunction parameter it will warn about unused function.

Related

Enable mocking for unit testing a library in C

In our environment we're encountering a problem regarding mocking functions for our library unit tests.
The thing is that instead of mocking whole modules (.c files) we would like to mock single functions.
The library is compiled to an archive file and linked statically to the unit test. Without mocking there isn't any issue.
Now when trying to mock single functions of the library we would get multiple definitions obviously.
My approach now is to use the weak function attribute when compiling/linking the library so that the linker takes the mocked (non-weak) function when linking against the unit test. I already tested it and it seems to work as expected.
The downside of this is that we need many attribute declarations in the code.
My final approach would be to pass some compile or link arguments to the compiler, that every function is automatically declared as a weak symbol.
The question now is: Is there anything to do this in a nice way?
btw: We use clang 8 as a compiler.
James Grenning describes several options to solve this problem (http://blog.wingman-sw.com/linker-substitution-in-c-limitations-and-workarounds). The option "function pointer substitution" gives a high degree of freedom. It works as follows: Replace functions by pointers to functions. The function pointers are initialized to point to the original function, but each pointer can be redirected individually to a test double.
This approach allows to have one single test executable where you can still decide for each test case individually for which function you use a test double and for which you use the original function.
It certainly also comes at a price:
One indirection for each call. But, if you use link-time-optimization the optimizer will most likely eliminate that indirection again, so this may not be an issue.
You make it possible to redirect function calls also in production code. This would certainly be a misuse of the concept, however.
I would suggest using VectorCAST
https://www.vector.com/us/en/products/products-a-z/software/vectorcast/
I've used, unity/cmock and others for unit testing C in the past, but after a while its vary tedious to manually create these for a language that isnt really built around that concept and is very much a heres a Hammer and Chissel the world is yours approach.
VectorCAST abstracts majority of the manual work that is required with tools like Unity/Cmock, we can get results across a project/module sooner and quicker than we did in the past with the other tools.
Is vectorCAST expensive and very much an enterprise level tool? yes... but its defiantly worth its weight in gold. And thats coming from someone who is very old school, manual approach to software development... just text editors, terminals and commandline debuggers.
VetorCAST handles function pointers and pointers extremely well, stubbing functions is easy as two clicks away. It saved our team alot of time... allowing us to focus on results and reducing the feedback loop of development.

Can I run GCC as a daemon (or use it as a library)?

I would like to use GCC kind of as a JIT compiler, where I just compile short snippets of code every now and then. While I could of course fork a GCC process for each function I want to compile, I find that GCC's startup overhead is too large for that (it seems to be about 50 ms on my computer, which would make it take 50 seconds to compile 1000 functions). Therefore, I'm wondering if it's possible to run GCC as a daemon or use it as a library or something similar, so that I can just submit a function for compilation without the startup overhead.
In case you're wondering, the reason I'm not considering using an actual JIT library is because I haven't found one that supports all the features I want, which include at least good knowledge of the ABI so that it can handle struct arguments (lacking in GNU Lightning), nested functions with closure (lacking in libjit) and having a C-only interface (lacking in LLVM; I also think LLVM lacks nested functions).
And no, I don't think I can batch functions together for compilation; half the point is that I'd like to compile them only once they're actually called for the first time.
I've noticed libgccjit, but from what I can tell, it seems very experimental.
My answer is "No (you can't run GCC as a daemon process, or use it as a library)", assuming you are trying to use the standard GCC compiler code. I see at least two problems:
The C compiler deals in complete translation units, and once it has finished reading the source, compiles it and exits. You'd have to rejig the code (the compiler driver program) to stick around after reading each file. Since it runs multiple sub-processes, I'm not sure that you'll save all that much time with it, anyway.
You won't be able to call the functions you create as if they were normal statically compiled and linked functions. At the least you will have to load them (using dlopen() and its kin, or writing code to do the mapping yourself) and then call them via the function pointer.
The first objection deals with the direct question; the second addresses a question raised in the comments.
I'm late to the party, but others may find this useful.
There exists a REPL (read–eval–print loop) for c++ called Cling, which is based on the Clang compiler. A big part of what it does is JIT for c & c++. As such you may be able to use Cling to get what you want done.
The even better news is that Cling is undergoing an attempt to upstream a lot of the Cling infrastructure into Clang and LLVM.
#acorn pointed out that you'd ruled out LLVM and co. for lack of a c API, but Clang itself does have one which is the only one they guarantee stability for: https://clang.llvm.org/doxygen/group__CINDEX.html

How to automatically call all functions in C source code

have you ever heard about automatic C code generators?
I have to do a kind of strange API functionality research which includes at least one attempt of every function execution. It may lead to crushes, segmentation faults - no matter. I just need to register every function call.
So i got a long list (several hundreds) of functions from sources using
ctags -x --c-kinds=f *.c
Can i use any tool to generate code calling every of them? Thanks a lot.
UPD: thanks for all your answers.
You could also consider customizing the GCC compiler, e.g. with a MELT extension (which e.g. would generate the testing during some customized compilation). Then you might even define your own #pragma or __attribute__ to parameterize these functions (enabling their auto-testing, giving default arguments for testing, etc etc).
However, I'm not sure it is the right approach for unit testing. There are many unit testing frameworks (but I am not very familiar with them).
Maybe something like autoconf could help you with that: as described here. In particular check for AC_CHECK_FUNCS. Autoconf creates small programs to test the existence of registered functions.

Assembly-level function fingerprint

I would like to determine, whether two functions in two executables were compiled from the same (C) source code, and would like to do so even if they were compiled by different compiler versions or with different compilation options. Currently, I'm considering implementing some kind of assembler-level function fingerprinting. The fingerprint of a function should have the properties that:
two functions compiled from the same source under different circumstances are likely to have the same fingerprint (or similar one),
two functions compiled from different C source are likely to have different fingerprints,
(bonus) if the two source functions were similar, the fingerprints are also similar (for some reasonable definition of similar).
What I'm looking for right now is a set of properties of compiled functions that individually satisfy (1.) and taken together hopefully also (2.).
Assumptions
Of course that this is generally impossible, but there might exist something that will work in most of the cases. Here are some assumptions that could make it easier:
linux ELF binaries (without debugging information available, though),
not obfuscated in any way,
compiled by gcc,
on x86 linux (approach that can be implemented on other architectures would be nice).
Ideas
Unfortunately, I have little to no experience with assembly. Here are some ideas for the abovementioned properties:
types of instructions contained in the function (i.e. floating point instructions, memory barriers)
memory accesses from the function (does it read/writes from/to heap? stack?)
library functions called (their names should be available in the ELF; also their order shouldn't usually change)
shape of the control flow graph (I guess this will be highly dependent on the compiler)
Existing work
I was able to find only tangentially related work:
Automated approach which can identify crypto algorithms in compiled code: http://www.emma.rub.de/research/publications/automated-identification-cryptographic-primitives/
Fast Library Identification and Recognition Technology in IDA disassembler; identifies concrete instruction sequences, but still contains some possibly useful ideas: http://www.hex-rays.com/idapro/flirt.htm
Do you have any suggestions regarding the function properties? Or a different idea which also accomplishes my goal? Or was something similar already implemented and I completely missed it?
FLIRT uses byte-level pattern matching, so it breaks down with any changes in the instruction encodings (e.g. different register allocation/reordered instructions).
For graph matching, see BinDiff. While it's closed source, Halvar has described some of the approaches on his blog. They even have open sourced some of the algos they do to generate fingerprints, in the form of BinCrowd plugin.
In my opinion, the easiest way to do something like this would be to decompose the functions assembly back into some higher level form where constructs (like for, while, function calls etc.) exist, then match the structure of these higher level constructs.
This would prevent instruction reordering, loop hoisting, loop unrolling and any other optimizations messing with the comparison, you can even (de)optimize this higher level structures to their maximum on both ends to ensure they are at the same point, so comparisons between unoptimized debug code and -O3 won't fail out due to missing temporaries/lack of register spills etc.
You can use something like boomerang as a basis for the decompilation (except you wouldn't spit out C code).
I suggest you approach this problem from the standpoint of the language the code was written in and what constraints that code puts on compiler optimization.
I'm not real familiar with the C standard, but C++ has the concept of "observable" behavior. The standard carefully defines this, and compilers are given great latitude in optimizing as long as the result gives the same observable behavior. My recommendation for trying to determine if two functions are the same would be to try to determine what their observable behavior is (what I/O they do and how the interact with other areas of memory and in what order).
If the problem set can be reduced to a small set of known C or C++ source code functions being compiled by n different compilers, each with m[n] different sets of compiler options, then a straightforward, if tedious, solution would be to compile the code with every combination of compiler and options and catalog the resulting instruction bytes, or more efficiently, their hash signature in a database.
The set of likely compiler options used is potentially large, but in actual practice, engineers typically use a pretty standard and small set of options, usually just minimally optimized for debugging and fully optimized for release. Researching many project configurations might reveal there are only two or three more in any engineering culture relating to prejudice or superstition of how compilers work—whether accurate or not.
I suspect this approach is closest to what you actually want: a way of investigating suspected misappropriated source code. All the suggested techniques of reconstructing the compiler's parse tree might bear fruit, but have great potential for overlooked symmetric solutions or ambiguous unsolvable cases.

Find unused functions in a C project by static analysis

I am trying to run static analysis on a C project to identify dead code i.e functions or code lines that are never ever called. I can build this project with Visual Studio .Net for Windows or using gcc for Linux. I have been trying to find some reasonable tool that can do this for me but so far I have not succeeded. I have read related questions on Stack Overflow i.e this and this and I have tried to use -Wunreachable-code with gcc but the output in gcc is not very helpful. It is of the following format
/home/adnan/my_socket.c: In function ‘my_sockNtoH32’:
/home/adnan/my_socket.c:666: warning: will never be executed
but when I look at line 666 in my_socket.c, it's actually inside another function that is being called from function my_sockNtoH32() and will not be executed for this specific instance but will be executed when called from some other functions.
What I need is to find the code which will never be executed. Can someone plz help with this?
PS: I can't convince management to buy a tool for this task, so please stick to free/open source tools.
If GCC isn't cutting it for you, try clang (or more accurately, its static analyzer). It (generally, your mileage may vary of course) has a much better static analysis than GCC (and produces much better output). It's used in Apple's Xcode but it's open-source and can be used seperately.
When GCC says "will never be executed", it means it. You may have a bug that, in fact, does make that dead code. For example, something like:
if (a = 42) {
// some code
} else {
// warning: unreachable code
}
Without seeing the code it's not possible to be specific, of course.
Note that if there is a macro at line 666, it's possible GCC refers to a part of that macro as well.
GCC will help you find dead code within a compilation. I'd be surprised if it can find dead code across multiple compilation units. A file-level declaration of a function or variable in a compilation unit means that some other compilation unit might reference it. So anything declared at the top level of a file, GCC can't eliminate, as it arguably only sees one compilation unit at a time.
The problem gets get harder. Imagine that compilation unit A declares function a, and compilation unit B has a function b that calls a. Is a dead? On the face of it, no. But in fact, it depends; if b is dead, and the only reference to a is in b, then a is dead, too. We get the same problem if b merely takes &a and puts it into an array X. Now to decide if a is dead, we need a points-to analysis across the entire system, to see if that pointer to a is used anywhere.
To get this kind of accurate "dead" information, you need a global view of the entire set of compilation units, and need to compute a points-to analysis, followed by the construction of a call-graph based on that points-to analysis. Function a is dead only if the call graph (as a tree,
with main as the root) doesn't reference it somewhere.
(Some caveats are necessary: whatever the analysis is, as a practical matter it must be conservative, so even a full-points to analysis may not identify a function correctly as dead. You also have to worry about uses of a C artifact from outside the set of C functions, e.g., a call to a from some bit of assembler code).
Threading makes this worse; each thread has some root function which is probably at the top of the call DAG. Since how a thread gets started isn't defined by C compilers, it should be clear that to determine if a multithreaded C application has dead code, somehow the analysis has to be told the thread root functions, or be told how to discover them by looking for thread-initialization primitives.
You aren't getting a lot responses on how to get a correct answer. While it isn't open source, our DMS Software Reengineering Toolkit with its C Front End has all the machinery to do this, including C parsers, control- and dataflow- analysis, local and global points-to analysis, and global call graph construction. DMS is easily customized to include extra information such as external calls from assembler, and/or a list of thread roots or specific source-patterns that are thread initialization calls, and we've actually done that (easily) for some large embedded engine controllers with millions of lines of code. DMS has been applied to systems as large as 26 million lines of code (some 18,000 compilation units) for the purpose of building such calls graphs.
[An interesting aside: in processing individual comilation units, DMS for scaling reasons in effect deletes symbols and related code that aren't used in that compilation unit. Remarkably, this gets rid of about 95% of code by volume when you take into account the iceberg usually hiding in the include file nest. It says C software typically has poorly factored include files. I suspect you all know that already.]
Tools like GCC will remove dead code while compiling. That's helpful, but the dead code is still lying around in your compilation unit source code using up developer's attention (they have to figure out if it is dead, too!). DMS in its program transformation mode can be configured, modulo some preprocessor issues, to actually remove that dead code from the source. On very large software systems, you don't really want to do this by hand.

Resources