Extract just the required functions from a C code project? - c

How can I extract just the required functions from a pile of C source files? Is there a tool which can be used on GNU/Linux?
Preferably FOSS, but the GNU/Linux is a hard requirement.
Basically I got about 10 .h files; I'd like to grab part of the code and get the required variables from the header files. Then I can make a single small .h file corresponding to the code I'm using in another project.
My terms might not be 100% correct.

One tool that you may or may not be aware of is cscope. It can be used to help you.
For a given set of files (more on what that means shortly), it gives you these options:
Find this C symbol:
Find this global definition:
Find functions called by this function:
Find functions calling this function:
Find this text string:
Change this text string:
Find this egrep pattern:
Find this file:
Find files #including this file:
Thus, if you know you want to use a function humungous_frogmondifier(), you can find where it is declared or defined by typing its name (or pasting its name) after 'Find this global definition'. If you then want to know what functions it calls, you use the next line. Once you've hit return after specifying the name, you will be given a list of the relevant lines in the source files above this menu on the screen. You can page through the list (if there are more entries than will fit on the screen), and at any time select one of the shown entries by number or letter, in which case cscope launches your editor on the file.
How about that list of files? If you run cscope in a directory without any setup, it will scan the source files in the directory and build its cross-reference. However, if you prefer, you can set up a list of files names in cscope.files and it will analyze those files instead. You can also include -I /path/to/directory on the cscope command line and it will find referenced headers in those directories too.
I'm using cscope 15.7a on some sizeable projects - depending on which version of the project, between about 21,000 and 25,000 files (and some smaller ones with only 10-15 thousand files). It takes about half an hour to set up this project (so I carefully rebuild the indexes once per night, and use the files for the day, accepting that they are a little less accurate at the end of the day). It allows me to track down unused stuff, and find out where stuff is used, and so on.
If you're used to an IDE, it will be primitive. If you're used to curses-mode programs (vim, etc), then it is tolerably friendly.

You suggest (in comments to the main question) that you will be doing this more than once, possibly on different (non-library) code bases. I'm not sure I see the big value in this; I've been coding C on an off for 30+ years and don't feel the need to do this very often.
But given the assumption you will, what you really want is a tool that can, for a given identifier in a system of C files and headers, find the definition of that identifier in those files, and compute the transitive closure of all the dependencies which it has. This defines a partial order over the definitions based on the depends-on relationship. Finally you want to emit the code for those definitions to an output file, in a linear order that honors the partial order determined. (You can simplify this a bit by insisting that the identifier you want is in a particular C compilation unit, but the rest of it stays the same).
Our DMS Software Reengineering Toolkit with its C Front End can be used to do this. DMS is a general purpose program transformation system, capable of parsing source files into ASTs, perform full name resolution (e.g., building symbol tables), [do flow analysis but this isn't needed for your task]. Given those ASTs and the symbol tables, it can be configured to compute this transitive dependency using the symbol table information which record where symbols are defined in the ASTs. Finally, it can be configured to assemble the ASTs of interest into a linear order honoring the partial order.
We have done all this with DMS in the past, where the problem was to generate SOA-like interfaces based on other criteria; after generating the SOA code, the tool picked out all the dependencies for the SOA code and did exactly what was required. The dependency extraction machinery is part of the C front end.
A complication for the C world is that the preprocessor may get in the way; for the particular task we accomplished, the extraction was done over a specific configuration of the application and so the preprocessor directives were all expanded away. If you want this done and retain the C preprocessor directives, you'll need something beyond what DMS can do today. (We do have experimental work that captures macros and preprocessor conditionals in the AST but that's not ready for release to production).
You'd think this problem would be harder with C++ but it is not, because the prepreprocessor is used far more lightly in C++ programs. While we have not done extraction for C++, it would follow exactly the same approach as for C.
So that's the good part with respect to your question.
The not so good part from your point of view, perhaps, is that DMS isn't FOSS; it is a commercial tool designed to be used by my company and our customers to build custom analysis and transformation tools for all those tasks you can't get off the shelf, that make economic sense. Nor does DMS run natively on Linux, rather it is a Windows based tool. It can reach across the network using NFS to access files on other systems including Linux. DMS does run under Wine on Linux.

Related

How to automatically merge C source files?

I have a single executable which consist of many .c source files across several directories.
Currently I need to run static analysis on the whole source code, not on each files separately.
I just found gcc ʟᴛᴏ (link time optimisation) works by compressing gimple which mirror the preprocessed source.
Also when the compiler crash during ʟᴛᴏ linking phase, it asks for sending preprocessed sources for the bug report.
By merging source files, I mean combining all the files used for creating the executable into a single file. Compiling and linking that single file would create the library, resulting in doing manually what ʟᴛᴏ does. (but it’s not the aim here. static analysers don’t support things like ɪᴘᴏ/ʟᴛᴏ)
Doing this manually will definitely takes hours…
So, is there a way to merge C source files automatically ? Or at least to get ʟᴛᴏ preprocessed sources ? (it seems thesave-tempsoption does nothing interesting during linking with ʟᴛᴏ)
CIL (C Intermediate Language) has a 'merger' feature which I've successfully used for some simple merge operations.
I've used it to merge moderately complicated programs - around a hundred files in different folders. Of course, if your codebase includes any C++ the CIL merger won't be able to merge that.
No, because for example two files might have conflicting static declarations. There are other ways that moving source code into a single file might make it stop working, and diagnosing every possible one would require solving the Halting Problem. (Does an arbitrary program ever use the result of __FILE__ in such a way that it fails if two sections of code are in the same file?) File-scope declarations are the most likely to occur in the real world, though.
That said, you can try just concatenating the files and seeing what error messages you get. Most headers should keep working if you #include them twice. A conflicting identifier name can be fixed by a search-and replace in the original files.

How does PC-Lint (by Gimpel) look across multiple modules?

I'm using Gimpel's PC-Lint v8.00 on a C codebase and am looking to understand how it traverses modules. The PC-lint manual only goes as far as to say that PC-Lint "looks across multiple modules". How does it do this? For example, does it start with one module and combine all related include files and source files into one large piece of code to analyze? How deep does it search in order to understand the program flow?
In a second related question, I have a use case where it is beneficial for me to lint one C module from the codebase at a time instead of providing every C module in a long list to PC-Lint. However, if I only provide one C module, will it automatically find the other C modules which it depends on, and use those to understand the program flow of the specified C module?
PC Lint creates some sort of run-time database when it parses your source files, noting things like global variables, extern-declarations, etc.
When it has processed all compilation units (C files with all included files, recursively), it does what a linker does to generate your output, but in stead of generating code, it reports on certain types of errors, for instance: An extern-declaration that has not been used, an unused prototype without implementation, unused global functions. These are issues not always reported by the linker, since the code generation is very well possible: The items have never been used anywhere!
The search depth can be influenced by the option -passes, which enables a far better value-tracking at the cost of execution time. Refer to seciton 10.2.2.4 in the PDF manual (for version 9.x).
To your second question, no, if you only provide one (or a few) source (C) file name(s) on your Lint command line, PC Lint will process only that file - and all include files used, recursively. You may want to use the option -u for "unit-checkout" to tell PC Lint that it only processes a part of a full project. Lint will then suppress certain kinds of warnings not useful for a partial project.
I think in principle you're asking about LINT OBJECT MODULES, see Chapter 9 of Lint Manual PDF.
Using say lint -u a1.c -oo procudes the a1.lob, when then again can be linked together using lint *.lob to produce the inter-module messages.
Also you asked a related, specific questions ( Any tips for speeding up static analysis tool PC-Lint? Any experiences using .LOB files?) but I'm not sure if I understand your concern with "How much would you say it affected linting time?", because I would say it depends. What is your current lint-time / speed? You posted some years ago now, how about running the job on a novel machine, new cpu then? KR

How to ensure unused symbols are not linked into the final executable?

First of all my apologies to those of you who would have followed my questions posted in the last few days. This might sound a little repetitive as I had been asking questions related to -ffunction-sections & -fdata-sections and this one is on the same line. Those questions and their answers didn't solve my problem, so I realized it is best for me to state the full problem here and let SO experts ponder about it. Sorry for not doing so earlier.
So, here goes my problem:
I build a set of static libraries which provide a lot of functionalities. These static libraries will be provided to many products. Not all products will use all of the functionalities provided by my libs. The problem is that the library sizes are quite big and the products want it to be reduced. The main goal is to reduce the final executable size and not the library size itself.
Now, I did some research and found out that, if there are 4 functions in a source file and only one function of that is used by the application, the linker will still include the rest of the 3 functions into the final executable as they all belong to the same object file. I further analyzed and found that -ffunction-sections, -fdata-sections and -gc-sections(this one is a linker option) will ensure only that one function gets linked.
But, these options for some reasons beyond my control cannot be used now.
Is there any other way in which I can ensure that the linker will link only the function which is strictly required and exclude all other functions even if they are in the same object file?
Are there any other ways of dealing with the problem?
Note: Reorganizing my code is almost ruled out as it is a legacy code and big.
I am dealing mainly with VxWorks & GCC here.
Thanks for any help!
Ultimately, the only way to ensure that only the functions you want are linked is to ensure that each source (object) file in the library only exports one function symbol - one (visible) function per file. Typically, there are some files which export several functions which are always all used together - the initialization and finalization functions for a package, for example. Also, there are often functions used by the exported function that do not need to be visible outside the source (object) file - make sure they are static.
If you looked at Plauger's "The Standard C Library", you'll find that every function is implemented in a separate file, even if the file ends up 4 lines long (one header, one function line, an open brace, one line of code, and a close brace).
Jay asked:
In the case of a big project, doesn't it become difficult to manage with so many files? Also, I don't find many open source projects following this model. OpenSSL is one example.
I didn't say it was widely used - it isn't. But it is the way to make sure that binaries are minimized. The compiler (linker) won't do the minimization for you - at least, I'm not aware of any that do. On a large project, you design the source files so that closely related functions that will normally all be used together are grouped in single source files. Functions that are only occasionally used should be placed in separate files. Ideally, the rarely used functions should each be in their own file; failing that, group small numbers of them into small (but non-minimal) files. That way, if one of the rarely used functions is used, you only get a limited amount of extra unused code linked.
As to number of files - yes, the technique espoused does mean a lot of files. You have to weigh the workload of managing (naming) lots of files against the benefit of minimal code size. Automatic build systems remove most of the pain; VCS systems handle lots of files.
Another alternative is to put the library code into a shared object - or dynamic link library (DLL). The programs then link with the shared object, which is loaded into memory just once and shared between programs using it. The (non-constant) data is replicated for each process. This reduces the size of the programs on disk, at the cost of fixups during the load process. However, you then don't need to worry about executable size; the executables do not include the shared objects. And you can update the library (if you're careful) without recompiling the main programs that use it. The reduced size of the executables is one reason shared libraries are popular.

How do you compare two files containing C code based on code structure, not merely textual differences?

I have two files containing C code which I wish to compare. I'm looking for a utility which will construct a syntax tree for each file, and compare the syntax trees, instead of merely comparing the text of the files. This way minor differences in formatting and style will be ignored. It would be nice to even be able to tell the comparison tool to ignore differences such as variable names, etc.
Correct me if I'm wrong, but diff doesn't have this capability. I'm a Ubuntu user. Thanks!
Our SD Smart Differencer does exactly what you want. It uses compiler-quality parsers to read source code and build ASTs for two files you select. It then compares the trees guided by the syntax, so it doesn't get confused by whitespace, layout or comments. Because it normalize the values of constants, it doesn't get confused by change of radix or how you expressed escape sequences!
The deltas are reported at the level of the langauge constructs (variable, expression, statement, declaration, function, ...) in terms of programmer intent (delete, insert, copy, move) complete with determining that an identifier has been renamed consistently throughout a changed block.
The SmartDifferencer has versions available for C (in a number of dialects; if you compiler-accurate parse, the langauge dialect matters) was well as for C++, Java, C#, JavaScript, COBOL, Python and many other langauges.
If you want to understand how a set of files are related to one another, our SD CloneDR will accept a very large set of files, and tell you what they have in common. It finds code that has been copy-paste-edited across the entire set. You don't have to tell it what to look for; it finds it automatically. Using ASTs (as above), it isn't fooled by whitespace changes or renames of identifiers. There's a bunch of sample clone detection reports for various languages at the web site.
There is a program called codeCompare from devart (http://www.devart.com/codecompare/benefits.html#cc) that includes the following feature (I know it is not exactly what you asked for but probably it can be used for that).
The feature is called "Structure Comparison"
This functionality allows you to compare different file revision by the presense of the structure blocks (classes, fields, methods). At that different versions of the same file are compared independently from their destination.
Structure comparison can be applied to the following languages:
C#
C++
Visual Basic
JavaScript
(I know it does not include C, but maybe with the C++ version you can solve the problem)

Compile-time lookup array creation for ANSI-C?

A previous programmer preferred to generate large lookup tables (arrays of constants) to save runtime CPU cycles rather than calculating values on the fly. He did this by creating custom Visual C++ projects that were unique for each individual lookup table... which generate array files that are then #included into a completely separate ANSI-C micro-controller (Renesas) project.
This approach is fine for his original calculation assumptions, but has become tedious when the input parameters need to be modified, requiring me to recompile all of the Visual C++ projects and re-import those files into the ANSI-C project. What I would like to do is port the Visual C++ source directly into the ANSI-C microcontroller project and let the compiler create the array tables.
So, my question is: Can ANSI-C compilers compute and generate lookup arrays during compile time? And if so, how should I go about it?
Thanks in advance for your help!
Is there some reason you can't import his code generation architecture to your build system?
I mean, in make I might consider something like:
TABLES:=$(wildcard table_*)
TABLE_INCS:=$(foreach dir,$TABLES,$dir/$dir.h)
include $(foreach dir,$TABLES,dir/makefile.inc)
$MAIN: $(SRS) $(TABLE_INCS)
where each table_* contains a complete code generation project whose sole purpose is tho build table_n/table_n.h. Also in each table directory a makefile fragment named makefile.inc which provides the dependency lines for generated include files, and now I've removed the recursivity.
Done right (and this implementation isn't finished, in part because the point is clearer this way but mostly because I am lazy), you could edit table_3/table_3.input, type make in the main directory and get table_3/table_3.h rebuilt and the program incrementally recompiled.
I guess that depends on the types of value you need to look up. If the processing to compute each value demands more than e.g. constant-expression evaluation can deliver, you're going to have problems.
Check out the Boost preprocessor library. It's written for C++ but as far as I'm aware, the two preprocessors are pretty much identical, and it can do this sort of thing.

Resources