Automatically flag unused structure members in C

Automatically flag unused structure members in C - c

I need a tool to automatically flag unused structure members in a C codebase. My definition of "unused" is simple - if the structure member definition is removed from the code, and the code compiles successfully, then the structure member is declared unused. The question is - how can this be done in an automated way? (speed isn't too much of concern as the codebase is small).
The existing stack overflow articles on this topic seem to hint that there is no existing static analysis tool that can do this today. On the other hand, given the modularity of Clang, I feel that this should be doable with AST manipulation. Let's take a single file, for example. What I would like to do is the following (this can be later generalized to a set of source files in the codebase):
Generate AST from C code.
Recursively visit all structure field defintions and remove them one-by-one. We can keep a "seen" dictionary to ensure we don't remove already seen field definition nodes.
Filter out field definitions to only those that are present in the codebase to analyzed (to avoid definitions in standard libraries, for example).
Compile the code.
If the code compiles successfully, then the corresponding field declaration is unused and is flagged.
Proceed to #1.
The keyword above is remove. How can I remove a field definition? There seems to be two ways using Clang.
At a source level, we can remove the field declaration using Clang Rewriter (there is a "RemoveText(SourceRange)" option). But, I don't know if this will work all the time (ex: for structures are autogenerated using MACRO expansion).
Delete the field declaration node from AST, and then "re-compile" the AST (whatever that means).
Among the above two options, #1 seems hacky - you'll need to create a copy of the source file, re-write it after a field definition is removed, and then re-compile the modified source. And, I am not sure how well it will work when there are complex MACROS involved for generating structure field definitions.
#2 seems clean, but from Googling, there seems to be no such thing as "deleting a AST node" (it is immutable). Please correct me if I am wrong. Even if I succeed in this, how I proceed from this point to re-evaluate the AST for missing references to structure fields? ("the compilation" step).
Any sugesstions appreciated (thanks in advance!). I've already have some initial success with #1 approach above, but I feel that this isn't the right direction.

cppcheck can do this. For example:
// test.cpp
struct Struct
{
int used;
int unused;
};
int main()
{
Struct s;
s.used = 0;
return s.used;
}
$ cppcheck test.cpp --enable=all
Checking test.cpp ...
test.cpp:5:9: style: struct member 'Struct::unused' is never used. [unusedStructMember]
int unused;
^
While I used C++ code in the example it behaves the same for C.

Related

How can I maintain correlation between structure definitions and their construction / destruction code?

When developing and maintaining code, I add a new member to a structure and sometimes forget to add the code to initialize or free it which may later result in a memory leak, an ineffective assertion, or run-time memory corruption.
I try to maintain symmetry in the code where things of the same type are structured and named in a similar manner, which works for matching Construct() and Deconstruct() code but because structures are defined in separate files I can't seem to align their definitions with the functions.
Question: is there a way through coding to make myself more aware that I (or someone else) has changed a structure and functions need updating?
Efforts:
The simple:
-Have improved code organization to help minimize the problem
-Have worked to get into the habit of updating everything at once
-Have used comments to document struct members, but this just means results in duplication
-Do use IDE's auto-suggest to take a look and compare suggested entries to implemented code, but this doesn't detect changes.
I had thought that maybe structure definitions could appear multiple times as long as they were identical, but that doesn't compile. I believe duplicate structure names can appear as long as they do not share visibility.
The most effective thing I've come up with is to use a compile time assertion:
static_assert(sizeof(struct Foobar) == 128, "Foobar structure size changed, reevaluate construct and destroy functions");
It's pretty good, definitely good enough. I don't mind updating the constant when modifying the struct. Unfortunately compile time assertions are very platform (compiler) and C Standard dependent, and I'm trying to maintain the backwards compatibility and cross platform compatibility of my code.
This is a good link regarding C Compile Time Assertions:
http://www.pixelbeat.org/programming/gcc/static_assert.html
Edit:
I just had a thought; although a structure definition can't easily be relocated to a source file (unless it does not need to be shared with other source files), I believe a function can actually be relocated to a header file by inlining it.
That seems like a hacked way to make the language serve my unintended purpose, which is not what I want. I want to be professional. If the professional practice is not to approach this code-maintainability issue this way, then that is the answer.

I've been programming in C for almost 40 years, and I don't know of a good solution to this problem.
In some circles it's popular to use a set of carefully-contrived macro definitions so that you can write the structure once, not as a direct C struct declaration but as a sequence of these macros and then, by defining the macro differently and re-expanding, turn your "definition" into either a declaration or a definition or an initialization. Personally, I feel that these techniques are too obfuscatory and are more trouble than they're worth, but they can be used to decent effect.
Otherwise, the only solution -- though it's not what you're looking for -- is "Be careful."
In an ideal project (although I realize full well there's no such thing) you can define your data structures first, and then spend the rest of your time writing and debugging the code that uses them. If you never have occasion to add fields to structs, then obviously you won't have this problem. (I'm sorry if this sounds like a facetious or unhelpful comment, but I think it's part of the reason that I, just as #CoffeeTableEspresso mentioned in a comment, tend not to have too many problems like this in practice.)
It's perhaps worth noting that C++ has more or less the same problem. My biggest wishlist feature in C++ was always that it would be possible to initialize class members in the class declaration. (Actually, I think I've heard that a recent revision to the C++ standard does allow this -- in which case another not-necessarily-helpful answer to your question is "Use C++ instead".)

C doesn't let you have benign struct redefinitions but it does let you have benign macro redefinitions.
So as long as you
save the struct body in a macro (according to a fixed naming convention)
redefine the macro at the point of your constructor
you will get a warning if the struct body changes and you haven't updated the corresponding constructor.
Example:
header.h:
#define MC_foo_bod \
int x; \
double y; \
void *p
struct foo{ MC_foo_bod; };
foo__init.c
#include "header.h"
#ifdef MC_foo_bod
//try for a silent redefinition
//if it wasn't silent, the macro changed and so should this code
#define MC_foo_bod \
int x; \
double y; \
void *p
#else
#error ""
//oops--not a redefinition
//perhaps a typo in the macro name or a failure to include the header?
#endif
void foo__init(struct foo*X)
{
//...
}

Is there a syntax error in this function declaration?

This is from a textbook:
/* This function locates the address of where a new structure
should be inserted within an existing list.
It receives the address of a name and returns the address of a
structure of type NameRec
*/
struct NameRec *linear Locate(char *name)
{
...
}
I understand it returns a pointer to a struct NameRec. Why is "linear" there and why is there a space between "linear" and "Locate"?

#define linear
will make it syntactically correct even if it wasn't before (though, technically, you'd probably want a #undef linear beforehand to avoid possible conflicting macro definitions).
It depends entirely on the context of the code, which you haven't shown. As it stands now, with no header inclusions or definitions like -Dlinear= on the compiler command line, it would not compile in a standards-conformant environment without extensions.
The best way to tell, of course, is to just try to actually compile the thing and see what happens :-)
Given that the solutions link for chapter 13 (the one you're asking about) has no mention of the linear word in the solution, I'd say it's a safe bet to assume your book is incorrect. I'd consider contacting the author (apparently currently working at FDU in New Jersey) to clear it up.

It's a typo in the book. See the locate function here:
https://users.ipfw.edu/chansavj/ACY2017/ANSI_C/ANSI_C_4thEd/Solutions%20to%20Exercises%20(Windows)/Solutions/83556-0s/Ch13/pgm13-5ex3.c
(Posted by ta.speot.is in the comments)

#define in C, legal character

There is a C structure
struct a
{
int val1,val2;
}
I have made changes to the code like
struct b
{
int val2;
}
struct a
{
int val1;
struct b b_obj;
}
Now, usage of val2 in the other C files is like a_obj->val2;.
I want to replace its declaration usage and there are a lot of them, so I have defined a macro in the header file where the struct a is defined as follows:
#define a_obj->val2 (a_obj->b_obj.val2)
It's not working. Is -> illegal in the identifier part of a macro definition #define?
Could someone please tell me where am I wrong?
Edit as suggested by #Basile -
It's a legacy source code, a very huge project. Not sure of LOC.
I want to make such changes because I want to make it more modular.
For example I want to group similar fields of the structure under a same name and that's the reason I want to create another struct B with fields which are related to B feature and also common to A.
I can't use Find Replace feature of other text editors, I am using VIM.

This kind of macro magic will get you into trouble soon,
because it is making your source code unreadable and brittle (credits Basile for the phrasing).
But this should work for what you describe.
struct b
{
int val2m;
}
struct a
{
int val1;
struct b b_obj;
}
#define val2 b_obj.val2m
The trick is to give the actual identifier inside the struct declaration a new name (val2m), so that the name all the other code uses can be turned into a magic alias,
which then can contain the modified access to take a detour via the additionally introduced inner struct.
This is only a kind of band-aid for the problematic situation of having to change something backstage in existing code with many references. Only use it if there is no chance of refactoring the code cleanly. ("band-aid", appropriate image by StoryTeller, credits).
I explicitly recommend looking at Basiles answer, for a cleaner more "future-proof" way. It is the way to go to avoid the trouble I predict with using this macro magic. Use it if you are not forced by very good reasons.

As other explained, the preprocessor works only on tokens, and you can only #define a name. Read the documentation of cpp and the C11 standard n1570.
What you want to do is very ugly (and there are few occasions where it is worthwhile). It makes your code messy, unreadable, and brittle.
Learn to use better your source code editor (you probably have some interactive replace, or interactive replace with regexp-s; if you don't, switch to a better editor like GNU emacs or vim - and study the documentation of your editor). You could also use scripting tools like ed, sed, grep, awk etc... to help you in doing those replacements.
In a small project, replacing relevant occurrences of ->val2 (or .val2) with ->b_obj.val2 (or .b_obj.val2) is really easy, even if you have a hundred of them. And that keeps your code readable. Don't forget to use some version control system (to keep both old and new versions of your code).
In a large project of at least a million of lines of source code, you might ask how to find every occurrence of field usage of val2 for a given type (but you should probably name val2 well enough to have most occurrences of it be relevant; in other words, take care of the naming of your fields). That is a very different question (e.g. you could write some GCC plugin to find such occurrences and help you in replacing the relevant ones).
If you are refactoring an old and large legacy code, you need to be sure to keep it readable, and you don't want fancy macro tricks. For example, you might add some static inline function to access that field. And it could be then worthwhile to use some better tools (e.g. a compiler plugin, some kind of C parser, etc...) to help you in that refactoring.
Keep the source code readable by human developers. Otherwise, you are shooting yourself in the foot. What you want to do is unreasonable, it decreases the readability of the code base.
I can't use Find Replace feature of other text editors, I am using VIM.
vim is scriptable (e.g. in lua) and accepts plugins (so if interactive replace is not enough, consider writing some vim plugin or script to help you), and has powerful find-replace-regexp facilities. You might also use some combination of scripts to help you. In many cases they are enough. If they are not, you should explain why.
Also, you could temporarily replace the val2 field of struct a with a unique name like val2_3TYRxW1PuK7 (or whatever is appropriate, making some unique "random-looking" name is easy). Then you run your full build (e.g. after some make clean). The compiler would emit error messages for every place where you need to replace val2 used as a field of struct a (but won't mind for any other occurrence of the val2 name used for some other purpose). That could help you a lot -once you have corrected your code to get rid of all errors- (especially when combined with some editor scripting) because then you just need to replace val2_3TYRxW1PuK7 with b_obj.val2 everywhere.

Is -> illegal in #define?
Yes.
#define identifier can only be letter, number or underscore.

Macros definitions must be regular identifiers, so you can't use any special character like - or >.
I've thinked that may be you can use an union, like this:
struct b
{
int val2;
}
struct a
{
int val1;
union {
struct b b_obj;
int val2;
}
}
so you can still using a_obj->val2.

How to compile and keep "unused" C declarations with clang -emit-llvm

Context
I'm writing a compiler for a language that requires lots of runtime functions. I'm using LLVM as my backend, so the codegen needs types for all those runtime types (functions, structs, etc) and instead of defining all of them manually using the LLVM APIs or handwriting the LLVM IR I'd like to write the headers in C and compile to the bitcode that the compiler can pull in with LLVMParseBitcodeInContext2.
Issue
The issue I'm having is that clang doesn't seem to keep any of the type declarations that aren't used by any any function definitions. Clang has -femit-all-decls which sounds like it's supposed to solve it, but it unfortunately isn't and Googling suggests it's misnamed as it only affects unused definitions, not declarations.
I then thought perhaps if I compile the headers only into .gch files I could pull them in with LLVMParseBitcodeInContext2 the same way (since the docs say they use "the same" bitcode format", however doing so errors with error: Invalid bitcode signature so something must be different. Perhaps the difference is small enough to workaround?
Any suggestions or relatively easy workarounds that can be automated for a complex runtime? I'd also be interested if someone has a totally alternative suggestion on approaching this general use case, keeping in mind I don't want to statically link in the runtime function bodies for every single object file I generate, just the types. I imagine this is something other compilers have needed as well so I wouldn't be surprised if I'm approaching this wrong.
e.g. given this input:
runtime.h
struct Foo {
int a;
int b;
};
struct Foo * something_with_foo(struct Foo *foo);
I need a bitcode file with this equivalent IR
runtime.ll
; ...etc...
%struct.Foo = type { i32, i32 }
declare %struct.Foo* #something_with_foo(%struct.Foo*)
; ...etc...
I could write it all by hand, but this would be duplicative as I also need to create C headers for other interop and it'd be ideal not to have to keep them in sync manually. The runtime is rather large. I guess I could also do things the other way around: write the declarations in LLVM IR and generate the C headers.
Someone else asked about this years back, but the proposed solutions are rather hacky and fairly impractical for a runtime of this size and type complexity: Clang - Compiling a C header to LLVM IR/bitcode

Clang's precompiled headers implementation does not seem to output LLVM IR, but only the AST (Abstract Syntax Tree) so that the header does not need to be parsed again:
The AST file itself contains a serialized representation of Clang’s
abstract syntax trees and supporting data structures, stored using the
same compressed bitstream as LLVM’s bitcode file format.
The underlying binary format may be the same, but it sounds like the content is different and LLVM's bitcode format is merely a container in this case. This is not very clear from the help page on the website, so I am just speculating. A LLVM/Clang expert could help clarify this point.
Unfortunately, there does not seem to be an elegant way around this. What I suggest in order to minimize the effort required to achieve what you want is to build a minimal C/C++ source file that in some way uses all the declarations that you want to be compiled to LLVM IR. For example, you just need to declare a pointer to a struct to ensure it does not get optimized away, and you may just provide an empty definition for a function to keep its signature.
Once you have a minimal source file, compile it with clang -O0 -c -emit-llvm -o precompiled.ll to get a module with all definitions in LLVM IR format.
An example from the snippet you posted:
struct Foo {
int a;
int b;
};
// Fake function definition.
struct Foo * something_with_foo(struct Foo *foo)
{
return NULL;
}
// A global variable.
struct Foo* x;
Output that shows that definitions are kept: https://godbolt.org/g/2F89BH

So, clang doesn't actually filter out the unused declarations. It defers emitting forward declarations till their first use. Whenever a function is used it checks if it has been emitted already, if not it emits the function declaration.
You can look at these lines in the clang repo.
// Forward declarations are emitted lazily on first use.
if (!FD->doesThisDeclarationHaveABody()) {
if (!FD->doesDeclarationForceExternallyVisibleDefinition())
return;
The simple fix here would be to either comment the last two lines or just add && false to the second condition.
// Forward declarations are emitted lazily on first use.
if (!FD->doesThisDeclarationHaveABody()) {
if (!FD->doesDeclarationForceExternallyVisibleDefinition() && false)
return;
This will cause clang to emit a declaration as soon as it sees it, this might also change the order in which definitions appear in your .ll (or .bc) files. Assuming that is not an issue.
To make it cleaner you can also add a command line flag --emit-all-declarations and check that here before you continue.

Is hidden declaration possible in a project?

In my project, a structre is being used in several functions.
like this:
void function1 (Struct_type1 * pstType1);
but when I search for Struct_type1 's references, I can't find any. This Structure must be defined somewhere. How to find the definition?
OS- Windows
Edit: I think its difficult to answer this without source code and I can't share that big project here. So, I've changed my question to:
Is Hidden Declaration possible in an embedded project?
(by hidden I mean no one can see the definition.)

Is Hidden Declaration possible in an embedded project?
If you have access to all source code in the project, then no.
This is only possible in one specific case, and that is when you have an external library for which you don't have the C code, you only have a header file and an object file or lib file (or DLL etc).
For such cases it is possible (and good practice) for the library header to forward-declare an incomplete type in the header, and hide the actual implementation in the C file which you don't have access to.
You would then have something like this in the h file:
typedef struct Struct_type1 Struct_type1;
The compiler might often do things like this with its own libraries too, if they want to hide away the implementation. One such example is the FILE struct.

Not an answer, but possibly a way to find the answer. Idea: Let compiler help you.
Define the struct yourself, then look at compiler errors like "struct struct_type1 is already defined in... at line ..."
If you get no compiler error in this case, maybe the struct is only forward declared, but not defined.
To explain why this is sometimes done, here a bit of code:
// Something.h
struct struct_type1; // Forward declaration.
struct struct_type1 *SomethingInit();
void SomethingDo( struct struct_type1 * context );
In code looking like the above, the definition of the struct is hidden inside the implementation. On the outside, it need not be known, how the struct is defined or its size etc, as it is only traded as a pointer to the struct (and never as a value). This technique is used to keep internal types out of public header files and used often by library designers. You can think of it as an opaque handle of sorts.
But then, you still should be able to find the forward declaration, albeit not the definition.