tracking uninitialized static variables - c

I need to debug an ugly and huge math C library, probably once produced by f2c. The code is abusing local static variables, and unfortunately somewhere it seems to exploit the fact that these are automatically initialized to 0. If its entry function is called with the same input twice, it is giving different results. If I unload the library and reload it again, it works correctly. It needs to be fast, so I would like to get rid of the load/unload.
My question is that how to uncover these errors with valgrind or by any other tool without manually walking through the entire code.
I am hunting places where a local static variable is declared, read first, and written only later. The problem is even further complicated by the fact that the static variables are sometimes passed further via pointers (yep - it is so ugly).
I understand that one can argue that mistakes like this should not be necessary detected by an automatic tool, as in some scenarios this is exactly the intended behaviour. Still, is there a way to make the auto-initialized local static variables "dirty"?

The devil is in the details, but this may work for you:
First, get Frama-C. If you are using Unix, your distribution may have a package. The package won't be the last version but it may be good enough and it will save you some time if you install it this way.
Say your example is as below, only so much bigger that it's not obvious what is wrong:
int add(int x, int y)
{
static int state;
int result = x + y + state; // I tested it once and it worked.
state++;
return result;
}
Type a command like:
frama-c -lib-entry -main add -deps ugly.c
Options -lib-entry -main add mean "look at function add". Option -deps computes functional dependencies. You'll find these "functional dependencies" in the log:
[from] Function add:
state FROM state; (and default:false)
\result FROM x; y; state; (and default:false)
This lists the actual inputs the results of add depend on, and the actual outputs computed from these inputs, including static variables read from and modified. A static variable that was properly initialized before being used would normally not appear as input, unless the analyzer was unable to determine that it was always initialized before being read from.
The log shows state as dependency of \result. If you expected the returned result to depend only on the arguments (meaning two calls with the same arguments produce the same result), it's a hint there may be something wrong here, with the variable state.
Another hint shown in the above lines is that the function modifies state.
This may help or not. Option -lib-entry means that the analyzer does not assume that any non-const static variable has kept its value at the time the function under analysis is called, so that may be too imprecise for your code. There are ways around that, but then it is up to you whether you want to gamble the time it takes to learn these ways.
EDIT: here is a more complex example:
void initialize_1(int *p)
{
*p = 0;
}
void initialize_2(int *p)
{
*p; // I made a mistake here.
}
int add(int x, int y)
{
static int state1;
static int state2;
initialize_1(&state1);
initialize_2(&state2);
// This is safe because I have initialized state1 and state2:
int result = x + y + state1 + state2;
state1++;
state2++;
return result;
}
On this example, the same command produces the results:
[from] Function initialize_1:
state1 FROM p
[from] Function initialize_2:
[from] Function add:
state1 FROM \nothing
state2 FROM state2
\result FROM x; y; state2
What you see for initialize_2 is an empty list of dependencies, meaning the function assigns nothing. I will make this case clearer by displaying an explicit message rather than just an empty list. If you know what any of the functions initialize_1, initialize_2 or add is supposed to do, you can compare this a priori knowledge to the results of the analysis and see that something is wrong for initialize_2 and add.
SECOND EDIT: and now my example shows something strange for initialize_1, so perhaps I should explain that. Variable state1 depends on p in the sense that p is used to write to state1, and if p had been different, then the final value of state1 would have been different. Here is a last example:
int t[10];
void initialize_index(int i)
{
t[i] = 1;
}
int main(int argc, char **argv)
{
initialize_index(argv[1][0]-'0');
}
With the command frama-c -deps t.c, the dependencies computed for initialize_index are:
[from] Function initialize_index:
t[0..9] FROM i (and SELF)
This means that each of the cells depends on i (it may be modified if i is the index of that particular cell). Each cell may also keep its value (if i indicates another cell): this is indicated with the (and SELF) mention in the latest version, and was indicated with a more obscure (and default:true) in previous versions.

Static code analysis tools are pretty good at finding typical programming errors like the use of uninitialized variables. Here is a list of free tools that do this for C.
Unfortunately I can't recommend any of the tools in the list. I am only familiar with two commercial products, Coverity and Klocwork. Coverity is very good (and expensive). Klocwork is so so (but less expensive).

What I did in the end is removed all static qualifiers from the code by '#define static'. This turns uninitialised static usage into invalid use, and the type of abuse I am hunting can be uncovered by the tools.
In my actual case this was enough to determine the place of the bug, but in a more general situation it should be refined if static's are actually doing something important, by gradually re-adding 'static' when the code fails to continue.

I don't know of any library that does this for you, but I would look into using regular expressions to find them. Something like
rgrep "static\s*int" path/to/src/root | grep -v = | grep -v "("
That should return all static int variables declared without an equals sign, and the last pipe should remove anything with parenthesis in them (getting rid of funcions). There's a good change that this won't work exactly for you, but playing around with grep may be the fastest way for you to track this down.
Of course, once you find one that works you can replace int with all of the other kinds of variables to search for those too. HTH

My question is that how to uncover these errors ...
But these aren't errors: the expectation that a static variable is initialized to 0 is perfectly valid, as is assigning some other value to it.
So asking for a tool that will automatically find non-errors is unlikely to produce a satisfying result.
From your description, it appears that somefunc() returns correct result first time it is called, and incorrect result on subsequent calls.
The simplest way to debug such problems is to have two GDB sessions side-by-side: one freshly-loaded (will compute correct answer), and one with "second iteration" (will compute wrong answer). Then step through both sessions "in parallel", and see where their computation or control flow starts to diverge.
Since you can usually effectively divide the problem in half, it often doesn't take long to find the bug. Bugs that always reproduce are the easiest ones to find. Just do it.

Related

Is there a static C analyzer which detects uninitialized static variables? [duplicate]

I need to debug an ugly and huge math C library, probably once produced by f2c. The code is abusing local static variables, and unfortunately somewhere it seems to exploit the fact that these are automatically initialized to 0. If its entry function is called with the same input twice, it is giving different results. If I unload the library and reload it again, it works correctly. It needs to be fast, so I would like to get rid of the load/unload.
My question is that how to uncover these errors with valgrind or by any other tool without manually walking through the entire code.
I am hunting places where a local static variable is declared, read first, and written only later. The problem is even further complicated by the fact that the static variables are sometimes passed further via pointers (yep - it is so ugly).
I understand that one can argue that mistakes like this should not be necessary detected by an automatic tool, as in some scenarios this is exactly the intended behaviour. Still, is there a way to make the auto-initialized local static variables "dirty"?
The devil is in the details, but this may work for you:
First, get Frama-C. If you are using Unix, your distribution may have a package. The package won't be the last version but it may be good enough and it will save you some time if you install it this way.
Say your example is as below, only so much bigger that it's not obvious what is wrong:
int add(int x, int y)
{
static int state;
int result = x + y + state; // I tested it once and it worked.
state++;
return result;
}
Type a command like:
frama-c -lib-entry -main add -deps ugly.c
Options -lib-entry -main add mean "look at function add". Option -deps computes functional dependencies. You'll find these "functional dependencies" in the log:
[from] Function add:
state FROM state; (and default:false)
\result FROM x; y; state; (and default:false)
This lists the actual inputs the results of add depend on, and the actual outputs computed from these inputs, including static variables read from and modified. A static variable that was properly initialized before being used would normally not appear as input, unless the analyzer was unable to determine that it was always initialized before being read from.
The log shows state as dependency of \result. If you expected the returned result to depend only on the arguments (meaning two calls with the same arguments produce the same result), it's a hint there may be something wrong here, with the variable state.
Another hint shown in the above lines is that the function modifies state.
This may help or not. Option -lib-entry means that the analyzer does not assume that any non-const static variable has kept its value at the time the function under analysis is called, so that may be too imprecise for your code. There are ways around that, but then it is up to you whether you want to gamble the time it takes to learn these ways.
EDIT: here is a more complex example:
void initialize_1(int *p)
{
*p = 0;
}
void initialize_2(int *p)
{
*p; // I made a mistake here.
}
int add(int x, int y)
{
static int state1;
static int state2;
initialize_1(&state1);
initialize_2(&state2);
// This is safe because I have initialized state1 and state2:
int result = x + y + state1 + state2;
state1++;
state2++;
return result;
}
On this example, the same command produces the results:
[from] Function initialize_1:
state1 FROM p
[from] Function initialize_2:
[from] Function add:
state1 FROM \nothing
state2 FROM state2
\result FROM x; y; state2
What you see for initialize_2 is an empty list of dependencies, meaning the function assigns nothing. I will make this case clearer by displaying an explicit message rather than just an empty list. If you know what any of the functions initialize_1, initialize_2 or add is supposed to do, you can compare this a priori knowledge to the results of the analysis and see that something is wrong for initialize_2 and add.
SECOND EDIT: and now my example shows something strange for initialize_1, so perhaps I should explain that. Variable state1 depends on p in the sense that p is used to write to state1, and if p had been different, then the final value of state1 would have been different. Here is a last example:
int t[10];
void initialize_index(int i)
{
t[i] = 1;
}
int main(int argc, char **argv)
{
initialize_index(argv[1][0]-'0');
}
With the command frama-c -deps t.c, the dependencies computed for initialize_index are:
[from] Function initialize_index:
t[0..9] FROM i (and SELF)
This means that each of the cells depends on i (it may be modified if i is the index of that particular cell). Each cell may also keep its value (if i indicates another cell): this is indicated with the (and SELF) mention in the latest version, and was indicated with a more obscure (and default:true) in previous versions.
Static code analysis tools are pretty good at finding typical programming errors like the use of uninitialized variables. Here is a list of free tools that do this for C.
Unfortunately I can't recommend any of the tools in the list. I am only familiar with two commercial products, Coverity and Klocwork. Coverity is very good (and expensive). Klocwork is so so (but less expensive).
What I did in the end is removed all static qualifiers from the code by '#define static'. This turns uninitialised static usage into invalid use, and the type of abuse I am hunting can be uncovered by the tools.
In my actual case this was enough to determine the place of the bug, but in a more general situation it should be refined if static's are actually doing something important, by gradually re-adding 'static' when the code fails to continue.
I don't know of any library that does this for you, but I would look into using regular expressions to find them. Something like
rgrep "static\s*int" path/to/src/root | grep -v = | grep -v "("
That should return all static int variables declared without an equals sign, and the last pipe should remove anything with parenthesis in them (getting rid of funcions). There's a good change that this won't work exactly for you, but playing around with grep may be the fastest way for you to track this down.
Of course, once you find one that works you can replace int with all of the other kinds of variables to search for those too. HTH
My question is that how to uncover these errors ...
But these aren't errors: the expectation that a static variable is initialized to 0 is perfectly valid, as is assigning some other value to it.
So asking for a tool that will automatically find non-errors is unlikely to produce a satisfying result.
From your description, it appears that somefunc() returns correct result first time it is called, and incorrect result on subsequent calls.
The simplest way to debug such problems is to have two GDB sessions side-by-side: one freshly-loaded (will compute correct answer), and one with "second iteration" (will compute wrong answer). Then step through both sessions "in parallel", and see where their computation or control flow starts to diverge.
Since you can usually effectively divide the problem in half, it often doesn't take long to find the bug. Bugs that always reproduce are the easiest ones to find. Just do it.

Usage of the const keyword

I know that using the const keyword on function arguments provides better performance, but I always forget to add it. Is the compiler (GCC in this case) smart enough to notice that the variabele never changes during the function, and compile it as if I would have added const explicitly?
You have a common misunderstanding about const. Only making an object const means that its value never changes, and then it's not just during a function, it never changes.
Making a parameter to a function const does not mean its value never changes, it just means that function cannot change the value through that const pointer. The value can change other ways.
For example, look at this function:
void f(const int* x, int* y)
{
cout << "x = " << *x << endl;
*y = 5;
cout << "x = " << *x << endl;
}
Notice that it takes a const pointer to x. However, what if you call it like this:
int x = 10;
f(&x, &x);
Now, f has a const pointer, but it's to a non-const object. So the value can change, and it does since y is a non-const pointer to the same object. All of this is perfectly legal code. There's no funny business here.
So there's really no answer to your question since it's based entirely on false premises.
Is the compiler (GCC in this case) smart enough to notice that the
variabele never changes during the function, and compile it as if I
would have added const explicitly?
Not necessarily. For example:
void some_function(int *ptr); // defined in another translation unit
int foo(int a) {
some_function(&a);
return a + 1;
}
The compiler can't see what some_function does, so it can't assume that it won't modify a.
Link-time optimization could perhaps see what some_function really does and act accordingly, but as far as this answer is concerned I'll consider only optimization for which the definition of some_function is unavailable.
int bar(const int a) {
some_function((int*)&a);
return a + 1;
}
The compiler can't see what some_function does, but it can assume that the value of a does not change anyway. Therefore it can make any optimizations that apply: maybe it can keep a in a callee-saves register across the call to some_function; maybe it computes the return value before making the call instead of after, and zaps a. The program has undefined behavior if some_function modifies a, and so from the compiler's POV once that happens it doesn't matter whether it uses the "right" or "wrong" value for a.
So, by in this example by marking a const you have told the compiler something that it cannot otherwise know -- that some_function will not modify *ptr. Or anyway that if it does modify it, then you don't care what your program's behavior is.
int baz(int a) {
some_function(NULL);
return a + 1;
}
Here the compiler can see all relevant code as far as the standard is concerned. It doesn't know what some_function does, but it does know that it doesn't have any standard means to access a. So it should make no difference whether a is marked const or not because the compiler knows it doesn't change anyway.
Debugger support can complicate this situation, of course -- I don't know how things stand with gcc and gdb, but in theory at least if the compiler wants to support you breaking in with the debugger and modifying a manually then it might not treat it as unmodifiable. The same applies to the possibility that some_function uses platform-specific functionality to walk up the stack and mess with a. Platforms don't have to provide such functionality, but if they do then it conflicts with optimization.
I've seen an old version of gcc (3.x, can't remember x) that failed to make certain optimizations where I failed to make a local int variable const, but in my case gcc 4 did make the optimization. Anyway, the case I'm thinking of wasn't a parameter, it was an automatic variable initialized with a constant value.
There's nothing special about a being a parameter in any of what I've said -- it could just as well be any automatic variable defined in the function. Mind you, the only way to for a parameter to get the effect of initialization with a constant value is to call the function with a constant value, and for the compiler to observe the value for that call. This tends to happen only when the function is inlined. So inlined calls to functions can have additional optimizations applied to them that the "out-of-line" function body isn't eligible for.
const, much like inline, is only a hint for a compiler and does not guarantee any performance gains. The more important task of const is to protect programmers from themselves so they do not unwilling modify variables where they shouldn’t be modified.
1) Really const is not affecting your performance directly. It may in some cases make simpler points-to analysis (so prefer const char* to char*), but const is more about semantics and readability of your code.
2) CV-qualified type forms different type in C and C++. So your compiler, even if it sees profit from making default const, will not do it, because it will change type and may lead to surprisingly odd things.
As part of the optimization the compiler is taking a deep look at when memory locations are read or written. So the compiler is quite good at detecting when a variable is not changed (const) and when it is changed. The optimizer does not need you to tell him when a variable is const.
Nevertheless you should always use const when appropriate. Why? Because it makes interfaces more clear and easier to understand. And it helps detect bugs when you are changing a variable that you did not want to change.

gcc attributes for init-on-first-use functions

I've been using the gcc const and pure attributes for functions which return a pointer to "constant" data that's allocated and initialized on the first use, i.e. where the function will return the same value each time it's called. As an example (not my usage case, but a well-known example) think of a function that allocates and computes trig lookup tables on the first call and just returns a pointer to the existing tables after the first call.
The problem: I've been told this usage is incorrect because these attributes forbid side effects, and that the compiler could even optimize out the call completely in some cases if the return value is not used. Is my usage of const/pure attributes safe, or is there any other way to tell the compiler that N>1 calls to the function are equivalent to 1 call to the function, but that 1 call to the function is not equivalent to 0 calls to the function? Or in other words, that the function only has side effects the first time it's called?
I say this is correct based on my understanding of pure and const, but if anyone has a precise definition of the two, please speak up. This gets tricky because the GCC documentation doesn't lay out exactly what it means for a function to have "no effects except the return value" (for pure) or to "not examine any values except their arguments" (for const). Obviously all functions have some effects (they use processor cycles, modify memory) and examine some values (the function code, constants).
"Side effects" would have to be defined in terms of the semantics of the C programming language, but we can guess what the GCC folks mean based on the purpose of these attributes, which is to enable additional optimizations (at least, that's what I assume they are for).
Forgive me if some of the following is too basic...
Pure functions can participate in common subexpression elimination. Their feature is that they don't modify the environment, so the compiler is free to call it fewer times without changing the semantics of the program.
z = f(x);
y = f(x);
becomes:
z = y = f(x);
Or gets eliminated entirely if z and y are unused.
So my best guess is that a working definition of "pure" is "any function which can be called fewer times without changing the semantics of the program". However, function calls may not be moved, e.g.,
size_t l = strlen(str); // strlen is pure
*some_ptr = '\0';
// Obviously, strlen can't be moved here...
Const functions can be reordered, because they do not depend on the dynamic environment.
// Assuming x and y not aliased, sin can be moved anywhere
*some_ptr = '\0';
double y = sin(x);
*other_ptr = '\0';
So my best guess is that a working definition of "const" is "any function which can be called at any point without changing the semantics of the program". However, there is a danger:
__attribute__((const))
double big_math_func(double x, double theta, double iota)
{
static double table[512];
static bool initted = false;
if (!initted) {
...
initted = true;
}
...
return result;
}
Since it's const, the compiler could reorder it...
pthread_mutex_lock(&mutex);
...
z = big_math_func(x, theta, iota);
...
pthread_mutex_unlock(&mutex);
// big_math_func might go here, if the compiler wants to
In this case, it could be called simultaneously from two processors even though it only appears inside a critical section in your code. Then the processor could decide to postpone changes to table after a change to initted already went through, which is bad news. You can solve this with memory barriers or pthread_once.
I don't think this bug will ever show up on x86, and I don't think it shows up on many systems that don't have multiple physical processors (not cores). So it will work fine for ages and then fail suddenly on a dual-socket POWER computer.
Conclusion: The advantage of these definitions is that they make it clear what kind of changes the compiler is allowed to make in the presence of these attributes, which (I think is) somewhat vague in the GCC docs. The disadvantage is that it's not clear that these are the definitions used by the GCC team.
If you look at the Haskell language specification, for example, you'll find a much more precise definition of purity, since purity is so important to the Haskell language.
Edit: I have not been able to coerce GCC or Clang into moving a solitary __attribute__((const)) function call across another function call, but it seems entirely possible that in the future, something like that would happen. Remember when -fstrict-aliasing became the default, and everybody suddenly had a lot more bugs in their programs? It's stuff like that that makes me cautious.
It seems to me that when you mark a function __attribute__((const)), you are promising the compiler that the result of the function call is the same no matter when it is called during your program's execution, as long as the parameters are the same.
However, I did come up with a way of moving a const function out of a critical section, although the way I did it could be called "cheating" of a sort.
__attribute__((const))
extern int const_func(int x);
int func(int x)
{
int y1, y2;
y1 = const_func(x);
pthread_mutex_lock(&mutex);
y2 = const_func(x);
pthread_mutex_unlock(&mutex);
return y1 + y2;
}
The compiler translates this into the following code (from the assembly):
int func(int x)
{
int y;
y = const_func(x);
pthread_mutex_lock(&mutex);
pthread_mutex_unlock(&mutex);
return y * 2;
}
Note that this won't happen with only __attribute__((pure)), the const attribute and only the const attribute triggers this behavior.
As you can see, the call inside the critical section disappeared. It seems rather arbitrary that the earlier call was kept, and I would not be willing to wager that the compiler won't, in some future version, make a different decision about which call to keep, or whether it might move the function call somewhere else entirely.
Conclusion 2: Tread carefully, because if you don't know what promises you are making to the compiler, a future version of the compiler might surprise you.

Local synonymous variable to non exact type

I'm a little bit new to C so I'm not familiar with how I would approach a solution to this issue. As you read on, you will notice its not critical that I find a solution, but it sure would be nice for this application and future reference. :)
I have a parameter int hello and I wan't to make a synonomous copy of not it.
f(int hello, structType* otherParam){
// I would like to have a synonom for (!hello)
}
My first thought was to make a local constant, but I'm not sure if there will be additional memory consumption. I'm building with GCC and I really don't know if it would recognize a constant of a parameter (before any modifications) as just a synonymous variable. I don't think so because the parameter could (even though it wont be) changed later on in that function, which would not effect the constant.
I then thought about making a local typedef, but I'm not sure exactly the syntax for doing so. I attempted the following:
typedef (!hello) hi;
However I get the following error.
D:/src-dir/file.c: In function 'f':
D:/src-dir/file.c: 00: error: expected identifier or '(' before '!' token
Any help is appreciated.
In general, in C, you want to write the code that most clearly expresses your intentions, and allow the optimiser to figure out the most efficient way to implement that.
In your example of a frequently-reused calculation, storing the result in a const-qualified variable is the most appropriate way to do this - something like the following:
void f(int hello)
{
const int non_hello = !hello;
/* code that uses non_hello frequently */
}
or more likely:
void x(structType *otherParam)
{
char * const d_name = otherParam->b->c->d->name;
/* code that uses d_name frequently */}
}
Note that such a const variable does not necessarily have to be allocated any memory (unless you take its address with & somewhere) - the optimiser might simply place it in a register (and bear in mind that even if it does get allocated memory, it will likely be stack memory).
Typedef defines an alias for a type, it's not what you want. So..
Just use !hello where you need it
Why would you need a "synonym" for a !hello ? Any programmer would instantly recognize !hello instead of looking for your clever trick for defining a "synonym".
Given:
f(int hello, structType* otherParam){
// I would like to have a synonom for (!hello)
}
The obvious, direct answer to what you have here would be:
f(int hello, structType *otherParam) {
int hi = !hello;
// ...
}
I would not expect to see any major (or probably even minor) effect on execution speed from this. Realistically, there probably isn't a lot of room for improvement in the execution speed.
There are certainly times something like this can make the code more readable. Also note, however, that when/if you modify the value of hello, the value of hi will not be modified to match (unless you add code to update it). It's rarely an issue, but something to remain aware of nonetheless.

Automatically deleting unused local variables from C source code

I want to delete unused local variables from C file.
Example:
int fun(int a , int b)
{
int c,sum=0;
sum=a + b;
return sum;
}
Here the unused variable is 'c'.
I will externally have a list of all unused local variables. Now using unused local variables which I have, we have to find local variables from source code & delete.
In above Example "c" is unused variable. I will be knowing it (I have code for that).
Here I have to find c & delete it .
EDIT
The point is not to find unused local variables with an external tool. The point is to remove them from code given a list of them.
Turn up your compiler warning level, and it should tell you.
Putting your source fragment in "f.c":
% gcc -c -Wall f.c
f.c: In function 'fun':
f.c:1: warning: unused variable 'c'
Tricky - you will have to parse C code for this. How close does the result have to be?
Example of what I mean:
int a, /* foo */
b, /* << the unused one */
c; /* bar */
Now, it's obvious to humans that the second comment has to go.
Slight variation:
void test(/* in */ int a, /* unused */ int b, /* out */ int* c);
Again, the second comment has to go, the one before b this time.
In general, you want to parse your input, filter it, and emit everything that's not the declaration of an unused variable. Your parser would have to preserve comments and #include statements, but if you don't #include headers it may be impossible to recognize declarations (even more so if macro's are used to hide the declaration). After all, you need headers to decide if A * B(); is a function declaration (when A is a type) or a multiplication (when A is a variable)
[edit] Furthermore:
Even if you know that a variable is unused, the proper way to remove it depends a lot on remote context. For instance, assume
int foo(int a, int b, int c) { return a + b; }
Clearly, c is unused. Can you change it to ?
int foo(int a, int b) { return a + b; }
Perhaps, but not if &foo is stored int a int(*)(int,int,int). And that may happen somewhere else. If (and only if) that happens, you should change it to
int foo(int a, int b, int /*unused*/ ) { return a + b; }
Why do you want to do this? Assuming you have a decent optimizing compiler (GCC, Visual Studio et al) the binary output will not be any different wheter you remove the 'int c' in your original example or not.
If this is just about code cleanup, any recent IDE will give you quick links to the source code for each warning, just click and delete :)
My answer is more of an elaborate comment to MSalters' very thorough answer.
I would go beyond 'tricky' and say that such a tool is both impossible and inadvisable.
If you are looking to simply remove the references to the variable, then you could write a code parser of your own, but it would need to distinguish between the function context it is in such as
int foo(double a, double b)
{
b = 10.0;
return (int) b;
}
int bar(double a, double b)
{
a = 5.00;
return (int) a;
}
Any simple parser would have trouble with both 'a' and 'b' being unused variables.
Secondly, if you consider comments as MSalter has, you'll discover that people do not comment consistently;
double a;
/*a is designed as a dummy variable*/
double b;
/*a is designed as a dummy variable*/
double a;
double b;
double a; /*a is designed as a dummy variable*/
double b;
etc.
So simply removing the unused variables will create orphaned comments, which are arguably more dangerous than not commenting at all.
Ultimately, it is an obscenely difficult task to do elegantly, and you would be mangling code regardless. By automating the process, you would be making the code worse.
Lastly, you should be considering why the variables were in the code in the first place, and if they are deprecated, why they were not deleted when all their references were.
Static code analysis tools in additional to warning level as Paul correctly stated.
As well as being able to reveal these through warnings, the compiler will normally optimise these away if any optimisations are turned on. Checking if a variable is never referenced is quite trivial in terms of implementation in the compiler.
You will need a good parser that preserves original character position of tokens (even in presence of preprocessor!). There are some tools for automated refactoring of C/C++, but they are far from mainstream.
I recommend you to check out Taras' Blog. The guy is doing some large automated refactorings of Mozilla codebase, like replacing out-params with return values. His main tool for code rewriting is Pork:
Pork is a C++ parsing and rewriting
tool chain. The core of Pork is a C++
parser that provides exact character
positions for the start and end of
every AST node, as well as the set of
macro expansions that contain any
location. This information allows C++
to be automatically rewritten in a
precise way.
From the blog:
So far pork has been used for “minor”
things like renaming
classes&functions, rotating
outparameters and correcting prbool
bugs. Additionally, Pork proved itself
in an experiment which involved
rewriting almost every function (ie
generating a 3+MB patch) in Mozilla to
use garbage collection instead of
reference-counting.
It is for C++, but it may suit your needs.
One of the posters above says "impossible and inadvisable".
Another says "tricky", which is the right answer.
You need 1) a full C (or whatever language of interest) parser,
2) inference procedures that understand the language
identifier references and data flows to determine that a variable
is indeed "dead", and 3) the ability to actually modify
the source code.
What's hard about all this is the huge energy to build
1) 2) 3). You can't justify for any individual cleanup task.
What one can do is to build such infrastructure specifically
with the goal of amortizing it across lots of differnt
program analysis and transformation tasks.
My company offers such a tool: The DMS Software Reengineering
Toolkit. See
http://www.semdesigns.com/Products/DMS/DMSToolkit.html
DMS has production quality front ends for many languages,
including C, C++, Java and COBOL.
We have in fact built an automated "find useless declarations"
tool for Java that does two things:
a) lists them all (thus producing the list!)
b) makes a copy of the code with the useless declarations
removed.
You choose which answer you want to keep :-)
To do the same for C would not be difficult. We already
have a tool that identifies such dead variables/functions.
One case we did not addess, is the "useless parameter"
case, becasue to remove a useless parameter, you have
to find all the calls from other modules,
verify that setting up the argument doesn't have a side
effect, and rip out the useless argument.
We in fact have full graphs of the entire software
system of interest, and so this would also be
possible.
So, its just tricky, and not even very tricky
if you have the right infrastructure.
You can solve the problem as a text processing problem. There must be a small number of regexp patterns how unused local variables are defined in the source code.
Using a list of unused variable names and the line numbers where they are, You can process the C source code line-by-line. On each line You can iterate over the variable names. On each variable name You can match the patterns one-by-one. After a successful match You know the syntax of the definition, so You know how to delete the unused variable from it.
For example if the source line is: "int a, unused, b;" and the compiler reported "unused" as an unused variable in that line, than the pattern "/, unused,/" will match and You can replace that substring with a single ",".
Also: splint.
Splint is a tool for statically checking C programs for security vulnerabilities and coding mistakes. With minimal effort, Splint can be used as a better lint. If additional effort is invested adding annotations to programs, Splint can perform stronger checking than can be done by any standard lint.

Resources