Detect start & end of a C declaration without a full C parser - c

I would like to partially parse a list of C declarations and/or function definitions.
That is, I want to split it into substrings, each containing one declaration, or function definition.
Each declaration (separately) will then be passed to another module (that does contain a full C parser, but that I cannot call directly.)
Obviously I could do this by including another full C parser in my program, but I hope to avoid this.
The tricky cases I'e come up against so far involve the question of whether '}' terminates a declaration/definition or not. For example in
int main(int ac, char **av) {return 0;}
... the '}' is a terminator, whereas in
typedef struct foo {int bar;} *pfoo;
it is not. There may also be pathological pieces of code like this:
struct {int bar;} *getFooPtr(...) { /* code... */ }
Notes
Please assume the C code has already been fully preprocessed before my function sees it. (Actually it hasn't, but we have a workaround for that.)
My parser will probably be implemented in Lua with LPeg

To extend the state machine in your answer to deal with function definitions add the following steps:
set fun/var state to 'unknown'
Examine the character at the current position
If it's ;, we have found the end of the declaration, and its not a function definition (might be a function declaration, though).
If it's " or ', jump to the matching quote, skipping over escape sequences if necessary.
If it's (, [ or {, jump to the matching ), ] or } (skipping over nested brackets and strings recursively if necessary)
If fun/var state is 'function' and we just skipped { .. }, we've found the end of the declaration, and its a function definition
If fun/var state is 'unknown' and we just skipped ( .. ), set fun/var state to 'function'.
If the current char is = or ,, set fun/var state to 'not-function`.
Advance to the next input character, and go back to 2.
Of course, this only works on post-pre-processed code -- if you have macros that do various odd things that haven't yet been expanded, all bets are off.

As far as I can tell, the following solution works for declarations only (that is, function definitions must be kept out of this section, or adding semicolons after them may be a workaround:)
Examine the character at the current position
If it's ;, we have found the end of the declaration.
If it's " or ', jump to the matching quote, skipping over escape sequences if necessary.
If it's (, [ or {, jump to the matching ), ] or } (skipping over nested brackets and strings recursively if necessary)
Otherwise, advance to the next input character and goto step 1.
If this proves to be unsatisfactory, I will switch to the clang parser.

Your best bet would be to extract the part of the C grammar which is related to declarations, and build a parser for that or an abbreviated version of that. Similarly, you want the grammar for function bodies, abbreviated in a similar way, so you can skip them.
This might produce a relatively trustworthy parser for declarations.
It is unfortunate that you will not likely be able to get your hands on a trustworthy C grammar; the one in the ANSI Standard(s) is not the one the compilers actually use. Every vendor has added goodies and complications to their compiler (e.g., MS C's declspecs, etc.).
The assumption the preprocessor has run is interesting. Where are you going to get the preprocessor configuration? (e.g., compiler commmand line defines, include paths, pragma settings, etc.)? This is harder than it looks, as each development environment defines different ways to set the preprocessor conditionals.
If you are willing to accept occasional errors, then any heuristic is valid candidate,
modulo how often it makes a mistake on an important client's code. This also means you can handle un-processed code, avoiding the preprocessor issue entirely.

Related

Regex to find all calls of the certain function and NOT its declaration

I need to write a regex which will match only lines with the C function call, not its declaration.
So, I need it to match only lines, where funcName() is not preceeded by int, double, float, char etc. and an arbitrary number of spaces.
The problem is, I can run into following expressions:
printf("Hello"); int f() {return 1;};
So I must consider even the situation, where there are some other characters before the date-type name.
myStruct f();
In this situation I want regex to match it, ONLY basic data-types should be excluded.
So far I've got to this expression:
^(?!(void|int|double|char))\s*f\(\).*$
But I have no idea, how to take care of the situation with characters before the type name.
The following regex meets your specs:
(^|((^|\s)(?!(void|int|double|char))[^\s]+)\s+)([a-zA-Z_]+\(\)?)
The function name is defined by a character class containing letters and the underscore.
The line starts with the function call, or
the line contains at least one non-whitespace character before the function name. In that case ...
this non-WS sequence does not match the excluded keywords
there is at least 1 WS character before the function name
See the live demo at regex101.
Caveat
As several commentors have noted, this is not a robust solution. It will work for a tightly constrained set of function call and declaration patterns only.
A general regex-based solution (if possible at all, which would heavily depend on the regex engine features available) will be of theoretical interest only as it had to mimic completely the C preprocessor.

Parsing C files without preprocessing it

I want to run simple analysis on C files (such as if you call foo macro with INT_TYPE as argument, then cast the response to int*), I do not want to prerprocess the file, I just want to parse it (so that, for instance, I'll have correct line numbers).
Ie, I want to get from
#include <a.h>
#define FOO(f)
int f() {FOO(1);}
an list of tokens like
<include_directive value="a.h"/>
<macro name="FOO"><param name="f"/><result/></macro>
<function name="f">
<return>int</return>
<body>
<macro_call name="FOO"><param>1</param></macro_call>
</body>
</function>
with no need to set include path, etc.
Is there any preexisting parser that does it? All parsers I know assume C is preprocessed. I want to have access to the macros and actual include instructions.
Our C Front End can parse code containing preprocesser elements can do this to fair extent and still build a usable AST. (Yes, the parse tree has precise file/line/column number information).
There are a number of restrictions, which allows it to handle most code. In those few cases it cannot handle, often a small, easy change to the source file giving equivalent code solves the problem.
Here's a rough set of rules and restrictions:
#includes and #defines can occur wherever a declaration or statement can occur, but not in the middle of a statement. These rarely cause a problem.
macro calls can occur where function calls occur in expressions, or can appear without semicolon in place of statements. Macro calls that span non-well-formed chunks are not handled well (anybody surprised?). The latter occur occasionally but not rarely and need manual revision. OP's example of "j(v,oid)*" is problematic, but this is really rare in code.
#if ... #endif must be wrapped around major language concepts (nonterminals) (constant, expression, statement, declaration, function) or sequences of such entities, or around certain non-well-formed but commonly occurring idioms, such as if (exp) {. Each arm of the conditional must contain the same kind of syntactic construct as the other arms. #if wrapped around random text used as bad kind of comment is problematic, but easily fixed in the source by making a real comment. Where these conditions are not met, you need to modify the original source code, often by moving the #if #elsif #else #end a few tokens.
In our experience, one can revise a code base of 50,000 lines in a few hours to get around these issues. While that seems annoying (and it is), the alternative is to not be able to parse the source code at all, which is far worse than annoying.
You also want more than just a parser. See Life After Parsing, to know what happens after you succeed in getting a parse tree. We've done some additional work in building symbol tables in which the declarations are recorded with the preprocessor context in which they are embedded, enabling type checking to include the preprocessor conditions.
You can have a look at this ANTLR grammar. You will have to add rules for preprocessor tokens, though.
Your specific example can be handled by writing your own parsing and ignore macro expansion.
Because FOO(1) itself can be interpreted as a function call.
When more cases are considered however, the parser is much more difficult. You can refer PDF Link to find more information.

Keyword-Label-Value style configuration file parsing library for C

Does a configuration parsing library exist already that will read the following style of file:
Keyword Label Value;
With nesting by { } replacing Values; optional Labels; support for "Include" would be nice.
An example configuration file might looks like:
Listen Inside 127.0.0.1:1000;
Listen Outside {
IP 1.2.3.4;
Port 1000;
TLS {
CertFile /path/to/file;
};
};
ACL default_acl {
IP 192.168.0.0/24;
IP 10.0.0.0/24;
};
What programming languages are you familiar with? My impression from your question is C.
It looks like like the tokens of your configuration language are regular expressions:
Listen
127.0.0.1:1000
1000
;
{
}
etc.
Almost all modern programming languages have some form of support for those.
If the implementation is C, I'd probably use flex. It generates a function which will apply a set of regular expressions, put the matched text into a C string, and return the type of that regular expression (just an int, which you choose). The function is a 'lexical analyser' or 'tokeniser'. It chops up streams of characters into handy units that match your needs, one regular expression at a time.
Flex is pretty easy to use. It has several advantages over lex. One is that you can have multiple lexical analysers functions, so if you need to do something odd for an include file, then you could have a second lexical analyser function for that job.
Your language looks simple. Bison/Yacc are very powerful tools, and "with great power comes great responsibility" :-)
I think it is sufficiently simple, that I might just write a parser by hand. It might only be a few functions to handle its structure. A technique that is very straightforward is called recursive descent parser. Have you got a CS degree, or understand this stuff?
Lots of people will (at this stage) tell you to get the 'Dragon Book' or one of its newer versions, often because that is what they had at college. The Dragon book is great, but it is like telling someone to read all of Wikipedia to find out about whales. Great if you have the time, and you'll learn a lot.
A reasonable start is the Wikipedia Recursive Descent parser article. Recursive descent is very popular because it is relatively straightforward to understand. The thing that makes it straightforward is to have a proper grammar which is cast into a form which is easy for recursive descent to parse. Then you literally write a function for every rule, with a simple error handling mechanism (that's why I asked about this). There are probably tools to generate them, but you might find it quicker to just write it. A first cut might take a day, then you'd be in a good position to decide.
A very nifty lex/flex feature is any characters which are not matched, are just echo'd to standard output. So you can see what your regular expressions are matching, and can add them incrementally. When the output 'dries up' everything is being matched.
Pontification alert: IMHO, more C programmers should learn to use flex. It is relatively easy to use, and very powerful for text handling. IMHO lots are put off because they are also told to use yacc/bison which are much more powerful, subtle and complex tools.
end Pontification.
If you need a bit of help with the grammar, please ask. If there is a nice grammar (might not be the case, but so far your examples look okay) then implementation is straightforward.
I found two links to stackoverflow answers which look helpful:
Recursive descent parser implementation
Looking for a tutorial on Recursive Descent Parsing
Here is an example of using flex.
Flex takes a 'script', and generates a C function called yylex(). This is the input script.
Remember that all of the regular expressions are being matched within that yylex function, so though the script looks weird, it is really an ordinary C function. To tell the caller, which will be your recursive descent parser, what type of regular expression is matched, it returns an integer value that you choose, just like any ordinary C function.
If there is nothing to tell the parser about, like white space, and probably some form of comment, it doesn't return. It 'silently' consumes those characters. If the syntax needs to use newline, then that would be recognised as a token, and a suitable token value returned to the parser. It is sometimes easier to let it be more free form, so this example consumes and ignores all white space.
Effectively the yylex function is everything from the first %% to the second %%. It behaves like a big switch() statement.
The regular expressions are like (very exotic) case: labels.
The code inside the { ... } is ordinary C. It can contain any C statements, and must be properly nested within the { ... }
The stuff before the first %% is the place to put flex definitions, and a few 'instructions' to flex.
The stuff inside %{ ... %} is ordinary C, and can include any headers needed by the C in the file, or even define global variables.
The stuff after the second %% is ordinary C, with no need for extra syntax, so no %{ ... %].
/* scanner for a configuration files */
%{
/* Put headers in here */
#include <config.h>
%}
%%
[0-9]+ { return TOK_NUMBER; }
[0-9]+"."[0-9]+"."[0-9]+"."[0-9]+":"[0-9]+ { return TOK_IP_PORT; }
[0-9]+"."[0-9]+"."[0-9]+"."[0-9]+"/"[0-9]+ { return TOK_IP_RANGE; }
"Listen" { return TOK_KEYWORD_LISTEN; }
[A-Za-z][A-Za-z0-9_]* { return TOK_IDENTIFIER; }
"{" { return TOK_OPEN_BRACE; }
"}" { return TOK_CLOSE_BRACE; }
";" { return TOK_SEMICOLON; }
[ \t\n]+ /* eat up whitespace, do nothing */
. { fprintf(stderr, "Unrecognized character: %s\n", yytext );
exit(1);
}
%%
/* -------- A simple test ----------- */
int main(int argc, char *argv[])
{
int tok;
yyin = stdin;
while (tok=yylex()) {
fprintf(stderr, "%d %s\n", tok, yytext);
}
}
That has a minimal, dummy main, which calls the yylex() function to get the next token
(enum) value. yytext is the string matched by the regular expression, so main just prints it.
WARNING, this is barely tested, little more than:
flex config.l
gcc lex.yy.c -ll
./a.out <tinytest
The values are just integers, so an enum in a header:
#ifndef _CONFIG_H_
#define _CONFIG_H_
enum TOKENS {
TOK_KEYWORD_LISTEN = 256,
TOK_IDENTIFIER = 257,
TOK_OPEN_BRACE = 258,
TOK_CLOSE_BRACE = 259,
TOK_SEMICOLON = 260,
TOK_IP_PORT = 261,
TOK_IP_RANGE = 262,
TOK_NUMBER = 263,
};
#endif _CONFIG_H_
In your parser, call yylex when you need the next value. You'll probably wrap yylex in something which copies yytext before handing the token type value back to the parser.
You will need to be comfortable handling memory. If this were a large file, maybe use malloc to allocate space. But for small files, and to make it easy to get started and debug, it might makes sense to write your own 'dumb' allocator. A 'dumb' memory management system, can make debugging much easier. Initially just have a big char array statically allocated and a mymalloc() handing out pieces. I can imagine the configuration data never gets free()'d. Everything can be held in C strings initially, so it is straightforward to debug because the exact sequence of input is in the char array. An improved version might 'stat' a file, and allocates a piece big enough.
How you deal with the actual configuration values is a bit beyond what I can describe. Text strings might be all that is needed, or maybe there is already a mechanism for that. Often there is no need to store the text value of 'Keywords', because the parser has recognised what it means, and the program might convert other values, e.g. IP addresses, into some internal representation.
Have you looked at lex and yacc (or alternatively, flex and bison)? It's a little hairy, but we use those to parse files that look exactly like your config file there. You can define sub-structures using brackets, parse variable-length lists with the same key, etc.
By labels do you mean comments? You can define your own comment structure, we use '#' to denote a comment line.
It doesn't support includes AFAIK.
Exist C library's to JSON and YAML. They look like what you need.

Can you devise a simple macro to effectively produce a compiler error when used?

I am looking for a strange macro definition, on purpose: I need a macro defined in such a way, that in the event the macro is effectively used in compiled code, the compiler will unfailingly produce an error.
The background: Since C11 had introduced several new keywords, and a new C++11 standard also added a few, I would like to introduce a header file in my projects (mostly using C89/C95 compilers with a few additions) to force developers to refrain from using these new keywords as identifier names, unless, of course, they are recognized as keywords in the intended fashion.
In the ancient past, I did this for new like this:
#define new *** /* C++ keyword, do not use */
And yes, it worked. Until it didn't, when a programmer forgot the underscore in a parameter name:
void myfunction(uint16_t new parameter);
I used variants since, but I've never been challenged again.
Now I intend to create a file with all keywords not supported by various compilers, and I'm looking for a dependable solution, at best with a not too confusing error message. "Syntax error" would be OK, but "parameter missing" would be confusing already.I'm thinking along the lines of
#define atomic +*=*+ /* C11 derived keyword; do not use */
and aside from my usual hesitation, I'm quite sure that any use (but not the definition) of the macro will produce an error.
EDIT: To make it even more difficult, MISRA will only allow the use of the basic source and execution character set, so # or $ are not allowed.
But I'd like to ask the community: Do you have a better macro value? As effective, but shorter? Or even longer but more dependable in some strange situation? Or a completely different method to generate an error (only using the compiler, please, not external tools!) when a "discouraged" identifier is used for any purpose?
Disclaimer:
And, yes, I know I can use a grep or a parser to run on a nightly build, and report the warnings it finds. But dropping an immediate error on the developers desk is quicker, and certain to be fixed before checking in.
If the sport is for the shortest tokensequence that always produces an error, any combination of two 1 character operators that can't legally occur together, but
don't use ({ or }) because gcc has a special meaning for that
don't use any sort of unbalanced parentheses because they can lead you far away until the error is recognized
don't use < or > because they could match template parameters for C++
don't use prefix operators as second character
don't use postfix operators as first character
This leave some possibilities
.., .| and other combinations with . since . expects a following identifier
&|, &/, &^, &,, &;
!|, !/, !^, !,, !;
But actually to be more user friendly I'd also first place a _Pragma in it so the compiler would also spit a warning.
#define atomic _Pragma("message \"some instructive text that you should read\"") ..
I think you can just use an illegal symbol:
#define bad_name #
Another one that would work would be this:
static const char *illegal_keyword = "";
#define bad_name (illegal_keyword = "bad_name")
It would error you that you are changing a constant. Also, the error message will usually be quite good:
Line 8: error: called object 'illegal_keyword = "printf"' is not a function
And the final one that is perhaps the shortest and will always work is this:
#define bad_name #
Because the preprocessor will never replace twice, and # is illegal outside of the prepocessor this will always error.
#define atomic do not use atomic
The expansion is not recursive so it stops. The only way to stop it from being a compilation error is:
#define do
#define not
#define use
but that's verboten because do and not are keywords.
The error message might even include 'atomic'. You can increase the probability of that by rephrasing the message:
#define atomic atomic cannot be used
(Now you are not playing with keywords in the message, though.)
I think [[]] isn't a valid sequence of tokens anywhere, so you could use that:
#define keyword [[]]
The error will be a syntax error, complaining about [ or ].
My attempt:
#define new new[-1]
#define atomic atomic[-1]

How to match C function prototypes and variable definitions with a regexp?

This has been asked before but I have a specialised case which I should be able to handle with a regular expression.
I'm trying to read the warning log from Doxygen and the source is in C (so far, I dread to think about C++).
I need to match the functions and variable definitions found in that log and pick up the function and variable names.
More specifically the log has lines like
/home/me/blaa.c:10:Warning: Member a_function(int a, int b) (function) of file blaa.c is not documented
and
/home/me/blaa.h:10:Warning: Member a_variable[SOME_CONST(sizeof(SOME_STRUCT), 64)*ANOTHER_CONST] (variable) of file blaa.h is not documented
With all the variations you can have in C...
Can I match those with just one regexp or should I not even bother? The word in after the "parameter" (I use this loosely to also include the variables) list in parentheses is a set of certain words (function, variable, enum, etc) so if nothing else helps, I could match with those but I'd rather not in case there are types that I haven't seen yet in the logs.
My current attempt looks like
'(?P<full_path>.+):\d+:\s+Warning:\s+Member\s+(?P<member_name>.+)([\(\[](\**)\s*\w+([,)])[\)\]))*\s+\((?P<member_type>.+)\) of file\s+(?P<filename>.+)\s+is not documented'
(I use Python's re package.)
But it still fails to catch everything.
EDIT: There's some mistake in there that I have done in the last edit.
You were allowing zero or more matches between <member_name> and <member_type>. Try this instead:
'(?P<full_path>.+):\d+:\s+Warning:\s+Member\s+(?P<member_name>\w+).*\s+\((?P<member_type>\w+)\) of file\s+(?P<filename>.+)\s+is not documented'

Resources