Keyword-Label-Value style configuration file parsing library for C

Keyword-Label-Value style configuration file parsing library for C - c

Does a configuration parsing library exist already that will read the following style of file:
Keyword Label Value;
With nesting by { } replacing Values; optional Labels; support for "Include" would be nice.
An example configuration file might looks like:
Listen Inside 127.0.0.1:1000;
Listen Outside {
IP 1.2.3.4;
Port 1000;
TLS {
CertFile /path/to/file;
};
};
ACL default_acl {
IP 192.168.0.0/24;
IP 10.0.0.0/24;
};

What programming languages are you familiar with? My impression from your question is C.
It looks like like the tokens of your configuration language are regular expressions:
Listen
127.0.0.1:1000
1000
;
{
}
etc.
Almost all modern programming languages have some form of support for those.
If the implementation is C, I'd probably use flex. It generates a function which will apply a set of regular expressions, put the matched text into a C string, and return the type of that regular expression (just an int, which you choose). The function is a 'lexical analyser' or 'tokeniser'. It chops up streams of characters into handy units that match your needs, one regular expression at a time.
Flex is pretty easy to use. It has several advantages over lex. One is that you can have multiple lexical analysers functions, so if you need to do something odd for an include file, then you could have a second lexical analyser function for that job.
Your language looks simple. Bison/Yacc are very powerful tools, and "with great power comes great responsibility" :-)
I think it is sufficiently simple, that I might just write a parser by hand. It might only be a few functions to handle its structure. A technique that is very straightforward is called recursive descent parser. Have you got a CS degree, or understand this stuff?
Lots of people will (at this stage) tell you to get the 'Dragon Book' or one of its newer versions, often because that is what they had at college. The Dragon book is great, but it is like telling someone to read all of Wikipedia to find out about whales. Great if you have the time, and you'll learn a lot.
A reasonable start is the Wikipedia Recursive Descent parser article. Recursive descent is very popular because it is relatively straightforward to understand. The thing that makes it straightforward is to have a proper grammar which is cast into a form which is easy for recursive descent to parse. Then you literally write a function for every rule, with a simple error handling mechanism (that's why I asked about this). There are probably tools to generate them, but you might find it quicker to just write it. A first cut might take a day, then you'd be in a good position to decide.
A very nifty lex/flex feature is any characters which are not matched, are just echo'd to standard output. So you can see what your regular expressions are matching, and can add them incrementally. When the output 'dries up' everything is being matched.
Pontification alert: IMHO, more C programmers should learn to use flex. It is relatively easy to use, and very powerful for text handling. IMHO lots are put off because they are also told to use yacc/bison which are much more powerful, subtle and complex tools.
end Pontification.
If you need a bit of help with the grammar, please ask. If there is a nice grammar (might not be the case, but so far your examples look okay) then implementation is straightforward.
I found two links to stackoverflow answers which look helpful:
Recursive descent parser implementation
Looking for a tutorial on Recursive Descent Parsing
Here is an example of using flex.
Flex takes a 'script', and generates a C function called yylex(). This is the input script.
Remember that all of the regular expressions are being matched within that yylex function, so though the script looks weird, it is really an ordinary C function. To tell the caller, which will be your recursive descent parser, what type of regular expression is matched, it returns an integer value that you choose, just like any ordinary C function.
If there is nothing to tell the parser about, like white space, and probably some form of comment, it doesn't return. It 'silently' consumes those characters. If the syntax needs to use newline, then that would be recognised as a token, and a suitable token value returned to the parser. It is sometimes easier to let it be more free form, so this example consumes and ignores all white space.
Effectively the yylex function is everything from the first %% to the second %%. It behaves like a big switch() statement.
The regular expressions are like (very exotic) case: labels.
The code inside the { ... } is ordinary C. It can contain any C statements, and must be properly nested within the { ... }
The stuff before the first %% is the place to put flex definitions, and a few 'instructions' to flex.
The stuff inside %{ ... %} is ordinary C, and can include any headers needed by the C in the file, or even define global variables.
The stuff after the second %% is ordinary C, with no need for extra syntax, so no %{ ... %].
/* scanner for a configuration files */
%{
/* Put headers in here */
#include <config.h>
%}
%%
[0-9]+ { return TOK_NUMBER; }
[0-9]+"."[0-9]+"."[0-9]+"."[0-9]+":"[0-9]+ { return TOK_IP_PORT; }
[0-9]+"."[0-9]+"."[0-9]+"."[0-9]+"/"[0-9]+ { return TOK_IP_RANGE; }
"Listen" { return TOK_KEYWORD_LISTEN; }
[A-Za-z][A-Za-z0-9_]* { return TOK_IDENTIFIER; }
"{" { return TOK_OPEN_BRACE; }
"}" { return TOK_CLOSE_BRACE; }
";" { return TOK_SEMICOLON; }
[ \t\n]+ /* eat up whitespace, do nothing */
. { fprintf(stderr, "Unrecognized character: %s\n", yytext );
exit(1);
}
%%
/* -------- A simple test ----------- */
int main(int argc, char *argv[])
{
int tok;
yyin = stdin;
while (tok=yylex()) {
fprintf(stderr, "%d %s\n", tok, yytext);
}
}
That has a minimal, dummy main, which calls the yylex() function to get the next token
(enum) value. yytext is the string matched by the regular expression, so main just prints it.
WARNING, this is barely tested, little more than:
flex config.l
gcc lex.yy.c -ll
./a.out <tinytest
The values are just integers, so an enum in a header:
#ifndef _CONFIG_H_
#define _CONFIG_H_
enum TOKENS {
TOK_KEYWORD_LISTEN = 256,
TOK_IDENTIFIER = 257,
TOK_OPEN_BRACE = 258,
TOK_CLOSE_BRACE = 259,
TOK_SEMICOLON = 260,
TOK_IP_PORT = 261,
TOK_IP_RANGE = 262,
TOK_NUMBER = 263,
};
#endif _CONFIG_H_
In your parser, call yylex when you need the next value. You'll probably wrap yylex in something which copies yytext before handing the token type value back to the parser.
You will need to be comfortable handling memory. If this were a large file, maybe use malloc to allocate space. But for small files, and to make it easy to get started and debug, it might makes sense to write your own 'dumb' allocator. A 'dumb' memory management system, can make debugging much easier. Initially just have a big char array statically allocated and a mymalloc() handing out pieces. I can imagine the configuration data never gets free()'d. Everything can be held in C strings initially, so it is straightforward to debug because the exact sequence of input is in the char array. An improved version might 'stat' a file, and allocates a piece big enough.
How you deal with the actual configuration values is a bit beyond what I can describe. Text strings might be all that is needed, or maybe there is already a mechanism for that. Often there is no need to store the text value of 'Keywords', because the parser has recognised what it means, and the program might convert other values, e.g. IP addresses, into some internal representation.

Have you looked at lex and yacc (or alternatively, flex and bison)? It's a little hairy, but we use those to parse files that look exactly like your config file there. You can define sub-structures using brackets, parse variable-length lists with the same key, etc.
By labels do you mean comments? You can define your own comment structure, we use '#' to denote a comment line.
It doesn't support includes AFAIK.

Exist C library's to JSON and YAML. They look like what you need.

Related

Re-implementing strlen (C)

Noob alert:
for learning purposes i've been given the task to re implement the strlen() function, I've gotten the notion that this would be best done with a function like macro rather than a function,
my reasoning would be that using a macro i wouldn't have to deal with passing a string to a function.
what are your thoughts?
is it better to create a proper function or a macro in this case?

Macros are expanded exactly once, when the program is compiled. Since time travel is not part of the C language, it is impossible for a future execution of a program to retroactively change the consequence of a macro. So if a computation, such as computing the length of a string, depends on information not known when the program is compiled, a macro is completely useless. Unless the string happens to be a literal, this will be the case. And I venture to assert that in the vast majority of cases, the string whose length is required did not exist when the program was compiled.
A clear understanding of what macros actually do -- modify the program text before compilation -- will help avoid distractions such as the suggestion in this question.
It can occasionally be useful to use strlen on a constant string literal, in order to avoid bugs which might be introduced in the future when the string literal is modified. For example, the following (which tests whether line starts with the text Hello):
/* Code smell: magic number */
if (strncmp(line, "Hello", 5) == 0) { ... }
Would be better written as:
/* Code smell: redundant repetition, see below */
if (strncmp(line, "Hello", strlen("Hello")) == 0) { ... }
Obviously, if a computation can be performed once at compile-time, it would be better to do so rather than doing it repeatedly when the program runs. Once upon a time when compilers were primitive and almost incapacable of understanding control flow, it made some sense to worry about such things, although even then a lot of the hand-optimisations were much too complicated for the minor benefits achieved.
Today, even this excuse is unavailable for the premature optimizer. Most modern C compilers are perfection capable of replacing strlen("Hello"); with the constant 5, so that the library function is never called. No macro magic is required to achieve this optimisation.
As indicated, the test in the example still has an unnecessary repetition of the prefix string. What we really want to write is:
if (startsWith(line, "Hello")) { ... }
Here the temptation to define startsWith as a macro will be very strong, since it seems like simple compile-time substitution would suffice. This temptation should be avoided. Modern compilers are also capable of "inlining" function calls; that is, inserting the body of the call directly into the code.
So the definition:
static int startsWith(const char* line, const char* prefix) {
return strncmp(line, prefix, strlen(prefix)) == 0;
}
will be just as fast as its macro equivalent, and unlike the macro it will not lead to problems when it is called with a second argument with side effects:
/* Bad style but people do it */
if (startsWith(line, prefixes[++i])) { doAction(i); }
Once the call is inlined, the compiler can then proceed to apply other optimisations, such as the elimination of the call to strlen in case the prefix argument is a string literal.

When writing a compiler, how are tokens checked?

Does a compiler use if statements when deciding what to do if a certain keyword is encounered, and should someone writing a compiler use them for most operations when checking code? Or is there a more efficient way? For example, when I test a symbol against a symbol table and it comes back as being a valid "token", do I have to use an if statement to determine what to do for every single keyword, since it seems rather inefficient, for example the pseudocode:
/*Each keyword/token in my compiler has a numerical representation which is what the symbol table returns back for example #define IF 0 and so on*/
if(Token == IF){
//This will be done to generate the AST representation for IF statements
}else if(Token == ELSE){
//This will be done to generate the AST representation of an if statement
}else if(Token == INT){
//This will be done to generate the AST represnetation of an integer
}

What kind of compilers do you mean?
If the performance matters, you may want something like callback, in this way, use the keyword as key and the callback function as the value, so the pseudo code would looks like this:
func *fp = funcTbl.get(Token);
if (fp) { fp(); }
You may try the recursive descent too. The function related to the keyword got called just where they are expected to be.
Last but not least, what you write is ok as well.

Assuming you have already split your source language from string representation to a series of lexical tokens, your next step is to use a parser to build an AST from your tokens.
The parsing stage of compilation achieves two main goals:
It checks your language for syntactic correctness, throwing an error if your input cannot be parsed according to the structure of your grammar.
It generates an AST representation of your source code
Does a compiler use if statements when deciding what to do if a
certain keyword is encountered?
No, your parser should analyse the series of lexical tokens and check them against the structure of your language's grammar.
Parsing is a well understood topic in computer science which can be approached in different ways. it cannot be trivially implemented in the example code fragment you have provided above. In a realistic programming language you need to consider that grammars can be ambiguous, and that a simple predictive parser is appropriate for all grammars and some kind of backtracking will be needed. If you do not understand this concept, I recommend you use a Parser generator for this, such as Bison.
This diagram shows a simplistic overview of the most important stages of compilation and may help you to understand its pipeline structure.
This is a process which has been refined for decades by many academics about how to best 'divide and conquer' such a mammoth task. I strongly encourage you to follow it.
For further reading, check out Modern Compiler Implementation in Java by Andrew Appel.

How can I get the function name as text not string in a macro?

I am trying to use a function-like macro to generate an object-like macro name (generically, a symbol). The following will not work because __func__ (C99 6.4.2.2-1) puts quotes around the function name.
#define MAKE_AN_IDENTIFIER(x) __func__##__##x
The desired result of calling MAKE_AN_IDENTIFIER(NULL_POINTER_PASSED) would be MyFunctionName__NULL_POINTER_PASSED. There may be other reasons this would not work (such as __func__ being taken literally and not interpreted, but I could fix that) but my question is what will provide a predefined macro like __func__ except without the quotes? I believe this is not possible within the C99 standard so valid answers could be references to other preprocessors.
Presently I have simply created my own object-like macro and redefined it manually before each function to be the function name. Obviously this is a poor and probably unacceptable practice. I am aware that I could take an existing cpp program or library and modify it to provide this functionality. I am hoping there is either a commonly used cpp replacement which provides this or a preprocessor library (prefer Python) which is designed for extensibility so as to allow me to 'configure' it to create the macro I need.
I wrote the above to try to provide a concise and well defined question but it is certainly the Y referred to by #Ruud. The X is...
I am trying to manage unique values for reporting errors in an embedded system. The values will be passed as a parameter to a(some) particular function(s). I have already written a Python program using pycparser to parse my code and identify all symbols being passed to the function(s) of interest. It generates a .h file of #defines maintaining the values of previously existing entries, commenting out removed entries (to avoid reusing the value and also allow for reintroduction with the same value), assigning new unique numbers for new identifiers, reporting malformed identifiers, and also reporting multiple use of any given identifier. This means that I can simply write:
void MyFunc(int * p)
{
if (p == NULL)
{
myErrorFunc(MYFUNC_NULL_POINTER_PASSED);
return;
}
// do something actually interesting here
}
and the Python program will create the #define MYFUNC_NULL_POINTER_PASSED 7 (or whatever next available number) for me with all the listed considerations. I have also written a set of macros that further simplify the above to:
#define FUNC MYFUNC
void MyFunc(int * p)
{
RETURN_ASSERT_NOT_NULL(p);
// do something actually interesting here
}
assuming I provide the #define FUNC. I want to use the function name since that will be constant throughout many changes (as opposed to LINE) and will be much easier for someone to transfer the value from the old generated #define to the new generated #define when the function itself is renamed. Honestly, I think the only reason I am trying to 'solve' this 'issue' is because I have to work in C rather than C++. At work we are writing fairly object oriented C and so there is a lot of NULL pointer checking and IsInitialized checking. I have two line functions that turn into 30 because of all these basic checks (these macros reduce those lines by a factor of five). While I do enjoy the challenge of crazy macro development, I much prefer to avoid them. That said, I dislike repeating myself and hiding the functional code in a pile of error checking even more than I dislike crazy macros.
If you prefer to take a stab at this issue, have at.

__FUNCTION__ used to compile to a string literal (I think in gcc 2.96), but it hasn't for many years. Now instead we have __func__, which compiles to a string array, and __FUNCTION__ is a deprecated alias for it. (The change was a bit painful.)
But in neither case was it possible to use this predefined macro to generate a valid C identifier (i.e. "remove the quotes").
But could you instead use the line number rather than function name as part of your identifier?
If so, the following would work. As an example, compiling the following 5-line source file:
#define CONCAT_TOKENS4(a,b,c,d) a##b##c##d
#define EXPAND_THEN_CONCAT4(a,b,c,d) CONCAT_TOKENS4(a,b,c,d)
#define MAKE_AN_IDENTIFIER(x) EXPAND_THEN_CONCAT4(line_,__LINE__,__,x)
static int MAKE_AN_IDENTIFIER(NULL_POINTER_PASSED);
will generate the warning:
foo.c:5: warning: 'line_5__NULL_POINTER_PASSED' defined but not used

As pointed out by others, there is no macro that returns the (unquoted) function name (mainly because the C preprocessor has insufficient syntactic knowledge to recognize functions). You would have to explicitly define such a macro yourself, as you already did yourself:
#define FUNC MYFUNC
To avoid having to do this manually, you could write your own preprocessor to add the macro definition automatically. A similar question is this: How to automatically insert pragmas in your program
If your source code has a consistent coding style (particularly indentation), then a simple line-based filter (sed, awk, perl) might do. In its most naive form: every function starts with a line that does not start with a hash or whitespace, and ends with a closing parenthesis or a comma. With awk:
{
print $0;
}
/^[^# \t].*[,\)][ \t]*$/ {
sub(/\(.*$/, "");
sub(/^.*[ \t]/, "");
print "#define FUNC " toupper($0);
}
For a more robust solution, you need a compiler framework like ROSE.

Gnu-C has a __FUNCTION__ macro, but sadly even that cannot be used in the way you are asking.

Detect start & end of a C declaration without a full C parser

I would like to partially parse a list of C declarations and/or function definitions.
That is, I want to split it into substrings, each containing one declaration, or function definition.
Each declaration (separately) will then be passed to another module (that does contain a full C parser, but that I cannot call directly.)
Obviously I could do this by including another full C parser in my program, but I hope to avoid this.
The tricky cases I'e come up against so far involve the question of whether '}' terminates a declaration/definition or not. For example in
int main(int ac, char **av) {return 0;}
... the '}' is a terminator, whereas in
typedef struct foo {int bar;} *pfoo;
it is not. There may also be pathological pieces of code like this:
struct {int bar;} *getFooPtr(...) { /* code... */ }
Notes
Please assume the C code has already been fully preprocessed before my function sees it. (Actually it hasn't, but we have a workaround for that.)
My parser will probably be implemented in Lua with LPeg

To extend the state machine in your answer to deal with function definitions add the following steps:
set fun/var state to 'unknown'
Examine the character at the current position
If it's ;, we have found the end of the declaration, and its not a function definition (might be a function declaration, though).
If it's " or ', jump to the matching quote, skipping over escape sequences if necessary.
If it's (, [ or {, jump to the matching ), ] or } (skipping over nested brackets and strings recursively if necessary)
If fun/var state is 'function' and we just skipped { .. }, we've found the end of the declaration, and its a function definition
If fun/var state is 'unknown' and we just skipped ( .. ), set fun/var state to 'function'.
If the current char is = or ,, set fun/var state to 'not-function`.
Advance to the next input character, and go back to 2.
Of course, this only works on post-pre-processed code -- if you have macros that do various odd things that haven't yet been expanded, all bets are off.

As far as I can tell, the following solution works for declarations only (that is, function definitions must be kept out of this section, or adding semicolons after them may be a workaround:)
Examine the character at the current position
If it's ;, we have found the end of the declaration.
If it's " or ', jump to the matching quote, skipping over escape sequences if necessary.
If it's (, [ or {, jump to the matching ), ] or } (skipping over nested brackets and strings recursively if necessary)
Otherwise, advance to the next input character and goto step 1.
If this proves to be unsatisfactory, I will switch to the clang parser.

Your best bet would be to extract the part of the C grammar which is related to declarations, and build a parser for that or an abbreviated version of that. Similarly, you want the grammar for function bodies, abbreviated in a similar way, so you can skip them.
This might produce a relatively trustworthy parser for declarations.
It is unfortunate that you will not likely be able to get your hands on a trustworthy C grammar; the one in the ANSI Standard(s) is not the one the compilers actually use. Every vendor has added goodies and complications to their compiler (e.g., MS C's declspecs, etc.).
The assumption the preprocessor has run is interesting. Where are you going to get the preprocessor configuration? (e.g., compiler commmand line defines, include paths, pragma settings, etc.)? This is harder than it looks, as each development environment defines different ways to set the preprocessor conditionals.
If you are willing to accept occasional errors, then any heuristic is valid candidate,
modulo how often it makes a mistake on an important client's code. This also means you can handle un-processed code, avoiding the preprocessor issue entirely.

Is there a way to access C's built in keywords like 'int' or 'char' or 'return'?

Is there a function or method to access C's keywords as mentioned in the question? The only way I can think of it is creating constants that will just be checked to see if any match, but that could be a lot to type, since there are a lot of keywords. I was hoping there was something. (New to C)
It is for a homework, so I cannot use regular expressions or parsing libraries. The purpose of the HW is to give my program a function and just return the identifiers, hence, why I was hoping there was a way to access the keywords easier than typing them all.
Example:
int foo (int args)
{
int x = 7;
char c = 'a';
args = x + c;
return args;
}
And it should return foo, args, x, c.
I am not looking for an answer, so a good hint if there is one would be great! If not, then just let me know that the tedious way is the only option.

To identify the identifiers (as distinct from other token kinds) in the source, you need to lex the source.
One of the easiest ways to do this is to implement Thompson's Algorithm and use the preprocessing grammar from the C99 language specification. Once the source is lexed (or during lexing), you just need to create the list of preprocessing identifiers that are not C99 keywords. It's quite straightforward to implement this in a couple hundred lines of code.

You will need to write a program to read the file, building 'words' from sequences of alphanumeric characters. You'll need a list of the keywords in C - which is quite short. Then you'll compare the words you read against the list of keywords and print out the first occurrence of each (so you'll also need to store the words you've seen).
You'll need to know what you're expected to do with preprocessor directives; you may be able to ignore them. You'll need to know how to recognize numbers, character strings and character constants. You'll need to know how to recognize both /* ... */ and // ... to EOL comments (or maybe not in the first version).
Eventually, you might get sucked into nastinesses such as strings that extend over line breaks and comments such as:
/\
\
* This is a C comment
*\
\
/
However, you can almost certainly omit those subtleties in a first pass.

There is no built-in way of accessing the language from inside itself. Welcome to C, the land of do-it-yourself. Yes, you're going to have to tokenize the input stream and test each word. For tokenizing, check out the strcspn() function (a compliment string of " \t\n" (space, tab, newline) is probably good enough to get you going there.
Then build a NULL-terminated array of strings, e.g.
const char *identifiers [] = {
"int",
"continue",
NULL
};
and iterate over that, doing strcmp() on the input vs the members of the array. If you hit the terminating NULL, you know it's not in the array (bonus points for using a sorted array and libc's bsearch(3) utilities!).