How to write own Configformat - c

I've developed an own file format for configuration files (plaintext and line based -> EOL = one configuration) for an application. This format is nothing quit special and the only reason I do this, is to learn something! The reader and writer functions will be implemented in C (with GLib because it should be a UTF8 encoded file).
So now, I'm thinking about the way I implement this format in C code. Which steps I have to do to get error messages that are as good as possible. I've heard something about Lexer, Parser, ... but never gone too deep in it. I’ve only a very abstract idea of them. So which steps I need to do to get a clean reader written in C for the format, which is also maintainable for future changes? What are the topics to learn/think about?
And yes I know: C is pain, there are a lot of diffrent "sexy" formats for this propose and so on. I want to learn something!
Cheers,
Gregor
Additional information
The reader/writer/parser (or whatever it's called) should depend on as little as possible on third party programs/components. The application around this config part already uses GLib, so that's whay GLib is also used for UTF8

One cool way of creating a config format is to embed a scripting language.
This gives you the parser for free and gives you the possibility to generate data on the fly or define variables that are being reused:
Consider these examples of xml vs an ugly pseudo scripting language:
<InputPoints>
<Point>
<x>1.0</x>
<y>1.0</y>
</Point>
<Point>
<x>1.0</x>
<y>2.0</y>
</Point>
<Point>
<x>1.0</x>
<y>3.0</y>
</Point>
<Point>
<x>1.0</x>
<y>4.0</y>
</Point>
<InputPoint>
vs:
for(i = 1; i <= 4; ++i) {
InputPoint(1, i);
}
or perhaps
<Username>allanballan</Username>
<Accountname>allanballan</Accountname>
<HomeDirectory>/home/allanballan</HomeDirectory>
vs
user = "allanballan";
Username = user;
Accountname = user;
HomeDirectory = "/home/"+user;
The first example compresses a list of points to a few statements, the second examples shows how to remove lots of redundant data using a temporary variable.
A popular language for this kind of situation is Lua. Exactly how to map a scripting language to configuration is up to the integrator, but it's really powerful and it comes with parsing and type checking for free.

You might want to look at the libconfig source code. It has a lightweight parser you could use as a starting point and that will probably help you in figuring out what a parser for your own format would have to look like.
Though, if you really want to learn about parsers and lexers, it would probably be better to implement a simple compiler. There's an MIT course you could follow.

Depending on how deep you'd like to dive into learning the matter, you should think about not writing your parser manually. You can do so of course, but it will be a great deal more complicated and adding new features to your language will burden you with the problems of always adapting lexer and parser code.
The good thing is, there are lots of tools out there that enable you to generate this stuff from a high-level description of your input and its structure. Standard *nix tools to do so are Lex and Yacc (or their descendants Flex and Bison), but I'd like to point you to ANTLR (http://www.antlr.org) instead. One of its nice features is that it provides backends for many different languages (C/C++ as well as Java, Python, Ruby, C#, ...), so learning how to work with it will also help you if you want to switch languages at a later point.

Related

Extracting information from C headers (functions, structs, ...)

I would love to write more software with V but having to painstakingly copytype the APIs I would love to use really sucks. So I have been looking into how to change that. One option is SWIG, but just for grabbing the definitions of functions, structs/unions and globals, it's quite over-the-top, since I only need the tool to generate something akin to a V source file.
So I was thinking of just using a C99 compliant parser like TinyCC and seeing if I could extract an AST out of that. Unfortunately, this has not been as easy as I was hoping it would be because TinyCC doesn't just parse C into an AST, it already prepares information for compilation. So hacking the parser out of it is tricky, at best.
Since there are tools that are meant to statically analyze code coverage or aid in IDE support, I wouldn't be surprised if someone had done something similar already, even if it is just to jump to a definition at the very least.
Hence, my question: Would it be safe to parse C with a parser library like MPC? Or is there a better, more forward alternative that you can suggest?

Turning strings into code?

So let's say I have a string containing some code in C, predictably read from a file that has other things in it besides normal C code. How would I turn this string into code usable by the program? Do I have to write an entire interpreter, or is there a library that already does this for me? The code in question may call subroutines that I declared in my actual C file, so one that only accounts for stock C commands may not work.
Whoo. With C this is actually pretty hard.
You've basically got a couple of options:
interpret the code
To do this, you'll hae to write an interpreter, and interpreting C is a fairly hard problem. There have been C interpreters available in the past, but I haven't read about one recently. In any case, unless you reallY really need this, writing your own interpreter is a big project.
Googling does show a couple of open-source (partial) C interpreters, like picoc
compile and dynamically load
If you can capture the code and wrap it so it makes a syntactically complete C source file, then you can compile it into a C dynamically loadable library: a DLL in Windows, or a .so in more variants of UNIX. Then you could load the result at runtime.
Now, what normally would lead someone to do this is a need to be able to express some complicated scripting functions. Have you considered the possibility of using a different language? Python, Scheme (guile) and Lua are easily available to add as a scripting language to a C application.
C has nothing of this nature. That's because C is compiled, and the compiler needs to do a lot of building of the code before the code starts running (hence receives a string as input) that it can't really change on the fly that easily. Compiled languages have a rigidity to them while interpreted languages have a flexibility.
You're thinking of Perl, Python PHP etc. and so called "fourth generation languages." I'm sure there's a technical term in c.s. for this flexibility, but C doesn't have it. You'll need to switch to one of these languages (and give up performance) if you have a task that requires this sort of string use much. Check out Perl's /e flag with regexes, for instance.
In C, you'll need to design your application so you don't need to do this. This is generally quite doable, as for its non-OO-ness and other deficiencies many huge, complex applications run on well-written C just fine.

What libraries would be useful for implementing a small language interpreter in C?

For my own learning experience, I want to try writing an interpreter for a simple programming language in C – the main thing I think I need is a hash table library, but a general purpose collection of data structures and helper functions would be pretty helpful. What would you guys recommend?
libbasekit - by the author of Io. You can also use libcoroutine.
One library I recommend looking into is libgc, a garbage collector for C.
You use it by replacing calls to malloc, realloc, strdup, etc. with their libgc counterparts (e.g. GC_MALLOC). It works by scanning the stack, global variables, and GC-allocated blocks, looking for numbers that might be pointers. Believe it or not, it actually performs quite well (almost on par with the very good ptmalloc, which is the default (non-garbage collected) malloc implementation in GNU/Linux), and a lot of programs use it (including Mono and GCJ). A disadvantage, though, is it might not play well with other libraries you may want to use, and you may even have to recompile some of them by hand to replace calls to malloc with GC_MALLOC.
Honestly - and I know some people will hate me for it - but I recommend you use C++. You don't have to bust a gut to learn it just to be able to start your project. Just use it like C, but in an hour you can learn how to use std::map<> (an associative container), std::string for easy textual data handling, and std::vector<> for a resizable heap-allocated array. If you want to spend an extra hour or two, learn to put member functions in classes (don't worry about polymorphism, virtual functions etc. to begin with), and you'll get a more organised program.
You need no more than the standard library for a suitably small language with simple constructs. The most complex part of an interpreted language is probably expression evaluation. For both that, procedure-calling, and construct-nesting you will need to understand and implement stack data structures.
The code at the link above is C++, but the algorithm is described clearly and you could re-implement it easily in C. There again there are few valid arguments for not using C++ IMO.
Before diving into what libraries to use I suggest you learn about grammars and compiler design. Especially input parsing is for compilers and interpreters similar, that is tokenizing and parsing. The process of tokenizing converts a stream characters (your input) into a stream of tokens. A parser takes this stream of tokens and matches it with your grammar.
You don't mention what language you're writing an interpreter for. But very likely that language contains recursion. In that case you need to use a so-called bottom-up parser which you cannot write by hand but needs to be generated. If you try write such a parser by hand you will end up with a error-prone mess.
If you're developing for a posix platform then you can use lex and yacc. These tools are a bit old but very powerful for building parsers. Lex can generate code that implements the tokenizing process and yacc can generate a bottom-up parser.
My answer probably raises more questions than it answers. That's because the field of compilers/interpreters is quite complex and cannot simply be explained in a short answer. Just get a good book on compiler design.

C Library to read configuration files with syntax based on curly brackets

For my C projects I'd like to use curly brackets based configuration files like:
account {
name = "test#test.com";
password = "test";
autoconnect = true;
}
etc. or some variations.
I'm trying to find some nice C libraries to suit my needs. Can you please advise?
Your desired syntax is nearly identical to Lua, which would look like this:
account = {
name = "test#test.com",
password = "test",
autoconnect = true,
}
If that suits you, I highly recommend Lua, as it's designed to be embeddable in C programs as a configuration or scripting facility. You can either use the raw Lua C API, or if you prefer C++ there are things like Luabind to make certain things prettier in that language.
Here is a trivial example using the pure C Lua API to retrieve values from a buffer which contains a Lua "chunk": http://lua-users.org/wiki/GettingValuesFromLua . You can basically read (or mmap) your configuration file in C, pass the pointer to the text to Lua, have Lua execute it, and then retrieve the bits and pieces iteratively. An alternative is to do "binding" (for which there is also an example on the Lua wiki). With binding the flow is more like that you set up C structures to represent your configuration data, bind them to Lua, and let the Lua configuration script actually populate (construct) a configuration object which is then accessible from C. Depending on your exact needs this may be better or worse, but in pure C (as opposed to C++), the learning curve may be steeper than the "get values" approach.
I would suggest using a lexer and parser for doing this, either the lex/yacc combo or flex/bison.
You basically write code in a .l and .y file to describe the layout and the lexer/parser generator creates C code that will process the file for you, calling functions to deliver the data to you.
Lexical analysis and parsing are a pain to do unless you're well versed in the art. Tools like those I've mentioned make the job a lot easier.
In the lexer, you get it to recognise the lexical elements like
e_account (account)
e_openbrace ({)
e_name (name)
e_string ("[^"]*")
e_semicolon (;)
and so on.
The lexer is used by the parser to detect the lexical elements and the parser has the higher level rules for deciding what constructs are valid. Things like an account section being e_account, e_openbrace, zero or more of e_stanza then finally e_closebrace. And also detecting e_stanza as being (among others) e_name, e_equals, e_string then e_semicolon.
Most of the intelligence is under the covers (and pretty ugly looking code at least for lex/yacc) but it's better than trying to write it yourself :-)
A variant of what you described would be JSON:
account={
name: "test#test.com",
password: "test",
autoconnect: true
}
http://www.json.org/
lists ~100 libraries to read and write JSON for every conceivable platform and language. There are seven libraries alone for C. The nice thing for JSON is interoperability of course and having a data format which is widely accepted (it even has a RFC: rfc4627)
libconfuse has nearly the syntax you require:
/*
* This is a C-style multi-line comment
*/
BackLog = 2147483647
bookmark heimdal {
login = "anonymous"
password = ${ANONPASS:-anonymous#} # environment variable substitution
}

Is it possible to write code to write code?

I've heard that there are some things one cannot do as a computer programmer, but I don't know what they are. One thing that occurred to me recently was: wouldn't it be nice to have a class that could make a copy of the source of the program it runs, modify that program and add a method to the class that it is, and then run the copy of the program and terminate itself. Is it possible for code to write code?
If you want to learn about the limits of computability, read about the halting problem
In computability theory, the halting
problem is a decision problem which
can be stated as follows: given a
description of a program and a finite
input, decide whether the program
finishes running or will run forever,
given that input.
Alan Turing proved in 1936 that a
general algorithm to solve the halting problem for all
possible program-input pairs cannot exist
Start by looking at quines, then at Macro-Assemblers and then lex & yacc, and flex & bison. Then consider self-modifying code.
Here's a quine (formatted, use the output as the new input):
#include<stdio.h>
main()
{
char *a = "main(){char *a = %c%s%c; int b = '%c'; printf(a,b,a,b,b);}";
int b = '"';
printf(a,b,a,b,b);
}
Now if you're just looking for things programmers can't do look for the opposite of np-complete.
Sure it is. That's how a lot of viruses work!
Get your head around this: computability theory.
Yes, that's what most Lisp macros do (for just one example).
Yes it certainly is, though maybe not in the context you are referring to check out this post on t4.
If you look at Functional Programming that has many opportunities to write code that generates further code, the way that a language like Lisp doesn't differentiate between code and data is a significant part of it's power.
Rails generates the various default model and controller classes from the database schema when it's creating a new application. It's quite standard to do this kind of thing with dynamic languages- I have a few bits of PHP around that generate php files, just because it was the simplest solution to the problem I was dealing with at the time.
So it is possible. As for the question you are asking, though- that is perhaps a little vague- what environment and language are you using? What do you expect the code to do and why does it need to be added to? A concrete example may bring more directly relevant responses.
Yes it is possible to create code generators.
Most of the time they take user input and produce valid code. But there are other possibilities.
Self modifying programes are also possible. But they were more common in the dos era.
Of course you can! In fact, if you use a dynamic language, the class can change itself (or another class) while the program is still running. It can even create new classes that didn't exist before. This is called metaprogramming, and it lets your code become very flexible.
You are confusing/conflating two meanings of the word "write". One meaning is the physical writing of bytes to a medium, and the other is designing software. Of course you can have the program do the former, if it was designed to do so.
The only way for a program to do something that the programmer did not explicitly intend it to do, is to behave like a living creature: mutate (incorporate in itself bits of environment), and replicate different mutants at different rates (to avoid complete extinction, if a mutation is terminal).
Sure it is. I wrote an effect for Paint.NET* that gives you an editor and allows you to write a graphical effect "on the fly". When you pause typing it compiles it to a dll, loads it and executes it. Now, in the editor, you only need to write the actual render function, everything else necessary to create a dll is written by the editor and sent to the C# compiler.
You can download it free here: http://www.boltbait.com/pdn/codelab/
In fact, there is even an option to see all the code that was written for you before it is sent to the compiler. The help file (linked above) talks all about it.
The source code is available to download from that page as well.
*Paint.NET is a free image editor that you can download here: http://getpaint.net
In relation to artificial intelligence, take a look at Evolutionary algorithms.
make a copy of the source of the program it runs, modify that program and add a method to the class that it is, and then run the copy of the program and terminate itself
You can also generate code, build it into a library instead of an executable, and then dynamically load the library without even exiting the program that is currently running.
Dynamic languages usually don't work quite as you suggest, in that they don't have a completely separate compilation step. It isn't necessary for a program to modify its own source code, recompile, and start from scratch. Typically the new functionality is compiled and linked in on the fly.
Common Lisp is a very good language to practice this in, but there are others where you can created code and run it then and there. Typically, this will be through a function called "eval" or something similar. Perl has an "eval" function, and it's generally common for scripting languages to have the ability.
There are a lot of programs that write other programs, such as yacc or bison, but they don't have the same dynamic quality you seem to be looking for.
Take a look at Langtom's loop. This is the simplest example of self-reproducing "program".
There is a whole class of such things called "Code Generators". (Although, a compiler also fits the description as you set it). And those describe the two areas of these beasts.
Most code generates, take some form of user input (most take a Database schema) and product source code which is then compiled.
More advanced ones can output executable code. With .NET, there's a whole namespace (System.CodeDom) dedicated to the create of executable code. The these objects, you can take C# (or another language) code, compile it, and link it into your currently running program.
I do this in PHP.
To persist settings for a class, I keep a local variable called $data. $data is just a dictionary/hashtable/assoc-array (depending on where you come from).
When you load the class, it includes a php file which basically defines data. When I save the class, it writes the PHP out for each value of data. It's a slow write process (and there are currently some concurrency issues) but it's faster than light to read. So much faster (and lighter) than using a database.
Something like this wouldn't work for all languages. It works for me in PHP because PHP is very much on-the-fly.
It has always been possible to write code generators. With XML technology, the use of code generators can be an essential tool. Suppose you work for a company that has to deal with XML files from other companies. It is relatively straightforward to write a program that uses the XML parser to parse the new XML file and write another program that has all the callback functions set up to read XML files of that format. You would still have to edit the new program to make it specific to your needs, but the development time when a new XML file (new structure, new names) is cut down a lot by using this type of code generator. In my opinion, this is part of the strength of XML technology.
Lisp lisp lisp lisp :p
Joking, if you want code that generates code to run and you got time to loose learning it and breaking your mind with recursive stuff generating more code, try to learn lisp :)
(eval '(or true false))
wouldn't it be nice to have a class that could make a copy of the source of the program it runs, modify that program and add a method to the class that it is, and then run the copy of the program and terminate itself
There are almost no cases where that would solve a problem that cannot be solved "better" using non-self-modifying code..
That said, there are some very common (useful) cases of code writing other code.. The most obvious being any server-side web-application, which generates HTML/Javascript (well, HTML is markup, but it's identical in theory). Also any script that alters a terminals environment usually outputs a shell script that is eval'd by the parent shell. wxGlade generates code to that creates bare-bone wx-based GUIs.
See our DMS Software Reengineering Toolkit. This is general purpose machinery to read and modify programs, or generate programs by assembling fragments.
This is one of the fundamental questions of Artificial Intelligence. Personally I hope it is not possible - otherwise soon I'll be out of a job!!! :)
It is called meta-programming and is both a nice way of writing useful programs, and an interesting research topic. Jacques Pitrat's Artificial Beings: the conscience of a conscious machine book should interest you a lot. It is mostly related to meta-knowledge based computer programs.
Another related term is multi-staged programming (because there are several stages of programs, each generating the next one).

Resources