What is the simplest parsing algorithm that can parse C code? - c

Does anyone know what the weakest family of widely-used parsing algorithms is that can parse C code? That is, is the C grammar LL(1), LR(0), LALR(1), etc.? I'm curious because as a side project I'm interested in writing a parser generator for one of these families and would like to ultimately be able to parse C code for another side project.

It seems that Bison uses an LALR(1) parser. LALR parsers are more robust than LL parsers, but are also more complex. From this I suspect that LALR(1) is probably the weakest parsing algorithm which can parse C code.
Unless you're really set on rolling your own recognizer. ANTLR would probably be your best bet to do this. ANTLR uses an LL* algorithm (which is, effectively, LALR).

Related

A more complete recursive descent c interpreter

I've seen several implementations of recursive descent c interpreters which all seem
to do a pretty good job - yet they all only implement a small portion of the C language -
for example they don't support structs or typedefs etc -
Does anyone know of any code that supports a large portion of the C language.
I know adding more functionality would be pretty trivial - but I'm a bit strapped
for time.
Picoc supports more that most of the Tiny/Small C interpreters. You might give it a look. And it does support structures.
If you just want to use it, this one looks awfully good for the job. There was a Dr. Dobb's article on it a while back ... there it is

antlr generate ast for c and parse the ast

I am doing static analyze on c program.And I search the antlr website ,there seems to be no appropriate grammar file that produce ast for c program.Does it mean I have to do it myself from the very start.Or is there a quicker method.I also need a tree parser that can traverse the ast created by the parser.
You indicated you want to do static analysis to detect buffer overflow.
First, writing a grammar for C is harder than it looks. There's all that stuff in the standard, and then there's what the real compilers actually accept. And you have to decide what to do about the preprocessor (and it varies from compiler to compiler!). If you don't get the grammar and preprocessing exactly right, you won't be able to parse real programs. (If you want to do toy languages, that's fine, but then you don't need a C grammar).
To do the analysis, you'll need far more machinery than an AST. You'll need symbol tables, control and data flow analysis, likely local and global points-to analysis, call graph extraction, and some type of range analysis.
People just don't seem to understand this.
** GETTING A PARSER IS A LONG WAY FROM DOING ANYTHING USEFUL WITH REAL LANGUAGES **
I'm shouting because I see this over, and over, and over.
If you want to get on with a specific program analysis or transformation task, unless you want to die of old age before you start your task, you better find a foundation that has most of what you need already. A foundation on a parser generator with a creaky grammar is not a foundation. (Don't get me wrong: ANTLR, YACC, JavaCC are all fine parser generators, and they're great for building a parser for a new language. They're great for implementing production parsers for real langauges when the investment gets made. But they produce parsers, and mostly people don't do the production part. And they don't provide the additional machinery by a long shot.)
Our DMS Software Reengineering Toolkit contains all the above machinery because it is almost always needed, and it is a royal headache to implement. (My team has 15 years invested so far.)
We've also instantiated that machinery is forms specifically useful for COBOL and Java, C, C++ (to somewhat lesser extent, the language is really hard), in a variety of dialects, so that others don't have to repeat this long process.
GCC and Clang are pretty mature for C and C++ as alternatives.
The hardest part is writing the grammar. Mixing in rewrite rules to create an AST isn't that hard, and creating a tree grammar from a parser grammar that emits an AST isn't that hard too (compared to writing the parser grammar, that is).
Here's a previous Q&A that shows how to create a proper AST: How to output the AST built using ANTLR?
And I couldn't find a decent SO-Q&A that explains how to go about creating a tree grammar, so here's a link to my personal blog that explains this: http://bkiers.blogspot.com/2011/03/6-creating-tree-grammar.html
Good luck.

What libraries would be useful for implementing a small language interpreter in C?

For my own learning experience, I want to try writing an interpreter for a simple programming language in C – the main thing I think I need is a hash table library, but a general purpose collection of data structures and helper functions would be pretty helpful. What would you guys recommend?
libbasekit - by the author of Io. You can also use libcoroutine.
One library I recommend looking into is libgc, a garbage collector for C.
You use it by replacing calls to malloc, realloc, strdup, etc. with their libgc counterparts (e.g. GC_MALLOC). It works by scanning the stack, global variables, and GC-allocated blocks, looking for numbers that might be pointers. Believe it or not, it actually performs quite well (almost on par with the very good ptmalloc, which is the default (non-garbage collected) malloc implementation in GNU/Linux), and a lot of programs use it (including Mono and GCJ). A disadvantage, though, is it might not play well with other libraries you may want to use, and you may even have to recompile some of them by hand to replace calls to malloc with GC_MALLOC.
Honestly - and I know some people will hate me for it - but I recommend you use C++. You don't have to bust a gut to learn it just to be able to start your project. Just use it like C, but in an hour you can learn how to use std::map<> (an associative container), std::string for easy textual data handling, and std::vector<> for a resizable heap-allocated array. If you want to spend an extra hour or two, learn to put member functions in classes (don't worry about polymorphism, virtual functions etc. to begin with), and you'll get a more organised program.
You need no more than the standard library for a suitably small language with simple constructs. The most complex part of an interpreted language is probably expression evaluation. For both that, procedure-calling, and construct-nesting you will need to understand and implement stack data structures.
The code at the link above is C++, but the algorithm is described clearly and you could re-implement it easily in C. There again there are few valid arguments for not using C++ IMO.
Before diving into what libraries to use I suggest you learn about grammars and compiler design. Especially input parsing is for compilers and interpreters similar, that is tokenizing and parsing. The process of tokenizing converts a stream characters (your input) into a stream of tokens. A parser takes this stream of tokens and matches it with your grammar.
You don't mention what language you're writing an interpreter for. But very likely that language contains recursion. In that case you need to use a so-called bottom-up parser which you cannot write by hand but needs to be generated. If you try write such a parser by hand you will end up with a error-prone mess.
If you're developing for a posix platform then you can use lex and yacc. These tools are a bit old but very powerful for building parsers. Lex can generate code that implements the tokenizing process and yacc can generate a bottom-up parser.
My answer probably raises more questions than it answers. That's because the field of compilers/interpreters is quite complex and cannot simply be explained in a short answer. Just get a good book on compiler design.

Need a way to parse algebraic expressions in C

I need to parse algebraic expressions for an application I'm working on and am hoping to garnish a bit of collective wisdom before taking a crack at it and, possibly, heading down the wrong road.
What I need to do is pretty straight forward: given a textual algebraic expression (3*x - 4(y - sin(pi))) create a object representation of the equation. The custom objects already exist, so I need a parser that creates a tree I can walk to instantiate the objects I need.
The basic requirements would be:
Ability to express the algebra as a grammar so I have control and can customize/extend it as necessary.
The initial syntax will include integers, real numbers, constants, variables, arithmetic operators (+, - , *, /), powers (^), equations (=), parenthesis, precedence, and simple functions (sin(pi)). I'm hoping to extend my app fairly quickly to support functions proper (f(x) = 3x +2).
Must compile in C as it needs to be integrated into my code.
I DON'T need to evaluate the expression mathematically, so software that solves for a variable or performs the arithmetic is noise.
I've done my Google homework and it looks like the best approach is to use a BNF grammar and software to generate a compiler in C. So my questions:
Does a BNF grammar with corresponding parser generator for algebraic expressions (or better yet, LaTex) already exist? Someone has to have done this already. I REALLY want to avoid rolling my own, mainly because I don't want to test it. I'd be willing to pay a reasonable amount for a library (under $50)
If not, which parser generator for C do you think is the easiest to learn/use here? Lex? YACC? Flex, Bison, Python/SymPy, Others? I'm not familiar with any of these.
The standard Linux tools flex and bison would probably be most appropriate here. IIRC the sample parsers and lexers used in these tools do something close to what you want, so you might be able to just modify that code to get what you need.
These tools seem like they meet your specifications. You can customize the grammars, compile down to C, and use any operator you want.
I've had very good luck with ANTLR. It has runtimes for many different languages, including C, and has a very nice syntax for specifying grammars and building trees. I recently wrote a similar grammar (algebraic expressions) in 131 lines, which is definitely manageable.
I used the code (found on the net) from the following:
Program Translation Fundamentals" by Peter Calingaert
I enhanced it to handle functions, which lets you implement things like "if(a, b, c)" (kind of like how Excel does things).
you can build simple parser yourself or use any of popular "compiler-compiler" (some of them were listed by other posts). just decide if your parser will be complicated enough to use (and learn) an external tool. in any case you'll need to define the grammar, usually it's the most brain intensive task if you don't have prior experience. the formal way to define syntactic grammars is BNF or EBNF

How do I implement parsing?

I am designing a compiler in C. I want to know which technique I should use, top-down or bottom up? I have only implemented operator precedence using bottom up. I have applied the following
rules:
E:=E+E
E:=E-E
E:=E/E
E:=E*E
E:=E^E
I want to know that am I going the right away?
If I want to include if-else, loops, arrays, functions, do I need to implement parsing?
If yes, how do I implement it? Any one can
I have only implemented token collection and operator precedence. What is the next steps?
Lex & Yacc is your answer. Or Flex and Bison which are branched version of original tools.
They are free, they are the real standard for writing lexers and parsers in C and used all around everywhere.
In addition O'Reilly has released a small pearl of 300 pages: Flex & Bison. I bought it and it really explains you how to write a good parser for a programming language and handle all the subtle things (error recovery, conflicts, scopes and so on). It will answer also your questions about how you are parsing expressions: your approach is right with a top-down parser but you'll discover that is not enough to handle operator precedences.
Of course, for hobby, you could write your own lexer and parser but it would be just an academic effort that is nice to understand how FSM and parser work but with no so much fun :)
If you are, instead, interested in programming language design or complex implementations I suggest this book: Programming Language Pragmatics that is not so famous because of the Dragon Book but it really explains why and how various characteristics can and should be implemented in a compiler. The Dragon Book is a bible too, and it will cover at a real low level how to write a parser.. but it would be sort of boring, I warn you..
The best way to implement a good parser in C is using flex & yacc
Your question is quite vague and hard to answer without a more specific, detailed question. The "Dragon book" is an excellent reference though for someone seeking to implement a compiler from scratch, or as others have pointed out Lex and Yacc.
If you intend to implement the parser by hand, you will want to do a recursive descent parser. The code directly reflects the grammar, so it's fairly easy to figure out and understand. It places some restrictions on your grammar (you can't have any left-recursive nonterminals), but you can work around those problems.
However, it depends on the complexity of the grammar; hand-hacking a parser for anything much more complicated than basic arithmetic expressions gets very tedious very quickly. If you're trying to implement anything that looks like a real programming language, use a parser generator like yacc or bison.

Resources