difference between Source code parser and Grammar parser - parser-generator

I have been trying to research on source code parser and often i would find people talking about parsing grammar.
so I was wondering what's the difference between a source code parser and grammar parser, are they the same thing?

The phrase "source code parser" by itself is clear enough: this is a mechanism that parses source text, using either a parser-generator engine based off of a formal grammar or some kind of hand-coded (typically recursive descent) parser derived from the grammar informally. It is unclear what the result of a "source code parser" is from just the phrase; it might just be "yes, that's valid syntax", more usually "produces a parse or abstract syntax tree", or it might be (sloppily) "full abstract syntax tree plus symbol table plus control and dataflow analyses".
The phrase "grammar parser" is not one I encounter much (and I work a lot in this field). It is likely something garbled from some other source. In the absence of a widely known definition, one would guess that this means a) a "source code parser" driven by a parser generator engine from a formal grammar, or b) a "source code parser" which parses a grammar (which is a kind of source code, too), as an analog to the phrase "Fortran parser". For the latter I would tend to write "parser for a grammar" to avoid confusion, although "Fortran parser" is pretty clear.
You used a third term, "parsing grammar" which I also don't encounter much. This is likely to mean b) in the above paragraph.
Where did your terms come from?

Bison is a general purpose parser generator that converts a grammar description for an LALR(1) context-free grammar into a C program to parse that grammar.
This kind of talk is not correct. There are 3 mistakes in this. It should read:
Bison is a general purpose parser generator that
reads a BNF grammar, which defines the syntax of a context-free language,
does an LALR(1) analysis and conflict resolution, and
outputs a C program that reads input written in the language whose syntax
is defined in the BNF grammar.
My intent is not to criticize, but to get people to use the correct terminology.
There is already enough misunderstanding in this subject.

Related

Which type of error is generated by this code in C? Syntax or Semantic?

Which error does the following code generates? According to the information which I have read so far, a syntax error is recognized mostly in compile-time, and a semantic error is recognized at run time. So does the following code throws a syntax error?
void f(){
return 0;
}
This error is detected at compile time.
The meaning of "syntax error" generally includes some analysis that goes beyond the formal grammar, especially since the compiler is probably implemented using a recursive descent parser or otherwise has embedded logic that goes beyond what the pure math of a formal parser would entail. C, in particular, requires some feedback between the lexer and the parser and can't be a pure context-free grammar parser. (if you are interested, it's because typedef names must be understood as types when following the grammar)
If you're a developer, you'll call this a "syntax error" because it's easily found by the compiler, naturally as part of what it needs to understand in order to generate code. It doesn't require a deeper static analysis than what it has to do anyway.
If you're a CS student studying parsers and grammar, you'll notice that it's grammatically correct, but the error is in what is being stated. That means it is not a syntax error but an error of semantics.
That formal distinction is not very useful in real life, as real languages require some semantic knowledge in order to parse. So the parser can indeed be the agent issuing the error, since it's been set up to want a bare keyword with no argument, earlier in the process. So it is a matter of syntax to such an implementation, but the semantic analysis was applied to modify the parser, earlier.
In short, it depends on your definitions. If the definition chosen is something about the implementation of the tool, then you can't really say in the abstract.

How do I lex unicode characters in C?

I've written a Lexer in C, it currently lexes files in ASCII successfully, however I'm confused as to how I would lex unicode. What unicode would I need to lex, for instance should I support utf-8, utf-16, etc. What do languages like Rust or Go support?
If so are there any libraries that can help me out, although I would prefer to try and do it myself so I can learn. Even then, a small library that I could read to learn from would be great.
There are already version of lex (and other lexer tools that support UniCode) and they are tabulated on the WikiPedia Page: List of Lexer Generators. There is also a list of lexer tools on the Wikipedia Parser Page. In summary, the following tools handle UniCode:
JavaCC - JavaCC generates lexical analyzers written in Java.
JFLex - A lexical analyzer generator for Java.
Quex - A fast universal lexical analyzer generator for C and C++.
FsLex - A lexer generator for byte and Unicode character input for F#
And, of course, there are the techniques used by W3.org and cited by #jim mcnamara at http://www.w3.org/2005/03/23-lex-U.
You say you have written your own lexer in C, but you have used the tag lex for the tool called lex; perhaps that was an oversight?
In the comments you say you have not used regular expressions, but also want to learn. Learning something about the theory of language recognition is key to writing an efficient and working lexer. The symbols being recognised are classified as a Chomsky Type 3 Language, or a Regular Language, which can be described by Regular Expressions. Regular Expressions can be implemented by coding that implements a Finite State Automata (or Finite State Machine). The standard implementation for a finite state machine is coded by a loop containing a switch. Most experienced coders should know, and be able to recognise and exploit this form:
while ( not <<EOF>> ) {
switch ( input_symbol ) {
case ( state_symbol[0] ) :
...
case ( state_symbol[1] ) :
...
default:
....
}
}
If you had coded in this style, the same coding could simply work whether the symbols being handled were 8 bit or 16 bit, as the algorithmic coding pattern remains the same.
Ad-Hoc coding of a lexical analyser without an understanding of the underlying theory and practice will eventually have its limits. I think you will find it beneficial to read a little more into this area.

Compiler Construction for Formal Requirement Specification Language using JFlex and CUP

I am planning to build compiler for Requirement Specification Language. I have come up with idea using JFlex as lexical analyzer and CUP as parser.
Can any one let me know it is possible to use JFlex and CUP for formal specification language? All the documentation and tutorials are related to programming language only.
Any tutorial available for building formal language compiler.
Lexer and parser generators do not care if your langauge is "conventional computer langauge", only that your langauge has a grammar specification they can handle.
Ofteh the way you get a such a grammar specification, is to take a specification for your formal system as given, and bend it according to the constraints of your chosen parser generator. This bending process is at best inconvenient, at worst really hard, depending of the gap between the parser generator's capabilitie and what your formal langauge specificaiton says.
I suggest you inspect your "Requirement Specification Language" formal grammar, and decide which parser generator you want to use based on that, to minimize the amount of bending you have to do.

antlr generate ast for c and parse the ast

I am doing static analyze on c program.And I search the antlr website ,there seems to be no appropriate grammar file that produce ast for c program.Does it mean I have to do it myself from the very start.Or is there a quicker method.I also need a tree parser that can traverse the ast created by the parser.
You indicated you want to do static analysis to detect buffer overflow.
First, writing a grammar for C is harder than it looks. There's all that stuff in the standard, and then there's what the real compilers actually accept. And you have to decide what to do about the preprocessor (and it varies from compiler to compiler!). If you don't get the grammar and preprocessing exactly right, you won't be able to parse real programs. (If you want to do toy languages, that's fine, but then you don't need a C grammar).
To do the analysis, you'll need far more machinery than an AST. You'll need symbol tables, control and data flow analysis, likely local and global points-to analysis, call graph extraction, and some type of range analysis.
People just don't seem to understand this.
** GETTING A PARSER IS A LONG WAY FROM DOING ANYTHING USEFUL WITH REAL LANGUAGES **
I'm shouting because I see this over, and over, and over.
If you want to get on with a specific program analysis or transformation task, unless you want to die of old age before you start your task, you better find a foundation that has most of what you need already. A foundation on a parser generator with a creaky grammar is not a foundation. (Don't get me wrong: ANTLR, YACC, JavaCC are all fine parser generators, and they're great for building a parser for a new language. They're great for implementing production parsers for real langauges when the investment gets made. But they produce parsers, and mostly people don't do the production part. And they don't provide the additional machinery by a long shot.)
Our DMS Software Reengineering Toolkit contains all the above machinery because it is almost always needed, and it is a royal headache to implement. (My team has 15 years invested so far.)
We've also instantiated that machinery is forms specifically useful for COBOL and Java, C, C++ (to somewhat lesser extent, the language is really hard), in a variety of dialects, so that others don't have to repeat this long process.
GCC and Clang are pretty mature for C and C++ as alternatives.
The hardest part is writing the grammar. Mixing in rewrite rules to create an AST isn't that hard, and creating a tree grammar from a parser grammar that emits an AST isn't that hard too (compared to writing the parser grammar, that is).
Here's a previous Q&A that shows how to create a proper AST: How to output the AST built using ANTLR?
And I couldn't find a decent SO-Q&A that explains how to go about creating a tree grammar, so here's a link to my personal blog that explains this: http://bkiers.blogspot.com/2011/03/6-creating-tree-grammar.html
Good luck.

Need a way to parse algebraic expressions in C

I need to parse algebraic expressions for an application I'm working on and am hoping to garnish a bit of collective wisdom before taking a crack at it and, possibly, heading down the wrong road.
What I need to do is pretty straight forward: given a textual algebraic expression (3*x - 4(y - sin(pi))) create a object representation of the equation. The custom objects already exist, so I need a parser that creates a tree I can walk to instantiate the objects I need.
The basic requirements would be:
Ability to express the algebra as a grammar so I have control and can customize/extend it as necessary.
The initial syntax will include integers, real numbers, constants, variables, arithmetic operators (+, - , *, /), powers (^), equations (=), parenthesis, precedence, and simple functions (sin(pi)). I'm hoping to extend my app fairly quickly to support functions proper (f(x) = 3x +2).
Must compile in C as it needs to be integrated into my code.
I DON'T need to evaluate the expression mathematically, so software that solves for a variable or performs the arithmetic is noise.
I've done my Google homework and it looks like the best approach is to use a BNF grammar and software to generate a compiler in C. So my questions:
Does a BNF grammar with corresponding parser generator for algebraic expressions (or better yet, LaTex) already exist? Someone has to have done this already. I REALLY want to avoid rolling my own, mainly because I don't want to test it. I'd be willing to pay a reasonable amount for a library (under $50)
If not, which parser generator for C do you think is the easiest to learn/use here? Lex? YACC? Flex, Bison, Python/SymPy, Others? I'm not familiar with any of these.
The standard Linux tools flex and bison would probably be most appropriate here. IIRC the sample parsers and lexers used in these tools do something close to what you want, so you might be able to just modify that code to get what you need.
These tools seem like they meet your specifications. You can customize the grammars, compile down to C, and use any operator you want.
I've had very good luck with ANTLR. It has runtimes for many different languages, including C, and has a very nice syntax for specifying grammars and building trees. I recently wrote a similar grammar (algebraic expressions) in 131 lines, which is definitely manageable.
I used the code (found on the net) from the following:
Program Translation Fundamentals" by Peter Calingaert
I enhanced it to handle functions, which lets you implement things like "if(a, b, c)" (kind of like how Excel does things).
you can build simple parser yourself or use any of popular "compiler-compiler" (some of them were listed by other posts). just decide if your parser will be complicated enough to use (and learn) an external tool. in any case you'll need to define the grammar, usually it's the most brain intensive task if you don't have prior experience. the formal way to define syntactic grammars is BNF or EBNF

Resources