Which type of error is generated by this code in C? Syntax or Semantic? - c

Which error does the following code generates? According to the information which I have read so far, a syntax error is recognized mostly in compile-time, and a semantic error is recognized at run time. So does the following code throws a syntax error?
void f(){
return 0;
}

This error is detected at compile time.
The meaning of "syntax error" generally includes some analysis that goes beyond the formal grammar, especially since the compiler is probably implemented using a recursive descent parser or otherwise has embedded logic that goes beyond what the pure math of a formal parser would entail. C, in particular, requires some feedback between the lexer and the parser and can't be a pure context-free grammar parser. (if you are interested, it's because typedef names must be understood as types when following the grammar)
If you're a developer, you'll call this a "syntax error" because it's easily found by the compiler, naturally as part of what it needs to understand in order to generate code. It doesn't require a deeper static analysis than what it has to do anyway.
If you're a CS student studying parsers and grammar, you'll notice that it's grammatically correct, but the error is in what is being stated. That means it is not a syntax error but an error of semantics.
That formal distinction is not very useful in real life, as real languages require some semantic knowledge in order to parse. So the parser can indeed be the agent issuing the error, since it's been set up to want a bare keyword with no argument, earlier in the process. So it is a matter of syntax to such an implementation, but the semantic analysis was applied to modify the parser, earlier.
In short, it depends on your definitions. If the definition chosen is something about the implementation of the tool, then you can't really say in the abstract.

Related

C Compiler syntax error detection

Coming from a beginner level programmer - does a C compiler build a concrete syntax tree for detecting errors like missing semicolons?
Or more generally, how does a C compiler detect syntax errors?
does a C compiler build a concrete syntax tree
Yes, or rather an 'abstract syntax tree', at least conceptually.
for detecting errors like missing semicolons?
Not for detecting syntax errors; after detecting and removing syntax errors.
Syntax errors are detected during parsing, on encountering a token that isn't a valid continuation of the current state. It's a large subject. Don Knuth is writing a monster tome on it, and has been for 20 years ;-), but there are plenty already in existence.
The short answer is yes, every compiler tries to build a parsing tree from the input files and generates a syntax error when it fails.
Anyway, in order for the compiler to figure out what's wrong exactly, a little bit more intelligence is required. For example, the compiler may attemptively insert a semicolon where parsing breaks and see if that would fix the syntax error. If so, it may suggest that a semicolon is missing in the error message.
As a note, C syntax is well defined by the standard, while error messages like "missing semicolon" are a friendly addition of the compiler.

difference between Source code parser and Grammar parser

I have been trying to research on source code parser and often i would find people talking about parsing grammar.
so I was wondering what's the difference between a source code parser and grammar parser, are they the same thing?
The phrase "source code parser" by itself is clear enough: this is a mechanism that parses source text, using either a parser-generator engine based off of a formal grammar or some kind of hand-coded (typically recursive descent) parser derived from the grammar informally. It is unclear what the result of a "source code parser" is from just the phrase; it might just be "yes, that's valid syntax", more usually "produces a parse or abstract syntax tree", or it might be (sloppily) "full abstract syntax tree plus symbol table plus control and dataflow analyses".
The phrase "grammar parser" is not one I encounter much (and I work a lot in this field). It is likely something garbled from some other source. In the absence of a widely known definition, one would guess that this means a) a "source code parser" driven by a parser generator engine from a formal grammar, or b) a "source code parser" which parses a grammar (which is a kind of source code, too), as an analog to the phrase "Fortran parser". For the latter I would tend to write "parser for a grammar" to avoid confusion, although "Fortran parser" is pretty clear.
You used a third term, "parsing grammar" which I also don't encounter much. This is likely to mean b) in the above paragraph.
Where did your terms come from?
Bison is a general purpose parser generator that converts a grammar description for an LALR(1) context-free grammar into a C program to parse that grammar.
This kind of talk is not correct. There are 3 mistakes in this. It should read:
Bison is a general purpose parser generator that
reads a BNF grammar, which defines the syntax of a context-free language,
does an LALR(1) analysis and conflict resolution, and
outputs a C program that reads input written in the language whose syntax
is defined in the BNF grammar.
My intent is not to criticize, but to get people to use the correct terminology.
There is already enough misunderstanding in this subject.

Lexical and Semantic Errors in C

Recently I had to give examples for Lexical and Semantic Errors in C. I have provided the following examples.
I thought the following was lexical error,
int a#;
And for semantic error I gave the following example,
int a[10];
a=100;
But now I am a bit confused whether both are in fact syntax errors. Please provide me some idea about these errors?
First the classification of errors (as lexical, syntactic, semantic, pragmatic) is somehow arbitrary in the detail.
If you define a lexical error as an error detected by the lexer, then a malformed number could be one, eg 12q4z5. Or a name with a prohibited character like $
You could define a syntactic error as one detected at parsing time. However C is not stricto sensu a context-free language, and its parsers keep contextual information (e.g. in symbol tables).
Since all of a # and ; are valid lexemes your a#; is not a lexical error.
Actually # is mostly useful at preprocessing time, and what is parsed is actually the preprocessed form, not the user-given source code!
Many semantic errors are related to the notion of undefined behavior, like printf("%d") which lacks an integer argument. Or out of bound access, in your case printf("%d\n", a[1234]);
Some (but not all) semantic or pragmatic errors can be find thru static analysis tools. You can even say that modern compilers (when all warnings are enabled) do some.
Your example of a=100; is a typing error (which could be arbitrarily called as syntactic, since the C compiler finds it at parsing time, and as semantic, since related to types which are not a context free property). And you have more specialized static analysis tools like Frama-C and you could extend or customize the GCC compiler (e.g. with MELT, a domain specific language) to add yours (like e.g. in TALPO).
A pragmatic error could be to declare a huge local variable, like here. Probably, when 1Tbyte RAM memory will be common, stacks would have many gigabytes, so that example would run, but not today.

Are GCC and Clang parsers really handwritten?

It seems that GCC and LLVM-Clang are using handwritten recursive descent parsers, and not machine generated, Bison-Flex based, bottom up parsing.
Could someone here please confirm that this is the case?
And if so, why do mainstream compiler frameworks use handwritten parsers?
Update : interesting blog on this topic here
There's a folk-theorem that says C is hard to parse, and C++ essentially impossible.
It isn't true.
What is true is that C and C++ are pretty hard to parse using LALR(1) parsers without hacking the parsing machinery and tangling in symbol table data. GCC in fact used to parse them, using YACC and additional hackery like this, and yes it was ugly. Now GCC uses handwritten parsers, but still with the symbol table hackery. The Clang folks never tried to use automated parser generators; AFAIK the Clang parser has always been hand-coded recursive descent.
What is true, is that C and C++ are relatively easy to parse with stronger automatically generated parsers, e.g., GLR parsers, and you don't need any hacks. The Elsa C++ parser is one example of this. Our C++ Front End is another (as are all our "compiler" front ends, GLR is pretty wonderful parsing technology).
Our C++ front end isn't as fast as GCC's, and certainly slower than Elsa; we've put little energy into tuning it carefully because we have other more pressing issues (nontheless it has been used on millions of lines of C++ code). Elsa is likely slower than GCC simply because it is more general. Given processor speeds these days, these differences might not matter a lot in practice.
But the "real compilers" that are widely distributed today have their roots in compilers of 10 or 20 years ago or more. Inefficiencies then mattered much more, and nobody had heard of GLR parsers, so people did what they knew how to do. Clang is certainly more recent, but then folk theorems retain their "persuasiveness" for a long time.
You don't have to do it that way anymore. You can very reasonably use GLR and other such parsers as front ends, with an improvement in compiler maintainability.
What is true, is that getting a grammar that matches your friendly neighborhood compiler's behavior is hard. While virtually all C++ compilers implement (most) of the original standard, they also tend have lots of dark corner extensions, e.g., DLL specifications in MS compilers, etc. If you have a strong parsing engine, you can
spend your time trying to get the final grammar to match reality, rather than trying to bend your grammar to match the limitations of your parser generator.
EDIT November 2012: Since writing this answer, we've improved our C++ front end to handle full C++11, including ANSI, GNU, and MS variant dialects. While there was lots of extra stuff, we don't have to change our parsing engine; we just revised the grammar rules. We did have to change the semantic analysis; C++11 is semantically very complicated, and this work swamps the effort to get the parser to run.
EDIT February 2015: ... now handles full C++14. (See get human readable AST from c++ code for GLR parses of a simple bit of code, and C++'s infamous "most vexing parse").
EDIT April 2017: Now handles (draft) C++17.
Yes:
GCC used a yacc (bison) parser once upon a time, but it was replaced with a hand-written recursive descent parser at some point in the 3.x series: see http://gcc.gnu.org/wiki/New_C_Parser for links to relevant patch submissions.
Clang also uses a hand-written recursive descent parser: see the section "A single unified parser for C, Objective C, C++ and Objective C++" near the end of http://clang.llvm.org/features.html .
Clang's parser is a hand-written recursive-descent parser, as are several other open-source and commercial C and C++ front ends.
Clang uses a recursive-descent parser for several reasons:
Performance: a hand-written parser allows us to write a fast parser, optimizing the hot paths as needed, and we're always in control of that performance. Having a fast parser has allowed Clang to be used in other development tools where "real" parsers are typically not used, e.g., syntax highlighting and code completion in an IDE.
Diagnostics and error recovery: because you're in full control with a hand-written recursive-descent parser, it's easy to add special cases that detect common problems and provide great diagnostics and error recovery (e.g., see http://clang.llvm.org/features.html#expressivediags) With automatically generated parsers, you're limited to the capabilities of the generator.
Simplicity: recursive-descent parsers are easy to write, understand, and debug. You don't need to be a parsing expert or learn a new tool to extend/improve the parser (which is especially important for an open-source project), yet you can still get great results.
Overall, for a C++ compiler, it just doesn't matter much: the parsing part of C++ is non-trivial, but it's still one of the easier parts, so it pays to keep it simple. Semantic analysis---particularly name lookup, initialization, overload resolution, and template instantiation---is orders of magnitude more complicated than parsing. If you want proof, go check out the distribution of code and commits in Clang's "Sema" component (for semantic analysis) vs. its "Parse" component (for parsing).
Weird answers there!
C/C++ grammars aren't context free. They are context sensitive because of the Foo * bar; ambiguity. We have to build a list of typedefs to know if Foo is a type or not.
Ira Baxter: I don't see the point with your GLR thing. Why build a parse tree which comprises ambiguities. Parsing means solving ambiguities, building the syntax tree. You resolve these ambiguities in a second pass, so this isn't less ugly. For me it is far more ugly ...
Yacc is a LR(1) parser generator (or LALR(1)), but it can be easily modified to be context sensitive. And there is nothing ugly in it. Yacc/Bison has been created to help in parsing C language, so probably it isn't the ugliest tool to generate a C parser ...
Until GCC 3.x the C parser is generated by yacc/bison, with typedefs table built during parsing. With "in parse" typedefs table building, C grammar becomes locally context free and furthermore "locally LR(1)".
Now, in Gcc 4.x, it is a recursive descent parser. It is exactly the same parser as in Gcc 3.x, it is still LR(1), and has the same grammar rules. The difference is that the yacc parser has been hand rewritten, the shift/reduce are now hidden in the call stack, and there is no "state454 : if (nextsym == '(') goto state398" as in gcc 3.x yacc's parser, so it is easier to patch, handle errors and print nicer messages, and to perform some of the next compiling steps during parsing. At the price of much less "easy to read" code for a gcc noob.
Why did they switched from yacc to recursive descent? Because it is quite necessary to avoid yacc to parse C++, and because GCC dreams to be multi language compiler, i.e. sharing maximum of code between the different languages it can compile. This is why the C++ and the C parser are written in the same way.
C++ is harder to parse than C because it isn't "locally" LR(1) as C, it is not even LR(k).
Look at func<4 > 2> which is a template function instantiated with 4 > 2, i.e. func<4 > 2>
has to be read as func<1>. This is definitely not LR(1). Now consider, func<4 > 2 > 1 > 3 > 3 > 8 > 9 > 8 > 7 > 8>. This is where a recursive descent can easily solve ambiguity, at the price of a few more function calls (parse_template_parameter is the ambiguous parser function. If parse_template_parameter(17tokens) failed, try again parse_template_parameter(15tokens), parse_template_parameter(13tokens)
... until it works).
I don't know why it wouldn't be possible to add into yacc/bison recursive sub grammars, maybe this will be the next step in gcc/GNU parser development?
gcc's parser is handwritten.. I suspect the same for clang. This is probably for a few reasons:
Performance: something that you've hand-optimized for your particular task will almost always perform better than a general solution. Abstraction usually has a performance hit
Timing: at least in the case of GCC, GCC predates a lot of free developer tools (came out in 1987). There was no free version of yacc, etc. at the time, which I'd imagine would've been a priority to the people at the FSF.
This is probably not a case of "not invented here" syndrome, but more along the lines of "there was nothing optimized specifically for what we needed, so we wrote our own".
It seems that GCC and LLVM-Clang are using handwritten recursive descent parsers, and not machine generated, Bison-Flex based, bottom up parsing.
Bison in particular I don't think can handle the grammar without parsing some things ambiguously and doing a second pass later.
I know Haskell's Happy allows for monadic (i.e. state-dependent) parsers that can resolve the particular issue with C syntax, but I know of no C parser generators that allow a user-supplied state monad.
In theory, error recovery would be a point in favor of a handwritten parser, but my experience with GCC/Clang has been that the error messages are not particularly good.
As for performance - some of the claims seem unsubstantiated. Generating a big state machine using a parser generator should result in something that's O(n) and I doubt parsing is the bottleneck in much tooling.

How to define grammar which excludes a certain set of words?

I have built a small code for static analysis of C code. The purpose of building it is to warn users about the use of methods such as strcpy() which could essentially cause buffer overflows.
Now, to formalise the same, I need to write a formal Grammar which shows the excluded libraries as NOT a part of the allowed set of accepted library methods used.
For example,
AllowedSentence->ANSI C Permitted Code, NOT UnSafeLibraryMethods
UnSafeLibraryMethods->strcpy|other potentially unsafe methods
Any ideas on how this grammar can be formalised?
I think, this should not be done at the grammar level. It should be a rule that is applied to the parse tree after parsing is done.
You hardly need a parser for the way you have posed the problem. If your only goal is to object to the presence of certain identifiers ("strcpy"), you can simply build a lexer that processes C and picks identifiers. Special lexemes can recognize your list of "you shouldn't use this". This way you use positive recognition instead of negative recognition to pick out the identifiers that you belive to be trouble.
If you want a more sophisticated analaysis tool, you'll likely want to parse C, an name-resolve the identifers to their actual definitisn, then the scan the tree looking for identifiers that are objectionable. This will at least let you decide if the identifier is actually defined by the user, or comes from some known library; surely, if my code defines strcpy, you shouldn't complain unless you know my strcpy is defective somehow.

Resources