C Compiler syntax error detection

C Compiler syntax error detection - c

Coming from a beginner level programmer - does a C compiler build a concrete syntax tree for detecting errors like missing semicolons?
Or more generally, how does a C compiler detect syntax errors?

does a C compiler build a concrete syntax tree
Yes, or rather an 'abstract syntax tree', at least conceptually.
for detecting errors like missing semicolons?
Not for detecting syntax errors; after detecting and removing syntax errors.
Syntax errors are detected during parsing, on encountering a token that isn't a valid continuation of the current state. It's a large subject. Don Knuth is writing a monster tome on it, and has been for 20 years ;-), but there are plenty already in existence.

The short answer is yes, every compiler tries to build a parsing tree from the input files and generates a syntax error when it fails.
Anyway, in order for the compiler to figure out what's wrong exactly, a little bit more intelligence is required. For example, the compiler may attemptively insert a semicolon where parsing breaks and see if that would fix the syntax error. If so, it may suggest that a semicolon is missing in the error message.
As a note, C syntax is well defined by the standard, while error messages like "missing semicolon" are a friendly addition of the compiler.

Related

Which type of error is generated by this code in C? Syntax or Semantic?

Which error does the following code generates? According to the information which I have read so far, a syntax error is recognized mostly in compile-time, and a semantic error is recognized at run time. So does the following code throws a syntax error?
void f(){
return 0;
}

This error is detected at compile time.
The meaning of "syntax error" generally includes some analysis that goes beyond the formal grammar, especially since the compiler is probably implemented using a recursive descent parser or otherwise has embedded logic that goes beyond what the pure math of a formal parser would entail. C, in particular, requires some feedback between the lexer and the parser and can't be a pure context-free grammar parser. (if you are interested, it's because typedef names must be understood as types when following the grammar)
If you're a developer, you'll call this a "syntax error" because it's easily found by the compiler, naturally as part of what it needs to understand in order to generate code. It doesn't require a deeper static analysis than what it has to do anyway.
If you're a CS student studying parsers and grammar, you'll notice that it's grammatically correct, but the error is in what is being stated. That means it is not a syntax error but an error of semantics.
That formal distinction is not very useful in real life, as real languages require some semantic knowledge in order to parse. So the parser can indeed be the agent issuing the error, since it's been set up to want a bare keyword with no argument, earlier in the process. So it is a matter of syntax to such an implementation, but the semantic analysis was applied to modify the parser, earlier.
In short, it depends on your definitions. If the definition chosen is something about the implementation of the tool, then you can't really say in the abstract.

Why missing terminating " character is a warning? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
When, for example, we compile the following code:
printf("hello);
we get a warning then an error about the missing " character. In my opinion, warnings inform us about a code that can be compiled but whose behaviour may be probably different from what the developper expects. Therefore my comprehension missed up two things:
Is there a complete code that can be compiled without errors while containing such a portion of code.
If such a code does not exist, why this missing character situation does not give us only an error (not a warning+error).
EDIT (I am doing my best to cope with off-topic votes recommendations):
1. Desired behavior : only one error diagnostic message, there is no need for a warning for the same thing.
Other related issues that do not let me accept the first answer:
2.1 Does printf_s() have the same issue? I tried to enable -c11 option with no success.
2.2 The historical reason to emit the warning does not seem to me to be plausible since why this double message was not used too in similar cases (old accepted constructions being forbidden in new c versions).

In my opinion, warnings inform us about a code that can be compiled
but whose behaviour may be probably different from what the developper
expects. Therefore my comprehension missed up two things:
Your opinion is irrelevant here, and neither the C standard nor the C++ standard distinguish different categories of diagnostic messages. That many compilers in fact do distinguish is an historically-based convention, albeit a widely observed one. What ultimately matters is what your compiler means by such a distinction (if indeed it makes one). On the other hand, and fortunately for you, GCC does adopt a convention similar to what you describe, as documented in its manual:
Errors report problems that make it impossible to compile your program. [...].
Warnings report other unusual conditions in your code that may indicate a problem, although compilation can (and does) proceed. [...]
(GCC 7.2 manual, section 13.9; the same or similar text appears also in earlier versions of the manual, back to at least v.4.)
Note well that the documentation frames the meaning of a warning slightly differently than you do: a GCC warning signals that compilation can proceed, but there is no assurance that it can complete successfully. If indeed it ultimately cannot then I would expect GCC, pursuant to its documentation, to also issue an error diagnostic. That is exactly what I observe with this test program, whether compiling as C or as C++:
#include <stdio.h>
int main(void) {
printf("hello);
}
I really think you are making far too much of the fact that GCC emits a warning in addition to an error in this case. That's an implementation quirk of no particular significance.
Is there a complete code that can be compiled without errors while containing such a portion of code.
It depends on exactly what you mean by that. Trivially, I could prefix the erroneous line in the above program with // to turn it into a comment, and that would make it perfectly valid C and C++ source. There are manifold other ways I could add to the given source without removing anything to make it valid -- some of them would even produce a program in which a printf() call is in fact performed.
I suppose that what you really want to know is whether there is code that would elicit the warning from GCC but not the corresponding error. To the best of my knowledge, modern GCC does not afford such code, but historically, GCC did allow it as an extension, in the form of embedded, unescaped newlines in string literals:
printf("hello);
Goodbye");
That behavior was already deprecated in GCC 3.2, and it was removed as early as GCC 4 (current is 7.2).
If such a code does not exist, why this missing character situation does not give us only an error (not a warning+error).
We can only guess, but it seems plausible that it derives from the historical existence of the language extension described above. And again, you are making far too much of this. GCC emits two diagnostics about the same problem -- so what? The ultimate purpose of the diagnostics is to help you figure out what is or may be wrong with your code, and the diagnostics GCC emits in this case do that job just fine.

Your compiler probably issues errors when it detects the program is ill-formed, and describes the immediate reason the program failed to be well formed at the location it happened.
This is often useless, because the mistake could have been many lines away.
It also issues warnings that are guesses (often educated guesses) what actually caused your problem. Maybe you forgot a ; on a previous line, failed to close a { or a (. The warning is not "this token is the point of error", but rather "this is where it all went wrong".
In reality, the C++ standard itself does not distinguish between warnings and errors; they are both diagnostics. It mandates some things cause diagnostics, and does not bar compilers from issuing additional diagnostics. Compilers are even free to compile ill formed programs with a warning.
I would expect an error for "newline in string", then a warning pointing at the open quote.

Validating C code using CDT classes

I am writing a simple C editor. I have to validate a code to highlight any misspellings, missing semicolons, uses of non-existing functions/variables/methods, assigments in if condition and so on and so forth.
Parsing and validating C is a very complex problem so I have decided to use CDT. However, I have no idea how to do so.
I have only found informations about method org.eclipse.cdt.core.dom.ast.gnu.c.GCCLanguage.getASTTranslationUnit(...) but this is not helping very much, because it allows to find only basic syntax errors. (Am I right?)
I need a function which gets a C code or an object of the class IASTTranslationUnit. It has to return list of all problems (errors and warnings). How can I do that, using the CDT API?

Many categories of errors can be checked by resolving names found in the AST (calling IASTName.resolveBinding()), and seeing whether the resulting binding is an IProblemBinding.
See how CDT's ProblemBindingChecker does this to show many errors in the editor.
Note that this wonn't catch all errors; you can look at CDT's other checkers for ideas about how to catch other categories of errors (some of the checkers also produce warnings).
However, even all of CDT's checkers put together won't diagnose of all the errors and warnings that a compiler would. If that is a requirement, I would suggest using an actual compiler's internals, such as libclang.

Debug Xtext could not even do k=1 for decision errors

I'm trying to create an Xtext parser for a scripting language that I use. The language is quite close to ANSI-C.
I started by converting this https://github.com/antlr/examples-v3/blob/master/C/C/C.g grammar to Xtext and removing the parts I don't need (structs, typedefs, etc.)
However, I run into problems and I don't know how to debug them properly and find my errors.
I receive
error(10): internal error: org.antlr.tool.Grammar.createLookaheadDFA(Grammar.java:1279): could not even do k=1 for decision 39; reason: timed out (>10000ms)
and also OutOfMemoryError exceptions.
EDIT: I have already tried increasing the memory & timeout. However, even with LARGE values, this does not work.
Can anyone suggest ways to "debug" the grammar? Where is decision 39? I would love to locate the problem, but I couldn't find anything.
PS: I've posted the grammar listing here, to not clutter the post up http://pastebin.com/8AYNUbSD

You can generate an Antlr grammar (.g) by activating debug mode in your workflow.mwe2, add the following fragment:
fragment = org.eclipse.xtext.generator.parser.antlr.DebugAntlrGeneratorFragment {}
Then, you can debug this debug grammar by using AntlrWorks IDE
Quick tutorial here

Lexical and Semantic Errors in C

Recently I had to give examples for Lexical and Semantic Errors in C. I have provided the following examples.
I thought the following was lexical error,
int a#;
And for semantic error I gave the following example,
int a[10];
a=100;
But now I am a bit confused whether both are in fact syntax errors. Please provide me some idea about these errors?

First the classification of errors (as lexical, syntactic, semantic, pragmatic) is somehow arbitrary in the detail.
If you define a lexical error as an error detected by the lexer, then a malformed number could be one, eg 12q4z5. Or a name with a prohibited character like $
You could define a syntactic error as one detected at parsing time. However C is not stricto sensu a context-free language, and its parsers keep contextual information (e.g. in symbol tables).
Since all of a # and ; are valid lexemes your a#; is not a lexical error.
Actually # is mostly useful at preprocessing time, and what is parsed is actually the preprocessed form, not the user-given source code!
Many semantic errors are related to the notion of undefined behavior, like printf("%d") which lacks an integer argument. Or out of bound access, in your case printf("%d\n", a[1234]);
Some (but not all) semantic or pragmatic errors can be find thru static analysis tools. You can even say that modern compilers (when all warnings are enabled) do some.
Your example of a=100; is a typing error (which could be arbitrarily called as syntactic, since the C compiler finds it at parsing time, and as semantic, since related to types which are not a context free property). And you have more specialized static analysis tools like Frama-C and you could extend or customize the GCC compiler (e.g. with MELT, a domain specific language) to add yours (like e.g. in TALPO).
A pragmatic error could be to declare a huge local variable, like here. Probably, when 1Tbyte RAM memory will be common, stacks would have many gigabytes, so that example would run, but not today.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight