Clarification regarding lexical errors in C

Clarification regarding lexical errors in C - c

I have already read this and this questions. They are quite helpful but still I have some doubt regarding token generation in lexical analyzer for C.
What if lexical analyzer detects int a2.5c; then according to my understandings 7 tokens will be generated.
int keyword
a identifier
2 constant
. special symbol
5 constant
c identifier
; special symbol
So Lexical analyzer will not report any error and tokens will be generated successfully.
Is my understanding correct? If not then can you please help me to understand?
Also If we declare any constant as double a = 10.10.10;
Will it generate any lexical errors? Why?
UPDATE :Asking out of curiosity, what if lexical analyzer detects :-) smiley kind of thing in program?? Will it generate any lexical error? Because as per my understandings : will be treated as special symbol, - will be treated as operator and again ) will be treated as special symbol
Thank You

Your first list of tokens is almost correct -- a2 is a valid identifier.
Its true that the first example won't generate any "lexical" errors per se, although there will be a parse error at the ..
It's hard to say whether the error in your second example is a lexical error or a parse error. The lexical structure of a floating-point constant is pretty complicated. I can imagine a compiler that grabs a string of digits and . and e/E and doesn't notice until it calls the equivalent of strtod that there are two decimal points, meaning that it might report a "lexical error". Strictly speaking, though, what we have there is two floating-point constants in a row -- 10.10 and .10, meaning that it's more likely a "parse error".
In the end, though, these are all just errors. Unless you're taking a compiler design/construction class, I'm not sure how important it is to classify errors as lexical or otherwise.
Addressing your follow-on question, yes, :-) would lex as three tokens :, -, and ).
Because just about any punctuation character is legal in C, there are relatively few character sequences that are lexically illegal (that is, that would generate errors during the lexical analysis phase). In fact, the only ones I can think of are:
Illegal character (I think the only unused ones are ` and #)
various problems with character and string constants (missing ' or ", bad escape sequences, etc.)
Indeed, almost any string of punctuation you care to bang out will make it through a C lexical analyzer, although of course it may or may not parse. (A somewhat infamous example is a+++++b, which unfortunately lexes as a++ ++ + b and is therefore a syntax error.)

The C lexer I wrote tokenizes this as
keyid int
white " "
keyid a2
const .5
keyid c
punct ;
white "\n"
Where keyid is keyword or identifer; const is numerical constant, and punct is punctuator (white is white space).
I would not say there is a lexical error; but certainly a syntax error that must be diagnosed due to an identifer followed by a numerical constant, which no grammar rule can reduce.

Related

C lexer, understanding documentation, preprocessing tokens

My goal is to build a parser for a reasonable subset C and right now I'm at the start, implementing the lexer.
Answers to a similar question on the same topic pointed towards the International Standard for C (700 pages of documentation) and the Yacc grammar webpage.
I would welcome any help with understanding the documentation: Is it true that the following picture from the documentation represents grammar rules, where the notation C -> (A, B) means that all occurrences of AB in that order get replaced by C?
identifier -> identifier-nondigit | (identifier,identifier-nondigit) | (identifier,digit)
identifier-nondigit -> nondigit | universal-character-name | other
digit -> 0 | 1 | 2 | ... | 9
non-digit -> _ | a | b | ... | z | A | ... | Z
I think I am confused because the documentation introduces 'preprocessing tokens' which I thought would be just labels of sequences of characters in the source produced without backtracking.
I.e. something like:
"15647 \n \t abdsfg8rg \t" -> "DWLDLW"
// D .. digits, W ... whitespace, L ... letters
It seems like the lexer is doing the same thing as the parser (just building a tree). What is the reason for introducing the preprocessing tokens and tokens?
Does it mean that the processing should be done 'in two waves'?
I was expecting the lexer to just use some regular expressions and maybe a few rules. But it seems like the result of lexing is a sequence of trees that can have the roots keyword, identifier, constant, string-literal, punctuator.
Thank you for any clarifications.

I think I am confused because the documentation introduces
'preprocessing tokens' which I thought would be just labels of
sequences of characters in the source produced without backtracking.
Preprocessing tokens are the input to the C preprocessor. In the process of translating C source code to an executable, the stream of preprocessing tokens and intervening whitespace is subject to manipulation by the preprocessor first, then the resulting stream of preprocessing tokens and whitespace is munged a bit further before being converted to (the Standard's word; perhaps "reinterpreted as" better conveys the idea) a stream of tokens. An overview of all this is presented in section 5.1.1.2 of the language standard.
The conversion from preprocessing tokens to tokens is a pretty straightforward mapping:
identifier --> identifier or enumeration-constant (the choice is context-sensitive, but that can be worked around in practice by avoiding making the distinction until semantic analysis).
pp-number --> integer-constant or floating-constant, as appropriate (two of the alternatives for constant)
character-constant --> character-constant (one of the alternatives for constant)
string-literal --> string-literal
punctuator --> punctuator
anything else remaining after deletion of preprocessing directives --> one or more single-character tokens
Note that header-name preprocessing tokens are a special case: there are ambiguities between each form of header-name and other possible tokenizations of preprocessing tokens. It is probably best to avoid analyzing anything as a header-name except in the context of an #include directive, and then you also don't need to worry about converting header-name preprocessing tokens to regular tokens because none of them will survive deletion of the preprocessing directives.
Additional details of the lexical analysis are presented in section 6.4 of the Standard and its subsections.
It seems like the lexer is doing the same thing as the parser (just building a tree).
I don't see how you draw that conclusion, but it is certainly true that the distinction between lexical analysis and parsing is artificial. It is not necessary to divide the language-analysis process that way, but it turns out often to be convenient, both for coding and for computation.
What is the reason for introducing the preprocessing tokens and tokens?
It is basically that standard C is really two languages in one: a preprocessing language and the C language itself. Early on, they were processed by altogether different programs, and preprocessing was optional. The preprocessor has a view of the units it operates with and upon that is not entirely consistent with the classifications of C grammar. Preprocessing-tokens are the units of the preprocessor's grammatic analysis and data, whereas tokens are the units of C's grammatic analysis.
Does it mean that the processing should be done 'in two waves'?
Logically, yes. In practice, I think most compilers integrate the two logical passes into a single physical pass through the source.

If you pay attention, you'll remark that the tokens are described with a regular grammar, this means that they could also be described with regular expressions. Why the editor of the standard preferred one formalism to the other is open to speculation, you could think that using only one formalism for both part was considered simpler.
The rules for white space and comments hint that the separation of concern between the parser and the lexer was present at the mind of the designer. You can't use the description as is in an lexer-less parser.
Note that the preprocessor is reason for the introduction of preprocessing-token. Things like header-name and pp-number have consequence on the behavior of the preprocessor. Note also that some tokens are recognized only in some contexts (notably <header> and "header" which is subtly different from "string").

Octal digit in ANSI C grammar (lex)

I looked ANSI C grammar (lex).
And this is octal digit regex
0{D}+{IS}? { count(); return(CONSTANT); }
My question is why do they accept something like 0898?
It's not an octal digit.
So i thought they would consider that, but they just have wrote like that.
Could you explain why is that? Thank you

You want reasonable, user-friendly error messages.
If your lexer accepts 0999, you can detect an illegal octal digit and output a reasonable message:
int x = 0999;
^
error: illegal octal digit, go back to school
If it doesn't, it will parse this as two separate tokens 0 and 999 and pass them to the parser. The resulting error messages could be quite confusing.
int x = 0999;
^
error: expected ‘,’ or ‘;’ before numeric constant
The invalid program is rejected either way, as it should, however the ostensibly incorrect lex grammar does a better job with error reporting.
This demonstrates that practical grammars built for tools such as lex or yacc do not have to correspond exactly to ideal grammars found in language definitions.

Keep in mind that this is only syntax, not semantic.
So it is sufficient to detect "Cannot be anything but a constant.".
It is not necessary (yet) to detect "A correct octal constant.".
Note that it does not even make a difference between octal, decimal, hexadecimal. All of them register as "CONSTANT".

The grammar you repeatedly link to in your questions was produced in 1985, 4 years prior to the publication of the first C standard revision in 1989.
That is not the grammar that was published in the standard of 1989, which clearly uses
octal-constant:
0
octal-constant octal-digit
octal-digit: one of
0 1 2 3 4 5 6 7
Even then, that Lex grammar is sufficient for tokenizing a valid program.

Why should the controlled group in a conditional inclusion be lexically valid when the conditional is false?

The following program compiles:
// #define WILL_COMPILE
#ifdef WILL_COMPILE
int i =
#endif
int main()
{
return 0;
}
GCC Live demo here.
But the following will issue a warning:
//#define WILL_NOT_COMPILE
#ifdef WILL_NOT_COMPILE
char* s = "failure
#endif
int main()
{
return 0;
}
GCC Live demo here.
I understand that in the first example, the controlled group is removed by the time the compilation phase of the translation is reached. So it compiles without errors or warnings.
But why is lexical validity required in the second example when the controlled group is not going to be included?
Searching online I found this quote:
Even if a conditional fails, the controlled text inside it is still run through initial transformations and tokenization. Therefore, it must all be lexically valid C. Normally the only way this matters is that all comments and string literals inside a failing conditional group must still be properly ended.
But this does not state why the lexical validity is checked when the conditional fails.
Have I missed something here?

In the translation phase 3 the preprocessor will generate preprocessor tokens and having a " end up in the catch all non-white-space character that cannot be one of the above
is undefined behavior.
See C11 6.4 Lexical elements p3:
A token is the minimal lexical element of the language in translation phases 7 and 8. The
categories of tokens are: keywords, identifiers, constants, string literals, and punctuators.
A preprocessing token is the minimal lexical element of the language in translation
phases 3 through 6. The categories of preprocessing tokens are: header names,
identifiers, preprocessing numbers, character constants, string literals, punctuators, and
single non-white-space characters that do not lexically match the other preprocessing
token categories.69) If a ' or a " character matches the last category, the behavior is
undefined. ....
For reference the preprocessing-token are:
preprocessing-token:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above
Of which the unmatched " in your second example matches non-white-space character that cannot be one of the above.
Since this is undefined behavior and not a constraint the compiler is not obliged to diagnose it but it is certainly allowed to and using -pedantic-errors it even becomes an error godbolt session. As rici points out it only becomes a constraint violation if the token survives preprocessing.
The gcc document you cite basically says the same thing:
... Even if a conditional fails, the controlled text inside it is still run through initial transformations and tokenization. Therefore, it must all be lexically valid C. Normally the only way this matters is that all comments and string literals inside a failing conditional group must still be properly ended. ...

"Why is [something about C] the way it is?" questions can't usually be answered, because none of the people who wrote the 1989 C standard are here to answer questions [as far as I know, anyway] and if they were here, it was nearly thirty years ago and they probably don't remember.
However, I can think of a plausible reason why the contents of skipped conditional groups are required to consist of a valid sequence of preprocessing tokens. Observe that comments are not required to consist of a valid sequence of preprocessing tokens:
/* this comment's perfectly fine even though it has an unclosed
character literal inside */
Observe also that it is really simple to scan for the end of a comment. /* you look for the next */, // you look for the end of the line. The only complication is that trigraphs and backslash-newline are supposed to be converted first. Tokenizing the contents of comments would be extra code to no useful purpose.
By contrast, it is not simple to scan for the end of a skipped conditional group, because conditional groups nest. You have to be looking for #if, #ifdef, and #ifndef as well as #else and #endif, and counting your depth. And all of those directives are lexically defined in terms of preprocessor tokens, because that's the most natural way to look for them when you're not in a skipped conditional group. Requiring skipped conditional groups to be tokenizable allows the preprocessor to use the same code to process directives within skipped conditional groups as it does elsewhere.
By default, GCC issues only a warning when it encounters an un-tokenizable line inside a skipped conditional group, an error elsewhere:
#if 0
"foo
#endif
"bar
gives me
test.c:2:1: warning: missing terminating " character
"foo
^
test.c:4:1: error: missing terminating " character
"bar
^~~~
This is an intentional leniency, possibly one I introduced myself (it's only been twenty years since I wrote a third of GCC's current preprocessor, but I have still forgotten a lot of the details). You see, the original C preprocessor, the one K and R wrote, did allow arbitrary nonsense inside skipped conditional groups, because it wasn't built around the concept of tokens in the first place; it transformed text into other text. So people would put comments between #if 0 and #endif instead of /* and */, and naturally enough those comments would sometimes contain apostrophes. So, when Per Bothner and Neil Booth and Chiaki Ishikawa and I replaced GCC's original "C-Compatible Compiler Preprocessor"1 with the integrated, fully standards-compliant "cpplib", circa GCC 3.0, we felt we needed to cut a little compatibility slack here.
1 Raise your hand if you're old enough to know why RMS thought this name was funny.

The description of Translation phase 3 (C11 5.1.1.2/3), which happens before preprocessing directives are actioned:
The source file is decomposed into preprocessing tokens and sequences of
white-space characters (including comments).
And the grammar for preprocessing-token is:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above
Note in particular that a string-literal is a single preprocessing-token. The subsequent description (C11 6.4/3) clarifies that:
If a ' or a " character matches the last category, the behavior is
undefined.
So your second code causes undefined behaviour at translation phase 3.

What is the lexical and syntactic analysis during the process of compiling in C Compiler?

What is the lexical and syntactic analysis during the process of compiling. Does the preprocessing happens after lexical and syntactic analysis ?

Consider this code:
int a = 10;
if (a < 4)
{
printf("%d", a);
}
In the Lexical Analysis phase: You identify each word/token and assign a meaning to it.
In the code above, you start by identifying that i followed by n followed by t and then a space is the word int, and that it is a language keyword;1 followed by 0 and a space is a number 10 and so on.
In the Syntactic Analysis phase: You verify whether the code follows the language syntax(grammar rules). For example, you check whether there is only one variable on the LHS of an operator(considering language C), that each statement is terminated by a ;, that if is followed by a conditional/Boolean statement etc.
Like others have mentioned, usually, preprocessing happens before lexical analysis or syntactical analysis.

Lexical analysis happens BEFORE the syntactical analysis. This is logical because when it is necessary to call a macro it is necessary to identify the borders of an identifier first. This is done with lexical analysis. After that syntactical analysis kicks in. Note that compilers are typically not generating the full preprocessed source before starting the syntactic analysis. They read the source picking one lexema at a time, do the preprocessing if needed, and feed the result to syntactic analysis.
In one case lexical analysis happens twice. This is the paste buffering. Look at the code:
#define En(x) Abcd ## x ## x
enum En(5)
{
a, b = 20, c, d
};
This code defines enum with a name Abcd55. When the ## are processed during the macro expansion, the data is placed into an internal buffer. After that this buffer is scanned much like a small #include. During the scanning compiler will break contents of the buffer into lexemas. It may happen that borders of scanned lexemas will not match the borders of original lexemas that were placed into the buffer. In the example above 3 lexemas are placed into the buffer but only one is retrieved.

Preprocessing happens before the lexical analysis iirc
Comments get filtered out, #define, ... and after that, a compiler generates tokens with a scanner/lexer (lexical analysis). After that compilers generate parsetrees, which are for the syntactic analysis

There are exceptions, but it usually breaks out like this:
Preprocess - transform program text to program text
Lexical analysis - transform program text to "tokens", which are essentially small integers with attributes attached
Syntactic analysis - transform program text to abstract syntax
The definition of "abstract syntax" can vary. In one-pass compilers, abstract syntax amounts to tartget code. But theses days it's usually a tree or DAG that logically represents the structure of the program.

When we are talking about C programming language, we should note that there is an ISO (ANSI) stadard for the language. Here is a last public draft of C99 (ISO/IEC 9899:1999): www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf
There is a section "5.1.1.2 Translation phases" which says how should C program be parsed. There are stages:
... some steps for multi-byte, trigraph and backslash processing...
3). The source file is decomposed into preprocessing tokens and sequences of
white-space characters (including comments).
This is lexical analysis for preprocessing. Only preprocessor directives, punctuation, string constants, identifiers, comments are lexed here.
4). Preprocessing directives are executed, macro invocations are expanded
This is preprocessing itself. This phase will also include files from #include and then it will delete preprocessing directives (like #define or #ifdef and other)
... processing of string literals...
7). White-space characters separating tokens are no longer significant. Each
preprocessing token is converted into a token. The resulting tokens are
syntactically and semantically analyzed and translated as a translation unit.
Conversion to token means language keyword detection and constants detection.
This is the step of final lexical analysis; syntactic and semantic analyses.
So, your question was:
Does the preprocessing happens after lexical and syntactic analysis ?
Some lexical analysis is needed to do preprocessing, so order is:
lexical_for_preprocessor, preprocessing, true_lexical, other_analysis.
PS: Real C compiler may be organized in slightly different way, but it must behave in the same way as written in standard.

Occurrences of question mark in C code

I am doing a simple program that should count the occurrences of ternary operator ?: in C source code. And I am trying to simplify that as much as it is possible. So I've filtered from source code these things:
String literals " "
Character constants ' '
Trigraph sequences ??=, ??(, etc.
Comments
Macros
And now I am only counting the occurances of questionmarks.
So my question question is: Is there any other symbol, operator or anything else what could cause problem - contain '?' ?
Let's suppose that the source is syntax valid.

I think you found all places where a question-mark is introduced and therefore eliminated all possible false-positives (for the ternary op). But maybe you eliminated too much: Maybe you want to count those "?:"'s that get introduced by macros; you dont count those. Is that what you intend? If that's so, you're done.

Run your tool on preprocessed source code (you can get this by running e.g. gcc -E). This will have done all macro expansions (as well as #include substitution), and eliminated all trigraphs and comments, so your job will become much easier.

In K&R ANSI C the only places where a question mark can validly occur are:
String literals " "
Character constants ' '
Comments
Now you might notice macros and trigraph sequences are missing from this list.
I didn't include trigraph sequences since they are a compiler extension and not "valid C". I don't mean you should remove the check from your program, I'm trying to say you already went further then what's needed for ANSI C.
I also didn't include macros because when you're talking about a character that can occur in macros you can mean two things:
Macro names/identifiers
Macro bodies
The ? character can not occur in macro identifiers (http://stackoverflow.com/questions/369495/what-are-the-valid-characters-for-macro-names), and I see macro bodies as regular C code so the first list (string literals, character constants and comments*) should cover them too.
* Can macros validly contain comments? Because if I use this:
#define somemacro 15 // this is a comment
then // this is a comment isn't part of the macro. But what if I would compiler this C file with -D somemacro="15 // this is a comment"?