C lexer, understanding documentation, preprocessing tokens

C lexer, understanding documentation, preprocessing tokens - c

My goal is to build a parser for a reasonable subset C and right now I'm at the start, implementing the lexer.
Answers to a similar question on the same topic pointed towards the International Standard for C (700 pages of documentation) and the Yacc grammar webpage.
I would welcome any help with understanding the documentation: Is it true that the following picture from the documentation represents grammar rules, where the notation C -> (A, B) means that all occurrences of AB in that order get replaced by C?
identifier -> identifier-nondigit | (identifier,identifier-nondigit) | (identifier,digit)
identifier-nondigit -> nondigit | universal-character-name | other
digit -> 0 | 1 | 2 | ... | 9
non-digit -> _ | a | b | ... | z | A | ... | Z
I think I am confused because the documentation introduces 'preprocessing tokens' which I thought would be just labels of sequences of characters in the source produced without backtracking.
I.e. something like:
"15647 \n \t abdsfg8rg \t" -> "DWLDLW"
// D .. digits, W ... whitespace, L ... letters
It seems like the lexer is doing the same thing as the parser (just building a tree). What is the reason for introducing the preprocessing tokens and tokens?
Does it mean that the processing should be done 'in two waves'?
I was expecting the lexer to just use some regular expressions and maybe a few rules. But it seems like the result of lexing is a sequence of trees that can have the roots keyword, identifier, constant, string-literal, punctuator.
Thank you for any clarifications.

I think I am confused because the documentation introduces
'preprocessing tokens' which I thought would be just labels of
sequences of characters in the source produced without backtracking.
Preprocessing tokens are the input to the C preprocessor. In the process of translating C source code to an executable, the stream of preprocessing tokens and intervening whitespace is subject to manipulation by the preprocessor first, then the resulting stream of preprocessing tokens and whitespace is munged a bit further before being converted to (the Standard's word; perhaps "reinterpreted as" better conveys the idea) a stream of tokens. An overview of all this is presented in section 5.1.1.2 of the language standard.
The conversion from preprocessing tokens to tokens is a pretty straightforward mapping:
identifier --> identifier or enumeration-constant (the choice is context-sensitive, but that can be worked around in practice by avoiding making the distinction until semantic analysis).
pp-number --> integer-constant or floating-constant, as appropriate (two of the alternatives for constant)
character-constant --> character-constant (one of the alternatives for constant)
string-literal --> string-literal
punctuator --> punctuator
anything else remaining after deletion of preprocessing directives --> one or more single-character tokens
Note that header-name preprocessing tokens are a special case: there are ambiguities between each form of header-name and other possible tokenizations of preprocessing tokens. It is probably best to avoid analyzing anything as a header-name except in the context of an #include directive, and then you also don't need to worry about converting header-name preprocessing tokens to regular tokens because none of them will survive deletion of the preprocessing directives.
Additional details of the lexical analysis are presented in section 6.4 of the Standard and its subsections.
It seems like the lexer is doing the same thing as the parser (just building a tree).
I don't see how you draw that conclusion, but it is certainly true that the distinction between lexical analysis and parsing is artificial. It is not necessary to divide the language-analysis process that way, but it turns out often to be convenient, both for coding and for computation.
What is the reason for introducing the preprocessing tokens and tokens?
It is basically that standard C is really two languages in one: a preprocessing language and the C language itself. Early on, they were processed by altogether different programs, and preprocessing was optional. The preprocessor has a view of the units it operates with and upon that is not entirely consistent with the classifications of C grammar. Preprocessing-tokens are the units of the preprocessor's grammatic analysis and data, whereas tokens are the units of C's grammatic analysis.
Does it mean that the processing should be done 'in two waves'?
Logically, yes. In practice, I think most compilers integrate the two logical passes into a single physical pass through the source.

If you pay attention, you'll remark that the tokens are described with a regular grammar, this means that they could also be described with regular expressions. Why the editor of the standard preferred one formalism to the other is open to speculation, you could think that using only one formalism for both part was considered simpler.
The rules for white space and comments hint that the separation of concern between the parser and the lexer was present at the mind of the designer. You can't use the description as is in an lexer-less parser.
Note that the preprocessor is reason for the introduction of preprocessing-token. Things like header-name and pp-number have consequence on the behavior of the preprocessor. Note also that some tokens are recognized only in some contexts (notably <header> and "header" which is subtly different from "string").

Related

Why does the preprocessor forbid macro pasting of weird tokens?

I am writing my own C-preprocessor based on GCC. So far it is nearly identical, but what I consider redundant is to perform any form of checking on the tokens being concatenated by virtue of ##.
So in my preprocessor manual, I've written this:
3.5 Concatenation
...
GCC forbids concatenation with two mutually incompatible preprocessing
tokens such as "x" and "+" (in any order). That would result in the
following error: "pasting "x" and "+" does not give a valid
preprocessing token" However this isn't true for this preprocessor - concatenation
may occur between any token.
My reasoning is simple: if it expands to invalid code, then the compiler will produce an error and so I don't have to explicitly handle such cases, making the preprocessor slower and increasing in code complexity. If it results in valid code, then this restriction removal just makes it more flexible (although probably in rare cases).
So I would like to ask, why does this error actually happen, why is this restriction actually applied and is it a true crime if I dismiss it in my preprocessor?

As far as ISO C goes, if ## creates an invalid token, the behavior is undefined. But there is a somewhat strange situation there, in the following way. The output of the preprocessing translation phases of C is a stream of preprocessing tokens (pp-tokens). These are converted to tokens, and then syntactically and semantically analyzed. Now here is an important rule: it is a constraint violation if a pp-token doesn't have a form which lets it be converted to a token. Thus, a preprocessor token which is garbage that you write yourself without help from the ## operator must be diagnosed for bad lexical syntax. But if you use ## to create a bad preprocessing token, the behavior is undefined.
Note the subtlety there: the behavior of ## is undefined if it is used to create a bad preprocessing token. It's not the case that the pasting is well-defined, and then caught at the stage where pp-tokens are converted to tokens: it's undefined right from that point where ## is evaluated.
Basically, this is historical. C preprocessors were historically (and probably some are) separate programs, with lexical analysis that was different from and looser from the downstream compiler. The C standard tried to capture that somehow in terms of a single language with translation phases, and the result has some quirks and areas of perhaps surprising under-specification. (For instance in the preprocessing translation phases, a number token ("pp-number") is a strange lexical grammar which allows gibberish, such as tokens with multiple floating-point E exponents.)
Now, back to your situation. Your textual C preprocessor does not actually output pp-token objects; it outputs another text stream. You may have pp-token objects internally, but they get flattened on output. Thus, you might think, why not allow your ## operator to just blindly glue together any two tokens? The net effect is as if those tokens were dumped into the output stream without any intervening whitespace. (And this is probably all it was, in early preprocessors which supported ##, and ran as separate programs).
Unfortunately what that means is that your ## operator is not purely a bona fide token pasting operator; it's just a blind juxtaposing operator which sometimes produces one token, when it happens to juxtapose two tokens that will be lexically analyzed as one by the downstream compiler. If you go that way, it may be best to be honest and document it as such, rather than portraying it as a flexibility feature.
A good reason, on the other hand, to reject bad preprocessing tokens in the ## operator is to catch situations in which it cannot achieve its documented job description: the requirement of making one token out of two. The idea is that the programmer knows the language spec (the contract between the programmer and the implementation) and knows that ## is supposed to make one token, and relies on that. For such a programmer, any situation involving a bad token paste is a mistake, and that programmer is best supported by diagnosis.
The maintainers of GCC and the GNU CPP preprocessor probably took this view: that the preprocessor isn't a flexible text munging tool, but part of a toolchain supporting disciplined C programming.
Moreover, the undefined behavior of a bad token paste job is easily diagnosed, so why not diagnose it? The lack of a diagnosis requirement in this area in the standard looks like just a historic concession. It is a kind of "low-hanging fruit" of diagnosis. Let those undefined behaviors go undiagnosed for which diagnosis is difficult or intractable, or requires run-time penalties.

Why should the controlled group in a conditional inclusion be lexically valid when the conditional is false?

The following program compiles:
// #define WILL_COMPILE
#ifdef WILL_COMPILE
int i =
#endif
int main()
{
return 0;
}
GCC Live demo here.
But the following will issue a warning:
//#define WILL_NOT_COMPILE
#ifdef WILL_NOT_COMPILE
char* s = "failure
#endif
int main()
{
return 0;
}
GCC Live demo here.
I understand that in the first example, the controlled group is removed by the time the compilation phase of the translation is reached. So it compiles without errors or warnings.
But why is lexical validity required in the second example when the controlled group is not going to be included?
Searching online I found this quote:
Even if a conditional fails, the controlled text inside it is still run through initial transformations and tokenization. Therefore, it must all be lexically valid C. Normally the only way this matters is that all comments and string literals inside a failing conditional group must still be properly ended.
But this does not state why the lexical validity is checked when the conditional fails.
Have I missed something here?

In the translation phase 3 the preprocessor will generate preprocessor tokens and having a " end up in the catch all non-white-space character that cannot be one of the above
is undefined behavior.
See C11 6.4 Lexical elements p3:
A token is the minimal lexical element of the language in translation phases 7 and 8. The
categories of tokens are: keywords, identifiers, constants, string literals, and punctuators.
A preprocessing token is the minimal lexical element of the language in translation
phases 3 through 6. The categories of preprocessing tokens are: header names,
identifiers, preprocessing numbers, character constants, string literals, punctuators, and
single non-white-space characters that do not lexically match the other preprocessing
token categories.69) If a ' or a " character matches the last category, the behavior is
undefined. ....
For reference the preprocessing-token are:
preprocessing-token:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above
Of which the unmatched " in your second example matches non-white-space character that cannot be one of the above.
Since this is undefined behavior and not a constraint the compiler is not obliged to diagnose it but it is certainly allowed to and using -pedantic-errors it even becomes an error godbolt session. As rici points out it only becomes a constraint violation if the token survives preprocessing.
The gcc document you cite basically says the same thing:
... Even if a conditional fails, the controlled text inside it is still run through initial transformations and tokenization. Therefore, it must all be lexically valid C. Normally the only way this matters is that all comments and string literals inside a failing conditional group must still be properly ended. ...

"Why is [something about C] the way it is?" questions can't usually be answered, because none of the people who wrote the 1989 C standard are here to answer questions [as far as I know, anyway] and if they were here, it was nearly thirty years ago and they probably don't remember.
However, I can think of a plausible reason why the contents of skipped conditional groups are required to consist of a valid sequence of preprocessing tokens. Observe that comments are not required to consist of a valid sequence of preprocessing tokens:
/* this comment's perfectly fine even though it has an unclosed
character literal inside */
Observe also that it is really simple to scan for the end of a comment. /* you look for the next */, // you look for the end of the line. The only complication is that trigraphs and backslash-newline are supposed to be converted first. Tokenizing the contents of comments would be extra code to no useful purpose.
By contrast, it is not simple to scan for the end of a skipped conditional group, because conditional groups nest. You have to be looking for #if, #ifdef, and #ifndef as well as #else and #endif, and counting your depth. And all of those directives are lexically defined in terms of preprocessor tokens, because that's the most natural way to look for them when you're not in a skipped conditional group. Requiring skipped conditional groups to be tokenizable allows the preprocessor to use the same code to process directives within skipped conditional groups as it does elsewhere.
By default, GCC issues only a warning when it encounters an un-tokenizable line inside a skipped conditional group, an error elsewhere:
#if 0
"foo
#endif
"bar
gives me
test.c:2:1: warning: missing terminating " character
"foo
^
test.c:4:1: error: missing terminating " character
"bar
^~~~
This is an intentional leniency, possibly one I introduced myself (it's only been twenty years since I wrote a third of GCC's current preprocessor, but I have still forgotten a lot of the details). You see, the original C preprocessor, the one K and R wrote, did allow arbitrary nonsense inside skipped conditional groups, because it wasn't built around the concept of tokens in the first place; it transformed text into other text. So people would put comments between #if 0 and #endif instead of /* and */, and naturally enough those comments would sometimes contain apostrophes. So, when Per Bothner and Neil Booth and Chiaki Ishikawa and I replaced GCC's original "C-Compatible Compiler Preprocessor"1 with the integrated, fully standards-compliant "cpplib", circa GCC 3.0, we felt we needed to cut a little compatibility slack here.
1 Raise your hand if you're old enough to know why RMS thought this name was funny.

The description of Translation phase 3 (C11 5.1.1.2/3), which happens before preprocessing directives are actioned:
The source file is decomposed into preprocessing tokens and sequences of
white-space characters (including comments).
And the grammar for preprocessing-token is:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above
Note in particular that a string-literal is a single preprocessing-token. The subsequent description (C11 6.4/3) clarifies that:
If a ' or a " character matches the last category, the behavior is
undefined.
So your second code causes undefined behaviour at translation phase 3.

Clarification regarding lexical errors in C

I have already read this and this questions. They are quite helpful but still I have some doubt regarding token generation in lexical analyzer for C.
What if lexical analyzer detects int a2.5c; then according to my understandings 7 tokens will be generated.
int keyword
a identifier
2 constant
. special symbol
5 constant
c identifier
; special symbol
So Lexical analyzer will not report any error and tokens will be generated successfully.
Is my understanding correct? If not then can you please help me to understand?
Also If we declare any constant as double a = 10.10.10;
Will it generate any lexical errors? Why?
UPDATE :Asking out of curiosity, what if lexical analyzer detects :-) smiley kind of thing in program?? Will it generate any lexical error? Because as per my understandings : will be treated as special symbol, - will be treated as operator and again ) will be treated as special symbol
Thank You

Your first list of tokens is almost correct -- a2 is a valid identifier.
Its true that the first example won't generate any "lexical" errors per se, although there will be a parse error at the ..
It's hard to say whether the error in your second example is a lexical error or a parse error. The lexical structure of a floating-point constant is pretty complicated. I can imagine a compiler that grabs a string of digits and . and e/E and doesn't notice until it calls the equivalent of strtod that there are two decimal points, meaning that it might report a "lexical error". Strictly speaking, though, what we have there is two floating-point constants in a row -- 10.10 and .10, meaning that it's more likely a "parse error".
In the end, though, these are all just errors. Unless you're taking a compiler design/construction class, I'm not sure how important it is to classify errors as lexical or otherwise.
Addressing your follow-on question, yes, :-) would lex as three tokens :, -, and ).
Because just about any punctuation character is legal in C, there are relatively few character sequences that are lexically illegal (that is, that would generate errors during the lexical analysis phase). In fact, the only ones I can think of are:
Illegal character (I think the only unused ones are ` and #)
various problems with character and string constants (missing ' or ", bad escape sequences, etc.)
Indeed, almost any string of punctuation you care to bang out will make it through a C lexical analyzer, although of course it may or may not parse. (A somewhat infamous example is a+++++b, which unfortunately lexes as a++ ++ + b and is therefore a syntax error.)

The C lexer I wrote tokenizes this as
keyid int
white " "
keyid a2
const .5
keyid c
punct ;
white "\n"
Where keyid is keyword or identifer; const is numerical constant, and punct is punctuator (white is white space).
I would not say there is a lexical error; but certainly a syntax error that must be diagnosed due to an identifer followed by a numerical constant, which no grammar rule can reduce.

Preprocessing the # (stringizing) operator and whitespace

I'm looking at implementing a C preprocessor in two phases, where the first phase converts the source file into an array of preprocessing tokens. This would be good for simplicity and performance, as the work of tokenizing would not need to be redone when a header file is included by multiple files in a project.
The snag:
#define f(x) #x
main() {
puts(f(a+b));
puts(f(a + b));
}
According to the standard, the output should be:
a+b
a + b
i.e. the information about whether constituent tokens were separated by whitespace is supposed to be preserved. This would require the two-phase design to be scrapped.
The uses of the # operator that I've seen so far don't actually need this, e.g. assert would still work fine if the output were always a + b regardless of whether the constituent tokens were separated by whitespace in the source file.
Is there any existing code anywhere that does depend on the exact behavior prescribed by the standard for this operator?

You might want to look at the preprocessor of the LCC compiler, written as an example ANSI C compiler for compiler courses. Another preprocessor is MCPP.
C/C++ preprocessing is quite tricky, if you stick to it make sure to get at least drafts of the relevant standards, and pilfer test suites somewhere.

What is the lexical and syntactic analysis during the process of compiling in C Compiler?

What is the lexical and syntactic analysis during the process of compiling. Does the preprocessing happens after lexical and syntactic analysis ?

Consider this code:
int a = 10;
if (a < 4)
{
printf("%d", a);
}
In the Lexical Analysis phase: You identify each word/token and assign a meaning to it.
In the code above, you start by identifying that i followed by n followed by t and then a space is the word int, and that it is a language keyword;1 followed by 0 and a space is a number 10 and so on.
In the Syntactic Analysis phase: You verify whether the code follows the language syntax(grammar rules). For example, you check whether there is only one variable on the LHS of an operator(considering language C), that each statement is terminated by a ;, that if is followed by a conditional/Boolean statement etc.
Like others have mentioned, usually, preprocessing happens before lexical analysis or syntactical analysis.

Lexical analysis happens BEFORE the syntactical analysis. This is logical because when it is necessary to call a macro it is necessary to identify the borders of an identifier first. This is done with lexical analysis. After that syntactical analysis kicks in. Note that compilers are typically not generating the full preprocessed source before starting the syntactic analysis. They read the source picking one lexema at a time, do the preprocessing if needed, and feed the result to syntactic analysis.
In one case lexical analysis happens twice. This is the paste buffering. Look at the code:
#define En(x) Abcd ## x ## x
enum En(5)
{
a, b = 20, c, d
};
This code defines enum with a name Abcd55. When the ## are processed during the macro expansion, the data is placed into an internal buffer. After that this buffer is scanned much like a small #include. During the scanning compiler will break contents of the buffer into lexemas. It may happen that borders of scanned lexemas will not match the borders of original lexemas that were placed into the buffer. In the example above 3 lexemas are placed into the buffer but only one is retrieved.

Preprocessing happens before the lexical analysis iirc
Comments get filtered out, #define, ... and after that, a compiler generates tokens with a scanner/lexer (lexical analysis). After that compilers generate parsetrees, which are for the syntactic analysis

There are exceptions, but it usually breaks out like this:
Preprocess - transform program text to program text
Lexical analysis - transform program text to "tokens", which are essentially small integers with attributes attached
Syntactic analysis - transform program text to abstract syntax
The definition of "abstract syntax" can vary. In one-pass compilers, abstract syntax amounts to tartget code. But theses days it's usually a tree or DAG that logically represents the structure of the program.

When we are talking about C programming language, we should note that there is an ISO (ANSI) stadard for the language. Here is a last public draft of C99 (ISO/IEC 9899:1999): www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf
There is a section "5.1.1.2 Translation phases" which says how should C program be parsed. There are stages:
... some steps for multi-byte, trigraph and backslash processing...
3). The source file is decomposed into preprocessing tokens and sequences of
white-space characters (including comments).
This is lexical analysis for preprocessing. Only preprocessor directives, punctuation, string constants, identifiers, comments are lexed here.
4). Preprocessing directives are executed, macro invocations are expanded
This is preprocessing itself. This phase will also include files from #include and then it will delete preprocessing directives (like #define or #ifdef and other)
... processing of string literals...
7). White-space characters separating tokens are no longer significant. Each
preprocessing token is converted into a token. The resulting tokens are
syntactically and semantically analyzed and translated as a translation unit.
Conversion to token means language keyword detection and constants detection.
This is the step of final lexical analysis; syntactic and semantic analyses.
So, your question was:
Does the preprocessing happens after lexical and syntactic analysis ?
Some lexical analysis is needed to do preprocessing, so order is:
lexical_for_preprocessor, preprocessing, true_lexical, other_analysis.
PS: Real C compiler may be organized in slightly different way, but it must behave in the same way as written in standard.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight