Why does the preprocessor forbid macro pasting of weird tokens? - c

I am writing my own C-preprocessor based on GCC. So far it is nearly identical, but what I consider redundant is to perform any form of checking on the tokens being concatenated by virtue of ##.
So in my preprocessor manual, I've written this:
3.5 Concatenation
...
GCC forbids concatenation with two mutually incompatible preprocessing
tokens such as "x" and "+" (in any order). That would result in the
following error: "pasting "x" and "+" does not give a valid
preprocessing token" However this isn't true for this preprocessor - concatenation
may occur between any token.
My reasoning is simple: if it expands to invalid code, then the compiler will produce an error and so I don't have to explicitly handle such cases, making the preprocessor slower and increasing in code complexity. If it results in valid code, then this restriction removal just makes it more flexible (although probably in rare cases).
So I would like to ask, why does this error actually happen, why is this restriction actually applied and is it a true crime if I dismiss it in my preprocessor?

As far as ISO C goes, if ## creates an invalid token, the behavior is undefined. But there is a somewhat strange situation there, in the following way. The output of the preprocessing translation phases of C is a stream of preprocessing tokens (pp-tokens). These are converted to tokens, and then syntactically and semantically analyzed. Now here is an important rule: it is a constraint violation if a pp-token doesn't have a form which lets it be converted to a token. Thus, a preprocessor token which is garbage that you write yourself without help from the ## operator must be diagnosed for bad lexical syntax. But if you use ## to create a bad preprocessing token, the behavior is undefined.
Note the subtlety there: the behavior of ## is undefined if it is used to create a bad preprocessing token. It's not the case that the pasting is well-defined, and then caught at the stage where pp-tokens are converted to tokens: it's undefined right from that point where ## is evaluated.
Basically, this is historical. C preprocessors were historically (and probably some are) separate programs, with lexical analysis that was different from and looser from the downstream compiler. The C standard tried to capture that somehow in terms of a single language with translation phases, and the result has some quirks and areas of perhaps surprising under-specification. (For instance in the preprocessing translation phases, a number token ("pp-number") is a strange lexical grammar which allows gibberish, such as tokens with multiple floating-point E exponents.)
Now, back to your situation. Your textual C preprocessor does not actually output pp-token objects; it outputs another text stream. You may have pp-token objects internally, but they get flattened on output. Thus, you might think, why not allow your ## operator to just blindly glue together any two tokens? The net effect is as if those tokens were dumped into the output stream without any intervening whitespace. (And this is probably all it was, in early preprocessors which supported ##, and ran as separate programs).
Unfortunately what that means is that your ## operator is not purely a bona fide token pasting operator; it's just a blind juxtaposing operator which sometimes produces one token, when it happens to juxtapose two tokens that will be lexically analyzed as one by the downstream compiler. If you go that way, it may be best to be honest and document it as such, rather than portraying it as a flexibility feature.
A good reason, on the other hand, to reject bad preprocessing tokens in the ## operator is to catch situations in which it cannot achieve its documented job description: the requirement of making one token out of two. The idea is that the programmer knows the language spec (the contract between the programmer and the implementation) and knows that ## is supposed to make one token, and relies on that. For such a programmer, any situation involving a bad token paste is a mistake, and that programmer is best supported by diagnosis.
The maintainers of GCC and the GNU CPP preprocessor probably took this view: that the preprocessor isn't a flexible text munging tool, but part of a toolchain supporting disciplined C programming.
Moreover, the undefined behavior of a bad token paste job is easily diagnosed, so why not diagnose it? The lack of a diagnosis requirement in this area in the standard looks like just a historic concession. It is a kind of "low-hanging fruit" of diagnosis. Let those undefined behaviors go undiagnosed for which diagnosis is difficult or intractable, or requires run-time penalties.

Related

Why is the C preprocessor a subject of undefined behavior?

I can understand that:
One of the origins of the UB is a performance increase (e.g. by removing never executed code, such as if (i+1 < i) { /* never_executed_code */ }; UPD: if i is a signed integer).
UB can be triggered at compile time because C does not clearly distinguish between compile time and run time. The "whole language is based on the (rather unhelpful) concept of an "abstract machine" (link).
However, I cannot understand yet why C preprocessor is a subject of undefined behavior? It is known that preprocessing directives are executed at compile time.
Consider C11, 6.10.3.3 The ## operator, 3:
If the result is not a valid preprocessing token, the behavior is undefined.
Why not make it a constraint? For example:
The result shall be a valid preprocessing token.
The same question goes for all the other "the behavior is undefined" in 6.10 Preprocessing directives.
Why is the C preprocessor a subject of undefined behavior?
When the C standard was created, there were some existing C preprocessors and there was some imaginary ideal C preprocessor in the minds of standardization committee members.
So there were these gray areas, where committee members weren't completely sure what would they want to do and/or existing C preprocessor implementations differed which each other in behavior.
So, these cases are not defined behavior. Because the C committee members are not completely sure what the behavior actually should be. So there is no requirement on what it should be.
One of the origins of the UB
Yes, one of.
UB may exist to ease up implementing the language. Like for example, in case of the preprocessor, the preprocessor writers don't have to care about what happens when an invalid preprocessor token is a result of ##.
Or UB may exist to reconcile existing implementations with different behaviors or as a point for extensions. So a preprocessor that segfaults in case of UB, a preprocessor that accepts and works in case of UB, and a preprocessor that formats your hard drive in case of UB, all can be standard conformant (but I wouldn't want to work on that one that formats your drive).
Suppose a file which is read in via include directive ends with the partial line:
#define foo bar
Depending upon the design of the preprocessor, it's possible that the partial token bar might be concatenated to whatever appears at the start of the line following the #include directive, or that whatever appears on that line will behave as though it were placed on the line with the #define directive, but with a whitespace separating it from the token bar, and it would hardly be inconceivable that a build script might rely upon such behaviors. It's also possible that implementations might behave as though a newline were inserted at the end of the included file, or might ignore the last partial line of such a file.
Any code which relied upon one of the former behaviors would clearly have been non-portable, but if code exploited such behavior to do something that would otherwise not be practical, such code would hardly be "erroneous", and the authors of the Standard would not have wanted to forbid an implementation that would process it usefully from continuing to do so.
When the Standard uses the phrase "non-portable or erroneous", that does not mean "non-portable, therefore erroneous". Prior to the publication of C89, C implementations defined many useful constructs, but none of them were defined by "the C Standard" since there wasn't one. If an implementation defined the behavior of some construct, some didn't, and the Standard left the construct as "Undefined", that would simply preserve the status quo where implementations that chose to define a useful behavior would do so, those that chose not to wouldn't, and programs that relied upon such behaviors would be "non-portable", working correctly on implementations that supported the behaviors, but not on those that didn't.
Without getting into specifics, my guess is, there exist several preprocessor implementations which have bugs, but the Standard doesn't want to declare them non-conforming, for compatibility reasons.
In human language: if you write a program which has X in it, preprocessor does weird stuff.
In standardese: the behavior of program with X is undefined.
If the standard says something like "The result shall be a valid preprocessing token", it might be unclear what "shall" means in this context.
The programmer shall write the program so this condition holds? If so, the wording with "undefined behavior" is clearer and more uniform (it appears in other places too)
The preprocessor shall make sure this condition holds? If so, this requires dedicated logic which checks the condition; may be impractical to implement.

Is it safe to run the C preprocessor several times on the same source?

From my experience, the C preprocessor just behaves as no-op when running on a previously preprocessed source. But is this behaviour guaranteed by the standard? Or maybe an implementation could have a preprocessor that modifies previously preprocessed code and for example removes/modifies line directives, or performs other modifications that could confuse the compiler?
In general, preprocessing via cpp is not guaranteed to be idempotent (a noop after the first run). A simple counterexample:
#define X #define Y z
X
Y
The first invocation will yield:
#define Y z
Y
The second one:
z
Having said that, valid C code shouldn't be doing something like that (because the output wouldn't be valid input for next stages of the compiler).
Moreover, depending on what you are trying to do, cpp has options like -fpreprocessed that may help.
The standard does not define a "preprocessor" as a separate component. The closest it comes is in the description of phase 4 of the translation process in §5.1.1.2:
Preprocessing directives are executed, macro invocations are expanded, and
_Pragma unary operator expressions are executed. If a character sequence that
matches the syntax of a universal character name is produced by token
concatenation (6.10.3.3), the behavior is undefined. A #include preprocessing
directive causes the named header or source file to be processed from phase 1
through phase 4, recursively. All preprocessing directives are then deleted.
However, the translation phases defined in that section are not separable, nor are they guaranteed to be independent of each other:
Implementations shall behave as if these separate phases occur, even though many are typically folded together in practice. Source files, translation units, and translated translation units need not necessarily be stored as files, nor need there be any one-to-one correspondence between these entities and any external representation. The description is conceptual only, and does not specify any
particular implementation. (Footnote 6 from the same section.)
So there is no contemplated mechanism to extract the result of translation phases 1-4 in any form, much less as a text file -- in fact, if the translation phases were implemented precisely as described, the output of phase 4 would be a sequence of tokens -- and neither is there a mechanism to feed that output back into the translator.
In other words, you might have some tool which calls itself a preprocessor, and it might even be part of a compiler suite. But that tool's behaviour is outside of the scope of the C standard. So there are no guarantees at all from the standard.
By the way, if the token stream which comes out of phase 4 were naively converted to text, it might not correctly preserve token boundaries. Most preprocessor tools inject extra whitespace at points where this would otherwise occur. That allows the output of the tool to be fed into a compiler, at least in most cases. (See #acorn's answer for an example where this wouldn't work correctly.) But this behaviour is neither required nor regulated by the standard, either.

How to process macros in LEX?

How do I implement #define in yacc/bison?
For Example:
#define f(x) x*x
If anywhere f(x) appears in any function then it is replaced by the right side of the
macro substituting for the argument ‘x’.
For example, f(3) would be replaced with 3*3. The macro can call another macro too.
It's not usually possible to do macro expansion inside a parser, at least not C-style macros, because C-style macro expansion doesn't respect syntax. For example
#define IF if(
#define THEN )
is legal (although very bad style IMHO). But for that to be handled inside the grammar, it would be necessary to allow a macro identifier to appear anywhere in the input, not just where an identifier might be expected. The necessary modifications to the grammar are going to make it much less readable and are very likely to introduce parser action conflicts. [Note 1]
Alternatively, you could do the macro expansion in the lexical analyzer. The lexical analyzer is not a parser, but parsing a C-style macro invocation doesn't require much sophistication, and if macro parameters were not allowed, it would be even simpler. This is how Flex handles macro replacement in its regular expressions. ({identifier}, for example. [Note 2] Since Flex macros are just raw character sequences, not token lists as with C-style macros, they can be handled by pushing the replacement text back into the input stream. (F)lex provides the unput special action for this purpose. unput pushes one character back into the input stream, so if you want to push an entire macro replacement, you have to unput it one character at a time, back to front so that the last character unput is the first one to be read afterwards.
That's workable but ugly. And it's not really scalable to even the small feature list provided by the C preprocessor. And it violates the fundamental principle of software design, which is that each component does just one thing (so that it can do it well).
So that leaves the most common approach, which is to add a separate macro processor component, so that instead of dividing the parse into lexical scan/syntax analysis, the parse becomes lexical scan/macro expansion/syntax analysis. [Note 3]
A C-style macro processor which works between the lexical analyser and the syntactic analyser could itself be written in Bison. As I mentioned above, the parsing requirements are generally minimal, but there is still parsing to be done and Bison is presumably already part of the project. Although I don't know of any macro processor (other than proof-of-concept programs I've written myself) which do this, I think it's a very flexible solution. In particular, the Bison syntactic analysis phase could be implemented with a push-parser, which avoids the need to produce the entire macro-expanded token stream in order to make it available to a traditional pull-parser.
That's not the only way to design macros, though. Indeed, it has a lot of shortcomings, because the macro expansions are not hygienic, respecting neither syntax nor scope. Probably anyone who has used C macros has at one time or other been bitten by these problems; the simplest manifestation is defining a macro like:
#define NEXT(a) a + 1
and then writing
int x = NEXT(a) * 3;
which is not going to produce the expected result (unless what is expected is a violation of the syntactic form of the last statement). Also, any macro expansion which needs to use a local variable will sooner or later produce an incorrect expansion because of unexpected name collision. Hygienic macro expansion seeks to solve these issues by viewing macro expansion as an operation on syntax trees, not token streams, making the parsing paradigm lexical scan/syntax analysis/macro expansion (of the parse tree). For that operation, the appropriate tool might well be some kind of tree parser.
Notes
Also, you'd want to remove the token from the parse tree Yacc/bison does have a poorly-documented feature, YYBACKUP, which might possibly help be able to accomplish this. I don't know if that's one of its intended use cases; indeed, it is not clear to me what its intended use cases are.
The (f)lex documentation calls these definitions, but they really are macros, and they suffer from all the usual problems macros bring with them, such as mysterious interactions with surrounding syntax.
Another possibility is macro expansion/lexical scan/syntax analysis, which could be implemented using a macro processor like M4. But that completely divorces the macros from the rest of the language.
yacc and lex generate c source at the end. So you can use macros inside the parser and lexer actions.
The actual #define preprocessor directives can go in the first section of the lexer and parser file
%{
// Somewhere here
#define f(x) x*x
%}
These sections will be copied verbatim to the generated c source.

Is GLR algorithm a must when bison parsing C grammar?

I'm trying to study C grammar with flex/bison.
I found bison cannot parse this bison grammar: https://www.lysator.liu.se/c/ANSI-C-grammar-y.html, because LALR algorithm cannot process recursively multiple expressions.
Is GLR algorithm a must for C grammar?
There is nothing wrong with that grammar except:
it represents a very old version of C
it requires a lexical analyser which can somehow distinguish between IDENTIFIER and TYPE_NAME
it does not even attempt to handle the preprocessor phases
Also, it has one shift/reduce conflict as a result of the "dangling else" ambiguity. However, that conflict can be ignored because bison's conflict resolution algorithm produces the correct result in this case. (You can suppress the warning either with an %expect directive or by including a precedence declaration which favours shifting else over reducing if. Or you can eliminate the ambiguity in the grammar using the technique described in the Wikipedia page linked above. (Note: I'm not talking about copy-and-pasting code from the Wikipedia page. In the case of C, you need to consider all cases of compound statements which terminate with an if statement.)
Moreover, an LR parser is not recursive, and it has no problems which could be described as a failure to "process recursively multiple expressions". (You might have that problem with a recursive descent parser, although it's pretty easy to work around the issue.)
So any problems you might have experienced (if your question refers to a concrete issue) have nothing to do with what's described in your question.
Of the problems I listed above, the most troubling is the syntactic ambiguity of the cast operator. The cast operator is not actually ambiguous; clearly, C compilers manage to correct compile such expressions. But distinguishing between the two possible parses of, for example, (x)-y*z requires knowing whether x names a type or a variable.
In C, all names are lexically scoped, so it is certainly possible to resolve x at compile time. But the resolution is not context-free. Since GLR is also a technique for parsing context-free grammars, using a GLR parser won't directly help you. It might be useful in the sense that GLR parsers can theoretically produce "parse forests" rather than parse trees; that is, the output of a GLR parser might effectively contain all possible correct parses, leaving the possibility to resolve the ambiguity by building symbol tables for each scope and then choosing between alternative parses by examining the name binding in effect at each site. (This works because type alias declarations -- "typedefs" -- are not ambiguous, so all the potential parses will have the same alias declarations.)
The usual solution, though, is to parse the program text using a deterministic parser, maintaining a symbol table during the parse, and giving the lexical analyser access to this symbol table so that it can distinguish between IDENTIFIER and TYPE_NAME, as expected by the grammar you link. This technique is politely called "lexical feedback", although it's also often called "the lexer hack".

What is the lexical and syntactic analysis during the process of compiling in C Compiler?

What is the lexical and syntactic analysis during the process of compiling. Does the preprocessing happens after lexical and syntactic analysis ?
Consider this code:
int a = 10;
if (a < 4)
{
printf("%d", a);
}
In the Lexical Analysis phase: You identify each word/token and assign a meaning to it.
In the code above, you start by identifying that i followed by n followed by t and then a space is the word int, and that it is a language keyword;1 followed by 0 and a space is a number 10 and so on.
In the Syntactic Analysis phase: You verify whether the code follows the language syntax(grammar rules). For example, you check whether there is only one variable on the LHS of an operator(considering language C), that each statement is terminated by a ;, that if is followed by a conditional/Boolean statement etc.
Like others have mentioned, usually, preprocessing happens before lexical analysis or syntactical analysis.
Lexical analysis happens BEFORE the syntactical analysis. This is logical because when it is necessary to call a macro it is necessary to identify the borders of an identifier first. This is done with lexical analysis. After that syntactical analysis kicks in. Note that compilers are typically not generating the full preprocessed source before starting the syntactic analysis. They read the source picking one lexema at a time, do the preprocessing if needed, and feed the result to syntactic analysis.
In one case lexical analysis happens twice. This is the paste buffering. Look at the code:
#define En(x) Abcd ## x ## x
enum En(5)
{
a, b = 20, c, d
};
This code defines enum with a name Abcd55. When the ## are processed during the macro expansion, the data is placed into an internal buffer. After that this buffer is scanned much like a small #include. During the scanning compiler will break contents of the buffer into lexemas. It may happen that borders of scanned lexemas will not match the borders of original lexemas that were placed into the buffer. In the example above 3 lexemas are placed into the buffer but only one is retrieved.
Preprocessing happens before the lexical analysis iirc
Comments get filtered out, #define, ... and after that, a compiler generates tokens with a scanner/lexer (lexical analysis). After that compilers generate parsetrees, which are for the syntactic analysis
There are exceptions, but it usually breaks out like this:
Preprocess - transform program text to program text
Lexical analysis - transform program text to "tokens", which are essentially small integers with attributes attached
Syntactic analysis - transform program text to abstract syntax
The definition of "abstract syntax" can vary. In one-pass compilers, abstract syntax amounts to tartget code. But theses days it's usually a tree or DAG that logically represents the structure of the program.
When we are talking about C programming language, we should note that there is an ISO (ANSI) stadard for the language. Here is a last public draft of C99 (ISO/IEC 9899:1999): www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf
There is a section "5.1.1.2 Translation phases" which says how should C program be parsed. There are stages:
... some steps for multi-byte, trigraph and backslash processing...
3). The source file is decomposed into preprocessing tokens and sequences of
white-space characters (including comments).
This is lexical analysis for preprocessing. Only preprocessor directives, punctuation, string constants, identifiers, comments are lexed here.
4). Preprocessing directives are executed, macro invocations are expanded
This is preprocessing itself. This phase will also include files from #include and then it will delete preprocessing directives (like #define or #ifdef and other)
... processing of string literals...
7). White-space characters separating tokens are no longer significant. Each
preprocessing token is converted into a token. The resulting tokens are
syntactically and semantically analyzed and translated as a translation unit.
Conversion to token means language keyword detection and constants detection.
This is the step of final lexical analysis; syntactic and semantic analyses.
So, your question was:
Does the preprocessing happens after lexical and syntactic analysis ?
Some lexical analysis is needed to do preprocessing, so order is:
lexical_for_preprocessor, preprocessing, true_lexical, other_analysis.
PS: Real C compiler may be organized in slightly different way, but it must behave in the same way as written in standard.

Resources