Preprocessing the # (stringizing) operator and whitespace - c

I'm looking at implementing a C preprocessor in two phases, where the first phase converts the source file into an array of preprocessing tokens. This would be good for simplicity and performance, as the work of tokenizing would not need to be redone when a header file is included by multiple files in a project.
The snag:
#define f(x) #x
main() {
puts(f(a+b));
puts(f(a + b));
}
According to the standard, the output should be:
a+b
a + b
i.e. the information about whether constituent tokens were separated by whitespace is supposed to be preserved. This would require the two-phase design to be scrapped.
The uses of the # operator that I've seen so far don't actually need this, e.g. assert would still work fine if the output were always a + b regardless of whether the constituent tokens were separated by whitespace in the source file.
Is there any existing code anywhere that does depend on the exact behavior prescribed by the standard for this operator?

You might want to look at the preprocessor of the LCC compiler, written as an example ANSI C compiler for compiler courses. Another preprocessor is MCPP.
C/C++ preprocessing is quite tricky, if you stick to it make sure to get at least drafts of the relevant standards, and pilfer test suites somewhere.

Related

Is it safe to run the C preprocessor several times on the same source?

From my experience, the C preprocessor just behaves as no-op when running on a previously preprocessed source. But is this behaviour guaranteed by the standard? Or maybe an implementation could have a preprocessor that modifies previously preprocessed code and for example removes/modifies line directives, or performs other modifications that could confuse the compiler?
In general, preprocessing via cpp is not guaranteed to be idempotent (a noop after the first run). A simple counterexample:
#define X #define Y z
X
Y
The first invocation will yield:
#define Y z
Y
The second one:
z
Having said that, valid C code shouldn't be doing something like that (because the output wouldn't be valid input for next stages of the compiler).
Moreover, depending on what you are trying to do, cpp has options like -fpreprocessed that may help.
The standard does not define a "preprocessor" as a separate component. The closest it comes is in the description of phase 4 of the translation process in §5.1.1.2:
Preprocessing directives are executed, macro invocations are expanded, and
_Pragma unary operator expressions are executed. If a character sequence that
matches the syntax of a universal character name is produced by token
concatenation (6.10.3.3), the behavior is undefined. A #include preprocessing
directive causes the named header or source file to be processed from phase 1
through phase 4, recursively. All preprocessing directives are then deleted.
However, the translation phases defined in that section are not separable, nor are they guaranteed to be independent of each other:
Implementations shall behave as if these separate phases occur, even though many are typically folded together in practice. Source files, translation units, and translated translation units need not necessarily be stored as files, nor need there be any one-to-one correspondence between these entities and any external representation. The description is conceptual only, and does not specify any
particular implementation. (Footnote 6 from the same section.)
So there is no contemplated mechanism to extract the result of translation phases 1-4 in any form, much less as a text file -- in fact, if the translation phases were implemented precisely as described, the output of phase 4 would be a sequence of tokens -- and neither is there a mechanism to feed that output back into the translator.
In other words, you might have some tool which calls itself a preprocessor, and it might even be part of a compiler suite. But that tool's behaviour is outside of the scope of the C standard. So there are no guarantees at all from the standard.
By the way, if the token stream which comes out of phase 4 were naively converted to text, it might not correctly preserve token boundaries. Most preprocessor tools inject extra whitespace at points where this would otherwise occur. That allows the output of the tool to be fed into a compiler, at least in most cases. (See #acorn's answer for an example where this wouldn't work correctly.) But this behaviour is neither required nor regulated by the standard, either.

How to process macros in LEX?

How do I implement #define in yacc/bison?
For Example:
#define f(x) x*x
If anywhere f(x) appears in any function then it is replaced by the right side of the
macro substituting for the argument ‘x’.
For example, f(3) would be replaced with 3*3. The macro can call another macro too.
It's not usually possible to do macro expansion inside a parser, at least not C-style macros, because C-style macro expansion doesn't respect syntax. For example
#define IF if(
#define THEN )
is legal (although very bad style IMHO). But for that to be handled inside the grammar, it would be necessary to allow a macro identifier to appear anywhere in the input, not just where an identifier might be expected. The necessary modifications to the grammar are going to make it much less readable and are very likely to introduce parser action conflicts. [Note 1]
Alternatively, you could do the macro expansion in the lexical analyzer. The lexical analyzer is not a parser, but parsing a C-style macro invocation doesn't require much sophistication, and if macro parameters were not allowed, it would be even simpler. This is how Flex handles macro replacement in its regular expressions. ({identifier}, for example. [Note 2] Since Flex macros are just raw character sequences, not token lists as with C-style macros, they can be handled by pushing the replacement text back into the input stream. (F)lex provides the unput special action for this purpose. unput pushes one character back into the input stream, so if you want to push an entire macro replacement, you have to unput it one character at a time, back to front so that the last character unput is the first one to be read afterwards.
That's workable but ugly. And it's not really scalable to even the small feature list provided by the C preprocessor. And it violates the fundamental principle of software design, which is that each component does just one thing (so that it can do it well).
So that leaves the most common approach, which is to add a separate macro processor component, so that instead of dividing the parse into lexical scan/syntax analysis, the parse becomes lexical scan/macro expansion/syntax analysis. [Note 3]
A C-style macro processor which works between the lexical analyser and the syntactic analyser could itself be written in Bison. As I mentioned above, the parsing requirements are generally minimal, but there is still parsing to be done and Bison is presumably already part of the project. Although I don't know of any macro processor (other than proof-of-concept programs I've written myself) which do this, I think it's a very flexible solution. In particular, the Bison syntactic analysis phase could be implemented with a push-parser, which avoids the need to produce the entire macro-expanded token stream in order to make it available to a traditional pull-parser.
That's not the only way to design macros, though. Indeed, it has a lot of shortcomings, because the macro expansions are not hygienic, respecting neither syntax nor scope. Probably anyone who has used C macros has at one time or other been bitten by these problems; the simplest manifestation is defining a macro like:
#define NEXT(a) a + 1
and then writing
int x = NEXT(a) * 3;
which is not going to produce the expected result (unless what is expected is a violation of the syntactic form of the last statement). Also, any macro expansion which needs to use a local variable will sooner or later produce an incorrect expansion because of unexpected name collision. Hygienic macro expansion seeks to solve these issues by viewing macro expansion as an operation on syntax trees, not token streams, making the parsing paradigm lexical scan/syntax analysis/macro expansion (of the parse tree). For that operation, the appropriate tool might well be some kind of tree parser.
Notes
Also, you'd want to remove the token from the parse tree Yacc/bison does have a poorly-documented feature, YYBACKUP, which might possibly help be able to accomplish this. I don't know if that's one of its intended use cases; indeed, it is not clear to me what its intended use cases are.
The (f)lex documentation calls these definitions, but they really are macros, and they suffer from all the usual problems macros bring with them, such as mysterious interactions with surrounding syntax.
Another possibility is macro expansion/lexical scan/syntax analysis, which could be implemented using a macro processor like M4. But that completely divorces the macros from the rest of the language.
yacc and lex generate c source at the end. So you can use macros inside the parser and lexer actions.
The actual #define preprocessor directives can go in the first section of the lexer and parser file
%{
// Somewhere here
#define f(x) x*x
%}
These sections will be copied verbatim to the generated c source.

macro expansion order with included files

Let's say I have a macro in an inclusion file:
// a.h
#define VALUE SUBSTITUTE
And another file that includes it:
// b.h
#define SUBSTITUTE 3
#include "a.h"
Is it the case that VALUE is now defined to SUBSTITUTE and will be macro expanded in two passes to 3, or is it the case that VALUE has been set to the macro expanded value of SUBSTITUTE (i.e. 3)?
I ask this question in the interest of trying to understand the Boost preprocessor library and how its BOOST_PP_SLOT defines work (edit: and I mean the underlying workings). Therefore, while I am asking the above question, I'd also be interested if anyone could explain that.
(and I guess I'd also like to know where the heck to find the 'painted blue' rules are written...)
VALUE is defined as SUBSTITUTE. The definition of VALUE is not aware at any point that SUBSTITUTE has also been defined. After VALUE is replaced, whatever it was replaced by will be scanned again, and potentially more replacements applied then. All defines exist in their own conceptual space, completely unaware of each other; they only interact with one another at the site of expansion in the main program text (defines are directives, and thus not part of the program proper).
The rules for the preprocessor are specified alongside the rules for C proper in the language standard. The standard documents themselves cost money, but you can usually download the "final draft" for free; the latest (C11) can be found here: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf
For at-home use the draft is pretty much equivalent to the real thing. Most people who quote the standard are actually looking at copies of the draft. (Certainly it's closer to the actual standard than any real-world C compiler is...)
There's a more accessible description of the macro rules in the GCC manual: http://gcc.gnu.org/onlinedocs/cpp/Self_002dReferential-Macros.html
Additionally... I couldn't tell you much about the Boost preprocessor library, not having used it, but there's a beautiful pair of libraries by the same authors called Order and Chaos that are very "clean" (as macro code goes) and easy to understand. They're more academic in tone and intended to be pure rather than portable; which might make them easier reading.
(Since I don't know Boost PP I don't know how relevant this is to your question but) there's also a good introductory example of the kids of techniques these libraries use for advanced metaprogramming constructs in this answer: Is the C99 preprocessor Turing complete?

What is the lexical and syntactic analysis during the process of compiling in C Compiler?

What is the lexical and syntactic analysis during the process of compiling. Does the preprocessing happens after lexical and syntactic analysis ?
Consider this code:
int a = 10;
if (a < 4)
{
printf("%d", a);
}
In the Lexical Analysis phase: You identify each word/token and assign a meaning to it.
In the code above, you start by identifying that i followed by n followed by t and then a space is the word int, and that it is a language keyword;1 followed by 0 and a space is a number 10 and so on.
In the Syntactic Analysis phase: You verify whether the code follows the language syntax(grammar rules). For example, you check whether there is only one variable on the LHS of an operator(considering language C), that each statement is terminated by a ;, that if is followed by a conditional/Boolean statement etc.
Like others have mentioned, usually, preprocessing happens before lexical analysis or syntactical analysis.
Lexical analysis happens BEFORE the syntactical analysis. This is logical because when it is necessary to call a macro it is necessary to identify the borders of an identifier first. This is done with lexical analysis. After that syntactical analysis kicks in. Note that compilers are typically not generating the full preprocessed source before starting the syntactic analysis. They read the source picking one lexema at a time, do the preprocessing if needed, and feed the result to syntactic analysis.
In one case lexical analysis happens twice. This is the paste buffering. Look at the code:
#define En(x) Abcd ## x ## x
enum En(5)
{
a, b = 20, c, d
};
This code defines enum with a name Abcd55. When the ## are processed during the macro expansion, the data is placed into an internal buffer. After that this buffer is scanned much like a small #include. During the scanning compiler will break contents of the buffer into lexemas. It may happen that borders of scanned lexemas will not match the borders of original lexemas that were placed into the buffer. In the example above 3 lexemas are placed into the buffer but only one is retrieved.
Preprocessing happens before the lexical analysis iirc
Comments get filtered out, #define, ... and after that, a compiler generates tokens with a scanner/lexer (lexical analysis). After that compilers generate parsetrees, which are for the syntactic analysis
There are exceptions, but it usually breaks out like this:
Preprocess - transform program text to program text
Lexical analysis - transform program text to "tokens", which are essentially small integers with attributes attached
Syntactic analysis - transform program text to abstract syntax
The definition of "abstract syntax" can vary. In one-pass compilers, abstract syntax amounts to tartget code. But theses days it's usually a tree or DAG that logically represents the structure of the program.
When we are talking about C programming language, we should note that there is an ISO (ANSI) stadard for the language. Here is a last public draft of C99 (ISO/IEC 9899:1999): www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf
There is a section "5.1.1.2 Translation phases" which says how should C program be parsed. There are stages:
... some steps for multi-byte, trigraph and backslash processing...
3). The source file is decomposed into preprocessing tokens and sequences of
white-space characters (including comments).
This is lexical analysis for preprocessing. Only preprocessor directives, punctuation, string constants, identifiers, comments are lexed here.
4). Preprocessing directives are executed, macro invocations are expanded
This is preprocessing itself. This phase will also include files from #include and then it will delete preprocessing directives (like #define or #ifdef and other)
... processing of string literals...
7). White-space characters separating tokens are no longer significant. Each
preprocessing token is converted into a token. The resulting tokens are
syntactically and semantically analyzed and translated as a translation unit.
Conversion to token means language keyword detection and constants detection.
This is the step of final lexical analysis; syntactic and semantic analyses.
So, your question was:
Does the preprocessing happens after lexical and syntactic analysis ?
Some lexical analysis is needed to do preprocessing, so order is:
lexical_for_preprocessor, preprocessing, true_lexical, other_analysis.
PS: Real C compiler may be organized in slightly different way, but it must behave in the same way as written in standard.

Occurrences of question mark in C code

I am doing a simple program that should count the occurrences of ternary operator ?: in C source code. And I am trying to simplify that as much as it is possible. So I've filtered from source code these things:
String literals " "
Character constants ' '
Trigraph sequences ??=, ??(, etc.
Comments
Macros
And now I am only counting the occurances of questionmarks.
So my question question is: Is there any other symbol, operator or anything else what could cause problem - contain '?' ?
Let's suppose that the source is syntax valid.
I think you found all places where a question-mark is introduced and therefore eliminated all possible false-positives (for the ternary op). But maybe you eliminated too much: Maybe you want to count those "?:"'s that get introduced by macros; you dont count those. Is that what you intend? If that's so, you're done.
Run your tool on preprocessed source code (you can get this by running e.g. gcc -E). This will have done all macro expansions (as well as #include substitution), and eliminated all trigraphs and comments, so your job will become much easier.
In K&R ANSI C the only places where a question mark can validly occur are:
String literals " "
Character constants ' '
Comments
Now you might notice macros and trigraph sequences are missing from this list.
I didn't include trigraph sequences since they are a compiler extension and not "valid C". I don't mean you should remove the check from your program, I'm trying to say you already went further then what's needed for ANSI C.
I also didn't include macros because when you're talking about a character that can occur in macros you can mean two things:
Macro names/identifiers
Macro bodies
The ? character can not occur in macro identifiers (http://stackoverflow.com/questions/369495/what-are-the-valid-characters-for-macro-names), and I see macro bodies as regular C code so the first list (string literals, character constants and comments*) should cover them too.
* Can macros validly contain comments? Because if I use this:
#define somemacro 15 // this is a comment
then // this is a comment isn't part of the macro. But what if I would compiler this C file with -D somemacro="15 // this is a comment"?

Resources