#line and string literal concatenation - c

Given this piece of C code:
char s[] =
"start"
#ifdef BLAH
"mid"
#endif
"end";
what should the output of the preprocessor be? In other words, what should the actual compiler receive and be able to handle? To narrow the possibilities, let's stick to C99.
I'm seeing that some preprocessors output this:
#line 1 "tst00.c"
char s[] =
"start"
#line 9
"end";
or this:
# 1 "tst00.c"
char s[] =
"start"
# 7 "tst00.c"
"end";
gcc -E outputs this:
# 1 "tst00.c"
# 1 "<command-line>"
# 1 "tst00.c"
char s[] =
"start"
"end";
And gcc is perfectly fine compiling all of the above preprocessed code even with the -fpreprocessed option, meaning that no further preprocessing should be done as all of it has been done already.
The confusion stems from this wording of the 1999 C standard:
5.1.1.2 Translation phases
1 The precedence among the syntax rules of translation is specified by the following
phases.
...
4. Preprocessing directives are executed, macro invocations are expanded, and
_Pragma unary operator expressions are executed. ... All preprocessing directives are
then deleted.
...
6. Adjacent string literal tokens are concatenated.
7. White-space characters separating tokens are no longer significant. Each
preprocessing token is converted into a token. The resulting tokens are syntactically
and semantically analyzed and translated as a translation unit.
In other words, is it legal for the #line directive to appear between adjacent string literals? If it is, it means that the actual compiler must do another round of string literal concatenation, but that's not mentioned in the standard.
Or are we simply dealing with non-standard compiler implementations, gcc included?

The #line or # 1 lines you get from GCC -E (or a compatible tool) are added for the sake of human readers and any tools that might attempt to work with a text form of the output of the preprocessor. They are just for convenience.
In general, yes, directives may appear between concatenated string literal tokens. #line is no different from #ifdef in your example.
Or are we simply dealing with non-standard compiler implementations, gcc included?
-E and -fpreprocessed modes are not standardized. A standard preprocessor always feeds its output into a compiler, not a text file. Moreover:
The output of the preprocessor has no standard textual representation.
The reason for inserting #line directives is so that any __LINE__ and __FILE__ macros that you might insert into the already-preprocessed file, before preprocessing it again, will expand correctly. Perhaps, when compiling such a file, the compiler may notice and use the values when reporting errors. Usage of "preprocessed text files" is nonstandard and generally discouraged.

Related

Preprocessor and compiler errors in C

When I have a syntax error in C, how can I know if it's a preprocessor error or a compiler error?
Let's say I type in this line: "# include header.h" (The " is part of the line to make it a string literal).
Will the preprocessor have an issue with it or will it be the compiler that will treat it as a string without assigning it to anything?
Typically compiler output doesn't distinguish "pre-processor errors" from "compiler errors", as these aren't really standardized terms.
What's called "pre-processing" is often the process of forming pre-processor tokens, followed by resolving all includes, pragmas and macros. In the C standard, this "pre-processing" roughly corresponds to "translation phases" 3 and 4:
The source file is decomposed into preprocessing tokens and sequences of
white-space characters (including comments). A source file shall not end in a
partial preprocessing token or in a partial comment. Each comment is replaced by
one space character. New-line characters are retained. Whether each nonempty
sequence of white-space characters other than new-line is retained or replaced by
one space character is implementation-defined.
Preprocessing directives are executed, macro invocations are expanded, and
_Pragma unary operator expressions are executed. If a character sequence that
matches the syntax of a universal character name is produced by token
concatenation (6.10.3.3), the behavior is undefined. A #include preprocessing
directive causes the named header or source file to be processed from phase 1
through phase 4, recursively. All preprocessing directives are then deleted.
The compiler will not obviously complain about finding a valid string literal "# include header.h" in either of the above phases - a string literal is a valid pre-processor token. What you call "pre-processor errors" is probably errors that occur in any of the above phases.
(This is a simplified explanation, there's lots of other mildly interesting stuff happening as well, like trigraph and newline \ replacement etc.)
But in this case, I think the compiler will complain in phase 7, emphasis mine:
White-space characters separating tokens are no longer significant. Each
preprocessing token is converted into a token. The resulting tokens are
syntactically and semantically analyzed and translated as a translation unit.
"Will the preprocessor have an issue with it or will it be the compiler that will treat it as a string without assigning it to anything?"
I've tried your example:
"#include <stdio.h>"
I get the following errors:
For GCC:
"error: expected identifier or '(' before string constant"
For Clang:
"error: expected identifier or '('"
You can see it here.
Both GCC and Clang treats it as string literal, which is reasonable since character sequences surrounded by " are specified as string literals:
"A character string literal is a sequence of zero or more multibyte characters enclosed in double-quotes,as in "xyz"."
Source: ISO/IEC 9899:2018 (C18), §6.4.5/3.
This issue is one the compiler cares about, not the preprocessor. In general, since macros are expanded before compilation, the incorrectness or failure of preprocessor directives is usually also something the compiler complains about. There is usually no explicit error detection stage for the C preprocessor.
If the assignment would be proper, f.e.:
const char* p = "#include <stdio.h>";
and you use variables, functions etc. which are declared in header.h, you can* get errors about undefined references about these variables/functions, since the compiler/linker can't see/find those declarations.
*Whether you get an error or not is furthermore dependent upon if the definition of that variable/function is visable before its use in the source code or how you link several source files.
"When I have a syntax error in C, how can I know if it's a preprocessor error or a compiler error?"
As said above, there are no real preprocessor errors, the compiler covers these issues. The preprocessor doesn't really analyze for errors, it is just expanding. Usually it is very clear if an error belongs to a macro or not, even though the compiler evaluates the syntactical issues.
As said in the comments already by Eugene, you can take a look at the macro expanded version of your code when using the -E option for GCC and test if the expansions were expanded successfully/as desired.

Standard Behavior Of An Empty Macro Preceding A Preprocessing Directive

Take, for example, the following:
#define FOO
FOO #define BAR 1
BAR
What should, according to each of the ANSI C and C99 standards, be the preprocessed output of the above code?
It seems to me that this should be evaluated to 1; however, running the above example through both gcc -E and clang -E produces the following:
#define BAR 1
BAR
The draft standard "ISO/IEC 9899:201x Committee Draft — April 12, 2011 N1570" section 6.10 actually contains an example of this:
EXAMPLE In:
#define EMPTY
EMPTY # include <file.h>
the sequence of preprocessing tokens on the second line is not a preprocessing directive, because it does not begin with a # at the start of translation phase 4, even though it will do so after the macro EMPTY has been replaced.
It tells us that "... the second line is not a preprocessing directive ..."
So for your code
FOO #define BAR 1
is not a preprocessing directive meaning that only FOO will be replaced and BAR will not be defined. Consequently the output of the preprocessor is:
#define BAR 1
BAR
Your code is not valid
ISO/IEC 9899:2011, Section 6.10 Preprocessing directives:
A preprocessing directive consists of a sequence of preprocessing
tokens that satisfies the following constraints: The first token in
the sequence is a # preprocessing token that (at the start of
translation phase 4) is either the first character in the source file
(optionally after white space containing no new-line characters) or
that follows white space containing at least one new-line character.
This example actually occurs in the Standard (C17 6.10/8):
EXAMPLE In:
#define EMPTY
EMPTY # include <file.h>
the sequence of preprocessing tokens on the second line is not a preprocessing directive, because it does not begin with a # at the start of translation phase 4, even though it will do so after the macro EMPTY has been replaced.
So the output you see from gcc -E is correct. (Note: the amount of whitespace here is not significant , at that stage of translation the program has been translated to a sequence of preprocessing tokens; the different amounts of whitespace in the output is just an artefact of how gcc -E works).

Preprocessing Tokens: '- -' vs. '--'

Why does the (GCC) preprocessor create two tokens - -B instead of a single one --B in the following example? What is the logic that the former should be correct and not the latter?
#define A -B
-A
Output according to gcc -E:
- -B
After all, -- is a valid operator, so theoretically a valid token as well.
Is this specific to the GCC preprocessor or does this follow from the C standards?
The preprocessor works on tokens, not strings. Macro substitution without ## cannot create a new token and so, if the preprocessor output goes to a textfile as opposed to going straight into the compiler, preprocessors insert whitespace so that the outputted textfile can be used as C input again without changed semantics.
The space insertion doesn't seem to be in the standard, but then the standard describes the preprocessor as working on tokens and as feeding its output to the compiler proper, not a textfile.
Focusing on the white space insertion is missing the issue.
The macro A is defined as the sequence of preprocessing tokens - and B.
When the compiler parses a fragment of source code -A, it produces 2 tokens - and A. A is expanded as part of the preprocessing phase and the tokens are converted to C tokens: -, - and B.
If B is itself defined as a macro (#define B 4), A would expand to -, -, 4, which is parsed as an expression evaluating to the value 4 with type int.
gcc -E produces text. For the text to convert back to the same sequence of tokens as the original source code, a space needs to be inserted between the two - tokens to prevent -- to be parsed as a single token.

Compilers that required # on the first column?

Were there widely used pre-ANSI C compilers† that required the # to be on the first column?
† I would accept any compiler on this list. If I can find mention of it in the comp.lang.c Usenet newsgroup in a post dated before 1995, I would accept it.
K&R C did not specify whether whitespace was permitted before the #. From the original The C Programming Language, §12¶1 of the "C Reference Manual" in Appendix A:
The C compiler contains a preprocessor capable of macro substitution, conditional compilation, and inclusion of named files. Lines beginning with # communicate with this preprocessor.
Thus, whether or not whitespace was permitted to precede the # was unspecified. This would mean a pre-ANSI compiler could fail to compile a program if the directive did not begin on the first column.
In ISO C (and in ANSI C before that), the C preprocessing directives were explicitly permitted to be prefixed with whitespace. In ANSI C (C-89):
A preprocessing directive consists of a sequence of preprocessing
tokens that begins with a # preprocessing token that is either the
first character in the source file (optionally after white space
containing no new-line characters) or that follows white space
containing at least one new-line character, and is ended by the next
new-line character.
ISO C.2011 has similar language, but is clarified even further:
A preprocessing directive consists of a sequence of preprocessing tokens that satisfies the
following constraints: The first token in the sequence is a # preprocessing token that (at
the start of translation phase 4) is either the first character in the source file (optionally
after white space containing no new-line characters) or that follows white space
containing at least one new-line character. The last token in the sequence is the first newline
character that follows the first token in the sequence.165) A new-line character ends
the preprocessing directive even if it occurs within what would otherwise be an invocation of a function-like macro.
165) Thus, preprocessing directives are commonly called ‘‘lines’’. These ‘‘lines’’ have no other syntactic
significance, as all white space is equivalent except in certain situations during preprocessing (see the
# character string literal creation operator in 6.10.3.2, for example).
Short answer: Yes.
I remember writing things like
#if foo
/* ... */
#else
#if bar
/* ... */
#else
#error "neither foo nor bar specified"
#endif
#endif
so that the various pre-ANSI compilers that I once used wouldn't complain about "unrecognized preprocessor directive '#error'". This would have been with Ritchie's original cc for the pdp11, or pcc (the "portable C compiler" which, IIRC, was the basis for the Vax cc of the 80's or so). Both of those compilers -- more accurately, the preprocessor used with both of those compilers -- definitely required the # to be in the first column. (Actually, although those compilers were very different, they might both have used different variants of basically the same preprocessor, which was always a separate program in those days.)

Macro definition for ".global"

I have a "c" program that uses ".global" assembly language code. My compiler does not allow this, i need to ignore the same, i tried the following:
#define .global //global
But this gives a compiler error. Is there any other option that I can use. The compilation error is:
"expected an identifier"
You are almost certainly going to end up writing your own preprocessing script; it shouldn't be too difficult, if your source files are reasonably controlled. If you don't use the .global construct in string literals or comments, for example, it would be sufficient to do something like:
sed 's/.global [_[:alpha:]][_[:alnum:]]*;//g'
(perhaps with a bit more attention to detail about whitespace).
You cannot manufacture a comment with a macro. (You also cannot define a macro whose name starts with a ., although I suppose there could be a compiler which accepts that as an extension.)
Comments are replaced with whitespace in phase 3 of the translation process. Preprocessor directives are not examined until phase 4, by which time all the comments have disappeared.
So there is no difference between
#define COMMENT //comment
#define COMMENT
Standards reference: §5.1.1.2/1:
The source file is decomposed into preprocessing tokens7) and sequences of white-space characters (including comments). A source file shall not end in a partial preprocessing token or in a partial comment. Each comment is replaced by one space character.…
Preprocessing directives are executed, macro invocations are expanded, and _Pragma unary operator expressions are executed.…

Resources