Unicode define statements in C preprocessor - c

I was doing some testing and fiddling about and noticed the C preprocessor does not let me use unicode with the define directive. Below is an example of my plight.
#include <stdio.h>
#define φαντασία fancy
int main() {
printf("Is the C preprocessor φαντασία?");
}
It is giving me this output...
$ ./a.out
Is the C preprocessor φαντασία?
Is there some way to fix this behaviour, it seems worthy of a feature request perhaps. Have I done something wrong?

Preprocessor replacement doesn't work on text; it works on tokens. When your printf line is processed, text is converted to 5 tokens first:
identifier token printf
punctuator token (
string-literal token "Is the C preprocessor φαντασία?"
punctuator token )
punctuator token ;
Only identifier type tokens are considered by macro replacement system.

Related

Why these consecutive macro replacements do not result in an error?

This program gives output as 5. But after replacing all macros, it would result in --5. This should cause an compilation error, trying to decrement the 5. But it compiles and runs fine.
#include <stdio.h>
#define A -B
#define B -C
#define C 5
int main()
{
printf("The value of A is %d\n", A);
return 0;
}
Why is there no error?
Here are the steps for the compilation of the statement printf("The value of A is %d\n", A);:
the lexical parser produces the preprocessing tokens printf, (, "The value of A is %dn", ,, A, ) and ;.
A is a macro that expands to the 2 tokens - and B.
B is also a macro and gets expanded to - and C.
C is again a macro and gets expanded to 5.
the tokens are then converted to C tokens, producing errors for preprocessing tokens that do not convert to proper C tokens (ex: 0a). In this example, the tokens are identical.
the compiler parses the resulting sequence according to the C grammar: printf, (, "The value of A is %d\n", ,, -, -, 5, ), ; matches a function call to printf with 2 arguments: a format string and a constant expression - - 5, which evaluates to 5 at compile time.
the code is therefore equivalent to printf("The value of A is %d\n", 5);. It will produce the output:
The value of A is 5
This sequence of macros is expanded as tokens, not strictly a sequence of characters, hence A does not expand as --5, but rather as - -5. Good C compilers would insert an extra space when preprocessing the source to textual output to ensure the resulting text produces the same sequence of tokens when reparsed. Note however that the C Standard does not say anything about preprocessing to textual output, it only specifies preprocessing as one of the parsing phases and it is a quality of implementation issue for compilers to not introduce potential side effects when preprocessing to textual output.
There is a separate feature for combining tokens into new tokens in the preprocessor called token pasting. It requires a specific operator ## and is quite tricky to use.
Note also that macros should be defined with parentheses around each argument and parentheses around the whole expansion to prevent operator precedence issues:
#define A (-B)
#define B (-C)
#define C 5
Two consecutive dashes are not combined into a single pre-decrement operator -- because C preprocessor works with individual tokens, effectively inserting whitespace around macro substitutions. Running this program through gcc -E
#define A -B
#define B -C
#define C 5
int main() {
return A;
}
produces the following output:
int main() {
return - -5;
}
Note the space after the first -.
According to the standard, macro replacements are performed at the level of preprocessor tokens, not at the level of individual characters (6.10.3.9):
A preprocessing directive of the form
# define identifier replacement-list new-line
defines an object-like macro that causes each subsequent instance of the macro name to be replaced by the replacement list of preprocessing tokens that constitute the remainder of the directive.
Therefore, the two dashes - constitute two different tokens, so they are kept separate from each other in the output of the preprocessor.
Whenever we use #include in C program, compiler will replace the variable with its value wherever it is used.
#define A -B
#define B -C
#define C 5
So when we print A ,
it will execute in following steps.
A=>-B
B=>-C
A=>-(-C)=>C
So when we print value of A, it comes out to be 5.
Generally these #define statements are used to declare value of constants that are to be used through out the code.
For more info see
this link on #define directive

The output of the following C code is T T , why not t t?

The output of the following C code is T T, but I think it should be t t.
#include<stdio.h>
#define T t
void main()
{
char T = 'T';
printf("\n%c\t%c\n",T,t);
}
The preprocessor does not perform substitution of any text within quotes, whether they are single quotes or double quotes.
So the character constant 'T' is unchanged.
From section 6.10.3 of the C standard:
9 A preprocessing directive of the form
# define identifier replacement-list new-line
defines an object-like macro that causes each subsequent instance of
the macro name 171) to be replaced by the replacement list of
preprocessing tokens that constitute the remainder of the
directive. The replacement list is then rescanned for more macro
names as specified below.
171) Since, by macro-replacement time, all character constants and
string literals are preprocessing tokens, not sequences possibly
containing identifier-like subsequences (see 5.1.1.2, translation
phases), they are never scanned for macro names or parameters.
TL;DR The variable name T is subject to MACRO replacement, not the initializer 'T'.
To elaborate, #define MACROs cause textual replacements and anything inside the "quotes" (either '' or "") are not part of MACRO replacement.
So in essence, try running the preprocessor on your code (example: gcc -E test.c) and it looks like
char t = 'T';
printf("\n%c\t%c\n",t,t);
Run gcc -E main.c -o test.txt && tail -f test.txt and See it online
which, expectedly, prints the value of variable t, T T.
That said, for a hosted environment, the required signature for main() is int main(void), at least.

gcc parsing code which has been #if 0 out [duplicate]

int main(void)
{
#if 0
something"
#endif
return 0;
}
A simple program above generates a warning: missing terminating " character in gcc. This seems odd, because it means that the compiler allow the code blocks between #if 0 and endif have invalid statement like something here, but not double quotes " that don't pair. The same happens in the use of #ifdef and #ifndef.
Real comments are fine here:
int main(void)
{
/*
something"
*/
return 0;
}
Why? And the single quote ' behave similarly, is there any other tokens that are treating specially?
See the comp.Lang.c FAQ, 11.19:
Under ANSI C, the text inside a "turned off" #if, #ifdef, or #ifndef must still consist of "valid preprocessing tokens." This means that the characters " and ' must each be paired just as in real C code, and the pairs mustn't cross line boundaries.
Compilation needs to go through many cycles, before generating executable binary.
You are not in the compiler yet. Your pre-processor is flagging this error. This will not check for C language syntax, but missing quotes, braces and things like that are pre-processor errors.
After this pre-processor pass, Your code will go to the C Compiler which will detect the error you are expecting...
The preprocessor works at the token level, and a string literal is considered a single token. The preprocessor is warning you that you have an invalid token.
According to the C99 standard, a preprocessing token is one of these things:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the
above
The standard also says:
If a ' or a " character matches the last category, the behavior is
undefined.
Things like "statement" above are invalid to the C compiler, but it is a valid token, and the preprocessor eliminates this token before it gets to the compiler.
Beside the Kevin's answer, Incompatibilities of GCC says:
GCC complains about unterminated character constants inside of preprocessing conditionals that fail. Some programs have English comments enclosed in conditionals that are guaranteed to fail; if these comments contain apostrophes, GCC will probably report an error. For example, this code would produce an error:
#if 0
You can't expect this to work.
#endif
The best solution to such a problem is to put the text into an actual C comment delimited by /*...*/.

Code blocks between #if 0 and #endif must have paired double quotes?

int main(void)
{
#if 0
something"
#endif
return 0;
}
A simple program above generates a warning: missing terminating " character in gcc. This seems odd, because it means that the compiler allow the code blocks between #if 0 and endif have invalid statement like something here, but not double quotes " that don't pair. The same happens in the use of #ifdef and #ifndef.
Real comments are fine here:
int main(void)
{
/*
something"
*/
return 0;
}
Why? And the single quote ' behave similarly, is there any other tokens that are treating specially?
See the comp.Lang.c FAQ, 11.19:
Under ANSI C, the text inside a "turned off" #if, #ifdef, or #ifndef must still consist of "valid preprocessing tokens." This means that the characters " and ' must each be paired just as in real C code, and the pairs mustn't cross line boundaries.
Compilation needs to go through many cycles, before generating executable binary.
You are not in the compiler yet. Your pre-processor is flagging this error. This will not check for C language syntax, but missing quotes, braces and things like that are pre-processor errors.
After this pre-processor pass, Your code will go to the C Compiler which will detect the error you are expecting...
The preprocessor works at the token level, and a string literal is considered a single token. The preprocessor is warning you that you have an invalid token.
According to the C99 standard, a preprocessing token is one of these things:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the
above
The standard also says:
If a ' or a " character matches the last category, the behavior is
undefined.
Things like "statement" above are invalid to the C compiler, but it is a valid token, and the preprocessor eliminates this token before it gets to the compiler.
Beside the Kevin's answer, Incompatibilities of GCC says:
GCC complains about unterminated character constants inside of preprocessing conditionals that fail. Some programs have English comments enclosed in conditionals that are guaranteed to fail; if these comments contain apostrophes, GCC will probably report an error. For example, this code would produce an error:
#if 0
You can't expect this to work.
#endif
The best solution to such a problem is to put the text into an actual C comment delimited by /*...*/.

How to make string or char constants with macro expansion using ## operator

I am trying to do the following:
#define mkstr(str) #str
#define cat(x,y) mkstr(x ## y)
int main()
{
puts(cat(\,n));
puts(cat(\,t))
return 0;
}
both of the puts statements cause error. As \n and n both are preprocessor tokens I expected output them correctly in those puts statements, but Bloodshed/DevC++ compiler giving me the following error:
24:1 G:\BIN\cLang\macro2.cpp pasting "\" and "n" does not give a valid preprocessing token
Where is the fact I'm missing?
The preprocessor uses a tokenizer which will require C-ish input. So even when stringifying you cannot pass random garbage to a macro. ==> Don't make your preprocessor sad - it will eat kittens if you do so too often.
Actually, there is no way to create "\n" via compile-time concatenation since "\\" "n" is a string consisting of the two literals, i.e. "\n".

Resources