Creating identifiers containing universal character names via token concatenation

Creating identifiers containing universal character names via token concatenation - c

I wrote this code that creates identifiers containing universal character names via token concatenation.
//#include <stdio.h>
int printf(const char*, ...);
#define CAT(a, b) a ## b
int main(void) {
//int \u306d\u3053 = 10;
int CAT(\u306d, \u3053) = 10;
printf("%d\n", \u306d\u3053);
//printf("%d\n", CAT(\u306d, \u3053));
return 0;
}
This code worked well with gcc 4.8.2 with -fextended-identifiers option and gcc 5.3.1, but didn't work with clang 3.3 with error message:
prog.c:10:17: error: use of undeclared identifier 'ねこ'
printf("%d\n", \u306d\u3053);
^
1 error generated.
and local clang (Apple LLVM version 7.0.2 (clang-700.1.81)) with error message:
$ clang -std=c11 -Wall -Wextra -o uctest1 uctest1.c
warning: format specifies type 'int' but the argument has type
'<dependent type>' [-Wformat]
uctest1.c:10:17: error: use of undeclared identifier 'ねこ'
printf("%d\n", \u306d\u3053);
^
1 warning and 1 error generated.
When I used -E option to have the compilers output code with macro expanded, gcc 5.3.1 emitted this:
# 1 "main.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 1 "<command-line>" 2
# 1 "main.c"
int printf(const char*, ...);
int main(void) {
int \U0000306d\U00003053 = 10;
printf("%d\n", \U0000306d\U00003053);
return 0;
}
local clang emitted this:
# 1 "uctest1.c"
# 1 "<built-in>" 1
# 1 "<built-in>" 3
# 326 "<built-in>" 3
# 1 "<command line>" 1
# 1 "<built-in>" 2
# 1 "uctest1.c" 2
int printf(const char*, ...);
int main(void) {
int \u306d\u3053 = 10;
printf("%d\n", ねこ);
return 0;
}
As you see, the identifiers declared and used in printf() matches in gcc's output, but they don't match in clang's output.
I know that creating universal character names via token concatenation invokes undefined behavior.
Quote from N1570 5.1.1.2 Translation phases:
If a character sequence that
matches the syntax of a universal character name is produced by token
concatenation (6.10.3.3), the behavior is undefined.
I thought that this character sequence \u306d\u3053 may "match the syntax of a universal character name" because it contains universal character names as its substring.
I also thought that "match" may mean that the entire token produced via concatenation stands for one universal character name, and that therefore this undefined behavior isn't invoked in this code.
Reading PRE30-C. Do not create a universal character name through concatenation, I found a comment saying this kind of concatenation is allowed:
What is forbidden, to create a new UCN via concatenation. Like doing
assign(\u0001,0401,a,b,4)
just concatenating stuff that happens to contain UCNs anywhere is okay.
And a log that shows that a code example like this case (but with 4 characters) is replaced with another code example.
Does my code example invoke some undefined behaviors (not limited to ones invoked by producing universal character names via token concatenation)?
Or is this a bug in clang?

Your code is not triggering the undefined behavior you mention, as universal character name (6.4.3) not being produced by token concatenation.
And, according to 6.10.3.3, as both the left side and the right side of operator ## is an identifier, and the produced token is also a valid preprocessing token (an identifier too), the ## operator itself not trigger an undefined behavior.
After reading description about identifier (6.4.2, D.1, D.2), universal character names (6.4.3), I'm pretty sure that it is more like a bug in clang preprocessor, which treats identifier produced by token concatenation and normal identifier differently.

Related

Preprocessor only on arbitrary file?

I wanted to demonstrate that the preprocessor is totally independant of the build process. It is another grammar and another lexer than the C language. In fact I wanted to show that the preprocessor could be applied to any type of file.
So I have this arbitrary file:
#define FOO
#ifdef FOO
I am foo
#endif
#
#
#
Something
#pragma Hello World
And I thought this would work:
$ gcc -E test.txt -o-
gcc: warning: test.txt: linker input file unused because linking not done
Unfortunately it only work with this:
$ cat test.txt | gcc -E -
Why is this error with GCC?

You need to tell gcc it's a C file.
gcc -xc -E test.txt

The C compiler uses the file name suffix as an indicator of the files that have to be compiled (ending in .c) files that have only to be linked (ending in .o or .so) For the files ending in .s it calls the assembler as(1) and for files ending in .f it call the fortran compiler, and for .cc it switches to C++ compiling.
Indeed, normally, C compilers take everything they don't match as a linker file, so once you pass it a linker file, it tries to link it, calling the linker ld(1). This is what happens with your .txt file. The linker has some similar way to recognise ld(1) scripts against object or shared object files.
BTW, the CPP language is indeed a macro language, but there's some similarities with C that cannot be avoided. It has, at least, to recognise C identifiers, as macro names have the same syntax as C identifiers, and it has to check that an identifier matches a macro name or not. In other side... It has to recognise C comments, and C strings (it indeed eliminates comments for the compiler), as macro expansion doesn't enter to expand inside them, and it has also to recognize parenthesis (they are considered for macro parameter detection and the , symbol, used to separate parameters). It also recognizes (inside the macro string) the tokens # (to stringify a parameter) and ## (to catenate and merge two symbols into one) (this last operator must force cpp to recognise almost any C token, as it must check for errors if you try to merge something like +##+ into ++, which is an error)
So, the conclussion is: the cpp doesn't have to implement the whole C syntax as a language, but the tokens of the C language must be recognised almost completely. The standard for the C language forces the c preprocessor to tokenize the input, so the ## operator can be used to merge tokens (and to check for validity) This means that, if you define a macro like:
#define M(p) +p
and then you call it like:
a = +M(-c);
you will get a string similar to:
a = + +-c;
in the output (it will insert a space in between the two + signs, so they don't get merged into ++ operator. The symbols + and - are together, because they will never be scanned as one token) See the next example (input is preceded by > symbol)
$ cpp - <<EOF
> #define M(p) +p
> a = +M(p);
> b = -M(p);
> p = +M(+p);
> p = +M(-p);
> EOF
# 1 "<stdin>"
# 1 "<built-in>" 1
# 1 "<built-in>" 3
# 346 "<built-in>" 3
# 1 "<command line>" 1
# 1 "<built-in>" 2
# 1 "<stdin>" 2
a = + +p;
b = -+p;
p = + + +p;
p = + +-p;
Another example will show more difficulties in parsing the tokens (input is delimited with >, stderr with >> and stdout is unquoted):
$ cpp - <<EOF
#define M(a,b) a##b
> a = M(a+,+b)
> a = M(a+,-b)
> a = M(a,+b)
> a = M(a,b)
> a = M(a,300)
> a = M(a,300.2)
> EOF
>> <stdin>:3:5: error: pasting formed '+-', an invalid preprocessing token
>> a = M(a+,-b)
>> ^
>> <stdin>:1:17: note: expanded from macro 'M'
>> #define M(a,b) a##b
>> ^
>> <stdin>:4:5: error: pasting formed 'a+', an invalid preprocessing token
>> a = M(a,+b)
>> ^
>> <stdin>:1:17: note: expanded from macro 'M'
>> #define M(a,b) a##b
>> ^
>> <stdin>:7:5: error: pasting formed 'a300.2', an invalid preprocessing token
>> a = M(a,300.2)
>> ^
>> <stdin>:1:17: note: expanded from macro 'M'
>> #define M(a,b) a##b
>> ^
>> 3 errors generated.
# 1 "<stdin>"
# 1 "<built-in>" 1
# 1 "<built-in>" 3
# 346 "<built-in>" 3
# 1 "<command line>" 1
# 1 "<built-in>" 2
# 1 "<stdin>" 2
a = a++b
a = a+-b
a = a+b
a = ab
a = a300
a = a 300.2
As you can see in this example, merging a and 300 goes fine, as one token makes an identifier, which is valid and cpp(1) doesn't complain, but when merging a and 300.2 the resulting token a300.2 is not a valid token in C, so it is rejected (it is also not joined and the tool inserts a space, to make the compiler see both tokens as separate ---should it joined both together, they would have been scanned as the tokens a300 and .2).
If you want to use a language independent macro preprocesor, consider using m4(1) as a macro language. It's far more powerful than cpp in many ways. But beware, it's difficult to learn due to the complexity of macro expansions it allows.

You can use the C preprocessor, cpp (or the more traditional form, /lib/cpp):
cpp test.txt
or
/lib/cpp test.txt

How does preprocessing work in this snippet?

#include<stdio.h>
#define A -B
#define B -C
#define C 5
int main() {
printf("The value of A is %dn", A);
return 0;
}
I came across the above code. I thought that after preprocessing, it gets transformed to
// code from stdio.h
int main() {
printf("The value of A is %dn", --5);
return 0;
}
which should result in a compilation error. But, the code compiles fine and produces output 5.
How does the code get preprocessed in this case so that it does not result into a compiler error?
PS: I am using gcc version 8.2.0 on Linux x86-64.

The preprocessor is defined as operating on a stream of tokens, not text. You have to read through all of sections 5.1.1, 6.4, and 6.10 of the C standard to fully understand how this works, but the critical bits are in 5.1.1.1 "Phases of translation": in phase 3, the source file is "decomposed into preprocessing tokens"; phases 4, 5, and 6 operate on those tokens; and in phase 7 "each preprocessing token is converted into a token". That indefinite article is critical: each preprocessing token becomes exactly one token.
What this means is, if you start with this source file
#define A -B
#define B -C
#define C 5
A
then, after translation phase 4 (macro expansion, among other things), what you have is a sequence of three preprocessing tokens,
<punctuator: -> <punctuator: -> <pp-number: 5>
and at the beginning of translation phase 7 that becomes
TK_MINUS TK_MINUS TK_INTEGER:5
which is then parsed as the expression -(-(5)) rather than as --(5). The standard offers no latitude in this: a C compiler that parses your example as --(5) is defective.
When you ask a compiler to dump out preprocessed source as text, the form of that text is not specified by the standard; typically, what you get has whitespace inserted as necessary so that a human will understand it the same way translation phase 7 would have.

Joining lines in C [Strange output not compatible with documentation]

I started coding C in vim and I have some problems.
The backslash is intended to join lines but when I try to write:
ret\
urn 0;
I get
return
0;
and when I add spaces before urn; it stay like that without join.
ret\
urn 0;
it stay like that.
why in the second case I don't get return 0; but
ret
urn 0;
code:
CPP output:
command:
gcc -E -Wall -Wextra -Wimplicit -pedantic -std=c99 main.c -o output.i
GCC 5.4,
Vim 7.4

-E output is not officially specified by the standard. It's an engineering tradeoff among several different design constraints, of which the relevant two are:
whitespace must be inserted or deleted as necessary so that the "compiler proper" (imagine feeding the -E output back into gcc -fpreprocessed — which is what -save-temps does) sees the same sequence of pp-tokens that it would normally (without -E). (See C99 section 6.4 for the definition of a pp-token.)
to the maximum extent possible, tokens should appear at the same line and column position that they did in the original source code, so that error messages and debugging information are as accurate as possible.
Here's how this applies to your examples:
ret\
urn 0;
The backslash-newline combines ret and urn into a single pp-token, which must therefore appear all together on one line in the output. The 0 and ;, however, should continue to be on their original line and column so that diagnostics are accurate. So you get
return
0;
with spaces inserted to keep the 0 in its original column.
ret\
urn 0;
Here the backslash-newline is immediately followed by whitespace, so ret and urn do not have to be combined, so, again, the diagnostics are most accurate if everything stays where it originally was, and the output is
ret
urn 0;
which looks like the backslash-newline had no effect at all.
You might find the output of gcc -E -P less surprising. -P tells the preprocessor not to bother trying to preserve token position (and also turns off all those lines beginning with # in the output). Your examples produce return 0; and ret urn 0;, both all on one line, in -P mode.
Finally, a word of advice: everyone who ever has to read your code (and that includes yourself six months later) will appreciate it if you never split a token in the middle with backslash-newline, except for very long string literals. It's a legacy misfeature that wouldn't be included in the language if it were designed from scratch today.

The white space is a token separator. Just because you split the line doesn't mean a white space will be ignored.
What the compiler sees is something like ret urn;. Which is not valid C, since it's two tokens which probably weren't defined before, nor are they in a valid expression.
Keywords must be written as a single token with no spaces.
Now, when you do :
ret\
urn;
The backslash followed by a newline is removed in the early translation phases, and the subsequent line is appended. If the line has no white spaces at the beginning, the result is a valid token that the compiler understands as the keyword return.
Long story short, you seem to be asking about specific behavior for GCC. It seems like a compiler bug. Since clang does the expected thing (although the line count remains the same):
clang -E -Wall -Wextra -Wimplicit -pedantic -std=c99 -x c main.cpp
# 1 "main.cpp"
# 1 "<built-in>" 1
# 1 "<built-in>" 3
# 316 "<built-in>" 3
# 1 "<command line>" 1
# 1 "<built-in>" 2
# 1 "main.cpp" 2
int main(void) {
ret urn 0;
}
It doesn't seem crucial however, since in this particular case the code will be invalid either way.

The behavior of the C preprocessor on \ followed by a newline is to remove both bytes from the input. This is done in a very early phase of the parsing. Yet the preprocessor retains the original line number for each token it sees and tries to output tokens on separate lines for the compiler to issue correct diagnostics for later phases of compilation.
For the input:
ret\
urn 1;
it may produce:
#line 1 "myfile.c"
return
#line 2 "myfile.c"
1;
Which it may shorten as
return
1;
Note that you can split any input line at any position with an escaped newline:
#inclu\
de <st\
dio.h>\
"Hello word\\
n"
for (i = 0; i < n; i+\
+)
ret\
\
\
urn;
\
r\
et\
urn\
123;\

Why "foo\\<NEWLINE>bar" becomes "foo\bar" after "gcc -E"?

See following example:
$ cat foo.c
int main()
{
char *p = "foo\\
bar";
return 0;
}
$ gcc -E foo.c
# 1 "foo.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 1 "<command-line>" 2
# 1 "foo.c"
int main()
{
char *p = "foo\bar";
return 0;
}
$
From my understanding the 2nd \ is escaped by the 1st \ so the 2nd \ should not be combined with the following <NEWLINE> to form the line continuation.

The rules are quite explicit in ISO/IEC 9899:2011 §5.1.1.2 Translation Phases:
Each instance of a backslash character (\) immediately followed by a new-line
character is deleted, splicing physical source lines to form logical source lines.
Only the last backslash on any physical source line shall be eligible for being part
of such a splice.
The character preceding the final backslash is not consulted. Phase 1 converts trigraphs into regular characters. That matters because ??/ is the trigraph for \.

The preprocessor removes all occurrences of backslash-newline before even trying to tokenize the input; there is no escape mechanism for this. It's not limited to string literals either:
#inclu\
de <st\
dio.h>
int m\
ain(void) {
/\
* yes, this is a comment */
put\
s("Hello,\
world!");
return 0;
}
This is valid code.
Using \\ to get a single \ only applies to string and character literals and happens much later in processing.

Why does including a header using the full path lead to better error messages?

There was a recent post on Ask Ubuntu, where OP was trying to compile a program which included term.h. When the code had #include <term.h>, the errors were:
In file included from clear_screen_UNIX.c:5:0:
clear_screen_UNIX.c:9:6: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘->’ token
void clear_screen(void) {
^
clear_screen_UNIX.c: In function ‘main’:
clear_screen_UNIX.c:23:14: error: called object is not a function or function pointer
clear_screen();
^
clear_screen_UNIX.c:26:14: error: called object is not a function or function pointer
clear_screen();
The OP then included the full path to term.h (#include "/usr/include/term.h"), which led to the much more useful message:
In file included from clear_screen_UNIX.c:7:0:
/usr/include/term.h:125:21: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘->’ token
#define CUR cur_term->type.
^
/usr/include/term.h:202:40: note: in expansion of macro ‘CUR’
#define clear_screen CUR Strings[5]
^
clear_screen_UNIX.c:9:6: note: in expansion of macro ‘clear_screen’
void clear_screen(void) {
^
clear_screen_UNIX.c: In function ‘main’:
clear_screen_UNIX.c:23:14: error: called object is not a function or function pointer
clear_screen();
^
clear_screen_UNIX.c:26:14: error: called object is not a function or function pointer
clear_screen();
These messages clearly indicate the problem is due to a macro expansion.
I verified the results myself, too. I wonder why GCC produced much better errors when the full path was given. Can I make it produce similar messages with the system include syntax as well?
I am using GCC 4.9.2, and I suspect OP was using GCC 4.8.2 (given the version of Ubuntu).

Conclusion
The reason GCC gives different/better messages if the full path of the header is given is that the GCC preprocessor gives information to GCC's cc1 compiler that the included header is a system header file or local header file by some digits at the end of the comment line of the preprocessor-generated .i file.
Then the cc1 compiler will generate more helpful messages if the header file is an local header file, and will suppress some error message if the header file is a system header, according to the GCC documentation.
To make the normal version of the code output error messages just like the code which specified the full path of the header file, as you asked for, GCC needs to stop including all system directories by specifying option -nostdinc, and then explicitly tell GCC the directories it could search for header files while not treat the directory as system directory using -I flag.
For your code, the command line could be like below (GCC_INCLUDE_DIR is your default GCC include directory, for the system default GCC, it could be /usr/lib/gcc/x86_64-unknown-linux-gnu/4.9.2/include/) :
gcc -c t.c -nostdinc -I/usr/include/ -IGCC_INCLUDE_DIR
Source code
Moved the source code here from this original post to make this answer more helpful.
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <term.h>
//#include "/usr/include/term.h"
void clear_screen(void) {
if (!cur_term) {
int result;
setupterm( NULL, STDOUT_FILENO, &result );
if (result <= 0)
return;
}
putp( tigetstr( "clear" ) );
}
int main(void) {
puts("I am going to clear the screen");
sleep(1);
clear_screen();
puts("Screen Cleared");
sleep(1);
clear_screen();
return 0;
}
The difference between Preprocessor Generated Files
You could use the following command line to ask GCC to output the preproceser generated code. This code will be fed into the actual compiler of GCC, cc1. If the preprocessor generated files are exactly the same, the cc1 compiler's behavior should be exactly the same. (Assuming the code is put into file t.c)
gcc -E t.c -o t.i
Following is the difference between the two gcc preprocessor generated .i files. The t.fullpath.i is the file generated with the full path header file, while t.i is the code without full path (Some of the diff output have been removed, since they are only filename differences.)
$ diff t.i t.fullpath.i
2920,2922c2920,2924
< # 1 "/usr/include/term.h" 1 3 4
< # 47 "/usr/include/term.h" 3 4
---
> # 1 "/usr/include/term.h" 1
> # 47 "/usr/include/term.h"
2924,2925c2926,2927
< # 48 "/usr/include/term.h" 2 3 4
< # 80 "/usr/include/term.h" 3 4
---
> # 48 "/usr/include/term.h" 2
> # 80 "/usr/include/term.h"
3007,3008c3009,3010
< # 81 "/usr/include/term.h" 2 3 4
< # 673 "/usr/include/term.h" 3 4
---
> # 81 "/usr/include/term.h" 2
> # 673 "/usr/include/term.h"
3041c3043
< # 729 "/usr/include/term.h" 3 4
---
> # 729 "/usr/include/term.h"
The different flag in the preprocessor generated code's comments makes the difference
GCC's cc1 compiler will take advantage of the preprocessor genrated information to generate the source code location of the error message, and also the debug information which will be used for gdb in the future.
For the following format:
# line-number "source-file" [flags]
The digits 3 and 4 of the flags mean:
3: Following text comes from a system header file (#include <> vs #include "")
4: Following text should be treated as being wrapped in an implicit extern "C" block.
For more information about different kind of these flags, please refer to this link.
Therefore, for the code without fully specified path, cc1 compiler will treat it as a system header file, and assume that system code is mostly correct, and then just output error message of the user's code. That is why the error message is shorter.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight