Why can identifiers contain '$' in C? [duplicate] - c

This question already has answers here:
dollar sign in variable name?
(4 answers)
Closed 1 year ago.
Recently I saw code like this:
int $ = 123;
So why can '$' be in an identifier in C?
Is it the same in C++?

This is not good practice. Generally, you should only use alphanumeric characters and underscores in identifiers ([a-z][A-Z][0-9]_).
Surface Level
Unlike in other languages (bash, perl), C does not use $ to denote the usage of a variable. As such, it is technically valid. As of C++ 17, this is standards conformant, see Draft n4659. In C it most likely falls under C11, 6.4.2. This means that it does seem to be supported by modern compilers.
As for your C++ question, lets test it!
int main(void) {
int $ = 0;
return $;
}
On GCC/G++/Clang/Clang++, this indeed compiles, and runs just fine.
Deeper Level
Compilers take source code, lex it into a token stream, put that into an abstract syntax tree (AST), and then use that to generate code (e.g. assembly/LLVM IR). Your question really only revolves around the first part (e.g. lexing).
The grammar (thus the lexer implementation) of C/C++ does not treat $ as special, unlike commas, periods, skinny arrows, etc... As such, you may get an output from the lexer like this from the below c code:
int i_love_$ = 0;
After the lexer, this becomes a token steam like such:
["int", "i_love_$", "=", "0"]
If you where to take this code:
int i_love_$,_and_.s = 0;
The lexer would output a token steam like:
["int", "i_love_$", ",", "_and_", ".", "s", "=", "0"]
As you can see, because C/C++ doesn't treat characters like $ as special, it is processed differently than other characters like periods.

The 2018 C standard says in 6.4.2 1 that an identifier consists of a nondigit character followed zero or more nondigit or digit characters, where the nondigit characters are:
one of the characters _, a through z, or A through Z,
a universal-character-name, which is \u followed by four hexadecimal digits or \U followed by eight hexadecimal digits, that is outside certain ranges1, or
implementation-defined characters.
The digit characters are 0 through 9.
Taking GCC as an example, its documentation says these additional characters are defined in its preprocessor section, and that section says GCC accepts $ and the characters that correspond to the universal character names.2 Thus, allowing $ is a choice made by the compiler implementors.
Draft n4659 of the 2017 C++ standard has the same rules, in clause 5.10 [lex.name], except it limits the universal character names further.
Footnote
1 These \u and \U forms allow you to write any character as a hexadecimal code. The excluded ranges are those in C’s basic character set and codes reserved for control characters and special uses.
2 The “universal character names” are the \u and \U forms. The characters that correspond to them are the characters that those forms represent. For example, π is a universal character, and \u03c0 is the universal character name for it.

Related

Creating a C string parser

In C (and similar languages), a string is declared for example as "abc". Another example is "ab\"c". I have a file which contains these strings. That is, the file contents is "abc" or "ab\c" etc. Any literal string that can be defined in a .c file can be defined in the file I'm reading.
These strings can be malformed. E.g. "abc (no closing quotes). What is the best way to write a parser to make sure the string in the file is a valid C literal string? (so that if I copy the file contents and paste them after char* str =, the resulting expression will be accepted by the compiler when at the top of a function)
The strings are each in a separate line.
Alternatively, you can think of this as wanting to parse lines that declare literal string variables. Imagine I'm grepping a big file and use char\* .* = (.*);$ and want to make sure the part in the parenthesis will not cause compilation errors;
The grammar for C string literals is given in C 2018 6.4.5. Supposing you want to parse only plain strings, not those with encoding prefixes such as u in u"xyz", then the grammar for a string-literal is " s-char-sequenceopt ", where “opt” means optional and s-char-sequence is one or more s-char tokens. An s-char is any member of the source character set except ", \ or the new-line character or is an escape-sequence.
The source character set includes at least the Latin alphabet (26 letters A-Z) in uppercase and lowercase, the ten digits, space, horizontal tab, vertical tab, form feed, and these characters:
"#%&’()*+,-./:;?[\]^_{|}~
However, a C implementation may include other characters in its source character set. Therefore, any character found in the string other than ", \, or the new-line character must be accepted as potentially valid in some C implementation.
An escape-sequence is defined in 6.4.4.4 1 to be one of:
\ followed by ', ", ?, \, a, b, f, n, r, t, v,
\ followed by one to three octal digits, or
\x followed by one or more hexadecimal digits, or
a universal-character-name.
Paragraph 7 says:
Each octal or hexadecimal escape sequence is the longest sequence of characters that can constitute the escape sequence.
A universal-character-name is defined in 6.4.3 to be \u followed by four hexadecimal digits or \U followed by eight hexadecimal digits. Paragraph 2 limits these:
A universal character name shall not specify a character whose short identifier is less than 00A0 other than 0024 ($), 0040 (#), or 0060 (‘), nor one in the range D800 through DFFF inclusive.
This part of the C grammar looks fairly simple to parse:
A string literal must start with a ".
If the next character is anything other than ", \, or a new-line character, then accept it.
If the next character is \ and it is followed by one of the single characters listed above, accept it and the following character.
If the next character is \ and it is followed by one to three octal digits, accept it and up to three octal digits.
If the next two characters are \x and are followed by a hexadecimal digit, accept them and all the hexadecimal digits that follow.
If the next two characters are \u and are followed by four hexadecimal digits, accept those six characters. However, if the value is one of those prohibited in the constraint above, this is not a valid C string literal.
If the next two characters are \U and are followed by eight hexadecimal digits, accept those ten characters. However, if the value is one of those prohibited in the constraint above, this is not a valid C string literal.
Repeat the above until the next character is not accepted.
If the next character is not ", this is not a valid C string literal.
If the next character is ", accept it.
If that is the end of the line read from the file, it is a valid C string literal. Otherwise, it is not.

How does C recognize macro tokens as arguments

#define function(in) in+1
int main(void)
{
printf("%d\n",function(1));
return 0;
}
The above is correctly preprocessed to :
int main(void)
{
printf("%d\n",1+1);
return 0;
}
However, if the macro in+1 is changed to in_1, the preprocessor will not do the argument replacement correctly and will end up with this:
printf("%d\n",in_1);
What are the list of tokens the preprocessor can correctly separate and insert the argument? (like the + sign)
Short answer: The replacement done by preprocessor is not simple text substitution. In your case, the argument must be an identifier, and it can only replace the identical identifiers.
The related form of preprocessing is
#define identifier(identifier-list) token-sequence
In order for the replacement to take place, the identifiers in the identifier-list and the tokens in the token-sequence must be identical in the token sense, according to C's tokenization rule (the rule to parse stream into tokens).
If you agree with the fact that
in C in and in_1 are two different identifiers (and C cannot relate one to the other), while
in+1 is not an identifier but a sequence of three tokens:
(1) identifier in,
(2) operator +, and
(3) integer constant 1,
then your question is clear: in and in_1 are just two identifiers between which C does not see any relationship, and cannot do the replacement as you wish.
Reference 1: In C, there are six classes of tokens:
(1) identifiers (e.g. in)
(2) keywords (e.g. if)
(3) constants (e.g. 1)
(4) string literals (e.g. "Hello")
(5) operators (e.g. +)
(6) other separators (e.g. ;)
Reference 2: In C, an identifier is a sequence of letters (including _) and digits (the first one cannot be a digit).
Reference 3: The tokenization rule:
... the next token is the longest string of characters that could constitute a token.
This is to say, when reading in+1, the compiler will read all the way to +, and knows that in is an identifier. But in the case of in_1, the compiler will read all the way to the white space after it, and deems in_1 as an identifier.
All references from the Reference Manual from K&R's The C Programming Language. Language evolved but they capture the essence.
See the C11 standard section 6.4 for the tokenization grammar .
The relevant token type here is identifier, which is defined as any sequence of letters or digits that doesn't start with a digit; also there can be \u codes and other implementation-defined characters in an identifier.
Due to the "maximal munch" principle of tokenization, character sequence in+1 is tokenized as in + 1 (not i n + 1).
if you want in_1, use two hashes....
#define function(in) in ## _1
So...
function(dave) --> dave_1
function(phil) --> phil_1
And for completeness, you can also use a single hash to turn the arg into a text string.
#define function(in) printf(#in "\n");

Isn't there a syntax error? Should printf("one" ", two and " "%s.\n", "three" ); be valid code?

Take a look at this code:
#include <stdio.h>
#define _ONE "one"
#define _TWO_AND ", two and "
int main()
{
const char THREE[6] = "three" ;
printf(_ONE _TWO_AND "%s.\n", THREE );
return 0;
}
The printf is effectively:
printf("one" ", two and " "%s.\n", "three" );
and the output is:
one, two and three.
gcc gives neither error nor warning messages after compiling this code.
Is the gcc compiler supposed work in that way, or is it a bug?
This is standard behavior, adjacent string literals are concatenated together if we look at the C99 draft standard section 5.1.1.2 Translation phases paragraph 6 says:
Adjacent string literal tokens are concatenated
gcc does have many non-standard extensions, but if you build using -pedantic then gcc should warn you if it is doing something non-standard, you can read more in the documents section Extensions to the C Language Family.
The rationale is covered in the Rationale for International Standard—Programming Languages—C and it says in section 6.4.5 String literals:
A string can be continued across multiple lines by using the backslash–newline line continuation, but this requires that the continuation of the string start in the first position of the next line. To permit more flexible layout, and to solve some preprocessing problems (see §6.10.3), the C89 Committee introduced string literal concatenation. Two string literals in a row are pasted together, with no null character in the middle, to make one combined string literal. This addition to the C language allows a programmer to extend a string literal beyond the end of a physical line without having to use the backslash–newline mechanism and thereby destroying the indentation scheme of the program. An explicit concatenation operator was not introduced because the concatenation is a lexical construct rather than a run-time operation.
You had not got any error because there was/is Nothing wrong.
Two string "A""B" are concatenated. This is convention of language C.
Try gcc -E to display the preprocessed source code. You will have something like this:
int main()
{
const char THREE[6] = "three";
printf("one" ", two and" "%s.\n", THREE );
return 0;
}
Then, follow the correct answer from #shafik-yaghmour

What's the use of universal characters on POSIX system?

In C one can pass unicode characters to printf() like this:
printf("some unicode char: %c\n", "\u00B1");
But the problem is that on POSIX compliant systems `char' is always 8 bits and most of UTF-8 character such as the above are wider and don't fit into char and as the result nothing is printed on the terminal. I can do this to achieve this effect however:
printf("some unicode char: %s\n", "\u00B1");
%s placeholder is expanded automatically and a unicode character is printed on the terminal. Also, in a standard it says:
If the hexadecimal value for a universal character name is less than
0x20 or in the range 0x7F-0x9F (inclusive), or if the universal
character name designates a character in the basic source character
set, then the program is illformed.
When I do this:
printf("letter a: %c\n", "\u0061");
gcc says:
error: \u0061 is not a valid universal character
So this technique is also unusable for printing ASCII characters. In this article on Wikipedia http://en.wikipedia.org/wiki/Character_(computing)#cite_ref-3 it says:
A char in the C programming language is a data type with the size of
exactly one byte, which in turn is defined to be large enough to
contain any member of the basic execution character set and UTF-8 code
units.
But is this doable on POSIX systems?
Use of universal characters in byte-based strings is dependent on the compile-time and run-time character encodings matching, so it's generally not a good idea except in certain situations. However they work very well in wide string and wide character literals: printf("%ls", L"\u00B1"); or printf("%lc", L'\00B1'); will print U+00B1 in the correct encoding for your locale.

Who determines the ordering of characters

I have a query based on the below program -
char ch;
ch = 'z';
while(ch >= 'a')
{
printf("char is %c and the value is %d\n", ch, ch);
ch = ch-1;
}
Why is the printing of whole set of lowercase letters not guaranteed in the above program. If C doesn't make many guarantees about the ordering of characters in internal form, then who actually does it and how ?
The compiler implementor chooses their underlying character set. About the only thing the standard has to say is that a certain minimal number of characters must be available and that the numeric characters are contiguous.
The required characters for a C99 execution environment are A through Z, a through z, 0 through 9 (which must be together and in order), any of !"#%&'()*+,-./:;<=>?[\]^_{|}~, space, horizontal tab, vertical tab, form-feed, alert, backspace, carriage return and new line. This remains unchanged in the current draft of C1x, the next iteration of that standard.
Everything else depends on the implementation.
For example, code like:
int isUpperAlpha(char c) {
return (c >= 'A') && (c <= 'Z');
}
will break on the mainframe which uses EBCDIC, dividing the upper case characters into two regions.
Truly portable code will take that into account. All other code should document its dependencies.
A more portable implementation of your example would be something along the lines of:
static char chrs[] = "zyxwvutsrqponmlkjihgfedcba";
char *pCh = chrs;
while (*pCh != 0) {
printf ("char is %c and the value is %d\n", *pCh, *pCh);
pCh++;
}
If you want a real portable solution, you should probably use islower() since code that checks only the Latin characters won't be portable to (for example) Greek using Unicode for its underlying character set.
Why is the printing of whole set of
lowercase letters not guaranteed in
the above program.
Because it's possible to use C with an EBCDIC character encoding, in which the letters aren't consecutive.
Obviously determined by the implementation of C you're using, but more then likely for you it's determined by the American Standard Code for Information Interchange (ASCII).
It is determined by whatever the execution character set is.
In most cases nowadays, that is the ASCII character set, but C has no requirement that a specific character set be used.
Note that there are some guarantees about the ordering of characters in the execution character set. For example, the digits '0' through '9' are guaranteed each to have a value one greater than the value of the previous digit.
These days, people going around calling your code non-portable are engaging in useless pedantry. Support for ASCII-incompatible encodings only remains in the C standard because of legacy EBCDIC mainframes that refuse to die. You will never encounter an ASCII-incompatible char encoding on any modern computer, now or in the future. Give it a few decades, and you'll never encounter anything but UTF-8.
To answer your question about who decides the character encoding: While it's nominally at the discression of your implementation (the C compiler, library, and OS) it was ultimately decided by the internet, both existing practice and IETF standards. Presumably modern systems are intended to communicate and interoperate with one another, and it would be a huge headache to have to convert every protocol header, html file, javascript source, username, etc. back and forth between ASCII-compatible encodings and EBCDIC or some other local mess.
In recent times, it's become clear that a universal encoding not just for machine-parsed text but also for natural-language text is also highly desirable. (Natural language text interchange is not as fundamental as machine-parsed text, but still very common and important.) Unicode provided the character set, and as the only ASCII-compatible Unicode encoding, UTF-8 is pretty much the successor to ASCII as the universal character encoding.

Resources