Even though an oldtimer, I fear I do not (anymore) have a complete grasp of parsing of constants in C. The second of the following 1-liners fails to compile:
int main( void ) { return (0xe +2); }
int main( void ) { return (0xe+2); }
$ gcc -s weird.c
weird.c: In function ‘main’:
weird.c:1:28: error: invalid suffix "+2" on integer constant
int main( void ) { return (0xe+2); }
^
The reason for the compilation failure is probably that 0xe+2 is parsed as a hexadecimal floating point constant as per C11 standard clause 6.4.4.2. My question is whether a convention exists to write simple additions of hexadecimal and decimal numbers in C, I do not like to have to rely on white space in parsing.
This was with gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9). Stopping compiling after preprocessing (-E) show that the compilation failure happens in gcc not cpp.
Because GCC thinks that 0xe+2 is a floating point number, while this is just an addition of two integers.
According to cppreference:
Due to maximal munch, hexadecimal integer constants ending in e and E,
when followed by the operators + or -, must be separated from the
operator with whitespace or parentheses in the source:
int x = 0xE+2; // error
int y = 0xa+2; // OK
int z = 0xE +2; // OK
int q = (0xE)+2; // OK
My question is whether a convention exists to write simple additions of hexadecimal and decimal numbers in C
The convention is to use spaces. This is actually mandated by C11 6.4 §3:
Preprocessing tokens can be separated by white space; this consists of
comments (described later), or white-space characters (space, horizontal tab, new-line, vertical tab, and form-feed), or both.
Where plain space is the commonly used one.
Similar exotic issues exist here and there in the language, some examples:
---a must be rewritten as - --a.
a+++++b must be rewritten as a++ + ++b.
a /// comment
b;
must be rewritten as
a / // comment
b
And so on. The culprit in all of these cases is the token parser which follows the so-called "maximal munch rule", C11 6.4 §4:
If the input stream has been parsed into preprocessing tokens up to a given character, the
next preprocessing token is the longest sequence of characters that could constitute a
preprocessing token.
In this specific case, the pre-processor does not make any distinction between floating point constants and integer constants, when it builds up a pre-processing token called pp-number, defined in C11 6.4.8:
pp-number e sign
pp-number E sign
pp-number p sign
pp-number P sign
pp-number .
A preprocessing number begins with a digit optionally preceded by a period (.) and may
be followed by valid identifier characters and the character sequences e+, e-, E+, E-,
p+, p-, P+, or P-.
Here, pp-number does apparently not have to be a floating point constant, as far as the pre-processor is concerned.
( As a side note, a similar convention also exists when terminating hexadecimal escape sequences inside strings. If I for example want to print the string "ABBA" on a new line, then I can't write
puts("\xD\xABBA"); (CR+LF+string)
Because the string in this case could be interpreted as part of the hex escape sequence. Instead I have to use white space to end the escape sequence and then rely on pre-processor string concatenation: puts("\xD\xA" "BBA"). The purpose is the same, to guide the pre-processor how to parse the code. )
Related
This question already has answers here:
Multicharacter literal in C and C++
(6 answers)
Closed 5 years ago.
While making a compiler for C language, I was looking for its grammar. I stumbled upon ANSI C Grammar (Lex). There I found the following regular expression (RE):
{CP}?"'"([^'\\\n]|{ES})+"'" { return I_CONSTANT; }
where CP and ES are as follows
CP (u|U|L)
ES (\\(['"\?\\abfnrtv]|[0-7]{1,3}|x[a-fA-F0-9]+))
I understand that ES is RE for escape sequence.
If I understand the regular expression correctly then, u'123' or U'\n\t' or L'abc' are valid I_CONSTANTs.
I wrote the following small program to see what constant values they represent.
#include <stdio.h>
int main(void) {
printf("%d %d %d\n", u'123', U'\n\t', L'abc');
return 0;
}
This gave the following output.
51 9 99
I deciphered that they represent the ASCII value of the right-most character inside single quotes.
However, what I fail to understand is the use and importance of this kind of integer constant.
These are multicharacter literals, and their value is implementation-defined.
From C11 6.4.4.4 p10-11:
The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined.
...
The value of a wide character constant containing more than one multibyte character or a single multibyte character that maps to multiple members of the extended execution character set, or containing a multibyte character or escape sequence not represented in the extended execution character set, is implementation-defined.
From your testing, it looks like GCC chooses to ignore all but the rightmost character on wide multicharacter literals. However, if you don't specify L, u, or U, GCC will combine the characters given as different bytes of the resulting integer, in an order depending on endianess. This behavior should not be relied on in portable code.
I'm reading through some emulator code and I've countered something truly odd:
switch (reg){
case 'eax':
/* and so on*/
}
How is this possible? I thought you could only switch on integral types. Is there some macro trickery going on?
(Only you can answer the "macro trickery" part - unless you paste up more code. But there's not much here for macros to work on - formally you are not allowed to redefine keywords; the behaviour on doing that is undefined.)
In order to achieve program readability, the witty developer is exploiting implementation defined behaviour. 'eax' is not a string, but a multi-character constant. Note very carefully the single quotation characters around eax. Most likely it is giving you an int in your case that's unique to that combination of characters. (Quite often each character occupies 8 bits in a 32 bit int). And everyone knows you can switch on an int!
Finally, a standard reference:
The C99 standard says:
6.4.4.4p10: "The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or
escape sequence that does not map to a single-byte execution
character, is implementation-defined."
According to the C Standard (6.8.4.2 The switch statement)
3 The expression of each case label shall be an integer constant
expression...
and (6.6 Constant expressions)
6 An integer constant expression shall have integer type and shall
only have operands that are integer constants, enumeration constants,
character constants, sizeof expressions whose results are integer constants, and floating constants that are the immediate operands of
casts. Cast operators in an integer constant expression shall only
convert arithmetic types to integer types, except as part of an
operand to the sizeof operator.
Now what is 'eax'?
The C Standard (6.4.4.4 Character constants)
2 An integer character constant is a sequence of one or more
multibyte characters enclosed in single-quotes, as in 'x'...
So 'eax' is an integer character constant according to the paragraph 10 of the same section
...The value of an integer character constant containing more than one
character (e.g., 'ab'), or containing a character or escape
sequence that does not map to a single-byte execution character, is
implementation-defined.
So according to the first mentioned quote it can be an operand of an integer constant expression that may be used as a case label.
Pay attention to that a character constant (enclosed in single quotes) has type int and is not the same as a string literal (a sequence of characters enclosed in double quotes) that has a type of a character array.
As other have said, this is an int constant and its actual value is implementation-defined.
I assume the rest of the code looks something like
if (SOMETHING)
reg='eax';
...
switch (reg){
case 'eax':
/* and so on*/
}
You can be sure that 'eax' in the first part has the same value as 'eax' in the second part, so it all works out, right? ... wrong.
In a comment #Davislor lists some possible values for 'eax':
... 0x65, 0x656178, 0x65617800, 0x786165, 0x6165, or something else
Notice the first potential value? That is just 'e', ignoring the other two characters. The problem is the program probably uses 'eax', 'ebx',
and so on. If all these constants have the same value as 'e' you end up with
switch (reg){
case 'e':
...
case 'e':
...
...
}
This doesn't look too good, does it?
The good part about "implementation-defined" is that the programmer can check the documentation of their compiler and see if it does something sensible with these constants. If it does, home free.
The bad part is that some other poor fellow can take the code and try to compile it using some other compiler. Instant compile error. The program is not portable.
As #zwol pointed out in the comments, the situation is not quite as bad as I thought, in the bad case the code doesn't compile. This will at least give you an exact file name and line number for the problem. Still, you will not have a working program.
The code fragment uses an historical oddity called multi-character character constant, also referred to as multi-chars.
'eax' is an integer constant whose value is implementation defined.
Here is an interesting page on multi-chars and how they can be used but should not:
http://www.zipcon.net/~swhite/docs/computers/languages/c_multi-char_const.html
Looking back further away into the rearview mirror, here is how the original C manual by Dennis Ritchie from the good old days ( https://www.bell-labs.com/usr/dmr/www/cman.pdf ) specified character constants.
2.3.2 Character constants
A character constant is 1 or 2 characters enclosed in single quotes ‘‘ ' ’’. Within a character constant a single quote must be preceded by a back-slash ‘‘\’’. Certain non-graphic characters, and ‘‘\’’ itself, may be escaped according to the following table:
BS \b
NL \n
CR \r
HT \t
ddd \ddd
\ \\
The escape ‘‘\ddd’’ consists of the backslash followed by 1, 2, or 3 octal digits which are taken to specify the value of the desired character. A special case of this construction is ‘‘\0’’ (not followed by a digit) which indicates a null character.
Character constants behave exactly like integers (not, in particular, like objects of character type). In conformity with the addressing structure of the PDP-11, a character constant of length 1 has the code for the given character in
the low-order byte and 0 in the high-order byte; a character constant of length 2 has the code for the first character in the low byte and that for the second character in the high-order byte. Character constants with more than one character are inherently machine-dependent and should be avoided.
The last phrase is all you need to remember about this curious construction: Character constants with more than one character are inherently machine-dependent and should be avoided.
Just like C code:
#include<stdio.h>
int main(void) {
char c = '\97';
printf("%d",c);
return 0;
}
the result is 55,but I can't understand how to calculate it.
I know the Octal number or hex number follow the '\', does the 97 is hex number?
\ is a octal escape sequence but 9 is not a valid octal digit so instead of interpreting it as octal it is being interpreted as a multi-character constant a \9 and a 1 whose value is implementation defined. Without any warning flags gcc provides the following warnings by default:
warning: unknown escape sequence: '\9' [enabled by default]
warning: multi-character character constant [-Wmultichar]
warning: overflow in implicit constant conversion [-Woverflow]
The C99 draft standard in section 6.4.4.4 Character constants paragraph 10 says (emphasis mine):
An integer character constant has type int. The value of an integer character constant
containing a single character that maps to a single-byte execution character is the
numerical value of the representation of the mapped character interpreted as an integer.
The value of an integer character constant containing more than one character (e.g.,
'ab'), or containing a character or escape sequence that does not map to a single-byte
execution character, is implementation-defined.
For example gcc implementation is documented here and is as follows:
The compiler evaluates a multi-character character constant a character at a time, shifting the previous value left by the number of bits per target character, and then or-ing in the bit-pattern of the new character truncated to the width of a target character. The final bit-pattern is given type int, and is therefore signed, regardless of whether single characters are signed or not (a slight change from versions 3.1 and earlier of GCC). If there are more characters in the constant than would fit in the target int the compiler issues a warning, and the excess leading characters are ignored.
For example, 'ab' for a target with an 8-bit char would be interpreted as ‘(int) ((unsigned char) 'a' * 256 + (unsigned char) 'b')’, and '\234a' as ‘(int) ((unsigned char) '\234' * 256 + (unsigned char) 'a')’.
As far as I can tell this is being interpreted as:
char c = ((unsigned char)'\71')*256 + '7' ;
which results in 55, which is consistent with the multi-character constant implementation above although the translation of \9 to \71 is not obvious.
Edit
I realized later on what is really happening is the \ is being dropped and so \9 -> 9, so what we really have is:
c = ((unsigned char)'9')*256 + '7' ;
which seems more reasonable but still arbitrary and not clear to me why this is not a straight out error.
Update
From reading The Annotated C++ Reference Manual we find out that in Classic C and older versions of C++ when backslash followed character was not defined as an scape sequence it was equal to the numeric value of the character. ARM section 2.5.2:
This differs from the interpretation by Classic C and early versions of C++, where the value of a sequence of a blackslash followed by a character in the source character set, if not defined as an escape sequence, was equal to the numeric value of the character. For example '\q' would be equal to 'q'.
\9 is not a valid escape, so the compiler ignores it and ascii '7' is 55.
I would not depend on this behavior, it's probably undefined. But that's where the 55 came from.
edit: Shafik points out it's not undefined, it's implementation defined. See his answer for the references.
First of all, I'm going to assume your code should read this, because it matches your title.
#include<stdio.h>
int main(void) {
char c = '\97';
printf("%d",c);
return 0;
}
\9 isn't valid, thus let's just assume the character is actually 7. 7 is ascii 55, which is the answer that was printed out.
I'm not sure what you wanted, but \97 isn't it...
\9 isn't a valid escape sequence, so it's likely falling back to a plain 9 character.
This means that it's the same thing as '97', which is undefined implementation defined (see Shafik Yaghmour's answer) behavior (2 characters can't fit into 1 character...).
To avoid things like this in the future, consider cranking up the warnings on your compiler. For example, a minimum for gcc should be -Wall -Wextra -pedantic.
It seems that LynxOS's implementation of strtod doesn't handle all the same cases as that of Linux, or for that matter Solaris. The problem I have is that I am trying to parse some text that can have decimal or hexadecimal numbers in it.
On Linux I call
a = strtod(pStr, (char **)NULL);
and I get the expected values in a for input strings such as 1.234567 and 0x40.
On LynxOS, the decimal numbers parse correctly, but the hex parses simply as 0 due to stopping when it hits the 'x'. Looking at the man pages, it seems that LynxOS's strtod only supports decimal strings in the input.
Does anyone here know of an alternative that will work on both Lynx and Linux?
Quote from the Standard (7.20.1.3) ( http://www.open-std.org/JTC1/sc22/wg14/www/docs/n1256.pdf )
The expected form of the subject sequence is an optional plus or minus sign, then one of
the following:
— a nonempty sequence of decimal digits optionally containing a decimal-point
character, then an optional exponent part as defined in 6.4.4.2;
— a 0x or 0X, then a nonempty sequence of hexadecimal digits optionally containing a
decimal-point character, then an optional binary exponent part as defined in 6.4.4.2;
— [...]
So, the compiler you're using on LynxOS is not a C99 compiler.
My copy of the C89 Standard has no reference to the 0x prefix:
4.10.1.4 The strtod function
[...]
The expected form of the subject sequence is an optional plus or
minus sign, then a nonempty sequence of digits optionally containing a
decimal-point character, then an optional exponent part [...]
strtod takes 3 arguments, not two. If you had prototyped it by including the correct header (stdlib.h), your compiler would have issued an error. Since you're calling a function with the wrong signature, your program has undefined behavior. Fix this and everything will be fine.
My homework is to write a regular expression representing the language of numerical literals from C programming language. I can use l for letter, d for digit, a for +, m for -, and p for point. Assume that there are no limits on the number of consecutive digits in any part of the expression.
Some of the examples of valid numerical literals were 13. , .328, 41.16, +45.80, -2.e+7, -.4E-7, 01E-06, +0
I came up with: (d+p+a+m)(d+p+E+e+a+m)*
update2: (l+d+p+a+m)(d+p+((E+e)(a+m+d)d*) )* im not sure how to prevent something like 1.0.0.0eee-e1.
Your regular expression does not support the various suffixes (l, u, f, etc.), nor does it support hexadecimal or octal constants.
The leading signs (+ or - in front of the number) are not lexically part of the constant; they are the unary + and - operators. Effectively, all integer and floating constants are positive.
If you need to fully support C99 floating constants, you need to support hexadecimal exponents (p instead of e).
Your regular expression also accepts many invalid sequences of characters, like 1.0.0.0eee-e1.
A single regular expression to match all C integer and floating literals would be quite long.
Untested, but this should be along the right lines for decimal at least. (Also, it accepts the string ".", or I think it does anyway; to fix that would eliminate the last of the common code between integer and FP, the leading [0-9]*.)
[0-9]*([0-9]([uU](ll?+LL?)+(ll?+LL?)?[uU]?)+(\.[0-9]*)?([eE][+-]?[0-9]+)[fFlL])
This Regex will match all your need:
[+-]?(?P<Dot1>\.)?\d+(?(Dot1)(?#if_dot_exist_in_the_beginning__do_nothing)|(?#if_dot_not_exist_yet__we_accept_optional_dot_now)(?P<Dot2>\.)?)\d*(?P<Exp>[Ee]?)(?(Exp)[+-]?\d*)