Regular expression for constants in C - c

I want to write regular expression for constants in C language. So I tried this:
Let
digit -> 0-9,
digit_oct -> 0-7,
digit_hex -> 0-9 | a-f | A-F
Then:
RE = digit+ U 0digit_oct+ U 0xdigit_hex+
I want to know whether I have written correct R.E. Is there any other way of writing this?

There is another type of integer constants, namely integer character constants such as 'a' or '\n'. In C99 these are constants and their type is just int.
The best regular expressions for all these are found in the standard, section 6.4, http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf

The 'RE' makes sense if we interpret the 'U' as being similar to set union. However, it is more conventional to use a '|' symbol to denote alternatives.
First, you are only dealing with integer constants, not with floating point or character or string constants, let alone more complex constants.
Second, you have omitted '0X' as a valid hex prefix.
Third, you have omitted the various suffixes: U, L, LL, ULL (and their lower-case and mixed case synonyms and permutations).
Also, the C standard (§6.4.4.1) distinguishes between digits and non-zero digits in a decimal constant:
decimal-constant:
nonzero-digit
decimal-constant digit
Any integer constant starting with a zero is an octal constant, never a decimal constant. In particular, writing 0 is writing an octal constant.

First, C does not support Unicode literals, so you can eliminate the last rule. You also only define integer literals, not floating-point literals and not string or character literals. For the sake of my convenience I assume that that is what you intended.
INT := OCTINT | DECINT | HEXINT
DECINT := [1-9] [0-9]* [uU]? [lL]? [lL]?
OCTINT := 0 [0-7]* [uU]? [lL]? [lL]?
HEXINT := 0x [0-9a-fA-F]+ [uU]? [lL]? [lL]?
These only describe the form of the literals, not any logic such as maximum values.

From perl point of view I came up with the following regexp, after reading ISO C 2011:
my $I_CONSTANT = qr/^(?:(0[xX][a-fA-F0-9]+(?:[uU](?:ll|LL|[lL])?|(?:ll|LL|[lL])[uU]?)?) # Hexadecimal
|([1-9][0-9]*(?:[uU](?:ll|LL|[lL])?|(?:ll|LL|[lL])[uU]?)?) # Decimal
|(0[0-7]*(?:[uU](?:ll|LL|[lL])?|(?:ll|LL|[lL])[uU]?)?) # Octal
|([uUL]?'(?:[^'\\\n]|\\(?:[\'\"\?\\abfnrtv]|[0-7]{1..3}|x[a-fA-F0-9]+))+') # Character
)$/x;

Related

This source code is switching on a string in C. How does it do that?

I'm reading through some emulator code and I've countered something truly odd:
switch (reg){
case 'eax':
/* and so on*/
}
How is this possible? I thought you could only switch on integral types. Is there some macro trickery going on?
(Only you can answer the "macro trickery" part - unless you paste up more code. But there's not much here for macros to work on - formally you are not allowed to redefine keywords; the behaviour on doing that is undefined.)
In order to achieve program readability, the witty developer is exploiting implementation defined behaviour. 'eax' is not a string, but a multi-character constant. Note very carefully the single quotation characters around eax. Most likely it is giving you an int in your case that's unique to that combination of characters. (Quite often each character occupies 8 bits in a 32 bit int). And everyone knows you can switch on an int!
Finally, a standard reference:
The C99 standard says:
6.4.4.4p10: "The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or
escape sequence that does not map to a single-byte execution
character, is implementation-defined."
According to the C Standard (6.8.4.2 The switch statement)
3 The expression of each case label shall be an integer constant
expression...
and (6.6 Constant expressions)
6 An integer constant expression shall have integer type and shall
only have operands that are integer constants, enumeration constants,
character constants, sizeof expressions whose results are integer constants, and floating constants that are the immediate operands of
casts. Cast operators in an integer constant expression shall only
convert arithmetic types to integer types, except as part of an
operand to the sizeof operator.
Now what is 'eax'?
The C Standard (6.4.4.4 Character constants)
2 An integer character constant is a sequence of one or more
multibyte characters enclosed in single-quotes, as in 'x'...
So 'eax' is an integer character constant according to the paragraph 10 of the same section
...The value of an integer character constant containing more than one
character (e.g., 'ab'), or containing a character or escape
sequence that does not map to a single-byte execution character, is
implementation-defined.
So according to the first mentioned quote it can be an operand of an integer constant expression that may be used as a case label.
Pay attention to that a character constant (enclosed in single quotes) has type int and is not the same as a string literal (a sequence of characters enclosed in double quotes) that has a type of a character array.
As other have said, this is an int constant and its actual value is implementation-defined.
I assume the rest of the code looks something like
if (SOMETHING)
reg='eax';
...
switch (reg){
case 'eax':
/* and so on*/
}
You can be sure that 'eax' in the first part has the same value as 'eax' in the second part, so it all works out, right? ... wrong.
In a comment #Davislor lists some possible values for 'eax':
... 0x65, 0x656178, 0x65617800, 0x786165, 0x6165, or something else
Notice the first potential value? That is just 'e', ignoring the other two characters. The problem is the program probably uses 'eax', 'ebx',
and so on. If all these constants have the same value as 'e' you end up with
switch (reg){
case 'e':
...
case 'e':
...
...
}
This doesn't look too good, does it?
The good part about "implementation-defined" is that the programmer can check the documentation of their compiler and see if it does something sensible with these constants. If it does, home free.
The bad part is that some other poor fellow can take the code and try to compile it using some other compiler. Instant compile error. The program is not portable.
As #zwol pointed out in the comments, the situation is not quite as bad as I thought, in the bad case the code doesn't compile. This will at least give you an exact file name and line number for the problem. Still, you will not have a working program.
The code fragment uses an historical oddity called multi-character character constant, also referred to as multi-chars.
'eax' is an integer constant whose value is implementation defined.
Here is an interesting page on multi-chars and how they can be used but should not:
http://www.zipcon.net/~swhite/docs/computers/languages/c_multi-char_const.html
Looking back further away into the rearview mirror, here is how the original C manual by Dennis Ritchie from the good old days ( https://www.bell-labs.com/usr/dmr/www/cman.pdf ) specified character constants.
2.3.2 Character constants
A character constant is 1 or 2 characters enclosed in single quotes ‘‘ ' ’’. Within a character constant a single quote must be preceded by a back-slash ‘‘\’’. Certain non-graphic characters, and ‘‘\’’ itself, may be escaped according to the following table:
BS \b
NL \n
CR \r
HT \t
ddd \ddd
\ \\
The escape ‘‘\ddd’’ consists of the backslash followed by 1, 2, or 3 octal digits which are taken to specify the value of the desired character. A special case of this construction is ‘‘\0’’ (not followed by a digit) which indicates a null character.
Character constants behave exactly like integers (not, in particular, like objects of character type). In conformity with the addressing structure of the PDP-11, a character constant of length 1 has the code for the given character in
the low-order byte and 0 in the high-order byte; a character constant of length 2 has the code for the first character in the low byte and that for the second character in the high-order byte. Character constants with more than one character are inherently machine-dependent and should be avoided.
The last phrase is all you need to remember about this curious construction: Character constants with more than one character are inherently machine-dependent and should be avoided.

How to escape from hex to decimal

I apologise if this is an obvious question. I've been searching online for an answer to this and cannot find one. This isn't relevant to my code per se, it's a curiosity on my part.
I am looking at testing my function to read start and end bytes of a buffer.
If I declare a char array as:
char *buffer;
buffer = "\x0212\x03";
meaning STX12ETX - switching between hex and decimal.
I get the expected error:
warning: hex escape sequence out of range [enabled by default]
I can test the code using all hex values:
"\x02\x31\x32\x03"
I am wanting to know, is there a way to escape the hex value to indicate that the following is a decimal value?
will something like this work for you ?
char *buffer;
buffer = "\x02" "12" "\x03";
according to standard:
§ 5.1.1.2 6. Adjacent string literal tokens are concatenated.
§ 6.4.4.4 3. and 7. Each octal or hexadecimal escape sequence is the longest sequence of characters that can constitute the escape sequence.
the escape characters:
\' - single quote '
\" - double quote "
\? - question mark ?
\ - backslash \
\octal digits
\xhexadecimal digits
So the only way to do it is concatenation of strings with the precompiler concatenation ( listing them one after another).
if you want to know more how the literals are constructed by compiler look at §6.4.4.4 and §6.4.5 they describe how to construct the character literals and string literals respectively.
You can write
"\b12"
to represent a decimal value. Altough you need to use space after hex values for it to work.
buffer = "\x02 \b12\x03";
Or just 12
buffer = "\x02 12\x03";
Basically you need to add a blank character after your hex values to indicate that it's a new value and not the same one
No, there's no way to end a hexadecimal escape except by having an invalid (for the hex value) character, but then that character is of course interpreted in its own right.
The C11 draft says (in 6.4.4.4 14):
[...] a hexadecimal escape sequence is terminated only by a non-hexadecimal character.
Octal escapes don't have this problem, they are limited to three octal digits.
You can always use the octal format. Octal code is always 3 digits.
So to get the character '<-' you simple type \215

C standard: L prefix and octal/hexadecimal escape sequences

I didn't find an explanation in the C standard how do aforementioned escape sequences in wide strings are processed.
For example:
wchar_t *txt1 = L"\x03A9";
wchar_t *txt2 = L"\xA9\x03";
Are these somehow processed (like prefixing each byte with \x00 byte) or stored in memory exactly the same way as they are declared here?
Also, how does L prefix operate according to the standard?
EDIT:
Let's consider txt2. How it would be stored in memory? \xA9\x00\x03\x00 or \xA9\x03 as it was written? Same goes to \x03A9. Would this be considered as a wide character or as 2 separate bytes which would be made into two wide characters?
EDIT2:
Standard says:
The hexadecimal digits that follow the backslash and the letter x in a hexadecimal escape
sequence are taken to be part of the construction of a single character for an integer
character constant or of a single wide character for a wide character constant. The
numerical value of the hexadecimal integer so formed specifies the value of the desired
character or wide character.
Now, we have a char literal:
wchar_t txt = L'\xFE\xFF';
It consists of 2 hex escape sequences, therefore it should be treated as two wide characters. If these are two wide characters they can't fit into one wchar_t space (yet it compiles in MSVC) and in my case this sequence is treated as the following:
wchar_t foo = L'\xFFFE';
which is the only hex escape sequence and therefore the only wide char.
EDIT3:
Conclusions: each oct/hex sequence is treated as a separate value ( wchar_t *txt2 = L"\xA9\x03"; consists of 3 elements). wchar_t txt = L'\xFE\xFF'; is not portable - implementation defined feature, one should use wchar_t txt = L'\xFFFE';
There's no processing. L"\x03A9" is simply an array wchar_t const[2] consisting of the two elements 0x3A9 and 0, and similarly L"\xA9\x03" is an array wchar_t const[3].
Note in particular C11 6.4.4.4/7:
Each octal or hexadecimal escape sequence is the longest sequence of characters that can
constitute the escape sequence.
And also C++11 2.14.3/4:
There is no limit to the number of digits in a hexadecimal sequence.
Note also that when you are using a hexadecimal sequence, it is your responsibility to ensure that your data type can hold the value. C11-6.4.4.4/9 actually spells this out as a requirement, whereas in C++ exceeding the type's range is merely "implementation-defined". (And a good compiler should warn you if you exceed the type's range.)
Your code doesn't make sense, though, because the left-hand sides are neither arrays nor pointers. It should be like this:
wchar_t const * p = L"\x03A9"; // pointer to the first element of a string
wchar_t arr1[] = L"\x03A9"; // an actual array
wchar_t arr2[2] = L"\x03A9"; // ditto, but explicitly typed
std::wstring s = L"\x03A9"; // C++ only
On a tangent: This question of mine elaborates a bit on string literals and escape sequences.

Write regular expression for C numerical literals

My homework is to write a regular expression representing the language of numerical literals from C programming language. I can use l for letter, d for digit, a for +, m for -, and p for point. Assume that there are no limits on the number of consecutive digits in any part of the expression.
Some of the examples of valid numerical literals were 13. , .328, 41.16, +45.80, -2.e+7, -.4E-7, 01E-06, +0
I came up with: (d+p+a+m)(d+p+E+e+a+m)*
update2: (l+d+p+a+m)(d+p+((E+e)(a+m+d)d*) )* im not sure how to prevent something like 1.0.0.0eee-e1.
Your regular expression does not support the various suffixes (l, u, f, etc.), nor does it support hexadecimal or octal constants.
The leading signs (+ or - in front of the number) are not lexically part of the constant; they are the unary + and - operators. Effectively, all integer and floating constants are positive.
If you need to fully support C99 floating constants, you need to support hexadecimal exponents (p instead of e).
Your regular expression also accepts many invalid sequences of characters, like 1.0.0.0eee-e1.
A single regular expression to match all C integer and floating literals would be quite long.
Untested, but this should be along the right lines for decimal at least. (Also, it accepts the string ".", or I think it does anyway; to fix that would eliminate the last of the common code between integer and FP, the leading [0-9]*.)
[0-9]*([0-9]([uU](ll?+LL?)+(ll?+LL?)?[uU]?)+(\.[0-9]*)?([eE][+-]?[0-9]+)[fFlL])
This Regex will match all your need:
[+-]?(?P<Dot1>\.)?\d+(?(Dot1)(?#if_dot_exist_in_the_beginning__do_nothing)|(?#if_dot_not_exist_yet__we_accept_optional_dot_now)(?P<Dot2>\.)?)\d*(?P<Exp>[Ee]?)(?(Exp)[+-]?\d*)

What does \x mean in C/C++?

Example:
char arr[] = "\xeb\x2a";
BTW, are the following the same?
"\xeb\x2a" vs. '\xeb\x2a'
\x indicates a hexadecimal character escape. It's used to specify characters that aren't typeable (like a null '\x00').
And "\xeb\x2a" is a literal string (type is char *, 3 bytes, null-terminated), and '\xeb\x2a' is a character constant (type is int, 2 bytes, not null-terminated, and is just another way to write 0xEB2A or 60202 or 0165452). Not the same :)
As other have said, the \x is an escape sequence that starts a "hexadecimal-escape-sequence".
Some further details from the C99 standard:
When used inside a set of single-quotes (') the characters are part of an "integer character constant" which is (6.4.4.4/2 "Character constants"):
a sequence of one or more multibyte characters enclosed in single-quotes, as in 'x'.
and
An integer character constant has type int. The value of an integer character constant containing a single character that maps to a single-byte execution character is the numerical value of the representation of the mapped character interpreted as an integer. The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined.
So the sequence in your example of '\xeb\x2a' is an implementation defined value. It's likely to be the int value 0xeb2a or 0x2aeb depending on whether the target platform is big-endian or little-endian, but you'd have to look at your compiler's documentation to know for certain.
When used inside a set of double-quotes (") the characters specified by the hex-escape-sequence are part of a null-terminated string literal.
From the C99 standard 6.4.5/3 "String literals":
The same considerations apply to each element of the sequence in a character string literal or a wide string literal as if it were in an integer character constant or a wide character constant, except that the single-quote ' is representable either by itself or by the escape sequence \', but the double-quote " shall be represented by the escape sequence \".
Additional info:
In my opinion, you should avoid avoid using 'multi-character' constants. There are only a few situations where they provide any value over using an regular, old int constant. For example, '\xeb\x2a' could be more portably be specified as 0xeb2a or 0x2aeb depending on what value you really wanted.
One area that I've found multi-character constants to be of some use is to come up with clever enum values that can be recognized in a debugger or memory dump:
enum CommandId {
CMD_ID_READ = 'read',
CMD_ID_WRITE = 'writ',
CMD_ID_DEL = 'del ',
CMD_ID_FOO = 'foo '
};
There are few portability problems with the above (other than platforms that have small ints or warnings that might be spewed). Whether the characters end up in the enum values in little- or big-endian form, the code will still work (unless you're doing some else unholy with the enum values). If the characters end up in the value using an endianness that wasn't what you expected, it might make the values less easy to read in a debugger, but the 'correctness' isn't affected.
When you say:
BTW,are these the same:
"\xeb\x2a" vs '\xeb\x2a'
They are in fact not. The first creates a character string literal, terminated with a zero byte, containing the two characters who's hex representation you provide. The second creates an integer constant.
It's a special character that indicates the string is actually a hexadecimal number.
http://www.austincc.edu/rickster/COSC1320/handouts/escchar.htm
The \x means it's a hex character escape. So \xeb would mean character eb in hex, or 235 in decimal. See http://msdn.microsoft.com/en-us/library/6aw8xdf2.aspx for ore information.
As for the second, no, they are not the same. The double-quotes, ", means it's a string of characters, a null-terminated character array, whereas a single quote, ', means it's a single character, the byte that character represents.
\x allows you to specify the character by its hexadecimal code.
This allows you to specify characters that are normally not printable (some of which have special escape sequences predefined such as '\n'=newline and '\t'=tab '\b'=bell)
A useful website is here.
And I quote:
x Unsigned hexadecimal integer
That way, your \xeb is like 235 in decimal.

Resources