Write regular expression for C numerical literals

Write regular expression for C numerical literals - c

My homework is to write a regular expression representing the language of numerical literals from C programming language. I can use l for letter, d for digit, a for +, m for -, and p for point. Assume that there are no limits on the number of consecutive digits in any part of the expression.
Some of the examples of valid numerical literals were 13. , .328, 41.16, +45.80, -2.e+7, -.4E-7, 01E-06, +0
I came up with: (d+p+a+m)(d+p+E+e+a+m)*
update2: (l+d+p+a+m)(d+p+((E+e)(a+m+d)d*) )* im not sure how to prevent something like 1.0.0.0eee-e1.

Your regular expression does not support the various suffixes (l, u, f, etc.), nor does it support hexadecimal or octal constants.
The leading signs (+ or - in front of the number) are not lexically part of the constant; they are the unary + and - operators. Effectively, all integer and floating constants are positive.
If you need to fully support C99 floating constants, you need to support hexadecimal exponents (p instead of e).
Your regular expression also accepts many invalid sequences of characters, like 1.0.0.0eee-e1.
A single regular expression to match all C integer and floating literals would be quite long.

Untested, but this should be along the right lines for decimal at least. (Also, it accepts the string ".", or I think it does anyway; to fix that would eliminate the last of the common code between integer and FP, the leading [0-9]*.)
[0-9]*([0-9]([uU](ll?+LL?)+(ll?+LL?)?[uU]?)+(\.[0-9]*)?([eE][+-]?[0-9]+)[fFlL])

This Regex will match all your need:
[+-]?(?P<Dot1>\.)?\d+(?(Dot1)(?#if_dot_exist_in_the_beginning__do_nothing)|(?#if_dot_not_exist_yet__we_accept_optional_dot_now)(?P<Dot2>\.)?)\d*(?P<Exp>[Ee]?)(?(Exp)[+-]?\d*)

Related

Conventions to write simple additions of hexadecimal and decimal numbers

Even though an oldtimer, I fear I do not (anymore) have a complete grasp of parsing of constants in C. The second of the following 1-liners fails to compile:
int main( void ) { return (0xe +2); }
int main( void ) { return (0xe+2); }
$ gcc -s weird.c
weird.c: In function ‘main’:
weird.c:1:28: error: invalid suffix "+2" on integer constant
int main( void ) { return (0xe+2); }
^
The reason for the compilation failure is probably that 0xe+2 is parsed as a hexadecimal floating point constant as per C11 standard clause 6.4.4.2. My question is whether a convention exists to write simple additions of hexadecimal and decimal numbers in C, I do not like to have to rely on white space in parsing.
This was with gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9). Stopping compiling after preprocessing (-E) show that the compilation failure happens in gcc not cpp.

Because GCC thinks that 0xe+2 is a floating point number, while this is just an addition of two integers.
According to cppreference:
Due to maximal munch, hexadecimal integer constants ending in e and E,
when followed by the operators + or -, must be separated from the
operator with whitespace or parentheses in the source:
int x = 0xE+2; // error
int y = 0xa+2; // OK
int z = 0xE +2; // OK
int q = (0xE)+2; // OK

My question is whether a convention exists to write simple additions of hexadecimal and decimal numbers in C
The convention is to use spaces. This is actually mandated by C11 6.4 §3:
Preprocessing tokens can be separated by white space; this consists of
comments (described later), or white-space characters (space, horizontal tab, new-line, vertical tab, and form-feed), or both.
Where plain space is the commonly used one.
Similar exotic issues exist here and there in the language, some examples:
---a must be rewritten as - --a.
a+++++b must be rewritten as a++ + ++b.
a /// comment
b;
must be rewritten as
a / // comment
b
And so on. The culprit in all of these cases is the token parser which follows the so-called "maximal munch rule", C11 6.4 §4:
If the input stream has been parsed into preprocessing tokens up to a given character, the
next preprocessing token is the longest sequence of characters that could constitute a
preprocessing token.
In this specific case, the pre-processor does not make any distinction between floating point constants and integer constants, when it builds up a pre-processing token called pp-number, defined in C11 6.4.8:
pp-number e sign
pp-number E sign
pp-number p sign
pp-number P sign
pp-number .
A preprocessing number begins with a digit optionally preceded by a period (.) and may
be followed by valid identifier characters and the character sequences e+, e-, E+, E-,
p+, p-, P+, or P-.
Here, pp-number does apparently not have to be a floating point constant, as far as the pre-processor is concerned.
( As a side note, a similar convention also exists when terminating hexadecimal escape sequences inside strings. If I for example want to print the string "ABBA" on a new line, then I can't write
puts("\xD\xABBA"); (CR+LF+string)
Because the string in this case could be interpreted as part of the hex escape sequence. Instead I have to use white space to end the escape sequence and then rely on pre-processor string concatenation: puts("\xD\xA" "BBA"). The purpose is the same, to guide the pre-processor how to parse the code. )

This source code is switching on a string in C. How does it do that?

I'm reading through some emulator code and I've countered something truly odd:
switch (reg){
case 'eax':
/* and so on*/
}
How is this possible? I thought you could only switch on integral types. Is there some macro trickery going on?

(Only you can answer the "macro trickery" part - unless you paste up more code. But there's not much here for macros to work on - formally you are not allowed to redefine keywords; the behaviour on doing that is undefined.)
In order to achieve program readability, the witty developer is exploiting implementation defined behaviour. 'eax' is not a string, but a multi-character constant. Note very carefully the single quotation characters around eax. Most likely it is giving you an int in your case that's unique to that combination of characters. (Quite often each character occupies 8 bits in a 32 bit int). And everyone knows you can switch on an int!
Finally, a standard reference:
The C99 standard says:
6.4.4.4p10: "The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or
escape sequence that does not map to a single-byte execution
character, is implementation-defined."

According to the C Standard (6.8.4.2 The switch statement)
3 The expression of each case label shall be an integer constant
expression...
and (6.6 Constant expressions)
6 An integer constant expression shall have integer type and shall
only have operands that are integer constants, enumeration constants,
character constants, sizeof expressions whose results are integer constants, and floating constants that are the immediate operands of
casts. Cast operators in an integer constant expression shall only
convert arithmetic types to integer types, except as part of an
operand to the sizeof operator.
Now what is 'eax'?
The C Standard (6.4.4.4 Character constants)
2 An integer character constant is a sequence of one or more
multibyte characters enclosed in single-quotes, as in 'x'...
So 'eax' is an integer character constant according to the paragraph 10 of the same section
...The value of an integer character constant containing more than one
character (e.g., 'ab'), or containing a character or escape
sequence that does not map to a single-byte execution character, is
implementation-defined.
So according to the first mentioned quote it can be an operand of an integer constant expression that may be used as a case label.
Pay attention to that a character constant (enclosed in single quotes) has type int and is not the same as a string literal (a sequence of characters enclosed in double quotes) that has a type of a character array.

As other have said, this is an int constant and its actual value is implementation-defined.
I assume the rest of the code looks something like
if (SOMETHING)
reg='eax';
...
switch (reg){
case 'eax':
/* and so on*/
}
You can be sure that 'eax' in the first part has the same value as 'eax' in the second part, so it all works out, right? ... wrong.
In a comment #Davislor lists some possible values for 'eax':
... 0x65, 0x656178, 0x65617800, 0x786165, 0x6165, or something else
Notice the first potential value? That is just 'e', ignoring the other two characters. The problem is the program probably uses 'eax', 'ebx',
and so on. If all these constants have the same value as 'e' you end up with
switch (reg){
case 'e':
...
case 'e':
...
...
}
This doesn't look too good, does it?
The good part about "implementation-defined" is that the programmer can check the documentation of their compiler and see if it does something sensible with these constants. If it does, home free.
The bad part is that some other poor fellow can take the code and try to compile it using some other compiler. Instant compile error. The program is not portable.
As #zwol pointed out in the comments, the situation is not quite as bad as I thought, in the bad case the code doesn't compile. This will at least give you an exact file name and line number for the problem. Still, you will not have a working program.

The code fragment uses an historical oddity called multi-character character constant, also referred to as multi-chars.
'eax' is an integer constant whose value is implementation defined.
Here is an interesting page on multi-chars and how they can be used but should not:
http://www.zipcon.net/~swhite/docs/computers/languages/c_multi-char_const.html
Looking back further away into the rearview mirror, here is how the original C manual by Dennis Ritchie from the good old days ( https://www.bell-labs.com/usr/dmr/www/cman.pdf ) specified character constants.
2.3.2 Character constants
A character constant is 1 or 2 characters enclosed in single quotes ‘‘ ' ’’. Within a character constant a single quote must be preceded by a back-slash ‘‘\’’. Certain non-graphic characters, and ‘‘\’’ itself, may be escaped according to the following table:
BS \b
NL \n
CR \r
HT \t
ddd \ddd
\ \\
The escape ‘‘\ddd’’ consists of the backslash followed by 1, 2, or 3 octal digits which are taken to specify the value of the desired character. A special case of this construction is ‘‘\0’’ (not followed by a digit) which indicates a null character.
Character constants behave exactly like integers (not, in particular, like objects of character type). In conformity with the addressing structure of the PDP-11, a character constant of length 1 has the code for the given character in
the low-order byte and 0 in the high-order byte; a character constant of length 2 has the code for the first character in the low byte and that for the second character in the high-order byte. Character constants with more than one character are inherently machine-dependent and should be avoided.
The last phrase is all you need to remember about this curious construction: Character constants with more than one character are inherently machine-dependent and should be avoided.

Why there is no sign character in the syntax of constants?

Why doesn't the standard include a sign character in the syntax of constants?
It is mentioning only digits and sign character is only present in exponents.

The standard does not bother with the sign in front of numeric literals because it would be redundant.
The syntax already captures the sign as part of unary plus + and unary minus - operators. When you write
int a = -4;
the syntax of the right-hand side could be adequately described as a unary minus - expression with the operand of 4. This is the approach that the standard takes.

If - were a part of the constant -2 then 4-2 would be a syntax error (since a token is always the longest possible sequence of characters). Also, the semantics of -2147483648 and - 2147483648 would be different (the first one would be an int and the second one a long, assuming int is 32 bits and long is longer). Both of those things would be confusing.
If the - is always an operator, the semantics of -2147483648 are sometimes a little unexpected, but the more common x-1 works as expected. So that's how most programming languages, including C, work.

GNU Flex regexp for negative numbers

I'm parsing math expressions in my C program. I use flex (without bison or yacc)
Everything works ok except negative or explicit positive numbers. Here is my current rule for integers and operators
integer -?([0-9]+)
.........
"-" {some action}
{integer} {some action}
And so on. It's ok for expressions like "1+2+3" but fails on "1-2-3" as it treats it as negative numbers, not single subtraction operator. So I have to escape numbers with brackets "1-(2)-(3)" or spaces. But it looks ugly.
I tried "[+-]+[ ]*-([0-9]+)" for only negatives but doesn't work as it includes previous operators in the result. Of course I can pre-process a string in order to count "-" and "+" but may be it is possible with regex within flex?

You need to read the numbers as positive integers, treating the leading '-' as a separate token that is interpreted syntactically (by bison/yacc) as either negation or subtraction depending on context.
So -1234 would be two tokens: '-', and 1234.
Similarly for '+'.

Regular expression for constants in C

I want to write regular expression for constants in C language. So I tried this:
Let
digit -> 0-9,
digit_oct -> 0-7,
digit_hex -> 0-9 | a-f | A-F
Then:
RE = digit+ U 0digit_oct+ U 0xdigit_hex+
I want to know whether I have written correct R.E. Is there any other way of writing this?

There is another type of integer constants, namely integer character constants such as 'a' or '\n'. In C99 these are constants and their type is just int.
The best regular expressions for all these are found in the standard, section 6.4, http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf

The 'RE' makes sense if we interpret the 'U' as being similar to set union. However, it is more conventional to use a '|' symbol to denote alternatives.
First, you are only dealing with integer constants, not with floating point or character or string constants, let alone more complex constants.
Second, you have omitted '0X' as a valid hex prefix.
Third, you have omitted the various suffixes: U, L, LL, ULL (and their lower-case and mixed case synonyms and permutations).
Also, the C standard (§6.4.4.1) distinguishes between digits and non-zero digits in a decimal constant:
decimal-constant:
nonzero-digit
decimal-constant digit
Any integer constant starting with a zero is an octal constant, never a decimal constant. In particular, writing 0 is writing an octal constant.

First, C does not support Unicode literals, so you can eliminate the last rule. You also only define integer literals, not floating-point literals and not string or character literals. For the sake of my convenience I assume that that is what you intended.
INT := OCTINT | DECINT | HEXINT
DECINT := [1-9] [0-9]* [uU]? [lL]? [lL]?
OCTINT := 0 [0-7]* [uU]? [lL]? [lL]?
HEXINT := 0x [0-9a-fA-F]+ [uU]? [lL]? [lL]?
These only describe the form of the literals, not any logic such as maximum values.

From perl point of view I came up with the following regexp, after reading ISO C 2011:
my $I_CONSTANT = qr/^(?:(0[xX][a-fA-F0-9]+(?:[uU](?:ll|LL|[lL])?|(?:ll|LL|[lL])[uU]?)?) # Hexadecimal
|([1-9][0-9]*(?:[uU](?:ll|LL|[lL])?|(?:ll|LL|[lL])[uU]?)?) # Decimal
|(0[0-7]*(?:[uU](?:ll|LL|[lL])?|(?:ll|LL|[lL])[uU]?)?) # Octal
|([uUL]?'(?:[^'\\\n]|\\(?:[\'\"\?\\abfnrtv]|[0-7]{1..3}|x[a-fA-F0-9]+))+') # Character
)$/x;

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight