I looked ANSI C grammar (lex).
And this is octal digit regex
0{D}+{IS}? { count(); return(CONSTANT); }
My question is why do they accept something like 0898?
It's not an octal digit.
So i thought they would consider that, but they just have wrote like that.
Could you explain why is that? Thank you
You want reasonable, user-friendly error messages.
If your lexer accepts 0999, you can detect an illegal octal digit and output a reasonable message:
int x = 0999;
^
error: illegal octal digit, go back to school
If it doesn't, it will parse this as two separate tokens 0 and 999 and pass them to the parser. The resulting error messages could be quite confusing.
int x = 0999;
^
error: expected ‘,’ or ‘;’ before numeric constant
The invalid program is rejected either way, as it should, however the ostensibly incorrect lex grammar does a better job with error reporting.
This demonstrates that practical grammars built for tools such as lex or yacc do not have to correspond exactly to ideal grammars found in language definitions.
Keep in mind that this is only syntax, not semantic.
So it is sufficient to detect "Cannot be anything but a constant.".
It is not necessary (yet) to detect "A correct octal constant.".
Note that it does not even make a difference between octal, decimal, hexadecimal. All of them register as "CONSTANT".
The grammar you repeatedly link to in your questions was produced in 1985, 4 years prior to the publication of the first C standard revision in 1989.
That is not the grammar that was published in the standard of 1989, which clearly uses
octal-constant:
0
octal-constant octal-digit
octal-digit: one of
0 1 2 3 4 5 6 7
Even then, that Lex grammar is sufficient for tokenizing a valid program.
Related
I stumbled on some C++ code like this:
int $T$S;
First I thought that it was some sort of PHP code or something wrongly pasted in there but it compiles and runs nicely (on MSVC 2008).
What kind of characters are valid for variables in C++ and are there any other weird characters you can use?
The only legal characters according to the standard are alphanumerics
and the underscore. The standard does require that just about anything
Unicode considers alphabetic is acceptable (but only as single
code-point characters). In practice, implementations offer extensions
(i.e. some do accept a $) and restrictions (most don't accept all of the
required Unicode characters). If you want your code to be portable,
restrict symbols to the 26 unaccented letters, upper or lower case, the
ten digits, and the '_'.
It's an extension of some compilers and not in the C standard
MSVC:
Microsoft Specific
Only the first 2048 characters of Microsoft C++ identifiers are significant. Names for user-defined types are "decorated" by the compiler to preserve type information. The resultant name, including the type information, cannot be longer than 2048 characters. (See Decorated Names for more information.) Factors that can influence the length of a decorated identifier are:
Whether the identifier denotes an object of user-defined type or a type derived from a user-defined type.
Whether the identifier denotes a function or a type derived from a function.
The number of arguments to a function.
The dollar sign is also a valid identifier in Visual C++.
// dollar_sign_identifier.cpp
struct $Y1$ {
void $Test$() {}
};
int main() {
$Y1$ $x$;
$x$.$Test$();
}
https://web.archive.org/web/20100216114436/http://msdn.microsoft.com/en-us/library/565w213d.aspx
Newest version: https://learn.microsoft.com/en-us/cpp/cpp/identifiers-cpp?redirectedfrom=MSDN&view=vs-2019
GCC:
6.42 Dollar Signs in Identifier Names
In GNU C, you may normally use dollar signs in identifier names. This is because many traditional C implementations allow such identifiers. However, dollar signs in identifiers are not supported on a few target machines, typically because the target assembler does not allow them.
http://gcc.gnu.org/onlinedocs/gcc/Dollar-Signs.html#Dollar-Signs
In my knowledge only letters (capital and small), numbers (0 to 9) and _ are valid for variable names according to standard (note: the variable name should not start with a number though).
All other characters should be compiler extensions.
This is not good practice. Generally, you should only use alphanumeric characters and underscores in identifiers ([a-z][A-Z][0-9]_).
Surface Level
Unlike in other languages (bash, perl), C does not use $ to denote the usage of a variable. As such, it is technically valid. In C it most likely falls under C11, 6.4.2. This means that it does seem to be supported by modern compilers.
As for your C++ question, lets test it!
int main(void) {
int $ = 0;
return $;
}
On GCC/G++/Clang/Clang++, this indeed compiles, and runs just fine.
Deeper Level
Compilers take source code, lex it into a token stream, put that into an abstract syntax tree (AST), and then use that to generate code (e.g. assembly/LLVM IR). Your question really only revolves around the first part (e.g. lexing).
The grammar (thus the lexer implementation) of C/C++ does not treat $ as special, unlike commas, periods, skinny arrows, etc... As such, you may get an output from the lexer like this from the below c code:
int i_love_$ = 0;
After the lexer, this becomes a token steam like such:
["int", "i_love_$", "=", "0"]
If you where to take this code:
int i_love_$,_and_.s = 0;
The lexer would output a token steam like:
["int", "i_love_$", ",", "_and_", ".", "s", "=", "0"]
As you can see, because C/C++ doesn't treat characters like $ as special, it is processed differently than other characters like periods.
I know to code in C well but I thought of learning C from the book C - The Complete Reference by Herbert Schildt. Here is a quote from Chapter 2:
In C89, at least the first 6 characters of an external identifier and at
least the first 31 characters of an internal identifier will be significant. C99 has increased these values. In C99, an external identifier has at least 31 significant characters, and an internal identifier has at least 63 significant characters.
Can somebody explain what does it mean to be significant?
That means that it is used within the compiler to differ between different names.
E.g. if only the first 6 characters are significant, when having two variables:
int abcdef_1;
int abcdef_2;
They will be treated as the same variable, and possibly the compiler will generate a warning or error.
About the minimal significance:
Maybe the compiler/assembler can handle more, but the linker cannot. Or maybe external tools which are out of control of the manufacturer of the assembler/linker can handle less, thus a minimum value (per type, internal/external) is defined in the C standard(s).
The code snippet below works as is but if I uncomment the first #define and comment the second the compiler complains about expecting a ')' at the assignment statement. Thought it might be wanting a cast but that did not help. Please point out my stupid oversight.
Thanks,
jh
//#define SMI_READ (0b10 << 10)
#define SMI_READ (0x2 << 10)
...
command |= SMI_READ;
In general, to answer a question like this we need to see the complete and unedited text of the error messages, and it also really helps if you provide a complete program that we can attempt to compile for ourselves. (It might seem to you that the error messages are useless, but often it's just that they only make sense if you know how to think like a compiler engineer.)
However, in this case, I can make a high-confidence guess, because the only difference between the two macros is that the one that doesn't work uses a binary number, 0b10, and the one that does work uses a hexadecimal number, 0x2. Binary numbers are not part of any version of the C standard, although they are a common extension. I therefore deduce that your compiler doesn't support them and is giving an unclear error message when it encounters them.
From C standard (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf)
6.4.4.1 Integer constants
...
octal-constant:
0
octal-constant octal-digit
...
hexadecimal-prefix: one of
0x 0X
No other prefixes are described, especially nothing which would cover 0b10.
I have already read this and this questions. They are quite helpful but still I have some doubt regarding token generation in lexical analyzer for C.
What if lexical analyzer detects int a2.5c; then according to my understandings 7 tokens will be generated.
int keyword
a identifier
2 constant
. special symbol
5 constant
c identifier
; special symbol
So Lexical analyzer will not report any error and tokens will be generated successfully.
Is my understanding correct? If not then can you please help me to understand?
Also If we declare any constant as double a = 10.10.10;
Will it generate any lexical errors? Why?
UPDATE :Asking out of curiosity, what if lexical analyzer detects :-) smiley kind of thing in program?? Will it generate any lexical error? Because as per my understandings : will be treated as special symbol, - will be treated as operator and again ) will be treated as special symbol
Thank You
Your first list of tokens is almost correct -- a2 is a valid identifier.
Its true that the first example won't generate any "lexical" errors per se, although there will be a parse error at the ..
It's hard to say whether the error in your second example is a lexical error or a parse error. The lexical structure of a floating-point constant is pretty complicated. I can imagine a compiler that grabs a string of digits and . and e/E and doesn't notice until it calls the equivalent of strtod that there are two decimal points, meaning that it might report a "lexical error". Strictly speaking, though, what we have there is two floating-point constants in a row -- 10.10 and .10, meaning that it's more likely a "parse error".
In the end, though, these are all just errors. Unless you're taking a compiler design/construction class, I'm not sure how important it is to classify errors as lexical or otherwise.
Addressing your follow-on question, yes, :-) would lex as three tokens :, -, and ).
Because just about any punctuation character is legal in C, there are relatively few character sequences that are lexically illegal (that is, that would generate errors during the lexical analysis phase). In fact, the only ones I can think of are:
Illegal character (I think the only unused ones are ` and #)
various problems with character and string constants (missing ' or ", bad escape sequences, etc.)
Indeed, almost any string of punctuation you care to bang out will make it through a C lexical analyzer, although of course it may or may not parse. (A somewhat infamous example is a+++++b, which unfortunately lexes as a++ ++ + b and is therefore a syntax error.)
The C lexer I wrote tokenizes this as
keyid int
white " "
keyid a2
const .5
keyid c
punct ;
white "\n"
Where keyid is keyword or identifer; const is numerical constant, and punct is punctuator (white is white space).
I would not say there is a lexical error; but certainly a syntax error that must be diagnosed due to an identifer followed by a numerical constant, which no grammar rule can reduce.
I saw a line of C that looked like this:
!ErrorHasOccured() ??!??! HandleError();
It compiled correctly and seems to run ok. It seems like it's checking if an error has occurred, and if it has, it handles it. But I'm not really sure what it's actually doing or how it's doing it. It does look like the programmer is trying express their feelings about errors.
I have never seen the ??!??! before in any programming language, and I can't find documentation for it anywhere. (Google doesn't help with search terms like ??!??!). What does it do and how does the code sample work?
??! is a trigraph that translates to |. So it says:
!ErrorHasOccured() || HandleError();
which, due to short circuiting, is equivalent to:
if (ErrorHasOccured())
HandleError();
Guru of the Week (deals with C++ but relevant here), where I picked this up.
Possible origin of trigraphs or as #DwB points out in the comments it's more likely due to EBCDIC being difficult (again). This discussion on the IBM developerworks board seems to support that theory.
From ISO/IEC 9899:1999 §5.2.1.1, footnote 12 (h/t #Random832):
The trigraph sequences enable the input of characters that are not defined in the Invariant Code Set as
described in ISO/IEC 646, which is a subset of the seven-bit US ASCII code set.
Well, why this exists in general is probably different than why it exists in your example.
It all started half a century ago with repurposing hardcopy communication terminals as computer user interfaces. In the initial Unix and C era that was the ASR-33 Teletype.
This device was slow (10 cps) and noisy and ugly and its view of the ASCII character set ended at 0x5f, so it had (look closely at the pic) none of the keys:
{ | } ~
The trigraphs were defined to fix a specific problem. The idea was that C programs could use the ASCII subset found on the ASR-33 and in other environments missing the high ASCII values.
Your example is actually two of ??!, each meaning |, so the result is ||.
However, people writing C code almost by definition had modern equipment,1 so my guess is: someone showing off or amusing themself, leaving a kind of Easter egg in the code for you to find.
It sure worked, it led to a wildly popular SO question.
ASR-33 Teletype
1. For that matter, the trigraphs were invented by the ANSI committee, which first met after C become a runaway success, so none of the original C code or coders would have used them.
It's a C trigraph. ??! is |, so ??!??! is the operator ||
As already stated ??!??! is essentially two trigraphs (??! and ??! again) mushed together that get replaced-translated to ||, i.e the logical OR, by the preprocessor.
The following table containing every trigraph should help disambiguate alternate trigraph combinations:
Trigraph Replaces
??( [
??) ]
??< {
??> }
??/ \
??' ^
??= #
??! |
??- ~
Source: C: A Reference Manual 5th Edition
So a trigraph that looks like ??(??) will eventually map to [], ??(??)??(??) will get replaced by [][] and so on, you get the idea.
Since trigraphs are substituted during preprocessing you could use cpp to get a view of the output yourself, using a silly trigr.c program:
void main(){ const char *s = "??!??!"; }
and processing it with:
cpp -trigraphs trigr.c
You'll get a console output of
void main(){ const char *s = "||"; }
As you can notice, the option -trigraphs must be specified or else cpp will issue a warning; this indicates how trigraphs are a thing of the past and of no modern value other than confusing people who might bump into them.
As for the rationale behind the introduction of trigraphs, it is better understood when looking at the history section of ISO/IEC 646:
ISO/IEC 646 and its predecessor ASCII (ANSI X3.4) largely endorsed existing practice regarding character encodings in the telecommunications industry.
As ASCII did not provide a number of characters needed for languages other than English, a number of national variants were made that substituted some less-used characters with needed ones.
(emphasis mine)
So, in essence, some needed characters (those for which a trigraph exists) were replaced in certain national variants. This leads to the alternate representation using trigraphs comprised of characters that other variants still had around.