What can create a lexical error in C? - c

Besides not closing a comment /*..., what constitutes a lexical error in C?

Here are some:
"abc<EOF>
where EOF is the end of the file. In fact, EOF in the middle of many lexemes should produce errors:
0x<EOF>
I assume that using bad escapes in strings is illegal:
"ab\qcd"
Probably trouble with floating point exponents
1e+%
Arguably, you shouldn't have stuff at the end of a preprocessor directive:
#if x %

Basically anything that is not conforming to ISO C 9899/1999, Annex A.1 "Lexical Grammar" is a lexical fault if the compiler does its lexical analysis according to this grammar. Here are some examples:
"abc<EOF> // invalid string literal (from Ira Baxter's answer) (ISO C 9899/1999 6.4.4.5)
'a<EOF> // invalid char literal (6.4.4.4)
where EOF is the end of the file.
double a = 1e*3; // misguided floating point literal (6.4.4.2)
int a = 0x0g; // invalid integer hex literal (6.4.4.1)
int a = 09; // invalid octal literal (6.4.4.1)
char a = 'aa'; // too long char literal (from Joel's answer, 6.4.4.4)
double a = 0x1p1q; // invalid hexadecimal floating point constant (6.4.4.2)
// instead of q, only a float suffix, that is 'f', 'l', 'F' or 'L' is allowed.
// invalid header name (6.4.7)
#include <<a.h>
#include ""a.h"

Aren't [#$`] and other symbols like that (maybe from unicode) lexical errors in C if put anywhere outside of string or comment? They are not constituting any valid lexical sequence of that language. They cannot pass the lexer because the lexer cannot recognize them as any kind of valid token. Usually lexers are FSMs or regex based so these symbols are just unrecognized input.
For example in the following code there are several lexical errors:
int main(void){
` int a = 3;
# —
return 0;
}
We can support it by feeding this to gcc, which gives
../a.c: In function ‘main’:
../a.c:2: error: stray ‘`’ in program
../a.c:3: error: stray ‘#’ in program
../a.c:3: error: stray ‘\342’ in program
../a.c:3: error: stray ‘\200’ in program
../a.c:3: error: stray ‘\224’ in program
GCC is smart and does error-recovery so it parsed a function definition (it knows we are in 'main') but these errors definitely look like lexical errors, they are not syntax errors and rightly so. GCC's lexer doesn't have any types of tokens that can be built from these symbols. Note that it even treats a three-byte UTF-8 symbol as three unrecognized symbols.

Illegal id
int 3d = 1;
Illegal preprocessor directive
#define x 1
Unexpected token
if [0] {}
Unresolvable id
while (0) {}

Lexical errors:
An unterminated comment
Any sequence of non-comment and non-whitespace characters that is not a valid preprocessor token
Any preprocessor token that is not a valid C token; an example is 0xe-2, which looks like an expression but is in fact a syntax error according to the standard -- an odd corner case resulting from the rules for pp-tokens.

Badly formed float constant (e.g. 123.34e, or 123.45.33).

Related

Trying to filter out invalid URL characters in C regex

I created code here that is supposed to determine if a URL contains an invalid set of characters, and regex may be a good way to go.
The problem here is that the target string in this code (stored in the value of the char array variable "find") is not being taken as a valid match even though my regex means match any character between square brackets at least once, and an exclamation mark is listed in the character set.
Also, when compiling with all warnings on, I receive these warnings:
./test2.c:6:25: warning: unknown escape sequence '\#'
./test2.c:6:25: warning: unknown escape sequence '\!'
./test2.c:6:25: warning: unknown escape sequence '\$'
./test2.c:6:25: warning: unknown escape sequence '\&'
./test2.c:6:25: warning: unknown escape sequence '\-'
./test2.c:6:25: warning: unknown escape sequence '\;'
./test2.c:6:25: warning: unknown escape sequence '\='
./test2.c:6:25: warning: unknown escape sequence '\]'
./test2.c:6:25: warning: unknown escape sequence '\_'
./test2.c:6:25: warning: unknown escape sequence '\~'
And the one that bugs me is:
./test2.c:6:25: warning: unknown escape sequence '\]'
because if I don't escape it, then I'm using it to end a set of characters to check for, yet I want that character to be included as a literal character in the check.
What can I do to fix this regex issue?
I want to be able to make an apache module from this after in C so that if a hacker tries using strange unacceptable characters in the URL, he will be directed to an error page. Once I figure this regex mess out, then I'll be on my way.
This is my code so far:
#include <stdio.h>
#include <stdlib.h>
#include <regex.h>
int main(){
const char* regex="/^[\#\!\$\&\-\;\=\?\[\]\_\~]+$/";
const char* find="!!!";
regex_t r;int s;
if ((s=regcomp(&r,regex,REG_EXTENDED)) != 0){
printf("Error compiling\n");return 1;
}
const int maxmat=10;
regmatch_t ml[maxmat];
if (regexec(&r,find,maxmat,ml,0) != 0){
printf("No match\n");
}else{
printf("Matched");
}
regfree(&r);
return 0;
}
This regex seems to work for me:
char* regex="(.*)[#!$&-;=?_~]+";
The various warnings you got were from the C compiler itself, not the regex compiler. The C compiler does not know anything about regular expressions or character sets. It does know about string lierals and the escape character for C strings is also '\', so it is trying to interpret all of the backslash characters as C string escape character for things like:
\n - newline
\" - quote character
\\ - backslash character
In order to pass a backslash to the regex engine, you must first escape it in the C string literal. Simply replace all of your \ with \\ and you will have more luck with you regular expressions.
If you have the option of compiling with C++11 compliant compiler you have the option of using raw strings, which get rid of all of the escaping in normal C strings:
strlen("\n") => 1
strlen(R"(\n)"); => 2
In the second case the string starts with R"( and continues until it finds )". So the second string consists of two characters \ and n rather than a single newline character.
This is very handy for using with regular expressions as it does not require multiple levels of escape characters.
A common beginner mistake is the assumption that you need or want to backslash stuff in a regular expression class. You don't; inside square brackets, every character represents just itself. There are a few special cases which require special handling, but not with backslashing.
If you want a literal ^ in the character class, it mustn't go first.
If you want a literal ] in the character class, it needs to go first (after any ^ to specify negation).
If you want a literal - in the character class, it needs to go first (even before any ], but after a ^ for negating the character class) or last.
By convention, if you want both ] and [, you usually put them next to each other.
So, you want
const char* regex="^[-][#!$&;=?_~]+$";
The slashes you had before and after the regex looked like you thought they were necessary or useful as regex separators; but they're not, so I took them out.
This will match a string consisting solely of the characters in your class. By your description, that's not really what you want. But you don't need a regex for finding an occurrence of one of these characters somewhere in a string; look at the general C string search functions.

gcc parsing code which has been #if 0 out [duplicate]

int main(void)
{
#if 0
something"
#endif
return 0;
}
A simple program above generates a warning: missing terminating " character in gcc. This seems odd, because it means that the compiler allow the code blocks between #if 0 and endif have invalid statement like something here, but not double quotes " that don't pair. The same happens in the use of #ifdef and #ifndef.
Real comments are fine here:
int main(void)
{
/*
something"
*/
return 0;
}
Why? And the single quote ' behave similarly, is there any other tokens that are treating specially?
See the comp.Lang.c FAQ, 11.19:
Under ANSI C, the text inside a "turned off" #if, #ifdef, or #ifndef must still consist of "valid preprocessing tokens." This means that the characters " and ' must each be paired just as in real C code, and the pairs mustn't cross line boundaries.
Compilation needs to go through many cycles, before generating executable binary.
You are not in the compiler yet. Your pre-processor is flagging this error. This will not check for C language syntax, but missing quotes, braces and things like that are pre-processor errors.
After this pre-processor pass, Your code will go to the C Compiler which will detect the error you are expecting...
The preprocessor works at the token level, and a string literal is considered a single token. The preprocessor is warning you that you have an invalid token.
According to the C99 standard, a preprocessing token is one of these things:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the
above
The standard also says:
If a ' or a " character matches the last category, the behavior is
undefined.
Things like "statement" above are invalid to the C compiler, but it is a valid token, and the preprocessor eliminates this token before it gets to the compiler.
Beside the Kevin's answer, Incompatibilities of GCC says:
GCC complains about unterminated character constants inside of preprocessing conditionals that fail. Some programs have English comments enclosed in conditionals that are guaranteed to fail; if these comments contain apostrophes, GCC will probably report an error. For example, this code would produce an error:
#if 0
You can't expect this to work.
#endif
The best solution to such a problem is to put the text into an actual C comment delimited by /*...*/.

Compilation Error in C-File handling

#include<stdio.h>
#include<conio.h>
FILE *fp;
int main()
{
int val;
char line[80];
fp=fopen("\Users\P\Desktop\Java\a.txt","rt");
while( fgets(line,80,fp)!=NULL )
{
sscanf(line,"%d",&val);
printf("val is:: %d",val);
}
fclose(fp);
return 0;
}
Why is there a compile error in the line fp=fopen("\Users\P\Desktop\Java\a.txt","rt")?
Escape your backslashes.
fp=fopen("\\Users\\P\\Desktop\\Java\\a.txt","rt");
xx.c:8:12: error: \u used with no following hex digits
fp=fopen("\Users\P\Desktop\Java\a.txt","rt");
^
xx.c:8:12: warning: unknown escape sequence '\P'
xx.c:8:12: warning: unknown escape sequence '\D'
xx.c:8:12: warning: unknown escape sequence '\J'
The issue with the backslash. Backslash is an escape in a C char string.
Try this
fp=fopen("\\Users\\P\Desktop\\Java\\a.txt","rt");
or this depending on your OS:
fp=fopen("/Users/P/Desktop/Java/a.txt","rt");
You may be familiar with how "\n" (newline) and "\t" (tab) are used in C-strings.
The compiler will look at any \<Character> and try to interpret it as an Escape-Sequence.
So, where you wrote "\Users\P\Desktop\Java\a.txt", the compiler is trying to treat
\U, \P, \D, \J and \a as special escape-sequences.
(The only one that seems to be valid is \a, which is the Bell/Beep sequence. The others should all generate errors)
As others have said, use \\ to insert a literal Backslash character, and not start an escape sequence.
P.S. Shame on you for not including the the compiler message in your question.
The worst questions all say, "I got an error", without ever describing what the error was.

Code blocks between #if 0 and #endif must have paired double quotes?

int main(void)
{
#if 0
something"
#endif
return 0;
}
A simple program above generates a warning: missing terminating " character in gcc. This seems odd, because it means that the compiler allow the code blocks between #if 0 and endif have invalid statement like something here, but not double quotes " that don't pair. The same happens in the use of #ifdef and #ifndef.
Real comments are fine here:
int main(void)
{
/*
something"
*/
return 0;
}
Why? And the single quote ' behave similarly, is there any other tokens that are treating specially?
See the comp.Lang.c FAQ, 11.19:
Under ANSI C, the text inside a "turned off" #if, #ifdef, or #ifndef must still consist of "valid preprocessing tokens." This means that the characters " and ' must each be paired just as in real C code, and the pairs mustn't cross line boundaries.
Compilation needs to go through many cycles, before generating executable binary.
You are not in the compiler yet. Your pre-processor is flagging this error. This will not check for C language syntax, but missing quotes, braces and things like that are pre-processor errors.
After this pre-processor pass, Your code will go to the C Compiler which will detect the error you are expecting...
The preprocessor works at the token level, and a string literal is considered a single token. The preprocessor is warning you that you have an invalid token.
According to the C99 standard, a preprocessing token is one of these things:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the
above
The standard also says:
If a ' or a " character matches the last category, the behavior is
undefined.
Things like "statement" above are invalid to the C compiler, but it is a valid token, and the preprocessor eliminates this token before it gets to the compiler.
Beside the Kevin's answer, Incompatibilities of GCC says:
GCC complains about unterminated character constants inside of preprocessing conditionals that fail. Some programs have English comments enclosed in conditionals that are guaranteed to fail; if these comments contain apostrophes, GCC will probably report an error. For example, this code would produce an error:
#if 0
You can't expect this to work.
#endif
The best solution to such a problem is to put the text into an actual C comment delimited by /*...*/.

Why can't we use the preprocessor to create custom-delimited strings?

I was playing around a bit with the C preprocessor, when something which seemed so simple failed:
#define STR_START "
#define STR_END "
int puts(const char *);
int main() {
puts(STR_START hello world STR_END);
}
When I compile it with gcc (note: similar errors with clang), it fails, with these errors:
$ gcc test.c
test.c:1:19: warning: missing terminating " character
test.c:2:17: warning: missing terminating " character
test.c: In function ‘main’:
test.c:7: error: missing terminating " character
test.c:7: error: ‘hello’ undeclared (first use in this function)
test.c:7: error: (Each undeclared identifier is reported only once
test.c:7: error: for each function it appears in.)
test.c:7: error: expected ‘)’ before ‘world’
test.c:7: error: missing terminating " character
Which sort of confused me, so I ran it through the pre-processor:
$ gcc -E test.c
# 1 "test.c"
# 1 ""
# 1 ""
# 1 "test.c"
test.c:1:19: warning: missing terminating " character
test.c:2:17: warning: missing terminating " character
int puts(const char *);
int main() {
puts(" hello world ");
}
Which, despite the warnings, produces completely valid code (in the bolded text)!
If, macros in C are simply a textual replace, why is it that my initial example would fail? Is this a compiler bug? If not, where in the standards does it have information pertaining to this scenario?
Note: I am not looking for how to make my initial snippet compile. I am simply looking for info on why this scenario fails.
The problem is that even though the code expands to " hello, world ", it's not being recognized as a single string literal token by the preprocessor; instead, it's being recognized as the (invalid) sequence of tokens ", hello, ,, world, ".
N1570:
6.4 Lexical elements
...
3 A token is the minimal lexical element of the language in translation phases 7 and 8. The
categories of tokens are: keywords, identifiers, constants, string literals, and punctuators.
A preprocessing token is the minimal lexical element of the language in translation
phases 3 through 6. The categories of preprocessing tokens are: header names,
identifiers, preprocessing numbers, character constants, string literals, punctuators, and
single non-white-space characters that do not lexically match the other preprocessing
token categories.69) If a ' or a " character matches the last category, the behavior is
undefined. Preprocessing tokens can be separated by white space; this consists of
comments (described later), or white-space characters (space, horizontal tab, new-line,
vertical tab, and form-feed), or both. As described in 6.10, in certain circumstances
during translation phase 4, white space (or the absence thereof) serves as more than
preprocessing token separation. White space may appear within a preprocessing token
only as part of a header name or between the quotation characters in a character constant
or string literal.
69) An additional category, placemarkers, is used internally in translation phase 4 (see 6.10.3.3); it cannot
occur in source files.
Note that neither ' nor " are punctuators under this definition.
The preprocessor runs in multiple phases. Phase 3, tokenization, occurs before expansion, so preprocessor macros must represent full tokens. In your case, STR_START and STR_END are tokenized and then substituted, which makes those tokens invalid.
Here
#define STR_START "
compiler expects string literal. String literal must end with closing quote. That's why compiler complains about missing terminating " character.
After macro expansion compiler complains again, because invalid tokens.
For example, MSVC compiler complains in other words:
error C2001: newline in constant
and after expansion it complains about missing quotes.

Resources