Warning: unknown escape sequence: '\040', why not '\x20'? [closed] - c

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
When I compile a C file has contents below:
#include <stdio.h>
#define FILE_NAME "text\ 1"
int main()
{
FILE* file_ptr = fopen(FILE_NAME, "w");
fclose(file_ptr);
return 0;
}
get warning:
tt.c: In function ‘main’:
tt.c:6:37: warning: unknown escape sequence: '\040'
6 | FILE* file_ptr = fopen(FILE_NAME, "w");
|
I know it caused by \ in a string of my C language code and 40 is decimal 32 as ASCII of SPACE. Why the warning is '\040' not '\x20'?
And seems also in bash \ transfer to \040 seed to binaries (not sure).
Is there a rule to force it?
Update:delete '\32' which used to represent ASCII of SPACE to decimal.
How I encounter this problem?
I just wanna know how Bash process ESCAPED SPACE, I thougt bash turn it to SPACE, but after I check source code of Bash (hard for me). I found maybe Bash treat \ as normal string of characters as below source code not involved \ :
#define slashify_in_quotes "\\`$\"\n"
#define slashify_in_here_document "\\`$"
#define shell_meta_chars "()<>;&|"
#define shell_break_chars "()<>;&| \t\n"
#define shell_quote_chars "\"`'"
So I think Bash turn the \ to the command or binary to process, so I write above simple C file to check how C treat \
So my question is Why gcc warning '\040' not '\x20'?
For how Bash treat \ still need me to check...

Answer to Updated Question
Why the warning is '\040' not '\x20'?
This is merely a choice by the compiler implementors. When you have \ in a string or character constant followed by something that is not a recognized escape sequence, the compiler warns you. For example, if you had \g, the compiler would warn you that \g is not recognized. When the character after the \ may be unclear, because it is a white space character that cannot be distinguished from others (like space from tab) or is not a printable character, the compiler shows it by value in the error message. This helps you find the exact character in your text editor, in case some unprintable character has slipped into the source code. The compiler authors could have used hexadecimal but simply chose to use octal.
I will fault them for using an inconsistent style. In GCC 10.2, \g results in the message unknown escape sequence: '\g', but \ results in the message unknown escape sequence: '\040'. These should either be:
unknown escape sequence: 'g' and unknown escape sequence: '\040' or
unknown escape sequence: '\g' and unknown escape sequence: '\\040'.
Answer to Original Vague Question
C 2018 6.4.4.4 specific character constants in C source code, and paragraph 1 lists four choices for escape-sequence: simple-escape-sequence, octal-escape-sequence, hexdecimal-escape-sequence, and univesal-char-name.
An octal-escape-sequence is \ followed by one to three octal digits. Thus, \040 the character with code 0408 = 32, and \32 is the character with code 328 = 26.
There is no decimal escape sequence; \32 is an octal escape sequence, not decimal. (Also note that because octal escape sequences can have various lengths, if one wishes to follow it by an octal digit, one must use all three allowed digits. \324 will be parsed as one character, not as \32 followed by 4, whereas \0324 is \032 followed by 4.)
A hexadecimal-escape-sequence is \x followed by any positive integer number of hexadecimal digits. \x20 is equal to \040.
(A simple-escape-sequence is one of \', \", \?, \\, \a, \b, \f, \n, \r, \t, or \v. A universal-character-name is \u followed by four hexadecimal digits or \U followed by eight hexadecimal digits.)

Related

Why can identifiers contain '$' in C? [duplicate]

This question already has answers here:
dollar sign in variable name?
(4 answers)
Closed 1 year ago.
Recently I saw code like this:
int $ = 123;
So why can '$' be in an identifier in C?
Is it the same in C++?
This is not good practice. Generally, you should only use alphanumeric characters and underscores in identifiers ([a-z][A-Z][0-9]_).
Surface Level
Unlike in other languages (bash, perl), C does not use $ to denote the usage of a variable. As such, it is technically valid. As of C++ 17, this is standards conformant, see Draft n4659. In C it most likely falls under C11, 6.4.2. This means that it does seem to be supported by modern compilers.
As for your C++ question, lets test it!
int main(void) {
int $ = 0;
return $;
}
On GCC/G++/Clang/Clang++, this indeed compiles, and runs just fine.
Deeper Level
Compilers take source code, lex it into a token stream, put that into an abstract syntax tree (AST), and then use that to generate code (e.g. assembly/LLVM IR). Your question really only revolves around the first part (e.g. lexing).
The grammar (thus the lexer implementation) of C/C++ does not treat $ as special, unlike commas, periods, skinny arrows, etc... As such, you may get an output from the lexer like this from the below c code:
int i_love_$ = 0;
After the lexer, this becomes a token steam like such:
["int", "i_love_$", "=", "0"]
If you where to take this code:
int i_love_$,_and_.s = 0;
The lexer would output a token steam like:
["int", "i_love_$", ",", "_and_", ".", "s", "=", "0"]
As you can see, because C/C++ doesn't treat characters like $ as special, it is processed differently than other characters like periods.
The 2018 C standard says in 6.4.2 1 that an identifier consists of a nondigit character followed zero or more nondigit or digit characters, where the nondigit characters are:
one of the characters _, a through z, or A through Z,
a universal-character-name, which is \u followed by four hexadecimal digits or \U followed by eight hexadecimal digits, that is outside certain ranges1, or
implementation-defined characters.
The digit characters are 0 through 9.
Taking GCC as an example, its documentation says these additional characters are defined in its preprocessor section, and that section says GCC accepts $ and the characters that correspond to the universal character names.2 Thus, allowing $ is a choice made by the compiler implementors.
Draft n4659 of the 2017 C++ standard has the same rules, in clause 5.10 [lex.name], except it limits the universal character names further.
Footnote
1 These \u and \U forms allow you to write any character as a hexadecimal code. The excluded ranges are those in C’s basic character set and codes reserved for control characters and special uses.
2 The “universal character names” are the \u and \U forms. The characters that correspond to them are the characters that those forms represent. For example, π is a universal character, and \u03c0 is the universal character name for it.

Regex in C to restrict Extended ASCII character set [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I need a regex expression in C able to match everything but first 32 characters from extended ASCII with length greater than 0. I thought the easiest way to do that would be pattern like "^[^\\x00-\\x20]+$", but it's not working as I expected. For some reason it won't match any character from 48 to 92. Any ideas what's wrong with this pattern and how can I fix it?
Link to Extended ASCII character set table
The Posix regex library (i.e. the functions in regex.h, including regcomp and regexec) does not interpret standard C backslash sequences. It really doesn't need to, since C will do those expansions when you compile the character string literal. (This is something you have to think about if you accept regular expressions from user input.) The only use of \ in a regex is to escape a special character (in REG_EXTENDED mode), or to make a character special (in basic regex mode, which should be avoided.)
So if you want to exclude characters from \x01 to \x20, you would write:
"^[^\x01-\x20]+$"
Note that you must supply the REG_EXTENDED flag to regcomp for that to work.
As you might note, that does not exclude NUL (\x00). There's no way to insert a NUL into a regex pattern because NUL is not a valid character inside a C character string; it will terminate the string. For the same reason, it's pointless to try to exclude NUL characters from a C string, because there cannot be any. However, if it made you feel better, you could use:
"^[\x21-\xFF]+$"
Semantically, those two regex patterns are identical (at least, in the default "C" locale and assuming char is 8 bits).
The character class as you wrote it, [^\\x00-\\x20], contains everything but the character x and the range from 0 (48) to \ (92). (That range overlaps with the characters 0, 2 and \, which are named explicitly, some of them twice.)
Never used regex in C. I would do it next way using unsigned char to fit EASCII
void match(const unsigned char *src, unsigned char *dst) {
while (*src) {
if (*src >= 32) {
*dst++ = *src++;
} else {
src++;
}
}
*dst = 0;
}

Trying to filter out invalid URL characters in C regex

I created code here that is supposed to determine if a URL contains an invalid set of characters, and regex may be a good way to go.
The problem here is that the target string in this code (stored in the value of the char array variable "find") is not being taken as a valid match even though my regex means match any character between square brackets at least once, and an exclamation mark is listed in the character set.
Also, when compiling with all warnings on, I receive these warnings:
./test2.c:6:25: warning: unknown escape sequence '\#'
./test2.c:6:25: warning: unknown escape sequence '\!'
./test2.c:6:25: warning: unknown escape sequence '\$'
./test2.c:6:25: warning: unknown escape sequence '\&'
./test2.c:6:25: warning: unknown escape sequence '\-'
./test2.c:6:25: warning: unknown escape sequence '\;'
./test2.c:6:25: warning: unknown escape sequence '\='
./test2.c:6:25: warning: unknown escape sequence '\]'
./test2.c:6:25: warning: unknown escape sequence '\_'
./test2.c:6:25: warning: unknown escape sequence '\~'
And the one that bugs me is:
./test2.c:6:25: warning: unknown escape sequence '\]'
because if I don't escape it, then I'm using it to end a set of characters to check for, yet I want that character to be included as a literal character in the check.
What can I do to fix this regex issue?
I want to be able to make an apache module from this after in C so that if a hacker tries using strange unacceptable characters in the URL, he will be directed to an error page. Once I figure this regex mess out, then I'll be on my way.
This is my code so far:
#include <stdio.h>
#include <stdlib.h>
#include <regex.h>
int main(){
const char* regex="/^[\#\!\$\&\-\;\=\?\[\]\_\~]+$/";
const char* find="!!!";
regex_t r;int s;
if ((s=regcomp(&r,regex,REG_EXTENDED)) != 0){
printf("Error compiling\n");return 1;
}
const int maxmat=10;
regmatch_t ml[maxmat];
if (regexec(&r,find,maxmat,ml,0) != 0){
printf("No match\n");
}else{
printf("Matched");
}
regfree(&r);
return 0;
}
This regex seems to work for me:
char* regex="(.*)[#!$&-;=?_~]+";
The various warnings you got were from the C compiler itself, not the regex compiler. The C compiler does not know anything about regular expressions or character sets. It does know about string lierals and the escape character for C strings is also '\', so it is trying to interpret all of the backslash characters as C string escape character for things like:
\n - newline
\" - quote character
\\ - backslash character
In order to pass a backslash to the regex engine, you must first escape it in the C string literal. Simply replace all of your \ with \\ and you will have more luck with you regular expressions.
If you have the option of compiling with C++11 compliant compiler you have the option of using raw strings, which get rid of all of the escaping in normal C strings:
strlen("\n") => 1
strlen(R"(\n)"); => 2
In the second case the string starts with R"( and continues until it finds )". So the second string consists of two characters \ and n rather than a single newline character.
This is very handy for using with regular expressions as it does not require multiple levels of escape characters.
A common beginner mistake is the assumption that you need or want to backslash stuff in a regular expression class. You don't; inside square brackets, every character represents just itself. There are a few special cases which require special handling, but not with backslashing.
If you want a literal ^ in the character class, it mustn't go first.
If you want a literal ] in the character class, it needs to go first (after any ^ to specify negation).
If you want a literal - in the character class, it needs to go first (even before any ], but after a ^ for negating the character class) or last.
By convention, if you want both ] and [, you usually put them next to each other.
So, you want
const char* regex="^[-][#!$&;=?_~]+$";
The slashes you had before and after the regex looked like you thought they were necessary or useful as regex separators; but they're not, so I took them out.
This will match a string consisting solely of the characters in your class. By your description, that's not really what you want. But you don't need a regex for finding an occurrence of one of these characters somewhere in a string; look at the general C string search functions.

Difference between \% and %% [duplicate]

This question already has answers here:
Why is percentage character not escaped with backslash in C?
(4 answers)
How to escape the % (percent) sign in C's printf
(13 answers)
Closed 9 years ago.
After reading over some K&R C I saw that printf can "recognize %% for itself" I tested this and it printed out "%", I then tried "\%" which also printed "%".
So, is there any difference?
Edit for code request:
#include <stdio.h>
int main()
{
printf("%%\n");
printf("\%\n");
return 0;
}
Output:
%
%
Compiled with GCC using -o
GCC version: gcc (SUSE Linux) 4.8.1 20130909 [gcc-4_8-branch revision 202388]
%% is not a C escape sequence, but a printf formatter acting like an escape for its own special character.
\% is illegal because it has the syntax of a C escape sequence, but no defined meaning. Escape sequences besides the few listed as standard are compiler-specific. In all likelihood the compiler ignored the backslash, and printf did not see any backslash at runtime. If it had, it would have printed the backslash in the output, because backslash is not special to printf.
Both are not the same. The second one will print %, but in case of the first one, you will get compiler warning:
[Warning] unknown escape sequence: '%' [enabled by default]
The warning is self explanatory that there is no escape sequence like \% in C.
6.4.4.4 Character constants;
says
The double-quote " and question-mark ? are representable either by themselves or by the escape sequences \" and \?, respectively, but the single-quote ' and the backslash \ shall be represented, respectively, by the escape sequences \' and \\.
It is clear that % can't be represented as \%. There isn't any \% in C.
When "%%" is passed to printf it will print % to standard output, but "\%" in not an valid escape sequence in C. Hence the program will compile, but it will not print anything and will generate a warning:
warning: spurious trailing ‘%’ in format [-Wformat=] printf("%");
The list of escape sequences in C can be found in Escape sequences in C.
This won't print % for the second printf.
int main()
{
printf("%%\n");
printf("\%");
printf("\n");
return 0;
}
Output:
%

How do I print the percent sign(%) in C? [duplicate]

This question already has answers here:
How to escape the % (percent) sign in C's printf
(13 answers)
Closed 7 years ago.
Why doesn't this program print the % sign?
#include <stdio.h>
main()
{
printf("%");
getch();
}
Your problem is that you have to change:
printf("%");
to
printf("%%");
Or you could use ASCII code and write:
printf("%c", 37);
:)
There's no explanation in this topic why to print a percentage sign. One must type %% and not for example an escape character with percentage - \%.
From comp.lang.c FAQ list · Question 12.6:
The reason it's tricky to print % signs with printf is that % is
essentially printf's escape character. Whenever printf sees a %, it
expects it to be followed by a character telling it what to do next.
The two-character sequence %% is defined to print a single %.
To understand why % can't work, remember that the backslash \ is the
compiler's escape character, and controls how the compiler interprets
source code characters at compile time. In this case, however, we want
to control how printf interprets its format string at run-time. As far
as the compiler is concerned, the escape sequence % is undefined, and
probably results in a single % character. It would be unlikely for
both the \ and the % to make it through to printf, even if printf were
prepared to treat the \ specially.
So the reason why one must type printf("%%"); to print a single % is that's what is defined in the printf function. % is an escape character of printf's, and \ of the compiler.
Use "%%". The man page describes this requirement:
% A '%' is written. No argument is converted. The complete conversion specification is '%%'.
Try printing out this way
printf("%%");

Resources