In the regex below, \s denotes a space character. I imagine the regex parser, is going through the string and sees \ and knows that the next character is special.
But this is not the case as double escapes are required.
Why is this?
var res = new RegExp('(\\s|^)' + foo).test(moo);
Is there a concrete example of how a single escape could be mis-interpreted as something else?
You are constructing the regular expression by passing a string to the RegExp constructor.
\ is an escape character in string literals.
The \ is consumed by the string literal parsing…
const foo = "foo";
const string = '(\s|^)' + foo;
console.log(string);
… so the data you pass to the RegEx compiler is a plain s and not \s.
You need to escape the \ to express the \ as data instead of being an escape character itself.
Inside the code where you're creating a string, the backslash is a javascript escape character first, which means the escape sequences like \t, \n, \", etc. will be translated into their javascript counterpart (tab, newline, quote, etc.), and that will be made a part of the string. Double-backslash represents a single backslash in the actual string itself, so if you want a backslash in the string, you escape that first.
So when you generate a string by saying var someString = '(\\s|^)', what you're really doing is creating an actual string with the value (\s|^).
The Regex needs a string representation of \s, which in JavaScript can be produced using the literal "\\s".
Here's a live example to illustrate why "\s" is not enough:
alert("One backslash: \s\nDouble backslashes: \\s");
Note how an extra \ before \s changes the output.
As has been said, inside a string literal, a backslash indicates an escape sequence, rather than a literal backslash character, but the RegExp constructor often needs literal backslash characters in the string passed to it, so the code should have \\s to represent a literal backslash, in most cases.
A problem is that double-escaping metacharacters is tedious. There is one way to pass a string to new RegExp without having to double escape them: use the String.raw template tag, an ES6 feature, which allows you to write a string that will be parsed by the interpreter verbatim, without any parsing of escape sequences. For example:
console.log('\\'.length); // length 1: an escaped backslash
console.log(`\\`.length); // length 1: an escaped backslash
console.log(String.raw`\\`.length); // length 2: no escaping in String.raw!
So, if you wish to keep your code readable, and you have many backslashes, you may use String.raw to type only one backslash, when the pattern requires a backslash:
const sentence = 'foo bar baz';
const regex = new RegExp(String.raw`\bfoo\sbar\sbaz\b`);
console.log(regex.test(sentence));
But there's a better option. Generally, there's not much good reason to use new RegExp unless you need to dynamically create a regular expression from existing variables. Otherwise, you should use regex literals instead, which do not require double-escaping of metacharacters, and do not require writing out String.raw to keep the pattern readable:
const sentence = 'foo bar baz';
const regex = /\bfoo\sbar\sbaz\b/;
console.log(regex.test(sentence));
Best to only use new RegExp when the pattern must be created on-the-fly, like in the following snippet:
const sentence = 'foo bar baz';
const wordToFind = 'foo'; // from user input
const regex = new RegExp(String.raw`\b${wordToFind}\b`);
console.log(regex.test(sentence));
\ is used in Strings to escape special characters. If you want a backslash in your string (e.g. for the \ in \s) you have to escape it via a backslash. So \ becomes \\ .
EDIT: Even had to do it here, because \\ in my answer turned to \.
Related
I came across a line like
char* template = "<html><head><title>%i %s</title></head><body><h1>%i %s</h1> </body></html>";
while reading through code to implement a web server.
I'm curious as I've never seen a string like this before - is template specifying a special type of string (I'm just guessing here because it was highlighted on my IDE)? Also, how would strlen() work with something like this?
Thanks
char* template = "<html>...</html>";
is fundamentally no different than
char *s = "hello";
The name template is not special, it's just an ordinary identifier, the name of the variable. (template happens to be a keyword in C++, but this is C.)
It would be better to define it as const, to enforce the fact that string literals cannot be modified, but it's not mandatory.
Note that template itself is not a string. It's a pointer to a string. The string itself (defined by the language as "a contiguous sequence of characters terminated by and including the first null
character") is the sequence starting with "<html>" and ending with "</html>" and the implicit terminating null character.
And in answer to your second question, strlen(template) would work just fine, giving you the length of the string (81 in this case).
I imagine that there is another part of the code that uses this string to format an output string used as a page by the web server. The strlen function will return the length of the string.
Unless there's a null character somewhere in the initializer or an escape sequence using a \ character, which there isn't, there's nothing special about this string. A % is a normal character in a string and doesn't receive special treatment. The strlen function in particular will read %i as two characters, i.e. % and i. Similarly for %s.
In contrast, a \ is a special character for string and denotes an escape sequence. The \ and the character that follows it in the string constant constitute a single character in the string itself. For example, \n means a newline character (ASCII 10) and \t is a tab character (ASCII 8).
This string is most likely used as a format string for printf. This function will read the string and interpret the %i and %s as format string accepting a int and a char * respectively.
char* template = "<html>...</html>";
just create a char array to store data "<html>...</html>",and this array name is template,you can change this name to other name you want.When create char array,compiler will add \0 to the end of array.strlen will calculate the length from array start to \0(\0 is no include).
I think your IDE will highlight this string is because this string is used in other place.
I created code here that is supposed to determine if a URL contains an invalid set of characters, and regex may be a good way to go.
The problem here is that the target string in this code (stored in the value of the char array variable "find") is not being taken as a valid match even though my regex means match any character between square brackets at least once, and an exclamation mark is listed in the character set.
Also, when compiling with all warnings on, I receive these warnings:
./test2.c:6:25: warning: unknown escape sequence '\#'
./test2.c:6:25: warning: unknown escape sequence '\!'
./test2.c:6:25: warning: unknown escape sequence '\$'
./test2.c:6:25: warning: unknown escape sequence '\&'
./test2.c:6:25: warning: unknown escape sequence '\-'
./test2.c:6:25: warning: unknown escape sequence '\;'
./test2.c:6:25: warning: unknown escape sequence '\='
./test2.c:6:25: warning: unknown escape sequence '\]'
./test2.c:6:25: warning: unknown escape sequence '\_'
./test2.c:6:25: warning: unknown escape sequence '\~'
And the one that bugs me is:
./test2.c:6:25: warning: unknown escape sequence '\]'
because if I don't escape it, then I'm using it to end a set of characters to check for, yet I want that character to be included as a literal character in the check.
What can I do to fix this regex issue?
I want to be able to make an apache module from this after in C so that if a hacker tries using strange unacceptable characters in the URL, he will be directed to an error page. Once I figure this regex mess out, then I'll be on my way.
This is my code so far:
#include <stdio.h>
#include <stdlib.h>
#include <regex.h>
int main(){
const char* regex="/^[\#\!\$\&\-\;\=\?\[\]\_\~]+$/";
const char* find="!!!";
regex_t r;int s;
if ((s=regcomp(&r,regex,REG_EXTENDED)) != 0){
printf("Error compiling\n");return 1;
}
const int maxmat=10;
regmatch_t ml[maxmat];
if (regexec(&r,find,maxmat,ml,0) != 0){
printf("No match\n");
}else{
printf("Matched");
}
regfree(&r);
return 0;
}
This regex seems to work for me:
char* regex="(.*)[#!$&-;=?_~]+";
The various warnings you got were from the C compiler itself, not the regex compiler. The C compiler does not know anything about regular expressions or character sets. It does know about string lierals and the escape character for C strings is also '\', so it is trying to interpret all of the backslash characters as C string escape character for things like:
\n - newline
\" - quote character
\\ - backslash character
In order to pass a backslash to the regex engine, you must first escape it in the C string literal. Simply replace all of your \ with \\ and you will have more luck with you regular expressions.
If you have the option of compiling with C++11 compliant compiler you have the option of using raw strings, which get rid of all of the escaping in normal C strings:
strlen("\n") => 1
strlen(R"(\n)"); => 2
In the second case the string starts with R"( and continues until it finds )". So the second string consists of two characters \ and n rather than a single newline character.
This is very handy for using with regular expressions as it does not require multiple levels of escape characters.
A common beginner mistake is the assumption that you need or want to backslash stuff in a regular expression class. You don't; inside square brackets, every character represents just itself. There are a few special cases which require special handling, but not with backslashing.
If you want a literal ^ in the character class, it mustn't go first.
If you want a literal ] in the character class, it needs to go first (after any ^ to specify negation).
If you want a literal - in the character class, it needs to go first (even before any ], but after a ^ for negating the character class) or last.
By convention, if you want both ] and [, you usually put them next to each other.
So, you want
const char* regex="^[-][#!$&;=?_~]+$";
The slashes you had before and after the regex looked like you thought they were necessary or useful as regex separators; but they're not, so I took them out.
This will match a string consisting solely of the characters in your class. By your description, that's not really what you want. But you don't need a regex for finding an occurrence of one of these characters somewhere in a string; look at the general C string search functions.
Basically, I can't figure this out, I want my C program to store the entire plaintext of a batch program then insert in a file and then run.
I finished my program, but holding the contents is my problem. How do I insert the code in a string and make it ignore ALL special characters like %s \ etc?
You have to escape special characters with a \, you can escape backslash itself with another backslash (i.e. \\).
As Ian previously mentioned, you can escape characters that aren't allowed in normal C strings with \; for instance, newline becomes \n, double-quote becomes \", and backslash becomes \\.
If you're unable or unwilling to do this for whatever reason, then you may be out of luck if you're solution must be in C. However, if you're willing to switch to C++, then you can use raw strings:
const char* s1 = R"foo(
Hello
World
)foo";
This is equivalent to
const char* s2 = "\nHello\nWorld\n";
A raw string must begin with R" followed by an arbitrary delimiter (made of any source character but parentheses, backslash and spaces; can be empty; and at most 16 characters long), then (, and must end with ) followed by the delimiter and ". The delimiter must be chosen such that the termination substring (), delimiter, ") does not appear within the string.
Is it possible to write something like this:
printf(#"
-
-
-
-
");
I can do it in C#, but can't in C. It gives me an error in CodeBlocks. Am I allowed to do such ?
Error message: error: stray '#' in program.
No. That syntax doesn't exist in C.
If you want a multiple-line string, write it as multiple double-quoted strings with no other tokens in between them. They will be combined.
printf(
"some string"
"more of the string"
"even more of the string"
);
(You will, of course, need to add a \n at the end of each line if that's what you want.)
No that's not a syntax that C understands, C doesn't have raw literals.
You can use \ as the last character to continue on the next line:
const char *str = "hello\n\
world";
Also, consecutive string literals will be concatenated. So you can do e.g.
const char *str = "Hello\n"
"world\n";
C#'s verbatim strings are not available in C. If you have some characters to escape, like " or \, escape them with '\', there is no there option in this language.
If you want to embed multiple lines in a string literal, you can either insert \n at the appropriate location in your string, or escape the return character as well:
printf("Here's\
a multiline\
string litteral");
Line continuation with \ at the end of the line.
printf("\
\
-\
-\
-\
-\
");
String literals in C may not contain newlines. You have two workarounds:
Use implicit string concatenation (done by the compiler).
printf("The quick brown"
" fox jumps over"
" the sleazy dog.");
Escape the newline by placing a backslash in front of it.
printf("The quick brown\
fox jumps over\
the sleazy dog.");
Personally, I prefer the first form since the second looks ugly (my opinion) and forces you to ruin your code indentation.
In either case, the string will simply not contain the newlines. So if you really meant for them to be there, you'll have to add them via \n.
there is a string:
"fdsfsfsfsfsdomnol$natureOrder(0123)jqnm"
I want to match the substring:$natureOrder(0123),I do something like this:
regcomp(®, "\$natureOrder\([0-9]{1,4}\)", cflags);
but it doesn't work!How to write the regex pattern?
Apart escaping the $, you need to have the parenthesis in your regex, and those ones too must be escaped.
So the regular expression would be
\$natureOrder\([0-9]{1,4}\)
And when in a C string, as the \ is the start of an escape sequence :
regcomp(®, "\\$natureOrder\\([0-9]{1,4}\\)", cflags);