Escaping '{' and '}' in regex for C - c

I would like to use the following regex \{.+\} in C. For example, {HELLO} would be valid but HELLO}2, {HELLO and HELLO would not.
I am making use of the POSIX regex library regex.h.
However, I am getting a regcomp error 13 when inputting "\\{.+\\}", and "\{.+\}" is giving me an unknown escape sequence warning.
#include <regex.h>
int main()
{
regex_t regex_enclosed;
char* pattern_enclosed = "\\{.+\\}";
// regex is not compiling but returning error code 13
regcomp(&regex_enclosed, pattern_enclosed, 0);
return 0;
}
Is there any way around this? As if I don't escape the { and }, the pattern isn't compiled correctly.

You must use the REG_EXTENDED flag to compile extended regular expressions. Basic regular expressions are not very intuitive and mostly obsolete. Furthermore, you want the shortest match to only match {HOME} in "{HOME}/{DATE}":
regex_t regex_enclosed;
const char *pattern_enclosed = "\\{[^}]+\\}"; // can also use "[{][^}]+[}]"
int res = regcomp(&regex_enclosed, pattern_enclosed, REG_EXTENDED);

Without REG_EXTENDED, you are using POSIX BRE. You can still go with the POSIX BRE, just:
Do not escape braces
Use Kleene star * instead of + (+ matches + in POSIX BRE)
But use negated bracket expression to match text between braces.
Use
regex_t regex_enclosed;
const char *pattern_enclosed = "{[^{}]*}";
int res = regcomp(&regex_enclosed, pattern_enclosed, 0);
EXPLANATION
--------------------------------------------------------------------------------
{ '{'
--------------------------------------------------------------------------------
[^{}]* any character except: '{', '}' (0 or more
times (matching the most amount possible))
--------------------------------------------------------------------------------
} '}'

Related

Can't use regular expression with .*

I've been trying to use regular expressions (<regex.h>) in a C project I am developing.
According to regex101 the regex it is well written and identifies what I'm trying to identify but it doesn't work when I try to run it in C.
#include <stdio.h>
#include <regex.h>
int main() {
char pattern[] = "#include.*";
char line[] = "#include <stdio.h>";
regex_t string;
int regex_return = -1;
regex_return = regcomp(&string, line, 0);
regex_return += regexec(&string, pattern, 0, NULL, 0);
printf("%d", regex_return);
return 0;
}
This is a sample code I wrote to test the expression when I found out it didn't work.
It prints 1, when I expected 0.
It prints 0 if I change the line to "#include", which is just strange to me, because it's ignoring the .* at the end.
line and pattern are swapped.
regcomp takes the pattern and regexec takes the string to check.

Matching forward slash in regex

I've troubles with preparing regex expression, matching forward slash ('/') inside.
I need to match string like "/ABC6" (forward slash, then any 3 characters, then exactly one digit). I tried expressions like "^/.{3}[0-9]", "^\/.{3}[0-9]", "^\\/.{3}[0-9]", "^\\\\/.{3}[0-9]" - without success.
How should I do this?
My code:
#include <regex.h>
regex_t regex;
int reti;
/* Compile regular expression */
reti = regcomp(&regex, "^/.{3}[0-9]", 0);
// here checking compilation result - is OK (it means: equal 0)
/* Execute regular expression */
reti = regexec(&regex, "/ABC5", 0, NULL, 0);
// reti indicates no match!
NOTE: this is about C language (gcc) on linux (Debian). And of course the expression like "^\/.{3}[0-9]" causes gcc compilation warning (unknown escape sequence).
SOLUTION: as #tripleee suggested in his answer, the problem was not caused by slash, but by brackets: '{' and '}', not allowed in BRE, but allowed in ERE. Finally I changed one line, then all works OK.
reti = regcomp(&regex, "^/.{3}[0-9]", REG_EXTENDED);
The slash is fine, the problem is that {3} is extended regular expression (ERE) syntax -- you need to pass REG_EXTENDED or use \{3\} instead (where of course in a C string those backslashes need to be doubled).

Regular Expressions are not returning correct solution

I'm writing a C program that uses a regular expressions to determine if certain words from a text that are being read from a file are valid or invalid. I've a attached the code that does my regular expression check. I used an online regex checker and based off of that it says my regex is correct. I'm not sure why else it would be wrong.
The regex should accept a string in either the format of AB1234 or ABC1234 ABCD1234.
//compile the regular expression
reti1 = regcomp(&regex1, "[A-Z]{2,4}\\d{4}", 0);
// does the actual regex test
status = regexec(&regex1,inputString,(size_t)0,NULL,0);
if (status==0)
printf("Matched (0 => Yes): %d\n\n",status);
else
printf(">>>NO MATCH<< \n\n");
You are using POSIX regular expressions, from regex.h. These don't support the syntax you are using, which is PCRE format, and is much more common these days. You are better off trying to use a library that will give you PCRE support. If you have to use POSIX expressions, I think this will work:
#include <regex.h>
#include "stdio.h"
int main(void) {
int status;
int reti1;
regex_t regex1;
char * inputString = "ABCD1234";
//compile the regular expression
reti1 = regcomp(&regex1, "^[[:upper:]]{2,4}[[:digit:]]{4}$", REG_EXTENDED);
// does the actual regex test
status = regexec(&regex1,inputString,(size_t)0,NULL,0);
if (status==0)
printf("Matched (0 => Yes): %d\n\n",status);
else
printf(">>>NO MATCH<< \n\n");
regfree (&regex1);
return 0;
}
(Note that my C is extremely rusty, so this code is probably horrible.)
I found some good resources on this answer.

What is wrong with this Bison grammar?

Im trying to build a Bison grammar and seem to be missing something. I kept it yet very basic, still I am getting a syntax error and can't figure out why:
Here is my Bison Code:
%{
#include <stdlib.h>
#include <stdio.h>
int yylex(void);
int yyerror(char *s);
%}
// Define the types flex could return
%union {
long lval;
char *sval;
}
// Define the terminal symbol token types
%token <sval> IDENT;
%token <lval> NUM;
%%
Program:
Def ';'
;
Def:
IDENT '=' Lambda { printf("Successfully parsed file"); }
;
Lambda:
"fun" IDENT "->" "end"
;
%%
main() {
yyparse();
return 0;
}
int yyerror(char *s)
{
extern int yylineno; // defined and maintained in flex.flex
extern char *yytext; // defined and maintained in flex.flex
printf("ERROR: %s at symbol \"%s\" on line %i", s, yytext, yylineno);
exit(2);
}
Here is my Flex Code
%{
#include <stdlib.h>
#include "bison.tab.h"
%}
ID [A-Za-z][A-Za-z0-9]*
NUM [0-9][0-9]*
HEX [$][A-Fa-f0-9]+
COMM [/][/].*$
%%
fun|if|then|else|let|in|not|head|tail|and|end|isnum|islist|isfun {
printf("Scanning a keyword\n");
}
{ID} {
printf("Scanning an IDENT\n");
yylval.sval = strdup( yytext );
return IDENT;
}
{NUM} {
printf("Scanning a NUM\n");
/* Convert into long to loose leading zeros */
char *ptr = NULL;
long num = strtol(yytext, &ptr, 10);
if( errno == ERANGE ) {
printf("Number was to big");
exit(1);
}
yylval.lval = num;
return NUM;
}
{HEX} {
printf("Scanning a NUM\n");
char *ptr = NULL;
/* convert hex into decimal using offset 1 because of the $ */
long num = strtol(&yytext[1], &ptr, 16);
if( errno == ERANGE ) {
printf("Number was to big");
exit(1);
}
yylval.lval = num;
return NUM;
}
";"|"="|"+"|"-"|"*"|"."|"<"|"="|"("|")"|"->" {
printf("Scanning an operator\n");
}
[ \t\n]+ /* eat up whitespace */
{COMM}* /* eat up one-line comments */
. {
printf("Unrecognized character: %s at linenumber %d\n", yytext, yylineno );
exit(1);
}
%%
And here is my Makefile:
all: parser
parser: bison flex
gcc bison.tab.c lex.yy.c -o parser -lfl
bison: bison.y
bison -d bison.y
flex: flex.flex
flex flex.flex
clean:
rm bison.tab.h
rm bison.tab.c
rm lex.yy.c
rm parser
Everything compiles just fine, I do not get any errors runnin make all.
Here is my testfile
f = fun x -> end;
And here is the output:
./parser < a0.0
Scanning an IDENT
Scanning an operator
Scanning a keyword
Scanning an IDENT
ERROR: syntax error at symbol "x" on line 1
since x seems to be recognized as a IDENT the rule should be correct, still I am gettin an syntax error.
I feel like I am missing something important, hopefully somebody can help me out.
Thanks in advance!
EDIT:
I tried to remove the IDENT in the Lambda rule and the testfile, now it seems to run through the line, but still throws
ERROR: syntax error at symbol "" on line 1
after the EOF.
Your scanner recognizes keywords (and prints out a debugging line, but see below), but it doesn't bother reporting anything to the parser. So they are effectively ignored.
In your bison definition file, you use (for example) "fun" as a terminal, but you do not provide the terminal with a name which could be used in the scanner. The scanner needs this name, because it has to return a token id to the parser.
To summarize, what you need is something like this:
In your grammar, before the %%:
token T_FUN "fun"
token T_IF "if"
token T_THEN "then"
/* Etc. */
In your scanner definition:
fun { return T_FUN; }
if { return T_IF; }
then { return T_THEN; }
/* Etc. */
A couple of other notes:
Your scanner rule for recognizing operators also fails to return anything, so operators will also be ignored. That's clearly not desirable. flex and bison allow an easier solution for single-character operators, which is to let the character be its own token id. That avoids having to create a token name. In the parser, a single-quoted character represents a token-id whose value is the character; that's quite different from a double-quoted string, which is an alias for the declared token name. So you could do this:
"=" { return '='; }
/* Etc. */
but it's easier to do all the single-character tokens at once:
[;+*.<=()-] { return yytext[0]; }
and even easier to use a default rule at the end:
. { return yytext[0]; }
which will have the effect of handling unrecognized characters by returning an unknown token id to the parser, which will cause a syntax error.
This won't work for "->", since that is not a single character token, which will have to be handled in the same way as keywords.
Flex will produce debugging output automatically if you use the -d flag when you create the scanner. That's a lot easier than inserting your own debugging printout, because you can turn it off by simply removing the -d option. (You can use %option debug instead if you don't want to change the flex invocation in your makefile.) It's also better because it provides consistent information, including position information.
Some minor points:
The pattern [0-9][0-9]* could more easily be written [0-9]+
The comment pattern "//".* does not require a $ lookahead at the end, since .* will always match the longest sequence of non-newline characters; consequently, the first unmatched character must either be a newline or the EOF. $ lookahead will not match if the pattern is terminated with an EOF, which will cause odd errors if the file ends with a comment without a newline at the end.
There is no point using {COMM}* since the comment pattern does not match the newline which terminates the comment, so it is impossible for there to be two consecutive comment matches. But anyway, after matching a comment and the following newline, flex will continue to match a following comment, so {COMM} is sufficient. (Personally, I wouldn't use the COMM abbreviation; it really adds nothing to readability, IMHO.)

Compiling/Matching POSIX Regular Expressions in C

I'm trying to match the following items in the string pcode:
u followed by a 1 or 2 digit number
phaseu
phasep
x (surrounded by non-word chars)
y (surrounded by non-word chars)
z (surrounded by non-word chars)
I've tried to implement a regex match using the POSIX regex functions (shown below), but have two problems:
The compiled pattern seems to have no subpatterns (i.e. compiled.n_sub == 0).
The pattern doesn't find matches in the string " u0", which it really should!
I'm confident that the regex string itself is working—in that it works in python and TextMate—my problem lies with the compilation, etc. in C. Any help with getting that working would be much appreciated.
Thanks in advance for your answers.
if(idata=tb_find(deftb,pdata)){
MESSAGE("Global variable!\n");
char pattern[80] = "((u[0-9]{1,2})|(phaseu)|(phasep)|[\\W]+([xyz])[\\W]+)";
MESSAGE("Pattern = \"%s\"\n",pattern);
regex_t compiled;
if(regcomp(&compiled, pattern, 0) == 0){
MESSAGE("Compiled regular expression \"%s\".\n", pattern);
}
int nsub = compiled.re_nsub;
MESSAGE("nsub = %d.\n",nsub);
regmatch_t matchptr[nsub];
int err;
if(err = regexec (&compiled, pcode, nsub, matchptr, 0)){
if(err == REG_NOMATCH){
MESSAGE("Regular expression did not match.\n");
}else if(err == REG_ESPACE){
MESSAGE("Ran out of memory.\n");
}
}
regfree(&compiled);
}
It seems you intend to use something resembling the "extended" POSIX regex syntax. POSIX defines two different regex syntaxes, a "basic" (read "obsolete") syntax and the "extended" syntax. To use the extended syntax, you need to add the REG_EXTENDED flag for regcomp:
...
if(regcomp(&compiled, pattern, REG_EXTENDED) == 0){
...
Without this flag, regcomp will use the "basic" regex syntax. There are some important differences, such as:
No support for the | operator
The brackets for submatches need to be escaped, \( and \)
It should be also noted that the POSIX extended regex syntax is not 1:1 compatible with Python's regex (don't know about TextMate). In particular, I'm afraid this part of your regexp does not work in POSIX, or at least is not portable:
[\\W]
The POSIX way to specify non-space characters is:
[^[:space:]]
Your whole regexp for POSIX should then look like this in C:
char *pattern = "((u[0-9]{1,2})|(phaseu)|(phasep)|[^[:space:]]+([xyz])[^[:space:]]+)";

Resources