Compiling/Matching POSIX Regular Expressions in C - c

I'm trying to match the following items in the string pcode:
u followed by a 1 or 2 digit number
phaseu
phasep
x (surrounded by non-word chars)
y (surrounded by non-word chars)
z (surrounded by non-word chars)
I've tried to implement a regex match using the POSIX regex functions (shown below), but have two problems:
The compiled pattern seems to have no subpatterns (i.e. compiled.n_sub == 0).
The pattern doesn't find matches in the string " u0", which it really should!
I'm confident that the regex string itself is working—in that it works in python and TextMate—my problem lies with the compilation, etc. in C. Any help with getting that working would be much appreciated.
Thanks in advance for your answers.
if(idata=tb_find(deftb,pdata)){
MESSAGE("Global variable!\n");
char pattern[80] = "((u[0-9]{1,2})|(phaseu)|(phasep)|[\\W]+([xyz])[\\W]+)";
MESSAGE("Pattern = \"%s\"\n",pattern);
regex_t compiled;
if(regcomp(&compiled, pattern, 0) == 0){
MESSAGE("Compiled regular expression \"%s\".\n", pattern);
}
int nsub = compiled.re_nsub;
MESSAGE("nsub = %d.\n",nsub);
regmatch_t matchptr[nsub];
int err;
if(err = regexec (&compiled, pcode, nsub, matchptr, 0)){
if(err == REG_NOMATCH){
MESSAGE("Regular expression did not match.\n");
}else if(err == REG_ESPACE){
MESSAGE("Ran out of memory.\n");
}
}
regfree(&compiled);
}

It seems you intend to use something resembling the "extended" POSIX regex syntax. POSIX defines two different regex syntaxes, a "basic" (read "obsolete") syntax and the "extended" syntax. To use the extended syntax, you need to add the REG_EXTENDED flag for regcomp:
...
if(regcomp(&compiled, pattern, REG_EXTENDED) == 0){
...
Without this flag, regcomp will use the "basic" regex syntax. There are some important differences, such as:
No support for the | operator
The brackets for submatches need to be escaped, \( and \)
It should be also noted that the POSIX extended regex syntax is not 1:1 compatible with Python's regex (don't know about TextMate). In particular, I'm afraid this part of your regexp does not work in POSIX, or at least is not portable:
[\\W]
The POSIX way to specify non-space characters is:
[^[:space:]]
Your whole regexp for POSIX should then look like this in C:
char *pattern = "((u[0-9]{1,2})|(phaseu)|(phasep)|[^[:space:]]+([xyz])[^[:space:]]+)";

Related

Escaping '{' and '}' in regex for C

I would like to use the following regex \{.+\} in C. For example, {HELLO} would be valid but HELLO}2, {HELLO and HELLO would not.
I am making use of the POSIX regex library regex.h.
However, I am getting a regcomp error 13 when inputting "\\{.+\\}", and "\{.+\}" is giving me an unknown escape sequence warning.
#include <regex.h>
int main()
{
regex_t regex_enclosed;
char* pattern_enclosed = "\\{.+\\}";
// regex is not compiling but returning error code 13
regcomp(&regex_enclosed, pattern_enclosed, 0);
return 0;
}
Is there any way around this? As if I don't escape the { and }, the pattern isn't compiled correctly.
You must use the REG_EXTENDED flag to compile extended regular expressions. Basic regular expressions are not very intuitive and mostly obsolete. Furthermore, you want the shortest match to only match {HOME} in "{HOME}/{DATE}":
regex_t regex_enclosed;
const char *pattern_enclosed = "\\{[^}]+\\}"; // can also use "[{][^}]+[}]"
int res = regcomp(&regex_enclosed, pattern_enclosed, REG_EXTENDED);
Without REG_EXTENDED, you are using POSIX BRE. You can still go with the POSIX BRE, just:
Do not escape braces
Use Kleene star * instead of + (+ matches + in POSIX BRE)
But use negated bracket expression to match text between braces.
Use
regex_t regex_enclosed;
const char *pattern_enclosed = "{[^{}]*}";
int res = regcomp(&regex_enclosed, pattern_enclosed, 0);
EXPLANATION
--------------------------------------------------------------------------------
{ '{'
--------------------------------------------------------------------------------
[^{}]* any character except: '{', '}' (0 or more
times (matching the most amount possible))
--------------------------------------------------------------------------------
} '}'

Matching forward slash in regex

I've troubles with preparing regex expression, matching forward slash ('/') inside.
I need to match string like "/ABC6" (forward slash, then any 3 characters, then exactly one digit). I tried expressions like "^/.{3}[0-9]", "^\/.{3}[0-9]", "^\\/.{3}[0-9]", "^\\\\/.{3}[0-9]" - without success.
How should I do this?
My code:
#include <regex.h>
regex_t regex;
int reti;
/* Compile regular expression */
reti = regcomp(&regex, "^/.{3}[0-9]", 0);
// here checking compilation result - is OK (it means: equal 0)
/* Execute regular expression */
reti = regexec(&regex, "/ABC5", 0, NULL, 0);
// reti indicates no match!
NOTE: this is about C language (gcc) on linux (Debian). And of course the expression like "^\/.{3}[0-9]" causes gcc compilation warning (unknown escape sequence).
SOLUTION: as #tripleee suggested in his answer, the problem was not caused by slash, but by brackets: '{' and '}', not allowed in BRE, but allowed in ERE. Finally I changed one line, then all works OK.
reti = regcomp(&regex, "^/.{3}[0-9]", REG_EXTENDED);
The slash is fine, the problem is that {3} is extended regular expression (ERE) syntax -- you need to pass REG_EXTENDED or use \{3\} instead (where of course in a C string those backslashes need to be doubled).

Regular Expressions are not returning correct solution

I'm writing a C program that uses a regular expressions to determine if certain words from a text that are being read from a file are valid or invalid. I've a attached the code that does my regular expression check. I used an online regex checker and based off of that it says my regex is correct. I'm not sure why else it would be wrong.
The regex should accept a string in either the format of AB1234 or ABC1234 ABCD1234.
//compile the regular expression
reti1 = regcomp(&regex1, "[A-Z]{2,4}\\d{4}", 0);
// does the actual regex test
status = regexec(&regex1,inputString,(size_t)0,NULL,0);
if (status==0)
printf("Matched (0 => Yes): %d\n\n",status);
else
printf(">>>NO MATCH<< \n\n");
You are using POSIX regular expressions, from regex.h. These don't support the syntax you are using, which is PCRE format, and is much more common these days. You are better off trying to use a library that will give you PCRE support. If you have to use POSIX expressions, I think this will work:
#include <regex.h>
#include "stdio.h"
int main(void) {
int status;
int reti1;
regex_t regex1;
char * inputString = "ABCD1234";
//compile the regular expression
reti1 = regcomp(&regex1, "^[[:upper:]]{2,4}[[:digit:]]{4}$", REG_EXTENDED);
// does the actual regex test
status = regexec(&regex1,inputString,(size_t)0,NULL,0);
if (status==0)
printf("Matched (0 => Yes): %d\n\n",status);
else
printf(">>>NO MATCH<< \n\n");
regfree (&regex1);
return 0;
}
(Note that my C is extremely rusty, so this code is probably horrible.)
I found some good resources on this answer.

C: Regular Expression does not match floating point literals

I'm writing a simple script to match floating points literals, only +123.23, -123.23 or 123.23 etc should be matched, so I don't match those -1.0e-10 form. So my expression is as simple as: [+-]?([0-9]*[.])?[0-9]+ which will capture the sign, digits, dot and fraction for me optionally. And my C validation looks like:
reti = regcomp(&regex, "[+-]?([0-9]*[.])?[0-9]+", 0);
if (reti) {
fprintf(stderr, "Could not compile regex of floating literals\n");
exit(1);
}
char * testString = "-87.21";
reti = regexec(&regex, testString, 0, NULL, 0);
if (!reti) {
printf("%s \n", testString);
}
However, the value of reti is 1, which means my regex on test string "-87.21" failed. I tested my regex on regexr.com, it matches "-87.21", So I don't really know what happens here. Is there anyone can help?
You need to add the REG_EXTENDED flag when calling regcomp or else adapt your regex to be compatible with POSIX BRE (Basic Regular Expressions, the legacy syntax used by sed and grep without -E). BRE is the default and probably not what you want.

usage of + in Posix Regex library

This should be pretty simple, but I am having trouble understanding the basic working of '+' in regex.h library in C. Not sure what is going wrong.
Pasting a sample code which doesn't work. I want to find a string which starts with B and ends with A, there can be more than one occurrence of B so I want to use B+
int main(int argc, const char * argv[])
{
regex_t regex;
int reti;
/* Compile regular expression */
reti = regcomp(&regex, "^B+A$", 0);
if( reti)
{
printf("Could not compile regex\n");
exit(1);
}
/* Execute regular expression */
reti = regexec(&regex, "BBBA", 0, NULL, 0);
if (!reti )
{
printf("Match\n");
}
else if( reti == REG_NOMATCH )
{
printf("No match\n");
}
else
{
printf("Regex match failed\n");
exit(1);
}
/* Free compiled regular expression if you want to use the regex_t again */
regfree(&regex);
return 0;
}
This does not find the match, but I am not able to understand why.
Usage of ^BB*A$ works fine, but that is not something I would want.
As I also want to check for something like ^[BCD]+A$ which should match BBBA or CCCCA or DDDDA. Usage of ^[BCD][BCD]*A$ wont work for me as that could match BCCCA which is not the desired match.
Tried using parentheses and brackets in the expression but it doesn't seem to help.
Quick help is much appreciated.
By default regcomp() compiles a pattern as a so-called Basic Regular Expression; in such regular expressions the + operator is not available. The regex syntax you're trying to use is known as Extended Regular Expression syntax. In order to have regcomp() work with that more extended syntax you need to pass it the REG_EXTENDED flag.
By the way, this comment:
As I also want to check for something like ^[BCD]+A$ which should match BBBA or CCCCA or
DDDDA. Usage of ^[BCD][BCD]*A$ wont work for me as that could match BCCCA which is not the
desired match
is based on a misconception of how the quantifiers + and * work. The regular expressions ^[BCD]+A$ and ^[BCD][BCD]*A$ are exactly equivalent.

Resources