How do you capture a group with regex? - c

I'm trying to extract a string from another using regex.
I'm using the POSIX regex functions (regcomp, regexec ...), and I fail at capturing a group ...
For instance, let the pattern be something as simple as "MAIL FROM:<(.*)>"
(with REG_EXTENDED cflags)
I want to capture everything between '<' and '>'
My problem is that regmatch_t gives me the boundaries of the whole pattern (MAIL FROM:<...>) instead of just what's between the parenthesis ...
What am I missing ?
Thanks in advance,
edit: some code
#define SENDER_REGEX "MAIL FROM:<(.*)>"
int main(int ac, char **av)
{
regex_t regex;
int status;
regmatch_t pmatch[1];
if (regcomp(&regex, SENDER_REGEX, REG_ICASE|REG_EXTENDED) != 0)
printf("regcomp error\n");
status = regexec(&regex, av[1], 1, pmatch, 0);
regfree(&regex);
if (!status)
printf( "matched from %d (%c) to %d (%c)\n"
, pmatch[0].rm_so
, av[1][pmatch[0].rm_so]
, pmatch[0].rm_eo
, av[1][pmatch[0].rm_eo]
);
return (0);
}
outputs:
$./a.out "012345MAIL FROM:<abcd>$"
matched from 6 (M) to 22 ($)
solution:
as RarrRarrRarr said, the indices are indeed in pmatch[1].rm_so and pmatch[1].rm_eo
hence regmatch_t pmatch[1]; becomes regmatch_t pmatch[2];
and regexec(&regex, av[1], 1, pmatch, 0); becomes regexec(&regex, av[1], 2, pmatch, 0);
Thanks :)

Here's a code example that demonstrates capturing multiple groups.
You can see that group '0' is the whole match, and subsequent groups are the parts within parentheses.
Note that this will only capture the first match in the source string. Here's a version that captures multiple groups in multiple matches.
#include <stdio.h>
#include <string.h>
#include <regex.h>
int main ()
{
char * source = "___ abc123def ___ ghi456 ___";
char * regexString = "[a-z]*([0-9]+)([a-z]*)";
size_t maxGroups = 3;
regex_t regexCompiled;
regmatch_t groupArray[maxGroups];
if (regcomp(&regexCompiled, regexString, REG_EXTENDED))
{
printf("Could not compile regular expression.\n");
return 1;
};
if (regexec(&regexCompiled, source, maxGroups, groupArray, 0) == 0)
{
unsigned int g = 0;
for (g = 0; g < maxGroups; g++)
{
if (groupArray[g].rm_so == (size_t)-1)
break; // No more groups
char sourceCopy[strlen(source) + 1];
strcpy(sourceCopy, source);
sourceCopy[groupArray[g].rm_eo] = 0;
printf("Group %u: [%2u-%2u]: %s\n",
g, groupArray[g].rm_so, groupArray[g].rm_eo,
sourceCopy + groupArray[g].rm_so);
}
}
regfree(&regexCompiled);
return 0;
}
Output:
Group 0: [ 4-13]: abc123def
Group 1: [ 7-10]: 123
Group 2: [10-13]: def

The 0th element of the pmatch array of regmatch_t structs will contain the boundaries of the whole string matched, as you have noticed. In your example, you are interested in the regmatch_t at index 1, not at index 0, in order to get information about the string matches by the subexpression.
If you need more help, try editing your question to include an actual small code sample so that people can more easily spot the problem.

Related

How do I get C to successfully match a regex?

So, I am trying to check the format of a key using the regex.h library in C. This is my code:
#include <stdio.h>
#include <regex.h>
int match(char *reg, char *string)
{
regex_t regex;
int res;
res = regcomp(&regex, reg, 0);
if (res)
{
fprintf(stderr, "Could not compile regex\n");
return 1;
}
res = regexec(&regex, string, 0, NULL, 0);
return res;
}
int main(void)
{
char *regex = "[\\w-]{24}\\.[\\w-]{6}\\.[\\w-]{27}|mfa\\.[\\w-]{84}";
char *key = "xxxxxxxxxxxxxxxxxxxxxxxx.xxxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx";
if (match(regex, key) == 0) printf("Valid key!\n");
else printf("Invalid key!\n");
return 0;
}
When I run this code, I get the output:
Invalid key!
Why is this happening? If I try to test the same key with the same regex in Node.JS, I get that the key does match the regex:
> const regex = new RegExp("[\\w-]{24}\\.[\\w-]{6}\\.[\\w-]{27}|mfa\\.[\\w-]{84}");
undefined
> const key = "xxxxxxxxxxxxxxxxxxxxxxxx.xxxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx";
undefined
> regex.test(key)
true
How could I get the right result using C?
Thanks in advance,
Robin
There are at least two issues here and one extra potential problem:
The limiting quantifiers will work as such in a POSIX ERE flavor, thus, as it has been pointed out in comments, you need to regcomp the pattern with a REG_EXTENDED option (i.e. res = regcomp(&regex, reg, REG_EXTENDED))
The \w shorthand character class does not work inside bracket expressions as a word char matching pattern, you need to replace it with [:alnum:]_, i.e. [\w-] must be replaced with [[:alnum:]_-]. The solution will be:
char *regex = "[[:alnum:]_-]{24}\\.[[:alnum:]_-]{6}\\.[[:alnum:]_-]{27}|mfa\\.[[:alnum:]_-]{84}";
Besides, if your regex must match the two alternatives exactly, you need to use a group around the whole pattern and add ^ and $ anchors on both ends. The solution will be:
char *regex = "^([[:alnum:]_-]{24}\\.[[:alnum:]_-]{6}\\.[[:alnum:]_-]{27}|mfa\\.[[:alnum:]_-]{84})$";
See this C demo:
#include <stdio.h>
#include <regex.h>
int match(char *reg, char *string)
{
regex_t regex;
int res;
res = regcomp(&regex, reg, REG_EXTENDED);
if (res)
{
fprintf(stderr, "Could not compile regex\n");
return 1;
}
res = regexec(&regex, string, 0, NULL, 0);
return res;
}
int main(void)
{
char *regex = "^([[:alnum:]_-]{24}\\.[[:alnum:]_-]{6}\\.[[:alnum:]_-]{27}|mfa\\.[[:alnum:]_-]{84})$";
char *key = "xxxxxxxxxxxxxxxxxxxxxxxx.xxxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx";
if (match(regex, key) == 0) printf("Valid key!\n");
else printf("Invalid key!\n");
return 0;
}
// => Valid key!

How to check if a word is a preposition (with regex in C) [duplicate]

This question already has an answer here:
Regex fails in C, online tests pass
(1 answer)
Closed 1 year ago.
I read a text in C and i want to check if the current word is a preposition or not with regular expression
I have tried this but it didn't work
int function(const char *testRegex){
regex_t regex;
if(regcomp(&regex, "^(a|an|the|in|on|of|and|is|are)$", 0)) {
// handle error
}
int value;
value = regexec(&regex, testRegex, 0, NULL, 0);
return value;
}
If i pass the function any word it always return that it didn't match even if i passed (a, an, the... )
so what is the problem ?
"^(a|an|the|in|on|of|and|is|are)$" is an extended regular expression: you should pass REG_EXTENDED to regcomp.
Also note that regexec returns 0 for a match and the regex_t object must be freed to avoid memory leaks.
#include <stdio.h>
#include <regex.h>
int isprep(const char *testRegex) {
regex_t regex;
int match;
if (regcomp(&regex, "^(a|an|the|in|on|of|and|is|are)$", REG_EXTENDED)) {
return -1;
}
match = !regexec(&regex, testRegex, 0, NULL, 0);
regfree(&regex);
return match;
}
int main() {
printf("a -> %d\n", isprep("a"));
printf("an -> %d\n", isprep("an"));
printf("ann -> %d\n", isprep("ann"));
return 0;
}
Output:
a -> 1
an -> 1
ann -> 0
Basic regular expressions require a \ before the ( to specify a subexpression and do not support alternations (foo|bar).
See more details in the Open Group documentation.

regexec get value of xml tags in c

I'm trying to get the value of xml tags in c programming by regexec and i cannot use xml parser.
Below is my sample code, can someone help in getting the expected output.
char value[500];
regex_t regexp_data;
regmatch_t matched_data[10];
char pattern_str[] = "<CODE[ \t]*^*>[ \t]*\\(.*\\)[ \t]*<\\/CODE[ \t]*>";
char msg_str[] = "<ROOT><INFO><CODE>5001</CODE><MSG>msg one</MSG></INFO> <INFO><CODE>5002</CODE><MSG>msg two</MSG></INFO></ROOT>";
if ((regcomp(&regexp_data, pattern_str, REG_NEWLINE) == 0) &&
(regexec(&regexp_data, msg_str, 10, matched_data, 0) == 0))
{
int i;
for (i=0; i < 10; ++i)
{
memset(value, '\0', sizeof(value));
memcpy(value, &msg_str[matched_data[i].rm_so], (matched_data[i].rm_eo - matched_data[i].rm_so));
printf ("value [%s]\n", value);
}
regfree(&regexp_data);
}
/*----------------------
Outupt
value [<CODE>5001</CODE><MSG>msg one</MSG></INFO><INFO><CODE>5002</CODE>]
value [5001</CODE><MSG>msg one</MSG></INFO><INFO><CODE>5002]
----------------------
Expected Outupt
value [5001]
value [5002]
----------------------*/
Per Wiktor's comment, .* is too greedy, so I updated the regex to "<CODE[ \t]*>\\s*([0-9]*)\\s*<\\/CODE[ \t]*>" and passed in the REG_EXTENDED flag to avoid having to escape the parentheses.
As for capturing multiple matches, you want to follow how the gist Wiktor linked captures multiple matches. In order to get every match, you have to call regexec on the string multiple times while advancing a pointer to the source string by the length of the entire match. The first array element in the array of matches is the entire match, while the subsequent elements are the captured groups. Since you only have one captured group, you only need to pass in a size of 2, not 10. Here's the full code I used:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <regex.h>
int main() {
char value[500];
regex_t regexp_data;
regmatch_t matched_data[2];
char pattern_str[] = "<CODE[ \t]*>\\s*([0-9]*)\\s*<\\/CODE[ \t]*>";
char msg_str[] = "<ROOT><INFO><CODE>5001</CODE><MSG>msg one</MSG></INFO><INFO><CODE>5002</CODE><MSG>msg two</MSG></INFO></ROOT>";
char *cursor = msg_str;
if (regcomp(&regexp_data, pattern_str, REG_EXTENDED | REG_NEWLINE) != 0) {
printf("Couldn't compile.\n");
return 1;
}
while (regexec(&regexp_data, cursor, 2, matched_data, 0) != REG_NOMATCH) {
memset(value, '\0', sizeof(value));
memcpy(value, cursor + matched_data[1].rm_so, (matched_data[1].rm_eo - matched_data[1].rm_so));
printf("value [%s]\n", value);
cursor += matched_data[0].rm_eo;
}
regfree(&regexp_data);
}
Your regular expression is matching from the first instance of <CODE> to the last instance of </CODE>. To help prevent this, you can replace the (.*\\) with ([^<]*\\), so your regex is now:
char pattern_str[] = "<CODE[ \t]*^*>[ \t]*\\([^<]*\\)[ \t]*<\\/CODE[ \t]*>";

Generate Email address Regex [duplicate]

This question already has answers here:
How can I validate an email address using a regular expression?
(79 answers)
Closed 6 years ago.
I'm trying today to build a regex to make it match to email adress.
I've made one but not working in all the cases I want.
I would a Regex to match with all email address finishing with 2 characters after the dot or only the .com.
I hope to be clear enought,
aaaaaa#bbbb.uk --> should work
aaaaaa#bbbb.com --> should work
aaaaaa#bbbb.cc --> should work
aaaaaa#bbbb.ukk --> should not work
aaaaaa#bbbb. --> should not work
this is my code:
#include <stdio.h>
#include <stdlib.h>
#include <regex.h>
int main (void)
{
int match;
int err;
regex_t preg;
regmatch_t pmatch[5];
size_t nmatch = 5;
const char *str_request = "1aaaak#aaaa.ukk";
const char *str_regex = "[a-zA-Z0-9][a-zA-Z0-9_.]+#[a-zA-Z0-9_]+.[a-zA-Z0-9_.]+[a-zA-Z0-9]{2}";
err = regcomp(&preg, str_regex, REG_EXTENDED);
if (err == 0)
{
match = regexec(&preg, str_request, nmatch, pmatch, 0);
nmatch = preg.re_nsub;
regfree(&preg);
if (match == 0)
{
printf ("match\n");
int start = pmatch[0].rm_so;
int end = pmatch[0].rm_eo;
printf("%d - %d\n", start, end);
}
else if (match == REG_NOMATCH)
{
printf("unmatch\n");
}
}
puts ("\nPress any key\n");
getchar ();
return (EXIT_SUCCESS);
}
"[a-zA-Z0-9][a-zA-Z0-9_.]+#[a-zA-Z0-9_]+\\.(com|[a-zA-Z]{2})$"
https://regex101.com/ is a very good tool for that
\. means a litteral dot ;
(|) means an alternative ;
$ means the end of the line, as we do not want some trailing chars after the match.

regular expressions in C match and print

I have lines from file like this:
{123} {12.3.2015 moday} {THIS IS A TEST}
is It possible to get every value between brackets {} and insert into array?
Also I wold like to know if there is some other solution for this problem...
to get like this:
array( 123,
'12.3.2015 moday',
'THIS IS A TEST'
)
My try:
int r;
regex_t reg;
regmatch_t match[2];
char *line = "{123} {12.3.2015 moday} {THIS IS A TEST}";
regcomp(&reg, "[{](.*?)*[}]", REG_ICASE | REG_EXTENDED);
r = regexec(&reg, line, 2, match, 0);
if (r == 0) {
printf("Match!\n");
printf("0: [%.*s]\n", match[0].rm_eo - match[0].rm_so, line + match[0].rm_so);
printf("1: %.*s\n", match[1].rm_eo - match[1].rm_so, line + match[1].rm_so);
} else {
printf("NO match!\n");
}
This will result:
123} {12.3.2015 moday} {THIS IS A TEST
Anyone know how to improve this?
To help you you can use the regex101 website which is really useful.
Then I suggest you to use this regex:
/(?<=\{).*?(?=\})/g
Or any of these ones:
/\{\K.*?(?=\})/g
/\{\K[^\}]+/g
/\{(.*?)\}/g
Also available here for the first one:
https://regex101.com/r/bB6sE8/1
In C you could start with this which is an example for here:
#include <stdio.h>
#include <string.h>
#include <regex.h>
int main ()
{
char * source = "{123} {12.3.2015 moday} {THIS IS A TEST}";
char * regexString = "{([^}]*)}";
size_t maxGroups = 10;
regex_t regexCompiled;
regmatch_t groupArray[10];
unsigned int m;
char * cursor;
if (regcomp(&regexCompiled, regexString, REG_EXTENDED))
{
printf("Could not compile regular expression.\n");
return 1;
};
cursor = source;
while (!regexec(&regexCompiled, cursor, 10, groupArray, 0))
{
unsigned int offset = 0;
if (groupArray[1].rm_so == -1)
break; // No more groups
offset = groupArray[1].rm_eo;
char cursorCopy[strlen(cursor) + 1];
strcpy(cursorCopy, cursor);
cursorCopy[groupArray[1].rm_eo] = 0;
printf("%s\n", cursorCopy + groupArray[1].rm_so);
cursor += offset;
}
regfree(&regexCompiled);
return 0;
}

Resources