I need help extracting a substring from a string using regex.h in C.
In this example, I am trying to extract all occurrences of character 'e' from a string 'telephone'. Unfortunately, I get stuck identifying the offsets of those characters. I am listing code below:
#include <stdio.h>
#include <regex.h>
int main(void) {
const int size=10;
regex_t regex;
regmatch_t matchStruct[size];
char pattern[] = "(e)";
char str[] = "telephone";
int failure = regcomp(®ex, pattern, REG_EXTENDED);
if (failure) {
printf("Cannot compile");
}
int matchFailure = regexec(®ex, pattern, size, matchStruct, 0);
if (!matchFailure) {
printf("\nMatch!!");
} else {
printf("NO Match!!");
}
return 0;
}
So per GNU's manual, I should get all of the occurrences of 'e' when a character is parenthesized. However, I always get only the first occurrence.
Essentially, I want to be able to see something like:
matchStruct[1].rm_so = 1;
matchStruct[1].rm_so = 2;
matchStruct[2].rm_so = 4;
matchStruct[2].rm_so = 5;
matchStruct[3].rm_so = 7;
matchStruct[3].rm_so = 8;
or something along these lines. Any advice?
Please note that you are in fact not comparing your compiled regex against str ("telephone") but rather to your plain-text pattern. Check your second attribute to regexec. That fixed, proceed for instance to "regex in C language using functions regcomp and regexec toggles between first and second match" where the answer to your question is already given.
Related
I've been trying to use regular expressions (<regex.h>) in a C project I am developing.
According to regex101 the regex it is well written and identifies what I'm trying to identify but it doesn't work when I try to run it in C.
#include <stdio.h>
#include <regex.h>
int main() {
char pattern[] = "#include.*";
char line[] = "#include <stdio.h>";
regex_t string;
int regex_return = -1;
regex_return = regcomp(&string, line, 0);
regex_return += regexec(&string, pattern, 0, NULL, 0);
printf("%d", regex_return);
return 0;
}
This is a sample code I wrote to test the expression when I found out it didn't work.
It prints 1, when I expected 0.
It prints 0 if I change the line to "#include", which is just strange to me, because it's ignoring the .* at the end.
line and pattern are swapped.
regcomp takes the pattern and regexec takes the string to check.
I'd like to match all strings that begin with a set of characters a-z, then exactly one : and another set of characters a-z right after that.
So as an example, the string "an:example" would be a correct match.
And another example, "another:ex:ample" needs to be a mismatch.
I have tried to set it up like that but it matches everything, even if i take bad string as input :(
So my regular expression is "[a-z]:[a-z]" but it evaluates the string "1an:example" as a Match :/
How can I do this correctly?
#include <stdio.h>
#include <regex.h>
int main() {
regex_t regex;
int retis;
char* str = "1an:example";
retis = regcomp(®ex, "[a-z]:[a-z]", 0);
retis = regexec(®ex, str, 0, NULL, 0);
if(!retis) {
puts("Match");
}
else if(retis == REG_NOMATCH) {
puts("No match");
}
regfree(®ex);
return 0;
}
You need
retis = regcomp(®ex, "^[a-z]+:[a-z]+$", REG_EXTENDED);
See the C online demo.
That is:
^ (start of string) and $ (end of string) are anchors that require the regex to match the whole string
[a-z]+ matches one or more lowercase letters
REG_EXTENDED allows extended regex syntax, e.g. in regex.h it is required to enable the $ anchor.
I'd like to match all strings that begin with a set of characters a-z, then exactly one : and another set of characters a-z right after that.
So as an example, the string "an:example" would be a correct match.
And another example, "another:ex:ample" needs to be a mismatch.
I have tried to set it up like that but it matches everything, even if i take bad string as input :(
So my regular expression is "[a-z]:[a-z]" but it evaluates the string "1an:example" as a Match :/
How can I do this correctly?
#include <stdio.h>
#include <regex.h>
int main() {
regex_t regex;
int retis;
char* str = "1an:example";
retis = regcomp(®ex, "[a-z]:[a-z]", 0);
retis = regexec(®ex, str, 0, NULL, 0);
if(!retis) {
puts("Match");
}
else if(retis == REG_NOMATCH) {
puts("No match");
}
regfree(®ex);
return 0;
}
You need
retis = regcomp(®ex, "^[a-z]+:[a-z]+$", REG_EXTENDED);
See the C online demo.
That is:
^ (start of string) and $ (end of string) are anchors that require the regex to match the whole string
[a-z]+ matches one or more lowercase letters
REG_EXTENDED allows extended regex syntax, e.g. in regex.h it is required to enable the $ anchor.
I want to implement a case-insensitive text search which supports parallel testing of multiple keywords. I was already able to achieve this in a way which to me does not seem to be efficient in terms of performance.
The function "strcasestr" (Link to Linux man page) seems to be doing a good job when searching for one keyword, but when you want to simultaneously test multiple keywords - in my understanding - you want to iterate the characters of the text (Haystack) only one single time to find an occurrence of the keywords (Needles).
Using "strcasestr" multiple times would cause - how I understand it - multiple iterations over the text (Haystack), which might not be the fastest solution. An example:
#define _GNU_SOURCE
#include <stdio.h>
#include <string.h>
int main (void) {
// Text to search in
char *str = "This is a test!";
char *result = strcasestr(str, "not_found1");
if (result == NULL) {
result = strcasestr(str, "NOT_FOUND2");
}
if (result == NULL) {
result = strcasestr(str, "TEST!");
}
printf("Result pointer: %s\n", result );
return 0;
}
Is there a way to get the position of the first occurrence of one of the (case-insensitive) keywords in the text in a faster way than I did it?
I would appreciate it if the solution would be extensible so that I could continue looping over the text to find all positions of the occurrences of the keywords, because I am working on a full-text search with a result rating system. Frameworks and small hints to put me in the right direction are also very welcome.
After a long time of learning and testing I found a solution which is working well for me. I tested a one-keyword version of it and the performance was comparable to the function "strcasestr" (Tested with ca. 500 MB of text).
To explain what the below code does:
First the text (Haystack) and the keywords (Needles) are defined. Then the keywords are already converted into lowercase for good performance. iter is an Array of numbers which reflect how many characters the current text progress is in match with each keyword. The program linearly iterates over each character of text until it finds a match in one of the keywords - in this case, the program ends and the result is "True". If it does not find a match (=0), the result if "False".
I welcome tips in the comments for better code quality or higher performance.
#include <stdio.h>
#include <string.h>
#include <ctype.h>
int main (void) {
int i, j;
int match = 0;
// Haystack
char *text = "This is a test!";
// Needles
int keywords_len = 3;
char keywords[][12] = {
"not_found1",
"NOT_FOUND2",
"TEST!"
};
// Make needles lowercase
for (i = 0; i < keywords_len; i++)
for (j = 0; keywords[i][j]; j++)
keywords[i][j] = tolower(keywords[i][j]);
// Define counters for keywords matches
int iter[] = { 0, 0, 0 };
// Loop over all characters and test match
char ptext;
while (ptext = *text++)
// Compare matches
// NOTE: (x | 32) means case-insensitive
if (!match)
for (i = 0; i < keywords_len; i++)
if ((ptext | 32) == keywords[i][iter[i]]) {
if (keywords[i][++(iter[i])] == '\0') {
match = 1;
break;
}
} else
iter[i] = 0;
else
break;
printf("Result: %s\n", match ? "True" : "False");
return 0;
}
I have the following string:
const char *str = "\"This is just some random text\" 130 28194 \"Some other string\" \"String 3\""
I would like to get the the integer 28194 of course the integer varies, so I can't do strstr("20194").
So I was wondering what would be a good way to get that part of the string?
I was thinking to use #include <regex.h> which I already have a procedure to match regexp's but not sure how the regexp in C will look like using the POSIX style notation. [:alpha:]+[:digit:] and if performance will be an issue. Or will it be better using strchr,strstr?
Any ideas will be appreciate it
If you want to use regex, you can use:
const char *str = "\"This is just some random text\" 130 28194 \"Some other string\" \"String 3\"";
regex_t re;
regmatch_t matches[2];
int comp_ret = regcomp(&re, "([[:digit:]]+) \"", REG_EXTENDED);
if(comp_ret)
{
// Error occured. See regex.h
}
if(!regexec(&re, str, 2, matches, 0))
{
long long result = strtoll(str + matches[1].rm_so, NULL, 10);
printf("%lld\n", result);
}
else
{
// Didn't match
}
regfree(&re);
You're correct that there are other approaches.
EDIT: Changed to use non-optional repetition and show more error checking.