C Regular Expressions: Extracting the Actual Matches - c

I am using regular expressions in C (using the "regex.h" library). After setting up the standard calls (and checks) for regcomp(...) and regexec(...), I can only manage to print the actual substrings that match my compiled regular expression.
Using regexec, according to the manual pages, means you store the substring matches in a structure known as "regmatch_t". The struct only contains rm_so and rm_eo to reference what I understand to be the addresses of the characters of the matched substring in memory, but my question is how can I just use these to offsets and two pointers to extract the actual substring and store it into an array (ideally a 2D array of strings)?
It works when you just print to standard out, but whenever you try to use the same setup but store it in a string/character array, it stores the entire string that was originally used to match against the expression.
Further, what is the "%.*s" inside the print statement? I imagine it's a regular expression in of itself to read in the pointers to a character array correctly. I just want to store the matched substrings inside a collection so I can work with them elsewhere in my software.
Background: p and p2 are both pointers set to point to the start of string to match before entering the while loop in the code below:
[EDIT: "matches" is a 2D array meant to ultimately store the substring matches and was preallocated/initalized before the main loop you see below]
int ind = 0;
while(1){
regExErr1 = regexec(&r, p, 10, m, 0);
//printf("Did match regular expr, value %i\n", regExErr1);
if( regExErr1 != 0 ){
fprintf(stderr, "No more matches with the inherent regular expression!\n");
break;
}
printf("What was found was: ");
int i = 0;
while(1){
if(m[i].rm_so == -1){
break;
}
int start = m[i].rm_so + (p - p2);
int finish = m[i].rm_eo + (p - p2);
strcpy(matches[ind], ("%.*s\n", (finish - start), p2 + start));
printf("Storing: %.*s", matches[ind]);
ind++;
printf("%.*s\n", (finish - start), p2 + start);
i++;
}
p += m[0].rm_eo; // this will move the pointer p to the end of last matched pattern and on to the start of a new one
}
printf("We have in [0]: %s\n", temp);

There are quite a lot of regular expression packages, but yours seems to match the one in POSIX: regcomp() etc.
The two structures it defines in <regex.h> are:
regex_t containing at least size_t re_nsub, the number of parenthesized subexpressions.
regmatch_t containing at least regoff_t rm_so, the byte offset from start of string to start of substring, and regoff_t rm_eo, the byte offset from start of string of the first character after the end of substring.
Note that 'offsets' are not pointers but indexes into the character array.
The execution function is:
int regexec(const regex_t *restrict preg, const char *restrict string,
size_t nmatch, regmatch_t pmatch[restrict], int eflags);
Your printing code should be:
for (int i = 0; i <= r.re_nsub; i++)
{
int start = m[i].rm_so;
int finish = m[i].rm_eo;
// strcpy(matches[ind], ("%.*s\n", (finish - start), p + start)); // Based on question
sprintf(matches[ind], "%.*s\n", (finish - start), p + start); // More plausible code
printf("Storing: %.*s\n", (finish - start), matches[ind]); // Print once
ind++;
printf("%.*s\n", (finish - start), p + start); // Why print twice?
}
Note that the code should be upgraded to ensure that the string copy (via sprintf()) does not overflow the target string — maybe by using snprintf() instead of sprintf(). It is also a good idea to mark the start and end of a string in the printing. For example:
printf("<<%.*s>>\n", (finish - start), p + start);
This makes it a whole heap easier to see spaces etc.
[In future, please attempt to provide an MCVE (Minimal, Complete, Verifiable Example) or SSCCE (Short, Self-Contained, Correct Example) so that people can help more easily.]
This is an SSCCE that I created, probably in response to another SO question in 2010. It is one of a number of programs I keep that I call 'vignettes'; little programs that show the essence of some feature (such as POSIX regexes, in this case). I find them useful as memory joggers.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <regex.h>
#define tofind "^DAEMONS=\\(([^)]*)\\)[ \t]*$"
int main(int argc, char **argv)
{
FILE *fp;
char line[1024];
int retval = 0;
regex_t re;
regmatch_t rm[2];
//this file has this line "DAEMONS=(sysklogd network sshd !netfs !crond)"
const char *filename = "/etc/rc.conf";
if (argc > 1)
filename = argv[1];
if (regcomp(&re, tofind, REG_EXTENDED) != 0)
{
fprintf(stderr, "Failed to compile regex '%s'\n", tofind);
return EXIT_FAILURE;
}
printf("Regex: %s\n", tofind);
printf("Number of captured expressions: %zu\n", re.re_nsub);
fp = fopen(filename, "r");
if (fp == 0)
{
fprintf(stderr, "Failed to open file %s (%d: %s)\n", filename, errno, strerror(errno));
return EXIT_FAILURE;
}
while ((fgets(line, 1024, fp)) != NULL)
{
line[strcspn(line, "\n")] = '\0';
if ((retval = regexec(&re, line, 2, rm, 0)) == 0)
{
printf("<<%s>>\n", line);
// Complete match
printf("Line: <<%.*s>>\n", (int)(rm[0].rm_eo - rm[0].rm_so), line + rm[0].rm_so);
// Match captured in (...) - the \( and \) match literal parenthesis
printf("Text: <<%.*s>>\n", (int)(rm[1].rm_eo - rm[1].rm_so), line + rm[1].rm_so);
char *src = line + rm[1].rm_so;
char *end = line + rm[1].rm_eo;
while (src < end)
{
size_t len = strcspn(src, " ");
if (src + len > end)
len = end - src;
printf("Name: <<%.*s>>\n", (int)len, src);
src += len;
src += strspn(src, " ");
}
}
}
return EXIT_SUCCESS;
}
This was designed to find a particular line starting DAEMONS= in a file /etc/rc.conf (but you can specify an alternative file name on the command line). You can adapt it to your purposes easily enough.

Since g++ regex is bugged until who knows when, you can use my code instead (License: AGPL, no warranty, your own risk, ...)
/**
* regexp (License: AGPL3 or higher)
* #param re extended POSIX regular expression
* #param nmatch maximum number of matches
* #param str string to match
* #return An array of char pointers. You have to free() the first element (string storage). the second element is the string matching the full regex, then come the submatches.
*/
char **regexp(char *re, int nmatch, char *str) {
char **result;
char *string;
regex_t regex;
regmatch_t *match;
int i;
match=malloc(nmatch*sizeof(*match));
if (!result) {
fprintf(stderr, "Out of memory !");
return NULL;
}
if (regcomp(&regex, re, REG_EXTENDED)!=0) {
fprintf(stderr, "Failed to compile regex '%s'\n", re);
return NULL;
}
string=strdup(str);
if (regexec(&regex,string,nmatch,match,0)) {
#ifdef DEBUG
fprintf(stderr, "String '%s' does not match regex '%s'\n",str,re);
#endif
free(string);
return NULL;
}
result=malloc(sizeof(*result));
if (!result) {
fprintf(stderr, "Out of memory !");
free(string);
return NULL;
}
for (i=0; i<nmatch; ++i) {
if (match[i].rm_so>=0) {
string[match[i].rm_eo]=0;
((char**)result)[i]=string+match[i].rm_so;
#ifdef DEBUG
printf("%s\n",string+match[i].rm_so);
#endif
} else {
((char**)result)[i]="";
}
}
result[0]=string;
return result;
}

Related

Custom STRCAT is overwhelmed by too many arguments

I am trying to code a custom strcat that separates arguments with \n except for the last one and terminates the string with \0.
It's working fine as is up to 5 arguments, but if I try passing a sixth one I get a strange line in response :
MacBook-Pro-de-Domingo% ./test ok ok ok ok ok
ok
ok
ok
ok
ok
MacBook-Pro-de-Domingo% ./test ok ok ok ok ok ok
ok
ok
ok
ok
ok
P/Users/domingodelmasok
Here is my custom strcat code:
char cat(char *dest, char *src, int current, int argc_nb)
{
int i = 0;
int j = 0;
while(dest[i])
i++;
while(src[j])
{
dest[i + j] = src[j];
j++;
}
if(current < argc_nb - 1)
dest[i + j] = '\n';
else
dest[i + j] = '\0';
return(*dest);
}
UPDATE Complete calling function:
char *concator(int argc, char **argv)
{
int i;
int j;
int size = 0;
char *str;
i = 1;
while(i < argc)
{
j = 0;
while(argv[i][j])
{
size++;
j++;
}
i++;
}
str = (char*)malloc(sizeof(*str) * (size + 1));
i = 1;
while(i < argc)
{
cat(str, argv[i], i, argc);
i++;
}
free(str);
return(str);
}
What's wrong here?
Thanks!
Edit: Fixed blunder.
There are quite a few issues with the code:
sizeof (char) == 1 by the C standard.
cat() requires the destination to be a string (terminated by a \0), but does not append it itself (except for current >= argc_nb - 1). This is a bug.
free(str); return str; is an use-after-free bug. If you call free(str), the contents at str are irrevocably lost, inaccessible. The free(str) should simply be removed; it is not appropriate here.
Arrays in C are indexed at 0. However, the concator() function skips the first string pointer (because argv[0] contains the name used to execute the program). This is wrong, and will eventually trip someone. Instead, have concator() add all strings in the array, but call it using concator(argc - 1, argv + 1);.
There might be even more, but at this point, I believe a rewrite from scratch, using a much more appropriate approach, is in order.
Consider the following join() function:
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
char *join(const size_t parts, const char *part[],
const char *separator, const char *suffix)
{
const size_t separator_len = (separator) ? strlen(separator) : 0;
const size_t suffix_len = (suffix) ? strlen(suffix) : 0;
size_t total_len = 0;
size_t p;
char *dst, *end;
/* Calculate sum of part lengths */
for (p = 0; p < parts; p++)
if (part[p])
total_len += strlen(part[p]);
/* Add separator lengths */
if (parts > 1)
total_len += (parts - 1) * separator_len;
/* Add suffix length */
total_len += suffix_len;
/* Allocate enough memory, plus end-of-string '\0' */
dst = malloc(total_len + 1);
if (!dst)
return NULL;
/* Keep a pointer to the current end of the result string */
end = dst;
/* Append each part */
for (p = 0; p < parts; p++) {
/* Insert separator */
if (p > 0 && separator_len > 0) {
memcpy(end, separator, separator_len);
end += separator_len;
}
/* Insert part */
if (part[p]) {
const size_t len = strlen(part[p]);
if (len > 0) {
memcpy(end, part[p], len);
end += len;
}
}
}
/* Append suffix */
if (suffix_len > 0) {
memcpy(end, suffix, suffix_len);
end += suffix_len;
}
/* Terminate string. */
*end = '\0';
/* All done. */
return dst;
}
The logic is simple. First, we find out the length of each component. Note that separator is only added between parts (so occurs parts-1 times), and suffix at the very end.
(The (string) ? strlen(string) : 0 idiom just means "if string is non-NULL, strlen(0), otherwise 0". We do that, because we allow NULL separator and suffix, but strlen(NULL) is Undefined Behaviour.)
Next, we allocate enough memory for the result, including the end-of-string NUL char, \0, that was not included in the lengths.
To append each part, we keep the result pointer intact, and instead use a temporary end pointer. (It is the end of the string thus far.) We use a loop, where we copy the next part to the end. Before the second and subsequent parts, we copy the separator before the part.
Next, we copy the suffix, and finally the end-of-string '\0'. (It is important to return a pointer to the beginning of the string, rather than end, of course; and that is why we kept dst to point to the new resulting string, and end at the point we appended each substring.)
You could use it from the command line using for example the following main():
int main(int argc, char *argv[])
{
char *result;
if (argc < 4) {
fprintf(stderr, "\n");
fprintf(stderr, "Usage: %s SEPARATOR SUFFIX PART [ PART ... ]\n", argv[0]);
fprintf(stderr, "\n");
return EXIT_FAILURE;
}
result = join(argc - 3, (const char **)(argv + 3), argv[1], argv[2]);
if (!result) {
fprintf(stderr, "Failed.\n");
return EXIT_FAILURE;
}
fputs(result, stdout);
return EXIT_SUCCESS;
}
If you compile the above to e.g. example (I use gcc -Wall -O2 example.c -o example), then running
./example ', ' $'!\n' Hello world
in a Bash shell outputs
Hello, world!
(with a newline at end). Running
./example ' and ' $'.\n' a b c d e f g
outputs
a and b and c and d and e and f and g
(again with a newline at end). The $'...' is just a Bash idiom to specify special characters in strings; $'!\n' is the same in Bash as "!\n" is in C, and $'.\n' is the Bash equivalent of ".\n" in C.
(Removing the automatic newline between parts, and allowing a string rather than just one char to be used as a separator and suffix, was a deliberate choice for two reasons. The main one is to stop anyone from just copy-pasting this as an answer to some exercise. The secondary one is to show that while it might sound more complicated than just using single characters for them, it is actually very little additional code; and if you consider the practical use cases, allowing a string to be used as the separator opens up a lot of options.)
The example code above is only very lightly tested, and might contain bugs. If you find any, or disagree with anything I've written above, do let me know in a comment so I can review, and fix as necessary.

Find and replace all occurrences of a substring in C

I am trying to find and replace all occurrences of a substring within an array of strings in C. I think I have most of the logic down, however I dont know where I am messing up for the remaining parts.
Here is the relevant code - the string I am replacing is in searchStr and I am trying to replace it with replaceStr. The array of strings is called buff. I do not need to save the modified string back into the array after, I just need to print the modified string to the console.
for (size_t i = 0; i < numLines; i++) {
char *tmp = buff[i];
char finalStr[MAX_STR_LEN * 2];
char temporaryString[MAX_STR_LEN];
int match = 0;
while ((tmp = strstr(tmp, searchStr))) {
match = 1;
char temporaryString[MAX_STR_LEN];
char tmp2[MAX_STR_LEN];
printf("Buff[i]: %s", buff[i]);
sprintf(temporaryString, "%s", strstr(tmp, searchStr) + strlen(searchStr)); // Grab everything after the match
printf("Behind: %s", temporaryString);
strncpy(tmp2, buff[i], tmp - buff[i]); // Grab everything before the match
strcat(finalStr, tmp2);
printf("In Front: %s\n", finalStr);
strcat(finalStr, replaceStr); // Concat everything before with the replacing string
tmp = tmp + strlen(searchStr);
buff[i] = tmp; // Move buff pointer up so that it looks for another match in the remaining part of the string
}
if (match) {
strcat(finalStr, temporaryString); // Add on any remaining bytes
printf("Final: %s\n", finalStr);
}
}
If have a lot of printf in there just so I can see where everything is for debugging.
Example case:
If I am running it against the string what4is4this with searchStr = 4 and replaceStr = !!! this is the output in the console... I am adding annotations as well using //
Buff[i]: what4is4this // Just printing out the current string before we attempt to replace anything
Behind: is4this // Looking good here
In Front: hat // Why is it cutting off the 'w'?
Buff[i]: is4this // Good - this is the remaining string we need to look through
Behind: this // Again, looking good
In Front: hat!!!isat // It should be 'is'
Final: hat!!!isat!!!isat // final should be 'what!!!is!!!this'
Any ideas guys? Im tearing my hair out trying to fix this
Thanks!
It is an unhealthy mix of pointer-juggling and undefined behavior, but the commenters already told you that. If you simplify it a bit and make good and frugal use of pointers you can do something like:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
// ALL CHECKS OMMITTED!
#define MAX_STR_LEN 1024
int main(int argc, char **argv)
{
char *buff, *cpbuff;
char *searchStr;
char *replaceStr;
// pointers too the two parts with the search string in between
char *tmp, *after;
// the final output (a fixed length is not good,
// should be dynamically allocated)
char finalStr[MAX_STR_LEN * 2] = { '\0' };
if (argc < 4) {
fprintf(stderr, "Usage: %s string searchstr replacestr\n", argv[0]);
exit(EXIT_FAILURE);
}
buff = malloc(strlen(argv[1]) + 1);
strcpy(buff, argv[1]);
searchStr = malloc(strlen(argv[2]) + 1);
strcpy(searchStr, argv[2]);
replaceStr = malloc(strlen(argv[3]) + 1);
strcpy(replaceStr, argv[3]);
// Keep a finger on the start of buff
cpbuff = buff;
while (1) {
printf("Buff: %s\n", buff);
// Grab everything after the match
after = strstr(buff, searchStr);
// No further matches? Than we're done
if (after == NULL) {
strcat(finalStr, buff);
break;
}
// assuming strlen(searchStr) >= 1
tmp = buff;
// mark the end of the first part
tmp[after - buff] = '\0';
// set the after pointer to the start of the second part
after = after + strlen(searchStr);
printf("Behind: %s\n", after);
printf("Before: %s\n\n", tmp);
// Put the first part to it's final place
strcat(finalStr, tmp);
// concat the replacement string
strcat(finalStr, replaceStr);
// Set buff to the start of the second part
buff = after + strlen(searchStr) - 1;
}
printf("Final: %s\n", finalStr);
// set the buff pointer back to it's start
buff = cpbuff;
free(buff);
free(searchStr);
free(replaceStr);
exit(EXIT_SUCCESS);
}
The single point that can be called an abuse of pointer arithmetic is the line that marks the end of the first part. It can be avoided by measuring and using the lengths of the involved strings and do arithmetic with them. It is slower, admitted, so it depends on your individual use case.
It is still more complicated than I like it, but it's a start.
I hope you can extend it now to several lines of input by yourself.

Search and replace within a file using PCRE in C

I want to parse a shell style key-value config file with C and replace values as needed.
An example file could look like
FOO="test"
SOME_KEY="some value here"
ANOTHER_KEY="here.we.go"
SOMETHING="0"
FOO_BAR_BAZ="2"
To find the value, I want to use regular expressions. I'm a beginner with the PCRE library so I created some code to test around. This application takes two arguments: the first one is the key to search for. The second one is the value to fill into the double quotes.
#include <pcre.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#define OVECCOUNT 30
int main(int argc, char **argv){
const char *error;
int erroffset;
pcre *re;
int rc;
int i;
int ovector[OVECCOUNT];
char regex[64];
sprintf(regex,"(?<=^%s=\\\").+(?<!\\\")", argv[1]);
char *str;
FILE *conf;
conf = fopen("test.conf", "rw");
fseek(conf, 0, SEEK_END);
int confSize = ftell(conf)+1;
rewind(conf);
str = malloc(confSize);
fread(str, 1, confSize, conf);
fclose(conf);
str[confSize-1] = '\n';
re = pcre_compile (
regex, /* the pattern */
PCRE_CASELESS | PCRE_MULTILINE, /* default options */
&error, /* for error message */
&erroffset, /* for error offset */
0); /* use default character tables */
if (!re) {
printf("pcre_compile failed (offset: %d), %s\n", erroffset, error);
return -1;
}
rc = pcre_exec (
re, /* the compiled pattern */
0, /* no extra data - pattern was not studied */
str, /* the string to match */
confSize, /* the length of the string */
0, /* start at offset 0 in the subject */
0, /* default options */
ovector, /* output vector for substring information */
OVECCOUNT); /* number of elements in the output vector */
if (rc < 0) {
switch (rc) {
case PCRE_ERROR_NOMATCH:
printf("String didn't match");
break;
default:
printf("Error while matching: %d\n", rc);
break;
}
free(re);
return -1;
}
for (i = 0; i < rc; i++) {
printf("========\nlength of vector: %d\nvector[0..1]: %d %d\nchars at start/end: %c %c\n", ovector[2*i+1] - ovector[2*i], ovector[0], ovector[1], str[ovector[0]], str[ovector[1]]);
printf("file content length is %d\n========\n", strlen(str));
}
int newContentLen = strlen(argv[2])+1;
char *newContent = calloc(newContentLen,1);
memcpy(newContent, argv[2], newContentLen);
char *before = malloc(ovector[0]);
memcpy(before, str, ovector[0]);
int afterLen = confSize-ovector[1];
char *after = malloc(afterLen);
memcpy(after, str+ovector[1],afterLen);
int newFileLen = newContentLen+ovector[0]+afterLen;
char *newFile = calloc(newFileLen,1);
sprintf(newFile,"%s%s%s", before,newContent, after);
printf("%s\n", newFile);
return 0;
}
This code is working in some cases but if I want to replace FOO or ANOTHER_KEY theres something fishy.
$ ./search_replace.out FOO baz
========
length of vector: 5
vector[0..1]: 5 10
chars at start/end: b "
file content length is 94
========
FOO="9#baz"
SOME_KEY="some value here"
ANOTHER_KEY="here.we.go"
SOMETHING="0"
FOO_BAR_BAZ="2"
$ ./search_replace.out ANOTHER_KEY insert
========
length of vector: 10
vector[0..1]: 52 62
chars at start/end: h "
file content length is 94
========
FOO="baaar"
SOME_KEY="some value here"
ANOTHER_KEY=")insert"
SOMETHING="0"
FOO_BAR_BAZ="2"
Now if I change the format of the input file slightly to
TEST="new inserted"
FOO="test"
SOME_KEY="some value here"
ANOTHER_KEY="here.we.go"
SOMETHING="0"
FOO_BAR_BAZ="2"
the code is working fine.
I don't get it why the code is behaves differently here.
The extra characters before the substituted text come from not properly null-terminating your before string. (Just as you hadn't null-terminated the whole buffer str, as Paul R has pointed out.) So:
char *before = malloc(ovector[0] + 1);
memcpy(before, str, ovector[0]);
before[ovector[0]] = '\0';
Anyway, the business of allocating substrings and copying the contents seems needlessly complicated and prone to errors. For example, do the somethingLen variables count the terminating null character or not? Sometimes they do, sometimes they don't. I'd recommend to pick one representation and use it consistently. (And you should really free all allocated buffers after no longer using them and probably also clean up the compiled regex.)
You could do the replacement with just one allocation for the target buffer by using the precision field of the %s format specifier on the "before" part:
int cutLen = ovector[1] - ovector[0];
int newFileLen = confSize + strlen(argv[2]) - cutLen;
char *newFile = malloc(newFileLen + 1);
snprintf(newFile, newFileLen + 1, "%.*s%s%s",
ovector[0], str, argv[2], str + ovector[1]);
Or you could just use fprintf to ther target file if you don't need the temporary buffer.
You forgot to terminate str, so subsequently calling strlen(str) will give unpredictable results. Either change:
str = malloc(confSize);
fread(str, 1, confSize, conf);
to:
str = malloc(confSize + 1); // note: extra char for '\0' terminator
fread(str, 1, confSize, conf);
str[confSize] = '\0'; // terminate string!
and/or pass confSize instead of strlen(str) to pcre_exec.
Your string is allocated confSize bytes of memory. Let's say that confSize is 10 as an example.
str = malloc(confSize);
So valid indexes for your string are 0-9. But this line assigns '\n' to the 10th index, which is the 11th byte:
str[confSize] = '\n';
If you're wanting the last character to be '\n', it should be:
str[confSize - 1] = '\n';

regmatch_t how can i get match only?

I don't think I understand how to return only the matched regular expression. I have a file that is a webpage. I'm trying to get all the links in the page. The regex works fine. But if I printf it out it will print out the line in which that match occurs. I only want to display the match only. I see you can do grouping so I tried that and am getting back an int value for my second printf call. According to the doc it is an offset. But offset to what? It doesn't seem to be accurate either because it would say 32 when character 32 on that line has nothing to do with the regex. I put in an exit just see the first match. Where am I going wrong?
char line[1000];
FILE *fp_original;
fp_original = fopen (file_original_page, "r");
regex_t re_links;
regmatch_t group[2];
regcomp (&re_links, "(href|src)=[\"|'][^\"']*[\"|']", REG_EXTENDED);
while (fgets (line, sizeof line, fp_original) != NULL) {
if (regexec (&re_links, line, 2, group, 0) == 0) {
printf ("%s", line);
printf ("%u\n", line[group[1].rm_so]);
exit (1);
}
}
fclose (fp_original);
regmatch_t array
regmatch_t is the matcharray that you pass to the regex call. If we pass 2 as the number of matches in regex we obtain in regmatch_t[0] the whole match and in regmatch_t[1] the submatch.
For instance:
size_t nmatch = 2;
regmatch_t pmatch[2];
rc = regex(&re_links, line, nmatch, pmatch, 0);
If this succeeded you can get the subexpression as follows:
pmatch[1].rm_eo - pmatch[1].rm_so, &line[pmatch[1].rm_so],
pmatch[1].rm_so, pmatch[1].rm_eo - 1);
Here is an example on how to apply the above:
#include <regex.h>
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
regex_t preg;
char *string = "I'm a link to somewhere";
char *pattern = ".*\\(link\\).*";
size_t nmatch = 2;
regmatch_t pmatch[2];
regcomp(&preg, pattern, 0);
regexec(&preg, string, nmatch, pmatch, 0);
printf("a matched substring \"%.*s\" is found at position %d to %d.\n",
pmatch[1].rm_eo - pmatch[1].rm_so, &string[pmatch[1].rm_so],
pmatch[1].rm_so, pmatch[1].rm_eo - 1);
regfree(&preg);
return 0;
}
Above code is certainly not save. It serves only as an example. If you exchange pmatch with your group it should work. Also don't forget to parenthesize the part of your regex you want to capture in your group --> \\(.*\\)
Edit
In order to avoid the warning by the compiler concerning the field precision, you can replace the whole printf part with this:
char *result;
result = (char*)malloc(pmatch[1].rm_eo - pmatch[1].rm_so);
strncpy(result, &string[pmatch[1].rm_so], pmatch[1].rm_eo - pmatch[1].rm_so);
printf("a matched substring \"%s\" is found at position %lld to %lld.\n",
result, pmatch[1].rm_so, pmatch[1].rm_eo - 1);
// later on ...
free(result);
the resulting match (your group) gives you a start index and an end index. you need to print just the items between those two indeces.
group[0] will be the entire regex match. the subsequent groups will be any captures you have in your regex.
for(int i = 0; i < re_links.re_nsub; ++i) {
printf("match %d from index %d to %d: ", i, group[i].rm_so, group[i].rm_eo);
for(int j = group[i].rm_so; j < group[i].rm_eo; ++j) {
printf("%c", line[j]);
}
printf("\n");
}
For a full example see my answer here.

C programming strings, pointers, and allocation

this problem I think is solely a lack of memory allocation issue.
(maybe skip to the bottom and read the final question for some simple suggestions)
I am writing this program that reads file(s) entered by the user. If the file 'includes' other files, then they will be read also. To check if another file includes a file, I parse the first word of the string. To do this I wrote a function that returns the parsed word, and a pointer is passed in that gets set to the first letter of the next word. For example consider the string:
"include foo" NOTE files can only include 1 other file
firstWord == include, chPtr == f
My algorithm parses firstWord to test for string equality with 'include', it then parses the second word to test for file validity and to see if the file has already been read.
Now, my problem is that many files are being read and chPtr gets overwritten. So, when I return the pointer to the next word. The next word will sometimes contain the last few characters of the previous file. Consider the example files named testfile-1 and bogus:
Let chPtr originally equal testfile-1 and now consider the parsing of 'include bogus':
extracting firstWord will == include, and chPtr will be overwritten to point to the b in bogus. So, chPtr will equal b o g u s '\0' l e - 1. the l e - 1 is the last few characters of testfile-1 since chPtr points to the same address of memory each time my function is called. This is a problem for me because when I parse bogus, chPtr will point to the l. Here is the code for my function:
char* extract_word(char** chPtr, char* line, char parseChar)
//POST: word is returned as the first n characters read until parseChar occurs in line
// FCTVAL == a ptr to the next word in line
{
int i = 0;
while(line[i] != parseChar && line[i] != '\0')
{
i++;
}
char* temp = Malloc(i + 1); //I have a malloc wrapper to check validity
for(int j = 0; j < i; j++)
{
temp[j] = line[j];
}
temp[i+1] = '\0';
*chPtr = (line + i + 1);
char* word = Strdup(temp); //I have a wrapper for strdup too
return word;
So, is my problem diagnosis correct? If so, do I make deep copies of chPtr? Also, how do I make deep copies of chPtr?
Thanks so much!
If I understand this correctly you want to scan a file and when an 'include' directive is encountered you want to scan the file specified in the the 'include' directive and so on ad infinitum for any levels of include i.e. read one file which may include other files which may in turn include other files.....
If that is so (and please correct if I am wrong ) then this is a classic recursion problem. The advantage of recursion is that all variables are created on the stack and are naturally freed when the stack unwinds.
The following code will do this without any need for malloc or free or the need to make copies of anything:
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define INCLUDE "include"
#define INCOFFSET 7
static void
process_record (char *name, char *buf)
{
// process record here
printf ("%s:%s\n", name, buf);
}
// change this to detect your particular include
static int
isinclude (char *buf)
{
//printf ("%s:Record %s INCLUDE=%s INCOFFSET=%d\n", __func__, buf, INCLUDE,
// INCOFFSET);
if (!strncmp (buf, INCLUDE, INCOFFSET))
{
//printf ("%s:Record == include", __func__);
return 1;
}
return 0;
}
static int
read_file (char *name)
{
//printf ("%s:File %s\n", __func__, name);
FILE *fd = fopen (name, "r");
if (!fd)
{
printf ("%s:Cannot open %s\n", __func__, name);
return -1;
}
char buf[1024];
ssize_t n;
while (fgets (buf, sizeof (buf), fd))
{
size_t n = strcspn (buf, "\n");
buf[n] = '\0';
//printf ("%s:Buf %s\n", __func__, buf);
if (isinclude (buf))
{
read_file (buf + (INCOFFSET + 1));
}
else
{
process_record (name, buf);
}
}
fclose (fd);
return 0;
}
int
main (int argc, char *argv[])
{
int ret = read_file (argv[1]);
if (ret < 0)
{
exit (EXIT_FAILURE);
}
exit (EXIT_SUCCESS);
}
char* temp = Malloc(i + 1); //I have a malloc wrapper to check validity
for(int j = 0; j < i; j++)
{
temp[j] = line[j];
}
temp[i+1] = '\0'; <------- subscript out of range replace with temp[i] = '\0';
It isn't clear where your problem is. But you might use a tool to help locate it.
Valgrind is one such (free) tool. It will detect a variety of memory access errors. (It likely would not have found your temp[i+1]='\0' error because that isnt "very wrong").
Our CheckPointer tool is another tool. It finds errors Valgrind cannot (e.g., e.g., it should have found your buggy temp assignment). While it is commercial, the evaluation version handles programs of small size, which may work for you. (I'm at home and don't remember the limits).

Resources