Search and replace within a file using PCRE in C - c

I want to parse a shell style key-value config file with C and replace values as needed.
An example file could look like
FOO="test"
SOME_KEY="some value here"
ANOTHER_KEY="here.we.go"
SOMETHING="0"
FOO_BAR_BAZ="2"
To find the value, I want to use regular expressions. I'm a beginner with the PCRE library so I created some code to test around. This application takes two arguments: the first one is the key to search for. The second one is the value to fill into the double quotes.
#include <pcre.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#define OVECCOUNT 30
int main(int argc, char **argv){
const char *error;
int erroffset;
pcre *re;
int rc;
int i;
int ovector[OVECCOUNT];
char regex[64];
sprintf(regex,"(?<=^%s=\\\").+(?<!\\\")", argv[1]);
char *str;
FILE *conf;
conf = fopen("test.conf", "rw");
fseek(conf, 0, SEEK_END);
int confSize = ftell(conf)+1;
rewind(conf);
str = malloc(confSize);
fread(str, 1, confSize, conf);
fclose(conf);
str[confSize-1] = '\n';
re = pcre_compile (
regex, /* the pattern */
PCRE_CASELESS | PCRE_MULTILINE, /* default options */
&error, /* for error message */
&erroffset, /* for error offset */
0); /* use default character tables */
if (!re) {
printf("pcre_compile failed (offset: %d), %s\n", erroffset, error);
return -1;
}
rc = pcre_exec (
re, /* the compiled pattern */
0, /* no extra data - pattern was not studied */
str, /* the string to match */
confSize, /* the length of the string */
0, /* start at offset 0 in the subject */
0, /* default options */
ovector, /* output vector for substring information */
OVECCOUNT); /* number of elements in the output vector */
if (rc < 0) {
switch (rc) {
case PCRE_ERROR_NOMATCH:
printf("String didn't match");
break;
default:
printf("Error while matching: %d\n", rc);
break;
}
free(re);
return -1;
}
for (i = 0; i < rc; i++) {
printf("========\nlength of vector: %d\nvector[0..1]: %d %d\nchars at start/end: %c %c\n", ovector[2*i+1] - ovector[2*i], ovector[0], ovector[1], str[ovector[0]], str[ovector[1]]);
printf("file content length is %d\n========\n", strlen(str));
}
int newContentLen = strlen(argv[2])+1;
char *newContent = calloc(newContentLen,1);
memcpy(newContent, argv[2], newContentLen);
char *before = malloc(ovector[0]);
memcpy(before, str, ovector[0]);
int afterLen = confSize-ovector[1];
char *after = malloc(afterLen);
memcpy(after, str+ovector[1],afterLen);
int newFileLen = newContentLen+ovector[0]+afterLen;
char *newFile = calloc(newFileLen,1);
sprintf(newFile,"%s%s%s", before,newContent, after);
printf("%s\n", newFile);
return 0;
}
This code is working in some cases but if I want to replace FOO or ANOTHER_KEY theres something fishy.
$ ./search_replace.out FOO baz
========
length of vector: 5
vector[0..1]: 5 10
chars at start/end: b "
file content length is 94
========
FOO="9#baz"
SOME_KEY="some value here"
ANOTHER_KEY="here.we.go"
SOMETHING="0"
FOO_BAR_BAZ="2"
$ ./search_replace.out ANOTHER_KEY insert
========
length of vector: 10
vector[0..1]: 52 62
chars at start/end: h "
file content length is 94
========
FOO="baaar"
SOME_KEY="some value here"
ANOTHER_KEY=")insert"
SOMETHING="0"
FOO_BAR_BAZ="2"
Now if I change the format of the input file slightly to
TEST="new inserted"
FOO="test"
SOME_KEY="some value here"
ANOTHER_KEY="here.we.go"
SOMETHING="0"
FOO_BAR_BAZ="2"
the code is working fine.
I don't get it why the code is behaves differently here.

The extra characters before the substituted text come from not properly null-terminating your before string. (Just as you hadn't null-terminated the whole buffer str, as Paul R has pointed out.) So:
char *before = malloc(ovector[0] + 1);
memcpy(before, str, ovector[0]);
before[ovector[0]] = '\0';
Anyway, the business of allocating substrings and copying the contents seems needlessly complicated and prone to errors. For example, do the somethingLen variables count the terminating null character or not? Sometimes they do, sometimes they don't. I'd recommend to pick one representation and use it consistently. (And you should really free all allocated buffers after no longer using them and probably also clean up the compiled regex.)
You could do the replacement with just one allocation for the target buffer by using the precision field of the %s format specifier on the "before" part:
int cutLen = ovector[1] - ovector[0];
int newFileLen = confSize + strlen(argv[2]) - cutLen;
char *newFile = malloc(newFileLen + 1);
snprintf(newFile, newFileLen + 1, "%.*s%s%s",
ovector[0], str, argv[2], str + ovector[1]);
Or you could just use fprintf to ther target file if you don't need the temporary buffer.

You forgot to terminate str, so subsequently calling strlen(str) will give unpredictable results. Either change:
str = malloc(confSize);
fread(str, 1, confSize, conf);
to:
str = malloc(confSize + 1); // note: extra char for '\0' terminator
fread(str, 1, confSize, conf);
str[confSize] = '\0'; // terminate string!
and/or pass confSize instead of strlen(str) to pcre_exec.

Your string is allocated confSize bytes of memory. Let's say that confSize is 10 as an example.
str = malloc(confSize);
So valid indexes for your string are 0-9. But this line assigns '\n' to the 10th index, which is the 11th byte:
str[confSize] = '\n';
If you're wanting the last character to be '\n', it should be:
str[confSize - 1] = '\n';

Related

Custom STRCAT is overwhelmed by too many arguments

I am trying to code a custom strcat that separates arguments with \n except for the last one and terminates the string with \0.
It's working fine as is up to 5 arguments, but if I try passing a sixth one I get a strange line in response :
MacBook-Pro-de-Domingo% ./test ok ok ok ok ok
ok
ok
ok
ok
ok
MacBook-Pro-de-Domingo% ./test ok ok ok ok ok ok
ok
ok
ok
ok
ok
P/Users/domingodelmasok
Here is my custom strcat code:
char cat(char *dest, char *src, int current, int argc_nb)
{
int i = 0;
int j = 0;
while(dest[i])
i++;
while(src[j])
{
dest[i + j] = src[j];
j++;
}
if(current < argc_nb - 1)
dest[i + j] = '\n';
else
dest[i + j] = '\0';
return(*dest);
}
UPDATE Complete calling function:
char *concator(int argc, char **argv)
{
int i;
int j;
int size = 0;
char *str;
i = 1;
while(i < argc)
{
j = 0;
while(argv[i][j])
{
size++;
j++;
}
i++;
}
str = (char*)malloc(sizeof(*str) * (size + 1));
i = 1;
while(i < argc)
{
cat(str, argv[i], i, argc);
i++;
}
free(str);
return(str);
}
What's wrong here?
Thanks!
Edit: Fixed blunder.
There are quite a few issues with the code:
sizeof (char) == 1 by the C standard.
cat() requires the destination to be a string (terminated by a \0), but does not append it itself (except for current >= argc_nb - 1). This is a bug.
free(str); return str; is an use-after-free bug. If you call free(str), the contents at str are irrevocably lost, inaccessible. The free(str) should simply be removed; it is not appropriate here.
Arrays in C are indexed at 0. However, the concator() function skips the first string pointer (because argv[0] contains the name used to execute the program). This is wrong, and will eventually trip someone. Instead, have concator() add all strings in the array, but call it using concator(argc - 1, argv + 1);.
There might be even more, but at this point, I believe a rewrite from scratch, using a much more appropriate approach, is in order.
Consider the following join() function:
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
char *join(const size_t parts, const char *part[],
const char *separator, const char *suffix)
{
const size_t separator_len = (separator) ? strlen(separator) : 0;
const size_t suffix_len = (suffix) ? strlen(suffix) : 0;
size_t total_len = 0;
size_t p;
char *dst, *end;
/* Calculate sum of part lengths */
for (p = 0; p < parts; p++)
if (part[p])
total_len += strlen(part[p]);
/* Add separator lengths */
if (parts > 1)
total_len += (parts - 1) * separator_len;
/* Add suffix length */
total_len += suffix_len;
/* Allocate enough memory, plus end-of-string '\0' */
dst = malloc(total_len + 1);
if (!dst)
return NULL;
/* Keep a pointer to the current end of the result string */
end = dst;
/* Append each part */
for (p = 0; p < parts; p++) {
/* Insert separator */
if (p > 0 && separator_len > 0) {
memcpy(end, separator, separator_len);
end += separator_len;
}
/* Insert part */
if (part[p]) {
const size_t len = strlen(part[p]);
if (len > 0) {
memcpy(end, part[p], len);
end += len;
}
}
}
/* Append suffix */
if (suffix_len > 0) {
memcpy(end, suffix, suffix_len);
end += suffix_len;
}
/* Terminate string. */
*end = '\0';
/* All done. */
return dst;
}
The logic is simple. First, we find out the length of each component. Note that separator is only added between parts (so occurs parts-1 times), and suffix at the very end.
(The (string) ? strlen(string) : 0 idiom just means "if string is non-NULL, strlen(0), otherwise 0". We do that, because we allow NULL separator and suffix, but strlen(NULL) is Undefined Behaviour.)
Next, we allocate enough memory for the result, including the end-of-string NUL char, \0, that was not included in the lengths.
To append each part, we keep the result pointer intact, and instead use a temporary end pointer. (It is the end of the string thus far.) We use a loop, where we copy the next part to the end. Before the second and subsequent parts, we copy the separator before the part.
Next, we copy the suffix, and finally the end-of-string '\0'. (It is important to return a pointer to the beginning of the string, rather than end, of course; and that is why we kept dst to point to the new resulting string, and end at the point we appended each substring.)
You could use it from the command line using for example the following main():
int main(int argc, char *argv[])
{
char *result;
if (argc < 4) {
fprintf(stderr, "\n");
fprintf(stderr, "Usage: %s SEPARATOR SUFFIX PART [ PART ... ]\n", argv[0]);
fprintf(stderr, "\n");
return EXIT_FAILURE;
}
result = join(argc - 3, (const char **)(argv + 3), argv[1], argv[2]);
if (!result) {
fprintf(stderr, "Failed.\n");
return EXIT_FAILURE;
}
fputs(result, stdout);
return EXIT_SUCCESS;
}
If you compile the above to e.g. example (I use gcc -Wall -O2 example.c -o example), then running
./example ', ' $'!\n' Hello world
in a Bash shell outputs
Hello, world!
(with a newline at end). Running
./example ' and ' $'.\n' a b c d e f g
outputs
a and b and c and d and e and f and g
(again with a newline at end). The $'...' is just a Bash idiom to specify special characters in strings; $'!\n' is the same in Bash as "!\n" is in C, and $'.\n' is the Bash equivalent of ".\n" in C.
(Removing the automatic newline between parts, and allowing a string rather than just one char to be used as a separator and suffix, was a deliberate choice for two reasons. The main one is to stop anyone from just copy-pasting this as an answer to some exercise. The secondary one is to show that while it might sound more complicated than just using single characters for them, it is actually very little additional code; and if you consider the practical use cases, allowing a string to be used as the separator opens up a lot of options.)
The example code above is only very lightly tested, and might contain bugs. If you find any, or disagree with anything I've written above, do let me know in a comment so I can review, and fix as necessary.

How to copy a sentence from a longer string into a new array while including period?

I want to save part of a string into a new char array while including the period. For example, the string is:
My name is John. I have 1 dog.
I want to copy each char up to and including the first period, so the new char array will contain:
My name is John.
The code I have written below copies only "My name is John" but omits the period.
ptrBeg and ptrEnd point to the char at the beginning and end, respectively, of the portion I want to copy. My intention was to copy ptrBeg into array newBuf through a pointer to newBuf and then increment both ptrBeg and the pointer to the array until ptrBeg and ptrEnd point to the same char, which should always be a period.
At this point, the text of the string should be copied, so I increment the pointer to char array once more and copy the period to the new space using
++ptrnewBuf;
*ptrnewBuf = *ptrEnd";
Finally, I print the contents of newBuf.
Here's the total code:
int main()
{
char buf[] = "My name is John. I have 1 dog.";
char * ptrBuf;
char * ptrBeg;
char * ptrEnd;
ptrBeg = buf;
ptrBuf = ptrBeg;
while (*ptrBuf != '.'){
ptrBuf++;
}
ptrEnd = ptrBuf;
char newBuf[100];
char * ptrnewBuf = newBuf;
while(*ptrBeg != *ptrEnd){
*ptrnewBuf = *ptrBeg;
ptrnewBuf++;
ptrBeg++;
}
++ptrnewBuf;
*ptrnewBuf = *ptrEnd;
printf("%s", newBuf);
}
How would I modify this code to include a period?
You are on the right track, but you may be making things a bit more complicated than needed and overlooking a few critical checks. The key to iterating by pointers or using pointer arithmetic is to always validate and protect your array or memory bounds during each iteration or arithmetic operation.
Another tip is to always map out your pointer positions on a piece of paper before coding everything up so you have a clear picture of what your iteration limits and any adjustments need to be. (you don't have to use full long strings and many boxes, just use a representation of what needs to be done with a handful of characters) In your case where you wish to copy the substing up through the first '.', something simple like the following will do, e.g.
+---+---+---+---+---+---+
| A | . | | B | . |\0 |
+---+---+---+---+---+---+
^ ^
| pointer (when *p == '.')
buf
So to copy "A." from buf to a new buffer you can't simply iterate while (*p != '.') or you will not copy '.'. By drawing it out, you can clearly see you need to also copy the character when p == '.', e.g.
+---+---+---+---+---+---+
| A | . | | B | . |\0 |
+---+---+---+---+---+---+
^ ^
| |-->| pointer (p + 1)
buf
Now regardless of the actual length of the string before '.', you now know you need p + 1 as the final address to include the last character in the copy.
You also know how many characters your new buffer can store. Say the size of new is MAXC characters (maximum number of characters). So you can store a string of at most MAXC-1 characters (plus the nul-character). When you are filling new you need to always validate you are within MAXC-1 characters.
You also need to insure you new string is nul-terminated (or it isn't a string, it's simply an array of characters). One effective way to insure nul-termination is by initializing all characters in new to 0 when it is declared, e.g.
char new[MAXC] = "";
which initializes the 1st character to 0 (e.g. '\0' empty-string) and all remaining characters 0 by default. Now if you fill no more than MAXC-1 characters, you are guaranteed the array will be a nul-terminated string.
Putting it altogether, you could do something like the following:
#include <stdio.h>
#define MAXC 128 /* if you need a constant, #define one (or more) */
int main (void) {
char buf[] = "My name is John. I have 1 dog.",
*p = buf, /* pointer to buf */
new[MAXC] = "", /* buffer for substring */
*n = new; /* pointer to new */
size_t ndx = 0; /* index for new */
/* loop copying each char until new full, '.' copied, or end of buf */
for (; ndx + 1 < MAXC && *p; p++, n++, ndx++) {
*n = *p; /* copy char from buf to new */
if (*n == '.') /* if char was '.' break */
break;
}
printf ("buf: %s\nnew: %s\n", buf, new);
return 0;
}
(note: ndx is incremented as part of the for loop to track the number of characters copied with the pointers)
Example Use/Output
$ ./bin/str_cpy_substr
buf: My name is John. I have 1 dog.
new: My name is John.
If you do not have the luxury of initializing the string to insure nul-termination, you can always affirmatively nul-terminate after your copy is done. For example, you could add the following after the for loop exit to insure an array of unknown initialization is properly terminated:
*++n = 0; /* nul-terminate (if not already done by initialization) and
* note ++n applied before * due to C operator precedence.
*/
Look things over and let me know if you have further questions.
Just breaking it out into a helper function that "extracts" the first sentence from a line. Just copies the characters over one at a time until either an end of string condition is hit on the source, the period is found, or a max length of the destination buffer is encountered.
void ExtractFirstSentence(const char* line, char* dst, int size)
{
int count = 0;
char c ='\0';
if ((line == NULL) || (dst == NULL) || (size <= 0))
{
return;
}
while ((*line) && ((count+1) < size) && (c != '.'))
{
c = *line++;
*dst++ = c;
count++;
}
*dst = '\0';
}
int main()
{
char buf[] = "My name is John. I have 1 dog.";
char newBuf[100];
ExtractFirstSentence(buf, newBuf, 100);
printf("%s", newBuf);
}
if you want something a bit easier without dealing with all those pointers, try :
int main()
{
char buf[] = "My name is John. I have 1 dog.";
int i = 0;
int j = 0;
while(buf[i] != '.' && buf[i] != '\0') {
i++;
}
char newbuf[i+1];
while (j <= i) {
newbuf[j] = buf[j];
j++;
}
newbuf[j] = '\0';
printf("%s\n",newbuf);
return 0;
}
though the i+1 when making newbuf and the newbuff[j] = '\0' im not 100% certain need to be that way. my thoughts are the i+1 is needed to make room for the \0 ending which is then added after the while loop copying buf to newbuf. but i could be mistaken.
You can use strtok() to split string. Just type man strtok, You will see:
Program source
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int
main(int argc, char *argv[])
{
char *str1, *str2, *token, *subtoken;
char *saveptr1, *saveptr2;
int j;
if (argc != 4) {
fprintf(stderr, "Usage: %s string delim subdelim\n",
argv[0]);
exit(EXIT_FAILURE);
}
for (j = 1, str1 = argv[1]; ; j++, str1 = NULL) {
token = strtok_r(str1, argv[2], &saveptr1);
if (token == NULL)
break;
printf("%d: %s\n", j, token);
for (str2 = token; ; str2 = NULL) {
subtoken = strtok_r(str2, argv[3], &saveptr2);
if (subtoken == NULL)
break;
printf(" --> %s\n", subtoken);
}
}
exit(EXIT_SUCCESS);
}
An example of the output produced by this program is the following:
$ ./a.out 'a/bbb///cc;xxx:yyy:' ':;' '/'
1: a/bbb///cc
--> a
--> bbb
--> cc
2: xxx
--> xxx
3: yyy
--> yyy

Parsing substrings from a string with "sscanf" function in C

I have a gps string like below:
char gps_string[] = "$GPRMC,080117.000,A,4055.1708,N,02918.9336,E,0.00,316.26,00,,,A*78";
I want to parse the substrings between the commas like below sequence:
$GPRMC
080117.000
A
4055.1708
.
.
.
I have tried sscanf function like below:
sscanf(gps_string,"%s,%s,%s,%s,%s,",char1,char2,char3,char4,char5);
But this is not working. char1 array gets the whole string if use above function.
Actually i have used strchr function in my previous algorithm and got it work but it's easier and simplier if i can get it work with sscanf and get those parameters in substring.
By the way, substrings between the commas can vary. But the comma sequence is fixed. For example below is another gps string example but it does not contain some of its parts because of sattellite problem:
char gps_string[] = "$GPRMC,001041.799,V,,,,,0.00,0.00,060180,,,N*"
There have been a number of comments in other answers stating that there are a number of problems with strtok() and suggesting using strpbrk() instead. An example of how this is used can be found at Arrays and strpbrk in C
I do not have a compiler available so I could not test this. I could have typos or other misteaks in the code, but I am sure that you can figure out what is meant.
In this case you would use
char *String_Buffer = gps_string;
char *start = String_Buffer;
char *end;
char *fields[MAXFIELDS];
int i = 0;
int n = 0;
char *match = NULL;
while (end = strpbrk(start, ",")) // Get pointer to next delimiter
{
/* found it, allocate enough space for it and NUL */
/* If there ar two consecutive delimiters, only the NUL gets entered */
n = end - start;
match = malloc(n + 1);
/* copy and NUL terminate */
/* Note that if n is 0, nothing will be copied so do not need to test */
memcpy(match, start, n);
match[n] = '\0';
printf("Found field entry: %s\n", match);
/* Now save the actual match string pointer into the fields array*/
/* Since the match pointer is in fields, it does not need to be freed */
fields[i++] = match;
start = end + 1;
}
/* Check that the last element in the gps_string is not ,
Then get the final field, which has the NUL termination of the string */
n = strlen(start);
match = malloc(n + 1);
/* Note that if n is 0, only the terminator will be put in */
strcpy(match, start);
printf("Found field entry: %s\n", match);
fields[i++] = match;
printf("Total number of fields: %d\n", i);
You can use strtok:
#include <stdio.h>
int main(void) {
char gps_string[] = "$GPRMC,080117.000,A,4055.1708,N,02918.9336,E,0.00,316.26,00,,,A*78";
char* c = strtok(gps_string, ",");
while (c != NULL) {
printf("%s\n", c);
c = strtok(NULL, ",");
}
return 0;
}
EDIT: As Carey Gregory mentioned, strtok modifies the given string. This is explained in the man page I linked to, and you can find some details here too.

How do I split a string by character position in c

I'm using C to read in an external text file. The input is not great and would look like;
0PAUL 22 ACACIA AVENUE 02/07/1986RN666
As you can see I have no obvious delimeter, and sometimes the values have no space between them. However I do know how long in character length each value should be when split. Which is as follows,
id = 1
name = 20
house number = 5
street name = 40
date of birth = 10
reference = 5
I've set up a structure I want to hold this information in, and have tried using fscanf to read in the file.
However I find something along the lines of just isn't doing what I need,
fscanf(file_in, "%1d, %20s", person.id[i], person.name[i]);
(The actual line I use attempts to grab all input but you should see where I'm going...)
The long term intention is to reformat the input file into another output file which would be made a little easier on the eye.
I appreciate I'm probably going about this all the wrong way, but I would hugely appreciate it if somebody could set me on the right path. If you're able to take it easy on me in regard to an obvious lack of understanding, I'd appreciate that also.
Thanks for reading
Use fgets to read each line at a time, then extract each field from the input line. Warning: no range checks is performed on buffers, so attention must be kept to resize buffers opportunely.
For example something like this (I don't compile it, so maybe some errors exist):
void copy_substr(const char * pBuffer, int content_size, int start_idx, int end_idx, char * pOutBuffer)
{
end_idx = end_idx > content_size ? content_size : end_idx;
int j = 0;
for (int i = start_idx; i < end_idx; i++)
pOutBuffer[j++] = pBuffer[i];
pOutBuffer[j] = 0;
return;
}
void test_solution()
{
char buffer_char[200];
fgets(buffer_char,sizeof(buffer_char),stdin); // use your own FILE handle instead of stdin
int len = strlen(buffer_char);
char temp_buffer[100];
// Reading first field: str[0..1), so only the char 0 (len=1)
int field_size = 1;
int filed_start_ofs = 0;
copy_substr(buffer_char, len, filed_start_ofs, filed_start_ofs + field_size, temp_buffer);
}
scanf is a good way to do it, you just need to use a buffer and call sscanf multiple times and give the good offsets.
For example :
char buffer[100];
fscanf(file_in, "%s",buffer);
sscanf(buffer, "%1d", person.id[i]);
sscanf(buffer+1, "%20s", person.name[i]);
sscanf(buffer+1+20, "%5d", person.street_number[i]);
and so on.
I feel like it is the easiest way to do it.
Please also consider using an array of your struct instead of a struct of arrays, it just feels wrong to have person.id[i] and not person[i].id
If you have fixed column widths, you can use pointer arithmetic to access substrings of your string str. if you have a starting index begin,
printf("%s", str + begin) ;
will print the substring beginning at begin and up to the end. If you want to print a string of a certain length, you can use printf's precision specifier .*, which takes a maximum length as additional argument:
printf("%.*s", length, str + begin) ;
If you want to copy the string to a temporary buffer, you could use strncpy, which will generate a null terminated string if the buffer is larger than the substring length. You could also use snprintf according to the above pattern:
char buf[length + 1];
snprintf(buf, sizeof(buf), "%.*s", length, str + begin) ;
This will extract leading and trailing spaces, which is probably not what you want. You could write a function to strip the unwanted whitespace; there should be plenty of examples here on SO.
You could also strip the whitespace when copying the substring. The example code below does this with the isspace function/macro from <ctype.h>:
#include <stdlib.h>
#include <stdio.h>
#include <ctype.h>
int extract(char *buf, const char *str, int len)
{
const char *end = str + len;
int tail = -1;
int i = 0;
// skip leading white space;
while (str < end && *str && isspace(*str)) str++;
// copy string
while (str < end && *str) {
if (!isspace(*str)) tail = i + 1;
buf[i++] = *str++;
}
if (tail < 0) tail= i;
buf[tail] = '\0';
return tail;
}
int main()
{
char str[][80] = {
"0PAUL 22 ACACIA AVENUE 02/07/1986RN666",
"1BOB 1 POLK ST 01/04/1988RN802",
"2ALICE 99 WEST HIGHLAND CAUSEWAY 28/06/1982RN774"
};
int i;
for (i = 0; i < 3; i++) {
char *p = str[i];
char id[2];
char name[20];
char number[6];
char street[35];
char bday[11];
char ref[11];
extract(id, p + 0, 1);
extract(name, p + 1, 19);
extract(number, p + 20, 5);
extract(street, p + 25, 34);
extract(bday, p + 59, 10);
extract(ref, p + 69, 10);
printf("<person id='%s'>\n", id);
printf(" <name>%s</name>\n", name);
printf(" <house>%s</house>\n", number);
printf(" <street>%s</street>\n", street);
printf(" <birthday>%s</birthday>\n", bday);
printf(" <reference>%s</reference>\n", ref);
printf("</person>\n\n");
}
return 0;
}
There's a danger here, however: When you access a string at a certain position str + pos you should make sure that you don't go beyond the actual string length. For example, you string may be terminated after the name. When you access the birthday, you access valid memory, but it might contain garbage.
You can avoid this problem by padding the full string with spaces.

C Regular Expressions: Extracting the Actual Matches

I am using regular expressions in C (using the "regex.h" library). After setting up the standard calls (and checks) for regcomp(...) and regexec(...), I can only manage to print the actual substrings that match my compiled regular expression.
Using regexec, according to the manual pages, means you store the substring matches in a structure known as "regmatch_t". The struct only contains rm_so and rm_eo to reference what I understand to be the addresses of the characters of the matched substring in memory, but my question is how can I just use these to offsets and two pointers to extract the actual substring and store it into an array (ideally a 2D array of strings)?
It works when you just print to standard out, but whenever you try to use the same setup but store it in a string/character array, it stores the entire string that was originally used to match against the expression.
Further, what is the "%.*s" inside the print statement? I imagine it's a regular expression in of itself to read in the pointers to a character array correctly. I just want to store the matched substrings inside a collection so I can work with them elsewhere in my software.
Background: p and p2 are both pointers set to point to the start of string to match before entering the while loop in the code below:
[EDIT: "matches" is a 2D array meant to ultimately store the substring matches and was preallocated/initalized before the main loop you see below]
int ind = 0;
while(1){
regExErr1 = regexec(&r, p, 10, m, 0);
//printf("Did match regular expr, value %i\n", regExErr1);
if( regExErr1 != 0 ){
fprintf(stderr, "No more matches with the inherent regular expression!\n");
break;
}
printf("What was found was: ");
int i = 0;
while(1){
if(m[i].rm_so == -1){
break;
}
int start = m[i].rm_so + (p - p2);
int finish = m[i].rm_eo + (p - p2);
strcpy(matches[ind], ("%.*s\n", (finish - start), p2 + start));
printf("Storing: %.*s", matches[ind]);
ind++;
printf("%.*s\n", (finish - start), p2 + start);
i++;
}
p += m[0].rm_eo; // this will move the pointer p to the end of last matched pattern and on to the start of a new one
}
printf("We have in [0]: %s\n", temp);
There are quite a lot of regular expression packages, but yours seems to match the one in POSIX: regcomp() etc.
The two structures it defines in <regex.h> are:
regex_t containing at least size_t re_nsub, the number of parenthesized subexpressions.
regmatch_t containing at least regoff_t rm_so, the byte offset from start of string to start of substring, and regoff_t rm_eo, the byte offset from start of string of the first character after the end of substring.
Note that 'offsets' are not pointers but indexes into the character array.
The execution function is:
int regexec(const regex_t *restrict preg, const char *restrict string,
size_t nmatch, regmatch_t pmatch[restrict], int eflags);
Your printing code should be:
for (int i = 0; i <= r.re_nsub; i++)
{
int start = m[i].rm_so;
int finish = m[i].rm_eo;
// strcpy(matches[ind], ("%.*s\n", (finish - start), p + start)); // Based on question
sprintf(matches[ind], "%.*s\n", (finish - start), p + start); // More plausible code
printf("Storing: %.*s\n", (finish - start), matches[ind]); // Print once
ind++;
printf("%.*s\n", (finish - start), p + start); // Why print twice?
}
Note that the code should be upgraded to ensure that the string copy (via sprintf()) does not overflow the target string — maybe by using snprintf() instead of sprintf(). It is also a good idea to mark the start and end of a string in the printing. For example:
printf("<<%.*s>>\n", (finish - start), p + start);
This makes it a whole heap easier to see spaces etc.
[In future, please attempt to provide an MCVE (Minimal, Complete, Verifiable Example) or SSCCE (Short, Self-Contained, Correct Example) so that people can help more easily.]
This is an SSCCE that I created, probably in response to another SO question in 2010. It is one of a number of programs I keep that I call 'vignettes'; little programs that show the essence of some feature (such as POSIX regexes, in this case). I find them useful as memory joggers.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <regex.h>
#define tofind "^DAEMONS=\\(([^)]*)\\)[ \t]*$"
int main(int argc, char **argv)
{
FILE *fp;
char line[1024];
int retval = 0;
regex_t re;
regmatch_t rm[2];
//this file has this line "DAEMONS=(sysklogd network sshd !netfs !crond)"
const char *filename = "/etc/rc.conf";
if (argc > 1)
filename = argv[1];
if (regcomp(&re, tofind, REG_EXTENDED) != 0)
{
fprintf(stderr, "Failed to compile regex '%s'\n", tofind);
return EXIT_FAILURE;
}
printf("Regex: %s\n", tofind);
printf("Number of captured expressions: %zu\n", re.re_nsub);
fp = fopen(filename, "r");
if (fp == 0)
{
fprintf(stderr, "Failed to open file %s (%d: %s)\n", filename, errno, strerror(errno));
return EXIT_FAILURE;
}
while ((fgets(line, 1024, fp)) != NULL)
{
line[strcspn(line, "\n")] = '\0';
if ((retval = regexec(&re, line, 2, rm, 0)) == 0)
{
printf("<<%s>>\n", line);
// Complete match
printf("Line: <<%.*s>>\n", (int)(rm[0].rm_eo - rm[0].rm_so), line + rm[0].rm_so);
// Match captured in (...) - the \( and \) match literal parenthesis
printf("Text: <<%.*s>>\n", (int)(rm[1].rm_eo - rm[1].rm_so), line + rm[1].rm_so);
char *src = line + rm[1].rm_so;
char *end = line + rm[1].rm_eo;
while (src < end)
{
size_t len = strcspn(src, " ");
if (src + len > end)
len = end - src;
printf("Name: <<%.*s>>\n", (int)len, src);
src += len;
src += strspn(src, " ");
}
}
}
return EXIT_SUCCESS;
}
This was designed to find a particular line starting DAEMONS= in a file /etc/rc.conf (but you can specify an alternative file name on the command line). You can adapt it to your purposes easily enough.
Since g++ regex is bugged until who knows when, you can use my code instead (License: AGPL, no warranty, your own risk, ...)
/**
* regexp (License: AGPL3 or higher)
* #param re extended POSIX regular expression
* #param nmatch maximum number of matches
* #param str string to match
* #return An array of char pointers. You have to free() the first element (string storage). the second element is the string matching the full regex, then come the submatches.
*/
char **regexp(char *re, int nmatch, char *str) {
char **result;
char *string;
regex_t regex;
regmatch_t *match;
int i;
match=malloc(nmatch*sizeof(*match));
if (!result) {
fprintf(stderr, "Out of memory !");
return NULL;
}
if (regcomp(&regex, re, REG_EXTENDED)!=0) {
fprintf(stderr, "Failed to compile regex '%s'\n", re);
return NULL;
}
string=strdup(str);
if (regexec(&regex,string,nmatch,match,0)) {
#ifdef DEBUG
fprintf(stderr, "String '%s' does not match regex '%s'\n",str,re);
#endif
free(string);
return NULL;
}
result=malloc(sizeof(*result));
if (!result) {
fprintf(stderr, "Out of memory !");
free(string);
return NULL;
}
for (i=0; i<nmatch; ++i) {
if (match[i].rm_so>=0) {
string[match[i].rm_eo]=0;
((char**)result)[i]=string+match[i].rm_so;
#ifdef DEBUG
printf("%s\n",string+match[i].rm_so);
#endif
} else {
((char**)result)[i]="";
}
}
result[0]=string;
return result;
}

Resources