C Way to Extract Variables from Strings - c

I was wondering how do C programmers usually extract data from a string? I read a lot about strtok, but I personally dislike the way the function works. Having to call it again with NULL as parameter seems odd to me. I once stumbled upon this little piece of code which I find pretty sleek :
sscanf(data, "%*[^=]%*c%[^&]%*[^=]%*c%[^&]", usr, pw);
This would extract data from a URL query string (only var1=value&var2=value).
Is there a reason to use strtok over sscanf? Performance maybe?

IMHO the best way is the most readable and understandable way. sscanf and strtok totally disqualify with your user/pw extraction from an URL.
Instead, look for the boundaries of the strings you are looking for (in an URL the slash, the at-sign, the colon, what have you) with strchr and strrchr, then memcpy from start to end to where you need the data and tack on a NUL. This also allows for appropriate error handling should the string have an unexpected format.

They are each better or more convenient at certain kinds of tasks:
sscanf allows you to concisely specify a fairly complex template for parsing values out of a line of text, but it is very unforgiving. If your input text differs by even a character from your template, the scan will fail. For that reason, it's almost never the right tool to use for human-generated input, for example. It is most useful for scanning automatically generated output, e.g. server log lines.
strtok is much more flexible, but also much more verbose: parsing a line with only a few fields may take many lines of code. It is also destructive: it actually modifies the string that is passed to it, so you may need to make a copy of the data before invoking strtok.

strtok is a much simpler, low level function mostly used to tokenize strings that have an unknown element count.
NULL is used to tell strtok to continue scanning the string from the last position, saving you some pointer manipulation and probably (internally to strtok) some initialization.
There's also the matter of readability. looking at the code snippet, it takes some time to understand what's going on.

sscanf uses a very incomplete (though efficient to implement) regular expression syntax, so if you wanted to do something more complicated, you cannot use sscanf.
That being said, strtok isn't re entrant so if you're using threading then you're out of luck.
But generally speaking, the one that ends up running faster for a particular circumstance and is more elegant is often considered to be the most idiomatic for that circumstance.

I myself created a small header file with a few definitions of functions that can help such as a char **Split(src, sep) function and a int DoubleArrLen(char **arr),
If you can improve it in any way here is the small 1-hour work thing.
#include <string.h>
#include <stdlib.h>
#include <malloc.h>
#include <assert.h>
char *substring(char *string, int position, int length)
{
char *pointer;
int c;
pointer = malloc(length+1);
if (pointer == NULL)
{
printf("Unable to allocate memory.\n");
exit(EXIT_FAILURE);
}
for (c = 0 ; c < position -1 ; c++)
string++;
for (c = 0 ; c < length ; c++)
{
*(pointer+c) = *string;
string++;
}
*(pointer+c) = '\0';
return pointer;
}
char **Split(char *a_str, const char a_delim)
{
char **result = 0;
size_t count = 0;
char *tmp = a_str;
char *last_comma = 0;
/* Count how many elements will be extracted. */
while (*tmp)
{
if (a_delim == *tmp)
{
count++;
last_comma = tmp;
}
tmp++;
}
/* Add space for trailing token. */
count += last_comma < (a_str + strlen(a_str) - 1);
/* Add space for terminating null string so caller
knows where the list of returned strings ends. */
count++;
result = malloc(sizeof(char *) * count);
if (result)
{
char delim[2] = { a_delim, '\0' }; // Fix for inconsistent splitting
size_t idx = 0;
char *token = strtok(a_str, delim);
while (token)
{
assert(idx < count);
*(result + idx++) = strdup(token);
token = strtok(0, delim);
}
assert(idx == count - 1);
*(result + idx) = 0;
}
return result;
}
static int SplitLen(char **array)
{
int i = 0;
while (*array++ != 0)
i++;
return i;
}
int IndexOf(char *str, char *ch)
{
int i;
int cnt;
int result = -1;
if(strlen(str) >= strlen(ch))
{
for(i = 0; i<strlen(str); i++)
{
if(str[i] == ch[0])
{
result = i;
for(cnt = 1; cnt < strlen(ch); cnt++)
{
if(str[i + cnt] != ch[cnt]) result = -1; break;
}
}
}
}
return result;
}
int IndexOfChar(char *str, char ch)
{
int result = -1;
int i = 0;
for(;i<strlen(str); i++)
{
if(str[i] == ch)
{
result = i;
break;
}
}
return result;
}
A little explanation can be the functions:
the substring function extracts a part of a string.
the IndexOf() function searches for a string inside the source string.
Others should be self-explanatory.
This includes a Split function as I pointed out earlier, you can use that instead of strtok..

Related

Why does my string_split implementation not work?

My str_split function returns (or at least I think it does) a char** - so a list of strings essentially. It takes a string parameter, a char delimiter to split the string on, and a pointer to an int to place the number of strings detected.
The way I did it, which may be highly inefficient, is to make a buffer of x length (x = length of string), then copy element of string until we reach delimiter, or '\0' character. Then it copies the buffer to the char**, which is what we are returning (and has been malloced earlier, and can be freed from main()), then clears the buffer and repeats.
Although the algorithm may be iffy, the logic is definitely sound as my debug code (the _D) shows it's being copied correctly. The part I'm stuck on is when I make a char** in main, set it equal to my function. It doesn't return null, crash the program, or throw any errors, but it doesn't quite seem to work either. I'm assuming this is what is meant be the term Undefined Behavior.
Anyhow, after a lot of thinking (I'm new to all this) I tried something else, which you will see in the code, currently commented out. When I use malloc to copy the buffer to a new string, and pass that copy to aforementioned char**, it seems to work perfectly. HOWEVER, this creates an obvious memory leak as I can't free it later... so I'm lost.
When I did some research I found this post, which follows the idea of my code almost exactly and works, meaning there isn't an inherent problem with the format (return value, parameters, etc) of my str_split function. YET his only has 1 malloc, for the char**, and works just fine.
Below is my code. I've been trying to figure this out and it's scrambling my brain, so I'd really appreciate help!! Sorry in advance for the 'i', 'b', 'c' it's a bit convoluted I know.
Edit: should mention that with the following code,
ret[c] = buffer;
printf("Content of ret[%i] = \"%s\" \n", c, ret[c]);
it does indeed print correctly. It's only when I call the function from main that it gets weird. I'm guessing it's because it's out of scope ?
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#define DEBUG
#ifdef DEBUG
#define _D if (1)
#else
#define _D if (0)
#endif
char **str_split(char[], char, int*);
int count_char(char[], char);
int main(void) {
int num_strings = 0;
char **result = str_split("Helo_World_poopy_pants", '_', &num_strings);
if (result == NULL) {
printf("result is NULL\n");
return 0;
}
if (num_strings > 0) {
for (int i = 0; i < num_strings; i++) {
printf("\"%s\" \n", result[i]);
}
}
free(result);
return 0;
}
char **str_split(char string[], char delim, int *num_strings) {
int num_delim = count_char(string, delim);
*num_strings = num_delim + 1;
if (*num_strings < 2) {
return NULL;
}
//return value
char **ret = malloc((*num_strings) * sizeof(char*));
if (ret == NULL) {
_D printf("ret is null.\n");
return NULL;
}
int slen = strlen(string);
char buffer[slen];
/* b is the buffer index, c is the index for **ret */
int b = 0, c = 0;
for (int i = 0; i < slen + 1; i++) {
char cur = string[i];
if (cur == delim || cur == '\0') {
_D printf("Copying content of buffer to ret[%i]\n", c);
//char *tmp = malloc(sizeof(char) * slen + 1);
//strcpy(tmp, buffer);
//ret[c] = tmp;
ret[c] = buffer;
_D printf("Content of ret[%i] = \"%s\" \n", c, ret[c]);
//free(tmp);
c++;
b = 0;
continue;
}
//otherwise
_D printf("{%i} Copying char[%c] to index [%i] of buffer\n", c, cur, b);
buffer[b] = cur;
buffer[b+1] = '\0'; /* extend the null char */
b++;
_D printf("Buffer is now equal to: \"%s\"\n", buffer);
}
return ret;
}
int count_char(char base[], char c) {
int count = 0;
int i = 0;
while (base[i] != '\0') {
if (base[i++] == c) {
count++;
}
}
_D printf("Found %i occurence(s) of '%c'\n", count, c);
return count;
}
You are storing pointers to a buffer that exists on the stack. Using those pointers after returning from the function results in undefined behavior.
To get around this requires one of the following:
Allow the function to modify the input string (i.e. replace delimiters with null-terminator characters) and return pointers into it. The caller must be aware that this can happen. Note that supplying a string literal as you are doing here is illegal in C, so you would instead need to do:
char my_string[] = "Helo_World_poopy_pants";
char **result = str_split(my_string, '_', &num_strings);
In this case, the function should also make it clear that a string literal is not acceptable input, and define its first parameter as const char* string (instead of char string[]).
Allow the function to make a copy of the string and then modify the copy. You have expressed concerns about leaking this memory, but that concern is mostly to do with your program's design rather than a necessity.
It's perfectly valid to duplicate each string individually and then clean them all up later. The main issue is that it's inconvenient, and also slightly pointless.
Let's address the second point. You have several options, but if you insist that the result be easily cleaned-up with a call to free, then try this strategy:
When you allocate the pointer array, also make it large enough to hold a copy of the string:
// Allocate storage for `num_strings` pointers, plus a copy of the original string,
// then copy the string into memory immediately following the pointer storage.
char **ret = malloc((*num_strings) * sizeof(char*) + strlen(string) + 1);
char *buffer = (char*)&ret[*num_strings];
strcpy(buffer, string);
Now, do all your string operations on buffer. For example:
// Extract all delimited substrings. Here, buffer will always point at the
// current substring, and p will search for the delimiter. Once found,
// the substring is terminated, its pointer appended to the substring array,
// and then buffer is pointed at the next substring, if any.
int c = 0;
for(char *p = buffer; *buffer; ++p)
{
if (*p == delim || !*p) {
char *next = p;
if (*p) {
*p = '\0';
++next;
}
ret[c++] = buffer;
buffer = next;
}
}
When you need to clean up, it's just a single call to free, because everything was stored together.
The string pointers you store into the res with ret[c] = buffer; array point to an automatic array that goes out of scope when the function returns. The code subsequently has undefined behavior. You should allocate these strings with strdup().
Note also that it might not be appropriate to return NULL when the string does not contain a separator. Why not return an array with a single string?
Here is a simpler implementation:
#include <stdlib.h>
char **str_split(const char *string, char delim, int *num_strings) {
int i, n, from, to;
char **res;
for (n = 1, i = 0; string[i]; i++)
n += (string[i] == delim);
*num_strings = 0;
res = malloc(sizeof(*res) * n);
if (res == NULL)
return NULL;
for (i = from = to = 0;; from = to + 1) {
for (to = from; string[to] != delim && string[to] != '\0'; to++)
continue;
res[i] = malloc(to - from + 1);
if (res[i] == NULL) {
/* allocation failure: free memory allocated so far */
while (i > 0)
free(res[--i]);
free(res);
return NULL;
}
memcpy(res[i], string + from, to - from);
res[i][to - from] = '\0';
i++;
if (string[to] == '\0')
break;
}
*num_strings = n;
return res;
}

freeing malloc'd memory causes other malloc'd memory to garbage

I'm trying to learn C, and one of the things I'm finding tricky is strings and manipulating them. I think I understand the basics of it, but I've taken for granted a lot of what might go into strings in JS or PHP (where I'm coming from).
I'm trying now to write a function that explodes a string into an array, based on a delimiter, using strtok. Similar to PHP's implementation of explode().
Here's the code:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
char **explode(char *input, char delimiter) {
char **output;
char *token;
char *string = malloc(sizeof(char) * strlen(input));
char delimiter_str[2] = {delimiter, '\0'};
int i;
int delim_count = 0;
for (i = 0; i < strlen(input); i++) {
string[i] = input[i];
if (input[i] == delimiter) {
delim_count++;
}
}
string[strlen(input)] = '\0';
output = malloc(sizeof(char *) * (delim_count + 1));
token = strtok(string, delimiter_str);
i = 0;
while (token != NULL) {
output[i] = token;
token = strtok(NULL, delimiter_str);
i++;
}
// if i uncomment this line, output gets all messed up
// free(string);
return output;
}
int main() {
char **row = explode("id,username,password", ',');
int i;
for (i = 0; i < 3; i++) {
printf("%s\n", row[i]);
}
free(row);
return 0;
}
The question I have is why if I try to free(string) in the function, the output gets messed up, and if I'm doing this incorrectly in the first place. I believe I'm just not mapping out the memory properly in my head and that's why I'm not understanding the issue.
you misunderstand what strtok does, It does not make new strings, it is simply returning a pointer to different parts of the original string. If you then free that string all the pointers you stored become invalid. I think you need
while (token != NULL) {
output[i] = strdup(token);
token = strtok(NULL, delimiter_str);
i++;
}
strdup will allocated and copy a new string for you
In output you save pointers that points into string so when you free string, you free the memory that the output pointers are pointing to.
It's not enough to save the pointers. You'll have to copy the actual strings. To do that you need to allocate memory to output in another way.

Can't modify an array within a loop (C)

I am currently developing a small program requires a function to return a string (character array), and two parameters which are (phrase, c). The 'phrase' is a string input and 'c' is the character which will be removed from the phrase. The left-over spaces will also be removed.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
//This method has two parameters: (str, c)
//It will remove all occurences of var 'c'
//inside of 'str'
char * rmchr(char * str, char *c) {
//Declare result array
char *strVal = (char *) malloc(sizeof(char) * strlen(str));
//Iterate through each character
for (int i = 0; i < strlen(str); i++) {
*(strVal+i) = str[i];
//Check if char matches 'c'
if (strVal[i] != *c){
//Assign filtered value to new array
*(strVal+i) = str[i];
printf("%c", strVal[i]);
}
}
return strVal;
}
int main()
{
char * result = rmchr("This is a great message to test with! It includes a lot of examples!","i");
return 1;
}
Inside of the 'rmchr' function (if-statement), the array prints out exactly how I'd like to return it:
Ths s a great message to test wth! It ncludes a lot of examples!
The problem is that my return variable, 'strVal' isn't being modified outside of the if-statement. How can I modify the array permanently so my ideal output will be returned inside of 'result' (inside of main).
I see a few points to address. Primarily, this code directly copies the input string verbatim as it stands. The same *(strVal+i) = str[i]; assignment takes place in two locations in the code which disregards the comparison against *c. Without some secondary index variable j, it becomes difficult to keep track of the end of the receiving string.
Additional notes:
There is no free for your malloc; this creates a memory leak.
You return exit code 1 which indicates abnormal program termination. return 0 to indicate a normal exit.
Don't cast the pointer malloc returns; this can hide errors.
Validate malloc success and exit if it failed.
strlen() is a linear time operation that iterates through the entire parameter string on each call. Call it once and store the result in a variable to save cycles.
This code does not handle removal of extra spaces as required.
See the below sample for a possible implementation that addresses some of the above points:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
char *rmchr(char *str, char *c) {
int i = 0;
int j = 0;
int len = strlen(str);
char *result = malloc(sizeof(*result) * (len + 1));
if (result == NULL) {
fprintf(stderr, "out of memory\n");
exit(1);
}
while (i < len) {
if (str[i] != *c) {
result[j++] = str[i++];
}
else {
for (i++; i < len && str[i] == ' '; i++);
}
}
result[j] = '\0';
return result;
}
int main() {
char *result = rmchr("This is a great message to test with! It includes a lot of examples!", "i");
for (int i = 0; i < strlen(result); i++) {
printf("%c", result[i]);
}
free(result);
return 0;
}
Output:
Ths s a great message to test wth! It ncludes a lot of examples!

sscanf parse formatted string

I would like to read a string containing a undefined amount of suffixes, all separated by ;
example 1: « .txt;.jpg;.png »
example 2: « .txt;.ods;_music.mp3;.mjpeg;.ext1;.ext2 »
I browsed the web and wrote that piece of code that doesn't work:
char *suffix[MAX]; /* will containt pointers to the different suffixes */
for (i = 0; i < MAX ; i++)
{
suffix[i] = NULL;
if (suffix_str && sscanf(suffix_str,"%[^;];%[^\n]",suffix[i],suffix_str) < 1)
suffix_str = NULL;
}
After the first iteration, the result of sscanf is 0. Why didn't it read the content of the string?
How should be parsed a string containing an undefined number of elements? Is sscanf a good choice?
First, as covered in general comment, you're invoking undefined behavior by using the same buffer as both a source input and destination target for sscanf. Per the C standard, that isn't allowed.
The correct function to use for this would likely be strtok. A very simply example appears below.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
char line[] = ".txt;.ods;_music.mp3;.mjpeg;.ext1;.ext2";
size_t slen = strlen(line); // worst case
char *suffix[slen/2+1], *ext;
size_t count=0;
for (ext = strtok(line, ";"); ext; ext = strtok(NULL, ";"))
suffix[count++] = ext;
// show suffix array entries we pulled
for (size_t i=0; i<count; ++i)
printf("%s ", suffix[i]);
fputc('\n', stdout);
}
Output
.txt .ods _music.mp3 .mjpeg .ext1 .ext2
Notes
This code assumes a worst-case suffix count to be half the string length, thereby a list of single character suffixes split on the delimiter.
The suffix array contains pointers into the now-sliced-up original line buffer. The lifetime of usability for those pointers is therefore only as long as that of the line buffer itself.
Hope it helps.
There are several ways to tokenize from a C string. In addition to using strtok and sscanf you could also do something like this:
char *temp = suffix_str;
char *suffix[i];
for (int i = 0; i < MAX; i++)
{
int j = 0;
char buf[32];
while (*temp != '\0' && *temp != '\n' && *temp != ';')
{
buf[j++] = *temp;
temp++;
}
buf[j] = 0;
if (*temp == ';') temp++;
suffix[i] = malloc((strlen(buf) + 1) * sizeof(char));
//handle memory allocation error
strcpy(suffix[i], buf);
}

Split function in C runtime error

I get a runtime error when running a C program,
Here is the C source (parsing.h header code a little lower):
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "parsing.h"
int main()
{
printf("Enter text seperated by single spaces :\n");
char *a = malloc(sizeof(char)*10);
gets(a);
char **aa = Split(a, ' ');
int k = SplitLen(a, ' ');
int i = 0;
for(;i<k;i++)
{
printf("%s\n", aa[i]);
}
free(a);
free(aa);
return 0;
}
and the parsing.h file:
#include <string.h>
#include <stdlib.h>
#include <malloc.h>
#include <assert.h>
char** Split(char* a_str, const char a_delim)
{
char** result = 0;
int count = 0;
char* tmp = a_str;
char* last_comma = 0;
/* Count how many elements will be extracted. */
while (*tmp)
{
if (a_delim == *tmp)
{
count++;
last_comma = tmp;
}
tmp++;
}
/* Add space for trailing token. */
count += last_comma < (a_str + strlen(a_str) - 1);
/* Add space for terminating null string so caller
knows where the list of returned strings ends. */
count++;
result = malloc(sizeof(char*) * count);
if (result)
{
size_t idx = 0;
char* token = strtok(a_str, ",");
while (token)
{
assert(idx < count);
*(result + idx++) = strdup(token);
token = strtok(0, ",");
}
assert(idx == count - 1);
*(result + idx) = 0;
}
return result;
}
int SplitLen(char *src, char sep)
{
int result = 0;
int i;
for(i = 0; i<strlen(src); i++)
{
if(src[i] == sep)
{
result += 1;
}
}
return result;
}
I'm sure most of the code is unneeded but I posted the whole lot in case there is some relevance, Here is the runtime error:
a.out: parsing.h:69: Split: Assertion `idx == count - 1' failed.
Aborted
Thanks in advance and for info I didn't program the whole lot but took some pieces from some places but most is my programming Thanks!.
The purpose of the assert function is that is will stop your program if the condition passed as an argument is false. What this tells you is that when you ran your program, idx != count - 1 at line 69. I didn't take the time to check what import that has on the execution of your program, but apparently (?) idx was intended to equal count - 1 there.
Does that help?
There are many problems. I'm ignoring the code split into two files; I'm treating it as a single file (see comments to question).
Do not use gets(). Never use gets(). Do not ever use gets(). I said it three times; it must be true. Note that gets() is no longer a Standard C function (it was removed from the C11 standard — ISO/IEC 9899:2011) because it cannot be used safely. Use fgets() or another safe function instead.
You don't need to use dynamic memory allocation for a string of 10 characters; use a local variable (it is simpler).
You need a bigger string — think about 4096.
You don't check whether you got any data; always check input function calls.
You don't free all the substrings at the end of main(), thus leaking memory.
One major problem the Split() code slices and dices the input string so that SplitLen() cannot give you the same answer that Split() does for the number of fields. The strtok() function is destructive. It also treats multiple adjacent delimiters as a single delimiter. Your code won't account for the difference.
Another major problem is that you analyze the strings based on the delimiter passed into the Split() function, but you use strtok(..., ',') to actually split on commas. This is more consistent with the commentary and names, but totally misleading to you. This is why your assertion fired.
You don't need to include <malloc.h> unless you are using the extra facilities it provides. You aren't, so you should not include it; <stdlib.h> declares malloc() and free() perfectly well.
This code works for me; I've annotated most of the places I made changes.
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
static int altSplitLen(char **array);
static char **Split(char *a_str, const char a_delim);
static int SplitLen(char *src, char sep);
int main(void)
{
printf("Enter text separated by single spaces:\n");
char a[4096]; // Simpler
if (fgets(a, sizeof(a), stdin) != 0) // Error checked!
{
char **aa = Split(a, ' ');
int k = SplitLen(a, ' ');
printf("SplitLen() says %d; altSplitLen() says %d\n", k, altSplitLen(aa));
for (int i = 0; i < k; i++)
{
printf("%s\n", aa[i]);
}
/* Workaround for broken SplitLen() */
{
puts("Loop to null pointer:");
char **data = aa;
while (*data != 0)
printf("[%s]\n", *data++);
}
{
// Fix for major leak!
char **data = aa;
while (*data != 0)
free(*data++);
}
free(aa); // Major leak!
}
return 0;
}
char **Split(char *a_str, const char a_delim)
{
char **result = 0;
size_t count = 0;
char *tmp = a_str;
char *last_comma = 0;
/* Count how many elements will be extracted. */
while (*tmp)
{
if (a_delim == *tmp)
{
count++;
last_comma = tmp;
}
tmp++;
}
/* Add space for trailing token. */
count += last_comma < (a_str + strlen(a_str) - 1);
/* Add space for terminating null string so caller
knows where the list of returned strings ends. */
count++;
result = malloc(sizeof(char *) * count);
if (result)
{
char delim[2] = { a_delim, '\0' }; // Fix for inconsistent splitting
size_t idx = 0;
char *token = strtok(a_str, delim);
while (token)
{
assert(idx < count);
*(result + idx++) = strdup(token);
token = strtok(0, delim);
}
assert(idx == count - 1);
*(result + idx) = 0;
}
return result;
}
int SplitLen(char *src, char sep)
{
int result = 0;
for (size_t i = 0; i < strlen(src); i++)
{
if (src[i] == sep)
{
result += 1;
}
}
return result;
}
static int altSplitLen(char **array)
{
int i = 0;
while (*array++ != 0)
i++;
return i;
}
Sample run:
$ parsing
Enter text separated by single spaces:
a b c d e f gg hhh iii jjjj exculpatory evidence
SplitLen() says 0; altSplitLen() says 12
Loop to null pointer:
[a]
[b]
[c]
[d]
[e]
[f]
[gg]
[hhh]
[iii]
[jjjj]
[exculpatory]
[evidence
]
$
Note that fgets() keeps the newline and gets() does not, so the newline was included in output. Note also how the printf() printing the data showed the limits of the strings; that is enormously helpful on many occasions.

Resources