C query string parsing - c

I have the following query string
address=1234&port=1234&username=1234&password=1234&gamename=1234&square=1234&LOGIN=LOGIN
I am trying to parse it into different variables: address,port,username,password,gamename,square and command (which would hold LOGIN)
I was thinking of using strtok but I don't think it would work. How can I parse the string to capture the variables ?
P.S - some of the fields might be empty - no gamename provided or square

When parsing a sting that may contain an empty-field between delimiters, strtok cannot be used, because strtok will treat any number of sequential delimiters as a single delimiter.
So in your case, if the variable=values fields may also contain an empty-field between the '&' delimiters, you must use strsep, or other functions such as strcspn, strpbrk or simply strchr and a couple of pointers to work your way down the string.
The strsep function is a BSD function and may not be included with your C library. GNU includes strsep and it was envisioned as a replacement for strtok simply because strtok cannot handle empty-fields.
(If you do not have strsep available, you will simply need to keep a start and end pointer and use a function like strchr to locate each occurrence of '&' setting the end pointer to one before the delimiter and then obtaining the var=value information from the characters between start and end pointer, then updating both to point one past the delimiter and repeating.)
Here, you can use strsep with a delimiter of "&\n" to locate each '&' (the '\n' char included presuming the line was read from a file with a line-oriented input function such as fgets or POSIX getline). You can then simply call strtok to parse the var=value text from each token returned by strsep using "=" as the delimiter (the '\n' having already been removed from the last token when parsing with strsep)
An example inserting a specific empty-field for handling between "...gamename=1234&&square=1234...", could be as follows:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main (void) {
char array[] = "address=1234&port=1234&username=1234&password=1234"
"&gamename=1234&&square=1234&LOGIN=LOGIN",
*query = strdup (array), /* duplicate array, &array is not char** */
*tokens = query,
*p = query;
while ((p = strsep (&tokens, "&\n"))) {
char *var = strtok (p, "="),
*val = NULL;
if (var && (val = strtok (NULL, "=")))
printf ("%-8s %s\n", var, val);
else
fputs ("<empty field>\n", stderr);
}
free (query);
}
(note: strsep takes a char** parameter as its first argument and will modify the argument to point one past the delimiter, so you must preserve a reference to the start of the original allocated string (query above)).
Example Use/Output
$ ./bin/strsep_query
address 1234
port 1234
username 1234
password 1234
gamename 1234
<empty field>
square 1234
LOGIN LOGIN
(note: the conversion of "1234" to a numeric value has been left to you)
Look things over and let me know if you have further questions.

Related

C - ssprintf() unable to retrieve a substring from the string [duplicate]

I have the following query string
address=1234&port=1234&username=1234&password=1234&gamename=1234&square=1234&LOGIN=LOGIN
I am trying to parse it into different variables: address,port,username,password,gamename,square and command (which would hold LOGIN)
I was thinking of using strtok but I don't think it would work. How can I parse the string to capture the variables ?
P.S - some of the fields might be empty - no gamename provided or square
When parsing a sting that may contain an empty-field between delimiters, strtok cannot be used, because strtok will treat any number of sequential delimiters as a single delimiter.
So in your case, if the variable=values fields may also contain an empty-field between the '&' delimiters, you must use strsep, or other functions such as strcspn, strpbrk or simply strchr and a couple of pointers to work your way down the string.
The strsep function is a BSD function and may not be included with your C library. GNU includes strsep and it was envisioned as a replacement for strtok simply because strtok cannot handle empty-fields.
(If you do not have strsep available, you will simply need to keep a start and end pointer and use a function like strchr to locate each occurrence of '&' setting the end pointer to one before the delimiter and then obtaining the var=value information from the characters between start and end pointer, then updating both to point one past the delimiter and repeating.)
Here, you can use strsep with a delimiter of "&\n" to locate each '&' (the '\n' char included presuming the line was read from a file with a line-oriented input function such as fgets or POSIX getline). You can then simply call strtok to parse the var=value text from each token returned by strsep using "=" as the delimiter (the '\n' having already been removed from the last token when parsing with strsep)
An example inserting a specific empty-field for handling between "...gamename=1234&&square=1234...", could be as follows:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main (void) {
char array[] = "address=1234&port=1234&username=1234&password=1234"
"&gamename=1234&&square=1234&LOGIN=LOGIN",
*query = strdup (array), /* duplicate array, &array is not char** */
*tokens = query,
*p = query;
while ((p = strsep (&tokens, "&\n"))) {
char *var = strtok (p, "="),
*val = NULL;
if (var && (val = strtok (NULL, "=")))
printf ("%-8s %s\n", var, val);
else
fputs ("<empty field>\n", stderr);
}
free (query);
}
(note: strsep takes a char** parameter as its first argument and will modify the argument to point one past the delimiter, so you must preserve a reference to the start of the original allocated string (query above)).
Example Use/Output
$ ./bin/strsep_query
address 1234
port 1234
username 1234
password 1234
gamename 1234
<empty field>
square 1234
LOGIN LOGIN
(note: the conversion of "1234" to a numeric value has been left to you)
Look things over and let me know if you have further questions.

strtok function and multithreading

I read a lot of stuff about strtok(char* s1, char* s2) and its implementation. However, I still can not understand what makes it a dangerous function to use in multi-threaded program. Can somebody please give me an example of a multi-threaded program and explain the issue there? Please not that I am looking for an example that shows me where the problem arises.
ps: strtok(char* s1, char* s2) is part of the C standard library.
In the first call to strtok, you supply the string and the delimiters. In subsequent calls, the first parameter is NULL, and you just supply the delimiters. strtok remembers the string that you passed in.
In a multithreaded environment, this is dangerous because many threads may be calling strtok with different strings. It will only remember the last one and return the wrong result.
Here is a concrete example:
Suppose first that your program is multi-threaded, and in one thread of execution, the following code runs:
char str1[] = "split.me.up";
// call this line A
char *word1 = strtok(str1, "."); // returns "split", sets str1[5] = '\0'
// ...
// call this line B
char *word2 = strtok(NULL, "."); // we hope to get back "me"
And in another thread, the following code runs:
char str2[] = "multi;token;string";
// call this line C
char *token1 = strtok(str2, ";"); // returns "multi", sets str2[5] = '\0'
// ...
// call this line D
char *token2 = strtok(NULL, ";"); // we hope to get back "token"
The point is, we don't really know what will be in word2 and token2:
If the commands are run in the order (A), (B), (C), (D), then we will get what we want.
But if, say, the commands run in the order (A), (C), (B), (D), then command (B) will search for a . delimeter in "token;string"! This is because the NULL first argument to command (B) tells strtok to continue searching in the last non-NULL search string it was passed, and because command (C) has already run, strtok will use str2.
Then command (B) will return token;string, at the same time setting the new starting character of a search to the NUL terminator at the end of str2. Then the command (D) will think it is searching an empty string, because it will begin its search at str2's NUL terminator, and so will return NULL as well.
Even if you place commands (A) and (B) right next to each other, and commands (C) and (D) right next to each other, there is no guarantee that (B) will be executed right after (A) before either (C) or (D), etc.
If you create some sort of mutex or alternate guard to protect the use of the strtok function, and only call strtok from a thread which has obtained a lock on said mutex, then strtok is safe to use. However, it is probably better just to use the thread-safe strtok_r as others have said.
Edit: There is one more issue, that nobody else has mentioned, namely that strtok modifies and potentially uses global (or static, whatever) variables, and does so in a probably-not-thread-safe way, so even if you don't rely on repeating calls to strtok to get successive "tokens" from the same string, it may not be safe to use it in a multi-threaded environment without guards, etc.
To explain in simple terms, Whenever they name it THREAD safe, they literally mean, it is not just your thread, other thread too can modify it! It is like a cake been shared with 5 friends concurrently. The results are unpredictable who consumed the cake, or who altered it!
Every call to the strtok() function, returns a refrence to a NULL terminated string and it uses a static buffer during parsing. Any subsequent call to the function will refer to that buffer only, and it gets altered.! It is independent of who called it, and thats is the reason for it is not thread safe.
Other hand strtok_r() using a additional 3rd argument called saveptr(we need to specify it) which is probably used to hold that reference for subsequent calls. Thus is no more system specific but in developer control.
An example:( from a book of Steven robbins, unix system programming)
An incorrect use of strtok to determine the average number of words per line.
#include <string.h>
#define LINE_DELIMITERS "\n"
#define WORD_DELIMITERS " "
static int wordcount(char *s) {
int count = 1;
if (strtok(s, WORD_DELIMITERS) == NULL)
return 0;
while (strtok(NULL, WORD_DELIMITERS) != NULL)
count++;
return count;
}
double wordaverage(char *s) { /* return average size of words in s */
int linecount = 1;
char *nextline;
int words;
nextline = strtok(s, LINE_DELIMITERS);
if (nextline == NULL)
return 0.0;
words = wordcount(nextline);
while ((nextline = strtok(NULL, LINE_DELIMITERS)) != NULL) {
words += wordcount(nextline);
linecount++;
}
return (double)words/linecount;
}
The wordaverage function determines the average number of words per line by using strtok to find the next line. The function then calls wordcount to count the number of words on this line. Unfortunately, wordcount also uses strtok, this time to parse the words on the line. Each of these functions by itself would be correct if the other one did not call strtok. The wordaverage function works correctly for the first line, but when wordaverage calls strtok to parse the second line, the internal state information kept by strtok has been reset by wordcount.

Obtaining zero-length string from strtok()

I have a CSV file containing data such as
value;name;test;etc
which I'm trying to split by using strtok(string, ";"). However, this file can contain zero-length data, like this:
value;;test;etc
which strtok() skips. Is there a way I can avoid strtok from skipping zero-length data like this?
A possible alternative is to use the BSD function strsep() instead of strtok(), if available.
From the man page:
The strsep() function is intended as a replacement for the strtok()
function. While the strtok() function should be preferred for
portability reasons (it conforms to ISO/IEC 9899:1990 ("ISO C90"))
it is unable to handle empty fields, i.e., detect fields delimited by
two adjacent delimiter characters, or to be used for more than a
single string at a time. The strsep() function first appeared in
4.4BSD.
A simple example (also copied from that man page):
char *token, *string, *tofree;
tofree = string = strdup("value;;test;etc");
while ((token = strsep(&string, ";")) != NULL)
printf("token=%s\n", token);
free(tofree);
Output:
token=value
token=
token=test
token=etc
so empty fields are handled correctly.
Of course, as others already said, none of these simple tokenizer functions handles
delimiter inside quotation marks correctly, so if that is an issue, you should use
a proper CSV parsing library.
There is no way to make strtok() not behave this way. From man page:
A sequence of two or more contiguous delimiter bytes in the parsed
string is considered to be a single delimiter. Delimiter bytes at the
start or end of the string are ignored. Put another way: the tokens
returned by strtok() are always nonempty strings.
But what you can do is check the amount of '\0' characters before the token, since strtok() replaces all encountered tokens with '\0'. That way you'll know how many tokens were skipped. Source info:
This end of the token is automatically replaced by a null-character,
and the beginning of the token is returned by the function.
And a code sample to show what I mean.
char* aStr = ...;
char* ptr = NULL;
ptr = strtok (...);
char* back = ptr;
int count = -1;
do {
back--;
if (back <= aStr) break; // to protect against reads before aStr
count++;
} while (*back = '\0');
(written without ide or testing, may be an invalid implementation, but the idea stands).
No you can't.
From "man strtok":
A sequence of two or more contiguous delimiter characters in the
parsed string is considered to be a single delimiter. Delimiter
characters at the start or end of the string are ignored. Put
another way: the tokens returned by strtok() are always nonempty
strings.
You could also run into problems if your data contains the delimiter inside quotes or any other "escape".
I think the best solution is to get a CSV parsing library or write your own parsing function.
From recent experience, it looks like strtok() does not necessarily replace all delimiters with the end of string characters, but rather replaces the first delimiter it finds with an end of string character and skips the following delimiters but leaves them in place.
This means that in the nominal case (no zero-length strings before delimiters), every call to strtok() after the first call to strtok() will return a pointer to a string that begins after a \0 character.
In the case where strtok() reads zero-length strings between delimiters, strtok() will return a pointer to a string that begins after a delimiter character that has not been replaced with \0.
Here is my solution for finding out whether strtok() has skipped a zero-length string between delimiters.
// Previous code is needed to point strtok to a string and start ingesting from it.
char * field_string = strtok(NULL, ',');
// Note that this can't be done after the first call to strtok for a given buffer, since the previous character would be outside of the string's memory space.
if (*(field_string-1) == '\0') {
// no delimiters were skipped
} else {
// one or more delimiters were skipped
}

Tokenize string with strtok

Example Text:
bclk = /gsrpkg_te/gsrpkg/gsrdie/xxBCLK
I would like to ask question regarding to "strtok".Below is an example code with some doubts I have faced.
char *p4;
char *p5;
p4 = strtok (eqvline, "=");
p5 = strtok (NULL, ":");
if ( !strcmp (p4, "bclk") ) {
strcpy ( sa_de_bclk, p5 );
printf ( "[vTPSim] ---> bclk = %s.\n", p5);
}
From the above example text there is no ":"(colon) anywhere.For my understanding of strtok() when there is no the defined symbol is found NULL will be assigned as result.
However, why in this case even if there is no ":", p5 still have the assignment of "/gsrpkg_te/gsrpkg/gsrdie/xxBCLK".
Thanks for your helps.
For my understanding of strtok when there is no the defined symbol is found NULL will be assigned as result
Perhaps you are confusing strtok() with strchr() or strstr(). If none of the separator symbols is found in the remaining part of the string, then strtok() returns that remaining part (more precisely, a pointer to its first character). It may be the entire string if no delimiters could be found in it at all. Docs.
Quote from the docs for haters and deniers:
If no such byte is found, the current token extends to the end of the string pointed to by s1, and subsequent searches for a token shall return a null pointer.
Subsequent. Not immediately the call which couldn't find more delimiters, but the ones following it.
If first parameter is NULL, then strtok tries to get next token. Since first strtok call was strtok (eqvline, "=") with eqvline != NULL and token = the second call will find the next part which is /gsrpkg_te/gsrpkg/gsrdie/xxBCLK in your exmaple.
Maybe you should read this documentation of strtok http://www.cplusplus.com/reference/cstring/strtok/.
char * strtok ( char * str, const char * delimiters );
A sequence of calls to this function split str into tokens, which are sequences of contiguous characters separated by any of the characters that are part of delimiters.
On a first call, the function expects a C string as argument for str, whose first character is used as the starting location to scan for tokens. In subsequent calls, the function expects a null pointer and uses the position right after the end of last token as the new starting location for scanning.
When I look at you example, I would expect that p4 contains "bclk " not "bclk" as you didn't define the whitespace as delimiter. According to the doc, what you did on line 4 (p5 = ...) doesn't have a defined result as you changed delimiters between calls to strtok. I would expect it to use the first delimiters and therefore returning the rest of the string after "=". What it does...
Hope this helps

strtok() issue: If tokens are delimited by delimiters,why is last token between a delimiter and the null '\0'?

In the following program, strtok() works as expected in the major part but I just can't comprehend the reason behind one finding. I have read about strtok() that:
To determine the beginning and the end of a token, the function first scans from the starting location for the first character not contained in delimiters (which becomes the beginning of the token). And then scans starting from this beginning of the token for the first character contained in delimiters, which becomes the end of the token.
Source: http://www.cplusplus.com/reference/cstring/strtok/
And as we know, strtok() places a \0 at the end of each token. But in the following program, the last delimiter is a dot(.), after which there is Toad between that dot and the quotation mark ("). Now the dot is a delimiter in my program, but there is no delimiter after Toad, not even a white space (which is a delimiter in my program). Please clear the following confusion arising from this premise:
Why is strtok() considering Toad as a token even though it is not between 2 delimiters? This is what I read about strtok() when it encounters a NULL character (\0):
Once the terminating null character of str has been found in a call to strtok, all subsequent calls to this function with a null pointer as the first argument return a null pointer.
Source: http://www.cplusplus.com/reference/cstring/strtok/
Nowhere does it say that once a null character is encountered,a pointer to the beginning of the token is returned (we don't even have a token here as we didn't get an end of the token as there was no delimiter character found after the scan begun from the beginning of the token (i.e. from 'T' of Toad), we only found a null character, not a delimiter). So why is the part between last delimiter and quotation mark of argument string considered a token by strtok()? Please explain this.
Code:
#include <stdio.h>
#include <string.h>
int main ()
{
char str[] =" Falcon,eagle-hawk..;buzzard,gull..pigeon sparrow,hen;owl.Toad";
char * pch=strtok(str," ;,.-");
while (pch != NULL)
{
printf ("%s\n",pch);
pch = strtok (NULL, " ;,.-");
}
return 0;
}
Output:
Falcon
eagle
hawk
buzzard
gull
pigeon
sparrow
hen
owl
Toad
The standard's specification of strtok (7.24.5.8) is pretty clear. In particular paragraph 4 (emphasis added by me) is directly relevant to the question, if I understand that correctly:
3 The first call in the sequence searches the string pointed to by s1 for the first character that is not contained in the current separator string pointed to by s2. If no such character is found, then there are no tokens in the string pointed to by s1 and the strtok function returns a null pointer. If such a character is found, it is the start of the first token.
4 The strtok function then searches from there for a character that is contained in the current separator string. If no such character is found, the current token extends to the end of the string pointed to by s1, and subsequent searches for a token will return a null pointer. If such a character is found, it is overwritten by a null character, which terminates the current token. The strtok function saves a pointer to the following character, from which the next search for a token will start.
In a call
char *where = strtok(string_or_NULL, delimiters);
the token (a pointer to which is) returned - if any - extends from the first non-delimiter character found from the starting position (inclusive) until the next delimiter character (exclusive), if one exists, or the end of the string, if no later delimiter character exists.
The linked description doesn't explicitly mention the case of a token extending until the end of the string, as opposed to the standard, so it is incomplete in that respect.
Going to the description in POSIX for strtok(), the description says:
char *strtok(char *restrict s1, const char *restrict s2);
A sequence of calls to strtok() breaks the string pointed to by s1 into a sequence of tokens, each of which is delimited by a byte from the string pointed to by s2. The first call in the sequence has s1 as its first argument, and is followed by calls with a null pointer as their first argument. The separator string pointed to by s2 may be different from call to call.
The first call in the sequence searches the string pointed to by s1 for the first byte that is not contained in the current separator string pointed to by s2. If no such byte is found, then there are no tokens in the string pointed to by s1 and strtok() shall return a null pointer. If such a byte is found, it is the start of the first token.
The strtok() function then searches from there for a byte that is contained in the current separator string. If no such byte is found, the current token extends to the end of the string pointed to by s1, and subsequent searches for a token shall return a null pointer. If such a byte is found, it is overwritten by a NUL character, which terminates the current token. The strtok() function saves a pointer to the following byte, from which the next search for a token shall start.
Note the second sentence of the third paragraph:
If no such byte is found, the current token extends to the end of the string pointed to by s1, and subsequent searches for a token shall return a null pointer.
This clearly states that in the example in the question, Toad is indeed a token. One way to think of it is that the list of delimiters always includes the NUL '\0' at the end of the delimiter string.
Having diagnosed that, note that strtok() is not a good function to use — it is not thread safe or reentrant. On Windows, you can use strtok_s() instead; on Unix, you can usually use strtok_r(). These are better functions because they don't store internally the pointer at which the search is to resume.
Because strtok() is not reentrant, you cannot call a function that uses strtok() from inside a function that itself uses strtok() while it is using strtok(). Also, any library function that uses strtok() must be clearly identified as doing so because it cannot be called from a function that is using strtok(). So, using strtok() makes life hard.
The other problem with the strtok() family of functions (and with strsep(), which is related) is that they overwrite the delimiter; you can't find out what the delimiter was after the tokenizer has tokenized the string. This can matter in some applications (such as parsing shell command lines; it matters whether the delimiter is a pipe or a semicolon or an ampersand (or ...). So shell parsers usually don't use strtok(), despite the number of questions on SO about shells where the parser does use strtok().
Generally, you should steer clear of plain strtok(), and it is up to you to decide whether strtok_r() or strtok_s() is appropriate for your purposes.
Because cplusplus.com isn't telling you the whole story. Cppreference.com has a better description.
Cplusplus.com also fails to mention that strtok is not thread-safe, and only documents the strtok function of the C++ programming language, whereas cppreference.com does mention the thread safety issue and documents the strtok functions of both the C and the C++ programming languages.
Are you perhaps just mis-reading the description?
Once the terminating null character of str has been found in a call to
strtok, all subsequent calls to this function with a null pointer
as the first argument return a null pointer.
Given 'subsequent', I'm reading this as every call to strtok after the one that discovered \0, not necessarily the current one itself. So, the definition is consistent with behavior (and with what you would expect from strtok).
strtok breaks a string to a sequence of tokens, separated by the given delimeters.
Delimeters only separate tokens, not necesarily terminate them on both side.

Resources