I have seen the following piece of code in one of the library. What is the behavior of strtok when empty string is passed as a delimiter? I can see whatever buf contains, stored into token variable after strtok call.
char buf[256] = {0};
char token = NULL;
...
...
while (!feof(filePtr))
{
os_memset(buf, 0, sizeof(buf));
if (!fgets(buf, 256, filePtr))
{
token = strtok(buf, "");
...
...
}
}
strtok() starts by looking for the first character not in the delimiter list, to find the beginning of a token. Since all characters are not in the delimiter list, the first character of the string will be the beginning of the token.
Then it looks for the next character in the delimiter list, to find the end of the token. Since there are no delimiters, it will never find any of them, so it stops at the end of the string.
As a result, an empty delimiter list means the entire string will be parsed as a single token.
Why he wrote it like this is anyone's guess.
Related
I want to use strtok and then return the string after the null terminator that strtok has placed.
char *foo(char *bar)
{
strtok(bar, " ");
return after_strtok_null(bar);
}
/*
examples:
foo("hello world") = "world"
foo("remove only the first") = "only the first"
*/
my code is not for skipping the first word (as I know a simple while loop will do) but I do want to use strtok once and then return the part that was not tokenized.
I will provide details of what I am trying to do at the end of the question, although I don't think it's really necessary
one solution that came into my mind was to simply skip all the null terminators until I reach a non - null:
char *foo(char *bar)
{
bar = strtok(bar, " ");
while(!(*(bar++)));
return bar;
}
This works fine for the examples shown above, but when it comes to using it on single words - I may misidentify the string's null terminator to be strtok's null terminator, and then I may access non - allocated memory.
For example, if I will try foo("demo"\* '\0' *\) the of strtok will be "demo"\* '\0' *\
and then, if I would run the while loop I will accuse the part after the string demo. another solution I have tried is to use strlen, but this one have the exact same problem.
I am trying to create a function that gets a sentence. some of the sentences have have their first word terminated with colons, although not necessarily. The function need to take the first word if it is terminated with colons and insert it (without the colons) into some global table. Then return the sentence without the first colons - terminated word and without the spaces that follow the word if the word has colons - terminated word at the start and otherwise, just return the sentence without the spaces in the start of the sentence.
You could use str[c]spn instead:
char *foo(char *bar) {
size_t pos = strcspn(bar, " ");
pos = strspn((bar += pos), "");
// *bar = '\0'; // uncomment to mimic strtok
return bar + pos;
}
You will get the expected substring of an empty string.
A good point is that you can avoid changing the original string - even if mimicing strtok is trivial...
I am trying to tokenize a string when encountered a newline.
rest = strdup(value);
while ((token = strtok_r(rest,"\n", &rest))) {
snprintf(new_value, MAX_BANNER_LEN + 1, "%s\n", token);
}
where 'value' is a string say, "This is an example\nHere is a newline"
But the above function is not tokenizing the 'value' and the 'new_value' variable comes as it is i.e. "This is an example\nHere is a newline".
Any suggestions to overcome this?
Thanks,
Poornima
Several things going on with your code:
strtok and strtok_r take the string to tokenize as first parameter. Subsequent tokenizations of the same string should pass NULL. (It is okay to tokenize the same string with different delimiters.)
The second parameter is a string of possible separators. In your case you should pass "\n". (strtok_r will treat stretches of the characters as single break. That means that tokenizing "a\n\n\nb" will produce two tokens.)
The third parameter to strtok_r is an internal parameter to the function. It will mark where the next tokenization should start, but you need not use it. Just define a char * and pass its address.
Especially, don't repurpose the source string variable as state. In your example, you will lose the handle to the strduped string, so that you cannot free it later, as you should.
It is not clear how you determine that your tokenization "doesn't work". You print the token to the same char buffer repeatedly. Do you want to keep only the part after the last newline? In that case, use strchrr(str, '\n'). If the result isn't NULL it is your "tail". If it is NULL the whole string is your tail.
Here's how tokenizing a string could work:
char *rest = strdup(str);
char *state;
char *token = strtok_r(rest, "\n", &state);
while (token) {
printf("'%s'\n", token);
token = strtok_r(NULL, "\n", &state);
}
free(rest);
This question already has answers here:
Need to know when no data appears between two token separators using strtok()
(6 answers)
Closed 8 years ago.
I'm reading in a .csv file (delimited by commas) so I can analyze the data. Many of the fields are null, meaning a line might look like:
456,Delaware,14450,,,John,Smith
(where we don't have a phone number or email address for John Smith so these fields are null).
But when I try to separate these lines into tokens (so I can put them in a matrix to analyze the data), strtok doesn't return NULL or an empty string, instead it skips these fields and I wind up with mismatched columns.
In other words, where my desired result is:
a[0]=456
a[1]=Delaware
a[2]=14450
a[3]=NULL (or "", either is fine with me)
a[4]=NULL (or "")
a[5]=John
a[6]=Smith
Instead, the result I get is:
a[0]=456
a[1]=Delaware
a[2]=14450
a[3]=John
a[4]=Smith
Which is wrong. Any suggestions about how I can get the results I need will be greatly welcomed. Here is my code:
FILE* stream = fopen("filename.csv", "r");
i=0;
char* tmp;
char* field;
char line[1024];
while (fgets(line, 1024, stream))
{
j=0;
tmp = strdup(line);
field= strtok(tmp, ",");
while(field != NULL)
{
a[i][j] =field;
field = strtok(NULL, ",");
j++;
}
i++;
}
fclose(stream);
Quote from ISO/IEC 9899:TC3 7.21.5.8 The strtok function
3 The first call in the sequence searches the string pointed to by s1 for the first character
that is not contained in the current separator string pointed to by s2. If no such character
is found, then there are no tokens in the string pointed to by s1 and the strtok function
returns a null pointer. If such a character is found, it is the start of the first token.
And the relevant quote for you:
4 The strtok function then searches from there for a character that is contained in the current separator string. If no such character is found, the current token extends to the
end of the string pointed to by s1, and subsequent searches for a token will return a null
pointer. If such a character is found, it is overwritten by a null character, which
terminates the current token. The strtok function saves a pointer to the following
character, from which the next search for a token will start.
So you cant catch multiple delimiter with strtok, as it isn't made for this.
It just will skip them.
I am studying the implementation of strtok and have a question. On this line, s [-1] = 0, I don't understand how tok is limited to the first token since we had previously assigned it everything contained in s.
char *strtok(char *s, const char *delim)
{
static char *last;
return strtok_r(s, delim, &last);
}
char *strtok_r(char *s, const char *delim, char **last)
{
char *spanp;
int c, sc;
char *tok;
if (s == NULL && (s = *last) == NULL)
return (NULL);
tok = s;
for (;;) {
c = *s++;
spanp = (char *)delim;
do {
if ((sc = *spanp++) == c) {
if (c == 0)
s = NULL;
else
s[-1] = 0;
*last = s;
return (tok);
}
} while (sc != 0);
}
}
tok was not previously assigned "everything contained in s". It was set to point to the same address as the address in s.
The s[-1] = 0; line is equivalent to *(s - 1) = '\0';, which sets the location just before where s is pointing to zero.
By setting that location to zero, returning the current value of tok will point to a string whose data spans from tok to s - 2 and is properly null-terminated at s - 1.
Also note that before tok is returned, *last is set to the current value of s, which is the starting scan position for the next token. strtok saves this value in a static variable so it can be remembered and automatically used for the next token.
This took much more space than I anticipated when I started, but I think it offers a useful explanation along with the others. (it became more of a mission really)
NOTE: This combination of strtok and strtok_r attempt to provide a reentrant implementation of the usual strtok from string.h by saving the address of the last character as a static variable in strtok. (whether it is reentrant was not tested)
The easiest way to understand this code (at least for me) is to understand what strtok and strtok_r do with the string they are operating on. Here strtok_r is where the work is done. strtok_r basically assigns a pointer to the string provided as an argument and then 'inch-worms' down the string, character-by-character, comparing each character to a delimiter character or null terminating character.
The key is to understand that the job of strtok_r is to chop the string up into separate tokens, which are returned on successive calls to the function. How does it work? The string is broken up into separate tokens by replacing each delimiter character found in the original string with a null-terminating character and returning a pointer to the beginning of the token (which will either be the start of the string on first call, or the next-character after the last delimiter on successive calls)
As with the string.h strtok function, the first call to strtok takes the original string as the first argument. For successive parsing of the same string NULL is used as the first argument. The original string is left littered with null-terminating characters after calls to strtok, so make a copy if you need it further. Below is an explanation of what goes on in strtok_r as you inch-worm down the string.
Consider for example the following string and strtok_r:
'this is a test'
The outer for loop stepping through string s
(ignoring the assignments and the NULL tests, the function assigns tok a pointer to the beginning of the string (tok = s). It then enters the for loop where it will step through string s one character at a time. c is assigned the (int value of) the current character pointed to by 's', and the pointer for s in incremented to the next character (this is the for loop increment of 's'). spanp is assigned the pointer to the delimiter array.
The inner do loop stepping though the delimeters 'delim'
The do loop is entered and then, using the spanp pointer, proceeds to go through the delim array testing if sc (the spanp character) equals the current for loop character c. If and only if our character c matches a delimiter, we then encounter the confusing if (c == 0) if-then-else test.
The if (c == 0) if-then-else test
This test is actually simple to understand when you think about it. As we are crawling down string s checking each character against the delim array. If we match one of the delimiters or hit the end, then what? We are about to return from the function, so what must we do?
Here we ask, did we reach the normal end of the string (c == 0), if so we set s = NULL, otherwise we match a delimiter, but are not at the end of the string.
Here is where the magic happens. We need to replace the delimiter character in the string with a null-terminating character (either 0 or '\0'). Why not set the pointer s = 0 here? Answer: we can't, we incremented it assigning c = *s++; at the beginning of the for loop, so s is now pointing to the next character in the string rather than the delimiter. So in order to replace the delimiter in string s with a null-terminating character, we must do s[-1] = 0; This is where the string s gets chopped into a token. last is assigned the address of the current pointer s and tok (pointing to the original beginning of s) is returned by the function.
So, in the main program, you how have the return of strtok_r which is a pointer pointing to the first character in the string s you passed to strtok_r which is now null-terminated at the first occurrence of the matching character in delim providing you with the token from the original string s you asked for.
There are two ways to reach the statement return(tok);. One way is that at the point where tok = s; occurs, s contains none of the delimiter characters (contents of delim).
That means s is a single token. The for loop ends when c == 0, that is, at the
null byte at the end of s, and strtok_r returns tok (that is,
the entire string that was in s at the time of tok = s;), as it should.
The other way for that return statement to occur is when s contains some character
that is in delim. In that case, at some point *spanp == c will be true where
*spanp is not the terminating null of delim, and therefore c == 0 is false.
At this point, s points to the character after the one from which c was read,
and s - 1 points to the place where the delimiter was found.
The statement s[-1] = 0; overwrites the delimiter with a null character, so now
tok points to a string of characters that starts where tok = s; said to start,
and ends at the first delimiter that was found in that string. In other words,
tok now points to the first token in that string, no more and no less,
and it is correctly returned by the function.
The code is not very well self-documenting in my opinion, so it is understandable
that it is confusing.
In my code below I use strtok to parse a line of code from a file that looks like:
1023.89,863.19 1001.05,861.94 996.44,945.67 1019.28,946.92 1023.89,863.19
As the file can have lines of different lengths I don't use fscanf. The code below works of except for one small glitch. It loops around one time too many and reads in a long empty string " " before looping again recognizing the null token "" and exiting the while loop. I don't know why this could be.
Any help would be greatly appreciated.
fgets(line, sizeof(line), some_file);
while ((line != OPC_NIL) {
token = strtok(line, "\t"); //Pull the string apart into tokens using the commas
input = op_prg_list_create();
while (token != NULL) {
test_token = strdup(token);
if (op_prg_list_size(input) == 0)
op_prg_list_insert(input,test_token,OPC_LISTPOS_HEAD);
else
op_prg_list_insert(input,test_token,OPC_LISTPOS_TAIL);
token = strtok (NULL, "\t");
}
fgets(line, sizeof(line), some_file);
}
You must use the correct list of delimiters. Your code contradicts comments:
token = strtok(line, "\t"); //Pull the string apart into tokens using the commas
If you want to separate tokens by commas, use "," instead of "\t". In addition, you certainly don't want the tokens to contain the newline character \n (which appears at the end of each line read from file by fgets). So add the newline character to the list of delimiters:
token = strtok(line, ",\n"); //Pull the string apart into tokens using the commas
...
token = strtok (NULL, ",\n");
You might want to add the space character to the list of delimiters too (is 863.19 1001.05 a single token or two tokens? Do you want to remove spaces at end of line?).
Your use of sizeof(line) tells me that line is a fixed size array living on the stack. In this case, (line != OPC_NIL) will never be false. However, fgets() will return NULL when the end of file is reached or some other error occurs. Your outer while loop should be rewritten as:
while(fgets(line, sizeof(line), some_file)) {
...
}
Your input file likely also has a newline character at the end of the last input line resulting in a single blank line at the end. This is the difference between this:
1023.89,863.19 1001.05,861.94 996.44,945.67 1019.28,946.92 1023.89,863.19↵
<blank line>
and this:
1023.89,863.19 1001.05,861.94 996.44,945.67 1019.28,946.92 1023.89,863.19
The first thing you should do in the while loop is check that the string is actually in the format you expect. If it's not then break:
while(fgets(line, sizeof(line), some_file)) {
if(strlen(line) == 0) // or other checks such as "contains tab characters"
break;
...
}