I am studying the implementation of strtok and have a question. On this line, s [-1] = 0, I don't understand how tok is limited to the first token since we had previously assigned it everything contained in s.
char *strtok(char *s, const char *delim)
{
static char *last;
return strtok_r(s, delim, &last);
}
char *strtok_r(char *s, const char *delim, char **last)
{
char *spanp;
int c, sc;
char *tok;
if (s == NULL && (s = *last) == NULL)
return (NULL);
tok = s;
for (;;) {
c = *s++;
spanp = (char *)delim;
do {
if ((sc = *spanp++) == c) {
if (c == 0)
s = NULL;
else
s[-1] = 0;
*last = s;
return (tok);
}
} while (sc != 0);
}
}
tok was not previously assigned "everything contained in s". It was set to point to the same address as the address in s.
The s[-1] = 0; line is equivalent to *(s - 1) = '\0';, which sets the location just before where s is pointing to zero.
By setting that location to zero, returning the current value of tok will point to a string whose data spans from tok to s - 2 and is properly null-terminated at s - 1.
Also note that before tok is returned, *last is set to the current value of s, which is the starting scan position for the next token. strtok saves this value in a static variable so it can be remembered and automatically used for the next token.
This took much more space than I anticipated when I started, but I think it offers a useful explanation along with the others. (it became more of a mission really)
NOTE: This combination of strtok and strtok_r attempt to provide a reentrant implementation of the usual strtok from string.h by saving the address of the last character as a static variable in strtok. (whether it is reentrant was not tested)
The easiest way to understand this code (at least for me) is to understand what strtok and strtok_r do with the string they are operating on. Here strtok_r is where the work is done. strtok_r basically assigns a pointer to the string provided as an argument and then 'inch-worms' down the string, character-by-character, comparing each character to a delimiter character or null terminating character.
The key is to understand that the job of strtok_r is to chop the string up into separate tokens, which are returned on successive calls to the function. How does it work? The string is broken up into separate tokens by replacing each delimiter character found in the original string with a null-terminating character and returning a pointer to the beginning of the token (which will either be the start of the string on first call, or the next-character after the last delimiter on successive calls)
As with the string.h strtok function, the first call to strtok takes the original string as the first argument. For successive parsing of the same string NULL is used as the first argument. The original string is left littered with null-terminating characters after calls to strtok, so make a copy if you need it further. Below is an explanation of what goes on in strtok_r as you inch-worm down the string.
Consider for example the following string and strtok_r:
'this is a test'
The outer for loop stepping through string s
(ignoring the assignments and the NULL tests, the function assigns tok a pointer to the beginning of the string (tok = s). It then enters the for loop where it will step through string s one character at a time. c is assigned the (int value of) the current character pointed to by 's', and the pointer for s in incremented to the next character (this is the for loop increment of 's'). spanp is assigned the pointer to the delimiter array.
The inner do loop stepping though the delimeters 'delim'
The do loop is entered and then, using the spanp pointer, proceeds to go through the delim array testing if sc (the spanp character) equals the current for loop character c. If and only if our character c matches a delimiter, we then encounter the confusing if (c == 0) if-then-else test.
The if (c == 0) if-then-else test
This test is actually simple to understand when you think about it. As we are crawling down string s checking each character against the delim array. If we match one of the delimiters or hit the end, then what? We are about to return from the function, so what must we do?
Here we ask, did we reach the normal end of the string (c == 0), if so we set s = NULL, otherwise we match a delimiter, but are not at the end of the string.
Here is where the magic happens. We need to replace the delimiter character in the string with a null-terminating character (either 0 or '\0'). Why not set the pointer s = 0 here? Answer: we can't, we incremented it assigning c = *s++; at the beginning of the for loop, so s is now pointing to the next character in the string rather than the delimiter. So in order to replace the delimiter in string s with a null-terminating character, we must do s[-1] = 0; This is where the string s gets chopped into a token. last is assigned the address of the current pointer s and tok (pointing to the original beginning of s) is returned by the function.
So, in the main program, you how have the return of strtok_r which is a pointer pointing to the first character in the string s you passed to strtok_r which is now null-terminated at the first occurrence of the matching character in delim providing you with the token from the original string s you asked for.
There are two ways to reach the statement return(tok);. One way is that at the point where tok = s; occurs, s contains none of the delimiter characters (contents of delim).
That means s is a single token. The for loop ends when c == 0, that is, at the
null byte at the end of s, and strtok_r returns tok (that is,
the entire string that was in s at the time of tok = s;), as it should.
The other way for that return statement to occur is when s contains some character
that is in delim. In that case, at some point *spanp == c will be true where
*spanp is not the terminating null of delim, and therefore c == 0 is false.
At this point, s points to the character after the one from which c was read,
and s - 1 points to the place where the delimiter was found.
The statement s[-1] = 0; overwrites the delimiter with a null character, so now
tok points to a string of characters that starts where tok = s; said to start,
and ends at the first delimiter that was found in that string. In other words,
tok now points to the first token in that string, no more and no less,
and it is correctly returned by the function.
The code is not very well self-documenting in my opinion, so it is understandable
that it is confusing.
Related
I want to use strtok and then return the string after the null terminator that strtok has placed.
char *foo(char *bar)
{
strtok(bar, " ");
return after_strtok_null(bar);
}
/*
examples:
foo("hello world") = "world"
foo("remove only the first") = "only the first"
*/
my code is not for skipping the first word (as I know a simple while loop will do) but I do want to use strtok once and then return the part that was not tokenized.
I will provide details of what I am trying to do at the end of the question, although I don't think it's really necessary
one solution that came into my mind was to simply skip all the null terminators until I reach a non - null:
char *foo(char *bar)
{
bar = strtok(bar, " ");
while(!(*(bar++)));
return bar;
}
This works fine for the examples shown above, but when it comes to using it on single words - I may misidentify the string's null terminator to be strtok's null terminator, and then I may access non - allocated memory.
For example, if I will try foo("demo"\* '\0' *\) the of strtok will be "demo"\* '\0' *\
and then, if I would run the while loop I will accuse the part after the string demo. another solution I have tried is to use strlen, but this one have the exact same problem.
I am trying to create a function that gets a sentence. some of the sentences have have their first word terminated with colons, although not necessarily. The function need to take the first word if it is terminated with colons and insert it (without the colons) into some global table. Then return the sentence without the first colons - terminated word and without the spaces that follow the word if the word has colons - terminated word at the start and otherwise, just return the sentence without the spaces in the start of the sentence.
You could use str[c]spn instead:
char *foo(char *bar) {
size_t pos = strcspn(bar, " ");
pos = strspn((bar += pos), "");
// *bar = '\0'; // uncomment to mimic strtok
return bar + pos;
}
You will get the expected substring of an empty string.
A good point is that you can avoid changing the original string - even if mimicing strtok is trivial...
I am trying to tokenize a string when encountered a newline.
rest = strdup(value);
while ((token = strtok_r(rest,"\n", &rest))) {
snprintf(new_value, MAX_BANNER_LEN + 1, "%s\n", token);
}
where 'value' is a string say, "This is an example\nHere is a newline"
But the above function is not tokenizing the 'value' and the 'new_value' variable comes as it is i.e. "This is an example\nHere is a newline".
Any suggestions to overcome this?
Thanks,
Poornima
Several things going on with your code:
strtok and strtok_r take the string to tokenize as first parameter. Subsequent tokenizations of the same string should pass NULL. (It is okay to tokenize the same string with different delimiters.)
The second parameter is a string of possible separators. In your case you should pass "\n". (strtok_r will treat stretches of the characters as single break. That means that tokenizing "a\n\n\nb" will produce two tokens.)
The third parameter to strtok_r is an internal parameter to the function. It will mark where the next tokenization should start, but you need not use it. Just define a char * and pass its address.
Especially, don't repurpose the source string variable as state. In your example, you will lose the handle to the strduped string, so that you cannot free it later, as you should.
It is not clear how you determine that your tokenization "doesn't work". You print the token to the same char buffer repeatedly. Do you want to keep only the part after the last newline? In that case, use strchrr(str, '\n'). If the result isn't NULL it is your "tail". If it is NULL the whole string is your tail.
Here's how tokenizing a string could work:
char *rest = strdup(str);
char *state;
char *token = strtok_r(rest, "\n", &state);
while (token) {
printf("'%s'\n", token);
token = strtok_r(NULL, "\n", &state);
}
free(rest);
I am trying to understand what the following code does
void chomp (char* string, char delim) {
size_t len = strlen (string);
if (len == 0) return;
char* nlpos = string + len - 1;
if (*nlpos == delim) *nlpos = '\0';
}
what is a delimiter?. Does the fourth line basically saves the last character in the string?
If the last character of the string matches delim, then that characters position in the string (*nlpos) is assigned a zero byte, which effectively terminates the C string one position closer to the beginning of the string.
I think that the term chomp became popular with Perl that often trimmed off the terminating newline when doing line by line processing.
The delimiter is the newline character.
Then string length is counted and type set to length (size_t formats lenghth into a ISO defined type that represents size).
Length of string is checked for zero (0) length and then returns to calling routine if true
This code will cut away the delimiter in a string (can be a buffer) and put null character(\0) at the end.
The fourth line will store the last char in the string and replaces it with the null character.
Delimiter is sequence of characters used to specify a boundary in plain text or region. Hence it will come at the end.
The NULL character is used in C style character strings to indicate where the end of the string is.
I have an array of charracters where I put in information using a gets().
char inname[30];
gets(inname);
How can I add another character to this array without knowing the length of the string in c? (the part that are actual letters and not like empty memmory spaces of romething)
note: my buffer is long enough for what I want to ask the user (a filename, Probebly not many people have names longer that 29 characters)
Note that gets is prone to buffer overflow and should be avoided.
Reading a line of input:
char inname[30];
sscanf("%.*s", sizeof(inname), inname);
int len = strlen(inname);
// Remove trailing newline
if (len > 0 && inname[len-1] == '\n') {
len--;
inname[len] = '\0'
}
Appending to the string:
char *string_to_append = ".";
if (len + strlen(string_to_append) + 1) <= sizeof(inname)) {
// There is enough room to append the string
strcat(inname, string_to_append);
}
Optional way to append a single character to the string:
if (len < sizeof(inname) - 2) {
// There is room to add another character
inname[len++] = '.'; // Add a '.' character to the string.
inname[len] = '\0'; // Don't forget to nul-terminate
}
As you have asked in comment, to determine the string length you can directly use
strlen(inname);
OR
you can loop through string in a for loop until \0 is found.
Now after getting the length of prvious string you can append new string as
strcat(&inname[prevLength],"NEW STRING");
EDIT:
To find the Null Char you can write a for loop like this
for(int i =0;inname[i] != 0;i++)
{
//do nothing
}
Now you can use i direcly to copy any character at the end of string like:
inname[i] = Youe Char;
After this increment i and again copy Null char to(0) it.
P.S.
Any String in C end with a Null character termination. ASCII null char '\0' is equivalent to 0 in decimal.
You know that the final character of a C string is '\0', e.g. the array:
char foo[10]={"Hello"};
is equivalent to this array:
['H'] ['e'] ['l'] ['l'] ['0'] ['\0']
Thus you can iterate on the array until you find the '\0' character, and then you can substitute it with the character you want.
Alternatively you can use the function strcat of string.h library
Short answer is you can't.
In c you must know the length of the string to append char's to it, in other languages the same applies but it happens magically, and without a doubt, internally the same must be done.
c strings are defined as sequences of bytes terminated by a special byte, the nul character which has ascii code 0 and is represented by the character '\0' in c.
You must find this value to append characters before it, and then move it after the appended character, to illustrate this suppose you have
char hello[10] = "Hello";
then you want to append a '!' after the 'o' so you can just do this
size_t length;
length = strlen(hello);
/* move the '\0' one position after it's current position */
hello[length + 1] = hello[length];
hello[length] = '!';
now the string is "Hello!".
Of course, you should take car of hello being large enough to hold one extra character, that is also not automatic in c, which is one of the things I love about working with it because it gives you maximum flexibility.
You can of course use some available functions to achieve this without worrying about moving the '\0' for example, with
strcat(hello, "!");
you will achieve the same.
Both strlen() and strcat() are defined in string.h header.
This question already has answers here:
Need to know when no data appears between two token separators using strtok()
(6 answers)
Closed 8 years ago.
I'm reading in a .csv file (delimited by commas) so I can analyze the data. Many of the fields are null, meaning a line might look like:
456,Delaware,14450,,,John,Smith
(where we don't have a phone number or email address for John Smith so these fields are null).
But when I try to separate these lines into tokens (so I can put them in a matrix to analyze the data), strtok doesn't return NULL or an empty string, instead it skips these fields and I wind up with mismatched columns.
In other words, where my desired result is:
a[0]=456
a[1]=Delaware
a[2]=14450
a[3]=NULL (or "", either is fine with me)
a[4]=NULL (or "")
a[5]=John
a[6]=Smith
Instead, the result I get is:
a[0]=456
a[1]=Delaware
a[2]=14450
a[3]=John
a[4]=Smith
Which is wrong. Any suggestions about how I can get the results I need will be greatly welcomed. Here is my code:
FILE* stream = fopen("filename.csv", "r");
i=0;
char* tmp;
char* field;
char line[1024];
while (fgets(line, 1024, stream))
{
j=0;
tmp = strdup(line);
field= strtok(tmp, ",");
while(field != NULL)
{
a[i][j] =field;
field = strtok(NULL, ",");
j++;
}
i++;
}
fclose(stream);
Quote from ISO/IEC 9899:TC3 7.21.5.8 The strtok function
3 The first call in the sequence searches the string pointed to by s1 for the first character
that is not contained in the current separator string pointed to by s2. If no such character
is found, then there are no tokens in the string pointed to by s1 and the strtok function
returns a null pointer. If such a character is found, it is the start of the first token.
And the relevant quote for you:
4 The strtok function then searches from there for a character that is contained in the current separator string. If no such character is found, the current token extends to the
end of the string pointed to by s1, and subsequent searches for a token will return a null
pointer. If such a character is found, it is overwritten by a null character, which
terminates the current token. The strtok function saves a pointer to the following
character, from which the next search for a token will start.
So you cant catch multiple delimiter with strtok, as it isn't made for this.
It just will skip them.