Tokenizing a String - C - c

I'm trying to tokenize a string in C based upon \r\n delimiters, and want to print out each string after subsequent calls to strtok(). In a while loop I have, there is processing done to each token.
When I include the processing code, the only output I receive is the first token, however when I take the processing code out, I receive every token. This doesn't make sense to me, and am wondering what I could be doing wrong.
Here's the code:
#include <stdio.h>
#include <time.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <stdlib.h>
int main()
{
int c = 0, c2 = 0;
char *tk, *tk2, *tk3, *tk4;
char buf[1024], buf2[1024], buf3[1024];
char host[1024], path[1024], file[1024];
strcpy(buf, "GET /~yourloginid/index.htm HTTP/1.1\r\nHost: remote.cba.csuohio.edu\r\n\r\n");
tk = strtok(buf, "\r\n");
while(tk != NULL)
{
printf("%s\n", tk);
/*
if(c == 0)
{
strcpy(buf2, tk);
tk2 = strtok(buf2, "/");
while(tk2 != NULL)
{
if(c2 == 1)
strcpy(path, tk2);
else if(c2 == 2)
{
tk3 = strtok(tk2, " ");
strcpy(file, tk3);
}
++c2;
tk2 = strtok(NULL, "/");
}
}
else if(c == 1)
{
tk3 = strtok(tk, " ");
while(tk3 != NULL)
{
if(c2 == 1)
{
printf("%s\n", tk3);
// strcpy(host, tk2);
// printf("%s\n", host);
}
++c2;
tk3 = strtok(NULL, " ");
}
}
*/
++c;
tk = strtok(NULL, "\r\n");
}
return 0;
}
Without those if else statements, I receive the following output...
GET /~yourloginid/index.htm HTTP/1.1
Host: remote.cba.csuohio.edu
...however, with those if else statements, I receive this...
GET /~yourloginid/index.htm HTTP/1.1
I'm not sure why I can't see the other token, because the program ends, which means that the loop must occur until the end of the entire string, right?

strtok stores "the point where the last token was found" :
"The point where the last token was found is kept internally by the function to be used on the next call (particular library implementations are not required to avoid data races)."
-- reference
That's why you can call it with NULL the second time.
So your calling it again with a different pointer inside your loop makes you loose the state of the initial call (meaning tk = strtok(NULL, "\r\n") will be NULL by the end of the while, because it will be using the state of the inner loops).
So the solution is probably to change the last line of the while from:
tk = strtok(NULL, "\r\n");
to something like (please check the bounds first, it should not go after buf + strlen(buf)):
tk = strtok(tk + strlen(tk) + 1, "\r\n");
Or use strtok_r, which stores the state externally (like in this answer).
// first call
char *saveptr1;
tk = strtok_r(buf, "\r\n", &saveptr1);
while(tk != NULL) {
//...
tk = strtok_r(NULL, "\r\n", &saveptr1);
}

strtok stores the state of the last token in a global variable, so that the next call to strtok knows where to continue. So when you call strtok(buf2, "/"); in the if, it clobbers the saved state about the outser tokenization.
The fix is to use strtok_r instead of strtok. This function takes an extra argument that is used to store the state:
char *save1, *save2, *save3;
tk = strtok_r(buf, "\r\n", &save1);
while(tk != NULL) {
printf("%s\n", tk);
if(c == 0) {
strcpy(buf2, tk);
tk2 = strtok_r(buf2, "/", &save2);
while(tk2 != NULL) {
if(c2 == 1)
strcpy(path, tk2);
else if(c2 == 2) {
tk3 = strtok_r(tk2, " ", &save3);
strcpy(file, tk3); }
++c2;
tk2 = strtok_r(NULL, "/", &save2); }
} else if(c == 1) {
tk3 = strtok_r(tk, " ", &save2);
while(tk3 != NULL) {
if(c2 == 1) {
printf("%s\n", tk3);
// strcpy(host, tk2);
// printf("%s\n", host);
}
++c2;
tk3 = strtok_r(NULL, " ", &save2); } }
++c;
tk = strtok_r(NULL, "\r\n", &save1); }
return 0;
}

One thing that stands out to me is that unless you are doing something else with the string buffer, there is no need to copy each token to its own buffer. The strtok function returns a pointer to the beginning of the token, so you can use the token in place. The following code may work better and be easier to understand:
#define MAX_PTR = 4
char buff[] = "GET /~yourloginid/index.htm HTTP/1.1\r\nHost: remote.cba.csuohio.edu\r\n\r\n";
char *ptr[MAX_PTR];
int i;
for (i = 0; i < MAX_PTR; i++)
{
if (i == 0) ptr[i] = strtok(buff, "\r\n");
else ptr[i] = strtok(NULL, "\r\n");
if (ptr[i] != NULL) printf("%s\n", ptr[i]);
}
The way that I defined the buffer is something that I call a pre-loaded buffer. You can use an array that is set equal to a string to initialize the array. The compiler will size it for you without you needing to do anything else. Now inside the for loop, the if statement determines which form of strtok is used. So if i == 0, then we need to initialize strtok. Otherwise, we use the second form for all subsequent tokens. Then the printf just prints the different tokens. Remember, strtok returns a pointer to a spot inside the buffer.
If you really are doing something else with the data and you really do need the buffer for other things, then the following code will work as well. This uses malloc to allocate blocks of memory from the heap.
#define MAX_PTR = 4
char buff[] = "GET /~yourloginid/index.htm HTTP/1.1\r\nHost: remote.cba.csuohio.edu\r\n\r\n";
char *ptr[MAX_PTR];
char *bptr; /* buffer pointer */
int i;
for (i = 0; i < MAX_PTR; i++)
{
if (i == 0) bptr = strtok(buff, "\r\n");
else bptr = strtok(NULL, "\r\n");
if (bptr != NULL)
{
ptr[i] = malloc(strlen(bptr + 2));
if (ptr[i] == NULL)
{
/* Malloc error check failed, exit program */
printf("Error: Memory Allocation Failed. i=%d\n", i);
exit(1);
}
strncpy(ptr[i], bptr, strlen(bptr) + 1);
ptr[i][strlen(bptr) + 1] = '\0';
printf("%s\n", ptr[i]);
}
else ptr[i] = NULL;
}
This code does pretty much the same thing, except that we are copying the token strings into buffers. Note that we use an array of char pointers to do this. THe malloc call allocates memory. Then we check if it fails. If malloc returns a NULL, then it failed and we exit program. The strncpy function should be used instead of strcpy. Strcpy does not allow for checking the size of the target buffer, so a malicious user can execute a buffer overflow attack on your code. The malloc was given strlen(bptr) + 2. This is to guarantee that the size of the buffer is big enough to handle the size of the token. The strlen(bptr) + 1 expressions are to make sure that the copied data doesn't overrun the buffer. As an added precaution, the last byte in the buffer is set to 0x00. Then we print the string. Note that I have the if (bptr != NULL). So the main block of code will be executed only if strtok returns a pointer to a valid string, otherwise we set the corresponding pointer entry in the array to NULL.
Don't forget to free() the pointers in the array when you are done with them.
In your code, you are placing things in named buffers, which can be done, but it's not really good practice because then if you try to use the code somewhere else, you have to make extensive modifications to it.

Related

What is the proper way to use strtok()?

Just to clarify, I'm a complete novice in C programming.
I have a tokenize function and it isn't behaving like what I expect it to. I'm trying to read from the FIFO or named pipe that is passed by the client side, and this is the server side. The client side reads a file and pass it to the FIFO. The problem is that tokenize doesn't return a format where execvp can process it, as running gdb tells me that it failed at calling the execute function in main(). (append function appends a char into the string)
One bug is that tokens is neither initialized nor allocated any memory.
Here is an example on how to initialize and allocate memory for tokens:
char **tokenize(char *line){
line = append(line,'\0');
int i = 0, tlen = 0;
char **tokens = NULL, *line2, *token, *delimiter;
delimiter = " \t";
token = strtok(line,delimiter);
while (token != NULL) {
if (i == tlen) {
// Allocate more space
tlen += 10;
tokens = realloc(tokens, tlen * sizeof *tokens);
if (tokens == NULL) {
exit(1);
}
}
tokens[i] = token;
token = strtok(NULL, delimiter);
i += 1;
}
tokens[i] = NULL;
return tokens;
}
This code will allocate memory for 10 tokens at a time. If the memory allocation fails, it will end the program with a non-zero return value to indicate failure.

_platform_memmove$VARIANT$Unknown () from /usr/lib/system/libsystem_platform.dylib changing content of character pointer

I am trying to write a program that accepts a user string and then reverses the order of the words in the string and prints it. My code works for most tries, however, it seg faults on certain occasions, for the same input.
On stepping through I found that the content of character pointers words[0] and words[1] are getting changed to garbage values/Null.
I set a watch point on one of the word[1] and wprd[0] character pointers that are getting corrupted (incorrect address), and can see that the content of these pointers changes at '_platform_memmove$VARIANT$Unknown () from /usr/lib/system/libsystem_platform.dylib'. I cant figure out how this gets invoked and what's causing the content of the pointers to be overwritten.
I have posted my code below and would like any assistance in figuring out where I am going wrong. I am sorry about the indentation issues.
char* reverseWords(char *s) {
char** words = NULL;
int word_count = 0;
/*Create an array of all the words that appear in the string*/
const char *delim = " ";
char *token;
token = strtok(s, delim);
while(token != NULL){
word_count++;
words = realloc(words, word_count * sizeof(char*));
if(words == NULL){
printf("malloc failed\n");
exit(0);
}
words[word_count - 1] = strdup(token);
token = strtok(NULL, delim);
}
/*Traverse the list backwards and check the words*/
int count = word_count;
char *return_string = malloc(strlen(s) + 1);
if(return_string == NULL){
printf("malloc failed\n");
exit(0);
}
int offset = 0;
while(count > 0){
memcpy((char*)return_string + offset, words[count - 1], strlen(words[count - 1]));
free(words[count - 1]);
offset += strlen(words[count - 1]);
if(count != 1){
return_string[offset] = ' ';
offset++;
}
else {
return_string[offset] = '\0';
}
count--;
}
printf("%s\n",return_string);
free(words);
return return_string;
}
int main(){
char *string = malloc(1000);
if(string == NULL){
printf("malloc failed\n");
exit(0);
}
fgets(string, 1000, stdin);
string[strlen(string)] = '\0';
reverseWords(string);
return 0;
}
The problem is that the line
char *return_string = malloc(strlen(s) + 1);
doesn't allocate nearly enough memory to hold the output. For example, if the input string is "Hello world", you would expect strlen(s) to be 11. However, strlen(s) will actually return 5.
Why? Because strtok modifies the input line. Every time you call strtok, it finds the first delimiter and replaces it with a NUL character. So after the first while loop, the input string looks like this
Hello\0world\0
and calling strlen on that string will return 5.
So, the result_string is too small, and one or more memcpy will write past the end of the string, resulting in undefined behavior, e.g. a segmentation fault. The reason for the error message about memmove: the memcpy function internally invokes memmove as needed.
As #WhozCraig pointed out in the comments, you also need to make sure that you don't access memory after a call to free, so you need to swap these two lines
free(words[count - 1]);
offset += strlen(words[count - 1]);

strtok and free

What's the problem of doing this:
void *educator_func(void *param) {
char *lineE = (char *) malloc (1024);
size_t lenE = 1024;
ssize_t readE;
FILE * fpE;
fpE = fopen(file, "r");
if (fpE == NULL) {
printf("ERROR: couldnt open file\n");
exit(0);
}
while ((readE = getline(&lineE, &lenE, fpE)) != -1) {
char *pch2E = (char *) malloc (50);
pch2E = strtok(lineE, " ");
free(pch2E);
}
free(lineE);
fclose(fpE);
return NULL;
}
If i remove the line 'pch2E = strtok(lineE, " ");' it works fine...
why cant i do a strtok() there ? I tried with strtok_r() also but no luck, it gives me invalid free (Address 0x422af10 is 0 bytes inside a block of size 1,024 free'd)
Your code is not doing what you think it is doing... the call to pch2E = strtok(lineE, " "); is replacing the value of pch2E with the return value of strtok which is either lineE or a newly allocated replacement for lineE
You can fix it as follows...
int firstPass = 1;
while ((readE = getline(&lineE, &lenE, fpE)) != -1)
{
char* pch2E = strtok( firstPass ? lineE : NULL, " ");
firstPass = 0;
}
free(lineE);
I should add, the more I look at your code, the more fundamentally flawed it looks to me. You need an inner loop in your code that deals with tokens while the outer loop is loading lines...
while ((readE = getline(&lineE, &lenE, fpE)) != -1)
{
char* pch2E;
int firstPass = 1;
while( (pch2E = strtok( firstPass ? lineE : NULL, " ")) != NULL )
{
firstPass = 0;
// do something with the pch2E return value
}
}
free(lineE);
strtok returns a pointer to the token, that is included in the string you have passed, so you can't free it, because it doesn't (always) point to something you've allocated with malloc.
That kind of assignment can't even work in C, if you wanted a function that would copy the token into a buffer, it would be something like this:
tokenize(char* string, char* delimiter, char* token);
And you would need to pass a valid pointer to token, for the function to copy the data in. In C to copy the data in the pointer, the function needs access to that pointer so it would be impossible for a function to do it on a return value.
An alternative strategy for that (but worst) would be a function that allocates memory internally and returns a pointer to a memory area that needs to be freed by the caller.
For your problem, strtok needs to be called several times to return all the tokens, until it returns null, so it should be:
while ((readE = getline(&lineE, &lenE, fpE)) != -1) {
char *pch2E;
pch2E = strtok(lineE, " "); //1st token
while ((pch2E = strtok(NULL, " ")) != NULL) {
//Do something with the token
}
}

Parsing a file with strtok in C

I'm writing a short function to parse through a file by checking string tokens. It should stop when it hits "visgroups", which is the 9th line of the file I am using to test (which is in the buffer called *source). "versioninfo" is the first line. When I run this code it just repeatedly prints out "versioninfo" until I cancel the program manually. Why isn't the strtok function moving on?
I will be doing some different manipulation of the source when I reach this point, that's why the loop control variable is called "active". Would this have anything to do with the fact that strtok isn't thread-safe? I'm not using source in any other threads.
int countVisgroups(int *visgroups, char *source) {
const char delims[] = {'\t', '\n', ' '};
int active = 0;
char *temp;
while (!active){
temp = strtok(source, delims);
if (temp == NULL) {
printf("%s\n", "Reached end of file while parsing.");
return(0);
}
if (strncmp(temp, "visgroups", 9) == 0) {
active = 1;
return(0);
}
printf("%s\n", temp);
}
return(0);
}
Your delims array needs to be nul terminated. Otherwise how can strtok know how many separators you passed in? Normally you'd just use const char *delims = "\t\n " but you could simply add ..., 0 to your initializer.
After the first call to strtok with the string you want to tokenize, all subsequent calls must be done with the first parameter set to NULL.
temp = strtok(NULL, delims);
And no it probably doesn't have to do anything with thread safety.
Try to rewrite it like this:
int countVisgroups(int *visgroups, char *source) {
const char delims[] = {'\t', '\n', ' ', '\0'};
int active = 0;
char *temp;
temp = strtok(source, delims);
while (!active){
if (temp == NULL) {
printf("%s\n", "Reached end of file while parsing.");
return(0);
}
if (strncmp(temp, "visgroups", 9) == 0) {
active = 1;
return(0);
}
printf("%s\n", temp);
temp = strtok(NULL, delims);
}
return(0);
}

How can I access a global pointer outside of a C function?

I am trying to access the data of*tkn within a different function in my program for example: putchar(*tkn); It is a global variable but its not working correctly. Any ideas?
#define MAX 20
// globals
char *tkn;
char array[MAX];
...
void tokenize()
{
int i = 0, j = 0;
char *delim = " ";
tkn = strtok (str," "); // get token 1
if (tkn != NULL) {
printf("token1: ");
while ((*tkn != 0) && (tkn != NULL))
{
putchar(*tkn);
array[i] = *tkn;
*tkn++;
i++;
}
}
}
In this line:
while ((*tkn != 0) && (tkn != NULL))
you need to reverse the conditions. If tkn is a null pointer, you will crash when the first term is evaluated. If you need to check a pointer for validity, do so before dereferencing it.
while (tkn != NULL && *tkn != '\0')
The extra parentheses you added do no harm but are not necessary. And although 0 is a perfectly good zero, the '\0' emphasizes that *tkn is a character. Of course, given the prior tkn != NULL condition in the if statement, there is no real need to repeat the check in the while loop.
Working code based on yours - some work left to do for subsequent tokens in the string, for example...
#include <stdlib.h>
#include <string.h>
enum { MAX = 20 };
char *tkn;
char array[MAX];
char str[2*MAX];
void tokenize(void)
{
int i = 0;
array[0] = '\0';
tkn = strtok(str, " "); // get token 1
if (tkn != NULL)
{
printf("token1: ");
while (tkn != NULL && *tkn != '\0' && i < MAX - 1)
{
putchar(*tkn);
array[i++] = *tkn++;
}
*tkn = '\0';
putchar('\n');
}
}
int main(void)
{
strcpy(str, "abc def");
tokenize();
printf("token = <<%s>>\n", array);
strcpy(str, "abcdefghijklmnopqrstuvwxyz");
tokenize();
printf("token = <<%s>>\n", array);
return(0);
}
Sample output:
token1: abc
token = <<abc>>
token1: abcdefghijklmnopqrs
token = <<abcdefghijklmnopqrs>>
Asked:
But what if I am taking in a string 'abc 3fc ghi' and I want to use just '3fc' in another function that say converts it from ascii to hex? How do I just use say tkn2 for 3fc and get that only using a pointer? – patrick 9 mins ago
That's where it gets trickier because strtok() has a moderately fiendish interface.
Leaving tokenize() unchanged, let's redefine str:
char *str;
Then, we can use (untested):
int main(void)
{
char buffer[2*MAX];
strcpy(buffer, "abc 3fc ghi");
str = buffer;
tokenize();
printf("token = <<%s>>\n", array); // "abc"
str = NULL;
tokenize();
printf("token = <<%s>>\n", array); // "3fc"
str = NULL;
tokenize();
printf("token = <<%s>>\n", array); // "ghi"
return(0);
}
Clearly, this relies on knowing that there are three tokens. To generalize, you'd need the tokenizer to tell you when there's nothing left to tokenize.
Note that having tkn as a global is really unnecessary - indeed, you should aim to avoid globals as much as possible.
You should use
tkn++
rather than
*tkn++
Just use strlcpy(3) instead of hand-coding the copy (hint - you are forgetting string zero terminator):
strlcpy( array, tkn, MAX );
Although tkn itself is a global variable, you also have to make sure that what it points to (ie. *tkn) is still around when you try to use it.
When you set tkn with a line like:
tkn = strtok (str," ");
Then tkn is pointing to part of the string that str points to. So if str was pointing to a non-static array declared in a function, for example, and that function has exited - then *tkn isn't allowed any more. If str was pointing to a block of memory allocated by malloc(), and you've called free() on that memory - then accessing *tkn isn't allowed after that point.

Resources