I've been assigned a homework from my college professor and I seem to have found some strange behavior of strtok
Basically, we have to parse a CSV file for my class, where the number of tokens in the CSV is known and the last element may have extra "," characters.
An example of a line:
Hello,World,This,Is,A lot, of Text
Where the tokens should be output as
1. Hello
2. World
3. This
4. Is
5. A lot, of Text
For this assignment we MUST use strtok. Because of this I found on some other SOF post that using strtok with an empty string (or passing "\n" as the second argument) results in reading until the end of the line. This is perfect for my application since the extra commas always appear in the last element.
I've created this code which works:
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#define NUM_TOKENS 5
const char *line = "Hello,World,This,Is,Text";
char **split_line(const char *line, int num_tokens)
{
char *copy = strdup(line);
// Make an array the correct size to hold num_tokens
char **tokens = (char**) malloc(sizeof(char*) * num_tokens);
int i = 0;
for (char *token = strtok(copy, ",\n"); i < NUM_TOKENS; token = strtok(NULL, i < NUM_TOKENS - 1 ? ",\n" : "\n"))
{
tokens[i++] = strdup(token);
}
free(copy);
return tokens;
}
int main()
{
char **tokens = split_line(line, NUM_TOKENS);
for (int i = 0; i < NUM_TOKENS; i++)
{
printf("%s\n", tokens[i]);
free(tokens[i]);
}
}
Now this works and should get me full credit but I hate this ternary that shouldn't be needed:
token = strtok(NULL, i < NUM_TOKENS - 1 ? ",\n" : "\n");
I'd like to replace the method with this version:
char **split_line(const char *line, int num_tokens)
{
char *copy = strdup(line);
// Make an array the correct size to hold num_tokens
char **tokens = (char**) malloc(sizeof(char*) * num_tokens);
int i = 0;
for (char *token = strtok(copy, ",\n"); i < NUM_TOKENS - 1; token = strtok(NULL, ",\n"))
{
tokens[i++] = strdup(token);
}
tokens[i] = strdup(strtok(NULL, "\n"));
free(copy);
return tokens;
}
This tickles my fancy much nicer since it is much easier to see that there is a final case. You also get rid of the strange ternary operator.
Sadly though, this segfaults! I can't for the life of me figure out why.
Edit: Add some output examples:
[11:56:06] gravypod:test git:(master*) $ ./test_no_fault
Hello
World
This
Is
Text
[11:56:10] gravypod:test git:(master*) $ ./test_seg_fault
[1] 3718 segmentation fault (core dumped) ./test_seg_fault
[11:56:14] gravypod:test git:(master*) $
Please check the return value from strtok before you risk passing NULL to another function. Your loop is calling strtok one more time than you think.
It is more usual to use this return value to control your loop, then you are not at the mercy of your data. As for the delimitors, best to keep it simple and not try anything fancy.
char **split_line(const char *line, int num_tokens)
{
char *copy = strdup(line);
char **tokens = (char**) malloc(sizeof(char*) * num_tokens);
int i = 0;
char *token;
char delim1[] = ",\r\n";
char delim2[] = "\r\n";
char *delim = delim1; // start with a comma in the delimiter set
token = strtok(copy, delim);
while(token != NULL) { // strtok result comtrols the loop
tokens[i++] = strdup(token);
if(i == NUM_TOKENS) {
delim = delim2; // change the delimiters
}
token = strtok(NULL, delim);
}
free(copy);
return tokens;
}
Note you should also check the return values from malloc and strdup and free your memory properly
When you get to the last loop, you'll get
for (char *token = strtok(copy, ",\n"); i < NUM_TOKENS - 1; token = strtok(NULL, ",\n"))
loop body
loop increment step, i.e. token = strtok(NULL, ",\n") (with the wrong second arg)
loop continuation check i < NUM_TOKENS - 1
i.e. it has still called strtok even though you're now out-of-range. You've also got an off-by-one on your array indices here: you'd want to initialise i=0 not 1.
You could avoid this by e.g.
making the initial strtok a special case outside the loop, e.g.
int i = 0;
tokens[i++] = strdup(strtok(copy, ",\n"));
then moving the strtok(NULL, ",\n") inside the loop
I'm also surprised you want the \n there at all, or even need to call the last strtok (wouldn't that already just point to the rest of the string? If you just trying to chop a trailing newline there are easier ways) but I haven't used strtok in years.
(As an aside you're also not freeing the malloced array you store the string pointers in. That said since it's the end of the program at that point that doesn't matter so much.)
Remember that strtok identifies a token when it finds any of the characters in the delimiter string (the second argument to strtok()) - it doesn't try to match the entire delimiter string itself.
Thus, the ternary operator was never needed in the first place - the string will be tokenized based on the occurrence of , OR \n in the input string, so the following works:
for (token = strtok(copy, ",\n"); i < NUM_TOKENS; token = strtok(NULL, ",\n"))
{
tokens[i++] = strdup(token);
}
The second example segfaults because it's already tokenized the input to the end of the string by the time it exits the for loop. Calling strtok() again sets token to NULL, and the segfault is generated when strdup() is called on the NULL pointer. Removing the extra call to strtok gives the expected results:
for (token = strtok(copy, ",\n"); i < NUM_TOKENS - 1; token = strtok(NULL, ",\n"))
{
tokens[i++] = strdup(token);
}
tokens[i] = strdup(token);
Related
Description of what my function attempts to do
My function gets a string for example "Ab + abc EF++aG hi jkL" and turns it into ["abc", "hi"]
In addition, the function only takes into account letters and the letters all have to be lowercase.
The problem is that
char* str1 = "Ab + abc EF++aG hi jkL";
char* str2 = "This is a very famous quote";
char** tokens1 = get_tokens(str1);
printf("%s", tokens1[0]); <----- prints out "abc" correct output
char** tokens2 = get_tokens(str2);
printf("%s", tokens1[0]); <----- prints out "s" incorrect output
get_tokens function (Returns the 2d array)
char** get_tokens(const char* str) {
// implement me
int num_tokens = count_tokens(str);
char delim[] = " ";
int str_length = strlen(str);
char* new_str = malloc(str_length);
strcpy(new_str, str);
char* ptr = strtok(new_str, delim);
int index = 0;
char** array_2d = malloc(sizeof(char*) *num_tokens);
while (ptr != NULL){
if (check_string(ptr) == 0){
array_2d[index] = ptr;
index++;
}
ptr = strtok(NULL, delim);
}
free(new_str);
new_str = NULL;
free(ptr);
ptr = NULL;
return array_2d;
}
count_tokens function (returns the number of valid strings)
for example count_tokens("AB + abc EF++aG hi jkL") returns 2 because only "abc" and "hi" are valid
int count_tokens(const char* str) {
// implement me
//Seperate string using strtok
char delim[] = " ";
int str_length = strlen(str);
char* new_str = malloc(str_length);
strcpy(new_str, str);
char* ptr = strtok(new_str, delim);
int counter = 0;
while (ptr != NULL){
if (check_string(ptr) == 0){
counter++;
}
ptr = strtok(NULL, delim);
}
free(new_str);
return counter;
}
Lastly check_string() checks if a string is valid
For example check_string("Ab") is invalid because there is a A inside.
using strtok to split "Ab + abc EF++aG hi jkL" into separate parts
int check_string(char* str){
// 0 = false
// 1 = true
int invalid_chars = 0;
for (int i = 0; i<strlen(str); i++){
int char_int_val = (int) str[i];
if (!((char_int_val >= 97 && char_int_val <= 122))){
invalid_chars = 1;
}
}
return invalid_chars;
}
Any help would be much appreciated. Thank you for reading.
If you have any questions about how the code works please ask me. Also I'm new to stackoverflow, please tell me if I have to change something.
You have a few problems in your code. First I'll repeat what I've said in the comments:
Not allocating enough space for the string copies. strlen does not include the NUL terminator in its length, so when you do
char* new_str = malloc(str_length);
strcpy(new_str, str);
new_str overflows by 1 when strcpy adds the '\0', invoking undefined behavior. You need to allocate one extra:
char* new_str = malloc(str_length + 1);
strcpy(new_str, str);
You should not free any pointer returned from strtok. You only free memory that's been dynamically allocated using malloc and friends. strtok does no such thing, so it's incorrect to free the pointer it returns. Doing so also invokes UB.
Your final problem is because of this:
// copy str to new_str, that's correct because strtok
// will manipulate the string you pass into it
strcpy(new_str, str);
// get the first token and allocate size for the number of tokens,
// so far so good (but you should check that malloc succeeded)
char* ptr = strtok(new_str, delim);
char** array_2d = malloc(sizeof(char*) *num_tokens);
while (ptr != NULL){
if (check_string(ptr) == 0){
// whoops, this is where the trouble starts ...
array_2d[index] = ptr;
index++;
}
// get the next token, this is correct
ptr = strtok(NULL, delim);
}
// ... because you free new_str
free(new_str);
ptr is a pointer to some token in new_str. As soon as you free(new_str), Any pointer pointing to that now-deallocated memory is invalid. You've loaded up array_2d with pointers to memory that's no longer allocated. Trying to access those locations again invokes undefined behavior. There's two ways I can think of off the top to solve this:
Instead of saving pointers that are offsets to new_str, find the same tokens in str (the string from main) and point to those instead. Since those are defined in main, they will exist for as long as the program exists.
Allocate some more memory, and strcpy the token into array_2d[index]. I'll demonstrate this below:
while (ptr != NULL){
if (check_string(ptr) == false)
{
// allocate (enough) memory for the pointer at index
array_2d[index] = malloc(strlen(ptr) + 1);
// you should _always_ check that malloc succeeds
if (array_2d[index] != NULL)
{
// _copy_ the string pointed to by ptr into our new space rather
// than simply assigning the pointer
strcpy(array_2d[index], ptr);
}
else { /* handle no mem error how you want */ }
index++;
}
ptr = strtok(NULL, delim);
}
// now we can safely free new_str without invalidating anything in array_2d
free(new_str);
I have a working demonstration here. Note some other changes in the demo:
#include <stdbool.h> and used that instead of 0 and 1 ints.
Changed your get_tokens function a bit to "return" the number of tokens. This is useful in main for printing them out.
Replaced the ASCII magic numbers with their characters.
Removed the useless freedPointer = NULL lines.
Changed your ints to size_t types for everything involving a size.
One final note, while this is a valid implementation, it's probably doing a bit more work than it needs to. Rather than counting the number of tokens in a first pass, then retrieving them in a second pass, you can surely do everything you want in a single pass, but I'll leave that as an exercise to you if you're so inclined.
How to split a string into an tokens and then save them in an array?
Specifically, I have a string "abc/qwe/jkh". I want to separate "/", and then save the tokens into an array.
Output will be such that
array[0] = "abc"
array[1] = "qwe"
array[2] = "jkh"
please help me
#include <stdio.h>
#include <string.h>
int main ()
{
char buf[] ="abc/qwe/ccd";
int i = 0;
char *p = strtok (buf, "/");
char *array[3];
while (p != NULL)
{
array[i++] = p;
p = strtok (NULL, "/");
}
for (i = 0; i < 3; ++i)
printf("%s\n", array[i]);
return 0;
}
You can use strtok()
char string[] = "abc/qwe/jkh";
char *array[10];
int i = 0;
array[i] = strtok(string, "/");
while(array[i] != NULL)
array[++i] = strtok(NULL, "/");
Why strtok() is a bad idea
Do not use strtok() in normal code, strtok() uses static variables which have some problems. There are some use cases on embedded microcontrollers where static variables make sense but avoid them in most other cases. strtok() behaves unexpected when more than 1 thread uses it, when it is used in a interrupt or when there are some other circumstances where more than one input is processed between successive calls to strtok().
Consider this example:
#include <stdio.h>
#include <string.h>
//Splits the input by the / character and prints the content in between
//the / character. The input string will be changed
void printContent(char *input)
{
char *p = strtok(input, "/");
while(p)
{
printf("%s, ",p);
p = strtok(NULL, "/");
}
}
int main(void)
{
char buffer[] = "abc/def/ghi:ABC/DEF/GHI";
char *p = strtok(buffer, ":");
while(p)
{
printContent(p);
puts(""); //print newline
p = strtok(NULL, ":");
}
return 0;
}
You may expect the output:
abc, def, ghi,
ABC, DEF, GHI,
But you will get
abc, def, ghi,
This is because you call strtok() in printContent() resting the internal state of strtok() generated in main(). After returning, the content of strtok() is empty and the next call to strtok() returns NULL.
What you should do instead
You could use strtok_r() when you use a POSIX system, this versions does not need static variables. If your library does not provide strtok_r() you can write your own version of it. This should not be hard and Stackoverflow is not a coding service, you can write it on your own.
I tried using strtok to split a string and store individual words in a 2D array.I have put printf statements and checked whether individual words are getting stored or not and saw that words are getting stored.
But the problem is while loop does not terminate at all, while executing it does not print exit statement "Hi".
I am unable to find the mistake here.
Can someone help me with this?
int wordBreak(char* str, char words[MAX_WORDS][MAX_WORD_LENGTH])
{
int x=0;
const char *delimiters = " :,;\n";
char* token = strtok(str,delimiters);
strcpy(words[x],token);
x++;
while(token!=NULL) {
token = strtok(NULL,delimiters);
strcpy(words[x],token);
x++;
}
printf("Hi\n");
return x;
}
You must always ensure the return of strtok is not NULL before calling strcpy to copy to your array. A simple rearrangement will ensure that is the case:
int wordBreak (char* str, char words[MAX_WORDS][MAX_WORD_LENGTH])
{
int x = 0;
const char *delimiters = " :,;\n";
char *token = strtok (str,delimiters);
while (x < MAX_WORDS && token != NULL) {
if (strlen (token) < MAX_WORD_LENGTH) {
strcpy (words[x++], token);
token = strtok (NULL, delimiters);
}
else
fprintf (stderr, "warning: token '%s' exeeded MAX_WORD_LENGTH - skipped\n",
token);
}
return x;
}
You also must ensure x < MAX_WORDS by adding the additional condition as shown above and a test to ensure strlen(token) < MAX_WORD_LENGTH before copying to your array.
The code prints some lines from a .csv file, but there's a bug in a sequence of a code and I am not sure where.
It prints them the following way:
hello
how
are you
buddy
buddy buddy buddy buddy
const char* getfield(char* line, int num)
{
const char* tok;
for (tok = strtok(line, ",");
tok && *tok;
tok = strtok(NULL, ",\n"))
{
if (!--num)
return tok;
}
return NULL;
}
int lang1()
{
fp = fopen("lang.csv", "r");
int i = 0;
char line[1024];
const char* word[256];
char num[] = { 1 , 2 };
while (fgets(line, 1024, fp))
{
printf("%s\n", getfield(line, num[0]));
word[i] = getfield(line, num[0]);
++i;
}
for (int i = 0; i < 5; i++)
printf("%s ", word[i]);
fclose(fp);
return 0;
}
How it should actually print them:
hello
how
are you
buddy
hello how are you buddy
Thank you for your patience, I'd love to fix this (and have an explanation for the issue, if possible - I am trying to understand the problem)...
The problem is that strtok doesn't return different new strings, but rather pointers to the same string (the "source" argument). That is, you will always get pointers to the single array line. Which will only contain the last line you read.
This would have been very obvious if you stepped through the code in a debugger, as you would then see the pointer returned by strtok to always be the same.
Possible solutions is to use an array of arrays, and copy the "string" returned by getfield.
char word[256][256];
...
strcpy(word[i], getfield(...));
I learnt C in uni but haven't used it for quite a few years. Recently I started working on a tool which uses C as the programming language. Now I'm stuck with some really basic functions. Among them are how to split and join strings using a delimiter? (I miss Python so much, even Java or C#!)
Below is the function I created to split a string, but it does not seem to work properly. Also, even this function works, the delimiter can only be a single character. How can I use a string as a delimiter?
Can someone please provide some help?
Ideally, I would like to have 2 functions:
// Split a string into a string array
char** fSplitStr(char *str, const char *delimiter);
// Join the elements of a string array to a single string
char* fJoinStr(char **str, const char *delimiter);
Thank you,
Allen
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
char** fSplitStr(char *str, const char *delimiters)
{
char * token;
char **tokenArray;
int count=0;
token = (char *)strtok(str, delimiters); // Get the first token
tokenArray = (char**)malloc(1 * sizeof(char*));
if (!token) {
return tokenArray;
}
while (token != NULL ) { // While valid tokens are returned
tokenArray[count] = (char*)malloc(sizeof(token));
tokenArray[count] = token;
printf ("%s", tokenArray[count]);
count++;
tokenArray = (char **)realloc(tokenArray, sizeof(char *) * count);
token = (char *)strtok(NULL, delimiters); // Get the next token
}
return tokenArray;
}
int main (void)
{
char str[] = "Split_The_String";
char ** splitArray = fSplitStr(str,"_");
printf ("%s", splitArray[0]);
printf ("%s", splitArray[1]);
printf ("%s", splitArray[2]);
return 0;
}
Answers: (Thanks to Moshbear, Joachim and sarnold):
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
char** fStrSplit(char *str, const char *delimiters)
{
char * token;
char **tokenArray;
int count=0;
token = (char *)strtok(str, delimiters); // Get the first token
tokenArray = (char**)malloc(1 * sizeof(char*));
tokenArray[0] = NULL;
if (!token) {
return tokenArray;
}
while (token != NULL) { // While valid tokens are returned
tokenArray[count] = (char*)strdup(token);
//printf ("%s", tokenArray[count]);
count++;
tokenArray = (char **)realloc(tokenArray, sizeof(char *) * (count + 1));
token = (char *)strtok(NULL, delimiters); // Get the next token
}
tokenArray[count] = NULL; /* Terminate the array */
return tokenArray;
}
char* fStrJoin(char **str, const char *delimiters)
{
char *joinedStr;
int i = 1;
joinedStr = realloc(NULL, strlen(str[0])+1);
strcpy(joinedStr, str[0]);
if (str[0] == NULL){
return joinedStr;
}
while (str[i] !=NULL){
joinedStr = (char*)realloc(joinedStr, strlen(joinedStr) + strlen(str[i]) + strlen(delimiters) + 1);
strcat(joinedStr, delimiters);
strcat(joinedStr, str[i]);
i++;
}
return joinedStr;
}
int main (void)
{
char str[] = "Split_The_String";
char ** splitArray = (char **)fStrSplit(str,"_");
char * joinedStr;
int i=0;
while (splitArray[i]!=NULL) {
printf ("%s", splitArray[i]);
i++;
}
joinedStr = fStrJoin(splitArray, "-");
printf ("%s", joinedStr);
return 0;
}
Use strpbrk instead of strtok, because strtok suffers from two weaknesses:
it's not re-entrant (i.e. thread-safe)
it modifies the string
For joining, use strncat for joining, and realloc for resizing.
The order of operations is very important.
Before doing the realloc;strncat loop, set the 0th element of the target string to '\0' so that strncat won't cause undefined behavior.
For starters, don't use sizeof to get the length of a string. strlen is the function to use. In this case strdup is better.
And you don't actually copy the string returned by strtok, you copy the pointer. Change you loop to this:
while (token != NULL) { // While valid tokens are returned
tokenArray[count] = strdup(token);
printf ("%s", tokenArray[count]);
count++;
tokenArray = (char **)realloc(tokenArray, sizeof(char *) * count);
token = (char *)strtok(NULL, delimiters); // Get the next token
}
tokenArray[count] = NULL; /* Terminate the array */
Also, don't forget to free the entries in the array, and the array itself when you're done with it.
Edit At the beginning of fSplitStr, wait with allocating the tokenArray until after you check that token is not NULL, and if token is NULL why not return NULL?
I'm not sure the best solution for you, but I do have a few notes:
token = (char *)strtok(str, delimiters); // Get the first token
tokenArray = (char**)malloc(1 * sizeof(char*));
if (!token) {
return tokenArray;
}
At this point, if you weren't able to find any tokens in the string, you return a pointer to an "array" that is large enough to hold a single character pointer. It is un-initialized, so it would not be a good idea to use the contents of this array in any way. C almost never initializes memory to 0x00 for you. (calloc(3) would do that for you, but since you need to overwrite every element anyway, it doesn't seem worth switching to calloc(3).)
Also, the (char **) case before the malloc(3) call indicates to me that you've probably forgotten the #include <stdlib.h> that would properly prototype malloc(3). (The cast was necessary before about 1989.)
Do note that your while() { } loop is setting pointers to the parts of the original input string to your tokenArray elements. (This is one of the cons that moshbear mentioned in his answer -- though it isn't always a weakness.) If you change tokenArray[1][1]='H', then your original input string also changes. (In addition to having each of the delimiter characters replaced with an ASCII NUL character.)