Tokenize an environment variable and save the resulting token in a char** - c

I'm attempting to create an array of strings that represent the directories stored in the PATH variable. I'm writing this code in C, but I'm having trouble getting the memory allocation parts working.
char* shell_path = getenv ("PATH");
char* tok = strtok (shell_path, SHELL_PATH_SEPARATOR);
int number_of_tokens = 0, i = 0;
while (tok != NULL)
{
number_of_tokens++;
}
Shell_Path_Directories = malloc (/* This is where I need some help */);
shell_path = getenv ("PATH");
tok = strtok (shell_path, SHELL_PATH_SEPARATOR);
while (tok != NULL)
{
Shell_Path_Directories[i++] = tok;
tok = strtok (NULL, SHELL_PATH_SEPARATOR);
}
The issue I'm having is that I can't think of how I can know exactly how much memory to allocate.
I know I'm tokenizing the strings twice, and that it's probably stupid for me to be doing that, but I'm open to improvements if someone can figure out a better way to do this.

Just to give you basically the same answer as user411313's in a different dialect:
char* shell_path = getenv ("PATH");
/* Copy the environment string */
size_t const len = strlen(shell_path)+1;
char *copyenv = memcpy(malloc(len), shell_path, len);
/* start the tokenization */
char *p=strtok(copyenv,SHELL_PATH_SEPARATOR);
/* the path should always contain at least one element */
assert(p);
char **result = malloc(sizeof result[0]);
int i = 0;
while (1)
{
result[i] = strcpy(malloc(strlen(p)+1), p);
p=strtok(0,SHELL_PATH_SEPARATOR);
if (!p) break;
++i;
result = realloc( result, (i+1)*sizeof*result );
}

You can do:
Shell_Path_Directories = malloc (sizeof(char*) * number_of_tokens);
Also the way you are counting the number_of_tokens is incorrect. You need to call the strtok again in the loop passing it NULL as the 1st argument:
while (tok != NULL) {
number_of_tokens++;
tok = strtok (NULL, SHELL_PATH_SEPARATOR);
}

Since you've counted the number of tokens already, you can use that as the number of pointers to char to allocate:
char **Shell_Path_Directories = malloc(number_of_tokens * sizeof(char *));
Then you have one more minor issue: you're using strtok directly on the string returned by getenv, which leads to undefined behavior (strtok modifies the string you pass to it, and you're not allowed to modify the string returned by getenv, so you get undefined behavior). You probably want to duplicate the string first, then tokenize your copy instead.

You should not change the getenv-return pointer, safer you make a copy. With strtok you can destroy the content of your environment table.
char* shell_path = getenv ("PATH");
char *p,*copyenv = strcpy( malloc(strlen(shell_path)+1), shell_path );
char **result = 0;
int i = 0;
for( p=strtok(copyenv,SHELL_PATH_SEPARATOR); p; p=strtok(0,SHELL_PATH_SEPARATOR) )
{
result = realloc( result, ++i*sizeof*result );
strcpy( result[i-1]=malloc(strlen(p)+1), p );
}

Related

Why do I get segmentation error when calling a parsing function in C

I have been trying to understand how this custom function below works to parse lines from argv but I keep getting segmentation errors.
I have been trying to debug it for hours but I cannot find the bug that is eating into "restricted memory"
The function takes a string literal and a delimiter string (not character).
When I use valgrind to audit the function, it reports a SIGSEGV error when calling strtok.
I understand strtok cannot work directly with string literals because it can cause undefined behaviour. So I decided to copy the str to a local variable first.
Yes, I tried using an array as copy too, but it still throws the segmentation error.
What I really don't understand is why does strtok not getting enough memory?
char **splitstring(char *str, const char *delim)
{
int i, wn;
char **array;
char *token;
char *copy;
copy = malloc(strlen(str) + 1);
if (copy == NULL)
{
perror("hsh");
return (NULL);
}
i = 0;
while (str[i])
{
copy[i] = str[i];
i++;
}
copy[i] = '\0';
token = strtok(copy, delim);
array = malloc((sizeof(char *) * 2));
array[0] = strdup(token);
i = 1;
wn = 3;
while (token)
{
token = strtok(NULL, delim);
array = realloc(array, (sizeof(char *) * (wn - 1)), (sizeof(char *) * wn));
array[i] = strdup(token);
i++;
wn++;
}
free(copy);
return (array);
}

Function returns char**s, running the function twice causes the return value to differ

Description of what my function attempts to do
My function gets a string for example "Ab + abc EF++aG hi jkL" and turns it into ["abc", "hi"]
In addition, the function only takes into account letters and the letters all have to be lowercase.
The problem is that
char* str1 = "Ab + abc EF++aG hi jkL";
char* str2 = "This is a very famous quote";
char** tokens1 = get_tokens(str1);
printf("%s", tokens1[0]); <----- prints out "abc" correct output
char** tokens2 = get_tokens(str2);
printf("%s", tokens1[0]); <----- prints out "s" incorrect output
get_tokens function (Returns the 2d array)
char** get_tokens(const char* str) {
// implement me
int num_tokens = count_tokens(str);
char delim[] = " ";
int str_length = strlen(str);
char* new_str = malloc(str_length);
strcpy(new_str, str);
char* ptr = strtok(new_str, delim);
int index = 0;
char** array_2d = malloc(sizeof(char*) *num_tokens);
while (ptr != NULL){
if (check_string(ptr) == 0){
array_2d[index] = ptr;
index++;
}
ptr = strtok(NULL, delim);
}
free(new_str);
new_str = NULL;
free(ptr);
ptr = NULL;
return array_2d;
}
count_tokens function (returns the number of valid strings)
for example count_tokens("AB + abc EF++aG hi jkL") returns 2 because only "abc" and "hi" are valid
int count_tokens(const char* str) {
// implement me
//Seperate string using strtok
char delim[] = " ";
int str_length = strlen(str);
char* new_str = malloc(str_length);
strcpy(new_str, str);
char* ptr = strtok(new_str, delim);
int counter = 0;
while (ptr != NULL){
if (check_string(ptr) == 0){
counter++;
}
ptr = strtok(NULL, delim);
}
free(new_str);
return counter;
}
Lastly check_string() checks if a string is valid
For example check_string("Ab") is invalid because there is a A inside.
using strtok to split "Ab + abc EF++aG hi jkL" into separate parts
int check_string(char* str){
// 0 = false
// 1 = true
int invalid_chars = 0;
for (int i = 0; i<strlen(str); i++){
int char_int_val = (int) str[i];
if (!((char_int_val >= 97 && char_int_val <= 122))){
invalid_chars = 1;
}
}
return invalid_chars;
}
Any help would be much appreciated. Thank you for reading.
If you have any questions about how the code works please ask me. Also I'm new to stackoverflow, please tell me if I have to change something.
You have a few problems in your code. First I'll repeat what I've said in the comments:
Not allocating enough space for the string copies. strlen does not include the NUL terminator in its length, so when you do
char* new_str = malloc(str_length);
strcpy(new_str, str);
new_str overflows by 1 when strcpy adds the '\0', invoking undefined behavior. You need to allocate one extra:
char* new_str = malloc(str_length + 1);
strcpy(new_str, str);
You should not free any pointer returned from strtok. You only free memory that's been dynamically allocated using malloc and friends. strtok does no such thing, so it's incorrect to free the pointer it returns. Doing so also invokes UB.
Your final problem is because of this:
// copy str to new_str, that's correct because strtok
// will manipulate the string you pass into it
strcpy(new_str, str);
// get the first token and allocate size for the number of tokens,
// so far so good (but you should check that malloc succeeded)
char* ptr = strtok(new_str, delim);
char** array_2d = malloc(sizeof(char*) *num_tokens);
while (ptr != NULL){
if (check_string(ptr) == 0){
// whoops, this is where the trouble starts ...
array_2d[index] = ptr;
index++;
}
// get the next token, this is correct
ptr = strtok(NULL, delim);
}
// ... because you free new_str
free(new_str);
ptr is a pointer to some token in new_str. As soon as you free(new_str), Any pointer pointing to that now-deallocated memory is invalid. You've loaded up array_2d with pointers to memory that's no longer allocated. Trying to access those locations again invokes undefined behavior. There's two ways I can think of off the top to solve this:
Instead of saving pointers that are offsets to new_str, find the same tokens in str (the string from main) and point to those instead. Since those are defined in main, they will exist for as long as the program exists.
Allocate some more memory, and strcpy the token into array_2d[index]. I'll demonstrate this below:
while (ptr != NULL){
if (check_string(ptr) == false)
{
// allocate (enough) memory for the pointer at index
array_2d[index] = malloc(strlen(ptr) + 1);
// you should _always_ check that malloc succeeds
if (array_2d[index] != NULL)
{
// _copy_ the string pointed to by ptr into our new space rather
// than simply assigning the pointer
strcpy(array_2d[index], ptr);
}
else { /* handle no mem error how you want */ }
index++;
}
ptr = strtok(NULL, delim);
}
// now we can safely free new_str without invalidating anything in array_2d
free(new_str);
I have a working demonstration here. Note some other changes in the demo:
#include <stdbool.h> and used that instead of 0 and 1 ints.
Changed your get_tokens function a bit to "return" the number of tokens. This is useful in main for printing them out.
Replaced the ASCII magic numbers with their characters.
Removed the useless freedPointer = NULL lines.
Changed your ints to size_t types for everything involving a size.
One final note, while this is a valid implementation, it's probably doing a bit more work than it needs to. Rather than counting the number of tokens in a first pass, then retrieving them in a second pass, you can surely do everything you want in a single pass, but I'll leave that as an exercise to you if you're so inclined.

Segfault resulting from strdup and strtok

I've been assigned a homework from my college professor and I seem to have found some strange behavior of strtok
Basically, we have to parse a CSV file for my class, where the number of tokens in the CSV is known and the last element may have extra "," characters.
An example of a line:
Hello,World,This,Is,A lot, of Text
Where the tokens should be output as
1. Hello
2. World
3. This
4. Is
5. A lot, of Text
For this assignment we MUST use strtok. Because of this I found on some other SOF post that using strtok with an empty string (or passing "\n" as the second argument) results in reading until the end of the line. This is perfect for my application since the extra commas always appear in the last element.
I've created this code which works:
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#define NUM_TOKENS 5
const char *line = "Hello,World,This,Is,Text";
char **split_line(const char *line, int num_tokens)
{
char *copy = strdup(line);
// Make an array the correct size to hold num_tokens
char **tokens = (char**) malloc(sizeof(char*) * num_tokens);
int i = 0;
for (char *token = strtok(copy, ",\n"); i < NUM_TOKENS; token = strtok(NULL, i < NUM_TOKENS - 1 ? ",\n" : "\n"))
{
tokens[i++] = strdup(token);
}
free(copy);
return tokens;
}
int main()
{
char **tokens = split_line(line, NUM_TOKENS);
for (int i = 0; i < NUM_TOKENS; i++)
{
printf("%s\n", tokens[i]);
free(tokens[i]);
}
}
Now this works and should get me full credit but I hate this ternary that shouldn't be needed:
token = strtok(NULL, i < NUM_TOKENS - 1 ? ",\n" : "\n");
I'd like to replace the method with this version:
char **split_line(const char *line, int num_tokens)
{
char *copy = strdup(line);
// Make an array the correct size to hold num_tokens
char **tokens = (char**) malloc(sizeof(char*) * num_tokens);
int i = 0;
for (char *token = strtok(copy, ",\n"); i < NUM_TOKENS - 1; token = strtok(NULL, ",\n"))
{
tokens[i++] = strdup(token);
}
tokens[i] = strdup(strtok(NULL, "\n"));
free(copy);
return tokens;
}
This tickles my fancy much nicer since it is much easier to see that there is a final case. You also get rid of the strange ternary operator.
Sadly though, this segfaults! I can't for the life of me figure out why.
Edit: Add some output examples:
[11:56:06] gravypod:test git:(master*) $ ./test_no_fault
Hello
World
This
Is
Text
[11:56:10] gravypod:test git:(master*) $ ./test_seg_fault
[1] 3718 segmentation fault (core dumped) ./test_seg_fault
[11:56:14] gravypod:test git:(master*) $
Please check the return value from strtok before you risk passing NULL to another function. Your loop is calling strtok one more time than you think.
It is more usual to use this return value to control your loop, then you are not at the mercy of your data. As for the delimitors, best to keep it simple and not try anything fancy.
char **split_line(const char *line, int num_tokens)
{
char *copy = strdup(line);
char **tokens = (char**) malloc(sizeof(char*) * num_tokens);
int i = 0;
char *token;
char delim1[] = ",\r\n";
char delim2[] = "\r\n";
char *delim = delim1; // start with a comma in the delimiter set
token = strtok(copy, delim);
while(token != NULL) { // strtok result comtrols the loop
tokens[i++] = strdup(token);
if(i == NUM_TOKENS) {
delim = delim2; // change the delimiters
}
token = strtok(NULL, delim);
}
free(copy);
return tokens;
}
Note you should also check the return values from malloc and strdup and free your memory properly
When you get to the last loop, you'll get
for (char *token = strtok(copy, ",\n"); i < NUM_TOKENS - 1; token = strtok(NULL, ",\n"))
loop body
loop increment step, i.e. token = strtok(NULL, ",\n") (with the wrong second arg)
loop continuation check i < NUM_TOKENS - 1
i.e. it has still called strtok even though you're now out-of-range. You've also got an off-by-one on your array indices here: you'd want to initialise i=0 not 1.
You could avoid this by e.g.
making the initial strtok a special case outside the loop, e.g.
int i = 0;
tokens[i++] = strdup(strtok(copy, ",\n"));
then moving the strtok(NULL, ",\n") inside the loop
I'm also surprised you want the \n there at all, or even need to call the last strtok (wouldn't that already just point to the rest of the string? If you just trying to chop a trailing newline there are easier ways) but I haven't used strtok in years.
(As an aside you're also not freeing the malloced array you store the string pointers in. That said since it's the end of the program at that point that doesn't matter so much.)
Remember that strtok identifies a token when it finds any of the characters in the delimiter string (the second argument to strtok()) - it doesn't try to match the entire delimiter string itself.
Thus, the ternary operator was never needed in the first place - the string will be tokenized based on the occurrence of , OR \n in the input string, so the following works:
for (token = strtok(copy, ",\n"); i < NUM_TOKENS; token = strtok(NULL, ",\n"))
{
tokens[i++] = strdup(token);
}
The second example segfaults because it's already tokenized the input to the end of the string by the time it exits the for loop. Calling strtok() again sets token to NULL, and the segfault is generated when strdup() is called on the NULL pointer. Removing the extra call to strtok gives the expected results:
for (token = strtok(copy, ",\n"); i < NUM_TOKENS - 1; token = strtok(NULL, ",\n"))
{
tokens[i++] = strdup(token);
}
tokens[i] = strdup(token);

C: Losing content of char** after end of function [duplicate]

This question already has answers here:
Using realloc inside a function [duplicate]
(2 answers)
Closed 8 years ago.
I have a problem I can't solve. I split a string in substrings and put these substrings in an array. Everything goes fine until the search function ends. the strtok function makes perfect substrings and then everything is nicely putten in the array but when the function ends the array loses all his content. I've tried a lot of different things but nothing seems to work. I want the words array to keep his content when the search function ends and returns to main.
int main(void)
{
char** words=NULL;
char argument[26] = "just+an+example";
search(argument, words);
}
search(char* argument, char** words)
{
char* p = strtok (argument, "+");
int n_spaces = 0;
while (p)
{
words = realloc(words, sizeof(char*)* ++n_spaces);
if (words == NULL)
exit(-1); // memory allocation failed
words[n_spaces-1] = p;
p = strtok(NULL, "+");
}
// realloc one extra element for the last NULL
words = realloc(words, sizeof(char*)* (n_spaces+1));
words[n_spaces] = 0;
}
I'm guessing that you want words in main to point to an array of pointers to the places where the delimiter is. You need to pass in the address of the variable words to search, and inside search, modify the memory pointed at by the variable words.
Try this:
int main(void)
{
char** words = NULL;
char argument[26] = "just+an+example";
search(argument, &words);
}
search(char* argument, char*** words)
{
char* p = strtok (argument, "+");
int n_spaces = 0;
while (p)
{
*words = realloc(*words, sizeof(char*) ++n_spaces);
if (*words == NULL)
exit(-1); // memory allocation failed
(*words)[n_spaces-1] = p;
p = strtok(NULL, "+");
}
// realloc one extra element for the last NULL
*words = realloc(words, sizeof(char*)* (n_spaces+1));
(*words)[n_spaces] = 0;
}
I didn't review your logic in search at all, so you may not be done debugging yet.
I was doing af few things wrong. First of all in the main function when I called the search function I had to pass the adress of my array (&words). My second mistake was instead of copying the substrings itself I copied the pointers to the substrings. At the end of the function these pointers are freed so my array lost his content at the end of the function. To fix this I had to malloc every time I wanted to copy a new string to my array and use strcpy to copy the string where the pointer points to to my array.
int main(void)
{
char** words = NULL;
char argument[26] = "just+an+example";
search(argument, &words);
}
search(char* argument, char*** words)
{
char* p = strtok (argument, "+");
int n_spaces = 0;
while (p)
{
*words = realloc(*words, sizeof(char*) ++n_spaces);
if (*words == NULL)
exit(-1); // memory allocation failed
(*words)[n_spaces - 1] = malloc(sizeof(char)* (strlen(p) + 1));
strcpy((*words)[n_spaces - 1], p);
p = strtok(NULL, "+");
}
}

Split and Join strings in C Language

I learnt C in uni but haven't used it for quite a few years. Recently I started working on a tool which uses C as the programming language. Now I'm stuck with some really basic functions. Among them are how to split and join strings using a delimiter? (I miss Python so much, even Java or C#!)
Below is the function I created to split a string, but it does not seem to work properly. Also, even this function works, the delimiter can only be a single character. How can I use a string as a delimiter?
Can someone please provide some help?
Ideally, I would like to have 2 functions:
// Split a string into a string array
char** fSplitStr(char *str, const char *delimiter);
// Join the elements of a string array to a single string
char* fJoinStr(char **str, const char *delimiter);
Thank you,
Allen
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
char** fSplitStr(char *str, const char *delimiters)
{
char * token;
char **tokenArray;
int count=0;
token = (char *)strtok(str, delimiters); // Get the first token
tokenArray = (char**)malloc(1 * sizeof(char*));
if (!token) {
return tokenArray;
}
while (token != NULL ) { // While valid tokens are returned
tokenArray[count] = (char*)malloc(sizeof(token));
tokenArray[count] = token;
printf ("%s", tokenArray[count]);
count++;
tokenArray = (char **)realloc(tokenArray, sizeof(char *) * count);
token = (char *)strtok(NULL, delimiters); // Get the next token
}
return tokenArray;
}
int main (void)
{
char str[] = "Split_The_String";
char ** splitArray = fSplitStr(str,"_");
printf ("%s", splitArray[0]);
printf ("%s", splitArray[1]);
printf ("%s", splitArray[2]);
return 0;
}
Answers: (Thanks to Moshbear, Joachim and sarnold):
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
char** fStrSplit(char *str, const char *delimiters)
{
char * token;
char **tokenArray;
int count=0;
token = (char *)strtok(str, delimiters); // Get the first token
tokenArray = (char**)malloc(1 * sizeof(char*));
tokenArray[0] = NULL;
if (!token) {
return tokenArray;
}
while (token != NULL) { // While valid tokens are returned
tokenArray[count] = (char*)strdup(token);
//printf ("%s", tokenArray[count]);
count++;
tokenArray = (char **)realloc(tokenArray, sizeof(char *) * (count + 1));
token = (char *)strtok(NULL, delimiters); // Get the next token
}
tokenArray[count] = NULL; /* Terminate the array */
return tokenArray;
}
char* fStrJoin(char **str, const char *delimiters)
{
char *joinedStr;
int i = 1;
joinedStr = realloc(NULL, strlen(str[0])+1);
strcpy(joinedStr, str[0]);
if (str[0] == NULL){
return joinedStr;
}
while (str[i] !=NULL){
joinedStr = (char*)realloc(joinedStr, strlen(joinedStr) + strlen(str[i]) + strlen(delimiters) + 1);
strcat(joinedStr, delimiters);
strcat(joinedStr, str[i]);
i++;
}
return joinedStr;
}
int main (void)
{
char str[] = "Split_The_String";
char ** splitArray = (char **)fStrSplit(str,"_");
char * joinedStr;
int i=0;
while (splitArray[i]!=NULL) {
printf ("%s", splitArray[i]);
i++;
}
joinedStr = fStrJoin(splitArray, "-");
printf ("%s", joinedStr);
return 0;
}
Use strpbrk instead of strtok, because strtok suffers from two weaknesses:
it's not re-entrant (i.e. thread-safe)
it modifies the string
For joining, use strncat for joining, and realloc for resizing.
The order of operations is very important.
Before doing the realloc;strncat loop, set the 0th element of the target string to '\0' so that strncat won't cause undefined behavior.
For starters, don't use sizeof to get the length of a string. strlen is the function to use. In this case strdup is better.
And you don't actually copy the string returned by strtok, you copy the pointer. Change you loop to this:
while (token != NULL) { // While valid tokens are returned
tokenArray[count] = strdup(token);
printf ("%s", tokenArray[count]);
count++;
tokenArray = (char **)realloc(tokenArray, sizeof(char *) * count);
token = (char *)strtok(NULL, delimiters); // Get the next token
}
tokenArray[count] = NULL; /* Terminate the array */
Also, don't forget to free the entries in the array, and the array itself when you're done with it.
Edit At the beginning of fSplitStr, wait with allocating the tokenArray until after you check that token is not NULL, and if token is NULL why not return NULL?
I'm not sure the best solution for you, but I do have a few notes:
token = (char *)strtok(str, delimiters); // Get the first token
tokenArray = (char**)malloc(1 * sizeof(char*));
if (!token) {
return tokenArray;
}
At this point, if you weren't able to find any tokens in the string, you return a pointer to an "array" that is large enough to hold a single character pointer. It is un-initialized, so it would not be a good idea to use the contents of this array in any way. C almost never initializes memory to 0x00 for you. (calloc(3) would do that for you, but since you need to overwrite every element anyway, it doesn't seem worth switching to calloc(3).)
Also, the (char **) case before the malloc(3) call indicates to me that you've probably forgotten the #include <stdlib.h> that would properly prototype malloc(3). (The cast was necessary before about 1989.)
Do note that your while() { } loop is setting pointers to the parts of the original input string to your tokenArray elements. (This is one of the cons that moshbear mentioned in his answer -- though it isn't always a weakness.) If you change tokenArray[1][1]='H', then your original input string also changes. (In addition to having each of the delimiter characters replaced with an ASCII NUL character.)

Resources