strtok segmentation fault - c

I am trying to understand why the following snippet of code is giving a segmentation fault:
void tokenize(char* line)
{
char* cmd = strtok(line," ");
while (cmd != NULL)
{
printf ("%s\n",cmd);
cmd = strtok(NULL, " ");
}
}
int main(void)
{
tokenize("this is a test");
}
I know that strtok() does not actually tokenize on string literals, but in this case, line points directly to the string "this is a test" which is internally an array of char. Is there any of tokenizing line without copying it into an array?

The problem is that you're attempting to modify a string literal. Doing so causes your program's behavior to be undefined.
Saying that you're not allowed to modify a string literal is an oversimplification. Saying that string literals are const is incorrect; they're not.
WARNING : Digression follows.
The string literal "this is a test" is of an expression of type char[15] (14 for the length, plus 1 for the terminating '\0'). In most contexts, including this one, such an expression is implicitly converted to a pointer to the first element of the array, of type char*.
The behavior of attempting to modify the array referred to by a string literal is undefined -- not because it's const (it isn't), but because the C standard specifically says that it's undefined.
Some compilers might permit you to get away with this. Your code might actually modify the static array corresponding to the literal (which could cause great confusion later on).
Most modern compilers, though, will store the array in read-only memory -- not physical ROM, but in a region of memory that's protected from modification by the virtual memory system. The result of attempting to modify such memory is typically a segmentation fault and a program crash.
So why aren't string literals const? Since you really shouldn't try to modify them, it would certainly make sense -- and C++ does make string literals const. The reason is historical. The const keyword didn't exist before it was introduced by the 1989 ANSI C standard (though it was probably implemented by some compilers before that). So a pre-ANSI program might look like this:
#include <stdio.h>
print_string(s)
char *s;
{
printf("%s\n", s);
}
main()
{
print_string("Hello, world");
}
There was no way to enforce the fact that print_string isn't allowed to modify the string pointed to by s. Making string literals const in ANSI C would have broken existing code, which the ANSI C committee tried very hard to avoid doing. There hasn't been a good opportunity since then to make such a change to the language. (The designers of C++, mostly Bjarne Stroustrup, weren't as concerned about backward compatibility with C.)

There's a very good reason that trying to tokenize a compile-time constant string will cause a segmentation fault: the constant string is in read-only memory.
The C compiler bakes compile-time constant strings into the executable, and the operating system loads them into read-only memory (.rodata in a *nix ELF file). Since this memory is marked as read-only, and since strtok writes into the string that you pass into it, you get a segmentation fault for writing into read-only memory.

As you said, you can't modify a string literal, which is what strtok does. You have to do
char str[] = "this is a test";
tokenize(str);
This creates the array str and initialises it with this is a test\0, and passes a pointer to it to tokenize.

Strok modifies its first argument in order to tokenize it. Hence you can't pass it a literal string, as it's of type const char * and cannot be modified, hence the undefined behaviour. You have to copy the string literal into a char array that can be modified.

What point are you trying to make by your "...is internally an array of char" remark?
The fact that "this is a test" is internally an array of char does not change anything at all. It is still a string literal (all string literals are non-modifiable arrays of char). Your strtok still tries to tokenize a string literal. This is why it crashes.

I'm sure you'll get beaten up about this... but "strtok()" is inherently unsafe and prone to things like access violations.
Here, the answer is almost certainly using a string constant.
Try this instead:
void tokenize(char* line)
{
char* cmd = strtok(line," ");
while (cmd != NULL)
{
printf ("%s\n",cmd);
cmd = strtok(NULL, " ");
}
}
int main(void)
{
char buff[80];
strcpy (buff, "this is a test");
tokenize(buff);
}

I have also big trouble with this error.
I found a simple solution.
please include <string.h>
it will remove strtok segmentation fault error.

I just hit the Segmentation Fault error from trying to use printf to print the token (cmd in your case) after it became NULL.

Related

Is String Literal in C really not modifiable?

As far as I know, a string literal can't be modified for example:
char* a = "abc";
a[0] = 'c';
That would not work since string literal is read-only. I can only modify it if:
char a[] = "abc";
a[0] = 'c';
However, in this post,
Parse $PATH variable and save the directory names into an array of strings, the first answer modified a string literal at these two places:
path_var[j]='\0';
array[current_colon] = path_var+j+1;
I'm not very familiar with C so any explanation would be appreciated.
In programming, there are quite a few rules that are up to you to follow, even though they are not — necessarily — enforced. And "String literals in C are not modifiable" is one of those. So is "Strings returned by getenv should not be modified".
There are some real-world analogies that apply. Here's one: If you're at an intersection, and the light is red, you're not supposed to cross. But, much of the time, if you break the rule, and cross, you might get away with it. You might get a ticket from a policeman — or you might not. You might cause a crash — or you might not. But if you get lucky, and neither of these things happens, that does not imply that crossing the intersection against the red light was okay — it's still quite true that it was very much against the rules.
Similarly, in C, if you write some code that modifies a string literal, or a string returned from getenv, you might get away with it. The compiler might give you a warning or error message — or it might not. Your program might crash — or it might not. But if the program seems to work, that does not imply that these strings are actually modifiable — they're not.
Code blocks from the post you linked:
const char *orig_path_var = getenv("PATH");
char *path_var = strdup(orig_path_var ? orig_path_var : "");
const char **array;
array = malloc((nb_colons+1) * sizeof(*array));
array[0] = path_var;
array[current_colon] = path_var+j+1;
First block:
In the 1st line getenv() returns a pointer to a string which is pointed to by orig_path_var. The string that get_env() returns should be treated as a read-only string as the behaviour is undefined if the program attempts to modify it.
In the 2nd line strdup() is called to make a duplicate of this string. The way strdup() does this is by calling malloc() and allocating memory for the size of the string + 1 and then copying the string into the memory.
Since malloc() is used, the string is stored on the heap, this allows us to edit the string and modify it.
Second block:
In the 1st line we can see that array points to a an array of char * pointers. There is nb_colons+1 pointers in the array.
Then in the 2nd line the 0th element of array is initilized to path_var (remember it is not a string literal, but a copy of one).
In the 3rd line, the current_colonth element of array is set to path_var+j+1. If you don't understand pointer arithmetic, this just means it assigns the address of the j+1th char of path_var to array[current_colon].
As you can see, the code is not operating on const string literals like orig_path_var. Instead it uses a copy made with strdup(). This seems to be where your confusion stems from so take a look at this:
char *strdup(const char *s);
The strdup() function returns a pointer to a new string which is a duplicate of the string s. Memory for the new string is obtained with malloc(3), and can be freed with free(3).
The above text shows what strdup() does according to its man page.
It may also help to read the malloc() man page.
In the example
char* a = "abc";
the token "abc" produces a literal object in the program image, and denotes an expression which yields that object's address.
In the example
char a[] = "abc";
The token "abc" is serves as an array initializer, and doesn't denote a literal object. It is equivalent to:
char a[] = { 'a', 'b', 'c', 0 };
The individual character values of "abc" are literal data is recorded somewhere and somehow in the program image, but they are not accessible as a string literal object.
The array a isn't a literal, needless to say. Modifying a doesn't constitute modifying a literal, because it isn't one.
Regarding the remark:
That would not work since string literal is read-only.
That isn't accurate. The ISO C standard (no version of it to date) doesn't specify any requirements for what happens if a program tries to modify a string literal. It is undefined behavior. If your implementation stops the program with some diagnostic message, that's because of undefined behavior, not because it is required.
C implementations are not required to support string literal modification, which has the benefits like:
standard-conforming C programs can be translated into images that can be be burned into ROM chips, such that their string literals are accessed directly from that ROM image without having to be copied into RAM on start-up.
compilers can condense the storage for string literals by taking advantage of situations when one literal is a suffix of another. The expression "string" + 2 == "ring" can yield true. Since a strictly conforming program will not do something like "ring"[0] = 'w', due to that being undefined behavior, such a program will thereby avoid falling victim to the surprise of "string" unexpectedly turning into "stwing".
There are several reasons for which you had better not to modify them:
The first is that the operating system and/or the compiler can enforce the non-writable property of string literals, putting them in read-only memory (e.g. ROM) or in the .text segment.
second, the compiler is allowed to merge string literals together, so if you modify (and do it successfully) you can get surprises later because other literals (that have been merged because e.g. one of them is a suffix of the other) change apparently by no reason.
if you need an initialized string that is modifiable, you can do it by allocating an array with a declaration, as in (which you can freely modify):
char array[100] = "abc"; // initialized to { 'a' ,'b', 'c', '\0',
// /* and 96 more '\0' characters */
// };

Segmentation fault of strcpy()

I was just going through C library functions to see what I can do with them. When I came across the strcpy function the code I wrote resulted in a segmentation fault and I would like to know why. The code I wrote should be printing WorldWorld. If I understood correctly, strcpy(x,y) will copy the contents of y into x.
main() {
char *x = "Hello";
char *y = "World";
printf(strcpy(x,y));
}
If it worked, the code you wrote would print "World", not "WorldWorld". Nothing is appended, strcpy overwrites data only.
Your program crashes because "Hello" and "World" are string constants. It's undefined behavior to attempt to write to a constant, and in your case this manifests as a segmentation fault. You should use char x[] = "Hello"; and char y[] = "World"; instead, which reserve memory on the stack to hold the strings, where they can be overwritten.
There are more problems with your program, though:
First, you should never pass a variable string as the first argument to printf: either use puts, or use printf("%s", string). Passing a variable as a format string prevents compilers that support type-checking printf arguments from doing that verification, and it can transform into a serious vulnerability if users can control it.
Second, you should never use strcpy. Strcpy will happily overrun buffers, which is another major security vulnerability. For instance, if you wrote:
char foo[] = "foo";
strcpy(foo, "this string is waaaaaay too long");
return;
you will cause undefined behavior, your program would crash again, and you're opening the door to other serious vulnerabilities that you can avoid by specifying the size of the destination buffer.
AFAIK, there is actually no standard C function that will decently copy strings, but the least bad one would be strlcpy, which additionally requires a size argument.

When to allocate memory to char *

I am bit confused when to allocate memory to a char * and when to point it to a const string.
Yes, I understand that if I wish to modify the string, I need to allocate it memory.
But in cases when I don't wish to modify the string to which I point and just need to pass the value should I just do the below? What are the disadvantages in the below steps as compared to allocating memory with malloc?
char *str = NULL;
str = "This is a test";
str = "Now I am pointing here";
Let's try again your example with the -Wwrite-strings compiler warning flag, you will see a warning:
warning: initialization discards 'const' qualifier from pointer target type
This is because the type of "This is a test" is const char *, not char *. So you are losing the constness information when you assign the literal address to the pointer.
For historical reasons, compilers will allow you to store string literals which are constants in non-const variables.
This is, however, a bad behavior and I suggest you to use -Wwrite-strings all the time.
If you want to prove it for yourself, try to modify the string:
char *str = "foo";
str[0] = 'a';
This program behavior is undefined but you may see a segmentation fault on many systems.
Running this example with Valgrind, you will see the following:
Process terminating with default action of signal 11 (SIGSEGV)
Bad permissions for mapped region at address 0x4005E4
The problem is that the binary generated by your compiler will store the string literals in a memory location which is read-only. By trying to write in it you cause a segmentation fault.
What is important to understand is that you are dealing here with two different systems:
The C typing system which is something to help you to write correct code and can be easily "muted" (by casting, etc.)
The Kernel memory page permissions which are here to protect your system and which shall always be honored.
Again, for historical reasons, this is a point where 1. and 2. do not agree. Or to be more clear, 1. is much more permissive than 2. (resulting in your program being killed by the kernel).
So don't be fooled by the compiler, the string literals you are declaring are really constant and you cannot do anything about it!
Considering your pointer str read and write is OK.
However, to write correct code, it should be a const char * and not a char *. With the following change, your example is a valid piece of C:
const char *str = "some string";
str = "some other string";
(const char * pointer to a const string)
In this case, the compiler does not emit any warning. What you write and what will be in memory once the code is executed will match.
Note: A const pointer to a const string being const char *const:
const char *const str = "foo";
The rule of thumb is: always be as constant as possible.
If you need to modify the string, use dynamic allocation (malloc() or better, some higher level string manipulation function such as strdup, etc. from the libc), if you don't need to, use a string literal.
If you know that str will always be read-only, why not declare it as such?
char const * str = NULL;
/* OR */
const char * str = NULL;
Well, actually there is one reason why this may be difficult - when you are passing the string to a read-only function that does not declare itself as such. Suppose you are using an external library that declares this function:
int countLettersInString(char c, char * str);
/* returns the number of times `c` occurs in `str`, or -1 if `str` is NULL. */
This function is well-documented and you know that it will not attempt to change the string str - but if you call it with a constant string, your compiler might give you a warning! You know there is nothing dangerous about it, but your compiler does not.
Why? Because as far as the compiler is concerned, maybe this function does try to modify the contents of the string, which would cause your program to crash. Maybe you rely very heavily on this library and there are lots of functions that all behave like this. Then maybe it's easier not to declare the string as const in the first place - but then it's all up to you to make sure you don't try to modify it.
On the other hand, if you are the one writing the countLettersInString function, then simply make sure the compiler knows you won't modify the string by declaring it with const:
int countLettersInString(char c, char const * str);
That way it will accept both constant and non-constant strings without issue.
One disadvantage of using string-literals is that they have length restrictions.
So you should keep in mind from the document ISO/IEC:9899
(emphasis mine)
5.2.4.1 Translation limits
1 The implementation shall be able to translate and execute at least one program that contains at least one instance of every one of the following limits:
[...]
— 4095 characters in a character string literal or wide string literal (after concatenation)
So If your constant text exceeds this count (What some times throughout may be possible, especially if you write a dynamic webserver in C) you are forbidden to use the string literal approach if you want to stay system independent.
There is no problem in your code as long as you are not planing to modify the contents of that string. Also, the memory for such string literals will remain for the full life time of the program. The memory allocated by malloc is read-write, so you can manipulate the contents of that memory.
If you have a string literal that you do not want to modify, what you are doing is ok:
char *str = NULL;
str = "This is a test";
str = "Now I am pointing here";
Here str a pointer has a memory which it points to. In second line you write to that memory "This is a test" and then again in 3 line you write in that memory "Now I am pointing here". This is legal in C.
You may find it a bit contradicting but you can't modify string that is something like this -
str[0]='X' // will give a problem.
However, if you want to be able to modify it, use it as a buffer to hold a line of input and so on, use malloc:
char *str=malloc(BUFSIZE); // BUFSIZE size what you want to allocate
free(str); // freeing memory
Use malloc() when you don't know the amount of memory needed during compile time.
It is legal in C unfortunately, but any attempt to modify the string literal via the pointer will result in undefined behavior.
Say
str[0] = 'Y'; //No compiler error, undefined behavior
It will run fine, but you may get a warning by the compiler, because you are pointing to a constant string.
P.S.: It will run OK only when you are not modifying it. So the only disadvantage of not using malloc is that you won't be able to modify it.

bus error when trying to access character on a string in C

I have used this line of code many times (update: when string was a parameter to the function!), however when I try to do it now I get a bus error (both with gcc and clang). I am reproducing the simplest possible code;
char *string = "this is a string";
char *p = string;
p++;
*p='x'; //this line will cause the Bus error
printf("string is %s\n",string);
Why am I unable to change the second character of the string using the p pointer?
You are trying to modify read only memory (where that string literal is stored). You can use a char array instead if you need to modify that memory.
char str[] = "This is a string";
str[0] = 'S'; /* works */
I have used this line of code many times..
I sure hope not. At best you would get a segfault (I say "at best" because attempting to modify readonly memory is unspecified behavior, in which case anything can happen, and a crash is the best thing that can happen).
When you declare a pointer to a string literal it points to read only memory in the data segment (look at the assembly output if you like). Declaring your type as a char[] will copy that literal onto the function's stack, which will in turn allow it to be modified if needed.

C's strtok() and read only string literals

char *strtok(char *s1, const char *s2)
repeated calls to this function break string s1 into "tokens"--that is
the string is broken into substrings,
each terminating with a '\0', where
the '\0' replaces any characters
contained in string s2. The first call
uses the string to be tokenized as s1;
subsequent calls use NULL as the first
argument. A pointer to the beginning
of the current token is returned; NULL
is returned if there are no more
tokens.
Hi,
I have been trying to use strtok just now and found out that if I pass in a char* into s1, I get a segmentation fault. If I pass in a char[], strtok works fine.
Why is this?
I googled around and the reason seems to be something about how char* is read only and char[] is writeable. A more thorough explanation would be much appreciated.
What did you initialize the char * to?
If something like
char *text = "foobar";
then you have a pointer to some read-only characters
For
char text[7] = "foobar";
then you have a seven element array of characters that you can do what you like with.
strtok writes into the string you give it - overwriting the separator character with null and keeping a pointer to the rest of the string.
Hence, if you pass it a read-only string, it will attempt to write to it, and you get a segfault.
Also, becasue strtok keeps a reference to the rest of the string, it's not reeentrant - you can use it only on one string at a time. It's best avoided, really - consider strsep(3) instead - see, for example, here: http://www.rt.com/man/strsep.3.html (although that still writes into the string so has the same read-only/segfault issue)
An important point that's inferred but not stated explicitly:
Based on your question, I'm guessing that you're fairly new to programming in C, so I'd like to explain a little more about your situation. Forgive me if I'm mistaken; C can be hard to learn mostly because of subtle misunderstanding in underlying mechanisms so I like to make things as plain as possible.
As you know, when you write out your C program the compiler pre-creates everything for you based on the syntax. When you declare a variable anywhere in your code, e.g.:
int x = 0;
The compiler reads this line of text and says to itself: OK, I need to replace all occurrences in the current code scope of x with a constant reference to a region of memory I've allocated to hold an integer.
When your program is run, this line leads to a new action: I need to set the region of memory that x references to int value 0.
Note the subtle difference here: the memory location that reference point x holds is constant (and cannot be changed). However, the value that x points can be changed. You do it in your code through assignment, e.g. x = 15;. Also note that the single line of code actually amounts to two separate commands to the compiler.
When you have a statement like:
char *name = "Tom";
The compiler's process is like this: OK, I need to replace all occurrences in the current code scope of name with a constant reference to a region of memory I've allocated to hold a char pointer value. And it does so.
But there's that second step, which amounts to this: I need to create a constant array of characters which holds the values 'T', 'o', 'm', and NULL. Then I need to replace the part of the code which says "Tom" with the memory address of that constant string.
When your program is run, the final step occurs: setting the pointer to char's value (which isn't constant) to the memory address of that automatically created string (which is constant).
So a char * is not read-only. Only a const char * is read-only. But your problem in this case isn't that char *s are read-only, it's that your pointer references a read-only regions of memory.
I bring all this up because understanding this issue is the barrier between you looking at the definition of that function from the library and understanding the issue yourself versus having to ask us. And I've somewhat simplified some of the details in the hopes of making the issue more understandable.
I hope this was helpful. ;)
I blame the C standard.
char *s = "abc";
could have been defined to give the same error as
const char *cs = "abc";
char *s = cs;
on grounds that string literals are unmodifiable. But it wasn't, it was defined to compile. Go figure. [Edit: Mike B has gone figured - "const" didn't exist at all in K&R C. ISO C, plus every version of C and C++ since, has wanted to be backward-compatible. So it has to be valid.]
If it had been defined to give an error, then you couldn't have got as far as the segfault, because strtok's first parameter is char*, so the compiler would have prevented you passing in the pointer generated from the literal.
It may be of interest that there was at one time a plan in C++ for this to be deprecated (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/1996/N0896.asc). But 12 years later I can't persuade either gcc or g++ to give me any kind of warning for assigning a literal to non-const char*, so it isn't all that loudly deprecated.
[Edit: aha: -Wwrite-strings, which isn't included in -Wall or -Wextra]
In brief:
char *s = "HAPPY DAY";
printf("\n %s ", s);
s = "NEW YEAR"; /* Valid */
printf("\n %s ", s);
s[0] = 'c'; /* Invalid */
If you look at your compiler documentation, odds are there is a option you can set to make those strings writable.

Resources