Is modifying a string pointed to by a pointer valid? - c

Here's a simple example of a program that concatenates two strings.
#include <stdio.h>
void strcat(char *s, char *t);
void strcat(char *s, char *t) {
while (*s++ != '\0');
s--;
while ((*s++ = *t++) != '\0');
}
int main() {
char *s = "hello";
strcat(s, " world");
while (*s != '\0') {
putchar(*s++);
}
return 0;
}
I'm wondering why it works. In main(), I have a pointer to the string "hello". According to the K&R book, modifying a string like that is undefined behavior. So why is the program able to modify it by appending " world"? Or is appending not considered as modifying?

Undefined behavior means a compiler can emit code that does anything. Working is a subset of undefined.

I +1'd MSN, but as for why it works, it's because nothing has come along to fill the space behind your string yet. Declare a few more variables, add some complexity, and you'll start to see some wackiness.

Perhaps surprisingly, your compiler has allocated the literal "hello" into read/write initialized data instead of read-only initialized data. Your assignment clobbers whatever is adjacent to that spot, but your program is small and simple enough that you don't see the effects. (Put it in a for loop and see if you are clobbering the " world" literal.)
It fails on Ubuntu x64 because gcc puts string literals in read-only data, and when you try to write, the hardware MMU objects.

You were lucky this time.
Especially in debug mode some compilers will put spare memory (often filled with some obvious value) around declarations so you can find code like this.

It also depends on the how the pointer is declared. For example, can change ptr, and what ptr points to:
char * ptr;
Can change what ptr points to, but not ptr:
char const * ptr;
Can change ptr, but not what ptr points to:
const char * ptr;
Can't change anything:
const char const * ptr;

According to the C99 specifification (C99: TC3, 6.4.5, §5), string literals are
[...] used to initialize an array of static storage duration and length just
sufficient to contain the sequence. [...]
which means they have the type char [], ie modification is possible in principle. Why you shouldn't do it is explained in §6:
It is unspecified whether these arrays are distinct provided their elements have the
appropriate values. If the program attempts to modify such an array, the behavior is
undefined.
Different string literals with the same contents may - but don't have to - be mapped to the same memory location. As the behaviour is undefined, compilers are free to put them in read-only sections in order to cleanly fail instead of introducing possibly hard to detect error sources.

I'm wondering why it works
It doesn't. It causes a Segmentation Fault on Ubuntu x64; for code to work it shouldn't just work on your machine.
Moving the modified data to the stack gets around the data area protection in linux:
int main() {
char b[] = "hello";
char c[] = " ";
char *s = b;
strcat(s, " world");
puts(b);
puts(c);
return 0;
}
Though you then are only safe as 'world' fits in the unused spaces between stack data - change b to "hello to" and linux detects the stack corruption:
*** stack smashing detected ***: bin/clobber terminated

The compiler is allowing you to modify s because you have improperly marked it as non-const -- a pointer to a static string like that should be
const char *s = "hello";
With the const modifier missing, you've basically disabled the safety that prevents you from writing into memory that you shouldn't write into. C does very little to keep you from shooting yourself in the foot. In this case you got lucky and only grazed your pinky toe.

s points to a bit of memory that holds "hello", but was not intended to contain more than that. This means that it is very likely that you will be overwriting something else. That is very dangerous, even though it may seem to work.
Two observations:
The * in *s-- is not necessary. s-- would suffice, because you only want to decrement the value.
You don't need to write strcat yourself. It already exists (you probably knew that, but I'm telling you anyway:-)).

Related

Do I misunderstand this example about scope of string literals?

I was reading up on common C pitfalls and came up to this article on some famous Uni website. (It is the 2nd link that comes up on google).
The last example on that page is,
// Memory allocation on the stack
void b(char **p) {
char * str="print this string";
*p = str;
}
int main(void) {
char * s;
b(&s);
s[0]='j'; //crash, since the memory for str is allocated on the stack,
//and the call to b has already returned, the memory pointed to by str
//is no longer valid.
return 0;
}
That explanation in the comment got me thinking then, that, isn't the memory for string literals not static?
Isn't the actual error there then that you are not supposed to modify string literals, because it is undefined behavior? Or are the comments there correct and my understanding of that example is wrong?
Upon searching further, I saw this question: referencing a char that went out of scope and I understood from that question that, the following is valid code.
#include <malloc.h>
char* a = NULL;
{
char* b = "stackoverflow";
a = b;
}
int main() {
puts(a);
}
Also this question agrees with the other stackoverflow question and my thinking, but opposes the comment from that website's code.
To test it, I tried the following,
#include <stdio.h>
#include <malloc.h>
void b(char **p)
{
char * str = "print this string";
*p = str;
}
int main(void)
{
char * s;
b(&s);
// s[0]='j'; //crash, since the memory for str is allocated on the stack,
//and the call to b has already returned, the memory pointed to by str is no longer valid.
printf("%s \n", s);
return 0;
}
which as expected does not give a segmentation fault.
Standard says (emphasize is mine):
6.4.5 String literals
[...] The multibyte character sequence is then used to initialize an array of static storage duration and length just sufficient to contain the sequence. [...]
[...] If the program attempts to
modify such an array, the behavior is undefined. [...]
No, you misunderstand the reason for crash. String literals have static duration, meaning that they exist for the lifetime of the program. Since your pointer points to the literal, you can use it anytime.
The reason for the crash is the fact that string literals are read-only. In fact char* x = "" is an error in C++, as it should be const char* x = "". They are read-only from language perspective, and any attempt to modify them would lead to undefined behavior.
In practical terms, they are often put in the read-only segment, so any attempt at modification triggers a GPF - general protection fault. Usual response to GPF is a program termination - and this is what you are witnessing with your application.
String literals are placed in general in rodata section (read-only) within the ELF file, and under Linux\Windows\Mac-OS they will end up in a memory region which will generate a fault when written to (configured so using MMU or MPU by the OS upon loading)

When to allocate memory to char *

I am bit confused when to allocate memory to a char * and when to point it to a const string.
Yes, I understand that if I wish to modify the string, I need to allocate it memory.
But in cases when I don't wish to modify the string to which I point and just need to pass the value should I just do the below? What are the disadvantages in the below steps as compared to allocating memory with malloc?
char *str = NULL;
str = "This is a test";
str = "Now I am pointing here";
Let's try again your example with the -Wwrite-strings compiler warning flag, you will see a warning:
warning: initialization discards 'const' qualifier from pointer target type
This is because the type of "This is a test" is const char *, not char *. So you are losing the constness information when you assign the literal address to the pointer.
For historical reasons, compilers will allow you to store string literals which are constants in non-const variables.
This is, however, a bad behavior and I suggest you to use -Wwrite-strings all the time.
If you want to prove it for yourself, try to modify the string:
char *str = "foo";
str[0] = 'a';
This program behavior is undefined but you may see a segmentation fault on many systems.
Running this example with Valgrind, you will see the following:
Process terminating with default action of signal 11 (SIGSEGV)
Bad permissions for mapped region at address 0x4005E4
The problem is that the binary generated by your compiler will store the string literals in a memory location which is read-only. By trying to write in it you cause a segmentation fault.
What is important to understand is that you are dealing here with two different systems:
The C typing system which is something to help you to write correct code and can be easily "muted" (by casting, etc.)
The Kernel memory page permissions which are here to protect your system and which shall always be honored.
Again, for historical reasons, this is a point where 1. and 2. do not agree. Or to be more clear, 1. is much more permissive than 2. (resulting in your program being killed by the kernel).
So don't be fooled by the compiler, the string literals you are declaring are really constant and you cannot do anything about it!
Considering your pointer str read and write is OK.
However, to write correct code, it should be a const char * and not a char *. With the following change, your example is a valid piece of C:
const char *str = "some string";
str = "some other string";
(const char * pointer to a const string)
In this case, the compiler does not emit any warning. What you write and what will be in memory once the code is executed will match.
Note: A const pointer to a const string being const char *const:
const char *const str = "foo";
The rule of thumb is: always be as constant as possible.
If you need to modify the string, use dynamic allocation (malloc() or better, some higher level string manipulation function such as strdup, etc. from the libc), if you don't need to, use a string literal.
If you know that str will always be read-only, why not declare it as such?
char const * str = NULL;
/* OR */
const char * str = NULL;
Well, actually there is one reason why this may be difficult - when you are passing the string to a read-only function that does not declare itself as such. Suppose you are using an external library that declares this function:
int countLettersInString(char c, char * str);
/* returns the number of times `c` occurs in `str`, or -1 if `str` is NULL. */
This function is well-documented and you know that it will not attempt to change the string str - but if you call it with a constant string, your compiler might give you a warning! You know there is nothing dangerous about it, but your compiler does not.
Why? Because as far as the compiler is concerned, maybe this function does try to modify the contents of the string, which would cause your program to crash. Maybe you rely very heavily on this library and there are lots of functions that all behave like this. Then maybe it's easier not to declare the string as const in the first place - but then it's all up to you to make sure you don't try to modify it.
On the other hand, if you are the one writing the countLettersInString function, then simply make sure the compiler knows you won't modify the string by declaring it with const:
int countLettersInString(char c, char const * str);
That way it will accept both constant and non-constant strings without issue.
One disadvantage of using string-literals is that they have length restrictions.
So you should keep in mind from the document ISO/IEC:9899
(emphasis mine)
5.2.4.1 Translation limits
1 The implementation shall be able to translate and execute at least one program that contains at least one instance of every one of the following limits:
[...]
— 4095 characters in a character string literal or wide string literal (after concatenation)
So If your constant text exceeds this count (What some times throughout may be possible, especially if you write a dynamic webserver in C) you are forbidden to use the string literal approach if you want to stay system independent.
There is no problem in your code as long as you are not planing to modify the contents of that string. Also, the memory for such string literals will remain for the full life time of the program. The memory allocated by malloc is read-write, so you can manipulate the contents of that memory.
If you have a string literal that you do not want to modify, what you are doing is ok:
char *str = NULL;
str = "This is a test";
str = "Now I am pointing here";
Here str a pointer has a memory which it points to. In second line you write to that memory "This is a test" and then again in 3 line you write in that memory "Now I am pointing here". This is legal in C.
You may find it a bit contradicting but you can't modify string that is something like this -
str[0]='X' // will give a problem.
However, if you want to be able to modify it, use it as a buffer to hold a line of input and so on, use malloc:
char *str=malloc(BUFSIZE); // BUFSIZE size what you want to allocate
free(str); // freeing memory
Use malloc() when you don't know the amount of memory needed during compile time.
It is legal in C unfortunately, but any attempt to modify the string literal via the pointer will result in undefined behavior.
Say
str[0] = 'Y'; //No compiler error, undefined behavior
It will run fine, but you may get a warning by the compiler, because you are pointing to a constant string.
P.S.: It will run OK only when you are not modifying it. So the only disadvantage of not using malloc is that you won't be able to modify it.

How do I change the value of a string passed by reference to a function?

I've been trying for the past hour in utter frustration, but no matter what I try, or look up, I can't find anything that's specific to CStrings.
So I have a function for a library I'm working on that goes like this (edited out the non-relevant bits from it)
char *String_set(char **string_one, char *string_two){
// Tests pointers to check if NULL, return NULL if one is
free(*string_one); // Free the pointer so as to not cause a leak.
*string_one = malloc(strlen(string_two) + 1); // Allocate string_one
memset(*string_one, 0, strlen(string_two) + 1); // Cleans the string
strcpy(*string_one, string_two); // Copy string_two into string_one by reference
return *string_one;
}
Now, I have also tried NOT freeing the *string_one, and instead reallocating the pointer to hold enough for string_two, THEN clearing it out (with memset), but both have the same result. Either A) Segmentation fault if a string literal was passed, or B) No change if a mutable string is passed.
The kicker (to me) is that I've added quite a few print statements to it to monitor the goings-on of the function, and if anything it confused me even more as I got output like this...
//Output before function is called. It outputs info about the string before function
String's value:
// Initialized it to "", so it's meant to be empty.
String's Memory Address: 0x51dd810
// Inside of function
String's value:
// Same value
String's Memory Address: 0x51dd810
// Same memory address
String_Two's Value: "Hello World"
// What I am attempting to replace it with.
// After operations in function, before return statement
Final String's Value: "Hello World"
// Gets set
Final String's memory address: 0x51dd950
// Different address
// After return
String's value:
// Nothing changed. Even after freeing the contents at memory address?
String's memory address: 0x51dd810
// Still same memory address ?
Then it fails my Unit test because the value did not change as expected. May I get an answer as to why? Now, I'm a bit of a newbie to C, but I figured that anything allocated on the heap is global in scope, hence accessible anywhere. Also modifiable anywhere as well. Why is it that, my changes did not go through at all? Why is it that the value of the string changes in the function but rolls back at the return of it? I know C is pass-by-value, but I figured passing the reference by value would work. How can I properly change the value of a string passed to a function, and what is wrong with my code?
Thank you in advance.
Edit: Gist of what should be runable code (remove the REVERSE, LOWERCASE, UPPERCASE lines)
Edit2: Updated GIST on mobile, May be some other errors, posted this in a hurry.
Edit3: Ideone of the... strangely working build. Strangely, this is also working on both Windows and Linux Virtual Machine, so the problem may not be there specifically... I'm honestly at a lost for words (disregarding the runtime error). I try to compile my project and run the tests over and over, and the code in ideone is word-for-word verbatim (although there's no runtime when I run it, strangely).
This is not a full answer, and I'm not sure this isn't becoming code-review, which is actually off-topic on SO. (Viewers please feel free to edit this answer if you find any additional flaws.)
String_Utils_concat() has no clean ownership semantics. If SELECTED(parameter, MODIFY), then it returns string_one (is literal in test), otherwise temp (mallocated). You cannot safely free result unless you remembered the value of parameter at time of call.
The code is very complex. Consider using strdup and asprintf.
Differences you see on platforms are probably due to different memory management schemes and different behavior of undefined behaviors.
Deep coupling of parameter is root of all troubles. Code can become less complex just by turning it inside-out. Can't provide a snippet, because all these string_xxx and parameter values, as well as the entire target, feel nonsense to me.
If you need a string library with duplicate/concat facilities, then:
char *strdup(const char *s); // already in libc
char *s; asprintf(&s, "%s%s", s1, s2); // already in libc
... After aggressive cleaning just for this case, your functions became mostly trivial:
// String_Utils_copy() eliminated as strdup() ('parameter' was not used)
char *
String_Utils_set(char **string_one, char *string_two)
{
free(*string_one);
return (*string_one = strdup(string_two));
}
char *
String_Utils_concat(char *string_one, char *string_two, int parameter)
{
char *temp; asprintf(&temp, "%s%s", string_one, string_two);
if (SELECTED(parameter, MODIFY)) {
String_Utils_set(&string_one, temp, NONE);
// i.e. 1) free(string_one);
// ^ this probably frees literal
// 2) string_one = strdup(temp);
free(temp);
return string_one;
// (what was the point at all?)
// entire thing is same as "return temp" except for freeing literal
}
return temp;
}
I hope there are some clues now...
Quick edit: as you're already allocating and copying here and there without a reason, I assume you're not in a very tight loop nor constrained otherwise. Then all interfaces should stick with widely-default "get const char *, return char * that should be freed" rule. I.e.
char *String_Utils_set(...); // throw it away
char *String_Utils_concat(const char *s1, const char *s2);
char *strdup(const char *s); // already in libc
char *s = String_Utils_concat("Hello, ", "World!");
printf("%s\n", s);
free(s); s = NULL;
char *s = strdup("Hello!");
printf("%s\n", s);
free(s); s = NULL;
With that clean and proper interface you may do whatever you meant by parameter just in-place, without any headaches.

Abort instead of segfault with clear memory violation

I came upon this weird behaviour when dealing with C strings. This is an exercise from the K&R book where I was supposed to write a function that appends one string onto the end of another string. This obviously requires the destination string to have enough memory allocated so that the source string fits. Here is the code:
/* strcat: Copies contents of source at the end of dest */
char *strcat(char *dest, const char* source) {
char *d = dest;
// Move to the end of dest
while (*dest != '\0') {
dest++;
} // *dest is now '\0'
while (*source != '\0') {
*dest++ = *source++;
}
*dest = '\0';
return d;
}
During testing I wrote the following, expecting a segfault to happen while the program is running:
int main() {
char s1[] = "hello";
char s2[] = "eheheheheheh";
printf("%s\n", strcat(s1, s2));
}
As far as I understand s1 gets an array of 6 chars allocated and s2 an array of 13 chars. I thought that when strcat tries to write to s1 at indexes higher than 6 the program would segfault. Instead everything works fine, but the program doesn't exit cleanly, instead it does:
helloeheheheheheh
zsh: abort ./a.out
and exits with code 134, which I think just means abort.
Why am I not getting a segfault (or overwriting s2 if the strings are allocated on the stack)? Where are these strings in memory (the stack, or the heap)?
Thanks for your help.
I thought that when strcat tries to write to s1 at indexes higher than 6 the program would segfault.
Writing outside the bounds of memory you have allocated on the stack is undefined behaviour. Invoking this undefined behaviour usually (but not always) results in a segfault. However, you can't be sure that a segfault will happen.
The wikipedia link explains it quite nicely:
When an instance of undefined behavior occurs, so far as the language specification is concerned anything could happen, maybe nothing at all.
So, in this case, you could get a segfault, the program could abort, or sometimes it could just run fine. Or, anything. There is no way of guaranteeing the result.
Where are these strings in memory (the stack, or the heap)?
Since you've declared them as char [] inside main(), they are arrays that have automatic storage, which for practical purposes means they're on the stack.
Edit 1:
I'm going to try and explain how you might go about discovering the answer for yourself. I'm not sure what actually happens as this is not defined behavior (as others have stated), but you can do some simple debugging to figure out what your compiler is actually doing.
Original Answer
My guess would be that they are both on the stack. You can check this by modifying your code with:
int main() {
char c1 = 'X';
char s1[] = "hello";
char s2[] = "eheheheheheh";
char c2 = '3';
printf("%s\n", strcat(s1, s2));
}
c1 and c2 are going to be on the stack. Knowing that you can check if s1 and s2 are as well.
If the address of c1 is less than s1 and the address of s1 is less than c2 then it is on the stack. Otherwise it is probably in your .bss section (which would be the smart thing to do but would break recursion).
The reason I'm banking on the strings being on the stack is that if you are modifying them in the function, and that function calls itself, then the second call would not have its own copy of the strings and hence would not be valid... However, the compiler still knows that this function isn't recursive and can put the strings in the .bss so I could be wrong.
Assuming my guess that it is on the stack is right, in your code
int main() {
char s1[] = "hello";
char s2[] = "eheheheheheh";
printf("%s\n", strcat(s1, s2));
}
"hello" (with the null terminator) is pushed onto the stack, followed by "eheheheheheh" (with the null terminator).
They are both located one after the other (thanks to plain luck of the order in which you wrote them) forming a single memory block that you can write to (but shouldn't!)... That's why there is no seg fault, you can see this by breaking before printf and looking at the addresses.
s2 == (uintptr_t)s1 + (strlen(s1) + 1) should be true if I'm right.
Modifying your code with
int main() {
char s1[] = "hello";
char c = '3';
char s2[] = "eheheheheheh";
printf("%s\n", strcat(s1, s2));
}
Should see c overwritten if I'm right...
However, if I'm wrong and it is in the .bss section then they could still be adjacent and you would be overwriting them without a seg fault.
If you really want to know, disassemble it:
Unfortunately I only know how to do it on Linux. Try using the nm <binary> > <text file>.txt command or objdump -t <your_binary> > <text file>.sym command to dump all the symbols from your program. The commands should also give you the section in which each symbol resides.
Search the file for the s1 and s2 symbols, if you don't find them it should mean that they are on the stack but we will check that in the next step.
Use the objdump -S your_binary > text_file.S command (make sure you built your binary with debug symbols) and then open the .S file in a text editor.
Again search for the s1 and s2 symbols, (hopefully there aren't any others, I suspect not but I'm not sure).
If you find their definitions followed by a push or sub %esp command, then they are on the stack. If you're unsure about what their definitions mean, post it back here and let us have a look.
There's no seg fault or even an overwrite because it can use the memory of the second string and still function. Even give the correct answer. The abort is a sign that the program realized something was wrong. Try reversing the order in which you declare the strings and try again. It probably won't be as pleasant.
int main() {
char s1[] = "hello";
char s2[] = "eheheheheheh";
printf("%s\n", strcat(s1, s2));
}
instead use:
int main() {
char s1[20] = "hello";
char s2[] = "eheheheheheh";
printf("%s\n", strcat(s1, s2));
}
Here is the reason why your program didn't crash:
Your strings are declared as array (s1[] and s2[]). So they're on the stack. And just so happens that memory for s2[] is right after s1[]. So when strcat() is called, all it does is moving each character in s2[] one byte forward. Stack as stack is readable and writable. So there is no restriction what you'e doing.
But I believe the compiler is free to locate s1[] and s2[] where it see fits so this is just a happy accident.
Now to get your program to crash is relatively easy
Swap s1 and s2 in your call: instead of strcat(s1, s2), do strcat(s2, s1). This should cause stack smashing exception.
Change s1[] and s2[] to *s1 and *s2. This should cause segfault when you're writing to readonly segment.
hmm.... the strings are in stack all right since heap is used only for dynamic allocation of memory and stuff..
segfault is for invalid memory access, but with this array you are just writing stuff which is going out of bound (outside the boundry) for the array , so while writing i dont think you will have a issue .... Since in C its actually left to the programer to ensure things are kept in bound for arrays.
Also while reading if you use pointers - I dont think there will be a issue either since you can just continue to read till where ever you want and using the sum of previous lengths. But if you use functions that are mentioned in string.h they relay on the presence of the null character "\0" to decide where to halt the operation -- hence i think your function worked !!
but the termination could also indicate that any other variable / something that might have been present next to the location of the strings might have got over written with char value .... accessing those might have caused the program to exit !!
hope this helps .... good question by the way !

copy a string in c - memory question:

consider the following code:
t[7] = "Hellow\0";
s[3] = "Dad";
//now copy t to s using the following strcpy function:
void strcpy(char *s, char *t) {
int i = 0;
while ((s[i] = t[i]) != '\0')
i++;
}
the above code is taken from "The C programming Language book".
my question is - we are copying 7 bytes to what was declared as 3 bytes.
how do I know that after copying, other data that was after s[] in the memory
wasn't deleted?
and one more question please: char *s is identical to char* s?
Thank you !
As you correctly point out, passing s[3] as the first argument is going to overwrite some memory that could well be used by something else. At best your program will crash right there and then; at worst, it will carry on running, damaged, and eventually end up corrupting something it was supposed to handle.
The intended way to do this in C is to never pass an array shorter than required.
By the way, it looks like you've swapped s and t; what was meant was probably this:
void strcpy(char *t, char *s) {
int i = 0;
while ((t[i] = s[i]) != '\0')
i++;
}
You can now copy s[4] into t[7] using this amended strcpy routine:
char t[] = "Hellow";
char s[] = "Dad";
strcpy(t, s);
(edit: the length of s is now fixed)
About the first question.
If you're lucky your program will crash.
If you are not it will keep on running and overwrite memory areas that shouldn't be touched (as you don't know what's actually in there). This would be a hell to debug...
About the second question.
Both char* s and char *s do the same thing. It's just a matter of style.
That is, char* s could be interpreted as "s is of type char pointer" while char *s could be interpreted as "s is a pointer to a char". But really, syntactically it's the same.
That example does nothing, you're not invoking strcpy yet. But if you did this:
strcpy(s,t);
It would be wrong in several ways:
The string s is not null terminated. In C the only way strcpy can know where a string ends is by finding the '\0'. The function may think that s is infinite and it might corrupt your memory and make the program crash.
Even if was null terminated, as you said the size of s is only 3. Because of the same cause, strcpy would write memory beyond where s ends, with maybe catastrophic results.
The workaround for this in C is the function strncpy(dst, src, max) in which you specify the maximum number of chars to copy. Still beware that this function might generate a not null terminated string if src is shorter than max chars.
I will assume that both s and t (above the function definition) are arrays of char.
how do I know that after copying, other data that was after s[] in the memory wasn't deleted?
No, this is worse, you are invoking undefined behavior and we know this because the standard says so. All you are allowed to do after the three elements in s is compare. Assignment is a strict no-no. Advance further, and you're not even allowed to compare.
and one more question please: char s is identical to char s?
In most cases it is a matter of style where you stick your asterix except if you are going to declare/define more than one, in which case you need to stick one to every variable you are going to name (as a pointer).
a string-literal "Hellow\0" is equal to "Hellow"
if you define
t[7] = "Hellow";
s[7] = "Dad";
your example is defined and crashes not.

Resources