In C, how can printf("Hello") ever output "Cello" in any circumstance? - c

It completely misses me how can printf("Hello") ever print Cello. It challenges my basic understanding of C. But from the top answer (by Carson Myers) for the following question on Stack Overflow, it seems it is possible. Can you please explain in simple terms how is it possible? Here's what the answer says:
Whenever you write a string in your source, that string is read only
(otherwise you would be potentially changing the behavior of the
executable--imagine if you wrote char *a = "hello"; and then changed
a[0] to 'c'. Then somewhere else wrote printf("hello");. If you were
allowed to change the first character of "hello", and your compiler
only stored it once (it should), then printf("hello"); would output
cello!)
Aforementioned question: Is it possible to modify a string of char in C?

Reasons:
Compilers usually store only one copy of identical string literals, so the string literal in char *a = "hello"; and in printf("hello") could be at a same memory location.
The answer in your link assumes that the memory location for storing string literals are mutable, which is typically not in modern architectures. However this is true if there's no memory access protection, e.g. in some embedded architectures or a 80386 working in real mode.
So when you modify the string referenced by a, the value for printf changes as well.

If you, somewhere in your source, have the string literal "Hello", that ends up in your executable as part of the code / data segment. This should be considered read-only at all times, because compilers are at liberty to optimize multiple occurences of the same literal into a single entity. You would have multiple cases of "Hello" in your source, and multiple pointers pointing to them, but they could all be pointing to the same address.
ISO/IEC 9899 "Programming languages - C", chapter 6.4.5 "String literals", paragraph 6:
It is unspecified whether these arrays are distinct provided their elements have the
appropriate values. If the program attempts to modify such an array, the behavior is
undefined.
Thus, any pointer to such a string literal is to be declared as a pointer to constant contents, to make this clear on the source level:
char const * a = "Hello";
Given this definition, a[0] = 'C'; is not a valid operation: You cannot change a const value, the compiler would issue an error.
However, in more than one ways it is possible to "trick" the language. For one, you could cast the pointer:
char const * a = "Hello";
char * b = (char *)a;
b[0] = 'C';
As the above snippet from the standard states, this -- while syntactically correct -- is semantically undefined behaviour. It might even work "correctly" on certain platforms (mostly for historical reasons), and actually print "Cello". It might break on others.
Consider what would happen if your executable is burned into a ROM chip, and executed from there...
I said "historical reasons". In the beginning, there was no const. That is why C defines the type of a string literal as char[] (no const).
Note that:
C++98 does define string literals as being const, but allows conversion to char *.
C++03 still allows the conversion but deprecates it.
C++11 no longer allows the conversion without a cast.

This is a practical explanation (i.e., not dictated by the C-language standard):
First, you declare char *a = "hello" somewhere in your code.
As a result, the compiler:
Generates a constant string "hello" and places it in a read-only memory section within the executable image (typically within the RO data section), but only if it hasn't already done so
Replaces char *a = "hello" with char *a = the address of "hello" in memory
Then, you call printf("hello") somewhere else in your code.
As a result, the compiler:
Generates a constant string "hello" and places it in a read-only memory section within the executable image (typically within the RO data section), but only if it hasn't already done so
Replaces printf("hello") with printf(the address of "hello" in memory)
Now, theoretically (as explained by #Carson Myers), if you could change any of the characters in "hello", then it would affect the result of anything that refers to the data located at the address of that string in memory.
In practice, because the compiler places all constant strings in a read-only memory section, it is not feasible.

the *a points to a different "Hello" than the one that you pass to printf. (you have 2 "hello" in your system)
It will work if you ask printf to print the string at a.

Related

Is String Literal in C really not modifiable?

As far as I know, a string literal can't be modified for example:
char* a = "abc";
a[0] = 'c';
That would not work since string literal is read-only. I can only modify it if:
char a[] = "abc";
a[0] = 'c';
However, in this post,
Parse $PATH variable and save the directory names into an array of strings, the first answer modified a string literal at these two places:
path_var[j]='\0';
array[current_colon] = path_var+j+1;
I'm not very familiar with C so any explanation would be appreciated.
In programming, there are quite a few rules that are up to you to follow, even though they are not — necessarily — enforced. And "String literals in C are not modifiable" is one of those. So is "Strings returned by getenv should not be modified".
There are some real-world analogies that apply. Here's one: If you're at an intersection, and the light is red, you're not supposed to cross. But, much of the time, if you break the rule, and cross, you might get away with it. You might get a ticket from a policeman — or you might not. You might cause a crash — or you might not. But if you get lucky, and neither of these things happens, that does not imply that crossing the intersection against the red light was okay — it's still quite true that it was very much against the rules.
Similarly, in C, if you write some code that modifies a string literal, or a string returned from getenv, you might get away with it. The compiler might give you a warning or error message — or it might not. Your program might crash — or it might not. But if the program seems to work, that does not imply that these strings are actually modifiable — they're not.
Code blocks from the post you linked:
const char *orig_path_var = getenv("PATH");
char *path_var = strdup(orig_path_var ? orig_path_var : "");
const char **array;
array = malloc((nb_colons+1) * sizeof(*array));
array[0] = path_var;
array[current_colon] = path_var+j+1;
First block:
In the 1st line getenv() returns a pointer to a string which is pointed to by orig_path_var. The string that get_env() returns should be treated as a read-only string as the behaviour is undefined if the program attempts to modify it.
In the 2nd line strdup() is called to make a duplicate of this string. The way strdup() does this is by calling malloc() and allocating memory for the size of the string + 1 and then copying the string into the memory.
Since malloc() is used, the string is stored on the heap, this allows us to edit the string and modify it.
Second block:
In the 1st line we can see that array points to a an array of char * pointers. There is nb_colons+1 pointers in the array.
Then in the 2nd line the 0th element of array is initilized to path_var (remember it is not a string literal, but a copy of one).
In the 3rd line, the current_colonth element of array is set to path_var+j+1. If you don't understand pointer arithmetic, this just means it assigns the address of the j+1th char of path_var to array[current_colon].
As you can see, the code is not operating on const string literals like orig_path_var. Instead it uses a copy made with strdup(). This seems to be where your confusion stems from so take a look at this:
char *strdup(const char *s);
The strdup() function returns a pointer to a new string which is a duplicate of the string s. Memory for the new string is obtained with malloc(3), and can be freed with free(3).
The above text shows what strdup() does according to its man page.
It may also help to read the malloc() man page.
In the example
char* a = "abc";
the token "abc" produces a literal object in the program image, and denotes an expression which yields that object's address.
In the example
char a[] = "abc";
The token "abc" is serves as an array initializer, and doesn't denote a literal object. It is equivalent to:
char a[] = { 'a', 'b', 'c', 0 };
The individual character values of "abc" are literal data is recorded somewhere and somehow in the program image, but they are not accessible as a string literal object.
The array a isn't a literal, needless to say. Modifying a doesn't constitute modifying a literal, because it isn't one.
Regarding the remark:
That would not work since string literal is read-only.
That isn't accurate. The ISO C standard (no version of it to date) doesn't specify any requirements for what happens if a program tries to modify a string literal. It is undefined behavior. If your implementation stops the program with some diagnostic message, that's because of undefined behavior, not because it is required.
C implementations are not required to support string literal modification, which has the benefits like:
standard-conforming C programs can be translated into images that can be be burned into ROM chips, such that their string literals are accessed directly from that ROM image without having to be copied into RAM on start-up.
compilers can condense the storage for string literals by taking advantage of situations when one literal is a suffix of another. The expression "string" + 2 == "ring" can yield true. Since a strictly conforming program will not do something like "ring"[0] = 'w', due to that being undefined behavior, such a program will thereby avoid falling victim to the surprise of "string" unexpectedly turning into "stwing".
There are several reasons for which you had better not to modify them:
The first is that the operating system and/or the compiler can enforce the non-writable property of string literals, putting them in read-only memory (e.g. ROM) or in the .text segment.
second, the compiler is allowed to merge string literals together, so if you modify (and do it successfully) you can get surprises later because other literals (that have been merged because e.g. one of them is a suffix of the other) change apparently by no reason.
if you need an initialized string that is modifiable, you can do it by allocating an array with a declaration, as in (which you can freely modify):
char array[100] = "abc"; // initialized to { 'a' ,'b', 'c', '\0',
// /* and 96 more '\0' characters */
// };

What is the reason behind the following C char array storage implementation?

What is the implementation reason behind the following char array implementation?
char *ch1 = "Hello"; // Read-only data
/* if we try ch1[1] = ch1[2];
we will get **Seg fault** since the value is stored in
the constant code segment */
char ch2[] = "World"; // Read-write data
/* if we try ch2[1] = ch2[2]; will work. */
According to the book Head first C (page 73,74), the ch2[] array is stored both in constant code segment but also in the function stack.
What is the reason behind duplicating both in code and
stack memory space?
Why the value can be kept only in stack if it is not read-only data?
First, let's clear something up. String literals are not necessarily read-only data, it's just that it's undefined behaviour to try and change them.
It doesn't necessarily have to crash, it may work just fine. But, being undefined behaviour, you shouldn't rely on it if you want you code to run in another implementation, another version of the same implementation, or even next Wednesday.
This may well stem from a time before standards were in place (the original ANSI/ISO mandate was to codify existing practice rather than create a new language). In many implementations, strings would share space for efficiency, such as the code:
char *good = "successful";
char *bad = "unsuccessful";
resulting in:
good---------+
bad--+ |
| |
V V
| u | n | s | u | c | c | e | s | s | f | u | l | \0 |
Hence, if you changed one of the characters in good, it would also change bad.
The reason you can do it with something like:
char indifferent[] = "meh";
is that, while good and bad point to a string literal, that statement actually creates a character array big enough to hold "meh" and then copies the data into it1. The copy of the data can be freely changed.
In fact the C99 rationale document explicitly cites this as one of the reasons:
String literals are not required to be modifiable. This specification allows implementations to share copies of strings with identical text, to place string literals in read-only memory, and to perform certain optimizations.
But regardless as to why, the standard is quite clear on the what. From C11 6.4.5 String literals:
7/ It is unspecified whether these arrays are distinct provided their elements have the appropriate values. If the program attempts to modify such an array, the behavior is undefined.
For the latter case, this is covered in 6.7.6 Declarators and 6.7.9 Initialisation.
1 Though it's worth noting the the normal "as if" rules apply here (as long as an implementation acts as if it's following the standard, it can do what it pleases).
In other words, if the implementation can detect that you never try to change the data, it can quite happily bypass the copy and use the original.
We will get Seg fault since the value is stored in the constant
code segment
This is false: your program crashes because it receives a signal indicating a segment violation (SIGSEGV) which, by default, causes the program to terminate. But this is not the primary reason. Modifying a string literal is undefined behavior, whether it's stored in read-only segments or not, which is much wider than you think.
array is stored both in constant code segment but also in the function
stack.
This is an implementation detail and shouldn't concern you: as far as ISO C is concerned, those statements make no sense. This also means it could be implemented differently.
When you
char ch2[] = "World";
"World", which is a string literal, is copied into ch2, something you would end up doing if you used malloc and pointers. Now, why is that copied?
One reason for this may be that it's something you would expect. If you could modify such string literal, what if another part of the code referred to it and expected to have that value? Having shared string literals is efficient because you can share them across your program and saves space.
By copying it, you have your own copy of the string ( you "own" it) and you can modify it as you will.
Quoting "Rationale for American National Standard for Information Systems Programming Language C"
String literals are specied to be unmodiable. This specication allows implementations to share copies of strings with identical text, to place string literals in read-only memory, and perform certain optimizations. However, string literals do not have the type array of const char, in order to avoid the problems of pointer type checking, particularly with library functions, since assigning a pointer to const char to a plain pointer to char is not valid.
This is only a partial answer with a counter-example to a claim that a string literal is stored in a read only memory:
int main() {
char a[]="World";
printf("%s", a);
}
gcc -O6 -S c.c
.LC0:
.string "%s" ;; String literal stored as expected
;; in read-only area within code
...
movl $1819438935, (%rsp) ;; First four bytes in "worl"
movw $100, 4(%rsp) ;; next to bytes in "d\0"
call printf
...
Here only the semantics of the concept literal is implemented; the literal "world\0" doesn't even exist.
In practice only when the string literals are long enough, an optimizing compiler will choose to memcpy data from the literal pool to stack, requiring the existence of the literal as null terminating string.
The semantics of char *ch1 = "Hello"; OTOH requires that there exists a linear array somewhere, whose address can be assigned to the pointer ch1.

In c, what are the rules governing how compilers merge the same strings into the executable

I am trying to find what the rules are for c and c++ compilers putting strings into the data section of executables and don't know where to look. I would like to know if the address of all of the following are guaranteed to be the same in c/c++ by the spec:
char * test1 = "hello";
const char * test2 = "hello";
static char * test3 = "hello";
static const char * test4 = "hello";
extern const char * test5; // Defined in another compilation unit as "hello"
extern const char * test6; // Defined in another shared object as "hello"
Testing on windows, they are all the same. However I do not know if they would be on all operating systems.
I would like to know if the address of all of the following are guaranteed to be the same in c/c++ by the spec
String literals are allowed to be the same object but are not required to.
C++ says:
(C++11, 2.14.5p12) "Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation-defined. The effect of attempting to modify a string literal is undefined."
C says:
(C11, 6.5.2.5p7) "String literals, and compound literals with const-qualified types, need not designate distinct objects.101) This allows implementations to share storage for string literals and constant compound literals with the same or overlapping representations."
And C99 Rationale says:
"This specification allows implementations to share copies of strings with identical text, to place string literals in read-only memory, and to perform certain optimizations"
Firstly, this has nothing to do with the operating system. It depends solely on the implementation, i.e on the compiler.
Secondly, the only "guarantees" you can hope for in this case will come from the compiler documentation. The formal rules of the language neither guarantee them to be the same, nor guarantee them to be different. (The latter applies to both C and C++.)
Thirdly, some compilers have such bizarre options like "make string literals modifiable". This usually implies that each literal is allocated in a unique region of storage and has unique address.
They can all be the same. Even x and y in the following can be the same. z can overlap with y
const char *x = "hello";
const char *y = "hello\0folks";
const char *z = "folks";
In C, I believe the only guarantee about a string literal is that it will evaluate to a pointer to a readable area of memory that will, assuming a program does not engage in Undefined Behavior, always contain the indicated characters followed by a zero byte. The compiler and linker are allowed to work together in any fashion they see fit to make that happen. While I don't know of any compiler/linker systems that do this, it would be perfectly legitimate for a compiler to put each string literal in its own constant section, and for a linker to place such sections in reverse order of length, and check before placing each one whether the appropriate sequence of bytes had already been placed somewhere. Note that the sequence of bytes wouldn't even have to be a string literal or defined constant; if the linker is trying to place the string "Hi!" and it notices that machine code contains the sequence of bytes [0x48, 0x69, 0x21, 0x00], the literal could evaluate to a pointer to the first of those.
Note that writing to the memory pointed to by a string literal is Undefined Behavior. On various system a write may trap, do nothing, or affect only the literal written, but it could also have totally unpredictable consequences [e.g. if the literal evaluated to a pointer into some machine code].

Where does constants, literals and globals get space

I have recently learnt that its possible to change values of constants in c using a pointer but its not possible for string literals. Possibly the explanation lies in the fact that constants and other strings are allocated space in modifiable region in space whereas string literals gets in non-modifiable region in space (possibly code segment). I have written a program that display addresses for these variables. Outputs are shown as well.
#include <stdio.h>
int x=0;
int y=0;
int main(int argc, char *argv[])
{
const int a =5;
const int b;
const int c =10;
const char *string = "simar"; //its a literal, gets space in code segment
char *string2 = "hello";
char string3[] = "bye"; // its array, so gets space in data segment
const char string4[] = "guess";
const int *pt;
int *pt2;
printf("\nx:%u\ny:%u Note that values are increasing\na:%u\nb:%u\nc:%u Note that values are dec, so they are on stack\nstring:%u\nstring2:%u Note that address range is different so they are in code segment\nstring3:%u Note it seems on stack as well\nstring4:%u\n",&x,&y,&a,&b,&c,string,string2,string3,string4);
}
Please explain where exactly these variables get space??
Where do globals get space, where do constants get and where does string literals get??
"Possible" is over-stating the case.
You can write code that attempts to modify a const object (for example by casting its address to a pointer-to-non-const type). You can also write code that attempts to modify a string literal.
In both cases your code has undefined behavior, meaning that the standard doesn't care what happens. The implementation can do what it likes, and what happens is usually an accidental side-effect of something else that does matter. You cannot rely on the behavior. In short, that code is wrong. This is true both of objects defined as const, and of string literals.
It may be that on a particular implementation, the effect is to change the object or the literal. It may be that on another implementation, you get an access error and the program crashes. It may be that on a third implementation, you get one behavior sometimes and the other at other times. It may be that something entirely different happens.
It's implementation-specific where variables get their space, but in a typical implementation:
x and y are in a modifiable data segment
a is on the stack. If it weren't for the fact that you take its address, then the variable storage could be optimized away entirely, and the value 5 used as an immediate value in any CPU instructions that the compiler emits for code that uses a.
b I think is an error -- uninitialized const object. Maybe it's allowed, but the compiler probably ought to warn.
c is on the stack, same as a.
the literals "simar" etc are all either in a code segment, a read-only data segment, or a modifiable data segment if the implementation doesn't bother with rodata.
string3 and string4 are arrays on the stack. Each is initialized by copying the contents of a string literal.
I have recently learnt that its possible to change values of constants in c using a pointer
Doing so leads to undefined behavior (see the standard 6.7.3), meaning that anything can happen. Practically, you can modify constants on some RAM-based systems.
its not possible for string literals
That is equally undefined behavior and may work, or may not work, or it may cause blue smoke to rise from your harddrive.
Possibly the explanation lies in the fact that constants and other strings are allocated space in modifiable region in space whereas string literals gets in non-modifiable region in space
This is system-dependant. On some systems they both lie in a constant/virtual RAM segment, on some they could lie in non-volatile flash memory. Having a discussion about where things end up in memory is pointless without specifying what system you are talking about. There is no generic case.

C Duration of strings, constants, compound literals, and why not, the code itself

I didn't remember where I read, that If I pass a string to a function like.
char *string;
string = func ("heyapple!");
char *func (char *string) {
char *p
p = string;
return p;
}
printf ("%s\n", string);
The string pointer continue to be valid because the "heyapple!" is in memory, it IS in the code the I wrote, so it never will be take off, right?
And about constants like 1, 2.10, 'a'?
And compound literals?
like If I do it:
func (1, 'a', "string");
Only the string will be all of my program execution, or the constans will be too?
For example I learned that I can take the address of string doing it
&"string";
Can I take the address of the constants literals? like 1, 2.10, 'a'?
I'm passing theses to functions arguments and it need to have static duration like strings without the word static.
Thanks a lot.
This doesn't make a whole lot of sense.
Values that are not pointers cannot be "freed", they are values, they can't go away.
If I do:
int c = 1;
The variable 'c' is not a pointer, it cannot do anything else than contain an integer value, to be more specific it can't NOT contain an integer value. That's all it does, there are no alternatives.
In practice, the literals will be compiled into the generated machine-code, so that somewhere in the code resulting from the above will be something like
load r0, 1
Or whatever the assembler for the underlying instruction set looks like. The '1' is a part of the instruction encoding, it can't go away.
Make sure you distinguish between values and pointers to memory. Pointers are themselves values, but a special kind of value that contains an address to memory.
With char* hello = "hello";, there are two things happening:
the string "hello" and a null-terminator are written somewhere in memory
a variable named hello contains a value which is the address to that memory
With int i = 0; only one thing happens:
a variable named i contains the value 0
When you pass around variables to functions their values are always copied. This is called pass by value and works fine for primitive types like int, double, etc. With pointers this is tricky because only the address is copied; you have to make sure that the contents of that address remain valid.
Short answer: yes. 1 and 'a' stick around due to pass by value semantics and "hello" sticks around due to string literal allocation.
Stuff like 1, 'a', and "heyapple!" are called literals, and they get stored in the compiled code, and in memory for when they have to be used. If they remain or not in memory for the duration of the program depends on where they are declared in the program, their size, and the compiler's characteristics, but you can generally assume that yes, they are stored somewhere in memory, and that they don't go away.
Note that, depending on the compiler and OS, it may be possible to change the value of literals, inadvertently or purposely. Many systems store literals in read-only areas (CONST sections) of memory to avoid nasty and hard-to-debug accidents.
For literals that fit into a memory word, like ints and chars it doesn't matter how they are stored: one repeats the literal throughout the code and lets the compiler decide how to make it available. For larger literals, like strings and structures, it would be bad practice to repeat, so a reference should be kept.
Note that if you use macros (#define HELLO "Hello!") it is up to the compiler to decide how many copies of the literal to store, because macro expansion is exactly that, a substitution of macros for their expansion that happens before the compiler takes a shot at the source code. If you want to make sure that only one copy exists, then you must write something like:
#define HELLO "Hello!"
char* hello = HELLO;
Which is equivalent to:
char* hello = "Hello!";
Also note that a declaration like:
const char* hello = "Hello!";
Keeps hello immutable, but not necessarily the memory it points to, because of:
char h = (char) hello;
h[3] = 'n';
I don't know if this case is defined in the C reference, but I would not rely on it:
char* hello = "Hello!";
char* hello2 = "Hello!"; // is it the same memory?
It is better to think of literals as unique and constant, and treat them accordingly in the code.
If you do want to modify a copy of a literal, use arrays instead of pointers, so it's guaranteed a different copy of the literal (and not an alias) is used each time:
char hello[] = "Hello!";
Back to your original question, the memory for the literal "heyapple!" will be available (will be referenceable) as long as a reference is kept to it in the running code. Keeping a whole module (a loadable library) in memory because of a literal may have consequences on overall memory use, but that's another concern (you could also force the unloading of the module that defines the literal and get all kind of strange results).
First,it IS in the code the I wrote, so it never will be take off, right? my answer is yes. I recommend you to have a look at the structure of ELF or runtime structure of executable. The position that the string literal stored is implementation dependent, in gcc, string literal is store in the .rdata segment. As the name implies, the .rdata is read-only. In your code
char *p
p = string;
the pointer p now point to an address in a readonly segment, so even after the end of function call, that address is still valid. But if you try to return a pointer point to a local variable then it is dangerous and may cause hard-to-find bugs:
int *func () {
int localVal = 100;
int *ptr = localVal;
return p;
}
int val = func ();
printf ("%d\n", val);
after the execution of func, as the stack space of func is retrieve by the c runtime, the memory address where localVal was stored will no longer guarantee to hold the original localVal value. It can be overidden by operation following the func.
Back to your question title
-
string literal have static duration.
As for "And about constants like 1, 2.10, 'a'?"
my answer is NO, your can't get address of a integer literal using &1. You may be confused by the name 'integer constant', but 1,2.10,'a' is not right value ! They do not identify a memory place,thus, they don't have duration, a variable contain their value can have duration
compound literals, well, I am not sure about this.

Resources