How are string literals compiled in C? As per my understanding, in test1, the string "hello" is put in data segment by compiler and in the 2nd line p is assigned that hard-coded virtual address. Is this correct? and that there is no basic difference between how test1 works and how test2 works.
Some code:
#include <stdio.h>
test1();
test2();
test3();
main()
{
test1();
test2();
//test3();
}
test1()
{
char *p;
p="hello";
}
test2()
{
char *p="hello";
}
test3()
{
char *p;
strcpy(p,"hello");
}
any reference from C standard will be greatly appreciated, so that I can understand this thing in depth from compiler point of view.
From the C standard point of view there's no particular requirement about where the literal string will be placed. About the only requirements about the storage of string literals are in C99 6.4.5/5 "String literals":
"an array of static storage duration and length just sufficient to contain the sequence" , which means that the literal will have a lifetime as long as the program.
"It is unspecified whether these arrays are distinct provided their elements have the appropriate value", which means the various "hello" literals in your example may or may not have the same address. You can't count on either behavior.
"If the program attempts to modify such an array, the behavior is undefined", which means that you can't change the string literal. One many platforms this is enforced (if you attempt to do so, the program will crash). On some platforms, the change may appear to work so you can't count on the bug being readily evident.
Your understanding is correct, the data of "Hello" will be put in a RO segment, and its relative virtual address will be assigned to the pointers in the testX() functions.
However, those are compiler-specific perspectives, the C standard doesn't care about them.
EDIT: Per test3(), see τεκ's comment.
Related
It completely misses me how can printf("Hello") ever print Cello. It challenges my basic understanding of C. But from the top answer (by Carson Myers) for the following question on Stack Overflow, it seems it is possible. Can you please explain in simple terms how is it possible? Here's what the answer says:
Whenever you write a string in your source, that string is read only
(otherwise you would be potentially changing the behavior of the
executable--imagine if you wrote char *a = "hello"; and then changed
a[0] to 'c'. Then somewhere else wrote printf("hello");. If you were
allowed to change the first character of "hello", and your compiler
only stored it once (it should), then printf("hello"); would output
cello!)
Aforementioned question: Is it possible to modify a string of char in C?
Reasons:
Compilers usually store only one copy of identical string literals, so the string literal in char *a = "hello"; and in printf("hello") could be at a same memory location.
The answer in your link assumes that the memory location for storing string literals are mutable, which is typically not in modern architectures. However this is true if there's no memory access protection, e.g. in some embedded architectures or a 80386 working in real mode.
So when you modify the string referenced by a, the value for printf changes as well.
If you, somewhere in your source, have the string literal "Hello", that ends up in your executable as part of the code / data segment. This should be considered read-only at all times, because compilers are at liberty to optimize multiple occurences of the same literal into a single entity. You would have multiple cases of "Hello" in your source, and multiple pointers pointing to them, but they could all be pointing to the same address.
ISO/IEC 9899 "Programming languages - C", chapter 6.4.5 "String literals", paragraph 6:
It is unspecified whether these arrays are distinct provided their elements have the
appropriate values. If the program attempts to modify such an array, the behavior is
undefined.
Thus, any pointer to such a string literal is to be declared as a pointer to constant contents, to make this clear on the source level:
char const * a = "Hello";
Given this definition, a[0] = 'C'; is not a valid operation: You cannot change a const value, the compiler would issue an error.
However, in more than one ways it is possible to "trick" the language. For one, you could cast the pointer:
char const * a = "Hello";
char * b = (char *)a;
b[0] = 'C';
As the above snippet from the standard states, this -- while syntactically correct -- is semantically undefined behaviour. It might even work "correctly" on certain platforms (mostly for historical reasons), and actually print "Cello". It might break on others.
Consider what would happen if your executable is burned into a ROM chip, and executed from there...
I said "historical reasons". In the beginning, there was no const. That is why C defines the type of a string literal as char[] (no const).
Note that:
C++98 does define string literals as being const, but allows conversion to char *.
C++03 still allows the conversion but deprecates it.
C++11 no longer allows the conversion without a cast.
This is a practical explanation (i.e., not dictated by the C-language standard):
First, you declare char *a = "hello" somewhere in your code.
As a result, the compiler:
Generates a constant string "hello" and places it in a read-only memory section within the executable image (typically within the RO data section), but only if it hasn't already done so
Replaces char *a = "hello" with char *a = the address of "hello" in memory
Then, you call printf("hello") somewhere else in your code.
As a result, the compiler:
Generates a constant string "hello" and places it in a read-only memory section within the executable image (typically within the RO data section), but only if it hasn't already done so
Replaces printf("hello") with printf(the address of "hello" in memory)
Now, theoretically (as explained by #Carson Myers), if you could change any of the characters in "hello", then it would affect the result of anything that refers to the data located at the address of that string in memory.
In practice, because the compiler places all constant strings in a read-only memory section, it is not feasible.
the *a points to a different "Hello" than the one that you pass to printf. (you have 2 "hello" in your system)
It will work if you ask printf to print the string at a.
I am trying to find what the rules are for c and c++ compilers putting strings into the data section of executables and don't know where to look. I would like to know if the address of all of the following are guaranteed to be the same in c/c++ by the spec:
char * test1 = "hello";
const char * test2 = "hello";
static char * test3 = "hello";
static const char * test4 = "hello";
extern const char * test5; // Defined in another compilation unit as "hello"
extern const char * test6; // Defined in another shared object as "hello"
Testing on windows, they are all the same. However I do not know if they would be on all operating systems.
I would like to know if the address of all of the following are guaranteed to be the same in c/c++ by the spec
String literals are allowed to be the same object but are not required to.
C++ says:
(C++11, 2.14.5p12) "Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation-defined. The effect of attempting to modify a string literal is undefined."
C says:
(C11, 6.5.2.5p7) "String literals, and compound literals with const-qualified types, need not designate distinct objects.101) This allows implementations to share storage for string literals and constant compound literals with the same or overlapping representations."
And C99 Rationale says:
"This specification allows implementations to share copies of strings with identical text, to place string literals in read-only memory, and to perform certain optimizations"
Firstly, this has nothing to do with the operating system. It depends solely on the implementation, i.e on the compiler.
Secondly, the only "guarantees" you can hope for in this case will come from the compiler documentation. The formal rules of the language neither guarantee them to be the same, nor guarantee them to be different. (The latter applies to both C and C++.)
Thirdly, some compilers have such bizarre options like "make string literals modifiable". This usually implies that each literal is allocated in a unique region of storage and has unique address.
They can all be the same. Even x and y in the following can be the same. z can overlap with y
const char *x = "hello";
const char *y = "hello\0folks";
const char *z = "folks";
In C, I believe the only guarantee about a string literal is that it will evaluate to a pointer to a readable area of memory that will, assuming a program does not engage in Undefined Behavior, always contain the indicated characters followed by a zero byte. The compiler and linker are allowed to work together in any fashion they see fit to make that happen. While I don't know of any compiler/linker systems that do this, it would be perfectly legitimate for a compiler to put each string literal in its own constant section, and for a linker to place such sections in reverse order of length, and check before placing each one whether the appropriate sequence of bytes had already been placed somewhere. Note that the sequence of bytes wouldn't even have to be a string literal or defined constant; if the linker is trying to place the string "Hi!" and it notices that machine code contains the sequence of bytes [0x48, 0x69, 0x21, 0x00], the literal could evaluate to a pointer to the first of those.
Note that writing to the memory pointed to by a string literal is Undefined Behavior. On various system a write may trap, do nothing, or affect only the literal written, but it could also have totally unpredictable consequences [e.g. if the literal evaluated to a pointer into some machine code].
I have recently learnt that its possible to change values of constants in c using a pointer but its not possible for string literals. Possibly the explanation lies in the fact that constants and other strings are allocated space in modifiable region in space whereas string literals gets in non-modifiable region in space (possibly code segment). I have written a program that display addresses for these variables. Outputs are shown as well.
#include <stdio.h>
int x=0;
int y=0;
int main(int argc, char *argv[])
{
const int a =5;
const int b;
const int c =10;
const char *string = "simar"; //its a literal, gets space in code segment
char *string2 = "hello";
char string3[] = "bye"; // its array, so gets space in data segment
const char string4[] = "guess";
const int *pt;
int *pt2;
printf("\nx:%u\ny:%u Note that values are increasing\na:%u\nb:%u\nc:%u Note that values are dec, so they are on stack\nstring:%u\nstring2:%u Note that address range is different so they are in code segment\nstring3:%u Note it seems on stack as well\nstring4:%u\n",&x,&y,&a,&b,&c,string,string2,string3,string4);
}
Please explain where exactly these variables get space??
Where do globals get space, where do constants get and where does string literals get??
"Possible" is over-stating the case.
You can write code that attempts to modify a const object (for example by casting its address to a pointer-to-non-const type). You can also write code that attempts to modify a string literal.
In both cases your code has undefined behavior, meaning that the standard doesn't care what happens. The implementation can do what it likes, and what happens is usually an accidental side-effect of something else that does matter. You cannot rely on the behavior. In short, that code is wrong. This is true both of objects defined as const, and of string literals.
It may be that on a particular implementation, the effect is to change the object or the literal. It may be that on another implementation, you get an access error and the program crashes. It may be that on a third implementation, you get one behavior sometimes and the other at other times. It may be that something entirely different happens.
It's implementation-specific where variables get their space, but in a typical implementation:
x and y are in a modifiable data segment
a is on the stack. If it weren't for the fact that you take its address, then the variable storage could be optimized away entirely, and the value 5 used as an immediate value in any CPU instructions that the compiler emits for code that uses a.
b I think is an error -- uninitialized const object. Maybe it's allowed, but the compiler probably ought to warn.
c is on the stack, same as a.
the literals "simar" etc are all either in a code segment, a read-only data segment, or a modifiable data segment if the implementation doesn't bother with rodata.
string3 and string4 are arrays on the stack. Each is initialized by copying the contents of a string literal.
I have recently learnt that its possible to change values of constants in c using a pointer
Doing so leads to undefined behavior (see the standard 6.7.3), meaning that anything can happen. Practically, you can modify constants on some RAM-based systems.
its not possible for string literals
That is equally undefined behavior and may work, or may not work, or it may cause blue smoke to rise from your harddrive.
Possibly the explanation lies in the fact that constants and other strings are allocated space in modifiable region in space whereas string literals gets in non-modifiable region in space
This is system-dependant. On some systems they both lie in a constant/virtual RAM segment, on some they could lie in non-volatile flash memory. Having a discussion about where things end up in memory is pointless without specifying what system you are talking about. There is no generic case.
I didn't remember where I read, that If I pass a string to a function like.
char *string;
string = func ("heyapple!");
char *func (char *string) {
char *p
p = string;
return p;
}
printf ("%s\n", string);
The string pointer continue to be valid because the "heyapple!" is in memory, it IS in the code the I wrote, so it never will be take off, right?
And about constants like 1, 2.10, 'a'?
And compound literals?
like If I do it:
func (1, 'a', "string");
Only the string will be all of my program execution, or the constans will be too?
For example I learned that I can take the address of string doing it
&"string";
Can I take the address of the constants literals? like 1, 2.10, 'a'?
I'm passing theses to functions arguments and it need to have static duration like strings without the word static.
Thanks a lot.
This doesn't make a whole lot of sense.
Values that are not pointers cannot be "freed", they are values, they can't go away.
If I do:
int c = 1;
The variable 'c' is not a pointer, it cannot do anything else than contain an integer value, to be more specific it can't NOT contain an integer value. That's all it does, there are no alternatives.
In practice, the literals will be compiled into the generated machine-code, so that somewhere in the code resulting from the above will be something like
load r0, 1
Or whatever the assembler for the underlying instruction set looks like. The '1' is a part of the instruction encoding, it can't go away.
Make sure you distinguish between values and pointers to memory. Pointers are themselves values, but a special kind of value that contains an address to memory.
With char* hello = "hello";, there are two things happening:
the string "hello" and a null-terminator are written somewhere in memory
a variable named hello contains a value which is the address to that memory
With int i = 0; only one thing happens:
a variable named i contains the value 0
When you pass around variables to functions their values are always copied. This is called pass by value and works fine for primitive types like int, double, etc. With pointers this is tricky because only the address is copied; you have to make sure that the contents of that address remain valid.
Short answer: yes. 1 and 'a' stick around due to pass by value semantics and "hello" sticks around due to string literal allocation.
Stuff like 1, 'a', and "heyapple!" are called literals, and they get stored in the compiled code, and in memory for when they have to be used. If they remain or not in memory for the duration of the program depends on where they are declared in the program, their size, and the compiler's characteristics, but you can generally assume that yes, they are stored somewhere in memory, and that they don't go away.
Note that, depending on the compiler and OS, it may be possible to change the value of literals, inadvertently or purposely. Many systems store literals in read-only areas (CONST sections) of memory to avoid nasty and hard-to-debug accidents.
For literals that fit into a memory word, like ints and chars it doesn't matter how they are stored: one repeats the literal throughout the code and lets the compiler decide how to make it available. For larger literals, like strings and structures, it would be bad practice to repeat, so a reference should be kept.
Note that if you use macros (#define HELLO "Hello!") it is up to the compiler to decide how many copies of the literal to store, because macro expansion is exactly that, a substitution of macros for their expansion that happens before the compiler takes a shot at the source code. If you want to make sure that only one copy exists, then you must write something like:
#define HELLO "Hello!"
char* hello = HELLO;
Which is equivalent to:
char* hello = "Hello!";
Also note that a declaration like:
const char* hello = "Hello!";
Keeps hello immutable, but not necessarily the memory it points to, because of:
char h = (char) hello;
h[3] = 'n';
I don't know if this case is defined in the C reference, but I would not rely on it:
char* hello = "Hello!";
char* hello2 = "Hello!"; // is it the same memory?
It is better to think of literals as unique and constant, and treat them accordingly in the code.
If you do want to modify a copy of a literal, use arrays instead of pointers, so it's guaranteed a different copy of the literal (and not an alias) is used each time:
char hello[] = "Hello!";
Back to your original question, the memory for the literal "heyapple!" will be available (will be referenceable) as long as a reference is kept to it in the running code. Keeping a whole module (a loadable library) in memory because of a literal may have consequences on overall memory use, but that's another concern (you could also force the unloading of the module that defines the literal and get all kind of strange results).
First,it IS in the code the I wrote, so it never will be take off, right? my answer is yes. I recommend you to have a look at the structure of ELF or runtime structure of executable. The position that the string literal stored is implementation dependent, in gcc, string literal is store in the .rdata segment. As the name implies, the .rdata is read-only. In your code
char *p
p = string;
the pointer p now point to an address in a readonly segment, so even after the end of function call, that address is still valid. But if you try to return a pointer point to a local variable then it is dangerous and may cause hard-to-find bugs:
int *func () {
int localVal = 100;
int *ptr = localVal;
return p;
}
int val = func ();
printf ("%d\n", val);
after the execution of func, as the stack space of func is retrieve by the c runtime, the memory address where localVal was stored will no longer guarantee to hold the original localVal value. It can be overidden by operation following the func.
Back to your question title
-
string literal have static duration.
As for "And about constants like 1, 2.10, 'a'?"
my answer is NO, your can't get address of a integer literal using &1. You may be confused by the name 'integer constant', but 1,2.10,'a' is not right value ! They do not identify a memory place,thus, they don't have duration, a variable contain their value can have duration
compound literals, well, I am not sure about this.
I always try to avoid to return string literals, because I fear they aren't defined outside of the function. But I'm not sure if this is the case. Let's take, for example, this function:
const char *
return_a_string(void)
{
return "blah";
}
Is this correct code? It does work for me, but maybe it only works for my compiler (gcc). So the question is, do (string) literals have a scope or are they present/defined all the time.
This code is fine across all platforms. The string gets compiled into the binary as a static string literal. If you are on windows for example you can even open your .exe with notepad and search for the string itself.
Since it is a static string literal scope does not matter.
String pooling:
One thing to look out for is that in some cases, identical string literals can be "pooled" to save space in the executable file. In this case each string literal that was the same could have the same memory address. You should never assume that it will or will not be the case though.
In most compilers you can set whether or not to use static string pooling for stirng literals.
Maximum size of string literals:
Several compilers have a maximum size for the string literal. For example with VC++ this is approximately 2,048 bytes.
Modifying a string literal gives undefined behavior:
Modifying a string literal should never be done. It has an undefined behavior.
char * sz = "this is a test";
sz[0] = 'T'; //<--- undefined results
Wide string literals:
All of the above applies equally to wide string literals.
Example: L"this is a wide string literal";
The C++ standard states: (section lex.string)
1 A string literal is a sequence
of characters (as defined in
lex.ccon) surrounded by double quotes, optionally beginning with the
letter L, as in "..." or L"...". A string literal that does not begin
with L is an ordinary string literal, also referred to as a narrow
string literal. An ordinary string literal has type "array of n
const
char" and static storage duration (basic.stc), where n is the
size
of the string as defined below, and is initialized with the given
characters. A string literal that begins with L, such as L"asdf",
is
a wide string literal. A wide string literal has type "array of
n
const wchar_t" and has static storage duration, where n is the size
of
the string as defined below, and is initialized with the given charac-
ters.
2 Whether all string literals are distinct (that is, are stored in
nonoverlapping objects) is implementation-defined. The effect
of
attempting to modify a string literal is undefined.
I give you an example so that your confusion becomes somewhat clear
char *f()
{
char a[]="SUMIT";
return a;
}
this won't work.
but
char *f()
{
char *a="SUMIT";
return a;
}
this works.
Reason: "SUMIT" is a literal which has a global scope.
while the array which is just a sequence of characters {'S','U','M','I',"T''\0'}
has a limited scope and it vanishes as soon as the program is returned.
This is valid in C (or C++), as others have explained.
The one thing I can think to watch out for is that if you're using dlls, then the pointer will not remain valid if the dll containing this code is unloaded.
The C (or C++) standard doesn't understand or take account of loading and unloading code at runtime, so anything which does that will face implementation-defined consequences: in this case the consequence is that the string literal, which is supposed to have static storage duration, appears from the POV of the calling code not to persist for the full duration of the program.
Yes, that's fine. They live in a global string table.
No, string literals do not have scope, so your code is guaranteed to work across all platforms and compilers. They are stored in your program's binary image, so you can always access them. However, trying to write to them (by casting away the const) will lead to undefined behavior.
You actually return a pointer to the zero-terminated string stored in the data section of the executable, an area loaded when you load the program. Just avoid to try and change the characters, it might give unpredictable results...
It's really important to make note of the undefined results that Brian mentioned. Since you have declared the function as returning a const char * type, you should be okay, but on many platforms string literals are placed into a read-only segment in the executable (usually the text segment) and modifying them will cause an access violation on most platforms.