Scope of (string) literals - c

I always try to avoid to return string literals, because I fear they aren't defined outside of the function. But I'm not sure if this is the case. Let's take, for example, this function:
const char *
return_a_string(void)
{
return "blah";
}
Is this correct code? It does work for me, but maybe it only works for my compiler (gcc). So the question is, do (string) literals have a scope or are they present/defined all the time.

This code is fine across all platforms. The string gets compiled into the binary as a static string literal. If you are on windows for example you can even open your .exe with notepad and search for the string itself.
Since it is a static string literal scope does not matter.
String pooling:
One thing to look out for is that in some cases, identical string literals can be "pooled" to save space in the executable file. In this case each string literal that was the same could have the same memory address. You should never assume that it will or will not be the case though.
In most compilers you can set whether or not to use static string pooling for stirng literals.
Maximum size of string literals:
Several compilers have a maximum size for the string literal. For example with VC++ this is approximately 2,048 bytes.
Modifying a string literal gives undefined behavior:
Modifying a string literal should never be done. It has an undefined behavior.
char * sz = "this is a test";
sz[0] = 'T'; //<--- undefined results
Wide string literals:
All of the above applies equally to wide string literals.
Example: L"this is a wide string literal";
The C++ standard states: (section lex.string)
1 A string literal is a sequence
of characters (as defined in
lex.ccon) surrounded by double quotes, optionally beginning with the
letter L, as in "..." or L"...". A string literal that does not begin
with L is an ordinary string literal, also referred to as a narrow
string literal. An ordinary string literal has type "array of n
const
char" and static storage duration (basic.stc), where n is the
size
of the string as defined below, and is initialized with the given
characters. A string literal that begins with L, such as L"asdf",
is
a wide string literal. A wide string literal has type "array of
n
const wchar_t" and has static storage duration, where n is the size
of
the string as defined below, and is initialized with the given charac-
ters.
2 Whether all string literals are distinct (that is, are stored in
nonoverlapping objects) is implementation-defined. The effect
of
attempting to modify a string literal is undefined.

I give you an example so that your confusion becomes somewhat clear
char *f()
{
char a[]="SUMIT";
return a;
}
this won't work.
but
char *f()
{
char *a="SUMIT";
return a;
}
this works.
Reason: "SUMIT" is a literal which has a global scope.
while the array which is just a sequence of characters {'S','U','M','I',"T''\0'}
has a limited scope and it vanishes as soon as the program is returned.

This is valid in C (or C++), as others have explained.
The one thing I can think to watch out for is that if you're using dlls, then the pointer will not remain valid if the dll containing this code is unloaded.
The C (or C++) standard doesn't understand or take account of loading and unloading code at runtime, so anything which does that will face implementation-defined consequences: in this case the consequence is that the string literal, which is supposed to have static storage duration, appears from the POV of the calling code not to persist for the full duration of the program.

Yes, that's fine. They live in a global string table.

No, string literals do not have scope, so your code is guaranteed to work across all platforms and compilers. They are stored in your program's binary image, so you can always access them. However, trying to write to them (by casting away the const) will lead to undefined behavior.

You actually return a pointer to the zero-terminated string stored in the data section of the executable, an area loaded when you load the program. Just avoid to try and change the characters, it might give unpredictable results...

It's really important to make note of the undefined results that Brian mentioned. Since you have declared the function as returning a const char * type, you should be okay, but on many platforms string literals are placed into a read-only segment in the executable (usually the text segment) and modifying them will cause an access violation on most platforms.

Related

Is String Literal in C really not modifiable?

As far as I know, a string literal can't be modified for example:
char* a = "abc";
a[0] = 'c';
That would not work since string literal is read-only. I can only modify it if:
char a[] = "abc";
a[0] = 'c';
However, in this post,
Parse $PATH variable and save the directory names into an array of strings, the first answer modified a string literal at these two places:
path_var[j]='\0';
array[current_colon] = path_var+j+1;
I'm not very familiar with C so any explanation would be appreciated.
In programming, there are quite a few rules that are up to you to follow, even though they are not — necessarily — enforced. And "String literals in C are not modifiable" is one of those. So is "Strings returned by getenv should not be modified".
There are some real-world analogies that apply. Here's one: If you're at an intersection, and the light is red, you're not supposed to cross. But, much of the time, if you break the rule, and cross, you might get away with it. You might get a ticket from a policeman — or you might not. You might cause a crash — or you might not. But if you get lucky, and neither of these things happens, that does not imply that crossing the intersection against the red light was okay — it's still quite true that it was very much against the rules.
Similarly, in C, if you write some code that modifies a string literal, or a string returned from getenv, you might get away with it. The compiler might give you a warning or error message — or it might not. Your program might crash — or it might not. But if the program seems to work, that does not imply that these strings are actually modifiable — they're not.
Code blocks from the post you linked:
const char *orig_path_var = getenv("PATH");
char *path_var = strdup(orig_path_var ? orig_path_var : "");
const char **array;
array = malloc((nb_colons+1) * sizeof(*array));
array[0] = path_var;
array[current_colon] = path_var+j+1;
First block:
In the 1st line getenv() returns a pointer to a string which is pointed to by orig_path_var. The string that get_env() returns should be treated as a read-only string as the behaviour is undefined if the program attempts to modify it.
In the 2nd line strdup() is called to make a duplicate of this string. The way strdup() does this is by calling malloc() and allocating memory for the size of the string + 1 and then copying the string into the memory.
Since malloc() is used, the string is stored on the heap, this allows us to edit the string and modify it.
Second block:
In the 1st line we can see that array points to a an array of char * pointers. There is nb_colons+1 pointers in the array.
Then in the 2nd line the 0th element of array is initilized to path_var (remember it is not a string literal, but a copy of one).
In the 3rd line, the current_colonth element of array is set to path_var+j+1. If you don't understand pointer arithmetic, this just means it assigns the address of the j+1th char of path_var to array[current_colon].
As you can see, the code is not operating on const string literals like orig_path_var. Instead it uses a copy made with strdup(). This seems to be where your confusion stems from so take a look at this:
char *strdup(const char *s);
The strdup() function returns a pointer to a new string which is a duplicate of the string s. Memory for the new string is obtained with malloc(3), and can be freed with free(3).
The above text shows what strdup() does according to its man page.
It may also help to read the malloc() man page.
In the example
char* a = "abc";
the token "abc" produces a literal object in the program image, and denotes an expression which yields that object's address.
In the example
char a[] = "abc";
The token "abc" is serves as an array initializer, and doesn't denote a literal object. It is equivalent to:
char a[] = { 'a', 'b', 'c', 0 };
The individual character values of "abc" are literal data is recorded somewhere and somehow in the program image, but they are not accessible as a string literal object.
The array a isn't a literal, needless to say. Modifying a doesn't constitute modifying a literal, because it isn't one.
Regarding the remark:
That would not work since string literal is read-only.
That isn't accurate. The ISO C standard (no version of it to date) doesn't specify any requirements for what happens if a program tries to modify a string literal. It is undefined behavior. If your implementation stops the program with some diagnostic message, that's because of undefined behavior, not because it is required.
C implementations are not required to support string literal modification, which has the benefits like:
standard-conforming C programs can be translated into images that can be be burned into ROM chips, such that their string literals are accessed directly from that ROM image without having to be copied into RAM on start-up.
compilers can condense the storage for string literals by taking advantage of situations when one literal is a suffix of another. The expression "string" + 2 == "ring" can yield true. Since a strictly conforming program will not do something like "ring"[0] = 'w', due to that being undefined behavior, such a program will thereby avoid falling victim to the surprise of "string" unexpectedly turning into "stwing".
There are several reasons for which you had better not to modify them:
The first is that the operating system and/or the compiler can enforce the non-writable property of string literals, putting them in read-only memory (e.g. ROM) or in the .text segment.
second, the compiler is allowed to merge string literals together, so if you modify (and do it successfully) you can get surprises later because other literals (that have been merged because e.g. one of them is a suffix of the other) change apparently by no reason.
if you need an initialized string that is modifiable, you can do it by allocating an array with a declaration, as in (which you can freely modify):
char array[100] = "abc"; // initialized to { 'a' ,'b', 'c', '\0',
// /* and 96 more '\0' characters */
// };

In C, how can printf("Hello") ever output "Cello" in any circumstance?

It completely misses me how can printf("Hello") ever print Cello. It challenges my basic understanding of C. But from the top answer (by Carson Myers) for the following question on Stack Overflow, it seems it is possible. Can you please explain in simple terms how is it possible? Here's what the answer says:
Whenever you write a string in your source, that string is read only
(otherwise you would be potentially changing the behavior of the
executable--imagine if you wrote char *a = "hello"; and then changed
a[0] to 'c'. Then somewhere else wrote printf("hello");. If you were
allowed to change the first character of "hello", and your compiler
only stored it once (it should), then printf("hello"); would output
cello!)
Aforementioned question: Is it possible to modify a string of char in C?
Reasons:
Compilers usually store only one copy of identical string literals, so the string literal in char *a = "hello"; and in printf("hello") could be at a same memory location.
The answer in your link assumes that the memory location for storing string literals are mutable, which is typically not in modern architectures. However this is true if there's no memory access protection, e.g. in some embedded architectures or a 80386 working in real mode.
So when you modify the string referenced by a, the value for printf changes as well.
If you, somewhere in your source, have the string literal "Hello", that ends up in your executable as part of the code / data segment. This should be considered read-only at all times, because compilers are at liberty to optimize multiple occurences of the same literal into a single entity. You would have multiple cases of "Hello" in your source, and multiple pointers pointing to them, but they could all be pointing to the same address.
ISO/IEC 9899 "Programming languages - C", chapter 6.4.5 "String literals", paragraph 6:
It is unspecified whether these arrays are distinct provided their elements have the
appropriate values. If the program attempts to modify such an array, the behavior is
undefined.
Thus, any pointer to such a string literal is to be declared as a pointer to constant contents, to make this clear on the source level:
char const * a = "Hello";
Given this definition, a[0] = 'C'; is not a valid operation: You cannot change a const value, the compiler would issue an error.
However, in more than one ways it is possible to "trick" the language. For one, you could cast the pointer:
char const * a = "Hello";
char * b = (char *)a;
b[0] = 'C';
As the above snippet from the standard states, this -- while syntactically correct -- is semantically undefined behaviour. It might even work "correctly" on certain platforms (mostly for historical reasons), and actually print "Cello". It might break on others.
Consider what would happen if your executable is burned into a ROM chip, and executed from there...
I said "historical reasons". In the beginning, there was no const. That is why C defines the type of a string literal as char[] (no const).
Note that:
C++98 does define string literals as being const, but allows conversion to char *.
C++03 still allows the conversion but deprecates it.
C++11 no longer allows the conversion without a cast.
This is a practical explanation (i.e., not dictated by the C-language standard):
First, you declare char *a = "hello" somewhere in your code.
As a result, the compiler:
Generates a constant string "hello" and places it in a read-only memory section within the executable image (typically within the RO data section), but only if it hasn't already done so
Replaces char *a = "hello" with char *a = the address of "hello" in memory
Then, you call printf("hello") somewhere else in your code.
As a result, the compiler:
Generates a constant string "hello" and places it in a read-only memory section within the executable image (typically within the RO data section), but only if it hasn't already done so
Replaces printf("hello") with printf(the address of "hello" in memory)
Now, theoretically (as explained by #Carson Myers), if you could change any of the characters in "hello", then it would affect the result of anything that refers to the data located at the address of that string in memory.
In practice, because the compiler places all constant strings in a read-only memory section, it is not feasible.
the *a points to a different "Hello" than the one that you pass to printf. (you have 2 "hello" in your system)
It will work if you ask printf to print the string at a.

In c, what are the rules governing how compilers merge the same strings into the executable

I am trying to find what the rules are for c and c++ compilers putting strings into the data section of executables and don't know where to look. I would like to know if the address of all of the following are guaranteed to be the same in c/c++ by the spec:
char * test1 = "hello";
const char * test2 = "hello";
static char * test3 = "hello";
static const char * test4 = "hello";
extern const char * test5; // Defined in another compilation unit as "hello"
extern const char * test6; // Defined in another shared object as "hello"
Testing on windows, they are all the same. However I do not know if they would be on all operating systems.
I would like to know if the address of all of the following are guaranteed to be the same in c/c++ by the spec
String literals are allowed to be the same object but are not required to.
C++ says:
(C++11, 2.14.5p12) "Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation-defined. The effect of attempting to modify a string literal is undefined."
C says:
(C11, 6.5.2.5p7) "String literals, and compound literals with const-qualified types, need not designate distinct objects.101) This allows implementations to share storage for string literals and constant compound literals with the same or overlapping representations."
And C99 Rationale says:
"This specification allows implementations to share copies of strings with identical text, to place string literals in read-only memory, and to perform certain optimizations"
Firstly, this has nothing to do with the operating system. It depends solely on the implementation, i.e on the compiler.
Secondly, the only "guarantees" you can hope for in this case will come from the compiler documentation. The formal rules of the language neither guarantee them to be the same, nor guarantee them to be different. (The latter applies to both C and C++.)
Thirdly, some compilers have such bizarre options like "make string literals modifiable". This usually implies that each literal is allocated in a unique region of storage and has unique address.
They can all be the same. Even x and y in the following can be the same. z can overlap with y
const char *x = "hello";
const char *y = "hello\0folks";
const char *z = "folks";
In C, I believe the only guarantee about a string literal is that it will evaluate to a pointer to a readable area of memory that will, assuming a program does not engage in Undefined Behavior, always contain the indicated characters followed by a zero byte. The compiler and linker are allowed to work together in any fashion they see fit to make that happen. While I don't know of any compiler/linker systems that do this, it would be perfectly legitimate for a compiler to put each string literal in its own constant section, and for a linker to place such sections in reverse order of length, and check before placing each one whether the appropriate sequence of bytes had already been placed somewhere. Note that the sequence of bytes wouldn't even have to be a string literal or defined constant; if the linker is trying to place the string "Hi!" and it notices that machine code contains the sequence of bytes [0x48, 0x69, 0x21, 0x00], the literal could evaluate to a pointer to the first of those.
Note that writing to the memory pointed to by a string literal is Undefined Behavior. On various system a write may trap, do nothing, or affect only the literal written, but it could also have totally unpredictable consequences [e.g. if the literal evaluated to a pointer into some machine code].

How are string literals compiled in C?

How are string literals compiled in C? As per my understanding, in test1, the string "hello" is put in data segment by compiler and in the 2nd line p is assigned that hard-coded virtual address. Is this correct? and that there is no basic difference between how test1 works and how test2 works.
Some code:
#include <stdio.h>
test1();
test2();
test3();
main()
{
test1();
test2();
//test3();
}
test1()
{
char *p;
p="hello";
}
test2()
{
char *p="hello";
}
test3()
{
char *p;
strcpy(p,"hello");
}
any reference from C standard will be greatly appreciated, so that I can understand this thing in depth from compiler point of view.
From the C standard point of view there's no particular requirement about where the literal string will be placed. About the only requirements about the storage of string literals are in C99 6.4.5/5 "String literals":
"an array of static storage duration and length just sufficient to contain the sequence" , which means that the literal will have a lifetime as long as the program.
"It is unspecified whether these arrays are distinct provided their elements have the appropriate value", which means the various "hello" literals in your example may or may not have the same address. You can't count on either behavior.
"If the program attempts to modify such an array, the behavior is undefined", which means that you can't change the string literal. One many platforms this is enforced (if you attempt to do so, the program will crash). On some platforms, the change may appear to work so you can't count on the bug being readily evident.
Your understanding is correct, the data of "Hello" will be put in a RO segment, and its relative virtual address will be assigned to the pointers in the testX() functions.
However, those are compiler-specific perspectives, the C standard doesn't care about them.
EDIT: Per test3(), see τεκ's comment.

C Duration of strings, constants, compound literals, and why not, the code itself

I didn't remember where I read, that If I pass a string to a function like.
char *string;
string = func ("heyapple!");
char *func (char *string) {
char *p
p = string;
return p;
}
printf ("%s\n", string);
The string pointer continue to be valid because the "heyapple!" is in memory, it IS in the code the I wrote, so it never will be take off, right?
And about constants like 1, 2.10, 'a'?
And compound literals?
like If I do it:
func (1, 'a', "string");
Only the string will be all of my program execution, or the constans will be too?
For example I learned that I can take the address of string doing it
&"string";
Can I take the address of the constants literals? like 1, 2.10, 'a'?
I'm passing theses to functions arguments and it need to have static duration like strings without the word static.
Thanks a lot.
This doesn't make a whole lot of sense.
Values that are not pointers cannot be "freed", they are values, they can't go away.
If I do:
int c = 1;
The variable 'c' is not a pointer, it cannot do anything else than contain an integer value, to be more specific it can't NOT contain an integer value. That's all it does, there are no alternatives.
In practice, the literals will be compiled into the generated machine-code, so that somewhere in the code resulting from the above will be something like
load r0, 1
Or whatever the assembler for the underlying instruction set looks like. The '1' is a part of the instruction encoding, it can't go away.
Make sure you distinguish between values and pointers to memory. Pointers are themselves values, but a special kind of value that contains an address to memory.
With char* hello = "hello";, there are two things happening:
the string "hello" and a null-terminator are written somewhere in memory
a variable named hello contains a value which is the address to that memory
With int i = 0; only one thing happens:
a variable named i contains the value 0
When you pass around variables to functions their values are always copied. This is called pass by value and works fine for primitive types like int, double, etc. With pointers this is tricky because only the address is copied; you have to make sure that the contents of that address remain valid.
Short answer: yes. 1 and 'a' stick around due to pass by value semantics and "hello" sticks around due to string literal allocation.
Stuff like 1, 'a', and "heyapple!" are called literals, and they get stored in the compiled code, and in memory for when they have to be used. If they remain or not in memory for the duration of the program depends on where they are declared in the program, their size, and the compiler's characteristics, but you can generally assume that yes, they are stored somewhere in memory, and that they don't go away.
Note that, depending on the compiler and OS, it may be possible to change the value of literals, inadvertently or purposely. Many systems store literals in read-only areas (CONST sections) of memory to avoid nasty and hard-to-debug accidents.
For literals that fit into a memory word, like ints and chars it doesn't matter how they are stored: one repeats the literal throughout the code and lets the compiler decide how to make it available. For larger literals, like strings and structures, it would be bad practice to repeat, so a reference should be kept.
Note that if you use macros (#define HELLO "Hello!") it is up to the compiler to decide how many copies of the literal to store, because macro expansion is exactly that, a substitution of macros for their expansion that happens before the compiler takes a shot at the source code. If you want to make sure that only one copy exists, then you must write something like:
#define HELLO "Hello!"
char* hello = HELLO;
Which is equivalent to:
char* hello = "Hello!";
Also note that a declaration like:
const char* hello = "Hello!";
Keeps hello immutable, but not necessarily the memory it points to, because of:
char h = (char) hello;
h[3] = 'n';
I don't know if this case is defined in the C reference, but I would not rely on it:
char* hello = "Hello!";
char* hello2 = "Hello!"; // is it the same memory?
It is better to think of literals as unique and constant, and treat them accordingly in the code.
If you do want to modify a copy of a literal, use arrays instead of pointers, so it's guaranteed a different copy of the literal (and not an alias) is used each time:
char hello[] = "Hello!";
Back to your original question, the memory for the literal "heyapple!" will be available (will be referenceable) as long as a reference is kept to it in the running code. Keeping a whole module (a loadable library) in memory because of a literal may have consequences on overall memory use, but that's another concern (you could also force the unloading of the module that defines the literal and get all kind of strange results).
First,it IS in the code the I wrote, so it never will be take off, right? my answer is yes. I recommend you to have a look at the structure of ELF or runtime structure of executable. The position that the string literal stored is implementation dependent, in gcc, string literal is store in the .rdata segment. As the name implies, the .rdata is read-only. In your code
char *p
p = string;
the pointer p now point to an address in a readonly segment, so even after the end of function call, that address is still valid. But if you try to return a pointer point to a local variable then it is dangerous and may cause hard-to-find bugs:
int *func () {
int localVal = 100;
int *ptr = localVal;
return p;
}
int val = func ();
printf ("%d\n", val);
after the execution of func, as the stack space of func is retrieve by the c runtime, the memory address where localVal was stored will no longer guarantee to hold the original localVal value. It can be overidden by operation following the func.
Back to your question title
-
string literal have static duration.
As for "And about constants like 1, 2.10, 'a'?"
my answer is NO, your can't get address of a integer literal using &1. You may be confused by the name 'integer constant', but 1,2.10,'a' is not right value ! They do not identify a memory place,thus, they don't have duration, a variable contain their value can have duration
compound literals, well, I am not sure about this.

Resources