differences in array initialization(char, string, other) regarding storage duration - c

In this question it was said in the comments:
char arr[10] = { 'H', 'e', 'l', 'l', 'o', '\0'}; and char arr[10] =
"Hello"; are strictly the same thing. – Michael Walz
This got me thinking.
I know that "Hello" is string literal. String literals are stored with static storage duraction and are immutable.
But if both are are really the same then char arr[10] = { 'H', 'e', 'l', 'l', 'o', '\0'}; would also create a similar string literal with.
Does char b[10]= {72, 101, 108, 108, 111, 0}; also create a "string" literal with static storage duration? Because theoretically it is the same thing.
char a = 'a'; is the same thing as char a; ...; a = 'a';, so your thoughts are correct 'a' is simply written to a
Are there differences between:
char a = 'a';
char a = {'a'};
How/where are the differences defined?
EDIT:
I see that I haven't made it clear enough that I am particularly interested in the memory usage/storage duration of the literals. I will leave the question as it is, but would like to make the emphasis of the question more clear in this edit.

I know that "Hello" is string literal. String literals are stored with static storage duraction and are immutable.
Yes, but string literals are also a grammatical item in the C language. char arr[10] = { 'H', 'e', 'l', 'l', 'o', '\0'}; is not a string literal, it is an initializer list. The initializer list does however behave as if it has static storage duration, remaining elements after the explicit \0 are set to zero etc.
The initializer list itself is stored in some manner of ROM memory. If your variable arr has static storage duration too, it will get allocated in the .data segment and initialized from the ROM init list before the program is started. If arr has automatic storage duration (local), then it is initialized from ROM in run-time, when the function containing arr is called.
The ROM memory where the initializer list is stored may or may not be the same ROM memory as used for string literals. Often there's a segment called .rodata where these things end up, but they may as well end up in some other segment, such as the code segment .text.
Compilers like to store string literals in a particular memory segment, because that means that they can perform an optimization called "string pooling". Meaning that if you have the string literal "Hello" several times in your program, the compiler will use the same memory location for it. It may not necessarily do this same optimization for initializer lists.
Regarding 'a' versus {'a'} in an initializer list, that's just a syntax hiccup in the C language. C11 6.7.6/11:
The initializer for a scalar shall be a single expression, optionally enclosed in braces. The
initial value of the object is that of the expression (after conversion); the same type
constraints and conversions as for simple assignment apply,
In plain English, this means that a "non-array" (scalar) can be either initialized with or without braces, it has the same meaning. Apart from that, the same rules as for regular assignment apply.

I know that "Hello" is string literal. String literals are stored with static storage duraction and are immutable.
Yes. But with char arr[10] = "Hello";, you are copying the string literal to an array arr and there's no need to "keep" the string literal. So if an implementation chooses to do remove the string literal altogether after copying it to arr and that's totally valid.
But if both are are really the same then char arr[10] = { 'H', 'e', 'l', 'l', 'o', '\0'}; would also create a similar string literal.
Again there's no need to make/store a string literal for this.
Only if you directly have a pointer to a string literal, it'd be usually stored somewhere such as:
char *string = "Hello, world!\n";
Even then an implementation can choose not to do so under the "as-if" rule. E.g.,
#include <stdio.h>
#include <string.h>
static const char *str = "Hi";
int main(void)
{
char arr[10];
strcpy(arr, str);
puts(arr);
}
"Hi" can be eliminated because it's used only for copying it into arr and isn't accessed directly anywhere. So eliminating the string literal (and the strcpy call too) as if you had "char arr[10] = "Hi"; and wouldn't affect the observable behaviour.
Basically the C standard doesn't necessitate a string literal has to be stored anywhere as long as the properties associated with a string literal are satisfied.
Are there differences between: char a = 'a'; char a = {'a'}; How/where are the differences defined?
Yes. C11, 6.7.9 says:
The initializer for a scalar shall be a single expression, optionally enclosed in braces. [..]
Per the syntax, even:
char c = {'a',}; is valid and equivalent too (though I wouldn't recommend this :).

In the abstract machine, char arr[10] = "Hello"; means that arr is initialized by copying data from the string literal "Hello" which has its own existence elsewhere; whereas the other version just has initial values like any other variable -- there is no string literal involved.
However, the observable behaviour of both versions is identical: there is created arr with values set as specified. This is what the other poster meant by the code being identical; according to the Standard, two programs are the same if they have the same observable behaviour. Compilers are allowed to generate the same assembly for both versions.
Your second question is entirely separate to the first; but char a = 'a'; and char a = {'a'}; are identical. A single initializer may optionally be enclosed in braces.

I belive your question is highly implementation dependant (HW and compiler wise). However, in general: arrays are placed in RAM, let it be global or not.
I know that "Hello" is string literal. String literals are stored with static storage duraction and are immutable.
Yes this saves the string "Hello" in ROM (read only memory). Your array is loaded the literal in runtime.
But if both are are really the same then char arr[10] = { 'H', 'e', 'l', 'l', 'o', '\0'}; would also create a similar string literal.
Yes but in this case the single characters are placed in ROM. The array you are initialized is loaded with character literals in runtime.
Does char b[10]= {72, 101, 108, 108, 111, 0}; also create a "string" literal with static storage duration? Because theoretically it is the same thing.
If you use UTF-8, then yes, since char == uint8_t and those are the values.
Are there differences between:
char a = 'a';
char a = {'a'};
How/where are the differences defined?
I believe not.
In reply to edit
Do you mean the lifetime of storage of string literals? Have a look at this.
So a string literal has static storage duration. It remains throughout the lifetime of the program, hardcoded in memory.

Related

difference between a string defined by pointer or an array

I was reading about pointers in K&R book here:
https://hikage.freeshell.org/books/theCprogrammingLanguage.pdf
There is an important difference between these definitions:
char amessage[] = "now is the time"; /* an array */
char *pmessage = "now is the time"; /* a pointer */
amessage is an array, just big enough to hold the sequence of characters and ’\0’ that initializes it. Individual characters
within the array may be changed but amessage will always refer to the same storage. On the other hand, pmessage is a
pointer, initialized to point to a string constant; the pointer may subsequently be modified to point elsewhere, but the result is
undefined if you try to modify the string contents.
I dont understand why cwe cant modify the string content !
I dont understand why cwe cant modify the string content !
Because the C standard says so: “If the program attempts to modify such an array [the array defined by a string literal], the behavior is undefined” (C 2018 6.4.5 7). A string literal is a sequence of characters in quotes in source code, such as "Hello, world.\n". (String literals may also be preceded by an encoding prefix u8, u, U, or L, as in L"abc".) A string literal defines an array containing the characters of the string plus a terminating null character.
A reason that attempting to modify the string literal’s array is that string literals were, and are, widely used for strings that are constant—error messages to be printed at times, format strings for printf operations, hard-coded names of things, and so on. As C developed, and the standard was written, it made sense for string literals to be treated as read-only and to allow a compiler to put them in read-only storage. Additionally, some compilers would use the same storage for identical string literals that appeared in different places, and some would use the same storage for a string literal that was a trailing substring of another string literal. Because of this shared storage, modifying one string would also modify the other. So allowing programs to modify string literals could cause some problems.
So, if you merely point to a string literal, you are pointing to something that should not be modified. If you want your own copy that can be modified, simply define it with an array as you show with char amessage[] = "now is the time";. Such a definition defines an array, amessage that has its own storage. That array is initialized with the contents of the string literal but is separate from it.
char amessage[] = "now is the time"; /* an array */
amessage is a modifiable array of chars.
char *pmessage = "now is the time"; /* a pointer */
pmessage is a pointer to the string literal. Attempt to modify the string literal is an Undefined Behaviour.
When you initialize a pointer with a string literal, the compiler creates a read-only array (and indeed is free to merge the pointers into one if you have several initializers using the same literal string (character by character) as in:
char *a = "abcdef", *b = "abcdef";
it is probable that both pointers be initialized to the same address in memory. This is the reason by which you are not allowed to modify the string, and why the behaviour can be unpredictable (you don't know if the compiler has merged both strings)
The thing goes further, as the compiler is permitted to do the following, on the next scenario:
char *a = "foo bar", *b = "bar";
the compiler is permitted to initialize a to point to a char array with the characters {'f', 'o', 'o', ' ', 'b', 'a', 'r', '\0'} and initialize also the pointer b to the fifth position of the array, as one of the string literals is a suffix of the other.
Allowing this allows the compiler to make extensive savings in the final executable and so, the string literals are assigned a read-only segment in the executable (they are placed in the .text segment or a similar one)
On the other hand, initializing an array has no problems, as you are defining the array variable that will store the characters, and it is not the compiler which is doing this. An initialization like:
char a[] = "Hello";
will arrange things to have a global variable of type array of chars with space for six characters. But you can also specify between the brackets the array size, as in
char a[32] = "Hello";
and then the array will have 32 characters (from 0 to 31) and the first five will be initialized to the character literals 'H', 'e', 'l', 'l' and 'o', followed by 27 null characters '\0'.
You are also allowed to say:
char a[4] = "Hello";
but in this case you will get an array initialized as {'H', 'e', 'l', 'l'} (only the first four characters are used from the string literal, and you will get a warning from the compiler, signalling the dangerous bend)
Last, think always that an assignment and an initialization are different things, despite they use the same symbol = to indicate it, they are not the same thing. You will never be allowed to write a sentence like:
char a[26];
a = "foo bar";
because the expression "foo bar" represents a char * pointing to a static array (unmodifiable) and an array cannot be assigned.

Long-form notation of a string assignment

I was wondering what the equivalent "long-form" way of creating a string would be. My first thought as as follows:
char *string = "Hello";
is short-form for:
char _string[6] = {'H', 'e', 'l', 'l', 'o', '\0'};
char *string = &_string[0];
Is this the closest approximation of what a string assignment actually does? When was the short-form notation introduced?
String literals have three properties your proposed long form does not:
They have static storage duration, meaning they exist (in the computing model used by the C standard) for the duration of program execution.
The behavior of attempting to modify them is not defined by the C standard, even though they do not have a const qualifier (which is a legacy due to the historical development of C).
When the compiler sees identical string literals, it is allowed to consolidate them. This could be visible to the program by comparing addresses. The compiler is also allowed to consolidate "abcdef" and "def" to "abcdef", since the latter is a subsequence of the former.
So a closer equivalent would be:
static const char _string[] = {'H', 'e', 'l', 'l', 'o', '\0'};
char *string = (char *) _string;
Thus, string does not have the const qualifier, and it does point to an array of static storage duration for which attempting to modify it is not defined by the C standard.
I do not believe there is a good way to replicate the third property, so _string will be a distinct object even if the same data appears elsewhere in the program.
String literals have existed in C since at least 1978, as they appear in the first edition of The C Programming Language by Kernighan and Ritchie, which calls them string constants.

Static duration of string literal

On this answer by michael-burr on this question:
what-is-the-type-of-string-literals-in-c-and-c
I found that
In C the type of a string literal is a char[] - it's not const according to the type, but it is undefined behavior to modify the contents
from this I can think that sentence "How are you" can't be modified (just as char c*="how are you?") but once it is used to initialize some char[] then it can be unless declared as const.
Apart from this from that answer:
The multibyte character sequence is then used to initialize an array of static storage duration
and from C Primer Plus 6th Edition I found:
Character string constants are placed in the static storage class, which means that if you use
a string constant in a function, the string is stored just once and lasts for the duration of the
program, even if the function is called several times
But when I tried this code:
#include <stdio.h>
void fun() {
char c[] = "hello";
printf("%s\n", c);
c[2] = 'x';
}
int main(void) {
fun();
fun();
return 0;
}
The array inside function fun doesn't behave as if it has retained the changed value.
Where am I going wrong on this?
char c[]="hello"; is not at all the same thing as char *c="hello";. The latter initializes a pointer to the aforementioned static string storage, modifying c[2] would be undefined behavior. The former is equivalent to:
char c[] = {'h', 'e', 'l', 'l', 'o', '\0'};
It's initializing an array on the stack, it's not creating a reference or pointer to the static string in some other memory location. Like any other non-const stack array, you can modify it however you please (as long as you don't go out of bounds).
You’re not modifying the string literal. You’re modifying a local array that contains a copy of the string literal. Each time you call fun, a new instance of c is created and initialized. When fun exits, that instance of c ceases to exist.
Because it is an automatic variable and it's instance is initialized every time you call the function.
Change it to have static storage
static char c[]="hello";
And it will behave as you expect keeping the changed value between the function calls
This is something completely different from how the string literals and compound literals are stored and used in the variable initialization. It is left to the implementation - for example this initialization of the automatic storage variable may be done by copying data from .rodata segment or it can be just couple immediate store instructions and the literal will be stored in the .text segment

String literals vs array of char when initializing a pointer

Inspired by this question.
We can initialize a char pointer by a string literal:
char *p = "ab";
And it is perfectly fine.
One could think that it is equivalent to the following:
char *p = {'a', 'b', '\0'};
But apparently it is not the case. And not only because the string literals are stored in a read-only memory, but it appears that even through the string literal has a type of char array, and the initializer {...} has the type of char array, two declarations are handled differently, as the compiler is giving the warning:
warning: excess elements in scalar initializer
in the second case. What is the explanation of such a behavior?
Update:
Moreover, in the latter case the pointer p will have the value of 0x61 (the value of the first array element 'a') instead of a memory location, such that the compiler, as warned, taking just the first element of the initializer and assigning it to p.
I think you're confused because char *p = "ab"; and char p[] = "ab"; have similar semantics, but different meanings.
I believe that the latter case (char p[] = "ab";) is best regarded as a short-hand notation for char p[] = {'a', 'b', '\0'}; (initializes an array with the size determined by the initializer). Actually, in this case, you could say "ab" is not really used as a string literal.
However, the former case (char *p = "ab";) is different in that it simply initializes the pointer p to point to the first element of the read-only string literal "ab".
I hope you see the difference. While char p[] = "ab"; is representable as an initialization such as you described, char *p = "ab"; is not, as pointers are, well, not arrays, and initializing them with an array initializer does something entirely different (namely give them the value of the first element, 0x61 in your case).
Long story short, C compilers only "replace" a string literal with a char array initializer if it is suitable to do so, i.e. it is being used to initialize a char array.
String literals have a "magical" status in C. They're unlike anything else. To understand why, it's useful to think about this in terms of memory management. For example, ask yourself, "Where is a string literal stored in memory? When is it freed from memory?" and things will start making sense.
They're unlike numeric literals which translate easily to machine instructions. For a simplified example, something like this:
int x = 123;
... might translate to something like this at the machine level:
mov ecx, 123
When we do something like:
const char* str = "hello";
... we now have a dilemma:
mov ecx, ???
There's not necessarily some native understanding of the hardware of what a multi-byte, variable-length string actually is. It mainly knows about bits and bytes and numbers and has registers designed to store these things, yet a string is a memory block containing multiple of those.
So compilers have to generate instructions to store that string's memory block somewhere, and so they typically generate instructions when compiling your code to store that string somewhere in a globally-accessible place (typically a read-only memory segment or the data segment). They might also coalesce multiple literal strings that are identical to be stored in the same memory region to avoid redundancy. Now it can generate a mov/load instruction to load the address to the literal string, and you can then work with it indirectly through a pointer.
Another scenario we might run into is this:
static const char* some_global_ptr = "blah";
int main()
{
if (...)
{
const char* ptr = "hello";
...
some_global_ptr = ptr;
}
printf("%s\n", some_global_ptr);
}
Naturally ptr goes out of scope, but we need that literal string's memory to linger around for this program to have well-defined behavior. So literal strings translate not only to addresses to globally-accessible memory blocks, but they also don't get freed as long as your binary/program is loaded/running so that you don't have to worry about their memory management. [Edit: excluding potential optimizations: for the C programmer, we never have to worry about the memory management of a literal string, so the effect is like it's always there].
Now about character arrays, literal strings aren't necessarily character arrays, per se. At no point in the software can we capture them to an array r-value that can give us the number of bytes allocated using sizeof. We can only point to the memory through char*/const char*
This code actually gives us a handle to such an array without involving a pointer:
char str[] = "hello";
Something interesting happens here. A production compiler is likely going to apply all kinds of optimizations, but excluding those, at a basic level such code might create two separate memory blocks.
The first block is going to be persistent for the duration of the program, and will contain that literal string, "hello". The second block will be for that actual str array, and it's not necessarily persistent. If we wrote such code inside a function, it's going to allocate memory on the stack, copy that literal string to the stack, and the free the memory from the stack when str goes out of scope. The address of str is not going to match the literal string, to put it another way.
Finally, when we write something like this:
char str[] = {'h', 'e', 'l', 'l', 'o', '\0'};
... it's not necessarily equivalent, as here there are no literal strings involved. Of course an optimizer is allowed to do all kinds of things, but in this scenario, it is possible that we will simply create a single memory block (allocated on the stack and freed from the stack if we're inside a function) with instructions to move all these numbers (characters) you specified to the stack.
So while we're effectively achieving the same effect as the previous version as far as the logic of the software is concerned, we're actually doing something subtly different when we don't specify a literal string. Again, optimizers can recognize when doing something different can have the same logical effect, so they might get fancy here and make these two effectively the same thing in terms of machine instructions. But short of that, this is subtly different code we're writing.
Last but not least, when we use initializers like {...}, the compiler expects you to assign it to an aggregate l-value with memory that is allocated and freed at some point when things go out of scope. So that's why you're getting the error trying to assign such a thing to a scalar (a single pointer).
The second example is syntactically incorrect. In C, {'a', 'b', '\0'} can be used to initialize an array, but not a pointer.
Instead, you can use a C99 compound literal (also available in some compilers as extension, e.g, GCC) like this:
char *p = (char []){'a', 'b', '\0'};
Note that it's more powerful as the initializer isn't necessarily null-terminated.
From C99 we have
A character string literal is a sequence of zero or more multibyte characters enclosed in
double-quotes
So in the second definition there is no string literal as it is not within the double quotes. The pointer should be allocated memory before writing something to it or if you want to go by initializer list then
char p[] = {'a','b','\0'};
is what you want. Basically both are different declarations.

meaning of static array of characters?

somewhere I read the following lines :-
char *p = "string literal";
My program crashes if I try to assign a new value to p[i].
A:-It turns into an unnamed, static array of characters, and this unnamed array may be stored in read-only memory, and which therefore cannot necessarily be modified. In an expression context, the array is converted at once to a pointer, as usual (see section 6), so the declaration initializes p to point to the unnamed array's first element.
I know what static do but I did not understand the following in the above lines
static array of characters.
This does not refer to the static keyword, but static in the sense that it cannot be changed.
EDIT: Thinking better, it seems this phrase was badly written, I think the author back then (for those wondering, this comes from the C faq) meant "constant"
EDIT2: OP asked what is a string literal, here is the answer:
String literal is a string that is hardcoded in your source (and later in your compiled program), you do it by using double quotes " a example would be this "some string literal here"
When you assigned this to a pointer, the pointer points to the string literal, that is stored in your program running code, NOT on the main memory, this is why it cannot be modified.
You can assign a string literal to array, to initialize the array, the meaning there is different, where the array will be sent to the memory, and will have that string as its initial value.
Mind you, a string literal must be inside double quotes " if you attempt other hacks it won't compile at all. You cannot for example do this: char* someVar = {'f', 'o', 'o', '\0'}; it won't work at all. (my compiler gives the error: excess elements in scalar initializer)
"Static" refers to the storage duration of the object that will be created for the string literal.
To quote C99 6.4.5:
The multibyte character sequence is then used to initialize an array of static storage duration and length just sufficient to contain the sequence.
Simply string literals refer to string constants about which C11 standard says that:
It is unspecified whether these arrays are distinct provided their elements have the
appropriate values. If the program attempts to modify such an array, the behavior is
undefined.
It can't change during program execution. While the string variables can change during program execution. String variables are arrays of characters whose last element is a NUL character (\0).
All string (variables) are array of characters but all character arrays are not string.
When compiler encounters a string literal, then it stores it in the read only section of memory, i.e, ROM. Here the word static refers to unmodifiable not the keyword static.
A string literal:
char *string_literal = "string literal";
or this can also be seen as
char *string_literal = {'s','t','r','i','n','g',' ','l','i','t','e','r','a','l','\0'};
A string variable
char string_var[] = "string variable";
or it can also be seen as
char string_var[] = {'s','t','r','i','n','g',' ','v','a','r','i','a','b','l','e', '\0'};
A character array:
char character_array[] = {'c','h','a','r','a','c','t','e','r',' ', 'a', 'r', 'r', 'a', 'y'};

Resources