Long-form notation of a string assignment - c

I was wondering what the equivalent "long-form" way of creating a string would be. My first thought as as follows:
char *string = "Hello";
is short-form for:
char _string[6] = {'H', 'e', 'l', 'l', 'o', '\0'};
char *string = &_string[0];
Is this the closest approximation of what a string assignment actually does? When was the short-form notation introduced?

String literals have three properties your proposed long form does not:
They have static storage duration, meaning they exist (in the computing model used by the C standard) for the duration of program execution.
The behavior of attempting to modify them is not defined by the C standard, even though they do not have a const qualifier (which is a legacy due to the historical development of C).
When the compiler sees identical string literals, it is allowed to consolidate them. This could be visible to the program by comparing addresses. The compiler is also allowed to consolidate "abcdef" and "def" to "abcdef", since the latter is a subsequence of the former.
So a closer equivalent would be:
static const char _string[] = {'H', 'e', 'l', 'l', 'o', '\0'};
char *string = (char *) _string;
Thus, string does not have the const qualifier, and it does point to an array of static storage duration for which attempting to modify it is not defined by the C standard.
I do not believe there is a good way to replicate the third property, so _string will be a distinct object even if the same data appears elsewhere in the program.
String literals have existed in C since at least 1978, as they appear in the first edition of The C Programming Language by Kernighan and Ritchie, which calls them string constants.

Related

difference between a string defined by pointer or an array

I was reading about pointers in K&R book here:
https://hikage.freeshell.org/books/theCprogrammingLanguage.pdf
There is an important difference between these definitions:
char amessage[] = "now is the time"; /* an array */
char *pmessage = "now is the time"; /* a pointer */
amessage is an array, just big enough to hold the sequence of characters and ’\0’ that initializes it. Individual characters
within the array may be changed but amessage will always refer to the same storage. On the other hand, pmessage is a
pointer, initialized to point to a string constant; the pointer may subsequently be modified to point elsewhere, but the result is
undefined if you try to modify the string contents.
I dont understand why cwe cant modify the string content !
I dont understand why cwe cant modify the string content !
Because the C standard says so: “If the program attempts to modify such an array [the array defined by a string literal], the behavior is undefined” (C 2018 6.4.5 7). A string literal is a sequence of characters in quotes in source code, such as "Hello, world.\n". (String literals may also be preceded by an encoding prefix u8, u, U, or L, as in L"abc".) A string literal defines an array containing the characters of the string plus a terminating null character.
A reason that attempting to modify the string literal’s array is that string literals were, and are, widely used for strings that are constant—error messages to be printed at times, format strings for printf operations, hard-coded names of things, and so on. As C developed, and the standard was written, it made sense for string literals to be treated as read-only and to allow a compiler to put them in read-only storage. Additionally, some compilers would use the same storage for identical string literals that appeared in different places, and some would use the same storage for a string literal that was a trailing substring of another string literal. Because of this shared storage, modifying one string would also modify the other. So allowing programs to modify string literals could cause some problems.
So, if you merely point to a string literal, you are pointing to something that should not be modified. If you want your own copy that can be modified, simply define it with an array as you show with char amessage[] = "now is the time";. Such a definition defines an array, amessage that has its own storage. That array is initialized with the contents of the string literal but is separate from it.
char amessage[] = "now is the time"; /* an array */
amessage is a modifiable array of chars.
char *pmessage = "now is the time"; /* a pointer */
pmessage is a pointer to the string literal. Attempt to modify the string literal is an Undefined Behaviour.
When you initialize a pointer with a string literal, the compiler creates a read-only array (and indeed is free to merge the pointers into one if you have several initializers using the same literal string (character by character) as in:
char *a = "abcdef", *b = "abcdef";
it is probable that both pointers be initialized to the same address in memory. This is the reason by which you are not allowed to modify the string, and why the behaviour can be unpredictable (you don't know if the compiler has merged both strings)
The thing goes further, as the compiler is permitted to do the following, on the next scenario:
char *a = "foo bar", *b = "bar";
the compiler is permitted to initialize a to point to a char array with the characters {'f', 'o', 'o', ' ', 'b', 'a', 'r', '\0'} and initialize also the pointer b to the fifth position of the array, as one of the string literals is a suffix of the other.
Allowing this allows the compiler to make extensive savings in the final executable and so, the string literals are assigned a read-only segment in the executable (they are placed in the .text segment or a similar one)
On the other hand, initializing an array has no problems, as you are defining the array variable that will store the characters, and it is not the compiler which is doing this. An initialization like:
char a[] = "Hello";
will arrange things to have a global variable of type array of chars with space for six characters. But you can also specify between the brackets the array size, as in
char a[32] = "Hello";
and then the array will have 32 characters (from 0 to 31) and the first five will be initialized to the character literals 'H', 'e', 'l', 'l' and 'o', followed by 27 null characters '\0'.
You are also allowed to say:
char a[4] = "Hello";
but in this case you will get an array initialized as {'H', 'e', 'l', 'l'} (only the first four characters are used from the string literal, and you will get a warning from the compiler, signalling the dangerous bend)
Last, think always that an assignment and an initialization are different things, despite they use the same symbol = to indicate it, they are not the same thing. You will never be allowed to write a sentence like:
char a[26];
a = "foo bar";
because the expression "foo bar" represents a char * pointing to a static array (unmodifiable) and an array cannot be assigned.

Why do we have to cast an array to type[] when initialising a pointer?

We do not cast a string when initialise a pointer:
char *string = "Hello World!";
However, if I try to define an array explicitly (whatever it is of), the compiler gives me a warning of type incompatibility:
char *string = {'H', 'e', 'l', 'l', 'o', '\0'};
Casting to char[] works, but I wonder why do we have to cast? Doesn't a compiler see that the initialising value {'H', 'e', 'l', 'l', 'o', '\0'} is already an array? If we initialise an array like string[] the same way, we do not have to cast though. I assume here the compiler sees what the initialising value is, why doesn't it see it when initialising a pointer?
string is a pointer, not an array, so it needs an initializer that is a pointer.
The first code snippet is OK because a string literal has array type, and that array decays into a pointer to its first element.
The second is not OK because you're assigning a set of characters to a pointers. Because string is not an array or struct, only the first member of the initializer list is used. So you have a character constant, which has type int, that you're trying to assign to a pointer.
You say it works if you cast. If you mean this:
char *string = (char []){'H', 'e', 'l', 'l', 'o', '\0'};
Then what you have is actually a compound literal on the right side which has array type, and like the first example an array decays to a pointer to its first member.
When you're initializing a pointer, you have to provide a value that's the address of an object, or a null pointer.
{'H', 'e', 'l', 'l', 'o', '\0'} is not the address of anything. It's the syntax for an initializer list, which can only be used to initialize a variable whose type is an array or structure type.
You haven't actually shown the cast you're talking about, but I assume it's
char *string = (char[]){'H', 'e', 'l', 'l', 'o', '\0'};
It's not actually a cast, although it uses similar syntax.
An initializer list preceded by an array or structure type in parentheses is called a compound literal. It creates an anonymous object of the specified type, and the value is that object.
When used with an array, the value decays to a pointer to the first element of the array, just like any other use of an array in r-value context. This allows you to use it as the initializer of a pointer variable.
So it's effectively equivalent to:
char temp[] = {'H', 'e', 'l', 'l', 'o', '\0'};
char *string = temp;
except that there's no name temp associated with the array.
You don't need this type of syntax when initializing with a string literal, because string literals already construct the array in static memory and evaluate to a pointer to the first element.
Compound literals like {'a'} or {'a', 1} or even {} are meant to be open for future language extensions.
Maybe the compound literal {'a', 1, 'b', 17} will at some point be a valid rvalue for a possible future pure C hashmap's initialization.
The syntax of C does intentionally not assume too much.
C may grow.
Plus, the little bit of compatibility we have with C++ we don't want to entirely be jeopardized by C compilers making assumptions that would make it harder than necessary for C code to still go correctly through a C++ compiler.
C and C++ need to evolve with a lot of regard for each other.
And even the potential future growth of just C itself makes it necessary to not wildly interpret literals as things they might reasonably mean in C 😅

differences in array initialization(char, string, other) regarding storage duration

In this question it was said in the comments:
char arr[10] = { 'H', 'e', 'l', 'l', 'o', '\0'}; and char arr[10] =
"Hello"; are strictly the same thing. – Michael Walz
This got me thinking.
I know that "Hello" is string literal. String literals are stored with static storage duraction and are immutable.
But if both are are really the same then char arr[10] = { 'H', 'e', 'l', 'l', 'o', '\0'}; would also create a similar string literal with.
Does char b[10]= {72, 101, 108, 108, 111, 0}; also create a "string" literal with static storage duration? Because theoretically it is the same thing.
char a = 'a'; is the same thing as char a; ...; a = 'a';, so your thoughts are correct 'a' is simply written to a
Are there differences between:
char a = 'a';
char a = {'a'};
How/where are the differences defined?
EDIT:
I see that I haven't made it clear enough that I am particularly interested in the memory usage/storage duration of the literals. I will leave the question as it is, but would like to make the emphasis of the question more clear in this edit.
I know that "Hello" is string literal. String literals are stored with static storage duraction and are immutable.
Yes, but string literals are also a grammatical item in the C language. char arr[10] = { 'H', 'e', 'l', 'l', 'o', '\0'}; is not a string literal, it is an initializer list. The initializer list does however behave as if it has static storage duration, remaining elements after the explicit \0 are set to zero etc.
The initializer list itself is stored in some manner of ROM memory. If your variable arr has static storage duration too, it will get allocated in the .data segment and initialized from the ROM init list before the program is started. If arr has automatic storage duration (local), then it is initialized from ROM in run-time, when the function containing arr is called.
The ROM memory where the initializer list is stored may or may not be the same ROM memory as used for string literals. Often there's a segment called .rodata where these things end up, but they may as well end up in some other segment, such as the code segment .text.
Compilers like to store string literals in a particular memory segment, because that means that they can perform an optimization called "string pooling". Meaning that if you have the string literal "Hello" several times in your program, the compiler will use the same memory location for it. It may not necessarily do this same optimization for initializer lists.
Regarding 'a' versus {'a'} in an initializer list, that's just a syntax hiccup in the C language. C11 6.7.6/11:
The initializer for a scalar shall be a single expression, optionally enclosed in braces. The
initial value of the object is that of the expression (after conversion); the same type
constraints and conversions as for simple assignment apply,
In plain English, this means that a "non-array" (scalar) can be either initialized with or without braces, it has the same meaning. Apart from that, the same rules as for regular assignment apply.
I know that "Hello" is string literal. String literals are stored with static storage duraction and are immutable.
Yes. But with char arr[10] = "Hello";, you are copying the string literal to an array arr and there's no need to "keep" the string literal. So if an implementation chooses to do remove the string literal altogether after copying it to arr and that's totally valid.
But if both are are really the same then char arr[10] = { 'H', 'e', 'l', 'l', 'o', '\0'}; would also create a similar string literal.
Again there's no need to make/store a string literal for this.
Only if you directly have a pointer to a string literal, it'd be usually stored somewhere such as:
char *string = "Hello, world!\n";
Even then an implementation can choose not to do so under the "as-if" rule. E.g.,
#include <stdio.h>
#include <string.h>
static const char *str = "Hi";
int main(void)
{
char arr[10];
strcpy(arr, str);
puts(arr);
}
"Hi" can be eliminated because it's used only for copying it into arr and isn't accessed directly anywhere. So eliminating the string literal (and the strcpy call too) as if you had "char arr[10] = "Hi"; and wouldn't affect the observable behaviour.
Basically the C standard doesn't necessitate a string literal has to be stored anywhere as long as the properties associated with a string literal are satisfied.
Are there differences between: char a = 'a'; char a = {'a'}; How/where are the differences defined?
Yes. C11, 6.7.9 says:
The initializer for a scalar shall be a single expression, optionally enclosed in braces. [..]
Per the syntax, even:
char c = {'a',}; is valid and equivalent too (though I wouldn't recommend this :).
In the abstract machine, char arr[10] = "Hello"; means that arr is initialized by copying data from the string literal "Hello" which has its own existence elsewhere; whereas the other version just has initial values like any other variable -- there is no string literal involved.
However, the observable behaviour of both versions is identical: there is created arr with values set as specified. This is what the other poster meant by the code being identical; according to the Standard, two programs are the same if they have the same observable behaviour. Compilers are allowed to generate the same assembly for both versions.
Your second question is entirely separate to the first; but char a = 'a'; and char a = {'a'}; are identical. A single initializer may optionally be enclosed in braces.
I belive your question is highly implementation dependant (HW and compiler wise). However, in general: arrays are placed in RAM, let it be global or not.
I know that "Hello" is string literal. String literals are stored with static storage duraction and are immutable.
Yes this saves the string "Hello" in ROM (read only memory). Your array is loaded the literal in runtime.
But if both are are really the same then char arr[10] = { 'H', 'e', 'l', 'l', 'o', '\0'}; would also create a similar string literal.
Yes but in this case the single characters are placed in ROM. The array you are initialized is loaded with character literals in runtime.
Does char b[10]= {72, 101, 108, 108, 111, 0}; also create a "string" literal with static storage duration? Because theoretically it is the same thing.
If you use UTF-8, then yes, since char == uint8_t and those are the values.
Are there differences between:
char a = 'a';
char a = {'a'};
How/where are the differences defined?
I believe not.
In reply to edit
Do you mean the lifetime of storage of string literals? Have a look at this.
So a string literal has static storage duration. It remains throughout the lifetime of the program, hardcoded in memory.

Cannot edit a C array by index

I am a noob to C programming (I come from the lands of JS and PHP), and as a learning exercise I attempted to write a program that asks for the user's name, and then prints it back out with the small exception of changing the first letter to a z. However, when I went to compile the code it returned the following error message in reference to the line name[0] = "Z";
warning: assignment makes integer from pointer without a cast
Is there a reason I can't assign a value to a specific index in a char array?
(Note: I have tried typecasting "Z" to a char but it just threw the error
warning: cast from pointer to integer of different size`)
Unlike some languages that do not distinguish between strings and characters, C requires a different syntax for characters (vs. a single-character string).
You need to use single quotes:
name[0] = 'Z';
The error is quite cryptic, though. It is trying to say that "Z", a single-character C string, gets assigned to name[0], an integral type of char. C strings are arrays; arrays are convertible to pointers. Hence, C treats this as a pointer-to-int assignment without a cast.
replace name[0] = "Z"; with name[0] = 'Z';.
'single-quatation' is for an character element and "double-quatation" is for a string assignment.
In C, single quotes and double quotes carry different meanings. In fact, there is no concept of "Strings" in C. You have the basic char data type, where a char is represented by single quotes. To represent strings, you store them as an array of chars. For example,
char text[] = {'h', 'e', 'l', 'l', 'o'};
This is just a more tedious way of writing
char text[] = "hello";
This is exactly the same as the first example, with the exception that there is a null character \0 at the end (this is how C detects the end of "strings"). It's the same as saying char text[] = {'h', 'e', 'l', 'l', 'o', '\0'}; except now you can work with your array more easily, if you want to do string based processing on it.
Coming to your question, if you want to index a certain character in a "string", you'd need to access it by it's index in the array.
So, text[0] returns the character h which is of type char. To assign a different value, you must assign a single quoted char as so:
text[0] = 'Z';

Why is max length of C string literal different from max char[]?

Clarification: Given that a string literal can be rewritten as a const
char[] (see below), imposing a lower max length on literals than on
char[]s is just a syntactic inconvenience. Why does the C standard
encourage this?
The C89 standard has a translation limit for string literals:
509 characters in a character string literal or wide string literal (after concatenation)
There isn't a limit for a char arrays; perhaps
32767 bytes in an object (in a hosted environment only)
applies (I'm not sure what object or hosted environment means), but at any rate it's a much higher limit.
My understanding is that a string literal is equivalent to char array containing characters, ie: it's always possible to rewrite something like this:
const char* str = "foo";
into this
static const char __THE_LITERAL[] = { 'f', 'o', 'o', '\0' };
const char* str = __THE_LITERAL;
So why such a hard limit on literals?
The limit on string literals is a compile-time requirement; there's a similar limit on the length of a logical source line. A compiler might use a fixed-size data structure to hold source lines and string literals.
(C99 increases these particular limits from 509 to 4095 characters.)
On the other hand, an object (such as an array of char) can be built at run time. The limits are likely imposed by the target machine architecture, not by the design of the compiler.
Note that these are not upper bounds imposed on programs. A compiler is not required to impose any finite limits at all. If a compiler does impose a limit on line length, it must be at least 509 or 4095 characters. (Most actual compilers, I think, don't impose fixed limits; rather they allocate memory dynamically.)
It's not that 509 characters is the limit for a string, it's the minimum required for ANSI compatibility, as explained here.
I think that the makers of the standard pulled the number 509 out of their ass, but unless we get some official documentation from this, there is no way for us to know.
As far as how many characters can actually be in a string literal, that is compiler-dependent.
Here are some examples:
MSVC: 2048
GCC: No Limit (up to 100,000 characters), but gives warning after 510 characters:
String literal of length 100000 exceeds maximum length 509 that C90 compilers are required to support
Sorry about the late answer, but I'd like to illustrate the difference between the two cases (Richard J. Ross already pointed out that they're not equivalent.)
Suppose you try this:
const char __THE_LITERAL[] = { 'f', 'o', 'o', '\0' };
const char* str = __THE_LITERAL;
char *str_writable = (char *) str; // Not so const anymore
str_writable[0] = 'g';
Now str contains "goo".
But if you do this:
const char* str = "foo";
char *str_writable = (char *) str;
str_writable[0] = 'g';
Result: segfault! (on my platform, at least.)
Here is the fundamental difference: In the first case you have an array which is initialized to "foo", but in the second case you have an actual string literal.
On a side note,
const char __THE_LITERAL[] = { 'f', 'o', 'o', '\0' };
is exactly equivalent to
const char __THE_LITERAL[] = "foo";
Here the = acts as an array initializer rather than as assignment. This is very different from
const char *str = "foo";
where the address of the string literal is assigned to str.

Resources