Why is max length of C string literal different from max char[]? - c

Clarification: Given that a string literal can be rewritten as a const
char[] (see below), imposing a lower max length on literals than on
char[]s is just a syntactic inconvenience. Why does the C standard
encourage this?
The C89 standard has a translation limit for string literals:
509 characters in a character string literal or wide string literal (after concatenation)
There isn't a limit for a char arrays; perhaps
32767 bytes in an object (in a hosted environment only)
applies (I'm not sure what object or hosted environment means), but at any rate it's a much higher limit.
My understanding is that a string literal is equivalent to char array containing characters, ie: it's always possible to rewrite something like this:
const char* str = "foo";
into this
static const char __THE_LITERAL[] = { 'f', 'o', 'o', '\0' };
const char* str = __THE_LITERAL;
So why such a hard limit on literals?

The limit on string literals is a compile-time requirement; there's a similar limit on the length of a logical source line. A compiler might use a fixed-size data structure to hold source lines and string literals.
(C99 increases these particular limits from 509 to 4095 characters.)
On the other hand, an object (such as an array of char) can be built at run time. The limits are likely imposed by the target machine architecture, not by the design of the compiler.
Note that these are not upper bounds imposed on programs. A compiler is not required to impose any finite limits at all. If a compiler does impose a limit on line length, it must be at least 509 or 4095 characters. (Most actual compilers, I think, don't impose fixed limits; rather they allocate memory dynamically.)

It's not that 509 characters is the limit for a string, it's the minimum required for ANSI compatibility, as explained here.
I think that the makers of the standard pulled the number 509 out of their ass, but unless we get some official documentation from this, there is no way for us to know.
As far as how many characters can actually be in a string literal, that is compiler-dependent.
Here are some examples:
MSVC: 2048
GCC: No Limit (up to 100,000 characters), but gives warning after 510 characters:
String literal of length 100000 exceeds maximum length 509 that C90 compilers are required to support

Sorry about the late answer, but I'd like to illustrate the difference between the two cases (Richard J. Ross already pointed out that they're not equivalent.)
Suppose you try this:
const char __THE_LITERAL[] = { 'f', 'o', 'o', '\0' };
const char* str = __THE_LITERAL;
char *str_writable = (char *) str; // Not so const anymore
str_writable[0] = 'g';
Now str contains "goo".
But if you do this:
const char* str = "foo";
char *str_writable = (char *) str;
str_writable[0] = 'g';
Result: segfault! (on my platform, at least.)
Here is the fundamental difference: In the first case you have an array which is initialized to "foo", but in the second case you have an actual string literal.
On a side note,
const char __THE_LITERAL[] = { 'f', 'o', 'o', '\0' };
is exactly equivalent to
const char __THE_LITERAL[] = "foo";
Here the = acts as an array initializer rather than as assignment. This is very different from
const char *str = "foo";
where the address of the string literal is assigned to str.

Related

I want to know how double quotes are used in C

In the code char * str = "hello";, I understand that code "hello" is to allocate the word hello to any other memory and then put the first value of that allocated memory into the variable str.
But when I use the code char str[10] = "hello";, I understood that the word hello is included in each element of the array.
If then, on the top, the code "hello" returns the address of the memory
and on the bottom, the code "hello" returns the word h e l l o \n.
I want to know why they are different and if I'm wrong, I want to know what double quotes return.
C is a bit quirky. You have two distinct use cases here. But let's first start with what "hello" is.
Your "hello" in the program source code is a character string literal. That is a character sequence enclosed in double quotes. When the compiler is compiling this source code, it appends a zero byte to the sequence, so that standard library functions like strlen() can work on it. The resulting zero-terminated sequence is then used by the compiler to "initialize an array of static storage duration and length just sufficient to contain the sequence array of constant characters" (n1570 ISO C draft, 6.4.5/6). That length is 6: The 5 characters h, e, l, l and o as well as the appended zero byte.
"Static storage duration" means that the array exists the entire time the program is running (as opposed to objects with automatic local storage duration, e.g. local variables, and those with dynamic storage duration, which are created via malloc() or calloc()).
You can memorize the address of that array, as in char *str = "hello";. This address will point to valid memory during the lifetime of the program.
The second use case is a special syntax for initializing character arrays. It is just syntactic sugar for this common use case, and a deviation from the fact that you cannot normally initialize arrays with arrays.1
This time you don't define a pointer, you define a proper array of 10 chars. You then use the string literal to initialize it. You always can use the generic method to initialize a character array by listing the individual array elements, separated by commas, in curly braces (by the way, this generic method works also for the other kind of compound types, namely structs):
char str[10] = { 'h', 'e', 'l', 'l', 'o', '\0' };
This is entirely equivalent to
char str[10] = "hello";
Now your array has more elements (10) than the number of characters in the initializing array produced from the string literal (6); the standard stipulates that "subobjects that are not initialized explicitly shall be initialized implicitly the same as objects that have static storage duration". Those global and static variables are initialized with zero, which means that the character array str ends with 4 zero characters.
It is immediately obvious why Dennis Ritchie added the somewhat anti-paradigmatic initialization of character arrays via a string literal, probably after the second time he had to do it with the generic array initialization syntax. Designing your own language has its benefits.
1 For example, static char src[] = "123"; char dest[] = src; doesn't work. You have to use strcpy().
The initialization:
char * str = "hello";
in most C implementations makes sure that the string hello is placed in a constant data section of the executable memory. Exactly six bytes are written, the last one being the string terminator '\0'.
str char pointer contains the address of the first character 'h', so that anyone accessing the string knows that the following bytes have to be read until the terminator character is found.
The other initialization
char str[10] = "hello"; // <-- string must be enclosed in double quotes
is very similar, as str points to the first character of the string and that the following characters are written in the following memory locations (included the string terminator).
But:
Even if only six bytes are explicitly initialized, ten bytes are allocated because that's the size of the array. In this case, the four trailing bytes will contain zeroes
Data is not constant and can be changed, while in the previous example it wasn't possible because such initialization, in most C implementations, instructs the compiler to use a constant data section
You seem to be mixing up some things:
char str[10] = "hello';
This does not even compile: when you start with a double-quote, you should end with one:
char str[10] = "hello";
In memory, this has following effect:
str[0] : h
str[1] : e
str[2] : l
str[3] : l
str[4] : o
str[5] : 0 (the zero character constant)
str[6] : xxx
str[7] : xxx
str[8] : xxx
str[9] : xxx
(By xxx, I mean that this can be anything)
As a result, the code will not return hello\n (with an end-of-line character), just hello\0 (the zero character).
The double quotes just mention the beginning and the ending of a string constant and return nothing.

difference between a string defined by pointer or an array

I was reading about pointers in K&R book here:
https://hikage.freeshell.org/books/theCprogrammingLanguage.pdf
There is an important difference between these definitions:
char amessage[] = "now is the time"; /* an array */
char *pmessage = "now is the time"; /* a pointer */
amessage is an array, just big enough to hold the sequence of characters and ’\0’ that initializes it. Individual characters
within the array may be changed but amessage will always refer to the same storage. On the other hand, pmessage is a
pointer, initialized to point to a string constant; the pointer may subsequently be modified to point elsewhere, but the result is
undefined if you try to modify the string contents.
I dont understand why cwe cant modify the string content !
I dont understand why cwe cant modify the string content !
Because the C standard says so: “If the program attempts to modify such an array [the array defined by a string literal], the behavior is undefined” (C 2018 6.4.5 7). A string literal is a sequence of characters in quotes in source code, such as "Hello, world.\n". (String literals may also be preceded by an encoding prefix u8, u, U, or L, as in L"abc".) A string literal defines an array containing the characters of the string plus a terminating null character.
A reason that attempting to modify the string literal’s array is that string literals were, and are, widely used for strings that are constant—error messages to be printed at times, format strings for printf operations, hard-coded names of things, and so on. As C developed, and the standard was written, it made sense for string literals to be treated as read-only and to allow a compiler to put them in read-only storage. Additionally, some compilers would use the same storage for identical string literals that appeared in different places, and some would use the same storage for a string literal that was a trailing substring of another string literal. Because of this shared storage, modifying one string would also modify the other. So allowing programs to modify string literals could cause some problems.
So, if you merely point to a string literal, you are pointing to something that should not be modified. If you want your own copy that can be modified, simply define it with an array as you show with char amessage[] = "now is the time";. Such a definition defines an array, amessage that has its own storage. That array is initialized with the contents of the string literal but is separate from it.
char amessage[] = "now is the time"; /* an array */
amessage is a modifiable array of chars.
char *pmessage = "now is the time"; /* a pointer */
pmessage is a pointer to the string literal. Attempt to modify the string literal is an Undefined Behaviour.
When you initialize a pointer with a string literal, the compiler creates a read-only array (and indeed is free to merge the pointers into one if you have several initializers using the same literal string (character by character) as in:
char *a = "abcdef", *b = "abcdef";
it is probable that both pointers be initialized to the same address in memory. This is the reason by which you are not allowed to modify the string, and why the behaviour can be unpredictable (you don't know if the compiler has merged both strings)
The thing goes further, as the compiler is permitted to do the following, on the next scenario:
char *a = "foo bar", *b = "bar";
the compiler is permitted to initialize a to point to a char array with the characters {'f', 'o', 'o', ' ', 'b', 'a', 'r', '\0'} and initialize also the pointer b to the fifth position of the array, as one of the string literals is a suffix of the other.
Allowing this allows the compiler to make extensive savings in the final executable and so, the string literals are assigned a read-only segment in the executable (they are placed in the .text segment or a similar one)
On the other hand, initializing an array has no problems, as you are defining the array variable that will store the characters, and it is not the compiler which is doing this. An initialization like:
char a[] = "Hello";
will arrange things to have a global variable of type array of chars with space for six characters. But you can also specify between the brackets the array size, as in
char a[32] = "Hello";
and then the array will have 32 characters (from 0 to 31) and the first five will be initialized to the character literals 'H', 'e', 'l', 'l' and 'o', followed by 27 null characters '\0'.
You are also allowed to say:
char a[4] = "Hello";
but in this case you will get an array initialized as {'H', 'e', 'l', 'l'} (only the first four characters are used from the string literal, and you will get a warning from the compiler, signalling the dangerous bend)
Last, think always that an assignment and an initialization are different things, despite they use the same symbol = to indicate it, they are not the same thing. You will never be allowed to write a sentence like:
char a[26];
a = "foo bar";
because the expression "foo bar" represents a char * pointing to a static array (unmodifiable) and an array cannot be assigned.

differences in array initialization(char, string, other) regarding storage duration

In this question it was said in the comments:
char arr[10] = { 'H', 'e', 'l', 'l', 'o', '\0'}; and char arr[10] =
"Hello"; are strictly the same thing. – Michael Walz
This got me thinking.
I know that "Hello" is string literal. String literals are stored with static storage duraction and are immutable.
But if both are are really the same then char arr[10] = { 'H', 'e', 'l', 'l', 'o', '\0'}; would also create a similar string literal with.
Does char b[10]= {72, 101, 108, 108, 111, 0}; also create a "string" literal with static storage duration? Because theoretically it is the same thing.
char a = 'a'; is the same thing as char a; ...; a = 'a';, so your thoughts are correct 'a' is simply written to a
Are there differences between:
char a = 'a';
char a = {'a'};
How/where are the differences defined?
EDIT:
I see that I haven't made it clear enough that I am particularly interested in the memory usage/storage duration of the literals. I will leave the question as it is, but would like to make the emphasis of the question more clear in this edit.
I know that "Hello" is string literal. String literals are stored with static storage duraction and are immutable.
Yes, but string literals are also a grammatical item in the C language. char arr[10] = { 'H', 'e', 'l', 'l', 'o', '\0'}; is not a string literal, it is an initializer list. The initializer list does however behave as if it has static storage duration, remaining elements after the explicit \0 are set to zero etc.
The initializer list itself is stored in some manner of ROM memory. If your variable arr has static storage duration too, it will get allocated in the .data segment and initialized from the ROM init list before the program is started. If arr has automatic storage duration (local), then it is initialized from ROM in run-time, when the function containing arr is called.
The ROM memory where the initializer list is stored may or may not be the same ROM memory as used for string literals. Often there's a segment called .rodata where these things end up, but they may as well end up in some other segment, such as the code segment .text.
Compilers like to store string literals in a particular memory segment, because that means that they can perform an optimization called "string pooling". Meaning that if you have the string literal "Hello" several times in your program, the compiler will use the same memory location for it. It may not necessarily do this same optimization for initializer lists.
Regarding 'a' versus {'a'} in an initializer list, that's just a syntax hiccup in the C language. C11 6.7.6/11:
The initializer for a scalar shall be a single expression, optionally enclosed in braces. The
initial value of the object is that of the expression (after conversion); the same type
constraints and conversions as for simple assignment apply,
In plain English, this means that a "non-array" (scalar) can be either initialized with or without braces, it has the same meaning. Apart from that, the same rules as for regular assignment apply.
I know that "Hello" is string literal. String literals are stored with static storage duraction and are immutable.
Yes. But with char arr[10] = "Hello";, you are copying the string literal to an array arr and there's no need to "keep" the string literal. So if an implementation chooses to do remove the string literal altogether after copying it to arr and that's totally valid.
But if both are are really the same then char arr[10] = { 'H', 'e', 'l', 'l', 'o', '\0'}; would also create a similar string literal.
Again there's no need to make/store a string literal for this.
Only if you directly have a pointer to a string literal, it'd be usually stored somewhere such as:
char *string = "Hello, world!\n";
Even then an implementation can choose not to do so under the "as-if" rule. E.g.,
#include <stdio.h>
#include <string.h>
static const char *str = "Hi";
int main(void)
{
char arr[10];
strcpy(arr, str);
puts(arr);
}
"Hi" can be eliminated because it's used only for copying it into arr and isn't accessed directly anywhere. So eliminating the string literal (and the strcpy call too) as if you had "char arr[10] = "Hi"; and wouldn't affect the observable behaviour.
Basically the C standard doesn't necessitate a string literal has to be stored anywhere as long as the properties associated with a string literal are satisfied.
Are there differences between: char a = 'a'; char a = {'a'}; How/where are the differences defined?
Yes. C11, 6.7.9 says:
The initializer for a scalar shall be a single expression, optionally enclosed in braces. [..]
Per the syntax, even:
char c = {'a',}; is valid and equivalent too (though I wouldn't recommend this :).
In the abstract machine, char arr[10] = "Hello"; means that arr is initialized by copying data from the string literal "Hello" which has its own existence elsewhere; whereas the other version just has initial values like any other variable -- there is no string literal involved.
However, the observable behaviour of both versions is identical: there is created arr with values set as specified. This is what the other poster meant by the code being identical; according to the Standard, two programs are the same if they have the same observable behaviour. Compilers are allowed to generate the same assembly for both versions.
Your second question is entirely separate to the first; but char a = 'a'; and char a = {'a'}; are identical. A single initializer may optionally be enclosed in braces.
I belive your question is highly implementation dependant (HW and compiler wise). However, in general: arrays are placed in RAM, let it be global or not.
I know that "Hello" is string literal. String literals are stored with static storage duraction and are immutable.
Yes this saves the string "Hello" in ROM (read only memory). Your array is loaded the literal in runtime.
But if both are are really the same then char arr[10] = { 'H', 'e', 'l', 'l', 'o', '\0'}; would also create a similar string literal.
Yes but in this case the single characters are placed in ROM. The array you are initialized is loaded with character literals in runtime.
Does char b[10]= {72, 101, 108, 108, 111, 0}; also create a "string" literal with static storage duration? Because theoretically it is the same thing.
If you use UTF-8, then yes, since char == uint8_t and those are the values.
Are there differences between:
char a = 'a';
char a = {'a'};
How/where are the differences defined?
I believe not.
In reply to edit
Do you mean the lifetime of storage of string literals? Have a look at this.
So a string literal has static storage duration. It remains throughout the lifetime of the program, hardcoded in memory.

C strings declarations [duplicate]

This question already has answers here:
What is the type of string literals in C and C++?
(4 answers)
Closed 9 years ago.
I'm learning C and today I stuck with the "strings" in C. Basically I understand that there is no such thing like string in C.
In C strings are an array characters terminated with \0 at the end.
So far so good.
char *name = "David";
char name[] = "David";
char name[5] = "David";
This is where confusing starts. Three different ways to declare "strings". Can you provide me with a simple examples in which situations which one to use. I've read a lot tutorials on the web but still can't get the idea.
I read this How to declare strings in C question on stackoverflow but still can't get the difference..
First one char *name = "David"; is string literal and is resides in read only section of memory. You can't do any modification to it. Better to write
const char *name = "David";
Second one char name[] = "David"; is a string of 6 chars including '\0'. Modification can be done.
char name[5] = "David"; invoke undefined behavior. "David" is a string of 6 chars (including terminating '\0'). You need an array of 6 chars to store it.
char name[6] = "David";
Further reading: C-FAQ 6. Arrays and Pointers.
This link provides a pretty good explanation.
char[] refers to an array, char* refers to a pointer, and they are not the same thing.
char a[] = "hello"; // array
char *p = "world"; // pointer
According to the standard, Annex J.2/1, it is undefined behavior when:
—The program attempts to modify a string literal (6.4.5).
6.4.5/5 says:
In translation phase 7, a byte or code of value zero is appended to
each multibyte character sequence that results from a string literal
or literals.
Therefore you actually need an array of six elements to account for the NUL character.
In the first example, you declare a pointer to a variable:
// A variable pointer to a variable string (i.e. an array of 6 bytes).
char *pName = "David";
At this time, you can modify the 6 bytes occupied by 'D', 'a', 'v', 'i', 'd', '\0':
pName[0] = 'c';
*pName = 'c';
*(pName+0) = 'c';
strcpy(pName, "Eric"); // Works well
But ONLY those 6 bytes:
// BUG: Will overwrite 2 random bytes located after \0 in RAM.
strcpy(pName, "Fredrik");
The pointer can be altered runtime to point to another variable string e.g.
pName = "Charlie Chaplin";
Which then can be modified
pName[0] = 'c';
*pName = 'c';
*(pName+0) = 'c';
// OK now, since pName now points to the CC array
// which is 16 bytes located somewhere else:
strcpy(pName, "Fredrik");
As stated by others, you would normally use const char * in the pointer cases, which also is the preferred way to use a string. The reason is that the compiler will help you from the most common (and hard-to-find) bugs of memorytrashing:
// A variable pointer to a constant string (i.e. an array of 6 constant bytes).
const char *pName = "David";
// Pointer can be altered runtime to point to another string e.g.
pName = "Charlie";
// But, the compiler will warn you if you try to change the string
// using any of the normal ways:
pName[0] = 'c'; // BUG
*pName = 'c'; // BUG
*(pName+0) = 'c'; // BUG
strcpy(pName, "Eric");// BUG
The other ways, using an array, gives less flexibility:
char aName[] = "David"; // aName is now an array in RAM.
// You can still modify the array using the normal ways:
aName[0] = 'd';
*aName = 'd';
*(aName+0) = 'd';
strcpy(aName, "Eric"); // OK
// But not change to a larger, or a different buffer
aName = "Charlie"; // BUG: This is not possible.
Similarly, a constant array helps you even more:
const char aName[] = "David"; // aName is now a constant array.
// The compiler will prevent modification of it:
aName[0] = 'd'; // BUG
*aName = 'd'; // BUG
*(aName+0) = 'd'; // BUG
strcpy(aName, "Eric");// BUG
// And you cannot of course change it this way either:
aName = "Charlie"; // BUG: This is not possible.
The major difference between using the pointer vs array declaration is the returned value of sizeof(): sizeof(pName) is the size of a pointer, i.e. typically 4. sizeof(aName) returns the size of the array, i.e. the length of the string+1.
It matters most if the variable is declared inside a function, especially if the string is long: It occupies more of the precious stack. Thus, the array declaration is normally avoided.
It also matters when passing the variable to a macros which use sizeof(). Such macros must be supplied with the intended type.
It also matters if you want to e.g. swap the strings. Strings declared as pointers are straight-forward and requires the CPU to access less bytes, by simply moving the 4 bytes of the pointers around:
const char *pCharlie = "Charlie";
const char *pDavid = "David";
const char *pTmp;
pTmp = pCharlie;
pCharlie = pDavid;
pDavid = pTmp;
pCharlie is now "David", and pDavid is now "Charlie".
Using arrays, you must provide a temporary storage large enough for the largest string, and use strcpy(), which takes more CPU, copying byte for byte in the strings.
The last method is rarely used, since the compiler automatically calculates that David needs 6 bytes. No need to tell it what's obvious.
char aName[6] = "David";
But, it is sometimes used in cases where the array MUST be a fixed length, independent of its contents, e.g. in binary protocols or files. In that case, it can be of benefit to manually add the limit, in order to get help from the compiler, should anyone by accident add or remove a character from the string in the future.

Char array initialization dilemma

Consider following code:
// hacky, since "123" is 4 chars long (including terminating 0)
char symbols[3] = "123";
// clean, but lot of typing
char symbols[3] = {'1', '2', '3'};
so, the twist is actually described in comment to the code, is there a way to initialize char[] with string literal without terminating zero?
Update: seems like IntelliSense is wrong indeed, this behaviour is explicitly defined in C standard.
This
char symbols[3] = "123";
is a valid statement.
According to the ANSI C Specification of 1988:
An array of character type may be initialized by a character string
literal, optionally enclosed in braces. Successive characters of the
character string literal (including the terminating null character if
there is room or if the array is of unknown size) initialize the
members of the array.
Therefore, what you're doing is technically fine.
Note that character arrays are an exception to the stated constraints on initializers:
There shall be no more initializers in an initializer list than there
are objects to be initialized.
However, the technical correctness of a piece of code is only a small part of that code's "goodness". The line char symbols[3] = "123"; will immediately strike the veteran programmer as suspect because it appears, at face value, to be a valid string initialization and later may be used as such, leading to unexpected errors and certain death.
If you wish to go this route you should be sure it's what you really want. Saving that extra byte is not worth the trouble this could get you into. The NULL symbol, if anything, allows you to write better, more flexible code because it provides an unambiguous (in most instances) way of terminating the array.
(Draft specification available here.)
To co-opt Rudy's comment elsewhere on this page, the C99 Draft Specification's 32nd Example in §6.7.8 (p. 130) states that the lines
char s[] = "abc", t[3] = "abc";
are identical to
char s[] = { 'a', 'b', 'c', '\0' },
t[] = { 'a', 'b', 'c' };
From which you can deduce the answer you're looking for.
The C99 specification draft can be found here.
If your array is only 3 chars long, the first line of code is identical to the second line. The '\0' at the end of the string will simply not be stored. IOW, there is nothing "dirty" or "wrong" with it.
1) The problems you are mentioning are not problems.
2) Que: Is there a way to initialize char[] with string literal without terminating zero? -- you are already doing that.

Resources