Are C constant character strings always null terminated without exception?
For example, will the following C code always print "true":
const char* s = "abc";
if( *(s + 3) == 0 ){
printf( "true" );
} else {
printf( "false" );
}
A string is only a string if it contains a null character.
A string is a contiguous sequence of characters terminated by and including the first null character. C11 §7.1.1 1
"abc" is a string literal. It also always contains a null character. A string literal may contain more than 1 null character.
"def\0ghi" // 2 null characters.
In the following, though, x is not a string (it is an array of char without a null character). y and z are both arrays of char and both are strings.
char x[3] = "abc";
char y[4] = "abc";
char z[] = "abc";
With OP's code, s points to a string, the string literal "abc", *(s + 3) and s[3] have the value of 0. To attempt to modified s[3] is undefined behavior as 1) s is a const char * and 2) the data pointed to by s is a string literal. Attempting to modify a string literal is also undefined behavior.
const char* s = "abc";
Deeper: C does not define "constant character strings".
The language defines a string literal, like "abc" to be a character array of size 4 with the value of 'a', 'b', 'c', '\0'. Attempting to modify these is UB. How this is used depends on context.
The standard C library defines string.
With const char* s = "abc";, s is a pointer to data of type char. As a const some_type * pointer, using s to modify data is UB. s is initialized to point to the string literal "abc". s itself is not a string. The memory s initial points to is a string.
In short, yes. A string constant is of course a string and a string is by definition 0-terminated.
If you use a string constant as an array initializer like this:
char x[5] = "hello";
you won't have a 0 terminator in x simply because there's no room for it.
But with
char x[] = "hello";
it will be there and the size of x is 6.
The notion of a string is determinate as a sequence of characters terminated by zero character. It is not important whether the sequence is modifiable or not that is whether a corresponding declaration has the qualifier const or not.
For example string literals in C have types of non-constant character arrays. So you may write for example
char *s = "Hello world";
In this declaration the identifier s points to the first character of the string.
You can initialize a character array yourself by a string using a string literal. For example
char s[] = "Hello world";
This declaration is equivalent to
char s[] = { 'H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', '\0' };
However in C you may exclude the terminating zero from an initialization of a character array.
For example
char s[11] = "Hello world";
Though the string literal used as the initializer contains the terminating zero it is excluded from the initialization. As result the character array s does not contain a string.
In C, there isn't really a "string" datatype like in C++ and Java.
Important principle that every competent computer science degree program should mention: Information is symbols plus interpretation.
A "string" is defined conventionally as any sequence of characters ending in a null byte ('\0').
The "gotcha" that's being posted (character/byte arrays with the value 0 in the middle of them) is only a difference of interpretation. Treating a byte array as a string versus treating it as bytes (numbers in [0, 255]) has different applications. Obviously if you're printing to the terminal you might want to print characters until you reach a null byte. If you're saving a file or running an encryption algorithm on blocks of data you will need to support 0's in byte arrays.
It's also valid to take a "string" and optionally interpret as a byte array.
Related
In the code char * str = "hello";, I understand that code "hello" is to allocate the word hello to any other memory and then put the first value of that allocated memory into the variable str.
But when I use the code char str[10] = "hello";, I understood that the word hello is included in each element of the array.
If then, on the top, the code "hello" returns the address of the memory
and on the bottom, the code "hello" returns the word h e l l o \n.
I want to know why they are different and if I'm wrong, I want to know what double quotes return.
C is a bit quirky. You have two distinct use cases here. But let's first start with what "hello" is.
Your "hello" in the program source code is a character string literal. That is a character sequence enclosed in double quotes. When the compiler is compiling this source code, it appends a zero byte to the sequence, so that standard library functions like strlen() can work on it. The resulting zero-terminated sequence is then used by the compiler to "initialize an array of static storage duration and length just sufficient to contain the sequence array of constant characters" (n1570 ISO C draft, 6.4.5/6). That length is 6: The 5 characters h, e, l, l and o as well as the appended zero byte.
"Static storage duration" means that the array exists the entire time the program is running (as opposed to objects with automatic local storage duration, e.g. local variables, and those with dynamic storage duration, which are created via malloc() or calloc()).
You can memorize the address of that array, as in char *str = "hello";. This address will point to valid memory during the lifetime of the program.
The second use case is a special syntax for initializing character arrays. It is just syntactic sugar for this common use case, and a deviation from the fact that you cannot normally initialize arrays with arrays.1
This time you don't define a pointer, you define a proper array of 10 chars. You then use the string literal to initialize it. You always can use the generic method to initialize a character array by listing the individual array elements, separated by commas, in curly braces (by the way, this generic method works also for the other kind of compound types, namely structs):
char str[10] = { 'h', 'e', 'l', 'l', 'o', '\0' };
This is entirely equivalent to
char str[10] = "hello";
Now your array has more elements (10) than the number of characters in the initializing array produced from the string literal (6); the standard stipulates that "subobjects that are not initialized explicitly shall be initialized implicitly the same as objects that have static storage duration". Those global and static variables are initialized with zero, which means that the character array str ends with 4 zero characters.
It is immediately obvious why Dennis Ritchie added the somewhat anti-paradigmatic initialization of character arrays via a string literal, probably after the second time he had to do it with the generic array initialization syntax. Designing your own language has its benefits.
1 For example, static char src[] = "123"; char dest[] = src; doesn't work. You have to use strcpy().
The initialization:
char * str = "hello";
in most C implementations makes sure that the string hello is placed in a constant data section of the executable memory. Exactly six bytes are written, the last one being the string terminator '\0'.
str char pointer contains the address of the first character 'h', so that anyone accessing the string knows that the following bytes have to be read until the terminator character is found.
The other initialization
char str[10] = "hello"; // <-- string must be enclosed in double quotes
is very similar, as str points to the first character of the string and that the following characters are written in the following memory locations (included the string terminator).
But:
Even if only six bytes are explicitly initialized, ten bytes are allocated because that's the size of the array. In this case, the four trailing bytes will contain zeroes
Data is not constant and can be changed, while in the previous example it wasn't possible because such initialization, in most C implementations, instructs the compiler to use a constant data section
You seem to be mixing up some things:
char str[10] = "hello';
This does not even compile: when you start with a double-quote, you should end with one:
char str[10] = "hello";
In memory, this has following effect:
str[0] : h
str[1] : e
str[2] : l
str[3] : l
str[4] : o
str[5] : 0 (the zero character constant)
str[6] : xxx
str[7] : xxx
str[8] : xxx
str[9] : xxx
(By xxx, I mean that this can be anything)
As a result, the code will not return hello\n (with an end-of-line character), just hello\0 (the zero character).
The double quotes just mention the beginning and the ending of a string constant and return nothing.
For instance, if I write:
void function(char *k){ printf("%s",k);}
and call it like this:
function("hello");
does the code translate that string to: "hello\0" ? Or I'm the one who has to add it?
In C (And C++), when you do
const char* mystr = "Hello";, the compiler will generate the following in (read-only) RAM:
0x7fff2fe0: 'H'
0x7fff2fe1: 'e'
0x7fff2fe2: 'l'
0x7fff2fe3: 'l'
0x7fff2fe4: 'o'
0x7fff2fe5: '\0'
Then, the compiler will replace
const char* mystr = "Hello";
with
const char* mystr = 0x7fff2fe0;
For your usage, your code will turn into
function(0x7fff2fe0)
Simple as that.
On a compiler level, all string literals have type const char[N], where the char array is an array that contains all of the written characters, followed by a \0. The char[N] has a length N that is 1 + the length of the string you write (char[6] for "Hello"). More information can be found in the here, where they also use the string "Hello" as an example. Thus, sizeof("Hello") == 6, and "Hello"[5] == '\0' (Yes, "Hello"[5] is legal, remember, "Hello" has type const char[6]). We see this information exemplified in the following:
printf("%d\n", sizeof("Hello")); // 6
const char[] str = "Hello"; // Casts from const char[6] to const char[6]
// Resulting in a copy of all 6 bytes
printf("%d\n", sizeof(str)); // 6
const char* str2 = "Hello"; // Casts from const char[6] to const char*
printf("%d\n", sizeof(str2)); // 4 on a 32bit system, 8 on a 64bit system
Do note, when casting to a pointer, that you get some pointer e.g. 0x7fff2fe0 to an array of characters that is not modifiable - attempting to modify the data pointed at 0x7fff2fe0 or 0x7fff2fe5 is explicitly undefined behavior. This status is commonly represented with const; by writing const, the compiler will correctly complain if you try to edit it.
As an additional note, by writing
char[] myarr = "Hello";
You will create a duplicate stack-allocated character array named myarr, and that array may be modified. myarr will indeed still contain \0 and have a size of 6 chars, in particular, myarr will have type char[6], with sizeof(myarr) == 6.
From the C11 Standard
Section 6.4.5 String Literals, Paragraph 6 (p. 71):
In translation phase 7, a byte or code of value zero is appended to each multibyte
character sequence that results from a string literal or literals.
78)
The multibyte character
sequence is then used to initialize an array of static storage duration and length just
sufficient to contain the sequence
A string literal already includes a terminating \0 by itself, regardless of what you do with that literal. "hello" is always a char [6] array of h, e, l, l, o and \0, by definition. So, the fact that you "pass it to a function" is completely inconsequential here.
There's no need to add anything.
String literals are not passed to the functions only the pointer to the first character. The referenced object will have all the chars + terminating zero.
I am learning C and I came across the pointers.
Even though I learned more with this tutorial than from the textbook I still wonder about the char pointers.
If I program this
#include <stdio.h>
int main()
{
char *ptr_str;
ptr_str = "Hello World";
printf(ptr_str);
return 0;
}
The result is
Hello World
I don't understand how there isn't an error while compiling since the pointer ptr_str is pointing directly to the text and not to the first character of the text. I thought that only this would work
#include <stdio.h>
int main()
{
char *ptr_str;
char var_str[] = "Hello World";
ptr_str = var_str;
printf(ptr_str);
return 0;
}
So in the first example how was I pointing directly to the text?
Your code works because string literals are essentially static arrays.
ptr_str = "Hello World";
is treated by the compiler as if it were
static char __tmp_0[] = {'H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd', '\0' };
ptr_str = __tmp_0;
(except trying to modify the contents of a string literal has undefined behavior).
You can even apply sizeof to a string literal and you'll get the size of the array: sizeof "Hello" is 6, for example.
In the context of assignment to a char pointer the 'value' of a string literal is the address of its first character.
so
ptr_str = "Hello World";
sets ptr_str to the address of the 'H'
Why won't the first one work? It will work as you have seen.
String literals are arrays. From §6.4.5p6 C11 Standard N1570
The multibyte character sequence is then used to initialize an array of static storage duration and length just sufficient to contain the sequence. For character string literals, the array elements have type char, and are initialized with the individual bytes of the multibyte character sequence.
Now in the first case literal array decayed into pointer to first element - so decayed pointer will basically be pointing to 'H'. You assigned that pointer to ptr_str. Now printf will expect a format specifier and the corresponding argument. Here it will be %s and corresponding argument would be char*. And printf will print every character until it reached the \0. That's all it happened. This is how you ended up pointing directly to the text.
Note that second case is quite different from first case in that - second case a copy is being made which can be modified (Trying to modify the first one would be undefined behavior). We are basically initializing a char array with the content of the string literal.
If this code is correct:
char v1[ ] = "AB";
char v2[ ] = {"AB"};
char v3[ ] = {'A', 'B'};
char v4[2] = "AB";
char v5[2] = {"AB"};
char v6[2] = {'A', 'B'};
char *str1 = "AB";
char *str2 = {"AB"};
Then why this other one is not?
char *str3 = {'A', 'B'};
To the best of my knowledge (please correct me if I'm wrong at any point) "AB" is a string literal and 'A' and 'B' are characters (integers,scalars). In char *str1 = "AB"; the string literal "AB" is defined and the char pointer is set to point to that string literal (to the first element). With char *str3 = {'A', 'B'}; two characters are defined and stored in subsequent memory positions, and the char pointer "should" be set to point to the first one. Why is that not correct?
In a similar way, a regular char array like v3[] or v6[2] can indeed be initialized with {'A', 'B'}. The two characters are defined, the array is set to point to them and thus, being "turned into" or treated like a string literal. Why a char pointer like char *str3 does not behave in the same way?
Just for the record, gcc compiler warnings I get are "initialization makes pointer from integer without a cast" when it gets to the 'A', and "excess elements in scalar initializer" when it gets to the 'B'.
Thanks in advance.
There is one thing you need to learn about constant string literals. Except when used to initialize an array (for example in the case of v1 in your example code) constant string literals are themselves arrays. For example if you use the literal "AB" it is stored somewhere by the compiler as an array of three characters: 'A', 'B' and the terminator '\0'.
When you initialize a pointer to point to a literal string, as in the case of str1 and str2, then you are making those pointers point to the first character in those arrays. You don't actually create an array named str1 (for example) you just make it point somewhere.
The definition
char *str1 = "AB";
is equivalent to
char *str1;
str1 = "AB";
Or rather
char unnamed_array_created_by_compiler[] = "AB";
char *str1 = unnamed_array_created_by_compiler;
There are also other problematic things with the definitions you show. First of all the arrays v3, v4, v5 and v6. You tell the compiler they will be arrays of two char elements. That means you can not use them as strings in C, since strings needs the special terminator character '\0'.
In fact if you check the sizes of v1 and v2 you will see that they are indeed three bytes large, once for each of the characters plus the terminator.
Another important thing you miss is that while constant string literals are arrays of char, you miss the constant part. String literals are really read-only, even if not stored as such. That's why you should never create a pointer to char (like str1 and str2) to point to them, you should create pointers to constant char. I.e.
const char *str1 = "AB";
(" ") is for string and (' ') is for character. for an string a memory has been allocated and for character not. pointers points to a memory and you must allocate an specified memory to it but for array of characters is not necessary.
This is something that has bugged for a quite a while.
I am trying to declare an array of char I am aware of the fact that string is an array of char.
but what I want to know is that when I declare something for example an array of characters
note I meant characters it is not a string like
char alphabet[26]={"a", "b" ,"c" ......"z"}
is that same as
char alphabet[]="abcd...z"
let's say I would do a bubble sort(I know is slow) to switch the alphabet order other way around is there any difference in handling those 2?
just really really curious.
No, a string and an array of char are quite different, though an array of char may contain a string.
An array is a data type. A string is a data layout.
An array is an object consisting of a contiguous sequence of some specified element type.
A string in C is, by definition, "a contiguous sequence of characters terminated by and including the first null character" (reference: N1570 7.1.1, paragraph 1).
Although the terminating null character is part of a string, the length of a string defined as "the number of bytes preceding the null character".
For example:
char arr[10] = "hello";
The array arr has 10 elements, with values { 'h', 'e', 'l', 'l', 'o', '\0', '\0', '\0', '\0', '\0', '\0' }.
The first 6 bytes of the array object arr contain a string whose value is { 'h', 'e', 'l', 'l', 'o', '\0' }, or, equivalently, "hello".
As for your declarations, the first one is valid if you change the double quotes to single quotes:
char alphabet[26]={'a', 'b', 'c', ..., 'z'};
but the array alphabet doesn't contain a string because there's no terminating '\0' null character.
In your second declaration:
char alphabet[]="abcd...z";
alphabet is 27 bytes long, because a string literal implicitly specifies that there's a trailing null character.
One exception: If the length of the string literal is exactly the same as the specified size of the array:
char not_a_string[5] = "hello";
then there is no null terminator. This is rarely a good idea.