Char array initialization dilemma - c

Consider following code:
// hacky, since "123" is 4 chars long (including terminating 0)
char symbols[3] = "123";
// clean, but lot of typing
char symbols[3] = {'1', '2', '3'};
so, the twist is actually described in comment to the code, is there a way to initialize char[] with string literal without terminating zero?
Update: seems like IntelliSense is wrong indeed, this behaviour is explicitly defined in C standard.

This
char symbols[3] = "123";
is a valid statement.
According to the ANSI C Specification of 1988:
An array of character type may be initialized by a character string
literal, optionally enclosed in braces. Successive characters of the
character string literal (including the terminating null character if
there is room or if the array is of unknown size) initialize the
members of the array.
Therefore, what you're doing is technically fine.
Note that character arrays are an exception to the stated constraints on initializers:
There shall be no more initializers in an initializer list than there
are objects to be initialized.
However, the technical correctness of a piece of code is only a small part of that code's "goodness". The line char symbols[3] = "123"; will immediately strike the veteran programmer as suspect because it appears, at face value, to be a valid string initialization and later may be used as such, leading to unexpected errors and certain death.
If you wish to go this route you should be sure it's what you really want. Saving that extra byte is not worth the trouble this could get you into. The NULL symbol, if anything, allows you to write better, more flexible code because it provides an unambiguous (in most instances) way of terminating the array.
(Draft specification available here.)
To co-opt Rudy's comment elsewhere on this page, the C99 Draft Specification's 32nd Example in §6.7.8 (p. 130) states that the lines
char s[] = "abc", t[3] = "abc";
are identical to
char s[] = { 'a', 'b', 'c', '\0' },
t[] = { 'a', 'b', 'c' };
From which you can deduce the answer you're looking for.
The C99 specification draft can be found here.

If your array is only 3 chars long, the first line of code is identical to the second line. The '\0' at the end of the string will simply not be stored. IOW, there is nothing "dirty" or "wrong" with it.

1) The problems you are mentioning are not problems.
2) Que: Is there a way to initialize char[] with string literal without terminating zero? -- you are already doing that.

Related

Utility of '\0' in C string [duplicate]

This question already has answers here:
What is a null-terminated string?
(7 answers)
Closed last year.
#include <stdio.h>
#include <string.h>
int main()
{
char ch[20] = {'h','i'};
int k=strlen(ch);
printf("%d",k);
return 0;
}
The output is 2.
As far as I know '\0' helps compiler identify the end of string but the output here suggests the strlen can detect the end on it's own then why do we need '\0'?
long story short: it's your compiler making proactive decisions based on the standard.
long story:
char ch[20] = {'h','i'}
in the line above what you are implying to your compiler is;
allocate a memory big enough to store 20 characters (aka, array of 20 chars).
initialize first two slices (first two members of the array) as 'h' & 'i'.
implicitly initialize the rest.
since you are initialing your char array, your compiler is smart enough to insert the null terminator to the third element if it has enough space remaining. This process is the standard for initialization.
if you were to remove the initialization syntax and initialize each member manually like below, the result is undefined behavior.
char ch[20];
ch[0] = 'h';
ch[1] = 'i';
Also, if you were to not have extra space for your compiler to put the null terminator, even if you used a initializer the result would still be an undefined behavior as you can easily test via this code snippet below:
char ch[2] = { 'h','i' };
int k = strlen(ch);
printf("%d\n%s\n", k, ch);
now, if you were to increase the array size of 'ch' from 2 to 3 or any other number higher than 2, you can see that your compiler initializes it with the null terminator thus no more undefined behavior.
In this declaration:
char ch[20] = {'h','i'};
the first two elements are initialized explicitly and all other elements are initialized implicitly by zeroes.
The above declaration in fact (with one exceptions that the third element of the array is also explicitly initialized) is equivalent to:
char ch[20] = "hi";
Pat attention to that the string literal is represented as the following array:
{ 'h', 'i', '\0' }
That is the array contains a string that is terminated by the zero character '\0' and the function strlen can successfully find the length of the stored string.
If you would write for example:
char ch[2] = "hi";
then in this case the array ch does not have a space to store the terminating zero of the string literal. In this case applying the function strlen to this array invokes undefined behavior.
A null byte (i.e. the value 0) is what defines the end of a string in C.
When you defined ch, you gave less initializers than values in the array, so the remaining elements are set to 0. This results in a null terminated string.
The strlen function is basically looking for that value and counting how many elements it sees before it finds the null byte.
As far as I know '\0' helps compiler identify the end of string
Technically, it helps user code and the C runtime library identify the ends of strings. To the extent that the compiler needs to know where strings end, it knows without looking for a terminator.
but the output here suggests the strlen can detect the end on it's own
That would be a misinterpretation. The actual fact is that your string is null-terminated even though you did not put a null terminator in it explicitly. This is a consequence of declaring your array with an initializer that specifies values for only some of the elements. As some of your other answers describe in more detail, that does not produce a partial initialization. Rather, elements for which the initializer does not specify values are default-initialized. For elements of type char, that means initialization with 0, which serves as a string terminator.
Moreover, if the array were without a terminator then the result of passing it to strlen() would be undefined. You could not then conclude anything from the result.
then why do we need '\0'?
So that user code and many standard library functions can recognize the ends of strings. You already know this.
But in many cases we do not need to provide terminators explicitly. In particular, we do not need to represent them in string literals (and it means something different than you probably intended if you do), and you don't need to represent them in the initializers for char arrays storing strings, provided that the array has more elements than you specify in the initializer.
It is likely that your array ch contained zeros thus the byte after i is already set to zero. You can view it with a debugger or simply test it in the code. Trust me, strlen needs the zero to work.

Are char arrays guaranteed to be null terminated?

#include <stdio.h>
int main() {
char a = 5;
char b[2] = "hi"; // No explicit room for `\0`.
char c = 6;
return 0;
}
Whenever we write a string, enclosed in double quotes, C automatically creates an array of characters for us, containing that string, terminated by the \0 character
http://www.eskimo.com/~scs/cclass/notes/sx8.html
In the above example b only has room for 2 characters so the null terminating char doesn't have a spot to be placed at and yet the compiler is reorganizing the memory store instructions so that a and c are stored before b in memory to make room for a \0 at the end of the array.
Is this expected or am I hitting undefined behavior?
It is allowed to initialize a char array with a string if the array is at least large enough to hold all of the characters in the string besides the null terminator.
This is detailed in section 6.7.9p14 of the C standard:
An array of character type may be initialized by a character string
literal or UTF−8 string literal, optionally enclosed in braces.
Successive bytes of the string literal (including the terminating null
character if there is room or if the array is of unknown size)
initialize the elements of the array.
However, this also means that you can't treat the array as a string since it's not null terminated. So as written, since you're not performing any string operations on b, your code is fine.
What you can't do is initialize with a string that's too long, i.e.:
char b[2] = "hello";
As this gives more initializers than can fit in the array and is a constraint violation. Section 6.7.9p2 states this as follows:
No initializer shall attempt to provide a value for an object not contained within the entity
being initialized.
If you were to declare and initialize the array like this:
char b[] = "hi";
Then b would be an array of size 3, which is large enough to hold the two characters in the string constant plus the terminating null byte, making b a string.
To summarize:
If the array has a fixed size:
If the string constant used to initialize it is shorter than the array, the array will contain the characters in the string with successive elements set to 0, so the array will contain a string.
If the array is exactly large enough to contain the elements of the string but not the null terminator, the array will contain the characters in the string without the null terminator, meaning the array is not a string.
If the string constant (not counting the null terminator) is longer than the array, this is a constraint violation which triggers undefined behavior
If the array does not have an explicit size, the array will be sized to hold the string constant plus the terminating null byte.
Whenever we write a string, enclosed in double quotes, C automatically creates an array of characters for us, containing that string, terminated by the \0 character.
Those notes are mildly misleading in this case. I shall have to update them.
When you write something like
char *p = "Hello";
or
printf("world!\n");
C automatically creates an array of characters for you, of just the right size, containing the string, terminated by the \0 character.
In the case of array initializers, however, things are slightly different. When you write
char b[2] = "hi";
the string is merely the initializer for an array which you are creating. So you have complete control over the size. There are several possibilities:
char b0[] = "hi"; // compiler infers size
char b1[1] = "hi"; // error
char b2[2] = "hi"; // No terminating 0 in the array. (Illegal in C++, BTW)
char b3[3] = "hi"; // explicit size matches string literal
char b4[10] = "hi"; // space past end of initializer is always zero-initialized
For b0, you don't specify a size, so the compiler uses the string initializer to pick the right size, which will be 3.
For b1, you specify a size, but it's too small, so the compiler should give you a error.
For b2, which is the case you asked about, you specify a size which is just barely big enough for the explicit characters in the string initializer, but not the terminating \0. This is a special case. It's legal, but what you end up with in b2 is not a proper null-terminated string. Since it's unusual at best, the compiler might give you a warning. See this question for more information on this case.
For b3, you specify a size which is just right, so you get a proper string in an exactly-sized array, just like b0.
For b4, you specify a size which is too big, although this is no problem. There ends up being extra space in the array, beyond the terminating \0. (As a matter of fact, this extra space will also be filled with \0.) This extra space would let you safely do something like strcat(b4, ", wrld!").
Needless to say, most of the time you want to use the b0 form. Counting characters is tedious and error-prone. As Brian Kernighan (one of the creators of C) has written in this context, "Let the computer do the dirty work."
One more thing. You wrote:
and yet the compiler is reorganizing the memory store instructions so that a and c are stored before b in memory to make room for a \0 at the end of the array.
I don't know what's going on there, but it's safe to say that the compiler is not trying to "make room for a \0". Compilers can and often do store variables in their own inscrutable internal order, matching neither the order you declared them, nor alphabetical order, nor anything else you might think of. If under your compiler array b ended up with extra space after it which did contain a \0 as if to terminate the string, that was probably basically random chance, not because the compiler was trying to be nice to you and helping to make something like printf("%s\n", b) be better defined. (Under the two compilers where I tried it, printf("%s\n", b) printed hi^E and hi ??, clearly showing the presence of trailing random garbage, as expected.)
There are two things in your question.
String literal. String literal (ie something enclosed in the double quotes) is always the correct null character terminated string.
char *p = "ABC"; // p references null character terminated string
Character array may only hold as many elements as it has so if you try to initialize two element array with three elements string literal, only two first will be written. So the array will not contain the null character terminated C string
char p[2] = "AB"; // p is not a valid C string.
A array of char need not be terminated by anything at all. It is an array. If the actual content is smaller than the dimensions of the array then you need to track the size of that content.
Answers here seem to have degenerated into a string discussion. Not all arrays of char are strings. However it is a very strong convention to use a null terminator as a sentinel if they are to be handled as de facto strings.
Your array may use something else, and may also have separators and zones. After all it may be a Union or overlay a structure. Possibly a staging area for another system.

meaning of static array of characters?

somewhere I read the following lines :-
char *p = "string literal";
My program crashes if I try to assign a new value to p[i].
A:-It turns into an unnamed, static array of characters, and this unnamed array may be stored in read-only memory, and which therefore cannot necessarily be modified. In an expression context, the array is converted at once to a pointer, as usual (see section 6), so the declaration initializes p to point to the unnamed array's first element.
I know what static do but I did not understand the following in the above lines
static array of characters.
This does not refer to the static keyword, but static in the sense that it cannot be changed.
EDIT: Thinking better, it seems this phrase was badly written, I think the author back then (for those wondering, this comes from the C faq) meant "constant"
EDIT2: OP asked what is a string literal, here is the answer:
String literal is a string that is hardcoded in your source (and later in your compiled program), you do it by using double quotes " a example would be this "some string literal here"
When you assigned this to a pointer, the pointer points to the string literal, that is stored in your program running code, NOT on the main memory, this is why it cannot be modified.
You can assign a string literal to array, to initialize the array, the meaning there is different, where the array will be sent to the memory, and will have that string as its initial value.
Mind you, a string literal must be inside double quotes " if you attempt other hacks it won't compile at all. You cannot for example do this: char* someVar = {'f', 'o', 'o', '\0'}; it won't work at all. (my compiler gives the error: excess elements in scalar initializer)
"Static" refers to the storage duration of the object that will be created for the string literal.
To quote C99 6.4.5:
The multibyte character sequence is then used to initialize an array of static storage duration and length just sufficient to contain the sequence.
Simply string literals refer to string constants about which C11 standard says that:
It is unspecified whether these arrays are distinct provided their elements have the
appropriate values. If the program attempts to modify such an array, the behavior is
undefined.
It can't change during program execution. While the string variables can change during program execution. String variables are arrays of characters whose last element is a NUL character (\0).
All string (variables) are array of characters but all character arrays are not string.
When compiler encounters a string literal, then it stores it in the read only section of memory, i.e, ROM. Here the word static refers to unmodifiable not the keyword static.
A string literal:
char *string_literal = "string literal";
or this can also be seen as
char *string_literal = {'s','t','r','i','n','g',' ','l','i','t','e','r','a','l','\0'};
A string variable
char string_var[] = "string variable";
or it can also be seen as
char string_var[] = {'s','t','r','i','n','g',' ','v','a','r','i','a','b','l','e', '\0'};
A character array:
char character_array[] = {'c','h','a','r','a','c','t','e','r',' ', 'a', 'r', 'r', 'a', 'y'};

storing of strings in char arrays in C

#include<stdio.h>
int main()
{
char a[5]="hello";
puts(a); //prints hello
}
Why does the code compile correctly? We need six places to store "hello", correct?
The C compiler will let you run off the end of arrays, it does no checks of that sort.
The C compiler allows you to explicitly ask for no null terminator.
char a[] = "Hello"; /* adds a terminator implicitly */
char a[6] = "Hello"; /* adds a terminator implicitly */
char a[5] = "Hello"; /* skips it */
Any value smaller than 5 results in an error.
As for why - one possibility is that your strings are of a fixed size, or are being used as buffers of byte values. In these cases you do not need a null terminator.
Best practice is to use char a[] so the compiler can set it to the correct value (including terminator) automatically.
a doesn't contain a null terminated string (extra initializers for fixed size arrays - such as the null terminator in "hello" - are discarded), so the behaviour when a pointer to that array is passed to puts is undefined.
In my experience, a lot of compilers will let you get away with compiling this. It will usually crash at runtime, though (because you don't have a null terminator).
C char array initialization includes the terminating null only if there is room or if the array dimensions are not specified.
You need 6 characters to store "hello" as a null terminated string. But char arrays are not constrained to store nul terminated string, you may need the array for another purpose and forcing an additional nul character in those cases would be pointless.
That is because in C memory management is done manually unlike in java and some other few languages....
The six places you allocated is not checked for during compilation but if you
have to get into filing(I mean storing actually) you are going to have a runtime error becuase the program kept five places in memory(but is expected to hold six) for the characters but the compiler did not check!
"hello" string is kept in read-only memory with 0 in the end. "a" points to this string, this is why the program may work correctly. But I think that generally this is undefined behavior.
It is necessary to see Assembly code generated by compiler to see what happens exactly. If you want to get junk output in this situation, try:
char a[5] = {'h', 'e', 'l', 'l', 'o'}
The C compiler you are using does not check that the string literal fits to the char array. You need 6 characters in the array to fit the literal "Hello" since the literal includes a terminating zero. Modern compilers, such as Visual C++ 2010 do check these things and give you and error.

C string initializer doesn't include terminator?

I am a little confused by the following C code snippets:
printf("Peter string is %d bytes\n", sizeof("Peter")); // Peter string is 6 bytes
This tells me that when C compiles a string in double quotes, it will automatically add an extra byte for the null terminator.
printf("Hello '%s'\n", "Peter");
The printf function knows when to stop reading the string "Peter" because it reaches the null terminator, so ...
char myString[2][9] = {"123456789", "123456789" };
printf("myString: %s\n", myString[0]);
Here, printf prints all 18 characters because there's no null terminators (and they wouldn't fit without taking out the 9's). Does C not add the null terminator in a variable definition?
Your string is [2][9]. Those [9] are ['1', '2', etc... '8', '9']. Because you only gave it room for 9 chars in the first array dimension, and because you used all 9, it has no room to place a '\0' character. redefine your char array:
char string[2][10] = {"123456789", "123456789"};
And it should work.
Sure it does, you just aren't leaving enough room for the '\0' byte. Making it:
char string[2][10] = { "123456789", "123456789" };
Will work as you expect (will just print 9 characters).
If you tell C that an array is a given size, C cannot make the array any larger. It would be disobeying you if it did so! Remember that not every char array contains a null terminated string. Sometimes the array (as used) is truly an array of (individual) char. The compiler doesn't know what you are doing and cannot read your mind.
This is why C allows you to initialize a char array where the null terminator won't fit but everything else will. Try your example with a string one byte longer and the compiler will complain.
Note that your example will compile but will not do what you expect, as the contents are not (null terminated) strings. With GCC, running your example, I see the string I should, followed by garbage.
Alterenatively, you can use:
char* myString[2] = {"123456789", "123456789" };
Like this, the initializer computes the right size for your null terminated strings.
C allows unterminated strings, C++ does not.
C allows character arrays to be
initialized with string constants. It
also allows a string constant
initializer to contain exactly one
more character than the array it
initializes, i.e., the implicit
terminating null character of the
string may be ignored. For example:
char name1[] = "Harry"; // Array of 6 char
char name2[6] = "Harry"; // Array of 6 char
char name3[] = { 'H', 'a', 'r', 'r', 'y', '\0' };
// Same as 'name1' initialization
char name4[5] = "Harry"; // Array of 5 char, no null char
C++ also allows character arrays to be
initialized with string constants, but
always includes the terminating null
character in the initialization. Thus
the last initializer (name4) in the
example above is invalid in C++.
Is there a reason why the compiler doesn't warn that there isn't enough room for the 0 byte? I get a warning if I try to add another '9' that won't fit, but it doesn't seem to care about dropping the 0 byte?
The '\0' byte isn't it's problem. Most of the time, if you have this:
char code[9] = "123456789";
The next byte will be off the edge of the variable, but will be unused memory, and will most likely be 0 (unless you malloc() and don't set the values before using them). So most of the time it works, even if it's bad for you.
If you're using gcc, you might also want to use the -Wall flag, or one of the other (million) warning flags. This might help (not sure).

Resources