Behavior when initializing character array with larger array - c

I was trying out some array related stuff in C.
I did following:
char a2[4] = {'g','e','e','k','s'};
printf("a2:%s,%u\n",a2,sizeof(a2));
printf("a2[4]:%d,%c\n",a2[4],a2[4]);
In this code, it prints:
a2:geek,4
a2[4]:0,
In this code, it prints:
a2:geek�,4
a2[4]:-1,�
Both code run on same online compiler. Then why different output. Is it because the standard defines this case as undefined behavior. If yes, can you point me to exact section of the standard?

Yes, this is undefined behavior. I don't have a reference to the standard, but %s format is for printing null-terminated strings, and you don't have a null terminator on a2. And when you access a2[4] you're accessing outside the array bounds, another cause of undefined behavior.
Finally, the array initializer also causes undefined behavior, see Is it ok to have excess elements in array initializer?

The presence of excess initializers violates a constraint in C 2018 6.7.9 2:
No initializer shall attempt to provide a value for an object not contained within the entity being initialized.
'k' and 's' would provide initial values for a2[4] and a2[5]. Since a2[4] or a2[5] do not exist, they are not contained within a2, and the constraint is violated.
That said, compilers will typically provide a warning then ignore the excess initializers and continue. This is the least of the problems in the code you show and has no effect on the output you see.
After the definition of a2, you print it using %s. %s requires a pointer to the first character in a sequence of characters terminated by a null. However, there is no null character in a2. The resulting behavior is not defined by the C standard. Often, what happens is a program will continue to print characters from memory beyond the array. This is of course not guaranteed and is especially unreliable in the modern high-optimization environment.
Assuming the printf does continue to print characters beyond the array, it appears that, on one system, there happens to be a null character beyond the array, so the printf stops after four characters. When you later print a2[4] (also behavior not defined by the C standard) as an integer (%d) and a character (%c), we see there is indeed a null character there.
On the other system, there is a −1 value in the memory at a2[4], which displays as “�”. After it, there are presumably some number (possibly zero) of non-printing characters and a null character.
Additionally, you print sizeof(a2) using the printf specifier %u. This is incorrect and may have undefined behavior. A proper specifier for the result of sizeof is %zu.

Related

Clarification about precedence of operators

I got this snippet from some exercises and the question: which is the output of following code:
main()
{
char *p = "ayqm";
printf("%c", ++*(p++));
}
My expected answer was z but the actual answer was in fact b. How is that possible?
Later edit: the snippet is taken as it is from an exercise and did not focus on the string literal or syntax issues existent in other than the printf() code zone.
As posted, the program has multiple problems:
it tries to modify the string constant "ayqm", which described as undefined behavior in the C Standard.
it uses printf without a proper declaration, again producing undefined behavior.
its output is not terminated with a newline, causing implementation defined behavior.
the prototype for main without a return type is obsolete, no longer supported by the C Standard.
incrementing characters produces implementation defined behavior. If the execution character set is ASCII, 'a'+1 does produce 'b', but it is not guaranteed by the C Standard. Indeed in the EBCDIC character set still used in older mainframe computers letters are in a single monotonic sequence (ie: 'a'+1 == 'b' but 'i'+1 != 'j' in this character set).
Here is a corrected version:
#include <stdio.h>
int main(void) {
char str[] = "ayqm";
char *p = str;
printf("%c\n", ++*(p++));
return 0;
}
p is post-incremented, which means the current value of p is used for the * operator and the value of p is incremented before the next sequence point, namely the call to the printf function. The character read through p, 'a' is then incremented, which may or may not produce 'b' depending on the execution character set.
After printf returns to the main function, p points to str[1] and str contains the string "byqm".
Your program is having undefined behavior because it is trying to modify the string literal "ayqm". As per the standard attempting to modify a string literal results in undefined behavior because it may be stored in read-only storage.
The pointer p is pointing to string literal "ayqm". This expression
printf ("%c", ++*(p++));
end up attempting to modify the string literal that pointer p is pointing to.
An undefined behavior in a program includes it may execute incorrectly (either crashing or silently generating incorrect results), or it may fortuitously do exactly what the programmer intended.

How char array behaves for longer strings?

I asked this question as one of multiple questions here. But people asked me to ask them separately. So why this question.
Consider below code lines:
char a[5] = "geeks"; //1
char a3[] = {'g','e','e','k','s'}; //d
printf("a:%s,%u\n",a,sizeof(a)); //5
printf("a3:%s,%u\n",a3,sizeof(a3)); //j
printf("a[5]:%d,%c\n",a[5],a[5]);
printf("a3[5]:%d,%c\n",a3[5],a3[5]);
Output:
a:geeksV,5
a3:geeks,5
a[5]:86,V
a3[5]:127,
However the output in original question was:
a:geeks,5
a3:geeksV,5
The question 1 in original question was:
Does line #1 adds \0? Notice that sizeof prints 5 in line #5 indicating \0 is not there. But then, how #5 does not print something like geeksU as in case of line #j? I feel \0 does indeed gets added in line #1, but is not considered in sizeof, while is considered by printf. Am I right with this?
Realizing that the output has changed (for same online compiler) when I took out only those code lines which are related to first question in original question, now I doubt whats going on here? I believe these are undefined behavior by C standard. Can someone shed more light? Possibly for another compiler?
Sorry again for asking 2nd question.
char a[5] = "geeks"; //1
Here, you specify the array's size as '5', and initialize it with 5 characters.
Therefore, you do not have a "C string", which by definition is ended by a NUL. (0).
printf("a:%s,%u\n",a,sizeof(a)); //5
The array itself still has a size of 5, which is correctly reported by the sizeof operator, but your call to printf is undefined behaviour and could print anything after the arrray's contents - it will just keep looking at the next address until it finds a 0 somewhere. That could be immediately, or it could print a 1000000 garbage characters, or it could cause some sort of segfault or other crash.
char a3[] = {'g','e','e','k','s'}; //d
Because you don't specify the array's size, the compiler will, through the initialization syntax, determine the size of the array. However, the way you chose to initialize a3, it will still only provide 5 bytes of length.
The reason for that is that your initialization just is an initialization list, and not a "string". Therefore, your subsequent call to printf also is undefined behaviour, and it is just luck that at the position a3[5] there seems to be a 0 in your case.
Effectively, both examples have the very same error.
You could have it different thus:
char a3[] = "geeks";
Using a string literal for initialization of the array with unspecified size will cause the compiler to allocate enough memory to hold the string and the additional NUL-terminator, and sizeof (a3) will now yield 6.
"geeks" here is a string literal in C.
When you define "geeks" the compiler automatically adds the NULL character to the end. This makes it 6 characters long.
But you are assigning it to char a[5]. This will cause undefined behaviour.
As mentioned by #DavidBowling, in this case the following condition applies
(Section 6.7.8.14) C99 standard.
An array of character type may be initialized by a character string literal, optionally enclosed in braces. Successive characters of the character string literal (including the terminating null character if there is room or if the array is of unknown size) initialize the elements of the array
the elements "geeks" will be copied into the array 'a' but the NULL character will not be copied.
So in this case when you try to print the array, it will continue printing until it encounters a \0 in the memory.
From the further print statements it is seen that a[5] has the value V. Presumably the next byte on your system is \0 and the array print stops.
So, in your system, at that instance, "geeksV" is printed.

C compiler Differences? NUL control character in Char

Below is C code that will generate char array but with explicitly adding Null character or not. The results are unexpected in two compiler and I'm not sure why we even have to explicitly add the Null character?
//
// stringBugorNot.c
//
//
//
#include <string.h>
#include <stdio.h>
int main(void)
{
char aString[3] = {'a', 'b','c'};
char bString[4] = {'a', 'b', 'c', '\0'};
printf("\n");
printf("len of a is: %lu\n", strlen(aString));
printf("len of b is: %lu\n", strlen(bString));
printf("\n");
//Portion A
printf("last element of a is: '%c'\n", aString[strlen(aString)]);
printf("last element of b is: '%c'\n", bString[strlen(bString)]);
printf("\n");
//Portion B
printf("last element of a is: '%c'\n", aString[strlen(aString) - 1]);
printf("last element of b is: '%c'\n", bString[strlen(bString) - 1]);
}
Comments
+clang will give a runtime error because out of bounds on "aString".. makes sense
+gcc will not give any error and simply output "nothing" the null as expected. But maybe gcc is smarter and adds the null for me? Is the actual memory size different??
Clang OUTPUT ---->
len of a is: 3
len of b is: 3
bugOrNot.c:16:41: runtime error: index 3 out of bounds for type 'char [3]'
last element of a is: ''
last element of b is: ''
last element of a is: 'c'
last element of b is: 'c'
GCC OUTPUT ---->
len of a is: 9
len of b is: 3
last element of a is: ''
last element of b is: ''
last element of a is: ''
last element of b is: 'c'
Unexpected behavior that you see is called undefined behavior (UB) in the C standard:
Calling strlen on aString is UB because there is no null termination
Dereferencing aString at its undefined index is UB, unless the index is 0, 1, or 2
gcc could insert null terminator inadvertently by aligning bString at 4-byte boundary. It doesn’t change the fact that it’s still a UB, though.
When you say
char bString[4] = {'a', 'b', 'c', '\0'};
you have properly constructed a null-terminated string. It is precisely as if you had said
char bString[4] = "abc";
Since this is a proper, null-terminated string, it is meaningful and legal to call strlen(bString), and you will get a result of 3.
When you say
char aString[3] = {'a', 'b','c'};
on the other hand, and as I think you know, you have not constructed a proper null-terminated string. Therefore, it is not legal or meaningful to call strlen(aString) -- formally, we say that the result is undefined, meaning that absolutely anything can happen.
You tried the code with two different compilers, and were surprised to get two different results. This is perfectly normal. (It's perfectly normal to get two different results, and it's perfectly normal to be surprised by this, because it is pretty surprising, the first few times you encounter it.)
It is not the case that one compiler is "smarter" than the other, or that it "guessed" that you were trying to construct a string and so automatically supplied the "missing" \0 for you. It was simply a fluke, a random happenstance. (It is also certainly not the case that one compiler or the other has any kind of a bug. Again, there's no right result here, so a compiler can't be wrong, no matter what it does.)
If you want to work with strings in C, make sure that they're all properly null-terminated. If you should ever happen to accidentally do something stringlike with a non-properly-null-terminated string, don't try to interpret the results, don't assume that they mean anything, and especially don't decide that it's the "right" result that you can therefore depend on. You can't. It's likely to change for no reason, like when you use a different compiler next week, or when your customer uses your program on vital data instead of test data.
In C, a string is a sequence of character values including the nul terminator. That terminator is how the various C library routines know where the end of the string is. If you don't terminate a string properly, library routines like strlen and strcpy and printf with %s will all scan past the end of the string into other memory, resulting in garbled output or runtime errors.
The reason you got different results for the length of a with the two different compilers is that in the clang case, the byte immediately following the last element of a contained 0, whereas in the gcc case the bytes immediately following a did not contain 0.
Strictly speaking, the behavior on passing a non-terminated sequence of characters to the string handling routines is undefined - the language specification places no requirements on the compiler or runtime environment to "do the right thing", whatever that would be. You've basically voided the warranty at that point, and pretty much anything can happen.
Note that the C language specification does not require bounds checks on array accesses - the fact that you got the index out of bounds exception for clang is due to the compiler being extra friendly and going beyond what the language standard actually requires.

What if i define a character array of X elements and add Y (>X) elements to it?

There is the NUL character that is added after every string.
So if I define a character array of 10 elements and put 6 elements, the 7th element is automatically the null character. (If I'm wrong somewhere please
correct me.)
So if I make a character array of 3 elements and put 4 letters into it like
char array[3]={'h','i','y','a'};
and I do not put the NUL character, what happens?
Is the null character added in the last position?
Do I get an error?
I'm so sorry, normally I would simply try running the code but my virtual machine keeps crashing for some reason.
According to the C Standard (C11 6.7.9. Initialization) it's a constraint violation that must be diagnosed:
Constraints
2 No initializer shall attempt to provide a value for an object not contained within the entity being initialized.
For example, gcc and clang will report
x.c: In function 'main':
x.c:4:30: warning: excess elements in array initializer
char array[3]={'h','i','y','a'};
^
x.c:4:30: note: (near initialization for 'array')
So what happens, i.e. what code does the compiler generate? In C lingo, violating a constraint is undefined behavior, so the answer is: we can't tell. Some compiler might ignore the extra character, some might extend the array dimension. Some might refuse to compile it. Some might create a corrupt program. This is the reason why good programmers steer clear of undefined behavior.
You initialize a 3 byte long buffer with a 4 byte long string literal, w/o terminating zero. GCC does it for you for the 1st 3 bytes, ignores the 4th byte and gives you a warning:
warning: initializer-string for array of chars is too long
When you printf it, you just pass a pointer, in our case the array symbol means the start address of the array. printf has no idea about its length, it will just happily iterate over bytes in memory until it finds a terminating zero.
So printf will print the memory area following your array, which is undefined behaviour, as only the compiler knows what it has placed after your array. If you're lucky, there will be a zero byte pretty soon, and printf stops before reaching a page of virtual memory it can't read. You end up with hiy and some garbage.
If you're not so lucky, printf reaches an unreadable page of memory and you die a painful death of segmentation fault.
Just trying to explain what the lawyers mean with undefined behaviour... :)

C character array and its length

I am studying now C with "C Programming Absolute Beginner's Guide" (3rd Edition) and there was written that all character arrays should have a size equal to the string length + 1 (which is string-termination zero length). But this code:
#include <stdio.h>
main()
{
char name[4] = "Givi";
printf("%s\n",name);
return 0;
}
outputs Givi and not Giv. Array size is 4 and in that case it should output Giv, because 4 (string length) + 1 (string-termination zero character length) = 5, and the character array size is only 4.
Why does my code output Givi and not Giv?
I am using MinGW 4.9.2 SEH for compilation.
You are hitting what is considered to be undefined behavior. It's working now, but due to chance, not correctness.
In your case, it's because the memory in your program is probably all zeroed out at the beginning. So even though your string is not terminated properly, it just so happens that the memory right after it is zero, so printf knows when to stop.
+-----------------------+
|G|i|v|i|\0|\0|... |
+-----------------------+
| your | rest of |
| stuff | memory (stack)|
+-----------------------+
Other languages, such as Java, have safeguards against this sort of situations. Languages like C, however, do less hand holding, which, on the one hand, allows more flexibility, but on the other, give you much, much more ways to shoot you in the foot with subtle issues such as this one. In other words, if your code compiles, that doesn't mean it's correct and it won't blow up now, in 5 minutes or in 5 years.
In real life, this is almost never the case, and your string might end up getting stored next to other things, which would always end up getting printed out together with your string. You never want this. Situations like this might lead to crashes, exploits and leaked confidential information.
See the following diagram for an example. Imagine you're working on a web server and the string "secret"--a user's password or key is stored right next to your harmless string:
+-----------------------+
|G|i|v|i|s|e|c|r|e|t |
+-----------------------+
| your | rest of |
| stuff | memory (stack)|
+-----------------------+
Every time you would output what you would think is "Givi", you'd end up printing out the secret string, which is not what you want.
The byte after the last character always has to be 0, otherwise printf would not know when the string is terminanted and would try to access bytes (or chars) while they are not 0.
As Andrei said, apparently it just happened, that the compiler put at least one byte with the value 0 after your string data, so printf recognized the end of the string.
This can vary from compiler to compiler and thus is undefined behaviour.
There could, for instance, be a chance to have printf accessing an address, which your program is not allowed to. This would result in a crash.
In C text strings are stored as zero terminated arrays of characters. This means that the end of a text string is indicated by a special character, a numeric value of zero (0), to indicate the end of the string.
So the array of text characters to be used to store a C text string must include an array element for each of the characters as well as an additional array element for the end of string.
All of the C text string functions (strcpy(), strcmp(), strcat(), etc.) all expect that the end of a text string is indicated by a value of zero. This includes the printf() family of functions that print or output text to the screen or to a file. Since these functions depend on seeing a zero value to terminate the string, one source of errors when using C text strings is copying too many characters due to a missing zero terminator or copying a long text string into a smaller buffer. This type of error is known as a buffer overflow error.
The C compiler will perform some types of adjustments for you automatically. For instance:
char *pText = "four"; // pointer to a text string constant, compiler automatically adds zero to an additional array element for the constant "four"
char text[] = "four"; // compiler creates a array with 5 elements and puts the characters four in the first four array elements, a value of 0 in the fifth
char text[5] = "four"; // programmer creates array of 5 elements, compiler puts the characters four in the first four array elements, a value of 0 in the fifth
In the example you provided a good C compiler should issue at the minimum a warning and probably an error. However it looks like your compiler is truncating the string to the array size and is not adding the additional zero string terminator. And you are getting lucky in that there is a zero value after the end of the string. I suppose there is also the possibility that the C compiler is adding an additional array element anyway but that would seem unlikely.
What your book states is basically right, but there is missing the phrase "at least". The array can very well be larger.
You already stated the reason for the min length requirement. So what does that tell you about the example? It is crap!
What it exhibits is called undefined behaviour (UB) and might result in daemons flying out your nose for the printf() - not the initializer. It is just not covered by the C standard (well ,the standard actually says this is UB), so the compiler (and your libraries) are not expected to behave correctly.
For such cases, no terminator will be appended explicitly, so the string is not properly terminated when passed to `printf()".
Reason this does not produce an error is likely some legacy code which did exploit this to safe some bytes of memory. So, instead of reporting an error that the implicit trailing '\0' terminator does not fit, it simply does not append it. Silently truncating the string literal would also be a bad idea.
The following line:
char name[4] = "Givi";
May give warning like:
string for array of chars is too long
Because the behavior is Undefined, still compiler may pass it. But if you debug, you will see:
name[0] 'G'
name[1] 'i'
name[2] 'V'
name[3] '\0'
And so the output is
Giv
Not Give as you mentioned in the question!
I'm using GCC compiler.
But if you write something like this:
char name[4] = "Giv";
Compiles fine! And output is
Giv

Resources