C compiler Differences? NUL control character in Char - c

Below is C code that will generate char array but with explicitly adding Null character or not. The results are unexpected in two compiler and I'm not sure why we even have to explicitly add the Null character?
//
// stringBugorNot.c
//
//
//
#include <string.h>
#include <stdio.h>
int main(void)
{
char aString[3] = {'a', 'b','c'};
char bString[4] = {'a', 'b', 'c', '\0'};
printf("\n");
printf("len of a is: %lu\n", strlen(aString));
printf("len of b is: %lu\n", strlen(bString));
printf("\n");
//Portion A
printf("last element of a is: '%c'\n", aString[strlen(aString)]);
printf("last element of b is: '%c'\n", bString[strlen(bString)]);
printf("\n");
//Portion B
printf("last element of a is: '%c'\n", aString[strlen(aString) - 1]);
printf("last element of b is: '%c'\n", bString[strlen(bString) - 1]);
}
Comments
+clang will give a runtime error because out of bounds on "aString".. makes sense
+gcc will not give any error and simply output "nothing" the null as expected. But maybe gcc is smarter and adds the null for me? Is the actual memory size different??
Clang OUTPUT ---->
len of a is: 3
len of b is: 3
bugOrNot.c:16:41: runtime error: index 3 out of bounds for type 'char [3]'
last element of a is: ''
last element of b is: ''
last element of a is: 'c'
last element of b is: 'c'
GCC OUTPUT ---->
len of a is: 9
len of b is: 3
last element of a is: ''
last element of b is: ''
last element of a is: ''
last element of b is: 'c'

Unexpected behavior that you see is called undefined behavior (UB) in the C standard:
Calling strlen on aString is UB because there is no null termination
Dereferencing aString at its undefined index is UB, unless the index is 0, 1, or 2
gcc could insert null terminator inadvertently by aligning bString at 4-byte boundary. It doesn’t change the fact that it’s still a UB, though.

When you say
char bString[4] = {'a', 'b', 'c', '\0'};
you have properly constructed a null-terminated string. It is precisely as if you had said
char bString[4] = "abc";
Since this is a proper, null-terminated string, it is meaningful and legal to call strlen(bString), and you will get a result of 3.
When you say
char aString[3] = {'a', 'b','c'};
on the other hand, and as I think you know, you have not constructed a proper null-terminated string. Therefore, it is not legal or meaningful to call strlen(aString) -- formally, we say that the result is undefined, meaning that absolutely anything can happen.
You tried the code with two different compilers, and were surprised to get two different results. This is perfectly normal. (It's perfectly normal to get two different results, and it's perfectly normal to be surprised by this, because it is pretty surprising, the first few times you encounter it.)
It is not the case that one compiler is "smarter" than the other, or that it "guessed" that you were trying to construct a string and so automatically supplied the "missing" \0 for you. It was simply a fluke, a random happenstance. (It is also certainly not the case that one compiler or the other has any kind of a bug. Again, there's no right result here, so a compiler can't be wrong, no matter what it does.)
If you want to work with strings in C, make sure that they're all properly null-terminated. If you should ever happen to accidentally do something stringlike with a non-properly-null-terminated string, don't try to interpret the results, don't assume that they mean anything, and especially don't decide that it's the "right" result that you can therefore depend on. You can't. It's likely to change for no reason, like when you use a different compiler next week, or when your customer uses your program on vital data instead of test data.

In C, a string is a sequence of character values including the nul terminator. That terminator is how the various C library routines know where the end of the string is. If you don't terminate a string properly, library routines like strlen and strcpy and printf with %s will all scan past the end of the string into other memory, resulting in garbled output or runtime errors.
The reason you got different results for the length of a with the two different compilers is that in the clang case, the byte immediately following the last element of a contained 0, whereas in the gcc case the bytes immediately following a did not contain 0.
Strictly speaking, the behavior on passing a non-terminated sequence of characters to the string handling routines is undefined - the language specification places no requirements on the compiler or runtime environment to "do the right thing", whatever that would be. You've basically voided the warranty at that point, and pretty much anything can happen.
Note that the C language specification does not require bounds checks on array accesses - the fact that you got the index out of bounds exception for clang is due to the compiler being extra friendly and going beyond what the language standard actually requires.

Related

Why are C-strings (char arrays) getting merged together or printing erroneously sometimes in C?

I have been working with strings in C. While working with ways to declare them and initialize them, I found some weird behavior I don't understand.
#include<stdio.h>
#include<string.h>
int main()
{
char str[5] = "World";
char str1[] = "hello";
char str2[] = {'N','a','m','a','s','t','e'};
char* str3 = "Hi";
printf("%s %zu\n"
"%s %zu\n"
"%s %zu\n"
"%s %zu\n",
str, strlen(str),
str1, strlen(str1),
str2, strlen(str2),
str3, strlen(str3));
return 0;
}
Sample output:
Worldhello 10
hello 5
Namaste 7
Hi 2
In some cases, the above code makes str contain Worldhello, and the rest are as they were intialized. In some other cases, the above code makes str2 contain Namastehello. It happens with different variables I never concatenated. So, how are they are getting combined?
To work with strings, you must allow space for a null character at the end of each string. Where you have char str[5]="World";, you allow only five characters, and the compiler fills them with “World”, but there is no space for a null character after them. Although the string literal "World" includes an automatic null character at its end, you did not provide space for it in the array, so it is not copied.
Where you have char str1[]="hello";, the compiler determines the array size by counting the characters, including the null character at the end of the string literal.
Where you have char str2[]={'N','a','m','a','s','t','e'};, there is no string literal, just a list of individual characters. The compiler determines the array size by counting those. Since there is no null character, it does not provide space for it.
One potential consequence of failing to terminate a string with a null character is that printf will continue reading memory beyond the string and printing characters from the values it finds. When the compiler has placed other character arrays after such an array you are printing, characters from those arrays may appear in the output.
If you allow space for a null character in str and provide a zero value in str2, your program will print strings in an orderly way:
#include <stdio.h>
#include <string.h>
int main(void)
{
char str[6] = "World"; // 5 letters plus a null character.
char str1[] = "hello";
char str2[] = {'N', 'a', 'm', 'a', 's', 't', 'e', 0}; // Include a null.
char *str3 = "Hi";
printf("%s %zu\n%s %zu\n%s %zu\n%s %zu\n",
str, strlen(str),
str1, strlen(str1),
str2, strlen(str2),
str3, strlen(str3));
return 0;
}
Undefined behavior in non-null-terminated, adjacently-stored C-strings
Why do you get this part:
Worldhello 10
hello 5
...instead of this?
World 5
hello 5
The answer is that printf() prints chars until it hits a null character, which is a binary zero, frequently written as the '\0' char. And, the compiler happens to have placed the character array containing hello right after the character array containing World. Since you explicitly forced the size of str to be 5 via str[5], the compiler was unable to fit the automatic null character at the end of the string. So, with hello happening to be (not guaranteed to be) right after World, and printf() printing until it sees a binary zero, it printed World, saw no terminating null char, and continued right on into the hello string right after it. This resulted in it printing Worldhello, and then stopping only when it saw the terminating character after hello, which string is properly terminated.
This code relies on undefined behavior, which is a bug. It cannot be relied upon. But, that is the explanation for this case.
Run it with gcc on a 64-bit Linux machine online here: Online GDB: undefined behavior in NON null-terminated C strings
#Eric Postpischil has a great answer and provides more insight here.
From the C tag wiki:
This tag should be used with general questions concerning the C language, as defined in the ISO 9899 standard (the latest version, 9899:2018, unless otherwise specified — also tag version-specific requests with c89, c99, c11, etc).
You've asked a "how?" question about something that none of those documents defines, and so the answer is undefined in the context of C. You can only experience this phenomenon through undefined behaviour.
how are they are getting combined?
There is no such requirement that any of these variables are "combined" or are immediately located after each other; trying to observe that is undefined behaviour. It may appear to coincidentally work (whatever that means) for you at times on your machine, while failing at other times or using some other machine or compiler, etc. That's purely coincidental and not to be relied upon.
In some cases, the above code assigns str with Worldhello and the rest as they were intitated.
In the context of undefined behaviour, it makes no sense to make claims about how your code functions, as you've already noticed, the functionality is erratic.
I found some weird Behaviour with them.
If you want to prevent erratic behaviour, stop invoking undefined behaviour by accessing arrays out of bounds (i.e. causing strlen to run off the end of an array).
Only one of those variables is safe to pass to strlen; you need to ensure the array contains a null terminator.

How char array behaves for longer strings?

I asked this question as one of multiple questions here. But people asked me to ask them separately. So why this question.
Consider below code lines:
char a[5] = "geeks"; //1
char a3[] = {'g','e','e','k','s'}; //d
printf("a:%s,%u\n",a,sizeof(a)); //5
printf("a3:%s,%u\n",a3,sizeof(a3)); //j
printf("a[5]:%d,%c\n",a[5],a[5]);
printf("a3[5]:%d,%c\n",a3[5],a3[5]);
Output:
a:geeksV,5
a3:geeks,5
a[5]:86,V
a3[5]:127,
However the output in original question was:
a:geeks,5
a3:geeksV,5
The question 1 in original question was:
Does line #1 adds \0? Notice that sizeof prints 5 in line #5 indicating \0 is not there. But then, how #5 does not print something like geeksU as in case of line #j? I feel \0 does indeed gets added in line #1, but is not considered in sizeof, while is considered by printf. Am I right with this?
Realizing that the output has changed (for same online compiler) when I took out only those code lines which are related to first question in original question, now I doubt whats going on here? I believe these are undefined behavior by C standard. Can someone shed more light? Possibly for another compiler?
Sorry again for asking 2nd question.
char a[5] = "geeks"; //1
Here, you specify the array's size as '5', and initialize it with 5 characters.
Therefore, you do not have a "C string", which by definition is ended by a NUL. (0).
printf("a:%s,%u\n",a,sizeof(a)); //5
The array itself still has a size of 5, which is correctly reported by the sizeof operator, but your call to printf is undefined behaviour and could print anything after the arrray's contents - it will just keep looking at the next address until it finds a 0 somewhere. That could be immediately, or it could print a 1000000 garbage characters, or it could cause some sort of segfault or other crash.
char a3[] = {'g','e','e','k','s'}; //d
Because you don't specify the array's size, the compiler will, through the initialization syntax, determine the size of the array. However, the way you chose to initialize a3, it will still only provide 5 bytes of length.
The reason for that is that your initialization just is an initialization list, and not a "string". Therefore, your subsequent call to printf also is undefined behaviour, and it is just luck that at the position a3[5] there seems to be a 0 in your case.
Effectively, both examples have the very same error.
You could have it different thus:
char a3[] = "geeks";
Using a string literal for initialization of the array with unspecified size will cause the compiler to allocate enough memory to hold the string and the additional NUL-terminator, and sizeof (a3) will now yield 6.
"geeks" here is a string literal in C.
When you define "geeks" the compiler automatically adds the NULL character to the end. This makes it 6 characters long.
But you are assigning it to char a[5]. This will cause undefined behaviour.
As mentioned by #DavidBowling, in this case the following condition applies
(Section 6.7.8.14) C99 standard.
An array of character type may be initialized by a character string literal, optionally enclosed in braces. Successive characters of the character string literal (including the terminating null character if there is room or if the array is of unknown size) initialize the elements of the array
the elements "geeks" will be copied into the array 'a' but the NULL character will not be copied.
So in this case when you try to print the array, it will continue printing until it encounters a \0 in the memory.
From the further print statements it is seen that a[5] has the value V. Presumably the next byte on your system is \0 and the array print stops.
So, in your system, at that instance, "geeksV" is printed.

C character array and its length

I am studying now C with "C Programming Absolute Beginner's Guide" (3rd Edition) and there was written that all character arrays should have a size equal to the string length + 1 (which is string-termination zero length). But this code:
#include <stdio.h>
main()
{
char name[4] = "Givi";
printf("%s\n",name);
return 0;
}
outputs Givi and not Giv. Array size is 4 and in that case it should output Giv, because 4 (string length) + 1 (string-termination zero character length) = 5, and the character array size is only 4.
Why does my code output Givi and not Giv?
I am using MinGW 4.9.2 SEH for compilation.
You are hitting what is considered to be undefined behavior. It's working now, but due to chance, not correctness.
In your case, it's because the memory in your program is probably all zeroed out at the beginning. So even though your string is not terminated properly, it just so happens that the memory right after it is zero, so printf knows when to stop.
+-----------------------+
|G|i|v|i|\0|\0|... |
+-----------------------+
| your | rest of |
| stuff | memory (stack)|
+-----------------------+
Other languages, such as Java, have safeguards against this sort of situations. Languages like C, however, do less hand holding, which, on the one hand, allows more flexibility, but on the other, give you much, much more ways to shoot you in the foot with subtle issues such as this one. In other words, if your code compiles, that doesn't mean it's correct and it won't blow up now, in 5 minutes or in 5 years.
In real life, this is almost never the case, and your string might end up getting stored next to other things, which would always end up getting printed out together with your string. You never want this. Situations like this might lead to crashes, exploits and leaked confidential information.
See the following diagram for an example. Imagine you're working on a web server and the string "secret"--a user's password or key is stored right next to your harmless string:
+-----------------------+
|G|i|v|i|s|e|c|r|e|t |
+-----------------------+
| your | rest of |
| stuff | memory (stack)|
+-----------------------+
Every time you would output what you would think is "Givi", you'd end up printing out the secret string, which is not what you want.
The byte after the last character always has to be 0, otherwise printf would not know when the string is terminanted and would try to access bytes (or chars) while they are not 0.
As Andrei said, apparently it just happened, that the compiler put at least one byte with the value 0 after your string data, so printf recognized the end of the string.
This can vary from compiler to compiler and thus is undefined behaviour.
There could, for instance, be a chance to have printf accessing an address, which your program is not allowed to. This would result in a crash.
In C text strings are stored as zero terminated arrays of characters. This means that the end of a text string is indicated by a special character, a numeric value of zero (0), to indicate the end of the string.
So the array of text characters to be used to store a C text string must include an array element for each of the characters as well as an additional array element for the end of string.
All of the C text string functions (strcpy(), strcmp(), strcat(), etc.) all expect that the end of a text string is indicated by a value of zero. This includes the printf() family of functions that print or output text to the screen or to a file. Since these functions depend on seeing a zero value to terminate the string, one source of errors when using C text strings is copying too many characters due to a missing zero terminator or copying a long text string into a smaller buffer. This type of error is known as a buffer overflow error.
The C compiler will perform some types of adjustments for you automatically. For instance:
char *pText = "four"; // pointer to a text string constant, compiler automatically adds zero to an additional array element for the constant "four"
char text[] = "four"; // compiler creates a array with 5 elements and puts the characters four in the first four array elements, a value of 0 in the fifth
char text[5] = "four"; // programmer creates array of 5 elements, compiler puts the characters four in the first four array elements, a value of 0 in the fifth
In the example you provided a good C compiler should issue at the minimum a warning and probably an error. However it looks like your compiler is truncating the string to the array size and is not adding the additional zero string terminator. And you are getting lucky in that there is a zero value after the end of the string. I suppose there is also the possibility that the C compiler is adding an additional array element anyway but that would seem unlikely.
What your book states is basically right, but there is missing the phrase "at least". The array can very well be larger.
You already stated the reason for the min length requirement. So what does that tell you about the example? It is crap!
What it exhibits is called undefined behaviour (UB) and might result in daemons flying out your nose for the printf() - not the initializer. It is just not covered by the C standard (well ,the standard actually says this is UB), so the compiler (and your libraries) are not expected to behave correctly.
For such cases, no terminator will be appended explicitly, so the string is not properly terminated when passed to `printf()".
Reason this does not produce an error is likely some legacy code which did exploit this to safe some bytes of memory. So, instead of reporting an error that the implicit trailing '\0' terminator does not fit, it simply does not append it. Silently truncating the string literal would also be a bad idea.
The following line:
char name[4] = "Givi";
May give warning like:
string for array of chars is too long
Because the behavior is Undefined, still compiler may pass it. But if you debug, you will see:
name[0] 'G'
name[1] 'i'
name[2] 'V'
name[3] '\0'
And so the output is
Giv
Not Give as you mentioned in the question!
I'm using GCC compiler.
But if you write something like this:
char name[4] = "Giv";
Compiles fine! And output is
Giv

Strings without a '\0' char?

If by mistake,I define a char array with no \0 as its last character, what happens then?
I'm asking this because I noticed that if I try to iterate through the array with while(cnt!='\0'), where cnt is an int variable used as an index to the array, and simultaneously print the cnt values to monitor what's happening the iteration stops at the last character +2.The extra characters are of course random but I can't get it why it has to stop after 2.Does the compiler automatically inserts a \0 character? Links to relevant documentation would be appreciated.
To make it clear I give an example. Let's say that the array str contains the word doh(with no '\0'). Printing the cnt variable at every loop would give me this:
doh+
or doh^
and so on.
EDIT (undefined behaviour)
Accessing array elements outside of the array boundaries is undefined behaviour.
Calling string functions with anything other than a C string is undefined behaviour.
Don't do it!
A C string is a sequence of bytes terminated by and including a '\0' (NUL terminator). All the bytes must belong to the same object.
Anyway, what you see is a coincidence!
But it might happen like this
,------------------ garbage
| ,---------------- str[cnt] (when cnt == 4, no bounds-checking)
memory ----> [...|d|o|h|*|0|0|0|4|...]
| | \_____/ -------- cnt (big-endian, properly 4-byte aligned)
\___/ ------------------ str
If you define a char array without the terminating \0 (called a "null terminator"), then your string, well, won't have that terminator. You would do that like so:
char strings[] = {'h', 'e', 'l', 'l', 'o'};
The compiler never automatically inserts a null terminator in this case. The fact that your code stops after "+2" is a coincidence; it could just as easily stopped at +50 or anywhere else, depending on whether there happened to be \0 character in the memory following your string.
If you define a string as:
char strings[] = "hello";
Then that will indeed be null-terminated. When you use quotation marks like that in C, then even though you can't physically see it in the text editor, there is a null terminator at the end of the string.
There are some C string-related functions that will automatically append a null-terminator. This isn't something the compiler does, but part of the function's specification itself. For example, strncat(), which concatenates one string to another, will add the null terminator at the end.
However, if one of the strings you use doesn't already have that terminator, then that function will not know where the string ends and you'll end up with garbage values (or a segmentation fault.)
In C language the term string refers to a zero-terminated array of characters. So, pedantically speaking there's no such thing as "strings without a '\0' char". If it is not zero-terminated, it is not a string.
Now, there's nothing wrong with having a mere array of characters without any zeros in it, as long as you understand that it is not a string. If you ever attempt to work with such character array as if it is a string, the behavior of your program is undefined. Anything can happen. It might appear to "work" for some magical reasons. Or it might crash all the time. It doesn't really matter what such a program will actually do, since if the behavior is undefined, the program is useless.
This would happen if, by coincidence, the byte at *(str + 5) is 0 (as a number, not ASCII)
As far as most string-handling functions are concerned, strings always stop at a '\0' character. If you miss this null-terminator somewhere, one of three things will usually happen:
Your program will continue reading past the end of the string until it finds a '\0' that just happened to be there. There are several ways for such a character to be there, but none of them is usually predictable beforehand: it could be part of another variable, part of the executable code or even part of a larger string that was previously stored in the same buffer. Of course by the time that happens, the program may have processed a significant amount of garbage. If you see lots of garbage produced by a printf(), an unterminated string is a common cause.
Your program will continue reading past the end of the string until it tries to read an address outside its address space, causing a memory error (e.g. the dreaded "Segmentation fault" in Linux systems).
Your program will run out of space when copying over the string and will, again, cause a memory error.
And, no, the C compiler will not normally do anything but what you specify in your program - for example it won't terminate a string on its own. This is what makes C so powerful and also so hard to code for.
I bet that an int is defined just after your string and that this int takes only small values such that at least one byte is 0.

C String Null Zero?

I have a basic C programming question, here is the situation. If I am creating a character array and if I wanted to treat that array as a string using the %s conversion code do I have to include a null zero. Example:
char name[6] = {'a','b','c','d','e','f'};
printf("%s",name);
The console output for this is:
abcdef
Notice that there is not a null zero as the last element in the array, yet I am still printing this as a string.
I am new to programming...So I am reading a beginners C book, which states that since I am not using a null zero in the last element I cannot treat it as a string.
This is the same output as above, although I include the null zero.
char name[7] = {'a','b','c','d','e','f','\0'};
printf("%s",name);
You're just being lucky; probably after the end of that array, on the stack, there's a zero, so printf stops reading just after the last character. If your program is very short and that zone of stack is still "unexplored" - i.e. the stack hasn't grown yet up to that point - it's very easy that it's zero, since generally modern OSes give initially zeroed pages to the applications.
More formally: by not having explicitly the NUL terminator, you're going in the land of undefined behavior, which means that anything can happen; such anything may also be that your program works fine, but it's just luck - and it's the worst type of bug, since, if it generally works fine, it's difficult to spot.
TL;DR version: don't do that. Stick to what is guaranteed to work and you won't introduce sneaky bugs in your application.
The output of your fist printf is not predictable specifically because you failed to include the terminating zero character. If it appears to work in your experiment, it is only because by a random chance the next byte in memory happened to be zero and worked as a zero terminator. The chances of this happening depend greatly on where you declare your name array (it is not clear form your example). For a static array the chances might be pretty high, while for a local (automatic) array you'll run into various garbage being printed pretty often.
You must include the null character at the end.
It worked without error because of luck, and luck alone. Try this:
char name[6] = {'a','b','c','d','e','f'};
printf("%s",name);
printf("%d",name[6]);
You'll most probably see that you can read that memory, and that there's a zero in it. But it's sheer luck.
What most likely happened is that there happened to be the value of 0 at memory location name + 6. This is not defined behavior though, you could get different output on a different system or with a different compiler.
Yes. You do. There are a few other ways to do it.
This form of initialization, puts the NUL character in for you automatically.
char name[7] = "abcdef";
printf("%s",name);
Note that I added 1 to the array size, to make space for that NUL.
One can also get away with omitting the size, and letting the compiler figure it out.
char name[] = "abcdef";
printf("%s",name);
Another method is to specify it with a pointer to a char.
char *name = "abcdef";
printf("%s",name);

Resources