Consider the case:
char s1[] = "abc";
s1[3] = 'x';
printf("%s", s1);
As I know, printf prints characters until it finds the null character and then stops.
When I overwrite the null character by 'x', why does printf print the s1 array correctly? How does it find the null character?
Your printf call invokes undefined behaviour because s1 doesn't have zero (aka null byte) terminator.
s1 is an array of 4 characters and over writing the null byte is not an issue.
After
s1[3] = 'x';
s1 will become:
[a][b][c][x]
But you can't print it as a string. A string in C is, by definition, a sequence of bytes terminated with a null byte. It just happens to work this time but you should never rely on that.
It means only that after this array there is by accident a null character in the memory.:)
You can try the following example
char s0[] = "xxx";
char s1[] = "abc";
char s2[] = "yyy";
s1[3] = 'x';
printf("%s",s1);
and see the result.
The printf function will print all the characters untill it encounters a nul character.
In your case, you have started accessing beyond the memory that was allocated and accessing memory beyond what is allocated is undefined behavior
In this case it accidently happen to be nul.
If it printed "abcx". It means that there was already a null in s1[4]. The value on the stack depends on previous operations. So it may always be a zero in that position but what is more likely to happen is that there is a zero while you are debugging the code and nothing goes wrong, but then in release a zero is not put in that position and you end up with a difficult bug to debug.
Undefined by the language definition does not mean undefined in an implementation. eg MS Visual Studio when compiling in Debug mode will set memory to predictable values to aid debugging.
When and why will an OS initialise memory to 0xCD, 0xDD, etc. on malloc/free/new/delete?
Related
I have been working with strings in C. While working with ways to declare them and initialize them, I found some weird behavior I don't understand.
#include<stdio.h>
#include<string.h>
int main()
{
char str[5] = "World";
char str1[] = "hello";
char str2[] = {'N','a','m','a','s','t','e'};
char* str3 = "Hi";
printf("%s %zu\n"
"%s %zu\n"
"%s %zu\n"
"%s %zu\n",
str, strlen(str),
str1, strlen(str1),
str2, strlen(str2),
str3, strlen(str3));
return 0;
}
Sample output:
Worldhello 10
hello 5
Namaste 7
Hi 2
In some cases, the above code makes str contain Worldhello, and the rest are as they were intialized. In some other cases, the above code makes str2 contain Namastehello. It happens with different variables I never concatenated. So, how are they are getting combined?
To work with strings, you must allow space for a null character at the end of each string. Where you have char str[5]="World";, you allow only five characters, and the compiler fills them with “World”, but there is no space for a null character after them. Although the string literal "World" includes an automatic null character at its end, you did not provide space for it in the array, so it is not copied.
Where you have char str1[]="hello";, the compiler determines the array size by counting the characters, including the null character at the end of the string literal.
Where you have char str2[]={'N','a','m','a','s','t','e'};, there is no string literal, just a list of individual characters. The compiler determines the array size by counting those. Since there is no null character, it does not provide space for it.
One potential consequence of failing to terminate a string with a null character is that printf will continue reading memory beyond the string and printing characters from the values it finds. When the compiler has placed other character arrays after such an array you are printing, characters from those arrays may appear in the output.
If you allow space for a null character in str and provide a zero value in str2, your program will print strings in an orderly way:
#include <stdio.h>
#include <string.h>
int main(void)
{
char str[6] = "World"; // 5 letters plus a null character.
char str1[] = "hello";
char str2[] = {'N', 'a', 'm', 'a', 's', 't', 'e', 0}; // Include a null.
char *str3 = "Hi";
printf("%s %zu\n%s %zu\n%s %zu\n%s %zu\n",
str, strlen(str),
str1, strlen(str1),
str2, strlen(str2),
str3, strlen(str3));
return 0;
}
Undefined behavior in non-null-terminated, adjacently-stored C-strings
Why do you get this part:
Worldhello 10
hello 5
...instead of this?
World 5
hello 5
The answer is that printf() prints chars until it hits a null character, which is a binary zero, frequently written as the '\0' char. And, the compiler happens to have placed the character array containing hello right after the character array containing World. Since you explicitly forced the size of str to be 5 via str[5], the compiler was unable to fit the automatic null character at the end of the string. So, with hello happening to be (not guaranteed to be) right after World, and printf() printing until it sees a binary zero, it printed World, saw no terminating null char, and continued right on into the hello string right after it. This resulted in it printing Worldhello, and then stopping only when it saw the terminating character after hello, which string is properly terminated.
This code relies on undefined behavior, which is a bug. It cannot be relied upon. But, that is the explanation for this case.
Run it with gcc on a 64-bit Linux machine online here: Online GDB: undefined behavior in NON null-terminated C strings
#Eric Postpischil has a great answer and provides more insight here.
From the C tag wiki:
This tag should be used with general questions concerning the C language, as defined in the ISO 9899 standard (the latest version, 9899:2018, unless otherwise specified — also tag version-specific requests with c89, c99, c11, etc).
You've asked a "how?" question about something that none of those documents defines, and so the answer is undefined in the context of C. You can only experience this phenomenon through undefined behaviour.
how are they are getting combined?
There is no such requirement that any of these variables are "combined" or are immediately located after each other; trying to observe that is undefined behaviour. It may appear to coincidentally work (whatever that means) for you at times on your machine, while failing at other times or using some other machine or compiler, etc. That's purely coincidental and not to be relied upon.
In some cases, the above code assigns str with Worldhello and the rest as they were intitated.
In the context of undefined behaviour, it makes no sense to make claims about how your code functions, as you've already noticed, the functionality is erratic.
I found some weird Behaviour with them.
If you want to prevent erratic behaviour, stop invoking undefined behaviour by accessing arrays out of bounds (i.e. causing strlen to run off the end of an array).
Only one of those variables is safe to pass to strlen; you need to ensure the array contains a null terminator.
Given for example a char *p that points to the first character in "there is so \0ma\0ny \0 \\0 in t\0his stri\0ng !\0\0\0\0",
how would Strrchr() find the last occurrence of null-character?
the following questions arises:
=>What conditions would it depend on to stop the loop!
=>I think in all cases it'll try to access the next memory area to check for its condition?at some point bypassing the string boundaries, UB! so is it safe !
please if i'am wrong feel free to correct me!
It's very simple, as explained in the comments.
The first \0 is the last and the only one in a C string.
So if you write
char *str = "there is so \0ma\0ny \0 \\0 in t\0his stri\0ng !\0\0\0\0";
char *p = strrchr(str, 's');
printf("%s\n", p);
it will print
so
because strchr will find the 's' in "so", which is the last 's' in the string you gave it. And (to answer your specific question) if you write
p = strrchr(str, '\0');
printf("%d %s\n", (int)(p - str), p+1);
it will print
12 ma
proving that strchr found the first \0.
It's obvious to you that str is a long string with some embedded \0's in it. But, in C, there is no such thing as a "string with embedded \0's in it". It is impossible, by definition, for a C string to contain an embedded \0. The first \0, by definition, ends the string.
One more point. You had mentioned that if you were to "access the next memory area", that you would "at some point bypassing the string boundaries, UB!" And you're right. In my answer, I skirted with danger when I said
p = strrchr(str, '\0');
printf("%d %s\n", (int)(p - str), p+1);
Here, p points to what strrchr thinks is the end of the string, so when I compute p+1 and try to print it using %s, if we don't know better it looks like I've indeed strayed into Undefined Behavior. In this case it's safe, of course, because we know exactly what's beyond the first \0. But if I were to write
char *str2 = "hello";
p = strrchr(str2, '\0');
printf("%s\n", p+1); /* WRONG */
then I'd definitely be well over the edge.
There is a difference between "a string", "an array of characters" and "a char* pointer".
A C String is a number of characters terminated by a null character.
An array of characters is a defined number of characters.
A char* pointer is technically a pointer to a single character, but often used to mark a point in a C style string.
You say you have a pointer to a character (char*p) and the value of *p is 't', but you believe that *p is the first character of a C style string
"there is so \0ma\0ny \0 \\0 in t\0his stri\0ng !\0\0\0\0".
As others have said, because you said this is a C style string and you don't know the length of it then the first null after p will mark the end of the string.
If this was a character array char str[40] then you could find the last null by looping from the end of the array towards the start for (i=39; i>=0; i--) BUT you don't know then length, so that won't work.
Hope that helps, and please excuse me if I have strayed into C++, its 25 years since I did C :)
In the case you present, you can never know if the null character you've found is the last one since you have no guarantee for the end of the string. As it is a c-string, it is guaranteed that the string ends with a '\0', but if you decide to go beyond that, you can't know if the memory you're accessing is yours. Accessing memory out of an array has undefined behaviour as you can either be accessing just the next object that is in memory that is yours or you could touch memory that is unallocated, but its block still belongs to your process, or you can try to touch a segment that is not yours at all. And only the third one will cause a SIGSEGV. You can see this question to check for segmentation fault without crashing your program, but your string could have ended way before you can catch it that way.
There is a reason for the strings to have an ending character. If you insist to have \0 in multiple places in your string, you can just terminate with another character, but note that all library functions will still consider the first \0 to be the end of the string.
It is considered a bad practice and a very bad thing to have multiple \0 in your strings so if you can, avoid it.
As we know a string terminates with '\0'.
It's because to know the compiler that string ended, or to secure from garbage values.
But how does an array terminate?
If '\0' is used it will take it as 0 a valid integer,
So how does the compiler knows the array ended?
C does not perform bounds checking on arrays. That's part of what makes it fast. However that also means it's up to you to ensure you don't read or write past the end of an array. So the language will allow you to do something like this:
int arr[5];
arr[10] = 4;
But if you do, you invoke undefined behavior. So you need to keep track of how large an array is yourself and ensure you don't go past the end.
Note that this also applies to character arrays, which can be treated as a string if it contains a sequence of characters terminated by a null byte. So this is a string:
char str[10] = "hello";
And so is this:
char str[5] = { 'h', 'i', 0, 0, 0 };
But this is not:
char str[5] = "hello"; // no space for the null terminator.
C doesn't provide any protections or guarantees to you about 'knowing the array is ended.' That's on you as the programmer to keep in mind in order to avoid accessing memory outside your array.
C language does not have native string type. In C, strings are actually one-dimensional array of characters terminated by a null character '\0'.
From C Standard#7.1.1p1 [emphasis mine]
A string is a contiguous sequence of characters terminated by and including the first null character. The term multibyte string is sometimes used instead to emphasize special processing given to multibyte characters contained in the string or to avoid confusion with a wide string. A pointer to a string is a pointer to its initial (lowest addressed) character. The length of a string is the number of bytes preceding the null character and the value of a string is the sequence of the values of the contained characters, in order.
String is a special case of character array which is terminated by a null character '\0'. All the standard library string related functions read the input string based on this rule i.e. read until first null character.
There is no significance of null character '\0' in array of any type apart from character array in C.
So, apart from string, for all other types of array, programmer is suppose to explicitly keep the track of number of elements in the array.
Also, note that, first null character ('\0') is the indication of string termination but it is not stopping you to read beyond it.
Consider this example:
#include <stdio.h>
int main(void) {
char str[5] = {'H', 'i', '\0', 'z'};
printf ("%s\n", str);
printf ("%c\n", str[3]);
return 0;
}
When you print the string
printf ("%s\n", str);
the output you will get is - Hi
because with %s format specifier, printf() writes every byte up to and not including the first null terminator [note the use of null character in the strings], but you can also print the 4th character of array as it is within the range of char array str though beyond first '\0' character
printf ("%c\n", str[3]);
the output you will get is - z
Additional:
Trying to access array beyond its size lead to undefined behavior which includes the program may execute incorrectly (either crashing or silently generating incorrect results), or it may fortuitously do exactly what the programmer intended.
It’s just a matter of convention. If you wanted to, you could totally write code that handled array termination (for arrays of any type) via some sentinel value. Here’s an example that does just that, arbitrarily using -1 as the sentinel:
int length(int arr[]) {
int i;
for (i = 0; arr[i] != -1; i++) {}
return i;
}
However, this is obviously utterly unpractical: You couldn’t use -1 in the array any longer.
By contrast, for C strings the sentinel value '\0' is less problematic because it’s expected that normal test won’t contain this character. This assumption is kind of valid. But even so there are obviously many strings which do contain '\0' as a valid character, and null-termination is therefore by no means universal.
One very common alternative is to store strings in a struct that looks something like this:
struct string {
unsigned int length;
char *buffer;
}
That is, we explicitly store a length alongside a buffer. This buffer isn’t null-terminated (although in practice it often has an additional terminal '\0' byte for compatibility with C functions).
Anyway, the answer boils down to: For C strings, null termination is a convenient convention. But it is only a convention, enforced by the C string functions (and by the C string literal syntax). You could use a similar convention for other array types but it would be prohibitively impractical. This is why other conventions developed for arrays. Notably, most functions that deal with arrays expect both an array and a length parameter. This length parameter determines where the array terminates.
My code is crashing because of a lack of the char '\0' at the end of some strings.
It's pretty clear to me why we have to use this termination char. My question is,
is there a problem adding a potential 2nd null character to a character array - to solve string problems?
I think it's cheaper just add a '\0' to every string than verify if it needs and then add it, but I don't know if it's a good thing to do.
is there a problem to have this char ('\0') twice at the end of a string?
This question lacks clarity as "string" means different things to people.
Let us use the C specification definition as this is a C post.
A string is a contiguous sequence of characters terminated by and including the first null character. C11 §7.1.1 1
So a string, cannot have 2 null characters as the string ends upon reaching its first one. #Michael Walz
Instead, re-parse to "is there a problem adding a potential 2nd null character to a character array - to solve string problems?"
A problem with attempting to add a null character to a string is confusion. The str...() functions work with C strings as defined above.
// If str1 was not a string, strcpy(str1, anything) would be undefined behavior.
strcpy(str1, "\0"); // no change to str1
char str2[] = "abc";
str2[strlen(str2)] = '\0'; // OK but only, re-assigns the \0 to a \0
// attempt to add another \0
str2[strlen(str2)+1] = '\0'; // Bad: assigning outside `str2[]` as the array is too small
char str3[10] = "abc";
str3[strlen(str3)+1] = '\0'; // OK, in this case
puts(str3); // Adding that \0 served no purpose
As many have commented, adding a spare '\0' is not directly attending the code's fundamental problem. #Haris #Malcolm McLean
That unposted code is the real issue that need solving #Yunnosch, and not by attempting to append a second '\0'.
I think it's cheaper just add a '\0' to every string than verify if it needs and then add it, but I don't know if it's a good thing to do.
Where would you add it? Let's assume we've done something like this:
char *p = malloc(32);
Now, if we know the allocated length, we could put a '\0' as the last character of the allocated area, as in p[31] = '\0'. But we don't how long the contents of the string are supposed to be. If there's supposed to be just foobar, then there'd still be 25 bytes of garbage, which might cause other issues if processed or printed.
Let alone the fact that if all you have is the pointer to the string, it's hard to know the length of the allocated area.
Probably better to fix the places where you build the strings to do it correctly.
Having '\0' is not a problem, unless you have not gone out of bounds of that char array.
You do have to understand that, having '\0' twice would mean, any string operation would not even know that there is a second '\0'. They will just read till the first '\0', and be with it. For them, the first '\0' is the Null terminating character and there should not be anything after that.
I need to define the characters in an array and print the string...But it always prints as string7 (in this case, test7)...What am I doing wrong here?
#include <stdio.h>
int main() {
char a[]={'t','e','s','t'};
printf("%s\n",a);
return 0;
}
Why this behavior?
Because you did not \0 terminate your array, so what you get is Undefined behavior.
What possibly happens behind the scenes ?
The printf tries to print the string till it encounters a \0 and in your case the string was never \0 terminated so it prints randomly till it encounters a \0.
Note that reading beyond the bounds of allocated memory is Undefined behavior so technically this is a UB.
What you need to do to solve the problem?
You need:
char a[]={'t','e','s','t',`\0`};
or
char a[]="test";
Because your "string", or char[], is not null-terminated (i.e. terminated by \0).
then, printf("%s", a); will attempt to print every character starting from the start of a and keep printing until it sees until it sees a \0.
That \0 is outside your array, and depends on the initial state of the memory of your program, which you pretty much don't have control.
to fix this, use
char a[]={'t','e','s','t','\0'};
The string you printing must be null terminated...so your string declaration should be,
char a[]={'t','e','s','t', '\0'};