I got this snippet from some exercises and the question: which is the output of following code:
main()
{
char *p = "ayqm";
printf("%c", ++*(p++));
}
My expected answer was z but the actual answer was in fact b. How is that possible?
Later edit: the snippet is taken as it is from an exercise and did not focus on the string literal or syntax issues existent in other than the printf() code zone.
As posted, the program has multiple problems:
it tries to modify the string constant "ayqm", which described as undefined behavior in the C Standard.
it uses printf without a proper declaration, again producing undefined behavior.
its output is not terminated with a newline, causing implementation defined behavior.
the prototype for main without a return type is obsolete, no longer supported by the C Standard.
incrementing characters produces implementation defined behavior. If the execution character set is ASCII, 'a'+1 does produce 'b', but it is not guaranteed by the C Standard. Indeed in the EBCDIC character set still used in older mainframe computers letters are in a single monotonic sequence (ie: 'a'+1 == 'b' but 'i'+1 != 'j' in this character set).
Here is a corrected version:
#include <stdio.h>
int main(void) {
char str[] = "ayqm";
char *p = str;
printf("%c\n", ++*(p++));
return 0;
}
p is post-incremented, which means the current value of p is used for the * operator and the value of p is incremented before the next sequence point, namely the call to the printf function. The character read through p, 'a' is then incremented, which may or may not produce 'b' depending on the execution character set.
After printf returns to the main function, p points to str[1] and str contains the string "byqm".
Your program is having undefined behavior because it is trying to modify the string literal "ayqm". As per the standard attempting to modify a string literal results in undefined behavior because it may be stored in read-only storage.
The pointer p is pointing to string literal "ayqm". This expression
printf ("%c", ++*(p++));
end up attempting to modify the string literal that pointer p is pointing to.
An undefined behavior in a program includes it may execute incorrectly (either crashing or silently generating incorrect results), or it may fortuitously do exactly what the programmer intended.
Related
I have been working with strings in C. While working with ways to declare them and initialize them, I found some weird behavior I don't understand.
#include<stdio.h>
#include<string.h>
int main()
{
char str[5] = "World";
char str1[] = "hello";
char str2[] = {'N','a','m','a','s','t','e'};
char* str3 = "Hi";
printf("%s %zu\n"
"%s %zu\n"
"%s %zu\n"
"%s %zu\n",
str, strlen(str),
str1, strlen(str1),
str2, strlen(str2),
str3, strlen(str3));
return 0;
}
Sample output:
Worldhello 10
hello 5
Namaste 7
Hi 2
In some cases, the above code makes str contain Worldhello, and the rest are as they were intialized. In some other cases, the above code makes str2 contain Namastehello. It happens with different variables I never concatenated. So, how are they are getting combined?
To work with strings, you must allow space for a null character at the end of each string. Where you have char str[5]="World";, you allow only five characters, and the compiler fills them with “World”, but there is no space for a null character after them. Although the string literal "World" includes an automatic null character at its end, you did not provide space for it in the array, so it is not copied.
Where you have char str1[]="hello";, the compiler determines the array size by counting the characters, including the null character at the end of the string literal.
Where you have char str2[]={'N','a','m','a','s','t','e'};, there is no string literal, just a list of individual characters. The compiler determines the array size by counting those. Since there is no null character, it does not provide space for it.
One potential consequence of failing to terminate a string with a null character is that printf will continue reading memory beyond the string and printing characters from the values it finds. When the compiler has placed other character arrays after such an array you are printing, characters from those arrays may appear in the output.
If you allow space for a null character in str and provide a zero value in str2, your program will print strings in an orderly way:
#include <stdio.h>
#include <string.h>
int main(void)
{
char str[6] = "World"; // 5 letters plus a null character.
char str1[] = "hello";
char str2[] = {'N', 'a', 'm', 'a', 's', 't', 'e', 0}; // Include a null.
char *str3 = "Hi";
printf("%s %zu\n%s %zu\n%s %zu\n%s %zu\n",
str, strlen(str),
str1, strlen(str1),
str2, strlen(str2),
str3, strlen(str3));
return 0;
}
Undefined behavior in non-null-terminated, adjacently-stored C-strings
Why do you get this part:
Worldhello 10
hello 5
...instead of this?
World 5
hello 5
The answer is that printf() prints chars until it hits a null character, which is a binary zero, frequently written as the '\0' char. And, the compiler happens to have placed the character array containing hello right after the character array containing World. Since you explicitly forced the size of str to be 5 via str[5], the compiler was unable to fit the automatic null character at the end of the string. So, with hello happening to be (not guaranteed to be) right after World, and printf() printing until it sees a binary zero, it printed World, saw no terminating null char, and continued right on into the hello string right after it. This resulted in it printing Worldhello, and then stopping only when it saw the terminating character after hello, which string is properly terminated.
This code relies on undefined behavior, which is a bug. It cannot be relied upon. But, that is the explanation for this case.
Run it with gcc on a 64-bit Linux machine online here: Online GDB: undefined behavior in NON null-terminated C strings
#Eric Postpischil has a great answer and provides more insight here.
From the C tag wiki:
This tag should be used with general questions concerning the C language, as defined in the ISO 9899 standard (the latest version, 9899:2018, unless otherwise specified — also tag version-specific requests with c89, c99, c11, etc).
You've asked a "how?" question about something that none of those documents defines, and so the answer is undefined in the context of C. You can only experience this phenomenon through undefined behaviour.
how are they are getting combined?
There is no such requirement that any of these variables are "combined" or are immediately located after each other; trying to observe that is undefined behaviour. It may appear to coincidentally work (whatever that means) for you at times on your machine, while failing at other times or using some other machine or compiler, etc. That's purely coincidental and not to be relied upon.
In some cases, the above code assigns str with Worldhello and the rest as they were intitated.
In the context of undefined behaviour, it makes no sense to make claims about how your code functions, as you've already noticed, the functionality is erratic.
I found some weird Behaviour with them.
If you want to prevent erratic behaviour, stop invoking undefined behaviour by accessing arrays out of bounds (i.e. causing strlen to run off the end of an array).
Only one of those variables is safe to pass to strlen; you need to ensure the array contains a null terminator.
#include <ctype.h>
#include <stdio.h>
int atoi(char *s);
int main()
{
printf("%d\n", atoi("123"));
}
int atoi(char *s)
{
int i;
while (isspace(*s))
s++;
int sign = (*s == '-') ? -1 : 1;
/* same mistake for passing pointer to isdigit, but will not cause CORE DUMP */
// isdigit(s), s++;// this will not lead to core dump
// return -1;
/* */
/* I know s is a pointer, but I don't quite understand why code above will not and code below will */
if (!isdigit(s))
s++;
return -1;
/* code here will cause CORE DUMP instead of an comile-time error */
for (i = 0; *s && isdigit(s); s++)
i = i * 10 + (*s - '0');
return i * sign;
}
I got "Segmentation fault (core dumped)" when I accidentally made mistake about missing * operator before 's'
then I got this confusing error.
Why "(!isdigit(s))" lead to core dump while "isdigit(s), s++;" will not.
From isdigit [emphasis added]
The behavior is undefined if the value of ch is not representable as unsigned char and is not equal to EOF.
From isdigit [emphasis added]
The c argument is an int, the value of which the application shall ensure is a character representable as an unsigned char or equal to the value of the macro EOF. If the argument has any other value, the behavior is undefined.
https://godbolt.org/z/PEnc8cW6T
An undefined behaviour includes it may execute incorrectly (either crashing or silently generating incorrect results), or it may fortuitously do exactly what the programmer intended.
All answers so far has failed to point out the actual problem, which is that implicit pointer to integer conversions are not allowed during assignment in C. Details here: "Pointer from integer/integer from pointer without a cast" issues
Specifically C17 6.5.2.2/7
If the expression that denotes the called function has a type that does include a prototype,
the arguments are implicitly converted, as if by assignment, to the types of the
corresponding parameters
Where "as if by assignment" sends us to check the rules of assignment 6.5.16.1, which are quoted in the above link. So isdigit(s) is equivalent to something like this:
char* s;
...
int param_to_isdigit = s; // constraint violation of 6.5.16.1
Here the compiler must issue a diagnostic message. If you didn't spot it or in case you are using a tool chain giving warnings instead of errors, check out What compiler options are recommended for beginners learning C? so that you prevent code like this from compiling, so that you don't have to spend time troubleshooting bugs that the compiler already spotted for you.
Furthermore, the ctype.h functions require that the passed integer must be representable as unsigned char, but that's another story. C17 7.4 Character handling <ctype.h>:
In all cases the argument is an int, the value of which shall be
representable as an unsigned char or shall equal the value of the macro EOF
You are invoking undefined behavior. isdigit() is supposed to receive an int argument, but you pass in a pointer. This is effectively attempting to assign a pointer to an int (xref: Language / Expressions / Assignment operators / Simple assignment, ¶1).
Furthermore, there is a constraint that the argument to isdigit() be representable as an unsigned char or equal to EOF. (xref: Library / Character handling <ctype.h>, ¶1).
As a guess, the isdigit() function may be performing some kind of table lookup, and the input value may cause the function to access a pointer value beyond the table.
Why no segfault from isdigit(s), s++;?
First of all. Undefined behavior can manifest itself in a lot of ways, including the program working as intended. That's what undefined means.
But that line is not equivalent to your if statement. What this does is that it executes isdigit(s), throws away the result, increments s and also throw away the result of that operation.
However, isdigit does not have side effects, so it's quite probable that the compiler simply removes the call to that function, and replace this line with an unconditional s++. That would explain why it does not segfault. But you would have to study the generated assembly to make sure, but it's a possibility.
You can read about the comma operator here What does the comma operator , do?
I wasn't able to repeat this behaviour in MacOS/Darwin, but I was able to in Debian Linux.
To investigate a bit further, I wrote the following program:
#include <ctype.h>
#include <stdio.h>
int main()
{
printf("isalnum('a'): %d\n", isalnum('a'));
printf("isalpha('a'): %d\n", isalpha('a'));
printf("iscntrl('\n'): %d\n", iscntrl('\n'));
printf("isdigit('1'): %d\n", isdigit('1'));
printf("isgraph('a'): %d\n", isgraph('a'));
printf("islower('a'): %d\n", islower('a'));
printf("isprint('a'): %d\n", isprint('a'));
printf("ispunct('.'): %d\n", ispunct('.'));
printf("isspace(' '): %d\n", isspace(' '));
printf("isupper('A'): %d\n", isupper('A'));
printf("isxdigit('a'): %d\n", isxdigit('a'));
printf("isdigit(0x7fffffff): %d\n", isdigit(0x7fffffff));
return 0;
}
In MacOS, this just prints out 1 for every result except the last one, implying that these functions are simply returning the result of a logical comparison.
The results are a bit different in Linux:
isalnum('a'): 8
isalpha('a'): 1024
iscntrl('\n'): 2
isdigit('1'): 2048
isgraph('a'): 32768
islower('a'): 512
isprint('a'): 16384
ispunct('.'): 4
isspace(' '): 8192
isupper('A'): 256
isxdigit('a'): 4096
Segmentation fault
This suggests to me that the library used in Linux is fetching values from a lookup table and masking them with a bit pattern corresponding to the argument provided. For example, '1' (ASCII 49) is an alphanumeric character, a digit, a printable character and a hex digit, so entry 49 in this lookup table is probably equal to 8+2018+32768+16384+4096, which is 55274.
The documentation for these functions does mention that the argument must have either the value of an unsigned char (0-255) or EOF (-1), so any value outside this range is causing this table to be read out of bounds, resulting in a segmentation error.
Since I'm only calling the isdigit() function with an integer argument, this can hardly be described as undefined behaviour. I really think the library functions should be hardened against this sort of problem.
#include<stdio.h>
int main()
{
char *s = "Abc";
while(*s)
printf("%c", *s++);
return 0;
}
I have seen this (on a site) as a correct code but I feel this is undefined behavior.
My reasoning:
Here s stores the address of the string literal Abc. So while traversing through the while loop :
Iteration - 1:
Here *(s++) increments the address stored in s by 1 and returns the non-incremented address (i.e the previous/original value of s). So, no problem everything works fine and Abc is printed.
Iteration - 2:
Now s points to a completely different address (which may be either valid or not). Now when trying to perform while(*s) isn't it undefined behavior ?
Any help would be really appreciated!
No. There's no undefined behaviour here.
*s++ is evaluated as *(s++) due to higher precedence of postfix increment operator than the dereference operator. So the loop simply iterates over the string and prints the bytes and stop when it sees the null byte.
Now s points to a completely different address (which may be either valid or not). Now when trying to perform while(*s) isn't it undefined behavior ?
No. In the first iteration s points to the address at the char A and at b in the next and at c in the next. And the loop terminates when s reaches the null byte at end of the string (i.e. *s is 0).
Basically, there's no modification of the string literal. The loop is functionally equivalent to:
while(*s) {
printf("%c", *s);
s++;
}
Iteration - 1:
Here *(s++) increments the address stored in s by 1 and returns the non-incremented address (i.e the previous/original value of s). So, no problem everything works fine and Abc is printed.
No, “Abc” is not printed. %c tells printf to expect a character value and print that. It prints a single character, not a string. Initially, s points to the first character of "Abc". s++ increments it to point to the next character.
Iteration - 2:
Now s points to a completely different address (which may be either valid or not). Now when trying to perform while(*s) isn't it undefined behavior ?
In iteration 2, s is pointing to “b”.
You may have been thinking of some char **p for which *p had been set to a pointer to "abc". In that case, incrementing p would change it to point to a different pointer (or to uncontrolled memory), and there would be a problem. That is not the case; for char *s, s points to a single character, and incrementing it adjusts it to point to the next character.
Now s points to a completely different address
Indeed, it is a completely different but well defined address. s referenced the next char of the string literal. So it just adds 1 to the pointer.
Because string literal is nul (zero) terminated the while loop will stop when s will reference it.
There is no UB.
I was trying out some array related stuff in C.
I did following:
char a2[4] = {'g','e','e','k','s'};
printf("a2:%s,%u\n",a2,sizeof(a2));
printf("a2[4]:%d,%c\n",a2[4],a2[4]);
In this code, it prints:
a2:geek,4
a2[4]:0,
In this code, it prints:
a2:geek�,4
a2[4]:-1,�
Both code run on same online compiler. Then why different output. Is it because the standard defines this case as undefined behavior. If yes, can you point me to exact section of the standard?
Yes, this is undefined behavior. I don't have a reference to the standard, but %s format is for printing null-terminated strings, and you don't have a null terminator on a2. And when you access a2[4] you're accessing outside the array bounds, another cause of undefined behavior.
Finally, the array initializer also causes undefined behavior, see Is it ok to have excess elements in array initializer?
The presence of excess initializers violates a constraint in C 2018 6.7.9 2:
No initializer shall attempt to provide a value for an object not contained within the entity being initialized.
'k' and 's' would provide initial values for a2[4] and a2[5]. Since a2[4] or a2[5] do not exist, they are not contained within a2, and the constraint is violated.
That said, compilers will typically provide a warning then ignore the excess initializers and continue. This is the least of the problems in the code you show and has no effect on the output you see.
After the definition of a2, you print it using %s. %s requires a pointer to the first character in a sequence of characters terminated by a null. However, there is no null character in a2. The resulting behavior is not defined by the C standard. Often, what happens is a program will continue to print characters from memory beyond the array. This is of course not guaranteed and is especially unreliable in the modern high-optimization environment.
Assuming the printf does continue to print characters beyond the array, it appears that, on one system, there happens to be a null character beyond the array, so the printf stops after four characters. When you later print a2[4] (also behavior not defined by the C standard) as an integer (%d) and a character (%c), we see there is indeed a null character there.
On the other system, there is a −1 value in the memory at a2[4], which displays as “�”. After it, there are presumably some number (possibly zero) of non-printing characters and a null character.
Additionally, you print sizeof(a2) using the printf specifier %u. This is incorrect and may have undefined behavior. A proper specifier for the result of sizeof is %zu.
Below is C code that will generate char array but with explicitly adding Null character or not. The results are unexpected in two compiler and I'm not sure why we even have to explicitly add the Null character?
//
// stringBugorNot.c
//
//
//
#include <string.h>
#include <stdio.h>
int main(void)
{
char aString[3] = {'a', 'b','c'};
char bString[4] = {'a', 'b', 'c', '\0'};
printf("\n");
printf("len of a is: %lu\n", strlen(aString));
printf("len of b is: %lu\n", strlen(bString));
printf("\n");
//Portion A
printf("last element of a is: '%c'\n", aString[strlen(aString)]);
printf("last element of b is: '%c'\n", bString[strlen(bString)]);
printf("\n");
//Portion B
printf("last element of a is: '%c'\n", aString[strlen(aString) - 1]);
printf("last element of b is: '%c'\n", bString[strlen(bString) - 1]);
}
Comments
+clang will give a runtime error because out of bounds on "aString".. makes sense
+gcc will not give any error and simply output "nothing" the null as expected. But maybe gcc is smarter and adds the null for me? Is the actual memory size different??
Clang OUTPUT ---->
len of a is: 3
len of b is: 3
bugOrNot.c:16:41: runtime error: index 3 out of bounds for type 'char [3]'
last element of a is: ''
last element of b is: ''
last element of a is: 'c'
last element of b is: 'c'
GCC OUTPUT ---->
len of a is: 9
len of b is: 3
last element of a is: ''
last element of b is: ''
last element of a is: ''
last element of b is: 'c'
Unexpected behavior that you see is called undefined behavior (UB) in the C standard:
Calling strlen on aString is UB because there is no null termination
Dereferencing aString at its undefined index is UB, unless the index is 0, 1, or 2
gcc could insert null terminator inadvertently by aligning bString at 4-byte boundary. It doesn’t change the fact that it’s still a UB, though.
When you say
char bString[4] = {'a', 'b', 'c', '\0'};
you have properly constructed a null-terminated string. It is precisely as if you had said
char bString[4] = "abc";
Since this is a proper, null-terminated string, it is meaningful and legal to call strlen(bString), and you will get a result of 3.
When you say
char aString[3] = {'a', 'b','c'};
on the other hand, and as I think you know, you have not constructed a proper null-terminated string. Therefore, it is not legal or meaningful to call strlen(aString) -- formally, we say that the result is undefined, meaning that absolutely anything can happen.
You tried the code with two different compilers, and were surprised to get two different results. This is perfectly normal. (It's perfectly normal to get two different results, and it's perfectly normal to be surprised by this, because it is pretty surprising, the first few times you encounter it.)
It is not the case that one compiler is "smarter" than the other, or that it "guessed" that you were trying to construct a string and so automatically supplied the "missing" \0 for you. It was simply a fluke, a random happenstance. (It is also certainly not the case that one compiler or the other has any kind of a bug. Again, there's no right result here, so a compiler can't be wrong, no matter what it does.)
If you want to work with strings in C, make sure that they're all properly null-terminated. If you should ever happen to accidentally do something stringlike with a non-properly-null-terminated string, don't try to interpret the results, don't assume that they mean anything, and especially don't decide that it's the "right" result that you can therefore depend on. You can't. It's likely to change for no reason, like when you use a different compiler next week, or when your customer uses your program on vital data instead of test data.
In C, a string is a sequence of character values including the nul terminator. That terminator is how the various C library routines know where the end of the string is. If you don't terminate a string properly, library routines like strlen and strcpy and printf with %s will all scan past the end of the string into other memory, resulting in garbled output or runtime errors.
The reason you got different results for the length of a with the two different compilers is that in the clang case, the byte immediately following the last element of a contained 0, whereas in the gcc case the bytes immediately following a did not contain 0.
Strictly speaking, the behavior on passing a non-terminated sequence of characters to the string handling routines is undefined - the language specification places no requirements on the compiler or runtime environment to "do the right thing", whatever that would be. You've basically voided the warranty at that point, and pretty much anything can happen.
Note that the C language specification does not require bounds checks on array accesses - the fact that you got the index out of bounds exception for clang is due to the compiler being extra friendly and going beyond what the language standard actually requires.