Is strlen on a string with uninitialized values undefined behavior?

Is strlen on a string with uninitialized values undefined behavior? - c

strlen returns the number of characters that precede the terminating null character. An implementation of strlen might look like this:
size_t strlen(const char * str)
{
const char *s;
for (s = str; *s; ++s) {}
return(s - str);
}
This particular implementation dereferences s, where s may contain indeterminate values. It's equivalent to this:
int a;
int* p = &a;
*p;
So for example if one were to do this (which causes strlen to give an incorrect output):
char buffer[10];
buffer[9] = '\0';
strlen(buffer);
Is it undefined behavior?

Calling the standard function strlen causes undefined behaviour. DR 451 clarifies this:
library functions will exhibit undefined behavior when used on indeterminate values
For a more in-depth discussion see this thread.

The behavior of the variant that you are showing is well defined under these circumstances.
The bytes of the uninitialized array have all indeterminate values, with exception of the 10th element that you set to 0.
Accessing an indeterminate value would only be UB if the address of the underlying object would be never taken or if the value is a trap for the corresponding type.
Since this is an array and access to array elements is through pointer arithmetic, the first case is not relevant, here.
Any char value can be accessed without UB, the clauses about trap representations in the standard explicitly exclude all character types from that.
Thus the values that you are dealing with are simply "unspecified".
Reading unspecified values may according to some members of the C standards committee give different results each time, what some call a "whobly" state or so. This property is not relevant, here, since your function reads any such value at most once.
So your access to the array elements gives you any arbitrary but valid char value.
You are sure that your for loop stops at latest at position 9, so you will not overrun your array.
So no "bad" things beyond the visible may happen if you use your specific version of the function. But having a function call that produces unspecified results is certainly nothing you want to see in real code. Something like this here leads to very subtle bugs, and you should avoid it by all means.

No, it's not undefined behavior. Your strlen function will stop before the end of the buffer. If your strlen function referenced buffer[10], then, yes that is undefined.
It certainly will be unexpected behavior, since most of buffer contains random data. "Undefined" is special word for people writing language standards. It means that anything could happen, including memory faults or exiting the program. By unexpected, I mean that it sure not what the programmer wanted to happen. On some runs, the result of strlen could be 3 or it could be 10.

Yes, it's undefined behaviour. From the draft C11 standard, §J.2 "Undefined behavior":
The behavior is undefined in the following circumstances:
...
The value of an object with automatic storage duration is used while it is
indeterminate.

Related

Executable not running

I have the following code:
# include<stdio.h>
# include<string.h>
# define M 5
void mycopy(char* text)
{
char buffer[M];
strcpy(buffer, text);
}
int main()
{
char *name = "Kshitij";
int i = 0;
mycopy(name);
printf("i = %d", i);
return 0;
}
This code compiles in GCC on Apple LLVM version 8.0.0 (clang-800.0.42.1).
However, when I try to run the corresponding executable, I get a process abort output such as:
[1] PID abort ./executable.out
I understand that an error is expected since the size of buffer array is less than the length of text argument being passed to it in this case. However, I am unable to understand to grasp the concept behind this behavior. Why isn't a compile-time error stating the reason raised by the compiler here ?

The C11 standard says the following in 7.1.4. Use of library functions:
If a function argument is described as being an array, *the pointer actually passed to the function shall have a value such that all address computations and accesses to objects (that would be valid if the pointer did point to the first element of such an array) are in fact valid.
And of strcpy
2 The strcpy function copies the string pointed to by s2 (including the terminating null character) into the array pointed to by s1.
And in Appendix J.2 Undefined behavior:
1 The behavior is undefined in the following circumstances:
The pointer passed to a library function array parameter does not have a value such that all address computations and object accesses are valid (7.1.4).
Since the behaviour is undefined, according to standard, anything can happen. A compiler is actually allowed to do compile-time bounds checking, and the compilation might very well be aborted if it can be deduced that the program always writes out of bounds. Or the implementation can support range checking and abort with clear diagnostics. Or the strcpy might just copy the 4 first characters and add a terminating null, or copy 42 into the target string instead - and all these implementations would be standard-conforming.

Char and strcpy in C

I came across a part of question in which, I am getting an output, but I need a explanation why it is true and does work?
char arr[4];
strcpy(arr,"This is a link");
printf("%s",arr);
When I compile and execute, I get the following output.
Output:
This is a link

The short answer why it worked (that time) is -- you got lucky. Writing beyond the end of an array is undefined behavior. Where undefined behavior is just that, undefined, it could just a easily cause a segmentation fault as it did produce output. (though generally, stack corruption is the result)
When handling character arrays in C, you are responsible to insure you have allocated sufficient storage. When you intend to use the array as a character string, you also must allocate sufficient storage for each character +1 for the nul-terminating character at the end (which is the very definition of a nul-terminated string in C).
Why did it work? Generally, when you request say char arr[4]; the compiler is only guaranteeing that you have 4-bytes allocated for arr. However, depending on the compiler, the alignment, etc. the compiler may actually allocate whatever it uses as a minimum allocation unit to arr. Meaning that while you have only requested 4-bytes and are only guaranteed to have 4-usable-bytes, the compiler may have actually set aside 8, 16, 32, 64, or 128, etc-bytes.
Or, again, you were just lucky that arr was the last allocation requested and nothing yet has requested or written to the memory address starting at byte-5 following arr in memory.
The point being, you requested 4-bytes and are only guaranteed to have 4-bytes available. Yes it may work in that one printf before anything else takes place in your code, but your code is wholly unreliable and you are playing Russian-Roulette with stack corruption (if it has not already taken place).
In C, the responsibility falls to you to insure your code, storage and memory use is all well-defined and that you do not wander off into the realm of undefined, because if you do, all bets are off, and your code isn't worth the bytes it is stored in.
How could you make your code well-defined? Appropriately limit and validate each required step in your code. For your snippet, you could use strncpy instead of strcpy and then affirmatively nul-terminate arr before calling printf, e.g.
char arr[4] = ""; /* initialize all values */
strncpy(arr,"This is a link", sizeof arr); /* limit copy to bytes available */
arr[sizeof arr - 1] = 0; /* affirmatively nul-terminate */
printf ("%s\n",arr);
Now, you can rely on the contents of arr throughout the remainder of your code.

Your code has some memory issues (buffer overrun) . The function strcpy copies bytes until the null character. The function printf prints until the null character.

There is no guarantee on the behavior of this piece of code.
It's just like: you told me "I'll pick you up at 5:00 p.m." and when you came I would be there(guarantee). But I can't guarantee whether I had grabbed you a cup of coffee or not, because you didn't told me you want one. Maybe I'm very nice and bought two cups of coffee, or maybe I'm a cheapskate and just bought one for myself.

It may work. It may not. It may fail immediately and obviously. It may fail at some arbitrary future time and in subtle ways that will drive you insane.
That is the often-insidious nature of undefined behaviour. Don't do it.
If it works at all, it's totally by accident and in no way guaranteed. It's possible that you're overwriting stuff on the stack or in other memory (depending on the implementation and how/where the actual variable str is defined(a)) but that the memory being overwritten is not used after that point (given the simple nature of the code).
That possibility of it working accidentally in no way makes it a good idea.
For the language lawyers among us, section J.2 (instances of undefined behaviour) of C11 clearly states:
An array subscript is out of range, even if an object is apparently accessible with the given subscript (as in the lvalue expression a[1][7] given the declaration int a[4][5]).
That informative section references 6.5.6, which is normative, and which states when discussing pointer/integer addition (of which a[b] is an example):
If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined. If the result points one past the last element of the array object, it shall not be used as the operand of a unary * operator that is evaluated.
(a) For example, on my system, declaring the variable inside main causes the program to crash because the buffer overflow trashes the return address on the stack.
However, if I put the declaration at file level (outside of main), it seems to run just fine, printing the message then exiting the program.
But I assure you that's only because the memory you've trashed is not important for the continuation of the program in this case. It will almost certainly be important in anything more substantial than this example.

your code will always work as long as the printf is placed just after strcpy. But it is wrong coding
Try following and it won't work
int j;
char arr[4];
int i;
strcpy(arr,"This is a link");
i=0;
j=0;
printf("%s",arr);
To understand why it is so you must understand the idea of stack. All local variables are allocated on stack. Hence in your code, program control has allocated 4 bytes for "arr" and when you copy a string which is larger than 4 bytes then you are overwriting/corrupting some other memory. But as you accessed "arr" just after strcpy hence the area you have overwritten which may belong to some other variables still not updated by program that's why your printf works fine. But as I suggested in example code where other variables are updated which fall into the memory region you have overwritten, you won't get correct (? or more appropriate is desired) output
Your code is working also because stack grows downwards if it would have been other way then also you had not get desired output

What happens in C when you print a declared, but unassigned variable?

Why does the following snippet cause random numbers to print to the screen with printf, but putchar always outputs 1?
#include <stdio.h>
int main() {
char c;
printf("%d\n", c );
putchar(c);
}

According to C99 standard, this is undefined behavior. Let's see why:
Section 6.7.8.9 says that
If an object that has automatic storage duration is not initialized explicitly, its value is indeterminate.
This applies to your variable c, because it has automatic storage duration, and it is not initialized explicitly.
Section J.2 says that
The behavior is undefined in the following circumstances:
...
The value of an object with automatic storage duration is used while it is
indeterminate
This applies to your code as well, because you read c when you pass it as a parameter to both printf and putchar, and the value of c is still indeterminate, because it has not been assigned.

1) The c variable first has a random value(default/garbage value) to itself as you declared-but-did-not-initialize your char c to any defined letter or value of ur interest(character).
2) Next you tried to printf the %d(digit/decimal/numerical value) of the char c, so now it is giving you a converted value of the garbage which was earlier assigned to c when you declared the char c in the first place.
3) Finally you tried to use putchar(c), which again behaves similarly because your char c is uninitialized and is still being read thereby re-trying to manage with an undetermined value to be printed onto the screen. (since the same un-initialized character variable c is being passed to both kind of printing as a parameter).
Yes these 3 statements are a bit clumsy to understand but they are as layman as it can get to help speed-up some understanding regarding this query of yours.
Pay attention to the 1st comment response to your question by #bluemoon. Those 3 words alone litterally have a huge amount of sensibility and meaningfull-ness to them, to a point that it also tells you what you have done erroneous in your own code(your actions)."UNDEFINED"(try relating the same with UNINITIALIZED).

Can unverified scanf call cause an undefined behavior?

Does below snippet invoke undefined behavior in case of an error?
#include <stdio.h>
int main() {
int i; /* Indeterminate */
if (scanf("%d", &i) == 1) /* Initialize */
printf("%d\n", i); /* Success! Print read value */
else
printf("%d\n", i); /* Input failed! Is printing `i` UB or not? */
return 0;
}
What if scanf fails, is an uninitialized variable accessed?
EDIT
Moreover what if I replace scanf("%d", &i) with my_initializer(&i):
int my_initializer(int *pi)
{
double room_temp_degc = get_room_temp_in_degc();
if(room_temp_degc < 12.0) {
// Cool
*pi = 42;
return 1;
} else {
return 0;
}
}

In C90, this is UB.
For C99 and C11, technically, it isn't, but the output is indeterminate. It's even possible, that another printf directly following will print a different value; uninitialized variables may appear to change without explicit action of the programme. Note, however, that an uninitialized variable can only be read if its address has been taken*) (which is done here in the scanf call). From n1570 6.3.2.1 p2:
If the lvalue designates an object of automatic storage duration that could have been declared with the register storage class (never had its address taken), and that object is uninitialized (not declared with an initializer and no assignment to it has been performed prior to use), the behavior is undefined.
In theory, this would allow for something like
int n;
&n;
printf("%d\n", n);
But compilers may still reorder statements or allocate registers based on the assumption that the first read doesn't occur before the first write and ignore the side-effect free &n; statement.
For any practical purpose, never read uninitialized values. First, there is no reason why you should want to; second, even an unspecified value allows surprsing optimizations: Some thought a "garbage" value could be used to gather some entropy for random numbers which lead to really bad bugs in cryptographic software, see e.g. Xi Wang's blog entry. For an even wierder example, where an uninitialized value is odd after multiplication with 2, see e.g. this blog (yes, indeterminate times 2 is simply indeterminate, not even and only otherwise indeterminate).
See also DR 260.
*) The quoted paragraph is missing in C99, but this should be considered a defect in the standard, not a change in C11. C99 makes it technically defined (for machines without trap representations) to read any uninitialized variable (though their values are still indeterminate and may still appear to change randomly, it's just not UB).
With DR 338, this was corrected, but not before C11. It was added to allow NaT on a Titanium platform (which exists for registers, but not for values in memory), even for integer types without trap representations. I don't know, if the &n in the code above has any effect on such a platform (by a strict reading of C11, it should, but I wouldn't rely on it).

Evaluating the condition containing unitialized pointer - UB, but can it crash?

Somewhere on the forums I encountered this:
Any attempt to evaluate an uninitialized pointer variable
invokes undefined behavior. For example:
int *ptr; /* uninitialized */
if (ptr == NULL) ...; /* undefined behavior */
What is meant here?
Is it meant that if I ONLY write:
if(ptr==NULL){int t;};
this statement is already UB?
Why? I am not dereferencing the pointer right?
(I noticed there maybe terminology issue, by UB in this case, I referred to: will my code crash JUST due to the if check?)

Using unitialized variables invokes undefined behavior. It doesn't matter whether it is pointer or not.
int i;
int j = 7 * i;
is undefined as well. Note that "undefined" means that anything can happen, including a possibility that it will work as expected.
In your case:
int *ptr;
if (ptr == NULL) { int i = 0; /* this line does nothing at all */ }
ptr might contain anything, it can be some random trash, but it can be NULL too. This code will most likely not crash since you are just comparing value of ptr to NULL. We don't know if the execution enters the condition's body or not, we can't be even sure that some value will be successfully read - and therefore, the behavior is undefined.

your pointer is not initialized. Your statement would be the same as:
int a;
if (a == 3){int t;}
since a is not initialized; its value can be anything so you have undefined behavior. It doesn't matter whether you dereference your pointer or not. If you would do that, you would get a segfault

The C99 draft standard says it is undefined clearly in Annex J.2 Undefined behavior:
The value of an object with automatic storage duration is used while it is
indeterminate (6.2.4, 6.7.8, 6.8).
and the normative text has an example that also says the same thing in section 6.5.2.5 Compound literals paragraph 17 which says:
Note that if an iteration statement were used instead of an explicit goto and a labeled statement, the lifetime of the unnamed object would be the body of the loop only, and on entry next time around p would have an indeterminate value, which would result in undefined behavior.
and the draft standard defines undefined behavior as:
behavior, upon use of a nonportable or erroneous program construct or of erroneous data,
for which this International Standard imposes no requirements
and notes that:
Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).

As Shafik has pointed out, the C99 standard draft declares any use of unintialized variables with automatic storage duration undefined behaviour. That amazes me, but that's how it is. My rationale for pointer use comes below, but similar reasons must be true for other types as well.
After int *pi; if (pi == NULL){} your prog is allowed to do arbitrary things. In reality, on PCs, nothing will happen. But there are architectures out there which have illegal address values, much like NaN floats, which will cause a hardware trap when they are loaded in a register. These to us modern PC users unheard of architectures are the reason for this provision. Cf. e.g. How does a hardware trap in a three-past-the-end pointer happen even if the pointer is never dereferenced?.

The behavior of this is undefined because of how the stack is used for various function calls. When a function is called the stack grows to make space for variables within the scope of that function, but this memory space is not cleared or zeroed out.
This can be shown to be unpredictable in code like the following:
#include <stdio.h>
void test()
{
int *ptr;
printf("ptr is %p\n", ptr);
}
void another_test()
{
test();
}
int main()
{
test();
test();
another_test();
test();
return 0;
}
This simply calls the test() function multiple times, which just prints where 'ptr' lives in memory. You'd expect maybe to get the same results each time, but as the stack is manipulated the physical location of where 'ptr' is has changed and the data at that address is unknown in advance.
On my machine running this program results in this output:
ptr is 0x400490
ptr is 0x400490
ptr is 0x400575
ptr is 0x400585
To explore this a bit more, consider the possible security implications of using pointers that you have not intentionally set yourself
#include <stdio.h>
void test()
{
int *ptr;
printf("ptr is %p\n", ptr);
}
void something_different()
{
int *not_ptr_or_is_it = (int*)0xdeadbeef;
}
int main()
{
test();
test();
something_different();
test();
return 0;
}
This results in something that is undefined even though it is predictable. It is undefined because on some machines this will work the same and others it might not work at all, it's part of the magic that happens when your C code is converted to machine code
ptr is 0x400490
ptr is 0x400490
ptr is 0xdeadbeef

Some implementations may be designed in such a way that an attempted rvalue conversion of an invalid pointer may case arbitrary behavior. Other implementations are designed in such a way that an attempt to compare any pointer object with null will never do anything other than yield 0 or 1.
Most implementations target hardware where pointer comparisons simply compare bits without regard for whether those bits represent valid pointers. The authors of many such implementations have historically considered it so obvious that a pointer comparison on such hardware should never have any side-effect other than to report that pointers are equal or report that they are unequal that they seldom bothered to explicitly document such behavior.
Unfortunately, it has become fashionable for implementations to aggressively "optimize" Undefined Behavior by identifying inputs that would cause a program to invoke UB, assuming such inputs cannot occur, and then eliminating any code that would be irrelevant if such inputs were never received. The "modern" viewpoint is that because the authors of the Standard refrained from requiring side-effect-free comparisons on implementations where such a requirement would
impose significant expense, there's no reason compilers for any platform should guarantee them.

You're not dereferencing the pointer, so you don't end up with a segfault. It will not crash. I don't understand why anyone thinks that comparing two numbers will crash. It's nonsense. So again:
IT WILL NOT CRASH. PERIOD.
But it's still UB. You don't know what memory address the pointer contains. It may or may not be NULL. So your condition if (ptr == NULL) may or may not evaluate to true.
Back to my IT WILL NOT CRASH statement. I've just tested the pointer going from 0 to 0xFFFFFFFF on the 32-bit x86 and ARMv6 platforms. It did not crash.
I've also tested the 0..0xFFFFFFFF and 0xFFFFFFFF00000000..0xFFFFFFFFFFFFFFFF ranges on and amd64 platform. Checking the full range would take a few thousand years I guess.
Again, it did not crash.
I challenge the commenters and downvoters to show a platform and value where it crashes. Until then, I'll probably be able to survive a few negative points.
There is also a SO link to
trap representation
which also indicates that it will not crash.