Somewhere on the forums I encountered this:
Any attempt to evaluate an uninitialized pointer variable
invokes undefined behavior. For example:
int *ptr; /* uninitialized */
if (ptr == NULL) ...; /* undefined behavior */
What is meant here?
Is it meant that if I ONLY write:
if(ptr==NULL){int t;};
this statement is already UB?
Why? I am not dereferencing the pointer right?
(I noticed there maybe terminology issue, by UB in this case, I referred to: will my code crash JUST due to the if check?)
Using unitialized variables invokes undefined behavior. It doesn't matter whether it is pointer or not.
int i;
int j = 7 * i;
is undefined as well. Note that "undefined" means that anything can happen, including a possibility that it will work as expected.
In your case:
int *ptr;
if (ptr == NULL) { int i = 0; /* this line does nothing at all */ }
ptr might contain anything, it can be some random trash, but it can be NULL too. This code will most likely not crash since you are just comparing value of ptr to NULL. We don't know if the execution enters the condition's body or not, we can't be even sure that some value will be successfully read - and therefore, the behavior is undefined.
your pointer is not initialized. Your statement would be the same as:
int a;
if (a == 3){int t;}
since a is not initialized; its value can be anything so you have undefined behavior. It doesn't matter whether you dereference your pointer or not. If you would do that, you would get a segfault
The C99 draft standard says it is undefined clearly in Annex J.2 Undefined behavior:
The value of an object with automatic storage duration is used while it is
indeterminate (6.2.4, 6.7.8, 6.8).
and the normative text has an example that also says the same thing in section 6.5.2.5 Compound literals paragraph 17 which says:
Note that if an iteration statement were used instead of an explicit goto and a labeled statement, the lifetime of the unnamed object would be the body of the loop only, and on entry next time around p would have an indeterminate value, which would result in undefined behavior.
and the draft standard defines undefined behavior as:
behavior, upon use of a nonportable or erroneous program construct or of erroneous data,
for which this International Standard imposes no requirements
and notes that:
Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).
As Shafik has pointed out, the C99 standard draft declares any use of unintialized variables with automatic storage duration undefined behaviour. That amazes me, but that's how it is. My rationale for pointer use comes below, but similar reasons must be true for other types as well.
After int *pi; if (pi == NULL){} your prog is allowed to do arbitrary things. In reality, on PCs, nothing will happen. But there are architectures out there which have illegal address values, much like NaN floats, which will cause a hardware trap when they are loaded in a register. These to us modern PC users unheard of architectures are the reason for this provision. Cf. e.g. How does a hardware trap in a three-past-the-end pointer happen even if the pointer is never dereferenced?.
The behavior of this is undefined because of how the stack is used for various function calls. When a function is called the stack grows to make space for variables within the scope of that function, but this memory space is not cleared or zeroed out.
This can be shown to be unpredictable in code like the following:
#include <stdio.h>
void test()
{
int *ptr;
printf("ptr is %p\n", ptr);
}
void another_test()
{
test();
}
int main()
{
test();
test();
another_test();
test();
return 0;
}
This simply calls the test() function multiple times, which just prints where 'ptr' lives in memory. You'd expect maybe to get the same results each time, but as the stack is manipulated the physical location of where 'ptr' is has changed and the data at that address is unknown in advance.
On my machine running this program results in this output:
ptr is 0x400490
ptr is 0x400490
ptr is 0x400575
ptr is 0x400585
To explore this a bit more, consider the possible security implications of using pointers that you have not intentionally set yourself
#include <stdio.h>
void test()
{
int *ptr;
printf("ptr is %p\n", ptr);
}
void something_different()
{
int *not_ptr_or_is_it = (int*)0xdeadbeef;
}
int main()
{
test();
test();
something_different();
test();
return 0;
}
This results in something that is undefined even though it is predictable. It is undefined because on some machines this will work the same and others it might not work at all, it's part of the magic that happens when your C code is converted to machine code
ptr is 0x400490
ptr is 0x400490
ptr is 0xdeadbeef
Some implementations may be designed in such a way that an attempted rvalue conversion of an invalid pointer may case arbitrary behavior. Other implementations are designed in such a way that an attempt to compare any pointer object with null will never do anything other than yield 0 or 1.
Most implementations target hardware where pointer comparisons simply compare bits without regard for whether those bits represent valid pointers. The authors of many such implementations have historically considered it so obvious that a pointer comparison on such hardware should never have any side-effect other than to report that pointers are equal or report that they are unequal that they seldom bothered to explicitly document such behavior.
Unfortunately, it has become fashionable for implementations to aggressively "optimize" Undefined Behavior by identifying inputs that would cause a program to invoke UB, assuming such inputs cannot occur, and then eliminating any code that would be irrelevant if such inputs were never received. The "modern" viewpoint is that because the authors of the Standard refrained from requiring side-effect-free comparisons on implementations where such a requirement would
impose significant expense, there's no reason compilers for any platform should guarantee them.
You're not dereferencing the pointer, so you don't end up with a segfault. It will not crash. I don't understand why anyone thinks that comparing two numbers will crash. It's nonsense. So again:
IT WILL NOT CRASH. PERIOD.
But it's still UB. You don't know what memory address the pointer contains. It may or may not be NULL. So your condition if (ptr == NULL) may or may not evaluate to true.
Back to my IT WILL NOT CRASH statement. I've just tested the pointer going from 0 to 0xFFFFFFFF on the 32-bit x86 and ARMv6 platforms. It did not crash.
I've also tested the 0..0xFFFFFFFF and 0xFFFFFFFF00000000..0xFFFFFFFFFFFFFFFF ranges on and amd64 platform. Checking the full range would take a few thousand years I guess.
Again, it did not crash.
I challenge the commenters and downvoters to show a platform and value where it crashes. Until then, I'll probably be able to survive a few negative points.
There is also a SO link to
trap representation
which also indicates that it will not crash.
Related
I posted a question about some pointer issues I've been having earlier in this question:
C int pointer segmentation fault several scenarios, can't explain behaviour
From some of the comments, I've been led to believe that the following:
#include <stdlib.h>
#include <stdio.h>
int main(){
int *p;
*p = 1;
printf("%d\n", *p);
return 0;
}
is undefined behaviour. Is this true? I do this all the time, and I've even seen it in my C course.
However, when I do
#include <stdlib.h>
#include <stdio.h>
int main(){
int *p=NULL;
*p = 1;
printf("%d\n", *p);
return 0;
}
I get a seg fault right before printing the contents of p (after the line *p=1;). Does this mean I should have always been mallocing any time I actually assign a value for a pointer to point to?
If that's the case, then why does char *string = "this is a string" always work?
I'm quite confused, please help!
This:
int *p;
*p = 1;
Is undefined behavior because p isn't pointing anywhere. It is uninitialized. So when you attempt to dereference p you're essentially writing to a random address.
What undefined behavior means is that there is no guarantee what the program will do. It might crash, it might output strange results, or it may appear to work properly.
This is also undefined behaivor:
int *p=NULL;
*p = 1;
Because you're attempting to dereference a NULL pointer.
This works:
char *string = "this is a string" ;
Because you're initializing string with the address of a string constant. It's not the same as the other two cases. It's actually the same as this:
char *string;
string = "this is a string";
Note that here string isn't being dereferenced. The pointer variable itself is being assigned a value.
Yes, doing int *p; *p = 1; is undefined behavior. You are dereferencing an uninitialized pointer (accessing the memory to which it points). If it works, it is only because the garbage in p happened to be the address of some region of memory which is writable, and whose contents weren't critical enough to cause an immediate crash when you overwrote them. (But you still might have corrupted some important program data causing problems you won't notice until later...)
An example as blatant as this should trigger a compiler warning. If it doesn't, figure out how to adjust your compiler options so it does. (On gcc, try -Wall -O).
Pointers have to point to valid memory before they can be dereferenced. That could be memory allocated by malloc, or the address of an existing valid object (p = &x;).
char *string = "this is a string"; is perfectly fine because this pointer is not uninitialized; you initialized it! (The * in char *string is part of its declaration; you aren't dereferencing it.) Specifically, you initialized it with the address of some memory which you asked the compiler to reserve and fill in with the characters this is a string\0. Having done that, you can safely dereference that pointer (though only to read, since it is undefined behavior to write to a string literal).
is undefined behaviour. Is this true?
Sure is. It just looks like it's working on your system with what you've tried, but you're performing an invalid write. The version where you set p to NULL first is segfaulting because of the invalid write, but it's still technically undefined behavior.
You can only write to memory that's been allocated. If you don't need the pointer, the easiest solution is to just use a regular int.
int p = 1;
In general, avoid pointers when you can, since automatic variables are much easier to work with.
Your char* example works because of the way strings work in C--there's a block of memory with the sequence "this is a string\0" somewhere in memory, and your pointer is pointing at that. This would be read-only memory though, and trying to change it (i.e., string[0] = 'T';) is undefined behavior.
With the line
char *string = "this is a string";
you are making the pointer string point to a place in read-only memory that contains the string "this is a string". The compiler/linker will ensure that this string will be placed in the proper location for you and that the pointer string will be pointing to the correct location. Therefore, it is guaranteed that the pointer string is pointing to a valid memory location without any further action on your part.
However, in the code
int *p;
*p = 1;
p is uninitialized, which means it is not pointing to a valid memory location. Dereferencing p will therefore result in undefined behavior.
It is not necessary to always use malloc to make p point to a valid memory location. It is one possible way, but there are many other possible ways, for example the following:
int i;
int *p;
p = &i;
Now p is also pointing to a valid memory location and can be safely dereferenced.
Consider the code:
#include <stdio.h>
int main(void)
{
int i=1, j=2;
int *p;
... some code goes here
*p = 3;
printf("%d %d\n", i, j);
}
Would the statement *p = 2; write to i, j, or neither? It would write to i or j if p points to that object, but not if p points somewhere else. If the ... portion of the code doesn't do anything with p, then p might happen point to i, or j, or something within the stdout object, or anything at all. If it happens to point to i or j, then the write *p = 3; might affect that object without any side effects, but if it points to information within stdout that controls where output goes, it might cause the following printf to behave in unpredictable fashion. In a typical implementation, p might point anywhere, and there will be so many things to which p might point that it would be impossible to predict all of the possible effects of writing to them.
Note that the Standard classifies many actions as "Undefined Behavior" with the intention that many or even most implementations will extend the semantics of the language by documenting their behavior. Most implementations, for example, extend the meaning of the << operator to allow it to be used to multiply negative numbers by power of two. Even on implementations that extend the language to specify that an assignment like *p = 3; will always perform a word-sized write of the value 3 to the indicated address, with whatever consequence results, there would be relatively few platforms(*) where it would be possible to fully characterize all possible effects of that action in cases where nothing is known about the value of p. In cases where pointers are read rather than written, some systems may be able to offer useful behavioral guarantees about the effect of arbitrary stray reads, but not all(**).
(*) Some freestanding platforms which keep code in read-only storage may be able to uphold some behavioral guarantees even if code writes to arbitrary pointer addresses. Such behavioral guarantees may be useful in systems whose state might be corrupted by electrical interference, but even when targeting such systems writing to a stray pointer would never be useful.
(**) On many platforms, stray reads will either yield a meaningless value without side effects or force an abnormal program termination, but on an Apple II which a Disk II card in the customary slot-6 location, if code reads from address 0xC0EF within a second of performing a disk access, the drive head to start overwriting whatever happens to be on the last track accessed. This is by design (software that needs to write to the disk does so by accessing address 0xC0EF, and having hardware respond to both reads and writes required one less logic gate--and thus one less chip--than would be required for hardware that only responded to writes) but does mean that code must be careful not to perform any stray reads.
I was going through here and found that malloc can cause unwanted behaviour if we don't include stdlib.h, cast the return value and if pointer and integer size differs on the system.
Below is the code snippet given in that SO question. This was tried on 64 bit machine where pointer and integer size are different.
int main()
{
int* p;
p = (int*)malloc(sizeof(int));
*p = 10;
return 0;
}
If we don't include stdlib.h, compiler will assume malloc return type as int, and casting it and assigning to different size pointer can cause unwanted behaviour. But my question is why casting int to int* and assigning it to different size pointer can cause the problem.
int main()
{
int* p;
p = (int*)malloc(sizeof(int));
*p = 10;
return 0;
}
Under C99 and C2011 rules, the call to malloc with no visible declaration is a constraint violation, meaning that a conforming compiler must issue a diagnostic. (This is about as close as C comes to saying that something is "illegal".) If your compiler doesn't warn about the call, you should find out what options to use to make it do so.
Under C90 rules, calling a function with no visible declaration causes the compiler to assume that the function actually returns a result of type int. Since malloc is actually defined with a return type of void*, the behavior is undefined; the compiler is not required to diagnose it, but the standard says exactly nothing about what happens when the call is evaluated.
What typically happens in practice is that the compiler generates code as if malloc were defined to return an int result. For example, malloc might put its 64-bit void* result in some particular CPU register, and the calling code might assume that that register contains a 32-bit int. (This is not a type conversion; it's just bad code that incorrectly treats a value of one type as if it were of a different type.) That (possibly garbage) int value is then converted to int* and stored in p. You might lose the high-order or low-order 32 bits of the returned pointer -- but that's only one out of arbitrarily many ways it can go wrong.
Or malloc might push its 64-bit result onto the stack, and the caller might pop only 32 bits off the stack, resulting in a stack misalignment that will cause all subsequent execution to be incorrect. For historical reasons, C compiler typically don't use this kind of calling convention, but the standard permits it.
If int, void*, and int* all happen to be the same size (as they often are on 32-bit systems), the code is likely to work -- but even that's not guaranteed. For example, a calling convention might use one register to return int results and a different one to return pointer results. Again, most existing C calling conventions allow for old bad code that makes assumptions like this.
Calling malloc requires #include <stdlib.h>, even though some compilers might not enforce that requirement. It's much easier to add the #include (and drop the cast) than to spend time thinking about what might happen if you don't.
Almost, any function causes undefined behavior if no prototype is given before calling it, except for example if it's prototype is int function(int x);.
It's pretty obvious that if the size of a pointer is larger than the size of an int and malloc() returns an int because of implicit declaration, then the returned address might not be the real address because for example, it might not be possible to represent it with less bits.
Dereferencing it would be undefined behavior, which by the way, you can't test for, since it's undefined, what would you expect to happen? it's undefined!!!
So, nothing to test there?
Does below snippet invoke undefined behavior in case of an error?
#include <stdio.h>
int main() {
int i; /* Indeterminate */
if (scanf("%d", &i) == 1) /* Initialize */
printf("%d\n", i); /* Success! Print read value */
else
printf("%d\n", i); /* Input failed! Is printing `i` UB or not? */
return 0;
}
What if scanf fails, is an uninitialized variable accessed?
EDIT
Moreover what if I replace scanf("%d", &i) with my_initializer(&i):
int my_initializer(int *pi)
{
double room_temp_degc = get_room_temp_in_degc();
if(room_temp_degc < 12.0) {
// Cool
*pi = 42;
return 1;
} else {
return 0;
}
}
In C90, this is UB.
For C99 and C11, technically, it isn't, but the output is indeterminate. It's even possible, that another printf directly following will print a different value; uninitialized variables may appear to change without explicit action of the programme. Note, however, that an uninitialized variable can only be read if its address has been taken*) (which is done here in the scanf call). From n1570 6.3.2.1 p2:
If the lvalue designates an object of automatic storage duration that could have been declared with the register storage class (never had its address taken), and that object is uninitialized (not declared with an initializer and no assignment to it has been performed prior to use), the behavior is undefined.
In theory, this would allow for something like
int n;
&n;
printf("%d\n", n);
But compilers may still reorder statements or allocate registers based on the assumption that the first read doesn't occur before the first write and ignore the side-effect free &n; statement.
For any practical purpose, never read uninitialized values. First, there is no reason why you should want to; second, even an unspecified value allows surprsing optimizations: Some thought a "garbage" value could be used to gather some entropy for random numbers which lead to really bad bugs in cryptographic software, see e.g. Xi Wang's blog entry. For an even wierder example, where an uninitialized value is odd after multiplication with 2, see e.g. this blog (yes, indeterminate times 2 is simply indeterminate, not even and only otherwise indeterminate).
See also DR 260.
*) The quoted paragraph is missing in C99, but this should be considered a defect in the standard, not a change in C11. C99 makes it technically defined (for machines without trap representations) to read any uninitialized variable (though their values are still indeterminate and may still appear to change randomly, it's just not UB).
With DR 338, this was corrected, but not before C11. It was added to allow NaT on a Titanium platform (which exists for registers, but not for values in memory), even for integer types without trap representations. I don't know, if the &n in the code above has any effect on such a platform (by a strict reading of C11, it should, but I wouldn't rely on it).
strlen returns the number of characters that precede the terminating null character. An implementation of strlen might look like this:
size_t strlen(const char * str)
{
const char *s;
for (s = str; *s; ++s) {}
return(s - str);
}
This particular implementation dereferences s, where s may contain indeterminate values. It's equivalent to this:
int a;
int* p = &a;
*p;
So for example if one were to do this (which causes strlen to give an incorrect output):
char buffer[10];
buffer[9] = '\0';
strlen(buffer);
Is it undefined behavior?
Calling the standard function strlen causes undefined behaviour. DR 451 clarifies this:
library functions will exhibit undefined behavior when used on indeterminate values
For a more in-depth discussion see this thread.
The behavior of the variant that you are showing is well defined under these circumstances.
The bytes of the uninitialized array have all indeterminate values, with exception of the 10th element that you set to 0.
Accessing an indeterminate value would only be UB if the address of the underlying object would be never taken or if the value is a trap for the corresponding type.
Since this is an array and access to array elements is through pointer arithmetic, the first case is not relevant, here.
Any char value can be accessed without UB, the clauses about trap representations in the standard explicitly exclude all character types from that.
Thus the values that you are dealing with are simply "unspecified".
Reading unspecified values may according to some members of the C standards committee give different results each time, what some call a "whobly" state or so. This property is not relevant, here, since your function reads any such value at most once.
So your access to the array elements gives you any arbitrary but valid char value.
You are sure that your for loop stops at latest at position 9, so you will not overrun your array.
So no "bad" things beyond the visible may happen if you use your specific version of the function. But having a function call that produces unspecified results is certainly nothing you want to see in real code. Something like this here leads to very subtle bugs, and you should avoid it by all means.
No, it's not undefined behavior. Your strlen function will stop before the end of the buffer. If your strlen function referenced buffer[10], then, yes that is undefined.
It certainly will be unexpected behavior, since most of buffer contains random data. "Undefined" is special word for people writing language standards. It means that anything could happen, including memory faults or exiting the program. By unexpected, I mean that it sure not what the programmer wanted to happen. On some runs, the result of strlen could be 3 or it could be 10.
Yes, it's undefined behaviour. From the draft C11 standard, §J.2 "Undefined behavior":
The behavior is undefined in the following circumstances:
...
The value of an object with automatic storage duration is used while it is
indeterminate.
int *p;
{
int x = 0;
p = &x;
}
// p is no longer valid
{
int x = 0;
if (&x == p) {
*p = 2; // Is this valid?
}
}
Accessing a pointer after the thing it points to has been freed is undefined behavior, but what happens if some later allocation happens in the same area, and you explicitly compare the old pointer to a pointer to the new thing? Would it have mattered if I cast &x and p to uintptr_t before comparing them?
(I know it's not guaranteed that the two x variables occupy the same spot. I have no reason to do this, but I can imagine, say, an algorithm where you intersect a set of pointers that might have been freed with a set of definitely valid pointers, removing the invalid pointers in the process. If a previously-invalidated pointer is equal to a known good pointer, I'm curious what would happen.)
By my understanding of the standard (6.2.4. (2))
The value of a pointer becomes indeterminate when the object it points to (or just past) reaches the end of its lifetime.
you have undefined behaviour when you compare
if (&x == p) {
as that meets these points listed in Annex J.2:
— The value of a pointer to an object whose lifetime has ended is used (6.2.4).
— The value of an object with automatic storage duration is used while it is indeterminate (6.2.4, 6.7.9, 6.8).
Okay, this seems to be interpreted as a two- make that three part question by some people.
First, there were concerns if using the pointer for a comparison is defined at all.
As is pointed out in the comments, the mere use of the pointer is UB, since $J.2: says use of pointer to object whose lifetime has ended is UB.
However, if that obstacle is passed (which is well in the range of UB, it can work after all and will on many platforms), here is what I found about the other concerns:
Given the pointers do compare equal, the code is valid:
C Standard, §6.5.3.2,4:
[...] If an invalid value has been assigned to the pointer, the behavior of the unary * operator is undefined.
Although a footnote at that location explicitly says. that the address of an object after the end of its lifetime is an invalid pointer value, this does not apply here, since the if makes sure the pointer's value is the address of x and thus is valid.
C++ Standard, §3.9.2,3:
If an object of type T is located at an address A, a pointer of type cv T* whose value is the address A is said to point to that object, regardless of how the value was obtained. [ Note: For instance, the address one past the end of an array (5.7) would be considered to point to an unrelated object of the array’s element type that might be located at that address.
Emphasis is mine.
It will probably work with most of the compilers but it still is undefined behavior. For the C language these x are two different objects, one has ended its lifetime, so you have UB.
More seriously, some compilers may decide to fool you in a different way than you expect.
The C standard says
Two pointers compare equal if and only if both are null pointers, both
are pointers to the same object (including a pointer to an object and
a subobject at its beginning) or function, both are pointers to one
past the last element of the same array object, or one is a pointer to
one past the end of one array object and the other is a pointer to the
start of a different array object that happens to immediately follow
the first array object in the address space.
Note in particular the phrase "both are pointers to the same object". In the sense of the standard the two "x"s are not the same object. They may happen to be realized in the same memory location, but this is to the discretion of the compiler. Since they are clearly two distinct objects, declared in different scopes the comparison should in fact never be true. So an optimizer might well cut away that branch completely.
Another aspect that has not yet been discussed of all that is that the validity of this depends on the "lifetime" of the objects and not the scope. If you'd add a possible jump into that scope
{
int x = 0;
p = &x;
BLURB: ;
}
...
if (...)
...
if (something) goto BLURB;
the lifetime would extend as long as the scope of the first x is reachable. Then everything is valid behavior, but still your test would always be false, and optimized out by a decent compiler.
From all that you see that you better leave it at argument for UB, and don't play such games in real code.
It would work, if by work you use a very liberal definition, roughly equivalent to that it would not crash.
However, it is a bad idea. I cannot imagine a single reason why it is easier to cross your fingers and hope that the two local variables are stored in the same memory address than it is to write p=&x again. If this is just an academic question, then yes it's valid C - but whether the if statement is true or not is not guaranteed to be consistent across platforms or even different programs.
Edit: To be clear, the undefined behavior is whether &x == p in the second block. The value of p will not change, it's still a pointer to that address, that address just doesn't belong to you anymore. Now the compiler might (probably will) put the second x at that same address (assuming there isn't any other intervening code). If that happens to be true, it's perfectly legal to dereference p just as you would &x, as long as it's type is a pointer to an int or something smaller. Just like it's legal to say p = 0x00000042; if (p == &x) {*p = whatever;}.
The behaviour is undefined. However, your question reminds me of another case where a somewhat similar concept was being employed. In the case alluded, there were these threads which would get different amounts of cpu times because of their priorities. So, thread 1 would get a little more time because thread 2 was waiting for I/O or something. Once its job was done, thread 1 would write values to the memory for the thread two to consume. This is not "sharing" the memory in a controlled way. It would write to the calling stack itself. Where variables in thread 2 would be allocated memory. Now, when thread 2 eventually got round to execution,all its declared variables would never have to be assigned values because the locations they were occupying had valid values. I don't know what they did if something went wrong in the process but this is one of the most hellish optimizations in C code I have ever witnessed.
Winner #2 in this undefined behavior contest is rather similar to your code:
#include <stdio.h>
#include <stdlib.h>
int main() {
int *p = (int*)malloc(sizeof(int));
int *q = (int*)realloc(p, sizeof(int));
*p = 1;
*q = 2;
if (p == q)
printf("%d %d\n", *p, *q);
}
According to the post:
Using a recent version of Clang (r160635 for x86-64 on Linux):
$ clang -O realloc.c ; ./a.out
1 2
This can only be explained if the Clang developers consider that this example, and yours, exhibit undefined behavior.
Put aside the fact if it is valid (and I'm convinced now that it's not, see Arne Mertz's answer) I still think that it's academic.
The algorithm you are thinking of would not produce very useful results, as you could only compare two pointers, but you have no chance to determine if these pointers point to the same kind of object or to something completely different. A pointer to a struct could now be the address of a single char for example.