I am reading an article about whole program optimization. The last paragraph in the Link-Time Code Generation section says zeroing an array allocated on the stack may not have the same effect depending on how it's zeroed:
Turning on whole program optimization did uncover several bugs that had undefined behavior. Without WPO, these had somehow not crashed. With WPO, they did. In one case, a member function call was being made through a pointer to uninitialized memory. In several other cases, it was assumed that arrays on the stack were identical to their own addresses. That is, it was assumed that memset(&charArray, 0, sizeof(charArray)) would have the same effect as memset(charArray, 0, sizeof(charArray)). This is not guaranteed by the standard, and appears to change under WPO.
I thought if I did char foo[1] that foo would always be == to &foo. Can someone explain what's happening here? Thanks
foo is an array and in expressions foo will converted to pointer to its first element, except when an operand of unary & and sizeof operators. So, in such cases foo == &foo[0]. &foo is the address of array foo, not the address of first element of foo.
Though the value of foo and &foo is equivalent, their types are different. foo is of type char * after decay while &foo is of type char (*)[1].
Related
Is this behavior defined or not?
volatile long (*volatile ptr)[1] = (void*)NULL;
volatile long v = (long) *ptr;
printf("%ld\n", v);
It works because by dereferencing pointer to array we are receiving an array itself, then that array decaying to pointer to it's first element.
Updated demo: https://ideone.com/DqFF6T
Also, GCC even considers next code as a constant expression:
volatile long (*ptr2)[1] = (void*)NULL;
enum { this_is_constant_in_gcc = ((void*)ptr2 == (void*)*ptr2) };
printf("%d\n", this_is_constant_in_gcc);
Basically, dereferencing ptr2 at compile time;
This:
long (*ptr)[1] = NULL;
Is declaring a pointer to an "array of 1 long" (more precisely, the type is long int (*)[1]), with the initial value of NULL. Everything fine, any pointer can be NULL.
Then, this:
long v = (long) *ptr;
Is dereferencing the NULL pointer, which is undefined behavior. All bets are off, if your program does not crash, the following statement could print any value or do anything else really.
Let me make this clear one more time: undefined behavior means that anything can happen. There is no explanation as to why anything strange happens after invoking undefined behavior, nor there needs to be. The compiler could very well emit 16-bit Real Mode x86 assembly, produce a binary that deletes your entire home folder, emit the Apollo 11 Guidance Computer assembly code, or whatever else. It is not a bug. It's perfectly conforming to the standard.
The only reason your code seems to work is because GCC decides, purely out of coincidence, to do the following (Godbolt link):
mov QWORD PTR [rbp-8], 0 ; put NULL on the stack
mov rax, QWORD PTR [rbp-8]
mov QWORD PTR [rbp-16], rax ; move NULL to the variable v
Causing the NULL-dereference to never actually happen. This is most probably a consequence of the undefined behavior in dereferencing ptr ¯\_(ツ)_/¯
Funnily enough, I previously said in a comment:
dereferencing NULL is invalid and will basically always cause a segmentation fault.
But of course, since it is undefined behavior that "basically always" is wrong. I think this is the first time I ever see a null-pointer dereference not cause a SIGSEGV.
Is this behavior defined or not?
Not.
long (*ptr)[1] = NULL;
long v = (long) *ptr;
printf("%ld\n", v);
It works because by dereferencing pointer to array we are receiving an
array itself, then that array decaying to pointer to it's first
element.
No, you are confusing type with value. It is true that the expression *ptr on the second line has type long[1], but evaluating that expression produces undefined behavior regardless of the data type, and regardless of the automatic conversion that would be applied to the result if it were defined.
The relevant section of the spec is paragraph 6.5.2.3/4:
The unary * operator denotes indirection. If the operand points to a
function, the result is a function designator; if it points to an
object, the result is an lvalue designating the object. If the operand
has type ''pointer to type'', the result has type ''type''. If an
invalid value has been assigned to the pointer, the behavior of the
unary * operator is undefined.
A footnote goes on to clarify that
[...] Among the invalid values for dereferencing a pointer by the unary * operator are a null pointer [...]
It may "work" for you in an empirical sense, but from a language perspective, any output at all or none is a conforming result.
Update:
It may be interesting to note that the answer would be different for explicitly taking the address of *ptr than it is for supposing that array decay will overcome the undefinedness of the dereference. The standard provides that, as a special case, where the operand of the unary & operator is the result of a unary * operator, neither of those operators is evaluated. Provided that all relevant constraints are satisfied, the result is as if they were both omitted altogether, except that it is never an lvalue.
Thus, this is ok:
long (*ptr)[1] = NULL;
long v = (long) &*ptr;
printf("%ld\n", v);
On many implementations it will reliably print 0, but do note that C does not specify that it must be 0.
The key distinction here is that in this case, the * operation is not evaluated (per spec). The * operation in the original code is is evaluated, notwithstanding the fact that if the pointer value were valid, the resulting array would be converted right back to a pointer (of a different type, but to the same location). That does suggest an obvious shortcut that implementations may take with the original code, and they may take it, if they wish, without regard to whether ptr's value is valid because if it is invalid then they can do whatever they want.
To just answer you´re provided questions:
Is dereferencing a NULL pointer to array valid in C?
No.
Is this behavior defined or not?
It is classified as "undefined behavior", so it is not defined.
Never mind of the case, that this trick with the array, maybe will work on some implementations and it fills absolutely no needs to do so (I imply you are asking out of curiousity), it is not valid per the C standard to dereference a NULL pointer in any way and will cause "Undefined Behavior".
Anything can happen when you implement such statements into your program.
Look at the answers on this question, which explain why:
What EXACTLY is meant by "de-referencing a NULL pointer"?
One qoute from Adam Rosenfield´s answer:
A null pointer is a pointer that does not point to any valid data (but it is not the only such pointer). The C standard says that it is undefined behavior to dereference a null pointer. This means that absolutely anything could happen: the program could crash, it could continue working silently, or it could erase your hard drive (although that's rather unlikely).
Is this behavior defined or not?
The behavior is undefined because you are applying * operator to a pointer that compares equal to null pointer constant.
The following stackoverflow thread tries to explain what undefined behavior is: Undefined, unspecified and implementation-defined behavior
There are tons of code like this one:
#include <stdio.h>
int main(void)
{
int a[2][2] = {{0, 1}, {2, -1}};
int *p = &a[0][0];
while (*p != -1) {
printf("%d\n", *p);
p++;
}
return 0;
}
But based on this answer, the behavior is undefined.
N1570. 6.5.6 p8:
When an expression that has integer type is added to or subtracted
from a pointer, the result has the type of the pointer operand. If the
pointer operand points to an element of an array object, and the array
is large enough, the result points to an element offset from the
original element such that the difference of the subscripts of the
resulting and original array elements equals the integer expression.
In other words, if the expression P points to the i-th element of an
array object, the expressions (P)+N (equivalently, N+(P)) and (P)-N
(where N has the value n) point to, respectively, the i+n-th and
i−n-th elements of the array object, provided they exist. Moreover,
if the expression P points to the last element of an array object, the
expression (P)+1 points one past the last element of the array object,
and if the expression Q points one past the last element of an array
object, the expression (Q)-1 points to the last element of the array
object. If both the pointer operand and the result point to elements
of the same array object, or one past the last element of the array
object, the evaluation shall not produce an overflow; otherwise, the
behavior is undefined. If the result points one past the last element
of the array object, it shall not be used as the operand of a unary
* operator that is evaluated.
Can someone explain this in detail?
The array who's base address (pointer to first element) p is assigned is of type int[2]. This means the address in p can legally be dereferenced only at locations *p and *(p+1), or if you prefer subscript notation, p[0] and p[1]. Furthermore, p+2 is guaranteed to be a legally evaluated as an address, and comparable to other addresses in that sequence, but can not be dereferenced. This is the one-past address.
The code you posted violates the one-past rule by dereferencing p once it passes the last element in the array in which it is homed. That the array in which it is homed is buttressed up against another array of similar dimension is not relevant to the formal definition cited.
That said, in practice it works, but as is often said. observed behavior is not, and should never be considered, defined behavior. Just because it works doesn't make it right.
The object representation of pointers is opaque, in C. There is no prohibition against pointers having bounds information encoded. That's one possibility to keep in mind.
More practically, implementations are also able to achieve certain optimizations based on assumptions which are asserted by rules like these: Aliasing.
Then there's the protection of programmers from accidents.
Consider the following code, inside a function body:
struct {
char c;
int i;
} foo;
char * cp1 = (char *) &foo;
char * cp2 = &foo.c;
Given this, cp1 and cp2 will compare as equal, but their bounds are nonetheless different. cp1 can point to any byte of foo and even to "one past" foo, but cp2 can only point to "one past" foo.c, at most, if we wish to maintain defined behaviour.
In this example, there might be padding between the foo.c and foo.i members. While the first byte of that padding co-incides with "one past" the foo.c member, cp2 + 2 might point into the other padding. The implementation can notice this during translation and instead of producing a program, it can advise you that you might be doing something you didn't think you were doing.
By contrast, if you read the initializer for the cp1 pointer, it intuitively suggests that it can access any byte of the foo structure, including padding.
In summary, this can produce undefined behaviour during translation (a warning or error) or during program execution (by encoding bounds information); there's no difference, standard-wise: The behaviour is undefined.
You can cast your pointer into a pointer to a pointer to array to ensure the correct array semantics.
This code is indeed not defined but provided as a C extension in every compiler in common usage today.
However the correct way of doing it would be to cast the pointer into a pointer to array as so:
((int (*)[2])p)[0][0]
to get the zeroth element or say:
((int (*)[2])p)[1][1]
to get the last.
To be strict, he reason I think this is illegal is that you are breaking strict aliasing, pointers to different types may not point to the same address (variable).
In this case you are creating a pointer to an array of ints and a pointer to an int and pointing them to the same value, this is not allowed by the standard as the only type that may alias another pointer is a char * and even this is rarely used properly.
I am new to C programming. While solving one of my class assignments, I came across the following code snippet. I did not understand what it does.
Can any one tell me what is the meaning of following C syntax,
((char *)0 +1) or ((int*)0 +1))
The (char *) 0 part creates a pointer to character data, at address 0. This address is then incremented by one, triggering undefined behavior since pointers to address 0 (also known as NULL in C) cannot be used in pointer arithmetic. The second part does the same but for pointer to integer data.
If the compiler simply treats NULL as the address (which is common but, again, not required which is why this is undefined behavior) the resulting addresses, if viewed numerically, will not be the same, since pointer arithmetic in C is done in terms of the type being pointed at, and typically sizeof (int) > sizeof (char).
Can any one tell me what is the meaning of following C syntax,
((char *)0 +1) or ((int*)0 +1))
Nothing by the terms of the C standard, because it's not defined. This code invokes undefined behavior on part of the C compiler. Let me explain:
In C every pointer may either point to some object of the type the pointer dereferences to or it may be 0, which is then called a null pointer. Null pointers can not be used in →pointer arithmetic.
Note that the actual representation of a null pointer on the metal, i.e. the bits the variable has on the machine may be something different than all zeros. But on the C side of things the null pointer always compares equal to an integer of the value 0. Moreover null pointers of different types also compare equal by definition. However comparisons of non null pointers of different types invokes undefined behavior. Also you can cast any pointer to a void* pointer, and back. Also you can cast every pointer to an integer of type uintptr_t and back. But casting from a pointer to type A to a pointer of type B (where B is not void*) invokes undefined behavior.
The special function malloc is defined by the C language specification to return a void* pointer that can be cast to any pointer type, though. But say you use it to allocate some memory for an array of char and later you cast that to int this again invokes undefined behavior.
Now you may ask: "What is undefined behavior?". Well, it just means, that the language standard doesn't define it and an implementer may go about it in any way seen fit. On most plattforms writing something like ((char*)0 + 1) may do something naively expected (creating a pointer, pointing to address 1), but it may as well make the compiler build an artificial intelligence, that at first chases you down the street, then gains consciousness and finally takes over the world, turning humans into batteries. So be careful about what you do ;)
In C you have to tell compiler which type you mean to use, this is called "casting".
For example:
char *c; //define c as "char pointer" (pointer to char)
c = ((char *)0 + 1); //this casts "0 + 1" to "char pointer" type, in this example not strictly necessary but adds some clarification to code
The following compiles and prints "string" as an output.
#include <stdio.h>
struct S { int x; char c[7]; };
struct S bar() {
struct S s = {42, "string"};
return s;
}
int main()
{
printf("%s", bar().c);
}
Apparently this seems to invokes an undefined behavior according to
C99 6.5.2.2/5 If an attempt is made to modify the result of a function
call or to access it after the next sequence point, the behavior is
undefined.
I don't understand where it says about "next sequence point". What's going on here?
You've run into a subtle corner of the language.
An expression of array type is, in most contexts, implicitly converted to a pointer to the first element of the array object. The exceptions, none of which apply here, are:
When the array expression is the operand of a unary & operator (which yields the address of the entire array);
When it's the operand of a unary sizeof or (as of C11) _Alignof operator (sizeof arr yields the size of the array, not the size of a pointer); and
When it's a string literal in an initializer used to initialize an array object (char str[6] = "hello"; doesn't convert "hello" to a char*.)
(The N1570 draft incorrectly adds _Alignof to the list of exceptions. In fact, for reasons that are not clear, _Alignof can only be applied to a type name, not to an expression.)
Note that there's an implicit assumption: that the array expression refers to an array object in the first place. In most cases, it does (the simplest case is when the array expression is the name of a declared array object) -- but in this one case, there is no array object.
If a function returns a struct, the struct result is returned by value. In this case, the struct contains an array, giving us an array value with no corresponding array object, at least logically. So the array expression bar().c decays to a pointer to the first element of ... er, um, ... an array object that doesn't exist.
The 2011 ISO C standard addresses this by introducing "temporary lifetime", which applies only to "A non-lvalue expression with structure or union type, where the structure or union
contains a member with array type" (N1570 6.2.4p8). Such an object may not be modified, and its lifetime ends at the end of the containing full expression or full declarator.
So as of C2011, your program's behavior is well defined. The printf call gets a pointer to the first element of an array that's part of a struct object with temporary lifetime; that object continues to exist until the printf call finishes.
But as of C99, the behavior is undefined -- not necessarily because of the clause you quote (as far as I can tell, there is no intervening sequence point), but because C99 doesn't define the array object that would be necessary for the printf to work.
If your goal is to get this program to work, rather than to understand why it might fail, you can store the result of the function call in an explicit object:
const struct s result = bar();
printf("%s", result.c);
Now you have a struct object with automatic, rather than temporary, storage duration, so it exists during and after the execution of the printf call.
The sequence point occurs at the end of the full expression- i.e., when printf returns in this example. There are other cases where sequence points occur
Effectively, this rule states that function temporaries do not live beyond the next sequence point- which in this case, occurs well after it's use, so your program has quite well-defined behaviour.
Here's a simple example of not well-defined behaviour:
char* c = bar().c; *c = 5; // UB
Here, the sequence point is met after c is created, and the memory it points to is destroyed, but we then attempt to access c, resulting in UB.
In C99 there is a sequence point at the call to a function, after the arguments have been evaluated (C99 6.5.2.2/10).
So, when bar().c is evaluated, it results in a pointer to the first element in the char c[7] array in the struct returned by bar(). However, that pointer gets copied into an argument (a nameless argument as it happens) to printf(), and by the time the call is actually made to the printf() function the sequence point mentioned above has occurred, so the member that the pointer was pointing to may no longer be alive.
As Keith Thomson mentions, C11 (and C++) make stronger guarantees about the lifetime of temporaries, so the behavior under those standards would not be undefined.
One of the examples of undefined behavior from the C standard reads (J.2):
— An array subscript is out of range, even if an object is apparently accessible with the
given subscript (as in the lvalue expression a[1][7] given the declaration int
a[4][5]) (6.5.6)
If the declaration is changed from int a[4][5] to unsigned char a[4][5], does accessing a[1][7] still result in undefined behavior? My opinion is that it does not, but I have heard from others who disagree, and I'd like to see what some other would-be experts on SO think.
My reasoning:
By the usual interpretation of 6.2.6.1 paragraph 4, and 6.5 paragraph 7, the representation of the object a is sizeof (unsigned char [4][5])*CHAR_BIT bits and can be
accessed as an array of type unsigned char [20] overlapped with the object.
a[1] has type unsigned char [5] as an lvalue, but used in an expression (as an operand to the [] operator, or equivalently as an operand to the + operator in *(a[1]+7)), it decays to a pointer of type unsigned char *.
The value of a[1] is also a pointer to a byte of the "representation" of a in the form unsigned char [20]. Interpreted in this way, adding 7 to a[1] is valid.
I would read this "informative example" in J2 as hint of what the standard body wanted: don't rely on the fact that accidentally an array index calculation gives something inside the "representation array" bounds. The intent is to ensure that all individual array bounds should always be in the defined ranges.
In particular, this allows for an implementation to do an aggressive bounds check, and to bark at you either at compile time or run time if you use a[1][7].
This reasoning has nothing to do with the underlying type.
A compiler vendor who wants to write a conforming compiler is bound to what the Standard has to say, but not to your reasoning. The Standard says that an array subscript out of range is undefined behaviour, without any exception, so the compiler is allowed to blow up.
To cite my comment from our last discussion (Does C99 guarantee that arrays are contiguous?)
"Your original question was for a[0][6], with the declaration char a[5][5]. This is UB, no matter what. It is valid to use char *p = &a[3][4]; and access p[0] to p[5]. Taking the address &p[6] is still valid, but accessing p[6] is outside of the object, thus UB. Accessing a[0][6] is outside of the object a[0], which has type array[5] of chars. The type of the result is irrelevant, it is important how you reach it."
EDIT:
There are enough cases of undefined behaviour where you have to scan through the whole Standard, collect the facts and combine them to finally get to the conclusion of undefined behaviour. This one is explicit, and you even cite the sentence from the Standard in your question. It is explicit and leaves no space for any workarounds.
I'm just wondering how much more explicitness in reasoning do you expect from us to become convinced that it really is UB?
EDIT 2:
After digging through the Standard and collecting information, here is another relevant citation:
6.3.2.1 - 3: Except when it is the operand of the sizeof operator or the
unary & operator, or is a string
literal used to initialize an array,
an expression that has type ‘‘array of
type’’ is converted to an expression
with type ‘‘pointer to type’’ that
points to the initial element of the
array object and is not an lvalue. If
the array object has register storage
class, the behavior is undefined.
So I think this is valid:
unsigned char *p = a[1];
unsigned char c = p[7]; // Strict aliasing not applied for char types
This is UB:
unsigned char c = a[1][7];
Because a[1] is not an lvalue at this point, but evaluated further, violating J.2 with an array subscript out of range. What really happens should depend on how the compiler actually implements the array indexing in multidimensional arrays. So you may be right that it doesn't make any difference on every known implementation. But that's a valid undefined behaviour, too. ;)
From 6.5.6/8
If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.
In your example, a[1][7] points to neither the same array object a[1], or one past the last element of a[1], so it is undefined behavior.
Under the hood, in the actual machine language, there is no difference between a[1][7] and a[2][2] for the definition of int a[4][5]. As R.. said, this is because the array access is translated to 1 * sizeof(a[0]) + 7 = 12 and 2 * sizeof(a[0]) + 2 = 12 (* sizeof(int) of course). The machine language doesn't know anything about arrays, matrices or indexes. All it knows about addresses. The C compiler above that can do anything it pleases, including naive bounds checking base on the indexer - a[1][7] would then be out of bound because the array a[1] doesn't have 8 cells. In this respect there is no difference between an int and char or unsigned char.
My guess is that the difference lies in the strict aliasing rules between int and char - even though the programmer doesn't actually do anything wrong, the compiler is forced to do a "logical" type cast for the array which it shouldn't do. As Jens Gustedt said, it looks more like a way to enable strict bounds checks, not a real issue with the int or char.
I've done some fiddling with the VC++ compiler and it seems to behave as you'd expect. Can anyone test this with gcc? In my experience gcc is much stricter about these sort of things.
I believe that the reason the cited (J.2) sample is undefined behavior is that the linker is not required to put the sub-arrays a[1], a[2], etc. next to each other in memory. They could be scattered across memory or they could be adjacent but not in the expected order. Switching the base type from int to unsigned char changes none of this.