Where is array element out of bound refer to? - c

I want to know in c language, if I have an array whose length is 3, if i try to access the 4th element of the array, which memory address will it point to?
I have read similar problems of accessing array element out of bound, they said that this is a typical undefined behavior which is unsafe, but will there be any common rules for where the 4th element would point to?.
For example, which memory address would the array[3] refer to with declaration given here?
int a = 10;
int array[3] = {1, 2, 3};
int b = 20;
printf("%d", array[3]); // access the 4th element here
May it point to a or b or array[x] or it's totally random?
The key point here is : if i declared variable A after variable B (especially when they are global variable or static variable), will they be stored continuously in memory? Or it's totally depend on compiler?

First, to add some clarity to the why part of the undefined behavior that you wrote about, as per C11 spec, you are allowed to write an expression which gives you the address of the element one past the last element, that's fine, but you can not dereference it. The later invokes undefined behavior.
Considering array[3] being the same as *(array + 3), quoting the C11 standard, chapter §6.5.6, Additive operators
[....] If both the pointer
operand and the result point to elements of the same array object, or one past the last
element of the array object, the evaluation shall not produce an overflow; otherwise, the
behavior is undefined.
and
If the result points one past the last element of the array object, it
shall not be used as the operand of a unary * operator that is evaluated.
So, to sum this up, (array + 3) will point to the next location of the last element in the array, (i.e., &array[2]) but whether that address will be that of the address of a or b or random memory location, is compiler dependent, entirely.

May it point to a or b or array[x] or it's totally random?
TOTALLY RANDOM :)

This will be totally undefined..
In fact, since those variables are local, they will porbably be sitting on the stack. But depending on your system and/or compiler, some of them (for example a and b in this case) might be stored in a register instead of in memory.
So, your &array[3] will probably be pointing somewhere in the stackspace, pointing to an int further then &array[2], but it is undefined to what value it will point, and accessing it will/should be illegal.

Related

Is it UB to access an element one past the end of a row of a 2d array?

Is the behavior of the following program undefined?
#include <stdio.h>
int main(void)
{
int arr[2][3] = { { 1, 2, 3 },
{ 4, 5, 6 }
};
int *ptr1 = &arr[0][0]; // pointer to first elem of { 1, 2, 3 }
int *ptr3 = ptr1 + 2; // pointer to last elem of { 1, 2, 3 }
int *ptr3_plus_1 = ptr3 + 1; // pointer to one past last elem of { 1, 2, 3 }
int *ptr4 = &arr[1][0]; // pointer to first elem of { 4, 5, 6 }
// int *ptr_3_plus_2 = ptr3 + 2; // this is not legal
/* It is legal to compare ptr3_plus_1 and ptr4 */
if (ptr3_plus_1 == ptr4) {
puts("ptr3_plus_1 == ptr4");
/* ptr3_plus_1 is a valid address, but is it legal to dereference it? */
printf("*ptr3_plus_1 = %d\n", *ptr3_plus_1);
} else {
puts("ptr3_plus_1 != ptr4");
}
return 0;
}
According to §6.5.6 ¶8:
Moreover, if the expression P points to the last element of an
array object, the expression (P)+1 points one past the last
element of the array object.... If both the pointer operand and the
result point to elements of the same array object, or one past the
last element of the array object, the evaluation shall not produce an
overflow; otherwise, the behavior is undefined. If the result points
one past the last element of the array object, it shall not be used as
the operand of a unary * operator that is evaluated.
From this, it would appear that the behavior of the above program is undefined; ptr3_plus_1 points to an address one past the end of the array object from which it is derived, and dereferencing this address causes undefined behavior.
Further, Annex J.2 suggests that this is undefined behavior:
An array subscript is out of range, even if an object is apparently
accessible with the given subscript (as in the lvalue expression
a[1][7] given the declaration int a[4][5]) (6.5.6).
There is some discussion of this issue in the Stack Overflow question, One-dimensional access to a multidimensional array: well-defined C?. The consensus here appears to be that this kind of access to arbitrary elements of a two-dimensional array through one-dimensional subscripts is indeed undefined behavior.
The issue, as I see it, is that it is not even legal to form the address of the pointer ptr3_plus_2, so it is not legal to access arbitrary two-dimensional array elements in this way. But, it is legal to form the address of the pointer ptr3_plus_1 using this pointer arithmetic. Further, it is legal to compare the two pointers ptr3_plus_1 and ptr4, according to §6.5.9 ¶6:
Two pointers compare equal if and only if both are null pointers, both
are pointers to the same object (including a pointer to an object and
a subobject at its beginning) or function, both are pointers to one
past the last element of the same array object, or one is a pointer
to one past the end of one array object and the other is a pointer to
the start of a different array object that happens to immediately
follow the first array object in the address space.
So, if it both ptr3_plus_1 and ptr4 are valid pointers that compare equal and that must point to the same address (the object pointed to by ptr4 must be adjacent in memory to the object pointed to by ptr3 anyway, since array storage must be contiguous), it would seem that *ptr3_plus_1 is as valid as *ptr4.
Is this undefined behavior, as described in §6.5.6 ¶8 and Annex J.2, or is this an exceptional case?
To Clarify
It seems unambiguous that it is undefined behavior to attempt to access the element one past the end of the final row of a two-dimensional array. My interest is in the question of whether it is legal to access the first element of the intermediate rows by forming a new pointer using a pointer to an element from the previous row and pointer arithmetic. It seems to me that a different example in Annex J.2 could have made this more clear.
Is it possible to reconcile the clear statement in §6.5.6 ¶8 that an attempted dereference of a pointer to the location one past the end of an array leads to undefined behavior with the idea that the pointer past the end of the first row of a two-dimensional array of type T[][] is also a pointer of type T * that points to an object of type T, namely the first element of an array of type T[]?
So, if it both ptr3_plus_1 and ptr4 are valid pointers that compare equal and that must point to the same address
They are.
it would seem that *ptr3_plus_1 is as valid as *ptr4.
It is not.
The pointers are equal, but not equivalent. The trivial well-known example of the distinction between equality and equivalence is negative zero:
double a = 0.0, b = -0.0;
assert (a == b);
assert (1/a != 1/b);
Now, to be fair, there is a difference between the two, as positive and negative zero have a different representation, ptr3_plus_1 and ptr4 on typical implementations have the same representation. This is not guaranteed, and on implementations where they would have different representations, it should be clear that your code might fail.
Even on the typical implementations, while there are good arguments to be made that the same representation implies equivalent values, to the best of my knowledge, the official interpretation is that the standard does not guarantee this, therefore programs cannot rely on it, therefore implementations can assume programs do not do this and optimise accordingly.
A debugging implementation might use "fat" pointers. For example, a pointer may be represented as a tuple (address, base, size) to detect out-of-bounds access. There is absolutely nothing wrong or contrary to the standard about such representation. So any pointer arithmetic that brings the pointer outside the range of [base, base+size] fails, and any dereference outside of [base, base+size) also fails.
Note that base and size are not the address and the size of the 2D array but rather of the array that the pointer points into (the row in this case).
It might sound trivial in this case, but when deciding whether a certain pointer construction is UB or not, it is useful to mentally run your example through this hypothetical implementation.

Is `*((*(&array + 1)) - 1)` safe to use to get the last element of an automatic array?

Suppose I want to get the last element of an automatic array whose size is unknown. I know that I can make use of the sizeof operator to get the size of the array and get the last element accordingly.
Is using *((*(&array + 1)) - 1) safe?
Like:
char array[SOME_SIZE] = { ... };
printf("Last element = %c", *((*(&array + 1)) - 1));
int array[SOME_SIZE] = { ... };
printf("Last element = %d", *((*(&array + 1)) - 1));
etc
No, it is not.
&array is of type pointer to char[SOME_SIZE] (in the first example given). This means &array + 1 points to memory immediately past the end of array. Dereferencing that (as in (*(&array+1)) gives undefined behaviour.
No need to analyse further. Once there is any part of an expression that gives undefined behaviour, the whole expression does.
I don't think it is safe.
From the standard as #dasblinkenlight quoted in his answer (now removed) there is also something I would like to add:
C99 Section 6.5.6.8 -
[...]
if the expression P points to the last element of an array object, the expression (P)+1 points [...]
If the result points one past the last element of the array object, it shall not be used as the operand of a unary * operator that is evaluated.
So as it says , we should not do this *(&array + 1) as it will go one past the last element of array and so * should not be used.
As also it is well known that dereferencing pointers pointing to an unauthorized memory location leads to undefined behaviour .
I believe it's undefined behavior for the reasons Peter mentions in his answer.
There is a huge debate going on about *(&array + 1). On the one hand, dereferencing &array + 1 seems to be legal because it's only changing the type from T (*)[] back to T [], but on the other hand, it's still a pointer to uninitialized, unused and unallocated memory.
My answer relies on the following:
C99 6.5.6.7 (Semantics of additive operators)
For the purposes of these operators, a pointer to an object that is
not an element of an array behaves the same as a pointer to the first
element of an array of length one with the type of the object as its
element type.
Since &array is not a pointer to an object that is an element of an array, then according to this, it means that the code is equivalent to:
char array_equiv[1][SOME_SIZE] = { ... };
/* ... */
printf("Last element = %c", *((*(&array_equiv[0] + 1)) - 1));
That is, &array is a pointer to an array of 10 chars, so it behaves the same as a pointer to the first element of an array of length 1 where each element is an array of 10 chars.
Now, that together with the clause that follows (already mentioned in other answers; this exact excerpt is blatantly stolen from ameyCU's answer):
C99 Section 6.5.6.8 -
[...]
if the expression P points to the last element of an array object, the expression (P)+1 points [...]
If the result points one past the last element of the array object, it shall not be used as the operand of a unary * operator that is evaluated.
Makes it pretty clear that it is UB: it's equivalent to dereferencing a pointer that points one past the last element of array_equiv.
Yes, in real world, it probably works, as in reality the original code doesn't really dereference a memory location, it's mostly a type conversion from T (*)[] to T [], but I'm pretty sure that from a strict standard-compliance point of view, it is undefined behavior.
It is probably safe, but there are some caveats.
Suppose we have
T array[LEN];
Then &array is of type T(*)[LEN].
Next, &array + 1 is again of type T(*)[LEN], pointing just past the end of the original array.
Next, *(&array + 1) is of type T[LEN], which may be implicitly converted to T*, still pointing just past the end of the original array. (So we did NOT dereference an invalid memory location: the * operator is not evaluated).
Next, *(&array + 1) - 1 is of type T*, pointing at the last array location.
Finally, we dereference this (which is legitimate if the array length is not zero): *(*(&array + 1) - 1) gives the last array element, a value of type T.
Note that the only time we actually dereference a pointer is in this last step.
Now, the potential caveats.
First, *(&array + 1) formally appears like an attempt to dereference a pointer that points to an invalid memory location. But it really isn't. That's the nature of array pointers: this formal dereference only changes the type of the pointer, does not actually result in an attempt to retrieve value from the referenced location. That is, array is of type T[LEN] but it may be implicitly converted to type &T, pointing to the first element of the array; &array is a pointer to type T[LEN], pointing at the beginning of the array; *(&array+1) is again of type T[LEN] which may be implicitly converted to type &T. At no point is a pointer actually dereferenced.
Second, &array + 1 may in fact be an invalid address, but it really isn't: My C++11 reference manual tells me explicitly that "Taking a pointer to the element one beyond the end of an array is guaranteed to work", and a similar statement is also made in K&R, so I believe it has always been standard behavior.
Finally, in case of a zero-length array, the expression dereferences the memory location just before the array, which may be unallocated/invalid. But this issue would also arise if one used a more conventional approach using sizeof() without testing for nonzero length first.
In short, I do not believe there is anything undefined or implementation-dependent about this expression's behavior.
Imho that might work but is probably unwise. You should carefully review your sw design and ask yourself why you want the last entry of the array. Is the content of the array completely unknown to you or is it possible to define the structure in terms of c structs and unions. If that is the case stay away from complex pointer operations in a char array for example and define the data properly in you c code, in structs and unions where ever possible.
So instead of :
printf("Last element = %c", *((*(&array + 1)) - 1));
It could be :
printf("Checksum = %c", myStruct.MyUnion.Checksum);
This clarifies your code. The last letter in your array means nothing to a person not familiar with whats in this array. myStruct.myUnion.Checksum makes sense to anyone. Studying the myStruct structure could explain the whole data structure to anyone. Please use something like that if it can be declared in such a way. If you are in the rare situation you can not, study above answers, they make good sense i think
a)
If both the pointer operand and the result [of P + N] point to
elements of the same array object, or one past the last element of the
array object, the evaluation shall not produce an overflow;
[...]
if the expression P points either to an element of an array
object or one past the last element of an array object, and the
expression Q points to the last element of the same array object, the
expression ((Q)+1)−(P) has the same value as ((Q)−(P))+1 and as
−((P)−((Q)+1)), and has the value zero if the expression P points one
past the last element of the array object, even though the expression
(Q)+1 does not point to an element of the array object.
This states that computations using array elements one past the last element is actually completely fine. As some people here have written that the use of non-existent objects for computations is already illegal, I thought I include that part.
Then we need to take care about this part:
If the result points one past the last element of the array object, it
shall not be used as the operand of a unary * operator that is
evaluated.
There is one important part that the other answers omitted and that is:
If the pointer operand points to an element of an array object
This is not the fact. The pointer operand we dereference is not a pointer to an element of an array object, it is a pointer to a pointer. So this whole clause is completely irrelevant. But, there is also stated:
For the purposes of these [additive] operators, a pointer to an object that is
not an element of an array behaves the same as a pointer to the first
element of an array of length one with the type of the object as its
element type.
What does this mean?
It means our pointer to a pointer is actually again a pointer to an array - of length[1]. And now we can close the loop, because as the first paragraph states, we are allowed to make calculations with one past the array, so we are allowed to make calculations with the array as if it would be an array of length[2]!
In a more graphical way:
ptr -> (ptr to int[10])[0] -> int[10]
-> (ptr to int[10])[1]
So, we are allowed to make calculations with (ptr to int[10])[1], even though it is technically outside the array of length[1].
b)
The steps that happen are:
array ptr of type int[SOME_SIZE] to the first element array
&array ptr to a ptr of type int[SOME_SIZE] to the first element of array
+ 1 ptr, one more than the ptr of type int[SOME_SIZE]) to the first element array, to a ptr of type int
This is NOT yet a pointer to int[SOME_SIZE+1], according to C99 Section 6.5.6.8. This is NOT yet ptr + SOME_SIZE + 1
* We dereference the pointer to the pointer. NOW, after the dereferencing, we have a pointer according to C99 Section 6.5.6.8, which is past the element of the array and which is not allowed to be dereferenced. This pointer is allowed to exist and we are allowed to use operators on it, except the unary * operator. But we don't use that one on that pointer yet.
-1 Now we subtract one from the ptr of type int to one after the last element of the array, letting ptr point to the last element of the array.
* dereferencing a ptr to int to the last element of the array, which is legal.
c)
And last, but not least:
If it would be illegal, then the offsetof macro would be illegal, too, which is defined as:
((size_t)(&((st *)0)->m))

2D Array indexing - undefined behavior?

I've recently got into some pieces of code doing some questionable 2D arrays indexing operations. Considering as an example the following code sample:
int a[5][5];
a[0][20] = 3;
a[-2][15] = 4;
a[5][-3] = 5;
Are the indexing operations above subject to undefined behavior?
It's undefined behavior, and here's why.
Multidimensional array access can be broken down into a series of single-dimensional array accesses. In other words, the expression a[i][j] can be thought of as (a[i])[j]. Quoting C11 §6.5.2.1/2:
The definition of the subscript operator [] is that E1[E2] is identical to (*((E1)+(E2))).
This means the above is identical to *(*(a + i) + j). Following C11 §6.5.6/8 regarding addition of an integer and pointer (emphasis mine):
If both the pointer
operand and the result point to elements of the same array object, or one past the last
element of the array object, the evaluation shall not produce an overflow; otherwise, the
behavior is undefined.
In other words, if a[i] is not a valid index, the behavior is immediately undefined, even if "intuitively" a[i][j] seems in-bounds.
So, in the first case, a[0] is valid, but the following [20] is not, because the type of a[0] is int[5]. Therefore, index 20 is out of bounds.
In the second case, a[-1] is already out-of-bounds, thus already UB.
In the last case, however, the expression a[5] points to one past the last element of the array, which is valid as per §6.5.6/8:
... if the expression P points to the last element of an array object, the expression (P)+1 points one past the last element of the array object ...
However, later in that same paragraph:
If the result points one past the last element of the array object, it shall not be used as the operand of a unary * operator that is evaluated.
So, while a[5] is a valid pointer, dereferencing it will cause undefined behavior, which is caused by the final [-3] indexing (which, is also out-of-bounds, therefore UB).
array indexing with negative indexes is undefined behaviour. Sorry, that a[-3] is the same as *(&a - 3) in most architectures/compilers, and accepted without warning, but the C language allows you to add negative integers to pointers, but not use negative values as array indexes. Of curse this is not even checked at runtime.
Also, there are some issues to be acquainted for when defining arrays in front to pointers. You can leave unspecified just the first subindex, and no more, like in:
int a[][3][2]; /* array of unspecified size, definition is alias of int (*a)[3][2]; */
(indeed, the above is a pointer definition, not an array, just print sizeof a)
or
int a[4][3][2]; /* array of 24 integers, size is 24*sizeof(int) */
when you do this, the way to evaluate the offset is different for arrays than for pointers, so be carefull. In case of arrays, int a[I][J][K];
&a[i][j][k]
is placed at
&a + i*(sizeof(int)*J*K) + j*(sizeof(int)*K) + k*(sizeof(int))
but when you declare
int ***a;
then a[i][j][k] is the same as:
*(*(*(&a+i)+j)+k), meaning you have to dereference pointer a, then add (sizeof(int **))*i to its value, then dereference again, then add (sizeof (int *))*j to that value, then dereference it, and add (sizeof(int))*k to that value to get the exact address of the data.
BR

== for pointer comparison

I quote from "The C Programming Language" by Kernighan & Ritchie:
Any pointer can be meaningfully compared for equality or inequality with zero. But the behavior is undefined for arithmetic or comparisons with pointers that do not point to members of the same array. (There is one exception: the address of the first element past the end of an array can be used in pointer arithmetic.)
Does this mean I cannot rely on == for checking equality of different pointers? What are the situations in which this comparison leads to a wrong result?
One example that comes to my mind is Harvard architecture with separate address spaces for code and for data. In computers of that architecture the compiler can store constant data in the code memory. Since the two address spaces are separate, a pointer to an address in the code memory could be numerically equal to a pointer in the data memory, without pointing to the same address.
The equality operator is defined for all valid pointers, and the only time it can give a "false positive" is when one pointer points to one element past the end of an array, and the other happens to point (or points by virtue of a structure definition) to another object stored just past the array in memory.
I think your mistake is treating K&R as normative. See the C99 standard (nice html version here: http://port70.net/~nsz/c/c99/n1256.html), 6.5.9 on the equality operator. The issue about comparisons being undefined only applies to relational operators (see 6.5.8):
When two pointers are compared, the result depends on the relative locations in the address space of the objects pointed to. If two pointers to object or incomplete types both point to the same object, or both point one past the last element of the same array object, they compare equal. If the objects pointed to are members of the same aggregate object, pointers to structure members declared later compare greater than pointers to members declared earlier in the structure, and pointers to array elements with larger subscript values compare greater than pointers to elements of the same array with lower subscript values. All pointers to members of the same union object compare equal. If the expression P points to an element of an array object and the expression Q points to the last element of the same array object, the pointer expression Q+1 compares greater than P. In all other cases, the behavior is undefined.
I interpret this as following:
short a[9];
int b[12];
short * c = a + 9;
Here it is valid to say that
c > a
because c results from a via pointer arithmetic,
but not necessarily that
b == c
or
c <= b
or something alike, because they result from different arrays, whose order and alignment in memory is not defined.
You cannot use pointer comparison for comparing pointers that point into different arrays.
So:
int arr[5] = {1, 2, 3, 4, 5};
int * p = &arr[0];
int anotherarr[] = {1, 2};
int * pf = &anotherarr[0];
You cannot do if (p == pf) since p and pf do not point into the same array. This will lead to undefined behaviour.
You can rely on pointer comparison if they point within the same array.
Not sure about the arithmetic case myself.
You can do == and != with pointers from different arrays.
<, <=, >, >= is not defined.

Pointer arithmetic in c and array bounds

I was browsing through a webpage which had some c FAQ's, I found this statement made.
Similarly, if a has 10 elements and ip
points to a[3], you can't compute or
access ip + 10 or ip - 5. (There is
one special case: you can, in this
case, compute, but not access, a
pointer to the nonexistent element
just beyond the end of the array,
which in this case is &a[10].
I was confused by the statement
you can't compute ip + 10
I can understand accessing the element out of bounds is undefined, but computing!!!.
I wrote the following snippet which computes (let me know if this is what the website meant by computing) a pointer out-of-bounds.
#include <stdio.h>
int main()
{
int a[10], i;
int *p;
for (i = 0; i<10; i++)
a[i] = i;
p = &a[3];
printf("p = %p and p+10 = %p\n", p, p+10);
return 0;
}
$ ./a.out
p = 0xbfa53bbc and p+10 = 0xbfa53be4
We can see that p + 10 is pointing to 10 elements(40 bytes) past p. So what exactly does the statement made in the webpage mean. Did I mis-interpret something.
Even in K&R (A.7.7) this statement is made:
The result of the + operator is the
sum of the operands. A pointer to an
object in an array and a value of any
integral type may be added. ... The
sum is a pointer of the same type as
the original pointer, and points to
another object in the same array,
appropriately offset from the original
object. Thus if P is a pointer to an
object in an array, the expression P+1
is a pointer to the next object in the
array. If the sum pointer points
outside the bounds of the array,
except at the first location beyond
the high end, the result is
undefined.
What does being "undefined" mean. Does this mean the sum will be undefined, or does it only mean when we dereference it the behavior is undefined. Is the operation undefined even when we do not dereference it and just calculate the pointer to element out-of-bounds.
Undefined behavior means exactly that: absolutely anything could happen. It could succeed silently, it could fail silently, it could crash your program, it could blue screen your OS, or it could erase your hard drive. Some of these are not very likely, but all of them are permissible behaviors as far as the C language standard is concerned.
In this particular case, yes, the C standard is saying that even computing the address of a pointer outside of valid array bounds, without dereferencing it, is undefined behavior. The reason it says this is that there are some arcane systems where doing such a calculation could result in a fault of some sort. For example, you might have an array at the very end of addressable memory, and constructing a pointer beyond that would cause an overflow in a special address register which generates a trap or fault. The C standard wants to permit this behavior in order to be as portable as possible.
In reality, though, you'll find that constructing such an invalid address without dereferencing it has well-defined behavior on the vast majority of systems you'll come across in common usage. Creating an invalid memory address will have no ill effects unless you attempt to dereference it. But of course, it's better to avoid creating those invalid addresses so that your code will work perfectly even on those arcane systems.
The web page wording is confusing, but technically correct. The C99 language specification (section 6.5.6) discusses additive expressions, including pointer arithmetic. Subitem 8 specifically states that computing a pointer one past the end of an array shall not cause an overflow, but beyond that the behavior is undefined.
In a more practical sense, C compilers will generally let you get away with it, but what you do with the resulting value is up to you. If you try to dereference the resulting pointer to a value, as K&R states, the behavior is undefined.
Undefined, in programming terms, means "Don't do that." Basically, it means the specification that defines how the language works does not define an appropriate behavior in that situation. As a result, theoretically anything can happen. Generally all that happens is you have a silent or noisy (segfault) bug in your program, but many programmers like to joke about other possible results from causing undefined behavior, like deleting all of your files.
The behaviour would be undefined in the following case
int a[3];
(a + 10) ; // this is UB too as you are computing &a[10]
*(a+10) = 10; // Ewwww!!!!

Resources