I've the following program:
#include <stdio.h>
int main() {
int v[100];
int *p;
for (p = &(v[0]); p != &(v[100]); ++p)
if ((*p = getchar()) == EOF) {
--p;
break;
}
while (p != v)
putchar(*--p);
return 0;
}
And this is the output of gcc --version on the terminal:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.3.0
Thread model: posix
Why getting the address of the element after the last of an array gives me no warning but getting for example the address of v[101] gives me the following warning
test.c:8:29: warning: array index 101 is past the end of the array (which
contains 100 elements) [-Warray-bounds]
for(p = &(v[0]); p != &(v[101]); ++p)
^ ~~~
test.c:5:5: note: array 'v' declared here
int v[100];
^
1 warning generated.
I know that indexing elements out of the bounds of a buffer is undefined behaviour, so why isn't the compiler complaining about the first case?
Moving pointer to one past the last element of array is allowed unless you dereference the pointer, so your program is valid if one or more characters are read before hitting EOF.
N1256 6.5.2.1 Array subscripting
The definition of the subscript operator []
is that E1[E2] is identical to (*((E1)+(E2))).
N1256 6.5.3.2 Address and indirection operators
If the operand is the result of a unary * operator,
neither that operator nor the & operator is evaluated and the result is as if both were
omitted, except that the constraints on the operators still apply and the result is not an
lvalue. Similarly, if the operand is the result of a [] operator, neither the & operator nor
the unary * that is implied by the [] is evaluated and the result is as if the & operator
were removed and the [] operator were changed to a + operator.
N1256 6.5.6 Additive operators
Moreover, if the expression P points to the last
element of an array object, the expression (P)+1 points one past the last element of the
array object, and if the expression Q points one past the last element of an array object,
the expression (Q)-1 points to the last element of the array object
It's about compatibility with sloppily written code.
As MikeCAT cited, for an array int ar[N], the expression ar+N is valid and results in a pointer that points to the past-the-end position. While this pointer cannot be dereferenced, it can be compared to any other pointer into the array, which allows you to write the nice for (p = ar; p != ar+N; ++p) loop.
Also, programmers like to write readable code, and arguably, if you want a pointer to the ith element of an array, writing &ar[i] conveys your intention more clearly than writing ar + i.
Combine these two, and you will get programmers who write &ar[N] to get the past-the-end pointer, and while this is technically accessing an invalid array index, no compiler will ever implement this as anything else than ar + N - in fact, the compiler would have to go out of its way to do it differently. Quite far in fact.
So, since any compiler that doesn't reason very strictly about undefined behavior will do the thing programmers expect for the expression, there's no reason not to write it, and so lots of people wrote it. And now we have massive code bases that use this idiom, which means that even modern compilers with their value tracking and reasoning about undefined behavior have to support this idiom for compatibility. And since Clang's warnings are meant to be useful, this particular warning was written so as to not warn about a case that will work anyway, out of some sense of misplaced pedantry.
Related
In C, suppose for a pointer p we do *p++ = 0. If p points to an int variable, is this defined behavior?
You can do arithmetic resulting in pointing one past the end of an "array object" per the standard, but I am unable to find a really precise definition of "array object" in the standard. I don't think in this context it means just an object explicitly defined as an array, because p=malloc(sizeof(int)); ++p; pretty clearly is intended to be defined behavior.
If a variable does not qualify as an "array object", then as far as I can tell *p++ = 0 is undefined behavior.
I am using the C23 draft, but an answer citing the C11 standard would probably answer the question too.
Yes it is well-defined. Pointer arithmetic is defined by the additive operators so that's where you need to look.
C17 6.5.6/7
For the purposes of these operators, a pointer to an object that is not an element of an array behaves
the same as a pointer to the first element of an array of length one with the type of the object as its
element type.
That is, int x; is to be regarded as equivalent to int x[1]; for the purpose of determining valid pointer arithmetic.
Given int x; int* p = &x; *p++ = 0; then it is fine to point 1 item past it but not to de-reference that item:
C17 6.5.6/8
If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation
shall not produce an overflow; otherwise, the behavior is undefined. If the result points one past the last element of the array object, it shall not be used as the operand of a unary * operator that is evaluated.
This behavior has not changed in the various revisions of the standard. It's the very same from C90 to C23.
There are two separate questions: 1. What constructs does the Standard specify that correct conforming implementations should process meaningfully, and 2. What constructs do clang and gcc actually process meaningfully. The clear intention of the Standard is to define the behavior of a pointer "one past" an array object and a pointer to the start of another array object that happens to immediately follow it. The actual behavior of clang and gcc tells another story, however.
Given the source code:
#include <stdint.h>
extern int x[],y[];
int test1(int *p)
{
y[0] = 1;
if (p == x+1)
*p = 2;
return y[0];
}
int test2(int *p)
{
y[0] = 1;
uintptr_t p1 = 3*(uintptr_t)(x+1);
uintptr_t p2 = 5*(uintptr_t)p;
if (5*p1 == 3*p2)
*p = 2;
return y[0];
}
both clang and gcc will recognize in both functions that the *p=2 assignment will only run if p happens to be equal to a one-past pointer to x, and will conclude as a consequence that it would be impossible for p to equal y. Construction of an executable example where clang and gcc would erroneously make this assumption is difficult without the ability to execute a program containing two compilation units, but examination of the generated machine code at https://godbolt.org/z/x78GMqbrv will reveal that every ret instruction is immediately preceded by mov eax,1, which loads the return value with 1.
Note that the code in test2 doesn't compare pointers, nor even compare integers that are directly formed from pointers, but the fact that clang and gcc are able to show that the numbers being compared can only be equal if the pointers happened to be equal is sufficient for test2() to, as perceived by clang or gcc, invoke UB if the function is passed a pointer to y, and y happens to equal x+1.
I was playing around with some arrays and pointers in c and started wondering whether doing this would be undefined behavior.
int (*arr)[5] = malloc(sizeof(int[5][5]));
// Is this undefined behavior?
int val0 = arr[0][5];
// Rephrased, is it guaranteed it'll always have the same effect as this line?
int val1 = arr[1][0];
Thank you for any insights.
In C, what you're doing is undefined behavior.
The expression arr[0] has type int [5]. So the expression arr[0][5] dereferences one element past the end of the array arr[0], and dereferencing past the end of an array is undefined behavior.
Section 6.5.2.1p2 of the C standard regarding Array Subscripting states:
The definition of the subscript operator [] is that E1[E2] is identical to (*((E1)+(E2))).
And section 6.5.6p8 of the C standard regarding Additive Operators states:
When an expression that has integer type is added to or
subtracted from a pointer, the result has the type of the pointer
operand. If the pointer operand points to an element of an array
object, and the array is large enough, the result points to an element
offset from the original element such that the difference of the
subscripts of the resulting and original array elements equals the
integer expression. In other words, if the expression P points to
the i-th element of an array object, the expressions (P)+N
(equivalently,N+(P)) and (P)-N (where N has the value n)
point to, respectively, the i+n-th and i−n -th elements of the
array object, provided they exist. Moreover, if the
expression P points to the last element of an array object, the
expression (P)+1 points one past the last element of the array
object, and if the expression Q points one past the last
element of an array object,the expression (Q)-1 points to the
last element of the array object. If both the pointer operand
and the result point to elements of the same array object,
or one past the last element of the array object, the evaluation
shall not produce an overflow; otherwise, the behavior is undefined.
If the result points one past the last element of the array object, it
shall not be used as the operand of a unary * operator that is
evaluated.
The part in bold specifies that the addition implicit in an array subscript may not result in a pointer more that one element past the end of an array, and that a pointer to one element past the end of an array may not be defererenced.
The fact that the array in question is itself a member of an array, meaning the elements of each subarray are continuous in memory, doesn't change this. Aggressive optimization settings in the compiler may note that it is undefined behavior to access past the end of the array and make optimizations based on this fact.
The Standard is clearly intended to avoid requiring that a compiler given something like:
int foo[5][10];
int test(int i)
{
foo[1][0] = 1;
foo[0][i] = 2;
return foo[1][0];
}
must reload the value of foo[1][0] to accommodate the possibility that the write to foo[0][i] might affect foo[1][0]. On the other hand, before the Standard was written, it would have been idiomatic to write something like:
void dump_array(int *p, int rows, int cols)
{
int i,j;
for (i=0; i<rows; i++)
{
for (j=0; j<cols; j++)
printf("%6d", *p++);
printf("\n");
}
}
int foo[5][10];
...
dump_array(foo[0], 5, 10);
and nothing in the published Rationale suggests that the authors had any intention of forbidding such constructs nor breaking code that used them. Indeed, the primary benefit of requiring that rows of an array be placed consecutively, even when adding padding would improve efficiency, is to allow such code to function.
At the time the Standard was written, when generating code for a function that received a pointer, compilers would treat the pointer as though it might identify some arbitrary part of some arbitrary larger object, without making any effort to know or care about what that enclosing object might be. They would thus, as a very popular form of "conforming language extension", support constructs like dump_array without regard for whether the Standard required them to do so, and consequently the authors of the Standard saw no reason to worry about when the Standard mandated such support. Instead, they left such matters as a Quality of Implementation issue over which the Standard could waive jurisdiction.
Unfortunately, because the authors of the Standard expected that compilers would treat the act of passing a pointer to a function as implicitly "laundering" it, the authors of the Standard saw no need to define any explicit method for laundering information about a pointer's enclosing objects in cases where it would be necessary for a function to treat a pointer identifying "raw" storage. Such distinctions didn't matter given the state of compiler technology in the 1980s, but may be quite relevant if e.g. code does something like:
int matrix[10][10];
void test2(int c)
{
matrix[4][0] = 1;
dump_array(matrix[0], 1, c);
matrix[4][0] = 2;
}
or
void test3(int r)
{
matrix[4][0] = 1;
dump_array((int*)matrix, r, 10);
matrix[4][0] = 2;
}
Depending upon what the functions is intending to do, having a compiler optimize out the first write to matrix[4][0] in one or both may improve efficiency, or it may cause the generated code to behave uselessly. Treating explicit pointer conversions as erasing type information, but treating array-to-pointer decay as retaining it, would allow programmers to achieve required semantics if they write code as in the second example, while allowing compilers to perform the relevant optimizations when source code is written as in the first example. Unfortunately, the Standard makes no distinctions, and maintainers of free compilers are loath to forego any "optimizations" they view the Standard as giving them, leaving the language with nothing but "hope for the best" semantics except on implementations that either refrain from cross-procedural optimizations or document what needs to be done to block them.
Decrementing a NULL pointer on my machine still gives a NULL pointer, I wonder if this is well defined.
char *p = NULL;
--p;
Yes, the behavior is undefined.
--p is equivalent to p = p - 1 (except that p is only evaluated once, which doesn't matter in this case).
N1570 6.5.6 paragraph 8, discussing additive operators, says:
When an expression that has integer type is added to or subtracted
from a pointer, the result has the type of the pointer operand. If the
pointer operand points to an element of an array object, and the array
is large enough, the result points to an element offset from the
original element such that the difference of the subscripts of the
resulting and original array elements equals the integer expression.
[...]
If both the pointer operand and the result point to elements
of the same array object, or one past the last element of the array
object, the evaluation shall not produce an overflow; otherwise, the
behavior is undefined.
Since your pointer value p doesn't point to an element of an array object or one past the last element of an array object, the behavior of p - 1 is undefined.
(Incidentally, I'd be surprised if your code caused p to be a null pointer -- though since the behavior is undefined the language certainly permits it. I can imagine an optimizing compiler ignoring the --p; because it knows its behavior is undefined, but I haven't seen that myself. How do you know p is null?)
As far as I see with GCC it does not generate a null pointer. Decrementing is just subtracting a number. With underflow the number just wraps around. You can see that here.
#include "stdio.h"
#include <inttypes.h>
int main()
{
char *p = NULL;
printf("%zx\n", (uintptr_t)p);
--p;
printf("%zx\n", (uintptr_t)p);
}
Output is
0
ffffffffffffffff
https://wandbox.org/permlink/gNzc38RWGSBi9tS3
After hunting for a related or duplicate question concerning the following to no avail (I can only do marginal justice to describe the sheer number of pointer-arithmetic and post-decrement questions tagged with C, but suffice it to say "boatloads" does a grave injustice to that result set count) I toss this in the ring in hopes of clarification or a referral to a duplicate that eluded me.
If the post-decrement operator is applied to a pointer such as below, a simple reverse-iteration of an array sequence, does the following code invoke undefined behavior?
#include <stdio.h>
#include <string.h>
int main()
{
char s[] = "some string";
const char *t = s + strlen(s);
while(t-->s)
fputc(*t, stdout);
fputc('\n', stdout);
return 0;
}
It was recently proposed to me that 6.5.6.p8 Additive operators, in conjunction with 6.5.2.p4, Postfix increment and decrement operators, specifies even performing a post-decrement upon t when it already contains the base-address of s invokes undefined behavior, regardless of whether the resulting value of t (not the t-- expression result) is evaluated or not. I simply want to know if that is indeed the case.
The cited portions of the standard were:
6.5.6 Additive Operators
If both the pointer operand and the result point to elements of the
same array object, or one past the last element of the array object,
the evaluation shall not produce an overflow; otherwise, the behavior
is undefined.
and its nearly tightly coupled relationship with...
6.5.2.4 Postfix increment and decrement operators Constraints
The operand of the postfix increment or decrement operator shall have
atomic, qualified, or unqualified real or pointer type, and shall be a
modifiable lvalue.
Semantics
The result of the postfix ++ operator is the value of the operand. As a side effect, the value of the operand object is incremented (that is, the value 1 of the appropriate type is added to it). See the discussions of additive operators and compound assignment for information on constraints, types, and conversions and the effects of operations on pointers. The value computation of the result is sequenced before the side effect of updating the stored value of the operand. With respect to an indeterminately-sequenced function call, the operation of postfix ++ is a single evaluation. Postfix ++ on an object with atomic type is a read-modify-write operation with memory_order_seq_cst memory order semantics.98)
The postfix -- operator is analogous to the postfix ++ operator, except that the value of the operand is decremented (that is, the value 1 of the appropriate type is subtracted from it).
Forward references: additive operators (6.5.6), compound assignment (6.5.16.2).
The very reason for using the post-decrement operator in the posted sample is to avoid evaluating an eventually-invalid address value against the base address of the array. For example, the code above was a refactor of the following:
#include <stdio.h>
#include <string.h>
int main()
{
char s[] = "some string";
size_t len = strlen(s);
char *t = s + len - 1;
while(t >= s)
{
fputc(*t, stdout);
t = t - 1;
}
fputc('\n', stdout);
}
Forgetting for a moment this has a non-zero-length string for s, this general algorithm clearly has issues (perhaps not as clearly to some). If s[] were instead "", then t would be assigned a value of s-1, which itself is not in the valid range of s through its one-past-address, and the evaluation for comparison against s that ensues is no good. If s has non-zero length, that addresses the initial s-1 problem, but only temporarily, as eventually this is still counting on that value (whatever it is) being valid for comparison against s to terminate the loop. It could be worse. it could have naively been:
size_t len = strlen(s) - 1;
char *t = s + len;
This has disaster written all over it if s were a zero-length string. The refactored code of this question opened with was intended to address all of these issues. But...
My paranoia may be getting to me, but it isn't paranoia if they're really all out to get you. So, per the standard (these sections, or perhaps others), does the original code (scroll to the top of this novel if you forgot what it looks like by now) indeed invoke undefined behavior or not?
I am pretty certain that the result of the post-decrement in this case is indeed undefined behaviour. The post-decrement clearly subtracts one from a pointer to the beginning of an object, so the result does not point to an element of the same array, and by the definition of pointer arithmetic (§6.5.6/8, as cited in the OP) that's undefined behaviour. The fact that you never use the resulting pointer is irrelevant.
What's wrong with:
char *t = s + strlen(s);
while (t > s) fputc(*--t, stdout);
Interesting but irrelevant fact: The implementation of reverse iterators in the standard C++ library usually holds in the reverse iterator a pointer to one past the target element. This allows the reverse iterator to be used normally without ever involving a pointer to "one before the beginning" of the container, which would be UB, as above.
There are tons of code like this one:
#include <stdio.h>
int main(void)
{
int a[2][2] = {{0, 1}, {2, -1}};
int *p = &a[0][0];
while (*p != -1) {
printf("%d\n", *p);
p++;
}
return 0;
}
But based on this answer, the behavior is undefined.
N1570. 6.5.6 p8:
When an expression that has integer type is added to or subtracted
from a pointer, the result has the type of the pointer operand. If the
pointer operand points to an element of an array object, and the array
is large enough, the result points to an element offset from the
original element such that the difference of the subscripts of the
resulting and original array elements equals the integer expression.
In other words, if the expression P points to the i-th element of an
array object, the expressions (P)+N (equivalently, N+(P)) and (P)-N
(where N has the value n) point to, respectively, the i+n-th and
i−n-th elements of the array object, provided they exist. Moreover,
if the expression P points to the last element of an array object, the
expression (P)+1 points one past the last element of the array object,
and if the expression Q points one past the last element of an array
object, the expression (Q)-1 points to the last element of the array
object. If both the pointer operand and the result point to elements
of the same array object, or one past the last element of the array
object, the evaluation shall not produce an overflow; otherwise, the
behavior is undefined. If the result points one past the last element
of the array object, it shall not be used as the operand of a unary
* operator that is evaluated.
Can someone explain this in detail?
The array who's base address (pointer to first element) p is assigned is of type int[2]. This means the address in p can legally be dereferenced only at locations *p and *(p+1), or if you prefer subscript notation, p[0] and p[1]. Furthermore, p+2 is guaranteed to be a legally evaluated as an address, and comparable to other addresses in that sequence, but can not be dereferenced. This is the one-past address.
The code you posted violates the one-past rule by dereferencing p once it passes the last element in the array in which it is homed. That the array in which it is homed is buttressed up against another array of similar dimension is not relevant to the formal definition cited.
That said, in practice it works, but as is often said. observed behavior is not, and should never be considered, defined behavior. Just because it works doesn't make it right.
The object representation of pointers is opaque, in C. There is no prohibition against pointers having bounds information encoded. That's one possibility to keep in mind.
More practically, implementations are also able to achieve certain optimizations based on assumptions which are asserted by rules like these: Aliasing.
Then there's the protection of programmers from accidents.
Consider the following code, inside a function body:
struct {
char c;
int i;
} foo;
char * cp1 = (char *) &foo;
char * cp2 = &foo.c;
Given this, cp1 and cp2 will compare as equal, but their bounds are nonetheless different. cp1 can point to any byte of foo and even to "one past" foo, but cp2 can only point to "one past" foo.c, at most, if we wish to maintain defined behaviour.
In this example, there might be padding between the foo.c and foo.i members. While the first byte of that padding co-incides with "one past" the foo.c member, cp2 + 2 might point into the other padding. The implementation can notice this during translation and instead of producing a program, it can advise you that you might be doing something you didn't think you were doing.
By contrast, if you read the initializer for the cp1 pointer, it intuitively suggests that it can access any byte of the foo structure, including padding.
In summary, this can produce undefined behaviour during translation (a warning or error) or during program execution (by encoding bounds information); there's no difference, standard-wise: The behaviour is undefined.
You can cast your pointer into a pointer to a pointer to array to ensure the correct array semantics.
This code is indeed not defined but provided as a C extension in every compiler in common usage today.
However the correct way of doing it would be to cast the pointer into a pointer to array as so:
((int (*)[2])p)[0][0]
to get the zeroth element or say:
((int (*)[2])p)[1][1]
to get the last.
To be strict, he reason I think this is illegal is that you are breaking strict aliasing, pointers to different types may not point to the same address (variable).
In this case you are creating a pointer to an array of ints and a pointer to an int and pointing them to the same value, this is not allowed by the standard as the only type that may alias another pointer is a char * and even this is rarely used properly.