It is said in C that when pointers refer to the same array or one element past the end of that array the arithmetics and comparisons are well defined. Then what about one before the first element of the array? Is it okay so long as I do not dereference it?
Given
int a[10], *p;
p = a;
(1) Is it legal to write --p?
(2) Is it legal to write p-1 in an expression?
(3) If (2) is okay, can I assert that p-1 < a?
There is some practical concern for this. Consider a reverse() function that reverses a C-string that ends with '\0'.
#include <stdio.h>
void reverse(char *p)
{
char *b, t;
b = p;
while (*p != '\0')
p++;
if (p == b) /* Do I really need */
return; /* these two lines? */
for (p--; b < p; b++, p--)
t = *b, *b = *p, *p = t;
}
int main(void)
{
char a[] = "Hello";
reverse(a);
printf("%s\n", a);
return 0;
}
Do I really need to do the check in the code?
Please share your ideas from language-lawyer/practical perspectives, and how you would cope with such situations.
(1) Is it legal to write --p?
It's "legal" as in the C syntax allows it, but it invokes undefined behavior. For the purpose of finding the relevant section in the standard, --p is equivalent to p = p - 1 (except p is only evaluated once). Then:
C17 6.5.6/8
If both the pointer
operand and the result point to elements of the same array object, or one past the last
element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.
The evaluation invokes undefined behavior, meaning it doesn't matter if you de-reference the pointer or not - you already invoked undefined behavior.
Furthermore:
C17 6.5.6/9:
When two pointers are subtracted, both shall point to elements of the same array object, or one past the last element of the array object;
If your code violates a "shall" in the ISO standard, it invokes undefined behavior.
(2) Is it legal to write p-1 in an expression?
Same as (1), undefined behavior.
As for examples of how this could cause problems in practice: imagine that the array is placed at the very beginning of a valid memory page. When you decrement outside that page, there could be a hardware exception or a pointer trap representation. This isn't a completely unlikely scenario for microcontrollers, particularly when they are using segmented memory maps.
The use of that kind of pointer arithmetic is bad coding practice, as it might lead to a significant bunch of hard to debug problems.
I only had to use this kind of thing once in more than 20 years. I was writing a call-back function, but I did not have access to the proper data. The calling function provided a pointer inside a proper array, and I needed the byte just before that pointer.
Considering that I had access to the entire source code, and I verified the behavior several times to prove that I get what I need, and I had it reviewed by other colleagues, I decided it is OK to let it go to production.
The proper solution would have been to change the caller function to return the proper pointer, but that was not feasible, time and money considering (that part of the software was licensed from a third-party).
So, a[-1] is possible, but should be used ONLY with very good care in very particular situations. Otherwise, there is no good reason to do that kind of self-hurting Voodoo ever.
Note: at a proper analysis, in my example, it is obvious that I did not access an element before the beginning of a proper array, but the element before a pointer, which was guaranteed to be inside of the same array.
Referring to the code provided:
it is NOT OK to use p[-1] with reverse(a);;
it is OK(-ish) to use it with reverse(a+1);, because you remain inside the array.
Related
Apologies for the loopy question wording, but it boils down to this. I'm maintaining C code that does something like this:
char *foo = "hello"
int i;
char *bar;
i = strlen(foo);
bar = &foo[i];
Is this safe? Could there be a case where the foo[i] leads to a segmentation fault, before it gets a pointer, and isn't covered by the compiler?
Yes, you are allowed to set a pointer one beyond the end of an array (although you are not actually doing that in your code - you are pointing to the NUL terminator).
Note though that the behaviour on dereferencing a pointer set to one past the end of an array would be undefined.
(Note that &foo[i] is required by the C standard to be evaluated without dereferencing the pointer: i.e. &foo[i + 1] is valid, but foo[i + 1] on its own isn't.)
I'm maintaining C code that does something like this:
char *foo = "hello"
int i;
char *bar;
i = strlen(foo);
bar = &foo[i];
Is this safe?
The code presented conforms with the current and all past C language standards, and has the same well-defined behavior according to each. In particular, be aware that C string literals represent null-terminated character arrays, just like all other C strings, so in the example code. foo[i] refers to the terminator, and &foo[i] points into the array (to its last element), not outside it.
Moreover, C also allows obtaining a pointer to just past the end of an array, so even 1 + &foo[i] is valid and conforming. It is not permitted to dereference such a pointer, but you can perform pointer arithmetic and comparisons with it (subject to the normal constraints).
Could there be a case where the foo[i] leads to a
segmentation fault, before it gets a pointer, and isn't covered by the
compiler?
C does not have anything whatever to say about segmentation faults. Whenever you receive one, either your implementation has exhibited non-conformance, or (almost always) your program is exercising undefined behavior. The code presented conforms, subject to a correct declaration of strlen() being in scope, and if it appears as part of a complete program that conforms and does not otherwise exercise UB then there is no reason short of general skepticism to fear a segfault.
I followed the discussion on One-byte-off pointer still valid in C?.
The gist of that discussion, as far as I could gather, was that if you have:
char *p = malloc(4);
Then it is OK to get pointers up to p+4 by using pointer arithmetic. If you get a pointer by using p+5, then the behavior is undefined.
I can see why dereferencing p+5 could cause undefined behavior. But undefined behavior using just pointer arithmetic?
Why would the arithmetic operators + and - not be valid operations? I don’t see any harm by adding or subtracting a number from a pointer. After all, a pointer is a represented by a number that captures the address of an object.
Of course, I was not in the standardization committee :) I am not privy to the discussions they had before codifying the standard. I am just curious. Any insight will be useful.
The simplest answer is that it is conceivable that a machine traps integer overflow. If that were the case, then any pointer arithmetic which wasn't confined to a single storage region might cause overflow, which would cause a trap, disrupting execution of the program. C shouldn't be obliged to check for possible overflow before attempting pointer arithmetic, so the standard allows a C implementation on such a machine to just allow the trap to happen, even if chaos ensues.
Another case is an architecture where memory is segmented, so that a pointer consists of a segment address (with implicit trailing 0s) and an offset. Any given object must fit in a single segment, which means that valid pointer arithmetic can work only on the offset. Again, overflowing the offset in the course of pointer arithmetic might produce random results, and the C implementation is under no obligation to check for that.
Finally, there may well be optimizations which the compiler can produce on the assumption that all pointer arithmetic is valid. As a simple motivating case:
if (iter - 1 < object.end()) {...}
Here the test can be omitted because it must be true for any pointer iter whose value is a valid position in (or just after) object. The UB for invalid pointer arithmetic means that the compiler is not under any obligation to attempt to prove that iter is valid (although it might need to ensure that it is based on a pointer into object), so it can just drop the comparison and proceed to generate unconditional code. Some compilers may do this sort of thing, so watch out :)
Here, by the way, is the important difference between unspecified behaviour and undefined behaviour. Comparing two pointers (of the same type) with == is defined regardless of whether they are pointers into the same object. In particular, if a and b are two different objects of the same type, end_a is a pointer to one-past-the-end of a and begin_b is a pointer to b, then
end_a == begin_b
is unspecified; it will be 1 if and only if b happens to be just after a in memory, and otherwise 0. Since you can't normally rely on knowing that (unless a and b are array elements of the same array), the comparison is normally meaningless; but it is not undefined behaviour and the compiler needs to arrange for either 0 or 1 to be produced (and moreover, for the same comparison to consistently have the same value, since you can rely on objects not moving around in memory.)
One case I can think of where the result of a + or - might give unexpected results is in the case of overflow or underflow.
The question you refer to points out that for p = malloc(4) you can do p+4 for comparison. One thing this needs to guarantee is that p+4 will not overflow. It doesn't guarantee that p+5 wont overflow.
That is to say that the + or - themselves wont cause any problems, but there is a chance, however small, that they will return a value that is unsuitable for comparison.
Performing basic +/- arithmetic on a pointer will not cause a problem. The order of pointer values is sequential: &p[0] < &p[1] < ... &p[n] for a type n objects long. But pointer arithmetic outside this range is not defined. &p[-1] may be less or greater than &p[0].
int *p = malloc(80 * sizeof *p);
int *q = p + 1000;
printf("p:%p q:%p\n", p, q);
Dereferencing pointers outside their range or even inside the memory range, but unaligned is a problem.
printf("*p:%d\n", *p); // OK
printf("*p:%d\n", p[79]); // OK
printf("*p:%d\n", p[80]); // Bad, but &p[80] will be greater than &p[79]
printf("*p:%d\n", p[-1]); // Bad, order of p, p[-1] is not defined
printf("*p:%d\n", p[81]); // Bad, order of p[80], p[81] is not defined
char *r = (char*) p;
printf("*p:%d\n", *((int*) (r + 1)) ); // Bad
printf("*p:%d\n", *q); // Bad
Q: Why is p[81] undefined behavior?
A: Example: memory runs 0 to N-1. char *p has the value N-81. p[0] to p[79] is well defined. p[80] is also well defined. p[81] would need to the value N to be consistent, but that overflows so p[81] may have the value 0, N or who knows.
A couple of things here, the reason p+4 would be valid in such a case is because iteration to one past the last position is valid.
p+5 would not be a problem theoretically, but according to me the problem will be when you will try to dereference (p+5) or maybe you will try to overwrite that address.
I was browsing through a webpage which had some c FAQ's, I found this statement made.
Similarly, if a has 10 elements and ip
points to a[3], you can't compute or
access ip + 10 or ip - 5. (There is
one special case: you can, in this
case, compute, but not access, a
pointer to the nonexistent element
just beyond the end of the array,
which in this case is &a[10].
I was confused by the statement
you can't compute ip + 10
I can understand accessing the element out of bounds is undefined, but computing!!!.
I wrote the following snippet which computes (let me know if this is what the website meant by computing) a pointer out-of-bounds.
#include <stdio.h>
int main()
{
int a[10], i;
int *p;
for (i = 0; i<10; i++)
a[i] = i;
p = &a[3];
printf("p = %p and p+10 = %p\n", p, p+10);
return 0;
}
$ ./a.out
p = 0xbfa53bbc and p+10 = 0xbfa53be4
We can see that p + 10 is pointing to 10 elements(40 bytes) past p. So what exactly does the statement made in the webpage mean. Did I mis-interpret something.
Even in K&R (A.7.7) this statement is made:
The result of the + operator is the
sum of the operands. A pointer to an
object in an array and a value of any
integral type may be added. ... The
sum is a pointer of the same type as
the original pointer, and points to
another object in the same array,
appropriately offset from the original
object. Thus if P is a pointer to an
object in an array, the expression P+1
is a pointer to the next object in the
array. If the sum pointer points
outside the bounds of the array,
except at the first location beyond
the high end, the result is
undefined.
What does being "undefined" mean. Does this mean the sum will be undefined, or does it only mean when we dereference it the behavior is undefined. Is the operation undefined even when we do not dereference it and just calculate the pointer to element out-of-bounds.
Undefined behavior means exactly that: absolutely anything could happen. It could succeed silently, it could fail silently, it could crash your program, it could blue screen your OS, or it could erase your hard drive. Some of these are not very likely, but all of them are permissible behaviors as far as the C language standard is concerned.
In this particular case, yes, the C standard is saying that even computing the address of a pointer outside of valid array bounds, without dereferencing it, is undefined behavior. The reason it says this is that there are some arcane systems where doing such a calculation could result in a fault of some sort. For example, you might have an array at the very end of addressable memory, and constructing a pointer beyond that would cause an overflow in a special address register which generates a trap or fault. The C standard wants to permit this behavior in order to be as portable as possible.
In reality, though, you'll find that constructing such an invalid address without dereferencing it has well-defined behavior on the vast majority of systems you'll come across in common usage. Creating an invalid memory address will have no ill effects unless you attempt to dereference it. But of course, it's better to avoid creating those invalid addresses so that your code will work perfectly even on those arcane systems.
The web page wording is confusing, but technically correct. The C99 language specification (section 6.5.6) discusses additive expressions, including pointer arithmetic. Subitem 8 specifically states that computing a pointer one past the end of an array shall not cause an overflow, but beyond that the behavior is undefined.
In a more practical sense, C compilers will generally let you get away with it, but what you do with the resulting value is up to you. If you try to dereference the resulting pointer to a value, as K&R states, the behavior is undefined.
Undefined, in programming terms, means "Don't do that." Basically, it means the specification that defines how the language works does not define an appropriate behavior in that situation. As a result, theoretically anything can happen. Generally all that happens is you have a silent or noisy (segfault) bug in your program, but many programmers like to joke about other possible results from causing undefined behavior, like deleting all of your files.
The behaviour would be undefined in the following case
int a[3];
(a + 10) ; // this is UB too as you are computing &a[10]
*(a+10) = 10; // Ewwww!!!!
Actually, memcpy works just fine when I use pointers to characters, but stops working when I use pointers to pointers to characters.
Can somebody please help me understand why memcpy fails here, or better yet, how I could have figured it out myself. I am finding it very difficult to understand the problems arising in my c/c++ code.
char *pc = "abcd";
char **ppc = &pc;
char **ppc2 = &pc;
setStaticAndDynamicPointers(ppc, ppc2);
char c;
c = (*ppc)[1];
assert(c == 'b'); // assertion doesn't fail.
memcpy(&c,&(*ppc[1]),1);
if(c!='b')
puts("memcpy didn't work."); // this gets printed out.
c = (*ppc2)[3];
assert(c=='d'); // assertion doesn't fail.
memcpy(&c, &(*ppc2[3]), 1);
if(c != 'd')
puts("memcpy didn't work again.");
memcpy(&c, pc, 1);
assert(c == 'a'); // assertion doesn't fail, even though used memcpy
void setStaticAndDynamicPointers(char **charIn, char **charIn2)
{
// sets the first arg to a pointer to static memory.
// sets the second arg to a pointer to dynamic memory.
char stat[5];
memcpy(stat, "abcd", 5);
*charIn = stat;
char *dyn = new char[5];
memcpy(dyn, "abcd", 5);
*charIn2 = dyn;
}
your comment implies that char stat[5] should be static, but it isn't. As a result charIn points to a block that is allocated on the stack, and when you return from the function, it is out of scope. Did you mean static char stat[5]?
char stat[5];
is a stack variable which goes out of scope, it's not // sets the first arg to a pointer to static memory.. You need to malloc/new some memory that gets the abcd put into it. Like you do for charIn2
Just like what Preet said, I don't think the problem is with memcpy. In your function "setStaticAndDynamicPointers", you are setting a pointer to an automatic variable created on the stack of that function call. By the time the function exits, the memory pointed to by "stat" variable will no longer exist. As a result, the first argument **charIn will point to something that's non-existent. Perhaps you can read in greater detail about stack frame (or activation record) here: link text
You have effectively created a dangling pointer to a stack variable in that code. If you want to test copying values into a stack var, make sure it's created in the caller function, not within the called function.
In addition to the definition of 'stat', the main problem in my eyes is that *ppc[3] is not the same as (*ppc)[3]. What you want is the latter (the fourth character from the string pointed to by ppc), but in your memcpy()s you use the former, the first character of the fourth string in the "string array" ppc (obviously ppc is not an array of char*, but you force the compiler to treat it as such).
When debugging such problems, I usually find it helpful to print the memory addresses and contents involved.
Note that the parenthesis in the expressions in your assignment statements are in different locations from the parenthesis in the memcpy expressions. So its not too suprising that they do different things.
When dealing with pointers, you have to keep the following two points firmly in the front of your mind:
#1 The pointer itself is separate from the data it points to. The pointer is just a number. The number tells us where, in memory, we can find the beginning of some other chunk of data. A pointer can be used to access the data it points to, but we can also manipulate the value of the pointer itself. When we increase (or decrease) the value of the pointer itself, we are moving the "destination" of the pointer forward (or backward) from the spot it originally pointed to. This brings us to the second point...
#2 Every pointer variable has a type that indicates what kind of data is being pointed to. A char * points to a char; a int * points to an int; and so on. A pointer can even point to another pointer (char **). The type is important, because when the compiler applies arithmetic operations to a pointer value, it automatically accounts for the size of the data type being pointed to. This allows us to deal with arrays using simple pointer arithmetic:
int *ip = {1,2,3,4};
assert( *ip == 1 ); // true
ip += 2; // adds 4-bytes to the original value of ip
// (2*sizeof(int)) => (2*2) => (4 bytes)
assert(*ip == 3); // true
This works because the array is just a list of identical elements (in this case ints), laid out sequentially in a single contiguous block of memory. The pointer starts out pointing to the first element in the array. Pointer arithmetic then allows us to advance the pointer through the array, element-by-element. This works for pointers of any type (except arithmetic is not allowed on void *).
In fact, this is exactly how the compiler translates the use of the array indexer operator []. It is literally shorthand for a pointer addition with a dereference operator.
assert( ip[2] == *(ip+2) ); // true
So, How is all this related to your question?
Here's your setup...
char *pc = "abcd";
char **ppc = &pc;
char **ppc2 = &pc;
for now, I've simplified by removing the call to setStaticAndDynamicPointers. (There's a problem in that function too—so please see #Nim's answer, and my comment there, for additional details about the function).
char c;
c = (*ppc)[1];
assert(c == 'b'); // assertion doesn't fail.
This works, because (*ppc) says "give me whatever ppc points to". That's the equivalent of, ppc[0]. It's all perfectly valid.
memcpy(&c,&(*ppc[1]),1);
if(c!='b')
puts("memcpy didn't work."); // this gets printed out.
The problematic part —as others have pointed out— is &(*ppc[1]), which taken literally means "give me a pointer to whatever ppc[1] points to."
First of all, let's simplify... operator precedence says that: &(*ppc[1]) is the same as &*ppc[1]. Then & and * are inverses and cancel each other out. So &(*ppc[1]) simplifies to ppc[1].
Now, given the above discussion, we're now equipped to understand why this doesn't work: In short, we're treating ppc as though it points to an array of pointers, when in fact it only points to a single pointer.
When the compiler encounters ppc[1], it applies the pointer arithmetic described above, and comes up with a pointer to the memory that immediately follows the variable pc -- whatever that memory may contain. (The behavior here is always undefined).
So the problem isn't with memcopy() at all. Your call to memcpy(&c,&(*ppc[1]),1) is dutifully copying 1-byte (as requested) from the memory that's pointed to by the bogus pointer ppc[1], and writing it into the character variable c.
As others have pointed out, you can fix this by moving your parenthesis around:
memcpy(&c,&((*ppc)[1]),1)
I hope the explanation was helpful. Good luck!
Although the previous answers raise valid points, I think the other thing you need to look at is your operator precedence rules when you memcpy:
memcpy(&c, &(*ppc2[3]), 1);
What happens here? It might not be what you're intending. The array notation takes higher precedence than the dereference operator, so you first attempt perform pointer arithmetic equivalent to ppc2++. You then dereference that value and pass the address into memcpy. This is not the same as (*ppc2)[1]. The result on my machine is an access violation error (XP/VS2005), but in general this is undefined behaviour. However, if you dereference the same way you did previously:
memcpy(&c, &((*ppc2)[3]), 1);
Then that access violation goes away and I get proper results.
Following an hot comment thread in another question, I came to debate of what is and what is not defined in C99 standard about C arrays.
Basically when I define a 2D array like int a[5][5], does the standard C99 garantee or not that it will be a contiguous block of ints, can I cast it to (int *)a and be sure I will have a valid 1D array of 25 ints.
As I understand the standard the above property is implicit in the sizeof definition and in pointer arithmetic, but others seems to disagree and says casting to (int*) the above structure give an undefined behavior (even if they agree that all existing implementations actually allocate contiguous values).
More specifically, if we think an implementation that would instrument arrays to check array boundaries for all dimensions and return some kind of error when accessing 1D array, or does not give correct access to elements above 1st row. Could such implementation be standard compilant ? And in this case what parts of the C99 standard are relevant.
We should begin with inspecting what int a[5][5] really is. The types involved are:
int
array[5] of ints
array[5] of arrays
There is no array[25] of ints involved.
It is correct that the sizeof semantics imply that the array as a whole is contiguous. The array[5] of ints must have 5*sizeof(int), and recursively applied, a[5][5] must have 5*5*sizeof(int). There is no room for additional padding.
Additionally, the array as a whole must be working when given to memset, memmove or memcpy with the sizeof. It must also be possible to iterate over the whole array with a (char *). So a valid iteration is:
int a[5][5], i, *pi;
char *pc;
pc = (char *)(&a[0][0]);
for (i = 0; i < 25; i++)
{
pi = (int *)pc;
DoSomething(pi);
pc += sizeof(int);
}
Doing the same with an (int *) would be undefined behaviour, because, as said, there is no array[25] of int involved. Using a union as in Christoph's answer should be valid, too. But there is another point complicating this further, the equality operator:
6.5.9.6
Two pointers compare equal if and only if both are null pointers, both are pointers to the same object (including a pointer to an object and a subobject at its beginning) or function, both are pointers to one past the last element of the same array object, or one is a pointer to one past the end of one array object and the other is a pointer to the start of a different array object that happens to immediately follow the first array object in the address space. 91)
91) Two objects may be adjacent in memory because they are adjacent elements of a larger array or adjacent members of a structure with no padding between them, or because the implementation chose to place them so, even though they are unrelated. If prior invalid pointer operations (such as accesses outside array bounds) produced undefined behavior, subsequent comparisons also produce undefined behavior.
This means for this:
int a[5][5], *i1, *i2;
i1 = &a[0][0] + 5;
i2 = &a[1][0];
i1 compares as equal to i2. But when iterating over the array with an (int *), it is still undefined behaviour, because it is originally derived from the first subarray. It doesn't magically convert to a pointer into the second subarray.
Even when doing this
char *c = (char *)(&a[0][0]) + 5*sizeof(int);
int *i3 = (int *)c;
won't help. It compares equal to i1 and i2, but it isn't derived from any of the subarrays; it is a pointer to a single int or an array[1] of int at best.
I don't consider this a bug in the standard. It is the other way around: Allowing this would introduce a special case that violates either the type system for arrays or the rules for pointer arithmetic or both. It may be considered a missing definition, but not a bug.
So even if the memory layout for a[5][5] is identical to the layout of a[25], and the very same loop using a (char *) can be used to iterate over both, an implementation is allowed to blow up if one is used as the other. I don't know why it should or know any implementation that would, and maybe there is a single fact in the Standard not mentioned till now that makes it well defined behaviour. Until then, I would consider it to be undefined and stay on the safe side.
I've added some more comments to our original discussion.
sizeof semantics imply that int a[5][5] is contiguous, but visiting all 25 integers via incrementing a pointer like int *p = *a is undefined behaviour: pointer arithmetics is only defined as long as all pointers invoved lie within (or one element past the last element of) the same array, as eg &a[2][1] and &a[3][1] do not (see C99 section 6.5.6).
In principle, you can work around this by casting &a - which has type int (*)[5][5] - to int (*)[25]. This is legal according to 6.3.2.3 §7, as it doesn't violate any alignment requirements. The problem is that accessing the integers through this new pointer is illegal as it violates the aliasing rules in 6.5 §7. You can work around this by using a union for type punning (see footnote 82 in TC3):
int *p = ((union { int multi[5][5]; int flat[25]; } *)&a)->flat;
This is, as far as I can tell, standards compliant C99.
If the array is static, like your int a[5][5] array, it's guaranteed to be contiguous.