Does this avoid UB - c

This question is more of an academic one, seeing as there is no valid reason to write your own offsetof macro anymore. Nevertheless, I've seen this home-grown implementation pop-up here and there:
#define offsetof(s, m) ((size_t) &(((s *)0)->m))
Which is, technically speaking, dereferencing a NULL pointer (AFAIKT):
C11(ISO/IEC 9899:201x) §6.3.2.3 Pointers Section 3
An integer constant expression with the value 0, or such an expression cast to type void *, is called a null pointer constant
So the above implementation is, according to how I read the standard, the same as writing:
#define offsetof(s, m) ((size_t) &(((s *)NULL)->m))
It does make me wonder that, by changing one tiny detail, the following definition of offsetof would be completely legal, and reliable:
#define offsetof(s, m) (((size_t)&(((s *) 1)->m)) - 1)
Seeing as, instead of 0, 1 is used as a pointer, and I subtract 1 at the end, the result should be the same. I'm no longer using a NULL pointer. As far as I can tell the results are the same.
So basically: is there any reason why using 1 instead of 0 in this offsetof definition might not work? Can it still cause UB in certain cases, and if so: when and how? Basically, what I'm asking here is: Am I missing anything here?

Both definitions are undefined behavior: in the first definition a null pointer is dereferenced and in your second definition you are dereferencing an invalid pointer (the pointer does not point to a valid object). It is not possible in C to write a portable version of offsetof macro.
Defect Report #44 says:
"In particular, this is why the offsetof macro exists: there was otherwise no portable means to compute such translation-time constants."
(DR#44 is for C89 but nothing has changed in the language in C99 and C11 that would allow a portable implementation.)

I believe the behaviour is implementation-defined. In 6.3.2.3 of n1256:
5 An integer may be converted to any pointer type. Except as previously specified, the result is implementation-defined, might not be correctly aligned, might not point to an entity of the referenced type, and might be a trap representation.

One problem is that your created pointer does not point to an object.
6.2.4 Storage durations of objects
The lifetime of an object is the portion of program execution during which storage is
guaranteed to be reserved for it. An object exists, has a constant address, 33) and retains
its last-stored value throughout its lifetime. 34) If an object is referred to outside of its
lifetime, the behavior is undefined. The value of a pointer becomes indeterminate when
the object it points to (or just past) reaches the end of its lifetime.
and
J.2 Undefined behaviour
- The value of a pointer to an object whose lifetime has ended is used (6.2.4).
3.19.2 indeterminate value: either an unspecified value or a trap representation
When you convert 1 to a pointer, and the created pointer does not point to an object, the value of the pointer becomes indeterminate. You then use the pointer. Both of those cause undefined behavior.
The conversion of an integer to a pointer is also problematic:
6.3.2.3 Pointers
An integer may be converted to any pointer type. Except as previously specified, the
result is implementation-defined, might not be correctly aligned, might not point to an
entity of the referenced type, and might be a trap representation. 67)

The implementation of offsetof with dereferencing a NULL pointer invokes undefined behavior. In this implementation it is assumed that the hypothetical structure begins at address 0. You may assume it to be 1, and yes it will invoke UB too because you are dereferencing a null pointer, but because an uninitialized pointer is dereferenced.

Nothing in any version of the C standard would forbid a compiler from doing anything it wanted with any macro that would attempt to achieve the effect without defining a storage location to hold the indicated object. Nonetheless, a form like:
#define offsetof(s, m) ((char*)&((((s)*)0)->m)-(char*)0)
would probably be pretty safe for pre-C99 compilers. Note that it generates an integer by subtracting one char* from another. That is specified to work and yield the a constant value when the pointers access parts of the same valid object, and will in practice work on any compiler which doesn't notice that a null pointer isn't a valid object. By contrast, the effect of casting a pointer to an integer or vice versa will vary on different platforms and there are many platforms where (int)(((char*)&foo)+1) - (int)(char*)&foo may not yield 1.
Note also that the meaning of "Undefined Behavior" has changed recently. It used to be that Undefined Behavior meant that the specification didn't say what compilers had to do, but most compilers would generally choose (sometimes arbitrarily) behavior that was mathematically correct or would make sense on the underlying platform. For example, on a 32-bit processor, int32_t foo=2147483647; foo+=(unsigned char)x; if (foo > 100) ... a compiler might determine that for any possible value of x the mathematically-correct value assigned to foo would be in the range 2147483647 to 2147483903, and thus greater than 100 in any case. Or it might perform the operation using two's-complement arithmetic and perform the comparison on a possibly-wrapped-around value. Newer compilers, however, may do something even more interesting.
A new compiler may look at an expression like the example with foo and infer that if x is zero then foo must remain 2147483647, and if x is non-zero the compiler would be allowed to do whatever it likes, so it may infer that as a consequence that the LSB of x must equal zero when the statement is executed, so if the code is preceded by a test for (unsigned char)x==0, that expression would always be true. Given code like the offsetof macro, which would generate Undefined Behavior regardless of the values of any variables, a compiler would be entitled to eliminate not just any code using it, but also any preceding code which could not by any defined means cause program execution to terminate.
Note that casting a non-zero integer literal to a pointer only Undefined Behavior if there does not exist any object whose address has been taken and cast to an integer so as yield that same value. Thus, a compiler would not be able to recognize a variant of the pointer-difference-based offsetof macro which cast some non-zero value to a pointer as exhibiting Undefined Behavior unless it could determine that the number in question did not correspond to any pointer. On the other hand, an attempt to cast a non-zero integer to a pointer would on some systems perform a validation check to ensure that the pointer is valid; such a system may then trap if it isn't.

You're not actually dereferencing the pointer, what you're doing is more akin to pointer addition, so using zero should be fine.

Related

Is it defined behavior to add 0 to a null pointer? [duplicate]

I noticed this warning from Clang:
warning: performing pointer arithmetic on a null pointer
has undefined behavior [-Wnull-pointer-arithmetic]
In details, it is this code which triggers this warning:
int *start = ((int*)0);
int *end = ((int*)0) + count;
The constant literal zero converted to any pointer type decays into the null pointer constant, which does not point to any contiguous area of memory but still has the type pointer to type needed to do pointer arithmetic.
Why would arithmetic on a null pointer be forbidden when doing the same on a non-null pointer obtained from an integer different than zero does not trigger any warning?
And more importantly, does the C standard explicitly forbid null pointer arithmetic?
Also, this code will not trigger the warning, but this is because the pointer is not evaluated at compile time:
int *start = ((int*)0);
int *end = start + count;
But a good way of avoiding the undefined behavior is to explicitly cast an integer value to the pointer:
int *end = (int *)(sizeof(int) * count);
The C standard does not allow it.
6.5.6 Additive operators (emphasis mine)
8 When an expression that has integer type is added to or
subtracted from a pointer, the result has the type of the pointer
operand. If the pointer operand points to an element of an array
object, and the array is large enough, the result points to an element
offset from the original element such that the difference of the
subscripts of the resulting and original array elements equals the
integer expression. In other words, if the expression P points to the
i-th element of an array object, the expressions (P)+N (equivalently,
N+(P)) and (P)-N (where N has the value n) point to, respectively, the
i+n-th and i-n-th elements of the array object, provided they exist.
Moreover, if the expression P points to the last element of an array
object, the expression (P)+1 points one past the last element of the
array object, and if the expression Q points one past the last element
of an array object, the expression (Q)-1 points to the last element of
the array object. If both the pointer operand and the result point to
elements of the same array object, or one past the last element of the
array object, the evaluation shall not produce an overflow; otherwise,
the behavior is undefined. If the result points one past the last
element of the array object, it shall not be used as the operand of a
unary * operator that is evaluated.
For the purposes of the above, a pointer to a single object is considered as pointing into an array of 1 element.
Now, ((uint8_t*)0) does not point at an element of an array object. Simply because a pointer holding a null pointer value does not point at any object. Which is said at:
6.3.2.3 Pointers
3 If a null pointer constant is converted to a pointer type, the
resulting pointer, called a null pointer, is guaranteed to compare
unequal to a pointer to any object or function.
So you can't do arithmetic on it. The warning is justified, because as the second highlighted sentence mentions, we are in the case of undefined behavior.
Don't be fooled by the fact the offsetof macro is possibly implemented like that. The standard library is not bound by the constraints placed on user programs. It can employ deeper knowledge. But doing this in our code is not well defined.
When the C Standard was written, the vast majority of C implementations would, for any non-void* pointer value p, uphold the invariants that p+0 and p-0 both yield p, and p-p will yield zero. More generally, operations like a size-zero memcpy or fwrite that operate on a buffer of size N would ignore the buffer address when N was zero. Such behavior would allow programmers to avoid having to write code to handle corner cases. For example, code to output a packet with an optional payload passed via address and length arguments would naturally process (NULL,0) as an empty payload.
Nothing in the published Rationale for the C Standard suggests that implementations whose target platforms would naturally behave in such fashion shouldn't continue to work as they always had. There were, however, a few platforms where it may have been expensive to uphold such behavioral guarantees in cases where p is null.
As with most situations where the vast majority of C implementations would process a construct identically, but implementations might exist where such treatment would be impractical, the Standard characterizes the addition of zero to a null pointer as Undefined Behavior. The Standard allows implementations to, as a form of "conforming language extension", define the behavior of constructs in cases where it imposes no requirements, and it allow conforming (but not strictly conforming) programs to make use of them. According to the published Rationale, the stated intention was that support for such "popular extensions" be regarded as a "quality of implementation" issue to be decided by the marketplace. Implementations that could support them at essentially zero cost would do so, but implementations where such support would be expensive would be free to support such constructs or not based upon their customers' needs.
If one is using a compiler that targets commonplace platforms, and is designed to process the widest range of useful programs reasonably efficiently, then the extended semantics surrounding pointer arithmetic may allow one to write code more efficiently than would otherwise be possible. If one is targeting a compiler that does not value compatibility with quality compilers, however, one should recognize that it may treat the Standard's allowance for quirky hardware as an invitation to behave nonsensically even on commonplace hardware. Of course, one should also be aware that such compilers may behave nonsensically in corner cases where adherence with the Standard would require them to forego optimizations that are unsound but would "usually" be safe.

Is casting a pointer to intptr_t, doing arithmetic on it and then casting back, defined behavior?

Let's say I want to move a void* pointer by 4 bytes. Are the following equivalent:
A:
void* new_address(void* in_ptr) {
intptr_t tmp = (intptr_t)in_ptr;
intptr_t new_address = tmp + 4;
return (void*)new_address;
}
B:
void* new_address(void* in_ptr) {
char* tmp = (char*)in_ptr;
char* new_address = tmp + 4;
return (void*)new_address;
}
Are both defined behavior? Is one more popular/accepted convention? Any other reason to use one over the other?.
Let's only consider 64bit systems. If intptr_t is not available we can use int64_t instead.
The context is a custom memory allocator which needs to move the pointer before allocating new block of memory to a specific address (for alignment purposes). We don't know what object the resulting pointer is going to point to yet but we know we need to move it to a specific location which in the examples above is 4 bytes.
Michael Kerrisk says on page 1415 that,
The C standards make one exception to the rule that pointers of
different types need not have the same representation: pointers of the
types char * and void * are required to have the same internal
representation.
All the C standard guarantees (7.18.1.4) is that you can convert void* values to intptr_t (or uintptr_t) and back again and end up with an equal value for the pointer.
The nuance is here that we cannot apply mathematical operations (including ==) if void* is in use.
Is casting a pointer to intptr_t [...] defined behavior?
Converting a pointer to any integer type is defined and the result is implementation defined, except when result can't be represented in integer type, then it's undefined behavior. See C11 6.3.2.3p6. But intptr_t has to be able to represent void* - the behavior is defined.
, doing arithmetic on it and then casting back, defined behavior?
Any integer may be converted to any pointer type. The resulting pointer is implementation defined - there is no guarantee that adding 4 to intptr_t will increment the pointer value by 4. See C11 6.3.2.3p5.
Are both defined behavior?
Yes, however the result is implementation defined.
Is one more popular/accepted convention?
Subjective: I say using uintptr_t is more popular then intptr_t. Converting a pointer to uintptr_t or to char* to do some arithmetic happens in some code, I can't say which is more popular.
Any other reason to use one over the other?.
Not really, but I think go with char*.
When it comes to actually accessing the data behind the resulting pointer - it depends. If the resulting pointer points within the same object then you're fine (remember, conversion is implementation defined). If the resulting pointer does not point to the same object, I believe the best interpretation would be from reading c2263 Clarifying Pointer Provenance v4 2.2.3Q5 and I think that's: the current C11 standard does not clearly specify that, which would make the behavior not defined.
Because you tagged gcc, both code snippets should compile to equivalent code - I believe on all architectures pointers are converted 1:1 to (u)intptr_t on gcc. Gcc docs implementation defined behavior 4.7 arrays and pointers states casting from pointer to integer and back again, the resulting pointer must reference the same object as the original pointer, otherwise the behavior is undefined - so you're safe as long as the resulting pointer points to the same object.
The context is a custom memory allocator
See implementations of container_of and offsetof macros. Do not hardcode + 4 in your code, and if you do, do not depend on alignment requirements on accessing the resulting pointers - remember to use memcpy to safely copy the context or handle alignment properly. Do not reinvent the wheel - when in doubt see other implementations like glibc malloc.c or newlib malloc.c - they both calculate on char* in mem2chunk macro, but also happen to do calculations on uintptr_t integers.
No 'strictly conforming program uses A. Using the result may be Undefined Behaviour as there is no requirement for addition against intptr_t to be reflected in a pointer value if that intptr_ is converted back to a pointer.
It is both unspecified behaviour and implementation-defined.
If the optional type intptr_t is defined all you are guaranteed is that you can convert void * to intptr_t and then convert that value back to void * and the two values will compare equal (==).
The strictly conforming way to perform pointer arithmetic is B. B is guaranteed to work if and only if the pointer int_ptr is valid and for the largest enclosing object there are 3 or more bytes in that object beyond that value. It's 3 because it's valid to point to (but not dereference) to the address that is (logically) one byte beyond the end of an object.
Object includes a declared object (including array) or block of memory such as returned by malloc().
All good practice is to prefer to write 'strictly conforming' programs where possible. So all good practice is to prefer B over A.
According to the standard the use of the pointer (as a pointer) may result in Undefined Behaviour because it may be (implementation defined) to be a trap representation.
A strictly conforming program is defined as "A strictly conforming program shall use only those features of the language and library specified in this International Standard.3) It shall not produce output dependent on any unspecified, undefined, or implementation-defined behavior, and shall not exceed any minimum implementation limit.
There's some disagreement about whether the code offered for A is unspecified or implementation defined. The standard says both because implementation-defined behaviour is a sub-category of unspecified. However because the implementation may document it as a trap representation using the value may result in Undefined Behaviour.
But I hope that is swept aside by the fact that 'strictly conforming programs' don't depend on unspecified, undefined or implementation defined behaviour.
So good practice here is certainly B.
Consider a secure environment that encrypts pointer values to deliberately confound the de-referencing of arbitrary pointer values. In principle it could provide intptr_t and be conformant.
Though I still maintain that if A doesn't work then intptr_t being an optional type it would be better to not provide it. Whether it is defined is unspecified and implementation dependent. That's because no 'strictly conforming program' uses it and it has no practical use other than to manipulate a pointer as an arithmetic type in a way not supported by pointer arithmetic on a compatible pointer type char *. The snippet in A falls into that category.
To store a void * declare a void * or char[sizeof(void*)] or malloc() or similar. To overlay a void * over an arithmetic type, declare a union and benefit that the union will be aligned for a void *.
But according to the specification it is unspecified, implementation-defined no 'strictly conforming program' can rely on it and may result in Undefined Behaviour.
A very long winded way of saying the answer, here, is B.

Does the C standard permit assigning an arbitrary value to a pointer and incrementing it?

Is the behaviour of this code well defined?
#include <stdio.h>
#include <stdint.h>
int main(void)
{
void *ptr = (char *)0x01;
size_t val;
ptr = (char *)ptr + 1;
val = (size_t)(uintptr_t)ptr;
printf("%zu\n", val);
return 0;
}
I mean, can we assign some fixed number to a pointer and increment it even if it is pointing to some random address? (I know that you can not dereference it)
The assignment:
void *ptr = (char *)0x01;
Is implementation defined behavior because it is converting an integer to a pointer. This is detailed in section 6.3.2.3 of the C standard regarding Pointers:
5 An integer may be converted to any pointer type. Except as previously specified, the result is implementation-defined,
might not be correctly aligned, might not point to an entity
of the referenced type, and might be a trap representation.
As for the subsequent pointer arithmetic:
ptr = (char *)ptr + 1;
This is dependent on a few things.
First, the current value of ptr may be a trap representation as per 6.3.2.3 above. If it is, the behavior is undefined.
Next is the question of whether 0x1 points to a valid object. Adding a pointer and an integer is only valid if both the pointer operand and the result point to elements of an array object (a single object counts as an array of size 1) or one element past the array object. This is detailed in section 6.5.6:
7 For the purposes of these operators, a pointer to an object that is not an element of an array behaves the same as a
pointer to the first element of an array of length one with the type
of the object as its element type
8 When an expression that has integer type is added to or subtracted from a pointer, the result has the type of the pointer
operand. If the pointer operand points to an element of an array
object, and the array is large enough, the result points to an element
offset from the original element such that the difference of the
subscripts of the resulting and original array elements equals the
integer expression. In other words, if the expression P points to the
i-th element of an array object, the expressions (P)+N (equivalently, N+(P) ) and (P)-N (where N has the value n ) point to,
respectively, the i+n-th and i−n-th elements of the array object, provided they exist. Moreover, if the expression P points to the last element of an
array object, the expression (P)+1 points one past the last element of
the array object, and if the expression Q points one past the
last element of an array object, the expression (Q)-1 points to
the last element of the array object. If both the pointer
operand and the result point to elements of the same array
object, or one past the last element of the array object, the
evaluation shall not produce an overflow; otherwise, the behavior is
undefined. If the result points one past the last element of the
array object, it shall not be used as the operand of a unary
* operator that is evaluated.
On a hosted implementation the value 0x1 almost certainly does not point to a valid object, in which case the addition is undefined. An embedded implementation could however support setting pointers to specific values, and if so it could be the case that 0x1 does in fact point to a valid object. If so, the behavior is well defined, otherwise it is undefined.
No, the behaviour of this program is undefined. Once an undefined construct is reached in a program, any future behaviour is undefined. Paradoxically, any past behaviour is undefined too.
The result of void *ptr = (char*)0x01; is implementation-defined, due in part to the fact that a char can have a trap representation.
But the behaviour of the ensuing pointer arithmetic in the statement ptr = (char *)ptr + 1; is undefined. This is because pointer arithmetic is only valid within arrays including one past the end of the array. For this purpose an object is an array of length one.
Yes, the code is well-defined as implementation-defined. It is not undefined. See ISO/IEC 9899:2011 [6.3.2.3]/5 and note 67.
The C language was originally created as a system programming language. Systems programming required manipulating memory-mapped hardware, requiring that you would stuff hard-coded addresses into pointers, sometimes increment those pointers, and read and write data from and to the resulting address. To that end, assigning and integer to a pointer and manipulating that pointer using arithmetic is well defined by the language. By making it implementation-defined, what the language allows is that all kinds of things can happen: from the classic halt-and-catch-fire to raising a bus error when trying to dereference an odd address.
The difference between undefined behaviour and implementation-defined behaviour is basically undefined behaviour means "don't do that, we don't know what will happen" and implementation-defined behaviour means "it's OK to go ahead and do that, it's up to you to know what will happen."
It is undefined behavior.
From N1570 (emphasis added):
An integer may be converted to any pointer type. Except as previously specified, the result is implementation-defined, might not be correctly aligned, might not point to an entity of the referenced type, and might be a trap representation.
If the value is a trap representation, reading it is undefined behavior:
Certain object representations need not represent a value of the object type. If the stored value of an object has such a representation and is read by an lvalue expression that does not have character type, the behavior is undefined. If such a representation is produced by a side effect that modifies all or any part of the object by an lvalue expression that does not have character type, the behavior is undefined.) Such a representation is called a trap representation.
And
An identifier is a primary expression, provided it has been declared as designating an object (in which case it is an lvalue) or a function (in which case it is a function designator).
Therefore, the line void *ptr = (char *)0x01; is already potentially undefined behavior, on an implementation where (char*)0x01 or (void*)(char*)0x01 is a trap representation. The left-hand side is an lvalue expression that does not have character type and reads a trap representation.
On some hardware, loading an invalid pointer into a machine register could crash the program, so this was a forced move by the standards committee.
The Standard does not require that implementations process integer-to-pointer conversions in a meaningful fashion for any particular integer values, or even for any possible integer values other than Null Pointer Constants. The only thing it guarantees about such conversions is that a program which stores the result of such a conversion directly into an object of suitable pointer type and does nothing with it except examine the bytes of that object will, at worst, see Unspecified values. While the behavior of converting an integer to a pointer is Implementation-Defined, nothing would forbid any implementation (no matter what it actually does with such conversions!) from specifying that some (or even all) of the bytes of the representation having Unspecified values, and specifying that some (or even all) integer values may behave as though they yield trap representations.
The only reasons the Standard says anything at all about integer-to-pointer conversions are that:
In some implementations, the construct is meaningful, and some programs for those implementations require it.
The authors of the Standard did not like the idea of a construct that was used on some implementations would represent a constraint violation on others.
It would have been odd for the Standard to describe a construct but then specify that it has Undefined Behavior in all cases.
Personally, I think the Standard should have allowed implementations to treat integer-to-pointer conversions as constraint violations if they don't define any situations where they would be useful, rather than require that compilers accept the meaningless code, but that wasn't the philosophy at the time.
I think it would be simplest to simply say that any operation involving integer-to-pointer conversions with anything other than intptr_t or uintptr_t values received from pointer-to-integer conversions invokes Undefined Behavior, but then note that it is common for quality implementations intended for low-level programming to process Undefined Behavior "in a documented manner characteristic of the environment". The Standard doesn't specify when implementations should process programs that invoke UB in that fashion but instead treats it as a Quality of Implementation issue.
If an implementation specifies that integer-to-pointer conversions operate in a fashion that would define the behavior of
char *p = (char*)1;
p++;
as equivalent to "char p = (char)2;", then the implementation should be expected to work that way. On the other hand, an implementation could define the behavior of integer-to-pointer conversion in such a way that even:
char *p = (char*)1;
char *q = p; // Not doing any arithmetic here--just a simple assignment
would release nasal demons. On most platforms, a compiler where arithmetic on pointers produced by integer-to-pointer conversions behaved oddly would not be viewed as a high-quality implementation suitable for low-level programming. A programmer that is not intending to target any other kind of implementations could thus expect such constructs to behave usefully on compilers for which the code was suitable, even though the Standard does not require it.

C - Reference after dereference terminology

This question is about terminology.
int main()
{
unsigned char array[10] = {0};
void *ptr = array;
void *middle = &ptr[5]; // <== dereferencing ‘void *’ pointer
}
Gcc emits the warning Dereferencing void pointer.
I understand the warning because the compiler needs to compute the actual offset, and it couldn't because void has no standard size.
But I disagree with the error message. This is not a dereference. I can't find a dereference explanation where it is something else than taking value of something.
Same thing for offsetof:
#define offsetof(a,b) ((int)(&(((a*)(0))->b)))
There are lot of threads about whether this is UB because of a null pointer dereference. But this is not a null pointer dereference! Is it?
There is no storage access in the assembly code
mov rax, QWORD PTR [rbp-48]
add rax, 5
mov QWORD PTR [rbp-40], rax
What is the difference between dereference and storage access?
But I disagree with the error message. This is not a dereference. I can't find a dereference explanation where it is something else than taking value of something.
The standard does not provide a formal definition of the term "dereference". The only place it uses it at all is in (non-normative) footnote 102:
[...] Among the invalid values for dereferencing a pointer by the unary * operator are a null pointer, an address inappropriately aligned for the type of object pointed to, and the address of an object after the end of its lifetime.
Note well, however, that this note characterizes dereferencing as the behavior of the unary * operator, not the effect of performing some other operation on the result. You can think of the operation as converting a pointer into the object to which it points, which you will recognize presents an issue if the pointer does not, in fact, point to an object of the pointed-to type, or if the pointed-to type is an incomplete one such as void. Such an issue exists formally even if the resulting object goes unused.
Now I acknowledge that there is room for confusion here on account of the fact that it is useless to perform a dereference without using the resulting object, but that's beside the point. Consider the following complete C statement:
1 + 2;
Would you deny that it performs an addition just because the result is unused?
Now, your (sub-)expression ptr[5] is defined to have meaning identical to that of (*((ptr)+(5))). The type of a pointer addition expression is the same as the type of the pointer involved, so the that indeed does involved dereferencing a void *, in the sense of applying the unary * operator to an expression of that type.
Nevertheless, although I think the error message is correct, I do agree that it is a poor choice. A more fundamental problem here, and one that is reached first in evaluation order, is a violation of the language constraint that in pointer addition, the pointer must point to a complete type, which void is not. Indeed, it's hard to construe the message that is emitted as satisfying the requirement that constraint violations result in a diagnostic. It seems to be about a different problem -- one that produces undefined behavior, but does not involve a constraint violation.
You also remark:
Same thing for offsetof:
#define offsetof(a,b) ((int)(&(((a*)(0))->b)))
[...] But this is not a null pointer dereference! Is it?
Be careful, there. The C language does not define the specific form of the replacement text of the offsetof() macro; what you've presented is an implementation detail.
We could easily divert into semantics here, since "dereference" is not a defined term in the standard, so I'll address instead a similar question: when the macro arguments meet the requirements of the offsetof() macro, does the definition presented expand to an expression with well-defined behavior?
The standard does not define behavior for the indirect member selection operator (->) when its left-hand operand has an acceptable type but does not point to any object (such as when it is null). The behavior is therefore undefined. Or if we take a->b to be wholly equivalent to ((*a).b), then the behavior is explicitly undefined when a does not point to any object. Either way, the C language does not define behavior for the expression.
But this is where it becomes important that your particular macro definition is an implementation detail. The implementation from which it is drawn is free to provide whatever behavior it wishes, and in particular, it can provide behavior that reliably satisfies C's specifications for the offsetof() macro. You should not rely on such code yourself. Even on an implementation that provides an offsetof() definition of that form, you cannot be certain that it does not also employ some special internal magic -- not available directly to your own code -- to make it work.

Dereferencing null pointer valid and working (not sizeof) [duplicate]

Code sample:
struct name
{
int a, b;
};
int main()
{
&(((struct name *)NULL)->b);
}
Does this cause undefined behaviour? We could debate whether it "dereferences null", however C11 doesn't define the term "dereference".
6.5.3.2/4 clearly says that using * on a null pointer causes undefined behaviour; however it doesn't say the same for -> and also it does not define a -> b as being (*a).b ; it has separate definitions for each operator.
The semantics of -> in 6.5.2.3/4 says:
A postfix expression followed by the -> operator and an identifier designates a member
of a structure or union object. The value is that of the named member of the object to
which the first expression points, and is an lvalue.
However, NULL does not point to an object, so the second sentence seems underspecified.
Also relevant might be 6.5.3.2/1:
Constraints:
The operand of the unary & operator shall be either a function designator, the result of a
[] or unary * operator, or an lvalue that designates an object that is not a bit-field and is
not declared with the register storage-class specifier.
However I feel that the bolded text is defective and should read lvalue that potentially designates an object , as per 6.3.2.1/1 (definition of lvalue) -- C99 messed up the definition of lvalue, so C11 had to rewrite it and perhaps this section got missed.
6.3.2.1/1 does say:
An lvalue is an expression (with an object type other than void) that potentially
designates an object; if an lvalue does not designate an object when it is evaluated, the
behavior is undefined
however the & operator does evaluate its operand. (It doesn't access the stored value but that is different).
This long chain of reasoning seems to suggest that the code causes UB however it is fairly tenuous and it's not clear to me what the writers of the Standard intended. If in fact they intended anything, rather than leaving it up to us to debate :)
From a lawyer point of view, the expression &(((struct name *)NULL)->b); should lead to UB, since you could not find a path in which there would be no UB. IMHO the root cause is that at a moment you apply the -> operator on an expression that does not point to an object.
From a compiler point of view, assuming the compiler programmer was not overcomplicated, it is clear that the expression returns the same value as offsetof(name, b) would, and I'm pretty sure that provided it is compiled without error any existing compiler will give that result.
As written, we could not blame a compiler that would note that in the inner part you use operator -> on an expression than cannot point to an object (since it is null) and issue a warning or an error.
My conclusion is that until there is a special paragraph saying that provided it is only to take its address it is legal do dereference a null pointer, this expression is not legal C.
Yes, this use of -> has undefined behavior in the direct sense of the English term undefined.
The behavior is only defined if the first expression points to an object and not defined (=undefined) otherwise. In general you shouldn't search more in the term undefined, it means just that: the standard doesn't provide a meaning for your code. (Sometimes it points explicitly to such situations that it doesn't define, but this doesn't change the general meaning of the term.)
This is a slackness that is introduced to help compiler builders to deal with things. They may defined a behavior, even for the code that you are presenting. In particular, for a compiler implementation it is perfectly fine to use such code or similar for the offsetof macro. Making this code a constraint violation would block that path for compiler implementations.
Let's start with the indirection operator *:
6.5.3.2 p4:
The unary * operator denotes indirection. If the operand points to a function, the result is
a function designator; if it points to an object, the result is an lvalue designating the
object. If the operand has type "pointer to type", the result has type "type". If an
invalid value has been assigned to the pointer, the behavior of the unary * operator is
undefined. 102)
*E, where E is a null pointer, is undefined behavior.
There is a footnote that states:
102) Thus, &*E is equivalent to E (even if E is a null pointer), and &(E1[E2]) to ((E1)+(E2)). It is
always true that if E is a function designator or an lvalue that is a valid operand of the unary &
operator, *&E is a function designator or an lvalue equal to E. If *P is an lvalue and T is the name of
an object pointer type, *(T)P is an lvalue that has a type compatible with that to which T points.
Which means that &*E, where E is NULL, is defined, but the question is whether the same is true for &(*E).m, where E is a null pointer and its type is a struct that has a member m?
C Standard doesn't define that behavior.
If it were defined, new problems would arise, one of which is listed below. C Standard is correct to keep it undefined, and provides a macro offsetof that handles the problem internally.
6.3.2.3 Pointers
An integer constant expression with the value 0, or such an expression cast to type
void *, is called a null pointer constant. 66) If a null pointer constant is converted to a
pointer type, the resulting pointer, called a null pointer, is guaranteed to compare unequal
to a pointer to any object or function.
This means that an integer constant expression with the value 0 is converted to a null pointer constant.
But the value of a null pointer constant is not defined as 0. The value is implementation defined.
7.19 Common definitions
The macros are
NULL
which expands to an implementation-defined null pointer constant
This means C allows an implementation where the null pointer will have a value where all bits are set and using member access on that value will result in an overflow which is undefined behavior
Another problem is how do you evaluate &(*E).m? Do the brackets apply and is * evaluated first. Keeping it undefined solves this problem.
First, let's establish that we need a pointer to an object:
6.5.2.3 Structure and union members
4 A postfix expression followed by the -> operator and an identifier designates a member
of a structure or union object. The value is that of the named member of the object to
which the first expression points, and is an lvalue.96) If the first expression is a pointer to
a qualified type, the result has the so-qualified version of the type of the designated
member.
Unfortunately, no null pointer ever points to an object.
6.3.2.3 Pointers
3 An integer constant expression with the value 0, or such an expression cast to type
void *, is called a null pointer constant.66) If a null pointer constant is converted to a
pointer type, the resulting pointer, called a null pointer, is guaranteed to compare unequal
to a pointer to any object or function.
Result: Undefined Behavior.
As a side-note, some other things to chew over:
6.3.2.3 Pointers
4 Conversion of a null pointer to another pointer type yields a null pointer of that type.
Any two null pointers shall compare equal.
5 An integer may be converted to any pointer type. Except as previously specified, the
result is implementation-defined, might not be correctly aligned, might not point to an
entity of the referenced type, and might be a trap representation.67)
6 Any pointer type may be converted to an integer type. Except as previously specified, the
result is implementation-defined. If the result cannot be represented in the integer type,
the behavior is undefined. The result need not be in the range of values of any integer
type.
67) The mapping functions for converting a pointer to an integer or an integer to a pointer are intended to be consistent with the addressing structure of the execution environment.
So even if the UB should happen to be benign this time, it might still result in some totally unexpected number.
Nothing in the C standard would impose any requirements on what a system could do with the expression. It would, when the standard was written, have been perfectly reasonable for it to to cause the following sequence of events at runtime:
Code loads a null pointer into the addressing unit
Code asks the addressing unit to add the offset of field b.
The addressing unit trigger a trap when attempting to add an integer to a null pointer (which should for robustness be a run-time trap, even though many systems don't catch it)
The system starts executing essentially random code after being dispatched through a trap vector that was never set because code to set it would have wasted been a waste of memory, as addressing traps shouldn't occur.
The very essence of what Undefined Behavior meant at the time.
Note that most of the compilers that have appeared since the early days of C would regard the address of a member of an object located at a constant address as being a compile-time constant, but I don't think such behavior was mandated then, nor has anything been added to the standard which would mandate that compile-time address calculations involving null pointers be defined in cases where run-time calculations would not.
No. Let's take this apart:
&(((struct name *)NULL)->b);
is the same as:
struct name * ptr = NULL;
&(ptr->b);
The first line is obviously valid and well defined.
In the second line, we calculate the address of a field relative to the address 0x0 which is perfectly legal as well. The Amiga, for example, had the pointer to the kernel in the address 0x4. So you could use a method like this to call kernel functions.
In fact, the same approach is used on the C macro offsetof (wikipedia):
#define offsetof(st, m) ((size_t)(&((st *)0)->m))
So the confusion here revolves around the fact that NULL pointers are scary. But from a compiler and standard point of view, the expression is legal in C (C++ is a different beast since you can overload the & operator).

Resources