Normal Structure Variable A decay Pointer or A Stack Frame - c

struct student
{
int roll_no;
int physics;
int chem;
float percentage;
};
int main()
{
struct student s = {1, 96, 98, 97.00};
printf("%d", *((int *) &s + 1));
}
Here in the above code we know that 's' is a structure variable and what my question is
whether 's' is a decay pointer or a separate stack frame is getting allocated for the structure, if it is a decay pointer we don't have to put that '&' with the 's' so i'm guessing it must be a stackframe cause it behaves just like a normal variable.
if we print 's' first members value gets printed and if we print '&s' first members address gets printed.
If it is anything other than this kindly help us to understand that too.
And also,
How to access if a Char array is a member of the structure, tried this *((char*)&s + offset_value); not working.
Your help will be much appreciated.

whether 's' is a decay pointer…
s is a structure. Structures are not automatically converted to pointers the way arrays are. (Some people say “decay,” but this is a colloquial term and is inaccurate.)
… or a separate stack frame is getting allocated for the structure,…
With s defined as you show inside main, the compiler automatically allocates space for it. In common C implementations, the compiler uses stack space for this (except that optimization may result in alternatives including using processor registers instead, incorporating use of the structure into other expressions, or eliminating part or all of the structure entirely). If the compiler does use stack space for s, it uses space within the stack frame for main; it does not allocate a separate stack frame.
(Generally, a stack frame is the space on the stack used by one function, sometimes with some modifications such as “leaf” functions that do not call other routines and may not establish a full stack frame.)
printf("%d", *((int *) &s + 1));
*((int *) &s + 1) is not proper C code because its behavior is not defined by the C standard and likely not by your compiler either.
&s is the address of the structure s. When we convert it to (int *), that gives the address of the first member of the structure, s.roll_no. That is because the C standard gives us a rule that says we can convert a pointer to a structure to a pointer to its first member. C 2018 6.7.2.1 15 says:
… A pointer to a structure object, suitably converted, points to its initial member (or if that member is a bit-field, then to the unit in which it resides), and vice versa…
Then you add 1 to this pointer, (int *) &s + 1. That also has defined behavior. From above (int *) &s points to s.roll_no, so it is equivalent to &s.roll_no. Then (int *) &s + 1 is equivalent to &s.roll_no + 1. C 2018 6.5.6 specifies what happens with addition. I will omit specifics of the rules, but essentially &s.roll_no + 1 is defined and produces a pointer that points just beyond s.roll_no.
However, even if the next member of the structure, physics, is just beyond s.roll_no1, the C standard does not define what happens when * is applied to it. This is because C 2018 6.5.6 8 says:
If the result [of the addition] points one past the last element of the array object, it shall not be used as the operand of a unary * operator that is evaluated.
In the context of 6.5.6, s.roll_no is an “array object” that contains just one element, and &s.roll_no + 1 points one beyond its last element. So *(&s.roll_no + 1) and, equivalently, *((int *) &s + 1) violates this rule. C 2018 4 2 says that when a rule like this is violated, the behavior is not defined:
If a "shall" or "shall not" requirement that appears outside of a constraint or runtime-constraint is violated, the behavior is undefined…
Footnote
1 This is not guaranteed because C implementations are allowed to insert padding between members in structures. However, typical C implementations will not insert padding between members of identical types, as it is not needed for alignment, and this can be checked at compile time. But the statements above apply even if there is no padding.

Structs parameters are passed by value unless the & operator is used. Arrays parameters are passed as pointers. A local variable is in the stack frame (or a register).
You might find the offsetof macro interesting
#define offsetof(S, F) (((char *)&((S *)0)->F - (char *)0))

Related

Is it OK to access past the size of a structure via member address, with enough space allocated?

Specifically, is the following code, the line below the marker, OK?
struct S{
int a;
};
#include <stdlib.h>
int main(){
struct S *p;
p = malloc(sizeof(struct S) + 1000);
// This line:
*(&(p->a) + 1) = 0;
}
People have argued here, but no one has given a convincing explanation or reference.
Their arguments are on a slightly different base, yet essentially the same
typedef struct _pack{
int64_t c;
} pack;
int main(){
pack *p;
char str[9] = "aaaaaaaa"; // Input
size_t len = offsetof(pack, c) + (strlen(str) + 1);
p = malloc(len);
// This line, with similar intention:
strcpy((char*)&(p->c), str);
// ^^^^^^^
The intent at least since the standardization of C in 1989 has been that implementations are allowed to check array bounds for array accesses.
The member p->a is an object of type int. C11 6.5.6p7 says that
7 For the purposes of [additive operators] a pointer to an object that is not an element of an array behaves the same as a pointer to the first element of an array of length one with the type of the object as its element type.
Thus
&(p->a)
is a pointer to an int; but it is also as if it were a pointer to the first element of an array of length 1, with int as the object type.
Now 6.5.6p8 allows one to calculate &(p->a) + 1 which is a pointer to just past the end of the array, so there is no undefined behaviour. However, the dereference of such a pointer is invalid. From Appendix J.2 where it is spelt out, the behaviour is undefined when:
Addition or subtraction of a pointer into, or just beyond, an array object and an integer type produces a result that points just beyond the array object and is used as the operand of a unary * operator that is evaluated (6.5.6).
In the expression above, there is only one array, the one (as if) with exactly 1 element. If &(p->a) + 1 is dereferenced, the array with length 1 is accessed out of bounds and undefined behaviour occurs, i.e.
behavior [...], for which [The C11] Standard imposes no requirements
With the note saying that:
Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).
That the most common behaviour is ignoring the situation completely, i.e. behaving as if the pointer referenced the memory location just after, doesn't mean that other kind of behaviour wouldn't be acceptable from the standard's point of view - the standard allows every imaginable and unimaginable outcome.
There has been claims that the C11 standard text has been written vaguely, and the intention of the committee should be that this indeed be allowed, and previously it would have been alright. It is not true. Read the part from the committee response to [Defect Report #017 dated 10 Dec 1992 to C89].
Question 16
[...]
Response
For an array of arrays, the permitted pointer arithmetic in
subclause 6.3.6, page 47, lines 12-40 is to be understood by
interpreting the use of the word object as denoting the specific
object determined directly by the pointer's type and value, not other
objects related to that one by contiguity. Therefore, if an expression
exceeds these permissions, the behavior is undefined. For example, the
following code has undefined behavior:
int a[4][5];
a[1][7] = 0; /* undefined */
Some conforming implementations may
choose to diagnose an array bounds violation, while others may
choose to interpret such attempted accesses successfully with the
obvious extended semantics.
(bolded emphasis mine)
There is no reason why the same wouldn't be transferred to scalar members of structures, especially when 6.5.6p7 says that a pointer to them should be considered to behave the same as a pointer to the first element of an array of length one with the type of the object as its element type.
If you want to address the consecutive structs, you can always take the pointer to the first member and cast that as the pointer to the struct and advance that instead:
*(int *)((S *)&(p->a) + 1) = 0;
This is undefined behavior, as you are accessing something that is not an array (int a within struct S) as an array, and out of bounds at that.
The correct way to achieve what you want, is to use an array without a size as the last struct member:
#include <stdlib.h>
typedef struct S {
int foo; //avoid flexible array being the only member
int a[];
} S;
int main(){
S *p = malloc(sizeof(*p) + 2*sizeof(int));
p->a[0] = 0;
p->a[1] = 42; //Perfectly legal.
}
C standard guarantees that
§6.7.2.1/15:
[...] A pointer to a structure object, suitably converted, points to its initial member (or if that member is a bit-field, then to the unit in which it resides), and vice versa. There may be unnamed padding within a structure object, but not at its beginning.
&(p->a) is equivalent to (int *)p. &(p->a) + 1 will be address of the element of the second struct. In this case, only one element is there, there will not be any padding in the structure so this will work but where there will be padding this code will break and leads to undefined behaviour.

Pointer difference across members of a struct?

The C99 standard states that:
When two pointers are subtracted, both shall point to elements of the same array object, or one past the last element of the array object
Consider the following code:
struct test {
int x[5];
char something;
short y[5];
};
...
struct test s = { ... };
char *p = (char *) s.x;
char *q = (char *) s.y;
printf("%td\n", q - p);
This obviously breaks the above rule, since the p and q pointers are pointing to different "array objects", and, according to the rule, the q - p difference is undefined.
But in practice, why should such a thing ever result in undefined behaviour? After all, the struct members are laid out sequentially (just as array elements are), with any potential padding between the members. True, the amount of padding will vary across implementations and that would affect the outcome of the calculations, but why should that result be "undefined"?
My question is, can we suppose that the standard is just "ignorant" of this issue, or is there a good reason for not broadening this rule? Couldn't the above rule be rephrased to "both shall point to elements of the same array object or members of the same struct"?
My only suspicion are segmented memory architectures where the members might end up in different segments. Is that the case?
I also suspect that this is the reason why GCC defines its own __builtin_offsetof, in order to have a "standards compliant" definition of the offsetof macro.
EDIT:
As already pointed out, arithmetic on void pointers is not allowed by the standard. It is a GNU extension that fires a warning only when GCC is passed -std=c99 -pedantic. I'm replacing the void * pointers with char * pointers.
Subtraction and relational operators (on type char*) between addresses of member of the same struct are well defined.
Any object can be treated as an array of unsigned char.
Quoting N1570 6.2.6.1 paragraph 4:
Values stored in non-bit-field objects of any other object type
consist of n × CHAR_BIT bits, where n is the size of an object of that
type, in bytes. The value may be copied into an object of type
unsigned char [ n ] (e.g., by memcpy); the resulting set of bytes is
called the object representation of the value.
...
My only suspicion are segmented memory architectures where the members
might end up in different segments. Is that the case?
No. For a system with a segmented memory architecture, normally the compiler will impose a restriction that each object must fit into a single segment. Or it can permit objects that occupy multiple segments, but it still has to ensure that pointer arithmetic and comparisons work correctly.
Pointer arithmetic requires that the two pointers being added or subtracted to be part of the same object because it doesn't make sense otherwise.
The quoted section of standard specifically refers to two unrelated objects such as int a[b]; and int b[5]. The pointer arithmetic requires to know the type of the object that the pointers pointing to (I am sure you are aware of this already).
i.e.
int a[5];
int *p = &a[1]+1;
Here p is calculated by knowing the that the &a[1] refers to an int object and hence incremented to 4 bytes (assuming sizeof(int) is 4).
Coming to the struct example, I don't think it can possibly be defined in a way to make pointer arithmetic between struct members legal.
Let's take the example,
struct test {
int x[5];
char something;
short y[5];
};
Pointer arithmatic is not allowed with void pointers by C standard (Compiling with gcc -Wall -pedantic test.c would catch that). I think you are using gcc which assumes void* is similar to char* and allows it.
So,
printf("%zu\n", q - p);
is equivalent to
printf("%zu", (char*)q - (char*)p);
as pointer arithmetic is well defined if the pointers point to within the same object and are character pointers (char* or unsigned char*).
Using correct types, it would be:
struct test s = { ... };
int *p = s.x;
short *q = s.y;
printf("%td\n", q - p);
Now, how can q-p be performed? based on sizeof(int) or sizeof(short) ? How can the size of char something; that's in the middle of these two arrays be calculated?
That should explain it's not possible to perform pointer arithmetic on objects of different types.
Even if all members are of same type (thus no type issue as stated above), then it's better to use the standard macro offsetof (from <stddef.h>) to get the difference between struct members which has the similar effect as pointer arithmetic between members:
printf("%zu\n", offsetof(struct test, y) - offsetof(struct test, x));
So I see no necessity to define pointer arithmetic between struct members by the C standard.
Yes, you are allowed to perform pointer arithmetric on structure bytes:
N1570 - 6.3.2.3 Pointers p7:
... When a pointer to an object is converted to a pointer to a character type,
the result points to the lowest addressed byte of the object. Successive increments of the
result, up to the size of the object, yield pointers to the remaining bytes of the object.
This means that for the programmer, bytes of the stucture shall be seen as a continuous area, regardless how it may have been implemented in the hardware.
Not with void* pointers though, that is non-standard compiler extension. As mentioned on paragraph from the standard, it applies only to character type pointers.
Edit:
As mafso pointed out in comments, above is only true as long as type of substraction result ptrdiff_t, has enough range for the result. Since range of size_t can be larger than ptrdiff_t, and if structure is big enough, it's possible that addresses are too far apart.
Because of this it's preferable to use offsetof macro on structure members and calculate result from those.
I believe the answer to this question is simpler than it appears, the OP asks:
but why should that result be "undefined"?
Well, let's see that the definition of undefined behavior is in the draft C99 standard section 3.4.3:
behavior, upon use of a nonportable or erroneous program construct or
of erroneous data, for which this International Standard imposes no
requirements
it is simply behavior for which the standard does not impose a requirement, which perfectly fits this situation, the results are going to vary depending on the architecture and attempting to specify the results would have probably been difficult if not impossible in a portable manner. This leaves the question, why would they choose undefined behavior as opposed to let's say implementation of unspecified behavior?
Most likely it was made undefined behavior to limit the number of ways an invalid pointer could be created, this is consistent with the fact that we are provided with offsetof to remove the one potential need for pointer subtraction of unrelated objects.
Although the standard does not really define the term invalid pointer, we get a good description in Rationale for International Standard—Programming Languages—C which in section 6.3.2.3 Pointers says (emphasis mine):
Implicit in the Standard is the notion of invalid pointers. In
discussing pointers, the Standard typically refers to “a pointer to an
object” or “a pointer to a function” or “a null pointer.” A special
case in address arithmetic allows for a pointer to just past the end
of an array. Any other pointer is invalid.
The C99 rationale further adds:
Regardless how an invalid pointer is created, any use of it yields
undefined behavior. Even assignment, comparison with a null pointer
constant, or comparison with itself, might on some systems result in
an exception.
This strongly suggests to us that a pointer to padding would be an invalid pointer, although it is difficult to prove that padding is not an object, the definition of object says:
region of data storage in the execution environment, the contents of
which can represent values
and notes:
When referenced, an object may be interpreted as having a particular
type; see 6.3.2.1.
I don't see how we can reason about the type or the value of padding between elements of a struct and therefore they are not objects or at least is strongly indicates padding is not meant to be considered an object.
I should point out the following:
from the C99 standard, section 6.7.2.1:
Within a structure object, the non-bit-field members and the units in which bit-fields
reside have addresses that increase in the order in which they are declared. A pointer to a
structure object, suitably converted, points to its initial member (or if that member is a
bit-field, then to the unit in which it resides), and vice versa. There may be unnamed
padding within a structure object, but not at its beginning.
It isn't so much that the result of pointer subtraction between members is undefined so much as it is unreliable (i.e. not guaranteed to be the same between different instances of the same struct type when the same arithmetic is applied).

Strict aliasing and memory locations

Strict aliasing prevents us from accessing the same memory location using an incompatible type.
int* i = malloc( sizeof( int ) ) ; //assuming sizeof( int ) >= sizeof( float )
*i = 123 ;
float* f = ( float* )i ;
*f = 3.14f ;
this would be illegal according to C standard, because the compiler "knows" that int cannot accessed by a float lvalue.
What if I use that pointer to point to correct memory, like this:
int* i = malloc( sizeof( int ) + sizeof( float ) + MAX_PAD ) ;
*i = 456 ;
First I allocate memory for int, float and the last part is memory which will allow float to be stored on aligned address. float requires to be aligned on multiples of 4. MAX_PAD is usually 8 of 16 bytes depending on the system. In any case, MAX_PAD is large enough so float can be aligned properly.
Then I write an int into i, so far so good.
float* f = ( float* )( ( char* )i + sizeof( int ) + PaddingBytesFloat( (char*)i ) ) ;
*f= 2.71f ;
I use the pointer i, increment it with the size of int and align it correctly with the function PaddingBytesFloat(), which returns the number of bytes required to align a float, given an address. Then I write a float into it.
In this case, f points to a different memory location that doesn't overlap; it has a different type.
Here are some parts from the standard (ISO/IEC 9899:201x) 6.5 , I was relying on when writing this example.
Aliasing is when more than one lvalue points to the same memory location. Standard requires that those lvalues have a compatible type with the effective type of the object.
What is effective type, quote from standard:
The effective type of an object for an access to its stored value is the declared type of the
object, if any.87)If a value is stored into an object having no declared type through an
lvalue having a type that is not a character type, then the type of the lvalue becomes the
effective type of the object for that access and for subsequent accesses that do not modify
the stored value. If a value is copied into an object having no declared type using
memcpy or memmove, or is copied as an array of character type, then the effective type
of the modified object for that access and for subsequent accesses that do not modify the
value is the effective type of the object from which the value is copied, if it has one. For all other accesses to an object having no declared type, the effective type of the object is
simply the type of the lvalue used for the access.
87) Allocated objects have no declared type.
I'm trying to connect the pieces and figure out if this is allowed. In my interpretation the effective type of an allocated object can be changed depending on the type of the lvalue used on that memory, because of this part: For
all other accesses to an object having no declared type, the effective type of the object is simply the type of the lvalue used for the access.
Is this legal? If not, what if I used a void pointer as lvalue instead of an int pointer i in my second example? If even that wouldn't work, what if I got the address, which is assigned to the float pointer in the second example, as a memcopied value, and that address was never used as an lvalue before.
I think that yes, it is legal.
To illustrate my point, let's see this code:
struct S
{
int i;
float f;
};
char *p = malloc(sizeof(struct S));
int *i = p + offsetof(struct S, i); //this offset is 0 by definition
*i = 456;
float *f = p + offsetof(struct S, f);
*f= 2.71f;
This code is, IMO, clearly legal, and it is equivalent to yours from a compiler point of view, for appropriate values of PaddingBytesFloat() and MAX_PAD.
Note that my code does not use any l-value of type struct S, it is only used to ease the calculation of the paddings.
As I read the standard, in malloc'ed memory has no declared type until something is written there. Then the declared type is whatever is written. Thus the declared type of such memory can be changed any time, overwriting the memory with a value of different type, much like an union.
TL; DR: My conclusion is that with dynamic memory you are safe, with regard to strict-aliasing as long as you read the memory using the same type (or a compatible one) you use to last write to that memory.
Yes, this is legal. To see why, you don't even need to think about strict aliasing rule, because it doesn't apply in this case.
According to the C99 standard, when you do this:
int* i = malloc( sizeof( int ) + sizeof( float ) + MAX_PAD ) ;
*i = 456 ;
malloc will return a pointer to a memory block large enough to hold an object of size sizeof(int)+sizeof(float)+MAX_PAD. However, notice that you are only using a small piece of this size; in particular, you are only using the first sizeof(int) bytes. Consequently, you are leaving some free space that can be used to store other objects, as long as you store them into a disjoint offset (that is, after the first sizeof(int) bytes). This is tightly related with the definition of what exactly is an object. From C99 section 3.14:
Object: region of data storage in the execution environment, the
contents of which can represent values
The precise meaning of the contents of the object pointed to by i is the value 456; this implies that the integer object itself only takes a small portion of the memory block you allocated. There is nothing in the standard stopping you from storing a new, different object of any type a few bytes ahead.
This code:
float* f = ( float* )( ( char* )i + sizeof( int ) + PaddingBytesFloat( (char*)i ) ) ;
*f= 2.71f ;
Is effectively attaching another object to a sub-block of the allocated memory. As long as the resulting memory location for f doesn't overlap with that of i, and there is enough room left to store a float, you will always be safe. The strict aliasing rule doesn't even apply here, because the pointers point to objects that do not overlap - the memory locations are different.
I think the key point here is to understand that you are effectively manipulating two distinct objects, with two distinct pointers. It just so happens that both pointers point to the same malloc()'d block, but they are far enough from one another, so this is not a problem.
You can have a look at this related question: What alignment issues limit the use of a block of memory created by malloc? and read Eric Postpischil's great answer: https://stackoverflow.com/a/21141161/2793118 - after all, if you can store arrays of different types in the same malloc() block, why wouldn't you store an int and a float? You can even look at your code as the special case in which these arrays are one-element arrays.
As long as you take care of alignment issues, the code is perfectly fine and 100% portable.
UPDATE (follow-up, read comments below):
I believe your reasoning about the standard not enforcing strict aliasing on malloc()'d objects is wrong. It is true that the effective type of a dynamically allocated object can be changed, as conveyed by the standard (it is a matter of using an lvalue expression with a different type to store a new value in there), but note that once you do that, it is your job to ensure that no other lvalue expression with a different type will access the object value. This is enforced by rule 7 on section 6.5, and you quoted it in your question:
An object shall have its stored value accessed only by an lvalue
expression that has one of the following types:
- a type compatible with the effective type of the object;
Thus, by the time you change the effective type of an object, you are implicitly promising to the compiler that you won't access this object using an old pointer with an incompatible type (compared to the new effective type). This should be enough for the purposes of the strict aliasing rule.
I found a nice analogy. You may also find it useful. Quoting from ISO/IEC 9899:TC2 Committee Draft — May 6, 2005 WG14/N1124
6.7.2.1 Structure and union specifiers
[16] As a special case, the last element of a structure with more than one
named member may have an incomplete array type; this is called a
flexible array member. In most situations, the flexible array member
is ignored. In particular, the size of the structure is as if the
flexible array member were omitted except that it may have more
trailing padding than the omission would imply. However, when a . (or
->) operator has a left operand that is (a pointer to) a structure with a flexible array member and the right operand names that member,
it behaves as if that member were replaced with the longest array
(with the same element type) that would not make the structure larger
than the object being accessed; the offset of the array shall remain
that of the flexible array member, even if this would differ from that
of the replacement array. If this array would have no elements, it
behaves as if it had one element but the behavior is undefined if any
attempt is made to access that element or to generate a pointer one
past it.
[17] EXAMPLE After the declaration:
struct s { int n; double d[]; };
the structure struct s has a flexible array member d. A typical
way to use this is:
int m = /* some value */;
struct s *p = malloc(sizeof (struct s) + sizeof (double [m]));
and assuming that the
call to malloc succeeds, the object pointed to by p behaves, for most
purposes, as if p had been declared as:
struct { int n; double d[m]; } > *p;
(there are circumstances in which this equivalence is broken; in particular, the offsets of member d might not be the same).
It would be more fair to use an example like:
struct ss {
double da;
int ia[];
}; // sizeof(double) >= sizeof(int)
In example of above quote, size of struct s is same as int (+ padding) and it is then followed by double. (or some other type, float in your case)
Accessing memory sizeof(int) + PADDING bytes after the struct start as double (using syntactic sugar) looks fine as per this example, so I believe your example is legal C.
The strict aliasing rules are there to allow for more aggressive compiler optimizations, specifically by being able to reorder accesses to different types without having to worry about whether they point to the same location. So for instance, in your first example it is perfectly legal for a compiler to reorder the writes to i and f, and thus your code is an example of undefined behaviour (UB).
There is an exception to this rule and you have the relevant quote from the standards
having a type that is not a character type
Your second bit of code is entirely safe. The memory regions do no overlap so it does not matter if memory accesses are reordered across that boundary. Indeed the behaviour of the two pieces of code is completely different. The first one places an int in a memory region and then a float in to the same memory region, whereas the second one places an int in to a memory region and a float in to a bit of memory next to it. Even if these accesses are reordered then your code will have the same effect. Perfectly, legal.
I get the feeling I have thus missed the real question here.
The safest way to fiddle with low level memory if you really did want the behaviour in your first program is either (a) a union or (b) a char *. Using char * and then casting to the proper type is used in a lot of C code, e.g: in this pcap tutorial (scroll down to "for all those new C programmers who insist that pointers are useless, I smite you."

Is it undefined behaviour to access an array beyond its end, if that area is allocated? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Is the “struct hack” technically undefined behavior?
Normally accessing an array beyond its end is undefined behavior in C. For example:
int foo[1];
foo[5] = 1; //Undefined behavior
Is it still undefined behavior if I know that the memory area after the end of the array has been allocated, with malloc or on the stack? Here is an example:
#include <stdio.h>
#include <stdlib.h>
typedef struct
{
int len;
int data[1];
} MyStruct;
int main(void)
{
MyStruct *foo = malloc(sizeof(MyStruct) + sizeof(int) * 10);
foo->data[5] = 1;
}
I have seen this patten used in several places to make a variable length struct, and it seems to work in practice. Is it technically undefined behavior?
What you are describing is affectionately called "the struct hack". It's not clear if it's completely okay, but it was and is widely used.
As of late (C99), it has started to be replaced by the "flexible array member", where you're allowed to put an int data[]; field if it's the last field in the struct.
Under 6.5.6 Additive operators:
Semantics
8 - [...] If the pointer operand points to an element of an array object, and the array is large enough, the result points to an element offset from the original element such that the difference of the subscripts of the resulting and original array elements equals the integer expression. [...] If the result points one past the last element of the array object, it shall not be used as the operand of a unary * operator that is evaluated.
If the memory is allocated by malloc then:
7.22.3 Memory management functions
1 - [...] The pointer returned if the allocation succeeds is suitably aligned so that it may be assigned to a pointer to any type of object with a fundamental alignment requirement and then used to access such an object or an array of such objects in the space allocated (until the space is explicitly deallocated). The lifetime of an allocated object extends from the allocation until the deallocation.
This does not however countenance the use of such memory without an appropriate cast, so for MyStruct as defined above only the declared members of the object can be used. This is why flexible array members (6.7.2.1:18) were added.
Also note that appendix J.2 Undefined behavior calls out array access:
1 - The behavior is undefined in the following circumstances: [...]
— Addition or subtraction of a pointer into, or just beyond, an array object and an
integer type produces a result that does not point into, or just beyond, the same array
object.
— Addition or subtraction of a pointer into, or just beyond, an array object and an
integer type produces a result that points just beyond the array object and is used as
the operand of a unary * operator that is evaluated.
— An array subscript is out of range, even if an object is apparently accessible with the
given subscript (as in the lvalue expression a[1][7] given the declaration int
a[4][5]).
So, as you note this would be undefined behaviour:
MyStruct *foo = malloc(sizeof(MyStruct) + sizeof(int) * 10);
foo->data[5] = 1;
However, you would be allowed to do the following:
MyStruct *foo = malloc(sizeof(MyStruct) + sizeof(int) * 10);
((int *) foo)[(offsetof(MyStruct, data) / sizeof(int)) + 5] = 1;
C++ is laxer in this regard; 3.9.2 Compound types [basic.compound] has:
3 - [...] If an object of type T is located at an address A, a pointer of type cv T* whose value is the address A is said to point to that object, regardless of how the value was obtained.
This makes sense considered in the light of C's more aggressive optimisation opportunities for pointers, e.g. with the restrict qualifier.
The C99 rationale document talks about this in section 6.7.2.1.
A new feature of C99: There is a common idiom known as the “struct hack” for creating a structure containing a variable-size array:
...
The validity of this construct has always been questionable. In the response to one Defect Report, the Committee decided that it was undefined behavior because the array p->items contains only one item, irrespective of whether the space exists. An alternative construct was
suggested: make the array size larger than the largest possible case (for example, using int items[INT_MAX];), but this approach is also undefined for other reasons.
The Committee felt that, although there was no way to implement the “struct hack” in C89, it was nonetheless a useful facility. Therefore the new feature of “flexible array members” was introduced. Apart from the empty brackets, and the removal of the “-1” in the malloc call,
this is used in the same way as the struct hack, but is now explicitly valid code.
The struct hack is undefined behavior, as supported not only be the C specification itself (I'm sure there are citations in the other answers) but the committee has even recorded its opinion.
So the answer is yes, it is undefined behavior according to the standard document, but it is well defined according to the de facto C standard. I imagine most compiler writers are intimately familiar with the hack. From GCC's tree-vrp.c:
/* Accesses after the end of arrays of size 0 (gcc
extension) and 1 are likely intentional ("struct
hack"). */
I think there's a good chance you might even find the struct hack in compiler test suites.

Why does this implementation of offsetof() work?

In ANSI C, offsetof is defined as below.
#define offsetof(st, m) \
((size_t) ( (char *)&((st *)(0))->m - (char *)0 ))
Why won't this throw a segmentation fault since we are dereferencing a NULL pointer? Or is this some sort of compiler hack where it sees that only address of the offset is taken out, so it statically calculates the address without actually dereferencing it? Also is this code portable?
At no point in the above code is anything dereferenced. A dereference occurs when the * or -> is used on an address value to find referenced value. The only use of * above is in a type declaration for the purpose of casting.
The -> operator is used above but it's not used to access the value. Instead it's used to grab the address of the value. Here is a non-macro code sample that should make it a bit clearer
SomeType *pSomeType = GetTheValue();
int* pMember = &(pSomeType->SomeIntMember);
The second line does not actually cause a dereference (implementation dependent). It simply returns the address of SomeIntMember within the pSomeType value.
What you see is a lot of casting between arbitrary types and char pointers. The reason for char is that it's one of the only type (perhaps the only) type in the C89 standard which has an explicit size. The size is 1. By ensuring the size is one, the above code can do the evil magic of calculating the true offset of the value.
Although that is a typical implementation of offsetof, it is not mandated by the standard, which just says:
The following types and macros are defined in the standard header <stddef.h> [...]
offsetof(type,member-designator)
which expands to an integer constant expression that has type size_t, the value of
which is the offset in bytes, to the structure member (designated by member-designator),
from the beginning of its structure (designated by type). The type and member designator
shall be such that given
statictypet;
then the expression &(t.member-designator) evaluates to an address constant. (If the specified member is a bit-field, the behavior is undefined.)
Read P J Plauger's "The Standard C Library" for a discussion of it and the other items in <stddef.h> which are all border-line features that could (should?) be in the language proper, and which might require special compiler support.
It's of historic interest only, but I used an early ANSI C compiler on 386/IX (see, I told you of historic interest, circa 1990) that crashed on that version of offsetof but worked when I revised it to:
#define offsetof(st, m) ((size_t)((char *)&((st *)(1024))->m - (char *)1024))
That was a compiler bug of sorts, not least because the header was distributed with the compiler and didn't work.
In ANSI C, offsetof is NOT defined like that. One of the reasons it's not defined like that is that some environments will indeed throw null pointer exceptions, or crash in other ways. Hence, ANSI C leaves the implementation of offsetof( ) open to compiler builders.
The code shown above is typical for compilers/environments that do not actively check for NULL pointers, but fail only when bytes are read from a NULL pointer.
To answer the last part of the question, the code is not portable.
The result of subtracting two pointers is defined and portable only if the two pointers point to objects in the same array or point to one past the last object of the array (7.6.2 Additive Operators, H&S Fifth Edition)
Listing 1: A representative set of offsetof() macro definitions
// Keil 8051 compiler
#define offsetof(s,m) (size_t)&(((s *)0)->m)
// Microsoft x86 compiler (version 7)
#define offsetof(s,m) (size_t)(unsigned long)&(((s *)0)->m)
// Diab Coldfire compiler
#define offsetof(s,memb) ((size_t)((char *)&((s *)0)->memb-(char *)0))
typedef struct
{
int i;
float f;
char c;
} SFOO;
int main(void)
{
printf("Offset of 'f' is %zu\n", offsetof(SFOO, f));
}
The various operators within the macro are evaluated in an order such that the following steps are performed:
((s *)0) takes the integer zero and casts it as a pointer to s.
((s *)0)->m dereferences that pointer to point to structure member m.
&(((s *)0)->m) computes the address of m.
(size_t)&(((s *)0)->m) casts the result to an appropriate data type.
By definition, the structure itself resides at address 0. It follows that the address of the field pointed to (Step 3 above) must be the offset, in bytes, from the start of the structure.
It doesn't segfault because you're not dereferencing it. The pointer address is being used as a number that's subtracted from another number, not used to address memory operations.
It calculates the offset of the member m relative to the start address of the representation of an object of type st.
((st *)(0)) refers to a NULL pointer of type st *.
&((st *)(0))->m refers to the address of member m in this object. Since the start address of this object is 0 (NULL), the address of member m is exactly the offset.
char * conversion and the difference calculates the offset in bytes. According to pointer operations, when you make a difference between two pointers of type T *, the result is the number of objects of type T represented between the two addresses contained by the operands.
Quoting the C standard for the offsetof macro:
C standard, section 6.6, paragraph 9
An address constant is a null pointer, a pointer to an lvalue designating an object of static storage duration, or a pointer to a function designator; it shall be created explicitly using the unary & operator or an integer constant cast to pointer type, or implicitly by the use of an expression of array or function type. The array-subscript [] and member-access . and -> operators, the address & and indirection * unary operators, and pointer casts may be used in the creation of an address constant, but the value of an object shall not be accessed by use of these operators.
The macro is defined as
#define offsetof(type, member) ((size_t)&((type *)0)->member)
and the expression comprises the creation of an address constant.
Although genuinely speaking, the result is not an address constant because it does not point to an object of static storage duration. But this is still agreed upon that the value of an object shall not be accessed, so the integer constant cast to pointer type will not be dereferenced.
Also, consider this quote from the C standard:
C standard, section 7.19, paragraph 3
The type and member designator shall be such that given
static type t;
then the expression &(t.member-designator) evaluates to an address constant. (If the
specified member is a bit-field, the behavior is undefined.)
A struct in C is a composite data type (or record) declaration that defines a physically grouped list of variables under one name in a block of memory, allowing the different variables to be accessed via a single pointer or by the struct declared name which returns the same address.
From the compiler perspective, the struct declared name is an address and the member designator is an offset from that address.

Resources