How can *i and u.i print different numbers in this code, even though i is defined as int *i = &u.i;? I can only assuming that I'm triggering UB here, but I can't see how exactly.
(ideone demo replicates if I select 'C' as the language. But as #2501 pointed out, not if 'C99 strict' is the language. But then again, I get the problem with gcc-5.3.0 -std=c99!)
// gcc -fstrict-aliasing -std=c99 -O2
union
{
int i;
short s;
} u;
int * i = &u.i;
short * s = &u.s;
int main()
{
*i = 2;
*s = 100;
printf(" *i = %d\n", *i); // prints 2
printf("u.i = %d\n", u.i); // prints 100
return 0;
}
(gcc 5.3.0, with -fstrict-aliasing -std=c99 -O2, also with -std=c11)
My theory is that 100 is the 'correct' answer, because the write to the union member through the short-lvalue *s is defined as such (for this platform/endianness/whatever). But I think that the optimizer doesn't realize that the write to *s can alias u.i, and therefore it thinks that *i=2; is the only line that can affect *i. Is this a reasonable theory?
If *s can alias u.i, and u.i can alias *i, then surely the compiler should think that *s can alias *i? Shouldn't aliasing be 'transitive'?
Finally, I always had this assumption that strict-aliasing problems were caused by bad casting. But there is no casting in this!
(My background is C++, I'm hoping I'm asking a reasonable question about C here. My (limited) understanding is that, in C99, it is acceptable to write through one union member and then reading through another member of a different type.)
The disrepancy is issued by -fstrict-aliasing optimization option. Its behavior and possible traps are described in GCC documentation:
Pay special attention to code like this:
union a_union {
int i;
double d;
};
int f() {
union a_union t;
t.d = 3.0;
return t.i;
}
The practice of reading from a different union member than the one
most recently written to (called “type-punning”) is common. Even with
-fstrict-aliasing, type-punning is allowed, provided the memory is accessed through the union type. So, the code above works as expected.
See Structures unions enumerations and bit-fields implementation. However, this code might
not:
int f() {
union a_union t;
int* ip;
t.d = 3.0;
ip = &t.i;
return *ip;
}
Note that conforming implementation is perfectly allowed to take advantage of this optimization, as second code example exhibits undefined behaviour. See Olaf's and others' answers for reference.
C standard (i.e. C11, n1570), 6.5p7:
An object shall have its stored value accessed only by an lvalue expression that has one of the following types:
...
an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
a character type.
The lvalue expressions of your pointers are not union types, thus this exception does not apply. The compiler is correct exploiting this undefined behaviour.
Make the pointers' types pointers to the union type and dereference with the respective member. That should work:
union {
...
} u, *i, *p;
Strict aliasing is underspecified in the C Standard, but the usual interpretation is that union aliasing (which supersedes strict aliasing) is only permitted when the union members are directly accessed by name.
For rationale behind this consider:
void f(int *a, short *b) {
The intent of the rule is that the compiler can assume a and b don't alias, and generate efficient code in f. But if the compiler had to allow for the fact that a and b might be overlapping union members, it actually couldn't make those assumptions.
Whether or not the two pointers are function parameters or not is immaterial, the strict aliasing rule doesn't differentiate based on that.
This code indeed invokes UB, because you do not respect the strict aliasing rule. n1256 draft of C99 states in 6.5 Expressions §7:
An object shall have its stored value accessed only by an lvalue expression that has one of
the following types:
— a type compatible with the effective type of the object,
— a qualified version of a type compatible with the effective type of the object,
— a type that is the signed or unsigned type corresponding to the effective type of the
object,
— a type that is the signed or unsigned type corresponding to a qualified version of the
effective type of the object,
— an aggregate or union type that includes one of the aforementioned types among its
members (including, recursively, a member of a subaggregate or contained union), or
— a character type.
Between the *i = 2; and the printf(" *i = %d\n", *i); only a short object is modified. With the help of the strict aliasing rule, the compiler is free to assume that the int object pointed by i has not been changed, and it can directly use a cached value without reloading it from main memory.
It is blatantly not what a normal human being would expect, but the strict aliasing rule was precisely written to allow optimizing compilers to use cached values.
For the second print, unions are referenced in same standard in 6.2.6.1 Representations of types / General §7:
When a value is stored in a member of an object of union type, the bytes of the object
representation that do not correspond to that member but do correspond to other members
take unspecified values.
So as u.s has been stored, u.i have taken a value unspecified by standard
But we can read later in 6.5.2.3 Structure and union members §3 note 82:
If the member used to access the contents of a union object is not the same as the member last used to
store a value in the object, the appropriate part of the object representation of the value is reinterpreted
as an object representation in the new type as described in 6.2.6 (a process sometimes called "type
punning"). This might be a trap representation.
Although notes are not normative, they do allow better understanding of the standard. When u.s have been stored through the *s pointer, the bytes corresponding to a short have been changed to the 2 value. Assuming a little endian system, as 100 is smaller that the value of a short, the representation as an int should now be 2 as high order bytes were 0.
TL/DR: even if not normative, the note 82 should require that on a little endian system of the x86 or x64 families, printf("u.i = %d\n", u.i); prints 2. But per the strict aliasing rule, the compiler is still allowed to assumed that the value pointed by i has not changed and may print 100
You are probing a somewhat controversial area of the C standard.
This is the strict aliasing rule:
An object shall have its stored value accessed only by an lvalue
expression that has one of the following types:
a type compatible with the effective type of the object,
a qualified version of a type compatible with the effective type of the object,
a type that is the signed or unsigned type corresponding to
the effective type of the object,
a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union),
a character type.
(C2011, 6.5/7)
The lvalue expression *i has type int. The lvalue expression *s has type short. These types are not compatible with each other, nor both compatible with any other particular type, nor does the strict aliasing rule afford any other alternative that allows both accesses to conform if the pointers are aliased.
If at least one of the accesses is non-conforming then the behavior is undefined, so the result you report -- or indeed any other result at all -- is entirely acceptable. In practice, the compiler must produce code that reorders the assignments with the printf() calls, or that uses a previously loaded value of *i from a register instead of re-reading it from memory, or some similar thing.
The aforementioned controversy arises because people will sometimes point to footnote 95:
If the member used to read the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called ‘‘type punning’’). This might be a trap representation.
Footnotes are informational, however, not normative, so there's really no question which text wins if they conflict. Personally, I take the footnote simply as an implementation guidance, clarifying the meaning of the fact that the storage for union members overlaps.
Looks like this is a result of the optimizer doing its magic.
With -O0, both lines print 100 as expected (assuming little-endian). With -O2, there is some reordering going on.
gdb gives the following output:
(gdb) start
Temporary breakpoint 1 at 0x4004a0: file /tmp/x1.c, line 14.
Starting program: /tmp/x1
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x2aaaaaaab000
Temporary breakpoint 1, main () at /tmp/x1.c:14
14 {
(gdb) step
15 *i = 2;
(gdb)
18 printf(" *i = %d\n", *i); // prints 2
(gdb)
15 *i = 2;
(gdb)
16 *s = 100;
(gdb)
18 printf(" *i = %d\n", *i); // prints 2
(gdb)
*i = 2
19 printf("u.i = %d\n", u.i); // prints 100
(gdb)
u.i = 100
22 }
(gdb)
0x0000003fa441d9f4 in __libc_start_main () from /lib64/libc.so.6
(gdb)
The reason this happens, as others have stated, is because it is undefined behavior to access a variable of one type through a pointer to another type even if the variable in question is part of a union. So the optimizer is free to do as it wishes in this case.
The variable of the other type can only be read directly via a union which guarantees well defined behavior.
What's curious is that even with -Wstrict-aliasing=2, gcc (as of 4.8.4) doesn't complain about this code.
Whether by accident or by design, C89 includes language which has been interpreted in two different ways (along with various interpretations in-between). At issue is the question of when a compiler should be required to recognize that storage used for one type might be accessed via pointers of another. In the example given in the C89 rationale, aliasing is considered between a global variable which is clearly not part of any union and a pointer to a different type, and nothing in the code would suggest that aliasing could occur.
One interpretation horribly cripples the language, while the other would restrict the use of certain optimizations to "non-conforming" modes. If those who didn't to have their preferred optimizations given second-class status had written C89 to unambiguously match their interpretation, those parts of the Standard would have been widely denounced and there would have been some sort of clear recognition of a non-broken dialect of C which would honor the non-crippling interpretation of the given rules.
Unfortunately, what has happened instead is since the rules clearly don't require compiler writers apply a crippling interpretation, most compiler writers have for years simply interpreted the rules in a fashion which retains the semantics that made C useful for systems programming; programmers didn't have any reason to complain that the Standard didn't mandate that compilers behave sensibly because from their perspective it seemed obvious to everyone that they should do so despite the sloppiness of the Standard. Meanwhile, however, some people insist that since the Standard has always allowed compilers to process a semantically-weakened subset of Ritchie's systems-programming language, there's no reason why a standard-conforming compiler should be expected to process anything else.
The sensible resolution for this issue would be to recognize that C is used for sufficiently varied purposes that there should be multiple compilation modes--one required mode would treat all accesses of everything whose address was taken as though they read and write the underlying storage directly, and would be compatible with code which expects any level of pointer-based type punning support. Another mode could be more restrictive than C11 except when code explicitly uses directives to indicate when and where storage that has been used as one type would need to be reinterpreted or recycled for use as another. Other modes would allow some optimizations but support some code that would break under stricter dialects; compilers without specific support for a particular dialect could substitute one with more defined aliasing behaviors.
I'm writing some extremely repetitive code in C (reading XML), and I found that writing my code like makes it easier to copy and paste code in a constructor*:
something_t* something_new(void)
{
something_t* obj = malloc(sizeof(*obj));
/* initialize */
return obj;
}
What I'm wondering is, it is safe to use sizeof(*obj) like this, when I just defined obj? GCC isn't showing any warnings and the code works fine, but GCC tends to have "helpful" extensions so I don't trust it.
* And yes, I realize that I should have just written a Python program to write my C program, but it's almost done already.
something_t* obj = malloc(sizeof(*obj));
What I'm wondering is, it is safe to use sizeof(*obj)
like this, when I just defined obj?
You have a declaration consisting of:
type-specifier something_t
declarator * obj
=
initializer malloc(sizeof(*obj))
;
The C standard says in section Scopes of identifiers:
Structure, union, and enumeration tags have scope that begins just
after the appearance of the tag in a type specifier that declares the
tag. Each enumeration constant has scope that begins just after the
appearance of its defining enumerator in an enumerator list. Any other
identifier has scope that begins just after the completion of its
declarator.
Since obj has scope that begins just after the completion of its declarator, it is guaranteed by the standard that the identifier used in the initializer refers to the just defined object.
While we are giving the sizeof like this.
something_t* obj = malloc(sizeof(obj));
It will allocate the memory to that pointer variable as four bytes( bytes allocated to a pointer variable.)
something_t* obj = malloc(sizeof(*obj));
It will take the data type which is declared to that pointer.
For example,
char *p;
printf("%d\n",sizeof(p));
It will return the value as four.
printf("%d\n",sizeof(*p));
Now it will return value as one. Which is a byte allocated to the character. So when we are using the *p in sizeof it will take the datatype. I don't know this is the answer you are expecting.
It's safe. For example,
n = sizeof(*(int *)NULL);
In this case, NULL pointer access doesn't occur, because a compiler can caluculate the size of the operand without knowing the value of "*(int *)NULL" in run time.
The C89/90 standard guarantees the 'sizeof' expression is a constant one; it is translated into a constant (e.g. 0x04) and embedded into a binary code in compilation stage.
In C99 standard, the 'sizeof' expression is not always a compile-time constant, because of introducing variable length array. For example,
n = sizeof(int [*(int *)NULL]);
In this case, the value of "*(int *)NULL" needs to be known in run time to caluculate the size of 'int[]'.
Can someone explain this behavior?
Using the compiler flag std=c99 I get the following errors:
"initializer element is not constant" for b1.
"expected expression before '.' token" for b2
b3 is OK.
When not using -std=c99 all lines are OK.
When not using static b1 is ok.
I'm using GCC.
typedef struct A_tag {
int v;
int w;
} A;
typedef struct B_tag {
A super;
int x;
int y;
} B;
void test(){
static B b1 = ((B){.super={.v=100}, .x=10});
static B b2 = ({.super={.v=100}, .x=10});
static B b3 = {.super={.v=100}, .x=10};
}
(B){.super={.v=100}, .x=10} is not a "cast" but as a whole this is a "compound literal" a temporary object that only lives inside the corresponding expression (basically). Since this is not a constant but a temporary object, by the standard you can't initialize with it.
As stated above, this is a "compound literal". Whether it can be used for initialization is actually implementation defined, IMO. The C11 standard says in [6.7.9 §4] that "expressions in an initializer for an object that has static or thread storage duration shall be constant expressions or string literals". Then in [6.6 §7] it lists what constant expressions can be and [6.6 §10] it allows an implementation to "accept other forms of constant expressions".
Since a "compound literals" is constant by definition, it ought to be possible to use it for initialization, although the standard does not explicitly say so. (And many compilers do accept it.)
Reading the C11 standard (actually N1570 which is a late draft), you will see that the rules for initializers of objects of static storage duration are different from the rules for those of automatic storage duration.
As I read it, compound literals can only be used for initializers of objects of automatic storage duration. They are allowed by 6.7.9 paragraph 13.
I am reading the book Let us C by Yashavant Kanetkar.
In the Array of Pointers section there is a section of code which is giving me problems:
int main()
{
static int a[]={0,1,2,3,4}; //-----------(MY PROBLEM)
int *p[]={a,a+1,a+2,a+3,a+4};
printf("%u %u %d\n",p,*p,*(*p));
return 0;
}
What I don't understand is why has the array a have to be initialized as static. I tried initializing it without the static keyword but I got an error saying "illegal". Please help.
C90 (6.5.7) had
All the expressions in an initializer for an object that has static storage duration or in an initializer list for an object that has aggregate or union type shall be constant expressions.
And you are initializing an object that has an aggregate type, so the value must be known at compile time and the address of automatic variables are not in that case.
Note this has changed in C99 (6.7.8/4)
All the expressions in an initializer for an object that has static storage duration shall be constant expressions or string literals.
The constraint on object with aggregate or union type has been removed and I've not found it placed somewhere else. Your code with static removed should be accepted by a C99 compiler (it is by gcc -std=c99 for instance, which seems to confirm that I've not overlooked a constraint elsewhere).
My guess would be that the contents of an array initialiser have to be a compile-time constant. By using static on a local variable in a function you essentially make that variable global, except with a local scope.
I've run into a problem that seems to not be addressed by any of the C Standards after C89 save for the mention that structures initialization limits had been lifted. However, I've run into an error using the Open Watcom IDE (for debugging) where the compiler states that the initializer must be a constant expression.
Here's the gist of what's going on.
typedef struct{
short x;
short y;
} POINT;
void foo( short x, short y )
{
POINT here = { x, y }; /* <-- This is generating the error for the compiler */
/* ... */
}
Any ideas why, or what standard disallows that?
The following quote is from the C99 rationale:
The C89 Committee considered proposals for permitting automatic
aggregate initializers to consist of a brace-enclosed series of
arbitrary execution-time expressions, instead of just those usable for
a translation-time static initializer. Rather than determine a set of
rules which would avoid pathological cases and yet not seem too
arbitrary, the C89 Committee elected to permit only static
initializers. This was reconsidered and execution-time expressions are
valid in C99.
The problem is that C isn't an Object language and only does strict typing. Further, C maintains a difference between structs and arrays.
The way your code will have to work is
void foo( short x, short y )
{
POINT here;
here.x = x;
here.y = y;
}
This is normal for C89... initializers do need to be constant, ie. able to be determined at compile time. This means no variables in initializers, and it's true for other types as well, not just structs. In C99, your code would work.