^^^ THIS QUESTION IS NOT ABOUT TYPE PUNNING ^^^
It is my understanding that an object contained in a union can only be used if it is active, and that it is active iff it was the last member to have a value stored to it. This suggests that the following code should be undefined at the points I mark.
My question is if I am correct in my understanding of when it is defined to access a member of a union, particularly in the following situations.
#include <stddef.h>
#include <stdio.h>
void work_with_ints(int* p, size_t k)
{
size_t i = 1;
for(;i<k;++i) p[i]=p[i-1];
}
void work_with_floats(float* p, size_t k)
{
size_t i = 1;
for(;i<k;++i) p[i]=p[i-1];
}
int main(void)
{
union{ int I[4]; float F[4]; } u;
// this is undefined because no member of the union was previously
// selected by storing a value to the union object
work_with_ints(u.I,4);
printf("%d %d %d %d\n",u.I[0],u.I[1],u.I[2],u.I[3]);
u.I[0]=1; u.I[1]=2; u.I[2]=3; u.I[3]=4;
// this is undefined because u currently stores an object of type int[4]
work_with_floats(u.F,4);
printf("%f %f %f %f\n",u.F[0],u.F[1],u.F[2],u.F[3]);
// this is defined because the assignment makes u store an object of
// type F[4], which is subsequently accessed
u.F[0]=42.0;
work_with_floats(u.F,4);
printf("%f %f %f %f\n",u.F[0],u.F[1],u.F[2],u.F[3]);
return 0;
}
Am I correct in the three items I have noted?
My actual example is not possible to use here due to size, but it was suggested in a comment that I extend this example to something compileable. I compiled and ran the above in both clang (-Weverything -std=c11) and gcc (-pedantic -std=c11). Each gave the following:
0 0 0 0
0.000000 0.000000 0.000000 0.000000
42.000000 42.000000 42.000000 42.000000
That seems appropriate, but that does not mean the code is compliant.
EDIT:
To clarify what the code is doing, I will point out the exact instances where the property I mention in the first paragraph is applied.
First, the contents of an uninitialized union are read and modified. This is undefined behavior, rather than unspecified with a potential for UB with traps, if the principle I mention in the first paragraph is true.
Second, the contents of a union are used with the type of an inactive union member. Again, this is undefined behavior, rather than unspecified with a potential for UB with traps, if the principle I mention in the first paragraph is true.
Third, the item just mentioned as "second" produces unspecified behavior with a potential for UB with traps, if first one element of the array contained in the inactive member is modified. This makes the whole array the active member, hence the change in definedness.
I am demonstrating the consequences of the principle in the first paragraph of this question, to show how that principle, if correct, affects the nature of the C standard. Consequent the significant effect on the nature of the standard in some circumstances, I am looking for help in determining if the principle I have stated is a correct understanding of the standard.
EDIT:
I think it may help to describe how I get from the standard the principle in the first paragraph above, and how one might disagree. Not much is said on the matter in the standard, so there has to be some filling in of the gaps no matter what.
The standard describes a union as holding one object at a time. This seems to suggest treating it like a structure containing one element. It seems that anything deviating from that interpretation deserves mention. That is how I get to the principle I have stated.
On the other hand, the discussion of effective type does not define the term "declared type". If that term is understood such that union members do not have a declared type, then it could be argued that each subobject of a union need be interpreted as another member recursively. So, in the last example in my code, all floating point array members would need to be initialized, not just the first.
The two examples I give of undefined behavior are important to me to resolve. However, the last example, which relates to the above paragraph, seems most crucial. I could really see an argument either way there.
EDIT:
This is not a type punning question. First, I am talking about writing to unions, not reading from them. Second, I am talking about the validity of doing these writes with a pointer rather than with the union type. This is very different from type punning issues.
This question is more related to strict aliasing than it is to type punning. You can not access memory however you want due to strict aliasing. This question deals with exactly how unions ease the constraints of strict aliasing on their members. It is not said they ever do that, but if they don't then you could never do something like the following.
union{int i} u; u.i=0; function_working_with_an_int_pointer (&u.i);
So, clearly, unions affect the application of strict aliasing rules in some cases. My question is to confirm that the line I have drawn according to my reading of the standard is correct.
an object contained in a union can only be used if it is active, and that it is active iff it was the last member to have a value stored to it.
The statement is false. The behavior is reliable and defined.
union {
unsigned char c [4];
long d;
} v;
v .d = 0xaabbccddL;
printf ("%x\n", v .c [2]);
It is completely acceptable to access the c member even though it was not the last assigned. On a little endian machine, it will definitely show bb and on a big endian machine, cc.
Related
In the following example code, is there any undefined or implementation defined behavior? Can I assign a value to one member of a union and read it back from another?
#include <stdio.h>
#include <stdint.h>
struct POINT
{
union
{
float Position[3];
struct { float X, Y, Z; };
};
};
struct INT
{
union
{
uint32_t Long;
uint16_t Words[2];
uint8_t Bytes[4];
};
};
int main(void)
{
struct POINT p;
p.Position[0] = 10;
p.Position[1] = 5;
p.Position[2] = 2;
printf("X: %f; Y: %f; Z: %f\n", p.X, p.Y, p.Z);
struct INT i;
i.Long = 0xDEADBEEF;
printf("0x%4x%4x\n", i.Words[0], i.Words[1]);
printf("0x%2x%2x%2x%2x\n", i.Bytes[0], i.Bytes[1], i.Bytes[2], i.Bytes[3]);
return 0;
}
The output on my machine is:
X: 10.000000; Y: 5.000000; Z: 2.000000
0xbeefdead
0xefbeadde
It's printing the words/bytes in reverse because x86 is little endian, as expected.
is there any undefined or implementation defined behavior?
Some basic concerns:
struct { float X, Y, Z; }; may have padding between X, Y, Z rendering printf("X: %f; Y: %f; Z: %f\n", p.X, p.Y, p.Z); undefined behavior as p.Z, etc. may not be initialized.
i.Long = 0xDEADBEEF; printf("0x%4x%4x\n", i.Words[0], i.Words[1]); leads to implementation defined behavior as C does not require a particular endian. (It appears OP knows of this already.)
Can I assign a value to one member of a union and read it back from another?
Yes - within limitations. Other answers well address this part.
Union type punning is allowed from C99 (despite only via the footnote - but it is the part of beauty of this language). With some restrictions it is OK.
If the member used to read the contents of a union object is not the
same as the member last used tostore a value in the object, the
appropriate part of the object representation of the value is
reinterpretedas an object representation in the new type as
described in 6.2.6 (a process sometimes called
‘‘typepunning’’). This might be a trap representation.
http://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_257.htm
I don't think the authors of the Standard have ever reached a consensus on what constructs should be required or expected to behave usefully in what circumstances. Instead, they punt the issue as a "quality of implementation" matter, relying upon implementations to support whatever constructs their customers might need.
The C Standard specifies that reading a union object through a member other than the last member written will reinterpret the bytes therein using the new type. If one looks at the list of lvalue types that may be used to access struct or union objects, however, there is no provision to access structs or unions using objects of non-character member types. In most cases where a pointer or lvalue of a member type would be used, it would be visibly freshly derived from a pointer to, or lvalue of, the parent type, and if a compiler makes any reasonable effort to notice such derivation there would be no need for a general rule allowing use of those types. The question of when to recognize such derivation was left as a quality-of-implementation issue on the presumption that compilers who made any bona fide effort to meet the needs of their customers would probably do a better job than if the Standard tried to write out a set of precise rules.
Rather than making any effort to look for ways in which member-type pointers might be derived from struct or union objects, however, gcc and clang instead opt to go beyond what's actually specified to a far lesser degree than most committee members would have expected. They will treat an operation performed directly on an lvalue formed using value.member or ptr->member is an operation on the parent object. They will also recognize lvalues of the form value.member[index] or ptr->member.index. On the other hand, despite the fact that (array)[index] is defined as being equivalent to (*((array)+(index))), they will not recognize (*((ptr->member)+(index))) as an operation on the object identified by ptr. They will also, generally needlessly, assume that objects of structure type may interact with unrelated pointers to objects of member type.
If one is writing code that would benefit from the ability to perform type punning, my recommendation would be to explicitly say in the documentation that reliable operation requires -fno-strict-aliasing. The purpose of the aliasing rules was to give compiler writers the freedom to perform optimizations that would not interfere with what their customers needed to do. Compiler writers were expected to recognize and support their customers' needs without regard for whether the Standard required them to do so.
In these two examples, does accessing members of the struct by offsetting pointers from other members result in Undefined / Unspecified / Implementation Defined Behavior?
struct {
int a;
int b;
} foo1 = {0, 0};
(&foo1.a)[1] = 1;
printf("%d", foo1.b);
struct {
int arr[1];
int b;
} foo2 = {{0}, 0};
foo2.arr[1] = 1;
printf("%d", foo2.b);
Paragraph 14 of C11 § 6.7.2.1 seems to indicate that this should be implementation-defined:
Each non-bit-field member of a structure or union object is aligned in an implementation-defined manner appropriate to its type.
and later goes on to say:
There may be unnamed padding within a structure object, but not at its beginning.
However, code like the following appears to be fairly common:
union {
int arr[2];
struct {
int a;
int b;
};
} foo3 = {{0, 0}};
foo3.arr[1] = 1;
printf("%d", foo3.b);
(&foo3.a)[1] = 2; // appears to be illegal despite foo3.arr == &foo3.a
printf("%d", foo3.b);
The standard appears to guarantee that foo3.arr is the same as &foo3.a, and it doesn't make sense that referring to it one way is legal and the other not, but equally it doesn't make sense that adding the outer union with the array should suddenly make (&foo3.a)[1] legal.
My reasoning for thinking the first examples must also therefore be legal:
foo3.arr is guaranteed to be the same as &foo.a
foo3.arr + 1 and &foo3.b point to the same memory location
&foo3.a + 1 and &foo3.b must therefore point to the same memory location (from 1 and 2)
struct layouts are required to be consistent, so &foo1.a and &foo1.b should be laid out exactly the same as &foo3.a and &foo3.b
&foo1.a + 1 and &foo1.b must therefore point to the same memory location (from 3 and 4)
I've come across some outside sources that suggest that both the foo3.arr[1] and (&foo3.a)[1] examples are illegal, however I haven't been able to find a concrete statement in the standard that would make it so.
Even if they were both illegal though, it's also possible to construct the same scenario with flexible array pointers which, as far as I can tell, does have standard-defined behavior.
union {
struct {
int x;
int arr[];
};
struct {
int y;
int a;
int b;
};
} foo4;
The original application is considering whether or not a buffer overflow from one struct field into another is strictly speaking defined by the standard:
struct {
char buffer[8];
char overflow[8];
} buf;
strcpy(buf.buffer, "Hello world!");
println(buf.overflow);
I would expect this to output "rld!" on nearly any real-world compiler, but is this behavior guaranteed by the standard, or is it an undefined or implementation-defined behavior?
Introduction: The standard is inadequate in this area, and there is decades of history of argument on this topic and strict aliasing with no convincing resolution or proposal to fix.
This answer reflects my view rather than any imposition of the Standard.
Firstly: it's generally agreed that the code in your first code sample is undefined behaviour due to accessing outside the bounds of an array via direct pointer arithmetic.
The rule is C11 6.5.6/8 . It says that indexing from a pointer must remain within "the array object" (or one past the end). It doesn't say which array object but it is generally agreed that in the case int *p = &foo.a; then "the array object" is foo.a, and not any larger object of which foo.a is a subobject.
Relevant links:
one, two.
Secondly: it's generally agreed that both of your union examples are correct. The standard explicitly says that any member of a union may be read; and whatever the contents of the relevant memory location are are interpreted as the type of the union member being read.
You suggest that the union being correct implies that the first code should be correct too, but it does not. The issue is not with specifying the memory location read; the issue is with how we arrived at the expression specifying that memory location.
Even though we know that &foo.a + 1 and &foo.b are the same memory address, it's valid to access an int through the second and not valid to access an int through the first.
It's generally agreed that you can access the int by computing its address in other ways that don't break the 6.5.6/8 rule, e.g.:
((int *)((char *)&foo + offsetof(foo, b))[0]
or
((int *)((uintptr_t)&foo.a + sizeof(int)))[0]
Relevant links: one, two
It's not generally agreed on whether ((int *)&foo)[1] is valid. Some say it's basically the same as your first code, since the standard says "a pointer to an object, suitably converted, points to the element's first object". Others say it's basically the same as my (char *) example above because it follows from the specification of pointer casting. A few even claim it's a strict aliasing violation because it aliases a struct as an array.
Maybe relevant is N2090 - Pointer provenance proposal. This does not directly address the issue, and doesn't propose a repeal of 6.5.6/8.
According to C11 draft N1570 6.5p7, an attempt to access the stored value of a struct or union object using anything other than an lvalue of character type, the struct or union type, or a containing struct or union type, invokes UB even if behavior would otherwise be fully described by other parts of the Standard. This section contains no provision that would allow an lvalue of a non-character member type (or any non-character numeric type, for that matter) to be used to access the stored value of a struct or union.
According to the published Rationale document, however, the authors of the Standard recognized that different implementations offered different behavioral guarantees in cases where the Standard imposed no requirements, and regarded such "popular extensions" as a good and useful thing. They judged that questions of when and how such extensions should be supported would be better answered by the marketplace than by the Committee. While it may seem weird that the Standard would allow an obtuse compiler to ignore the possibility that someStruct.array[i] might affect the stored value of someStruct, the authors of the Standard recognized that any compiler whose authors aren't deliberately obtuse will support such a construct whether the Standard mandates or not, and that any attempt to mandate any kind of useful behavior from obtusely-designed compilers would be futile.
Thus, a compiler's level of support for essentially anything having to do with structures or unions is a quality-of-implementation issue. Compiler writers who are focused on being compatible with a wide range of programs will support a wide range of constructs. Those which are focused on maximizing the performance of code that needs only those constructs without which the language would be totally useless, will support a much narrower set. The Standard, however, is devoid of guidance on such issues.
PS--Compilers that are configured to be compatible with MSVC-style volatile semantics will interpret that qualifier as a indicating that an access to the pointer may have side-effects that interact with objects whose address has been taken and that aren't guarded by restrict, whether or not there is any other reason to expect such a possibility. Use of such a qualifier when accessing storage in "unusual" ways may make it more obvious to human readers that the code is doing something "weird" at the same time as it will thus ensure compatibility with any compiler that uses such semantics, even if such compiler would not otherwise recognize that access pattern. Unfortunately, some compiler writers refuse to support such semantics at anything other than optimization level 0 except with programs that demand it using non-standard syntax.
How can *i and u.i print different numbers in this code, even though i is defined as int *i = &u.i;? I can only assuming that I'm triggering UB here, but I can't see how exactly.
(ideone demo replicates if I select 'C' as the language. But as #2501 pointed out, not if 'C99 strict' is the language. But then again, I get the problem with gcc-5.3.0 -std=c99!)
// gcc -fstrict-aliasing -std=c99 -O2
union
{
int i;
short s;
} u;
int * i = &u.i;
short * s = &u.s;
int main()
{
*i = 2;
*s = 100;
printf(" *i = %d\n", *i); // prints 2
printf("u.i = %d\n", u.i); // prints 100
return 0;
}
(gcc 5.3.0, with -fstrict-aliasing -std=c99 -O2, also with -std=c11)
My theory is that 100 is the 'correct' answer, because the write to the union member through the short-lvalue *s is defined as such (for this platform/endianness/whatever). But I think that the optimizer doesn't realize that the write to *s can alias u.i, and therefore it thinks that *i=2; is the only line that can affect *i. Is this a reasonable theory?
If *s can alias u.i, and u.i can alias *i, then surely the compiler should think that *s can alias *i? Shouldn't aliasing be 'transitive'?
Finally, I always had this assumption that strict-aliasing problems were caused by bad casting. But there is no casting in this!
(My background is C++, I'm hoping I'm asking a reasonable question about C here. My (limited) understanding is that, in C99, it is acceptable to write through one union member and then reading through another member of a different type.)
The disrepancy is issued by -fstrict-aliasing optimization option. Its behavior and possible traps are described in GCC documentation:
Pay special attention to code like this:
union a_union {
int i;
double d;
};
int f() {
union a_union t;
t.d = 3.0;
return t.i;
}
The practice of reading from a different union member than the one
most recently written to (called “type-punning”) is common. Even with
-fstrict-aliasing, type-punning is allowed, provided the memory is accessed through the union type. So, the code above works as expected.
See Structures unions enumerations and bit-fields implementation. However, this code might
not:
int f() {
union a_union t;
int* ip;
t.d = 3.0;
ip = &t.i;
return *ip;
}
Note that conforming implementation is perfectly allowed to take advantage of this optimization, as second code example exhibits undefined behaviour. See Olaf's and others' answers for reference.
C standard (i.e. C11, n1570), 6.5p7:
An object shall have its stored value accessed only by an lvalue expression that has one of the following types:
...
an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
a character type.
The lvalue expressions of your pointers are not union types, thus this exception does not apply. The compiler is correct exploiting this undefined behaviour.
Make the pointers' types pointers to the union type and dereference with the respective member. That should work:
union {
...
} u, *i, *p;
Strict aliasing is underspecified in the C Standard, but the usual interpretation is that union aliasing (which supersedes strict aliasing) is only permitted when the union members are directly accessed by name.
For rationale behind this consider:
void f(int *a, short *b) {
The intent of the rule is that the compiler can assume a and b don't alias, and generate efficient code in f. But if the compiler had to allow for the fact that a and b might be overlapping union members, it actually couldn't make those assumptions.
Whether or not the two pointers are function parameters or not is immaterial, the strict aliasing rule doesn't differentiate based on that.
This code indeed invokes UB, because you do not respect the strict aliasing rule. n1256 draft of C99 states in 6.5 Expressions §7:
An object shall have its stored value accessed only by an lvalue expression that has one of
the following types:
— a type compatible with the effective type of the object,
— a qualified version of a type compatible with the effective type of the object,
— a type that is the signed or unsigned type corresponding to the effective type of the
object,
— a type that is the signed or unsigned type corresponding to a qualified version of the
effective type of the object,
— an aggregate or union type that includes one of the aforementioned types among its
members (including, recursively, a member of a subaggregate or contained union), or
— a character type.
Between the *i = 2; and the printf(" *i = %d\n", *i); only a short object is modified. With the help of the strict aliasing rule, the compiler is free to assume that the int object pointed by i has not been changed, and it can directly use a cached value without reloading it from main memory.
It is blatantly not what a normal human being would expect, but the strict aliasing rule was precisely written to allow optimizing compilers to use cached values.
For the second print, unions are referenced in same standard in 6.2.6.1 Representations of types / General §7:
When a value is stored in a member of an object of union type, the bytes of the object
representation that do not correspond to that member but do correspond to other members
take unspecified values.
So as u.s has been stored, u.i have taken a value unspecified by standard
But we can read later in 6.5.2.3 Structure and union members §3 note 82:
If the member used to access the contents of a union object is not the same as the member last used to
store a value in the object, the appropriate part of the object representation of the value is reinterpreted
as an object representation in the new type as described in 6.2.6 (a process sometimes called "type
punning"). This might be a trap representation.
Although notes are not normative, they do allow better understanding of the standard. When u.s have been stored through the *s pointer, the bytes corresponding to a short have been changed to the 2 value. Assuming a little endian system, as 100 is smaller that the value of a short, the representation as an int should now be 2 as high order bytes were 0.
TL/DR: even if not normative, the note 82 should require that on a little endian system of the x86 or x64 families, printf("u.i = %d\n", u.i); prints 2. But per the strict aliasing rule, the compiler is still allowed to assumed that the value pointed by i has not changed and may print 100
You are probing a somewhat controversial area of the C standard.
This is the strict aliasing rule:
An object shall have its stored value accessed only by an lvalue
expression that has one of the following types:
a type compatible with the effective type of the object,
a qualified version of a type compatible with the effective type of the object,
a type that is the signed or unsigned type corresponding to
the effective type of the object,
a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union),
a character type.
(C2011, 6.5/7)
The lvalue expression *i has type int. The lvalue expression *s has type short. These types are not compatible with each other, nor both compatible with any other particular type, nor does the strict aliasing rule afford any other alternative that allows both accesses to conform if the pointers are aliased.
If at least one of the accesses is non-conforming then the behavior is undefined, so the result you report -- or indeed any other result at all -- is entirely acceptable. In practice, the compiler must produce code that reorders the assignments with the printf() calls, or that uses a previously loaded value of *i from a register instead of re-reading it from memory, or some similar thing.
The aforementioned controversy arises because people will sometimes point to footnote 95:
If the member used to read the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called ‘‘type punning’’). This might be a trap representation.
Footnotes are informational, however, not normative, so there's really no question which text wins if they conflict. Personally, I take the footnote simply as an implementation guidance, clarifying the meaning of the fact that the storage for union members overlaps.
Looks like this is a result of the optimizer doing its magic.
With -O0, both lines print 100 as expected (assuming little-endian). With -O2, there is some reordering going on.
gdb gives the following output:
(gdb) start
Temporary breakpoint 1 at 0x4004a0: file /tmp/x1.c, line 14.
Starting program: /tmp/x1
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x2aaaaaaab000
Temporary breakpoint 1, main () at /tmp/x1.c:14
14 {
(gdb) step
15 *i = 2;
(gdb)
18 printf(" *i = %d\n", *i); // prints 2
(gdb)
15 *i = 2;
(gdb)
16 *s = 100;
(gdb)
18 printf(" *i = %d\n", *i); // prints 2
(gdb)
*i = 2
19 printf("u.i = %d\n", u.i); // prints 100
(gdb)
u.i = 100
22 }
(gdb)
0x0000003fa441d9f4 in __libc_start_main () from /lib64/libc.so.6
(gdb)
The reason this happens, as others have stated, is because it is undefined behavior to access a variable of one type through a pointer to another type even if the variable in question is part of a union. So the optimizer is free to do as it wishes in this case.
The variable of the other type can only be read directly via a union which guarantees well defined behavior.
What's curious is that even with -Wstrict-aliasing=2, gcc (as of 4.8.4) doesn't complain about this code.
Whether by accident or by design, C89 includes language which has been interpreted in two different ways (along with various interpretations in-between). At issue is the question of when a compiler should be required to recognize that storage used for one type might be accessed via pointers of another. In the example given in the C89 rationale, aliasing is considered between a global variable which is clearly not part of any union and a pointer to a different type, and nothing in the code would suggest that aliasing could occur.
One interpretation horribly cripples the language, while the other would restrict the use of certain optimizations to "non-conforming" modes. If those who didn't to have their preferred optimizations given second-class status had written C89 to unambiguously match their interpretation, those parts of the Standard would have been widely denounced and there would have been some sort of clear recognition of a non-broken dialect of C which would honor the non-crippling interpretation of the given rules.
Unfortunately, what has happened instead is since the rules clearly don't require compiler writers apply a crippling interpretation, most compiler writers have for years simply interpreted the rules in a fashion which retains the semantics that made C useful for systems programming; programmers didn't have any reason to complain that the Standard didn't mandate that compilers behave sensibly because from their perspective it seemed obvious to everyone that they should do so despite the sloppiness of the Standard. Meanwhile, however, some people insist that since the Standard has always allowed compilers to process a semantically-weakened subset of Ritchie's systems-programming language, there's no reason why a standard-conforming compiler should be expected to process anything else.
The sensible resolution for this issue would be to recognize that C is used for sufficiently varied purposes that there should be multiple compilation modes--one required mode would treat all accesses of everything whose address was taken as though they read and write the underlying storage directly, and would be compatible with code which expects any level of pointer-based type punning support. Another mode could be more restrictive than C11 except when code explicitly uses directives to indicate when and where storage that has been used as one type would need to be reinterpreted or recycled for use as another. Other modes would allow some optimizations but support some code that would break under stricter dialects; compilers without specific support for a particular dialect could substitute one with more defined aliasing behaviors.
Unfortunately, description of a particular behavior of Unions in C in online resources (I can list few if required) differs vastly from one source to another, and in some cases insufficient. One of the resource says, You can define a union with many members, but only one member can contain a value at any given time. and thats about it. And then another resource says, in union, the only member whose value is currently stored will have the memory.
So, now if I run this program,
#include <stdio.h>
union item
{
int a;
float b;
char ch;
};
int main( )
{
union item it;
it.a = 12;
it.b = 20.2;
it.ch='z';
printf("%d\n",it.a);
printf("%f\n",it.b);
printf("%c\n",it.ch);
return 0;
}
I get output as:
1101109626
20.199940
z
The online website states that a and b both are corrupted, although I disagree slightly here as b is close to 20.2. Anyhow, now if I write char in the beginning and then write a and b (still same format), I see that b has right value but other two are corrupted. However, if I declare b as int, a and b both are correct. So I deduce that, if members of union are of the same format, then when you write any one member, the other members WILL contain the same value (since they are of same format) which you can read off at any time without any problem. But if they are all of different format, then the one who was written last is only the valid value. I found no online resource which states this categorically. Is this assumption correct?
But if they are all of different format, then the one who was written
last is only the valid value.
You are almost correct.
When you write one member of union and read another (the one that wasn't written last), the behavior is unspecified which can be trap representation.
From one footnote of the C11 n1570 draft (see footnote 95 in 6.5.2.3):
If the member used to read the contents of a union object is not the
same as the member last used to store a value in the object, the
appropriate part of the object representation of the value is
reinterpreted as an object representation in the new type as described
in 6.2.6 (a process sometimes called ‘‘type punning’’). This might be
a trap representation.
The whole idea of a C union is to share the same storage area for different types. If all members of the union were of the same type, then it would makes no sense to have a union at all, because it would be equal to a single instance of that type for all purposes.
Unions can help you achieve type punning, i.e. "raw" conversion between different types, but the behavior should be considered UB and is platform and compiler dependent. Sometimes this behavior is exactly what you want: e.g. you may want to get the native representation of a 32-bit float converted into a 32-bit integer, or treat a struct of two 32-bit integers as a union with a single 64-bit integer to perform 64-bit arithmetics and still have simple access to high and low words.
Generally speaking, you will want to use it to conserve space when you only need to store a value of a certain type at any given moment. And keep in mind that you can have an union of any combination of structs also, not only primitive types, and its memory space will be utilized efficiently; union will have the size of the largest struct.
As the comments and other answers are explaining, the purpose of a union (and a struct) is to allow for compound variable types, and in the case of a union specifically, to share memory among the members. It makes sense that only one member at any one time owns the memory allocated for the union. If by chance, after one member had been assigned a value, but another member appears to have kept its previously assigned value, it is purely by chance, and should be considered undefined (or unspecified) behavior. In simple terms, don't rely on it.
Web references are sometimes ok for providing extra insignt, but here is some of what the C standard says on the topic:
C99 6.2.5.20
A union type describes an overlapping nonempty set of member objects,
each of which has an optionally specified name and possibly distinct
type.
A few lines down:
C99 6.2.6.1.7
When a value is stored in a member of an object of union type, the
bytes of the object representation that do not correspond to that
member but do correspond to other members take unspecified values.
"You can define a union with many members, but only one member can contain a value at any given time." is the corect statement.
The size of a union is the size of its largest member (plus possibly some padding). Using a member instructs the compiler to use the type of that member.
struct example {
int what_is_in_it;
union {
int a;
long b;
float f;
} u;
} e;
#define ITHASANINT 1
#define ITHASALONG 2
#define ITHASAFLOAT 3
switch (e.what_is_in_it) {
case ITHASANINT: printf("%d\n", e.u.a); break; // compiler passes an int
case ITHASALONG: printf("%ld\n", e.u.b); break; // compiler passes a long
case ITHASAFLOAT:printf("%f\n", e.u.f); break; // compiler passes a float (promoted to double)
}
Is it possible to to have the following assert fail with any compiler on any architecture?
union { int x; int y; } u;
u.x = 19;
assert(u.x == u.y);
C99 Makes a special guarantee for a case when two members of a union are structures that share an initial sequence of fields:
struct X {int a; /* other fields may follow */ };
struct Y {int a; /* other fields may follow */ };
union {X x; Y y;} u;
u.x.a = 19;
assert(u.x.a == u.y.a); // Guaranteed never to fail by 6.5.2.3-5.
6.5.2.3-5 : One special guarantee is made in order to simplify the use of unions: if a union contains
several structures that share a common initial sequence (see below), and if the union
object currently contains one of these structures, it is permitted to inspect the common
initial part of any of them anywhere that a declaration of the complete type of the union is
visible. Two structures share a common initial sequence if corresponding members have
compatible types (and, for bit-fields, the same widths) for a sequence of one or more
initial members.
However, I was unable to find a comparable guarantee for non-structured types inside a union. This may very well be an omission, though: if the standard goes some length to describe what must happen with structured types that are not even the same, it should have clarified the same point for simpler, non-structured types.
The assert in the problem will never fail in an implementation of standard C because accessing u.y after an assignment to u.x is required to reinterpret the bytes of u.x as the type of u.y. Since the types are the same, the reinterpretation produces the same value.
This requirement is noted in C 2011 (N1570) 6.5.2.3 note 95, which indicates it derives from clause 6.2.6, which covers the representations of types. Note 95 says:
If the member used to read the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called ‘‘type punning’’). This might be a trap representation.
(N1570 is an unofficial draft but is readily available on the net.)
I believe this question is very hard to answer in the manner you seem to expect.
As far as I know, reading one field of a union that is not the one that was most recently wwritten to, is undefined behavior.
Thus, it's impossible to answer with "no", since any compiler writer is free to specifically detect this and make it fail just out of spite, if they feel like it.