According to this reddit comment thread, it is undefined if an attempt is made to read memory before it has been written to. I'm referring to normal heap memory which has been succesfully malloced.
... note that this is not strictly valid C: the compiler/runtime system is allowed to initialize uninitialized memory with so-called trap representations, which cause undefined behavior on access.
I find this hard to believe. Is there a Standard quote?
Of course, I understand that there is no guarantee that the memory has been zeroed out. The values in this uninitialized memory are essentially pseudo-random or arbitrary. But I can't really believe that the Standard would refer to this as undefined behaviour (in the sense that it might segfault, or delete all your files, or whatever). The rest of the reddit thread there didn't cast any more light on this issue.
If accessing through a char*, this is defined. But otherwise, this is undefined behavior.
(C99, 7.20.3.3) "The malloc function allocates space for an object whose size is specified by size and whose value is indeterminate."
on indeterminate value:
(C99, 3.17.2p1) "indeterminate value: either an unspecified value or a trap representation"
on trap representation reading through a non-character type being undefined behavior:
(C99, 6.2.6.1p5) "Certain object representations need not represent a value of the object type. If the stored value of an object has such a representation and is read by an lvalue expression that does not have character type, the behavior is undefined. [...] Such a representation is called a trap representation."
It rationally has to be undefined. Otherwise, the necessary behavior of a C program running under something like Valgrind, which diagnoses reads of uninitialized memory and throws appropriate errors when they occur, would be illegal under the standard.
Reading the standard, the key question is whether the values of malloc'ed memory are "unspecified values" (which must be some readable value), or "indeterminate values" (which may contain trap representations; c.f. definition 3.17.2.)
As per 7.20.3.3, quoted in the other answers, malloc returns a block of memory which contains indeterminate values, and therefore may contain trap representations. The relevant discussion of trap representations is 6.2.6.1, part 5:
Certain object representations need not represent a value of the object type. If the stored value of an object has such a representation and is read by an lvalue expression that does not have character type, the behavior is undefined. ... Such a representation is called a trap representation.
So, there you go. Basically, the C implementation is permitted to detect (i.e., "trap") references to indeterminate values, and deal with that error how it chooses, including in undefined ways.
ISO/IEC 9899:1999, 7.20.3.3 The malloc function:
The malloc function allocates space for an object whose size is specified by size and
whose value is indeterminate.
6.2.6.1 Representation of types, §5:
Certain object representations need not represent a value of the object type. If the stored
value of an object has such a representation and is read by an lvalue expression that does
not have character type, the behavior is undefined.
And footnote 41 makes it even more explicit (at least for automatic variables):
Thus, an automatic variable can be initialized to a trap representation without causing undefined behavior, but the value of the variable cannot be used until a proper value is stored in it.
Related
The following is not undefined behavior in modern C:
union foo
{
int i;
float f;
};
union foo bar;
bar.f = 1.0f;
printf("%08x\n", bar.i);
and prints the hex representation of 1.0f.
However the following is undefined behavior:
int x;
printf("%08x\n", x);
What about this?
union xyzzy
{
char c;
int i;
};
union xyzzy plugh;
This ought to be undefined behavior since no member of plugh has been written.
printf("%08x\n", plugh.i);
But what about this. Is this undefined behavior or not?
plugh.c = 'A';
printf("%08x\n", plugh.i);
Most C compilers nowadays will have sizeof(char) < sizeof(int), with sizeof(int) being either 2 or 4. That means that in these cases, at most 50% or 25% of plugh.i will have been written to, but reading the remaining bytes will be reading uninitialized data, and hence should be undefined behavior. On the basis of this, is the entire read undefined behavior?
Defect report 283: Accessing a non-current union member ("type punning") covers this and tells us there is undefined behavior if there is trap representation.
The defect report asked:
In the paragraph corresponding to 6.5.2.3#5, C89 contained this
sentence:
With one exception, if a member of a union object is accessed after a value has been stored in a different member of the object, the
behavior is implementation-defined.
Associated with that sentence was this footnote:
The "byte orders" for scalar types are invisible to isolated programs that do not indulge in type punning (for example, by
assigning to one member of a union and inspecting the storage by
accessing another member that is an appropriately sixed array of
character type), but must be accounted for when conforming to
externally imposed storage layouts.
The only corresponding verbiage in C99 is 6.2.6.1#7:
When a value is stored in a member of an object of union type, the bytes of the object representation that do not correspond to that
member but do correspond to other members take unspecified values, but
the value of the union object shall not thereby become a trap
representation.
It is not perfectly clear that the C99 words have the same
implications as the C89 words.
The defect report added the following footnote:
Attach a new footnote 78a to the words "named member" in 6.5.2.3#3:
78a If the member used to access the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called "type punning"). This might be a trap representation.
C11 6.2.6.1 General tells us:
Certain object representations need not represent a value of the object type. If the stored value of an object has such a representation and is read by an lvalue expression that does not have character type, the behavior is undefined. If such a representation is produced by a side effect that modifies all or any part of the object by an lvalue expression that does not have character type, the behavior is undefined.50) Such a representation is called a trap representation.
From 6.2.6.1 §7 :
When a value is stored in a member of an object of union type, the bytes of the object representation that do not correspond to that member but do correspond to other members take unspecified values.
So, the value of plugh.i would be unspecified after setting plugh.c.
From a footnote to 6.5.2.3 §3 :
If the member used to read the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called ‘‘type punning’’). This might be a trap representation.
This says that type punning is specifically allowed (as you asserted in your question). But it might result in a trap representation, in which case reading the value has undefined behavior according to 6.2.6.1 §5 :
Certain object representations need not represent a value of the object type. If the stored value of an object has such a representation and is read by an lvalue expression that does not have character type, the behavior is undefined. If such a representation is produced by a side effect that modifies all or any part of the object by an lvalue expression that does not have character type, the behavior is undefined. 50) Such a representation is called
a trap representation.
If it's not a trap representation, there seems to be nothing in the standard that would make this undefined behavior, because from 4 §3, we get :
A program that is correct in all other aspects, operating on correct data, containing unspecified behavior shall be a correct program and act in accordance with 5.1.2.3.
Other answers address the main question of whether reading plugh.i produces undefined behavior when plugh was not initialized and only plugh.c was ever assigned. In short: no, unless the bytes of plugh.i constitute a trap representation at the time of the read.
But I want to speak directly to a preliminary assertion in the question:
Most C compilers nowadays will have sizeof(char) < sizeof(int), with
sizeof(int) being either 2 or 4. That means that in these cases at
most 50% or 25% of plugh.i will have been written to
The question seems to be supposing that assigning a value to plugh.c will leave undisturbed those bytes of plugh that do not correspond to c, but in no way does the standard support that proposition. In fact, it expressly denies any such guarantee, for as others have noted:
When a value is stored in a member of an object of union type, the
bytes of the object representation that do not correspond to that
member but do correspond to other members take unspecified values.
(C2011, 6.2.6.1/7; emphasis added)
Although this does not guarantee that the unspecified values taken by those bytes are different from their values prior to the assignment, it expressly provides that they might be. And it is entirely plausible that in some implementations they often will be. For example, on a platform that supports only word-sized writes to memory or where such writes are more efficient than byte-sized ones, it is likely that assignments to plugh.c are implemented with word-sized writes, without first loading the other bytes of plugh.i so as to preserve their values.
C11 §6.2.6.1 p7 says:
When a value is stored in a member of an object of union type, the
bytes of the object representation that do not correspond to that
member but do correspond to other members take unspecified values.
So, plugh.i would be unspecified.
In cases where useful optimizations might cause some aspects of a program's execution to behave in a fashion inconsistent with the Standard (e.g. two consecutive reads of the same byte yielding inconsistent results), the Standard generally attempts to characterize situations where such effects might be observed, and then classify such situations as invoking Undefined Behavior. It doesn't make much effort to ensure that its characterizations don't "ensnare" some actions whose behavior should obviously be processed predictably, since it expects compiler writers to avoid behaving obtusely in such cases.
Unfortunately, there are some corner cases where this approach really doesn't work well. For example, consider:
struct c8 { uint32_t u; unsigned char arr[4]; };
union uc { uint32_t u; struct c8 dat; } uuc1,uuc2;
void wowzo(void)
{
union uc u;
u.u = 123;
uuc1 = u;
uuc2 = u;
}
I think it's clear that the Standard does not require that the bytes in uuc1.dat.arr or uuc2.dat.arr contain any particular value, and that a compiler would be allowed to, for each of the four bytes i==0..3, copy uuc1.dat.arr[i] to uuc2.dat.arr[i], copy uuc2.dat.arr[i] to uuc1.dat.arr[i], or write both uuc1.dat.arr[i] and uuc2.dat.arr[i] with matching values. I don't think it's clear whether the Standard intends to require that a compiler select one of those courses of action rather than simply leaving those bytes holding whatever they happen to hold.
Clearly the code is supposed to have fully defined behavior if nothing ever observes the contents of uuc1.dat.arr nor uuc2.dat.arr, and there's nothing to suggest that examining those arrays should invoke UB. Further, there is no defined means via which the value of u.dat.arr could change between the assignments to uuc1 and uuc2. That would suggest that the uuc1.dat.arr and uuc2.dat.arr should contain matching values. On the other hand, for some kinds of programs, storing obviously-meaningless data into uuc1.dat.arr and/or uuc1.dat.arr would seldom serve any useful purpose. I don't think the authors of the Standard particularly intended to require such stores, but saying that the bytes take on "Unspecified" values makes them necessary. I'd expect such a behavioral guarantee to be deprecated, but I don't know what could replace it.
This question already has answers here:
(Why) is using an uninitialized variable undefined behavior?
(7 answers)
Closed 6 years ago.
Various esteemed, high rep users on SO keeps insisting that reading a variable with indeterminate value "is always UB". So where exactly is this mentioned in the C standard?
It is very clear that an indeterminate value could either be an unspecified value or a trap representation:
3.19.2
indeterminate value
either an unspecified value or a trap representation
3.19.3
unspecified value
valid value of the relevant type where this International Standard imposes no
requirements on which value is chosen in any instance
NOTE An unspecified value cannot be a trap representation.
3.19.4
trap representation
an object representation that need not represent a value of the object type
It is also clear that reading a trap representation invokes undefined behavior, 6.2.6.1:
Certain object representations need not represent a value of the object type. If the stored
value of an object has such a representation and is read by an lvalue expression that does
not have character type, the behavior is undefined. If such a representation is produced
by a side effect that modifies all or any part of the object by an lvalue expression that
does not have character type, the behavior is undefined.50) Such a representation is called
a trap representation.
However, an indeterminate value does not necessarily contain a trap representation. In fact, trap representations are very rare for systems using two's complement.
Where in the C standard does it actually say that reading an indeterminate value invokes undefined behavior?
I was reading the non-normative Annex J of C11 and found that this is indeed listed as one case of UB:
The value of an object with automatic storage duration is used while it is
indeterminate (6.2.4, 6.7.9, 6.8).
However, the listed sections are irrelevant. 6.2.4 only states rules regarding life time and when a variable's value becomes indeterminate. Similarly, 6.7.9 is regarding initialization and states how a variable's value becomes indeterminate. 6.8 seems mostly irrelevant. None of the sections contains any normative text saying that accessing an indeterminate value can lead to UB. Is this a defect in Annex J?
There is however some relevant, normative text in 6.3.2.1 regarding lvalues:
If the lvalue designates an
object of automatic storage duration that could have been declared with the register
storage class (never had its address taken), and that object is uninitialized (not declared
with an initializer and no assignment to it has been performed prior to use), the behavior
is undefined.
But that is a special case, which only applies to variables of automatic storage duration that never had their address taken. I have always thought that this section of 6.3.2.1 is the only case of UB regarding indeterminate values (that are not trap representations). But people keep insisting that "it is always UB". So where exactly is this mentioned?
As far as I know, there is nothing in the standard that says that using an indeterminate value is always undefined behavior.
The cases that are spelled out as invoking undefined behavior are:
If the value happens to be a trap representation.
If the indeterminate value is an object of automatic storage.
If the value is a pointer to an object whose lifetime has ended.
As an example, the C standard specifies that the type unsigned char has no padding bits and therefore none of its values can ever be a trap representation.
Portable implementations of functions such as memcpy take advantage of this fact to perform a copy of any value, including indeterminate values. Those values could potentially be trap representations when used as values of a type that contains padding bits, but they are simply unspecified when used as values of unsigned char.
I believe that it is erroneous to assume that if something could invoke undefined behavior then it does invoke undefined behavior when the program has no safe way of checking. Consider the following example:
int read(int* array, int n, int i)
{
if (0 <= i)
if (i < n)
return array[i];
return 0;
}
In this case, the read function has no safe way of checking whether array really is of (at least) length n. Clearly, if the compiler considered these possible UB operations as definite UB, it would be nearly impossible to write any pointer code.
More generally, if the compiler cannot prove that something is UB, it has to assume that it isn't UB, otherwise it risks breaking conforming programs.
The only case where the possibility is treated like a certainty, is the case of objects of automatic storage. I think it's reasonable to assume that the reason for that is because those cases can be statically rejected, since all the information the compiler needs can be obtained through local flow analysis.
On the other hand, declaring it as UB for non-automatic storage objects would not give the compiler any useful information in terms of optimizations or portability (in the general case). Thus, the standard probably doesn't mention those cases because it wouldn't change anything in realistic implementations anyway.
To allow the best blend of optimization opportunities and useful semantics, types which have no trap representations should have Indeterminate Values subdivided into three kinds:
The first read will yield any value that could result from an unspecified
bit pattern; subsequent would be guaranteed to yield the same value.
This would be similar to "Unspecified value", except that the Standard
doesn't generally distinguish between types which do and don't have trap
representations, and in cases where the Standard calls for "Unspecified
Value" it requires that an implementation ensure the value is not a trap
representation; in the general case, that would require that an
implementation include code to guard against certain bit patterns.
Each read may independently yield any value that could result from an
unspecified bit pattern.
The value read, and the result of most computations performed upon it,
may behave non-deterministically as though the read had yielded any
possible value.
Unfortunately, the Standard doesn't make such distinctions, and there is some
disagreement about what it calls for. I would suggest that #2 should be the
default, but it should be possible for code to indicate all places where code
needs to force the compiler to pick a concrete value, and indicate that a
compiler may use #3-style semantics everywhere else. For example, if code for
a collection of distinct 16-bit values stored as:
struct COLLECTION { size_t count; uint16_t values[65536], locations[65536]; };
maintains the invariant that for each i < count, locations[values[i]]==i, it
should be possible to initialize such a structure merely by setting "count"
to zero, even if the storage had previously been used as some other type.
If casts are specified as always yielding concrete values, code which wants
to see if something is in the collection could use:
uint32_t index = (uint32_t)(collection->locations[value]);
if (index < collection->count && collections->values[index]==value)
... value was found
It would be acceptable to have the above code arbitrarily yield any number for "index" each time it reads an item from the array, but it would be essential that both uses of "index" in the second line use the same value.
Unfortunately, some compiler writers seem to think compilers should treat all indeterminate values as #3, while some algorithms require #1 and some require #2, and there's no real way to distinguish the varying requirements.
3.19.2 permits implementation to be a trap representation, and both reading and writing are undefined behaviour.
Your platform may give you guarantees (e.g. that integer types never have trap representations) but that is not required by the Standard, and if you rely on that, your code loses some portability. That's a valid choice, but shouldn't be made in ignorance.
More systems have trap representations for floating-point types than for integer types, but C programs may be run on processors that track register validity - see (Why) is using an uninitialized variable undefined behavior in C?. This degree of latitude is the principal reason for C's wide adoption across many hardware architectures.
This is a followup to Can a char array be used with any data type?
I know about dynamic memory and common implementations of malloc, references can be found on wikipedia. I also know that the pointer returned by malloc can be cast to whatever the programmer wants, without even a warning because the standard states in 6.3.2.3 Pointers §1
A pointer to void may be converted to or from a pointer to any incomplete or object
type. A pointer to any incomplete or object type may be converted to a pointer to void
and back again; the result shall compare equal to the original pointer.
The question is assuming I have a freestanding environment without malloc and free, how can I build in conformant C an implementation of those two functions?
If I take some freedom regarding the standard, it is easy:
start with a large character array
use a reasonably large alignment (8 should be enough for many architectures)
implement an algorithm that returns addresses from that array, at that alignment, keeping track of what has been allocated - nice examples can be found in malloc implementation?
The problem is that the effective type of the pointers returned by that implementation will still be char *
And standard says in same paragraph § 7
A pointer to an object or incomplete type may be converted to a pointer to a different
object or incomplete type. If the resulting pointer is not correctly aligned for the
pointed-to type, the behavior is undefined. Otherwise, when converted back again, the
result shall compare equal to the original pointer.
That does not seem to allow me to pretend that what was declared as simple characters can magically contains another type, and even different types in different part of this array or at different moments in same part. Said differently dereferencing such pointers seem undefined behaviour with a strict interpretation of standard. That is why common idioms use memcpy instead of aliasing when you get a byte representation of an object in a string buffer, for example when you read it from a network stream.
So how can I build a conformant implementation of malloc in pure C???
This answer is only an interpretation of the standard, because I could not find an explicit answer in C99 n1256 draft nor in C11 n1570.
The rationale comes from the C++ standard (C++14 draft n4296).
3.8 Object lifetime [basic.life] says (emphasize mine):
§ 1The lifetime of an object of type T begins when:
storage with the proper alignment and size for type T is obtained, and
if the object has non-vacuous initialization, its initialization is complete.
The lifetime of an object of type T ends when:
if T is a class type with a non-trivial destructor (12.4), the destructor call starts, or
the storage which the object occupies is reused or released.
and
§ 3 The properties ascribed to objects throughout this International Standard apply for a given object only
during its lifetime.
I know that C and C++ are different languages, but they are related, and the above is only here to explain the following interpretation
The relevant part in C standard is 7.20.3 Memory management functions.
... The pointer returned if the allocation
succeeds is suitably aligned so that it may be assigned to a pointer to any type of object
and then used to access such an object or an array of such objects in the space allocated
(until the space is explicitly deallocated). The lifetime of an allocated object extends
from the allocation until the deallocation. Each such allocation shall yield a pointer to an
object disjoint from any other object. The pointer returned points to the start (lowest byte
address) of the allocated space...
My interpretation is that provided you have a memory zone with correct size and alignement, for example a part of a large character array, but any other type of array of type could be used here you can pretend that it is a pointer to an uninitialized object or array of another type (say T) and convert a char or void pointer to the first byte of the zone to a pointer of the new type (T). But in order to not violate the strict aliasing rule, this zone must no longer be accessed through any previous value or pointer or the initial type - if the initial type was character, it will be still allowed for reading, but writing could lead to trap representation. As this object is not initialized, it can contain a trap representation and reading it before its initialization is undefined behaviour. This T object and its associated pointer will be valid until you decide to use the memory zone for any other usage and the pointer to T becomes dangling at that time.
TL/DR: The strict aliasing rule only mandates that a memory zone can only contain an object of one effective type at one single moment. But you are allowed to re-use the memory zone for an object of a different type provided:
the size and alignment are compatible
you initialize the new object with a correct value before using it
you no longer access the initial object
Because that way you simply use the memory zone as allocated memory.
Per C standard, the lifetime of the initial object will not be ended (static objects last until the end of the program, and automatic ones until the end of their declaring scope), but you can no longer access it because of the strict aliasing rule
The authors of the C Standard put far more effort into specifying behaviors which weren't obviously desirable than those that were, since they expected that sensible compiler writers would support useful behaviors whether or not the Standard mandated it, and since obtuse compilers writers could produce "compliant" implementations that were fully-compliant but completely useless(*).
It was possible to write reliable and efficient malloc() equivalents on many platforms prior to the advent of C89, and I see no reason to believe that the authors intended that people writing C89 compilers for a platform which had been able to handle malloc() equivalents previously would not make those implementations just as capable as their predecessors. Unfortunately, the language which was popular in the 1990s (which was a combined superset of C89 and its predecessors) has been replaced by a poor-quality dialect which omits features that the authors of C89 would have taken for granted and expected others to do likewise.
Even beyond the question of how one acquires memory, a larger issue is that
malloc() promises that newly-allocated memory will, at worst, hold
Indeterminate Value; because structure types have no trap representations,
reading such storage using a pointer of structure type will have defined
behavior. If the memory was previously written using some other type,
however, a structure-type read would have Undefined Behavior unless either
the free() or malloc() physically erases all of the storage in question,
thus negating the performance benefit of having malloc() rather than just
calloc().
(*)Provided that there exists at least one set of source files that the implementation processes in compliant fashion without UB, an implementation may require arbitrary (perhaps impossibly large) amounts of stack space when given any other set of source files, and behave in arbitrary fashion if that space is unavailable.
According to C99 J.2, the behavior is undefined when:
The value of an object with automatic storage duration is used while it is
indeterminate
What about all the other cases where an object has an indeterminate value? Do we also always invoke UB if we use them? Or do we invoke UB only when they contain a trap representation?
Examples include:
the value of an object allocated using malloc (7.20.3.3p2)
[storing in non-automatic storage] a FILE* after calling fclose on it (7.19.3p4)
[storing in non-automatic storage] a pointer after calling free on it (6.2.4p2)
...and so on.
I've used C99 for my references, but feel free to refer to C99 or C11 in your answer.
I am using C11 revision here:
The definitions from the standard are:
indeterminate value
either an unspecified value or a trap representation
trap representation
an object representation that need not represent a value of the object type
unspecified value
Unspecified valid value of the relevant type where this International Standard imposes no
requirements on which value is chosen in any instance
An unspecified value is a valid value of the relevant type and as such it does not cause undefined behaviour. Using a trap representation will.
But why this wording exists in the standard is that the excerpt enables compilers to issue diagnostics, or reject programs that use the value of uninitialized local variables yet still stay standard-compliant; because there are types of which it is said that they cannot contain trap representations in memory, so they'd always be having unspecified value there in their indeterminate state. This applies to for example unsigned char. And since using an unspecified value does not have undefined behaviour then the standard does not allow one to reject such a program.
Additionally, say an unsigned char normally does not have a trap representation... except, IIRC there are computer architectures where a register can be set to "uninitialized", and reading from a register in such an architecture will trigger a fault. Thus even if an unsigned char does not really have trap representations in memory, on this architecture it will with cause a hardware fault with 100 % probability, if it is of automatic storage duration and compiler decides to store it in a register and it is still uninitialized at the time of the call.
Say we declare in C:
int a;
Its value is undefined (let's forget those compilers that set zero by default). But does this undefined value still have the data type of the variable (integer)?
So it could be that a=19382, a=23, a=-33332... but not a=33.2?
This sounds pretty logic to me, but not sure.
As you have left it uninitialised, although a will consume the same amount of memory as an initialised int, the contents are indeterminate.
The compiler has to treat it as an int, so it could have any integer value. It can't be 33.2 because that isn't an int. It could however on some architectures happen to contain a 'trapping value' which would cause an error whenever you tried to access 'a'. Offhand, I can't think of any examples for an int, although floating point numbers do have values called 'signalling NANs' and some architectures will trap if you try load an invalid pointer into an address register.
When it boils down to the bitwise representation of any variable, only by looking at that level, you cannot tell whether it is an int, float or whatever. That is why, the associated data type of a variable is required to understand and represent the type.
So, when a variable has indeterminate value, it is indeterminate. The point whether it is of the same type to the variable, does not make much sense, IMHO.
Just to make a point (not strictly speaking), you don't have any means to check the value of an uninitialized variable, as, trying to use (read) the value will generate undefined behaviour.
The value is undefined, so using it is undefined behaviour. It could be an int, but it's also possible to have a bit pattern that isn't a legal representation for any int.
In practice, it's unlikely to have bit patterns that aren't legal representations of int, but c doesn't guarantee you anything about it.
It can however never be a different type, because the type you give it determines how the bit pattern stored is to be interpreted.
If an ordinary int is declared outside of functions, it will be initialized to 0. That's guaranteed by the C standard, and if your compiler (in practice the runtime library that the compiler links it with) does not set it to zero, then it has a seroius bug.
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf Section 6.7.9 paragraph 10.
If it's declared inside of a function, it'll have an indeterminate value, and accessing it might result in the dreaded undefined behavior.