It is common to assume that initializing an object to all bits 0 is a simple way to set all its members to 0. The standard does not guarantee this for non integer types as:
all bits zero might not be a valid representation for pointers, even null pointers, although all common modern systems use exactly that.
all bits zero might not be a legal representation for floating point numbers, although it is on IEEE compliant systems.
What about integers? Is the following code fully defined:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(void) {
int *p = calloc(sizeof(*p), 1);
if (p) {
printf("%d\n", *p);
memset(p, 0, sizeof(*p));
printf("%d\n", *p);
free(p);
}
return 0;
}
From C Standard, 6.2.6.2, Integer Types
For any integer type, the object representation where all the bits are zero shall be a representation of the value zero in that type.
The definition of a trap representation is, C11 6.2.6.1/5:
Certain object representations need not represent a value of the object type. If the stored
value of an object has such a representation and is read by an lvalue expression that does
not have character type, the behavior is undefined. If such a representation is produced
by a side effect that modifies all or any part of the object by an lvalue expression that
does not have character type, the behavior is undefined.50) Such a representation is called
a trap representation.
This means that a trap representation has to be something that is not a valid value.
In case of two's complement, all binary combinations of an int are valid values, so trap representations aren't possible.
In case of wildly fictional one's complement systems, it would be possible to make the value 0xFFFFFFFF (assuming 32 bit int) a trap representation, if negative zeroes are not supported. Similarly, on a wildly fictional signed magnitude system, the value 0x80000000 could be used as a trap representation.
On even wilder fictional systems, integers may have padding bits and then such padding bits could be used to hold trap representations.
In any of these cases, the binary representation 0 is always a value. A lot of the C standard depends on this, such as initialization of objects with static storage duration, the calloc() function, the values of padding bytes in structs etc etc. In all of these, the outcome shall not be a trap representation.
Please note that if you aren't a programmer of wildly fictional systems, none of this is of any concern. There may have existed a few odd, experimental computers where this was a thing. You may even find someone yet alive who can tell you about them.
If you are designing for compatibility with such exotic, most likely fictional systems, you should document in detail exactly why this compatibility is needed for your product. Since your boss probably wants to know why you are spending lots of time designing for compatibility with computers that actually don't exist in the real world.
Related
Unsigned integer overflow is well defined by both the C and C++ standards. For example, the C99 standard (§6.2.5/9) states
A computation involving unsigned operands can never overflow,
because a result that cannot be represented by the resulting unsigned integer type is
reduced modulo the number that is one greater than the largest value that can be
represented by the resulting type.
However, both standards state that signed integer overflow is undefined behavior. Again, from the C99 standard (§3.4.3/1)
An example of undefined behavior is the behavior on integer overflow
Is there an historical or (even better!) a technical reason for this discrepancy?
The historical reason is that most C implementations (compilers) just used whatever overflow behaviour was easiest to implement with the integer representation it used. C implementations usually used the same representation used by the CPU - so the overflow behavior followed from the integer representation used by the CPU.
In practice, it is only the representations for signed values that may differ according to the implementation: one's complement, two's complement, sign-magnitude. For an unsigned type there is no reason for the standard to allow variation because there is only one obvious binary representation (the standard only allows binary representation).
Relevant quotes:
C99 6.2.6.1:3:
Values stored in unsigned bit-fields and objects of type unsigned char shall be represented using a pure binary notation.
C99 6.2.6.2:2:
If the sign bit is one, the value shall be modified in one of the following ways:
— the corresponding value with sign bit 0 is negated (sign and magnitude);
— the sign bit has the value −(2N) (two’s complement);
— the sign bit has the value −(2N − 1) (one’s complement).
Nowadays, all processors use two's complement representation, but signed arithmetic overflow remains undefined and compiler makers want it to remain undefined because they use this undefinedness to help with optimization. See for instance this blog post by Ian Lance Taylor or this complaint by Agner Fog, and the answers to his bug report.
Aside from Pascal's good answer (which I'm sure is the main motivation), it is also possible that some processors cause an exception on signed integer overflow, which of course would cause problems if the compiler had to "arrange for another behaviour" (e.g. use extra instructions to check for potential overflow and calculate differently in that case).
It is also worth noting that "undefined behaviour" doesn't mean "doesn't work". It means that the implementation is allowed to do whatever it likes in that situation. This includes doing "the right thing" as well as "calling the police" or "crashing". Most compilers, when possible, will choose "do the right thing", assuming that is relatively easy to define (in this case, it is). However, if you are having overflows in the calculations, it is important to understand what that actually results in, and that the compiler MAY do something other than what you expect (and that this may very depending on compiler version, optimisation settings, etc).
First of all, please note that C11 3.4.3, like all examples and foot notes, is not normative text and therefore not relevant to cite!
The relevant text that states that overflow of integers and floats is undefined behavior is this:
C11 6.5/5
If an exceptional condition occurs during the evaluation of an
expression (that is, if the result is not mathematically defined or
not in the range of representable values for its type), the behavior
is undefined.
A clarification regarding the behavior of unsigned integer types specifically can be found here:
C11 6.2.5/9
The range of nonnegative values of a signed integer type is a subrange
of the corresponding unsigned integer type, and the representation of
the same value in each type is the same. A computation involving
unsigned operands can never overflow, because a result that cannot be
represented by the resulting unsigned integer type is reduced modulo
the number that is one greater than the largest value that can be
represented by the resulting type.
This makes unsigned integer types a special case.
Also note that there is an exception if any type is converted to a signed type and the old value can no longer be represented. The behavior is then merely implementation-defined, although a signal may be raised.
C11 6.3.1.3
6.3.1.3 Signed and unsigned integers
When a value with integer
type is converted to another integer type other than _Bool, if the
value can be represented by the new type, it is unchanged.
Otherwise, if the new type is unsigned, the value is converted by
repeatedly adding or subtracting one more than the maximum value that
can be represented in the new type until the value is in the range of
the new type.
Otherwise, the new type is signed and the value
cannot be represented in it; either the result is
implementation-defined or an implementation-defined signal is raised.
In addition to the other issues mentioned, having unsigned math wrap makes the unsigned integer types behave as abstract algebraic groups (meaning that, among other things, for any pair of values X and Y, there will exist some other value Z such that X+Z will, if properly cast, equal Y and Y-Z will, if properly cast, equal X). If unsigned values were merely storage-location types and not intermediate-expression types (e.g. if there were no unsigned equivalent of the largest integer type, and arithmetic operations on unsigned types behaved as though they were first converted them to larger signed types, then there wouldn't be as much need for defined wrapping behavior, but it's difficult to do calculations in a type which doesn't have e.g. an additive inverse.
This helps in situations where wrap-around behavior is actually useful - for example with TCP sequence numbers or certain algorithms, such as hash calculation. It may also help in situations where it's necessary to detect overflow, since performing calculations and checking whether they overflowed is often easier than checking in advance whether they would overflow, especially if the calculations involve the largest available integer type.
Perhaps another reason for why unsigned arithmetic is defined is because unsigned numbers form integers modulo 2^n, where n is the width of the unsigned number. Unsigned numbers are simply integers represented using binary digits instead of decimal digits. Performing the standard operations in a modulus system is well understood.
The OP's quote refers to this fact, but also highlights the fact that there is only one, unambiguous, logical way to represent unsigned integers in binary. By contrast, Signed numbers are most often represented using two's complement but other choices are possible as described in the standard (section 6.2.6.2).
Two's complement representation allows certain operations to make more sense in binary format. E.g., incrementing negative numbers is the same that for positive numbers (expect under overflow conditions). Some operations at the machine level can be the same for signed and unsigned numbers. However, when interpreting the result of those operations, some cases don't make sense - positive and negative overflow. Furthermore, the overflow results differ depending on the underlying signed representation.
The most technical reason of all, is simply that trying to capture overflow in an unsigned integer requires more moving parts from you (exception handling) and the processor (exception throwing).
C and C++ won't make you pay for that unless you ask for it by using a signed integer. This isn't a hard-fast rule, as you'll see near the end, but just how they proceed for unsigned integers. In my opinion, this makes signed integers the odd-one out, not unsigned, but it's fine they offer this fundamental difference as the programmer can still perform well-defined signed operations with overflow. But to do so, you must cast for it.
Because:
unsigned integers have well defined overflow and underflow
casts from signed -> unsigned int are well defined, [uint's name]_MAX - 1 is conceptually added to negative values, to map them to the extended positive number range
casts from unsigned -> signed int are well defined, [uint's name]_MAX - 1 is conceptually deducted from positive values beyond the signed type's max, to map them to negative numbers)
You can always perform arithmetic operations with well-defined overflow and underflow behavior, where signed integers are your starting point, albeit in a round-about way, by casting to unsigned integer first then back once finished.
int32_t x = 10;
int32_t y = -50;
// writes -60 into z, this is well defined
int32_t z = int32_t(uint32_t(y) - uint32_t(x));
Casts between signed and unsigned integer types of the same width are free, if the CPU is using 2's compliment (nearly all do). If for some reason the platform you're targeting doesn't use 2's Compliment for signed integers, you will pay a small conversion price when casting between uint32 and int32.
But be wary when using bit widths smaller than int
usually if you are relying on unsigned overflow, you are using a smaller word width, 8bit or 16bit. These will promote to signed int at the drop of a hat (C has absolutely insane implicit integer conversion rules, this is one of C's biggest hidden gotcha's), consider:
unsigned char a = 0;
unsigned char b = 1;
printf("%i", a - b); // outputs -1, not 255 as you'd expect
To avoid this, you should always cast to the type you want when you are relying on that type's width, even in the middle of an operation where you think it's unnecessary. This will cast the temporary and get you the signedness AND truncate the value so you get what you expected. It's almost always free to cast, and in fact, your compiler might thank you for doing so as it can then optimize on your intentions more aggressively.
unsigned char a = 0;
unsigned char b = 1;
printf("%i", (unsigned char)(a - b)); // cast turns -1 to 255, outputs 255
The code below for testing endianness is expected to have implementation defined behavior:
int is_little_endian(void) {
int x = 1;
char *p = (char*)&x;
return *p == 1;
}
But is it possible that it may have undefined behavior on purposely contrived architectures? For example could the first byte of the representation of an int with value 1 (or another well chosen value) be a trap value for the char type?
As noted in comments, the type unsigned char would not have this issue as it cannot have trap values, but this question specifically concerns the char type.
Per C 2018 6.2.5 15, char behaves as either signed char or unsigned char. Suppose it is signed char. 6.2.6.2 2 discusses signed integer types, including signed char. At the end of this paragraph, it says:
Which of these [sign and magnitude, two’s complement, or ones’ complement] applies is implementation-defined, as is whether the value with sign bit 1 and all value bits zero (for the first two), or with sign bit and all value bits 1 (for ones’ complement), is a trap representation or a normal value.
Thus, this paragraph allows signed char to have a trap representation. However, the paragraph in the standard that says accessing trap representations may have undefined behavior, 6.2.6.1 5, specifically excludes character types:
Certain object representations need not represent a value of the object type. If the stored value of an object has such a representation and is read by an lvalue expression that does not have character type, the behavior is undefined. If such a representation is produced by a side effect that modifies all or any part of the object by an lvalue expression that does not have character type, the behavior is undefined. Such a representation is called a trap representation.
Thus, although char may have trap representations, there is no reason we should not be able to access it. There is then the question of what happens if we use the value in an expression? If a char has a trap representation, it does not represent a value. So attempting to compare it to 1 in *p == 1 does not seem to have a defined behavior.
The specific value of 1 in an int will not result in a trap representation in char for any normal C implementation, as the 1 will be in the “rightmost” (lowest valued) bit of some byte of the int, and no normal C implementation puts the sign bit of a char in the bit in that position. However, the C standard apparently does not prohibit such an arrangement, so, theoretically, an int with value 1 might be encoded with bits 00000001 in one of its bytes, and those bits might be a trap representation for a char.
I don't think the Standard would forbid an implementation in which signed char used sign-magnitude or ones'-complement format, and trapped on attempts to load the bit pattern that would represent "negative zero". Nor does it require that such implementations must make char unsigned. It would be possible to contrive an architecture upon which your code could have arbitrary behavior. A few more important things to note, however:
There is no guarantee that the bits within a char are mapped in the same sequence as the ones in an int. The code wouldn't launch into UB-land if the bits aren't mapped in order, but the result would not be very meaningful.
So far as I can tell, every non-contrived conforming C99 implementation has used two's-complement format; I consider it doubtful that any will ever do otherwise.
It would be silly for an implementation to make char be a type with fewer representable values than bit patterns.
One could contrive a conforming implementation that would do almost anything with almost any source text, provided that there exists some source text that it would process in the fashion defined by the Standard.
One could contrive a conforming sign-magnitude implementation where the integer value 1 would have a bit pattern that would encode signed char value "negative zero", and which would trap on an attempt to load that. One could even contrive a conforming ones'-complement implementation that did so (have lots of padding bits on the "int" type, all of which get set when storing the value "1"). Given that one could contrive a conforming implementation that uses the One Program rule to justify doing anything it liked with the above source text regardless of what integer format it uses, however, I don't think the possibility of weird char type should really be a worry.
Note, btw, that the Standard makes no effort to forbid silly implementations; it might be improved by adding language mandating that char must either be a two's-complement type with no trap representations or an unsigned type, and either mandating the same for signed char or explicitly saying that is not required. It might also be improved if it recognized a category of implementations which can't support types like unsigned long long [which would be a major stumbling block for a 36-bit ones'-complement systems, and may be the reason that no conforming C99 implementations exist for such platforms].
I found a quote from the Standard that proves that no object representation is a trap value for unsigned char:
6.2.6.2 Integer types
1 For unsigned integer types other than unsigned char, the bits of the object
representation shall be divided into two groups: value bits and padding bits (there need
not be any of the latter). If there are N value bits, each bit shall represent a different
power of 2 between 1 and 2N−1, so that objects of that type shall be capable of
representing values from 0 to 2N − 1 using a pure binary representation; this shall be
known as the value representation. The values of any padding bits are unspecified.53)
The previous says that an unsigned char cannot have any padding bits.
The following footnote says that padding bits are what can be used for trap representations.
53) Some combinations of padding bits might generate trap representations, for example, if one padding
bit is a parity bit. Regardless, no arithmetic operation on valid values can generate a trap
representation other than as part of an exceptional condition such as an overflow, and this cannot occur
with unsigned types. All other combinations of padding bits are alternative object representations of
the value specified by the value bits.
So I guess the answer is that char is not guaranteed to not have any trap values but unsigned char is.
Frankly, is such a code valid (aside for the lack of necessary error checking, omitted here for simplicity)?
Code to send data over the Internet:
uint16_t i = htons(500);
sendto(sockfd, &i, sizeof(uint16_t), 0, &dest_addr, sizeof(struct sockaddr_in));
Code to receive this data:
uint16_t i;
recvfrom(sockfd, &i, sizeof(uint16_t), 0, src_addr, sizeof(struct sockaddr_in));
i = ntohs(i);
if(i < 100 || i > 1000)
fprintf(stderr, "Received invalid data over the network\n");
else
do_something(i);
My concern is, that since I read that the C standard allows trap value for any type other than unsigned char, is it possible that this way I might receive such a trap value over the network, and thus I’ll have UB as soon as I write i = ntohs(i)?
Or does POSIX guarantee that uint16_t and uint32_t shall not have trap values?
Or is it not guaranteed by any official standard, however the vast majority of implementations do not have trap values for uint16_t and uint32_t, and thus as per this de-facto standard I don’t have to fear this?
C99 specifies that the fixed width types must be two's complement and have no padding bits. The bit of the standard that talks about trap representations says that in integer types only padding bits can cause trap representations. So we don't even need to dig into POSIX to see that your code is fine.
POSIX additionally makes all integer types two's complement (I can't find this now, either it says so explicitly or it's a consequence of certain other things in POSIX, I don't remember).
I think the answer is "yes". What you're doing is fine.
The bit I think you're worried about is this:
Certain object representations need not represent a value of the object type. If the stored
value of an object has such a representation and is read by an lvalue expression that does
not have character type, the behavior is undefined.
For uinxx_t I don't believe there are any trap values. All combinations of bit values produce a valid number (i.e. object representation). I.e., all bit combinations "represent a value of the object type", therefore you cannot initialise a uintxx_t to a trap value/representation.
The standard then goes onto say, which I'd never read before, so this was quite interesting:
For signed integer types ...If the sign bit is one, the value shall be
modified in one of the following ways:
— the corresponding value with
sign bit 0 is negated (sign and magnitude);
— the sign bit has the
value -(2N) (two’s complement);
— the sign bit has the value -(2N - 1)
(one’s complement).
Which of these applies is implementation-defined,
as is whether the value with sign bit 1 and all value bits zero (for
the first two), or with sign bit and all value bits 1 (for one’s
complement), is a trap representation or a normal value.
So, it might be a problem if you were using signed integers, but it does not appear to be for unsigned integers.
uint16_t cannot have trap values in memory - there are 16 value bits with no padding bits. However, an uninitialized local variable, whose address has never been taken, will have an indeterminate value. As the address of i is taken here, it must reside in memory and even if recvfrom fails it will have a valid unspecified value.
You also need to consider endian-ness of client and server while sending and receiving binary data.
If your client is little endian and sending to big endian server, at server you dont need to convert binary data back if client had already converted it to network byte order
Question
As a fledgling C language-lawyer, I have run into a situation where I am uncertain if I understand what C specifications logically guarantee correctly.
As I understand it, "bitwise operators" (&, |, and &) will work as intuitively expected on non-negative values with any of C's integer types (char/short/int/long/etc, whether signed or unsigned) - regardless of underlying object representation.
Is this correct understanding of what is/isn't strictly well-defined behavior in C?
Key Point
In many ways, this question boils down to whether a conforming implementation is allowed to take two non-trap, non-negative values as operands to the bitwise operators, and produce a trap representation result (from the operation itself, not from assigning/interpreting the result into/as an inappropriate type).
Example
Consider the following code:
#include <limits.h>
#define MOST_SIGNIFICANT_BIT = (unsigned char )((UCHAR_MAX >> 1) + 1)
/* ... in some function: */
unsigned char byte;
/* Using broad meaning of "byte", not necessarily "octet" */
int val;
/* val is assigned an arbitrary _non-negative_ value at runtime */
byte = val | MOST_SIGNIFICANT_BIT;
Do note the above comment that val receives a non-negative value at runtime (which can be represented by val's type).
My expectation is that byte has the most significant bit set, and the lower bits are the pure binary representation (no padding bits or trap representations) of the bottom CHAR_BIT - 1 bits of the numerical value of val.
I expect this to remain true even if the type of val is changed to any other integer type, but I expect this guarantee to disappear as soon as the value of val becomes negative (no one result guaranteed for all implementations), or the type of val is changed to any non-integer type (violates a constraint of C's definition of the bitwise operators).
Self-Answering
I am posting my explanation of my current understanding as an answer because I'm fairly confident in it, but I am looking for corrections to any of my misconceptions and will accept any better/correcting answer instead of mine.
Why (I Think) This Is (Probably) Correct
The bitwise operators &, |, and ^ are defined as operating on the actual binary representation of the converted operands: the operands are said to undergo the "usual arithmetic conversions".
As I understand it, it logically follows that when you use two integer type expressions with non-negative values as operands to one of those operators, regardless of padding bits or trap representations, their value bits will "line up": therefore, the result's value bits will have the numerical value which matches with what you'd expect if you just assumed a "pure binary representation".
The kicker is that so long as you start with valid (non-trap), non-negative values as the operands, the operands should always promote to an integer type which can represent both of their values, and thus can logically represent the result value for either of those three operations. You also never run into possible issues with e.g. "signed zero", because limiting yourself to non-negative values avoids such problems. And so long as the result is used as a type which can hold the result value (or as an unsigned integer type), you won't introduce other similar/related issues.
Can These Operations Produce A Trap Representation From Non-Negative Non-Trap Operands?
Footnotes 44/53 and 45/54 of the last C99/C11 drafts, respectively, seem to suggest that the answer to this uncertainty depends on whether foo | bar, foo & bar, and foo ^ bar are considered "arithmetic operations". If they are, then they are not allowed to produce a trap representation result given non-trap values.
The Index of C99 and C11 standard drafts lists the bitwise operators as a subset of "arithmetic operators", suggesting yes. Though C89 doesn't organize its index that way, and my C Programming Language (2nd Edition) has a section called "arithmetic operators" which includes just +, -, *, /, and %, leaving the bitwise operators to a separate section. In other words, there is no clear answer on this point.
In practice, I don't know of any systems where this would happen (given the constraint of non-negative values for both operands), for what that's worth.
And one may consider the following: the type unsigned char is expected (and essentially blessed by C99 and C11) to be capable of accessing all bits of the underlying object representation of a type - it seems likely that the intent is that bitwise operators would work correctly with unsigned char - which would be integer-promoted to int on most modern systems, an unsigned int on the rest: therefore it seems very unlikely that foo | bar, foo & bar, or foo ^ bar would be allowed to produce trap representations - at least if foo and bar are both values that can be held in an unsigned char, and if the result is assigned into an unsigned char.
It is very tempting to generalize from the prior two points that this is a non-issue, although I wouldn't call it a rigorous proof.
Applied to the Example
Here's why I think this is correct and will work as expected:
UCHAR_MAX >> 1 subjects UCHAR_MAX to "usual arithmetic conversions": By definition, UCHAR_MAX will fit into either an int or unsigned int, because on most systems int can represent all values of unsigned char, and on the few that don't, an unsigned int has to be able to represent all values of unsigned char`, so that's just an "integer promotion" in this case.
Because bit shifts are defined in terms of values and not bitwise representations, UCHAR_MAX >> 1 is the quotient of UCHAR_MAX being divided by 2. (Let's call this result UCHAR_MAX_DIV_2).
UCHAR_MAX_DIV_2 + 1 subjects both arguments to usual arithmetic conversion: If UCHAR_MAX fit into an int, then the result is an int, otherwise, it is an unsigned int. Either way the conversions stop at integer promotion.
The result of UCHAR_MAX_DIV_2 + 1 is a positive value, which, when converted into an unsigned char, will have the most significant bit of the unsigned char set, and all other bits cleared (because the conversion would preserve the numerical value, and unsigned char is very strictly defined to have a pure binary representation without any padding bits or trap representations - but even without such an explicit requirement, the resulting value would have the most significant value bit set).
The (unsigned char) cast of MOST_SIGNIFICANT_BIT is actually redundant in this context - cast or no cast, it's going to be subject to the "usual arithmetic conversions" when bitwise-OR'ed. (but it might be useful in other contexts).
The above five steps will be constant-folded on pretty much every compiler out there - but a proper compiler should not constant-fold in a way which differs from the semantics of the code if it hadn't, so all of the above applies.
val | MOST_SIGNIFICANT_BIT is where it gets interesting: unlike << and >>, | and the other binary operators are defined in terms of manipulating the binary representations. Both val and MOST_SIGNIFICANT_BIT are subject to usual arithmetic conversions: details like the layout of bits or trap representations might mean a different binary representation, but should preserve the value: Given two variables of the same integer-type, holding non-negative, non-trap values, the value bits should "line up" correctly, so I actually expect that val | MOST_SIGNIFICANT_BIT produces the correct value (let's call this result VAL_WITH_MSB_SET). I don't see an explicit guarantee that this step couldn't produce a trap representation, but I don't believe there's an implementation of C where it would.
byte = VAL_WITH_MSB_SET forces a conversion: conversion of an integer type (so long as the value is a non-trap value) into a smaller, unsigned integer type is well defined: In this case, the value is reduced modulo UCHAR_MAX + 1. Since val is stated to be positive, the end result is that byte has the value of the remainder of VAL_WITH_MSB_SET divided by UCHAR_MAX + 1.
Explaining Where It Doesn't Work
If val were to be a negative value or non-integer type, we'd be out of luck because there is no longer a logical certainty that the bits that get binary-OR'ed will have the same "meaning":
If val is a signed integer type but has a negative value, then even though MOST_SIGNIFICANT_BIT is promoted to a compatible type, even though the value bits "line up", the result doesn't have any guaranteed meaning (because C makes no guarantee about how negative numbers are encoded), nor is the result (for any encoding) going to have the same meaning, especially once assigned into the unsigned char at the last step.
If val has a non-integer type, it's already violating the C standard, which constrains |, &, and ^ operators to operating on the "integer types". But if your compiler allowed it (or you did some tricks using unions, etc), then you have no guarantees about what each bit means, and thus the bit that you set are meaningless.
In many ways, this question boils down to whether a conforming implementation is allowed to take two non-trap, non-negative values as operands to the bitwise operators, and produce a trap representation result
This is covered by C11 section 6.2.6.2 (C99 is similar). There is a footnote that clarifies the intent of the more technical text:
Regardless, no arithmetic operation on valid values can generate a trap representation other than as part of an exceptional condition such as an overflow, and this cannot occur with unsigned types.
The bitwise operators are arithmetic operations as discussed here.
In this footnote, "trap representation" excludes the special case "negative zero". The negative zero may or may not cause UB, but it has its own text (in 6.2.6.2 also) separate from the trap representation text.
So your question can actually be answered for both signed and unsigned values; the only dangerous case is the "negative zero" possibility. (Which can't occur from non-negative input).
As per the C standard the value representation of a integer type is implementation defined. So 5 might not be represented as 00000000000000000000000000000101 or -1 as 11111111111111111111111111111111 as we usually assume in a 32-bit 2's complement. So even though the operators ~, << and >> are well defined, the bit patterns they will work on is implementation defined. The only defined bit pattern I could find was "§5.2.1/3 A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string.".
So my questions is - Is there a implementation independent way of converting integer types to a bit pattern?
We can always start with a null character and do enough bit operations on it to get it to a desired value, but I find it too cumbersome. I also realise that practically all implementations will use a 2's complement representation, but I want to know how to do it in a pure C standard way. Personally I find this topic quite intriguing due to the matter of device-driver programming where all code written till date assumes a particular implementation.
In general, it's not that hard to accommodate unusual platforms for the most cases (if you don't want to simply assume 8-bit char, 2's complement, no padding, no trap, and truncating unsigned-to-signed conversion), the standard mostly gives enough guarantees (a few macros to inspect certain implementation details would be helpful, though).
As far as a strictly conforming program can observe (outside bit-fields), 5 is always encoded as 00...0101. This is not necessarily the physical representation (whatever this should mean), but what is observable by portable code. A machine using Gray code internally, for example, would have to emulate a "pure binary notation" for bitwise operators and shifts.
For negative values of signed types, different encodings are allowed, which leads to different (but well-defined for every case) results when re-interpreting as the corresponding unsigned type. For example, strictly conforming code must distinguish between (unsigned)n and *(unsigned *)&n for a signed integer n: They are equal for two's complement without padding bits, but different for the other encodings if n is negative.
Further, padding bits may exist, and signed integer types may have more padding bits than their corresponding unsigned counterparts (but not the other way round, type-punning from signed to unsigned is always valid). sizeof cannot be used to get the number of non-padding bits, so e.g. to get an unsigned value where only the sign-bit (of the corresponding signed type) is set, something like this must be used:
#define TYPE_PUN(to, from, x) ( *(to *)&(from){(x)} )
unsigned sign_bit = TYPE_PUN(unsigned, int, INT_MIN) &
TYPE_PUN(unsigned, int, -1) & ~1u;
(there are probably nicer ways) instead of
unsigned sign_bit = 1u << sizeof sign_bit * CHAR_BIT - 1;
as this may shift by more than the width. (I don't know of a constant expression giving the width, but sign_bit from above can be right-shifted until it's 0 to determine it, Gcc can constant-fold that.) Padding bits can be inspected by memcpying into an unsigned char array, though they may appear to "wobble": Reading the same padding bit twice may give different results.
If you want the bit pattern (without padding bits) of a signed integer (little endian):
int print_bits_u(unsigned n) {
for(; n; n>>=1) {
putchar(n&1 ? '1' : '0'); // n&1 never traps
}
return 0;
}
int print_bits(int n) {
return print_bits_u(*(unsigned *)&n & INT_MAX);
/* This masks padding bits if int has more of them than unsigned int.
* Note that INT_MAX is promoted to unsigned int here. */
}
int print_bits_2scomp(int n) {
return print_bits_u(n);
}
print_bits gives different results for negative numbers depending on the representation used (it gives the raw bit pattern), print_bits_2scomp gives the two's complement representation (possibly with a greater width than a signed int has, if unsigned int has less padding bits).
Care must be taken not to generate trap representations when using bitwise operators and when type-punning from unsigned to signed, see below how these can potentially be generated (as an example, *(int *)&sign_bit can trap with two's complement, and -1 | 1 can trap with ones' complement).
Unsigned-to-signed integer conversion (if the converted value isn't representable in the target type) is always implementation-defined, I would expect non-2's complement machines to differ from the common definition more likely, though technically, it could also become an issue on 2's complement implementations.
From C11 (n1570) 6.2.6.2:
(1) For unsigned integer types other than unsigned char, the bits of the object representation shall be divided into two groups: value bits and padding bits (there need not be any of the latter). If there are N value bits, each bit shall represent a different power of 2 between 1 and 2N-1, so that objects of that type shall be capable of representing values from 0 to 2N-1 using a pure binary representation; this shall be known as the value representation. The values of any padding bits are unspecified.
(2) For signed integer types, the bits of the object representation shall be divided into three groups: value bits, padding bits, and the sign bit. There need not be any padding bits; signed char shall not have any padding bits. There shall be exactly one sign bit. Each bit that is a value bit shall have the same value as the same bit in the object representation of the corresponding unsigned type (if there are M value bits in the signed
type and N in the unsigned type, then M≤N ). If the sign bit is zero, it shall not affect the resulting value. If the sign bit is one, the value shall be modified in one of the following ways:
the corresponding value with sign bit 0 is negated (sign and magnitude);
the sign bit has the value -(2M) (two's complement);
the sign bit has the value -(2M-1) (ones' complement).
Which of these applies is implementation-defined, as is whether the value with sign bit 1 and all value bits zero (for the first two), or with sign bit and all value bits 1 (for ones' complement), is a trap representation or a normal value. In the case of sign and magnitude and ones' complement, if this representation is a normal value it is called a negative zero.
To add to mafso's excellent answer, there's a part of the ANSI C rationale which talks about this:
The Committee has explicitly restricted the C language to binary architectures, on the grounds that this stricture was implicit in any case:
Bit-fields are specified by a number of bits, with no mention of “invalid integer” representation. The only reasonable encoding for such bit-fields is binary.
The integer formats for printf suggest no provision for “invalid integer” values, implying that any result of bitwise manipulation produces an integer result which can be printed by printf.
All methods of specifying integer constants — decimal, hex, and octal — specify an integer value. No method independent of integers is defined for specifying “bit-string constants.” Only a binary encoding provides a complete one-to-one mapping between bit strings and integer values.
The restriction to binary numeration systems rules out such curiosities as Gray code and makes
possible arithmetic definitions of the bitwise operators on unsigned types.
The relevant part of the standard might be this quote:
3.1.2.5 Types
[...]
The type char, the signed and unsigned integer types, and the
enumerated types are collectively called integral types. The
representations of integral types shall define values by use of a pure
binary numeration system.
If you want to get the bit-pattern of a given int, then bit-wise operators are your friends. If you want to convert an int to its 2-complement representation, then arithmetic operators are your friends. The two representations can be different, as it is implementation defined:
Std Draft 2011. 6.5/4. Some operators (the unary operator ~, and the
binary operators <<, >>, &, ^, and |, collectively described as
bitwise operators) are required to have operands that have integer
type. These operators yield values that depend on the internal
representations of integers, and have implementation-defined and
undefined aspects for signed types.
So it means that i<<1 will effectively shift the bit-pattern by one position to the left, but that the value produced can be different than i*2 (even for smal values of i).