Are the results of bitwise operations on signed integers defined?

Are the results of bitwise operations on signed integers defined? - c

I know that the behavior of >> on signed integer can be implementation dependent (specifically, when the left operand is negative).
What about the others: ~, >>, &, ^, |?
When their operands are signed integers of built-in type (short, int, long, long long), are the results guaranteed to be the same (in terms of bit content) as if their type is unsigned?

For negative operands, << has undefined behavior and the result of >> is implementation-defined (usually as "arithmetic" right shift). << and >> are conceptually not bitwise operators. They're arithmetic operators equivalent to multiplication or division by the appropriate power of two for the operands on which they're well-defined.
As for the genuine bitwise operators ^, ~, |, and &, they operate on the bit representation of the value in the (possibly promoted) type of the operand. Their results are well-defined for each possible choice of signed representation (twos complement, ones complement, or sign-magnitude) but in the latter two cases it's possible that the result will be a trap representation if the implementation treats the "negative zero" representation as a trap. Personally, I almost always use unsigned expressions with bitwise operators so that the result is 100% well-defined in terms of values rather than representations.
Finally, note that this answer as written may only apply to C. C and C++ are very different languages and while I don't know C++ well, I understand it may differ in some of these areas from C...

A left shift << of a negative value has undefined behaviour;
A right shift >> of a negative value gives an implementation-defined result;
The result of the &, | and ^ operators is defined in terms of the bitwise representation of the values. Three possibilities are allowed for the representation of negative numbers in C: two's complement, ones' complement and sign-magnitude. The method used by the implementation will determine the numerical result when these operators are used on negative values.
Note that the value with sign bit 1 and all value bits zero (for two's complement and sign-magnitude), or with sign bit and all value bits 1 (for ones’ complement) is explicitly allowed to be a trap representation, and in this case if you use arguments to these operators that would generate such a value the behaviour is undefined.

The bit content will be the same, but the resulting values will still be implementation dependent.
You really shouldn't see the values as signed or unsigned when using bitwise operations, because that is working on a different level.
Using unsigned types saves you from some of this trouble.

The C89 Standard defined the behavior of left-shifting signed numbers based upon bit positions. If neither signed nor unsigned types have padding bits, the required behavior for unsigned types, combined with the requirement that positive signed types share the same representation as unsigned types, would imply that the sign bit is immediately to the left of the most significant value bit.
This, in C89, -1<<1 would be -2 on two's-complement implementations which don't have padding bits and -3 on ones'-complement implementations which don't have padding bits. If there are any sign-magnitude implementations without padding bits, -1<<1 would equal 2 on those.
The C99 Standard changed left-shifts of negative values to Undefined Behavior, but nothing in the rationale gives any clue as to why (or even mentions the change at all). The behavior required by C89 may have been less than ideal in some ones'-complement implementations, and so it would made sense to allow those implementations the freedom to select something better. I've seen no evidence to suggest that the authors of the Standard didn't intended that quality two's-complement implementations should continue to provide the same behavior mandated by C89, but unfortunately they didn't actually say so.

Related

When are bitwise operations undefined in C? [duplicate]

Bitwise operators (~, &, | and ^) operate on the bitwise representation of their promoted operands. Can such operations cause undefined behavior?
For example, the ~ operator is defined this way in the C Standard:
6.5.3.3 Unary arithmetic operators
The result of the ~ operator is the bitwise complement of its (promoted) operand (that is, each bit in the result is set if and only if the corresponding bit in the converted operand is not set). The integer promotions are performed on the operand, and the result has the promoted type. If the promoted type is an unsigned type, the expression ~E is equivalent to the maximum value representable in that type minus E.
On all architectures, ~0 produces a bit pattern with the sign bit set to 1 and all value bits set to 1. On a one's complement architecture, this representation correspond to a negative zero. Can this bit pattern be a trap representation?
Are there other examples of undefined behavior involving simple bitwise operators for more common architectures?

For one's complement systems, there's explicitly listed the possibility of trap values for those that do not support negative zeros in signed integers (C11 6.2.6.2p4):
If the implementation does not support negative zeros, the behavior of the &, |, ^, ~, <<, and >> operators with operands that would produce such a value is undefined.
Then again, one's complement systems are not exactly common; as for example GCC doesn't support any!
C11 does imply that the implementation-defined and undefined aspects are just allowed for signed types (C11 6.5p4).

Clarifying How Bitwise Operators, Non-Negative Operants, and Type Conversion Interact

Question
As a fledgling C language-lawyer, I have run into a situation where I am uncertain if I understand what C specifications logically guarantee correctly.
As I understand it, "bitwise operators" (&, |, and &) will work as intuitively expected on non-negative values with any of C's integer types (char/short/int/long/etc, whether signed or unsigned) - regardless of underlying object representation.
Is this correct understanding of what is/isn't strictly well-defined behavior in C?
Key Point
In many ways, this question boils down to whether a conforming implementation is allowed to take two non-trap, non-negative values as operands to the bitwise operators, and produce a trap representation result (from the operation itself, not from assigning/interpreting the result into/as an inappropriate type).
Example
Consider the following code:
#include <limits.h>
#define MOST_SIGNIFICANT_BIT = (unsigned char )((UCHAR_MAX >> 1) + 1)
/* ... in some function: */
unsigned char byte;
/* Using broad meaning of "byte", not necessarily "octet" */
int val;
/* val is assigned an arbitrary _non-negative_ value at runtime */
byte = val | MOST_SIGNIFICANT_BIT;
Do note the above comment that val receives a non-negative value at runtime (which can be represented by val's type).
My expectation is that byte has the most significant bit set, and the lower bits are the pure binary representation (no padding bits or trap representations) of the bottom CHAR_BIT - 1 bits of the numerical value of val.
I expect this to remain true even if the type of val is changed to any other integer type, but I expect this guarantee to disappear as soon as the value of val becomes negative (no one result guaranteed for all implementations), or the type of val is changed to any non-integer type (violates a constraint of C's definition of the bitwise operators).
Self-Answering
I am posting my explanation of my current understanding as an answer because I'm fairly confident in it, but I am looking for corrections to any of my misconceptions and will accept any better/correcting answer instead of mine.

Why (I Think) This Is (Probably) Correct
The bitwise operators &, |, and ^ are defined as operating on the actual binary representation of the converted operands: the operands are said to undergo the "usual arithmetic conversions".
As I understand it, it logically follows that when you use two integer type expressions with non-negative values as operands to one of those operators, regardless of padding bits or trap representations, their value bits will "line up": therefore, the result's value bits will have the numerical value which matches with what you'd expect if you just assumed a "pure binary representation".
The kicker is that so long as you start with valid (non-trap), non-negative values as the operands, the operands should always promote to an integer type which can represent both of their values, and thus can logically represent the result value for either of those three operations. You also never run into possible issues with e.g. "signed zero", because limiting yourself to non-negative values avoids such problems. And so long as the result is used as a type which can hold the result value (or as an unsigned integer type), you won't introduce other similar/related issues.
Can These Operations Produce A Trap Representation From Non-Negative Non-Trap Operands?
Footnotes 44/53 and 45/54 of the last C99/C11 drafts, respectively, seem to suggest that the answer to this uncertainty depends on whether foo | bar, foo & bar, and foo ^ bar are considered "arithmetic operations". If they are, then they are not allowed to produce a trap representation result given non-trap values.
The Index of C99 and C11 standard drafts lists the bitwise operators as a subset of "arithmetic operators", suggesting yes. Though C89 doesn't organize its index that way, and my C Programming Language (2nd Edition) has a section called "arithmetic operators" which includes just +, -, *, /, and %, leaving the bitwise operators to a separate section. In other words, there is no clear answer on this point.
In practice, I don't know of any systems where this would happen (given the constraint of non-negative values for both operands), for what that's worth.
And one may consider the following: the type unsigned char is expected (and essentially blessed by C99 and C11) to be capable of accessing all bits of the underlying object representation of a type - it seems likely that the intent is that bitwise operators would work correctly with unsigned char - which would be integer-promoted to int on most modern systems, an unsigned int on the rest: therefore it seems very unlikely that foo | bar, foo & bar, or foo ^ bar would be allowed to produce trap representations - at least if foo and bar are both values that can be held in an unsigned char, and if the result is assigned into an unsigned char.
It is very tempting to generalize from the prior two points that this is a non-issue, although I wouldn't call it a rigorous proof.
Applied to the Example
Here's why I think this is correct and will work as expected:
UCHAR_MAX >> 1 subjects UCHAR_MAX to "usual arithmetic conversions": By definition, UCHAR_MAX will fit into either an int or unsigned int, because on most systems int can represent all values of unsigned char, and on the few that don't, an unsigned int has to be able to represent all values of unsigned char`, so that's just an "integer promotion" in this case.
Because bit shifts are defined in terms of values and not bitwise representations, UCHAR_MAX >> 1 is the quotient of UCHAR_MAX being divided by 2. (Let's call this result UCHAR_MAX_DIV_2).
UCHAR_MAX_DIV_2 + 1 subjects both arguments to usual arithmetic conversion: If UCHAR_MAX fit into an int, then the result is an int, otherwise, it is an unsigned int. Either way the conversions stop at integer promotion.
The result of UCHAR_MAX_DIV_2 + 1 is a positive value, which, when converted into an unsigned char, will have the most significant bit of the unsigned char set, and all other bits cleared (because the conversion would preserve the numerical value, and unsigned char is very strictly defined to have a pure binary representation without any padding bits or trap representations - but even without such an explicit requirement, the resulting value would have the most significant value bit set).
The (unsigned char) cast of MOST_SIGNIFICANT_BIT is actually redundant in this context - cast or no cast, it's going to be subject to the "usual arithmetic conversions" when bitwise-OR'ed. (but it might be useful in other contexts).
The above five steps will be constant-folded on pretty much every compiler out there - but a proper compiler should not constant-fold in a way which differs from the semantics of the code if it hadn't, so all of the above applies.
val | MOST_SIGNIFICANT_BIT is where it gets interesting: unlike << and >>, | and the other binary operators are defined in terms of manipulating the binary representations. Both val and MOST_SIGNIFICANT_BIT are subject to usual arithmetic conversions: details like the layout of bits or trap representations might mean a different binary representation, but should preserve the value: Given two variables of the same integer-type, holding non-negative, non-trap values, the value bits should "line up" correctly, so I actually expect that val | MOST_SIGNIFICANT_BIT produces the correct value (let's call this result VAL_WITH_MSB_SET). I don't see an explicit guarantee that this step couldn't produce a trap representation, but I don't believe there's an implementation of C where it would.
byte = VAL_WITH_MSB_SET forces a conversion: conversion of an integer type (so long as the value is a non-trap value) into a smaller, unsigned integer type is well defined: In this case, the value is reduced modulo UCHAR_MAX + 1. Since val is stated to be positive, the end result is that byte has the value of the remainder of VAL_WITH_MSB_SET divided by UCHAR_MAX + 1.
Explaining Where It Doesn't Work
If val were to be a negative value or non-integer type, we'd be out of luck because there is no longer a logical certainty that the bits that get binary-OR'ed will have the same "meaning":
If val is a signed integer type but has a negative value, then even though MOST_SIGNIFICANT_BIT is promoted to a compatible type, even though the value bits "line up", the result doesn't have any guaranteed meaning (because C makes no guarantee about how negative numbers are encoded), nor is the result (for any encoding) going to have the same meaning, especially once assigned into the unsigned char at the last step.
If val has a non-integer type, it's already violating the C standard, which constrains |, &, and ^ operators to operating on the "integer types". But if your compiler allowed it (or you did some tricks using unions, etc), then you have no guarantees about what each bit means, and thus the bit that you set are meaningless.

In many ways, this question boils down to whether a conforming implementation is allowed to take two non-trap, non-negative values as operands to the bitwise operators, and produce a trap representation result
This is covered by C11 section 6.2.6.2 (C99 is similar). There is a footnote that clarifies the intent of the more technical text:
Regardless, no arithmetic operation on valid values can generate a trap representation other than as part of an exceptional condition such as an overflow, and this cannot occur with unsigned types.
The bitwise operators are arithmetic operations as discussed here.
In this footnote, "trap representation" excludes the special case "negative zero". The negative zero may or may not cause UB, but it has its own text (in 6.2.6.2 also) separate from the trap representation text.
So your question can actually be answered for both signed and unsigned values; the only dangerous case is the "negative zero" possibility. (Which can't occur from non-negative input).

Any reason to assign -0?

I am going through some old C code using Lint which stumbled upon this line:
int16_t max = -0;
The Lint message is that the "Constant expression evaluates to 0 in operation '-'".
Is there any reason why someone would use -0?

In the C specification (6.2.6.2 Integer types), it states the following (emphasis mine):
For signed integer types, the bits of the object representation shall be divided into three groups: value bits, padding bits, and the sign bit. There need not be any padding bits; there shall be exactly one sign bit. Each bit that is a value bit shall have the same value as the same bit in the object representation of the corresponding unsigned type (if there are
M value bits in the signed type and N in the unsigned type, then M £ N). If the sign bit is zero, it shall not affect the resulting value. If the sign bit is one, the value shall be modified in one of the following ways:
the corresponding value with sign bit 0 is negated (sign and
magnitude);
the sign bit has the value -(2N) (two’s complement);
the sign bit has the value -(2N - 1) (one’s complement).
Which of these applies is implementation-defined, as is whether the
value with sign bit 1 and all value bits zero (for the first two), or
with sign bit and all value bits 1 (for one’s complement), is a trap
representation or a normal value. In the case of sign and magnitude
and one’s complement, if this representation is a normal value it is
called a negative zero.
In other words, C supports three different representations for signed integers and two of them have the concept of signed zero, which makes a distinction between a positive and a negative zero.
So, my explanation is that perhaps the author of your code snippet was trying to produce a negative zero value. But, as pointed out in Jens Gustedt's answer, this expression cannot actually produce a negative zero, which means the author may have made a wrong assumption there.

No, I can't see any reason for this. Others have mentioned that it is possible to have platforms with "negative zero", but such a negative zero can never be produced by this expression, so this is useless.
The corresponding paragraph in the C standard is 6.2.6.2 p3, emphasis is mine:
If the implementation supports negative zeros, they shall be generated
only by:
— the &, |, ^, ~, <<, and >> operators with operands that
produce such a value;
— the +, -, *, /, and % operators where one operand is a negative zero and the result is zero;
— compound
assignment operators based on the above cases.
To produce a negative zero on such a platform you could use ~INT_MAX, for example, but that would not be a zero for other representations, so the code wouldn't be very portable.

That is for an architecture using a CPU with one's complement numbers.
Ones complement is a way to represent negative numbers, having a -0. Still in use for floating point.
For unsigned numbers the maximal number is indeed -0 : all 1s.
We are accustomed to two's complement numbers, having one negative number more. Though one's complement has some troubles around zero, the same holds for two's complement around MIN_INT: -MIN_INT == MIN_INT.
In the code above probably intended was unsigned numbers:
uint16_t max = (uint16_t) -1;

No reason for it with C99 or later.
int16_t is an exact-width integer type.
The typedef name intN_t designates a signed integer type with width N, no padding bits, and a two’s complement representation. Thus, int8_t denotes such a signed integer type with a width of exactly 8 bits. C11dr §7.20.1.1 1
There is no signed zero with two’s complement. So code is equivalent to
int16_t max = 0;
IMO, the int16_t max = -0; hoped for result, on a non-2's complement platform, was to initialize max to -0 to flag an array of length 0 or one that only contained elements with the value -0.

How to do a bit representation in a C-standard way?

As per the C standard the value representation of a integer type is implementation defined. So 5 might not be represented as 00000000000000000000000000000101 or -1 as 11111111111111111111111111111111 as we usually assume in a 32-bit 2's complement. So even though the operators ~, << and >> are well defined, the bit patterns they will work on is implementation defined. The only defined bit pattern I could find was "§5.2.1/3 A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string.".
So my questions is - Is there a implementation independent way of converting integer types to a bit pattern?
We can always start with a null character and do enough bit operations on it to get it to a desired value, but I find it too cumbersome. I also realise that practically all implementations will use a 2's complement representation, but I want to know how to do it in a pure C standard way. Personally I find this topic quite intriguing due to the matter of device-driver programming where all code written till date assumes a particular implementation.

In general, it's not that hard to accommodate unusual platforms for the most cases (if you don't want to simply assume 8-bit char, 2's complement, no padding, no trap, and truncating unsigned-to-signed conversion), the standard mostly gives enough guarantees (a few macros to inspect certain implementation details would be helpful, though).
As far as a strictly conforming program can observe (outside bit-fields), 5 is always encoded as 00...0101. This is not necessarily the physical representation (whatever this should mean), but what is observable by portable code. A machine using Gray code internally, for example, would have to emulate a "pure binary notation" for bitwise operators and shifts.
For negative values of signed types, different encodings are allowed, which leads to different (but well-defined for every case) results when re-interpreting as the corresponding unsigned type. For example, strictly conforming code must distinguish between (unsigned)n and *(unsigned *)&n for a signed integer n: They are equal for two's complement without padding bits, but different for the other encodings if n is negative.
Further, padding bits may exist, and signed integer types may have more padding bits than their corresponding unsigned counterparts (but not the other way round, type-punning from signed to unsigned is always valid). sizeof cannot be used to get the number of non-padding bits, so e.g. to get an unsigned value where only the sign-bit (of the corresponding signed type) is set, something like this must be used:
#define TYPE_PUN(to, from, x) ( *(to *)&(from){(x)} )
unsigned sign_bit = TYPE_PUN(unsigned, int, INT_MIN) &
TYPE_PUN(unsigned, int, -1) & ~1u;
(there are probably nicer ways) instead of
unsigned sign_bit = 1u << sizeof sign_bit * CHAR_BIT - 1;
as this may shift by more than the width. (I don't know of a constant expression giving the width, but sign_bit from above can be right-shifted until it's 0 to determine it, Gcc can constant-fold that.) Padding bits can be inspected by memcpying into an unsigned char array, though they may appear to "wobble": Reading the same padding bit twice may give different results.
If you want the bit pattern (without padding bits) of a signed integer (little endian):
int print_bits_u(unsigned n) {
for(; n; n>>=1) {
putchar(n&1 ? '1' : '0'); // n&1 never traps
}
return 0;
}
int print_bits(int n) {
return print_bits_u(*(unsigned *)&n & INT_MAX);
/* This masks padding bits if int has more of them than unsigned int.
* Note that INT_MAX is promoted to unsigned int here. */
}
int print_bits_2scomp(int n) {
return print_bits_u(n);
}
print_bits gives different results for negative numbers depending on the representation used (it gives the raw bit pattern), print_bits_2scomp gives the two's complement representation (possibly with a greater width than a signed int has, if unsigned int has less padding bits).
Care must be taken not to generate trap representations when using bitwise operators and when type-punning from unsigned to signed, see below how these can potentially be generated (as an example, *(int *)&sign_bit can trap with two's complement, and -1 | 1 can trap with ones' complement).
Unsigned-to-signed integer conversion (if the converted value isn't representable in the target type) is always implementation-defined, I would expect non-2's complement machines to differ from the common definition more likely, though technically, it could also become an issue on 2's complement implementations.
From C11 (n1570) 6.2.6.2:
(1) For unsigned integer types other than unsigned char, the bits of the object representation shall be divided into two groups: value bits and padding bits (there need not be any of the latter). If there are N value bits, each bit shall represent a different power of 2 between 1 and 2N-1, so that objects of that type shall be capable of representing values from 0 to 2N-1 using a pure binary representation; this shall be known as the value representation. The values of any padding bits are unspecified.
(2) For signed integer types, the bits of the object representation shall be divided into three groups: value bits, padding bits, and the sign bit. There need not be any padding bits; signed char shall not have any padding bits. There shall be exactly one sign bit. Each bit that is a value bit shall have the same value as the same bit in the object representation of the corresponding unsigned type (if there are M value bits in the signed
type and N in the unsigned type, then M≤N ). If the sign bit is zero, it shall not affect the resulting value. If the sign bit is one, the value shall be modified in one of the following ways:
the corresponding value with sign bit 0 is negated (sign and magnitude);
the sign bit has the value -(2M) (two's complement);
the sign bit has the value -(2M-1) (ones' complement).
Which of these applies is implementation-defined, as is whether the value with sign bit 1 and all value bits zero (for the first two), or with sign bit and all value bits 1 (for ones' complement), is a trap representation or a normal value. In the case of sign and magnitude and ones' complement, if this representation is a normal value it is called a negative zero.

To add to mafso's excellent answer, there's a part of the ANSI C rationale which talks about this:
The Committee has explicitly restricted the C language to binary architectures, on the grounds that this stricture was implicit in any case:
Bit-fields are specified by a number of bits, with no mention of “invalid integer” representation. The only reasonable encoding for such bit-fields is binary.
The integer formats for printf suggest no provision for “invalid integer” values, implying that any result of bitwise manipulation produces an integer result which can be printed by printf.
All methods of specifying integer constants — decimal, hex, and octal — specify an integer value. No method independent of integers is defined for specifying “bit-string constants.” Only a binary encoding provides a complete one-to-one mapping between bit strings and integer values.
The restriction to binary numeration systems rules out such curiosities as Gray code and makes
possible arithmetic definitions of the bitwise operators on unsigned types.
The relevant part of the standard might be this quote:
3.1.2.5 Types
[...]
The type char, the signed and unsigned integer types, and the
enumerated types are collectively called integral types. The
representations of integral types shall define values by use of a pure
binary numeration system.

If you want to get the bit-pattern of a given int, then bit-wise operators are your friends. If you want to convert an int to its 2-complement representation, then arithmetic operators are your friends. The two representations can be different, as it is implementation defined:
Std Draft 2011. 6.5/4. Some operators (the unary operator ~, and the
binary operators <<, >>, &, ^, and |, collectively described as
bitwise operators) are required to have operands that have integer
type. These operators yield values that depend on the internal
representations of integers, and have implementation-defined and
undefined aspects for signed types.
So it means that i<<1 will effectively shift the bit-pattern by one position to the left, but that the value produced can be different than i*2 (even for smal values of i).

Bitwise operations on negative numbers

I was reading a book on C, where in some section it says: " The bitwise operations are typically used with unsigned types.".
Question: why?

Simply because it is not immediately clear what bit operations on the sign bit of a signed number should mean.
Unsigned types have no special bits, everything works straight
forward.
Signed types have a special sign bit and can interpret that with
three different encodings to represent negative values (ones and two's complement or sign and magnitude).

From the programming languages' point of view the unsigned does not mean, that it cannot be negative. It means, that the first bit of the number is not used for determining if the value is negative or positive.
So, for an 8 bit value the bitwise operators consider all the 8 bits, thus working on the range 0..255 (while the mathematical operators can consider the first bit as being the signedness indicator, thus working on a range of -128 to +127).

Unsigned operands do not have any special bits for sign representation.
For signed operands, the special bit for sign representation may interrupt operations which we want to perform.

Bitwise operations are typically used with unsigned types or signed types with non-negative values.
The C Standard supports 3 different representations for negative values: applying bitwise operators on such values has at best an implementation defined behavior.
For example -1 is represented:
as |1000|0000|0000|0001| on a 16-bit system with sign magnitude.
as |1111|1111|1111|1110| on a 16-bit system with one's complement.
and more commonly as |1111|1111|1111|1111| on a 16-bit system with two's complement.
As a consequence, -1 & 3 can have 3 different values:
1 on sign-magnitude systems,
2 on one's complement systems,
3 on two's complement systems,
Other operations may behave in even more unexpected ways on negative numbers, such as << and >>, especially on the DS9K.
Conversely, the bitwise operations are fully defined on unsigned types, which are mandated to have binary representation (with some restrictions on the right operand of shift operators).