Are bitwise operations portable? - c

Suppose we have the following code:
int j = -1 & 0xFF;
The resulting value in j could be one of the following based on the underlying representation:
System Value
Two's complement 0xFF
One's complement 0xFE
Sign/Magnitude 0x01
But are the &, |, and ^ operators in C always defined in terms of two's complement (thus making j always be equal to 0xFF), or are they defined in terms of the underlying representation of the system?

They're defined in terms of the actual bit representation. From the C11 final draft:
The result of the binary & operator is the bitwise AND of the operands (that is, each bit in the result is set if and only if each of the corresponding bits in the converted operands is set).
...
The result of the ^ operator is the bitwise exclusive OR of the operands (that is, each bit in the result is set if and only if exactly one of the corresponding bits in the converted operands is set).
...
The result of the | operator is the bitwise inclusive OR of the operands (that is, each bit in the result is set if and only if at least one of the corresponding bits in the converted operands is set).

Related

When are bitwise operations undefined in C? [duplicate]

Bitwise operators (~, &, | and ^) operate on the bitwise representation of their promoted operands. Can such operations cause undefined behavior?
For example, the ~ operator is defined this way in the C Standard:
6.5.3.3 Unary arithmetic operators
The result of the ~ operator is the bitwise complement of its (promoted) operand (that is, each bit in the result is set if and only if the corresponding bit in the converted operand is not set). The integer promotions are performed on the operand, and the result has the promoted type. If the promoted type is an unsigned type, the expression ~E is equivalent to the maximum value representable in that type minus E.
On all architectures, ~0 produces a bit pattern with the sign bit set to 1 and all value bits set to 1. On a one's complement architecture, this representation correspond to a negative zero. Can this bit pattern be a trap representation?
Are there other examples of undefined behavior involving simple bitwise operators for more common architectures?
For one's complement systems, there's explicitly listed the possibility of trap values for those that do not support negative zeros in signed integers (C11 6.2.6.2p4):
If the implementation does not support negative zeros, the behavior of the &, |, ^, ~, <<, and >> operators with operands that would produce such a value is undefined.
Then again, one's complement systems are not exactly common; as for example GCC doesn't support any!
C11 does imply that the implementation-defined and undefined aspects are just allowed for signed types (C11 6.5p4).

Can bitwise operators have undefined behavior?

Bitwise operators (~, &, | and ^) operate on the bitwise representation of their promoted operands. Can such operations cause undefined behavior?
For example, the ~ operator is defined this way in the C Standard:
6.5.3.3 Unary arithmetic operators
The result of the ~ operator is the bitwise complement of its (promoted) operand (that is, each bit in the result is set if and only if the corresponding bit in the converted operand is not set). The integer promotions are performed on the operand, and the result has the promoted type. If the promoted type is an unsigned type, the expression ~E is equivalent to the maximum value representable in that type minus E.
On all architectures, ~0 produces a bit pattern with the sign bit set to 1 and all value bits set to 1. On a one's complement architecture, this representation correspond to a negative zero. Can this bit pattern be a trap representation?
Are there other examples of undefined behavior involving simple bitwise operators for more common architectures?
For one's complement systems, there's explicitly listed the possibility of trap values for those that do not support negative zeros in signed integers (C11 6.2.6.2p4):
If the implementation does not support negative zeros, the behavior of the &, |, ^, ~, <<, and >> operators with operands that would produce such a value is undefined.
Then again, one's complement systems are not exactly common; as for example GCC doesn't support any!
C11 does imply that the implementation-defined and undefined aspects are just allowed for signed types (C11 6.5p4).

Why only & and | are called bit wise operators?

In programming languages operators like & and | are called bit wise operators. My question is even addition(+) and subtraction(-) or to that matter any mathematical expressions are bit wise operations. I mean the calculation happens on binary data as machine cannot understand decimals. I think for addition also there will be an add gate so why only operators like & and |(or) are called bit wise operators.
Because the bitwise operators only operate on the bits, they do nothing "more" and there's no question of the underlying format.
Addition treats a bunch of bits as a number, which might be signed (or even floating point); this means it must interpret the bits in a particular way (e.g. two's complement, signed magnitude, floating point, and so on), while the bitwise operators treat the bits as just "raw" bits, with no interpretation and no dependencies between bits as there might be in the higher-level numerical formats.
Also, you forgot some: there's also the ^ bitwise XOR operator, ~ which is bitwise not, and of course the shifting operators << and >>.
In the C language, there are a lot of operators called bitwise: & | ^ << >> ~ &= |= ^= <<= >>= ~=. They have in common that they are only used for bit manipulation on the "raw binary level", regardless of what kind of data the variable contains.
But of course, all operators have the purpose of altering bits. Bitwise is just a naming convention by the C language. Strictly speaking, C groups operators together in different groups with related operators, like this (C11 6.5):
Additive operators + -
Bitwise shift operators >> <<
Bitwise AND operator &
Bitwise exclusive OR operator ^
Bitwise inclusive OR operator |
And so on.
^ (XOR) and the shift operators are also bitwise operators. The difference between these and other operators is mainly that bitwise operators do not assume a particular encoding of a value, as is the case with the two's complement representation of integers. A small exception to this rule is that >> makes no sense unless the leftmost bit is interpreted as a sign bit.
In bitwise operations, the value of the bit at some particular position in the result may depend from the value of the bit in the same position in operands but does not depend on bit values in any other positions.
With one operand, there are 2 possible inputs (false, true) so 4 bitwise operations are possible (x, not x, 0 and 1). With two operands, there are 4 possible input combinations so in total 16 such operations are possible (and, or, xor, not x, not y, x, y, 0, 1, etc).

negative integer number >> 31 = -1 not 1? [duplicate]

This question already has answers here:
Arithmetic bit-shift on a signed integer
(6 answers)
Closed 9 years ago.
so, lets say I have a signed integer (couple of examples):
-1101363339 = 10111110 01011010 10000111 01110101 in binary.
-2147463094 = 10000000 00000000 01010000 01001010 in binary.
-20552 = 11111111 11111111 10101111 10111000 in binary.
now: -1101363339 >> 31 for example, should equal 1 right? but on my computer, I am getting -1. Regardless of what negative integer I pick if x = negative number, x >> 31 = -1. why? clearly in binary it should be 1.
Per C99 6.5.7 Bitwise shift operators:
If E1 has a signed type and a negative value, the resulting value is implementation-defined.
where E1 is the left-hand side of the shift expression. So it depends on your compiler what you'll get.
In most languages when you shift to the right it does an arithmetic shift, meaning it preserves the most significant bit. Therefore in your case you have all 1's in binary, which is -1 in decimal. If you use an unsigned int you will get the result you are looking for.
Per C 2011 6.5.7 Bitwise shift operators:
The result of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 has an unsigned type
or if E1 has a signed type and a nonnegative value, the value of the result is the integral
part of the quotient of E1/ 2E2. If E1 has a signed type and a negative value, the
resulting value is implementation-defined.
Basically, the right-shift of a negative signed integer is implementation defined but most implementations choose to do it as an arithmetic shift.
The behavior you are seeing is called an arithmetic shift which is when right shifting extends the sign bit. This means that the MSBs will carry the same value as the original sign bit. In other words, a negative number will always be negative after a left shift operation.
Note that this behavior is implementation defined and cannot be guaranteed with a different compiler.
What you are seeing is an arithmetic shift, in contrast to the bitwise shift you were expecting; i.e., the compiler, instead of "brutally" shifting the bits, is propagating the sign bit, thus dividing by 2N.
When talking about unsigned ints and positive ints, a right shift is a very simple operation - the bits are shifted to the right by one place (inserting 0 on the left), regardless of their meaning. In such cases, the operation is equivalent to dividing by 2N (and actually the C standard defines it like that).
The distinction comes up when talking about negative numbers. Several negative numbers representation exist, although currently for integers most commonly 2's complement representation is used.
The problem of a "brutal" bitwise shift here is, for starters, that one of the bits is used in some way to express the sign; thus, shifting the binary digits regardless of the negative integers representation can give unexpected results.
For example, commonly in 2's representation the most significant bit is 1 for negative numbers, 0 for positive numbers; applying a bitwise shift (with zeroes inserted to the left) to a negative number would (between other things) make it positive, not resulting in the (usually expected) division by 2N
So, arithmetic shift is introduced; negative numbers represented in 2's complement have an interesting property: the division by 2N behavior of the shift is preserved if, instead of inserting zeroes from the left, you insert bits that have the same value of the original sign bit.
In this way, signed divisions by 2N can be performed with just a bit of extra logic in the shift, without having to resort to a fully-fledged division routine.
Now, is arithmetic shift guaranteed for signed integers? In some languages yes1, but in C it's not like that - the behavior of the shift operators when dealing with negative integers is left as an implementation-defined detail.
As often happens, this is due to different hardware support for the operation; C is used on vastly different platforms, and, especially in the past, there was quite a difference in the "cost" of operations depending on the platform.
For example, if the processor does not provide an arithmetic right shift instruction, the compiler would be mandated to emit a much slower DIV instruction of some kind, which could be a problem in an inner loop on slower processors. For these reasons, the C standard leaves it up to the implementor to do the most appropriate thing for the current platform.
In your case, your implementation probably chose arithmetic shift because you are running on an x86 processor, that uses 2's complement arithmetic and provides both bitwise and arithmetic shift as single CPU instructions.
Actually, languages like Java even have separated arithmetic and bitwise shift operators - this is mainly due to the fact that they do not have unsigned types to e.g. store bitfields.

Bitwise AND on signed chars

I have a file that I've read into an array of data type signed char. I cannot change this fact.
I would now like to do this: !((c[i] & 0xc0) & 0x80) where c[i] is one of the signed characters.
Now, I know from section 6.5.10 of the C99 standard that "Each of the operands [of the bitwise AND] shall have integral type."
And Section 6.5 of the C99 specification tells me:
Some operators (the unary operator ~ , and the binary operators << , >> , & , ^ , and | ,
collectively described as bitwise operators )shall have operands that have integral type.
These operators return
values that depend on the internal representations of integers, and
thus have implementation-defined aspects for signed types.
My question is two-fold:
Since I want to work with the original bit patterns from the file, how can I convert/cast my signed char to unsigned char so that the bit patterns remain unchanged?
Is there a list of these "implementation-defined aspects" anywhere (say for MVSC and GCC)?
Or you could take a different route and argue that this produces the same result for both signed and unsigned chars for any value of c[i].
Naturally, I will reward references to relevant standards or authoritative texts and discourage "informed" speculation.
As others point out, in all likelyhood your implementation is based on two's complement, and will give exactly the result you expect.
However, if you're worried about the results of an operation involving a signed value, and all you care about is the bit pattern, simply cast directly to an equivalent unsigned type. The results are defined under the standard:
6.3.1.3 Signed and unsigned integers
...
Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or
subtracting one more than the maximum value that can be represented in the new type
until the value is in the range of the new type.
This is essentially specifying that the result will be the two's complement representation of the value.
Fundamental to this is that in two's complement maths the result of a calculation is modulo some power of two (i.e. the number of bits in the type), which in turn is exactly equivalent to masking off the relevant number of bits. And the complement of a number is the number subtracted from the power of two.
Thus adding a negative value is the same as adding any value which differs from the value by a multiple of that power of two.
i.e:
(0 + signed_value) mod (2^N)
==
(2^N + signed_value) mod (2^N)
==
(7 * 2^N + signed_value) mod (2^N)
etc. (if you know modulo, that should be pretty self-evidently true)
So if you have a negative number, adding a power of two will make it positive (-5 + 256 = 251), but the bottom 'N' bits will be exactly the same (0b11111011) and it will not affect the outcome of a mathematical operation. As values are then truncated to fit the type, the result is exactly the binary value you expected with even if the result 'overflows' (i.e. what you might think happens if the number was positive to start with - this wrapping is also well defined behaviour).
So in 8-bit two's complement:
-5 is the same as 251 (i.e 256 - 5) - 0b11111011
If you add 30, and 251, you get 281. But that's larger than 256, and 281 mod 256 equals 25. Exactly the same as 30 - 5.
251 * 2 = 502. 502 mod 256 = 246. 246 and -10 are both 0b11110110.
Likewise if you have:
unsigned int a;
int b;
a - b == a + (unsigned int) -b;
Under the hood, this cast is unlikely to be implemented with arithmetic and will certainly be a straight assignment from one register/value to another, or just optimised out altogether as the maths does not make a distinction between signed and unsigned (intepretation of CPU flags is another matter, but that's an implementation detail). The standard exists to ensure that an implementation doesn't take it upon itself to do something strange instead, or I suppose, for some weird architecture which isn't using two's complement...
unsigned char UC = *(unsigned char*)&C - this is how you can convert signed C to unsigned keeping the "bit pattern". Thus you could change your code to something like this:
!(( (*(unsigned char*)(c+i)) & 0xc0) & 0x80)
Explanation(with references):
761 When a pointer to an object is converted to a pointer to a character type, the result points to the lowest addressed byte of the object.
1124 When applied to an operand that has type char, unsigned char, or signed char, (or a qualified version thereof) the result is 1.
These two implies that unsigned char pointer points to the same byte as original signed char pointer.
You appear to have something similar to:
signed char c[] = "\x7F\x80\xBF\xC0\xC1\xFF";
for (int i = 0; c[i] != '\0'; i++)
{
if (!((c[i] & 0xC0) & 0x80))
...
}
You are (correctly) concerned about sign extension of the signed char type. In practice, however, (c[i] & 0xC0) will convert the signed character to a (signed) int, but the & 0xC0 will discard any set bits in the more significant bytes; the result of the expression will be in the range 0x00 .. 0xFF. This will, I believe, apply whether you use sign-and-magnitude, one's complement or two's complement binary values. The detailed bit pattern you get for a specific signed character value varies depending on the underlying representation; but the overall conclusion that the result will be in the range 0x00 .. 0xFF is valid.
There is an easy resolution for that concern — cast the value of c[i] to an unsigned char before using it:
if (!(((unsigned char)c[i] & 0xC0) & 0x80))
The value c[i] is converted to an unsigned char before it is promoted to an int (or, the compiler might promote to int, then coerce to unsigned char, then promote the unsigned char back to int), and the unsigned value is used in the & operations.
Of course, the code is now merely redundant. Using & 0xC0 followed by & 0x80 is entirely equivalent to just & 0x80.
If you're processing UTF-8 data and looking for continuation bytes, the correct test is:
if (((unsigned char)c[i] & 0xC0) == 0x80)
"Since I want to work with the original bit patterns from the file,
how can I convert/cast my signed char to unsigned char so that the bit
patterns remain unchanged?"
As someone already explained in a previous answer to your question on the same topic, any small integer type, be it signed or unsigned, will get promoted to the type int whenever used in an expression.
C11 6.3.1.1
"If an int can represent all values of the original type (as
restricted by the width, for a bit-field), the value is converted to
an int; otherwise, it is converted to an unsigned int. These are
called the integer promotions."
Also, as explained in the same answer, integer literals are always of the type int.
Therefore, your expression will boil down to the pseudo code (int) & (int) & (int). The operations will be performed on three temporary int variables and the result will be of type int.
Now, if the original data contained bits that may be interpreted as sign bits for the specific signedness representation (in practice this will be two's complement on all systems), you will get problems. Because these bits will be preserved upon promotion from signed char to int.
And then the bit-wise & operator performs an AND on every single bit regardless of the contents of its integer operand (C11 6.5.10/3), be it signed or not. If you had data in the signed bits of your original signed char, it will now be lost. Because the integer literals (0xC0 or 0x80) will have no bits set that corresponds to the sign bits.
The solution is to prevent the sign bits from getting transferred to the "temporary int". One solution is to cast c[i] to unsigned char, which is completely well-defined (C11 6.3.1.3). This will tell the compiler that "the whole contents of this variable is an integer, there are no sign bits to be concerned about".
Better yet, make a habit of always using unsigned data in every form of bit manipulations. The purist, 100% safe, MISRA-C compliant way of re-writing your expression is this:
if ( ((uint8_t)c[i] & 0xc0u) & 0x80u) > 0u)
The u suffix actually enforces the expression to be of unsigned int, but it is good practice to always cast to the intended type. It tells the reader of the code "I actually know what I am doing and I also understand all weird implicit promotion rules in C".
And then if we know our hex, (0xc0 & 0x80) is pointless, it is always true. And x & 0xC0 & 0x80 is always the same as x & 0x80. Therefore simplify the expression to:
if ( ((uint8_t)c[i] & 0x80u) > 0u)
"Is there a list of these "implementation-defined aspects" anywhere"
Yes, the C standard conveniently lists them in Appendix J.3. The only implementation-defined aspect you encounter in this case though, is the signedness implementation of integers. Which in practice is always two's complement.
EDIT:
The quoted text in the question is concerned with that the various bit-wise operators will produce implementation-defined results. This is just briefly mentioned as implementation-defined even in the appendix with no exact references. The actual chapter 6.5 doesn't say much regarding impl.defined behavior of & | etc. The only operators where it is explicitly mentioned is the << and >>, where left shifting a negative number is even undefined behavior, but right shifting it is implementation-defined.

Resources