I can't seem to find the relevant parts in the C standard fully defining the behavior of the unary minus operator with unsigned operands.
The 2003 C++ standard (yes, C++, bear with me for a few lines) says in 5.3.1c7: The negative of an unsigned quantity is computed by subtracting its value from 2^n, where n is the number of bits in the promoted operand.
The 1999 C standard, however, doesn't include such an explicit statement and does not clearly define the unary - behavior neither in 6.5.3.3c1,3 nor in 6.5c4. In the latter it says Some operators (the unary operator ~, and the binary operators <<, >>, &, ^, and |, ...) ... return values that depend on the internal representations of integers, and have implementation-defined and undefined aspects for signed types.), which excludes the unary minus and things seem to remain vague.
This earlier question refers to the K&R ANSI C book, section A.7.4.5 that says The negative of an unsigned quantity is computed by subtracting the promoted value from the largest value of the promoted type and adding one.
What would be the 1999 C standard equivalent to the above quote from the book?
6.2.5c9 says: A computation involving unsigned operands can never overflow, because a result that cannot be represented by the resulting unsigned integer type is reduced modulo the number that is one greater than the largest value that can be represented by the resulting type.
Is that it? Or is there something else I'm missing?
Yes, 6.2.5c9 is exactly the paragraph that you looked for.
The behavior of the unary minus operator on unsigned operands has nothing to do with whether a machine uses two's-complement arithmetic with signed numbers. Instead, given unsigned int x,y; the statement y=-x; will cause y to receive whatever value it would have to hold to make x+y equal zero. If x is zero, y will likewise be zero. For any other value of x, it will be UINT_MAX-x+1, in which case the arithmetic value of x+y will be UINT_MAX+1+(y-y) which, when assigned to a unsigned integer, will have UINT_MAX+1 subtracted from it, yielding zero.
In every implementation I know of, a negative is calculated as two's complement...
int a = 12;
int b = -a;
int c = ~a + 1;
assert(b == c);
...so there is really no physical difference between negative signed and "negative" unsigned integers - the only difference is in how they are interpreted.
So in this example...
unsigned a = 12;
unsigned b = -a;
int c = -a;
...the b and c are going to contain the exact same bits. The only difference is that b is interpreted as 2^32-12 (or 2^64-12), while c is interpreted as "normal" -12.
So, a negative is calculated in the exact same way regardless of "sign-ness", and the casting between unsigned and signed is actually a no-op (and can never cause an overflow in a sense that some bits need to be "cut-off").
This is late, but anyway...
C states (in a rather hard way, as mentioned in other answers already) that
any unsigned type is a binary representation with a type-specific
number of bits
all arithmetic operations on unsigned types are done (mod 2^N), 'mod'
being the mathematical definition of the modulus, and 'N' being the
number of bits used to represent the type.
The unary minus operator applied to an unsigned type behaves as if the value would have been promoted to the next bigger signed type, then negated, and then again converted to unsigned and truncated to the source type. (This is a slight simplification because of integer promotion happens for all types that have fewer bits than 'int', but it comes close enough I think.)
Some compilers do indeed give warnings when applying the unary minus to an unsigned type, but this is merely for the benefit of the programmer. IMHO the construct is well-defined and portable.
But if in doubt, just don't use the unary minus: write '0u - x' instead of '-x', and everything will be fine. Any decent code generator will create just a negate instruction from this, unless optimization is fully disabled.
Related
So can i cast the values to unsigned values, do the operation and cast back, and get the same result? I want to do this because unsigned integers can overflow, while signed cant.
Unsigned integer arithmetic does not overflow in C terminology because it is defined to wrap modulo 2N, where N is the number of bits in the unsigned type being operated on, per C 2018 6.2.5 9:
… A computation involving unsigned operands can never overflow, because a result that cannot be represented by the resulting unsigned integer type is reduced modulo the number that is one greater than the largest value that can be represented by the resulting type.
For other types, if an overflow occurs, the behavior is not defined by the C standard, per 6.5 5:
If an exceptional condition occurs during the evaluation of an expression (that is, if the result is not mathematically defined or not in the range of representable values for its type), the behavior is undefined. Note that not just the result is undefined; the entire behavior of the program is undefined. It could give a result you do not expect, it could trap, or it could execute entirely different code from what you expect.
Regarding your question:
So can I cast the values to unsigned values, do the operation and cast back, and get the same result?
we have two problems. First, consider a + b given int a, b;. If a + b overflows, then the behavior is not defined by the C standard. So we cannot say whether converting to unsigned, adding, and converting back to int will produce the same result because there is no defined result for a + b to start with.
Second, the conversion back is partly implementation-defined, per C 6.3.1.3. Consider int c = (unsigned) a + (unsigned) b;, which implicitly converts the unsigned sum to an int to store in c. Paragraph 1 tells us that, if the value of the sum is representable in int, it is the result of the conversion. But paragraph 3 tells us what happens if the value is not representable in int:
Otherwise, the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised.
GCC, for example, defines the result to be the result of wrapping modulo 2N. So, for int c = (unsigned) a + (unsigned) b;, GCC will produce the same result as int c = a + b; would if a + b wrapped modulo 2N. However, GCC does not guarantee the latter. When optimizing, GCC expects overflow will not occur, which can result in it eliminating any code branches where the program does allow overflow to occur. (GCC may have some options regarding its treatment of overflow.)
Additionally, even if both signed arithmetic and unsigned arithmetic wrap, performing an operation using unsigned values and converting back does not mathematically produce the same result as doing the operation with signed values. For example, consider -3/2. The int result is −1. But if -3 is converted to 32-bit unsigned, the resulting value is 232−3, and then (int) ((unsigned) -3 / (unsigned) 2) is 2−31−2 = 2,147,483,646.
Question
As a fledgling C language-lawyer, I have run into a situation where I am uncertain if I understand what C specifications logically guarantee correctly.
As I understand it, "bitwise operators" (&, |, and &) will work as intuitively expected on non-negative values with any of C's integer types (char/short/int/long/etc, whether signed or unsigned) - regardless of underlying object representation.
Is this correct understanding of what is/isn't strictly well-defined behavior in C?
Key Point
In many ways, this question boils down to whether a conforming implementation is allowed to take two non-trap, non-negative values as operands to the bitwise operators, and produce a trap representation result (from the operation itself, not from assigning/interpreting the result into/as an inappropriate type).
Example
Consider the following code:
#include <limits.h>
#define MOST_SIGNIFICANT_BIT = (unsigned char )((UCHAR_MAX >> 1) + 1)
/* ... in some function: */
unsigned char byte;
/* Using broad meaning of "byte", not necessarily "octet" */
int val;
/* val is assigned an arbitrary _non-negative_ value at runtime */
byte = val | MOST_SIGNIFICANT_BIT;
Do note the above comment that val receives a non-negative value at runtime (which can be represented by val's type).
My expectation is that byte has the most significant bit set, and the lower bits are the pure binary representation (no padding bits or trap representations) of the bottom CHAR_BIT - 1 bits of the numerical value of val.
I expect this to remain true even if the type of val is changed to any other integer type, but I expect this guarantee to disappear as soon as the value of val becomes negative (no one result guaranteed for all implementations), or the type of val is changed to any non-integer type (violates a constraint of C's definition of the bitwise operators).
Self-Answering
I am posting my explanation of my current understanding as an answer because I'm fairly confident in it, but I am looking for corrections to any of my misconceptions and will accept any better/correcting answer instead of mine.
Why (I Think) This Is (Probably) Correct
The bitwise operators &, |, and ^ are defined as operating on the actual binary representation of the converted operands: the operands are said to undergo the "usual arithmetic conversions".
As I understand it, it logically follows that when you use two integer type expressions with non-negative values as operands to one of those operators, regardless of padding bits or trap representations, their value bits will "line up": therefore, the result's value bits will have the numerical value which matches with what you'd expect if you just assumed a "pure binary representation".
The kicker is that so long as you start with valid (non-trap), non-negative values as the operands, the operands should always promote to an integer type which can represent both of their values, and thus can logically represent the result value for either of those three operations. You also never run into possible issues with e.g. "signed zero", because limiting yourself to non-negative values avoids such problems. And so long as the result is used as a type which can hold the result value (or as an unsigned integer type), you won't introduce other similar/related issues.
Can These Operations Produce A Trap Representation From Non-Negative Non-Trap Operands?
Footnotes 44/53 and 45/54 of the last C99/C11 drafts, respectively, seem to suggest that the answer to this uncertainty depends on whether foo | bar, foo & bar, and foo ^ bar are considered "arithmetic operations". If they are, then they are not allowed to produce a trap representation result given non-trap values.
The Index of C99 and C11 standard drafts lists the bitwise operators as a subset of "arithmetic operators", suggesting yes. Though C89 doesn't organize its index that way, and my C Programming Language (2nd Edition) has a section called "arithmetic operators" which includes just +, -, *, /, and %, leaving the bitwise operators to a separate section. In other words, there is no clear answer on this point.
In practice, I don't know of any systems where this would happen (given the constraint of non-negative values for both operands), for what that's worth.
And one may consider the following: the type unsigned char is expected (and essentially blessed by C99 and C11) to be capable of accessing all bits of the underlying object representation of a type - it seems likely that the intent is that bitwise operators would work correctly with unsigned char - which would be integer-promoted to int on most modern systems, an unsigned int on the rest: therefore it seems very unlikely that foo | bar, foo & bar, or foo ^ bar would be allowed to produce trap representations - at least if foo and bar are both values that can be held in an unsigned char, and if the result is assigned into an unsigned char.
It is very tempting to generalize from the prior two points that this is a non-issue, although I wouldn't call it a rigorous proof.
Applied to the Example
Here's why I think this is correct and will work as expected:
UCHAR_MAX >> 1 subjects UCHAR_MAX to "usual arithmetic conversions": By definition, UCHAR_MAX will fit into either an int or unsigned int, because on most systems int can represent all values of unsigned char, and on the few that don't, an unsigned int has to be able to represent all values of unsigned char`, so that's just an "integer promotion" in this case.
Because bit shifts are defined in terms of values and not bitwise representations, UCHAR_MAX >> 1 is the quotient of UCHAR_MAX being divided by 2. (Let's call this result UCHAR_MAX_DIV_2).
UCHAR_MAX_DIV_2 + 1 subjects both arguments to usual arithmetic conversion: If UCHAR_MAX fit into an int, then the result is an int, otherwise, it is an unsigned int. Either way the conversions stop at integer promotion.
The result of UCHAR_MAX_DIV_2 + 1 is a positive value, which, when converted into an unsigned char, will have the most significant bit of the unsigned char set, and all other bits cleared (because the conversion would preserve the numerical value, and unsigned char is very strictly defined to have a pure binary representation without any padding bits or trap representations - but even without such an explicit requirement, the resulting value would have the most significant value bit set).
The (unsigned char) cast of MOST_SIGNIFICANT_BIT is actually redundant in this context - cast or no cast, it's going to be subject to the "usual arithmetic conversions" when bitwise-OR'ed. (but it might be useful in other contexts).
The above five steps will be constant-folded on pretty much every compiler out there - but a proper compiler should not constant-fold in a way which differs from the semantics of the code if it hadn't, so all of the above applies.
val | MOST_SIGNIFICANT_BIT is where it gets interesting: unlike << and >>, | and the other binary operators are defined in terms of manipulating the binary representations. Both val and MOST_SIGNIFICANT_BIT are subject to usual arithmetic conversions: details like the layout of bits or trap representations might mean a different binary representation, but should preserve the value: Given two variables of the same integer-type, holding non-negative, non-trap values, the value bits should "line up" correctly, so I actually expect that val | MOST_SIGNIFICANT_BIT produces the correct value (let's call this result VAL_WITH_MSB_SET). I don't see an explicit guarantee that this step couldn't produce a trap representation, but I don't believe there's an implementation of C where it would.
byte = VAL_WITH_MSB_SET forces a conversion: conversion of an integer type (so long as the value is a non-trap value) into a smaller, unsigned integer type is well defined: In this case, the value is reduced modulo UCHAR_MAX + 1. Since val is stated to be positive, the end result is that byte has the value of the remainder of VAL_WITH_MSB_SET divided by UCHAR_MAX + 1.
Explaining Where It Doesn't Work
If val were to be a negative value or non-integer type, we'd be out of luck because there is no longer a logical certainty that the bits that get binary-OR'ed will have the same "meaning":
If val is a signed integer type but has a negative value, then even though MOST_SIGNIFICANT_BIT is promoted to a compatible type, even though the value bits "line up", the result doesn't have any guaranteed meaning (because C makes no guarantee about how negative numbers are encoded), nor is the result (for any encoding) going to have the same meaning, especially once assigned into the unsigned char at the last step.
If val has a non-integer type, it's already violating the C standard, which constrains |, &, and ^ operators to operating on the "integer types". But if your compiler allowed it (or you did some tricks using unions, etc), then you have no guarantees about what each bit means, and thus the bit that you set are meaningless.
In many ways, this question boils down to whether a conforming implementation is allowed to take two non-trap, non-negative values as operands to the bitwise operators, and produce a trap representation result
This is covered by C11 section 6.2.6.2 (C99 is similar). There is a footnote that clarifies the intent of the more technical text:
Regardless, no arithmetic operation on valid values can generate a trap representation other than as part of an exceptional condition such as an overflow, and this cannot occur with unsigned types.
The bitwise operators are arithmetic operations as discussed here.
In this footnote, "trap representation" excludes the special case "negative zero". The negative zero may or may not cause UB, but it has its own text (in 6.2.6.2 also) separate from the trap representation text.
So your question can actually be answered for both signed and unsigned values; the only dangerous case is the "negative zero" possibility. (Which can't occur from non-negative input).
This seemingly trivial line is taken from the C book my Mike Banahan & Brady (Section 2.8.8.2).
I can understand how implicit promotion comes into play in expressions like c=a+b depending on the types of the operands, but I am unable to grasp how and in which case the same can figure in something like -b, where b is any legitimate operand. Can you explain it and then give a proper example?
Extracted text follows:
The usual arithmetic conversions are applied to both of the operands
of the binary forms of the operators. Only the integral promotions are
performed on the operands of the unary forms of the operators.
Update:
Lest it goes unnoticed, here I am adding what I asked based on OUAH's answer in a comment–
The book says 'Only the integral promotions are performed'...Does it mean that in an expression like x=-y, where 'x' is a long double and 'y' is a float, 'y' won't be promoted to long double if we explicitly use a unary operator? I know it would be, but asking it nevertheless to get a clearer idea about the "Only the integeral promotions..." part.
Update:
Can you explain with example how promotion comes into play for the following bit-wise operators? For the last three, should I assume that whenever those are used on a variable, it is promoted to integer type first? And what exactly does the "usual arithmetic conversions" mean for the first three? Can you give a small example? I don't want to post it as a separate question if it can be settled here.
Take this example on a 32-bit system:
unsigned char a = 42;
printf("%zu\n", sizeof a); // prints 1
printf("%zu\n", sizeof +a); // prints 4, a has been promoted to int
For a unary arithmetic operator, the C standard says (in section 6.5.3.3) that
The integer promotions are performed on the operand, and the result has the promoted type.
It also defines the term, in section 6.3.1.1:
If an int can represent all values of the original type (as
restricted by the width, for a bit-field), the value is converted to
an int; otherwise, it is converted to an unsigned int.
These are called the integer promotions. All other types are
unchanged by the integer promotions.
(References are to the N1570 draft of the 2011 C standard.)
I believe the rationale for this is that implementations are not required to support any arithmetic operations on types narrower than int (one "word"). Operands narrower than an int are converted to int, or to unsigned int, before they're operated on.
For binary operators (those taking two operands), there's an additional requirement, that both operands must be of the same type. Typical CPUs might have instructions to add two 32-bit signed integers, or two 32-bit unsigned integers, or two 64-bit signed or unsigned integers, but none that will directly add, for example, a 32-bit signed integer and a 64-bit unsigned integer. To allow for this, we have the usual arithmetic conversions, described in section 6.3.1.8. These rules tell you, for example, what happens when you try to add an int to a double: the int operand is promoted, by conversion, to type double, and the addition adds the resulting two double operands.
The shift operators don't require the usual arithmetic conversions, since there's no particular need for both operands to be of the same type. The left operand is a value to be operated on; the right operand specifies the number of bits to shift it.
Does it mean that in an expression like x=-y, where x is a long double and y is a float, y won't be promoted to long double if we explicitly use a unary operator?
Assignment causes the right operand to be converted to the type of the left operand. The expression -y is evaluated independently of the context in which it appears (this is true for most expressions). So the unary - is applied to its operand, which is of type float (the integer promotions don't affect that), yielding a result of type float. The assignment causes that float value to be converted to long double before being assigned to x.
The title of your question asks how this can possibly happen. I'm not sure what that means. The conversion rules are specified in the language standard. Compilers follow those rules.
I'm not sure, but I think that every operation is promoted to a proper type. First the conversion is done, and secondly the operation is done. The -b operation changes the value of the result variable, so the promotion should be done and then the value sign is converted.
Operations like +b is also an operations, so there are an Promotion + Operation process. I don't know if a code optimization coud skip this process in this concrete case.
During operations with binary operators arithmetic promotion to highest needed form is done, from int to long, float, or double.
double c=2+3.5
But in unary operators +,- only promotions in Integer type datatypes are allowed. From short to int or long.
unsigned char a=255;
cout<<sizeof(a)<<endl; //prints 1
cout<<sizeof(+a)<<endl; //prints 4
cout<<sizeof(++a)<<endl; //prints 1
So this Integral promtions does not work on other unary operators ++a and a++
Consider following example:
#include <stdio.h>
int main(void)
{
unsigned char a = 15; /* one byte */
unsigned short b = 15; /* two bytes */
unsigned int c = 15; /* four bytes */
long x = -a; /* eight bytes */
printf("%ld\n", x);
x = -b;
printf("%ld\n", x);
x = -c;
printf("%ld\n", x);
return 0;
}
To compile I am using GCC 4.4.7 (and it gave me no warnings):
gcc -g -std=c99 -pedantic-errors -Wall -W check.c
My result is:
-15
-15
4294967281
The question is why both unsigned char and unsigned short values are "propagated" correctly to (signed) long, while unsigned int is not ? Is there any reference or rule on this ?
Here are results from gdb (words are in little-endian order) accordingly:
(gdb) x/2w &x
0x7fffffffe168: 11111111111111111111111111110001 11111111111111111111111111111111
(gdb) x/2w &x
0x7fffffffe168: 11111111111111111111111111110001 00000000000000000000000000000000
This is due to how the integer promotions applied to the operand and the requirement that the result of unary minus have the same type. This is covered in section 6.5.3.3 Unary arithmetic operators and says (emphasis mine going forward):
The result of the unary - operator is the negative of its (promoted) operand. The integer promotions are performed on the operand, and the result has the promoted type.
and integer promotion which is covered in the draft c99 standard section 6.3 Conversions and says:
if an int can represent all values of the original type, the value is converted to an int; otherwise, it is converted to an unsigned int. These are called the integer promotions.48) All other types are unchanged by the integer promotions.
In the first two cases, the promotion will be to int and the result will be int. In the case of unsigned int no promotion is required but the result will require a conversion back to unsigned int.
The -15 is converted to unsigned int using the rules set out in section 6.3.1.3 Signed and unsigned integers which says:
Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type.49)
So we end up with -15 + (UMAX + 1) which results in UMAX - 14 which results in a large unsigned value. This is sometimes why you will see code use -1 converted to to an unsigned value to obtain the max unsigned value of a type since it will always end up being -1 + UMAX + 1 which is UMAX.
int is special. Everything smaller than int gets promoted to int in arithmetic operations.
Thus -a and -b are applications of unary minus to int values of 15, which just work and produce -15. This value is then converted to long.
-c is different. c is not promoted to an int as it is not smaller than int. The result of unary minus applied to an unsigned int value of k is again an unsigned int, computed as 2N-k (N is the number of bits).
Now this unsigned int value is converted to long normally.
This behavior is correct. Quotes are from C 9899:TC2.
6.5.3.3/3:
The result of the unary - operator is the negative of its (promoted) operand. The integer promotions are performed on the operand, and the result has the promoted type.
6.2.5/9:
A computation involving unsigned operands can never overflow, because a result that cannot be represented by the resulting unsigned integer type is reduced modulo the number that is one greater than the largest value that can be represented by the resulting type.
6.3.1.1/2:
The following may be used in an expression wherever an int or unsigned int may be used:
An object or expression with an integer type whose integer conversion rank is less than or equal to the rank of int and unsigned int.
A bit-field of type _Bool, int, signed int, or unsigned int.
If an int can represent all values of the original type, the value is converted to an int; otherwise, it is converted to an unsigned int. These are called the integer promotions. All other types are unchanged by the integer promotions.
So for long x = -a;, since the operand a, an unsigned char, has conversion rank less than the rank of int and unsigned int, and all unsigned char values can be represented as int (on your platform), we first promote to type int. The negative of that is simple: the int with value -15.
Same logic for unsigned short (on your platform).
The unsigned int c is not changed by promotion. So the value of -c is calculated using modular arithmetic, giving the result UINT_MAX-14.
C's integer promotion rules are what they are because standards-writers wanted to allow a wide variety of existing implementations that did different things, in some cases because they were created before there were "standards", to keep on doing what they were doing, while defining rules for new implementations that were more specific than "do whatever you feel like". Unfortunately, the rules as written make it extremely difficult to write code which doesn't depend upon a compiler's integer size. Even if future processors would be able to perform 64-bit operations faster than 32-bit ones, the rules dictated by the standards would cause a lot of code to break if int ever grew beyond 32 bits.
It would probably in retrospect have been better to have handled "weird" compilers by explicitly recognizing the existence of multiple dialects of C, and recommending that compilers implement a dialect that handles various things in consistent ways, but providing that they may also implement dialects which do them differently. Such an approach may end up ultimately being the only way that int can grow beyond 32 bits, but I've not heard of anyone even considering such a thing.
I think the root of the problem with unsigned integer types stems from the fact that they are sometimes used to represent numerical quantities, and are sometimes used to represent members of a wrapping abstract algebraic ring. Unsigned types behave in a manner consistent with an abstract algebraic ring in circumstances which do not involve type promotion. Applying a unary minus to a member of a ring should (and does) yield a member of that same ring which, when added to the original, will yield zero [i.e. the additive inverse]. There is exactly one way to map integer quantities to ring elements, but multiple ways exist to map ring elements back to integer quantities. Thus, adding a ring element to an integer quantity should yield an element of the same ring regardless of the size of the integer, and conversion from rings to integer quantities should require that code specify how the conversion should be performed. Unfortunately, C implicitly converts rings to integers in cases where either the size of the ring is smaller than the default integer type, or when an operation uses a ring member with an integer of a larger type.
The proper solution to solve this problem would be to allow code to specify that certain variables, return values, etc. should be regarded as ring types rather than numbers; an expression like -(ring16_t)2 should yield 65534 regardless of the size of int, rather than yielding 65534 on systems where int is 16 bits, and -2 on systems where it's larger. Likewise, (ring32)0xC0000001 * (ring32)0xC0000001 should yield (ring32)0x80000001 even if int happens to be 64 bits [note that if int is 64 bits, the compiler could legally do anything it likes if code tries to multiply two unsigned 32-bit values which equal 0xC0000001, since the result would be too large to represent in a 64-bit signed integer.
Negatives are tricky. Especially when it comes to unsigned values. If you look at the c-documentation, you'll notice that (contrary to what you'd expect) unsigned chars and shorts are promoted to signed ints for computing, while an unsigned int will be computed as an unsigned int.
When you compute the -c, the c is treated as an int, it becomes -15, then is stored in x, (which still believes it is an UNSIGNED int) and is stored as such.
For clarification - No ACTUAL promotion is done when "negativeing" an unsigned. When you assign a negative to any type of int (or take a negative) the 2's compliment of the number is instead used. Since the only practical difference between unsigned and signed values is that the MSB acts as a sign flag, it is taken as a very large positive number instead of a negative one.
I have a file that I've read into an array of data type signed char. I cannot change this fact.
I would now like to do this: !((c[i] & 0xc0) & 0x80) where c[i] is one of the signed characters.
Now, I know from section 6.5.10 of the C99 standard that "Each of the operands [of the bitwise AND] shall have integral type."
And Section 6.5 of the C99 specification tells me:
Some operators (the unary operator ~ , and the binary operators << , >> , & , ^ , and | ,
collectively described as bitwise operators )shall have operands that have integral type.
These operators return
values that depend on the internal representations of integers, and
thus have implementation-defined aspects for signed types.
My question is two-fold:
Since I want to work with the original bit patterns from the file, how can I convert/cast my signed char to unsigned char so that the bit patterns remain unchanged?
Is there a list of these "implementation-defined aspects" anywhere (say for MVSC and GCC)?
Or you could take a different route and argue that this produces the same result for both signed and unsigned chars for any value of c[i].
Naturally, I will reward references to relevant standards or authoritative texts and discourage "informed" speculation.
As others point out, in all likelyhood your implementation is based on two's complement, and will give exactly the result you expect.
However, if you're worried about the results of an operation involving a signed value, and all you care about is the bit pattern, simply cast directly to an equivalent unsigned type. The results are defined under the standard:
6.3.1.3 Signed and unsigned integers
...
Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or
subtracting one more than the maximum value that can be represented in the new type
until the value is in the range of the new type.
This is essentially specifying that the result will be the two's complement representation of the value.
Fundamental to this is that in two's complement maths the result of a calculation is modulo some power of two (i.e. the number of bits in the type), which in turn is exactly equivalent to masking off the relevant number of bits. And the complement of a number is the number subtracted from the power of two.
Thus adding a negative value is the same as adding any value which differs from the value by a multiple of that power of two.
i.e:
(0 + signed_value) mod (2^N)
==
(2^N + signed_value) mod (2^N)
==
(7 * 2^N + signed_value) mod (2^N)
etc. (if you know modulo, that should be pretty self-evidently true)
So if you have a negative number, adding a power of two will make it positive (-5 + 256 = 251), but the bottom 'N' bits will be exactly the same (0b11111011) and it will not affect the outcome of a mathematical operation. As values are then truncated to fit the type, the result is exactly the binary value you expected with even if the result 'overflows' (i.e. what you might think happens if the number was positive to start with - this wrapping is also well defined behaviour).
So in 8-bit two's complement:
-5 is the same as 251 (i.e 256 - 5) - 0b11111011
If you add 30, and 251, you get 281. But that's larger than 256, and 281 mod 256 equals 25. Exactly the same as 30 - 5.
251 * 2 = 502. 502 mod 256 = 246. 246 and -10 are both 0b11110110.
Likewise if you have:
unsigned int a;
int b;
a - b == a + (unsigned int) -b;
Under the hood, this cast is unlikely to be implemented with arithmetic and will certainly be a straight assignment from one register/value to another, or just optimised out altogether as the maths does not make a distinction between signed and unsigned (intepretation of CPU flags is another matter, but that's an implementation detail). The standard exists to ensure that an implementation doesn't take it upon itself to do something strange instead, or I suppose, for some weird architecture which isn't using two's complement...
unsigned char UC = *(unsigned char*)&C - this is how you can convert signed C to unsigned keeping the "bit pattern". Thus you could change your code to something like this:
!(( (*(unsigned char*)(c+i)) & 0xc0) & 0x80)
Explanation(with references):
761 When a pointer to an object is converted to a pointer to a character type, the result points to the lowest addressed byte of the object.
1124 When applied to an operand that has type char, unsigned char, or signed char, (or a qualified version thereof) the result is 1.
These two implies that unsigned char pointer points to the same byte as original signed char pointer.
You appear to have something similar to:
signed char c[] = "\x7F\x80\xBF\xC0\xC1\xFF";
for (int i = 0; c[i] != '\0'; i++)
{
if (!((c[i] & 0xC0) & 0x80))
...
}
You are (correctly) concerned about sign extension of the signed char type. In practice, however, (c[i] & 0xC0) will convert the signed character to a (signed) int, but the & 0xC0 will discard any set bits in the more significant bytes; the result of the expression will be in the range 0x00 .. 0xFF. This will, I believe, apply whether you use sign-and-magnitude, one's complement or two's complement binary values. The detailed bit pattern you get for a specific signed character value varies depending on the underlying representation; but the overall conclusion that the result will be in the range 0x00 .. 0xFF is valid.
There is an easy resolution for that concern — cast the value of c[i] to an unsigned char before using it:
if (!(((unsigned char)c[i] & 0xC0) & 0x80))
The value c[i] is converted to an unsigned char before it is promoted to an int (or, the compiler might promote to int, then coerce to unsigned char, then promote the unsigned char back to int), and the unsigned value is used in the & operations.
Of course, the code is now merely redundant. Using & 0xC0 followed by & 0x80 is entirely equivalent to just & 0x80.
If you're processing UTF-8 data and looking for continuation bytes, the correct test is:
if (((unsigned char)c[i] & 0xC0) == 0x80)
"Since I want to work with the original bit patterns from the file,
how can I convert/cast my signed char to unsigned char so that the bit
patterns remain unchanged?"
As someone already explained in a previous answer to your question on the same topic, any small integer type, be it signed or unsigned, will get promoted to the type int whenever used in an expression.
C11 6.3.1.1
"If an int can represent all values of the original type (as
restricted by the width, for a bit-field), the value is converted to
an int; otherwise, it is converted to an unsigned int. These are
called the integer promotions."
Also, as explained in the same answer, integer literals are always of the type int.
Therefore, your expression will boil down to the pseudo code (int) & (int) & (int). The operations will be performed on three temporary int variables and the result will be of type int.
Now, if the original data contained bits that may be interpreted as sign bits for the specific signedness representation (in practice this will be two's complement on all systems), you will get problems. Because these bits will be preserved upon promotion from signed char to int.
And then the bit-wise & operator performs an AND on every single bit regardless of the contents of its integer operand (C11 6.5.10/3), be it signed or not. If you had data in the signed bits of your original signed char, it will now be lost. Because the integer literals (0xC0 or 0x80) will have no bits set that corresponds to the sign bits.
The solution is to prevent the sign bits from getting transferred to the "temporary int". One solution is to cast c[i] to unsigned char, which is completely well-defined (C11 6.3.1.3). This will tell the compiler that "the whole contents of this variable is an integer, there are no sign bits to be concerned about".
Better yet, make a habit of always using unsigned data in every form of bit manipulations. The purist, 100% safe, MISRA-C compliant way of re-writing your expression is this:
if ( ((uint8_t)c[i] & 0xc0u) & 0x80u) > 0u)
The u suffix actually enforces the expression to be of unsigned int, but it is good practice to always cast to the intended type. It tells the reader of the code "I actually know what I am doing and I also understand all weird implicit promotion rules in C".
And then if we know our hex, (0xc0 & 0x80) is pointless, it is always true. And x & 0xC0 & 0x80 is always the same as x & 0x80. Therefore simplify the expression to:
if ( ((uint8_t)c[i] & 0x80u) > 0u)
"Is there a list of these "implementation-defined aspects" anywhere"
Yes, the C standard conveniently lists them in Appendix J.3. The only implementation-defined aspect you encounter in this case though, is the signedness implementation of integers. Which in practice is always two's complement.
EDIT:
The quoted text in the question is concerned with that the various bit-wise operators will produce implementation-defined results. This is just briefly mentioned as implementation-defined even in the appendix with no exact references. The actual chapter 6.5 doesn't say much regarding impl.defined behavior of & | etc. The only operators where it is explicitly mentioned is the << and >>, where left shifting a negative number is even undefined behavior, but right shifting it is implementation-defined.