Convert Raw 14 bit Two's Complement to Signed 16 bit Integer - c

I am doing some work in embedded C with an accelerometer that returns data as a 14 bit 2's complement number. I am storing this result directly into a uint16_t. Later in my code I am trying to convert this "raw" form of the data into a signed integer to represent / work with in the rest of my code.
I am having trouble getting the compiler to understand what I am trying to do. In the following code I'm checking if the 14th bit is set (meaning the number is negative) and then I want to invert the bits and add 1 to get the magnitude of the number.
int16_t fxls8471qr1_convert_raw_accel_to_mag(uint16_t raw, enum fxls8471qr1_fs_range range) {
int16_t raw_signed;
if(raw & _14BIT_SIGN_MASK) {
// Convert 14 bit 2's complement to 16 bit 2's complement
raw |= (1 << 15) | (1 << 14); // 2's complement extension
raw_signed = -(~raw + 1);
}
else {
raw_signed = raw;
}
uint16_t divisor;
if(range == FXLS8471QR1_FS_RANGE_2G) {
divisor = FS_DIV_2G;
}
else if(range == FXLS8471QR1_FS_RANGE_4G) {
divisor = FS_DIV_4G;
}
else {
divisor = FS_DIV_8G;
}
return ((int32_t)raw_signed * RAW_SCALE_FACTOR) / divisor;
}
This code unfortunately doesn't work. The disassembly shows me that for some reason the compiler is optimizing out my statement raw_signed = -(~raw + 1); How do I acheive the result I desire?
The math works out on paper, but I feel like for some reason the compiler is fighting with me :(.

Converting the 14 bit 2's complement value to 16 bit signed, while maintaining the value is simply a metter of:
int16_t accel = (int16_t)(raw << 2) / 4 ;
The left-shift pushes the sign bit into the 16 bit sign bit position, the divide by four restores the magnitude but maintains its sign. The divide avoids the implementation defined behaviour of an right-shift, but will normally result in a single arithmetic-shift-right on instruction sets that allow. The cast is necessary because raw << 2 is an int expression, and unless int is 16 bit, the divide will simply restore the original value.
It would be simpler however to just shift the accelerometer data left by two bits and treat it as if the sensor was 16 bit in the first place. Normalising everything to 16 bit has the benefit that the code needs no change if you use a sensor with any number of bits up-to 16. The magnitude will simply be four times greater, and the least significant two bits will be zero - no information is gained or lost, and the scaling is arbitrary in any case.
int16_t accel = raw << 2 ;
In both cases, if you want the unsigned magnitude then that is simply:
int32_t mag = (int32_t)labs( (int)accel ) ;

I would do simple arithmetic instead. The result is 14-bit signed, which is represented as a number from 0 to 2^14 - 1. Test if the number is 2^13 or above (signifying a negative) and then subtract 2^14.
int16_t fxls8471qr1_convert_raw_accel_to_mag(uint16_t raw, enum fxls8471qr1_fs_range range)
{
int16_t raw_signed = raw;
if(raw_signed >= 1 << 13) {
raw_signed -= 1 << 14;
}
uint16_t divisor;
if(range == FXLS8471QR1_FS_RANGE_2G) {
divisor = FS_DIV_2G;
}
else if(range == FXLS8471QR1_FS_RANGE_4G) {
divisor = FS_DIV_4G;
}
else {
divisor = FS_DIV_8G;
}
return ((int32_t)raw_signed * RAW_SCALE_FACTOR) / divisor;
}
Please check my arithmetic. (Do I have 13 and 14 correct?)

Supposing that int in your particular C implementation is 16 bits wide, the expression (1 << 15), which you use in mangling raw, produces undefined behavior. In that case, the compiler is free to generate code to do pretty much anything -- or nothing -- if the branch of the conditional is taken wherein that expression is evaluated.
Also if int is 16 bits wide, then the expression -(~raw + 1) and all intermediate values will have type unsigned int == uint16_t. This is a result of "the usual arithmetic conversions", given that (16-bit) int cannot represent all values of type uint16_t. The result will have the high bit set and therefore be outside the range representable by type int, so assigning it to an lvalue of type int produces implementation-defined behavior. You'd have to consult your documentation to determine whether the behavior it defines is what you expected and wanted.
If you instead perform a 14-bit sign conversion, forcing the higher-order bits off ((~raw + 1) & 0x3fff) then the result -- the inverse of the desired negative value -- is representable by a 16-bit signed int, so an explicit conversion to int16_t is well-defined and preserves the (positive) value. The result you want is the inverse of that, which you can obtain simply by negating it. Overall:
raw_signed = -(int16_t)((~raw + 1) & 0x3fff);
Of course, if int were wider than 16 bits in your environment then I see no reason why your original code would not work as expected. That would not invalidate the expression above, however, which produces consistently-defined behavior regardless of the size of default int.

Assuming when code reaches return ((int32_t)raw_signed ..., it has a value in the [-8192 ... +8191] range:
If RAW_SCALE_FACTOR is a multiple of 4 then a little savings can be had.
So rather than
int16_t raw_signed = raw << 2;
raw_signed >>= 2;
instead
int16_t fxls8471qr1_convert_raw_accel_to_mag(uint16_t raw,enum fxls8471qr1_fs_range range){
int16_t raw_signed = raw << 2;
uint16_t divisor;
...
// return ((int32_t)raw_signed * RAW_SCALE_FACTOR) / divisor;
return ((int32_t)raw_signed * (RAW_SCALE_FACTOR/4)) / divisor;
}

To convert the 14-bit two's-complement into a signed value, you can flip the sign bit and subtract the offset:
int16_t raw_signed = (raw ^ 1 << 13) - (1 << 13);

Related

How can I convert this number representation to a float?

I read this 16-bit value from a temperature sensor (type MCP9808)
Ignoring the first three MSBs, what's an easy way to convert the other bits to a float?
I managed to convert the values 2^7 through 2^0 to an integer with some bit-shifting:
uint16_t rawBits = readSensor();
int16_t value = (rawBits << 3) / 128;
However I can't think of an easy way to also include the bits with an exponent smaller than 0, except for manually checking if they're set and then adding 1/2, 1/4, 1/8 and 1/16 to the result respectively.
Something like this seems pretty reasonable. Take the number portion, divide by 16, and fix the sign.
float tempSensor(uint16_t value) {
bool negative = (value & 0x1000);
return (negative ? -1 : 1) * (value & 0x0FFF) / 16.0f;
}
float convert(unsigned char msb, unsigned char lsb)
{
return ((lsb | ((msb & 0x0f) << 8)) * ((msb & 0x10) ? -1 : 1)) / 16.0f;
}
or
float convert(uint16_t val)
{
return (((val & 0x1000) ? -1 : 1) * (val << 4)) / 256.0f;
}
If performance isn't a super big deal, I would go for something less clever and more explcit, along the lines of:
bool is_bit_set(uint16_t value, uint16_t bit) {
uint16_t mask = 1 << bit;
return (value & mask) == mask;
}
float parse_temperature(uint16_t raw_reading) {
if (is_bit_set(raw_reading, 15)) { /* temp is above Tcrit. Do something about it. */ }
if (is_bit_set(raw_reading, 14)) { /* temp is above Tupper. Do something about it. */ }
if (is_bit_set(raw_reading, 13)) { /* temp is above Tlower. Do something about it. */ }
uint16_t whole_degrees = (raw_reading & 0x0FF0) >> 4;
float magnitude = (float) whole_degrees;
if (is_bit_set(raw_reading, 0)) magnitude += 1.0f/16.0f;
if (is_bit_set(raw_reading, 1)) magnitude += 1.0f/8.0f;
if (is_bit_set(raw_reading, 2)) magnitude += 1.0f/4.0f;
if (is_bit_set(raw_reading, 3)) magnitude += 1.0f/2.0f;
bool is_negative = is_bit_set(raw_reading, 12);
// TODO: What do the 3 most significant bits do?
return magnitude * (is_negative ? -1.0 : 1.0);
}
Honestly this is a lot of simple constant math, I'd be surprised if the compiler wasn't able to heavily optimize it. That would need confirmation, of course.
If your C compiler has a clz buitlin or equivalent, it could be useful to avoid mul operation.
In your case, as the provided temp value looks like a mantissa and if your C compiler uses IEEE-754 float representation, translating the temp value in its IEEE-754 equivalent may be a most efficient way :
Update: Compact the code a little and more clear explanation about the mantissa.
float convert(uint16_t val) {
uint16_t mantissa = (uint16_t)(val <<4);
if (mantissa==0) return 0.0;
unsigned char e = (unsigned char)(__builtin_clz(mantissa) - 16);
uint32_t r = (uint32_t)((val & 0x1000) << 19 | (0x86 - e) << 23 | ((mantissa << (e+8)) & 0x07FFFFF));
return *((float *)(&r));
}
or
float convert(unsigned char msb, unsigned char lsb) {
uint16_t mantissa = (uint16_t)((msb<<8 | lsb) <<4);
if (mantissa==0) return 0.0;
unsigned char e = (unsigned char)(__builtin_clz(mantissa) - 16);
uint32_t r = (uint32_t)((msb & 0x10) << 27 | (0x86 - e) << 23 | ((mantissa << (e+8)) & 0x07FFFFF));
return *((float *)(&r));
}
Explanation:
We use the fact that the temp value is somehow a mantissa in the range -255 to 255.
We can then consider that its IEEE-754 exponent will be 128 at max to -128 at min.
We use the clz buitlin to get the "order" of the first bit set in the mantissa,
this way we can define the exponent as the therorical max (2^7 =>128) less this "order".
We use also this order to left shift the temp value to get the IEEE-754 mantissa,
plus one left shift to substract the '1' implied part of the significand for IEEE-754.
Thus we build a 32 bits binary IEEE-754 representation from the temp value with :
At first the sign bit to the 32th bit of our binary IEEE-754 representation.
The computed exponent as the theorical max 7 (2^7 =>128) plus the IEEE-754 bias (127) minus the actual "order" of the temp value.
The "order" of the temp value is deducted from the number of leading '0' of its 12 bits representation in the variable mantissa through the clz builtin.
Beware that here we consider that the clz builtin is expecting a 32 bit value as parameter, that is why we substract 16 here. This code may require adaptation if your clz expects anything else.
The number of leading '0' can go from 0 (temp value above 128 or under -127) to 11 as we directly return 0.0 for a zero temp value.
As the following bit of the "order" is then 1 in the temp value, it is equivalent to a power of 2 reduction from the theorical max 7.
Thus, with 7 + 127 => 0x86, we can simply substract to that the "order" as the number of leading '0' permits us to deduce the 'first' base exponent for IEEE-754.
If the "order" is greater than 7 we will still get the negative exponent required for less than 1 values.
We add then this 8bits exponent to our binary IEEE-754 representation from 24th bit to 31th bit.
The temp value is somehow already a mantissa, we suppress the leading '0' and its first bit set by shifting it to the left (e + 1) while also shifting left for 7 bits to place the mantissa in the 32 bits (e+7+1 => e+8) . We mask then only the desired 23 bits (AND &0x7FFFFF).
Its first bit set must be removed as it is the '1' implied significand in IEEE-754 (the power of 2 of the exponent).
We have then the IEEEE-754 mantissa and place it from the 8th bit to the 23th bit of our binary IEEE-754 representation.
The 4 initial trailing 0 from our 16 bits temp value and the added seven 'right' 0 from the shifting won't change the effective IEEE-754 value.
As we start from a 32 bits value and use or operator (|) on a 32 bits exponent and mantissa, we have then the final IEEE-754 representation.
We can then return this binary representation as an IEEE-754 C float value.
Due to the required clz and the IEEE-754 translation, this way is less portable. The main interest is to avoid MUL operations in the resulting machine code for performance on arch with a "poor" FPU.
P.S.: Casts explanation. I've added explicit casts to let the C compiler know that we discard voluntary some bits :
uint16_t mantissa = (uint16_t)(val <<4); : The cast here tells the compiler that we know we'll "loose" four left bits, as it the goal here. We discard the four first bits of the temp value for the mantissa.
(unsigned char)(__builtin_clz(mantissa) - 16) : We tell to the C compiler that we will only consider a 8 bits range for the builtin return, as we know our mantissa has only 12 significatives bits and thus a range output from 0 to 12. Thus we do not need the full int return.
uint32_t r = (uint32_t) ... : We tell the C compiler to not bother with the sign representation here as we build an IEEE-754 representation.

Cast Integer to Float using Bit Manipulation breaks on some integers in C

Working on a class assignment, I'm trying to cast an integer to a float only using bit manipulations (limited to any integer/unsigned operations incl. ||, &&. also if, while). My code is working for most values, but some values are not generating the results I'm looking for.
For example, if x is 0x807fffff, I get 0xceff0001, but the correct result should be 0xceff0000. I think I'm missing something with my mantissa and rounding, but can't quite pin it down. I've looked at some other threads on SO as well converting-int-to-float and how-to-manually
unsigned dl22(int x) {
int tmin = 0x1 << 31;
int tmax = ~tmin;
unsigned signBit = 0;
unsigned exponent;
unsigned mantissa;
int bias = 127;
if (x == 0) {
return 0;
}
if (x == tmin) {
return 0xcf << 24;
}
if (x < 0) {
signBit = x & tmin;
x = (~x + 1);
}
exponent = bias + 31;
while ( ( x & tmin) == 0 ) {
exponent--;
x <<= 1;
}
exponent <<= 23;
int mantissaMask = ~(tmin >> 8);
mantissa = (x >> 8) & mantissaMask;
return (signBit | exponent | mantissa);
}
EDIT/UPDATE
Found a viable solution - see below
Your code produces the expected output for me on the example you presented. As discussed in comments, however, from C's perspective it does exhibit undefined behavior -- not just in the computation of tmin, but also, for the same reason, in the loop wherein you compute the exponent. To whatever extent this code produces results that vary from environment to environment, that will follow either from the undefined behavior or from your assumption about the size of [unsigned] int being incorrect for the C implementation in use.
Nevertheless, if we assume (unsafely)
that shifts of ints operate as if the left operand were reinterpreted as an unsigned int with the same bit pattern, operated upon, and the resulting bit pattern reinterpreted as an int, and
that int and unsigned int are at least 32 bits wide,
then your code seems correct, modulo rounding.
In the event that the absolute value of the input int has more than 24 significant binary digits (i.e. it is at least 224), however, some precision will be lost in the conversion. In that case the correct result will depend on the FP rounding mode you intend to implement. An incorrectly rounded result will be off by 1 unit in the last place; how many results that affects depends on the rounding mode.
Simply truncating / shifting off the extra bits as you do yields round toward zero mode. That's one of the standard rounding modes, but not the default. The default rounding mode is to round to the nearest representable number, with ties being resolved in favor of the result having least-significant bit 0 (round to even); there are also three other standard modes. To implement any mode other than round-toward-zero, you'll need to capture the 8 least-significant bits of the significand after scaling and before shifting them off. These, together with other details depending on the chosen rounding mode, will determine how to apply the correct rounding.
About half of the 32-bit two's complement numbers will be rounded differently when converted in round-to-zero mode than when converted in any one of the other modes; which numbers exhibit a discrepancy depends on which rounding mode you consider.
I didn't originally mention that I am trying to imitate a U2F union statement:
float u2f(unsigned u) {
union {
unsigned u;
float f;
} a;
a.u = u;
return a.f;
}
Thanks to guidance provided in the postieee-754-bit-manipulation-rounding-error I was able to manage the rounding issues by putting the following after my while statement. This clarified the rounding that was occurring.
lsb = (x >> 8) & 1;
roundBit = (x >> 7) & 1;
stickyBitFlag = !!(x & 0x7F);
exponent <<= 23;
int mantissaMask = ~(tmin >> 8);
mantissa = (x >> 8);
mantissa &= mantissaMask;
roundBit = (roundBit & stickyBitFlag) | (roundBit & lsb);
return (signBit | exponent | mantissa) + roundBit;

C Bit-Level Int to Float Conversion Unexpected Output

Background:
I am playing around with bit-level coding (this is not homework - just curious). I found a lot of good material online and in a book called Hacker's Delight, but I am having trouble with one of the online problems.
It asks to convert an integer to a float. I used the following links as reference to work through the problem:
How to manually (bitwise) perform (float)x?
How to convert an unsigned int to a float?
http://locklessinc.com/articles/i2f/
Problem and Question:
I thought I understood the process well enough (I tried to document the process in the comments), but when I test it, I don't understand the output.
Test Cases:
float_i2f(2) returns 1073741824
float_i2f(3) returns 1077936128
I expected to see something like 2.0000 and 3.0000.
Did I mess up the conversion somewhere? I thought maybe this was a memory address, so I was thinking maybe I missed something in the conversion step needed to access the actual number? Or maybe I am printing it incorrectly? I am printing my output like this:
printf("Float_i2f ( %d ): ", 3);
printf("%u", float_i2f(3));
printf("\n");
But I thought that printing method was fine for unsigned values in C (I'm used to programming in Java).
Thanks for any advice.
Code:
/*
* float_i2f - Return bit-level equivalent of expression (float) x
* Result is returned as unsigned int, but
* it is to be interpreted as the bit-level representation of a
* single-precision floating point values.
* Legal ops: Any integer/unsigned operations incl. ||, &&. also if, while
* Max ops: 30
* Rating: 4
*/
unsigned float_i2f(int x) {
if (x == 0){
return 0;
}
//save the sign bit for later and get the asolute value of x
//the absolute value is needed to shift bits to put them
//into the appropriate position for the float
unsigned int signBit = 0;
unsigned int absVal = (unsigned int)x;
if (x < 0){
signBit = 0x80000000;
absVal = (unsigned int)-x;
}
//Calculate the exponent
// Shift the input left until the high order bit is set to form the mantissa.
// Form the floating exponent by subtracting the number of shifts from 158.
unsigned int exponent = 158; //158 possibly because of place in byte range
while ((absVal & 0x80000000) == 0){//this checks for 0 or 1. when it reaches 1, the loop breaks
exponent--;
absVal <<= 1;
}
//find the mantissa (bit shift to the right)
unsigned int mantissa = absVal >> 8;
//place the exponent bits in the right place
exponent = exponent << 23;
//get the mantissa
mantissa = mantissa & 0x7fffff;
//return the reconstructed float
return signBit | exponent | mantissa;
}
Continuing from the comment. Your code is correct, and you are simply looking at the equivalent unsigned integer made up by the bits in your IEEE-754 single-precision floating point number. The IEEE-754 single-precision number format (made up of the sign, extended exponent, and mantissa), can be interpreted as a float, or those same bits can be interpreted as an unsigned integer (just the number that is made up by the 32-bits). You are outputting the unsigned equivalent for the floating point number.
You can confirm with a simple union. For example:
#include <stdio.h>
#include <stdint.h>
typedef union {
uint32_t u;
float f;
} u2f;
int main (void) {
u2f tmp = { .f = 2.0 };
printf ("\n u : %u\n f : %f\n", tmp.u, tmp.f);
return 0;
}
Example Usage/Output
$ ./bin/unionuf
u : 1073741824
f : 2.000000
Let me know if you have any further questions. It's good to see that your study resulted in the correct floating point conversion. (also note the second comment regarding truncation/rounding)
I'll just chime in here, because nothing specifically about endianness has been addressed. So let's talk about it.
The construction of the value in the original question was endianness-agnostic, using shifts and other bitwise operations. This means that regardless of whether your system is big- or little-endian, the actual value will be the same. The difference will be its byte order in memory.
The generally accepted convention for IEEE-754 is that the byte order is big-endian (although I believe there is no formal specification of this, and therefore no requirement on implementations to follow it). This means if you want to directly interpret your integer value as a float, it needs to be laid out in big-endian byte order.
So, you can use this approach combined with a union if and only if you know that the endianness of floats and integers on your system is the same.
On the common Intel-based architectures this is not okay. On those architectures, integers are little-endian and floats are big-endian. You need to convert your value to big-endian. A simple approach to this is to repack its bytes even if they are already big-endian:
uint32_t n = float_i2f( input_val );
uint8_t char bytes[4] = {
(uint8_t)((n >> 24) & 0xff),
(uint8_t)((n >> 16) & 0xff),
(uint8_t)((n >> 8) & 0xff),
(uint8_t)(n & 0xff)
};
float fval;
memcpy( &fval, bytes, sizeof(float) );
I'll stress that you only need to worry about this if you are trying to reinterpret your integer representation as a float or the other way round.
If you're only trying to output what the representation is in bits, then you don't need to worry. You can just display your integer in a useful form such as hex:
printf( "0x%08x\n", n );

How to find the nth bit of an integer in C

I've got an assignment where I need to convert from an 8 bit sign magnitude number to two's complement and then add those two numbers. I've got a relatively good idea as to how to do this, however I can't work out how to find the eighth bit of an integer such that I can tell what sign the number has.
The overall idea is that should the sign bit be 0 just return the number as it is already in two's complement if it is a one though then I want to set it to 0 before inverting all bits with the ~ operator and then add 1.
Thanks in advance
You can check if the high bit is set by creating a mask that has just that bit set and using a logical AND to see if the result is non-zero.
Once you know the high bit is set, you can convert to twos complement by flipping all bits and adding one.
uint8_t x = (some value)
if (x & (1 << 7)) {
printf("sign bit set\n");
x = (uint8_t)((~(x & (0x7F))) & 0xFF) + 1;
printf("converted value: %02X\n", x);
}
Then you can add this number to any other normally.
Assuming that your computer/compiler uses two's complement (almost certainly the case) and assuming that you want the result to be in two's complement.
Use an uint8_t to hold the sign and magnitude number.
To check if a bit is set, use the bitwise AND operator &, together with a bit mask corresponding to the msb. To get a bit mask corresponding to bit n, left shift the value 1 n times. In C code:
#define SIGN (1 << 7)
uint8_t sm = ...;
if(sm & SIGN) // if non-zero, then the SIGN bit is set
{
}
else // it was zero, the SIGN bit is not set
{
}
To do the actual conversion, there are several ways. I simply would mask out and copy the relevant parts of the number, again with bitwise AND:
#define MAGNITUDE 0x7F
int8_t magnitude = sm & MAGNITUDE; // variable magnitude is two's compl.
EDIT complete solution (since someone already posted one):
#define SIGN (1 << 7)
#define MAGNITUDE 0x7F
uint8_t sm = ...;
int8_t twos_compl = sm & MAGNITUDE;
if(sm & SIGN) // if non-zero, then the SIGN bit is set
{
twos_compl = -twos_compl;
}
int8_t x = ...; // some other number in two's complement
int16_t result = twos_compl + x;
As a side note, be very careful when mixing the ~ operator with small integer types, because it performs an implicit integer promotion. For example uint8_t x = 1 and then ~my_uint8 gives you 0xFFFFFFFE (32 bit system) and not 0xFE as you might expect.
For the above task, there is no need to use ~ at all.

How to sign extend a 9-bit value when converting from an 8-bit value?

I'm implementing a relative branching function in my simple VM.
Basically, I'm given an 8-bit relative value. I then shift this left by 1 bit to make it a 9-bit value. So, for instance, if you were to say "branch +127" this would really mean, 127 instructions, and thus would add 256 to the IP.
My current code looks like this:
uint8_t argument = 0xFF; //-1 or whatever
int16_t difference = argument << 1;
*ip += difference; //ip is a uint16_t
I don't believe difference will ever be detected as a less than 0 with this however. I'm rusty on how signed to unsigned works. Beyond that, I'm not sure the difference would be correctly be subtracted from IP in the case argument is say -1 or -2 or something.
Basically, I'm wanting something that would satisfy these "tests"
//case 1
argument = -5
difference -> -10
ip = 20 -> 10 //ip starts at 20, but becomes 10 after applying difference
//case 2
argument = 127 (must fit in a byte)
difference -> 254
ip = 20 -> 274
Hopefully that makes it a bit more clear.
Anyway, how would I do this cheaply? I saw one "solution" to a similar problem, but it involved division. I'm working with slow embedded processors (assumed to be without efficient ways to multiply and divide), so that's a pretty big thing I'd like to avoid.
To clarify: you worry that left shifting a negative 8 bit number will make it appear like a positive nine bit number? Just pad the top 9 bits with the sign bit of the initial number before left shift:
diff = 0xFF;
int16 diff16=(diff + (diff & 0x80)*0x01FE) << 1;
Now your diff16 is signed 2*diff
As was pointed out by Richard J Ross III, you can avoid the multiplication (if that's expensive on your platform) with a conditional branch:
int16 diff16 = (diff + ((diff & 0x80)?0xFF00:0))<<1;
If you are worried about things staying in range and such ("undefined behavior"), you can do
int16 diff16 = diff;
diff16 = (diff16 | ((diff16 & 0x80)?0x7F00:0))<<1;
At no point does this produce numbers that are going out of range.
The cleanest solution, though, seems to be "cast and shift":
diff16 = (signed char)diff; // recognizes and preserves the sign of diff
diff16 = (short int)((unsigned short)diff16)<<1; // left shift, preserving sign
This produces the expected result, because the compiler automatically takes care of the sign bit (so no need for the mask) in the first line; and in the second line, it does a left shift on an unsigned int (for which overflow is well defined per the standard); the final cast back to short int ensures that the number is correctly interpreted as negative. I believe that in this form the construct is never "undefined".
All of my quotes come from the C standard, section 6.3.1.3. Unsigned to signed is well defined when the value is within range of the signed type:
1 When a value with integer type is converted to another integer type
other than _Bool, if the value can be represented by the new type, it
is unchanged.
Signed to unsigned is well defined:
2 Otherwise, if the new type is unsigned, the value is converted by
repeatedly adding or subtracting one more than the maximum value that
can be represented in the new type until the value is in the range of
the new type.
Unsigned to signed, when the value lies out of range isn't too well defined:
3 Otherwise, the new type is signed and the value cannot be
represented in it; either the result is implementation-defined or an
implementation-defined signal is raised.
Unfortunately, your question lies in the realm of point 3. C doesn't guarantee any implicit mechanism to convert out-of-range values, so you'll need to explicitly provide one. The first step is to decide which representation you intend to use: Ones' complement, two's complement or sign and magnitude
The representation you use will affect the translation algorithm you use. In the example below, I'll use two's complement: If the sign bit is 1 and the value bits are all 0, this corresponds to your lowest value. Your lowest value is another choice you must make: In the case of two's complement, it'd make sense to use either of INT16_MIN (-32768) or INT8_MIN (-128). In the case of the other two, it'd make sense to use INT16_MIN - 1 or INT8_MIN - 1 due to the presense of negative zeros, which should probably be translated to be indistinguishable from regular zeros. In this example, I'll use INT8_MIN, since it makes sense that (uint8_t) -1 should translate to -1 as an int16_t.
Separate the sign bit from the value bits. The value should be the absolute value, except in the case of a two's complement minimum value when sign will be 1 and the value will be 0. Of course, the sign bit can be where-ever you like it to be, though it's conventional for it to rest at the far left hand side. Hence, shifting right 7 places obtains the conventional "sign" bit:
uint8_t sign = input >> 7;
uint8_t value = input & (UINT8_MAX >> 1);
int16_t result;
If the sign bit is 1, we'll call this a negative number and add to INT8_MIN to construct the sign so we don't end up in the same conundrum we started with, or worse: undefined behaviour (which is the fate of one of the other answers).
if (sign == 1) {
result = INT8_MIN + value;
}
else {
result = value;
}
This can be shortened to:
int16_t result = (input >> 7) ? INT8_MIN + (input & (UINT8_MAX >> 1)) : input;
... or, better yet:
int16_t result = input <= INT8_MAX ? input
: INT8_MIN + (int8_t)(input % (uint8_t) INT8_MIN);
The sign test now involves checking if it's in the positive range. If it is, the value remains unchanged. Otherwise, we use addition and modulo to produce the correct negative value. This is fairly consistent with the C standard's language above. It works well for two's complement, because int16_t and int8_t are guaranteed to use a two's complement representation internally. However, types like int aren't required to use a two's complement representation internally. When converting unsigned int to int for example, there needs to be another check, so that we're treating values less than or equal to INT_MAX as positive, and values greater than or equal to (unsigned int) INT_MIN as negative. Any other values need to be handled as errors; In this case I treat them as zeros.
/* Generate some random input */
srand(time(NULL));
unsigned int input = rand();
for (unsigned int x = UINT_MAX / ((unsigned int) RAND_MAX + 1); x > 1; x--) {
input *= (unsigned int) RAND_MAX + 1;
input += rand();
}
int result = /* Handle positives: */ input <= INT_MAX ? input
: /* Handle negatives: */ input >= (unsigned int) INT_MIN ? INT_MIN + (int)(input % (unsigned int) INT_MIN)
: /* Handle errors: */ 0;
If the offset is in the 2's complement representation, then
convert this
uint8_t argument = 0xFF; //-1
int16_t difference = argument << 1;
*ip += difference;
into this:
uint8_t argument = 0xFF; //-1
int8_t signed_argument;
signed_argument = argument; // this relies on implementation-defined
// conversion of unsigned to signed, usually it's
// just a bit-wise copy on 2's complement systems
// OR
// memcpy(&signed_argument, &argument, sizeof argument);
*ip += signed_argument + signed_argument;

Resources