Cast Integer to Float using Bit Manipulation breaks on some integers in C - c

Working on a class assignment, I'm trying to cast an integer to a float only using bit manipulations (limited to any integer/unsigned operations incl. ||, &&. also if, while). My code is working for most values, but some values are not generating the results I'm looking for.
For example, if x is 0x807fffff, I get 0xceff0001, but the correct result should be 0xceff0000. I think I'm missing something with my mantissa and rounding, but can't quite pin it down. I've looked at some other threads on SO as well converting-int-to-float and how-to-manually
unsigned dl22(int x) {
int tmin = 0x1 << 31;
int tmax = ~tmin;
unsigned signBit = 0;
unsigned exponent;
unsigned mantissa;
int bias = 127;
if (x == 0) {
return 0;
}
if (x == tmin) {
return 0xcf << 24;
}
if (x < 0) {
signBit = x & tmin;
x = (~x + 1);
}
exponent = bias + 31;
while ( ( x & tmin) == 0 ) {
exponent--;
x <<= 1;
}
exponent <<= 23;
int mantissaMask = ~(tmin >> 8);
mantissa = (x >> 8) & mantissaMask;
return (signBit | exponent | mantissa);
}
EDIT/UPDATE
Found a viable solution - see below

Your code produces the expected output for me on the example you presented. As discussed in comments, however, from C's perspective it does exhibit undefined behavior -- not just in the computation of tmin, but also, for the same reason, in the loop wherein you compute the exponent. To whatever extent this code produces results that vary from environment to environment, that will follow either from the undefined behavior or from your assumption about the size of [unsigned] int being incorrect for the C implementation in use.
Nevertheless, if we assume (unsafely)
that shifts of ints operate as if the left operand were reinterpreted as an unsigned int with the same bit pattern, operated upon, and the resulting bit pattern reinterpreted as an int, and
that int and unsigned int are at least 32 bits wide,
then your code seems correct, modulo rounding.
In the event that the absolute value of the input int has more than 24 significant binary digits (i.e. it is at least 224), however, some precision will be lost in the conversion. In that case the correct result will depend on the FP rounding mode you intend to implement. An incorrectly rounded result will be off by 1 unit in the last place; how many results that affects depends on the rounding mode.
Simply truncating / shifting off the extra bits as you do yields round toward zero mode. That's one of the standard rounding modes, but not the default. The default rounding mode is to round to the nearest representable number, with ties being resolved in favor of the result having least-significant bit 0 (round to even); there are also three other standard modes. To implement any mode other than round-toward-zero, you'll need to capture the 8 least-significant bits of the significand after scaling and before shifting them off. These, together with other details depending on the chosen rounding mode, will determine how to apply the correct rounding.
About half of the 32-bit two's complement numbers will be rounded differently when converted in round-to-zero mode than when converted in any one of the other modes; which numbers exhibit a discrepancy depends on which rounding mode you consider.

I didn't originally mention that I am trying to imitate a U2F union statement:
float u2f(unsigned u) {
union {
unsigned u;
float f;
} a;
a.u = u;
return a.f;
}
Thanks to guidance provided in the postieee-754-bit-manipulation-rounding-error I was able to manage the rounding issues by putting the following after my while statement. This clarified the rounding that was occurring.
lsb = (x >> 8) & 1;
roundBit = (x >> 7) & 1;
stickyBitFlag = !!(x & 0x7F);
exponent <<= 23;
int mantissaMask = ~(tmin >> 8);
mantissa = (x >> 8);
mantissa &= mantissaMask;
roundBit = (roundBit & stickyBitFlag) | (roundBit & lsb);
return (signBit | exponent | mantissa) + roundBit;

Related

How can I convert this number representation to a float?

I read this 16-bit value from a temperature sensor (type MCP9808)
Ignoring the first three MSBs, what's an easy way to convert the other bits to a float?
I managed to convert the values 2^7 through 2^0 to an integer with some bit-shifting:
uint16_t rawBits = readSensor();
int16_t value = (rawBits << 3) / 128;
However I can't think of an easy way to also include the bits with an exponent smaller than 0, except for manually checking if they're set and then adding 1/2, 1/4, 1/8 and 1/16 to the result respectively.
Something like this seems pretty reasonable. Take the number portion, divide by 16, and fix the sign.
float tempSensor(uint16_t value) {
bool negative = (value & 0x1000);
return (negative ? -1 : 1) * (value & 0x0FFF) / 16.0f;
}
float convert(unsigned char msb, unsigned char lsb)
{
return ((lsb | ((msb & 0x0f) << 8)) * ((msb & 0x10) ? -1 : 1)) / 16.0f;
}
or
float convert(uint16_t val)
{
return (((val & 0x1000) ? -1 : 1) * (val << 4)) / 256.0f;
}
If performance isn't a super big deal, I would go for something less clever and more explcit, along the lines of:
bool is_bit_set(uint16_t value, uint16_t bit) {
uint16_t mask = 1 << bit;
return (value & mask) == mask;
}
float parse_temperature(uint16_t raw_reading) {
if (is_bit_set(raw_reading, 15)) { /* temp is above Tcrit. Do something about it. */ }
if (is_bit_set(raw_reading, 14)) { /* temp is above Tupper. Do something about it. */ }
if (is_bit_set(raw_reading, 13)) { /* temp is above Tlower. Do something about it. */ }
uint16_t whole_degrees = (raw_reading & 0x0FF0) >> 4;
float magnitude = (float) whole_degrees;
if (is_bit_set(raw_reading, 0)) magnitude += 1.0f/16.0f;
if (is_bit_set(raw_reading, 1)) magnitude += 1.0f/8.0f;
if (is_bit_set(raw_reading, 2)) magnitude += 1.0f/4.0f;
if (is_bit_set(raw_reading, 3)) magnitude += 1.0f/2.0f;
bool is_negative = is_bit_set(raw_reading, 12);
// TODO: What do the 3 most significant bits do?
return magnitude * (is_negative ? -1.0 : 1.0);
}
Honestly this is a lot of simple constant math, I'd be surprised if the compiler wasn't able to heavily optimize it. That would need confirmation, of course.
If your C compiler has a clz buitlin or equivalent, it could be useful to avoid mul operation.
In your case, as the provided temp value looks like a mantissa and if your C compiler uses IEEE-754 float representation, translating the temp value in its IEEE-754 equivalent may be a most efficient way :
Update: Compact the code a little and more clear explanation about the mantissa.
float convert(uint16_t val) {
uint16_t mantissa = (uint16_t)(val <<4);
if (mantissa==0) return 0.0;
unsigned char e = (unsigned char)(__builtin_clz(mantissa) - 16);
uint32_t r = (uint32_t)((val & 0x1000) << 19 | (0x86 - e) << 23 | ((mantissa << (e+8)) & 0x07FFFFF));
return *((float *)(&r));
}
or
float convert(unsigned char msb, unsigned char lsb) {
uint16_t mantissa = (uint16_t)((msb<<8 | lsb) <<4);
if (mantissa==0) return 0.0;
unsigned char e = (unsigned char)(__builtin_clz(mantissa) - 16);
uint32_t r = (uint32_t)((msb & 0x10) << 27 | (0x86 - e) << 23 | ((mantissa << (e+8)) & 0x07FFFFF));
return *((float *)(&r));
}
Explanation:
We use the fact that the temp value is somehow a mantissa in the range -255 to 255.
We can then consider that its IEEE-754 exponent will be 128 at max to -128 at min.
We use the clz buitlin to get the "order" of the first bit set in the mantissa,
this way we can define the exponent as the therorical max (2^7 =>128) less this "order".
We use also this order to left shift the temp value to get the IEEE-754 mantissa,
plus one left shift to substract the '1' implied part of the significand for IEEE-754.
Thus we build a 32 bits binary IEEE-754 representation from the temp value with :
At first the sign bit to the 32th bit of our binary IEEE-754 representation.
The computed exponent as the theorical max 7 (2^7 =>128) plus the IEEE-754 bias (127) minus the actual "order" of the temp value.
The "order" of the temp value is deducted from the number of leading '0' of its 12 bits representation in the variable mantissa through the clz builtin.
Beware that here we consider that the clz builtin is expecting a 32 bit value as parameter, that is why we substract 16 here. This code may require adaptation if your clz expects anything else.
The number of leading '0' can go from 0 (temp value above 128 or under -127) to 11 as we directly return 0.0 for a zero temp value.
As the following bit of the "order" is then 1 in the temp value, it is equivalent to a power of 2 reduction from the theorical max 7.
Thus, with 7 + 127 => 0x86, we can simply substract to that the "order" as the number of leading '0' permits us to deduce the 'first' base exponent for IEEE-754.
If the "order" is greater than 7 we will still get the negative exponent required for less than 1 values.
We add then this 8bits exponent to our binary IEEE-754 representation from 24th bit to 31th bit.
The temp value is somehow already a mantissa, we suppress the leading '0' and its first bit set by shifting it to the left (e + 1) while also shifting left for 7 bits to place the mantissa in the 32 bits (e+7+1 => e+8) . We mask then only the desired 23 bits (AND &0x7FFFFF).
Its first bit set must be removed as it is the '1' implied significand in IEEE-754 (the power of 2 of the exponent).
We have then the IEEEE-754 mantissa and place it from the 8th bit to the 23th bit of our binary IEEE-754 representation.
The 4 initial trailing 0 from our 16 bits temp value and the added seven 'right' 0 from the shifting won't change the effective IEEE-754 value.
As we start from a 32 bits value and use or operator (|) on a 32 bits exponent and mantissa, we have then the final IEEE-754 representation.
We can then return this binary representation as an IEEE-754 C float value.
Due to the required clz and the IEEE-754 translation, this way is less portable. The main interest is to avoid MUL operations in the resulting machine code for performance on arch with a "poor" FPU.
P.S.: Casts explanation. I've added explicit casts to let the C compiler know that we discard voluntary some bits :
uint16_t mantissa = (uint16_t)(val <<4); : The cast here tells the compiler that we know we'll "loose" four left bits, as it the goal here. We discard the four first bits of the temp value for the mantissa.
(unsigned char)(__builtin_clz(mantissa) - 16) : We tell to the C compiler that we will only consider a 8 bits range for the builtin return, as we know our mantissa has only 12 significatives bits and thus a range output from 0 to 12. Thus we do not need the full int return.
uint32_t r = (uint32_t) ... : We tell the C compiler to not bother with the sign representation here as we build an IEEE-754 representation.

C Bit-Level Int to Float Conversion Unexpected Output

Background:
I am playing around with bit-level coding (this is not homework - just curious). I found a lot of good material online and in a book called Hacker's Delight, but I am having trouble with one of the online problems.
It asks to convert an integer to a float. I used the following links as reference to work through the problem:
How to manually (bitwise) perform (float)x?
How to convert an unsigned int to a float?
http://locklessinc.com/articles/i2f/
Problem and Question:
I thought I understood the process well enough (I tried to document the process in the comments), but when I test it, I don't understand the output.
Test Cases:
float_i2f(2) returns 1073741824
float_i2f(3) returns 1077936128
I expected to see something like 2.0000 and 3.0000.
Did I mess up the conversion somewhere? I thought maybe this was a memory address, so I was thinking maybe I missed something in the conversion step needed to access the actual number? Or maybe I am printing it incorrectly? I am printing my output like this:
printf("Float_i2f ( %d ): ", 3);
printf("%u", float_i2f(3));
printf("\n");
But I thought that printing method was fine for unsigned values in C (I'm used to programming in Java).
Thanks for any advice.
Code:
/*
* float_i2f - Return bit-level equivalent of expression (float) x
* Result is returned as unsigned int, but
* it is to be interpreted as the bit-level representation of a
* single-precision floating point values.
* Legal ops: Any integer/unsigned operations incl. ||, &&. also if, while
* Max ops: 30
* Rating: 4
*/
unsigned float_i2f(int x) {
if (x == 0){
return 0;
}
//save the sign bit for later and get the asolute value of x
//the absolute value is needed to shift bits to put them
//into the appropriate position for the float
unsigned int signBit = 0;
unsigned int absVal = (unsigned int)x;
if (x < 0){
signBit = 0x80000000;
absVal = (unsigned int)-x;
}
//Calculate the exponent
// Shift the input left until the high order bit is set to form the mantissa.
// Form the floating exponent by subtracting the number of shifts from 158.
unsigned int exponent = 158; //158 possibly because of place in byte range
while ((absVal & 0x80000000) == 0){//this checks for 0 or 1. when it reaches 1, the loop breaks
exponent--;
absVal <<= 1;
}
//find the mantissa (bit shift to the right)
unsigned int mantissa = absVal >> 8;
//place the exponent bits in the right place
exponent = exponent << 23;
//get the mantissa
mantissa = mantissa & 0x7fffff;
//return the reconstructed float
return signBit | exponent | mantissa;
}
Continuing from the comment. Your code is correct, and you are simply looking at the equivalent unsigned integer made up by the bits in your IEEE-754 single-precision floating point number. The IEEE-754 single-precision number format (made up of the sign, extended exponent, and mantissa), can be interpreted as a float, or those same bits can be interpreted as an unsigned integer (just the number that is made up by the 32-bits). You are outputting the unsigned equivalent for the floating point number.
You can confirm with a simple union. For example:
#include <stdio.h>
#include <stdint.h>
typedef union {
uint32_t u;
float f;
} u2f;
int main (void) {
u2f tmp = { .f = 2.0 };
printf ("\n u : %u\n f : %f\n", tmp.u, tmp.f);
return 0;
}
Example Usage/Output
$ ./bin/unionuf
u : 1073741824
f : 2.000000
Let me know if you have any further questions. It's good to see that your study resulted in the correct floating point conversion. (also note the second comment regarding truncation/rounding)
I'll just chime in here, because nothing specifically about endianness has been addressed. So let's talk about it.
The construction of the value in the original question was endianness-agnostic, using shifts and other bitwise operations. This means that regardless of whether your system is big- or little-endian, the actual value will be the same. The difference will be its byte order in memory.
The generally accepted convention for IEEE-754 is that the byte order is big-endian (although I believe there is no formal specification of this, and therefore no requirement on implementations to follow it). This means if you want to directly interpret your integer value as a float, it needs to be laid out in big-endian byte order.
So, you can use this approach combined with a union if and only if you know that the endianness of floats and integers on your system is the same.
On the common Intel-based architectures this is not okay. On those architectures, integers are little-endian and floats are big-endian. You need to convert your value to big-endian. A simple approach to this is to repack its bytes even if they are already big-endian:
uint32_t n = float_i2f( input_val );
uint8_t char bytes[4] = {
(uint8_t)((n >> 24) & 0xff),
(uint8_t)((n >> 16) & 0xff),
(uint8_t)((n >> 8) & 0xff),
(uint8_t)(n & 0xff)
};
float fval;
memcpy( &fval, bytes, sizeof(float) );
I'll stress that you only need to worry about this if you are trying to reinterpret your integer representation as a float or the other way round.
If you're only trying to output what the representation is in bits, then you don't need to worry. You can just display your integer in a useful form such as hex:
printf( "0x%08x\n", n );

Convert Raw 14 bit Two's Complement to Signed 16 bit Integer

I am doing some work in embedded C with an accelerometer that returns data as a 14 bit 2's complement number. I am storing this result directly into a uint16_t. Later in my code I am trying to convert this "raw" form of the data into a signed integer to represent / work with in the rest of my code.
I am having trouble getting the compiler to understand what I am trying to do. In the following code I'm checking if the 14th bit is set (meaning the number is negative) and then I want to invert the bits and add 1 to get the magnitude of the number.
int16_t fxls8471qr1_convert_raw_accel_to_mag(uint16_t raw, enum fxls8471qr1_fs_range range) {
int16_t raw_signed;
if(raw & _14BIT_SIGN_MASK) {
// Convert 14 bit 2's complement to 16 bit 2's complement
raw |= (1 << 15) | (1 << 14); // 2's complement extension
raw_signed = -(~raw + 1);
}
else {
raw_signed = raw;
}
uint16_t divisor;
if(range == FXLS8471QR1_FS_RANGE_2G) {
divisor = FS_DIV_2G;
}
else if(range == FXLS8471QR1_FS_RANGE_4G) {
divisor = FS_DIV_4G;
}
else {
divisor = FS_DIV_8G;
}
return ((int32_t)raw_signed * RAW_SCALE_FACTOR) / divisor;
}
This code unfortunately doesn't work. The disassembly shows me that for some reason the compiler is optimizing out my statement raw_signed = -(~raw + 1); How do I acheive the result I desire?
The math works out on paper, but I feel like for some reason the compiler is fighting with me :(.
Converting the 14 bit 2's complement value to 16 bit signed, while maintaining the value is simply a metter of:
int16_t accel = (int16_t)(raw << 2) / 4 ;
The left-shift pushes the sign bit into the 16 bit sign bit position, the divide by four restores the magnitude but maintains its sign. The divide avoids the implementation defined behaviour of an right-shift, but will normally result in a single arithmetic-shift-right on instruction sets that allow. The cast is necessary because raw << 2 is an int expression, and unless int is 16 bit, the divide will simply restore the original value.
It would be simpler however to just shift the accelerometer data left by two bits and treat it as if the sensor was 16 bit in the first place. Normalising everything to 16 bit has the benefit that the code needs no change if you use a sensor with any number of bits up-to 16. The magnitude will simply be four times greater, and the least significant two bits will be zero - no information is gained or lost, and the scaling is arbitrary in any case.
int16_t accel = raw << 2 ;
In both cases, if you want the unsigned magnitude then that is simply:
int32_t mag = (int32_t)labs( (int)accel ) ;
I would do simple arithmetic instead. The result is 14-bit signed, which is represented as a number from 0 to 2^14 - 1. Test if the number is 2^13 or above (signifying a negative) and then subtract 2^14.
int16_t fxls8471qr1_convert_raw_accel_to_mag(uint16_t raw, enum fxls8471qr1_fs_range range)
{
int16_t raw_signed = raw;
if(raw_signed >= 1 << 13) {
raw_signed -= 1 << 14;
}
uint16_t divisor;
if(range == FXLS8471QR1_FS_RANGE_2G) {
divisor = FS_DIV_2G;
}
else if(range == FXLS8471QR1_FS_RANGE_4G) {
divisor = FS_DIV_4G;
}
else {
divisor = FS_DIV_8G;
}
return ((int32_t)raw_signed * RAW_SCALE_FACTOR) / divisor;
}
Please check my arithmetic. (Do I have 13 and 14 correct?)
Supposing that int in your particular C implementation is 16 bits wide, the expression (1 << 15), which you use in mangling raw, produces undefined behavior. In that case, the compiler is free to generate code to do pretty much anything -- or nothing -- if the branch of the conditional is taken wherein that expression is evaluated.
Also if int is 16 bits wide, then the expression -(~raw + 1) and all intermediate values will have type unsigned int == uint16_t. This is a result of "the usual arithmetic conversions", given that (16-bit) int cannot represent all values of type uint16_t. The result will have the high bit set and therefore be outside the range representable by type int, so assigning it to an lvalue of type int produces implementation-defined behavior. You'd have to consult your documentation to determine whether the behavior it defines is what you expected and wanted.
If you instead perform a 14-bit sign conversion, forcing the higher-order bits off ((~raw + 1) & 0x3fff) then the result -- the inverse of the desired negative value -- is representable by a 16-bit signed int, so an explicit conversion to int16_t is well-defined and preserves the (positive) value. The result you want is the inverse of that, which you can obtain simply by negating it. Overall:
raw_signed = -(int16_t)((~raw + 1) & 0x3fff);
Of course, if int were wider than 16 bits in your environment then I see no reason why your original code would not work as expected. That would not invalidate the expression above, however, which produces consistently-defined behavior regardless of the size of default int.
Assuming when code reaches return ((int32_t)raw_signed ..., it has a value in the [-8192 ... +8191] range:
If RAW_SCALE_FACTOR is a multiple of 4 then a little savings can be had.
So rather than
int16_t raw_signed = raw << 2;
raw_signed >>= 2;
instead
int16_t fxls8471qr1_convert_raw_accel_to_mag(uint16_t raw,enum fxls8471qr1_fs_range range){
int16_t raw_signed = raw << 2;
uint16_t divisor;
...
// return ((int32_t)raw_signed * RAW_SCALE_FACTOR) / divisor;
return ((int32_t)raw_signed * (RAW_SCALE_FACTOR/4)) / divisor;
}
To convert the 14-bit two's-complement into a signed value, you can flip the sign bit and subtract the offset:
int16_t raw_signed = (raw ^ 1 << 13) - (1 << 13);

Rounding point issues when converting to float bitwise

I am working on a homework assignment, where we are supposed to convert an int to float via bitwise operations. The following code works, except it encounters rounding. My function seems to always round down, but in some cases it should round up.
For example 0x80000001 should be represented as 0xcf000000 (exponent 31, mantissa 0), but my function returns 0xceffffff. (exponent 30, mantissa 0xffffff).
I am not sure how to continue to fix these rounding issues. What steps should i take to make this work?
unsigned float_i2f(int x) {
if(x==0) return 0;
int sign = 0;
if(x<0) {
sign = 1<<31;
x = -x;
}
unsigned y = x;
unsigned exp = 31;
while ((y & 0x80000000) == 0)
{
exp--;
y <<= 1;
}
unsigned mantissa = y >> 8;
return sign | ((exp+127) << 23) | (mantissa & 0x7fffff);
}
Possible duplicate of this, but the question is not properly answered.
You are obviously ignoring the lowest 8 bits of y when you calculate mantissa.
The usual rule is called "round to nearest even": If the lowest 8 bit of y are > 0x80 then increase mantissa by 1. If the lowest 8 bit of y are = 0x80 and bit 8 is 1 then increase mantissa by 1. In either case, if mantissa becomes >= 0x1000000 then shift mantissa to the right and increase exponent.

How to represent negation using bitwise operators, C

Suppose you have 2 numbers:
int x = 1;
int y = 2;
Using bitwise operators, how can i represent x-y?
When comparing the bits of two numbers A and B there are three posibilities. The following assumes unsigned numbers.
A == B : All of the bits are the same
A > B: The most significant bit that differs between the two numbers is set in A and not in B
A < B: The most significant bit that differs between the two numbers is set in B and not in A
Code might look like the following
int getDifType(uint32_t A, uint32_t B)
{
uint32_t bitMask = 0x8000000;
// From MSB to LSB
for (bitMask = 0x80000000; 0 != bitMask; bitMask >>= 1)
{
if (A & bitMask != B & bitMask)
return (A & bitMask) - (B & bitMask);
}
// No difference found
return 0;
}
You need to read about two's complement arithmetic. Addition, subtraction, negation, sign testing, and everything else are all done by the hardware using bitwise operations, so you can definitely do it in your C program. The wikipedia link above should teach you everything you need to know to solve your problem.
Your first step will be to implement addition using only bitwise operators. After that, everything should be easy. Start small- what do you have to do to implement 00 + 00, 01 + 01, etc? Go from there.
You need to start checking from the most significant end to find if a number is greater or not. This logic will work only for non-negative integers.
int x,y;
//get x & y
unsigned int mask=1; // make the mask 000..0001
mask=mask<<(8*sizeoF(int)-1); // make the mask 1000..000
while(mask!=0)
{
if(x & mask > y & mask)
{printf("x greater");break;}
else if(y & mask > x & mask)
{printf("y greater");break;}
mask=mask>>1; // shift 1 in mask to the right
}
Compare the bits from left to right, looking for the leftmost bits that differ. Assuming a machine that is two's complement, the topmost bit determines the sign and will have a flipped comparison sense versus the other bits. This should work on any two's complement machine:
int compare(int x, int y) {
unsigned int mask = ~0U - (~0U >> 1); // select left-most bit
if (x & mask && ~y & mask)
return -1; // x < 0 and y >= 0, therefore y > x
else if (~x & mask && y & mask)
return 1; // x >= 0 and y < 0, therefore x > y
for (; mask; mask >>= 1) {
if (x & mask && ~y & mask)
return 1;
else if (~x & mask && y & mask)
return -1;
}
return 0;
}
[Note that this technically isn't portable. C makes no guarantees that signed arithmetic will be two's complement. But you'll be hard pressed to find a C implementation on a modern machine that behaves differently.]
To see why this works, consider first comparing two unsigned numbers, 13d = 1101b and 11d = 1011b. (I'm assuming a 4-bit wordsize for brevity.) The leftmost differing bit is the second from the left, which the former has set, while the other does not. The former number is therefore the larger. It should be fairly clear that this principle holds for all unsigned numbers.
Now, consider two's complement numbers. You negate a number by complementing the bits and adding one. Thus, -1d = 1111b, -2d = 1110b, -3d = 1101b, -4d = 1100b, etc. You can see that two negative numbers can be compared as though they were unsigned. Likewise, two non-negative numbers can also be compared as though unsigned. Only when the signs differ do we have to consider them -- but if they differ, the comparison is trivial!

Resources