Rounding point issues when converting to float bitwise - c

I am working on a homework assignment, where we are supposed to convert an int to float via bitwise operations. The following code works, except it encounters rounding. My function seems to always round down, but in some cases it should round up.
For example 0x80000001 should be represented as 0xcf000000 (exponent 31, mantissa 0), but my function returns 0xceffffff. (exponent 30, mantissa 0xffffff).
I am not sure how to continue to fix these rounding issues. What steps should i take to make this work?
unsigned float_i2f(int x) {
if(x==0) return 0;
int sign = 0;
if(x<0) {
sign = 1<<31;
x = -x;
}
unsigned y = x;
unsigned exp = 31;
while ((y & 0x80000000) == 0)
{
exp--;
y <<= 1;
}
unsigned mantissa = y >> 8;
return sign | ((exp+127) << 23) | (mantissa & 0x7fffff);
}
Possible duplicate of this, but the question is not properly answered.

You are obviously ignoring the lowest 8 bits of y when you calculate mantissa.
The usual rule is called "round to nearest even": If the lowest 8 bit of y are > 0x80 then increase mantissa by 1. If the lowest 8 bit of y are = 0x80 and bit 8 is 1 then increase mantissa by 1. In either case, if mantissa becomes >= 0x1000000 then shift mantissa to the right and increase exponent.

Related

How can I convert this number representation to a float?

I read this 16-bit value from a temperature sensor (type MCP9808)
Ignoring the first three MSBs, what's an easy way to convert the other bits to a float?
I managed to convert the values 2^7 through 2^0 to an integer with some bit-shifting:
uint16_t rawBits = readSensor();
int16_t value = (rawBits << 3) / 128;
However I can't think of an easy way to also include the bits with an exponent smaller than 0, except for manually checking if they're set and then adding 1/2, 1/4, 1/8 and 1/16 to the result respectively.
Something like this seems pretty reasonable. Take the number portion, divide by 16, and fix the sign.
float tempSensor(uint16_t value) {
bool negative = (value & 0x1000);
return (negative ? -1 : 1) * (value & 0x0FFF) / 16.0f;
}
float convert(unsigned char msb, unsigned char lsb)
{
return ((lsb | ((msb & 0x0f) << 8)) * ((msb & 0x10) ? -1 : 1)) / 16.0f;
}
or
float convert(uint16_t val)
{
return (((val & 0x1000) ? -1 : 1) * (val << 4)) / 256.0f;
}
If performance isn't a super big deal, I would go for something less clever and more explcit, along the lines of:
bool is_bit_set(uint16_t value, uint16_t bit) {
uint16_t mask = 1 << bit;
return (value & mask) == mask;
}
float parse_temperature(uint16_t raw_reading) {
if (is_bit_set(raw_reading, 15)) { /* temp is above Tcrit. Do something about it. */ }
if (is_bit_set(raw_reading, 14)) { /* temp is above Tupper. Do something about it. */ }
if (is_bit_set(raw_reading, 13)) { /* temp is above Tlower. Do something about it. */ }
uint16_t whole_degrees = (raw_reading & 0x0FF0) >> 4;
float magnitude = (float) whole_degrees;
if (is_bit_set(raw_reading, 0)) magnitude += 1.0f/16.0f;
if (is_bit_set(raw_reading, 1)) magnitude += 1.0f/8.0f;
if (is_bit_set(raw_reading, 2)) magnitude += 1.0f/4.0f;
if (is_bit_set(raw_reading, 3)) magnitude += 1.0f/2.0f;
bool is_negative = is_bit_set(raw_reading, 12);
// TODO: What do the 3 most significant bits do?
return magnitude * (is_negative ? -1.0 : 1.0);
}
Honestly this is a lot of simple constant math, I'd be surprised if the compiler wasn't able to heavily optimize it. That would need confirmation, of course.
If your C compiler has a clz buitlin or equivalent, it could be useful to avoid mul operation.
In your case, as the provided temp value looks like a mantissa and if your C compiler uses IEEE-754 float representation, translating the temp value in its IEEE-754 equivalent may be a most efficient way :
Update: Compact the code a little and more clear explanation about the mantissa.
float convert(uint16_t val) {
uint16_t mantissa = (uint16_t)(val <<4);
if (mantissa==0) return 0.0;
unsigned char e = (unsigned char)(__builtin_clz(mantissa) - 16);
uint32_t r = (uint32_t)((val & 0x1000) << 19 | (0x86 - e) << 23 | ((mantissa << (e+8)) & 0x07FFFFF));
return *((float *)(&r));
}
or
float convert(unsigned char msb, unsigned char lsb) {
uint16_t mantissa = (uint16_t)((msb<<8 | lsb) <<4);
if (mantissa==0) return 0.0;
unsigned char e = (unsigned char)(__builtin_clz(mantissa) - 16);
uint32_t r = (uint32_t)((msb & 0x10) << 27 | (0x86 - e) << 23 | ((mantissa << (e+8)) & 0x07FFFFF));
return *((float *)(&r));
}
Explanation:
We use the fact that the temp value is somehow a mantissa in the range -255 to 255.
We can then consider that its IEEE-754 exponent will be 128 at max to -128 at min.
We use the clz buitlin to get the "order" of the first bit set in the mantissa,
this way we can define the exponent as the therorical max (2^7 =>128) less this "order".
We use also this order to left shift the temp value to get the IEEE-754 mantissa,
plus one left shift to substract the '1' implied part of the significand for IEEE-754.
Thus we build a 32 bits binary IEEE-754 representation from the temp value with :
At first the sign bit to the 32th bit of our binary IEEE-754 representation.
The computed exponent as the theorical max 7 (2^7 =>128) plus the IEEE-754 bias (127) minus the actual "order" of the temp value.
The "order" of the temp value is deducted from the number of leading '0' of its 12 bits representation in the variable mantissa through the clz builtin.
Beware that here we consider that the clz builtin is expecting a 32 bit value as parameter, that is why we substract 16 here. This code may require adaptation if your clz expects anything else.
The number of leading '0' can go from 0 (temp value above 128 or under -127) to 11 as we directly return 0.0 for a zero temp value.
As the following bit of the "order" is then 1 in the temp value, it is equivalent to a power of 2 reduction from the theorical max 7.
Thus, with 7 + 127 => 0x86, we can simply substract to that the "order" as the number of leading '0' permits us to deduce the 'first' base exponent for IEEE-754.
If the "order" is greater than 7 we will still get the negative exponent required for less than 1 values.
We add then this 8bits exponent to our binary IEEE-754 representation from 24th bit to 31th bit.
The temp value is somehow already a mantissa, we suppress the leading '0' and its first bit set by shifting it to the left (e + 1) while also shifting left for 7 bits to place the mantissa in the 32 bits (e+7+1 => e+8) . We mask then only the desired 23 bits (AND &0x7FFFFF).
Its first bit set must be removed as it is the '1' implied significand in IEEE-754 (the power of 2 of the exponent).
We have then the IEEEE-754 mantissa and place it from the 8th bit to the 23th bit of our binary IEEE-754 representation.
The 4 initial trailing 0 from our 16 bits temp value and the added seven 'right' 0 from the shifting won't change the effective IEEE-754 value.
As we start from a 32 bits value and use or operator (|) on a 32 bits exponent and mantissa, we have then the final IEEE-754 representation.
We can then return this binary representation as an IEEE-754 C float value.
Due to the required clz and the IEEE-754 translation, this way is less portable. The main interest is to avoid MUL operations in the resulting machine code for performance on arch with a "poor" FPU.
P.S.: Casts explanation. I've added explicit casts to let the C compiler know that we discard voluntary some bits :
uint16_t mantissa = (uint16_t)(val <<4); : The cast here tells the compiler that we know we'll "loose" four left bits, as it the goal here. We discard the four first bits of the temp value for the mantissa.
(unsigned char)(__builtin_clz(mantissa) - 16) : We tell to the C compiler that we will only consider a 8 bits range for the builtin return, as we know our mantissa has only 12 significatives bits and thus a range output from 0 to 12. Thus we do not need the full int return.
uint32_t r = (uint32_t) ... : We tell the C compiler to not bother with the sign representation here as we build an IEEE-754 representation.

Cast Integer to Float using Bit Manipulation breaks on some integers in C

Working on a class assignment, I'm trying to cast an integer to a float only using bit manipulations (limited to any integer/unsigned operations incl. ||, &&. also if, while). My code is working for most values, but some values are not generating the results I'm looking for.
For example, if x is 0x807fffff, I get 0xceff0001, but the correct result should be 0xceff0000. I think I'm missing something with my mantissa and rounding, but can't quite pin it down. I've looked at some other threads on SO as well converting-int-to-float and how-to-manually
unsigned dl22(int x) {
int tmin = 0x1 << 31;
int tmax = ~tmin;
unsigned signBit = 0;
unsigned exponent;
unsigned mantissa;
int bias = 127;
if (x == 0) {
return 0;
}
if (x == tmin) {
return 0xcf << 24;
}
if (x < 0) {
signBit = x & tmin;
x = (~x + 1);
}
exponent = bias + 31;
while ( ( x & tmin) == 0 ) {
exponent--;
x <<= 1;
}
exponent <<= 23;
int mantissaMask = ~(tmin >> 8);
mantissa = (x >> 8) & mantissaMask;
return (signBit | exponent | mantissa);
}
EDIT/UPDATE
Found a viable solution - see below
Your code produces the expected output for me on the example you presented. As discussed in comments, however, from C's perspective it does exhibit undefined behavior -- not just in the computation of tmin, but also, for the same reason, in the loop wherein you compute the exponent. To whatever extent this code produces results that vary from environment to environment, that will follow either from the undefined behavior or from your assumption about the size of [unsigned] int being incorrect for the C implementation in use.
Nevertheless, if we assume (unsafely)
that shifts of ints operate as if the left operand were reinterpreted as an unsigned int with the same bit pattern, operated upon, and the resulting bit pattern reinterpreted as an int, and
that int and unsigned int are at least 32 bits wide,
then your code seems correct, modulo rounding.
In the event that the absolute value of the input int has more than 24 significant binary digits (i.e. it is at least 224), however, some precision will be lost in the conversion. In that case the correct result will depend on the FP rounding mode you intend to implement. An incorrectly rounded result will be off by 1 unit in the last place; how many results that affects depends on the rounding mode.
Simply truncating / shifting off the extra bits as you do yields round toward zero mode. That's one of the standard rounding modes, but not the default. The default rounding mode is to round to the nearest representable number, with ties being resolved in favor of the result having least-significant bit 0 (round to even); there are also three other standard modes. To implement any mode other than round-toward-zero, you'll need to capture the 8 least-significant bits of the significand after scaling and before shifting them off. These, together with other details depending on the chosen rounding mode, will determine how to apply the correct rounding.
About half of the 32-bit two's complement numbers will be rounded differently when converted in round-to-zero mode than when converted in any one of the other modes; which numbers exhibit a discrepancy depends on which rounding mode you consider.
I didn't originally mention that I am trying to imitate a U2F union statement:
float u2f(unsigned u) {
union {
unsigned u;
float f;
} a;
a.u = u;
return a.f;
}
Thanks to guidance provided in the postieee-754-bit-manipulation-rounding-error I was able to manage the rounding issues by putting the following after my while statement. This clarified the rounding that was occurring.
lsb = (x >> 8) & 1;
roundBit = (x >> 7) & 1;
stickyBitFlag = !!(x & 0x7F);
exponent <<= 23;
int mantissaMask = ~(tmin >> 8);
mantissa = (x >> 8);
mantissa &= mantissaMask;
roundBit = (roundBit & stickyBitFlag) | (roundBit & lsb);
return (signBit | exponent | mantissa) + roundBit;

Divide a signed integer by a power of 2

I'm working on a way to divide a signed integer by a power of 2 using only binary operators (<< >> + ^ ~ & | !), and the result has to be round toward 0. I came across this question also on Stackoverflow on the problem, however, I cannot understand why it works. Here's the solution:
int divideByPowerOf2(int x, int n)
{
return (x + ((x >> 31) & ((1 << n) + ~0))) >> n;
}
I understand the x >> 31 part (only add the next part if x is negative, because if it's positive x will be automatically round toward 0). But what's bothering me is the (1 << n) + ~0 part. How can it work?
Assuming 2-complement, just bit-shifting the dividend is equivalent to a certain kind of division: not the conventional division where we round the dividend to next multiple of divisor toward zero. But another kind where we round the dividend toward negative infinity. I rediscovered that in Smalltalk, see http://smallissimo.blogspot.fr/2015/03/is-bitshift-equivalent-to-division-in.html.
For example, let's divide -126 by 8. traditionally, we would write
-126 = -15 * 8 - 6
But if we round toward infinity, we get a positive remainder and write it:
-126 = -16 * 8 + 2
The bit-shifting is performing the second operation, in term of bit patterns (assuming 8 bits long int for the sake of being short):
1000|0010 >> 3 = 1111|0000
1000|0010 = 1111|0000 * 0000|1000 + 0000|0010
So what if we want the traditional division with quotient rounded toward zero and remainder of same sign as dividend? Simple, we just have to add 1 to the quotient - if and only if the dividend is negative and the division is inexact.
You saw that x>>31 corresponds to first condition, dividend is negative, assuming int has 32 bits.
The second term corresponds to the second condition, if division is inexact.
See how are encoded -1, -2, -4, ... in two complement: 1111|1111 , 1111|1110 , 1111|1100. So the negation of nth power of two has n trailing zeros.
When the dividend has n trailing zeros and we divide by 2^n, then no need to add 1 to final quotient. In any other case, we need to add 1.
What ((1 << n) + ~0) is doing is creating a mask with n trailing ones.
The n last bits don't really matter, because we are going to shift to the right and just throw them away. So, if the division is exact, the n trailing bits of dividend are zero, and we just add n 1s that will be skipped. On the contrary, if the division is inexact, then one or more of the n trailing bits of the dividend is 1, and we are sure to cause a carry to the n+1 bit position: that's how we add 1 to the quotient (we add 2^n to the dividend). Does that explain it a bit more?
This is "write-only code": instead of trying to understand the code, try to create it by yourself.
For example, let's divide a number by 8 (shift right by 3).
If the number is negative, the normal right-shift rounds in the wrong direction. Let's "fix" it by adding a number:
int divideBy8(int x)
{
if (x >= 0)
return x >> 3;
else
return (x + whatever) >> 3;
}
Here you can come up with a mathematical formula for whatever, or do some trial and error. Anyway, here whatever = 7:
int divideBy8(int x)
{
if (x >= 0)
return x >> 3;
else
return (x + 7) >> 3;
}
How to unify the two cases? You need to make an expression that looks like this:
(x + stuff) >> 3
where stuff is 7 for negative x, and 0 for positive x. The trick here is using x >> 31, which is a 32-bit number whose bits are equal to the sign-bit of x: all 0 or all 1. So stuff is
(x >> 31) & 7
Combining all these, and replacing 8 and 7 by the more general power of 2, you get the code you asked about.
Note: in the description above, I assume that int represents a 32-bit hardware register, and hardware uses two's complement representation to do right shift.
OP's reference is of a C# code and so many subtle differences that cause it to be bad code with C, as this post is tagged.
int is not necessarily 32-bits so using a magic number of 32 does not make for a robust solution.
In particular (1 << n) + ~0 results in implementation defined behavior when n causes a bit to be shifted into the sign place. Not good coding.
Restricting code to only using "binary" operators << >> + ^ ~ & | ! encourages a coder to assume things about int which is not portable nor compliant with the C spec. So OP's posted code does not "work" in general, although may work in many common implementations.
OP code fails when int is not 2's complement, not uses the range [-2147483648 .. 2147483647] or when 1 << n uses implementation behavior that is not as expected.
// weak code
int divideByPowerOf2(int x, int n) {
return (x + ((x >> 31) & ((1 << n) + ~0))) >> n;
}
A simple alternative, assuming long long exceeds the range of int follows. I doubt this meets some corner of OP's goals, but OP's given goals encourages non-robust coding.
int divideByPowerOf2(int x, int n) {
long long ill = x;
if (x < 0) ill = -ill;
while (n--) ill >>= 1;
if (x < 0) ill = -ill;
return (int) ill;
}

Bitwise operation and masks

I am having problem understanding how this piece of code works. I understand when the x is a positive number, actually only (x & ~mark) have a value; but cannot figure what this piece of code is doing when x is a negative number.
e.g. If x is 1100(-4), and mask would be 0001, while ~mask is 1110.
The result of ((~x & mask) + (x & ~mask)) is 0001 + 1100 = 1011(-3), I tried hard but cannot figure out what this piece of code is doing, any suggestion is helpful.
/*
* fitsBits - return 1 if x can be represented as an
* n-bit, two's complement integer.
* 1 <= n <= 32
* Examples: fitsBits(5,3) = 0, fitsBits(-4,3) = 1
* Legal ops: ! ~ & ^ | + << >>
* Max ops: 15
* Rating: 2
*/
int fitsBits(int x, int n) {
/* mask the sign bit against ~x and vice versa to get highest bit in x. Shift by n-1, and not. */
int mask = x >> 31;
return !(((~x & mask) + (x & ~mask)) >> (n + ~0));
}
Note: this is pointless and only worth doing as an academic exercise.
The code makes the following assumptions (which are not guaranteed by the C standard):
int is 32-bit (1 sign bit followed by 31 value bits)
int is represented using 2's complement
Right-shifting a negative number does arithmetic shift, i.e. fill sign bit with 1
With these assumptions in place, x >> 31 will generate all-bits-0 for positive or zero numbers, and all-bits-1 for negative numbers.
So the effect of (~x & mask) + (x & ~mask) is the same as (x < 0) ? ~x : x .
Since we assumed 2's complement, ~x for negative numbers is -(x+1).
The effect of this is that if x is positive it remains unchanged. and if x is negative then it's mapped onto the range [0, INT_MAX] . In 2's complement there are exactly as many negative numbers as non-negative numbers, so this works.
Finally, we right-shift by n + ~0. In 2's complement, ~0 is -1, so this is n - 1. If we shift right by 4 bits for example, and we shifted all the bits off the end; it means that this number is representable with 1 sign bit and 4 value bits. So this shift tells us whether the number fits or not.
Putting all of that together, it is an arcane way of writing:
int x;
if ( x < 0 )
x = -(x+1);
// now x is non-negative
x >>= n - 1; // aka. x /= pow(2, n-1)
if ( x == 0 )
return it_fits;
else
return it_doesnt_fit;
Here is a stab at it, unfortunately it is hard to summarize bitwise logic easily. The general idea is to try to right shift x and see if it becomes 0 as !0 returns 1. If right shifting a positive number n-1 times results in 0, then that means n bits are enough to represent it.
The reason for what I call a and b below is due to negative numbers being allowed one extra value of representation by convention. An integer can represent some number of values, that number of values is an even number, one of the numbers required to represent is 0, and so what is left is an odd number of values to be distributed among negative and positive numbers. Negative numbers get to have that one extra value (by convention) which is where the abs(x)-1 comes into play.
Let me know if you have questions:
int fitsBits(int x, int n) {
int mask = x >> 31;
/* -------------------------------------------------
// A: Bitwise operator logic to get 0 or abs(x)-1
------------------------------------------------- */
// mask == 0x0 when x is positive, therefore a == 0
// mask == 0xffffffff when x is negative, therefore a == ~x
int a = (~x & mask);
printf("a = 0x%x\n", a);
/* -----------------------------------------------
// B: Bitwise operator logic to get abs(x) or 0
----------------------------------------------- */
// ~mask == 0xffffffff when x is positive, therefore b == x
// ~mask == 0x0 when x is negative, therefore b == 0
int b = (x & ~mask);
printf("b = 0x%x\n", b);
/* ----------------------------------------
// C: A + B is either abs(x) or abs(x)-1
---------------------------------------- */
// c is either:
// x if x is a positive number
// ~x if x is a negative number, which is the same as abs(x)-1
int c = (a + b);
printf("c = %d\n", c);
/* -------------------------------------------
// D: A ridiculous way to subtract 1 from n
------------------------------------------- */
// ~0 == 0xffffffff == -1
// n + (-1) == n-1
int d = (n + ~0);
printf("d = %d\n", d);
/* ----------------------------------------------------
// E: Either abs(x) or abs(x)-1 is shifted n-1 times
---------------------------------------------------- */
int e = (c >> d);
printf("e = %d\n", e);
// If e was right shifted into 0 then you know the number would have fit within n bits
return !e;
}
You should be performing those operations with unsigned int instead of int.
Some operations like >> will perform an arithmetic shift instead of logical shift when dealing with signed numbers and you will have this sort of unexpected outcome.
A right arithmetic shift of a binary number by 1. The empty position in the most significant bit is filled with a copy of the original MSB instead of zero. -- from Wikipedia
With unsigned int though this is what happens:
In a logical shift, zeros are shifted in to replace the discarded bits. Therefore the logical and arithmetic left-shifts are exactly the same.
However, as the logical right-shift inserts value 0 bits into the most significant bit, instead of copying the sign bit, it is ideal for unsigned binary numbers, while the arithmetic right-shift is ideal for signed two's complement binary numbers. -- from Wikipedia

How to represent negation using bitwise operators, C

Suppose you have 2 numbers:
int x = 1;
int y = 2;
Using bitwise operators, how can i represent x-y?
When comparing the bits of two numbers A and B there are three posibilities. The following assumes unsigned numbers.
A == B : All of the bits are the same
A > B: The most significant bit that differs between the two numbers is set in A and not in B
A < B: The most significant bit that differs between the two numbers is set in B and not in A
Code might look like the following
int getDifType(uint32_t A, uint32_t B)
{
uint32_t bitMask = 0x8000000;
// From MSB to LSB
for (bitMask = 0x80000000; 0 != bitMask; bitMask >>= 1)
{
if (A & bitMask != B & bitMask)
return (A & bitMask) - (B & bitMask);
}
// No difference found
return 0;
}
You need to read about two's complement arithmetic. Addition, subtraction, negation, sign testing, and everything else are all done by the hardware using bitwise operations, so you can definitely do it in your C program. The wikipedia link above should teach you everything you need to know to solve your problem.
Your first step will be to implement addition using only bitwise operators. After that, everything should be easy. Start small- what do you have to do to implement 00 + 00, 01 + 01, etc? Go from there.
You need to start checking from the most significant end to find if a number is greater or not. This logic will work only for non-negative integers.
int x,y;
//get x & y
unsigned int mask=1; // make the mask 000..0001
mask=mask<<(8*sizeoF(int)-1); // make the mask 1000..000
while(mask!=0)
{
if(x & mask > y & mask)
{printf("x greater");break;}
else if(y & mask > x & mask)
{printf("y greater");break;}
mask=mask>>1; // shift 1 in mask to the right
}
Compare the bits from left to right, looking for the leftmost bits that differ. Assuming a machine that is two's complement, the topmost bit determines the sign and will have a flipped comparison sense versus the other bits. This should work on any two's complement machine:
int compare(int x, int y) {
unsigned int mask = ~0U - (~0U >> 1); // select left-most bit
if (x & mask && ~y & mask)
return -1; // x < 0 and y >= 0, therefore y > x
else if (~x & mask && y & mask)
return 1; // x >= 0 and y < 0, therefore x > y
for (; mask; mask >>= 1) {
if (x & mask && ~y & mask)
return 1;
else if (~x & mask && y & mask)
return -1;
}
return 0;
}
[Note that this technically isn't portable. C makes no guarantees that signed arithmetic will be two's complement. But you'll be hard pressed to find a C implementation on a modern machine that behaves differently.]
To see why this works, consider first comparing two unsigned numbers, 13d = 1101b and 11d = 1011b. (I'm assuming a 4-bit wordsize for brevity.) The leftmost differing bit is the second from the left, which the former has set, while the other does not. The former number is therefore the larger. It should be fairly clear that this principle holds for all unsigned numbers.
Now, consider two's complement numbers. You negate a number by complementing the bits and adding one. Thus, -1d = 1111b, -2d = 1110b, -3d = 1101b, -4d = 1100b, etc. You can see that two negative numbers can be compared as though they were unsigned. Likewise, two non-negative numbers can also be compared as though unsigned. Only when the signs differ do we have to consider them -- but if they differ, the comparison is trivial!

Resources