what happens if you cast a big int to float - c

this is a general question about what precisely happens when I cast a very big/small SIGNED integer to a floating point using gcc 4.4.
I see some weird behaviour when doing the casting. Here are some examples:
MUSTBE is obtained with this method:
float f = (float)x;
unsigned int r;
memcpy(&r, &f, sizeof(unsigned int));
./btest -f float_i2f -1 0x80800001
input: 10000000100000000000000000000001
absolute value: 01111111011111111111111111111111
exponent: 10011101
mantissa: 00000000011111101111111111111111 (right shifted absolute value)
EXPECT: 11001110111111101111111111111111 (sign|exponent|mantissa)
MUST BE: 11001110111111110000000000000000 (sign ok, exponent ok,
mantissa???)
./btest -f float_i2f -1 0x3f7fffe0
EXPECT: 01001110011111011111111111111111
MUST BE: 01001110011111100000000000000000
./btest -f float_i2f -1 0x80004999
EXPECT: 11001110111111111111111101101100
MUST BE: 11001110111111111111111101101101 (<- 1 added at the end)
So what bothers me that the mantissa is in both examples different then if I just shift my integer value to the right. The zeros at the end for instance. Where do they come from?
I only see this behaviour on big/small values. Values in the range -2^24, 2^24 work fine.
I wonder if someone can enlighten me what happens here. What are the steps too take on very big/small values.
This is an add on question to : function to convert float to int (huge integers) which is not as general as this one here.
EDIT
Code:
unsigned float_i2f(int x) {
if (x == 0) return 0;
/* get sign of x */
int sign = (x>>31) & 0x1;
/* absolute value of x */
int a = sign ? ~x + 1 : x;
/* calculate exponent */
int e = 158;
int t = a;
while (!(t >> 31) & 0x1) {
t <<= 1;
e--;
};
/* calculate mantissa */
int m = (t >> 8) & ~(((0x1 << 31) >> 8 << 1));
m &= 0x7fffff;
int res = sign << 31;
res |= (e << 23);
res |= m;
return res;
}
EDIT 2:
After Adams remarks and the reference to the book Write Great Code, I updated my routine with rounding. Still I get some rounding errors (now fortunately only 1 bit off).
Now if I do a test run, I get mostly good results but a couple of rounding errors like this:
input: 0xfefffff5
result: 11001011100000000000000000000101
GOAL: 11001011100000000000000000000110 (1 too low)
input: 0x7fffff
result: 01001010111111111111111111111111
GOAL: 01001010111111111111111111111110 (1 too high)
unsigned float_i2f(int x) {
if (x == 0) return 0;
/* get sign of x */
int sign = (x>>31) & 0x1;
/* absolute value of x */
int a = sign ? ~x + 1 : x;
/* calculate exponent */
int e = 158;
int t = a;
while (!(t >> 31) & 0x1) {
t <<= 1;
e--;
};
/* mask to check which bits get shifted out when rounding */
static unsigned masks[24] = {
0, 1, 3, 7,
0xf, 0x1f,
0x3f, 0x7f,
0xff, 0x1ff,
0x3ff, 0x7ff,
0xfff, 0x1fff,
0x3fff, 0x7fff,
0xffff, 0x1ffff,
0x3ffff, 0x7ffff,
0xfffff, 0x1fffff,
0x3fffff, 0x7fffff
};
/* mask to check wether round up, or down */
static unsigned HOmasks[24] = {
0,
1, 2, 4, 0x8, 0x10, 0x20, 0x40, 0x80,
0x100, 0x200, 0x400, 0x800, 0x1000, 0x2000, 0x4000, 0x8000, 0x10000, 0x20000, 0x40000, 0x80000, 0x100000, 0x200000, 0x400000
};
int S = a & masks[8];
int m = (t >> 8) & ~(((0x1 << 31) >> 8 << 1));
m &= 0x7fffff;
if (S > HOmasks[8]) {
/* round up */
m += 1;
} else if (S == HOmasks[8]) {
/* round down */
m = m + (m & 1);
}
/* special case where last bit of exponent is also set in mantissa
* and mantissa itself is 0 */
if (m & (0x1 << 23)) {
e += 1;
m = 0;
}
int res = sign << 31;
res |= (e << 23);
res |= m;
return res;
}
Does someone have any idea where the problem lies?

A 32-bit float uses some of the bits for the exponent and therefore cannot represent all 32-bit integer values exactly.
A 64-bitdouble can store any 32-bit integer value exactly.
Wikipedia has an abbreviated entry on IEEE 754 floating point, and lots of details of the internals of floating point numbers at IEEE 754-1985 — the current standard is IEEE 754:2008. It notes that a 32-bit float uses one bit for the sign, 8 bits for the exponent, leaving 23 explicit and 1 implicit bit for the mantissa, which is why absolute values up to 224 can be represented exactly.
I thought that it was clear that a 32 bit integer can't be exactly stored into a 32bit float. My question is: What happens IF I store an integer bigger 2^24 or smaller -2^24? And how can I replicate it?
Once the absolute values are larger than 224, the integer values cannot be represented exactly in the 24 effective digits of the mantissa of a 32-bit float, so only the leading 24 digits are reliably available. Floating point rounding also kicks in.
You can demonstrate with code similar to this:
#include
#include
typedef union Ufloat
{
uint32_t i;
float f;
} Ufloat;
static void dump_value(uint32_t i, uint32_t v)
{
Ufloat u = { .i = v };
printf("0x%.8" PRIX32 ": 0x%.8" PRIX32 " = %15.7e = %15.6A\n", i, v, u.f, u.f);
}
int main(void)
{
uint32_t lo = 1 << 23;
uint32_t hi = 1 << 28;
Ufloat u;
for (uint32_t v = lo; v < hi; v <<= 1)
{
u.f = v;
dump_value(v, u.i);
}
lo = (1 << 24) - 16;
hi = lo + 64;
for (uint32_t v = lo; v < hi; v++)
{
u.f = v;
dump_value(v, u.i);
}
return 0;
}
Sample output:
0x00800000: 0x4B000000 = 8.3886080e+06 = 0X1.000000P+23
0x01000000: 0x4B800000 = 1.6777216e+07 = 0X1.000000P+24
0x02000000: 0x4C000000 = 3.3554432e+07 = 0X1.000000P+25
0x04000000: 0x4C800000 = 6.7108864e+07 = 0X1.000000P+26
0x08000000: 0x4D000000 = 1.3421773e+08 = 0X1.000000P+27
0x00FFFFF0: 0x4B7FFFF0 = 1.6777200e+07 = 0X1.FFFFE0P+23
0x00FFFFF1: 0x4B7FFFF1 = 1.6777201e+07 = 0X1.FFFFE2P+23
0x00FFFFF2: 0x4B7FFFF2 = 1.6777202e+07 = 0X1.FFFFE4P+23
0x00FFFFF3: 0x4B7FFFF3 = 1.6777203e+07 = 0X1.FFFFE6P+23
0x00FFFFF4: 0x4B7FFFF4 = 1.6777204e+07 = 0X1.FFFFE8P+23
0x00FFFFF5: 0x4B7FFFF5 = 1.6777205e+07 = 0X1.FFFFEAP+23
0x00FFFFF6: 0x4B7FFFF6 = 1.6777206e+07 = 0X1.FFFFECP+23
0x00FFFFF7: 0x4B7FFFF7 = 1.6777207e+07 = 0X1.FFFFEEP+23
0x00FFFFF8: 0x4B7FFFF8 = 1.6777208e+07 = 0X1.FFFFF0P+23
0x00FFFFF9: 0x4B7FFFF9 = 1.6777209e+07 = 0X1.FFFFF2P+23
0x00FFFFFA: 0x4B7FFFFA = 1.6777210e+07 = 0X1.FFFFF4P+23
0x00FFFFFB: 0x4B7FFFFB = 1.6777211e+07 = 0X1.FFFFF6P+23
0x00FFFFFC: 0x4B7FFFFC = 1.6777212e+07 = 0X1.FFFFF8P+23
0x00FFFFFD: 0x4B7FFFFD = 1.6777213e+07 = 0X1.FFFFFAP+23
0x00FFFFFE: 0x4B7FFFFE = 1.6777214e+07 = 0X1.FFFFFCP+23
0x00FFFFFF: 0x4B7FFFFF = 1.6777215e+07 = 0X1.FFFFFEP+23
0x01000000: 0x4B800000 = 1.6777216e+07 = 0X1.000000P+24
0x01000001: 0x4B800000 = 1.6777216e+07 = 0X1.000000P+24
0x01000002: 0x4B800001 = 1.6777218e+07 = 0X1.000002P+24
0x01000003: 0x4B800002 = 1.6777220e+07 = 0X1.000004P+24
0x01000004: 0x4B800002 = 1.6777220e+07 = 0X1.000004P+24
0x01000005: 0x4B800002 = 1.6777220e+07 = 0X1.000004P+24
0x01000006: 0x4B800003 = 1.6777222e+07 = 0X1.000006P+24
0x01000007: 0x4B800004 = 1.6777224e+07 = 0X1.000008P+24
0x01000008: 0x4B800004 = 1.6777224e+07 = 0X1.000008P+24
0x01000009: 0x4B800004 = 1.6777224e+07 = 0X1.000008P+24
0x0100000A: 0x4B800005 = 1.6777226e+07 = 0X1.00000AP+24
0x0100000B: 0x4B800006 = 1.6777228e+07 = 0X1.00000CP+24
0x0100000C: 0x4B800006 = 1.6777228e+07 = 0X1.00000CP+24
0x0100000D: 0x4B800006 = 1.6777228e+07 = 0X1.00000CP+24
0x0100000E: 0x4B800007 = 1.6777230e+07 = 0X1.00000EP+24
0x0100000F: 0x4B800008 = 1.6777232e+07 = 0X1.000010P+24
0x01000010: 0x4B800008 = 1.6777232e+07 = 0X1.000010P+24
0x01000011: 0x4B800008 = 1.6777232e+07 = 0X1.000010P+24
0x01000012: 0x4B800009 = 1.6777234e+07 = 0X1.000012P+24
0x01000013: 0x4B80000A = 1.6777236e+07 = 0X1.000014P+24
0x01000014: 0x4B80000A = 1.6777236e+07 = 0X1.000014P+24
0x01000015: 0x4B80000A = 1.6777236e+07 = 0X1.000014P+24
0x01000016: 0x4B80000B = 1.6777238e+07 = 0X1.000016P+24
0x01000017: 0x4B80000C = 1.6777240e+07 = 0X1.000018P+24
0x01000018: 0x4B80000C = 1.6777240e+07 = 0X1.000018P+24
0x01000019: 0x4B80000C = 1.6777240e+07 = 0X1.000018P+24
0x0100001A: 0x4B80000D = 1.6777242e+07 = 0X1.00001AP+24
0x0100001B: 0x4B80000E = 1.6777244e+07 = 0X1.00001CP+24
0x0100001C: 0x4B80000E = 1.6777244e+07 = 0X1.00001CP+24
0x0100001D: 0x4B80000E = 1.6777244e+07 = 0X1.00001CP+24
0x0100001E: 0x4B80000F = 1.6777246e+07 = 0X1.00001EP+24
0x0100001F: 0x4B800010 = 1.6777248e+07 = 0X1.000020P+24
0x01000020: 0x4B800010 = 1.6777248e+07 = 0X1.000020P+24
0x01000021: 0x4B800010 = 1.6777248e+07 = 0X1.000020P+24
0x01000022: 0x4B800011 = 1.6777250e+07 = 0X1.000022P+24
0x01000023: 0x4B800012 = 1.6777252e+07 = 0X1.000024P+24
0x01000024: 0x4B800012 = 1.6777252e+07 = 0X1.000024P+24
0x01000025: 0x4B800012 = 1.6777252e+07 = 0X1.000024P+24
0x01000026: 0x4B800013 = 1.6777254e+07 = 0X1.000026P+24
0x01000027: 0x4B800014 = 1.6777256e+07 = 0X1.000028P+24
0x01000028: 0x4B800014 = 1.6777256e+07 = 0X1.000028P+24
0x01000029: 0x4B800014 = 1.6777256e+07 = 0X1.000028P+24
0x0100002A: 0x4B800015 = 1.6777258e+07 = 0X1.00002AP+24
0x0100002B: 0x4B800016 = 1.6777260e+07 = 0X1.00002CP+24
0x0100002C: 0x4B800016 = 1.6777260e+07 = 0X1.00002CP+24
0x0100002D: 0x4B800016 = 1.6777260e+07 = 0X1.00002CP+24
0x0100002E: 0x4B800017 = 1.6777262e+07 = 0X1.00002EP+24
0x0100002F: 0x4B800018 = 1.6777264e+07 = 0X1.000030P+24
The first part of the output demonstrates that some integer values can still be stored exactly; specifically, powers of 2 can be stored exactly. In fact, more precisely (but less concisely), any integer where binary representation of the absolute value has no more than 24 significant digits (any trailing digits are zeros) can be represented exactly. The values can't necessarily be printed exactly, but that's a separate issue from storing them exactly.
The second (larger) part of the output demonstrates that up to 224-1, the integer values can be represented exactly. The value of 224 itself is also exactly representable, but 224+1 is not, so it appears the same as 224. By contrast, 224+2 can be represented with just 24 binary digits followed by 1 zero and hence can be represented exactly. Repeat ad nauseam for increments larger than 2. It looks as though 'round even' mode is in effect; that's why the results show 1 value then 3 values.
(I note in passing that there isn't a way to stipulate that the double passed to printf() — converted from float by the rules of default argument promotions (ISO/IEC 9899:2011 §6.5.2.2 Function calls, ¶6) be printed as a float() — the h modifier would logically be used, but is not defined.)

C/C++ floats tend to be compatible with the IEEE 754 floating point standard (e.g. in gcc). The zeros come from the rounding rules.
Shifting a number to the right makes some bits from the right-hand side go away. Let's call them guard bits. Now let's call HO the highest bit and LO the lowest bit of our number. Now suppose that the guard bits are still a part of our number. If, for example, we have 3 guard bits it means that the value of our LO bit is 8 (if it is set). Now if:
value of guard bits > 0.5 * value of LO
rounds the number to the smalling possible greater value, ignoring the sign
value of 'guard bits' == 0.5 * value of LO
use current number value if LO == 0
number += 1 otherwise
value of guard bits < 0.5 * value of LO
use current number value
why do 3 guard bits mean the LO value is 8 ?
Suppose we have a binary 8 bit number:
weights: 128 64 32 16 8 4 2 1
binary num: 0 0 0 0 1 1 1 1
Let's shift it right by 3 bits:
weights: x x x 128 64 32 16 8 | 4 2 1
binary num: 0 0 0 0 0 0 0 1 | 1 1 1
As you see, with 3 guard bits the LO bit ends up at the 4th position and has a weight of 8. It is true only for the purpose of rounding. The weights have to be 'normalized' afterwards, so that the weight of LO bit becomes 1 again.
And how can I check with bit operations if guard bits > 0.5 * value ??
The fastest way is to employ lookup tables. Suppose we're working on an 8 bit number:
unsigned number; //our number
unsigned bitsToShift; //number of bits to shift
assert(bitsToShift < 8); //8 bits
unsigned guardMasks[8] = {0, 1, 3, 7, 0xf, 0x1f, 0x3f}
unsigned LOvalues[8] = {0, 1, 2, 4, 0x8, 0x10, 0x20, 0x40} //divided by 2 for faster comparison
unsigned guardBits = number & guardMasks[bitsToShift]; //value of the guard bits
number = number >> bitsToShift;
if(guardBits > LOvalues[bitsToShift]) {
...
} else if (guardBits == LOvalues[bitsToShift]) {
...
} else { //guardBits < LOvalues[bitsToShift]
...
}
Reference: Write Great Code, Volume 1 by Randall Hyde

Related

Multiply float by a number using bitwise operators

I have this function that takes in the bits of a float (f) as a uint32_t. It should use bit operations and + to calculate f * 2048 and should return the bits of this value as a uint32_t.
If the result is too large to be represented as a float, +inf or -inf should be returned returned; and if f is +0, -0, +inf or -inf, or Nan, it should be returned unchanged.
uint32_t float_2048(uint32_t f) {
uint32_t a = (f << 1) ;
int result = a << 10;
return result;
}
This is what I have so far but if I give it the value '1' it returns 0 instead of 2048. How do I fix this?
Some example inputs and outputs:
./float_2048 1
2048
./float_2048 3.14159265
6433.98193
./float_2048 -2.718281828e-20
-5.56704133e-17
./float_2048 1e38
inf
As mentioned in the comments, to multiply a floating-point number by a power of 2 (assuming, as is likely, that it is represented in IEEE-754 format), we can just add that power to the (binary) exponent part of the representation.
For a single-precision (32-bit) float value, that exponent is stored in bits 30-23 and the following code shows how to extract those, add the required value (11, because 2048 = 211), then replace the exponent bits with that modified value.
uint32_t fmul2048(uint32_t f)
{
#define EXPONENT 0x7F800000u
#define SIGN_BIT 0x80000000u
uint32_t expon = (f & EXPONENT) >> 23; // Get exponent value
f &= ~EXPONENT; // Remove old exponent
expon += 11; // Adding 11 to exponent multiplies by 2^11 (= 2048);
if (expon > 254) return EXPONENT | (f & SIGN_BIT); // Too big: return +/- Inf
f |= (expon << 23); // Insert modified exponent
return f;
}
There will, no-doubt, be some "bit trickery" that can be applied to make the code smaller and/or more efficient; but I have avoided doing so in order to keep the code clear. I have also included one error check (for a too large exponent) and the code returns the standard representation for +/- Infinity (all exponent bits set to 1, and keeping the original sign) if that test fails. (I leave other error-checking as an "exercise for the reader".)
To handle all float takes more code.
Do some tests so code can assume the expected float size, matching endian and (IEEE) encoding. C does not require float as 32-bit, matching endian to an integer, not binary32 encoding, even though that is common.
Extract the biased exponent and look for its min and max value.
Max values signify NAN or infinity.
Min values are sub-normals and zero and need special handling. The significand needs to be shifted. If that result is now a normal float, re-encode it.
Biased exponents in between simple need an increment and test for exceeding FLT_MAX's exponent.
Tested successfully for all float.
#include <assert.h>
#include <stdint.h>
static_assert(sizeof(uint32_t) == sizeof(float), "Unexpected float size");
#define IEEE_MASK_BIASED_EXPO 0x7F800000u
#define IEEE_MASK_BIASED_EXPO_LSB 0x00800000u
#define IEEE_MASK_SIGNIFICAND 0x007FFFFFu
#define IEEE_SIGNIFICAND_MAX 0x00FFFFFFu
#define IEEE_INFINITY 0x7F800000u
// Scale value by 2048
uint32_t float_2048(uint32_t f) {
uint32_t expo = f & IEEE_MASK_BIASED_EXPO;
// Test for infinity or NAN
if (expo == IEEE_MASK_BIASED_EXPO) {
return f;
}
// Sub-normal and zero test
if (expo == 0) {
uint64_t sig = f & IEEE_MASK_SIGNIFICAND;
sig <<= 11; // *= 2048;
// If value now a normal one
if (sig > IEEE_MASK_SIGNIFICAND) {
expo += IEEE_MASK_BIASED_EXPO_LSB;
while (sig > IEEE_SIGNIFICAND_MAX) {
sig >>= 1;
expo += IEEE_MASK_BIASED_EXPO_LSB;
}
f = (f & ~IEEE_MASK_BIASED_EXPO) | (expo & IEEE_MASK_BIASED_EXPO);
}
f = (f & ~IEEE_MASK_SIGNIFICAND) | (sig & IEEE_MASK_SIGNIFICAND);
} else {
expo += 11 * IEEE_MASK_BIASED_EXPO_LSB; // *= 2048;
if (expo >= IEEE_MASK_BIASED_EXPO) {
f &= ~(IEEE_MASK_BIASED_EXPO | IEEE_MASK_SIGNIFICAND);
f |= IEEE_INFINITY;
} else {
f = (f & ~IEEE_MASK_BIASED_EXPO) | (expo & IEEE_MASK_BIASED_EXPO);
}
}
return f;
}
Test code.
#include <stdio.h>
#include <stdlib.h>
typedef union {
uint32_t u32;
float f;
} fu32;
int main(void ) {
// Lightweight test to see if endian matches and IEEE encoding
assert((fu32) {.u32 = 0x87654321}.f == -1.72477726182e-34f);
float f[] = {0, FLT_TRUE_MIN, FLT_MIN, 1, FLT_MAX};
size_t n = sizeof f/sizeof f[0];
for (size_t i = 0; i<n; i++) {
fu32 x = { .f = f[i] };
float y0 = x.f * 2048.0f;
fu32 y1 = { .u32 = float_2048(x.u32) };
if (memcmp(&y0, &y1.f, sizeof y0)) {
printf("%.9g %.9g\n", y0, y1.f);
}
}
fu32 x = { .u32 = 0 };
do {
fu32 y0 = { .f = isnan(x.f) ? x.f : x.f * 2048.0f };
fu32 y1 = { .u32 = float_2048(x.u32) };
if (memcmp(&y0.f, &y1.f, sizeof y0)) {
printf("%.9g %.9g\n", y0.f, y1.f);
printf("%08lx %08lx %08lx\n", (unsigned long) x.u32,
(unsigned long) y0.u32, (unsigned long) y1.u32);
break;
}
x.u32++;
} while (x.u32 != 0);
puts("Done");
}

Converting negative numbers to positive numbers but keeping positive numbers unchanged

I want to apply a bitmask to a number that will mimic the absolute value function for 2's complement encoded signed 32 bit integers. So far, I have
int absoluteValue(int x) {
int sign = x >> 31; //get most significant byte...all 1's if x is < 0, all 0's if x >= 0
int negated = (~x + 1) & sign; //negates the number if negative, sets to 0 if positive
//what should go here???
}
Am I going in the right direction? I'm not really sure where to go from here (mostly just how to apply a mask to keep the original positive value). I also don't want to use any conditional statements
Bizarre question. What about
return (negated << 1) + x;
So put together this makes:
int absoluteValue(int x) {
int sign = x >> 31; //get most significant byte...all 1's if x is < 0, all 0's if x >= 0
int negated = (~x + 1) & sign; //negates the number if negative, sets to 0 if positive
return (negated << 1) + x;
}
The last part
negated = (~x + 1) & sign;
is wrong, you are going to get either 1 or 0, you have to create a mask with all
first 31 bits to 0 and only the last one to either 0 or 1.
Assuming that for you target you are dealing with 32 bit integers with 2
complement, you can do this:
#include <stdio.h>
// assuming 32bit, 2 complement
int sign_inverse(int n)
{
int mask = ~n & 0x80000000U;
if(n == 0)
mask = 0;
return (~n + 1) | mask;
}
int main(void)
{
int a = 5;
int b = -4;
int c = 54;
int d = 0;
printf("sign_inverse(%d) = %d\n", a, sign_inverse(a));
printf("sign_inverse(%d) = %d\n", b, sign_inverse(b));
printf("sign_inverse(%d) = %d\n", c, sign_inverse(c));
printf("sign_inverse(%d) = %d\n", d, sign_inverse(d));
return 0;
}
but you need at least 1 if for the case of 0, because the mask for 0 is 0x80000000.
The output of this is:
$ ./b
sign_inverse(5) = -5
sign_inverse(-4) = 4
sign_inverse(54) = -54
sign_inverse(0) = 0
Please note that two's complement representation is not guaranteed, and also the behaviour of operator >> on signed values, where the result get's "filled" with 1-bits is implementation defined (cf., for example, cppreference.com/arithmetic operations):
For negative LHS, the value of LHS >> RHS is implementation-defined
where in most implementations, this performs arithmetic right shift
(so that the result remains negative). Thus in most implementations,
right shifting a signed LHS fills the new higher-order bits with the
original sign bit (i.e. with 0 if it was non-negative and 1 if it was
negative).
But if you take this for given, and if you just want to use bit wise operations and operator +, you are already going into the right direction.
The only thing is that you should take into account the mask you create ( i.e. your sign) in that you toggle the bits of x only in the case where x is negative. You can achieve this by the XOR-operator as follows:
int x = -3000;
unsigned int mask = x >> 31;
int sign = mask & 0x01;
int positive = (x^mask) + sign;
printf("x:%d mask:%0X sign:%d positive:%d\n",x,mask,sign,positive);

Sign extension, addition and subtraction binary in C

How would I go about implementing a sign extend from 16 bits to 32 bits in C code?
I am supposed to be using bitwise operators. I also need to add and subtract; can anyone point me in the right direction? I did the first 4 but am confused on the rest. I have to incorporate a for loop somewhere as well for 1 of the cases.
I am not allowed to use any arithmetic operators (+, -, /, *) and no if statements.
Here is the code for the switch statement I am currently editing:
unsigned int csc333ALU(const unsigned int opcode,
const unsigned int argument1,
const unsigned int argument2) {
unsigned int result;
switch(opcode) {
case(0x01): // result = NOT argument1
result = ~(argument1);
break;
case(0x02): // result = argument 1 OR argument 2
result = argument1 | argument2;
break;
case(0x03): // result = argument 1 AND argument 2
result = argument1 & argument2;
break;
case(0x04): // result = argument 1 XOR argument 2
result = argument1 ^ argument2;
break;
case(0x05): // result = 16 bit argument 1 sign extended to 32 bits
result = 0x00000000;
break;
case(0x06): // result = argument1 + argument2
result = 0x00000000;
break;
case(0x07): // result = -argument1. In two's complement, negate and add 1.
result = 0x00000000;
break;
default:
printf("Invalid opcode: %X\n", opcode);
result = 0xFFFFFFFF;
}
partial answer for sign extension:
result = (argument1 & 0x8000) == 0x8000 ? 0xFFFF0000 | argument1 : argument1;
To sign-extend a 16 bit number to 32 bit, you need to copy bit 15 to the upper bits. The naive way to do this is with 16 instructions, copying bit 15 to bit 16, then 17, then 18, and so on. But you can do it more efficiently by using previously copied bits and doubling the number of bits you've copied each time like this:
unsigned int ext = (argument1 & 0x8000U) << 1;
ext |= ext << 1;
ext |= ext << 2;
ext |= ext << 4;
ext |= ext << 8;
result = (argument1 & 0xffffU) | ext;
To add two 32 bit numbers "manually" then you can simply do it bit by bit.
unsigned carry = 0;
result = 0;
for (int i = 0; i < 32; i++) {
// Extract the ith bit from argument1 and argument 2.
unsigned a1 = (argument1 >> i) & 1;
unsigned a2 = (argument2 >> i) & 1;
// The ith bit of result is set if 1 or 3 of a1, a2, carry is set.
unsigned v = a1 ^ a2 ^ carry;
result |= v << i;
// The new carry is 1 if at least two of a1, a2, carry is set.
carry = (a1 & a2) | (a1 & carry) | (a2 & carry);
}
Subtraction works with almost exactly the same code: a - b is the same as a + (~b+1) in two's complement arithmetic. Because you aren't allowed to simply add 1, you can achieve the same by initialising carry to 1 instead of 0.
unsigned carry = 1;
result = 0;
for (int i = 0; i < 32; i++) {
unsigned a1 = (argument1 >> i) & 1;
unsigned a2 = (~argument2 >> i) & 1;
unsigned v = a1 ^ a2 ^ carry;
result |= v << i;
carry = (a1 & a2) | (a1 & carry) | (a2 & carry);
}
To find two's complement without doing the negation, similar ideas apply. Bitwise negate and then add 1. Adding 1 is simpler than adding argument2, so the code is correspondingly simpler.
result = ~argument1;
unsigned carry = 1;
for (int i = 0; i < 32 && carry; i++) {
carry &= (result >> i);
result |= (1 << i);
}
to get sign extension from short int to int....
short int iShort = value;
int i = iShort; // compiler automatically creates code that performs sign extension
Note: going from i to iShort will generate a compiler waring.
however, for other situations...
no need to make comparison, the & will result in a single bit being either 0 or 1 and be sure to cast the parts of the calculation as int
int i = (short int argument&0x8000)? (int)(0xFFFF000 | (int)argument) : (int)argument;

How to manually (bitwise) perform (float)x?

Now, here is the function header of the function I'm supposed to implement:
/*
* float_from_int - Return bit-level equivalent of expression (float) x
* Result is returned as unsigned int, but
* it is to be interpreted as the bit-level representation of a
* single-precision floating point values.
* Legal ops: Any integer/unsigned operations incl. ||, &&. also if, while
* Max ops: 30
* Rating: 4
*/
unsigned float_from_int(int x) {
...
}
We aren't allowed to do float operations, or any kind of casting.
Now I tried to implement the first algorithm given at this site: http://locklessinc.com/articles/i2f/
Here's my code:
unsigned float_from_int(int x) {
// grab sign bit
int xIsNegative = 0;
int absValOfX = x;
if(x < 0){
xIsNegative = 1;
absValOfX = -x;
}
// zero case
if(x == 0){
return 0;
}
if(x == 0x80000000){ //Updated to add this
return 0xcf000000;
}
//int shiftsNeeded = 0;
/*while(){
shiftsNeeded++;
}*/
unsigned I2F_MAX_BITS = 15;
unsigned I2F_MAX_INPUT = ((1 << I2F_MAX_BITS) - 1);
unsigned I2F_SHIFT = (24 - I2F_MAX_BITS);
unsigned result, i, exponent, fraction;
if ((absValOfX & I2F_MAX_INPUT) == 0)
result = 0;
else {
exponent = 126 + I2F_MAX_BITS;
fraction = (absValOfX & I2F_MAX_INPUT) << I2F_SHIFT;
i = 0;
while(i < I2F_MAX_BITS) {
if (fraction & 0x800000)
break;
else {
fraction = fraction << 1;
exponent = exponent - 1;
}
i++;
}
result = (xIsNegative << 31) | exponent << 23 | (fraction & 0x7fffff);
}
return result;
}
But it didn't work (see test error below):
ERROR: Test float_from_int(8388608[0x800000]) failed...
...Gives 0[0x0]. Should be 1258291200[0x4b000000]
I don't know where to go from here. How should I go about parsing the float from this int?
EDIT #1:
You might be able to see from my code that I also started working on this algorithm (see this site):
I assumed 10-bit, 2’s complement, integers since the mantissa is only
9 bits, but the process generalizes to more bits.
Save the sign bit of the input and take the absolute value of the input.
Shift the input left until the high order bit is set and count the number of shifts required. This forms the floating mantissa.
Form the floating exponent by subtracting the number of shifts from step 2 from the constant 137 or (0h89-(#of shifts)).
Assemble the float from the sign, mantissa, and exponent.
But, that doesn't seem right. How could I convert 0x80000000? Doesn't make sense.
EDIT #2:
I think it's because I say max bits is 15... hmmm...
EDIT #3: Screw that old algorithm, I'm starting over:
unsigned float_from_int(int x) {
// grab sign bit
int xIsNegative = 0;
int absValOfX = x;
if(x < 0){
xIsNegative = 1;
absValOfX = -x;
}
// zero case
if(x == 0){
return 0;
}
if (x == 0x80000000){
return 0xcf000000;
}
int shiftsNeeded = 0;
int counter = 0;
while(((absValOfX >> counter) & 1) != 1 && shiftsNeeded < 32){
counter++;
shiftsNeeded++;
}
unsigned exponent = shiftsNeeded + 127;
unsigned result = (xIsNegative << 31) | (exponent << 23);
return result;
Here's the error I get on this one (I think I got past the last error):
ERROR: Test float_from_int(-2139095040[0x80800000]) failed...
...Gives -889192448[0xcb000000]. Should be -822149120[0xceff0000]
May be helpful to know that:
absValOfX = 7f800000
(using printf)
EDIT #4: Ah, I'm finding the exponent wrong, need to count from the left, then subtract from 32 I believe.
EDIT #5: I started over, now trying to deal with weird rounding problems...
if (x == 0){
return 0; // 0 is a special case because it has no 1 bits
}
if (x >= 0x80000000 && x <= 0x80000040){
return 0xcf000000;
}
// Save the sign bit of the input and take the absolute value of the input.
unsigned signBit = 0;
unsigned absX = (unsigned)x;
if (x < 0)
{
signBit = 0x80000000u;
absX = (unsigned)-x;
}
// Shift the input left until the high order bit is set to form the mantissa.
// Form the floating exponent by subtracting the number of shifts from 158.
unsigned exponent = 158;
while ((absX & 0x80000000) == 0)
{
exponent--;
absX <<= 1;
}
unsigned negativeRoundUp = (absX >> 7) & 1 & (absX >> 8);
// compute mantissa
unsigned mantissa = (absX >> 8) + ((negativeRoundUp) || (!signBit & (absX >> 7) & (exponent < 156)));
printf("absX = %x, absX >> 8 = %x, exponent = %i, mantissa = %x\n", absX, (absX >> 8), exponent, mantissa);
// Assemble the float from the sign, mantissa, and exponent.
return signBit | ((exponent << 23) + (signBit & negativeRoundUp)) | ( (mantissa) & 0x7fffff);
-
absX = fe000084, absX >> 8 = fe0000, exponent = 156, mantissa = fe0000
ERROR: Test float_from_int(1065353249[0x3f800021]) failed...
...Gives 1316880384[0x4e7e0000]. Should be 1316880385[0x4e7e0001]
EDIT #6
Did it again, still, the rounding doesn't work properly. I've tried to hack together some rounding, but it just won't work...
unsigned float_from_int(int x) {
/*
If N is negative, negate it in two's complement. Set the high bit (2^31) of the result.
If N < 2^23, left shift it (multiply by 2) until it is greater or equal to.
If N ≥ 2^24, right shift it (unsigned divide by 2) until it is less.
Bitwise AND with ~2^23 (one's complement).
If it was less, subtract the number of left shifts from 150 (127+23).
If it was more, add the number of right shifts to 150.
This new number is the exponent. Left shift it by 23 and add it to the number from step 3.
*/
printf("---------------\n");
//printf("x = %i (%x), -x = %i, (%x)\n", x, x, -x, -x);
if(x == 0){
return 0;
}
if(x == 0x80000000){
return 0xcf000000;
}
// If N is negative, negate it in two's complement. Set the high bit of the result
unsigned signBit = 0;
if (x < 0){
signBit = 0x80000000;
x = -x;
}
printf("abs val of x = %i (%x)\n", x, x);
int roundTowardsZero = 0;
int lastDigitLeaving = 0;
int shiftAmount = 0;
int originalAbsX = x;
// If N < 2^23, left shift it (multiply it by 2) until it is great or equal to.
if(x < (8388608)){
while(x < (8388608)){
//printf(" minus shift and x = %i", x );
x = x << 1;
shiftAmount--;
}
} // If N >= 2^24, right shfit it (unsigned divide by 2) until it is less.
else if(x >= (16777215)){
while(x >= (16777215)){
/*if(x & 1){
roundTowardsZero = 1;
printf("zzz Got here ---");
}*/
lastDigitLeaving = (x >> 1) & 1;
//printf(" plus shift and x = %i", x);
x = x >> 1;
shiftAmount++;
}
//Round towards zero
x = (x + (lastDigitLeaving && (!(originalAbsX > 16777216) || signBit)));
printf("x = %i\n", x);
//shiftAmount = shiftAmount + roundTowardsZero;
}
printf("roundTowardsZero = %i, shiftAmount = %i (%x)\n", roundTowardsZero, shiftAmount, shiftAmount);
// Bitwise AND with 0x7fffff
x = x & 0x7fffff;
unsigned exponent = 150 + shiftAmount;
unsigned rightPlaceExponent = exponent << 23;
printf("exponent = %i, rightPlaceExponent = %x\n", exponent, rightPlaceExponent);
unsigned result = signBit | rightPlaceExponent | x;
return result;
The problem is that the lowest int is -2147483648, but the highest is 2147483647, so there is no absolute value of -2147483648. While you could work around it, I would just make a special case for that one bit pattern (like you do for 0):
if (x == 0)
return 0;
if (x == -2147483648)
return 0xcf000000;
The other problem is that you copied an algorithm that only works for numbers from 0 to 32767. Further down in the article they explain how to expand it to all ints, but it uses operations that you're likely not allowed to use.
I would recommend writing it from scratch based on the algorithm mentioned in your edit. Here's a version in C# that rounds towards 0:
uint float_from_int(int x)
{
if (x == 0)
return 0; // 0 is a special case because it has no 1 bits
// Save the sign bit of the input and take the absolute value of the input.
uint signBit = 0;
uint absX = (uint)x;
if (x < 0)
{
signBit = 0x80000000u;
absX = (uint)-x;
}
// Shift the input left until the high order bit is set to form the mantissa.
// Form the floating exponent by subtracting the number of shifts from 158.
uint exponent = 158;
while ((absX & 0x80000000) == 0)
{
exponent--;
absX <<= 1;
}
// compute mantissa
uint mantissa = absX >> 8;
// Assemble the float from the sign, mantissa, and exponent.
return signBit | (exponent << 23) | (mantissa & 0x7fffff);
}
The basic formulation of the algorithm is to determine the sign, exponent and mantissa bits, then pack the result into an integer. Breaking it down this way makes it easy to clearly separate the tasks in code and makes solving the problem (and testing your algorithm) much easier.
The sign bit is the easiest, and getting rid of it makes finding the exponent easier. You can distinguish four cases: 0, 0x80000000, [-0x7ffffff, -1], and [1, 0x7fffffff]. The first two are special cases, and you can trivially get the sign bit in the last two cases (and the absolute value of the input). If you're going to cast to unsigned, you can get away with not special-casing 0x80000000 as I mentioned in a comment.
Next up, find the exponent -- there's an easy (and costly) looping way, and a trickier but faster way to do this. My absolute favourite page for this is Sean Anderson's bit hacks page. One of the algorithms shows a very quick loop-less way to find the log2 of an integer in only seven operations.
Once you know the exponent, then finding the mantissa is easy. You just drop the leading one bit, then shift the result either left or right depending on the exponent's value.
If you use the fast log2 algorithm, you can probably end up with an algorithm which uses no more than 20 operations.
Dealing with 0x80000000 is pretty easy:
int xIsNegative = 0;
unsigned int absValOfX = x;
if (x < 0)
{
xIsNegative = 1;
absValOfX = -(unsigned int)x;
}
It gets rid of special casing -2147483648 since that value is representable as an unsigned value, and absValOfX should always be positive.

Two's complement stm32 c

I have a number that is "significant byte", it may be 0 or 255.
Which means 0 or -1.
How to convert 255 to -1 in one time.
I have a function that doesn't works for me:
acc->x = ((raw_data[1]) << 8) | raw_data[0];
Assuming that every 8th bit set to 1 means negative (254 == -2) then a widening conversion from signed types should do:
int n = (signed char)somebyte;
so
unsigned char rawdate[2] = ...;
int msbyte = (signed char)rawdata[1];
acc->x = (msbyte << 8) | (raw_data[0] & 0xFF);
I am not sure what is required but here are the rules for arithmetic conversions of integers.
If an integer is assigned to another lower bit integer, the data will be truncated.
Example:
struct A {
int c1 : 8;
unsigned c2 : 8;
} a;
int main()
{
short int i = 255; // right 8 bits containing all bits set
a.c1 = i; // or a.c1 = 255. casting not required.
a.c2 = i; // same as above.
// prints -1, 255
printf("c1: %d c2: %d\n", a.c1, a.c2);
i = 511; // 9 number of 1 bits
a.c1 = i; // left 9th bit will be truncated. casting not required.
a.c2 = i; // same as above
// prints -1, 255
printf("c1: %d c2: %d\n", a.c1, a.c2);
return 0;
}
If a signed 8 bit integer (or char) is assigned to higher bit integer (say int), it's sign bit will be shifted.
ex:
char c = 255; // which is -1
int i = c; // i is now -1. sign bit will be shifted to 32nd bit.

Resources