Bit shifting for fixed point arithmetic on float numbers in C - c

i wrote the following test code to check fixed point arithmetic and bit shifting.
void main(){
float x = 2;
float y = 3;
float z = 1;
unsigned int * px = (unsigned int *) (& x);
unsigned int * py = (unsigned int *) (& y);
unsigned int * pz = (unsigned int *) (& z);
*px <<= 1;
*py <<= 1;
*pz <<= 1;
*pz =*px + *py;
*px >>= 1;
*py >>= 1;
*pz >>= 1;
printf("%f %f %f\n",x,y,z);
}
The result is
2.000000 3.000000 0.000000
Why is the last number 0? I was expecting to see a 5.000000
I want to use some kind of fixed point arithmetic to bypass the use of floating point numbers on an image processing application. Which is the best/easiest/most efficient way to turn my floating point arrays into integers? Is the above "tricking the compiler" a robust workaround? Any suggestions?

If you want to use fixed point, dont use type 'float' or 'double' because them has internal structure. Floats and Doubles have specific bit for sign; some bits for exponent, some for mantissa (take a look on color image here); so they inherently are floating point.
You should either program fixed point by hand storing data in integer type, or use some fixed-point library (or language extension).
There is a description of Floating point extensions implemented in GCC: http://gcc.gnu.org/onlinedocs/gcc/Fixed_002dPoint.html
There is some MACRO-based manual implementation of fixed-point for C: http://www.eetimes.com/discussion/other/4024639/Fixed-point-math-in-C

What you are doing are cruelties to the numbers.
First, you assign values to float variables. How they are stored is system dependant, but normally, IEEE 754 format is used. So your variables internally look like
x = 2.0 = 1 * 2^1 : sign = 0, mantissa = 1, exponent = 1 -> 0 10000000 00000000000000000000000 = 0x40000000
y = 3.0 = 1.5 * 2^1 : sign = 0, mantissa = 1.5, exponent = 1 -> 0 10000000 10000000000000000000000 = 0x40400000
z = 1.0 = 1 * 2^0 : sign = 0, mantissa = 1, exponent = 0 -> 0 01111111 00000000000000000000000 = 0x3F800000
If you do some bit shiftng operations on these numbers, you mix up the borders between sign, exponent and mantissa and so anything can, may and will happen.
In your case:
your 2.0 becomes 0x80000000, resulting in -0.0,
your 3.0 becomes 0x80800000, resulting in -1.1754943508222875e-38,
your 1.0 becomes 0x7F000000, resulting in 1.7014118346046923e+38.
The latter you lose by adding -0.0 and -1.1754943508222875e-38, which becomes the latter, namely 0x80800000, which should be, after >>ing it by 1, 3.0 again. I don't know why it isn't, probably because I made a mistake here.
What stays is that you cannot do bit-shifting on floats an expect a reliable result.
I would consider converting them to integer or other fixed-point on the ARM and sending them over the line as they are.

It's probable that your compiler uses IEEE 754 format for floats, which in bit terms, looks like this:
SEEEEEEEEFFFFFFFFFFFFFFFFFFFFFFF
^ bit 31 ^ bit 0
S is the sign bit s = 1 implies the number is negative.
E bits are the exponent. There are 8 exponent bits giving a range of 0 - 255 but the exponent is biased - you need to subtract 127 to get the true exponent.
F bits are the fraction part, however, you need to imagine an invisible 1 on the front so the fraction is always 1.something and all you see are the binary fraction digits.
The number 2 is 1 x 21 = 1 x 2128 - 127 so is encoded as
01000000000000000000000000000000
So if you use a bit shift to shift it right you get
10000000000000000000000000000000
which by convention is -0 in IEEE754, so rather than multiplying your number by 2 your shift has made it zero.
The number 3 is [1 + 0.5] x 2128 - 127
which is represented as
01000000010000000000000000000000
Shifting that left gives you
10000000100000000000000000000000
which is -1 x 2-126 or some very small number.
You can do the same for z, but you probably get the idea that shifting just screws up floating point numbers.

Fixed point doesn't work that way. What you want to do is something like this:
void main(){
// initing 8bit fixed point numbers
unsigned int x = 2 << 8;
unsigned int y = 3 << 8;
unsigned int z = 1 << 8;
// adding two numbers
unsigned int a = x + y;
// multiplying two numbers with fixed point adjustment
unsigned int b = (x * y) >> 8;
// use numbers
printf("%d %d\n", a >> 8, b >> 8);
}

Related

Turn int to IEEE 754, extract exponent, and add 1 to exponent

Problem
I need to multiply a number without using * or + operator or other libs, only binary logic
To multiply a number by two using the IEEE norm, you add one to the exponent, for example:
12 = 1 10000010 100000(...)
So the exponent is: 10000010 (130)
If I want to multiply it by 2, I just add 1 to it and it becomes 10000011 (131).
Question
If I get a float, how do I turn it into, binary, then IEEE norm? Example:
8.0 = 1000.0 in IEEE I need it to have only one number on the left side, so 1.000 * 2^3. Then how do I add one so I multiply it by 2?
I need to get a float, ie. 6.5
Turn it to binary 110.1
Then to IEEE 754 0 10000001 101000(...)
Extract the exponent 10000001
Add one to it 10000010
Return it to IEEE 754 0 10000010 101000(...)
Then back to float 13
Given that the C implementation is known to use IEEE-754 basic 32-bit binary floating-point for its float type, the following code shows how to take apart the bits that represent a float, adjust the exponent, and reassemble the bits. Only simple multiplications involving normal numbers are handled.
#include <assert.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>
int main(void)
{
float f = 6.125;
// Copy the bits that represent the float f into a 32-bit integer.
uint32_t u;
assert(sizeof f == sizeof u);
memcpy(&u, &f, sizeof u);
// Extract the sign, exponent, and significand fields.
uint32_t sign = u >> 31;
uint32_t exponent = (u >> 23) & 0xff;
uint32_t significand = u & 0x7fffff;
// Assert the exponent field is in the normal range and will remain so.
assert(0 < exponent && exponent < 254);
// Increment the exponent.
++exponent;
// Reassemble the bits and copy them back into f.
u = sign << 31 | exponent << 23 | significand;
memcpy(&f, &u, sizeof f);
// Display the result.
printf("%g\n", f);
}
Maybe not exactly what you are looking for, but C has a library function ldexp which does exactly what you need:
double x = 6.5;
x = ldexp(x, 1); // now x is 13
Maybe unions is the tool you need.
#include<iostream>
union fb {
float f;
struct b_s {
unsigned int sign :1;
unsigned int mant :22;
unsigned int exp :8;
} b;
};
fb num;
int main() {
num.f = 3.1415;
num.b.exp++;
std::cout << num.f << std::endl;
return 0;
}

Getting the mantissa (of a float) of either an unsigned int or a float (C)

So, i am trying to program a function which prints a given float number (n) in its (mantissa * 2^exponent) format. I was abled to get the sign and the exponent, but not the mantissa (whichever the number is, mantissa is always equal to 0.000000). What I have is:
unsigned int num = *(unsigned*)&n;
unsigned int m = num & 0x007fffff;
mantissa = *(float*)&m;
Any ideas of what the problem might be?
The C library includes a function that does this exact task, frexp:
int expon;
float mant = frexpf(n, &expon);
printf("%g = %g * 2^%d\n", n, mant, expon);
Another way to do it is with log2f and exp2f:
if (n == 0) {
mant = 0;
expon = 0;
} else {
expon = floorf(log2f(fabsf(n)));
mant = n * exp2f(-expon);
}
These two techniques are likely to give different results for the same input. For instance, on my computer the frexpf technique describes 4 as 0.5 × 23 but the log2f technique describes 4 as 1 × 22. Both are correct, mathematically speaking. Also, frexp will give you the exact bits of the mantissa, whereas log2f and exp2f will probably round off the last bit or two.
You should know that *(unsigned *)&n and *(float *)&m violate the rule against "type punning" and have undefined behavior. If you want to get the integer with the same bit representation as a float, or vice versa, use a union:
union { uint32_t i; float f; } u;
u.f = n;
num = u.i;
(Note: This use of unions is well-defined in C since roughly 2003, but, due to the C++ committee's long-standing habit of not paying sufficient attention to changes going into C, it is not officially well-defined in C++.)
You should also know IEEE floating-point numbers use "biased" exponents. When you initialize a float variable's mantissa field but leave its exponent field at zero, that gives you the representation of a number with a large negative exponent: in other words, a number so small that printf("%f", n) will print it as zero. Whenever printf("%f", variable) prints zero, change %f to %g or %a and rerun the program before assuming that variable actually is zero.
You are stripping off the bits of the exponent, leaving 0. An exponent of 0 is special, it means the number is denormalized and is quite small, at the very bottom of the range of representable numbers. I think you'd find if you looked closely that your result isn't quite exactly zero, just so small that you have trouble telling the difference.
To get a reasonable number for the mantissa, you need to put an appropriate exponent back in. If you want a mantissa in the range of 1.0 to 2.0, you need an exponent of 0, but adding the bias means you really need an exponent of 127.
unsigned int m = (num & 0x007fffff) | (127 << 23);
mantissa = *(float*)&m;
If you'd rather have a fully integer mantissa you need an exponent of 23, biased it becomes 150.
unsigned int m = (num & 0x007fffff) | ((23+127) << 23);
mantissa = *(float*)&m;
In addition to zwol's remarks: if you want to do it yourself you have to acquire some knowledge about the innards of an IEEE-754 float. Once you have done so you can write something like
#include <stdlib.h>
#include <stdio.h>
#include <math.h> // for testing only
typedef union {
float value;
unsigned int bits; // assuming 32 bit large ints (better: uint32_t)
} ieee_754_float;
// clang -g3 -O3 -W -Wall -Wextra -Wpedantic -Weverything -std=c11 -o testthewest testthewest.c -lm
int main(int argc, char **argv)
{
unsigned int m, num;
int exp; // the exponent can be negative
float n, mantissa;
ieee_754_float uf;
// neither checks nor balances included!
if (argc == 2) {
n = atof(argv[1]);
} else {
exit(EXIT_FAILURE);
}
uf.value = n;
num = uf.bits;
m = num & 0x807fffff; // extract mantissa (i.e.: get rid of sign bit and exponent)
num = num & 0x7fffffff; // full number without sign bit
exp = (num >> 23) - 126; // extract exponent and subtract bias
m |= 0x3f000000; // normalize mantissa (add bias)
uf.bits = m;
mantissa = uf.value;
printf("n = %g, mantissa = %g, exp = %d, check %g\n", n, mantissa, exp, mantissa * powf(2, exp));
exit(EXIT_SUCCESS);
}
Note: the code above is one of the quick&dirty(tm) species and is not meant for production. It also lacks handling for subnormal (denormal) numbers, a thing you must include. Hint: multiply the mantissa with a large power of two (e.g.: 2^25 or in that ballpark) and adjust the exponent accordingly (if you took the value of my example subtract 25).

How to generate an IEEE 754 Single-precision float using only integer arithmetic?

Assuming a low end microprocessor with no floating point arithmetic, I need to generate an IEE754 single precision floating point format number to push out to a file.
I need to write a function that takes three integers being the sign, whole and the fraction and returns a byte array with 4 bytes being the IEEE 754 single precision representation.
Something like:
// Convert 75.65 to 4 byte IEEE 754 single precision representation
char* float = convert(0, 75, 65);
Does anybody have any pointers or example C code please? I'm particularly struggling to understand how to convert the mantissa.
You will need to generate the sign (1 bit), the exponent (8 bits, a biased power of 2), and the fraction/mantissa (23 bits).
Bear in mind that the fraction has an implicit leading '1' bit, which means that the most significant leading '1' bit (2^22) is not stored in the IEEE format. For example, given a fraction of 0x755555 (24 bits), the actual bits stored would be 0x355555 (23 bits).
Also bear in mind that the fraction is shifted so that the binary point is immediately to the right of the implicit leading '1' bit. So an IEEE 23-bit fraction of 11 0101 0101... represents the 24-bit binary fraction 1.11 0101 0101...
This means that the exponent has to be adjusted accordingly.
Does the value have to be written big endian or little endian? Reversed bit ordering?
If you are free, you should think about writing the value as string literal. That way you can easily convert the integer: just write the int part and write "e0" as exponent (or omit the exponent and write ".0").
For the binary representation, you should have a look at Wikipedia. Best is to first assemble the bitfields to an uint32_t - the structure is given in the linked article. Note that you might have to round if the integer has more than 23 bits value. Remember to normalize the generated value.
Second step will be to serialize the uint32_t to an uint8_t-array. Mind the endianess of the result!
Also note to use uint8_t for the result if you really want 8 bit values; you should use an unsigned type. For the intermediate representation, using uint32_t is recommended as that will guarantee you operate on 32 bit values.
You haven't had a go yet so no give aways.
Remember you can regard two 32-bit integers a & b to be interpreted as a decimal a.b as being a single 64-bit integer with an exponent of 2^-32 (where ^ is exponent).
So without doing anything you've got it in the form:
s * m * 2^e
The only problem is your mantissa is too long and your number isn't normalized.
A bit of shifting and adding/subtracting with a possible rounding step and you're done.
You can use a software floating point compiler/library.
See https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html
The basic premise is to:
Given binary32 float.
Form a binary fixed-point representation of the combined whole and factional parts hundredths. This code uses a structure encoding both whole and hundredths fields separately. Important that the whole field is at least 32 bits.
Shift left/right (*2 and /2) until MSbit is in the implied bit position whilst counting the shifts. A robust solution would also note non-zero bits shifted out.
Form a biased exponent.
Round mantissa and drop implied bit.
Form sign (not done here).
Combine the above 3 steps to form the answer.
As Sub-normals, infinites & Not-A-Number will not result with whole, hundredths input, generating those float special cases are not addressed here.
.
#include <assert.h>
#include <stdint.h>
#define IMPLIED_BIT 0x00800000L
typedef struct {
int_least32_t whole;
int hundreth;
} x_xx;
int_least32_t covert(int whole, int hundreth) {
assert(whole >= 0 && hundreth >= 0 && hundreth < 100);
if (whole == 0 && hundreth == 0) return 0;
x_xx x = { whole, hundreth };
int_least32_t expo = 0;
int sticky_bit = 0; // Note any 1 bits shifted out
while (x.whole >= IMPLIED_BIT * 2) {
expo++;
sticky_bit |= x.hundreth % 2;
x.hundreth /= 2;
x.hundreth += (x.whole % 2)*(100/2);
x.whole /= 2;
}
while (x.whole < IMPLIED_BIT) {
expo--;
x.hundreth *= 2;
x.whole *= 2;
x.whole += x.hundreth / 100;
x.hundreth %= 100;
}
int32_t mantissa = x.whole;
// Round to nearest - ties to even
if (x.hundreth >= 100/2 && (x.hundreth > 100/2 || x.whole%2 || sticky_bit)) {
mantissa++;
}
if (mantissa >= (IMPLIED_BIT * 2)) {
mantissa /= 2;
expo++;
}
mantissa &= ~IMPLIED_BIT; // Toss MSbit as it is implied in final
expo += 24 + 126; // Bias: 24 bits + binary32 bias
expo <<= 23; // Offset
return expo | mantissa;
}
void test_covert(int whole, int hundreths) {
union {
uint32_t u32;
float f;
} u;
u.u32 = covert(whole, hundreths);
volatile float best = whole + hundreths / 100.0;
printf("%10d.%02d --> %15.6e %15.6e Same:%d\n", whole, hundreths, u.f, best,
best == u.f);
}
#include <limits.h>
int main(void) {
test_covert(75, 65);
test_covert(0, 1);
test_covert(INT_MAX, 99);
return 0;
}
Output
75.65 --> 7.565000e+01 7.565000e+01 Same:1
0.01 --> 1.000000e-02 1.000000e-02 Same:1
2147483647.99 --> 2.147484e+09 2.147484e+09 Same:1
Known issues: sign not applied.

why the result is nan in this c float conversion?

float and int types are all 4 bytes and I try converting in this way:
unsigned int x = 0; // 00000000
x = ~x>>1; // 7fffffff
float f = *((float *)&x);
printf("%f\n", f);
Because the first bit in c float number represents +/- and the next 8 bits is exp in 2^(exp-127) and the rest will be converted to 0.xxxxx..., it means I can get max float number:0|11111111|111...111 but finally I get a nan.
So is there anything wrong?
You are close, but your exponent is out of range so you have a NaN. FLT_MAX is:
0 11111110 11111111111111111111111
s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm
Note that the max exponent is 11111110, as 11111111 is reserved for NaNs.
The corresponding hex value is:
0x7f7fffff
So your code should be:
unsigned int x = 0x7f7fffff;
float f = *((float *)&x);
printf("%f\n", f);
and the result will be:
3.4028235E38
If you're interested in IEEE-754 format then check out this handy online calculator which converts between binary, hex and float formats: http://www.h-schmidt.net/FloatConverter/IEEE754.html
A bit-wise IEEE floating-point standard single precision (32-bit) NaN(Not a Number) would be:
s111 1111 1xxx xxxx xxxx xxxx xxxx xxxx
where s is the sign (most often ignored in applications) and x is non-zero (the value zero encodes infinities).
In order to get the binary representation of the max float value, execute the "inverse":
float f = FLT_MAX;
int x = *((int*)&f);
printf("0x%.8X\n",x);
The result is 0x7F7FFFFF (and not 0x7FFFFFFF as you have assumed).
The C-language standard does not dictate sizeof(float) == sizeof(int).
So you will have to verify this on your platform in order to ensure correct execution.

"Shared Exponent" representation of a floating point vector in OpenCL C

In OpenCL, I want to store a vector (3D) using a "Shared Exponent" representation for compact storage. Typically, if you store a 3D floating point vector, you simply store 3 separate float values (or 4 when aligned properly). This requires 12 (16) bytes storage for single precision and if you don't require this accuracy you can use the "half" precision float and shrink it down to 6 (8) bytes.
When using half precision and 3 separate values, the memory looks like this (no alignment considered):
x coordinate: 1 bit sign, 5 bits exponent, 10 bits mantissa
y coordinate: 1 bit sign, 5 bits exponent, 10 bits mantissa
z coordinate: 1 bit sign, 5 bits exponent, 10 bits mantissa
I'd like to shrink this down to 4 bytes by using a shared exponent, as OpenGL uses this in one of its internal texture formats ("RGB9_E5"). This means, the absolutely largest component decides what the exponent of the whole number is. This exponent is then used for each component implicitly. Tricks such as "normalized" storage with an implicit "1." in front of the mantissa don't work in this case. Such a representation works like this (we could tweak the acutal parameters, so this is an example):
x coordinate: 1 bit sign, 8 bits mantissa
y coordinate: 1 bit sign, 8 bits mantissa
z coordinate: 1 bit sign, 8 bits mantissa
5 bits shared exponent
I'd like to store this in an OpenCL uint type (32 bits) or something equivalent (e.g. uchar4). The question now is:
How can I convert from and into this representation to and from float3 as fast as possible?
My idea is like this, but I'm sure there is some "bit hacking" trick which uses the bit representation of IEEE floats to circumvent the floating point ALU:
Use uchar4 as the representative type. Store x, y, z mantisssa in x, y, z components of this uchar4. The w component is split up into 5 less significant bits (w & 0x1F) for the shared exponent and the three more significant bits (w >> 5) & 1, (w >> 6) & 1 and (w >> 7) & 1 are the signs for x, y and z, respectively.
Note that the exponent is "biased" by 16, i.e. a stored value of 16 means that the represented numbers are up to (not including) 1.0, a stored value of 19 means values up to (not including) 8.0 and so on.
"Unpacking" this representation into a float3 could be done using this code:
float3 unpackCompactVector(uchar4 packed) {
float exp = (float)(packed.w & 0x1F) - 16.0;
float factor = exp2(exp) / 256.0;
float x = (float)(packed.x) * factor * (packed.w & 0x20 ? -1.0 : 1.0);
float y = (float)(packed.y) * factor * (packed.w & 0x40 ? -1.0 : 1.0);
float z = (float)(packed.z) * factor * (packed.w & 0x80 ? -1.0 : 1.0);
float3 result = { x, y, z };
return result;
}
"Packing" a float3 into this representation could be done using this code:
uchar4 packCompactVector(float3 vec) {
float xAbs = abs(vec.x); uchar xSign = vec.x < 0.0 ? 0x20 : 0;
float yAbs = abs(vec.y); uchar ySign = vec.y < 0.0 ? 0x40 : 0;
float zAbs = abs(vec.z); uchar zSign = vec.z < 0.0 ? 0x80 : 0;
float maxAbs = max(max(xAbs, yAbs), zAbs);
int exp = floor(log2(maxAbs)) + 1;
float factor = exp2(exp);
uchar xMant = floor(xAbs / factor * 256);
uchar yMant = floor(yAbs / factor * 256);
uchar zMant = floor(zAbs / factor * 256);
uchar w = ((exp + 16) & 0x1F) + xSign + ySign + zSign;
uchar4 result = { xMant, yMant, zMant, w };
return result;
}
I've put an equivalent implementation in C++ online on ideone. The test cases shows the transition from exp = 3 to exp 4 (with the bias of 16 this is encoded as 19 and 20, respectively) by encoding numbers around 8.0.
This implementation seems to work on the first sight. But:
There are some corner cases I didn't cover, in particular over- and underflow (of the exponent).
I don't want to use floating point math functions like log2 because they are slow.
Can you suggest a better way to achieve my goal?
Note that I only need an OpenCL "device code" for this, I don't need to convert between the representations in the host program. But I added the C tag since a solution is most probably independent of the OpenCL language features (OpenCL is almost C and it also uses IEEE 754 floats, bit manipulation works the same, etc.).
If you used CL/GL interop and stored your data in an OpenGL texture in RGB9_E5 format and if you could create an OpenCL image from that texture, you could leverage the hardware texture unit to do the conversion into a float4 upon reading from the image. It might be worth trying.

Resources