"Shared Exponent" representation of a floating point vector in OpenCL C

In OpenCL, I want to store a vector (3D) using a "Shared Exponent" representation for compact storage. Typically, if you store a 3D floating point vector, you simply store 3 separate float values (or 4 when aligned properly). This requires 12 (16) bytes storage for single precision and if you don't require this accuracy you can use the "half" precision float and shrink it down to 6 (8) bytes.
When using half precision and 3 separate values, the memory looks like this (no alignment considered):
x coordinate: 1 bit sign, 5 bits exponent, 10 bits mantissa
y coordinate: 1 bit sign, 5 bits exponent, 10 bits mantissa
z coordinate: 1 bit sign, 5 bits exponent, 10 bits mantissa
I'd like to shrink this down to 4 bytes by using a shared exponent, as OpenGL uses this in one of its internal texture formats ("RGB9_E5"). This means, the absolutely largest component decides what the exponent of the whole number is. This exponent is then used for each component implicitly. Tricks such as "normalized" storage with an implicit "1." in front of the mantissa don't work in this case. Such a representation works like this (we could tweak the acutal parameters, so this is an example):
x coordinate: 1 bit sign, 8 bits mantissa
y coordinate: 1 bit sign, 8 bits mantissa
z coordinate: 1 bit sign, 8 bits mantissa
5 bits shared exponent
I'd like to store this in an OpenCL uint type (32 bits) or something equivalent (e.g. uchar4). The question now is:
How can I convert from and into this representation to and from float3 as fast as possible?
My idea is like this, but I'm sure there is some "bit hacking" trick which uses the bit representation of IEEE floats to circumvent the floating point ALU:
Use uchar4 as the representative type. Store x, y, z mantisssa in x, y, z components of this uchar4. The w component is split up into 5 less significant bits (w & 0x1F) for the shared exponent and the three more significant bits (w >> 5) & 1, (w >> 6) & 1 and (w >> 7) & 1 are the signs for x, y and z, respectively.
Note that the exponent is "biased" by 16, i.e. a stored value of 16 means that the represented numbers are up to (not including) 1.0, a stored value of 19 means values up to (not including) 8.0 and so on.
"Unpacking" this representation into a float3 could be done using this code:
float3 unpackCompactVector(uchar4 packed) {
float exp = (float)(packed.w & 0x1F) - 16.0;
float factor = exp2(exp) / 256.0;
float x = (float)(packed.x) * factor * (packed.w & 0x20 ? -1.0 : 1.0);
float y = (float)(packed.y) * factor * (packed.w & 0x40 ? -1.0 : 1.0);
float z = (float)(packed.z) * factor * (packed.w & 0x80 ? -1.0 : 1.0);
float3 result = { x, y, z };
return result;
"Packing" a float3 into this representation could be done using this code:
uchar4 packCompactVector(float3 vec) {
float xAbs = abs(vec.x); uchar xSign = vec.x < 0.0 ? 0x20 : 0;
float yAbs = abs(vec.y); uchar ySign = vec.y < 0.0 ? 0x40 : 0;
float zAbs = abs(vec.z); uchar zSign = vec.z < 0.0 ? 0x80 : 0;
float maxAbs = max(max(xAbs, yAbs), zAbs);
int exp = floor(log2(maxAbs)) + 1;
float factor = exp2(exp);
uchar xMant = floor(xAbs / factor * 256);
uchar yMant = floor(yAbs / factor * 256);
uchar zMant = floor(zAbs / factor * 256);
uchar w = ((exp + 16) & 0x1F) + xSign + ySign + zSign;
uchar4 result = { xMant, yMant, zMant, w };
return result;
I've put an equivalent implementation in C++ online on ideone. The test cases shows the transition from exp = 3 to exp 4 (with the bias of 16 this is encoded as 19 and 20, respectively) by encoding numbers around 8.0.
This implementation seems to work on the first sight. But:
There are some corner cases I didn't cover, in particular over- and underflow (of the exponent).
I don't want to use floating point math functions like log2 because they are slow.
Can you suggest a better way to achieve my goal?
Note that I only need an OpenCL "device code" for this, I don't need to convert between the representations in the host program. But I added the C tag since a solution is most probably independent of the OpenCL language features (OpenCL is almost C and it also uses IEEE 754 floats, bit manipulation works the same, etc.).

If you used CL/GL interop and stored your data in an OpenGL texture in RGB9_E5 format and if you could create an OpenCL image from that texture, you could leverage the hardware texture unit to do the conversion into a float4 upon reading from the image. It might be worth trying.


Convert integer in a new floating point format

This code is intended to convert a signed 16-bit integer to a new floating point format (similar to the normal IEEE 754 floating point format). I unterstand the regular IEEE 754 floating point format, but i don't understand how this code works and how this floating point format looks like. I would be grateful for some insights into what the idea of the code is respectively how many bits are used for representing the significand and how many bits are used for representing the exponent in this new format.
#include <stdint.h>
uint32_t short2fp (int16_t inp)
int x, f, i;
if (inp == 0)
return 0;
else if (inp < 0)
i = -inp;
x = 191;
i = inp;
x = 63;
for (f = i; f > 1; f >>= 1)
for (f = i; f < 0x8000; f <<= 1);
return (x * 0x8000 + f - 0x8000);
This couple of tricks should help you recognize the parameters (exponent's size and mantissa's size) of a custom floating-point format:
First of all, how many bits is this float number long?
We know that the sign bit is the highest bit set in any negative float number. If we calculate short2fp(-1) we obtain 0b10111111000000000000000, that is a 23-bit number. Therefore, this custom float format is a 23-bit float.
If we want to know the exponent's and mantissa's sizes, we can convert the number 3, because this will set both the highest exponent's bit and the highest mantissa's bit. If we do short2fp(3), we obtain 0b01000000100000000000000, and if we split this number we get 0 1000000 100000000000000: the first bit is the sign, then we have 7 bits of exponent, and finally 15 bits of mantissa.
Float format size: 23 bits
Exponent size: 7 bits
Mantissa size: 15 bits
NOTE: this conclusion may be wrong for a different number of reasons (e.g.: float format particularly different from IEEE754 ones, short2fp() function not working properly, too much coffee this morning, etc.), but in general this works for every binary floating-point format defined by IEEE754 (binary16, binary32, binary64, etc.) so I'm confident this works for your custom float format too.
P.S.: the short2fp() function is written very poorly, you may try improve its clearness if you want to investigate the inner workings of the function.
The two statements x = 191; and x = 63; set x to either 1•128 + 63 or 0•128 + 63, according to whether the number is negative or positive. Therefore 128 (27) has the sign bit at this point. As x is later multiplied by 0x8000 (215), the sign bit is 222 in the result.
These statements also initialize the exponent to 0, which is encoded as 63 due to a bias of 63. This follows the IEEE-754 pattern of using a bias of 2n−1−1 for an exponent field of n bits. (The “single” format has eight exponent bits and a bias of 27−1 = 127, and the “double” format has 11 exponent bits and a bias of 210−1 = 1023.) Thus we expect an exponent field of 7 bits, with bias 26−1 = 63.
This loop:
for (f = i; f > 1; f >>= 1)
detects the magnitude of i (the absolute value of the input), adding one to the exponent for each power of two that f is detected to exceed. For example, if the input is 4, 5, 6, or 7, the loop executes two times, adding two to x and reducing f to 1, at which point the loop stops. This confirms the exponent bias; if i is 1, x is left as is, so the initial value of 63 corresponds to an exponent of 0 and a represented value of 20 = 1.
The loop for (f = i; f < 0x8000; f <<= 1); scales f in the opposite direction, moving its leading bit to be in the 0x8000 position.
In return (x * 0x8000 + f - 0x8000);, x * 0x8000 moves the sign bit and exponent field from their initial positions (bit 7 and bits 6 to 0) to their final positions (bit 22 and bits 21 to 15). f - 0x8000 removes the leading bit from f, giving the trailing bits of the significand. This is then added to the final value, forming the primary encoding of the significand in bits 14 to 0.
Thus the format has the sign bit in bit 22, exponent bits in bits 21 to 15 with a bias of 63, and the trailing significand bits in bits 14 to 0.
The format could encode subnormal numbers, infinities, and NaNs in the usual way, but this is not discernible from the code shown, as it encodes only integers in the normal range.
As a comment suggested, I would use a small number of strategically selected test cases to reverse engineer the format. The following assumes an IEEE-754-like binary floating-point format using sign-magnitude encoding with a sign bit, exponent bits, and significand (mantissa) bits.
short2fp (1) = 001f8000 while short2fp (-1) = 005f8000. The exclusive OR of these is 0x00400000 which means the sign bit is in bit 22 and this floating-point format comprises 23 bits.
short2fp (1) = 001f8000, short2fp (2) = 00200000, and short2fp (4) = 00208000. The difference between consecutive values is 0x00008000 so the least significant bit of the exponent field is bit 15, the exponent field comprises 7 bits, and the exponent is biased by (0x001f8000 >> 15) = 0x3F = 63.
This leaves the least significant 15 bits for the significand. We can see from short2fp (2) = 00200000 that the integer bit of the significand (mantissa) is not stored, that is, it is implicit as in IEEE-754 formats like binary32 or binary64.

How can I convert this number representation to a float?

I read this 16-bit value from a temperature sensor (type MCP9808)
Ignoring the first three MSBs, what's an easy way to convert the other bits to a float?
I managed to convert the values 2^7 through 2^0 to an integer with some bit-shifting:
uint16_t rawBits = readSensor();
int16_t value = (rawBits << 3) / 128;
However I can't think of an easy way to also include the bits with an exponent smaller than 0, except for manually checking if they're set and then adding 1/2, 1/4, 1/8 and 1/16 to the result respectively.
Something like this seems pretty reasonable. Take the number portion, divide by 16, and fix the sign.
float tempSensor(uint16_t value) {
bool negative = (value & 0x1000);
return (negative ? -1 : 1) * (value & 0x0FFF) / 16.0f;
float convert(unsigned char msb, unsigned char lsb)
return ((lsb | ((msb & 0x0f) << 8)) * ((msb & 0x10) ? -1 : 1)) / 16.0f;
float convert(uint16_t val)
return (((val & 0x1000) ? -1 : 1) * (val << 4)) / 256.0f;
If performance isn't a super big deal, I would go for something less clever and more explcit, along the lines of:
bool is_bit_set(uint16_t value, uint16_t bit) {
uint16_t mask = 1 << bit;
return (value & mask) == mask;
float parse_temperature(uint16_t raw_reading) {
if (is_bit_set(raw_reading, 15)) { /* temp is above Tcrit. Do something about it. */ }
if (is_bit_set(raw_reading, 14)) { /* temp is above Tupper. Do something about it. */ }
if (is_bit_set(raw_reading, 13)) { /* temp is above Tlower. Do something about it. */ }
uint16_t whole_degrees = (raw_reading & 0x0FF0) >> 4;
float magnitude = (float) whole_degrees;
if (is_bit_set(raw_reading, 0)) magnitude += 1.0f/16.0f;
if (is_bit_set(raw_reading, 1)) magnitude += 1.0f/8.0f;
if (is_bit_set(raw_reading, 2)) magnitude += 1.0f/4.0f;
if (is_bit_set(raw_reading, 3)) magnitude += 1.0f/2.0f;
bool is_negative = is_bit_set(raw_reading, 12);
// TODO: What do the 3 most significant bits do?
return magnitude * (is_negative ? -1.0 : 1.0);
Honestly this is a lot of simple constant math, I'd be surprised if the compiler wasn't able to heavily optimize it. That would need confirmation, of course.
If your C compiler has a clz buitlin or equivalent, it could be useful to avoid mul operation.
In your case, as the provided temp value looks like a mantissa and if your C compiler uses IEEE-754 float representation, translating the temp value in its IEEE-754 equivalent may be a most efficient way :
Update: Compact the code a little and more clear explanation about the mantissa.
float convert(uint16_t val) {
uint16_t mantissa = (uint16_t)(val <<4);
if (mantissa==0) return 0.0;
unsigned char e = (unsigned char)(__builtin_clz(mantissa) - 16);
uint32_t r = (uint32_t)((val & 0x1000) << 19 | (0x86 - e) << 23 | ((mantissa << (e+8)) & 0x07FFFFF));
return *((float *)(&r));
float convert(unsigned char msb, unsigned char lsb) {
uint16_t mantissa = (uint16_t)((msb<<8 | lsb) <<4);
if (mantissa==0) return 0.0;
unsigned char e = (unsigned char)(__builtin_clz(mantissa) - 16);
uint32_t r = (uint32_t)((msb & 0x10) << 27 | (0x86 - e) << 23 | ((mantissa << (e+8)) & 0x07FFFFF));
return *((float *)(&r));
We use the fact that the temp value is somehow a mantissa in the range -255 to 255.
We can then consider that its IEEE-754 exponent will be 128 at max to -128 at min.
We use the clz buitlin to get the "order" of the first bit set in the mantissa,
this way we can define the exponent as the therorical max (2^7 =>128) less this "order".
We use also this order to left shift the temp value to get the IEEE-754 mantissa,
plus one left shift to substract the '1' implied part of the significand for IEEE-754.
Thus we build a 32 bits binary IEEE-754 representation from the temp value with :
At first the sign bit to the 32th bit of our binary IEEE-754 representation.
The computed exponent as the theorical max 7 (2^7 =>128) plus the IEEE-754 bias (127) minus the actual "order" of the temp value.
The "order" of the temp value is deducted from the number of leading '0' of its 12 bits representation in the variable mantissa through the clz builtin.
Beware that here we consider that the clz builtin is expecting a 32 bit value as parameter, that is why we substract 16 here. This code may require adaptation if your clz expects anything else.
The number of leading '0' can go from 0 (temp value above 128 or under -127) to 11 as we directly return 0.0 for a zero temp value.
As the following bit of the "order" is then 1 in the temp value, it is equivalent to a power of 2 reduction from the theorical max 7.
Thus, with 7 + 127 => 0x86, we can simply substract to that the "order" as the number of leading '0' permits us to deduce the 'first' base exponent for IEEE-754.
If the "order" is greater than 7 we will still get the negative exponent required for less than 1 values.
We add then this 8bits exponent to our binary IEEE-754 representation from 24th bit to 31th bit.
The temp value is somehow already a mantissa, we suppress the leading '0' and its first bit set by shifting it to the left (e + 1) while also shifting left for 7 bits to place the mantissa in the 32 bits (e+7+1 => e+8) . We mask then only the desired 23 bits (AND &0x7FFFFF).
Its first bit set must be removed as it is the '1' implied significand in IEEE-754 (the power of 2 of the exponent).
We have then the IEEEE-754 mantissa and place it from the 8th bit to the 23th bit of our binary IEEE-754 representation.
The 4 initial trailing 0 from our 16 bits temp value and the added seven 'right' 0 from the shifting won't change the effective IEEE-754 value.
As we start from a 32 bits value and use or operator (|) on a 32 bits exponent and mantissa, we have then the final IEEE-754 representation.
We can then return this binary representation as an IEEE-754 C float value.
Due to the required clz and the IEEE-754 translation, this way is less portable. The main interest is to avoid MUL operations in the resulting machine code for performance on arch with a "poor" FPU.
P.S.: Casts explanation. I've added explicit casts to let the C compiler know that we discard voluntary some bits :
uint16_t mantissa = (uint16_t)(val <<4); : The cast here tells the compiler that we know we'll "loose" four left bits, as it the goal here. We discard the four first bits of the temp value for the mantissa.
(unsigned char)(__builtin_clz(mantissa) - 16) : We tell to the C compiler that we will only consider a 8 bits range for the builtin return, as we know our mantissa has only 12 significatives bits and thus a range output from 0 to 12. Thus we do not need the full int return.
uint32_t r = (uint32_t) ... : We tell the C compiler to not bother with the sign representation here as we build an IEEE-754 representation.

How is float to int type conversion done in C? [duplicate]

I was wondering if you could help explain the process on converting an integer to float, or a float to an integer. For my class, we are to do this using only bitwise operators, but I think a firm understanding on the casting from type to type will help me more in this stage.
From what I know so far, for int to float, you will have to convert the integer into binary, normalize the value of the integer by finding the significand, exponent, and fraction, and then output the value in float from there?
As for float to int, you will have to separate the value into the significand, exponent, and fraction, and then reverse the instructions above to get an int value?
I tried to follow the instructions from this question: Casting float to int (bitwise) in C.
But I was not really able to understand it.
Also, could someone explain why rounding will be necessary for values greater than 23 bits when converting int to float?
First, a paper you should consider reading, if you want to understand floating point foibles better: "What Every Computer Scientist Should Know About Floating Point Arithmetic," http://www.validlab.com/goldberg/paper.pdf
And now to some meat.
The following code is bare bones, and attempts to produce an IEEE-754 single precision float from an unsigned int in the range 0 < value < 224. That's the format you're most likely to encounter on modern hardware, and it's the format you seem to reference in your original question.
IEEE-754 single-precision floats are divided into three fields: A single sign bit, 8 bits of exponent, and 23 bits of significand (sometimes called a mantissa). IEEE-754 uses a hidden 1 significand, meaning that the significand is actually 24 bits total. The bits are packed left to right, with the sign bit in bit 31, exponent in bits 30 .. 23, and the significand in bits 22 .. 0. The following diagram from Wikipedia illustrates:
The exponent has a bias of 127, meaning that the actual exponent associated with the floating point number is 127 less than the value stored in the exponent field. An exponent of 0 therefore would be encoded as 127.
(Note: The full Wikipedia article may be interesting to you. Ref: http://en.wikipedia.org/wiki/Single_precision_floating-point_format )
Therefore, the IEEE-754 number 0x40000000 is interpreted as follows:
Bit 31 = 0: Positive value
Bits 30 .. 23 = 0x80: Exponent = 128 - 127 = 1 (aka. 21)
Bits 22 .. 0 are all 0: Significand = 1.00000000_00000000_0000000. (Note I restored the hidden 1).
So the value is 1.0 x 21 = 2.0.
To convert an unsigned int in the limited range given above, then, to something in IEEE-754 format, you might use a function like the one below. It takes the following steps:
Aligns the leading 1 of the integer to the position of the hidden 1 in the floating point representation.
While aligning the integer, records the total number of shifts made.
Masks away the hidden 1.
Using the number of shifts made, computes the exponent and appends it to the number.
Using reinterpret_cast, converts the resulting bit-pattern to a float. This part is an ugly hack, because it uses a type-punned pointer. You could also do this by abusing a union. Some platforms provide an intrinsic operation (such as _itof) to make this reinterpretation less ugly.
There are much faster ways to do this; this one is meant to be pedagogically useful, if not super efficient:
float uint_to_float(unsigned int significand)
// Only support 0 < significand < 1 << 24.
if (significand == 0 || significand >= 1 << 24)
return -1.0; // or abort(); or whatever you'd like here.
int shifts = 0;
// Align the leading 1 of the significand to the hidden-1
// position. Count the number of shifts required.
while ((significand & (1 << 23)) == 0)
significand <<= 1;
// The number 1.0 has an exponent of 0, and would need to be
// shifted left 23 times. The number 2.0, however, has an
// exponent of 1 and needs to be shifted left only 22 times.
// Therefore, the exponent should be (23 - shifts). IEEE-754
// format requires a bias of 127, though, so the exponent field
// is given by the following expression:
unsigned int exponent = 127 + 23 - shifts;
// Now merge significand and exponent. Be sure to strip away
// the hidden 1 in the significand.
unsigned int merged = (exponent << 23) | (significand & 0x7FFFFF);
// Reinterpret as a float and return. This is an evil hack.
return *reinterpret_cast< float* >( &merged );
You can make this process more efficient using functions that detect the leading 1 in a number. (These sometimes go by names like clz for "count leading zeros", or norm for "normalize".)
You can also extend this to signed numbers by recording the sign, taking the absolute value of the integer, performing the steps above, and then putting the sign into bit 31 of the number.
For integers >= 224, the entire integer does not fit into the significand field of the 32-bit float format. This is why you need to "round": You lose LSBs in order to make the value fit. Thus, multiple integers will end up mapping to the same floating point pattern. The exact mapping depends on the rounding mode (round toward -Inf, round toward +Inf, round toward zero, round toward nearest even). But the fact of the matter is you can't shove 24 bits into fewer than 24 bits without some loss.
You can see this in terms of the code above. It works by aligning the leading 1 to the hidden 1 position. If a value was >= 224, the code would need to shift right, not left, and that necessarily shifts LSBs away. Rounding modes just tell you how to handle the bits shifted away.
Have you checked the IEEE 754 floating-point representation?
In 32-bit normalized form, it has (mantissa's) sign bit, 8-bit exponent (excess-127, I think) and 23-bit mantissa in "decimal" except that the "0." is dropped (always in that form) and the radix is 2, not 10. That is: the MSB value is 1/2, the next bit 1/4 and so on.
Joe Z's answer is elegant but range of input values is highly limited. 32 bit float can store all integer values from the following range:
[-224...+224] = [-16777216...+16777216]
and some other values outside this range.
The whole range would be covered by this:
float int2float(int value)
// handles all values from [-2^24...2^24]
// outside this range only some integers may be represented exactly
// this method will use truncation 'rounding mode' during conversion
// we can safely reinterpret it as 0.0
if (value == 0) return 0.0;
if (value == (1U<<31)) // ie -2^31
// -(-2^31) = -2^31 so we'll not be able to handle it below - use const
// value = 0xCF000000;
return (float)INT_MIN; // *((float*)&value); is undefined behaviour
int sign = 0;
// handle negative values
if (value < 0)
sign = 1U << 31;
value = -value;
// although right shift of signed is undefined - all compilers (that I know) do
// arithmetic shift (copies sign into MSB) is what I prefer here
// hence using unsigned abs_value_copy for shift
unsigned int abs_value_copy = value;
// find leading one
int bit_num = 31;
int shift_count = 0;
for(; bit_num > 0; bit_num--)
if (abs_value_copy & (1U<<bit_num))
if (bit_num >= 23)
// need to shift right
shift_count = bit_num - 23;
abs_value_copy >>= shift_count;
// need to shift left
shift_count = 23 - bit_num;
abs_value_copy <<= shift_count;
// exponent is biased by 127
int exp = bit_num + 127;
// clear leading 1 (bit #23) (it will implicitly be there but not stored)
int coeff = abs_value_copy & ~(1<<23);
// move exp to the right place
exp <<= 23;
int rint;
float rfloat;
}ret = { sign | exp | coeff };
return ret.rfloat;
Of course there are other means to find abs value of int (branchless). Similarly couting leading zeros can also be done without a branch so treat this example as example ;-).

C/C++ - convert 32-bit floating-point value to 24-bit normalized fixed-point value?

Please let me know how to convert 32 bit float to 24 bit normalized value? What I tried is (units * (1 <<24) but doesn't seem to be working. Please help me with this. Thanks.
Of course it is not working, (1 << 24) is too large for a 24-bit number capable of representing 0 to store, by exactly 1. To put this another way, 1 << 24 is actually a 25-bit number.
Consider (units * ((1 << 24) - 1)) instead.
(1 << 24) - 1 is the largest value an unsigned 24-bit integer that begins at 0 can represent.
Now, a floating-point number in the range [0.0 - 1.0] will actually fit into an unsigned 24-bit fixed-point integer without overflow.
A normalized fixed-point representation, means that the maximum representable value, not strictly reachable, is 1. So 1 is represented by 1<<24. See also Q Formats.
For example Q24 means 24 fractional bits, 0 integer bit and no sign. If using a 32 bits unsigned integer to manage a Q24, the remainig 8 bits can be used to ease calculations.
Before translating from floating-point to fixed-point representation, you always have to define the range for your original value. Example: the floating point value is a physical value in the range from [0, 5), so 0 is included and 5 is not included in the range, and your fixed-point value is normalized to 5.
#include <string.h>
#include <stdio.h>
float length_flp = 4.5; // Units: meters. Range: [0,5)
float time_flp = 1.2; // Seconds. Range: [0,2)
float speed_flp = 1.2; // m/sec. Range: [0,2.5)
unsigned uint32_t length_fixp; // Meters. Representation: Q24 = 24 bit normalized to MAX_LENGTH=5
unsigned uint32_t time_fixp; // Seconds. Representation: Q24 = 24 bit normalized to MAX_TIME=2
unsigned uint32_t speed_fixp; // m/sec. Repr: Q24 = 24 bit normalized to MAX_SPEED=(MAX_LENGTH/MAX_TIME)=2.5
void main(void)
printf("length_flp=%f m\n", length_flp);
printf("time_flp=%f sec\n", time_flp);
printf("speed_flp=%f m/sec\n\n", length_flp / time_flp);
length_fixp = (length_flp / 5) * (1 << 24);
time_fixp = (time_flp / 2) * (1 << 24);
speed_fixp = (length_fixp / (time_fixp >> 12)) << 12;
printf("length_fixp=%d m\n", length_fixp);
printf("time_fixp=%d sec\n", time_fixp);
printf("speed_fixp = %d msec [fixed-point] = %f msec\n", speed_fixp, (float)speed_fixp / (1 << 24) * 2.5);
The advantage with normalized representation is that operations between normalized values return a normalized value.
By the way, you have to define a generic function for each operation (division, multiplication, etc.), to prevent overflow and save precision.
As you can see I've used a small trick to calculate speed_fixp.
The output is
length_flp=4.500000 m
time_flp=1.200000 sec
speed_flp=3.750000 m/sec
length_fixp = 15099494 m [fixed-point]
time_fixp = 10066330 sec [fixed-point]
speed_fixp = 25169920 msec [fixed-point] = 3.750610 msec

Bit shifting for fixed point arithmetic on float numbers in C

i wrote the following test code to check fixed point arithmetic and bit shifting.
void main(){
float x = 2;
float y = 3;
float z = 1;
unsigned int * px = (unsigned int *) (& x);
unsigned int * py = (unsigned int *) (& y);
unsigned int * pz = (unsigned int *) (& z);
*px <<= 1;
*py <<= 1;
*pz <<= 1;
*pz =*px + *py;
*px >>= 1;
*py >>= 1;
*pz >>= 1;
printf("%f %f %f\n",x,y,z);
The result is
2.000000 3.000000 0.000000
Why is the last number 0? I was expecting to see a 5.000000
I want to use some kind of fixed point arithmetic to bypass the use of floating point numbers on an image processing application. Which is the best/easiest/most efficient way to turn my floating point arrays into integers? Is the above "tricking the compiler" a robust workaround? Any suggestions?
If you want to use fixed point, dont use type 'float' or 'double' because them has internal structure. Floats and Doubles have specific bit for sign; some bits for exponent, some for mantissa (take a look on color image here); so they inherently are floating point.
You should either program fixed point by hand storing data in integer type, or use some fixed-point library (or language extension).
There is a description of Floating point extensions implemented in GCC: http://gcc.gnu.org/onlinedocs/gcc/Fixed_002dPoint.html
There is some MACRO-based manual implementation of fixed-point for C: http://www.eetimes.com/discussion/other/4024639/Fixed-point-math-in-C
What you are doing are cruelties to the numbers.
First, you assign values to float variables. How they are stored is system dependant, but normally, IEEE 754 format is used. So your variables internally look like
x = 2.0 = 1 * 2^1 : sign = 0, mantissa = 1, exponent = 1 -> 0 10000000 00000000000000000000000 = 0x40000000
y = 3.0 = 1.5 * 2^1 : sign = 0, mantissa = 1.5, exponent = 1 -> 0 10000000 10000000000000000000000 = 0x40400000
z = 1.0 = 1 * 2^0 : sign = 0, mantissa = 1, exponent = 0 -> 0 01111111 00000000000000000000000 = 0x3F800000
If you do some bit shiftng operations on these numbers, you mix up the borders between sign, exponent and mantissa and so anything can, may and will happen.
In your case:
your 2.0 becomes 0x80000000, resulting in -0.0,
your 3.0 becomes 0x80800000, resulting in -1.1754943508222875e-38,
your 1.0 becomes 0x7F000000, resulting in 1.7014118346046923e+38.
The latter you lose by adding -0.0 and -1.1754943508222875e-38, which becomes the latter, namely 0x80800000, which should be, after >>ing it by 1, 3.0 again. I don't know why it isn't, probably because I made a mistake here.
What stays is that you cannot do bit-shifting on floats an expect a reliable result.
I would consider converting them to integer or other fixed-point on the ARM and sending them over the line as they are.
It's probable that your compiler uses IEEE 754 format for floats, which in bit terms, looks like this:
^ bit 31 ^ bit 0
S is the sign bit s = 1 implies the number is negative.
E bits are the exponent. There are 8 exponent bits giving a range of 0 - 255 but the exponent is biased - you need to subtract 127 to get the true exponent.
F bits are the fraction part, however, you need to imagine an invisible 1 on the front so the fraction is always 1.something and all you see are the binary fraction digits.
The number 2 is 1 x 21 = 1 x 2128 - 127 so is encoded as
So if you use a bit shift to shift it right you get
which by convention is -0 in IEEE754, so rather than multiplying your number by 2 your shift has made it zero.
The number 3 is [1 + 0.5] x 2128 - 127
which is represented as
Shifting that left gives you
which is -1 x 2-126 or some very small number.
You can do the same for z, but you probably get the idea that shifting just screws up floating point numbers.
Fixed point doesn't work that way. What you want to do is something like this:
void main(){
// initing 8bit fixed point numbers
unsigned int x = 2 << 8;
unsigned int y = 3 << 8;
unsigned int z = 1 << 8;
// adding two numbers
unsigned int a = x + y;
// multiplying two numbers with fixed point adjustment
unsigned int b = (x * y) >> 8;
// use numbers
printf("%d %d\n", a >> 8, b >> 8);
