Convert integer in a new floating point format - c

This code is intended to convert a signed 16-bit integer to a new floating point format (similar to the normal IEEE 754 floating point format). I unterstand the regular IEEE 754 floating point format, but i don't understand how this code works and how this floating point format looks like. I would be grateful for some insights into what the idea of the code is respectively how many bits are used for representing the significand and how many bits are used for representing the exponent in this new format.
#include <stdint.h>
uint32_t short2fp (int16_t inp)
{
int x, f, i;
if (inp == 0)
{
return 0;
}
else if (inp < 0)
{
i = -inp;
x = 191;
}
else
{
i = inp;
x = 63;
}
for (f = i; f > 1; f >>= 1)
x++;
for (f = i; f < 0x8000; f <<= 1);
return (x * 0x8000 + f - 0x8000);
}

This couple of tricks should help you recognize the parameters (exponent's size and mantissa's size) of a custom floating-point format:
First of all, how many bits is this float number long?
We know that the sign bit is the highest bit set in any negative float number. If we calculate short2fp(-1) we obtain 0b10111111000000000000000, that is a 23-bit number. Therefore, this custom float format is a 23-bit float.
If we want to know the exponent's and mantissa's sizes, we can convert the number 3, because this will set both the highest exponent's bit and the highest mantissa's bit. If we do short2fp(3), we obtain 0b01000000100000000000000, and if we split this number we get 0 1000000 100000000000000: the first bit is the sign, then we have 7 bits of exponent, and finally 15 bits of mantissa.
Conclusion:
Float format size: 23 bits
Exponent size: 7 bits
Mantissa size: 15 bits
NOTE: this conclusion may be wrong for a different number of reasons (e.g.: float format particularly different from IEEE754 ones, short2fp() function not working properly, too much coffee this morning, etc.), but in general this works for every binary floating-point format defined by IEEE754 (binary16, binary32, binary64, etc.) so I'm confident this works for your custom float format too.
P.S.: the short2fp() function is written very poorly, you may try improve its clearness if you want to investigate the inner workings of the function.

The two statements x = 191; and x = 63; set x to either 1•128 + 63 or 0•128 + 63, according to whether the number is negative or positive. Therefore 128 (27) has the sign bit at this point. As x is later multiplied by 0x8000 (215), the sign bit is 222 in the result.
These statements also initialize the exponent to 0, which is encoded as 63 due to a bias of 63. This follows the IEEE-754 pattern of using a bias of 2n−1−1 for an exponent field of n bits. (The “single” format has eight exponent bits and a bias of 27−1 = 127, and the “double” format has 11 exponent bits and a bias of 210−1 = 1023.) Thus we expect an exponent field of 7 bits, with bias 26−1 = 63.
This loop:
for (f = i; f > 1; f >>= 1)
x++;
detects the magnitude of i (the absolute value of the input), adding one to the exponent for each power of two that f is detected to exceed. For example, if the input is 4, 5, 6, or 7, the loop executes two times, adding two to x and reducing f to 1, at which point the loop stops. This confirms the exponent bias; if i is 1, x is left as is, so the initial value of 63 corresponds to an exponent of 0 and a represented value of 20 = 1.
The loop for (f = i; f < 0x8000; f <<= 1); scales f in the opposite direction, moving its leading bit to be in the 0x8000 position.
In return (x * 0x8000 + f - 0x8000);, x * 0x8000 moves the sign bit and exponent field from their initial positions (bit 7 and bits 6 to 0) to their final positions (bit 22 and bits 21 to 15). f - 0x8000 removes the leading bit from f, giving the trailing bits of the significand. This is then added to the final value, forming the primary encoding of the significand in bits 14 to 0.
Thus the format has the sign bit in bit 22, exponent bits in bits 21 to 15 with a bias of 63, and the trailing significand bits in bits 14 to 0.
The format could encode subnormal numbers, infinities, and NaNs in the usual way, but this is not discernible from the code shown, as it encodes only integers in the normal range.

As a comment suggested, I would use a small number of strategically selected test cases to reverse engineer the format. The following assumes an IEEE-754-like binary floating-point format using sign-magnitude encoding with a sign bit, exponent bits, and significand (mantissa) bits.
short2fp (1) = 001f8000 while short2fp (-1) = 005f8000. The exclusive OR of these is 0x00400000 which means the sign bit is in bit 22 and this floating-point format comprises 23 bits.
short2fp (1) = 001f8000, short2fp (2) = 00200000, and short2fp (4) = 00208000. The difference between consecutive values is 0x00008000 so the least significant bit of the exponent field is bit 15, the exponent field comprises 7 bits, and the exponent is biased by (0x001f8000 >> 15) = 0x3F = 63.
This leaves the least significant 15 bits for the significand. We can see from short2fp (2) = 00200000 that the integer bit of the significand (mantissa) is not stored, that is, it is implicit as in IEEE-754 formats like binary32 or binary64.

Related

what does float cf = *(float *)&ci; in C do?

i'm trying to find out what this program prints exactly.
#include <stdio.h>
int main() {
float bf = -62.140625;
int bi = *(int *)&bf;
int ci = bi+(1<<23);
float cf = *(float *)&ci;
printf("%X\n",bi);
printf("%f\n",cf);
}
This prints out:
C2789000
-124.281250
But what happens line by line ? I do not understand .
Thanks in advance.
It is a convoluted way of doubling an 32bit floating point number by adding one to its exponent. Moreover it is incorrect due to violation of strict aliasing rule by accesing object if type float via type int.
Exponent is located at bits number 23 to 30. Adding 1<<23 increment the exponent by one what works like multiplication of the original number by 2.
If we rewrite this program to remove pointer punning
int main() {
float bf = -62.140625;
memcpy(&bi, &bf, sizeof(bi));
for(int i = 0; i < 32; i += 8)
printf("%02x ", ((unsigned)bi & (0xff << i)) >> i);
bi += (1<<23);
memcpy(&bf, &bi, sizeof(bi));;
printf("%f\n",bf);
}
Float numbers have the format:
-62.140625 has exponent == 0.
bi += (1<<23);
sets the exponent to 1 so the resulting float number will be -62.140625 * 2^1 and it is equal to -124.281250. If you change that line to
bi += (1<<24);
it will set the exponent to 4 so the resulting float number will be -62.140625 * 2^2 and it is equal to -248.562500.
float bf = -62.140625;
This creates a float object named bf and initializes it to −62.140625.
int bi = *(int *)&bf;
&bf takes the address of bf, which produces a pointer to a float. (int *) says to convert this to a pointer to an int. Then * says to access the pointed-to memory, as if it were an int.
The C standard does not define the behavior of this access, but many C implementations support it, sometimes requiring a command-line switch to enable support for it.
A float value is normally encoded in some way. −62.140625 is not an integer, so it cannot be stored as a binary numeral that represents an integer. It is encoded. Reinterpreting the bytes memory as an int using * (int *) &bf is an attempt to get the bits into an int so they can be manipulated directly, instead of through floating-point operations.
int ci = bi+(1<<23);
The format most commonly used for the float type is IEEE-754 binary32, also called “single precision.” In this format, bit 31 is a sign bit, bits 30-23 encode an exponent and/or some other information, and bits 22-0 encode most of a significand (or, in the case of a NaN, other information). (The significand is the fraction part of a floating-point representation. A floating-point format represents a number as ±F•be, where b is a fixed base, F is a number with a fixed precision in a certain range, and e is an exponent in a certain range. F is the significand.)
1<<23 is 1 shifted 23 bits, so it is 1 in the exponent field, bits 30-23.
If the exponent field contains 1 to 1021, then adding 1 to it increases the encoded exponent by 1. (The codes 0 and 1023 have special meaning in the exponent field. 1022 is a normal value, but adding 1 to it overflows the exponent in the special code 1023, so it will not increase the exponent in a normal way.)
Since the base b of a binary floating-point format is 2, increasing the exponent by 1 multiplies the number represented by 2. ±F•be becomes ±F•be+1.
float cf = *(float *)&ci;
This is the opposite of the previous reinterpretation: It says to reinterpet the bytes of the int as a float.
printf("%X\n",bi);
This says to print bi using a hexadecimal format. This is technically wrong; the %X format should be used with an unsigned int, not an int, but most C implementations let it pass.
printf("%f\n",cf);
This prints the new float value.

How can I convert this number representation to a float?

I read this 16-bit value from a temperature sensor (type MCP9808)
Ignoring the first three MSBs, what's an easy way to convert the other bits to a float?
I managed to convert the values 2^7 through 2^0 to an integer with some bit-shifting:
uint16_t rawBits = readSensor();
int16_t value = (rawBits << 3) / 128;
However I can't think of an easy way to also include the bits with an exponent smaller than 0, except for manually checking if they're set and then adding 1/2, 1/4, 1/8 and 1/16 to the result respectively.
Something like this seems pretty reasonable. Take the number portion, divide by 16, and fix the sign.
float tempSensor(uint16_t value) {
bool negative = (value & 0x1000);
return (negative ? -1 : 1) * (value & 0x0FFF) / 16.0f;
}
float convert(unsigned char msb, unsigned char lsb)
{
return ((lsb | ((msb & 0x0f) << 8)) * ((msb & 0x10) ? -1 : 1)) / 16.0f;
}
or
float convert(uint16_t val)
{
return (((val & 0x1000) ? -1 : 1) * (val << 4)) / 256.0f;
}
If performance isn't a super big deal, I would go for something less clever and more explcit, along the lines of:
bool is_bit_set(uint16_t value, uint16_t bit) {
uint16_t mask = 1 << bit;
return (value & mask) == mask;
}
float parse_temperature(uint16_t raw_reading) {
if (is_bit_set(raw_reading, 15)) { /* temp is above Tcrit. Do something about it. */ }
if (is_bit_set(raw_reading, 14)) { /* temp is above Tupper. Do something about it. */ }
if (is_bit_set(raw_reading, 13)) { /* temp is above Tlower. Do something about it. */ }
uint16_t whole_degrees = (raw_reading & 0x0FF0) >> 4;
float magnitude = (float) whole_degrees;
if (is_bit_set(raw_reading, 0)) magnitude += 1.0f/16.0f;
if (is_bit_set(raw_reading, 1)) magnitude += 1.0f/8.0f;
if (is_bit_set(raw_reading, 2)) magnitude += 1.0f/4.0f;
if (is_bit_set(raw_reading, 3)) magnitude += 1.0f/2.0f;
bool is_negative = is_bit_set(raw_reading, 12);
// TODO: What do the 3 most significant bits do?
return magnitude * (is_negative ? -1.0 : 1.0);
}
Honestly this is a lot of simple constant math, I'd be surprised if the compiler wasn't able to heavily optimize it. That would need confirmation, of course.
If your C compiler has a clz buitlin or equivalent, it could be useful to avoid mul operation.
In your case, as the provided temp value looks like a mantissa and if your C compiler uses IEEE-754 float representation, translating the temp value in its IEEE-754 equivalent may be a most efficient way :
Update: Compact the code a little and more clear explanation about the mantissa.
float convert(uint16_t val) {
uint16_t mantissa = (uint16_t)(val <<4);
if (mantissa==0) return 0.0;
unsigned char e = (unsigned char)(__builtin_clz(mantissa) - 16);
uint32_t r = (uint32_t)((val & 0x1000) << 19 | (0x86 - e) << 23 | ((mantissa << (e+8)) & 0x07FFFFF));
return *((float *)(&r));
}
or
float convert(unsigned char msb, unsigned char lsb) {
uint16_t mantissa = (uint16_t)((msb<<8 | lsb) <<4);
if (mantissa==0) return 0.0;
unsigned char e = (unsigned char)(__builtin_clz(mantissa) - 16);
uint32_t r = (uint32_t)((msb & 0x10) << 27 | (0x86 - e) << 23 | ((mantissa << (e+8)) & 0x07FFFFF));
return *((float *)(&r));
}
Explanation:
We use the fact that the temp value is somehow a mantissa in the range -255 to 255.
We can then consider that its IEEE-754 exponent will be 128 at max to -128 at min.
We use the clz buitlin to get the "order" of the first bit set in the mantissa,
this way we can define the exponent as the therorical max (2^7 =>128) less this "order".
We use also this order to left shift the temp value to get the IEEE-754 mantissa,
plus one left shift to substract the '1' implied part of the significand for IEEE-754.
Thus we build a 32 bits binary IEEE-754 representation from the temp value with :
At first the sign bit to the 32th bit of our binary IEEE-754 representation.
The computed exponent as the theorical max 7 (2^7 =>128) plus the IEEE-754 bias (127) minus the actual "order" of the temp value.
The "order" of the temp value is deducted from the number of leading '0' of its 12 bits representation in the variable mantissa through the clz builtin.
Beware that here we consider that the clz builtin is expecting a 32 bit value as parameter, that is why we substract 16 here. This code may require adaptation if your clz expects anything else.
The number of leading '0' can go from 0 (temp value above 128 or under -127) to 11 as we directly return 0.0 for a zero temp value.
As the following bit of the "order" is then 1 in the temp value, it is equivalent to a power of 2 reduction from the theorical max 7.
Thus, with 7 + 127 => 0x86, we can simply substract to that the "order" as the number of leading '0' permits us to deduce the 'first' base exponent for IEEE-754.
If the "order" is greater than 7 we will still get the negative exponent required for less than 1 values.
We add then this 8bits exponent to our binary IEEE-754 representation from 24th bit to 31th bit.
The temp value is somehow already a mantissa, we suppress the leading '0' and its first bit set by shifting it to the left (e + 1) while also shifting left for 7 bits to place the mantissa in the 32 bits (e+7+1 => e+8) . We mask then only the desired 23 bits (AND &0x7FFFFF).
Its first bit set must be removed as it is the '1' implied significand in IEEE-754 (the power of 2 of the exponent).
We have then the IEEEE-754 mantissa and place it from the 8th bit to the 23th bit of our binary IEEE-754 representation.
The 4 initial trailing 0 from our 16 bits temp value and the added seven 'right' 0 from the shifting won't change the effective IEEE-754 value.
As we start from a 32 bits value and use or operator (|) on a 32 bits exponent and mantissa, we have then the final IEEE-754 representation.
We can then return this binary representation as an IEEE-754 C float value.
Due to the required clz and the IEEE-754 translation, this way is less portable. The main interest is to avoid MUL operations in the resulting machine code for performance on arch with a "poor" FPU.
P.S.: Casts explanation. I've added explicit casts to let the C compiler know that we discard voluntary some bits :
uint16_t mantissa = (uint16_t)(val <<4); : The cast here tells the compiler that we know we'll "loose" four left bits, as it the goal here. We discard the four first bits of the temp value for the mantissa.
(unsigned char)(__builtin_clz(mantissa) - 16) : We tell to the C compiler that we will only consider a 8 bits range for the builtin return, as we know our mantissa has only 12 significatives bits and thus a range output from 0 to 12. Thus we do not need the full int return.
uint32_t r = (uint32_t) ... : We tell the C compiler to not bother with the sign representation here as we build an IEEE-754 representation.

How is float to int type conversion done in C? [duplicate]

I was wondering if you could help explain the process on converting an integer to float, or a float to an integer. For my class, we are to do this using only bitwise operators, but I think a firm understanding on the casting from type to type will help me more in this stage.
From what I know so far, for int to float, you will have to convert the integer into binary, normalize the value of the integer by finding the significand, exponent, and fraction, and then output the value in float from there?
As for float to int, you will have to separate the value into the significand, exponent, and fraction, and then reverse the instructions above to get an int value?
I tried to follow the instructions from this question: Casting float to int (bitwise) in C.
But I was not really able to understand it.
Also, could someone explain why rounding will be necessary for values greater than 23 bits when converting int to float?
First, a paper you should consider reading, if you want to understand floating point foibles better: "What Every Computer Scientist Should Know About Floating Point Arithmetic," http://www.validlab.com/goldberg/paper.pdf
And now to some meat.
The following code is bare bones, and attempts to produce an IEEE-754 single precision float from an unsigned int in the range 0 < value < 224. That's the format you're most likely to encounter on modern hardware, and it's the format you seem to reference in your original question.
IEEE-754 single-precision floats are divided into three fields: A single sign bit, 8 bits of exponent, and 23 bits of significand (sometimes called a mantissa). IEEE-754 uses a hidden 1 significand, meaning that the significand is actually 24 bits total. The bits are packed left to right, with the sign bit in bit 31, exponent in bits 30 .. 23, and the significand in bits 22 .. 0. The following diagram from Wikipedia illustrates:
The exponent has a bias of 127, meaning that the actual exponent associated with the floating point number is 127 less than the value stored in the exponent field. An exponent of 0 therefore would be encoded as 127.
(Note: The full Wikipedia article may be interesting to you. Ref: http://en.wikipedia.org/wiki/Single_precision_floating-point_format )
Therefore, the IEEE-754 number 0x40000000 is interpreted as follows:
Bit 31 = 0: Positive value
Bits 30 .. 23 = 0x80: Exponent = 128 - 127 = 1 (aka. 21)
Bits 22 .. 0 are all 0: Significand = 1.00000000_00000000_0000000. (Note I restored the hidden 1).
So the value is 1.0 x 21 = 2.0.
To convert an unsigned int in the limited range given above, then, to something in IEEE-754 format, you might use a function like the one below. It takes the following steps:
Aligns the leading 1 of the integer to the position of the hidden 1 in the floating point representation.
While aligning the integer, records the total number of shifts made.
Masks away the hidden 1.
Using the number of shifts made, computes the exponent and appends it to the number.
Using reinterpret_cast, converts the resulting bit-pattern to a float. This part is an ugly hack, because it uses a type-punned pointer. You could also do this by abusing a union. Some platforms provide an intrinsic operation (such as _itof) to make this reinterpretation less ugly.
There are much faster ways to do this; this one is meant to be pedagogically useful, if not super efficient:
float uint_to_float(unsigned int significand)
{
// Only support 0 < significand < 1 << 24.
if (significand == 0 || significand >= 1 << 24)
return -1.0; // or abort(); or whatever you'd like here.
int shifts = 0;
// Align the leading 1 of the significand to the hidden-1
// position. Count the number of shifts required.
while ((significand & (1 << 23)) == 0)
{
significand <<= 1;
shifts++;
}
// The number 1.0 has an exponent of 0, and would need to be
// shifted left 23 times. The number 2.0, however, has an
// exponent of 1 and needs to be shifted left only 22 times.
// Therefore, the exponent should be (23 - shifts). IEEE-754
// format requires a bias of 127, though, so the exponent field
// is given by the following expression:
unsigned int exponent = 127 + 23 - shifts;
// Now merge significand and exponent. Be sure to strip away
// the hidden 1 in the significand.
unsigned int merged = (exponent << 23) | (significand & 0x7FFFFF);
// Reinterpret as a float and return. This is an evil hack.
return *reinterpret_cast< float* >( &merged );
}
You can make this process more efficient using functions that detect the leading 1 in a number. (These sometimes go by names like clz for "count leading zeros", or norm for "normalize".)
You can also extend this to signed numbers by recording the sign, taking the absolute value of the integer, performing the steps above, and then putting the sign into bit 31 of the number.
For integers >= 224, the entire integer does not fit into the significand field of the 32-bit float format. This is why you need to "round": You lose LSBs in order to make the value fit. Thus, multiple integers will end up mapping to the same floating point pattern. The exact mapping depends on the rounding mode (round toward -Inf, round toward +Inf, round toward zero, round toward nearest even). But the fact of the matter is you can't shove 24 bits into fewer than 24 bits without some loss.
You can see this in terms of the code above. It works by aligning the leading 1 to the hidden 1 position. If a value was >= 224, the code would need to shift right, not left, and that necessarily shifts LSBs away. Rounding modes just tell you how to handle the bits shifted away.
Have you checked the IEEE 754 floating-point representation?
In 32-bit normalized form, it has (mantissa's) sign bit, 8-bit exponent (excess-127, I think) and 23-bit mantissa in "decimal" except that the "0." is dropped (always in that form) and the radix is 2, not 10. That is: the MSB value is 1/2, the next bit 1/4 and so on.
Joe Z's answer is elegant but range of input values is highly limited. 32 bit float can store all integer values from the following range:
[-224...+224] = [-16777216...+16777216]
and some other values outside this range.
The whole range would be covered by this:
float int2float(int value)
{
// handles all values from [-2^24...2^24]
// outside this range only some integers may be represented exactly
// this method will use truncation 'rounding mode' during conversion
// we can safely reinterpret it as 0.0
if (value == 0) return 0.0;
if (value == (1U<<31)) // ie -2^31
{
// -(-2^31) = -2^31 so we'll not be able to handle it below - use const
// value = 0xCF000000;
return (float)INT_MIN; // *((float*)&value); is undefined behaviour
}
int sign = 0;
// handle negative values
if (value < 0)
{
sign = 1U << 31;
value = -value;
}
// although right shift of signed is undefined - all compilers (that I know) do
// arithmetic shift (copies sign into MSB) is what I prefer here
// hence using unsigned abs_value_copy for shift
unsigned int abs_value_copy = value;
// find leading one
int bit_num = 31;
int shift_count = 0;
for(; bit_num > 0; bit_num--)
{
if (abs_value_copy & (1U<<bit_num))
{
if (bit_num >= 23)
{
// need to shift right
shift_count = bit_num - 23;
abs_value_copy >>= shift_count;
}
else
{
// need to shift left
shift_count = 23 - bit_num;
abs_value_copy <<= shift_count;
}
break;
}
}
// exponent is biased by 127
int exp = bit_num + 127;
// clear leading 1 (bit #23) (it will implicitly be there but not stored)
int coeff = abs_value_copy & ~(1<<23);
// move exp to the right place
exp <<= 23;
union
{
int rint;
float rfloat;
}ret = { sign | exp | coeff };
return ret.rfloat;
}
Of course there are other means to find abs value of int (branchless). Similarly couting leading zeros can also be done without a branch so treat this example as example ;-).

How to generate an IEEE 754 Single-precision float using only integer arithmetic?

Assuming a low end microprocessor with no floating point arithmetic, I need to generate an IEE754 single precision floating point format number to push out to a file.
I need to write a function that takes three integers being the sign, whole and the fraction and returns a byte array with 4 bytes being the IEEE 754 single precision representation.
Something like:
// Convert 75.65 to 4 byte IEEE 754 single precision representation
char* float = convert(0, 75, 65);
Does anybody have any pointers or example C code please? I'm particularly struggling to understand how to convert the mantissa.
You will need to generate the sign (1 bit), the exponent (8 bits, a biased power of 2), and the fraction/mantissa (23 bits).
Bear in mind that the fraction has an implicit leading '1' bit, which means that the most significant leading '1' bit (2^22) is not stored in the IEEE format. For example, given a fraction of 0x755555 (24 bits), the actual bits stored would be 0x355555 (23 bits).
Also bear in mind that the fraction is shifted so that the binary point is immediately to the right of the implicit leading '1' bit. So an IEEE 23-bit fraction of 11 0101 0101... represents the 24-bit binary fraction 1.11 0101 0101...
This means that the exponent has to be adjusted accordingly.
Does the value have to be written big endian or little endian? Reversed bit ordering?
If you are free, you should think about writing the value as string literal. That way you can easily convert the integer: just write the int part and write "e0" as exponent (or omit the exponent and write ".0").
For the binary representation, you should have a look at Wikipedia. Best is to first assemble the bitfields to an uint32_t - the structure is given in the linked article. Note that you might have to round if the integer has more than 23 bits value. Remember to normalize the generated value.
Second step will be to serialize the uint32_t to an uint8_t-array. Mind the endianess of the result!
Also note to use uint8_t for the result if you really want 8 bit values; you should use an unsigned type. For the intermediate representation, using uint32_t is recommended as that will guarantee you operate on 32 bit values.
You haven't had a go yet so no give aways.
Remember you can regard two 32-bit integers a & b to be interpreted as a decimal a.b as being a single 64-bit integer with an exponent of 2^-32 (where ^ is exponent).
So without doing anything you've got it in the form:
s * m * 2^e
The only problem is your mantissa is too long and your number isn't normalized.
A bit of shifting and adding/subtracting with a possible rounding step and you're done.
You can use a software floating point compiler/library.
See https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html
The basic premise is to:
Given binary32 float.
Form a binary fixed-point representation of the combined whole and factional parts hundredths. This code uses a structure encoding both whole and hundredths fields separately. Important that the whole field is at least 32 bits.
Shift left/right (*2 and /2) until MSbit is in the implied bit position whilst counting the shifts. A robust solution would also note non-zero bits shifted out.
Form a biased exponent.
Round mantissa and drop implied bit.
Form sign (not done here).
Combine the above 3 steps to form the answer.
As Sub-normals, infinites & Not-A-Number will not result with whole, hundredths input, generating those float special cases are not addressed here.
.
#include <assert.h>
#include <stdint.h>
#define IMPLIED_BIT 0x00800000L
typedef struct {
int_least32_t whole;
int hundreth;
} x_xx;
int_least32_t covert(int whole, int hundreth) {
assert(whole >= 0 && hundreth >= 0 && hundreth < 100);
if (whole == 0 && hundreth == 0) return 0;
x_xx x = { whole, hundreth };
int_least32_t expo = 0;
int sticky_bit = 0; // Note any 1 bits shifted out
while (x.whole >= IMPLIED_BIT * 2) {
expo++;
sticky_bit |= x.hundreth % 2;
x.hundreth /= 2;
x.hundreth += (x.whole % 2)*(100/2);
x.whole /= 2;
}
while (x.whole < IMPLIED_BIT) {
expo--;
x.hundreth *= 2;
x.whole *= 2;
x.whole += x.hundreth / 100;
x.hundreth %= 100;
}
int32_t mantissa = x.whole;
// Round to nearest - ties to even
if (x.hundreth >= 100/2 && (x.hundreth > 100/2 || x.whole%2 || sticky_bit)) {
mantissa++;
}
if (mantissa >= (IMPLIED_BIT * 2)) {
mantissa /= 2;
expo++;
}
mantissa &= ~IMPLIED_BIT; // Toss MSbit as it is implied in final
expo += 24 + 126; // Bias: 24 bits + binary32 bias
expo <<= 23; // Offset
return expo | mantissa;
}
void test_covert(int whole, int hundreths) {
union {
uint32_t u32;
float f;
} u;
u.u32 = covert(whole, hundreths);
volatile float best = whole + hundreths / 100.0;
printf("%10d.%02d --> %15.6e %15.6e Same:%d\n", whole, hundreths, u.f, best,
best == u.f);
}
#include <limits.h>
int main(void) {
test_covert(75, 65);
test_covert(0, 1);
test_covert(INT_MAX, 99);
return 0;
}
Output
75.65 --> 7.565000e+01 7.565000e+01 Same:1
0.01 --> 1.000000e-02 1.000000e-02 Same:1
2147483647.99 --> 2.147484e+09 2.147484e+09 Same:1
Known issues: sign not applied.

Convert from binary to floating point

I'm doing some exercises for Computer Science university and one of them is about converting an int array of 64 bit into it's double-precision floating point value.
Understanding the first bit, the sign +/-, is quite easy. Same for the exponent, as well as we know that the bias is 1023.
We are having problems with the significand. How can I calculate it?
In the end, I would like to obtain the real numbers that the bits meant.
computing the significand of the given 64 bit is quite easy.
according to the wiki article using the IEEE 754, the significand is made up the first 53 bits (from bit 0 to bit 52).
Now if you want to convert number having like 67 bits to your 64 bits value, it would be rounded by setting the trailing 64th bits of your value to 1, even if it was one before... because of the other 3 bits:
11110000 11110010 11111 becomes 11110000 11110011 after the rounding of the last byte;
therefore the there is no need to store the 53th bits because it has always a value a one.
that's why you only store in 52 bits in the significand instead of 53.
now to compute it, you just need to target the bit range of the significand [bit(1) - bit(52)] -bit(0) is always 1- and use it .
int index_signf = 1; // starting at 1, not 0
int significand_length = 52;
int byteArray[53]; // array containing the bits of the significand
double significand_endValue = 0;
for( ; index_signf <= significand_length ; index_signf ++)
{
significand_endValue += byteArray[index_signf] * (pow(2,-(index_signf)));
}
significand_endValue += 1;
Now you just have to fill byteArray accordlingly before computing it, using function like that:
int* getSignificandBits(int* array64bits){
//returned array
int significandBitsArray[53];
// indexes++
int i_array64bits = 0;
int i_significandBitsArray=1;
//set the first bit = 1
significandBitsArray[0] = 1;
// fill it
for(i_significandBitsArray=1, i_array64bits = (63 - 1); i_array64bits >= (64 - 52); i_array64bits--, i_significandBitsArray ++)
significandBitsArray[i_significandBitsArray] = array64bits[i_array64bits];
return significandBitsArray;
}
You could just load the bits into an unsigned integer of the same size as a double, take the address of that and cast it to a void* which you then cast to a double* and dereference.
Of course, this might be "cheating" if you really are supposed to parse the floating point standard, but this is how I would have solved the problem given the parameters you've stated so far.
If you have a byte representation of an object you can copy the bytes into the storage of a variable of the right type to convert it.
double convert_to_double(uint64_t x) {
double result;
mempcy(&result, &x, sizeof(x));
return result;
}
You will often see code like *(double *)&x to do the conversion, but whereas in practice this will always work it's undefined behavior in C.

Resources