Convert from binary to floating point - c

I'm doing some exercises for Computer Science university and one of them is about converting an int array of 64 bit into it's double-precision floating point value.
Understanding the first bit, the sign +/-, is quite easy. Same for the exponent, as well as we know that the bias is 1023.
We are having problems with the significand. How can I calculate it?
In the end, I would like to obtain the real numbers that the bits meant.

computing the significand of the given 64 bit is quite easy.
according to the wiki article using the IEEE 754, the significand is made up the first 53 bits (from bit 0 to bit 52).
Now if you want to convert number having like 67 bits to your 64 bits value, it would be rounded by setting the trailing 64th bits of your value to 1, even if it was one before... because of the other 3 bits:
11110000 11110010 11111 becomes 11110000 11110011 after the rounding of the last byte;
therefore the there is no need to store the 53th bits because it has always a value a one.
that's why you only store in 52 bits in the significand instead of 53.
now to compute it, you just need to target the bit range of the significand [bit(1) - bit(52)] -bit(0) is always 1- and use it .
int index_signf = 1; // starting at 1, not 0
int significand_length = 52;
int byteArray[53]; // array containing the bits of the significand
double significand_endValue = 0;
for( ; index_signf <= significand_length ; index_signf ++)
{
significand_endValue += byteArray[index_signf] * (pow(2,-(index_signf)));
}
significand_endValue += 1;
Now you just have to fill byteArray accordlingly before computing it, using function like that:
int* getSignificandBits(int* array64bits){
//returned array
int significandBitsArray[53];
// indexes++
int i_array64bits = 0;
int i_significandBitsArray=1;
//set the first bit = 1
significandBitsArray[0] = 1;
// fill it
for(i_significandBitsArray=1, i_array64bits = (63 - 1); i_array64bits >= (64 - 52); i_array64bits--, i_significandBitsArray ++)
significandBitsArray[i_significandBitsArray] = array64bits[i_array64bits];
return significandBitsArray;
}

You could just load the bits into an unsigned integer of the same size as a double, take the address of that and cast it to a void* which you then cast to a double* and dereference.
Of course, this might be "cheating" if you really are supposed to parse the floating point standard, but this is how I would have solved the problem given the parameters you've stated so far.

If you have a byte representation of an object you can copy the bytes into the storage of a variable of the right type to convert it.
double convert_to_double(uint64_t x) {
double result;
mempcy(&result, &x, sizeof(x));
return result;
}
You will often see code like *(double *)&x to do the conversion, but whereas in practice this will always work it's undefined behavior in C.

Related

what does float cf = *(float *)&ci; in C do?

i'm trying to find out what this program prints exactly.
#include <stdio.h>
int main() {
float bf = -62.140625;
int bi = *(int *)&bf;
int ci = bi+(1<<23);
float cf = *(float *)&ci;
printf("%X\n",bi);
printf("%f\n",cf);
}
This prints out:
C2789000
-124.281250
But what happens line by line ? I do not understand .
Thanks in advance.
It is a convoluted way of doubling an 32bit floating point number by adding one to its exponent. Moreover it is incorrect due to violation of strict aliasing rule by accesing object if type float via type int.
Exponent is located at bits number 23 to 30. Adding 1<<23 increment the exponent by one what works like multiplication of the original number by 2.
If we rewrite this program to remove pointer punning
int main() {
float bf = -62.140625;
memcpy(&bi, &bf, sizeof(bi));
for(int i = 0; i < 32; i += 8)
printf("%02x ", ((unsigned)bi & (0xff << i)) >> i);
bi += (1<<23);
memcpy(&bf, &bi, sizeof(bi));;
printf("%f\n",bf);
}
Float numbers have the format:
-62.140625 has exponent == 0.
bi += (1<<23);
sets the exponent to 1 so the resulting float number will be -62.140625 * 2^1 and it is equal to -124.281250. If you change that line to
bi += (1<<24);
it will set the exponent to 4 so the resulting float number will be -62.140625 * 2^2 and it is equal to -248.562500.
float bf = -62.140625;
This creates a float object named bf and initializes it to −62.140625.
int bi = *(int *)&bf;
&bf takes the address of bf, which produces a pointer to a float. (int *) says to convert this to a pointer to an int. Then * says to access the pointed-to memory, as if it were an int.
The C standard does not define the behavior of this access, but many C implementations support it, sometimes requiring a command-line switch to enable support for it.
A float value is normally encoded in some way. −62.140625 is not an integer, so it cannot be stored as a binary numeral that represents an integer. It is encoded. Reinterpreting the bytes memory as an int using * (int *) &bf is an attempt to get the bits into an int so they can be manipulated directly, instead of through floating-point operations.
int ci = bi+(1<<23);
The format most commonly used for the float type is IEEE-754 binary32, also called “single precision.” In this format, bit 31 is a sign bit, bits 30-23 encode an exponent and/or some other information, and bits 22-0 encode most of a significand (or, in the case of a NaN, other information). (The significand is the fraction part of a floating-point representation. A floating-point format represents a number as ±F•be, where b is a fixed base, F is a number with a fixed precision in a certain range, and e is an exponent in a certain range. F is the significand.)
1<<23 is 1 shifted 23 bits, so it is 1 in the exponent field, bits 30-23.
If the exponent field contains 1 to 1021, then adding 1 to it increases the encoded exponent by 1. (The codes 0 and 1023 have special meaning in the exponent field. 1022 is a normal value, but adding 1 to it overflows the exponent in the special code 1023, so it will not increase the exponent in a normal way.)
Since the base b of a binary floating-point format is 2, increasing the exponent by 1 multiplies the number represented by 2. ±F•be becomes ±F•be+1.
float cf = *(float *)&ci;
This is the opposite of the previous reinterpretation: It says to reinterpet the bytes of the int as a float.
printf("%X\n",bi);
This says to print bi using a hexadecimal format. This is technically wrong; the %X format should be used with an unsigned int, not an int, but most C implementations let it pass.
printf("%f\n",cf);
This prints the new float value.

How is float to int type conversion done in C? [duplicate]

I was wondering if you could help explain the process on converting an integer to float, or a float to an integer. For my class, we are to do this using only bitwise operators, but I think a firm understanding on the casting from type to type will help me more in this stage.
From what I know so far, for int to float, you will have to convert the integer into binary, normalize the value of the integer by finding the significand, exponent, and fraction, and then output the value in float from there?
As for float to int, you will have to separate the value into the significand, exponent, and fraction, and then reverse the instructions above to get an int value?
I tried to follow the instructions from this question: Casting float to int (bitwise) in C.
But I was not really able to understand it.
Also, could someone explain why rounding will be necessary for values greater than 23 bits when converting int to float?
First, a paper you should consider reading, if you want to understand floating point foibles better: "What Every Computer Scientist Should Know About Floating Point Arithmetic," http://www.validlab.com/goldberg/paper.pdf
And now to some meat.
The following code is bare bones, and attempts to produce an IEEE-754 single precision float from an unsigned int in the range 0 < value < 224. That's the format you're most likely to encounter on modern hardware, and it's the format you seem to reference in your original question.
IEEE-754 single-precision floats are divided into three fields: A single sign bit, 8 bits of exponent, and 23 bits of significand (sometimes called a mantissa). IEEE-754 uses a hidden 1 significand, meaning that the significand is actually 24 bits total. The bits are packed left to right, with the sign bit in bit 31, exponent in bits 30 .. 23, and the significand in bits 22 .. 0. The following diagram from Wikipedia illustrates:
The exponent has a bias of 127, meaning that the actual exponent associated with the floating point number is 127 less than the value stored in the exponent field. An exponent of 0 therefore would be encoded as 127.
(Note: The full Wikipedia article may be interesting to you. Ref: http://en.wikipedia.org/wiki/Single_precision_floating-point_format )
Therefore, the IEEE-754 number 0x40000000 is interpreted as follows:
Bit 31 = 0: Positive value
Bits 30 .. 23 = 0x80: Exponent = 128 - 127 = 1 (aka. 21)
Bits 22 .. 0 are all 0: Significand = 1.00000000_00000000_0000000. (Note I restored the hidden 1).
So the value is 1.0 x 21 = 2.0.
To convert an unsigned int in the limited range given above, then, to something in IEEE-754 format, you might use a function like the one below. It takes the following steps:
Aligns the leading 1 of the integer to the position of the hidden 1 in the floating point representation.
While aligning the integer, records the total number of shifts made.
Masks away the hidden 1.
Using the number of shifts made, computes the exponent and appends it to the number.
Using reinterpret_cast, converts the resulting bit-pattern to a float. This part is an ugly hack, because it uses a type-punned pointer. You could also do this by abusing a union. Some platforms provide an intrinsic operation (such as _itof) to make this reinterpretation less ugly.
There are much faster ways to do this; this one is meant to be pedagogically useful, if not super efficient:
float uint_to_float(unsigned int significand)
{
// Only support 0 < significand < 1 << 24.
if (significand == 0 || significand >= 1 << 24)
return -1.0; // or abort(); or whatever you'd like here.
int shifts = 0;
// Align the leading 1 of the significand to the hidden-1
// position. Count the number of shifts required.
while ((significand & (1 << 23)) == 0)
{
significand <<= 1;
shifts++;
}
// The number 1.0 has an exponent of 0, and would need to be
// shifted left 23 times. The number 2.0, however, has an
// exponent of 1 and needs to be shifted left only 22 times.
// Therefore, the exponent should be (23 - shifts). IEEE-754
// format requires a bias of 127, though, so the exponent field
// is given by the following expression:
unsigned int exponent = 127 + 23 - shifts;
// Now merge significand and exponent. Be sure to strip away
// the hidden 1 in the significand.
unsigned int merged = (exponent << 23) | (significand & 0x7FFFFF);
// Reinterpret as a float and return. This is an evil hack.
return *reinterpret_cast< float* >( &merged );
}
You can make this process more efficient using functions that detect the leading 1 in a number. (These sometimes go by names like clz for "count leading zeros", or norm for "normalize".)
You can also extend this to signed numbers by recording the sign, taking the absolute value of the integer, performing the steps above, and then putting the sign into bit 31 of the number.
For integers >= 224, the entire integer does not fit into the significand field of the 32-bit float format. This is why you need to "round": You lose LSBs in order to make the value fit. Thus, multiple integers will end up mapping to the same floating point pattern. The exact mapping depends on the rounding mode (round toward -Inf, round toward +Inf, round toward zero, round toward nearest even). But the fact of the matter is you can't shove 24 bits into fewer than 24 bits without some loss.
You can see this in terms of the code above. It works by aligning the leading 1 to the hidden 1 position. If a value was >= 224, the code would need to shift right, not left, and that necessarily shifts LSBs away. Rounding modes just tell you how to handle the bits shifted away.
Have you checked the IEEE 754 floating-point representation?
In 32-bit normalized form, it has (mantissa's) sign bit, 8-bit exponent (excess-127, I think) and 23-bit mantissa in "decimal" except that the "0." is dropped (always in that form) and the radix is 2, not 10. That is: the MSB value is 1/2, the next bit 1/4 and so on.
Joe Z's answer is elegant but range of input values is highly limited. 32 bit float can store all integer values from the following range:
[-224...+224] = [-16777216...+16777216]
and some other values outside this range.
The whole range would be covered by this:
float int2float(int value)
{
// handles all values from [-2^24...2^24]
// outside this range only some integers may be represented exactly
// this method will use truncation 'rounding mode' during conversion
// we can safely reinterpret it as 0.0
if (value == 0) return 0.0;
if (value == (1U<<31)) // ie -2^31
{
// -(-2^31) = -2^31 so we'll not be able to handle it below - use const
// value = 0xCF000000;
return (float)INT_MIN; // *((float*)&value); is undefined behaviour
}
int sign = 0;
// handle negative values
if (value < 0)
{
sign = 1U << 31;
value = -value;
}
// although right shift of signed is undefined - all compilers (that I know) do
// arithmetic shift (copies sign into MSB) is what I prefer here
// hence using unsigned abs_value_copy for shift
unsigned int abs_value_copy = value;
// find leading one
int bit_num = 31;
int shift_count = 0;
for(; bit_num > 0; bit_num--)
{
if (abs_value_copy & (1U<<bit_num))
{
if (bit_num >= 23)
{
// need to shift right
shift_count = bit_num - 23;
abs_value_copy >>= shift_count;
}
else
{
// need to shift left
shift_count = 23 - bit_num;
abs_value_copy <<= shift_count;
}
break;
}
}
// exponent is biased by 127
int exp = bit_num + 127;
// clear leading 1 (bit #23) (it will implicitly be there but not stored)
int coeff = abs_value_copy & ~(1<<23);
// move exp to the right place
exp <<= 23;
union
{
int rint;
float rfloat;
}ret = { sign | exp | coeff };
return ret.rfloat;
}
Of course there are other means to find abs value of int (branchless). Similarly couting leading zeros can also be done without a branch so treat this example as example ;-).

C Bit-Level Int to Float Conversion Unexpected Output

Background:
I am playing around with bit-level coding (this is not homework - just curious). I found a lot of good material online and in a book called Hacker's Delight, but I am having trouble with one of the online problems.
It asks to convert an integer to a float. I used the following links as reference to work through the problem:
How to manually (bitwise) perform (float)x?
How to convert an unsigned int to a float?
http://locklessinc.com/articles/i2f/
Problem and Question:
I thought I understood the process well enough (I tried to document the process in the comments), but when I test it, I don't understand the output.
Test Cases:
float_i2f(2) returns 1073741824
float_i2f(3) returns 1077936128
I expected to see something like 2.0000 and 3.0000.
Did I mess up the conversion somewhere? I thought maybe this was a memory address, so I was thinking maybe I missed something in the conversion step needed to access the actual number? Or maybe I am printing it incorrectly? I am printing my output like this:
printf("Float_i2f ( %d ): ", 3);
printf("%u", float_i2f(3));
printf("\n");
But I thought that printing method was fine for unsigned values in C (I'm used to programming in Java).
Thanks for any advice.
Code:
/*
* float_i2f - Return bit-level equivalent of expression (float) x
* Result is returned as unsigned int, but
* it is to be interpreted as the bit-level representation of a
* single-precision floating point values.
* Legal ops: Any integer/unsigned operations incl. ||, &&. also if, while
* Max ops: 30
* Rating: 4
*/
unsigned float_i2f(int x) {
if (x == 0){
return 0;
}
//save the sign bit for later and get the asolute value of x
//the absolute value is needed to shift bits to put them
//into the appropriate position for the float
unsigned int signBit = 0;
unsigned int absVal = (unsigned int)x;
if (x < 0){
signBit = 0x80000000;
absVal = (unsigned int)-x;
}
//Calculate the exponent
// Shift the input left until the high order bit is set to form the mantissa.
// Form the floating exponent by subtracting the number of shifts from 158.
unsigned int exponent = 158; //158 possibly because of place in byte range
while ((absVal & 0x80000000) == 0){//this checks for 0 or 1. when it reaches 1, the loop breaks
exponent--;
absVal <<= 1;
}
//find the mantissa (bit shift to the right)
unsigned int mantissa = absVal >> 8;
//place the exponent bits in the right place
exponent = exponent << 23;
//get the mantissa
mantissa = mantissa & 0x7fffff;
//return the reconstructed float
return signBit | exponent | mantissa;
}
Continuing from the comment. Your code is correct, and you are simply looking at the equivalent unsigned integer made up by the bits in your IEEE-754 single-precision floating point number. The IEEE-754 single-precision number format (made up of the sign, extended exponent, and mantissa), can be interpreted as a float, or those same bits can be interpreted as an unsigned integer (just the number that is made up by the 32-bits). You are outputting the unsigned equivalent for the floating point number.
You can confirm with a simple union. For example:
#include <stdio.h>
#include <stdint.h>
typedef union {
uint32_t u;
float f;
} u2f;
int main (void) {
u2f tmp = { .f = 2.0 };
printf ("\n u : %u\n f : %f\n", tmp.u, tmp.f);
return 0;
}
Example Usage/Output
$ ./bin/unionuf
u : 1073741824
f : 2.000000
Let me know if you have any further questions. It's good to see that your study resulted in the correct floating point conversion. (also note the second comment regarding truncation/rounding)
I'll just chime in here, because nothing specifically about endianness has been addressed. So let's talk about it.
The construction of the value in the original question was endianness-agnostic, using shifts and other bitwise operations. This means that regardless of whether your system is big- or little-endian, the actual value will be the same. The difference will be its byte order in memory.
The generally accepted convention for IEEE-754 is that the byte order is big-endian (although I believe there is no formal specification of this, and therefore no requirement on implementations to follow it). This means if you want to directly interpret your integer value as a float, it needs to be laid out in big-endian byte order.
So, you can use this approach combined with a union if and only if you know that the endianness of floats and integers on your system is the same.
On the common Intel-based architectures this is not okay. On those architectures, integers are little-endian and floats are big-endian. You need to convert your value to big-endian. A simple approach to this is to repack its bytes even if they are already big-endian:
uint32_t n = float_i2f( input_val );
uint8_t char bytes[4] = {
(uint8_t)((n >> 24) & 0xff),
(uint8_t)((n >> 16) & 0xff),
(uint8_t)((n >> 8) & 0xff),
(uint8_t)(n & 0xff)
};
float fval;
memcpy( &fval, bytes, sizeof(float) );
I'll stress that you only need to worry about this if you are trying to reinterpret your integer representation as a float or the other way round.
If you're only trying to output what the representation is in bits, then you don't need to worry. You can just display your integer in a useful form such as hex:
printf( "0x%08x\n", n );

How to generate an IEEE 754 Single-precision float using only integer arithmetic?

Assuming a low end microprocessor with no floating point arithmetic, I need to generate an IEE754 single precision floating point format number to push out to a file.
I need to write a function that takes three integers being the sign, whole and the fraction and returns a byte array with 4 bytes being the IEEE 754 single precision representation.
Something like:
// Convert 75.65 to 4 byte IEEE 754 single precision representation
char* float = convert(0, 75, 65);
Does anybody have any pointers or example C code please? I'm particularly struggling to understand how to convert the mantissa.
You will need to generate the sign (1 bit), the exponent (8 bits, a biased power of 2), and the fraction/mantissa (23 bits).
Bear in mind that the fraction has an implicit leading '1' bit, which means that the most significant leading '1' bit (2^22) is not stored in the IEEE format. For example, given a fraction of 0x755555 (24 bits), the actual bits stored would be 0x355555 (23 bits).
Also bear in mind that the fraction is shifted so that the binary point is immediately to the right of the implicit leading '1' bit. So an IEEE 23-bit fraction of 11 0101 0101... represents the 24-bit binary fraction 1.11 0101 0101...
This means that the exponent has to be adjusted accordingly.
Does the value have to be written big endian or little endian? Reversed bit ordering?
If you are free, you should think about writing the value as string literal. That way you can easily convert the integer: just write the int part and write "e0" as exponent (or omit the exponent and write ".0").
For the binary representation, you should have a look at Wikipedia. Best is to first assemble the bitfields to an uint32_t - the structure is given in the linked article. Note that you might have to round if the integer has more than 23 bits value. Remember to normalize the generated value.
Second step will be to serialize the uint32_t to an uint8_t-array. Mind the endianess of the result!
Also note to use uint8_t for the result if you really want 8 bit values; you should use an unsigned type. For the intermediate representation, using uint32_t is recommended as that will guarantee you operate on 32 bit values.
You haven't had a go yet so no give aways.
Remember you can regard two 32-bit integers a & b to be interpreted as a decimal a.b as being a single 64-bit integer with an exponent of 2^-32 (where ^ is exponent).
So without doing anything you've got it in the form:
s * m * 2^e
The only problem is your mantissa is too long and your number isn't normalized.
A bit of shifting and adding/subtracting with a possible rounding step and you're done.
You can use a software floating point compiler/library.
See https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html
The basic premise is to:
Given binary32 float.
Form a binary fixed-point representation of the combined whole and factional parts hundredths. This code uses a structure encoding both whole and hundredths fields separately. Important that the whole field is at least 32 bits.
Shift left/right (*2 and /2) until MSbit is in the implied bit position whilst counting the shifts. A robust solution would also note non-zero bits shifted out.
Form a biased exponent.
Round mantissa and drop implied bit.
Form sign (not done here).
Combine the above 3 steps to form the answer.
As Sub-normals, infinites & Not-A-Number will not result with whole, hundredths input, generating those float special cases are not addressed here.
.
#include <assert.h>
#include <stdint.h>
#define IMPLIED_BIT 0x00800000L
typedef struct {
int_least32_t whole;
int hundreth;
} x_xx;
int_least32_t covert(int whole, int hundreth) {
assert(whole >= 0 && hundreth >= 0 && hundreth < 100);
if (whole == 0 && hundreth == 0) return 0;
x_xx x = { whole, hundreth };
int_least32_t expo = 0;
int sticky_bit = 0; // Note any 1 bits shifted out
while (x.whole >= IMPLIED_BIT * 2) {
expo++;
sticky_bit |= x.hundreth % 2;
x.hundreth /= 2;
x.hundreth += (x.whole % 2)*(100/2);
x.whole /= 2;
}
while (x.whole < IMPLIED_BIT) {
expo--;
x.hundreth *= 2;
x.whole *= 2;
x.whole += x.hundreth / 100;
x.hundreth %= 100;
}
int32_t mantissa = x.whole;
// Round to nearest - ties to even
if (x.hundreth >= 100/2 && (x.hundreth > 100/2 || x.whole%2 || sticky_bit)) {
mantissa++;
}
if (mantissa >= (IMPLIED_BIT * 2)) {
mantissa /= 2;
expo++;
}
mantissa &= ~IMPLIED_BIT; // Toss MSbit as it is implied in final
expo += 24 + 126; // Bias: 24 bits + binary32 bias
expo <<= 23; // Offset
return expo | mantissa;
}
void test_covert(int whole, int hundreths) {
union {
uint32_t u32;
float f;
} u;
u.u32 = covert(whole, hundreths);
volatile float best = whole + hundreths / 100.0;
printf("%10d.%02d --> %15.6e %15.6e Same:%d\n", whole, hundreths, u.f, best,
best == u.f);
}
#include <limits.h>
int main(void) {
test_covert(75, 65);
test_covert(0, 1);
test_covert(INT_MAX, 99);
return 0;
}
Output
75.65 --> 7.565000e+01 7.565000e+01 Same:1
0.01 --> 1.000000e-02 1.000000e-02 Same:1
2147483647.99 --> 2.147484e+09 2.147484e+09 Same:1
Known issues: sign not applied.

Rules for Explicit int32 -> float32 Casting

I have a homework assignment to emulate floating point casts, e.g.:
int y = /* ... */;
float x = (float)(y);
. . . but obviously without using casting. That's fine, and I wouldn't have a problem, except I can't find any specific, concrete definition of how exactly such casts are supposed to operate.
I have written an implementation that works fairly well, but it doesn't quite match up occasionally (for example, it might put a value of three in the exponent and fill the mantissa with ones, but the "ground truth" will have a value of four in the exponent and fill the mantissa with zeroes). The fact that the two are equivalent (sorta, by infinite series) is frustrating because the bit pattern is still "wrong".
Sure, I get vague things, like "round toward zero" from scattered websites, but honestly my searches keep getting clogged C newbie questions (e.g., "What's a cast?", "When do I use it?"). So, I can't find a general rule that works for explicitly defining the exponent and the mantissa.
Help?
Since this is homework, I'll just post some notes about what I think is the tricky part - rounding when the magnitude of the integer is larger than the precision of the float will hold. It sounds like you already have a solution for the basics of obtaining the exponent and mantissa already.
I'll assume that your float representation is IEEE 754, and that rounding is performed the same way that MSVC and MinGW do: using a "banker's rounding" scheme (I'm honestly not sure if that particular rounding scheme is required by the standard; it's what I tested against though). The remaining discussion assumes the int to be converted in greater than 0. Negative numbers can be handled by dealing with their absolute value and setting the sign bit at the end. Of course, 0 needs to be handled specially in any case (because there's no msb to find).
Since there are 24 bits of precision in the mantissa (including the implied 1 for the msb), ints up to 16777215 (or 0x00ffffff) can be represented exactly. There's nothing particularly special to do other than the bit shifting to get things in the right place and calculating the correct exponent depending on the shifts.
However, if there are more than 24 bits of precision in the int value, you'll need to round. I performed the rounding using these steps:
If the msb of the dropped bits is 0, nothing more needs to be done. The mantissa and exponent can be left alone.
if the msb of the dropped bits is 1, and the remaining dropped bits have one or more bits set, the mantissa needs to be incremented. If the mantissa overflows (beyond 24 bits, assuming you haven't already dropped the implied msb), then the mantissa needs to be shifted right, and the exponent incremented.
if the msb of the dropped bits is one, and the remaining dropped bits are all 0, then the mantissa is incremented only if the lsb is 1. Handle overflow of the mantissa similarly to case 2.
Since the mantissa increment will overflow only when it's all 1's, if you're not carrying around the mantissa's msb (i.e., if you've already dropped it since it'll be dropped in the ultimate float representation), then the cases where the mantissa increment overflows can be fixed up simply by setting the mantissa to zero and incrementing the exponent.
I saw your question and remembered some code for floating point emulation I had written a long time ago. First of all a very important piece of advice for floating point numbers. Read "What Every Programmer Should know about Floating point" , it's very nice and complete guide on the subject.
As for my code I dug around and found it but I have to warn you it's ugly and since it was for a personal project (my undergrad thesis) it's not properly commented. Also the code might have certain peculiarities since it targetted an embedded system (a robot). The link to the page that explains the project and has a download link for the code is here. Don't mind the website, I am no web designer I am afraid :)
This is how I represented floating points in that project:
typedef struct
{
union{
struct {
unsigned long mantissa: 23;
unsigned long exponent: 8;
unsigned long sign: 1;
} float_parts; //the struct shares same memory space as the float
//allowing us to access its parts with the bitfields
float all;
};
}_float __attribute__((__packed__));
It uses bitfields the explanation of which is I guess out of the topic scope so refer to the link if you want to learn more information.
What would interest you from in there I suppose is this function. Please note that the code is not very well written and I have not looked at it for years. Also note that since I was targeting only the specific robot's architecture the code has no checks for endianess. But in any case I hope it's of use to you.
_float intToFloat(int number)
{
int i;
//will hold the resulting float
_float result;
//depending on the number's sign determine the floating number's sign
if(number > 0)
result.float_parts.sign = 0;
else if(number < 0)
{
number *= -1; //since it would have been in twos complements
//being negative and all
result.float_parts.sign = 1;
}
else // 0 is kind of a special case
{
parseFloat(0.0,&result);
return result;
}
//get the individual bytes (not considering endiannes here, since it is for the robot only for now)
unsigned char* bytes= (unsigned char*)&number;
//we have to get the most significant bit of the int
for(i = 31; i >=0; i --)
{
if(bytes[i/8] & (0x01 << (i-((i/8)*8))))
break;
}
//and adding the bias, input it into the exponent of the float
//because the exponent says where the decimal (or binary) point is placed relative to the beginning of the mantissa
result.float_parts.exponent = i+127;
//now let's prepare for mantissa calculation
result.float_parts.mantissa = (bytes[2] << 16 | bytes[1] << 8 | bytes[0]);
//actual calculation of the mantissa
i= 0;
while(!(result.float_parts.mantissa & (0x01<<22)) && i<23) //the i is to make sure that
{ //for all zero mantissas we don't
result.float_parts.mantissa <<=1; //get infinite loop
i++;
}
result.float_parts.mantissa <<=1;
//finally we got the number
return result;
}
Thanks everyone for the very useful help! In particular, the rules for rounding were especially helpful!
I am pleased to say that, with the help of this question's responses, and all you glorious people, I successfully implemented the function. My final function is:
unsigned float_i2f(int x) {
/* Apply a complex series of operations to make the cast. Rounding was achieved with the help of my post http://stackoverflow.com/questions/9288241/rules-for-explicit-int32-float32-casting. */
int sign, exponent, y;
int shift, shift_is_pos, shifted_x, deshifted_x, dropped;
int mantissa;
if (x==0) return 0;
sign = x<0 ? 0x80000000 : 0; //extract sign
x = sign ? -x : x; //absolute value, sorta
//Check how big the exponent needs to be to offset the necessary shift to the mantissa.
exponent = 0;
y = x;
while (y/=2) {
++exponent;
}
shift = exponent - 23; shift_is_pos = shift >= 0; //How much to shift x to get the mantissa, and whether that shift is left or right.
shifted_x = (shift_is_pos ? (x>>shift) : (x<<-shift)); //Shift x
deshifted_x = (shift_is_pos ? (shifted_x<<shift) : (shifted_x>>-shift)); //Unshift it (fills right with zeros)
dropped = x - deshifted_x; //Subtract the difference. This gives the rounding error.
mantissa = 0x007FFFFF & shifted_x; //Remove leading MSB (it is represented implicitly)
//It is only possible for bits to have been dropped if the shift was positive (right).
if (shift_is_pos) {
//We dropped some bits. Rounding may be necessary.
if ((0x01<<(shift-1))&dropped ) {
//The MSB of the dropped bits is 1. Rounding may be necessary.
//Kill the MSB of the dropped bits (taking into account hardware ignoring 32 bit shifts).
if (shift==1) dropped = 0;
else dropped <<= 33-shift;
if (dropped) {
//The remaining dropped bits have one or more bits set.
goto INC_MANTISSA;
}
//The remaining dropped bits are all 0
else if (mantissa&0x01) {
//LSB is 1
goto INC_MANTISSA;
}
}
}
//No rounding is necessary
goto CONTINUE;
//For incrementing the mantissa. Handles overflow by incrementing the exponent and setting the mantissa to 0.
INC_MANTISSA:
++mantissa;
if (mantissa&(0x00800000)) {
mantissa = 0;
++exponent;
}
//Resuming normal program flow.
CONTINUE:
exponent += 127; //Bias the exponent
return sign | (exponent<<23) | mantissa; //Or it all together and return.
}
It solves all test cases correctly, although I'm certain it does not handle everything correctly (for example, if x is 0x80000000, then the "absolute value" section will return 0x80000000, because of overflow).
Once again, I want to thank all of you greatly for your help!
Thanks,
Ian

Resources