Format a float using IEEE-754

Format a float using IEEE-754 - c

I have an assignment in C where I'm given a float number and have to print it using the IEEE-754 format:
__(sign)mantissa * 2^(exponent)__
The sign would be either '-' or ' ', the exponent an int value and the mantissa a float value. I also cannot usestructs or any function present in any library other than the the ones in stdio.h (for the printf function). This should be done using bitwise operations.
I was given two functions:
unsignedToFloat: given an unsigned int, returns a float with the same bits;
floatToUnsigned: given a float, returns an unsigned int with the same bits;
After getting the float representation as an unsigned int, I have managed to determine the sign and exponent of any float number. I have also managed to determine the mantissa in bits. For example, given the float 3.75:
> bit representation: 01000000011100000000000000000000;
> sign: 0 (' ');
> exponent = 10000000 (128 - bias = 1)
> mantissa = 11100000000000000000000
Now, in this example, I should represent the mantissa as "1.875". However, I have not been able to do this conversion. So far, I have tried to create another float as 000000000[mantissa], but it gives me the wrong result (which I now understand why). I was instructed that the mantissa has a 1 in the beginning, meaning in this example, the mantissa would become 1.111, but I wasn't instructed on how exactly to do this. I tried searching online, but couldn't find any way of adding a 1 to the beginning of the mantissa.
I also had the idea of doing this portion by going through every bit of the mantissa and getting its decimal representation, and then adding 1. In this example, I would do the following:
> 11100000000000000000000
> as_float = 2^-1 + 2^-2 + 2^-3
> as_float += 1
However, this approach seems very hack-y, slow and could perhaps give me wrong result.
With this said, I am completely out of ideas. How would I represent the mantissa of a float number as its own thing?

Convert the float to unsigned int.
Clear the sign bit (using bitwise AND &).
Set the exponent so that it's equal to the bias (using bitwise AND & and OR |).
Convert the modified value from unsigned int to float.
Use printf to print the new value.
This works because clearing the sign bit and setting the exponent to the bias leaves you with (positive)mantissa * 2^0 = mantissa, so printf will print the mantissa for you.

Related

How to decide when to round when converting decimal/int to float

I have an integer -2147483647, which is 0x8000001 in hex.
When I use an online float converter, it says that it will instead convert -2147483648, due to rounding.
Now I have to implement this int-to-float function in c with single-precision.
I have all the other features already implemented but I was not aware of the rounding.
When do I decide when to round?
**EDIT:
Here are the features I have so far. (in the order of how the function works)
1) First I check if the int is negative, in which I negate it and assign it to an unsigned int. I also set the sign bit to be 1.
2) To get the exponent, I search for the first '1' bit from the left of the unsigned int, and get its position.
3) I add 127 to the position and shift it 23 bits to the left to place to match the IEEE 754 standards.
4) With the position Counter I had on 2), I align the unsigned int to the right so that the 31st bit is the second right-most '1'bit. (because you omit the first '1'bit as it is assumed in every IEEE float)
5) I right shift that int 9 bits to the right so that it fits the mantissa slot of the standard.
6) I add the sign bit, the exponent, and the mantissa to get the final bit representation of the original int.

The three closest values for -2147483647 in single precision, as shown in hex and the integer number are:
hex ceffffff = -2147483520
hex cf000000 = -2147483648
hex cf000001 = -2147483904
so hex cf000000 = -2147483648 is used since it's the closest.

How to extract the sign, mantissa and exponent from a 32-Bit float [duplicate]

This question already has answers here:
How to get the sign, mantissa and exponent of a floating point number
(7 answers)
Closed 7 years ago.
so I got a task where I have to extract the sign, exponent and mantissa from a floating point number given as uint32_t. I have to do that in C and as you might expect, how do I do that ?
For the sign I would search for the MSB (Most Significant Bit, since it tells me whether my number is positive or negative, depending if it's 0 or 1)
Or let's get straight to my idea, can I "splice" my 32 bit number into three parts ?
Get the 1 bit for msb/sign
Then after that follows 1 byte which stands for the exponent
and at last 23 bits for the mantissa
It probably doesn't work like that but can you give me a hint/solution ?
I know of freexp, but I want an alternative, where I learn a little more of C.
Thank you.

If you know the bitwise layout of your floating point type (e.g. because your implementation supports IEEE floating point representations) then convert a pointer to your floating point variable (of type float, double, or long double) into a pointer to unsigned char. From there, treat the variable like an array of unsigned char and use bitwise opertions to extract the parts you need.
Otherwise, do this;
#include <math.h>
int main()
{
double x = 4.25E12; /* value picked at random */
double significand;
int exponent;
int sign;
sign = (x >= 0) ? 1 : -1; /* deem 0 to be positive sign */
significand = frexp(x, &exponent);
}
The calculation of sign in the above should be obvious.
significand may be positive or negative, for non-zero x. The absolute value of significand is in the range [0.5,1) which, when multiplied by 2 to the power of exponent gives the original value.
If x is 0, both exponent and significand will be 0.
This will work regardless of what floating point representations your compiler supports (assuming double values).

Is it possible to round-trip a floating point double to two decimal integers with fidelity?

I am trying to discern whether it is possible to decompose a double precision IEEE floating point value into to two integers and recompose them later with full fidelity. Imagine something like this:
double foo = <inputValue>;
double ipart = 0;
double fpart = modf(foo, &ipart);
int64_t intIPart = ipart;
int64_t intFPart = fpart * <someConstant>;
double bar = ((double)ipart) + ((double)intFPart) / <someConstant>;
assert(foo == bar);
It's logically obvious that any 64-bit quantity can be stored in 128-bits (i.e. just store the literal bits.) The goal here is to decompose the integer part and the fractional part of the double into integer representations (to interface with and API whose storage format I don't control) and get back a bit-exact double when recomposing the two 64-bit integers.
I have a conceptual understanding of IEEE floating point, and I get that doubles are stored base-2. I observe, empirically, that with the above approach, sometimes foo != bar for even very large values of <someConstant>. I've been out of school a while, and I can't quite close the loop in my head terms of understanding whether this is possible or not given the different bases (or some other factor).
EDIT:
I guess this was implied/understood in my brain but not captured here: In this situation, I'm guaranteed that the overall magnitude of the double in questions will always be within +/- 2^63 (and > 2^-64). With that understanding, the integer part is guaranteed to fit within a 64-bit int type then my expectation is that with ~16 bits of decimal precision, the fractional part should be easily representable in a 64-bit int type as well.

If you know the number is in [–263, +263) and the ULP (the value of the lowest bit in the number) is at least 2-63, then you can use this:
double ipart;
double fpart = modf(foo, &ipart);
int64_t intIPart = ipart;
int64_t intFPart = fpart * 0x1p63;
double bar = intIPart + intFPart * 0x1p-63;
If you just want a couple of integers from which the value can be reconstructed and do not care about the meaning of those integers (e.g., it is not necessary that one of them be the integer part), then you can use frexp to disassemble the number into its significand (with sign) and exponent, and you can use ldexp to reassemble it:
int exp;
int64_t I = frexp(foo, &exp) * 0x1p53;
int64_t E = exp;
double bar = ldexp(I, E-53);
This code will work for any finite value of an IEEE-754 64-bit binary floating-point object. It does not support infinities or NaNs.
It is even possible to pack I and E into a single int64_t, if you want to go to the trouble.

The goal here is to decompose the integer part and the fractional part
of the double into integer representations
You can't even get just the integer part or just the fractional part reliably. The problem is that you seem to misunderstand how floating point numbers are stored. They don't have an integer part and a fractional part. They have a significant digits part, called the mantissa, and an exponent. The exponent essentially scales the mantissa up or down, similar to how scientific notation works.
A double-precision floating point number has 11 bits for the exponent, giving a range of values that's something like 2-1022...21023. If you want to store the integer and fractional parts, then, you'll need two integers that each have about 210 bits. That'd be a silly way to do things, though -- most of those bits would go unused because only the bits in the mantissa are significant. Using two very long integers would let you represent all the values accross the entire range of a double with the same precision everywhere, something that you can't do with a double. You could have, for example, a very large integer part with a very small fractional part, but that's a number that a double couldn't accurately represent.
Update
If, as you indicate in your comment, you know that the value in question is within the range ±263, you can use the answer to Extract fractional part of double *efficiently* in C, like this:
double whole = // your original value
long iPart = (long)whole;
double fraction = whole - iPart;
long fPart = fraction * (2 << 63);
I haven't tested that, but it should get you what you want.

See wikipedia for the format of a double:
http://en.wikipedia.org/wiki/Double-precision_floating-point_format
IEEE double format encodes three integers: the significand, the exponent, and the sign bit.
Here is code which will extract the three constituent integers in IEEE double format:
double d = 2.0;
// sign bit
bool s = (*reinterpret_cast<int64_t*>(&d)) >> 63;
// significand
int64_t m = *reinterpret_cast<int64_t*>(&d) & 0x000FFFFFFFFFFFFFULL;
// exponent
int64_t e = ((*reinterpret_cast<int64_t*>(&d) >> 52) & 0x00000000000007FFULL) - 1023;
// now the double d is exactly equal to s * (1 + (m / 2^52)) * 2^e
// print out the exact arithmatic expression for d:
std::cout << "d = " << std::dec << (s ? "-(1 + " : "(1 + (") << m << "/" << (1ULL << 52) << ")) x 2^" << e;

Integer representation to float representation

Is there an algorithm that can convert a 32-bit integer in its integer representation to a IEEE 754 float representation by just using integer operations?
I have a couple of thoughts on this but none of these works so far. (Using C)
I was thinking about shifting the integers but then I failed to
construct the new float representation on that .
I suppose I could convert the integer to binary but it has the same
problem with the first approach.

excellent resource on float
Address +3 +2 +1 +0
Format SEEEEEEE EMMMMMMM MMMMMMMM MMMMMMMM
S represents the sign bit where 1 is negative and 0 is positive.
E is the two’s complement exponent with an offset of 127.
M is the 23-bit normalized mantissa. The highest bit is always 1
and, therefore, is not stored
Then look here for two's complement
I'll use num as an array of bits, I know this isn't standard C array range accessing, but you get the point
So for a basic algorithm, we start with filling out S.
bit S = 0;
if (num[0] ==1) {
S = 1;
num[1..32] = -num[1..32] + 1; //ignore the leading bit. flip all the bits then add 1
}
Now we have set S and we have a scalar value for the rest of the number.
Then we can position our number into the mantissa, by finding the first index of 1. Which will also let us find the exponent. Note here that the exponent will always be positive since we can't have fractional int values. (also, make a special case for checking if the value is 0 first, to avoid an infinite loop in here, or just modify the loop appropriately, I'm lazy)
int pos = 1;
signed byte E = 32;
bit[23] M;
while(num[pos] == 0) {
--E;
++pos;
}
int finalPos = min(32, pos+23); //don't get too many bits
M = num[pos+1..finalPos]; //set the mantissa bits
Then you construct your float with the bits in S,E,M

problems in floating point comparison [duplicate]

This question already has answers here:
strange output in comparison of float with float literal
(8 answers)
Closed 7 years ago.
void main()
{
float f = 0.98;
if(f <= 0.98)
printf("hi");
else
printf("hello");
getch();
}
I am getting this problem here.On using different floating point values of f i am getting different results.
Why this is happening?

f is using float precision, but 0.98 is in double precision by default, so the statement f <= 0.98 is compared using double precision.
The f is therefore converted to a double in the comparison, but may make the result slightly larger than 0.98.
Use
if(f <= 0.98f)
or use a double for f instead.
In detail... assuming float is IEEE single-precision and double is IEEE double-precision.
These kinds of floating point numbers are stored with base-2 representation. In base-2 this number needs an infinite precision to represent as it is a repeated decimal:
0.98 = 0.1111101011100001010001111010111000010100011110101110000101000...
A float can only store 24 bits of significant figures, i.e.
0.111110101110000101000111_101...
^ round off here
= 0.111110101110000101001000
= 16441672 / 2^24
= 0.98000001907...
A double can store 53 bits of signficant figures, so
0.11111010111000010100011110101110000101000111101011100_00101000...
^ round off here
= 0.11111010111000010100011110101110000101000111101011100
= 8827055269646172 / 2^53
= 0.97999999999999998224...
So the 0.98 will become slightly larger in float and smaller in double.

It's because floating point values are not exact representations of the number. All base ten numbers need to be represented on the computer as base 2 numbers. It's in this conversion that precision is lost.
Read more about this at http://en.wikipedia.org/wiki/Floating_point
An example (from encountering this problem in my VB6 days)
To convert the number 1.1 to a single precision floating point number we need to convert it to binary. There are 32 bits that need to be created.
Bit 1 is the sign bit (is it negative [1] or position [0])
Bits 2-9 are for the exponent value
Bits 10-32 are for the mantissa (a.k.a. significand, basically the coefficient of scientific notation )
So for 1.1 the single floating point value is stored as follows (this is truncated value, the compiler may round the least significant bit behind the scenes, but all I do is truncate it, which is slightly less accurate but doesn't change the results of this example):
s --exp--- -------mantissa--------
0 01111111 00011001100110011001100
If you notice in the mantissa there is the repeating pattern 0011. 1/10 in binary is like 1/3 in decimal. It goes on forever. So to retrieve the values from the 32-bit single precision floating point value we must first convert the exponent and mantissa to decimal numbers so we can use them.
sign = 0 = a positive number
exponent: 01111111 = 127
mantissa: 00011001100110011001100 = 838860
With the mantissa we need to convert it to a decimal value. The reason is there is an implied integer ahead of the binary number (i.e. 1.00011001100110011001100). The implied number is because the mantissa represents a normalized value to be used in the scientific notation: 1.0001100110011.... * 2^(x-127).
To get the decimal value out of 838860 we simply divide by 2^-23 as there are 23 bits in the mantissa. This gives us 0.099999904632568359375. Add the implied 1 to the mantissa gives us 1.099999904632568359375. The exponent is 127 but the formula calls for 2^(x-127).
So here is the math:
(1 + 099999904632568359375) * 2^(127-127)
1.099999904632568359375 * 1 = 1.099999904632568359375
As you can see 1.1 is not really stored in the single floating point value as 1.1.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight