Is there an algorithm that can convert a 32-bit integer in its integer representation to a IEEE 754 float representation by just using integer operations?
I have a couple of thoughts on this but none of these works so far. (Using C)
I was thinking about shifting the integers but then I failed to
construct the new float representation on that .
I suppose I could convert the integer to binary but it has the same
problem with the first approach.
excellent resource on float
Address +3 +2 +1 +0
Format SEEEEEEE EMMMMMMM MMMMMMMM MMMMMMMM
S represents the sign bit where 1 is negative and 0 is positive.
E is the two’s complement exponent with an offset of 127.
M is the 23-bit normalized mantissa. The highest bit is always 1
and, therefore, is not stored
Then look here for two's complement
I'll use num as an array of bits, I know this isn't standard C array range accessing, but you get the point
So for a basic algorithm, we start with filling out S.
bit S = 0;
if (num[0] ==1) {
S = 1;
num[1..32] = -num[1..32] + 1; //ignore the leading bit. flip all the bits then add 1
}
Now we have set S and we have a scalar value for the rest of the number.
Then we can position our number into the mantissa, by finding the first index of 1. Which will also let us find the exponent. Note here that the exponent will always be positive since we can't have fractional int values. (also, make a special case for checking if the value is 0 first, to avoid an infinite loop in here, or just modify the loop appropriately, I'm lazy)
int pos = 1;
signed byte E = 32;
bit[23] M;
while(num[pos] == 0) {
--E;
++pos;
}
int finalPos = min(32, pos+23); //don't get too many bits
M = num[pos+1..finalPos]; //set the mantissa bits
Then you construct your float with the bits in S,E,M
Related
I have the hex literal 400D99999999999A which is the bitpattern for 3.7 as
a double
How do I write this in C? I saw this
page about
floating_literal and hex. Maybe it's obvious and I need to sleep but I'm
not seeing how to write the bitpattern as a float. I understand it's
suppose to let a person write a more precise fraction but I'm not sure how
to translate a bitpattern to a literal
#include <stdio.h>
#include <string.h>
int main(int argc, char *argv[])
{
double d = 0x400D99999999999Ap0;
printf("%f\n", d); //incorrect
unsigned long l = 0x400D99999999999A;
memcpy(&d, &l, 8);
printf("%f\n", d); //correct, 3.7
return 0;
}
how to write the bitpattern as a float.
The bit pattern 0x400D99999999999A commonly encodes (alternates encodings exists) the double with a value of about 3.7*1.
double d;
unsigned long l = 0x400D99999999999A;
// Assume same size, same endian
memcpy(&d, &l, 8);
printf("%g\n", d);
// output 3.7
To write the value out using "%a" format with a hexadecimal significand, decimal power-of-2 exponent
printf("%g\n", d);
// output 0x1.d99999999999ap+1
The double constant (not literal) 1.d99999999999ap+1 has an explicit 1 bit and and the lower 52-bits of 0x400D99999999999A and the exponent of +1 is like the biased exponent bits (12 most significant bits, expect the sign bit) biased by 0x400 - 1.
Now code can use double d = 0x1.d99999999999ap+1 instead of the memcpy() to initialize d.
*1 Closest double to 3.7 is exactly
3.70000000000000017763568394002504646778106689453125
The value you're trying to use is an IEEE bit pattern. C doesn't support this directly. To get the desired bit pattern, you need to specify the mantissa, as an ordinary hex integer, along with a power-of-two exponent.
In this case, the desired IEEE bit pattern is 400D99999999999A. If you strip off the sign bit and the exponent, you're left with D99999999999A. There's an implied leading 1 bit, so to get the actual mantissa value, that needs to be explicitly added, giving 1D99999999999A. This represents the mantissa as an integer with no fractional part. It then needs to be scaled, in this case by a power-of-two exponent value of -51. So the desired constant is:
double d = 0x1D99999999999Ap-51;
If you plug this into your code, you will get the desired bit pattern of 400D99999999999A.
The following program shows how to interpret a string of bits as a double, using either the native double format or using the IEEE-754 double-precision binary format (binary64).
#include <math.h>
#include <stdint.h>
#include <string.h>
// Create a mask of n bits, in the low bits.
#define Mask(n) (((uint64_t) 1 << (n)) - 1)
/* Given a uint64_t containing 64 bits, this function interprets them in the
native double format.
*/
double InterpretNativeDouble(uint64_t bits)
{
double result;
_Static_assert(sizeof result == sizeof bits, "double must be 64 bits");
// Copy the bits into a native double.
memcpy(&result, &bits, sizeof result);
return result;
}
/* Given a uint64_t containing 64 bits, this function interprets them in the
IEEE-754 double-precision binary format. (Checking that the native double
format has sufficient bounds and precision to represent the result is
omitted. For NaN results, a NaN is returned, but the signaling
characteristic and the payload bits are not supported.)
*/
double InterpretDouble(uint64_t bits)
{
/* Set some parameters of the format. (This routine is not fully
parameterized for all IEEE-754 binary formats; some hardcoded constants
are used.)
*/
static const int Emax = 1023; // Maximum exponent.
static const int Precision = 53; // Precision (number of digits).
// Separate the fields in the encoding.
int SignField = bits >> 63;
int ExponentField = bits >> 52 & Mask(11);
uint64_t SignificandField = bits & Mask(52);
// Interpret the exponent and significand fields.
int Exponent;
double Significand;
switch (ExponentField)
{
/* An exponent field of all zero bits indicates a subnormal number,
for which the exponent is fixed at its minimum and the leading bit
of the significand is zero. This includes zero, which is not
classified as a subnormal number but is consistent in the encoding.
*/
case 0:
Exponent = 1 - Emax;
Significand = 0 + ldexp(SignificandField, 1-Precision);
// ldexp(x, y) computes x * pow(2, y).
break;
/* An exponent field of all one bits indicates a NaN or infinity,
according to whether the significand field is zero or not.
*/
case Mask(11):
Exponent = 0;
Significand = SignificandField ? NAN : INFINITY;
break;
/* All other exponent fields indicate normal numbers, for which the
exponent is encoded with a bias (equal to EMax) and the leading bit
of the significand is one.
*/
default:
Exponent = ExponentField - Emax;
Significand = 1 + ldexp(SignificandField, 1-Precision);
break;
}
// Combine the exponent and significand,.
Significand = ldexp(Significand, Exponent);
// Interpret the sign field.
if (SignField)
Significand = -Significand;
return Significand;
}
#include <stdio.h>
#include <inttypes.h>
int main(void)
{
uint64_t bits = 0x400D99999999999A;
printf("The bits 0x%16" PRIx64 " interpreted as:\n", bits);
printf("\ta native double represent %.9999g, and\n",
InterpretNativeDouble(bits));
printf("\tan IEEE-754 double-precision datum represent %.9999g.\n",
InterpretDouble(bits));
}
As an IEEE-754 double-precision value, that bit pattern 400D99999999999A actually consists of three parts:
the first bit, 0, is the sign;
the next 11 bits, 10000000000 or 0x400, are the exponent; and
the remaining 52 bits, 0xD99999999999A, are the significand (also known as the "mantissa").
But the exponent has a bias of 1023 (0x3ff), so numerically it's 0x400 - 0x3ff = 1. And the significand is all fractional, and has an implicit 1 bit to its left, so it's really 0x1.D99999999999A.
So the actual number this represents is
0x1.D99999999999A × 2¹
which is about 1.85 × 2, or 3.7.
Or, using C's "hex float" or %a representation, it's 0x1.D99999999999Ap1.
In "hex float" notation, the leading 1 and the decimal point (really a "radix point") are explicit, and the p at the end indicates a power-of-two exponent.
Although the decomposition I've shown here may seem reasonably straightforward, actually writing code to reliably decompose a 64-bit number like 400D99999999999A into its three component parts, and manipulate and recombine them to determine what floating-point value they represent (or even to form an equivalent hex float constant like 0x1.D99999999999Ap1) can be surprisingly tricky. See Eric Postpischil's answer for more of the details.
I have an assignment in C where I'm given a float number and have to print it using the IEEE-754 format:
__(sign)mantissa * 2^(exponent)__
The sign would be either '-' or ' ', the exponent an int value and the mantissa a float value. I also cannot usestructs or any function present in any library other than the the ones in stdio.h (for the printf function). This should be done using bitwise operations.
I was given two functions:
unsignedToFloat: given an unsigned int, returns a float with the same bits;
floatToUnsigned: given a float, returns an unsigned int with the same bits;
After getting the float representation as an unsigned int, I have managed to determine the sign and exponent of any float number. I have also managed to determine the mantissa in bits. For example, given the float 3.75:
> bit representation: 01000000011100000000000000000000;
> sign: 0 (' ');
> exponent = 10000000 (128 - bias = 1)
> mantissa = 11100000000000000000000
Now, in this example, I should represent the mantissa as "1.875". However, I have not been able to do this conversion. So far, I have tried to create another float as 000000000[mantissa], but it gives me the wrong result (which I now understand why). I was instructed that the mantissa has a 1 in the beginning, meaning in this example, the mantissa would become 1.111, but I wasn't instructed on how exactly to do this. I tried searching online, but couldn't find any way of adding a 1 to the beginning of the mantissa.
I also had the idea of doing this portion by going through every bit of the mantissa and getting its decimal representation, and then adding 1. In this example, I would do the following:
> 11100000000000000000000
> as_float = 2^-1 + 2^-2 + 2^-3
> as_float += 1
However, this approach seems very hack-y, slow and could perhaps give me wrong result.
With this said, I am completely out of ideas. How would I represent the mantissa of a float number as its own thing?
Convert the float to unsigned int.
Clear the sign bit (using bitwise AND &).
Set the exponent so that it's equal to the bias (using bitwise AND & and OR |).
Convert the modified value from unsigned int to float.
Use printf to print the new value.
This works because clearing the sign bit and setting the exponent to the bias leaves you with (positive)mantissa * 2^0 = mantissa, so printf will print the mantissa for you.
I have an integer -2147483647, which is 0x8000001 in hex.
When I use an online float converter, it says that it will instead convert -2147483648, due to rounding.
Now I have to implement this int-to-float function in c with single-precision.
I have all the other features already implemented but I was not aware of the rounding.
When do I decide when to round?
**EDIT:
Here are the features I have so far. (in the order of how the function works)
1) First I check if the int is negative, in which I negate it and assign it to an unsigned int. I also set the sign bit to be 1.
2) To get the exponent, I search for the first '1' bit from the left of the unsigned int, and get its position.
3) I add 127 to the position and shift it 23 bits to the left to place to match the IEEE 754 standards.
4) With the position Counter I had on 2), I align the unsigned int to the right so that the 31st bit is the second right-most '1'bit. (because you omit the first '1'bit as it is assumed in every IEEE float)
5) I right shift that int 9 bits to the right so that it fits the mantissa slot of the standard.
6) I add the sign bit, the exponent, and the mantissa to get the final bit representation of the original int.
The three closest values for -2147483647 in single precision, as shown in hex and the integer number are:
hex ceffffff = -2147483520
hex cf000000 = -2147483648
hex cf000001 = -2147483904
so hex cf000000 = -2147483648 is used since it's the closest.
This question already has answers here:
How to get the sign, mantissa and exponent of a floating point number
(7 answers)
Closed 7 years ago.
so I got a task where I have to extract the sign, exponent and mantissa from a floating point number given as uint32_t. I have to do that in C and as you might expect, how do I do that ?
For the sign I would search for the MSB (Most Significant Bit, since it tells me whether my number is positive or negative, depending if it's 0 or 1)
Or let's get straight to my idea, can I "splice" my 32 bit number into three parts ?
Get the 1 bit for msb/sign
Then after that follows 1 byte which stands for the exponent
and at last 23 bits for the mantissa
It probably doesn't work like that but can you give me a hint/solution ?
I know of freexp, but I want an alternative, where I learn a little more of C.
Thank you.
If you know the bitwise layout of your floating point type (e.g. because your implementation supports IEEE floating point representations) then convert a pointer to your floating point variable (of type float, double, or long double) into a pointer to unsigned char. From there, treat the variable like an array of unsigned char and use bitwise opertions to extract the parts you need.
Otherwise, do this;
#include <math.h>
int main()
{
double x = 4.25E12; /* value picked at random */
double significand;
int exponent;
int sign;
sign = (x >= 0) ? 1 : -1; /* deem 0 to be positive sign */
significand = frexp(x, &exponent);
}
The calculation of sign in the above should be obvious.
significand may be positive or negative, for non-zero x. The absolute value of significand is in the range [0.5,1) which, when multiplied by 2 to the power of exponent gives the original value.
If x is 0, both exponent and significand will be 0.
This will work regardless of what floating point representations your compiler supports (assuming double values).
I am given this code to convert a signed integer into two's complement but I don't understand how it really works, especially if the input is negative.
void convertB2T( int32_t num) {
uint8_t bInt[32];
int32_t mask = 0x01;
for (int position = 0; position < NUM_BITS; position++) {
bInt[position] = ( num & Mask) ? 1 : 0;
Mask = Mask << 1;
}
}
So my questions are:
num is an integer, Mask is a hex, so how does num & Mask work? Does C just convert num to binary representation and do the bitwise and? Also the output of Mask is an integer correct? So if this output is non-zero, it is treated as TRUE and if zero, FALSE, right?
How does this work if num is negative? I tried running the code and actually did not get a correct answer (all higher level bits are 1's).
This program basically extracts each bit of the number and puts it in a vector. So every bit becomes a vector element. It has nothing to do with two's complement conversion (although the resulting bit-vector will be in two's complement, as the internal representation of numbers is in two's complement).
The computer has no idea what hex means. Every value is stored in binary, because binary is the only thing computer understands. So, ,the "integer" and the hex values are converted to binary (the hex there is also an integer). On these binary representations that the computer uses, the binary operators are applied.
In order to understand what is happening with the result when num is negative, you need to understand that the result is basically the two's complement representation of num and you need to know how the two's complement representation works. Wikipedia is a good starting point.
To answer your questions
1.Yes num is integer represented in decimal format and mask is also integer represented in hex format.
Yes C compiler treats num and mask with their binary equivalents.
Say
num = 24; // binary value on 32 bit machine is 000000000000000000011000
mask = 0x01; // binary value on 32 bit machine is 000000000000000000000001
Yes compiler now performs & bitwise and the equivalent binary values.
Yes if output is nonzero, treated as true
If a number is negative, its represented in 2's complement form.
Basically your code is just storing binary equivalent of number into array. You are not representing in twos complement.
If MSB is 1 indicates number is negative. if a number is negative
num = -24; // represent binary value of 24
000000000000000000011000 -> apply 1's complement + 1 to this binary value
111111111111111111100111 -> 1's complement
+000000000000000000000001 -> add 1
------------------------
111111111111111111101000 -> -24 representation
------------------------