I have the hex literal 400D99999999999A which is the bitpattern for 3.7 as
a double
How do I write this in C? I saw this
page about
floating_literal and hex. Maybe it's obvious and I need to sleep but I'm
not seeing how to write the bitpattern as a float. I understand it's
suppose to let a person write a more precise fraction but I'm not sure how
to translate a bitpattern to a literal
#include <stdio.h>
#include <string.h>
int main(int argc, char *argv[])
{
double d = 0x400D99999999999Ap0;
printf("%f\n", d); //incorrect
unsigned long l = 0x400D99999999999A;
memcpy(&d, &l, 8);
printf("%f\n", d); //correct, 3.7
return 0;
}
how to write the bitpattern as a float.
The bit pattern 0x400D99999999999A commonly encodes (alternates encodings exists) the double with a value of about 3.7*1.
double d;
unsigned long l = 0x400D99999999999A;
// Assume same size, same endian
memcpy(&d, &l, 8);
printf("%g\n", d);
// output 3.7
To write the value out using "%a" format with a hexadecimal significand, decimal power-of-2 exponent
printf("%g\n", d);
// output 0x1.d99999999999ap+1
The double constant (not literal) 1.d99999999999ap+1 has an explicit 1 bit and and the lower 52-bits of 0x400D99999999999A and the exponent of +1 is like the biased exponent bits (12 most significant bits, expect the sign bit) biased by 0x400 - 1.
Now code can use double d = 0x1.d99999999999ap+1 instead of the memcpy() to initialize d.
*1 Closest double to 3.7 is exactly
3.70000000000000017763568394002504646778106689453125
The value you're trying to use is an IEEE bit pattern. C doesn't support this directly. To get the desired bit pattern, you need to specify the mantissa, as an ordinary hex integer, along with a power-of-two exponent.
In this case, the desired IEEE bit pattern is 400D99999999999A. If you strip off the sign bit and the exponent, you're left with D99999999999A. There's an implied leading 1 bit, so to get the actual mantissa value, that needs to be explicitly added, giving 1D99999999999A. This represents the mantissa as an integer with no fractional part. It then needs to be scaled, in this case by a power-of-two exponent value of -51. So the desired constant is:
double d = 0x1D99999999999Ap-51;
If you plug this into your code, you will get the desired bit pattern of 400D99999999999A.
The following program shows how to interpret a string of bits as a double, using either the native double format or using the IEEE-754 double-precision binary format (binary64).
#include <math.h>
#include <stdint.h>
#include <string.h>
// Create a mask of n bits, in the low bits.
#define Mask(n) (((uint64_t) 1 << (n)) - 1)
/* Given a uint64_t containing 64 bits, this function interprets them in the
native double format.
*/
double InterpretNativeDouble(uint64_t bits)
{
double result;
_Static_assert(sizeof result == sizeof bits, "double must be 64 bits");
// Copy the bits into a native double.
memcpy(&result, &bits, sizeof result);
return result;
}
/* Given a uint64_t containing 64 bits, this function interprets them in the
IEEE-754 double-precision binary format. (Checking that the native double
format has sufficient bounds and precision to represent the result is
omitted. For NaN results, a NaN is returned, but the signaling
characteristic and the payload bits are not supported.)
*/
double InterpretDouble(uint64_t bits)
{
/* Set some parameters of the format. (This routine is not fully
parameterized for all IEEE-754 binary formats; some hardcoded constants
are used.)
*/
static const int Emax = 1023; // Maximum exponent.
static const int Precision = 53; // Precision (number of digits).
// Separate the fields in the encoding.
int SignField = bits >> 63;
int ExponentField = bits >> 52 & Mask(11);
uint64_t SignificandField = bits & Mask(52);
// Interpret the exponent and significand fields.
int Exponent;
double Significand;
switch (ExponentField)
{
/* An exponent field of all zero bits indicates a subnormal number,
for which the exponent is fixed at its minimum and the leading bit
of the significand is zero. This includes zero, which is not
classified as a subnormal number but is consistent in the encoding.
*/
case 0:
Exponent = 1 - Emax;
Significand = 0 + ldexp(SignificandField, 1-Precision);
// ldexp(x, y) computes x * pow(2, y).
break;
/* An exponent field of all one bits indicates a NaN or infinity,
according to whether the significand field is zero or not.
*/
case Mask(11):
Exponent = 0;
Significand = SignificandField ? NAN : INFINITY;
break;
/* All other exponent fields indicate normal numbers, for which the
exponent is encoded with a bias (equal to EMax) and the leading bit
of the significand is one.
*/
default:
Exponent = ExponentField - Emax;
Significand = 1 + ldexp(SignificandField, 1-Precision);
break;
}
// Combine the exponent and significand,.
Significand = ldexp(Significand, Exponent);
// Interpret the sign field.
if (SignField)
Significand = -Significand;
return Significand;
}
#include <stdio.h>
#include <inttypes.h>
int main(void)
{
uint64_t bits = 0x400D99999999999A;
printf("The bits 0x%16" PRIx64 " interpreted as:\n", bits);
printf("\ta native double represent %.9999g, and\n",
InterpretNativeDouble(bits));
printf("\tan IEEE-754 double-precision datum represent %.9999g.\n",
InterpretDouble(bits));
}
As an IEEE-754 double-precision value, that bit pattern 400D99999999999A actually consists of three parts:
the first bit, 0, is the sign;
the next 11 bits, 10000000000 or 0x400, are the exponent; and
the remaining 52 bits, 0xD99999999999A, are the significand (also known as the "mantissa").
But the exponent has a bias of 1023 (0x3ff), so numerically it's 0x400 - 0x3ff = 1. And the significand is all fractional, and has an implicit 1 bit to its left, so it's really 0x1.D99999999999A.
So the actual number this represents is
0x1.D99999999999A × 2¹
which is about 1.85 × 2, or 3.7.
Or, using C's "hex float" or %a representation, it's 0x1.D99999999999Ap1.
In "hex float" notation, the leading 1 and the decimal point (really a "radix point") are explicit, and the p at the end indicates a power-of-two exponent.
Although the decomposition I've shown here may seem reasonably straightforward, actually writing code to reliably decompose a 64-bit number like 400D99999999999A into its three component parts, and manipulate and recombine them to determine what floating-point value they represent (or even to form an equivalent hex float constant like 0x1.D99999999999Ap1) can be surprisingly tricky. See Eric Postpischil's answer for more of the details.
Related
I have an assignment in C where I'm given a float number and have to print it using the IEEE-754 format:
__(sign)mantissa * 2^(exponent)__
The sign would be either '-' or ' ', the exponent an int value and the mantissa a float value. I also cannot usestructs or any function present in any library other than the the ones in stdio.h (for the printf function). This should be done using bitwise operations.
I was given two functions:
unsignedToFloat: given an unsigned int, returns a float with the same bits;
floatToUnsigned: given a float, returns an unsigned int with the same bits;
After getting the float representation as an unsigned int, I have managed to determine the sign and exponent of any float number. I have also managed to determine the mantissa in bits. For example, given the float 3.75:
> bit representation: 01000000011100000000000000000000;
> sign: 0 (' ');
> exponent = 10000000 (128 - bias = 1)
> mantissa = 11100000000000000000000
Now, in this example, I should represent the mantissa as "1.875". However, I have not been able to do this conversion. So far, I have tried to create another float as 000000000[mantissa], but it gives me the wrong result (which I now understand why). I was instructed that the mantissa has a 1 in the beginning, meaning in this example, the mantissa would become 1.111, but I wasn't instructed on how exactly to do this. I tried searching online, but couldn't find any way of adding a 1 to the beginning of the mantissa.
I also had the idea of doing this portion by going through every bit of the mantissa and getting its decimal representation, and then adding 1. In this example, I would do the following:
> 11100000000000000000000
> as_float = 2^-1 + 2^-2 + 2^-3
> as_float += 1
However, this approach seems very hack-y, slow and could perhaps give me wrong result.
With this said, I am completely out of ideas. How would I represent the mantissa of a float number as its own thing?
Convert the float to unsigned int.
Clear the sign bit (using bitwise AND &).
Set the exponent so that it's equal to the bias (using bitwise AND & and OR |).
Convert the modified value from unsigned int to float.
Use printf to print the new value.
This works because clearing the sign bit and setting the exponent to the bias leaves you with (positive)mantissa * 2^0 = mantissa, so printf will print the mantissa for you.
int x=25,i;
float *p=(float *)&x;
printf("%f\n",*p);
I understand that bit representation for floating point numbers and int are different, but no matter what value I store, the answer is always 0.000000. Shouldn't it be some other value depending on the floating point representation?
Your code has undefined behavior -- but it will most likely behave as you expect, as long as the size and alignment of types int and float are compatible.
By using the "%f" format to print *p, you're losing a lot of information.
Try this:
#include <stdio.h>
int main(void) {
int x = 25;
float *p = (float*)&x;
printf("%g\n", *p);
return 0;
}
On my system (and probably on yours), it prints:
3.50325e-44
The int value 25 has zeros in most of its high-order bits. Those bits are probably in the same place as the exponent field of type float -- resulting in a very small number.
Look up IEEE floating-point representation for more information. Byte order is going to be an issue. (And don't do this kind of thing in real code unless you have a very good reason.)
As rici suggests in a comment, a better way to learn about floating-point representation is to start with a floating-point value, convert it to an unsigned integer of the same size, and display the integer value in hexadecimal. For example:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
void show(float f) {
unsigned int rep;
memcpy(&rep, &f, sizeof rep);
printf("%g --> 0x%08x\n", f, rep);
}
int main(void) {
if (sizeof (float) != sizeof (unsigned int)) {
fprintf(stderr, "Size mismatch\n");
exit(EXIT_FAILURE);
}
show(0.0);
show(1.0);
show(1.0/3.0);
show(-12.34e5);
return 0;
}
For the purposes of this discussion, we're going to assume both int and float are 32 bits wide. We're also going to assume IEEE-754 floats.
Floating point values are represented as sign * βexp * signficand. For 32-bit binary floats, β is 2, the exponent exp ranges from -126 to 127, and the significand is a normalized binary fraction, such that there is a single leading non-zero bit before the radix point. For example, the binary integer representation of 25 is
110012
while the binary floating point representation of 25.0 would be:
1.10012 * 24 // normalized
The IEEE-754 encoding for a 32-bit float is
s eeeeeeee fffffffffffffffffffffff
where s denotes the sign bit, e denotes the exponent bits, and f denotes the significand (fraction) bits. The exponent is encoded using "excess 127" notation, meaning an exponent value of 127 (011111112) represents 0, while 1 (000000012) represents -126 and 254 (111111102) represents 127. The leading bit of the significand is not explicitly stored, so 25.0 would be encoded as
0 10000011 10010000000000000000000 // exponent 131-127 = 4
However, what happens when you map the bit pattern for the 32-bit integer value 25 onto a 32-bit floating point format? We wind up with the following:
0 00000000 00000000000000000011001
It turns out that in IEEE-754 floats, exponent value 000000002 is reserved for representing 0.0 and subnormal (or denormal) numbers. A subnormal number is number close to 0 that can't be represented as 1.??? * 2exp, because the exponent would have to be smaller than what we can encode in 8 bits. Such numbers are interpreted as 0.??? * 2-126, with as many leading 0s as necessary.
In this case, it adds up to 0.000000000000000000110012 * 2-126, which gives us 3.50325 * 10-44.
You'll have to map large integer values (in excess of 224) to see anything other than 0 out to a bunch of decimal places. And, like Keith says, this is all undefined behavior anyway.
This question already has answers here:
How to get the sign, mantissa and exponent of a floating point number
(7 answers)
Closed 7 years ago.
so I got a task where I have to extract the sign, exponent and mantissa from a floating point number given as uint32_t. I have to do that in C and as you might expect, how do I do that ?
For the sign I would search for the MSB (Most Significant Bit, since it tells me whether my number is positive or negative, depending if it's 0 or 1)
Or let's get straight to my idea, can I "splice" my 32 bit number into three parts ?
Get the 1 bit for msb/sign
Then after that follows 1 byte which stands for the exponent
and at last 23 bits for the mantissa
It probably doesn't work like that but can you give me a hint/solution ?
I know of freexp, but I want an alternative, where I learn a little more of C.
Thank you.
If you know the bitwise layout of your floating point type (e.g. because your implementation supports IEEE floating point representations) then convert a pointer to your floating point variable (of type float, double, or long double) into a pointer to unsigned char. From there, treat the variable like an array of unsigned char and use bitwise opertions to extract the parts you need.
Otherwise, do this;
#include <math.h>
int main()
{
double x = 4.25E12; /* value picked at random */
double significand;
int exponent;
int sign;
sign = (x >= 0) ? 1 : -1; /* deem 0 to be positive sign */
significand = frexp(x, &exponent);
}
The calculation of sign in the above should be obvious.
significand may be positive or negative, for non-zero x. The absolute value of significand is in the range [0.5,1) which, when multiplied by 2 to the power of exponent gives the original value.
If x is 0, both exponent and significand will be 0.
This will work regardless of what floating point representations your compiler supports (assuming double values).
I am trying to discern whether it is possible to decompose a double precision IEEE floating point value into to two integers and recompose them later with full fidelity. Imagine something like this:
double foo = <inputValue>;
double ipart = 0;
double fpart = modf(foo, &ipart);
int64_t intIPart = ipart;
int64_t intFPart = fpart * <someConstant>;
double bar = ((double)ipart) + ((double)intFPart) / <someConstant>;
assert(foo == bar);
It's logically obvious that any 64-bit quantity can be stored in 128-bits (i.e. just store the literal bits.) The goal here is to decompose the integer part and the fractional part of the double into integer representations (to interface with and API whose storage format I don't control) and get back a bit-exact double when recomposing the two 64-bit integers.
I have a conceptual understanding of IEEE floating point, and I get that doubles are stored base-2. I observe, empirically, that with the above approach, sometimes foo != bar for even very large values of <someConstant>. I've been out of school a while, and I can't quite close the loop in my head terms of understanding whether this is possible or not given the different bases (or some other factor).
EDIT:
I guess this was implied/understood in my brain but not captured here: In this situation, I'm guaranteed that the overall magnitude of the double in questions will always be within +/- 2^63 (and > 2^-64). With that understanding, the integer part is guaranteed to fit within a 64-bit int type then my expectation is that with ~16 bits of decimal precision, the fractional part should be easily representable in a 64-bit int type as well.
If you know the number is in [–263, +263) and the ULP (the value of the lowest bit in the number) is at least 2-63, then you can use this:
double ipart;
double fpart = modf(foo, &ipart);
int64_t intIPart = ipart;
int64_t intFPart = fpart * 0x1p63;
double bar = intIPart + intFPart * 0x1p-63;
If you just want a couple of integers from which the value can be reconstructed and do not care about the meaning of those integers (e.g., it is not necessary that one of them be the integer part), then you can use frexp to disassemble the number into its significand (with sign) and exponent, and you can use ldexp to reassemble it:
int exp;
int64_t I = frexp(foo, &exp) * 0x1p53;
int64_t E = exp;
double bar = ldexp(I, E-53);
This code will work for any finite value of an IEEE-754 64-bit binary floating-point object. It does not support infinities or NaNs.
It is even possible to pack I and E into a single int64_t, if you want to go to the trouble.
The goal here is to decompose the integer part and the fractional part
of the double into integer representations
You can't even get just the integer part or just the fractional part reliably. The problem is that you seem to misunderstand how floating point numbers are stored. They don't have an integer part and a fractional part. They have a significant digits part, called the mantissa, and an exponent. The exponent essentially scales the mantissa up or down, similar to how scientific notation works.
A double-precision floating point number has 11 bits for the exponent, giving a range of values that's something like 2-1022...21023. If you want to store the integer and fractional parts, then, you'll need two integers that each have about 210 bits. That'd be a silly way to do things, though -- most of those bits would go unused because only the bits in the mantissa are significant. Using two very long integers would let you represent all the values accross the entire range of a double with the same precision everywhere, something that you can't do with a double. You could have, for example, a very large integer part with a very small fractional part, but that's a number that a double couldn't accurately represent.
Update
If, as you indicate in your comment, you know that the value in question is within the range ±263, you can use the answer to Extract fractional part of double *efficiently* in C, like this:
double whole = // your original value
long iPart = (long)whole;
double fraction = whole - iPart;
long fPart = fraction * (2 << 63);
I haven't tested that, but it should get you what you want.
See wikipedia for the format of a double:
http://en.wikipedia.org/wiki/Double-precision_floating-point_format
IEEE double format encodes three integers: the significand, the exponent, and the sign bit.
Here is code which will extract the three constituent integers in IEEE double format:
double d = 2.0;
// sign bit
bool s = (*reinterpret_cast<int64_t*>(&d)) >> 63;
// significand
int64_t m = *reinterpret_cast<int64_t*>(&d) & 0x000FFFFFFFFFFFFFULL;
// exponent
int64_t e = ((*reinterpret_cast<int64_t*>(&d) >> 52) & 0x00000000000007FFULL) - 1023;
// now the double d is exactly equal to s * (1 + (m / 2^52)) * 2^e
// print out the exact arithmatic expression for d:
std::cout << "d = " << std::dec << (s ? "-(1 + " : "(1 + (") << m << "/" << (1ULL << 52) << ")) x 2^" << e;
Is there an algorithm that can convert a 32-bit integer in its integer representation to a IEEE 754 float representation by just using integer operations?
I have a couple of thoughts on this but none of these works so far. (Using C)
I was thinking about shifting the integers but then I failed to
construct the new float representation on that .
I suppose I could convert the integer to binary but it has the same
problem with the first approach.
excellent resource on float
Address +3 +2 +1 +0
Format SEEEEEEE EMMMMMMM MMMMMMMM MMMMMMMM
S represents the sign bit where 1 is negative and 0 is positive.
E is the two’s complement exponent with an offset of 127.
M is the 23-bit normalized mantissa. The highest bit is always 1
and, therefore, is not stored
Then look here for two's complement
I'll use num as an array of bits, I know this isn't standard C array range accessing, but you get the point
So for a basic algorithm, we start with filling out S.
bit S = 0;
if (num[0] ==1) {
S = 1;
num[1..32] = -num[1..32] + 1; //ignore the leading bit. flip all the bits then add 1
}
Now we have set S and we have a scalar value for the rest of the number.
Then we can position our number into the mantissa, by finding the first index of 1. Which will also let us find the exponent. Note here that the exponent will always be positive since we can't have fractional int values. (also, make a special case for checking if the value is 0 first, to avoid an infinite loop in here, or just modify the loop appropriately, I'm lazy)
int pos = 1;
signed byte E = 32;
bit[23] M;
while(num[pos] == 0) {
--E;
++pos;
}
int finalPos = min(32, pos+23); //don't get too many bits
M = num[pos+1..finalPos]; //set the mantissa bits
Then you construct your float with the bits in S,E,M