Is this x==(int)(float)x always true? - c

x is int.
I think x==(int)(float)x is always true,but the book says it false when x is TMax.
I've checked it when x=TMax,it's still true. The book is wrong?

Assuming IEEE-754 is used, the single precision float can represent integers accurately only within 224, but int is normally 32-bit in modern computer, the integers outside the range may not be cast back with (int)(float)x
If double is used instead of float, x == (int)(double)x is true for all 32-bit integers, because double can represent integers within 253.

float has a precision of 24 bits with IEEE-754 floating point formats. As soon as you have more precision in your integer value, you lose precision. Try the same on a system which has 32 bit ints and you'll see the difference.
For example, take
#include <stdio.h>
int main()
{
unsigned int x = 4000000003U;
float y = x;
printf ("%u %.20g %.20g %u\n", x, (float)x, y, (unsigned int)(float)x);
}
which will convert this large number to a float. This float is incapable to hold the whole number, so it approximates it.
After converting back to an int, you get a different value.
At least, you should get one, but I cannot reproduce this on my sytem here: the program above outputs
4000000003 4000000003 4000000000 4000000003
while I expected the second number to be equal to the third...
However, if I change the code to 64 bit integers:
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
int main()
{
uint64_t x = 400000000000000003U;
double y = x;
printf ("%" PRIu64 " %.20g %.20g %" PRIu64 "\n", x, (double)x, y,
(uint64_t)(double)x);
}
it will work as expected: 64 bit are even more than a double can hold (this would be 53 bit), and thus it works as wanted.

It isn't necessarily true, because floating point loses precision as you increase the magnitude of your values.

tczf, you must be reading the book CSAPP, I think you can have a look at the practice problem 2.49, which is on the page 144. A floating-point format with a n-bit fraction can't represent the integer which is larger than 2^(n+1)-1, because it will lose the precision.
Therefore, IEEE 754 single precision floating-point format has 23-bits fraction, and "int" type has 32 bits, so if the integer is larger than 2^24-1, the float type can't represent it exactly. But double floating-point format can, since it has 52-bit fraction.

#FatSheep 's answer has an error.
A floating-point format with a n-bit fraction can't represent the integer which is larger than 2^(n+1)-1, because it will lose the precision.
Not 2^(n+1)-1, the correct one is which larger than or equal to 2^(n+1) + 1.
n + 1 means normalized values have implied leading 1, and significand equals M = 1 + f, f represents fraction.
The code will help you.
#include<stdio.h>
#include<limits.h>
int main(){
for(int i =0; i < INT_MAX; i++){
int x = i;
int y = (int)(float)x;//when x >= 2^(n + 1) + 1, x != y
if(x != y){
printf("%d,%d\n",x,y);
break;
}
}
return 0;
}

Related

Write double bitpattern as a literal

I have the hex literal 400D99999999999A which is the bitpattern for 3.7 as
a double
How do I write this in C? I saw this
page about
floating_literal and hex. Maybe it's obvious and I need to sleep but I'm
not seeing how to write the bitpattern as a float. I understand it's
suppose to let a person write a more precise fraction but I'm not sure how
to translate a bitpattern to a literal
#include <stdio.h>
#include <string.h>
int main(int argc, char *argv[])
{
double d = 0x400D99999999999Ap0;
printf("%f\n", d); //incorrect
unsigned long l = 0x400D99999999999A;
memcpy(&d, &l, 8);
printf("%f\n", d); //correct, 3.7
return 0;
}
how to write the bitpattern as a float.
The bit pattern 0x400D99999999999A commonly encodes (alternates encodings exists) the double with a value of about 3.7*1.
double d;
unsigned long l = 0x400D99999999999A;
// Assume same size, same endian
memcpy(&d, &l, 8);
printf("%g\n", d);
// output 3.7
To write the value out using "%a" format with a hexadecimal significand, decimal power-of-2 exponent
printf("%g\n", d);
// output 0x1.d99999999999ap+1
The double constant (not literal) 1.d99999999999ap+1 has an explicit 1 bit and and the lower 52-bits of 0x400D99999999999A and the exponent of +1 is like the biased exponent bits (12 most significant bits, expect the sign bit) biased by 0x400 - 1.
Now code can use double d = 0x1.d99999999999ap+1 instead of the memcpy() to initialize d.
*1 Closest double to 3.7 is exactly
3.70000000000000017763568394002504646778106689453125
The value you're trying to use is an IEEE bit pattern. C doesn't support this directly. To get the desired bit pattern, you need to specify the mantissa, as an ordinary hex integer, along with a power-of-two exponent.
In this case, the desired IEEE bit pattern is 400D99999999999A. If you strip off the sign bit and the exponent, you're left with D99999999999A. There's an implied leading 1 bit, so to get the actual mantissa value, that needs to be explicitly added, giving 1D99999999999A. This represents the mantissa as an integer with no fractional part. It then needs to be scaled, in this case by a power-of-two exponent value of -51. So the desired constant is:
double d = 0x1D99999999999Ap-51;
If you plug this into your code, you will get the desired bit pattern of 400D99999999999A.
The following program shows how to interpret a string of bits as a double, using either the native double format or using the IEEE-754 double-precision binary format (binary64).
#include <math.h>
#include <stdint.h>
#include <string.h>
// Create a mask of n bits, in the low bits.
#define Mask(n) (((uint64_t) 1 << (n)) - 1)
/* Given a uint64_t containing 64 bits, this function interprets them in the
native double format.
*/
double InterpretNativeDouble(uint64_t bits)
{
double result;
_Static_assert(sizeof result == sizeof bits, "double must be 64 bits");
// Copy the bits into a native double.
memcpy(&result, &bits, sizeof result);
return result;
}
/* Given a uint64_t containing 64 bits, this function interprets them in the
IEEE-754 double-precision binary format. (Checking that the native double
format has sufficient bounds and precision to represent the result is
omitted. For NaN results, a NaN is returned, but the signaling
characteristic and the payload bits are not supported.)
*/
double InterpretDouble(uint64_t bits)
{
/* Set some parameters of the format. (This routine is not fully
parameterized for all IEEE-754 binary formats; some hardcoded constants
are used.)
*/
static const int Emax = 1023; // Maximum exponent.
static const int Precision = 53; // Precision (number of digits).
// Separate the fields in the encoding.
int SignField = bits >> 63;
int ExponentField = bits >> 52 & Mask(11);
uint64_t SignificandField = bits & Mask(52);
// Interpret the exponent and significand fields.
int Exponent;
double Significand;
switch (ExponentField)
{
/* An exponent field of all zero bits indicates a subnormal number,
for which the exponent is fixed at its minimum and the leading bit
of the significand is zero. This includes zero, which is not
classified as a subnormal number but is consistent in the encoding.
*/
case 0:
Exponent = 1 - Emax;
Significand = 0 + ldexp(SignificandField, 1-Precision);
// ldexp(x, y) computes x * pow(2, y).
break;
/* An exponent field of all one bits indicates a NaN or infinity,
according to whether the significand field is zero or not.
*/
case Mask(11):
Exponent = 0;
Significand = SignificandField ? NAN : INFINITY;
break;
/* All other exponent fields indicate normal numbers, for which the
exponent is encoded with a bias (equal to EMax) and the leading bit
of the significand is one.
*/
default:
Exponent = ExponentField - Emax;
Significand = 1 + ldexp(SignificandField, 1-Precision);
break;
}
// Combine the exponent and significand,.
Significand = ldexp(Significand, Exponent);
// Interpret the sign field.
if (SignField)
Significand = -Significand;
return Significand;
}
#include <stdio.h>
#include <inttypes.h>
int main(void)
{
uint64_t bits = 0x400D99999999999A;
printf("The bits 0x%16" PRIx64 " interpreted as:\n", bits);
printf("\ta native double represent %.9999g, and\n",
InterpretNativeDouble(bits));
printf("\tan IEEE-754 double-precision datum represent %.9999g.\n",
InterpretDouble(bits));
}
As an IEEE-754 double-precision value, that bit pattern 400D99999999999A actually consists of three parts:
the first bit, 0, is the sign;
the next 11 bits, 10000000000 or 0x400, are the exponent; and
the remaining 52 bits, 0xD99999999999A, are the significand (also known as the "mantissa").
But the exponent has a bias of 1023 (0x3ff), so numerically it's 0x400 - 0x3ff = 1. And the significand is all fractional, and has an implicit 1 bit to its left, so it's really 0x1.D99999999999A.
So the actual number this represents is
0x1.D99999999999A × 2¹
which is about 1.85 × 2, or 3.7.
Or, using C's "hex float" or %a representation, it's 0x1.D99999999999Ap1.
In "hex float" notation, the leading 1 and the decimal point (really a "radix point") are explicit, and the p at the end indicates a power-of-two exponent.
Although the decomposition I've shown here may seem reasonably straightforward, actually writing code to reliably decompose a 64-bit number like 400D99999999999A into its three component parts, and manipulate and recombine them to determine what floating-point value they represent (or even to form an equivalent hex float constant like 0x1.D99999999999Ap1) can be surprisingly tricky. See Eric Postpischil's answer for more of the details.

Difference between the result after dividing by 2 and multiplying with 0.5

#include <stdio.h>
int main() {
unsigned long long int c = 9999999999999999999U / 2;
unsigned long long int d = 9999999999999999999U * 0.5;
unsigned long long int e = 9999999999999999999U >> 1;
printf("%llu\n%llu\n%llu\n", c, d, e);
return 0;
}
So the output of that is:
4999999999999999999
5000000000000000000
4999999999999999999
Why is there a difference when multiplied by 0.5?
and why doesn't this difference show up when the numbers are small?
In the case of d, 9999999999999999999 is promoted to a double, which if your C implementation uses IEEE 754 doubles, would be converted to 10000000000000000000 (if I did my calculations correctly) because they only have 53 bits available in the significand, one of which is an implied 1. Multiplying 10000000000000000000 by 0.5 is 5000000000000000000. Floating point is weird. Read up on it at https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html.
9999999999999999999U is a large number. It requires 64 bits to represent in binary. Type unsigned long long int is guaranteed by the C Standard to have at least 64 value bits, so depending on the actual range of smaller integer types, it is a integer constant with type unsigned int, unsigned long int or at most unsigned long long int.
The expressions 9999999999999999999U / 2 and 9999999999999999999U >> 1 are thus fully defined and evaluate to 4999999999999999999, typically at compile time through constant folding, with the same type. This value can be stored into c and e and output correctly by printf with a format %llu as expected.
Conversely 9999999999999999999U * 0.5 (or similarly 9999999999999999999U / 2.0) is evaluated as a floating point expression: (double)9999999999999999999U * 0.5, the floating point result of type double is converted an unsigned long long int when assigned to d.
The double type is only guaranteed to provide enough precision for converting numbers up to 10 decimal digits without loss, a lot less than required for your number. Most C implementations use IEEE-754 representation for the double type that has exactly 53 bits of precision. The value 9999999999999999999 is thus rounded as 1E19 when converted to a double. Multiplying by 0.5 or dividing by 2.0 is performed exactly as it only changes the binary exponent part. The result 5E18 is converted to unsigned long long int and printed as 5000000000000000000 as you see on your system.
The differences are explained with type propagation.
First example, dividing integer by integer. Dividing by two and right-shifting are equivalent here, they are done on the operands as-is.
Second example, dividing integer by double. Here, the compiler will first convert the integer operand to a double (which only guarantees ten decimal digits, I think) and then performs the division. In order to store the result in an integer again, it is truncated.
I hope that illustrates that there are different operations going on that are caused by different types of operands, even though they seem to be the similar from a mathematical point of view.

How big of a number can you store in double and float in c?

I am trying to figure out exactly how big number I can use as floating point number and double. But it does not store the way I expected except integer value. double should hold 8 bytes of information which is enough to hold variable a, but it does not hold it right. It shows 1234567890123456768 in which last 2 digits are different. And when I stored 214783648 or any digit in the last digit in float variable b, it shows the same value 214783648. which is supposed to be the limit. So what's going on?
double a;
float b;
int c;
a = 1234567890123456789;
b = 2147483648;
c = 2147483647;
printf("Bytes of double: %d\n", sizeof(double));
printf("Bytes of integer: %d\n", sizeof(int));
printf("Bytes of float: %d\n", sizeof(float));
printf("\n");
printf("You can count up to %.0f in 4 bytes\n", pow(2,32));
printf("You can count up to %.0f with + or - sign in 4 bytes\n", pow(2,31));
printf("You can count up to %.0f in 4 bytes\n", pow(2,64));
printf("You can count up to %.0f with + or - sign in in 8 bytes\n", pow(2,63));
printf("\n");
printf("double number: %.0f\n", a);
printf("floating point: %.0f\n", b);
printf("integer: %d\n", c);
return 0;
The answer to the question of what is the largest (finite) number that can be stored in a floating point type would be FLT_MAX or DBL_MAX for float and double, respectively.
However, that doesn't mean that the type can precisely represent every smaller number or integer (in fact, not even close).
First you need to understand that not all bits of a floating point number are “equal”. A floating point number has an exponent (8 bits in IEEE-754 standard float, 11 bits in double), and a mantissa (23 and 52 bits in float, and double respectively). The number is obtained by multiplying the mantissa (which has an implied leading 1-bit and binary point) by 2exponent (after normalizing the exponent; its binary value is not used directly). There is also a separate sign bit, so the following applies to negative numbers as well.
As the exponent changes, the distance between consecutive values of the mantissa changes as well, i.e., the greater the exponent, the further apart consecutive representable values of the floating point number are. Thus you may be able to store one number of a given magnitude precisely, but not the “next” number. One should also remember that some seemingly simple fractions can not be represented precisely with any number of binary digits (e.g., 1/10, one tenth, is an infinitely repeating sequence in binary, like 1/3, one third, is in decimal).
When it comes to integers, you can precisely represent every integer up to 2mantissa_bits + 1 magnitude. Thus an IEEE-754 float can represent all integers up to 224 and a double up to 253 (in the last half of these ranges the consecutive floating point values are exactly one integer apart, since the entire mantissa is used for the integer part only). There are individual larger integers that can be represented, but they are spaced more than one integer apart, i.e., you can represent some integers greater than 2mantissa_bits + 1 but every integer only up to that magnitude.
For example:
float f = powf(2.0f, 24.0f);
float f1 = f + 1.0f, f2 = f1 + 2.0f;
double d = pow(2.0, 53.0);
double d1 = d + 1.0, d2 = d + 2.0;
(void) printf("2**24 float = %.0f, +1 = %.0f, +2 = %.0f\n", f, f1, f2);
(void) printf("2**53 double = %.0f, +1 = %.0f, +2 = %.0f\n", d, d1, d2);
Outputs:
2**24 float = 16777216, +1 = 16777216, +2 = 16777218
2**53 double = 9007199254740992, +1 = 9007199254740992, +2 = 9007199254740994
As you can see, adding 1 to 2mantissa_bits + 1 makes no difference since the result is not representable, but adding 2 does produce the correct answer (as it happens, at this magnitude the representable numbers are two integers apart since the multiplier has doubled).
 
TL;DR An IEE-754 float can precisely represent all integers up to 224 and double up to 253, but only some integers of greater magnitude (the spacing of representable values depends on the magnitude).
sizeof(double) is 8, true, but double needs some bits to store the exponent part as well.
Assuming IEEE-754 is used, double can represent integers at most 253 precisely, which is less than 1234567890123456789.
See also Double-precision floating-point format.
You can use constants to know what are the limits :
FLT_MAX
DBL_MAX
LDBL_MAX
From CPP reference
You can print the actual limits of the standard POD-types by printing the limits stored in the 'limits.h' header file (for C++ the equivalent is 'std::numeric_limits' identifier as shown here:
enter link description here)
Due to the fact that the hardware doesn't work with floating types respectively cannot represent floating types by hardware in reality, the hardware uses the bit-length of your hardware to represent a floating type. Since you don't have an infinit length for floating types, you can only show/present a double variable for a specific precision. Most of the hardware uses for the floating type presentation the IEEE-754 standard.
To get more precision you could try 'long double' (dependend on the hardware this could be of quadruple-precision than double), AVX,SSE registers, big-num libraries or you coudl do it yourself.
The sizeof an object only reports the memory space it occupies. It does not show the valid range. It would be well possible to have an unsigned int with e.g. 2**16 (65536) possible value occupy 32 bits im memory.
For floating point objects, it is more difficult. They consist of (simplified) two fields: an integer mantissa and an exponent (see details in the linked article). Both with a fixed width.
As the mantissa only has a limited range, trailing bits are truncated or rounded and the exponent is corrected, if required. This is one reason one should never use floating point types to store precise values like currency.
In decimal (note: computers use binary representation) with 4 digit mantissa:
1000 --> 1.000e3
12345678 --> 1.234e7
The paramters for your implementation are defined in float.h similar to limits.h which provides parameters for integers.
On Linux, #include <values.h>
On Windows,#include <float.h>
There is a fairly comprehensive list of defines

Issue with floating point representation

int x=25,i;
float *p=(float *)&x;
printf("%f\n",*p);
I understand that bit representation for floating point numbers and int are different, but no matter what value I store, the answer is always 0.000000. Shouldn't it be some other value depending on the floating point representation?
Your code has undefined behavior -- but it will most likely behave as you expect, as long as the size and alignment of types int and float are compatible.
By using the "%f" format to print *p, you're losing a lot of information.
Try this:
#include <stdio.h>
int main(void) {
int x = 25;
float *p = (float*)&x;
printf("%g\n", *p);
return 0;
}
On my system (and probably on yours), it prints:
3.50325e-44
The int value 25 has zeros in most of its high-order bits. Those bits are probably in the same place as the exponent field of type float -- resulting in a very small number.
Look up IEEE floating-point representation for more information. Byte order is going to be an issue. (And don't do this kind of thing in real code unless you have a very good reason.)
As rici suggests in a comment, a better way to learn about floating-point representation is to start with a floating-point value, convert it to an unsigned integer of the same size, and display the integer value in hexadecimal. For example:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
void show(float f) {
unsigned int rep;
memcpy(&rep, &f, sizeof rep);
printf("%g --> 0x%08x\n", f, rep);
}
int main(void) {
if (sizeof (float) != sizeof (unsigned int)) {
fprintf(stderr, "Size mismatch\n");
exit(EXIT_FAILURE);
}
show(0.0);
show(1.0);
show(1.0/3.0);
show(-12.34e5);
return 0;
}
For the purposes of this discussion, we're going to assume both int and float are 32 bits wide. We're also going to assume IEEE-754 floats.
Floating point values are represented as sign * βexp * signficand. For 32-bit binary floats, β is 2, the exponent exp ranges from -126 to 127, and the significand is a normalized binary fraction, such that there is a single leading non-zero bit before the radix point. For example, the binary integer representation of 25 is
110012
while the binary floating point representation of 25.0 would be:
1.10012 * 24 // normalized
The IEEE-754 encoding for a 32-bit float is
s eeeeeeee fffffffffffffffffffffff
where s denotes the sign bit, e denotes the exponent bits, and f denotes the significand (fraction) bits. The exponent is encoded using "excess 127" notation, meaning an exponent value of 127 (011111112) represents 0, while 1 (000000012) represents -126 and 254 (111111102) represents 127. The leading bit of the significand is not explicitly stored, so 25.0 would be encoded as
0 10000011 10010000000000000000000 // exponent 131-127 = 4
However, what happens when you map the bit pattern for the 32-bit integer value 25 onto a 32-bit floating point format? We wind up with the following:
0 00000000 00000000000000000011001
It turns out that in IEEE-754 floats, exponent value 000000002 is reserved for representing 0.0 and subnormal (or denormal) numbers. A subnormal number is number close to 0 that can't be represented as 1.??? * 2exp, because the exponent would have to be smaller than what we can encode in 8 bits. Such numbers are interpreted as 0.??? * 2-126, with as many leading 0s as necessary.
In this case, it adds up to 0.000000000000000000110012 * 2-126, which gives us 3.50325 * 10-44.
You'll have to map large integer values (in excess of 224) to see anything other than 0 out to a bunch of decimal places. And, like Keith says, this is all undefined behavior anyway.

Is it possible to round-trip a floating point double to two decimal integers with fidelity?

I am trying to discern whether it is possible to decompose a double precision IEEE floating point value into to two integers and recompose them later with full fidelity. Imagine something like this:
double foo = <inputValue>;
double ipart = 0;
double fpart = modf(foo, &ipart);
int64_t intIPart = ipart;
int64_t intFPart = fpart * <someConstant>;
double bar = ((double)ipart) + ((double)intFPart) / <someConstant>;
assert(foo == bar);
It's logically obvious that any 64-bit quantity can be stored in 128-bits (i.e. just store the literal bits.) The goal here is to decompose the integer part and the fractional part of the double into integer representations (to interface with and API whose storage format I don't control) and get back a bit-exact double when recomposing the two 64-bit integers.
I have a conceptual understanding of IEEE floating point, and I get that doubles are stored base-2. I observe, empirically, that with the above approach, sometimes foo != bar for even very large values of <someConstant>. I've been out of school a while, and I can't quite close the loop in my head terms of understanding whether this is possible or not given the different bases (or some other factor).
EDIT:
I guess this was implied/understood in my brain but not captured here: In this situation, I'm guaranteed that the overall magnitude of the double in questions will always be within +/- 2^63 (and > 2^-64). With that understanding, the integer part is guaranteed to fit within a 64-bit int type then my expectation is that with ~16 bits of decimal precision, the fractional part should be easily representable in a 64-bit int type as well.
If you know the number is in [–263, +263) and the ULP (the value of the lowest bit in the number) is at least 2-63, then you can use this:
double ipart;
double fpart = modf(foo, &ipart);
int64_t intIPart = ipart;
int64_t intFPart = fpart * 0x1p63;
double bar = intIPart + intFPart * 0x1p-63;
If you just want a couple of integers from which the value can be reconstructed and do not care about the meaning of those integers (e.g., it is not necessary that one of them be the integer part), then you can use frexp to disassemble the number into its significand (with sign) and exponent, and you can use ldexp to reassemble it:
int exp;
int64_t I = frexp(foo, &exp) * 0x1p53;
int64_t E = exp;
double bar = ldexp(I, E-53);
This code will work for any finite value of an IEEE-754 64-bit binary floating-point object. It does not support infinities or NaNs.
It is even possible to pack I and E into a single int64_t, if you want to go to the trouble.
The goal here is to decompose the integer part and the fractional part
of the double into integer representations
You can't even get just the integer part or just the fractional part reliably. The problem is that you seem to misunderstand how floating point numbers are stored. They don't have an integer part and a fractional part. They have a significant digits part, called the mantissa, and an exponent. The exponent essentially scales the mantissa up or down, similar to how scientific notation works.
A double-precision floating point number has 11 bits for the exponent, giving a range of values that's something like 2-1022...21023. If you want to store the integer and fractional parts, then, you'll need two integers that each have about 210 bits. That'd be a silly way to do things, though -- most of those bits would go unused because only the bits in the mantissa are significant. Using two very long integers would let you represent all the values accross the entire range of a double with the same precision everywhere, something that you can't do with a double. You could have, for example, a very large integer part with a very small fractional part, but that's a number that a double couldn't accurately represent.
Update
If, as you indicate in your comment, you know that the value in question is within the range ±263, you can use the answer to Extract fractional part of double *efficiently* in C, like this:
double whole = // your original value
long iPart = (long)whole;
double fraction = whole - iPart;
long fPart = fraction * (2 << 63);
I haven't tested that, but it should get you what you want.
See wikipedia for the format of a double:
http://en.wikipedia.org/wiki/Double-precision_floating-point_format
IEEE double format encodes three integers: the significand, the exponent, and the sign bit.
Here is code which will extract the three constituent integers in IEEE double format:
double d = 2.0;
// sign bit
bool s = (*reinterpret_cast<int64_t*>(&d)) >> 63;
// significand
int64_t m = *reinterpret_cast<int64_t*>(&d) & 0x000FFFFFFFFFFFFFULL;
// exponent
int64_t e = ((*reinterpret_cast<int64_t*>(&d) >> 52) & 0x00000000000007FFULL) - 1023;
// now the double d is exactly equal to s * (1 + (m / 2^52)) * 2^e
// print out the exact arithmatic expression for d:
std::cout << "d = " << std::dec << (s ? "-(1 + " : "(1 + (") << m << "/" << (1ULL << 52) << ")) x 2^" << e;

Resources