Here's the code:
#include <stdio.h>
#include <math.h>
static double const x = 665857;
static double const y = 470832;
int main(){
double z = x*x*x*x -(y*y*y*y*4+y*y*4);
printf("%f \n",z);
return 0;
}
Mysteriously (to me) this code prints "0.0" if compiled on 32 bits machines (or with the -m32 flag on 64 bits machines like in my case) with GCC 4.6. As far as I know about floating point operations, it is possible to overflow/underflow them or to lose precision with them, but... a 0? How?
Thanks in advance.
The problem is not that the numbers overflow. The problem is that doubles don't have enough precision to distinguish between the two operands of your subtraction.
The value of x*x*x*x is 196573006004558194713601.
The value of y*y*y*y*4+y*y*4 is 196573006004558194713600.
These numbers have 78 bits, and only the last bit is different. Double precision numbers only have 53 bits. Other numbers are rounded to only 53 bits.
In your case, the two operands are rounded to the same number, and so their difference is 0.
Even stranger things happen if you slightly rewrite your expression for z:
double z = x * x * x * x - ((y * y + 1) * y * y * 4);
With this change, you get 33554432! Why? Because the way intermediate results were rounded caused the last bit of the right operand to be different. The value of the last bit is 2^(78-53)=2^25.
Evaluating the expression with arbitrary precision integers:
Prelude> 665857^4 - 4*(470832^4 + 470832^2)
1
Since a double normally only has 53 bits of precision and the intermediate results have 78 bits, the precision isn't sufficient to calculate the result exactly, hence it is rounded, the last bits are forgotten at some point.
There is no floating-point overflow or underflow in your code. The two quantities are of the order of 1.96573006 × 10^23, and largely fit within a double. Your example simply illustrates catastrophic cancellation, where you subtract two close quantities and the relative precision of the result becomes horrible.
See http://en.wikipedia.org/wiki/Loss_of_significance
This is result of the way IEEE 754 represents floating point numbers in normalized form. float or double or whatever other IEEE 754 compliant representation is stored like:
1.xxxxxxxxxxxxxxxxxxx * 2^exp
where xxxxxxxxxxxxxxxxxxx is the fractional part of the mantissa so the mantissa itself is always in the range [1, 2). The integer part which is always 1 is not stored in the representation. The number of x bits defines the precision. It is 52 bits for the double. The exponent is stored in an offset form (one must subtract 1023 in order to obtain its value) but that is irrelevant now.
665857^4 in 64-bit IEEE 754 is:
0 10001001100 (1)0100110100000001111100111011101010000101110010100010
+ exponent mantissa
(the first bit is the sign bit: 0 = positive, 1 - negative; the bit in parentheses is not really stored)
In 80-bit x86 extended precision it is:
0 10001001100 (1)0100110100000001111100111011101010000101110010100010
0 100000001001100 1 010011010000000111110011101110101000010111001010000111000111011
(here the integer part is explicitly part of the representation - a deviation from IEEE 754; I've aligned the mantissas for clarity)
4*470832^4 in 64-bit IEEE 754 and 80-bit x86 extended precision is:
0 10001001100 (1)0100110100000001111100111011101001111111010101100111
0 100000001001100 1 010011010000000111110011101110100111111101010110011100100010000
4*470832^2 in 64-bit IEEE 754 and 80-bit x86 extended precision is:
0 10000100110 (1)1001110011101010100101010100100000000000000000000000
0 100000000100110 1 100111001110101010010101010010000000000000000000000000000000000
When you sum up the last two numbers, the procedure is the following: the smaller value has its exponent adjusted to match the larger value's exponent while the mantissa is shifted to the right in order to preserve the value. Since the two exponents differ by 38, the mantissa of the smaller number is shifted 38 bits to the right:
470832^2*4 in adjusted 64-bit IEEE 754 and 80-bit x86 extended precision:
this bit came from 1.xxxx ------------------------------v
0 10001001100 (0)0000000000000000000000000000000000000110011100111010|1010
0 100000001001100 0 0000000000000000000000000000000000000110011100111010101001010101
Now both quantities have the same exponents and their mantissas could be summed:
0 10001001100 (1)0100110100000001111100111011101001111111010101100111|0010
0 10001001100 (0)0000000000000000000000000000000000000110011100111010|1010
--------------------------------------------------------------------------
0 10001001100 (1)0100110100000001111100111011101010000101110010100001|1100
I kept some of the 80-bit precision bits on the right of the bar, because the summation internally is done in the greater precision of 80 bits.
Now let's perform the subtraction in 64-bit + some bits of the 80-bit rep:
minuend 0 10001001100 (1)0100110100000001111100111011101010000101110010100001|1100
subtrahend 0 10001001100 (1)0100110100000001111100111011101010000101110010100001|1100
-------------------------------------------------------------------------------------
difference 0 10001001100 (0)0000000000000000000000000000000000000000000000000000|0000
A pure 0! If you perform the calculations in full 80-bit, you would once again obtain a pure 0.
The real problem here is that 1.0 cannot be represented in 64-bit precision with an exponent of 2^77 - there are no 77 bits of precision in the mantissa. This is also true for the 80-bit precision - there are only 63 bits in the mantissa, 14 bits less than necessary to represent 1.0 given an exponent of 2^77.
So that's it! It's just the wonderful world of scientific computing where nothing works the way you were taught in the math classes...
Related
So, It's almost time for midterms and the professor gave us some sample questions.
What I THINK the answer is:
We are given a float that is f=50000.
if we do f*f we get 2,500,000,000.
Now, I'm assuming we're working with a 32 bit machine as that is what we have studied so far. So, if that Is the case then 2,500,000,000 32 bit float not being declared unsigned is considered signed by default. Since 2,500,000,000 is a little over half of the 32 bit representation of 4294967296, and it is signed, we would have a negative value returned, so the statement f * f < 0 would be true, right?
I've only been studying systems programming for 4 weeks, PLEASE correct me if I am wrong here.
Unlike the int type, which is typically represented as a two's complement number, a float is a floating point type, which means it stores values using a mantissa and an exponent. This means that the typical wrapping behavior seen with signed integer types doesn't apply to floating point types.
In the case of 2,500,000,000, this will actually get stored as 0x1.2A05F2 x 231.
Floating point types are typically stored using IEEE 754 floating point format. In the case of a single precision floating point (which a float typically is), it has 1 sign bit, 8 exponent bits, and 24 mantissa bits (with 23 bits stored, as the high order "1" bit is implied).
While this format can't "wrap" from positive to negative, it is subject to 2 things:
Loss of precision
Overflow of the exponent
As an example of precision loss, let's use a decimal floating point format with a 3 digit mantissa and a 2 digit exponent. If we multiply 2.34 x 1010 by 6.78 x 1010, you get 1.58652 x 1021, but because of the 3 digit precision it gets truncated to 1.58 x 1021. So we lose the least significant digits.
To illustrate exponent overflow, suppose we were to multiply 2.00 x 1060 by 3.00 x 1050. You'd get 6.00 x 10110. But because the maximum value of an exponent is 99, this is an overflow. IEEE 754 has a special notation for infinity which it uses in the case of overflow where it sets the mantissa to all 0 bits and the exponent to all 1 bits, and the sign bit can be used to distinguish positive infinity and negative infinity.
Having this code, (this is a well-known fast square root inverse algorithm, that was used in Quake 3) I am unable to understand the print output. I understand the entire algorithm superficially but want to get an in-depth understanding. What is the value of i when printf prints it? This gives 1120403456 for me. Does it depend on the computer's architecture? I've read somewhere that type-punning like this results in undefined behaviour. On the other website I've read that the value of i at that point is the exact value of bits that are used by this variable. I am confused with this and honestly expected the value of i to be 100. How do I interpret the resulting 1120403456? How do I convert this value to decimal 100? Are those bits encoded somehow? This is the code excerpt:
#include<stdio.h>
int main()
{
float x = 100;
float xhalf = 0.5f*x;
int i = *(int*)&x;
printf("%d", i);
i = 0x5f3759df - (i>>1);
x = *(float*)&i;
x = x*(1.5f-xhalf*x*x);
return x;
}
The value of i printed after int i = *(int*)&x; is the bit representation of the floating-point 100.0, which is what x was initialised to.
Since you're using the %d format in printf it will print that representation as a decimal integer.
The bitpattern of 100.0 in an IEEE 32-bit float is 0x42c80000, which is 1120403456 in decimal
To interpret 1120403456 or any other number as an IEEE basic 32-bit binary floating-point number:
Write the number as 32 bits, in this case, 01000010110010000000000000000000.
The first bit is the sign. 0 is + and 1 is −.
The next eight bits are an encoding of the exponent. They are 10000101, which is 133 in decimal. The exponent is encoded with a bias, 127, meaning the actual power of two represented is 2133−127 = 26. (If the exponent is 0 or 255, it is a special value that has a different meaning.)
The remaining 23 bits encode the significand. For normal exponents (codes 1 to 254), they encode a significand formed by starting with “1.” and appending the 23 bits, “10010000000000000000000”, to make a binary numeral, “1.10010000000000000000000”. In decimal, this is 1.5625.
The full value encoded is the sign with the power of two multiplied by the significand: + 26 • 1.5625, which equals 64•1.5625, which is 100.
If the exponent code were 0, it represents 2−126, and the significand would be formed by starting with “0.” instead of “1.”
If the exponent code were 255, and the significand bits were zero, the object would represent infinity (+∞ or −∞ according to the sign bit). If the exponent code were 255, and the significand bits were not zero, the object would represent a Not a Number (NaN). There are additional semantics to NaN not covered in this answer.
It is much older than Quake III, although the exact magic constant has varied a bit.
The encoding of a finite single-precision floating point number according to IEEE-754 (which is what most modern computers use) is
31 30 23 22 0
├─┼─┬─┬─┬─┬─┬─┬─┬─┼─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┤
│S│ Exponent │ Mantissa │
└─┴───────────────┴─────────────────────────────────────────────┘
The highest bit, S is the sign. The value is negative if it is set, positive otherwise.
If Exponent and Mantissa are both zero, then the value is either 0.0 (if S is zero) or -0.0 (if S is 1).
If Exponent is zero, but Mantissa nonzero, you have a denormal number very close to zero.
Representations where Exponent is 255 (all bits set) are reserved for infinity (if Mantissa is zero), and not-a-number (if Mantissa is nonzero).
For Exponent values from 1 to 254, inclusive, Mantissa contains the fractional bits of the mantissa, with an implicit 1 in the integers position (This is what the (1) means in e.g. 0 10000001 (1)1011011011011011011011.) In other words, the value Mantissa represents for these Exponent values, is from 1.000000000000000000000002 (1.0 in decimal) to 1.111111111111111111111112 (1.99999994 in decimal), inclusive.
Now, if we consider only nonnegative values (with S clear), with E = Exponent (between 1 and 254, inclusive), and M the logical value of Mantissa (between 1.0 and 1.99999994), then the value V the single-precision floating point represents is
V = 2E - 127 × M
The 127 is the exponent bias. The square root of V is
√V = 2(E - 127)/2 × √M = 2E/2 - 127 × 263 × √2 × √M
and the inverse square root, which is the target operation here,
1/√V = 2(127 - E)/2 × 1/√M = 2-E/2 - 127 × 263 × √2 × 1/√M
noting that 2127/2 = 263.5 = 263 × √2.
If we take the integer representation, and shift it one bit to the right, we effectively halve E. To multiply by 263, we need to add 63 to the exponent; 63×223. However, for the square root operation, instead of multiplying by √2 × √M, we effectively just multiply by M/2. This means that that (shifting the integer representation one bit right, then adding 63 to the exponent) yields an estimate of a square root that is off by a factor between 0.7071067 and 0.75 for arguments greater than 1.0, and by a factor between 0.7071067 and 1448.155 for values between 0 and 1. (It yields 5.421×10-20 for √0, and 0.75 for √1.)
Note that for the inverse square root operation, we wish to negate E.
It turns out that if you shift the integer representation right by one bit, and then subtract that from (1 01111100 1101110101100111011111)2 (1597463007 in decimal, 0x5f3759df in hexadecimal, a single-precision floating-point approximation of √2127 ≈ 13043817825332782212), you get a pretty good approximation of the inverse square root (1/√V). For all finite normal arguments, the approximation is off by a factor of 0.965624 to 1.0339603 from the correct value; a very good approximation! All you need is one or two iterations of Newton's method for the inverse square root operation on top. (For f(X) = 0 iff X = 1/√V you need f(X) = 1/X^2 - V. So, each iteration in X is X - f(X)/f'(X) = X × (1.5 - 0.5 × V × X × X. Which should look quite familiar.)
Title pretty much sums it all.
I know that floats are 32bit total with 23bits for mantissa and 8bits for the exponent value and 1 for signing.
Calculating the range of "int" is pretty simple: 32bits = 32-1bit signature =31bits ==> Range is therefore 2³¹= 2.14e9
The formula makes sense...
Now i've looked around stackoverflow but all the answers i've found regarding float range calculations lacked substance. Just a bunch of numbers appearing randomly in the responses and magically reaching the 3.4e38 conclusion.
I'm looking for an answer from someone with real knowledge of subject. Someone that can explain through the use of a formula how this range is calculated.
Thank you all.
Mo.
C does not define float as described by OP. The one suggested by OP: binary32, the most popular, is one of many conforming formats.
What C does define
5.2.4.2.2 Characteristics of floating types
s sign (±1)
b base or radix of exponent representation (an integer > 1)
e exponent (an integer between a minimum emin and a maximum emax)
p precision (the number of base-b digits in the significand)
fk nonnegative integers less than b (the significand digits)
x = s*power(b,e)*Σ(k=1, p, f[k]*power(b,-k))
For binary32, the max value is
x = (+1)*power(2, 128)*(0.1111111111 1111111111 1111 binary)
x = 3.402...e+38
Given 32-bits to define a float many other possibilities occur. Example: A float could exist just like binary32, yet not support infinity/not-a-number. The leaves another exponent available numbers. The max value is then 2*3.402...e+38.
binary32 describes its significand ranging up to 1.11111... binary. The C characteristic formula above ranges up to 0.111111...
C uses single-precision floating point notation, which means that a 32-bit float has 1 bit for the sign, 8 bits for the exponent, and 23 bits for the mantissa. The mantissa is calculated by summing each mantissa bit * 2^(- (bit_index)). The exponent is calculated by converting the 8 bit binary number to a decimal and subtracting 127 (thus you can have negative exponents as well), and the sign bit indicates whether or not is negative. The formula is thus:
(-1)^S * 1.M * 2^(E - 127)
Where S is the sign, M is the mantissa, and E is the exponent. See https://en.wikipedia.org/wiki/Single-precision_floating-point_format for a better mathematical explanation.
To explicitly answer your question, that means for a 32 bit float, the largest value is (-1)^0 * 1.99999988079071044921875 * 2^128, which is 6.8056469327705771962340836696903385088 × 10^38 according to Wolfram. The smallest value is the negative of that.
The code
float x = 3.141592653589793238;
double z = 3.141592653589793238;
printf("x=%f\n", x);
printf("z=%f\n", z);
printf("x=%20.18f\n", x);
printf("z=%20.18f\n", z);
will give you the output
x=3.141593
z=3.141593
x=3.141592741012573242
z=3.141592653589793116
where on the third line of output 741012573242 is garbage and on the fourth line 116 is garbage. Do doubles always have 16 significant figures while floats always have 7 significant figures? Why don't doubles have 14 significant figures?
Floating point numbers in C use IEEE 754 encoding.
This type of encoding uses a sign, a significand, and an exponent.
Because of this encoding, many numbers will have small changes to allow them to be stored.
Also, the number of significant digits can change slightly since it is a binary representation, not a decimal one.
Single precision (float) gives you 23 bits of significand, 8 bits of exponent, and 1 sign bit.
Double precision (double) gives you 52 bits of significand, 11 bits of exponent, and 1 sign bit.
Do doubles always have 16 significant
figures while floats always have 7
significant figures?
No. Doubles always have 53 significant bits and floats always have 24 significant bits (except for denormals, infinities, and NaN values, but those are subjects for a different question). These are binary formats, and you can only speak clearly about the precision of their representations in terms of binary digits (bits).
This is analogous to the question of how many digits can be stored in a binary integer: an unsigned 32 bit integer can store integers with up to 32 bits, which doesn't precisely map to any number of decimal digits: all integers of up to 9 decimal digits can be stored, but a lot of 10-digit numbers can be stored as well.
Why don't doubles
have 14 significant figures?
The encoding of a double uses 64 bits (1 bit for the sign, 11 bits for the exponent, 52 explicit significant bits and one implicit bit), which is double the number of bits used to represent a float (32 bits).
float: 23 bits of significand, 8 bits of exponent, and 1 sign bit.
double: 52 bits of significand, 11 bits of exponent, and 1 sign bit.
It's usually based on significant figures of both the exponent and significand in base 2, not base 10. From what I can tell in the C99 standard, however, there is no specified precision for floats and doubles (other than the fact that 1 and 1 + 1E-5 / 1 + 1E-7 are distinguishable [float and double repsectively]). However, the number of significant figures is left to the implementer (as well as which base they use internally, so in other words, an implementation could decide to make it based on 18 digits of precision in base 3). [1]
If you need to know these values, the constants FLT_RADIX and FLT_MANT_DIG (and DBL_MANT_DIG / LDBL_MANT_DIG) are defined in float.h.
The reason it's called a double is because the number of bytes used to store it is double the number of a float (but this includes both the exponent and significand). The IEEE 754 standard (used by most compilers) allocate relatively more bits for the significand than the exponent (23 to 9 for float vs. 52 to 12 for double), which is why the precision is more than doubled.
1: Section 5.2.4.2.2 ( http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf )
A float has 23 bits of precision, and a double has 52.
It's not exactly double precision because of how IEEE 754 works, and because binary doesn't really translate well to decimal. Take a look at the standard if you're interested.
This question already has answers here:
strange output in comparison of float with float literal
(8 answers)
Closed 7 years ago.
void main()
{
float f = 0.98;
if(f <= 0.98)
printf("hi");
else
printf("hello");
getch();
}
I am getting this problem here.On using different floating point values of f i am getting different results.
Why this is happening?
f is using float precision, but 0.98 is in double precision by default, so the statement f <= 0.98 is compared using double precision.
The f is therefore converted to a double in the comparison, but may make the result slightly larger than 0.98.
Use
if(f <= 0.98f)
or use a double for f instead.
In detail... assuming float is IEEE single-precision and double is IEEE double-precision.
These kinds of floating point numbers are stored with base-2 representation. In base-2 this number needs an infinite precision to represent as it is a repeated decimal:
0.98 = 0.1111101011100001010001111010111000010100011110101110000101000...
A float can only store 24 bits of significant figures, i.e.
0.111110101110000101000111_101...
^ round off here
= 0.111110101110000101001000
= 16441672 / 2^24
= 0.98000001907...
A double can store 53 bits of signficant figures, so
0.11111010111000010100011110101110000101000111101011100_00101000...
^ round off here
= 0.11111010111000010100011110101110000101000111101011100
= 8827055269646172 / 2^53
= 0.97999999999999998224...
So the 0.98 will become slightly larger in float and smaller in double.
It's because floating point values are not exact representations of the number. All base ten numbers need to be represented on the computer as base 2 numbers. It's in this conversion that precision is lost.
Read more about this at http://en.wikipedia.org/wiki/Floating_point
An example (from encountering this problem in my VB6 days)
To convert the number 1.1 to a single precision floating point number we need to convert it to binary. There are 32 bits that need to be created.
Bit 1 is the sign bit (is it negative [1] or position [0])
Bits 2-9 are for the exponent value
Bits 10-32 are for the mantissa (a.k.a. significand, basically the coefficient of scientific notation )
So for 1.1 the single floating point value is stored as follows (this is truncated value, the compiler may round the least significant bit behind the scenes, but all I do is truncate it, which is slightly less accurate but doesn't change the results of this example):
s --exp--- -------mantissa--------
0 01111111 00011001100110011001100
If you notice in the mantissa there is the repeating pattern 0011. 1/10 in binary is like 1/3 in decimal. It goes on forever. So to retrieve the values from the 32-bit single precision floating point value we must first convert the exponent and mantissa to decimal numbers so we can use them.
sign = 0 = a positive number
exponent: 01111111 = 127
mantissa: 00011001100110011001100 = 838860
With the mantissa we need to convert it to a decimal value. The reason is there is an implied integer ahead of the binary number (i.e. 1.00011001100110011001100). The implied number is because the mantissa represents a normalized value to be used in the scientific notation: 1.0001100110011.... * 2^(x-127).
To get the decimal value out of 838860 we simply divide by 2^-23 as there are 23 bits in the mantissa. This gives us 0.099999904632568359375. Add the implied 1 to the mantissa gives us 1.099999904632568359375. The exponent is 127 but the formula calls for 2^(x-127).
So here is the math:
(1 + 099999904632568359375) * 2^(127-127)
1.099999904632568359375 * 1 = 1.099999904632568359375
As you can see 1.1 is not really stored in the single floating point value as 1.1.