This question already has answers here:
Why does floating-point arithmetic not give exact results when adding decimal fractions?
(31 answers)
Closed 7 years ago.
Consider the following code:
#include<stdio.h>
#define k 0.7
#define p 0.5
int main()
{
float a=k,b=p;
double aa=k;
if(a<k)
{
if(b<p) printf("2 is right");
else printf("1 is right");
}
else printf("0 is right");
printf("\n");
if(a>k)
{
if(b<p) printf("2 is right");
else printf("1 is right");
}
else printf("0 is right");
return 0;
}
Consider this as part II of this question, here the understanding was that the double precision values of floating point constants (doubles when represented as numeric constants) was lost when it was converted to their corresponding floating point values. The exceptional values were X.5 and X.0. But I observed the following results:
Input Output
K=0.1 to 0.4 0 is right
1 is right
K=0.5 0 is right
1 is right
K=0.6 0 is right
1 is right
K=0.7 1 is right
0 is right
K=0.8 0 is right
1 is right
K=0.9 1 is right
0 is right
K=8.4 1 is right
0 is right
Why is this queer behavior? How come only few floating point values are displaying this property? Can't we assume that float precision values are always less than double precision values?? How do we explain this behavior?
Can't we assume that float precision values are always less than double precision values??
Not really, they may both have the same precision. You can assume that the range and precision of double is not smaller than that of float.
But, for all practical purposes, it's a profitable bet that double has 53 bits of precision and float has 24. And that double has 11-bit exponents, float 8-bit.
So, disregarding exotic architectures, float has less precision and a smaller range than double, every float value is representable as a double, but not vice versa. So casting from float to double is value-preserving, but casting from double to float will change all values needing more than 24 bits of precision.
The cast is generally performed (for values in the float range) by rounding the significand of the double value to 24 bits of precision in the following way:
if the 25th most significant bit of the significand is 0, the significand is truncated (the value is rounded towards zero)
if the 25th most significant bit is 1, and not all bits of lower significance are 0, the value is rounded away from zero (the significand is truncated and then the value of the lest significant bit is added)
otherwise, the significand is rounded so tat the 24th most significant bit is zero, that rounds away from zero in half the cases and towards zero in half.
You cannot predict if casting a double to float increases or decreases the value by looking at the decimal expansion, except in the few cases where you can see that the value will be unchanged. It's the binary expansion that matters.
Related
I typed the following C code:
typedef unsigned char* byte_pointer;
void show_bytes(byte_pointer start, size_t len)
{
int i;
for (i = 0; i < len; i++)
printf(" %.2x", start[i]);
printf("\n");
}
void show_float(float x)
{
show_bytes((byte_pointer) &x, sizeof(float));
}
void main()
{
float f = 3.9;
show_float(f);
}
The output of this code is:
Output: 0x4079999A
Manual Calculations
1.1111001100110011001100 x 2 power 1
M: 1111001100110011001100
E: 1 + 127 = 128(d) = 10000000(b)
Final Bits: 01000000011110011001100110011000
Hex: 0x40799998
Why this last A is displayed despite of 8.
As per manual calculations the answer in Hex should supposed to be: Output: 0x40799998
Those undisclosed manual calculations must be wrong. The correct result is 4079999A16.
In the format commonly used for float, IEEE-754 binary32 or “single precision,” numbers are represented as an integer with magnitude less than 224 multiplied by a power of two within certain limits. (The floating-point representation is often described in other forms, such as sign, a 24-digit binary significand with radix point after the first digit, and a power of two. These forms are mathematically equivalent.)
The two numbers in this form closest to 3.9 are 16,357,785•2−23 and 16,357,786•2−23. These are, respectively, 3.8999998569488525390625 and 3.900000095367431640625. Lining them up, we can see the latter is closer to 3.9:
3.8999998569488525390625
3.9000000000000000000000
3.9000000953674316406250
as the former differs by 1.5 at the seventh digit after the decimal point, whereas the latter differs by about 9.5 at the eighth digit after the decimal point.
Therefore, the best conversion of 3.9 to this float format produces 16,357,786•2−23. In hexadecimal, 16,357,786 is F9999A16. In the encoding of the representation into the bits of a float, the low 23 bits of the significand are put into the primary significand field. The low 23 bits are 79999A16, and that is what we should see in the primary significand field.
Also note we can easily see the binary for 3.9 is
11.11100110011001100110011001100110011001100110…2. The bold marks the 24 bits that fit in the float significand. Immediately after them is 1001…, which we can see ought to round up, since it exceeds half of the previous bit, and therefore the last four bits of the significand should be 1010.
(Also note that good C implementations convert numerals in source text to the nearest representable number, especially for numbers without many decimal digits, but the C standard does not require this. It says “For decimal floating constants, and also for hexadecimal floating constants when FLT_RADIX is not a power of 2, the result is either the nearest representable value, or the larger or smaller representable value immediately adjacent to the nearest representable value, chosen in an implementation-defined manner. However, the encoding shown in the question 4079999816, is not for either of the adjacent representable values, 4079999916 and 4079999A16. It is farther away than either.)
While testing the float type and printing it with it's format specifier %f I was testing it's rounding methods.
I've declared the variable as float and gave it the value 5.123456. As you know float must represent at least 6 significant figures.
I then changed it's value to 5.1234567 and printed the value with the %f. It baffles me why it prints out as 5.123456. But if I change the variable value to 5.1234568, it prints out as 5.123457. It rounds properly.
If I haven't made myself clear or the explanation is very confusing:
float a = 5.1234567
printf("%d", a);
// prints out as 5.123456
float a = 5.1234568
printf("%d", a);
// prints out as 5.123457
I've compiled using CodeBlocks and MinGW, same result.
OP is experiencing the effects of double rounding
First, the values 5.123456, 5.1234567, etc. are rounded by the compiler to the closest representable float. Then printf() is rounding the float value to the closest 0.000001 decimal textual representation.
I've declared the variable as float and gave it the value 5.123456. As you know float must represent at least 6 significant figures.
A float can represent about 2^32 different values. 5.123456 is not one of them. The closest value a typical float can represent is 5.12345600128173828125 and that is correct for 6 significant digits: 5.12345...
float x = 5.123456f;
// 5.123455524444580078125 representable float just smaller than 5.123456
// 5.123456 OP's code
// 5.12345600128173828125 representable float just larger than 5.123456 (best)
// The following prints 7 significant digits
// %f prints 6 places after the decimal point.
printf("%f", 5.123456f); // --> 5.123456
With 5.1234567, the closest float has an exact value of 5.123456478118896484375. When using "%f", this is expected print rounded to the closest 0.000001 or 5.123456
float x = 5.1234567f;
// 5.123456478118896484375 representable float just smaller than 5.1234567 (best)
// 5.1234567 OP's code
// 5.1234569549560546875 representable float just larger than 5.1234567
// %f prints 6 places after the decimal point.
printf("%f", 5.1234567f); // --> 5.123456
Significant digits is not the number of digit after the decimal point. It is the number of digits starting with the left-most (most significant) digit.
To print a float to 6 significant figures, use "%.*e".
See Printf width specifier to maintain precision of floating-point value for more details.
float x = 5.1234567;
printf("%.*e\n", 6 - 1, x); // 5.12346e+00
// x xxxxx 6 significant digits
There is no exact float representation for the number 5.1234567 you intend to show here.
If you check here:
https://www.h-schmidt.net/FloatConverter/IEEE754.html
You can see that this number is converted into 5.1234565, or the double 5.1234564781188965 and this rounds down,
While the number 5.1234568 is representable in float, and has a double representation of 5.123456954956055, and this rounds up.
There are two levels of rounding going on:
Your constant of 5.1234567 gets rounded to the nearest value which can be represented by a float (5.123456478...).
The float gets rounded to 6 digits when printed.
It will become obvious if you print the value with more digits.
What it comes down to is that the mantissa of a float has 23 bits and this is not the same as 6 decimal digits (or any number of digits really). Even some apparently simple values like 0.1 don't have an exact float representation.
I am trying to figure out exactly how big number I can use as floating point number and double. But it does not store the way I expected except integer value. double should hold 8 bytes of information which is enough to hold variable a, but it does not hold it right. It shows 1234567890123456768 in which last 2 digits are different. And when I stored 214783648 or any digit in the last digit in float variable b, it shows the same value 214783648. which is supposed to be the limit. So what's going on?
double a;
float b;
int c;
a = 1234567890123456789;
b = 2147483648;
c = 2147483647;
printf("Bytes of double: %d\n", sizeof(double));
printf("Bytes of integer: %d\n", sizeof(int));
printf("Bytes of float: %d\n", sizeof(float));
printf("\n");
printf("You can count up to %.0f in 4 bytes\n", pow(2,32));
printf("You can count up to %.0f with + or - sign in 4 bytes\n", pow(2,31));
printf("You can count up to %.0f in 4 bytes\n", pow(2,64));
printf("You can count up to %.0f with + or - sign in in 8 bytes\n", pow(2,63));
printf("\n");
printf("double number: %.0f\n", a);
printf("floating point: %.0f\n", b);
printf("integer: %d\n", c);
return 0;
The answer to the question of what is the largest (finite) number that can be stored in a floating point type would be FLT_MAX or DBL_MAX for float and double, respectively.
However, that doesn't mean that the type can precisely represent every smaller number or integer (in fact, not even close).
First you need to understand that not all bits of a floating point number are “equal”. A floating point number has an exponent (8 bits in IEEE-754 standard float, 11 bits in double), and a mantissa (23 and 52 bits in float, and double respectively). The number is obtained by multiplying the mantissa (which has an implied leading 1-bit and binary point) by 2exponent (after normalizing the exponent; its binary value is not used directly). There is also a separate sign bit, so the following applies to negative numbers as well.
As the exponent changes, the distance between consecutive values of the mantissa changes as well, i.e., the greater the exponent, the further apart consecutive representable values of the floating point number are. Thus you may be able to store one number of a given magnitude precisely, but not the “next” number. One should also remember that some seemingly simple fractions can not be represented precisely with any number of binary digits (e.g., 1/10, one tenth, is an infinitely repeating sequence in binary, like 1/3, one third, is in decimal).
When it comes to integers, you can precisely represent every integer up to 2mantissa_bits + 1 magnitude. Thus an IEEE-754 float can represent all integers up to 224 and a double up to 253 (in the last half of these ranges the consecutive floating point values are exactly one integer apart, since the entire mantissa is used for the integer part only). There are individual larger integers that can be represented, but they are spaced more than one integer apart, i.e., you can represent some integers greater than 2mantissa_bits + 1 but every integer only up to that magnitude.
For example:
float f = powf(2.0f, 24.0f);
float f1 = f + 1.0f, f2 = f1 + 2.0f;
double d = pow(2.0, 53.0);
double d1 = d + 1.0, d2 = d + 2.0;
(void) printf("2**24 float = %.0f, +1 = %.0f, +2 = %.0f\n", f, f1, f2);
(void) printf("2**53 double = %.0f, +1 = %.0f, +2 = %.0f\n", d, d1, d2);
Outputs:
2**24 float = 16777216, +1 = 16777216, +2 = 16777218
2**53 double = 9007199254740992, +1 = 9007199254740992, +2 = 9007199254740994
As you can see, adding 1 to 2mantissa_bits + 1 makes no difference since the result is not representable, but adding 2 does produce the correct answer (as it happens, at this magnitude the representable numbers are two integers apart since the multiplier has doubled).
TL;DR An IEE-754 float can precisely represent all integers up to 224 and double up to 253, but only some integers of greater magnitude (the spacing of representable values depends on the magnitude).
sizeof(double) is 8, true, but double needs some bits to store the exponent part as well.
Assuming IEEE-754 is used, double can represent integers at most 253 precisely, which is less than 1234567890123456789.
See also Double-precision floating-point format.
You can use constants to know what are the limits :
FLT_MAX
DBL_MAX
LDBL_MAX
From CPP reference
You can print the actual limits of the standard POD-types by printing the limits stored in the 'limits.h' header file (for C++ the equivalent is 'std::numeric_limits' identifier as shown here:
enter link description here)
Due to the fact that the hardware doesn't work with floating types respectively cannot represent floating types by hardware in reality, the hardware uses the bit-length of your hardware to represent a floating type. Since you don't have an infinit length for floating types, you can only show/present a double variable for a specific precision. Most of the hardware uses for the floating type presentation the IEEE-754 standard.
To get more precision you could try 'long double' (dependend on the hardware this could be of quadruple-precision than double), AVX,SSE registers, big-num libraries or you coudl do it yourself.
The sizeof an object only reports the memory space it occupies. It does not show the valid range. It would be well possible to have an unsigned int with e.g. 2**16 (65536) possible value occupy 32 bits im memory.
For floating point objects, it is more difficult. They consist of (simplified) two fields: an integer mantissa and an exponent (see details in the linked article). Both with a fixed width.
As the mantissa only has a limited range, trailing bits are truncated or rounded and the exponent is corrected, if required. This is one reason one should never use floating point types to store precise values like currency.
In decimal (note: computers use binary representation) with 4 digit mantissa:
1000 --> 1.000e3
12345678 --> 1.234e7
The paramters for your implementation are defined in float.h similar to limits.h which provides parameters for integers.
On Linux, #include <values.h>
On Windows,#include <float.h>
There is a fairly comprehensive list of defines
This question already has answers here:
How to get the sign, mantissa and exponent of a floating point number
(7 answers)
Closed 7 years ago.
so I got a task where I have to extract the sign, exponent and mantissa from a floating point number given as uint32_t. I have to do that in C and as you might expect, how do I do that ?
For the sign I would search for the MSB (Most Significant Bit, since it tells me whether my number is positive or negative, depending if it's 0 or 1)
Or let's get straight to my idea, can I "splice" my 32 bit number into three parts ?
Get the 1 bit for msb/sign
Then after that follows 1 byte which stands for the exponent
and at last 23 bits for the mantissa
It probably doesn't work like that but can you give me a hint/solution ?
I know of freexp, but I want an alternative, where I learn a little more of C.
Thank you.
If you know the bitwise layout of your floating point type (e.g. because your implementation supports IEEE floating point representations) then convert a pointer to your floating point variable (of type float, double, or long double) into a pointer to unsigned char. From there, treat the variable like an array of unsigned char and use bitwise opertions to extract the parts you need.
Otherwise, do this;
#include <math.h>
int main()
{
double x = 4.25E12; /* value picked at random */
double significand;
int exponent;
int sign;
sign = (x >= 0) ? 1 : -1; /* deem 0 to be positive sign */
significand = frexp(x, &exponent);
}
The calculation of sign in the above should be obvious.
significand may be positive or negative, for non-zero x. The absolute value of significand is in the range [0.5,1) which, when multiplied by 2 to the power of exponent gives the original value.
If x is 0, both exponent and significand will be 0.
This will work regardless of what floating point representations your compiler supports (assuming double values).
This question already has answers here:
strange output in comparison of float with float literal
(8 answers)
Closed 7 years ago.
void main()
{
float f = 0.98;
if(f <= 0.98)
printf("hi");
else
printf("hello");
getch();
}
I am getting this problem here.On using different floating point values of f i am getting different results.
Why this is happening?
f is using float precision, but 0.98 is in double precision by default, so the statement f <= 0.98 is compared using double precision.
The f is therefore converted to a double in the comparison, but may make the result slightly larger than 0.98.
Use
if(f <= 0.98f)
or use a double for f instead.
In detail... assuming float is IEEE single-precision and double is IEEE double-precision.
These kinds of floating point numbers are stored with base-2 representation. In base-2 this number needs an infinite precision to represent as it is a repeated decimal:
0.98 = 0.1111101011100001010001111010111000010100011110101110000101000...
A float can only store 24 bits of significant figures, i.e.
0.111110101110000101000111_101...
^ round off here
= 0.111110101110000101001000
= 16441672 / 2^24
= 0.98000001907...
A double can store 53 bits of signficant figures, so
0.11111010111000010100011110101110000101000111101011100_00101000...
^ round off here
= 0.11111010111000010100011110101110000101000111101011100
= 8827055269646172 / 2^53
= 0.97999999999999998224...
So the 0.98 will become slightly larger in float and smaller in double.
It's because floating point values are not exact representations of the number. All base ten numbers need to be represented on the computer as base 2 numbers. It's in this conversion that precision is lost.
Read more about this at http://en.wikipedia.org/wiki/Floating_point
An example (from encountering this problem in my VB6 days)
To convert the number 1.1 to a single precision floating point number we need to convert it to binary. There are 32 bits that need to be created.
Bit 1 is the sign bit (is it negative [1] or position [0])
Bits 2-9 are for the exponent value
Bits 10-32 are for the mantissa (a.k.a. significand, basically the coefficient of scientific notation )
So for 1.1 the single floating point value is stored as follows (this is truncated value, the compiler may round the least significant bit behind the scenes, but all I do is truncate it, which is slightly less accurate but doesn't change the results of this example):
s --exp--- -------mantissa--------
0 01111111 00011001100110011001100
If you notice in the mantissa there is the repeating pattern 0011. 1/10 in binary is like 1/3 in decimal. It goes on forever. So to retrieve the values from the 32-bit single precision floating point value we must first convert the exponent and mantissa to decimal numbers so we can use them.
sign = 0 = a positive number
exponent: 01111111 = 127
mantissa: 00011001100110011001100 = 838860
With the mantissa we need to convert it to a decimal value. The reason is there is an implied integer ahead of the binary number (i.e. 1.00011001100110011001100). The implied number is because the mantissa represents a normalized value to be used in the scientific notation: 1.0001100110011.... * 2^(x-127).
To get the decimal value out of 838860 we simply divide by 2^-23 as there are 23 bits in the mantissa. This gives us 0.099999904632568359375. Add the implied 1 to the mantissa gives us 1.099999904632568359375. The exponent is 127 but the formula calls for 2^(x-127).
So here is the math:
(1 + 099999904632568359375) * 2^(127-127)
1.099999904632568359375 * 1 = 1.099999904632568359375
As you can see 1.1 is not really stored in the single floating point value as 1.1.