Rounding floats in C - c

While testing the float type and printing it with it's format specifier %f I was testing it's rounding methods.
I've declared the variable as float and gave it the value 5.123456. As you know float must represent at least 6 significant figures.
I then changed it's value to 5.1234567 and printed the value with the %f. It baffles me why it prints out as 5.123456. But if I change the variable value to 5.1234568, it prints out as 5.123457. It rounds properly.
If I haven't made myself clear or the explanation is very confusing:
float a = 5.1234567
printf("%d", a);
// prints out as 5.123456
float a = 5.1234568
printf("%d", a);
// prints out as 5.123457
I've compiled using CodeBlocks and MinGW, same result.

OP is experiencing the effects of double rounding
First, the values 5.123456, 5.1234567, etc. are rounded by the compiler to the closest representable float. Then printf() is rounding the float value to the closest 0.000001 decimal textual representation.
I've declared the variable as float and gave it the value 5.123456. As you know float must represent at least 6 significant figures.
A float can represent about 2^32 different values. 5.123456 is not one of them. The closest value a typical float can represent is 5.12345600128173828125 and that is correct for 6 significant digits: 5.12345...
float x = 5.123456f;
// 5.123455524444580078125 representable float just smaller than 5.123456
// 5.123456 OP's code
// 5.12345600128173828125 representable float just larger than 5.123456 (best)
// The following prints 7 significant digits
// %f prints 6 places after the decimal point.
printf("%f", 5.123456f); // --> 5.123456
With 5.1234567, the closest float has an exact value of 5.123456478118896484375. When using "%f", this is expected print rounded to the closest 0.000001 or 5.123456
float x = 5.1234567f;
// 5.123456478118896484375 representable float just smaller than 5.1234567 (best)
// 5.1234567 OP's code
// 5.1234569549560546875 representable float just larger than 5.1234567
// %f prints 6 places after the decimal point.
printf("%f", 5.1234567f); // --> 5.123456
Significant digits is not the number of digit after the decimal point. It is the number of digits starting with the left-most (most significant) digit.
To print a float to 6 significant figures, use "%.*e".
See Printf width specifier to maintain precision of floating-point value for more details.
float x = 5.1234567;
printf("%.*e\n", 6 - 1, x); // 5.12346e+00
// x xxxxx 6 significant digits

There is no exact float representation for the number 5.1234567 you intend to show here.
If you check here:
https://www.h-schmidt.net/FloatConverter/IEEE754.html
You can see that this number is converted into 5.1234565, or the double 5.1234564781188965 and this rounds down,
While the number 5.1234568 is representable in float, and has a double representation of 5.123456954956055, and this rounds up.

There are two levels of rounding going on:
Your constant of 5.1234567 gets rounded to the nearest value which can be represented by a float (5.123456478...).
The float gets rounded to 6 digits when printed.
It will become obvious if you print the value with more digits.
What it comes down to is that the mantissa of a float has 23 bits and this is not the same as 6 decimal digits (or any number of digits really). Even some apparently simple values like 0.1 don't have an exact float representation.

Related

Floating Point Representation in Hexadecimal using C Langauge

I typed the following C code:
typedef unsigned char* byte_pointer;
void show_bytes(byte_pointer start, size_t len)
{
int i;
for (i = 0; i < len; i++)
printf(" %.2x", start[i]);
printf("\n");
}
void show_float(float x)
{
show_bytes((byte_pointer) &x, sizeof(float));
}
void main()
{
float f = 3.9;
show_float(f);
}
The output of this code is:
Output: 0x4079999A
Manual Calculations
1.1111001100110011001100 x 2 power 1
M: 1111001100110011001100
E: 1 + 127 = 128(d) = 10000000(b)
Final Bits: 01000000011110011001100110011000
Hex: 0x40799998
Why this last A is displayed despite of 8.
As per manual calculations the answer in Hex should supposed to be: Output: 0x40799998
Those undisclosed manual calculations must be wrong. The correct result is 4079999A16.
In the format commonly used for float, IEEE-754 binary32 or “single precision,” numbers are represented as an integer with magnitude less than 224 multiplied by a power of two within certain limits. (The floating-point representation is often described in other forms, such as sign, a 24-digit binary significand with radix point after the first digit, and a power of two. These forms are mathematically equivalent.)
The two numbers in this form closest to 3.9 are 16,357,785•2−23 and 16,357,786•2−23. These are, respectively, 3.8999998569488525390625 and 3.900000095367431640625. Lining them up, we can see the latter is closer to 3.9:
3.8999998569488525390625
3.9000000000000000000000
3.9000000953674316406250
as the former differs by 1.5 at the seventh digit after the decimal point, whereas the latter differs by about 9.5 at the eighth digit after the decimal point.
Therefore, the best conversion of 3.9 to this float format produces 16,357,786•2−23. In hexadecimal, 16,357,786 is F9999A16. In the encoding of the representation into the bits of a float, the low 23 bits of the significand are put into the primary significand field. The low 23 bits are 79999A16, and that is what we should see in the primary significand field.
Also note we can easily see the binary for 3.9 is
11.11100110011001100110011001100110011001100110…2. The bold marks the 24 bits that fit in the float significand. Immediately after them is 1001…, which we can see ought to round up, since it exceeds half of the previous bit, and therefore the last four bits of the significand should be 1010.
(Also note that good C implementations convert numerals in source text to the nearest representable number, especially for numbers without many decimal digits, but the C standard does not require this. It says “For decimal floating constants, and also for hexadecimal floating constants when FLT_RADIX is not a power of 2, the result is either the nearest representable value, or the larger or smaller representable value immediately adjacent to the nearest representable value, chosen in an implementation-defined manner. However, the encoding shown in the question 4079999816, is not for either of the adjacent representable values, 4079999916 and 4079999A16. It is farther away than either.)

How big of a number can you store in double and float in c?

I am trying to figure out exactly how big number I can use as floating point number and double. But it does not store the way I expected except integer value. double should hold 8 bytes of information which is enough to hold variable a, but it does not hold it right. It shows 1234567890123456768 in which last 2 digits are different. And when I stored 214783648 or any digit in the last digit in float variable b, it shows the same value 214783648. which is supposed to be the limit. So what's going on?
double a;
float b;
int c;
a = 1234567890123456789;
b = 2147483648;
c = 2147483647;
printf("Bytes of double: %d\n", sizeof(double));
printf("Bytes of integer: %d\n", sizeof(int));
printf("Bytes of float: %d\n", sizeof(float));
printf("\n");
printf("You can count up to %.0f in 4 bytes\n", pow(2,32));
printf("You can count up to %.0f with + or - sign in 4 bytes\n", pow(2,31));
printf("You can count up to %.0f in 4 bytes\n", pow(2,64));
printf("You can count up to %.0f with + or - sign in in 8 bytes\n", pow(2,63));
printf("\n");
printf("double number: %.0f\n", a);
printf("floating point: %.0f\n", b);
printf("integer: %d\n", c);
return 0;
The answer to the question of what is the largest (finite) number that can be stored in a floating point type would be FLT_MAX or DBL_MAX for float and double, respectively.
However, that doesn't mean that the type can precisely represent every smaller number or integer (in fact, not even close).
First you need to understand that not all bits of a floating point number are “equal”. A floating point number has an exponent (8 bits in IEEE-754 standard float, 11 bits in double), and a mantissa (23 and 52 bits in float, and double respectively). The number is obtained by multiplying the mantissa (which has an implied leading 1-bit and binary point) by 2exponent (after normalizing the exponent; its binary value is not used directly). There is also a separate sign bit, so the following applies to negative numbers as well.
As the exponent changes, the distance between consecutive values of the mantissa changes as well, i.e., the greater the exponent, the further apart consecutive representable values of the floating point number are. Thus you may be able to store one number of a given magnitude precisely, but not the “next” number. One should also remember that some seemingly simple fractions can not be represented precisely with any number of binary digits (e.g., 1/10, one tenth, is an infinitely repeating sequence in binary, like 1/3, one third, is in decimal).
When it comes to integers, you can precisely represent every integer up to 2mantissa_bits + 1 magnitude. Thus an IEEE-754 float can represent all integers up to 224 and a double up to 253 (in the last half of these ranges the consecutive floating point values are exactly one integer apart, since the entire mantissa is used for the integer part only). There are individual larger integers that can be represented, but they are spaced more than one integer apart, i.e., you can represent some integers greater than 2mantissa_bits + 1 but every integer only up to that magnitude.
For example:
float f = powf(2.0f, 24.0f);
float f1 = f + 1.0f, f2 = f1 + 2.0f;
double d = pow(2.0, 53.0);
double d1 = d + 1.0, d2 = d + 2.0;
(void) printf("2**24 float = %.0f, +1 = %.0f, +2 = %.0f\n", f, f1, f2);
(void) printf("2**53 double = %.0f, +1 = %.0f, +2 = %.0f\n", d, d1, d2);
Outputs:
2**24 float = 16777216, +1 = 16777216, +2 = 16777218
2**53 double = 9007199254740992, +1 = 9007199254740992, +2 = 9007199254740994
As you can see, adding 1 to 2mantissa_bits + 1 makes no difference since the result is not representable, but adding 2 does produce the correct answer (as it happens, at this magnitude the representable numbers are two integers apart since the multiplier has doubled).
 
TL;DR An IEE-754 float can precisely represent all integers up to 224 and double up to 253, but only some integers of greater magnitude (the spacing of representable values depends on the magnitude).
sizeof(double) is 8, true, but double needs some bits to store the exponent part as well.
Assuming IEEE-754 is used, double can represent integers at most 253 precisely, which is less than 1234567890123456789.
See also Double-precision floating-point format.
You can use constants to know what are the limits :
FLT_MAX
DBL_MAX
LDBL_MAX
From CPP reference
You can print the actual limits of the standard POD-types by printing the limits stored in the 'limits.h' header file (for C++ the equivalent is 'std::numeric_limits' identifier as shown here:
enter link description here)
Due to the fact that the hardware doesn't work with floating types respectively cannot represent floating types by hardware in reality, the hardware uses the bit-length of your hardware to represent a floating type. Since you don't have an infinit length for floating types, you can only show/present a double variable for a specific precision. Most of the hardware uses for the floating type presentation the IEEE-754 standard.
To get more precision you could try 'long double' (dependend on the hardware this could be of quadruple-precision than double), AVX,SSE registers, big-num libraries or you coudl do it yourself.
The sizeof an object only reports the memory space it occupies. It does not show the valid range. It would be well possible to have an unsigned int with e.g. 2**16 (65536) possible value occupy 32 bits im memory.
For floating point objects, it is more difficult. They consist of (simplified) two fields: an integer mantissa and an exponent (see details in the linked article). Both with a fixed width.
As the mantissa only has a limited range, trailing bits are truncated or rounded and the exponent is corrected, if required. This is one reason one should never use floating point types to store precise values like currency.
In decimal (note: computers use binary representation) with 4 digit mantissa:
1000 --> 1.000e3
12345678 --> 1.234e7
The paramters for your implementation are defined in float.h similar to limits.h which provides parameters for integers.
On Linux, #include <values.h>
On Windows,#include <float.h>
There is a fairly comprehensive list of defines

Unexpected floating point comparisons [duplicate]

This question already has answers here:
Why does floating-point arithmetic not give exact results when adding decimal fractions?
(31 answers)
Closed 7 years ago.
Consider the following code:
#include<stdio.h>
#define k 0.7
#define p 0.5
int main()
{
float a=k,b=p;
double aa=k;
if(a<k)
{
if(b<p) printf("2 is right");
else printf("1 is right");
}
else printf("0 is right");
printf("\n");
if(a>k)
{
if(b<p) printf("2 is right");
else printf("1 is right");
}
else printf("0 is right");
return 0;
}
Consider this as part II of this question, here the understanding was that the double precision values of floating point constants (doubles when represented as numeric constants) was lost when it was converted to their corresponding floating point values. The exceptional values were X.5 and X.0. But I observed the following results:
Input Output
K=0.1 to 0.4 0 is right
1 is right
K=0.5 0 is right
1 is right
K=0.6 0 is right
1 is right
K=0.7 1 is right
0 is right
K=0.8 0 is right
1 is right
K=0.9 1 is right
0 is right
K=8.4 1 is right
0 is right
Why is this queer behavior? How come only few floating point values are displaying this property? Can't we assume that float precision values are always less than double precision values?? How do we explain this behavior?
Can't we assume that float precision values are always less than double precision values??
Not really, they may both have the same precision. You can assume that the range and precision of double is not smaller than that of float.
But, for all practical purposes, it's a profitable bet that double has 53 bits of precision and float has 24. And that double has 11-bit exponents, float 8-bit.
So, disregarding exotic architectures, float has less precision and a smaller range than double, every float value is representable as a double, but not vice versa. So casting from float to double is value-preserving, but casting from double to float will change all values needing more than 24 bits of precision.
The cast is generally performed (for values in the float range) by rounding the significand of the double value to 24 bits of precision in the following way:
if the 25th most significant bit of the significand is 0, the significand is truncated (the value is rounded towards zero)
if the 25th most significant bit is 1, and not all bits of lower significance are 0, the value is rounded away from zero (the significand is truncated and then the value of the lest significant bit is added)
otherwise, the significand is rounded so tat the 24th most significant bit is zero, that rounds away from zero in half the cases and towards zero in half.
You cannot predict if casting a double to float increases or decreases the value by looking at the decimal expansion, except in the few cases where you can see that the value will be unchanged. It's the binary expansion that matters.

Integer to floating conversion in C

float f1 = 123.125;
int i1 = -150;
f1 = i1; // integer to floating conversion
printf("%i assigned to an float produces %f\n", i1, f1);
Output:
-150 assigned to an float produces -150.000000
My question is why the result has 6 zeros (000000) after the . and not 7 or 8 or some number?
That's just what printf does. See the man page where it says
f, F
The double argument shall be converted to decimal notation in the style "[-]ddd.ddd", where the number of digits after the radix character is equal to the precision specification. If the precision is missing, it shall be taken as 6; if the precision is explicitly zero and no '#' flag is present, no radix character shall appear. If a radix character appears, at least one digit appears before it. The low-order digit shall be rounded in an implementation-defined manner.
(emphasis mine)
It has nothing to do with how 150 is represented as a floating point number in memory (and in fact, it's promoted to a double because printf is varargs).
The number of zeros you see is a result of the default precision used by the %f printf conversion. It's basically unrelated to the integer to floating point conversion.
Because the C standard (§7.19.6.1) says that in the absence of information to the contrary, %f will print 6 decimal places.
f,F A double argument representing a floating-point number is converted to
decimal notation in the style [−]ddd.ddd, where the number of digits after
the decimal-point character is equal to the precision specification. If the
precision is missing, it is taken as 6; if the precision is zero and the # flag is
not specified, no decimal-point character appears.
Floating point arithmetic is not exact. printf is just showing that number of zeroes.
From the documentation:
The default number of digits after the
decimal point is six, but this can be
changed with a precision field. If a
decimal point appears, at least one
digit appears before it. The "double"
value is rounded to the correct number
of decimal places.

problems in floating point comparison [duplicate]

This question already has answers here:
strange output in comparison of float with float literal
(8 answers)
Closed 7 years ago.
void main()
{
float f = 0.98;
if(f <= 0.98)
printf("hi");
else
printf("hello");
getch();
}
I am getting this problem here.On using different floating point values of f i am getting different results.
Why this is happening?
f is using float precision, but 0.98 is in double precision by default, so the statement f <= 0.98 is compared using double precision.
The f is therefore converted to a double in the comparison, but may make the result slightly larger than 0.98.
Use
if(f <= 0.98f)
or use a double for f instead.
In detail... assuming float is IEEE single-precision and double is IEEE double-precision.
These kinds of floating point numbers are stored with base-2 representation. In base-2 this number needs an infinite precision to represent as it is a repeated decimal:
0.98 = 0.1111101011100001010001111010111000010100011110101110000101000...
A float can only store 24 bits of significant figures, i.e.
0.111110101110000101000111_101...
^ round off here
= 0.111110101110000101001000
= 16441672 / 2^24
= 0.98000001907...
A double can store 53 bits of signficant figures, so
0.11111010111000010100011110101110000101000111101011100_00101000...
^ round off here
= 0.11111010111000010100011110101110000101000111101011100
= 8827055269646172 / 2^53
= 0.97999999999999998224...
So the 0.98 will become slightly larger in float and smaller in double.
It's because floating point values are not exact representations of the number. All base ten numbers need to be represented on the computer as base 2 numbers. It's in this conversion that precision is lost.
Read more about this at http://en.wikipedia.org/wiki/Floating_point
An example (from encountering this problem in my VB6 days)
To convert the number 1.1 to a single precision floating point number we need to convert it to binary. There are 32 bits that need to be created.
Bit 1 is the sign bit (is it negative [1] or position [0])
Bits 2-9 are for the exponent value
Bits 10-32 are for the mantissa (a.k.a. significand, basically the coefficient of scientific notation )
So for 1.1 the single floating point value is stored as follows (this is truncated value, the compiler may round the least significant bit behind the scenes, but all I do is truncate it, which is slightly less accurate but doesn't change the results of this example):
s --exp--- -------mantissa--------
0 01111111 00011001100110011001100
If you notice in the mantissa there is the repeating pattern 0011. 1/10 in binary is like 1/3 in decimal. It goes on forever. So to retrieve the values from the 32-bit single precision floating point value we must first convert the exponent and mantissa to decimal numbers so we can use them.
sign = 0 = a positive number
exponent: 01111111 = 127
mantissa: 00011001100110011001100 = 838860
With the mantissa we need to convert it to a decimal value. The reason is there is an implied integer ahead of the binary number (i.e. 1.00011001100110011001100). The implied number is because the mantissa represents a normalized value to be used in the scientific notation: 1.0001100110011.... * 2^(x-127).
To get the decimal value out of 838860 we simply divide by 2^-23 as there are 23 bits in the mantissa. This gives us 0.099999904632568359375. Add the implied 1 to the mantissa gives us 1.099999904632568359375. The exponent is 127 but the formula calls for 2^(x-127).
So here is the math:
(1 + 099999904632568359375) * 2^(127-127)
1.099999904632568359375 * 1 = 1.099999904632568359375
As you can see 1.1 is not really stored in the single floating point value as 1.1.

Resources