Why the difference in output? - c

Using the C Code given below (written in Visual Studio):
#include "stdafx.h"
int _tmain(int argc, _TCHAR* argv[])
{
float i = 2.0/3.0;
printf("%5.6f", i);
return 0;
}
produces the output:
0.666667
however when the %5.6f is changed to %5.20f the output changes to :
0.66666668653488159000
My question is why the subtle changes in output for the similar decimal?

When you use 32-bit float, the computer represents the result of 2./3. as 11,184,811 / 16,777,216, which is exactly 0.666666686534881591796875. In the floating-point you are using, numbers are always represented as some integer multiplied by some power of two (which may be a negative power of two). Due to limits on how large the integer can be (when you use float, the integer must fit in 24 bits, not including the sign), the closest representable value to 2/3 is 11,184,811 / 16,777,216.
The reason that printf with '%5.6f` displays “0.666667” is because “%5.6f” requests just six digits, so the number is rounded at the sixth digit.
The reason that printf with %5.20f displays “0.66666668653488159000” is that your printf implementation “gives up” after 17 digits, figuring that is close enough in some sense. Some implementations of printf, which one might argue are better, print the represented value as closely as the requested format permits. In this case, they would display “0.66666668653488159180”, and, if you requested more digits, they would display the exact value, “0.666666686534881591796875”.
(The floating-point format is often presented as a sign, a fraction between 1 [inclusive] and 2 [exclusive], and an exponent, instead of a sign, an integer, and an exponent. Mathematically, they are the same with an adjustment in the exponent: Each number representable with a sign, a 24-bit unsigned integer, and an exponent is equal to some number with a sign, a fraction between 1 and 2, and an adjusted exponent. Using the integer version tends to make proofs easier and sometimes helps explanation.)

Unlike integers, which can be represented exactly in any base, relatively few decimal fractions have an exact representation in the base-2 fractional format.
This means that FP integers are exact, and, generally, FP fractions are not.
So for two-digits, say, 0.01 to 0.99, only 0.25, 0.50, and 0.75 (and 0) have exact representations. Normally it doesn't matter as output gets rounded, and really, few if any physical constants are known to the precision available in the format.

This is because you may not have an exact representation of 0.6666666666666666...66667 in floating point.

The the precision is stored in exponential format i.e. like (-/+)ax10^n. If the data type is 32 bit will spend 1 bit for sign, 8 bit for a and rest for n. So, it doesn't store values after 20th digit after point. So, in this compiler you will never get the correct value.

float type has only 23 bits to represent part of decimal, 20 is too many.

Related

How can I know in advance which real numbers would have an imprecise representation using float variables in C?

I know that the number 159.95 cannot be precisely represented using float variables in C.
For example, considering the following piece of code:
#include <stdio.h>
int main()
{
float x = 159.95;
printf("%f\n",x);
return 0;
}
It outputs 159.949997.
I would like to know if there is some way to know in advance which real value (in decimal system) would be represented in an imprecise way like the 159.95 number.
Best regards.
Succinctly, for the format most commonly used for float, a number is exactly representable if and only if it is representable as an integer F times a power of two, 2E such that:
the magnitude of F is less than 224, and
–149 ≤ E < 105.
More generally, C 2018 5.2.4.2.2 specifies the characteristics of floating-point types. A floating-point number is represented as s•be•sum(fk b−k, 1 ≤ k ≤ p), where:
s is a sign, +1 or −1,
b is a fixed base chosen by the C implementation, often 2,
e is an exponent, which is an integer between a minimum emin and a maximum emax, chosen by the C implementation,
p is the precision, the number of base-b digits in the significand, and
fk are digits in base-b, nonnegative integers less than b.
The significand is the fraction portion of the representation, sum(fk b−k, 1 ≤ k ≤ p). It is written as a sum so that we can express the variable number of digits it may have. (p is a variable set by the C implementation, not by the programmer using the C implementation.) When we write it out a significand in base b, it can be a numeral, such as .0011101010011001010101102 for a 24-bit significand in base 2. Note that, in the this form (and the sum), the significand has all its digits after the radix point.
To make it slightly easier to tell if a number is in this format, we can adjust the scale so the significand is an integer instead of having digits after the radix point: s•be−p•sum(fk bp−k, 1 ≤ k ≤ p). This changes the above significand from .0011101010011001010101102 to 0011101010011001010101102. Since it has p digits, it is always a non-negative integer less than bp.
Now we can figure out if a finite number is representable in this format:
Get b, p, emin, and emax for the target C implementation. If it uses IEEE-754 binary32 for float, then b is 2, p is 24, emin is −125, and emax is 128. When <float.h> is included, these are defined as FLT_RADIX, FLT_MANT_DIGITS, FLT_MIN_EXP, and FLT_MAX_EXP.
Ignore the sign. Write the absolute value of number as a rational number n/d in simplest form. If it is an integer, let d be 1.
If d is not a power of b, the number is not representable in the format.
If n is a multiple of b greater than or equal to bp, divide it by b and multiply d by d until n is not a multiple or is less than bp.
If n is greater than or equal to bp, the number is not representable in the format.
Let e be such that 1/d = be−p. If emin ≤ e ≤ emax, the number is representable in the format. Otherwise, it is not.
Some floating-point formats might not support subnormal numbers, in which f1 is zero. This is indicated by FLT_HAS_SUBNORM being defined to be zero and would require modifications to the above.
I would like to know if there is some way to know in advance which real value (in decimal system) would be represented in an imprecise way like the 159.95 number.
In general, floating point numbers can only represent numbers whose denominator is a power of 2.
To check if a number can be represented as floating point value (of any floating-point type) at all, take the decimal digits after the decimal point, interpret them as number and check if they can be divided by 5^n while n is the number of digits:
159.95 => 95, 2 digits => 95%(5*5) = 20 => Cannot be represented as floating-point value
Counterexample:
159.625 => 625, 3 digits => 625%(5*5*5) = 0 => Can be represented as floating-point value
You also have to consider the fact that floating-point values only have a limited number of digits after the decimal point:
In principle, 123456789 can be represented by a floating-point value exactly (it is an integer), however float does not have enough bits!
To check if an integer value can be represented by float exactly, divide the number by 2 until the result is odd. If the result is < 2^24, the number can be represented by float exactly.
In the case of a rational number, first do the "divisible by 5^n" check described above. Then multiply the number by 2 until the result is an integer. Check if it is < 2^24.
I would like to know if there is some way to know in advance which real value... would be represented in an imprecise way
The short and only partly facetious answer is... all of them!
There are roughly 2^32 = 4294967296 values of type float. And there are an uncountably infinite number of real numbers. So, for a randomly-chosen real number, the chance that it can be exactly represented as a value of type float is 4294967296/∞, which is 0.
If you use type double, there are approximately 2^64 = 18446744073709551616 of those, so the chance that a randomly-chosen real number can be exactly represented as a double is 18446744073709551616/∞, which is again... 0.
I realize I'm not answering quite the question you asked, but in general, it's usually a bad idea to use binary floating-point types as if they were an exact representation of decimal fractions. Attempts to assume that they're ever an exact representation usually lead to trouble. In general, it's best to assume that floating-point types are an imperfect (approximate) realization of of real numbers, period (that is, without assuming decimal). If you never assume they're exact (which for true real numbers, they virtually never are), you'll never get into trouble in cases where you thought they'd be exact, but they weren't.
[Footnote 1: As Eric P. reminds in a comment, there's no such thing as a "randomly-chosen real number", which is why this is a partially facetious answer.]
[Footnote 2: I now see your comment where you say that you do assume they are all imprecise, but that you would "like to understand the phenomenon in a deeper way", in which case my answer does you no good, but hopefully some of the others do. I can especially commend Martin Rosenau's answer, which goes straight to the heart of the matter: a rational number is representable exactly in base 2 if and only if its reduced denominator is a pure power of 2, or stated another way, has only 2's in its prime factorization. That's why, if you take any number you can actually store in a float or double, and print it back out using %f and enough digits, with a properly-written printf, you'll notice that the numbers always end in things like ...625 or ...375. Binary fractions are like the English rulers still used in the U.S.: everything is halves and quarters and eights and sixteenths and thirty-seconds and sixty-fourths.]
Usually, a float is an IEEE754 binary32 float (this is not guaranteed by spec and may be different on some compilers/systems). This data type specifies a 24-bit significand; this means that if you write the number in binary, it should require no more than 24 bits excluding trailing zeros.
159.95's binary representation is 10011111.11110011001100110011... with repeating 0011 forever, so it requires an infinite number of bits to represent precisely with a binary format.
Other examples:
1073741760 has a binary representation of 111111111111111111111111000000. It has 30 bits in that representation, but only 24 significant bits (since the remainder are trailing zero bits). It has an exact float representation.
1073741761 has a binary representation of 111111111111111111111111000001. It has 30 significant bits and cannot be represented exactly as a float.
0.000000059604644775390625 has a binary representation of 0.000000000000000000000001. It has one significant bit and can be represented exactly.
0.750000059604644775390625 has a binary representation of 0.110000000000000000000001, which is 24 significant bits. It can be represented exactly as a float.
1.000000059604644775390625 has a binary representation of 1.000000000000000000000001, which is 25 significant bits. It cannot be represented exactly as a float.
Another factor (which applies to very large and very small numbers) is that the exponent is limited to the -126 to +127 range. With some handwaving around denormal values and other special cases, this generally allows values ranging from roughly 2-126 to slightly under 2128.
I would like to know if there is some way to know in advance which real value (in decimal system) would be represented in an imprecise way like the 159.95 number.
In another answer I semiseriously answered "all of them",
but let's look at it another way. Specifically, let's look at
which numbers can be exactly represented.
The key fact to remember is that floating point formats use binary.
(The major, popular formats, anyway.) So the numbers that can be
represented exactly are the ones with exact binary representations.
Here is a table of a few of the single-precision float values
that can be represented exactly, specifically the seven
contiguous values near 1.0.
I'm going to show them as hexadecimal fractions, binary
fractions, and decimal fractions.
(That is, along each horizontal row, all three values are exactly
the same, just represented in different bases. But note that the
fractional hexadecimal and binary representations I'm using here
are not directly acceptable in C.)
hexadecimal
binary
decimal
delta
0x0.fffffd
0b0.111111111111111111111101
0.999999821186065673828125
5.96e-08
0x0.fffffe
0b0.111111111111111111111110
0.999999880790710449218750
5.96e-08
0x0.ffffff
0b0.111111111111111111111111
0.999999940395355224609375
5.96e-08
0x1.000000
0b1.00000000000000000000000
1.000000000000000000000000
0x1.000002
0b1.00000000000000000000001
1.000000119209289550781250
1.19e-07
0x1.000004
0b1.00000000000000000000010
1.000000238418579101562500
1.19e-07
0x1.000006
0b1.00000000000000000000011
1.000000357627868652343750
1.19e-07
There are several things to notice about this table:
The decimal numbers look pretty weird.
The hexadecimal and binary numbers look pretty normal, and show pretty clearly that single-precision floating point has 24 bits of precision.
If you look at the decimal column, the precision seems to be about equivalent to 7 decimal digits.
It's clearly not exactly 7 digits, though.
The difference between consecutive values less than 1.0 is about 0.00000005, and greater than 1.0 is twice that, about 0.00000010. (More on this later.)
Here is a similar table for type double.
(I'm showing fewer columns because there's not enough room
horizontally for everything.)
hexadecimal
decimal
delta
0x0.ffffffffffffe8
0.99999999999999966693309261245303787291049957275390625
1.11e-16
0x0.fffffffffffff0
0.99999999999999977795539507496869191527366638183593750
1.11e-16
0x0.fffffffffffff8
0.99999999999999988897769753748434595763683319091796875
1.11e-16
0x1.0000000000000
1.0000000000000000000000000000000000000000000000000000
0x1.0000000000001
1.0000000000000002220446049250313080847263336181640625
2.22e-16
0x1.0000000000002
1.0000000000000004440892098500626161694526672363281250
2.22e-16
0x1.0000000000003
1.0000000000000006661338147750939242541790008544921875
2.22e-16
You can see right away that type double has much better precision:
53 bits, or about 15 decimal digits' worth instead of 7, and with a much
finer spacing between "adjacent" numbers.
What does it mean for these numbers to be "contiguous" or
"adjacent"? Aren't real numbers continuous? Yes, true real
numbers are continuous, but we're not looking at true real
numbers: we're looking at finite-precision floating point, and we
are, literally, seeing the finite limit of the precision here.
In type float, there simply is no value — no representable
value, that is — between 1.00000000 and 1.00000012.
In type double, there is no value between 1.00000000000000000
and 1.00000000000000022.
So let's go back to your question, asking whether there's "some way
to know which decimal values are represented in a precise or imprecise way."
If you look at ten decimal values between 1 and 2:
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
the answer is, only one of them is exactly representable in binary: 1.5.
If you break the interval down into 100 fractions, like this:
1.01
1.02
1.03
1.04
1.05
…
1.95
1.96
1.97
1.98
1.99
it turns out there are three fractions you can represent exactly:
.25, .50, and .75, corresponding to
¼, ½, and ¾.
If we looked at three-digit decimal fractions, there are at most
seven of them we can represent: .125, .250, .375, .500, .625, .750, and .875. These correspond to eighths, that is, ordinary
fractions with 8 in the denominator.
I said "at most seven" because it's not true (none of these
estimates are true) for all ranges of numbers. Remember,
precision is finite, and digits to the left of the decimal part
— that is, in the integral part of your numbers — count against
your precision budget, too. So it turns out that if you were to
look at the range, say, 4000000–4000001, and tried to subdivide
it, you would find that you could represent 4000000.25 and
4000000.50 as type float, but not 4000000.125 or 4000000.375.
You can't really see it if you look at the decimal
representation, but what's happening inside is that type float
has exactly 24 binary bits of available precision, and the
integer part 4000000 uses up 22 of those bits, so you've only got
two bits left over for the fractional part, and with two bits you
can do halves and quarters, but not eighths.
You're probably noticing a pattern by now: the fractions we've
looked at so far that can be be represented exactly in binary
involve halves, quarters, and eights, and if we looked further,
this pattern would continue: sixteenths, thirty-seconds,
sixty-fourths, etc. And this should come as no real surprise:
just as in decimal the "exact" fractions involve tenths,
hundredths, thousandths, etc.; when we move to binary (base 2) the fractions
all involve powers of two. ½ in binary is 0b0.1.
¼ and ¾ are 0b0.01 and 0b0.11.
⅜ and ⅝ are 0b0.011 and 0b0.101.
What about a fraction like 1/3? You can't represent it exactly
in binary, but since you can't represent it in decimal, either,
this doesn't tend to bother us too much. In decimal it's the
infinitely repeating fraction 0.333333…, and in binary it's the
infinitely-repeating fraction 0b0.0101010101….
But then we come to the humble fraction 1/10, or one tenth.
This obviously can be represented as a decimal fraction — 0.1 —
but it turns out that it cannot be represented exactly in binary.
In binary it's the infinitely-repeating fraction 0b0.0001100110011….
And this is why, as we saw above, you can't represent most of the other
"single digit" decimal fractions 0.2, 0.3, 0.4, …, either
(with the notable exception of 0.5), and you can't represent most
of the double-digit decimal fractions 0.01, 0.02, 0.03, …,
or most of the triple-digit decimal fractions, etc.
So returning once more to your question of which decimal
fractions can be represented exactly, we can say:
For single-digit fractions 0.1, 0.2, 0.3, …, we can exactly represent .5, and to be charitable we can say that we can also represent .0, so that's two out of ten, or 20%.
For double-digit fractions 0.01, 0.02, 0.03, …, we can exactly represent .00, 0.25, 0.50, and 0.75, so that's four out of a hundred, or 4%.
For three-digit fractions 0.001, 0.002, 0.003, …, we can exactly represent the eight fractions involving eighths, so that's 8/1000 = 0.8%.
So while there are some decimal fractions we can represent
exactly, there aren't very many, and the percentage seems to be
going down as we add more digits. :-(
The fact — and depending on your point of view it's either an
unfortunate fact or a sad fact or a perfectly normal fact —
is that most decimal fractions can not be represented exactly
in binary and so can not be represented exactly using computer
floating point.
The numbers that can be represented exactly using computer
floating point, although they can all be exactly converted into
numerically equivalent decimal fractions, end up converting to
rather weird-looking numbers for the most part, with lots of digits, as we saw above.
(In fact, for type float, which internally has 24 bits of
significance, the exact decimal conversions end up having up to
24 decimal digits. And the fractions always end in 5.)
One last point concerns the spacing between these "contiguous",
exactly-representable binary fractions. In the examples I've
shown, why is there tighter spacing for numbers less than 1.0
than for numbers greater than 1.0?
The answer lies in an earlier statement that "precision is
finite, and digits to the left of the decimal part count against
your precision budget, too". Switching to decimal fractions for
a moment, if I told you you had exactly 7 significant decimal
digits to work with, you could represent
1234.567
1234.568
1234.569
and
12345.67
12345.68
12345.69
but you could not represent
12345.678
because that would require 8 significant digits.
Stated another way, for numbers between 1000 and 10000 you can
have three more digits after the decimal point, but for numbers
from 10000 to 100000 you can only have two. Mathematicians call
these intervals like 1000-10000 and 10000-100000 decades,
and within each decade, all the numbers have the same number of
fractional digits for a given precision, and the same exponents:
1.000000×103 – 1.999999×103,
1.000000×104 – 1.999999×104, etc.
(This usage is rather different than ordinary usage, in which the
word "decade" refers to a period of 10 years.)
But for binary floating point, once again, the intervals of
interest involve powers of 2, not 10. (In binary, some computer
scientists call these intervals binades, by analogy with "decades".)
The interesting intervals are from 1 to 2, 2–4, 4–8, 8–16, etc.
For numbers between 1 and 2, you've got 1 bit to the left of the
decimal point (really the "binary point"), so in single precision
you've got 23 bits left over to use for the fractional part to the right.
But for numbers between 2 and 4, you've got 2 bits to the left,
so you've only got 22 bits to use for the fraction.
This works in the other direction, too: for numbers between
½ and 1, you don't need any bits to the left of the binary
point, so you can use all 24 for the fraction to the right.
(Below ½ it gets even more interesting). So that's why we
saw twice the precision (numbers half the size in the "delta"
column) for numbers just below 1.0 than for numbers just above.
We'd see similar shifts in available precision when crossing all the other
powers of two: 2.0, 4.0, 8.0, …, and also ½, ¼,
⅛, etc.
This has been a rather long answer, longer than I had intended.
Thanks for reading.
Hopefully now you have a better appreciation for which numbers can be
exactly represented in binary floating point, and why most of them can't.

Why does C print float values after the decimal point different from the input value? [duplicate]

This question already has answers here:
Why IEEE754 single-precision float has only 7 digit precision?
(2 answers)
Closed 1 year ago.
Why does C print float values after the decimal point different from the input value?
Following is the code.
CODE:
#include <stdio.h>
#include<math.h>
void main()
{
float num=2118850.132000;
printf("num:%f",num);
}
OUTPUT:
num:2118850.250000
This should have printed 2118850.132000, But instead it is changing the digits after the decimal to .250000. Why is it happening so?
Also, what can one do to avoid this?
Please guide me.
Your computer uses binary floating point internally. Type float has 24 bits of precision, which translates to approximately 7 decimal digits of precision.
Your number, 2118850.132, has 10 decimal digits of precision. So right away we can see that it probably won't be possible to represent this number exactly as a float.
Furthermore, due to the properties of binary numbers, no decimal fraction that ends in 1, 2, 3, 4, 6, 7, 8, or 9 (that is, numbers like 0.1 or 0.2 or 0.132) can be exactly represented in binary. So those numbers are always going to experience some conversion or roundoff error.
When you enter the number 2118850.132 as a float, it is converted internally into the binary fraction 1000000101010011000010.01. That's equivalent to the decimal fraction 2118850.25. So that's why the .132 seems to get converted to 0.25.
As I mentioned, float has only 24 bits of precision. You'll notice that 1000000101010011000010.01 is exactly 24 bits long. So we can't, for example, get closer to your original number by using something like 1000000101010011000010.001, which would be equivalent to 2118850.125, which would be closer to your 2118850.132. No, the next lower 24-bit fraction is 1000000101010011000010.00 which is equivalent to 2118850.00, and the next higher one is 1000000101010011000010.10 which is equivalent to 2118850.50, and both of those are farther away from your 2118850.132. So 2118850.25 is as close as you can get with a float.
If you used type double you could get closer. Type double has 53 bits of precision, which translates to approximately 16 decimal digits. But you still have the problem that .132 ends in 2 and so can never be exactly represented in binary. As type double, your number would be represented internally as the binary number 1000000101010011000010.0010000111001010110000001000010 (note 53 bits), which is equivalent to 2118850.132000000216066837310791015625, which is much closer to your 2118850.132, but is still not exact. (Also notice that 2118850.132000000216066837310791015625 begins to diverge from your 2118850.1320000000 after 16 digits.)
So how do you avoid this? At one level, you can't. It's a fundamental limitation of finite-precision floating-point numbers that they cannot represent all real numbers with perfect accuracy. Also, the fact that computers typically use binary floating-point internally means that they can almost never represent "exact-looking" decimal fractions like .132 exactly.
There are two things you can do:
If you need more than about 7 digits worth of precision, definitely use type double, don't try to use type float.
If you believe your data is accurate to three places past the decimal, print it out using %.3f. If you take 2118850.132 as a double, and printf it using %.3f, you'll get 2118850.132, like you want. (But if you printed it with %.12f, you'd get the misleading 2118850.132000000216.)
This will work if you use double instead of float:
#include <stdio.h>
#include<math.h>
void main()
{
double num=2118850.132000;
printf("num:%f",num);
}

Understanding casts from integer to float

Could someone explain this weird looking output on a 32 bit machine?
#include <stdio.h>
int main() {
printf("16777217 as float is %.1f\n",(float)16777217);
printf("16777219 as float is %.1f\n",(float)16777219);
return 0;
}
Output
16777217 as float is 16777216.0
16777219 as float is 16777220.0
The weird thing is that 16777217 casts to a lower value and 16777219 casts to a higher value...
In the IEEE-754 basic 32-bit binary floating-point format, all integers from −16,777,216 to +16,777,216 are representable. From 16,777,216 to 33,554,432, only even integers are representable. Then, from 33,554,432 to 67,108,864, only multiples of four are representable. (Since the question does not necessitate discussion of which numbers are representable, I will omit explanation and just take this for granted.)
The most common default rounding mode is to round the exact mathematical result to the nearest representable value and, in case of a tie, to round to the representable value which has zero in the low bit of its significand.
16,777,217 is equidistant between the two representable values 16,777,216 and 16,777,218. These values are represented as 1000000000000000000000002•21 and 1000000000000000000000012•21. The former has 0 in the low bit of its significand, so it is chosen as the result.
16,777,219 is equidistant between the two representable values 16,777,218 and 16,777,220. These values are represented as 1000000000000000000000012•21 and 1000000000000000000000102•21. The latter has 0 in the low bit of its significand, so it is chosen as the result.
You may have heard of the concept of "precision", as in "this fractional representation has 3 digits of precision".
This is very easy to think about in a fixed-point representation. If I have, say, three digits of precision past the decimal, then I can exactly represent 1/2 = 0.5, and I can exactly represent 1/4 = 0.25, and I can exactly represent 1/8 = 0.125, but if I try to represent 1/16, I can not get 0.0625; I will either have to settle for 0.062 or 0.063.
But that's for fixed-point. The computer you're using uses floating-point, which is a lot like scientific notation. You get a certain number of significant digits total, not just digits to the right of the decimal point. For example, if you have 3 decimal digits worth of precision in a floating-point format, you can represent 0.123 but not 0.1234, and you can represent 0.0123 and 0.00123, but not 0.01234 or 0.001234. And if you have digits to the left of the decimal point, those take away away from the number you can use to the right of the decimal point. You can use 1.23 but not 1.234, and 12.3 but not 12.34, and 123.0 but not 123.4 or 123.anythingelse.
And -- you can probably see the pattern by now -- if you're using a floating-point format with only three significant digits, you can't represent all numbers greater than 999 perfectly accurately at all, even though they don't have a fractional part. You can represent 1230 but not 1234, and 12300 but not 12340.
So that's decimal floating-point formats. Your computer, on the other hand, uses a binary floating-point format, which ends up being somewhat trickier to think about. We don't have an exact number of decimal digits' worth of precision, and the numbers that can't be exactly represented don't end up being nice even multiples of 10 or 100.
In particular, type float on most machines has 24 binary bits worth of precision, which works out to 6-7 decimal digits' worth of precision. That's obviously not enough for numbers like 16777217.
So where did the numbers 16777216 and 16777220 come from? As Eric Postpischil has already explained, it ends up being because they're multiples of 2. If we look at the binary representations of nearby numbers, the pattern becomes clear:
16777208 111111111111111111111000
16777209 111111111111111111111001
16777210 111111111111111111111010
16777211 111111111111111111111011
16777212 111111111111111111111100
16777213 111111111111111111111101
16777214 111111111111111111111110
16777215 111111111111111111111111
16777216 1000000000000000000000000
16777218 1000000000000000000000010
16777220 1000000000000000000000100
16777215 is the biggest number that can be represented exactly in 24 bits. After that, you can represent only even numbers, because the low-order bit is the 25th, and essentially has to be 0.
Type float cannot hold that much significance. The significand can only hold 24 bits. Of those 23 are stored and the 24th is 1 and not stored, because the significand is normalised.
Please read this which says "Integers in [ − 16777216 , 16777216 ] can be exactly represented", but yours are out of that range.
Floating representation follows a method similar to what we use in everyday life and we call exponential representation. This is a number using a number of digits that we decide will suffice to realistically represent the value, we call it mantissa, or significant, that we will multiply to a base, or radix, value elevated to a power that we call exponent. In plain words:
num*base^exp
We generally use 10 as base, because we have 10 finger in our hands, so we are habit to numbers like 1e2, which is 100=1*10^2.
Of course we regret to use exponential representation for so small numbers, but we prefer to use it when acting on very large numbers, or, better, when our number has a number of digits that we consider enough to represent the entity we are valorizing.
The correct number of digits could be how many we can handle by mind, or what are required for an engineering application. When we decided how many digits we need we will not care anymore for how adherent to the real value will be the numeric representation we are going to handle. I.e. for a number like 123456.789e5 it is understood that adding up 99 unit we can tolerate the rounded representation and consider it acceptable anyway, if not we should change the representation and use a different one with appropriate number of digits as in 12345678900.
On a computer when you have to handle very large numbers, that couldn't fit in a standard integer, or when the you have to represent a real number (with decimal part) the right choice is a floating or double floating point representation. It uses the same layout we discussed above, but the base is 2 instead of 10. This because a computer can have only 2 fingers, the states 0 or 1. Se the formula we used before, to represent 100, become:
100100*2^0
That's still isn't the real floating point representation, but gives the idea. Now consider that in a computer the floating point format is standardized and for a standard float, as per IEE-754, it uses, as memory layout (we will see after why it is assumed 1 more bit for the mantissa), 23bits for the mantissa, 1bit for the sign and 8bits for the exponent biased by -127 (that simply means that it will range between -126 and +127 without the need for a sign bit, and the values 0x00 and 0xff reserved for special meaning).
Now consider using 0 as exponent, this means that the value 2^exponent=2^0=1 multiplied by mantissa give the same behavior of a 23bits integer. This imply that incrementing a count as in:
float f = 0;
while(1)
{
f +=1;
printf ("%f\n", f);
}
You will see that the printed value linearly increase by one until it saturates the 23bits and the exponent will become to grow.
If the base, or radix, of our floating point number would have been 10, we would see an increase each 10 loops for the first 100 (10^2) values, than an increase of 100 for the next 1000 (10^3) values and so on. You see that this corresponds to the *truncation** we have to make due to the limited number of available digits.
The same phenomenon will be observed when using the binary base, only the changes happens on powers of 2 interval.
What we discussed up to now is called the denormalized form of a floating point, what is normally used is the counterpart normalized. The latter simply means that there is a 24th bit, not stored, that is always 1. In plane words we wouldn't use an exponent of 0 for number less that 2^24, but we shift it (multiply by 2) up to the MSbit==1 reach the 24th bit, than the exponent is adjusted to such a negative value that force the conversion to shift back the number to its original value.
Remember the reserved value of the exponent we talked above? Well an exponent==0x00 means that we have a denormalized number. exponent==0xff indicate a nan (not-a-number) or +/-infinity if mantissa==0.
It should be clear now that when the number we express is beyond the 24bits of the significant (mantissa), we should expect approximation of the real value depending on how much far we are from 2^24.
Now the number you are using are just on the edge of 2^24=16,277,216 :
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|1|0|0|1|0|1|1|0|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1| = 16,277,215
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
s\______ _______/\_____________________ _______________________/
i v v
g exponent mantissa
n
Now increasing by 1 we have:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|1|0|0|1|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0| = 16,277,216
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
s\__ exponent __/\_________________ mantissa __________________/
Note that we have triggered to 1 the 24th bit, but from now on we are above the 24 bit representation, and each possible further representation is in steps of 2^1=2. Simply advance by 2 or can represent only even numbers (multiples of 2^1=2). I.e. setting to 1 the Less Significant bit we have:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|1|0|0|1|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1| = 16,277,218
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
s\__ exponent __/\_________________ mantissa __________________/
Increasing again:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|1|0|0|1|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0| = 16,277,220
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
s\__ exponent __/\_________________ mantissa __________________/
As you can see we cannot exactly represent 16,277,219. In your code:
// This will print 16777216, because 1 increment isn't enough to
// increase the significant that can express only intervals
// that are > 2^1
printf("16777217 as float is %.1f\n",(float)16777217);
// This will print 16777220, because an increment of 3 on
// the base 16777216=2^24 will trigger an exponent increase rounded
// to the closer exact representation
printf("16777219 as float is %.1f\n",(float)16777219);
As said above the choice of the numeric format must be appropriate for the usage, a floating point is only an approximate representation of a real number, and is definitively our duty to carefully use the right type.
In the case if we need more precision we could use a double, or an integer long long int.
Just for sake of completeness I would add few words on the approximate representation for irriducible numbers. This numbers are not divisible by a fraction of 2, so the representation in float format will always be not exact, and need to be rounded to the correct value during conversion to decimal representation.
For more details see:
https://en.wikipedia.org/wiki/IEEE_754
https://en.wikipedia.org/wiki/Single-precision_floating-point_format
Online demo applets:
https://babbage.cs.qc.cuny.edu/IEEE-754/
https://evanw.github.io/float-toy/
https://www.h-schmidt.net/FloatConverter/IEEE754.html

How can I round a float to a given decimal precision

I need to convert a floating-point number with system precision to one with a specified precision (e.g. 3 decimal places) for the printed output. The fprintf function will not suffice for this as it will not correctly round some numbers. All the other solutions I've tried fail in that they all reintroduce undesired precision when I convert back to a float. For example:
float xf_round1_f(float input, int prec) {
printf("%f\t",input);
int trunc = round(input * pow(10, prec));
printf("%f\t",(float)trunc);
input=(float)trunc / pow(10, prec);
printf("%f\n",input);
return (input);
}
This function prints the input, the truncated integer and the output to each line, and the result looks like this for some numbers supposed to be truncated to 3 decimal places:
49.975002 49975.000000 49.974998
49.980000 49980.000000 49.980000
49.985001 49985.000000 49.985001
49.990002 49990.000000 49.990002
49.995003 49995.000000 49.994999
50.000000 50000.000000 50.000000
You can see that the second step works as intended - even when "trunc" is cast to float for printing - but as soon as I convert it back to a float the precision returns. The 1st and 6th rows illustrate problem cases.
Surely there must be a way of resolving this - even if the 1st row result remained 49.975002 a formatted print would give the desired effect, but in this case there is a real problem.
Any solutions?
Binary floating-point cannot represent most decimal numerals exactly. Each binary floating-point number is formed by multiplying an integer by a power of two. For the common implementation of float, IEEE-754 32-bit binary floating-point, that integer must be in (–224, 224). There is no integer x and integer y such that x•2y exactly equals 49.975. Therefore, when you divide 49975 by 1000, the result must be an approximation.
If you merely need to format a number for output, you can do this with the usual fprintf format specifiers. If you need to compute exactly with such numbers, you may be able to do it by scaling them to representable values and doing the arithmetic either in floating-point or in integer arithmetic, depending on your needs.
Edit: it appears you may only care about the printed results. printf is generally smart enough to do proper rounding to the number of digits you specify. If you give a format of "%.3f" you will probably get what you need.
If your only problem is with the cases that are below the desired number, you can easily fix it by making everything higher than the desired number instead. Unfortunately this increases the absolute error of the answer; even a result that was exact before, such as 50.000 is now off.
Simply add this line to the end of the function:
input=nextafterf(input, input*1.0001);
See it in action at http://ideone.com/iHNTzs
49.975002 49975.000000 49.974998 49.975002
49.980000 49980.000000 49.980000 49.980003
49.985001 49985.000000 49.985001 49.985004
49.990002 49990.000000 49.990002 49.990005
49.995003 49995.000000 49.994999 49.995003
50.000000 50000.000000 50.000000 50.000004
If you require exact representation of all decimal fractions with three digits after the decimal point, you can work in thousandths. Use an integer data type to represent one thousand times the actual number for all intermediate results.
Fixed point numbers. That is where you keep the actual numbers in a wide precision integer format, for example long or long long. And you also keep the number of decimal places. And then you will also need methods to scale the fixed point number by the decimal places. And some way to convert to/from strings.
The reason why you are having trouble that 1/10 is not representable exactly as a fractional power of 2 (1/2, 1/4, 1/8, etc). This is the same reason that 1/3 is a repeating decimal in base 10 (0.33333...).

My floating point number has extra digits when I print it

I define a floating point number as float transparency = 0.85f; And in the next line, I pass it to a function -- fcn_name(transparency) -- but it turns out that the variable transparency has value 0.850000002, and when I print it with the default setting, it is 0.850000002. For the value 0.65f, it is 0.649999998.
How can I avoid this issue? I know floating point is just an approximation, but if I define a float with just a few decimals, how can I make sure it is not changed?
Floating-point values represented in binary format do not have any specific decimal precision. Just because you read in some spec that the number can represent some fixed amount of decimal digits, it doesn't really mean much. It is just a rough conversion of the physical (and meaningful) binary precision to its much less meaningful decimal approximation.
One property of binary floating-point format is that it can only represent precisely (within the limits of its mantissa width) the numbers that can be expressed as finite sums of powers of 2 (including negative powers of 2). Numbers like 0.5, 0.25, 0.75 (decimal) will be represented precisely in binary floating-point format, since these numbers are either powers of 2 (2^-1, 2^-2) or sums thereof.
Meanwhile, such number as decimal 0.1 cannot be expressed by a finite sum of powers of 2. The representation of decimal 0.1 in floating-point binary has infinite length. This immediately means that 0.1 cannot be ever represented precisely in finite binary floating-point format. Note that 0.1 has only one decimal digit. However, this number is still not representable. This illustrates the fact that expressing floating-point precision in terms of decimal digits is not very useful.
Values like 0.85 and 0.65 from your example are also non-representable, which is why you see these values distorted after conversion to a finite binary floating-point format. Actually, you have to get used to the fact that most fractional decimal numbers you will encounter in everyday life will not be representable precisely in binary floating-point types, regardless of how large these floating-point types are.
The only way I can think of solving this problem is to pass characteristic and mantissa to the function separately and let IT work on setting the values appropriately.
Also if you want more precision,
http://www.drdobbs.com/cpp/fixed-point-arithmetic-types-for-c/184401992 is the article I know. Though this works for C++ only. (Searching for an equivalent C implementation).
I tried this on VS2010,
#include <stdio.h>
void printfloat(float f)
{
printf("%f",f);
}
int main(int argc, char *argv[])
{
float f = 0.24f;
printfloat(f);
return 0;
}
OUTPUT: 0.240000

Resources