My floating point number has extra digits when I print it - c

I define a floating point number as float transparency = 0.85f; And in the next line, I pass it to a function -- fcn_name(transparency) -- but it turns out that the variable transparency has value 0.850000002, and when I print it with the default setting, it is 0.850000002. For the value 0.65f, it is 0.649999998.
How can I avoid this issue? I know floating point is just an approximation, but if I define a float with just a few decimals, how can I make sure it is not changed?

Floating-point values represented in binary format do not have any specific decimal precision. Just because you read in some spec that the number can represent some fixed amount of decimal digits, it doesn't really mean much. It is just a rough conversion of the physical (and meaningful) binary precision to its much less meaningful decimal approximation.
One property of binary floating-point format is that it can only represent precisely (within the limits of its mantissa width) the numbers that can be expressed as finite sums of powers of 2 (including negative powers of 2). Numbers like 0.5, 0.25, 0.75 (decimal) will be represented precisely in binary floating-point format, since these numbers are either powers of 2 (2^-1, 2^-2) or sums thereof.
Meanwhile, such number as decimal 0.1 cannot be expressed by a finite sum of powers of 2. The representation of decimal 0.1 in floating-point binary has infinite length. This immediately means that 0.1 cannot be ever represented precisely in finite binary floating-point format. Note that 0.1 has only one decimal digit. However, this number is still not representable. This illustrates the fact that expressing floating-point precision in terms of decimal digits is not very useful.
Values like 0.85 and 0.65 from your example are also non-representable, which is why you see these values distorted after conversion to a finite binary floating-point format. Actually, you have to get used to the fact that most fractional decimal numbers you will encounter in everyday life will not be representable precisely in binary floating-point types, regardless of how large these floating-point types are.

The only way I can think of solving this problem is to pass characteristic and mantissa to the function separately and let IT work on setting the values appropriately.
Also if you want more precision,
http://www.drdobbs.com/cpp/fixed-point-arithmetic-types-for-c/184401992 is the article I know. Though this works for C++ only. (Searching for an equivalent C implementation).
I tried this on VS2010,
#include <stdio.h>
void printfloat(float f)
{
printf("%f",f);
}
int main(int argc, char *argv[])
{
float f = 0.24f;
printfloat(f);
return 0;
}
OUTPUT: 0.240000

Related

How can I know in advance which real numbers would have an imprecise representation using float variables in C?

I know that the number 159.95 cannot be precisely represented using float variables in C.
For example, considering the following piece of code:
#include <stdio.h>
int main()
{
float x = 159.95;
printf("%f\n",x);
return 0;
}
It outputs 159.949997.
I would like to know if there is some way to know in advance which real value (in decimal system) would be represented in an imprecise way like the 159.95 number.
Best regards.
Succinctly, for the format most commonly used for float, a number is exactly representable if and only if it is representable as an integer F times a power of two, 2E such that:
the magnitude of F is less than 224, and
–149 ≤ E < 105.
More generally, C 2018 5.2.4.2.2 specifies the characteristics of floating-point types. A floating-point number is represented as s•be•sum(fk b−k, 1 ≤ k ≤ p), where:
s is a sign, +1 or −1,
b is a fixed base chosen by the C implementation, often 2,
e is an exponent, which is an integer between a minimum emin and a maximum emax, chosen by the C implementation,
p is the precision, the number of base-b digits in the significand, and
fk are digits in base-b, nonnegative integers less than b.
The significand is the fraction portion of the representation, sum(fk b−k, 1 ≤ k ≤ p). It is written as a sum so that we can express the variable number of digits it may have. (p is a variable set by the C implementation, not by the programmer using the C implementation.) When we write it out a significand in base b, it can be a numeral, such as .0011101010011001010101102 for a 24-bit significand in base 2. Note that, in the this form (and the sum), the significand has all its digits after the radix point.
To make it slightly easier to tell if a number is in this format, we can adjust the scale so the significand is an integer instead of having digits after the radix point: s•be−p•sum(fk bp−k, 1 ≤ k ≤ p). This changes the above significand from .0011101010011001010101102 to 0011101010011001010101102. Since it has p digits, it is always a non-negative integer less than bp.
Now we can figure out if a finite number is representable in this format:
Get b, p, emin, and emax for the target C implementation. If it uses IEEE-754 binary32 for float, then b is 2, p is 24, emin is −125, and emax is 128. When <float.h> is included, these are defined as FLT_RADIX, FLT_MANT_DIGITS, FLT_MIN_EXP, and FLT_MAX_EXP.
Ignore the sign. Write the absolute value of number as a rational number n/d in simplest form. If it is an integer, let d be 1.
If d is not a power of b, the number is not representable in the format.
If n is a multiple of b greater than or equal to bp, divide it by b and multiply d by d until n is not a multiple or is less than bp.
If n is greater than or equal to bp, the number is not representable in the format.
Let e be such that 1/d = be−p. If emin ≤ e ≤ emax, the number is representable in the format. Otherwise, it is not.
Some floating-point formats might not support subnormal numbers, in which f1 is zero. This is indicated by FLT_HAS_SUBNORM being defined to be zero and would require modifications to the above.
I would like to know if there is some way to know in advance which real value (in decimal system) would be represented in an imprecise way like the 159.95 number.
In general, floating point numbers can only represent numbers whose denominator is a power of 2.
To check if a number can be represented as floating point value (of any floating-point type) at all, take the decimal digits after the decimal point, interpret them as number and check if they can be divided by 5^n while n is the number of digits:
159.95 => 95, 2 digits => 95%(5*5) = 20 => Cannot be represented as floating-point value
Counterexample:
159.625 => 625, 3 digits => 625%(5*5*5) = 0 => Can be represented as floating-point value
You also have to consider the fact that floating-point values only have a limited number of digits after the decimal point:
In principle, 123456789 can be represented by a floating-point value exactly (it is an integer), however float does not have enough bits!
To check if an integer value can be represented by float exactly, divide the number by 2 until the result is odd. If the result is < 2^24, the number can be represented by float exactly.
In the case of a rational number, first do the "divisible by 5^n" check described above. Then multiply the number by 2 until the result is an integer. Check if it is < 2^24.
I would like to know if there is some way to know in advance which real value... would be represented in an imprecise way
The short and only partly facetious answer is... all of them!
There are roughly 2^32 = 4294967296 values of type float. And there are an uncountably infinite number of real numbers. So, for a randomly-chosen real number, the chance that it can be exactly represented as a value of type float is 4294967296/∞, which is 0.
If you use type double, there are approximately 2^64 = 18446744073709551616 of those, so the chance that a randomly-chosen real number can be exactly represented as a double is 18446744073709551616/∞, which is again... 0.
I realize I'm not answering quite the question you asked, but in general, it's usually a bad idea to use binary floating-point types as if they were an exact representation of decimal fractions. Attempts to assume that they're ever an exact representation usually lead to trouble. In general, it's best to assume that floating-point types are an imperfect (approximate) realization of of real numbers, period (that is, without assuming decimal). If you never assume they're exact (which for true real numbers, they virtually never are), you'll never get into trouble in cases where you thought they'd be exact, but they weren't.
[Footnote 1: As Eric P. reminds in a comment, there's no such thing as a "randomly-chosen real number", which is why this is a partially facetious answer.]
[Footnote 2: I now see your comment where you say that you do assume they are all imprecise, but that you would "like to understand the phenomenon in a deeper way", in which case my answer does you no good, but hopefully some of the others do. I can especially commend Martin Rosenau's answer, which goes straight to the heart of the matter: a rational number is representable exactly in base 2 if and only if its reduced denominator is a pure power of 2, or stated another way, has only 2's in its prime factorization. That's why, if you take any number you can actually store in a float or double, and print it back out using %f and enough digits, with a properly-written printf, you'll notice that the numbers always end in things like ...625 or ...375. Binary fractions are like the English rulers still used in the U.S.: everything is halves and quarters and eights and sixteenths and thirty-seconds and sixty-fourths.]
Usually, a float is an IEEE754 binary32 float (this is not guaranteed by spec and may be different on some compilers/systems). This data type specifies a 24-bit significand; this means that if you write the number in binary, it should require no more than 24 bits excluding trailing zeros.
159.95's binary representation is 10011111.11110011001100110011... with repeating 0011 forever, so it requires an infinite number of bits to represent precisely with a binary format.
Other examples:
1073741760 has a binary representation of 111111111111111111111111000000. It has 30 bits in that representation, but only 24 significant bits (since the remainder are trailing zero bits). It has an exact float representation.
1073741761 has a binary representation of 111111111111111111111111000001. It has 30 significant bits and cannot be represented exactly as a float.
0.000000059604644775390625 has a binary representation of 0.000000000000000000000001. It has one significant bit and can be represented exactly.
0.750000059604644775390625 has a binary representation of 0.110000000000000000000001, which is 24 significant bits. It can be represented exactly as a float.
1.000000059604644775390625 has a binary representation of 1.000000000000000000000001, which is 25 significant bits. It cannot be represented exactly as a float.
Another factor (which applies to very large and very small numbers) is that the exponent is limited to the -126 to +127 range. With some handwaving around denormal values and other special cases, this generally allows values ranging from roughly 2-126 to slightly under 2128.
I would like to know if there is some way to know in advance which real value (in decimal system) would be represented in an imprecise way like the 159.95 number.
In another answer I semiseriously answered "all of them",
but let's look at it another way. Specifically, let's look at
which numbers can be exactly represented.
The key fact to remember is that floating point formats use binary.
(The major, popular formats, anyway.) So the numbers that can be
represented exactly are the ones with exact binary representations.
Here is a table of a few of the single-precision float values
that can be represented exactly, specifically the seven
contiguous values near 1.0.
I'm going to show them as hexadecimal fractions, binary
fractions, and decimal fractions.
(That is, along each horizontal row, all three values are exactly
the same, just represented in different bases. But note that the
fractional hexadecimal and binary representations I'm using here
are not directly acceptable in C.)
hexadecimal
binary
decimal
delta
0x0.fffffd
0b0.111111111111111111111101
0.999999821186065673828125
5.96e-08
0x0.fffffe
0b0.111111111111111111111110
0.999999880790710449218750
5.96e-08
0x0.ffffff
0b0.111111111111111111111111
0.999999940395355224609375
5.96e-08
0x1.000000
0b1.00000000000000000000000
1.000000000000000000000000
0x1.000002
0b1.00000000000000000000001
1.000000119209289550781250
1.19e-07
0x1.000004
0b1.00000000000000000000010
1.000000238418579101562500
1.19e-07
0x1.000006
0b1.00000000000000000000011
1.000000357627868652343750
1.19e-07
There are several things to notice about this table:
The decimal numbers look pretty weird.
The hexadecimal and binary numbers look pretty normal, and show pretty clearly that single-precision floating point has 24 bits of precision.
If you look at the decimal column, the precision seems to be about equivalent to 7 decimal digits.
It's clearly not exactly 7 digits, though.
The difference between consecutive values less than 1.0 is about 0.00000005, and greater than 1.0 is twice that, about 0.00000010. (More on this later.)
Here is a similar table for type double.
(I'm showing fewer columns because there's not enough room
horizontally for everything.)
hexadecimal
decimal
delta
0x0.ffffffffffffe8
0.99999999999999966693309261245303787291049957275390625
1.11e-16
0x0.fffffffffffff0
0.99999999999999977795539507496869191527366638183593750
1.11e-16
0x0.fffffffffffff8
0.99999999999999988897769753748434595763683319091796875
1.11e-16
0x1.0000000000000
1.0000000000000000000000000000000000000000000000000000
0x1.0000000000001
1.0000000000000002220446049250313080847263336181640625
2.22e-16
0x1.0000000000002
1.0000000000000004440892098500626161694526672363281250
2.22e-16
0x1.0000000000003
1.0000000000000006661338147750939242541790008544921875
2.22e-16
You can see right away that type double has much better precision:
53 bits, or about 15 decimal digits' worth instead of 7, and with a much
finer spacing between "adjacent" numbers.
What does it mean for these numbers to be "contiguous" or
"adjacent"? Aren't real numbers continuous? Yes, true real
numbers are continuous, but we're not looking at true real
numbers: we're looking at finite-precision floating point, and we
are, literally, seeing the finite limit of the precision here.
In type float, there simply is no value — no representable
value, that is — between 1.00000000 and 1.00000012.
In type double, there is no value between 1.00000000000000000
and 1.00000000000000022.
So let's go back to your question, asking whether there's "some way
to know which decimal values are represented in a precise or imprecise way."
If you look at ten decimal values between 1 and 2:
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
the answer is, only one of them is exactly representable in binary: 1.5.
If you break the interval down into 100 fractions, like this:
1.01
1.02
1.03
1.04
1.05
…
1.95
1.96
1.97
1.98
1.99
it turns out there are three fractions you can represent exactly:
.25, .50, and .75, corresponding to
¼, ½, and ¾.
If we looked at three-digit decimal fractions, there are at most
seven of them we can represent: .125, .250, .375, .500, .625, .750, and .875. These correspond to eighths, that is, ordinary
fractions with 8 in the denominator.
I said "at most seven" because it's not true (none of these
estimates are true) for all ranges of numbers. Remember,
precision is finite, and digits to the left of the decimal part
— that is, in the integral part of your numbers — count against
your precision budget, too. So it turns out that if you were to
look at the range, say, 4000000–4000001, and tried to subdivide
it, you would find that you could represent 4000000.25 and
4000000.50 as type float, but not 4000000.125 or 4000000.375.
You can't really see it if you look at the decimal
representation, but what's happening inside is that type float
has exactly 24 binary bits of available precision, and the
integer part 4000000 uses up 22 of those bits, so you've only got
two bits left over for the fractional part, and with two bits you
can do halves and quarters, but not eighths.
You're probably noticing a pattern by now: the fractions we've
looked at so far that can be be represented exactly in binary
involve halves, quarters, and eights, and if we looked further,
this pattern would continue: sixteenths, thirty-seconds,
sixty-fourths, etc. And this should come as no real surprise:
just as in decimal the "exact" fractions involve tenths,
hundredths, thousandths, etc.; when we move to binary (base 2) the fractions
all involve powers of two. ½ in binary is 0b0.1.
¼ and ¾ are 0b0.01 and 0b0.11.
⅜ and ⅝ are 0b0.011 and 0b0.101.
What about a fraction like 1/3? You can't represent it exactly
in binary, but since you can't represent it in decimal, either,
this doesn't tend to bother us too much. In decimal it's the
infinitely repeating fraction 0.333333…, and in binary it's the
infinitely-repeating fraction 0b0.0101010101….
But then we come to the humble fraction 1/10, or one tenth.
This obviously can be represented as a decimal fraction — 0.1 —
but it turns out that it cannot be represented exactly in binary.
In binary it's the infinitely-repeating fraction 0b0.0001100110011….
And this is why, as we saw above, you can't represent most of the other
"single digit" decimal fractions 0.2, 0.3, 0.4, …, either
(with the notable exception of 0.5), and you can't represent most
of the double-digit decimal fractions 0.01, 0.02, 0.03, …,
or most of the triple-digit decimal fractions, etc.
So returning once more to your question of which decimal
fractions can be represented exactly, we can say:
For single-digit fractions 0.1, 0.2, 0.3, …, we can exactly represent .5, and to be charitable we can say that we can also represent .0, so that's two out of ten, or 20%.
For double-digit fractions 0.01, 0.02, 0.03, …, we can exactly represent .00, 0.25, 0.50, and 0.75, so that's four out of a hundred, or 4%.
For three-digit fractions 0.001, 0.002, 0.003, …, we can exactly represent the eight fractions involving eighths, so that's 8/1000 = 0.8%.
So while there are some decimal fractions we can represent
exactly, there aren't very many, and the percentage seems to be
going down as we add more digits. :-(
The fact — and depending on your point of view it's either an
unfortunate fact or a sad fact or a perfectly normal fact —
is that most decimal fractions can not be represented exactly
in binary and so can not be represented exactly using computer
floating point.
The numbers that can be represented exactly using computer
floating point, although they can all be exactly converted into
numerically equivalent decimal fractions, end up converting to
rather weird-looking numbers for the most part, with lots of digits, as we saw above.
(In fact, for type float, which internally has 24 bits of
significance, the exact decimal conversions end up having up to
24 decimal digits. And the fractions always end in 5.)
One last point concerns the spacing between these "contiguous",
exactly-representable binary fractions. In the examples I've
shown, why is there tighter spacing for numbers less than 1.0
than for numbers greater than 1.0?
The answer lies in an earlier statement that "precision is
finite, and digits to the left of the decimal part count against
your precision budget, too". Switching to decimal fractions for
a moment, if I told you you had exactly 7 significant decimal
digits to work with, you could represent
1234.567
1234.568
1234.569
and
12345.67
12345.68
12345.69
but you could not represent
12345.678
because that would require 8 significant digits.
Stated another way, for numbers between 1000 and 10000 you can
have three more digits after the decimal point, but for numbers
from 10000 to 100000 you can only have two. Mathematicians call
these intervals like 1000-10000 and 10000-100000 decades,
and within each decade, all the numbers have the same number of
fractional digits for a given precision, and the same exponents:
1.000000×103 – 1.999999×103,
1.000000×104 – 1.999999×104, etc.
(This usage is rather different than ordinary usage, in which the
word "decade" refers to a period of 10 years.)
But for binary floating point, once again, the intervals of
interest involve powers of 2, not 10. (In binary, some computer
scientists call these intervals binades, by analogy with "decades".)
The interesting intervals are from 1 to 2, 2–4, 4–8, 8–16, etc.
For numbers between 1 and 2, you've got 1 bit to the left of the
decimal point (really the "binary point"), so in single precision
you've got 23 bits left over to use for the fractional part to the right.
But for numbers between 2 and 4, you've got 2 bits to the left,
so you've only got 22 bits to use for the fraction.
This works in the other direction, too: for numbers between
½ and 1, you don't need any bits to the left of the binary
point, so you can use all 24 for the fraction to the right.
(Below ½ it gets even more interesting). So that's why we
saw twice the precision (numbers half the size in the "delta"
column) for numbers just below 1.0 than for numbers just above.
We'd see similar shifts in available precision when crossing all the other
powers of two: 2.0, 4.0, 8.0, …, and also ½, ¼,
⅛, etc.
This has been a rather long answer, longer than I had intended.
Thanks for reading.
Hopefully now you have a better appreciation for which numbers can be
exactly represented in binary floating point, and why most of them can't.

Book says c Standard provides floating point accuracy to six significant figures, but this isnt true?

I am reading C Primer Plus by Stephen Prata, and one of the first ways it introduces floats is talking about how they are accurate to a certain point. It says specifically "The C standard provides that a float has to be able to represent at least six significant figures...A float has to represent accurately the first six numbers, for example, 33.333333"
This is odd to me, because it makes it sound like a float is accurate up to six digits, but that is not true. 1.4 is stored as 1.39999... and so on. You still have errors.
So what exactly is being provided? Is there a cutoff for how accurate a number is supposed to be?
In C, you can't store more than six significant figures in a float without getting a compiler warning, but why? If you were to do more than six figures it seems to go just as accurately.
This is made even more confusing by the section on underflow and subnormal numbers. When you have a number that is the smallest a float can be, and divide it by 10, the errors you get don't seem to be subnormal? They seem to just be the regular rounding errors mentioned above.
So why is the book saying floats are accurate to six digits and how is subnormal different from regular rounding errors?
Suppose you have a decimal numeral with q significant digits:
dq−1.dq−2dq−3…d0,
and let’s also make it a floating-point decimal numeral, meaning we scale it by a power of ten:
dq−1.dq−2dq−3…d0•10e.
Next, we convert this number to float. Many such numbers cannot be exactly represented in float, so we round the result to the nearest representable value. (If there is a tie, we round to make the low digit even.) The result (if we did not overflow or underflow) is some floating-point number x. By the definition of floating-point numbers (in C 2018 5.2.4.2.2 3), it is represented by some number of digits in some base scaled by that base to a power. Supposing it is base two, x is:
bp−1.bp−2bp−3…b0•2p.
Next, we convert this float x back to decimal with q significant digits. Similarly, the float value x might not be exactly representable as a decimal numeral with q digits, so we get some possibly new number:
nq−1.nq−2nq−3…n0•10m.
It turns out that, for any float format, there is some number q such that, if the decimal numeral we started with is limited to q digits, then the result of this round-trip conversion will equal the original number. Each decimal numeral of q digits, when rounded to float and then back to q decimal digits, results in the starting number.
In the 2018 C standard, clause 5.2.4.2.2, paragraph 12, tells us this number q must be at least 6 (a C implementation may support larger values), and the C implementation should define a preprocessor symbol for it (in float.h) called FLT_DIG.
So considering your example number, 1.4, when we convert it to float in the IEEE-754 basic 32-bit binary format, we get exactly 1.39999997615814208984375 (that is its mathematical value, shown in decimal for convenience; the actual bits in the object represented it in binary). When we convert that to decimal with full precision, we get “1.39999997615814208984375”. But if we convert it to decimal with rounding six digits, we get “1.40000”. So 1.4 survives the round trip.
In other words, it is not true in general that six decimal digits can be represented in float without change, but it is true that float carries enough information that you can recover six decimal digits from it.
Of course, once you start doing arithmetic, errors will generally compound, and you can no longer rely on six decimal digits.
Thanks to Govind Parmar for citing an on-line example of C11 (or, for that matter C99).
The "6" you're referring to is "FLT_DECIMAL_DIG".
http://c0x.coding-guidelines.com/5.2.4.2.2.html
number of decimal digits, n, such that any floating-point number with
p radix b digits can be rounded to a floating-point number with n
decimal digits and back again without change to the value,
{ p log10 b if b is a power of 10
{
{ [^1 + p log10 b^] otherwise
FLT_DECIMAL_DIG 6
DBL_DECIMAL_DIG 10 LDBL_DECIMAL_DIG
10
"Subnormal" means:
What is a subnormal floating point number?
A number is subnormal when the exponent bits are zero and the mantissa
is non-zero. They're numbers between zero and the smallest normal
number. They don't have an implicit leading 1 in the mantissa.
STRONG SUGGESTION:
If you're unfamiliar with "floating point arithmetic" (or, frankly, even if you are), this is an excellent article to read (or review):
What Every Programmer Should Know About Floating-Point Arithmetic

Why the difference in output?

Using the C Code given below (written in Visual Studio):
#include "stdafx.h"
int _tmain(int argc, _TCHAR* argv[])
{
float i = 2.0/3.0;
printf("%5.6f", i);
return 0;
}
produces the output:
0.666667
however when the %5.6f is changed to %5.20f the output changes to :
0.66666668653488159000
My question is why the subtle changes in output for the similar decimal?
When you use 32-bit float, the computer represents the result of 2./3. as 11,184,811 / 16,777,216, which is exactly 0.666666686534881591796875. In the floating-point you are using, numbers are always represented as some integer multiplied by some power of two (which may be a negative power of two). Due to limits on how large the integer can be (when you use float, the integer must fit in 24 bits, not including the sign), the closest representable value to 2/3 is 11,184,811 / 16,777,216.
The reason that printf with '%5.6f` displays “0.666667” is because “%5.6f” requests just six digits, so the number is rounded at the sixth digit.
The reason that printf with %5.20f displays “0.66666668653488159000” is that your printf implementation “gives up” after 17 digits, figuring that is close enough in some sense. Some implementations of printf, which one might argue are better, print the represented value as closely as the requested format permits. In this case, they would display “0.66666668653488159180”, and, if you requested more digits, they would display the exact value, “0.666666686534881591796875”.
(The floating-point format is often presented as a sign, a fraction between 1 [inclusive] and 2 [exclusive], and an exponent, instead of a sign, an integer, and an exponent. Mathematically, they are the same with an adjustment in the exponent: Each number representable with a sign, a 24-bit unsigned integer, and an exponent is equal to some number with a sign, a fraction between 1 and 2, and an adjusted exponent. Using the integer version tends to make proofs easier and sometimes helps explanation.)
Unlike integers, which can be represented exactly in any base, relatively few decimal fractions have an exact representation in the base-2 fractional format.
This means that FP integers are exact, and, generally, FP fractions are not.
So for two-digits, say, 0.01 to 0.99, only 0.25, 0.50, and 0.75 (and 0) have exact representations. Normally it doesn't matter as output gets rounded, and really, few if any physical constants are known to the precision available in the format.
This is because you may not have an exact representation of 0.6666666666666666...66667 in floating point.
The the precision is stored in exponential format i.e. like (-/+)ax10^n. If the data type is 32 bit will spend 1 bit for sign, 8 bit for a and rest for n. So, it doesn't store values after 20th digit after point. So, in this compiler you will never get the correct value.
float type has only 23 bits to represent part of decimal, 20 is too many.

Why is not a==0 in the following code?

#include <stdio.h>
int main( )
{
float a=1.0;
long i;
for(i=0; i<100; i++)
{
a = a - 0.01;
}
printf("%e\n",a);
}
Result is: 6.59e-07
It's a binary floating point number, not a decimal one - therefore you need to expect rounding errors. See the Basic section in this article:
What Every Programmer Should Know About Floating-Point Arithmetic
For example, the value 0.01 does not have a precise represenation in binary floating point type. To get a "correct" result in your sample you would have to either round or use a a decimal floating point type (see Wikipedia):
Binary fixed-point types are most commonly used, because the rescaling operations can be implemented as fast bit shifts. Binary fixed-point numbers can represent fractional powers of two exactly, but, like binary floating-point numbers, cannot exactly represent fractional powers of ten. If exact fractional powers of ten are desired, then a decimal format should be used. For example, one-tenth (0.1) and one-hundredth (0.01) can be represented only approximately by binary fixed-point or binary floating-point representations, while they can be represented exactly in decimal fixed-point or decimal floating-point representations. These representations may be encoded in many ways, including BCD.
There are two questions here. If you're asking, why is my printf statement displaying the result as 6.59e-07 instead of 0.000000659, it's because you've used the format specifier for Scientific Notation: %e. You want %f for the floating point a.
printf("%f\n",a);
If you're asking why the result is not exactly zero rather than 0.000000659, the answer is (as others have pointed out) that with floating point arithmetic using binary numbers you need to expect rounding.
You have to specify %f for printing the float number then it will print 0 for variable a.
That's floating point numbers rounding errors on the scene. Each time you subtract a fraction you get approximately the result you'd normally expect from a number on paper and so the final result is very close to zero, but not necessarily precise zero.
The precision with floating numbers isn't accurate, that's why you find this result.
Cordially

exact representation of floating points in c

void main()
{
float a = 0.7;
if (a < 0.7)
printf("c");
else
printf("c++");
}
In the above question for 0.7, "c" will be printed, but for 0.8, "c++" wil be printed. Why?
And how is any float represented in binary form?
At some places, it is mentioned that internally 0.7 will be stored as 0.699997, but 0.8 as 0.8000011. Why so?
basically with float you get 32 bits that encode
VALUE = SIGN * MANTISSA * 2 ^ (128 - EXPONENT)
32-bits = 1-bit 23-bits 8-bits
and that is stored as
MSB LSB
[SIGN][EXPONENT][MANTISSA]
since you only get 23 bits, that's the amount of "precision" you can store. If you are trying to represent a fraction that is irrational (or repeating) in base 2, the sequence of bits will be "rounded off" at the 23rd bit.
0.7 base 10 is 7 / 10 which in binary is 0b111 / 0b1010 you get:
0.1011001100110011001100110011001100110011001100110011... etc
Since this repeats, in fixed precision there is no way to exactly represent it. The
same goes for 0.8 which in binary is:
0.1100110011001100110011001100110011001100110011001101... etc
To see what the fixed precision value of these numbers is you have to "cut them off" at the number of bits you and do the math. The only trick is you the leading 1 is implied and not stored so you technically get an extra bit of precision. Because of rounding, the last bit will be a 1 or a 0 depending on the value of the truncated bit.
So the value of 0.7 is effectively 11,744,051 / 2^24 (no rounding effect) = 0.699999988 and the value of 0.8 is effectively 13,421,773 / 2^24 (rounded up) = 0.800000012.
That's all there is to it :)
A good reference for this is What Every Computer Scientist Should Know About Floating-Point Arithmetic. You can use higher precision types (e.g. double) or a Binary Coded Decimal (BCD) library to achieve better floating point precision if you need it.
The internal representation is IEE754.
You can also use this calculator to convert decimal to float, I hope this helps to understand the format.
floats will be stored as described in IEEE 754: 1 bit for sign, 8 for a biased exponent, and the rest storing the fractional part.
Think of numbers representable as floats as points on the number line, some distance apart; frequently, decimal fractions will fall in between these points, and the nearest representation will be used; this leads to the counterintuitive results you describe.
"What every computer scientist should know about floating point arithmetic" should answer all your questions in detail.
If you want to know how float/double is presented in C(and almost all languages), please refert to Standard for Floating-Point Arithmetic (IEEE 754) http://en.wikipedia.org/wiki/IEEE_754-2008
Using single-precision floats as an example, here is the bit layout:
seeeeeeeemmmmmmmmmmmmmmmmmmmmmmm meaning
31 0 bit #
s = sign bit, e = exponent, m = mantissa
Another good resource to see how floating point numbers are stored as binary in computers is Wikipedia's page on IEEE-754.
Floating point numbers in C/C++ are represented in IEEE-754 standard format. There are many articles on the internet, that describe in much better detail than I can here, how exactly a floating point is represented in binary. A simple search for IEEE-754 should illuminate the mystery.
0.7 is a numeric literal; it's value is the mathematical real number 0.7, rounded to the nearest double value.
After initialising float a = 0.7, the value of a is 0.7 rounded to float, that is the real number 0.7, rounded to the nearest double value, rounded to the nearest float value. Except by a huge coincidence, you wouldn't expect a to be equal to 0.7.
"if (a < 0.7)" compares 0.7 rounded to double then to float with the number 0.7 rounded to double. It seems that in the case of 0.7, the rounding produced a smaller number. And in the same experiment with 0.8, rounding 0.8 to float will produce a larger number than 0.8.
Floating point comparisons are not reliable, whatever you do. You should use threshold tolerant comparison/ epsilon comparison of floating points.
Try IEEE-754 Floating-Point Conversion and see what you get. :)

Resources