exact representation of floating points in c - c

void main()
{
float a = 0.7;
if (a < 0.7)
printf("c");
else
printf("c++");
}
In the above question for 0.7, "c" will be printed, but for 0.8, "c++" wil be printed. Why?
And how is any float represented in binary form?
At some places, it is mentioned that internally 0.7 will be stored as 0.699997, but 0.8 as 0.8000011. Why so?

basically with float you get 32 bits that encode
VALUE = SIGN * MANTISSA * 2 ^ (128 - EXPONENT)
32-bits = 1-bit 23-bits 8-bits
and that is stored as
MSB LSB
[SIGN][EXPONENT][MANTISSA]
since you only get 23 bits, that's the amount of "precision" you can store. If you are trying to represent a fraction that is irrational (or repeating) in base 2, the sequence of bits will be "rounded off" at the 23rd bit.
0.7 base 10 is 7 / 10 which in binary is 0b111 / 0b1010 you get:
0.1011001100110011001100110011001100110011001100110011... etc
Since this repeats, in fixed precision there is no way to exactly represent it. The
same goes for 0.8 which in binary is:
0.1100110011001100110011001100110011001100110011001101... etc
To see what the fixed precision value of these numbers is you have to "cut them off" at the number of bits you and do the math. The only trick is you the leading 1 is implied and not stored so you technically get an extra bit of precision. Because of rounding, the last bit will be a 1 or a 0 depending on the value of the truncated bit.
So the value of 0.7 is effectively 11,744,051 / 2^24 (no rounding effect) = 0.699999988 and the value of 0.8 is effectively 13,421,773 / 2^24 (rounded up) = 0.800000012.
That's all there is to it :)

A good reference for this is What Every Computer Scientist Should Know About Floating-Point Arithmetic. You can use higher precision types (e.g. double) or a Binary Coded Decimal (BCD) library to achieve better floating point precision if you need it.

The internal representation is IEE754.
You can also use this calculator to convert decimal to float, I hope this helps to understand the format.

floats will be stored as described in IEEE 754: 1 bit for sign, 8 for a biased exponent, and the rest storing the fractional part.
Think of numbers representable as floats as points on the number line, some distance apart; frequently, decimal fractions will fall in between these points, and the nearest representation will be used; this leads to the counterintuitive results you describe.
"What every computer scientist should know about floating point arithmetic" should answer all your questions in detail.

If you want to know how float/double is presented in C(and almost all languages), please refert to Standard for Floating-Point Arithmetic (IEEE 754) http://en.wikipedia.org/wiki/IEEE_754-2008
Using single-precision floats as an example, here is the bit layout:
seeeeeeeemmmmmmmmmmmmmmmmmmmmmmm meaning
31 0 bit #
s = sign bit, e = exponent, m = mantissa

Another good resource to see how floating point numbers are stored as binary in computers is Wikipedia's page on IEEE-754.

Floating point numbers in C/C++ are represented in IEEE-754 standard format. There are many articles on the internet, that describe in much better detail than I can here, how exactly a floating point is represented in binary. A simple search for IEEE-754 should illuminate the mystery.

0.7 is a numeric literal; it's value is the mathematical real number 0.7, rounded to the nearest double value.
After initialising float a = 0.7, the value of a is 0.7 rounded to float, that is the real number 0.7, rounded to the nearest double value, rounded to the nearest float value. Except by a huge coincidence, you wouldn't expect a to be equal to 0.7.
"if (a < 0.7)" compares 0.7 rounded to double then to float with the number 0.7 rounded to double. It seems that in the case of 0.7, the rounding produced a smaller number. And in the same experiment with 0.8, rounding 0.8 to float will produce a larger number than 0.8.

Floating point comparisons are not reliable, whatever you do. You should use threshold tolerant comparison/ epsilon comparison of floating points.
Try IEEE-754 Floating-Point Conversion and see what you get. :)

Related

float add double, get the same value of the double [duplicate]

Why do some numbers lose accuracy when stored as floating point numbers?
For example, the decimal number 9.2 can be expressed exactly as a ratio of two decimal integers (92/10), both of which can be expressed exactly in binary (0b1011100/0b1010). However, the same ratio stored as a floating point number is never exactly equal to 9.2:
32-bit "single precision" float: 9.19999980926513671875
64-bit "double precision" float: 9.199999999999999289457264239899814128875732421875
How can such an apparently simple number be "too big" to express in 64 bits of memory?
In most programming languages, floating point numbers are represented a lot like scientific notation: with an exponent and a mantissa (also called the significand). A very simple number, say 9.2, is actually this fraction:
5179139571476070 * 2 -49
Where the exponent is -49 and the mantissa is 5179139571476070. The reason it is impossible to represent some decimal numbers this way is that both the exponent and the mantissa must be integers. In other words, all floats must be an integer multiplied by an integer power of 2.
9.2 may be simply 92/10, but 10 cannot be expressed as 2n if n is limited to integer values.
Seeing the Data
First, a few functions to see the components that make a 32- and 64-bit float. Gloss over these if you only care about the output (example in Python):
def float_to_bin_parts(number, bits=64):
if bits == 32: # single precision
int_pack = 'I'
float_pack = 'f'
exponent_bits = 8
mantissa_bits = 23
exponent_bias = 127
elif bits == 64: # double precision. all python floats are this
int_pack = 'Q'
float_pack = 'd'
exponent_bits = 11
mantissa_bits = 52
exponent_bias = 1023
else:
raise ValueError, 'bits argument must be 32 or 64'
bin_iter = iter(bin(struct.unpack(int_pack, struct.pack(float_pack, number))[0])[2:].rjust(bits, '0'))
return [''.join(islice(bin_iter, x)) for x in (1, exponent_bits, mantissa_bits)]
There's a lot of complexity behind that function, and it'd be quite the tangent to explain, but if you're interested, the important resource for our purposes is the struct module.
Python's float is a 64-bit, double-precision number. In other languages such as C, C++, Java and C#, double-precision has a separate type double, which is often implemented as 64 bits.
When we call that function with our example, 9.2, here's what we get:
>>> float_to_bin_parts(9.2)
['0', '10000000010', '0010011001100110011001100110011001100110011001100110']
Interpreting the Data
You'll see I've split the return value into three components. These components are:
Sign
Exponent
Mantissa (also called Significand, or Fraction)
Sign
The sign is stored in the first component as a single bit. It's easy to explain: 0 means the float is a positive number; 1 means it's negative. Because 9.2 is positive, our sign value is 0.
Exponent
The exponent is stored in the middle component as 11 bits. In our case, 0b10000000010. In decimal, that represents the value 1026. A quirk of this component is that you must subtract a number equal to 2(# of bits) - 1 - 1 to get the true exponent; in our case, that means subtracting 0b1111111111 (decimal number 1023) to get the true exponent, 0b00000000011 (decimal number 3).
Mantissa
The mantissa is stored in the third component as 52 bits. However, there's a quirk to this component as well. To understand this quirk, consider a number in scientific notation, like this:
6.0221413x1023
The mantissa would be the 6.0221413. Recall that the mantissa in scientific notation always begins with a single non-zero digit. The same holds true for binary, except that binary only has two digits: 0 and 1. So the binary mantissa always starts with 1! When a float is stored, the 1 at the front of the binary mantissa is omitted to save space; we have to place it back at the front of our third element to get the true mantissa:
1.0010011001100110011001100110011001100110011001100110
This involves more than just a simple addition, because the bits stored in our third component actually represent the fractional part of the mantissa, to the right of the radix point.
When dealing with decimal numbers, we "move the decimal point" by multiplying or dividing by powers of 10. In binary, we can do the same thing by multiplying or dividing by powers of 2. Since our third element has 52 bits, we divide it by 252 to move it 52 places to the right:
0.0010011001100110011001100110011001100110011001100110
In decimal notation, that's the same as dividing 675539944105574 by 4503599627370496 to get 0.1499999999999999. (This is one example of a ratio that can be expressed exactly in binary, but only approximately in decimal; for more detail, see: 675539944105574 / 4503599627370496.)
Now that we've transformed the third component into a fractional number, adding 1 gives the true mantissa.
Recapping the Components
Sign (first component): 0 for positive, 1 for negative
Exponent (middle component): Subtract 2(# of bits) - 1 - 1 to get the true exponent
Mantissa (last component): Divide by 2(# of bits) and add 1 to get the true mantissa
Calculating the Number
Putting all three parts together, we're given this binary number:
1.0010011001100110011001100110011001100110011001100110 x 1011
Which we can then convert from binary to decimal:
1.1499999999999999 x 23 (inexact!)
And multiply to reveal the final representation of the number we started with (9.2) after being stored as a floating point value:
9.1999999999999993
Representing as a Fraction
9.2
Now that we've built the number, it's possible to reconstruct it into a simple fraction:
1.0010011001100110011001100110011001100110011001100110 x 1011
Shift mantissa to a whole number:
10010011001100110011001100110011001100110011001100110 x 1011-110100
Convert to decimal:
5179139571476070 x 23-52
Subtract the exponent:
5179139571476070 x 2-49
Turn negative exponent into division:
5179139571476070 / 249
Multiply exponent:
5179139571476070 / 562949953421312
Which equals:
9.1999999999999993
9.5
>>> float_to_bin_parts(9.5)
['0', '10000000010', '0011000000000000000000000000000000000000000000000000']
Already you can see the mantissa is only 4 digits followed by a whole lot of zeroes. But let's go through the paces.
Assemble the binary scientific notation:
1.0011 x 1011
Shift the decimal point:
10011 x 1011-100
Subtract the exponent:
10011 x 10-1
Binary to decimal:
19 x 2-1
Negative exponent to division:
19 / 21
Multiply exponent:
19 / 2
Equals:
9.5
Further reading
The Floating-Point Guide: What Every Programmer Should Know About Floating-Point Arithmetic, or, Why don’t my numbers add up? (floating-point-gui.de)
What Every Computer Scientist Should Know About Floating-Point Arithmetic (Goldberg 1991)
IEEE Double-precision floating-point format (Wikipedia)
Floating Point Arithmetic: Issues and Limitations (docs.python.org)
Floating Point Binary
This isn't a full answer (mhlester already covered a lot of good ground I won't duplicate), but I would like to stress how much the representation of a number depends on the base you are working in.
Consider the fraction 2/3
In good-ol' base 10, we typically write it out as something like
0.666...
0.666
0.667
When we look at those representations, we tend to associate each of them with the fraction 2/3, even though only the first representation is mathematically equal to the fraction. The second and third representations/approximations have an error on the order of 0.001, which is actually much worse than the error between 9.2 and 9.1999999999999993. In fact, the second representation isn't even rounded correctly! Nevertheless, we don't have a problem with 0.666 as an approximation of the number 2/3, so we shouldn't really have a problem with how 9.2 is approximated in most programs. (Yes, in some programs it matters.)
Number bases
So here's where number bases are crucial. If we were trying to represent 2/3 in base 3, then
(2/3)10 = 0.23
In other words, we have an exact, finite representation for the same number by switching bases! The take-away is that even though you can convert any number to any base, all rational numbers have exact finite representations in some bases but not in others.
To drive this point home, let's look at 1/2. It might surprise you that even though this perfectly simple number has an exact representation in base 10 and 2, it requires a repeating representation in base 3.
(1/2)10 = 0.510 = 0.12 = 0.1111...3
Why are floating point numbers inaccurate?
Because often-times, they are approximating rationals that cannot be represented finitely in base 2 (the digits repeat), and in general they are approximating real (possibly irrational) numbers which may not be representable in finitely many digits in any base.
While all of the other answers are good there is still one thing missing:
It is impossible to represent irrational numbers (e.g. π, sqrt(2), log(3), etc.) precisely!
And that actually is why they are called irrational. No amount of bit storage in the world would be enough to hold even one of them. Only symbolic arithmetic is able to preserve their precision.
Although if you would limit your math needs to rational numbers only the problem of precision becomes manageable. You would need to store a pair of (possibly very big) integers a and b to hold the number represented by the fraction a/b. All your arithmetic would have to be done on fractions just like in highschool math (e.g. a/b * c/d = ac/bd).
But of course you would still run into the same kind of trouble when pi, sqrt, log, sin, etc. are involved.
TL;DR
For hardware accelerated arithmetic only a limited amount of rational numbers can be represented. Every not-representable number is approximated. Some numbers (i.e. irrational) can never be represented no matter the system.
There are infinitely many real numbers (so many that you can't enumerate them), and there are infinitely many rational numbers (it is possible to enumerate them).
The floating-point representation is a finite one (like anything in a computer) so unavoidably many many many numbers are impossible to represent. In particular, 64 bits only allow you to distinguish among only 18,446,744,073,709,551,616 different values (which is nothing compared to infinity). With the standard convention, 9.2 is not one of them. Those that can are of the form m.2^e for some integers m and e.
You might come up with a different numeration system, 10 based for instance, where 9.2 would have an exact representation. But other numbers, say 1/3, would still be impossible to represent.
Also note that double-precision floating-points numbers are extremely accurate. They can represent any number in a very wide range with as much as 15 exact digits. For daily life computations, 4 or 5 digits are more than enough. You will never really need those 15, unless you want to count every millisecond of your lifetime.
Why can we not represent 9.2 in binary floating point?
Floating point numbers are (simplifying slightly) a positional numbering system with a restricted number of digits and a movable radix point.
A fraction can only be expressed exactly using a finite number of digits in a positional numbering system if the prime factors of the denominator (when the fraction is expressed in it's lowest terms) are factors of the base.
The prime factors of 10 are 5 and 2, so in base 10 we can represent any fraction of the form a/(2b5c).
On the other hand the only prime factor of 2 is 2, so in base 2 we can only represent fractions of the form a/(2b)
Why do computers use this representation?
Because it's a simple format to work with and it is sufficiently accurate for most purposes. Basically the same reason scientists use "scientific notation" and round their results to a reasonable number of digits at each step.
It would certainly be possible to define a fraction format, with (for example) a 32-bit numerator and a 32-bit denominator. It would be able to represent numbers that IEEE double precision floating point could not, but equally there would be many numbers that can be represented in double precision floating point that could not be represented in such a fixed-size fraction format.
However the big problem is that such a format is a pain to do calculations on. For two reasons.
If you want to have exactly one representation of each number then after each calculation you need to reduce the fraction to it's lowest terms. That means that for every operation you basically need to do a greatest common divisor calculation.
If after your calculation you end up with an unrepresentable result because the numerator or denominator you need to find the closest representable result. This is non-trivil.
Some Languages do offer fraction types, but usually they do it in combination with arbitary precision, this avoids needing to worry about approximating fractions but it creates it's own problem, when a number passes through a large number of calculation steps the size of the denominator and hence the storage needed for the fraction can explode.
Some languages also offer decimal floating point types, these are mainly used in scenarios where it is imporant that the results the computer gets match pre-existing rounding rules that were written with humans in mind (chiefly financial calculations). These are slightly more difficult to work with than binary floating point, but the biggest problem is that most computers don't offer hardware support for them.

How can I know in advance which real numbers would have an imprecise representation using float variables in C?

I know that the number 159.95 cannot be precisely represented using float variables in C.
For example, considering the following piece of code:
#include <stdio.h>
int main()
{
float x = 159.95;
printf("%f\n",x);
return 0;
}
It outputs 159.949997.
I would like to know if there is some way to know in advance which real value (in decimal system) would be represented in an imprecise way like the 159.95 number.
Best regards.
Succinctly, for the format most commonly used for float, a number is exactly representable if and only if it is representable as an integer F times a power of two, 2E such that:
the magnitude of F is less than 224, and
–149 ≤ E < 105.
More generally, C 2018 5.2.4.2.2 specifies the characteristics of floating-point types. A floating-point number is represented as s•be•sum(fk b−k, 1 ≤ k ≤ p), where:
s is a sign, +1 or −1,
b is a fixed base chosen by the C implementation, often 2,
e is an exponent, which is an integer between a minimum emin and a maximum emax, chosen by the C implementation,
p is the precision, the number of base-b digits in the significand, and
fk are digits in base-b, nonnegative integers less than b.
The significand is the fraction portion of the representation, sum(fk b−k, 1 ≤ k ≤ p). It is written as a sum so that we can express the variable number of digits it may have. (p is a variable set by the C implementation, not by the programmer using the C implementation.) When we write it out a significand in base b, it can be a numeral, such as .0011101010011001010101102 for a 24-bit significand in base 2. Note that, in the this form (and the sum), the significand has all its digits after the radix point.
To make it slightly easier to tell if a number is in this format, we can adjust the scale so the significand is an integer instead of having digits after the radix point: s•be−p•sum(fk bp−k, 1 ≤ k ≤ p). This changes the above significand from .0011101010011001010101102 to 0011101010011001010101102. Since it has p digits, it is always a non-negative integer less than bp.
Now we can figure out if a finite number is representable in this format:
Get b, p, emin, and emax for the target C implementation. If it uses IEEE-754 binary32 for float, then b is 2, p is 24, emin is −125, and emax is 128. When <float.h> is included, these are defined as FLT_RADIX, FLT_MANT_DIGITS, FLT_MIN_EXP, and FLT_MAX_EXP.
Ignore the sign. Write the absolute value of number as a rational number n/d in simplest form. If it is an integer, let d be 1.
If d is not a power of b, the number is not representable in the format.
If n is a multiple of b greater than or equal to bp, divide it by b and multiply d by d until n is not a multiple or is less than bp.
If n is greater than or equal to bp, the number is not representable in the format.
Let e be such that 1/d = be−p. If emin ≤ e ≤ emax, the number is representable in the format. Otherwise, it is not.
Some floating-point formats might not support subnormal numbers, in which f1 is zero. This is indicated by FLT_HAS_SUBNORM being defined to be zero and would require modifications to the above.
I would like to know if there is some way to know in advance which real value (in decimal system) would be represented in an imprecise way like the 159.95 number.
In general, floating point numbers can only represent numbers whose denominator is a power of 2.
To check if a number can be represented as floating point value (of any floating-point type) at all, take the decimal digits after the decimal point, interpret them as number and check if they can be divided by 5^n while n is the number of digits:
159.95 => 95, 2 digits => 95%(5*5) = 20 => Cannot be represented as floating-point value
Counterexample:
159.625 => 625, 3 digits => 625%(5*5*5) = 0 => Can be represented as floating-point value
You also have to consider the fact that floating-point values only have a limited number of digits after the decimal point:
In principle, 123456789 can be represented by a floating-point value exactly (it is an integer), however float does not have enough bits!
To check if an integer value can be represented by float exactly, divide the number by 2 until the result is odd. If the result is < 2^24, the number can be represented by float exactly.
In the case of a rational number, first do the "divisible by 5^n" check described above. Then multiply the number by 2 until the result is an integer. Check if it is < 2^24.
I would like to know if there is some way to know in advance which real value... would be represented in an imprecise way
The short and only partly facetious answer is... all of them!
There are roughly 2^32 = 4294967296 values of type float. And there are an uncountably infinite number of real numbers. So, for a randomly-chosen real number, the chance that it can be exactly represented as a value of type float is 4294967296/∞, which is 0.
If you use type double, there are approximately 2^64 = 18446744073709551616 of those, so the chance that a randomly-chosen real number can be exactly represented as a double is 18446744073709551616/∞, which is again... 0.
I realize I'm not answering quite the question you asked, but in general, it's usually a bad idea to use binary floating-point types as if they were an exact representation of decimal fractions. Attempts to assume that they're ever an exact representation usually lead to trouble. In general, it's best to assume that floating-point types are an imperfect (approximate) realization of of real numbers, period (that is, without assuming decimal). If you never assume they're exact (which for true real numbers, they virtually never are), you'll never get into trouble in cases where you thought they'd be exact, but they weren't.
[Footnote 1: As Eric P. reminds in a comment, there's no such thing as a "randomly-chosen real number", which is why this is a partially facetious answer.]
[Footnote 2: I now see your comment where you say that you do assume they are all imprecise, but that you would "like to understand the phenomenon in a deeper way", in which case my answer does you no good, but hopefully some of the others do. I can especially commend Martin Rosenau's answer, which goes straight to the heart of the matter: a rational number is representable exactly in base 2 if and only if its reduced denominator is a pure power of 2, or stated another way, has only 2's in its prime factorization. That's why, if you take any number you can actually store in a float or double, and print it back out using %f and enough digits, with a properly-written printf, you'll notice that the numbers always end in things like ...625 or ...375. Binary fractions are like the English rulers still used in the U.S.: everything is halves and quarters and eights and sixteenths and thirty-seconds and sixty-fourths.]
Usually, a float is an IEEE754 binary32 float (this is not guaranteed by spec and may be different on some compilers/systems). This data type specifies a 24-bit significand; this means that if you write the number in binary, it should require no more than 24 bits excluding trailing zeros.
159.95's binary representation is 10011111.11110011001100110011... with repeating 0011 forever, so it requires an infinite number of bits to represent precisely with a binary format.
Other examples:
1073741760 has a binary representation of 111111111111111111111111000000. It has 30 bits in that representation, but only 24 significant bits (since the remainder are trailing zero bits). It has an exact float representation.
1073741761 has a binary representation of 111111111111111111111111000001. It has 30 significant bits and cannot be represented exactly as a float.
0.000000059604644775390625 has a binary representation of 0.000000000000000000000001. It has one significant bit and can be represented exactly.
0.750000059604644775390625 has a binary representation of 0.110000000000000000000001, which is 24 significant bits. It can be represented exactly as a float.
1.000000059604644775390625 has a binary representation of 1.000000000000000000000001, which is 25 significant bits. It cannot be represented exactly as a float.
Another factor (which applies to very large and very small numbers) is that the exponent is limited to the -126 to +127 range. With some handwaving around denormal values and other special cases, this generally allows values ranging from roughly 2-126 to slightly under 2128.
I would like to know if there is some way to know in advance which real value (in decimal system) would be represented in an imprecise way like the 159.95 number.
In another answer I semiseriously answered "all of them",
but let's look at it another way. Specifically, let's look at
which numbers can be exactly represented.
The key fact to remember is that floating point formats use binary.
(The major, popular formats, anyway.) So the numbers that can be
represented exactly are the ones with exact binary representations.
Here is a table of a few of the single-precision float values
that can be represented exactly, specifically the seven
contiguous values near 1.0.
I'm going to show them as hexadecimal fractions, binary
fractions, and decimal fractions.
(That is, along each horizontal row, all three values are exactly
the same, just represented in different bases. But note that the
fractional hexadecimal and binary representations I'm using here
are not directly acceptable in C.)
hexadecimal
binary
decimal
delta
0x0.fffffd
0b0.111111111111111111111101
0.999999821186065673828125
5.96e-08
0x0.fffffe
0b0.111111111111111111111110
0.999999880790710449218750
5.96e-08
0x0.ffffff
0b0.111111111111111111111111
0.999999940395355224609375
5.96e-08
0x1.000000
0b1.00000000000000000000000
1.000000000000000000000000
0x1.000002
0b1.00000000000000000000001
1.000000119209289550781250
1.19e-07
0x1.000004
0b1.00000000000000000000010
1.000000238418579101562500
1.19e-07
0x1.000006
0b1.00000000000000000000011
1.000000357627868652343750
1.19e-07
There are several things to notice about this table:
The decimal numbers look pretty weird.
The hexadecimal and binary numbers look pretty normal, and show pretty clearly that single-precision floating point has 24 bits of precision.
If you look at the decimal column, the precision seems to be about equivalent to 7 decimal digits.
It's clearly not exactly 7 digits, though.
The difference between consecutive values less than 1.0 is about 0.00000005, and greater than 1.0 is twice that, about 0.00000010. (More on this later.)
Here is a similar table for type double.
(I'm showing fewer columns because there's not enough room
horizontally for everything.)
hexadecimal
decimal
delta
0x0.ffffffffffffe8
0.99999999999999966693309261245303787291049957275390625
1.11e-16
0x0.fffffffffffff0
0.99999999999999977795539507496869191527366638183593750
1.11e-16
0x0.fffffffffffff8
0.99999999999999988897769753748434595763683319091796875
1.11e-16
0x1.0000000000000
1.0000000000000000000000000000000000000000000000000000
0x1.0000000000001
1.0000000000000002220446049250313080847263336181640625
2.22e-16
0x1.0000000000002
1.0000000000000004440892098500626161694526672363281250
2.22e-16
0x1.0000000000003
1.0000000000000006661338147750939242541790008544921875
2.22e-16
You can see right away that type double has much better precision:
53 bits, or about 15 decimal digits' worth instead of 7, and with a much
finer spacing between "adjacent" numbers.
What does it mean for these numbers to be "contiguous" or
"adjacent"? Aren't real numbers continuous? Yes, true real
numbers are continuous, but we're not looking at true real
numbers: we're looking at finite-precision floating point, and we
are, literally, seeing the finite limit of the precision here.
In type float, there simply is no value — no representable
value, that is — between 1.00000000 and 1.00000012.
In type double, there is no value between 1.00000000000000000
and 1.00000000000000022.
So let's go back to your question, asking whether there's "some way
to know which decimal values are represented in a precise or imprecise way."
If you look at ten decimal values between 1 and 2:
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
the answer is, only one of them is exactly representable in binary: 1.5.
If you break the interval down into 100 fractions, like this:
1.01
1.02
1.03
1.04
1.05
…
1.95
1.96
1.97
1.98
1.99
it turns out there are three fractions you can represent exactly:
.25, .50, and .75, corresponding to
¼, ½, and ¾.
If we looked at three-digit decimal fractions, there are at most
seven of them we can represent: .125, .250, .375, .500, .625, .750, and .875. These correspond to eighths, that is, ordinary
fractions with 8 in the denominator.
I said "at most seven" because it's not true (none of these
estimates are true) for all ranges of numbers. Remember,
precision is finite, and digits to the left of the decimal part
— that is, in the integral part of your numbers — count against
your precision budget, too. So it turns out that if you were to
look at the range, say, 4000000–4000001, and tried to subdivide
it, you would find that you could represent 4000000.25 and
4000000.50 as type float, but not 4000000.125 or 4000000.375.
You can't really see it if you look at the decimal
representation, but what's happening inside is that type float
has exactly 24 binary bits of available precision, and the
integer part 4000000 uses up 22 of those bits, so you've only got
two bits left over for the fractional part, and with two bits you
can do halves and quarters, but not eighths.
You're probably noticing a pattern by now: the fractions we've
looked at so far that can be be represented exactly in binary
involve halves, quarters, and eights, and if we looked further,
this pattern would continue: sixteenths, thirty-seconds,
sixty-fourths, etc. And this should come as no real surprise:
just as in decimal the "exact" fractions involve tenths,
hundredths, thousandths, etc.; when we move to binary (base 2) the fractions
all involve powers of two. ½ in binary is 0b0.1.
¼ and ¾ are 0b0.01 and 0b0.11.
⅜ and ⅝ are 0b0.011 and 0b0.101.
What about a fraction like 1/3? You can't represent it exactly
in binary, but since you can't represent it in decimal, either,
this doesn't tend to bother us too much. In decimal it's the
infinitely repeating fraction 0.333333…, and in binary it's the
infinitely-repeating fraction 0b0.0101010101….
But then we come to the humble fraction 1/10, or one tenth.
This obviously can be represented as a decimal fraction — 0.1 —
but it turns out that it cannot be represented exactly in binary.
In binary it's the infinitely-repeating fraction 0b0.0001100110011….
And this is why, as we saw above, you can't represent most of the other
"single digit" decimal fractions 0.2, 0.3, 0.4, …, either
(with the notable exception of 0.5), and you can't represent most
of the double-digit decimal fractions 0.01, 0.02, 0.03, …,
or most of the triple-digit decimal fractions, etc.
So returning once more to your question of which decimal
fractions can be represented exactly, we can say:
For single-digit fractions 0.1, 0.2, 0.3, …, we can exactly represent .5, and to be charitable we can say that we can also represent .0, so that's two out of ten, or 20%.
For double-digit fractions 0.01, 0.02, 0.03, …, we can exactly represent .00, 0.25, 0.50, and 0.75, so that's four out of a hundred, or 4%.
For three-digit fractions 0.001, 0.002, 0.003, …, we can exactly represent the eight fractions involving eighths, so that's 8/1000 = 0.8%.
So while there are some decimal fractions we can represent
exactly, there aren't very many, and the percentage seems to be
going down as we add more digits. :-(
The fact — and depending on your point of view it's either an
unfortunate fact or a sad fact or a perfectly normal fact —
is that most decimal fractions can not be represented exactly
in binary and so can not be represented exactly using computer
floating point.
The numbers that can be represented exactly using computer
floating point, although they can all be exactly converted into
numerically equivalent decimal fractions, end up converting to
rather weird-looking numbers for the most part, with lots of digits, as we saw above.
(In fact, for type float, which internally has 24 bits of
significance, the exact decimal conversions end up having up to
24 decimal digits. And the fractions always end in 5.)
One last point concerns the spacing between these "contiguous",
exactly-representable binary fractions. In the examples I've
shown, why is there tighter spacing for numbers less than 1.0
than for numbers greater than 1.0?
The answer lies in an earlier statement that "precision is
finite, and digits to the left of the decimal part count against
your precision budget, too". Switching to decimal fractions for
a moment, if I told you you had exactly 7 significant decimal
digits to work with, you could represent
1234.567
1234.568
1234.569
and
12345.67
12345.68
12345.69
but you could not represent
12345.678
because that would require 8 significant digits.
Stated another way, for numbers between 1000 and 10000 you can
have three more digits after the decimal point, but for numbers
from 10000 to 100000 you can only have two. Mathematicians call
these intervals like 1000-10000 and 10000-100000 decades,
and within each decade, all the numbers have the same number of
fractional digits for a given precision, and the same exponents:
1.000000×103 – 1.999999×103,
1.000000×104 – 1.999999×104, etc.
(This usage is rather different than ordinary usage, in which the
word "decade" refers to a period of 10 years.)
But for binary floating point, once again, the intervals of
interest involve powers of 2, not 10. (In binary, some computer
scientists call these intervals binades, by analogy with "decades".)
The interesting intervals are from 1 to 2, 2–4, 4–8, 8–16, etc.
For numbers between 1 and 2, you've got 1 bit to the left of the
decimal point (really the "binary point"), so in single precision
you've got 23 bits left over to use for the fractional part to the right.
But for numbers between 2 and 4, you've got 2 bits to the left,
so you've only got 22 bits to use for the fraction.
This works in the other direction, too: for numbers between
½ and 1, you don't need any bits to the left of the binary
point, so you can use all 24 for the fraction to the right.
(Below ½ it gets even more interesting). So that's why we
saw twice the precision (numbers half the size in the "delta"
column) for numbers just below 1.0 than for numbers just above.
We'd see similar shifts in available precision when crossing all the other
powers of two: 2.0, 4.0, 8.0, …, and also ½, ¼,
⅛, etc.
This has been a rather long answer, longer than I had intended.
Thanks for reading.
Hopefully now you have a better appreciation for which numbers can be
exactly represented in binary floating point, and why most of them can't.

My floating point number has extra digits when I print it

I define a floating point number as float transparency = 0.85f; And in the next line, I pass it to a function -- fcn_name(transparency) -- but it turns out that the variable transparency has value 0.850000002, and when I print it with the default setting, it is 0.850000002. For the value 0.65f, it is 0.649999998.
How can I avoid this issue? I know floating point is just an approximation, but if I define a float with just a few decimals, how can I make sure it is not changed?
Floating-point values represented in binary format do not have any specific decimal precision. Just because you read in some spec that the number can represent some fixed amount of decimal digits, it doesn't really mean much. It is just a rough conversion of the physical (and meaningful) binary precision to its much less meaningful decimal approximation.
One property of binary floating-point format is that it can only represent precisely (within the limits of its mantissa width) the numbers that can be expressed as finite sums of powers of 2 (including negative powers of 2). Numbers like 0.5, 0.25, 0.75 (decimal) will be represented precisely in binary floating-point format, since these numbers are either powers of 2 (2^-1, 2^-2) or sums thereof.
Meanwhile, such number as decimal 0.1 cannot be expressed by a finite sum of powers of 2. The representation of decimal 0.1 in floating-point binary has infinite length. This immediately means that 0.1 cannot be ever represented precisely in finite binary floating-point format. Note that 0.1 has only one decimal digit. However, this number is still not representable. This illustrates the fact that expressing floating-point precision in terms of decimal digits is not very useful.
Values like 0.85 and 0.65 from your example are also non-representable, which is why you see these values distorted after conversion to a finite binary floating-point format. Actually, you have to get used to the fact that most fractional decimal numbers you will encounter in everyday life will not be representable precisely in binary floating-point types, regardless of how large these floating-point types are.
The only way I can think of solving this problem is to pass characteristic and mantissa to the function separately and let IT work on setting the values appropriately.
Also if you want more precision,
http://www.drdobbs.com/cpp/fixed-point-arithmetic-types-for-c/184401992 is the article I know. Though this works for C++ only. (Searching for an equivalent C implementation).
I tried this on VS2010,
#include <stdio.h>
void printfloat(float f)
{
printf("%f",f);
}
int main(int argc, char *argv[])
{
float f = 0.24f;
printfloat(f);
return 0;
}
OUTPUT: 0.240000

Difference in floating point comparison C

float a = 0.7;
if(a<0.7)
printf("true");
else
printf("false");
OUTPUT : true
Now , if I change the value of a to say 1.7 ,then
float a = 1.7;
if(a<1.7)
printf("true");
else
printf("false");
OUTPUT : false
Since 0.7 is treated as a double (HIGH PRECISION) and a is a float (LESS PRECISION) , therefore a < 0.7 , and in second case it should be the same again , so it should also print true. Why the difference in output here ?
PS : I have already seen this link.
Since you saw my answer to the question you linked, let's work through it and make the necessary changes to examine your second scenario:
In binary, 1.7 is:
b1.1011001100110011001100110011001100110011001100110011001100110...
However, 1.7 is a double-precision literal, whose value is 1.7 rounded to the closest representable double-precision value, which is:
b1.1011001100110011001100110011001100110011001100110011
In decimal, that's exactly:
1.6999999999999999555910790149937383830547332763671875
When you write float a = 1.7, that double value is rounded again to single-precision, and a gets the binary value:
b1.10110011001100110011010
which is exactly
1.7000000476837158
in decimal (note that it rounded up!)
When you do the comparison (a < 1.7), you are comparing this single-precision value (converted to double, which does not round, because all single-precision values are representable in double precision) to the original double-precision value. Because
1.7000000476837158 > 1.6999999999999999555910790149937383830547332763671875
the comparison correctly returns false, and your program prints "false".
OK, so why are the results different with 0.7 and 1.7? It's all in the rounding. Single-precision numbers have 24 bits. When we write down 0.7 in binary, it looks like this:
b.101100110011001100110011 00110011...
(there is space after the 24th bit to show where it is). Because the next digit after the 24th bit is a zero, when we round to 24 bits, we round down.
Now look at 1.7:
b1.10110011001100110011001 10011001...
because we have the leading 1., the position of the 24th bit shifts, and now the next digit after the 24th bit is a one, and we round up instead.
If float and double are IEEE-754 32 bit and 64 bit floating point formats respectively, then the closest float to exactly 1.7 is ~1.7000000477, and the closest double is ~1.6999999999999999556. In this case the closest float just happens to be numerically greater than the closest double.
0.7 and 1.7 aren't represented the same way in base-2 - so the one might be slightly more and the other slightly fewer than the actual (exact) value.
It all has to do with base 2 representation of floating-point. Here is a good reference about the subject: What Every Computer Scientist Should Know About Floating-Point Arithmetic.

Can anyone explain me this feature simply?

I have the following code,
float a = 0.7;
if(0.7 > a)
printf("Hi\n");
else
printf("Hello\n"); //Line1
and
float a = 0.98;
if(0.98 > a)
printf("Hi\n");
else
printf("Hello\n"); //Line2
here line1 outputs Hi but Line2 outputs Hello. I assume there would be a certain criteria about double constant and float, i.e any one of them would become larger on evaluation. But this two codes clarify me that situation can be come when double constant get larger and some other times float get larger. Is there any rounding off issue behind this? If it is, please explain me. I am badly in need of this clear..
thanks advance
What you have is called representation error.
To see what is going on you might find it easier to first consider the decimal representations of 1/3, 1/2 and 2/3 stored with different precision (3 decimal places or 6 decimal places):
a = 0.333
b = 0.333333
a < b
a = 0.500
b = 0.500000
a == b
a = 0.667
b = 0.666667
a > b
Increasing the precision can make the number slightly larger, slightly smaller, or have the same value.
The same logic applies to binary floating point numbers.
float a = 0.7;
Now a is the closest single-precision floating point value to 0.7. For the comparison 0.7 > a that is promoted to double, since the type of the constant 0.7 is double, and its value is the closest double-precision floating point value to 0.7. These two values are different, since 0.7 isn't exactly representable, so one value is larger than the other.
The same applies to 0.98. Sometimes, the closest single-precision value is larger than the decimal fraction and the closest double-precision number smaller, sometimes the other way round.
This is part of What Every Computer Scientist Should Know About Floating-Point Arithmetic.
This is simply one of the issues with floating point precision.
While there are an infinite number of floating point numbers, there are not an infinite number of floating point representations due to the bit-constraints. So there will be rounding errors when using floats in this manner.
There is no criteria for where it decides to round up or down, that would probably be language -implementation or compiler dependent.
See here: http://en.wikipedia.org/wiki/Floating_point, and http://en.wikipedia.org/wiki/IEEE_754 for more details.

Resources