I'm trying to implement printf and I want to know how printf rounds floating-point numbers because I cannot find a general rule
If for example input => printf("|%.f| |%.1f| |%.2f| |%.5f| |%.12f", 0.000099, 0.000099, 0.000099, 0.000099, 0.000099);
Here is the output => |0| |0.0| |0.00| |0.00010| |0.000099000000
I use the method from IEEE-754 so our floating-point number in memory is: 0.000098999999999999994037755413067714016506215557456016540527343750
My question is when and how should I round my floating-point number?
I am looking for a general rule that I must follow for all floating-point numbers.
Not sure if this is exactly how printf does it, but this seems to work for your example:
Add 5/(10^(number of decimal points+1). Then truncate.
Your interpretation is flawed: Your C compiler is upcasting your constants to doubles. So it's not using 0.0000989999.... It's using the more accurate double equivalent.
Try this:
printf("|%.f| |%.1f| |%.2f| |%.5f| |%.12f", (float)0.000099, (float)0.000099, (float)0.000099, (float)0.000099, (float)0.000099);
Output:
|0| |0.0| |0.00| |0.00010| |0.000098999997
The problem you are trying to address is actually very difficult to solve correctly.
Many existing printf implementations use conversion code dtoa.c written by David M. Gay almost 30 years ago.
You can learn more about this from this question, which is not an exact duplicate:
Why does "dtoa.c" contain so much code?
And these sites:
Rick Regan's blog article https://www.exploringbinary.com/quick-and-dirty-floating-point-to-decimal-conversion/
Clinger's How to Read Floating Point Numbers Accurately
David M. Gay's paper, Correctly Rounded Binary-Decimal and Decimal-Binary Conversions.
David Goldberg's What Every Computer Scientist Should Know About Floating Point Arithmetic.
The C standard provides the following in §7.21.6.1p13:
For e, E, f, F, g, and G conversions, if the number of significant decimal digits is at most DECIMAL_DIG, then the result should be correctly rounded. If the number of significant decimal digits is more than DECIMAL_DIG but the source value is exactly representable with DECIMAL_DIG digits, then the result should be an exact representation with trailing zeros. Otherwise, the source value is bounded by two adjacent decimal strings L<U, both having DECIMAL_DIG significant digits; the value of the resultant decimal string D should satisfy L ≤ D ≤ U, with the extra stipulation that the error should have a correct sign for the current rounding direction.
However, that paragraph is part of a subsection headed "Recommended Practice". (If Annex F is in effect, then the recommended practice is required. See F.5.)
"Correctly rounded" is defined by the current rounding direction. See fesetround.
Related
My Matlab script reads a string value "0.001044397222448" from a file, and after parsing the file, this value printed in the console shows as double precision:
value_double =
0.001044397222448
After I convert this number to singe using value_float = single(value_double), the value shows as:
value_float =
0.0010444
What is the real value of this variable, that I later use in my Simulink simulation? Is it really truncated/rounded to 0.0010444?
My problem is that later on, after I compare this with analogous C code, I get differences. In the C code the value is read as float gf = 0.001044397222448f; and it prints out as 0.001044397242367267608642578125000. So the C code keeps good precision. But, does Matlab?
The number 0.001044397222448 (like the vast majority of decimal fractions) cannot be exactly represented in binary floating point.
As a single-precision float, it's most closely represented as (hex) 0x0.88e428 × 2-9, which in decimal is 0.001044397242367267608642578125.
In double precision, it's most closely represented as 0x0.88e427d4327300 × 2-9, which in decimal is 0.001044397222447999984407118745366460643708705902099609375.
Those are what the numbers are, internally, in both C and Matlab.
Everything else you see is an artifact of how the numbers are printed back out, possibly rounded and/or truncated.
When I said that the single-precision representation "in decimal is 0.001044397242367267608642578125", that's mildly misleading, because it makes it look like there are 28 or more digits' worth of precision. Most of those digits, however, are an artifact of the conversion from base 2 back to base 10. As other answers have noted, single-precision floating point actually gives you only about 7 decimal digits of precision, as you can see if you notice where the single- and double-precision equivalents start to diverge:
0.001044397242367267608642578125
0.001044397222447999984407118745366460643708705902099609375
^
difference
Similarly, double precision gives you roughly 16 decimal digits worth of precision, as you can see if you compare the results of converting a few previous and next mantissa values:
0x0.88e427d43272f8 0.00104439722244799976756668424826557384221814572811126708984375
0x0.88e427d4327300 0.001044397222447999984407118745366460643708705902099609375
0x0.88e427d4327308 0.00104439722244800020124755324246734744519926607608795166015625
0x0.88e427d4327310 0.0010443972224480004180879877395682342466898262500762939453125
^
changes
This also demonstrates why you can never exactly represent your original value 0.001044397222448 in binary. If you're using double, you can have 0.00104439722244799998, or you can have 0.0010443972224480002, but you can't have anything in between. (You'd get a little less close with float, and you could get considerably closer with long double, but you'll never get your exact value.)
In C, and whether you're using float or double, you can ask for as little or as much precision as you want when printing things with %f, and under a high-quality implementation you'll always get properly-rounded results. (Of course the results you get will always be the result of rounding the actual, internal value, not necessarily the decimal value you started with.) For example, if I run this code:
printf("%.5f\n", 0.001044397222448);
printf("%.10f\n", 0.001044397222448);
printf("%.15f\n", 0.001044397222448);
printf("%.20f\n", 0.001044397222448);
printf("%.30f\n", 0.001044397222448);
printf("%.40f\n", 0.001044397222448);
printf("%.50f\n", 0.001044397222448);
printf("%.60f\n", 0.001044397222448);
printf("%.70f\n", 0.001044397222448);
I see these results, which as you can see match the analysis above.
(Note that this particular example is using double, not float.)
0.00104
0.0010443972
0.001044397222448
0.00104439722244799998
0.001044397222447999984407118745
0.0010443972224479999844071187453664606437
0.00104439722244799998440711874536646064370870590210
0.001044397222447999984407118745366460643708705902099609375000
0.0010443972224479999844071187453664606437087059020996093750000000000000
I'm not sure how Matlab prints things.
In answer to your specific questions:
What is the real value of this variable, that I later use in my Simulink simulation? Is it really truncated/rounded to 0.0010444?
As a float, it is really "truncated" to a number which, converted back to decimal, is exactly 0.001044397242367267608642578125. But as we've seen, most of those digits are essentially meaningless, and the result can more properly thought of as being about 0.0010443972.
In the C code the value is read as float gf = 0.001044397222448f; and it prints out as 0.001044397242367267608642578125000
So C got the same answer I did -- but, again, most of those digits are not meaningful.
So the C code keeps good precision. But, does Matlab?
I'd be willing to bet that Matlab keeps the same internal precision for ordinary floats and doubles.
MATLAB uses IEEE-754 binary64 for its double-precision type and binary32 for single-precision. When 0.001044397222448 is rounded to the nearest value representable in binary64, the result is 4816432068447840•2−62 = 0.001044397222447999984407118745366460643708705902099609375.
When that is rounded to the nearest value representable in binary32, the result is 8971304•2−33 = 0.001044397242367267608642578125.
Various software (C, Matlab, others) displays floating-point numbers in diverse ways, with more or fewer digits. The above values are the exact numbers represented by the floating-point data, per the IEEE 754 specification, and they are the values the data has when used in arithmetic operations.
All single precisions should be the same
So here is the thing. According to documentation, both matlab and C comply with the IEEE 754 standard. Which means that there should not be any difference between what is actually stored in memory.
You could compute the binary representation by hand but according to this(thanks #Danijel) handy website, the representation of 0.001044397222448 should be 0x3a88e428.
The question is how precise is your representation? It is a bit tricky with floating point but the short answer is your number is accurate up to the 9th decimal and has decimal represented up to the 33rd decimal. If you want the long answer see the tow paragraphs at the end of this post.
A display issue
The fact that you are not seeing the same thing when you print does not mean that you don't have the same bits in memory (and you should have the exact same bytes in memory in C and MATLAB). The only reason you see a difference on your display is because the print functions truncate your number. If you print the 33 decimals in each language you should not have any difference.
To do so in matlab use: fprintf('%.33f', value_float);
To do so in c use printf('%.33f\n', gf);
About floating point precision
Now in more details, the question was: how precise is this representation? Well the tricky thing with floating point is that the precision of the representation depends on what number you are representing. The representation is over 32 bits and is divide with 1 bit for the sign, 8 for the exponent and 23 for the fraction.
The number can be computed as sign * 2^(exponent-127) * 1.fraction. This basically means that the maximal error/precision (depending on how you want to call it) is basically 2^(exponent-127-23), the 23 is here to represent the 23 bytes of the fraction. (There are a few edge cases, I won't elaborate on it). In our case the exponent is 117, which means your precision is 2^(117-127-23) = 1.16415321826934814453125e-10. That means that your single precision float should represent your number accurately up to the 9th decimal, after that it is up to luck.
Further details
I know this is a rather short explanation. For more details, this post explains the floating point imprecision more precisely and this website gives you some useful info and allows you to play visually with the representation.
This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 3 years ago.
So i have been trying to make my own printf and now i stuck at %f.
The problem i have is i don't know what printf does in the background when i give it a float number like: f = 1.4769996 it print 1.477000.
but when i give it f = 1.4759995 it print the value 1.475999
float f = 1.4769996;
printf("%f\n", f); // 1.477000
f = 1.4759995;
printf("%f\n", f); // 1.475999
what i thought of is that printf see the 5 at last and it adds one but not working in the second example.
What is the logic behind this floating point ?
Your C implementation likely uses the IEEE-754 binary32 and binary64 formats for float and double. Given this, float f = 1.4769996; results in setting f to 1.47699964046478271484375, and f = 1.4759995; results in setting f to 1.47599947452545166015625.
Then it is easy to see that rounding 1.47699964046478271484375 to six digits after the decimal point results in 1.477000 (because the next digit is 6, so we round up), and rounding 1.47599947452545166015625 to six digits after the decimal point results in 1.475999 (because the next digit is 4, so we round down).
When working with floating-point numbers, it is important to understand each floating-point value represents one number exactly (unless it is a Not a Number [NaN] encoding). When you write 1.4769996 in source code, it is converted to a value representable in double. When you assign it to a float, it is converted to a value representable in float. Operations on the floating-point object behave as if the object have exactly the value it represents, not as if its value is the numeral you wrote in source code.
To provide some further details, the C standard requires (in C 2018 7.21.6.1 13) that formatting with f be correctly rounded if the number of digits requested is at most DECIMAL_DIG. DECIMAL_DIG is the number of decimal digits in the widest floating-point format the implementation supports such that converting any number in that format to a numeral with DECIMAL_DIG significant decimal digits and back to the floating-point format yields the original value (5.2.4.2.2 12). DECIMAL_DIG must be at least 10. If more than DECIMAL_DIG digits are requested, the C standard allows some leeway in rounding. However, high-quality C implementations will round correctly as specified by IEEE-754 (to the nearest number with the requested number of digits, with ties favoring an even low digit).
If you are trying to write your own printf, and if you are stuck on %f, there are three or four things you need to know:
When a "varargs" function like printf is called, arguments of type float are always implicitly promoted to type double. So when you've seen %f in the format string, and you're using va_arg() to pluck the next argument from the list, you'll want to pluck an argument of type double, not float. (This also means that you have just one case to handle, not two. Inside printf, you don't have to worry about handling type float at all.)
Printing the whole-number part of a double is easy; it's more or less the same problem as printing an int, which I'm guessing you've already figured out, if you've got %d working. And to do a straightforward, simpleminded job of printing the fractional part, it usually works pretty well to just repeatedly multiply by 10. That is, if you're trying to print 123.456, and you've already got the 123 part taken care of, you can then proceed to print the rest by taking the fractional part 0.456, multiplying by 10 to get 4.56 then truncating to get 4, then taking the new fractional part 0.56 and repeating.
There is no such number as 1.4769996. (There's no such number as the 123.456 I was just using, either.) When we write numbers like 1.4769996 and 123.456 we're thinking about decimal fractions, but most computers (including the one you're using) use binary fractions internally, and you can't represent decimal fractions like 1.4769996 and 123.456 exactly in binary, so the actual numbers are always a little bit different than you expect, which is why you often get slight "roundoff error", or extra 999's at the end when you expected 000.
Doing a proper job on this stuff is really, really hard. If you're trying to write your own printf, and you've gotten to %f, and if you can get it working pretty well most of the time, consider yourself lucky, and call it a day. Don't get bogged down on the last digit -- or if you're bound and determined to get the last digit right in every case (which is certainly a noble goal), do some research and set aside some time, because you're going to be working at it for a while.
This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 3 years ago.
When given the input 150, I expect the output to be the mathematically correct answer 70685.7750, but I am getting the wrong output 70685.7812.
#include<stdio.h>
int main()
{
float A,n,R;
n=3.14159;
scanf("%f",&R);
A=n*(R*R);
printf("A=%.4f\n",A);
}
float and double numbers are not represented very accurately in the memory. The main reason is that the memory is limited, and most non-integers are not.
The best example is PI. You can specify as many digits as you want, but it will still be an approximation.
The limited precision of representing the numbers is the reason of the following rule:
when working with floats and double numbers, not not check for equality (m == n), but check that the difference between them is smaller than a certain error ((m-n) < e)
Please note, as mentioned in the comments too, that the above rule is not "the mother rule of all rules". There are other rules also.
Careful analysis must be done for each particular situation, in order to have a properly working application.
(Thanks #EricPostpischil for the reminder)
It is common for a variable of type float to be an IEEE-754 32-bit floating point number.
The number 3.14159 cannot be stored exactly in an IEEE-754 32-bit float - the closest value is approximately 3.14159012. 150 * 150 * 3.14159012 is 70685.7777, and the closest value to this that can be represented in a 32-bit float is 70685.78125, which you are then printing with %.4f so you see 70685.7812.
Another way of thinking about this is that your n value only ends up being accurate to the sixth significant figure, so - as you are just calculating a series of multiplications - your result is also only acccurate to the sixth significant figure (ie 70685.8). (In the general case this can be worse - for example subtraction of two close values can lead to a large increase in the relative error).
If you switch to using variables of type double (and change the scanf() to use %lf), then you will likely get the answer you are after. double is typically a 64-bit float, which means that the error in the representation of your n values and the result is small enough not to affect the fourth decimal place.
Have you heard that float and double values aren't always perfectly accurate, have limited precision? Have you heard that type float gives you the equivalent of only about 7 decimal digits' worth of precision? This is what that means. Your expected and actual answers, 70685.7750 and 70685.7812, differ in the seventh digit, just about as expected.
I expect the output to be the mathematically correct answer
I am sorry to disappoint you, but that's your mistake. As a general rule, when you're doing floating-point arithmetic, you will never get the mathematically correct answer, you will always get a limited-precision approximation of the mathematically correct answer.
The canonical SO answers to this sort of question are collected at Is floating point math broken?. You might want to read some of those answers for more enlightenment.
I am reading C Primer Plus by Stephen Prata, and one of the first ways it introduces floats is talking about how they are accurate to a certain point. It says specifically "The C standard provides that a float has to be able to represent at least six significant figures...A float has to represent accurately the first six numbers, for example, 33.333333"
This is odd to me, because it makes it sound like a float is accurate up to six digits, but that is not true. 1.4 is stored as 1.39999... and so on. You still have errors.
So what exactly is being provided? Is there a cutoff for how accurate a number is supposed to be?
In C, you can't store more than six significant figures in a float without getting a compiler warning, but why? If you were to do more than six figures it seems to go just as accurately.
This is made even more confusing by the section on underflow and subnormal numbers. When you have a number that is the smallest a float can be, and divide it by 10, the errors you get don't seem to be subnormal? They seem to just be the regular rounding errors mentioned above.
So why is the book saying floats are accurate to six digits and how is subnormal different from regular rounding errors?
Suppose you have a decimal numeral with q significant digits:
dq−1.dq−2dq−3…d0,
and let’s also make it a floating-point decimal numeral, meaning we scale it by a power of ten:
dq−1.dq−2dq−3…d0•10e.
Next, we convert this number to float. Many such numbers cannot be exactly represented in float, so we round the result to the nearest representable value. (If there is a tie, we round to make the low digit even.) The result (if we did not overflow or underflow) is some floating-point number x. By the definition of floating-point numbers (in C 2018 5.2.4.2.2 3), it is represented by some number of digits in some base scaled by that base to a power. Supposing it is base two, x is:
bp−1.bp−2bp−3…b0•2p.
Next, we convert this float x back to decimal with q significant digits. Similarly, the float value x might not be exactly representable as a decimal numeral with q digits, so we get some possibly new number:
nq−1.nq−2nq−3…n0•10m.
It turns out that, for any float format, there is some number q such that, if the decimal numeral we started with is limited to q digits, then the result of this round-trip conversion will equal the original number. Each decimal numeral of q digits, when rounded to float and then back to q decimal digits, results in the starting number.
In the 2018 C standard, clause 5.2.4.2.2, paragraph 12, tells us this number q must be at least 6 (a C implementation may support larger values), and the C implementation should define a preprocessor symbol for it (in float.h) called FLT_DIG.
So considering your example number, 1.4, when we convert it to float in the IEEE-754 basic 32-bit binary format, we get exactly 1.39999997615814208984375 (that is its mathematical value, shown in decimal for convenience; the actual bits in the object represented it in binary). When we convert that to decimal with full precision, we get “1.39999997615814208984375”. But if we convert it to decimal with rounding six digits, we get “1.40000”. So 1.4 survives the round trip.
In other words, it is not true in general that six decimal digits can be represented in float without change, but it is true that float carries enough information that you can recover six decimal digits from it.
Of course, once you start doing arithmetic, errors will generally compound, and you can no longer rely on six decimal digits.
Thanks to Govind Parmar for citing an on-line example of C11 (or, for that matter C99).
The "6" you're referring to is "FLT_DECIMAL_DIG".
http://c0x.coding-guidelines.com/5.2.4.2.2.html
number of decimal digits, n, such that any floating-point number with
p radix b digits can be rounded to a floating-point number with n
decimal digits and back again without change to the value,
{ p log10 b if b is a power of 10
{
{ [^1 + p log10 b^] otherwise
FLT_DECIMAL_DIG 6
DBL_DECIMAL_DIG 10 LDBL_DECIMAL_DIG
10
"Subnormal" means:
What is a subnormal floating point number?
A number is subnormal when the exponent bits are zero and the mantissa
is non-zero. They're numbers between zero and the smallest normal
number. They don't have an implicit leading 1 in the mantissa.
STRONG SUGGESTION:
If you're unfamiliar with "floating point arithmetic" (or, frankly, even if you are), this is an excellent article to read (or review):
What Every Programmer Should Know About Floating-Point Arithmetic
This question already has answers here:
Why is printf round floating point numbers up?
(3 answers)
Closed 6 years ago.
#include<stdio.h>
void main()
{
float f = 2.5367;
printf("%3.3f",f);
}
Output is : 2.537
But how?
I think output should be 2.536 but the output is 2.537.
Because 2.5367 is closer to 2.537. Delta from 2.536 is 0.007 while delta from 2.537 is just 0.003.
What you want to do is just deleting last character of its string representation and is wrong in math.
In addition to existing answers, note that many C compilers try to follow IEEE 754 for floating-point matters. IEEE 754 recommends rounding according to the current rounding mode for conversions from binary floating-point to decimal. The default rounding mode is “round to nearest and ties to even”. Some compilation platforms do not take the rounding mode into account and always round according to the default nearest-even mode in conversions from floating-point to decimal.
since float f = 2.5367 needs to rounding up and there should be 3 digit after decimal, so the value will be 2.*** , means 2.537.
The documentation says:
The precision value specifies the number of digits after the decimal point. If a decimal point appears, at least one digit appears before it. The value is rounded to the appropriate number of digits.
That said, please use another book or resource for learning, whoever taught you that it's ok to use void main() is plain wrong and should not teach C.