Is "multistage rounding" of fractional numbers a thing? - rounding-error

Problem:
Due to precision errors I get a number, say, 0.004999999999. Mathematically speaking 0.00499(9) equals 0.005 precisely, but not in computers.
Such level of precision is still fine, but when I display the number, the user expects to see 0.01, i.e. rounded to two decimal points, but obviously any sane rounding algorithm would return 0.00 instead since 0.004999999999 is definitely closer to 0.00 than to 0.01.
But the user understanably expects to see 0.01
Solution(?):
It can be done with "multistage rounding" like 0.004999999999.round(10).round(2) given we internally calculate everything to a precision of 10 decimal points.
It seems to be a very common problem, but I surprisingly couldn't find any conventional solution to it.

There's nothing wrong with the double rounding approach, just be aware that binary numbers can't exactly represent decimals so sometimes the results aren't what you expect. Your example works in Python but another very similar one does not:
>>> round(round(0.004999999999, 10), 2)
0.01
>>> round(round(0.044999999999, 10), 2)
0.04
In the first case the first rounding produces a number just over 0.005 while the second produces a number just under 0.045; the exact numbers are 0.005000000000000000104083408558608425664715468883514404296875 and 0.04499999999999999833466546306226518936455249786376953125. Nothing closer is possible using binary numbers. If your language has a decimal number system, you could use that to get consistent results by avoiding the conversion to binary.
An alternate approach is to multiply the number by a very small amount so that something close to the boundary gets pushed over it.
>>> round(0.004999999999 * 1.000000001, 2)
0.01
>>> round(0.044999999999 * 1.000000001, 2)
0.05
Again, this approach won't work for every number, but it prioritizes the case you care about - you're more likely to get numbers rounded up when they should be rounded down.

Related

floating point inaccuracies in c

I know floating point values are limited in the numbers the can express accurately and i have found many sites that describe why this happens. But i have not found any information of how to deal with this problem efficiently. But I'm sure NASA isn't OK with 0.2/0.1 = 0.199999. Example:
#include <stdio.h>
#include <float.h>
#include <math.h>
int main(void)
{
float number = 4.20;
float denominator = 0.25;
printf("number = %f\n", number);
printf("denominator = %f\n", denominator);
printf("quotient as a float = %f should be 16.8\n", number/denominator);
printf("the remainder of 4.20 / 0.25 = %f\n", number - ((int) number/denominator)*denominator);
printf("now if i divide 0.20000 by 0.1 i get %f not 2\n", ( number - ((int) number/denominator)*denominator)/0.1);
}
output:
number = 4.200000
denominator = 0.250000
quotient as a float = 16.799999 should be 16.8
the remainder of 4.20 / 0.25 = 0.200000
now if i divide 0.20000 by 0.1 i get 1.999998 not 2
So how do i do arithmetic with floats (or decimals or doubles) and get accurate results. Hope i haven't just missed something super obvious. Any help would be awesome! Thanks.
The solution is to not use floats for applications where you can't accept roundoff errors. Use an extended precision library (a.k.a. arbitrary precision library) like GNU MP Bignum. See this Wikipedia page for a nice list of arbitrary-precision libraries. See also the Wikipedia article on rational data types and this thread for more info.
If you are going to use floating point representations (float, double, etc.) then write code using accepted methods for dealing with roundoff errors (e.g., avoiding ==). There's lots of on-line literature about how to do this and the methods vary widely depending on the application and algorithms involved.
Floating point is pretty fine, most of the time. Here are the key things I try to keep in mind:
There's really a big difference between float and double. double gives you enough precision for most things, most of the time; float surprisingly often gives you not enough. Unless you know what you're doing and have a really good reason, just always use double.
There are some things that floating point is not good for. Although C doesn't support it natively, fixed point is often a good alternative. You're essentially using fixed point if you do your financial calculations in cents rather than dollars -- that is, if you use an int or a long int representing pennies, and remember to put a decimal point two places from the right when it's time to print out as dollars.
The algorithm you use can really matter. Naïve or "obvious" algorithms can easily end up magnifying the effects of roundoff error, while more sophisticated algorithms minimize them. One simple example is that the order you add up floating-point numbers can matter.
Never worry about 16.8 versus 16.799999. That sort of thing always happens, but it's not a problem, unless you make it a problem. If you want one place past the decimal, just print it using %.1f, and printf will round it for you. (Also don't try to compare floating-point numbers for exact equality, but I assume you've heard that by now.)
Related to the above, remember that 0.1 is not representable exactly in binary (just as 1/3 is not representable exactly in decimal). This is just one of many reasons that you'll always get what look like tiny roundoff "errors", even though they're perfectly normal and needn't cause problems.
Occasionally you need a multiple precision (MP or "bignum") library, which can represent numbers to arbitrary precision, but these are (relatively) slow and (relatively) cumbersome to use, and fortunately you usually don't need them. But it's good to know they exist, and if you're a math nurd they can be a lot of fun to use.
Occasionally a library for representing rational numbers is useful. Such a library represents, for example, the number 1/3 as the pair of numbers (1, 3), so it doesn't have the inaccuracies inherent in trying to represent that number as 0.333333333.
Others have recommended the paper What Every Computer Scientist Should Know About Floating-Point Arithmetic, which is very good, and the standard reference, although it's long and fairly technical. An easier and shorter read I can recommend is this handout from a class I used to teach: https://www.eskimo.com/~scs/cclass/handouts/sciprog.html#precision . This is a little dated by now, but it should get you started on the basics.
There's isn't a good answer and it's often a problem.
If data is integral, e.g. amounts of money in cents, then store it as integers, which can mean a double that is constrained to hold an integer number of cents rather than a rational number of dollars. But that only helps in a few circumstances.
As a general rule, you get inaccuracies when trying to divide by numbers that are close to zero. So you just have to write the algorithms to avoid or suppress such operations. There are lots of discussions of "numerically stable" versus "unstable" algorithms and it's too big a subject to do justice to it here. And then, usually, it's best to treat floating point numbers as though they have small random errors. If they ultimately represent measurements of analogue values in the real world, there must be a certain tolerance or inaccuracy in them anyway.
If you are doing maths rather than processing data, simply don't use C or C++. Use a symbolic algebra package such a Maple, which stores values such as sqrt(2) as an expression rather than a floating point number, so sqrt(2) * sqrt(2) will always give exactly 2, rather than a number very close to 2.

Calculate square root using integer arithmetic

I want to calculate the square root of some integer without using any floating point arithmetic. The catch, however, is that I don't want to discard precision from the output. That is to say, I do not want a rounded integer as the result, I would like to achieve the decimal point value as well, at least to two significant digits. As an example:
sqrt(9) = 3
sqrt(10) = 3.16
sqrt(999999) = 999.99
I've been thinking about it but I haven't particularly come up with solutions, nor has searching helped much since most similar questions are just that, only similar.
Output is acceptable in any form which is not a floating point number and accurately represents the data. Preferably, I would have two ints, one for the portion before the decimal and one for the portion after the decimal.
I'm okay with just pseudo-code / an explained algorithm, if coding C would be best. Thanks
You can calculate an integer numerator and an integer denominator, such that the floating-point division of the numerator by the denominator will yield the square root of the input number.
Please note, however, that no square-root method exists such that the result is 100% accurate for every natural number, as the square root of such number can be an irrational number.
Here is the algorithm:
Function (input number, input num_of_iterations, output root):
Set root.numerator = number
Set root.denominator = 1
Run num_of_iterations:
Set root = root-(root^2-number)/(root*2)
You might find this C++ implementation useful (it also includes the conversion of the numerator divided by the denominator into a numerical string with predefined floating-point precision).
Please note that no floating-point operations are required (as demonstrated at the given link).

upper bound for the floating point error for a number

There are many questions (and answers) on this subject, but I am too thick to figure it out. In C, for a floating point of a given type, say double:
double x;
scanf("%lf", &x);
Is there a generic way to calculate an upper bound (as small as possible) for the error between the decimal fraction string passed to scanf and the internal representation of what is now in x?
If I understand correctly, there is sometimes going to be an error, and it will increase as the absolute value of the decimal fraction increases (in other words, 0.1 will be a bit off, but 100000000.1 will be off by much more).
This aspect of the C standard is slightly under-specified, but you can expect the conversion from decimal to double to be within one Unit in the Last Place of the original.
You seem to be looking for a bound on the absolute error of the conversion. With the above assumption, you can compute such a bound as a double as DBL_EPSILON * x. DBL_EPSILON is typically 2^-52.
A tighter bound on the error that can have been made during the conversion can be computed as follows:
double va = fabs(x);
double error = nextafter(va, +0./0.) - va;
The best conversion functions guarantee conversion to half a ULP in default round-to-nearest mode. If you are using conversion functions with this guarantee, you can divide the bound I offer by two.
The above applies when the original number represented in decimal is 0 or when its absolute value is comprised between DBL_MIN (approx. 2*10^-308) and DBL_MAX (approx. 2*10^308). If the non-null decimal number's absolute value is lower than DBL_MIN, then the absolute error is only bounded by DBL_MIN * DBL_EPSILON. If the absolute value is higher than DBL_MAX, you are likely to get infinity as the result of the conversion.
you cant think of this in terms of base 10, the error is in base 2, which wont necessarily point to a specific decimal place in base 10.
You have two underlying issues with your question, first scanf taking an ascii string and converting it to a binary number, that is one piece of software which uses a number of C libraries. I have seen for example compile time parsing vs runtime parsing give different conversion results on the same system. so in terms of error, if you want an exact number convert it yourself and place that binary number in the register/variable, otherwise accept what you get with the conversion and understand there may be rounding or clipping on the conversion that you didnt expect (which results in an accuracy issue, you didnt get the number you expected).
the second and real problem Pascal already answered. you only have x number if binary places. In terms of decimal if you had 3 decimal places the number 1.2345 would either have to be represented as 1.234 or 1.235. same for binary if you have 3 bits of mantissa then 1.0011 is either 1.001 or 1.010 depending on rounding. the mantissa length for IEEE floating point numbers is well documented you can simply google to find how many binary places you have for each precision.

trouble with double truncation and math in C

Im making a functions that fits balls into boxes. the code that computes the number of balls that can fit on each side of the box is below. Assume that the balls fit together as if they were cubes. I know this is not the optimal way but just go with it.
the problem for me is that although I get numbers like 4.0000000*4.0000000*2.000000 the product is 31 instead of 32. whats going on??
two additional things, this error only happens when the optimal side length is reached; for example, the side length is 12.2, the box thickness is .1 and the ball radius is 1.5. this leads to exactly 4 balls fit on that side. if I DONT cast as an int, it works out but if I do cast as an int, I get the aforementioned error (31 instead of 32). Also, the print line runs once if the side length is optimal but twice if it's not. I don't know what that means.
double ballsFit(double r, double l, double w, double h, double boxthick)
{
double ballsInL, ballsInW, ballsInH;
int ballsinbox;
ballsInL= (int)((l-(2*boxthick))/(r*2));
ballsInW= (int)((w-(2*boxthick))/(r*2));
ballsInH= (int)((h-(2*boxthick))/(r*2));
ballsinbox=(ballsInL*ballsInW*ballsInH);
printf("LENGTH=%f\nWidth=%f\nHight=%f\nBALLS=%d\n", ballsInL, ballsInW, ballsInH, ballsinbox);
return ballsinbox;
}
The fundamental problem is that floating-point math is inexact.
For example, the number 0.1 -- that you mention as the value of thickness in the problematic example -- cannot be represented exactly as a double. When you assign 0.1 to a variable, what gets stored is an approximation of 0.1.
I recommend that you read What Every Computer Scientist Should Know About Floating-Point Arithmetic.
although I get numbers like 4.0000000*4.0000000*2.000000 the product is 31 instead of 32. whats going on??
It is almost certainly the case that the multiplicands (at least some of them) are not what they look like. If they were exactly 4.0, 4.0 and 2.0, their product would be exactly 32.0. If you printed out all the digits that the doubles are capable of representing, I am pretty sure you'd see lots of 9s, as in 3.99999999999... etc. As a consequence, the product is a tiny bit less than 32. The double-to-int conversion simply chops off the fractional part, so you end up with 31.
Of course, you don't always get numbers that are less than what they would be if the computation were exact; you can also get numbers that are greater than what you might expect.
Fixed precision floating point numbers, such as the IEEE-754 numbers commonly used in modern computers cannot represent all decimal numbers accurately - much like 1/3 cannot be represented accurately in decimal.
For example 0.1 can be something along the lines of 0.100000000000000004... when converted to binary and back. The difference is small, but significant.
I have occasionally managed to (partially) deal with such issues by using extended or arbitrary precision arithmetic to maintain a degree of precision while computing and then down-converting to double for the final results. There is usually a noticeable drop in performance, but IMHO correctness is infinitely more important.
I recently used algorithms from the high-precision arithmetic libraries listed here with good results on both the precision and performance fronts.

Confusion with floating point numbers

int main()
{
float x=3.4e2;
printf("%f",x);
return 0;
}
Output:
340.000000 // It's ok.
But if write x=3.1234e2 the output is 312.339996 and if x=3.12345678e2 the output is 312.345673.
Why are the outputs like these? I think if I write x=3.1234e2 the output should be 312.340000, but the actual output is 312.339996 using GCC compiler.
Not all fractional numbers have an exact binary equivalent so it is rounded to the nearest value.
Simplified example,
if you have 3 bits for the fraction, you can have:
0
0.125
0.25
0.375
...
0.5 has an exact representation, but 0.1 will be shown as 0.125.
Of course the real differences are much smaller.
Floating-point numbers are normally represented as binary fractions times a power of two, for efficiency. This is about as accurate as base-10 representation, except that there are decimal fractions that cannot be exactly represented as binary fractions. They are, instead, represented as approximations.
Moreover, a float is normally 32 bits long, which means that it doesn't have all that many significant digits. You can see in your examples that they're accurate to about 8 significant digits.
You are, however, printing the numbers to slightly beyond their significance, and therefore you're seeing the difference. Look at your printf format string documentation to see how to print fewer digits.
You may need to represent decimal numbers exactly; this often happens in financial applications. In that case, you need to use a special library to represent numbers, or simply calculate everything as integers (such as representing amounts as cents rather than as dollars and fractions of a dollar).
The standard reference is What Every Computer Scientist Should Know About Floating-Point Arithmetic, but it looks like that would be very advanced for you. Alternatively, you could Google floating-point formats (particularly IEEE standard formats) or look them up on Wikipedia, if you wanted the details.

Resources