I know floating point values are limited in the numbers the can express accurately and i have found many sites that describe why this happens. But i have not found any information of how to deal with this problem efficiently. But I'm sure NASA isn't OK with 0.2/0.1 = 0.199999. Example:
#include <stdio.h>
#include <float.h>
#include <math.h>
int main(void)
{
float number = 4.20;
float denominator = 0.25;
printf("number = %f\n", number);
printf("denominator = %f\n", denominator);
printf("quotient as a float = %f should be 16.8\n", number/denominator);
printf("the remainder of 4.20 / 0.25 = %f\n", number - ((int) number/denominator)*denominator);
printf("now if i divide 0.20000 by 0.1 i get %f not 2\n", ( number - ((int) number/denominator)*denominator)/0.1);
}
output:
number = 4.200000
denominator = 0.250000
quotient as a float = 16.799999 should be 16.8
the remainder of 4.20 / 0.25 = 0.200000
now if i divide 0.20000 by 0.1 i get 1.999998 not 2
So how do i do arithmetic with floats (or decimals or doubles) and get accurate results. Hope i haven't just missed something super obvious. Any help would be awesome! Thanks.
The solution is to not use floats for applications where you can't accept roundoff errors. Use an extended precision library (a.k.a. arbitrary precision library) like GNU MP Bignum. See this Wikipedia page for a nice list of arbitrary-precision libraries. See also the Wikipedia article on rational data types and this thread for more info.
If you are going to use floating point representations (float, double, etc.) then write code using accepted methods for dealing with roundoff errors (e.g., avoiding ==). There's lots of on-line literature about how to do this and the methods vary widely depending on the application and algorithms involved.
Floating point is pretty fine, most of the time. Here are the key things I try to keep in mind:
There's really a big difference between float and double. double gives you enough precision for most things, most of the time; float surprisingly often gives you not enough. Unless you know what you're doing and have a really good reason, just always use double.
There are some things that floating point is not good for. Although C doesn't support it natively, fixed point is often a good alternative. You're essentially using fixed point if you do your financial calculations in cents rather than dollars -- that is, if you use an int or a long int representing pennies, and remember to put a decimal point two places from the right when it's time to print out as dollars.
The algorithm you use can really matter. Naïve or "obvious" algorithms can easily end up magnifying the effects of roundoff error, while more sophisticated algorithms minimize them. One simple example is that the order you add up floating-point numbers can matter.
Never worry about 16.8 versus 16.799999. That sort of thing always happens, but it's not a problem, unless you make it a problem. If you want one place past the decimal, just print it using %.1f, and printf will round it for you. (Also don't try to compare floating-point numbers for exact equality, but I assume you've heard that by now.)
Related to the above, remember that 0.1 is not representable exactly in binary (just as 1/3 is not representable exactly in decimal). This is just one of many reasons that you'll always get what look like tiny roundoff "errors", even though they're perfectly normal and needn't cause problems.
Occasionally you need a multiple precision (MP or "bignum") library, which can represent numbers to arbitrary precision, but these are (relatively) slow and (relatively) cumbersome to use, and fortunately you usually don't need them. But it's good to know they exist, and if you're a math nurd they can be a lot of fun to use.
Occasionally a library for representing rational numbers is useful. Such a library represents, for example, the number 1/3 as the pair of numbers (1, 3), so it doesn't have the inaccuracies inherent in trying to represent that number as 0.333333333.
Others have recommended the paper What Every Computer Scientist Should Know About Floating-Point Arithmetic, which is very good, and the standard reference, although it's long and fairly technical. An easier and shorter read I can recommend is this handout from a class I used to teach: https://www.eskimo.com/~scs/cclass/handouts/sciprog.html#precision . This is a little dated by now, but it should get you started on the basics.
There's isn't a good answer and it's often a problem.
If data is integral, e.g. amounts of money in cents, then store it as integers, which can mean a double that is constrained to hold an integer number of cents rather than a rational number of dollars. But that only helps in a few circumstances.
As a general rule, you get inaccuracies when trying to divide by numbers that are close to zero. So you just have to write the algorithms to avoid or suppress such operations. There are lots of discussions of "numerically stable" versus "unstable" algorithms and it's too big a subject to do justice to it here. And then, usually, it's best to treat floating point numbers as though they have small random errors. If they ultimately represent measurements of analogue values in the real world, there must be a certain tolerance or inaccuracy in them anyway.
If you are doing maths rather than processing data, simply don't use C or C++. Use a symbolic algebra package such a Maple, which stores values such as sqrt(2) as an expression rather than a floating point number, so sqrt(2) * sqrt(2) will always give exactly 2, rather than a number very close to 2.
Related
I have my code below and I want to ask what's the best way in solving numbers (division, multiplication, logarithm, exponents) up to 4 decimals places? I'm using PIC16F1789 as my device.
float sensorValue;
float sensorAverage;
void main(){
//Get an average data by testing 100 times
for(int x = 0; x < 100; x++){
// Get the total sum of all 100 data
sensorValue = (sensorValue + ADC_GetConversion(SENSOR));
}
// Get the average
sensorAverage = sensorValue/100.0;
}
In general, on MCUs, floating point types are more costly (clocks, code) to process than integer types. While this is often true for devices which have a hardware floating point unit, it becomes a vital information on devices without, like the PIC16/18 controllers. These have to emulate all floating point operations in software. This can easily cost >100 clock cycles per addition (much more for multiplication) and bloats the code.
So, best is to avoid float (not to speak of double on such systems.
For your example, the ADC returns an integer type anyway, so the summation can be done purely with integer types. You just have to make sure the summand does not overflow, so it has to hold ~100 * for your code.
Finally, to calculate the average, you can either divide the integer by the number of iterations (round to zero), or - better - apply a simple "round to nearest" by:
#define NUMBER_OF_ITERATIONS 100
sensorAverage = (sensorValue + NUMBER_OF_ITERATIONS / 2) / NUMBER_OF_ITERATIONS;
If you really want to speed up your code, set NUMBER_OF_ITERATIONS to a power of two (64 or 128 here), if your code can tolerate this.
Finally: To get not only the integer part of the division, you can treat the sum (sensoreValue) as a fractional value. For the given 100 iterations, you can treat it as decimal fraction: when converting to a string, just print a decimal point left of the lower 2 digits. As you divide by 100, there will be no more than two significal digits of decimal fraction. If you really need 4 digits, e.g. for other operations, you can multiply the sum by 100 (actually, it is 10000, but you already have multipiled it by 100 by the loop).
This is called decimal fixed point. Faster for processing (replaces multiplication by shifts) would be to use binary fixed point, as I stated above.
On PIC16, I would strongly suggest to think of using binary fraction as multiplication and division are very costly on this platform. In general, they are not well suited for signal processing. If you need to sustain some performance, an ARM Cortex-M0 or M4 would be the far better choice - at similar prices.
In your example it is trivial to avoid non-integer representations altogether, however to answer your question more generally an ISO compliant compiler will support floating point arithmetic and the library, but for performance and code size reasons you may want to avoid that.
Fixed-point arithmetic is what you probably need. For simple calculations an ad-hoc approach to fixed point can be used whereby for example you treat the units of sensorAverage in your example as hundredths (1/100), and avoid the expensive division altogether. However if you want to perform full maths library operations, then a better approach is to use a fixed-point library. One such library is presented in Optimizing Applications with Fixed-Point Arithmetic by Anthony Williams. The code is C++ and PIC16 may lack a decent C++ compiler, but the methods can be ported somewhat less elegantly to C. It also uses a huge 64bit fixed-point 36Q28 format, which would be expensive and slow on PIC16; you might want to adapt it to use 16Q16 perhaps.
If you are really concerned about performance, stick to integer arithmetics, try to make the number of samples to average a power of two so the division can be made by means of bit shifts, however if it is not a power of two lets say 100 (as Olaf point out for fixed point) you can also use bit shifts and additions: How can I multiply and divide using only bit shifting and adding?
If you are not concerned about performace and still want to work with floats (you already got warned this may not be very fast in a PIC16 and may use a lot of flash), math.h has the following functions: http://en.cppreference.com/w/c/numeric/math including exponeciation: pow(base,exp) and logarithms* only base 2, base 10 and base e, for arbitrary base use the change of base logarithmic property
I am running a computation on an embedded processor which involves a float operation like this:
a = (float)23 / (float)3; // a = 7.666....7
This is a long computation, and my app is OK with a certain amount of rounding error; I mean 7.67 or 7.66 doesn't matter. Is there a way to cut down the amount of computation time spent in computing the float, or telling math.h to 2 digits ?
Any idea how to do this ?
PS:
Now I know many will suggest using fixed point, but I have specific requirements.
Precision is determined by the data type, common types like float and double have a fixed precision that can't change, but there are libraries like libfixmath that let you perform fast non-integer maths (in this case using int32_t)
There is nothing you can 'tell' math.h (in this regard); float does not have precision options.
If you want less computational intensive calculations you must use other types or methods.
You already mention the best candidate: fixed-point. But, for obscure reasons, you say that you can't use it.
Other idea is to scale up your calculations by for example a factor 10,000, and keep everything as int (or longer). Then, only scale down to float when needed. But of course, it depends on your problem if you can do this.
Im making a functions that fits balls into boxes. the code that computes the number of balls that can fit on each side of the box is below. Assume that the balls fit together as if they were cubes. I know this is not the optimal way but just go with it.
the problem for me is that although I get numbers like 4.0000000*4.0000000*2.000000 the product is 31 instead of 32. whats going on??
two additional things, this error only happens when the optimal side length is reached; for example, the side length is 12.2, the box thickness is .1 and the ball radius is 1.5. this leads to exactly 4 balls fit on that side. if I DONT cast as an int, it works out but if I do cast as an int, I get the aforementioned error (31 instead of 32). Also, the print line runs once if the side length is optimal but twice if it's not. I don't know what that means.
double ballsFit(double r, double l, double w, double h, double boxthick)
{
double ballsInL, ballsInW, ballsInH;
int ballsinbox;
ballsInL= (int)((l-(2*boxthick))/(r*2));
ballsInW= (int)((w-(2*boxthick))/(r*2));
ballsInH= (int)((h-(2*boxthick))/(r*2));
ballsinbox=(ballsInL*ballsInW*ballsInH);
printf("LENGTH=%f\nWidth=%f\nHight=%f\nBALLS=%d\n", ballsInL, ballsInW, ballsInH, ballsinbox);
return ballsinbox;
}
The fundamental problem is that floating-point math is inexact.
For example, the number 0.1 -- that you mention as the value of thickness in the problematic example -- cannot be represented exactly as a double. When you assign 0.1 to a variable, what gets stored is an approximation of 0.1.
I recommend that you read What Every Computer Scientist Should Know About Floating-Point Arithmetic.
although I get numbers like 4.0000000*4.0000000*2.000000 the product is 31 instead of 32. whats going on??
It is almost certainly the case that the multiplicands (at least some of them) are not what they look like. If they were exactly 4.0, 4.0 and 2.0, their product would be exactly 32.0. If you printed out all the digits that the doubles are capable of representing, I am pretty sure you'd see lots of 9s, as in 3.99999999999... etc. As a consequence, the product is a tiny bit less than 32. The double-to-int conversion simply chops off the fractional part, so you end up with 31.
Of course, you don't always get numbers that are less than what they would be if the computation were exact; you can also get numbers that are greater than what you might expect.
Fixed precision floating point numbers, such as the IEEE-754 numbers commonly used in modern computers cannot represent all decimal numbers accurately - much like 1/3 cannot be represented accurately in decimal.
For example 0.1 can be something along the lines of 0.100000000000000004... when converted to binary and back. The difference is small, but significant.
I have occasionally managed to (partially) deal with such issues by using extended or arbitrary precision arithmetic to maintain a degree of precision while computing and then down-converting to double for the final results. There is usually a noticeable drop in performance, but IMHO correctness is infinitely more important.
I recently used algorithms from the high-precision arithmetic libraries listed here with good results on both the precision and performance fronts.
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Dealing with accuracy problems in floating-point numbers
I was quite surprised why I tried to multiply a float in C (with GCC 3.2) and that it did not do as I expected.. As a sample:
int main() {
float nb = 3.11f;
nb *= 10;
printf("%f\n", nb);
}
Displays: 31.099998
I am curious regarding the way floats are implemented and why it produces this unexpected behavior?
First off, you can multiply floats. The problem you have is not the multiplication itself, but the original number you've used. Multiplication can lose some precision, but here the original number you've multiplied started with lost precision.
This is actually an expected behavior. floats are implemented using binary representation which means they can't accurately represent decimal values.
See MSDN for more information.
You can also see in the description of float that it has 6-7 significant digits accuracy. In your example if you round 31.099998 to 7 significant digits you will get 31.1 so it still works as expected here.
double type would of course be more accurate, but still has rounding error due to it's binary representation while the number you wrote is decimal.
If you want complete accuracy for decimal numbers, you should use a decimal type. This type exists in languages like C#. http://msdn.microsoft.com/en-us/library/system.decimal.aspx
You can also use rational numbers representation. Using two integers which will give you complete accuracy as long as you can represent the number as a division of two integers.
This is working as expected. Computers have finite precision, because they're trying to compute floating point values from integers. This leads to floating point inaccuracies.
The Floating point wikipedia page goes into far more detail on the representation and resulting accuracy problems than I could here :)
Interesting real-world side-note: this is partly why a lot of money calculations are done using integers (cents) - don't let the computer lose money with lack of precision! I want my $0.00001!
The number 3.11 cannot be represented in binary. The closest you can get with 24 significant bits is 11.0001110000101000111101, which works out to 3.1099998950958251953125 in decimal.
If your number 3.11 is supposed to represent a monetary amount, then you need to use a decimal representation.
In the Python communities we often see people surprised at this, so there are well-tested-and-debugged FAQs and tutorial sections on the issue (of course they're phrased in terms of Python, not C, but since Python delegates float arithmetic to the underlying C and hardware anyway, all the descriptions of float's mechanics still apply).
It's not the multiplication's fault, of course -- remove the statement where you multiply nb and you'll see similar issues anyway.
From Wikipedia article:
The fact that floating-point numbers
cannot precisely represent all real
numbers, and that floating-point
operations cannot precisely represent
true arithmetic operations, leads to
many surprising situations. This is
related to the finite precision with
which computers generally represent
numbers.
Floating points are not precise because they use base 2 (because it's binary: either 0 or 1) instead of base 10. And base 2 converting to base 10, as many have stated before, will cause rounding precision issues.
Yesterday I asked a floating point question, and I have another one. I am doing some computations where I use the results of the math.h (C language) sine, cosine and tangent functions.
One of the developers muttered that you have to be careful of the return values of these functions and I should not make assumptions on the return values of the gcc math functions. I am not trying to start a discussion but I really want to know what I need to watch out for when doing computations with the standard math functions.
x
You should not assume that the values returned will be consistent to high degrees of precision between different compiler/stdlib versions.
That's about it.
You should not expect sin(PI/6) to be equal to cos(PI/3), for example. Nor should you expect asin(sin(x)) to be equal to x, even if x is in the domain for sin. They will be close, but might not be equal.
Floating point is straightforward. Just always remember that there is an uncertainty component to all floating point operations and functions. It is usually modelled as being random, even though it usually isn't, but if you treat it as random, you'll succeed in understanding your own code. For instance:
a=a/3*3;
This should be treated as if it was:
a=(a/3+error1)*3+error2;
If you want an estimate of the size of the errors, you need to dig into each operation/function to find out. Different compilers, parameter choice etc. will yield different values. For instance, 0.09-0.089999 on a system with 5 digits precision will yield an error somewhere between -0.000001 and 0.000001. this error is comparable in size with the actual result.
If you want to learn how to do floating point as precise as posible, then it's a study by it's own.
The problem isn't with the standard math functions, so much as the nature of floating point arithmetic.
Very short version: don't compare two floating point numbers for equality, even with obvious, trivial identities like 10 == 10 / 3.0 * 3.0 or tan(x) == sin(x) / cos(x).
you should take care about precision:
Structure of a floating-point number
are you on 32bits, 64 bits Platform ?
you should read IEEE Standard for Binary Floating-Point Arithmetic
there are some intersting libraries such GMP, or MPFR.
you should learn how Comparing floating-point numbers
etc ...
Agreed with all of the responses that say you should not compare for equality. What you can do, however, is check if the numbers are close enough, like so:
if (abs(numberA - numberB) < CLOSE_ENOUGH)
{
// Equal for all intents and purposes
}
Where CLOSE_ENOUGH is some suitably small floating-point value.