Floating point again - c

Yesterday I asked a floating point question, and I have another one. I am doing some computations where I use the results of the math.h (C language) sine, cosine and tangent functions.
One of the developers muttered that you have to be careful of the return values of these functions and I should not make assumptions on the return values of the gcc math functions. I am not trying to start a discussion but I really want to know what I need to watch out for when doing computations with the standard math functions.
x

You should not assume that the values returned will be consistent to high degrees of precision between different compiler/stdlib versions.
That's about it.

You should not expect sin(PI/6) to be equal to cos(PI/3), for example. Nor should you expect asin(sin(x)) to be equal to x, even if x is in the domain for sin. They will be close, but might not be equal.

Floating point is straightforward. Just always remember that there is an uncertainty component to all floating point operations and functions. It is usually modelled as being random, even though it usually isn't, but if you treat it as random, you'll succeed in understanding your own code. For instance:
a=a/3*3;
This should be treated as if it was:
a=(a/3+error1)*3+error2;
If you want an estimate of the size of the errors, you need to dig into each operation/function to find out. Different compilers, parameter choice etc. will yield different values. For instance, 0.09-0.089999 on a system with 5 digits precision will yield an error somewhere between -0.000001 and 0.000001. this error is comparable in size with the actual result.
If you want to learn how to do floating point as precise as posible, then it's a study by it's own.

The problem isn't with the standard math functions, so much as the nature of floating point arithmetic.
Very short version: don't compare two floating point numbers for equality, even with obvious, trivial identities like 10 == 10 / 3.0 * 3.0 or tan(x) == sin(x) / cos(x).

you should take care about precision:
Structure of a floating-point number
are you on 32bits, 64 bits Platform ?
you should read IEEE Standard for Binary Floating-Point Arithmetic
there are some intersting libraries such GMP, or MPFR.
you should learn how Comparing floating-point numbers
etc ...

Agreed with all of the responses that say you should not compare for equality. What you can do, however, is check if the numbers are close enough, like so:
if (abs(numberA - numberB) < CLOSE_ENOUGH)
{
// Equal for all intents and purposes
}
Where CLOSE_ENOUGH is some suitably small floating-point value.

Related

Why didn't I get tan(PI /2) = infinty in C

When I calculate tan(PI/2) I get -22877332 in c, but tan(Pi/2) is infinity.
and google giving it as 3060023.30695 Why i am getting different answer
I tried on mingw compiler and in google both are giving different answer
float32 Tan_f32 (float32 ValValue )
{
float32 Result_Val;
Result_Val= (tanf(ValValue));
return Result_Val;
}
in mingw compiler it gives -22877332
and in google 3060023.30695
It is impossible to pass π/2 to tan or tanf because π is irrational, so any floating-point number, no matter how precise, will be at least slightly different from π/2. Therefore, tanf(ValValue) returns the tangent of some value close to π/2, and that tangent is large but not infinite.
In the common format used for float, IEEE-754 basic 32-bit binary floating-point, the closest representable number to π/2 is 1.57079637050628662109375. The tangent of that number is approximately −22877332.4289, and the closest value representable in float is −22877332, which is the result you got. So your tanf is giving you the best possible result for the input number you gave it.
The C standard, or indeed the common but by no means ubiquitous IEEE754 floating point standard, give no guarantee of the accuracy of tan (Cf sqrt). An implementation will make a compromise in getting a good result out in a reasonable number of clock cycles.
In particular, the behaviour of the trigonometric function near an asymptote is particularly unpredictable; and that's the case here.
Accepting that the fault is not due to your value of pi (worth a check although note that because pi is transcendental it can't be represented exactly in any floating point system), if you want a well-behaved tan function across the whole domain, you'll be better off using a third party mathematics library.
Finally, note that under IEEE754, you might get more consistent behaviour around an asyptote if you let floating point division deal with the pole, and use
double c = cos(x); tan(x) = sqrt(1 / c / c - 1);
This might be more numerically stable, as IEEE754 defines a division by zero.

floating point inaccuracies in c

I know floating point values are limited in the numbers the can express accurately and i have found many sites that describe why this happens. But i have not found any information of how to deal with this problem efficiently. But I'm sure NASA isn't OK with 0.2/0.1 = 0.199999. Example:
#include <stdio.h>
#include <float.h>
#include <math.h>
int main(void)
{
float number = 4.20;
float denominator = 0.25;
printf("number = %f\n", number);
printf("denominator = %f\n", denominator);
printf("quotient as a float = %f should be 16.8\n", number/denominator);
printf("the remainder of 4.20 / 0.25 = %f\n", number - ((int) number/denominator)*denominator);
printf("now if i divide 0.20000 by 0.1 i get %f not 2\n", ( number - ((int) number/denominator)*denominator)/0.1);
}
output:
number = 4.200000
denominator = 0.250000
quotient as a float = 16.799999 should be 16.8
the remainder of 4.20 / 0.25 = 0.200000
now if i divide 0.20000 by 0.1 i get 1.999998 not 2
So how do i do arithmetic with floats (or decimals or doubles) and get accurate results. Hope i haven't just missed something super obvious. Any help would be awesome! Thanks.
The solution is to not use floats for applications where you can't accept roundoff errors. Use an extended precision library (a.k.a. arbitrary precision library) like GNU MP Bignum. See this Wikipedia page for a nice list of arbitrary-precision libraries. See also the Wikipedia article on rational data types and this thread for more info.
If you are going to use floating point representations (float, double, etc.) then write code using accepted methods for dealing with roundoff errors (e.g., avoiding ==). There's lots of on-line literature about how to do this and the methods vary widely depending on the application and algorithms involved.
Floating point is pretty fine, most of the time. Here are the key things I try to keep in mind:
There's really a big difference between float and double. double gives you enough precision for most things, most of the time; float surprisingly often gives you not enough. Unless you know what you're doing and have a really good reason, just always use double.
There are some things that floating point is not good for. Although C doesn't support it natively, fixed point is often a good alternative. You're essentially using fixed point if you do your financial calculations in cents rather than dollars -- that is, if you use an int or a long int representing pennies, and remember to put a decimal point two places from the right when it's time to print out as dollars.
The algorithm you use can really matter. Naïve or "obvious" algorithms can easily end up magnifying the effects of roundoff error, while more sophisticated algorithms minimize them. One simple example is that the order you add up floating-point numbers can matter.
Never worry about 16.8 versus 16.799999. That sort of thing always happens, but it's not a problem, unless you make it a problem. If you want one place past the decimal, just print it using %.1f, and printf will round it for you. (Also don't try to compare floating-point numbers for exact equality, but I assume you've heard that by now.)
Related to the above, remember that 0.1 is not representable exactly in binary (just as 1/3 is not representable exactly in decimal). This is just one of many reasons that you'll always get what look like tiny roundoff "errors", even though they're perfectly normal and needn't cause problems.
Occasionally you need a multiple precision (MP or "bignum") library, which can represent numbers to arbitrary precision, but these are (relatively) slow and (relatively) cumbersome to use, and fortunately you usually don't need them. But it's good to know they exist, and if you're a math nurd they can be a lot of fun to use.
Occasionally a library for representing rational numbers is useful. Such a library represents, for example, the number 1/3 as the pair of numbers (1, 3), so it doesn't have the inaccuracies inherent in trying to represent that number as 0.333333333.
Others have recommended the paper What Every Computer Scientist Should Know About Floating-Point Arithmetic, which is very good, and the standard reference, although it's long and fairly technical. An easier and shorter read I can recommend is this handout from a class I used to teach: https://www.eskimo.com/~scs/cclass/handouts/sciprog.html#precision . This is a little dated by now, but it should get you started on the basics.
There's isn't a good answer and it's often a problem.
If data is integral, e.g. amounts of money in cents, then store it as integers, which can mean a double that is constrained to hold an integer number of cents rather than a rational number of dollars. But that only helps in a few circumstances.
As a general rule, you get inaccuracies when trying to divide by numbers that are close to zero. So you just have to write the algorithms to avoid or suppress such operations. There are lots of discussions of "numerically stable" versus "unstable" algorithms and it's too big a subject to do justice to it here. And then, usually, it's best to treat floating point numbers as though they have small random errors. If they ultimately represent measurements of analogue values in the real world, there must be a certain tolerance or inaccuracy in them anyway.
If you are doing maths rather than processing data, simply don't use C or C++. Use a symbolic algebra package such a Maple, which stores values such as sqrt(2) as an expression rather than a floating point number, so sqrt(2) * sqrt(2) will always give exactly 2, rather than a number very close to 2.

How unreliable are floating point values, operators and functions?

I don't want to introduce floating point when an inexact value would be a distaster, so I have a couple of questions about when you actually can use them safely.
Are they exact for integers as long as you don't overflow the number of significant digit? Are these two tests always true:
double d = 2.0;
if (d + 3.0 == 5.0) ...
if (d * 3.0 == 6.0) ...
What math functions can you rely on? Are these tests always true:
#include <math.h>
double d = 100.0;
if (log10(d) == 2.0) ...
if (pow(d, 2.0) == 10000.0) ...
if (sqrt(d) == 10.0) ...
How about this:
int v = ...;
if (log2((double) v) > 16.0) ... /* gonna need more than 16 bits to store v */
if (log((double) v) / log(2.0) > 16.0) ... /* C89 */
I guess you can summarize this question as: 1) Can floating point types hold the exact value of all integers up to the number of their significant digits in float.h? 2) Do all floating point operators and functions guarantee that the result is the closest to the actual mathematical result?
I too find incorrect results distasteful.
On common hardware, you can rely on +, -, *, /, and sqrt working and delivering the correctly-rounded result. That is, they deliver the floating-point number closest to the sum, difference, product, quotient, or square root of their argument or arguments.
Some library functions, notably log2 and log10 and exp2 and exp10, traditionally have terrible implementations that are not even faithfully-rounded. Faithfully-rounded means that a function delivers one of the two floating-point numbers bracketing the exact result. Most modern pow implementations have similar issues. Lots of these functions will even blow exact cases like log10(10000) and pow(7, 2). Thus equality comparisons involving these functions, even in exact cases, are asking for trouble.
sin, cos, tan, atan, exp, and log have faithfully-rounded implementations on every platform I've recently encountered. In the bad old days, on processors using the x87 FPU to evaluate sin, cos, and tan, you would get horribly wrong outputs for largish inputs and you'd get the input back for larger inputs. CRlibm has correctly-rounded implementations; these are not mainstream because, I'm told, they've got rather nastier worst cases than the traditional faithfully-rounded implementations.
Things like copysign and nextafter and isfinite all work correctly. ceil and floor and rint and friends always deliver the exact result. fmod and friends do too. frexp and friends work. fmin and fmax work.
Someone thought it would be a brilliant idea to make fma(x,y,z) compute x*y+z by computing x*y rounded to a double, then adding z and rounding the result to a double. You can find this behaviour on modern platforms. It's stupid and I hate it.
I have no experience with the hyperbolic trig, gamma, or Bessel functions in my C library.
I should also mention that popular compilers targeting 32-bit x86 play by a different, broken, set of rules. Since the x87 is the only supported floating-point instruction set and all x87 arithmetic is done with an extended exponent, computations that would induce an underflow or overflow in double precision may fail to underflow or overflow. Furthermore, since the x87 also by default uses an extended significand, you may not get the results you're looking for. Worse still, compilers will sometimes spill intermediate results to variables of lower precision, so you can't even rely on your calculations with doubles being done in extended precision. (Java has a trick for doing 64-bit math with 80-bit registers, but it is quite expensive.)
I would recommend sticking to arithmetic on long doubles if you're targeting 32-bit x86. Compilers are supposed to set FLT_EVAL_METHOD to an appropriate value, but I do not know if this is done universally.
Can floating point types hold the exact value of all integers up to the number of their significant digits in float.h?
Well, they can store the integers which fit in their mantissa (significand). So [-2^53, 2^53] for double. For more on this, see: Which is the first integer that an IEEE 754 float is incapable of representing exactly?
Do all floating point operators and functions guarantee that the result is the closest to the actual mathematical result?
They at least guarantee that the result is immediately on either side of the actual mathematical result. That is, you won't get a result which has a valid floating point value between itself and the "actual" result. But beware, because repeated operations may accumulate an error which seems counter to this, while it is not (because all intermediate values are subject to the same constraints, not just the inputs and output of a compound expression).

Why do float calculation results differ in C and on my calculator?

I am working on a problem and my results returned by C program are not as good as returned by a simple calculator, not equally precise to be precise.
On my calculator, when I divide 2000008 by 3, I get 666669.333333
But in my C program, I am getting 666669.312500
This is what I'm doing-
printf("%f\n",2000008/(float)3);
Why are results different? What should i do to get the result same as that of calculator? I tried double but then it returns result in a different format. Do I need to go through conversion and all? Please help.
See http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html for an in-depth explanation.
In short, floating point numbers are approximations to the real numbers, and they have a limit on digits they can hold. With float, this limit is quite small, with doubles, it's more, but still not perfect.
Try
printf("%20.12lf\n",(double)2000008/(double)3);
and you'll see a better, but still not perfect result. What it boils down to is, you should never assume floating point numbers to be precise. They aren't.
Floating point numbers take a fixed amount of memory and therefore have a limited precision. Limited precision means you can't represent all possible real numbers, and that in turn means that some calculations result in rounding errors. Use double instead of float to gain extra precision, but mind you that even a double can't represent everything even if it's enough for most practical purposes.
Gunthram summarizes it very well in his answer:
What it boils down to is, you should never assume floating point numbers to be precise. They aren't.

Fortran/C Interlanguage problems: results differ in the 14th digit

I have to use C and Fortran together to do some simulations. In their course I use the same memory in both programming language parts, by defining a pointer in C to access memory allocated by Fortran.
The datatype of the problematic variable is
real(kind=8)
for Fortran, and
double
for C. The results of the same calculations now differ in the respective programming languages, and I need to directly compare them and get a zero. All calculations are done only with the above accuracies. The difference is always in the 13-14th digit.
What would be a good way to resolve this? Any compiler-flags? Just cut-off after some digits?
Many thanks!
Floating point is not perfectly accurate. Ever. Even cos(x) == cos(y) can be false if x == y.
So when doing your comparisons, take this into account, and allow the values to differ by some small epsilon value.
This is a problem with the inaccuracy with floating point numbers - they will be inaccurate and a certain place. You usually compare them either by rounding them to a digit that you know will be in the accurate area, or by providing an epsilon of appropiate value (small enough to not impact further calculations, and big enough to take care of the inaccuracy while comparing).
One thing you might check is to be sure that the FPU control word is the same in both cases. If it is set to 53-bit precision in one case and 64-bit in the other, it would likely produce different results. You can use the instructions fstcw and fldcw to read and load the control word value. Nonetheless, as others have mentioned, you should not depend on the accuracy being identical even if you can make it work in one situation.
Perfect portability is very difficult to achieve in floating point operations. Changing the order of the machine instructions might change the rounding. One compiler might keep values in registers, while another copy it to memory, which can change the precision. Currently the Fortran and C languages allow a certain amount of latitude. The IEEE module of Fortran 2008, when implemented, will allow requiring more specific and therefore more portable floating point computations.
Since you are compiling for an x86 architecture, it's likely that one of the compilers is maintaining intermediate values in floating point registers, which are 80 bits as opposed to the 64 bits of a C double.
For GCC, you can supply the -ffloat-store option to inhibit this optimisation. You may also need to change the code to explicitly store some intermediate results in double variables. Some experimentation is likely in order.

Resources