How to compare double numbers? - c

I know that when I would like to check if double == double I should write:
bool AreSame(double a, double b)
{
return fabs(a - b) < EPSILON;
}
But what when I would like to check if a > b or b > a ?

There is no general solution for comparing floating-point numbers that contain errors from previous operations. The code that must be used is application-specific. So, to get a proper answer, you must describe your situation more specifically. For example, if you are sorting numbers in a list or other data structure, you should not use any tolerance for comparison.
Usually, if your program needs to compare two numbers for order but cannot do so because it has only approximations of those numbers, then you should redesign the program rather than try to allow numbers to be ordered incorrectly.
The underlying problem is that performing a correct computation using incorrect data is in general impossible. If you want to compute some function of two exact mathematical values x and y but the only data you have is some incorrectly computed values x and y, it is generally impossible to compute the exactly correct result. For example, suppose you want to know what the sum, x+y, is, but you only know x is 3 and y is 4, but you do not know what the true, exact x and y are. Then you cannot compute x+y.
If you know that x and y are approximately x and y, then you can compute an approximation of x+y by adding x and y. This works when the function being computed has a reasonable derivative: Slightly changing the inputs of a function with a reasonable derivative slightly changes its outputs. This fails when the function you want to compute has a discontinuity or a large derivative. For example, if you want to compute the square root of x (in the real domain) using an approximation x but x might be negative due to previous rounding errors, then computing sqrt(x) may produce an exception. Similarly, comparing for inequality or order is a discontinuous function: A slight change in inputs can change the answer completely.
The common bad advice is to compare with a “tolerance”. This method trades false negatives (incorrect rejections of numbers that would satisfy the comparison if the true mathematical values were compared) for false positives (incorrect acceptance of numbers that would not satisfy the comparison).
Whether or not an application can tolerate false acceptance depends on the application. Therefore, there is no general solution.
The level of tolerance to set, and even the nature by which it is calculated, depend on the data, the errors, and the previous calculations. So, even when it is acceptable to compare with a tolerance, the amount of tolerance to use and how to calculate it depends on the application. There is no general solution.

The analogous comparisons are:
a > b - EPSILON
and
b > a - EPSILON
I am assuming that EPSILON is some small positive number.

Related

Numerical Integration in fortran with infinity as one of the limits

I am asked to normalize a probability distribution P=A(x^2)(e^-x) within 0 to infinity by finding the value for A. I know the algorithms to calculate the Numerical value of Integration, but how do I deal with one of the limits being Infinity.
The only way I have been able to solve this problem with some accuracy (I got full accuracy, indeed) is by doing some math first, in order to obtain the taylor series that represents the integral of the first.
I have been looking here for my sample code, but I don't find it. I'll edit my post if I get a working solution.
The basic idea is to calculate all the derivatives of the function exp(-(x*x)) and use the coeficients to derive the integral form (by dividing those coeficients by one more than the exponent of x of the above function) to get the taylor series of the integral (I recommend you to use the unnormalized version described above to get the simple number coeficients, then adjust the result by multiplying by the proper constants) you'll get a taylor series with good convergence, giving you precise values for full precision (The integral requires a lot of subdivision, and you cannot divide an unbounded interval into a finite number of intervals, all finite)
I'll edit this question if I get on the code I wrote (so stay online, and dont' change the channel :) )

Matlab's bvp4c: output arrays not always the same length as the initial guess

The Matlab function bvp4c solves boundary value problems. It takes a differential equation, boundary conditions and an initial guess as input, and returns a structure array containing arrays of x, y and yp (which stands for "y prime", or y').
The length of the output arrays should be the same as that of the initial guess, but I found that it isn't always. I have checked the dimensions of the input (the initial guess, always 1x101 double for x and 16x101 double for y) and the output (sometimes 1x101 double for x and 16x101 double for y and yp as it should be, but often different values, such as 1x91 double and 16x91 double or 1x175 double and 16x175 double).
Looking at the output array x when its length is off, some extra values are squeezed in, or some are taken out. For example, the initial guess has 100 positions between x=0 and x=1, and the x array should be [0 0.01 0.02 ... 1], but sometimes a new position like 0.015 shows up.
Question: Why does this happen, and how can this be solved?
"The length of the output arrays should be the same as that of the initial guess ...." This is incorrect.
As described in the bvp4c documentation, sol.x contains a "[mesh] selected by bvp4c" with an "[approximation] to y(x) at the mesh points of sol.x". In order to evaluate bvp4c's solution on your mesh, use deval.
Why does bvp4c choose a mesh? Quoting from the cited paper1, which you can get in full here if you have a MathWorks account:
Because BVPs can have more than one solution, BVP codes require users to supply a guess for the solution desired. The guess includes a guess for an initial mesh that reveals the behavior of the desired solution. The codes then adapt the mesh so as to obtain an accurate numerical solution with a modest number of mesh points.
Because a steady BVP generally has a global behavior strongly dependent on its boundary values, the spatial mesh between the two boundaries may need to be refined in order to properly approximate the desired solution with the locally chosen basis functions for the method. However, there may also be portions of the mesh that do not need to be refined and can even be coarsened in some cases to maintain a reasonably small residual and accurate approximation. Therefore, for general efficiency, the guess mesh is adaptively refined or coarsened depending on some locally chosen metric (since bvp4c is collocation based, the metric is probably point-based or division-integrated based) such that the mesh returned by bvp4c is, in some sense, adequate enough for generic interpolation within the boundaries.
I'll also note that this is different from numerically solving IVPs since their state is not global across the entire time integration locus and only depends on the current state to the next time-step, and possibly previous time steps if using a multi-step method or solving a delay differential equation, which makes the refinement inherently local. This local behavior of IVPs is what allows functions like ode45 to return a solution at pre-selected time values because it can locally refine the solution at the selected point while performing the time march (this is known as dense output).
1 Shampine, L.F., M.W. Reichelt, and J. Kierzenka, "Solving Boundary Value Problems for Ordinary Differential Equations in MATLAB with bvp4c".

are floating point numbers changed when sorted in perl?

I'm running a statistical bootstrap at 10k permutations, which I'm trying to compare against an observed value. The observed is supposed to be identical to the max of the 10k permutations. The way I am measuring this is by attempting to find its percentile.
All results of the 10k permutations (10,000 random numbers) are stored in an array, which I sort using:
my #sorted = sort {$a <=> $b} #permutednumbers;
When I then compare the observed value $truevalue, I'm getting an inaccurate comparison. These are stored as floating point numbers. The bootstrapping procedure uses the same formula for generating the random number so it should be absolutely identical, but when comparing the same value, it becomes inaccurate. I'm testing this with:
if ($sorted[$#sorted] == $truevalue) {
print "sorted: $sorted[$#sorted] is eq truevalue:$truevalue\n";
} elsif ($sorted[$#sorted] > $truevalue) {
print "sorted: $sorted[$#sorted] is gt truevalue:$truevalue\n";
} elsif ($sorted[$#sorted] < $truevalue) {
print "sorted: $sorted[$#sorted] is lt truevalue:$truevalue, totalpermvalues; $totalpermvalues\n";
}
output:
sorted: 0.937864522389543 is gt truevalue:0.937864522389543
So I get that floating point numbers aren't printed in complete accuracy, but I always assumed internally the computer stores the correct numbers. Is that not a correct assumption? Of course I can fix this quickly by changing them into integers of some sort, but is this something that I should be doing automatically all the time? Are floating point numbers just dangerous to use? Those exact values should be identical given that they are outputs of identical inputs, which is what is confusing me...
If this matters, the values are individually calculated using the linear_interpolate function in Math::Interpolate package, but the inputs are identical.
If I understand correctly, you are wondering why == is returning false and > is returning true for what appear to be identical numbers. Obviously, the numbers are not actually identical. You can see this by printing more digits.
printf "sorted: %.80e is gt truevalue:%.80e\n", $sorted[$#sorted], $truevalue;
No, sort will not change values. One has to assume that there is a difference in the way these two values have been produced.
It is most certainly possible to use == with floating point numbers (FPN), returning true if a pair of 64 bit quantities is identical. But one has to be very careful when one ask the question "Are these two FPNs equal?"
A (relatively small but still considerable) quantity of integers and rational numbers can be represented accurately in a FPN. For these (and only for these), questions such as "Is the FPN a equal to 1.5?" (written as $a==1.5) may make sense but only if you are confident about the genesis of the value in $a. - Don't take this lightly: will both of the following statements print "1"?
print 0.12345678901234567 == 1.2345678901234567E-1,"\n";
print 0.12345678901234567 == 12.345678901234567E-2,"\n";
All FPNs are not only representatives of the value x they represent accurately. They are also responsible for an interval of real numbers, including rational, irrational and transcendent (and even integer) numbers "a little greater and a little smaller" than x. You can quantify "a little": it is 1e-16 for x == 1.0, and shrinks or grows accordingly. So, for instance, 1+1e-17 will be 1.0 on your computer. You can input this number, but the FPN will be 1.0 all the same. Asking whether a FPN as the result of some computation equals 1+1e-17 doesn't make sense since you cannot even tell the computer that value.
The solution isn't difficult. Instead of asking for equality you have to ask "Is the FPN a in an interval [p,q] around x?" Determining p and q should be given a little thought, as a suitable choice of these values primarily depends on x. The usual formula is something like
abs( $a - $expect ) <= $expect*PRECISION
where PRECISION could be, for instance, 1e-12. (The value to use here may depend on the algorithm you use for computing $a, or on your needs, or both.)
Finally: due to the mathematical properties of FP machine instructions, the usual arithmetic laws of associativity or distributivity are not guaranteed. The effect of truncation in addition or subtraction may, for instance cause heavy distortion in the result. A typical example for illustrating this, compute some Taylor series: once adding terms in decreasing order until terms become smaller than a given limit, and once, using the same terms, but in increasing order.

Gamma in the Baum Welch algorithm and float precision

I am currently trying to implement a Baum Welch algorithm in C, but I run into the following problem : the gamma function :
gamma(i,t) = alpha(i,t) * beta(i,t) / sum over `i` of(alpha(i,t) * beta(i,t))
Unfortunately, for large enough observation sets, alpha drops rapidly to 0 as t increases, and beta drops rapidly to 0 as t decreases, meaning that, due to rounding down, there is never a spot where both alpha and beta are non-zero, which makes things rather problematic.
Is there a way around this problem or should I just try to increase precision for the values? I fear the problem may just pop up again if I try this approach, as alpha and beta drop of about one order of magnitude per observation.
You should do these computations, and generally all computations for probability models, in log-space:
lg_gamma(i, t) = (lg_alpha(i, t) + lg_beta(i, t)
- logsumexp over i of (lg_alpha(i, t) + lg_beta(i, t)))
where lg_gamma(i, t) represents the logarithm of gamma(i, t), etc., and logsumexp is the function described here. At the end of the computation, you can convert to probabilities using exp, if needed (that's typically only needed for displaying probabilities, but even there logs may be preferable).
The base of the logarithm is not important, as long as you use the same base everywhere. I prefer the natural logarithm, because log saves typing compared to log2 :)

How to convert floating point input to integers and preserve maximum precision?

I have to use an algorithm which expects a matrix of integers as input. The input I have is real valued, therefore I want to convert the input it to integer before passing it to the algorithm.
I though of scaling the input by a large constant and then rounding it to integers. This looks like a good solution but how does one decide a good constant to be used, specially since the range of float input could vary from case to case? Any other ideas are also welcome?
Probably the best general answer to this question is to find out what is the maximum integer value that your algorithm can accept as an element in the matrix without causing overflow in the algorithm itself. Once you have this maximum value, find the maximum floating point value in your input data, then scale your inputs by the ratio of these two maximum values and round to the nearest integer (avoid truncation).
In practice you probably cannot do this because you probably cannot determine what is the maximum integer value that the algorithm can accept without overflowing. Perhaps you don't know the details of the algorithm, or it depends in a complicated way on all of the input values. If this is the case, you'll just have to pick an arbitrary maximum input value that seems to work well enough.
First normalize your input to [0,1) range, then use a common way to scale them:
f(x) = range_max_exclusive * x + range_min_inclusive
After that, cast f(x) (or round if you wish) to integer. In that way you can handle situations such as real values are in range [0,1) or [0,n) where n>1.
In general, your favourite library contains matrix operations, which you can implement this technique easily and with better performance than your possible implementation.
EDIT: Scaling-down then Scaling-up is sure to get lost some precision. I favor it because a normalization operation is generally comes with the library. Also you can do that without downscaling by:
f(x) = range_max_exlusive / max_element * x + range_min_inclusive

Resources