Double precision computations - c

I am trying to compute numerically (using analytical formulae) the values of the following sequence of integrals:
I(k,t) = int_0^{N/2-1} u^k e^(-i*u*delta*t) du
where "i" is the imaginary unit. For small k, this integral can be computed by hand, but for larger k it is more convenient to notice that there is an iterative relationship between the terms of sequence that can be derived by integration by parts. This is implemented below by the function i1.
void i1(int N, double t, double delta, double complex ** result){
unsigned int k;
(*result)=(double complex*)malloc(sizeof(double complex)*N);
if(t==0){
for(k=0;k<N;k++){
(*result)[k]=pow(N-2,k+1)/(pow(2,k+1)*(k+1));
}
}
else{
(*result)[0]=2/(delta*t)*sin(delta*(N-2)*t/4)*cexp(-I*(N-2)*t*delta/4);
for(k=1;k<N;k++){
(*result)[k]=I/(delta*t)*(pow(N-2,k)/pow(2,k)*cexp(-I*delta*(N-2)*t/2)-k*(*result)[k-1]);
}
}
}
The problem is that in my case t is very small (1e-12) and delta is typically around 1e6. When testing in the case N=4, I noticed some weird results appearing for k=3, namely the results where suddenly very large, much larger than they should be as the norm of an integral is always smaller than the integral of the norm, the results of the test are printed below:
I1(0,1.0000e-12)=1.0000000000e+00+-5.0000000000e-07I
Norm=1.0000000000e+00
compare = 1.0000000000e+00
I1(1,1.0000e-12)=5.0000000000e-01+-3.3328895199e-07I
Norm=5.0000000000e-01
compare = 5.0000000000e-01
I1(2,1.0000e-12)=3.3342209601e-01+-2.5013324745e-07I
Norm=3.3342209601e-01
compare = 3.3333333333e-01
I1(3,1.0000e-12)=2.4960025766e-01+-2.6628804517e+02I
Norm=2.6628816215e+02
compare = 2.5000000000e-01
k=3 not being particularly big, I computed the value of the integral by hand, but I got using the calculator and the analytical formula I obtained the same larger than expected results for the imaginary part in the case. I also realized that if I changed the order of the terms the result changed. It therefore appears to be a problem with precision, as in the iterative process there is a subtraction of very large but almost equal terms, and following what was said on this thread: How to divide tiny double precision numbers correctly without precision errors?, this can cause small errors to be amplified. However I am finding it difficult to see how to resolve the issue in my case, and was also wondering if someone could briefly explain why this occurs?

You have to be very careful with floating point addition and subtraction.
Suppose a decimal floating point with 6 digits precision (to keep things simple). Adding/subtracting a small number to/from a large one discards some or even all of the smaller. So:
5.00000E+9 + 1.45678E+4 is: 5.00000 + 0.000014 E+9 = 5.00001E+9
which is as good as it gets. But if you add a series of small numbers to a large one, then you may be better off adding the small numbers together first, and adding the result to the large number.
Subtraction of similar size numbers is another way of losing precision. So:
5.12346E+4 - 5.12345E+4 = 1.00000E-1
Now, the two numbers can be at best their real value +/- half the least significant digit, in this case 0.5E-1 -- which is a relative error of about +/-1E-6. The result of the subtraction is still +/- 0.5E-1 (we cannot reduce the error !), which is a relative error of +/- 0.5 !!!
Multiplication and division are much better behaved -- until you over-/under-flow.
But as soon as you are doing anything iterative with add/subtract, keep saying (loudly) to yourself: floating point numbers are not (entirely) like real numbers.

Related

I am multiplying two complex numbers whose imaginary parts are both zero, I was expecting the result to be only real but I get an imaginary part

I am multiplying the reciprocal of the determinant of a matrix by the transposed cofactor matrix to get the inverse matrix. Some of the values in the transposed cofactor matrix will have an imaginary value not equaled to or very close to zero.
I am trying to replicate code originally written in matlab, so there are exact target values I am trying to achieve, any differences in the values propagate themselves throughout the rest of the calculations resulting in very different final values. Is it possible to do? Or will there always very differences between the two codes calculations?
(I have revised my code to show the small values)This the function and the output.
void MatrixScalarMultiply(int r, int c, double complex x, double complex mat[r][c],
double complex result[r][c]){
for (int R=0; R<r; R++){
for (int C=0; C<c; C++){
printf("%.16g%+.16gi times %.16g%+.16gi\n", creal(x), cimag(x), creal(mat[R][C]), cimag(mat[R][C]));
result[R][C] = x * mat[R][C];
printf("result[%d][%d]:%.16g%+.16gi\n", R, C, creal(result[R][C]), cimag(result[R][C]));
}
}
}
output:
1122579414.726753+0i times 0.0004943535237422733-2.632898458153072e-21i
result[0][0]:554951.0893507092-2.955637610188447e-12i
I am multiplying two complex numbers whose imaginary parts are both zero,
As OP found out by using exponential notation, the imaginary parts were not both zero.
... any differences in the values propagate themselves throughout the rest of the calculations resulting in very different final values. Is it possible to do?
Yes, it is possible, yet often not likely to have a floating-point calculation on 2 platforms result in the same exact result. A more reasonable approach is to tolerate a small difference. What constitutes a small difference depends on the calculation, which is not yet shown.
Or will there always very differences between the two codes calculations?
No, there will not always differ. Again, what constitutes a small difference depends on the calculation, which is not yet shown.
... can see that the imaginary part of the second number is actually a very small number. Is there a way I can round those off to zero?
Yes, code could round as it did using the "%+.16f" format in an earlier version of the question. That rounded the display value and not mat[R][C].
Instead of attempting to "round those off to zero", consider analyzing code and determine what tolerance is possible. A simply, though not so mathematical sound, approach adjusts the various input real and imaginary arguments 1 unit in the last place (ULP), both up and down with nextafter() and noticing the range of outputs.
Alternatively, the algorithm and true code should be posted in a separate question to help analyze why the imaginary part does not meet OP's expectations.

How to correctly compare 2 doubles?

I'm doing calculations with triangles. But I need to know if three given points aren't on the same line. To do that, I'm calculating an area of the triangle
area=(Ax* (By-Cy) + Bx* (Cy-Ay) + Cx* (Ay-By));
If the area equals zero, then all three points are colinear.
But the problem is, that it never really equals zero since doubles and floats are very inaccurate, so
if(area==0){
printf("It's not a triangle");
}
won't work. How is the correct way of overcoming this problem?
Lets us clear some causally understandings and dig deeper.
Wrong formula for area
The area is 1/2 of OP's formula, yet that does not make a different when comparing to 0.0.
// area=(Ax* (By-Cy) + Bx* (Cy-Ay) + Cx* (Ay-By));
area=(Ax* (By-Cy) + Bx* (Cy-Ay) + Cx* (Ay-By))/2;
Inaccuracy
"since doubles and floats are very inaccurate" is itself inaccurate. All finite FP values are exact, just like integers. It is when comparing their operations against mathematical divide, they get the mis-nomer of "inaccurate". Like integer divide, FP divide and other basic FP math OPs, they are defined differently than math operations. 7/3 and 7.0/3.0 both do not result in the mathematical 21/3, but a different value. When C employs an IEEE math model, that "quotient" is not approximate, but exact.
Comparing how many?
"compare 2 doubles" misleads as effectively it a complicated compare of 6 double that code needs to perform.
Review of the test formula
Ax* (By-Cy) + Bx* (Cy-Ay) + Cx* (Ay-By) with double operands will behave without rounding as long as the sub-steps do not round. In general, this is not possible. The work-arounds are
Use higher precision
Perform the test using long double. This does not eliminate the issue, just makes it smaller/less likely. Note long double is not required to have higher precision.
Employ some sort of epsilon
A naive approach takes the result |computed area| and compares against an epsilon. Absolute areas below that are considered "zero". This does not scale well as the epsilon really depends on the magnitude of operands relative to the area. A relative epsilon is needed. Suggest fmax(|ax|,|bx|,|cx|) * fmax(|ay|,|by|,|cy|) * DBL_EPSILON. This is only a first order approximation.
Look for sign change
The area formula is a signed area. Effectively reversing the order of a,b,c inverts the sign of the area. Should a small perturbation of any one of the 8 operands by operand_new=operand*(1 +/- DBL_EPSILON)result in an area sign change, the area can be assessed "close enough to zero".
Re-order the formula.
It is the subtraction of distant values values that kill precision. Exchanging xs with ys may help in the inner term subtractions. Re-ordering subtraction of the 3 products can help.
A better re-ordering can take the form of forming the 6 products: AxBy, -AxCy, BxCy, -BxAy, CxAy, -CxBy and then sum those.
Both of these benefit by using Kahan summation algorithm SO, perhaps taking advantage of fma().
For me, I'd explore #4b or #3. Had OP posted an Minimal, Complete, and Verifiable example, sample data and expected sample results, true code could be had. Lacking that, consider these starting ideas for a fuzzy problem.
You figure out: How much could the rounding errors be? If you use double precision, and the result of a single operation is x, then the rounding error can be up to abs (x) / 2^52. (If you don't use double then use double. )
You do this and find the rounding error in By-Cy, Cy-Ay, Ay-By. These three errors are multiplied by Ax, Bx and Cx. The three products have their own rounding error. Then you have an error adding the first two products, then adding the third product. You add up all these errors and get a maximum total error e.
So if the area is less than e, then you can assume they are on a straight line.
To improve on this: If Ax, Bx, Cx are all positive (say 100, 101, 102.5), then you calculate the average and subtract from Ax, Bx and Cx. That makes your numbers smaller and the rounding errors smaller.
I would try something like this:
#include <float.h>
...
if ( (area < FLT_EPSILON) && (area > -FLT_EPSILON) )
{
printf("It's not a triangle");
}

C thinking : float vs. integers and float representation

When using integers in C (and in many other languages), one must pay attention when dividing about precision. It is always better to multiply and add things (thus creating a larger intermediary result, so long as it doesn't overflow) before dividing.
But what about floats? Does that still hold? Or are they represented in such a way that it is better to divide number of similar orders of magnitude rather than large ones by small ones?
The representation of floats/doubles and similar floating-point working, is geared towards retaining numbers of significant digits (aka "precision"), rather than a fixed number of decimal places, such as happens in fixed-point, or integer working.
It is best to avoid combining quantities, that may give rise to implicit under or overflow in terms of the exponent, ie at the limits of the floating-point number range.
Hence, addition/subtraction of quantities of widely differing magnitudes (either explicitly, or due to having opposite signs)) should be avoided and re-arranged, where possible, to avoid this well-known route to lost precision.
Example: it's better to refactor/re-order
small + big + small + big + small * big
as
(small+small+small) + big + big
since the smalls individually might make no difference to a big, and hence their contribution might disappear.
If there is any "noise" or imprecision in the lower bits of any quantity, it's also wise to be aware how loss of significant bits propagates through a computation.
With integers:
As long as there is no overflow, +,-,* is always exact.
With division, the result is truncated and often not equal to the mathematical answer.
ia,ib,ic, multiplying before dividing ia*ib/ic vs ia*(ib/ic) is better as the quotient is based on more bits of the product ia*ib than ib.
With floating point:
Issues are subtle. Again, as long as no over/underflow, the order or *,/ sequence make less impact than with integers. FP */- is akin to adding/subtracting logs. Typical results are within 0.5 ULP of the mathematically correct answer.
With FP and +,- the result of fa,fb,fc can have significant differences than the mathematical correct one when 1) values are far apart in magnitude or 2) subtracting values that are nearly equal and the error in a prior calculation now become significant.
Consider the quadratic equation:
double d = sqrt(b*b - 4*a/c); // assume b*b - 4*a/c >= 0
double root1 = (-b + d)/(2*a);
double root2 = (-b - d)/(2*a);
Versus
double d = sqrt(b*b - 4*a/c); // assume b*b - 4*a/c >= 0
double root1 = (b < 0) ? (-b + d)/(2*a) : (-b - d)/(2*a)
double root2 = c/(a*root1); // assume a*root1 != 0
The 2nd has much better root2 precision result when one root is near 0 and |b| is nearly d. This is because the b,d subtraction cancels many bits of significance allowing the error in the calculation of d to become significant.
(for integer) It is always better to multiply and add things (thus creating a larger intermediary result, so long as it doesn't overflow) before dividing.
Does that still hold (for floats)?
In general the answer is No
It is easy to construct an example where adding all input before division will give you a huge rounding error.
Assume you want to add 10000000000 values and divide them by 1000. Further assume that each value is 1. So the expected result is 10000000.
Method 1
However, if you add all the values before division, you'll get the result 16777.216 (for a 32 bit float). As you can see it is pretty much off.
Method 2
So is it better to divide each value by 1000 before adding it to the result? If you do that, you'll get the result 32768.0 (for a 32 bit float). As you can see it is pretty much off as well.
Method 3
However, if you go on adding values until the temporary result is greater than 1000000 and then divide the temporary result by 1000 and add that intermediate result to the final result and repeats that until you have added a total 10000000000 values, you will get the correct result.
So there is no simple "always add before division" or "always divide before adding" when dealing with floating point. As a general rule it is typically a good idea to keep operands in similar magnitude. That is what the third example does.

Rounding a float to two decimal places

I wish to round a float value, say val to its nearest multiple of 0.05. An explanation of my intent is here. I already have upper and lower bounds of val, say valU and valL respectively. I can do this in the following ways:
Search for the nearest multiple of 0.05 in the range [valL, valU] and assigning the value accordingly. Ties are settled by taking lower value. OR
Using something like this (off-course replacing 20 by 100 in the solution given in link).
I find that method-2 yields wrong results some times. Can someone please tell me why method-1 is right way?
... round a float value, say val to its nearest multiple of 0.05 ...
Given typical binary floating point, the best that can be had is
... round a float value to nearest to a multiple of 0.05 R and save as the nearest representable float r.
Method #1 is unclear in the corner cases. What OP has is not code, but an outline of code from a math perspective and not an actual C code one. I'd go with a variation on #2 that works very well. #Barmar
Almost method 2: Multiply, round, then divide.
#include <math.h>
double factor = 20.0;
float val = foo();
double_or_float rounded_val = round(val * factor)/factor;
This method has two subtle points that make it superior.
The multiplication is done with greater precision and range than the referenced answer - this allows for an exact product and a very precise quotient. If the product/quotient were calculated with only float math, some edge cases would end up with the wrong answer and of course some large values would overflow to infinity.
"Ties are settled by taking lower value." is a tough and unusual goal. Sounds like a goal geared to skew selection. round(double) nicely rounds half way cases away from zero regardless of the current rounding direction. To accomplish "lower", change the current rounding direction and use rint() or nearbyint().
#include <fenv.h>
int current_rounding_direction = fegetround();
fesetround(FE_DOWNWARD);
double rounded_val = rint(val * factor)/factor;
fesetround(current_rounding_direction);
... method-2 yields wrong results some times...
OP needs to post the code and the exact values used and calculated for a quality explanation of the strength/weakness of various methods. Try printf("%a %a\n", val, rounded_val);. Often, problems occurs due to imprecise understanding of the exact values used should code use printf("%f\n", val);
Further: "I already have upper and lower bounds of val, say valU and valL respectively. I can do this in the following ways:"
This is doubtfully accurate as the deviation of valU and valL is just an iteration of the original problem - to find rounded_val. The code to find valL and valU each needs an upper and lower bound, else what is to prevent range [valL ... valU] from itself having inaccurate endpoints?

How to round 8.475 to 8.48 in C (rounding function that takes into account representation issues)? Reducing probability of issue

I am trying to round 8.475 to 8.48 (to two decimal places in C). The problem is that 8.475 internally is represented as 8.47499999999999964473:
double input_test =8.475;
printf("input tests: %.20f, %.20f \n", input_test, *&input_test);
gives:
input tests: 8.47499999999999964473, 8.47499999999999964473
So, if I had an ideal round function then it would round 8.475=8.4749999... to 8.47. So, internal round function is no appropriate for me. I see that rounding problem arises in cases of "underflow" and therefore I am trying to use the following algorithm:
double MyRound2( double * value) {
double ad;
long long mzr;
double resval;
if ( *value < 0.000000001 )
ad = -0.501;
else
ad = 0.501;
mzr = long long (*value);
resval = *value - mzr;
resval= (long long( resval*100+ad))/100;
return resval;
}
This solves the "underflow" issue and it works well for "overflow" issues as well. The problem is that there are valid values x.xxx99 for which this function incorrectly gives bigger value (because of 0.001 in 0.501). How to solve this issue, how to devise algorithm that can detect floating point representation issue and that can round taking account this issue? Maybe C already has such clever rounding function? Maybe I can select different value for constant ad - such that probability of such rounding errors goes to zero (I mostly work with money values with up to 4 decimal ciphers).
I have read all the popoular articles about floating point representation and I know that there are tricky and unsolvable issues, but my client do not accept such explanation because client can clearly demonstrate that Excel handles (reproduces, rounds and so on) floating point numbers without representation issues.
(The C and C++ standards are intentionally flexible when it comes to the specification of the double type; quite often it is IEEE754 64 bit type. So your observed result is platform-dependent).
You are observing of the pitfalls of using floating point types.
Sadly there isn't an "out-of-the-box" fix for this. (Adding a small constant pre-rounding just pushes the problem to other numbers).
Moral of the story: don't use floating point types for money.
Use a special currency type instead or work in "pence"; using an integral type instead.
By the way, Excel does use an IEEE754 double precision floating point for its number type, but it also has some clever tricks up its sleeve. Essentially it tracks the joke digits carefully and also is clever with its formatting. This is how it can evaluate 1/3 + 1/3 + 1/3 exactly. But even it will get money calculations wrong sometimes.
For financial calculations, it is better to work in base-10 to avoid represenatation issues when going to/from binary. In many countries, financial software is even legally required to do so. Here is one library for IEEE 754R Decimal Floating-Point Arithmetic, have not tried it myself:
http://www.netlib.org/misc/intel/
Also note that working in decimal floating-point instead of fixed-point representation allows clever algoritms like the Kahan summation algorithm, to avoid accumulation of rounding errors. A noteworthy difference to normal floating point is that numbers with few significant digits are not normalized, so you can have e.g both 1*10^2 and .1*10^3.
An implementation note is that one representation in the std uses a binary significand, to allow sw implementations using a standard binary ALU.
How about this one: Define some threshold. This threshold is the distance to the next multiple of 0.005 at which you assume that this distance could be an error of imprecision. Execute appropriate methods if it's within that distance and smaller. Round as usual and at the end, if you detected that it was, add 0.01.
That said, this is only a work around and somewhat of a code smell. If you don't need too much speed, go for some other type than float. Like your own type that works like
class myDecimal{ int digits; int exponent_of_ten; } with value = digits * E exponent_of_ten
I am not trying to argument that using floating point numbers to represent money is advisable - it is not! but sometimes you have no choice... We do kind of work with money (life incurance calculations) and are forced to use floating point numbers for everything including values representing money.
Now there are quite some different rounding behaviours out there: round up, round down, round half up, round half down, round half even, maybe more. It looks like you were after round half up method.
Our round-half-up function - here translated from Java - looks like this:
#include <iostream>
#include <cmath>
#include <cfloat>
using namespace std;
int main()
{
double value = 8.47499999999999964473;
double result = value * pow(10, 2);
result = nextafter(result + (result > 0.0 ? 1e-8 : -1e-8), DBL_MAX);
double integral = floor(result);
double fraction = result - integral;
if (fraction >= 0.5) {
result = ceil(result);
} else {
result = integral;
}
result /= pow(10, 2);
cout << result << endl;
return 0;
}
where nextafter is a function returning the next floating point value after the given value - this code is proved to work using C++11 (AFAIK the nextafter is also available in boost), the result written into the standard output is 8.48.

Resources