How to correctly compare 2 doubles? - c

I'm doing calculations with triangles. But I need to know if three given points aren't on the same line. To do that, I'm calculating an area of the triangle
area=(Ax* (By-Cy) + Bx* (Cy-Ay) + Cx* (Ay-By));
If the area equals zero, then all three points are colinear.
But the problem is, that it never really equals zero since doubles and floats are very inaccurate, so
if(area==0){
printf("It's not a triangle");
}
won't work. How is the correct way of overcoming this problem?

Lets us clear some causally understandings and dig deeper.
Wrong formula for area
The area is 1/2 of OP's formula, yet that does not make a different when comparing to 0.0.
// area=(Ax* (By-Cy) + Bx* (Cy-Ay) + Cx* (Ay-By));
area=(Ax* (By-Cy) + Bx* (Cy-Ay) + Cx* (Ay-By))/2;
Inaccuracy
"since doubles and floats are very inaccurate" is itself inaccurate. All finite FP values are exact, just like integers. It is when comparing their operations against mathematical divide, they get the mis-nomer of "inaccurate". Like integer divide, FP divide and other basic FP math OPs, they are defined differently than math operations. 7/3 and 7.0/3.0 both do not result in the mathematical 21/3, but a different value. When C employs an IEEE math model, that "quotient" is not approximate, but exact.
Comparing how many?
"compare 2 doubles" misleads as effectively it a complicated compare of 6 double that code needs to perform.
Review of the test formula
Ax* (By-Cy) + Bx* (Cy-Ay) + Cx* (Ay-By) with double operands will behave without rounding as long as the sub-steps do not round. In general, this is not possible. The work-arounds are
Use higher precision
Perform the test using long double. This does not eliminate the issue, just makes it smaller/less likely. Note long double is not required to have higher precision.
Employ some sort of epsilon
A naive approach takes the result |computed area| and compares against an epsilon. Absolute areas below that are considered "zero". This does not scale well as the epsilon really depends on the magnitude of operands relative to the area. A relative epsilon is needed. Suggest fmax(|ax|,|bx|,|cx|) * fmax(|ay|,|by|,|cy|) * DBL_EPSILON. This is only a first order approximation.
Look for sign change
The area formula is a signed area. Effectively reversing the order of a,b,c inverts the sign of the area. Should a small perturbation of any one of the 8 operands by operand_new=operand*(1 +/- DBL_EPSILON)result in an area sign change, the area can be assessed "close enough to zero".
Re-order the formula.
It is the subtraction of distant values values that kill precision. Exchanging xs with ys may help in the inner term subtractions. Re-ordering subtraction of the 3 products can help.
A better re-ordering can take the form of forming the 6 products: AxBy, -AxCy, BxCy, -BxAy, CxAy, -CxBy and then sum those.
Both of these benefit by using Kahan summation algorithm SO, perhaps taking advantage of fma().
For me, I'd explore #4b or #3. Had OP posted an Minimal, Complete, and Verifiable example, sample data and expected sample results, true code could be had. Lacking that, consider these starting ideas for a fuzzy problem.

You figure out: How much could the rounding errors be? If you use double precision, and the result of a single operation is x, then the rounding error can be up to abs (x) / 2^52. (If you don't use double then use double. )
You do this and find the rounding error in By-Cy, Cy-Ay, Ay-By. These three errors are multiplied by Ax, Bx and Cx. The three products have their own rounding error. Then you have an error adding the first two products, then adding the third product. You add up all these errors and get a maximum total error e.
So if the area is less than e, then you can assume they are on a straight line.
To improve on this: If Ax, Bx, Cx are all positive (say 100, 101, 102.5), then you calculate the average and subtract from Ax, Bx and Cx. That makes your numbers smaller and the rounding errors smaller.

I would try something like this:
#include <float.h>
...
if ( (area < FLT_EPSILON) && (area > -FLT_EPSILON) )
{
printf("It's not a triangle");
}

Related

I am multiplying two complex numbers whose imaginary parts are both zero, I was expecting the result to be only real but I get an imaginary part

I am multiplying the reciprocal of the determinant of a matrix by the transposed cofactor matrix to get the inverse matrix. Some of the values in the transposed cofactor matrix will have an imaginary value not equaled to or very close to zero.
I am trying to replicate code originally written in matlab, so there are exact target values I am trying to achieve, any differences in the values propagate themselves throughout the rest of the calculations resulting in very different final values. Is it possible to do? Or will there always very differences between the two codes calculations?
(I have revised my code to show the small values)This the function and the output.
void MatrixScalarMultiply(int r, int c, double complex x, double complex mat[r][c],
double complex result[r][c]){
for (int R=0; R<r; R++){
for (int C=0; C<c; C++){
printf("%.16g%+.16gi times %.16g%+.16gi\n", creal(x), cimag(x), creal(mat[R][C]), cimag(mat[R][C]));
result[R][C] = x * mat[R][C];
printf("result[%d][%d]:%.16g%+.16gi\n", R, C, creal(result[R][C]), cimag(result[R][C]));
}
}
}
output:
1122579414.726753+0i times 0.0004943535237422733-2.632898458153072e-21i
result[0][0]:554951.0893507092-2.955637610188447e-12i
I am multiplying two complex numbers whose imaginary parts are both zero,
As OP found out by using exponential notation, the imaginary parts were not both zero.
... any differences in the values propagate themselves throughout the rest of the calculations resulting in very different final values. Is it possible to do?
Yes, it is possible, yet often not likely to have a floating-point calculation on 2 platforms result in the same exact result. A more reasonable approach is to tolerate a small difference. What constitutes a small difference depends on the calculation, which is not yet shown.
Or will there always very differences between the two codes calculations?
No, there will not always differ. Again, what constitutes a small difference depends on the calculation, which is not yet shown.
... can see that the imaginary part of the second number is actually a very small number. Is there a way I can round those off to zero?
Yes, code could round as it did using the "%+.16f" format in an earlier version of the question. That rounded the display value and not mat[R][C].
Instead of attempting to "round those off to zero", consider analyzing code and determine what tolerance is possible. A simply, though not so mathematical sound, approach adjusts the various input real and imaginary arguments 1 unit in the last place (ULP), both up and down with nextafter() and noticing the range of outputs.
Alternatively, the algorithm and true code should be posted in a separate question to help analyze why the imaginary part does not meet OP's expectations.

C thinking : float vs. integers and float representation

When using integers in C (and in many other languages), one must pay attention when dividing about precision. It is always better to multiply and add things (thus creating a larger intermediary result, so long as it doesn't overflow) before dividing.
But what about floats? Does that still hold? Or are they represented in such a way that it is better to divide number of similar orders of magnitude rather than large ones by small ones?
The representation of floats/doubles and similar floating-point working, is geared towards retaining numbers of significant digits (aka "precision"), rather than a fixed number of decimal places, such as happens in fixed-point, or integer working.
It is best to avoid combining quantities, that may give rise to implicit under or overflow in terms of the exponent, ie at the limits of the floating-point number range.
Hence, addition/subtraction of quantities of widely differing magnitudes (either explicitly, or due to having opposite signs)) should be avoided and re-arranged, where possible, to avoid this well-known route to lost precision.
Example: it's better to refactor/re-order
small + big + small + big + small * big
as
(small+small+small) + big + big
since the smalls individually might make no difference to a big, and hence their contribution might disappear.
If there is any "noise" or imprecision in the lower bits of any quantity, it's also wise to be aware how loss of significant bits propagates through a computation.
With integers:
As long as there is no overflow, +,-,* is always exact.
With division, the result is truncated and often not equal to the mathematical answer.
ia,ib,ic, multiplying before dividing ia*ib/ic vs ia*(ib/ic) is better as the quotient is based on more bits of the product ia*ib than ib.
With floating point:
Issues are subtle. Again, as long as no over/underflow, the order or *,/ sequence make less impact than with integers. FP */- is akin to adding/subtracting logs. Typical results are within 0.5 ULP of the mathematically correct answer.
With FP and +,- the result of fa,fb,fc can have significant differences than the mathematical correct one when 1) values are far apart in magnitude or 2) subtracting values that are nearly equal and the error in a prior calculation now become significant.
Consider the quadratic equation:
double d = sqrt(b*b - 4*a/c); // assume b*b - 4*a/c >= 0
double root1 = (-b + d)/(2*a);
double root2 = (-b - d)/(2*a);
Versus
double d = sqrt(b*b - 4*a/c); // assume b*b - 4*a/c >= 0
double root1 = (b < 0) ? (-b + d)/(2*a) : (-b - d)/(2*a)
double root2 = c/(a*root1); // assume a*root1 != 0
The 2nd has much better root2 precision result when one root is near 0 and |b| is nearly d. This is because the b,d subtraction cancels many bits of significance allowing the error in the calculation of d to become significant.
(for integer) It is always better to multiply and add things (thus creating a larger intermediary result, so long as it doesn't overflow) before dividing.
Does that still hold (for floats)?
In general the answer is No
It is easy to construct an example where adding all input before division will give you a huge rounding error.
Assume you want to add 10000000000 values and divide them by 1000. Further assume that each value is 1. So the expected result is 10000000.
Method 1
However, if you add all the values before division, you'll get the result 16777.216 (for a 32 bit float). As you can see it is pretty much off.
Method 2
So is it better to divide each value by 1000 before adding it to the result? If you do that, you'll get the result 32768.0 (for a 32 bit float). As you can see it is pretty much off as well.
Method 3
However, if you go on adding values until the temporary result is greater than 1000000 and then divide the temporary result by 1000 and add that intermediate result to the final result and repeats that until you have added a total 10000000000 values, you will get the correct result.
So there is no simple "always add before division" or "always divide before adding" when dealing with floating point. As a general rule it is typically a good idea to keep operands in similar magnitude. That is what the third example does.

Rounding a float to two decimal places

I wish to round a float value, say val to its nearest multiple of 0.05. An explanation of my intent is here. I already have upper and lower bounds of val, say valU and valL respectively. I can do this in the following ways:
Search for the nearest multiple of 0.05 in the range [valL, valU] and assigning the value accordingly. Ties are settled by taking lower value. OR
Using something like this (off-course replacing 20 by 100 in the solution given in link).
I find that method-2 yields wrong results some times. Can someone please tell me why method-1 is right way?
... round a float value, say val to its nearest multiple of 0.05 ...
Given typical binary floating point, the best that can be had is
... round a float value to nearest to a multiple of 0.05 R and save as the nearest representable float r.
Method #1 is unclear in the corner cases. What OP has is not code, but an outline of code from a math perspective and not an actual C code one. I'd go with a variation on #2 that works very well. #Barmar
Almost method 2: Multiply, round, then divide.
#include <math.h>
double factor = 20.0;
float val = foo();
double_or_float rounded_val = round(val * factor)/factor;
This method has two subtle points that make it superior.
The multiplication is done with greater precision and range than the referenced answer - this allows for an exact product and a very precise quotient. If the product/quotient were calculated with only float math, some edge cases would end up with the wrong answer and of course some large values would overflow to infinity.
"Ties are settled by taking lower value." is a tough and unusual goal. Sounds like a goal geared to skew selection. round(double) nicely rounds half way cases away from zero regardless of the current rounding direction. To accomplish "lower", change the current rounding direction and use rint() or nearbyint().
#include <fenv.h>
int current_rounding_direction = fegetround();
fesetround(FE_DOWNWARD);
double rounded_val = rint(val * factor)/factor;
fesetround(current_rounding_direction);
... method-2 yields wrong results some times...
OP needs to post the code and the exact values used and calculated for a quality explanation of the strength/weakness of various methods. Try printf("%a %a\n", val, rounded_val);. Often, problems occurs due to imprecise understanding of the exact values used should code use printf("%f\n", val);
Further: "I already have upper and lower bounds of val, say valU and valL respectively. I can do this in the following ways:"
This is doubtfully accurate as the deviation of valU and valL is just an iteration of the original problem - to find rounded_val. The code to find valL and valU each needs an upper and lower bound, else what is to prevent range [valL ... valU] from itself having inaccurate endpoints?

Compiler does not recognise matching float values [duplicate]

I know UIKit uses CGFloat because of the resolution independent coordinate system.
But every time I want to check if for example frame.origin.x is 0 it makes me feel sick:
if (theView.frame.origin.x == 0) {
// do important operation
}
Isn't CGFloat vulnerable to false positives when comparing with ==, <=, >=, <, >?
It is a floating point and they have unprecision problems: 0.0000000000041 for example.
Is Objective-C handling this internally when comparing or can it happen that a origin.x which reads as zero does not compare to 0 as true?
First of all, floating point values are not "random" in their behavior. Exact comparison can and does make sense in plenty of real-world usages. But if you're going to use floating point you need to be aware of how it works. Erring on the side of assuming floating point works like real numbers will get you code that quickly breaks. Erring on the side of assuming floating point results have large random fuzz associated with them (like most of the answers here suggest) will get you code that appears to work at first but ends up having large-magnitude errors and broken corner cases.
First of all, if you want to program with floating point, you should read this:
What Every Computer Scientist Should Know About Floating-Point Arithmetic
Yes, read all of it. If that's too much of a burden, you should use integers/fixed point for your calculations until you have time to read it. :-)
Now, with that said, the biggest issues with exact floating point comparisons come down to:
The fact that lots of values you may write in the source, or read in with scanf or strtod, do not exist as floating point values and get silently converted to the nearest approximation. This is what demon9733's answer was talking about.
The fact that many results get rounded due to not having enough precision to represent the actual result. An easy example where you can see this is adding x = 0x1fffffe and y = 1 as floats. Here, x has 24 bits of precision in the mantissa (ok) and y has just 1 bit, but when you add them, their bits are not in overlapping places, and the result would need 25 bits of precision. Instead, it gets rounded (to 0x2000000 in the default rounding mode).
The fact that many results get rounded due to needing infinitely many places for the correct value. This includes both rational results like 1/3 (which you're familiar with from decimal where it takes infinitely many places) but also 1/10 (which also takes infinitely many places in binary, since 5 is not a power of 2), as well as irrational results like the square root of anything that's not a perfect square.
Double rounding. On some systems (particularly x86), floating point expressions are evaluated in higher precision than their nominal types. This means that when one of the above types of rounding happens, you'll get two rounding steps, first a rounding of the result to the higher-precision type, then a rounding to the final type. As an example, consider what happens in decimal if you round 1.49 to an integer (1), versus what happens if you first round it to one decimal place (1.5) then round that result to an integer (2). This is actually one of the nastiest areas to deal with in floating point, since the behaviour of the compiler (especially for buggy, non-conforming compilers like GCC) is unpredictable.
Transcendental functions (trig, exp, log, etc.) are not specified to have correctly rounded results; the result is just specified to be correct within one unit in the last place of precision (usually referred to as 1ulp).
When you're writing floating point code, you need to keep in mind what you're doing with the numbers that could cause the results to be inexact, and make comparisons accordingly. Often times it will make sense to compare with an "epsilon", but that epsilon should be based on the magnitude of the numbers you are comparing, not an absolute constant. (In cases where an absolute constant epsilon would work, that's strongly indicative that fixed point, not floating point, is the right tool for the job!)
Edit: In particular, a magnitude-relative epsilon check should look something like:
if (fabs(x-y) < K * FLT_EPSILON * fabs(x+y))
Where FLT_EPSILON is the constant from float.h (replace it with DBL_EPSILON fordoubles or LDBL_EPSILON for long doubles) and K is a constant you choose such that the accumulated error of your computations is definitely bounded by K units in the last place (and if you're not sure you got the error bound calculation right, make K a few times bigger than what your calculations say it should be).
Finally, note that if you use this, some special care may be needed near zero, since FLT_EPSILON does not make sense for denormals. A quick fix would be to make it:
if (fabs(x-y) < K * FLT_EPSILON * fabs(x+y) || fabs(x-y) < FLT_MIN)
and likewise substitute DBL_MIN if using doubles.
Since 0 is exactly representable as an IEEE754 floating-point number (or using any other implementation of f-p numbers I've ever worked with) comparison with 0 is probably safe. You might get bitten, however, if your program computes a value (such as theView.frame.origin.x) which you have reason to believe ought to be 0 but which your computation cannot guarantee to be 0.
To clarify a little, a computation such as :
areal = 0.0
will (unless your language or system is broken) create a value such that (areal==0.0) returns true but another computation such as
areal = 1.386 - 2.1*(0.66)
may not.
If you can assure yourself that your computations produce values which are 0 (and not just that they produce values which ought to be 0) then you can go ahead and compare f-p values with 0. If you can't assure yourself to the required degree, best stick to the usual approach of 'toleranced equality'.
In the worst cases the careless comparison of f-p values can be extremely dangerous: think avionics, weapons-guidance, power-plant operations, vehicle navigation, almost any application in which computation meets the real world.
For Angry Birds, not so dangerous.
I want to give a bit of a different answer than the others. They are great for answering your question as stated but probably not for what you need to know or what your real problem is.
Floating point in graphics is fine! But there is almost no need to ever compare floats directly. Why would you need to do that? Graphics uses floats to define intervals. And comparing if a float is within an interval also defined by floats is always well defined and merely needs to be consistent, not accurate or precise! As long as a pixel (which is also an interval!) can be assigned that's all graphics needs.
So if you want to test if your point is outside a [0..width[ range this is just fine. Just make sure you define inclusion consistently. For example always define inside is (x>=0 && x < width). The same goes for intersection or hit tests.
However, if you are abusing a graphics coordinate as some kind of flag, like for example to see if a window is docked or not, you should not do this. Use a boolean flag that is separate from the graphics presentation layer instead.
Comparing to zero can be a safe operation, as long as the zero wasn't a calculated value (as noted in an above answer). The reason for this is that zero is a perfectly representable number in floating point.
Talking perfectly representable values, you get 24 bits of range in a power-of-two notion (single precision). So 1, 2, 4 are perfectly representable, as are .5, .25, and .125. As long as all your important bits are in 24-bits, you are golden. So 10.625 can be repsented precisely.
This is great, but will quickly fall apart under pressure. Two scenarios spring to mind:
1) When a calculation is involved. Don't trust that sqrt(3)*sqrt(3) == 3. It just won't be that way. And it probably won't be within an epsilon, as some of the other answers suggest.
2) When any non-power-of-2 (NPOT) is involved. So it may sound odd, but 0.1 is an infinite series in binary and therefore any calculation involving a number like this will be imprecise from the start.
(Oh and the original question mentioned comparisons to zero. Don't forget that -0.0 is also a perfectly valid floating-point value.)
[The 'right answer' glosses over selecting K. Selecting K ends up being just as ad-hoc as selecting VISIBLE_SHIFT but selecting K is less obvious because unlike VISIBLE_SHIFT it is not grounded on any display property. Thus pick your poison - select K or select VISIBLE_SHIFT. This answer advocates selecting VISIBLE_SHIFT and then demonstrates the difficulty in selecting K]
Precisely because of round errors, you should not use comparison of 'exact' values for logical operations. In your specific case of a position on a visual display, it can't possibly matter if the position is 0.0 or 0.0000000003 - the difference is invisible to the eye. So your logic should be something like:
#define VISIBLE_SHIFT 0.0001 // for example
if (fabs(theView.frame.origin.x) < VISIBLE_SHIFT) { /* ... */ }
However, in the end, 'invisible to the eye' will depend on your display properties. If you can upper bound the display (you should be able to); then choose VISIBLE_SHIFT to be a fraction of that upper bound.
Now, the 'right answer' rests upon K so let's explore picking K. The 'right answer' above says:
K is a constant you choose such that the accumulated error of your
computations is definitely bounded by K units in the last place (and
if you're not sure you got the error bound calculation right, make K a
few times bigger than what your calculations say it should be)
So we need K. If getting K is more difficult, less intuitive than selecting my VISIBLE_SHIFT then you'll decide what works for you. To find K we are going to write a test program that looks at a bunch of K values so we can see how it behaves. Ought to be obvious how to choose K, if the 'right answer' is usable. No?
We are going to use, as the 'right answer' details:
if (fabs(x-y) < K * DBL_EPSILON * fabs(x+y) || fabs(x-y) < DBL_MIN)
Let's just try all values of K:
#include <math.h>
#include <float.h>
#include <stdio.h>
void main (void)
{
double x = 1e-13;
double y = 0.0;
double K = 1e22;
int i = 0;
for (; i < 32; i++, K = K/10.0)
{
printf ("K:%40.16lf -> ", K);
if (fabs(x-y) < K * DBL_EPSILON * fabs(x+y) || fabs(x-y) < DBL_MIN)
printf ("YES\n");
else
printf ("NO\n");
}
}
ebg#ebg$ gcc -o test test.c
ebg#ebg$ ./test
K:10000000000000000000000.0000000000000000 -> YES
K: 1000000000000000000000.0000000000000000 -> YES
K: 100000000000000000000.0000000000000000 -> YES
K: 10000000000000000000.0000000000000000 -> YES
K: 1000000000000000000.0000000000000000 -> YES
K: 100000000000000000.0000000000000000 -> YES
K: 10000000000000000.0000000000000000 -> YES
K: 1000000000000000.0000000000000000 -> NO
K: 100000000000000.0000000000000000 -> NO
K: 10000000000000.0000000000000000 -> NO
K: 1000000000000.0000000000000000 -> NO
K: 100000000000.0000000000000000 -> NO
K: 10000000000.0000000000000000 -> NO
K: 1000000000.0000000000000000 -> NO
K: 100000000.0000000000000000 -> NO
K: 10000000.0000000000000000 -> NO
K: 1000000.0000000000000000 -> NO
K: 100000.0000000000000000 -> NO
K: 10000.0000000000000000 -> NO
K: 1000.0000000000000000 -> NO
K: 100.0000000000000000 -> NO
K: 10.0000000000000000 -> NO
K: 1.0000000000000000 -> NO
K: 0.1000000000000000 -> NO
K: 0.0100000000000000 -> NO
K: 0.0010000000000000 -> NO
K: 0.0001000000000000 -> NO
K: 0.0000100000000000 -> NO
K: 0.0000010000000000 -> NO
K: 0.0000001000000000 -> NO
K: 0.0000000100000000 -> NO
K: 0.0000000010000000 -> NO
Ah, so K should be 1e16 or larger if I want 1e-13 to be 'zero'.
So, I'd say you have two options:
Do a simple epsilon computation using your engineering judgement for the value of 'epsilon', as I've suggested. If you are doing graphics and 'zero' is meant to be a 'visible change' than examine your visual assets (images, etc) and judge what epsilon can be.
Don't attempt any floating point computations until you've read the non-cargo-cult answer's reference (and gotten your Ph.D in the process) and then use your non-intuitive judgement to select K.
The correct question: how does one compare points in Cocoa Touch?
The correct answer: CGPointEqualToPoint().
A different question: Are two calculated values are the same?
The answer posted here: They are not.
How to check if they are close? If you want to check if they are close, then don't use CGPointEqualToPoint(). But, don't check to see if they are close. Do something that makes sense in the real world, like checking to see if a point is beyond a line or if a point is inside a sphere.
The last time I checked the C standard, there was no requirement for floating point operations on doubles (64 bits total, 53 bit mantissa) to be accurate to more than that precision. However, some hardware might do the operations in registers of greater precision, and the requirement was interpreted to mean no requirement to clear lower order bits (beyond the precision of the numbers being loaded into the registers). So you could get unexpected results of comparisons like this depending on what was left over in the registers from whoever slept there last.
That said, and despite my efforts to expunge it whenever I see it, the outfit where I work has lots of C code that is compiled using gcc and run on linux, and we have not noticed any of these unexpected results in a very long time. I have no idea whether this is because gcc is clearing the low-order bits for us, the 80-bit registers are not used for these operations on modern computers, the standard has been changed, or what. I'd like to know if anyone can quote chapter and verse.
You can use such code for compare float with zero:
if ((int)(theView.frame.origin.x * 100) == 0) {
// do important operation
}
This will compare with 0.1 accuracy, that enough for CGFloat in this case.
Another issue that may need to be kept in mind is that different implementations do things differently. One example of this that I am very familiar with is the FP units on the Sony Playstation 2. They have significant discrepancies when compared to the IEEE FP hardware in any X86 device. The cited article mentions the complete lack of support for inf and NaN, and it gets worse.
Less well known is what I came to know as the "one bit multiply" error. For certain values of float x:
y = x * 1.0;
assert(y == x);
would fail the assert. In the general case, sometimes, but not always, the result of a FP multiply on the Playstation 2 had a mantissa that was a single bit less than the equivalent IEEE mantissa.
My point being that you should not assume that porting FP code from one platform to another will produce the same results. Any given platform is internally consistent, in that results don't change on that platform, it's just that they may not agree with a different platform. E.g. CPython on X86 uses 64 bit doubles to represent floats, while CircuitPython on a Cortex MO has to use software FP, and only uses 32 bit floats. Needless to say that will introduce discrepancies.
A quote I learned over 40 years ago is as true today as the day I learned it. "Doing floating point maths on a computer is like moving a pile of sand. Every time you do anything, you leave a little sand behind and pick up a little dirt."
Playstation is a registered trademark of Sony Corporation.
-(BOOL)isFloatEqual:(CGFloat)firstValue secondValue:(CGFloat)secondValue{
BOOL isEqual = NO;
NSNumber *firstValueNumber = [NSNumber numberWithDouble:firstValue];
NSNumber *secondValueNumber = [NSNumber numberWithDouble:secondValue];
isEqual = [firstValueNumber isEqualToNumber:secondValueNumber];
return isEqual;
}
I am using the following comparison function to compare a number of decimal places:
bool compare(const double value1, const double value2, const int precision)
{
int64_t magnitude = static_cast<int64_t>(std::pow(10, precision));
int64_t intValue1 = static_cast<int64_t>(value1 * magnitude);
int64_t intValue2 = static_cast<int64_t>(value2 * magnitude);
return intValue1 == intValue2;
}
// Compare 9 decimal places:
if (compare(theView.frame.origin.x, 0, 9)) {
// do important operation
}
I'd say the right thing is to declare each number as an object, and then define three things in that object: 1) an equality operator. 2) a setAcceptableDifference method. 3)the value itself. The equality operator returns true if the absolute difference of two values is less than the value set as acceptable.
You can subclass the object to suit the problem. For example, round bars of metal between 1 and 2 inches might be considered of equal diameter if their diameters differed by less than 0.0001 inches. So you'd call setAcceptableDifference with parameter 0.0001, and then use the equality operator with confidence.

Double precision computations

I am trying to compute numerically (using analytical formulae) the values of the following sequence of integrals:
I(k,t) = int_0^{N/2-1} u^k e^(-i*u*delta*t) du
where "i" is the imaginary unit. For small k, this integral can be computed by hand, but for larger k it is more convenient to notice that there is an iterative relationship between the terms of sequence that can be derived by integration by parts. This is implemented below by the function i1.
void i1(int N, double t, double delta, double complex ** result){
unsigned int k;
(*result)=(double complex*)malloc(sizeof(double complex)*N);
if(t==0){
for(k=0;k<N;k++){
(*result)[k]=pow(N-2,k+1)/(pow(2,k+1)*(k+1));
}
}
else{
(*result)[0]=2/(delta*t)*sin(delta*(N-2)*t/4)*cexp(-I*(N-2)*t*delta/4);
for(k=1;k<N;k++){
(*result)[k]=I/(delta*t)*(pow(N-2,k)/pow(2,k)*cexp(-I*delta*(N-2)*t/2)-k*(*result)[k-1]);
}
}
}
The problem is that in my case t is very small (1e-12) and delta is typically around 1e6. When testing in the case N=4, I noticed some weird results appearing for k=3, namely the results where suddenly very large, much larger than they should be as the norm of an integral is always smaller than the integral of the norm, the results of the test are printed below:
I1(0,1.0000e-12)=1.0000000000e+00+-5.0000000000e-07I
Norm=1.0000000000e+00
compare = 1.0000000000e+00
I1(1,1.0000e-12)=5.0000000000e-01+-3.3328895199e-07I
Norm=5.0000000000e-01
compare = 5.0000000000e-01
I1(2,1.0000e-12)=3.3342209601e-01+-2.5013324745e-07I
Norm=3.3342209601e-01
compare = 3.3333333333e-01
I1(3,1.0000e-12)=2.4960025766e-01+-2.6628804517e+02I
Norm=2.6628816215e+02
compare = 2.5000000000e-01
k=3 not being particularly big, I computed the value of the integral by hand, but I got using the calculator and the analytical formula I obtained the same larger than expected results for the imaginary part in the case. I also realized that if I changed the order of the terms the result changed. It therefore appears to be a problem with precision, as in the iterative process there is a subtraction of very large but almost equal terms, and following what was said on this thread: How to divide tiny double precision numbers correctly without precision errors?, this can cause small errors to be amplified. However I am finding it difficult to see how to resolve the issue in my case, and was also wondering if someone could briefly explain why this occurs?
You have to be very careful with floating point addition and subtraction.
Suppose a decimal floating point with 6 digits precision (to keep things simple). Adding/subtracting a small number to/from a large one discards some or even all of the smaller. So:
5.00000E+9 + 1.45678E+4 is: 5.00000 + 0.000014 E+9 = 5.00001E+9
which is as good as it gets. But if you add a series of small numbers to a large one, then you may be better off adding the small numbers together first, and adding the result to the large number.
Subtraction of similar size numbers is another way of losing precision. So:
5.12346E+4 - 5.12345E+4 = 1.00000E-1
Now, the two numbers can be at best their real value +/- half the least significant digit, in this case 0.5E-1 -- which is a relative error of about +/-1E-6. The result of the subtraction is still +/- 0.5E-1 (we cannot reduce the error !), which is a relative error of +/- 0.5 !!!
Multiplication and division are much better behaved -- until you over-/under-flow.
But as soon as you are doing anything iterative with add/subtract, keep saying (loudly) to yourself: floating point numbers are not (entirely) like real numbers.

Resources