Rounding a float to two decimal places - c

I wish to round a float value, say val to its nearest multiple of 0.05. An explanation of my intent is here. I already have upper and lower bounds of val, say valU and valL respectively. I can do this in the following ways:
Search for the nearest multiple of 0.05 in the range [valL, valU] and assigning the value accordingly. Ties are settled by taking lower value. OR
Using something like this (off-course replacing 20 by 100 in the solution given in link).
I find that method-2 yields wrong results some times. Can someone please tell me why method-1 is right way?

... round a float value, say val to its nearest multiple of 0.05 ...
Given typical binary floating point, the best that can be had is
... round a float value to nearest to a multiple of 0.05 R and save as the nearest representable float r.
Method #1 is unclear in the corner cases. What OP has is not code, but an outline of code from a math perspective and not an actual C code one. I'd go with a variation on #2 that works very well. #Barmar
Almost method 2: Multiply, round, then divide.
#include <math.h>
double factor = 20.0;
float val = foo();
double_or_float rounded_val = round(val * factor)/factor;
This method has two subtle points that make it superior.
The multiplication is done with greater precision and range than the referenced answer - this allows for an exact product and a very precise quotient. If the product/quotient were calculated with only float math, some edge cases would end up with the wrong answer and of course some large values would overflow to infinity.
"Ties are settled by taking lower value." is a tough and unusual goal. Sounds like a goal geared to skew selection. round(double) nicely rounds half way cases away from zero regardless of the current rounding direction. To accomplish "lower", change the current rounding direction and use rint() or nearbyint().
#include <fenv.h>
int current_rounding_direction = fegetround();
fesetround(FE_DOWNWARD);
double rounded_val = rint(val * factor)/factor;
fesetround(current_rounding_direction);
... method-2 yields wrong results some times...
OP needs to post the code and the exact values used and calculated for a quality explanation of the strength/weakness of various methods. Try printf("%a %a\n", val, rounded_val);. Often, problems occurs due to imprecise understanding of the exact values used should code use printf("%f\n", val);
Further: "I already have upper and lower bounds of val, say valU and valL respectively. I can do this in the following ways:"
This is doubtfully accurate as the deviation of valU and valL is just an iteration of the original problem - to find rounded_val. The code to find valL and valU each needs an upper and lower bound, else what is to prevent range [valL ... valU] from itself having inaccurate endpoints?

Related

How to correctly compare 2 doubles?

I'm doing calculations with triangles. But I need to know if three given points aren't on the same line. To do that, I'm calculating an area of the triangle
area=(Ax* (By-Cy) + Bx* (Cy-Ay) + Cx* (Ay-By));
If the area equals zero, then all three points are colinear.
But the problem is, that it never really equals zero since doubles and floats are very inaccurate, so
if(area==0){
printf("It's not a triangle");
}
won't work. How is the correct way of overcoming this problem?
Lets us clear some causally understandings and dig deeper.
Wrong formula for area
The area is 1/2 of OP's formula, yet that does not make a different when comparing to 0.0.
// area=(Ax* (By-Cy) + Bx* (Cy-Ay) + Cx* (Ay-By));
area=(Ax* (By-Cy) + Bx* (Cy-Ay) + Cx* (Ay-By))/2;
Inaccuracy
"since doubles and floats are very inaccurate" is itself inaccurate. All finite FP values are exact, just like integers. It is when comparing their operations against mathematical divide, they get the mis-nomer of "inaccurate". Like integer divide, FP divide and other basic FP math OPs, they are defined differently than math operations. 7/3 and 7.0/3.0 both do not result in the mathematical 21/3, but a different value. When C employs an IEEE math model, that "quotient" is not approximate, but exact.
Comparing how many?
"compare 2 doubles" misleads as effectively it a complicated compare of 6 double that code needs to perform.
Review of the test formula
Ax* (By-Cy) + Bx* (Cy-Ay) + Cx* (Ay-By) with double operands will behave without rounding as long as the sub-steps do not round. In general, this is not possible. The work-arounds are
Use higher precision
Perform the test using long double. This does not eliminate the issue, just makes it smaller/less likely. Note long double is not required to have higher precision.
Employ some sort of epsilon
A naive approach takes the result |computed area| and compares against an epsilon. Absolute areas below that are considered "zero". This does not scale well as the epsilon really depends on the magnitude of operands relative to the area. A relative epsilon is needed. Suggest fmax(|ax|,|bx|,|cx|) * fmax(|ay|,|by|,|cy|) * DBL_EPSILON. This is only a first order approximation.
Look for sign change
The area formula is a signed area. Effectively reversing the order of a,b,c inverts the sign of the area. Should a small perturbation of any one of the 8 operands by operand_new=operand*(1 +/- DBL_EPSILON)result in an area sign change, the area can be assessed "close enough to zero".
Re-order the formula.
It is the subtraction of distant values values that kill precision. Exchanging xs with ys may help in the inner term subtractions. Re-ordering subtraction of the 3 products can help.
A better re-ordering can take the form of forming the 6 products: AxBy, -AxCy, BxCy, -BxAy, CxAy, -CxBy and then sum those.
Both of these benefit by using Kahan summation algorithm SO, perhaps taking advantage of fma().
For me, I'd explore #4b or #3. Had OP posted an Minimal, Complete, and Verifiable example, sample data and expected sample results, true code could be had. Lacking that, consider these starting ideas for a fuzzy problem.
You figure out: How much could the rounding errors be? If you use double precision, and the result of a single operation is x, then the rounding error can be up to abs (x) / 2^52. (If you don't use double then use double. )
You do this and find the rounding error in By-Cy, Cy-Ay, Ay-By. These three errors are multiplied by Ax, Bx and Cx. The three products have their own rounding error. Then you have an error adding the first two products, then adding the third product. You add up all these errors and get a maximum total error e.
So if the area is less than e, then you can assume they are on a straight line.
To improve on this: If Ax, Bx, Cx are all positive (say 100, 101, 102.5), then you calculate the average and subtract from Ax, Bx and Cx. That makes your numbers smaller and the rounding errors smaller.
I would try something like this:
#include <float.h>
...
if ( (area < FLT_EPSILON) && (area > -FLT_EPSILON) )
{
printf("It's not a triangle");
}

How to round 8.475 to 8.48 in C (rounding function that takes into account representation issues)? Reducing probability of issue

I am trying to round 8.475 to 8.48 (to two decimal places in C). The problem is that 8.475 internally is represented as 8.47499999999999964473:
double input_test =8.475;
printf("input tests: %.20f, %.20f \n", input_test, *&input_test);
gives:
input tests: 8.47499999999999964473, 8.47499999999999964473
So, if I had an ideal round function then it would round 8.475=8.4749999... to 8.47. So, internal round function is no appropriate for me. I see that rounding problem arises in cases of "underflow" and therefore I am trying to use the following algorithm:
double MyRound2( double * value) {
double ad;
long long mzr;
double resval;
if ( *value < 0.000000001 )
ad = -0.501;
else
ad = 0.501;
mzr = long long (*value);
resval = *value - mzr;
resval= (long long( resval*100+ad))/100;
return resval;
}
This solves the "underflow" issue and it works well for "overflow" issues as well. The problem is that there are valid values x.xxx99 for which this function incorrectly gives bigger value (because of 0.001 in 0.501). How to solve this issue, how to devise algorithm that can detect floating point representation issue and that can round taking account this issue? Maybe C already has such clever rounding function? Maybe I can select different value for constant ad - such that probability of such rounding errors goes to zero (I mostly work with money values with up to 4 decimal ciphers).
I have read all the popoular articles about floating point representation and I know that there are tricky and unsolvable issues, but my client do not accept such explanation because client can clearly demonstrate that Excel handles (reproduces, rounds and so on) floating point numbers without representation issues.
(The C and C++ standards are intentionally flexible when it comes to the specification of the double type; quite often it is IEEE754 64 bit type. So your observed result is platform-dependent).
You are observing of the pitfalls of using floating point types.
Sadly there isn't an "out-of-the-box" fix for this. (Adding a small constant pre-rounding just pushes the problem to other numbers).
Moral of the story: don't use floating point types for money.
Use a special currency type instead or work in "pence"; using an integral type instead.
By the way, Excel does use an IEEE754 double precision floating point for its number type, but it also has some clever tricks up its sleeve. Essentially it tracks the joke digits carefully and also is clever with its formatting. This is how it can evaluate 1/3 + 1/3 + 1/3 exactly. But even it will get money calculations wrong sometimes.
For financial calculations, it is better to work in base-10 to avoid represenatation issues when going to/from binary. In many countries, financial software is even legally required to do so. Here is one library for IEEE 754R Decimal Floating-Point Arithmetic, have not tried it myself:
http://www.netlib.org/misc/intel/
Also note that working in decimal floating-point instead of fixed-point representation allows clever algoritms like the Kahan summation algorithm, to avoid accumulation of rounding errors. A noteworthy difference to normal floating point is that numbers with few significant digits are not normalized, so you can have e.g both 1*10^2 and .1*10^3.
An implementation note is that one representation in the std uses a binary significand, to allow sw implementations using a standard binary ALU.
How about this one: Define some threshold. This threshold is the distance to the next multiple of 0.005 at which you assume that this distance could be an error of imprecision. Execute appropriate methods if it's within that distance and smaller. Round as usual and at the end, if you detected that it was, add 0.01.
That said, this is only a work around and somewhat of a code smell. If you don't need too much speed, go for some other type than float. Like your own type that works like
class myDecimal{ int digits; int exponent_of_ten; } with value = digits * E exponent_of_ten
I am not trying to argument that using floating point numbers to represent money is advisable - it is not! but sometimes you have no choice... We do kind of work with money (life incurance calculations) and are forced to use floating point numbers for everything including values representing money.
Now there are quite some different rounding behaviours out there: round up, round down, round half up, round half down, round half even, maybe more. It looks like you were after round half up method.
Our round-half-up function - here translated from Java - looks like this:
#include <iostream>
#include <cmath>
#include <cfloat>
using namespace std;
int main()
{
double value = 8.47499999999999964473;
double result = value * pow(10, 2);
result = nextafter(result + (result > 0.0 ? 1e-8 : -1e-8), DBL_MAX);
double integral = floor(result);
double fraction = result - integral;
if (fraction >= 0.5) {
result = ceil(result);
} else {
result = integral;
}
result /= pow(10, 2);
cout << result << endl;
return 0;
}
where nextafter is a function returning the next floating point value after the given value - this code is proved to work using C++11 (AFAIK the nextafter is also available in boost), the result written into the standard output is 8.48.

Algorithm for tetration to work with floating point numbers

Tetration is the level above exponentiation (e.g: 2^^4 = 2^(2^(2^2)) = 65536.
So far, I've figured out an algorithm for tetration that works.
However, although the variable a can be floating or integer, unfortunately, the variable b must be an integer number.
How can I modify the pseudo-code algorithm so that both a and b can be floating point numbers and the correct answer will be produced?
// Hyperoperation type 4:
public float tetrate(float a, float b)
{
float total = a;
for (i = 1; i < b; i++) total = pow(a, total);
return total;
}
In an attempt to solve this, I've created my own custom power() function (trying to avoid roots, and log functions), and then successfully generalized it to multiplication. Unfortunately, when I then try to generalize to tetration, numbers go pear shaped.
I would like an algorithm to be precise up to x amount of decimal places, and not an approximation as Wikipedia talks about. To clarify, preferably, it would need to satisfy at least the first three requirements, and the fourth requirement can be up to the answerer.
base_num ^^ tetration_num =
e^( base_num * ln (e^(tetration_num * ln base_num)))
Natural log can be calculated with a Taylor series to a whatever accuracy you need.
e^x can also be calcuated to whatever accuracy you need with a Taylor series.
With some care about over/underflow, you should be able to work with whatever values you need using the above.
Just in case you need to series, This page lists the ones you would need. Having coded something similiar to this in fixed point math (ints, no floats) I can say that it isn't all that hard to get up and running, but you need to be careful about the order in which you do things or you will overflow numbers quickly.
Update
It turns out my above only works for some tetrations as I did not fully understand how tetrations work. Silly rabbit.

Compiler does not recognise matching float values [duplicate]

I know UIKit uses CGFloat because of the resolution independent coordinate system.
But every time I want to check if for example frame.origin.x is 0 it makes me feel sick:
if (theView.frame.origin.x == 0) {
// do important operation
}
Isn't CGFloat vulnerable to false positives when comparing with ==, <=, >=, <, >?
It is a floating point and they have unprecision problems: 0.0000000000041 for example.
Is Objective-C handling this internally when comparing or can it happen that a origin.x which reads as zero does not compare to 0 as true?
First of all, floating point values are not "random" in their behavior. Exact comparison can and does make sense in plenty of real-world usages. But if you're going to use floating point you need to be aware of how it works. Erring on the side of assuming floating point works like real numbers will get you code that quickly breaks. Erring on the side of assuming floating point results have large random fuzz associated with them (like most of the answers here suggest) will get you code that appears to work at first but ends up having large-magnitude errors and broken corner cases.
First of all, if you want to program with floating point, you should read this:
What Every Computer Scientist Should Know About Floating-Point Arithmetic
Yes, read all of it. If that's too much of a burden, you should use integers/fixed point for your calculations until you have time to read it. :-)
Now, with that said, the biggest issues with exact floating point comparisons come down to:
The fact that lots of values you may write in the source, or read in with scanf or strtod, do not exist as floating point values and get silently converted to the nearest approximation. This is what demon9733's answer was talking about.
The fact that many results get rounded due to not having enough precision to represent the actual result. An easy example where you can see this is adding x = 0x1fffffe and y = 1 as floats. Here, x has 24 bits of precision in the mantissa (ok) and y has just 1 bit, but when you add them, their bits are not in overlapping places, and the result would need 25 bits of precision. Instead, it gets rounded (to 0x2000000 in the default rounding mode).
The fact that many results get rounded due to needing infinitely many places for the correct value. This includes both rational results like 1/3 (which you're familiar with from decimal where it takes infinitely many places) but also 1/10 (which also takes infinitely many places in binary, since 5 is not a power of 2), as well as irrational results like the square root of anything that's not a perfect square.
Double rounding. On some systems (particularly x86), floating point expressions are evaluated in higher precision than their nominal types. This means that when one of the above types of rounding happens, you'll get two rounding steps, first a rounding of the result to the higher-precision type, then a rounding to the final type. As an example, consider what happens in decimal if you round 1.49 to an integer (1), versus what happens if you first round it to one decimal place (1.5) then round that result to an integer (2). This is actually one of the nastiest areas to deal with in floating point, since the behaviour of the compiler (especially for buggy, non-conforming compilers like GCC) is unpredictable.
Transcendental functions (trig, exp, log, etc.) are not specified to have correctly rounded results; the result is just specified to be correct within one unit in the last place of precision (usually referred to as 1ulp).
When you're writing floating point code, you need to keep in mind what you're doing with the numbers that could cause the results to be inexact, and make comparisons accordingly. Often times it will make sense to compare with an "epsilon", but that epsilon should be based on the magnitude of the numbers you are comparing, not an absolute constant. (In cases where an absolute constant epsilon would work, that's strongly indicative that fixed point, not floating point, is the right tool for the job!)
Edit: In particular, a magnitude-relative epsilon check should look something like:
if (fabs(x-y) < K * FLT_EPSILON * fabs(x+y))
Where FLT_EPSILON is the constant from float.h (replace it with DBL_EPSILON fordoubles or LDBL_EPSILON for long doubles) and K is a constant you choose such that the accumulated error of your computations is definitely bounded by K units in the last place (and if you're not sure you got the error bound calculation right, make K a few times bigger than what your calculations say it should be).
Finally, note that if you use this, some special care may be needed near zero, since FLT_EPSILON does not make sense for denormals. A quick fix would be to make it:
if (fabs(x-y) < K * FLT_EPSILON * fabs(x+y) || fabs(x-y) < FLT_MIN)
and likewise substitute DBL_MIN if using doubles.
Since 0 is exactly representable as an IEEE754 floating-point number (or using any other implementation of f-p numbers I've ever worked with) comparison with 0 is probably safe. You might get bitten, however, if your program computes a value (such as theView.frame.origin.x) which you have reason to believe ought to be 0 but which your computation cannot guarantee to be 0.
To clarify a little, a computation such as :
areal = 0.0
will (unless your language or system is broken) create a value such that (areal==0.0) returns true but another computation such as
areal = 1.386 - 2.1*(0.66)
may not.
If you can assure yourself that your computations produce values which are 0 (and not just that they produce values which ought to be 0) then you can go ahead and compare f-p values with 0. If you can't assure yourself to the required degree, best stick to the usual approach of 'toleranced equality'.
In the worst cases the careless comparison of f-p values can be extremely dangerous: think avionics, weapons-guidance, power-plant operations, vehicle navigation, almost any application in which computation meets the real world.
For Angry Birds, not so dangerous.
I want to give a bit of a different answer than the others. They are great for answering your question as stated but probably not for what you need to know or what your real problem is.
Floating point in graphics is fine! But there is almost no need to ever compare floats directly. Why would you need to do that? Graphics uses floats to define intervals. And comparing if a float is within an interval also defined by floats is always well defined and merely needs to be consistent, not accurate or precise! As long as a pixel (which is also an interval!) can be assigned that's all graphics needs.
So if you want to test if your point is outside a [0..width[ range this is just fine. Just make sure you define inclusion consistently. For example always define inside is (x>=0 && x < width). The same goes for intersection or hit tests.
However, if you are abusing a graphics coordinate as some kind of flag, like for example to see if a window is docked or not, you should not do this. Use a boolean flag that is separate from the graphics presentation layer instead.
Comparing to zero can be a safe operation, as long as the zero wasn't a calculated value (as noted in an above answer). The reason for this is that zero is a perfectly representable number in floating point.
Talking perfectly representable values, you get 24 bits of range in a power-of-two notion (single precision). So 1, 2, 4 are perfectly representable, as are .5, .25, and .125. As long as all your important bits are in 24-bits, you are golden. So 10.625 can be repsented precisely.
This is great, but will quickly fall apart under pressure. Two scenarios spring to mind:
1) When a calculation is involved. Don't trust that sqrt(3)*sqrt(3) == 3. It just won't be that way. And it probably won't be within an epsilon, as some of the other answers suggest.
2) When any non-power-of-2 (NPOT) is involved. So it may sound odd, but 0.1 is an infinite series in binary and therefore any calculation involving a number like this will be imprecise from the start.
(Oh and the original question mentioned comparisons to zero. Don't forget that -0.0 is also a perfectly valid floating-point value.)
[The 'right answer' glosses over selecting K. Selecting K ends up being just as ad-hoc as selecting VISIBLE_SHIFT but selecting K is less obvious because unlike VISIBLE_SHIFT it is not grounded on any display property. Thus pick your poison - select K or select VISIBLE_SHIFT. This answer advocates selecting VISIBLE_SHIFT and then demonstrates the difficulty in selecting K]
Precisely because of round errors, you should not use comparison of 'exact' values for logical operations. In your specific case of a position on a visual display, it can't possibly matter if the position is 0.0 or 0.0000000003 - the difference is invisible to the eye. So your logic should be something like:
#define VISIBLE_SHIFT 0.0001 // for example
if (fabs(theView.frame.origin.x) < VISIBLE_SHIFT) { /* ... */ }
However, in the end, 'invisible to the eye' will depend on your display properties. If you can upper bound the display (you should be able to); then choose VISIBLE_SHIFT to be a fraction of that upper bound.
Now, the 'right answer' rests upon K so let's explore picking K. The 'right answer' above says:
K is a constant you choose such that the accumulated error of your
computations is definitely bounded by K units in the last place (and
if you're not sure you got the error bound calculation right, make K a
few times bigger than what your calculations say it should be)
So we need K. If getting K is more difficult, less intuitive than selecting my VISIBLE_SHIFT then you'll decide what works for you. To find K we are going to write a test program that looks at a bunch of K values so we can see how it behaves. Ought to be obvious how to choose K, if the 'right answer' is usable. No?
We are going to use, as the 'right answer' details:
if (fabs(x-y) < K * DBL_EPSILON * fabs(x+y) || fabs(x-y) < DBL_MIN)
Let's just try all values of K:
#include <math.h>
#include <float.h>
#include <stdio.h>
void main (void)
{
double x = 1e-13;
double y = 0.0;
double K = 1e22;
int i = 0;
for (; i < 32; i++, K = K/10.0)
{
printf ("K:%40.16lf -> ", K);
if (fabs(x-y) < K * DBL_EPSILON * fabs(x+y) || fabs(x-y) < DBL_MIN)
printf ("YES\n");
else
printf ("NO\n");
}
}
ebg#ebg$ gcc -o test test.c
ebg#ebg$ ./test
K:10000000000000000000000.0000000000000000 -> YES
K: 1000000000000000000000.0000000000000000 -> YES
K: 100000000000000000000.0000000000000000 -> YES
K: 10000000000000000000.0000000000000000 -> YES
K: 1000000000000000000.0000000000000000 -> YES
K: 100000000000000000.0000000000000000 -> YES
K: 10000000000000000.0000000000000000 -> YES
K: 1000000000000000.0000000000000000 -> NO
K: 100000000000000.0000000000000000 -> NO
K: 10000000000000.0000000000000000 -> NO
K: 1000000000000.0000000000000000 -> NO
K: 100000000000.0000000000000000 -> NO
K: 10000000000.0000000000000000 -> NO
K: 1000000000.0000000000000000 -> NO
K: 100000000.0000000000000000 -> NO
K: 10000000.0000000000000000 -> NO
K: 1000000.0000000000000000 -> NO
K: 100000.0000000000000000 -> NO
K: 10000.0000000000000000 -> NO
K: 1000.0000000000000000 -> NO
K: 100.0000000000000000 -> NO
K: 10.0000000000000000 -> NO
K: 1.0000000000000000 -> NO
K: 0.1000000000000000 -> NO
K: 0.0100000000000000 -> NO
K: 0.0010000000000000 -> NO
K: 0.0001000000000000 -> NO
K: 0.0000100000000000 -> NO
K: 0.0000010000000000 -> NO
K: 0.0000001000000000 -> NO
K: 0.0000000100000000 -> NO
K: 0.0000000010000000 -> NO
Ah, so K should be 1e16 or larger if I want 1e-13 to be 'zero'.
So, I'd say you have two options:
Do a simple epsilon computation using your engineering judgement for the value of 'epsilon', as I've suggested. If you are doing graphics and 'zero' is meant to be a 'visible change' than examine your visual assets (images, etc) and judge what epsilon can be.
Don't attempt any floating point computations until you've read the non-cargo-cult answer's reference (and gotten your Ph.D in the process) and then use your non-intuitive judgement to select K.
The correct question: how does one compare points in Cocoa Touch?
The correct answer: CGPointEqualToPoint().
A different question: Are two calculated values are the same?
The answer posted here: They are not.
How to check if they are close? If you want to check if they are close, then don't use CGPointEqualToPoint(). But, don't check to see if they are close. Do something that makes sense in the real world, like checking to see if a point is beyond a line or if a point is inside a sphere.
The last time I checked the C standard, there was no requirement for floating point operations on doubles (64 bits total, 53 bit mantissa) to be accurate to more than that precision. However, some hardware might do the operations in registers of greater precision, and the requirement was interpreted to mean no requirement to clear lower order bits (beyond the precision of the numbers being loaded into the registers). So you could get unexpected results of comparisons like this depending on what was left over in the registers from whoever slept there last.
That said, and despite my efforts to expunge it whenever I see it, the outfit where I work has lots of C code that is compiled using gcc and run on linux, and we have not noticed any of these unexpected results in a very long time. I have no idea whether this is because gcc is clearing the low-order bits for us, the 80-bit registers are not used for these operations on modern computers, the standard has been changed, or what. I'd like to know if anyone can quote chapter and verse.
You can use such code for compare float with zero:
if ((int)(theView.frame.origin.x * 100) == 0) {
// do important operation
}
This will compare with 0.1 accuracy, that enough for CGFloat in this case.
Another issue that may need to be kept in mind is that different implementations do things differently. One example of this that I am very familiar with is the FP units on the Sony Playstation 2. They have significant discrepancies when compared to the IEEE FP hardware in any X86 device. The cited article mentions the complete lack of support for inf and NaN, and it gets worse.
Less well known is what I came to know as the "one bit multiply" error. For certain values of float x:
y = x * 1.0;
assert(y == x);
would fail the assert. In the general case, sometimes, but not always, the result of a FP multiply on the Playstation 2 had a mantissa that was a single bit less than the equivalent IEEE mantissa.
My point being that you should not assume that porting FP code from one platform to another will produce the same results. Any given platform is internally consistent, in that results don't change on that platform, it's just that they may not agree with a different platform. E.g. CPython on X86 uses 64 bit doubles to represent floats, while CircuitPython on a Cortex MO has to use software FP, and only uses 32 bit floats. Needless to say that will introduce discrepancies.
A quote I learned over 40 years ago is as true today as the day I learned it. "Doing floating point maths on a computer is like moving a pile of sand. Every time you do anything, you leave a little sand behind and pick up a little dirt."
Playstation is a registered trademark of Sony Corporation.
-(BOOL)isFloatEqual:(CGFloat)firstValue secondValue:(CGFloat)secondValue{
BOOL isEqual = NO;
NSNumber *firstValueNumber = [NSNumber numberWithDouble:firstValue];
NSNumber *secondValueNumber = [NSNumber numberWithDouble:secondValue];
isEqual = [firstValueNumber isEqualToNumber:secondValueNumber];
return isEqual;
}
I am using the following comparison function to compare a number of decimal places:
bool compare(const double value1, const double value2, const int precision)
{
int64_t magnitude = static_cast<int64_t>(std::pow(10, precision));
int64_t intValue1 = static_cast<int64_t>(value1 * magnitude);
int64_t intValue2 = static_cast<int64_t>(value2 * magnitude);
return intValue1 == intValue2;
}
// Compare 9 decimal places:
if (compare(theView.frame.origin.x, 0, 9)) {
// do important operation
}
I'd say the right thing is to declare each number as an object, and then define three things in that object: 1) an equality operator. 2) a setAcceptableDifference method. 3)the value itself. The equality operator returns true if the absolute difference of two values is less than the value set as acceptable.
You can subclass the object to suit the problem. For example, round bars of metal between 1 and 2 inches might be considered of equal diameter if their diameters differed by less than 0.0001 inches. So you'd call setAcceptableDifference with parameter 0.0001, and then use the equality operator with confidence.

Double precision computations

I am trying to compute numerically (using analytical formulae) the values of the following sequence of integrals:
I(k,t) = int_0^{N/2-1} u^k e^(-i*u*delta*t) du
where "i" is the imaginary unit. For small k, this integral can be computed by hand, but for larger k it is more convenient to notice that there is an iterative relationship between the terms of sequence that can be derived by integration by parts. This is implemented below by the function i1.
void i1(int N, double t, double delta, double complex ** result){
unsigned int k;
(*result)=(double complex*)malloc(sizeof(double complex)*N);
if(t==0){
for(k=0;k<N;k++){
(*result)[k]=pow(N-2,k+1)/(pow(2,k+1)*(k+1));
}
}
else{
(*result)[0]=2/(delta*t)*sin(delta*(N-2)*t/4)*cexp(-I*(N-2)*t*delta/4);
for(k=1;k<N;k++){
(*result)[k]=I/(delta*t)*(pow(N-2,k)/pow(2,k)*cexp(-I*delta*(N-2)*t/2)-k*(*result)[k-1]);
}
}
}
The problem is that in my case t is very small (1e-12) and delta is typically around 1e6. When testing in the case N=4, I noticed some weird results appearing for k=3, namely the results where suddenly very large, much larger than they should be as the norm of an integral is always smaller than the integral of the norm, the results of the test are printed below:
I1(0,1.0000e-12)=1.0000000000e+00+-5.0000000000e-07I
Norm=1.0000000000e+00
compare = 1.0000000000e+00
I1(1,1.0000e-12)=5.0000000000e-01+-3.3328895199e-07I
Norm=5.0000000000e-01
compare = 5.0000000000e-01
I1(2,1.0000e-12)=3.3342209601e-01+-2.5013324745e-07I
Norm=3.3342209601e-01
compare = 3.3333333333e-01
I1(3,1.0000e-12)=2.4960025766e-01+-2.6628804517e+02I
Norm=2.6628816215e+02
compare = 2.5000000000e-01
k=3 not being particularly big, I computed the value of the integral by hand, but I got using the calculator and the analytical formula I obtained the same larger than expected results for the imaginary part in the case. I also realized that if I changed the order of the terms the result changed. It therefore appears to be a problem with precision, as in the iterative process there is a subtraction of very large but almost equal terms, and following what was said on this thread: How to divide tiny double precision numbers correctly without precision errors?, this can cause small errors to be amplified. However I am finding it difficult to see how to resolve the issue in my case, and was also wondering if someone could briefly explain why this occurs?
You have to be very careful with floating point addition and subtraction.
Suppose a decimal floating point with 6 digits precision (to keep things simple). Adding/subtracting a small number to/from a large one discards some or even all of the smaller. So:
5.00000E+9 + 1.45678E+4 is: 5.00000 + 0.000014 E+9 = 5.00001E+9
which is as good as it gets. But if you add a series of small numbers to a large one, then you may be better off adding the small numbers together first, and adding the result to the large number.
Subtraction of similar size numbers is another way of losing precision. So:
5.12346E+4 - 5.12345E+4 = 1.00000E-1
Now, the two numbers can be at best their real value +/- half the least significant digit, in this case 0.5E-1 -- which is a relative error of about +/-1E-6. The result of the subtraction is still +/- 0.5E-1 (we cannot reduce the error !), which is a relative error of +/- 0.5 !!!
Multiplication and division are much better behaved -- until you over-/under-flow.
But as soon as you are doing anything iterative with add/subtract, keep saying (loudly) to yourself: floating point numbers are not (entirely) like real numbers.

Resources