How to check for floating point precision in Double

How to check for floating point precision in Double - c

Take this simple function for example.
int checkIfTriangleIsValid(double a, double b, double c) {
//fix the precision problem
int c1, c2, c3;
c1 = a+b>c ? 0 : 1;
c2 = b+c>a ? 0 : 1;
c3 = c+a>b ? 0 : 1;
if(c1 == 0 && c2 == 0 && c3 == 0)
return 0;
else {
printf("%d, %d, %d\n",c1, c2, c3);
return 1;
}
}
I place for a = 1.923, b = 59.240, c = 61.163
Now for some reason when I check for the condition in c1 it should give me 1, but instead, it gives me 0. I tried to do a printf with %.30f and found that the values later changes.
How can I fix this problem?
EDIT: I checked the other questions that are similar to mine but they don't even have a double.

Likely your C implementation uses the IEEE-754 basic 64-bit binary floating-point format for double. When 1.923, 59.240, and 61.163 are properly converted to the nearest values representable in double, the results are exactly:
1.9230000000000000426325641456060111522674560546875,
59.24000000000000198951966012828052043914794921875, and
61.1629999999999967030817060731351375579833984375.
As you can see, the first two of these sum to more than the third. This means that, by the time you assign these values to double objects, they have already been altered in a way that changes their relationship. No subsequent calculations can repair this, because the original information is gone.
Since no solution after conversion to double can work, you need a solution that operates before or instead of conversion to double. If you want to compute exactly, or more precisely, with the values 1.923, 59.240, and 61.163, you may need to write your own decimal arithmetic code or find some other code that supports decimal arithmetic. If you only want to work with numbers with three decimal places, then a possible solution is to write some code that reads input such as “59.240” and returns it in an integer object scaled by 1000, so that 59240 is returned. The resulting values could then easily be tested for the triangle inequality.

when I check for the condition in c1 it should give me 1, but instead it gives me 0
How can I fix this problem?
Change your expectations.
A typical double can represent exactly about 264 different values. 1.923, 59.240, 61.163 are typically not in that set as double is usually encoded in a binary way. e.g. binary64.
When a,b,c are assigned 1.923, 59.240, 61.163, they get values more like the below which are the closet double.
a 1.923000000000000042632564145606...
b 59.240000000000001989519660128281...
c 61.162999999999996703081706073135...
In my case, the a, and b both received a slightly higher value than the decimal code form, while c received a slightly lower one.
When adding a+b, the sum was rounded up, further away from c.
printf("a+b %35.30f\n", a+b);
a+b 61.163000000000003808509063674137
a + b > c was true, as well as other compares and OP's
checkIfTriangleIsValid(1.923, 59.240, 61.163) should return valid (0) as it is really more like checkIfTriangleIsValid(1.9230000000000000426..., 59.24000000000000198..., 61.16299999999999670...)
Adding a+b is further complicated in that the addition may occur using double or long double math. Research FLT_EVAL_METHOD for details. Rounding mode also can affect the final sum.
#include <float.h>
printf("FLT_EVAL_METHOD %d\n", FLT_EVAL_METHOD);
As to an alternative triangle check, subtract the largest 2 values and then compare against the smallest.
a > (c-b) can preserve significantly more precision than (a+b) > c.
// Assume a,b,c >= 0
int checkIfTriangleIsValid_2(double a, double b, double c) {
// Sort so `c` is largest, then b, a.
if (c < b) {
double t = b; b = c; c = t;
}
if (c < a) {
double t = a; a = c; c = t;
}
if (a > b) {
double t = b; b = a; a = t;
}
// So far, no loss of precision is expected due to compares/swaps.
// Only now need to check a + b >= c for valid triangle
// To preserve precision, subtract from `c` the value closest to it (`b`).
return a > (c-b);
}
I will review more later as time permits. This approach significant helps for a precise answer - yet need to assess more edge cases. It reports a valid triangle checkIfTriangleIsValid_2(1.923, 59.240, 61.163)).
FLT_EVAL_METHOD, rounding mode and double encoding can result in different answers on other platforms.
Notes:
It appears a checkIfTriangleIsValid() returning 0 means valid triangle.
It also appears when the triangle has 0 area, the expected result is 1 or invalid.

Related

Floating-point arithemtic in C: epsilon comparison

I'm trying to compare values with double precision using epsilon. However, I have a problem - initially I have thought that the difference should be equal to the epsilon, but it isn't. Additionally, when I've tried to check the binary representation using the successive multiplication something strange has happened and I feel confused, therefore I would appreciate your explanation to the problem and comments on my way of thinking
#include <stdio.h>
#define EPSILON 1e-10
void double_equal(double a, double b) {
printf("a: %.12f, b: %.12f, a - b = %.12f\n", a, b, a - b);
printf("a: %.12f, b: %.12f, b - a = %.12f\n", a, b, b - a);
if (a - b < EPSILON) printf("a - b < EPSILON\n");
if (a - b == EPSILON) printf("a - b == EPSILON\n");
if (a - b <= EPSILON) printf("a - b <= EPSILON\n");
if (b - a <= EPSILON) printf("b - a <= EPSILON\n");
}
int main(void) {
double wit1 = 1.0000000001;
double wit2 = 1.0;
double_equal(wit1, wit2);
return 0;
}
The output is:
a: 1.000000000100, b: 1.000000000000, a - b = 0.000000000100
a: 1.000000000100, b: 1.000000000000, b - a = -0.000000000100
b - a <= EPSILON
Numeric constants in C are declared as doubles if we don't provide "F"/"f" sign right after the number (#define EPSILON 1e-10F), therefore I can't see here the problem of conversion as in this question. Therefore, I have created really simple program for THESE SPECIFIC examples (I know it should include handling converting integral parts to binary numbers).
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
char* convert(double a) {
char* res = malloc(200);
int count = 0;
double integral;
a = modf(a, &integral);
if (integral == 1) {
res[count++] = integral + '0';
res[count++] = '.';
} else {
res[count++] = '0';
res[count++] = '.';
}
while(a != 0 && count < 200) {
printf("%.100f\n", a);
a *= 2;
a = modf(a, &integral);
if (integral == 1) res[count++] = integral + '0';
else res[count++] = '0';
}
res[count] = '\0';
return res;
}
int main(void) {
double wit1 = 1.0000000001;
double diff = 0.0000000001;
char* res = convert(wit1);
char* di = convert(diff);
printf("this: %s\n", res);
printf("diff: %s\n", di);
return 0;
}
Direct output:
this: 1.0000000000000000000000000000000001101101111100111
diff: 0.00000000000000000000000000000000011011011111001101111111011001110101111011110110111011
First question: why there are so many ending zero-ones in the difference? Why do the results after the binary point differ?
However, if we look at the process of calculation and the fractional part, printed out (I'm presenting only the first few lines):
1.0000000001:
0.0000000001000000082740370999090373516082763671875000000000000000000000000000000000000000000000000000
0.0000000002000000165480741998180747032165527343750000000000000000000000000000000000000000000000000000
0.0000000004000000330961483996361494064331054687500000000000000000000000000000000000000000000000000000
0.0000000001:
0.0000000001000000000000000036432197315497741579165547065599639608990401029586791992187500000000000000
0.0000000002000000000000000072864394630995483158331094131199279217980802059173583984375000000000000000
0.0000000004000000000000000145728789261990966316662188262398558435961604118347167968750000000000000000
Second question: why there are so many strange ending numbers? Is this a result of the incapability of the floating-point arithmetic of precisely representing decimal values?
Analyzing the subtraction, I can see, why the result is bigger than the epsilon. I follow the procedure:
Prepare a complement sequence of zero-ones for the sequence to subtract
"Add" the sequences
Subtract the one in the beginning, add it to the rightmost bit
Therefore:
1.0000000000000000000000000000000001101101111100111
- 1.0000000000000000000000000000000000000000000000000
|
\/
1.0000000000000000000000000000000001101101111100111
"+"0.1111111111111111111111111111111111111111111111111
--------------------------------------------------------
10.0000000000000000000000000000000001101101111100110
|
\/
0.0000000000000000000000000000000001101101111100111
Comparing with the calculated value of epsilon:
0.000000000000000000000000000000000110110111110011 0 1111111011001110101111011110110111011
0.000000000000000000000000000000000110110111110011 1
Spaces indicate the difference.
Third question: do I have to worry if I can't compare the value equal to the epsilon? I think that this situation indicates what the interval of tolerance with epsilon has been made for. However, is there anything I should change?

Why do the results after the binary point differ?
Because that is the difference.
Expecting something else comes from thinking 1.0000000001 and 0.0000000001 as double have those 2 values. They do not. Their difference is not 1.0. They have values near those two, each with about 53 binary digits of significance. Their difference is close to the unit in the last place of 1.0000000001.
why there are so many strange ending numbers? Is this a result of the incapability of the floating-point arithmetic of precisely representing decimal values?
Somewhat.
double can encode about 264 different numbers. 1.0000000001 and 0.0000000001 are not in that set. Instead nearby ones are used that look like strange ending numbers.
do I have to worry if I can't compare the value equal to the epsilon? I think that this situation indicates what the interval of tolerance with epsilon has been made for. However, is there anything I should change?
Yes, change use of epsilon. epsilon is useful for the relative difference, not absolute one. Very large consecutive double values are far more than epsilon apart. About 45% of all double, (the small ones) are all less than epsilon in magnitude. Either if (a - b <= EPSILON) printf("a - b <= EPSILON\n"); or if (b - a <= EPSILON) printf("b - a <= EPSILON\n"); will be true for small a, b even though they are trillions of times different in magnitude.
Oversimplification:
if (fabs(a-b) < EPSILON*fabs(a + b)) {
return values_a_b_are_near_each_other;
}

This answer assumes your C implementation uses IEEE-754 binary64, also known as the “double” format for its double type. This is common.
If the C implementation rounds correctly, then double wit1 = 1.0000000001; initializes wit1 to 1.0000000001000000082740370999090373516082763671875. This is because the two representable values nearest 1.0000000001 are 1.000000000099999786229432174877729266881942749023437500000000000000000000000000000000000000000000000 and 1.0000000001000000082740370999090373516082763671875. The latter is chosen since it is closer.
If correctly rounded, the 1e-10 used for EPSILON will produce 0.000000000100000000000000003643219731549774157916554706559963960899040102958679199218750000000000000.
Clearly wit1 - 1 exceeds EPSILON, so the test a - b < EPSILON in double_equal evaluates as false.
First question: why there are so many ending zero-ones in the difference?
Count the number of bits from the first 1 to the last 1. In each number, there are 53. That is because there are 53 bits in the significand of a double. It is a bit of a coincidence your numbers happened to end in a 1 bit. About half the time, the trailing bit is 0, and a quarter of the time, the last two bits are zeros, and so on. However, since there are 53 bits in the significand of a double, there will be exactly 53 bits from the first 1 bit to the last bit that is part of the represented value.
Since your first number starts with a 1 in the integer position, it has at most 52 bits after that. At that point, the number must be rounded to the nearest representable value.
Since your second number is between 2−34 and 2−33, its first 1 bit is in the 2−34 position, and it can go to the 2−86 position before it has to be rounded.
Third question: do I have to worry if I can't compare the value equal to the epsilon?
Why do you want to compare to the epsilon? There is no general solution for comparing floating-point numbers that contain errors from previous operations. Whether or not an “epsilon comparison” can or should be used is dependent on the application and the operations and numbers involved.

Inequalities in c not working

I'm learning c, and am confused as my code seems to evaluate ( 1e16 - 1 >= 1e16 ) as true when it should be false. My code is below, it returns
9999999999999999 INVALIDBIG\n
when I would expect it not to return anything. I thought any problems with large numbers could be avoided by using long long.
int main(void)
{
long long z;
z = 9999999999999999;
if ( z >= 1e16 || z < 0 )
{
printf("%lli INVALIDBIG\n",z);
}
}

1e16 is a double type literal value, and floats/doubles can be imprecise for decimal arithmetic/comparison (just one of many common examples: decimal 0.2). Its going to cast the long-long z upwards to double for the comparison, and I'm guessing the standard double representation can't store the precision needed (maybe someone else can demonstrate the binary mantissa/sign representations)
Try changing the 1e16 to (long double)1e16, it doesn't then print out your message. (update: or, as the other question-commenter added, change 1e16 to an integer literal)

The doubles and floats can hold limited number of digits. In your case the double numbers with values 9999999999999999 and 1e16 have identical 8 bytes of hex representation. You can check them byte by byte:
long long z = 9999999999999999;
double dz1 = z;
double dz2 = 1e16;
/* prints 0 */
printf("memcmp: %d\n", memcmp(&dz1, &dz2, sizeof(double)));
So, they are equal.
Smaller integers can be stored in double with perfect precision. For example, see Double-precision floating-point format or biggest integer that can be stored in a double
The maximum integer that can be converted to double exactly is 253 (9007199254740992).

float vs double comparison [duplicate]

This question already has answers here:
Comparing float and double
(3 answers)
Closed 7 years ago.
int main(void)
{
  float me = 1.1;  
double you = 1.1;   
if ( me == you ) {
printf("I love U");
} else {
printf("I hate U");
}
}
This prints "I hate U". Why?

Floats use binary fraction. If you convert 1.1 to float, this will result in a binary representation.
Each bit right if the binary point halves the weight of the digit, as much as for decimal, it divides by ten. Bits left of the point double (times ten for decimal).
in decimal: ... 0*2 + 1*1 + 0*0.5 + 0*0.25 + 0*0.125 + 1*0.0625 + ...
binary: 0 1 . 0 0 0 1 ...
2's exp: 1 0 -1 -2 -3 -4
(exponent to the power of 2)
Problem is that 1.1 cannot be converted exactly to binary representation. For double, there are, however, more significant digits than for float.
If you compare the values, first, the float is converted to double. But as the computer does not know about the original decimal value, it simply fills the trailing digits of the new double with all 0, while the double value is more precise. So both do compare not equal.
This is a common pitfall when using floats. For this and other reasons (e.g. rounding errors), you should not use exact comparison for equal/unequal), but a ranged compare using the smallest value different from 0:
#include "float.h"
...
// check for "almost equal"
if ( fabs(fval - dval) <= FLT_EPSILON )
...
Note the usage of FLT_EPSILON, which is the aforementioned value for single precision float values. Also note the <=, not <, as the latter will actually require exact match).
If you compare two doubles, you might use DBL_EPSILON, but be careful with that.
Depending on intermediate calculations, the tolerance has to be increased (you cannot reduce it further than epsilon), as rounding errors, etc. will sum up. Floats in general are not forgiving with wrong assumptions about precision, conversion and rounding.
Edit:
As suggested by #chux, this might not work as expected for larger values, as you have to scale EPSILON according to the exponents. This conforms to what I stated: float comparision is not that simple as integer comparison. Think about before comparing.

In short, you should NOT use == to compare floating points.
for example
float i = 1.1; // or double
float j = 1.1; // or double
This argument
(i==j) == true // is not always valid
for a correct comparison you should use epsilon (very small number):
(abs(i-j)<epsilon)== true // this argument is valid

The question simplifies to why do me and you have different values?
Usually, C floating point is based on a binary representation. Many compilers & hardware follow IEEE 754 binary32 and binary64. Rare machines use a decimal, base-16 or other floating point representation.
OP's machine certainly does not represent 1.1 exactly as 1.1, but to the nearest representable floating point number.
Consider the below which prints out me and you to high precision. The previous representable floating point numbers are also shown. It is easy to see me != you.
#include <math.h>
#include <stdio.h>
int main(void) {
float me = 1.1;
double you = 1.1;
printf("%.50f\n", nextafterf(me,0)); // previous float value
printf("%.50f\n", me);
printf("%.50f\n", nextafter(you,0)); // previous double value
printf("%.50f\n", you);
1.09999990463256835937500000000000000000000000000000
1.10000002384185791015625000000000000000000000000000
1.09999999999999986677323704498121514916420000000000
1.10000000000000008881784197001252323389053300000000
But it is more complicated: C allows code to use higher precision for intermediate calculations depending on FLT_EVAL_METHOD. So on another machine, where FLT_EVAL_METHOD==1 (evaluate all FP to double), the compare test may pass.
Comparing for exact equality is rarely used in floating point code, aside from comparison to 0.0. More often code uses an ordered compare a < b. Comparing for approximate equality involves another parameter to control how near. #R.. has a good answer on that.

Because you are comparing two Floating point!
Floating point comparison is not exact because of Rounding Errors. Simple values like 1.1 or 9.0 cannot be precisely represented using binary floating point numbers, and the limited precision of floating point numbers means that slight changes in the order of operations can change the result. Different compilers and CPU architectures store temporary results at different precisions, so results will differ depending on the details of your environment. For example:
float a = 9.0 + 16.0
double b = 25.0
if(a == b) // can be false!
if(a >= b) // can also be false!
Even
if(abs(a-b) < 0.0001) // wrong - don't do this
This is a bad way to do it because a fixed epsilon (0.0001) is chosen because it “looks small”, could actually be way too large when the numbers being compared are very small as well.
I personally use the following method, may be this will help you:
#include <iostream> // std::cout
#include <cmath> // std::abs
#include <algorithm> // std::min
using namespace std;
#define MIN_NORMAL 1.17549435E-38f
#define MAX_VALUE 3.4028235E38f
bool nearlyEqual(float a, float b, float epsilon) {
float absA = std::abs(a);
float absB = std::abs(b);
float diff = std::abs(a - b);
if (a == b) {
return true;
} else if (a == 0 || b == 0 || diff < MIN_NORMAL) {
return diff < (epsilon * MIN_NORMAL);
} else {
return diff / std::min(absA + absB, MAX_VALUE) < epsilon;
}
}
This method passes tests for many important special cases, for different a, b and epsilon.
And don't forget to read What Every Computer Scientist Should Know About Floating-Point Arithmetic!

Comparing two numbers without comparison operators

As part of a program that I am writing for an assignment, I need to compare two numbers. Essentially, the program computes the eccentricity of an ellipse given its two axes and it has to compare the value of the calculated eccentricity to the (given) eccentricity of the Moon's orbit around the Earth, and Earth's orbit around the Sun. If the calculated eccentricity is greater than the given eccentricity, then this needs to be represented by a value of 1, otherwise, a value of 0. All of these values are floating-point, specifically, long double.
The constraints of the assignment do not allow me to use comparison operators (like >) or any sort of logic (!x or if-else). However, I am allowed to use the pow and sqrt functions from the math.h library. I am also allowed to use arithmetic operations as well as the modulo operation.
I know that I can take advantage of integer division to truncate the decimal if the denominator is greater than the numerator, i.e.:
int x = eccentricity / MOON_ORBIT_ECCENTRICITY;
... will be 0 if MOON_ORBIT_ECCENTRICITY is greater than eccentricity. However, if this relationship is inverted, then the value of x could be any non-zero integer. In such a case, the desired result is 1.
The first and most intuitive (and naïve) solution was:
int y = (x / x);
This will return 1 if x is non-zero. However, if x is 0, then my program crashes due to division by zero. In fact, I keep running into the problem of dividing by zero. This also happens in the case of:
int y = (x + 1) % x;
Does anyone have an idea of how to solve this? This seems so frustratingly easy.

#lurker comment above is a good approach to handle eccentricity as restricted by OP.
So as not to copy that, consider the not-so-serious following:
// Return e1 > e2
int Eccentricity_Compare(long double e1, long double e2) {
char buf[20];
// print a number beginning with
// if e2 >= e1: `+`
// else `-`
sprintf(buf, "%+Le", e2 - e1); // reverse subtraction for = case
const char *pm = "+-";
char *p = strchr(pm, buf[0]);
return (int) (p - pm);
}
Wink, wink: OP said nothing about <stdio.h> functions.

How do I compute maximum/minimum of 8 different float values

I need to find maximum and minimum of 8 float values I get. I did as follows. But float comparisons are going awry as warned by any good C book!
How do I compute the max and min in a accurate way.
main()
{
float mx,mx1,mx2,mx3,mx4,mn,mn1,mn2,mn3,mn4,tm1,tm2;
mx1 = mymax(2.1,2.01); //this returns 2.09999 instead of 2.1 because a is passed as 2.09999.
mx2 = mymax(-3.5,7.000001);
mx3 = mymax(7,5);
mx4 = mymax(7.0000011,0); //this returns incorrectly- 7.000001
tm1 = mymax(mx1,mx2);
tm2 = mymax(mx3,mx4);
mx = mymax(tm1,tm2);
mn1 = mymin(2.1,2.01);
mn2 = mymin(-3.5,7.000001);
mn3 = mymin(7,5);
mn4 = mymin(7.0000011,0);
tm1 = mymin(mx1,mx2);
tm2 = mymin(mx3,mx4);
mn = mymin(tm1,tm2);
printf("Max is %f, Min is %f \n",mx,mn);
getch();
}
float mymax(float a,float b)
{
if(a >= b)
{
return a;
}
else
{
return b;
}
}
float mymin(float a,float b)
{
if(a <= b)
{
return a;
}
else
{
return b;
}
}
How can I do exact comparisons of these floats? This is all C code.
thank you.
-AD.

You are doing exact comparison of these floats. The problem (with your example code at least) is that float simply does not have enough digits of precision to represent the values of your literals sufficiently. 7.000001 and 7.0000011 simply are so close together that the mantissa of a 32 bit float cannot represent them differently.
But the example seems artificial. What is the real problem you're trying to solve? What values will you actually be working with? Or is this just an academic exercise?
The best solution depends on the answer to that. If your actual values just require somewhat more more precision than float can provide, use double. If you need exact representation of decimal digits, use a decimal type library. If you want to improve your understanding of how floating point values work, read The Floating-Point Guide.

You can do exact comparison of floats. Either directly as floats, or by casting them to int with the same bit representation.
float a = 1.0f;
float b = 2.0f;
int &ia = *(int *)(&a);
int &ib = *(int *)(&b);
/* you can compare a and b, or ia and ib, the results will be the same,
whatever the values of the floats are.
Floats are ordered the correct way when its bits are considered as int
and thus can be compared (provided that float and int both are 32 bits).
*/
But you will never be able to represent exactly 2.1 as a float.
Your problem is not a problem of comparison, it is a problem of representation of a value.

I'd claim that these comparisons are actually exact, since no value is altered.
The problem is that many float literals can't be represented exactly by IEEE-754 floating point numbers. So for example 2.1.
If you need an exact representation of base 10 pointed numbers you could - for example - write your own fixed point BCD arithmetic.
Concerning finding min and max at the same time:
A way that needs less comparisons is for each index pair (2*i, 2*i+1) first finding the minimum (n/2 comparisons)
Then find the minimum of the minima ((n-1)/2 comparisons) and the maximum of the maxima ((n-1)/2 comparisons).
So we get (3*n-2)/2 comparisons instead of (2*n-2)/2 when finding the minimum and maximum separated.

The < and > comparison always works correct with floats or doubles. Only the == comparison has problems, therefore you are advised to use epsilon.
So your method of calculating min, max has no issue. Note that if you use float, you should use the notation 2.1f instead of 2.1. Just a note.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight