Questions about float characteristics [duplicate] - c

This question already has answers here:
Why are floating point numbers inaccurate?
(5 answers)
Closed 5 years ago.
Q1: For what reason isn't it recommended to compare floats by == or != like in V1?
Q2: Does fabs() in V2 work the same way, like I programmed it in V3?
Q3: Is it ok to use (x >= y) and (x <= y)?
Q4: According to Wikipedia float has a precision between 6 and 9 digits, in my case 7 digits. So on what does it depend, which precision between 6 and 9 digits my float has? See [1]
[1] float characteristics
Source: Wikipedia
Type | Size | Precision | Range
Float | 4Byte ^= 32Bits | 6-9 decimal digits | (2-2^23)*2^127
Source: tutorialspoint
Type | Size | Precision | Range
Float | 4Byte ^= 32Bits | 6 decimal digits | 1.2E-38 to 3.4E+38
Source: chortle
Type | Size | Precision | Range
Float | 4Byte ^= 32Bits | 7 decimal digits | -3.4E+38 to +3.4E+38
The following three codes produce the same result, still it is not recommended to use the first variant.
1. Variant
#include <stdio.h> // printf() scanf()
int main()
{
float a = 3.1415926;
float b = 3.1415930;
if (a == b)
{
printf("a(%+.7f) == b(%+.7f)\n", a, b);
}
if (a != b)
{
printf("a(%+.7f) != b(%+.7f)\n", a, b);
}
return 0;
}
V1-Output:
a(+3.1415925) != b(+3.1415930)
2. Variant
#include <stdio.h> // printf() scanf()
#include <float.h> // FLT_EPSILON == 0.0000001
#include <math.h> // fabs()
int main()
{
float x = 3.1415926;
float y = 3.1415930;
if (fabs(x - y) < FLT_EPSILON)
{
printf("x(%+.7f) == y(%+.7f)\n", x, y);
}
if (fabs(x - y) > FLT_EPSILON)
{
printf("x(%+.7f) != y(%+.7f)\n", x, y);
}
return 0;
}
V2-Output:
x(+3.1415925) != y(+3.1415930)
3. Variant:
#include <stdio.h> // printf() scanf()
#include <float.h> // FLT_EPSILON == 0.0000001
#include <stdlib.h> // abs()
int main()
{
float x = 3.1415926;
float y = 3.1415930;
const int FPF = 10000000; // Float_Precission_Factor
if ((float)(abs((x - y) * FPF)) / FPF < FLT_EPSILON) // if (x == y)
{
printf("x(%+.7f) == y(%+.7f)\n", x, y);
}
if ((float)(abs((x - y) * FPF)) / FPF > FLT_EPSILON) // if (x != y)
{
printf("x(%+.7f) != y(%+.7f)\n", x, y);
}
return 0;
}
V3-Output:
x(+3.1415925) != y(+3.1415930)
I am grateful for any help, links, references and hints!

When working with floating-point operations, almost every step may introduce a small rounding error. Convert a number from decimal in the source code to the floating-point format? There is a small error, unless the number is exactly representable. Add two numbers? Their exact sum often has more bits than fit in the floating-point format, so it has to be rounded to fit. The same is true for multiplication and division. Take a square root? The result is usually irrational and cannot be represented in the floating-point format, so it is rounded. Call the library to get the cosine or the logarithm? The exact result is usually irrational, so it is rounded. And most math libraries have some additional error as well, because calculating those functions very precisely is hard.
So, let’s say you calculate some value and have a result in x. It has a variety of errors incorporated into it. And you calculate another value and have a result in y. Suppose that, if calculated with exact mathematics, these two values would be equal. What is the chance that the errors in x and y are exactly the same?
It is unlikely. If x and y were calculated in different ways, they experienced different errors, and it is essentially chance whether they have the same total error or not. Therefore, even if the exact mathematical results would be equal, x == y may be false because of the errors.
Similarly, two exact mathematical values might be different, but the errors might coincide so that x == y returns true.
Therefore x == y and x != y generally cannot be used to tell if the desired exact mathematical values are equal or not.
What can be used? Unfortunately, there is no general solution to this. Your examples use FLT_EPSILON as an error threshold, but that is not useful. After doing more than a few floating-point operations, the error may easily accumulated to be more than FLT_EPSILON, either as an absolute error or a relative error.
In order to make a comparison, you need to have some knowledge about how large the accumulated error might be, and that depends greatly on the particular calculations you have performed. You also need to know what the consequences of false positives and false negatives are—is it more important to avoid falsely stating two things are equal or to avoid falsely stating two things are unequal? These issues are specific to each algorithm and its data.

Because on 64 bit machine you will find out that 0.1*3 = 0.30000000000000004 :-)
See the links #yano and #PM-77-1 provided as comments.

You know machine stores everything using 0 and 1.
Also know that not every floating point value is representable in binary within a limited bits.
Computers stores possible nearest representable binary of the given numbers.
So their is a difference between 2.0000001 and 2.0000000 in the eye of computer (but we say they are equal!).
Not always this trouble appears, but it is risky.

Related

Underflow error in floating point arithmetic in C

I am new to C, and my task is to create a function
f(x) = sqrt[(x^2)+1]-1
that can handle very large numbers and very small numbers. I am submitting my script on an online interface that checks my answers.
For very large numbers I simplify the expression to:
f(x) = x-1
By just using the highest power. This was the correct answer.
The same logic does not work for smaller numbers. For small numbers (on the order of 1e-7), they are very quickly truncated to zero, even before they are squared. I suspect that this has to do with floating point precision in C. In my textbook, it says that the float type has smallest possible value of 1.17549e-38, with 6 digit precision. So although 1e-7 is much larger than 1.17e-38, it has a higher precision, and is therefore rounded to zero. This is my guess, correct me if I'm wrong.
As a solution, I am thinking that I should convert x to a long double when x < 1e-6. However when I do this, I still get the same error. Any ideas? Let me know if I can clarify. Code below:
#include <math.h>
#include <stdio.h>
double feval(double x) {
/* Insert your code here */
if (x > 1e299)
{;
return x-1;
}
if (x < 1e-6)
{
long double g;
g = x;
printf("x = %Lf\n", g);
long double a;
a = pow(x,2);
printf("x squared = %Lf\n", a);
return sqrt(g*g+1.)- 1.;
}
else
{
printf("x = %f\n", x);
printf("Used third \n");
return sqrt(pow(x,2)+1.)-1;
}
}
int main(void)
{
double x;
printf("Input: ");
scanf("%lf", &x);
double b;
b = feval(x);
printf("%f\n", b);
return 0;
}
For small inputs, you're getting truncation error when you do 1+x^2. If x=1e-7f, x*x will happily fit into a 32 bit floating point number (with a little bit of error due to the fact that 1e-7 does not have an exact floating point representation, but x*x will be so much smaller than 1 that floating point precision will not be sufficient to represent 1+x*x.
It would be more appropriate to do a Taylor expansion of sqrt(1+x^2), which to lowest order would be
sqrt(1+x^2) = 1 + 0.5*x^2 + O(x^4)
Then, you could write your result as
sqrt(1+x^2)-1 = 0.5*x^2 + O(x^4),
avoiding the scenario where you add a very small number to 1.
As a side note, you should not use pow for integer powers. For x^2, you should just do x*x. Arbitrary integer powers are a little trickier to do efficiently; the GNU scientific library for example has a function for efficiently computing arbitrary integer powers.
There are two issues here when implementing this in the naive way: Overflow or underflow in intermediate computation when computing x * x, and substractive cancellation during final subtraction of 1. The second issue is an accuracy issue.
ISO C has a standard math function hypot (x, y) that performs the computation sqrt (x * x + y * y) accurately while avoiding underflow and overflow in intermediate computation. A common approach to fix issues with subtractive cancellation is to transform the computation algebraically such that it is transformed into multiplications and / or divisions.
Combining these two fixes leads to the following implementation for float argument. It has an error of less than 3 ulps across all possible inputs according to my testing.
/* Compute sqrt(x*x+1)-1 accurately and without spurious overflow or underflow */
float func (float x)
{
return (x / (1.0f + hypotf (x, 1.0f))) * x;
}
A trick that is often useful in these cases is based on the identity
(a+1)*(a-1) = a*a-1
In this case
sqrt(x*x+1)-1 = (sqrt(x*x+1)-1)*(sqrt(x*x+1)+1)
/(sqrt(x*x+1)+1)
= (x*x+1-1) / (sqrt(x*x+1)+1)
= x*x/(sqrt(x*x+1)+1)
The last formula can be used as an implementation. For vwry small x sqrt(x*x+1)+1 will be close to 2 (for small enough x it will be 2) but we don;t loose precision in evaluating it.
The problem isn't with running into the minimum value, but with the precision.
As you said yourself, float on your machine has about 7 digits of precision. So let's take x = 1e-7, so that x^2 = 1e-14. That's still well within the range of float, no problems there. But now add 1. The exact answer would be 1.00000000000001. But if we only have 7 digits of precision, this gets rounded to 1.0000000, i.e. exactly 1. So you end up computing sqrt(1.0)-1 which is exactly 0.
One approach would be to use the linear approximation of sqrt around x=1 that sqrt(x) ~ 1+0.5*(x-1). That would lead to the approximation f(x) ~ 0.5*x^2.

From a float to an integer

Given this code that my professor gave us in an exam which means we cannot modify the code nor use function from other libraries (except stdio.h):
float x;
(suppose x NOT having an integer part)
while (CONDITION){
x = x*10
}
I have to find the condition that makes sure that x has no valid number to the right of decimal point not giving attention to the problems of precision of a float number (After the decimal point we have to have only zeros). I tried this condition:
while ((fmod((x*10),10))){
X = X*10
}
printf(" %f ",x);
example:
INPUT x=0.456; --------> OUTPUT: 456.000
INPUT X=0.4567;--------> OUTPUT; 4567.000
It is important to be sure that after the decimal point we don't have any
significant number
But I had to include math.h library BUT my professor doesn't allow us to use it in this specific case (I'm not even allowed to use (long) since we never seen it in class).
So what is the condition that solve the problem properly without this library?
As pointed out here previously:Due to the accuracy of floats this is not really possible but I think your Prof wants to get something like
while (x - (int)x != 0 )
or
while (x - (int)x >= 0.00000001 )
You can get rid of the zeroes by using the g modifier instead of f:
printf(" %g \n",x);
There is fuzziness ("not giving attention to the problems of precision of a float number") in the question, yet I think a sought answer is below, assign x to an integer type until x no longer has a fractional part.
Success of this method depends on INT_MIN <= x <= INT_MAX. This is expected when the number of bits in the significant of float does not exceed the value bits of int. Although this is common, it is not specified by C. As an alternative, code could with a wider integer type like long long with a far less chance of the range restriction issue.
Given the rounding introduced with *10, this method is not a good foundation of float to text conversion.
float Dipok(float x) {
int i;
while ((i=x) != x) {
x = x*10;
}
return x;
}
#include <assert.h>
#include <stdio.h>
#include <float.h>
void Dipok_test(float x) {
// suppose x NOT having an integer part
assert(x > -1.0 && x < 1.0);
float y = Dipok(x);
printf("x:%.*f y:%.f\n", FLT_DECIMAL_DIG, x, y);
}
int main(void) {
Dipok_test(0.456);
Dipok_test(0.4567);
return 0;
}
Output
x:0.456000000 y:456
x:0.456699997 y:4567
As already pointed out by 2501, this is just not possible.
Floats are not accurate. Depending on your platform, the float value for 0.001 is represented as something like 0.0010000001 in fact.
What would you expect the code to calculate: 10000001 or 1?
Any solution will work for some values only.
I try to answer to my exam question please if I say something wrong correct me!
It is not possible to find a proper condition that makes sure that there are no valid number after the decimal point. For example : We want to know the result of 0.4*20 which is 8.000 BUT due to imprecision problems the output will be different:
f=0.4;
for(i=1;i<20;i++)
f=f+0.4;
printf("The number f=0.4*20 is ");
if(f!=8.0) {printf(" not ");}
printf(" %f ",8.0);
printf("The real answer is f=0.4*20= %f",f);
Our OUTPUT will be:
The number f=0.4*20 is not 8.000000
The real answer is f=0.4*20= 8.000001

How to compare two complex numbers?

In C, complex numbers are float or double and have same problem as canonical types:
#include <stdio.h>
#include <complex.h>
int main(void)
{
double complex a = 0 + I * 0;
double complex b = 1 + I * 1;
for (int i = 0; i < 10; i++) {
a += .1 + I * .1;
}
if (a == b) {
puts("Ok");
}
else {
printf("Fail: %f + i%f != %f + i%f\n", creal(a), cimag(a), creal(b), cimag(b));
}
return 0;
}
The result:
$ clang main.c
$ ./a.out
Fail: 1.000000 + i1.000000 != 1.000000 + i1.000000
I try this syntax:
a - b < DBL_EPSILON + I * DBL_EPSILON
But the compiler hate it:
main.c:24:15: error: invalid operands to binary expression ('_Complex double' and '_Complex double')
if (a - b < DBL_EPSILON + I * DBL_EPSILON) {
~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This last works fine but it’s a little fastidious:
fabs(creal(a) - creal(b)) < DBL_EPSILON && fabs(cimag(a) - cimag(b)) < DBL_EPSILON
Comparing 2 complex floating point numbers is much like comparing 2 real floating point numbers.
Comparing for exact equivalences often is insufficient as the numbers involved contain small computational errors.
So rather than if (a == b) code needs to be if (nearlyequal(a,b))
The usual approach is double diff = cabs(a - b) and then comparing diff to some small constant value like DBL_EPSILON.
This fails when a,b are large numbers as their difference may many orders of magnitude larger than DBL_EPSILON, even though a,b differ only by their least significant bit.
This fails for small numbers too as the difference between a,b may be relatively great, but many orders of magnitude smaller than DBL_EPSILON and so return true when the value are relatively quite different.
Complex numbers literally add another dimensional problem to the issue as the real and imaginary components themselves may be greatly different. Thus the best answer for nearlyequal(a,b) is highly dependent on the code's goals.
For simplicity, let us use the magnitude of the difference as compared to the average magnitude of a,b. A control constant ULP_N approximates the number of binary digits of least significance that a,b are allowed to differ.
#define ULP_N 4
bool nearlyequal(complex double a, complex double b) {
double diff = cabs(a - b);
double mag = (cabs(a) + cabs(b))/2;
return diff <= (mag * DBL_EPSILON * (1ull << ULP_N));
}
Instead of comparing the complex number components, you can compute the complex absolute value (also known as norm, modulus or magnitude) of their difference, which is the distance between the two on the complex plane:
if (cabs(a - b) < DBL_EPSILON) {
// complex numbers are close
}
Small complex numbers will appear to be close to zero even if there is no precision issue, a separate issue that is also present for real numbers.
Since complex numbers are represented as floating point numbers, you have to deal with their inherent imprecision. Floating point numbers are "close enough" if they're within the machine epsilon.
The usual way is to subtract them, take the absolute value, and see if it's close enough.
#include <complex.h>
#include <stdbool.h>
#include <float.h>
static inline bool ceq(double complex a, double complex b) {
return cabs(a-b) < DBL_EPSILON;
}

float vs double comparison [duplicate]

This question already has answers here:
Comparing float and double
(3 answers)
Closed 7 years ago.
int main(void)
{
  float me = 1.1;  
double you = 1.1;   
if ( me == you ) {
printf("I love U");
} else {
printf("I hate U");
}
}
This prints "I hate U". Why?
Floats use binary fraction. If you convert 1.1 to float, this will result in a binary representation.
Each bit right if the binary point halves the weight of the digit, as much as for decimal, it divides by ten. Bits left of the point double (times ten for decimal).
in decimal: ... 0*2 + 1*1 + 0*0.5 + 0*0.25 + 0*0.125 + 1*0.0625 + ...
binary: 0 1 . 0 0 0 1 ...
2's exp: 1 0 -1 -2 -3 -4
(exponent to the power of 2)
Problem is that 1.1 cannot be converted exactly to binary representation. For double, there are, however, more significant digits than for float.
If you compare the values, first, the float is converted to double. But as the computer does not know about the original decimal value, it simply fills the trailing digits of the new double with all 0, while the double value is more precise. So both do compare not equal.
This is a common pitfall when using floats. For this and other reasons (e.g. rounding errors), you should not use exact comparison for equal/unequal), but a ranged compare using the smallest value different from 0:
#include "float.h"
...
// check for "almost equal"
if ( fabs(fval - dval) <= FLT_EPSILON )
...
Note the usage of FLT_EPSILON, which is the aforementioned value for single precision float values. Also note the <=, not <, as the latter will actually require exact match).
If you compare two doubles, you might use DBL_EPSILON, but be careful with that.
Depending on intermediate calculations, the tolerance has to be increased (you cannot reduce it further than epsilon), as rounding errors, etc. will sum up. Floats in general are not forgiving with wrong assumptions about precision, conversion and rounding.
Edit:
As suggested by #chux, this might not work as expected for larger values, as you have to scale EPSILON according to the exponents. This conforms to what I stated: float comparision is not that simple as integer comparison. Think about before comparing.
In short, you should NOT use == to compare floating points.
for example
float i = 1.1; // or double
float j = 1.1; // or double
This argument
(i==j) == true // is not always valid
for a correct comparison you should use epsilon (very small number):
(abs(i-j)<epsilon)== true // this argument is valid
The question simplifies to why do me and you have different values?
Usually, C floating point is based on a binary representation. Many compilers & hardware follow IEEE 754 binary32 and binary64. Rare machines use a decimal, base-16 or other floating point representation.
OP's machine certainly does not represent 1.1 exactly as 1.1, but to the nearest representable floating point number.
Consider the below which prints out me and you to high precision. The previous representable floating point numbers are also shown. It is easy to see me != you.
#include <math.h>
#include <stdio.h>
int main(void) {
float me = 1.1;
double you = 1.1;
printf("%.50f\n", nextafterf(me,0)); // previous float value
printf("%.50f\n", me);
printf("%.50f\n", nextafter(you,0)); // previous double value
printf("%.50f\n", you);
1.09999990463256835937500000000000000000000000000000
1.10000002384185791015625000000000000000000000000000
1.09999999999999986677323704498121514916420000000000
1.10000000000000008881784197001252323389053300000000
But it is more complicated: C allows code to use higher precision for intermediate calculations depending on FLT_EVAL_METHOD. So on another machine, where FLT_EVAL_METHOD==1 (evaluate all FP to double), the compare test may pass.
Comparing for exact equality is rarely used in floating point code, aside from comparison to 0.0. More often code uses an ordered compare a < b. Comparing for approximate equality involves another parameter to control how near. #R.. has a good answer on that.
Because you are comparing two Floating point!
Floating point comparison is not exact because of Rounding Errors. Simple values like 1.1 or 9.0 cannot be precisely represented using binary floating point numbers, and the limited precision of floating point numbers means that slight changes in the order of operations can change the result. Different compilers and CPU architectures store temporary results at different precisions, so results will differ depending on the details of your environment. For example:
float a = 9.0 + 16.0
double b = 25.0
if(a == b) // can be false!
if(a >= b) // can also be false!
Even
if(abs(a-b) < 0.0001) // wrong - don't do this
This is a bad way to do it because a fixed epsilon (0.0001) is chosen because it “looks small”, could actually be way too large when the numbers being compared are very small as well.
I personally use the following method, may be this will help you:
#include <iostream> // std::cout
#include <cmath> // std::abs
#include <algorithm> // std::min
using namespace std;
#define MIN_NORMAL 1.17549435E-38f
#define MAX_VALUE 3.4028235E38f
bool nearlyEqual(float a, float b, float epsilon) {
float absA = std::abs(a);
float absB = std::abs(b);
float diff = std::abs(a - b);
if (a == b) {
return true;
} else if (a == 0 || b == 0 || diff < MIN_NORMAL) {
return diff < (epsilon * MIN_NORMAL);
} else {
return diff / std::min(absA + absB, MAX_VALUE) < epsilon;
}
}
This method passes tests for many important special cases, for different a, b and epsilon.
And don't forget to read What Every Computer Scientist Should Know About Floating-Point Arithmetic!

How do I determine whether the value of a float is a whole number? [duplicate]

This question already has answers here:
Checking if float is an integer
(8 answers)
Closed 8 years ago.
I have a program in which i need to print FLOAT in case of a float number or print INTEGER in case of a regular number.
for Example pseudo code
float num = 1.5;
if (num mod sizeof(int)==0)
printf ("INTEGER");
else
printf("FLOAT");
For example:
1.6 would print "FLOAT"
1.0 would print "INTEGER"
Will something like this work?
All float types have the same size, so your method won't work. You can check if a float is an integer by using ceilf
float num = 1.5;
if (ceilf(num) == num)
printf ("INTEGER");
else
printf("FLOAT");
You can use modff():
const char * foo (float num) {
float x;
modff(num, &x);
return (num == x) ? "INTEGER" : "FLOAT";
}
modff() will take a float argument, and break it into its integer and fractional parts. It stores the integer part in the second argument, and the fractional part is returned.
The "easy" way, but with a catch:
You could use roundf, like this:
float z = 1.0f;
if (roundf(z) == z) {
printf("integer\n");
} else {
printf("fraction\n");
}
The problem with this and other similar techniques (such as ceilf) is that, while they work great for whole number constants, they will fail if the number is a result of a calculation that was subject to floating-point round-off error. For example:
float z = powf(powf(3.0f, 0.05f), 20.0f);
if (roundf(z) == z) {
printf("integer\n");
} else {
printf("fraction\n");
}
Prints "fraction", even though (31/20)20 should equal 3, because the actual calculation result ended up being 2.9999992847442626953125.
So how do we deal with this?
Any similar method, be it fmodf or whatever, is subject to this. In applications that perform complex or rounding-prone calculations, usually what you want to do is define some "tolerance" value for what constitutes a "whole number" (this goes for floating-point equality comparisons in general). We often call this tolerance epsilon. For example, lets say that we'll forgive the computer for up to +/- 0.00001 rounding error. Then, if we are testing z, we can choose an epsilon of 0.00001 and do:
if (fabsf(roundf(z) - z) <= 0.00001f) {
printf("integer\n");
} else {
printf("fraction\n");
}
You don't really want to use ceilf here because e.g. ceilf(1.0000001) is 2 not 1, and ceilf(-1.99999999) is -1 not -2.
Choose a tolerance value that is appropriate for your application. For more information, check out this article on comparing floating-point numbers.
Will something like this work?
No. For example on the x86_32 and ARM 32 bit architectures sizeof(int) == 4 and sizeof(float) == 4.
Also whatever you think mod is, it clearly shows you don't understand what the sizeof operator does.

Resources