IEEE-754: "smallest" overflow condition - c

Before I start, just some background information:
I'm running a bare-metal application on an ARM7 microcontroller (LPC2294/01) compiled in Keil uVision3, using the compiler standard math library (which is IEEE-754 compliant).
The issue:
I'm having trouble wrapping my head around what exactly constitutes an 'overflow' on the sum of 2 single-precision floating point inputs.
Initially, I was under the impression that if I attempted to add any positive value to the largest value that can be represented by IEEE-754 notation, the result would generate an overflow exception.
So for instance, suppose I have:
a = 0x7f7fffff (ie. 3.4028235..E38);
b = 0x3f800000 (ie. 1.0)
I expected that summing these two values would result in overflow as defined in IEEE-754. To my initial surprise, the result simply returned the value of 'a' with no exception being flagged.
So then I thought, since the precision (or resolution if you prefer) decreases as the value being represented increases, it's likely the value '1' in this case is being effectively rounded down to 0 due to its relative insignificance.
So that begged the question: What would be the smallest value of 'b' in this case that would cause an overflow exception? Does it depend on the specific implementation of IEEE-754?
Maybe it's as simple as me not understanding how to determine the minimum 'significant' precision in this particular case, but given the code below, why would the second sum cause an overflow and not the first?
static union sFloatConversion32
{
unsigned int unsigned32Value;
float floatValue;
} sFloatConversion32;
t_bool test_Float32_Addition(void)
{
float a;
float b;
float c;
sFloatConversion32.unsigned32Value = 0x7f7fffff;
a = sFloatConversion32.floatValue;
sFloatConversion32.unsigned32Value = 0x72ffffff;
b = sFloatConversion32.floatValue;
/* This sum returns (c = a) without overflow */
c = a + b;
sFloatConversion32.unsigned32Value = 0x73000000;
b = sFloatConversion32.floatValue;
/* This sum, however, causes an overflow exception */
c = a + b;
}
Is there a generalized rule that can be applied such that it would be possible to know ahead of time (ie. without performing the sum), that given two floats, their sum will cause an overflow as defined by IEEE-754?

Overflow occurs when the result is affected by the range of the format. As long as normal rounding keeps the result within the finite range, no overflow occurs, because the result is the same as it would be if the exponent were unbounded—the result was reduced by the normal rounding, before range was considered. So there is no exception due to range.
When the rounded result does not fit into the finite range of the format, then a finite result cannot be produced, so an overflow exception occurs and infinity is produced.
In IEEE 754, a normal operation is in effect two steps:
Calculate the exact mathematical result.
Round the exact mathematical result to the nearest representable value.
IEEE 754 defines overflow to occur if and only if the the result of the above exceeds in magnitude the largest representable finite value. In other words, overflow does not occur just because you went above the largest representable value but only if you go so far above the largest representable value that the normal way arithmetic works in floating-point does not work.
So, if you start with the largest representable value and add a small number to it, the result would simply round to the largest representable value anyway (when using round-to-nearest). IEEE 754 regards this as normal—all arithmetic operations round, and if that rounding kept the result in bounds, that is normal and unexceptionable. Even if the exponent range were unbounded, normal rounding would have produced the same result. Since this is a normal result not affected by the limited range, nothing exceptional has occurred.
Overflow occurs only when the mathematical result is so large that rounding would produce the next higher number if we were not limited by the exponent. (But, since we have reached the limits of the exponent range, we must return infinity.)
The largest representable value in IEEE-754 basic 32-bit binary floating-point is 2128−2104. At this point, the steps between representable numbers are in units of 2104. With the round-to-nearest rule, adding any number less than half a step, 2103, to this will round to 2128−2104, and no overflow occurs. If you add a number greater than 2103, then the result would round to 2128 if the exponent could go that high. Instead, infinity is produced and an overflow exception occurs. (If you add exactly 2103, the rule for ties is used. This rule says to choose the candidate with the even low bit. That produces 2128, so it also overflows.)
So, with round-to-nearest, overflow occurs at the midpoint of a step. With other rounding rules, overflow occurs at different points. With round-toward-infinity (round up), adding any positive value, even 2−149, to 2128−2104 will cause an overflow. With round-toward-zero, adding any value less than 2104 to 2128−2104 will not overflow.

Does it depend on the specific implementation of IEEE-754?
Yes and the rounding mode active at the time.
Consider the step between the x before max and FLT_MAX.
float max = FLT_MAX;
float before_max = nextafterf(max, 0.0f);
float delta = max - before_max;
printf("max: %- 20a %.*g\n", max, FLT_DECIMAL_DIG, max);
printf("1st d: % -20a %.*g\n", delta, FLT_DECIMAL_DIG, delta);
// Typical output
max: 0x1.fffffep+127 3.40282347e+38
b4max: 0x1.fffffep+127 3.40282347e+38
1st d: 0x1p+104 2.02824096e+31
The largest float is about twice the float with the same smallest float with the same steps or ULP. Think of this smaller float with all its explicit precision bits cleared versus set as with FLOAT_MAX.
float m0 = nextafterf(max/2, max);
printf("m0: %- 20a %.*g\n", m0, FLT_DECIMAL_DIG, m0);
// m0: 0x1p+127 1.70141183e+38
Now compare this to FLT_EPSILON, the smallest step from 1.0 to the next larger float:
float eps = FLT_EPSILON;
printf("epsil: %- 20a %.*g\n", eps, FLT_DECIMAL_DIG, eps);
// Output
// epsil: 0x1p-23 1.1920929e-07
Notice the ratio delta/m0 is FLT_EPSILON.
float r = delta1/m0;
printf("r: %- 20a %.*g\n", r, FLT_DECIMAL_DIG, r);
// r: 0x1p-23 1.1920929e-07
Consider the typical rounding mode of rounding to nearest, ties to even.
Now let us try adding 1/2*delta1 to FLOAT_MAX and then try adding the next smaller float.
sum = max + delta1/2;
printf("sum: % -20a %.*g\n", sum, FLT_DECIMAL_DIG, sum);
sum = nextafterf(sum, 0);
printf("sum: % -20a %.*g\n", sum, FLT_DECIMAL_DIG, sum);
// sum: inf inf
// sum: 0x1.fffffep+127 3.40282347e+38
IEEE-754: “smallest” overflow condition
We can see the smallest delta if about FLT_MAX*1/2*1/2*FLOAT_EPSILON.
float small = FLT_MAX*0.25f*FLT_EPSILON;
printf("small: %- 20a %.*g\n", small, FLT_DECIMAL_DIG, small);
printf("sum: % -20a %.*g\n", max+small, FLT_DECIMAL_DIG, max+small);
small = nextafterf(small, max);
printf("sum: % -20a %.*g\n", max+small, FLT_DECIMAL_DIG, max+small);
// sum: 0x1.fffffep+127 3.40282347e+38
// sum: inf inf
Given the various possible encoding for float, your results may differ, yet this approach gives an idea of how to determine the smallest delta that cause overflow.

Run this program long enough and see what will happen:
float x = 10000000.0f;
while(1)
{
printf("%f\n", x);
x += 1.0f;
}
I think it will answer your question.

Related

Underflow error in floating point arithmetic in C

I am new to C, and my task is to create a function
f(x) = sqrt[(x^2)+1]-1
that can handle very large numbers and very small numbers. I am submitting my script on an online interface that checks my answers.
For very large numbers I simplify the expression to:
f(x) = x-1
By just using the highest power. This was the correct answer.
The same logic does not work for smaller numbers. For small numbers (on the order of 1e-7), they are very quickly truncated to zero, even before they are squared. I suspect that this has to do with floating point precision in C. In my textbook, it says that the float type has smallest possible value of 1.17549e-38, with 6 digit precision. So although 1e-7 is much larger than 1.17e-38, it has a higher precision, and is therefore rounded to zero. This is my guess, correct me if I'm wrong.
As a solution, I am thinking that I should convert x to a long double when x < 1e-6. However when I do this, I still get the same error. Any ideas? Let me know if I can clarify. Code below:
#include <math.h>
#include <stdio.h>
double feval(double x) {
/* Insert your code here */
if (x > 1e299)
{;
return x-1;
}
if (x < 1e-6)
{
long double g;
g = x;
printf("x = %Lf\n", g);
long double a;
a = pow(x,2);
printf("x squared = %Lf\n", a);
return sqrt(g*g+1.)- 1.;
}
else
{
printf("x = %f\n", x);
printf("Used third \n");
return sqrt(pow(x,2)+1.)-1;
}
}
int main(void)
{
double x;
printf("Input: ");
scanf("%lf", &x);
double b;
b = feval(x);
printf("%f\n", b);
return 0;
}
For small inputs, you're getting truncation error when you do 1+x^2. If x=1e-7f, x*x will happily fit into a 32 bit floating point number (with a little bit of error due to the fact that 1e-7 does not have an exact floating point representation, but x*x will be so much smaller than 1 that floating point precision will not be sufficient to represent 1+x*x.
It would be more appropriate to do a Taylor expansion of sqrt(1+x^2), which to lowest order would be
sqrt(1+x^2) = 1 + 0.5*x^2 + O(x^4)
Then, you could write your result as
sqrt(1+x^2)-1 = 0.5*x^2 + O(x^4),
avoiding the scenario where you add a very small number to 1.
As a side note, you should not use pow for integer powers. For x^2, you should just do x*x. Arbitrary integer powers are a little trickier to do efficiently; the GNU scientific library for example has a function for efficiently computing arbitrary integer powers.
There are two issues here when implementing this in the naive way: Overflow or underflow in intermediate computation when computing x * x, and substractive cancellation during final subtraction of 1. The second issue is an accuracy issue.
ISO C has a standard math function hypot (x, y) that performs the computation sqrt (x * x + y * y) accurately while avoiding underflow and overflow in intermediate computation. A common approach to fix issues with subtractive cancellation is to transform the computation algebraically such that it is transformed into multiplications and / or divisions.
Combining these two fixes leads to the following implementation for float argument. It has an error of less than 3 ulps across all possible inputs according to my testing.
/* Compute sqrt(x*x+1)-1 accurately and without spurious overflow or underflow */
float func (float x)
{
return (x / (1.0f + hypotf (x, 1.0f))) * x;
}
A trick that is often useful in these cases is based on the identity
(a+1)*(a-1) = a*a-1
In this case
sqrt(x*x+1)-1 = (sqrt(x*x+1)-1)*(sqrt(x*x+1)+1)
/(sqrt(x*x+1)+1)
= (x*x+1-1) / (sqrt(x*x+1)+1)
= x*x/(sqrt(x*x+1)+1)
The last formula can be used as an implementation. For vwry small x sqrt(x*x+1)+1 will be close to 2 (for small enough x it will be 2) but we don;t loose precision in evaluating it.
The problem isn't with running into the minimum value, but with the precision.
As you said yourself, float on your machine has about 7 digits of precision. So let's take x = 1e-7, so that x^2 = 1e-14. That's still well within the range of float, no problems there. But now add 1. The exact answer would be 1.00000000000001. But if we only have 7 digits of precision, this gets rounded to 1.0000000, i.e. exactly 1. So you end up computing sqrt(1.0)-1 which is exactly 0.
One approach would be to use the linear approximation of sqrt around x=1 that sqrt(x) ~ 1+0.5*(x-1). That would lead to the approximation f(x) ~ 0.5*x^2.

Float data type uncertainty

I am doing a numerical analysis of a math software I developed. I want to identify what is the uncertainty of my result. Being f() my method and x an input value, I want to identify y of my result as f(x) +/- y. My f() method has multiple operations between float variables. To study the error propagation occurred in f(), I have to apply the Statistical Propagation of Uncertainty formulas and in order to do so I have to know the uncertainty of a float variable.
I do understand the architecture of a float variable as specified in the IEEE 754 standard and the rounding error converting a decimal value to float inherent to the latter.
From what I understood of the literature, the FLT_EPSILON macro in http://www.cplusplus.com/reference/cfloat/
defines my y value but this quick test proves it wrong:
float f1 = 1.234567f;
float f2 = 1.234567f + 1.192092896e-7f;
float f3 = 1.234567f + 1.192092895e-7f;
printf("Inicial:\t%f\n", f1);
printf("Inicial:\t%f\n", f2);
printf("Inicial:\t%f\n\n", f3);
Output:
Inicial: 1.234567
Inicial: 1.234567
Inicial: 1.234567
When the expected output should be:
Inicial: 1.234567
Inicial: 1.234568 <---
Inicial: 1.234567
What is that I am wrong about?
Should not the float value of x + FLT_EPSILON and x - FLT_EPSILON be the same?
EDIT: My question is being R the float value of x, what is the y value that x + y || x - y equals the same R float value?
Propagation of uncertainty is from the field of statistics and refers to how uncertainties in inputs affect mathematical functions of them. The analysis of errors that occur in computational arithmetic is numerical analysis.
FLT_EPSILON is not a measure of uncertainty or error in floating-point results. It is the distance between 1 and the next value representable in the float type. Hence, it is the size of steps between representable numbers at the magnitude of 1.
When you convert a decimal numeral to floating-point, the rounding error that results may have a magnitude of up to ½ the step size when the common round-to-nearest mode is used. The reason the bound is ½ the step size is that for any number x (within the finite domain of the floating-point format), there is a representable value within ½ the step size (inclusive). This is because, if there is a representable number more than ½ the step size in one direction, there is a representable number less than ½ the step size in the other direction.
The step size varies with the magnitudes of the numbers. With binary floating-point, it doubles at 2, and again at 4, then 8, and so on. Below 1, it halves, and again at ½, ¼, and so on.
When you perform floating-point arithmetic operations, the rounding that occurs in the computation may compound or cancel previous errors. There is no general formula for the final error.
The two numerals use used in your sample code, 1.192092897e-7f and 1.192092896e-7f, are so close together that they convert to the same float value, 2−23. That is why there is no difference in your f2 and f3.
There is a difference between f1 and f2, but you did not print enough digits to display it.
You ask “Should not the float value of x + FLT_EPSILON and x - FLT_EPSILON be the same?”, but your code does not contain x - FLT_EPSILON.
Re: “My question is being R the float value of x, what is the y value that x + y || x - y equals the same R float value?” This is trivially satisfied by y = 0. Did you mean to ask what is the largest value of y that satisfies the condition? That is a bit complicated.
The step size for a number x is called the ULP of x, which we may consider as a function ULP(x). ULP stands for Unit of Least Precision. It is the place value of the least digit in the floating-point representation of x. It is not a constant; it is a function of x.
For most values representable in a floating-point format, the largest y that satisfies your condition is ½ ULP(x) of the least digit in the floating-point representation of x is even and, if the digit is odd, it is just under ½ ULP(x). This complication arises from the rule that the results of arithmetic are rounded to the nearest representable value and, in case of a tie, the value with the even low digit is chosen. Thus, adding ½ ULP(x) to x will yield a tie that will round to x if the low digit is even, but will not round to x if the low digit is odd.
However, for x that are on the boundary where the ULP changes, the largest y that satisfies your condition is ¼ ULP(x). This is because, just below x (in magnitude), the step size changes, and the next number lower than x is half of x’s step size away instead of the usual full step size. So you can only go halfway toward that value before changing the result of the subtraction, so the most y can be is ¼ ULP(x).
Float is a 32 bit IEEE 754 single precision Floating Point Number: 1 bit for the sign, 8 bits for the exponent, and 23* for the value, i.e. float has 7 decimal digits of precision.
Increase the printf number of printed digits to see more but after 7 digits its just noise:
#include <stdio.h>
int main(void) {
float f1 = 1.234567f;
float f2 = 1.234567f + 1.192092897e-7f;
float f3 = 1.234567f + 1.192092896e-7f;
printf("Inicial:\t%.16f\n", f1);
printf("Inicial:\t%.16f\n", f2);
printf("Inicial:\t%.16f\n\n", f3);
return 0;
}
Output:
Inicial: 1.2345670461654663
Inicial: 1.2345671653747559
Inicial: 1.2345671653747559
float f1 = 1.234567f;
float f2 = f1 + 1.192092897e-7f;
float f3 = f1 + 1.192092896e-7f;
printf("Inicial:\t%.20f\n", f1);
printf("Inicial:\t%.20f\n", f2);
printf("Inicial:\t%.20f\n\n", f3);
Output:
Inicial: 1.23456704616546630000
Inicial: 1.23456716537475590000
Inicial: 1.23456716537475590000
No, your expectation is wrong
In the first printf call, you're printing the variable f1 with no effect which is just 1.234567f.

Understanding the maximum values that can be stored in floats in C

I have come across some behaviour with the float type in C that I do not understand, and was hoping might be explained. Using the macros defined in float.h I can determine the maximum/minimum values that the datatype can store on the given hardware. However when performing a calculation that should not exceed these limits, I find that a typed float variable fails where a double succeeds.
The following is a minimal example, which compiles on my machine.
#include <stdio.h>
#include <stdlib.h>
#include <float.h>
int main(int argc, char **argv)
{
int gridsize;
long gridsize3;
float *datagrid;
float sumval_f;
double sumval_d;
long i;
gridsize = 512;
gridsize3 = (long)gridsize*gridsize*gridsize;
datagrid = calloc(gridsize3, sizeof(float));
if(datagrid == NULL)
{
free(datagrid);
printf("Memory allocation failed\n");
exit(0);
}
for(i=0; i<gridsize3; i++)
{
datagrid[i] += 1.0;
}
sumval_f = 0.0;
sumval_d = 0.0;
for(i=0; i<gridsize3; i++)
{
sumval_f += datagrid[i];
sumval_d += (double)datagrid[i];
}
printf("\ngridsize3 = %e\n", (float)gridsize3);
printf("FLT_MIN = %e\n", FLT_MIN);
printf("FLT_MAX = %e\n", FLT_MAX);
printf("DBL_MIN = %e\n", DBL_MIN);
printf("DBL_MAX = %e\n", DBL_MAX);
printf("\nfloat sum = %f\n", sumval_f);
printf("double sum = %lf\n", sumval_d);
printf("sumval_d/sumval_f = %f\n\n", sumval_d/(double)sumval_f);
free(datagrid);
return(0);
}
Compiling with gcc I find the output:
gridsize3 = 1.342177e+08
FLT_MIN = 1.175494e-38
FLT_MAX = 3.402823e+38
DBL_MIN = 2.225074e-308
DBL_MAX = 1.797693e+308
float sum = 16777216.000000
double sum = 134217728.000000
sumval_d/sumval_f = 8.000000
Whilst compiling with icc the sumval_f = 67108864.0 and hence the final ratio is instead 2.0*. Note that the float sum is incorrect, whilst the double sum is correct.
As far as I can tell the output of FLT_MAX suggests that the sum should fit into a float, and yet it seems to plateau out at either an eighth or a half of the full value.
Is there a compiler specific override to the values found using float.h?
Why is a double required to correctly find the sum of this array?
*Interestingly the inclusion of an if statement inside the for loop that prints values of the array causes the value to match the gcc output, i.e. an eighth of the correct sum, rather than a half.
The problem here isn't the range of values but the precision.
Assuming a 32-bit IEEE754 float, this datatype has a maximum of 24 bits of precision. This means that not all integers larger than 16777216 can be represented exactly.
So when your sum reaches 16777216, adding 1 to it is outside the precision of what the datatype can store, so the number doesn't get any bigger.
A (presumably) 64-bit double has 53 bits of precision. This is enough bits to hold all integer values up to your sum of 134217728, so it gives you an accurate result.
A float can precisely represent any integer between -16777215 and +16777215, inclusive. It can also represent all even integers between -2*16777215 and +2*16777215 (including +/- 2*8388608, i.e. 16777216), all multiples of 4 between -4*16777215 and +4*16777215, and likewise for all power-of-two scaling factors up to 2^104 (roughly 2.028E+31). Additionally, it can represent multiples of 1/2 from -16777215/2 to +16777215/2, multiples of 1/4 from -16777215/4 to +16777215/4, etc. down to multiples of 1/2^149 from -167777215/(2^149) to +16777215/(2^149).
Floating point numbers represent all of the infinite possible values between any two numbers; but, computers cannot hold an infinite number of values. So a compromise is made. The floating point numbers hold an approximation of the value.
This means that if you pick a value that is "more" than the stored floating point number, but not enough to arrive at the "next" storable approximation, then storing that logically bigger number won't actually change the floating point value.
The "error" in a floating point approximation is variable. For small numbers, the error is more precise; for bigger numbers, the error proportionally the same, but a bigger actual value.

Computing floating point accuracy (K&R 2-1)

I found Stevens Computing Services – K & R Exercise 2-1 a very thorough answer to K&R 2-1. This slice of the full code computes the maximum value of a float type in the C programming language.
Unluckily my theoretical comprehension of float values is quite limited. I know they are composed of significand (mantissa.. ) and a magnitude which is a power of 2.
#include <stdio.h>
#include <limits.h>
#include <float.h>
main()
{
float flt_a, flt_b, flt_c, flt_r;
/* FLOAT */
printf("\nFLOAT MAX\n");
printf("<limits.h> %E ", FLT_MAX);
flt_a = 2.0;
flt_b = 1.0;
while (flt_a != flt_b) {
flt_m = flt_b; /* MAX POWER OF 2 IN MANTISSA */
flt_a = flt_b = flt_b * 2.0;
flt_a = flt_a + 1.0;
}
flt_m = flt_m + (flt_m - 1); /* MAX VALUE OF MANTISSA */
flt_a = flt_b = flt_c = flt_m;
while (flt_b == flt_c) {
flt_c = flt_a;
flt_a = flt_a * 2.0;
flt_b = flt_a / 2.0;
}
printf("COMPUTED %E\n", flt_c);
}
I understand that the latter part basically checks to which power of 2 it's possible to raise the significand with a three variable algorithm. What about the first part?
I can see that a progression of multiples of 2 should eventually determine the value of the significand, but I tried to trace a few small numbers to check how it should work and it failed to find the right values...
======================================================================
What are the concepts on which this program is based upon and does this program gets more precise as longer and non-integer numbers have to be found?
The first loop determines the number of bits contributing to the significand by finding the least power 2 such that adding 1 to it (using floating-point arithmetic) fails to change its value. If that's the nth power of two, then the significand uses n bits, because with n bits you can express all the integers from 0 through 2^n - 1, but not 2^n. The floating-point representation of 2^n must therefore have an exponent large enough that the (binary) units digit is not significant.
By that same token, having found the first power of 2 whose float representation has worse than unit precision, the maximim float value that does have unit precision is one less. That value is recorded in variable flt_m.
The second loop then tests for the maximum exponent by starting with the maximum unit-precision value, and repeatedly doubling it (thereby increasing the exponent by 1) until it finds that the result cannot be converted back by halving it. The maximum float is the value before that final doubling.
Do note, by the way, that all the above supposes a base-2 floating-point representation. You are unlikely to run into anything different, but C does not actually require any specific representation.
With respect to the second part of your question,
does this program gets more precise as longer and non-integer numbers have to be found?
the program takes care to avoid losing precision. It does assume a binary floating-point representation such as you described, but it will work correctly regardless of the number of bits in the significand or exponent of such a representation. No non-integers are involved, but the program already deals with numbers that have worse than unit precision, and with numbers larger than can be represented with type int.

Explain this code in K&R 2-1

I'm trying to determine range of the various floating-point types. When I read this code:
#include <stdio.h>
main()
{
float fl, fltest, last;
double dbl, dbltest, dblast;
fl = 0.0;
fltest = 0.0;
while (fl == 0.0) {
last = fltest;
fltest = fltest + 1111e28;
fl = (fl + fltest) - fltest;
}
printf("Maximum range of float variable: %e\n", last);
dbl = 0.0;
dbltest = 0.0;
while (dbl == 0.0) {
dblast = dbltest;
dbltest = dbltest + 1111e297;
dbl = (dbl + dbltest) - dbltest;
}
printf("Maximum range of double variable: %e\n", dblast);
return 0;
}
I don't understand why author added 1111e28 at fltest variable ?
The loop terminates when fltest reaches +Inf, as at that point fl = (fl + fltest) - fltest becomes NaN, which is unequal to 0.0. last contains a value which when added to 1111e28 produces +Inf and so is close to the upper limit of float.
1111e28 is chosen to reach +Inf reasonably quickly; it also needs to be large enough that when added to large values the loop continues to progress i.e. it is at least as large as the gap between the largest and second-largest non-infinite float values.
OP: ... why author added 1111e28 at fltest variable ?
A: [Edit] For code to work using float, 1111e28, or 1.111e31 this delta value needs careful selection. It should be big enough such that if fltest was FLT_MAX, the sum of fltest + delta would overflow and become float.infinity. With round to nearest mode, this is FLT_MAX*FLT_EPSILON/4. On my machine:
min_delta 1.014120601e+31 1/2 step between 2nd largest and FLT_MAX
FLT_MAX 3.402823466e+38
FLT_EPSILON 8.388608000e+06
FLT_MAX*FLT_EPSILON 4.056481679e+31
delta needs to be small enough so if f1test is the 2nd largest number, adding delta, would not sum right up to float.infinity and skip FLT_MAX. This is 3x min_delta
max_delta 3.042361441e+31
So 1.014120601e+31 <= 1111e28 < 3.042361441e+31.
#david.pfx Yes. 1111e28 is a cute number and it is in range.
Note: Complications occur when the math and its intermediate values, even though the variables are float may calcuate at higher precsison like double. This is allowed in C and control by FLT_EVAL_METHOD or very careful coding.
1111e28 is a curious value that makes sense if the author all ready knew the general range ofFLT_MAX.
The below code is expected to loop many times (24946069 on one test platform). Hopefully, the value fltest eventually becomes "infinite". Then f1 will becomes NaN as the difference of Infinity - Infinity. The the while loop ends as Nan != 0.0. #ecatmur
while (fl == 0.0) {
last = fltest;
fltest = fltest + 1111e28;
fl = (fl + fltest) - fltest;
}
The looping, if done in small enough increments, will arrive at a precise answer. Prior knowledge of FLT_MAX and FLT_EPSILON are needed to insure this.
The problem with this is that C does not define the range FLT_MAX and DBL_MAX other than they must be at least 1E+37. So if the maximum value was quite large, the increment value of 1111e28 or 1111e297 would have no effect. Example: dbltest = dbltest + 1111e297;, for dbltest = 1e400 would certainly not increase 1e400 unless dbltest a hundred decimal digits of precision.
If DBL_MAX was smaller than 1111e297, the method fails too. Note: On simple platforms in 2014, it is not surprising to find double and float to be the same 4-byte IEEE binary32 ) The first time though the loop, dbltest becomes infinity and the loop stops, reporting "Maximum range of double variable: 0.000000e+00".
There are many ways to efficiently derive the maximum float point value. A sample follows that uses a random initial value to help show its resilience to potential variant FLT_MAX.
float float_max(void) {
float nextx = 1.0 + rand()/RAND_MAX;
float x;
do {
x = nextx;
nextx *= 2;
} while (!isinf(nextx));
float delta = x;
do {
nextx = x + delta/2;
if (!isinf(nextx)) {
x = nextx;
}
delta /= 2;
} while (delta >= 1.0);
return x;
}
isinf() is a new-ish C function. Simple enough to roll your own if needed.
In re: #didierc comment
[Edit]
The precision of a float and double is implied with "epsilon": "the difference between 1 and the least value greater than 1 that is representable in the
given floating point type ...". The maximum values follow
FLT_EPSILON 1E-5
DBL_EPSILON 1E-9
Per #Pascal Cuoq comment. "... 1111e28 being chosen larger than FLT_MAX*FLT_EPSILON.", 1111e28 needs to be at least FLT_MAX*FLT_EPSILON to impact the loop's addition, yet small enough to precisely reach the number before infinity. Again, prior knowledge of FLT_MAX and FLT_EPSILON are needed to make this determination. If these values are known ahead of time, then the code simple could have been:
printf("Maximum range of float variable: %e\n", FLT_MAX);
The largest value representable in a float is 3.40282e+38. The constant 1111e28 is chosen such that adding that constant to a number in the range of 10^38 still produces a different floating point value, so that the value of fltest will continue to increase as the function runs. It needs to be large enough that it will still be significant at the 10^38 range, and small enough that the result will be accurate.

Resources