Why, with strtof() "3.40282356779733650000e38" unexpectantly converted to infinity even though it is within 0.5 ULP of FLT_MAX?
FLT_MAX (float32) is 0x1.fffffep+127 or about 3.4028234663852885981170e+38.
1/2 ULP above FLT_MAX is 0x1.ffffffp+127 or about 3.4028235677973366163754e+38, so I expected any decimal text below this and the lower FLT_MAX to convert to FLT_MAX when in "round to nearest" mode.
This works as decimal text increases from FLT_MAX to about 3.4028235677973388642700e38, yet for decimal text values about above that like "3.40282356779733650000e38", the conversion result is infinity.
Follows is code that reveals the issue. It gently creeps up a decimal text string, looking for the value in which conversion changes to infinity.
Your results may differ as not all C implementations use the same floating point.
#include <assert.h>
#include <float.h>
#include <stdio.h>
#include <stdlib.h>
void bar(unsigned n) {
char buf[100];
assert (n < 90);
int len = sprintf(buf, "%.*fe%d", n+1, 0.0, FLT_MAX_10_EXP);
puts(buf);
printf("%-*s %-*s %s\n", len, "string", n+3, "float", "double");
float g = 0;
for (unsigned i = 0; i < n; i++) {
for (int digit = '1'; digit <= '9'; digit++) {
unsigned offset = i ? 1+i : i;
buf[offset]++;
errno = 0;
float f = strtof(buf, 0);
if (errno) {
buf[offset]--;
break;
}
g = f;
}
printf("\"%s\" %.*e %a\n", buf, n + 3, g, atof(buf));
}
double delta = FLT_MAX - nextafterf(FLT_MAX, 0);
double flt_max_ulp_d2 = FLT_MAX + delta/2.0;
printf(" %.*e %a FLT_MAX + 1/2 ULP - 1 dULP\n", n + 3, nextafter(flt_max_ulp_d2,0),nextafter(flt_max_ulp_d2,0));
printf(" %.*e %a FLT_MAX + 1/2 ULP\n", n + 3, flt_max_ulp_d2,flt_max_ulp_d2);
printf(" %.*e %a FLT_MAX\n", n + 3, FLT_MAX, FLT_MAX);
printf(" 1 23456789 123456789 123456789\n");
printf("FLT_ROUNDS %d (0: toward zero, 1: to nearest)\n", FLT_ROUNDS);
}
int main() {
printf("%a %.20e\n", FLT_MAX, FLT_MAX);
printf("%a\n", strtof("3.40282356779733650000e38", 0));
printf("%a\n", strtod("3.40282356779733650000e38", 0));
printf("%a\n", strtod("3.4028235677973366163754e+3", 0));
bar(19);
}
Output
0x1.fffffep+127 3.40282346638528859812e+38
inf
0x1.ffffffp+127
0x1.a95a5aaada733p+11
0.00000000000000000000e38
string float double
"3.00000000000000000000e38" 3.0000000054977557577780e+38 0x1.c363cbf21f28ap+127
"3.40000000000000000000e38" 3.3999999521443642490773e+38 0x1.ff933c78cdfadp+127
"3.40000000000000000000e38" 3.3999999521443642490773e+38 0x1.ff933c78cdfadp+127
"3.40200000000000000000e38" 3.4020000005553803402978e+38 0x1.ffe045fe9918p+127
"3.40280000000000000000e38" 3.4027999387901483621794e+38 0x1.ffff169a83f08p+127
"3.40282000000000000000e38" 3.4028200183756559773331e+38 0x1.ffffdbd19d02cp+127
"3.40282300000000000000e38" 3.4028230607370965250836e+38 0x1.fffff966ad924p+127
"3.40282350000000000000e38" 3.4028234663852885981170e+38 0x1.fffffe54daff8p+127
"3.40282356000000000000e38" 3.4028234663852885981170e+38 0x1.fffffeec5116ep+127
"3.40282356700000000000e38" 3.4028234663852885981170e+38 0x1.fffffefdfcbbcp+127
"3.40282356770000000000e38" 3.4028234663852885981170e+38 0x1.fffffeffc119p+127
"3.40282356779000000000e38" 3.4028234663852885981170e+38 0x1.fffffefffb424p+127
"3.40282356779700000000e38" 3.4028234663852885981170e+38 0x1.fffffeffffc85p+127
"3.40282356779730000000e38" 3.4028234663852885981170e+38 0x1.fffffefffff9fp+127
"3.40282356779733000000e38" 3.4028234663852885981170e+38 0x1.fffffefffffeep+127
"3.40282356779733600000e38" 3.4028234663852885981170e+38 0x1.fffffeffffffep+127
"3.40282356779733640000e38" 3.4028234663852885981170e+38 0x1.fffffefffffffp+127 <-- Actual
"3.40282356779733660000e38" 3.4028234663852885981170e+38 ... <-- Expected
"3.40282356779733642000e38" 3.4028234663852885981170e+38 0x1.fffffefffffffp+127
"3.40282356779733642700e38" 3.4028234663852885981170e+38 0x1.fffffefffffffp+127
3.4028235677973362385861e+38 0x1.fffffefffffffp+127 FLT_MAX + 1/2 ULP - 1 dULP
3.4028235677973366163754e+38 0x1.ffffffp+127 FLT_MAX + 1/2 ULP
3.4028234663852885981170e+38 0x1.fffffep+127 FLT_MAX
1 23456789 123456789 123456789
FLT_ROUNDS 1 (0: toward zero, 1: to nearest)
Notes: GNU C11 (GCC) version 11.3.0 (x86_64-pc-cygwin)
compiled by GNU C version 11.3.0, GMP version 6.2.1, MPFR version 4.1.0, MPC version 1.2.1, isl version isl-0.25-GMP
[Edit]
The exact value of FLT_MAX + 1/2 ULP:
0x1.ffffffp+127 340282356779733661637539395458142568448.0
I stumbled on this problem today when trying to determine the maximum decimal text passed to strtof() that returned a finite float.
This is a Can I answer my own question? answer. Other answers are welcomed.
Why, with strtof() "3.40282356779733650000e38" unexpectantly converted to infinity even though it is within 0.5 ULP of FLT_MAX?
Certainly double rounding.
"Double" here refers to doing something twice, not the type double.
Let 1/2 of a float ULP above FLT_MAX is 0x1.ffffffp+127 or about 3.4028235677973366163754e+38 is called threshold.
About 3.4028235673364274808e38 is one half of a double ULP below threshold. Apparently values like "3.40282356779733650000e38" prematurely rounds as a double to threshold. threshold, as a float, is half-way between FLT_MAX and the next larger float (if the encoding was extended). Being a half-way tie, it rounds to the "even" value - the larger one in this case. Since the next larger float is beyond the max encodable finite value, the result is infinity.
Conclusions
A better strtof() would correctly handle this corner case.
Instead, it is reasonable to consider decimal places past FLT_DECIMAL_DIG + 3 (see following) in strtof() as noise.
In an alternative strtof() implementation, IEEE_754 allows such decimal text conversions to treat all the decimal digits passed a certain significance as zero. This, thus allowing conversions to the 2nd closest float when near the 1/2 way point of 2 floats. With common float, that significance is FLT_DECIMAL_DIG + 3 or 12 decimal places. That is not used here as decimals in the 19th place affect the result.
Related
I need to find number which is a power of 2 that when added to FLT_MAX will cause overflow. However, when I printf very large power, like 2^300, inf still doesn't appear. Also, I thought that as FLT_MAX is the maximum floating point represented, adding 1 to it will cause overflow immediately.
#include <stdio.h>
#include <float.h>
int main(){
float f = FLT_MAX;
printf("%f", f + pow(2,300));
}
Any help would be appreciated. Thanks!
The answer is (FLT_MAX - nextafterf(FLT_MAX, 0))/2, that is, exactly 0x1p+103 or approximately 1.014120480e+31.
There is a mistake in the method you use to determine the answer : the standard function pow returns a double, and C's “usual arithmetic conversions” (C11 6.3.1.8:1) mean that the expression f + pow(2,300) is computed as a double. It is then printed as a double because of how arguments are passed to variadic functions.
This C program shows how you can arrive to the float value that, added to FLT_MAX with float addition, results in float infinity:
#include <stdio.h>
#include <float.h>
#include <math.h>
int main(){
float f = FLT_MAX;
printf("FLT_MAX: %a\n", f);
float b = nextafterf(f, 0);
printf("number before FLT_MAX: %a\n", b);
float d = f - b;
printf("difference: %a\n", d);
printf("FLT_MAX + d: %a\n", f + d);
printf("FLT_MAX + d/2: %a\n", f + d/2);
printf("FLT_MAX + nextafterf(d/2,0): %a\n", f + nextafterf(d/2,0));
float answer = d/2;
printf("answer: %a %.9e\n", answer, answer);
}
It prints:
FLT_MAX: 0x1.fffffep+127
number before FLT_MAX: 0x1.fffffcp+127
difference: 0x1p+104
FLT_MAX + d: inf
FLT_MAX + d/2: inf
FLT_MAX + nextafterf(d/2,0): 0x1.fffffep+127
answer: 0x1p+103 1.014120480e+31
It shows that if you take the difference between FLT_MAX and its lower neighbor (call this difference d), as you could expect, d added to FLT_MAX produces inf. But this is not the smallest float you can add to FLT_MAX to produce inf—there are smaller candidates. It is enough to add exactly half of d to FLT_MAX in order for the result to tound up to inf. If you add less than that, on the other hand, the result is rounded down to FLT_MAX.
This line is working with double not float.
printf("%f", f + pow(2,300));
To be working with float you need
printf("%f", f + powf(2,300));
and in this case the output is
inf
In the second case the float result is promoted to double in the call to printf, but it's too late, the value is already in an overflow representation.
//float=(-1) ^ s * 2 ^ (x - 127) * (1 + n * 2 ^ -23)
// s xxxxxxxx nnnnnnnnnnnnnnnnnnnnnnn
//FLT_MAX 3.402823466e+38F 2 ^ 128 0 11111110 11111111111111111111111
//FLT_MIN 1.175494351e-38F 2 ^ -126 0 00000001 00000000000000000000000
//FLT_TRUE_MIN 1.401298464e-45F 2 ^ -149 0 00000000 00000000000000000000001
//ONE 1f 2 ^ 0 0 01111111 00000000000000000000000
//INFINITY - 2 ^ 128+ 0 11111111 00000000000000000000000
union
{
float f;
int i;
}k,k2,k3;
k.i = 0b01111111011111111111111111111111; // 2^128 FLT_MAX
k2.i = 0b01110011000000000000000000000000; // 2^103
k3.f = k.f + k2.f; // 2^128+ INFINITY
Typically, Rounding to 2 decimal places is very easy with
printf("%.2lf",<variable>);
However, the rounding system will usually rounds to the nearest even. For example,
2.554 -> 2.55
2.555 -> 2.56
2.565 -> 2.56
2.566 -> 2.57
And what I want to achieve is that
2.555 -> 2.56
2.565 -> 2.57
In fact, rounding half-up is doable in C, but for Integer only;
int a = (int)(b+0.5)
So, I'm asking for how to do the same thing as above with 2 decimal places on positive values instead of Integer to achieve what I said earlier for printing.
It is not clear whether you actually want to "round half-up", or rather "round half away from zero", which requires different treatment for negative values.
Single precision binary float is precise to at least 6 decimal places, and 20 for double, so nudging a FP value by DBL_EPSILON (defined in float.h) will cause a round-up to the next 100th by printf( "%.2lf", x ) for n.nn5 values. without affecting the displayed value for values not n.nn5
double x2 = x * (1 + DBL_EPSILON) ; // round half-away from zero
printf( "%.2lf", x2 ) ;
For different rounding behaviours:
double x2 = x * (1 - DBL_EPSILON) ; // round half-toward zero
double x2 = x + DBL_EPSILON ; // round half-up
double x2 = x - DBL_EPSILON ; // round half-down
Following is precise code to round a double to the nearest 0.01 double.
The code functions like x = round(100.0*x)/100.0; except it handles uses manipulations to insure scaling by 100.0 is done exactly without precision loss.
Likely this is more code than OP is interested, but it does work.
It works for the entire double range -DBL_MAX to DBL_MAX. (still should do more unit testing).
It depends on FLT_RADIX == 2, which is common.
#include <float.h>
#include <math.h>
void r100_best(const char *s) {
double x;
sscanf(s, "%lf", &x);
// Break x into whole number and fractional parts.
// Code only needs to round the fractional part.
// This preserves the entire `double` range.
double xi, xf;
xf = modf(x, &xi);
// Multiply the fractional part by N (256).
// Break into whole and fractional parts.
// This provides the needed extended precision.
// N should be >= 100 and a power of 2.
// The multiplication by a power of 2 will not introduce any rounding.
double xfi, xff;
xff = modf(xf * 256, &xfi);
// Multiply both parts by 100.
// *100 incurs 7 more bits of precision of which the preceding code
// insures the 8 LSbit of xfi, xff are zero.
int xfi100, xff100;
xfi100 = (int) (xfi * 100.0);
xff100 = (int) (xff * 100.0); // Cast here will truncate (towards 0)
// sum the 2 parts.
// sum is the exact truncate-toward-0 version of xf*256*100
int sum = xfi100 + xff100;
// add in half N
if (sum < 0)
sum -= 128;
else
sum += 128;
xf = sum / 256;
xf /= 100;
double y = xi + xf;
printf("%6s %25.22f ", "x", x);
printf("%6s %25.22f %.2f\n", "y", y, y);
}
int main(void) {
r100_best("1.105");
r100_best("1.115");
r100_best("1.125");
r100_best("1.135");
r100_best("1.145");
r100_best("1.155");
r100_best("1.165");
return 0;
}
[Edit] OP clarified that only the printed value needs rounding to 2 decimal places.
OP's observation that rounding of numbers "half-way" per a "round to even" or "round away from zero" is misleading. Of 100 "half-way" numbers like 0.005, 0.015, 0.025, ... 0.995, only 4 are typically exactly "half-way": 0.125, 0.375, 0.625, 0.875. This is because floating-point number format use base-2 and numbers like 2.565 cannot be exactly represented.
Instead, sample numbers like 2.565 have as the closest double value of 2.564999999999999947... assuming binary64. Rounding that number to nearest 0.01 should be 2.56 rather than 2.57 as desired by OP.
Thus only numbers ending with 0.125 and 0.625 area exactly half-way and round down rather than up as desired by OP. Suggest to accept that and use:
printf("%.2lf",variable); // This should be sufficient
To get close to OP's goal, numbers could be A) tested against ending with 0.125 or 0.625 or B) increased slightly. The smallest increase would be
#include <math.h>
printf("%.2f", nextafter(x, 2*x));
Another nudge method is found with #Clifford.
[Former answer that rounds a double to the nearest double multiple of 0.01]
Typical floating-point uses formats like binary64 which employs base-2. "Rounding to nearest mathmatical 0.01 and ties away from 0.0" is challenging.
As #Pascal Cuoq mentions, floating point numbers like 2.555 typically are only near 2.555 and have a more precise value like 2.555000000000000159872... which is not half way.
#BLUEPIXY solution below is best and practical.
x = round(100.0*x)/100.0;
"The round functions round their argument to the nearest integer value in floating-point
format, rounding halfway cases away from zero, regardless of the current rounding direction." C11dr §7.12.9.6.
The ((int)(100 * (x + 0.005)) / 100.0) approach has 2 problems: it may round in the wrong direction for negative numbers (OP did not specify) and integers typically have a much smaller range (INT_MIN to INT_MAX) that double.
There are still some cases when like when double x = atof("1.115"); which end up near 1.12 when it really should be 1.11 because 1.115, as a double is really closer to 1.11 and not "half-way".
string x rounded x
1.115 1.1149999999999999911182e+00 1.1200000000000001065814e+00
OP has not specified rounding of negative numbers, assuming y = -f(-x).
There seem to be two definitions for the Machine-Epsilon:
The maximum relative Error when rounding a real number to the next floating-point number.
The minimum positive number such that 1.0 + machine_eps != 1.0
First of all, i fail to see how these two correlate.
Second DBL_EPSILON does not conform to Definition 2 in my understanding:
The following Program prints:
DBL_EPSILON: 2.220446049250313080847e-16
DBL_EPSILON / 2: 1.110223024625156540424e-16
1.0 + DBL_EPSILON: 1.000000000000000222045e+00
1.0 + DBL_EPSILON / 2: 1.000000000000000000000e+00
m_eps 2.220446049250313080847e-16
m_eps -1u 2.220446049250312834328e-16
1.0 + m_eps -1u 1.000000000000000222045e+00
(m_eps -1u < DBL_EPSILON): True
(m_eps -1u == DBL_EPSILON/2): False
m_eps -1u should be a number smaller but really close to DBL_EPSILON. With
Definiton 2) should 1.0 + m_eps -1u not evaluate to 1.0? Why is it necessary
to devide DBL_EPSILON by 2 for this?
#include <stdout.h>
#include <stdint.h>
#inlcude <float.h>
union Double_t {
double f;
int64_t i;
};
int main(int argc, char *argv[])
{
union Double_t m_eps;
printf("DBL_EPSILON: \t\t%.*e\n", DECIMAL_DIG, DBL_EPSILON);
printf("DBL_EPSILON / 2: \t%.*e\n", DECIMAL_DIG, DBL_EPSILON / 2);
printf("1.0 + DBL_EPSILON: \t%.*e\n", DECIMAL_DIG, 1.0 + DBL_EPSILON);
printf("1.0 + DBL_EPSILON / 2: \t%.*e\n", DECIMAL_DIG, 1.0 + DBL_EPSILON / 2);
m_eps.f = DBL_EPSILON;
printf("\nm_eps \t\t\t%.*e\n", DECIMAL_DIG, m_eps.f);
m_eps.i -= 1;
printf("m_eps -1u\t\t%.*e\n", DECIMAL_DIG, m_eps.f);
printf("\n1.0 + (m_eps -1u)\t\t%.*e\n", DECIMAL_DIG, 1.0 + m_eps.f);
printf("\n(m_eps -1u < DBL_EPSILON): %s\n",
(m_eps.f < DBL_EPSILON) ? "True": "False"
);
printf("(m_eps -1u == DBL_EPSILON/2): %s\n",
(DBL_EPSILON/2 == m_eps.f) ? "True": "False"
);
return 0;
}
A wrong definition of DBL_EPSILON, the one you quote as “The minimum positive number such that 1.0 + machine_eps != 1”, is floating around. You can even find it in standard libraries and in otherwise fine answers on StackOverflow. When found in standard libraries, it is in a comment near a value that obviously does not correspond to the comment, but corresponds to the correct definition:
DBL_EPSILON: This is the difference between 1 and the smallest
floating point number of type double that is greater than 1. (correct definition taken from the GNU C library)
The C99 standard phrases it this way:
the difference between 1 and the least value greater than 1 that is representable in the given floating point type, b^(1−p)
This is probably the cause of your confusion. Forget about the wrong definition. I wrote a rant about this here (which is very much like your question).
The other definition in your question, “the maximum relative Error when rounding a real number to the next floating-point number”, is correct-ish when the result of the rounding is a normal floating-point number. Rounding a real to finite floating-point number produces a floating-point number within 1/2 ULP of the real value. For a normal floating-point number, this 1/2 ULP absolute error translates to a relative error that can be between DBL_EPSILON/2 and DBL_EPSILON/4 depending where the floating-point number is located in its binade.
For the following code,
#include <stdio.h>
#include <limits.h>
#include <float.h>
int main(void) {
printf("double max = %??\n", DBL_MAX);
printf("double min = %??\n", DBL_MIN);
printf("double epsilon = %??\n", DBL_EPSILON);
printf("float epsilon = %??\n", FLT_EPSILON);
printf("float max = %??\n", FLT_MAX);
printf("float min = %??\n\n", FLT_MIN);
return 0;
}
what specifiers would I have to use in place of the ??'s in order for printf to display the various quantities as appropriately-sized decimal numbers?
Use the same format you'd use for any other values of those types:
#include <float.h>
#include <stdio.h>
int main(void) {
printf("FLT_MAX = %g\n", FLT_MAX);
printf("DBL_MAX = %g\n", DBL_MAX);
printf("LDBL_MAX = %Lg\n", LDBL_MAX);
}
Arguments of type float are promoted to double for variadic functions like printf, which is why you use the same format for both.
%f prints a floating-point value using decimal notation with no exponent, which will give you a very long string of (mostly insignificant) digits for very large values.
%e forces the use of an exponent.
%g uses either %f or %e, depending on the magnitude of the number being printed.
On my system, the above prints the following:
FLT_MAX = 3.40282e+38
DBL_MAX = 1.79769e+308
LDBL_MAX = 1.18973e+4932
As Eric Postpischil points out in a comment, the above prints only approximations of the values. You can print more digits by specifying a precision (the number of digits you'll need depends on the precision of the types); for example, you can replace %g by %.20g.
Or, if your implementation supports it, C99 added the ability to print floating-point values in hexadecimal with as much precision as necessary:
printf("FLT_MAX = %a\n", FLT_MAX);
printf("DBL_MAX = %a\n", DBL_MAX);
printf("LDBL_MAX = %La\n", LDBL_MAX);
But the result is not as easily human-readable as the usual decimal format:
FLT_MAX = 0x1.fffffep+127
DBL_MAX = 0x1.fffffffffffffp+1023
LDBL_MAX = 0xf.fffffffffffffffp+16380
(Note: main() is an obsolescent definition; use int main(void) instead.)
To print approximations of the maximums with enough digits to represent the actual values (the result of converting the printed value back to floating-point should be the original value), you can use:
#include <float.h>
#include <stdio.h>
int main(void)
{
printf("%.*g\n", DECIMAL_DIG, FLT_MAX);
printf("%.*g\n", DECIMAL_DIG, DBL_MAX);
printf("%.*Lg\n", DECIMAL_DIG, LDBL_MAX);
return 0;
}
In C 2011, you can use the more specific FLT_DECIMAL_DIG, DBL_DECIMAL_DIG, and LDBL_DECIMAL_DIG in place of DECIMAL_DIG.
To print the exact values, instead of approximations, you need to specify more precision. (int) (log10(x)+1) digits should be enough.
Approximations of the minimums and the epsilons can be printed with sufficient accuracy in the same way. However, calculating the numbers of digits needed for exact values may be more complicated than for the maximums. (Technically, it may be impossible in exotic C implementations. E.g., a base-three floating-point system would have a minimum not representable in any finite number of decimal digits. I am not aware of any such implementations in use.)
You could use the last three prints in my solution to the exercise 2.1 from The C Programming Language
// float or IEEE754 binary32
printf(
"float: {min: %e, max: %e}, comp: {min: %e, max: %e}\n",
FLT_MIN, FLT_MAX, pow(2,-126), pow(2,127) * (2 - pow(2,-23))
);
// double or IEEE754 binary64
printf(
"double: {min: %e, max: %e}, comp: {min: %e, max: %e}\n",
DBL_MIN, DBL_MAX, pow(2,-1022), pow(2,1023) * (2 - pow(2,-52))
);
// long double or IEEE754 binary 128
printf(
"long double: {min: %Le, max: %Le}, comp: {min: %Le, max: %Le}\n",
LDBL_MIN, LDBL_MAX, powl(2,-16382), powl(2,16383) * (2 - powl(2,-112))
);
Obviously, the maximal values are calculated according to IEEE 754. The full solution is available via link:
https://github.com/mat90x/tcpl/blob/master/types_ranges.c
There is FLT_MIN constant that is nearest to zero. How to get nearest to some number value?
As an example:
float nearest_to_1000 = 1000.0f + epsilon;
// epsilon must be the smallest value satisfying condition:
// nearest_to_1000 > 1000.0f
I would prefer numeric formula without using special functions.
C provides a function for this, in the <math.h> header. nextafterf(x, INFINITY) is the next representable value after x, in the direction toward INFINITY.
However, if you'd prefer to do it yourself:
The following returns the epsilon you seek, for single precision (float), assuming IEEE 754. See notes at the bottom about using library routines.
#include <float.h>
#include <math.h>
/* Return the ULP of q.
This was inspired by Algorithm 3.5 in Siegfried M. Rump, Takeshi Ogita, and
Shin'ichi Oishi, "Accurate Floating-Point Summation", _Technical Report
05.12_, Faculty for Information and Communication Sciences, Hamburg
University of Technology, November 13, 2005.
*/
float ULP(float q)
{
// SmallestPositive is the smallest positive floating-point number.
static const float SmallestPositive = FLT_EPSILON * FLT_MIN;
/* Scale is .75 ULP, so multiplying it by any significand in [1, 2) yields
something in [.75 ULP, 1.5 ULP) (even with rounding).
*/
static const float Scale = 0.75f * FLT_EPSILON;
q = fabsf(q);
/* In fmaf(q, -Scale, q), we subtract q*Scale from q, and q*Scale is
something more than .5 ULP but less than 1.5 ULP. That must produce q
- 1 ULP. Then we subtract that from q, so we get 1 ULP.
The significand 1 is of particular interest. We subtract .75 ULP from
q, which is midway between the greatest two floating-point numbers less
than q. Since we round to even, the lesser one is selected, which is
less than q by 1 ULP of q, although 2 ULP of itself.
*/
return fmaxf(SmallestPositive, q - fmaf(q, -Scale, q));
}
The following returns the next value representable in float after the value it is passed (treating −0 and +0 as the same).
#include <float.h>
#include <math.h>
/* Return the next floating-point value after the finite value q.
This was inspired by Algorithm 3.5 in Siegfried M. Rump, Takeshi Ogita, and
Shin'ichi Oishi, "Accurate Floating-Point Summation", _Technical Report
05.12_, Faculty for Information and Communication Sciences, Hamburg
University of Technology, November 13, 2005.
*/
float NextAfterf(float q)
{
/* Scale is .625 ULP, so multiplying it by any significand in [1, 2)
yields something in [.625 ULP, 1.25 ULP].
*/
static const float Scale = 0.625f * FLT_EPSILON;
/* Either of the following may be used, according to preference and
performance characteristics. In either case, use a fused multiply-add
(fmaf) to add to q a number that is in [.625 ULP, 1.25 ULP]. When this
is rounded to the floating-point format, it must produce the next
number after q.
*/
#if 0
// SmallestPositive is the smallest positive floating-point number.
static const float SmallestPositive = FLT_EPSILON * FLT_MIN;
if (fabsf(q) < 2*FLT_MIN)
return q + SmallestPositive;
return fmaf(fabsf(q), Scale, q);
#else
return fmaf(fmaxf(fabsf(q), FLT_MIN), Scale, q);
#endif
}
Library routines are used, but fmaxf (maximum of its arguments) and fabsf (absolute value) are easily replaced. fmaf should compile to a hardware instruction on architectures with fused multiply-add. Failing that, fmaf(a, b, c) in this use can be replaced by (double) a * b + c. (IEEE-754 binary64 has sufficient range and precision to replaced fmaf. Other choices for double might not.)
Another alternative to the fused-multiply add would be to add some tests for cases where q * Scale would be subnormal and handle those separately. For other cases, the multiplication and addition can be performed separately with ordinary * and + operators.