I need to find number which is a power of 2 that when added to FLT_MAX will cause overflow. However, when I printf very large power, like 2^300, inf still doesn't appear. Also, I thought that as FLT_MAX is the maximum floating point represented, adding 1 to it will cause overflow immediately.
#include <stdio.h>
#include <float.h>
int main(){
float f = FLT_MAX;
printf("%f", f + pow(2,300));
}
Any help would be appreciated. Thanks!
The answer is (FLT_MAX - nextafterf(FLT_MAX, 0))/2, that is, exactly 0x1p+103 or approximately 1.014120480e+31.
There is a mistake in the method you use to determine the answer : the standard function pow returns a double, and C's “usual arithmetic conversions” (C11 6.3.1.8:1) mean that the expression f + pow(2,300) is computed as a double. It is then printed as a double because of how arguments are passed to variadic functions.
This C program shows how you can arrive to the float value that, added to FLT_MAX with float addition, results in float infinity:
#include <stdio.h>
#include <float.h>
#include <math.h>
int main(){
float f = FLT_MAX;
printf("FLT_MAX: %a\n", f);
float b = nextafterf(f, 0);
printf("number before FLT_MAX: %a\n", b);
float d = f - b;
printf("difference: %a\n", d);
printf("FLT_MAX + d: %a\n", f + d);
printf("FLT_MAX + d/2: %a\n", f + d/2);
printf("FLT_MAX + nextafterf(d/2,0): %a\n", f + nextafterf(d/2,0));
float answer = d/2;
printf("answer: %a %.9e\n", answer, answer);
}
It prints:
FLT_MAX: 0x1.fffffep+127
number before FLT_MAX: 0x1.fffffcp+127
difference: 0x1p+104
FLT_MAX + d: inf
FLT_MAX + d/2: inf
FLT_MAX + nextafterf(d/2,0): 0x1.fffffep+127
answer: 0x1p+103 1.014120480e+31
It shows that if you take the difference between FLT_MAX and its lower neighbor (call this difference d), as you could expect, d added to FLT_MAX produces inf. But this is not the smallest float you can add to FLT_MAX to produce inf—there are smaller candidates. It is enough to add exactly half of d to FLT_MAX in order for the result to tound up to inf. If you add less than that, on the other hand, the result is rounded down to FLT_MAX.
This line is working with double not float.
printf("%f", f + pow(2,300));
To be working with float you need
printf("%f", f + powf(2,300));
and in this case the output is
inf
In the second case the float result is promoted to double in the call to printf, but it's too late, the value is already in an overflow representation.
//float=(-1) ^ s * 2 ^ (x - 127) * (1 + n * 2 ^ -23)
// s xxxxxxxx nnnnnnnnnnnnnnnnnnnnnnn
//FLT_MAX 3.402823466e+38F 2 ^ 128 0 11111110 11111111111111111111111
//FLT_MIN 1.175494351e-38F 2 ^ -126 0 00000001 00000000000000000000000
//FLT_TRUE_MIN 1.401298464e-45F 2 ^ -149 0 00000000 00000000000000000000001
//ONE 1f 2 ^ 0 0 01111111 00000000000000000000000
//INFINITY - 2 ^ 128+ 0 11111111 00000000000000000000000
union
{
float f;
int i;
}k,k2,k3;
k.i = 0b01111111011111111111111111111111; // 2^128 FLT_MAX
k2.i = 0b01110011000000000000000000000000; // 2^103
k3.f = k.f + k2.f; // 2^128+ INFINITY
Related
Why, with strtof() "3.40282356779733650000e38" unexpectantly converted to infinity even though it is within 0.5 ULP of FLT_MAX?
FLT_MAX (float32) is 0x1.fffffep+127 or about 3.4028234663852885981170e+38.
1/2 ULP above FLT_MAX is 0x1.ffffffp+127 or about 3.4028235677973366163754e+38, so I expected any decimal text below this and the lower FLT_MAX to convert to FLT_MAX when in "round to nearest" mode.
This works as decimal text increases from FLT_MAX to about 3.4028235677973388642700e38, yet for decimal text values about above that like "3.40282356779733650000e38", the conversion result is infinity.
Follows is code that reveals the issue. It gently creeps up a decimal text string, looking for the value in which conversion changes to infinity.
Your results may differ as not all C implementations use the same floating point.
#include <assert.h>
#include <float.h>
#include <stdio.h>
#include <stdlib.h>
void bar(unsigned n) {
char buf[100];
assert (n < 90);
int len = sprintf(buf, "%.*fe%d", n+1, 0.0, FLT_MAX_10_EXP);
puts(buf);
printf("%-*s %-*s %s\n", len, "string", n+3, "float", "double");
float g = 0;
for (unsigned i = 0; i < n; i++) {
for (int digit = '1'; digit <= '9'; digit++) {
unsigned offset = i ? 1+i : i;
buf[offset]++;
errno = 0;
float f = strtof(buf, 0);
if (errno) {
buf[offset]--;
break;
}
g = f;
}
printf("\"%s\" %.*e %a\n", buf, n + 3, g, atof(buf));
}
double delta = FLT_MAX - nextafterf(FLT_MAX, 0);
double flt_max_ulp_d2 = FLT_MAX + delta/2.0;
printf(" %.*e %a FLT_MAX + 1/2 ULP - 1 dULP\n", n + 3, nextafter(flt_max_ulp_d2,0),nextafter(flt_max_ulp_d2,0));
printf(" %.*e %a FLT_MAX + 1/2 ULP\n", n + 3, flt_max_ulp_d2,flt_max_ulp_d2);
printf(" %.*e %a FLT_MAX\n", n + 3, FLT_MAX, FLT_MAX);
printf(" 1 23456789 123456789 123456789\n");
printf("FLT_ROUNDS %d (0: toward zero, 1: to nearest)\n", FLT_ROUNDS);
}
int main() {
printf("%a %.20e\n", FLT_MAX, FLT_MAX);
printf("%a\n", strtof("3.40282356779733650000e38", 0));
printf("%a\n", strtod("3.40282356779733650000e38", 0));
printf("%a\n", strtod("3.4028235677973366163754e+3", 0));
bar(19);
}
Output
0x1.fffffep+127 3.40282346638528859812e+38
inf
0x1.ffffffp+127
0x1.a95a5aaada733p+11
0.00000000000000000000e38
string float double
"3.00000000000000000000e38" 3.0000000054977557577780e+38 0x1.c363cbf21f28ap+127
"3.40000000000000000000e38" 3.3999999521443642490773e+38 0x1.ff933c78cdfadp+127
"3.40000000000000000000e38" 3.3999999521443642490773e+38 0x1.ff933c78cdfadp+127
"3.40200000000000000000e38" 3.4020000005553803402978e+38 0x1.ffe045fe9918p+127
"3.40280000000000000000e38" 3.4027999387901483621794e+38 0x1.ffff169a83f08p+127
"3.40282000000000000000e38" 3.4028200183756559773331e+38 0x1.ffffdbd19d02cp+127
"3.40282300000000000000e38" 3.4028230607370965250836e+38 0x1.fffff966ad924p+127
"3.40282350000000000000e38" 3.4028234663852885981170e+38 0x1.fffffe54daff8p+127
"3.40282356000000000000e38" 3.4028234663852885981170e+38 0x1.fffffeec5116ep+127
"3.40282356700000000000e38" 3.4028234663852885981170e+38 0x1.fffffefdfcbbcp+127
"3.40282356770000000000e38" 3.4028234663852885981170e+38 0x1.fffffeffc119p+127
"3.40282356779000000000e38" 3.4028234663852885981170e+38 0x1.fffffefffb424p+127
"3.40282356779700000000e38" 3.4028234663852885981170e+38 0x1.fffffeffffc85p+127
"3.40282356779730000000e38" 3.4028234663852885981170e+38 0x1.fffffefffff9fp+127
"3.40282356779733000000e38" 3.4028234663852885981170e+38 0x1.fffffefffffeep+127
"3.40282356779733600000e38" 3.4028234663852885981170e+38 0x1.fffffeffffffep+127
"3.40282356779733640000e38" 3.4028234663852885981170e+38 0x1.fffffefffffffp+127 <-- Actual
"3.40282356779733660000e38" 3.4028234663852885981170e+38 ... <-- Expected
"3.40282356779733642000e38" 3.4028234663852885981170e+38 0x1.fffffefffffffp+127
"3.40282356779733642700e38" 3.4028234663852885981170e+38 0x1.fffffefffffffp+127
3.4028235677973362385861e+38 0x1.fffffefffffffp+127 FLT_MAX + 1/2 ULP - 1 dULP
3.4028235677973366163754e+38 0x1.ffffffp+127 FLT_MAX + 1/2 ULP
3.4028234663852885981170e+38 0x1.fffffep+127 FLT_MAX
1 23456789 123456789 123456789
FLT_ROUNDS 1 (0: toward zero, 1: to nearest)
Notes: GNU C11 (GCC) version 11.3.0 (x86_64-pc-cygwin)
compiled by GNU C version 11.3.0, GMP version 6.2.1, MPFR version 4.1.0, MPC version 1.2.1, isl version isl-0.25-GMP
[Edit]
The exact value of FLT_MAX + 1/2 ULP:
0x1.ffffffp+127 340282356779733661637539395458142568448.0
I stumbled on this problem today when trying to determine the maximum decimal text passed to strtof() that returned a finite float.
This is a Can I answer my own question? answer. Other answers are welcomed.
Why, with strtof() "3.40282356779733650000e38" unexpectantly converted to infinity even though it is within 0.5 ULP of FLT_MAX?
Certainly double rounding.
"Double" here refers to doing something twice, not the type double.
Let 1/2 of a float ULP above FLT_MAX is 0x1.ffffffp+127 or about 3.4028235677973366163754e+38 is called threshold.
About 3.4028235673364274808e38 is one half of a double ULP below threshold. Apparently values like "3.40282356779733650000e38" prematurely rounds as a double to threshold. threshold, as a float, is half-way between FLT_MAX and the next larger float (if the encoding was extended). Being a half-way tie, it rounds to the "even" value - the larger one in this case. Since the next larger float is beyond the max encodable finite value, the result is infinity.
Conclusions
A better strtof() would correctly handle this corner case.
Instead, it is reasonable to consider decimal places past FLT_DECIMAL_DIG + 3 (see following) in strtof() as noise.
In an alternative strtof() implementation, IEEE_754 allows such decimal text conversions to treat all the decimal digits passed a certain significance as zero. This, thus allowing conversions to the 2nd closest float when near the 1/2 way point of 2 floats. With common float, that significance is FLT_DECIMAL_DIG + 3 or 12 decimal places. That is not used here as decimals in the 19th place affect the result.
This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 2 months ago.
I'm using the online compiler https://www.onlinegdb.com/ and in the following code when I multiply 2.1 with 100 the output becomes 209 instead of 210.
#include<stdio.h>
#include <stdint.h>
int main()
{
float x = 1.8;
x = x + 0.3;
int coefficient = 100;
printf("x: %2f\n", x);
uint16_t y = (uint16_t)(x * coefficient);
printf("y: %d\n", y);
return 0;
}
Where am I doing wrong? And what should I do to obtain 210?
I tried to all different type casts still doesn't work.
The following assumes the compiler uses IEEE-754 binary32 and binary64 for float and double, which is overwhelmingly common.
float x = 1.8;
Since 1.8 is a double constant, the compiler converts 1.8 to the nearest double value, 1.8000000000000000444089209850062616169452667236328125. Then, to assign it to the float x, it converts that to the nearest float value, 1.7999999523162841796875.
x = x + 0.3;
The compiler converts 0.3 to the nearest double value, 0.299999999999999988897769753748434595763683319091796875. Then it adds x and that value using double arithmetic, which produces 2.09999995231628400205181605997495353221893310546875.
Then, to assign that to x, it converts it to the nearest float value, 2.099999904632568359375.
uint16_t y = (uint16_t)(x * coefficient);
Since x is float and coefficient is int, the compiler converts the coefficient to float and performs the multiplication using float arithmetic. This produces 209.9999847412109375.
Then the conversion to uint16_t truncates the number, producing 209.
One way to get 210 instead is to use uint16_t y = lroundf(x * coefficient);. (lroundf is declared in <math.h>.) However, to determine what the right way is, you should explain what these numbers are and why you are doing this arithmetic with them.
Floating point numbers are not exact, when you add 1.8 + 0.3,
the FPU might generate a slightly different result from the expected 2.1 (by margin smaller then float Epsilon)
read more about floating-point numbers representation in wiki https://en.wikipedia.org/wiki/Machine_epsilon
what happens to you is:
1.8 + 0.3 = 209.09999999...
then you truncate it to int resulting in 209
you might find this question also relevant to you Why float.Epsilon and not zero? might be
#include<stdio.h>
#include <stdint.h>
#include <inttypes.h>
int main()
{
float x = 1.8;
x = x + 0.3;
uint16_t coefficient = 100;
printf("x: %2f\n", x);
uint16_t y = round(x * coefficient);
printf("y: %" PRIu16 "\n", y);
return 0;
}
Float max/min is
179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368
Compiling to assembly I see the literal is 0xffefffffffffffff. I am unable to understand how to write it in a float literal form. I tried -0xFFFFFFFFFFFFFp972 which resulted in 0xFFEFFFFFFFFFFFFE. Notice the last digit is E instead of F. I have no idea why the last bit is wrong or why 972 gave me the closest number. I didn't understand what I should be doing with the exponent bias either. I used 13 F's because that would set 52bits (the amount of bits in the mantissa) but everything else I'm clueless on
I want to be able to write double min/max as a literal and be able to understand it enough so I can parse it into a 8byte hex value
How do I write float max as a float literal?
Use FLT_MAX. If making your own code, use exponential notation either as hex (preferred) or decimal. If in decimal, use FLT_DECIMAL_DIG significant digits. Any more is not informative. Append an f.
#include <float.h>
#include <stdio.h>
int main(void) {
printf("%a\n", FLT_MAX);
printf("%.*g\n", FLT_DECIMAL_DIG, FLT_MAX);
float m0 = FLT_MAX;
float m1 = 0x1.fffffep+127f;
float m2 = 3.40282347e+38f;
printf("%d %d\n", m1 == m0, m2 == m0);
}
Sample output
0x1.fffffep+127
3.40282347e+38
1 1
Likewise for double, yet no f.
printf("%a\n", DBL_MAX);
printf("%.*g\n", DBL_DECIMAL_DIG, DBL_MAX);
0x1.fffffffffffffp+1023
1.7976931348623157e+308
double m0 = FLT_MAX;
double m1 = 0x1.fffffffffffffp+1023;
double m2 = 1.7976931348623157e+308;
Rare machines will have different max values.
I have this code
#define Third (1.0/3.0)
#define ThirdFloat (1.0f/3.0f)
int main()
{
double a=1/3;
double b=1.0/3.0;
double c=1.0f/3.0f;
printf("a = %20.15lf, b = %20.15lf, c = %20.15lf\n", a,b,c);
float d=1/3;
float e=1.0/3.0;
float f=1.0f/3.0f;
printf("d = %20.15f, e = %20.15f, f = %20.15f\n", d,e,f);
double g=Third*3.0;
double h=ThirdFloat*3.0;
float i=ThirdFloat*3.0f;
printf("(1/3)*3: g = %20.15lf; h = %20.15lf, i = %20.15f\n", g, h, i);
}
Which gives that output
a = 0.000000000000000, b = 0.333333333333333, c = 0.333333343267441
d = 0.000000000000000, e = 0.333333343267441, f = 0.333333343267441
(1/3)*3: g = 1.000000000000000; h = 1.000000029802322, i = 1.000000000000000
I assume that output for a and d looks like this because compiler casts integer value to float after division.
b looks good, e is wrong because of low float precision, so as c and f.
But i have no idea why g has correct value (i thought that 1.0/3.0 = 1.0lf/3.0lf, but then i should be wrong) and why h isn't the same as i.
Let us first look closer: use "%.17e" (approximate decimal) and "%a" (exact).
#define Third (1.0/3.0)
#define ThirdFloat (1.0f/3.0f)
#define FMT "%.17e, %a"
int main(void) {
double a=1/3;
double b=1.0/3.0;
double c=1.0f/3.0f;
printf("a = " FMT "\n", a,a);
printf("b = " FMT "\n", b,b);
printf("c = " FMT "\n", c,c);
puts("");
float d=1/3;
float e=1.0/3.0;
float f=1.0f/3.0f;
printf("d = " FMT "\n", d,d);
printf("e = " FMT "\n", e,e);
printf("f = " FMT "\n", f,f);
puts("");
double g=Third*3.0;
double h=ThirdFloat*3.0;
float i=ThirdFloat*3.0f;
printf("g = " FMT "\n", g,g);
printf("h = " FMT "\n", h,h);
printf("i = " FMT "\n", i,i);
}
Output
a = 0.00000000000000000e+00, 0x0p+0
b = 3.33333333333333315e-01, 0x1.5555555555555p-2
c = 3.33333343267440796e-01, 0x1.555556p-2
d = 0.00000000000000000e+00, 0x0p+0
e = 3.33333343267440796e-01, 0x1.555556p-2
f = 3.33333343267440796e-01, 0x1.555556p-2
g = 1.00000000000000000e+00, 0x1p+0
h = 1.00000002980232239e+00, 0x1.0000008p+0
i = 1.00000000000000000e+00, 0x1p+0
But i have no idea why g has correct value
(1.0/3.0)*3.0 can evaluate as a double at compiler or run time and the rounded result is exactly 1.0.
(1.0/3.0)*3.0 can evaluate at compiler or run time using wider than double math and the rounded result is exactly 1.0. Research FLT_EVAL_METHOD.
and why h isn't the same as i.
(1.0f/3.0f) can use float math to form the float quotient that is noticeably different than one-third: 0.333333343267.... a final *3.0 is not surprisingly different that 1.0.
The outputs are all correct. We need to see why the expectation was amiss.
OP further asks: "Why is h (float * double) less accurate than i (float * float)?"
Both start with 0.333333343267... * 3.0, not one-third * 3.0.
float * double is more accurate. Both form a product, yet float * float is a float product rounded to the nearest 1 part in 224 whereas the more accurate float * double product is a double and rounds to the nearest 1 part in 253. The float * float round to 1.0000000 whereas float * double rounds to 1.0000000298...
But i have no idea why g has correct value (i thought that 1.0/3.0 = 1.0lf/3.0lf
G has exactly the value it should based on:
#define Third (1.0/3.0)
...
double g=Third*3.0;
which is g=(1.0/3.0)*3.0;
Which is 1.000000000000000 (when printed with "%20.15lf")
I think i got the answer.
#define Third (1.0/3.0)
#define ThirdFloat (1.0f/3.0f)
printf("%20.15f, %20.15lf\n", ThirdFloat*3.0, ThirdFloat*3.0);//float*double
printf("%20.15f, %20.15lf\n", ThirdFloat*3.0f, ThirdFloat*3.0f);//float*float
printf("%20.15f, %20.15lf\n", Third*3.0, Third*3.0);//double*double
printf("%20.15f, %20.15lf\n\n", Third*3.0f, Third*3.0f);//float*float
printf("%20.15f, %20.15lf\n", Third, Third);
printf("%20.15f, %20.15lf\n", ThirdFloat, ThirdFloat);
printf("%20.15f, %20.15lf\n", 3.0, 3.0);
printf("%20.15f, %20.15lf\n", 3.0f, 3.0f);
And output:
1.000000029802322, 1.000000029802322
1.000000000000000, 1.000000000000000
1.000000000000000, 1.000000000000000
1.000000000000000, 1.000000000000000
0.333333333333333, 0.333333333333333
0.333333343267441, 0.333333343267441
3.000000000000000, 3.000000000000000
3.000000000000000, 3.000000000000000
First line is not accurate because of the limitations of float. Constant ThirdFloat has really low precision, so when multiplied by double, compiler takes this really bad approximation (0.333333343267441), converts it into double and multiplies by 3.0 given by double, and that gives also wrong result (1.000000029802322).
But if ThirdFloat, which is float, is multiplied by 3.0f, which is float as well, compiler can avoid approximation by taking exact value of 1/3 and multiply it by 3, that's why i got exact result.
In this example, the behaviour of floor differs and I do not understand why:
printf("floor(34000000.535 * 100 + 0.5) : %lf \n", floor(34000000.535 * 100 + 0.5));
printf("floor(33000000.535 * 100 + 0.5) : %lf \n", floor(33000000.535 * 100 + 0.5));
The output for this code is:
floor(34000000.535 * 100 + 0.5) : 3400000053.000000
floor(33000000.535 * 100 + 0.5) : 3300000054.000000
Why does the first result not equal to 3400000054.0 as we could expect?
double in C does not represent every possible number that can be expressed in text.
double can typically represent about 264 different numbers. Neither 34000000.535 nor 33000000.535 are in that set when double is encoded as a binary floating point number. Instead the closest representable number is used.
Text 34000000.535
closest double 34000000.534999996423...
Text 33000000.535
closest double 33000000.535000000149...
With double as a binary floating point number, multiplying by a non-power-of-2, like 100.0, can introduce additional rounding differences. Yet in these cases, it still results in products, one just above xxx.5 and another below.
Adding 0.5, a simple power of 2, does not incurring rounding issues as the value is not extreme compared to 3x00000053.5.
Seeing intermediate results to higher print precision well shows the typical step-by-step process.
#include <stdio.h>
#include <float.h>
#include <math.h>
void fma_test(double a, double b, double c) {
int n = DBL_DIG + 3;
printf("a b c %.*e %.*e %.*e\n", n, a, n, b, n, c);
printf("a*b %.*e\n", n, a*b);
printf("a*b+c %.*e\n", n, a*b+c);
printf("a*b+c %.*e\n", n, floor(a*b+c));
puts("");
}
int main(void) {
fma_test(34000000.535, 100, 0.5);
fma_test(33000000.535, 100, 0.5);
}
Output
a b c 3.400000053499999642e+07 1.000000000000000000e+02 5.000000000000000000e-01
a*b 3.400000053499999523e+09
a*b+c 3.400000053999999523e+09
a*b+c 3.400000053000000000e+09
a b c 3.300000053500000015e+07 1.000000000000000000e+02 5.000000000000000000e-01
a*b 3.300000053500000000e+09
a*b+c 3.300000054000000000e+09
a*b+c 3.300000054000000000e+09
The issue is more complex then this simple answers as various platforms can 1) use higher precision math like long double or 2) rarely, use a decimal floating point double. So code's results may vary.
Question has been already answered here.
In basic float numbers are just approximation. If we have program like this:
float a = 0.2 + 0.3;
float b = 0.25 + 0.25;
if (a == b) {
//might happen
}
if (a != b) {
// also might happen
}
The only guaranteed thing is that a-b is relatively small.