For the following code,
#include <stdio.h>
#include <limits.h>
#include <float.h>
int main(void) {
printf("double max = %??\n", DBL_MAX);
printf("double min = %??\n", DBL_MIN);
printf("double epsilon = %??\n", DBL_EPSILON);
printf("float epsilon = %??\n", FLT_EPSILON);
printf("float max = %??\n", FLT_MAX);
printf("float min = %??\n\n", FLT_MIN);
return 0;
}
what specifiers would I have to use in place of the ??'s in order for printf to display the various quantities as appropriately-sized decimal numbers?
Use the same format you'd use for any other values of those types:
#include <float.h>
#include <stdio.h>
int main(void) {
printf("FLT_MAX = %g\n", FLT_MAX);
printf("DBL_MAX = %g\n", DBL_MAX);
printf("LDBL_MAX = %Lg\n", LDBL_MAX);
}
Arguments of type float are promoted to double for variadic functions like printf, which is why you use the same format for both.
%f prints a floating-point value using decimal notation with no exponent, which will give you a very long string of (mostly insignificant) digits for very large values.
%e forces the use of an exponent.
%g uses either %f or %e, depending on the magnitude of the number being printed.
On my system, the above prints the following:
FLT_MAX = 3.40282e+38
DBL_MAX = 1.79769e+308
LDBL_MAX = 1.18973e+4932
As Eric Postpischil points out in a comment, the above prints only approximations of the values. You can print more digits by specifying a precision (the number of digits you'll need depends on the precision of the types); for example, you can replace %g by %.20g.
Or, if your implementation supports it, C99 added the ability to print floating-point values in hexadecimal with as much precision as necessary:
printf("FLT_MAX = %a\n", FLT_MAX);
printf("DBL_MAX = %a\n", DBL_MAX);
printf("LDBL_MAX = %La\n", LDBL_MAX);
But the result is not as easily human-readable as the usual decimal format:
FLT_MAX = 0x1.fffffep+127
DBL_MAX = 0x1.fffffffffffffp+1023
LDBL_MAX = 0xf.fffffffffffffffp+16380
(Note: main() is an obsolescent definition; use int main(void) instead.)
To print approximations of the maximums with enough digits to represent the actual values (the result of converting the printed value back to floating-point should be the original value), you can use:
#include <float.h>
#include <stdio.h>
int main(void)
{
printf("%.*g\n", DECIMAL_DIG, FLT_MAX);
printf("%.*g\n", DECIMAL_DIG, DBL_MAX);
printf("%.*Lg\n", DECIMAL_DIG, LDBL_MAX);
return 0;
}
In C 2011, you can use the more specific FLT_DECIMAL_DIG, DBL_DECIMAL_DIG, and LDBL_DECIMAL_DIG in place of DECIMAL_DIG.
To print the exact values, instead of approximations, you need to specify more precision. (int) (log10(x)+1) digits should be enough.
Approximations of the minimums and the epsilons can be printed with sufficient accuracy in the same way. However, calculating the numbers of digits needed for exact values may be more complicated than for the maximums. (Technically, it may be impossible in exotic C implementations. E.g., a base-three floating-point system would have a minimum not representable in any finite number of decimal digits. I am not aware of any such implementations in use.)
You could use the last three prints in my solution to the exercise 2.1 from The C Programming Language
// float or IEEE754 binary32
printf(
"float: {min: %e, max: %e}, comp: {min: %e, max: %e}\n",
FLT_MIN, FLT_MAX, pow(2,-126), pow(2,127) * (2 - pow(2,-23))
);
// double or IEEE754 binary64
printf(
"double: {min: %e, max: %e}, comp: {min: %e, max: %e}\n",
DBL_MIN, DBL_MAX, pow(2,-1022), pow(2,1023) * (2 - pow(2,-52))
);
// long double or IEEE754 binary 128
printf(
"long double: {min: %Le, max: %Le}, comp: {min: %Le, max: %Le}\n",
LDBL_MIN, LDBL_MAX, powl(2,-16382), powl(2,16383) * (2 - powl(2,-112))
);
Obviously, the maximal values are calculated according to IEEE 754. The full solution is available via link:
https://github.com/mat90x/tcpl/blob/master/types_ranges.c
Related
Why, with strtof() "3.40282356779733650000e38" unexpectantly converted to infinity even though it is within 0.5 ULP of FLT_MAX?
FLT_MAX (float32) is 0x1.fffffep+127 or about 3.4028234663852885981170e+38.
1/2 ULP above FLT_MAX is 0x1.ffffffp+127 or about 3.4028235677973366163754e+38, so I expected any decimal text below this and the lower FLT_MAX to convert to FLT_MAX when in "round to nearest" mode.
This works as decimal text increases from FLT_MAX to about 3.4028235677973388642700e38, yet for decimal text values about above that like "3.40282356779733650000e38", the conversion result is infinity.
Follows is code that reveals the issue. It gently creeps up a decimal text string, looking for the value in which conversion changes to infinity.
Your results may differ as not all C implementations use the same floating point.
#include <assert.h>
#include <float.h>
#include <stdio.h>
#include <stdlib.h>
void bar(unsigned n) {
char buf[100];
assert (n < 90);
int len = sprintf(buf, "%.*fe%d", n+1, 0.0, FLT_MAX_10_EXP);
puts(buf);
printf("%-*s %-*s %s\n", len, "string", n+3, "float", "double");
float g = 0;
for (unsigned i = 0; i < n; i++) {
for (int digit = '1'; digit <= '9'; digit++) {
unsigned offset = i ? 1+i : i;
buf[offset]++;
errno = 0;
float f = strtof(buf, 0);
if (errno) {
buf[offset]--;
break;
}
g = f;
}
printf("\"%s\" %.*e %a\n", buf, n + 3, g, atof(buf));
}
double delta = FLT_MAX - nextafterf(FLT_MAX, 0);
double flt_max_ulp_d2 = FLT_MAX + delta/2.0;
printf(" %.*e %a FLT_MAX + 1/2 ULP - 1 dULP\n", n + 3, nextafter(flt_max_ulp_d2,0),nextafter(flt_max_ulp_d2,0));
printf(" %.*e %a FLT_MAX + 1/2 ULP\n", n + 3, flt_max_ulp_d2,flt_max_ulp_d2);
printf(" %.*e %a FLT_MAX\n", n + 3, FLT_MAX, FLT_MAX);
printf(" 1 23456789 123456789 123456789\n");
printf("FLT_ROUNDS %d (0: toward zero, 1: to nearest)\n", FLT_ROUNDS);
}
int main() {
printf("%a %.20e\n", FLT_MAX, FLT_MAX);
printf("%a\n", strtof("3.40282356779733650000e38", 0));
printf("%a\n", strtod("3.40282356779733650000e38", 0));
printf("%a\n", strtod("3.4028235677973366163754e+3", 0));
bar(19);
}
Output
0x1.fffffep+127 3.40282346638528859812e+38
inf
0x1.ffffffp+127
0x1.a95a5aaada733p+11
0.00000000000000000000e38
string float double
"3.00000000000000000000e38" 3.0000000054977557577780e+38 0x1.c363cbf21f28ap+127
"3.40000000000000000000e38" 3.3999999521443642490773e+38 0x1.ff933c78cdfadp+127
"3.40000000000000000000e38" 3.3999999521443642490773e+38 0x1.ff933c78cdfadp+127
"3.40200000000000000000e38" 3.4020000005553803402978e+38 0x1.ffe045fe9918p+127
"3.40280000000000000000e38" 3.4027999387901483621794e+38 0x1.ffff169a83f08p+127
"3.40282000000000000000e38" 3.4028200183756559773331e+38 0x1.ffffdbd19d02cp+127
"3.40282300000000000000e38" 3.4028230607370965250836e+38 0x1.fffff966ad924p+127
"3.40282350000000000000e38" 3.4028234663852885981170e+38 0x1.fffffe54daff8p+127
"3.40282356000000000000e38" 3.4028234663852885981170e+38 0x1.fffffeec5116ep+127
"3.40282356700000000000e38" 3.4028234663852885981170e+38 0x1.fffffefdfcbbcp+127
"3.40282356770000000000e38" 3.4028234663852885981170e+38 0x1.fffffeffc119p+127
"3.40282356779000000000e38" 3.4028234663852885981170e+38 0x1.fffffefffb424p+127
"3.40282356779700000000e38" 3.4028234663852885981170e+38 0x1.fffffeffffc85p+127
"3.40282356779730000000e38" 3.4028234663852885981170e+38 0x1.fffffefffff9fp+127
"3.40282356779733000000e38" 3.4028234663852885981170e+38 0x1.fffffefffffeep+127
"3.40282356779733600000e38" 3.4028234663852885981170e+38 0x1.fffffeffffffep+127
"3.40282356779733640000e38" 3.4028234663852885981170e+38 0x1.fffffefffffffp+127 <-- Actual
"3.40282356779733660000e38" 3.4028234663852885981170e+38 ... <-- Expected
"3.40282356779733642000e38" 3.4028234663852885981170e+38 0x1.fffffefffffffp+127
"3.40282356779733642700e38" 3.4028234663852885981170e+38 0x1.fffffefffffffp+127
3.4028235677973362385861e+38 0x1.fffffefffffffp+127 FLT_MAX + 1/2 ULP - 1 dULP
3.4028235677973366163754e+38 0x1.ffffffp+127 FLT_MAX + 1/2 ULP
3.4028234663852885981170e+38 0x1.fffffep+127 FLT_MAX
1 23456789 123456789 123456789
FLT_ROUNDS 1 (0: toward zero, 1: to nearest)
Notes: GNU C11 (GCC) version 11.3.0 (x86_64-pc-cygwin)
compiled by GNU C version 11.3.0, GMP version 6.2.1, MPFR version 4.1.0, MPC version 1.2.1, isl version isl-0.25-GMP
[Edit]
The exact value of FLT_MAX + 1/2 ULP:
0x1.ffffffp+127 340282356779733661637539395458142568448.0
I stumbled on this problem today when trying to determine the maximum decimal text passed to strtof() that returned a finite float.
This is a Can I answer my own question? answer. Other answers are welcomed.
Why, with strtof() "3.40282356779733650000e38" unexpectantly converted to infinity even though it is within 0.5 ULP of FLT_MAX?
Certainly double rounding.
"Double" here refers to doing something twice, not the type double.
Let 1/2 of a float ULP above FLT_MAX is 0x1.ffffffp+127 or about 3.4028235677973366163754e+38 is called threshold.
About 3.4028235673364274808e38 is one half of a double ULP below threshold. Apparently values like "3.40282356779733650000e38" prematurely rounds as a double to threshold. threshold, as a float, is half-way between FLT_MAX and the next larger float (if the encoding was extended). Being a half-way tie, it rounds to the "even" value - the larger one in this case. Since the next larger float is beyond the max encodable finite value, the result is infinity.
Conclusions
A better strtof() would correctly handle this corner case.
Instead, it is reasonable to consider decimal places past FLT_DECIMAL_DIG + 3 (see following) in strtof() as noise.
In an alternative strtof() implementation, IEEE_754 allows such decimal text conversions to treat all the decimal digits passed a certain significance as zero. This, thus allowing conversions to the 2nd closest float when near the 1/2 way point of 2 floats. With common float, that significance is FLT_DECIMAL_DIG + 3 or 12 decimal places. That is not used here as decimals in the 19th place affect the result.
Float max/min is
179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368
Compiling to assembly I see the literal is 0xffefffffffffffff. I am unable to understand how to write it in a float literal form. I tried -0xFFFFFFFFFFFFFp972 which resulted in 0xFFEFFFFFFFFFFFFE. Notice the last digit is E instead of F. I have no idea why the last bit is wrong or why 972 gave me the closest number. I didn't understand what I should be doing with the exponent bias either. I used 13 F's because that would set 52bits (the amount of bits in the mantissa) but everything else I'm clueless on
I want to be able to write double min/max as a literal and be able to understand it enough so I can parse it into a 8byte hex value
How do I write float max as a float literal?
Use FLT_MAX. If making your own code, use exponential notation either as hex (preferred) or decimal. If in decimal, use FLT_DECIMAL_DIG significant digits. Any more is not informative. Append an f.
#include <float.h>
#include <stdio.h>
int main(void) {
printf("%a\n", FLT_MAX);
printf("%.*g\n", FLT_DECIMAL_DIG, FLT_MAX);
float m0 = FLT_MAX;
float m1 = 0x1.fffffep+127f;
float m2 = 3.40282347e+38f;
printf("%d %d\n", m1 == m0, m2 == m0);
}
Sample output
0x1.fffffep+127
3.40282347e+38
1 1
Likewise for double, yet no f.
printf("%a\n", DBL_MAX);
printf("%.*g\n", DBL_DECIMAL_DIG, DBL_MAX);
0x1.fffffffffffffp+1023
1.7976931348623157e+308
double m0 = FLT_MAX;
double m1 = 0x1.fffffffffffffp+1023;
double m2 = 1.7976931348623157e+308;
Rare machines will have different max values.
Suppose I have a floating-point value of type float or double (i.e. 32 or 64 bits on typical machines). I want to print this value as text (e.g. to the standard output stream), and then later, in some other process, scan it back in - with fscanf() if I'm using C, or perhaps with istream::operator>>() if I'm using C++. But - I need the scanned float to end up being exactly, identical to the original value (up to equivalent representations of the same value). Also, the printed value should be easily readable - to a human - as floating-point, i.e. I don't want to print 0x42355316 and reinterpret that as a 32-bit float.
How should I do this? I'm assuming the standard library of (C and C++) won't be sufficient, but perhaps I'm wrong. I suppose that a sufficient number of decimal digits might be able to guarantee an error that's underneath the precision threshold - but that's not the same as guaranteeing the rounding/truncation will happen just the way I want it.
Notes:
The scanning does not having to be perfectly accurate w.r.t. the value it scans, only the original value.
If it makes it easier, you may assume the value is a number and is not infinity.
denormal support is desired but not required; still if we get a denormal, failure should be conspicuous.
First, you should use the %a format with fprintf and fscanf. This is what it was designed for, and the C standard requires it to work (reproduce the original number) if the implementation uses binary floating-point.
Failing that, you should print a float with at least FLT_DECIMAL_DIG significant digits and a double with at least DBL_DECIMAL_DIG significant digits. Those constants are defined in <float.h> and are defined:
… number of decimal digits, n, such that any floating-point number with p radix b digits can be rounded to a floating-point number with n decimal digits and back again without change to the value,… [b is the base used for the floating-point format, defined in FLT_RADIX, and p is the number of base-b digits in the format.]
For example:
printf("%.*g\n", FLT_DECIMAL_DIG, 1.f/3);
or:
#define QuoteHelper(x) #x
#define Quote(x) QuoteHelper(x)
…
printf("%." Quote(FLT_DECIMAL_DIG) "g\n", 1.f/3);
In C++, these constants are defined in <limits> as std::numeric_limits<Type>::max_digits10, where Type is float or double or another floating-point type.
Note that the C standard only recommends that such a round-trip through a decimal numeral work; it does not require it. For example, C 2018 5.2.4.2.2 15 says, under the heading “Recommended practice”:
Conversion from (at least) double to decimal with DECIMAL_DIG digits and back should be the identity function. [DECIMAL_DIG is the equivalent of FLT_DECIMAL_DIG or DBL_DECIMAL_DIG for the widest floating-point format supported in the implementation.]
In contrast, if you use %a, and FLT_RADIX is a power of two (meaning the implementation uses a floating-point base that is two, 16, or another power of two), then C standard requires that the result of scanning the numeral produced with %a equals the original number.
I need the scanned float to end up being exactly, identical to the original value.
As already pointed out in the other answers, that can be achieved with the %a format specifier.
Also, the printed value should be easily readable - to a human - as floating-point, i.e. I don't want to print 0x42355316 and reinterpret that as a 32-bit float.
That's more tricky and subjective. The first part of the string that %a produces is in fact a fraction composed by hexadecimal digits, so that an output like 0x1.4p+3 may take some time to be parsed as 10 by a human reader.
An option could be to print all the decimal digits needed to represent the floating-point value, but there may be a lot of them. Consider, for example the value 0.1, its closest representation as a 64-bit float may be
0x1.999999999999ap-4 == 0.1000000000000000055511151231257827021181583404541015625
While printf("%.*lf\n", DBL_DECIMAL_DIG, 01); (see e.g. Eric's answer) would print
0.10000000000000001 // If DBL_DECIMAL_DIG == 17
My proposal is somewhere in the middle. Similarly to what %a does, we can exactly represent any floating-point value with radix 2 as a fraction multiplied by 2 raised to some integer power. We can transform that fraction into a whole number (increasing the exponent accordingly) and print it as a decimal value.
0x1.999999999999ap-4 --> 1.999999999999a16 * 2-4 --> 1999999999999a16 * 2-56
--> 720575940379279410 * 2-56
That whole number has a limited number of digits (it's < 253), but the result it's still an exact representation of the original double value.
The following snippet is a proof of concept, without any check for corner cases. The format specifier %a separates the mantissa and the exponent with a p character (as in "... multiplied by two raised to the Power of..."), I'll use a q instead, for no particular reason other than using a different symbol.
The value of the mantissa will also be reduced (and the exponent raised accordingly), removing all the trailing zero-bits. The idea beeing that 5q+1 (parsed as 510 * 21) should be more "easily" identified as 10, rather than 2814749767106560q-48.
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void to_my_format(double x, char *str)
{
int exponent;
double mantissa = frexp(x, &exponent);
long long m = 0;
if ( mantissa ) {
exponent -= 52;
m = (long long)scalbn(mantissa, 52);
// A reduced mantissa should be more readable
while (m && m % 2 == 0) {
++exponent;
m /= 2;
}
}
sprintf(str, "%lldq%+d", m, exponent);
// ^
// Here 'q' is used to separate the mantissa from the exponent
}
double from_my_format(char const *str)
{
char *end;
long long mantissa = strtoll(str, &end, 10);
long exponent = strtol(str + (end - str + 1), &end, 10);
return scalbn(mantissa, exponent);
}
int main(void)
{
double tests[] = { 1, 0.5, 2, 10, -256, acos(-1), 1000000, 0.1, 0.125 };
size_t n = (sizeof tests) / (sizeof *tests);
char num[32];
for ( size_t i = 0; i < n; ++i ) {
to_my_format(tests[i], num);
double x = from_my_format(num);
printf("%22s%22a ", num, tests[i]);
if ( tests[i] != x )
printf(" *** %22a *** Round-trip failed\n", x);
else
printf("%58.55g\n", x);
}
return 0;
}
Testable here.
Generally, the improvement in readability is admitedly little to none, surely a matter of opinion.
You can use the %a format specifier to print the value as hexadecimal floating point. Note that this is not the same as reinterpreting the float as an integer and printing the integer value.
For example:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
float x;
scanf("%f", &x);
printf("x=%.7f\n", x);
char str[20];
sprintf(str, "%a", x);
printf("str=%s\n", str);
float y;
sscanf(str, "%f", &y);
printf("y=%.7f\n", y);
printf("x==y: %d\n", (x == y));
return 0;
}
With an input of 4, this outputs:
x=4.0000000
str=0x1p+2
y=4.0000000
x==y: 1
With an input of 3.3, this outputs:
x=3.3000000
str=0x1.a66666p+1
y=3.3000000
x==y: 1
As you can see from the output, the %a format specifier prints in exponential format with the significand in hex and the exponent in decimal. This format can then be converted directly back to the exact same value as demonstrated by the equality check.
I am solving one of C Primer Plus exercises dealing with float underflow. The task is to simulate it. I did it this way:
#include<stdio.h>
#include<float.h>
int main(void)
{
// print min value for a positive float retaining full precision
printf("%s\n %.150f\n", "Minimum positive float value retaining full precision:",FLT_MIN);
// print min value for a positive float retaining full precision divided by two
printf("%s\n %.150f\n", "Minimum positive float value retaining full precision divided by two:",FLT_MIN/2.0);
// print min value for a positive float retaining full precision divided by four
printf("%s\n %.150f\n", "Minimum positive float value retaining full precision divided by four:",FLT_MIN/4.0);
return 0;
}
The result is
Minimum positive float value retaining full precision: 0.000000000000000000000000000000000000011754943508222875079687365372222456778186655567720875215087517062784172594547271728515625000000000000000000000000
Minimum positive float value retaining full precision divided by two: 0.000000000000000000000000000000000000005877471754111437539843682686111228389093327783860437607543758531392086297273635864257812500000000000000000000000
Minimum positive float value retaining full precision divided by four: 0.000000000000000000000000000000000000002938735877055718769921841343055614194546663891930218803771879265696043148636817932128906250000000000000000000000
I expected less precision for min float value divide by two and four but it seems the precision is ok and there is no underflow situation. How is it possible? Did I miss something?
Thank you very much
Incorrect method of assessing precision as code simple divides FLT_MIN (certainly a power of 2) by 2.
Instead start with a number that is just above a power of 2 so its binary significand is something like 1.000...(maybe total of 24 binary digits)...0001. Insure values printed are originally float. (FLT_MIN/2.0 is a double.)
Notice below that the precision is lost when the numbers becomes less than FLT_MIN: minimum normalized positive floating-point number.
Also consider FLT_TRUE_MIN: minimum positive floating-point number. See binary32
#include <float.h>
#include <math.h>
#include <stdio.h>
int main(void) {
char *format = "%.10e %a\n";
printf(format, FLT_MIN, FLT_MIN);
printf(format, FLT_TRUE_MIN, FLT_TRUE_MIN);
float f = nextafterf(1.0f, 2.0f);
do {
f /= 2;
printf(format, f, f); // print in decimal and hex for detail
} while (f);
return 0;
}
Output
1.1754943508e-38 0x1p-126
1.4012984643e-45 0x1p-149
5.0000005960e-01 0x1.000002p-1
2.5000002980e-01 0x1.000002p-2
1.2500001490e-01 0x1.000002p-3
...
2.3509889819e-38 0x1.000002p-125
1.1754944910e-38 0x1.000002p-126
5.8774717541e-39 0x1p-127 // lost least significant bit of precision
2.9387358771e-39 0x1p-128
...
2.8025969286e-45 0x1p-148
1.4012984643e-45 0x1p-149
0.0000000000e+00 0x0p+0
The following code gives some odd results:
#include <stdio.h>
#include <float.h>
int main()
{
float t = 1.0;
float res;
float myFltMax = 340282346638528859.0;
printf("FLT_MAX %f\n", FLT_MAX);
res = FLT_MAX - t;
printf("res %f\n", res);
res = myFltMax - t;
printf("res myFltMax %f\n", res);
return 1;
}
The results are:
FLT_MAX 340282346638528859811704183484516925440.000000
res 340282346638528859811704183484516925440.000000
res myFltMax 340282356122255360.000000
So, if i subtract 1 from FLT_MAX the result is the same and if i subtract 1 from other big float, the result is greater than initial number.
I am using gcc version 4.7.2.
Thank you.
If you subtract 1 from myFltMax you don't get the difference greater than the initial number. You get the same number. Print myFltMax as well and you'll see that it's 340282356122255360 and not 340282346638528859.
Proof.
Basically, the compiler rounds your 340282346638528859 to the nearest value that can be represented in the floating point type and that happens to be 340282356122255360.