In experimenting with rounding modes, FE_DOWNWARD ( and FE_TOWARDZERO) apparently fails to form the expected sum of infinity and instead forms DBL_MAX when adding DBL_MAX and 1 ULP of DBL_MAX.
Follows is test code that demos the unexpected sum. Under different rounds modes, it adds to DBL_MAX values near 0.5 ULP and 1.0 ULP. No problems noted in FE_TONEAREST and FE_UPWARD.
Questions:
Do you agree it is an error?
Does code form the correct answer on another machine?
This sadly follows another near DBL_MAX problem recently reported, so perhaps my math library is sub-par. Advice on how to report is requested.
Compiler notes:
Invoking: Cygwin C Compiler
gcc -std=c11 -O0 -g3 -pedantic -Wall -Wextra -Wconversion -c -fmessage-length=0 -v -MMD -MP -MF"rand_i.d" -MT"rand_i.o" -o "rand_i.o" "../rand_i.c"
COLLECT_GCC=gcc
Target: x86_64-pc-cygwin
gcc version 11.3.0 (GCC)
GNU C11 (GCC) version 11.3.0 (x86_64-pc-cygwin)
compiled by GNU C version 11.3.0, GMP version 6.2.1, MPFR version 4.1.0, MPC version 1.2.1, isl version isl-0.25-GMP
#include <fenv.h>
#include <float.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#define STRINGIFY(x) #x
#define TOSTRING(x) STRINGIFY(x)
int main() {
const double max = DBL_MAX;
const double max_ulp = max - nextafter(max, 0);
int mode[] = {FE_DOWNWARD, FE_TONEAREST, FE_TOWARDZERO, FE_UPWARD};
const char *modes[] = {STRINGIFY(FE_DOWNWARD), STRINGIFY(FE_TONEAREST), //
STRINGIFY(FE_TOWARDZERO), STRINGIFY(FE_UPWARD)};
int n = sizeof mode / sizeof mode[0];
int p = (DBL_MANT_DIG + 2)/4;
int P = (LDBL_MANT_DIG + 2)/4;
printf("%s:%d\n", STRINGIFY(FLT_EVAL_METHOD), FLT_EVAL_METHOD);
printf("LDBL_MAX :%-24.*La %.*Lg\n", P, LDBL_MAX, LDBL_DECIMAL_DIG, LDBL_MAX);
printf("DBL_MAX :%-24.*a %.*g\n", p, max, DBL_DECIMAL_DIG, max);
printf("DBL_MAX_ULP:%-24.*a %.*g\n", p, max_ulp, DBL_DECIMAL_DIG, max_ulp);
printf("\n");
printf("mode: Addendum SUM (double) SUM (long double)\n");
for (int i = 0; i < n; i++) {
if (fesetround(mode[i])) {
perror("Invalid mode");
return -1;
}
double delta[] = {nextafter(max_ulp / 2, 0), max_ulp / 2, //
nextafter(max_ulp / 2, INFINITY), nextafter(max_ulp, 0), //
max_ulp, nextafter(max_ulp, INFINITY)};
const char *deltas[] = { "0.5 ulp-", "0.5 ulp", "0.5 ulp+", //
"ulp-", "ulp", "ulp+"};
int dn = sizeof delta / sizeof delta[0];
for (int d = 0; d < dn; d++) {
double sum = max + delta[d];
printf("mode:%-14s %-8s:%-24.*a sum:%-24.*a %-24.*La\n", //
modes[i], deltas[d], p, delta[d], p, sum, P, 0.0L + max + delta[d]);
}
puts("");
}
}
Output: note 4 unexpected lines.
FLT_EVAL_METHOD:0
LDBL_MAX :0x1.fffffffffffffffep+16383 1.18973149535723176502e+4932
DBL_MAX :0x1.fffffffffffffp+1023 1.7976931348623157e+308
DBL_MAX_ULP:0x1.0000000000000p+971 1.9958403095347198e+292
mode: Addendum SUM (double) SUM (long double)
mode:FE_DOWNWARD 0.5 ulp-:0x1.fffffffffffffp+969 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff7fep+1023
mode:FE_DOWNWARD 0.5 ulp :0x1.0000000000000p+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023
mode:FE_DOWNWARD 0.5 ulp+:0x1.0000000000001p+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023
mode:FE_DOWNWARD ulp- :0x1.fffffffffffffp+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffffffep+1023
mode:FE_DOWNWARD ulp :0x1.0000000000000p+971 sum:0x1.fffffffffffffp+1023 0x1.0000000000000000p+1024
--> inf expected <--
mode:FE_DOWNWARD ulp+ :0x1.0000000000001p+971 sum:0x1.fffffffffffffp+1023 0x1.0000000000000000p+1024
--> inf expected <--
mode:FE_TONEAREST 0.5 ulp-:0x1.fffffffffffffp+969 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023
mode:FE_TONEAREST 0.5 ulp :0x1.0000000000000p+970 sum:inf 0x1.fffffffffffff800p+1023
mode:FE_TONEAREST 0.5 ulp+:0x1.0000000000001p+970 sum:inf 0x1.fffffffffffff800p+1023
mode:FE_TONEAREST ulp- :0x1.fffffffffffffp+970 sum:inf 0x1.0000000000000000p+1024
mode:FE_TONEAREST ulp :0x1.0000000000000p+971 sum:inf 0x1.0000000000000000p+1024
mode:FE_TONEAREST ulp+ :0x1.0000000000001p+971 sum:inf 0x1.0000000000000000p+1024
mode:FE_TOWARDZERO 0.5 ulp-:0x1.fffffffffffffp+969 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff7fep+1023
mode:FE_TOWARDZERO 0.5 ulp :0x1.0000000000000p+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023
mode:FE_TOWARDZERO 0.5 ulp+:0x1.0000000000001p+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023
mode:FE_TOWARDZERO ulp- :0x1.fffffffffffffp+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffffffep+1023
mode:FE_TOWARDZERO ulp :0x1.0000000000000p+971 sum:0x1.fffffffffffffp+1023 0x1.0000000000000000p+1024
--> inf expected <--
mode:FE_TOWARDZERO ulp+ :0x1.0000000000001p+971 sum:0x1.fffffffffffffp+1023 0x1.0000000000000000p+1024
--> inf expected <--
mode:FE_UPWARD 0.5 ulp-:0x1.fffffffffffffp+969 sum:inf 0x1.fffffffffffff800p+1023
mode:FE_UPWARD 0.5 ulp :0x1.0000000000000p+970 sum:inf 0x1.fffffffffffff800p+1023
mode:FE_UPWARD 0.5 ulp+:0x1.0000000000001p+970 sum:inf 0x1.fffffffffffff802p+1023
mode:FE_UPWARD ulp- :0x1.fffffffffffffp+970 sum:inf 0x1.0000000000000000p+1024
mode:FE_UPWARD ulp :0x1.0000000000000p+971 sum:inf 0x1.0000000000000000p+1024
mode:FE_UPWARD ulp+ :0x1.0000000000001p+971 sum:inf 0x1.0000000000000002p+1024
Checking for and dealing with overflow happens in two stages. Note that the actual software/circuitry may use a different strategy, but the result will be as if this were the procedure.
The infinite-precision result of the computation is rounded, with rules based on the current rounding mode, but with no limits on the exponent. During this stage there's no such thing as "infinity", but the numbers can get as big as they need to get.
If the result is outside the representable range, it is "corrected" to be within the representable range, in a manner based on the rounding mode. For FE_NEAREST (the "normal" mode) the number is corrected to infinity. For FE_TOWARDZERO, however, it's corrected to +/-DBL_MAX. (For the other rounding modes it depends on sign: rounding away from zero leads to infinity and rounding toward zero leads to +/-DBL_MAX.)
The overflow rules for a given mode are thus reminiscent of the rounding rules for that mode, but not really the same. Arguably FE_NEAREST is the weird one, since it acts like the (IEEE-754-only) ties-away-from-zero rounding mode under all out-of-range situations, rather than noticing that infinity is way farther away than DBL_MAX. But the basic behavior for the other modes is to output +/-DBL_MAX when its preferred direction (given the sign) is toward zero, and infinity when its preferred direction is away from zero.
Also note that when the result of stage 1 is a value outside the representable range, an overflow exception will still be emitted even when the result of stage 2 is DBL_MAX. The overflow doesn't indicate "I made an infinity", but "I had to do stage 2".
#Sneftel idea to use a 2 stage rounding model was useful and well explains why my expected result was amiss and the code was right.
Below, for reference, is augmented code that posts the overflow flag and helps illustrates the models application.
#include <fenv.h>
#include <float.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#define STRINGIFY(x) #x
#define TOSTRING(x) STRINGIFY(x)
const char* exceptstr(fexcept_t *flag) {
static char buf[100];
buf[0] = 0;
if (*flag & FE_DIVBYZERO)
strcat(buf, STRINGIFY(FE_DIVBYZERO));
if (*flag & FE_INEXACT)
strcat(buf, STRINGIFY(FE_INEXACT));
if (*flag & FE_INVALID)
strcat(buf, STRINGIFY(FE_INVALID));
if (*flag & FE_OVERFLOW)
strcat(buf, STRINGIFY(FE_OVERFLOW));
if (*flag & FE_UNDERFLOW)
strcat(buf, STRINGIFY(FE_UNDERFLOW));
return buf;
}
int main() {
const double max = DBL_MAX;
const double max_ulp = max - nextafter(max, 0);
int mode[] = {FE_DOWNWARD, FE_TONEAREST, FE_TOWARDZERO, FE_UPWARD};
const char *modes[] = {STRINGIFY(FE_DOWNWARD), STRINGIFY(FE_TONEAREST), //
STRINGIFY(FE_TOWARDZERO), STRINGIFY(FE_UPWARD)};
int n = sizeof mode / sizeof mode[0];
int p = (DBL_MANT_DIG + 2) / 4;
int P = (LDBL_MANT_DIG + 2) / 4;
printf("%s:%d\n", STRINGIFY(FLT_EVAL_METHOD), FLT_EVAL_METHOD);
printf("LDBL_MAX :%-24.*La %.*Lg\n", P, LDBL_MAX, LDBL_DECIMAL_DIG,
LDBL_MAX);
printf("DBL_MAX :%-24.*a %.*g\n", p, max, DBL_DECIMAL_DIG, max);
printf("DBL_MAX_ULP:%-24.*a %.*g\n", p, max_ulp, DBL_DECIMAL_DIG, max_ulp);
printf("\n");
printf(
"mode: Addendum SUM (double) SUM (long double) FE\n");
for (int i = 0; i < n; i++) {
if (fesetround(mode[i])) {
perror("Invalid mode");
return -1;
}
double delta[] = {nextafter(max_ulp / 2, 0), max_ulp / 2, //
nextafter(max_ulp / 2, INFINITY), nextafter(max_ulp, 0), //
max_ulp, nextafter(max_ulp, INFINITY)};
const char *deltas[] = {"0.5 ulp-", "0.5 ulp", "0.5 ulp+", //
"ulp-", "ulp", "ulp+"};
int dn = sizeof delta / sizeof delta[0];
for (int d = 0; d < dn; d++) {
if (feclearexcept(FE_ALL_EXCEPT)) {
perror("feclearexcept()");
return -1;
}
///////////////////////////////
double sum = max + delta[d];
///////////////////////////////
fexcept_t flag;
if (fegetexceptflag(&flag, FE_ALL_EXCEPT)) {
perror("fegetexceptflag()");
return -1;
}
printf(
"mode:%-14s %-8s:%-24.*a sum:%-24.*a %-24.*La %s\n", //
modes[i], deltas[d], p, delta[d], p, sum, P, 1.0L * max + delta[d],
exceptstr(&flag));
}
puts("");
}
}
Output (Look to the far right)
FLT_EVAL_METHOD:0
LDBL_MAX :0x1.fffffffffffffffep+16383 1.18973149535723176502e+4932
DBL_MAX :0x1.fffffffffffffp+1023 1.7976931348623157e+308
DBL_MAX_ULP:0x1.0000000000000p+971 1.9958403095347198e+292
mode: Addendum SUM (double) SUM (long double) FE
mode:FE_DOWNWARD 0.5 ulp-:0x1.fffffffffffffp+969 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff7fep+1023 FE_INEXACT
mode:FE_DOWNWARD 0.5 ulp :0x1.0000000000000p+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023 FE_INEXACT
mode:FE_DOWNWARD 0.5 ulp+:0x1.0000000000001p+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023 FE_INEXACT
mode:FE_DOWNWARD ulp- :0x1.fffffffffffffp+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffffffep+1023 FE_INEXACT
mode:FE_DOWNWARD ulp :0x1.0000000000000p+971 sum:0x1.fffffffffffffp+1023 0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_DOWNWARD ulp+ :0x1.0000000000001p+971 sum:0x1.fffffffffffffp+1023 0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST 0.5 ulp-:0x1.fffffffffffffp+969 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023 FE_INEXACT
mode:FE_TONEAREST 0.5 ulp :0x1.0000000000000p+970 sum:inf 0x1.fffffffffffff800p+1023 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST 0.5 ulp+:0x1.0000000000001p+970 sum:inf 0x1.fffffffffffff800p+1023 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST ulp- :0x1.fffffffffffffp+970 sum:inf 0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST ulp :0x1.0000000000000p+971 sum:inf 0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_TONEAREST ulp+ :0x1.0000000000001p+971 sum:inf 0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_TOWARDZERO 0.5 ulp-:0x1.fffffffffffffp+969 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff7fep+1023 FE_INEXACT
mode:FE_TOWARDZERO 0.5 ulp :0x1.0000000000000p+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023 FE_INEXACT
mode:FE_TOWARDZERO 0.5 ulp+:0x1.0000000000001p+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffff800p+1023 FE_INEXACT
mode:FE_TOWARDZERO ulp- :0x1.fffffffffffffp+970 sum:0x1.fffffffffffffp+1023 0x1.fffffffffffffffep+1023 FE_INEXACT
mode:FE_TOWARDZERO ulp :0x1.0000000000000p+971 sum:0x1.fffffffffffffp+1023 0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_TOWARDZERO ulp+ :0x1.0000000000001p+971 sum:0x1.fffffffffffffp+1023 0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD 0.5 ulp-:0x1.fffffffffffffp+969 sum:inf 0x1.fffffffffffff800p+1023 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD 0.5 ulp :0x1.0000000000000p+970 sum:inf 0x1.fffffffffffff800p+1023 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD 0.5 ulp+:0x1.0000000000001p+970 sum:inf 0x1.fffffffffffff802p+1023 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD ulp- :0x1.fffffffffffffp+970 sum:inf 0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD ulp :0x1.0000000000000p+971 sum:inf 0x1.0000000000000000p+1024 FE_INEXACTFE_OVERFLOW
mode:FE_UPWARD ulp+ :0x1.0000000000001p+971 sum:inf 0x1.0000000000000002p+1024 FE_INEXACTFE_OVERFLOW
Related
Compiling and running this code:
// List the range of the long double.
#include <stdio.h>
#include <float.h>
int main() {
printf("Long double: %Lg - %Lg\n", LDBL_MIN, LDBL_MAX);
}
Gives this result:
Long double: 3.3621e-4932 - 1.18973e+4932
Yet this code:
#include <stdio.h>
#include <float.h>
int main() {
long double ld = 1.18973e+4932;
printf("Longest double: %Lg", ld);
}
Gives this warning when compiled:
gcc -std=gnu99 -o fj -Wall -Wno-format-overflow -g r2.c -lm
r2.c:4:3: warning: floating constant exceeds range of ‘double’ [-Woverflow]
4 | long double ld = 1.18973e+4932;
| ^~~~
However, if you compile:
#include <stdio.h>
#include <float.h>
int main() {
long double ld = LDBL_MAX;
printf("Longest double: %Lg\n", ld);
}
It compiles and runs:
Longest double: 1.18973e+4932
What's going on here? It should accept the numerical limit that was listed in the first program, but it does just fine with the LDBL_MAX version of it.
My compiler:
gcc --version
gcc (Debian 10.2.1-6) 10.2.1 20210110
My computer:
AMD Ryzen 5 5600G with Radeon Graphics
CPU MHz: 3057.560
BogoMIPS: 7785.19
CPU cache size: 512 KB
Debian GNU/Linux 11 (bullseye)
1.18973e+4932 is an out-of-range double constant.
LDBL_MAX is an in range long double constant.
Make the floating point constant a long double by appending an L.
Lower case l is an option too, yet harder to distinguish from 1.
// long double ld = 1.18973e+4932;
long double ld = 1.18973e+4932L; // Yet this is not quite the max
// max value better with adequate precision as
long double ld = 1.18973149535723176502e+4932L;
// Should print the same.
printf("Longest double: %.*Lg\n", LDBL_DECIMAL_DIG, ld);
printf("Longest double: %.*Lg\n", LDBL_DECIMAL_DIG, LDBL_MAX);
When coding near the limits, consider hex notation for better control in rounding issues.
long double ld = 1.1897315e+4932L; --> Infinity
long double ld = 1.18973149535723177e+4932L; --> Infinity
long double ld = 1.18973149535723176506e+4932L; --> Infinity
// 1.18973149535723176502e+4932L
long double ld = 0x1.fffffffffffffffep+16383L;
I'm using Windows 10 and MinGW GCC compiler(not the mingw64 one), but when I try this code,
This is my GCC version,
C:\Users\94768>gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=c:/mingw/bin/../libexec/gcc/mingw32/6.3.0/lto-wrapper.exe
Target: mingw32
Configured with: ../src/gcc-6.3.0/configure --build=x86_64-pc-linux-gnu --host=mingw32 --target=mingw32 --with-gmp=/mingw --with-mpfr --with-mpc=/mingw --with-isl=/mingw --prefix=/mingw --disable-win32-registry --with-arch=i586 --with-tune=generic --enable-languages=c,c++,objc,obj-c++,fortran,ada --with-pkgversion='MinGW.org GCC-6.3.0-1' --enable-static --enable-shared --enable-threads --with-dwarf2 --disable-sjlj-exceptions --enable-version-specific-runtime-libs --with-libiconv-prefix=/mingw --with-libintl-prefix=/mingw --enable-libstdcxx-debug --enable-libgomp --disable-libvtv --enable-nls
Thread model: win32
gcc version 6.3.0 (MinGW.org GCC-6.3.0-1)
The code is,
#include <stdio.h>
void main() {
float a = 1.12345;
double b = 1.12345;
long double c = 1.12345;
printf("float value is is %f\n", a);
printf("double value is %lf\n", b);
printf("long double value is %Lf\n", c);
}
I got this output which is not what I expected, I don't understand what's the issue here, I am an absolute beginner to C programming.
float value is 1.123450
double value is 1.123450
long double value is -0.000000
Immensely appreciate your guidance!
GCC 6.3? That's ancient!
Latest GCC release at the time of this writing is 11.1.
I recommend you move away from MinGW and switch to MinGW-w64 as it is much more up to date (and supports both 32-bit and 64-bit Windows).
You can get a recent standalone MinGW-w64 build from https://winlibs.com/
Built with that compiler the output of your code is:
float value is is 1.123450
double value is 1.123450
long double value is 1.123450
I ran a program like this:
#include <stdio.h>
int main(void)
{
float aboat = 32000.0;
double abet = 2.14e9;
long double dip =5.32e-5;
printf("%f can be written %e\n",aboat,aboat);
printf("And it's %a in hexadecimal, powers of 2 notation\n",aboat);
printf("%f can be written %e\n",abet, abet);
printf("%Lf can be written %Le\n",dip,dip);
\\ this statement can not print rightly
return 0;
}
but it print like this :
32000.000000 can be written 3.200000e+004
And it's 0x1.f40000p+14 in hexadecimal, powers of 2 notation
2140000000.000000 can be written 2.140000e+009
-1950228512509697500000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000000000000000000.0
00000 can be written 272.500183
as you can see , this result must be incorrect. but I can not find error, so is it a bug attribute to IDE ? I use JetBrains CLion 2017.3.4.
thanks for remind.this is my CMakeList.txt:
cmake_minimum_required(VERSION 3.9)
project(untitled1 C)
set(CMAKE_C_STANDARD 11)
add_executable(untitled1 main.c)
And the gcc version is 6.3.0
The following program (adapted from here) is giving inconsistent results when compiled with GCC (4.8.2) and Clang (3.5.1). In particular, the GCC result does not change even when FLT_EVAL_METHOD does.
#include <stdio.h>
#include <float.h>
int r1;
double ten = 10.0;
int main(int c, char **v) {
printf("FLT_EVAL_METHOD = %d\n", FLT_EVAL_METHOD);
r1 = 0.1 == (1.0 / ten);
printf("0.1 = %a, 1.0/ten = %a\n", 0.1, 1.0 / ten);
printf("r1=%d\n", r1);
}
Tests:
$ gcc -std=c99 t.c && ./a.out
FLT_EVAL_METHOD = 0
0.1 = 0x1.999999999999ap-4, 1.0/ten = 0x1.999999999999ap-4
r1=1
$ gcc -std=c99 -mpfmath=387 t.c && ./a.out
FLT_EVAL_METHOD = 2
0.1 = 0x0.0000000000001p-1022, 1.0/ten = 0x0p+0
r1=1
$ clang -std=c99 t.c && ./a.out
FLT_EVAL_METHOD = 0
0.1 = 0x1.999999999999ap-4, 1.0/ten = 0x1.999999999999ap-4
r1=1
$ clang -std=c99 -mfpmath=387 -mno-sse t.c && ./a.out
FLT_EVAL_METHOD = 2
0.1 = 0x0.07fff00000001p-1022, 1.0/ten = 0x0p+0
r1=0
Note that, according to this blog post, GCC 4.4.3 used to output 0 instead of 1 in the second test.
A possibly related question indicates that a bug has been corrected in GCC 4.6, which might explain why GCC's result is different.
I would like to confirm if any of these results would be incorrect, or if some subtle evaluation steps (e.g. a new preprocessor optimization) would justify the difference between these compilers.
This answer is about something that you should resolve before you go further, because it is going to make reasoning about what happens much harder otherwise:
Surely printing 0.1 = 0x0.07fff00000001p-1022 or 0.1 = 0x0.0000000000001p-1022 can only be a bug on your compilation platform caused by ABI mismatch when using -mfpmath=387. None of these values can be excused by excess precision.
You could try to include your own conversion-to-readable-format in the test file, so that that conversion is also compiled with -mfpmath=387. Or make a small stub in another file, not compiled with that option, with a minimalistic call convention:
In other file:
double d;
void print_double(void)
{
printf("%a", d);
}
In the file compiled with -mfpmath=387:
extern double d;
d = 0.1;
print_double();
Ignoring the printf problem which Pascal Cuoq addressed, I think GCC is correct here: according to the C99 standard, FLT_EVAL_METHOD == 2 should
evaluate all operations and constants to the range and precision of the long double type.
So, in this case, both 0.1 and 1.0 / ten are being evaluated to an extended precision approximation of 1/10.
I'm not sure what Clang is doing, though this question might provide some help.
How to convert each of the following mathematical expressions to its equivalent statement in C?
1 / (x^2 + y^2)
square root of (b^2 - 4ac)
1.0 / (pow(x,2) + pow(y,2))
sqrt(pow(b,2) - 4*a*c)
See pow() and sqrt() functions manual.
You can also write x*x instead of pow(x, 2). Both will have the exact same result and performance (the compiler knows what the pow function does and how to optimize it).
(For commenters)
GCC outputs the exact same assembler code for both of these functions:
double pow2_a(double x) {
return pow(x, 2);
}
double pow2_b(double x) {
return x * X;
}
Assembler:
fldl 4(%esp)
fmul %st(0), %st
ret