m32 and m64 compiler options provides different output - c

Sample Code
#include "stdio.h"
#include <stdint.h>
int main()
{
double d1 = 210.01;
uint32_t m = 1000;
uint32_t v1 = (uint32_t) (d1 * m);
printf("%d",v1);
return 0;
}
Output
1. When compiling with -m32 option (i.e gcc -g3 -m32 test.c)
/test 174 # ./a.out
210009
2. When compiling with -m64 option (i.e gcc -g3 -m64 test.c)
test 176 # ./a.out
210010
Why do I get a difference?
My understanding "was", m would be promoted to double and multiplication would be cast downward to unit32_t. Moreover, since we are using stdint type integer, we would be further removing ambiguity related to architecture etc etc.
I know something is fishy here, but not able to pin it down.
Update:
Just to clarify (for one of the comment), the above behavior is seen for both gcc and g++.

I can confirm the results on my gcc (Ubuntu 5.2.1-22ubuntu2). What seems to happen is that the 32-bit unoptimized code uses 387 FPU with FMUL opcode, whereas 64-bit uses the SSE MULS opcode. (just execute gcc -S test.c with different parameters and see the assembler output). And as is well known, the 387 FPU that executes the FMUL has more than 64 bits of precision (80!) so it seems that it rounds differently here. The reason of course is that that the exact value of 64-bit IEEE double 210.01 is not that, but
210.009999999999990905052982270717620849609375
and when you multiply by 1000, you're not actually just shifting the decimal point - after all there is no decimal point but binary point in the floating point value; so the value must be rounded. And on 64-bit doubles it is rounded up. On 80-bit 387 FPU registers, the calculation is more precise, and it ends up being rounded down.
After reading about this a bit more, I believe the result generated by gcc on 32-bit arch is not standard conforming. Thus if you force the standard to C99 or C11 with -std=c99, -std=c11, you will get the correct result
% gcc -m32 -std=c11 test.c; ./a.out
210010
If you do not want to force C99 or C11 standard, you could also use the -fexcess-precision=standard switch.
However fun does not stop here.
% gcc -m32 test.c; ./a.out
210009
% gcc -m32 -O3 test.c; ./a.out
210010
So you get the "correct" result if you compile with -O3; this is of course because the 64-bit compiler uses the 64-bit SSE math to constant-fold the calculation.
To confirm that extra precision affects it, you can use a long double:
#include "stdio.h"
#include <stdint.h>
int main()
{
long double d1 = 210.01; // double constant to long double!
uint32_t m = 1000;
uint32_t v1 = (uint32_t) (d1 * m);
printf("%d",v1);
return 0;
}
Now even -m64 rounds it to 210009.
% gcc -m64 test.c; ./a.out
210009

Related

Why doesn't the same generated assembler code lead to the same output?

Sample code (t0.c):
#include <stdio.h>
float f(float a, float b, float c) __attribute__((noinline));
float f(float a, float b, float c)
{
return a * c + b * c;
}
int main(void)
{
void* p = V;
printf("%a\n", f(4476.0f, 20439.0f, 4915.0f));
return 0;
}
Invocation & execution (via godbolt.org):
# icc 2021.1.2 on Linux on x86-64
$ icc t0.c -fp-model=fast -O3 -DV=f
0x1.d32322p+26
$ icc t0.c -fp-model=fast -O3 -DV=0
0x1.d32324p+26
Generated assembler code is the same: https://godbolt.org/z/osra5jfYY.
Why doesn't the same generated assembler code lead to the same output?
Why does void* p = f; matter?
Godbolt shows you the assembly emitted by running the compiler with -S. But in this case, that's not the code that actually gets run, because further optimizations can be done at link time.
Try checking the "Compile to binary" box instead (https://godbolt.org/z/ETznv9qP4), which will actually compile and link the binary and then disassemble it. We see that in your -DV=f version, the code for f is:
addss xmm0,xmm1
mulss xmm0,xmm2
ret
just as before. But with -DV=0, we have:
movss xmm0,DWORD PTR [rip+0x2d88]
ret
So f has been converted to a function which simply returns a constant loaded from memory. At link time, the compiler was able to see that f was only ever called with a particular set of constant arguments, and so it could perform interprocedural constant propagation and have f merely return the precomputed result.
Having an additional reference to f evidently defeats this. Probably the compiler or linker sees that f had its address taken, and didn't notice that nothing was ever done with the address. So it assumes that f might be called elsewhere in the program, and therefore it has to emit code that would give the correct result for arbitrary arguments.
As to why the results are different: The precomputation is done strictly, evaluating both a*c and b*c as float and then adding them. So its result of 122457232 is the "right" one by the rules of C, and it is also what you get when compiling with -O0 or -fp-model=strict. The runtime version has been optimized to (a+b)*c, which is actually more accurate because it avoids an extra rounding; it yields 122457224, which is closer to the exact value of 122457225.

Why does `gcc -ffast-math` disable the correct result of `isnan()` and `isinf()`?

I understand that using the -ffast-math flag allows for unsafe math operations and disables signalling NaNs. However, I expected the functions isnan() and isinf() to still be able to return the correct results, which they do not.
Here's an example:
File test_isnan.c:
#include <stdio.h>
#include <math.h>
int main(void){
/* Produce a NaN */
const float my_nan = sqrtf(-1.f);
/* Produce an inf */
const float my_inf = 1.f/0.f;
printf("This should be a NaN: %.6e\n", my_nan);
printf("This should be inf: %.6e\n", my_inf);
if (isnan(my_nan)) {
printf("Caugth the nan!\n");
} else {
printf("isnan failed?\n");
}
if (isinf(my_inf)) {
printf("Caugth the inf!\n");
} else {
printf("isinf failed?\n");
}
}
Now let's compile and run the program without -ffast-math:
$ gcc test_isnan.c -lm -o test_isnan.o && ./test_isnan.o
This should be a NaN: -nan
This should be inf: inf
Caugth the nan!
Caugth the inf!
But with it:
$ gcc test_isnan.c -lm -o test_isnan.o -ffast-math && ./test_isnan.o
This should be a NaN: -nan
This should be inf: inf
isnan failed?
isinf failed?
So why don't isnan() and isinf() catch these nans and infs? What am I missing?
In case it might be relevant, here's my gcc version:
gcc (Spack GCC) 10.2.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
From https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html:
-ffast-math
Sets the options -fno-math-errno, -funsafe-math-optimizations, -ffinite-math-only, -fno-rounding-math, -fno-signaling-nans, -fcx-limited-range and -fexcess-precision=fast.
Where:
-ffinite-math-only
Allow optimizations for floating-point arithmetic that assume that arguments and results are not NaNs or +-Infs.
The moment you break that assumption, you can't expect those functions to work.
I understand that you were hoping for this setting to optimize all the other operations while still providing a correct result for these two functions, but that's just not the way it works. I don't think there's a way to solve this. Maybe you can have a look at Clang, but I don't expect it to be different.
-ffast-math
Sets the options ... -ffinite-math-only ...
-ffinite-math-only
Allow optimizations for floating-point arithmetic that assume that arguments and results are not NaNs or +-Infs.
Compiler optimizes the code to:
printf("This should be a NaN: %.6e\n", sqrtf(-1.f));
printf("This should be inf: %.6e\n", 1.f/0.f);
printf("isnan failed?\n");
printf("isinf failed?\n");
because the compiler knows that expressions can't return nan or inf.

Which of gcc's -O3 optimization flags enable "hardware accelerated instructions"

So basically, for my project, there is a restriction that it must not use the -O3 flag (we must only use -O2). The reasoning for this is that the -O3 flag apparently introduces "hardware accelerated instructions".
The gcc version is 5.4 and the manual page for this version's optimization flags is: this
I want to include as many of -O3's flags as possible.
The list of flags introduced by -O3 are:
-finline-functions, -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload, -ftree-loop-vectorize, -ftree-loop-distribute-patterns, -ftree-slp-vectorize, -fvect-cost-model, -ftree-partial-pre and -fipa-cp-clone
So I am planning to use -O2 and manually include as many of the above flags as possible.
Which flags above enable "hardware accelerated instructions" optimizations? How can I tell if a flag enables "hardware accelerated instructions" optimizations by reading the descriptions? What constitutes that?
The set of instructions used is controlled by -march, not by -O3. It is true that -O3 may make more use of SIMD instructions for vectorization, but -O3 does not specifically add or remove instructions from consideration during code generation.
If you want to compile your code with the simplest instructions only, choose the simplest march for your platform. For example, -march=core2 would be a conservative choice for x86-64, as this refers to the Intel Core 2 processor family which is quite old.
Still, Core 2 supports MMX and SSE to SSE3 and SSSE3. To disable those, add:
-mno-mmx -mno-sse -mno-sse2 -mno-sse3 -mno-ssse3
I feel like John has already answered this question, I will try to provide some examples.
Consider following minimal program:
#include <cstring>
void copy(long *dst , const long *src)
{
std::memcpy(dst, src, sizeof(long) * 4);
}
Compiled with GCC 7.2 g++ -O2 at x86_64 gives following output:
copy(long*, long const*):
movdqu (%rsi), %xmm0
movups %xmm0, (%rdi)
movdqu 16(%rsi), %xmm0
movups %xmm0, 16(%rdi)
ret
Compiled with GCC 7.2 g++ -O2 -mno-sse at x86_64 gives following output:
copy(long*, long const*):
movq (%rsi), %rax
movq %rax, (%rdi)
movq 8(%rsi), %rax
movq %rax, 8(%rdi)
movq 16(%rsi), %rax
movq %rax, 16(%rdi)
movq 24(%rsi), %rax
movq %rax, 24(%rdi)
ret
As you can see, GCC is able to generate SSE instructions even at -O2 level. Separate flag is required to suppress generation of those instructions.
At the same time GCC 5.4 generates the same code with and without -mno-sse flag, but it also does the same for -O3 optimization level.
So your goal is a bit misleading here. Using superset of -O2 flags might suppress generation of SSE and similar instructions in some cases, but this is not guaranteed as optimization level is only indirectly related to what instructions are generated. If you really want to suppress those, you can use -mno-sse flag, but this will probably put you at disadvantage. Just stick with -O2 - this way everyone will be on equal terms.
I used https://godbolt.org/ to demonstrate this.
I would like to add a couple.
Your TA said:
Using -O3 enables hardward accelerated instructions such as SSE
I think that's not really correct. It seems like -f options are intended to be machine-independent. A compiler parses the source code and transforms it into, generally, something called intermediate representations (IRs). Following that, a compiler optimizes IR itself so-called machine-independently. Then, the assembly code is generated from the optimized IR. When the assembly is generated, a different set of optimizations are applied. Examples of IR would be LLVM IR, Sun IR, GCC gimple tree and/or RTL, etc. I believe all modern compiler have IR(s).
GCC's -f options are, I believe, basically intended to be machine-independent. -m options are for machine-dependent optimizations. -O2 or -O3 determines -f options, which are ideally machine-independent. Using fancy instructions is up to machine-dependent parts of a compiler.
In reality, the boarder line between machine-dependent and machine-independent worlds might not be crystal clear.
Vectorization turned on by -O3 would be an example of optimizations in a gray area. Some machines do not support SIMD. Some do but the vector size, for example, is different from architecture to architecture. Here is my example code:
// code.c
long long int inner_product(int* v0, int* v1, int sz) {
long long int res = 0;
for (int i = 0; i < sz; i++) {
res += (v0[i] * v1[i]);
}
return res;
}
If it is compiled as follows:
$ gcc -O3 -S -march=core2 code.c -o code.s -fdump-tree-all
Gcc vectorizes it but the vector size is 2 long long ints or 4 ints:
$ cat code.*.optimized | grep 'vector('
vector(2) long long int vect_res_16.21;
vector(2) long long int vect_res_16.19;
vector(2) long long int vect__8.18;
vector(4) int vect__7.17;
vector(4) int vect__6.16;
vector(4) int * vectp_v1.15;
vector(4) int vect__4.13;
vector(4) int * vectp_v0.12;
On the other hands, if it is compiled with -march=skylake-avx512, whose vector size is 4 times bigger than core2, the outcome would be:
$ gcc -O3 -S -march=skylake-avx512 code.c -o code.s -fdump-tree-all && cat code.*.optimized | grep 'vector('
vector(8) long long int vect_res_16.21;
vector(8) long long int vect_res_16.19;
vector(8) long long int vect__8.18;
vector(16) int vect__7.17;
vector(16) int vect__6.16;
vector(16) int * vectp_v1.15;
vector(16) int vect__4.13;
vector(16) int * vectp_v0.12;
Note that the code generation did not yet start at this point. The IR looks different depending on the march value. I guess vectorization is not the only example that shows this kind of behavior.
Nonetheless, I do not think that it is fair to call such optimizations machine-dependent. A typical machine dependent optimization is instruction scheduling. It has almost nothing to do with intermediate representations. It is more about machine instructions and the micro-architecture. For those machine-independent optimizations in a gray area, conceptually, a compiler can generate unified vectorized IR (say, making vector size always 4), and let the code generator deals with it (each code generator may need to merge or split vector operands). However, in implementation, giving an appropriate size to a vector operand is regarded as easier at the IR level than at the code generation level. So, I guess people gave up living in the world of "Ideas." Nonetheless, I believe that machine-independent optimizations could be referred as machine-independent yet.
You might need to ask what kind of fancy instructions the TA does not want to see. If that's only vector instructions, all vector-related flags should be disabled but you could use -O3. My example code could be used to see if there is still SIMD instructions:
$ gcc -O3 -S code.c
$ cat code.s | egrep xmm
If the output is empty, you are good.

Same FLT_EVAL_METHOD, different results in GCC/Clang

The following program (adapted from here) is giving inconsistent results when compiled with GCC (4.8.2) and Clang (3.5.1). In particular, the GCC result does not change even when FLT_EVAL_METHOD does.
#include <stdio.h>
#include <float.h>
int r1;
double ten = 10.0;
int main(int c, char **v) {
printf("FLT_EVAL_METHOD = %d\n", FLT_EVAL_METHOD);
r1 = 0.1 == (1.0 / ten);
printf("0.1 = %a, 1.0/ten = %a\n", 0.1, 1.0 / ten);
printf("r1=%d\n", r1);
}
Tests:
$ gcc -std=c99 t.c && ./a.out
FLT_EVAL_METHOD = 0
0.1 = 0x1.999999999999ap-4, 1.0/ten = 0x1.999999999999ap-4
r1=1
$ gcc -std=c99 -mpfmath=387 t.c && ./a.out
FLT_EVAL_METHOD = 2
0.1 = 0x0.0000000000001p-1022, 1.0/ten = 0x0p+0
r1=1
$ clang -std=c99 t.c && ./a.out
FLT_EVAL_METHOD = 0
0.1 = 0x1.999999999999ap-4, 1.0/ten = 0x1.999999999999ap-4
r1=1
$ clang -std=c99 -mfpmath=387 -mno-sse t.c && ./a.out
FLT_EVAL_METHOD = 2
0.1 = 0x0.07fff00000001p-1022, 1.0/ten = 0x0p+0
r1=0
Note that, according to this blog post, GCC 4.4.3 used to output 0 instead of 1 in the second test.
A possibly related question indicates that a bug has been corrected in GCC 4.6, which might explain why GCC's result is different.
I would like to confirm if any of these results would be incorrect, or if some subtle evaluation steps (e.g. a new preprocessor optimization) would justify the difference between these compilers.
This answer is about something that you should resolve before you go further, because it is going to make reasoning about what happens much harder otherwise:
Surely printing 0.1 = 0x0.07fff00000001p-1022 or 0.1 = 0x0.0000000000001p-1022 can only be a bug on your compilation platform caused by ABI mismatch when using -mfpmath=387. None of these values can be excused by excess precision.
You could try to include your own conversion-to-readable-format in the test file, so that that conversion is also compiled with -mfpmath=387. Or make a small stub in another file, not compiled with that option, with a minimalistic call convention:
In other file:
double d;
void print_double(void)
{
printf("%a", d);
}
In the file compiled with -mfpmath=387:
extern double d;
d = 0.1;
print_double();
Ignoring the printf problem which Pascal Cuoq addressed, I think GCC is correct here: according to the C99 standard, FLT_EVAL_METHOD == 2 should
evaluate all operations and constants to the range and precision of the long double type.
So, in this case, both 0.1 and 1.0 / ten are being evaluated to an extended precision approximation of 1/10.
I'm not sure what Clang is doing, though this question might provide some help.

Is there a document describing how Clang handles excess floating-point precision?

It is nearly impossible(*) to provide strict IEEE 754 semantics at reasonable cost when the only floating-point instructions one is allowed to used are the 387 ones. It is particularly hard when one wishes to keep the FPU working on the full 64-bit significand so that the long double type is available for extended precision. The usual “solution” is to do intermediate computations at the only available precision, and to convert to the lower precision at more or less well-defined occasions.
Recent versions of GCC handle excess precision in intermediate computations according to the interpretation laid out by Joseph S. Myers in a 2008 post to the GCC mailing list. This description makes a program compiled with gcc -std=c99 -mno-sse2 -mfpmath=387 completely predictable, to the last bit, as far as I understand. And if by chance it doesn't, it is a bug and it will be fixed: Joseph S. Myers' stated intention in his post is to make it predictable.
Is it documented how Clang handles excess precision (say when the option -mno-sse2 is used), and where?
(*) EDIT: this is an exaggeration. It is slightly annoying but not that difficult to emulate binary64 when one is allowed to configure the x87 FPU to use a 53-bit significand.
Following a comment by R.. below, here is the log of a short interaction of mine with the most recent version of Clang I have :
Hexa:~ $ clang -v
Apple clang version 4.1 (tags/Apple/clang-421.11.66) (based on LLVM 3.1svn)
Target: x86_64-apple-darwin12.4.0
Thread model: posix
Hexa:~ $ cat fem.c
#include <stdio.h>
#include <math.h>
#include <float.h>
#include <fenv.h>
double x;
double y = 2.0;
double z = 1.0;
int main(){
x = y + z;
printf("%d\n", (int) FLT_EVAL_METHOD);
}
Hexa:~ $ clang -std=c99 -mno-sse2 fem.c
Hexa:~ $ ./a.out
0
Hexa:~ $ clang -std=c99 -mno-sse2 -S fem.c
Hexa:~ $ cat fem.s
…
movl $0, %esi
fldl _y(%rip)
fldl _z(%rip)
faddp %st(1)
movq _x#GOTPCREL(%rip), %rax
fstpl (%rax)
…
This does not answer the originally posed question, but if you are a programmer working with similar issues, this answer might help you.
I really don't see where the perceived difficulty is. Providing strict IEEE-754 binary64 semantics while being limited to 80387 floating-point math, and retaining 80-bit long double computation, seems to follow well-specified C99 casting rules with both GCC-4.6.3 and clang-3.0 (based on LLVM 3.0).
Edited to add: Yet, Pascal Cuoq is correct: neither gcc-4.6.3 or clang-llvm-3.0 actually enforce those rules correctly for '387 floating-point math. Given the proper compiler options, the rules are correctly applied to expressions evaluated at compile time, but not for run-time expressions. There are workarounds, listed after the break below.
I do molecular dynamics simulation code, and am very familiar with the repeatability/predictability requirements and also with the desire to retain maximum precision available when possible, so I do claim I know what I am talking about here. This answer should show that the tools exist and are simple to use; the problems arise from not being aware of or not using those tools.
(A preferred example I like, is the Kahan summation algorithm. With C99 and proper casting (adding casts to e.g. Wikipedia example code), no tricks or extra temporary variables are needed at all. The implementation works regardless of compiler optimization level, including at -O3 and -Ofast.)
C99 explicitly states (in e.g. 5.4.2.2) that casting and assignment both remove all extra range and precision. This means that you can use long double arithmetic by defining your temporary variables used during computation as long double, also casting your input variables to that type; whenever a IEEE-754 binary64 is needed, just cast to a double.
On '387, the cast generates an assignment and a load on both the above compilers; this does correctly round the 80-bit value to IEEE-754 binary64. This cost is very reasonable in my opinion. The exact time taken depends on the architecture and surrounding code; usually it is and can be interleaved with other code to bring the cost down to neglible levels. When MMX, SSE or AVX are available, their registers are separate from the 80-bit 80387 registers, and the cast usually is done by moving the value to the MMX/SSE/AVX register.
(I prefer production code to use a specific floating-point type, say tempdouble or such, for temporary variables, so that it can be defined to either double or long double depending on architecture and speed/precision tradeoffs desired.)
In a nutshell:
Don't assume (expression) is of double precision just because all the variables and literal constants are. Write it as (double)(expression) if you want the result at double precision.
This applies to compound expressions, too, and may sometimes lead to unwieldy expressions with many levels of casts.
If you have expr1 and expr2 that you wish to compute at 80-bit precision, but also need the product of each rounded to 64-bit first, use
long double expr1;
long double expr2;
double product = (double)(expr1) * (double)(expr2);
Note, product is computed as a product of two 64-bit values; not computed at 80-bit precision, then rounded down. Calculating the product at 80-bit precision, then rounding down, would be
double other = expr1 * expr2;
or, adding descriptive casts that tell you exactly what is happening,
double other = (double)((long double)(expr1) * (long double)(expr2));
It should be obvious that product and other often differ.
The C99 casting rules are just another tool you must learn to wield, if you do work with mixed 32-bit/64-bit/80-bit/128-bit floating point values. Really, you encounter the exact same issues if you mix binary32 and binary64 floats (float and double on most architectures)!
Perhaps rewriting Pascal Cuoq's exploration code, to correctly apply casting rules, makes this clearer?
#include <stdio.h>
#define TEST(eq) printf("%-56s%s\n", "" # eq ":", (eq) ? "true" : "false")
int main(void)
{
double d = 1.0 / 10.0;
long double ld = 1.0L / 10.0L;
printf("sizeof (double) = %d\n", (int)sizeof (double));
printf("sizeof (long double) == %d\n", (int)sizeof (long double));
printf("\nExpect true:\n");
TEST(d == (double)(0.1));
TEST(ld == (long double)(0.1L));
TEST(d == (double)(1.0 / 10.0));
TEST(ld == (long double)(1.0L / 10.0L));
TEST(d == (double)(ld));
TEST((double)(1.0L/10.0L) == (double)(0.1));
TEST((long double)(1.0L/10.0L) == (long double)(0.1L));
printf("\nExpect false:\n");
TEST(d == ld);
TEST((long double)(d) == ld);
TEST(d == 0.1L);
TEST(ld == 0.1);
TEST(d == (long double)(1.0L / 10.0L));
TEST(ld == (double)(1.0L / 10.0));
return 0;
}
The output, with both GCC and clang, is
sizeof (double) = 8
sizeof (long double) == 12
Expect true:
d == (double)(0.1): true
ld == (long double)(0.1L): true
d == (double)(1.0 / 10.0): true
ld == (long double)(1.0L / 10.0L): true
d == (double)(ld): true
(double)(1.0L/10.0L) == (double)(0.1): true
(long double)(1.0L/10.0L) == (long double)(0.1L): true
Expect false:
d == ld: false
(long double)(d) == ld: false
d == 0.1L: false
ld == 0.1: false
d == (long double)(1.0L / 10.0L): false
ld == (double)(1.0L / 10.0): false
except that recent versions of GCC promote the right hand side of ld == 0.1 to long double first (i.e. to ld == 0.1L), yielding true, and that with SSE/AVX, long double is 128-bit.
For the pure '387 tests, I used
gcc -W -Wall -m32 -mfpmath=387 -mno-sse ... test.c -o test
clang -W -Wall -m32 -mfpmath=387 -mno-sse ... test.c -o test
with various optimization flag combinations as ..., including -fomit-frame-pointer, -O0, -O1, -O2, -O3, and -Os.
Using any other flags or C99 compilers should lead to the same results, except for long double size (and ld == 1.0 for current GCC versions). If you encounter any differences, I'd be very grateful to hear about them; I may need to warn my users of such compilers (compiler versions). Note that Microsoft does not support C99, so they are completely uninteresting to me.
Pascal Cuoq does bring up an interesting problem in the comment chain below, which I didn't immediately recognize.
When evaluating an expression, both GCC and clang with -mfpmath=387 specify that all expressions are evaluated using 80-bit precision. This leads to for example
7491907632491941888 = 0x1.9fe2693112e14p+62 = 110011111111000100110100100110001000100101110000101000000000000
5698883734965350400 = 0x1.3c5a02407b71cp+62 = 100111100010110100000001001000000011110110111000111000000000000
7491907632491941888 * 5698883734965350400 = 42695510550671093541385598890357555200 = 100000000111101101101100110001101000010100100001011110111111111111110011000111000001011101010101100011000000000000000000000000
yielding incorrect results, because that string of ones in the middle of the binary result is just at the difference between 53- and 64-bit mantissas (64 and 80-bit floating point numbers, respectively). So, while the expected result is
42695510550671088819251326462451515392 = 0x1.00f6d98d0a42fp+125 = 100000000111101101101100110001101000010100100001011110000000000000000000000000000000000000000000000000000000000000000000000000
the result obtained with just -std=c99 -m32 -mno-sse -mfpmath=387 is
42695510550671098263984292201741942784 = 0x1.00f6d98d0a43p+125 = 100000000111101101101100110001101000010100100001100000000000000000000000000000000000000000000000000000000000000000000000000000
In theory, you should be able to tell gcc and clang to enforce the correct C99 rounding rules by using options
-std=c99 -m32 -mno-sse -mfpmath=387 -ffloat-store -fexcess-precision=standard
However, this only affects expressions the compiler optimizes, and does not seem to fix the 387 handling at all. If you use e.g. clang -O1 -std=c99 -m32 -mno-sse -mfpmath=387 -ffloat-store -fexcess-precision=standard test.c -o test && ./test with test.c being Pascal Cuoq's example program, you will get the correct result per IEEE-754 rules -- but only because the compiler optimizes away the expression, not using the 387 at all.
Simply put, instead of computing
(double)d1 * (double)d2
both gcc and clang actually tell the '387 to compute
(double)((long double)d1 * (long double)d2)
This is indeed I believe this is a compiler bug affecting both gcc-4.6.3 and clang-llvm-3.0, and an easily reproduced one. (Pascal Cuoq points out that FLT_EVAL_METHOD=2 means operations on double-precision arguments is always done at extended precision, but I cannot see any sane reason -- aside from having to rewrite parts of libm on '387 -- to do that in C99 and considering IEEE-754 rules are achievable by the hardware! After all, the correct operation is easily achievable by the compiler, by modifying the '387 control word to match the precision of the expression. And, given the compiler options that should force this behaviour -- -std=c99 -ffloat-store -fexcess-precision=standard -- make no sense if FLT_EVAL_METHOD=2 behaviour is actually desired, there is no backwards compatibility issues, either.) It is important to note that given the proper compiler flags, expressions evaluated at compile time do get evaluated correctly, and that only expressions evaluated at run time get incorrect results.
The simplest workaround, and the portable one, is to use fesetround(FE_TOWARDZERO) (from fenv.h) to round all results towards zero.
In some cases, rounding towards zero may help with predictability and pathological cases. In particular, for intervals like x = [0,1), rounding towards zero means the upper limit is never reached through rounding; important if you evaluate e.g. piecewise splines.
For the other rounding modes, you need to control the 387 hardware directly.
You can use either __FPU_SETCW() from #include <fpu_control.h>, or open-code it. For example, precision.c:
#include <stdlib.h>
#include <stdio.h>
#include <limits.h>
#define FP387_NEAREST 0x0000
#define FP387_ZERO 0x0C00
#define FP387_UP 0x0800
#define FP387_DOWN 0x0400
#define FP387_SINGLE 0x0000
#define FP387_DOUBLE 0x0200
#define FP387_EXTENDED 0x0300
static inline void fp387(const unsigned short control)
{
unsigned short cw = (control & 0x0F00) | 0x007f;
__asm__ volatile ("fldcw %0" : : "m" (*&cw));
}
const char *bits(const double value)
{
const unsigned char *const data = (const unsigned char *)&value;
static char buffer[CHAR_BIT * sizeof value + 1];
char *p = buffer;
size_t i = CHAR_BIT * sizeof value;
while (i-->0)
*(p++) = '0' + !!(data[i / CHAR_BIT] & (1U << (i % CHAR_BIT)));
*p = '\0';
return (const char *)buffer;
}
int main(int argc, char *argv[])
{
double d1, d2;
char dummy;
if (argc != 3) {
fprintf(stderr, "\nUsage: %s 7491907632491941888 5698883734965350400\n\n", argv[0]);
return EXIT_FAILURE;
}
if (sscanf(argv[1], " %lf %c", &d1, &dummy) != 1) {
fprintf(stderr, "%s: Not a number.\n", argv[1]);
return EXIT_FAILURE;
}
if (sscanf(argv[2], " %lf %c", &d2, &dummy) != 1) {
fprintf(stderr, "%s: Not a number.\n", argv[2]);
return EXIT_FAILURE;
}
printf("%s:\td1 = %.0f\n\t %s in binary\n", argv[1], d1, bits(d1));
printf("%s:\td2 = %.0f\n\t %s in binary\n", argv[2], d2, bits(d2));
printf("\nDefaults:\n");
printf("Product = %.0f\n\t %s in binary\n", d1 * d2, bits(d1 * d2));
printf("\nExtended precision, rounding to nearest integer:\n");
fp387(FP387_EXTENDED | FP387_NEAREST);
printf("Product = %.0f\n\t %s in binary\n", d1 * d2, bits(d1 * d2));
printf("\nDouble precision, rounding to nearest integer:\n");
fp387(FP387_DOUBLE | FP387_NEAREST);
printf("Product = %.0f\n\t %s in binary\n", d1 * d2, bits(d1 * d2));
printf("\nExtended precision, rounding to zero:\n");
fp387(FP387_EXTENDED | FP387_ZERO);
printf("Product = %.0f\n\t %s in binary\n", d1 * d2, bits(d1 * d2));
printf("\nDouble precision, rounding to zero:\n");
fp387(FP387_DOUBLE | FP387_ZERO);
printf("Product = %.0f\n\t %s in binary\n", d1 * d2, bits(d1 * d2));
return 0;
}
Using clang-llvm-3.0 to compile and run, I get the correct results,
clang -std=c99 -m32 -mno-sse -mfpmath=387 -O3 -W -Wall precision.c -o precision
./precision 7491907632491941888 5698883734965350400
7491907632491941888: d1 = 7491907632491941888
0100001111011001111111100010011010010011000100010010111000010100 in binary
5698883734965350400: d2 = 5698883734965350400
0100001111010011110001011010000000100100000001111011011100011100 in binary
Defaults:
Product = 42695510550671098263984292201741942784
0100011111000000000011110110110110011000110100001010010000110000 in binary
Extended precision, rounding to nearest integer:
Product = 42695510550671098263984292201741942784
0100011111000000000011110110110110011000110100001010010000110000 in binary
Double precision, rounding to nearest integer:
Product = 42695510550671088819251326462451515392
0100011111000000000011110110110110011000110100001010010000101111 in binary
Extended precision, rounding to zero:
Product = 42695510550671088819251326462451515392
0100011111000000000011110110110110011000110100001010010000101111 in binary
Double precision, rounding to zero:
Product = 42695510550671088819251326462451515392
0100011111000000000011110110110110011000110100001010010000101111 in binary
In other words, you can work around the compiler issues by using fp387() to set the precision and rounding mode.
The downside is that some math libraries (libm.a, libm.so) may be written with the assumption that intermediate results are always computed at 80-bit precision. At least the GNU C library fpu_control.h on x86_64 has the comment "libm requires extended precision". Fortunately, you can take the '387 implementations from e.g. GNU C library, and implement them in a header file or write a known-to-work libm, if you need the math.h functionality; in fact, I think I might be able to help there.
For the record, below is what I found by experimentation. The following program shows various behaviors when compiled with Clang:
#include <stdio.h>
int r1, r2, r3, r4, r5, r6, r7;
double ten = 10.0;
int main(int c, char **v)
{
r1 = 0.1 == (1.0 / ten);
r2 = 0.1 == (1.0 / 10.0);
r3 = 0.1 == (double) (1.0 / ten);
r4 = 0.1 == (double) (1.0 / 10.0);
ten = 10.0;
r5 = 0.1 == (1.0 / ten);
r6 = 0.1 == (double) (1.0 / ten);
r7 = ((double) 0.1) == (1.0 / 10.0);
printf("r1=%d r2=%d r3=%d r4=%d r5=%d r6=%d r7=%d\n", r1, r2, r3, r4, r5, r6, r7);
}
The results vary with the optimization level:
$ clang -v
Apple LLVM version 4.2 (clang-425.0.24) (based on LLVM 3.2svn)
$ clang -mno-sse2 -std=c99 t.c && ./a.out
r1=0 r2=1 r3=0 r4=1 r5=1 r6=0 r7=1
$ clang -mno-sse2 -std=c99 -O2 t.c && ./a.out
r1=0 r2=1 r3=0 r4=1 r5=1 r6=1 r7=1
The cast (double) that differentiates r5 and r6 at -O2 has no effect at -O0 and for variables r3 and r4. The result r1 is different from r5 at all optimization levels, whereas r6 only differs from r3 at -O2.

Resources