Adding doubles in x86_64 assembly problems - c

Hello I am trying to learn assembly and learn how to work with floating point numbers in x86_64. From what I understand arguments are passed in xmm0, xmm1, xmm2, and so on, and the result is returned in xmm0. So I am trying to make a simple assembly function that adds to double together. Here is the function
.text
.global floatadd
.type floatadd,#function
floatadd:
addsd %xmm1,%xmm0
ret
And here is the C code I am using as well.
#include<stdio.h>
int main(){
double a = 1.5;
double b = 1.2;
double c = floatadd(a,b);
printf("result = %f\n",c);
}
I have been trying to following what is happening in gdb. When I set a breakpoint in my function I can see xmm0 has 1.5 and xmm1 has 1.2 and when they are added together they 2.7. In gdb print $xmm0 gives v2_double = {2.7000000000000002, 0} However when my function returns from main and calls
cvtsi2sd %eax,%xmm0
Print $xmm0 becomes v2_double = {2, 0}. I am not sure why gcc calls that or why it is uses the 32bit register instead of the 64bit register. I have tried using the modifier %lf, and %f and both of them do the same thing.
What is happening?

The problem is that you failed to declare floatadd before calling it. So the compiler assumes it returns an int in %eax and converts that int to a double. Add the declaration:
double floatadd(double, double);
before main.
Using -Wall or whatever equivalent your compiler uses to enable warnings would probably have told you about this problem...

Related

Why doesn't the same generated assembler code lead to the same output?

Sample code (t0.c):
#include <stdio.h>
float f(float a, float b, float c) __attribute__((noinline));
float f(float a, float b, float c)
{
return a * c + b * c;
}
int main(void)
{
void* p = V;
printf("%a\n", f(4476.0f, 20439.0f, 4915.0f));
return 0;
}
Invocation & execution (via godbolt.org):
# icc 2021.1.2 on Linux on x86-64
$ icc t0.c -fp-model=fast -O3 -DV=f
0x1.d32322p+26
$ icc t0.c -fp-model=fast -O3 -DV=0
0x1.d32324p+26
Generated assembler code is the same: https://godbolt.org/z/osra5jfYY.
Why doesn't the same generated assembler code lead to the same output?
Why does void* p = f; matter?
Godbolt shows you the assembly emitted by running the compiler with -S. But in this case, that's not the code that actually gets run, because further optimizations can be done at link time.
Try checking the "Compile to binary" box instead (https://godbolt.org/z/ETznv9qP4), which will actually compile and link the binary and then disassemble it. We see that in your -DV=f version, the code for f is:
addss xmm0,xmm1
mulss xmm0,xmm2
ret
just as before. But with -DV=0, we have:
movss xmm0,DWORD PTR [rip+0x2d88]
ret
So f has been converted to a function which simply returns a constant loaded from memory. At link time, the compiler was able to see that f was only ever called with a particular set of constant arguments, and so it could perform interprocedural constant propagation and have f merely return the precomputed result.
Having an additional reference to f evidently defeats this. Probably the compiler or linker sees that f had its address taken, and didn't notice that nothing was ever done with the address. So it assumes that f might be called elsewhere in the program, and therefore it has to emit code that would give the correct result for arbitrary arguments.
As to why the results are different: The precomputation is done strictly, evaluating both a*c and b*c as float and then adding them. So its result of 122457232 is the "right" one by the rules of C, and it is also what you get when compiling with -O0 or -fp-model=strict. The runtime version has been optimized to (a+b)*c, which is actually more accurate because it avoids an extra rounding; it yields 122457224, which is closer to the exact value of 122457225.

What is happening here in pow function?

I have seen various answer here that depicts Strange behavior of pow function in C.
But I Have something different to ask here.
In the below code I have initialized int x = pow(10,2) and int y = pow(10,n) (int n = 2).
In first case it when I print the result it shows 100 and in the other case it comes out to be 99.
I know that pow returns double and it gets truncated on storing in int, but I want to ask why the output comes to be different.
CODE1
#include<stdio.h>
#include<math.h>
int main()
{
int n = 2;
int x;
int y;
x = pow(10,2); //Printing Gives Output 100
y = pow(10,n); //Printing Gives Output 99
printf("%d %d" , x , y);
}
Output : 100 99
Why is the output coming out to be different. ?
My gcc version is 4.9.2
Update :
Code 2
int main()
{
int n = 2;
int x;
int y;
x = pow(10,2); //Printing Gives Output 100
y = pow(10,n); //Printing Gives Output 99
double k = pow(10,2);
double l = pow(10,n);
printf("%d %d\n" , x , y);
printf("%f %f\n" , k , l);
}
Output : 100 99
100.000000 100.000000
Update 2 Assembly Instructions FOR CODE1
Generated Assembly Instructions GCC 4.9.2 using gcc -S -masm=intel :
.LC1:
.ascii "%d %d\0"
.text
.globl _main
.def _main; .scl 2; .type 32; .endef
_main:
push ebp
mov ebp, esp
and esp, -16
sub esp, 48
call ___main
mov DWORD PTR [esp+44], 2
mov DWORD PTR [esp+40], 100 //Concerned Line
fild DWORD PTR [esp+44]
fstp QWORD PTR [esp+8]
fld QWORD PTR LC0
fstp QWORD PTR [esp]
call _pow //Concerned Line
fnstcw WORD PTR [esp+30]
movzx eax, WORD PTR [esp+30]
mov ah, 12
mov WORD PTR [esp+28], ax
fldcw WORD PTR [esp+28]
fistp DWORD PTR [esp+36]
fldcw WORD PTR [esp+30]
mov eax, DWORD PTR [esp+36]
mov DWORD PTR [esp+8], eax
mov eax, DWORD PTR [esp+40]
mov DWORD PTR [esp+4], eax
mov DWORD PTR [esp], OFFSET FLAT:LC1
call _printf
leave
ret
.section .rdata,"dr"
.align 8
LC0:
.long 0
.long 1076101120
.ident "GCC: (tdm-1) 4.9.2"
.def _pow; .scl 2; .type 32; .endef
.def _printf; .scl 2; .type 32; .endef
I know that pow returns double and it gets truncated on storing in int, but I want to ask why the output comes to be different.
You must first, if you haven't already, divest yourself of the idea that floating-point numbers are in any way sensible or predictable. double only approximates real numbers and almost anything you do with a double is likely to be an approximation to the actual result.
That said, as you have realized, pow(10, n) resulted in a value like 99.99999999999997, which is an approximation accurate to 15 significant figures. And then you told it to truncate to the largest integer less than that, so it threw away most of those.
(Aside: there is rarely a good reason to convert a double to an int. Usually you should either format it for display with something like sprintf("%.0f", x), which does rounding correctly, or use the floor function, which can handle floating-point numbers that may be out of the range of an int. If neither of those suit your purpose, like in currency or date calculations, possibly you should not be using floating point numbers at all.)
There are two weird things going on here. First, why is pow(10, n) inaccurate? 10, 2, and 100 are all precisely representable as double. The best answer I can offer is that the C standard library you are using has a bug. (The compiler and the standard library, which I assume are gcc and glibc, are developed on different release schedules and by different teams. If pow is returning inaccurate results, that is probably a bug in glibc, not gcc.)
In the comments on your question, amdn found a glibc bug to do with FP rounding that might be related and another Q&A that goes into more detail about why this happens and how it's not a violation of the C standard. chux's answer also addresses this. (C doesn't require implementation of IEEE 754, but even if it did, pow isn't required to use correct rounding.) I will still call this a glibc bug, because it's an undesirable property.
(It's also conceivable, though unlikely, that your processor's FPU is wrong.)
Second, why is pow(10, n) different from pow(10, 2)? This one is far easier. gcc optimizes away function calls for which the result can be calculated at compile time, so pow(10, 2) is almost certainly being optimized to 100.0. If you look at the generated assembly code, you will find only one call to pow.
The GCC manual, section 6.59 describes which standard library functions may be treated in this way (follow the link for the full list):
The remaining functions are provided for optimization purposes.
With the exception of built-ins that have library equivalents such as the standard C library functions discussed below, or that expand to library calls, GCC built-in functions are always expanded inline and thus do not have corresponding entry points and their address cannot be obtained. Attempting to use them in an expression other than a function call results in a compile-time error.
[...]
The ISO C90 functions abort, abs, acos, asin, atan2, atan, calloc, ceil, cosh, cos, exit, exp, fabs, floor, fmod, fprintf, fputs, frexp, fscanf, isalnum, isalpha, iscntrl, isdigit, isgraph, islower, isprint, ispunct, isspace, isupper, isxdigit, tolower, toupper, labs, ldexp, log10, log, malloc, memchr, memcmp, memcpy, memset, modf, pow, printf, putchar, puts, scanf, sinh, sin, snprintf, sprintf, sqrt, sscanf, strcat, strchr, strcmp, strcpy, strcspn, strlen, strncat, strncmp, strncpy, strpbrk, strrchr, strspn, strstr, tanh, tan, vfprintf, vprintf and vsprintf are all recognized as built-in functions unless -fno-builtin is specified (or -fno-builtin-function is specified for an individual function).
So it would seem you can disable this behavior with -fno-builtin-pow.
Why is the output coming out to be different. ? (in the updated appended code)
We do not know the values are that different.
When comparing the textual out of int/double, be sure to print the double with sufficient precision to see if it is 100.000000 or just near 100.000000 or in hex to remove all doubt.
printf("%d %d\n" , x , y);
// printf("%f %f\n" , k , l);
// Is it the FP number just less than 100?
printf("%.17e %.17e\n" , k , l); // maybe 9.99999999999999858e+01
printf("%a %a\n" , k , l); // maybe 0x1.8ffffffffffff0000p+6
Why is the output coming out to be different. ? (in the original code)
C does not specify the accuracy of most <math.h> functions. The following are all compliant results.
// Higher quality functions return 100.0
pow(10,2) --> 100.0
// Lower quality and/or faster one may return nearby results
pow(10,2) --> 100.0000000000000142...
pow(10,2) --> 99.9999999999999857...
Assigning a floating point (FP) number to an int simple drops the fraction regardless of how close the fraction is to 1.0
When converting FP to an integer, better to control the conversion and round to cope with minor computational differences.
// long int lround(double x);
long i = lround(pow(10.0,2.0));
You're not the first to find this. Here's a discussion form 2013:
pow() cast to integer, unexpected result
I'm speculating that the assembly code produced by the tcc guys is causing the second value to be rounded down after calculating a result that is REALLY close to 100.
Like mikijov said in that historic post, looks like the bug has been fixed.
As others have mentioned, Code 2 returns 99 due to floating point truncation. The reason why Code 1 returns a different and correct answer is because of a libc optimization.
When the power is a small positive integer, it is more efficient to perform the operation as repeated multiplication. The simpler path removes roundoff. Since this is inlined you don't see function calls being made.
You've fooled it into thinking that the inputs are real and so it gives an approximate answer, which happens to be slightly under 100, e.g. 99.999999 that is then truncated to 99.

Passing pointer from c to asm and add value to sse register

I want to pass the pointer to an array between C and ASM code. I've got an array of four double values and i need to pass them to asm, load them to xmm, multiply and return the pointer to four values back to C. I've got an error while loading data to xmm0.
How to pass this pointers to ASM and back to C ?
How to load all four numbers to xmm0 ?
Here is the code:
.text
.globl calkasse
.type calkasse, #function
calkasse:
pushq %rbp
movq %rsp, %rbp
movq 8(%rbp), %rax
movaps 16(%rax), %xmm0
mulps %xmm0,%xmm0
movq %rbp, %rsp
popq %rbp
ret
and C code:
double (*calkasse(double (*)[4]))[4];
int main(void) {
double suma=0.0;
double poczatek=1.0;
double koniec=5.0;
double step=0.001;
double i=poczatek;
double array[4];
double (*wynik)[4];
array[0] = i;
array[1] = i+step;
array[2] = i+(2*step);
array[3] = i+(3*step);
wynik = calkasse(&array);
suma+=*wynik[0]+*wynik[1]+*wynik[2]+*wynik[3];
return 1;
}
You'll need to compile your project as an assembly file, and THEN make these changes. If you use inline assembly, what happens is that there is a TON (and i mean a ton from what i've seen in my testing) of changes to the system between your C code and inline assembly. Basically, it destroys the state of the system as you see it before entering the assembly portion. Its obvious why this is since your C code is going to have to use a C library to even figure out what the inline assembly means. Thus, your persistence will be destroyed during this, and values will be pushed and popped.
To not destroy your registers before entering the assembly, you can try to add your assembly in the object file after its created. You can use GDB to figure out what your code is doing before you start adding to it, and you should be able to manually get the address you're looking for, and then you can just place it in a variable. I would place it in a variable, because if you insert it into a register, you'd have to trace the assembly after your added code to make sure you placed it in the right register at the correct time. If you place it a variable, you can use a mov instruction (or leal IIRC) to get the pointer value into the variable, and then in your C code, you can just set this variable to null. Then, you can write your C code before you actually write the assembly manually (basically the variable is a mock). Hope this helps, and good luck.
You have to make your compiler behave predictably. A C function can be compiled in many ways. It is also OS and platform dependant. So for instance, allowing function parameter passing via registers / register optimizations / os forbidden registers ... these will change the compilation. Try writing a function with optimizations off, and look assembly output of the compiler. That way you will have a more controlled case of compilation which will be consistent at least on one os and platform... Then you can add your inline assembly to that function. Always keep an eye on assembly output... This needs medium reverse engineering.

GCC doesn't use built in atomics with -fopenmp [duplicate]

I have stumbled upon the following problem. The below code snippet does not link on Mac OS X with any Xcode I tried (4.4, 4.5)
#include <stdlib.h>
#include <string.h>
#include <emmintrin.h>
int main(int argc, char *argv[])
{
char *temp;
#pragma omp parallel
{
__m128d v_a, v_ar;
memcpy(temp, argv[0], 10);
v_ar = _mm_shuffle_pd(v_a, v_a, _MM_SHUFFLE2 (0,1));
}
}
The code is just provided as an example and would segfault when you run it. The point is that it does not compile. The compilation is done using the following line
/Applications/Xcode.app/Contents/Developer/usr/bin/gcc test.c -arch x86_64 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.7.sdk -mmacosx-version-min=10.7 -fopenmp
Undefined symbols for architecture x86_64:
"___builtin_ia32_shufpd", referenced from:
_main.omp_fn.0 in ccJM7RAw.o
"___builtin_object_size", referenced from:
_main.omp_fn.0 in ccJM7RAw.o
ld: symbol(s) not found for architecture x86_64
collect2: ld returned 1 exit status
The code compiles just fine when not using the -fopenmp flag to gcc. Now, I googled around and found a solution for the first problem connected with memcpy, which is adding -fno-builtin, or -D_FORTIFY_SOURCE=0 to gcc arguments list. I did not manage to solve the second problem (sse intrinsic).
Can anyone help me to solve this? The questions:
most importantly: how to get rid of the "___builtin_ia32_shufpd" error?
what exactly is the reason for the memcpy problem, and what does the -D_FORTIFY_SOURCE=0 flag eventually do?
This is a bug in the way Apple's LLVM-backed GCC (llvm-gcc) transforms OpenMP regions and handles calls to the built-ins inside them. The problem can be diagnosed by examining the intermediate tree dumps (obtainable by passing -fdump-tree-all argument to gcc). Without OpenMP enabled the following final code representation is generated (from the test.c.016t.fap):
main (argc, argv)
{
D.6544 = __builtin_object_size (temp, 0);
D.6545 = __builtin_object_size (temp, 0);
D.6547 = __builtin___memcpy_chk (temp, D.6546, 10, D.6545);
D.6550 = __builtin_ia32_shufpd (v_a, v_a, 1);
}
This is a C-like representation of how the compiler sees the code internally after all transformations. This is what is then gets turned into assembly instructions. (only those lines that refer to the built-ins are shown here)
With OpenMP enabled the parallel region is extracted into own function, main.omp_fn.0:
main.omp_fn.0 (.omp_data_i)
{
void * (*<T4f6>) (void *, const <unnamed type> *, long unsigned int, long unsigned int) __builtin___memcpy_chk.21;
long unsigned int (*<T4f5>) (const <unnamed type> *, int) __builtin_object_size.20;
vector double (*<T6b5>) (vector double, vector double, int) __builtin_ia32_shufpd.23;
long unsigned int (*<T4f5>) (const <unnamed type> *, int) __builtin_object_size.19;
__builtin_object_size.19 = __builtin_object_size;
D.6587 = __builtin_object_size.19 (D.6603, 0);
__builtin_ia32_shufpd.23 = __builtin_ia32_shufpd;
D.6593 = __builtin_ia32_shufpd.23 (v_a, v_a, 1);
__builtin_object_size.20 = __builtin_object_size;
D.6588 = __builtin_object_size.20 (D.6605, 0);
__builtin___memcpy_chk.21 = __builtin___memcpy_chk;
D.6590 = __builtin___memcpy_chk.21 (D.6609, D.6589, 10, D.6588);
}
Again I have only left the code that refers to the builtins. What is apparent (but the reason for that is not immediately apparent to me) is that the OpenMP code trasnformer really insists on calling all the built-ins through function pointers. These pointer asignments:
__builtin_object_size.19 = __builtin_object_size;
__builtin_ia32_shufpd.23 = __builtin_ia32_shufpd;
__builtin_object_size.20 = __builtin_object_size;
__builtin___memcpy_chk.21 = __builtin___memcpy_chk;
generate external references to symbols which are not really symbols but rather names that get special treatment by the compiler. The linker then tries to resolve them but is unable to find any of the __builtin_* names in any of the object files that the code is linked against. This is also observable in the assembly code that one can obtain by passing -S to gcc:
LBB2_1:
movapd -48(%rbp), %xmm0
movl $1, %eax
movaps %xmm0, -80(%rbp)
movaps -80(%rbp), %xmm1
movl %eax, %edi
callq ___builtin_ia32_shufpd
movapd %xmm0, -32(%rbp)
This basically is a function call that takes 3 arguments: one integer in %eax and two XMM arguments in %xmm0 and %xmm1, with the result being returned in %xmm0 (as per the SysV AMD64 ABI function calling convention). In contrast, the code generated without -fopenmp is an instruction-level expansion of the intrinsic as it is supposed to happen:
LBB1_3:
movapd -64(%rbp), %xmm0
shufpd $1, %xmm0, %xmm0
movapd %xmm0, -80(%rbp)
What happens when you pass -D_FORTIFY_SOURCE=0 is that memcpy is not replaced by the "fortified" checking version and a regular call to memcpy is used instead. This eliminates the references to object_size and __memcpy_chk but cannot remove the call to the ia32_shufpd built-in.
This is obviously a compiler bug. If you really really really must use Apple's GCC to compile the code, then an interim solution would be to move the offending code to an external function as the bug apparently only affects code that gets extracted from parallel regions:
void func(char *temp, char *argv0)
{
__m128d v_a, v_ar;
memcpy(temp, argv0, 10);
v_ar = _mm_shuffle_pd(v_a, v_a, _MM_SHUFFLE2 (0,1));
}
int main(int argc, char *argv[])
{
char *temp;
#pragma omp parallel
{
func(temp, argv[0]);
}
}
The overhead of one additional function call is neglegible compared to the overhead of entering and exiting the parallel region. You can use OpenMP pragmas inside func - they will work because of the dynamic scoping of the parallel region.
May be Apple would provide a fixed compiler in the future, may they won't, given their commitment to replacing GCC with Clang.

Floating point anomaly when an unused statement is not commented out?

When the program as shown below is run, it produces ok output:
j= 0 9007199616606190.000000 = x
k= 0 9007199616606190.000000 = [x]
r= 31443101 0.000000 = m*(x-[x])
But when the commented-out line (i.e. //if (argc>1) r = atol(argv[1]);) is uncommented, it produces:
j= 20000 9007199616606190.000000 = x
k= 17285 9007199616606190.000000 = [x]
r= 31443101 0.000000 = m*(x-[x])
even though that line should have no effect, since argc>1 is false. Has anybody got a plausible explanation for this problem? Is it reproducible on any other systems?
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main(int argc, char *argv[]) {
int j, k, m=10000;
double r=31443101, jroot=sqrt(83), x;
//if (argc>1) r = atol(argv[1]);
x = r * r * jroot;
j = m*(x-floor(x));
k = floor(m*(x-floor(x)));
printf ("j= %9d %24.6f = x\n", j, x);
printf ("k= %9d %24.6f = [x]\n", k, floor(x));
printf ("r= %9.0f %24.6f = m*(x-[x]) \n", r, m*(x-floor(x)));
return 0;
}
Note, test system = AMD Athlon 64 5200+ system with Linux 2.6.35.14-96.fc14.i686 (i.e., booted to run a 32-bit OS on 64-bit HW) with gcc (GCC) 4.5.1 20100924 (Red Hat 4.5.1-4)
Update -- A few hours ago I posted a comment that code generated with and without the if statement differed only in stack offsets and some skipped code. I now find that comment was not entirely correct; i.e. it is true for non-optimized code, but not true for the -O3 code I executed.
Effect of optimization switch on problem:
-O0 : Both program versions run ok
-O2 or -O3 : Version with comment has error as above, where j=20000 and k=17285
-O1 : Version with comment has j=20000 (an error) and k=0 (OK)
Anyhow, looking at -O3 -S code listings, the two cases differ mostly in skipped if code and stack offsets up to the line before call floor, at which point the with-if code has one more fstpl than the without-if code:
... ;; code without comment:
fmul %st, %st(1)
fxch %st(1)
fstpl (%esp)
fxch %st(1)
fstpl 48(%esp)
fstpl 32(%esp)
call floor
movl $.LC2, (%esp)
fnstcw 86(%esp)
movzwl 86(%esp), %eax
...
... ;; versus code with comment:
fmul %st, %st(1)
fxch %st(1)
fstpl (%esp)
fxch %st(1)
fstpl 48(%esp)
fstpl 32(%esp)
fstpl 64(%esp)
call floor
movl $.LC3, (%esp)
fnstcw 102(%esp)
movzwl 102(%esp), %eax
...
I haven't figured out the reason for the difference.
Not duplicated on my system, Win7 running CygWin with gcc 4.3.4. Both with and without the if statement, the value of j is set to zero, not 20K.
My only suggestion would be to use gcc -S to get a look at the assembler output. That should hopefully tell you what's going wrong.
Specifically, generate the assembler output to two separate files, one each for the working and non-working variant, then vgrep them (eyeball them side by side) to try and ascertain the difference.
This is a serious failure in your environment by the way. With m being 10000, that means the x - floor(x) must be equal to 2. I can't for the life of me think of any real number where that would be the case :-)
I think there are two reasons why that line could have an effect:
Without that line, the values of all of these variables can be (and, IMHO, most likely are) determined at compile-time; with that line, the computations have to be performed at run-time. But obviously, the compiler's precomputed values are supposed to be the same as values computed at run-time, and I'm inclined to discount this as the actual reason for the different observed behavior. (It would certainly show up as a huge difference in the assembler output, though!)
On many machines, floating-point arithmetic is performed using more bits in intermediate values than can actually be stored in a double-precision floating-point number. Your second version, by creating two different code-paths to set x, basically restricts x to what can be stored in a double-precision floating-point number, whereas your first version can allow the initially-calculated value for x to still be available as an intermediate value, with extra bits, when computing subsequent values. (This could be the case whether all of these values are computed at compile-time or at run-time.)
The reason that uncommenting that line might affect the result is that without that line, the compiler can see that r and jroot cannot change after initialisation, so it can calculate x at compile-time rather than runtime. When the line is uncommented, r might change, so the calculation of x must be deferred to runtime, which can result it in being done with a different precision (particularly if 387 floating point math is being used).
You can try using -mfpmath=sse -march=native to use the SSE unit for floating point calculations, which doesn't exhibit excess precision; or you can try using the -ffloat-store switch.
Your subtraction x - floor(x) exhibits catastrophic cancellation - this is the root cause of the problem something to be avoided ;).
EDITED:
I also do not see a difference when I compile your code on my computer using the -O0, -O1, -O2 and -O3.
AMD Phenom Quad 64 bit.
gcc (Ubuntu 4.4.3-4ubuntu5) 4.4.3
I also tried clang (llvm) from release 3.0 with and without the if same results.
I agree that the compiler can pre-compute everything without that if line, you would definitely see that in the assembly output though.
Floating point and C can be nasty, lots of stuff to know to get it to really work. Forcing the int to double conversions is good for accuracy (c libraries in the compiler, even if the fpu is good have been known to have problems and the compilers C library it uses and the C library compiled into or used by your program can/will differ), but int to/from float is where FPU's tend to have their bugs (I think I saw that mentioned with TestFloat or somewhere like that). Might try running TestFloat on your system to see if your FPU is good. Between the famous pentium floating point bug and into the PentiumIV and beyond days most processors had floating point bugs, the pentium III I had was solid but the Pentium IV I had would fail. I rarely use floating point anymore so dont bother to test my systems.
Playing with optimization did change your results according to your edit so this is most likely a gcc problem or a combination of your code and gcc (and not a hardware fpu problem). Then try a different version of gcc on the same computer. 4.4.x instead of 4.5.x for example.

Resources