Different Truncation Results When Casting

Different Truncation Results When Casting - c

I'm having some some difficulty predicting how my C code will truncate results. Refer to the following:
float fa,fb,fc;
short ia,ib;
fa=160
fb=0.9;
fc=fa*fb;
ia=(short)fc;
ib=(short)(fa*fb);
The results are ia=144, ib=143.
I can understand the reasoning for either result, but I don't understand why the two calculations are treated differently. Can anyone refer me to where this behaviour is defined or explain the difference?
Edit: the results are compiled with MS Visual C++ Express 2010 on Intel core i3-330m. I get the same results on gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) under Virtual Box on the same machine.

The compiler is allowed to use more precision for a subexpression like fa*fb than it uses when assigning to a float variable like fc. So it's the fc= part which is very slightly changing the result (and happening to then make a difference in the integer truncation).

aschepler explained the mechanics of what's going on well, but the fundamental problem with your code is using a value which does not exist as a float in code that depends upon the value of its approximation in an unstable way. If you want to multiply by 0.9 (the actual number 0.9=9/10, not the floating point value 0.9 or 0.9f) you should multiply by 9 then divide by 10, or forget about floating point types and use a decimal arithmetic library.
A cheap and dirty way around the problem, when the unstable points are isolated as in your example here, is to just add a value (typically 0.5) which you know will be larger than the error but smaller than the difference from the next integer before truncating.

This is compiler dependent. On mine (gcc 4.4.3) it produces the same result for both expressions, namely -144, probably because the identical expression is optimized away.
Others explained well what happened. In other words I would say that the differences probably happens because your compiler internally promotes floats to 80 bits fpu registers before performing the multiplication, then convert back either to float or to short.
If my hypothesis is true if you write ib = (short)(float)(fa * fb); you should get the same result than when casting fc to short.

Related

Fortran/C Interlanguage problems: results differ in the 14th digit

I have to use C and Fortran together to do some simulations. In their course I use the same memory in both programming language parts, by defining a pointer in C to access memory allocated by Fortran.
The datatype of the problematic variable is
real(kind=8)
for Fortran, and
double
for C. The results of the same calculations now differ in the respective programming languages, and I need to directly compare them and get a zero. All calculations are done only with the above accuracies. The difference is always in the 13-14th digit.
What would be a good way to resolve this? Any compiler-flags? Just cut-off after some digits?
Many thanks!

Floating point is not perfectly accurate. Ever. Even cos(x) == cos(y) can be false if x == y.
So when doing your comparisons, take this into account, and allow the values to differ by some small epsilon value.

This is a problem with the inaccuracy with floating point numbers - they will be inaccurate and a certain place. You usually compare them either by rounding them to a digit that you know will be in the accurate area, or by providing an epsilon of appropiate value (small enough to not impact further calculations, and big enough to take care of the inaccuracy while comparing).

One thing you might check is to be sure that the FPU control word is the same in both cases. If it is set to 53-bit precision in one case and 64-bit in the other, it would likely produce different results. You can use the instructions fstcw and fldcw to read and load the control word value. Nonetheless, as others have mentioned, you should not depend on the accuracy being identical even if you can make it work in one situation.

Perfect portability is very difficult to achieve in floating point operations. Changing the order of the machine instructions might change the rounding. One compiler might keep values in registers, while another copy it to memory, which can change the precision. Currently the Fortran and C languages allow a certain amount of latitude. The IEEE module of Fortran 2008, when implemented, will allow requiring more specific and therefore more portable floating point computations.

Since you are compiling for an x86 architecture, it's likely that one of the compilers is maintaining intermediate values in floating point registers, which are 80 bits as opposed to the 64 bits of a C double.
For GCC, you can supply the -ffloat-store option to inhibit this optimisation. You may also need to change the code to explicitly store some intermediate results in double variables. Some experimentation is likely in order.

Floating point again

Yesterday I asked a floating point question, and I have another one. I am doing some computations where I use the results of the math.h (C language) sine, cosine and tangent functions.
One of the developers muttered that you have to be careful of the return values of these functions and I should not make assumptions on the return values of the gcc math functions. I am not trying to start a discussion but I really want to know what I need to watch out for when doing computations with the standard math functions.
x

You should not assume that the values returned will be consistent to high degrees of precision between different compiler/stdlib versions.
That's about it.

You should not expect sin(PI/6) to be equal to cos(PI/3), for example. Nor should you expect asin(sin(x)) to be equal to x, even if x is in the domain for sin. They will be close, but might not be equal.

Floating point is straightforward. Just always remember that there is an uncertainty component to all floating point operations and functions. It is usually modelled as being random, even though it usually isn't, but if you treat it as random, you'll succeed in understanding your own code. For instance:
a=a/3*3;
This should be treated as if it was:
a=(a/3+error1)*3+error2;
If you want an estimate of the size of the errors, you need to dig into each operation/function to find out. Different compilers, parameter choice etc. will yield different values. For instance, 0.09-0.089999 on a system with 5 digits precision will yield an error somewhere between -0.000001 and 0.000001. this error is comparable in size with the actual result.
If you want to learn how to do floating point as precise as posible, then it's a study by it's own.

The problem isn't with the standard math functions, so much as the nature of floating point arithmetic.
Very short version: don't compare two floating point numbers for equality, even with obvious, trivial identities like 10 == 10 / 3.0 * 3.0 or tan(x) == sin(x) / cos(x).

you should take care about precision:
Structure of a floating-point number
are you on 32bits, 64 bits Platform ?
you should read IEEE Standard for Binary Floating-Point Arithmetic
there are some intersting libraries such GMP, or MPFR.
you should learn how Comparing floating-point numbers
etc ...

Agreed with all of the responses that say you should not compare for equality. What you can do, however, is check if the numbers are close enough, like so:
if (abs(numberA - numberB) < CLOSE_ENOUGH)
{
// Equal for all intents and purposes
}
Where CLOSE_ENOUGH is some suitably small floating-point value.

optimizing with IEEE floating point - guaranteed mathematical identities?

I am having some trouble with IEEE floating point rules preventing compiler optimizations that seem obvious. For example,
char foo(float x) {
if (x == x)
return 1;
else
return 0;
}
cannot be optimized to just return 1 because NaN == NaN is false. Okay, fine, I guess.
However, I want to write such that the optimizer can actually fix stuff up for me. Are there mathematical identities that hold for all floats? For example, I would be willing to write !(x - x) if it meant the compiler could assume that it held all the time (though that also isn't the case).
I see some reference to such identities on the web, for example here, but I haven't found any organized information, including in a light scan of the IEEE 754 standard.
It'd also be fine if I could get the optimizer to assume isnormal(x) without generating additional code (in gcc or clang).
Clearly I'm not actually going to write (x == x) in my source code, but I have a function that's designed for inlining. The function may be declared as foo(float x, float y), but often x is 0, or y is 0, or x and y are both z, etc. The floats represent onscreen geometric coordinates. These are all cases where if I were coding by hand without use of the function I'd never distinguish between 0 and (x - x), I'd just hand-optimize stupid stuff away. So, I really don't care about the IEEE rules in what the compiler does after inlining my function, and I'd just as soon have the compiler ignore them. Rounding differences are also not very important since we're basically doing onscreen drawing.
I don't think -ffast-math is an option for me, because the function appears in a header file, and it is not appropriate that the .c files that use the function compile with -ffast-math.

Another reference that might be of some use for you is a really nice article on floating-point optimization in Game Programming Gems volume 2, by Yossarian King. You can read the article here. It discusses the IEEE format in quite detail, taking into account implementations and architecture, and provides many optimization tricks.

I think that you are always going to struggle to make computer floating-point-number arithmetic behave like mathematical real-number arithmetic, and suggest that you don't for any reason. I suggest that you are making a type error trying to compare the equality of 2 fp numbers. Since fp numbers are, in the overwhelming majority, approximations, you should accept this and use approximate-equality as your test.
Computer integers exist for equality testing of numerical values.
Well, that's what I think, you go ahead and fight the machine (well, all the machines actually) if you wish.
Now, to answer some parts of your question:
-- for every mathematical identity you are familiar with from real-number arithmetic, there are counter examples in the domain of floating-point numbers, whether IEEE or otherwise;
-- 'clever' programming almost always makes it more difficult for a compiler to optimise code than straightforward programming;
-- it seems that you are doing some graphics programming: in the end the coordinates of points in your conceptual space are going to be mapped to pixels on a screen; pixels always have integer coordinates; your translation from conceptual space to screen space defines your approximate-equality function
Regards
Mark

If you can assume that floating-point numbers used in this module will not be Inf/NaN, you can compile it with -ffinite-math-only (in GCC). This may "improve" the codegen for examples like the one you posted.

You could compare for bitwise equality. Although you might get bitten for some values that are equivalent but bitwise different, it will catch all those cases where you have a true equality as you mentioned. And I am not sure the compiler will recognize what you do and remove it when inlining (which I believe is what you are after), but that can easily be checked.

What happened when you tried it the obvious way and profiled it? or examined the generated asm?
If the function is inlined with values known at the call site, the optimizer has this information available. For example: foo(0, y).
You may be surprised at the work you don't have to do, but at the very least profiling or looking at what the compiler actually does with the code will give you more information and help you figure out where to proceed next.
That said, if you know certain things that the optimizer can't figure out itself, you can write multiple versions of the function, and specify the one you want to call. This is something of a hassle, but at least with inline functions they will all be specified together in one header. It's also quite a bit easier than the next step, which is using inline asm to do exactly what you want.

cosf(M_PI_2) not returning zero

This started suddenly today morning.
Original lines were this
float angle = (x+90)*(M_PI/180.0);
float xx = cosf(angle);
float yy = sinf(angle);
After putting a breakpoint and hovering cursor.. I get the correct answer for yy as 1. but xx is NOT zero.
I tried with cosf(M_PI_2); still no luck.. it was working fine till yesterday.. I did not change any compiler setting etc..
I am using Xcode latest version as of todays date

The first thing to notice is that you're using floats. These are inherently inaccurate, and for most calculations give you only a close approximation of the mathematically-correct answer. Assuming that x in your code has value 0, angle will have a close approximation to π/2. xx will therefore have an approximation to cos(π/2). However, this is unlikely to be exactly zero due to approximation and rounding issues.
If you were able to change your code to us doubles rather than floats you're likely to get more accuracy, and an answer nearer zero. However, if it is important for your code to produce a value of exactly zero at this point, you're going to have to rethink how you're doing the calculations.
If this doesn't answer your particular problem, give us some more details and we'll have another think.

Contrary to what others have said, this is not an x87 co-processor issue. XCode uses SSE for floating-point computation on Intel by default (except for long double arithmetic).
The "problem" is: when you write cosf(M_PI_2), you are actually telling the XCode compiler (gcc or llvm-gcc or clang) to do the following:
Look up the expansion of M_PI_2 in <math.h>. Per the POSIX standard, it is a double precision literal that converts to the correctly rounded value of π/2.
Round the converted double precision value to single precision.
Call the math library function cosf on the single precision value.
Note that, throughout this process, you are not operating on the actual value of π/2. You are instead operating on that value rounded to a representable floating-point number. While cos(π/2) is exactly zero, you are not telling the compiler to do that computation. You are instead telling the compiler to do cos(π/2 + tiny), where tiny is the difference between the rounded value (float)M_PI_2 and the (unrepresentable) exact value of π/2. If cos is computed with no error at all, the result of cos(π/2 + tiny) is approximately -tiny. If it returned zero, that would be an error.
edit: a step-by-step expansion of the computation on an Intel mac with the current XCode compiler:
M_PI_2 is defined to be
1.57079632679489661923132169163975144
but that's not actually a representable double precision number. When the compiler converts it to a double precision value it becomes exactly
1.5707963267948965579989817342720925807952880859375
This is the closest double-precision number to π/2, but it differs from the actual mathematical value of π/2 by about 6.12*10^(-17).
Step (2) rounds this number to single-precision, which changes the value to exactly
1.57079637050628662109375
Which is approximately π/2 + 4.37*10^(-8). When we compute cosf of this number then, we get:
-0.00000004371138828673792886547744274139404296875
which is very nearly the exact value of cosine evaluated at that point:
-0.00000004371139000186241438857289400265215231661...
In fact, it is the correctly rounded result; there is no value that the computation could have returned that would be more accurate. The only error here is that the computation that you asked the compiler to perform is different from the computation that you thought you were asking it to do.

I suspect the answer is as near as damnit to 0 as not to be worth worrying about.
If i run the same thing through I get the answer "-4.3711388e-008" which can also be written as "-0.000000043711388". Which is pretty damned close to 0. Definitely near enough to not worry about it being out at the 8th decimal place.
Edit: Further to what LiraLuna is saying I wrote the following piece of x87 assembler under visual studio
float fRes;
_asm
{
fld1
fld1
fadd st, st(1)
fldpi
fdiv st, st(1)
fcos
fstp [fRes]
}
char str[16];
sprintf( str, "%f", fRes );
Basically this uses the x87's fcos instruction to do a cosine of pi/2. the value held in str is "0.000000"
This, however, is not actually what fcos returned. It ACTUALLY returned 6.1230318e-017. This implies that the error occurs at the 17th decimal place and, lets be honest, thats far less significant than the standard debug cosf above.
As SSE3 has no specific cosine instruction I suspect (though i cannot confirm without seeing the assembler generated) that it is either using its own taylor series expansion or it is using the fcos instruction anyway. Either way you are still unlikely to get better precision than the error occurring at the 17th decimal place, in my opinion.

The only thing I can think of is a malicious macro substituion i.e. M_PI_2 is no longer 1.57079632679489661923.
Try calling cosf( 1.57079632679489661923 ) to test this.

The real thing you should be careful about is the sign of cosine. Make sure it is the same as you expected. E.g. if you operate with angles between 0 and pi/2. make sure that what you use as PI_2 is less that actual value of pi/2!
And the difference between 0.000001 and 0.0 is less than you think.

The reason
What you are experiencing is the infamous x87 math co-processor float truncate 'bug' - or rather - a feature. IEEE floats have an amazing range of numbers, but at a cost. They sacrifice precession for high range.
They are not inaccurate as you think, though - this is a semi-myth generate by Intel's x87 chip design, that internally uses 80bit internal representation for floats - they have far superior precession though a bit slower.
When you perform a float comparison, x87 caches the float as an 80bit float, then when it's stack is full, it saves the 32bit representation in RAM, decreasing accuracy by a large degree.
The solution
x87 is old, really old. It's replacement is SSE. SSE computes 32bit floats and 64bit floats natively, leading to minimal precession lost on math. Please note that precession issues with floats still exist, but printf("%f\n", cosf(M_PI_2)); should be zero. Heck - even float comparison with SSE is accurate again! (unlike x87).
Since latest Xcode is actually GCC 4.2.1, use the compiler switch -msse3 -mfpmath=sse and see how you get a perfectly round 0.00000 (Note: if you get -0.00000, do not worry, it's perfectly fine and still equals 0.00000 under the IEEE spec (read more at this wikipedia article)).
All Intel macs are guaranteed to have SSE3 support (OSx86 Macs excluded, if you want to support those, use -msse2).

Floating-point precision when moving from i386 to x86_64

I have an application that was developed for Linux x86 32 bits. There are lots of floating-point operations and a lot of tests depending on the results. Now we are porting it to x86_64, but the test results are different in this architecture. We don't want to keep a separate set of results for each architecture.
According to the article An Introduction to GCC - for the GNU compilers gcc and g++ the problem is that GCC in X86_64 assumes fpmath=sse while x86 assumes fpmath=387. The 387 FPU uses 80 bit internal precision for all operations and only convert the result to a given floating-point type (float, double or long double) while SSE uses the type of the operands to determine its internal precision.
I can force -mfpmath=387 when compiling my own code and all my operations work correctly, but whenever I call some library function (sin, cos, atan2, etc.) the results are wrong again. I assume it's because libm was compiled without the fpmath override.
I tried to build libm myself (glibc) using 387 emulation, but it caused a lot of crashes all around (don't know if I did something wrong).
Is there a way to force all code in a process to use the 387 emulation in x86_64? Or maybe some library that returns the same values as libm does on both architectures? Any suggestions?
Regarding the question of "Do you need the 80 bit precision", I have to say that this is not a problem for an individual operation. In this simple case the difference is really small and makes no difference. When compounding a lot of operations, though, the error propagates and the difference in the final result is not so small any more and makes a difference. So I guess I need the 80 bit precision.

I'd say you need to fix your tests. You're generally setting yourself up for disappointment if you assume floating point math to be accurate. Instead of testing for exact equality, test whether it's close enough to the expected result. What you've found isn't a bug, after all, so if your tests report errors, the tests are wrong. ;)
As you've found out, every library you rely on is going to assume SSE floating point, so unless you plan to compile everything manually, now and forever, just so you can set the FP mode to x87, you're better off dealing with the problem now, and just accepting that FP math is not 100% accurate, and will not in general yield the same result on two different platforms. (I believe AMD CPU's yield slightly different results in x87 math as well).
Do you absolutely need 80-bit precision? (If so, there obviously aren't many alternatives, other than to compile everything yourself to use 80-bit FP.)
Otherwise, adjust your tests to perform comparisons and equality tests within some small epsilon. If the difference is smaller than that epsilon, the values are considered equal.

80 bit precision is actually dangerous. The problem is that it is actually preserved as long as the variable is stored in the CPU register. Whenever it is forced out to RAM, it is truncated to the type precision. So you can have a variable actually change its value even though nothing happened to it in the code.

If you want long double precision, use long double for all of your floating point variables, rather than expecting float or double to have extra magic precision. This is really a no-brainer.

SSE floating point and 387 floating point use entirely different instructions, and so there's no way to convince SSE fp instructions to use the 387. Probably the best way to deal with this is resign your test suite to getting slightly different results, and not depend on results being the same to the last bit.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight