Round floating-point value to e.g. single precision - c

C and C++ provide floating-point data types of several widths, but they leave precision unspecified. The compiler is free to use idealized arithmetic to simplify expressions, to use double precision in computing an expression over float values, or to use a double-precision register to keep the value of a float variable or common subexpression.
Correct me if I'm wrong is wrong, see edit, but it's even legal to hoist a float in memory into a double-precision register, so storing a value and then loading it back doesn't necessarily truncate bits.
What is the safest, most portable way to convert a number to a lower precision? Ideally, it should be efficient too, compiling to cvtsd2ss on SSE2. (So, while volatile may be an answer, I'd prefer something better.)
Edit: Summarizing some of the comments and findings…
Wider precision for intermediate results is always fair game.
Expression simplification is allowed in C++, and in C given FP_CONTRACT on.
Using double precision for a single-precision float is not allowed (in C or C++).
However, some compilers (particularly GCC on x86-32) illegally forget some precision conversions.
Edit 2: Some folks are expressing doubt as to the conformance of failing to narrow intermediate results.
C11 § (same as the C99 ref cited in the answer) is specific about "remove all extra range and precision" because it specifies how other computations may be done in wider precision. Among several conforming alternative precisions is "indeterminable," which to me means no constraint whatsoever.
C11 §7.12.2 and §6.5/8 defines #pragma STDC FP_CONTRACT on which enables the compiler to use infinite precision where possible.
The intermediate operations in the contracted expression are evaluated as if to infinite range and precision, while the final operation is rounded to the format determined by the expression evaluation method. A contracted expression might also omit the raising of floating-point exceptions.
C++14 likewise specifically waives the constraints of finite precision and range on intermediate results. N4567 §5/12:
The values of the floating operands and the results of floating expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby.
Note that allowing the identity x - x = 0 to simplify a + b - b + c into a + c is not the same as making addition commutative or associative. a + b + c is still not the same as a + c + b or a + (b + c), when the CPU only provides addition with two addends and a rounded result.

The C99 excplicitly says that
assignment and cast [..] remove all extra range and precision
So, if you want to limit the range and precision to that of a float, just cast to float, or assign to a float variable.
You can even do stuff like (double)((float)d) (with extra parentheses to make sure humans read it correctly), limiting a variable d to float precision and range, then casting it back to double. (A standard C compiler is NOT allowed to optimize that away even if d is a double; it must limit the precision and range to that of a float.)
I've used this in practical implementations of e.g. Kahan summation algorithm, where it can be utilized to allow the C compiler to do very aggressive optimization, but without risk of invalidation.

I'm not so sure I share your fear here ... I tried this glorified cast-as-a-function:
float to_float(double x)
return (float) x;
when entered into the Compiler Explorer, I get this:
push rbp
mov rbp, rsp
movsd QWORD PTR [rbp-8], xmm0
cvtsd2ss xmm0, QWORD PTR [rbp-8]
pop rbp
That seems to generate the requested opcode (cvtsd2ss) right away, and I didn't even enter any compiler options to force SSE2 or anything.
I'd say that a cast has to convert to the target type, the compiler isn't free to ignore casts as far as I know.
Can you provide some case where you think the compiler can ignore a cast, that you've seen happen? Perhaps there's undefined behavior of some kind lurking in the code, which makes the compiler take unexpected shortcuts.


can't figure out the sizeof(long double) in C is 16 bytes or 10 bytes

Although there are some answers are on this websites, I still can't figure out the meaning of sizeof(long double). Why is the output of printing var3 is 3.141592653589793115998?
When I try to execute codes from another person, it runs different from another person. Could somebody help me to solve this problem?
My testing codes:
float var1 =3.1415926535897932;
double var2=3.1415926535897932;
long double var3 =3.141592653589793213456;
printf("%d\n",sizeof(long double));
output of my testing codes:
Codes are the same with another person, but the output from another person is:
Could somebody tell me why the output of us are different?
Floating point numbers -inside our computers- are not mathematical real numbers.
They have lots of counter-intuitive properties (e.g. (1.0-x) + x in your C code can be different of 1....). For more, read the
Be also aware that a number is not its representation in digits. For example, most of your examples are approximations of the number π (which, intuitively speaking, has an infinite number of digits or bits, since it is a trancendental number, as proven by Évariste Galois and Niels Abel). The continuum hypothesis is related.
I still can't figure out the meaning of sizeof(long double).
It is the implementation specific ratio of the number of bytes (or octets) in a long double automatic variable vs the number of bytes in a char automatic variable.
The C11 standard (read n1570 and see this reference) does allow an implementation to have sizeof(long double) being, like sizeof(char), equal to 1. I cannot name such an implementation, but it might be (theoretically) the case on some weird computer architectures (e.g. some DSP).
Could somebody tell me why the output of us are different?
What make you think they could be equal?
Practically speaking, floating point numbers are often IEEE754. But on IBM mainframes (e.g. z/Series) or on VAXes they are not.
float var1 =3.1415926535897932;
double var2 =3.1415926535897932;
Be aware that it could be possible to have a C implementation where (double)var1 != var2 or where var1 != (float)var2 after executing these above instructions.
If you need more precision that what long double achieve on your particular C implementation (e.g. your recent GCC compiler, which could be a cross-compiler), consider using some arbitrary precision arithmetic library such as GMPlib.
I recommend carefully reading the documentation of printf(3), and of every other function that you are using from your C standard library. I also suggest to read the documentation of your C compiler.
You might be interested by static program analysis tools such as Frama-C or the Clang static analyzer. Read also this draft report.
If your C compiler is a recent GCC, compile with all warnings and debug info, so gcc -Wall -Wextra -g and learn how to use the GDB debugger.
Could somebody tell me why the output of us are different?
C allows different compilers/implementations to use different floating point encoding and handle evaluations in slightly different ways.
The difference in sizeof hint that the 2 implementations may employ different precision. Yet the difference could be due to padding, In this case, extra bytes added to preserve an alignment for performance reasons.
A better precision assessment is to print epsilon: the difference between 1.0 and the next larger value of the type.
#include <float.h>
Sample result
1.192093e-07 2.220446e-16 1.084202e-19
When this is 0, floating point types evaluate to that type. With other values like 2, floating point evaluate using wider types and only in the end save the result to the target type.
Two of several possible values indicated below:
0 evaluate all operations and constants just to the range and precision of the type;
2 evaluate all operations and constants to the range and precision of the long double type.
Notice the constants 3.1415926535897932, 3.141592653589793213456 are both normally double constants. Neither has an L suffix that would make the long double. Both have the same double value of 3.1415926535897931... and val2, val3 should get the same value. Yet with FLT_EVAL_METHOD==2, constants can evaluated as a long double and that is certainly what happened in "the output from another person" code.
Print FLT_EVAL_METHOD to see that difference.

Union in C changing machine behavior of Float Addition

New to C programming, and I've been told to avoid unions which in general makes perfect sense and I agree with. However, as part of an academic exercise I'm writing an emulator for hardware single-precision floating point addition by doing bit manipulation operations on unsigned 32-bit integers. I only mention that to explain why I want to use unions; I'm having no trouble with the emulation.
In order to test this emulator, I wrote a test program. But of course I'm trying to find the bit representation of floats on my hardware, so I thought this could be the perfect use for a union. I wrote this union:
typedef union {
float floatRep;
uint32_t unsignedIntRep;
} FloatExaminer;
This way, I can initialize a float with the floatRep member and then examine the bits with the unsignedIntRep member.
This worked most of the time, but when I got to NaN addition, I started running into trouble. The exact situation was that I wrote a function to automate these tests. The gist of it was this:
void addTest(float op1, float op2){
FloatExaminer result;
result.floatRep = op1 + op2;
printf("%f + %f = %f\n", op1, op2, result.floatRep);
//print bit pattern as well
printf("Bit pattern of result: %08x", result.unsignedIntRep);
OK, now for the confusing part:
I added a NAN and a NAN with different mantissa bit patterns to differentiate between the two. On my particular hardware, it's supposed to return the second NAN operand (making it quiet if it was signalling). (I'll explain how I know this below.) However, passing the bit patterns op1=0x7fc00001, op2=0x7fc00002 would return op1, 0x7fc00001, every time!
I know it's supposed to return the second operand because I tried--outside the function--initializing as an integer and casting to a float as below:
uint32_t intRep1 = 0x7fc00001;
uint32_t intRep2 = 0x7fc00002;
float *op1 = (float *) &intRep1;
float *op2 = (float *) &intRep2;
float result = *op1 + *op2;
uint32_t *intResult = (uint32_t *)&result;
printf("%08x", *intResult); //bit pattern 0x7fc00002
In the end, I've concluded that unions are evil and I should never use them. However, does anyone know why I'm getting the result I am? Did I make stupid mistake or assumption? (I understand that hardware architecture varies, but this just seems bizarre.)
I'm assuming that when you say "my particular hardware", you are referring to an Intel processor using SSE floating point. But in fact, that architecture has a different rule, according to the Intel® 64 and IA-32 Architectures
Software Developer's Manual. Here's a summary of Table 4.7 ("Rules for handling NaNs") from Volume 1 of that documentation, which describes the handling of NaNs in arithmetic operations: (QNaN is a quiet NaN; SNaN is a signalling NaN; I've only included information about two-operand instructions)
SNaN and QNaN
x87 FPU — QNaN source operand.
SSE — First source operand, converted to a QNaN.
Two SNaNs
x87 FPU — SNaN source operand with the larger significand, converted to a QNaN
SSE — First source operand, converted to a QNaN.
Two QNaNs
x87 FPU — QNaN source operand with the larger significand
SSE — First source operand
NaN and a floating-point value
x87/SSE — NaN source operand, converted to a QNaN.
SSE floating point machine instructions generally have the form op xmm1, xmm2/m32, where the first operand is the destination register and the second operand is either a register or a memory location. The instruction will then do, in effect, xmm1 <- xmm1 (op) xmm2/m32, so the first operand is both the left-hand operand of the operation and the destination. That's the meaningof "first operand" in the above chart. AVX adds three-operand instructions, where the destination might be a different register; it is then the third operand and does not figure in the above chart. The x87 FPU uses a stack-based architecture, where the top of the stack is always one of the operands and the result replaces either the top of the stack or the other operand; in the above chart, it will be noted that the rules do not attempt to decide which operand is "first", relying instead on a simple comparison.
Now, suppose we're generating code for an SSE machine, and we have to handle the C statement:
a = b + c;
where none of those variables are in a register. That means we might emit code something like this: (I'm not using real instructions here, but the principle is the same)
LOAD r1, b (r1 <- b)
ADD r1, c (r1 <- r1 + c)
STORE r1, a (a <- r1)
But we could also do this, with (almost) the same result:
LOAD r1, c (r1 <- c)
ADD r1, b (r1 <- r1 + b)
STORE r1, a (a <- r1)
That will have precisely the same effect, except for additions involving NaNs (and only when using SSE). Since arithmetic involving NaNs is unspecified by the C standard, there is no reason why the compiler should care which of these two options it chooses. In particular, if r1 happened to already have the value c in it, the compiler would probably choose the second option, since it saves a load instruction. (And who is going to complain? We all want the compiler to generate code which runs as quickly as possible, no?)
So, in short, the order of the operands of the ADD instruction will vary with the intricate details of how the compiler chooses to optimize the code, and the particular state of the registers at the moment in which the addition operator is being emitted. It is possible that this will be effected by the use of a union, but it is equally or more likely that it has to do with the fact that in your code using the union, the values being added are arguments to the function and therefore are already placed in registers.
Indeed, different versions of gcc, and different optimization settings, produce different results for your code. And forcing the compiler to emit x87 FPU instructions produces yet different results, because the hardware operates according to a different logic.
If you want some bedtime reading, you can download the entire Intel SDM (currently 4,684 pages / 23.3MB, but it keeps on getting bigger) from their site.

Clang reciprocal to 1 optimisations

After a discussion with colleagues, I ended up testing wether if clang would optimize two divisions, with a reciprocal to 1, to a single division.
const float x = a / b; //x not used elsewhere
const float y = 1 / x;
Theoretically clang could optimize to const float y = b / a if x is used only as a temporary step value, no?
Here's the input&output of a simple test case: (in both ouputs you can see that it's performing the two divisions, instead of optimizing)
This related question, is behind my comprehension and seem to focus only on why a specific instruction isn't used, whereas in my case it's the optimisation that isn't done: Why does GCC or Clang not optimise reciprocal to 1 instruction when using fast-math
No, clang can not do that.
But first, why are you using float? float has six digits precision, double has 15. Unless you have a good reason, that you can explain, use double.
1 / (a / b) in floating-point arithmetic is not the same as b / a. What the compiler has to do, is in the first case:
Divide a by b
Round the result to the nearest floating-point number
Divide 1 by the result
Round the result to the nearest floating-point number.
In the second case:
Divide b by a.
Round the result to the nearest floating-point number.
The compiler can only change the code if the result is guaranteed to be the same, and if the compiler writer cannot produce a mathematical proof that the result is the same, the compiler cannot change the code. There are two rounding operations in the first case, rounding different numbers, so it is unlikely that the result can be guaranteed to be the same.
The compiler doesn't think like a mathematician. Where you think simplifying the expression is trivial mathematically, the compiler has a lot of other things to consider. It is actually quite likely that the compiler is much smarter than the programmer and also knows far more about the C standard.
Something like this is probably what goes through the optimizing compiler's "mind":
Ah they wrote a / b but only use x at one place, so we don't have to allocate that variable on the stack. I'll remove it and use a CPU register.
Hmm, integer literal 1 divided with a float variable. Okay, we have to invoke balancing here before anything else and turn that literal into a float 1.0f.
The programmer is counting on me to generate code that contains the potential floating point inaccuracy involved in dividing 1.0f with another float variable! So I can't just swap this expression with b / a because then that floating point inaccuracy that the programmer seems to want here would be lost.
And so on. There's a lot of considerations. What machine code you end up with is hard to predict in advance. Just know that the compiler follows your instructions to the letter.

Is operator ≤ UB for floating point comparison?

There are numerous reference on the subject (here or here). However I still fails to understand why the following is not considered UB and properly reported by my favorite compiler (insert clang and/or gcc) with a neat warning:
// f1, f2 and epsilon are defined as double
if ( f1 / f2 <= epsilon )
As per C99:TC3, §8: we have:
Except for assignment and cast (which remove all extra range and
precision), the values of operations with floating operands and values
subject to the usual arithmetic conversions and of floating constants
are evaluated to a format whose range and precision may be greater
than required by the type. [...]
Using typical compilation f1 / f2 would be read directly from the FPU. I've tried here using gcc -m32, with gcc 5.2. So f1 / f2 is (over-here) on an 80 bits (just a guess dont have the exact spec here) floating point register. There is not type promotion here (per standard).
I've also tested clang 3.5, this compiler seems to cast the result of f1 / f2 back to a normal 64 bits floating point representation (this is an implementation defined behavior but for my question I prefer the default gcc behavior).
As per my understanding the comparison will be done in between a type for which we don't know the size (ie. format whose range and precision may be greater) and epsilon which size is exactly 64 bits.
What I really find hard to understand is equality comparison with a well known C types (eg. 64bits double) and something whose range and precision may be greater. I would have assumed that somewhere in the standard some kind of promotion would be required (eg. standard would mandates that epsilon would be promoted to a wider floating point type).
So the only legitimate syntaxes should instead be:
if ( (double)(f1 / f2) <= epsilon )
double res = f1 / f2;
if ( res <= epsilon )
As a side note, I would have expected the litterature to document only the operator <, in my case:
if ( f1 / f2 < epsilon )
Since it is always possible to compare floating point with different size using operator <.
So in which cases the first expression would make sense ? In other word, how could the standard defines some kind of equality operator in between two floating point representation with different size ?
EDIT: The whole confusion here, was that I assumed it was possible to compare two float of different size. Which cannot possibly happen. (thanks #DevSolar!).
<= is well-defined for all possible floating point values.
There is one exception though: the case when at least one of the arguments is uninitialised. But that's more to do with reading an uninitialised variable being UB; not the <= itself
I think you're confusing implementation-defined with undefined behavior. The C language doesn't mandate IEEE 754, so all floating point operations are essentially implementation-defined. But this is different from undefined behavior.
After a bit of chat, it became clear where the miscommunication came from.
The quoted part of the standard explicitly allows an implementation to use wider formats for floating operands in calculations. This includes, but is not limited to, using the long double format for double operands.
The standard section in question also does not call this "type promotion". It merely refers to a format being used.
So, f1 / f2 may be done in some arbitrary internal format, but without making the result any other type than double.
So when the result is compared (by either <= or the problematic ==) to epsilon, there is no promotion of epsilon (because the result of the division never got a different type), but by the same rule that allowed f1 / f2 to happen in some wider format, epsilon is allowed to be evaluated in that format as well. It is up to the implementation to do the right thing here.
The value of FLT_EVAL_METHOD might tell what exactly an implementation is doing exactly (if set to 0, 1, or 2 respectively), or it might have a negative value, which indicates "indeterminate" (-1) or "implementation-defined", which means "look it up in your compiler manual".
This gives an implementation "wiggle room" to do any kind of funny things with floating operands, as long as at least the range / precision of the actual type is preserved. (Some older FPUs had "wobbly" precisions, depending on the kind of floating operation performed. The quoted part of the standard caters for exactly that.)
In no case may any of this lead to undefined behaviour. Implementation-defined, yes. Undefined, no.
The only case where you would get undefined behavior is when a large floating point variable gets demoted to a smaller one which cannot represent the contents. I don't quite see how that applies in this case.
The text you quote is concerned about whether or not floats may be evaluated as doubles etc, as indicated by the text you unfortunately didn't include in the quote:
The use of evaluation formats is characterized by the
implementation-defined value of FLT_EVAL_METHOD:
-1 indeterminable;
0 evaluate all operations and constants just to the range and precision of the type;
1 evaluate operations and constants of type float and double to the range and precision of the double type, evaluate long double operations and constants to the range and precision of the long double type;
2 evaluate all operations and constants to the range and precision of the long double type.
However, I don't believe this macro overwrites the behavior of the usual arithmetic conversions. The usual arithmetic conversions guarantee that you can never compare two float variables of different size. So I don't see how you could run into undefined behavior here. The only possible issue you would have is performance.
In theory, in case FLT_EVAL_METHOD == 2 then your operands could indeed get evaluated as type long double. But please note that if the compiler allows such implicit promotions to larger types, there will be a reason for it.
According to the text you cited, explicit casting will counter this compiler behavior.
In which case the code if ( (double)(f1 / f2) <= epsilon ) is nonsense. By the time you cast the result of f1 / f2 to double, the calculation is already done and have been carried out on long double. The calculation of the result <= epsilon will however be carried out on double since you forced this with the cast.
To avoid long double entirely, you would have to write the code as:
if ( (double)((double)f1 / (double)f2) <= epsilon )
or to increase readability, preferably:
double div = (double)f1 / (double)f2;
if( (double)div <= (double)epsilon )
But again, code like this does only make sense if you know that there will be implicit promotions, which you wish to avoid to increase performance. In practice, I doubt you'll ever run into that situation, as the compiler is most likely far more capable than the programmer to make such decisions.

How to avoid floating point round off error in unit tests?

I'm trying to write unit tests for some simple vector math functions that operate on arrays of single precision floating point numbers. The functions use SSE intrinsics and I'm getting false positives (at least I think) when running the tests on a 32-bit system (the tests pass on 64-bit). As the operation runs through the array, I accumulate more and more round off error. Here is a snippet of unit test code and output (my actual question(s) follow):
Test Setup:
static const int N = 1024;
static const float MSCALAR = 42.42f;
static void setup(void) {
input = _mm_malloc(sizeof(*input) * N, 16);
ainput = _mm_malloc(sizeof(*ainput) * N, 16);
output = _mm_malloc(sizeof(*output) * N, 16);
expected = _mm_malloc(sizeof(*expected) * N, 16);
memset(output, 0, sizeof(*output) * N);
for (int i = 0; i < N; i++) {
input[i] = i * 0.4f;
ainput[i] = i * 2.1f;
expected[i] = (input[i] * MSCALAR) + ainput[i];
My main test code then calls the function to be tested (which does the same calculation used to generate the expected array) and checks its output against the expected array generated above. The check is for closeness (within 0.0001) not equality.
Sample output:
0.000000 0.000000 delta: 0.000000
44.419998 44.419998 delta: 0.000000
...snip 100 or so lines...
2043.319946 2043.319946 delta: 0.000000
2087.739746 2087.739990 delta: 0.000244
...snip 100 or so lines...
4086.639893 4086.639893 delta: 0.000000
4131.059570 4131.060059 delta: 0.000488
4175.479492 4175.479980 delta: 0.000488
...etc, etc...
I know I have two problems:
On 32-bit machines, differences between 387 and SSE floating point arithmetic units. I believe 387 uses more bits for intermediate values.
Non-exact representation of my 42.42 value that I'm using to generate expected values.
So my question is, what is the proper way to write meaningful and portable unit tests for math operations on floating point data?
*By portable I mean should pass on both 32 and 64 bit architectures.
Per a comment, we see that the function being tested is essentially:
for (int i = 0; i < N; ++i)
D[i] = A[i] * b + C[i];
where A[i], b, C[i], and D[i] all have type float. When referring to the data of a single iteration, I will use a, c, and d for A[i], C[i], and D[i].
Below is an analysis of what we could use for an error tolerance when testing this function. First, though, I want to point out that we can design the test so that there is no error. We can choose the values of A[i], b, C[i], and D[i] so that all the results, both final and intermediate results, are exactly representable and there is no rounding error. Obviously, this will not test the floating-point arithmetic, but that is not the goal. The goal is to test the code of the function: Does it execute instructions that compute the desired function? Simply choosing values that would reveal any failures to use the right data, to add, to multiply, or to store to the right location will suffice to reveal bugs in the function. We trust that the hardware performs floating-point correctly and are not testing that; we just want to test that the function was written correctly. To accomplish this, we could, for example, set b to a power of two, A[i] to various small integers, and C[i] to various small integers multiplied by b. I could detail limits on these values more precisely if desired. Then all results would be exact, and any need to allow for a tolerance in comparison would vanish.
That aside, let us proceed to error analysis.
The goal is to find bugs in the implementation of the function. To do this, we can ignore small errors in the floating-point arithmetic, because the kinds of bugs we are seeking almost always cause large errors: The wrong operation is used, the wrong data is used, or the result is not stored in the desired location, so the actual result is almost always very different from the expected result.
Now the question is how much error should we tolerate? Because bugs will generally cause large errors, we can set the tolerance quite high. However, in floating-point, “high” is still relative; an error of one million is small compared to values in the trillions, but it is too high to discover errors when the input values are in the ones. So we ought to do at least some analysis to decide the level.
The function being tested will use SSE intrinsics. This means it will, for each i in the loop above, either perform a floating-point multiply and a floating-point add or will perform a fused floating-point multiply-add. The potential errors in the latter are a subset of the former, so I will use the former. The floating-point operations for a*b+c do some rounding so that they calculate a result that is approximately a•b+c (interpreted as an exact mathematical expression, not floating-point). We can write the exact value calculated as (a•b•(1+e0)+c)•(1+e1) for some errors e0 and e1 with magnitudes at most 2-24, provided all the values are in the normal range of the floating-point format. (2-24 is the maximum relative error that can occur in any correctly rounded elementary floating-point operation in round-to-nearest mode in the IEEE-754 32-bit binary floating-point format. Rounding in round-to-nearest mode changes the mathematical value by at most half the value of the least significant bit in the significand, which is 23 bits below the most significant bit.)
Next, we consider what value the test program produces for its expected value. It uses the C code d = a*b + c;. (I have converted the long names in the question to shorter names.) Ideally, this would also calculate a multiply and an add in IEEE-754 32-bit binary floating-point. If it did, then the result would be identical to the function being tested, and there would be no need to allow for any tolerance in comparison. However, the C standard allows implementations some flexibility in performing floating-point arithmetic, and there are non-conforming implementations that take more liberties than the standard allows.
A common behavior is for an expression to be computed with more precision than its nominal type. Some compilers may calculate a*b + c using double or long double arithmetic. The C standard requires that results be converted to the nominal type in casts or assignments; extra precision must be discarded. If the C implementation is using extra precision, then the calculation proceeds: a*b is calculated with extra precision, yielding exactly a•b, because double and long double have enough precision to exactly represent the product of any two float values. A C implementation might then round this result to float. This is unlikely, but I allow for it anyway. However, I also dismiss it because it moves the expected result to be closer to the result of the function being tested, and we just need to know the maximum error that can occur. So I will continue, with the worse (more distant) case, that the result so far is a•b. Then c is added, yielding (a•b+c)•(1+e2) for some e2 with magnitude at most 2-53 (the maximum relative error of normal numbers in the 64-bit binary format). Finally, this value is converted to float for assignment to d, yielding (a•b+c)•(1+e2)•(1+e3) for some e3 with magnitude at most 2-24.
Now we have expressions for the exact result computed by a correctly operating function, (a•b•(1+e0)+c)•(1+e1), and for the exact result computed by the test code, (a•b+c)•(1+e2)•(1+e3), and we can calculate a bound on how much they can differ. Simple algebra tells us the exact difference is a•b•(e0+e1+e0•e1-e2-e3-e2•e3)+c•(e1-e2-e3-e2•e3). This is a simple function of e0, e1, e2, and e3, and we can see its extremes occur at endpoints of the potential values for e0, e1, e2, and e3. There are some complications due to interactions between possibilities for the signs of the values, but we can simply allow some extra error for the worst case. A bound on the maximum magnitude of the difference is |a•b|•(3•2-24+2-53+2-48)+|c|•(2•2-24+2-53+2-77).
Because we have plenty of room, we can simplify that, as long as we do it in the direction of making the values larger. E.g., it might be convenient to use |a•b|•3.001•2-24+|c|•2.001•2-24. This expression should suffice to allow for rounding in floating-point calculations while detecting nearly all implementation errors.
Note that the expression is not proportional to the final value, a*b+c, as calculated either by the function being tested or by the test program. This means that, in general, tests using a tolerance relative to the final values calculated by the function being tested or by the test program are wrong. The proper form of a test should be something like this:
double tolerance = fabs(input[i] * MSCALAR) * 0x3.001p-24 + fabs(ainput[i]) * 0x2.001p-24;
double difference = fabs(output[i] - expected[i]);
if (! (difference < tolerance))
// Report error here.
In summary, this gives us a tolerance that is larger than any possible differences due to floating-point rounding, so it should never give us a false positive (report the test function is broken when it is not). However, it is very small compared to the errors caused by the bugs we want to detect, so it should rarely give us a false negative (fail to report an actual bug).
(Note that there are also rounding errors computing the tolerance, but they are smaller than the slop I have allowed for in using .001 in the coefficients, so we can ignore them.)
(Also note that ! (difference < tolerance) is not equivalent to difference >= tolerance. If the function produces a NaN, due to a bug, any comparison yields false: both difference < tolerance and difference >= tolerance yield false, but ! (difference < tolerance) yields true.)
On 32-bit machines, differences between 387 and SSE floating point arithmetic units. I believe 387 uses more bits for intermediate values.
If you are using GCC as 32-bit compiler, you can tell it to generate SSE2 code still with options -msse2 -mfpmath=sse. Clang can be told to do the same thing with one of the two options and ignores the other one (I forget which). In both cases the binary program should implement strict IEEE 754 semantics, and compute the same result as a 64-bit program that also uses SSE2 instructions to implement strict IEEE 754 semantics.
Non-exact representation of my 42.42 value that I'm using to generate expected values.
The C standard says that a literal such as 42.42f must be converted to either the floating-point number immediately above or immediately below the number represented in decimal. Moreover, if the literal is representable exactly as a floating-point number of the intended format, then this value must be used. However, a quality compiler (such as GCC) will give you(*) the nearest representable floating-point number, of which there is only one, so again, this is not a real portability issue as long as you are using a quality compiler (or at the very least, the same compiler).
Should this turn out to be a problem, a solution is to write an exact representation of the constants you intend. Such an exact representation can be very long in decimal format (up to 750 decimal digits for the exact representation of a double) but is always quite compact in C99's hexadecimal format: 0x1.535c28p+5 for the exact representation of the float nearest to 42.42. A recent version of the static analysis platform for C programs Frama-C can provide the hexadecimal representation of all inexact decimal floating-point constants with option -warn-decimal-float:all.
(*) barring a few conversion bugs in older GCC versions. See Rick Regan's blog for details.
