How to avoid floating point round off error in unit tests? - c

I'm trying to write unit tests for some simple vector math functions that operate on arrays of single precision floating point numbers. The functions use SSE intrinsics and I'm getting false positives (at least I think) when running the tests on a 32-bit system (the tests pass on 64-bit). As the operation runs through the array, I accumulate more and more round off error. Here is a snippet of unit test code and output (my actual question(s) follow):
Test Setup:
static const int N = 1024;
static const float MSCALAR = 42.42f;
static void setup(void) {
input = _mm_malloc(sizeof(*input) * N, 16);
ainput = _mm_malloc(sizeof(*ainput) * N, 16);
output = _mm_malloc(sizeof(*output) * N, 16);
expected = _mm_malloc(sizeof(*expected) * N, 16);
memset(output, 0, sizeof(*output) * N);
for (int i = 0; i < N; i++) {
input[i] = i * 0.4f;
ainput[i] = i * 2.1f;
expected[i] = (input[i] * MSCALAR) + ainput[i];
}
}
My main test code then calls the function to be tested (which does the same calculation used to generate the expected array) and checks its output against the expected array generated above. The check is for closeness (within 0.0001) not equality.
Sample output:
0.000000 0.000000 delta: 0.000000
44.419998 44.419998 delta: 0.000000
...snip 100 or so lines...
2043.319946 2043.319946 delta: 0.000000
2087.739746 2087.739990 delta: 0.000244
...snip 100 or so lines...
4086.639893 4086.639893 delta: 0.000000
4131.059570 4131.060059 delta: 0.000488
4175.479492 4175.479980 delta: 0.000488
...etc, etc...
I know I have two problems:
On 32-bit machines, differences between 387 and SSE floating point arithmetic units. I believe 387 uses more bits for intermediate values.
Non-exact representation of my 42.42 value that I'm using to generate expected values.
So my question is, what is the proper way to write meaningful and portable unit tests for math operations on floating point data?
*By portable I mean should pass on both 32 and 64 bit architectures.

Per a comment, we see that the function being tested is essentially:
for (int i = 0; i < N; ++i)
D[i] = A[i] * b + C[i];
where A[i], b, C[i], and D[i] all have type float. When referring to the data of a single iteration, I will use a, c, and d for A[i], C[i], and D[i].
Below is an analysis of what we could use for an error tolerance when testing this function. First, though, I want to point out that we can design the test so that there is no error. We can choose the values of A[i], b, C[i], and D[i] so that all the results, both final and intermediate results, are exactly representable and there is no rounding error. Obviously, this will not test the floating-point arithmetic, but that is not the goal. The goal is to test the code of the function: Does it execute instructions that compute the desired function? Simply choosing values that would reveal any failures to use the right data, to add, to multiply, or to store to the right location will suffice to reveal bugs in the function. We trust that the hardware performs floating-point correctly and are not testing that; we just want to test that the function was written correctly. To accomplish this, we could, for example, set b to a power of two, A[i] to various small integers, and C[i] to various small integers multiplied by b. I could detail limits on these values more precisely if desired. Then all results would be exact, and any need to allow for a tolerance in comparison would vanish.
That aside, let us proceed to error analysis.
The goal is to find bugs in the implementation of the function. To do this, we can ignore small errors in the floating-point arithmetic, because the kinds of bugs we are seeking almost always cause large errors: The wrong operation is used, the wrong data is used, or the result is not stored in the desired location, so the actual result is almost always very different from the expected result.
Now the question is how much error should we tolerate? Because bugs will generally cause large errors, we can set the tolerance quite high. However, in floating-point, “high” is still relative; an error of one million is small compared to values in the trillions, but it is too high to discover errors when the input values are in the ones. So we ought to do at least some analysis to decide the level.
The function being tested will use SSE intrinsics. This means it will, for each i in the loop above, either perform a floating-point multiply and a floating-point add or will perform a fused floating-point multiply-add. The potential errors in the latter are a subset of the former, so I will use the former. The floating-point operations for a*b+c do some rounding so that they calculate a result that is approximately a•b+c (interpreted as an exact mathematical expression, not floating-point). We can write the exact value calculated as (a•b•(1+e0)+c)•(1+e1) for some errors e0 and e1 with magnitudes at most 2-24, provided all the values are in the normal range of the floating-point format. (2-24 is the maximum relative error that can occur in any correctly rounded elementary floating-point operation in round-to-nearest mode in the IEEE-754 32-bit binary floating-point format. Rounding in round-to-nearest mode changes the mathematical value by at most half the value of the least significant bit in the significand, which is 23 bits below the most significant bit.)
Next, we consider what value the test program produces for its expected value. It uses the C code d = a*b + c;. (I have converted the long names in the question to shorter names.) Ideally, this would also calculate a multiply and an add in IEEE-754 32-bit binary floating-point. If it did, then the result would be identical to the function being tested, and there would be no need to allow for any tolerance in comparison. However, the C standard allows implementations some flexibility in performing floating-point arithmetic, and there are non-conforming implementations that take more liberties than the standard allows.
A common behavior is for an expression to be computed with more precision than its nominal type. Some compilers may calculate a*b + c using double or long double arithmetic. The C standard requires that results be converted to the nominal type in casts or assignments; extra precision must be discarded. If the C implementation is using extra precision, then the calculation proceeds: a*b is calculated with extra precision, yielding exactly a•b, because double and long double have enough precision to exactly represent the product of any two float values. A C implementation might then round this result to float. This is unlikely, but I allow for it anyway. However, I also dismiss it because it moves the expected result to be closer to the result of the function being tested, and we just need to know the maximum error that can occur. So I will continue, with the worse (more distant) case, that the result so far is a•b. Then c is added, yielding (a•b+c)•(1+e2) for some e2 with magnitude at most 2-53 (the maximum relative error of normal numbers in the 64-bit binary format). Finally, this value is converted to float for assignment to d, yielding (a•b+c)•(1+e2)•(1+e3) for some e3 with magnitude at most 2-24.
Now we have expressions for the exact result computed by a correctly operating function, (a•b•(1+e0)+c)•(1+e1), and for the exact result computed by the test code, (a•b+c)•(1+e2)•(1+e3), and we can calculate a bound on how much they can differ. Simple algebra tells us the exact difference is a•b•(e0+e1+e0•e1-e2-e3-e2•e3)+c•(e1-e2-e3-e2•e3). This is a simple function of e0, e1, e2, and e3, and we can see its extremes occur at endpoints of the potential values for e0, e1, e2, and e3. There are some complications due to interactions between possibilities for the signs of the values, but we can simply allow some extra error for the worst case. A bound on the maximum magnitude of the difference is |a•b|•(3•2-24+2-53+2-48)+|c|•(2•2-24+2-53+2-77).
Because we have plenty of room, we can simplify that, as long as we do it in the direction of making the values larger. E.g., it might be convenient to use |a•b|•3.001•2-24+|c|•2.001•2-24. This expression should suffice to allow for rounding in floating-point calculations while detecting nearly all implementation errors.
Note that the expression is not proportional to the final value, a*b+c, as calculated either by the function being tested or by the test program. This means that, in general, tests using a tolerance relative to the final values calculated by the function being tested or by the test program are wrong. The proper form of a test should be something like this:
double tolerance = fabs(input[i] * MSCALAR) * 0x3.001p-24 + fabs(ainput[i]) * 0x2.001p-24;
double difference = fabs(output[i] - expected[i]);
if (! (difference < tolerance))
// Report error here.
In summary, this gives us a tolerance that is larger than any possible differences due to floating-point rounding, so it should never give us a false positive (report the test function is broken when it is not). However, it is very small compared to the errors caused by the bugs we want to detect, so it should rarely give us a false negative (fail to report an actual bug).
(Note that there are also rounding errors computing the tolerance, but they are smaller than the slop I have allowed for in using .001 in the coefficients, so we can ignore them.)
(Also note that ! (difference < tolerance) is not equivalent to difference >= tolerance. If the function produces a NaN, due to a bug, any comparison yields false: both difference < tolerance and difference >= tolerance yield false, but ! (difference < tolerance) yields true.)

On 32-bit machines, differences between 387 and SSE floating point arithmetic units. I believe 387 uses more bits for intermediate values.
If you are using GCC as 32-bit compiler, you can tell it to generate SSE2 code still with options -msse2 -mfpmath=sse. Clang can be told to do the same thing with one of the two options and ignores the other one (I forget which). In both cases the binary program should implement strict IEEE 754 semantics, and compute the same result as a 64-bit program that also uses SSE2 instructions to implement strict IEEE 754 semantics.
Non-exact representation of my 42.42 value that I'm using to generate expected values.
The C standard says that a literal such as 42.42f must be converted to either the floating-point number immediately above or immediately below the number represented in decimal. Moreover, if the literal is representable exactly as a floating-point number of the intended format, then this value must be used. However, a quality compiler (such as GCC) will give you(*) the nearest representable floating-point number, of which there is only one, so again, this is not a real portability issue as long as you are using a quality compiler (or at the very least, the same compiler).
Should this turn out to be a problem, a solution is to write an exact representation of the constants you intend. Such an exact representation can be very long in decimal format (up to 750 decimal digits for the exact representation of a double) but is always quite compact in C99's hexadecimal format: 0x1.535c28p+5 for the exact representation of the float nearest to 42.42. A recent version of the static analysis platform for C programs Frama-C can provide the hexadecimal representation of all inexact decimal floating-point constants with option -warn-decimal-float:all.
(*) barring a few conversion bugs in older GCC versions. See Rick Regan's blog for details.

Related

platform independent way to reduce precision of floating point constant values

The use case:
I have some large data arrays containing floating point constants that.
The file defining that array is generated and the template can be easily adapted.
I would like to make some tests, how reduced precision does influence the results in terms of quality, but also in compressibility of the binary.
Since I do not want to change other source code than the generated file, I am looking for a way to reduce the precision of the constants.
I would like to limit the mantissa to a fixed number of bits (set the lower ones to 0). But since floating point literals are in decimal, there are some difficulties, specifying numbers in a way that the binary representation does contain all zeros at the lower mantissa bits.
The best case would be something like:
#define FP_REDUCE(float) /* some macro */
static const float32_t veryLargeArray[] = {
FP_REDUCE(23.423f), FP_REDUCE(0.000023f), FP_REDUCE(290.2342f),
// ...
};
#undef FP_REDUCE
This should be done at compile time and it should be platform independent.
The following uses the Veltkamp-Dekker splitting algorithm to remove n bits (with rounding) from x, where p = 2n (for example, to remove eight bits, use 0x1p8f for the second argument). The casts to float32_t coerce the results to that type, as the C standard otherwise permits implementations to use more precision within expressions. (Double-rounding could produce incorrect results in theory, but this will not occur when float32_t is the IEEE basic 32-bit binary format and the C implementation computes this expression in that format or the 64-bit format or wider, as the former is the desired format and the latter is wide enough to represent intermediate results exactly.)
IEEE-754 binary floating-point is assumed, with round-to-nearest. Overflow occurs if x•(p+1) rounds to infinity.
#define RemoveBits(x, p) (float32_t) (((float32_t) ((x) * ((p)+1))) - (float32_t) (((float32_t) ((x) * ((p)+1))) - (x))))
What you're asking for can be done with varying degrees of partial portability, but not absolute unless you want to run the source file through your own preprocessing tool at build time to reduce the precision. If that's an option for you, it's probably your best one.
Short of that, I'm going to assume at least that your floating point types are base 2 and obey Annex F/IEEE semantics. This should be a reasonable assumption, but the latter is false with gcc on platforms (including 32-bit x86) with extended-precision under the default standards-conformance profile; you need -std=cNN or -fexcess-precision=standard to fix it.
One approach is to add and subtract a power of two chosen to cause rounding to the desired precision:
#define FP_REDUCE(x,p) ((x)+(p)-(p))
Unfortunately, this works in absolute precisions, not relative, and requires knowing the right value p for the particular x, which is going to be equal to the value of the leading base-2 place of x, times 2 raised to the power of FLT_MANT_DIG minus the bits of precision you want. This cannot be evaluated as a constant expression for use as an initializer, but you can write it in terms of FLT_EPSILON and, if you can assume C99+, a preprocessor-token-pasting to form a hex float literal, yielding the correct value for this factor. But you still need to know the power of two for the leading digit of x; I don't see any way to extract that as a constant expression.
Edit: I believe this is fixable, so as not to need an absolute precision but rather automatically scale to the value, but it depends on correctness of a work in progress. See Is there a correct constant-expression, in terms of a float, for its msb?. If that works I will later integrate the result with this answer.
Another approach I like, if your compiler supports compound literals in static initializers and if you can assume IEEE type representations, is using a union and masking off bits:
union { float x; uint32_t r; } fr;
#define FP_REDUCE(x) ((union fr){.r=(union fr){x}.r & (0xffffffffu<<n)}.x)
where n is the number of bits you want to drop. This will round towards zero rather than to nearest; if you want to make it round to nearest, it should be possible by adding an appropriate constant to the low bits before masking, but you have to take care about what happens when the addition overflows into the exponent bits.

sine cosine modular extended precision arithmetic

I've seen in many impletation of sine/cosine a so called extended modular precision arithmetic. But what it is for?
For instance in the cephes implemetation, after reduction to the range [0,pi/4], they are doing this modular precision arithmetic to improve the precision.
Hereunder the code:
z = ((x - y * DP1) - y * DP2) - y * DP3;
where DP1, DP2 and DP3 are some hardcoded coefficient.
How to find those coefficient mathematically? I've understand the purpose of "modular extension arithmetic" for big num, but here what is its exact purpose?
In the context of argument reduction for trigonometric functions, what you are looking at is Cody-Waite argument reduction, a technique introduced in the book: William J. Cody and William Waite, Software Manual for the Elementary Functions, Prentice-Hall, 1980. The goal is to achieve, for arguments up to a certain magnitude, an accurate reduced argument, despite subtractive cancellation in intermediate computation. For this purpose, the relevant constant is represented with more than native precision, by using a sum of multiple numbers of decreasing magnitude (here: DP1, DP2, DP3), such that all of the intermediate products except the least significant one can be computed without rounding error.
Consider as an example the computation of sin (113) in IEEE-754 binary32 (single precision). The typical argument reduction would conceptually compute i=rintf(x/(π/2)); reduced_x = x-i*(π/2). The binary32 number closest to π/2 is 0x1.921fb6p+0. We compute i=72, the product rounds to 0x1.c463acp+6, which is close to the argument x=0x1.c40000p+6. During subtraction, some leading bits cancel, and we wind up with reduced_x = -0x1.8eb000p-4. Note the trailing zeros introduced by renormalization. These zero bits carry no useful information. Applying an accurate approximation to the reduced argument, sin(x) = -0x1.8e0eeap-4, whereas the true result is -0x1.8e0e9d39...p-4. We wind up with large relative error and large ulp error.
We can remedy this by using a two-step Cody-Waite argument reduction. For example, we could use pio2_hi = 0x1.921f00p+0, and pio2_lo = 0x1.6a8886p-17. Note the eight trailing zero bits in single-precision representation ofpio2_hi, which allow us to multiply with any 8-bit integer i and still have the product i * pio2_hi representable exactly as a single-precision number. When we compute ((x - i * pio2_hi) - i * pio2_lo), we get reduced_x = -0x1.8eafb4p-4, and therefore sin(x) = -0x1.8e0e9ep-4, a quite accurate result.
The best way to split the constant into a sum will depend on the magnitude of i we need to handle, on the maximum number of bits subject to subtractive cancellation for a given argument range (based on how close integer multiples of π/2 can get to integers), and performance considerations. Typical real-life use cases involve two- to four-stage Cody-Waite reduction schemes. The availability of fused multiple-add (FMA) allows the use of constituent constants with fewer trailing zero bits. See this paper: Sylvie Boldo, Marc Daumas, and Ren-Cang Li, "Formally verified argument reduction with a fused multiply-add." IEEE Transactions on Computers, 58 :1139–1145, 2009. For a worked example using fmaf() you might want to look at the code in one of my previous answers.

Clang reciprocal to 1 optimisations

After a discussion with colleagues, I ended up testing wether if clang would optimize two divisions, with a reciprocal to 1, to a single division.
const float x = a / b; //x not used elsewhere
const float y = 1 / x;
Theoretically clang could optimize to const float y = b / a if x is used only as a temporary step value, no?
Here's the input&output of a simple test case: https://gist.github.com/Jiboo/d6e839084841d39e5ab6 (in both ouputs you can see that it's performing the two divisions, instead of optimizing)
This related question, is behind my comprehension and seem to focus only on why a specific instruction isn't used, whereas in my case it's the optimisation that isn't done: Why does GCC or Clang not optimise reciprocal to 1 instruction when using fast-math
Thanks,
JB.
No, clang can not do that.
But first, why are you using float? float has six digits precision, double has 15. Unless you have a good reason, that you can explain, use double.
1 / (a / b) in floating-point arithmetic is not the same as b / a. What the compiler has to do, is in the first case:
Divide a by b
Round the result to the nearest floating-point number
Divide 1 by the result
Round the result to the nearest floating-point number.
In the second case:
Divide b by a.
Round the result to the nearest floating-point number.
The compiler can only change the code if the result is guaranteed to be the same, and if the compiler writer cannot produce a mathematical proof that the result is the same, the compiler cannot change the code. There are two rounding operations in the first case, rounding different numbers, so it is unlikely that the result can be guaranteed to be the same.
The compiler doesn't think like a mathematician. Where you think simplifying the expression is trivial mathematically, the compiler has a lot of other things to consider. It is actually quite likely that the compiler is much smarter than the programmer and also knows far more about the C standard.
Something like this is probably what goes through the optimizing compiler's "mind":
Ah they wrote a / b but only use x at one place, so we don't have to allocate that variable on the stack. I'll remove it and use a CPU register.
Hmm, integer literal 1 divided with a float variable. Okay, we have to invoke balancing here before anything else and turn that literal into a float 1.0f.
The programmer is counting on me to generate code that contains the potential floating point inaccuracy involved in dividing 1.0f with another float variable! So I can't just swap this expression with b / a because then that floating point inaccuracy that the programmer seems to want here would be lost.
And so on. There's a lot of considerations. What machine code you end up with is hard to predict in advance. Just know that the compiler follows your instructions to the letter.

Is there any accuracy gain when casting to double and back when doing float division?

What is the difference between two following?
float f1 = some_number;
float f2 = some_near_zero_number;
float result;
result = f1 / f2;
and:
float f1 = some_number;
float f2 = some_near_zero_number;
float result;
result = (double)f1 / (double)f2;
I am especially interested in very small f2 values which may produce +infinity when operating on floats. Is there any accuracy to be gained?
Some practical guidelines for using this kind of cast would be nice as well.
I am going to assume IEEE 754 binary floating point arithmetic, with float 32 bit and double 64 bit.
In general, there is no advantage to doing the calculation in double, and in some cases it may make things worse through doing two rounding steps.
Conversion from float to double is exact. For the infinite, NaN, or zero divisor inputs it makes no differences. Given a finite number result, the IEEE 754 standard requires the result to be the result of the real number division f1/f2, rounded to the type being using in the division.
If it is done as a float division that is the closest float to the exact result. If it is done as double division, it will be the closest double with an additional rounding step for the assignment to result.
For most inputs, the two will give the same answer. Any overflow or underflow that did not happen on the division because it was done in double will happen instead on the conversion.
For simple conversion, if the answer is very close to half way between two float values the two rounding steps may pick the wrong float. I had assumed this could also apply to division results. However, Pascal Cuoq, in a comment on this answer, has called attention to a very interesting paper, Innocuous Double Rounding of Basic Arithmetic
Operations by Pierre Roux, claiming proof that double rounding is harmless for several operations, including division, under conditions that are implied by the assumptions I made at the start of this answer.
If the result of an individual floating-point addition, subtraction, multiply, or divide, is immediately stored to a float, there will be no accuracy improvement using double for intermediate values. In cases where operations are chained together, however, accuracy will often be improved by using a higher-precision intermediate type, provided that one is consistent in using them. In Turbo Pascal circa 1986 code like:
Function TriangleArea(A: Single, B:Single, C:Single): Single
Begin
Var S: Extended; (* S stands for Semi-perimeter *)
S := (A+B+C) * 0.5;
TriangleArea := Sqrt((S-A)*(S-B)*(S-C)*S)
End;
would extend all operands of floating-point operations to type Extended (80-bit float), and then convert them back to single- or double-precision when storing to variables of those types. Very nice semantics for numerical processing. Turbo C of that area behaved similarly, but rather unhelpfully failed to provide any numeric type capable of holding intermediate results; the failure of languages to provide a variable type which could hold intermediate results led to people's unfairly criticizing the concept of a higher-precision intermediate result type, when the real problem was that languages failed to support it properly.
Anyway, if one were to write the above method into a modern language like C#:
public static float triangleArea(float a, float b, float c)
{
double s = (a + b + c) * 0.5;
return (double)(Math.Sqrt((s - a) * (s - b) * (s - c) * s));
}
the code would work well if the compiler happens to promote the operands of the addition to double before performing the computation, but that's something it may or may not do. If the compiler performs the calculation as float, precision may be horrid. When using the above formula to compute the area of an isosceles triangle with long sides of 16777215 and a short side of 4, for example, eager promotion will yield a correct result of 3.355443E+7 while performing the math as float will, depending upon the order of the operands, yield 5.033165E+7 [more than 50% too big] or 16777214.0 [more than 50% too small].
Note that even though code like the above will work perfectly on some environments, but yield completely bogus results on others, compilers will generally not give any warning about the situation.
Although individual operations on float which are going to be immediately stored to float can be done just as accurately with type float as they could be with type double, eagerly promoting operands will often help considerably when operations are combined. In some cases, rearranging operations may avoid problems caused by loss of promotion (e.g. the above formula uses five additions, four multiplications, and a square root; rewriting the formula as:
Math.Sqrt((a+b+c)*(b-a+c)*(a-b+c)*(a-c+b))*0.25
increases the number of additions to eight, but will work correctly even if they are performed at single precision.
"Accuracy gain when casting to double and back when doing float division?"
The result depends on other factors aside from only the 2 posted methods.
C allows evaluation of float operations to happen at different levels depending on FLT_EVAL_METHOD. (See below table) If the current setting is 1 or 2, the two methods posted by OP will provide the same answer.
Depending on other code and compiler optimization levels, the quotient result may be used at wider precision in subsequent calculations in either of OP's cases.
Because of this, a float division that overflows or becomes to 0.0 (a result with total loss of precision) due to extreme float values, and if optimized for subsequent calculations may in fact not over/under flow as the quotient was carried forward as double.
To compel the quotient to become a float for future calculations in the midst of potential optimizations, code often uses volatile
volatile float result = f1 / f2;
C does not specify the precision of math operations, yet common application of standards like IEEE 754 provide the a single operation like binary32 divide will result in the closest answer representable. Should the divide occur at a wider format like double or long double, then the wider quotient conversion back to float experiences another rounding step that in rare occasions will result in a different answer than the direct float/float.
FLT_EVAL_METHOD
-1 indeterminable;
0 evaluate all operations and constants just to the range and precision of the type;
1 evaluate operations and constants of type float and double to the
range and precision of the double type, evaluate long double operations and constants to the range and precision of the long double type;
2 evaluate all operations and constants to the range and precision of the
long double type.
Practical guidelines:
Use float vs. double to conserve space when needed. (float is usually narrower, rarely the same, as double) If precision is important, use double (or long double).
Using float vs. double to improve speed may or may not work as a platform's native operations may all be double. It may be faster, same or slower - profile to find out. Much of C was originally designed with double as only level FP was carried out aside from double to/from float conversions. Later C has added functions like sinf() to facilitate faster, direct float operations. So the more modern the compiler/platform, more likely float will be faster. Again: profile to find out.

Floating point rounding when truncating

This is probably a question for an x86 FPU expert:
I am trying to write a function which generates a random floating point value in the range [min,max]. The problem is that my generator algorithm (the floating-point Mersenne Twister, if you're curious) only returns values in the range [1,2) - ie, I want an inclusive upper bound, but my "source" generated value is from an exclusive upper bound. The catch here is that the underlying generator returns an 8-byte double, but I only want a 4-byte float, and I am using the default FPU rounding mode of Nearest.
What I want to know is whether the truncation itself in this case will result in my return value being inclusive of max when the FPU internal 80-bit value is sufficiently close, or whether I should increment the significand of my max value before multiplying it by the intermediary random in [1,2), or whether I should change FPU modes. Or any other ideas, of course.
Here's the code I am currently using, and I did verify that 1.0f resolves to 0x3f800000:
float MersenneFloat( float min, float max )
{
//genrand returns a double in [1,2)
const float random = (float)genrand_close1_open2();
//return in desired range
return min + ( random - 1.0f ) * (max - min);
}
If it makes a difference, this needs to work on both Win32 MSVC++ and Linux gcc. Also, will using any versions of the SSE optimizations change the answer to this?
Edit: The answer is yes, truncation in this case from double to float is sufficient to cause the result to be inclusive of max. See Crashworks' answer for more.
The SSE ops will subtly change the behavior of this algorithm because they don't have the intermediate 80-bit representation -- the math truly is done in 32 or 64 bits. The good news is that you can easily test it and see if it changes your results by simply specifying the /ARCH:SSE2 command line option to MSVC, which will cause it to use the SSE scalar ops instead of x87 FPU instructions for ordinary floating point math.
I'm not sure offhand of what the exact rounding behavior is around the integer boundaries, but you can test to see what'll happen when 1.999.. gets rounded from 64 to 32 bits by eg
static uint64 OnePointNineRepeating = 0x3FF FFFFF FFFF FFFF // exponent 0 (biased to 1023), all 1 bits in mantissa
double asDouble = *(double *)(&OnePointNineRepeating);
float asFloat = asDouble;
return asFloat;
Edit, result: original poster ran this test and found that with truncation, the 1.99999 will round up to 2 both with and without /arch:SSE2.
If you do adjust the rounding so that does include both ends of the range, will those extreme values not be only half as likely as any of the non-extreme ones?
With truncation, you are never going to be inclusive of the max.
Are you sure you really need the max? There is literally an almost 0 chance that you will land on exactly the maximum.
That said, you can exploit the fact that you are giving up precision and do something like this:
float MersenneFloat( float min, float max )
{
double random = 100000.0; // just a dummy value
while ((float)random > 65535.0)
{
//genrand returns a double in [1,2)
double random = genrand_close1_open2() - 1.0; // now it's [0,1)
random *= 65536.0; // now it's [0,65536). We try again if it's > 65535.0
}
//return in desired range
return min + float(random/65535.0) * (max - min);
}
Note that, now, it has a slight chance of multiple calls to genrand each time you call MersenneFloat. So you have given up possible performance for a closed interval. Since you are downcasting from double to float, you end up sacrificing no precision.
Edit: improved algorithm

Resources