Floating point rounding error moves reported result inside range

Floating point rounding error moves reported result inside range - c

I am working on a function to report test results together with lower and upper limit for that specific test result. These three values will be converted with a specified formula (aX + b)/c where X is testResult/lowerLimit/upperLimit and a,b and c are floating point numbers.
If the reported test result is inside/outside the specified limits before the conversion it shall also be inside/outside the limit after the conversion in order to ensure the validity of the reported results.
I have identified two cases where an invalid test result will move inside the range after the conversion but I have yet to find a case where the test result is inside the range before the conversion and will be outside the specified limits after conversion. Can this case even occur? I don't believe so? Can it?
Below is some code that produces the two cases that I mentioned together with corrections to ensure the validity of the reported test result.
TLDR: Can the ((TRUE == insideLimitBefore) && (FALSE == insideLimitAfter)) case in the code below happen?
#include <stdio.h>
#include <stdint.h>
#define TRUE (uint8_t)0x01
#define FALSE (uint8_t)0x00
int32_t LinearMapping(const int32_t input);
void Convert(int32_t testResult, int32_t lowerLimit, int32_t upperLimit);
int main(void)
{
int32_t lowerLimit = 504;
int32_t testResult = 503;
int32_t upperLimit = 1000;
printf("INPUT:\n\tLower limit:\t%d\t\n\tTest result:\t%d\t\n\tUpper limit:\t%d\t\n", lowerLimit, testResult, upperLimit);
Convert(testResult, lowerLimit, upperLimit);
lowerLimit = 500;
testResult = 504;
upperLimit = 503;
printf("INPUT:\n\tLower limit:\t%d\t\n\tTest result:\t%d\t\n\tUpper limit:\t%d\t\n", lowerLimit, testResult, upperLimit);
Convert(testResult, lowerLimit, upperLimit);
return 0;
}
int32_t LinearMapping(const int32_t input)
{
float retVal;
const float a = 1.0;
const float b = 1.0;
const float c = 2.3;
retVal = a * input;
retVal += b;
retVal /= c;
return (int32_t)retVal;
}
void Convert(int32_t testResult, int32_t lowerLimit, int32_t upperLimit)
{
uint8_t insideLimitAfter;
uint8_t belowLowerLimit;
uint8_t insideLimitBefore = ((lowerLimit <= testResult) && (testResult <= upperLimit)) ? TRUE : FALSE;
if (FALSE == insideLimitBefore)
{
/* testResult is either below or above lowerLimit/upperLimit respectively */
if (testResult < lowerLimit)
{
belowLowerLimit = TRUE;
}
else /* testResult > upperLimit */
{
belowLowerLimit = FALSE;
}
}
testResult = LinearMapping(testResult);
lowerLimit = LinearMapping(lowerLimit);
upperLimit = LinearMapping(upperLimit);
insideLimitAfter = ((lowerLimit <= testResult) && (testResult <= upperLimit)) ? TRUE : FALSE;
if ((FALSE == insideLimitBefore) && (TRUE == insideLimitAfter))
{
if (TRUE == belowLowerLimit)
{
printf("OUTPUT:\n\tLower limit:\t%d\t\n\tTest result:\t%d\t\n\tUpper limit:\t%d\t\n", lowerLimit+1, testResult, upperLimit);
}
else /* belowLowerLimit == FALSE => testResult > upperLimit */
{
printf("OUTPUT:\n\tLower limit:\t%d\t\n\tTest result:\t%d\t\n\tUpper limit:\t%d\t\n", lowerLimit, testResult, upperLimit-1);
}
}
else if ((TRUE == insideLimitBefore) && (FALSE == insideLimitAfter))
{
/* Is this case even possible? */
}
else
{
/* Do nothing */
}
}

to find a case where the test result is inside the range before the conversion and will be outside the specified limits after conversion. Can this case even occur?
No, given sane a,b,c, lowerLimit, testResult, upperLimit.
Given 3 values lo,x,hi that are lo <= x <= hi before the linear conversion in LinearMapping() the lo_new <= x_new <= hi_new will maintain the same relationship as long as the conversion is (positively) linear (no division by 0, a,b, c are not Not-A-Numbers). No conversion of a float that is out of range of int32_t.
The primary reason is on edge cases of x just inside or at the limit, [lo...hi] , the LinearMapping() may reduce the effective precision of all 3. The new x may now equal lo or hi and == favors "in range". So no change in lo <= x <= hi.
OP originally found examples of "invalid test result will move inside the range after the conversion" because the x was just outside [lo...hi] and the effective precision reduction now made x equal to either lo or hi. Since == favors "in range", the move from outside to inside is seen.
Note: if the LinearMapping() has a negative slope like -1, then lo <= x <= hi can easily be broken. as 1 <= 2 <= 3 --> -1 > -2 > -3. This makes lowerLimit > upperLimit which "in range" cannot be satisfied for any x.
For reference, OP's code simplified:
#include <stdio.h>
#include <stdint.h>
int LinearMapping(const int input) {
const float a = 1.0;
const float b = 1.0;
const float c = 2.3;
float retVal = a * input;
retVal += b;
retVal /= c;
return (int) retVal;
}
void Convert(int testResult, int lowerLimit, int upperLimit) {
printf("Before %d %s %d %s %d\n", lowerLimit,
lowerLimit <= testResult ? "<=" : "> ", testResult,
testResult <= upperLimit ? "<=" : "> ", upperLimit);
testResult = LinearMapping(testResult);
lowerLimit = LinearMapping(lowerLimit);
upperLimit = LinearMapping(upperLimit);
printf("After %d %s %d %s %d\n\n", lowerLimit,
lowerLimit <= testResult ? "<=" : "> ", testResult,
testResult <= upperLimit ? "<=" : "> ", upperLimit);
}
int main(void) {
Convert(503, 504, 1000);
Convert(504, 500, 503);
return 0;
}
Output
Before 504 > 503 <= 1000
After 219 <= 219 <= 435
Before 500 <= 504 > 503
After 217 <= 219 <= 219

… I have yet to find a case where the test result is inside the range before the conversion and will be outside the specified limits after conversion. Can this case even occur? I don't believe so? Can it?
Yes, this can occur, in theory, although due to C behavior rather than due to the underlying floating-point operations. The C standard does not guarantee that IEEE-754 floating-point arithmetic is used or even that the same precision through evaluation of an expression, and this can cause identical inputs to an expression to have different results.
Although LinearMapping is shown as a single routine, the compiler may inline it. That is, where the routine is called, the compiler may replace the call with the body of the routine. Furthermore, when it does this in different places, it may evaluate the expressions using different methods. So, in this code, LinearMapping may be evaluated using different floating-point operations in each call:
testResult = LinearMapping(testResult);
lowerLimit = LinearMapping(lowerLimit);
upperLimit = LinearMapping(upperLimit);
This means that (a * testResult + b) / c might be evaluated using only 32-bit floating-point arithmetic, while (a * upperLimit + b) / c might be evaluated using 64-bit floating-point arithmetic, converted to 32-bit after the division. (I have merged your three assignment statements into one expression here for brevity. The issue applies either way.)
One consequence of this is double rounding. When a result is calculated with one precision and then converted to another, two roundings occur, one in the initial calculation and a second in the conversion. Consider an exact mathematical result that looks like this:
1.xxxxx01011111111xxxx1xx
^ ^ Start of bits to be rounded in wider format.
| Start of bits to be rounded in narrower format.
If this were the result of a calculation in the narrower format, we would examine the bits 011111111xxxx1xx and round them down (they are less than ½ at the position where we are rounding), so the final result would be 1.xxxxx01. However, if we do the calculation in the wider format first, the bits to be rounded away are 1xxxx1xx (more than ½), and these are rounded up, making the intermediate result 1.xxxxx0110000000. When we convert to the narrower format, the bits to be rounded away are 10000000, which is exactly the midpoint (½), so the round-to-nearest-ties-to-even rule tells us to round up, which makes the final result 1.xxxxx10.
Thus, even if testResult and upperLimit are equal, the results of applying LinearMapping to them may be unequal, and testResult may appear to be outside the interval.
There may be some ways to avoid this problem:
If your C implementation conforms to Annex F of the C standard (which essentially says that it uses IEEE-754 operations and binds them to C operators in expected ways) or at least conforms to certain parts of it, then double rounding should not occur with well written source code.
The C standard says an implementation should define FLT_EVAL_METHOD in <float.h>. If FLT_EVAL_METHOD is 0, it means all floating-point operations and constants are evaluated with their nominal type. In this case, double rounding will not occur if you use a single floating-point type in your source code. If FLT_EVAL_METHOD is 1, float operations are evaluated with double. In this case, you could avoid double rounding by using double instead of float. If it is 2, operations are evaluated with long double, and you could avoid double rounding by using long double. If FLT_EVAL_METHOD is -1, the floating-point format used for evaluation is indeterminable, so double rounding is a concern.
For specific values or intervals of values and/or known floating-point formats, it might be possible to prove that double rounding does not occur. E.g., given that your inputs are all int32_t, that the linear mapping parameters are certain values, and that only 32-bit or 64-bit IEEE-754 binary floating-point is in use, it might be possible to prove double rounding does not occur.
Even if your implementation conforms to Annex F or defines FLT_EVAL_METHOD to be non-negative, you must still take care in your code not to use expressions that have type double which you then assign to objects of type float. This will cause double rounding because your source code explicitly requires it, rather than because C is lax about floating-point.
As a specific example, consider (1.0 * 13546 + 1.0) / 2.3. If the floating-point constants are represented in 64-bit binary floating-point (53-bit significands) and the expression is evaluated in 64-bit binary, the result is 5890.0000000000009094947017729282379150390625. However, if the same constants are used (in 64-bit binary) but the expression is evaluated with Intel’s 80-bit binary (64-bit significands) and then converted to 64-bit, the result is 5890.
In this case, the exact mathematical quotient is:
1.01110000001000000000000000000000000000000000000000001000000000001011…
^ Bits rounded away in double.
If we round this in double, we can see the bits to be rounded away, 1000000000001011…, are greater than ½ at the rounding position, so we round up. If we round it in long double, the bits to be rounded away are 01011…. These round down, leaving:
1.011100000010000000000000000000000000000000000000000010000000000
^ Bits rounded away in double.
Now, when we round to double, the bits to be rounded away are 10000000000, which is the midpoint. The rule says to round to make the low bit even, so the result is:
1.011100000010000000000000000000000000000000000000000010000000000

Related

Alternative to ceil() and floor() to get the closest integer values, above and below of a floating point value?

I´m looking for an alternative for the ceil() and floor() functions in C, due to I am not allowed to use these in a project.
What I have build so far is a tricky back and forth way by the use of the cast operator and with that the conversion from a floating-point value (in my case a double) into an int and later as I need the closest integers, above and below the given floating-point value, to be also double values, back to double:
#include <stdio.h>
int main(void) {
double original = 124.576;
double floorint;
double ceilint;
int f;
int c;
f = (int)original; //Truncation to closest floor integer value
c = f + 1;
floorint = (double)f;
ceilint = (double)c;
printf("Original Value: %lf, Floor Int: %lf , Ceil Int: %lf", original, floorint, ceilint);
}
Output:
Original Value: 124.576000, Floor Int: 124.000000 , Ceil Int: 125.000000
For this example normally I would not need the ceil and floor integer values of c and f to be converted back to double but I need them in double in my real program. Consider that as a requirement for the task.
Although the output is giving the desired values and seems right so far, I´m still in concern if this method is really that right and appropriate or, to say it more clearly, if this method does bring any bad behavior or issue into the program or gives me a performance-loss in comparison to other alternatives, if there are any other possible alternatives.
Do you know a better alternative? And if so, why this one should be better?
Thank you very much.

Do you know a better alternative? And if so, why this one should be better?
OP'code fails:
original is already a whole number.
original is a negative like -1.5. Truncation is not floor there.
original is just outside int range.
original is not-a-number.
Alternative construction
double my_ceil(double x)
Using the cast to some integer type trick is a problem when x is outsize the integer range. So check first if x is inside range of a wide enough integer (one whose precision exceeds double). x values outside that are already whole numbers. Recommend to go for the widest integer (u)intmax_t.
Remember that a cast to an integer is a round toward 0 and not a floor. Different handling needed if x is negative/positive when code is ceil() or floor(). OP's code missed this.
I'd avoid if (x >= INTMAX_MAX) { as that involves (double) INTMAX_MAX whose rounding and then precise value is "chosen in an implementation-defined manner". Instead, I'd compare against INTMAX_MAX_P1. some_integer_MAX is a Mersenne Number and with 2's complement, ...MIN is a negated "power of 2".
#include <inttypes.h>
#define INTMAX_MAX_P1 ((INTMAX_MAX/2 + 1)*2.0)
double my_ceil(double x) {
if (x >= INTMAX_MAX_P1) {
return x;
}
if (x < INTMAX_MIN) {
return x;
}
intmax_t i = (intmax_t) x; // this rounds towards 0
if (i < 0 || x == i) return i; // negative x is already rounded up.
return i + 1.0;
}
As x may be a not-a-number, it is more useful to reverse the compare as relational compare of a NaN is false.
double my_ceil(double x) {
if (x >= INTMAX_MIN && x < INTMAX_MAX_P1) {
intmax_t i = (intmax_t) x; // this rounds towards 0
if (i < 0 || x == i) return i; // negative x is already rounded up.
return i + 1.0;
}
return x;
}
double my_floor(double x) {
if (x >= INTMAX_MIN && x < INTMAX_MAX_P1) {
intmax_t i = (intmax_t) x; // this rounds towards 0
if (i > 0 || x == i) return i; // positive x is already rounded down.
return i - 1.0;
}
return x;
}

You're missing an important step: you need to check if the number is already integral, so for ceil assuming non-negative numbers (generalisation is trivial), use something like
double ceil(double f){
if (f >= LLONG_MAX){
// f will be integral unless you have a really funky platform
return f;
} else {
long long i = f;
return 0.0 + i + (f != i); // to obviate potential long long overflow
}
}
Another missing piece in the puzzle, which is covered off by my enclosing if, is to check if f is within the bounds of a long long. On common platforms if f was outside the bounds of a long long then it would be integral anyway.
Note that floor is trivial due to the fact that truncation to long long is always towards zero.

How to test for lossless double / integer conversion?

I have one double, and one int64_t. I want to know if they hold exactly the same value, and if converting one type into the other does not lose any information.
My current implementation is the following:
int int64EqualsDouble(int64_t i, double d) {
return (d >= INT64_MIN)
&& (d < INT64_MAX)
&& (round(d) == d)
&& (i == (int64_t)d);
}
My question is: is this implementation correct? And if not, what would be a correct answer? To be correct, it must leave no false positive, and no false negative.
Some sample inputs:
int64EqualsDouble(0, 0.0) should return 1
int64EqualsDouble(1, 1.0) should return 1
int64EqualsDouble(0x3FFFFFFFFFFFFFFF, (double)0x3FFFFFFFFFFFFFFF) should return 0, because 2^62 - 1 can be exactly represented with int64_t, but not with double.
int64EqualsDouble(0x4000000000000000, (double)0x4000000000000000) should return 1, because 2^62 can be exactly represented in both int64_t and double.
int64EqualsDouble(INT64_MAX, (double)INT64_MAX) should return 0, because INT64_MAX can not be exactly represented as a double
int64EqualsDouble(..., 1.0e100) should return 0, because 1.0e100 can not be exactly represented as an int64_t.

Yes, your solution works correctly because it was designed to do so, because int64_t is represented in two's complement by definition (C99 7.18.1.1:1), on platforms that use something resembling binary IEEE 754 double-precision for the double type. It is basically the same as this one.
Under these conditions:
d < INT64_MAX is correct because it is equivalent to d < (double) INT64_MAX and in the conversion to double, the number INT64_MAX, equal to 0x7fffffffffffffff, rounds up. Thus you want d to be strictly less than the resulting double to avoid triggering UB when executing (int64_t)d.
On the other hand, INT64_MIN, being -0x8000000000000000, is exactly representable, meaning that a double that is equal to (double)INT64_MIN can be equal to some int64_t and should not be excluded (and such a double can be converted to int64_t without triggering undefined behavior)
It goes without saying that since we have specifically used the assumptions about 2's complement for integers and binary floating-point, the correctness of the code is not guaranteed by this reasoning on platforms that differ. Take a platform with binary 64-bit floating-point and a 64-bit 1's complement integer type T. On that platform T_MIN is -0x7fffffffffffffff. The conversion to double of that number rounds down, resulting in -0x1.0p63. On that platform, using your program as it is written, using -0x1.0p63 for d makes the first three conditions true, resulting in undefined behavior in (T)d, because overflow in the conversion from integer to floating-point is undefined behavior.
If you have access to full IEEE 754 features, there is a shorter solution:
#include <fenv.h>
…
#pragma STDC FENV_ACCESS ON
feclearexcept(FE_INEXACT), f == i && !fetestexcept(FE_INEXACT)
This solution takes advantage of the conversion from integer to floating-point setting the INEXACT flag iff the conversion is inexact (that is, if i is not representable exactly as a double).
The INEXACT flag remains unset and f is equal to (double)i if and only if f and i represent the same mathematical value in their respective types.
This approach requires the compiler to have been warned that the code accesses the FPU's state, normally with #pragma STDC FENV_ACCESS on but that's typically not supported and you have to use a compilation flag instead.

OP's code has a dependency that can be avoided.
For a successful compare, d must be a whole number and round(d) == d takes care of that. Even d, as a NaN would fail that.
d must be mathematically in the range of [INT64_MIN ... INT64_MAX] and if the if conditions properly insure that, then the final i == (int64_t)d completes the test.
So the question comes down to comparing INT64 limits with the double d.
Let us assume FLT_RADIX == 2, but not necessarily IEEE 754 binary64.
d >= INT64_MIN is not a problem as -INT64_MIN is a power of 2 and exactly converts to a double of the same value, so the >= is exact.
Code would like to do the mathematical d <= INT64_MAX, but that may not work and so a problem. INT64_MAX is a "power of 2 - 1" and may not convert exactly - it depends on if the precision of the double exceeds 63 bits - rendering the compare unclear. A solution is to halve the comparison. d/2 suffers no precision loss and INT64_MAX/2 + 1 converts exactly to a double power-of-2
d/2 < (INT64_MAX/2 + 1)
[Edit]
// or simply
d < ((double)(INT64_MAX/2 + 1))*2
Thus if code does not want to rely on the double having less precision than uint64_t. (Something that likely applies with long double) a more portable solution would be
int int64EqualsDouble(int64_t i, double d) {
return (d >= INT64_MIN)
&& (d < ((double)(INT64_MAX/2 + 1))*2) // (d/2 < (INT64_MAX/2 + 1))
&& (round(d) == d)
&& (i == (int64_t)d);
}
Note: No rounding mode issues.
[Edit] Deeper limit explanation
Insuring mathematically, INT64_MIN <= d <= INT64_MAX, can be re-stated as INT64_MIN <= d < (INT64_MAX + 1) as we are dealing with whole numbers. Since the raw application of (double) (INT64_MAX + 1) in code is certainly 0, an alternative, is ((double)(INT64_MAX/2 + 1))*2. This can be extended for rare machines with double of higher powers-of-2 to ((double)(INT64_MAX/FLT_RADIX + 1))*FLT_RADIX. The comparison limits being exact powers-of-2, conversion to double suffers no precision loss and (lo_limit >= d) && (d < hi_limit) is exact, regardless of the precision of the floating point. Note: that a rare floating point with FLT_RADIX == 10 is still a problem.

In addition to Pascal Cuoq's elaborate answer, and given the extra context you give in comments, I would add a test for negative zeros. You should preserve negative zeros unless you have good reasons not to. You need a specific test to avoid converting them to (int64_t)0. With your current proposal, negative zeros will pass your test, get stored as int64_t and read back as positive zeros.
I am not sure what is the most efficient way to test them, maybe this:
int int64EqualsDouble(int64_t i, double d) {
return (d >= INT64_MIN)
&& (d < INT64_MAX)
&& (round(d) == d)
&& (i == (int64_t)d
&& (!signbit(d) || d != 0.0);
}

Compare floating point numbers as integers

Can two floating point values (IEEE 754 binary64) be compared as integers? Eg.
long long a = * (long long *) ptr_to_double1,
b = * (long long *) ptr_to_double2;
if (a < b) {...}
assuming the size of long long and double is the same.

YES - Comparing the bit-patterns for two floats as if they were integers (aka "type-punning") produces meaningful results under some restricted scenarios...
Identical to floating-point comparison when:
Both numbers are positive, positive-zero, or positive-infinity.
One positive and one negative number, and you are using a signed integer comparison.
Inverse of floating-point comparison when:
Both numbers are negative, negative-zero, or negative-infinity.
One positive and one negative number, and you are using a unsigned integer comparison.
Not comparable to floating-point comparison when:
Either number is one of the NaN values - Floating point comparisons with a NaN always returns false, and this simply can't be modeled in integer operations where exactly one of the following is always true: (A < B), (A == B), (B < A).
Negative floating-point numbers are a bit funky b/c they are handled very differently than in the 2's complement arithmetic used for integers. Doing an integer +1 on the representation for a negative float will make it a bigger negative number.
With a little bit manipulation, you can make both positive and negative floats comparable with integer operations (this can come in handy for some optimizations):
int32 float_to_comparable_integer(float f) {
uint32 bits = std::bit_cast<uint32>(f);
const uint32 sign_bit = bits & 0x80000000ul;
// Modern compilers turn this IF-statement into a conditional move (CMOV) on x86,
// which is much faster than a branch that the cpu might mis-predict.
if (sign_bit) {
bits = 0x7FFFFFF - bits;
}
return static_cast<int32>(bits);
}
Again, this does not work for NaN values, which always return false from comparisons, and have multiple valid bit representations:
Signaling NaNs (w/ sign bit): Anything between 0xFF800001, and 0xFFBFFFFF.
Signaling NaNs (w/o sign bit): Anything between 0x7F800001, and 0x7FBFFFFF.
Quiet NaNs (w/ sign bit): Anything between 0xFFC00000, and 0xFFFFFFFF.
Quiet NaNs (w/o sign bit): Anything between 0x7FC00000, and 0x7FFFFFFF.
IEEE-754 bit format: http://www.puntoflotante.net/FLOATING-POINT-FORMAT-IEEE-754.htm
More on Type-Punning: https://randomascii.wordpress.com/2012/01/23/stupid-float-tricks-2/

No. Two floating point values (IEEE 754 binary64) cannot compare simply as integers with if (a < b).
IEEE 754 binary64
The order of the values of double is not the same order as integers (unless you are are on a rare sign-magnitude machine). Think positive vs. negative numbers.
double has values like 0.0 and -0.0 which have the same value but different bit patterns.
double has "Not-a-number"s that do not compare like their binary equivalent integer representation.
If both the double values were x > 0 and not "Not-a-number", endian, aliasing, and alignment, etc. were not an issue, OP's idea would work.
Alternatively, a more complex if() ... condition would work - see below
[non-IEEE 754 binary64]
Some double use an encoding where there are multiple representations of the same value. This would differ from an "integer" compare.
Tested code: needs 2's complement, same endian for double and the integers, does not account for NaN.
int compare(double a, double b) {
union {
double d;
int64_t i64;
uint64_t u64;
} ua, ub;
ua.d = a;
ub.d = b;
// Cope with -0.0 right away
if (ua.u64 == 0x8000000000000000) ua.u64 = 0;
if (ub.u64 == 0x8000000000000000) ub.u64 = 0;
// Signs differ?
if ((ua.i64 < 0) != (ub.i64 < 0)) {
return ua.i64 >= 0 ? 1 : -1;
}
// If numbers are negative
if (ua.i64 < 0) {
ua.u64 = -ua.u64;
ub.u64 = -ub.u64;
}
return (ua.u64 > ub.u64) - (ua.u64 < ub.u64);
}
Thanks to #David C. Rankin for a correction.
Test code
void testcmp(double a, double b) {
int t1 = (a > b) - (a < b);
int t2 = compare(a, b);
if (t1 != t2) {
printf("%le %le %d %d\n", a, b, t1, t2);
}
}
#include <float.h>
void testcmps() {
// Various interesting `double`
static const double a[] = {
-1.0 / 0.0, -DBL_MAX, -1.0, -DBL_MIN, -0.0,
+0.0, DBL_MIN, 1.0, DBL_MAX, +1.0 / 0.0 };
int n = sizeof a / sizeof a[0];
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
testcmp(a[i], a[j]);
}
}
puts("!");
}

If you strictly cast the bit value of a floating point number to its correspondingly-sized signed integer (as you've done), then signed integer comparison of the results will be identical to the comparison of the original floating-point values, excluding NaN values. Put another way, this comparison is legitimate for all representable finite and infinite numeric values.
In other words, for double-precision (64-bits), this comparison will be valid if the following tests pass:
long long exponentMask = 0x7ff0000000000000;
long long mantissaMask = 0x000fffffffffffff;
bool isNumber = ((x & exponentMask) != exponentMask) // Not exp 0x7ff
|| ((x & mantissaMask) == 0); // Infinities
for each operand x.
Of course, if you can pre-qualify your floating-point values, then a quick isNaN() test would be much more clear. You'd have to profile to understand performance implications.

There are two parts to your question:
Can two floating point numbers be compared? The answer to this is yes. it is perfectly valid to compare size of floating point numbers. Generally you want to avoid equals comparisons due to truncation issues see here, but
if (a < b)
will work just fine.
Can two floating point numbers be compared as integers? This answer is also yes, but this will require casting. This question should help with that answer: convert from long long to int and the other way back in c++

function to map a double into a long number

Maybe it seems a little bit rare question, but I would like to find a function able to transform a double (c number) into a long (c number). It's not necessary to preserve the double information. The most important thing is:
double a,b;
long c,d;
c = f(a);
d = f(b);
This must be truth:
if (a < b) then c < d for all a,b double and for all c,d long
Thank you to all of you.

Your requirement is feasible if the following two conditions hold:
The compiler defines sizeof(double) the same as sizeof(long)
The hardware uses IEEE 754 double-precision binary floating-point format
While the 2nd condition holds on every widely-used platform, the 1st condition does not.
If both conditions do hold on your platform, then you can implement the function as follows:
long f(double x)
{
if (x > 0)
return double_to_long(x);
if (x < 0)
return -double_to_long(-x);
return 0;
}
You have several different ways to implement the conversion function:
long double_to_long(double x)
{
long y;
memcpy(&y,&x,sizeof(x));
return y;
}
long double_to_long(double x)
{
long y;
y = *(long*)&x;
return y;
}
long double_to_long(double x)
{
union
{
double x;
long y;
}
u;
u.x = x;
return u.y;
}
Please note that the second option is not recommended, because it breaks strict-aliasing rule.

There are four basic transformations from floating-point to integer types:
floor - Rounds towards negative infinity, i.e. next lowest integer.
ceil[ing] - Rounds towards positive infinity, i.e. next highest integer.
trunc[ate] - Rounds towards zero, i.e. strips the floating-point portion and leaves the integer.
round - Rounds towards the nearest integer.
None of these transformations will give the behaviour you specify, but floor will permit the slightly weaker condition (a < b) implies (c <= d).
If a double value uses more space to represent than a long, then there is no mapping that can meet your initial constraint, thanks to the pigeonhole principle. Basically, since the double type can represent many more distinct values than a long type, there is no way to preserve the strict partial order of the < relationship, as multiple double values would be forced to map to the same long value.
See also:
Difference between Math.Floor() and Math.Truncate() (Stack Overflow)
Pigeonhole principle (Wikipedia)

Use frexp() to get you mostly there. It splits the number into exponent and significand (fraction).
Assume long is at least the same size as double, other-wise this is pointless. Pigeonhole principle.
#include <math.h>
long f(double x) {
assert(sizeof(long) >= sizeof(double));
#define EXPOWIDTH 11
#define FRACWIDTH 52
int ipart;
double fraction = frexp(fabs(x), &ipart);
long lg = ipart;
lg += (1L << EXPOWIDTH)/2;
if (lg < 0) ipart = 0;
if (lg >= (1L << EXPOWIDTH)) lg = (1L << EXPOWIDTH) - 1;
lg <<= FRACWIDTH;
lg += (long) (fraction * (1L << FRACWIDTH));
if (x < 0) {
lg = -lg;
}
return lg;
}
-
Notes:
The proper value for EXPO depends on DBL_MAX_EXP and DBL_MIN_EXP and particulars of the double type.
This solution maps the same double values near the extremes of double. I will look and test more later.
Otherwise as commented above: overlay the two types.
As long is often 2's complement and double is laid out in a sign-magnitude fashion, extra work is need when the double is negative. Also watch out for -0.0.
long f(double x) {
assert(sizeof x == sizeof (long));
union {
double d;
long lg;
} u = { x*1.0 }; // *1.0 gets rid of -0.0
// If 2's complement - which is the common situation
if (u.lg < 0) {
u.lg = LONG_MAX - u.lg;
}
return u.lg;
}

What operations and functions on +0.0 and -0.0 give different arithmetic results?

In C, when ±0.0 is supported, -0.0 or +0.0 assigned to a double typically makes no arithmetic difference. Although they have different bit patterns, they arithmetically compare as equal.
double zp = +0.0;
double zn = -0.0;
printf("0 == memcmp %d\n", 0 == memcmp(&zn, &zp, sizeof zp));// --> 0 == memcmp 0
printf("== %d\n", zn == zp); // --> == 1
Inspire by a #Pascal Cuoq comment, I am looking for a few more functions in standard C that provide arithmetically different results.
Note: Many functions, like sin(), return +0.0 from f(+0.0) and -0.0 from f(-0.0). But these do not provide different arithmetic results. Also the 2 results should not both be NaN.

There are a few standard operations and functions that form numerically different answers between f(+0.0) and f(-0.0).
Different rounding modes or other floating point implementations may give different results.
#include <math.h>
double inverse(double x) { return 1/x; }
double atan2m1(double y) { return atan2(y, -1.0); }
double sprintf_d(double x) {
char buf[20];
// sprintf(buf, "%+f", x); Changed to e
sprintf(buf, "%+e", x);
return buf[0]; // returns `+` or `-`
}
double copysign_1(double x) { return copysign(1.0, x); }
double signbit_d(double x) {
int sign = signbit(x); // my compile returns 0 or INT_MIN
return sign;
}
double pow_m1(double x) { return pow(x, -1.0); }
void zero_test(const char *name, double (*f)(double)) {
double fzp = (f)(+0.0);
double fzn = (f)(-0.0);
int differ = fzp != fzn;
if (fzp != fzp && fzn != fzn) differ = 0; // if both NAN
printf("%-15s f(+0):%-+15e %s f(-0):%-+15e\n",
name, fzp, differ ? "!=" : "==", fzn);
}
void zero_tests(void) {
zero_test("1/x", inverse);
zero_test("atan2(x,-1)", atan2m1);
zero_test("printf(\"%+e\")", sprintf_d);
zero_test("copysign(x,1)", copysign_1);
zero_test("signbit()", signbit_d);
zero_test("pow(x,-odd)", pow_m1);; // #Pascal Cuoq
zero_test("tgamma(x)", tgamma); // #vinc17 #Pascal Cuoq
}
Output:
1/x f(+0):+inf != f(-0):-inf
atan2(x,-1) f(+0):+3.141593e+00 != f(-0):-3.141593e+00
printf("%+e") f(+0):+4.300000e+01 != f(-0):+4.500000e+01
copysign(x,1) f(+0):+1.000000e+00 != f(-0):-1.000000e+00
signbit() f(+0):+0.000000e+00 != f(-0):-2.147484e+09
pow(x,-odd) f(+0):+inf != f(-0):-inf
tgamma(x) f(+0):+inf != f(-0):+inf
Notes:
tgamma(x) came up == on my gcc 4.8.2 machine, but correctly != on others.
rsqrt(), AKA 1/sqrt() is a maybe future C standard function. May/may not also work.
double zero = +0.0; memcpy(&zero, &x, sizeof x) can show x is a different bit pattern than +0.0 but x could still be a +0.0. I think some FP formats have many bit patterns that are +0.0 and -0.0. TBD.
This is a self-answer as provided by https://stackoverflow.com/help/self-answer.

The IEEE 754-2008 function rsqrt (that will be in the future ISO C standard) returns ±∞ on ±0, which is quite surprising. And tgamma also returns ±∞ on ±0. With MPFR, mpfr_digamma returns the opposite of ±∞ on ±0.

I think about this method, but I can't check before weekend, so someone might do some experiments on this, if he/she like, or just tell me that it is nonsense:
Generate a -0.0f. It should be possible to generate staticly by a assigning a tiny negative constant that underflows float representation.
Assign this constant to a volatile double and back to float.
By changing the bit representation 2 times, I assume that the
compiler specific standard bit representation for -0.0f is now in the
variable. The compiler can't outsmart me there, because a totally
other value could be in the volatile variable between those 2 copies.
compare the input to 0.0f. To detect if we have a 0.0f/-0.0f case
if it is equal, assign the input volitale double variable, and then back to float.
I again assume that it has now standard compiler representation for 0.0f
access the bit patterns by a union and compare them, to decide if it is -0.0f
The code might be something like:
typedef union
{
float fvalue;
/* assuming int has at least the same number of bits as float */
unsigned int bitpat;
} tBitAccess;
float my_signf(float x)
{
/* assuming double has smaller min and
other bit representation than float */
volatile double refitbits;
tBitAccess tmp;
unsigned int pat0, patX;
if (x < 0.0f) return -1.0f;
if (x > 0.0f) return 1.0f;
refitbits = (double) (float) -DBL_MIN;
tmp.fvalue = (float) refitbits;
pat0 = tmp.bitpat;
refitbits = (double) x;
tmp.fvalue = (float) refitbits;
patX = tmp.bitpat;
return (patX == pat0)? -1.0f : 1.0f;
}
It is not a standard function, or an operator, but a function that should differentiate between signs of -0.0 and 0.0.
It based (mainly) on the assumption that the compiler vendor does not use different bit patterns for -0.0f as result of changing of formats, even if the floating point format would allow it, and if this holds, it is independent from the chosen bit pattern.
For a floating point formats that have exact one pattern for -0.0f this function should safely do the trick without knowledge of the bit ordering in that pattern.
The other assumptions (about size of the types and so on) can be handled with precompiler switches on the float.h constants.
Edit: On a second thought: If we can force the value comparing to (0.0 || -0.0) below the smallest representable denormal (subnormal) floating point number or its negative counterpart, and there is no second pattern for -0.0f (exact) in the FP format, we could drop the casting to volatile double. (But maybe keep the float volatile, to ensure that with deactivated denormals the compiler can't do any fancy trick, to ignore operations, that reduce any further the absolut value of things comparing equal to 0.0.)
The Code then might look like:
typedef union
{
float fvalue;
/* assuming int has at least the same number of bits as float */
unsigned int bitpat;
} tBitAccess;
float my_signf(float x)
{
volatile tBitAccess tmp;
unsigned int pat0, patX;
if (x < 0.0f) return -1.0f;
if (x > 0.0f) return 1.0f;
tmp.fvalue = -DBL_MIN;
/* forcing something compares equal to 0.0f below smallest subnormal
- not sure if one abs()-factor is enough */
tmp.fvalue = tmp.fvalue * fabsf(tmp.fvalue);
pat0 = tmp.bitpat;
tmp.fvalue = x;
tmp.fvalue = tmp.fvalue * fabsf(tmp.fvalue);
patX = tmp.bitpat;
return (patX == pat0)? -1.0f : 1.0f;
}
This might not work with fancy rounding methods, that prevent rounding from negative values towards -0.0.

Not exactly an answer to the question, but can be useful to know:
Just faced the case a - b = c => b = a - c, which fails to hold if a is 0.0 and b is -0.0. We have 0.0 - (-0.0) = 0.0 => b = 0.0 - 0.0 = 0.0. The sign is lost. The -0.0 is not recovered.
Taken from here.