How to test for lossless double / integer conversion? - c

I have one double, and one int64_t. I want to know if they hold exactly the same value, and if converting one type into the other does not lose any information.
My current implementation is the following:
int int64EqualsDouble(int64_t i, double d) {
return (d >= INT64_MIN)
&& (d < INT64_MAX)
&& (round(d) == d)
&& (i == (int64_t)d);
}
My question is: is this implementation correct? And if not, what would be a correct answer? To be correct, it must leave no false positive, and no false negative.
Some sample inputs:
int64EqualsDouble(0, 0.0) should return 1
int64EqualsDouble(1, 1.0) should return 1
int64EqualsDouble(0x3FFFFFFFFFFFFFFF, (double)0x3FFFFFFFFFFFFFFF) should return 0, because 2^62 - 1 can be exactly represented with int64_t, but not with double.
int64EqualsDouble(0x4000000000000000, (double)0x4000000000000000) should return 1, because 2^62 can be exactly represented in both int64_t and double.
int64EqualsDouble(INT64_MAX, (double)INT64_MAX) should return 0, because INT64_MAX can not be exactly represented as a double
int64EqualsDouble(..., 1.0e100) should return 0, because 1.0e100 can not be exactly represented as an int64_t.

Yes, your solution works correctly because it was designed to do so, because int64_t is represented in two's complement by definition (C99 7.18.1.1:1), on platforms that use something resembling binary IEEE 754 double-precision for the double type. It is basically the same as this one.
Under these conditions:
d < INT64_MAX is correct because it is equivalent to d < (double) INT64_MAX and in the conversion to double, the number INT64_MAX, equal to 0x7fffffffffffffff, rounds up. Thus you want d to be strictly less than the resulting double to avoid triggering UB when executing (int64_t)d.
On the other hand, INT64_MIN, being -0x8000000000000000, is exactly representable, meaning that a double that is equal to (double)INT64_MIN can be equal to some int64_t and should not be excluded (and such a double can be converted to int64_t without triggering undefined behavior)
It goes without saying that since we have specifically used the assumptions about 2's complement for integers and binary floating-point, the correctness of the code is not guaranteed by this reasoning on platforms that differ. Take a platform with binary 64-bit floating-point and a 64-bit 1's complement integer type T. On that platform T_MIN is -0x7fffffffffffffff. The conversion to double of that number rounds down, resulting in -0x1.0p63. On that platform, using your program as it is written, using -0x1.0p63 for d makes the first three conditions true, resulting in undefined behavior in (T)d, because overflow in the conversion from integer to floating-point is undefined behavior.
If you have access to full IEEE 754 features, there is a shorter solution:
#include <fenv.h>
…
#pragma STDC FENV_ACCESS ON
feclearexcept(FE_INEXACT), f == i && !fetestexcept(FE_INEXACT)
This solution takes advantage of the conversion from integer to floating-point setting the INEXACT flag iff the conversion is inexact (that is, if i is not representable exactly as a double).
The INEXACT flag remains unset and f is equal to (double)i if and only if f and i represent the same mathematical value in their respective types.
This approach requires the compiler to have been warned that the code accesses the FPU's state, normally with #pragma STDC FENV_ACCESS on but that's typically not supported and you have to use a compilation flag instead.

OP's code has a dependency that can be avoided.
For a successful compare, d must be a whole number and round(d) == d takes care of that. Even d, as a NaN would fail that.
d must be mathematically in the range of [INT64_MIN ... INT64_MAX] and if the if conditions properly insure that, then the final i == (int64_t)d completes the test.
So the question comes down to comparing INT64 limits with the double d.
Let us assume FLT_RADIX == 2, but not necessarily IEEE 754 binary64.
d >= INT64_MIN is not a problem as -INT64_MIN is a power of 2 and exactly converts to a double of the same value, so the >= is exact.
Code would like to do the mathematical d <= INT64_MAX, but that may not work and so a problem. INT64_MAX is a "power of 2 - 1" and may not convert exactly - it depends on if the precision of the double exceeds 63 bits - rendering the compare unclear. A solution is to halve the comparison. d/2 suffers no precision loss and INT64_MAX/2 + 1 converts exactly to a double power-of-2
d/2 < (INT64_MAX/2 + 1)
[Edit]
// or simply
d < ((double)(INT64_MAX/2 + 1))*2
Thus if code does not want to rely on the double having less precision than uint64_t. (Something that likely applies with long double) a more portable solution would be
int int64EqualsDouble(int64_t i, double d) {
return (d >= INT64_MIN)
&& (d < ((double)(INT64_MAX/2 + 1))*2) // (d/2 < (INT64_MAX/2 + 1))
&& (round(d) == d)
&& (i == (int64_t)d);
}
Note: No rounding mode issues.
[Edit] Deeper limit explanation
Insuring mathematically, INT64_MIN <= d <= INT64_MAX, can be re-stated as INT64_MIN <= d < (INT64_MAX + 1) as we are dealing with whole numbers. Since the raw application of (double) (INT64_MAX + 1) in code is certainly 0, an alternative, is ((double)(INT64_MAX/2 + 1))*2. This can be extended for rare machines with double of higher powers-of-2 to ((double)(INT64_MAX/FLT_RADIX + 1))*FLT_RADIX. The comparison limits being exact powers-of-2, conversion to double suffers no precision loss and (lo_limit >= d) && (d < hi_limit) is exact, regardless of the precision of the floating point. Note: that a rare floating point with FLT_RADIX == 10 is still a problem.

In addition to Pascal Cuoq's elaborate answer, and given the extra context you give in comments, I would add a test for negative zeros. You should preserve negative zeros unless you have good reasons not to. You need a specific test to avoid converting them to (int64_t)0. With your current proposal, negative zeros will pass your test, get stored as int64_t and read back as positive zeros.
I am not sure what is the most efficient way to test them, maybe this:
int int64EqualsDouble(int64_t i, double d) {
return (d >= INT64_MIN)
&& (d < INT64_MAX)
&& (round(d) == d)
&& (i == (int64_t)d
&& (!signbit(d) || d != 0.0);
}

Related

Floating point rounding error moves reported result inside range

I am working on a function to report test results together with lower and upper limit for that specific test result. These three values will be converted with a specified formula (aX + b)/c where X is testResult/lowerLimit/upperLimit and a,b and c are floating point numbers.
If the reported test result is inside/outside the specified limits before the conversion it shall also be inside/outside the limit after the conversion in order to ensure the validity of the reported results.
I have identified two cases where an invalid test result will move inside the range after the conversion but I have yet to find a case where the test result is inside the range before the conversion and will be outside the specified limits after conversion. Can this case even occur? I don't believe so? Can it?
Below is some code that produces the two cases that I mentioned together with corrections to ensure the validity of the reported test result.
TLDR: Can the ((TRUE == insideLimitBefore) && (FALSE == insideLimitAfter)) case in the code below happen?
#include <stdio.h>
#include <stdint.h>
#define TRUE (uint8_t)0x01
#define FALSE (uint8_t)0x00
int32_t LinearMapping(const int32_t input);
void Convert(int32_t testResult, int32_t lowerLimit, int32_t upperLimit);
int main(void)
{
int32_t lowerLimit = 504;
int32_t testResult = 503;
int32_t upperLimit = 1000;
printf("INPUT:\n\tLower limit:\t%d\t\n\tTest result:\t%d\t\n\tUpper limit:\t%d\t\n", lowerLimit, testResult, upperLimit);
Convert(testResult, lowerLimit, upperLimit);
lowerLimit = 500;
testResult = 504;
upperLimit = 503;
printf("INPUT:\n\tLower limit:\t%d\t\n\tTest result:\t%d\t\n\tUpper limit:\t%d\t\n", lowerLimit, testResult, upperLimit);
Convert(testResult, lowerLimit, upperLimit);
return 0;
}
int32_t LinearMapping(const int32_t input)
{
float retVal;
const float a = 1.0;
const float b = 1.0;
const float c = 2.3;
retVal = a * input;
retVal += b;
retVal /= c;
return (int32_t)retVal;
}
void Convert(int32_t testResult, int32_t lowerLimit, int32_t upperLimit)
{
uint8_t insideLimitAfter;
uint8_t belowLowerLimit;
uint8_t insideLimitBefore = ((lowerLimit <= testResult) && (testResult <= upperLimit)) ? TRUE : FALSE;
if (FALSE == insideLimitBefore)
{
/* testResult is either below or above lowerLimit/upperLimit respectively */
if (testResult < lowerLimit)
{
belowLowerLimit = TRUE;
}
else /* testResult > upperLimit */
{
belowLowerLimit = FALSE;
}
}
testResult = LinearMapping(testResult);
lowerLimit = LinearMapping(lowerLimit);
upperLimit = LinearMapping(upperLimit);
insideLimitAfter = ((lowerLimit <= testResult) && (testResult <= upperLimit)) ? TRUE : FALSE;
if ((FALSE == insideLimitBefore) && (TRUE == insideLimitAfter))
{
if (TRUE == belowLowerLimit)
{
printf("OUTPUT:\n\tLower limit:\t%d\t\n\tTest result:\t%d\t\n\tUpper limit:\t%d\t\n", lowerLimit+1, testResult, upperLimit);
}
else /* belowLowerLimit == FALSE => testResult > upperLimit */
{
printf("OUTPUT:\n\tLower limit:\t%d\t\n\tTest result:\t%d\t\n\tUpper limit:\t%d\t\n", lowerLimit, testResult, upperLimit-1);
}
}
else if ((TRUE == insideLimitBefore) && (FALSE == insideLimitAfter))
{
/* Is this case even possible? */
}
else
{
/* Do nothing */
}
}
to find a case where the test result is inside the range before the conversion and will be outside the specified limits after conversion. Can this case even occur?
No, given sane a,b,c, lowerLimit, testResult, upperLimit.
Given 3 values lo,x,hi that are lo <= x <= hi before the linear conversion in LinearMapping() the lo_new <= x_new <= hi_new will maintain the same relationship as long as the conversion is (positively) linear (no division by 0, a,b, c are not Not-A-Numbers). No conversion of a float that is out of range of int32_t.
The primary reason is on edge cases of x just inside or at the limit, [lo...hi] , the LinearMapping() may reduce the effective precision of all 3. The new x may now equal lo or hi and == favors "in range". So no change in lo <= x <= hi.
OP originally found examples of "invalid test result will move inside the range after the conversion" because the x was just outside [lo...hi] and the effective precision reduction now made x equal to either lo or hi. Since == favors "in range", the move from outside to inside is seen.
Note: if the LinearMapping() has a negative slope like -1, then lo <= x <= hi can easily be broken. as 1 <= 2 <= 3 --> -1 > -2 > -3. This makes lowerLimit > upperLimit which "in range" cannot be satisfied for any x.
For reference, OP's code simplified:
#include <stdio.h>
#include <stdint.h>
int LinearMapping(const int input) {
const float a = 1.0;
const float b = 1.0;
const float c = 2.3;
float retVal = a * input;
retVal += b;
retVal /= c;
return (int) retVal;
}
void Convert(int testResult, int lowerLimit, int upperLimit) {
printf("Before %d %s %d %s %d\n", lowerLimit,
lowerLimit <= testResult ? "<=" : "> ", testResult,
testResult <= upperLimit ? "<=" : "> ", upperLimit);
testResult = LinearMapping(testResult);
lowerLimit = LinearMapping(lowerLimit);
upperLimit = LinearMapping(upperLimit);
printf("After %d %s %d %s %d\n\n", lowerLimit,
lowerLimit <= testResult ? "<=" : "> ", testResult,
testResult <= upperLimit ? "<=" : "> ", upperLimit);
}
int main(void) {
Convert(503, 504, 1000);
Convert(504, 500, 503);
return 0;
}
Output
Before 504 > 503 <= 1000
After 219 <= 219 <= 435
Before 500 <= 504 > 503
After 217 <= 219 <= 219
… I have yet to find a case where the test result is inside the range before the conversion and will be outside the specified limits after conversion. Can this case even occur? I don't believe so? Can it?
Yes, this can occur, in theory, although due to C behavior rather than due to the underlying floating-point operations. The C standard does not guarantee that IEEE-754 floating-point arithmetic is used or even that the same precision through evaluation of an expression, and this can cause identical inputs to an expression to have different results.
Although LinearMapping is shown as a single routine, the compiler may inline it. That is, where the routine is called, the compiler may replace the call with the body of the routine. Furthermore, when it does this in different places, it may evaluate the expressions using different methods. So, in this code, LinearMapping may be evaluated using different floating-point operations in each call:
testResult = LinearMapping(testResult);
lowerLimit = LinearMapping(lowerLimit);
upperLimit = LinearMapping(upperLimit);
This means that (a * testResult + b) / c might be evaluated using only 32-bit floating-point arithmetic, while (a * upperLimit + b) / c might be evaluated using 64-bit floating-point arithmetic, converted to 32-bit after the division. (I have merged your three assignment statements into one expression here for brevity. The issue applies either way.)
One consequence of this is double rounding. When a result is calculated with one precision and then converted to another, two roundings occur, one in the initial calculation and a second in the conversion. Consider an exact mathematical result that looks like this:
1.xxxxx01011111111xxxx1xx
^ ^ Start of bits to be rounded in wider format.
| Start of bits to be rounded in narrower format.
If this were the result of a calculation in the narrower format, we would examine the bits 011111111xxxx1xx and round them down (they are less than ½ at the position where we are rounding), so the final result would be 1.xxxxx01. However, if we do the calculation in the wider format first, the bits to be rounded away are 1xxxx1xx (more than ½), and these are rounded up, making the intermediate result 1.xxxxx0110000000. When we convert to the narrower format, the bits to be rounded away are 10000000, which is exactly the midpoint (½), so the round-to-nearest-ties-to-even rule tells us to round up, which makes the final result 1.xxxxx10.
Thus, even if testResult and upperLimit are equal, the results of applying LinearMapping to them may be unequal, and testResult may appear to be outside the interval.
There may be some ways to avoid this problem:
If your C implementation conforms to Annex F of the C standard (which essentially says that it uses IEEE-754 operations and binds them to C operators in expected ways) or at least conforms to certain parts of it, then double rounding should not occur with well written source code.
The C standard says an implementation should define FLT_EVAL_METHOD in <float.h>. If FLT_EVAL_METHOD is 0, it means all floating-point operations and constants are evaluated with their nominal type. In this case, double rounding will not occur if you use a single floating-point type in your source code. If FLT_EVAL_METHOD is 1, float operations are evaluated with double. In this case, you could avoid double rounding by using double instead of float. If it is 2, operations are evaluated with long double, and you could avoid double rounding by using long double. If FLT_EVAL_METHOD is -1, the floating-point format used for evaluation is indeterminable, so double rounding is a concern.
For specific values or intervals of values and/or known floating-point formats, it might be possible to prove that double rounding does not occur. E.g., given that your inputs are all int32_t, that the linear mapping parameters are certain values, and that only 32-bit or 64-bit IEEE-754 binary floating-point is in use, it might be possible to prove double rounding does not occur.
Even if your implementation conforms to Annex F or defines FLT_EVAL_METHOD to be non-negative, you must still take care in your code not to use expressions that have type double which you then assign to objects of type float. This will cause double rounding because your source code explicitly requires it, rather than because C is lax about floating-point.
As a specific example, consider (1.0 * 13546 + 1.0) / 2.3. If the floating-point constants are represented in 64-bit binary floating-point (53-bit significands) and the expression is evaluated in 64-bit binary, the result is 5890.0000000000009094947017729282379150390625. However, if the same constants are used (in 64-bit binary) but the expression is evaluated with Intel’s 80-bit binary (64-bit significands) and then converted to 64-bit, the result is 5890.
In this case, the exact mathematical quotient is:
1.01110000001000000000000000000000000000000000000000001000000000001011…
^ Bits rounded away in double.
If we round this in double, we can see the bits to be rounded away, 1000000000001011…, are greater than ½ at the rounding position, so we round up. If we round it in long double, the bits to be rounded away are 01011…. These round down, leaving:
1.011100000010000000000000000000000000000000000000000010000000000
^ Bits rounded away in double.
Now, when we round to double, the bits to be rounded away are 10000000000, which is the midpoint. The rule says to round to make the low bit even, so the result is:
1.011100000010000000000000000000000000000000000000000010000000000

Compare floating point numbers as integers

Can two floating point values (IEEE 754 binary64) be compared as integers? Eg.
long long a = * (long long *) ptr_to_double1,
b = * (long long *) ptr_to_double2;
if (a < b) {...}
assuming the size of long long and double is the same.
YES - Comparing the bit-patterns for two floats as if they were integers (aka "type-punning") produces meaningful results under some restricted scenarios...
Identical to floating-point comparison when:
Both numbers are positive, positive-zero, or positive-infinity.
One positive and one negative number, and you are using a signed integer comparison.
Inverse of floating-point comparison when:
Both numbers are negative, negative-zero, or negative-infinity.
One positive and one negative number, and you are using a unsigned integer comparison.
Not comparable to floating-point comparison when:
Either number is one of the NaN values - Floating point comparisons with a NaN always returns false, and this simply can't be modeled in integer operations where exactly one of the following is always true: (A < B), (A == B), (B < A).
Negative floating-point numbers are a bit funky b/c they are handled very differently than in the 2's complement arithmetic used for integers. Doing an integer +1 on the representation for a negative float will make it a bigger negative number.
With a little bit manipulation, you can make both positive and negative floats comparable with integer operations (this can come in handy for some optimizations):
int32 float_to_comparable_integer(float f) {
uint32 bits = std::bit_cast<uint32>(f);
const uint32 sign_bit = bits & 0x80000000ul;
// Modern compilers turn this IF-statement into a conditional move (CMOV) on x86,
// which is much faster than a branch that the cpu might mis-predict.
if (sign_bit) {
bits = 0x7FFFFFF - bits;
}
return static_cast<int32>(bits);
}
Again, this does not work for NaN values, which always return false from comparisons, and have multiple valid bit representations:
Signaling NaNs (w/ sign bit): Anything between 0xFF800001, and 0xFFBFFFFF.
Signaling NaNs (w/o sign bit): Anything between 0x7F800001, and 0x7FBFFFFF.
Quiet NaNs (w/ sign bit): Anything between 0xFFC00000, and 0xFFFFFFFF.
Quiet NaNs (w/o sign bit): Anything between 0x7FC00000, and 0x7FFFFFFF.
IEEE-754 bit format: http://www.puntoflotante.net/FLOATING-POINT-FORMAT-IEEE-754.htm
More on Type-Punning: https://randomascii.wordpress.com/2012/01/23/stupid-float-tricks-2/
No. Two floating point values (IEEE 754 binary64) cannot compare simply as integers with if (a < b).
IEEE 754 binary64
The order of the values of double is not the same order as integers (unless you are are on a rare sign-magnitude machine). Think positive vs. negative numbers.
double has values like 0.0 and -0.0 which have the same value but different bit patterns.
double has "Not-a-number"s that do not compare like their binary equivalent integer representation.
If both the double values were x > 0 and not "Not-a-number", endian, aliasing, and alignment, etc. were not an issue, OP's idea would work.
Alternatively, a more complex if() ... condition would work - see below
[non-IEEE 754 binary64]
Some double use an encoding where there are multiple representations of the same value. This would differ from an "integer" compare.
Tested code: needs 2's complement, same endian for double and the integers, does not account for NaN.
int compare(double a, double b) {
union {
double d;
int64_t i64;
uint64_t u64;
} ua, ub;
ua.d = a;
ub.d = b;
// Cope with -0.0 right away
if (ua.u64 == 0x8000000000000000) ua.u64 = 0;
if (ub.u64 == 0x8000000000000000) ub.u64 = 0;
// Signs differ?
if ((ua.i64 < 0) != (ub.i64 < 0)) {
return ua.i64 >= 0 ? 1 : -1;
}
// If numbers are negative
if (ua.i64 < 0) {
ua.u64 = -ua.u64;
ub.u64 = -ub.u64;
}
return (ua.u64 > ub.u64) - (ua.u64 < ub.u64);
}
Thanks to #David C. Rankin for a correction.
Test code
void testcmp(double a, double b) {
int t1 = (a > b) - (a < b);
int t2 = compare(a, b);
if (t1 != t2) {
printf("%le %le %d %d\n", a, b, t1, t2);
}
}
#include <float.h>
void testcmps() {
// Various interesting `double`
static const double a[] = {
-1.0 / 0.0, -DBL_MAX, -1.0, -DBL_MIN, -0.0,
+0.0, DBL_MIN, 1.0, DBL_MAX, +1.0 / 0.0 };
int n = sizeof a / sizeof a[0];
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
testcmp(a[i], a[j]);
}
}
puts("!");
}
If you strictly cast the bit value of a floating point number to its correspondingly-sized signed integer (as you've done), then signed integer comparison of the results will be identical to the comparison of the original floating-point values, excluding NaN values. Put another way, this comparison is legitimate for all representable finite and infinite numeric values.
In other words, for double-precision (64-bits), this comparison will be valid if the following tests pass:
long long exponentMask = 0x7ff0000000000000;
long long mantissaMask = 0x000fffffffffffff;
bool isNumber = ((x & exponentMask) != exponentMask) // Not exp 0x7ff
|| ((x & mantissaMask) == 0); // Infinities
for each operand x.
Of course, if you can pre-qualify your floating-point values, then a quick isNaN() test would be much more clear. You'd have to profile to understand performance implications.
There are two parts to your question:
Can two floating point numbers be compared? The answer to this is yes. it is perfectly valid to compare size of floating point numbers. Generally you want to avoid equals comparisons due to truncation issues see here, but
if (a < b)
will work just fine.
Can two floating point numbers be compared as integers? This answer is also yes, but this will require casting. This question should help with that answer: convert from long long to int and the other way back in c++

float vs double comparison [duplicate]

This question already has answers here:
Comparing float and double
(3 answers)
Closed 7 years ago.
int main(void)
{
  float me = 1.1;  
double you = 1.1;   
if ( me == you ) {
printf("I love U");
} else {
printf("I hate U");
}
}
This prints "I hate U". Why?
Floats use binary fraction. If you convert 1.1 to float, this will result in a binary representation.
Each bit right if the binary point halves the weight of the digit, as much as for decimal, it divides by ten. Bits left of the point double (times ten for decimal).
in decimal: ... 0*2 + 1*1 + 0*0.5 + 0*0.25 + 0*0.125 + 1*0.0625 + ...
binary: 0 1 . 0 0 0 1 ...
2's exp: 1 0 -1 -2 -3 -4
(exponent to the power of 2)
Problem is that 1.1 cannot be converted exactly to binary representation. For double, there are, however, more significant digits than for float.
If you compare the values, first, the float is converted to double. But as the computer does not know about the original decimal value, it simply fills the trailing digits of the new double with all 0, while the double value is more precise. So both do compare not equal.
This is a common pitfall when using floats. For this and other reasons (e.g. rounding errors), you should not use exact comparison for equal/unequal), but a ranged compare using the smallest value different from 0:
#include "float.h"
...
// check for "almost equal"
if ( fabs(fval - dval) <= FLT_EPSILON )
...
Note the usage of FLT_EPSILON, which is the aforementioned value for single precision float values. Also note the <=, not <, as the latter will actually require exact match).
If you compare two doubles, you might use DBL_EPSILON, but be careful with that.
Depending on intermediate calculations, the tolerance has to be increased (you cannot reduce it further than epsilon), as rounding errors, etc. will sum up. Floats in general are not forgiving with wrong assumptions about precision, conversion and rounding.
Edit:
As suggested by #chux, this might not work as expected for larger values, as you have to scale EPSILON according to the exponents. This conforms to what I stated: float comparision is not that simple as integer comparison. Think about before comparing.
In short, you should NOT use == to compare floating points.
for example
float i = 1.1; // or double
float j = 1.1; // or double
This argument
(i==j) == true // is not always valid
for a correct comparison you should use epsilon (very small number):
(abs(i-j)<epsilon)== true // this argument is valid
The question simplifies to why do me and you have different values?
Usually, C floating point is based on a binary representation. Many compilers & hardware follow IEEE 754 binary32 and binary64. Rare machines use a decimal, base-16 or other floating point representation.
OP's machine certainly does not represent 1.1 exactly as 1.1, but to the nearest representable floating point number.
Consider the below which prints out me and you to high precision. The previous representable floating point numbers are also shown. It is easy to see me != you.
#include <math.h>
#include <stdio.h>
int main(void) {
float me = 1.1;
double you = 1.1;
printf("%.50f\n", nextafterf(me,0)); // previous float value
printf("%.50f\n", me);
printf("%.50f\n", nextafter(you,0)); // previous double value
printf("%.50f\n", you);
1.09999990463256835937500000000000000000000000000000
1.10000002384185791015625000000000000000000000000000
1.09999999999999986677323704498121514916420000000000
1.10000000000000008881784197001252323389053300000000
But it is more complicated: C allows code to use higher precision for intermediate calculations depending on FLT_EVAL_METHOD. So on another machine, where FLT_EVAL_METHOD==1 (evaluate all FP to double), the compare test may pass.
Comparing for exact equality is rarely used in floating point code, aside from comparison to 0.0. More often code uses an ordered compare a < b. Comparing for approximate equality involves another parameter to control how near. #R.. has a good answer on that.
Because you are comparing two Floating point!
Floating point comparison is not exact because of Rounding Errors. Simple values like 1.1 or 9.0 cannot be precisely represented using binary floating point numbers, and the limited precision of floating point numbers means that slight changes in the order of operations can change the result. Different compilers and CPU architectures store temporary results at different precisions, so results will differ depending on the details of your environment. For example:
float a = 9.0 + 16.0
double b = 25.0
if(a == b) // can be false!
if(a >= b) // can also be false!
Even
if(abs(a-b) < 0.0001) // wrong - don't do this
This is a bad way to do it because a fixed epsilon (0.0001) is chosen because it “looks small”, could actually be way too large when the numbers being compared are very small as well.
I personally use the following method, may be this will help you:
#include <iostream> // std::cout
#include <cmath> // std::abs
#include <algorithm> // std::min
using namespace std;
#define MIN_NORMAL 1.17549435E-38f
#define MAX_VALUE 3.4028235E38f
bool nearlyEqual(float a, float b, float epsilon) {
float absA = std::abs(a);
float absB = std::abs(b);
float diff = std::abs(a - b);
if (a == b) {
return true;
} else if (a == 0 || b == 0 || diff < MIN_NORMAL) {
return diff < (epsilon * MIN_NORMAL);
} else {
return diff / std::min(absA + absB, MAX_VALUE) < epsilon;
}
}
This method passes tests for many important special cases, for different a, b and epsilon.
And don't forget to read What Every Computer Scientist Should Know About Floating-Point Arithmetic!

Implement ceil() in C

I want to implement my own ceil() in C. Searched through the libraries for source code & found here, but it seems pretty difficult to understand. I want clean & elegant code.
I also searched on SO, found some answer here. None of the answer seems to be correct. One of the answer is:
#define CEILING_POS(X) ((X-(int)(X)) > 0 ? (int)(X+1) : (int)(X))
#define CEILING_NEG(X) ((X-(int)(X)) < 0 ? (int)(X-1) : (int)(X))
#define CEILING(X) ( ((X) > 0) ? CEILING_POS(X) : CEILING_NEG(X) )
AFAIK, the return type of the ceil() is not int. Will macro be type-safe here?
Further, will the above implementation work for negative numbers?
What will be the best way to implement it?
Can you provide the clean code?
The macro you quoted definitely won't work correctly for numbers that are greater than INT_MAX but which can still be represented exactly as a double.
The only way to implement ceil() correctly (assuming you can't implement it using an equivalent assembly instruction) is to do bit-twiddling on the binary representation of the floating point number, as is done in the s_ceil.c source file behind your first link. Understanding how the code works requires an understanding of the floating point representation of the underlying platform -- the representation is most probably going to be IEEE 754 -- but there's no way around this.
Edit:
Some of the complexities in s_ceil.c stem from the special cases it handles (NaNs, infinities) and the fact that it needs to do its work without being able to assume that a 64-bit integral type exists.
The basic idea of all the bit-twiddling is to mask off the fractional bits of the mantissa and add 1 to it if the number is greater than zero... but there's a bit of additional logic involved as well to make sure you do the right thing in all cases.
Here's a illustrative version of ceil() for floats that I cobbled together. Beware: This does not handle the special cases correctly and it is not tested extensively -- so don't actually use it. It does however serve to illustrate the principles involved in the bit-twiddling. I've tried to comment the routine extensively, but the comments do assume that you understand how floating point numbers are represented in IEEE 754 format.
union float_int
{
float f;
int i;
};
float myceil(float x)
{
float_int val;
val.f=x;
// Extract sign, exponent and mantissa
// Bias is removed from exponent
int sign=val.i >> 31;
int exponent=((val.i & 0x7fffffff) >> 23) - 127;
int mantissa=val.i & 0x7fffff;
// Is the exponent less than zero?
if(exponent<0)
{
// In this case, x is in the open interval (-1, 1)
if(x<=0.0f)
return 0.0f;
else
return 1.0f;
}
else
{
// Construct a bit mask that will mask off the
// fractional part of the mantissa
int mask=0x7fffff >> exponent;
// Is x already an integer (i.e. are all the
// fractional bits zero?)
if((mantissa & mask) == 0)
return x;
else
{
// If x is positive, we need to add 1 to it
// before clearing the fractional bits
if(!sign)
{
mantissa+=1 << (23-exponent);
// Did the mantissa overflow?
if(mantissa & 0x800000)
{
// The mantissa can only overflow if all the
// integer bits were previously 1 -- so we can
// just clear out the mantissa and increment
// the exponent
mantissa=0;
exponent++;
}
}
// Clear the fractional bits
mantissa&=~mask;
}
}
// Put sign, exponent and mantissa together again
val.i=(sign << 31) | ((exponent+127) << 23) | mantissa;
return val.f;
}
Nothing you will write is more elegant than using the standard library implementation. No code at all is always more elegant than elegant code.
That aside, this approach has two major flaws:
If X is greater than INT_MAX + 1 or less than INT_MIN - 1, the behavior of your macro is undefined. This means that your implementation may give incorrect results for nearly half of all floating-point numbers. You will also raise the invalid flag, contrary to IEEE-754.
It gets the edge cases for -0, +/-infinity, and nan wrong. In fact, the only edge case it gets right is +0.
You can implement ceil in manner similar to what you tried, like so (this implementation assumes IEEE-754 double precision):
#include <math.h>
double ceil(double x) {
// All floating-point numbers larger than 2^52 are exact integers, so we
// simply return x for those inputs. We also handle ceil(nan) = nan here.
if (isnan(x) || fabs(x) >= 0x1.0p52) return x;
// Now we know that |x| < 2^52, and therefore we can use conversion to
// long long to force truncation of x without risking undefined behavior.
const double truncation = (long long)x;
// If the truncation of x is smaller than x, then it is one less than the
// desired result. If it is greater than or equal to x, it is the result.
// Adding one cannot produce a rounding error because `truncation` is an
// integer smaller than 2^52.
const double ceiling = truncation + (truncation < x);
// Finally, we need to patch up one more thing; the standard specifies that
// ceil(-small) be -0.0, whereas we will have 0.0 right now. To handle this
// correctly, we apply the sign of x to the result.
return copysign(ceiling, x);
}
Something like that is about as elegant as you can get and still be correct.
I flagged a number of concerns with the (generally good!) implementation that Martin put in his answer. Here's how I would implement his approach:
#include <stdint.h>
#include <string.h>
static inline uint64_t toRep(double x) {
uint64_t r;
memcpy(&r, &x, sizeof x);
return r;
}
static inline double fromRep(uint64_t r) {
double x;
memcpy(&x, &r, sizeof x);
return x;
}
double ceil(double x) {
const uint64_t signbitMask = UINT64_C(0x8000000000000000);
const uint64_t significandMask = UINT64_C(0x000fffffffffffff);
const uint64_t xrep = toRep(x);
const uint64_t xabs = xrep & signbitMask;
// If |x| is larger than 2^52 or x is NaN, the result is just x.
if (xabs >= toRep(0x1.0p52)) return x;
if (xabs < toRep(1.0)) {
// If x is in (1.0, 0.0], the result is copysign(0.0, x).
// We can generate this value by clearing everything except the signbit.
if (x <= 0.0) return fromRep(xrep & signbitMask);
// Otherwise x is in (0.0, 1.0), and the result is 1.0.
else return 1.0;
}
// Now we know that the exponent of x is strictly in the range [0, 51],
// which means that x contains both integral and fractional bits. We
// generate a mask covering the fractional bits.
const int exponent = xabs >> 52;
const uint64_t fractionalBits = significandMask >> exponent;
// If x is negative, we want to truncate, so we simply mask off the
// fractional bits.
if (xrep & signbitMask) return fromRep(xrep & ~fractionalBits);
// x is positive; to force rounding to go away from zero, we first *add*
// the fractionalBits to x, then truncate the result. The add may
// overflow the significand into the exponent, but this produces the
// desired result (zero significand, incremented exponent), so we just
// let it happen.
return fromRep(xrep + fractionalBits & ~fractionalBits);
}
One thing to note about this approach is that it does not raise the inexact floating-point flag for non-integral inputs. That may or may not be a concern for your usage. The first implementation that I listed does raise the flag.
I don't think a macrofunction is a good solution: it isn't type safe and there is a multi-evaluation of the arguments (side-effects). You should rather write a clean and elegant function.
As I would have expected more jokes in answers, I will try a couple
#define CEILING(X) ceil(X)
Bonus: a macro with not so many side effects
If you don't care too much of negative zeroes
#define CEILING(X) (-floor(-(X)))
If you care of negative zero, then
#define CEILING(X) (NEGATIVE_ZERO - floor(-(X)))
Portable definition of NEGATIVE_ZERO left as an exercize....
Bonus, it will also set FP flags (OVERFLOW INVALID INEXACT)

Checking if float is an integer

How can I check if a float variable contains an integer value? So far, I've been using:
float f = 4.5886;
if (f-(int)f == 0)
printf("yes\n");
else printf("no\n");
But I wonder if there is a better solution, or if this one has any (or many) drawbacks.
Apart from the fine answers already given, you can also use ceilf(f) == f or floorf(f) == f. Both expressions return true if f is an integer. They also returnfalse for NaNs (NaNs always compare unequal) and true for ±infinity, and don't have the problem with overflowing the integer type used to hold the truncated result, because floorf()/ceilf() return floats.
Keep in mind that most of the techniques here are valid presuming that round-off error due to prior calculations is not a factor. E.g. you could use roundf, like this:
float z = 1.0f;
if (roundf(z) == z) {
printf("integer\n");
} else {
printf("fraction\n");
}
The problem with this and other similar techniques (such as ceilf, casting to long, etc.) is that, while they work great for whole number constants, they will fail if the number is a result of a calculation that was subject to floating-point round-off error. For example:
float z = powf(powf(3.0f, 0.05f), 20.0f);
if (roundf(z) == z) {
printf("integer\n");
} else {
printf("fraction\n");
}
Prints "fraction", even though (31/20)20 should equal 3, because the actual calculation result ended up being 2.9999992847442626953125.
Any similar method, be it fmodf or whatever, is subject to this. In applications that perform complex or rounding-prone calculations, usually what you want to do is define some "tolerance" value for what constitutes a "whole number" (this goes for floating-point equality comparisons in general). We often call this tolerance epsilon. For example, lets say that we'll forgive the computer for up to +/- 0.00001 rounding error. Then, if we are testing z, we can choose an epsilon of 0.00001 and do:
if (fabsf(roundf(z) - z) <= 0.00001f) {
printf("integer\n");
} else {
printf("fraction\n");
}
You don't really want to use ceilf here because e.g. ceilf(1.0000001) is 2 not 1, and ceilf(-1.99999999) is -1 not -2.
You could use rintf in place of roundf if you prefer.
Choose a tolerance value that is appropriate for your application (and yes, sometimes zero tolerance is appropriate). For more information, check out this article on comparing floating-point numbers.
if (fmod(f, 1) == 0.0) {
...
}
Don't forget math.h and libm.
stdlib float modf (float x, float *ipart) splits into two parts, check if return value (fractional part) == 0.
if (f <= LONG_MIN || f >= LONG_MAX || f == (long)f) /* it's an integer */
This deals with computational round-off. You set the epsilon as desired:
bool IsInteger(float value)
{
return fabs(ceilf(value) - value) < EPSILON;
}
I'm not 100% sure but when you cast f to an int, and subtract it from f, I believe it is getting cast back to a float. This probably won't matter in this case, but it could present problems down the line if you are expecting that to be an int for some reason.
I don't know if it's a better solution per se, but you could use modulus math instead, for example:
float f = 4.5886;
bool isInt;
isInt = (f % 1.0 != 0) ? false : true;
depending on your compiler you may or not need the .0 after the 1, again the whole implicit casts thing comes into play. In this code, the bool isInt should be true if the right of the decimal point is all zeroes, and false otherwise.
#define twop22 (0x1.0p+22)
#define ABS(x) (fabs(x))
#define isFloatInteger(x) ((ABS(x) >= twop22) || (((ABS(x) + twop22) - twop22) == ABS(x)))

Resources