Usually nextafter is implemented in the following way:
double nextafter(double x, double y)
{
// handle corner cases
int delta = ((x > 0) == (x < y)) ? 1 : -1;
unsigned long long mant = __mant(x); // get mantissa as int
mant += delta;
...
}
Here, a binary representation is obtained using __mant(x).
Out of curiosity: is it possible to implement nextafter without obtaining a binary representation? For example, using a sequence of arithmetic floating point operations.
The code below implements nextafter in the ascending direction for finite values for IEEE-754 arithmetic with round-to-nearest-ties-to-even. Handling of NaNs, infinities, and the descending direction is obvious.
Without assuming IEEE-754 or round-to-nearest, the floating-point properties are sufficiently well characterized by C 2018 5.2.4.2.2 that we we can implement nextafter (again in the ascending direction) this way:
If the input is a NaN, return it, and report an error if it is a signaling NaN.
If the input is −∞, return -DBL_MAX.
If the input is -DBL_TRUE_MIN, return zero.
If the input is zero, return +DBL_TRUE_MIN.
If the input is +DBL_MAX, return +∞.
If the input is +∞, return +∞. (Note this never occurs with a full nextafter(x, y) implementation, as it moves the first argument in the direction of the second argument, so we never ascend from +∞ because we never receive a second argument greater than +∞.)
Otherwise, if it is positive, use logb to find the exponent e. If e is less than DBL_MIN, return the input plus DBL_TRUE_MIN (the ULP of subnormals and the lowest normals). If e is not less than DBL_MIN, return the input plus scalb(1, e + 1 - DBL_MANT_DIG) (the specific ULP for the input). Rounding method is irrelevant as these additions are exact.
Otherwise, the input is negative. Use the above except if the input is exactly a power of FLT_RADIX (the input equals scalb(1, e)), decrement the second argument of scalb by one (because this nextafter step transitions from a greater exponent to a lower one).
Note that FLT_RADIX is correct above; there is no DBL_RADIX; all floating-point formats use the same radix.
If you want to consider logb and scalb as functions that manipulate the floating-point representation, then they could be replaced by ordinary arithmetic. log could find a quick approximation that could be quickly refined to the true exponent, and scalb can be implemented in a variety of ways, possibly simply a table look-up. If log remains objectionable, then trial comparisons would suffice.
The above handles formats with or without subnormals because, if subnormals are supported, it steps into them with the decrement, and, if subnormals are not supported, the minimum normal magnitude is DBL_TRUE_MIN, so it is recognized in the above as the point where we step to zero next.
There is one caveat; the C standard allows it to be “indeterminable” whether an implementation supports subnormals or not “if floating-point operations do not consistently interpret subnormal representations as zero, nor as nonzero.” In that case, I do not see that the standard specifies what the standard nextafter does, so there is nothing for us to do to match it in our implementation. Supposing that subnormals are sometimes supported, DBL_TRUE_MIN must be a subnormal value, and the above will attempt to work as if subnormal support is currently on (e.g., flush-to-zero is off) and, if it is off, you will get whatever you get.
#include <float.h>
#include <math.h>
/* Return the next floating-point value after the finite value q.
This was inspired by Algorithm 3.5 in Siegfried M. Rump, Takeshi Ogita, and
Shin'ichi Oishi, "Accurate Floating-Point Summation", _Technical Report
05.12_, Faculty for Information and Communication Sciences, Hamburg
University of Technology, November 13, 2005.
IEEE-754 and the default rounding mode,
round-to-nearest-ties-to-even, may be required.
*/
double NextAfter(double q)
{
/* Scale is .625 ULP, so multiplying it by any significand in [1, 2)
yields something in [.625 ULP, 1.25 ULP].
*/
static const double Scale = 0.625 * DBL_EPSILON;
/* Either of the following may be used, according to preference and
performance characteristics. In either case, use a fused multiply-add
(fma) to add to q a number that is in [.625 ULP, 1.25 ULP]. When this
is rounded to the floating-point format, it must produce the next
number after q.
*/
#if 0
// SmallestPositive is the smallest positive floating-point number.
static const double SmallestPositive = DBL_EPSILON * DBL_MIN;
if (fabs(q) < 2*DBL_MIN)
return q + SmallestPositive;
return fma(fabs(q), Scale, q);
#else
return fma(fmax(fabs(q), DBL_MIN), Scale, q);
#endif
}
#if defined CompileMain
#include <stdio.h>
#include <stdlib.h>
#define NumberOf(a) (sizeof (a) / sizeof *(a))
int main(void)
{
int status = EXIT_SUCCESS;
static const struct { double in, out; } cases[] =
{
{ INFINITY, INFINITY },
{ 0x1.fffffffffffffp1023, INFINITY },
{ 0x1.ffffffffffffep1023, 0x1.fffffffffffffp1023 },
{ 0x1.ffffffffffffdp1023, 0x1.ffffffffffffep1023 },
{ 0x1.ffffffffffffcp1023, 0x1.ffffffffffffdp1023 },
{ 0x1.0000000000003p1023, 0x1.0000000000004p1023 },
{ 0x1.0000000000002p1023, 0x1.0000000000003p1023 },
{ 0x1.0000000000001p1023, 0x1.0000000000002p1023 },
{ 0x1.0000000000000p1023, 0x1.0000000000001p1023 },
{ 0x1.fffffffffffffp1022, 0x1.0000000000000p1023 },
{ 0x1.fffffffffffffp1, 0x1.0000000000000p2 },
{ 0x1.ffffffffffffep1, 0x1.fffffffffffffp1 },
{ 0x1.ffffffffffffdp1, 0x1.ffffffffffffep1 },
{ 0x1.ffffffffffffcp1, 0x1.ffffffffffffdp1 },
{ 0x1.0000000000003p1, 0x1.0000000000004p1 },
{ 0x1.0000000000002p1, 0x1.0000000000003p1 },
{ 0x1.0000000000001p1, 0x1.0000000000002p1 },
{ 0x1.0000000000000p1, 0x1.0000000000001p1 },
{ 0x1.fffffffffffffp-1022, 0x1.0000000000000p-1021 },
{ 0x1.ffffffffffffep-1022, 0x1.fffffffffffffp-1022 },
{ 0x1.ffffffffffffdp-1022, 0x1.ffffffffffffep-1022 },
{ 0x1.ffffffffffffcp-1022, 0x1.ffffffffffffdp-1022 },
{ 0x1.0000000000003p-1022, 0x1.0000000000004p-1022 },
{ 0x1.0000000000002p-1022, 0x1.0000000000003p-1022 },
{ 0x1.0000000000001p-1022, 0x1.0000000000002p-1022 },
{ 0x1.0000000000000p-1022, 0x1.0000000000001p-1022 },
{ 0x0.fffffffffffffp-1022, 0x1.0000000000000p-1022 },
{ 0x0.ffffffffffffep-1022, 0x0.fffffffffffffp-1022 },
{ 0x0.ffffffffffffdp-1022, 0x0.ffffffffffffep-1022 },
{ 0x0.ffffffffffffcp-1022, 0x0.ffffffffffffdp-1022 },
{ 0x0.0000000000003p-1022, 0x0.0000000000004p-1022 },
{ 0x0.0000000000002p-1022, 0x0.0000000000003p-1022 },
{ 0x0.0000000000001p-1022, 0x0.0000000000002p-1022 },
{ 0x0.0000000000000p-1022, 0x0.0000000000001p-1022 },
{ -0x1.fffffffffffffp1023, -0x1.ffffffffffffep1023 },
{ -0x1.ffffffffffffep1023, -0x1.ffffffffffffdp1023 },
{ -0x1.ffffffffffffdp1023, -0x1.ffffffffffffcp1023 },
{ -0x1.0000000000004p1023, -0x1.0000000000003p1023 },
{ -0x1.0000000000003p1023, -0x1.0000000000002p1023 },
{ -0x1.0000000000002p1023, -0x1.0000000000001p1023 },
{ -0x1.0000000000001p1023, -0x1.0000000000000p1023 },
{ -0x1.0000000000000p1023, -0x1.fffffffffffffp1022 },
{ -0x1.0000000000000p2, -0x1.fffffffffffffp1 },
{ -0x1.fffffffffffffp1, -0x1.ffffffffffffep1 },
{ -0x1.ffffffffffffep1, -0x1.ffffffffffffdp1 },
{ -0x1.ffffffffffffdp1, -0x1.ffffffffffffcp1 },
{ -0x1.0000000000004p1, -0x1.0000000000003p1 },
{ -0x1.0000000000003p1, -0x1.0000000000002p1 },
{ -0x1.0000000000002p1, -0x1.0000000000001p1 },
{ -0x1.0000000000001p1, -0x1.0000000000000p1 },
{ -0x1.0000000000000p-1021, -0x1.fffffffffffffp-1022 },
{ -0x1.fffffffffffffp-1022, -0x1.ffffffffffffep-1022 },
{ -0x1.ffffffffffffep-1022, -0x1.ffffffffffffdp-1022 },
{ -0x1.ffffffffffffdp-1022, -0x1.ffffffffffffcp-1022 },
{ -0x1.0000000000004p-1022, -0x1.0000000000003p-1022 },
{ -0x1.0000000000003p-1022, -0x1.0000000000002p-1022 },
{ -0x1.0000000000002p-1022, -0x1.0000000000001p-1022 },
{ -0x1.0000000000001p-1022, -0x1.0000000000000p-1022 },
{ -0x1.0000000000000p-1022, -0x0.fffffffffffffp-1022 },
{ -0x0.fffffffffffffp-1022, -0x0.ffffffffffffep-1022 },
{ -0x0.ffffffffffffep-1022, -0x0.ffffffffffffdp-1022 },
{ -0x0.ffffffffffffdp-1022, -0x0.ffffffffffffcp-1022 },
{ -0x0.0000000000004p-1022, -0x0.0000000000003p-1022 },
{ -0x0.0000000000003p-1022, -0x0.0000000000002p-1022 },
{ -0x0.0000000000002p-1022, -0x0.0000000000001p-1022 },
{ -0x0.0000000000001p-1022, -0x0.0000000000000p-1022 },
};
for (int i = 0; i < NumberOf(cases); ++i)
{
double in = cases[i].in, expected = cases[i].out;
double observed = NextAfter(in);
printf("NextAfter(%a) = %a.\n", in, observed);
if (! (observed == expected))
{
printf("\tError, expected %a.\n", expected);
status = EXIT_FAILURE;
}
}
return status;
}
#endif // defined CompileMain
Consider FP may/may not support sub-normals, infinity, have unique values for most values (e.g. using 2 float for double), support +/= 0, without an intimae knowledge of the FP encoding, using assumptions like mant += delta; is the next value leads to portability failures - even when using binary ops. Using only FP ops would need many assumptions on the FP encoding.
I think a more useful approach would be to post candidate code that uses "a sequence of arithmetic floating point operations" and then ask for 1) what conditions it fails 2) how to improve?
For my sample ISO-C99 implementation below I used float mapped to IEEE-754 binary32 because I can get better test coverage that way. The assumptions are that IEEE-754 bindings for C are in effect (so C floating-point types are bound to IEEE-754 binary types, subnormals are supported etc), the rounding mode in effect is round-to-nearest-or-even, and exception signaling requirements specified by the ISO-C standard (FE_INEXACT, FE_OVERFLOW, FE_UNDERFLOW) are waived (the masked response is delivered).
The code may be more complex than it needs to be; I simply separated out the various operand classes and handled them one by one. I used the Intel compiler with the strictest floating-point settings to compile. The implementation of nextafterf() from the Intel math library is used as a golden reference. I consider it highly unlikely, but obviously not impossible, that my implementation has a bug that matches a bug in Intel's library implementation.
I am demonstrating three variants below, one of which is suitable for platforms without FMA (fused multiply-add), while another avoids floating-point division which may be slow on some platforms. In general, the stepping mechanism differs between normalized operands and subnormal / zero operands. Stepping from a subnormal towards negative zero does not readily work with any general scheme. Each of the three variants uses a different approach of dealing with this: by special casing one single case, by reducing slightly the step size for subnormal operands, or by scaling to positive integers followed by back-scaling.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <float.h>
#include <string.h>
#include <math.h>
#define VARIANT (1) // 1, 2, or 3
float my_nextafterf (float a, float b)
{
const float FP32_MIN_NORMAL = 0x1.000000p-126f;
const float FP32_MAX_NORMAL = 0x1.fffffep+127f;
const float FP32_EPSILON = 0x1.0p-23f;
const float FP32_ONE = 1.0f;
const float FP32_HALF = 0.5f;
const float FP32_ZERO = 0.0f;
const float FP32_NEG_ZERO = FP32_ZERO * (-FP32_ONE);
const float FP32_MIN_SUBNORM = FP32_MIN_NORMAL * FP32_EPSILON;
const float FP32_SUBNORM_SCALE = FP32_ONE / FP32_MIN_NORMAL;
const float FP32_INT_SCALE = FP32_ONE / FP32_EPSILON;
const float FP32_ONE_M_ULP = FP32_ONE - FP32_EPSILON * FP32_HALF;
const float FP32_ONE_P_ULP = FP32_ONE + FP32_EPSILON;
const float FP32_INC = FP32_ONE_P_ULP * FP32_EPSILON * FP32_HALF;
float r;
if ((!(fabsf(a) <= INFINITY)) || (!(fabsf(b) <= INFINITY))) { // unordered
r = a + b;
}
else if (a == b) { // equal
r = b;
}
else if (fabsf (a) == INFINITY) { // infinity
r = (a >= FP32_ZERO) ? FP32_MAX_NORMAL : (-FP32_MAX_NORMAL);
}
#if VARIANT == 1
else if (fabsf (a) > FP32_MIN_NORMAL) { // normal
float factor = ((a >= FP32_ZERO) == (a < b)) ? FP32_INC : (-FP32_INC);
r = fmaf (factor, a, a);
} else { // zero, subnormal, or smallest normal
float scal = (a >= FP32_ZERO) ? FP32_INT_SCALE : (-FP32_INT_SCALE);
float incr = ((a >= FP32_ZERO) == (a < b)) ? FP32_ONE : (-FP32_ONE);
r = (a * scal * FP32_SUBNORM_SCALE + incr) / scal / FP32_SUBNORM_SCALE;
}
#elif VARIANT == 2
else if (fabsf (a) > FP32_MIN_NORMAL) { // normal
r = ((a < b) == (a >= FP32_ZERO)) ?
(a / FP32_ONE_M_ULP) : (a * FP32_ONE_M_ULP);
} else { // zero, subnormal, or smallest normal
float incr = (a >= FP32_ZERO) ? FP32_MIN_SUBNORM : (-FP32_MIN_SUBNORM);
r = ((a < b) == (a >= FP32_ZERO)) ? (a + incr) :
((a == (-FP32_MIN_SUBNORM)) ? FP32_NEG_ZERO : (a - incr));
}
#elif VARIANT == 3
else {
float factor1 = (fabsf (a) > FP32_MIN_NORMAL) ?
(((a < b) == (a >= FP32_ZERO)) ? FP32_INC : (-FP32_INC)) :
((a >= FP32_ZERO) ? FP32_MIN_SUBNORM : (-FP32_MIN_SUBNORM));
float factor2 = (fabsf (a) > FP32_MIN_NORMAL) ? a :
(((a < b) == (a >= FP32_ZERO)) ? FP32_ONE_M_ULP : (-FP32_ONE_M_ULP));
r = fmaf (factor1, factor2, a);
}
#else // VARIANT
#error unknown VARIANT
#endif // VARIANT
return r;
}
float uint32_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof r);
return r;
}
uint32_t float_as_uint32 (float a)
{
uint32_t r;
memcpy (&r, &a, sizeof r);
return r;
}
// Fixes via: Greg Rose, KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007
static uint32_t kiss_z = 362436069;
static uint32_t kiss_w = 521288629;
static uint32_t kiss_jsr = 123456789;
static uint32_t kiss_jcong = 380116160;
#define znew (kiss_z=36969*(kiss_z&0xffff)+(kiss_z>>16))
#define wnew (kiss_w=18000*(kiss_w&0xffff)+(kiss_w>>16))
#define MWC ((znew<<16)+wnew)
#define SHR3 (kiss_jsr^=(kiss_jsr<<13),kiss_jsr^=(kiss_jsr>>17),kiss_jsr^=(kiss_jsr<<5))
#define CONG (kiss_jcong=69069*kiss_jcong+13579)
#define KISS ((MWC^CONG)+SHR3)
int main (void)
{
float a, b, res, ref;
uint32_t ia, ib, ires, iref;
const uint32_t FP32_QNAN_BIT = 0x00400000; // x86 and other architectures
printf ("Testing nextafterf() variant %d\n", VARIANT);
ia = 0x0000000;
do {
for (int i = 1; i < 20; i++) {
switch (i) {
case 0: ib = ia;
break;
case 1: ib = ia - 1;
break;
case 2: ib = ia + 1;
break;
case 3: ib = 0x00000000;
break;
case 4: ib = 0x80000000;
break;
case 5: ib = 0x7f800000;
break;
case 6: ib = 0xff800000;
break;
default: ib = KISS;
break;
}
a = uint32_as_float (ia);
b = uint32_as_float (ib);
res = my_nextafterf (a, b);
ref = nextafterf (a, b);
ires = float_as_uint32 (res);
iref = float_as_uint32 (ref);
if (ires != iref) {
/* if both 'from' and 'to' are NaN, result may be either NaN, quietened */
if (!(isnan (a) && isnan (b) &&
((ires == (ia | FP32_QNAN_BIT)) || (ires == (ib | FP32_QNAN_BIT))))) {
printf ("error: a=%08x b=%08x res=%08x ref=%08x\n", ia, ib, ires, iref);
}
}
}
ia++;
if (!(ia & 0xffffff)) printf ("\ria = 0x%08x", ia);
} while (ia);
return EXIT_SUCCESS;
}
I am trying to code a program that will take a floating point number in base 10 and convert its fractional part in base 2. In the following code, I am intending to call my converting function into a printf, and format the output; the issue I have lies in my fra_binary() where I can't figure out the best way to return an integer made of the result of the conversion at each turn respectively (concatenation). Here is what I have done now (the code is not optimized because I am still working on it) :
#include <stdio.h>
#include <math.h>
int fra_binary(double fract) ;
int main()
{
long double n ;
double fract, deci ;
printf("base 10 :\n") ;
scanf("%Lf", &n) ;
fract = modf(n, &deci) ;
int d = deci ;
printf("base 2: %d.%d\n", d, fra_binary(fract)) ;
return(0) ;
}
int fra_binary(double F)
{
double fl ;
double decimal ;
int array[30] ;
for (int i = 0 ; i < 30 ; i++) {
fl = F * 2 ;
F = modf(fl, &decimal) ;
array[i] = decimal ;
if (F == 0) break ;
}
return array[0] ;
}
Obviously this returns partly the desired output, because I would need the whole array concatenated as one int or char to display the series of 1 and 0s I need. So at each turn, I want to use the decimal part of the number I work on as the binary number to concatenate (1 + 0 = 10 and not 1). How would I go about it?
Hope this makes sense!
return array[0] ; is only the first value of int array[30] set in fra_binary(). Code discards all but the first calculation of the loop for (int i = 0 ; i < 30 ; i++).
convert its fractional part in base 2
OP's loop idea is a good starting point. Yet int array[30] is insufficient to encode the fractional portion of all double into a "binary".
can't figure out the best way to return an integer
Returning an int will be insufficient. Instead consider using a string - or manage an integer array in a likewise fashion.
Use defines from <float.h> to drive the buffer requirements.
#include <stdio.h>
#include <math.h>
#include <float.h>
char *fra_binary(char *dest, double x) {
_Static_assert(FLT_RADIX == 2, "Unexpected FP base");
double deci;
double fract = modf(x, &deci);
fract = fabs(fract);
char *s = dest;
do {
double d;
fract = modf(fract * 2.0, &d);
*s++ = "01"[(int) d];
} while (fract);
*s = '\0';
// For debug
printf("%*.*g --> %.0f and .", DBL_DECIMAL_DIG + 8, DBL_DECIMAL_DIG, x,
deci);
return dest;
}
int main(void) {
// Perhaps 53 - -1021 + 1
char fraction_string[DBL_MANT_DIG - DBL_MIN_EXP + 1];
puts(fra_binary(fraction_string, -0.0));
puts(fra_binary(fraction_string, 1.0));
puts(fra_binary(fraction_string, asin(-1))); // machine pi
puts(fra_binary(fraction_string, -0.1));
puts(fra_binary(fraction_string, DBL_MAX));
puts(fra_binary(fraction_string, DBL_MIN));
puts(fra_binary(fraction_string, DBL_TRUE_MIN));
}
Output
-0 --> -0 and .0
1 --> 1 and .0
3.1415926535897931 --> 3 and .001001000011111101101010100010001000010110100011
-0.10000000000000001 --> -0 and .0001100110011001100110011001100110011001100110011001101
1.7976931348623157e+308 --> 179769313486231570814527423731704356798070600000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 and .0
2.2250738585072014e-308 --> 0 and .00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001
4.9406564584124654e-324 --> 0 and .000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001
Also unclear why input is long double, yet processing is with double. Recommend using just one FP type.
Note that your algorithm finds out the binary representation of the fraction most significant bit first.
One way to convert the fractional part to a binary string, would be to supply the function with a string and a string length, and have the function fill it with up to that many binary digits:
/* This function returns the number of chars needed in dst
to describe the fractional part of value in binary,
not including the trailing NUL ('\0').
Returns zero in case of an error (non-finite value).
*/
size_t fractional_bits(char *dst, size_t len, double value)
{
double fraction, integral;
size_t i = 0;
if (!isfinite(value))
return 0;
if (value > 0.0)
fraction = modf(value, &integral);
else
if (value < 0.0)
fraction = modf(-value, &integral);
else {
/* Zero fraction. */
if (len > 1) {
dst[0] = '0';
dst[1] = '\0';
} else
if (len > 0)
dst[0] = '\0';
/* One binary digit was needed for exact representation. */
return 1;
}
while (fraction > 0.0) {
fraction = fraction * 2.0;
if (fraction >= 1.0) {
fraction = fraction - 1.0;
if (i < len)
dst[i] = '1';
} else
if (i < len)
dst[i] = '0';
i++;
}
if (i < len)
dst[i] = '\0';
else
if (len > 0)
dst[len - 1] = '\0';
return i;
}
The above function works very much like snprintf(), except it takes only the double whose fractional bits are to be stored as a string of binary digits (0 or 1). and returns 0 in case of an error (non-finite double value).
Another option is to use an unsigned integer type to hold the bits. For example, if your code is intended to work on architectures where double is an IEEE-754 Binary64 type or similar, the mantissa has up to 53 bits of precision, and an uint64_t would suffice.
Here is an example of that:
uint64_t fractional_bits(const double val, size_t bits)
{
double fraction, integral;
uint64_t result = 0;
if (bits < 1 || bits > 64) {
errno = EINVAL;
return 0;
}
if (!isfinite(val)) {
errno = EDOM;
return 0;
}
if (val > 0.0)
fraction = modf(val, &integral);
else
if (val < 0.0)
fraction = modf(-val, &integral);
else {
errno = 0;
return 0;
}
while (bits-->0) {
result = result << 1;
fraction = fraction * 2.0;
if (fraction >= 1.0) {
fraction = fraction - 1.0;
result = result + 1;
}
}
errno = 0;
return result;
}
The return value is the binary representation of the fractional part: [i]fractional_part[/i] ≈ [i]result[/i] / 2[sup][i]bits[/i][/sup], where [i]bits[/i] is between 1 and 64, inclusive.
In order for the caller to detect an error, the function clears errno to zero if no error occurred. If an error does occur, the function returns zero with errno set to EDOM if the value is not finite, or to EINVAL if bits is less than 1 or greater than 64.
You can combine the two approaches, if you implement an arbitrary-size unsigned integer type, or a bitmap type.
I was thinking about the floor function available in math.h. It is very easy to use it:
#include <stdio.h>
#include <math.h>
int main(void)
{
for (double a = 12.5; a < 13.4; a += 0.1)
printf("floor of %.1lf is %.1lf\n", a, floor(a));
return 0;
}
What if I would like to write my own implementation of it? Would it look simply like this:
#include <stdio.h>
#include <math.h>
double my_floor(double num)
{
return (int)num;
}
int main(void)
{
double a;
for (a = 12.5; a < 13.4; a += 0.1)
printf("floor of %.1lf is %.1lf\n", a, floor(a));
printf("\n\n");
for (a = 12.5; a < 13.4; a += 0.1)
printf("floor of %.1lf is %.1lf\n", a, my_floor(a));
return 0;
}
?
It seems it does not work with negative numbers (my_floor), but the second one seems to be fine (my_floor_2):
#include <stdio.h>
#include <math.h>
double my_floor(double num)
{
return (int)num;
}
double my_floor_2(double num)
{
if(num < 0)
return (int)num - 1;
else
return (int)num;
}
int main(void)
{
double a1 = -12.5;
printf("%lf\n", floor(a1));
printf("%lf\n", my_floor(a1));
printf("%lf\n", my_floor_2(a1));
return 0;
}
program output:
-13.000000
-12.000000
-13.000000
Is one of them eventually correct or not?
Both of your attempts have limitations:
If the double value is outside the range of the int type, converting to int is implementation defined.
If the double value is negative but integral, returning (int)num - 1 is incorrect.
Here is an (almost) portable version that tries to handle all cases:
double my_floor_2(double num) {
if (num >= LLONG_MAX || num <= LLONG_MIN || num != num) {
/* handle large values, infinities and nan */
return num;
}
long long n = (long long)num;
double d = (double)n;
if (d == num || num >= 0)
return d;
else
return d - 1;
}
It should be correct if type long long has more value bits than type double, which is the case on most modern systems.
No, you can't tackle it this way. The best way of writing your own implementation is to take the one from the C Standard Library on your platform. But note that might contain platform specific nuances so might not be portable.
The C Standard Library floor function is typically clever in that it doesn't work by taking a conversion to an integral type. If it did then you'd run the risk of signed integer overflow, the behaviour of which is undefined. (Note that the smallest possible range for an int is -32767 to +32767).
The precise implementation is also dependent on the floating point scheme used on your platform.
For a platform using IEEE754 floating point, and a long long type you could adopt this scheme:
If the magnitude of the number is greater than 253, return it (as it's already integral).
Else, cast to a 64-bit type (long long), and return it back.
In C++ and 32 bit arithmetics it can be done for example like this:
//---------------------------------------------------------------------------
// IEEE 754 double MSW masks
const DWORD _f64_sig =0x80000000; // sign
const DWORD _f64_exp =0x7FF00000; // exponent
const DWORD _f64_exp_sig=0x40000000; // exponent sign
const DWORD _f64_exp_bia=0x3FF00000; // exponent bias
const DWORD _f64_exp_lsb=0x00100000; // exponent LSB
const DWORD _f64_exp_pos= 20; // exponent LSB bit position
const DWORD _f64_man =0x000FFFFF; // mantisa
const DWORD _f64_man_msb=0x00080000; // mantisa MSB
const DWORD _f64_man_bits= 52; // mantisa bits
// IEEE 754 single masks
const DWORD _f32_sig =0x80000000; // sign
const DWORD _f32_exp =0x7F800000; // exponent
const DWORD _f32_exp_sig=0x40000000; // exponent sign
const DWORD _f32_exp_bia=0x3F800000; // exponent bias
const DWORD _f32_exp_lsb=0x00800000; // exponent LSB
const DWORD _f32_exp_pos= 23; // exponent LSB bit position
const DWORD _f32_man =0x007FFFFF; // mantisa
const DWORD _f32_man_msb=0x00400000; // mantisa MSB
const DWORD _f32_man_bits= 23; // mantisa bits
//---------------------------------------------------------------------------
double f64_floor(double x)
{
const int h=1; // may be platform dependent MSB/LSB order
const int l=0;
union _f64 // semi result
{
double f; // 64bit floating point
DWORD u[2]; // 2x32 bit uint
} y;
DWORD m,a;
int sig,exp,sh;
y.f=x;
// extract sign
sig =y.u[h]&_f64_sig;
// extract exponent
exp =((y.u[h]&_f64_exp)>>_f64_exp_pos)-(_f64_exp_bia>>_f64_exp_pos);
// floor bit shift
sh=_f64_man_bits-exp; a=0;
if (exp<0)
{
a=y.u[l]|(y.u[h]&_f64_man);
if (sig) return -1.0;
return 0.0;
}
// LSW
if (sh>0)
{
if (sh<32) m=(0xFFFFFFFF>>sh)<<sh; else m=0;
a=y.u[l]&(m^0xFFFFFFFF); y.u[l]&=m;
}
// MSW
sh-=32;
if (sh>0)
{
if (sh<_f64_exp_pos) m=(0xFFFFFFFF>>sh)<<sh; else m=_f64_sig|_f64_exp;
a|=y.u[h]&(m^0xFFFFFFFF); y.u[h]&=m;
}
if ((sig)&&(a)) y.f--;
return y.f;
}
//---------------------------------------------------------------------------
float f32_floor(float x)
{
union // semi result
{
float f; // 32bit floating point
DWORD u; // 32 bit uint
} y;
DWORD m,a;
int sig,exp,sh;
y.f=x;
// extract sign
sig =y.u&_f32_sig;
// extract exponent
exp =((y.u&_f32_exp)>>_f32_exp_pos)-(_f32_exp_bia>>_f32_exp_pos);
// floor bit shift
sh=_f32_man_bits-exp; a=0;
if (exp<0)
{
a=y.u&_f32_man;
if (sig) return -1.0;
return 0.0;
}
if (sh>0)
{
if (sh<_f32_exp_pos) m=(0xFFFFFFFF>>sh)<<sh; else m=_f32_sig|_f32_exp;
a|=y.u&(m^0xFFFFFFFF); y.u&=m;
}
if ((sig)&&(a)) y.f--;
return y.f;
}
//---------------------------------------------------------------------------
The point is to make mask that will clear out the decimal bits from mantissa and in case of negative input and non zero cleared bits decrement the result. To access individual bits you can convert your floating point value to integral representation with use of union (like in the example) or use pointers instead.
I tested this in simple VCL app like this:
float f32;
double f64;
AnsiString txt="";
// 64 bit
txt+="[double]\r\n";
for (f64=-10.0;f64<=10.0;f64+=0.1)
if (fabs(floor(f64)-f64_floor(f64))>1e-6)
{
txt+=AnsiString().sprintf("%5.3lf %5.3lf %5.3lf\r\n",f64,floor(f64),f64_floor(f64));
f64_floor(f64);
}
for (f64=1;f64<=1e307;f64*=1.1)
{
if (fabs(floor( f64)-f64_floor( f64))>1e-6) { txt+=AnsiString().sprintf("%lf lf lf\r\n", f64,floor( f64),f64_floor( f64));
f64_floor( f64); }
if (fabs(floor(-f64)-f64_floor(-f64))>1e-6) { txt+=AnsiString().sprintf("%lf lf lf\r\n",-f64,floor(-f64),f64_floor(-f64));
f64_floor(-f64); }
}
// 32 bit
txt+="[float]\r\n";
for (f32=-10.0;f32<=10.0;f32+=0.1)
if (fabs(floor(f32)-f32_floor(f32))>1e-6)
{
txt+=AnsiString().sprintf("%5.3lf %5.3lf %5.3lf\r\n",f32,floor(f32),f32_floor(f32));
f32_floor(f32);
}
for (f32=1;f32<=1e37;f32*=1.1)
{
if (fabs(floor( f32)-f32_floor( f32))>1e-6) { txt+=AnsiString().sprintf("%lf lf lf\r\n", f32,floor( f32),f32_floor( f32));
f32_floor( f32); }
if (fabs(floor(-f32)-f32_floor(-f32))>1e-6) { txt+=AnsiString().sprintf("%lf lf lf\r\n",-f32,floor(-f32),f32_floor(-f32));
f32_floor(-f32); }
}
mm_log->Lines->Add(txt);
with no difference result (so in all tested cases it matches math.h floor() values. If you want to give this a shot outside VCL then just change AnsiString to any string type you got at hand and change the output from TMemo::mm_log to anything you got (like console cout or whatever)
The double calling of fxx_floor() in case of difference is for debuging purposes (you can place a breakpoint and step in the error case directly).
[Notes]
Beware the order of words (MSW,LSW) is platform dependent so you should adjust the h,l constants accordingly. This code is not optimized so it is easily understandable so do not expect it will be fast.
When the precision of the floating point type is small enough as compared to a wide integer type, cast to that integer type when the floating point value is in the integer range.
Review the function for values outside the intmax_t range, NAN, infinity and -0.0 and adjust as desired.
#if DBL_MANT_DIG >= 64
#error TBD code
#endif
#include <inttypes.h>
// INTMAX_MAX is not exact as a double, yet INTMAX_MAX + 1 is an exact double
#define INTMAX_MAX_P1 ((INTMAX_MAX/2 + 1)*2.0)
double my_floor(double x) {
if (x >= 0.0) {
if (x < INTMAX_MAX_P1) {
return (double)(intmax_t)x;
}
return x;
} else if (x < 0.0) {
if (x >= INTMAX_MIN) {
intmax_t ix = (intmax_t) x;
return (ix == x) ? x : (double)(ix-1);
}
return x;
}
return x; // NAN
}
Try it online!. Try this as your function:
// we need as much space as possible
typedef long double largestFloat;
largestFloat myFloor(largestFloat x)
{
largestFloat xcopy = (x < 0) ? (x * -1) : x;
unsigned int zeros = 0;
largestFloat n = 1;
// Count digits before the decimal
for (n = 1; xcopy > (n * 10); n *= 10, ++zeros)
;
// Make xcopy follow 0 <= xcopy < 1
for (xcopy -= n; zeros != -1; xcopy -= n) {
if (xcopy < 0) {
xcopy += n;
n /= 10;
--zeros;
}
}
xcopy += n;
// Follow standard floor behavior
if (x < 0)
return (xcopy == 0) ? x : (x + xcopy - 1);
else
return x - xcopy;
}
This is an explanation of the code.
Create xcopy (absolute value of x)
Use the first for loop to figure out the number of digits before the decimal point.
Use that number to continually decrease xcopy until it satisfies 0 <= xcopy < 1
Based on whether x was originally positive or negative, either return x - xcopy or x - (1 - xcopy).
I'm looking to for a reasonably efficient way of determining if a floating point value (double) can be exactly represented by an integer data type (long, 64 bit).
My initial thought was to check the exponent to see if it was 0 (or more precisely 127). But that won't work because 2.0 would be e=1 m=1...
So basically, I am stuck. I have a feeling that I can do this with bit masks, but I'm just not getting my head around how to do that at this point.
So how can I check to see if a double is exactly representable as a long?
Thanks
I think I have found a way to clamp a double into an integer in a standard-conforming fashion (this is not really what the question is about, but it helps a lot). First, we need to see why the obvious code is not correct.
// INCORRECT CODE
uint64_t double_to_uint64 (double x)
{
if (x < 0.0) {
return 0;
}
if (x > UINT64_MAX) {
return UINT64_MAX;
}
return x;
}
The problem here is that in the second comparison, UINT64_MAX is being implicitly converted to double. The C standard does not specify exactly how this conversion works, only that it is to be rounded up or down to a representable value. This means that the second comparison may be false, even if should mathematically be true (which can happen when UINT64_MAX is rounded up, and 'x' is mathematically between UINT64_MAX and (double)UINT64_MAX). As such, the conversion of double to uint64_t can result in undefined behavior in that edge case.
Surprisingly, the solution is very simple. Consider that while UINT64_MAX may not be exactly representable in a double, UINT64_MAX+1, being a power of two (and not too large), certainly is. So, if we first round the input to an integer, the comparison x > UINT64_MAX is equivalent to x >= UINT64_MAX+1, except for possible overflow in the addition. We can fix the overflow by using ldexp instead of adding one to UINT64_MAX. That being said, the following code should be correct.
/* Input: a double 'x', which must not be NaN.
* Output: If 'x' is lesser than zero, then zero;
* otherwise, if 'x' is greater than UINT64_MAX, then UINT64_MAX;
* otherwise, 'x', rounded down to an integer.
*/
uint64_t double_to_uint64 (double x)
{
assert(!isnan(x));
double y = floor(x);
if (y < 0.0) {
return 0;
}
if (y >= ldexp(1.0, 64)) {
return UINT64_MAX;
}
return y;
}
Now, to back to your question: is x is exactly representable in an uint64_t? Only if it was neither rounded nor clamped.
/* Input: a double 'x', which must not be NaN.
* Output: If 'x' is exactly representable in an uint64_t,
* then 1, otherwise 0.
*/
int double_representable_in_uint64 (double x)
{
assert(!isnan(x));
return (floor(x) == x && x >= 0.0 && x < ldexp(1.0, 64));
}
The same algorithm can be used for integers of different size, and also for signed integers with a minor modification. The code that follows does some very basic tests of the uint32_t and uint64_t versions (only false positives can possibly be caught), but is also suitable for manual examination of the edge cases.
#include <inttypes.h>
#include <math.h>
#include <limits.h>
#include <assert.h>
#include <stdio.h>
uint32_t double_to_uint32 (double x)
{
assert(!isnan(x));
double y = floor(x);
if (y < 0.0) {
return 0;
}
if (y >= ldexp(1.0, 32)) {
return UINT32_MAX;
}
return y;
}
uint64_t double_to_uint64 (double x)
{
assert(!isnan(x));
double y = floor(x);
if (y < 0.0) {
return 0;
}
if (y >= ldexp(1.0, 64)) {
return UINT64_MAX;
}
return y;
}
int double_representable_in_uint32 (double x)
{
assert(!isnan(x));
return (floor(x) == x && x >= 0.0 && x < ldexp(1.0, 32));
}
int double_representable_in_uint64 (double x)
{
assert(!isnan(x));
return (floor(x) == x && x >= 0.0 && x < ldexp(1.0, 64));
}
int main ()
{
{
printf("Testing 32-bit\n");
for (double x = 4294967295.999990; x < 4294967296.000017; x = nextafter(x, INFINITY)) {
uint32_t y = double_to_uint32(x);
int representable = double_representable_in_uint32(x);
printf("%f -> %" PRIu32 " representable=%d\n", x, y, representable);
assert(!representable || (double)(uint32_t)x == x);
}
}
{
printf("Testing 64-bit\n");
double x = ldexp(1.0, 64) - 40000.0;
for (double x = 18446744073709510656.0; x < 18446744073709629440.0; x = nextafter(x, INFINITY)) {
uint64_t y = double_to_uint64(x);
int representable = double_representable_in_uint64(x);
printf("%f -> %" PRIu64 " representable=%d\n", x, y, representable);
assert(!representable || (double)(uint64_t)x == x);
}
}
}
Here's one method that could work in most cases. I'm not sure if/how it will break if you give it NaN, INF, very large (overflow) numbers...
(Though I think they will all return false - not exactly representable.)
You could:
Cast it to an integer.
Cast it back to a floating-point.
Compare with original value.
Something like this:
double val = ... ; // Value
if ((double)(long long)val == val){
// Exactly representable
}
floor() and ceil() are also fair game (though they may fail if the value overflows an integer):
floor(val) == val
ceil(val) == val
And here's a messy bit-mask solution:
This uses union type-punning and assumes IEEE double-precision. Union type-punning is only valid in C99 TR2 and later.
int representable(double x){
// Handle corner cases:
if (x == 0)
return 1;
// -2^63 is representable as a signed 64-bit integer, but +2^63 is not.
if (x == -9223372036854775808.)
return 1;
// Warning: Union type-punning is only valid in C99 TR2 or later.
union{
double f;
uint64_t i;
} val;
val.f = x;
uint64_t exp = val.i & 0x7ff0000000000000ull;
uint64_t man = val.i & 0x000fffffffffffffull;
man |= 0x0010000000000000ull; // Implicit leading 1-bit.
int shift = (exp >> 52) - 1075;
// Out of range
if (shift < -52 || shift > 10)
return 0;
// Test mantissa
if (shift < 0){
shift = -shift;
return ((man >> shift) << shift) == man;
}else{
return ((man << shift) >> shift) == man;
}
}
You can use the modf function to split a float into the integer and fraction parts. modf is in the standard C library.
#include <math.h>
#include <limits.h>
double val = ...
double i;
long l;
/* check if fractional part is 0 */
if (modf(val, &i) == 0.0) {
/* val is an integer. check if it can be stored in a long */
if (val >= LONG_MIN && val <= LONG_MAX) {
/* can be exactly represented by a long */
l = val;
}
}
How to check if float can be exactly represented as an integer ?
I'm looking to for a reasonably efficient way of determining if a floating point value double can be exactly represented by an integer data type long, 64 bit.
Range (LONG_MIN, LONG_MAX) and fraction (frexp()) tests needed. Also need to watch out for not-a-numbers.
The usual idea is to test like (double)(long)x == x, but to avoid its direct usage. (long)x, when x is out of range, is undefined behavior (UB).
The valid range of conversion for (long)x is LONG_MIN - 1 < x < LONG_MAX + 1 as code discards any fractional part of x during the conversion. So code needs to test, using FP math, if x is in range.
#include <limits.h>
#include <stdbool.h>
#define DBL_LONG_MAXP1 (2.0*(LONG_MAX/2+1))
#define DBL_LONG_MINM1 (2.0*(LONG_MIN/2-1))
bool double_to_long_exact_possible(double x) {
if (x < DBL_LONG_MAXP1) {
double whole_number_part;
if (frexp(x, &whole_number_part) != 0.0) {
return false; // Fractional part exist.
}
#if -LONG_MAX == LONG_MIN
// rare non-2's complement machine
return x > DBL_LONG_MINM1;
#else
return x - LONG_MIN > -1.0;
#endif
}
return false; // Too large or NaN
}
Any IEEE floating-point double or float value with a magnitude at or above 2^52 or 2^23 will be whole number. Adding 2^52 or 2^23 to a positive number whose magnitude is less than that will cause it to be rounded to a whole number. Subtracting the value that was added will yield a whole number which will equal the original iff the original was a whole number. Note that this algorithm will fail with some numbers larger than 2^52, but it isn't needed for numbers that big.
Could you use the modulus operator to check if the double is divisible by one... or am I completely misunderstanding the question?
double val = ... ; // Value
if(val % 1 == 0) {
// Val is evenly divisible by 1 and is therefore a whole number
}