Converting SIGNED fractions to UNSIGNED fixed point for addition and multiplication

Converting SIGNED fractions to UNSIGNED fixed point for addition and multiplication - c

How can we convert floating point numbers to their "fixed-point representations", and use their "fixed-point representations" in fixed-point operations such as addition and multiplication? The result in the fixed-point operation must yield to the correct answer when converted back to floating point.
Say:
(double)(xb_double) + (double)(xb_double) = ?
Then we convert both addends to a fixed point representation (integer),
(int)(xa_fixed) + (int)(xb_fixed) = (int) (xsum_fixed)
To get (double)(xsum_double), we convert (int)(sum_fixed) back to floating point and yield same answer,
FixedToDouble(xsum_fixed) => xsum_double
Specifically, if the range of the values of xa_double and xb_double is between -1.65 and 1.65, I want to convert xa_double and xb_double in their respective 10-bit fixed point representations (0x0000 to 0x03FF)
WHAT I HAVE TRIED
int fixed_MAX = 1023;
int fixed_MIN = 0;
double Value_MAX = 1.65;
double Value_MIN = -1.65;
double slope = ((fixed_MAX) - (fixed_MIN))/((Value_MAX) - (Value_MIN));
int DoubleToFixed(double x)
{
return round(((x) - Value_MIN)*slope + fixed_MIN); //via interpolation method
}
double FixedToDouble(int x)
{
return (double)((((x) + fixed_MIN)/slope) + Value_MIN);
}
int sum_fixed(int x, int y)
{
return (x + y - (1.65*slope)); //analysis, just basic math
}
int subtract_fixed(int x, int y)
{
return (x - y + (1.65*slope));
}
int product_fixed(int x, int y)
{
return (((x * y) - (slope*slope*((1.65*FixedToDouble(x)) + (1.65*FixedToDouble(y)) + (1.65*1.65))) + (slope*slope*1.65)) / slope);
}
And if I want to add (double)(1.00) + (double)(2.00) = which should yield to (double)(3.00),
With my code,
xsum_fixed = DoubleToFixed(1.00) + DoubleToFixed(2.00);
xsum_double = FixedToDouble(xsum_fixed);
I get the answer:
xsum_double = 3.001613
Which is very close to the correct answer (double)(3.00)
Also, if I perform multiplication and subtraction I get 2.004839 and -1.001613, respectively.
HERE'S THE CATCH:
So I know my code is working, but how can I perform addition, multiplication and subtraction on these fixed-point representations without having INTERNAL FLOATING POINT OPERATIONS AND NUMBERS.
So in the code above, the functions sum_fixed, product_fixed, and subtract_fixed have internal floating point numbers (slope and 1.65, 1.65 being the MAX float input). I derived my code by basic math, really.
So I want to implement add, subtract, and product functions without any internal floating point operations or numbers.
UPDATE:
I also found a simpler code in converting fractional numbers to fixed-point:
//const int scale = 16; //1/2^16 in 32 bits
#define DoubleToFixed(x) (int)((x) * (double)(1<<scale))
#define FixedToDouble(x) ((double)(x) / (double)(1<<scale))
#define FractionPart(x) ((x) & FractionMask)
#define MUL(x,y) (((long long)(x)*(long long)(y)) >> scale)
#define DIV(x, y) (((long long)(x)<<16)/(y))
However, this converts only UNSIGNED fractions to UNSIGNED fixed-point. And I want to convert SIGNED fractions (-1.65 to 1.65) to UNSIGNED fixed-point (0x0000 to 0x03FF). How can I do this with the use of this code above? Is the range or number of bits have something to do with the conversion process? Is this code only for positive fractions?
credits to #chux

You can have the mantissa of the floating point representation of your number be equal to its fixed point representation. Since FP addition shifts the smaller operand's mantissa until both operands have the same exponent, you can add a certain 'magic number' to force it. For double, it's 1<<(52-precision) (52 is double's mantissa size, 'precision' is the required number of binary precision digits). So the conversion would look like this:
union { double f; long long i; } u = { xfloat+(1ll<<52-precision) }; // shift x's mantissa
long long xfixed = u.i & (1ll<<52)-1; // extract the mantissa
After that you can use xfixed in integer math (for multiplication, you'd have to shift the result right by 'precision'). To convert it back to double, simply multiply it by 1.0/(1 << precision);
Note that it doesn't handle negatives. If you need them, you'd have to convert them to the complementary representation manually (first fabs the double, then negate the int result if the input was negative).

Related

Why do we multiply normalized fraction with 0.5 to get the significand in IEEE 754 representation?

I have a question about the pack754() function defined in Section 7.4 of Beej's Guide to Network Programming.
This function converts a floating point number f into its IEEE 754 representation where bits is the total number of bits to represent the number and expbits is the number of bits used to represent only the exponent.
I am concerned with single-precision floating numbers only, so for this question, bits is specified as 32 and expbits is specified as 8. This implies that 23 bits is used to store the significand (because one bit is sign bit).
My question is about this line of code.
significand = fnorm * ((1LL<<significandbits) + 0.5f);
What is the role of + 0.5f in this code?
Here is a complete code that uses this function.
#include <stdio.h>
#include <stdint.h> // defines uintN_t types
#include <inttypes.h> // defines PRIx macros
uint64_t pack754(long double f, unsigned bits, unsigned expbits)
{
long double fnorm;
int shift;
long long sign, exp, significand;
unsigned significandbits = bits - expbits - 1; // -1 for sign bit
if (f == 0.0) return 0; // get this special case out of the way
// check sign and begin normalization
if (f < 0) { sign = 1; fnorm = -f; }
else { sign = 0; fnorm = f; }
// get the normalized form of f and track the exponent
shift = 0;
while(fnorm >= 2.0) { fnorm /= 2.0; shift++; }
while(fnorm < 1.0) { fnorm *= 2.0; shift--; }
fnorm = fnorm - 1.0;
// calculate the binary form (non-float) of the significand data
significand = fnorm * ((1LL<<significandbits) + 0.5f);
// get the biased exponent
exp = shift + ((1<<(expbits-1)) - 1); // shift + bias
// return the final answer
return (sign<<(bits-1)) | (exp<<(bits-expbits-1)) | significand;
}
int main(void)
{
float f = 3.1415926;
uint32_t fi;
printf("float f: %.7f\n", f);
fi = pack754(f, 32, 8);
printf("float encoded: 0x%08" PRIx32 "\n", fi);
return 0;
}
What purpose does + 0.5f serve in this code?

The code is an incorrect attempt at rounding.
long double fnorm;
long long significand;
unsigned significandbits
...
significand = fnorm * ((1LL<<significandbits) + 0.5f); // bad code
The first clue of incorrectness is the f of 0.5f, which indicates float, is a nonsensical introduction of specifying float in a routine with long double f and fnorm. float math has no application in the function.
Yet adding 0.5f does not mean that the code is limited to float math in (1LL<<significandbits) + 0.5f. See FLT_EVAL_METHOD which may allow higher precision intermediate results and have fooled the code author in testing.
A rounding attempt does make sense as the argument is long double and the target representations are narrower. Adding 0.5 is a common approach - but it is not done right here. IMO, the lack of the author commenting here concerning 0.5f hinted that the intent was "obvious" - not subtle, albeit incorrect.
As commented, moving the 0.5 is closer to being correct for rounding, but may mis-lead some into thinking the addition is done with float math, (it is long double math adding a long doubleproduct to float causes the 0.5f to be promoted to long double first).
// closer to rounding but may mislead
significand = fnorm * (1LL<<significandbits) + 0.5f;
// better
significand = fnorm * (1LL<<significandbits) + 0.5L; // or 0.5l or simply 0.5
To round, without calling the preferred <math.h> rounds routines like rintl(), roundl(), nearbyintl(), llrintl(), adding the explicit type 0.5 is still a weak attempt at rounding. It is weak because it rounds incorrectly with many cases. The +0.5 trick relies on that sum being exact.
Consider
long double product = fnorm * (1LL<<significandbits);
long long significand = product + 0.5; // double rounding?
product + 0.5 itself may go through a rounding before truncation/assignment to long long - in effect double rounding.
Best to use the right tool in the C shed of standard library functions.
significand = llrintl(fnorm * (1ULL<<significandbits));
A corner case remains with this rounding is where significand is now one too great and significand , exp needs adjustment. As well identified by #Nayuki, code has other short-comings too. Also, it fails on -0.0.

The + 0.5f serves no purpose in the code, and may be harmful or misleading.
The expression (1LL<<significandbits) + 0.5f results in a float. But even for the small case of significandbits = 23 for single-precision floating-point, the expression evaluates to (float)(223 + 0.5), which rounds to exactly 223 (round half even).
Replacing + 0.5f with + 0.0f results in the same behavior. Heck, drop that term entirely, because fnorm will cause the right-hand side argument of * to be casted to long double anyway. This would be a better way to rewrite the line: long long significand = fnorm * (long double)(1LL << significandbits);
Side note: This implementation of pack754() handles zero correctly (and collapses negative zero to positive zero), but mishandles subnormal numbers (wrong bits), infinities (infinite loop), and NaN (wrong bits). It's best to not treat it as a reference model function.

Compare floating point numbers as integers

Can two floating point values (IEEE 754 binary64) be compared as integers? Eg.
long long a = * (long long *) ptr_to_double1,
b = * (long long *) ptr_to_double2;
if (a < b) {...}
assuming the size of long long and double is the same.

YES - Comparing the bit-patterns for two floats as if they were integers (aka "type-punning") produces meaningful results under some restricted scenarios...
Identical to floating-point comparison when:
Both numbers are positive, positive-zero, or positive-infinity.
One positive and one negative number, and you are using a signed integer comparison.
Inverse of floating-point comparison when:
Both numbers are negative, negative-zero, or negative-infinity.
One positive and one negative number, and you are using a unsigned integer comparison.
Not comparable to floating-point comparison when:
Either number is one of the NaN values - Floating point comparisons with a NaN always returns false, and this simply can't be modeled in integer operations where exactly one of the following is always true: (A < B), (A == B), (B < A).
Negative floating-point numbers are a bit funky b/c they are handled very differently than in the 2's complement arithmetic used for integers. Doing an integer +1 on the representation for a negative float will make it a bigger negative number.
With a little bit manipulation, you can make both positive and negative floats comparable with integer operations (this can come in handy for some optimizations):
int32 float_to_comparable_integer(float f) {
uint32 bits = std::bit_cast<uint32>(f);
const uint32 sign_bit = bits & 0x80000000ul;
// Modern compilers turn this IF-statement into a conditional move (CMOV) on x86,
// which is much faster than a branch that the cpu might mis-predict.
if (sign_bit) {
bits = 0x7FFFFFF - bits;
}
return static_cast<int32>(bits);
}
Again, this does not work for NaN values, which always return false from comparisons, and have multiple valid bit representations:
Signaling NaNs (w/ sign bit): Anything between 0xFF800001, and 0xFFBFFFFF.
Signaling NaNs (w/o sign bit): Anything between 0x7F800001, and 0x7FBFFFFF.
Quiet NaNs (w/ sign bit): Anything between 0xFFC00000, and 0xFFFFFFFF.
Quiet NaNs (w/o sign bit): Anything between 0x7FC00000, and 0x7FFFFFFF.
IEEE-754 bit format: http://www.puntoflotante.net/FLOATING-POINT-FORMAT-IEEE-754.htm
More on Type-Punning: https://randomascii.wordpress.com/2012/01/23/stupid-float-tricks-2/

No. Two floating point values (IEEE 754 binary64) cannot compare simply as integers with if (a < b).
IEEE 754 binary64
The order of the values of double is not the same order as integers (unless you are are on a rare sign-magnitude machine). Think positive vs. negative numbers.
double has values like 0.0 and -0.0 which have the same value but different bit patterns.
double has "Not-a-number"s that do not compare like their binary equivalent integer representation.
If both the double values were x > 0 and not "Not-a-number", endian, aliasing, and alignment, etc. were not an issue, OP's idea would work.
Alternatively, a more complex if() ... condition would work - see below
[non-IEEE 754 binary64]
Some double use an encoding where there are multiple representations of the same value. This would differ from an "integer" compare.
Tested code: needs 2's complement, same endian for double and the integers, does not account for NaN.
int compare(double a, double b) {
union {
double d;
int64_t i64;
uint64_t u64;
} ua, ub;
ua.d = a;
ub.d = b;
// Cope with -0.0 right away
if (ua.u64 == 0x8000000000000000) ua.u64 = 0;
if (ub.u64 == 0x8000000000000000) ub.u64 = 0;
// Signs differ?
if ((ua.i64 < 0) != (ub.i64 < 0)) {
return ua.i64 >= 0 ? 1 : -1;
}
// If numbers are negative
if (ua.i64 < 0) {
ua.u64 = -ua.u64;
ub.u64 = -ub.u64;
}
return (ua.u64 > ub.u64) - (ua.u64 < ub.u64);
}
Thanks to #David C. Rankin for a correction.
Test code
void testcmp(double a, double b) {
int t1 = (a > b) - (a < b);
int t2 = compare(a, b);
if (t1 != t2) {
printf("%le %le %d %d\n", a, b, t1, t2);
}
}
#include <float.h>
void testcmps() {
// Various interesting `double`
static const double a[] = {
-1.0 / 0.0, -DBL_MAX, -1.0, -DBL_MIN, -0.0,
+0.0, DBL_MIN, 1.0, DBL_MAX, +1.0 / 0.0 };
int n = sizeof a / sizeof a[0];
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
testcmp(a[i], a[j]);
}
}
puts("!");
}

If you strictly cast the bit value of a floating point number to its correspondingly-sized signed integer (as you've done), then signed integer comparison of the results will be identical to the comparison of the original floating-point values, excluding NaN values. Put another way, this comparison is legitimate for all representable finite and infinite numeric values.
In other words, for double-precision (64-bits), this comparison will be valid if the following tests pass:
long long exponentMask = 0x7ff0000000000000;
long long mantissaMask = 0x000fffffffffffff;
bool isNumber = ((x & exponentMask) != exponentMask) // Not exp 0x7ff
|| ((x & mantissaMask) == 0); // Infinities
for each operand x.
Of course, if you can pre-qualify your floating-point values, then a quick isNaN() test would be much more clear. You'd have to profile to understand performance implications.

There are two parts to your question:
Can two floating point numbers be compared? The answer to this is yes. it is perfectly valid to compare size of floating point numbers. Generally you want to avoid equals comparisons due to truncation issues see here, but
if (a < b)
will work just fine.
Can two floating point numbers be compared as integers? This answer is also yes, but this will require casting. This question should help with that answer: convert from long long to int and the other way back in c++

Returning the exponent value of a floating point number

How can I return only the exponent part of a floating point number? It's single precision so I need to get the value of the eight exponent bits.
int floatExp(float f) {
//Return the exponent value for f. return tmax if f is nan or infinity
}

The standard way to extract the binary exponent is to use the frexp() function from math.h. See http://www.csse.uwa.edu.au/programming/ansic-library.html#float
If you need the decimal exponent, use the log10() function in math.h.

Code certainly should pass a float rather than unsigned.
// int floatExp(unsigned f)
int floatExp(float f)
Use isfinite and frexpf().
#include <math.h>
int floatExp2(float value) {
int exp;
if (isfinite(value)) {
frexpf(&exp);
return exp;
}
return tmax;
}
For base 10, a bit more work.
#include <math.h>
int floatExp10_(float value) {
if (isfinite(value)) {
if (value == 0.0) return 0;
float exp = log10f(fabsf(value));
// use floorf() rather than `(int) exp`.
// This does the proper rounding when `exp` is + or -
return (int) floorf(exp);
}
return tmax;
}
C does not define that floating point nor float have "eight exponent bits". It also does not define that the exponent is a binary one. Many implementations use IEEE 754 single-precision binary floating-point format: binary32 which does have an 8-bit power-of-2 biased exponent.
A hack way is to use a union. This non-portable method is highly dependent on knowing the floating point format, size of the integer, the endian-ness of the float and integer. The following, or a variation, of it may work in OP's environment.
int floatExp_hack(float value) {
if (isfinite(value)) {
union {
float f;
value unsigned long ul;
} u;
assert(sizeof u.f == sizeof u.ul);
u.f = value;
return (u.ul >> 23) & 0xFF;
}
return tmax;
}

Take a log in whatever base you're interested in getting an exponent in.
Hacking around with bits is only going to work if your platform uses IEEE floats, which isn't guaranteed. Wikipedia's article on single precision IEEE floating points will show you what bits to mask, but note that some exponents have special values.

pow() seems to be out by one here

What's going on here:
#include <stdio.h>
#include <math.h>
int main(void) {
printf("17^12 = %lf\n", pow(17, 12));
printf("17^13 = %lf\n", pow(17, 13));
printf("17^14 = %lf\n", pow(17, 14));
}
I get this output:
17^12 = 582622237229761.000000
17^13 = 9904578032905936.000000
17^14 = 168377826559400928.000000
13 and 14 do not match with wolfram alpa cf:
12: 582622237229761.000000
582622237229761
13: 9904578032905936.000000
9904578032905937
14: 168377826559400928.000000
168377826559400929
Moreover, it's not wrong by some strange fraction - it's wrong by exactly one!
If this is down to me reaching the limits of what pow() can do for me, is there an alternative that can calculate this? I need a function that can calculate x^y, where x^y is always less than ULLONG_MAX.

pow works with double numbers. These represent numbers of the form s * 2^e where s is a 53 bit integer. Therefore double can store all integers below 2^53, but only some integers above 2^53. In particular, it can only represent even numbers > 2^53, since for e > 0 the value is always a multiple of 2.
17^13 needs 54 bits to represent exactly, so e is set to 1 and hence the calculated value becomes even number. The correct value is odd, so it's not surprising it's off by one. Likewise, 17^14 takes 58 bits to represent. That it too is off by one is a lucky coincidence (as long as you don't apply too much number theory), it just happens to be one off from a multiple of 32, which is the granularity at which double numbers of that magnitude are rounded.
For exact integer exponentiation, you should use integers all the way. Write your own double-free exponentiation routine. Use exponentiation by squaring if y can be large, but I assume it's always less than 64, making this issue moot.

The numbers you get are too big to be represented with a double accurately. A double-precision floating-point number has essentially 53 significant binary digits and can represent all integers up to 2^53 or 9,007,199,254,740,992.
For higher numbers, the last digits get truncated and the result of your calculation is rounded to the next number that can be represented as a double. For 17^13, which is only slightly above the limit, this is the closest even number. For numbers greater than 2^54 this is the closest number that is divisible by four, and so on.

If your input arguments are non-negative integers, then you can implement your own pow.
Recursively:
unsigned long long pow(unsigned long long x,unsigned int y)
{
if (y == 0)
return 1;
if (y == 1)
return x;
return pow(x,y/2)*pow(x,y-y/2);
}
Iteratively:
unsigned long long pow(unsigned long long x,unsigned int y)
{
unsigned long long res = 1;
while (y--)
res *= x;
return res;
}
Efficiently:
unsigned long long pow(unsigned long long x,unsigned int y)
{
unsigned long long res = 1;
while (y > 0)
{
if (y & 1)
res *= x;
y >>= 1;
x *= x;
}
return res;
}

A small addition to other good answers: under x86 architecture there is usually available x87 80-bit extended format, which is supported by most C compilers via the long double type. This format allows to operate with integer numbers up to 2^64 without gaps.
There is analogue of pow() in <math.h> which is intended for operating with long double numbers - powl(). It should also be noticed that the format specifier for the long double values is other than for double ones - %Lf. So the correct program using the long double type looks like this:
#include <stdio.h>
#include <math.h>
int main(void) {
printf("17^12 = %Lf\n", powl(17, 12));
printf("17^13 = %Lf\n", powl(17, 13));
printf("17^14 = %Lf\n", powl(17, 14));
}
As Stephen Canon noted in comments there is no guarantee that this program should give exact result.

Implement ceil() in C

I want to implement my own ceil() in C. Searched through the libraries for source code & found here, but it seems pretty difficult to understand. I want clean & elegant code.
I also searched on SO, found some answer here. None of the answer seems to be correct. One of the answer is:
#define CEILING_POS(X) ((X-(int)(X)) > 0 ? (int)(X+1) : (int)(X))
#define CEILING_NEG(X) ((X-(int)(X)) < 0 ? (int)(X-1) : (int)(X))
#define CEILING(X) ( ((X) > 0) ? CEILING_POS(X) : CEILING_NEG(X) )
AFAIK, the return type of the ceil() is not int. Will macro be type-safe here?
Further, will the above implementation work for negative numbers?
What will be the best way to implement it?
Can you provide the clean code?

The macro you quoted definitely won't work correctly for numbers that are greater than INT_MAX but which can still be represented exactly as a double.
The only way to implement ceil() correctly (assuming you can't implement it using an equivalent assembly instruction) is to do bit-twiddling on the binary representation of the floating point number, as is done in the s_ceil.c source file behind your first link. Understanding how the code works requires an understanding of the floating point representation of the underlying platform -- the representation is most probably going to be IEEE 754 -- but there's no way around this.
Edit:
Some of the complexities in s_ceil.c stem from the special cases it handles (NaNs, infinities) and the fact that it needs to do its work without being able to assume that a 64-bit integral type exists.
The basic idea of all the bit-twiddling is to mask off the fractional bits of the mantissa and add 1 to it if the number is greater than zero... but there's a bit of additional logic involved as well to make sure you do the right thing in all cases.
Here's a illustrative version of ceil() for floats that I cobbled together. Beware: This does not handle the special cases correctly and it is not tested extensively -- so don't actually use it. It does however serve to illustrate the principles involved in the bit-twiddling. I've tried to comment the routine extensively, but the comments do assume that you understand how floating point numbers are represented in IEEE 754 format.
union float_int
{
float f;
int i;
};
float myceil(float x)
{
float_int val;
val.f=x;
// Extract sign, exponent and mantissa
// Bias is removed from exponent
int sign=val.i >> 31;
int exponent=((val.i & 0x7fffffff) >> 23) - 127;
int mantissa=val.i & 0x7fffff;
// Is the exponent less than zero?
if(exponent<0)
{
// In this case, x is in the open interval (-1, 1)
if(x<=0.0f)
return 0.0f;
else
return 1.0f;
}
else
{
// Construct a bit mask that will mask off the
// fractional part of the mantissa
int mask=0x7fffff >> exponent;
// Is x already an integer (i.e. are all the
// fractional bits zero?)
if((mantissa & mask) == 0)
return x;
else
{
// If x is positive, we need to add 1 to it
// before clearing the fractional bits
if(!sign)
{
mantissa+=1 << (23-exponent);
// Did the mantissa overflow?
if(mantissa & 0x800000)
{
// The mantissa can only overflow if all the
// integer bits were previously 1 -- so we can
// just clear out the mantissa and increment
// the exponent
mantissa=0;
exponent++;
}
}
// Clear the fractional bits
mantissa&=~mask;
}
}
// Put sign, exponent and mantissa together again
val.i=(sign << 31) | ((exponent+127) << 23) | mantissa;
return val.f;
}

Nothing you will write is more elegant than using the standard library implementation. No code at all is always more elegant than elegant code.
That aside, this approach has two major flaws:
If X is greater than INT_MAX + 1 or less than INT_MIN - 1, the behavior of your macro is undefined. This means that your implementation may give incorrect results for nearly half of all floating-point numbers. You will also raise the invalid flag, contrary to IEEE-754.
It gets the edge cases for -0, +/-infinity, and nan wrong. In fact, the only edge case it gets right is +0.
You can implement ceil in manner similar to what you tried, like so (this implementation assumes IEEE-754 double precision):
#include <math.h>
double ceil(double x) {
// All floating-point numbers larger than 2^52 are exact integers, so we
// simply return x for those inputs. We also handle ceil(nan) = nan here.
if (isnan(x) || fabs(x) >= 0x1.0p52) return x;
// Now we know that |x| < 2^52, and therefore we can use conversion to
// long long to force truncation of x without risking undefined behavior.
const double truncation = (long long)x;
// If the truncation of x is smaller than x, then it is one less than the
// desired result. If it is greater than or equal to x, it is the result.
// Adding one cannot produce a rounding error because `truncation` is an
// integer smaller than 2^52.
const double ceiling = truncation + (truncation < x);
// Finally, we need to patch up one more thing; the standard specifies that
// ceil(-small) be -0.0, whereas we will have 0.0 right now. To handle this
// correctly, we apply the sign of x to the result.
return copysign(ceiling, x);
}
Something like that is about as elegant as you can get and still be correct.
I flagged a number of concerns with the (generally good!) implementation that Martin put in his answer. Here's how I would implement his approach:
#include <stdint.h>
#include <string.h>
static inline uint64_t toRep(double x) {
uint64_t r;
memcpy(&r, &x, sizeof x);
return r;
}
static inline double fromRep(uint64_t r) {
double x;
memcpy(&x, &r, sizeof x);
return x;
}
double ceil(double x) {
const uint64_t signbitMask = UINT64_C(0x8000000000000000);
const uint64_t significandMask = UINT64_C(0x000fffffffffffff);
const uint64_t xrep = toRep(x);
const uint64_t xabs = xrep & signbitMask;
// If |x| is larger than 2^52 or x is NaN, the result is just x.
if (xabs >= toRep(0x1.0p52)) return x;
if (xabs < toRep(1.0)) {
// If x is in (1.0, 0.0], the result is copysign(0.0, x).
// We can generate this value by clearing everything except the signbit.
if (x <= 0.0) return fromRep(xrep & signbitMask);
// Otherwise x is in (0.0, 1.0), and the result is 1.0.
else return 1.0;
}
// Now we know that the exponent of x is strictly in the range [0, 51],
// which means that x contains both integral and fractional bits. We
// generate a mask covering the fractional bits.
const int exponent = xabs >> 52;
const uint64_t fractionalBits = significandMask >> exponent;
// If x is negative, we want to truncate, so we simply mask off the
// fractional bits.
if (xrep & signbitMask) return fromRep(xrep & ~fractionalBits);
// x is positive; to force rounding to go away from zero, we first *add*
// the fractionalBits to x, then truncate the result. The add may
// overflow the significand into the exponent, but this produces the
// desired result (zero significand, incremented exponent), so we just
// let it happen.
return fromRep(xrep + fractionalBits & ~fractionalBits);
}
One thing to note about this approach is that it does not raise the inexact floating-point flag for non-integral inputs. That may or may not be a concern for your usage. The first implementation that I listed does raise the flag.

I don't think a macrofunction is a good solution: it isn't type safe and there is a multi-evaluation of the arguments (side-effects). You should rather write a clean and elegant function.

As I would have expected more jokes in answers, I will try a couple
#define CEILING(X) ceil(X)
Bonus: a macro with not so many side effects
If you don't care too much of negative zeroes
#define CEILING(X) (-floor(-(X)))
If you care of negative zero, then
#define CEILING(X) (NEGATIVE_ZERO - floor(-(X)))
Portable definition of NEGATIVE_ZERO left as an exercize....
Bonus, it will also set FP flags (OVERFLOW INVALID INEXACT)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Converting SIGNED fractions to UNSIGNED fixed point for addition and multiplication - c

Related

Why do we multiply normalized fraction with 0.5 to get the significand in IEEE 754 representation?

Compare floating point numbers as integers

Returning the exponent value of a floating point number

pow() seems to be out by one here

Implement ceil() in C

Categories

Resources