As in whether it falls within 2^3 - 2^4, 2^4 - 2^5, etc. The number returned would be the EXPONENT itself (minus an offset).
How could this be done extremely quickly and efficiently as possible? This function will be called a lot in a program that is EXTREMELY dependent on speed. This is my current code but it is far too inefficient as it uses a for loop.
static inline size_t getIndex(size_t numOfBytes)
{
int i = 3;
for (; i < 32; i++)
{
if (numOfBytes < (1 << i))
return i - OFFSET;
}
return (NUM_OF_BUCKETS - 1);
}
Thank you very much!
What you're after is simply log2(n), as far as I can tell.
It might be worth cheating and using some inline assembly if your target architecture(s) have instructions that can do this. See the Wikipedia entry on "find first set" for lots of discussion and information about hardware support.
One way to do it would be to find the highest order bit that is set to 1. I'm trying to think if this is efficient though, since you'll still have to do n checks in worst case.
Maybe you could do a binary search style where you check if it's greater than 2^16, if so, check if it's greater than 2^24 (assuming 32 bits here), and if not, then check if it's greater than 2^20, etc... That would be log(n) checks, but I'm not sure of the efficiency of a bit check vs a full int comparison.
Could get some perf data on either.
There is a particularly efficient algorithm using de Bruijn sequences described on Sean Eron Anderson's excellent Bit Twiddling Hacks page:
uint32_t v; // find the log base 2 of 32-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[32] =
{
0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30,
8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31
};
v |= v >> 1; // first round down to one less than a power of 2
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
r = MultiplyDeBruijnBitPosition[(uint32_t)(v * 0x07C4ACDDU) >> 27];
It works in 13 operations without branching!
You are basically trying to compute: floor(log2(x))
Take the logarithm to the base 2, then take the floor.
The most portable way to do this in C is to use the logf() function, which finds the log to the base e, then adjust: log2(x) == logf(x) / logf(2.0)
See the answer here: How to write log base(2) in c/c++
If you just cast the resulting float value to int, you compute floor() at the same time.
But, if it is available to you and you can use it, there is an extremely fast way to compute log2() of a floating point number: logbf()
From the man page:
The inte-
ger constant FLT_RADIX, defined in <float.h>, indicates the radix used
for the system's floating-point representation. If FLT_RADIX is 2,
logb(x) is equal to floor(log2(x)), except that it is probably faster.
http://linux.die.net/man/3/logb
If you think about how floating-point numbers are stored, you realize that the value floor(log2(x)) is part of the number, and if you just extract that value you are done. A little bit of shifting and bit-masking, and subtract the bias from the exponent (or technically the "significand") and there you have it. The fastest way possible to compute floor(log2(x)) for any float value x.
http://en.wikipedia.org/wiki/Single_precision
But actually logbf() converts the result to a float before giving it to you, and handles errors. If you write your own function to extract the exponent as an integer, it will be slightly faster and an integer is what you want anyway. If you wanted to write your own function you need to use a C union to gain access to the bits inside the float; trying to play with pointers will get you warnings or errors related to "type-punning", at least on GCC. I will give details on how to do this, if you ask. I have written this code before, as an inline function.
If you only have a small range of numbers to test, you could possibly cast your numbers to integer and then use a lookup table.
You can make use of floating number representation:
double n_bytes = numOfBytes
Taking the exponent bits should give you the result as floating numbers are represented as:
(-1)^S X (1. + M) X 2^E
Where:
S - Sign
M - Mantissa
E - Exponent
To construct the mask and shift you would have to read about the exact bit pattern of the floating point type you are using.
The CPU floating point support does most of the work for you.
An even better way would be to use the built-in function:
double frexp (double x, int * exp );
Floating point representation
#include <Limits.h> // For CHAR_BIT.
#include <math.h> // For frexp.
#include <stdio.h> // For printing results, as a demonstration.
// These routines assume 0 < x.
/* This requires GCC (or any other compiler that supplies __builtin_clz). It
should perform well on any machine with a count-leading-zeroes instruction
or something similar.
*/
static int log2A(unsigned int x)
{
return sizeof x * CHAR_BIT - 1 - __builtin_clz(x);
}
/* This requires that a double be able to exactly represent any unsigned int.
(This is true for 32-bit integers and 64-bit IEEE 754 floating-point.) It
might perform well on some machines and poorly on others.
*/
static int log2B(unsigned int x)
{
int exponent;
frexp(x, &exponent);
return exponent - 1;
}
int main(void)
{
// Demonstrate the routines.
for (unsigned int x = 1; x; x <<= 1)
printf("0x%08x: log2A -> %2d, log2B -> %2d.\n", x, log2A(x), log2B(x));
return 0;
}
This is generally fast on any machine with hardware floating point unit:
((union { float val; uint32_t repr; }){ x }.repr >> 23) - 0x7f
The only assumptions it makes are that floating point is IEEE and integer and floating point endianness match, both of which are true on basically all real-world systems (certainly all modern ones).
Edit: When I've used this in the past, I didn't need it for large numbers. Eric points out that it will give the wrong result for ints that don't fit in float. Here is a revised (albeit possibly slower) version that fixes that and supports values up to 52 bits (in particular, all 32-bit positive integer inputs):
((union { double val; uint64_t repr; }){ x }.repr >> 52) - 0x3ff
Also note that I'm assuming x is a positive (not just non-negative, also nonzero) number. If x is negative you'll get a bogus result, and if x is 0, you'll get a large negative result (approximating negative infinity as the logarithm).
Related
I want to create a big integer from string representation and to do that efficiently I need an upper bound on the number of digits in the target base to avoid reallocating memory.
Example:
A 640 bit number has 640 digits in base 2, but only ten digits in base 2^64, so I will have to allocate ten 64 bit integers to hold the result.
The function I am currently using is:
int get_num_digits_in_different_base(int n_digits, double src_base, double dst_base){
return ceil(n_digits*log(src_base)/log(dst_base));
}
Where src_base is in {2, ..., 10 + 26} and dst_base is in {2^8, 2^16, 2^32, 2^64}.
I am not sure if the result will always be correctly rounded though. log2 would be easier to reason about, but I read that older versions of Microsoft Visual C++ do not support that function. It could be emulated like log2(x) = log(x)/log(2) but now I am back where I started.
GMP probably implements a function to do base conversion, but I may not read the source or else I might get GPL cancer so I can not do that.
I imagine speed is of some concern, or else you could just try the floating point-based estimate and adjust if it turned out to be too small. In that case, one can sacrifice tightness of the estimate for speed.
In the following, let dst_base be 2^w, src_base be b, and n_digits be n.
Let k(b,w)=max {j | b^j < 2^w}. This represents the largest power of b that is guaranteed to fit within a w-wide binary (non-negative) integer. Because of the relatively small number of source and destination bases, these values can be precomputed and looked-up in a table, but mathematically k(b,w)=[w log 2/log b] (where [.] denotes the integer part.)
For a given n let m=ceil( n / k(b,w) ). Then the maximum number of dst_base digits required to hold a number less than b^n is:
ceil(log (b^n-1)/log (2^w)) ≤ ceil(log (b^n) / log (2^w) )
≤ ceil( m . log (b^k(b,w)) / log (2^w) ) ≤ m.
In short, if you precalculate the k(b,w) values, you can quickly get an upper bound (which is not tight!) by dividing n by k, rounding up.
I'm not sure about float point rounding in this case, but it is relatively easy to implement this using only integers, as log2 is a classic bit manipulation pattern and integer division can be easily rounded up. The following code is equivalent to yours, but using integers:
// Returns log2(x) rounded up using bit manipulation (not most efficient way)
unsigned int log2(unsigned int x)
{
unsigned int y = 0;
--x;
while (x) {
y++;
x >>= 1;
}
return y;
}
// Returns ceil(a/b) using integer division
unsigned int roundup(unsigned int a, unsigned int b)
{
return (a + b - 1) / b;
}
unsigned int get_num_digits_in_different_base(unsigned int n_digits, unsigned int src_base, unsigned int log2_dst_base)
{
return roundup(n_digits * log2(src_base), log2_dst_base);
}
Please, note that:
This function return different results compared to yours! However, in every case I looked, both were still correct (the smaller value was more accurate, but your requirement is just an upper bound).
The integer version I wrote receives log2_dst_base instead of dst_base to avoid overflow for 2^64.
log2 can be made more efficient using lookup tables.
I've used unsigned int instead of int.
A buddy of mine had these puzzles and this is one that is eluding me. Here is the problem, you are given a number and you want to return that number times 3 and divided by 16 rounding towards 0. Should be easy. The catch? You can only use the ! ~ & ^ | + << >> operators and of them only a combination of 12.
int mult(int x){
//some code here...
return y;
}
My attempt at it has been:
int hold = x + x + x;
int hold1 = 8;
hold1 = hold1 & hold;
hold1 = hold1 >> 3;
hold = hold >> 4;
hold = hold + hold1;
return hold;
But that doesn't seem to be working. I think I have a problem of losing bits but I can't seem to come up with a way of saving them. Another perspective would be nice. Just to add, you also can only use variables of type int and no loops, if statements or function calls may be used.
Right now I have the number 0xfffffff. It is supposed to return 0x2ffffff but it is returning 0x3000000.
For this question you need to worry about the lost bits before your division (obviously).
Essentially, if it is negative then you want to add 15 after you multiply by 3. A simple if statement (using your operators) should suffice.
I am not going to give you the code but a step by step would look like,
x = x*3
get the sign and store it in variable foo.
have another variable hold x + 15;
Set up an if statement so that if x is negative it uses that added 15 and if not then it uses the regular number (times 3 which we did above).
Then divide by 16 which you already showed you know how to do. Good luck!
This seems to work (as long as no overflow occurs):
((num<<2)+~num+1)>>4
Try this JavaScript code, run in console:
for (var num = -128; num <= 128; ++num) {
var a = Math.floor(num * 3 / 16);
var b = ((num<<2)+~num+1)>>4;
console.log(
"Input:", num,
"Regular math:", a,
"Bit math:", b,
"Equal: ", a===b
);
}
The Maths
When you divide a positive integer n by 16, you get a positive integer quotient k and a remainder c < 16:
(n/16) = k + (c/16).
(Or simply apply the Euclidan algorithm.) The question asks for multiplication by 3/16, so multiply by 3
(n/16) * 3 = 3k + (c/16) * 3.
The number k is an integer, so the part 3k is still a whole number. However, int arithmetic rounds down, so the second term may lose precision if you divide first, And since c < 16, you can safely multiply first without overflowing (assuming sizeof(int) >= 7). So the algorithm design can be
(3n/16) = 3k + (3c/16).
The design
The integer k is simply n/16 rounded down towards 0. So k can be found by applying a single AND operation. Two further operations will give 3k. Operation count: 3.
The remainder c can also be found using an AND operation (with the missing bits). Multiplication by 3 uses two more operations. And shifts finishes the division. Operation count: 4.
Add them together gives you the final answer.
Total operation count: 8.
Negatives
The above algorithm uses shift operations. It may not work well on negatives. However, assuming two's complement, the sign of n is stored in a sign bit. It can be removed beforing applying the algorithm and reapplied on the answer.
To find and store the sign of n, a single AND is sufficient.
To remove this sign, OR can be used.
Apply the above algorithm.
To restore the sign bit, Use a final OR operation on the algorithm output with the stored sign bit.
This brings the final operation count up to 11.
what you can do is first divide by 4 then add 3 times then again devide by 4.
3*x/16=(x/4+x/4+x/4)/4
with this logic the program can be
main()
{
int x=0xefffffff;
int y;
printf("%x",x);
y=x&(0x80000000);
y=y>>31;
x=(y&(~x+1))+(~y&(x));
x=x>>2;
x=x&(0x3fffffff);
x=x+x+x;
x=x>>2;
x=x&(0x3fffffff);
x=(y&(~x+1))+(~y&(x));
printf("\n%x %d",x,x);
}
AND with 0x3fffffff to make msb's zero. it'l even convert numbers to positive.
This uses 2's complement of negative numbers. with direct methods to divide there will be loss of bit accuracy for negative numbers. so use this work arround of converting -ve to +ve number then perform division operations.
Note that the C99 standard states in section section 6.5.7 that right shifts of signed negative integer invokes implementation-defined behavior. Under the provisions that int is comprised of 32 bits and that right shifting of signed integers maps to an arithmetic shift instruction, the following code works for all int inputs. A fully portable solution that also fulfills the requirements set out in the question may be possible, but I cannot think of one right now.
My basic idea is to split the number into high and low bits to prevent intermediate overflow. The high bits are divided by 16 first (this is an exact operation), then multiplied by three. The low bits are first multiplied by three, then divided by 16. Since arithmetic right shift rounds towards negative infinity instead of towards zero like integer division, a correction needs to be applied to the right shift for negative numbers. For a right shift by N, one needs to add 2N-1 prior to the shift if the number to be shifted is negative.
#include <stdio.h>
#include <stdlib.h>
int ref (int a)
{
long long int t = ((long long int)a * 3) / 16;
return (int)t;
}
int main (void)
{
int a, t, r, c, res;
a = 0;
do {
t = a >> 4; /* high order bits */
r = a & 0xf; /* low order bits */
c = (a >> 31) & 15; /* shift correction. Portable alternative: (a < 0) ? 15 : 0 */
res = t + t + t + ((r + r + r + c) >> 4);
if (res != ref(a)) {
printf ("!!!! error a=%08x res=%08x ref=%08x\n", a, res, ref(a));
return EXIT_FAILURE;
}
a++;
} while (a);
return EXIT_SUCCESS;
}
I'm working with a microchip that doesn't have room for floating point precision, however. I need to account for fractional values during some equations. So far I've had good luck using the old *100 -> /100 method like so:
increment = (short int)(((value1 - value2)*100 / totalSteps));
// later in the code I loop through the number of totolSteps
// adding back the increment to arrive at the total I want at the precise time
// time I need it.
newValue = oldValue + (increment / 100);
This works great for values from 0-255 divided by a totalSteps of up to 300. After 300, the fractional values to the right of the decimal place, become important, because they add up over time of course.
I'm curious if anyone has a better way to save decimal accuracy within an integer paradigm? I tried using *1000 /1000, but that didn't work at all.
Thank you in advance.
Fractions with integers is called fixed point math.
Try Googling "fixed point".
Fixed point tips and tricks are out of the scope of SO answer...
Example: 5 tap FIR filter
// C is the filter coefficients using 2.8 fixed precision.
// 2 MSB (of 10) is for integer part and 8 LSB (of 10) is the fraction part.
// Actual fraction precision here is 1/256.
int FIR_5(int* in, // input samples
int inPrec, // sample fraction precision
int* c, // filter coefficients
int cPrec) // coefficients fraction precision
{
const int coefHalf = (cPrec > 0) ? 1 << (cPrec - 1) : 0; // value of 0.5 using cPrec
int sum = 0;
for ( int i = 0; i < 5; ++i )
{
sum += in[i] * c[i];
}
// sum's precision is X.N. where N = inPrec + cPrec;
// return to original precision (inPrec)
sum = (sum + coefHalf) >> cPrec; // adding coefHalf for rounding
return sum;
}
int main()
{
const int filterPrec = 8;
int C[5] = { 8, 16, 208, 16, 8 }; // 1.0 == 256 in 2.8 fixed point. Filter value are 8/256, 16/256, 208/256, etc.
int W[5] = { 10, 203, 40, 50, 72}; // A sampling window (example)
int res = FIR_5(W, 0, C, filterPrec);
return 0;
}
Notes:
In the above example:
the samples are integers (no fraction)
the coefs have fractions of 8 bit.
8 bit fractions mean that each change of 1 is treated as 1/256. 1 << 8 == 256.
Useful notation is Y.Xu or Y.Xs. where Y is how many bits are allocated for the integer part and X for he fraction. u/s denote signed/unsigned.
when multiplying 2 fixed point numbers, their precision (size of fraction bits) are added to each other.
Example A is 0.8u, B is 0.2U. C=A*B. C is 0.10u
when dividing, use a shift operation to lower the result precision. Amount of shifting is up to you. Before lowering precision it's better to add a half to lower the error.
Example: A=129 in 0.8u which is a little over 0.5 (129/256). We want the integer part so we right shift it by 8. Before that we want to add a half which is 128 (1<<7). So A = (A + 128) >> 8 --> 1.
Without adding a half you'll get a larger error in the final result.
Don't use this approach.
New paradigm: Do not accumulate using FP math or fixed point math. Do your accumulation and other equations with integer math. Anytime you need to get some scaled value, divide by your scale factor (100), but do the "add up" part with the raw, unscaled values.
Here's a quick attempt at a precise rational (Bresenham-esque) version of the interpolation if you truly cannot afford to directly interpolate at each step.
div_t frac_step = div(target - source, num_steps);
if(frac_step.rem < 0) {
// Annoying special case to deal with rounding towards zero.
// Alternatively check for the error term slipping to < -num_steps as well
frac_step.rem = -frac_step.rem;
--frac_step.quot;
}
unsigned int error = 0;
do {
// Add the integer term plus an accumulated fraction
error += frac_step.rem;
if(error >= num_steps) {
// Time to carry
error -= num_steps;
++source;
}
source += frac_step.quot;
} while(--num_steps);
A major drawback compared to the fixed-point solution is that the fractional term gets rounded off between iterations if you are using the function to continually walk towards a moving target at differing step lengths.
Oh, and for the record your original code does not seem to be properly accumulating the fractions when stepping, e.g. a 1/100 increment will always be truncated to 0 in the addition no matter how many times the step is taken. Instead you really want to add the increment to a higher-precision fixed-point accumulator and then divide it by 100 (or preferably right shift to divide by a power-of-two) each iteration in order to compute the integer "position".
Do take care with the different integer types and ranges required in your calculations. A multiplication by 1000 will overflow a 16-bit integer unless one term is a long. Go through you calculations and keep track of input ranges and the headroom at each step, then select your integer types to match.
Maybe you can simulate floating point behaviour by saving
it using the IEEE 754 specification
So you save mantisse, exponent, and sign as unsigned int values.
For calculation you use then bitwise addition of mantisse and exponent and so on.
Multiplication and Division you can replace by bitwise addition operations.
I think it is a lot of programming staff to emulate that but it should work.
Your choice of type is the problem: short int is likely to be 16 bits wide. That's why large multipliers don't work - you're limited to +/-32767. Use a 32 bit long int, assuming that your compiler supports it. What chip is it, by the way, and what compiler?
I want to implement my own ceil() in C. Searched through the libraries for source code & found here, but it seems pretty difficult to understand. I want clean & elegant code.
I also searched on SO, found some answer here. None of the answer seems to be correct. One of the answer is:
#define CEILING_POS(X) ((X-(int)(X)) > 0 ? (int)(X+1) : (int)(X))
#define CEILING_NEG(X) ((X-(int)(X)) < 0 ? (int)(X-1) : (int)(X))
#define CEILING(X) ( ((X) > 0) ? CEILING_POS(X) : CEILING_NEG(X) )
AFAIK, the return type of the ceil() is not int. Will macro be type-safe here?
Further, will the above implementation work for negative numbers?
What will be the best way to implement it?
Can you provide the clean code?
The macro you quoted definitely won't work correctly for numbers that are greater than INT_MAX but which can still be represented exactly as a double.
The only way to implement ceil() correctly (assuming you can't implement it using an equivalent assembly instruction) is to do bit-twiddling on the binary representation of the floating point number, as is done in the s_ceil.c source file behind your first link. Understanding how the code works requires an understanding of the floating point representation of the underlying platform -- the representation is most probably going to be IEEE 754 -- but there's no way around this.
Edit:
Some of the complexities in s_ceil.c stem from the special cases it handles (NaNs, infinities) and the fact that it needs to do its work without being able to assume that a 64-bit integral type exists.
The basic idea of all the bit-twiddling is to mask off the fractional bits of the mantissa and add 1 to it if the number is greater than zero... but there's a bit of additional logic involved as well to make sure you do the right thing in all cases.
Here's a illustrative version of ceil() for floats that I cobbled together. Beware: This does not handle the special cases correctly and it is not tested extensively -- so don't actually use it. It does however serve to illustrate the principles involved in the bit-twiddling. I've tried to comment the routine extensively, but the comments do assume that you understand how floating point numbers are represented in IEEE 754 format.
union float_int
{
float f;
int i;
};
float myceil(float x)
{
float_int val;
val.f=x;
// Extract sign, exponent and mantissa
// Bias is removed from exponent
int sign=val.i >> 31;
int exponent=((val.i & 0x7fffffff) >> 23) - 127;
int mantissa=val.i & 0x7fffff;
// Is the exponent less than zero?
if(exponent<0)
{
// In this case, x is in the open interval (-1, 1)
if(x<=0.0f)
return 0.0f;
else
return 1.0f;
}
else
{
// Construct a bit mask that will mask off the
// fractional part of the mantissa
int mask=0x7fffff >> exponent;
// Is x already an integer (i.e. are all the
// fractional bits zero?)
if((mantissa & mask) == 0)
return x;
else
{
// If x is positive, we need to add 1 to it
// before clearing the fractional bits
if(!sign)
{
mantissa+=1 << (23-exponent);
// Did the mantissa overflow?
if(mantissa & 0x800000)
{
// The mantissa can only overflow if all the
// integer bits were previously 1 -- so we can
// just clear out the mantissa and increment
// the exponent
mantissa=0;
exponent++;
}
}
// Clear the fractional bits
mantissa&=~mask;
}
}
// Put sign, exponent and mantissa together again
val.i=(sign << 31) | ((exponent+127) << 23) | mantissa;
return val.f;
}
Nothing you will write is more elegant than using the standard library implementation. No code at all is always more elegant than elegant code.
That aside, this approach has two major flaws:
If X is greater than INT_MAX + 1 or less than INT_MIN - 1, the behavior of your macro is undefined. This means that your implementation may give incorrect results for nearly half of all floating-point numbers. You will also raise the invalid flag, contrary to IEEE-754.
It gets the edge cases for -0, +/-infinity, and nan wrong. In fact, the only edge case it gets right is +0.
You can implement ceil in manner similar to what you tried, like so (this implementation assumes IEEE-754 double precision):
#include <math.h>
double ceil(double x) {
// All floating-point numbers larger than 2^52 are exact integers, so we
// simply return x for those inputs. We also handle ceil(nan) = nan here.
if (isnan(x) || fabs(x) >= 0x1.0p52) return x;
// Now we know that |x| < 2^52, and therefore we can use conversion to
// long long to force truncation of x without risking undefined behavior.
const double truncation = (long long)x;
// If the truncation of x is smaller than x, then it is one less than the
// desired result. If it is greater than or equal to x, it is the result.
// Adding one cannot produce a rounding error because `truncation` is an
// integer smaller than 2^52.
const double ceiling = truncation + (truncation < x);
// Finally, we need to patch up one more thing; the standard specifies that
// ceil(-small) be -0.0, whereas we will have 0.0 right now. To handle this
// correctly, we apply the sign of x to the result.
return copysign(ceiling, x);
}
Something like that is about as elegant as you can get and still be correct.
I flagged a number of concerns with the (generally good!) implementation that Martin put in his answer. Here's how I would implement his approach:
#include <stdint.h>
#include <string.h>
static inline uint64_t toRep(double x) {
uint64_t r;
memcpy(&r, &x, sizeof x);
return r;
}
static inline double fromRep(uint64_t r) {
double x;
memcpy(&x, &r, sizeof x);
return x;
}
double ceil(double x) {
const uint64_t signbitMask = UINT64_C(0x8000000000000000);
const uint64_t significandMask = UINT64_C(0x000fffffffffffff);
const uint64_t xrep = toRep(x);
const uint64_t xabs = xrep & signbitMask;
// If |x| is larger than 2^52 or x is NaN, the result is just x.
if (xabs >= toRep(0x1.0p52)) return x;
if (xabs < toRep(1.0)) {
// If x is in (1.0, 0.0], the result is copysign(0.0, x).
// We can generate this value by clearing everything except the signbit.
if (x <= 0.0) return fromRep(xrep & signbitMask);
// Otherwise x is in (0.0, 1.0), and the result is 1.0.
else return 1.0;
}
// Now we know that the exponent of x is strictly in the range [0, 51],
// which means that x contains both integral and fractional bits. We
// generate a mask covering the fractional bits.
const int exponent = xabs >> 52;
const uint64_t fractionalBits = significandMask >> exponent;
// If x is negative, we want to truncate, so we simply mask off the
// fractional bits.
if (xrep & signbitMask) return fromRep(xrep & ~fractionalBits);
// x is positive; to force rounding to go away from zero, we first *add*
// the fractionalBits to x, then truncate the result. The add may
// overflow the significand into the exponent, but this produces the
// desired result (zero significand, incremented exponent), so we just
// let it happen.
return fromRep(xrep + fractionalBits & ~fractionalBits);
}
One thing to note about this approach is that it does not raise the inexact floating-point flag for non-integral inputs. That may or may not be a concern for your usage. The first implementation that I listed does raise the flag.
I don't think a macrofunction is a good solution: it isn't type safe and there is a multi-evaluation of the arguments (side-effects). You should rather write a clean and elegant function.
As I would have expected more jokes in answers, I will try a couple
#define CEILING(X) ceil(X)
Bonus: a macro with not so many side effects
If you don't care too much of negative zeroes
#define CEILING(X) (-floor(-(X)))
If you care of negative zero, then
#define CEILING(X) (NEGATIVE_ZERO - floor(-(X)))
Portable definition of NEGATIVE_ZERO left as an exercize....
Bonus, it will also set FP flags (OVERFLOW INVALID INEXACT)
I am currently writing a fast 32.32 fixed-point math library. I succeeded at making adding, subtraction and multiplication work correctly, but I am quite stuck at division.
A little reminder for those who can't remember: a 32.32 fixed-point number is a number having 32 bits of integer part and 32 bits of fractional part.
The best algorithm I came up with needs 96-bit integer division, which is something compilers usually don't have built-ins for.
Anyway, here it goes:
G = 2^32
notation: x is the 64-bit fixed-point number, x1 is its low nibble and x2 is its high
G*(a/b) = ((a1 + a2*G) / (b1 + b2*G))*G // Decompose this
G*(a/b) = (a1*G) / (b1*G + b2) + (a2*G*G) / (b1*G + b2)
As you can see, the (a2*G*G) is guaranteed to be larger than the regular 64-bit integer. If uint128_t's were actually supported by my compiler, I would simply do the following:
((uint128_t)x << 32) / y)
Well they aren't and I need a solution. Thank you for your help.
You can decompose a larger division into multiple chunks that do division with less bits. As another poster already mentioned the algorithm can be found in TAOCP from Knuth.
However, no need to buy the book!
There is a code on the hackers delight website that implements the algorithm in C. It's written to do 64-bit unsigned divisions using 32-bit arithmetic only, so you can't directly cut'n'paste the code. To get from 64 to 128-bit you have to widen all types, masks and constans by two e.g. a short becomes a int, a 0xffff becomes 0xffffffffll ect.
After this easy easy change you should be able to do 128bit divisions.
The code is mirrored on GitHub, but was originally posted on Hackersdelight.org (original link no longer accessible).
Since your largest values only need 96-bit, One of the 64-bit divisions will always return zero, so you can even simplify the code a bit.
Oh - and before I forget this: The code only works with unsigned values. To convert from signed to unsigned divide you can do something like this (pseudo-code style):
fixpoint Divide (fixpoint a, fixpoint b)
{
// check if the integers are of different sign:
fixpoint sign_difference = a ^ b;
// do unsigned division:
fixpoint x = unsigned_divide (abs(a), abs(b));
// if the signs have been different: negate the result.
if (sign_difference < 0)
{
x = -x;
}
return x;
}
The website itself is worth checking out as well: http://www.hackersdelight.org/
By the way - nice task that you're working on.. Do you mind telling us for what you need the fixed-point library?
By the way - the ordinary shift and subtract algorithm for division would work as well.
If you target x86 you can implement it using MMX or SSE intrinsics. The algorithm relies only on primitive operations, so it could perform quite fast as well.
Better self-adjusting answer:
Forgive the C#-ism of the answer, but the following should work in all cases. There is likely a solution possible that finds the right shifts to use quicker, but I'd have to think much deeper than I can right now. This should be reasonably efficient though:
int upshift = 32;
ulong mask = 0xFFFFFFFF00000000;
ulong mod = x % y;
while ((mod & mask) != 0)
{
// Current upshift of the remainder would overflow... so adjust
y >>= 1;
mask <<= 1;
upshift--;
mod = x % y;
}
ulong div = ((x / y) << upshift) + (mod << upshift) / y;
Simple but unsafe answer:
This calculation can cause an overflow in the upshift of the x % y remainder if this remainder has any bits set in the high 32 bits, causing an incorrect answer.
((x / y) << 32) + ((x % y) << 32) / y
The first part uses integer division and gives you the high bits of the answer (shift them back up).
The second part calculates the low bits from the remainder of the high-bit division (the bit that could not be divided any further), shifted up and then divided.
I like Nils' answer, which is probably the best. It's just long division, like we all learned in grade school, except the digits are base 2^32 instead of base 10.
However, you might also consider using Newton's approximation method for division:
x := x (N + N - N * D * x)
where N is the numerator and D is the demoninator.
This just uses multiplies and adds, which you already have, and it converges very quickly to about 1 ULP of precision. On the other hand, you won't be able to acheive the exact 0.5-ULP answer in all cases.
In any case, the tricky bit is detecting and handling the overflows.
Quick -n- dirty.
Do the A/B divide with double precision floating point.
This gives you C~=A/B. It's only approximate because of floating point precision and 53 bits of mantissa.
Round off C to a representable number in your fixed point system.
Now compute (again with your fixed point) D=A-C*B. This should have significantly lower magnitude than A.
Repeat , now computing D/B with floating point. Again, round the answer to an integer. Add each division result together as you go. You can stop when your remainder is so small that your floating point divide returns 0 after rounding.
You're still not done. Now you're very close to the answer, but the divisions weren't exact.
To finalize, you'll have to do a binary search. Using the (very good) starting estimate, see if increasing it improves the error.. you basically want to bracket the proper answer and keep dividing the range in half with new tests.
Yes, you could do Newton iteration here, but binary search will likely be easier since you need only simple multiplies and adds using your existing 32.32 precision toolkit.
This is not the most efficient method, but it's by far the easiest to code.