what (r+1 + (r >> 8)) >> 8 does? - c

In some old C/C++ graphics related code, that I have to port to Java and JavaScript I found this:
b = (b+1 + (b >> 8)) >> 8; // very fast
Where b is short int for blue, and same code is seen for r and b (red & blue). The comment is not helpful.
I cannot figure out what it does, apart from obvious shifting and adding. I can port without understanding, I just ask out of curiosity.

y = ( x + 1 + (x>>8) ) >> 8 // very fast
This is a fixed-point approximation of division by 255. Conceptually, this is useful for normalizing calculations based on pixel values such that 255 (typically the maximum pixel value) maps to exactly 1.
It is described as very fast because fully general integer division is a relatively slow operation on many CPUs -- although it is possible that your compiler would make a similar optimization for you if it can deduce the input constraints.
This works based on the idea that 257/(256*256) is a very close approximation of 1/255, and that x*257/256 can be formulated as x+(x>>8). The +1 is rounding support which allows the formula to exactly match the integer division x/255 for all values of x in [0..65534].
Some algebra on the inner portion may make things a bit more clear...
x*257/256
= (x*256+x)/256
= x + x/256
= x + (x>>8)
There is more discussion here: How to do alpha blend fast? and here: Division via Multiplication
By the way, if you want round-to-nearest, and your CPU can do fast multiplies, the following is accurate for all uint16_t dividend values -- actually [0..(2^16)+126].
y = ((x+128)*257)>>16 // divide by 255 with round-to-nearest for x in [0..65662]

Looks like it is meant to check if blue (or red or green) is fully used. It evaluates to 1, when b is 255, and is 0 for all lower values.

A common use case of when you'd want to use a formula that's more accurate than 257/256 is when you have to combine a lot of alpha values together for each pixel. As one example, when doing image shrinking, you need to combine 4 alphas for each source pixel contributing to the destination, and then combine all the source pixels contributing to the destination.
I posted an infinitely accurate bit twiddling version of /255 but it was rejected without reason. So I'll add that I implement alpha blending hardware for a living, I write real time graphics code and game engines for a living, and I've published articles on this topic in conferences like MICRO, so I really know what I'm talking about. And it might be useful or at least entertaining for people to understand the more accurate formula that is EXACTLY 1/255:
Version 1: x = (x + (x >> 8)) >> 8
- no constant added, won't satisfy (x * 255) / 255 = x, but will look fine in most cases.
Version 2: x = (x + (x >> 8) + 1) >> 8
- WILL satisfy (x * 255) / 255 = x for integers, but won't hit correct integer values for all alphas
Version 3: (simple integer rounding):
(x + (x >> 8) + 128) >> 8
- Won't hit correct integer values for all alphas, but will on average be closer than Version 2 at the same cost.
Version 4: Infinitely accurate version, to any level of precision desired, for any number of composite alphas: (useful for image resizing, rotation, etc.):
[(x + (x >> 8)) >> 8] + [ ( (x & 255) + (x >> 8) ) >> 8]
Why is version 4 infinitely accurate?
Because 1/255 = 1/256 + 1/65536 + 1/256^3 + 1/256^4 + ...
The simplest expression above (version 1) doesn't handle rounding, but it also doesn't handle the carries that occur from this infinite number of identical sum columns. The new term added above determines the carry out (0 or 1) from this infinite number of base 256 digits. By adding it, you are getting the same result as if you added all the infinite addends. At which point you can round by adding a half bit to whatever accuracy point you want.
Not needed for the OP perhaps, but people should know that you don't need to approximate at all. The formula above is actually more accurate than double precision floating point.
As for speed: In hardware, this method is faster than even a single (full width) add. In software, you have to consider throughput vs latency. In latency, it may still be faster than a narrow multiply (definitely faster than a full width multiply), but in the OP context, you can unroll many pixels at once, and since modern multiply units are pipelined, you are still OK. In translation to Java, you probably have no narrow multiplies, so this could still be faster, but need to check.
WRT the one person who said "why not use the built in OS capabilities for alpha blitting?": If you already have a substantial graphical code base in that OS, this might be a fine option. If not, you're looking at hundreds to thousands as many lines of code to leverage the OS version - code that's far harder to write and debug than this code. And in the end, the OS code you have isn't portable at all, while this code can be used anywhere.

I suspect that it is trying to do the following:
boolean isBFullyOn = false;
if (b == 0xff) {
isBFullyOn = true;
}
Back in the days of slow processors; smart bit-shifting tricks like the above could be faster than the obvious if-then-else logic. It avoids a jump statement which was costly.
It probably also sets an overflow flag in the processor which was used for some latter logic. This is all highly dependant upon the target processor.
And also on my part speculative!!

Is value of b+1 + b/256, this calculation divided by 256.
In that way, using bit shift the compiler tranlte using CPU level shift instruction, instead of using FPU or library division functions.

b = (b + (b >> 8)) >> 8; is basically b = b *257/256 .
I would consider +1 being an ugly hack of the -0.5 mean reduce caused by the inner >>8.
I would write it as b = (b + 128 + ((b +128)>> 8)) >> 8; instead.

Running this test code:
public void test() {
Set<Integer> results = new HashSet<Integer>();
// short int ranges between -32767 and 32767
for (int i = -32767; i <= 32767; i++) {
int b = (i + 1 + (i >> 8)) >> 8;
if (!results.contains(b)) {
System.out.println(i + " -> " + b);
results.add(b);
}
}
}
Produces all possible values between -129 and 128. However, if you are working with 8-bit colours (0 - 255) then the only possible outputs are 0 (for 0 - 254) and 1 (for 255) so it is likely that it is attempting the function #kaykay posted.

Related

Fast hashing of 32 bit values to between 0 and 254 inclusive

I'm looking for a fast way in C to hash numbers 32-bit numbers more or less uniformly between 0 and 254. 255 is reserved for a special purpose.
As an added constraint, I'm looking for a method that would map well to being used with ISA-specific vector intrinsics or to a language like OpenCL or CUDA without introducing control flow divergence between the vector lanes/threads.
Ordinarily, I would just use the following code to hash the number between 0 and 255, as this is just a fast way of doing x mod 256.
inline uint8_t hash(uint32_t x){ return x & 255; }
I could just give in and use the following:
inline uint8_t hash(uint32_t x){ return x % 255; }
However, this solution seems unimaginative and unlikely to be the highest performing solution. I found code at this site (http://homepage.cs.uiowa.edu/~jones/bcd/mod.shtml#exmod15) that appears to provide a reasonable solution for scalar code and have inserted it here for your convenience.
uint32_t mod255( uint32_t a ) {
a = (a >> 16) + (a & 0xFFFF); /* sum base 2**16 digits */
a = (a >> 8) + (a & 0xFF); /* sum base 2**8 digits */
if (a < 255) return a;
if (a < (2 * 255)) return a - 255;
return a - (2 * 255);
}
I see two potential performance issues with this code:
The large number of if statements makes me question how easy it will be for a compiler or human :) to effectively vectorize the code without leading to control flow divergence within a warp/wavefront on a SIMT architecture or vectorized execution on a multicore CPU. If such divergence does occur, it will reduce parallel efficiency, as the divergent paths will have to be run in series.
It looks like it could be troublesome for a branch predictor (not applicable on common GPU architectures) as the code path that executes depends on the value of the input. Therefore, if there is a mix of small and large values interspersed with one another, this code will likely sacrifice some performance due to a moderate number of branch mispredictions.
Any recommendations on alternatives that I could use are most welcome. Alternatively, let me know if what I am asking for is unreasonable.
The "if statements on GPU kill performance" is a popular misconception which desperately wants to live on, it seems.
The large number of if statements makes me question how easy it will
be for a compiler or human :) to vectorize the code.
First of all I wouldn't consider 2 if statements a "large number of if statements", and those are so short and trivial that I'm willing to bet the compiler will turn them into branchless conditional moves or predicated instructions. There will be no performance penalty at all. (Do check the generated assembly, however).
It looks like it could be troublesome for a branch predictor as the code path that executes depends on the value of the input. Therefore, if there is a mix of small and large values interspersed with one another, this code will likely sacrifice some performance due to a moderate number of branch mispredictions.
Current GPUs do not have branch predictors. Note however that depending on the underlying hardware, operation on integers (and notably shifting) may be quite costly.
I would just do this:
uchar fast_mod255( uint a32 ) {
ushort a16 = (a32 >> 16) + (a32 & 0xFFFF); /* sum base 2**16 digits */
uchar a8 = (a16 >> 8) + (a16 & 0xFF); /* sum base 2**8 digits */
return (a8 % 255);
}
Another option is to just do:
uchar fast_mod255( uchar4 a ) {
return (dot(a) % 255); // or return (distance(a) % 255);
}
GPUs are very efficient in computing the distances and dot products, even in 4 dimensions. And it is a valid way of hashing as well. Dsicarding the overflowed values.
No branching, and a clever compiler can even optimize it out. Or do you really need that values that fall in the 255 zone have a scattered pattern instead of 1?
I wanted to answer my own question because over the last 2 years I have seen ways to get around a slow integer divide instruction. The easiest way is to make the integer a compile-time constant. Any decent modern compiler should replace the integer divide with an equivalent set of other instructions with typically higher throughput (how many such instructions can be retired per cycle) and reduced latency (how many cycles it takes the instruction to execute). If you're curious, check out Hacker's Delight (an excellent book on low-level computer arithmetic).
I wanted to share another finding, which I found on Daniel Lemire's blog (located here). The code that follows doesn't compute mod 255 but does something similar, which is equally useful in a number of applications and much faster.
Suppose that you have a set of numbers S that are uniformly randomly picked from the range 0 to 2^k - 1 inclusive, where k >= 0. In this case, if you care only about mapping numbers roughly uniformly from 0 to 254 inclusive, you may do the following:
For each number n in a set S, you may map n to one of the 255 candidate values by multiplying n by 255 and then arithmetically shifting the result to the right by k digits.
Here is the function that you call on each n for a fixed value of k:
int map_to_0_to_254(int n, int k){
return (n * 255) >> k;
}
As an example, if the values for the argument n range uniformly randomly from 0 to 4095 (2^12 - 1),
then map_to_0_254(n, 12) will return a value in the range 0 to 254 inclusive.
Here is a more general templated version in C++ for mapping to range from 0 to range_size - 1 inclusive:
template<typename T>
T map_to_0_to_range_size_minus_1(T n, T range_size, T k){
return (n * range_size) >> k;
}
REMEMBER that this code assumes that the inputs for n are roughly uniformly randomly distributed between 0 and 2^k - 1 inclusive. If that property holds, then the outputs will be roughly uniformly distributed between 0 and range_size - 1 inclusive. The larger 2^k is relative to range_size, the more uniform the mapping will be for a fixed set of inputs.
Why This is Useful
This approach has applications to computing hash functions for hash tables where the number of bins is not a power of 2. Those operations would ordinarily require a long-latency integer divide instruction, which is often an order of magnitude slower to execute than an integer multiply, because you often do not know the number of bins in the hash table at compile time.

Generating Logarithmically Spaced Values on an Operation Limited Microcontroller

I've recently come across a problem where, using a cheap 16 bit uC (MSP430 series), I've had to generate a logarithmically spaced output value based on the 10 bit ADC read. The reason for this is that I require fine grain control at the low end of the integer space, while, at the same time, requiring the use of the larger values, though at less precision, (to me, the difference between 2^15 and 2^16 in my feedback loop is of little consequence). I've never done this before and I had no luck finding examples online, so I came up with a little scheme to do this on my operation-limited uC.
With my method here, the ADC result is linearly interpolated between the two closest integer powers-of-two via only integer multiplication/addition/summation and bitwise shifting, (outlined below).
My question is, is there a better, (faster/less operations), way than this to generate a smooth, (or smooth-ish), set of data logarithmically spaced over the integer resolution? I haven't found anything online, hence my attempt at coming up with something from scratch in the first place.
N is the logarithmic resolution of the micro controller, (here assumed to be 16 bit). M is the integer resolution of the ADC, (here assumed to be 10 bit). ADC_READ is the value read by the ADC at a given time. On a uC that supports floating point operations, doing this is trivial:
x = N / M #16/1024
y = (float) ADC_READ / M #ADC_READ/1024
result = 2 ^ ( x * y )
In all of the plots below, this is the "Ideal" set of values. The "Resultant" values are generated by variations of the following:
unsigned int returnValue( adcRead ){
unsigned int e;
unsigned int a;
unsigned int rise;
unsigned int base;
unsigned int xoffset;
unsigned int yoffset;
e = adcRead >> 6;
a = 1 << e;
rise = ( 1 << (e + 1) ) - ( 1 << e );
base = e << 6;
xoffset = adcRead - base;
yoffset = ( rise >> rise_shift ) * (xoffset >> offset_shift); //this is an operation to prevent rolling over. rise_shift + offset_shift = M/N, here = 6
result = a + yoffset;
return result;
}
The extra declarations and what not are for readability only. Assume the final product is condensed. Basically, it does as intended, with varying degrees of discretization at the low end and smoothness at the high end based on the values of rise_shift and offset_shift. Here, they are both equal to 3:
Here rise_shift = 2, offset_shift = 4
Here rise_shift = 4, offset_shift = 2
I'm interested to see if anyone has come up with or knows of anything better. Currently, I only have to run this code ~20-30 times a second, so I obviously have not encountered any delays. But, with a 16MHz clock, and using information from here, I estimate this entire operation taking at most ~110 clock cycles, or ~7us. This is on the scale the ADC read time, which is ~4us.
Thanks
EDIT: By "better" I do not necessarily just mean faster, (it's already quite fast, apparently). Immediately, one sees that the low end has fairly drastic discretization to the integer powers of two, which results from the shifting operations to prevent roll-ever. Other than a look-up table, (suggested below), the answer to how this could be improved is not immediate.
based on the 10 bit ADC read.
This ADC can output only 1024 different values (0-1023), so you can use a table of 1024 16-Bit values, which would consume 2KB Flash memory:
const uint16_t LogarithmicTable[1024] = { 0, 1, ... , 64380};
Calculating the logarithmic output is now a simple array access:
result = LogarithmicTable[ADC_READ];
You can use a tool like Excel to generate the constants in this Table for you.
It sounds like you want to compute the function 2n/64, which would map 1024 to 65536 just above the high end but maps anything up to 64 to zero (or one, depending on rounding). Other exponential functions could avoid the low-end discretization, but it's not clear whether that would help the functionality.
We can factor 2n/64 into 2floor( n/64 ) × 2(n mod 64)/64. Usually multiplying by an integer power of 2 involves a left shift, but because the other side is a fraction between one and two, we're better off doing a right shift.
uint16_t exp_table[ 64 ] = {
32768u,
pow( 2, 1./64 ) * 32768u,
pow( 2, 2./64 ) * 32768u,
...
};
uint16_t adc_exp( uint16_t linear ) {
return exp_table[ linear % 64 ] >> ( 15 - linear / 64 );
}
This loses no precision against a full, 2-kilobyte table. To save more space, use linear interpolation.

C quick calculation of next multiple of 4?

What's a fast way to round up an unsigned int to a multiple of 4?
A multiple of 4 has the two least significant bits 0, right? So I could mask them out and then do a switch statement, adding either 1,2 or 3 to the given uint.
That's not a very elegant solution..
There's also the arithmetic roundup:
myint == 0 ? 0 : ((myint+3)/4)*4
Probably there's a better way including some bit operations?
(myint + 3) & ~0x03
The addition of 3 is so that the next multiple of 4 becomes previous multiple of 4, which is produced by a modulo operation, doable by masking since the divisor is a power of 2.
I assume that what you are trying to achieve is the alignment of the input number, i.e. if the original number is already a multiple of 4, then it doesn't need to be changed. However, this is not clear from your question. Maybe you want next multiple even when the original number is already a multiple? Please, clarify.
In order to align an arbitrary non-negative number i on an arbitrary boundary n you just need to do
i = i / n * n;
But this will align it towards the negative infinity. In order to align it to the positive infinity, add n - 1 before peforming the alignment
i = (i + n - 1) / n * n;
This is already good enough for all intents and purposes. In your case it would be
i = (i + 3) / 4 * 4;
However, if you would prefer to to squeeze a few CPU clocks out of this, you might use the fact that the i / 4 * 4 can be replaced with a bit-twiddling i & ~0x3, giving you
i = (i + 3) & ~0x3;
although it wouldn't surprise me if modern compilers could figure out the latter by themselves.
If by "next multiple of 4" you mean the smallest multiple of 4 that is larger than your unsigned int value myint, then this will work:
(myint | 0x03) + 1;
(myint + 4) & 0xFFFC
If you want the next multiple of 4 strictly greater than myint, this solution will do (similar to previous posts):
(myint + 4) & ~3u
If you instead want to round up to the nearest multiple of 4 (leaving myint unchanged if it is a multiple of 4), this should work:
(0 == myint & 0x3) ? myint : ((myint + 4) & ~3u);
myint = (myint + 4) & 0xffffffc
This is assuming that by "next multiple of 4" that you are always moving upwards; i.e. 5 -> 8 and 4 -> 8.
This is branch-free, generally configurable, easy to understand (if you know about C byte strings), and it lets you avoid thinking about the bit size of myInt:
myInt += "\x00\x03\x02\x01"[myInt & 0x3];
Only downside is a possible single memory access to elsewhere (static string storage) than the stack.

implementation of rand()

I am writing some embedded code in C and need to use the rand() function. Unfortunately, rand() is not supported in the library for the controller. I need a simple implementation that is fast, but more importantly has little space overhead, that produces relatively high-quality random numbers. Does anyone know which algorithm to use or sample code?
EDIT: It's for image processing, so "relatively high quality" means decent cycle length and good uniform properties.
Check out this collection of random number generators from George Marsaglia. He's a leading expert in random number generation, so I'd be confident using anything he recommends. The generators in that list are tiny, some requiring only a couple unsigned longs as state.
Marsaglia's generators are definitely "high quality" by your standards of long period and good uniform distribution. They pass stringent statistical tests, though they wouldn't do for cryptography.
Use the C code for LFSR113 from L'écuyer:
unsigned int lfsr113_Bits (void)
{
static unsigned int z1 = 12345, z2 = 12345, z3 = 12345, z4 = 12345;
unsigned int b;
b = ((z1 << 6) ^ z1) >> 13;
z1 = ((z1 & 4294967294U) << 18) ^ b;
b = ((z2 << 2) ^ z2) >> 27;
z2 = ((z2 & 4294967288U) << 2) ^ b;
b = ((z3 << 13) ^ z3) >> 21;
z3 = ((z3 & 4294967280U) << 7) ^ b;
b = ((z4 << 3) ^ z4) >> 12;
z4 = ((z4 & 4294967168U) << 13) ^ b;
return (z1 ^ z2 ^ z3 ^ z4);
}
Very high quality and fast. Do NOT use rand() for anything.
It is worse than useless.
Here is a link to a ANSI C implementation of a few random number generators.
I've made a collection of random number generators, "simplerandom", that are compact and suitable for embedded systems. The collection is available in C and Python.
I've looked around for a bunch of simple and decent ones I could find, and put them together in a small package. They include several Marsaglia generators (KISS, MWC, SHR3), and a couple of L'Ecuyer LFSR ones.
All the generators return an unsigned 32-bit integer, and typically have a state made of 1 to 4 32-bit unsigned integers.
Interestingly, I found a few issues with the Marsaglia generators, and I've tried to fix/improve all those issues. Those issues were:
SHR3 generator (component of Marsaglia's 1999 KISS generator) was broken.
MWC low 16 bits have only an approx 229.1 period. So I made a slightly improved MWC, which gives the low 16 bits a 259.3 period, which is the overall period of this generator.
I uncovered a few issues with seeding, and tried to make robust seeding (initialisation) procedures, so they won't break if you give them a "bad" seed value.
I recommend the academic paper Two Fast Implementations of the Minimal Standard Random Number Generator by David Carta. You can find free PDF through Google. The original paper on the Minimal Standard Random Number Generator is also worth reading.
Carta's code gives fast, high-quality random numbers on 32-bit machines. For a more thorough evaluation, see the paper.
Mersenne twister
A bit from Wikipedia:
It was designed to have a period of 219937 − 1 (the creators of the algorithm proved this property). In practice, there is little reason to use a larger period, as most applications do not require 219937 unique combinations (219937 is approximately 4.3 × 106001; this is many orders of magnitude larger than the estimated number of particles in the observable universe, which is 1080).
It has a very high order of dimensional equidistribution (see linear congruential generator). This implies that there is negligible serial correlation between successive values in the output sequence.
It passes numerous tests for statistical randomness, including the Diehard tests. It passes most, but not all, of the even more stringent TestU01 Crush randomness tests.
source code for many languages available on the link.
I'd take one from the GNU C library, the source is available to browse online.
http://qa.coreboot.org/docs/libpayload/rand_8c-source.html
But if you have any concern at all about the quality of the random numbers, you should probably look at more carefully written mathematically libraries. It's a big subject and the standard rand implementations aren't highly thought of by experts.
Here's another possibility: http://www.boost.org/doc/libs/1_39_0/libs/random/index.html
(If you find you have too many options, you could always pick one at random.)
I found this: Simple Random Number Generation, by John D. Cook.
It should be easy to adapt to C, given that it's only a few lines of code.
Edit: and you could clarify what you mean by "relatively high-quality". Are you generating encryption keys for nuclear launch codes, or random numbers for a game of poker?
Better yet, use multiple linear feedback shift registers combine them together.
Assuming that sizeof(unsigned) == 4:
unsigned t1 = 0, t2 = 0;
unsigned random()
{
unsigned b;
b = t1 ^ (t1 >> 2) ^ (t1 >> 6) ^ (t1 >> 7);
t1 = (t1 >> 1) | (~b << 31);
b = (t2 << 1) ^ (t2 << 2) ^ (t1 << 3) ^ (t2 << 4);
t2 = (t2 << 1) | (~b >> 31);
return t1 ^ t2;
}
The standard solution is to use a linear feedback shift register.
There is one simple RNG named KISS, it is one random number generator according to three numbers.
/* Implementation of a 32-bit KISS generator which uses no multiply instructions */
static unsigned int x=123456789,y=234567891,z=345678912,w=456789123,c=0;
unsigned int JKISS32() {
int t;
y ^= (y<<5); y ^= (y>>7); y ^= (y<<22);
t = z+w+c; z = w; c = t < 0; w = t&2147483647;
x += 1411392427;
return x + y + w;
}
Also there is one web site to test RNG http://www.phy.duke.edu/~rgb/General/dieharder.php

I need a fast 96-bit on 64-bit specific division algorithm for a fixed-point math library

I am currently writing a fast 32.32 fixed-point math library. I succeeded at making adding, subtraction and multiplication work correctly, but I am quite stuck at division.
A little reminder for those who can't remember: a 32.32 fixed-point number is a number having 32 bits of integer part and 32 bits of fractional part.
The best algorithm I came up with needs 96-bit integer division, which is something compilers usually don't have built-ins for.
Anyway, here it goes:
G = 2^32
notation: x is the 64-bit fixed-point number, x1 is its low nibble and x2 is its high
G*(a/b) = ((a1 + a2*G) / (b1 + b2*G))*G // Decompose this
G*(a/b) = (a1*G) / (b1*G + b2) + (a2*G*G) / (b1*G + b2)
As you can see, the (a2*G*G) is guaranteed to be larger than the regular 64-bit integer. If uint128_t's were actually supported by my compiler, I would simply do the following:
((uint128_t)x << 32) / y)
Well they aren't and I need a solution. Thank you for your help.
You can decompose a larger division into multiple chunks that do division with less bits. As another poster already mentioned the algorithm can be found in TAOCP from Knuth.
However, no need to buy the book!
There is a code on the hackers delight website that implements the algorithm in C. It's written to do 64-bit unsigned divisions using 32-bit arithmetic only, so you can't directly cut'n'paste the code. To get from 64 to 128-bit you have to widen all types, masks and constans by two e.g. a short becomes a int, a 0xffff becomes 0xffffffffll ect.
After this easy easy change you should be able to do 128bit divisions.
The code is mirrored on GitHub, but was originally posted on Hackersdelight.org (original link no longer accessible).
Since your largest values only need 96-bit, One of the 64-bit divisions will always return zero, so you can even simplify the code a bit.
Oh - and before I forget this: The code only works with unsigned values. To convert from signed to unsigned divide you can do something like this (pseudo-code style):
fixpoint Divide (fixpoint a, fixpoint b)
{
// check if the integers are of different sign:
fixpoint sign_difference = a ^ b;
// do unsigned division:
fixpoint x = unsigned_divide (abs(a), abs(b));
// if the signs have been different: negate the result.
if (sign_difference < 0)
{
x = -x;
}
return x;
}
The website itself is worth checking out as well: http://www.hackersdelight.org/
By the way - nice task that you're working on.. Do you mind telling us for what you need the fixed-point library?
By the way - the ordinary shift and subtract algorithm for division would work as well.
If you target x86 you can implement it using MMX or SSE intrinsics. The algorithm relies only on primitive operations, so it could perform quite fast as well.
Better self-adjusting answer:
Forgive the C#-ism of the answer, but the following should work in all cases. There is likely a solution possible that finds the right shifts to use quicker, but I'd have to think much deeper than I can right now. This should be reasonably efficient though:
int upshift = 32;
ulong mask = 0xFFFFFFFF00000000;
ulong mod = x % y;
while ((mod & mask) != 0)
{
// Current upshift of the remainder would overflow... so adjust
y >>= 1;
mask <<= 1;
upshift--;
mod = x % y;
}
ulong div = ((x / y) << upshift) + (mod << upshift) / y;
Simple but unsafe answer:
This calculation can cause an overflow in the upshift of the x % y remainder if this remainder has any bits set in the high 32 bits, causing an incorrect answer.
((x / y) << 32) + ((x % y) << 32) / y
The first part uses integer division and gives you the high bits of the answer (shift them back up).
The second part calculates the low bits from the remainder of the high-bit division (the bit that could not be divided any further), shifted up and then divided.
I like Nils' answer, which is probably the best. It's just long division, like we all learned in grade school, except the digits are base 2^32 instead of base 10.
However, you might also consider using Newton's approximation method for division:
x := x (N + N - N * D * x)
where N is the numerator and D is the demoninator.
This just uses multiplies and adds, which you already have, and it converges very quickly to about 1 ULP of precision. On the other hand, you won't be able to acheive the exact 0.5-ULP answer in all cases.
In any case, the tricky bit is detecting and handling the overflows.
Quick -n- dirty.
Do the A/B divide with double precision floating point.
This gives you C~=A/B. It's only approximate because of floating point precision and 53 bits of mantissa.
Round off C to a representable number in your fixed point system.
Now compute (again with your fixed point) D=A-C*B. This should have significantly lower magnitude than A.
Repeat , now computing D/B with floating point. Again, round the answer to an integer. Add each division result together as you go. You can stop when your remainder is so small that your floating point divide returns 0 after rounding.
You're still not done. Now you're very close to the answer, but the divisions weren't exact.
To finalize, you'll have to do a binary search. Using the (very good) starting estimate, see if increasing it improves the error.. you basically want to bracket the proper answer and keep dividing the range in half with new tests.
Yes, you could do Newton iteration here, but binary search will likely be easier since you need only simple multiplies and adds using your existing 32.32 precision toolkit.
This is not the most efficient method, but it's by far the easiest to code.

Resources