Most optimized way to calculate modulus in C - c

I have minimize cost of calculating modulus in C.
say I have a number x and n is the number which will divide x
when n == 65536 (which happens to be 2^16):
mod = x % n (11 assembly instructions as produced by GCC)
or
mod = x & 0xffff which is equal to mod = x & 65535 (4 assembly instructions)
so, GCC doesn't optimize it to this extent.
In my case n is not x^(int) but is largest prime less than 2^16 which is 65521
as I showed for n == 2^16, bit-wise operations can optimize the computation. What bit-wise operations can I preform when n == 65521 to calculate modulus.

First, make sure you're looking at optimized code before drawing conclusion about what GCC is producing (and make sure this particular expression really needs to be optimized). Finally - don't count instructions to draw your conclusions; it may be that an 11 instruction sequence might be expected to perform better than a shorter sequence that includes a div instruction.
Also, you can't conclude that because x mod 65536 can be calculated with a simple bit mask that any mod operation can be implemented that way. Consider how easy dividing by 10 in decimal is as opposed to dividing by an arbitrary number.
With all that out of the way, you may be able to use some of the 'magic number' techniques from Henry Warren's Hacker's Delight book:
Archive of http://www.hackersdelight.org/
Archive of http://www.hackersdelight.org/magic.htm
There was an added chapter on the website that contained "two methods of computing the remainder of division without computing the quotient!", which you may find of some use. The 1st technique applies only to a limited set of divisors, so it won't work for your particular instance. I haven't actually read the online chapter, so I don't know exactly how applicable the other technique might be for you.

x mod 65536 is only equivalent to x & 0xffff if x is unsigned - for signed x, it gives the wrong result for negative numbers. For unsigned x, gcc does indeed optimise x % 65536 to a bitwise and with 65535 (even on -O0, in my tests).
Because 65521 is not a power of 2, x mod 65521 can't be calculated so simply. gcc 4.3.2 on -O3 calculates it using x - (x / 65521) * 65521; the integer division by a constant is done using integer multiplication by a related constant.

rIf you don't have to fully reduce your integers modulo 65521, then you can use the fact that 65521 is close to 2**16. I.e. if x is an unsigned int you want to reduce then you can do the following:
unsigned int low = x &0xffff;
unsigned int hi = (x >> 16);
x = low + 15 * hi;
This uses that 2**16 % 65521 == 15. Note that this is not a full reduction. I.e. starting with a 32-bit input, you only are guaranteed that the result is at most 20 bits and that it is of course congruent to the input modulo 65521.
This trick can be used in applications where there are many operations that have to be reduced modulo the same constant, and where intermediary results do not have to be the smallest element in its residue class.
E.g. one application is the implementation of Adler-32, which uses the modulus 65521. This hash function does a lot of operations modulo 65521. To implement it efficiently one would only do modular reductions after a carefully computed number of additions. A reduction shown as above is enough and only the computation of the hash will need a full modulo operation.

The bitwise operation only works well if the divisor is of the form 2^n. In the general case, there is no such bit-wise operation.

If the constant with which you want to take the modulo is known at compile time
and you have a decent compiler (e.g. gcc), tis usually best to let the compiler
work its magic. Just declare the modulo const.
If you don't know the constant at compile time, but you are going to take - say -
a billion modulos with the same number, then use this http://libdivide.com/

As an approach when we deal with powers of 2, can be considered this one (mostly C flavored):
.
.
#define THE_DIVISOR 0x8U; /* The modulo value (POWER OF 2). */
.
.
uint8 CheckIfModulo(const sint32 TheDividend)
{
uint8 RetVal = 1; /* TheDividend is not modulus THE_DIVISOR. */
if (0 == (TheDividend & (THE_DIVISOR - 1)))
{
/* code if modulo is satisfied */
RetVal = 0; /* TheDividend IS modulus THE_DIVISOR. */
}
else
{
/* code if modulo is NOT satisfied */
}
return RetVal;
}

If x is an increasing index, and the increment i is known to be less than n (e.g. when iterating over a circular array of length n), avoid the modulus completely.
A loop going
x += i; if (x >= n) x -= n;
is way faster than
x = (x + i) % n;
which you unfortunately find in many text books...
If you really need an expression (e.g. because you are using it in a for statement), you can use the ugly but efficient
x = x + (x+i < n ? i : i-n)

idiv — Integer Division
The idiv instruction divides the contents of the 64 bit integer EDX:EAX (constructed by viewing EDX as the most significant four bytes and EAX as the least significant four bytes) by the specified operand value. The quotient result of the division is stored into EAX, while the remainder is placed in EDX.
source: http://www.cs.virginia.edu/~evans/cs216/guides/x86.html

Related

Is there a way to know if an integer division operation had a remainder?

I don't want to know what the remainder is, I just want to know if there was a remainder as a boolean value.
As such, using the modulo operator is not what I'm looking for.
Something in C would be preferable, but any language works.
If really you cannot use the remainder operation (homework constraint ?), then you can use
check_remainder = b*(a/b) != a;
But the remainder use % is the natural way to go.
Use div(), which, hopefully, calculates quotient and remainder with 1 single operation.
#include <stdlib.h>
//...
int a = 42, b = 5;
div_t c = div(a, b);
if (c.rem) /* remainder for a/b is not zero */;
//...
You cannot tell if there is a remainder until you calculate such remainder, which means you have to perform the division.
That said, there are some shortcuts you can take but only if the divisor is a power of 2. For example, a division by 2 will have a remainder if the least significant bit is 1. In general, a division by 2 raised to the power n will have a remainder if the least n significant bits have a value other than 0000.....0 (n zeroes).
At the assembly level, some architectures, such as x86 provide you with both the quotient and the remainder as he result of a DIV instruction.
If using the modulus operator (or the division operator which is used internally) is not an option, you can do it the naive way:
unsigned has_remainder (unsigned a, unsigned b)
{
while (a >= b)
a -= b;
return a;
}

Fast hashing of 32 bit values to between 0 and 254 inclusive

I'm looking for a fast way in C to hash numbers 32-bit numbers more or less uniformly between 0 and 254. 255 is reserved for a special purpose.
As an added constraint, I'm looking for a method that would map well to being used with ISA-specific vector intrinsics or to a language like OpenCL or CUDA without introducing control flow divergence between the vector lanes/threads.
Ordinarily, I would just use the following code to hash the number between 0 and 255, as this is just a fast way of doing x mod 256.
inline uint8_t hash(uint32_t x){ return x & 255; }
I could just give in and use the following:
inline uint8_t hash(uint32_t x){ return x % 255; }
However, this solution seems unimaginative and unlikely to be the highest performing solution. I found code at this site (http://homepage.cs.uiowa.edu/~jones/bcd/mod.shtml#exmod15) that appears to provide a reasonable solution for scalar code and have inserted it here for your convenience.
uint32_t mod255( uint32_t a ) {
a = (a >> 16) + (a & 0xFFFF); /* sum base 2**16 digits */
a = (a >> 8) + (a & 0xFF); /* sum base 2**8 digits */
if (a < 255) return a;
if (a < (2 * 255)) return a - 255;
return a - (2 * 255);
}
I see two potential performance issues with this code:
The large number of if statements makes me question how easy it will be for a compiler or human :) to effectively vectorize the code without leading to control flow divergence within a warp/wavefront on a SIMT architecture or vectorized execution on a multicore CPU. If such divergence does occur, it will reduce parallel efficiency, as the divergent paths will have to be run in series.
It looks like it could be troublesome for a branch predictor (not applicable on common GPU architectures) as the code path that executes depends on the value of the input. Therefore, if there is a mix of small and large values interspersed with one another, this code will likely sacrifice some performance due to a moderate number of branch mispredictions.
Any recommendations on alternatives that I could use are most welcome. Alternatively, let me know if what I am asking for is unreasonable.
The "if statements on GPU kill performance" is a popular misconception which desperately wants to live on, it seems.
The large number of if statements makes me question how easy it will
be for a compiler or human :) to vectorize the code.
First of all I wouldn't consider 2 if statements a "large number of if statements", and those are so short and trivial that I'm willing to bet the compiler will turn them into branchless conditional moves or predicated instructions. There will be no performance penalty at all. (Do check the generated assembly, however).
It looks like it could be troublesome for a branch predictor as the code path that executes depends on the value of the input. Therefore, if there is a mix of small and large values interspersed with one another, this code will likely sacrifice some performance due to a moderate number of branch mispredictions.
Current GPUs do not have branch predictors. Note however that depending on the underlying hardware, operation on integers (and notably shifting) may be quite costly.
I would just do this:
uchar fast_mod255( uint a32 ) {
ushort a16 = (a32 >> 16) + (a32 & 0xFFFF); /* sum base 2**16 digits */
uchar a8 = (a16 >> 8) + (a16 & 0xFF); /* sum base 2**8 digits */
return (a8 % 255);
}
Another option is to just do:
uchar fast_mod255( uchar4 a ) {
return (dot(a) % 255); // or return (distance(a) % 255);
}
GPUs are very efficient in computing the distances and dot products, even in 4 dimensions. And it is a valid way of hashing as well. Dsicarding the overflowed values.
No branching, and a clever compiler can even optimize it out. Or do you really need that values that fall in the 255 zone have a scattered pattern instead of 1?
I wanted to answer my own question because over the last 2 years I have seen ways to get around a slow integer divide instruction. The easiest way is to make the integer a compile-time constant. Any decent modern compiler should replace the integer divide with an equivalent set of other instructions with typically higher throughput (how many such instructions can be retired per cycle) and reduced latency (how many cycles it takes the instruction to execute). If you're curious, check out Hacker's Delight (an excellent book on low-level computer arithmetic).
I wanted to share another finding, which I found on Daniel Lemire's blog (located here). The code that follows doesn't compute mod 255 but does something similar, which is equally useful in a number of applications and much faster.
Suppose that you have a set of numbers S that are uniformly randomly picked from the range 0 to 2^k - 1 inclusive, where k >= 0. In this case, if you care only about mapping numbers roughly uniformly from 0 to 254 inclusive, you may do the following:
For each number n in a set S, you may map n to one of the 255 candidate values by multiplying n by 255 and then arithmetically shifting the result to the right by k digits.
Here is the function that you call on each n for a fixed value of k:
int map_to_0_to_254(int n, int k){
return (n * 255) >> k;
}
As an example, if the values for the argument n range uniformly randomly from 0 to 4095 (2^12 - 1),
then map_to_0_254(n, 12) will return a value in the range 0 to 254 inclusive.
Here is a more general templated version in C++ for mapping to range from 0 to range_size - 1 inclusive:
template<typename T>
T map_to_0_to_range_size_minus_1(T n, T range_size, T k){
return (n * range_size) >> k;
}
REMEMBER that this code assumes that the inputs for n are roughly uniformly randomly distributed between 0 and 2^k - 1 inclusive. If that property holds, then the outputs will be roughly uniformly distributed between 0 and range_size - 1 inclusive. The larger 2^k is relative to range_size, the more uniform the mapping will be for a fixed set of inputs.
Why This is Useful
This approach has applications to computing hash functions for hash tables where the number of bins is not a power of 2. Those operations would ordinarily require a long-latency integer divide instruction, which is often an order of magnitude slower to execute than an integer multiply, because you often do not know the number of bins in the hash table at compile time.

In C bits, multiply by 3 and divide by 16

A buddy of mine had these puzzles and this is one that is eluding me. Here is the problem, you are given a number and you want to return that number times 3 and divided by 16 rounding towards 0. Should be easy. The catch? You can only use the ! ~ & ^ | + << >> operators and of them only a combination of 12.
int mult(int x){
//some code here...
return y;
}
My attempt at it has been:
int hold = x + x + x;
int hold1 = 8;
hold1 = hold1 & hold;
hold1 = hold1 >> 3;
hold = hold >> 4;
hold = hold + hold1;
return hold;
But that doesn't seem to be working. I think I have a problem of losing bits but I can't seem to come up with a way of saving them. Another perspective would be nice. Just to add, you also can only use variables of type int and no loops, if statements or function calls may be used.
Right now I have the number 0xfffffff. It is supposed to return 0x2ffffff but it is returning 0x3000000.
For this question you need to worry about the lost bits before your division (obviously).
Essentially, if it is negative then you want to add 15 after you multiply by 3. A simple if statement (using your operators) should suffice.
I am not going to give you the code but a step by step would look like,
x = x*3
get the sign and store it in variable foo.
have another variable hold x + 15;
Set up an if statement so that if x is negative it uses that added 15 and if not then it uses the regular number (times 3 which we did above).
Then divide by 16 which you already showed you know how to do. Good luck!
This seems to work (as long as no overflow occurs):
((num<<2)+~num+1)>>4
Try this JavaScript code, run in console:
for (var num = -128; num <= 128; ++num) {
var a = Math.floor(num * 3 / 16);
var b = ((num<<2)+~num+1)>>4;
console.log(
"Input:", num,
"Regular math:", a,
"Bit math:", b,
"Equal: ", a===b
);
}
The Maths
When you divide a positive integer n by 16, you get a positive integer quotient k and a remainder c < 16:
(n/16) = k + (c/16).
(Or simply apply the Euclidan algorithm.) The question asks for multiplication by 3/16, so multiply by 3
(n/16) * 3 = 3k + (c/16) * 3.
The number k is an integer, so the part 3k is still a whole number. However, int arithmetic rounds down, so the second term may lose precision if you divide first, And since c < 16, you can safely multiply first without overflowing (assuming sizeof(int) >= 7). So the algorithm design can be
(3n/16) = 3k + (3c/16).
The design
The integer k is simply n/16 rounded down towards 0. So k can be found by applying a single AND operation. Two further operations will give 3k. Operation count: 3.
The remainder c can also be found using an AND operation (with the missing bits). Multiplication by 3 uses two more operations. And shifts finishes the division. Operation count: 4.
Add them together gives you the final answer.
Total operation count: 8.
Negatives
The above algorithm uses shift operations. It may not work well on negatives. However, assuming two's complement, the sign of n is stored in a sign bit. It can be removed beforing applying the algorithm and reapplied on the answer.
To find and store the sign of n, a single AND is sufficient.
To remove this sign, OR can be used.
Apply the above algorithm.
To restore the sign bit, Use a final OR operation on the algorithm output with the stored sign bit.
This brings the final operation count up to 11.
what you can do is first divide by 4 then add 3 times then again devide by 4.
3*x/16=(x/4+x/4+x/4)/4
with this logic the program can be
main()
{
int x=0xefffffff;
int y;
printf("%x",x);
y=x&(0x80000000);
y=y>>31;
x=(y&(~x+1))+(~y&(x));
x=x>>2;
x=x&(0x3fffffff);
x=x+x+x;
x=x>>2;
x=x&(0x3fffffff);
x=(y&(~x+1))+(~y&(x));
printf("\n%x %d",x,x);
}
AND with 0x3fffffff to make msb's zero. it'l even convert numbers to positive.
This uses 2's complement of negative numbers. with direct methods to divide there will be loss of bit accuracy for negative numbers. so use this work arround of converting -ve to +ve number then perform division operations.
Note that the C99 standard states in section section 6.5.7 that right shifts of signed negative integer invokes implementation-defined behavior. Under the provisions that int is comprised of 32 bits and that right shifting of signed integers maps to an arithmetic shift instruction, the following code works for all int inputs. A fully portable solution that also fulfills the requirements set out in the question may be possible, but I cannot think of one right now.
My basic idea is to split the number into high and low bits to prevent intermediate overflow. The high bits are divided by 16 first (this is an exact operation), then multiplied by three. The low bits are first multiplied by three, then divided by 16. Since arithmetic right shift rounds towards negative infinity instead of towards zero like integer division, a correction needs to be applied to the right shift for negative numbers. For a right shift by N, one needs to add 2N-1 prior to the shift if the number to be shifted is negative.
#include <stdio.h>
#include <stdlib.h>
int ref (int a)
{
long long int t = ((long long int)a * 3) / 16;
return (int)t;
}
int main (void)
{
int a, t, r, c, res;
a = 0;
do {
t = a >> 4; /* high order bits */
r = a & 0xf; /* low order bits */
c = (a >> 31) & 15; /* shift correction. Portable alternative: (a < 0) ? 15 : 0 */
res = t + t + t + ((r + r + r + c) >> 4);
if (res != ref(a)) {
printf ("!!!! error a=%08x res=%08x ref=%08x\n", a, res, ref(a));
return EXIT_FAILURE;
}
a++;
} while (a);
return EXIT_SUCCESS;
}

How can I strength reduce division by 2^n + 1?

I need to perform some integer divisions in the hot path of my code. I've already determined via profiling and cycle counting that the integer divisions are costing me. I'm hoping there's something I can do to strength reduce the divisions into something cheaper.
In this path, I am dividing by 2^n+1, where n is variable. Essentially I want to optimize this function to remove the division operator:
unsigned long compute(unsigned long a, unsigned int n)
{
return a / ((1 << n) + 1);
}
If I were dividing by 2^n, I would just replace the div with a shift-right by n. If I were dividing by a constant, I would let the compiler strength reduce that specific division, likely turning it into a multiply and some shifts.
Is there a similar optimization that applies to 2^n+1?
Edit: a here can be an arbitrary 64-bit integer. n takes only a few values between 10 and, say, 25. I can certainly precompute some values for each n, but not for a.
Since you can only shift an int so many places, you can put all those cases into a choice of one of several divisions by a constant:
unsigned long compute(unsigned long a, unsigned int n)
{
// assuming a 32-bit architecture (making this work for 64-bits
// is left as an exercise for the reader):
switch (n) {
case 0: return a / ((1 << 0) + 1);
case 1: return a / ((1 << 1) + 1);
case 2: return a / ((1 << 2) + 1);
// cases 3 through 30...
case 31: return a / ((1 << 31) + 1);
}
}
So now each division is by a constant, which compilers will typically reduce to a series of multiply/shift/add instructions (as the question mentioned). See Does a c/c++ compiler optimize constant divisions by power-of-two value into shifts? for deatils.
You can replace integer division by a constant, by multiplication (modulo wordsize) with a magic number and a shift.
The magic numbers can be pre-calculated for known constants.
Since n can't take many values e.g. 0..31 it is "easy" to pre-calculate these magic numbers for all n and store it in a table with 32 elements.
Javascript Page for calculating the magic numbers
A good compiler can compute the magic numbers and replace integer division by multiplication and shift if the divisor is constant at compile time. Depending on how the rest of the code is structured around the performance critical code you could use macro or inline tricks to unroll for all possible values of n and let the compiler do the work of finding the magic numbers (similar to the answer with the switch, but I would put more code in the constant region otherwise it might be a tradeof not worth it -- branching can cost you performance also)
Detailed description together with code for calculating the magic numbers can be fund in the Book "Hackers Delight" by Henry S. Warren, Jr. (highly recommended must have book!) pp. 180ff.
Link to Google Books for the relevant chapter:
Chapter 10-9 Unsigned Division by Divisors >= 1

I need a fast 96-bit on 64-bit specific division algorithm for a fixed-point math library

I am currently writing a fast 32.32 fixed-point math library. I succeeded at making adding, subtraction and multiplication work correctly, but I am quite stuck at division.
A little reminder for those who can't remember: a 32.32 fixed-point number is a number having 32 bits of integer part and 32 bits of fractional part.
The best algorithm I came up with needs 96-bit integer division, which is something compilers usually don't have built-ins for.
Anyway, here it goes:
G = 2^32
notation: x is the 64-bit fixed-point number, x1 is its low nibble and x2 is its high
G*(a/b) = ((a1 + a2*G) / (b1 + b2*G))*G // Decompose this
G*(a/b) = (a1*G) / (b1*G + b2) + (a2*G*G) / (b1*G + b2)
As you can see, the (a2*G*G) is guaranteed to be larger than the regular 64-bit integer. If uint128_t's were actually supported by my compiler, I would simply do the following:
((uint128_t)x << 32) / y)
Well they aren't and I need a solution. Thank you for your help.
You can decompose a larger division into multiple chunks that do division with less bits. As another poster already mentioned the algorithm can be found in TAOCP from Knuth.
However, no need to buy the book!
There is a code on the hackers delight website that implements the algorithm in C. It's written to do 64-bit unsigned divisions using 32-bit arithmetic only, so you can't directly cut'n'paste the code. To get from 64 to 128-bit you have to widen all types, masks and constans by two e.g. a short becomes a int, a 0xffff becomes 0xffffffffll ect.
After this easy easy change you should be able to do 128bit divisions.
The code is mirrored on GitHub, but was originally posted on Hackersdelight.org (original link no longer accessible).
Since your largest values only need 96-bit, One of the 64-bit divisions will always return zero, so you can even simplify the code a bit.
Oh - and before I forget this: The code only works with unsigned values. To convert from signed to unsigned divide you can do something like this (pseudo-code style):
fixpoint Divide (fixpoint a, fixpoint b)
{
// check if the integers are of different sign:
fixpoint sign_difference = a ^ b;
// do unsigned division:
fixpoint x = unsigned_divide (abs(a), abs(b));
// if the signs have been different: negate the result.
if (sign_difference < 0)
{
x = -x;
}
return x;
}
The website itself is worth checking out as well: http://www.hackersdelight.org/
By the way - nice task that you're working on.. Do you mind telling us for what you need the fixed-point library?
By the way - the ordinary shift and subtract algorithm for division would work as well.
If you target x86 you can implement it using MMX or SSE intrinsics. The algorithm relies only on primitive operations, so it could perform quite fast as well.
Better self-adjusting answer:
Forgive the C#-ism of the answer, but the following should work in all cases. There is likely a solution possible that finds the right shifts to use quicker, but I'd have to think much deeper than I can right now. This should be reasonably efficient though:
int upshift = 32;
ulong mask = 0xFFFFFFFF00000000;
ulong mod = x % y;
while ((mod & mask) != 0)
{
// Current upshift of the remainder would overflow... so adjust
y >>= 1;
mask <<= 1;
upshift--;
mod = x % y;
}
ulong div = ((x / y) << upshift) + (mod << upshift) / y;
Simple but unsafe answer:
This calculation can cause an overflow in the upshift of the x % y remainder if this remainder has any bits set in the high 32 bits, causing an incorrect answer.
((x / y) << 32) + ((x % y) << 32) / y
The first part uses integer division and gives you the high bits of the answer (shift them back up).
The second part calculates the low bits from the remainder of the high-bit division (the bit that could not be divided any further), shifted up and then divided.
I like Nils' answer, which is probably the best. It's just long division, like we all learned in grade school, except the digits are base 2^32 instead of base 10.
However, you might also consider using Newton's approximation method for division:
x := x (N + N - N * D * x)
where N is the numerator and D is the demoninator.
This just uses multiplies and adds, which you already have, and it converges very quickly to about 1 ULP of precision. On the other hand, you won't be able to acheive the exact 0.5-ULP answer in all cases.
In any case, the tricky bit is detecting and handling the overflows.
Quick -n- dirty.
Do the A/B divide with double precision floating point.
This gives you C~=A/B. It's only approximate because of floating point precision and 53 bits of mantissa.
Round off C to a representable number in your fixed point system.
Now compute (again with your fixed point) D=A-C*B. This should have significantly lower magnitude than A.
Repeat , now computing D/B with floating point. Again, round the answer to an integer. Add each division result together as you go. You can stop when your remainder is so small that your floating point divide returns 0 after rounding.
You're still not done. Now you're very close to the answer, but the divisions weren't exact.
To finalize, you'll have to do a binary search. Using the (very good) starting estimate, see if increasing it improves the error.. you basically want to bracket the proper answer and keep dividing the range in half with new tests.
Yes, you could do Newton iteration here, but binary search will likely be easier since you need only simple multiplies and adds using your existing 32.32 precision toolkit.
This is not the most efficient method, but it's by far the easiest to code.

Resources