speeding up "base conversion" for large integers

speeding up "base conversion" for large integers - c

I am using a base-conversion algorithm to generate a permutation from a large integer (split into 32-bit words).
I use a relatively standard algorithm for this:
/* N = count,K is permutation index (0..N!-1) A[N] contains 0..N-1 */
i = 0;
while (N > 1) {
swap A[i] and A[i+(k%N)]
k = k / N
N = N - 1
i = i + 1
}
Unfortunately, the divide and modulo each iteration adds up, especially moving to large integers - But, it seems I could just use multiply!
/* As before, N is count, K is index, A[N] contains 0..N-1 */
/* Split is arbitrarily 128 (bits), for my current choice of N */
/* "Adjust" is precalculated: (1 << Split)/(N!) */
a = k*Adjust; /* a can be treated as a fixed point fraction */
i = 0;
while (N > 1) {
a = a*N;
index = a >> Split;
a = a & ((1 << Split) - 1); /* actually, just zeroing a register */
swap A[i] and A[i+index]
N = N - 1
i = i + 1
}
This is nicer, but doing large integer multiplies is still sluggish.
Question 1:
Is there a way of doing this faster?
Eg. Since I know that N*(N-1) is less than 2^32, could I pull out those numbers from one word, and merge in the 'leftovers'?
Or, is there a way to modify an arithetic decoder to pull out the indicies one at a time?
Question 2:
For the sake of curiosity - if I use multiplication to convert a number to base 10 without the adjustment, then the result is multiplied by (10^digits/2^shift). Is there a tricky way to remove this factor working with the decimal digits? Even with the adjustment factor, this seems like it would be faster -- why wouldn't standard libraries use this vs divide and mod?

Seeing that you are talking about numbers like 2^128/(N!), it seems that in your problem N is going to be rather small (N < 35 according to my calculations).
I suggest taking the original algorithm as a starting point; first switch the direction of the loop:
i = 2;
while (i < N) {
swap A[N - 1 - i] and A[N - i + k % i]
k = k / i
i = i + 1
}
Now change the loop to do several permutations per iteration. I guess the speed of division is the same regardless of the number i, as long as i < 2^32.
Split the range 2...N-1 into sub-ranges so that the product of the numbers in each sub-range is less than 2^32:
2, 3, 4, ..., 12: product is 479001600
13, 14, ..., 19: product is 253955520
20, 21, ..., 26: product is 3315312000
27, 28, ..., 32: product is 652458240
33, 34, 35: product is 39270
Then, divide the long number k by the products instead of dividing by i. Each iteration will yield a remainder (less than 2^32) and a smaller number k. When you have the remainder, you can work with it in an inner loop using the original algorithm; which will now be faster because it doesn't involve long division.
Here is some code:
static const int rangeCount = 5;
static const int rangeLimit[rangeCount] = {13, 20, 27, 33, 36};
static uint32_t rangeProduct[rangeCount] = {
479001600,
253955520,
3315312000,
652458240,
39270
};
for (int rangeIndex = 0; rangeIndex < rangeCount; ++rangeIndex)
{
// The following two lines involve long division;
// math libraries probably calculate both quotient and remainder
// in one function call
uint32_t rangeRemainder = k % rangeProduct[rangeIndex];
k /= rangeProduct[rangeIndex];
// A range starts where the previous range ended
int rangeStart = (rangeIndex == 0) ? 2 : rangeLimit[rangeIndex - 1];
// Iterate over range
for (int i = rangeStart; i < rangeLimit[rangeIndex] && i < n; ++i)
{
// The following two lines involve a 32-bit division;
// it produces both quotient and remainder in one Pentium instruction
int remainder = rangeRemainder % i;
rangeRemainder /= i;
std::swap(permutation[n - 1 - i], permutation[n - i + remainder]);
}
}
Of course, this code can be extended into more than 128 bits.
Another optimization could involve extraction of powers of 2 from the products of ranges; this might add a slight speedup by making the ranges longer. Not sure whether this is worthwhile (maybe for large values of N, like N=1000).

Dont know about algorithms, but the ones you use seems pretty simple, so i dont really see how you can optimize the algorithm.
You may use alternative approaches:
use ASM (assembler) - from my experience, after a long time trying to figure out how should a certain algorithm would be written in ASM, it ended up being slower than the version generated by the compiler:) Probably because the compiler also knows how to layout the code so the CPU cache would be more efficient, and/or what instructions are actually faster and what situations(this was on GCC/linux).
use multi-processing:
make your algorithm multithreaded, and make sure you run with the same number of threads as the number of available cpu cores(most cpu's nowdays do have multiple cores/multithreading)
make you algorithm capable of running on multiple machines on a network, and devise a way of sending these numbers to machines in a network, so you may use their CPU power.

Related

How to optimize C code : looking for the next set bit and finding sum of corresponding array elements

EDIT: Now I realize I didn't explain my algorithm well enough. I'll try again.
What I'm doing is something very similar to dot product of two vectors, but there is a difference. I've got two vectors: one vector of bits and one vector of floats of the same length. So I need to calculate sum:
float[0]*bit[0]+float[1]*bit[1]+..+float[N-1]*bit[N-1], BUT the difference from a classic dot product is that I need to skip some fixed number of elements after each set bit.
Example:
vector of floats = {1.5, 2.0, 3.0, 4.5, 1.0}
vector of bits = {1, 0, 1, 0, 1 }
nSkip = 2
in this case sum is calculated as follows:
sum = floats[0]*bits[0]
bits[0] == 1, so skipping 2 elements (at positions 1 and 2)
sum = sum + floats[3]*bits[3]
bits[3] == 0, so no skipping
sum = sum + floats[4]*bits[4]
result = 1.5*1+4.5*0+1.0*1 = 2.5
The following code is called many times with different data so I need to optimize it to run as fast as possible on my Core i7 (I don't care much about compatibility with anything else). It is optimized to some extent but still slow, but I don't know how to further improve it.
Bit array is implemented as an array of 64 bit unsigned ints, it allows me to use bitscanforward to find the next set bit.
code:
unsigned int i = 0;
float fSum = 0;
do
{
unsigned int nAddr = i / 64;
unsigned int nShift = i & 63;
unsigned __int64 v = bitarray[nAddr] >> nShift;
unsigned long idx;
if (!_BitScanForward64(&idx, v))
{
i+=64-nShift;
continue;
}
i+= idx;
fSum += floatarray[i];
i+= nSkip;
} while(i<nEnd);
Profiler shows 3 slowest hotspots :
1. v = bitarray[nAddr] >> nShift (memory access with shift)
2. _BitScanForward64(&idx, v)
3. fSum += floatarray[i]; (memory access)
But probably there is a different way of doing this. I was thinking about just resetting nSkip bits after each set bit in the bit vector and then calculating classical dot product - didn't try yet but honestly don't belive it will be faster with more memory access.

You have too many of your operations inside of the loop. You also only have one loop, so many of the operations that do need to happen for each flag word (the 64 bit unsigned integer) are happening 63 extra times.
Consider division an expensive operation and try to not do that too often when optimizing code for performance.
Memory access is also considered expensive in terms of how long it takes, so this should also be limited to required accesses only.
Tests that allow you to exit early are often useful (though sometimes the test itself is expensive relative to the operations you'd be avoiding, but that's probably not the case here.
Using nested loops should simplify this a lot. The outer loop should work at the 64 bit word level, and the inner loop should work at the bit level.
I have noticed a mistake in my earlier recommendations. Since the division here is by 64, which is a power of 2, this is not actually an expensive operation, but we still need to get as many operations as far out of the loops as we can.
/* this is completely untested, but incorporates the optimizations
that I outlined as well as a few others.
I process the arrays backwards, which allows for elimination of
comparisons of variables against other variables, which is much
slower than comparisons of variables against 0, which is essentially
free on many processors when you have just operated or loaded the
value to a register.
Going backwards at the bit level also allows for the possibility that
the compiler will take advantage of the comparison of the top bit
being the same as test for negative, which is cheap and mostly free
for all but the first time through the inner loop (for each time
through the outer loop.
*/
double acc = 0.0;
unsigned i_end = nEnd-1;
unsigned i_bit;
int i_word_end;
if (i_end == 0)
{
return acc;
}
i_bit = i_end % 64;
i_word = i_end / 64;
do
{
unsigned __int64 v = bitarray[i_word_end];
unsigned i_upper = i_word_end << 64;
while (v)
{
if (v & 0x80000000000000)
{
// The following code is semantically the same as
// unsigned i = i_bit_end + (i_word_end * sizeof(v));
unsigned i = i_bit_end | i_upper;
acc += floatarray[i];
}
v <<= 1;
i--;
}
i_bit_end = 63;
i_word_end--;
} while (i_word_end >= 0);

I think you should check "how to ask questions" first. You will not gain many upvotes for this, since you are asking us to do the work for you instead of introducing a particular problem.
I cannot see why you are incrementing the same variable in two places instead of one (i).
Also think you should declare variables only once, not in every iteration.

How to improve on this implementation of the radix-sort?

I'm implementing a 2-byte Radix Sort. The concept is to use Counting Sort, to sort the lower 16 bits of the integers, then the upper 16 bits. This allows me to run the sort in 2 iterations. The first concept I had was trying to figure out how to handle negatives. Since the sign bit would be flipped for negative numbers, then in hex form, that would make negatives greater than the positives. To combat this I flipped the sign bit when it was positive, in order to make [0, 2 bil) = [128 000 000 000, 255 255...). And when it was negative I flipped all the bits, to make it range from (000 000 .., 127 255 ..). This site helped me with that information. To finish it off, I would split the integer into either the top or bottom 16-bits based on the pass. The following is the code allowing me to do that.
static uint32_t position(int number, int pass) {
int mask;
if (number <= 0) mask = 0x80000000;
else mask = (number >> 31) | 0x80000000;
uint32_t out = number ^ mask;
return pass == 0 ? out & 0xffff : (out >> 16) & 0xffff;
}
To start the actual Radix Sort, I needed to form a histogram of size 65536 elements. The problem I ran across was when the number of elements inputted was very large. It would take a while to create the histogram, so I implemented it in parallel, using processes and shared memory. I partitioned the array into subsections of size / 8. Then over an array of shared memory sized 65536 * 8, I had each process create its own histogram. Afterwards, I summed it all together to form a single histogram. The following is the code for that:
for (i=0;i<8;i++) {
pid_t pid = fork();
if (pid < 0) _exit(0);
if (pid == 0) {
const int start = (i * size) >> 3;
const int stop = i == 7 ? size : ((i + 1) * size) >> 3;
const int curr = i << 16;
for (j=start;j<stop;++j)
hist[curr + position(array[j], pass)]++;
_exit(0);
}
}
for (i=0;i<8;i++) wait(NULL);
for (i=1;i<8;i++) {
const int pos = i << 16;
for (j=0;j<65536;j++)
hist[j] += hist[pos + j];
}
The next part was where I spent most of my time analyzing how cache affected the performance of the prefix-sum. With an 8-bit and 11-bit pass Radix Sort, all of the histogram would fit within L1 cache. With 16-bits, it would only fit within L2 cache. In the end the 16-bit histogram ran the sum the fastest, since I only had to run 2 iterations with it. I also ran the prefix sum in parallel as per the CUDA website recommendations. At 250 million elements, this ran about 1.5 seconds slower than the 16-bit integer. So my prefix sum ended up looking like this:
for (i=1;i<65536;i++)
hist[i] += hist[i-1];
The only thing left was to traverse backwards through the array and put all the elements into their respective spots in the temp array. Since I only had to go through twice, instead of copying from the temp back to array, and running the code again. I ran the sort first using array as the input, and temp as the output. Then ran it the second time using temp as the input and array as the output. This kept me from mem-copying back to array both times. The code looks like this for the actual sort:
histogram(array, size, 0, hist);
for (i=size-1;i>=0;i--)
temp[--hist[position(array[i], 0)]] = array[i];
memset(hist, 0, arrSize);
histogram(temp, size, 1, hist);
for (i=size-1;i>=0;i--)
array[--hist[position(temp[i], 1)]] = temp[i];
This link contains the full code that I have so far. I ran a test against quicksort, and it ran between 5 and 10 times faster with integers and floats, and about 5 times faster with 8-byte data types. Is there a way to improve on this?

My guess would be that treating the sign of the integers during operation is not worth it. It complexyfies and slows down your code. I'd go for a first sort as unsigned and then do a second path that just reorders the two halves and inverts the one of the negatives.
Also from your code I don't get how you have different processes operate together. How do you collect the histogram in the parent? you have a process shared variable? In any case using ptrhead would be much more appropriate, here.

Checksum for an integer array?

I have an array that is of size 4,9,16 or 25 (according to the input) and the numbers in the array are the same but less by one (if the array size is 9 then the biggest element in the array would be 8) the numbers start with 0
and I would like to do some algorithm to generate some sort of a checksum for the array so that I can compare that 2 arrays are equal without looping through the whole array and checking each element one by one.
Where can I get this sort of information? I need something that is as simple as possible. Thank you.
edit: just to be clear on what I want:
-All the numbers in the array are distinct, so [0,1,1,2] is not valid because there is a repeated element (1)
-The position of the numbers matter, so [0,1,2,3] is not the same as [3,2,1,0]
-The array will contain the number 0, so this should also be taken into consideration.
EDIT:
Okay I tried to implement the Fletcher's algorithm here:
http://en.wikipedia.org/wiki/Fletcher%27s_checksum#Straightforward
int fletcher(int array[], int size){
int i;
int sum1=0;
int sum2=0;
for(i=0;i<size;i++){
sum1=(sum1+array[i])%255;
sum2=(sum2+sum1)%255;
}
return (sum2 << 8) | sum1;
}
to be honest I have no idea what does the return line do but unfortunately, the algorithm does not work.
For arrays [2,1,3,0] and [1,3,2,0] I get the same checksum.
EDIT2:
okay here's another one, the Adler checksum
http://en.wikipedia.org/wiki/Adler-32#Example_implementation
#define MOD 65521;
unsigned long adler(int array[], int size){
int i;
unsigned long a=1;
unsigned long b=0;
for(i=0;i<size;i++){
a=(a+array[i])%MOD;
b=(b+a)%MOD;
}
return (b <<16) | a;
}
This also does not work.
Arrays [2,0,3,1] and [1,3,0,2] generate same checksum.
I'm losing hope here, any ideas?

Let's take the case of your array of 25 integers. You explain that it can contains any permutations of the unique integers 0 to 24. According to this page, there is 25! (25 factorial) possible permutations, that is 15511210043330985984000000. Far more than a 32bit integer can contains.
The conclusion is that you will have collision, no matter how hard you try.
Now, here is a simple algorithm that account for position:
int checksum(int[] array, int size) {
int c = 0;
for(int i = 0; i < size; i++) {
c += array[i];
c = c << 3 | c >> (32 - 3); // rotate a little
c ^= 0xFFFFFFFF; // invert just for fun
}
return c;
}

I think what you want is in the answer of the following thread:
Fast permutation -> number -> permutation mapping algorithms
You just take the number your permutation is mapped to and take that as your Checksum. As there is exactly one Checksum per permutation there can't be a smaller Checksum that is collision free.

How about the checksum of weighted sum? Let's take an example for [0,1,2,3]. First pick a seed and limit, let's pick a seed as 7 and limit as 10000007.
a[4] = {0, 1, 2, 3}
limit = 10000007, seed = 7
result = 0
result = ((result + a[0]) * seed) % limit = ((0 + 0) * 7)) % 10000007 = 0
result = ((result + a[1]) * seed) % limit = ((0 + 1) * 7)) % 10000007 = 7
result = ((result + a[2]) * seed) % limit = ((7 + 2) * 7)) % 10000007 = 63
result = ((result + a[3]) * seed) % limit = ((63 + 3) * 7)) % 10000007 = 462
Your checksum is 462 for that [0, 1, 2, 3].
The reference is http://www.codeabbey.com/index/wiki/checksum

For an array of N unique integers from 1 to N, just adding up the elements will always be N*(N+1)/2. Therefore the only difference is in the ordering. If by "checksum" you imply that you tolerate some collisions, then one way is to sum the differences between consecutive numbers. So for example, the delta checksum for {1,2,3,4} is 1+1+1=3, but the delta checksum for {4,3,2,1} is -1+-1+-1=-3.
No requirements were given for collision rates or computational complexity, but if the above doesn't suit, then I recommend a position dependent checksum

From what I understand your array contains a permutation of numbers from 0 to N-1. One check-sum which will be useful is the rank of the array in its lexicographic ordering. What does it means ? Given 0, 1, 2
You have the possible permutations
1: 0, 1, 2
2: 0, 2, 1
3: 1, 0, 2
4: 1, 2, 0
5: 2, 0, 1
6: 2, 1, 0
The check-sum will be the first number, and computed when you create the array. There are solutions proposed in
Find the index of a given permutation in the list of permutations in lexicographic order
which can be helpful, although it seems the best algorithm was of quadratic complexity. To improve it to linear complexity you should cache the values of the factorials before hand.
The advantage? ZERO collision.
EDIT: Computation
The value is like the evaluation of a polynomial where factorial is used for the monomial instead of power. So the function is
f(x0,....,xn-1) = X0 * (0!) + X1 * (1!) + X2 * (2!) +...+ Xn-1 * (n-1!)
The idea is to use each values to get a sub-range of permutations, and with enough values you pinpoint an unique permutation.
Now for the implementation (like the one of a polynomial):
pre compute 0!.... to n-1! at the beginning of the program
Each time you set an array you use f(elements) to compute its checksum
you compare in O(1) using this checksum

A way to find the nearest prime number to an unsigned long integer ( 32 bits wide ) in C?

I'm looking for a way to find the closest prime number. Greater or less than, it doesn't matter, simply the closest ( without overflowing, preferably. ) As for speed, if it can compute it in approximately 50 milliseconds on a 1GHz machine ( in software, running inside Linux ), I'd be ecstatic.

The largest prime gap in the range up to (2^32 - 1) is (335). There are (6542) primes less than (2^16) that can be tabulated and used to sieve successive odd values after a one-time setup. Obviously, only primes <= floor(sqrt(candidate)) need be tested for a particular candidate value.
Alternatively: The deterministic variant of the Miller-Rabin test, with SPRP bases: {2, 7, 61} is sufficient to prove primality for a 32-bit value. Due to the test's complexity (requires exponentiation, etc), I doubt it would be as fast for such small candidates.
Edit: Actually, if multiply/reduce can be kept to 32-bits in exponentiation (might need 64-bit support), the M-R test might be better. The prime gaps will typically be much smaller, making the sieve setup costs excessive. Without large lookup tables, etc., you might also get a boost from better cache locality.
Furthermore: The product of primes {2, 3, 5, 7, 11, 13, 17, 19, 23} = (223092870). Explicitly test any candidate in [2, 23]. Calculate greatest common divisor: g = gcd(u, 223092870UL). If (g != 1), the candidate is composite. If (g == 1 && u < (29 * 29)), the candidate (u > 23) is definitely prime. Otherwise, move on to the more expensive tests. A single gcd test using 32-bit arithmetic is very cheap, and according to Mertens' (?) theorem, this will detect ~ 68.4% of all odd composite numbers.

UPDATE 2: Fixed (in a heavy-handed way) some bugs that caused wrong answers for small n. Thanks to Brett Hale for noticing! Also added some asserts to document some assumptions.
UPDATE: I coded this up and it seems plenty fast enough for your requirements (solved 1000 random instances from [2^29, 2^32-1] in <100ms, on a 2.2GHz machine -- not a rigorous test but convincing nonetheless).
It is written in C++ since that's what my sieve code (which I adapted from) was in, but the conversion to C should be straightforward. The memory usage is also (relatively) small which you can see by inspection.
You can see that because of the way the function is called, the number returned is the nearest prime that fits in 32 bits, but in fact this is the same thing since the primes around 2^32 are 4294967291 and 4294967311.
I tried to make sure there wouldn't be any bugs due to integer overflow (since we're dealing with numbers right up to UINT_MAX); hopefully I didn't make a mistake there. The code could be simplified if you wanted to use 64-bit types (or you knew your numbers would be smaller than 2^32-256) since you wouldn't have to worry about wrapping around in the loop conditions. Also this idea scales for bigger numbers as long as you're willing to compute/store the small primes up to the needed limit.
I should note also that the small-prime-sieve runs quite quickly for these numbers (4-5 ms from a rough measurement) so if you are especially memory-starved, running it every time instead of storing the small primes is doable (you'd probably want to make the mark[] arrays more space efficient in this case)
#include <iostream>
#include <cmath>
#include <climits>
#include <cassert>
using namespace std;
typedef unsigned int UI;
const UI MAX_SM_PRIME = 1 << 16;
const UI MAX_N_SM_PRIMES = 7000;
const UI WINDOW = 256;
void getSMPrimes(UI primes[]) {
UI pos = 0;
primes[pos++] = 2;
bool mark[MAX_SM_PRIME / 2] = {false};
UI V_SM_LIM = UI(sqrt(MAX_SM_PRIME / 2));
for (UI i = 0, p = 3; i < MAX_SM_PRIME / 2; ++i, p += 2)
if (!mark[i]) {
primes[pos++] = p;
if (i < V_SM_LIM)
for (UI j = p*i + p + i; j < MAX_SM_PRIME/2; j += p)
mark[j] = true;
}
}
UI primeNear(UI n, UI min, UI max, const UI primes[]) {
bool mark[2*WINDOW + 1] = {false};
if (min == 0) mark[0] = true;
if (min <= 1) mark[1-min] = true;
assert(min <= n);
assert(n <= max);
assert(max-min <= 2*WINDOW);
UI maxP = UI(sqrt(max));
for (int i = 0; primes[i] <= maxP; ++i) {
UI p = primes[i], k = min / p;
if (k < p) k = p;
UI mult = p*k;
if (min <= mult)
mark[mult-min] = true;
while (mult <= max-p) {
mult += p;
mark[mult-min] = true;
}
}
for (UI s = 0; (s <= n-min) || (s <= max-n); ++s)
if ((s <= n-min) && !mark[n-s-min])
return n-s;
else if ((s <= max-n) && !mark[n+s-min])
return n+s;
return 0;
}
int main() {
UI primes[MAX_N_SM_PRIMES];
getSMPrimes(primes);
UI n;
while (cin >> n) {
UI win_min = (n >= WINDOW) ? (n-WINDOW) : 0;
UI win_max = (n <= UINT_MAX-WINDOW) ? (n+WINDOW) : UINT_MAX;
if (!win_min)
win_max = 2*WINDOW;
else if (win_max == UINT_MAX)
win_min = win_max-2*WINDOW;
UI p = primeNear(n, win_min, win_max, primes);
cout << "found nearby prime " << p << " from window " << win_min << ' ' << win_max << '\n';
}
}
You can sieve intervals in that range if you know primes up to 2^16 (there are only 6542 <= 2^16; you should go a bit higher if the prime itself could be greater than 2^32 - 1). Not necessarily the fastest way but very simple, and fancier prime testing techniques are really suited to much larger ranges.
Basically, do a regular Sieve of Eratosthenes to get the "small" primes (say the first 7000). Obviously you only need to do this once at the start of the program, but it should be very fast.
Then, supposing your "target" number is 'a', consider the interval [a-n/2, a+n/2) for some value of n. Probably n = 128 is a reasonable place to start; you may need to try adjacent intervals if the numbers in the first one are all composite.
For every "small" prime p, cross out its multiples in the range, using division to find where to start. One optimization is that you only need to start crossing off multiples starting at p*p (which means that you can stop considering primes once p*p is above the interval).
Most of the primes except the first few will have either one or zero multiples inside the interval; to take advantage of this you can pre-ignore multiples of the first few primes. The simplest thing is to ignore all even numbers, but it's not uncommon to ignore multiples of 2, 3, and 5; this leaves integers congruent to 1, 7, 11, 13, 17, 19, 23, and 29 mod 30 (there are eight, which map nicely to the bits of a byte when sieving a large range).
...Sort of went off on a tangent there; anyway once you've processed all the small primes (up till p*p > a+n/2) you just look in the interval for numbers you didn't cross out; since you want the closest to a start looking there and search outward in both directions.

Optimize me! (C, performance) -- followup to bit-twiddling question

Thanks to some very helpful stackOverflow users at Bit twiddling: which bit is set?, I have constructed my function (posted at the end of the question).
Any suggestions -- even small suggestions -- would be appreciated. Hopefully it will make my code better, but at the least it should teach me something. :)
Overview
This function will be called at least 1013 times, and possibly as often as 1015. That is, this code will run for months in all likelihood, so any performance tips would be helpful.
This function accounts for 72-77% of the program's time, based on profiling and about a dozen runs in different configurations (optimizing certain parameters not relevant here).
At the moment the function runs in an average of 50 clocks. I'm not sure how much this can be improved, but I'd be thrilled to see it run in 30.
Key Observation
If at some point in the calculation you can tell that the value that will be returned will be small (exact value negotiable -- say, below a million) you can abort early. I'm only interested in large values.
This is how I hope to save the most time, rather than by further micro-optimizations (though these are of course welcome as well!).
Performance Information
smallprimes is a bit array (64 bits); on average about 8 bits will be set, but it could be as few as 0 or as many as 12.
q will usually be nonzero. (Notice that the function exits early if q and smallprimes are zero.)
r and s will often be 0. If q is zero, r and s will be too; if r is zero, s will be too.
As the comment at the end says, nu is usually 1 by the end, so I have an efficient special case for it.
The calculations below the special case may appear to risk overflow, but through appropriate modeling I have proved that, for my input, this will not occur -- so don't worry about that case.
Functions not defined here (ugcd, minuu, star, etc.) have already been optimized; none take long to run. pr is a small array (all in L1). Also, all functions called here are pure functions.
But if you really care... ugcd is the gcd, minuu is the minimum, vals is the number of trailing binary 0s, __builtin_ffs is the location of the leftmost binary 1, star is (n-1) >> vals(n-1), pr is an array of the primes from 2 to 313.
The calculations are currently being done on a Phenom II 920 x4, though optimizations for i7 or Woodcrest are still of interest (if I get compute time on other nodes).
I would be happy to answer any questions you have about the function or its constituents.
What it actually does
Added in response to a request. You don't need to read this part.
The input is an odd number n with 1 < n < 4282250400097. The other inputs provide the factorization of the number in this particular sense:
smallprimes&1 is set if the number is divisible by 3, smallprimes&2 is set if the number is divisible by 5, smallprimes&4 is set if the number is divisible by 7, smallprimes&8 is set if the number is divisible by 11, etc. up to the most significant bit which represents 313. A number divisible by the square of a prime is not represented differently from a number divisible by just that number. (In fact, multiples of squares can be discarded; in the preprocessing stage in another function multiples of squares of primes <= lim have smallprimes and q set to 0 so they will be dropped, where the optimal value of lim is determined by experimentation.)
q, r, and s represent larger factors of the number. Any remaining factor (which may be greater than the square root of the number, or if s is nonzero may even be less) can be found by dividing factors out from n.
Once all the factors are recovered in this way, the number of bases, 1 <= b < n, to which n is a strong pseudoprime are counted using a mathematical formula best explained by the code.
Improvements so far
Pushed the early exit test up. This clearly saves work so I made the change.
The appropriate functions are already inline, so __attribute__ ((inline)) does nothing. Oddly, marking the main function bases and some of the helpers with __attribute ((hot)) hurt performance by almost 2% and I can't figure out why (but it's reproducible with over 20 tests). So I didn't make that change. Likewise, __attribute__ ((const)), at best, did not help. I was more than slightly surprised by this.
Code
ulong bases(ulong smallprimes, ulong n, ulong q, ulong r, ulong s)
{
if (!smallprimes & !q)
return 0;
ulong f = __builtin_popcountll(smallprimes) + (q > 1) + (r > 1) + (s > 1);
ulong nu = 0xFFFF; // "Infinity" for the purpose of minimum
ulong nn = star(n);
ulong prod = 1;
while (smallprimes) {
ulong bit = smallprimes & (-smallprimes);
ulong p = pr[__builtin_ffsll(bit)];
nu = minuu(nu, vals(p - 1));
prod *= ugcd(nn, star(p));
n /= p;
while (n % p == 0)
n /= p;
smallprimes ^= bit;
}
if (q) {
nu = minuu(nu, vals(q - 1));
prod *= ugcd(nn, star(q));
n /= q;
while (n % q == 0)
n /= q;
} else {
goto BASES_END;
}
if (r) {
nu = minuu(nu, vals(r - 1));
prod *= ugcd(nn, star(r));
n /= r;
while (n % r == 0)
n /= r;
} else {
goto BASES_END;
}
if (s) {
nu = minuu(nu, vals(s - 1));
prod *= ugcd(nn, star(s));
n /= s;
while (n % s == 0)
n /= s;
}
BASES_END:
if (n > 1) {
nu = minuu(nu, vals(n - 1));
prod *= ugcd(nn, star(n));
f++;
}
// This happens ~88% of the time in my tests, so special-case it.
if (nu == 1)
return prod << 1;
ulong tmp = f * nu;
long fac = 1 << tmp;
fac = (fac - 1) / ((1 << f) - 1) + 1;
return fac * prod;
}

You seem to be wasting much time doing divisions by the factors. It is much faster to replace a division with a multiplication by the reciprocal of divisor (division: ~15-80(!) cycles, depending on the divisor, multiplication: ~4 cycles), IF of course you can precompute the reciprocals.
While this seems unlikely to be possible with q, r, s - due to the range of those vars, it is very easy to do with p, which always comes from the small, static pr[] array. Precompute the reciprocals of those primes and store them in another array. Then, instead of dividing by p, multiply by the reciprocal taken from the second array. (Or make a single array of structs.)
Now, obtaining exact division result by this method requires some trickery to compensate for rounding errors. You will find the gory details of this technique in this document, on page 138.
EDIT:
After consulting Hacker's Delight (an excellent book, BTW) on the subject, it seems that you can make it even faster by exploiting the fact that all divisions in your code are exact (i.e. remainder is zero).
It seems that for every divisor d which is odd and base B = 2word_size, there exists a unique multiplicative inverse d⃰ which satisfies the conditions: d⃰ < B and d·d⃰ ≡ 1 (mod B). For every x which is an exact multiple of d, this implies x/d ≡ x·d⃰ (mod B). Which means you can simply replace a division with a multiplication, no added corrections, checks, rounding problems, whatever. (The proofs of these theorems can be found in the book.) Note that this multiplicative inverse need not be equal to the reciprocal as defined by the previous method!
How to check whether a given x is an exact multiple of d - i.e. x mod d = 0 ? Easy! x mod d = 0 iff x·d⃰ mod B ≤ ⌊(B-1)/d⌋. Note that this upper limit can be precomputed.
So, in code:
unsigned x, d;
unsigned inv_d = mulinv(d); //precompute this!
unsigned limit = (unsigned)-1 / d; //precompute this!
unsigned q = x*inv_d;
if(q <= limit)
{
//x % d == 0
//q == x/d
} else {
//x % d != 0
//q is garbage
}
Assuming the pr[] array becomes an array of struct prime:
struct prime {
ulong p;
ulong inv_p; //equal to mulinv(p)
ulong limit; //equal to (ulong)-1 / p
}
the while(smallprimes) loop in your code becomes:
while (smallprimes) {
ulong bit = smallprimes & (-smallprimes);
int bit_ix = __builtin_ffsll(bit);
ulong p = pr[bit_ix].p;
ulong inv_p = pr[bit_ix].inv_p;
ulong limit = pr[bit_ix].limit;
nu = minuu(nu, vals(p - 1));
prod *= ugcd(nn, star(p));
n *= inv_p;
for(;;) {
ulong q = n * inv_p;
if (q > limit)
break;
n = q;
}
smallprimes ^= bit;
}
And for the mulinv() function:
ulong mulinv(ulong d) //d needs to be odd
{
ulong x = d;
for(;;)
{
ulong tmp = d * x;
if(tmp == 1)
return x;
x *= 2 - tmp;
}
}
Note you can replace ulong with any other unsigned type - just use the same type consistently.
The proofs, whys and hows are all available in the book. A heartily recommended read :-).

If your compiler supports GCC function attributes, you can mark your pure functions with this attribute:
ulong star(ulong n) __attribute__ ((const));
This attribute indicates to the compiler that the result of the function depends only on its argument(s). This information can be used by the optimiser.
Is there a reason why you've opencoded vals() instead of using __builtin_ctz() ?

It is still somewhat unclear, what you are searching for. Quite frequently number theoretic problems allow huge speedups by deriving mathematical properties that the solutions must satisfiy.
If you are indeed searching for the integers that maximize the number of non-witnesses for the MR test (i.e. oeis.org/classic/A141768 that you mention) then it might be possible to use that the number of non-witnesses cannot be larger than phi(n)/4 and that the integers for which have this many non-witnesses are either are the product of two primes of the form
(k+1)*(2k+1)
or they are Carmichael numbers with 3 prime factors.
I'd think above some limit all integers in the sequence have this form and that it is possible to verify this by proving an upper bound for the witnesses of all other integers.
E.g. integers with 4 or more factors always have at most phi(n)/8 non-witnesses. Similar results can be derived from you formula for the number of bases for other integers.
As for micro-optimizations: Whenever you know that an integer is divisible by some quotient, then it is possible to replace the division by a multiplication with the inverse of the quotient modulo 2^64. And the tests n % q == 0 can be replaced by a test
n * inverse_q < max_q,
where inverse_q = q^(-1) mod 2^64 and max_q = 2^64 / q.
Obviously inverse_q and max_q need to be precomputed, to be efficient, but since you are using a sieve, I assume this should not be an obstacle.

Small optimization but:
ulong f;
ulong nn;
ulong nu = 0xFFFF; // "Infinity" for the purpose of minimum
ulong prod = 1;
if (!smallprimes & !q)
return 0;
// no need to do this operations before because of the previous return
f = __builtin_popcountll(smallprimes) + (q > 1) + (r > 1) + (s > 1);
nn = star(n);
BTW: you should edit your post to add star() and other functions you use definition

Try replacing this pattern (for r and q too):
n /= p;
while (n % p == 0)
n /= p;
With this:
ulong m;
...
m = n / p;
do {
n = m;
m = n / p;
} while ( m * p == n);
In my limited tests, I got a small speedup (10%) from eliminating the modulo.
Also, if p, q or r were constant, the compiler will replace the divisions by multiplications. If there are few choices for p, q or r, or if certain ones are more frequent, you might gain something by specializing the function for those values.

Have you tried using profile-guided optimisation?
Compile and link the program with the -fprofile-generate option, then run the program over a representative data set (say, a day's worth of computation).
Then re-compile and link it with the -fprofile-use option instead.

1) I would make the compiler spit out the assembly it generates and try and deduce if what it does is the best it can do... and if you spot problems, change the code so the assembly looks better. This way you can also make sure that functions you hope it'll inline (like star and vals) are really inlined. (You might need to add pragma's, or even turn them into macros)
2) It's great that you try this on a multicore machine, but this loop is singlethreaded. I'm guessing that there is an umbrella functions which splits the load across a few threads so that more cores are used?
3) It's difficult to suggest speed ups if what the actual function tries to calculate is unclear. Typically the most impressive speedups are not achieved with bit twiddling, but with a change in the algorithm. So a bit of comments might help ;^)
4) If you really want a speed up of 10* or more, check out CUDA or openCL which allows you to run C programs on your graphics hardware. It shines with functions like these!
5) You are doing loads of modulo and divides right after each other. In C this is 2 separate commands (first '/' and then '%'). However in assembly this is 1 command: 'DIV' or 'IDIV' which returns both the remainder and the quotient in one go:
B.4.75 IDIV: Signed Integer Divide
IDIV r/m8 ; F6 /7 [8086]
IDIV r/m16 ; o16 F7 /7 [8086]
IDIV r/m32 ; o32 F7 /7 [386]
IDIV performs signed integer division. The explicit operand provided is the divisor; the dividend and destination operands are implicit, in the following way:
For IDIV r/m8, AX is divided by the given operand; the quotient is stored in AL and the remainder in AH.
For IDIV r/m16, DX:AX is divided by the given operand; the quotient is stored in AX and the remainder in DX.
For IDIV r/m32, EDX:EAX is divided by the given operand; the quotient is stored in EAX and the remainder in EDX.
So it will require some inline assembly, but I'm guessing there'll be a significant speedup as there are a few places in your code which can benefit from this.

Make sure your functions get inlined. If they're out-of-line, the overhead might add up, especially in the first while loop. The best way to be sure is to examine the assembly.
Have you tried pre-computing star( pr[__builtin_ffsll(bit)] ) and vals( pr[__builtin_ffsll(bit)] - 1) ? That would trade some simple work for an array lookup, but it might be worth it if the tables are small enough.
Don't compute f until you actually need it (near the end, after your early-out). You can replace the code around BASES_END with something like
BASES_END:
ulong addToF = 0;
if (n > 1) {
nu = minuu(nu, vals(n - 1));
prod *= ugcd(nn, star(n));
addToF = 1;
}
// ... early out if nu == 1...
// ... compute f ...
f += addToF;
Hope that helps.

First some nitpicking ;-) you should be more careful about the types that you are using. In some places you seem to assume that ulong is 64 bit wide, use uint64_t there. And also for all other types, rethink carefully what you expect of them and use the appropriate type.
The optimization that I could see is integer division. Your code does that a lot, this is probably the most expensive thing you are doing. Division of small integers (uint32_t) maybe much more efficient than by big ones. In particular for uint32_t there is an assembler instruction that does division and modulo in one go, called divl.
If you use the appropriate types your compiler might do that all for you. But you'd better check the assembler (option -S to gcc) as somebody already said. Otherwise it is easy to include some little assembler fragments here and there. I found something like that in some code of mine:
register uint32_t a asm("eax") = 0;
register uint32_t ret asm("edx") = 0;
asm("divl %4"
: "=a" (a), "=d" (ret)
: "0" (a), "1" (ret), "rm" (divisor));
As you can see this uses special registers eax and edx and stuff like that...

Did you try a table lookup version of the first while loop? You could divide smallprimes in 4 16 bit values, look up their contribution and merge them. But maybe you need the side effects.

Did you try passing in an array of primes instead of splitting them in smallprimes, q, r and s? Since I don't know what the outer code does, I am probably wrong, but there is a chance that you also have a function to convert some primes to a smallprimes bitmap, and inside this function, you convert the bitmap back to an array of primes, effecively. In addition, you seem to do identical processing for elements of smallprimes, q, r, and s. It should save you a tiny amount of processing per call.
Also, you seem to know that the passed in primes divide n. Do you have enough knowledge outside about the power of each prime that divides n? You could save a lot of time if you can eliminate the modulo operation by passing in that information to this function. In other words, if n is pow(p_0,e_0)*pow(p_1,e_1)*...*pow(p_k,e_k)*n_leftover, and if you know more about these e_is and n_leftover, passing them in would mean a lot of things you don't have to do in this function.
There may be a way to discover n_leftover (the unfactored part of n) with less number of modulo operations, but it is only a hunch, so you may need to experiment with it a bit. The idea is to use gcd to remove known factors from n repeatedly until you get rid of all known prime factors. Let me give some almost-c-code:
factors=p_0*p_1*...*p_k*q*r*s;
n_leftover=n/factors;
do {
factors=gcd(n_leftover, factors);
n_leftover = n_leftover/factors;
} while (factors != 1);
I am not at all certain this will be better than the code you have, let alone the combined mod/div suggestions you can find in other answers, but I think it is worth a try. I feel that it will be a win, especially for numbers with high numbers of small prime factors.

You're passing in the complete factorization of n, so you're factoring consecutive integers and then using the results of that factorization here. It seems to me that you might benefit from doing some of this at the time of finding the factors.
BTW, I've got some really fast code for finding the factors you're using without doing any division. It's a little like a sieve but produces factors of consecutive numbers very quickly. Can find it and post if you think it may help.
edit had to recreate the code here:
#include
#define SIZE (1024*1024) //must be 2^n
#define MASK (SIZE-1)
typedef struct {
int p;
int next;
} p_type;
p_type primes[SIZE];
int sieve[SIZE];
void init_sieve()
{
int i,n;
int count = 1;
primes[1].p = 3;
sieve[1] = 1;
for (n=5;SIZE>n;n+=2)
{
int flag = 0;
for (i=1;count>=i;i++)
{
if ((n%primes[i].p) == 0)
{
flag = 1;
break;
}
}
if (flag==0)
{
count++;
primes[count].p = n;
sieve[n>>1] = count;
}
}
}
int main()
{
int ptr,n;
init_sieve();
printf("init_done\n");
// factor odd numbers starting with 3
for (n=1;1000000000>n;n++)
{
ptr = sieve[n&MASK];
if (ptr == 0) //prime
{
// printf("%d is prime",n*2+1);
}
else //composite
{
// printf ("%d has divisors:",n*2+1);
while(ptr!=0)
{
// printf ("%d ",primes[ptr].p);
sieve[n&MASK]=primes[ptr].next;
//move the prime to the next number it divides
primes[ptr].next = sieve[(n+primes[ptr].p)&MASK];
sieve[(n+primes[ptr].p)&MASK] = ptr;
ptr = sieve[n&MASK];
}
}
// printf("\n");
}
return 0;
}
The init function creates a factor base and initializes the sieve. This takes about 13 seconds on my laptop. Then all numbers up to 1 billion are factored or determined to be prime in another 25 seconds. Numbers less than SIZE are never reported as prime because they have 1 factor in the factor base, but that could be changed.
The idea is to maintain a linked list for every entry in the sieve. Numbers are factored by simply pulling their factors out of the linked list. As they are pulled out, they are inserted into the list for the next number that will be divisible by that prime. This is very cache friendly too. The sieve size must be larger than the largest prime in the factor base. As is, this sieve could run up to 2**40 in about 7 hours which seems to be your target (except for n needing to be 64 bits).
Your algorithm could be merged into this to make use of the factors as they are identified rather than packing bits and large primes into variables to pass to your function. Or your function could be changed to take the linked list (you could create a dummy link to pass in for the prime numbers outside the factor base).
Hope it helps.
BTW, this is the first time I've posted this algorithm publicly.

just a thought but maybe using your compilers optimization options would help, if you haven't already. another thought would be that if money isn't an issue you could use the Intel C/C++ compiler, assuming your using an Intel processor. I'd also assume that other processor manufacturers (AMD, etc.) would have similar compilers

If you are going to exit immediately on (!smallprimes&!q) why not do that test before even calling the function, and save the function call overhead?
Also, it seems like you effectively have 3 different functions which are linear except for the smallprimes loop.
bases1(s,n,q), bases2(s,n,q,r), and bases3(s,n,q,r,s).
It might be a win to actually create those as 3 separate functions without the branches and gotos, and call the appropriate one:
if (!(smallprimes|q)) { r = 0;}
else if (s) { r = bases3(s,n,q,r,s);}
else if (r) { r = bases2(s,n,q,r); }
else { r = bases1(s,n,q);
This would be most effective if previous processing has already given the calling code some 'knowledge' of which function to execute and you don't have to test for it.

If the divisions you're using are with numbers that aren’t known at compile time, but are used frequently at runtime (dividing by the same number many times), then I would suggest using the libdivide library, which basically implements at runtime the optimisations that compilers do for compile time constants (using shifts masks etc.). This can provide a huge benefit. Also avoiding using x % y == 0 for something like z = x/y, z * y == x as ergosys suggested above should also have a measurable improvement.

Does the code on your top post is the optimized version? If yes, there is still too many divide operations which greatly eat CPU cycles.
This code is overexecute innecessarily a bit
if (!smallprimes & !q)
return 0;
change to logical and &&
if (!smallprimes && !q)
return 0;
will make it short circuited faster without eveluating q
And the following code
ulong bit = smallprimes & (-smallprimes);
ulong p = pr[__builtin_ffsll(bit)];
which is used to find the last set bit of smallprimes. Why don't you use the simpler way
ulong p = pr[__builtin_ctz(smallprimes)];
Another culprit for decreased performance maybe too many program branching. You may consider changing to some other less-branch or branch-less equivalents

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight