Using bitwise operators for division by 10 and modulus 10

Using bitwise operators for division by 10 and modulus 10 - c

I am working on a programme to calculate 2^p using the traditional school arithmetic technique, multiply by 2, carry 1 if > 9.
I have written two lines of code that return n / 10 and n % 10 using bitwise operators. These are accurate for 0 - 19 inclusive, which is all that is needed when the multiplier is 2: the largest number is (2 * 9) + 1. However this method is inaccurate from 20 onwards and not needed. These techniques speed up the programme.
Because I would like to be certain I am using C correctly, is this technique good practice?
#include <stdio.h>
int main(void) {
unsigned int d, m;
/* division by 10 */
for (int i=0; i<20; ++i) {
d = (i + 6) >> 4;
printf("%u ", d);
}
printf("\n");
/* modulus 10 */
for (int i=0; i<20; ++i) {
m = i - (((i + 6) >> 4) * 10);
printf("%u ", m);
}
printf("\n");
return 0;
}

This is not good practice.
If you compile code that divides or modulos by constants, and enable optimizations, the compiler will do this for you when it's possible and equivalent (on checking, gcc uses a somewhat more complicated set of imul, shifts and a subtraction rather than an integer division instruction; more expensive than what you wrote, but still much cheaper than integer division). And your code won't be an unreadable mess, that breaks when passed values even slightly outside its design parameters.
There are extremely rare cases where you might do something like this, if:
Profiling has shown that the code in question is the bottleneck preventing your code from achieving adequate performance, and tweaking optimization levels is inadequate to fix it, and
You comment the hell out of what you're doing, including the restrictions on inputs and the rationale for doing this
But outside of that case, you never want to write unreadable, unmaintainable, brittle code just to shave a few cycles.

Related

C - portable fast dot product between a float matrix and a sparse boolean value matrix

I am working on a spiking neural network project in C where spikes are boolean values. Right now I have built a custom bit matrix type to represent the spike matrixes.
I frequently need the dot product of the bit matrix and a matrix of single precision floats of the same size, so I was wondering how I should speed things up?
I also need to do pointwise multiplication of the float matrix and the bit matrix later.
My plan right now was just to loop through with and if statement and bitshift. I want to speed this up.
float current = 0;
for (int i = 0; i < n_elem; i++, bit_vec >>= 1) {
if (bit_vec & 1)
current += weights[i];
}
I don't necessarily need to use a bit vector, it could be represented in other ways too. I have seen other answers here, but they are hardware specific and I am looking for something that can be portable.
I am not using any BLAS functions either, mostly because I am never operating on two floats. Should I be?
Thanks.

The bit_vec >>= 1 and current += weights[i] instruction cause a loop carried dependency that will certainly prevent the compiler to generate a fast implementation (and also prevent the processor to execute it efficiently).
You can solve this by unrolling the loop. Additionally, most mainstream compilers are not smart enough so to optimize out the condition en use a blend instruction available on most architecture. Conditional branches are slow, especially when they cannot be easily predicted (which is certainly you case). You can use a multiplication so to help the compiler generating better instructions. Here is the result:
const unsigned int blockSize = 4;
float current[blockSize] = {0.f};
int i;
for (i = 0; i < n_elem-blockSize+1; i+=blockSize, bit_vec >>= blockSize)
for(int j = 0 ; j < blockSize ; ++j)
current[j] += weights[i] * (bit_vec >> j);
for (; i < n_elem; ++i, bit_vec >>= 1)
if (bit_vec & 1)
current[0] += weights[i];
float sum = 0.f;
for(int j = 0 ; j < blockSize ; ++j)
sum += current[j];
This code should be faster assuming n_elem is relatively big. It should still be far from being efficient since compilers like GCC and Clang fail to auto-vectorize it. This is sad since it would be several time faster with SIMD instructions (like SSE, AVX, Neon, etc.). That being said, this is exactly why people use non-portable code: to manually use efficient instruction since compiler often fail to do that in non-trivial cases.

How to sum large numbers?

I am trying to calculate 1 + 1 * 2 + 1 * 2 * 3 + 1 * 2 * 3 * 4 + ... + 1 * 2 * ... * n where n is the user input.
It works for values of n up to 12. I want to calculate the sum for n = 13, n = 14 and n = 15. How do I do that in C89? As I know, I can use unsigned long long int only in C99 or C11.
Input 13, result 2455009817, expected 6749977113
Input 14, result 3733955097, expected 93928268313
Input 15, result 1443297817, expected 1401602636313
My code:
#include <stdio.h>
#include <stdlib.h>
int main()
{
unsigned long int n;
unsigned long int P = 1;
int i;
unsigned long int sum = 0;
scanf("%lu", &n);
for(i = 1; i <= n; i++)
{
P *= i;
sum += P;
}
printf("%lu", sum);
return 0;
}

In practice, you want some arbitrary precision arithmetic (a.k.a. bigint or bignum) library. My recommendation is GMPlib but there are other ones.
Don't try to code your own bignum library. Efficient & clever algorithms exist, but they are unintuitive and difficult to grasp (you can find entire books devoted to that question). In addition, existing libraries like GMPlib are taking advantage of specific machine instructions (e.g. ADC -add with carry) that a standard C compiler won't emit (from pure C code).
If this is a homework and you are not allowed to use external code, consider for example representing a number in base or radix 1000000000 (one billion) and code yourself the operations in a very naive way, similar to what you have learned as a kid. But be aware that more efficient algorithms exist (and that real bignum libraries are using them).
A number could be represented in base 1000000000 by having an array of unsigned, each being a "digit" of base 1000000000. So you need to manage arrays (probably heap allocated, using malloc) and their length.

You could use a double, especially if your platform uses IEEE754.
Such a double gives you 53 bits of precision, which means integers are exact up to the 53rd power of 2. That's good enough for this case.
If your platform doesn't use IEEE754 then consult the documentation on the floating point scheme adopted. It might be adequate.

A simple approach when you're just over the limit of MaxInt, is to do the computations modulo 10^n for a suitable n and you do the same computation as floating point computation but where you divide everything by 10^r.The former result will give you the first n digits while the latter result will give you the last digits of the answer with the first r digits removed. Then the last few digits here will be inaccurate due to roundoff errors, so you should choose r a bit smaller than n. In this case taking n = 9 and r = 5 will work well.

Why does rand() + rand() produce negative numbers?

I observed that rand() library function when it is called just once within a loop, it almost always produces positive numbers.
for (i = 0; i < 100; i++) {
printf("%d\n", rand());
}
But when I add two rand() calls, the numbers generated now have more negative numbers.
for (i = 0; i < 100; i++) {
printf("%d = %d\n", rand(), (rand() + rand()));
}
Can someone explain why I am seeing negative numbers in the second case?
PS: I initialize the seed before the loop as srand(time(NULL)).

rand() is defined to return an integer between 0 and RAND_MAX.
rand() + rand()
could overflow. What you observe is likely a result of undefined behaviour caused by integer overflow.

The problem is the addition. rand() returns an int value of 0...RAND_MAX. So, if you add two of them, you will get up to RAND_MAX * 2. If that exceeds INT_MAX, the result of the addition overflows the valid range an int can hold. Overflow of signed values is undefined behaviour and may lead to your keyboard talking to you in foreign tongues.
As there is no gain here in adding two random results, the simple idea is to just not do it. Alternatively you can cast each result to unsigned int before the addition if that can hold the sum. Or use a larger type. Note that long is not necessarily wider than int, the same applies to long long if int is at least 64 bits!
Conclusion: Just avoid the addition. It does not provide more "randomness". If you need more bits, you might concatenate the values sum = a + b * (RAND_MAX + 1), but that also likely requires a larger data type than int.
As your stated reason is to avoid a zero-result: That cannot be avoided by adding the results of two rand() calls, as both can be zero. Instead, you can just increment. If RAND_MAX == INT_MAX, this cannot be done in int. However, (unsigned int)rand() + 1 will do very, very likely. Likely (not definitively), because it does require UINT_MAX > INT_MAX, which is true on all implementations I'm aware of (which covers quite some embedded architectures, DSPs and all desktop, mobile and server platforms of the past 30 years).
Warning:
Although already sprinkled in comments here, please note that adding two random values does not get a uniform distribution, but a triangular distribution like rolling two dice: to get 12 (two dice) both dice have to show 6. for 11 there are already two possible variants: 6 + 5 or 5 + 6, etc.
So, the addition is also bad from this aspect.
Also note that the results rand() generates are not independent of each other, as they are generated by a pseudorandom number generator. Note also that the standard does not specify the quality or uniform distribution of the calculated values.

This is an answer to a clarification of the question made in comment to this answer,
the reason i was adding was to avoid '0' as the random number in my code. rand()+rand() was the quick dirty solution which readily came to my mind.
The problem was to avoid 0. There are (at least) two problems with the proposed solution. One is, as the other answers indicate, that rand()+rand() can invoke undefined behavior. Best advice is to never invoke undefined behavior. Another issue is there's no guarantee that rand() won't produce 0 twice in a row.
The following rejects zero, avoids undefined behavior, and in the vast majority of cases will be faster than two calls to rand():
int rnum;
for (rnum = rand(); rnum == 0; rnum = rand()) {}
// or do rnum = rand(); while (rnum == 0);

Basically rand() produce numbers between 0 and RAND_MAX, and 2 RAND_MAX > INT_MAX in your case.
You can modulus with the max value of your data-type to prevent overflow. This ofcourse will disrupt the distribution of the random numbers, but rand is just a way to get quick random numbers.
#include <stdio.h>
#include <limits.h>
int main(void)
{
int i=0;
for (i=0; i<100; i++)
printf(" %d : %d \n", rand(), ((rand() % (INT_MAX/2))+(rand() % (INT_MAX/2))));
for (i=0; i<100; i++)
printf(" %d : %ld \n", rand(), ((rand() % (LONG_MAX/2))+(rand() % (LONG_MAX/2))));
return 0;
}

May be you could try rather a tricky approach by ensuring that the value returned by sum of 2 rand() never exceeds the value of RAND_MAX. A possible approach could be sum = rand()/2 + rand()/2; This would ensure that for a 16 bit compiler with RAND_MAX value of 32767 even if both rand happens to return 32767, even then (32767/2 = 16383) 16383+16383 = 32766, thus would not result in negative sum.

the reason i was adding was to avoid '0' as the random number in my code. rand()+rand() was the quick dirty solution which readily came to my mind.
A simple solution (okay, call it a "Hack") which never produces a zero result and will never overflow is:
x=(rand()/2)+1 // using divide -or-
x=(rand()>>1)+1 // using shift which may be faster
// compiler optimization may use shift in both cases
This will limit your maximum value, but if you don't care about that, then this should work fine for you.

To avoid 0, try this:
int rnumb = rand()%(INT_MAX-1)+1;
You need to include limits.h.

thx. the reason i was adding was to avoid '0' as the random number in my code. rand()+rand() was the quick dirty solution which readily came to my mind
It sounds like an XY problem to me, in which in order to not get a 0 from rand(), you call rand() two times, doing the program slower, with a new setback and the possibility of getting a 0 is still there.
Another solution is using uniform_int_distribution, which creates a random and uniformly distributed number in the defined interval:
https://wandbox.org/permlink/QKIHG4ghwJf1b7ZN
#include <random>
#include <array>
#include <iostream>
int main()
{
const int MAX_VALUE=50;
const int MIN_VALUE=1;
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> distrib(MIN_VALUE, MAX_VALUE);
std::array<int,MAX_VALUE-MIN_VALUE> weight={0};
for(int i=0; i<50000; i++) {
weight[distrib(gen)-MIN_VALUE]++;
}
for(int i=0;i<(int)weight.size();i++) {
std::cout << "value: " << MIN_VALUE+i << " times: " << weight[i] << std::endl;
}
}

While what everyone else has said about the likely overflow could very well be the cause of the negative, even when you use unsigned integers. The real problem is actually using time/date functionality as the seed. If you have truly become familiar with this functionality you will know exactly why I say this. As what it really does is give a distance (elapsed time) since a given date/time. While the use of the date/time functionality as the seed to a rand(), is a very common practice, it really is not the best option. You should search better alternatives, as there are many theories on the topic and I could not possibly go into all of them. You add into this equation the possibility of overflow and this approach was doomed from the beginning.
Those that posted the rand()+1 are using the solution that most use in order to guarantee that they do not get a negative number. But, that approach is really not the best way either.
The best thing you can do is take the extra time to write and use proper exception handling, and only add to the rand() number if and/or when you end up with a zero result. And, to deal with negative numbers properly. The rand() functionality is not perfect, and therefore needs to be used in conjunction with exception handling to ensure that you end up with the desired result.
Taking the extra time and effort to investigate, study, and properly implement the rand() functionality is well worth the time and effort. Just my two cents. Good luck in your endeavors...

Exponent not working properly in C

When i run the following code
/*Program to find the greatest common divisor of two nonnegative integer
values*/
#include <stdio.h>
int main(void){
printf(" n | n^2\n");
printf("-----------------\n");
for(int n = 1; n<11; n++){
int nSquared = n^2;
printf("%i %i\n",n,nSquared);
}
}
The table that gets returned to the terminal displays as follows
n | n^2
-----------------
1 3
2 0
3 1
4 6
5 7
6 4
7 5
8 10
9 11
10 8
why does the "n^2" side generate the wrong numbers? And is there a way to write superscripts and subscripts in C, so I do not have to display "n^2" and can display that side of the column as "n²" instead?

Use pow function from math.h.
^ is the bitwise exclusive OR operator and has to nothing to do with a power function.

The ^ is the XOR operation. You'd either want to use the math.h function "pow", or write your own.

^ is the bitwise xor operator. You should use the pow function declared in the math.h header.
#include <stdio.h>
#include <math.h>
int main(void) {
printf(" n | n^2\n");
printf("-----------------\n");
for(int n = 1; n < 11; n++){
int nSquared = pow(n, 2); // calculate n raised to 2
printf("%i %i\n", n, nSquared);
}
return 0;
}
Include the math library by the flag -lm for gcc compilation.

As others have pointed out, the problem is that ^ is the bitwise xor operator. C has no exponentiation operator.
You're being advised to use the pow() function to compute the square of an int value.
That's likely to work (if you're careful), but it's not the best approach. The pow function takes two double arguments and returns a double result. (There are powf and powl functions that operator on float and long double, respectively.) That means that pow has to be able to handle arbitrary floating-point exponents; for example, pow(2.0, 1.0/3.0) will give you an approximation of the cube root of two.
Like many floating-point operations, pow is subject to the possibility of rounding errors. It's possible that pow(3.0, 2.0) will yield a result that's just slightly less than 9.0; converting that to int will give you 8 rather than 9. And even if you manage to avoid that problem, converting from integer to floating-point, performing an expensive operation, and then converting back to integer is massive overkill. (The implementation might optimize calls to pow with integer exponents, but I wouldn't count on that.)
It's been said (with slight exaggeration) that premature optimization is the root of all evil, and the time spent doing the extra computations is not likely to be noticeable. But in this case there's a way to do what you want that's both simpler and more efficient. Rather than
int nSquared = n^2;
which is incorrect, or
int nSquared = pow(n, 2);
which is inefficient and possibly unreliable, just write:
int nSquared = n * n;

Optimize me! (C, performance) -- followup to bit-twiddling question

Thanks to some very helpful stackOverflow users at Bit twiddling: which bit is set?, I have constructed my function (posted at the end of the question).
Any suggestions -- even small suggestions -- would be appreciated. Hopefully it will make my code better, but at the least it should teach me something. :)
Overview
This function will be called at least 1013 times, and possibly as often as 1015. That is, this code will run for months in all likelihood, so any performance tips would be helpful.
This function accounts for 72-77% of the program's time, based on profiling and about a dozen runs in different configurations (optimizing certain parameters not relevant here).
At the moment the function runs in an average of 50 clocks. I'm not sure how much this can be improved, but I'd be thrilled to see it run in 30.
Key Observation
If at some point in the calculation you can tell that the value that will be returned will be small (exact value negotiable -- say, below a million) you can abort early. I'm only interested in large values.
This is how I hope to save the most time, rather than by further micro-optimizations (though these are of course welcome as well!).
Performance Information
smallprimes is a bit array (64 bits); on average about 8 bits will be set, but it could be as few as 0 or as many as 12.
q will usually be nonzero. (Notice that the function exits early if q and smallprimes are zero.)
r and s will often be 0. If q is zero, r and s will be too; if r is zero, s will be too.
As the comment at the end says, nu is usually 1 by the end, so I have an efficient special case for it.
The calculations below the special case may appear to risk overflow, but through appropriate modeling I have proved that, for my input, this will not occur -- so don't worry about that case.
Functions not defined here (ugcd, minuu, star, etc.) have already been optimized; none take long to run. pr is a small array (all in L1). Also, all functions called here are pure functions.
But if you really care... ugcd is the gcd, minuu is the minimum, vals is the number of trailing binary 0s, __builtin_ffs is the location of the leftmost binary 1, star is (n-1) >> vals(n-1), pr is an array of the primes from 2 to 313.
The calculations are currently being done on a Phenom II 920 x4, though optimizations for i7 or Woodcrest are still of interest (if I get compute time on other nodes).
I would be happy to answer any questions you have about the function or its constituents.
What it actually does
Added in response to a request. You don't need to read this part.
The input is an odd number n with 1 < n < 4282250400097. The other inputs provide the factorization of the number in this particular sense:
smallprimes&1 is set if the number is divisible by 3, smallprimes&2 is set if the number is divisible by 5, smallprimes&4 is set if the number is divisible by 7, smallprimes&8 is set if the number is divisible by 11, etc. up to the most significant bit which represents 313. A number divisible by the square of a prime is not represented differently from a number divisible by just that number. (In fact, multiples of squares can be discarded; in the preprocessing stage in another function multiples of squares of primes <= lim have smallprimes and q set to 0 so they will be dropped, where the optimal value of lim is determined by experimentation.)
q, r, and s represent larger factors of the number. Any remaining factor (which may be greater than the square root of the number, or if s is nonzero may even be less) can be found by dividing factors out from n.
Once all the factors are recovered in this way, the number of bases, 1 <= b < n, to which n is a strong pseudoprime are counted using a mathematical formula best explained by the code.
Improvements so far
Pushed the early exit test up. This clearly saves work so I made the change.
The appropriate functions are already inline, so __attribute__ ((inline)) does nothing. Oddly, marking the main function bases and some of the helpers with __attribute ((hot)) hurt performance by almost 2% and I can't figure out why (but it's reproducible with over 20 tests). So I didn't make that change. Likewise, __attribute__ ((const)), at best, did not help. I was more than slightly surprised by this.
Code
ulong bases(ulong smallprimes, ulong n, ulong q, ulong r, ulong s)
{
if (!smallprimes & !q)
return 0;
ulong f = __builtin_popcountll(smallprimes) + (q > 1) + (r > 1) + (s > 1);
ulong nu = 0xFFFF; // "Infinity" for the purpose of minimum
ulong nn = star(n);
ulong prod = 1;
while (smallprimes) {
ulong bit = smallprimes & (-smallprimes);
ulong p = pr[__builtin_ffsll(bit)];
nu = minuu(nu, vals(p - 1));
prod *= ugcd(nn, star(p));
n /= p;
while (n % p == 0)
n /= p;
smallprimes ^= bit;
}
if (q) {
nu = minuu(nu, vals(q - 1));
prod *= ugcd(nn, star(q));
n /= q;
while (n % q == 0)
n /= q;
} else {
goto BASES_END;
}
if (r) {
nu = minuu(nu, vals(r - 1));
prod *= ugcd(nn, star(r));
n /= r;
while (n % r == 0)
n /= r;
} else {
goto BASES_END;
}
if (s) {
nu = minuu(nu, vals(s - 1));
prod *= ugcd(nn, star(s));
n /= s;
while (n % s == 0)
n /= s;
}
BASES_END:
if (n > 1) {
nu = minuu(nu, vals(n - 1));
prod *= ugcd(nn, star(n));
f++;
}
// This happens ~88% of the time in my tests, so special-case it.
if (nu == 1)
return prod << 1;
ulong tmp = f * nu;
long fac = 1 << tmp;
fac = (fac - 1) / ((1 << f) - 1) + 1;
return fac * prod;
}

You seem to be wasting much time doing divisions by the factors. It is much faster to replace a division with a multiplication by the reciprocal of divisor (division: ~15-80(!) cycles, depending on the divisor, multiplication: ~4 cycles), IF of course you can precompute the reciprocals.
While this seems unlikely to be possible with q, r, s - due to the range of those vars, it is very easy to do with p, which always comes from the small, static pr[] array. Precompute the reciprocals of those primes and store them in another array. Then, instead of dividing by p, multiply by the reciprocal taken from the second array. (Or make a single array of structs.)
Now, obtaining exact division result by this method requires some trickery to compensate for rounding errors. You will find the gory details of this technique in this document, on page 138.
EDIT:
After consulting Hacker's Delight (an excellent book, BTW) on the subject, it seems that you can make it even faster by exploiting the fact that all divisions in your code are exact (i.e. remainder is zero).
It seems that for every divisor d which is odd and base B = 2word_size, there exists a unique multiplicative inverse d⃰ which satisfies the conditions: d⃰ < B and d·d⃰ ≡ 1 (mod B). For every x which is an exact multiple of d, this implies x/d ≡ x·d⃰ (mod B). Which means you can simply replace a division with a multiplication, no added corrections, checks, rounding problems, whatever. (The proofs of these theorems can be found in the book.) Note that this multiplicative inverse need not be equal to the reciprocal as defined by the previous method!
How to check whether a given x is an exact multiple of d - i.e. x mod d = 0 ? Easy! x mod d = 0 iff x·d⃰ mod B ≤ ⌊(B-1)/d⌋. Note that this upper limit can be precomputed.
So, in code:
unsigned x, d;
unsigned inv_d = mulinv(d); //precompute this!
unsigned limit = (unsigned)-1 / d; //precompute this!
unsigned q = x*inv_d;
if(q <= limit)
{
//x % d == 0
//q == x/d
} else {
//x % d != 0
//q is garbage
}
Assuming the pr[] array becomes an array of struct prime:
struct prime {
ulong p;
ulong inv_p; //equal to mulinv(p)
ulong limit; //equal to (ulong)-1 / p
}
the while(smallprimes) loop in your code becomes:
while (smallprimes) {
ulong bit = smallprimes & (-smallprimes);
int bit_ix = __builtin_ffsll(bit);
ulong p = pr[bit_ix].p;
ulong inv_p = pr[bit_ix].inv_p;
ulong limit = pr[bit_ix].limit;
nu = minuu(nu, vals(p - 1));
prod *= ugcd(nn, star(p));
n *= inv_p;
for(;;) {
ulong q = n * inv_p;
if (q > limit)
break;
n = q;
}
smallprimes ^= bit;
}
And for the mulinv() function:
ulong mulinv(ulong d) //d needs to be odd
{
ulong x = d;
for(;;)
{
ulong tmp = d * x;
if(tmp == 1)
return x;
x *= 2 - tmp;
}
}
Note you can replace ulong with any other unsigned type - just use the same type consistently.
The proofs, whys and hows are all available in the book. A heartily recommended read :-).

If your compiler supports GCC function attributes, you can mark your pure functions with this attribute:
ulong star(ulong n) __attribute__ ((const));
This attribute indicates to the compiler that the result of the function depends only on its argument(s). This information can be used by the optimiser.
Is there a reason why you've opencoded vals() instead of using __builtin_ctz() ?

It is still somewhat unclear, what you are searching for. Quite frequently number theoretic problems allow huge speedups by deriving mathematical properties that the solutions must satisfiy.
If you are indeed searching for the integers that maximize the number of non-witnesses for the MR test (i.e. oeis.org/classic/A141768 that you mention) then it might be possible to use that the number of non-witnesses cannot be larger than phi(n)/4 and that the integers for which have this many non-witnesses are either are the product of two primes of the form
(k+1)*(2k+1)
or they are Carmichael numbers with 3 prime factors.
I'd think above some limit all integers in the sequence have this form and that it is possible to verify this by proving an upper bound for the witnesses of all other integers.
E.g. integers with 4 or more factors always have at most phi(n)/8 non-witnesses. Similar results can be derived from you formula for the number of bases for other integers.
As for micro-optimizations: Whenever you know that an integer is divisible by some quotient, then it is possible to replace the division by a multiplication with the inverse of the quotient modulo 2^64. And the tests n % q == 0 can be replaced by a test
n * inverse_q < max_q,
where inverse_q = q^(-1) mod 2^64 and max_q = 2^64 / q.
Obviously inverse_q and max_q need to be precomputed, to be efficient, but since you are using a sieve, I assume this should not be an obstacle.

Small optimization but:
ulong f;
ulong nn;
ulong nu = 0xFFFF; // "Infinity" for the purpose of minimum
ulong prod = 1;
if (!smallprimes & !q)
return 0;
// no need to do this operations before because of the previous return
f = __builtin_popcountll(smallprimes) + (q > 1) + (r > 1) + (s > 1);
nn = star(n);
BTW: you should edit your post to add star() and other functions you use definition

Try replacing this pattern (for r and q too):
n /= p;
while (n % p == 0)
n /= p;
With this:
ulong m;
...
m = n / p;
do {
n = m;
m = n / p;
} while ( m * p == n);
In my limited tests, I got a small speedup (10%) from eliminating the modulo.
Also, if p, q or r were constant, the compiler will replace the divisions by multiplications. If there are few choices for p, q or r, or if certain ones are more frequent, you might gain something by specializing the function for those values.

Have you tried using profile-guided optimisation?
Compile and link the program with the -fprofile-generate option, then run the program over a representative data set (say, a day's worth of computation).
Then re-compile and link it with the -fprofile-use option instead.

1) I would make the compiler spit out the assembly it generates and try and deduce if what it does is the best it can do... and if you spot problems, change the code so the assembly looks better. This way you can also make sure that functions you hope it'll inline (like star and vals) are really inlined. (You might need to add pragma's, or even turn them into macros)
2) It's great that you try this on a multicore machine, but this loop is singlethreaded. I'm guessing that there is an umbrella functions which splits the load across a few threads so that more cores are used?
3) It's difficult to suggest speed ups if what the actual function tries to calculate is unclear. Typically the most impressive speedups are not achieved with bit twiddling, but with a change in the algorithm. So a bit of comments might help ;^)
4) If you really want a speed up of 10* or more, check out CUDA or openCL which allows you to run C programs on your graphics hardware. It shines with functions like these!
5) You are doing loads of modulo and divides right after each other. In C this is 2 separate commands (first '/' and then '%'). However in assembly this is 1 command: 'DIV' or 'IDIV' which returns both the remainder and the quotient in one go:
B.4.75 IDIV: Signed Integer Divide
IDIV r/m8 ; F6 /7 [8086]
IDIV r/m16 ; o16 F7 /7 [8086]
IDIV r/m32 ; o32 F7 /7 [386]
IDIV performs signed integer division. The explicit operand provided is the divisor; the dividend and destination operands are implicit, in the following way:
For IDIV r/m8, AX is divided by the given operand; the quotient is stored in AL and the remainder in AH.
For IDIV r/m16, DX:AX is divided by the given operand; the quotient is stored in AX and the remainder in DX.
For IDIV r/m32, EDX:EAX is divided by the given operand; the quotient is stored in EAX and the remainder in EDX.
So it will require some inline assembly, but I'm guessing there'll be a significant speedup as there are a few places in your code which can benefit from this.

Make sure your functions get inlined. If they're out-of-line, the overhead might add up, especially in the first while loop. The best way to be sure is to examine the assembly.
Have you tried pre-computing star( pr[__builtin_ffsll(bit)] ) and vals( pr[__builtin_ffsll(bit)] - 1) ? That would trade some simple work for an array lookup, but it might be worth it if the tables are small enough.
Don't compute f until you actually need it (near the end, after your early-out). You can replace the code around BASES_END with something like
BASES_END:
ulong addToF = 0;
if (n > 1) {
nu = minuu(nu, vals(n - 1));
prod *= ugcd(nn, star(n));
addToF = 1;
}
// ... early out if nu == 1...
// ... compute f ...
f += addToF;
Hope that helps.

First some nitpicking ;-) you should be more careful about the types that you are using. In some places you seem to assume that ulong is 64 bit wide, use uint64_t there. And also for all other types, rethink carefully what you expect of them and use the appropriate type.
The optimization that I could see is integer division. Your code does that a lot, this is probably the most expensive thing you are doing. Division of small integers (uint32_t) maybe much more efficient than by big ones. In particular for uint32_t there is an assembler instruction that does division and modulo in one go, called divl.
If you use the appropriate types your compiler might do that all for you. But you'd better check the assembler (option -S to gcc) as somebody already said. Otherwise it is easy to include some little assembler fragments here and there. I found something like that in some code of mine:
register uint32_t a asm("eax") = 0;
register uint32_t ret asm("edx") = 0;
asm("divl %4"
: "=a" (a), "=d" (ret)
: "0" (a), "1" (ret), "rm" (divisor));
As you can see this uses special registers eax and edx and stuff like that...

Did you try a table lookup version of the first while loop? You could divide smallprimes in 4 16 bit values, look up their contribution and merge them. But maybe you need the side effects.

Did you try passing in an array of primes instead of splitting them in smallprimes, q, r and s? Since I don't know what the outer code does, I am probably wrong, but there is a chance that you also have a function to convert some primes to a smallprimes bitmap, and inside this function, you convert the bitmap back to an array of primes, effecively. In addition, you seem to do identical processing for elements of smallprimes, q, r, and s. It should save you a tiny amount of processing per call.
Also, you seem to know that the passed in primes divide n. Do you have enough knowledge outside about the power of each prime that divides n? You could save a lot of time if you can eliminate the modulo operation by passing in that information to this function. In other words, if n is pow(p_0,e_0)*pow(p_1,e_1)*...*pow(p_k,e_k)*n_leftover, and if you know more about these e_is and n_leftover, passing them in would mean a lot of things you don't have to do in this function.
There may be a way to discover n_leftover (the unfactored part of n) with less number of modulo operations, but it is only a hunch, so you may need to experiment with it a bit. The idea is to use gcd to remove known factors from n repeatedly until you get rid of all known prime factors. Let me give some almost-c-code:
factors=p_0*p_1*...*p_k*q*r*s;
n_leftover=n/factors;
do {
factors=gcd(n_leftover, factors);
n_leftover = n_leftover/factors;
} while (factors != 1);
I am not at all certain this will be better than the code you have, let alone the combined mod/div suggestions you can find in other answers, but I think it is worth a try. I feel that it will be a win, especially for numbers with high numbers of small prime factors.

You're passing in the complete factorization of n, so you're factoring consecutive integers and then using the results of that factorization here. It seems to me that you might benefit from doing some of this at the time of finding the factors.
BTW, I've got some really fast code for finding the factors you're using without doing any division. It's a little like a sieve but produces factors of consecutive numbers very quickly. Can find it and post if you think it may help.
edit had to recreate the code here:
#include
#define SIZE (1024*1024) //must be 2^n
#define MASK (SIZE-1)
typedef struct {
int p;
int next;
} p_type;
p_type primes[SIZE];
int sieve[SIZE];
void init_sieve()
{
int i,n;
int count = 1;
primes[1].p = 3;
sieve[1] = 1;
for (n=5;SIZE>n;n+=2)
{
int flag = 0;
for (i=1;count>=i;i++)
{
if ((n%primes[i].p) == 0)
{
flag = 1;
break;
}
}
if (flag==0)
{
count++;
primes[count].p = n;
sieve[n>>1] = count;
}
}
}
int main()
{
int ptr,n;
init_sieve();
printf("init_done\n");
// factor odd numbers starting with 3
for (n=1;1000000000>n;n++)
{
ptr = sieve[n&MASK];
if (ptr == 0) //prime
{
// printf("%d is prime",n*2+1);
}
else //composite
{
// printf ("%d has divisors:",n*2+1);
while(ptr!=0)
{
// printf ("%d ",primes[ptr].p);
sieve[n&MASK]=primes[ptr].next;
//move the prime to the next number it divides
primes[ptr].next = sieve[(n+primes[ptr].p)&MASK];
sieve[(n+primes[ptr].p)&MASK] = ptr;
ptr = sieve[n&MASK];
}
}
// printf("\n");
}
return 0;
}
The init function creates a factor base and initializes the sieve. This takes about 13 seconds on my laptop. Then all numbers up to 1 billion are factored or determined to be prime in another 25 seconds. Numbers less than SIZE are never reported as prime because they have 1 factor in the factor base, but that could be changed.
The idea is to maintain a linked list for every entry in the sieve. Numbers are factored by simply pulling their factors out of the linked list. As they are pulled out, they are inserted into the list for the next number that will be divisible by that prime. This is very cache friendly too. The sieve size must be larger than the largest prime in the factor base. As is, this sieve could run up to 2**40 in about 7 hours which seems to be your target (except for n needing to be 64 bits).
Your algorithm could be merged into this to make use of the factors as they are identified rather than packing bits and large primes into variables to pass to your function. Or your function could be changed to take the linked list (you could create a dummy link to pass in for the prime numbers outside the factor base).
Hope it helps.
BTW, this is the first time I've posted this algorithm publicly.

just a thought but maybe using your compilers optimization options would help, if you haven't already. another thought would be that if money isn't an issue you could use the Intel C/C++ compiler, assuming your using an Intel processor. I'd also assume that other processor manufacturers (AMD, etc.) would have similar compilers

If you are going to exit immediately on (!smallprimes&!q) why not do that test before even calling the function, and save the function call overhead?
Also, it seems like you effectively have 3 different functions which are linear except for the smallprimes loop.
bases1(s,n,q), bases2(s,n,q,r), and bases3(s,n,q,r,s).
It might be a win to actually create those as 3 separate functions without the branches and gotos, and call the appropriate one:
if (!(smallprimes|q)) { r = 0;}
else if (s) { r = bases3(s,n,q,r,s);}
else if (r) { r = bases2(s,n,q,r); }
else { r = bases1(s,n,q);
This would be most effective if previous processing has already given the calling code some 'knowledge' of which function to execute and you don't have to test for it.

If the divisions you're using are with numbers that aren’t known at compile time, but are used frequently at runtime (dividing by the same number many times), then I would suggest using the libdivide library, which basically implements at runtime the optimisations that compilers do for compile time constants (using shifts masks etc.). This can provide a huge benefit. Also avoiding using x % y == 0 for something like z = x/y, z * y == x as ergosys suggested above should also have a measurable improvement.

Does the code on your top post is the optimized version? If yes, there is still too many divide operations which greatly eat CPU cycles.
This code is overexecute innecessarily a bit
if (!smallprimes & !q)
return 0;
change to logical and &&
if (!smallprimes && !q)
return 0;
will make it short circuited faster without eveluating q
And the following code
ulong bit = smallprimes & (-smallprimes);
ulong p = pr[__builtin_ffsll(bit)];
which is used to find the last set bit of smallprimes. Why don't you use the simpler way
ulong p = pr[__builtin_ctz(smallprimes)];
Another culprit for decreased performance maybe too many program branching. You may consider changing to some other less-branch or branch-less equivalents

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight