Optimize me! (C, performance) -- followup to bit-twiddling question

Optimize me! (C, performance) -- followup to bit-twiddling question - c

Thanks to some very helpful stackOverflow users at Bit twiddling: which bit is set?, I have constructed my function (posted at the end of the question).
Any suggestions -- even small suggestions -- would be appreciated. Hopefully it will make my code better, but at the least it should teach me something. :)
Overview
This function will be called at least 1013 times, and possibly as often as 1015. That is, this code will run for months in all likelihood, so any performance tips would be helpful.
This function accounts for 72-77% of the program's time, based on profiling and about a dozen runs in different configurations (optimizing certain parameters not relevant here).
At the moment the function runs in an average of 50 clocks. I'm not sure how much this can be improved, but I'd be thrilled to see it run in 30.
Key Observation
If at some point in the calculation you can tell that the value that will be returned will be small (exact value negotiable -- say, below a million) you can abort early. I'm only interested in large values.
This is how I hope to save the most time, rather than by further micro-optimizations (though these are of course welcome as well!).
Performance Information
smallprimes is a bit array (64 bits); on average about 8 bits will be set, but it could be as few as 0 or as many as 12.
q will usually be nonzero. (Notice that the function exits early if q and smallprimes are zero.)
r and s will often be 0. If q is zero, r and s will be too; if r is zero, s will be too.
As the comment at the end says, nu is usually 1 by the end, so I have an efficient special case for it.
The calculations below the special case may appear to risk overflow, but through appropriate modeling I have proved that, for my input, this will not occur -- so don't worry about that case.
Functions not defined here (ugcd, minuu, star, etc.) have already been optimized; none take long to run. pr is a small array (all in L1). Also, all functions called here are pure functions.
But if you really care... ugcd is the gcd, minuu is the minimum, vals is the number of trailing binary 0s, __builtin_ffs is the location of the leftmost binary 1, star is (n-1) >> vals(n-1), pr is an array of the primes from 2 to 313.
The calculations are currently being done on a Phenom II 920 x4, though optimizations for i7 or Woodcrest are still of interest (if I get compute time on other nodes).
I would be happy to answer any questions you have about the function or its constituents.
What it actually does
Added in response to a request. You don't need to read this part.
The input is an odd number n with 1 < n < 4282250400097. The other inputs provide the factorization of the number in this particular sense:
smallprimes&1 is set if the number is divisible by 3, smallprimes&2 is set if the number is divisible by 5, smallprimes&4 is set if the number is divisible by 7, smallprimes&8 is set if the number is divisible by 11, etc. up to the most significant bit which represents 313. A number divisible by the square of a prime is not represented differently from a number divisible by just that number. (In fact, multiples of squares can be discarded; in the preprocessing stage in another function multiples of squares of primes <= lim have smallprimes and q set to 0 so they will be dropped, where the optimal value of lim is determined by experimentation.)
q, r, and s represent larger factors of the number. Any remaining factor (which may be greater than the square root of the number, or if s is nonzero may even be less) can be found by dividing factors out from n.
Once all the factors are recovered in this way, the number of bases, 1 <= b < n, to which n is a strong pseudoprime are counted using a mathematical formula best explained by the code.
Improvements so far
Pushed the early exit test up. This clearly saves work so I made the change.
The appropriate functions are already inline, so __attribute__ ((inline)) does nothing. Oddly, marking the main function bases and some of the helpers with __attribute ((hot)) hurt performance by almost 2% and I can't figure out why (but it's reproducible with over 20 tests). So I didn't make that change. Likewise, __attribute__ ((const)), at best, did not help. I was more than slightly surprised by this.
Code
ulong bases(ulong smallprimes, ulong n, ulong q, ulong r, ulong s)
{
if (!smallprimes & !q)
return 0;
ulong f = __builtin_popcountll(smallprimes) + (q > 1) + (r > 1) + (s > 1);
ulong nu = 0xFFFF; // "Infinity" for the purpose of minimum
ulong nn = star(n);
ulong prod = 1;
while (smallprimes) {
ulong bit = smallprimes & (-smallprimes);
ulong p = pr[__builtin_ffsll(bit)];
nu = minuu(nu, vals(p - 1));
prod *= ugcd(nn, star(p));
n /= p;
while (n % p == 0)
n /= p;
smallprimes ^= bit;
}
if (q) {
nu = minuu(nu, vals(q - 1));
prod *= ugcd(nn, star(q));
n /= q;
while (n % q == 0)
n /= q;
} else {
goto BASES_END;
}
if (r) {
nu = minuu(nu, vals(r - 1));
prod *= ugcd(nn, star(r));
n /= r;
while (n % r == 0)
n /= r;
} else {
goto BASES_END;
}
if (s) {
nu = minuu(nu, vals(s - 1));
prod *= ugcd(nn, star(s));
n /= s;
while (n % s == 0)
n /= s;
}
BASES_END:
if (n > 1) {
nu = minuu(nu, vals(n - 1));
prod *= ugcd(nn, star(n));
f++;
}
// This happens ~88% of the time in my tests, so special-case it.
if (nu == 1)
return prod << 1;
ulong tmp = f * nu;
long fac = 1 << tmp;
fac = (fac - 1) / ((1 << f) - 1) + 1;
return fac * prod;
}

You seem to be wasting much time doing divisions by the factors. It is much faster to replace a division with a multiplication by the reciprocal of divisor (division: ~15-80(!) cycles, depending on the divisor, multiplication: ~4 cycles), IF of course you can precompute the reciprocals.
While this seems unlikely to be possible with q, r, s - due to the range of those vars, it is very easy to do with p, which always comes from the small, static pr[] array. Precompute the reciprocals of those primes and store them in another array. Then, instead of dividing by p, multiply by the reciprocal taken from the second array. (Or make a single array of structs.)
Now, obtaining exact division result by this method requires some trickery to compensate for rounding errors. You will find the gory details of this technique in this document, on page 138.
EDIT:
After consulting Hacker's Delight (an excellent book, BTW) on the subject, it seems that you can make it even faster by exploiting the fact that all divisions in your code are exact (i.e. remainder is zero).
It seems that for every divisor d which is odd and base B = 2word_size, there exists a unique multiplicative inverse d⃰ which satisfies the conditions: d⃰ < B and d·d⃰ ≡ 1 (mod B). For every x which is an exact multiple of d, this implies x/d ≡ x·d⃰ (mod B). Which means you can simply replace a division with a multiplication, no added corrections, checks, rounding problems, whatever. (The proofs of these theorems can be found in the book.) Note that this multiplicative inverse need not be equal to the reciprocal as defined by the previous method!
How to check whether a given x is an exact multiple of d - i.e. x mod d = 0 ? Easy! x mod d = 0 iff x·d⃰ mod B ≤ ⌊(B-1)/d⌋. Note that this upper limit can be precomputed.
So, in code:
unsigned x, d;
unsigned inv_d = mulinv(d); //precompute this!
unsigned limit = (unsigned)-1 / d; //precompute this!
unsigned q = x*inv_d;
if(q <= limit)
{
//x % d == 0
//q == x/d
} else {
//x % d != 0
//q is garbage
}
Assuming the pr[] array becomes an array of struct prime:
struct prime {
ulong p;
ulong inv_p; //equal to mulinv(p)
ulong limit; //equal to (ulong)-1 / p
}
the while(smallprimes) loop in your code becomes:
while (smallprimes) {
ulong bit = smallprimes & (-smallprimes);
int bit_ix = __builtin_ffsll(bit);
ulong p = pr[bit_ix].p;
ulong inv_p = pr[bit_ix].inv_p;
ulong limit = pr[bit_ix].limit;
nu = minuu(nu, vals(p - 1));
prod *= ugcd(nn, star(p));
n *= inv_p;
for(;;) {
ulong q = n * inv_p;
if (q > limit)
break;
n = q;
}
smallprimes ^= bit;
}
And for the mulinv() function:
ulong mulinv(ulong d) //d needs to be odd
{
ulong x = d;
for(;;)
{
ulong tmp = d * x;
if(tmp == 1)
return x;
x *= 2 - tmp;
}
}
Note you can replace ulong with any other unsigned type - just use the same type consistently.
The proofs, whys and hows are all available in the book. A heartily recommended read :-).

If your compiler supports GCC function attributes, you can mark your pure functions with this attribute:
ulong star(ulong n) __attribute__ ((const));
This attribute indicates to the compiler that the result of the function depends only on its argument(s). This information can be used by the optimiser.
Is there a reason why you've opencoded vals() instead of using __builtin_ctz() ?

It is still somewhat unclear, what you are searching for. Quite frequently number theoretic problems allow huge speedups by deriving mathematical properties that the solutions must satisfiy.
If you are indeed searching for the integers that maximize the number of non-witnesses for the MR test (i.e. oeis.org/classic/A141768 that you mention) then it might be possible to use that the number of non-witnesses cannot be larger than phi(n)/4 and that the integers for which have this many non-witnesses are either are the product of two primes of the form
(k+1)*(2k+1)
or they are Carmichael numbers with 3 prime factors.
I'd think above some limit all integers in the sequence have this form and that it is possible to verify this by proving an upper bound for the witnesses of all other integers.
E.g. integers with 4 or more factors always have at most phi(n)/8 non-witnesses. Similar results can be derived from you formula for the number of bases for other integers.
As for micro-optimizations: Whenever you know that an integer is divisible by some quotient, then it is possible to replace the division by a multiplication with the inverse of the quotient modulo 2^64. And the tests n % q == 0 can be replaced by a test
n * inverse_q < max_q,
where inverse_q = q^(-1) mod 2^64 and max_q = 2^64 / q.
Obviously inverse_q and max_q need to be precomputed, to be efficient, but since you are using a sieve, I assume this should not be an obstacle.

Small optimization but:
ulong f;
ulong nn;
ulong nu = 0xFFFF; // "Infinity" for the purpose of minimum
ulong prod = 1;
if (!smallprimes & !q)
return 0;
// no need to do this operations before because of the previous return
f = __builtin_popcountll(smallprimes) + (q > 1) + (r > 1) + (s > 1);
nn = star(n);
BTW: you should edit your post to add star() and other functions you use definition

Try replacing this pattern (for r and q too):
n /= p;
while (n % p == 0)
n /= p;
With this:
ulong m;
...
m = n / p;
do {
n = m;
m = n / p;
} while ( m * p == n);
In my limited tests, I got a small speedup (10%) from eliminating the modulo.
Also, if p, q or r were constant, the compiler will replace the divisions by multiplications. If there are few choices for p, q or r, or if certain ones are more frequent, you might gain something by specializing the function for those values.

Have you tried using profile-guided optimisation?
Compile and link the program with the -fprofile-generate option, then run the program over a representative data set (say, a day's worth of computation).
Then re-compile and link it with the -fprofile-use option instead.

1) I would make the compiler spit out the assembly it generates and try and deduce if what it does is the best it can do... and if you spot problems, change the code so the assembly looks better. This way you can also make sure that functions you hope it'll inline (like star and vals) are really inlined. (You might need to add pragma's, or even turn them into macros)
2) It's great that you try this on a multicore machine, but this loop is singlethreaded. I'm guessing that there is an umbrella functions which splits the load across a few threads so that more cores are used?
3) It's difficult to suggest speed ups if what the actual function tries to calculate is unclear. Typically the most impressive speedups are not achieved with bit twiddling, but with a change in the algorithm. So a bit of comments might help ;^)
4) If you really want a speed up of 10* or more, check out CUDA or openCL which allows you to run C programs on your graphics hardware. It shines with functions like these!
5) You are doing loads of modulo and divides right after each other. In C this is 2 separate commands (first '/' and then '%'). However in assembly this is 1 command: 'DIV' or 'IDIV' which returns both the remainder and the quotient in one go:
B.4.75 IDIV: Signed Integer Divide
IDIV r/m8 ; F6 /7 [8086]
IDIV r/m16 ; o16 F7 /7 [8086]
IDIV r/m32 ; o32 F7 /7 [386]
IDIV performs signed integer division. The explicit operand provided is the divisor; the dividend and destination operands are implicit, in the following way:
For IDIV r/m8, AX is divided by the given operand; the quotient is stored in AL and the remainder in AH.
For IDIV r/m16, DX:AX is divided by the given operand; the quotient is stored in AX and the remainder in DX.
For IDIV r/m32, EDX:EAX is divided by the given operand; the quotient is stored in EAX and the remainder in EDX.
So it will require some inline assembly, but I'm guessing there'll be a significant speedup as there are a few places in your code which can benefit from this.

Make sure your functions get inlined. If they're out-of-line, the overhead might add up, especially in the first while loop. The best way to be sure is to examine the assembly.
Have you tried pre-computing star( pr[__builtin_ffsll(bit)] ) and vals( pr[__builtin_ffsll(bit)] - 1) ? That would trade some simple work for an array lookup, but it might be worth it if the tables are small enough.
Don't compute f until you actually need it (near the end, after your early-out). You can replace the code around BASES_END with something like
BASES_END:
ulong addToF = 0;
if (n > 1) {
nu = minuu(nu, vals(n - 1));
prod *= ugcd(nn, star(n));
addToF = 1;
}
// ... early out if nu == 1...
// ... compute f ...
f += addToF;
Hope that helps.

First some nitpicking ;-) you should be more careful about the types that you are using. In some places you seem to assume that ulong is 64 bit wide, use uint64_t there. And also for all other types, rethink carefully what you expect of them and use the appropriate type.
The optimization that I could see is integer division. Your code does that a lot, this is probably the most expensive thing you are doing. Division of small integers (uint32_t) maybe much more efficient than by big ones. In particular for uint32_t there is an assembler instruction that does division and modulo in one go, called divl.
If you use the appropriate types your compiler might do that all for you. But you'd better check the assembler (option -S to gcc) as somebody already said. Otherwise it is easy to include some little assembler fragments here and there. I found something like that in some code of mine:
register uint32_t a asm("eax") = 0;
register uint32_t ret asm("edx") = 0;
asm("divl %4"
: "=a" (a), "=d" (ret)
: "0" (a), "1" (ret), "rm" (divisor));
As you can see this uses special registers eax and edx and stuff like that...

Did you try a table lookup version of the first while loop? You could divide smallprimes in 4 16 bit values, look up their contribution and merge them. But maybe you need the side effects.

Did you try passing in an array of primes instead of splitting them in smallprimes, q, r and s? Since I don't know what the outer code does, I am probably wrong, but there is a chance that you also have a function to convert some primes to a smallprimes bitmap, and inside this function, you convert the bitmap back to an array of primes, effecively. In addition, you seem to do identical processing for elements of smallprimes, q, r, and s. It should save you a tiny amount of processing per call.
Also, you seem to know that the passed in primes divide n. Do you have enough knowledge outside about the power of each prime that divides n? You could save a lot of time if you can eliminate the modulo operation by passing in that information to this function. In other words, if n is pow(p_0,e_0)*pow(p_1,e_1)*...*pow(p_k,e_k)*n_leftover, and if you know more about these e_is and n_leftover, passing them in would mean a lot of things you don't have to do in this function.
There may be a way to discover n_leftover (the unfactored part of n) with less number of modulo operations, but it is only a hunch, so you may need to experiment with it a bit. The idea is to use gcd to remove known factors from n repeatedly until you get rid of all known prime factors. Let me give some almost-c-code:
factors=p_0*p_1*...*p_k*q*r*s;
n_leftover=n/factors;
do {
factors=gcd(n_leftover, factors);
n_leftover = n_leftover/factors;
} while (factors != 1);
I am not at all certain this will be better than the code you have, let alone the combined mod/div suggestions you can find in other answers, but I think it is worth a try. I feel that it will be a win, especially for numbers with high numbers of small prime factors.

You're passing in the complete factorization of n, so you're factoring consecutive integers and then using the results of that factorization here. It seems to me that you might benefit from doing some of this at the time of finding the factors.
BTW, I've got some really fast code for finding the factors you're using without doing any division. It's a little like a sieve but produces factors of consecutive numbers very quickly. Can find it and post if you think it may help.
edit had to recreate the code here:
#include
#define SIZE (1024*1024) //must be 2^n
#define MASK (SIZE-1)
typedef struct {
int p;
int next;
} p_type;
p_type primes[SIZE];
int sieve[SIZE];
void init_sieve()
{
int i,n;
int count = 1;
primes[1].p = 3;
sieve[1] = 1;
for (n=5;SIZE>n;n+=2)
{
int flag = 0;
for (i=1;count>=i;i++)
{
if ((n%primes[i].p) == 0)
{
flag = 1;
break;
}
}
if (flag==0)
{
count++;
primes[count].p = n;
sieve[n>>1] = count;
}
}
}
int main()
{
int ptr,n;
init_sieve();
printf("init_done\n");
// factor odd numbers starting with 3
for (n=1;1000000000>n;n++)
{
ptr = sieve[n&MASK];
if (ptr == 0) //prime
{
// printf("%d is prime",n*2+1);
}
else //composite
{
// printf ("%d has divisors:",n*2+1);
while(ptr!=0)
{
// printf ("%d ",primes[ptr].p);
sieve[n&MASK]=primes[ptr].next;
//move the prime to the next number it divides
primes[ptr].next = sieve[(n+primes[ptr].p)&MASK];
sieve[(n+primes[ptr].p)&MASK] = ptr;
ptr = sieve[n&MASK];
}
}
// printf("\n");
}
return 0;
}
The init function creates a factor base and initializes the sieve. This takes about 13 seconds on my laptop. Then all numbers up to 1 billion are factored or determined to be prime in another 25 seconds. Numbers less than SIZE are never reported as prime because they have 1 factor in the factor base, but that could be changed.
The idea is to maintain a linked list for every entry in the sieve. Numbers are factored by simply pulling their factors out of the linked list. As they are pulled out, they are inserted into the list for the next number that will be divisible by that prime. This is very cache friendly too. The sieve size must be larger than the largest prime in the factor base. As is, this sieve could run up to 2**40 in about 7 hours which seems to be your target (except for n needing to be 64 bits).
Your algorithm could be merged into this to make use of the factors as they are identified rather than packing bits and large primes into variables to pass to your function. Or your function could be changed to take the linked list (you could create a dummy link to pass in for the prime numbers outside the factor base).
Hope it helps.
BTW, this is the first time I've posted this algorithm publicly.

just a thought but maybe using your compilers optimization options would help, if you haven't already. another thought would be that if money isn't an issue you could use the Intel C/C++ compiler, assuming your using an Intel processor. I'd also assume that other processor manufacturers (AMD, etc.) would have similar compilers

If you are going to exit immediately on (!smallprimes&!q) why not do that test before even calling the function, and save the function call overhead?
Also, it seems like you effectively have 3 different functions which are linear except for the smallprimes loop.
bases1(s,n,q), bases2(s,n,q,r), and bases3(s,n,q,r,s).
It might be a win to actually create those as 3 separate functions without the branches and gotos, and call the appropriate one:
if (!(smallprimes|q)) { r = 0;}
else if (s) { r = bases3(s,n,q,r,s);}
else if (r) { r = bases2(s,n,q,r); }
else { r = bases1(s,n,q);
This would be most effective if previous processing has already given the calling code some 'knowledge' of which function to execute and you don't have to test for it.

If the divisions you're using are with numbers that aren’t known at compile time, but are used frequently at runtime (dividing by the same number many times), then I would suggest using the libdivide library, which basically implements at runtime the optimisations that compilers do for compile time constants (using shifts masks etc.). This can provide a huge benefit. Also avoiding using x % y == 0 for something like z = x/y, z * y == x as ergosys suggested above should also have a measurable improvement.

Does the code on your top post is the optimized version? If yes, there is still too many divide operations which greatly eat CPU cycles.
This code is overexecute innecessarily a bit
if (!smallprimes & !q)
return 0;
change to logical and &&
if (!smallprimes && !q)
return 0;
will make it short circuited faster without eveluating q
And the following code
ulong bit = smallprimes & (-smallprimes);
ulong p = pr[__builtin_ffsll(bit)];
which is used to find the last set bit of smallprimes. Why don't you use the simpler way
ulong p = pr[__builtin_ctz(smallprimes)];
Another culprit for decreased performance maybe too many program branching. You may consider changing to some other less-branch or branch-less equivalents

Related

What is an efficient algorithm to find all the factors of an integer?

I was writing a very simple program to examine if a number could divide another number evenly:
// use the divider squared to reduce iterations
for(divider = 2; (divider * divider) <= number; divider++)
if(number % divider == 0)
print("%d can divided by %d\n", number, divider);
Now I was curious if the task could be done by finding the square root of number and compare it to divider. However, it seems that sqrt() isn't really able to boost the efficiency. How was sqrt() handled in C and how can I boost the efficiency of sqrt()? Also, is there any other way to approach the answer with even greater efficiency?
Also, the
number % divider == 0
is used to test if divider could evenly divide number, is there also a more efficient way to do the test besides using %?

I'm not going to address what the best algorithm to find all factors of an integer is. Instead I would like to comment on your current method.
There are thee conditional tests cases to consider
(divider * divider) <= number
divider <= number/divider
divider <= sqrt(number)
See Conditional tests in primality by trial division for more detials.
The case to use depends on your goals and hardware.
The advantage of case 1 is that it does not require a division. However, it can overflow when divider*divider is larger than the largest integer. Case two does not have the overflow problem but it requires a division. For case3 the sqrt only needs to be calculated once but it requires that the sqrt function get perfect squares correct.
But there is something else to consider many instruction sets, including the x86 instruction set, return the remainder as well when doing a division. Since you're already doing number % divider this means that you get it for free when doing number / divider.
Therefore, case 1 is only useful on system where the division and remainder are not calculated in one instruction and you're not worried about overflow.
Between case 2 and case3 I think the main issue is again the instruction set. Choose case 2 if the sqrt is too slow compared to case2 or if your sqrt function does not calculate perfect squares correctly. Choose case 3 if the instruction set does not calculate the divisor and remainder in one instruction.
For the x86 instruction set case 1, case 2 and case 3 should give essentially equal performance. So there should be no reason to use case 1 (however see a subtle point below) . The C standard library guarantees that the sqrt of perfect squares are done correctly. So there is no disadvantage to case 3 either.
But there is one subtle point about case 2. I have found that some compilers don't recognize that the division and remainder are calculated together. For example in the following code
for(divider = 2; divider <= number/divider; divider++)
if(number % divider == 0)
GCC generates two division instruction even though only one is necessary. One way to fix this is to keep the division and reminder close like this
divider = 2, q = number/divider, r = number%divider
for(; divider <= q; divider++, q = number/divider, r = number%divider)
if(r == 0)
In this case GCC produces only one division instruction and case1, case 2 and case 3 have the same performance. But this code is a bit less readable than
int cut = sqrt(number);
for(divider = 2; divider <= cut; divider++)
if(number % divider == 0)
so I think overall case 3 is the best choice at least with the x86 instruction set.

However, it seems that sqrt() isn't really able to boost the efficiency
That is to be expected, as the saved multiplication per iteration is largely dominated by the much slower division operation inside the loop.
Also, the number % divider = 0 is used to test if divider could evenly divide number, is there also a more efficient way to do the test besides using %?
Not that I know of. Checking whether a % b == 0 is at least as hard as checking a % b = c for some c, because we can use the former to compute the latter (with one extra addition). And at least on Intel architectures, computing the latter is just as computationally expensive as a division, which is amongst the slowest operations in typical, modern processors.
If you want significantly better performance, you need a better factorization algorithm, of which there are plenty. One particular simple one with runtime O(n1/4) is Pollard's ρ algorithm. You can find a straightforward C++ implementation in my algorithms library. Adaption to C is left as an exercise to the reader:
int rho(int n) { // will find a factor < n, but not necessarily prime
if (~n & 1) return 2;
int c = rand() % n, x = rand() % n, y = x, d = 1;
while (d == 1) {
x = (1ll*x*x % n + c) % n;
y = (1ll*y*y % n + c) % n;
y = (1ll*y*y % n + c) % n;
d = __gcd(abs(x - y), n);
}
return d == n ? rho(n) : d;
}
void factor(int n, map<int, int>& facts) {
if (n == 1) return;
if (rabin(n)) { // simple randomized prime test (e.g. Miller–Rabin)
// we found a prime factor
facts[n]++;
return;
}
int f = rho(n);
factor(n/f, facts);
factor(f, facts);
}
Constructing the factors of n from its prime factors is then an easy task. Just use all possible exponents for the found prime factors and combine them in each possible way.

In C, you can take square roots of floating point numbers with the sqrt() family of functions in the header <math.h>.
Taking square roots is usually slower than dividing because the algorithm to take square roots is more complicated than the division algorithm. This is not a property of the C language but of the hardware that executes your program. On modern processors, taking square roots can be just as fast as dividing. This holds, for example, on the Haswell microarchitecture.
However, if the algorithmic improvements are good, the slightly slower speed of a sqrt() call usually doesn't matter.
To only compare up to the square root of number, employ code like this:
#include <math.h>
/* ... */
int root = (int)sqrt((double)number);
for(divider = 2; divider <= root; divider++)
if(number % divider = 0)
print("%d can divided by %d\n", number, divider);

This is just my random thought, so please comment and critisize it if it's wrong.
The idea is to precompute all the prime numbers below a certain range and use it as a table.
Looping though the table, check if the prime number is a factor, if it is, then increament the counter for that prime number, if not then increment the index. Terminate when the index reaches the end or the prime number to check exceeds the input.
At end, the result is a table of all the prime factors of the input, and their counts. Then generating all natual factors should be trival, isn't it?
Worst case, the loop needs to go to the end, then it takes 6542 iterations.
Considering the input is [0, 4294967296] this is similar to O(n^3/8).
Here's MATLAB code that implements this method:
if p is generated by p=primes(65536); this method would work for all inputs between [0, 4294967296] (but not tested).
function [ output_non_zero ] = fact2(input, p)
output_table=zeros(size(p));
i=1;
while(i<length(p));
if(input<1.5)
break;
% break condition: input is divided to 1,
% all prime factors are found.
end
if(rem(input,p(i))<1)
% if dividable, increament counter and don't increament index
% keep checking until not dividable
output_table(i)=output_table(i)+1;
input = input/p(i);
else
% not dividable, try next
i=i+1;
end
end
% remove all zeros, should be handled more efficiently
output_non_zero = [p(output_table~=0);...
output_table(output_table~=0)];
if(input > 1.5)
% the last and largest prime factor could be larger than 65536
% hence would skip from the table, add it to the end of output
% if exists
output_non_zero = [output_non_zero,[input;1]];
end
end
test
p=primes(65536);
t = floor(rand()*4294967296);
b = fact2(t, p);
% check if all prime factors adds up and they are all primes
assert((prod(b(1,:).^b(2,:))==t)&&all(isprime(b(1,:))), 'test failed');

How to optimize C code : looking for the next set bit and finding sum of corresponding array elements

EDIT: Now I realize I didn't explain my algorithm well enough. I'll try again.
What I'm doing is something very similar to dot product of two vectors, but there is a difference. I've got two vectors: one vector of bits and one vector of floats of the same length. So I need to calculate sum:
float[0]*bit[0]+float[1]*bit[1]+..+float[N-1]*bit[N-1], BUT the difference from a classic dot product is that I need to skip some fixed number of elements after each set bit.
Example:
vector of floats = {1.5, 2.0, 3.0, 4.5, 1.0}
vector of bits = {1, 0, 1, 0, 1 }
nSkip = 2
in this case sum is calculated as follows:
sum = floats[0]*bits[0]
bits[0] == 1, so skipping 2 elements (at positions 1 and 2)
sum = sum + floats[3]*bits[3]
bits[3] == 0, so no skipping
sum = sum + floats[4]*bits[4]
result = 1.5*1+4.5*0+1.0*1 = 2.5
The following code is called many times with different data so I need to optimize it to run as fast as possible on my Core i7 (I don't care much about compatibility with anything else). It is optimized to some extent but still slow, but I don't know how to further improve it.
Bit array is implemented as an array of 64 bit unsigned ints, it allows me to use bitscanforward to find the next set bit.
code:
unsigned int i = 0;
float fSum = 0;
do
{
unsigned int nAddr = i / 64;
unsigned int nShift = i & 63;
unsigned __int64 v = bitarray[nAddr] >> nShift;
unsigned long idx;
if (!_BitScanForward64(&idx, v))
{
i+=64-nShift;
continue;
}
i+= idx;
fSum += floatarray[i];
i+= nSkip;
} while(i<nEnd);
Profiler shows 3 slowest hotspots :
1. v = bitarray[nAddr] >> nShift (memory access with shift)
2. _BitScanForward64(&idx, v)
3. fSum += floatarray[i]; (memory access)
But probably there is a different way of doing this. I was thinking about just resetting nSkip bits after each set bit in the bit vector and then calculating classical dot product - didn't try yet but honestly don't belive it will be faster with more memory access.

You have too many of your operations inside of the loop. You also only have one loop, so many of the operations that do need to happen for each flag word (the 64 bit unsigned integer) are happening 63 extra times.
Consider division an expensive operation and try to not do that too often when optimizing code for performance.
Memory access is also considered expensive in terms of how long it takes, so this should also be limited to required accesses only.
Tests that allow you to exit early are often useful (though sometimes the test itself is expensive relative to the operations you'd be avoiding, but that's probably not the case here.
Using nested loops should simplify this a lot. The outer loop should work at the 64 bit word level, and the inner loop should work at the bit level.
I have noticed a mistake in my earlier recommendations. Since the division here is by 64, which is a power of 2, this is not actually an expensive operation, but we still need to get as many operations as far out of the loops as we can.
/* this is completely untested, but incorporates the optimizations
that I outlined as well as a few others.
I process the arrays backwards, which allows for elimination of
comparisons of variables against other variables, which is much
slower than comparisons of variables against 0, which is essentially
free on many processors when you have just operated or loaded the
value to a register.
Going backwards at the bit level also allows for the possibility that
the compiler will take advantage of the comparison of the top bit
being the same as test for negative, which is cheap and mostly free
for all but the first time through the inner loop (for each time
through the outer loop.
*/
double acc = 0.0;
unsigned i_end = nEnd-1;
unsigned i_bit;
int i_word_end;
if (i_end == 0)
{
return acc;
}
i_bit = i_end % 64;
i_word = i_end / 64;
do
{
unsigned __int64 v = bitarray[i_word_end];
unsigned i_upper = i_word_end << 64;
while (v)
{
if (v & 0x80000000000000)
{
// The following code is semantically the same as
// unsigned i = i_bit_end + (i_word_end * sizeof(v));
unsigned i = i_bit_end | i_upper;
acc += floatarray[i];
}
v <<= 1;
i--;
}
i_bit_end = 63;
i_word_end--;
} while (i_word_end >= 0);

I think you should check "how to ask questions" first. You will not gain many upvotes for this, since you are asking us to do the work for you instead of introducing a particular problem.
I cannot see why you are incrementing the same variable in two places instead of one (i).
Also think you should declare variables only once, not in every iteration.

About a criteria for random integers number generation (C)

I am running a bunch of physical simulations in which I need random numbers. I'm using the standard rand() function in C++.
So it works like this: first I precalculate a bunch of probabilities that are of the form 1/(1+exp(a)), for a set of different a. They're of type double as returned by the exp function in the math library, and then things must happen with those probabilities, there are only two of them, so I generate a random number uniformly distributed between 0 and 1 and compared with those precalculated probabilities. To do that, I used:
double p = double(rand()%101)/100.0;
so I'm given random values between 0 and 1 both included. This didn't yield to correct physical results. I tried this:
double p = double(rand()%1000001)/1000000.0;
And this worked. I don't really understand why so I would like some criteria about how to do it. My intuition tells that if I do
double p = double(rand()%(N+1))/double(N);
with N big enough such that the smallest division (1/N) is much smaller than the smallest probability 1/1+exp(a) then I will be getting realistic random numbers.
I would like to understand why, though.

rand() returns a random number between 0 and RAND_MAX.
Therefore you need this:
double p = double(rand() % RAND_MAX) / double(RAND_MAX);
Also run this snippet and you will understand:
int i;
for (i = 1; i < 30; i++)
{
int rnd = rand();
double p0 = double(rnd % 101) / 100.0;
double p1 = double(rnd % 1000001) / 1000000.0;
printf ("%d\t%f\t%f\n", rnd, p0, p1);
}
for (i = 1; i < 30; i++)
{
int rnd = rand();
double p0 = double(rnd) / double(RAND_MAX);
printf ("%d\t%f\n", rnd, p0);
}

You have multiple problems.
rand() isn't very random at all. On almost all operating systems it returns badly distributed, horribly biased numbers. It's actually quite hard to find a good random number generator, but I can guarantee you that rand() will be among the worst you can find.
rand() % N gives a biased distribution. Think about the pigeonhole principle. Let's simplify it, assume that rand returns numbers [0,7) and your N is 6. 0 to 5 map to 0 to 5, 6 maps to 0 and 7 maps to 1, meaning that 0 and 1 are twice as likely to come out.
Converting the numbers to double before division does not remove the bias from 2, it just makes it less visible. The pigeonhole principle applies regardless of the conversions you do.
Converting a well-distributed random number from integer to float/double is harder than it looks. Simple division ignores the problems of how floating point math works.
I can't help you much with 1, you need to do research. Look around the net for random number libraries. If you want something very random and unpredictable you need to look for cryptographic random libraries. If you want a repeatable but good random number Mersenne Twister should probably be good enough. But you need to do the research here.
For 2 and 3 there are standard solutions. You are mapping a set from M elements to N elements and rand % N will only work iff N < M and N and M share prime factors. Since on most systems M will be a power of two it means that N also has to be a power of two. So assuming that M is a power of two the algorithm is: find the nearest power of 2 higher or equal to N, let's call it P. Generate randomness_source() % P. If the number is higher than N, throw it away and try again. This is the only safe way to do this. Cleverer people than you and me have spent years on this problem, there's no better way to remove the bias.
For 4, you can probably ignore the problem and just divide, in an absolute majority of cases this should be good enough. If you really want to study the problem, I've done some work on it and published the code on github. There I go through some basic principles of how floating point numbers work and how it relates to generating random numbers.

// produces pseudorandom bits. These are NOT crypto quality bits. Has the same underlying unpredictability as uncooked
// rand() output. It buffers rand() bits to produce a more convenient zero-to-the-argument range including negative
// arguments, corrects for the toward-zero bias of the modular construction I'd be using otherwise, eliminates the
// RAND_MAX range limitation, (use INT64_MAX instead) and effectively obscures biases and sequence telltales due to
// annoyingly bad rand libraries. It does not correct these biases; anyone tracking the arguments and outputs has
// enough information to reconstruct the rand() output and detect them. But it makes the relationships drastically more complicated.
// needs stdint, stdlib.
int64_t privaterandom(int64_t range, int reset){
static uint64_t state = 0;
int64_t retval;
if (reset != 0){
srand((unsigned int)range);
state = (uint64_t)range;
}
if (range == 0) return (0);
if (range < 0) return -privaterandom(-range, 0);
if (range > UINT64_MAX/0xFFFFFFFF){
retval = (privaterandom(range/0xFFFFFFFF, 0) * 0xFFFFFFFF); // order of operations matters
return (retval + privaterandom(0xFFFFFFFF, 0));
}
while (state < UINT64_MAX / 0xFF){
state *= RAND_MAX;
state += rand();
}
retval = (state % range);
// makes "pigeonhole" bias alternate unpredictably between toward-even and toward-odd
if ((state/range > (state - (retval) )/ range) && state % 2 == 0) retval++;
state /= range;
return retval;
}
int64_t Random(int64_t range){ return (privaterandom(range, 0));}
int64_t Random_Init(int64_t seed){return (privaterandom(seed, 1));}

Faster way of finding multiple of double

If have the following C function, used to determine if one number is a multiple of another to an arbirary tolerance
#include <math.h>
#define TOLERANCE 0.0001
int IsMultipleOf(double x,double mod)
{
return(fabs(fmod(x, mod)) < TOLERANCE);
}
It works fine, but profiling shows it to be very slow, to the extent that it has become a candidate for optimization. About 75% of the time is spent in modulo and the remaining in fabs. I'm trying to figure a way of speeding things up, using something like a look-up table. The parameter x changes regularly, whereas mod changes infrequently. The number of possible values of x is small enough that the space for a look-up would not be an issue, typically it will be one of a few hundred possible values. I can get rid of the fabs easily enough, but can't figure out a reasonable alternative to the modulo. Any ideas on how to optimize the above?
Edit The code will be running on a wide range of Windows desktop and mobile devices, hence processors could include Intel, AMD on desktop, and ARM or SH4 on mobile devices. VisualStudio 2008 is the compiler.

Do you really have to use modulo for this?
Wouldn't it be possible to just result = x / mod and then check if the decimal part of result is close to 0. For instance:
11 / 5.4999 = 2.000003 ==> 0.000003 < TOLERANCE
Or something like that.

Division (floating point or not, fmod in your case) is often an operation where the execution time varies a lot depending on the cpu and compiler:
gcc has a builtin replacement for
that if you give it the right compile
flags or if you use __builtin_fmod
explicitly. This then might map the
operation on a small number of
assembler instructions.
there may be special units like SSE
on intel processors where this
operation is implemented more
efficiently
By such tricks, depending on your environment (you didn't tell which) the time may vary from some clock cycles to some hundred. I think best is to look into the documentation of your compiler and cpu for that particular operation.

The following is probably overkill, and sub-optimal. But for what it is worth here is one way on how to do it.
We know the format of the double ...
1 bit for the sign
11 bits for the biased exponent
52 fraction bits
Let ...
value = x / mod;
exp = exponent bits of value - BIAS;
lsb = least sig bit of value's fraction bits;
Once you have that ...
/*
* If applying the exponent would eliminate the fraction bits
* then for double precision resolution it is a multiple.
* Note: lsb may require some massaging.
*/
if (exp > lsb)
return (true);
if (exp < 0)
return (false);
The only case remaining is the tolerance case. Build your double so that you are getting rid of all the digits to the left of the decimal.
sign bit is zero (positive)
exponent is the BIAS (1023 I think ... look it up to be sure)
shift the fraction bits as appropriate
Now compare it against your tolerance.

I think you need to inspect the bowels of your C RTL fmod() function: X86 FPU's have 'FPREM/FPREM1' instructions which computes remainders by repeated subtraction.
While floating point division is a single instruction, it seems you may need to call FPREM repeatedly to get the right answer for modulus, so your RTL may not use it.

I have not tested this at all, but from the way I understand fmod this should be equivalent inlined, which might let the compiler optimize it better, though I would have thought that the compiler's math library (or builtins) would work just as well. (also, I don't even know for sure if this is correct).
#include <math.h>
int IsMultipleOf(double x, double mod) {
long n = x / mod; // You should probably test for /0 or NAN result here
double new_x = mod * n;
double delta = x - new_x;
return fabs(delta) < TOLERANCE; // and for NAN result from fabs
}

Maybe you can get away with long long instead of double if you have comparable scale of data. For example long long would be enough for over 60 astronomical units in micrometer resolution.

Does it need to be double precision ? Depending on how good your math library is, this ought to be faster:
#include <math.h>
#define TOLERANCE 0.0001f
bool IsMultipleOf(float x, float mod)
{
return(fabsf(fmodf(x, mod)) < TOLERANCE);
}

I presume modulo looks a little like this on the inside:
mod(x,m) {
while (x > m) {
x = x - m
}
return x
}
I think that through some sort of search i could be optimised: eg:
fastmod(x,m) {
q = 1
while (m * q < x) {
q = q * 2
}
return mod((x - (q / 2) * m), m)
}
You might even choose to replace the finall call to mod with annother call to fastmod, adding the condition that if x < m then to return x.

Most optimized way to calculate modulus in C

I have minimize cost of calculating modulus in C.
say I have a number x and n is the number which will divide x
when n == 65536 (which happens to be 2^16):
mod = x % n (11 assembly instructions as produced by GCC)
or
mod = x & 0xffff which is equal to mod = x & 65535 (4 assembly instructions)
so, GCC doesn't optimize it to this extent.
In my case n is not x^(int) but is largest prime less than 2^16 which is 65521
as I showed for n == 2^16, bit-wise operations can optimize the computation. What bit-wise operations can I preform when n == 65521 to calculate modulus.

First, make sure you're looking at optimized code before drawing conclusion about what GCC is producing (and make sure this particular expression really needs to be optimized). Finally - don't count instructions to draw your conclusions; it may be that an 11 instruction sequence might be expected to perform better than a shorter sequence that includes a div instruction.
Also, you can't conclude that because x mod 65536 can be calculated with a simple bit mask that any mod operation can be implemented that way. Consider how easy dividing by 10 in decimal is as opposed to dividing by an arbitrary number.
With all that out of the way, you may be able to use some of the 'magic number' techniques from Henry Warren's Hacker's Delight book:
Archive of http://www.hackersdelight.org/
Archive of http://www.hackersdelight.org/magic.htm
There was an added chapter on the website that contained "two methods of computing the remainder of division without computing the quotient!", which you may find of some use. The 1st technique applies only to a limited set of divisors, so it won't work for your particular instance. I haven't actually read the online chapter, so I don't know exactly how applicable the other technique might be for you.

x mod 65536 is only equivalent to x & 0xffff if x is unsigned - for signed x, it gives the wrong result for negative numbers. For unsigned x, gcc does indeed optimise x % 65536 to a bitwise and with 65535 (even on -O0, in my tests).
Because 65521 is not a power of 2, x mod 65521 can't be calculated so simply. gcc 4.3.2 on -O3 calculates it using x - (x / 65521) * 65521; the integer division by a constant is done using integer multiplication by a related constant.

rIf you don't have to fully reduce your integers modulo 65521, then you can use the fact that 65521 is close to 2**16. I.e. if x is an unsigned int you want to reduce then you can do the following:
unsigned int low = x &0xffff;
unsigned int hi = (x >> 16);
x = low + 15 * hi;
This uses that 2**16 % 65521 == 15. Note that this is not a full reduction. I.e. starting with a 32-bit input, you only are guaranteed that the result is at most 20 bits and that it is of course congruent to the input modulo 65521.
This trick can be used in applications where there are many operations that have to be reduced modulo the same constant, and where intermediary results do not have to be the smallest element in its residue class.
E.g. one application is the implementation of Adler-32, which uses the modulus 65521. This hash function does a lot of operations modulo 65521. To implement it efficiently one would only do modular reductions after a carefully computed number of additions. A reduction shown as above is enough and only the computation of the hash will need a full modulo operation.

The bitwise operation only works well if the divisor is of the form 2^n. In the general case, there is no such bit-wise operation.

If the constant with which you want to take the modulo is known at compile time
and you have a decent compiler (e.g. gcc), tis usually best to let the compiler
work its magic. Just declare the modulo const.
If you don't know the constant at compile time, but you are going to take - say -
a billion modulos with the same number, then use this http://libdivide.com/

As an approach when we deal with powers of 2, can be considered this one (mostly C flavored):
.
.
#define THE_DIVISOR 0x8U; /* The modulo value (POWER OF 2). */
.
.
uint8 CheckIfModulo(const sint32 TheDividend)
{
uint8 RetVal = 1; /* TheDividend is not modulus THE_DIVISOR. */
if (0 == (TheDividend & (THE_DIVISOR - 1)))
{
/* code if modulo is satisfied */
RetVal = 0; /* TheDividend IS modulus THE_DIVISOR. */
}
else
{
/* code if modulo is NOT satisfied */
}
return RetVal;
}

If x is an increasing index, and the increment i is known to be less than n (e.g. when iterating over a circular array of length n), avoid the modulus completely.
A loop going
x += i; if (x >= n) x -= n;
is way faster than
x = (x + i) % n;
which you unfortunately find in many text books...
If you really need an expression (e.g. because you are using it in a for statement), you can use the ugly but efficient
x = x + (x+i < n ? i : i-n)

idiv — Integer Division
The idiv instruction divides the contents of the 64 bit integer EDX:EAX (constructed by viewing EDX as the most significant four bytes and EAX as the least significant four bytes) by the specified operand value. The quotient result of the division is stored into EAX, while the remainder is placed in EDX.
source: http://www.cs.virginia.edu/~evans/cs216/guides/x86.html

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight