VHDL simulation stuck in for loop

VHDL simulation stuck in for loop - loops

I'm doing simulation testing for some VHDL I wrote and when I run it in ModelSim it gets stuck. When I hit 'break' it has an arrow pointing to the For loop in the following function:
function MOD_3 (a, b, c : UNSIGNED (1023 downto 0)) return UNSIGNED is
VARIABLE x : UNSIGNED (1023 downto 0) := TO_UNSIGNED(1, 1024);
VARIABLE y : UNSIGNED (1023 downto 0) := a;
VARIABLE b_temp : UNSIGNED (1023 downto 0) := b;
begin
for I in 0 to 1024 loop
if b_temp > 0 then
if b_temp MOD 2 = 1 then
x := (x * y) MOD c;
end if;
y := (y * y) MOD c;
b_temp := b_temp / 2;
else
exit;
end if;
end loop;
return x MOD c;
end function;
I originally had this as a while loop which I realize is not good for synthesizing. So I converted it to a for loop with the condition that b_temp is greater than 0. b_temp is a 1024-bit unsigned and so if it is the largest number that could be represented by 1024 bits and divided in half (which I do in each iteration) 1024 times, shouldn't it definitely be 0?
I have a feeling my problem lies in the large multiplications...if I comment out x := (x * y) MOD c and y := (y * y) MOD c then it exits the loop. So the only thing I can think of is it takes too long to carry out these 1024-bit multiplications? If this is the case, is there any built-in way I can optimize this to make it faster, or is my only option to implement something like Karatsuba multiplication, etc...?

I'm of the opinion implementing a Montgomery multiplier in numeric_std function calls may not improve simulation as much as you'd like (while giving synthesis eligibility).
The issue is the number of dynamically elaborated subprogram calls vs. their operand sizes vs. fitting in your CPU-running-Modelsim's L1/L2/L3 caches.
It does do wonders for targeting synthesis in an FPGA or a SIMD GPU implementation.
See Subversion Repositories BasicRSA file modmult.vhd (which has a generic size). I successfully converted this to using numeric_std[_unsigned].
If I recall correctly this appears inspired by
a Masters thesis (Efficient Hardware Architectures for Modular Multiplication) by David Narh Amanor in 2005 outlining a Java and a VHDL implementation in various word sizes.
I found the OpenCores implementation mentioned in a Stackoverflow question (Montgomery multiplication VHDL Implementation) and found the generic sized version in the SVN repository (the downloadable version is 16 bit) and the mention of the thesis in A 1024 – Bit Implementation of the Faster Montgomery Multiplier Using VHDL (by David Narh Anamor, the original link having expired). Note the quoted FPGA implementation performance under 42 usec.
Notice the length 1024 version specified by a generic would still be performing dynamically elaborated function calls with length 1024 operands (although not the "*"s, the "mod"s or the "/"s. You'd still be doing millions of function calls with dynamically elaborated (passed on an expression stack) 1024 bit parameters. We're simply changing how many millions of large parameter subroutine calls and how long they can take.
And that also brings up the possibility of an integer vector implementation (bignum equivalent) in VHDL, which would potential increase simulation performance even more (and you're likely in uncharted territory here).
A subprogram based version of the OpenCores model using variable parameters would be telling. (Whether or not you can impress anyone showing them a simulation model executing, or whether there's this looong pause interrupted by everyone taking furtive glances at the wall clock and looking bored).

Related

calculating with very big numbers in c

Im new to programming.
I want to calculate the modulo of a number which is in range from [0,10^24].
For example: (12 * 10^22) % 89
I know that I can't do this with the usual data types like long, integer and so on.
How can I do this? Is there a way to do this?
Thanks in advance

The expression shown can be easily calculated by reducing modulo 89 after each operation:
#include <stdio.h>
int main(void)
{
// Initialize a product.
int t = 1;
// Calculate 10^22 (modulo 89) by multiplying by 10 (modulo 89) 22 times.
for (int i = 0; i < 22; ++i)
t = t * 10 % 89;
// Multiply by 12 (modulo 89).
t = t * 12 % 89;
// Show the result.
printf("%d\n", t);
}

As the previous answers say, you can get by with regular integers in this case.
If you really need very large integers (like in many cryptographic applications), there are several open source libraries around. Check out the GNU MP library or Flint (Fast Library for Number Theory, take a peek at the (header only!) C++ precision package or take a peek at Wikipedia's list. Most of them are available as packages for your favorite Linux distrtibution.
Many languages handle large integers transparently, GNU's bc (at your nearest Linux box) uses them.

The largest guaranteed C data type is a uint64_t (see stdint.h), which holds up to 2^64 - 1. I think it is preferable to maintain conformance to standard C when possible, so I'd suggest a way of simplifying the problem first so as to not use data types wider than that. For that, let's borrow from a modular arithmetic proof:
xy % z == ((x % z) * (y % z)) % z
From this, we can gather that, (12 * 10^22) % 89 == (12 % 89 * 10^11 % 89 * 10^11 %89) % 89 This new version certainly isn't as pretty, but it does make all the factors fit nicely into standard C data types. Simplification prior to computation such as this can be used for all the data in your range.
However, there is the problem of having storing the value you want to modulo in the first place. Clarification on what the use case is might make for a more applicable answer. For instance, is the value being stored in a string, perhaps from user input?
Alternatively, you can make use of certain libraries designed for dealing with really big numbers. I don't know which would be best for your purposes (I don't even know what OS you're on), but if you'd like to explore those options, a simple web search for "bignum", "arbitrary-precision", or "infinite-precision" libraries should give you what you want.

Fast hashing of 32 bit values to between 0 and 254 inclusive

I'm looking for a fast way in C to hash numbers 32-bit numbers more or less uniformly between 0 and 254. 255 is reserved for a special purpose.
As an added constraint, I'm looking for a method that would map well to being used with ISA-specific vector intrinsics or to a language like OpenCL or CUDA without introducing control flow divergence between the vector lanes/threads.
Ordinarily, I would just use the following code to hash the number between 0 and 255, as this is just a fast way of doing x mod 256.
inline uint8_t hash(uint32_t x){ return x & 255; }
I could just give in and use the following:
inline uint8_t hash(uint32_t x){ return x % 255; }
However, this solution seems unimaginative and unlikely to be the highest performing solution. I found code at this site (http://homepage.cs.uiowa.edu/~jones/bcd/mod.shtml#exmod15) that appears to provide a reasonable solution for scalar code and have inserted it here for your convenience.
uint32_t mod255( uint32_t a ) {
a = (a >> 16) + (a & 0xFFFF); /* sum base 2**16 digits */
a = (a >> 8) + (a & 0xFF); /* sum base 2**8 digits */
if (a < 255) return a;
if (a < (2 * 255)) return a - 255;
return a - (2 * 255);
}
I see two potential performance issues with this code:
The large number of if statements makes me question how easy it will be for a compiler or human :) to effectively vectorize the code without leading to control flow divergence within a warp/wavefront on a SIMT architecture or vectorized execution on a multicore CPU. If such divergence does occur, it will reduce parallel efficiency, as the divergent paths will have to be run in series.
It looks like it could be troublesome for a branch predictor (not applicable on common GPU architectures) as the code path that executes depends on the value of the input. Therefore, if there is a mix of small and large values interspersed with one another, this code will likely sacrifice some performance due to a moderate number of branch mispredictions.
Any recommendations on alternatives that I could use are most welcome. Alternatively, let me know if what I am asking for is unreasonable.

The "if statements on GPU kill performance" is a popular misconception which desperately wants to live on, it seems.
The large number of if statements makes me question how easy it will
be for a compiler or human :) to vectorize the code.
First of all I wouldn't consider 2 if statements a "large number of if statements", and those are so short and trivial that I'm willing to bet the compiler will turn them into branchless conditional moves or predicated instructions. There will be no performance penalty at all. (Do check the generated assembly, however).
It looks like it could be troublesome for a branch predictor as the code path that executes depends on the value of the input. Therefore, if there is a mix of small and large values interspersed with one another, this code will likely sacrifice some performance due to a moderate number of branch mispredictions.
Current GPUs do not have branch predictors. Note however that depending on the underlying hardware, operation on integers (and notably shifting) may be quite costly.

I would just do this:
uchar fast_mod255( uint a32 ) {
ushort a16 = (a32 >> 16) + (a32 & 0xFFFF); /* sum base 2**16 digits */
uchar a8 = (a16 >> 8) + (a16 & 0xFF); /* sum base 2**8 digits */
return (a8 % 255);
}
Another option is to just do:
uchar fast_mod255( uchar4 a ) {
return (dot(a) % 255); // or return (distance(a) % 255);
}
GPUs are very efficient in computing the distances and dot products, even in 4 dimensions. And it is a valid way of hashing as well. Dsicarding the overflowed values.
No branching, and a clever compiler can even optimize it out. Or do you really need that values that fall in the 255 zone have a scattered pattern instead of 1?

I wanted to answer my own question because over the last 2 years I have seen ways to get around a slow integer divide instruction. The easiest way is to make the integer a compile-time constant. Any decent modern compiler should replace the integer divide with an equivalent set of other instructions with typically higher throughput (how many such instructions can be retired per cycle) and reduced latency (how many cycles it takes the instruction to execute). If you're curious, check out Hacker's Delight (an excellent book on low-level computer arithmetic).
I wanted to share another finding, which I found on Daniel Lemire's blog (located here). The code that follows doesn't compute mod 255 but does something similar, which is equally useful in a number of applications and much faster.
Suppose that you have a set of numbers S that are uniformly randomly picked from the range 0 to 2^k - 1 inclusive, where k >= 0. In this case, if you care only about mapping numbers roughly uniformly from 0 to 254 inclusive, you may do the following:
For each number n in a set S, you may map n to one of the 255 candidate values by multiplying n by 255 and then arithmetically shifting the result to the right by k digits.
Here is the function that you call on each n for a fixed value of k:
int map_to_0_to_254(int n, int k){
return (n * 255) >> k;
}
As an example, if the values for the argument n range uniformly randomly from 0 to 4095 (2^12 - 1),
then map_to_0_254(n, 12) will return a value in the range 0 to 254 inclusive.
Here is a more general templated version in C++ for mapping to range from 0 to range_size - 1 inclusive:
template<typename T>
T map_to_0_to_range_size_minus_1(T n, T range_size, T k){
return (n * range_size) >> k;
}
REMEMBER that this code assumes that the inputs for n are roughly uniformly randomly distributed between 0 and 2^k - 1 inclusive. If that property holds, then the outputs will be roughly uniformly distributed between 0 and range_size - 1 inclusive. The larger 2^k is relative to range_size, the more uniform the mapping will be for a fixed set of inputs.
Why This is Useful
This approach has applications to computing hash functions for hash tables where the number of bins is not a power of 2. Those operations would ordinarily require a long-latency integer divide instruction, which is often an order of magnitude slower to execute than an integer multiply, because you often do not know the number of bins in the hash table at compile time.

Fast and efficient array lookup and modification

I have been wondering for a while which of the two following methods are faster or better.
MY CURRENT METHOD
I'm developing a chess game and the pieces are stored as numbers (really bytes to preserve memory) into a one-dimensional array. There is a position for the cursor corresponding to the index in the array. To access the piece at the current position in the array is easy (piece = pieces[cursorPosition]).
The problem is that to get the x and y values for checking if the move is a valid move requires the division and a modulo operators (x = cursorPosition % 8; y = cursorPosition / 8).
Likewise when using x and y to check if moves are valid (you have to do it this way for reasons that would fill the entire page), you have to do something like - purely as an example - if pieces[y * 8 + x] != 0: movePiece = False. The obvious problem is having to do y * 8 + x a bunch of times to access the array.
Ultimately, this means that getting a piece is trivial but then getting the x and y requires another bit of memory and a very small amount of time to compute it each round.
A MORE TRADITIONAL METHOD
Using a two-dimensional array, one can implement the above process a little easier except for the fact that piece lookup is now a little harder and more memory is used. (I.e. piece = pieces[cursorPosition[0]][cursorPosition[1]] or piece = pieces[x][y]).
I don't think this is faster and it definitely doesn't look less memory intensive.
GOAL
My end goal is to have the fastest possible code that uses the least amount of memory. This will be developed for the unix terminal (and potentially Windows CMD if I can figure out how to represent the pieces without color using Ansi escape sequences) and I will either be using a secure (encrypted with protocol and structure) TCP connection to connect people p2p to play chess or something else and I don't know how much memory people will have or how fast their computer will be or how strong of an internet connection they will have.
I also just want to learn to do this the best way possible and see if it can be done.
-
I suppose my question is one of the following:
Which of the above methods is better assuming that there are slightly more computations involving move validation (which means that the y * 8 + x has to be used a lot)?
or
Is there perhaps a method that includes both of the benefits of 1d and 2d arrays with not as many draw backs as I described?

First, you should profile your code to make sure that this is really a bottleneck worth spending time on.
Second, if you're representing your position as an unsigned byte decomposing it into X and Y coordinates will be very fast. If we use the following C code:
int getX(unsigned char pos) {
return pos%8;
}
We get the following assembly with gcc 4.8 -O2:
getX(unsigned char):
shrb $3, %dil
movzbl %dil, %eax
ret
If we get the Y coordinate with:
int getY(unsigned char pos) {
return pos/8;
}
We get the following assembly with gcc 4.8 -O2:
getY(unsigned char):
movl %edi, %eax
andl $7, %eax
ret

There is no short answer to this question; it all depends on how much time you spend optimizing.
On some architectures, two-dimensional arrays might work better than one-dimensional. On other architectures, bitmapped integers might be the best.
Do not worry about division and multiplication.
You're dividing, modulating and multiplying by 8.
This number is in the power of two, thus any computer can use bitwise operations in order to achieve the result.
(x * 8) is the same as (x << 3)
(x % 8) is the same as (x & (8 - 1))
(x / 8) is the same as (x >> 3)
Those operations are normally performed in a single clock cycle. On many modern architectures, they can be performed in less than a single clock cycle (including ARM architectures).
Do not worry about using bitwise operators instead of *, % and /. If you're using a compiler that's less than a decade old, it'll optimize it for you and use bitwise operations.
What you should focus on instead, is how easy it will be for you to find out whether or not a move is legal, for instance. This will help your computer-player to "think quickly".
If you're using an 8*8 array, then it's easy for you to see where a castle can move by checking if only x or y is changed. If checking the queen, then X must either be the same or move the same number of steps as the Y position.
If you use a one-dimensional array, you also have advantages.
But performance-wise, it might be a real good idea to use a 16x16 array or a 1x256 array.
Fill the entire array with 0x80 values (eg. "illegal position"). Then fill the legal fields with 0x00.
If using a 1x256 array, you can check bit 3 and 7 of the index. If any of those are set, then the position is outside the board.
Testing can be done this way:
if(position & 0x88)
{
/* move is illegal */
}
else
{
/* move is legal */
}
... or ...
if(0 == (position & 0x88))
{
/* move is legal */
}
'position' (the index) should be an unsigned byte (uint8_t in C). This way, you'll never have to worry about pointing outside the buffer.
Some people optimize their chess-engines by using 64-bit bitmapped integers.
While this is good for quickly comparing the positions, it has other disadvantages; for instance checking if the knight's move is legal.
It's not easy to say which is better, though.
Personally, I think the one-dimensional array in general might be the best way to do it.
I recommend getting familiar (very familiar) with AND, OR, XOR, bit-shifting and rotating.
See Bit Twiddling Hacks for more information.

Optimize me! (C, performance) -- followup to bit-twiddling question

Thanks to some very helpful stackOverflow users at Bit twiddling: which bit is set?, I have constructed my function (posted at the end of the question).
Any suggestions -- even small suggestions -- would be appreciated. Hopefully it will make my code better, but at the least it should teach me something. :)
Overview
This function will be called at least 1013 times, and possibly as often as 1015. That is, this code will run for months in all likelihood, so any performance tips would be helpful.
This function accounts for 72-77% of the program's time, based on profiling and about a dozen runs in different configurations (optimizing certain parameters not relevant here).
At the moment the function runs in an average of 50 clocks. I'm not sure how much this can be improved, but I'd be thrilled to see it run in 30.
Key Observation
If at some point in the calculation you can tell that the value that will be returned will be small (exact value negotiable -- say, below a million) you can abort early. I'm only interested in large values.
This is how I hope to save the most time, rather than by further micro-optimizations (though these are of course welcome as well!).
Performance Information
smallprimes is a bit array (64 bits); on average about 8 bits will be set, but it could be as few as 0 or as many as 12.
q will usually be nonzero. (Notice that the function exits early if q and smallprimes are zero.)
r and s will often be 0. If q is zero, r and s will be too; if r is zero, s will be too.
As the comment at the end says, nu is usually 1 by the end, so I have an efficient special case for it.
The calculations below the special case may appear to risk overflow, but through appropriate modeling I have proved that, for my input, this will not occur -- so don't worry about that case.
Functions not defined here (ugcd, minuu, star, etc.) have already been optimized; none take long to run. pr is a small array (all in L1). Also, all functions called here are pure functions.
But if you really care... ugcd is the gcd, minuu is the minimum, vals is the number of trailing binary 0s, __builtin_ffs is the location of the leftmost binary 1, star is (n-1) >> vals(n-1), pr is an array of the primes from 2 to 313.
The calculations are currently being done on a Phenom II 920 x4, though optimizations for i7 or Woodcrest are still of interest (if I get compute time on other nodes).
I would be happy to answer any questions you have about the function or its constituents.
What it actually does
Added in response to a request. You don't need to read this part.
The input is an odd number n with 1 < n < 4282250400097. The other inputs provide the factorization of the number in this particular sense:
smallprimes&1 is set if the number is divisible by 3, smallprimes&2 is set if the number is divisible by 5, smallprimes&4 is set if the number is divisible by 7, smallprimes&8 is set if the number is divisible by 11, etc. up to the most significant bit which represents 313. A number divisible by the square of a prime is not represented differently from a number divisible by just that number. (In fact, multiples of squares can be discarded; in the preprocessing stage in another function multiples of squares of primes <= lim have smallprimes and q set to 0 so they will be dropped, where the optimal value of lim is determined by experimentation.)
q, r, and s represent larger factors of the number. Any remaining factor (which may be greater than the square root of the number, or if s is nonzero may even be less) can be found by dividing factors out from n.
Once all the factors are recovered in this way, the number of bases, 1 <= b < n, to which n is a strong pseudoprime are counted using a mathematical formula best explained by the code.
Improvements so far
Pushed the early exit test up. This clearly saves work so I made the change.
The appropriate functions are already inline, so __attribute__ ((inline)) does nothing. Oddly, marking the main function bases and some of the helpers with __attribute ((hot)) hurt performance by almost 2% and I can't figure out why (but it's reproducible with over 20 tests). So I didn't make that change. Likewise, __attribute__ ((const)), at best, did not help. I was more than slightly surprised by this.
Code
ulong bases(ulong smallprimes, ulong n, ulong q, ulong r, ulong s)
{
if (!smallprimes & !q)
return 0;
ulong f = __builtin_popcountll(smallprimes) + (q > 1) + (r > 1) + (s > 1);
ulong nu = 0xFFFF; // "Infinity" for the purpose of minimum
ulong nn = star(n);
ulong prod = 1;
while (smallprimes) {
ulong bit = smallprimes & (-smallprimes);
ulong p = pr[__builtin_ffsll(bit)];
nu = minuu(nu, vals(p - 1));
prod *= ugcd(nn, star(p));
n /= p;
while (n % p == 0)
n /= p;
smallprimes ^= bit;
}
if (q) {
nu = minuu(nu, vals(q - 1));
prod *= ugcd(nn, star(q));
n /= q;
while (n % q == 0)
n /= q;
} else {
goto BASES_END;
}
if (r) {
nu = minuu(nu, vals(r - 1));
prod *= ugcd(nn, star(r));
n /= r;
while (n % r == 0)
n /= r;
} else {
goto BASES_END;
}
if (s) {
nu = minuu(nu, vals(s - 1));
prod *= ugcd(nn, star(s));
n /= s;
while (n % s == 0)
n /= s;
}
BASES_END:
if (n > 1) {
nu = minuu(nu, vals(n - 1));
prod *= ugcd(nn, star(n));
f++;
}
// This happens ~88% of the time in my tests, so special-case it.
if (nu == 1)
return prod << 1;
ulong tmp = f * nu;
long fac = 1 << tmp;
fac = (fac - 1) / ((1 << f) - 1) + 1;
return fac * prod;
}

You seem to be wasting much time doing divisions by the factors. It is much faster to replace a division with a multiplication by the reciprocal of divisor (division: ~15-80(!) cycles, depending on the divisor, multiplication: ~4 cycles), IF of course you can precompute the reciprocals.
While this seems unlikely to be possible with q, r, s - due to the range of those vars, it is very easy to do with p, which always comes from the small, static pr[] array. Precompute the reciprocals of those primes and store them in another array. Then, instead of dividing by p, multiply by the reciprocal taken from the second array. (Or make a single array of structs.)
Now, obtaining exact division result by this method requires some trickery to compensate for rounding errors. You will find the gory details of this technique in this document, on page 138.
EDIT:
After consulting Hacker's Delight (an excellent book, BTW) on the subject, it seems that you can make it even faster by exploiting the fact that all divisions in your code are exact (i.e. remainder is zero).
It seems that for every divisor d which is odd and base B = 2word_size, there exists a unique multiplicative inverse d⃰ which satisfies the conditions: d⃰ < B and d·d⃰ ≡ 1 (mod B). For every x which is an exact multiple of d, this implies x/d ≡ x·d⃰ (mod B). Which means you can simply replace a division with a multiplication, no added corrections, checks, rounding problems, whatever. (The proofs of these theorems can be found in the book.) Note that this multiplicative inverse need not be equal to the reciprocal as defined by the previous method!
How to check whether a given x is an exact multiple of d - i.e. x mod d = 0 ? Easy! x mod d = 0 iff x·d⃰ mod B ≤ ⌊(B-1)/d⌋. Note that this upper limit can be precomputed.
So, in code:
unsigned x, d;
unsigned inv_d = mulinv(d); //precompute this!
unsigned limit = (unsigned)-1 / d; //precompute this!
unsigned q = x*inv_d;
if(q <= limit)
{
//x % d == 0
//q == x/d
} else {
//x % d != 0
//q is garbage
}
Assuming the pr[] array becomes an array of struct prime:
struct prime {
ulong p;
ulong inv_p; //equal to mulinv(p)
ulong limit; //equal to (ulong)-1 / p
}
the while(smallprimes) loop in your code becomes:
while (smallprimes) {
ulong bit = smallprimes & (-smallprimes);
int bit_ix = __builtin_ffsll(bit);
ulong p = pr[bit_ix].p;
ulong inv_p = pr[bit_ix].inv_p;
ulong limit = pr[bit_ix].limit;
nu = minuu(nu, vals(p - 1));
prod *= ugcd(nn, star(p));
n *= inv_p;
for(;;) {
ulong q = n * inv_p;
if (q > limit)
break;
n = q;
}
smallprimes ^= bit;
}
And for the mulinv() function:
ulong mulinv(ulong d) //d needs to be odd
{
ulong x = d;
for(;;)
{
ulong tmp = d * x;
if(tmp == 1)
return x;
x *= 2 - tmp;
}
}
Note you can replace ulong with any other unsigned type - just use the same type consistently.
The proofs, whys and hows are all available in the book. A heartily recommended read :-).

If your compiler supports GCC function attributes, you can mark your pure functions with this attribute:
ulong star(ulong n) __attribute__ ((const));
This attribute indicates to the compiler that the result of the function depends only on its argument(s). This information can be used by the optimiser.
Is there a reason why you've opencoded vals() instead of using __builtin_ctz() ?

It is still somewhat unclear, what you are searching for. Quite frequently number theoretic problems allow huge speedups by deriving mathematical properties that the solutions must satisfiy.
If you are indeed searching for the integers that maximize the number of non-witnesses for the MR test (i.e. oeis.org/classic/A141768 that you mention) then it might be possible to use that the number of non-witnesses cannot be larger than phi(n)/4 and that the integers for which have this many non-witnesses are either are the product of two primes of the form
(k+1)*(2k+1)
or they are Carmichael numbers with 3 prime factors.
I'd think above some limit all integers in the sequence have this form and that it is possible to verify this by proving an upper bound for the witnesses of all other integers.
E.g. integers with 4 or more factors always have at most phi(n)/8 non-witnesses. Similar results can be derived from you formula for the number of bases for other integers.
As for micro-optimizations: Whenever you know that an integer is divisible by some quotient, then it is possible to replace the division by a multiplication with the inverse of the quotient modulo 2^64. And the tests n % q == 0 can be replaced by a test
n * inverse_q < max_q,
where inverse_q = q^(-1) mod 2^64 and max_q = 2^64 / q.
Obviously inverse_q and max_q need to be precomputed, to be efficient, but since you are using a sieve, I assume this should not be an obstacle.

Small optimization but:
ulong f;
ulong nn;
ulong nu = 0xFFFF; // "Infinity" for the purpose of minimum
ulong prod = 1;
if (!smallprimes & !q)
return 0;
// no need to do this operations before because of the previous return
f = __builtin_popcountll(smallprimes) + (q > 1) + (r > 1) + (s > 1);
nn = star(n);
BTW: you should edit your post to add star() and other functions you use definition

Try replacing this pattern (for r and q too):
n /= p;
while (n % p == 0)
n /= p;
With this:
ulong m;
...
m = n / p;
do {
n = m;
m = n / p;
} while ( m * p == n);
In my limited tests, I got a small speedup (10%) from eliminating the modulo.
Also, if p, q or r were constant, the compiler will replace the divisions by multiplications. If there are few choices for p, q or r, or if certain ones are more frequent, you might gain something by specializing the function for those values.

Have you tried using profile-guided optimisation?
Compile and link the program with the -fprofile-generate option, then run the program over a representative data set (say, a day's worth of computation).
Then re-compile and link it with the -fprofile-use option instead.

1) I would make the compiler spit out the assembly it generates and try and deduce if what it does is the best it can do... and if you spot problems, change the code so the assembly looks better. This way you can also make sure that functions you hope it'll inline (like star and vals) are really inlined. (You might need to add pragma's, or even turn them into macros)
2) It's great that you try this on a multicore machine, but this loop is singlethreaded. I'm guessing that there is an umbrella functions which splits the load across a few threads so that more cores are used?
3) It's difficult to suggest speed ups if what the actual function tries to calculate is unclear. Typically the most impressive speedups are not achieved with bit twiddling, but with a change in the algorithm. So a bit of comments might help ;^)
4) If you really want a speed up of 10* or more, check out CUDA or openCL which allows you to run C programs on your graphics hardware. It shines with functions like these!
5) You are doing loads of modulo and divides right after each other. In C this is 2 separate commands (first '/' and then '%'). However in assembly this is 1 command: 'DIV' or 'IDIV' which returns both the remainder and the quotient in one go:
B.4.75 IDIV: Signed Integer Divide
IDIV r/m8 ; F6 /7 [8086]
IDIV r/m16 ; o16 F7 /7 [8086]
IDIV r/m32 ; o32 F7 /7 [386]
IDIV performs signed integer division. The explicit operand provided is the divisor; the dividend and destination operands are implicit, in the following way:
For IDIV r/m8, AX is divided by the given operand; the quotient is stored in AL and the remainder in AH.
For IDIV r/m16, DX:AX is divided by the given operand; the quotient is stored in AX and the remainder in DX.
For IDIV r/m32, EDX:EAX is divided by the given operand; the quotient is stored in EAX and the remainder in EDX.
So it will require some inline assembly, but I'm guessing there'll be a significant speedup as there are a few places in your code which can benefit from this.

Make sure your functions get inlined. If they're out-of-line, the overhead might add up, especially in the first while loop. The best way to be sure is to examine the assembly.
Have you tried pre-computing star( pr[__builtin_ffsll(bit)] ) and vals( pr[__builtin_ffsll(bit)] - 1) ? That would trade some simple work for an array lookup, but it might be worth it if the tables are small enough.
Don't compute f until you actually need it (near the end, after your early-out). You can replace the code around BASES_END with something like
BASES_END:
ulong addToF = 0;
if (n > 1) {
nu = minuu(nu, vals(n - 1));
prod *= ugcd(nn, star(n));
addToF = 1;
}
// ... early out if nu == 1...
// ... compute f ...
f += addToF;
Hope that helps.

First some nitpicking ;-) you should be more careful about the types that you are using. In some places you seem to assume that ulong is 64 bit wide, use uint64_t there. And also for all other types, rethink carefully what you expect of them and use the appropriate type.
The optimization that I could see is integer division. Your code does that a lot, this is probably the most expensive thing you are doing. Division of small integers (uint32_t) maybe much more efficient than by big ones. In particular for uint32_t there is an assembler instruction that does division and modulo in one go, called divl.
If you use the appropriate types your compiler might do that all for you. But you'd better check the assembler (option -S to gcc) as somebody already said. Otherwise it is easy to include some little assembler fragments here and there. I found something like that in some code of mine:
register uint32_t a asm("eax") = 0;
register uint32_t ret asm("edx") = 0;
asm("divl %4"
: "=a" (a), "=d" (ret)
: "0" (a), "1" (ret), "rm" (divisor));
As you can see this uses special registers eax and edx and stuff like that...

Did you try a table lookup version of the first while loop? You could divide smallprimes in 4 16 bit values, look up their contribution and merge them. But maybe you need the side effects.

Did you try passing in an array of primes instead of splitting them in smallprimes, q, r and s? Since I don't know what the outer code does, I am probably wrong, but there is a chance that you also have a function to convert some primes to a smallprimes bitmap, and inside this function, you convert the bitmap back to an array of primes, effecively. In addition, you seem to do identical processing for elements of smallprimes, q, r, and s. It should save you a tiny amount of processing per call.
Also, you seem to know that the passed in primes divide n. Do you have enough knowledge outside about the power of each prime that divides n? You could save a lot of time if you can eliminate the modulo operation by passing in that information to this function. In other words, if n is pow(p_0,e_0)*pow(p_1,e_1)*...*pow(p_k,e_k)*n_leftover, and if you know more about these e_is and n_leftover, passing them in would mean a lot of things you don't have to do in this function.
There may be a way to discover n_leftover (the unfactored part of n) with less number of modulo operations, but it is only a hunch, so you may need to experiment with it a bit. The idea is to use gcd to remove known factors from n repeatedly until you get rid of all known prime factors. Let me give some almost-c-code:
factors=p_0*p_1*...*p_k*q*r*s;
n_leftover=n/factors;
do {
factors=gcd(n_leftover, factors);
n_leftover = n_leftover/factors;
} while (factors != 1);
I am not at all certain this will be better than the code you have, let alone the combined mod/div suggestions you can find in other answers, but I think it is worth a try. I feel that it will be a win, especially for numbers with high numbers of small prime factors.

You're passing in the complete factorization of n, so you're factoring consecutive integers and then using the results of that factorization here. It seems to me that you might benefit from doing some of this at the time of finding the factors.
BTW, I've got some really fast code for finding the factors you're using without doing any division. It's a little like a sieve but produces factors of consecutive numbers very quickly. Can find it and post if you think it may help.
edit had to recreate the code here:
#include
#define SIZE (1024*1024) //must be 2^n
#define MASK (SIZE-1)
typedef struct {
int p;
int next;
} p_type;
p_type primes[SIZE];
int sieve[SIZE];
void init_sieve()
{
int i,n;
int count = 1;
primes[1].p = 3;
sieve[1] = 1;
for (n=5;SIZE>n;n+=2)
{
int flag = 0;
for (i=1;count>=i;i++)
{
if ((n%primes[i].p) == 0)
{
flag = 1;
break;
}
}
if (flag==0)
{
count++;
primes[count].p = n;
sieve[n>>1] = count;
}
}
}
int main()
{
int ptr,n;
init_sieve();
printf("init_done\n");
// factor odd numbers starting with 3
for (n=1;1000000000>n;n++)
{
ptr = sieve[n&MASK];
if (ptr == 0) //prime
{
// printf("%d is prime",n*2+1);
}
else //composite
{
// printf ("%d has divisors:",n*2+1);
while(ptr!=0)
{
// printf ("%d ",primes[ptr].p);
sieve[n&MASK]=primes[ptr].next;
//move the prime to the next number it divides
primes[ptr].next = sieve[(n+primes[ptr].p)&MASK];
sieve[(n+primes[ptr].p)&MASK] = ptr;
ptr = sieve[n&MASK];
}
}
// printf("\n");
}
return 0;
}
The init function creates a factor base and initializes the sieve. This takes about 13 seconds on my laptop. Then all numbers up to 1 billion are factored or determined to be prime in another 25 seconds. Numbers less than SIZE are never reported as prime because they have 1 factor in the factor base, but that could be changed.
The idea is to maintain a linked list for every entry in the sieve. Numbers are factored by simply pulling their factors out of the linked list. As they are pulled out, they are inserted into the list for the next number that will be divisible by that prime. This is very cache friendly too. The sieve size must be larger than the largest prime in the factor base. As is, this sieve could run up to 2**40 in about 7 hours which seems to be your target (except for n needing to be 64 bits).
Your algorithm could be merged into this to make use of the factors as they are identified rather than packing bits and large primes into variables to pass to your function. Or your function could be changed to take the linked list (you could create a dummy link to pass in for the prime numbers outside the factor base).
Hope it helps.
BTW, this is the first time I've posted this algorithm publicly.

just a thought but maybe using your compilers optimization options would help, if you haven't already. another thought would be that if money isn't an issue you could use the Intel C/C++ compiler, assuming your using an Intel processor. I'd also assume that other processor manufacturers (AMD, etc.) would have similar compilers

If you are going to exit immediately on (!smallprimes&!q) why not do that test before even calling the function, and save the function call overhead?
Also, it seems like you effectively have 3 different functions which are linear except for the smallprimes loop.
bases1(s,n,q), bases2(s,n,q,r), and bases3(s,n,q,r,s).
It might be a win to actually create those as 3 separate functions without the branches and gotos, and call the appropriate one:
if (!(smallprimes|q)) { r = 0;}
else if (s) { r = bases3(s,n,q,r,s);}
else if (r) { r = bases2(s,n,q,r); }
else { r = bases1(s,n,q);
This would be most effective if previous processing has already given the calling code some 'knowledge' of which function to execute and you don't have to test for it.

If the divisions you're using are with numbers that aren’t known at compile time, but are used frequently at runtime (dividing by the same number many times), then I would suggest using the libdivide library, which basically implements at runtime the optimisations that compilers do for compile time constants (using shifts masks etc.). This can provide a huge benefit. Also avoiding using x % y == 0 for something like z = x/y, z * y == x as ergosys suggested above should also have a measurable improvement.

Does the code on your top post is the optimized version? If yes, there is still too many divide operations which greatly eat CPU cycles.
This code is overexecute innecessarily a bit
if (!smallprimes & !q)
return 0;
change to logical and &&
if (!smallprimes && !q)
return 0;
will make it short circuited faster without eveluating q
And the following code
ulong bit = smallprimes & (-smallprimes);
ulong p = pr[__builtin_ffsll(bit)];
which is used to find the last set bit of smallprimes. Why don't you use the simpler way
ulong p = pr[__builtin_ctz(smallprimes)];
Another culprit for decreased performance maybe too many program branching. You may consider changing to some other less-branch or branch-less equivalents

Most optimized way to calculate modulus in C

I have minimize cost of calculating modulus in C.
say I have a number x and n is the number which will divide x
when n == 65536 (which happens to be 2^16):
mod = x % n (11 assembly instructions as produced by GCC)
or
mod = x & 0xffff which is equal to mod = x & 65535 (4 assembly instructions)
so, GCC doesn't optimize it to this extent.
In my case n is not x^(int) but is largest prime less than 2^16 which is 65521
as I showed for n == 2^16, bit-wise operations can optimize the computation. What bit-wise operations can I preform when n == 65521 to calculate modulus.

First, make sure you're looking at optimized code before drawing conclusion about what GCC is producing (and make sure this particular expression really needs to be optimized). Finally - don't count instructions to draw your conclusions; it may be that an 11 instruction sequence might be expected to perform better than a shorter sequence that includes a div instruction.
Also, you can't conclude that because x mod 65536 can be calculated with a simple bit mask that any mod operation can be implemented that way. Consider how easy dividing by 10 in decimal is as opposed to dividing by an arbitrary number.
With all that out of the way, you may be able to use some of the 'magic number' techniques from Henry Warren's Hacker's Delight book:
Archive of http://www.hackersdelight.org/
Archive of http://www.hackersdelight.org/magic.htm
There was an added chapter on the website that contained "two methods of computing the remainder of division without computing the quotient!", which you may find of some use. The 1st technique applies only to a limited set of divisors, so it won't work for your particular instance. I haven't actually read the online chapter, so I don't know exactly how applicable the other technique might be for you.

x mod 65536 is only equivalent to x & 0xffff if x is unsigned - for signed x, it gives the wrong result for negative numbers. For unsigned x, gcc does indeed optimise x % 65536 to a bitwise and with 65535 (even on -O0, in my tests).
Because 65521 is not a power of 2, x mod 65521 can't be calculated so simply. gcc 4.3.2 on -O3 calculates it using x - (x / 65521) * 65521; the integer division by a constant is done using integer multiplication by a related constant.

rIf you don't have to fully reduce your integers modulo 65521, then you can use the fact that 65521 is close to 2**16. I.e. if x is an unsigned int you want to reduce then you can do the following:
unsigned int low = x &0xffff;
unsigned int hi = (x >> 16);
x = low + 15 * hi;
This uses that 2**16 % 65521 == 15. Note that this is not a full reduction. I.e. starting with a 32-bit input, you only are guaranteed that the result is at most 20 bits and that it is of course congruent to the input modulo 65521.
This trick can be used in applications where there are many operations that have to be reduced modulo the same constant, and where intermediary results do not have to be the smallest element in its residue class.
E.g. one application is the implementation of Adler-32, which uses the modulus 65521. This hash function does a lot of operations modulo 65521. To implement it efficiently one would only do modular reductions after a carefully computed number of additions. A reduction shown as above is enough and only the computation of the hash will need a full modulo operation.

The bitwise operation only works well if the divisor is of the form 2^n. In the general case, there is no such bit-wise operation.

If the constant with which you want to take the modulo is known at compile time
and you have a decent compiler (e.g. gcc), tis usually best to let the compiler
work its magic. Just declare the modulo const.
If you don't know the constant at compile time, but you are going to take - say -
a billion modulos with the same number, then use this http://libdivide.com/

As an approach when we deal with powers of 2, can be considered this one (mostly C flavored):
.
.
#define THE_DIVISOR 0x8U; /* The modulo value (POWER OF 2). */
.
.
uint8 CheckIfModulo(const sint32 TheDividend)
{
uint8 RetVal = 1; /* TheDividend is not modulus THE_DIVISOR. */
if (0 == (TheDividend & (THE_DIVISOR - 1)))
{
/* code if modulo is satisfied */
RetVal = 0; /* TheDividend IS modulus THE_DIVISOR. */
}
else
{
/* code if modulo is NOT satisfied */
}
return RetVal;
}

If x is an increasing index, and the increment i is known to be less than n (e.g. when iterating over a circular array of length n), avoid the modulus completely.
A loop going
x += i; if (x >= n) x -= n;
is way faster than
x = (x + i) % n;
which you unfortunately find in many text books...
If you really need an expression (e.g. because you are using it in a for statement), you can use the ugly but efficient
x = x + (x+i < n ? i : i-n)

idiv — Integer Division
The idiv instruction divides the contents of the 64 bit integer EDX:EAX (constructed by viewing EDX as the most significant four bytes and EAX as the least significant four bytes) by the specified operand value. The quotient result of the division is stored into EAX, while the remainder is placed in EDX.
source: http://www.cs.virginia.edu/~evans/cs216/guides/x86.html

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight