Linear Recursion Vectorization - c

To vectorize the following mathematical expression (linear recursion):
f(i)=f(i-1)/c+g(i), i starts at 1, f(0), c are constant numbers and given.
I can get better speed by using list-comprehension, that is:
def function():
txt= [0,2,0,2,0,2,0,2,2,2,0,2,2,2,0,2,0,2,2,2,0,2,0,2,0,2,0,2,2,2,0,2,0,2,0,2,0,2,2,2]
indices_0=[]
vl=0
sb_l=10
CONST=512
[vl := vl+pow(txt[i],(i+1)) for i in range(sb_l)]
if (vl==1876):
indices_0=[0]
p=[i for i in range(1,len(txt)-sb_l+1) if (vl := (vl-txt[i-1])/2+ txt[i+sb_l-1]*CONST)==1876]
print(indices_0+p)
function()
I am looking for a vectorized/faster than vectorization (if possible!) implementation of the above code in python/c.
Note:
1.
A linear recursive function is a function that only makes a single call to itself each time the function runs (as opposed to one that would call itself multiple times during its execution). The factorial function is a good example of linear recursion.
2.
Note that all variables array are given for demonstration purpose, the main part for vectorization is :
[vl := vl+pow(txt[i],(i+1)) for i in range(sb_l)]
if (vl==1876):
indices_0=[0]
p=[i for i in range(1,len(txt)-sb_l+1)
if (vl := (vl-txt[i-1])/2+ txt[i+sb_l-1]*CONST)==1876]
Here, f(i-1)= (vl-txt[i-1]), c=2, g(i)= txt[i+sb_l-1]*CONST.
POSTSCRIPT: I am currently doing it in python, would it be much faster if it is implemented in C language's vectorization?

Here is an example of equivalent C program doing the same thing but faster:
#include <stdio.h>
#include <stdlib.h>
// See: https://stackoverflow.com/questions/29787310/does-pow-work-for-int-data-type-in-c
int64_t int_pow(int64_t base, int exp)
{
int64_t result = 1;
while (exp)
{
// Branchless optimization: result *= base * (exp % 2);
if(exp % 2)
result *= base;
exp /= 2;
base *= base;
}
return result;
}
void function()
{
// Both txt values and sb_l not be too big or it will cause silent overflows (ie. wrong results)
const int64_t txt[] = {0,2,0,2,0,2,0,2,2,2,0,2,2,2,0,2,0,2,2,2,0,2,0,2,0,2,0,2,2,2,0,2,0,2,0,2,0,2,2,2};
const size_t txtSize = sizeof(txt) / sizeof(txt[0]);
const int sb_l = 10;
const int64_t CONST = 512;
int64_t vl = 0;
int64_t* results = (int64_t*)malloc(txtSize * sizeof(int64_t));
size_t cur = 0;
// Optimization so not to compute pow(0,i+1) which is 0
for (int i = 0; i < sb_l; ++i)
if(txt[i] != 0)
vl += int_pow(txt[i], i+1);
if (vl == 1876)
{
results[cur] = 0;
cur++;
}
for (int i = 1; i < txtSize-sb_l+1; ++i)
{
vl = (vl - txt[i-1]) / 2 + txt[i+sb_l-1] * CONST;
if(vl == 1876)
results[cur++] = i;
}
// Printing
printf("[");
for (int i = 0; i < cur; ++i)
{
if(i > 0)
printf(", ");
printf("%ld", results[i]);
}
printf("]\n");
fflush(stdout);
free(results);
}
int main(int argc, char* argv[])
{
function();
return 0;
}
Be careful with overflows. You can put assertions if you are unsure about that in specific places (note they make the code slower when enabled though). Please do not forget to compile the program with optimizations (eg. -O3 with GCC and Clang and /O2 with MSVC).

Related

How to find the minimum number of coins needed for a given target amount(different from existing ones)

This is a classic question, where a list of coin amounts are given in coins[], len = length of coins[] array, and we try to find minimum amount of coins needed to get the target.
The coins array is sorted in ascending order
NOTE: I am trying to optimize the efficiency. Obviously I can run a for loop through the coins array and add the target%coins[i] together, but this will be erroneous when I have for example coins[] = {1,3,4} and target = 6, the for loop method would give 3, which is 1,1,4, but the optimal solution is 2, which is 3,3.
I haven't learned matrices and multi-dimensional array yet, are there ways to do this problem without them? I wrote a function, but it seems to be running in an infinity loop.
int find_min(const int coins[], int len, int target) {
int i;
int min = target;
int curr;
for (i = 0; i < len; i++) {
if (target == 0) {
return 0;
}
if (coins[i] <= target) {
curr = 1 + find_min(coins, len, target - coins[i]);
if (curr < min) {
min = curr;
}
}
}
return min;
}
I can suggest you this reading,
https://www.geeksforgeeks.org/generate-a-combination-of-minimum-coins-that-results-to-a-given-value/
the only thing is that there is no C version of the code, but if really need it you can do the porting by yourself.
Since no one gives a good answer, and that I figured it out myself. I might as well post an answer.
I add an array called lp, which is initialized in main,
int lp[4096];
int i;
for (i = 0; i <= COINS_MAX_TARGET; i++) {
lp[i] = -1;
}
every index of lp is equal to -1.
int find_min(int tar, const int coins[], int len, int lp[])
{
// Base case
if (tar == 0) {
lp[0] = 0;
return 0;
}
if (lp[tar] != -1) {
return lp[tar];
}
// Initialize result
int result = COINS_MAX_TARGET;
// Try every coin that is smaller than tar
for (int i = 0; i < len; i++) {
if (coins[i] <= tar) {
int x = find_min(tar - coins[i], coins, len, lp);
if (x != COINS_MAX_TARGET)
result = ((result > (1 + x)) ? (1+x) : result);
}
}
lp[tar] = result;
return result;
}

How can I resolve this recursive overflow?

#define MIN -2147483648
long max(long x,long y)
{
long m=x;
if(y>x)
m=y;
return m;
}
long f(int x,int y,int **p)
{
long result;
if(x<0||y<0)
result = MIN;
else
if(x==0&&y==0)
result = p[0][0];
else
result = max(f(x-1,y,p),f(x,y-1,p))+p[x][y];
return result;
}
int main(void)
{
int n;
scanf("%d",&n);
int** p = (int **)malloc(n*sizeof(int*));
for(int i=0;i<n;i++)
{
p[i] = (int*)malloc(n*sizeof(int));
for(int j=0;j<n;j++)
scanf("%d",p[i]+j);
}
printf("haha\n");
printf("%ld\n",f(n-1,n-1,p));
return 0;
}
when I assign 10 to n, it works well.But when I assign 20 to n, there's no result put out. I googled it and I guessed that the error may be a recursive overflow. So how can I resolve this problem?
You're making a very large number of recursive calls. At each level, you make twice the number of calls as the prior level. So when N is 20, you're making 2^20 = 1048576 function calls. That takes a long time.
Most of these calls keep recomputing the same values over and over again. Rather that recomputing these values, calculate them only once.
Here's a non-recursive method of doing this:
long f(int x,int y,int **p)
{
long **p2;
int i, j;
p2 = malloc(sizeof(long *)*(x+1));
for (i=0;i<=x;i++) {
p2[i] = malloc(sizeof(long)*(y+1));
for (j=0;j<=y;j++) {
if (i==0 && j==0) {
p2[i][j] = p[i][j];
} else if (i==0) {
p2[i][j] = p2[i][j-1] + p[i][j];
} else if (j==0) {
p2[i][j] = p2[i-1][j] + p[i][j];
} else {
p2[i][j] = max(p2[i-1][j], p2[i][j-1]) + p[i][j];
}
}
}
return p2[x][y];
}
EDIT:
If you still want a recursive solution, you can do the following. This only makes recursive calls if the necessary values have not yet been computed.
long f(int x,int y,int **p)
{
static long**p2=NULL;
int i, j;
if (!p2) {
p2 = malloc(sizeof(long*)*(x+1));
for (i=0;i<=x;i++) {
p2[i] = malloc(sizeof(long)*(y+1));
for (j=0;j<=y;j++) {
p2[i][j] = MIN;
}
}
}
if (x==0 && y==0) {
p2[x][y] = p[x][y];
} else if (x==0) {
if (p2[x][y-1] == MIN) {
p2[x][y-1] = f(x,y-1,p);
}
p2[x][y] = p2[x][y-1] + p[x][y];
} else if (y==0) {
if (p2[x-1][y] == MIN) {
p2[x-1][y] = f(x-1,y,p);
}
p2[x][y] = p2[x-1][y] + p[x][y];
} else {
if (p2[x][y-1] == MIN) {
p2[x][y-1] = f(x,y-1,p);
}
if (p2[x-1][y] == MIN) {
p2[x-1][y] = f(x-1,y,p);
}
p2[x][y] = max(p2[x-1][y], p2[x][y-1]) + p[x][y];
}
return p2[x][y];
}
You don't specify which compiler you are using. Look into how to increase stack size for your program. However, even if you get it to work for n=20, there will be a limit (may not be far from n=20) due to combinatorial explosion as mentioned in previous comment.
For n > 0, each call to f(n) calls f(n-1) twice. So calling f(n) = calling 2*fn(n-1)
For n = 20, that is 2^20 calls. Each call returns a long. If long is 8 bytes = 2^3, then you have at least 2^23 bytes on the stack.
EDIT
Actually, according to the documentation, the linker controls the stack size.
You can try increasing stack size and implement more efficient algorithm as proposed by different answers
To increase stack size with ld (the GNU linker)
--stack reserve
--stack reserve,commit
Specify the number of bytes of memory to reserve (and optionally commit) to be used as stack for this program. The default is 2Mb reserved, 4K committed. [This option is specific to the i386 PE targeted port of the linker]

Optimizing a large if-else branch with binary search

So there is an if-else branch in my program with about 30 if-else statements. This part runs more than 100 times per second, so I saw it as an opportunity to optimize, and made it do binary search with a function pointer array (practically a balanced tree map) instead of doing linear if-else condition checks. But it ran slower about 70% of the previous speed.
I made a simple benchmark program to test the issue and it also gave similar result that the if-else part runs faster, both with and without compiler optimizations.
I also counted the number of comparisons done, and as expected the one doing binary search did about half number of comparisons than the simple if-else branch. But still it ran 20~30% slower.
I want to know where all my computing time is being wasted, and why the linear if-else runs faster than the logarithmic binary search?
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
long long ifElseCount = 0;
long long binaryCount = 0;
int ifElseSearch(int i) {
++ifElseCount;
if (i == 0) {
return 0;
}
++ifElseCount;
if (i == 1) {
return 1;
}
++ifElseCount;
if (i == 2) {
return 2;
}
++ifElseCount;
if (i == 3) {
return 3;
}
++ifElseCount;
if (i == 4) {
return 4;
}
++ifElseCount;
if (i == 5) {
return 5;
}
++ifElseCount;
if (i == 6) {
return 6;
}
++ifElseCount;
if (i == 7) {
return 7;
}
++ifElseCount;
if (i == 8) {
return 8;
}
++ifElseCount;
if (i == 9) {
return 9;
}
}
int getZero(void) {
return 0;
}
int getOne(void) {
return 1;
}
int getTwo(void) {
return 2;
}
int getThree(void) {
return 3;
}
int getFour(void) {
return 4;
}
int getFive(void) {
return 5;
}
int getSix(void) {
return 6;
}
int getSeven(void) {
return 7;
}
int getEight(void) {
return 8;
}
int getNine(void) {
return 9;
}
struct pair {
int n;
int (*getN)(void);
};
struct pair zeroToNine[10] = {
{0, getZero},
{2, getTwo},
{4, getFour},
{6, getSix},
{8, getEight},
{9, getNine},
{7, getSeven},
{5, getFive},
{3, getThree},
{1, getOne},
};
int sortCompare(const void *p, const void *p2) {
if (((struct pair *)p)->n < ((struct pair *)p2)->n) {
return -1;
}
if (((struct pair *)p)->n > ((struct pair *)p2)->n) {
return 1;
}
return 0;
}
int searchCompare(const void *pKey, const void *pElem) {
++binaryCount;
if (*(int *)pKey < ((struct pair *)pElem)->n) {
return -1;
}
if (*(int *)pKey > ((struct pair *)pElem)->n) {
return 1;
}
return 0;
}
int binarySearch(int key) {
return ((struct pair *)bsearch(&key, zeroToNine, 10, sizeof(struct pair), searchCompare))->getN();
}
struct timer {
clock_t start;
clock_t end;
};
void startTimer(struct timer *timer) {
timer->start = clock();
}
void endTimer(struct timer *timer) {
timer->end = clock();
}
double getSecondsPassed(struct timer *timer) {
return (timer->end - timer->start) / (double)CLOCKS_PER_SEC;
}
int main(void) {
#define nTests 500000000
struct timer timer;
int i;
srand((unsigned)time(NULL));
printf("%d\n\n", rand());
for (i = 0; i < 10; ++i) {
printf("%d ", zeroToNine[i].n);
}
printf("\n");
qsort(zeroToNine, 10, sizeof(struct pair), sortCompare);
for (i = 0; i < 10; ++i) {
printf("%d ", zeroToNine[i].n);
}
printf("\n\n");
startTimer(&timer);
for (i = 0; i < nTests; ++i) {
ifElseSearch(rand() % 10);
}
endTimer(&timer);
printf("%f\n", getSecondsPassed(&timer));
startTimer(&timer);
for (i = 0; i < nTests; ++i) {
binarySearch(rand() % 10);
}
endTimer(&timer);
printf("%f\n", getSecondsPassed(&timer));
printf("\n%lli %lli\n", ifElseCount, binaryCount);
return EXIT_SUCCESS;
}
possible output:
78985494
0 2 4 6 8 9 7 5 3 1
0 1 2 3 4 5 6 7 8 9
12.218656
16.496393
2750030239 1449975849
You should look at the generated instructions to see (gcc -S source.c), but generally it comes down to these three:
1) N is too small.
If you only have a 8 different branches, you execute an average of 4 checks (assuming equally probable cases, otherwise it could be even faster).
If you make it a binary search, that is log(8) == 3 checks, but these checks are much more complex, resulting in an overall more code executed.
So, unless your N is in the hundreds, it probably doesn't make sense to do this. You could do some profiling to find the actual value for N.
2) Branch prediction is harder.
In case of a linear search, every condition is true in 1/N cases, meaning the compiler and branch predictor can assume no branching, and then recover only once. For a binary search, you likely end up flushing the pipeline once every layer. And for N < 1024, 1/log(N) chance of misprediction actually hurts the performance.
3) Pointers to functions are slow
When executing a pointer to a function you have to get it from memory, then you have to load your function into instruction cache, then execute the call instruction, the function setup and return. You can not inline functions called through a pointer, so that is several extra instructions, plus memory access, plus moving things in/out of the cache. It adds up pretty quickly.
All in all, this only makes sense for a large N, and you should always profile before applying these optimizations.
Use a switch statement.
Compilers are clever. They will produce the most efficient code for your particular values. They will even do a binary search (with inline code) if that is deemed more efficient.
And as a huge benefit, the code is readable, and doesn't require you to make changes in half a dozen places to add a new case.
PS. Obviously your code is a good learning experience. Now you've learned, so don't do it again :-)

Multiplying large numbers through strings

I'm trying to write a program that will receive 2 strings representing numbers of any length
(for instance, char *a = "10000000000000";, char *b = "9999999999999999";) and multiply them.
This is what I came up with so far, not sure how to continue (nullify simply fills the whole string with '0'):
char *multiply(char *hnum, const char *other)
{
int num1=0, num2=0, carry=0, hnumL=0, otherL=0, i=0, temp1L=0, temp2L=0, n=0;
char *temp1, *temp2;
if(!hnum || !other) return NULL;
for(hnumL=0; hnum[hnumL] != '\0'; hnumL++);
for(otherL=0; other[otherL] != '\0'; otherL++);
temp1 = (char*)malloc(otherL+hnumL);
if(!temp1) return NULL;
temp2 = (char*)malloc(otherL+hnumL);
if(!temp2) return NULL;
nullify(temp1);
nullify(temp2);
hnumL--;
otherL--;
for(otherL; otherL >= 0; otherL--)
{
carry = 0;
num1 = other[otherL] - '0';
for(hnumL; hnumL >= 0; hnumL--)
{
num2 = hnum[hnumL] - '0';
temp1[i+n] = (char)(((int)'0') + ((num1 * num2 + carry) % 10));
carry = (num1 * num2 + carry) / 10;
i++;
temp1L++;
}
if(carry > 0)
{
temp1[i+n] = (char)(((int)'0') + carry);
temp1L++;
}
p.s. Is there a library that handles this already? Couldn't find anything like it.
On paper, you would probably do as follows:
999x99
--------
8991
8991
========
98901
The process is to multiply individual digits starting from the right of each number and adding them up keeping a carry in mind each time ("9 times 9 equals 81, write 1, keep 8 in mind"). I'm pretty sure you covered that in elementary school, didn't you?.
The process can be easily put into an algorithm:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
struct result
{
int carry;
int res;
};
/*
* multiply two numbers between 0 and 9 into result.res. If there is a carry, put it into
* result.carry
*/
struct result mul(int a, int b)
{
struct result res;
res.res = a * b;
if (res.res > 9)
{
res.carry = res.res / 10;
res.res %= 10;
}
else
res.carry = 0;
return res;
}
/*
* add
* adds a digit (b) to str at pos. If the result generates a carry,
* it's added also (recursively)
*/
add(char str[], int pos, int b)
{
int res;
int carry;
res = str[pos] - '0' + b;
if (res > 9)
{
carry = res / 10;
res %= 10;
add(str, pos - 1, carry);
}
str[pos] = res + '0';
}
void nullify(char *numstr, int len)
{
while (--len >= 0)
numstr[len] = '0';
}
int main(void)
{
struct result res;
char *mp1 = "999";
char *mp2 = "999";
char sum[strlen(mp1) + strlen(mp2) + 1];
int i;
int j;
nullify(sum, strlen(mp1) + strlen(mp2));
for (i = strlen(mp2) - 1; i >= 0; i--)
{
/* iterate from right over second multiplikand */
for (j = strlen(mp1) - 1; j >= 0; j--)
{
/* iterate from right over first multiplikand */
res = mul((mp2[i] - '0'), (mp1[j] - '0'));
add(sum, i + j + 1, res.res); /* add sum */
add(sum, i + j, res.carry); /* add carry */
}
}
printf("%s * %s = %s\n", mp1, mp2, sum);
return 0;
}
This is just the same as on paper, except that you don't need to remember individual summands since we add up everything on the fly.
This might not bee the fastest way to do it, but it doesn't need malloc() (provided you have a C99 compiler, otherwise you would need to dynamically allocate sum) and works for arbitrarily long numbers (up to the stack limit since add() is implemented as recursive function).
Yes there are libraries that handle this. It's actually a pretty big subject area that a lot of research has gone into. I haven't looked through your code that closely, but I know that the library implementations of big num operations have very efficient algorithms that you're unlikely to discover on your own. FOr example, the multiplication routine we all learned in grade school (pre common-core) is a O(n^2) solution to multiplication, but there exist ways to solve it in ~O(n^1.5).
THe standard GNU c big num library is GNU MP
https://gmplib.org/

Optimizing I/O(Output) in C code + a loop

I have a code which reads around (10^5) int(s) from stdin and then after performing ## i output them on stdout. I have taken care of the INPUT part by using "setvbuf" & reading lines using "fgets_unlocked()" and then parsing them to get the required int(s).
I have 2 issues which i am not able to come over with:
1.) As i am printing int(s) 5 million on stdout its taking lot of time : IS THERE ANY WAY TO REDUCE THIS( i tried using fwrite() but the o/p prints unprintable characters due to the reason using fread to read into int buffer)
2.) After parsing the input for the int(s) say 'x' i actually find the no of divisors by doing %(mod) for the no in a loop.(See in the code below): Maybe this is also a reason for my code being times out:
Any suggestions on this to improved.
Many thanks
This is actually a problem from http://www.codechef.com/problems/PD13
# include <stdio.h>
# define SIZE 32*1024
char buf[SIZE];
main(void)
{
int i=0,chk =0;
unsigned int j =0 ,div =0;
int a =0,num =0;
char ch;
setvbuf(stdin,(char*)NULL,_IOFBF,0);
scanf("%d",&chk);
while(getchar_unlocked() != '\n');
while((a = fread_unlocked(buf,1,SIZE,stdin)) >0)
{
for(i=0;i<a;i++)
{
if(buf[i] != '\n')
{
num = (buf[i] - '0')+(10*num);
}
else
if(buf[i] == '\n')
{
div = 1;
for(j=2;j<=(num/2);j++)
{
if((num%j) == 0) // Prob 2
{
div +=j;
}
}
num = 0;
printf("%d\n",div); // problem 1
}
}
}
return 0;
}
You can print far faster than printf.
Look into itoa(), or write your own simple function that converts integers to ascii very quickly.
Here's a quick-n-dirty version of itoa that should work fast for your purposes:
char* custom_itoa(int i)
{
static char output[24]; // 64-bit MAX_INT is 20 digits
char* p = &output[23];
for(*p--=0;i/=10;*p--=i%10+0x30);
return ++p;
}
note that this function has some serious built in limits, including:
it doesn't handle negative numbers
it doesn't currently handle numbers greater than 23-characters in decimal form.
it is inherently thread-dangerous. Do not attempt in a multi-threaded environment.
the return value will be corrupted as soon as the function is called again.
I wrote this purely for speed, not for safety or convenience.
Version 2 based on suggestion by #UmNyobe and #wildplasser(see above comments)
The code execution took 0.12 seconds and 3.2 MB of memory on the online judge.
I myself checked with 2*10^5 int(input) in the range from 1 to 5*10^5 and the execution took:
real 0m0.443s
user 0m0.408s
sys 0m0.024s
**Please see if some more optimization can be done.
enter code here
/** Solution for the sum of the proper divisor problem from codechef **/
/** # author dZONE **/
# include <stdio.h>
# include <math.h>
# include <stdlib.h>
# include <error.h>
# define SIZE 200000
inline int readnum(void);
void count(int num);
int pft[]={2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103,107,109,113,127,131,137,139,149,151,157,163,167,173,179,181,191,193,197,199,211,223,227,229,233,239,241,251,257,263,269,271,277,281,283,293,307,311,313,317,331,337,347,349,353,359,367,373,379,383,389,397,401,409,419,421,431,433,439,443,449,457,461,463,467,479,487,491,499,503,509,521,523,541,547,557,563,569,571,577,587,593,599,601,607,613,617,619,631,641,643,647,653,659,661,673,677,683,691,701,709};
unsigned long long int sum[SIZE];
int k = 0;
inline int readnum(void)
{
int num = 0;
char ch;
while((ch = getchar_unlocked()) != '\n')
{
if(ch >=48 && ch <=57)
{
num = ch -'0' + 10*num;
}
}
if(num ==0)
{
return -1;
}
return num;
}
void count(int num)
{
unsigned int i = 0;
unsigned long long tmp =0,pfac =1;
int flag = 0;
tmp = num;
sum[k] = 1;
for(i=0;i<127;i++)
{
if((tmp % pft[i]) == 0)
{
flag =1; // For Prime numbers not in pft table
pfac =1;
while(tmp % pft[i] == 0)
{
tmp =tmp /pft[i];
pfac *= pft[i];
}
pfac *= pft[i];
sum[k] *= (pfac-1)/(pft[i]-1);
}
}
if(flag ==0)
{
sum[k] = 1;
++k;
return;
}
if(tmp != 1) // For numbers with some prime factors in the pft table+some prime > 705
{
sum[k] *=((tmp*tmp) -1)/(tmp -1);
}
sum[k] -=num;
++k;
return;
}
int main(void)
{
int i=0,terms =0,num = 0;
setvbuf(stdin,(char*)NULL,_IOFBF,0);
scanf("%d",&terms);
while(getchar_unlocked() != '\n');
while(terms--)
{
num = readnum();
if(num ==1)
{
continue;
}
if(num == -1)
{
perror("\n ERROR\n");
return 0;
}
count(num);
}
i =0;
while(i<k)
{
printf("%lld\n",sum[i]);
++i;
}
return 0;
}
//Prob 2 Is your biggesr issue right now.... You just want to find the number of divisors?
My first suggestion will be to cache your result to some degree... but this requires potentially twice the amount of storage you have at the beginning :/.
What you can do is generate a list of prime numbers before hand (using the sieve algorithm). It will be ideal to know the biggest number N in your list and generate all primes till his square root. Now for each number in your list, you want to find his representation as product of factors, ie
n = a1^p1 * a1^p2 *... *an^pn
Then the sum of divisors will be.
((a1^(p1+1) - 1)/(a1 - 1))*((a2^(p2+1) - 1)/(a2-1))*...*((an^(pn+1) - 1)/(an-1))
To understand you have (for n = 8) 1+ 2 + 4 + 8 = 15 = (16 - 1)/(2 - 1)
It will drastically improve the speed but integer factorization (what you are really doing) is really costly...
Edit:
In your link the maximum is 5000000 so you have at most 700 primes
Simple decomposition algorithm
void primedecomp(int number, const int* primetable, int* primecount,
int pos,int tablelen){
while(pos < tablelen && number % primetable[pos] !=0 )
pos++;
if(pos == tablelen)
return
while(number % primetable[pos] ==0 ){
number = number / primetable[pos];
primecount[pos]++;
}
//number has been modified
//too lazy to write a loop, so recursive call
primedecomp(number,primetable,primecount, pos+1,tablelen);
}
EDIT : rather than counting, compute a^(n+1) using primepow = a; primepow = a*primepow;
It will be much cleaner in C++ or java where you have hashmap. At the end
primecount contains the pi values I was talking about above.
Even if it looks scary, you will create the primetable only once. Now this algorithm
run in worst case in O(tablelen) which is O(square root(Nmax)). your initial
loop ran in O(Nmax).

Resources