Summing with OpenMP using C - c

I've been trying to parallelize this piece of code for about two days and keep having logical errors. The program is to find the area of an integral using the sum of the very small dx and calculate each discrete value of the integral. I am trying to implement this with openmp but I actually have no experience with openmp. I would like your help please. The actual goal is to parallelize the suma variable in the threads so every thread calculates less values of the integral. The program compiles successfully but when I execute the program it returns wrong results.
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main(int argc, char *argv[]){
float down = 1, up = 100, dx, suma = 0, j;
int steps, i, nthreads, tid;
long starttime, finishtime, runtime;
starttime = omp_get_wtime();
steps = atoi(argv[1]);
dx = (up - down) / steps;
nthreads = omp_get_num_threads();
tid = omp_get_thread_num();
#pragma omp parallel for private(i, j, tid) reduction(+:suma)
for(i = 0; i < steps; i++){
for(j = (steps / nthreads) * tid; j < (steps / nthreads) * (tid + 1); j += dx){
suma += ((j * j * j) + ((j + dx) * (j + dx) * (j + dx))) / 2 * dx;
}
}
printf("For %d steps the area of the integral 3 * x^2 + 1 from %f to %f is: %f\n", steps, down, up, suma);
finishtime = omp_get_wtime();
runtime = finishtime - starttime;
printf("Runtime: %ld\n", runtime);
return (0);
}

The problem lies within your for-loop. If you use the for-pragma, OpenMP does the loop splitting for you:
#pragma omp parallel for private(i) reduction(+:suma)
for(i = 0; i < steps; i++) {
// recover the x-position of the i-th step
float x = down + i * dx;
// evaluate the function at x
float y = (3.0f * x * x + 1)
// add the sum of the rectangle to the overall integral
suma += y * dx
}
Even if you would convert to a parallelisation scheme where you would have to compute the indices by yourself, that would be problematic. The outer loop should be executed only nthread times.
You should also consider switching to double for increased accuracy.

Let's just consider the threads=1 case. This:
#pragma omp parallel for private(i, j, tid) reduction(+:suma)
for(i = 0; i < steps; i++){
for(j = (steps / nthreads) * tid; j < (steps / nthreads) * (tid + 1); j += dx){
suma += ((j * j * j) + ((j + dx) * (j + dx) * (j + dx))) / 2 * dx;
}
}
turns into this:
for(i = 0; i < steps; i++){
for(j = 0; j < steps; j += dx){
suma += ((j * j * j) + ((j + dx) * (j + dx) * (j + dx))) / 2 * dx;
}
}
and you can start to see the problem; you're basically looping over steps2.
In addition, your second loop doesn't make any sense, as you're incrementing by dx. That same confusion between indicies (i, j) with locations in the physical domain (i*dx) shows up in your increment. j+dx doesn't make any sense. Presumably you want to be increasing suma by (f(x) + f(x'))*dx/2 (eg, trapezoidal rule); that should be
float x = down + i*dx;
suma += dx * ((3 * x * x + 1) + (3 * (x + dx) * (x + dx) + 1)) / 2;
As ebo points out, you want to be summing the integrand, not its antiderivative.
Now if we include a check on the answer:
printf("For %d steps the area of the integral 3 * x^2 + 1 from %f to %f is: %f (expected: %f)\n",
steps, down, up, suma, up*up*up-down*down*down + up - down);
and we run it in serial, we start getting the right answer:
$ ./foo 10
For 10 steps the area of the integral 3 * x^2 + 1 from 1.000000 to 100.000000 is: 1004949.375000 (expected: 1000098.000000)
Runtime: 0
$ ./foo 100
For 100 steps the area of the integral 3 * x^2 + 1 from 1.000000 to 100.000000 is: 1000146.562500 (expected: 1000098.000000)
Runtime: 0
$ ./foo 1000
For 1000 steps the area of the integral 3 * x^2 + 1 from 1.000000 to 100.000000 is: 1000098.437500 (expected: 1000098.000000)
Runtime: 0
There's no point at all in worrying about the OpenMP case until the serial case works.
Once it comes time to OpenMP this, as ebo points out, the easiest thing to do is to just let OpenMP do your loop decomposition for you: eg,
#pragma omp parallel for reduction(+:suma)
for(i = 0; i < steps; i++){
float x = down + i*dx;
suma += dx * ((3 * x * x + 1) + (3 * (x + dx) * (x + dx) + 1)) / 2;
}
Running this, one gets
$ setenv OMP_NUM_THREADS 1
$ ./foo 1000
For 1000 steps the area of the integral 3 * x^2 + 1 from 1.000000 to 100.000000 is: 1000098.437500 (expected: 1000098.000000)
Runtime: 0
$ setenv OMP_NUM_THREADS 2
$ ./foo 1000
For 1000 steps the area of the integral 3 * x^2 + 1 from 1.000000 to 100.000000 is: 1000098.437500 (expected: 1000098.000000)
Runtime: 0
$ setenv OMP_NUM_THREADS 4
$ ./foo 1000
For 1000 steps the area of the integral 3 * x^2 + 1 from 1.000000 to 100.000000 is: 1000098.625000 (expected: 1000098.000000)
Runtime: 0
$ setenv OMP_NUM_THREADS 8
$ ./foo 1000
For 1000 steps the area of the integral 3 * x^2 + 1 from 1.000000 to 100.000000 is: 1000098.500000 (expected: 1000098.000000)
One can do the blocking explicitly in OpenMP if you really want to, but you should have a reason for doing that.

Related

How do i make c threads do cyclyc sum?

So i have this program im working on, and the guist of it is that i need to do some operations with threads, following the next shcheme: The j-th thread Hj calculates a group of 100 consecutive iterations of the sum, making a cyclic distribution of the groups among all the threads. For example, if H = 4, the
thread H2 does the calculation of iterations [100..199, 500..599, 900..999, ...].
To ensure no data races occur, the threads must work each on a different sum variable.
Then compare after joining the threads the result achieved by the threads and the one done sequentally.
Here is the code:
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <math.h>
#include <pthread.h>
#include <sys/time.h>
#define H 4
double res[H] = {0};
//Time function
float restar_tiempo(struct timeval *inicio, struct timeval *fin) {
return (fin->tv_sec - inicio->tv_sec) + 1e-6 * (fin->tv_usec - inicio->tv_usec);
}
//Thread function
void *_hilo(void *arg) {
int a = * ((int*)arg);
double pi = 0;
double n = 100 * a;
while (n < 10000000) {
res[a] += (pow(-1, n) / pow(4, n)) * ((2 / (4 * n + 1)) + (2 / (4 * n + 2)) + (1 / (4 * n + 3)));
pi++;
n++;
if ((int) n % 100 == 0)
n += (H - 1)*100;
}
printf("Result on thread[%d]: %f\n", a, res[a]);
pthread_exit(NULL);
}
int main() {
pthread_t hilo[H];
struct timeval in, mid, fin;
gettimeofday(&in, NULL);
for (int i = 0; i < H; i++) {
int* p = malloc(sizeof (int));
*p = i;
printf("Esto es i: %d\n", i);
res[i] = 0;
if (pthread_create(&hilo[i], NULL, _hilo, p) != 0) {
perror(" Error creando hilo");
exit(EXIT_FAILURE);
}
free(p);
}
//Join
for (int i = 0; i < H; i++)
pthread_join(hilo[i], NULL);
//Partial sum
double f = 0;
for (int i = 0; i < H; i++){
printf("Resultado parcial de hilo %d: %f\n", i, res[i]);
f += res[i];
}
//Total partial sum
printf("Resultado total: %lf\n", f);
//printf("Hola/n");
gettimeofday(&mid, NULL);
//Secuential sum
double s = 0;
for (double n = 0; n < 10000000; n++)
s += (pow(-1, n) / pow(4, n)) * ((2 / (4 * n + 1)) + (2 / (4 * n + 2)) + (1 / (4 * n + 3)));
//Print secuential
printf("Resultado secuencial: %f\n", s);
gettimeofday(&fin, NULL);
//Result diff
printf("Diferencia resultados: %f\n", fabs(f - s));
//Time threads
printf("Tiempo por hilos: %f\n", restar_tiempo(&in, &mid));
//Secuential time
printf("Tiempo secuencial: %f\n", restar_tiempo(&mid, &fin));
//Time diff
printf("Diferencia tiempos: %f\n", restar_tiempo(&in, &mid) - restar_tiempo(&mid, &fin));
return 0;
}
I can compile everything without warnings, but when i execute the program, the result provided by the first thread is erratic, as it changes between executions (the rest of threads display 0 because they work with very little values).
Example with some added prints inside the thread function and after doing the join:
First execution:
This is i:0
This is i:1
This is i:2
This is i:3
//Inside thread funct
Thread result[2]: 0.000000
Thread result[2]: 0.000000
Thread result[3]: 0.000000
Thread result[0]: 3.141593
//After join
Partial result of thread 0: 3.141593
Partial result of thread 1: 0.000000
Partial result of thread 2: 0.000000
Partial result of thread 3: 0.000000
Total result: 3.141593
Sequential result: 3.141593
Difference results: 0.000000
Time per threads: 0.183857
Sequential time: 0.034788
Difference times: 0.149069
Second execution:
This is i:0
This is i:1
This is i:2
This is i:3
Thread result[2]: 0.000000
Thread result[0]: 6.470162
Thread result[0]: 6.470162
Thread result[3]: 0.000000
Partial result of thread 0: 6.470162
Partial result of thread 1: 0.000000
Partial result of thread 2: 0.000000
Partial result of thread 3: 0.000000
Total result: 6.470162
Sequential result: 3.141593
Difference results: 3.328570
Time per threads: 0.189794
Sequential time: 0.374017
Difference times: -0.184223
How can i make it so the sum works properly?
I think it has something to do with arg in the function _hilo, or the subsequent int cast with int a.
(Excuse the mix in languages, i speak spanish so most of the printfs are in said language. Dont mind them, the block with the results example has the traduction)
Okay i solved it but i dont know why it works like this fully or why this caused issues. I just deleted the free (p) statement and now it works like a charm. If someone can enlighten me on why this happens, i´ll be grateful.

Poor maths performance in C vs Python/numpy

Near-duplicate / related:
How does BLAS get such extreme performance? (If you want fast matmul in C, seriously just use a good BLAS library unless you want to hand-tune your own asm version.) But that doesn't mean it's not interesting to see what happens when you compile less-optimized matrix code.
how to optimize matrix multiplication (matmul) code to run fast on a single processor core
Matrix Multiplication with blocks
Out of interest, I decided to compare the performance of (inexpertly) handwritten C vs. Python/numpy performing a simple matrix multiplication of two, large, square matrices filled with random numbers from 0 to 1.
I found that python/numpy outperformed my C code by over 10,000x This is clearly not right, so what is wrong with my C code that is causing it to perform so poorly? (even compiled with -O3 or -Ofast)
The python:
import time
import numpy as np
t0 = time.time()
m1 = np.random.rand(2000, 2000)
m2 = np.random.rand(2000, 2000)
t1 = time.time()
m3 = m1 # m2
t2 = time.time()
print('creation time: ', t1 - t0, ' \n multiplication time: ', t2 - t1)
The C:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main(void) {
clock_t t0=clock(), t1, t2;
// create matrices and allocate memory
int m_size = 2000;
int i, j, k;
double running_sum;
double *m1[m_size], *m2[m_size], *m3[m_size];
double f_rand_max = (double)RAND_MAX;
for(i = 0; i < m_size; i++) {
m1[i] = (double *)malloc(sizeof(double)*m_size);
m2[i] = (double *)malloc(sizeof(double)*m_size);
m3[i] = (double *)malloc(sizeof(double)*m_size);
}
// populate with random numbers 0 - 1
for (i=0; i < m_size; i++)
for (j=0; j < m_size; j++) {
m1[i][j] = (double)rand() / f_rand_max;
m2[i][j] = (double)rand() / f_rand_max;
}
t1 = clock();
// multiply together
for (i=0; i < m_size; i++)
for (j=0; j < m_size; j++) {
running_sum = 0;
for (k = 0; k < m_size; k++)
running_sum += m1[i][k] * m2[k][j];
m3[i][j] = running_sum;
}
t2 = clock();
float t01 = ((float)(t1 - t0) / CLOCKS_PER_SEC );
float t12 = ((float)(t2 - t1) / CLOCKS_PER_SEC );
printf("creation time: %f", t01 );
printf("\nmultiplication time: %f", t12 );
return 0;
}
EDIT: Have corrected the python to do a proper dot product which closes the gap a little and the C to time with a resolution of microseconds and use the comparable double data type, rather than float, as originally posted.
Outputs:
$ gcc -O3 -march=native bench.c
$ ./a.out
creation time: 0.092651
multiplication time: 139.945068
$ python3 bench.py
creation time: 0.1473407745361328
multiplication time: 0.329038143157959
It has been pointed out that the naive algorithm implemented here in C could be improved in ways that lend themselves to make better use of compiler optimisations and the cache.
EDIT: Having modified the C code to transpose the second matrix in order to achieve a more efficient access pattern, the gap closes more
The modified multiplication code:
// transpose m2 in order to capitalise on cache efficiencies
// store transposed matrix in m3 for now
for (i=0; i < m_size; i++)
for (j=0; j < m_size; j++)
m3[j][i] = m2[i][j];
// swap the pointers
void *mtemp = *m3;
*m3 = *m2;
*m2 = mtemp;
// multiply together
for (i=0; i < m_size; i++)
for (j=0; j < m_size; j++) {
running_sum = 0;
for (k = 0; k < m_size; k++)
running_sum += m1[i][k] * m2[j][k];
m3[i][j] = running_sum;
}
The results:
$ gcc -O3 -march=native bench2.c
$ ./a.out
creation time: 0.107767
multiplication time: 10.843431
$ python3 bench.py
creation time: 0.1488208770751953
multiplication time: 0.3335080146789551
EDIT: compiling with -0fast, which I am reassured is a fair comparison, brings down the difference to just over an order of magnitude (in numpy's favour).
$ gcc -Ofast -march=native bench2.c
$ ./a.out
creation time: 0.098201
multiplication time: 4.766985
$ python3 bench.py
creation time: 0.13812589645385742
multiplication time: 0.3441300392150879
EDIT: It was suggested to change indexing from arr[i][j] to arr[i*m_size + j] this yielded a small performance increase:
for m_size = 10000
$ gcc -Ofast -march=native bench3.c # indexed by arr[ i * m_size + j ]
$ ./a.out
creation time: 1.280863
multiplication time: 626.327820
$ gcc -Ofast -march=native bench2.c # indexed by art[I][j]
$ ./a.out
creation time: 2.410230
multiplication time: 708.979980
$ python3 bench.py
creation time: 3.8284950256347656
multiplication time: 39.06089973449707
The up to date code bench3.c:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main(void) {
clock_t t0, t1, t2;
t0 = clock();
// create matrices and allocate memory
int m_size = 10000;
int i, j, k, x, y;
double running_sum;
double *m1 = (double *)malloc(sizeof(double)*m_size*m_size),
*m2 = (double *)malloc(sizeof(double)*m_size*m_size),
*m3 = (double *)malloc(sizeof(double)*m_size*m_size);
double f_rand_max = (double)RAND_MAX;
// populate with random numbers 0 - 1
for (i=0; i < m_size; i++) {
x = i * m_size;
for (j=0; j < m_size; j++)
m1[x + j] = ((double)rand()) / f_rand_max;
m2[x + j] = ((double)rand()) / f_rand_max;
m3[x + j] = ((double)rand()) / f_rand_max;
}
t1 = clock();
// transpose m2 in order to capitalise on cache efficiencies
// store transposed matrix in m3 for now
for (i=0; i < m_size; i++)
for (j=0; j < m_size; j++)
m3[j*m_size + i] = m2[i * m_size + j];
// swap the pointers
double *mtemp = m3;
m3 = m2;
m2 = mtemp;
// multiply together
for (i=0; i < m_size; i++) {
x = i * m_size;
for (j=0; j < m_size; j++) {
running_sum = 0;
y = j * m_size;
for (k = 0; k < m_size; k++)
running_sum += m1[x + k] * m2[y + k];
m3[x + j] = running_sum;
}
}
t2 = clock();
float t01 = ((float)(t1 - t0) / CLOCKS_PER_SEC );
float t12 = ((float)(t2 - t1) / CLOCKS_PER_SEC );
printf("creation time: %f", t01 );
printf("\nmultiplication time: %f", t12 );
return 0;
}
CONCLUSION: So the original absurd factor of x10,000 difference was largely due to mistakenly comparing element-wise multiplication in Python/numpy to C code and not compiled with all of the available optimisations and written with a highly inefficient memory access pattern that likely didn't utilise the cache.
A 'fair' comparison (ie. correct, but highly inefficient single-threaded algorithm, compiled with -Ofast) yields a performance factor difference of x350
A number of simple edits to improve the memory access pattern brought the comparison down to a factor of x16 (in numpy's favour) for large matrix (10000 x 10000) multiplication. Furthermore, numpy automatically utilises all four virtual cores on my machine whereas this C does not, so the performance difference could be a factor of x4 - x8 (depending on how well this program ran on hyperthreading). I consider a factor of x4 - x8 to be fairly sensible, given that I don't really know what I'm doing and just knocked a bit of code together whereas numpy is based on BLAS which I understand has been extensively optimised over the years by experts from all over the place so I consider the question answered/solved.

Optimizing neighbor count function for conway's game of life in C

Having some trouble optimizing a function that returns the number of neighbors of a cell in a Conway's Game of Life implementation. I'm trying to learn C and just get better at coding. I'm not very good at recognizing potential optimizations, and I've spent a lot of time online reading various methods but it's not really clicking for me yet.
Specifically I'm trying to figure out how to unroll this nested for loop in the most efficient way, but each time I try I just make the runtime longer.
I'm including the function, I don't think any other context is needed. Thanks for any advice you can give!
Here is the code for the countNeighbors() function:
static int countNeighbors(board b, int x, int y)
{
int n = 0;
int x_left = max(0, x-1);
int x_right = min(HEIGHT, x+2);
int y_left = max(0, y-1);
int y_right = min(WIDTH, y+2);
int xx, yy;
for (xx = x_left; xx < x_right; ++xx) {
for (yy = y_left; yy < y_right; ++yy) {
n += b[xx][yy];
}
}
return n - b[x][y];
}
Instead of declaring board as b[WIDTH][HEIGHT] declare it as b[WIDTH + 2][HEIGHT + 2]. This gives an extra margin which will have zeros, but it prevents from index out of bounds. So, instead of:
x x
x x
We will have:
0 0 0 0
0 x x 0
0 x x 0
0 0 0 0
x denotes used cells, 0 will be unused.
Typical trade off: a bit of memory for speed.
Thanks to that we don't have to call min and max functions (which have bad for performance if statements).
Finally, I would write your function like that:
int countNeighborsFast(board b, int x, int y)
{
int n = 0;
n += b[x-1][y-1];
n += b[x][y-1];
n += b[x+1][y-1];
n += b[x-1][y];
n += b[x+1][y];
n += b[x-1][y+1];
n += b[x][y+1];
n += b[x+1][y+1];
return n;
}
Benchmark (updated)
Full, working source code.
Thanks to Jongware comment I added linearization (reducing array's dimensions from 2 to 1) and changing int to char.
I also made the main loop linear and calculate the returned sum directly, without an intermediate n variable.
2D array was 10002 x 10002, 1D had 100040004 elements.
The CPU I have is Pentium Dual-Core T4500 at 2.30 GHz, further details here (output of cat /prof/cpuinfo).
Results on default optimization level O0:
Original: 15.50s
Mine: 10.13s
Linear: 2.51s
LinearAndChars: 2.48s
LinearAndCharsAndLinearLoop: 2.32s
LinearAndCharsAndLinearLoopAndSum: 1.53s
That's about 10x faster compared to the original version.
Results on O2:
Original: 6.42s
Mine: 4.17s
Linear: 0.55s
LinearAndChars: 0.53s
LinearAndCharsAndLinearLoop: 0.42s
LinearAndCharsAndLinearLoopAndSum: 0.44s
About 15x faster.
On O3:
Original: 10.44s
Mine: 1.47s
Linear: 0.26s
LinearAndChars: 0.26s
LinearAndCharsAndLinearLoop: 0.25s
LinearAndCharsAndLinearLoopAndSum: 0.24s
About 44x faster.
The last version, LinearAndCharsAndLinearLoopAndSum is:
typedef char board3[(HEIGHT + 2) * (WIDTH + 2)];
int i;
for (i = WIDTH + 3; i <= (WIDTH + 2) * (HEIGHT + 1) - 2; i++)
countNeighborsLinearAndCharsAndLinearLoopAndSum(b3, i);
int countNeighborsLinearAndCharsAndLinearLoopAndSum(board3 b, int pos)
{
return
b[pos - 1 - (WIDTH + 2)] +
b[pos - (WIDTH + 2)] +
b[pos + 1 - (WIDTH + 2)] +
b[pos - 1] +
b[pos + 1] +
b[pos - 1 + (WIDTH + 2)] +
b[pos + (WIDTH + 2)] +
b[pos + 1 + (WIDTH + 2)];
}
Changing 1 + (WIDTH + 2) to WIDTH + 3 won't help, because compiler takes care of it anyway (even on O0 optimization level).

Numerical Evaluation of Pi

I would like to evaluate Pi approximately by running the following code which fits a regular polygon of n sides inside a circle with unit diameter and calculates its perimeter using the function in the code. However the output after the 34th term is 0 when long double variable type is used or it increases without bounds when double variable type is used. How can I remedy this situation? Any suggestion or help is appreciated and welcome.
Thanks
P.S: Operating system: Ubuntu 12.04 LTS 32-bit, Compiler: GCC 4.6.3
#include <stdio.h>
#include <math.h>
#include <limits.h>
#include <stdlib.h>
#define increment 0.25
int main()
{
int i = 0, k = 0, n[6] = {3, 6, 12, 24, 48, 96};
double per[61] = {0}, per2[6] = {0};
// Since the above algorithm is recursive we need to specify the perimeter for n = 3;
per[3] = 0.5 * 3 * sqrtl(3);
for(i = 3; i <= 60; i++)
{
per[i + 1] = powl(2, i) * sqrtl(2 * (1.0 - sqrtl(1.0 - (per[i] / powl(2, i)) * (per[i] / powl(2, i)))));
printf("%d %f \n", i, per[i]);
}
return 0;
for(k = 0; k < 6; k++)
{
//p[k] = k
}
}
Some ideas:
Use y = (1.0 - x)*( 1.0 + x) instead of y = 1.0 - x*x. This helps with 1 stage of "subtraction of nearly equal values", but I am still stuck on the next 1.0 - sqrtl(y) as y approaches 1.0.
// per[i + 1] = powl(2, i) * sqrtl(2 * (1.0 - sqrtl(1.0 - (per[i] / powl(2, i)) * (per[i] / powl(2, i)))));
long double p = powl(2, i);
// per[i + 1] = p * sqrtl(2 * (1.0 - sqrtl(1.0 - (per[i] / p) * (per[i] / p))));
long double x = per[i] / p;
// per[i + 1] = p * sqrtl(2 * (1.0 - sqrtl(1.0 - x * x)));
// per[i + 1] = p * sqrtl(2 * (1.0 - sqrtl((1.0 - x)*(1.0 + x)) ));
long double y = (1.0 - x)*( 1.0 + x);
per[i + 1] = p * sqrtl(2 * (1.0 - sqrtl(y) ));
Change array size or for()
double per[61+1] = { 0 }; // Add 1 here
...
for (i = 3; i <= 60; i++) {
...
per[i + 1] =
Following is a similar method for pi
unsigned n = 6;
double sine = 0.5;
double cosine = sqrt(0.75);
double pi = n*sine;
static const double mpi = 3.1415926535897932384626433832795;
do {
sine = sqrt((1 - cosine)/2);
cosine = sqrt((1 + cosine)/2);
n *= 2;
pi = n*sine;
printf("%6u s:%.17e c:%.17e pi:%.17e %%:%.6e\n", n, sine, cosine, pi, (pi-mpi)/mpi);
} while (n <500000);
Subtracting 1.0 from a nearly-1.0 number is leading to "catastrophic cancellation", where the relative error in a FP calculation skyrockets due to the loss of significant digits. Try evaluating pow(2, i) - (pow(2, i) - 1.0) for each i between 0 and 60 and you'll see what I mean.
The only real solution to this issue is reorganizing your equations to avoid subtracting nearly-equal nonzero quantities. For more details, see Acton, Real Computing Made Real, or Higham, Accuracy and Stability of Numerical Algorithms.

The outermost for loop does not work as intended

I have been using Ubuntu 12.04 LTS with GCC to compile my the codes for my assignment for a while. However, recently I have run into two issues as follows:
The following code calculates zero for a nonzero value with the second formula is used.
There is a large amount of error in the calculation of the integral of the standard normal distribution from 0 to 5 or larger standard deviations.
How can I remedy these issues? I am especially obsessed with the first one. Any help or suggestion is appreciated. thanks in advance.
The code is as follows:
#include <stdio.h>
#include <math.h>
#include <limits.h>
#include <stdlib.h>
#define N 599
long double
factorial(long double n)
{
//Here s is the free parameter which is increased by one in each step and
//pro is the initial product and by setting pro to be 0 we also cover the
//case of zero factorial.
int s = 1;
long double pro = 1;
//Here pro stands for product.
if (n < 0)
printf("Factorial is not defined for a negative number \n");
else {
while (n >= s) {
pro *= s;
s++;
}
return pro;
}
}
int main()
{
// Since the function given is the standard normal distribution
// probability density function we have mean = 0 and variance = 1.
// Hence we also have z = x; while dealing with only positive values of
// x and keeping in mind that the PDF is symmetric around the mean.
long double * summand1 = malloc(N * sizeof(long double));
long double * summand2 = malloc(N * sizeof(long double));
int p = 0, k, z[5] = {0, 3, 5, 10, 20};
long double sum1[5] = {0}, sum2[5] = {0} , factor = 1.0;
for (p = 0; p <= 4; p++)
{
for (k = 0; k <= N; k++)
{
summand1[k] = (1 / sqrtl(M_PI * 2) )* powl(-1, k) * powl(z[p], 2 * k + 1) / ( factorial(k) * (2 * k + 1) * powl(2, k));
sum1[p] += summand1[k];
}
//Wolfamalpha site gives the same value here
for (k = 0; k <= N; k++)
{
factor *= (2 * k + 1);
summand2[k] = ((1 / sqrtl(M_PI * 2) ) * powl(z[p], 2 * k + 1) / factor);
//printf("%Le \n", factor);
sum2[p] += summand2[k];
}
sum2[p] = sum2[p] * expl((-powl(z[p],2)) / 2);
}
for (p = 0; p < 4; p++)
{
printf("The sum obtained for z between %d - %d \
\nusing the first formula is %Lf \n", z[p], z[p+1], sum1[p+1]);
printf("The sum obtained for z between %d - %d \
\nusing the second formula is %Lf \n", z[p], z[p+1], sum2[p+1]);
}
return 0;
}
The working code without the outermost for loop is
#include <stdio.h>
#include <math.h>
#include <limits.h>
#include <stdlib.h>
#define N 1200
long double
factorial(long double n)
{
//Here s is the free parameter which is increased by one in each step and
//pro is the initial product and by setting pro to be 0 we also cover the
//case of zero factorial.
int s = 1;
long double pro = 1;
//Here pro stands for product.
if (n < 0)
printf("Factorial is not defined for a negative number \n");
else {
while (n >= s) {
pro *= s;
s++;
}
return pro;
}
}
int main()
{
// Since the function given is the standard normal distribution
// probability density function we have mean = 0 and variance = 1.
// Hence we also have z = x; while dealing with only positive values of
// x and keeping in mind that the PDF is symmetric around the mean.
long double * summand1 = malloc(N * sizeof(long double));
long double * summand2 = malloc(N * sizeof(long double));
int k, z = 3;
long double sum1 = 0, sum2 = 0, pro = 1.0;
for (k = 0; k <= N; k++)
{
summand1[k] = (1 / sqrtl(M_PI * 2) )* powl(-1, k) * powl(z, 2 * k + 1) / ( factorial(k) * (2 * k + 1) * powl(2, k));
sum1 += summand1[k];
}
//Wolfamalpha site gives the same value here
printf("The sum obtained for z between 0-3 using the first formula is %Lf \n", sum1);
for (k = 0; k <= N; k++)
{
pro *= (2 * k + 1);
summand2[k] = ((1 / sqrtl(M_PI * 2) * powl(z, 2 * k + 1) / pro));
//printf("%Le \n", pro);
sum2 += summand2[k];
}
sum2 = sum2 * expl((-powl(z,2)) / 2);
printf("The sum obtained for z between 0-3 using the second formula is %Lf \n", sum2);
return 0;
}
I'm quite certain that the problem is in factor not being set back to 1 in the outer loop..
factor *= (2 * k + 1); (in the loop that calculates sum2.)
In the second version provided the one that works it starts with z=3
However in the first loop since you do not clear it between iterations on p by the time you reach z[2] it already is a huge number.
EDIT: Possible help with precision..
Basically you have a huge number powl(z[p], 2 * k + 1) divided by another huge number factor. huge floating point numbers lose their precision. The way to avoid that is to perform the division as soon as possible..
Instead of first calculating powl(z[p], 2 * k + 1) and dividing by factor :
- (z[p]z[p] ... . * z[p]) / (1*3*5*...(2*k+1))`
rearrange the calculation: (z[p]/1) * (z[p]^2/3) * (z[p]^2/5) ... (z[p]^2/(2*k+1))
You can do this in sumand2 calculation and a similar trick in summand1

Resources