Parallelization for Monte Carlo pi approximation - c

I am writing a c script to parallelize pi approximation with OpenMp. I think my code works fine with a convincing output. I am running it with 4 threads now. What I am not sure is that if this code is vulnerable to race condition? and if it is, how do I coordinate the thread action in this code ?
the code looks as follows:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <omp.h>
double sample_interval(double a, double b) {
double x = ((double) rand())/((double) RAND_MAX);
return (b-a)*x + a;
}
int main (int argc, char **argv) {
int N = atoi( argv[1] ); // convert command-line input to N = number of points
int i;
int NumThreads = 4;
const double pi = 3.141592653589793;
double x, y, z;
double counter = 0;
#pragma omp parallel firstprivate(x, y, z, i) reduction(+:counter) num_threads(NumThreads)
{
srand(time(NULL));
for (int i=0; i < N; ++i)
{
x = sample_interval(-1.,1.);
y = sample_interval(-1.,1.);
z = ((x*x)+(y*y));
if (z<= 1)
{
counter++;
}
}
}
double approx_pi = 4.0 * counter/ (double)N;
printf("%i %1.6e %1.6e\n ", N, 4.0 * counter/ (double)N, fabs(4.0 * counter/ (double)N - pi) / pi);
return 0;
}
Also I was wondering if the seed for random number should be declared inside or outside parallelization. my output looks like this:
10 3.600000e+00 1.459156e-01
100 3.160000e+00 5.859240e-03
1000 3.108000e+00 1.069287e-02
10000 3.142400e+00 2.569863e-04
100000 3.144120e+00 8.044793e-04
1000000 3.142628e+00 3.295610e-04
10000000 3.141379e+00 6.794439e-05
100000000 3.141467e+00 3.994585e-05
1000000000 3.141686e+00 2.971945e-05
Which looks OK for now. your suggestion for race condition and seed placement is most welcome.

There are a few problems in your code that I can see. The main one is from my standpoint that it isn't parallelized. Or more precisely, you didn't enable the parallelism you introduced with OpenMP while compiling it. Here is the way one can see that:
The way the code is parallelized, the main for loop should be executed in full by all the threads (there is no worksharing here, no #pragma omp parallel for, only a #pragma omp parallel). Therefore, considering you set the number of threads to be 4, the global number of iterations should be 4*N. Thus, your output should slowly converge towards 4*Pi, not towards Pi.
Indeed, I tried your code on my laptop, compiled it with OpenMP support, and that is pretty-much what I get. However, when I don't enable OpenMP, I get an output similar to yours. So in conclusion, you need to:
Enable OpenMP at compilation time for getting a parallel version of your code.
Divide your result by NumThreads to get a "valid" approximation of Pi (or distribute your loop over N with a #pragma omp for for example)
But that is if / when your code is correct elsewhere, which it isn't yet.
As BitTickler already hinted, rand() isn't thread-safe. So you have to go for another random number generator, which will allow you to privatize it's state. That could be rand_r() for example. That said, this still has quite a few issues:
rand() / rand_r() is a terrible RNG in term of randomness and periodicity. While increasing your number of tries, you'll rapidly go over the period of the RNG and repeat over and over again the same sequence. You need something more robust to do anything remotely serious.
Even with a "good" RNG, the parallelism aspect can be an issue in the sense that you want your sequences in parallel to be uncorrelated between each-other. And just using a different seed value per thread doesn't guaranty that to you (although with a wide-enough RNG, you have a bit of headroom for that)
Anyway, bottom line is:
Use a better thread-safe RNG (I find drand48_r() or random_r() to be OK for toy codes on Linux)
Initialize its state per-thread based on the thread id for example, while keeping in mind that this won't ensure a proper decorrelation of the random series in some circumstances (and the larger the number of times you call the functions, the more likely you are to finally have overlapping series).
This done (along with a few minor fixes), your code becomes for example as follows:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <omp.h>
typedef struct drand48_data RNGstate;
double sample_interval(double a, double b, RNGstate *state) {
double x;
drand48_r(state, &x);
return (b-a)*x + a;
}
int main (int argc, char **argv) {
int N = atoi( argv[1] ); // convert command-line input to N = number of points
int NumThreads = 4;
const double pi = 3.141592653589793;
double x, y, z;
double counter = 0;
time_t ctime = time(NULL);
#pragma omp parallel private(x, y, z) reduction(+:counter) num_threads(NumThreads)
{
RNGstate state;
srand48_r(ctime+omp_get_thread_num(), &state);
for (int i=0; i < N; ++i) {
x = sample_interval(-1, 1, &state);
y = sample_interval(-1, 1, &state);
z = ((x*x)+(y*y));
if (z<= 1) {
counter++;
}
}
}
double approx_pi = 4.0 * counter / (NumThreads * N);
printf("%i %1.6e %1.6e\n ", N, approx_pi, fabs(approx_pi - pi) / pi);
return 0;
}
Which I compile like this:
gcc -std=gnu99 -fopenmp -O3 -Wall pi.c -o pi_omp

Related

Unable to figure out where the race condition occuring In OPENMP program in c

I am trying to integrate sin(x) from 0 to pi. But every time i run
the program i am getting different outputs.I know it is because of race condition occuring , but i am unable to figure out where is the problem lies
this is my code:
#include<stdio.h>
#include<stdlib.h>
#include<omp.h>
#include<math.h>
#include<time.h>
#define NUM_THREADS 4
static long num_steps= 10000000;
float rand_generator(float a )
{
//srand((unsigned int)time(NULL));
return ((float)rand()/(float)(RAND_MAX)) * a;
}
int main(int argc, char *argv[])
{
// srand((unsigned int)time(NULL));
omp_set_num_threads(NUM_THREADS);
float result;
float sum[NUM_THREADS];
float area=3.14;
int nthreads;
#pragma omp parallel
{
int id,nthrds;
id=omp_get_thread_num();
sum[id]=0.0;
printf("%d\n",id );
nthrds=omp_get_num_threads();
printf("%d\n",nthrds );
//if(id==0)nthreads=nthrds;
for (int i = id; i < num_steps; i=i+nthrds)
{
//float y=rand_generator(1);
//printf("%f\n",y );
float x=rand_generator(3.14);
sum[id]+=sin(x);
}
//printf(" sum is: %lf\n", sum);
//float p=(float)sum/num_steps*area;
}
float p=0.0;
for (int i = 0; i <NUM_THREADS; ++i)
{
p+=(sum[i]/num_steps)*area;
}
printf(" p is: %lf\n",p );
}
I tried adding pragma atomic but it also doesn't help.
Any help will be appreciated :).
The problem comes from the use of rand(). rand() is not thread safe. The reason is that it uses a common state for all the calls and is thus sensitive to races. Using stdlib's rand() from multiple threads
There a thread safe random generator that is called rand_r(). Instead of storing the rand generator state in an hidden global var, the state is a parameter to the function and can be rendered thread local.
You can use it like that
float rand_generator_r(float a,unsigned int *state )
{
//srand((unsigned int)time(NULL));
return ((float)rand_r(state)/(float)(RAND_MAX)) * a;
}
In your parallel block, add :
unsigned int rand_state=id*time(NULL); // or whatever thread dependent seed
and in your code call
float x=rand_generator(3.14,&rand_state);
and it should work.
By the way, I have the impression that there is a false sharing in your code that should slow down performances.
float sum[NUM_THREADS];
It is modified by all threads and is really likely to be store in a single cache line. Every store (and there are many stores to it) will create an invalidate in all other caches and it may significantly slow down your performances.
You should insure that the values are in different cache lines with :
#define CACHE_LINE_SIZE 64
struct {
float s;
char padding[CACHE_LINE_SIZE - sizeof(float)];
} sum_nofalse_sharing[NUM_THREADS];
and in your code, accumulate in sum_nofalse_sharing[id].s
Alternatively, create a local sum in the parallel block and write its value to sum[id] at the end.

Why does the time for this simple program to run double if run quickly in succession?

I have been working through the introductory openmp example and on the first multithreaded example - a numerical integration to pi - I knew the bit about false sharing would be coming and so implemented the following:
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include "omp.h"
#define STEPS 100000000.0
#define MAX_THREADS 4
void pi(double start, double end, double **sum);
int main(){
double * sum[MAX_THREADS];
omp_set_num_threads(MAX_THREADS);
double inc;
bool set_inc=false;
double start=omp_get_wtime();
#pragma omp parallel
{
int ID=omp_get_thread_num();
#pragma omp critical
if(!set_inc){
int num_threads=omp_get_num_threads();
printf("Using %d threads.\n", num_threads);
inc=1.0/num_threads;
set_inc=true;
}
pi(ID*inc, (ID+1)*inc, &sum[ID]);
}
double end=omp_get_wtime();
double tot=0.0;
for(int i=0; i<MAX_THREADS; i++){
tot=tot+*sum[i];
}
tot=tot/STEPS;
printf("The value of pi is: %.8f. Took %f secs.\n", tot, end-start);
return 0;
}
void pi(double start, double end, double **sum_ptr){
double *sum=(double *) calloc(1, sizeof(double));
for(double i=start; i<end; i=i+1/STEPS){
*sum=*sum+4.0/(1.0+i*i);
}
*sum_ptr=sum;
}
My idea was that in using calloc, the probability of the pointers returned being contiguous and thus being pulled into the same cache lines was virtually impossible (though I'm a tad unsure as to why there would be false sharing anyways as double is 64 bit here and my cache lines are 8 bytes as well, so if you can enlighten me there as well...). -- now I realize cache lines are typically 64 bytes not bits
In fun, after compiling I ran the program in quick succession and here's a short example of what I got (definitely was pushing arrows and enter in the terminal more than 1 press/.5 secs):
user#user-kubuntu:~/git/openmp-practice$ ./pi_mp.exe
Using 4 threads.
The value of pi is: 3.14159273. Took 0.104703 secs.
user#user-kubuntu:~/git/openmp-practice$ ./pi_mp.exe
Using 4 threads.
The value of pi is: 3.14159273. Took 0.196900 secs.
I thought that maybe something was happening because of the way I tried to avoid the false sharing and since I am still ignorant about the complete happenings amongst the levels of memory I chalked it up to that. So, I followed the prescribed method of the tutorial using a "critical" section like so:
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include "omp.h"
#define STEPS 100000000.0
#define MAX_THREADS 4
double pi(double start, double end);
int main(){
double sum=0.0;
omp_set_num_threads(MAX_THREADS);
double inc;
bool set_inc=false;
double start=omp_get_wtime();
#pragma omp parallel
{
int ID=omp_get_thread_num();
#pragma omp critical
if(!set_inc){
int num_threads=omp_get_num_threads();
printf("Using %d threads.\n", num_threads);
inc=1.0/num_threads;
set_inc=true;
}
double temp=pi(ID*inc, (ID+1)*inc);
#pragma omp critical
sum+=temp;
}
double end=omp_get_wtime();
sum=sum/STEPS;
printf("The value of pi is: %.8f. Took %f secs.\n", sum, end-start);
return 0;
}
double pi(double start, double end){
double sum=0.0;
for(double i=start; i<end; i=i+1/STEPS){
sum=sum+4.0/(1.0+i*i);
}
return sum;
}
The doubling in run time is virtually identical. What's the explanation for this? Does it have anything to do with the low level memory? Can you answer my intermediate question?
Thanks a lot.
Edit:
The compiler is gcc 7 on Kubuntu 17.10. options used were -fopenmp -W -o ( in that order).
The system specs include an i5 6500 # 3.2Ghz and 16 gigs of DDR4 RAM (though I forget its clock speed)
As some have asked, the program time does not continue to double if run more than twice in quick succession. After the initial doubling, it remains at around the same time (~.2 secs) for as many successive runs as I have tested (5+). Waiting a second or two, the time to run returns to the lesser amount. However, when the runs are not run manually in succession but rather in one command line such as ./pi_mp.exe;./pi_mp.exe;./pi_mp.exe;./pi_mp.exe;./pi_mp.exe;./pi_mp.exe; I get:
The value of pi is: 3.14159273. Took 0.100528 secs.
Using 4 threads.
The value of pi is: 3.14159273. Took 0.097707 secs.
Using 4 threads.
The value of pi is: 3.14159273. Took 0.098078 secs.
...
Adding gcc optimization options (-O3) had no change on any of the results.

Increase of execution time while using multithreaded FFTW

I am new to FFTW library. I have successfully implemented 1D and 2D fft using FFTW library. I converted my 2D fft code into multithreaded 2D fft. But the results were completely opposite. Multithreaded 2D FFT code is taking longer time to run than serialized 2D FFT code. I am missing something somewhere. I followed all the instructions given in FFTW documentation to parallelize the code.
This is my parallelized 2D FFT C program
#include <mpi.h>
#include <fftw3.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#define N 2000
#define M 2000
#define index(i, j) (j + i*M)
int i, j;
void get_input(fftw_complex *in) {
for(i=0;i<N;i++){
for(j=0;j<M;j++){
in[index(i, j)][0] = sin(i + j);
in[index(i, j)][1] = sin(i * j);
}
}
}
void show_out(fftw_complex *out){
for(i=0;i<N;i++){
for(j=0;j<M;j++){
printf("%lf %lf \n", out[index(i, j)][0], out[index(i, j)][1]);
}
}
}
int main(){
clock_t start, end;
double time_taken;
start = clock();
int a = fftw_init_threads();
printf("%d\n", a);
fftw_complex *in, *out;
fftw_plan p;
in = (fftw_complex *)fftw_malloc(N * M * sizeof(fftw_complex));
out = (fftw_complex *)fftw_malloc(N * M * sizeof(fftw_complex));
get_input(in);
fftw_plan_with_nthreads(4);
p = fftw_plan_dft_2d(N, M, in, out, FFTW_FORWARD, FFTW_ESTIMATE);
fftw_execute(p);
/*p = fftw_plan_dft_1d(N, out, out, FFTW_BACKWARD, FFTW_ESTIMATE);
fftw_execute(p);
puts("In Real Domain");
show_out(out);*/
fftw_destroy_plan(p);
fftw_free(in);
fftw_free(out);
fftw_cleanup_threads();
end = clock();
time_taken = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("%g \n", time_taken);
return 0;
}
Can someone please help me in pointing out the mistake what I am doing?
That kind of behavior is typical of incorrect binding.
Generally speaking, OpenMP threads should all be bound to cores of the same socket in order to avoid NUMA effect (which can make performance suboptimal or even worst).
Also, make sure MPI tasks are correctly bound (one task should be bound to several cores from the same sockets, and you should use one OpenMP thread per core).
Because of MPI, there is a risk your OpenMP threads end up doing time sharing.
At first, i recommend you start printing both MPI and OpenMP binding.
How to achieve that is dependent on both MPI library and OpenMP runtime. If you use Open MPI and Intel compilers, you can KMP_AFFINITY=verbose mpirun --report-bindings --tag-output ...
Then, as suggested earlier, i recommend you start easy and increase complexity
1 MPI task and 1 OpenMP thread
1 MPI task and x OpenMP threads (x is the number of cores on one socket)
x MPI tasks and 1 OpenMP thread per task
x MPI tasks and y OpenMP threads per task
hopefully, 2. will be faster than 1. and 4 will be faster than 3.

How to measure overall performance of parallel programs (with papi)

I asked myself what would be the best way to measure the performance (in flops) of a parallel program. I read about papi_flops. This seems to work fine for a serial program. But I don't know how I can measure the overall performance of a parallel program.
I would like to measure the performance of a blas/lapack function, in my example below gemm. But I also want to measure other function, specially functions where the number of operation is not known. (In the case of gemm the ops are known (ops(gemm) = 2*n^3), so I could calculate the performance as a function of the number of operations and the execution time.) The library (I am using Intel MKL) spawn the threads automatically. So I can't measure the performance of each thread individually and then reduce it.
This is my example:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include "mkl.h"
#include "omp.h"
#include "papi.h"
int main(int argc, char *argv[] )
{
int i, j, l, k, n, m, idx, iter;
int mat, mat_min, mat_max;
int threads;
double *A, *B, *C;
double alpha =1.0, beta=0.0;
float rtime1, rtime2, ptime1, ptime2, mflops;
long long flpops;
#pragma omp parallel
{
#pragma omp master
threads = omp_get_num_threads();
}
if(argc < 4){
printf("pass me 3 arguments!\n");
return( -1 );
}
else
{
mat_min = atoi(argv[1]);
mat_max = atoi(argv[2]);
iter = atoi(argv[3]);
}
m = mat_max; n = mat_max; k = mat_max;
printf (" Initializing data for matrix multiplication C=A*B for matrix \n"
" A(%ix%i) and matrix B(%ix%i)\n\n", m, k, k, n);
A = (double *) malloc( m*k * sizeof(double) );
B = (double *) malloc( k*n * sizeof(double) );
C = (double *) malloc( m*n * sizeof(double) );
printf (" Intializing matrix data \n\n");
for (i = 0; i < (m*k); i++)
A[i] = (double)(i+1);
for (i = 0; i < (k*n); i++)
B[i] = (double)(-i-1);
memset(C,0,m*n*sizeof(double));
// actual meassurment
for(mat=mat_min;mat<=mat_max;mat+=5)
{
m = mat; n = mat; k = mat;
for( idx=-1; idx<iter; idx++ ){
PAPI_flops( &rtime1, &ptime1, &flpops, &mflops );
cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans,
m, n, k, alpha, A, k, B, n, beta, C, n);
PAPI_flops( &rtime2, &ptime2, &flpops, &mflops );
}
printf("%d threads: %d in %f sec, %f MFLOPS\n",threads,mat,rtime2-rtime1,mflops);fflush(stdout);
}
printf("Done\n");fflush(stdout);
free(A);
free(B);
free(C);
return 0;
}
This is one output (for matrix size 200):
1 threads: 200 in 0.001459 sec, 5570.258789 MFLOPS
2 threads: 200 in 0.000785 sec, 5254.993652 MFLOPS
4 threads: 200 in 0.000423 sec, 4919.640137 MFLOPS
8 threads: 200 in 0.000264 sec, 3894.036865 MFLOPS
We can see for the execution time, that the function gemm scales. But the flops that I am measuring is only the performance of thread 0.
My question is: How can I measure the overall performance? I am grateful for any input.
First, I'm just curious - why do you need the FLOPS? don't you just care how much time is taken? or maybe time taken in compare to other BLAS libraries?
PAPI is thread based not much help on its own here.
What I would do is measure around the function call and see how time changes with number of threads it spawns. It should not spawn more threads than physical cores (HT is no good here). Then, if the matrix is big enough, and the machine is not loaded, the time should simply divide by the number of threads. E.g., 10 seconds over 4 core should become 2.5 seconds.
Other than that, there are 2 things you can do to really measure it:
1. Use whatever you use now but inject your start/end measurement code around the BLAS code. One way to do that (in linux) is by pre-loading a lib that defines pthread_start and using your own functions that call the originals but do some extra measurements. Another way to to override the function pointer when the process is already running (=trampoline). In linux it's in the GOT/PLT and in windows it's more complicated - look for a library.
2. Use oprofile, or some other profiler, to report number of instructions executed in the time you care for. Or better yet, to report the number of floating point instructions executed. A little problem with this is that SSE instructions are multiplying or adding 2 or more doubles at a time so you'd have to account for that. I guess you can assume they always use the maximum possible operands.

open MP - dot product

I am implementing parallel dot product in open MP
I have this code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <math.h>
#include <omp.h>
#define SIZE 1000
int main (int argc, char *argv[]) {
float u[SIZE], v[SIZE], dp,dpp;
int i, j, tid;
dp=0.0;
for(i=0;i<SIZE;i++){
u[i]=1.0*(i+1);
v[i]=1.0*(i+2);
}
printf("\n values of u and v:\n");
for (i=0;i<SIZE;i++){
printf(" u[%d]= %.1f\t v[%d]= %.1f\n",i,u[i],i,v[i]);
}
#pragma omp parallel shared(u,v,dp,dpp) private (tid,i)
{
tid=omp_get_thread_num();
#pragma omp for private (i)
for(i=0;i<SIZE;i++){
dpp+=u[i]*v[i];
printf("thread: %d\n", tid);
}
#pragma omp critical
{
dp=dpp;
printf("thread %d\n",tid);
}
}
printf("\n dot product is %f\n",dp);
}
I am starting it with: pgcc -B -Mconcur -Minfo -o prog prog.c
And result I get in console is:
33, Loop not parallelized: innermost
39, Loop not vectorized/parallelized: contains call
48, Loop not vectorized/parallelized: contains call
What am I doing wrong?
From my side of view, everything looks Ok.
First of all, a simple 1,000-element dot product does not have enough computational cost to justify multi-threading --- you will pay so much more in communication and synchronization costs than you will gain in performance that it is not worth it.
Secondly, it looks like you are computing the full dot product in each thread, not dividing the computation across multiple threads and combining the result at the end.
Here is an example of how to do vector dot products from https://computing.llnl.gov/tutorials/openMP/#SHARED
#include <omp.h>
main ()
{
int i, n, chunk;
float a[100], b[100], result;
/* Some initializations */
n = 100;
chunk = 10;
result = 0.0;
for (i=0; i < n; i++) {
a[i] = i * 1.0;
b[i] = i * 2.0;
}
#pragma omp parallel for \
default(shared) private(i) \
schedule(static,chunk) \
reduction(+:result)
for (i=0; i < n; i++)
result += (a[i] * b[i]);
printf("Final result= %f\n",result);
}
Basically, OpenMP is good for doing coarse-grained parallelism when you have large, expensive loops. In general, when you are doing parallel programming, the larger the "chunks" of computation you can do before re-synchronizing, the better. Especially as the number of cores grows, the communication and synchronization costs will grow. Pretend that each synchronization (grabbing a new index or chunk of indexes to execute, entering a critical section, etc.) costs you 10ms, or 1M instructions to get a better idea of when/where/how to parallelize your code.
The problem is still the same as in your latest question. You are accumulating values in a variable and you must tell OpenMp how to do that:
#pragma omp for reduction(+: dpp)
for(size_t i=0; i<SIZE; i++){
dpp += u[i]*v[i];
}
Use a loop-local variable for the index and that is all you need, forget about all the stuff that you are doing around that. If you want to see then what the compiler is doing of your code run it with -S and check the assembler output. This can be very instructive, because you then learn what simple statements like that amount to when they are parallelized.
And don't use int for loop indices. Sizes and stuff like that are size_t.

Resources