I am learning MPI and I am doing some exercises to understand some aspects of it. I have written a code that should perform a simple Monte-Carlo.
There are two main loops in it that have to be accomplished: one on the time steps T and a smaller one inside this one on the number of molecules N. So after I attempt to move every molecule the program goes to the next time step.
I tried to parallelize it by dividing the operations on the molecules on the different processors. Unfortunately the code, which works for 1 processor, prints the wrong results for total_E when p>1.
The problem probably lies in the following function and more precisely is given by a call to MPI_Allgather(local_r,n,MPI_DOUBLE,r,n,MPI_DOUBLE,MPI_COMM_WORLD);
I completely don't understand why. What am I doing wrong? (besides a primitive parallelization strategy)
My logic was that for every time step I could calculate the moves on the molecules on the different processors. Unfortunately, while I work with the local vectors local_r on the various processors, to calculate the energy difference local_DE, I need the global vector r since the energy of the i-th molecule depends on all the others. Therefore I thought to call MPI_Allgather since I have to update the global vector as well as the local ones.
void Step(double (*H)(double,double),double* local_r,double* r,double *E_,int n,int my_rank){
int i;
double* local_rt = calloc(n,sizeof(double));
double local_DE;
local_rt[i] = local_r[i] + delta*((double)lrand48()/RAND_MAX-0.5);
local_rt[i] = periodic(local_rt[i]);
local_DE = E_single(H,local_rt,r,i,n,my_rank) - E_single(H,local_r,r,i,n,my_rank);
if ( local_DE <= 0.0 || exp(-local_DE) > (double) lrand48()/RAND_MAX ) {
(*E_) += local_DE;
local_r[i] = local_rt[i];
return ;
Here it is the complete "working" code:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <mpi.h>
#define N 100
#define L 5.0
#define T_ 5000
#define delta 2.0
void Step(double (*)(double,double),double*,double*,double*,int,int);
double H(double ,double );
double E(double (*)(double,double),double* ,double*,int ,int );
double E_single(double (*)(double,double),double* ,double*,int ,int ,int);
double * pos_ini(void);
double periodic(double );
double dist(double , double );
double sign(double );
int main(int argc,char** argv){
if (argc < 2) {
printf("./program <outfile>\n");
int my_rank;
int p;
FILE* outfile = fopen(argv[1],"w");
double total_E,E_;
int n;
n = N/p;
int t;
double * r = calloc(N,sizeof(double)),*local_r = calloc(n,sizeof(double));
for(t = 0;t<=T_;t++){
if(t ==0){
r = pos_ini();
MPI_Scatter(r,n,MPI_DOUBLE, local_r,n,MPI_DOUBLE, 0, MPI_COMM_WORLD);
E_ = E(H,local_r,r,n,my_rank);
total_E = 0;
if(my_rank == 0){
return 0;
double sign(double a){
if(a < 0){
return -1.0 ;
return 1.0 ;
double periodic(double a){
if(sqrt(a*a) > L/2.0){
a = a - sign(a)*L;
return a;
double dist(double a, double b){
double d = a-b;
d = periodic(d);
return sqrt(d*d);
double * pos_ini(void){
double * r = calloc(N,sizeof(double));
int i;
for(i = 0;i<N;i++){
r[i] = ((double) lrand48()/RAND_MAX)*L - L/2.0;
return r;
double H(double a,double b){
return exp(-dist(a,b)*dist(a,b))/dist(a,b);
return 0.0;
double E(double (*H)(double,double),double* local_r,double*r,int n,int my_rank){
double local_V = 0;
int i;
for(i = 0;i<n;i++){
local_V += E_single(H,local_r,r,i,n,my_rank);
local_V *= 0.5;
return local_V;
double E_single(double (*H)(double,double),double* local_r,double*r,int i,int n,int my_rank){
double local_V = 0;
int j;
for(j = 0;j<N;j++){
if( (i + n*my_rank) != j ){
return local_V;
void Step(double (*H)(double,double),double* local_r,double* r,double *E_,int n,int my_rank){
int i;
double* local_rt = calloc(n,sizeof(double));
double local_DE;
local_rt[i] = local_r[i] + delta*((double)lrand48()/RAND_MAX-0.5);
local_rt[i] = periodic(local_rt[i]);
local_DE = E_single(H,local_rt,r,i,n,my_rank) - E_single(H,local_r,r,i,n,my_rank);
if ( local_DE <= 0.0 || exp(-local_DE) > (double) lrand48()/RAND_MAX ) {
(*E_) += local_DE;
local_r[i] = local_rt[i];
return ;
You cannot expect to get the same energy given different number of MPI processes for one simple reason - the configurations generated are very different depending on how many processes are there. The reason is not MPI_Allgather, but the way the Monte-Carlo sweeps are performed.
Given one process, you attempt to move atom 1, then atom 2, then atom 3, and so on, until you reach atom N. Each attempt sees the configuration resulting from the previous one, which is fine.
Given two processes, you attempt to move atom 1 while at the same time attempting to move atom N/2. Neither atom 1 sees the eventual displacement of atom N/2 nor the other way round, but then atoms 2 and N/2+1 see the displacement of both atom 1 and atom N/2. You end up with two partial configurations that you simply merge with the all-gather. This is not equivalent to the previous case when a single process does all the MC attempts. The same applies for the case of more than two processes.
There is another source of difference - the pseudo-random number (PRN) sequence. The sequence produced by the repeated calls to lrand48() in one process is not the same as the combined sequence produced by multiple independent calls to lrand48() in different processes, therefore even if you sequentialise the trials, still the acceptance will differ due to the locally different PRN sequences.
Forget about the specific values of the energy produced after each step. In a proper MC simulation those are insignificant. What counts is the average value over a large number of steps. Those should be the same (within a certain margin proportional to 1/sqrt(N)) no matter the update algorithm used.
It's been quite long since the last time I used MPI but it seems that your program halts when you try to "gather" and update the data in several of all the processes and it is unpredictable that which processes would need to do the gathering.
So in this case a simple solution is to let the rest of the processes send some dummy data so they could simply be ignore by others. For instance,
if ( local_DE <= 0.0 || exp(-local_DE) > (double) lrand48()/RAND_MAX ) {
(*E_) += local_DE;
local_r[i] = local_rt[i];
// filter out the dummy data out of "r" here
} else {
MPI_Allgather(dummy_sendbuf, n, MPI_DOUBLE, dummy_recvbuf, n, MPI_DOUBLE, MPI_COMM_WORLD);
Dummy data could be some exceptional wrong numbers which should not be in the results, so other processes could filter them out.
But as I mentioned, this is quite wasteful as you don't really need to receive that much data from all processes and we would like to avoid it especially when there're quite a lot of data to send.
In this case, you can gather some "flags" from other processes so that we could know which processes own data to send.
// pseudo codes
// for example, place 1 at local_flags[my_rank] if it's got data to send, otherwise 0
MPI_Allgather(local_flags, n, MPI_BYTE, recv_flags, n, MPI_BYTE, MPI_COMM_WORLD)
// so now all the processes know which processes will send
// receive data from those processes
I remember with MPI_Allgatherv, you could specify the number of elements to receive from a specific process. Here's an example: http://mpi.deino.net/mpi_functions/MPI_Allgatherv.html
But bear in mind this might be an overkill if the program is not well parallelized. For example, in your case, this is placed inside a loop, so those processes without data still need to wait for the next gathering of the flags.
You should take MPI_Allgather() outside for loop. I tested with the following code but note that I modified the lines involving RAND_MAX in order to get consistent results. As a result, the code gives the same answer for number of processors 1, 2, and 4.
void Step(double (*H)(double,double),double* local_r,double* r,double *E_,int n,int my_rank){
int i;
double* local_rt = calloc(n,sizeof(double));
double local_DE;
//local_rt[i] = local_r[i] + delta*((double)lrand48()/RAND_MAX-0.5);
local_rt[i] = local_r[i] + delta*((double)lrand48()-0.5);
local_rt[i] = periodic(local_rt[i]);
local_DE = E_single(H,local_rt,r,i,n,my_rank) - E_single(H,local_r,r,i,n,my_rank);
//if ( local_DE <= 0.0 || exp(-local_DE) > (double) lrand48()/RAND_MAX )
if ( local_DE <= 0.0 || exp(-local_DE) > (double) lrand48() )
(*E_) += local_DE;
local_r[i] = local_rt[i];
return ;
The code below is a direct translation from a youtube video on Estimating PI using OpenMP and Monte Carlo. Even with the same inputs I'm not getting here their output. In fact, it seems like around half the value is what I get.
int main() {
int num; // number of iterations
printf("Enter number of iterations you want the loop to run for: ");
scanf_s("%d", &num);
double x, y, z, pi;
long long int i;
int count = 0;
int num_thread;
printf("Enter number of threads you want to run to parallelize the process:\t");
scanf_s("%d", &num_thread);
#pragma omp parallel firstprivate(x,y,z,i) shared(count) num_threads(num_thread)
srand((int)time(NULL) ^ omp_get_thread_num());
for (i = 0; i < num; i++) {
x = (double)rand() / (double)RAND_MAX;
y = (double)rand() / (double)RAND_MAX;
z = pow(((x * x) + (y * y)), .5);
if (z <= 1) {
pi = ((double)count / (double)(num * num_thread)) * 4;
printf("The value of pi obtained is %f\n", pi);
return 0;
I've also used a similar algorithm straight from the Oak Ridge National Laboratory's website (https://www.olcf.ornl.gov/tutorials/monte-carlo-pi/):
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <math.h>
int main(int argc, char* argv[])
int niter = 1000000; //number of iterations per FOR loop
double x,y; //x,y value for the random coordinate
int i; //loop counter
int count=0; //Count holds all the number of how many good coordinates
double z; //Used to check if x^2+y^2<=1
double pi; //holds approx value of pi
int numthreads = 16;
#pragma omp parallel firstprivate(x, y, z, i) shared(count) num_threads(numthreads)
srandom((int)time(NULL) ^ omp_get_thread_num()); //Give random() a seed value
for (i=0; i<niter; ++i) //main loop
x = (double)random()/RAND_MAX; //gets a random x coordinate
y = (double)random()/RAND_MAX; //gets a random y coordinate
z = sqrt((x*x)+(y*y)); //Checks to see if number is inside unit circle
if (z<=1)
++count; //if it is, consider it a valid random point
//print the value of each thread/rank
pi = ((double)count/(double)(niter*numthreads))*4.0;
printf("Pi: %f\n", pi);
return 0;
And I am have the exact problem, so I'm think it isn't the code but somehow my machine.
I am running in VS Studio 22, Windows 11 with 16 core i9-12900kf and 32 gb ram.
Edit: I forgot to mention I did alter the second algorithm to use srand() and rand() instead.
There are many errors in the code:
As pointed out by #JeromeRichard and #JohnBollinger rand\srand\random are not threadsafe you should use a threadsafe solution.
There is a race condition at line ++count; (different threads read and write a shared variable). You should use reduction to avoid it.
The code assumes that you use numthreads threads, but OpenMP does not guarantee that you actually got all of the threads you requested. I think if you got PI/2 as a result, the problem should be the difference between the requested and obtained number of threads. If you use #pragma omp parallel for... before the loop, you do not need any assumptions about the number of threads (ie. in this case the equation to calculate PI does not contain the number of threads).
A minor comment is that you do not need to use the time-consuming pow function.
Putting it together your code should be something like this:
#pragma omp parallel for reduction(+:count) num_threads(num_thread)
for (long long int i = 0; i < num; i++) {
const double x = threadsafe_random_number_between_0_1();
const double y = threadsafe_random_number_between_0_1();
const double z = x * x + y * y;
if (z <= 1) {
double pi = ((double) count / (double) num ) * 4.0;
One assumption but I may be wrong : you initialise random with time, so it may happen than different thread use the same time , which may result in same random number generated, and so the result will be really bad as you got multiple time the same values. This is a problem with the Monte-Carlo method where 2 identical points will make wrong result.
For my CS assignment we were asked to create a program to approximate pi using Viete's Formula. I have done that, however, I don't exactly like my code and was wondering if there was a way I could do it without using two while loops.
(My professor is asking us to use a while loop, so I want to keep at least one!)
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
int main()
double n, x, out, c, t, count, approx;
printf("enter the number of iterations to approximate pi\n");
scanf("%lf", &n);
c = 1;
out = 1;
t = 0;
count = 1;
x = sqrt(2);
while (count<=n)
while (c<t)
printf("%lf is the approximation of pi\n", approx);
I just feel like my code could somehow be simpler, but I'm not sure how to simplify it.
Consider how many times the inner loop runs in each iteration of the outer loop
on the first iteration, it does not run at all (c == t == 1)
on each subsequent iteration, it runs exactly once (as t has been incremented once since the last iteration of the outer loop).
So you could replace this inner while with an if:
if (count > 1) {
once you do that, t and c are completely unnecessary and can be eliminated.
If you change the initial value of x (before the loop), you could have the first iteration calculate it here, thus getting rid of the if too. That leaves a minimal loop:
out = 1;
count = 1;
x = 0;
while (count<=n) {
I just feel like my code could somehow be simpler, but I'm not sure how to simplify it.
I don't like the fact that I am using two while loops. I was wondering if there was a way to code this program using only one, rather than the two I am currently using
Seems simply enough to use a single loop.
OP's code, the while (c < t) loop, could be replaced with if (c < t) and achieve the same outcome. The loop is only executed 1 or 0 times. With an adjustment of initial c or t, the loop/block could executed exactly once each time. Thus negating the test completely.
A few additional adjustments are in Viete().
#include <stdio.h>
#include <math.h>
double Viete(unsigned n) {
const char *pi = "pi 3.141592653589793238462643383...";
printf("m_pi=%.17f\n", acos(-1));
double term = sqrt(2.0);
double v = 1.0;
while (n-- > 0) {
v = v * term / 2;
printf("v_pi=%.17f %u\n", 2 / v, n);
term = sqrt(2 + term);
return 2 / v;
int op_pi(unsigned n) {
unsigned c = 1;
unsigned t = 0;
unsigned count = 1;
double out = 1;
double x = sqrt(2);
while (count <= n) {
t = t + 1;
// while (c < t) {
// or
if (c < t) {
x = sqrt(2 + x);
c = c + 1;
out = out * (x / 2);
count = count + 1;
printf("%lf is the approximation of pi %u\n", 2 / out, count);
double approx = 2 / out;
printf("%lf is the approximation of pi\n", approx);
int main(void) {
2.828427 is the approximation of pi 2
3.061467 is the approximation of pi 3
3.121445 is the approximation of pi 4
3.136548 is the approximation of pi 5
3.140331 is the approximation of pi 6
3.140331 is the approximation of pi
pi 3.141592653589793238462643383...
v_pi=2.82842712474618985 4
v_pi=3.06146745892071825 3
v_pi=3.12144515225805197 2
v_pi=3.13654849054593887 1
v_pi=3.14033115695475251 0
pi 3.141592653589793238462643383...
Additional minor simplifications possible.
I am struggling to figure out how to parallelize this code with OpenMP, any help is appreciated. Below is the base code and a description.
In the simulation of a collection of soft particles (such as proteins in a fluid), there is a repulsive force between a pair of particles when they overlap. The goal of this assignment is to use parallel computing to accelerate the computation of these repulsive forces, using multiple cores with Open-MP.
In the force repulsion function, the particles are assumed to have unit radius. The particles are in a “simulation box” of dimensions L × L × L. The dimension L is chosen such that the volume fraction of particles is φ = 0.3. The simulation box has periodic (wrap-around) boundary conditions, which explains why we need to use the remainder function to compute the distance between two particles. If the particles overlap, i.e., the distance s between two particles is less than 2, then the repulsive force is proportional to k(2−s) where k is a force constant. The force is along the vector joining the two particles.
Write a program that tests the correctness of your code. This can be done by computing the correct forces and comparing them to the forces computed by your optimized code. Give evidence in your report that your program works correctly using your test program
How much faster is your accelerated code compared to the provided baseline code? Include timings for different problem sizes. Be sure to include a listing of your code in your report.
Code to parallelize
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <sys/time.h>
double get_walltime() {
struct timeval tp;
gettimeofday(&tp, NULL);
return (double) (tp.tv_sec + tp.tv_usec*1e-6); }
void force_repulsion(int np, const double *pos, double L, double krepulsion, double *forces)
int i, j;
double posi [4]; double rvec [4];
double s2, s, f;
// initialize forces to zero
for (i=0; i<3*np; i++)
forces[i] = 0.;
// loop over all pairs
for (i=0; i<np; i++)
posi[0] = pos[3*i ];
posi[1] = pos[3*i+1]; posi[2] = pos[3*i+2];
for (j=i+1; j<np; j++)
// compute minimum image difference
rvec[0] = remainder(posi[0] - pos[3*j ], L);
rvec[1] = remainder(posi[1] - pos[3*j+1], L);
rvec[2] = remainder(posi[2] - pos[3*j+2], L);
s2 = rvec [0]* rvec [0] + rvec [1]* rvec [1] + rvec [2]* rvec [2];
if (s2 < 4)
s = sqrt(s2);
rvec[0] /= s; rvec[1] /= s;
rvec[2] /= s;
f = krepulsion*(2.-s);
forces[3*i ] += f*rvec[0];
forces[3*i+1] += f*rvec[1];
forces[3*i+2] += f*rvec[2];
forces[3*j ] += -f*rvec[0];
forces[3*j+1] += -f*rvec[1];
forces[3*j+2] += -f*rvec[2]; }
} }
int main(int argc, char *argv[]) {
int i;
int np = 100; // default number of particles
double phi = 0.3; // volume fraction
double krepulsion = 125.; // force constant
double *pos; double *forces;
double L, time0 , time1;
if (argc > 1)
np = atoi(argv[1]);
L = pow(4./3.*3.1415926536*np/phi, 1./3.);
// generate random particle positions inside simulation box
forces = (double *) malloc(3*np*sizeof(double));
pos = (double *) malloc(3*np*sizeof(double));
for (i=0; i<3*np; i++)
pos[i] = rand()/(double)RAND_MAX*L;
// measure execution time of this function
time0 = get_walltime ();
force_repulsion(np, pos, L, krepulsion, forces);
time1 = get_walltime ();
printf("number of particles: %d\n", np);
printf("elapsed time: %f\n", time1-time0);
return 0; }
Theoretically, it would be as simple as this:
void force_repulsion(int np, const double *pos, double L, double krepulsion,
double *forces)
// initialize forces to zero
#pragma omp parallel for
for (int i = 0; i < 3 * np; i++)
forces[i] = 0.;
// loop over all pairs
#pragma omp parallel for
for (int i = 0; i < np; i++)
double posi[4];
double rvec[4];
double s2, s, f;
posi[0] = pos[3 * i];
g++ -fopenmp example.cc -o example
Note that I did not check for correctness. Make sure you won't have global variable inside the parallel for (as I updated your code..)
I'm having trouble keeping randomly generated values that are normally distributed between 0 and 1 (including 0, excluding 1). I believe the algorithm is basically correct, I am just stumped here. Any insight would be great.
These are the needed include files:
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
The normally distributed random number generator function:
float rand_normal(float mean, float stddev)
static float n2 = 0.0;
float x, y, r;
static int n2_cached = 0;
if (!n2_cached)
x = 2.0*rand()/RAND_MAX - 1;
y = 2.0*rand()/RAND_MAX - 1;
r = x*x + y*y;
} while (r==0.0 || r>1.0);
float d = sqrt(-2.0*log(r)/r);
float n1 = x*d;
float result = n1*stddev + mean;
n2 = y*d;
n2_cached = 1;
return result;
n2_cached = 0;
return n2*stddev + mean;
main function used only for testing purposes.
int main()
int i;
float min = 0.5, max = 0.5, r, avg = 0;
float x, w;
int n = 10000000;
for (i=0; i<n; i++)
r = rand_normal(0.5, 0.09);
if (r < min)
min = r;
else if ( r>max)
max = r;
avg += r;
avg /= (float)n;
printf("min = %f\nmax = %f\navg = %f\n", min, max, avg);
return 0;
In case anyone was wondering, this function is needed for a "genetic inheritance in plants" simulation.
Why would you expect the result to stay between 0 and 1? The Gaussian distribution has full support, so whatever interval you are looking at and whatever the mean and variance you choose, there will always be a (possibility very small) non-zero probability of falling outside of that interval. If you really want to restrict yourself to [0,1] for some reason, then you can simply call rand_normal until you fall into that interval.
Note also that while Box-Müller (the algorithm you are using) is easy to implement, this is one of the worst and most costly ways of generating a Gaussian random variable. The best and fastest algorithm I know is the "Ziggurat" method, an implementation of which can be found at
I would definitely create a function to convert "rand()" to a normalized floating point value. For example:
nrand ()
return rand()/(RAND_MAX - 1);
Also, here are a few links that might help:
Here is my function that tests two points x and y if they're in the mandelbrot set or not after MAX_ITERATION 255. It should return 0 if not, 1 if it is.
int isMandelbrot (int x, int y) {
int i;
int j;
double Re[255];
double Im[255];
double a;
double b;
double dist;
double finaldist;
int check;
while (i < MAX_ITERATION) {
a = Re[j];
b = Im[j];
Im[i]=(2 * a * b) + y;
finaldist = sqrt(pow(Re[MAX_ITERATION],2)+pow(Im[MAX_ITERATION],2));
if (dist > 2) { //not in mandelbrot
check = 0;
} else if (dist <= 2) { //in mandelbrot set
check = 1;
return check;
Given that it's correct (can someone verify... or write a more efficient one?).
Here is my code to print it, however it does not work! (it keeps giving all points are in the set). What have I done wrong here?
int main(void) {
double col;
double row;
int checkSet;
row = -4;
col = -1;
while (row < 1.0 ) {
while (col < 1.0) {
checkSet = isMandelbrot(row, col);
if (checkSet == 1) {
} else if (checkSet == 0) {
return 0;
There are some bugs in your code. For example, you do this:
a = Re[j];
b = Im[j];
But at the first iteration, j = -1, so you're getting the value at index -1 of the arrays. That is not what you wanted to do.
Also, why are Re and Im arrays - do you really need to keep track of all the intermediate results in the calculation?
Wikipedia contains pseudocode for the algorithm, you might want to check your own code against that.
Another bug: your function takes int arguments, so the values of your double inputs will be truncated (i.e. the fractional part will be discarded).
You should probably be checking for escape inside the while loop. That is to say, if ((a*a + b*b) > 4) at any time then that pixel has escaped, end of story. By continuing to iterate those pixels, as well as wasting CPU cycles you the values are growing without bound and seem to be exceeding what can be represented in a double - the result is NaN, so your finaldist computation is producing garbage.
I think you would benefit from more resolution in your main. Your code as you've put it here isn't computing enough pixels to really see much of the set.