Parallel computing using multiple cores with Open-MP.

Parallel computing using multiple cores with Open-MP. - c

I am struggling to figure out how to parallelize this code with OpenMP, any help is appreciated. Below is the base code and a description.
In the simulation of a collection of soft particles (such as proteins in a fluid), there is a repulsive force between a pair of particles when they overlap. The goal of this assignment is to use parallel computing to accelerate the computation of these repulsive forces, using multiple cores with Open-MP.
In the force repulsion function, the particles are assumed to have unit radius. The particles are in a “simulation box” of dimensions L × L × L. The dimension L is chosen such that the volume fraction of particles is φ = 0.3. The simulation box has periodic (wrap-around) boundary conditions, which explains why we need to use the remainder function to compute the distance between two particles. If the particles overlap, i.e., the distance s between two particles is less than 2, then the repulsive force is proportional to k(2−s) where k is a force constant. The force is along the vector joining the two particles. Write a program that tests the correctness of your code. This can be done by computing the correct forces and comparing them to the forces computed by your optimized code. Give evidence in your report that your program works correctly using your test program
How much faster is your accelerated code compared to the provided baseline code? Include timings for different problem sizes. Be sure to include a listing of your code in your report.
Code to parallelize
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <sys/time.h>
double get_walltime() {
struct timeval tp;
gettimeofday(&tp, NULL);
return (double) (tp.tv_sec + tp.tv_usec*1e-6); }
void force_repulsion(int np, const double *pos, double L, double krepulsion, double *forces)
{
int i, j;
double posi [4]; double rvec [4];
double s2, s, f;
// initialize forces to zero
for (i=0; i<3*np; i++)
forces[i] = 0.;
// loop over all pairs
for (i=0; i<np; i++)
{
posi[0] = pos[3*i ];
posi[1] = pos[3*i+1]; posi[2] = pos[3*i+2];
for (j=i+1; j<np; j++)
{
// compute minimum image difference
rvec[0] = remainder(posi[0] - pos[3*j ], L);
rvec[1] = remainder(posi[1] - pos[3*j+1], L);
rvec[2] = remainder(posi[2] - pos[3*j+2], L);
s2 = rvec [0]* rvec [0] + rvec [1]* rvec [1] + rvec [2]* rvec [2];
if (s2 < 4)
{
s = sqrt(s2);
rvec[0] /= s; rvec[1] /= s;
rvec[2] /= s;
f = krepulsion*(2.-s);
forces[3*i ] += f*rvec[0];
forces[3*i+1] += f*rvec[1];
forces[3*i+2] += f*rvec[2];
forces[3*j ] += -f*rvec[0];
forces[3*j+1] += -f*rvec[1];
forces[3*j+2] += -f*rvec[2]; }
} }
}
int main(int argc, char *argv[]) {
int i;
int np = 100; // default number of particles
double phi = 0.3; // volume fraction
double krepulsion = 125.; // force constant
double *pos; double *forces;
double L, time0 , time1;
if (argc > 1)
np = atoi(argv[1]);
L = pow(4./3.*3.1415926536*np/phi, 1./3.);
// generate random particle positions inside simulation box
forces = (double *) malloc(3*np*sizeof(double));
pos = (double *) malloc(3*np*sizeof(double));
for (i=0; i<3*np; i++)
pos[i] = rand()/(double)RAND_MAX*L;
// measure execution time of this function
time0 = get_walltime ();
force_repulsion(np, pos, L, krepulsion, forces);
time1 = get_walltime ();
printf("number of particles: %d\n", np);
printf("elapsed time: %f\n", time1-time0);
free(forces);
free(pos);
return 0; }

Theoretically, it would be as simple as this:
void force_repulsion(int np, const double *pos, double L, double krepulsion,
double *forces)
{
// initialize forces to zero
#pragma omp parallel for
for (int i = 0; i < 3 * np; i++)
forces[i] = 0.;
// loop over all pairs
#pragma omp parallel for
for (int i = 0; i < np; i++)
{
double posi[4];
double rvec[4];
double s2, s, f;
posi[0] = pos[3 * i];
//...
Compilation:
g++ -fopenmp example.cc -o example
Note that I did not check for correctness. Make sure you won't have global variable inside the parallel for (as I updated your code..)

Related

Why am I not getting the same estimation of PI using a parallelized (OpenMP) algothrim copied from working code?

The code below is a direct translation from a youtube video on Estimating PI using OpenMP and Monte Carlo. Even with the same inputs I'm not getting here their output. In fact, it seems like around half the value is what I get.
int main() {
int num; // number of iterations
printf("Enter number of iterations you want the loop to run for: ");
scanf_s("%d", &num);
double x, y, z, pi;
long long int i;
int count = 0;
int num_thread;
printf("Enter number of threads you want to run to parallelize the process:\t");
scanf_s("%d", &num_thread);
printf("\n");
#pragma omp parallel firstprivate(x,y,z,i) shared(count) num_threads(num_thread)
{
srand((int)time(NULL) ^ omp_get_thread_num());
for (i = 0; i < num; i++) {
x = (double)rand() / (double)RAND_MAX;
y = (double)rand() / (double)RAND_MAX;
z = pow(((x * x) + (y * y)), .5);
if (z <= 1) {
count++;
}
}
} // END PRAGMA
pi = ((double)count / (double)(num * num_thread)) * 4;
printf("The value of pi obtained is %f\n", pi);
return 0;
}
I've also used a similar algorithm straight from the Oak Ridge National Laboratory's website (https://www.olcf.ornl.gov/tutorials/monte-carlo-pi/):
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <math.h>
int main(int argc, char* argv[])
{
int niter = 1000000; //number of iterations per FOR loop
double x,y; //x,y value for the random coordinate
int i; //loop counter
int count=0; //Count holds all the number of how many good coordinates
double z; //Used to check if x^2+y^2<=1
double pi; //holds approx value of pi
int numthreads = 16;
#pragma omp parallel firstprivate(x, y, z, i) shared(count) num_threads(numthreads)
{
srandom((int)time(NULL) ^ omp_get_thread_num()); //Give random() a seed value
for (i=0; i<niter; ++i) //main loop
{
x = (double)random()/RAND_MAX; //gets a random x coordinate
y = (double)random()/RAND_MAX; //gets a random y coordinate
z = sqrt((x*x)+(y*y)); //Checks to see if number is inside unit circle
if (z<=1)
{
++count; //if it is, consider it a valid random point
}
}
//print the value of each thread/rank
}
pi = ((double)count/(double)(niter*numthreads))*4.0;
printf("Pi: %f\n", pi);
return 0;
}
And I am have the exact problem, so I'm think it isn't the code but somehow my machine.
I am running in VS Studio 22, Windows 11 with 16 core i9-12900kf and 32 gb ram.
Edit: I forgot to mention I did alter the second algorithm to use srand() and rand() instead.

There are many errors in the code:
As pointed out by #JeromeRichard and #JohnBollinger rand\srand\random are not threadsafe you should use a threadsafe solution.
There is a race condition at line ++count; (different threads read and write a shared variable). You should use reduction to avoid it.
The code assumes that you use numthreads threads, but OpenMP does not guarantee that you actually got all of the threads you requested. I think if you got PI/2 as a result, the problem should be the difference between the requested and obtained number of threads. If you use #pragma omp parallel for... before the loop, you do not need any assumptions about the number of threads (ie. in this case the equation to calculate PI does not contain the number of threads).
A minor comment is that you do not need to use the time-consuming pow function.
Putting it together your code should be something like this:
#pragma omp parallel for reduction(+:count) num_threads(num_thread)
for (long long int i = 0; i < num; i++) {
const double x = threadsafe_random_number_between_0_1();
const double y = threadsafe_random_number_between_0_1();
const double z = x * x + y * y;
if (z <= 1) {
count++;
}
}
double pi = ((double) count / (double) num ) * 4.0;

One assumption but I may be wrong : you initialise random with time, so it may happen than different thread use the same time , which may result in same random number generated, and so the result will be really bad as you got multiple time the same values. This is a problem with the Monte-Carlo method where 2 identical points will make wrong result.

Segmentation Faults with OPEN MP and pointer operations

I'm developing a code to simulate the response of some dynamical systems to my PhD research. To this end, I'm trying to implement parallelization with OpenMP to increase the performance of the code.
Basically I have a solution function parallel_dynamical_diagram_solution that calls other functions within the same file (realloc_vector, perturb_wolf, rk4, lyapunov_wolf, store_LE, get_attractor) that contains some operations and numerical methods. Also, the solution functions calls a pointer to a function edosys that is declared in another file. The solution function is shown below:
void parallel_dynamical_diagram_solution(int dim, int np, int ndiv, int trans, int *attrac, int maxper, double t, double **x, double *varparX, double *varparY, double *parrange, double *par, void (*edosys)(int, double *, double, double *, double *), int bifmode) {
// Allocate memory for x` = f(x)
double *f = malloc(dim * sizeof *f);
// Allocate memory for vectors necessary for lyapunov exponents calculation
double *cum = malloc(dim * sizeof *cum); // Cumulative Vector
double *lambda = malloc(dim *sizeof *lambda); // Lyapunov Exponents vector
double *s_cum = malloc(dim * sizeof *s_cum); // Short Cumulative Vector
double *s_lambda = malloc(dim * sizeof *s_lambda); // Short Lyapunov Exponents Vector
double *znorm = malloc(dim * sizeof *znorm); // Norm of Vectors
double *gsc = malloc((dim - 1) * sizeof *gsc); // Inner Products Vector
// Store Initial Conditions
double t0 = t;
double *IC = malloc(dim * sizeof *IC);
for (int i = 0; i < dim; i++) {
IC[i] = (*x)[i];
}
// Declare rk4 timestep, final time, short initial time and pi
double h, tf, s_T0;
const double pi = 4 * atan(1); // Pi number definition
// Declare and define increment of control parameters
double varstep[2];
varstep[0] = (parrange[1] - parrange[0])/(parrange[2] - 1); // -1 in the denominator ensures the input resolution
varstep[1] = (parrange[4] - parrange[3])/(parrange[5] - 1); // -1 in the denominator ensures the input resolution
// Numerical control parameters
int ndim = dim + (dim * dim); // Define new dimension to include linearized dynamical equations
// Declare vector and allocate memory to store poincare map values: poinc[number of permanent regime forcing periods][dimension original system]
double **poinc = malloc((np - trans) * sizeof **poinc);
for (int i = 0; i < np - trans; i++) {
poinc[i] = malloc(dim * sizeof **poinc);
}
// Declare vector to store the chosen Lyapunov Exponents to determine the attractor
double *LE = malloc(dim * sizeof *LE);
// Declare vector for temporary storage of periodicity values to check if all directions are equal
int *tmp_attrac = malloc(dim * sizeof *tmp_attrac);
// Declare variable to flag if all directions present same periodicity or not (0 = all the same, 1 = not the same)
int diffAttrac = -1;
// Prepare x vector to include perturbed values
realloc_vector(x, ndim);
// Starts the parallel block
int k;
#pragma omp parallel default(none) firstprivate(x, t, par, IC, varstep, diffAttrac, poinc, varparY, varparX) \
private(k, f, attrac, lambda, s_lambda, LE, cum, s_cum, znorm, gsc, h, tf, s_T0, tmp_attrac, edosys) \
shared(parrange, dim, ndim, np, ndiv, trans, t0, bifmode, maxper)
{
// Starts to increment bifurcation control parameter
#pragma omp for schedule(static)
// Loop for Y control parameter
for (k = 0; k < (int)parrange[5]; k++) {
// Value of Y control parameter based on index k
(*varparY) = parrange[3] + k*varstep[1];
printf("(Iteration: %d) varparY = %lf\n", k, (*varparY));
// Reset Initial conditions for the beggining of a horizontal line
for (int i = 0; i < dim; i++) {
(*x)[i] = IC[i];
printf("(Iteration: %d) x[%d] = %lf\n", k, i, (*x)[i]);
}
// Loop for X control parameter
for (int m = 0; m < (int)parrange[2]; m++) {
// Value of X control parameter based on index m
(*varparX) = parrange[0] + m*varstep[0];
printf("(Iteration: %d) a\n", k);
// Reset Variables
t = t0;
printf("(Iteration: %d) b\n", k);
for (int i = 0; i < dim; i++) {
lambda[i] = 0.0;
s_lambda[i] = 0.0;
LE[i] = 0.0;
printf("(Iteration: %d) c\n", k);
}
// Check the mode of the bifurcation
if (bifmode == 1) {
// Reset Initial conditions in each bifurcation step
for (int i = 0; i < dim; i++) {
(*x)[i] = IC[i];
}
}
// Vary timestep if varpar = par[0], varying also final time and short initial time
h = (2 * pi) / (ndiv * par[0]); // par[0] = OMEGA
tf = h*np*ndiv; // Final time
s_T0 = ((double) trans/ (double) np) * tf; // Advanced initial time
// Assign initial perturbation
perturb_wolf(x, dim, ndim, &cum, &s_cum);
// Call Runge-Kutta 4th order integrator N = nP * nDiv times
for (int i = 0; i < np; i++) {
for (int j = 0; j < ndiv; j++) {
rk4(ndim, *x, t, h, par, f, edosys);
lyapunov_wolf(x, t, h, dim, ndim, s_T0, &cum, &s_cum, &lambda, &s_lambda, &znorm, &gsc);
t = t + h;
// Apply poincare map at permanent regime
if (i >= trans) {
// Choose any point in the trajectory for poincare section placement
if (j == 1) {
// Stores poincare values in poinc[np - trans][dim] vector
for (int p = 0; p < dim; p++) {
poinc[i - trans][p] = (*x)[p];
}
}
}
}
}
// Define which lyapunov will be taken: lambda[dim] or s_lambda[dim]
store_LE(dim, lambda, s_lambda, LE);
// Verify the type of motion of the system
(*attrac) = get_attractor(poinc, LE, dim, np, trans, tmp_attrac, &diffAttrac, maxper);
printf("[k = %d] [m = %d]: Attractor = %d, lambda_1 = %lf, lambda_2 = %lf\n",k, m, (*attrac), lambda[0], lambda[1]);
}
}
// Free memory
free(f); free(cum); free(s_cum); free(lambda); free(s_lambda);
free(znorm); free(gsc); free(LE); free(tmp_attrac); free(IC);
for (int i = 0; i < np - trans; i++) {
free(poinc[i]);
}
free(poinc);
} // Ends Parallel Block
}
When I run the code, a segmentation fault error occurs as shown below:
(Iteration: 0) varparY = 0.010000
(Iteration: 0) x[0] = 1.000000
(Iteration: 0) x[1] = 0.000000
(Iteration: 0) a
(Iteration: 0) b
zsh: segmentation fault ./dyndiag
It appears to be happening when I assign the values of lambda, s_lambda and LE, but I don't know why as these variables are declared as private. I'm new to OpenMP and parallelization in general, could someone help me? What am I missing here?
Thanks in advance!

The easy part is to tell why you get segmentation faults: many of your variables are unitialized inside the parallel block (attrac, lambda, s_lambda, LE, cum, s_cum, gsc, etc). Consider the following code segment:
double *lambda = malloc(dim *sizeof *lambda);
#pragma omp parallel private(lambda)
{
lambda[0]=0; // lambda is unitialized --> undefined behavior
}
The private clause makes the pointer private, but will not initialize your private variable or allocate memory for each thread. The code above practically means the following (it may be easier to understand actually what is happening):
double *lambda = malloc(dim *sizeof *lambda);
#pragma omp parallel
{
double* lambda; // a local variable is created for each thread
lambda[0]=0; // BUT it is unitialized --> undefined behavior
}
The solution is to allocate (and free) the memory inside the parallel block:
#pragma omp parallel
{
double *lambda = malloc(dim *sizeof *lambda);
....
lambda[0]=0; // lambda is private and initialized -- OK
...
free(lambda);
}
This will solve your segmentation fault problem, but unfortunately your code will not work properly. You still have issues with your firstprvate variables, but the biggest problem is that you wish to parallelize iterations. As far as I understand your code the output of an iteration is the input of the next one. You simply cannot parallelize it, iteration is a sequential process. You should first read a decent book on OpenMP then rewrite your code and parallelize work inside an iteration.

Error in parallelization MPI_Allgather

EDITED IN LIGHT OF THE COMMENTS
I am learning MPI and I am doing some exercises to understand some aspects of it. I have written a code that should perform a simple Monte-Carlo.
There are two main loops in it that have to be accomplished: one on the time steps T and a smaller one inside this one on the number of molecules N. So after I attempt to move every molecule the program goes to the next time step.
I tried to parallelize it by dividing the operations on the molecules on the different processors. Unfortunately the code, which works for 1 processor, prints the wrong results for total_E when p>1.
The problem probably lies in the following function and more precisely is given by a call to MPI_Allgather(local_r,n,MPI_DOUBLE,r,n,MPI_DOUBLE,MPI_COMM_WORLD);
I completely don't understand why. What am I doing wrong? (besides a primitive parallelization strategy)
My logic was that for every time step I could calculate the moves on the molecules on the different processors. Unfortunately, while I work with the local vectors local_r on the various processors, to calculate the energy difference local_DE, I need the global vector r since the energy of the i-th molecule depends on all the others. Therefore I thought to call MPI_Allgather since I have to update the global vector as well as the local ones.
void Step(double (*H)(double,double),double* local_r,double* r,double *E_,int n,int my_rank){
int i;
double* local_rt = calloc(n,sizeof(double));
double local_DE;
for(i=0;i<n;i++){
local_rt[i] = local_r[i] + delta*((double)lrand48()/RAND_MAX-0.5);
local_rt[i] = periodic(local_rt[i]);
local_DE = E_single(H,local_rt,r,i,n,my_rank) - E_single(H,local_r,r,i,n,my_rank);
if ( local_DE <= 0.0 || exp(-local_DE) > (double) lrand48()/RAND_MAX ) {
(*E_) += local_DE;
local_r[i] = local_rt[i];
}
MPI_Allgather(local_r,n,MPI_DOUBLE,r,n,MPI_DOUBLE,MPI_COMM_WORLD);
}
return ;
}
Here it is the complete "working" code:
#define _XOPEN_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <mpi.h>
#define N 100
#define L 5.0
#define T_ 5000
#define delta 2.0
void Step(double (*)(double,double),double*,double*,double*,int,int);
double H(double ,double );
double E(double (*)(double,double),double* ,double*,int ,int );
double E_single(double (*)(double,double),double* ,double*,int ,int ,int);
double * pos_ini(void);
double periodic(double );
double dist(double , double );
double sign(double );
int main(int argc,char** argv){
if (argc < 2) {
printf("./program <outfile>\n");
exit(-1);
}
srand48(0);
int my_rank;
int p;
FILE* outfile = fopen(argv[1],"w");
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&my_rank);
MPI_Comm_size(MPI_COMM_WORLD,&p);
double total_E,E_;
int n;
n = N/p;
int t;
double * r = calloc(N,sizeof(double)),*local_r = calloc(n,sizeof(double));
for(t = 0;t<=T_;t++){
if(t ==0){
r = pos_ini();
MPI_Scatter(r,n,MPI_DOUBLE, local_r,n,MPI_DOUBLE, 0, MPI_COMM_WORLD);
E_ = E(H,local_r,r,n,my_rank);
}else{
Step(H,local_r,r,&E_,n,my_rank);
}
total_E = 0;
MPI_Allreduce(&E_,&total_E,1,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD);
if(my_rank == 0){
fprintf(outfile,"%d\t%lf\n",t,total_E/N);
}
}
MPI_Finalize();
return 0;
}
double sign(double a){
if(a < 0){
return -1.0 ;
}else{
return 1.0 ;
}
}
double periodic(double a){
if(sqrt(a*a) > L/2.0){
a = a - sign(a)*L;
}
return a;
}
double dist(double a, double b){
double d = a-b;
d = periodic(d);
return sqrt(d*d);
}
double * pos_ini(void){
double * r = calloc(N,sizeof(double));
int i;
for(i = 0;i<N;i++){
r[i] = ((double) lrand48()/RAND_MAX)*L - L/2.0;
}
return r;
}
double H(double a,double b){
if(dist(a,b)<2.0){
return exp(-dist(a,b)*dist(a,b))/dist(a,b);
}else{
return 0.0;
}
}
double E(double (*H)(double,double),double* local_r,double*r,int n,int my_rank){
double local_V = 0;
int i;
for(i = 0;i<n;i++){
local_V += E_single(H,local_r,r,i,n,my_rank);
}
local_V *= 0.5;
return local_V;
}
double E_single(double (*H)(double,double),double* local_r,double*r,int i,int n,int my_rank){
double local_V = 0;
int j;
for(j = 0;j<N;j++){
if( (i + n*my_rank) != j ){
local_V+=H(local_r[i],r[j]);
}
}
return local_V;
}
void Step(double (*H)(double,double),double* local_r,double* r,double *E_,int n,int my_rank){
int i;
double* local_rt = calloc(n,sizeof(double));
double local_DE;
for(i=0;i<n;i++){
local_rt[i] = local_r[i] + delta*((double)lrand48()/RAND_MAX-0.5);
local_rt[i] = periodic(local_rt[i]);
local_DE = E_single(H,local_rt,r,i,n,my_rank) - E_single(H,local_r,r,i,n,my_rank);
if ( local_DE <= 0.0 || exp(-local_DE) > (double) lrand48()/RAND_MAX ) {
(*E_) += local_DE;
local_r[i] = local_rt[i];
}
MPI_Allgather(local_r,n,MPI_DOUBLE,r,n,MPI_DOUBLE,MPI_COMM_WORLD);
}
return ;
}

You cannot expect to get the same energy given different number of MPI processes for one simple reason - the configurations generated are very different depending on how many processes are there. The reason is not MPI_Allgather, but the way the Monte-Carlo sweeps are performed.
Given one process, you attempt to move atom 1, then atom 2, then atom 3, and so on, until you reach atom N. Each attempt sees the configuration resulting from the previous one, which is fine.
Given two processes, you attempt to move atom 1 while at the same time attempting to move atom N/2. Neither atom 1 sees the eventual displacement of atom N/2 nor the other way round, but then atoms 2 and N/2+1 see the displacement of both atom 1 and atom N/2. You end up with two partial configurations that you simply merge with the all-gather. This is not equivalent to the previous case when a single process does all the MC attempts. The same applies for the case of more than two processes.
There is another source of difference - the pseudo-random number (PRN) sequence. The sequence produced by the repeated calls to lrand48() in one process is not the same as the combined sequence produced by multiple independent calls to lrand48() in different processes, therefore even if you sequentialise the trials, still the acceptance will differ due to the locally different PRN sequences.
Forget about the specific values of the energy produced after each step. In a proper MC simulation those are insignificant. What counts is the average value over a large number of steps. Those should be the same (within a certain margin proportional to 1/sqrt(N)) no matter the update algorithm used.

It's been quite long since the last time I used MPI but it seems that your program halts when you try to "gather" and update the data in several of all the processes and it is unpredictable that which processes would need to do the gathering.
So in this case a simple solution is to let the rest of the processes send some dummy data so they could simply be ignore by others. For instance,
if ( local_DE <= 0.0 || exp(-local_DE) > (double) lrand48()/RAND_MAX ) {
(*E_) += local_DE;
local_r[i] = local_rt[i];
MPI_Allgather(local_r,n,MPI_DOUBLE,r,n,MPI_DOUBLE,MPI_COMM_WORLD);
// filter out the dummy data out of "r" here
} else {
MPI_Allgather(dummy_sendbuf, n, MPI_DOUBLE, dummy_recvbuf, n, MPI_DOUBLE, MPI_COMM_WORLD);
}
Dummy data could be some exceptional wrong numbers which should not be in the results, so other processes could filter them out.
But as I mentioned, this is quite wasteful as you don't really need to receive that much data from all processes and we would like to avoid it especially when there're quite a lot of data to send.
In this case, you can gather some "flags" from other processes so that we could know which processes own data to send.
// pseudo codes
// for example, place 1 at local_flags[my_rank] if it's got data to send, otherwise 0
MPI_Allgather(local_flags, n, MPI_BYTE, recv_flags, n, MPI_BYTE, MPI_COMM_WORLD)
// so now all the processes know which processes will send
// receive data from those processes
MPI_Allgatherv(...)
I remember with MPI_Allgatherv, you could specify the number of elements to receive from a specific process. Here's an example: http://mpi.deino.net/mpi_functions/MPI_Allgatherv.html
But bear in mind this might be an overkill if the program is not well parallelized. For example, in your case, this is placed inside a loop, so those processes without data still need to wait for the next gathering of the flags.

You should take MPI_Allgather() outside for loop. I tested with the following code but note that I modified the lines involving RAND_MAX in order to get consistent results. As a result, the code gives the same answer for number of processors 1, 2, and 4.
void Step(double (*H)(double,double),double* local_r,double* r,double *E_,int n,int my_rank){
int i;
double* local_rt = calloc(n,sizeof(double));
double local_DE;
for(i=0;i<n;i++){
//local_rt[i] = local_r[i] + delta*((double)lrand48()/RAND_MAX-0.5);
local_rt[i] = local_r[i] + delta*((double)lrand48()-0.5);
local_rt[i] = periodic(local_rt[i]);
local_DE = E_single(H,local_rt,r,i,n,my_rank) - E_single(H,local_r,r,i,n,my_rank);
//if ( local_DE <= 0.0 || exp(-local_DE) > (double) lrand48()/RAND_MAX )
if ( local_DE <= 0.0 || exp(-local_DE) > (double) lrand48() )
{
(*E_) += local_DE;
local_r[i] = local_rt[i];
}
}
MPI_Allgather(local_r,n,MPI_DOUBLE,r,n,MPI_DOUBLE,MPI_COMM_WORLD);
return ;
}

Correct way to implement windowing

I'm trying to implement windowing in a program, for that I've wrote a sin function with 2048 samples. I'm reading the values and try to calculate the PSD using the "rect" window, when my window is 2048 wide, the result is accurate. Otherwise the result doesn't make any sense to me.
Here is the code that I'm using,
#include <fftw3.h>
#include <math.h>
#include <stdio.h>
#include <complex.h>
int main (){
FILE* inputFile = NULL;
FILE* outputFile= NULL;
double* inputData=NULL;
double* outputData=NULL;
double* windowData=NULL;
unsigned int windowSize = 512;
int overlaping =128;
int index1 =0,index2=0, i=0;
double powVal= 0.0;
fftw_plan plan_r2hc;
// mememory allocation
inputData = (double*) fftw_malloc(sizeof(double)*windowSize);
outputData= (double*) fftw_malloc(sizeof(double)*windowSize);
windowData= (double*) fftw_malloc(sizeof(double)*windowSize);
plan_r2hc = fftw_plan_r2r_1d(windowSize, inputData, windowData, FFTW_R2HC, FFTW_PATIENT);
// Opning files
inputFile = fopen("sinusD","rb");
outputFile= fopen("windowingResult","wb+");
if(inputFile==NULL ){
printf("Couldn't open either the input or the output file \n");
return -1;
}
while((i=fread(inputData,sizeof(double),windowSize,inputFile))==windowSize){
fftw_execute_r2r(plan_r2hc, inputData, windowData);
for( index1 =0; index1 < windowSize;index1++){
outputData[index1]+=windowData[index1];
printf("index %d \t %lf\n",index1,inputData[index1]);
}
if(overlaping!=0)
fseek(inputFile,(-overlaping)*sizeof(double),SEEK_CUR);
}
if( i!=0){
i = -i;
fseek(inputFile ,i*sizeof(double),SEEK_END);
fread(inputData,sizeof(double),-i,inputFile);
fftw_execute_r2r(plan_r2hc, inputData, windowData);
for( index1=0;index1< windowSize; index1++){
outputData[index1]+=windowData[index1];
}
}
powVal = outputData[0]*outputData[0];
powVal /= (windowSize*windowSize)/2;
index1 = 0;
fprintf(outputFile,"%lf ",powVal);
printf(" PSD \t %lf\n",powVal);
for (index1 =1; index1<=windowSize/2;index1++){
powVal = outputData[index1]*outputData[index1]+outputData[windowSize-index1]*outputData[windowSize- index1];
powVal/=(windowSize*windowSize)/2;
// powVal = 20*log10(fabs(powVal));
fprintf(outputFile,"%lf ",powVal);
printf(" PsD %d \t %10.5lf\n",index1,powVal);
}
fftw_free(inputData);
fftw_free(outputData);
fftw_free(windowData);
fclose(inputFile);
fclose(outputFile);
}

You need to premultiply the signal with a window function. This can be precomputed if you are calculating multiple FFTs.
For example, a Hanning window is calculated as follows:
#define WINDOW_SIZE 2048
int i;
double w[WINDOW_SIZE];
for (i=0; i<WINDOW_SIZE; i++) {
w[i] = (1.0 - cos(2.0 * M_PI * i/(WINDOW_SIZE-1))) * 0.5;
}
Before computing the Fourier transform, multiply your input data by this window as follows:
for (i=0; i<WINDOW_SIZE; i++) inputData[i] *= w[i];
Explanation
When you calculate the Fourier transform of a finite set of samples, what you actually get is the frequency spectrum of the infinite signal that you would get by repeating these samples forever. Unless you're sampling a signal whose frequency is an exact multiple of the sampling frame rate, you will get large discontinuities where the end of one sample frame runs into the start of the next. A window function flattens out the samples at the edges of the sample frame to eliminate these discontinuities.

DTRMM & DTRSM hangs on certain matrix sizes

I'm testing performance of ?GEMM, ?TRMM, ?TRSM using MKL's automatic offload on the new Intel Xeon Phi coprocessors and am having some issues with DTRMM and DTRSM. I have code to test the performance for matrix size in steps of 1024 up to 10240 and performance seems to drop off significantly somewhere after N=M=K=8192. When I try testing exactly where by using step sizes of 2, my script was hanging. I then checked 512 step sizes, which work fine, 256 work as well, but anything under 256 just stalls. I cannot find any known issues in regards to this problem. All single precision versions work, as well as single and double precision on ?GEMM. Here is my code:
#include <stdio.h>
#include <stdlib.h>
#include <malloc.h>
#include <stdint.h>
#include <time.h>
#include "mkl.h"
#define DBG 0
int main(int argc, char **argv)
{
char transa = 'N', side = 'L', uplo = 'L', diag = 'U';
MKL_INT N, NP; // N = M, N, K, lda, ldb, ldc
double alpha = 1.0; // Scaling factors
double *A, *B; // Matrices
int matrix_bytes; // Matrix size in bytes
int matrix_elements; // Matrix size in elements
int i, j; // Counters
int msec;
clock_t start, diff;
N = atoi(argv[1]);
start = clock();
matrix_elements = N * N;
matrix_bytes = sizeof(double) * matrix_elements;
// Allocate the matrices
A = malloc(matrix_bytes);
if (A == NULL)
{
printf("Could not allocate matrix A\n");
return -1;
}
B = malloc(matrix_bytes);
if (B == NULL)
{
printf("Could not allocate matrix B\n");
return -1;
}
for (i = 0; i < matrix_elements; i++)
{
A[i] = 0.0;
B[i] = 0.0;
}
// Initialize the matrices
for (i = 0; i < N; i++)
for (j = 0; j <= i; j++)
{
A[i+N*j] = 1.0;
B[i+N*j] = 2.0;
}
// DTRMM call
dtrmm(&side, &uplo, &transa, &diag, &N, &N, &alpha, A, &N, B, &N);
diff = clock() - start;
msec = diff * 1000 / CLOCKS_PER_SEC;
printf("%f\n", (float)msec * 10e-4);
if (DBG == 1)
{
printf("\nMatrix dimension is set to %d \n\n", (int)N);
// Display the result
printf("\nResulting matrix B:\n");
if (N > 10)
{
printf("NOTE: B is too large, print only upper-left 10x10 block...\n");
NP = 10;
}
else
NP = N;
printf("\n");
for (i = 0; i < NP; i++)
{
for (j = 0; j < NP; j++)
printf("%7.3f ", B[i + j * N]);
printf("\n");
}
}
// Free the matrix memory
free(A);
free(B);
return 0;
}
Any help or insight would be greatly appreciated.

This phenomenon has been extensively discussed in other questions, and also in Intel's Software Optimization Manual and Agner Fog's notes.
Typically, you are experiencing a perfect storm of evictions in the memory hierarchy, such that suddenly (nearly) every single access misses cache and/or TLB (one can determine exactly which resource is missing by looking at the specific data access pattern or by using the PMCs; I can do the calculation later when I'm near a whiteboard, unless mystical gets to you first).
You can also search through some of my or Mystical's answers to find previous answers.

The issue was an older version of Intel's icc compiler (beta 10 update, I believe.. maybe). Gold update works like a charm.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight