MPI Program Does Not "Speed Up" After Implementing Parallel Computing Techniques

MPI Program Does Not "Speed Up" After Implementing Parallel Computing Techniques - c

I am developing an MPI parallel program designed specifically to solve problem 2 on Project Euler. The original problem statement can be found here. My code works without any compilation errors, and the correct answer is retuned consistently (which can be verified on the website).
However, I thought it would be worthwhile to use MPI_Wtime() to gather data on how long it takes to execute the MPI program using 1, 2, 3, and 4 processes. To my surprise, I found that my program takes longer to execute as more processes are included. This is contrary to my expectations, as I thought increasing the number of processes would reduce the computation time (speed up) according to Amdahl’s law. I included my code for anyone who may be interested in testing this for themselves.
#include <mpi.h>
#include <stdio.h>
#include <tgmath.h>
int main(int argc, char* argv[]) {
MPI_Init(&argc, &argv);
int rank, size, start_val, end_val, upperLimit;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
upperLimit = 33;
start_val = rank * (upperLimit / size) + 1;
int num1 = 1; int num2 = 1; int SumNum = num1 + num2; int x = 0;
double start, end;
// begin timing
start = MPI_Wtime();
// arbitrarily inflate the number of computations
// to make the program take longer to compute
// change to t < 1 for only 1 computation
for (int i = 0; i < 10000000; i++) {
// generate an algorithim that defines the range of
// each process to handle for the fibb_sequence problem.
if (rank == (size - 1)) {
end_val = upperLimit;
}
else {
end_val = start_val + (upperLimit / size) - 1;
}
/*
calculations before this code indicate that it will take exactly 32 seperate algorithim computations
to get to the largest number before exceeding 4,000,000 in the fibb sequence. This can be done with a simple
computation, but this calculation will not be listed in this code.
*/
long double fibb_const = (1 + sqrt(5)) / 2; int j = start_val - 1; long double fibb_const1 = (1 - sqrt(5)) / 2;
// calculate fibb sequence positions for the sequence using a formula
double position11 = (pow(fibb_const, j) - pow(fibb_const1, j)) / (sqrt(5));
double position12 = (pow(fibb_const, j + 1) - pow(fibb_const1, (j + 1))) / (sqrt(5));
position11 = floor(position11);
position12 = floor(position12);
// dynamically assign values to each process to generate a solution quickly
if (rank == 0) {
for (int i = start_val; i < end_val; i++) {
SumNum = num1 + num2;
num1 = num2;
num2 = SumNum;
if (SumNum % 2 == 0) {
x = x + SumNum;
//printf("Process 0 reports %d \n \n", SumNum);
//fflush(stdout);
}
}
}
else {
for (int i = start_val; i < end_val; i++) {
SumNum = position12 + position11;
if (SumNum % 2 == 0) {
x = x + SumNum;
//printf("Process %d reports %d \n \n", rank, SumNum);
//fflush(stdout);
}
position11 = position12;
position12 = SumNum;
}
}
int recieve_buf = 0;
MPI_Reduce(&x, &recieve_buf, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
if (rank == 0) {
//printf("This is the final solution: %d \n \n", recieve_buf);
//fflush(stdout);
}
}
// end timing
end = MPI_Wtime();
// timer goes here
double elapsed_time = end - start;
printf("I am rank %d, and I report a walltime of %f seconds.", rank, elapsed_time);
// end the MPI code
MPI_Finalize();
return 0;
}
Note that I utilize 10000000 computations in a for loop to intentionally increase the computation time.
I have attempted to solve this problem by utilizing time.h and chrono in alternate versions of this code to cross-reference my results. Consistently, it seems as if the computation time increases as more processes are included. I saw a similar SO post here, but I could use an additional explination.
How I Run my Code
I use mpiexec -n <process_count> <my_file_name>.exe to run my code on from the VS Studio 2022 command prompt. Additionally, I have tested this code on macOS by running mpicc foo.c followed by mpiexec -n <process_count> ./a.out. All my best efforts seem to produce data contrary to my expectations.
Hopefully this question isn't too vague. I will provide more information if needed.
System Info
I am currently using a x64 based pc, Lenovo, Windows 11. Thanks again

This is a case of the granularity being too fine. Granularity is defined as the amount of work between synchronization points vs the cost of synchronization.
Let's say your MPI_Reduce takes one, or a couple of, microseconds. (A figure that has stayed fairly constant over the past few decades!) That's enough time to do a few thousand operations. So for speedup to occur, you need many thousands of operations between the reductions. You don't have that, so the time of your code is completely dominated by the cost of the MPI calls, and that does not go down with the number of processes.

Related

Why OpenMP reduction is slower than MPI on share memory structure?

I have tried to test OpenMP and MPI parallel implementation for inner products of two vectors (element values are computed on the fly) and find out that OpenMP is slower than MPI.
The MPI code I am using is as following,
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <omp.h>
#include <mpi.h>
int main(int argc, char* argv[])
{
double ttime = -omp_get_wtime();
int np, my_rank;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &np);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
int n = 10000;
int repeat = 10000;
int sublength = (int)(ceil((double)(n) / (double)(np)));
int nstart = my_rank * sublength;
int nend = nstart + sublength;
if (nend >n )
{
nend = n;
sublength = nend - nstart;
}
double dot = 0;
double sum = 1;
int j, k;
double time = -omp_get_wtime();
for (j = 0; j < repeat; j++)
{
double loc_dot = 0;
for (k = 0; k < sublength; k++)
{
double temp = sin((sum+ nstart +k +j)/(double)(n));
loc_dot += (temp * temp);
}
MPI_Allreduce(&loc_dot, &dot, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
sum += (dot/(double)(n));
}
time += omp_get_wtime();
if (my_rank == 0)
{
ttime += omp_get_wtime();
printf("np = %d sum = %f, loop time = %f sec, total time = %f \n", np, sum, time, ttime);
}
return 0;
}
I have tried several different implementation with OpenMP.
Here is the version which not to complicate and close to best performance I can achieve.
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <omp.h>
int main(int argc, char* argv[])
{
int n = 10000;
int repeat = 10000;
int np = 1;
if (argc > 1)
{
np = atoi(argv[1]);
}
omp_set_num_threads(np);
int nstart =0;
int sublength =n;
double loc_dot = 0;
double sum = 1;
#pragma omp parallel
{
int i, j, k;
double time = -omp_get_wtime();
for (j = 0; j < repeat; j++)
{
#pragma omp for reduction(+: loc_dot)
for (k = 0; k < sublength; k++)
{
double temp = sin((sum+ nstart +k +j)/(double)(n));
loc_dot += (temp * temp);
}
#pragma omp single
{
sum += (loc_dot/(double)(n));
loc_dot =0;
}
}
time += omp_get_wtime();
#pragma omp single nowait
printf("sum = %f, time = %f sec, np = %d\n", sum, time, np);
}
return 0;
}
here is my test results:
OMP
sum = 6992.953984, time = 0.409850 sec, np = 1
sum = 6992.953984, time = 0.270875 sec, np = 2
sum = 6992.953984, time = 0.186024 sec, np = 4
sum = 6992.953984, time = 0.144010 sec, np = 8
sum = 6992.953984, time = 0.115188 sec, np = 16
sum = 6992.953984, time = 0.195485 sec, np = 32
MPI
sum = 6992.953984, time = 0.381701 sec, np = 1
sum = 6992.953984, time = 0.243513 sec, np = 2
sum = 6992.953984, time = 0.158326 sec, np = 4
sum = 6992.953984, time = 0.102489 sec, np = 8
sum = 6992.953984, time = 0.063975 sec, np = 16
sum = 6992.953984, time = 0.044748 sec, np = 32
Can anyone tell me what I am missing?
thanks!
update:
I have written an acceptable reduce function for OMP. the perfomance is close to MPI reduce function now. the code is as following.
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <omp.h>
double darr[2][64];
int nreduce=0;
#pragma omp threadprivate(nreduce)
double OMP_Allreduce_dsum(double loc_dot,int tid,int np)
{
darr[nreduce][tid]=loc_dot;
#pragma omp barrier
double dsum =0;
int i;
for (i=0; i<np; i++)
{
dsum += darr[nreduce][i];
}
nreduce=1-nreduce;
return dsum;
}
int main(int argc, char* argv[])
{
int np = 1;
if (argc > 1)
{
np = atoi(argv[1]);
}
omp_set_num_threads(np);
double ttime = -omp_get_wtime();
int n = 10000;
int repeat = 10000;
#pragma omp parallel
{
int tid = omp_get_thread_num();
int sublength = (int)(ceil((double)(n) / (double)(np)));
int nstart = tid * sublength;
int nend = nstart + sublength;
if (nend >n )
{
nend = n;
sublength = nend - nstart;
}
double sum = 1;
double time = -omp_get_wtime();
int j, k;
for (j = 0; j < repeat; j++)
{
double loc_dot = 0;
for (k = 0; k < sublength; k++)
{
double temp = sin((sum+ nstart +k +j)/(double)(n));
loc_dot += (temp * temp);
}
double dot =OMP_Allreduce_dsum(loc_dot,tid,np);
sum +=(dot/(double)(n));
}
time += omp_get_wtime();
#pragma omp master
{
ttime += omp_get_wtime();
printf("np = %d sum = %f, loop time = %f sec, total time = %f \n", np, sum, time, ttime);
}
}
return 0;
}

First of all, this code is very sensitive to synchronization overheads (both software and hardware) resulting in apparent strange behaviors themselves to both the OpenMP runtime implementation and low-level processor operations (eg. cache/bus effects). Indeed, a full synchronization is required for each iteration of the j-based loop executed every 45 ms. This means 4.5 us/iteration. In such a short time, the partial-sum spread in 32 cores needs to be reduced and broadcasted. If each core accumulates its own value in a shared atomic location, taking for example 60 ns per atomic add (realistic overhead for atomics on scalable Xeon processors), it would take 32 * 60 ns = 1.92 us since this process is done sequentially on x86 processors so far. This small additional time represent an overhead of 43% on the overall execution time because of the barriers! Due to contention on atomic variables, timings are often much worse. Moreover, the barrier themselves are expensive (they are often implemented using atomics in OpenMP runtimes but in a way that could scale a bit better).
The first OpenMP implementation was slow because implicit synchronizations and complex hardware cache effects. Indeed, the omp for reduction directive performs an implicit barrier at the end of its region as well as omp single. The reduction itself can implemented in several ways. The OpenMP runtime of ICC use a clever tree-based atomic implementation which should scale quite well (but not perfectly). Moreover, the omp single section will cause some cache-line bouncing. Indeed, the result loc_dot will likely be stored in the cache of the last core updating it while the thread executing this section will likely scheduled on another core. In this case, the processor has to move the cache-line from one L2 cache to another (or load the value from the L3 cache directly regarding the hardware state). The same thing also apply for sum (which tends to move between cores as the thread executing the section will likely not be always scheduled on the same core). Finally, the sum variable must be broadcasted on each core so they can start a new iteration.
The last OpenMP implementation is significantly better since every thread works on its own local data, it uses only one barrier (this synchronization is mandatory regarding the algorithm) and caches are better used. The accumulation part may not be ideal as all cores will likely fetch data previously located on all other L1/L2 caches causing a all-to-all broadcast pattern. This hardware-operation can scale barely but should be sequential either.
Note that the last OpenMP implementation suffer from false-sharing. Indeed, items of darr will be stored contiguously in memory and share the same cache-line. As a result, when a thread writes in darr, the associated core will request the cache-line and invalidates the ones located on others cores. This causes cache-line bouncing between cores. However, on current x86 processors, cache lines are 64 bytes wise and a double variable takes 8 bytes resulting in 8 items per cache-line. Thus, it mitigates the effect cache-line bouncing typically to 8 cores over the 32 ones. That being said, the item packing has some benefits as only 4 cache-lines fetch are required per core to perform the global accumulation. To prevent false-sharing, one can allocate a (8 times) bigger array and reserve some space between items so that 1 item is stored per cache-line. The best strategy on your target processor may to use a tree-based atomic reduction like the one the ICC OpenMP runtime use. Ideally, the sum reduction and the barrier can be merged together for better performance. This is what the MPI implementation can do internally (MPI_Allreduce).
Note that all implementations suffer from the very high thread synchronization. This is a problem as some context switch regularly occurs on some core because of some operating-system/hardware events (network, storage device, user, system processes, etc.). One critical issue is frequency-scaling on any modern x86 processors: not all core will work at the same frequency and their frequency change over time. The slowest thread will slow down all the others because of the barrier. In the worst case, some threads may passively wait enabling some cores to sleep (C-states) and then take more time to wake up slowing further down the others depending on the platform configuration.
The takeaway is:
the more synchronized a code is, the lower its scaling and the challenging its optimization.

openCL Kernel segmentation fault in pi calculation

Good evening all,
I am trying to design an openCL kernel to calculate pi. This is a school assignment and we were told to use this equation:
Pi/4 = 1 - 1/3 + 1/5 - 1/7 + 1/9 - ...
Here is my kernel design that is currently generating a segfault and I am not sure why:
__kernel void calculatePi(int numIterations, __global float *outputPi, __local float* local_result, int numWorkers)
{
// Get global ID for worker
const uint gid = get_global_id(0);
const uint lid = get_local_id(0);
const uint offset = numIterations*gid*2;
float sum = 0.0f;
for (int i = 0; i < numWorkers; i++)
{
local_result[i] = 0.0f;
}
barrier(CLK_LOCAL_MEM_FENCE);
for (int i=0; i<numIterations; i++)
{
if (i % 2 == 0)
sum += 1 / (1 + 2*i + offset);
else
sum -= 1 / (1 + 2*i + offset);
}
local_result[gid] = sum;
barrier(CLK_GLOBAL_MEM_FENCE);
if (lid == 0)
{
outputPi[0] = 0;
for (int i = 0; i < numWorkers; i++)
{
outputPi[0] += local_result[i];
}
outputPi[0] *= 4;
}
}
Basically, my thought process was to have 16 workers in parallel. each worker will take numIterations terms and determine a partial calculation of pi. In this case, I'm also using 16 for numIterations. The terms alternate, so for every odd term I subtract and every even I add. The first worker is responsible for calculating the first 16 terms, the next worker the next 16 terms, and so on to create 16 partial sums of 16 digits each. Once each worker has calculated their partial sum, I have the first worker take all of the partial sums and add it up to send out. I also multiply by 4 to complete the equation.
My issue is that I keep getting a segmentation fault within my main program at the following line:
ret = clEnqueueReadBuffer(command_queue, result_buffer, CL_TRUE, 0, sizeof(result), &result, 0, NULL, NULL);
Here are the other uses of "result" that could be causing this issue:
float result[1] = {0}; // Initialized at top of main
/* Create buffers to hold the text characters and count */
cl_mem result_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof(result), result, NULL);
printf("Final calculated value: %f \n", result[0]);
Can anyone give me insight as to why I'm getting the segfault when trying to read the result buffer back into result?
The full code can be seen within my github: https://github.com/TreverWagenhals/TreverWagenhals/tree/master/School/Heterogeneous%20Computing/Lab2
Thanks.
EDIT: I've found the issue that was in my code. I was creating a variable called numWorkers and passing that into one of the kernel arguments, which apparently wasn't correct. In the process of simplifying my code I was able to remove it and use the global_size variable directly, which now resolves the seg fault issue and shows the data each call.
Now I'm having an issue within my Kernel where 4 is being returned instead of the value for pi. I will debug further and create a new question if I can't see the issue. I'll have

Error in parallelization MPI_Allgather

EDITED IN LIGHT OF THE COMMENTS
I am learning MPI and I am doing some exercises to understand some aspects of it. I have written a code that should perform a simple Monte-Carlo.
There are two main loops in it that have to be accomplished: one on the time steps T and a smaller one inside this one on the number of molecules N. So after I attempt to move every molecule the program goes to the next time step.
I tried to parallelize it by dividing the operations on the molecules on the different processors. Unfortunately the code, which works for 1 processor, prints the wrong results for total_E when p>1.
The problem probably lies in the following function and more precisely is given by a call to MPI_Allgather(local_r,n,MPI_DOUBLE,r,n,MPI_DOUBLE,MPI_COMM_WORLD);
I completely don't understand why. What am I doing wrong? (besides a primitive parallelization strategy)
My logic was that for every time step I could calculate the moves on the molecules on the different processors. Unfortunately, while I work with the local vectors local_r on the various processors, to calculate the energy difference local_DE, I need the global vector r since the energy of the i-th molecule depends on all the others. Therefore I thought to call MPI_Allgather since I have to update the global vector as well as the local ones.
void Step(double (*H)(double,double),double* local_r,double* r,double *E_,int n,int my_rank){
int i;
double* local_rt = calloc(n,sizeof(double));
double local_DE;
for(i=0;i<n;i++){
local_rt[i] = local_r[i] + delta*((double)lrand48()/RAND_MAX-0.5);
local_rt[i] = periodic(local_rt[i]);
local_DE = E_single(H,local_rt,r,i,n,my_rank) - E_single(H,local_r,r,i,n,my_rank);
if ( local_DE <= 0.0 || exp(-local_DE) > (double) lrand48()/RAND_MAX ) {
(*E_) += local_DE;
local_r[i] = local_rt[i];
}
MPI_Allgather(local_r,n,MPI_DOUBLE,r,n,MPI_DOUBLE,MPI_COMM_WORLD);
}
return ;
}
Here it is the complete "working" code:
#define _XOPEN_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <mpi.h>
#define N 100
#define L 5.0
#define T_ 5000
#define delta 2.0
void Step(double (*)(double,double),double*,double*,double*,int,int);
double H(double ,double );
double E(double (*)(double,double),double* ,double*,int ,int );
double E_single(double (*)(double,double),double* ,double*,int ,int ,int);
double * pos_ini(void);
double periodic(double );
double dist(double , double );
double sign(double );
int main(int argc,char** argv){
if (argc < 2) {
printf("./program <outfile>\n");
exit(-1);
}
srand48(0);
int my_rank;
int p;
FILE* outfile = fopen(argv[1],"w");
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&my_rank);
MPI_Comm_size(MPI_COMM_WORLD,&p);
double total_E,E_;
int n;
n = N/p;
int t;
double * r = calloc(N,sizeof(double)),*local_r = calloc(n,sizeof(double));
for(t = 0;t<=T_;t++){
if(t ==0){
r = pos_ini();
MPI_Scatter(r,n,MPI_DOUBLE, local_r,n,MPI_DOUBLE, 0, MPI_COMM_WORLD);
E_ = E(H,local_r,r,n,my_rank);
}else{
Step(H,local_r,r,&E_,n,my_rank);
}
total_E = 0;
MPI_Allreduce(&E_,&total_E,1,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD);
if(my_rank == 0){
fprintf(outfile,"%d\t%lf\n",t,total_E/N);
}
}
MPI_Finalize();
return 0;
}
double sign(double a){
if(a < 0){
return -1.0 ;
}else{
return 1.0 ;
}
}
double periodic(double a){
if(sqrt(a*a) > L/2.0){
a = a - sign(a)*L;
}
return a;
}
double dist(double a, double b){
double d = a-b;
d = periodic(d);
return sqrt(d*d);
}
double * pos_ini(void){
double * r = calloc(N,sizeof(double));
int i;
for(i = 0;i<N;i++){
r[i] = ((double) lrand48()/RAND_MAX)*L - L/2.0;
}
return r;
}
double H(double a,double b){
if(dist(a,b)<2.0){
return exp(-dist(a,b)*dist(a,b))/dist(a,b);
}else{
return 0.0;
}
}
double E(double (*H)(double,double),double* local_r,double*r,int n,int my_rank){
double local_V = 0;
int i;
for(i = 0;i<n;i++){
local_V += E_single(H,local_r,r,i,n,my_rank);
}
local_V *= 0.5;
return local_V;
}
double E_single(double (*H)(double,double),double* local_r,double*r,int i,int n,int my_rank){
double local_V = 0;
int j;
for(j = 0;j<N;j++){
if( (i + n*my_rank) != j ){
local_V+=H(local_r[i],r[j]);
}
}
return local_V;
}
void Step(double (*H)(double,double),double* local_r,double* r,double *E_,int n,int my_rank){
int i;
double* local_rt = calloc(n,sizeof(double));
double local_DE;
for(i=0;i<n;i++){
local_rt[i] = local_r[i] + delta*((double)lrand48()/RAND_MAX-0.5);
local_rt[i] = periodic(local_rt[i]);
local_DE = E_single(H,local_rt,r,i,n,my_rank) - E_single(H,local_r,r,i,n,my_rank);
if ( local_DE <= 0.0 || exp(-local_DE) > (double) lrand48()/RAND_MAX ) {
(*E_) += local_DE;
local_r[i] = local_rt[i];
}
MPI_Allgather(local_r,n,MPI_DOUBLE,r,n,MPI_DOUBLE,MPI_COMM_WORLD);
}
return ;
}

You cannot expect to get the same energy given different number of MPI processes for one simple reason - the configurations generated are very different depending on how many processes are there. The reason is not MPI_Allgather, but the way the Monte-Carlo sweeps are performed.
Given one process, you attempt to move atom 1, then atom 2, then atom 3, and so on, until you reach atom N. Each attempt sees the configuration resulting from the previous one, which is fine.
Given two processes, you attempt to move atom 1 while at the same time attempting to move atom N/2. Neither atom 1 sees the eventual displacement of atom N/2 nor the other way round, but then atoms 2 and N/2+1 see the displacement of both atom 1 and atom N/2. You end up with two partial configurations that you simply merge with the all-gather. This is not equivalent to the previous case when a single process does all the MC attempts. The same applies for the case of more than two processes.
There is another source of difference - the pseudo-random number (PRN) sequence. The sequence produced by the repeated calls to lrand48() in one process is not the same as the combined sequence produced by multiple independent calls to lrand48() in different processes, therefore even if you sequentialise the trials, still the acceptance will differ due to the locally different PRN sequences.
Forget about the specific values of the energy produced after each step. In a proper MC simulation those are insignificant. What counts is the average value over a large number of steps. Those should be the same (within a certain margin proportional to 1/sqrt(N)) no matter the update algorithm used.

It's been quite long since the last time I used MPI but it seems that your program halts when you try to "gather" and update the data in several of all the processes and it is unpredictable that which processes would need to do the gathering.
So in this case a simple solution is to let the rest of the processes send some dummy data so they could simply be ignore by others. For instance,
if ( local_DE <= 0.0 || exp(-local_DE) > (double) lrand48()/RAND_MAX ) {
(*E_) += local_DE;
local_r[i] = local_rt[i];
MPI_Allgather(local_r,n,MPI_DOUBLE,r,n,MPI_DOUBLE,MPI_COMM_WORLD);
// filter out the dummy data out of "r" here
} else {
MPI_Allgather(dummy_sendbuf, n, MPI_DOUBLE, dummy_recvbuf, n, MPI_DOUBLE, MPI_COMM_WORLD);
}
Dummy data could be some exceptional wrong numbers which should not be in the results, so other processes could filter them out.
But as I mentioned, this is quite wasteful as you don't really need to receive that much data from all processes and we would like to avoid it especially when there're quite a lot of data to send.
In this case, you can gather some "flags" from other processes so that we could know which processes own data to send.
// pseudo codes
// for example, place 1 at local_flags[my_rank] if it's got data to send, otherwise 0
MPI_Allgather(local_flags, n, MPI_BYTE, recv_flags, n, MPI_BYTE, MPI_COMM_WORLD)
// so now all the processes know which processes will send
// receive data from those processes
MPI_Allgatherv(...)
I remember with MPI_Allgatherv, you could specify the number of elements to receive from a specific process. Here's an example: http://mpi.deino.net/mpi_functions/MPI_Allgatherv.html
But bear in mind this might be an overkill if the program is not well parallelized. For example, in your case, this is placed inside a loop, so those processes without data still need to wait for the next gathering of the flags.

You should take MPI_Allgather() outside for loop. I tested with the following code but note that I modified the lines involving RAND_MAX in order to get consistent results. As a result, the code gives the same answer for number of processors 1, 2, and 4.
void Step(double (*H)(double,double),double* local_r,double* r,double *E_,int n,int my_rank){
int i;
double* local_rt = calloc(n,sizeof(double));
double local_DE;
for(i=0;i<n;i++){
//local_rt[i] = local_r[i] + delta*((double)lrand48()/RAND_MAX-0.5);
local_rt[i] = local_r[i] + delta*((double)lrand48()-0.5);
local_rt[i] = periodic(local_rt[i]);
local_DE = E_single(H,local_rt,r,i,n,my_rank) - E_single(H,local_r,r,i,n,my_rank);
//if ( local_DE <= 0.0 || exp(-local_DE) > (double) lrand48()/RAND_MAX )
if ( local_DE <= 0.0 || exp(-local_DE) > (double) lrand48() )
{
(*E_) += local_DE;
local_r[i] = local_rt[i];
}
}
MPI_Allgather(local_r,n,MPI_DOUBLE,r,n,MPI_DOUBLE,MPI_COMM_WORLD);
return ;
}

Dividing processes evenly among threads

I am trying to come up with an algorithm to divide a number of processes as evenly as possible over a number of threads. Each process takes the same amount of time.
The number of processes can vary, from 1 to 1 million. The threadCount is fixed, and can be anywhere from 4 to 48.
The code below does divide all the work evenly, except for the last case, where I throw in what is left over.
Is there a way to fix this so that the work is spread more evenly?
void main(void)
{
int processBegin[100];
int processEnd[100];
int activeProcessCount = 6243;
int threadCount = 24;
int processsInBundle = (int) (activeProcessCount / threadCount);
int processBalance = activeProcessCount - (processsInBundle * threadCount);
for (int i = 0; i < threadCount; ++i)
{
processBegin[ i ] = i * processsInBundle;
processEnd[ i ] = (processBegin[ i ] + processsInBundle) - 1;
}
processEnd[ threadCount - 1 ] += processBalance;
FILE *debug = fopen("s:\\data\\testdump\\debug.csv", WRITE);
for (int i = 0; i < threadCount; ++i)
{
int processsInBucket = (i == threadCount - 1) ? processsInBundle + processBalance : processBegin[i+1] - processBegin[i];
fprintf(debug, "%d,start,%d,stop,%d,processsInBucket,%d\n", activeProcessCount, processBegin[i], processEnd[i], processsInBucket);
}
fclose(debug);
}

Give the first activeProcessCount % threadCount threads processInBundle + 1 processes and give the others processsInBundle ones.
int processInBundle = (int) (activeProcessCount / threadCount);
int processSoFar = 0;
for (int i = 0; i < activeProcessCount % threadCount; i++){
processBegin[i] = processSoFar;
processSoFar += processInBundle + 1;
processEnd[i] = processSoFar - 1;
}
for (int i = activeProcessCount % threadCount; i < threadCount; i++){
processBegin[i] = processSoFar;
processSoFar += processInBundle;
processEnd[i] = processSoFar - 1;
}

That's the same problem as trying to divide 5 pennies onto 3 people. It's just impossible unless you can saw the pennies in half.
Also even if all processes need an equal amount of theoretical runtime it doesn't mean that they will be executed in the same amount of time due to kernel scheduling, cache performance and various other hardware related factors.
To suggest some performance optimisations:
Use dynamic scheduling. i.e. split your work into batches (can be size 1) and have your threads take one batch at a time, run it, then take the next one. This way the threads will always be working until all batches are gone.
More advanced is to start with a big batch size (commonly numwork/numthreads and decrease it each time a thread takes work out of the pool). OpenMP refers to it as guided scheduling.

Segfault when running openmpi

I am currently working on a project where I need to implement a parallel fft algorithm using openmpi. I have a compiling piece of code, but when I run it over the cluster I get segmentation faults.
I have my hunches about where things are going wrong, but I don't think I have enough of an understanding about pointers and references to be able to make a efficient fix.
The first chunk that could be going wrong is in the passing of the arrays to the helper functions. I believe that either my looping is inconsistent, or I am not understanding how the to pass these pointers and get back the things I need.
The second possible spot would be within the actual mpi_Send/Recv commands. I am sending a type that is not supported by the openmpi c datatypes, so I am using the mpi_byte type to send the raw data instead. Is this a viable option? Or should I be looking into an alternative to this method.
/* function declarations */
double complex get_block(double complex c[], int start, int stop);
double complex put_block(double complex from[], double complex to[],
int start, int stop);
void main(int argc, char **argv)
{
/* Initialize MPI */
MPI_Init(&argc, &argv);
double complex c[N/p];
int myid;
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
//printf("My id is %d\n",myid);
MPI_Status status;
int i;
for(i=0;i<N/p;i++){
c[i] = 1.0 + 1.0*I;
}
int j = log(p)/log(2) + 1;
double q;
double complex z;
double complex w = exp(-2*PI*I/N);
double complex block[N/(2*p)]; // half the size of chunk c
int e,l,t,k,m,rank,plus,minus;
int temp = (log(N)-log(p))/log(2);
//printf("temp = %d", temp);
for(e = 0; e < (log(p)/log(2)); e++){
/* loop constants */
t = pow(2,e); l = pow(2,e+temp);
q = n/2*l; z = cpow(w,(complex)q);
j = j-1; int v = pow(2,j);
if(e != 0){
plus = (myid + p/v)%p;
minus = (myid - p/v)%p;
} else {
plus = myid + p/v;
minus = myid - p/v;
}
if(myid%t == myid%(2*t)){
MPI_Recv((char*)&c,
sizeof(c),
MPI_BYTE,
plus,
MPI_ANY_TAG,
MPI_COMM_WORLD,
&status);
/* transform */
for(k = 0; k < N/p; k++){
m = (myid * N/p + k)%l;
c[k] = c[k] + c[k+N/v] * cpow(z,m);
c[k+N/v] = c[k] - c[k + N/v] * cpow(z,m);
printf("(k,k+N/v) = (%d,%d)\n",k,k+N/v);
}*/
printf("\n\n");
/* end transform */
*block = get_block(c, N/v, N/v + N/p + 1);
MPI_Send((char*)&block,
sizeof(block),
MPI_BYTE,
plus,
1,
MPI_COMM_WORLD);
} else {
// send data of this PE to the (i- p/v)th PE
MPI_Send((char*)&c,
sizeof(c),
MPI_BYTE,
minus,
1,
MPI_COMM_WORLD);
// after the transformation, receive data from (i-p/v)th PE
// and store them in c:
MPI_Recv((char*)&block,
sizeof(block),
MPI_BYTE,
minus,
MPI_ANY_TAG,
MPI_COMM_WORLD,
&status);
*c = put_block(block, c, N/v, N/v + N/p - 1);
//printf("Process %d send/receive %d\n",myid, plus);
}
}
/* shut down MPI */
MPI_Finalize();
}
/* helper functions */
double complex get_block(double complex *c, int start, int stop)
{
double complex block[stop - start + 1];
//printf("%d = %d\n",sizeof(block)/sizeof(double complex), sizeof(&c)/sizeof(double complex));
int j = 0;
int i;
for(i = start; i < stop+1; i++){
block[j] = c[i];
j = j+1;
}
return *block;
}
double complex put_block(double complex from[], double complex to[], int start, int stop)
{
int j = 0;
int i;
for(i = start; i<stop+1; i++){
to[i] = from[j];
j = j+1;
}
return *to;
}
I really appreciate the feedback!

You are using arrays / pointers to arrays in the wrong way. For example you declare an array as double complex block[N], which is fine (although uncommon, in most cases it is better to use malloc) and then you receive into it via MPI_Recv(&block). However "block" is already a pointer to that array, so by writing "&block" you are passing the pointer of the pointer to MPI_Recv. That's not what it expects. If you want to use the "&" notation you have to write &block[0], which would give you the pointer to the first element of the block-array.

Have you tried debugging your code? This can be a pain in a parallel setting, but it can tell you exactly where it is failing and usually also why.
If you're using Linux or OS X, you could run your code as follows on the command line:
mpirun -np 4 xterm -e gdb -ex run --args ./yourprog yourargs
where I'm assuming yourprog is the name of your program and yourargs are any command-line arguments you want to pass.
What this command will do is launch four xterm windows. Each xterm will in turn launch gdb as specified by the option -e. gdb will then execute the command run as specified by the option -ex and launch your executable with the given options, as specified by --args.
What you get are four xterm windows running four instances of your program in parallel with MPI. If any of the instances crashes, gdb will tell you where and why.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight