MPI in C parallelisation of custom function running in serial - c

I am a baginner MPI user and I may made some mistake with my parallel code for my calculation.
I need to compute an iterative estimation on a large data set and I want to calculate it in parallel using MPI in C.
I made a standard (ANSI) C function ('myFunc') to estimate an element in the output dataset ('param_2') based on the input parameters ('param_1',param_3,'table_1','table_2','table_3') and the estimation of the previous iteration ('param_2'). The calculation could be done parallel if we partition the new estimation ('param_2') into chunks.
When I made some profiling on the code, I realised that the calculation started almost at the same time on each node (thread), but it is finished in a serial fashion, one after another (with a fixed time interval between them).
It looks like they are using some shared resources or something like that... I tried to eliminate all concurrency between the threads, but i am affraid I do not have enough experience in MPI to solve the problem.
I thought all MPI thread have its own 'copy' of the declared variables and using them independently from each other, so I do not understand why the threads wait for each other to finish the calculation when they have their own copy of the parameters...
Here is the simplefield version of the code:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <mpi.h>
#define X 131
#define Y 131
#define Z 150
#define MASTER 0
float table_1[31][8];
float table_2[31][4];
float table_3[31][2];
int main(int argc, char* argv[]) {
float *param_1;
float *param_2;
float param_3;
float *chunk;
int file_length = X*Y*Z;
float myFunc(int i, float *param_1, float *param_2, float param_3);
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &taskid);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
chunk_size = ceil(file_length / numtasks);
/* Allocate memory for the input parameters */
param_1 = malloc(file_length*sizeof(float));
param_2 = malloc(file_length*sizeof(float));
if( taskid == MASTER) {
/* Read parameters from file (table_1, table_2, table_3, param_1) */
}
MPI_Bcast(table_1,31*8,MPI_FLOAT,MASTER,MPI_COMM_WORLD);
MPI_Bcast(table_2,31*4,MPI_FLOAT,MASTER,MPI_COMM_WORLD);
MPI_Bcast(table_3,31*2,MPI_FLOAT,MASTER,MPI_COMM_WORLD);
MPI_Bcast(param_1,file_length,MPI_FLOAT,MASTER,MPI_COMM_WORLD);
for(it = 0; it < 10; it++) {
for(i = 0; i < chunk_size; i++) {
chunk[i] = myFunc((taskid*chunk_size)+i, param_1, param_2, param_3);
}
MPI_Gather(chunk, chunk_size, MPI_FLOAT, param_2, chunk_size, MPI_FLOAT, MASTER, MPI_COMM_WORLD);
MPI_Bcast(param_2, file_length, MPI_FLOAT, MASTER, MPI_COMM_WORLD);
}
MPI_Finalize();
free(...);
return 0;
}
float myFunc(int i, float *param_1, float *param_2, float param_3) {
/* Using the global tables (table_1,table_2,table_3) and some localy declared variable */
/* No MPI function here, only Math functions */
}
If you have a solution, advise or a comment please be kind and share with me, I would be grateful, thank you!

Related

short MPI program (C Code), declaring variable in root doesn't work

So I have a question regarding my little MPI program. (I am new to programming so its probably just some beginners mistake)
I made a program which calculates pi by using the formula: (1/n) * the sum of 4/(1+(i/n)^2);
Only problem is, I have to define the number of iterations in the root function, but as soon as I set any kind of braces around it the program doesn't work anymore. Is there any way to define "n" outside of root but give it a value inside of the root function? Or is it just some braces problem and if I set them correctly it will still work fine.
Thanks in advance
Code: (Problem starts at "if(process_rank == ROOT)" - yes there are no braces right now)
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <mpi.h>
#include <math.h>
#define ROOT 0
void print_time(double time)
{
printf("TIME:%.10f\n", time * 1000.0); // Print execution time in ms
}
int main(int argc, char *argv[])
{
int communicator_size, process_rank; //n
double pi_appr; //time_elapsed
double PI_ORIGINAL = 3.141592653589793238462643; // Original pi value for comparison
double i, n; //size, error, rank
double result=0.0, sum=0.0, begin=0.0, end=0.0; //pi=0.0
MPI_Init(&argc, &argv); //error=
MPI_Comm_size(MPI_COMM_WORLD, &communicator_size); //
MPI_Comm_rank(MPI_COMM_WORLD, &process_rank); //&process_rank
//Synchronize all processes and get the begin time
MPI_Barrier(MPI_COMM_WORLD);
begin = MPI_Wtime();
if(process_rank == ROOT)
n = 1000000; // Root defines number of computation iterations
n = 1000000; //if i dont declare n here again it doesnt work
//Each process will caculate a part of the sum
for (i=process_rank; i<n; i+=communicator_size)
{
result += 4/(1+((i/n)*(i/n))); // for some reason pow() didnt work
}
//now we some up the results with MPI_Reduce
MPI_Reduce(&result, &sum, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
//Synchronize all processes and get the end time
MPI_Barrier(MPI_COMM_WORLD);
end = MPI_Wtime();
if (process_rank == ROOT)
{
pi_appr = (1/n)*sum; //calculate
printf("%f\n", end-begin); //we print the time by substracting the begin- from the end-time
printf("Computed pi: %.16f (Error = %.16f)\n", pi_appr, fabs(pi_appr - PI_ORIGINAL));
}
MPI_Finalize();
return 0;
}
You are misunderstanding how MPI works. It has independent processes and distributed memory. So if you initialize a variable only in one process, it will not be initialized in other processes. You could use a MPI_Bcast call to communicate a value from the root to other processes.

Increase of execution time while using multithreaded FFTW

I am new to FFTW library. I have successfully implemented 1D and 2D fft using FFTW library. I converted my 2D fft code into multithreaded 2D fft. But the results were completely opposite. Multithreaded 2D FFT code is taking longer time to run than serialized 2D FFT code. I am missing something somewhere. I followed all the instructions given in FFTW documentation to parallelize the code.
This is my parallelized 2D FFT C program
#include <mpi.h>
#include <fftw3.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#define N 2000
#define M 2000
#define index(i, j) (j + i*M)
int i, j;
void get_input(fftw_complex *in) {
for(i=0;i<N;i++){
for(j=0;j<M;j++){
in[index(i, j)][0] = sin(i + j);
in[index(i, j)][1] = sin(i * j);
}
}
}
void show_out(fftw_complex *out){
for(i=0;i<N;i++){
for(j=0;j<M;j++){
printf("%lf %lf \n", out[index(i, j)][0], out[index(i, j)][1]);
}
}
}
int main(){
clock_t start, end;
double time_taken;
start = clock();
int a = fftw_init_threads();
printf("%d\n", a);
fftw_complex *in, *out;
fftw_plan p;
in = (fftw_complex *)fftw_malloc(N * M * sizeof(fftw_complex));
out = (fftw_complex *)fftw_malloc(N * M * sizeof(fftw_complex));
get_input(in);
fftw_plan_with_nthreads(4);
p = fftw_plan_dft_2d(N, M, in, out, FFTW_FORWARD, FFTW_ESTIMATE);
fftw_execute(p);
/*p = fftw_plan_dft_1d(N, out, out, FFTW_BACKWARD, FFTW_ESTIMATE);
fftw_execute(p);
puts("In Real Domain");
show_out(out);*/
fftw_destroy_plan(p);
fftw_free(in);
fftw_free(out);
fftw_cleanup_threads();
end = clock();
time_taken = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("%g \n", time_taken);
return 0;
}
Can someone please help me in pointing out the mistake what I am doing?
That kind of behavior is typical of incorrect binding.
Generally speaking, OpenMP threads should all be bound to cores of the same socket in order to avoid NUMA effect (which can make performance suboptimal or even worst).
Also, make sure MPI tasks are correctly bound (one task should be bound to several cores from the same sockets, and you should use one OpenMP thread per core).
Because of MPI, there is a risk your OpenMP threads end up doing time sharing.
At first, i recommend you start printing both MPI and OpenMP binding.
How to achieve that is dependent on both MPI library and OpenMP runtime. If you use Open MPI and Intel compilers, you can KMP_AFFINITY=verbose mpirun --report-bindings --tag-output ...
Then, as suggested earlier, i recommend you start easy and increase complexity
1 MPI task and 1 OpenMP thread
1 MPI task and x OpenMP threads (x is the number of cores on one socket)
x MPI tasks and 1 OpenMP thread per task
x MPI tasks and y OpenMP threads per task
hopefully, 2. will be faster than 1. and 4 will be faster than 3.

Parallelization for Monte Carlo pi approximation

I am writing a c script to parallelize pi approximation with OpenMp. I think my code works fine with a convincing output. I am running it with 4 threads now. What I am not sure is that if this code is vulnerable to race condition? and if it is, how do I coordinate the thread action in this code ?
the code looks as follows:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <omp.h>
double sample_interval(double a, double b) {
double x = ((double) rand())/((double) RAND_MAX);
return (b-a)*x + a;
}
int main (int argc, char **argv) {
int N = atoi( argv[1] ); // convert command-line input to N = number of points
int i;
int NumThreads = 4;
const double pi = 3.141592653589793;
double x, y, z;
double counter = 0;
#pragma omp parallel firstprivate(x, y, z, i) reduction(+:counter) num_threads(NumThreads)
{
srand(time(NULL));
for (int i=0; i < N; ++i)
{
x = sample_interval(-1.,1.);
y = sample_interval(-1.,1.);
z = ((x*x)+(y*y));
if (z<= 1)
{
counter++;
}
}
}
double approx_pi = 4.0 * counter/ (double)N;
printf("%i %1.6e %1.6e\n ", N, 4.0 * counter/ (double)N, fabs(4.0 * counter/ (double)N - pi) / pi);
return 0;
}
Also I was wondering if the seed for random number should be declared inside or outside parallelization. my output looks like this:
10 3.600000e+00 1.459156e-01
100 3.160000e+00 5.859240e-03
1000 3.108000e+00 1.069287e-02
10000 3.142400e+00 2.569863e-04
100000 3.144120e+00 8.044793e-04
1000000 3.142628e+00 3.295610e-04
10000000 3.141379e+00 6.794439e-05
100000000 3.141467e+00 3.994585e-05
1000000000 3.141686e+00 2.971945e-05
Which looks OK for now. your suggestion for race condition and seed placement is most welcome.
There are a few problems in your code that I can see. The main one is from my standpoint that it isn't parallelized. Or more precisely, you didn't enable the parallelism you introduced with OpenMP while compiling it. Here is the way one can see that:
The way the code is parallelized, the main for loop should be executed in full by all the threads (there is no worksharing here, no #pragma omp parallel for, only a #pragma omp parallel). Therefore, considering you set the number of threads to be 4, the global number of iterations should be 4*N. Thus, your output should slowly converge towards 4*Pi, not towards Pi.
Indeed, I tried your code on my laptop, compiled it with OpenMP support, and that is pretty-much what I get. However, when I don't enable OpenMP, I get an output similar to yours. So in conclusion, you need to:
Enable OpenMP at compilation time for getting a parallel version of your code.
Divide your result by NumThreads to get a "valid" approximation of Pi (or distribute your loop over N with a #pragma omp for for example)
But that is if / when your code is correct elsewhere, which it isn't yet.
As BitTickler already hinted, rand() isn't thread-safe. So you have to go for another random number generator, which will allow you to privatize it's state. That could be rand_r() for example. That said, this still has quite a few issues:
rand() / rand_r() is a terrible RNG in term of randomness and periodicity. While increasing your number of tries, you'll rapidly go over the period of the RNG and repeat over and over again the same sequence. You need something more robust to do anything remotely serious.
Even with a "good" RNG, the parallelism aspect can be an issue in the sense that you want your sequences in parallel to be uncorrelated between each-other. And just using a different seed value per thread doesn't guaranty that to you (although with a wide-enough RNG, you have a bit of headroom for that)
Anyway, bottom line is:
Use a better thread-safe RNG (I find drand48_r() or random_r() to be OK for toy codes on Linux)
Initialize its state per-thread based on the thread id for example, while keeping in mind that this won't ensure a proper decorrelation of the random series in some circumstances (and the larger the number of times you call the functions, the more likely you are to finally have overlapping series).
This done (along with a few minor fixes), your code becomes for example as follows:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <omp.h>
typedef struct drand48_data RNGstate;
double sample_interval(double a, double b, RNGstate *state) {
double x;
drand48_r(state, &x);
return (b-a)*x + a;
}
int main (int argc, char **argv) {
int N = atoi( argv[1] ); // convert command-line input to N = number of points
int NumThreads = 4;
const double pi = 3.141592653589793;
double x, y, z;
double counter = 0;
time_t ctime = time(NULL);
#pragma omp parallel private(x, y, z) reduction(+:counter) num_threads(NumThreads)
{
RNGstate state;
srand48_r(ctime+omp_get_thread_num(), &state);
for (int i=0; i < N; ++i) {
x = sample_interval(-1, 1, &state);
y = sample_interval(-1, 1, &state);
z = ((x*x)+(y*y));
if (z<= 1) {
counter++;
}
}
}
double approx_pi = 4.0 * counter / (NumThreads * N);
printf("%i %1.6e %1.6e\n ", N, approx_pi, fabs(approx_pi - pi) / pi);
return 0;
}
Which I compile like this:
gcc -std=gnu99 -fopenmp -O3 -Wall pi.c -o pi_omp

How to measure overall performance of parallel programs (with papi)

I asked myself what would be the best way to measure the performance (in flops) of a parallel program. I read about papi_flops. This seems to work fine for a serial program. But I don't know how I can measure the overall performance of a parallel program.
I would like to measure the performance of a blas/lapack function, in my example below gemm. But I also want to measure other function, specially functions where the number of operation is not known. (In the case of gemm the ops are known (ops(gemm) = 2*n^3), so I could calculate the performance as a function of the number of operations and the execution time.) The library (I am using Intel MKL) spawn the threads automatically. So I can't measure the performance of each thread individually and then reduce it.
This is my example:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include "mkl.h"
#include "omp.h"
#include "papi.h"
int main(int argc, char *argv[] )
{
int i, j, l, k, n, m, idx, iter;
int mat, mat_min, mat_max;
int threads;
double *A, *B, *C;
double alpha =1.0, beta=0.0;
float rtime1, rtime2, ptime1, ptime2, mflops;
long long flpops;
#pragma omp parallel
{
#pragma omp master
threads = omp_get_num_threads();
}
if(argc < 4){
printf("pass me 3 arguments!\n");
return( -1 );
}
else
{
mat_min = atoi(argv[1]);
mat_max = atoi(argv[2]);
iter = atoi(argv[3]);
}
m = mat_max; n = mat_max; k = mat_max;
printf (" Initializing data for matrix multiplication C=A*B for matrix \n"
" A(%ix%i) and matrix B(%ix%i)\n\n", m, k, k, n);
A = (double *) malloc( m*k * sizeof(double) );
B = (double *) malloc( k*n * sizeof(double) );
C = (double *) malloc( m*n * sizeof(double) );
printf (" Intializing matrix data \n\n");
for (i = 0; i < (m*k); i++)
A[i] = (double)(i+1);
for (i = 0; i < (k*n); i++)
B[i] = (double)(-i-1);
memset(C,0,m*n*sizeof(double));
// actual meassurment
for(mat=mat_min;mat<=mat_max;mat+=5)
{
m = mat; n = mat; k = mat;
for( idx=-1; idx<iter; idx++ ){
PAPI_flops( &rtime1, &ptime1, &flpops, &mflops );
cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans,
m, n, k, alpha, A, k, B, n, beta, C, n);
PAPI_flops( &rtime2, &ptime2, &flpops, &mflops );
}
printf("%d threads: %d in %f sec, %f MFLOPS\n",threads,mat,rtime2-rtime1,mflops);fflush(stdout);
}
printf("Done\n");fflush(stdout);
free(A);
free(B);
free(C);
return 0;
}
This is one output (for matrix size 200):
1 threads: 200 in 0.001459 sec, 5570.258789 MFLOPS
2 threads: 200 in 0.000785 sec, 5254.993652 MFLOPS
4 threads: 200 in 0.000423 sec, 4919.640137 MFLOPS
8 threads: 200 in 0.000264 sec, 3894.036865 MFLOPS
We can see for the execution time, that the function gemm scales. But the flops that I am measuring is only the performance of thread 0.
My question is: How can I measure the overall performance? I am grateful for any input.
First, I'm just curious - why do you need the FLOPS? don't you just care how much time is taken? or maybe time taken in compare to other BLAS libraries?
PAPI is thread based not much help on its own here.
What I would do is measure around the function call and see how time changes with number of threads it spawns. It should not spawn more threads than physical cores (HT is no good here). Then, if the matrix is big enough, and the machine is not loaded, the time should simply divide by the number of threads. E.g., 10 seconds over 4 core should become 2.5 seconds.
Other than that, there are 2 things you can do to really measure it:
1. Use whatever you use now but inject your start/end measurement code around the BLAS code. One way to do that (in linux) is by pre-loading a lib that defines pthread_start and using your own functions that call the originals but do some extra measurements. Another way to to override the function pointer when the process is already running (=trampoline). In linux it's in the GOT/PLT and in windows it's more complicated - look for a library.
2. Use oprofile, or some other profiler, to report number of instructions executed in the time you care for. Or better yet, to report the number of floating point instructions executed. A little problem with this is that SSE instructions are multiplying or adding 2 or more doubles at a time so you'd have to account for that. I guess you can assume they always use the maximum possible operands.

C, MPI variable scope

I'm looking at someone else's MPI code and there are a number of times that variables are declared in main() and used in other functions (some MPI specific). I am new to MPI, but in my programming experience that is normally not supposed to be done. Basically it is difficult for me to determine if it is safe to do this (no errors are thrown).
The entire code is quite long so I will just give a simplified version below:
int main(int argc, char** argv) {
// ...unrelated code
int num_procs, local_rank, name_len;
MPI_Comm comm_new;
MPI_Init(&argc, &argv);
MPI_Get_processor_name(proc_name, &name_len);
create_ring_topology(&comm_new, &local_rank, &num_procs);
// ...unrelated code
MPI_Comm_free(&comm_new);
MPI_Finalize();
}
void create_ring_topology(MPI_Comm* comm_new, int* local_rank, int* num_procs) {
MPI_Comm_size(MPI_COMM_WORLD, num_procs);
int dims[1], periods[1];
int dimension = 1;
dims[0] = *num_procs;
periods[0] = 1;
int* local_coords = malloc(sizeof(int)*dimension);
MPI_Cart_create(MPI_COMM_WORLD, dimension, dims, periods, 0, comm_new);
MPI_Comm_rank(*comm_new, local_rank);
MPI_Comm_size(*comm_new, num_procs);
MPI_Cart_coords(*comm_new, *local_rank, dimension, local_coords);
sprintf(s_local_coords, "[%d]", local_coords[0]);
}
That's just regular pointer usage. Nothing wrong with that.
The variables are declared in main and remain in-scope until main returns, i.e. almost for the duration of the program.
Note that MPI does not actually add anything to C. All it is is an extra library. It does not extend the language.

Resources