How to measure overall performance of parallel programs (with papi) - c

I asked myself what would be the best way to measure the performance (in flops) of a parallel program. I read about papi_flops. This seems to work fine for a serial program. But I don't know how I can measure the overall performance of a parallel program.
I would like to measure the performance of a blas/lapack function, in my example below gemm. But I also want to measure other function, specially functions where the number of operation is not known. (In the case of gemm the ops are known (ops(gemm) = 2*n^3), so I could calculate the performance as a function of the number of operations and the execution time.) The library (I am using Intel MKL) spawn the threads automatically. So I can't measure the performance of each thread individually and then reduce it.
This is my example:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include "mkl.h"
#include "omp.h"
#include "papi.h"
int main(int argc, char *argv[] )
{
int i, j, l, k, n, m, idx, iter;
int mat, mat_min, mat_max;
int threads;
double *A, *B, *C;
double alpha =1.0, beta=0.0;
float rtime1, rtime2, ptime1, ptime2, mflops;
long long flpops;
#pragma omp parallel
{
#pragma omp master
threads = omp_get_num_threads();
}
if(argc < 4){
printf("pass me 3 arguments!\n");
return( -1 );
}
else
{
mat_min = atoi(argv[1]);
mat_max = atoi(argv[2]);
iter = atoi(argv[3]);
}
m = mat_max; n = mat_max; k = mat_max;
printf (" Initializing data for matrix multiplication C=A*B for matrix \n"
" A(%ix%i) and matrix B(%ix%i)\n\n", m, k, k, n);
A = (double *) malloc( m*k * sizeof(double) );
B = (double *) malloc( k*n * sizeof(double) );
C = (double *) malloc( m*n * sizeof(double) );
printf (" Intializing matrix data \n\n");
for (i = 0; i < (m*k); i++)
A[i] = (double)(i+1);
for (i = 0; i < (k*n); i++)
B[i] = (double)(-i-1);
memset(C,0,m*n*sizeof(double));
// actual meassurment
for(mat=mat_min;mat<=mat_max;mat+=5)
{
m = mat; n = mat; k = mat;
for( idx=-1; idx<iter; idx++ ){
PAPI_flops( &rtime1, &ptime1, &flpops, &mflops );
cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans,
m, n, k, alpha, A, k, B, n, beta, C, n);
PAPI_flops( &rtime2, &ptime2, &flpops, &mflops );
}
printf("%d threads: %d in %f sec, %f MFLOPS\n",threads,mat,rtime2-rtime1,mflops);fflush(stdout);
}
printf("Done\n");fflush(stdout);
free(A);
free(B);
free(C);
return 0;
}
This is one output (for matrix size 200):
1 threads: 200 in 0.001459 sec, 5570.258789 MFLOPS
2 threads: 200 in 0.000785 sec, 5254.993652 MFLOPS
4 threads: 200 in 0.000423 sec, 4919.640137 MFLOPS
8 threads: 200 in 0.000264 sec, 3894.036865 MFLOPS
We can see for the execution time, that the function gemm scales. But the flops that I am measuring is only the performance of thread 0.
My question is: How can I measure the overall performance? I am grateful for any input.

First, I'm just curious - why do you need the FLOPS? don't you just care how much time is taken? or maybe time taken in compare to other BLAS libraries?
PAPI is thread based not much help on its own here.
What I would do is measure around the function call and see how time changes with number of threads it spawns. It should not spawn more threads than physical cores (HT is no good here). Then, if the matrix is big enough, and the machine is not loaded, the time should simply divide by the number of threads. E.g., 10 seconds over 4 core should become 2.5 seconds.
Other than that, there are 2 things you can do to really measure it:
1. Use whatever you use now but inject your start/end measurement code around the BLAS code. One way to do that (in linux) is by pre-loading a lib that defines pthread_start and using your own functions that call the originals but do some extra measurements. Another way to to override the function pointer when the process is already running (=trampoline). In linux it's in the GOT/PLT and in windows it's more complicated - look for a library.
2. Use oprofile, or some other profiler, to report number of instructions executed in the time you care for. Or better yet, to report the number of floating point instructions executed. A little problem with this is that SSE instructions are multiplying or adding 2 or more doubles at a time so you'd have to account for that. I guess you can assume they always use the maximum possible operands.

Related

Increase of execution time while using multithreaded FFTW

I am new to FFTW library. I have successfully implemented 1D and 2D fft using FFTW library. I converted my 2D fft code into multithreaded 2D fft. But the results were completely opposite. Multithreaded 2D FFT code is taking longer time to run than serialized 2D FFT code. I am missing something somewhere. I followed all the instructions given in FFTW documentation to parallelize the code.
This is my parallelized 2D FFT C program
#include <mpi.h>
#include <fftw3.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#define N 2000
#define M 2000
#define index(i, j) (j + i*M)
int i, j;
void get_input(fftw_complex *in) {
for(i=0;i<N;i++){
for(j=0;j<M;j++){
in[index(i, j)][0] = sin(i + j);
in[index(i, j)][1] = sin(i * j);
}
}
}
void show_out(fftw_complex *out){
for(i=0;i<N;i++){
for(j=0;j<M;j++){
printf("%lf %lf \n", out[index(i, j)][0], out[index(i, j)][1]);
}
}
}
int main(){
clock_t start, end;
double time_taken;
start = clock();
int a = fftw_init_threads();
printf("%d\n", a);
fftw_complex *in, *out;
fftw_plan p;
in = (fftw_complex *)fftw_malloc(N * M * sizeof(fftw_complex));
out = (fftw_complex *)fftw_malloc(N * M * sizeof(fftw_complex));
get_input(in);
fftw_plan_with_nthreads(4);
p = fftw_plan_dft_2d(N, M, in, out, FFTW_FORWARD, FFTW_ESTIMATE);
fftw_execute(p);
/*p = fftw_plan_dft_1d(N, out, out, FFTW_BACKWARD, FFTW_ESTIMATE);
fftw_execute(p);
puts("In Real Domain");
show_out(out);*/
fftw_destroy_plan(p);
fftw_free(in);
fftw_free(out);
fftw_cleanup_threads();
end = clock();
time_taken = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("%g \n", time_taken);
return 0;
}
Can someone please help me in pointing out the mistake what I am doing?
That kind of behavior is typical of incorrect binding.
Generally speaking, OpenMP threads should all be bound to cores of the same socket in order to avoid NUMA effect (which can make performance suboptimal or even worst).
Also, make sure MPI tasks are correctly bound (one task should be bound to several cores from the same sockets, and you should use one OpenMP thread per core).
Because of MPI, there is a risk your OpenMP threads end up doing time sharing.
At first, i recommend you start printing both MPI and OpenMP binding.
How to achieve that is dependent on both MPI library and OpenMP runtime. If you use Open MPI and Intel compilers, you can KMP_AFFINITY=verbose mpirun --report-bindings --tag-output ...
Then, as suggested earlier, i recommend you start easy and increase complexity
1 MPI task and 1 OpenMP thread
1 MPI task and x OpenMP threads (x is the number of cores on one socket)
x MPI tasks and 1 OpenMP thread per task
x MPI tasks and y OpenMP threads per task
hopefully, 2. will be faster than 1. and 4 will be faster than 3.

Parallelization for Monte Carlo pi approximation

I am writing a c script to parallelize pi approximation with OpenMp. I think my code works fine with a convincing output. I am running it with 4 threads now. What I am not sure is that if this code is vulnerable to race condition? and if it is, how do I coordinate the thread action in this code ?
the code looks as follows:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <omp.h>
double sample_interval(double a, double b) {
double x = ((double) rand())/((double) RAND_MAX);
return (b-a)*x + a;
}
int main (int argc, char **argv) {
int N = atoi( argv[1] ); // convert command-line input to N = number of points
int i;
int NumThreads = 4;
const double pi = 3.141592653589793;
double x, y, z;
double counter = 0;
#pragma omp parallel firstprivate(x, y, z, i) reduction(+:counter) num_threads(NumThreads)
{
srand(time(NULL));
for (int i=0; i < N; ++i)
{
x = sample_interval(-1.,1.);
y = sample_interval(-1.,1.);
z = ((x*x)+(y*y));
if (z<= 1)
{
counter++;
}
}
}
double approx_pi = 4.0 * counter/ (double)N;
printf("%i %1.6e %1.6e\n ", N, 4.0 * counter/ (double)N, fabs(4.0 * counter/ (double)N - pi) / pi);
return 0;
}
Also I was wondering if the seed for random number should be declared inside or outside parallelization. my output looks like this:
10 3.600000e+00 1.459156e-01
100 3.160000e+00 5.859240e-03
1000 3.108000e+00 1.069287e-02
10000 3.142400e+00 2.569863e-04
100000 3.144120e+00 8.044793e-04
1000000 3.142628e+00 3.295610e-04
10000000 3.141379e+00 6.794439e-05
100000000 3.141467e+00 3.994585e-05
1000000000 3.141686e+00 2.971945e-05
Which looks OK for now. your suggestion for race condition and seed placement is most welcome.
There are a few problems in your code that I can see. The main one is from my standpoint that it isn't parallelized. Or more precisely, you didn't enable the parallelism you introduced with OpenMP while compiling it. Here is the way one can see that:
The way the code is parallelized, the main for loop should be executed in full by all the threads (there is no worksharing here, no #pragma omp parallel for, only a #pragma omp parallel). Therefore, considering you set the number of threads to be 4, the global number of iterations should be 4*N. Thus, your output should slowly converge towards 4*Pi, not towards Pi.
Indeed, I tried your code on my laptop, compiled it with OpenMP support, and that is pretty-much what I get. However, when I don't enable OpenMP, I get an output similar to yours. So in conclusion, you need to:
Enable OpenMP at compilation time for getting a parallel version of your code.
Divide your result by NumThreads to get a "valid" approximation of Pi (or distribute your loop over N with a #pragma omp for for example)
But that is if / when your code is correct elsewhere, which it isn't yet.
As BitTickler already hinted, rand() isn't thread-safe. So you have to go for another random number generator, which will allow you to privatize it's state. That could be rand_r() for example. That said, this still has quite a few issues:
rand() / rand_r() is a terrible RNG in term of randomness and periodicity. While increasing your number of tries, you'll rapidly go over the period of the RNG and repeat over and over again the same sequence. You need something more robust to do anything remotely serious.
Even with a "good" RNG, the parallelism aspect can be an issue in the sense that you want your sequences in parallel to be uncorrelated between each-other. And just using a different seed value per thread doesn't guaranty that to you (although with a wide-enough RNG, you have a bit of headroom for that)
Anyway, bottom line is:
Use a better thread-safe RNG (I find drand48_r() or random_r() to be OK for toy codes on Linux)
Initialize its state per-thread based on the thread id for example, while keeping in mind that this won't ensure a proper decorrelation of the random series in some circumstances (and the larger the number of times you call the functions, the more likely you are to finally have overlapping series).
This done (along with a few minor fixes), your code becomes for example as follows:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <omp.h>
typedef struct drand48_data RNGstate;
double sample_interval(double a, double b, RNGstate *state) {
double x;
drand48_r(state, &x);
return (b-a)*x + a;
}
int main (int argc, char **argv) {
int N = atoi( argv[1] ); // convert command-line input to N = number of points
int NumThreads = 4;
const double pi = 3.141592653589793;
double x, y, z;
double counter = 0;
time_t ctime = time(NULL);
#pragma omp parallel private(x, y, z) reduction(+:counter) num_threads(NumThreads)
{
RNGstate state;
srand48_r(ctime+omp_get_thread_num(), &state);
for (int i=0; i < N; ++i) {
x = sample_interval(-1, 1, &state);
y = sample_interval(-1, 1, &state);
z = ((x*x)+(y*y));
if (z<= 1) {
counter++;
}
}
}
double approx_pi = 4.0 * counter / (NumThreads * N);
printf("%i %1.6e %1.6e\n ", N, approx_pi, fabs(approx_pi - pi) / pi);
return 0;
}
Which I compile like this:
gcc -std=gnu99 -fopenmp -O3 -Wall pi.c -o pi_omp

Benchmarking, sequential x parallel program. Sublinear speedup?

Update2. Solved! This is memory issue. Some benching about it here:
http://dontpad.com/bench_mem
Update. My goal is to achieve best throughput. All my results are here.
Sequential Results:
https://docs.google.com/spreadsheet/ccc?key=0AjKHxPB2qgJXdE8yQVNHRkRiQ2VzeElIRWwxMWtRcVE&usp=sharing
Parallel Results*:
https://docs.google.com/spreadsheet/ccc?key=0AjKHxPB2qgJXdEhTb2plT09PNEs3ajBvWUlVaWt0ZUE&usp=sharing
multsoma_par1_vN, N determines how data is acessed by each thread.
N: 1 - NTHREADS displacement, 2 - L1 displacement, 3 - L2 displacement, 4 - TAM/NTHREADS
I am having a hard time trying to figure out why my parallel code runs just slighty faster than sequential code.
What I basically do is to loop through a big array (10^8 elements) of a type (int/float/double) and apply the computation: A = A * CONSTANT + B. Where A and B are arrays of same size.
Sequential code only do a single function call.
Parallel version create pthreads and uses the same function as starting function.
I am using gettimeofday(), RDTSC() and more recently getrusage() to measure timings. My main results are expressed by Clocks per Element (CPE).
My processor is an i5-3570K. 4 Cores, no hyper-threading.
The problem is that I can get up to 2.00 CPE under sequential code and when going parallel my best performance was 1.84 CPE. I know that I get an overhead by creating pthreads and calling more timing routines, but I don't think this is the reason for not getting better timings.
I did measured each thread CPE and executed the program with 1, 2, 3 and 4 threads. When creating only one thread, I get the expected result CPE around 2.00 (+ some overhead expressed in miliseconds but overall CPE is not affected at all).
When running with 2 threads or more the main CPE decreases, but each thread CPE increases.
2 threads I get main CPE around 1.9 and each thread to 3.8 ( Why this is not 2.0 ?! )
The same happens to 3 and 4 threads.
4 threads I get main CPE around 1.85 (my best timing) and each thread with 7.0~7.5 CPE.
Using many threads more than avaiable cores(4) I still getting CPE under 2.0 but not better than 1.85 (most times higher due to overhead).
I suspect that maybe context switching could be the limiting factor here. When running with 2 threads I can count 5 to 10 involuntary contexts switch from each thread...
But I am not so sure about this. Are those seemly few context switches enough to almost double my CPE ? I was expecting to atleast get around 1.00 CPE using all my CPU Cores.
I went further on this and analyzed the assembly code for this function. They are identical, except for some extra shifts and adds (4 instructions) at the very beginning of the function and they are out of loops.
In case you want to see some code:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/time.h>
#include <cpuid.h>
typedef union{
unsigned long long int64;
struct {unsigned int lo, hi;} int32;
} tsc_counter;
#define RDTSC(cpu_c) \
__asm__ __volatile__ ("rdtsc" : \
"=a" ((cpu_c).int32.lo), \
"=d" ((cpu_c).int32.hi) )
#define CNST 5
#define NTHREADS 4
#define L1_SIZE 8096
#define L2_SIZE 72512
typedef int data_t;
data_t * A;
data_t * B;
int tam;
double avg_thread_CPE;
tsc_counter thread_t0[NTHREADS], thread_t1[NTHREADS];
struct timeval thread_sec0[NTHREADS], thread_sec1[NTHREADS];
void fillA_B(int tam){
int i;
for (i=0;i<tam;i++){
A[i]=2; B[i]=2;
}
return;
}
void* multsoma_par4_v4(void *arg){
int w;
int i,j;
int *id = (int *) arg;
int limit = tam-14;
int size = tam/NTHREADS;
int tam2 = ((*id+1)*size);
int limit2 = tam2-14;
gettimeofday(&thread_sec0[*id],NULL);
RDTSC(thread_t0[*id]);
//Mult e Soma
for (i=(*id)*size;i<limit2 && i<limit;i+=15){
A[i] = A[i] * CNST + B[i];
A[i+1] = A[i+1] * CNST + B[i+1];
A[i+2] = A[i+2] * CNST + B[i+2];
A[i+3] = A[i+3] * CNST + B[i+3];
A[i+4] = A[i+4] * CNST + B[i+4];
A[i+5] = A[i+5] * CNST + B[i+5];
A[i+6] = A[i+6] * CNST + B[i+6];
A[i+7] = A[i+7] * CNST + B[i+7];
A[i+8] = A[i+8] * CNST + B[i+8];
A[i+9] = A[i+9] * CNST + B[i+9];
A[i+10] = A[i+10] * CNST + B[i+10];
A[i+11] = A[i+11] * CNST + B[i+11];
A[i+12] = A[i+12] * CNST + B[i+12];
A[i+13] = A[i+13] * CNST + B[i+13];
A[i+14] = A[i+14] * CNST + B[i+14];
}
for (; i<tam2 && i<tam; i++)
A[i] = A[i] * CNST + B[i];
RDTSC(thread_t1[*id]);
gettimeofday(&thread_sec1[*id],NULL);
double CPE, elapsed_time;
CPE = ((double)(thread_t1[*id].int64-thread_t0[*id].int64))/((double)(size));
elapsed_time = (double)(thread_sec1[*id].tv_sec-thread_sec0[*id].tv_sec)*1000;
elapsed_time+= (double)(thread_sec1[*id].tv_usec - thread_sec0[*id].tv_usec)/1000;
//printf("Thread %d workset - %d\n",*id,size);
//printf("CPE Thread %d - %lf\n",*id, CPE);
//printf("Time Thread %d - %lf\n",*id, elapsed_time/1000);
avg_thread_CPE+=CPE;
free(arg);
pthread_exit(NULL);
}
void imprime(int tam){
int i;
int ans = 12;
for (i=0;i<tam;i++){
//printf("%d ",A[i]);
//checking...
if (A[i]!=ans) printf("WA!!\n");
}
printf("\n");
return;
}
int main(int argc, char *argv[]){
tsc_counter t0,t1;
struct timeval sec0,sec1;
pthread_t thread[NTHREADS];
double CPE;
double elapsed_time;
int i;
int* id;
tam = atoi(argv[1]);
A = (data_t*) malloc (tam*sizeof(data_t));
B = (data_t*) malloc (tam*sizeof(data_t));
fillA_B(tam);
avg_thread_CPE = 0;
//Start Computing...
gettimeofday(&sec0,NULL);
RDTSC(t0); //Time Stamp 0
for (i=0;i<NTHREADS;i++){
id = (int*) malloc(sizeof(int));
*id = i;
if (pthread_create(&thread[i], NULL, multsoma_par4_v4, (void*)id)) {
printf("--ERRO: pthread_create()\n"); exit(-1);
}
}
for (i=0; i<NTHREADS; i++) {
if (pthread_join(thread[i], NULL)) {
printf("--ERRO: pthread_join() \n"); exit(-1);
}
}
RDTSC(t1); //Time Stamp 1
gettimeofday(&sec1,NULL);
//End Computing...
imprime(tam);
CPE = ((double)(t1.int64-t0.int64))/((double)(tam)); //diferenca entre Time_Stamps/repeticoes
elapsed_time = (double)(sec1.tv_sec-sec0.tv_sec)*1000;
elapsed_time+= (double)(sec1.tv_usec - sec0.tv_usec)/1000;
printf("Main CPE: %lf\n",CPE);
printf("Avg Thread CPE: %lf\n",avg_thread_CPE/NTHREADS);
printf("Time: %lf\n",elapsed_time/1000);
free(A); free(B);
return 0;
}
I appreciate any help.
After seeing the full code, I rather agree with the guess of #nosid in comments: since the ratio of compute operations to memory loads is low, and the data (about 800M if I am not mistaken) don't fit in cache, the memory bandwidth is likely the limiting factor. The link to the main memory is shared to all cores in a processor, so when its bandwidth is saturated, all memory operations start stalling and take longer time; thus CPE increases.
Also, the following place in your code is a data race:
avg_thread_CPE+=CPE;
as you sum up CPE values calculated on different threads to a single global variable without any synchronization.
Below I left the part of my initial answer, including the "first statement" referred in the comments. I still find it correct, for the definition of CPE as the number of clocks taken by the operations over a single element.
You should not expect the clocks per element (CPE) metric to decrease
due to use of multiple threads. By definition, it's how fast a
single data item is processed, in average. Threading helps to process all data faster (by simultaneous processing on different
cores), so the elapsed wallclock time, i.e. the time to execute the
whole program, should be expected to decrease.

MPI wrapper that imitates OpenMP's for-loop pragma

I am thinking about implementing a wrapper for MPI that imitates OpenMP's way
of parallelizing for loops.
begin_parallel_region( chunk_size=100 , num_proc=10 );
for( int i=0 ; i<1000 ; i++ )
{
//some computation
}
end_parallel_region();
The code above distributes computation inside the for loop to 10 slave MPI processors.
Upon entering the parallel region, the chunk size and number of slave processors are provided.
Upon leaving the parallel region, the MPI processors are synched and are put idle.
EDITED in response to High Performance Mark.
I have no intention to simulate the OpenMP's shared memory model.
I propose this because I need it.
I am developing a library that is required to build graphs from mathetical functions.
In these mathetical functions, there often exist for loops like the one below.
for( int i=0 ; i<n ; i++ )
{
s = s + sin(x[i]);
}
So I want to first be able to distribute sin(x[i]) to slave processors and at the end reduce to the single varible just like in OpenMP.
I was wondering if there is such a wrapper out there so that I don't have to reinvent the wheel.
Thanks.
There is no such wrapper out there which has escaped from the research labs into widespread use. What you propose is not so much re-inventing the wheel as inventing the flying car.
I can see how you propose to write MPI code which simulates OpenMP's approach to sharing the burden of loops, what is much less clear is how you propose to have MPI simulate OpenMP's shared memory model ?
In a simple OpenMP program one might have, as you suggest, 10 threads each perform 10% of the iterations of a large loop, perhaps updating the values of a large (shared) data structure. To simulate that inside your cunning wrapper in MPI you'll either have to (i) persuade single-sided communications to behave like shared memory (this might be doable and will certainly be difficult) or (ii) distribute the data to all processes, have each process independently compute 10% of the results, then broadcast the results all-to-all so that at the end of execution each process has all the data that the others have.
Simulating shared memory computing on distributed memory hardware is a hot topic in parallel computing, always has been, always will be. Google for distributed shared memory computing and join the fun.
EDIT
Well, if you've distributed x across processes then individual processes can compute sin(x[i]) and you can reduce the sum on to one process using MPI_Reduce.
I must be missing something about your requirements because I just can't see why you want to build any superstructure on top of what MPI already provides. Nevertheless, my answer to your original question remains No, there is no such wrapper as you seek and all the rest of my answer is mere commentary.
Yes, you could do this, for specific tasks. But you shouldn't.
Consider how you might implement this; the begin part would distribute the data, and the end part would bring the answer back:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <mpi.h>
typedef struct state_t {
int globaln;
int localn;
int *locals;
int *offsets;
double *localin;
double *localout;
double (*map)(double);
} state;
state *begin_parallel_mapandsum(double *in, int n, double (*map)(double)) {
state *s = malloc(sizeof(state));
s->globaln = n;
s->map = map;
/* figure out decomposition */
int size, rank;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
s->locals = malloc(size * sizeof(int));
s->offsets = malloc(size * sizeof(int));
s->offsets[0] = 0;
for (int i=0; i<size; i++) {
s->locals[i] = (n+i)/size;
if (i < size-1) s->offsets[i+1] = s->offsets[i] + s->locals[i];
}
/* allocate local arrays */
s->localn = s->locals[rank];
s->localin = malloc(s->localn*sizeof(double));
s->localout = malloc(s->localn*sizeof(double));
/* distribute */
MPI_Scatterv( in, s->locals, s->offsets, MPI_DOUBLE,
s->localin, s->locals[rank], MPI_DOUBLE,
0, MPI_COMM_WORLD);
return s;
}
double end_parallel_mapandsum(state **s) {
double localanswer=0., answer;
/* sum up local answers */
for (int i=0; i<((*s)->localn); i++) {
localanswer += ((*s)->localout)[i];
}
/* and get global result. Everyone gets answer */
MPI_Allreduce(&localanswer, &answer, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
free( (*s)->localin );
free( (*s)->localout );
free( (*s)->locals );
free( (*s)->offsets );
free( (*s) );
return answer;
}
int main(int argc, char **argv) {
int rank;
double *inputs;
double result;
int n=100;
const double pi=4.*atan(1.);
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
inputs = malloc(n * sizeof(double));
for (int i=0; i<n; i++) {
inputs[i] = 2.*pi/n*i;
}
}
state *s=begin_parallel_mapandsum(inputs, n, sin);
for (int i=0; i<s->localn; i++) {
s->localout[i] = (s->map)(s->localin[i]);
}
result = end_parallel_mapandsum(&s);
if (rank == 0) {
printf("Calculated result: %lf\n", result);
double trueresult = 0.;
for (int i=0; i<n; i++) trueresult += sin(inputs[i]);
printf("True result: %lf\n", trueresult);
}
MPI_Finalize();
}
That constant distribute/gather is a terrible communications burden to sum up a few numbers, and is antithetical to the entire distributed-memory computing model.
To a first approximation, shared memory approaches - OpenMP, pthreads, IPP, what have you - are about scaling computations faster; about throwing more processors at the same chunk of memory. On the other hand, distributed-memory computing is about scaling a computation bigger; about using more resourses, particularly memory, than can be found on a single computer. The big win of using MPI is when you're dealing with problem sets which can't fit on any one node's memory, ever. So when doing distributed-memory computing, you avoid having all the data in any one place.
It's important to keep that basic approach in mind even when you are just using MPI on-node to use all the processors. The above scatter/gather approach will just kill performance. The more idiomatic distributed-memory computing approach is for the logic of the program to already have distributed the data - that is, your begin_parallel_region and end_parallel_region above would have already been built into the code above your loop at the very beginning. Then, every loop is just
for( int i=0 ; i<localn ; i++ )
{
s = s + sin(x[i]);
}
and when you need to exchange data between tasks (or reduce a result, or what have you) then you call the MPI functions to do those specific tasks.
Is MPI a must or are you just trying to run your OpenMP-like code on a cluster? In the latter case, I propose you to take a look at Intel's Cluster OpenMP:
http://www.hpcwire.com/hpcwire/2006-05-19/openmp_on_clusters-1.html

What could be some possible problems with this use of OpenMP?

I was trying to figure out how to parallelize a segment of code in OpenMP, where the inside of the for loop is independent from the rest of it.
Basically the project is dealing with particle systems, but I don't think that should relevant to the parallelization of the code. Is it a caching problem where the for loop divides the threads in a way such that the particles are not cached in each core in an efficient manner?
Edit: As mentioned by an answer below, I'm wondering why I'm not getting speedup.
#pragma omp parallel for
for (unsigned i = 0; i < psize-n_dead; ++i)
{
s->particles[i].pos = s->particles[i].pos + dt * s->particles[i].vel;
s->particles[i].vel = (1 - dt*.1) * s->particles[i].vel + dt*s->force;
// printf("%d", omp_get_thread_num());
}
If you're asking whether it's parallelized correctly, it looks fine. I don't see any data-races or loop-dependencies that could break it.
But I think you're wondering on why you aren't getting any speedup with parallelism.
Since you mentioned that the trip count, psize-n_dead will be on the order of 4000. I'd say that's actually pretty small given the amount of work in the loop.
In other words, you don't have much total work to be worth parallelizing. So threading overhead is probably eating up any speedup that you should be gaining. If possible, you should try parallelizing at a higher level.
EDIT: You updated your comment to include up to 200000.
For larger values, it's likely that you'll be memory bound in some way. Your loop merely iterates through all the data doing very little work. So using more threads probably won't help much (if at all).
There is no correctness issues such as data races in this piece of code.
Assuming that the number of particles to process is big enough to warrant parallelism, I do not see OpenMP related performance issues in this code. By default, OpenMP will split the loop iterations statically in equal portions across all threads, so any cache conflicts may only occur at the boundaries of these portions, i.e. just in a few iterations of the loop.
Unrelated to OpenMP (and so to the parallel speedup problem), possibly performance improvement can be achieved by switching from array-of-structs to struct-of-arrays, as this might help compiler to vectorize the code (i.e. use SIMD instructions of a target processor):
#pragma omp parallel for
for (unsigned i = 0; i < psize-n_dead; ++i)
{
s->particles.pos[i] = s->particles.pos[i] + dt * s->particles.vel[i];
s->particles.vel[i] = (1 - dt*.1) * s->particles.vel[i] + dt*s->force;
}
Such reorganization assumes that most time all particles are processed in a loop like this one. Working with an individual particle requires more cache lines to be loaded, but if you process them all in a loop, the net amount of cache lines loaded is nearly the same.
How sure are you that you're not getting speedup?
Trying it both ways - array of structs and struct of arrays, compiled with gcc -O3 (gcc 4.6), on a dual quad-core nehalem, I get for psize-n_dead = 200000, running 100 iterations for better timer accuracy:
Struct of arrays (reported time are in milliseconds)
$ for t in 1 2 4 8; do export OMP_NUM_THREADS=$t; time ./foo; done
Took time 90.984000
Took time 45.992000
Took time 22.996000
Took time 11.998000
Array of structs:
$ for t in 1 2 4 8; do export OMP_NUM_THREADS=$t; time ./foo; done
Took time 58.989000
Took time 28.995000
Took time 14.997000
Took time 8.999000
However, I because the operation is so short (sub-ms) I didn't see any speedup without doing 100 iterations because of timer accuracy. Also, you'd have to have a machine with good memory bandwidth to to get this sort of behaviour; you're only doing ~3 FMAs and another multiplication for every two pieces of data you read in.
Code for array-of-structs follows.
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
typedef struct particle_struct {
double pos;
double vel;
} particle;
typedef struct simulation_struct {
particle *particles;
double force;
} simulation;
void tick(struct timeval *t) {
gettimeofday(t, NULL);
}
/* returns time in seconds from now to time described by t */
double tock(struct timeval *t) {
struct timeval now;
gettimeofday(&now, NULL);
return (double)(now.tv_sec - t->tv_sec) + ((double)(now.tv_usec - t->tv_usec)/1000000.);
}
void update(simulation *s, unsigned psize, double dt) {
#pragma omp parallel for
for (unsigned i = 0; i < psize; ++i)
{
s->particles[i].pos = s->particles[i].pos+ dt * s->particles[i].vel;
s->particles[i].vel = (1 - dt*.1) * s->particles[i].vel + dt*s->force;
}
}
void init(simulation *s, unsigned np) {
s->force = 1.;
s->particles = malloc(np*sizeof(particle));
for (unsigned i=0; i<np; i++) {
s->particles[i].pos = 1.;
s->particles[i].vel = 1.;
}
int main(void)
{
const unsigned np=200000;
simulation s;
struct timeval clock;
init(&s, np);
tick(&clock);
for (int iter=0;iter< 100; iter++)
update(&s, np, 0.75);
double elapsed=tock(&clock)*1000.;
printf("Took time %lf\n", elapsed);
free(s.particles);
}

Resources