Creating an optimized reduce on sum - c

I have just started taking courses in HPC and am doing an assignment where we are expected to implement Reduce function equivalent to the MPI_Reduce on MPI_SUM... Easy enough right? Here is what I did:
I started with basic concept of sending data/array from all nodes to the root-node (0-th ranked process) and there I computed the sum.
As a second step I optimized it further so that each process sends data to it's mirror image which computes the sum,and this process keeps repeating until the result in finally present in the root-node (0-th process). My implementation is as follows:
for(k=(size-1); k>0; k/=2)
{
if(rank<=k)
{
if(rank<=(k/2))
{
//receiving the buffers from different processes and computing them
MPI_Recv(rec_buffer, count, MPI_INT, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
for(i=0; i<count; i++)
{
res[i] += rec_buffer[i];
}
}
else
{
MPI_Send(res, count, MPI_INT, k-rank, 0, MPI_COMM_WORLD);
}
}
}
But the thing is this code performs significantly poorer compared to the MPI_Reduce function itself.
So how can I further optimize it? What can I do differently to make if better? I can't make the sum loop multi-threaded as it is required that we do it in a single thread. I can may be optimize the sum loop but not sure how and where to begin.
I apologize for a pretty basic question but I am really starting to get feet wet in the field of HPC. Thanks!

Your second approach is the right one since you do the same number of communications but you have parallelized the reduce operation (sum in your case) and the communication (since you communicate between subsets). You typically do reduce operation like described in Reduction operator.
However you may want to try asynchronous communication using MPI_Isend and MPI_Irecv to improve performance and get closer to MPI_Reduce performance.
#GillesGouillardet provided one implementation, you can see that the communications in the code are done with isend and irecv (look for "MCA_PML_CALL( isend" and "MCA_PML_CALL( irecv" )

Related

Why do I get a stack smashing error using MPI_Bcast and -O3 compiler flag, but everything works without -O3?

I am pretty new to MPI, so apologies if this is simple.
I have some code from a month or two ago that has been working fine, but I decided to go back and revise it. (It was written when I was just starting out, and it's not a performance critical section.) The code basically generates a random graph on one process and then shares the results with all other processes. An excerpt from the baby's-first-steps version follows:
unsigned int *graph;
if (commrank == 0) {
graph = gengraph(params); //allocates graph memory in function
if (commsize > 1) {
for (int k=1; k<commsize; k++)
MPI_Send(graph, n*n, MPI_UNSIGNED, k, 0, MPI_COMM_WORLD);
}
} else {
MPI_Status recvStatus;
graph = malloc(sizeof(unsigned int)*n*n);
MPI_Recv(graph, n*n, MPI_UNSIGNED, 0, 0, MPI_COMM_WORLD, &recvStatus);
}
While obviously naive, this worked just fine for a while, before I chose to go back and do it in what I thought was the proper fashion:
if (commrank == 0) {
graph = gengraph(params);
MPI_Bcast(graph, n*n, MPI_UNSIGNED, 0, MPI_COMM_WORLD);
} else {
graph = malloc(sizeof(unsigned int)*n*n);
MPI_Bcast(graph, n*n, MPI_UNSIGNED, 0, MPI_COMM_WORLD);
}
The problem is, I keep getting when "stack smashing" errors in the second version when I compile with -O3 optimization, though it works fine when compiled unoptimized. Note that I have checked the graph allocation function multiple times and debugged it, and it appears to be fine. I have also debugged the second version, and it appears to work fine. The crash occurs later when I try to free the graph memory. (Note that this is not a double free error, and, again, it works fine in the naive implementation and has for some time.)
One final wrinkle: The first version also fails if, instead of using the recvStatus variable, I instead use MPI_STATUS_IGNORE. And, again, this only fails with -O3.
Any thoughts would be greatly appreciated. If it's any help, I'm using mpicc on top of gcc 7.5.0, but I imagine I'm doing something stupid rather than encountering a compiler problem.
I changed the mpicc compiler to Clang and used Address Sanitizer, per the suggestion of #hristo-iliev, and found an error in a subsequent MPI call (a recv with the wrong count size). This led to the undefined behavior. Notably, the address sanitizer pinpointed location of the error quite clearly, while valgrind only gave rather opaque indications that something was going on in MPI (as, well, it always does).
Apologies to the StackOverflow community for this, as the code above was not the culprit (not entirely surprising). It was just some standard undefined behavior due to sloppiness.

Why this OpenMP parallel for loop doesn't work properly?

I would like to implement OpenMP to parallelize my code. I am starting from a very basic example to understand how it works, but I am missing something...
So, my example looks like this, without parallelization:
int main() {
...
for (i = 0; i < n-1; i++) {
u[i+1] = (1+h)*u[i]; // Euler
v[i+1] = v[i]/(1-h); // implicit Euler
}
...
return 0;
}
Where I omitted some parts in the "..." because are not relevant. It works, and if I print the u[] and v[] arrays on a file, I get the expected results.
Now, if I try to parallelize it just by adding:
#include <omp.h>
int main() {
...
omp_set_num_threads(2);
#pragma omp parallel for
for (i = 0; i < n-1; i++) {
u[i+1] = (1+h)*u[i]; // Euler
v[i+1] = v[i]/(1-h); // implicit Euler
}
...
return 0;
}
The code compiles and the program runs, BUT the u[] and v[] arrays are half full of zeros.
If I set omp_set_num_threads( 4 ), I get three quarters of zeros.
If I set omp_set_num_threads( 1 ), I get the expected result.
So it looks like only the first thread is being executed, while not the other ones...
What am I doing wrong?
OpenMP assumes that each iteration of a loop is independent of the others. When you write this:
for (i = 0; i < n-1; i++) {
u[i+1] = (1+h)*u[i]; // Euler
v[i+1] = v[i]/(1-h); // implicit Euler
}
The iteration i of the loop is modifying iteration i+1. Meanwhile, iteration i+1 might be happening at the same time.
Unless you can make the iterations independent, this isn't a good use-case for parallelism.
And, if you think about what Euler's method does, it should be obvious that it is not possible to parallelize the code you're working on in this way. Euler's method calculates the state of a system at time t+1 based on information at time t. Since you cannot knowing what's at t+1 without knowing first knowing t, there's no way to parallelize across the iterations of Euler's method.
u[i+1] = (1+h)*u[i];
v[i+1] = v[i]/(1-h);
is equivalent to
u[i] = pow((1+h), i)*u[0];
v[i] = v[0]*pow(1.0/(1-h), i);
therefore you can parallelize you code like this
#pragma omp parallel for
for (int i = 0; i < n; i++) {
u[i] = pow((1+h), i)*u[0];
v[i] = v[0]*pow(1.0/(1-h), i);
}
If you want to mitigate the cost of the pow function you can do it once per thread rather than once per iteration like his (since t << n).
#pragma omp parallel
{
int nt = omp_get_num_threads();
int t = omp_get_thread_num();
int s = (t+0)*n/nt;
int f = (t+1)*n/nt;
u[s] = pow((1+h), s)*u[0];
v[s] = v[0]*pow(1.0/(1-h), s);
for(int i=s; i<f-1; i++) {
u[i+1] = (1+h)*u[i];
v[i+1] = v[i]/(1-h);
}
}
You can also write your own pow(double, int) function optimized for integer powers.
Note that the relationship I used is not in fact 100% equivalent because floating point arithmetic is not associative. That's not usually a problem but it's something one should be aware of.
Before parallelizing your code you must identify its concurrency, i.e. the set of tasks that are logically happening at the same time and then figure out a way to make them actually happen in parallel.
As mentioned above, this is a not a good example to apply parallelism on due to the fact that there is no concurrency in its nature. Attempting to use parallelism like that will lead to wrong results, due to the so-called race conditions.
If you just wanna learn how OpenMP works, try to come up with examples where you can clearly identify conceptually independent tasks. One of the most simple I can think of would be computing the area under a curve by means of integration.
Welcome to the parallel ( or "just"-concurrent ) plurality of computing realities.
Why?
Any non-sequential schedule of processing the loop will have problems with hidden ( not correctly handled ) breach of data-{-access | -value}
integrity in time.
A pure-[SERIAL] flow of processing is free from such dangers as the principally serialised steps indirectly introduce ( right by a rigid order of executing nothing but a one-step-after-another as a sequence ) order, in which there is no chance to "touch" the same memory location twice or more times at the same time.
This "peace-of-mind" is inadvertently lost, once a process goes into a "just"-[CONCURRENT] or the true-[PARALLEL] processing.
Suddenly there is an almost random order ( in a case of a "just"-[CONCURRENT] ) or a principally "immediate" singularity ( avoiding any original meaning of "order" - in the case of a true-[PARALLEL] code execution mode -- like a robot, having 6DoF, arrives into each and every trajectory-point in a true-[PARALLEL] fashion, driving all 6DoF-axes in parallel, not a one-after-another, in a pure-[SERIAL]-manner, not in a some-now-some-other-later-and-the-rest-as-it-gets in a "just"-[CONCURRENT] fashion, as the 3D-trajectory of robot-arm will become hardly predictable and mutual collisions would be often on a car assembly line ... ).
Solution:
Using either a defensive tool, called atomic operations, or a principal approach - design (b)locking-free algorithm, where possible, or explicitly signal and coordinate reads and writes ( sure, at a cost in excess-time and degraded performance ), so as to warrant the values will not get damaged into an inconsistent digital trash, if protective steps ( ensuring all "old"-writes get safely "through" before any "next"-reads go ahead to grab a "right"-value ) were not coded in ( as was demonstrated above ).
Epilogue:
Using a tool, like OpenMP for problems, where it cannot bring any advantage, will result in spending time and decreased performance ( as there are needs to handle all tool-related overheads, while there is literally zero net-effect of parallelism in cases, where the algorithm does not allow any parallelism to be enjoyed ), so one finally pays ways more then one finally gets.
A good point to learn about OpenMP best practices could be sources for example from Lawrence Livermore National Laboratory ( indeed very competent ) and similar publications on using OpenMP.

Does root processor apply MPI_Reduce to itself as well?

When using MPI_reduce, does the root processor apply the specified MPI operation on itself as well?
For example, assume the following code is run by all processors including the root, does root reduce it's local_sum into global_sum as if it is non-root one?
int local_sum;
int global_sum;
int i;
for (i = 0; i < 5; i++) {
local_sum += rand_nums[i];
}
MPI_Reduce(&local_sum, &global_sum, 1, MPI_INT, MPI_SUM, ROOT, MPI_COMM_WORLD);
Yes, reduce is also applied to the root itself. Maybe in your mind, you think that MPI just add numbers from other ranks to the local_sum variable in the root.
However, what MPI actually does is that it reduces from all ranks (including the root) in your communicator and put the result to global_sum.
If MPI doesn't reduce the root itself, then it doesn't make sense having two parameters for the MPI_Reduce call.

MPI writing file unequal size vectors

I am having small doubt regarding file writing in MPI. Lets say I have "N" no of process working on a program. At the end of the program, each process will have "m" number of particles (positions+velocities). But the number of particles, m , differs for each process. How would I write all the particle info (pos + vel) in a single file. What I understood from searching is that I can do so with MPI_File_open, MPI_File_set_view,MPI_File_write_all, But I need to have same no of particles in each process. Any ideas how I could do it in my case ?
You don't need the same number of particles on each processor. What you do need is for every processor to participate. One or more could very well have zero particles, even.
Allgather is a fine way to do it, and the single integer exchanged among all processes is not such large overhead.
However, a better way is to use MPI_SCAN:
incr = numparts;
MPI_Scan(&incr, &new_offset, 1, MPI_LONG_LONG_INT,
MPI_SUM, MPI_COMM_WORLD);
new_offset -= incr; /* or skip this with MPI_EXSCAN, but \
then rank 0 has an undefined result */
MPI_File_write_at_all(fh, new_offset, buf, count, datatype, status);
You need to perform
MPI_Allgather(np, 1, MPI_INTEGER, procnp, 1, &
MPI_INTEGER, MPI_COMM_WORLD, ierr)
where np is the number of particles per process and procnp is an array of size number of processes nprocs. This gives you an array on every process of the number of molecules on all other processes. That way MPI_File_set_view can be chosen correctly for each processes by calculating the offset based on the process id. This psudocode to get the offset is something like,
procdisp = 0
!Obtain displacement of each processor using all other procs' np
for i = 1, irank -1
procdisp = procdisp + procnp(i)*datasize
enddo
This was taken from fortran code so irank is from 1 to nprocs

Nested kernels in CUDA

CUDA currently does not allow nested kernels.
To be specific, I have the following problem:
I have N number of M-dimensional data. To process each of the N data-points, three kernels need to be run in a sequence. Since, nesting of kernels is not allowed, I cannot create a kernel with calls to the three kernels. Therefore, I have to process each data-point serially.
One solution is to write a big kernel containing the functionality of all the other three kernels, but I think it will sub-optimal.
Can anyone suggest how streams can be used to run the N data-points in parallel, while retaining the the three smaller kernels.
Thanks.
Well, if you want to use streams... you will want to create N streams:
cudaStream_t streams;
streams = malloc(N * sizeof(cudaStream_t));
for(i=0; i<N; i++)
{
cudaStreamCreate(&streams[i]);
}
Then for the ith data point, you want to use cudaMemcpyAsync for transfers:
cudaMemcpyAsync(dst, src, kind, count, streams[i]);
and call your kernels with all four configuration parameters (sharedMemory can be 0, of course):
kernel_1 <<< nBlocks, nThreads, sharedMemory, streams[i] >>> ( args );
kernel_2 <<< nBlocks, nThreads, sharedMemory, streams[i] >>> ( args );
and of course cleanup:
for(i=0; i<N; i++)
{
cudaStreamDestroy(streams[i]);
}
free(streams)
As an update to the selected answer, NVidia's GPU with Compute Capability 3.5 now allows nested kernels, Dynamic Parallelism as they call it.
Nowadays, with the Fermi compatibility, it is possible to launch parallel kernel

Resources