How to parallelize function calls with MPI or OpenMP

How to parallelize function calls with MPI or OpenMP - c

1st question:
I wonder how I can parallelize function calls to the same function, but with different input parameters in a for loop. For example (C code):
//a[i] and b[i] are defined as elements of a list with 2 columns and N rows
//i is the row number
#pragma omp parallel
{
char cmd[1000];
#pragma omp for nowait
for(i=0; i<N; i++) {
//call the serial programm
sprintf(cmd, "./serial_program %f %f", a[i], b[i]);
system(cmd);
}
}
If I just apply a pragma omp for (+the omp header of course) nothing happens. Maybe this is not possible with OpenMP, but would it be possible with MPI and how would it look like then? I have experience only with OpenMP so far, but not with MPI.
update: defined cmd within parallel region
Status: solved
2nd question:
If i have a OpenMP parallelized program and i want to use it among different nodes within a cluster, how can i distribute the calls among the different nodes with MPI and how would i compile it?
//a[i] and b[i] are defined as elements of a list with 2 columns and N rows
//i is the row number
for(i=0; i<N; i++) {
//call the parallelized program
sprintf(cmd, "./openmp_parallelized_program %f %f", a[i], b[i]);
system(cmd);
}
Status: unsolved

MPI is a method to communicate between nodes of a computing cluster. It enables one motherboard to talk to another. MPI is for clusters and large computing tasks, it is not for parallelizing desktop applications.
Communications in MPI are done by explicitly sending and receiving data.
Unlike OpenMP, there is no #pragma that will automatically facilitate parallelization.
Also there is something really messed up about the code that you posted, specifically, it is a C program that acts like a bash script.
#!/bin/bash
N=10
for i in `seq 1 $N`;
do
./program $i &
done
On many clusters calls to system will execute only on the host node, resulting in no speedup and io problems. The command you showed is wholly unworkable.

With MPI you would do something like:
int rank, size;
MPI_Init();
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int start = (rank*N)/size;
int end = ((rank+1)*N)/size;
for (i = start; i < end; i++)
{
sprintf(cmd, "./openmp_parallelized_program %f %f", a[i], b[i]);
system(cmd);
}
MPI_Finalize();
Then run the MPI job with one process per node. There is a caveat though. Some MPI implementations do not allow processes to call fork() under certain conditions (and system() calls fork()), e.g. if they communicate over RDMA-based networks like InfiniBand. Instead, you could merge both programs in order to create one hybrid MPI/OpenMP program.

Related

Accessing the executing thread's private variables within a task in OpenMP

I am trying to learn OpenMP, and have stumbled upon the fact that threads do not retain their own data when executing tasks, but they rather have a copy of the data of the thread which has generated the task. Let me demonstrate it with an example:
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main()
{
#pragma omp parallel num_threads(4)
{
int thread_id = omp_get_thread_num();
#pragma omp single
{
printf("Thread ID of the #single: %d\n", omp_get_thread_num());
for (int i = 0; i < 10; i++) {
#pragma omp task
{
sleep(1);
printf("thread_id, ID of the executing thread: %d, %d\n", thread_id, omp_get_thread_num());
}
}
}
}
return 0;
}
An example output of this code is as follows:
Thread ID of the #single: 1
thread_id, ID of the executing thread: 1, 2
thread_id, ID of the executing thread: 1, 0
thread_id, ID of the executing thread: 1, 3
thread_id, ID of the executing thread: 1, 1
...
It is evident that the thread_id within the task refers to a copy that is assigned to the thread_id of the thread that has created the task (i.e. the one running the single portion of the code).
What if I wanted to refer the executing thread's own private variables then? Are they unrecoverably shadowed? Is there a clause to make this code output number, same number instead at the end of each line?

I am trying to learn OpenMP, and have stumbled upon the fact that
threads do not retain their own data when executing tasks, but they
rather have a copy of the data of the thread which has generated the
task.
"[T]hreads do not retain their own data" is an odd way to describe it. Attributing data ownership to threads themselves instead of to the tasks they are performing is perhaps the key conceptual problem here. It is absolutely natural and to be expected that a thread performing a given task operates with and on the data environment of that task.
But if you're not accustomed to explicit tasks, then it is understandable that you've gotten away so far without appreciating the distinction here. The (many) constructs that give rise to implicit tasks are generally structured in ways that are not amenable to detecting the difference.
So with your example, yes,
the thread_id within the task refers to a copy that
is assigned to the thread_id of the thread that has created the task
(i.e. the one running the single portion of the code).
Although it may not be immediately obvious, that follows from the OMP specification:
When a thread encounters a task construct, an explicit task is
generated from the code for the associated structured-block. The data
environment of the task is created according to the data-sharing
attribute clauses on the task construct, per-data environment ICVs,
and any defaults that apply.
(OMP 5.0 Specification, section 2.10.1; emphasis added)
The only way that can be satisfied is if the task closes over any shared data from the context of its declaration, which is indeed what you observe. Moreover, this is typically what one wants -- the data on which a task is to operate should be established at the point of and by the context of its declaration, else how would one direct what a task is to do?
What if I wanted to refer the executing thread's own private variables
then?
Threads do not have variables, at least not in the terminology of OMP. Those belong to the "data environment" of whatever tasks they are executing at any given time.
Are they unrecoverably shadowed?
When a thread is executing a given task, it accesses the data environment of that task. That environment may include variables that are shared with other tasks, but only in that sense can it access the variables of another task. "Unrecoverably shadowed" is not the wording I would use to describe the situation, but it gets the idea across.
Is there a clause to make this
code output number, same number instead at the end of each line?
There are ways to restructure the code to achieve that, but none of them are as simple as just adding a clause to the omp task directive. In fact, I don't think any of them involve explicit tasks at all. The most natural way to get that would be with a parallel loop:
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main(void) {
#pragma omp parallel for num_threads(4)
for (int i = 0; i < 10; i++) {
int thread_id = omp_get_thread_num();
sleep(1);
printf("thread_id, ID of the executing thread: %d, %d\n", thread_id, omp_get_thread_num());
}
return 0;
}
Of course, that also simplifies it to the point where it seems trivial, but perhaps that helps drive home the point. A large part of the purpose of declaring an explicit task is that that task may be executed by a different thread than the one that created it, which is exactly what you need to avoid to achieve the behavior you are asking for.

The problem is, that here you create four parallel threads:
#pragma omp parallel num_threads(4)
and here, you restrict the further execution to one single thread
#pragma omp single
{
printf("Thread ID of the #single: %d\n", omp_get_thread_num());
From now on, only the context of this single thread is used, hence the same instance of the variable thread_id is used. Here
for (int i = 0; i < 10; i++) {
#pragma omp task
{
sleep(1);
printf("thread_id, ID of the executing thread: %d, %d\n", thread_id, omp_get_thread_num());
}
you indeed distribute the loop iteration on four threads, but based on the state of the single task (together with the corresponding instance of thread_id to which you restricted execution above. So a first measure is to end the single section directly after the printf (before the loop iterations start):
int thread_id = omp_get_thread_num();
#pragma omp single
{
printf("Thread ID of the #single: %d\n", omp_get_thread_num());
}
// Now outside the "single"
for (int i = 0; i < 10; i++) {
...
Now, for each iteration in the for loop, a task is created immediately. And this is performed for each of the four threads. So, you now have 40 tasks pending with
10 x thread_id == 0
10 x thread_id == 1
10 x thread_id == 2
10 x thread_id == 3
These tasks are now distributed amongst the threads arbitrarily. This is where the association between thread_id and the omp thread number gets lost. There is not much you can do about it, except for removing the
#pragma omp task
which leads to a similar result (with corresponding omp thread id and thread_id numbers), but works a bit different internally (the dissociation of the tasks and the omp threads does not take place).

Not seeing any speedup using openmp

I am very new in openmp and am trying to understand its constructs..
Here is a simple code I wrote... (square of the number)..
#include <omp.h>
#include <stdio.h>
#define SIZE 20000
#define NUM_THREADS 50
int main(){
int id;
int output[SIZE];
omp_set_num_threads(NUM_THREADS);
double start = omp_get_wtime();
#pragma omp parallel for
//{
//id = omp_get_thread_num();
for (int i=0; i<SIZE;i++){
id = omp_get_thread_num();
//printf("current thread :%d of %d threads\n", id, omp_get_num_threads());
output[i] = i*i;
}
//}
double end = omp_get_wtime();
printf("time elapsed: %f for %d threads\n", end-start, NUM_THREADS);
}
Now, changing number of threads should decrease the time.. but actually it is increasing the time?
What am i doing wrong?

This is most likely due to your choice of problem to inspect. Lets look at your parallel loop:
#pragma omp parallel for
for (int i=0; i<SIZE;i++){
id = omp_get_thread_num();
output[i] = i*i;
}
You have specified 50 threads and stated you have 16 cores.
The serial case ignores the OMP directive and can perform aggressive optimization of the loop. Each element i is i*i, a simple multiplication dependent on nothing but the loop index. id can be optimized out completely. This probably gets completely vectorized and if your processor is modern it can probably do 4 multiplies in a single instruction (SIMD) meaning for size=2000, you are looking at 500 SIMD multiplications (with no data fetch overhead and a cache friendly store). This is going to be very fast.
Alternatively, lets look at the parallel version. You are initializing 50 threads -- expensive!. You are introducing many context switches as even if you have processor affinity, you have oversubscribed your cores. Each of the 50 threads is going to run 40 iterations of your loop. If you are lucky the compiler unrolled the loop a bit so it could instead do 10 iterations of a SIMD multiply. The multiplies, whether SIMD or not, are still going to be fast. What you end up with is the same amount of real work, so each processor has 1/16th of the work but the overhead of 50 threads being created and destroyed creates more work than the parallel gain. This is a good example of something that doesn't benefit from parallelization.
The first thing you want to do is limit your number of threads to your actual cores. You are not going to gain anything by adding needless context switches to your execution time. More threads than cores is generally not going to make it go faster.
The second thing you want to do is to do something more complicated in your loop, and do it many times (google for examples, there are many). When constructing your work loop you will also want to keep cache performance in mind, as badly constructed loops don't speedup well.
When you change your work to be more complex than the thread overhead, embarassingly parallel and great cache performance you can start to see a real benefit to OpenMP. The last thing you'll want to do is benchmark your loop from serial to 16 threads. e.g.:

the error of MPI initialization

I have a question about parallelism. I have a portion of code which I have applied the concept of parallelizme and this part of code must be repeat N times in a loop, but I can not initialize the MPI in the loop because shows " MPI_Init(89): Cannot call MPI_INIT or MPI_INIT_THREAD more than once" and if I boot before the loop each process it will handle all the loop and it is not that the goal.
for (int i = 0; i <N; i ++)
{
the parallel area
}
I want that for every i in the loop, the K processes execute the parallel area.

The canonical way to distribute calculations in your case is to run the loop in all MPI processes and make each rank except rank 0 skip the serial parts:
// Obtain the rank
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
for (i = 0; i < N; i++)
{
if (rank == 0)
{
// Some serial calculations
}
//
// Parallel part
//
if (rank == 0)
{
// More serial calculations within the loop
}
}
If you need to communicate data produced in rank 0 in the serial calculations, you could either use point-to-point operations or collectives like MPI_Bcast or MPI_Scatter at the beginning of the parallel part. You could then bring the distributed data back to rank 0 for further processing in the second serial part with MPI_Gather or MPI_Reduce at the end of the parallel block.
Another canonical approach is to use the master-worker pattern where one process (the master) distributes work items to a set of worker processes that are simply spinning in a loop of receive work -> process work -> return results.
As to the multiple initialisation of MPI, one could check if it has already been done:
int done_already;
MPI_Initialized(&done_already);
if (!done_already)
MPI_Init(NULL, NULL);
Note that once you finalise MPI by calling MPI_Finalize, it cannot be reinitialised for the duration of the current execution of the program.

Simple OpenMP For Loop in C wrong output

Trying to get a simple OpenMP loop going, but I keep getting weird outputs. It doesn't list from 1 to 1000 straight, but goes from 501 to 750, then 1 to 1000. I'm guessing there's a threading issue? I'm compiling and running on VS2013.
#include <stdio.h>
#include <math.h>
int main(void)
{
int counter = 0;
double root = 0;
// OPEN MP SECTION
printf("Open MP section: \n\n");
getchar(); //Pause
#pragma omp parallel for
for (counter = 0; counter <= 1000; counter++)
{
root = sqrt(counter);
printf("The root of %d is %.2f\n", counter, root);
}
return(0);
}

The whole point of OpenMP is to run things in parallel, distributing work to different execution engines.
Hence, it's likely that the individual iterations of your loop are done out of order because that is the very nature of multi-threading.
While it may make sense for the calculations to be done in parallel (and hence possibly out of order), that's not really what you want for the printing of results.
One way to ensure the results are printed in the correct order is to defer the printing until after the parallel execution is complete. In other words, parallelise the calculation but serialise the output.
That of course means being able to store the information in, for example, an array, while the parallel operations are running.
In other words, something like:
// Make array instead of single value.
double root[1001];
// Parallelise just the calculation bit.
#pragma omp parallel for
for (counter = 0; counter <= 1000; counter++)
root[counter] = sqrt(counter);
// Leave the output as a serial operation,
// once all parallel operations are done.
for (counter = 0; counter <= 1000; counter++)
printf("The root of %d is %.2f\n", counter, root[counter]);

Store the results in an array and get the printf out of the loop. It has to serialize to the display.

Your code will not be run sequentially.
OpenMP Parallel Pragma:
#pragma omp parallel
{
// Code inside this region runs in parallel.
printf("Hello!\n");
}
'This code creates a team of threads, and each thread executes the same code. It prints the text "Hello!" followed by a newline, as many times as there are threads in the team created. For a dual-core system, it will output the text twice. (Note: It may also output something like "HeHlellolo", depending on system, because the printing happens in parallel.) At the }, the threads are joined back into one, as if in non-threaded program."'
http://bisqwit.iki.fi/story/howto/openmp/#ParallelPragma

How many threads in a loop

If I create a loop
for(int i=0;i<n;i++){//do something}
and run it through Visual Studio, will my program create a thread for every iteration, for the whole loop, or it's a variable number?

and run it through Visual Studio, will my program create a thread for every iteration, for the whole loop, or it's a variable number?
None of the above. Your program will by default have a single thread of execution and it will execute each iteration of the loop in series, without creating new ones.
Only with a feature like OpenMP (or similar) could you spawn different threads per iteration.
#include <omp.h>
#pragma omp parallel for
for(int n=0; n<10; ++n) { printf(" %d", n); }
printf(".\n");