Trying to get a simple OpenMP loop going, but I keep getting weird outputs. It doesn't list from 1 to 1000 straight, but goes from 501 to 750, then 1 to 1000. I'm guessing there's a threading issue? I'm compiling and running on VS2013.
#include <stdio.h>
#include <math.h>
int main(void)
{
int counter = 0;
double root = 0;
// OPEN MP SECTION
printf("Open MP section: \n\n");
getchar(); //Pause
#pragma omp parallel for
for (counter = 0; counter <= 1000; counter++)
{
root = sqrt(counter);
printf("The root of %d is %.2f\n", counter, root);
}
return(0);
}
The whole point of OpenMP is to run things in parallel, distributing work to different execution engines.
Hence, it's likely that the individual iterations of your loop are done out of order because that is the very nature of multi-threading.
While it may make sense for the calculations to be done in parallel (and hence possibly out of order), that's not really what you want for the printing of results.
One way to ensure the results are printed in the correct order is to defer the printing until after the parallel execution is complete. In other words, parallelise the calculation but serialise the output.
That of course means being able to store the information in, for example, an array, while the parallel operations are running.
In other words, something like:
// Make array instead of single value.
double root[1001];
// Parallelise just the calculation bit.
#pragma omp parallel for
for (counter = 0; counter <= 1000; counter++)
root[counter] = sqrt(counter);
// Leave the output as a serial operation,
// once all parallel operations are done.
for (counter = 0; counter <= 1000; counter++)
printf("The root of %d is %.2f\n", counter, root[counter]);
Store the results in an array and get the printf out of the loop. It has to serialize to the display.
Your code will not be run sequentially.
OpenMP Parallel Pragma:
#pragma omp parallel
{
// Code inside this region runs in parallel.
printf("Hello!\n");
}
'This code creates a team of threads, and each thread executes the same code. It prints the text "Hello!" followed by a newline, as many times as there are threads in the team created. For a dual-core system, it will output the text twice. (Note: It may also output something like "HeHlellolo", depending on system, because the printing happens in parallel.) At the }, the threads are joined back into one, as if in non-threaded program."'
http://bisqwit.iki.fi/story/howto/openmp/#ParallelPragma
Related
I am trying to learn OpenMP, and have stumbled upon the fact that threads do not retain their own data when executing tasks, but they rather have a copy of the data of the thread which has generated the task. Let me demonstrate it with an example:
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main()
{
#pragma omp parallel num_threads(4)
{
int thread_id = omp_get_thread_num();
#pragma omp single
{
printf("Thread ID of the #single: %d\n", omp_get_thread_num());
for (int i = 0; i < 10; i++) {
#pragma omp task
{
sleep(1);
printf("thread_id, ID of the executing thread: %d, %d\n", thread_id, omp_get_thread_num());
}
}
}
}
return 0;
}
An example output of this code is as follows:
Thread ID of the #single: 1
thread_id, ID of the executing thread: 1, 2
thread_id, ID of the executing thread: 1, 0
thread_id, ID of the executing thread: 1, 3
thread_id, ID of the executing thread: 1, 1
...
It is evident that the thread_id within the task refers to a copy that is assigned to the thread_id of the thread that has created the task (i.e. the one running the single portion of the code).
What if I wanted to refer the executing thread's own private variables then? Are they unrecoverably shadowed? Is there a clause to make this code output number, same number instead at the end of each line?
I am trying to learn OpenMP, and have stumbled upon the fact that
threads do not retain their own data when executing tasks, but they
rather have a copy of the data of the thread which has generated the
task.
"[T]hreads do not retain their own data" is an odd way to describe it. Attributing data ownership to threads themselves instead of to the tasks they are performing is perhaps the key conceptual problem here. It is absolutely natural and to be expected that a thread performing a given task operates with and on the data environment of that task.
But if you're not accustomed to explicit tasks, then it is understandable that you've gotten away so far without appreciating the distinction here. The (many) constructs that give rise to implicit tasks are generally structured in ways that are not amenable to detecting the difference.
So with your example, yes,
the thread_id within the task refers to a copy that
is assigned to the thread_id of the thread that has created the task
(i.e. the one running the single portion of the code).
Although it may not be immediately obvious, that follows from the OMP specification:
When a thread encounters a task construct, an explicit task is
generated from the code for the associated structured-block. The data
environment of the task is created according to the data-sharing
attribute clauses on the task construct, per-data environment ICVs,
and any defaults that apply.
(OMP 5.0 Specification, section 2.10.1; emphasis added)
The only way that can be satisfied is if the task closes over any shared data from the context of its declaration, which is indeed what you observe. Moreover, this is typically what one wants -- the data on which a task is to operate should be established at the point of and by the context of its declaration, else how would one direct what a task is to do?
What if I wanted to refer the executing thread's own private variables
then?
Threads do not have variables, at least not in the terminology of OMP. Those belong to the "data environment" of whatever tasks they are executing at any given time.
Are they unrecoverably shadowed?
When a thread is executing a given task, it accesses the data environment of that task. That environment may include variables that are shared with other tasks, but only in that sense can it access the variables of another task. "Unrecoverably shadowed" is not the wording I would use to describe the situation, but it gets the idea across.
Is there a clause to make this
code output number, same number instead at the end of each line?
There are ways to restructure the code to achieve that, but none of them are as simple as just adding a clause to the omp task directive. In fact, I don't think any of them involve explicit tasks at all. The most natural way to get that would be with a parallel loop:
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main(void) {
#pragma omp parallel for num_threads(4)
for (int i = 0; i < 10; i++) {
int thread_id = omp_get_thread_num();
sleep(1);
printf("thread_id, ID of the executing thread: %d, %d\n", thread_id, omp_get_thread_num());
}
return 0;
}
Of course, that also simplifies it to the point where it seems trivial, but perhaps that helps drive home the point. A large part of the purpose of declaring an explicit task is that that task may be executed by a different thread than the one that created it, which is exactly what you need to avoid to achieve the behavior you are asking for.
The problem is, that here you create four parallel threads:
#pragma omp parallel num_threads(4)
and here, you restrict the further execution to one single thread
#pragma omp single
{
printf("Thread ID of the #single: %d\n", omp_get_thread_num());
From now on, only the context of this single thread is used, hence the same instance of the variable thread_id is used. Here
for (int i = 0; i < 10; i++) {
#pragma omp task
{
sleep(1);
printf("thread_id, ID of the executing thread: %d, %d\n", thread_id, omp_get_thread_num());
}
you indeed distribute the loop iteration on four threads, but based on the state of the single task (together with the corresponding instance of thread_id to which you restricted execution above. So a first measure is to end the single section directly after the printf (before the loop iterations start):
int thread_id = omp_get_thread_num();
#pragma omp single
{
printf("Thread ID of the #single: %d\n", omp_get_thread_num());
}
// Now outside the "single"
for (int i = 0; i < 10; i++) {
...
Now, for each iteration in the for loop, a task is created immediately. And this is performed for each of the four threads. So, you now have 40 tasks pending with
10 x thread_id == 0
10 x thread_id == 1
10 x thread_id == 2
10 x thread_id == 3
These tasks are now distributed amongst the threads arbitrarily. This is where the association between thread_id and the omp thread number gets lost. There is not much you can do about it, except for removing the
#pragma omp task
which leads to a similar result (with corresponding omp thread id and thread_id numbers), but works a bit different internally (the dissociation of the tasks and the omp threads does not take place).
I am very new in openmp and am trying to understand its constructs..
Here is a simple code I wrote... (square of the number)..
#include <omp.h>
#include <stdio.h>
#define SIZE 20000
#define NUM_THREADS 50
int main(){
int id;
int output[SIZE];
omp_set_num_threads(NUM_THREADS);
double start = omp_get_wtime();
#pragma omp parallel for
//{
//id = omp_get_thread_num();
for (int i=0; i<SIZE;i++){
id = omp_get_thread_num();
//printf("current thread :%d of %d threads\n", id, omp_get_num_threads());
output[i] = i*i;
}
//}
double end = omp_get_wtime();
printf("time elapsed: %f for %d threads\n", end-start, NUM_THREADS);
}
Now, changing number of threads should decrease the time.. but actually it is increasing the time?
What am i doing wrong?
This is most likely due to your choice of problem to inspect. Lets look at your parallel loop:
#pragma omp parallel for
for (int i=0; i<SIZE;i++){
id = omp_get_thread_num();
output[i] = i*i;
}
You have specified 50 threads and stated you have 16 cores.
The serial case ignores the OMP directive and can perform aggressive optimization of the loop. Each element i is i*i, a simple multiplication dependent on nothing but the loop index. id can be optimized out completely. This probably gets completely vectorized and if your processor is modern it can probably do 4 multiplies in a single instruction (SIMD) meaning for size=2000, you are looking at 500 SIMD multiplications (with no data fetch overhead and a cache friendly store). This is going to be very fast.
Alternatively, lets look at the parallel version. You are initializing 50 threads -- expensive!. You are introducing many context switches as even if you have processor affinity, you have oversubscribed your cores. Each of the 50 threads is going to run 40 iterations of your loop. If you are lucky the compiler unrolled the loop a bit so it could instead do 10 iterations of a SIMD multiply. The multiplies, whether SIMD or not, are still going to be fast. What you end up with is the same amount of real work, so each processor has 1/16th of the work but the overhead of 50 threads being created and destroyed creates more work than the parallel gain. This is a good example of something that doesn't benefit from parallelization.
The first thing you want to do is limit your number of threads to your actual cores. You are not going to gain anything by adding needless context switches to your execution time. More threads than cores is generally not going to make it go faster.
The second thing you want to do is to do something more complicated in your loop, and do it many times (google for examples, there are many). When constructing your work loop you will also want to keep cache performance in mind, as badly constructed loops don't speedup well.
When you change your work to be more complex than the thread overhead, embarassingly parallel and great cache performance you can start to see a real benefit to OpenMP. The last thing you'll want to do is benchmark your loop from serial to 16 threads. e.g.:
I am trying to create an OpenMP program that will sequentially iterate through a loop. I realize threads are not intended for sequential programs -- I'm trying to either get a little speedup compared to a single thread, or at least keep the execution time similar to a single-threaded program.
Inside my #pragma omp parallel section, each thread computes its own section of a large array and gets the sum of that portion. These all may run in parallel. Then I want the threads to run in order, and each sum is added to the TotalSum IN ORDER. So thread 1 has to wait for thread 0 to complete, and so on. I have this part inside a #pragma omp critical section. Everything runs fine, except that only thread 0 is completing and then the program exits. How can I ensure that the other threads will keep polling? I've tried sleep() and while loops, but it continues to exit after thread 0 completes.
I am not using #pragma omp parallel for because I need to keep track of the specific ranges of the master array that each thread accesses. Here is a shortened version of the code section in question:
//DONE and MasterArray are global arrays. DONE keeps track of all the threads that have completed
int Function()
{
#pragma omp parallel
{
int ID = omp_get_thread_num
variables: start,end,i,j,temp(array) (all are initialized here)
j = 0;
for (i = start; i < end; i++)
{
if(i != start)
temp[j] = MasterArray[i];
else
temp[j] = temp[j-1] + MasterArray[i];
j++;
}
#pragma omp critical
{
while(DONE[ID] == 0 && ERROR == 0) {
int size = sizeof(temp) / sizeof(temp[0]);
if (ID == 0) {
Sum = temp[size];
DONE[ID] = 1;
if (some situation)
ERROR = 1; //there's an error and we need to exit the function and program
}
else if (DONE[ID-1] == 1) {
Sum = temp[size];
DONE[ID] = 1;
if (some situation)
ERROR = 1; //there's an error and we need to exit the function and program
}
}
}
}
if (ERROR == 1)
return(-1);
else
return(0);
}
this function is called from main after initializing the number of threads. It seems to me that the parallel portion completes, then we check for an error. If an error is found, the loop terminates. I realize something is wrong here, but I can't figure out what it is, and now I'm just going in circles. Any help would be great. Again, my problem is that the function exits after only thread 0 executes, but no error has been flagged. I have it running in pthreads too, but that has been simpler to execute.
Thanks!
Your attempt of ordering threads with #pragma omp critical is totally incorrect. There can be just one thread in a critical section at any time, and the order in which the threads arrive to the critical section is not determined. So in your code it can happen that e.g. the thread #2 enters the critical section first and never leaves it, waiting for thread #1 to complete, while the thread #1 and the rest are waiting at #pragma omp critical. And even if some threads, e.g. thread #0, are lucky to complete the critical section in right order, they will wait on an implicit barrier at the end of the parallel region. In other words, the deadlock is almost guaranteed in this code.
I suggest you do something much simpler and natural to order your threads, namely an ordered section. It should look like this:
#pragma omp parallel
{
int ID = omp_get_thread_num();
// Computations done by each thread
#pragma omp for ordered schedule(static,1)
for( int t=0; t<omp_get_num_threads(); ++t )
{
assert( t==ID );
#pragma omp ordered
{
// Do the stuff you want to be in order
}
}
}
So you create a parallel loop with the number of iterations equal to the number of threads in the region. The schedule(static,1) clause makes it explicit that the iterations are assigned one per thread in the order of thread IDs; and the ordered clause allows to use ordered sections inside the loop. Now in the body of the loop you put an ordered section (the block following #pragma omp ordered), and it will be executed in the order of iterations, which is also the order of thread IDs (as ensured by the assertion).
For more information, you may look at this question: How does the omp ordered clause work?
If I create a loop
for(int i=0;i<n;i++){//do something}
and run it through Visual Studio, will my program create a thread for every iteration, for the whole loop, or it's a variable number?
and run it through Visual Studio, will my program create a thread for every iteration, for the whole loop, or it's a variable number?
None of the above. Your program will by default have a single thread of execution and it will execute each iteration of the loop in series, without creating new ones.
Only with a feature like OpenMP (or similar) could you spawn different threads per iteration.
#include <omp.h>
#pragma omp parallel for
for(int n=0; n<10; ++n) { printf(" %d", n); }
printf(".\n");
1st question:
I wonder how I can parallelize function calls to the same function, but with different input parameters in a for loop. For example (C code):
//a[i] and b[i] are defined as elements of a list with 2 columns and N rows
//i is the row number
#pragma omp parallel
{
char cmd[1000];
#pragma omp for nowait
for(i=0; i<N; i++) {
//call the serial programm
sprintf(cmd, "./serial_program %f %f", a[i], b[i]);
system(cmd);
}
}
If I just apply a pragma omp for (+the omp header of course) nothing happens. Maybe this is not possible with OpenMP, but would it be possible with MPI and how would it look like then? I have experience only with OpenMP so far, but not with MPI.
update: defined cmd within parallel region
Status: solved
2nd question:
If i have a OpenMP parallelized program and i want to use it among different nodes within a cluster, how can i distribute the calls among the different nodes with MPI and how would i compile it?
//a[i] and b[i] are defined as elements of a list with 2 columns and N rows
//i is the row number
for(i=0; i<N; i++) {
//call the parallelized program
sprintf(cmd, "./openmp_parallelized_program %f %f", a[i], b[i]);
system(cmd);
}
Status: unsolved
MPI is a method to communicate between nodes of a computing cluster. It enables one motherboard to talk to another. MPI is for clusters and large computing tasks, it is not for parallelizing desktop applications.
Communications in MPI are done by explicitly sending and receiving data.
Unlike OpenMP, there is no #pragma that will automatically facilitate parallelization.
Also there is something really messed up about the code that you posted, specifically, it is a C program that acts like a bash script.
#!/bin/bash
N=10
for i in `seq 1 $N`;
do
./program $i &
done
On many clusters calls to system will execute only on the host node, resulting in no speedup and io problems. The command you showed is wholly unworkable.
With MPI you would do something like:
int rank, size;
MPI_Init();
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int start = (rank*N)/size;
int end = ((rank+1)*N)/size;
for (i = start; i < end; i++)
{
sprintf(cmd, "./openmp_parallelized_program %f %f", a[i], b[i]);
system(cmd);
}
MPI_Finalize();
Then run the MPI job with one process per node. There is a caveat though. Some MPI implementations do not allow processes to call fork() under certain conditions (and system() calls fork()), e.g. if they communicate over RDMA-based networks like InfiniBand. Instead, you could merge both programs in order to create one hybrid MPI/OpenMP program.