Why this OpenMP parallel for loop doesn't work properly? - c

I would like to implement OpenMP to parallelize my code. I am starting from a very basic example to understand how it works, but I am missing something...
So, my example looks like this, without parallelization:
int main() {
...
for (i = 0; i < n-1; i++) {
u[i+1] = (1+h)*u[i]; // Euler
v[i+1] = v[i]/(1-h); // implicit Euler
}
...
return 0;
}
Where I omitted some parts in the "..." because are not relevant. It works, and if I print the u[] and v[] arrays on a file, I get the expected results.
Now, if I try to parallelize it just by adding:
#include <omp.h>
int main() {
...
omp_set_num_threads(2);
#pragma omp parallel for
for (i = 0; i < n-1; i++) {
u[i+1] = (1+h)*u[i]; // Euler
v[i+1] = v[i]/(1-h); // implicit Euler
}
...
return 0;
}
The code compiles and the program runs, BUT the u[] and v[] arrays are half full of zeros.
If I set omp_set_num_threads( 4 ), I get three quarters of zeros.
If I set omp_set_num_threads( 1 ), I get the expected result.
So it looks like only the first thread is being executed, while not the other ones...
What am I doing wrong?

OpenMP assumes that each iteration of a loop is independent of the others. When you write this:
for (i = 0; i < n-1; i++) {
u[i+1] = (1+h)*u[i]; // Euler
v[i+1] = v[i]/(1-h); // implicit Euler
}
The iteration i of the loop is modifying iteration i+1. Meanwhile, iteration i+1 might be happening at the same time.
Unless you can make the iterations independent, this isn't a good use-case for parallelism.
And, if you think about what Euler's method does, it should be obvious that it is not possible to parallelize the code you're working on in this way. Euler's method calculates the state of a system at time t+1 based on information at time t. Since you cannot knowing what's at t+1 without knowing first knowing t, there's no way to parallelize across the iterations of Euler's method.

u[i+1] = (1+h)*u[i];
v[i+1] = v[i]/(1-h);
is equivalent to
u[i] = pow((1+h), i)*u[0];
v[i] = v[0]*pow(1.0/(1-h), i);
therefore you can parallelize you code like this
#pragma omp parallel for
for (int i = 0; i < n; i++) {
u[i] = pow((1+h), i)*u[0];
v[i] = v[0]*pow(1.0/(1-h), i);
}
If you want to mitigate the cost of the pow function you can do it once per thread rather than once per iteration like his (since t << n).
#pragma omp parallel
{
int nt = omp_get_num_threads();
int t = omp_get_thread_num();
int s = (t+0)*n/nt;
int f = (t+1)*n/nt;
u[s] = pow((1+h), s)*u[0];
v[s] = v[0]*pow(1.0/(1-h), s);
for(int i=s; i<f-1; i++) {
u[i+1] = (1+h)*u[i];
v[i+1] = v[i]/(1-h);
}
}
You can also write your own pow(double, int) function optimized for integer powers.
Note that the relationship I used is not in fact 100% equivalent because floating point arithmetic is not associative. That's not usually a problem but it's something one should be aware of.

Before parallelizing your code you must identify its concurrency, i.e. the set of tasks that are logically happening at the same time and then figure out a way to make them actually happen in parallel.
As mentioned above, this is a not a good example to apply parallelism on due to the fact that there is no concurrency in its nature. Attempting to use parallelism like that will lead to wrong results, due to the so-called race conditions.
If you just wanna learn how OpenMP works, try to come up with examples where you can clearly identify conceptually independent tasks. One of the most simple I can think of would be computing the area under a curve by means of integration.

Welcome to the parallel ( or "just"-concurrent ) plurality of computing realities.
Why?
Any non-sequential schedule of processing the loop will have problems with hidden ( not correctly handled ) breach of data-{-access | -value}
integrity in time.
A pure-[SERIAL] flow of processing is free from such dangers as the principally serialised steps indirectly introduce ( right by a rigid order of executing nothing but a one-step-after-another as a sequence ) order, in which there is no chance to "touch" the same memory location twice or more times at the same time.
This "peace-of-mind" is inadvertently lost, once a process goes into a "just"-[CONCURRENT] or the true-[PARALLEL] processing.
Suddenly there is an almost random order ( in a case of a "just"-[CONCURRENT] ) or a principally "immediate" singularity ( avoiding any original meaning of "order" - in the case of a true-[PARALLEL] code execution mode -- like a robot, having 6DoF, arrives into each and every trajectory-point in a true-[PARALLEL] fashion, driving all 6DoF-axes in parallel, not a one-after-another, in a pure-[SERIAL]-manner, not in a some-now-some-other-later-and-the-rest-as-it-gets in a "just"-[CONCURRENT] fashion, as the 3D-trajectory of robot-arm will become hardly predictable and mutual collisions would be often on a car assembly line ... ).
Solution:
Using either a defensive tool, called atomic operations, or a principal approach - design (b)locking-free algorithm, where possible, or explicitly signal and coordinate reads and writes ( sure, at a cost in excess-time and degraded performance ), so as to warrant the values will not get damaged into an inconsistent digital trash, if protective steps ( ensuring all "old"-writes get safely "through" before any "next"-reads go ahead to grab a "right"-value ) were not coded in ( as was demonstrated above ).
Epilogue:
Using a tool, like OpenMP for problems, where it cannot bring any advantage, will result in spending time and decreased performance ( as there are needs to handle all tool-related overheads, while there is literally zero net-effect of parallelism in cases, where the algorithm does not allow any parallelism to be enjoyed ), so one finally pays ways more then one finally gets.
A good point to learn about OpenMP best practices could be sources for example from Lawrence Livermore National Laboratory ( indeed very competent ) and similar publications on using OpenMP.

Related

Parallelizing a for loop in Visual Studio 2010 (OpenMP)

I've recently reading up about OpenMP and was trying to parallelize some existing for loops in my program to get a speed-up. However, for some reason I seem to be getting garbage data written to the file. What I mean by that is I don't have Points 1,2,3,4 etc. written to my file, I have Points 1,4,7,8 etc. I suspect this is because I am not keeping track of the threads and it just leads to race conditions?
I have been reading as much as I can find about OpenMP, since it seems like a great abstraction to do multi-threaded programing. I'd appreciate any pointers please to get to the bottom of what I might be doing incorrectly.
Here is what I have been trying to do so far (only the relevant bit of code):
#include <omp.h>
pixelIncrement = Image.rowinc/2;
#pragma omp parallel for
for (int i = 0; i < Image.nrows; i++ )
{
int k =0;
row = Image.data + i * pixelIncrement;
#pragma omp parallel for
for (int j = 0; j < Image.ncols; j++)
{
k++;
disparityThresholdValue = row[j];
// Don't want to save certain points
if ( disparityThresholdValue < threshHold)
{
// Get the data points
x = (int)Image.x[k];
y = (int)Image.y[k];
z = (int)Image.z[k];
grayValue= (int)Image.gray[k];
cloudObject->points[k].x = x;
cloudObject->points[k].y = y;
cloudObject->points[k].z = z;
cloudObject->points[k].grayValue = grayValue;
fprintf( cloudPointsFile, "%f %f %f %d\n", x, y, z, grayValue);
}
}
}
fclose( pointFile );
I did enable OpenMP in my Compiler settings (C/C++ -> Language -> Open MP Support (/openmp).
Any suggestions as to what might be the problem? I am using a Quadcore processor on Windows XP 32-bit.
Are all points written to the file, but just not sequentially, or is the actual point data messed up?
The first case is expected in parallel programming - once you execute something side-by-side you wont be able to guarantee order unless you synchronize the access (at which point you can just leave out the parallelization as it becomes effectively linear). If you need to rely on order, you can parallelize any calculations but need to write it down in one thread.
If the points itself are messed up, check where your variables are declared and if multiple threads are accessing the same.
A few problems here:
#pragma omp parallel for
for (int i = 0; i < Image.nrows; i++ )
{
int k =0;
row = Image.data + i * pixelIncrement;
#pragma omp parallel for
for (int j = 0; j < Image.ncols; j++)
{
k++;
There's no need for the inner parallel for. The outer loop should contain enough work to keep all cores busy.
Also, for the inner loop k is a shared variable and gets incremented in a non-atomic way. x, y, z are also shared among the inner loop threads and overwritten "randomly". Remove the inner directive and see how it goes.
When you have a loop with a nested loop there is no need for a second omp pragma.
It will already paralelize the first loop. Remember that this is valid only if the second loop has to be executed in sequence. You have a sequencial incrementation, so you can not execute the second loop in a random order. OMP pragmas are a very easy and cool way to paralelize code but do not use them too much!
More details here -> Parallel Loops with OpenMP

problems when creating many plans and executing plans

I am a little confused about creating many_plan by calling fftwf_plan_many_dft_r2c() and executing it with OpenMP. What I am trying to achieve here is to see if explicitly using OpenMP and organizing FFTW data could work together. ( I know I "should" use multithreaded version of fftw but I failed to get a expected speedup from it ).
My code looks like this:
/* I ignore some helper APIs */
#define N 1024*1024 //N is the total size of 1d fft
fftwf_plan p;
float * in;
fftwf_complex *out;
omp_set_num_threads(threadNum); // Suppose threadNum is 2 here
in = fftwf_alloc_real(2*(N/2+1));
std::fill(in,in+2*(N/2+1),1.1f); // just try with a random real floating numbers
out = (fftwf_complex *)&in[0]; // for in-place transformation
/* Problems start from here */
int n[] = {N/threadNum}; // according to the manual, n is the size of each "howmany" transformation
p = fftwf_plan_many_dft_r2c(1, n, threadNum, in, NULL,1 ,1, out, NULL, 1, 1, FFTW_ESTIMATE);
#pragma omp parallel for
for (int i = 0; i < threadNum; i ++)
{
fftwf_execute(p);
// fftwf_execute_dft_r2c(p,in+i*N/threadNum,out+i*N/threadNum);
}
What I got is like this:
If I use fftwf_execute(p), the program executes successfully, but the result seems not correct. ( I compare the result with the version of not using many_plan and openmp )
If I use fftwf_execute_dft_r2c(), I got segmentation fault.
Can somebody help me here? How should I partition the data across multiple threads? Or it is not correct in the first place.
Thank you in advance.
flyree
Do you properly allocate memory for out? Does this:
out = (fftwf_complex *)&in[0]; // for in-place transformation
do the same as this:
out = (fftw_complex*)fftw_malloc(sizeof(fftw_complex)*numberOfOutputColumns);
You are trying to access 'p' inside your parallel block, without specifically telling openMP how to use it. It should be:
pragma omp parallel for shared(p)
If you are going to split the work up for n threads, I would think you'd explicitly want to tell omp to use n threads:
pragma omp parallel for shared(p) num_threads(n)
Does this code work without multithreading? If you removed the for loop and openMP call and executed fftwf_execute(p) just once does it work?
I don't know much about FFTW's plans for many, but it seems like p is really many plans, not one single plan. So, when you "execute" p, you are executing all plans at once, right? You don't really need to iteratively execute p.
I'm still learning about OpenMP + FFTW so I could be wrong on these. StackOverflow doesn't like it when i put a # in front of pragma, but you need one.

unbalanced nested for loops in openmp

I've been trying to parallelize an algorithm with unbalanced nested for loops using OpenMP. I can't post the original code as it's a secret project of an unheard government but here's a toy example:
for (i = 0; i < 100; i++) {
#pragma omp parallel for private(j, k)
for (j = 0; j < 1000000; j++) {
for (k = 0; k < 2; k++) {
temp = i * j * k; /* dummy operation (don't mind the race) */
}
if (i % 2 == 0) temp = 0; /* so I can't use openmp collapse */
}
}
Currently this example is working slower in multiple threads (~1 sec in single thread ~2.4 sec in 2 threads etc.).
Things to note:
Outer for loop needs to be done in order (dependent on the previous step) (As far as I know, OpenMP handles inner loops well so threads don't get created/destroyed at each step, right?)
Typical index numbers are given in the example (100, 1000000, 2)
Dummy operation consists of just a few operations
There are some conditional operations outside the inner most loop so collapse is not an option (doesn't seem like it would increase the performance anyways)
Looks like an embarrassingly parallel algorithm but I can't seem to get any speedups for the last two days. What would be the best strategy here?
Unfortunately this embarrassingly parallel algorithm is an embarrassingly bad example of how performant parallelism should be implemented. And since my crystall ball tells me that besides i, temp is also a shared automatic variable, I would assume it for the rest of this text. It also tells me that you have a pre-Nehalem CPU...
There are two sources of slowdown here - code transformation and cache coherency.
The way parallel regions are implmentend is that their code is extracted in separate functions. Shared local variables are extracted into structures that are then shared between the threads in the team that executes the parallel region. Under the OpenMP transformations your code sample would become something similiar to this:
typedef struct {
int i;
int temp;
} main_omp_fn_0_shared_vars;
void main_omp_fn_0 (void *data) {
main_omp_fn_0_shared_vars *vars = data;
// compute values of j_min and j_max for this thread
for (j = j_min; j < j_max; j++) {
for (k = 0; k < 2; k++) {
vars->temp = vars->i * j * k;
if (vars->i % 2 == 0) vars->temp = 0;
}
}
int main (void) {
int i, temp;
main_omp_fn_0_shared_vars vars;
for (i = 0; i < 100; i++)
{
vars.i = i;
vars.temp = temp;
// This is how GCC implements parallel regions with libgomp
// Start main_omp_fn_0 in the other threads
GOMP_parallel_start(main_omp_fn_0, &vars, 0);
// Start main_omp_fn_0 in the main thread
main_omp_fn_0(&vars);
// Wait for other threads to finish (implicit barrier)
GOMP_parallel_end();
i = vars.i;
temp = vars.temp;
}
}
You pay a small penalty for accessing temp and i this way as their intermediate values cannot be stored in registers but are loaded and stored each time.
The other source of degradation is the cache coherency protocol. Accessing the same memory location from multiple threads executing on multiple CPU cores leads to lots of cache invalidation events. Worse, vars.i and vars.temp are likely to end up in the same cache line and although vars.i is only read from and vars.temp is only written to, full cache invalidation is likely to occur at each iteration of the inner loop.
Normally access to shared variables is protected by explicit synchronisation constructs like atomic statements and critical sections and performance degradation is well expected in that case.
Think of the overheads:
Since your outer loop needs to be in order you're creating x threads to perform the work in the inner loop, destroying them, then creating them again... and so on 100 times.
You have to wait until the longest task within the inner loop completes its work before performing the next step in the outer loop, so essentially this is a synchronization overhead. The tasks don't look irregular, but if the work to perform is small there's only so much speedup you can get out of this.
You have the cost of thread creation here, and allocating the private variables.
If the work inside the inner loop is small the benefits of parallelising this loop might not necessarily outweigh the cost of the parallelisation overheads above, hence you end up with a slowdown.

Local copies of arrays for threads in OpenMP?

I am new to OpenMP so this might be very basic.
I have a function:
void do_calc(int input1[], int input2[], int results[]);
Now, the function modifies input1[] during calculations but still can use it for another iteration (it sorts it in various ways), input2[] is different for every iteration and the function stores results in results[].
In one threaded version of the program I just iterate through various input2[]. In parallel version I try this:
#pragma omp parallel for reduction (+:counter) schedule(static) private (i,j)
for (i = 0; i < NUMITER ; i++){
int tempinput1[1000];
int tempresults[1000];
int tempinput2[5] = derive_input_from_i(i, input2[]);
array_copy(input, tempinput);
do_calc(tempinput, tempinput2, tempresults);
for (j = 0; j < 1000; j++)
counter += tempresults[i] //simplified
}
This code works but is very inefficient because I am copying input to tempinput every iteration and I need only one copy per thread. This copy could be then reused in subsequent do_calc invocations. What I would like to do is this:
#do this only once for every thread worker:
array_copy(input, tempinput);
and then tell the thread to store tempinput for iterations it does in the future.
How do I go about it in OpenMP?
Additional performance issues:
a) I would like to have the code which works on dual/quad/octal core processors and let OpenMP determine number of thread workers and for every of them copy input once;
b) My algorithm benefits from input[] being sorted in previous iteration (as then next sort is faster as keys change only slightly for similar i's) so I would like to make sure that number of iterations is divided equally among threads and that thread no 1 gets 0 ... NUMITER/n portion of iterations, thread no 2 gets NUMITER/n ... 2*NUMITER/n etc.
b) Is not that important but it would be very cool to have :)
(I am using Visual Studio 2010 and I have OpenMP 2.0 version)

What would be an efficient way to add multithreading to this simple algorithm?

I would say my knowledge in C is fair, and I wish to extend a program to enhance my knowledge of parallel programming.
It essentially the program I am refering to is a brute force generator, to increment through passwords such as from 0000 .. zzzz of a specific character set:
Need help with brute force code for crypt(3)
The algorithm is outlined below (credit to Jerome for this)
int len = 3;
char letters[] = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
int nbletters = sizeof(letters)-1;
int main() {
int i, entry[len];
for(i=0 ; i<len ; i++) entry[i] = 0;
do {
for(i=0 ; i<len ; i++) putchar(letters[entry[i]]);
putchar('\n');
for(i=0 ; i<len && ++entry[i] == nbletters; i++) entry[i] = 0;
} while(i<len);
}
In what logical way would you say this could be extended by multithreading?
CUDA is a silly, if simple, solution. I had heard of OpenMP which in my books looks like a good solution, how do you think this could be split up to benefit from multiple cores of my computer? I.e. core 1 computing aaaa..ffff, and core 2 computing ffff...zzzz, is this the only method that would make sense with this?
I think you answered your own question. The aaaa..ffff on thread #1 and ffff..zzzz on thread #2 is probably the way to go, except to maybe break it down into more threadable parts in case you have more cores available. Trying to start a thread to perform some part of the do loop would probably introduce more overhead than benefit in such a tight algorithm.
I assume that you want to see your output characters in the order they are referenced in the entry array.
This is a sequential operation you can not parallelize it.
Edit:
OK, now I see how wrong my was are :) You actually CAN parallelize this program, but you have to implement an additional layer handling the order of letters in the output. Also need to implement synchronization.

Resources