unbalanced nested for loops in openmp

unbalanced nested for loops in openmp - c

I've been trying to parallelize an algorithm with unbalanced nested for loops using OpenMP. I can't post the original code as it's a secret project of an unheard government but here's a toy example:
for (i = 0; i < 100; i++) {
#pragma omp parallel for private(j, k)
for (j = 0; j < 1000000; j++) {
for (k = 0; k < 2; k++) {
temp = i * j * k; /* dummy operation (don't mind the race) */
}
if (i % 2 == 0) temp = 0; /* so I can't use openmp collapse */
}
}
Currently this example is working slower in multiple threads (~1 sec in single thread ~2.4 sec in 2 threads etc.).
Things to note:
Outer for loop needs to be done in order (dependent on the previous step) (As far as I know, OpenMP handles inner loops well so threads don't get created/destroyed at each step, right?)
Typical index numbers are given in the example (100, 1000000, 2)
Dummy operation consists of just a few operations
There are some conditional operations outside the inner most loop so collapse is not an option (doesn't seem like it would increase the performance anyways)
Looks like an embarrassingly parallel algorithm but I can't seem to get any speedups for the last two days. What would be the best strategy here?

Unfortunately this embarrassingly parallel algorithm is an embarrassingly bad example of how performant parallelism should be implemented. And since my crystall ball tells me that besides i, temp is also a shared automatic variable, I would assume it for the rest of this text. It also tells me that you have a pre-Nehalem CPU...
There are two sources of slowdown here - code transformation and cache coherency.
The way parallel regions are implmentend is that their code is extracted in separate functions. Shared local variables are extracted into structures that are then shared between the threads in the team that executes the parallel region. Under the OpenMP transformations your code sample would become something similiar to this:
typedef struct {
int i;
int temp;
} main_omp_fn_0_shared_vars;
void main_omp_fn_0 (void *data) {
main_omp_fn_0_shared_vars *vars = data;
// compute values of j_min and j_max for this thread
for (j = j_min; j < j_max; j++) {
for (k = 0; k < 2; k++) {
vars->temp = vars->i * j * k;
if (vars->i % 2 == 0) vars->temp = 0;
}
}
int main (void) {
int i, temp;
main_omp_fn_0_shared_vars vars;
for (i = 0; i < 100; i++)
{
vars.i = i;
vars.temp = temp;
// This is how GCC implements parallel regions with libgomp
// Start main_omp_fn_0 in the other threads
GOMP_parallel_start(main_omp_fn_0, &vars, 0);
// Start main_omp_fn_0 in the main thread
main_omp_fn_0(&vars);
// Wait for other threads to finish (implicit barrier)
GOMP_parallel_end();
i = vars.i;
temp = vars.temp;
}
}
You pay a small penalty for accessing temp and i this way as their intermediate values cannot be stored in registers but are loaded and stored each time.
The other source of degradation is the cache coherency protocol. Accessing the same memory location from multiple threads executing on multiple CPU cores leads to lots of cache invalidation events. Worse, vars.i and vars.temp are likely to end up in the same cache line and although vars.i is only read from and vars.temp is only written to, full cache invalidation is likely to occur at each iteration of the inner loop.
Normally access to shared variables is protected by explicit synchronisation constructs like atomic statements and critical sections and performance degradation is well expected in that case.

Think of the overheads:
Since your outer loop needs to be in order you're creating x threads to perform the work in the inner loop, destroying them, then creating them again... and so on 100 times.
You have to wait until the longest task within the inner loop completes its work before performing the next step in the outer loop, so essentially this is a synchronization overhead. The tasks don't look irregular, but if the work to perform is small there's only so much speedup you can get out of this.
You have the cost of thread creation here, and allocating the private variables.
If the work inside the inner loop is small the benefits of parallelising this loop might not necessarily outweigh the cost of the parallelisation overheads above, hence you end up with a slowdown.

Related

Make out-of-order CPU run instructions in-order

Consider the loop:
for (int i = 0; i < n; i++) {
sum += a[i];
}
An out-of-order CPU can execute many instructions in advance, it can e.g. have 20 parallel pending loads of a[i] from 20 different iterations of the loop.
But, for me, this is a hindrance. I want that the CPU works like an in-order CPU. I want it to not start a load in the next iteration until it has finished the load in the current iteration.
The reason I want this is very simple: I want to save the memory bandwidth for other processes running on other CPU core. This process is low priority, and I want to limit is as much as possible even though it will get slower.
Two techniques come to mind: fake loop-carried dependencies and memory barriers.
For fake dependencies, something like this can be used:
double* a_current = a;
for (int i = 0; i < n; i++) {
volatile int a_val = *a_current;
sum += a_val;
a_current += 1 + (a_val - a_val);
}
This is horrible code and I wonder if there is something better.
About memory barriers, I don't know almost anything. What could be useful there?

Manual synchronization in OpenMP while loop

I recently started working with OpenMP to do some 'research' for an project in university. I have a rectangular and evenly spaced grid on which I'm solving a partial differential equation with an iterative scheme. So I basically have two for-loops (one in x- and y-direction of the grid each) wrapped by a while-loop for the iterations.
Now I want to investigate different parallelization schemes for this. The first (obvious) approach was to do a spatial a parallelization on the for loops.
Works fine too.
The approach I have problems with is a more tricky idea. Each thread calculates all grid points. The first thread starts solving the equation at the first grid row (y=0). When it's finished the thread goes on with the next row (y=1) and so on. At the same time thread #2 can already start at y=0, because all the necessary information are already available. I just need to do a kind of a manual synchronization between the threads so they can't overtake each other.
Therefore I used an array called check. It contains the thread-id that is currently allowed to work on each grid row. When the upcoming row is not 'ready' (value in check[j] is not correct), the thread goes into an empty while-loop, until it is.
Things will get clearer with a MWE:
#include <stdio.h>
#include <math.h>
#include <omp.h>
int main()
{
// initialize variables
int iter = 0; // iteration step counter
int check[100] = { 0 }; // initialize all rows for thread #0
#pragma omp parallel num_threads(2)
{
int ID, num_threads, nextID;
double u[100 * 300] = { 0 };
// get parallelization info
ID = omp_get_thread_num();
num_threads = omp_get_num_threads();
// determine next valid id
if (ID == num_threads - 1) nextID = 0;
else nextID = ID + 1;
// iteration loop until abort criteria (HERE: SIMPLIFIED) are valid
while (iter<1000)
{
// rows (j=0 and j=99 are boundary conditions and don't have to be calculated)
for (int j = 1; j < (100 - 1); j++)
{
// manual sychronization: wait until previous thread completed enough rows
while (check[j + 1] != ID)
{
//printf("Thread #%d is waiting!\n", ID);
}
// gridpoints in row j
for (int i = 1; i < (300 - 1); i++)
{
// solve PDE on gridpoint
// replaced by random operation to consume time
double ignore = pow(8.39804,10.02938) - pow(12.72036,5.00983);
}
// update of check array in atomic to avoid race condition
#pragma omp atomic write
{
check[j] = nextID;
}
}// for j
#pragma omp atomic write
check[100 - 1] = nextID;
#pragma omp atomic
iter++;
#pragma omp single
{
printf("Iteration step: %d\n\n", iter);
}
}//while
}// omp parallel
}//main
The thing is, this MWE actually works on my machine. But if I copy it into my project, it doesn't. Additionally the outcome is always different: It stops either after the first iteration or after the third.
Another weird thing: when I remove the slashes of the comment in the inner while-loop it works! The output contains some
"Thread #1 is waiting!"
but that's reasonable. To me it looks like I created somehow a race condition, but I don't know where.
Does somebody has an idea what the problem could be? Or a hint how to realize this kind of synchronization?

I think you are mixing up atomicity and memory consitency. The OpenMP standard actually describes it very nicely in
1.4 Memory Model (emphasis mine):
The OpenMP API provides a relaxed-consistency, shared-memory model.
All OpenMP threads have access to a place to store and to retrieve
variables, called the memory. In addition, each thread is allowed to
have its own temporary view of the memory. The temporary view of
memory for each thread is not a required part of the OpenMP memory
model, but can represent any kind of intervening structure, such as
machine registers, cache, or other local storage, between the thread
and the memory. The temporary view of memory allows the thread to
cache variables and thereby to avoid going to memory for every
reference to a variable.
1.4.3 The Flush Operation
The memory model has relaxed-consistency because a thread’s temporary
view of memory is not required to be consistent with memory at all
times. A value written to a variable can remain in the thread’s
temporary view until it is forced to memory at a later time. Likewise,
a read from a variable may retrieve the value from the thread’s
temporary view, unless it is forced to read from memory. The OpenMP
flush operation enforces consistency between the temporary view and
memory.
To avoid that, you should also make the read of check[] atomic and specify the seq_cst clause to your atomic constructs. This clause forces an implicit flush to the operation. (It is called a sequentially consistent atomic construct)
int c;
// manual sychronization: wait until previous thread completed enough rows
do
{
#pragma omp atomic read
c = check[j + 1];
} while (c != ID);
Disclaimer: I can't really try the code right now.
Furhter Notes:
I think the iter stop criteria is bogus, the way you use it, but I guess that's irrelevant given that it is not your actual criteria.
I assume this variant will perform worse than the spatial decomposition. You loose a lot of data locality, especially on NUMA systems. But of course it is fine to try and measure.
There seems to be a discrepancy between your code (using check[j + 1]) and your description "At the same time thread #2 can already start at y=0"

tasks run in thread takes longer than in serial?

So im doing some computation on 4 million nodes.
the very bask serial version just have a for loop which loops 4 million times and do 4 million times of computation. this takes roughly 1.2 sec.
when I split the for loop to, say, 4 for loops and each does 1/4 of the computation, the total time became 1.9 sec.
I guess there are some overhead in creating for loops and maybe has to do with cpu likes to compute data in chunk.
The real thing bothers me is when I try to put 4 loops to 4 thread on a 8 core machine, each thread would take 0.9 seconds to finish.
I am expecting each of them to only take 1.9/4 second instead.
I dont think there are any race condition or synchronize issue since all I do was having a for loop to create 4 threads, which took 200 microseconds. And then a for loop to joins them.
The computation read from a shared array and write to a different shared array.
I am sure they are not writing to the same byte.
Where could the overhead came from?
main: ncores: number of cores. node_size: size of graph (4 million node)
for(i = 0 ; i < ncores ; i++){
int *t = (int*)malloc(sizeof(int));
*t = i;
int iret = pthread_create( &thread[i], NULL, calculate_rank_p, (void*)(t));
}
for (i = 0; i < ncores; i++)
{
pthread_join(thread[i], NULL);
}
calculate_rank_p: vector is the rank vector for page rank calculation
Void *calculate_rank_pthread(void *argument) {
int index = *(int*)argument;
for(i = index; i < node_size ; i+=ncores)
current_vector[i] = calc_r(i, vector);
return NULL;
}
calc_r: this is just a page rank calculation using compressed row format.
double calc_r(int i, double *vector){
double prank = 0;
int j;
for(j = row_ptr[i]; j < row_ptr[i+1]; j++){
prank += vector[col_ind[j]] * val[j];
}
return prank;
}
everything that is not declared are global variable

The computation read from a shared array and write to a different shared array. I am sure they are not writing to the same byte.
It's impossible to be sure without seeing relevant code and having some more details, but this sounds like it could be due to false sharing, or ...
the performance issue of false sharing (aka cache line ping-ponging), where threads use different objects but those objects happen to be close enough in memory that they fall on the same cache line, and the cache system treats them as a single lump that is effectively protected by a hardware write lock that only one core can hold at a time. This causes real but invisible performance contention; whichever thread currently has exclusive ownership so that it can physically perform an update to the cache line will silently throttle other threads that are trying to use different (but, alas, nearby) data that sits on the same line.
http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206
UPDATE
This looks like it could very well trigger false sharing, depending on the size of a vector (though there is still not enough information in the post to be sure, as we don't see how the various vector are allocated.
for(i = index; i < node_size ; i+=ncores)
Instead of interleaving which core works on which data i += ncores give each of them a range of data to work on.

For me the same surprise when build and run in Debug (other test code though).
In release all as expected ;)

Parallelizing a for loop in Visual Studio 2010 (OpenMP)

I've recently reading up about OpenMP and was trying to parallelize some existing for loops in my program to get a speed-up. However, for some reason I seem to be getting garbage data written to the file. What I mean by that is I don't have Points 1,2,3,4 etc. written to my file, I have Points 1,4,7,8 etc. I suspect this is because I am not keeping track of the threads and it just leads to race conditions?
I have been reading as much as I can find about OpenMP, since it seems like a great abstraction to do multi-threaded programing. I'd appreciate any pointers please to get to the bottom of what I might be doing incorrectly.
Here is what I have been trying to do so far (only the relevant bit of code):
#include <omp.h>
pixelIncrement = Image.rowinc/2;
#pragma omp parallel for
for (int i = 0; i < Image.nrows; i++ )
{
int k =0;
row = Image.data + i * pixelIncrement;
#pragma omp parallel for
for (int j = 0; j < Image.ncols; j++)
{
k++;
disparityThresholdValue = row[j];
// Don't want to save certain points
if ( disparityThresholdValue < threshHold)
{
// Get the data points
x = (int)Image.x[k];
y = (int)Image.y[k];
z = (int)Image.z[k];
grayValue= (int)Image.gray[k];
cloudObject->points[k].x = x;
cloudObject->points[k].y = y;
cloudObject->points[k].z = z;
cloudObject->points[k].grayValue = grayValue;
fprintf( cloudPointsFile, "%f %f %f %d\n", x, y, z, grayValue);
}
}
}
fclose( pointFile );
I did enable OpenMP in my Compiler settings (C/C++ -> Language -> Open MP Support (/openmp).
Any suggestions as to what might be the problem? I am using a Quadcore processor on Windows XP 32-bit.

Are all points written to the file, but just not sequentially, or is the actual point data messed up?
The first case is expected in parallel programming - once you execute something side-by-side you wont be able to guarantee order unless you synchronize the access (at which point you can just leave out the parallelization as it becomes effectively linear). If you need to rely on order, you can parallelize any calculations but need to write it down in one thread.
If the points itself are messed up, check where your variables are declared and if multiple threads are accessing the same.

A few problems here:
#pragma omp parallel for
for (int i = 0; i < Image.nrows; i++ )
{
int k =0;
row = Image.data + i * pixelIncrement;
#pragma omp parallel for
for (int j = 0; j < Image.ncols; j++)
{
k++;
There's no need for the inner parallel for. The outer loop should contain enough work to keep all cores busy.
Also, for the inner loop k is a shared variable and gets incremented in a non-atomic way. x, y, z are also shared among the inner loop threads and overwritten "randomly". Remove the inner directive and see how it goes.

When you have a loop with a nested loop there is no need for a second omp pragma.
It will already paralelize the first loop. Remember that this is valid only if the second loop has to be executed in sequence. You have a sequencial incrementation, so you can not execute the second loop in a random order. OMP pragmas are a very easy and cool way to paralelize code but do not use them too much!
More details here -> Parallel Loops with OpenMP

Local copies of arrays for threads in OpenMP?

I am new to OpenMP so this might be very basic.
I have a function:
void do_calc(int input1[], int input2[], int results[]);
Now, the function modifies input1[] during calculations but still can use it for another iteration (it sorts it in various ways), input2[] is different for every iteration and the function stores results in results[].
In one threaded version of the program I just iterate through various input2[]. In parallel version I try this:
#pragma omp parallel for reduction (+:counter) schedule(static) private (i,j)
for (i = 0; i < NUMITER ; i++){
int tempinput1[1000];
int tempresults[1000];
int tempinput2[5] = derive_input_from_i(i, input2[]);
array_copy(input, tempinput);
do_calc(tempinput, tempinput2, tempresults);
for (j = 0; j < 1000; j++)
counter += tempresults[i] //simplified
}
This code works but is very inefficient because I am copying input to tempinput every iteration and I need only one copy per thread. This copy could be then reused in subsequent do_calc invocations. What I would like to do is this:
#do this only once for every thread worker:
array_copy(input, tempinput);
and then tell the thread to store tempinput for iterations it does in the future.
How do I go about it in OpenMP?
Additional performance issues:
a) I would like to have the code which works on dual/quad/octal core processors and let OpenMP determine number of thread workers and for every of them copy input once;
b) My algorithm benefits from input[] being sorted in previous iteration (as then next sort is faster as keys change only slightly for similar i's) so I would like to make sure that number of iterations is divided equally among threads and that thread no 1 gets 0 ... NUMITER/n portion of iterations, thread no 2 gets NUMITER/n ... 2*NUMITER/n etc.
b) Is not that important but it would be very cool to have :)
(I am using Visual Studio 2010 and I have OpenMP 2.0 version)