OpenMP For - group loops for cache optimization

OpenMP For - group loops for cache optimization - c

I working to adapt a program to use OpenMP. I have a group of nested for loops. The outermost for loop is a y-axis loops that goes down an image. I would like to run multiple parallel threads on the loop, but I'm having trouble making it fast.
Currently when I run 8 threads it runs like:
thread 0 -> row 0,8,16...
thread 1 -> row 1,9,17...
thread 2 -> row 2,10,18...
thread 3 -> row 3,11,19...
I would like it to run in blocks, so that thread 0 does the first 1/8 of the rows. What is the best way to do this?
Current code:
...
int y_percent = data_size_Y/8;
int thread = 0;
#pragma omp parallel for num_threads(8) firstprivate(vecs, bufferedOut,data_size_X, data_size_Y, kern_cent_X, kern_cent_Y, sum)
for(int y = y_percent*omp_get_thread_num(); y < (omp_get_thread_num()+1)*y_percent; y++){ // the y coordinate of theoutput location we're focusing on

You can use the schedule clause on the pragma statement to specify the chunk size that you are wanting each thread to process. In the example below, I specify the static scheduling method with a chunk size that specifies the number of contiguous iterations each thread should get. In this simple example, each thread will get chunks of 8 iterations each (e.g. thread 0 will get iterations 0-7, thread 1 iterations 8-15, etc). It is worth pointing out that if you aren't concerned with the ordering of chunk distribution (e.g. if you don't care if thread 0 gets the first chunk or not), you can replace static with dynamic. dynamic gives the ability to assign chunks to threads as they need them instead of preassigning chunks to threads from the start (useful for load balancing when some iterations take longer than others). For more information on the scheduling methods, check out the following:
Wikipedia article - Scheduling Clauses
LLNL docs - DO/for Directive
Example:
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
int main() {
int i;
int iterations = 32;
int num_threads = 4;
#pragma omp parallel for schedule(static, 8) num_threads(num_threads)
for(i=0; i<iterations; i++) {
printf("thread %d: %d\n", omp_get_thread_num(), i);
}
}

You could simply use the following code to achieve that.
#pragma omp parallel for num_threads(8)
for(int y = 0; y < data_size_Y; y++) {
....
}
Generally I think the long list of firstprivate is not necessary. Depending on how you exactly use those variables, most of them should be able to be defined as shared.

Related

Execute for loop iterations in openmp in order with dynamic schedule

I'd like to run a for loop in openmp with dynamic schedule.
#pragma omp for schedule(dynamic,chunk) private(i) nowait
for(i=0;i<n;i++){
//loop code here
}
and I'd like to have each thread executing ordered chunks such that
e.g. thread 1 -> iterations 0 to k
thread2 -> iterations k+1->k+chunk
etc..
Static schedule partly does what I want but I'd like to dynamically load balance the iterations.
Neither ordered clause, if I understood correctly what it does.
My question is how to make sure that the chunks assigned are ordered chunks?
I am using openmp 3.1 with gcc

You can implement this yourself without resorting to omp for, which is considered a convenience function by expert OpenMP programmers.
The following roughly illustrates what you might do. Please check the arithmetic carefully.
#pragma omp parallel
{
int me = omp_get_thread_num();
int nt = omp_get_num_threads();
int chunk = /* divide n by nt appropriately */
int start = me * chunk;
int end = (me+1) * chunk;
if (end > n) end = n;
for (int i = start; i < end; i++) {
/* do work */
}
} /* end parallel */
This does not do any dynamic load-balancing. You can do that yourself by assigning loop iterations unevenly to threads if you know the cost function a priori. You might read up on the inspector-executor model (e.g. 1).

Manual synchronization in OpenMP while loop

I recently started working with OpenMP to do some 'research' for an project in university. I have a rectangular and evenly spaced grid on which I'm solving a partial differential equation with an iterative scheme. So I basically have two for-loops (one in x- and y-direction of the grid each) wrapped by a while-loop for the iterations.
Now I want to investigate different parallelization schemes for this. The first (obvious) approach was to do a spatial a parallelization on the for loops.
Works fine too.
The approach I have problems with is a more tricky idea. Each thread calculates all grid points. The first thread starts solving the equation at the first grid row (y=0). When it's finished the thread goes on with the next row (y=1) and so on. At the same time thread #2 can already start at y=0, because all the necessary information are already available. I just need to do a kind of a manual synchronization between the threads so they can't overtake each other.
Therefore I used an array called check. It contains the thread-id that is currently allowed to work on each grid row. When the upcoming row is not 'ready' (value in check[j] is not correct), the thread goes into an empty while-loop, until it is.
Things will get clearer with a MWE:
#include <stdio.h>
#include <math.h>
#include <omp.h>
int main()
{
// initialize variables
int iter = 0; // iteration step counter
int check[100] = { 0 }; // initialize all rows for thread #0
#pragma omp parallel num_threads(2)
{
int ID, num_threads, nextID;
double u[100 * 300] = { 0 };
// get parallelization info
ID = omp_get_thread_num();
num_threads = omp_get_num_threads();
// determine next valid id
if (ID == num_threads - 1) nextID = 0;
else nextID = ID + 1;
// iteration loop until abort criteria (HERE: SIMPLIFIED) are valid
while (iter<1000)
{
// rows (j=0 and j=99 are boundary conditions and don't have to be calculated)
for (int j = 1; j < (100 - 1); j++)
{
// manual sychronization: wait until previous thread completed enough rows
while (check[j + 1] != ID)
{
//printf("Thread #%d is waiting!\n", ID);
}
// gridpoints in row j
for (int i = 1; i < (300 - 1); i++)
{
// solve PDE on gridpoint
// replaced by random operation to consume time
double ignore = pow(8.39804,10.02938) - pow(12.72036,5.00983);
}
// update of check array in atomic to avoid race condition
#pragma omp atomic write
{
check[j] = nextID;
}
}// for j
#pragma omp atomic write
check[100 - 1] = nextID;
#pragma omp atomic
iter++;
#pragma omp single
{
printf("Iteration step: %d\n\n", iter);
}
}//while
}// omp parallel
}//main
The thing is, this MWE actually works on my machine. But if I copy it into my project, it doesn't. Additionally the outcome is always different: It stops either after the first iteration or after the third.
Another weird thing: when I remove the slashes of the comment in the inner while-loop it works! The output contains some
"Thread #1 is waiting!"
but that's reasonable. To me it looks like I created somehow a race condition, but I don't know where.
Does somebody has an idea what the problem could be? Or a hint how to realize this kind of synchronization?

I think you are mixing up atomicity and memory consitency. The OpenMP standard actually describes it very nicely in
1.4 Memory Model (emphasis mine):
The OpenMP API provides a relaxed-consistency, shared-memory model.
All OpenMP threads have access to a place to store and to retrieve
variables, called the memory. In addition, each thread is allowed to
have its own temporary view of the memory. The temporary view of
memory for each thread is not a required part of the OpenMP memory
model, but can represent any kind of intervening structure, such as
machine registers, cache, or other local storage, between the thread
and the memory. The temporary view of memory allows the thread to
cache variables and thereby to avoid going to memory for every
reference to a variable.
1.4.3 The Flush Operation
The memory model has relaxed-consistency because a thread’s temporary
view of memory is not required to be consistent with memory at all
times. A value written to a variable can remain in the thread’s
temporary view until it is forced to memory at a later time. Likewise,
a read from a variable may retrieve the value from the thread’s
temporary view, unless it is forced to read from memory. The OpenMP
flush operation enforces consistency between the temporary view and
memory.
To avoid that, you should also make the read of check[] atomic and specify the seq_cst clause to your atomic constructs. This clause forces an implicit flush to the operation. (It is called a sequentially consistent atomic construct)
int c;
// manual sychronization: wait until previous thread completed enough rows
do
{
#pragma omp atomic read
c = check[j + 1];
} while (c != ID);
Disclaimer: I can't really try the code right now.
Furhter Notes:
I think the iter stop criteria is bogus, the way you use it, but I guess that's irrelevant given that it is not your actual criteria.
I assume this variant will perform worse than the spatial decomposition. You loose a lot of data locality, especially on NUMA systems. But of course it is fine to try and measure.
There seems to be a discrepancy between your code (using check[j + 1]) and your description "At the same time thread #2 can already start at y=0"

Specify which positions in an array a thread access

I'm trying to create a program that creates an array and, with OpenMP, assigns values to each position in that array. That would be trivial, except that I want to specify which positions an array is responsible for.
For example, if I have an array of length 80 and 8 threads, I want to make sure that thread 0 only writes to positions 0-9, thread 1 to 10-19 and so on.
I'm very new to OpenMP, so I tried the following:
#include <omp.h>
#include <stdio.h>
#define N 80
int main (int argc, char *argv[])
{
int nthreads = 8, tid, i, base, a[N];
#pragma omp parallel
{
tid = omp_get_thread_num();
base = ((float)tid/(float)nthreads) * N;
for (i = 0; i < N/nthreads; i++) {
a[base + i] = 0;
printf("%d %d\n", tid, base+i);
}
}
return 0;
}
This program, however, doesn't access all positions, as I expected. The output is different every time I run it, and it might be for example:
4 40
5 51
5 52
5 53
5 54
5 55
5 56
5 57
5 58
5 59
5 50
4 40
6 60
6 60
3 30
0 0
1 10
I think I'm missing a directive, but I don't know which one it is.

The way to ensure that things work the way you want is to have a loop of just 8 iterations as the outer (parallel) loop, and have each thread execute an inner loop which accesses just the right elements:
#pragma omp parallel for private(j)
for(i = 0; i < 8; i++) {
for(j = 0; j < 10; j++) {
a[10*i+j] = 0;
printf("thread %d updated element %d\n", omp_get_thread_num(), 8*i+j);
}
}
I was unable to test this right now but I'm 90% sure this does exactly what you want (and you have "complete control" over how things work when you do it like this). However it may not be the most efficient thing to do. For one thing - when you just want to set a bunch of elements to zero, you want to use a built in function like memset, not a loop...

You're missing a fair bit. The directive
#pragma omp parallel
only tells the run time that the following block of code is to be executed in parallel, essentially by all threads. But it doesn't specify that the work is to be shared out across threads, just that all threads are to execute the block. To share the work your code will need another directive, something like this
#pragma omp parallel
{
#pragma omp for
...
It's the for directive which distributes the work across threads.
However, you are making a mistake in the design of your program which is even more serious than your unfamiliarity with the syntax of OpenMP. Manual decomposition of work across threads, as you propose, is just what OpenMP is designed to help programmers avoid. By trying to do the decomposition yourself you are programming against the grain of OpenMP and run two risks:
Of getting things wrong; in particular of getting wrong matters that the compiler and run-time will get right with no effort or thought on your part.
Of carefully crafting a parallel program which runs more slowly than its serial equivalent.
If you want some control over the allocation of work to threads investigate the schedule clause. I suggest that you start your parallel region something like this (note that I am fusing the two directives into one statement):
#pragma omp parallel for default(none) shared(a,base,N)
{
for (i = 0; i < N; i++) {
a[base + i] = 0;
}
Note also that I have specified the accessibility of variables. This is a good practice especially when learning OpenMP. The compiler will make i private automatically.
As I have written it the run-time will divide the iterations over i into chunks, one for each thread. The first thread will get i = 0..N/num_threads, the second i = (N/num_threads)+1..2N/num_threads and so on.
Later you can add a schedule clause explicitly to the directive. What I have written above is equivalent to
#pragma omp parallel for default(none) shared(a,N) schedule(static)
but you can also experiment with
#pragma omp parallel for default(none) shared(a,N) schedule(dynamic,chunk_size)
and a number of other options which are well documented in the usual places.

#pragma omp parallel is not enough for the for loop to be parallelized.
Ummm... I noticed that you actually try to distribute work by hand. The reason it does not work is most probably becasue of racing conditions on computing the parameters for the for loop.
If I recall properly any variables declared outside of the parallel region are shared among threads. So ALL threads write to i, tid and base at once. You could make it work with appropriate private/shared clauses.
However, a better ways is to let OpenMP distribute the work.
This is sufficient:
#pragma omp parallel private(tid)
{
tid = omp_get_thread_num();
#pramga omp for
for (i = 0; i < N; i++) {
a[i] = 0;
printf("%d %d\n", tid, i);
}
}
Note that private(tid) it makes a local copy of tid for each thread, so they do not overwrite each other on the omp_get_thread_num(). Also it is possible to declare shared(a) because we want each thread to work on the same copy of table. This is implicit now. I believe iterators should be declared private, but I think pragma takes care of it, not 100% how it is this specific case, when its declared outside the parallel region. But I'm sure you can actually set it to shared by hand and mess it up.
EDIT: I noticed original underlying problem so I took out irrelevant parts.

Local copies of arrays for threads in OpenMP?

I am new to OpenMP so this might be very basic.
I have a function:
void do_calc(int input1[], int input2[], int results[]);
Now, the function modifies input1[] during calculations but still can use it for another iteration (it sorts it in various ways), input2[] is different for every iteration and the function stores results in results[].
In one threaded version of the program I just iterate through various input2[]. In parallel version I try this:
#pragma omp parallel for reduction (+:counter) schedule(static) private (i,j)
for (i = 0; i < NUMITER ; i++){
int tempinput1[1000];
int tempresults[1000];
int tempinput2[5] = derive_input_from_i(i, input2[]);
array_copy(input, tempinput);
do_calc(tempinput, tempinput2, tempresults);
for (j = 0; j < 1000; j++)
counter += tempresults[i] //simplified
}
This code works but is very inefficient because I am copying input to tempinput every iteration and I need only one copy per thread. This copy could be then reused in subsequent do_calc invocations. What I would like to do is this:
#do this only once for every thread worker:
array_copy(input, tempinput);
and then tell the thread to store tempinput for iterations it does in the future.
How do I go about it in OpenMP?
Additional performance issues:
a) I would like to have the code which works on dual/quad/octal core processors and let OpenMP determine number of thread workers and for every of them copy input once;
b) My algorithm benefits from input[] being sorted in previous iteration (as then next sort is faster as keys change only slightly for similar i's) so I would like to make sure that number of iterations is divided equally among threads and that thread no 1 gets 0 ... NUMITER/n portion of iterations, thread no 2 gets NUMITER/n ... 2*NUMITER/n etc.
b) Is not that important but it would be very cool to have :)
(I am using Visual Studio 2010 and I have OpenMP 2.0 version)

How to generate random numbers in parallel?

I want to generate pseudorandom numbers in parallel using openMP, something like this:
int i;
#pragma omp parallel for
for (i=0;i<100;i++)
{
printf("%d %d %d\n",i,omp_get_thread_num(),rand());
}
return 0;
I've tested it on windows and I got huge speedup, but each thread generated exactly the same numbers. I've tested it also on Linux and I got huge slowdown, parallel version on 8core processor was about 10 time slower than sequential, but each thread generated different numbers.
Is there any way to have both speedup and different numbers?
Edit 27.11.2010
I think I've solved it using an idea from Jonathan Dursi post. It seems that following code works fast on both linux and windows. Numbers are also pseudorandom. What do You think about it?
int seed[10];
int main(int argc, char **argv)
{
int i,s;
for (i=0;i<10;i++)
seed[i] = rand();
#pragma omp parallel private(s)
{
s = seed[omp_get_thread_num()];
#pragma omp for
for (i=0;i<1000;i++)
{
printf("%d %d %d\n",i,omp_get_thread_num(),s);
s=(s*17931+7391); // those numbers should be choosen more carefully
}
seed[omp_get_thread_num()] = s;
}
return 0;
}
PS.: I haven't accepted any answer yet, because I need to be sure that this idea is good.

I'll post here what I posted to Concurrent random number generation :
I think you're looking for rand_r(), which explicitly takes the current RNG state as a parameter. Then each thread should have its own copy of seed data (whether you want each thread to start off with the same seed or different ones depends on what you're doing, here you want them to be different or you'd get the same row again and again). There's some discussion of rand_r() and thread-safety here: whether rand_r is real thread safe? .
So say you wanted each thread to have its seed start off with its thread number (which is probably not what you want, as it would give the same results every time you ran with the same number of threads, but just as an example):
#pragma omp parallel default(none)
{
int i;
unsigned int myseed = omp_get_thread_num();
#pragma omp for
for(i=0; i<100; i++)
printf("%d %d %d\n",i,omp_get_thread_num(),rand_r(&myseed));
}
Edit: Just on a lark, checked to see if the above would get any speedup. Full code was
#define NRANDS 1000000
int main(int argc, char **argv) {
struct timeval t;
int a[NRANDS];
tick(&t);
#pragma omp parallel default(none) shared(a)
{
int i;
unsigned int myseed = omp_get_thread_num();
#pragma omp for
for(i=0; i<NRANDS; i++)
a[i] = rand_r(&myseed);
}
double sum = 0.;
double time=tock(&t);
for (long int i=0; i<NRANDS; i++) {
sum += a[i];
}
printf("Time = %lf, sum = %lf\n", time, sum);
return 0;
}
where tick and tock are just wrappers to gettimeofday(), and tock() returns the difference in seconds. Sum is printed just to make sure that nothing gets optimized away, and to demonstrate a small point; you will get different numbers with different numbers of threads because each thread gets its own threadnum as a seed; if you run the same code again and again with the same number of threads you'll get the same sum, for the same reason. Anyway, timing (running on a 8-core nehalem box with no other users):
$ export OMP_NUM_THREADS=1
$ ./rand
Time = 0.008639, sum = 1074808568711883.000000
$ export OMP_NUM_THREADS=2
$ ./rand
Time = 0.006274, sum = 1074093295878604.000000
$ export OMP_NUM_THREADS=4
$ ./rand
Time = 0.005335, sum = 1073422298606608.000000
$ export OMP_NUM_THREADS=8
$ ./rand
Time = 0.004163, sum = 1073971133482410.000000
So speedup, if not great; as #ruslik points out, this is not really a compute-intensive process, and other issues like memory bandwidth start playing a role. Thus, only a shade over 2x speedup on 8 cores.

You cannot use the C rand() function from multiple threads; this results in undefined behavior. Some implementations might give you locking (which will make it slow); others might allow threads to clobber each other's state, possibly crashing your program or just giving "bad" random numbers.
To solve the problem, either write your own PRNG implementation or use an existing one that allows the caller to store and pass the state to the PRNG iterator function.

Get each thread to set a different seed based on its thread id, e.g. srand(omp_get_thread_num() * 1000);

It seems like that rand has a global shared state between all threads on Linux and a thread local storage state for it on Windows. The shared state on Linux is causing your slowdowns because of the necessary synchronization.
I don't think there is a portable way in the C library to use the RNG parallel on multiple threads, so you need another one. You could use a Mersenne Twister. As marcog said you need to initialize the seed for each thread differently.

On linux/unix you can use
long jrand48(unsigned short xsubi[3]);
where xsubi[3] encodes the state of the random number generator, like this:
#include<stdio.h>
#include<stdlib.h>
#include <algorithm>
int main() {
unsigned short *xsub;
#pragma omp parallel private(xsub)
{
xsub = new unsigned short[3];
xsub[0]=xsub[1]=xsub[2]= 3+omp_get_thread_num();
int j;
#pragma omp for
for(j=0;j<10;j++)
printf("%d [%d] %ld\n", j, omp_get_thread_num(), jrand48(xsub));
}
}
compile with
g++-mp-4.4 -Wall -Wextra -O2 -march=native -fopenmp -D_GLIBCXX_PARALLEL jrand.cc -o jrand
(replace g++-mp-4.4 with whatever you need to call g++ version 4.4 or 4.3)
and you get
$ ./jrand
0 [0] 1344229389
1 [0] 1845350537
2 [0] 229759373
3 [0] 1219688060
4 [0] -553792943
5 [1] 360650087
6 [1] -404254894
7 [1] 1678400333
8 [1] 1373359290
9 [1] 171280263
i.e. 10 different pseudorandom numbers without any mutex locking or race conditions.

Random numbers can be generated very fast,so usually the memory would be the bottleneck. By dividing this task between several threads you create additional communication and syncronization overheads (and sinchronization of caches of different cores is not cheap).
It would be better to use a single thread with a better random() function.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight