another question about OpenMP...
I'm trying to speed up counting sort with OpenMP, but my code runs fastest on 1 thread, and slows down as I'm adding threads...(I've got 4 cores) Results are correct.
I'm paralleling only a loop in which the counter is incremented, the rest is computed sequentially (is that ok?) Here I try making incrementation by atomic operation. I tried also a version in which every thread had his own table "counters" but it was even slower.
#pragma omp parallel for private(i) num_threads(4) default(none) shared(counters, table, possible_values, table_size)
for(i=0; i < table_size; i++){
#pragma omp atomic
counters[(int)(table[i]*100)]++;
}
table - contains unsorted values
possible_values - 100 (i've got numbers from 0 to 0.99)
table_size - size of table
How can I speed things up?
Related
Assume that each loop iteration of my code takes the same time. Please note that each loop iteration involves memory access from disjoint portions of a large contiguous memory. I am using VS2019 compiler.
I thought it should not matter whether I use
#pragma omp for schedule(static, CHUNKSIZE)
OR
#pragma omp for schedule(static)
I have used values like 5 for CHUNKSIZE.
I am asking this because I see the first variation performs slightly better.
Can someone throw some light?
If you do not specify a chunk
#pragma omp for schedule(static)
OpenMP will:
Divide the loop into equal-sized chunks or as equal as possible in the
case where the number of loop iterations is not evenly divisible by
the number of threads multiplied by the chunk size. By default, chunk
size is loop_count/number_of_threads
Hence, for a CHUNKSIZE=5, 2 threads and a loop (to be parallelized) with 22 iterations. To thread ID=0 will be assign the iterations {0 to 10} and to thread ID=1 {11 to 21}. Each thread with 11 iterations. However, for:
#pragma omp for schedule(static, CHUNKSIZE)
to thread ID=0 will be assign the iterations {0 to 4}, {10 to 14} and {20 to 21}, whereas thread ID=1 will work with the iterations {5 to 9} and {15 to 19}. Therefore, to the first and second threads it was assigned 12 and 10 iterations, respectively.
All this to show that having
#pragma omp for schedule(static)
and
#pragma omp for schedule(static, CHUNKSIZE)
is not the same. Different chunk sizes, might affects directly the loading balancing, and cache misses, among others. Even if one:
Assume that each loop iteration of my code takes the same time
Naturally, thinks get more complicated, if each iteration of the loop being parallelized is performing a different among of work. For instance:
for(int i = 0; i < 22; i++)
for(int j = i+1; j < 22 ; i++)
// do the same work.
With
#pragma omp for schedule(static)
Thread ID=0 would execute 176 iterations whereas Thread ID=1 55. With a load unbalance of 176 - 55 = 121.
whereas with
#pragma omp for schedule(static, CHUNKSIZE)
Thread ID=0 would execute 141 iterations and Thread ID=1 90. With a load unbalance of 141 - 90 = 51.
As you can see in this case without the chunk, one thread performed 121 parallel tasks more than the other thread, whereas with a chunk=5, the difference was reduced to 51.
To conclude, it depends on your code, the hardware where that code is being executed, how you are performing the benchmark, how big is the time difference, and so. The bottom line is: you need to analyze it, look for potential loading balancing problems, measure cache misses, and so on. Profiling is always the answer.
I have the following piece of code that I would like to write in openmp.
My code abstractly looks like the following
I start first with dividing N=100 iterations equally among p=10pieces and I store the allocated iterations for every piece in a vector
Nvec[1]={0,1,..,9}
Nvec[2]={10,11,..,19}
Nvec[p]={N-9,..,N}
then I loop on the iterations
for(k=0;k<p;k++){\\loop on each piece of Nvec
for(j=0;j<2;j++){\\here is a nested loop
for(i=Nvec[k][0];i<Nvec[k][p];i++){
\\then I loop between the first and
\\last value of the array corresponding to piece k
}
}
Now, as you can see the code is sequential with a total of 2*100=200 iterations, I wanted to parallelize it using OpenMp with the absolute condition to keep the order of iterations!
I tried the following
#pragma omp parallel for schedule(static) collapse(2)
{
for(j=0;j<2;j++){
for(i=0;i<n;i++){
\\loop code here
}
}
}
this setting doesn't keep the order of the iterations as in the sequential version.
In the sequential version, each chunk is processed entirely with j=0 and then entirely with j=1.
In my openMP version, every thread takes a chunk of iterations and process it entirely with j=0. In a way all threads treats either j=0 or j=1 cases. Every worker with p=10 processes 200/10=20 iterations, problem is all iterations are j=0 or j=1.
How can I make sure that every thread get a chunk of iterations, performs the loop code with j=0 on all the iterations, then j=1 on the same chunk of iterations?
EDIT
what I want exactly for every chunk of 20 iterations
worker 1
j:0
i:1--->10
j:1
i:1--->10
worker p
j:0
i:90--->99
j:1
i:90--->99
the openMP code above does
worker 1
j:0
i:1--->20
worker p
j:1
i:80--->99
It's actually simple - just make the outer j-loop non-worksharing:
#pragma omp parallel
for (int j = 0; j < 2; j++) {
#pragma omp for schedule(static)
for (int i = 0; i < 10; i++) {
...
}
}
If you use the static schedule, OpenMP guarantees, that each worker will get to handle the same range of is for both j=0 and j=1.
Note: You moving the parallel construct to the outer loop is merely an optimization to avoid thread management overhead. The code works similarly if you just place a parallel for in-between the two loops.
I have two versions of code that produce equivalent results where I am trying to parallelize only the inner loop of a nested for loop. I am not getting much speedup but I didn't expect a 1-to-1 since I'm trying only to parallelize the inner loop.
My main question is why these two versions have similar runtimes? Doesn't the second version fork threads only once and avoid the overhead of starting new threads on every iteration over i as in the first version?
The first version of code starts up threads on every iteration of the outer loop like this:
for(i=0; i<2000000; i++){
sum = 0;
#pragma omp parallel for private(j) reduction(+:sum)
for(j=0; j<1000; j++){
sum += 1;
}
final += sum;
}
printf("final=%d\n",final/2000000);
With this output and runtime:
OMP_NUM_THREADS=1
final=1000
real 0m5.847s
user 0m5.628s
sys 0m0.212s
OMP_NUM_THREADS=4
final=1000
real 0m4.017s
user 0m15.612s
sys 0m0.336s
The second version of code starts threads once(?) before the outer loop and parallelizes the inner loop like this:
#pragma omp parallel private(i,j)
for(i=0; i<2000000; i++){
sum = 0;
#pragma omp barrier
#pragma omp for reduction(+:sum)
for(j=0; j<1000; j++){
sum += 1;
}
#pragma omp single
final += sum;
}
printf("final=%d\n",final/2000000);
With this output and runtime:
OMP_NUM_THREADS=1
final=1000
real 0m5.476s
user 0m4.964s
sys 0m0.504s
OMP_NUM_THREADS=4
final=1000
real 0m4.347s
user 0m15.984s
sys 0m1.204s
Why isn't the second version much faster than the first? Doesn't it avoid the overhead of starting threads on every loop iteration or am I doing something wrong?
An OpenMP implementation may use thread pooling to eliminate the overhead of starting threads on encountering a parallel construct. A pool of OMP_NUM_THREADS threads is started for the first parallel construct, and after the construct is completed the slave threads are returned to the pool. These idle threads can be reallocated when a later parallel construct is encountered.
See for example this explanation of thread pooling in the Sun Studio OpenMP implementation.
You appear to be retracing the steps of Amdahl's Law: It speaks of parallel process vs it's own overhead. One thing that Amadhl found was regardless of how much parallelism you put into a program, it will always have to same speedup to begin with. Parallelism only starts to improve run time/performance when the program requires enough work to compensate the extra processing power.
I'm trying to create a program that creates an array and, with OpenMP, assigns values to each position in that array. That would be trivial, except that I want to specify which positions an array is responsible for.
For example, if I have an array of length 80 and 8 threads, I want to make sure that thread 0 only writes to positions 0-9, thread 1 to 10-19 and so on.
I'm very new to OpenMP, so I tried the following:
#include <omp.h>
#include <stdio.h>
#define N 80
int main (int argc, char *argv[])
{
int nthreads = 8, tid, i, base, a[N];
#pragma omp parallel
{
tid = omp_get_thread_num();
base = ((float)tid/(float)nthreads) * N;
for (i = 0; i < N/nthreads; i++) {
a[base + i] = 0;
printf("%d %d\n", tid, base+i);
}
}
return 0;
}
This program, however, doesn't access all positions, as I expected. The output is different every time I run it, and it might be for example:
4 40
5 51
5 52
5 53
5 54
5 55
5 56
5 57
5 58
5 59
5 50
4 40
6 60
6 60
3 30
0 0
1 10
I think I'm missing a directive, but I don't know which one it is.
The way to ensure that things work the way you want is to have a loop of just 8 iterations as the outer (parallel) loop, and have each thread execute an inner loop which accesses just the right elements:
#pragma omp parallel for private(j)
for(i = 0; i < 8; i++) {
for(j = 0; j < 10; j++) {
a[10*i+j] = 0;
printf("thread %d updated element %d\n", omp_get_thread_num(), 8*i+j);
}
}
I was unable to test this right now but I'm 90% sure this does exactly what you want (and you have "complete control" over how things work when you do it like this). However it may not be the most efficient thing to do. For one thing - when you just want to set a bunch of elements to zero, you want to use a built in function like memset, not a loop...
You're missing a fair bit. The directive
#pragma omp parallel
only tells the run time that the following block of code is to be executed in parallel, essentially by all threads. But it doesn't specify that the work is to be shared out across threads, just that all threads are to execute the block. To share the work your code will need another directive, something like this
#pragma omp parallel
{
#pragma omp for
...
It's the for directive which distributes the work across threads.
However, you are making a mistake in the design of your program which is even more serious than your unfamiliarity with the syntax of OpenMP. Manual decomposition of work across threads, as you propose, is just what OpenMP is designed to help programmers avoid. By trying to do the decomposition yourself you are programming against the grain of OpenMP and run two risks:
Of getting things wrong; in particular of getting wrong matters that the compiler and run-time will get right with no effort or thought on your part.
Of carefully crafting a parallel program which runs more slowly than its serial equivalent.
If you want some control over the allocation of work to threads investigate the schedule clause. I suggest that you start your parallel region something like this (note that I am fusing the two directives into one statement):
#pragma omp parallel for default(none) shared(a,base,N)
{
for (i = 0; i < N; i++) {
a[base + i] = 0;
}
Note also that I have specified the accessibility of variables. This is a good practice especially when learning OpenMP. The compiler will make i private automatically.
As I have written it the run-time will divide the iterations over i into chunks, one for each thread. The first thread will get i = 0..N/num_threads, the second i = (N/num_threads)+1..2N/num_threads and so on.
Later you can add a schedule clause explicitly to the directive. What I have written above is equivalent to
#pragma omp parallel for default(none) shared(a,N) schedule(static)
but you can also experiment with
#pragma omp parallel for default(none) shared(a,N) schedule(dynamic,chunk_size)
and a number of other options which are well documented in the usual places.
#pragma omp parallel is not enough for the for loop to be parallelized.
Ummm... I noticed that you actually try to distribute work by hand. The reason it does not work is most probably becasue of racing conditions on computing the parameters for the for loop.
If I recall properly any variables declared outside of the parallel region are shared among threads. So ALL threads write to i, tid and base at once. You could make it work with appropriate private/shared clauses.
However, a better ways is to let OpenMP distribute the work.
This is sufficient:
#pragma omp parallel private(tid)
{
tid = omp_get_thread_num();
#pramga omp for
for (i = 0; i < N; i++) {
a[i] = 0;
printf("%d %d\n", tid, i);
}
}
Note that private(tid) it makes a local copy of tid for each thread, so they do not overwrite each other on the omp_get_thread_num(). Also it is possible to declare shared(a) because we want each thread to work on the same copy of table. This is implicit now. I believe iterators should be declared private, but I think pragma takes care of it, not 100% how it is this specific case, when its declared outside the parallel region. But I'm sure you can actually set it to shared by hand and mess it up.
EDIT: I noticed original underlying problem so I took out irrelevant parts.
I have the following code:
#pragma omp parallel shared(a,n) private(i,j,k,x,pid,rows,mymin,mymax)
{
// nprocs=1;
#ifdef _OPENMP
nprocs=omp_get_num_threads();
#endif
#ifdef _OPENMP
pid=omp_get_thread_num();
#endif
rows=n/nprocs;
mymin=pid * rows;
mymax=mymin + rows - 1;
for(k=0;k<n;k++){
if(k>=mymin && k<=mymax){
#pragma omp for schedule(static,rows)
for(x=k+1;x<n;x++){
a[k][x]= a[k][x]/a[k][k];
}
#pragma omp barrier
}
}
}
Here I am selecting which thread will update which row of matrix based on the if condition. For eg, if there are two threads, thread 1 will update first two rows of matrix 'a' and thread 2 will update the other two.
And after I selected that, I divide the iterations on the columns of that row by paralleling the inner loop among thread 1 and two( where I start for(x=k+1,x<n;x++)). I am also putting a barrier after the inner for loop so that after every column value of single row is updated, its synchronized.
But the problem is I am not getting proper synchronized values. In the final matrix, some values updated by thread 0 are shown in some rows and some by other thread but not all.
Using omp barrier here is useless since there is an implicit barrier at the end of a omp for construct unless a nowait clause is specified.
On the other hand, you don't need to manually specify how to decompose the work to threads, and the way you decompose is not correct.
What you are trying to do in fact can be written as follows.
#pragma omp parallel for shared(a,n) private(k,x)
for(k=0;k<n;k++){
for(x=k+1;x<n;x++){
a[k][x]= a[k][x]/a[k][k];
}
}
Since the work load is not balanced across different k, you may want to use schedule(dynamic, ...) clause as well. Please refer to omp doc for more info.
http://msdn.microsoft.com/en-us/library/b5b5b6eb.aspx