OpenMP on outer loop: How to print output after inner loop? - c

I have a program that defines a matrix Pixel, a function iBlend (which I can't see or change) that operates on two cells in Pixel returning a float, and actually just contains the following:
float RowResult[columns];
#pragma omp parallel for
for (int i = 0; i < rows; i++) {
float sum = 0;
RowResult[0] = 0;
for (int j = 1; j < columns; j++) {
RowResult[j] = iBlend(Pixel[i],Pixel[j],RowResult[j-1]);
sum += RowResult[j];
}
printf("Row %d scored a total of: %f\n", i, sum);
if (sum > 100) {
for (int j = 0; j < columns; j++) printf("%f,",RowResult[j]);
printf("\n");
}
}
This code goes through a matrix, a row at a time, doing some fancy-pants math with the indices i and j.
Naturally, this leads to gibberish when I try to run this with more than 1 thread because threads are calling printf on a first-come, first-serve basis.
I tried adding a simple #pragma omp single to the second part of the function (creating a block from the first printf until the second-to-last line), but I get the classic error about not being able to nest omp blocks:
> error: work-sharing region may not be closely nested inside of
> work-sharing, ‘critical’, ‘ordered’, ‘master’, explicit ‘task’ or
> ‘taskloop’ region
> #pragma omp single
> ^~~
A few solutions came to mind. First was "just define a big enough matrix to hold all the answers and then loop over them at the end with one thread." But this way I actually run out of memory -- it turns out only a few rows really get printed, so doing it on the fly is totally essential. I also can't change the format of the output, although which row is printed in which order doesn't matter, as long as it's printed row-wise and the pixels are in order per row.
Another solution I thought about was "just flip the two loops and have the omp loop go through the pixels in the other direction, and then when that parallel loop is done I'm in single-threaded land again!" -- but turns out the rows need to have their pixels calculated in order because there's a carried dependency (iBlend needs the result of a previous computation).
At first I thought this was a simple problem, so maybe I'm just missing something obvious. I just want to exit "multithread" mode when I'm printing. Maybe something like "add this result to the stack of answers (treat adding to stack as 'atomic' with OpenMP?)" and then "if I'm the master thread, print everything on the stack so far in order"? But that'd be slow and a little tricky to implement. I'm sure there's an easier way. ...Right?
Thanks for any help.

Related

OpenMP in C array reduction / parallelize the code

I have a problem with my code, it should print number of appearances of a certain number.
I want parallelize this code with OpenMP, and I tried to use reduction for arrays but it's obviously didn't working as I wanted.
The error is: "segmentation fault". Should some variables be private? or it's the problem with the way I'm trying to use the reduction?
I think each thread should count some part of array, and then merge it somehow.
#pragma omp parallel for reduction (+: reasult[:i])
for (i = 0; i < M; i++) {
for(j = 0; j < N; j++) {
if ( numbers[j] == i){
result[i]++;
}
}
}
Where N is big number telling how many numbers I have. Numbers is array of all numbers and result array with sum of each number.
First you have a typo on the name
#pragma omp parallel for reduction (+: reasult[:i])
should actually be "result" not "reasult"
Nonetheless, why are you section the array with result[:i]? Based on your code, it seems that you wanted to reduce the entire array, namely:
#pragma omp parallel for reduction (+: result)
for (i = 0; i < M; i++)
for(j = 0; j < N; j++)
if ( numbers[j] == i)
result[i]++;
When one's compiler does not support the OpenMP 4.5 array reduction feature one can alternatively explicitly implement the reduction (check this SO thread to see how).
As pointed out by #Hristo Iliev in the comments:
Provided that M * sizeof(result[0]) / #threads is a multiple of the
cache line size, and even if it isn't when the value of M is large
enough, there is absolutely no need to involve reduction in the
process. Unless the program is running on a NUMA system, that is.
Assuming that the aforementioned conditions are met, and if you analyze carefully the outermost loop iterations (i.e., variable i) are assigned to the threads, and since the variable i is used to access the result array, each thread will be updating a different position of the result array. Therefore, you can simplified your code to:
#pragma omp parallel for
for (i = 0; i < M; i++)
for(j = 0; j < N; j++)
if ( numbers[j] == i)
result[i]++;

Error in OpenMP, trying to vectorize a Matrix Multiplication for loop

I'm trying to vectorize an old matrix multiplication program I made, specifically this function using a parallel for call in openmp. I keep getting this error:
matrix_multiply.c(26): error: invalid entity for this variable list in omp clause
#pragma omp parallel for schedule(static) default(shared) private(i,j,k,sum)
Any help would be much appreciated as I've tried looking up the error and can't find any documentation that was helpful. I'm compiling using ICC if that makes a difference.
void matrix_mult(int * matrix_A, int * matrix_B, int n)
{
#pragma omp parallel for schedule(static) default(shared) private(i,j,k,sum)
for (int i = 0; i < n; i++)
{
for (int j = 0; j < n; j++)
{
int sum = 0;
for (int k = 0; k<n; k++)
{
int index_a = i * n +k;
int index_b = j + k * n;
sum += matrix_A[index_a] * matrix_A[index_b];
}
matrix_B[i * n + j] = sum;
}
}
}
There're two things worth mentioning here:
What you're actually doing here isn't vectorizing (although your compiler might be doing it for you), it is parallelizing. Here, you're creating threads to split the work among. Each thread may or may not use the CPU's vector units to spreed the computations up even more, but it has nothing to do with the parallelization directives you've put.
The error the compiler reports only says that it doesn't known the variables you've listed in your private directive. Indeed, if you look closer, neither of i, j, k, and sum have been declared before the directive line. So for the compiler, they don't exist (yet). As a matter of fact, since you only declare them when you need them (which is very good), which is inside the parallel region, you don't have to declare them privateanyway since they already are private to the the thread where they are created. So just removing the private clause should fix your issue.
Finally, if performance matters to you, rather than trying to parallelize or vectorize this code, just consider replacing it by an effective library call that will do it for you. Unfortunately, since you're dealing with integers, BLAS won't do. But I'm sure there are good options out there for that.

How to declare arrays in omp pragma

I'm modifying existing library from single thread to multi threading. I have code like a provided below. I can't understand how to declare arrays x, y, array1, array2. Which of them I should declare as share or threadprivate. Do I need use flush. If yes in which case ?
//global variables
static int array1[100000];
static int array2[100000];
//part of program code from one of function.
int i
int x[1000000];
int y[1000000];
#pragma omp parallel for
for(i=0, i<100; i++)
{
y[i] = i*i-3*i-10*random();
x[i] = myfunc(i, y[i])
}
//additional function
int myfunc(j, z)
int j,
int z[]
{
array1[array2[j]] += z[j]+j;
return array1[j];
}
The problem I see in your code is in this line
array1[array2[j]] += z[j]+j;
This means that array1 can potentially be modified by whichever j index. And j in the context of the function myfunc() corresponds to index i at the upper level. The trouble is that i is the index upon which the loop is parallelised, therefore, this means that array1 can be modified concurrently at any moment by any thread.
The crucial question now is to know if array2 can have the same value for different indexes:
If you are sure that for whatever j1 != j2 you have array2[j1] != array2[j2], then your code is trivially parallelisable.
If there are values j1 != j2 for which you have array[j1] == array[j2], then you have dependencies across iterations for array1 and the code is no longer (simply and/or effectively) parallelisable.
So let's assume we are in the former case, then the OpenMP directives you have already in the code are sufficient:
i needs to be private but is implicitly already so as it is the index of the parallelised loop;
x and y should to be shared (which they are by default) since their access index is the one that is distributed in parallel (namely i) so their parallel updates do not overlap;
array2 is only accessed in read mode so it's a no brainer shared (which it is by default again);
array1 is read and written, but due to our initial assumption, there are no possible collisions between threads as their sets of indexes to access it are disjoin. Therefore, the default shared qualifier just works fine.
But now, if we are in the case where array2 allows for non-disjoin sets of indexes for accessing array1, we will have to preserve the ordering of these accesses / updates of array1. This can be done with the ordered clause / directive. And since we still want the parallelisation to be (somewhat) effective, we will have to add a schedule(static,1) clause to the parallel directive. For more details about this, please refer to this great answer. Your code would now look like this:
//global variables
static int array1[100000];
static int array2[100000];
//part of program code from one of function.
int i
int x[1000000];
int y[1000000];
#pragma omp parallel for schedule(static,1) ordered
for(i=0; i<100; i++)
{
y[i] = i*i-3*i-10*random();
x[i] = myfunc(i, y[i])
}
//additional function
int myfunc(j, z)
int j,
int z[]
{
int tmp = z[j]+j;
#pragma omp ordered
array1[array2[j]] += tmp;
return array1[j];
}
This would (I think) work and be in term of parallelism not too bad (for a limited number of threads), but this has a big (enormous) flaw: it generates tons of false sharing while updating x and y. Therefore, it might be more advantageous to use some per-thread copies of these and to only update the global arrays at the end. The central part of code snippet would then look something like this (not tested at all):
#pragma omp parallel
#pragma omp single
int nbth = omp_get_num_threads();
int *xm = malloc(1000000*nbth*sizeof(int));
int *ym = malloc(1000000*nbth*sizeof(int));
#pragma omp parallel
{
int tid = omp_get_thread_num();
int *xx = xm+1000000*tid;
int *yy = ym+1000000*tid;
#pragma omp for schedule(static,1) ordered
for(i=0; i<100; i++)
{
yy[i] = i*i-3*i-10*random();
xx[i] = myfunc(i, y[i])
}
#pragma omp for
for (i=0; i<100; i++)
{
int j;
x[i] = 0;
y[i] = 0;
for (j=0; j<nbth; j++)
{
x[i] += xm[j*1000000+i];
y[i] += ym[j*1000000+i];
}
}
}
free(xm);
free(ym);
This will avoid the false sharing, but will increase the number of memory accesses and the overhead of parallelisation. So it might not be very beneficial after all. You'll have to see it for yourself in your actual code.
BTW, the fact that i only loops until 100 looks suspicious to me when the corresponding arrays are declared to be 1000000 long. If 100 is truly the correct size for the loop, then probably the parallelisation isn't worth it anyway...
EDIT:
As Jim Cownie pointed it out in a comment, I missed the call to random() as source of dependency across iterations, preventing from proper parallelisation. I'm not sure how relevant this is in the context of your actual code (I doubt you truly fill your y array with random data) but in case you do, you'll have to change this part in order to do it in parallel (otherwise, the serialisation needed to have the random number series generated will just kill whichever gain from parallelisation). But generating non-correlated pseudo-random series in parallel is not as simple as it sounds. You can use rand_r() instead of random() as a thread-safe alternative for the RNG and initialise its seed per-thread to different values. However, you're not sure that one thread's series won't collide with another thread's one too soon (with a thread starting to generate the very same series than another one after a while, messing-up your expected asymptotic behaviour).
As I'm pretty sure you're not truly interested in that, I won't develop any further (this is a whole question all by itself), but I will just use the (not so good) rand_r() trick. If you want more details on a possible alternative for generating good parallel random series, just ask another question.
The case where no problem comes from array2 (disjoin sets of indexes), the code would become:
// global variable
unsigned int seed;
#pragma omp threadprivate(seed)
// done just once somewhere
#pragma omp parallel
seed = omp_get_thread_num(); //or something else, but different for each thread
// then the parallelised loop
#pragma omp parallel for
for(i=0; i<100; i++)
{
y[i] = i*i-3*i-10*rand_r(&seed);
x[i] = myfunc(i, y[i])
}
Then the other case would have to use the same trick in addition to what has already been described. But again, keep in mind that this isn't good enough for serious RNG based computation (like Monte-Carlo methods). Its does the job if all you want is generate some values for testing purpose, but it won't pass any serious statistical quality test.

C code inside a loop not being executed

Background: the overall program is designed to carry out 2D DIC between a refference image and 1800 target images, (for tomographic reconstruction) In my code, there is this for loop block
for (k=0; k<kmax; k++)
{
K=nm12+(k*(h-n+1))/(kmax-1);
printf("\nk=%d\nL= ", K);
for (l=0; l<lmax; l++)
{
///For each subset, calculate and store its mean and standard deviation.
///Also want to know the sum and sum of squares of subset, but in two sections, stored in fm/df[k][l][0 and 1].
L=nm12+(l*(w-n+1))/(lmax-1);
printf("%d ", L);
fm[k][l][0]=0;
df[k][l][0]=0;
fm[k][l][1]=0;
df[k][l][1]=0;
///loops are j then i as it is more efficient (saves m-1 recalculations of b=j+L;
for (j=0; j<m; j++)
{
b=j+L;
for (i=0; i<M; i++)
{
a=i+K;
fm[k][l][0]+=ref[a][b];
df[k][l][0]+=ref[a][b]*ref[a][b];
}
for (i=M; i<m; i++)
{
a=i+K;
fm[k][l][1]+=ref[a][b];
df[k][l][1]+=ref[a][b]*ref[a][b];
}
}
fm[k][l][2]=m2r*(fm[k][l][1]+fm[k][l][0]);
df[k][l][2]=sqrt(df[k][l][1]+df[k][l][0]-m2*fm[k][l][2]*fm[k][l][2]);
a+=1;
}
}
Each time l reaches 10 the line df[k][l][2]=sqrt(df[k][l][1]+df[k][l][0]-m2*fm[k][l][2]*fm[k][l][2]); appears to no longer be executed. By this I mean the debugger shows that the value of df[k][l][2] is not changed from zero to the sum correctly. Also, df[k][l][0 and 1] remain fixed regardless of k and l, just as long as l>=10.
kmax=15, lmax=20, n=121, m=21, M=(3*m)/4=15, nm12=(n-m+1)/2=50.
The arrays fm and df are double arrays, declared double fm[kmax][lmax][3], df[kmax][lmax][3];
Also, the line a+=1; is just there to be used as a breakpoint to check the value of df[k][l][2], and has no affect on the code functionality.
Any help as to why this is happening, how to fix, etc will be muchly appreciated!
EDIT: MORE INFO.
The array ref (containing the reference image pixel values) is a dynamic array, with memory allocated using malloc, in this code block:
double **dark, **flat, **ref, **target, **target2, ***gm, ***dg;
dark=(double**)malloc(h * sizeof(double*));
flat=(double**)malloc(h * sizeof(double*));
ref=(double**)malloc(h * sizeof(double*));
target=(double**)malloc(h * sizeof(double*));
target2=(double**)malloc(h * sizeof(double*));
size_t wd=w*sizeof(double);
for (a=0; a<h; a++)
{
dark[a]=(double*)malloc(wd);
flat[a]=(double*)malloc(wd);
ref[a]=(double*)malloc(wd);
target[a]=(double*)malloc(wd);
target2[a]=(double*)malloc(wd);
}
where h=1040 and w=1388 the dimensions of the image.
You don't mention much about what compiler, IDE or framework that you're using. But a way to isolate the problem is to create a new small (console) project, containing only the snippet you've posted. This way you'll eliminate most kinds of input/thread/stack/memory/compiler etc. issues.
And if it doesn't, it'll be small enough to post the whole sample here on stackoverflow, for us take apart and ponder.
Ergo you should create a self contained unit test for your algorithm.

Optimising and why openmp is much slower than sequential way?

I am a newbie in programming with OpenMp. I wrote a simple c program to multiply matrix with a vector. Unfortunately, by comparing executing time I found that the OpenMP is much slower than the Sequential way.
Here is my code (Here the matrix is N*N int, vector is N int, result is N long long):
#pragma omp parallel for private(i,j) shared(matrix,vector,result,m_size)
for(i=0;i<m_size;i++)
{
for(j=0;j<m_size;j++)
{
result[i]+=matrix[i][j]*vector[j];
}
}
And this is the code for sequential way:
for (i=0;i<m_size;i++)
for(j=0;j<m_size;j++)
result[i] += matrix[i][j] * vector[j];
When I tried these two implementations with a 999x999 matrix and a 999 vector, the execution time is:
Sequential: 5439 ms
Parallel: 11120 ms
I really cannot understand why OpenMP is much slower than sequential algo (over 2 times slower!) Anyone who can solve my problem?
Your code partially suffers from the so-called false sharing, typical for all cache-coherent systems. In short, many elements of the result[] array fit in the same cache line. When thread i writes to result[i] as a result of the += operator, the cache line holding that part of result[] becomes dirty. The cache coherency protocol then invalidates all copies of that cache line in the other cores and they have to refresh their copy from the upper level cache or from the main memory. As result is an array of long long, then one cache line (64 bytes on x86) holds 8 elements and besides result[i] there are 7 other array elements in the same cache line. Therefore it is possible that two "neighbouring" threads will constantly fight for ownership of the cache line (assuming that each thread runs on a separate core).
To mitigate false sharing in your case, the easiest thing to do is to ensure that each thread gets an iteration block, whose size is divisible by the number of elements in the cache line. For example you can apply the schedule(static,something*8) where something should be big enough so that the iteration space is not fragmented into too many pieces, but in the same time it should be small enough so that each thread gets a block. E.g. for m_size equal to 999 and 4 threads you would apply the schedule(static,256) clause to the parallel for construct.
Another partial reason for the code to run slower might be that when OpenMP is enabled, the compiler might become reluctant to apply some code optimisations when shared variables are being assigned to. OpenMP provides for the so-called relaxed memory model where it is allowed that the local memory view of a shared variable in each threads is different and the flush construct is provided in order to synchronise the views. But compilers usually see shared variables as being implicitly volatile if they cannot prove that other threads would not need to access desynchronised shared variables. You case is one of those, since result[i] is only assigned to and the value of result[i] is never used by other threads. In the serial case the compiler would most likely create a temporary variable to hold the result from the inner loop and would only assign to result[i] once the inner loop has finished. In the parallel case it might decide that this would create a temporary desynchronised view of result[i] in the other threads and hence decide not to apply the optimisation. Just for the record, GCC 4.7.1 with -O3 -ftree-vectorize does the temporary variable trick with both OpenMP enabled and not.
Because when OpenMP distributes the work among threads there is a lot of administration/synchronisation going on to ensure the values in your shared matrix and vector are not corrupted somehow. Even though they are read-only: humans see that easily, your compiler may not.
Things to try out for pedagogic reasons:
0) What happens if matrix and vector are not shared?
1) Parallelize the inner "j-loop" first, keep the outer "i-loop" serial. See what happens.
2) Do not collect the sum in result[i], but in a variable temp and assign its contents to result[i] only after the inner loop is finished to avoid repeated index lookups. Don't forget to init temp to 0 before the inner loop starts.
I did this in reference to Hristo's comment. I tried using schedule(static, 256). For me it makes it does not help changing the default chunck size. Maybe it even makes it worse. I printed out the thread number and its index with and without setting the schedule and it's clear that OpenMP already chooses the thread indices to be far from one another so that false sharing does not seem to be an issue. For me this code already gives a good boost with OpenMP.
#include "stdio.h"
#include <omp.h>
void loop_parallel(const int *matrix, const int ld, const int*vector, long long* result, const int m_size) {
#pragma omp parallel for schedule(static, 250)
//#pragma omp parallel for
for (int i=0;i<m_size;i++) {
//printf("%d %d\n", omp_get_thread_num(), i);
long long sum = 0;
for(int j=0;j<m_size;j++) {
sum += matrix[i*ld +j] * vector[j];
}
result[i] = sum;
}
}
void loop(const int *matrix, const int ld, const int*vector, long long* result, const int m_size) {
for (int i=0;i<m_size;i++) {
long long sum = 0;
for(int j=0;j<m_size;j++) {
sum += matrix[i*ld +j] * vector[j];
}
result[i] = sum;
}
}
int main() {
const int m_size = 1000;
int *matrix = new int[m_size*m_size];
int *vector = new int[m_size];
long long*result = new long long[m_size];
double dtime;
dtime = omp_get_wtime();
loop(matrix, m_size, vector, result, m_size);
dtime = omp_get_wtime() - dtime;
printf("time %f\n", dtime);
dtime = omp_get_wtime();
loop_parallel(matrix, m_size, vector, result, m_size);
dtime = omp_get_wtime() - dtime;
printf("time %f\n", dtime);
}

Resources