Here's my code, which allows different threads to compute conv2d and merge the results back to the result matrix.
#pragma omp parallel private(tid)
float *gptr;
gptr = malloc(M * M * sizeof(float) / NUMTHREADS);
tid = omp_get_thread_num();
#pragma omp for
for (int i = 0; i < M; i++)
{
for (int j = 0; j < M; j++)
{
float tmp = 0.;
for (int k = 0; k < GW; k++)
{
int ii = i + k - W2;
for (int l = 0; l < GW; l++)
{
int jj = j + l - W2;
if (ii >= 0 && ii < M && jj >= 0 && jj < M)
{
tmp += float_m[k * M + l] * GK[ii * GW + jj];
}
}
}
*(gptr + (i - tid * M / NUMTHREADS) * M + j) = tmp;
}
}
But the declaration pragma omp parallel private(tid) didn't work properly.
It gives the error message for float declaration next line:
\omp.c: In function 'main':.\omp.c:86:5: error: expected expression before 'float'
float *gptr;
^~~~~
Where did this go wrong and how to fix it?
Your parallel region is longer than a single line, so you have to use curly braces:
#pragma omp parallel private(tid)
{
//your code
}
UPDATE - a more precise answer with references:
From OpenMP specification the syntax of the parallel construct is as follows:
#pragma omp parallel [clause[ [,] clause] ... ] new-line
structured-block
The structured block is:
an executable statement, possibly compound, with a single
entry at the top and a single exit at the bottom, or an OpenMP
construct.
The definition of compound statement:
A compound statement, or block, is a brace-enclosed sequence of
statements and declarations.
In your code
#pragma omp parallel private(tid)
float *gptr;
float *gptr; is not an executable/compound statement/OpenMP construct, therefore you get an error message. You have to create a compound statement by putting your code between { and }.
I see three problems with your code.
Your immediate problem is that you need curly braces around the material of the parallel region.
Less importantly, considering putting a collapse(2) on the i,j loops.
But most importantly, are you sure that allocating gptr in the parallel region is what you want? It means that each thread creates its own copy, and which stays local to the parallel region. You probably want to allocate outside and pass the pointer in as shared.
Related
I have this code that transposes a matrix using loop tiling strategy.
void transposer(int n, int m, double *dst, const double *src) {
int blocksize;
for (int i = 0; i < n; i += blocksize) {
for (int j = 0; j < m; j += blocksize) {
// transpose the block beginning at [i,j]
for (int k = i; k < i + blocksize; ++k) {
for (int l = j; l < j + blocksize; ++l) {
dst[k + l*n] = src[l + k*m];
}
}
}
}
}
I want to optimize this with multi-threading using OpenMP, however I am not sure what to do when having so many nested for loops. I thought about just adding #pragma omp parallel for but doesn't this just parallelize the outer loop?
When you try to parallelize a loop nest, you should ask yourself how many levels are conflict free. As in: every iteration writing to a different location. If two iterations write (potentially) to the same location, you need to 1. use a reduction 2. use a critical section or other synchronization 3. decide that this loop is not worth parallelizing, or 4. rewrite your algorithm.
In your case, the write location depends on k,l. Since k<n and l*n, there are no pairs k.l / k',l' that write to the same location. Furthermore, there are no two inner iterations that have the same k or l value. So all four loops are parallel, and they are perfectly nested, so you can use collapse(4).
You could also have drawn this conclusion by considering the algorithm in the abstract: in a matrix transposition each target location is written exactly once, so no matter how you traverse the target data structure, it's completely parallel.
You can use the collapse specifier to parallelize over two loops.
# pragma omp parallel for collapse(2)
for (int i = 0; i < n; i += blocksize) {
for (int j = 0; j < m; j += blocksize) {
// transpose the block beginning at [i,j]
for (int k = i; k < i + blocksize; ++k) {
for (int l = j; l < j + blocksize; ++l) {
dst[k + l*n] = src[l + k*m];
}
}
}
}
As a side-note, I think you should swap the two innermost loops. Usually, when you have a choice between writing sequentially and reading sequentially, writing is more important for performance.
I thought about just adding #pragma omp parallel for but doesnt this
just parallelize the outer loop?
Yes. To parallelize multiple consecutive loops one can utilize OpenMP' collapse clause. Bear in mind, however that:
(As pointed out by Victor Eijkhout). Even though this does not directly apply to your code snippet, typically, for each new loop to be parallelized one should reason about potential newer race-conditions e.g., that this parallelization might have added. For example, different threads writing concurrently into the same dst position.
in some cases parallelizing nested loops may result in slower execution times than parallelizing a single loop. Since, the concrete implementation of the collapse clause uses a more complex heuristic (than the simple loop parallelization) to divide the iterations of the loops among threads, which can result in an overhead higher than the gains that it provides.
You should try to benchmark with a single parallel loop and then with two, and so on, and compare the results, accordingly.
void transposer(int n, int m, double *dst, const double *src) {
int blocksize;
#pragma omp parallel for collapse(...)
for (int i = 0; i < n; i += blocksize)
for (int j = 0; j < m; j += blocksize)
for (int k = i; k < i + blocksize; ++k
for (int l = j; l < j + blocksize; ++l)
dst[k + l*n] = src[l + k*m];
}
Depending upon the number of threads, cores, size of the matrices among other factors it might be that running sequential would actually be faster than the parallel versions. This is specially true in your code that is not very CPU intensive (i.e., dst[k + l*n] = src[l + k*m];)
I want to parallelize the for loops and I can't seem to grasp the concept, every time I try to parallelize them it still works but it slows down dramatically.
for(i=0; i<nbodies; ++i){
for(j=i+1; j<nbodies; ++j) {
d2 = 0.0;
for(k=0; k<3; ++k) {
rij[k] = pos[i][k] - pos[j][k];
d2 += rij[k]*rij[k];
if (d2 <= cut2) {
d = sqrt(d2);
d3 = d*d2;
for(k=0; k<3; ++k) {
double f = -rij[k]/d3;
forces[i][k] += f;
forces[j][k] -= f;
}
ene += -1.0/d;
}
}
}
}
I tried using synchronization with barrier and critical in some cases but nothing happens or the processing simply does not end.
Update, this is the state I'm at right now. Working without crashes but calculation times worsen the more threads I add. (Ryzen 5 2600 6/12)
#pragma omp parallel shared(d,d2,d3,nbodies,rij,pos,cut2,forces) private(i,j,k) num_threads(n)
{
clock_t begin = clock();
#pragma omp for schedule(auto)
for(i=0; i<nbodies; ++i){
for(j=i+1; j<nbodies; ++j) {
d2 = 0.0;
for(k=0; k<3; ++k) {
rij[k] = pos[i][k] - pos[j][k];
d2 += rij[k]*rij[k];
}
if (d2 <= cut2) {
d = sqrt(d2);
d3 = d*d2;
#pragma omp parallel for shared(d3) private(k) schedule(auto) num_threads(n)
for(k=0; k<3; ++k) {
double f = -rij[k]/d3;
#pragma omp atomic
forces[i][k] += f;
#pragma omp atomic
forces[j][k] -= f;
}
ene += -1.0/d;
}
}
}
clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
#pragma omp single
printf("Calculation time %lf sec\n",time_spent);
}
I incorporated the timer in the actual parallel code (I think it is some milliseconds faster this way). Also I think I got most of the shared and private variables right. In the file it outputs the forces.
Using barriers or other synchronizations will slow down your code, if the amount of unsynchronized work is not larger by a good factor. That is not the case with you. You probably need to reformulate your code to remove synchronization.
You are doing something like an N-body simulation. I've worked out a couple of solutions here: https://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-examples.html#N-bodyproblems
Also: your d2 loop is a reduction, so you can treat it like that, but it is probably enough if that variable is private to the i,j iterations.
You should always define your variables in their minimal required scope, especially if performance is an issue. (Note that if you do so your compiler can create more efficient code). Besides performance it also helps to avoid data race.
I think you have misplaced a curly brace and the condition in the first for loop should be i<nbodies-1. Variable ene can be summed up using reduction and to avoid data race atomic operations have to be used to increase array forces, so you do not need to use slow barriers or critical sections. Your code should look something like this (assuming int for indices and double for calculation):
#pragma omp parallel for reduction(+:ene)
for(int i=0; i<nbodies-1; ++i){
for(int j=i+1; j<nbodies; ++j) {
double d2 = 0.0;
double rij[3];
for(int k=0; k<3; ++k) {
rij[k] = pos[i][k] - pos[j][k];
d2 += rij[k]*rij[k];
}
if (d2 <= cut2) {
double d = sqrt(d2);
double d3 = d*d2;
for(int k=0; k<3; ++k) {
double f = -rij[k]/d3;
#pragma omp atomic
forces[i][k] += f;
#pragma omp atomic
forces[j][k] -= f;
}
ene += -1.0/d;
}
}
}
}
Solved, turns out all I needed was
#pragma omp parallel for nowait
Doesn't need the "atomic" either.
Weird solution, I don't fully understand how it works but it does also the output file has 0 corrupt results whatsoever.
I am trying to compute the average value over adjacent elements within a matrix, but am stuck getting OpenMP's vectorization to work. As I understand the second nested for-loop, the reduction clause should ensure that no race conditions occur when writing to elements of next. However, when compiling the code (I tried auto-vectorization with both GCC GCC 7.3.0 and ICC and OpenMP > 4.5) I get the report: "error: reduction variable "next" must be shared on entry to this OpenMP pragma". Why does this occur when variables are shared by default? How can I fix this issue since adding shared(next) does not seem to help?
// CODE ABOVE (...)
size_t const width = 100;
size_t const height = 100;
float * restrict next = malloc(sizeof(float)*width*height);
// INITIALIZATION OF 'next' (this works fine)
#pragma omp for simd collapse(2)
for(size_t j = 1; j < height-1; j++)
for(size_t i = 1; i < width-1; i++)
next[j*width+i] = 0.0f;
// COMPUTE AVERAGE FOR INNER ELEMENTS
#pragma omp for simd collapse(4) reduction(+:next[0:width*height])
for(size_t j = 1; j < height-1; ++j){
for(size_t i = 1; i < width-1; ++i){
// compute average over adjacent elements
for(size_t _j = 0; _j < 3; _j++) {
for(size_t _i = 0; _i < 3; _i++) {
next[j*width + i] += (1.0 / 9.0) * next[(j-1 +_j)*width + (i-1 + _i)];
}
}
}
}
The problem is that GCC 7.3.0 does not support
#pragma omp for simd collapse(4) reduction(+:next[0:width*height])
the use of reduction of array sections in this context.
This function is support by GCC 9 forwards:
Since GCC 9, there is initial OpenMP 5 support (essentially C/C++,
only). GCC 10 added some more features, mainly for C/C++ but also for
Fortran.
How could I make the parallel of this with OpenMP 3.1? I have tried a collapse but the compiler says this:
error: initializer expression refers to iteration variable âkâ
for (j = k+1; j < N; ++j){
And when I try a simple parallel for, the result is like the threads sometimes do the same and jump things so sometimes the result is greater and other times is less
int N = 100;
int *x;
x = (int*) malloc ((N+1)*sizeof(int));
//... initialization of the array x ...
// ...
for (k = 1; k < N-1; ++k)
{
for (j = k+1; j < N; ++j)
{
s = x[k] + x[j];
if (fn(s) == 1){
count++;
}
}
Count must be 62 but is random
Based on the code snippet that you have provided, and according to the restrictions to nested parallel loops specified by the OpenMP 3.1 standard:
The iteration count for each associated loop is computed before entry to the outermost loop. If execution of any associated loop changes any of the values used to compute any of the iteration counts, then the behavior is unspecified.
Since the iterations of your inner loop depend upon the iterations of your outer loop (i.e., j = k+1) you can not do the following:
#pragma omp parallel for collapse(2) schedule(static, 1) private(j) reduction(+:count)
for (k = 1; k < N-1; ++k)
for (j = k+1; j < N; ++j)
...
Moreover, from the OpenMP 3.1 "Loop Construct" section (relevant to this question) one can read:
for (init-expr; test-expr; incr-expr) structured-block
where init-expr is one of the following:
...
integer-type var = lb
...
and test-expr :
...
var relational-op b
with the restriction of lb and b of:
Loop invariant expressions of a type compatible with the type of var.
Notwithstanding, as kindly pointed out by #Hristo Iliev, "that changed in 5.0 where support for non-rectangular loops was added.". As one can read from the OpenMP 5.0 "Loop Construct" section, now the restriction on lb and b are:
Expressions of a type compatible with the type of var that are loop
invariant with respect to the outermost associated loop or are one of
the following (where var-outer, a1, and a2 have a type compatible with
the type of var, var-outer is var from an outer associated loop, and
a1 and a2 are loop invariant integer expressions with respect to the
outermost loop):
...
var-outer + a2
...
Alternatively to the collapse clause you can use the normal parallel for. Bear in mind that you have a race condition during the update of the variable count.
#pragma omp parallel for schedule(static, 1) private(j) reduction(+:count)
for (k = 1; k < N-1; ++k){
for (j = k+1; j < N; ++j)
{
s = x[k] + x[j];
if (fn(s) == 1){
count++;
}
}
Importante note although the k does not have to be private, since it is part of the loop to be parallelized and OpenMP will implicitly make it private, the same does not apply to the variable j. Hence, one of the reason why:
Count must be 62 but is random
the other was the lack of the reduction(+:count).
I am doing it like:
void calculateClusterCentroIDs(int numCoords, int numObjs, int numClusters, float * dataSetMatrix, int * clusterAssignmentCurrent, float *clustersCentroID) {
int * clusterMemberCount = (int *) calloc (numClusters,sizeof(int));
#pragma omp parallel
{
int ** localClusterMemberCount;
int * activeCluster;
#pragma omp single
{
localClusterMemberCount = (int **) malloc (omp_get_num_threads() * sizeof(int *));
//localClusterMemberCount[0] = (int *) calloc (omp_get_num_threads()*numClusters,sizeof(int));
for (int i = 0; i < omp_get_num_threads(); ++i) {
localClusterMemberCount[i] = calloc (numClusters,sizeof(int));
//localClusterMemberCount[i] = localClusterMemberCount[i-1] + numClusters;
}
activeCluster = (int *) calloc (omp_get_num_threads(),sizeof(int));
}
// sum all points
// for every point
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
activeCluster[omp_get_thread_num()] = clusterAssignmentCurrent[i];
// update count of members in that cluster
++localClusterMemberCount[omp_get_thread_num()][activeCluster[omp_get_thread_num()]];
// sum point coordinates for finding centroid
for (int j = 0; j < numCoords; ++j)
#pragma omp atomic
clustersCentroID[activeCluster[omp_get_thread_num()]*numCoords + j] += dataSetMatrix[i*numCoords + j];
}
// now divide each coordinate sum by number of members to find mean/centroid
// for each cluster
for (int i = 0; i < numClusters; ++i) {
if (localClusterMemberCount[omp_get_thread_num()][i] != 0)
// for each numCoordsension
for (int j = 0; j < numCoords; ++j)
#pragma omp atomic
clustersCentroID[i*numCoords + j] /= localClusterMemberCount[omp_get_thread_num()][i]; /// XXXX will divide by zero here for any empty clusters!
}
// free memory
#pragma omp single
{
free (localClusterMemberCount[0]);
free (localClusterMemberCount);
free (activeCluster);
}
}
free(clusterMemberCount);
But I get the error: Segment violation ('core' generated) so I am doing something bad, and I think the error is on mallocing pointers due to I have tried sequential code and it is working fine. Also I have tried parallel code but without mallocs (using globals variables with atomic) and that works fine too. The error only apears when I try to create private pointers and malloc them.
Any idea how can I solve it?
Two reasons for the segfault:
localClusterMemberCount should be a shared variable declared outside of the parallel region, initialized within the parallel region by a single thread. Otherwise, each thread has its own copy of the variable, and for all but the thread that has gone through the singlesection, that points to a random location of the memory.
An implicit or explicit barrier is needed before the section of code where pointers are freed. All threads needs to be done for sure before memory can be disallocated, otherwise one thread may free pointers still being used by other threads.
There are few other issues with the code. See below with my own comments flagged with ***:
void calculateClusterCentroIDs(int numCoords, int numObjs, int numClusters, float * dataSetMatrix, int * clusterAssignmentCurrent, float *clustersCentroID) {
int * clusterMemberCount = (int *) calloc (numClusters,sizeof(int));
/* ***
* This has to be a shared variable that each thread can access
* If declared inside the parallel region, it will be a thread-local variable
* which is left un-initialized for all but one thread. Further attempts to access
* that variable will lead to segfaults
*/
int ** localClusterMemberCount;
#pragma omp parallel shared(localClusterMemberCount,clusterMemberCount)
{
// *** Make activeCluster a thread-local variable rather than a shared array (shared array will result in false sharing)
int activeCluster;
#pragma omp single
{
localClusterMemberCount = (int **) malloc (omp_get_num_threads() * sizeof(int *));
//localClusterMemberCount[0] = (int *) calloc (omp_get_num_threads()*numClusters,sizeof(int));
for (int i = 0; i < omp_get_num_threads(); ++i) {
localClusterMemberCount[i] = calloc (numClusters,sizeof(int));
//localClusterMemberCount[i] = localClusterMemberCount[i-1] + numClusters;
}
}
// sum all points
// for every point
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
activeCluster = clusterAssignmentCurrent[i];
// update count of members in that cluster
++localClusterMemberCount[omp_get_thread_num()][activeCluster];
// sum point coordinates for finding centroid
// *** This may be slower in parallel because of the atomic operation
for (int j = 0; j < numCoords; ++j)
#pragma omp atomic
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}
/* ***
* Missing: one reduction step
* The global cluster member count needs to be updated
* one option is below :
*/
#pragma omp critical
for (int i=0; i < numClusters; ++i) clusterMemberCount+=localClusterMemberCount[omp_get_thread_num()];
#pragma omp barrier // wait here before moving on
// *** The code below was wrong; to compute the average, coordinates should be divided by the global count
// *** Sucessive divisions by local count will fail. Like, 1/(4+6) is not the same as (1/4)/6
// now divide each coordinate sum by number of members to find mean/centroid
// for each cluster
#pragma omp for
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount != 0)
// for each numCoordsension
#pragma omp simd //not sure this will help, the compiler may already vectorize that
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i]; /// XXXX will divide by zero here for any empty clusters!
// *** ^^ atomic is not needed
// *** only one thread will access each value of clusterCentroID
}
#pragma omp barrier
/* ***
* A barrier is needed otherwise the first thread arriving there will start to free the memory
* Other threads may still be in the previous loop attempting to access localClusterMemberCount
* If the pointer has been freed already, this will result in a segfault
*
* With the corrected code, the implicit barrier at the end of the distributed
* for loop would be sufficient. With your initial code, an explicit barrier
* would have been needed.
*/
// free memory
#pragma omp single
{
// *** Need to free all pointers and not only the first one
for (int i = 0; i < omp_get_num_threads(); ++i) free (localClusterMemberCount[i]);
free (localClusterMemberCount);
}
}
free(clusterMemberCount);