OpenMP gives (core dumped) - c

I have loop which I want to parallelize with OpenMP. when I compile with gcc -o prog prog.c -lm -fopenmp I get no errors. But when I execute it, I get segmentation fault(core dumped). The problem surely comes from the OpenMP commands because the program works when I delete the #pragma...
Here is the parallel loop:
ix = (i-1)%ILIGNE+1;
iy = (i-1)/ILIGNE+1;
k = 1;
# pragma omp parallel for private(j,jx,jy,r,R,voisin) shared(NTOT,k,i,ix,iy) num_threads(2) schedule(auto)
for(j = 1;j <= NTOT;j++){
if(j != i){
jx = (j-1)%ILIGNE+1;
jy = (j-1)/ICOLONE+1;
r[k][0] = (jx-ix)*a;
r[k][1] = (jy-iy)*a;
R[k] = sqrt(pow(r[k][0],2.0)+pow(r[k][1],2.0));
voisin[k] = j;
k++;
}
}
I tried to change the stack size to unlimited but it doesn't fix the problem. Please tell me if it is about a memory leak or a race condition or something else? and thank you for your help

As a side note, be careful when you make an array private.
If you allocated it as a static array
e.g.
int R[5] or something similar then that's fine, each thread gets its own personal copy :).
If you malloc these however
e.g.:
int R = malloc(5*sizeof(int));
then it will act as a shared array regardless of whether you define it as private (which could potentially lead to undefined behaviour, segfaults, jibberish in the array etc).

I'm not sure what your code does, but I'm pretty sure the OpenMP version is wrong. Indeed, you parallelised over the j loop, but the heart of your algorithm revolves around the k, which is loosely derived from j and i (which is not presented here BTW).
So when you distribute your j indexes across your OpenMP threads, they all start from a different value of j, but all from the same value of k which is shared. From that, k is incremented quite randomly and the accesses to the various arrays using k are very very likely to generate segmentation faults.
Moreover, arrays r, R and voisin shouldn't be declared private if one want the parallelisation to have any effect.
Finally, C loops like this for(j = 1;j <= NTOT;j++) look utterly suspicious to me for off-by-one accesses... Shouldn't that be rather for(j = 0;j < NTOT;j++)? (just mentioning this since the initial value of k is 1 as well...)
Bottom line is that you'd probably better define k from j's value with k = j<i ? j : j-1 instead of trying to increment it inside the code.
Assuming all the rest is correct, this might be a valid version:
ix = (i-1)%ILIGNE+1;
iy = (i-1)/ILIGNE+1;
# pragma omp parallel for private(j,jx,jy,k) num_threads(2) schedule(auto)
for(j = 1;j <= NTOT;j++){
if(j != i){
k = j<i ? j : j-1;
jx = (j-1)%ILIGNE+1;
jy = (j-1)/ICOLONE+1;
r[k][0] = (jx-ix)*a;
r[k][1] = (jy-iy)*a;
R[k] = sqrt(pow(r[k][0],2.0)+pow(r[k][1],2.0));
voisin[k] = j;
}
}
Still, be careful with the C indexing from 0 to size-1, not from 1 to size...

Related

Parallelizing inner loop with residual calculations in OpenMP with SSE vectorization

I'm trying to parallelizing the inner loop of a program that has data dependencies (min) outside the scope of the loops. I'm having an issue where the residual calculations occuring outside the scope of the inner j loop. The code gets errors if the "#pragma omp parallel" part is included on the j loop even if the loop doesn't run at all due to a k value being too low. say (1,2,3) for example.
for (i = 0; i < 10; i++)
{
#pragma omp parallel for shared(min) private (j, a, b, storer, arr) //
for (j = 0; j < k-4; j += 4)
{
mm_a = _mm_load_ps(&x[j]);
mm_b = _mm_load_ps(&y[j]);
mm_a = _mm_add_ps(mm_a, mm_b);
_mm_store_ps(storer, mm_a);
#pragma omp critical
{
if (storer[0] < min)
{
min = storer[0];
}
if (storer[1] < min)
{
min = storer[1];
}
//etc
}
}
do
{
#pragma omp critical
{
if (x[j]+y[j] < min)
{
min = x[j]+y[j];
}
}
}
} while (j++ < (k - 1));
round_min = min
}
The j-based loop is a parallel loop so you cannot use j after the loop. This is especially true since you explicitly put j as private, so only visible locally in the thread but not outside the parallel region. You can explicitly compute the position of the remaining j value using (k-4+3)/4*4 just after the parallel loop.
Furthermore, here is few important points:
You may not really need to vectorize the code yourself: you can use omp simd reduction. OpenMP can do all the boring job of computing the residual calculations for you automatically. Moreover, the code will be portable and much simpler. The generated code may also likely be faster than yours. Note however that some compilers might not be able to vectorize the code (GCC and ICC does, while Clang and MSVC often need some help).
Critical section (omp critical) are very costly. In your case this will just annihilate any possible improvement related to the parallel section. The code will likely be slower due to cache-line bouncing.
Reading data written by _mm_store_ps is inefficient here although some compiler (like GCC) may be able to understand the logic of your code and generate a faster implementation (extracting lane data).
Horizontal SIMD reductions inefficient. Use vertical ones that are much faster and that can be easily used here.
Here is a corrected code taking into account the above points:
for (i = 0; i < 10; i++)
{
// Assume min is already initialized correctly here
#pragma omp parallel for simd reduction(min:min) private(j)
for (j = 0; j < k; ++j)
{
const float tmp = x[j] + y[j];
if(tmp < min)
min = tmp;
}
// Use min here
}
The above code is vectorized correctly on x86 architecture on GCC/ICC (both with -O3 -fopenmp), Clang (with -O3 -fopenmp -ffastmath) and MSVC (with /O2 /fp:precise -openmp:experimental).

OpenMP: Parallelize for loops inside a while loop

I am implementing an algorithm to compute graph layout using force-directed. I would like to add OpenMP directives to accelerate some loops. After reading some courses, I added some OpenMP directives according to my understanding. The code is compiled, but don’t return the same result as the sequential version.
I wonder if you would be kind enough to look at my code and help me to figure out what is going wrong with my OpenMP version.
Please download the archive here:
http://www.mediafire.com/download/3m42wdiq3v77xbh/drawgraph.zip
As you see, the portion of code which I want to parallelize is:
unsigned long graphLayout(Graph * graph, double * coords, unsigned long maxiter)
Particularly, these two loops which consumes alot of computational resources:
/* compute repulsive forces (electrical: f=-C.K^2/|xi-xj|.Uij) */
for(int j = 0 ; j < graph->nvtxs ; j++) {
if(i == j) continue;
double * _xj = _position+j*DIM;
double dist = DISTANCE(_xi,_xj);
// power used for repulsive force model (standard is 1/r, 1/r^2 works well)
// double coef = 0.0; -C*K*K/dist; // power 1/r
double coef = -C*K*K*K/(dist*dist); // power 1/r^2
for(int d = 0 ; d < DIM ; d++) force[d] += coef*(_xj[d]-_xi[d])/dist;
}
/* compute attractive forces (spring: f=|xi-xj|^2/K.Uij) */
for(int k = graph->xadj[i] ; k < graph->xadj[i+1] ; k++) {
int j = graph->adjncy[k]; /* edge (i,j) */
double * _xj = _position+j*DIM;
double dist = DISTANCE(_xi,_xj);
double coef = dist*dist/K;
for(int d = 0 ; d < DIM ; d++) force[d] += coef*(_xj[d]-_xi[d])/dist;
}
Thank you in advance for any help you can provide!
You have data races in your code, e.g., when doing maxmove = nmove; or energy += nforce2;. In any multi-threaded code, you cannot write into a variable shared by threads until you use an atomic access (#pragma omp atomic read/write/update) or until you synchronize an access to such a variable explicitly (critical sections, locks). Read some tutorial about OpenMP first, there are more problems with your code, e.g.
if(nmove > maxmove) maxmove = nmove;
this line will generally not work even with atomics (you would have to use so-called compare-and-exchange atomic operation to solve this). Much better solution here is to let each thread to calculate its local maximum and then reduce it into a global maximum.

Paralellized execution in nested for using Cilk

I'm trying to implement a 2D-stencil algorithm that manipulates a matrix. For each field in the matrix, the fields above, below, left and right of it are to be added and divided by 4 in order to calculate the new value. This process may be iterated multiple times for a given matrix.
The program is written in C and compiles with the cilkplus gcc binary.
**Edit: I figured you might interested in the compiler flags:
~/cilkplus/bin/gcc -fcilkplus -lcilkrts -pedantic-errors -g -Wall -std=gnu11 -O3 `pkg-config --cflags glib-2.0 gsl` -c -o sal_cilk_tst.o sal_cilk_tst.c
Please note that the real code involves some pointer arithmetic to keep everything consistent. The sequential implementation works. I'm omitting these steps here to enhance understandability.
A pseudocode would look something like this (No edge case handling):
for(int i = 0; i < iterations; i++){
for(int j = 0; j < matrix.width; j++){
for(int k = 0; k < matrix.height; k++){
result_ matrix[j][k] = (matrix[j-1][k] +
matrix[j+1][k] +
matrix[j] [k+1] +
matrix[j] [k-1]) / 4;
}
}
matrix = result_matrix;
}
The stencil calculation itself is then moved to the function apply_stencil(...)
for(int i = 0; i < iterations; i++){
for(int j = 0; j < matrix.width; j++){
for(int k = 0; k < matrix.height; k++){
apply_stencil(matrix, result_matrix, j, k);
}
}
matrix = result_matrix;
}
and parallelization is attempted:
for(int i = 0; i < iterations; i++){
for(int j = 0; j < matrix.width; j++){
cilk_for(int k = 0; k < matrix.height; k++){ /* <--- */
apply_stencil(matrix, result_matrix, j, k);
}
}
matrix = result_matrix;
}
This version compiles without errors/warning, but just straight out produces a Floating point exception when executed. In case you are wondering: It does not matter which of the for loops are made into cilk_for loops. All configurations (except no cilk_for) produce the same error.
the possible other method:
for(int i = 0; i < iterations; i++){
for(int j = 0; j < matrix.width; j++){
for(int k = 0; k < matrix.height; k++){
cilk_spawn apply_stencil(matrix, result_matrix, j, k); /* <--- */
}
}
cilk_sync; /* <--- */
matrix = result_matrix;
}
This produces 3 warnings when compiled: i, j and k appear to be uninitialized.
When trying to execute, the function which executes the matrix = result_matrix; step appears to be undefined.
Now for the actual question: Why and how does Cilk break my sequential code; or rather how can I prevent it from doing so?
The actual code is of course available too, should you be interested. However, this project is for an university class and therefore subject to plagiarism from other students who find this thread which is why I would prefer not to share it publicly.
**UPDATE:
As suggested I attempted to run the algorithm with only 1 worker thread, effectively making the cilk implementation sequential. This did, surprisingly enough, work out fine. However as soon as I change the number of workers to two, the familiar errors return.
I don't think this behavior is caused by race-conditions though. Since the working matrix is changed after each iteration and cilk_sync is called, there is effectively no critical section. All threads do not depend on data written by others in the same iteration.
The next steps I will attempt is to try out other versions of the cilkplus compiler, to see if its maybe an error on their side.
With regards to the floating point exception in a cilk_for, there are some issues that have been fixed in some versions of the Cilk Plus runtime. Is it possible that you are using an outdated version?
https://software.intel.com/en-us/forums/intel-cilk-plus/topic/558825
Also, what were the specific warning messages that are produced? There are some "uninitialized variable" warnings that occur with older versions of Cilk Plus GCC, which I thought were spurious warnings.
The Cilk runtime uses a recursive divide and conquer algorithm to parallelize your loop. Essentially, it breaks the range in half, and recursively calls itself twice, spawning half and calling half.
As part of the initialization, it calculates a "grain size" which is the size of the minimum size it will break your range into. By default, that's loopRange/8P, where P is the number of cores.
One interesting experiment would be to set the number of Cilk workers to 1. When you do this, all of the cilk_for mechanism is excersized, but because there's only 1 worker, nothing gets stolen.
Another possibility is to try running your code under Cilkscreen - the Cilk race detector. Unfortunately only the cilkplus branch of GCC generates the annotations that Cilkscreen needs. Your choices are to use the Intel commpiler, or try using the cilkplus branch of GCC 4.9. Directions on how to pull down the code and build it are at the cilkplus.org website.

OpenMP - make array counter with thread different start number

I am strugling with quite tricky (I guess) problem with parallel file reading.
Right now I have maped file using mmap and I want it to read values and put into three arrays. Well, maybe explanations isn't so clear, so this is my current code:
int counter = header;
#pragma omp parallel for shared (red,green,blue,map) private (i,j) firstprivate(counter)
for(i = 0; i < x; i++)
{
for(j = 0; j < y; j++)
{
red[i][j] = map[counter];
green[i][j] = map[counter+1];
blue[i][j] = map[counter+2];
counter+=3;
}
}
the header is beginning of a file - just image related info like size and some comments.
The problem here is, that counter has to be private. What I am finding difficult is to come up with dividing the counter across threads with different start numbers.
Can someone give me a hint how it can be achieved?
You could, if I've understood your code correctly, leave counter as a shared variable and enclose the update to it in a single section, something like
blue[i][j] = map[counter+2];
#pragma omp single
{
counter+=3;
}
}
I write something like because I haven't carefully checked the syntax nor considered the usefulness of some of the optional clauses for the single directive. I suspect that this might prove a real drag on performance.
An alternative would be to radically re-order your loop nest, perhaps (and again this is not carefully checked):
#pragma omp parallel shared (red,green,blue,map) private (i,j)
#pragma for collapse(2) schedule(hmmm, this needs some thought)
for(counter = 0; counter < counter_max; counter += 3)
{
for(i = 0; i < x; i++)
{
for(j = 0; j < y; j++)
{
red[i][j] = map[counter];
}
}
}
... loop for blue now ...
... loop for green now ...
Note that in this version OpenMP will collapse the first two loops into one iteration space, I expect this will provide better performance than uncollapsed looping but it's worth experimenting to figure that out. It's probably also worth experimenting with the schedule clause.
If I understood well your problem is that of initializing counter to different values for different threads.
If this is the case, then it can be solved like the linearization of matrix indexes:
size_t getLineStart(size_t ii, size_t start, size_t length) {
return start + ii*length*3;
}
int counter;
#pragma omp parallel for shared (red,green,blue,map) private (i,j) private(counter)
for(size_t ii = 0; ii < x; ii++) {
////////
// Get the starting point for the current iteration
counter = getLineStart(ii,header,y);
////////
for(size_t jj = 0; jj < y; jj++) {
red [i][j] = map[counter ];
green[i][j] = map[counter+1];
blue [i][j] = map[counter+2];
counter+=3;
}
}
These kind of counters are useful when it's hard to know how many values will be counted but then they don't work easily with OpenMP. But in your case it's trivial to figure out what the index is: 3*(y*i+j). I suggest fusing the loop and doing something like this:
#pragma omp parallel for
for(n=0; n<x*y; n++) {
(&red [0][0])[n] = map[3*n+header+0];
(&green[0][0])[n] = map[3*n+header+1];
(&blue [0][0])[n] = map[3*n+header+2];
}

Parallelizing a for loop in Visual Studio 2010 (OpenMP)

I've recently reading up about OpenMP and was trying to parallelize some existing for loops in my program to get a speed-up. However, for some reason I seem to be getting garbage data written to the file. What I mean by that is I don't have Points 1,2,3,4 etc. written to my file, I have Points 1,4,7,8 etc. I suspect this is because I am not keeping track of the threads and it just leads to race conditions?
I have been reading as much as I can find about OpenMP, since it seems like a great abstraction to do multi-threaded programing. I'd appreciate any pointers please to get to the bottom of what I might be doing incorrectly.
Here is what I have been trying to do so far (only the relevant bit of code):
#include <omp.h>
pixelIncrement = Image.rowinc/2;
#pragma omp parallel for
for (int i = 0; i < Image.nrows; i++ )
{
int k =0;
row = Image.data + i * pixelIncrement;
#pragma omp parallel for
for (int j = 0; j < Image.ncols; j++)
{
k++;
disparityThresholdValue = row[j];
// Don't want to save certain points
if ( disparityThresholdValue < threshHold)
{
// Get the data points
x = (int)Image.x[k];
y = (int)Image.y[k];
z = (int)Image.z[k];
grayValue= (int)Image.gray[k];
cloudObject->points[k].x = x;
cloudObject->points[k].y = y;
cloudObject->points[k].z = z;
cloudObject->points[k].grayValue = grayValue;
fprintf( cloudPointsFile, "%f %f %f %d\n", x, y, z, grayValue);
}
}
}
fclose( pointFile );
I did enable OpenMP in my Compiler settings (C/C++ -> Language -> Open MP Support (/openmp).
Any suggestions as to what might be the problem? I am using a Quadcore processor on Windows XP 32-bit.
Are all points written to the file, but just not sequentially, or is the actual point data messed up?
The first case is expected in parallel programming - once you execute something side-by-side you wont be able to guarantee order unless you synchronize the access (at which point you can just leave out the parallelization as it becomes effectively linear). If you need to rely on order, you can parallelize any calculations but need to write it down in one thread.
If the points itself are messed up, check where your variables are declared and if multiple threads are accessing the same.
A few problems here:
#pragma omp parallel for
for (int i = 0; i < Image.nrows; i++ )
{
int k =0;
row = Image.data + i * pixelIncrement;
#pragma omp parallel for
for (int j = 0; j < Image.ncols; j++)
{
k++;
There's no need for the inner parallel for. The outer loop should contain enough work to keep all cores busy.
Also, for the inner loop k is a shared variable and gets incremented in a non-atomic way. x, y, z are also shared among the inner loop threads and overwritten "randomly". Remove the inner directive and see how it goes.
When you have a loop with a nested loop there is no need for a second omp pragma.
It will already paralelize the first loop. Remember that this is valid only if the second loop has to be executed in sequence. You have a sequencial incrementation, so you can not execute the second loop in a random order. OMP pragmas are a very easy and cool way to paralelize code but do not use them too much!
More details here -> Parallel Loops with OpenMP

Resources