Paralellized execution in nested for using Cilk - c

I'm trying to implement a 2D-stencil algorithm that manipulates a matrix. For each field in the matrix, the fields above, below, left and right of it are to be added and divided by 4 in order to calculate the new value. This process may be iterated multiple times for a given matrix.
The program is written in C and compiles with the cilkplus gcc binary.
**Edit: I figured you might interested in the compiler flags:
~/cilkplus/bin/gcc -fcilkplus -lcilkrts -pedantic-errors -g -Wall -std=gnu11 -O3 `pkg-config --cflags glib-2.0 gsl` -c -o sal_cilk_tst.o sal_cilk_tst.c
Please note that the real code involves some pointer arithmetic to keep everything consistent. The sequential implementation works. I'm omitting these steps here to enhance understandability.
A pseudocode would look something like this (No edge case handling):
for(int i = 0; i < iterations; i++){
for(int j = 0; j < matrix.width; j++){
for(int k = 0; k < matrix.height; k++){
result_ matrix[j][k] = (matrix[j-1][k] +
matrix[j+1][k] +
matrix[j] [k+1] +
matrix[j] [k-1]) / 4;
}
}
matrix = result_matrix;
}
The stencil calculation itself is then moved to the function apply_stencil(...)
for(int i = 0; i < iterations; i++){
for(int j = 0; j < matrix.width; j++){
for(int k = 0; k < matrix.height; k++){
apply_stencil(matrix, result_matrix, j, k);
}
}
matrix = result_matrix;
}
and parallelization is attempted:
for(int i = 0; i < iterations; i++){
for(int j = 0; j < matrix.width; j++){
cilk_for(int k = 0; k < matrix.height; k++){ /* <--- */
apply_stencil(matrix, result_matrix, j, k);
}
}
matrix = result_matrix;
}
This version compiles without errors/warning, but just straight out produces a Floating point exception when executed. In case you are wondering: It does not matter which of the for loops are made into cilk_for loops. All configurations (except no cilk_for) produce the same error.
the possible other method:
for(int i = 0; i < iterations; i++){
for(int j = 0; j < matrix.width; j++){
for(int k = 0; k < matrix.height; k++){
cilk_spawn apply_stencil(matrix, result_matrix, j, k); /* <--- */
}
}
cilk_sync; /* <--- */
matrix = result_matrix;
}
This produces 3 warnings when compiled: i, j and k appear to be uninitialized.
When trying to execute, the function which executes the matrix = result_matrix; step appears to be undefined.
Now for the actual question: Why and how does Cilk break my sequential code; or rather how can I prevent it from doing so?
The actual code is of course available too, should you be interested. However, this project is for an university class and therefore subject to plagiarism from other students who find this thread which is why I would prefer not to share it publicly.
**UPDATE:
As suggested I attempted to run the algorithm with only 1 worker thread, effectively making the cilk implementation sequential. This did, surprisingly enough, work out fine. However as soon as I change the number of workers to two, the familiar errors return.
I don't think this behavior is caused by race-conditions though. Since the working matrix is changed after each iteration and cilk_sync is called, there is effectively no critical section. All threads do not depend on data written by others in the same iteration.
The next steps I will attempt is to try out other versions of the cilkplus compiler, to see if its maybe an error on their side.

With regards to the floating point exception in a cilk_for, there are some issues that have been fixed in some versions of the Cilk Plus runtime. Is it possible that you are using an outdated version?
https://software.intel.com/en-us/forums/intel-cilk-plus/topic/558825
Also, what were the specific warning messages that are produced? There are some "uninitialized variable" warnings that occur with older versions of Cilk Plus GCC, which I thought were spurious warnings.

The Cilk runtime uses a recursive divide and conquer algorithm to parallelize your loop. Essentially, it breaks the range in half, and recursively calls itself twice, spawning half and calling half.
As part of the initialization, it calculates a "grain size" which is the size of the minimum size it will break your range into. By default, that's loopRange/8P, where P is the number of cores.
One interesting experiment would be to set the number of Cilk workers to 1. When you do this, all of the cilk_for mechanism is excersized, but because there's only 1 worker, nothing gets stolen.
Another possibility is to try running your code under Cilkscreen - the Cilk race detector. Unfortunately only the cilkplus branch of GCC generates the annotations that Cilkscreen needs. Your choices are to use the Intel commpiler, or try using the cilkplus branch of GCC 4.9. Directions on how to pull down the code and build it are at the cilkplus.org website.

Related

Parallelizing inner loop with residual calculations in OpenMP with SSE vectorization

I'm trying to parallelizing the inner loop of a program that has data dependencies (min) outside the scope of the loops. I'm having an issue where the residual calculations occuring outside the scope of the inner j loop. The code gets errors if the "#pragma omp parallel" part is included on the j loop even if the loop doesn't run at all due to a k value being too low. say (1,2,3) for example.
for (i = 0; i < 10; i++)
{
#pragma omp parallel for shared(min) private (j, a, b, storer, arr) //
for (j = 0; j < k-4; j += 4)
{
mm_a = _mm_load_ps(&x[j]);
mm_b = _mm_load_ps(&y[j]);
mm_a = _mm_add_ps(mm_a, mm_b);
_mm_store_ps(storer, mm_a);
#pragma omp critical
{
if (storer[0] < min)
{
min = storer[0];
}
if (storer[1] < min)
{
min = storer[1];
}
//etc
}
}
do
{
#pragma omp critical
{
if (x[j]+y[j] < min)
{
min = x[j]+y[j];
}
}
}
} while (j++ < (k - 1));
round_min = min
}
The j-based loop is a parallel loop so you cannot use j after the loop. This is especially true since you explicitly put j as private, so only visible locally in the thread but not outside the parallel region. You can explicitly compute the position of the remaining j value using (k-4+3)/4*4 just after the parallel loop.
Furthermore, here is few important points:
You may not really need to vectorize the code yourself: you can use omp simd reduction. OpenMP can do all the boring job of computing the residual calculations for you automatically. Moreover, the code will be portable and much simpler. The generated code may also likely be faster than yours. Note however that some compilers might not be able to vectorize the code (GCC and ICC does, while Clang and MSVC often need some help).
Critical section (omp critical) are very costly. In your case this will just annihilate any possible improvement related to the parallel section. The code will likely be slower due to cache-line bouncing.
Reading data written by _mm_store_ps is inefficient here although some compiler (like GCC) may be able to understand the logic of your code and generate a faster implementation (extracting lane data).
Horizontal SIMD reductions inefficient. Use vertical ones that are much faster and that can be easily used here.
Here is a corrected code taking into account the above points:
for (i = 0; i < 10; i++)
{
// Assume min is already initialized correctly here
#pragma omp parallel for simd reduction(min:min) private(j)
for (j = 0; j < k; ++j)
{
const float tmp = x[j] + y[j];
if(tmp < min)
min = tmp;
}
// Use min here
}
The above code is vectorized correctly on x86 architecture on GCC/ICC (both with -O3 -fopenmp), Clang (with -O3 -fopenmp -ffastmath) and MSVC (with /O2 /fp:precise -openmp:experimental).

How can I best "parallelise" a set of four nested for()-loops in a Brute-Force attack?

I have the following homework task:
I need to brute force 4-char passphrase with the following mask
%%##
( where # - is a numeric character, % - is an alpha character )
in several threads using OpenMP.
Here is a piece of code, but I'm not sure if it is doing the right thing:
int i, j, m, n;
const char alph[26] = "abcdefghijklmnopqrstuvwxyz";
const char num[10] = "0123456789";
#pragma omp parallel for private(pass) schedule(dynamic) collapse(4)
for (i = 0; i < 26; i++)
for (j = 0; j < 26; j++)
for (m = 0; m < 10; m++)
for (n = 0; n < 10; n++) {
pass[0] = alph[i];
pass[1] = alph[j];
pass[2] = num[m];
pass[3] = num[n];
/* Working with pass here */
}
So my question is :
How to correctly specify the "parallel for" instruction, in order to split the range of passphrases between several cores?
Help is much appreciated.
Your code is pretty much right, except for using alph instead of num. If you're able to define the pass variable within the loop, that'll save you many a headache.
A full MWE might look like:
//Compile with, e.g.: gcc -O3 temp.c -std=c99 -fopenmp
#include <stdio.h>
#include <unistd.h>
#include <string.h>
int PassCheck(char *pass){
usleep(50); //Sleep for 100 microseconds to simulate work
return strncmp(pass, "qr34", 4)==0;
}
int main(){
const char alph[27] = "abcdefghijklmnopqrstuvwxyz";
const char num[11] = "0123456789";
char goodpass[5] = "----"; //Provide a default password to indicate an error state
int i, j, m, n;
#pragma omp parallel for collapse(4)
for (i = 0; i < 26; i++)
for (j = 0; j < 26; j++)
for (m = 0; m < 10; m++)
for (n = 0; n < 10; n++){
char pass[4];
pass[0] = alph[i];
pass[1] = alph[j];
pass[2] = num[m];
pass[3] = num[n];
if(PassCheck(pass)){
//It is good practice to use `critical` here in case two
//passwords are somehow both valid. This won't arise in
//your code, but is worth thinking about.
#pragma omp critical
{
memcpy(goodpass, pass, 4);
goodpass[4] = '\0';
//#pragma omp cancel for //Escape for loops!
}
}
}
printf("Password was '%s'.\n",goodpass);
return 0;
}
Dynamic scheduling
Using a dynamic schedule here is probably pointless. Your expectation should be that each password will take, on average, about the same amount of time to check. Therefore, each iteration of the loop will take about the same amount of time. Therefore, there is no need to use dynamic scheduling because your loops will remain evenly distributed.
Visual noise
Note that the loop nest is stacked, rather than indented. You'll often see this in code where there are many nested loops as it tends to reduce visual noise.
Breaking early
#pragma omp cancel for is available as of OpenMP 4.0; however, I got a warning using it in this context, so I've commented it out. If you are able to get it working, that'll reduce your run-time by half since all effort is wasted once the correct password has been found and the password will, on average, be located half-way through the search space.
Where the guessed password is generated
One of the commentors suggests moving, e.g. pass[0] so that it is not in the innermost loop. This is a bad idea as doing so will prevent you from using collapse(4). As a result you could parallelize the outer loop, but you run the risk that its iteration count cannot be evenly divided by the number of threads, resulting in a large load imbalance. Alternatively, you could parallelize the inner loop, which exposes you to the same problem plus high synchronization costs each time the loop ends.
Why usleep?
The usleep function causes the code to run slowly. This is intentional; it provides feedback on the effect of parallelism, since the workload is so small.
If I remove the usleep, then the code completes in 0.003s on a single core and 0.004s on 4 cores. You cannot tell that the parallelism is even working. Leaving usleep in gives 8.950s on a single core and 2.257s on 4 cores, an apt demonstration of the effectiveness of the parallelism.
Naturally, you would remove this line once you're sure that parallelism is working correctly.
Further, any actual brute-force password cracker would likely be computing an expensive hash function inside the PassCheck function. Including usleep() here allows us to simulate that function and experiment with high-level design without having to the function first.

OpenMP gives (core dumped)

I have loop which I want to parallelize with OpenMP. when I compile with gcc -o prog prog.c -lm -fopenmp I get no errors. But when I execute it, I get segmentation fault(core dumped). The problem surely comes from the OpenMP commands because the program works when I delete the #pragma...
Here is the parallel loop:
ix = (i-1)%ILIGNE+1;
iy = (i-1)/ILIGNE+1;
k = 1;
# pragma omp parallel for private(j,jx,jy,r,R,voisin) shared(NTOT,k,i,ix,iy) num_threads(2) schedule(auto)
for(j = 1;j <= NTOT;j++){
if(j != i){
jx = (j-1)%ILIGNE+1;
jy = (j-1)/ICOLONE+1;
r[k][0] = (jx-ix)*a;
r[k][1] = (jy-iy)*a;
R[k] = sqrt(pow(r[k][0],2.0)+pow(r[k][1],2.0));
voisin[k] = j;
k++;
}
}
I tried to change the stack size to unlimited but it doesn't fix the problem. Please tell me if it is about a memory leak or a race condition or something else? and thank you for your help
As a side note, be careful when you make an array private.
If you allocated it as a static array
e.g.
int R[5] or something similar then that's fine, each thread gets its own personal copy :).
If you malloc these however
e.g.:
int R = malloc(5*sizeof(int));
then it will act as a shared array regardless of whether you define it as private (which could potentially lead to undefined behaviour, segfaults, jibberish in the array etc).
I'm not sure what your code does, but I'm pretty sure the OpenMP version is wrong. Indeed, you parallelised over the j loop, but the heart of your algorithm revolves around the k, which is loosely derived from j and i (which is not presented here BTW).
So when you distribute your j indexes across your OpenMP threads, they all start from a different value of j, but all from the same value of k which is shared. From that, k is incremented quite randomly and the accesses to the various arrays using k are very very likely to generate segmentation faults.
Moreover, arrays r, R and voisin shouldn't be declared private if one want the parallelisation to have any effect.
Finally, C loops like this for(j = 1;j <= NTOT;j++) look utterly suspicious to me for off-by-one accesses... Shouldn't that be rather for(j = 0;j < NTOT;j++)? (just mentioning this since the initial value of k is 1 as well...)
Bottom line is that you'd probably better define k from j's value with k = j<i ? j : j-1 instead of trying to increment it inside the code.
Assuming all the rest is correct, this might be a valid version:
ix = (i-1)%ILIGNE+1;
iy = (i-1)/ILIGNE+1;
# pragma omp parallel for private(j,jx,jy,k) num_threads(2) schedule(auto)
for(j = 1;j <= NTOT;j++){
if(j != i){
k = j<i ? j : j-1;
jx = (j-1)%ILIGNE+1;
jy = (j-1)/ICOLONE+1;
r[k][0] = (jx-ix)*a;
r[k][1] = (jy-iy)*a;
R[k] = sqrt(pow(r[k][0],2.0)+pow(r[k][1],2.0));
voisin[k] = j;
}
}
Still, be careful with the C indexing from 0 to size-1, not from 1 to size...

Segmentation fault right at the end of the program

I have a problem with this code.
It works as expected, excepting that it gets Seg fault right at the end.
Here is the code:
void distribuie(int *nrP, pach *pachet, post *postas) {
int nrPos, k, i, j;
nrPos = 0;
for (k = 0; k < 18; k++)
pos[k].nrPac = 0;
for (i = 0; i < *nrP; i++) {
int distributed = 0;
for (j = 0; j < nrPos; j++)
if (pac[i].idCar == pos[j].id) {
pos[j].vec[pos[j].nrPac] = pac[i].id;
pos[j].nrPac++;
distributed = 1;
break;
}
if (distributed == 0) {
pos[nrPos].id = pac[i].idCar;
pos[nrPos].vec[0] = pac[i].id;
pos[nrPos].nrPac = 1;
nrPos++;
}
}
for (i = 0; i < nrPos; i++) {
printf("%d %d ", pos[i].id, pos[i].nrPac);
for (j = 0; j < pos[i].nrPac; j++)
printf("%d ", pos[i].vec[j]);
printf("\n");
}
}
and calling this function in main().
Running with gdb resulted in this error:
Program received signal SIGSEGV, Segmentation fault.
0x00000001 in ?? ()
If gdb can't find the stack trace, it means your code wrote over the stack so thoroughly that neither the normal C runtime nor gdb can find the information about where the function should return on the stack.
Or, in other words, you have a (major) stack overflow.
Somewhere, your code is writing out of bounds of an array. It is curious that the code posted references global variables pos and pac but is passed (unused) variables postas and pachet. It suggests that the code you're showing isn't the code you're executing. However, assuming that pos and pac are really spelled the same as postas and pachet, then it could be that you are mishandling the call to your distribuie() function. (If, as a comment suggests, pos and pac really are global variables, then why does the function get passed postas and pachet?)
Are you getting any compilation warnings? Have you enabled compilation warnings? If you've got GCC, does the code compile cleanly with -Wall? What about with -Wall -Wextra? If you're getting any warnings, fix the causes. Remember, at this stage in your career, it is probable that the C compiler knows more about C than you do.
You can help yourself with the debugging by printing key values (like *nrP) on entry to the function. If that isn't a sane value, you know where to start looking. You might also take a good look at the data for the line:
pos[j].vec[pos[j].nrPac] = pac[i].id;
There is lots of room there for things to go badly astray!
I lack information to completely help you: I don't know the size of the pos[] array. The loop with k<18 suggests it is 18 elements (but it could be less; I simply don't know). Then you start processing *nrP pachets, but you don't check that you process at most 18 of these. If there are more, you overwrite some other memory. Then you want to print the result et voila, a segmentation fault, meaning some memory got corrupted, is used by someone thinking it is a valid pionter, but the pointer is invalid and...bang - segfault.
So the for loop should at least check the bounds (assuming 18):
for (i = 0; i < *nrP && i < 18; i++) {
In the same way, the pos structure apparently has an array of vec, but its size is unknown and by the same reasoning can be 18, can be less or an be more:
pos[j].vec[pos[j].nrPac]
If you add all your bounds checks it will probably run.

Parallelizing a for loop in Visual Studio 2010 (OpenMP)

I've recently reading up about OpenMP and was trying to parallelize some existing for loops in my program to get a speed-up. However, for some reason I seem to be getting garbage data written to the file. What I mean by that is I don't have Points 1,2,3,4 etc. written to my file, I have Points 1,4,7,8 etc. I suspect this is because I am not keeping track of the threads and it just leads to race conditions?
I have been reading as much as I can find about OpenMP, since it seems like a great abstraction to do multi-threaded programing. I'd appreciate any pointers please to get to the bottom of what I might be doing incorrectly.
Here is what I have been trying to do so far (only the relevant bit of code):
#include <omp.h>
pixelIncrement = Image.rowinc/2;
#pragma omp parallel for
for (int i = 0; i < Image.nrows; i++ )
{
int k =0;
row = Image.data + i * pixelIncrement;
#pragma omp parallel for
for (int j = 0; j < Image.ncols; j++)
{
k++;
disparityThresholdValue = row[j];
// Don't want to save certain points
if ( disparityThresholdValue < threshHold)
{
// Get the data points
x = (int)Image.x[k];
y = (int)Image.y[k];
z = (int)Image.z[k];
grayValue= (int)Image.gray[k];
cloudObject->points[k].x = x;
cloudObject->points[k].y = y;
cloudObject->points[k].z = z;
cloudObject->points[k].grayValue = grayValue;
fprintf( cloudPointsFile, "%f %f %f %d\n", x, y, z, grayValue);
}
}
}
fclose( pointFile );
I did enable OpenMP in my Compiler settings (C/C++ -> Language -> Open MP Support (/openmp).
Any suggestions as to what might be the problem? I am using a Quadcore processor on Windows XP 32-bit.
Are all points written to the file, but just not sequentially, or is the actual point data messed up?
The first case is expected in parallel programming - once you execute something side-by-side you wont be able to guarantee order unless you synchronize the access (at which point you can just leave out the parallelization as it becomes effectively linear). If you need to rely on order, you can parallelize any calculations but need to write it down in one thread.
If the points itself are messed up, check where your variables are declared and if multiple threads are accessing the same.
A few problems here:
#pragma omp parallel for
for (int i = 0; i < Image.nrows; i++ )
{
int k =0;
row = Image.data + i * pixelIncrement;
#pragma omp parallel for
for (int j = 0; j < Image.ncols; j++)
{
k++;
There's no need for the inner parallel for. The outer loop should contain enough work to keep all cores busy.
Also, for the inner loop k is a shared variable and gets incremented in a non-atomic way. x, y, z are also shared among the inner loop threads and overwritten "randomly". Remove the inner directive and see how it goes.
When you have a loop with a nested loop there is no need for a second omp pragma.
It will already paralelize the first loop. Remember that this is valid only if the second loop has to be executed in sequence. You have a sequencial incrementation, so you can not execute the second loop in a random order. OMP pragmas are a very easy and cool way to paralelize code but do not use them too much!
More details here -> Parallel Loops with OpenMP

Resources