I am implementing an algorithm to compute graph layout using force-directed. I would like to add OpenMP directives to accelerate some loops. After reading some courses, I added some OpenMP directives according to my understanding. The code is compiled, but don’t return the same result as the sequential version.
I wonder if you would be kind enough to look at my code and help me to figure out what is going wrong with my OpenMP version.
Please download the archive here:
http://www.mediafire.com/download/3m42wdiq3v77xbh/drawgraph.zip
As you see, the portion of code which I want to parallelize is:
unsigned long graphLayout(Graph * graph, double * coords, unsigned long maxiter)
Particularly, these two loops which consumes alot of computational resources:
/* compute repulsive forces (electrical: f=-C.K^2/|xi-xj|.Uij) */
for(int j = 0 ; j < graph->nvtxs ; j++) {
if(i == j) continue;
double * _xj = _position+j*DIM;
double dist = DISTANCE(_xi,_xj);
// power used for repulsive force model (standard is 1/r, 1/r^2 works well)
// double coef = 0.0; -C*K*K/dist; // power 1/r
double coef = -C*K*K*K/(dist*dist); // power 1/r^2
for(int d = 0 ; d < DIM ; d++) force[d] += coef*(_xj[d]-_xi[d])/dist;
}
/* compute attractive forces (spring: f=|xi-xj|^2/K.Uij) */
for(int k = graph->xadj[i] ; k < graph->xadj[i+1] ; k++) {
int j = graph->adjncy[k]; /* edge (i,j) */
double * _xj = _position+j*DIM;
double dist = DISTANCE(_xi,_xj);
double coef = dist*dist/K;
for(int d = 0 ; d < DIM ; d++) force[d] += coef*(_xj[d]-_xi[d])/dist;
}
Thank you in advance for any help you can provide!
You have data races in your code, e.g., when doing maxmove = nmove; or energy += nforce2;. In any multi-threaded code, you cannot write into a variable shared by threads until you use an atomic access (#pragma omp atomic read/write/update) or until you synchronize an access to such a variable explicitly (critical sections, locks). Read some tutorial about OpenMP first, there are more problems with your code, e.g.
if(nmove > maxmove) maxmove = nmove;
this line will generally not work even with atomics (you would have to use so-called compare-and-exchange atomic operation to solve this). Much better solution here is to let each thread to calculate its local maximum and then reduce it into a global maximum.
Related
I have the following homework task:
I need to brute force 4-char passphrase with the following mask
%%##
( where # - is a numeric character, % - is an alpha character )
in several threads using OpenMP.
Here is a piece of code, but I'm not sure if it is doing the right thing:
int i, j, m, n;
const char alph[26] = "abcdefghijklmnopqrstuvwxyz";
const char num[10] = "0123456789";
#pragma omp parallel for private(pass) schedule(dynamic) collapse(4)
for (i = 0; i < 26; i++)
for (j = 0; j < 26; j++)
for (m = 0; m < 10; m++)
for (n = 0; n < 10; n++) {
pass[0] = alph[i];
pass[1] = alph[j];
pass[2] = num[m];
pass[3] = num[n];
/* Working with pass here */
}
So my question is :
How to correctly specify the "parallel for" instruction, in order to split the range of passphrases between several cores?
Help is much appreciated.
Your code is pretty much right, except for using alph instead of num. If you're able to define the pass variable within the loop, that'll save you many a headache.
A full MWE might look like:
//Compile with, e.g.: gcc -O3 temp.c -std=c99 -fopenmp
#include <stdio.h>
#include <unistd.h>
#include <string.h>
int PassCheck(char *pass){
usleep(50); //Sleep for 100 microseconds to simulate work
return strncmp(pass, "qr34", 4)==0;
}
int main(){
const char alph[27] = "abcdefghijklmnopqrstuvwxyz";
const char num[11] = "0123456789";
char goodpass[5] = "----"; //Provide a default password to indicate an error state
int i, j, m, n;
#pragma omp parallel for collapse(4)
for (i = 0; i < 26; i++)
for (j = 0; j < 26; j++)
for (m = 0; m < 10; m++)
for (n = 0; n < 10; n++){
char pass[4];
pass[0] = alph[i];
pass[1] = alph[j];
pass[2] = num[m];
pass[3] = num[n];
if(PassCheck(pass)){
//It is good practice to use `critical` here in case two
//passwords are somehow both valid. This won't arise in
//your code, but is worth thinking about.
#pragma omp critical
{
memcpy(goodpass, pass, 4);
goodpass[4] = '\0';
//#pragma omp cancel for //Escape for loops!
}
}
}
printf("Password was '%s'.\n",goodpass);
return 0;
}
Dynamic scheduling
Using a dynamic schedule here is probably pointless. Your expectation should be that each password will take, on average, about the same amount of time to check. Therefore, each iteration of the loop will take about the same amount of time. Therefore, there is no need to use dynamic scheduling because your loops will remain evenly distributed.
Visual noise
Note that the loop nest is stacked, rather than indented. You'll often see this in code where there are many nested loops as it tends to reduce visual noise.
Breaking early
#pragma omp cancel for is available as of OpenMP 4.0; however, I got a warning using it in this context, so I've commented it out. If you are able to get it working, that'll reduce your run-time by half since all effort is wasted once the correct password has been found and the password will, on average, be located half-way through the search space.
Where the guessed password is generated
One of the commentors suggests moving, e.g. pass[0] so that it is not in the innermost loop. This is a bad idea as doing so will prevent you from using collapse(4). As a result you could parallelize the outer loop, but you run the risk that its iteration count cannot be evenly divided by the number of threads, resulting in a large load imbalance. Alternatively, you could parallelize the inner loop, which exposes you to the same problem plus high synchronization costs each time the loop ends.
Why usleep?
The usleep function causes the code to run slowly. This is intentional; it provides feedback on the effect of parallelism, since the workload is so small.
If I remove the usleep, then the code completes in 0.003s on a single core and 0.004s on 4 cores. You cannot tell that the parallelism is even working. Leaving usleep in gives 8.950s on a single core and 2.257s on 4 cores, an apt demonstration of the effectiveness of the parallelism.
Naturally, you would remove this line once you're sure that parallelism is working correctly.
Further, any actual brute-force password cracker would likely be computing an expensive hash function inside the PassCheck function. Including usleep() here allows us to simulate that function and experiment with high-level design without having to the function first.
I have an application where I need to find the position of peaks in a given set of data. The resolution must be much higher than the spacing between the datapoints (i.e. it is not sufficient to find the highest datapoint, instead a "virtual" peak position has to be estimated given the shape of the peak). A peak is made of about 4 or 5 datapoints. A dataset is acquired every few ms and the peak detection has to be performed in real time.
I compared several methods in LabVIEW and I found the best result (in terms of resolution and speed) is given by the LabVIEW PeakDetector.vi, which scans the dataset with a moving window (>= 3 points width) and for each position performs a quadratic fit. The resulting quadratic function (a parabola) has a local maximum, which is in turn compared to nearby points.
Now I want to implement the same method in C. The polynomial fit is implemented as follows (using Gaussian matrix):
// Fits *y from x_start to (x_start + window) with a parabola and returns x_max and y_max
int polymax(uint16_t * y_data, int x_start, int window, double *x_max, double *y_max)
{
float sum[10],mat[3][4],temp=0,temp1=0,a1,a2,a3;
int i,j;
float x[window];
for(i = 0; i < window; i++)
x[i] = (float)i;
float y[window];
for(i = 0; i < window; i++)
y[i] = (float)(y_data[x_start + i] - y_data[x_start]);
for(i = 0; i < window; i++)
{
temp=temp+x[i];
temp1=temp1+y[i];
}
sum[0]=temp;
sum[1]=temp1;
sum[2]=sum[3]=sum[4]=sum[5]=sum[6]=0;
for(i = 0;i < window;i++)
{
sum[2]=sum[2]+(x[i]*x[i]);
sum[3]=sum[3]+(x[i]*x[i]*x[i]);
sum[4]=sum[4]+(x[i]*x[i]*x[i]*x[i]);
sum[5]=sum[5]+(x[i]*y[i]);
sum[6]=sum[6]+(x[i]*x[i]*y[i]);
}
mat[0][0]=window;
mat[0][1]=mat[1][0]=sum[0];
mat[0][2]=mat[1][2]=mat[2][0]=sum[2];
mat[1][2]=mat[2][3]=sum[3];
mat[2][2]=sum[4];
mat[0][3]=sum[1];
mat[1][3]=sum[5];
mat[2][3]=sum[6];
temp=mat[1][0]/mat[0][0];
temp1=mat[2][0]/mat[0][0];
for(i = 0, j = 0; j < 3 + 1; j++)
{
mat[i+1][j]=mat[i+1][j]-(mat[i][j]*temp);
mat[i+2][j]=mat[i+2][j]-(mat[i][j]*temp1);
}
temp=mat[2][4]/mat[1][5];
temp1=mat[0][6]/mat[1][7];
for(i = 1,j = 0; j < 3 + 1; j++)
{
mat[i+1][j]=mat[i+1][j]-(mat[i][j]*temp);
mat[i-1][j]=mat[i-1][j]-(mat[i][j]*temp1);
}
temp=mat[0][2]/mat[2][2];
temp1=mat[1][2]/mat[2][2];
for(i = 0, j = 0; j < 3 + 1; j++)
{
mat[i][j]=mat[i][j]-(mat[i+2][j]*temp);
mat[i+1][j]=mat[i+1][j]-(mat[i+2][j]*temp1);
}
a3 = mat[2][3]/mat[2][2];
a2 = mat[1][3]/mat[1][8];
a1 = mat[0][3]/mat[0][0];
// zX^2 + yX + x
if (a3 < 0)
{
temp = - a2 / (2*a3);
*x_max = temp + x_start;
*y_max = (a3*temp*temp + a2*temp + a1) + y_data[x_start];
return 0;
}
else
return -1;
}
The scan is performed in an outer function, which calls the above function repeatedly and chooses then the highest local y_max.
The above works and peaks are found. Only the noise is much worse than the LabVIEW counterpart (i.e. I get a very oscillating peak position, given the same input dataset and the same parameters). As the algorithm works the above code should be conceptually correct, so I think it might be a numerical problem as I simply use "floats" without further effort to improve numerical accuracy. Is this a possible answer? Does anyone have a tip, where I should be looking to?
Thanks.
PS: I have done my search and found this very good overview and also this question, similar to mine (unfortunately with not many answers). I will study these further.
EDIT: I have found my problems being elsewhere. Improving the algorithm by removing certain output values (a sort of post-validation in which a result is only accepted if the result is within the moving window) brought the solution to the issue. Now I am satisfied with the results, i.e. they are comparable to those from LabVIEW. Nevertheless, thanks a lot for your comments.
Sorry to be late to the part, but if you have C/C++ it is really easy to port it to C# code using VS2013 Express (free version) and just port that into Labview using the .NET toolset.
I am strugling with quite tricky (I guess) problem with parallel file reading.
Right now I have maped file using mmap and I want it to read values and put into three arrays. Well, maybe explanations isn't so clear, so this is my current code:
int counter = header;
#pragma omp parallel for shared (red,green,blue,map) private (i,j) firstprivate(counter)
for(i = 0; i < x; i++)
{
for(j = 0; j < y; j++)
{
red[i][j] = map[counter];
green[i][j] = map[counter+1];
blue[i][j] = map[counter+2];
counter+=3;
}
}
the header is beginning of a file - just image related info like size and some comments.
The problem here is, that counter has to be private. What I am finding difficult is to come up with dividing the counter across threads with different start numbers.
Can someone give me a hint how it can be achieved?
You could, if I've understood your code correctly, leave counter as a shared variable and enclose the update to it in a single section, something like
blue[i][j] = map[counter+2];
#pragma omp single
{
counter+=3;
}
}
I write something like because I haven't carefully checked the syntax nor considered the usefulness of some of the optional clauses for the single directive. I suspect that this might prove a real drag on performance.
An alternative would be to radically re-order your loop nest, perhaps (and again this is not carefully checked):
#pragma omp parallel shared (red,green,blue,map) private (i,j)
#pragma for collapse(2) schedule(hmmm, this needs some thought)
for(counter = 0; counter < counter_max; counter += 3)
{
for(i = 0; i < x; i++)
{
for(j = 0; j < y; j++)
{
red[i][j] = map[counter];
}
}
}
... loop for blue now ...
... loop for green now ...
Note that in this version OpenMP will collapse the first two loops into one iteration space, I expect this will provide better performance than uncollapsed looping but it's worth experimenting to figure that out. It's probably also worth experimenting with the schedule clause.
If I understood well your problem is that of initializing counter to different values for different threads.
If this is the case, then it can be solved like the linearization of matrix indexes:
size_t getLineStart(size_t ii, size_t start, size_t length) {
return start + ii*length*3;
}
int counter;
#pragma omp parallel for shared (red,green,blue,map) private (i,j) private(counter)
for(size_t ii = 0; ii < x; ii++) {
////////
// Get the starting point for the current iteration
counter = getLineStart(ii,header,y);
////////
for(size_t jj = 0; jj < y; jj++) {
red [i][j] = map[counter ];
green[i][j] = map[counter+1];
blue [i][j] = map[counter+2];
counter+=3;
}
}
These kind of counters are useful when it's hard to know how many values will be counted but then they don't work easily with OpenMP. But in your case it's trivial to figure out what the index is: 3*(y*i+j). I suggest fusing the loop and doing something like this:
#pragma omp parallel for
for(n=0; n<x*y; n++) {
(&red [0][0])[n] = map[3*n+header+0];
(&green[0][0])[n] = map[3*n+header+1];
(&blue [0][0])[n] = map[3*n+header+2];
}
I wanted to optimize below code using openMP
double val;
double m_y = 0.0f;
double m_u = 0.0f;
double m_v = 0.0f;
#define _MSE(m, t) \
val = refData[t] - calData[t]; \
m += val*val;
#pragma omp parallel
{
#pragma omp for
for( i=0; i<(width*height)/2; i++ ) { //yuv422: 2 pixels at a time
_MSE(m_u, 0);
_MSE(m_y, 1);
_MSE(m_v, 2);
_MSE(m_y, 3);
#pragma omp reduction(+:refData) reduction(+:calData)
refData += 4;
calData += 4;
// int id = omp_get_thread_num();
//printf("Thread %d performed %d iterations of the loop\n",id ,i);
}
}
Any suggestion welcome for optimizing above code currently I have wrong output.
I think the easiest thing you can do is allow it to split into 4 threads, and calculate the UYVY errors in each of those. Instead of making them separate values, make them an array:
double sqError[4] = {0};
const int numBytes = width * height * 2;
#pragma omp parallel for
for( int elem = 0; elem < 4; elem++ ) {
for( int i = elem; i < numBytes; i += 4 ) {
int val = refData[i] - calData[i];
sqError[elem] += (double)(val*val);
}
}
This way, each thread operates exclusively on one thing and there is no contention.
Maybe it's not the most advanced use of OMP, but you should see a speedup.
After your comment about performance hit, I did some experiments and found that indeed the performance was worse. I suspect this may be due to cache misses.
You said:
performance hit this time with openMP : Time :0.040637 with serial
Time :0.018670
So I reworked it using the reduction on each variable and using a single loop:
#pragma omp parallel for reduction(+:e0) reduction(+:e1) reduction(+:e2) reduction(+:e3)
for( int i = 0; i < numBytes; i += 4 ) {
int val = refData[i] - calData[i];
e0 += (double)(val*val);
val = refData[i+1] - calData[i+1];
e1 += (double)(val*val);
val = refData[i+2] - calData[i+2];
e2 += (double)(val*val);
val = refData[i+3] - calData[i+3];
e3 += (double)(val*val);
}
With my test case on a 4-core machine, I observed a little less than 4-fold improvement:
serial: 2025 ms
omp with 2 loops: 6850 ms
omp with reduction: 455 ms
[Edit] On the subject of why the first piece of code performed worse than the non-parallel version, Hristo Iliev said:
Your first piece of code is a terrible example of what false sharing
does in multithreaded codes. As sqError has only 4 elements of 8 bytes
each, it fits in a single cache line (even in a half cache line on
modern x86 CPUs). With 4 threads constantly writing to neighbouring
elements, this would generate a massive amount of inter-core cache
invalidation due to false sharing. One can get around this by using
instead a structure like this struct _error { double val; double
pad[7]; } sqError[4]; Now each sqError[i].val will be in a separate
cache line, hence no false sharing.
The code looks like it's calculating the MSE but adding to the same sum, m. For parallelism to work properly, you need to eliminate sharing of m, one approach would be preallocating an array (width*height/2 I imagine) just to store the different sums, or ms. Finally, add up all the sums at the end.
Also, test that this is actually faster!
I've recently reading up about OpenMP and was trying to parallelize some existing for loops in my program to get a speed-up. However, for some reason I seem to be getting garbage data written to the file. What I mean by that is I don't have Points 1,2,3,4 etc. written to my file, I have Points 1,4,7,8 etc. I suspect this is because I am not keeping track of the threads and it just leads to race conditions?
I have been reading as much as I can find about OpenMP, since it seems like a great abstraction to do multi-threaded programing. I'd appreciate any pointers please to get to the bottom of what I might be doing incorrectly.
Here is what I have been trying to do so far (only the relevant bit of code):
#include <omp.h>
pixelIncrement = Image.rowinc/2;
#pragma omp parallel for
for (int i = 0; i < Image.nrows; i++ )
{
int k =0;
row = Image.data + i * pixelIncrement;
#pragma omp parallel for
for (int j = 0; j < Image.ncols; j++)
{
k++;
disparityThresholdValue = row[j];
// Don't want to save certain points
if ( disparityThresholdValue < threshHold)
{
// Get the data points
x = (int)Image.x[k];
y = (int)Image.y[k];
z = (int)Image.z[k];
grayValue= (int)Image.gray[k];
cloudObject->points[k].x = x;
cloudObject->points[k].y = y;
cloudObject->points[k].z = z;
cloudObject->points[k].grayValue = grayValue;
fprintf( cloudPointsFile, "%f %f %f %d\n", x, y, z, grayValue);
}
}
}
fclose( pointFile );
I did enable OpenMP in my Compiler settings (C/C++ -> Language -> Open MP Support (/openmp).
Any suggestions as to what might be the problem? I am using a Quadcore processor on Windows XP 32-bit.
Are all points written to the file, but just not sequentially, or is the actual point data messed up?
The first case is expected in parallel programming - once you execute something side-by-side you wont be able to guarantee order unless you synchronize the access (at which point you can just leave out the parallelization as it becomes effectively linear). If you need to rely on order, you can parallelize any calculations but need to write it down in one thread.
If the points itself are messed up, check where your variables are declared and if multiple threads are accessing the same.
A few problems here:
#pragma omp parallel for
for (int i = 0; i < Image.nrows; i++ )
{
int k =0;
row = Image.data + i * pixelIncrement;
#pragma omp parallel for
for (int j = 0; j < Image.ncols; j++)
{
k++;
There's no need for the inner parallel for. The outer loop should contain enough work to keep all cores busy.
Also, for the inner loop k is a shared variable and gets incremented in a non-atomic way. x, y, z are also shared among the inner loop threads and overwritten "randomly". Remove the inner directive and see how it goes.
When you have a loop with a nested loop there is no need for a second omp pragma.
It will already paralelize the first loop. Remember that this is valid only if the second loop has to be executed in sequence. You have a sequencial incrementation, so you can not execute the second loop in a random order. OMP pragmas are a very easy and cool way to paralelize code but do not use them too much!
More details here -> Parallel Loops with OpenMP