OpenMP collapsed parallel loop with reduction - c

i'm trying to parallelize this collapse loops with openMP, but this is what i got:
"smooth.c:47:6: error: not enough perfectly nested loops before ‘sum’ sum = 0;"
Somebody knows a good way to parallelize this? i'm stuck 2 days in this problem.
Here my loops:
long long int sum;
#pragma omp parallel for collapse(3) default(none) shared(DY, DX) private(dx, dy) reduction(+:sum)
for (y = 0; y < height; y++) {
for (x = 0; x < width; x++) {
sum = 0;
for (d = 0; d < 9; d++) {
dx = x + DX[d];
dy = y + DY[d];
if (dx >= 0 && dx < width && dy >= 0 && dy < height)
sum += image(dy, dx);
}
smooth(y, x) = sum / 9;
}
}
Full code:
https://github.com/fernandesbreno/smooth_

i'm trying to parallelize this collapse loops with openMP, but this is what i got: "smooth.c:47:6: error: not enough perfectly nested loops before ‘sum’ sum = 0;"
You cannot collapse three loop levels because the third level is not perfectly nested inside the second. There is
sum = 0;
before it and
smooth(y, x) = sum / 9;
after it in the middle loop. (I suppose smooth() is a macro, else the assignment doesn't make sense. Don't do that, though, because it's confusing.)
Consider how you would rewrite that loop nest into an equivalent single loop by hand, using your knowledge of the problem structure and details. I submit that it would be challenging to do so, and that the result would furthermore have unavoidable data dependencies. But if you managed to do it without introducing dependencies, then voila! You have a single flat loop to parallelize, no collapsing needed.
Your simplest way forward, however, would probably be to collapse only two levels instead of three. Moreover, you want to compare with not collapsing at all, as it's not at all clear that collapsing will yield an improvement vs. parallelizing only the outer loop, and collapsing might even be worse.
But if you must have OpenMP collapse all three levels of the nest, then you need to take the two lines I called out above, and lift them out of the loop nest. Possibly you could do that in part by getting rid of sum altogether and working directly with the result raster. Again, this is not necessarily going to produce an improvement.

Related

How to loop through blocks of pixels with minimum number of for loops

I have an image of width * height pixels in which i want to loop through blocks of pixels, say block size of 10 * 10. How can i do this with minimum number of loops?
I have tried by first looping through each column, then through each row and took the starting x and y position from this two outer loops. Then the loop goes from start position of the block and loops till the block size and manipulates the pixels. This consumes four nested loops.
for (int i = 0; i < Width; i+=Block_Size) {
for (int j = 0; j < Height; j+=Block_Size) {
for (int x = i; x < i + Block_Size; x++) {
for (int y = j; y < j + Block_Size; y++) {
//Get pixel values within the block
}
}
}
}
How can i do this with minimum number of loops?
You can reduce the number of loops by completely unrolling as many loop levels as you like. For fixed raster dimensions, you could unroll them all, yielding a (probably lengthy) implementation with zero loops. For known Block_Size you can unroll one or both of the inner loops regardless of whether the overall dimensions are known, yielding as few as two loops remaining.
But why would you consider such a thing? The question seems to assume that there would be some kind of inherent advantage to reducing the depth of loop nest, but that's not necessarily true, and whatever effect there might be is likely to be small.
I'm inclined to guess that you've studied a bit of computational complexity theory, and taken away the idea that deep loop nests necessarily yield poorly-scaling performance, or even that deep loop nests have inherently poor performance, period. These are misconceptions, albeit relatively common ones, and they anyway look at the problem backwards.
The primary consideration in how the performance of your loop nest scales is how many times the body of the innermost loop,
//Get pixel values within the block
, is executed. You'll have roughly the same performance for any reasonable approach that causes it to be executed exactly once for every pixel in the raster, regardless of how many loops are involved. With that being the case, code clarity should be your goal, and your original four-loop nest is pretty clear.
It is possible to achieve this with three loops, but in order to do that you will need to store information about where each block of pixels starts and how many blocks of pixels there are in total!
Independent of that, both the width as well as the height of the image have to be multiples of your Block_Size.
Here is how it is possible with three loops:
int numberOfBlocks = x;
int pixelBlockStartingPoints[numberOfBlocks] = { startingPoint1, startingPoint2, ... };
for(int i = 0; i < numberOfBlocks; i++){
for(int j = pixelBlockStartingPoints[i]; j < pixelBlockStartingPoint[i] + Block_Size; j++){
for(int k = pixelBlockStartingPoints[i]; k < pixelBlockStartingPoint[i] + Block_Size; k++){
// Get Pixel-Data
}
}
}

How to paralellize nested (dependent) for loop in C, Open MP

I'm having trouble parallelizing a block of C code.
The block is something like this:
for(n = 0; n < N; n++)
{
for(x = 0; x < X; x++)
{
var_1[x] = var_1[x] * 3 * var_2[x];
var_2[x] = var_2[x] * 2;
}
for(y = 0; y < Y; y++)
{
var_3[y] = var_1[y] * var_2[y];
}
}
That's not the actual code (it's for an assignment so I can't post the source code) but the problem is that the variables lie in a nested loop, and each iteration is dependent upon the previous.
Simply adding a #pragma omp for in front of the outer loop doesn't work because each thread's work begins at the initial values of var_1, var_2 and var_3.
Please let me know if I can explain any better! I'm quite lost.
As HighPerformanceMark comments, the inner loop has no loop-carried dependency. This invites vectorisation first, parallelisation second, at least to me. More importantly, the cheapest computation is the one you never do in the first place. As it is written, var_3[] is never read in either loop. You can remove the second inner loop, and only compute var_3[] after the last iteration of the outer loop.
I suspect further performance gains are entirely academic. It seems to me that this algorithm will very quickly overflow (if used on integers), or lose precision (if used on floats), so the maximum number of iterations has to be small.

What memory access patterns are most efficient for outer-product-type double loops?

What access patterns are most efficient for writing cache-efficient outer-product type code that maximally exploits data data locality?
Consider a block of code for processing all pairs of elements of two arrays such as:
for (int i = 0; i < N; i++)
for (int j = 0; j < M; j++)
out[i*M + j] = X[i] binary-op Y[j];
This is a standard vector-vector outer product when binary-op is scalar multiplication and X and Y are 1d, but this same pattern is also matrix multiplication when X and Y are matrices and binary-op is a dot product between the ith row and j-th column of two matrices.
For matrix multiplication, I know optimized BLASs like OpenBLAS and MKL can get much higher performance than you get from the double loop style code above, because they process the elements in chunks in such a way as to exploit the CPU cache much more. Unfortunately, OpenBLAS kernels are written in assembly so it's pretty difficult to figure out what's going on.
Are there any good "tricks of the trade" for re-organizing these types of double loops to improve cache performance?
Since each element of out is only hit once, we're clearly free to reorder the iterations. The straight linear traversal of out is the easiest to write, but I don't think it's the most efficient pattern to execute, since you don't exploit any locality in X.
I'm especially interested in the setting where M and N are large, and the size of each element (X[i], and Y[j]) is pretty small (like O(1) bytes), so were talking about something analogous to vector-vector outer product or the multiplication of a tall and skinny matrix by a short and fat matrix (e.g. N x D by D x M where D is small).
For large enough M, The Y vector will exceed the L1 cache size.* Thus on every new outer iteration, you'll be reloading Y from main memory (or at least, a slower cache). In other words, you won't be exploiting temporal locality in Y.
You should block up your accesses to Y; something like this:
for (jj = 0; jj < M; jj += CACHE_SIZE) { // Iterate over blocks
for (i = 0; i < N; i++) {
for (j = jj; j < (jj + CACHE_SIZE); j++) { // Iterate within block
out[i*M + j] = X[i] * Y[j];
}
}
}
The above doesn't do anything smart with accesses to X, but new values are only being accessed 1/CACHE_SIZE as often, so the impact is probably negligible.
* If everything is small enough to already fit in cache, then you can't do better than what you already have (vectorisation opportunities notwithstanding).

What sort of indexing method can I use to store the distances between X^2 vectors in an array without redundancy?

I'm working on a demo that requires a lot of vector math, and in profiling, I've found that it spends the most time finding the distances between given vectors.
Right now, it loops through an array of X^2 vectors, and finds the distance between each one, meaning it runs the distance function X^4 times, even though (I think) there are only (X^2)/2 unique distances.
It works something like this: (pseudo c)
#define MATRIX_WIDTH 8
typedef float vec2_t[2];
vec2_t matrix[MATRIX_WIDTH * MATRIX_WIDTH];
...
for(int i = 0; i < MATRIX_WIDTH; i++)
{
for(int j = 0; j < MATRIX_WIDTH; j++)
{
float xd, yd;
float distance;
for(int k = 0; k < MATRIX_WIDTH; k++)
{
for(int l = 0; l < MATRIX_WIDTH; l++)
{
int index_a = (i * MATRIX_LENGTH) + j;
int index_b = (k * MATRIX_LENGTH) + l;
xd = matrix[index_a][0] - matrix[index_b][0];
yd = matrix[index_a][1] - matrix[index_b][1];
distance = sqrtf(powf(xd, 2) + powf(yd, 2));
}
}
// More code that uses the distances between each vector
}
}
What I'd like to do is create and populate an array of (X^2) / 2 distances without redundancy, then reference that array when I finally need it. However, I'm drawing a blank on how to index this array in a way that would work. A hash table would do it, but I think it's much too complicated and slow for a problem that seems like it could be solved by a clever indexing method.
EDIT: This is for a flocking simulation.
performance ideas:
a) if possible work with the squared distance, to avoid root calculation
b) never use pow for constant, integer powers - instead use xd*xd
I would consider changing your algorithm - O(n^4) is really bad. When dealing with interactions in physics (also O(n^4) for distances in 2d field) one would implement b-trees etc and neglect particle interactions with a low impact. But it will depend on what "more code that uses the distance..." really does.
just did some considerations: the number of unique distances is 0.5*n*n(+1) with n = w*h.
If you write down when unique distances occur, you will see that both inner loops can be reduced, by starting at i and j.
Additionally if you only need to access those distances via the matrix index, you can set up a 4D-distance matrix.
If memory is limited we can save up nearly 50%, as mentioned above, with a lookup function that will access a triangluar matrix, as Code-Guru said. We would probably precalculate the line index to avoid summing up on access
float distanceArray[(H*W+1)*H*W/2];
int lineIndices[H];
searchDistance(int i, int j)
{
return i<j?distanceArray[i+lineIndices[j]]:distanceArray[j+lineIndices[i]];
}

optimizing the nested for loop in c

I have the following piece of c code,
double findIntraClustSimFullCoverage(cluster * pCluster)
{
double sum = 0;
register int i = 0, j = 0;
double perElemSimilarity = 0;
for (i = 0; i < 10000; i++)
{
perElemSimilarity = 0;
for (j = 0; j < 10000; j++)
{
perElemSimilarity += arr[i][j];
}
perElemSimilarity /= pCluster->size;
sum += perElemSimilarity;
}
return (sum / pCluster->size);
}
NOTE: arr is a matrix of size 10000 X 10000
This is a portion of a GA code, hence this nested for loop runs many times.
This affects the performance of the code i.e. takes hell a lot of time to give the results.
I profiled the code using valgrind / kcachegrind.
This indicated that 70 % of the process execution time was spent in running this nested for loop.
The register variables i and j, do not seem to be stored in register values (profiling with and without "register" keyword indicated this)
I simply can not find a way to optimize this nested for loop portion of code (as it is very simple and straight forward).
Please help me in optimizing this portion of code.
I'm assuming that you change the arr matrix frequently, else you could just compute the sum (see Lucian's answer) once and remember it.
You can use a similar approach when you modify the matrix. Instead of completely re-computing the sum after the matrix has (likely) been changed, you can store a 'sum' value somewhere, and have every piece of code that updates the matrix update the stored sum appropriately. For instance, assuming you start with an array of all zeros:
double arr[10000][10000];
< initialize it to all zeros >
double sum = 0;
// you want set arr[27][53] to 82853
sum -= arr[27][53];
arr[27][53] = 82853;
sum += arr[27][53];
// you want set arr[27][53] to 473
sum -= arr[27][53];
arr[27][53] = 473;
sum += arr[27][53];
You might want to completely re-calculate the sum from time to time to avoid accumulation of errors.
If you're sure that you have no option for algorithmic optimization, you'll have to rely on very low level optimizations to speed up your code. These are very platform/compiler specific so your mileage may vary.
It is probable that, at some point, the bottleneck of the operation is pulling the values of arr from the memory. So make sure that your data is laid out in a linear cache friendly way. That is to say that &arr[i][j+1] - &arr[i][j] == sizeof(double).
You may also try to unroll your inner loop, in case your compiler does not already do it. Your code :
for (j = 0; j < 10000; j++)
{
perElemSimilarity += arr[i][j];
}
Would for example become :
for (j = 0; j < 10000; j+=10)
{
perElemSimilarity += arr[i][j+0];
perElemSimilarity += arr[i][j+1];
perElemSimilarity += arr[i][j+2];
perElemSimilarity += arr[i][j+3];
perElemSimilarity += arr[i][j+4];
perElemSimilarity += arr[i][j+5];
perElemSimilarity += arr[i][j+6];
perElemSimilarity += arr[i][j+7];
perElemSimilarity += arr[i][j+8];
perElemSimilarity += arr[i][j+9];
}
These are the basic ideas, difficult to say more without knowing your platform, compiler, looking at the generated assembly code.
You might want to take a look at this presentation for more complete examples of optimization opportunities.
If you need even more performance, you could take a look at SIMD intrinsics for your platform, of try to use, say OpenMP, to distribute your computation on multiple threads.
Another step would be to try with OpenMP, something along the following (untested) :
#pragma omp parallel for private(perElemSimilarity) reduction(+:sum)
for (i = 0; i < 10000; i++)
{
perElemSimilarity = 0;
/* INSERT INNER LOOP HERE */
perElemSimilarity /= pCluster->size;
sum += perElemSimilarity;
}
But note that even if you bring this portion of code to 0% (which is impossible) of your execution time, your GA algorithm will still take hours to run. Your performance bottleneck is elsewhere now that this portion of code takes 'only' 22% of your running time.
I might be wrong here, but isn't the following equivalent:
for (i = 0; i < 10000; i++)
{
for (j = 0; j < 10000; j++)
{
sum += arr[i][j];
}
}
return (sum / ( pCluster->size * pCluster->size ) );
The register keyword is an optimizer hint, if the optimizer doesn't think the register is well spent there, it won't be.
Is the matrix well packed, i.e. is it a contiguous block of memory?
Is 'j' the minor index (i.e. are you going from one element to the next in memory), or are you jumping from one element to that plus 1000?
Is arr fairly static? Is this called more than once on the same arr? The result of the inner loop only depends on the row/column that j traverses, so calculating it lazily and storing it for future reference will make a big difference
The way this problem is stated, there isn't much you can do. You are processing 10,000 x 10,000 double input values, that's 800 MB. Whatever you do is limited by the time it takes to read 800 MB of data.
On the other hand, are you also writing 10,000 x 10,000 values each time this is called? If not, you could for example store the sums for each row and have a boolean row indicating that a row sum needs to be calculated, which is set each time you change a row element. Or you could even update the sum for a row each time an array element is change.

Resources