OpenMP and 17 Nested For-Loops - c

I have a giant nested for-loop, designed to set a large array to its default value. I'm trying to use OpenMP for the first time to parallelize, and have no idea where to begin. I have been reading tutorials, and am afraid the process will be performed independently on N number of cores, instead of N cores divided the process amongst itself for a common output. The code is in C, compiled in Visual Studio v14. Any help for this newbie is appreciated -- thanks!
(Attached below is the monster nested for-loop...)
for (j = 0;j < box1; j++)
{
for (k = 0; k < box2; k++)
{
for (l = 0; l < box3; l++)
{
for (m = 0; m < box4; m++)
{
for (x = 0;x < box5; x++)
{
for (y = 0; y < box6; y++)
{
for (xa = 0;xa < box7; xa++)
{
for (xb = 0; xb < box8; xb++)
{
for (nb = 0; nb < memvara; nb++)
{
for (na = 0; na < memvarb; na++)
{
for (nx = 0; nx < memvarc; nx++)
{
for (nx1 = 0; nx1 < memvard; nx1++)
{
for (naa = 0; naa < adirect; naa++)
{
for (nbb = 0; nbb < tdirect; nbb++)
{
for (ncc = 0; ncc < fs; ncc++)
{
for (ndd = 0; ndd < bs; ndd++)
{
for (o = 0; o < outputnum; o++)
{
lookup->n[j][k][l][m][x][y][xa][xb][nb][na][nx][nx1][naa][nbb][ncc][ndd][o] = -3; //set to default value
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}

If n is actually a multidimensional array, you can do this:
size_t i;
size_t count = sizeof(lookup->n) / sizeof(int);
int *p = (int*)lookup->n;
for( i = 0; i < count; i++ )
{
p[i] = -3;
}
Now, that's much easier to parallelize.
Read more on why this works here (applies to C as well): How do I use arrays in C++?

This is more of an extended comment than an answer.
Find the iteration limit (ie the variable among box1, box2, etc) with the largest value. Revise your loop nest so that the outermost loop runs over that. Simply parallelise the outermost loop. Choosing the largest value means that you'll get, in the limit, an equal number of inner loop iterations to run for each thread.
Collapsing loops, whether you can use OpenMP's collapse clause or have to do it by hand, is only useful when you have reason to believe that parallelising over only the outermost loop will result in significant load imbalance. That seems very unlikely in this case, so distributing the work (approximately) evenly across the available threads at the outermost level would probably provide reasonably good load balancing.

I believe, based on tertiary research, that the solution might be found in adding #pragma omp parallel for collapse(N) directly above the nested loops. However, this seems to only work in OpenMP v3.0, and the whole project is based on Visual Studio (and therefore, OpenMP v2.0) for now...

Related

Loop Unrolling Using Nested Loops and If/Else Statements 10 x 10 Unrolling

So,I have two loops that I would like to attempt to do a 10 x 10 unrolling on. I really have never done this. I have seen some simple examples that did not involve if/else statements or nested loops. So I am kind of at a loss how to do this for these loops.
All the variables are int.
The first loop is:
for (j=0; j < WIDTH; ++j) {
for (i = 0; i < HEIGHT; ++i) {
n = Calculate(prv, i, j);
if (prv[i][j] && (n == 3 || n == 2))
nxt[i][j] = true;
else if (!prv[i][j] && (n == 3))
nxt[i][j] = true;
else
nxt[i][j] = false;
}
}
I believe the secret is doing some sort of multiple accumulators, I am just not quite sure how that would look.
The second loop:
for (ii = i_left; ii < i_right; ++ii) {
for (jj = j_left; jj < j_right; ++jj) {
n += b[ii][jj];
}
}
Again, I believe this too would involve some sort of multiple accumulator approach as well.
Any help getting started on this would be greatly appreciated. Also, if there are any other ways to optimize the loops, I would appreciate those suggestions as well.
Thank you

Efficiently print every x iterations in for loop

I am writing a program in which a certain for-loop gets iterated over many many times.
One single iteration doesn't take to long but since the program iterates the loop so often it takes quite some time to compute.
In an effort to get more information on the progress of the program without slowing it down to much I would like to print the progress every xth step.
Is there a different way to do this, than a conditional with a modulo like so:
for(int i = 0; i < some_large_number; i++){
if(i % x == 0)
printf("%f%%\r", percent);
//some other code
.
.
.
}
?
Thanks is advance
This code:
for(int i = 0; i < some_large_number; i++){
if(i % x == 0)
printf("%f%%\r", percent);
//some other code
.
.
.
}
can be restructured as:
/* Partition the execution into blocks of x iterations, possibly including a
final fragmentary block. The expression (some_large_number+(x-1))/x
calculates some_large_number/x with any fraction rounded up.
*/
for (int block = 0, i = 0; block < (some_large_number+(x-1))/x; ++block)
{
printf("%f%%\r", percent);
// Set limit to the lesser of the end of the current block or some_large_number.
int limit = (block+1) * x;
if (some_large_number < limit) limit = some_large_number;
// Iterate the original code.
for (; i < limit; ++i)
{
//some other code
}
}
With the following caveats and properties:
The inner loop has no more work than the original loop (it has no extra variable to count or test) and has the i % x == 0 test completely removed. This is optimal for the inner loop in the sense it reduces the nominal amount of work as much as possible, although real-world hardware sometimes has finicky behaviors that can result in more compute time for less actual work.
New identifiers block and limit are introduced but can be changed to avoid any conflicts with uses in the original code.
Other than the above, the inner loop operates identically to the original code: It sees the same values of i in the same order as the original code, so no changes are needed in that code.
some_large_number+(x-1) could overflow int.
I would do it like this:
int j = x;
for (int i = 0; i < some_large_number; i++){
if(--j == 0) {
printf("%f%%\r", percent);
j = x;
}
//some other code
.
.
.
}
Divide the some_large_number by x. Now loop for x times and nest it with the new integer and then print the percent. I meant this:
int temp = some_large_number/x;
for (int i = 0; i < x; i++){
for (int j = 0; j < temp; j++){
//some code
}
printf("%f%%\r", percent);
}
The fastest approach regarding your performance concern would be to use a nested loop:
unsigned int x = 6;
unsigned int segments = some_large_number / x;
unsigned int y;
for ( unsigned int i = 0; i < segments; i++ ) {
printf("%f%%\r", percent);
for ( unsigned int j = 0; j < x; j++ ) {
/* some code here */
}
}
// If some_large_number canĀ“t be divided evenly through `x`:
if (( y = (some_large_number % x)) != 0 )
{
for ( unsigned int i = 0; i < y; i++ ) {
/* same code as inside of the former inner loop. */
}
}
Another example would be to use a different counting variable for the check to execute the print process by comparing that to x - 1 and reset the variable to -1 if it matches:
unsigned int x = 6;
unsigned int some_large_number = 100000000;
for ( unsigned int i = 0, int j = 0; i < some_large_number; i++, j++ ) {
if(j == (x - 1))
{
printf("%f%%\r", percent);
j = -1;
}
/* some code here */
}

Initialize a minesweeper in c

I'm currently rewriting a minesweeper program in C using the CSFML library.
I'm having some issues at managing the initialization only after the first click, more precisely in the part where I'm supposed to set the tiles around the click empty.
I can't find a way to make these tiles empty without having a risk of removing some bombs.
Here's my init code block for now :
int current = 0;
temp.bombs = BOMB_EASY;
temp.difficulty = EASY;
temp.mapEasy = malloc(sizeof(sTILE *) * (Y_EASY + 1));
for (int i = 0; i < Y_EASY + 1 ; i++)
{
temp.mapEasy[i] = malloc(sizeof(sTILE) * (X_EASY + 1));
}
for (int i = 0; i < X_EASY + 1; i++)
{
temp.mapEasy[Y_EASY][i].type = 0;
}
while (current < BOMB_EASY)
{
for (int i = 0; i < Y_EASY; i++)
{
for (int j = 0; j < X_EASY; j++)
{
int isBomb = rand() % 10;
if (isBomb == 0 && current < BOMB_EASY && temp.mapEasy[i][j].type != 9)
{
temp.mapEasy[i][j].type = 9;
current++;
}
else if (temp.mapEasy[i][j].type != 9)
{
temp.mapEasy[i][j].type = 0;
}
}
}
}
for (int i = 0; i < Y_EASY; i++)
{
for (int j = 0; j < X_EASY; j++)
{
if (temp.mapEasy[i][j].type == 0)
{
temp.mapEasy[i][j].type = HowManyBombs(temp.mapEasy, i, j, Y_EASY, X_EASY);
}
temp.mapEasy[i][j].isRevealed = sfFalse;
temp.mapEasy[i][j].isFlagged = sfFalse;
}
}
}
I know my question might seem stupid and someone probably already answered it but I couldn't find the answer so thanks at the ones who will answer me.
Create an empty matrix
Fill it with n mines at random locations. Upon generating (x, y) coordinates, check if they are already taken.
If the coordinates are already taken, you should place the mine at the next available position.
For example by increasing x by 1, check if free, if not, increase x again. Upon reaching the end of the row, increase y instead and start over with x=0.
If you simply generate a new random number, your algorithm could in theory get forever stuck. In practice it will probably work, but I would expect such an algorithm to generate the grid slower1) than one that just picks the next free spot.
1) rand() call overhead is the most likely bottleneck in this algorithm. But also if you pick the next spot rather than calling rand() again, the CPU might be able to speculatively load (parts of) the array in prefetch data cache. This wouldn't be possible when the memory location is literally random each time you pick it.

Using Hash Tables in Leu of Multidimensional Array

EDIT: Found a solution! Like the commenters suggested, using memset is an insanely better approach. Replace the entire for loop with
memset(lookup->n, -3, (dimensions*sizeof(signed char)));
where
long int dimensions = box1 * box2 * box3 * box4 * box5 * box6 * box7 * box8 * memvara * memvarb * memvarc * memvard * adirect * tdirect * fs * bs * outputnum;
Intro
Right now, I'm looking at a beast of a for-loop:
for (j = 0;j < box1; j++)
{
for (k = 0; k < box2; k++)
{
for (l = 0; l < box3; l++)
{
for (m = 0; m < box4; m++)
{
for (x = 0;x < box5; x++)
{
for (y = 0; y < box6; y++)
{
for (xa = 0;xa < box7; xa++)
{
for (xb = 0; xb < box8; xb++)
{
for (nb = 0; nb < memvara; nb++)
{
for (na = 0; na < memvarb; na++)
{
for (nx = 0; nx < memvarc; nx++)
{
for (nx1 = 0; nx1 < memvard; nx1++)
{
for (naa = 0; naa < adirect; naa++)
{
for (nbb = 0; nbb < tdirect; nbb++)
{
for (ncc = 0; ncc < fs; ncc++)
{
for (ndd = 0; ndd < bs; ndd++)
{
for (o = 0; o < outputnum; o++)
{
lookup->n[j][k][l][m][x][y][xa][xb][nb][na][nx][nx1][naa][nbb][ncc][ndd][o] = -3; //set to default value
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
The Problem
This loop is called every cycle in the main run to reset values to an initial state. Unfortunately, it is necessary for the structure of the program that this many values are kept in a single data structure.
Here's the kicker: for every 60 seconds of program run time, 57 seconds goes to this function alone.
The Question
My question is this: would hash tables be an appropriate substitute for a linear array? This array has an O(n^17) cardinality, yet hash tables have an ideal of O(1).
If so, what hash library would you recommend? This program is in C and has no native hash support.
If not, what would you recommend instead?
Can you provide some pseudo-code on how you think this should be implemented?
Notes
OpenMP was used in an attempt to parallelize this loop. Numerous implementations only resulted in slightly-to-greatly increased run time.
Memory usage is not particularly an issue -- this program is intended to be ran on an insanely high-spec'd computer.
We are student researchers, thrust into a heretofore unknown world of optimization and parallelization -- please bear with us, and thank you for any help
Hash vs Array
As comments have specified, an array should not be a problem here. Lookup into an array with a known offset is O(1).
The Bottleneck
It seems to me that the bulk of the work here (and the reason it is slow) is the number of pointer de-references in the inner-loop.
To explain in a bit more detail, consider myData[x][y][z] in the following code:
for (int x = 0; x < someVal1; x++) {
for (int y = 0; y < someVal2; y++) {
for (int z = 0; z < someVal3; z++) {
myData[x][y][z] = -3; // x and y only change in outer-loops.
}
}
}
To compute the location for the -3, we do a lookup and add a value - once for myData[x], then again to get to myData[x][y], and once more finally for myData[x][y][z].
Since this lookup is in the inner-most portion of the loop, we have redundant reads. myData[x] and myData[x][y] are being recomputed, even when only z's value is changing. The lookups were performed during a previous iteration, but the results weren't stored.
For your loop, there are many layers of lookups being computed each iteration, even when only the value of o is changing in that inner-loop.
An Improvement for the Bottleneck
To make one lookup, per loop iteration, per loop level, simply store intermediate lookups. Using int* as the indirection (though any type would work here), the sample code above (with myData) would become:
int **a, *b;
for (int x = 0; x < someVal1; x++) {
a = myData[x]; // Store the lookup.
for (int y = 0; y < someVal2; y++) {
b = a[y]; // Indirection based on the stored lookup.
for (int z = 0; z < someVal3; z++) {
b[z] = -3; // This can be extrapolated as needed to deeper levels.
}
}
}
This is just sample code, small adjustments may be necessary to get it to compile (casts and so forth). Note that there is probably no advantage to using this approach with a 3-dimensional array. However, for a 17-dimensional large data set with simple inner-loop operations (such as assignment), this approach should help quite a bit.
Finally, I'm assuming you aren't actually just assigning the value of -3. You can use memset to accomplish that goal much more efficiently.

Parallelizing giving wrong output

I got some problems trying to parallelize an algorithm. The intention is to do some modifications to a 100x100 matrix. When I run the algorithm without openMP everything runs smoothly in about 34-35 seconds, when I parallelize on 2 threads (I need it to be with 2 threads only) it gets down to like 22 seconds but the output is wrong and I think it's a synchronization problem that I cannot fix.
Here's the code :
for (p = 0; p < sapt; p++){
memset(count,0,Nc*sizeof(int));
for (i = 0; i < N; i ++){
for (j = 0; j < N; j++){
for( m = 0; m < Nc; m++)
dist[m] = N+1;
omp_set_num_threads(2);
#pragma omp parallel for shared(configurationMatrix, dist) private(k,m) schedule(static,chunk)
for (k = 0; k < N; k++){
for (m = 0; m < N; m++){
if (i == k && j == m)
continue;
if (MAX(abs(i-k),abs(j-m)) < dist[configurationMatrix[k][m]])
dist[configurationMatrix[k][m]] = MAX(abs(i-k),abs(j-m));
}
}
int max = -1;
for(m = 0; m < Nc; m++){
if (dist[m] == N+1)
continue;
if (dist[m] > max){
max = dist[m];
configurationMatrix2[i][j] = m;
}
}
}
}
memcpy(configurationMatrix, configurationMatrix2, N*N*sizeof(int));
#pragma omp parallel for shared(count, configurationMatrix) private(i,j)
for (i = 0; i < N; i ++)
for (j = 0; j < N; j++)
count[configurationMatrix[i][j]] ++;
for (i = 0; i < Nc; i ++)
fprintf(out,"%i ", count[i]);
fprintf(out, "\n");
}
In which : sapt = 100;
count -> it's a vector that holds me how many of an each element of the matrix I'm having on each step;
(EX: count[1] = 60 --> I have the element '1' 60 times in my matrix and so on)
dist --> vector that holds me max distances from element i,j of let's say value K to element k,m of same value K.
(EX: dist[1] = 10 --> distance from the element of value 1 to the furthest element of value 1)
Then I write stuff down in an output file, but again, wrong output.
If I understand your code correctly this line
count[configurationMatrix[i][j]] ++;
increments count at the element whose index is at configurationMatrix[i][j]. I don't see that your code takes any steps to ensure that threads are not simultaneously trying to increment the same element of count. It's entirely feasible that two different elements of configurationMatrix provide the same index into count and that those two elements are handled by different threads. Since ++ is not an atomic operation your code has a data race; multiple threads can contend for update access to the same variable and you lose any guarantees of correctness, or determinism, in the result.
I think you may have other examples of the same problem in other parts of your code too. You are silent on the errors you observe in the results of the parallel program compared with the results from the serial program yet those errors are often very useful in diagnosing a problem. For example, if the results of the parallel program are not the same every time you run it, that is very suggestive of a data race somewhere in your code.
How to fix this ? Since you only have 2 threads the easiest fix would be to not parallelise this part of the program. You could wrap the data race inside an OpenMP critical section but that's really just another way of serialising your code. Finally, you could possibly modify your algorithm and data structures to avoid this problem entirely.

Resources