Different results using parallelism despite there being no data dependency/data race? - c

I have a piece of code in C which generates an int matrix and assigns 0 to every field. Afterwards, when I run this:
#pragma omp parallel for
for (i = 0; i < 100; i++)
for (j = 0; j < 100; j++)
a[i][j] = a[i][j] + 1
without OpenMP, I get, as expected, 1s in every field.
But when I run it in parallel, I get plotches of random values (0s and sometimes even 2) every once in a while, despite (what I think is) a piece of code with no data dependency. Everytime it's ran, it produces a different result with different plotches of messy values. Am I missing something? I made sure that it's the same code by simply writing it in serial first, then copying it over and just adding the extra lines making it parallel. Thanks in advance!

Your i and j variables aren't declared inside the parallel pragma.
According to http://supercomputingblog.com/openmp/tutorial-parallel-for-loops-with-openmp/ this can cause the j variable to be shared across all parallel threads, meaning it gets incremented too many times and rows get skipped (causing 0's).
I suspect with the right ordering this also causes increments to be lost (causing 2's, 3's and 4's), but I'm not sure what order that is off the top of my head.

Related

How to ensure data synchronization with OpenMP?

When I try to do the math expression from the following code the matrix values are not consistent, how can I fix that?
#pragma omp parallel num_threads (NUM_THREADS)
{
#pragma omp for
for(int i = 1; i < qtdPassos; i++)
{
#pragma omp critical
matriz[i][0] = matriz[i-1][0]; /
for (int j = 1; j < qtdElementos-1; j++)
{
matriz[i][j] = (matriz[i-1][j-1] + (2 * matriz[i-1][j]) + matriz[i-1][j+1]) / 4; // Xi(t+1) = [Xi-1 ^ (t) + 2 * Xi ^ (t)+ Xi+1 ^ (t)] / 4
}
matriz[i][qtdElementos-1] = matriz[i-1][qtdElementos-1];
}
}
The problem comes from a race condition which is due to a loop carried dependency. The encompassing loop cannot be parallelised (nor the inner loop) since loop iterations matriz read/write the current and previous row. The same applies for the column.
Note that OpenMP does not check if the loop can be parallelized (in fact, it theoretically cannot in general). It is your responsibility to check that. Additionally, note that using a critical section for the whole iteration serializes the execution defeating the purpose of a parallel loop (in fact, it will be slower due to the overhead of the critical section). Note also that #pragma omp critical only applies on the next statement. Protecting the line matriz[i][0] = matriz[i-1][0]; is not enough to avoid the race condition.
I do not think this current code can be (efficiently) parallelised. That being said, if your goal is to implement a 1D/2D stencil, then you can use a double buffering technique (ie. write in a 2D array that is different from the input array). A similar logic can be applied for 1D stencil repeated multiple times (which is apparently what you want to do). Note that the results will be different in that case. For the 1D stencil case, this double buffering strategy can fix the dependency issue and enable you to parallelize the inner-loop. For the 2D stencil case, the two nested loops can be parallelized.

OpenMP parallelize grouped array sum using pointers

I want to effectively parallelize the following sum in C:
#pragma omp parallel for num_threads(nth)
for(int i = 0; i < l; ++i) pout[pg[i]] += px[i];
where px is a pointer to a double array x of size l containing some data, pg is a pointer to an integer array g of size l that assigns each data point in x to one of ng groups which occur in a random order, and pout is a pointer to a double array out of size ng which is initialized with zeros and contains the result of summing x over the grouping defined by g.
The code above works, but the performance is not optimal so I wonder if there is somewthing I can do in OpenMP (such as a reduction() clause) to improve the execution. The dimensions l and ng of the arrays, and the number of threads nth are available to me and fixed beforehand. I cannot directly access the arrays, only the pointers are passed to a function which does the parallel sum.
Your code has a data race (at line pout[pg[i]] += ...), you should fix it first, then worry about its performance.
if ng is not too big and you use OpenMP 4.5+, the most efficient solution is using reduction: #pragma omp parallel for num_threads(nth) reduction(+:pout[:ng])
if ng is too big, most probably the best idea is to use a serial version of the program on PCs. Note that your code will be correct by adding #pragma omp atomic before pout[pg[i]] += .., but its performance is questionable.
From your description it sounds like you have a many-to-few mapping. That is a big problem for parallelism because you likely have write conflicts in the target array. Attempts to control with critical sections or locks will probably only slow down the code.
Unless it is prohibitive in memory, I would give each thread a private copy of pout and sum into that, then add those copies together. Now the reading of the source array can be nicely divided up between the threads. If the pout array is not too large, your speedup should be decent.
Here is the crucial bit of code:
#pragma omp parallel shared(sum,threadsum)
{
int thread = omp_get_thread_num(),
myfirst = thread*ngroups;
#pragma omp for
for ( int i=0; i<biglen; i++ )
threadsum[ myfirst+indexes[i] ] += 1;
#pragma omp for
for ( int igrp=0; igrp<ngroups; igrp++ )
for ( int t=0; t<nthreads; t++ )
sum[igrp] += threadsum[ t*ngroups+igrp ];
}
Now for the tricky bit. I'm using an index array of size 100M, but the number of groups is crucial. With 5000 groups I get good speedup, but with only 50, even though I've eliminated things like false sharing, I get pathetic or no speedup. This is not clear to me yet.
Final word: I also coded #Laci's solution of just using a reduction. Testing on 1M groups output: For 2-8 threads the reduction solution is actually faster, but for higher thread counts I win by almost a factor of 2 because the reduction solution repeatedly adds the whole array while I sum it just once, and then in parallel. For smaller numbers of groups the reduction is probably preferred overall.

Which sequence is more effective in Assembly language?

I have 2 C sequences which both multiply two matrices.
Sequence 1:
int A[M][N], B[N][P], C[M][P], i, j, k;
for (i = 0; i < M; i++)
for (j = 0; j < P; j++)
for (k = 0; k < N; k++)
C[i][j] += A[i][k] * B[k][j];
Sequence 2:
int A[M][N], B[N][P], C[M][P], i, j, k;
for (i = M - 1; i >= 0; i--)
for (j = P - 1; j >= 0; j--)
for (k = N - 1; k >= 0; k--)
C[i][j] += A[i][k] * B[k][j];
My question is: which of them is more efficient when translated in Assembly language?
I'm pretty sure that the second one can be written using the loop instruction, while the first one can be written using inc/jl.
First, you should understand that source code does not dictate what the assembly language is. The C standard allows a compiler to transform a program in any way as long as the resulting observable behavior (defined by the standard) remains the same. (The observable behavior is largely the output to files and devices, interactive input and output, and accesses to special volatile objects.)
Compilers take advantage of this rule to optimize your program. If the results of your loop are the same in either direction, then, in the best compilers, writing the loop in one direction or another has no consequence. The compiler analyzes the source code and sees that the effect of the loop is merely to perform a set of operations whose order does not matter. It represents the loop and the operations within it abstractly and later generates the best assembly code it can.
If the arrays in your example are large, then the time it takes the compiler to execute the loop control instructions is irrelevant. In typical systems, it takes dozens of CPU cycles or more to fetch a value from memory. With large arrays, the bottleneck in your example code will be fetching data from memory. The CPU will be forced to wait for this data, and it will easily complete any loop control or array address arithmetic instructions while it is waiting for data from memory.
Typical systems deal with the slow memory problem by including some fast memory, called cache. Often, there is very fast cache built into the core of the processor itself, plus some fast cache on the chip with the processor, and there are may other levels of cache. Memory in cache is organized into lines, which are segments of consecutive data from memory. Thus, one cache line may contain eight consecutive int objects. When the processor needs data that is not already in cache, an entire cache line is fetched from memory. Because of this, you can avoid the memory delay by using eight consecutive int objects. When you read the first one (or even before—the processor may predict your read and start fetching it ahead of time), all eight will be ready from memory. So your program will only have to wait for the first one. When it goes to use the second through the eight, they will already be in cache, where they are immediately available to the processor.
Unfortunately, array multiplication is notoriously bad for caches. Although your loop traverses the rows of array A (using A[i][k] where k is the fastest-varying index as your code is written), it traverses the columns of B (using B[k][j]). So consecutive iterations of your loop use consecutive elements of A but not consecutive elements of B. If the arrays are large, your program will end up waiting for elements from B to be fetched from memory. And, if you change the code to use consecutive elements from B, then it no longer uses consecutive elements from A.
With array multiplication, a typical way to deal with this problem is to split the array multiplication into smaller blocks, doing only a portion at a time, perhaps 8×8 blocks. This works because the cache can hold multiple lines at a time. If you arrange the work so that one 8×8 block from B (e.g., all the elements with a row number from 16 to 23 and a column number from 32 to 39) is used repeatedly for a while, then it can remain in cache, with all its data immediately available. This sort of rearrangement of work can speed up your program tremendously, making it many times faster. It is a much larger improvement than merely changing the direction of your loops can provide.
Some compilers can see that your loops on i, j, and k can be interchanged, and they may try to reorganize them if there is some benefit. Few compilers can break up the routines into blocks as I describe above. Also, the compiler can rearrange the work in your example only because you show A, B, and C declared as separate arrays. If these were not visible to the compiler but were instead passed as pointers to a function that was performing matrix multiplication, the compiler would not be able to see that A, B, and C point to separate arrays. In this case, it cannot know that the order of the loops does not matter. If the function were passed a C that points to the same array as A, the function would be overwriting some of its input while calculating outputs, and so the loop directions would matter.
There are a variety of matrix multiplication libraries that use the blocking technique and others to perform matrix multiplication efficiently.

Update an array of 1000000 items in parallel

This is a question which was asked in previous year exam.
Consider the following fragment of C code:
int i, array[1000000];
array[0] = 0;
for (i = 1; i < 1000000; i++)
array[i] = array[i-1] + 3;
Can we simply run 1,000,000 of the array update statement in the
for loop in parallel? If not, change the update statement so that it
can run in parallel and still produce the same final contents of the
array.
I understand that it is not possible to simply run 1,000,000 of the array update statement in the for loop in parallel. Only ways that are coming into my mind is to use recursion, which is not parallel, and to use 1000000 threads, which is not a great idea.
So is there another way of getting this done in parallel with very few update statements ? We may use openMPI or openCL
Edit: This is not a homework question, but I think it was given as homework in some school. This is from a past exam paper. I uploaded it here
The problem is that you cannot parallelize that loop, neither with only 2 thread, since each iteration depends on the previous.
Your algorithm produces:
array[0] = 0;
array[1] = 3;
array[2] = 6;
...
So, you can write the update statement in a way that each iteration does not depend on the previous:
int i, array[1000000];
array[0] = 0;
for (i = 1; i < 1000000; i++)
array[i] = 3*i;
In this way you have removed the data dependency and you can easily parallelize the loop (e.g. with OpenMP or MPI).

Why does the order of loops in a matrix multiply algorithm affect performance? [duplicate]

This question already has answers here:
Why does the order of the loops affect performance when iterating over a 2D array?
(7 answers)
Closed 8 years ago.
I am given two functions for finding the product of two matrices:
void MultiplyMatrices_1(int **a, int **b, int **c, int n){
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
for (int k = 0; k < n; k++)
c[i][j] = c[i][j] + a[i][k]*b[k][j];
}
void MultiplyMatrices_2(int **a, int **b, int **c, int n){
for (int i = 0; i < n; i++)
for (int k = 0; k < n; k++)
for (int j = 0; j < n; j++)
c[i][j] = c[i][j] + a[i][k]*b[k][j];
}
I ran and profiled two executables using gprof, each with identical code except for this function. The second of these is significantly (about 5 times) faster for matrices of size 2048 x 2048. Any ideas as to why?
I believe that what you're looking at is the effects of locality of reference in the computer's memory hierarchy.
Typically, computer memory is segregated into different types that have different performance characteristics (this is often called the memory hierarchy). The fastest memory is in the processor's registers, which can (usually) be accessed and read in a single clock cycle. However, there are usually only a handful of these registers (usually no more than 1KB). The computer's main memory, on the other hand, is huge (say, 8GB), but is much slower to access. In order to improve performance, the computer is usually physically constructed to have several levels of caches in-between the processor and main memory. These caches are slower than registers but much faster than main memory, so if you do a memory access that looks something up in the cache it tends to be a lot faster than if you have to go to main memory (typically, between 5-25x faster). When accessing memory, the processor first checks the memory cache for that value before going back to main memory to read the value in. If you consistently access values in the cache, you will end up with much better performance than if you're skipping around memory, randomly accessing values.
Most programs are written in a way where if a single byte in memory is read into memory, the program later reads multiple different values from around that memory region as well. Consequently, these caches are typically designed so that when you read a single value from memory, a block of memory (usually somewhere between 1KB and 1MB) of values around that single value is also pulled into the cache. That way, if your program reads the nearby values, they're already in the cache and you don't have to go to main memory.
Now, one last detail - in C/C++, arrays are stored in row-major order, which means that all of the values in a single row of a matrix are stored next to each other. Thus in memory the array looks like the first row, then the second row, then the third row, etc.
Given this, let's look at your code. The first version looks like this:
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
for (int k = 0; k < n; k++)
c[i][j] = c[i][j] + a[i][k]*b[k][j];
Now, let's look at that innermost line of code. On each iteration, the value of k is changing increasing. This means that when running the innermost loop, each iteration of the loop is likely to have a cache miss when loading the value of b[k][j]. The reason for this is that because the matrix is stored in row-major order, each time you increment k, you're skipping over an entire row of the matrix and jumping much further into memory, possibly far past the values you've cached. However, you don't have a miss when looking up c[i][j] (since i and j are the same), nor will you probably miss a[i][k], because the values are in row-major order and if the value of a[i][k] is cached from the previous iteration, the value of a[i][k] read on this iteration is from an adjacent memory location. Consequently, on each iteration of the innermost loop, you are likely to have one cache miss.
But consider this second version:
for (int i = 0; i < n; i++)
for (int k = 0; k < n; k++)
for (int j = 0; j < n; j++)
c[i][j] = c[i][j] + a[i][k]*b[k][j];
Now, since you're increasing j on each iteration, let's think about how many cache misses you'll likely have on the innermost statement. Because the values are in row-major order, the value of c[i][j] is likely to be in-cache, because the value of c[i][j] from the previous iteration is likely cached as well and ready to be read. Similarly, b[k][j] is probably cached, and since i and k aren't changing, chances are a[i][k] is cached as well. This means that on each iteration of the inner loop, you're likely to have no cache misses.
Overall, this means that the second version of the code is unlikely to have cache misses on each iteration of the loop, while the first version almost certainly will. Consequently, the second loop is likely to be faster than the first, as you've seen.
Interestingly, many compilers are starting to have prototype support for detecting that the second version of the code is faster than the first. Some will try to automatically rewrite the code to maximize parallelism. If you have a copy of the Purple Dragon Book, Chapter 11 discusses how these compilers work.
Additionally, you can optimize the performance of this loop even further using more complex loops. A technique called blocking, for example, can be used to notably increase performance by splitting the array into subregions that can be held in cache longer, then using multiple operations on these blocks to compute the overall result.
Hope this helps!
This may well be the memory locality. When you reorder the loop, the memory that's needed in the inner-most loop is nearer and can be cached, while in the inefficient version you need to access memory from the entire data set.
The way to test this hypothesis is to run a cache debugger (like cachegrind) on the two pieces of code and see how many cache misses they incur.
Apart from locality of memory there is also compiler optimisation. A key one for vector and matrix operations is loop unrolling.
for (int k = 0; k < n; k++)
c[i][j] = c[i][j] + a[i][k]*b[k][j];
You can see in this inner loop i and j do not change. This means it can be rewritten as
for (int k = 0; k < n; k+=4) {
int * aik = &a[i][k];
c[i][j] +=
+ aik[0]*b[k][j]
+ aik[1]*b[k+1][j]
+ aik[2]*b[k+2][j]
+ aik[3]*b[k+3][j];
}
You can see there will be
four times fewer loops and accesses to c[i][j]
a[i][k] is being accessed continuously in memory
the memory accesses and multiplies can be pipelined (almost concurrently) in the CPU.
What if n is not a multiple of 4 or 6 or 8? (or whatever the compiler decides to unroll it to) The compiler handles this tidy up for you. ;)
To speed up this solution faster, you could try transposing the b matrix first. This is a little extra work and coding, but it means that accesses to b-transposed are also continuous in memory. (As you are swapping [k] with [j])
Another thing you can do to improve performance is to multi-thread the multiplication. This can improve performance by a factor of 3 on a 4 core CPU.
Lastly you might consider using float or double You might think int would be faster, however that is not always the case as floating point operations can be more heavily optimised (both in hardware and the compiler)
The second example has c[i][j] is changing on each iteration which makes it harder to optimise.
Probably the second one has to skip around in memory more to access the array elements. It might be something else, too -- you could check the compiled code to see what is actually happening.

Resources