Nested Loop Optimisation in C and OpenMP

Nested Loop Optimisation in C and OpenMP - c

Actually I have two questions, the first one is, considering cache, which one of the following code is faster?
int a[10000][10000];
for(int i = 0; i < 10000; i++){
for(int j = 0; j < 10000; j++){
a[i][j]++;
}
}
or
int a[10000][10000];
for(int i = 0; i < 10000; i++){
for(int j = 0; j < 10000; j++){
a[j][i]++;
}
}
I am guessing the first one will be much faster since there are a lot less cache miss. And my question is if you are using OpenMP, what kind of technique will you use to optimise such a nested loop? My strategy is to divide the outer loop into 4 chunks and assign them among 4 cores, is there any better way (more cache friendly) to do it?
Thanks!
Bob

As maxihatop pointed out, the first one performs better because it has better cache locality.
Dividing the outer loop into chunks is a good strategy in the case like this, where the complexity of the task inside the loop is constant.
You might want to take a look at #pragma omp for schedule(static). This will evenly divide the iterations contiguously among threads. So your code should look like:
#pragma omp for schedule(static)
for (i = 0; i < 10000; i++) {
for(j = 0; j < 10000; j++){
a[i][j]++;
}
Lawrence Livermore National Laboratory provides a fantastic tutorial of OpenMP. You can find more information there.
https://computing.llnl.gov/tutorials/openMP/

Related

OpenMP reduction on multiple variables (array)

I am trying to do a reduction on multiple variables (an array) using OMP, but wasn't sure how to implement it with OMP. See the code below.
#pramga omp parallel for reduction( ??? )
for (int i = 0; i < n; i++) {
for (int j = 0; j < m; j++) {
[ compute value ... ]
y[j] += value
}
}
I thought I could do something like this, with the atomic keyword, but realised this would prevent two threads from updating y at the same time even if they are updating different values.
#pramga omp parallel for
for (int i = 0; i < n; i++) {
for (int j = 0; j < m; j++) {
[ compute value ... ]
#pragma omp atomic
y[j] += value
}
}
Does OMP have any functionality for something like this or otherwise how would I achieve this optimally without OMP's reduction keyword?

There is an array reduction available in OpenMP since version 4.5:
#pramga omp parallel for reduction(+:y[:m])
where m is the size of the array. The only limitation here is that the local array used in reduction is always reserved on the stack, so it cannot be used in the case of large arrays.
The atomic operation you mentioned should work fine, but it may be less efficient than reduction. Of course, it depends on the actual circumstances (e.g. actual value of n and m, time to compute value, false sharing, etc.).
#pragma omp atomic
y[j] += value

OpenMP in C array reduction / parallelize the code

I have a problem with my code, it should print number of appearances of a certain number.
I want parallelize this code with OpenMP, and I tried to use reduction for arrays but it's obviously didn't working as I wanted.
The error is: "segmentation fault". Should some variables be private? or it's the problem with the way I'm trying to use the reduction?
I think each thread should count some part of array, and then merge it somehow.
#pragma omp parallel for reduction (+: reasult[:i])
for (i = 0; i < M; i++) {
for(j = 0; j < N; j++) {
if ( numbers[j] == i){
result[i]++;
}
}
}
Where N is big number telling how many numbers I have. Numbers is array of all numbers and result array with sum of each number.

First you have a typo on the name
#pragma omp parallel for reduction (+: reasult[:i])
should actually be "result" not "reasult"
Nonetheless, why are you section the array with result[:i]? Based on your code, it seems that you wanted to reduce the entire array, namely:
#pragma omp parallel for reduction (+: result)
for (i = 0; i < M; i++)
for(j = 0; j < N; j++)
if ( numbers[j] == i)
result[i]++;
When one's compiler does not support the OpenMP 4.5 array reduction feature one can alternatively explicitly implement the reduction (check this SO thread to see how).
As pointed out by #Hristo Iliev in the comments:
Provided that M * sizeof(result[0]) / #threads is a multiple of the
cache line size, and even if it isn't when the value of M is large
enough, there is absolutely no need to involve reduction in the
process. Unless the program is running on a NUMA system, that is.
Assuming that the aforementioned conditions are met, and if you analyze carefully the outermost loop iterations (i.e., variable i) are assigned to the threads, and since the variable i is used to access the result array, each thread will be updating a different position of the result array. Therefore, you can simplified your code to:
#pragma omp parallel for
for (i = 0; i < M; i++)
for(j = 0; j < N; j++)
if ( numbers[j] == i)
result[i]++;

Which method is better for incrementing and traversing in a loop in C

These are example of incrementing each element of an array by 10.
for (i = 0; i< 100; i++){
arr[i] += 10;
}
or
for (i = 0; i< 100; i+=2){
arr[i] += 10;
arr[i+1] += 10;
}
which is the efficient way to solve this problem out of these these two in C language?

Don't worry about it. Your compiler will make this optimization if necessary.
For example, clang 10 unrolls this completely and uses vector instructions to do multiple at once.

As #JeremyRoman stated compiler will be better than the humans optimizing the code.
But you may make its work easier or tougher. In your example the second way prevents gcc from unrolling the loops.
So make it simple, do not try to premature micro optimize your code as result might be right opposite than expected
https://godbolt.org/z/jYcLpT

Let's look at better and efficient outside of run-time performance1.
Bug!
Only 3 or 4 lines of code and the one with 4 is incorrect. What if arr[] and ar[] both existed? the compiler would not complain, yet certainly incorrect code.
//ar[i+1] += 10;
arr[i+1] += 10;
Coding
The below wins. Short and and easy to code. No concern about if arr[i+1] += 10; access arr[100]
for (i = 0; i< 100; i++){
arr[i] += 10;
}
Review
The below wins. Clear, to the point. I had to review the other more to be clear of its correctness - inefficient review time. Defense-ability - I'd have no trouble defending this code.
for (i = 0; i< 100; i++) {
arr[i] += 10;
}
Maintenance
The below wins. Change i < 100 to i < N and this code is fine, the other can readily break.
for (i = 0; i< 100; i++) {
arr[i] += 10;
}
Optimization possibilities
The below wins. Compilers do a fine job at optimizing common idioms. The 2nd poses more analyses and a greater chance the compiler will not optimize well.
for (i = 0; i< 100; i++) {
arr[i] += 10;
}
Score
Outside of performance:
5 to 0
1 Notice OP never explicitly stated to view this only as run-time performance. So let use consider various ideas of better.

Make this for-loop more efficient?

edit - -
This code will be run with optimizations off
full transparency this is a homework assignment.
I’m having some trouble figuring out how to optimize this code...
My instructor went over unrolling and splitting but neither seems to greatly reduce the time needed to execute the code. Any help would be appreciated!
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
int j;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
// ... and this one. But your inner loop must do the *same
// number of additions as this one does.
}

Assuming you mean same number of additions to sum at runtime (rather than same number of additions in the source code), unrolling could give you something like:
for (j = 0; j + 5 < ARRAY_SIZE; j += 5) {
sum += array[j] + array[j+1] + array[j+2] + array[j+3] + array[j+4];
}
for (; j < ARRAY_SIZE; j++) {
sum += array[j];
}
Alternatively, since you're adding the same values each time through the outer loop, you don't need to process it N_TIMES times, just do this:
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
int j;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
sum *= N_TIMES;
break;
// ... and this one. But your inner loop must do the *same
// number of additions as this one does.
}
This requires that the initial value of sum is zero, which is likely but there's actually nothing in your question that mandates this, so I include it as a pre-condition for this method.

Except by cheating*, this inner loop is essentially non-optimizable. Because you must fetch all the array elements and perform all the additions anyway.
The body of the loop performs:
a conditional branch on j;
a fetch of array[j];
the accumulation to a scalar variable;
the incrementation of j.
As said, 2. to 4. are inescapable.Then all you can do is reducing the number of conditional branches by loop unrolling (this turns the conditional branch in an unconditional one, at the expense of the number of iterations becoming fixed).
It is no surprise that you don't see a big difference. Modern processors are "loop aware", meaning that branch prediction is well tuned to such loops so that the cost of the branches is pretty low.
Cheating:
As others said, you can completely bypass the outer loop. This is just exploiting a flaw in the exercise statement.
As optimizations must be turned off, using inline assembly, pragmas, vector instructions or intrinsics should be banned as well (not mentioning automatic parallelization).
There is a possibility to pack two ints in a long long. If the sum doesn't overflow, you will perform two additions at a time. But is this legal ?
One might think of an access pattern that favors cache utilization. But here there is no hope as the array is fully traversed on every loop and there is no possibility of reuse of the values fetched.

First of all, unless you are explicitly compiling with -O0, your compiler has already likely optimized this loop much further than you could possibly expect.
Including unrolling, and on top of unrolling also vectorization and more. Trying to optimize this by hand is something you should never, absolutely never do. At most you will successfully make the code harder to read and understand, while most likely not even being able to match the compiler in terms of performance.
As to why there is no measurable gain? Possibly because you already hit a bottleneck, even with the "non optimized" version. For ARRAY_SIZE greater than your processors cache even the compiler optimized version is already limited by memory bandwidth.
But for completeness, let's just assume you have not hit that bottleneck, and that you actually had turned optimizations almost off (so no more than -O1), and optimize for that.
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
int j;
int tmpSum[4] = {0,0,0,0};
for (j = 0; j < ARRAY_SIZE; j+=4) {
tmpSum[0] += array[j+0];
tmpSum[1] += array[j+1];
tmpSum[2] += array[j+2];
tmpSum[3] += array[j+3];
}
sum += tmpSum[0] + tmpSum[1] + tmpSum[2] + tmpSum[3];
if(ARRAY_SIZE % 4 != 0) {
j -= 4;
for (; j < ARRAY_SIZE; j++) {
sum += array[j];
}
}
// ... and this one. But your inner loop must do the *same
// number of additions as this one does.
}
There is pretty much only one factor left which still could have reduced the performance, for a smaller array.
Not the overhead for the loop, so plain unrolling would had been pointless with a modern processor. Don't even bother, you won't beat the branch prediction.
But the latency between two instructions, until a value written by one instruction may be read again by the next instruction still applies. In this case, sum is constantly written and read all over again, and even if sum is cached in a register, this delay still applies and the processors pipeline had to wait.
The way around that, is to have multiple independent additions going on simultaneously, and finally just combine the results. This is by the way also an optimization which most modern compilers do know how to perform.
On top of that, you could now also express the first loop with vector instructions - once again also something the compiler would have done. At this point you are running into instruction latency again, so you will likely have to introduce one more set of temporaries, so that you now have two independent addition streams each using vector instructions.
Why the requirement of at least -O1? Because otherwise the compiler won't even place tmpSum in a register, or will try to express e.g. array[j+0] as a sequence of instructions for performing the addition first, rather than just using a single instruction for that. Hardly possible to optimize in that case, without using inline assembly directly.
Or if you just feel like (legit) cheating:
const int N_TIMES = 1000;
const int ARRAY_SIZE = 1024;
const int array[1024] = {1};
int sum = 0;
__attribute__((optimize("O3")))
__attribute__((optimize("unroll-loops")))
int fastSum(const int array[]) {
int j;
int tmpSum;
for (j = 0; j < ARRAY_SIZE; j++) {
tmpSum += array[j];
}
return tmpSum;
}
int main() {
int i;
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
sum += fastSum(array);
// ... and this one. But your inner loop must do the *same
// number of additions as this one does.
}
return sum;
}
The compiler will then apply pretty much all the optimizations described above.

Loop Optimization in C

I have been tasked with optimizing a particular for loop in C. Here is the loop:
#define ARRAY_SIZE 10000
#define N_TIMES 600000
for (i = 0; i < N_TIMES; i++)
{
int j;
for (j = 0; j < ARRAY_SIZE; j++)
{
sum += array[j];
}
}
I'm supposed to use loop unrolling, loop splitting, and pointers in order to speed it up, but every time I try to implement something, the program doesn't return. Here's what I've tried so far:
for (i = 0; i < N_TIMES; i++)
{
int j,k;
for (j = 0; j < ARRAY_SIZE; j++)
{
for (k = 0; k < 100; k += 2)
{
sum += array[k];
sum += array[k + 1];
}
}
}
I don't understand why the program doesn't even return now. Any help would be appreciated.

That second piece of code is both inefficient and wrong, since it adds values more than the original code.
The loop unrolling (or lessening in this case since you probably don't want to unroll a ten-thousand-iteration loop) would be:
// Ensure ARRAY_SIZE is a multiple of two before trying this.
for (int i = 0; i < N_TIMES; i++)
for (int j = 0; j < ARRAY_SIZE; j += 2)
sum += array[j] + array[j+1];
But, to be honest, the days of dumb compilers has long since gone. You should generally leave this level of micro-optimisation up to your compiler, while you concentrate on the more high-level stuff like data structures, algorithms and human analysis.
That last one is rather important. Since you're adding the same array to an accumulated sum a constant number of times, you only really need the sum of the array once, then you can add that partial sum as many times as you want:
int temp = 0;
for (int i = 0; i < ARRAY_SIZE; i++)
temp += array[i];
sum += temp * N_TIMES;
It's still O(n) but with a much lower multiplier on the n (one rather than six hundred thousand). It may be that gcc's insane optimisation level of -O3 could work that out but I doubt it. The human brain can still outdo computers in a lot of areas.
For now, anyway :-)

There is nothing wrong on your program... it will return. It is only going to take 50 times more than the first one...
On the first you had 2 fors: 600.000 * 10.000 = 6.000.000.000 iterations.
On the second you have 3 fors: 600.000 * 10.000 * 50 = 300.000.000.000 iterations...

Loop unrolling doesn't speed loops up, it slows them down. In olden times it gave you a speed bump by reducing the number of conditional evaluations. In modern times it slows you down by killing the cache.
There's no obvious use case for loop splitting here. To split a loop you're looking for two or more obvious groupings in the iterations. At a stretch you could multiply array[j] by i rather than doing the outer loop and claim you've split the inner from the outer, then discarded the outer as useless.
C array-indexing syntax is just defined as (a peculiar syntax for) pointer arithmetic. But I guess you'd want something like:
sum += *arrayPointer++;
In place of your use of j, with things initialised suitably. But I doubt you'll gain anything from it.
As per the comments, if this were real life then you'd just let the compiler figure this stuff out.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Nested Loop Optimisation in C and OpenMP - c

Related

OpenMP reduction on multiple variables (array)

OpenMP in C array reduction / parallelize the code

Which method is better for incrementing and traversing in a loop in C

Make this for-loop more efficient?

Loop Optimization in C

Categories

Resources