I have been tasked with optimizing a particular for loop in C. Here is the loop:
#define ARRAY_SIZE 10000
#define N_TIMES 600000
for (i = 0; i < N_TIMES; i++)
{
int j;
for (j = 0; j < ARRAY_SIZE; j++)
{
sum += array[j];
}
}
I'm supposed to use loop unrolling, loop splitting, and pointers in order to speed it up, but every time I try to implement something, the program doesn't return. Here's what I've tried so far:
for (i = 0; i < N_TIMES; i++)
{
int j,k;
for (j = 0; j < ARRAY_SIZE; j++)
{
for (k = 0; k < 100; k += 2)
{
sum += array[k];
sum += array[k + 1];
}
}
}
I don't understand why the program doesn't even return now. Any help would be appreciated.
That second piece of code is both inefficient and wrong, since it adds values more than the original code.
The loop unrolling (or lessening in this case since you probably don't want to unroll a ten-thousand-iteration loop) would be:
// Ensure ARRAY_SIZE is a multiple of two before trying this.
for (int i = 0; i < N_TIMES; i++)
for (int j = 0; j < ARRAY_SIZE; j += 2)
sum += array[j] + array[j+1];
But, to be honest, the days of dumb compilers has long since gone. You should generally leave this level of micro-optimisation up to your compiler, while you concentrate on the more high-level stuff like data structures, algorithms and human analysis.
That last one is rather important. Since you're adding the same array to an accumulated sum a constant number of times, you only really need the sum of the array once, then you can add that partial sum as many times as you want:
int temp = 0;
for (int i = 0; i < ARRAY_SIZE; i++)
temp += array[i];
sum += temp * N_TIMES;
It's still O(n) but with a much lower multiplier on the n (one rather than six hundred thousand). It may be that gcc's insane optimisation level of -O3 could work that out but I doubt it. The human brain can still outdo computers in a lot of areas.
For now, anyway :-)
There is nothing wrong on your program... it will return. It is only going to take 50 times more than the first one...
On the first you had 2 fors: 600.000 * 10.000 = 6.000.000.000 iterations.
On the second you have 3 fors: 600.000 * 10.000 * 50 = 300.000.000.000 iterations...
Loop unrolling doesn't speed loops up, it slows them down. In olden times it gave you a speed bump by reducing the number of conditional evaluations. In modern times it slows you down by killing the cache.
There's no obvious use case for loop splitting here. To split a loop you're looking for two or more obvious groupings in the iterations. At a stretch you could multiply array[j] by i rather than doing the outer loop and claim you've split the inner from the outer, then discarded the outer as useless.
C array-indexing syntax is just defined as (a peculiar syntax for) pointer arithmetic. But I guess you'd want something like:
sum += *arrayPointer++;
In place of your use of j, with things initialised suitably. But I doubt you'll gain anything from it.
As per the comments, if this were real life then you'd just let the compiler figure this stuff out.
Related
These are example of incrementing each element of an array by 10.
for (i = 0; i< 100; i++){
arr[i] += 10;
}
or
for (i = 0; i< 100; i+=2){
arr[i] += 10;
arr[i+1] += 10;
}
which is the efficient way to solve this problem out of these these two in C language?
Don't worry about it. Your compiler will make this optimization if necessary.
For example, clang 10 unrolls this completely and uses vector instructions to do multiple at once.
As #JeremyRoman stated compiler will be better than the humans optimizing the code.
But you may make its work easier or tougher. In your example the second way prevents gcc from unrolling the loops.
So make it simple, do not try to premature micro optimize your code as result might be right opposite than expected
https://godbolt.org/z/jYcLpT
Let's look at better and efficient outside of run-time performance1.
Bug!
Only 3 or 4 lines of code and the one with 4 is incorrect. What if arr[] and ar[] both existed? the compiler would not complain, yet certainly incorrect code.
//ar[i+1] += 10;
arr[i+1] += 10;
Coding
The below wins. Short and and easy to code. No concern about if arr[i+1] += 10; access arr[100]
for (i = 0; i< 100; i++){
arr[i] += 10;
}
Review
The below wins. Clear, to the point. I had to review the other more to be clear of its correctness - inefficient review time. Defense-ability - I'd have no trouble defending this code.
for (i = 0; i< 100; i++) {
arr[i] += 10;
}
Maintenance
The below wins. Change i < 100 to i < N and this code is fine, the other can readily break.
for (i = 0; i< 100; i++) {
arr[i] += 10;
}
Optimization possibilities
The below wins. Compilers do a fine job at optimizing common idioms. The 2nd poses more analyses and a greater chance the compiler will not optimize well.
for (i = 0; i< 100; i++) {
arr[i] += 10;
}
Score
Outside of performance:
5 to 0
1 Notice OP never explicitly stated to view this only as run-time performance. So let use consider various ideas of better.
edit - -
This code will be run with optimizations off
full transparency this is a homework assignment.
I’m having some trouble figuring out how to optimize this code...
My instructor went over unrolling and splitting but neither seems to greatly reduce the time needed to execute the code. Any help would be appreciated!
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
int j;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
// ... and this one. But your inner loop must do the *same
// number of additions as this one does.
}
Assuming you mean same number of additions to sum at runtime (rather than same number of additions in the source code), unrolling could give you something like:
for (j = 0; j + 5 < ARRAY_SIZE; j += 5) {
sum += array[j] + array[j+1] + array[j+2] + array[j+3] + array[j+4];
}
for (; j < ARRAY_SIZE; j++) {
sum += array[j];
}
Alternatively, since you're adding the same values each time through the outer loop, you don't need to process it N_TIMES times, just do this:
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
int j;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
sum *= N_TIMES;
break;
// ... and this one. But your inner loop must do the *same
// number of additions as this one does.
}
This requires that the initial value of sum is zero, which is likely but there's actually nothing in your question that mandates this, so I include it as a pre-condition for this method.
Except by cheating*, this inner loop is essentially non-optimizable. Because you must fetch all the array elements and perform all the additions anyway.
The body of the loop performs:
a conditional branch on j;
a fetch of array[j];
the accumulation to a scalar variable;
the incrementation of j.
As said, 2. to 4. are inescapable.Then all you can do is reducing the number of conditional branches by loop unrolling (this turns the conditional branch in an unconditional one, at the expense of the number of iterations becoming fixed).
It is no surprise that you don't see a big difference. Modern processors are "loop aware", meaning that branch prediction is well tuned to such loops so that the cost of the branches is pretty low.
Cheating:
As others said, you can completely bypass the outer loop. This is just exploiting a flaw in the exercise statement.
As optimizations must be turned off, using inline assembly, pragmas, vector instructions or intrinsics should be banned as well (not mentioning automatic parallelization).
There is a possibility to pack two ints in a long long. If the sum doesn't overflow, you will perform two additions at a time. But is this legal ?
One might think of an access pattern that favors cache utilization. But here there is no hope as the array is fully traversed on every loop and there is no possibility of reuse of the values fetched.
First of all, unless you are explicitly compiling with -O0, your compiler has already likely optimized this loop much further than you could possibly expect.
Including unrolling, and on top of unrolling also vectorization and more. Trying to optimize this by hand is something you should never, absolutely never do. At most you will successfully make the code harder to read and understand, while most likely not even being able to match the compiler in terms of performance.
As to why there is no measurable gain? Possibly because you already hit a bottleneck, even with the "non optimized" version. For ARRAY_SIZE greater than your processors cache even the compiler optimized version is already limited by memory bandwidth.
But for completeness, let's just assume you have not hit that bottleneck, and that you actually had turned optimizations almost off (so no more than -O1), and optimize for that.
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
int j;
int tmpSum[4] = {0,0,0,0};
for (j = 0; j < ARRAY_SIZE; j+=4) {
tmpSum[0] += array[j+0];
tmpSum[1] += array[j+1];
tmpSum[2] += array[j+2];
tmpSum[3] += array[j+3];
}
sum += tmpSum[0] + tmpSum[1] + tmpSum[2] + tmpSum[3];
if(ARRAY_SIZE % 4 != 0) {
j -= 4;
for (; j < ARRAY_SIZE; j++) {
sum += array[j];
}
}
// ... and this one. But your inner loop must do the *same
// number of additions as this one does.
}
There is pretty much only one factor left which still could have reduced the performance, for a smaller array.
Not the overhead for the loop, so plain unrolling would had been pointless with a modern processor. Don't even bother, you won't beat the branch prediction.
But the latency between two instructions, until a value written by one instruction may be read again by the next instruction still applies. In this case, sum is constantly written and read all over again, and even if sum is cached in a register, this delay still applies and the processors pipeline had to wait.
The way around that, is to have multiple independent additions going on simultaneously, and finally just combine the results. This is by the way also an optimization which most modern compilers do know how to perform.
On top of that, you could now also express the first loop with vector instructions - once again also something the compiler would have done. At this point you are running into instruction latency again, so you will likely have to introduce one more set of temporaries, so that you now have two independent addition streams each using vector instructions.
Why the requirement of at least -O1? Because otherwise the compiler won't even place tmpSum in a register, or will try to express e.g. array[j+0] as a sequence of instructions for performing the addition first, rather than just using a single instruction for that. Hardly possible to optimize in that case, without using inline assembly directly.
Or if you just feel like (legit) cheating:
const int N_TIMES = 1000;
const int ARRAY_SIZE = 1024;
const int array[1024] = {1};
int sum = 0;
__attribute__((optimize("O3")))
__attribute__((optimize("unroll-loops")))
int fastSum(const int array[]) {
int j;
int tmpSum;
for (j = 0; j < ARRAY_SIZE; j++) {
tmpSum += array[j];
}
return tmpSum;
}
int main() {
int i;
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
sum += fastSum(array);
// ... and this one. But your inner loop must do the *same
// number of additions as this one does.
}
return sum;
}
The compiler will then apply pretty much all the optimizations described above.
I'm having a bit of trouble figuring out the Big O run time for the two set of code samples where the iterations depend on outside loops. I have a basic understanding of the Big O run times and I can figure out the run times for simpler code samples. I'm not too sure how some lines are affecting the run time.
I would consider this first one O(n^2). However, I'm not certain.
for(i = 1; i < n; i++){
for(j = 1000/i; j > 0; j--){ <--Not sure if this is still O(n)
arr[j]++; /* THIS LINE */
}
}
I'm a bit more lost with this one. O(n^3) possibly O(n^2)?
for(i = 0; i < n; i++){
for(j = i; j < n; j++){
while( j<n ){
arr[i] += arr[j]; /* THIS LINE */
j++;
}
}
}
I found this post and I applied this to the first code sample but I'm still unsure about the second. What is the Big-O of a nested loop, where number of iterations in the inner loop is determined by the current iteration of the outer loop?
Regarding the first one. It is not O(n^2)!!! For the sake of simplicity and readability, let's rewrite it in the form of pseudocode:
for i in [1, 2, ... n]: # outer loop
for j in [1, 2, ... 1000/i]: # inner loop
do domething with time complexity O(1). # constant-time operation
Now, the number of constant-time operations within the inner loop (which depends on parameter i of the outer loop) can be expressed as:
Now, we can calculate the number of constant-time operations overall:
Here, N(n) is a harmonic number (see wikipedia), and there is a very interesting property of these numbers:
Where C is Euler–Mascheroni constant. Therefore, the complexity of the first algorithm is:
Regarding the second one. It seems like either the code contains a mistake, or it is a trick test question. The code resolves to
for (i = 1; i < n; i++)
for(j = i; j < n; j++){
arr[j]++;
j++;
}
The inner loop takes
operations, so we can calculate overall complexity:
For the second loop (which it appears that you still need an answer for), you have sort of a misleading bit of code, where you have 3 nested loops, so at first glance, it makes sense that the runtime is O(n^3).
However, this is incorrect. This is because the innermost while loop modifies j, the same variable that the for loop modifies. This code is actually equivalent to this bit of code below:
for(i = 0; i < n; i++){
for(j = i; j < n; j++){
arr[i] += arr[j]; /* THIS LINE */
j++;
}
}
This is because the while loop on the inside will run, incrementing j until j == n, then it breaks out. At that point, the inner for loop will increment j again and compare it to n, where it will find that j >= n, and exit. You should be familiar with this case already, and recognize it as O(n^2).
Just a note, the second bit of code is not safe (technically), as j may overflow when you increment it an additional time after the while loop finishes running. This would cause the for loop to run forever. However, this will only occur when n = int_max().
Actually I have two questions, the first one is, considering cache, which one of the following code is faster?
int a[10000][10000];
for(int i = 0; i < 10000; i++){
for(int j = 0; j < 10000; j++){
a[i][j]++;
}
}
or
int a[10000][10000];
for(int i = 0; i < 10000; i++){
for(int j = 0; j < 10000; j++){
a[j][i]++;
}
}
I am guessing the first one will be much faster since there are a lot less cache miss. And my question is if you are using OpenMP, what kind of technique will you use to optimise such a nested loop? My strategy is to divide the outer loop into 4 chunks and assign them among 4 cores, is there any better way (more cache friendly) to do it?
Thanks!
Bob
As maxihatop pointed out, the first one performs better because it has better cache locality.
Dividing the outer loop into chunks is a good strategy in the case like this, where the complexity of the task inside the loop is constant.
You might want to take a look at #pragma omp for schedule(static). This will evenly divide the iterations contiguously among threads. So your code should look like:
#pragma omp for schedule(static)
for (i = 0; i < 10000; i++) {
for(j = 0; j < 10000; j++){
a[i][j]++;
}
Lawrence Livermore National Laboratory provides a fantastic tutorial of OpenMP. You can find more information there.
https://computing.llnl.gov/tutorials/openMP/
#include<stdio.h>
#include<time.h>
int main()
{
clock_t start;
double d;
long int n,i,j;
scanf("%ld",&n);
n=100000;
j=2;
start=clock();
printf("\n%ld",j);
for(j=3;j<=n;j+=2)
{
for(i=3;i*i<=j;i+=2)
if(j%i==0)
break;
if(i*i>j)
printf("\n%ld",j);
}
d=(clock()-start)/(double)CLOCKS_PER_SEC;
printf("\n%f",d);
}
I got the running time of 0.015 sec when n=100000 for the above program.
I also implemented the Sieve of Eratosthenes algorithm in C and got the running time of 0.046 for n=100000.
How is my above algorithm faster than Sieve's algorithm that I have implemented.
What is the time complexity of my above program??
My sieve's implementation
#define LISTSIZE 100000 //Number of integers to sieve<br>
#include <stdio.h>
#include <math.h>
#include <time.h>
int main()
{
clock_t start;
double d;
long int list[LISTSIZE],i,j;
int listMax = (int)sqrt(LISTSIZE), primeEstimate = (int)(LISTSIZE/log(LISTSIZE));
for(int i=0; i < LISTSIZE; i++)
list[i] = i+2;
start=clock();
for(i=0; i < listMax; i++)
{
//If the entry has been set to 0 ('removed'), skip it
if(list[i] > 0)
{
//Remove all multiples of this prime
//Starting from the next entry in the list
//And going up in steps of size i
for(j = i+1; j < LISTSIZE; j++)
{
if((list[j] % list[i]) == 0)
list[j] = 0;
}
}
}
d=(clock()-start)/(double)CLOCKS_PER_SEC;
//Output the primes
int primesFound = 0;
for(int i=0; i < LISTSIZE; i++)
{
if(list[i] > 0)
{
primesFound++;
printf("%ld\n", list[i]);
}
}
printf("\n%f",d);
return 0;
}
There are a number of things that might influence your result. To be sure, we would need to see the code for your sieve implementation. Also, what is the resolution of the clock function on your computer? If the implementation does not allow for a high degree of accuracy at the millisecond level, then your results could be within the margin of error for your measurement.
I suspect the problem lies here:
//Remove all multiples of this prime
//Starting from the next entry in the list
//And going up in steps of size i
for(j = i+1; j < LISTSIZE; j++)
{
if((list[j] % list[i]) == 0)
list[j] = 0;
}
This is a poor way to remove all of the multiples of the prime number. Why not use the built in multiplication operator to remove the multiples? This version should be much faster:
//Remove all multiples of this prime
//Starting from the next entry in the list
//And going up in steps of size i
for(j = list[i]; j < LISTSIZE; j+=list[i])
{
list[j] = 0;
}
What is the time complexity of my above program??
To empirically measure the time complexity of your program, you need more than one data point. Run your program for multiple values of N, then make a graph of N vs. time. You can do this using a spreadsheet, GNUplot, or graph paper and pencil. You can also use software and/or plain old mathematics to find a polynomial curve that fits your data.
Non-empirically: much has been written (and lectured in computer science classes) about analyzing computational complexity. The Wikipedia article on computational complexity theory might provide some starting points for further reading.
Your sieve implementation is incorrect; that's the reason why it is so slow:
you shouldn't make it an array of numbers, but an array of flags (you may still use int as the data type, but char would do as well)
you shouldn't be using index shifts for the array, but list[i] should determine whether i is a prime or not (and not whether i+2 is a prime)
you should start the elimination with i=2
with these modifications, you should follow 1800 INFORMATION's advice, and cancel all multiples of i with a loop that goes in steps of i, not steps of 1
Just for your time complexity:
You have an outer loop of ~LISTMAX iterations and an inner loop of max. LISTSIZE iterations. This means your complexity is
O(sqrt(n)*n)
where n = listsize. It is actually a bit lower since the inner loop reduces it's count eacht time and is only run for each unknown number. But that's difficult to calculate. Since the O-Notation offers an upper bound, O(sqrt(n)*n) should be ok.
The behaviour is difficult to predict, but you should take into account that accessing the memory is not cheap... it's probably faster to just calculate it again for small primes.
Those run times are too small to be meaningful. The system clock resolution is not accurate to that kind of level.
What you should do to get accurate timing information is run your algorithm in a loop. Repeat it a few thousand times to get the run time up to at least a second, then you can divide the time by the number of loops.