edit - -
This code will be run with optimizations off
full transparency this is a homework assignment.
I’m having some trouble figuring out how to optimize this code...
My instructor went over unrolling and splitting but neither seems to greatly reduce the time needed to execute the code. Any help would be appreciated!
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
int j;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
// ... and this one. But your inner loop must do the *same
// number of additions as this one does.
}
Assuming you mean same number of additions to sum at runtime (rather than same number of additions in the source code), unrolling could give you something like:
for (j = 0; j + 5 < ARRAY_SIZE; j += 5) {
sum += array[j] + array[j+1] + array[j+2] + array[j+3] + array[j+4];
}
for (; j < ARRAY_SIZE; j++) {
sum += array[j];
}
Alternatively, since you're adding the same values each time through the outer loop, you don't need to process it N_TIMES times, just do this:
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
int j;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
sum *= N_TIMES;
break;
// ... and this one. But your inner loop must do the *same
// number of additions as this one does.
}
This requires that the initial value of sum is zero, which is likely but there's actually nothing in your question that mandates this, so I include it as a pre-condition for this method.
Except by cheating*, this inner loop is essentially non-optimizable. Because you must fetch all the array elements and perform all the additions anyway.
The body of the loop performs:
a conditional branch on j;
a fetch of array[j];
the accumulation to a scalar variable;
the incrementation of j.
As said, 2. to 4. are inescapable.Then all you can do is reducing the number of conditional branches by loop unrolling (this turns the conditional branch in an unconditional one, at the expense of the number of iterations becoming fixed).
It is no surprise that you don't see a big difference. Modern processors are "loop aware", meaning that branch prediction is well tuned to such loops so that the cost of the branches is pretty low.
Cheating:
As others said, you can completely bypass the outer loop. This is just exploiting a flaw in the exercise statement.
As optimizations must be turned off, using inline assembly, pragmas, vector instructions or intrinsics should be banned as well (not mentioning automatic parallelization).
There is a possibility to pack two ints in a long long. If the sum doesn't overflow, you will perform two additions at a time. But is this legal ?
One might think of an access pattern that favors cache utilization. But here there is no hope as the array is fully traversed on every loop and there is no possibility of reuse of the values fetched.
First of all, unless you are explicitly compiling with -O0, your compiler has already likely optimized this loop much further than you could possibly expect.
Including unrolling, and on top of unrolling also vectorization and more. Trying to optimize this by hand is something you should never, absolutely never do. At most you will successfully make the code harder to read and understand, while most likely not even being able to match the compiler in terms of performance.
As to why there is no measurable gain? Possibly because you already hit a bottleneck, even with the "non optimized" version. For ARRAY_SIZE greater than your processors cache even the compiler optimized version is already limited by memory bandwidth.
But for completeness, let's just assume you have not hit that bottleneck, and that you actually had turned optimizations almost off (so no more than -O1), and optimize for that.
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
int j;
int tmpSum[4] = {0,0,0,0};
for (j = 0; j < ARRAY_SIZE; j+=4) {
tmpSum[0] += array[j+0];
tmpSum[1] += array[j+1];
tmpSum[2] += array[j+2];
tmpSum[3] += array[j+3];
}
sum += tmpSum[0] + tmpSum[1] + tmpSum[2] + tmpSum[3];
if(ARRAY_SIZE % 4 != 0) {
j -= 4;
for (; j < ARRAY_SIZE; j++) {
sum += array[j];
}
}
// ... and this one. But your inner loop must do the *same
// number of additions as this one does.
}
There is pretty much only one factor left which still could have reduced the performance, for a smaller array.
Not the overhead for the loop, so plain unrolling would had been pointless with a modern processor. Don't even bother, you won't beat the branch prediction.
But the latency between two instructions, until a value written by one instruction may be read again by the next instruction still applies. In this case, sum is constantly written and read all over again, and even if sum is cached in a register, this delay still applies and the processors pipeline had to wait.
The way around that, is to have multiple independent additions going on simultaneously, and finally just combine the results. This is by the way also an optimization which most modern compilers do know how to perform.
On top of that, you could now also express the first loop with vector instructions - once again also something the compiler would have done. At this point you are running into instruction latency again, so you will likely have to introduce one more set of temporaries, so that you now have two independent addition streams each using vector instructions.
Why the requirement of at least -O1? Because otherwise the compiler won't even place tmpSum in a register, or will try to express e.g. array[j+0] as a sequence of instructions for performing the addition first, rather than just using a single instruction for that. Hardly possible to optimize in that case, without using inline assembly directly.
Or if you just feel like (legit) cheating:
const int N_TIMES = 1000;
const int ARRAY_SIZE = 1024;
const int array[1024] = {1};
int sum = 0;
__attribute__((optimize("O3")))
__attribute__((optimize("unroll-loops")))
int fastSum(const int array[]) {
int j;
int tmpSum;
for (j = 0; j < ARRAY_SIZE; j++) {
tmpSum += array[j];
}
return tmpSum;
}
int main() {
int i;
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
sum += fastSum(array);
// ... and this one. But your inner loop must do the *same
// number of additions as this one does.
}
return sum;
}
The compiler will then apply pretty much all the optimizations described above.
Related
These are example of incrementing each element of an array by 10.
for (i = 0; i< 100; i++){
arr[i] += 10;
}
or
for (i = 0; i< 100; i+=2){
arr[i] += 10;
arr[i+1] += 10;
}
which is the efficient way to solve this problem out of these these two in C language?
Don't worry about it. Your compiler will make this optimization if necessary.
For example, clang 10 unrolls this completely and uses vector instructions to do multiple at once.
As #JeremyRoman stated compiler will be better than the humans optimizing the code.
But you may make its work easier or tougher. In your example the second way prevents gcc from unrolling the loops.
So make it simple, do not try to premature micro optimize your code as result might be right opposite than expected
https://godbolt.org/z/jYcLpT
Let's look at better and efficient outside of run-time performance1.
Bug!
Only 3 or 4 lines of code and the one with 4 is incorrect. What if arr[] and ar[] both existed? the compiler would not complain, yet certainly incorrect code.
//ar[i+1] += 10;
arr[i+1] += 10;
Coding
The below wins. Short and and easy to code. No concern about if arr[i+1] += 10; access arr[100]
for (i = 0; i< 100; i++){
arr[i] += 10;
}
Review
The below wins. Clear, to the point. I had to review the other more to be clear of its correctness - inefficient review time. Defense-ability - I'd have no trouble defending this code.
for (i = 0; i< 100; i++) {
arr[i] += 10;
}
Maintenance
The below wins. Change i < 100 to i < N and this code is fine, the other can readily break.
for (i = 0; i< 100; i++) {
arr[i] += 10;
}
Optimization possibilities
The below wins. Compilers do a fine job at optimizing common idioms. The 2nd poses more analyses and a greater chance the compiler will not optimize well.
for (i = 0; i< 100; i++) {
arr[i] += 10;
}
Score
Outside of performance:
5 to 0
1 Notice OP never explicitly stated to view this only as run-time performance. So let use consider various ideas of better.
I'm having a bit of trouble figuring out the Big O run time for the two set of code samples where the iterations depend on outside loops. I have a basic understanding of the Big O run times and I can figure out the run times for simpler code samples. I'm not too sure how some lines are affecting the run time.
I would consider this first one O(n^2). However, I'm not certain.
for(i = 1; i < n; i++){
for(j = 1000/i; j > 0; j--){ <--Not sure if this is still O(n)
arr[j]++; /* THIS LINE */
}
}
I'm a bit more lost with this one. O(n^3) possibly O(n^2)?
for(i = 0; i < n; i++){
for(j = i; j < n; j++){
while( j<n ){
arr[i] += arr[j]; /* THIS LINE */
j++;
}
}
}
I found this post and I applied this to the first code sample but I'm still unsure about the second. What is the Big-O of a nested loop, where number of iterations in the inner loop is determined by the current iteration of the outer loop?
Regarding the first one. It is not O(n^2)!!! For the sake of simplicity and readability, let's rewrite it in the form of pseudocode:
for i in [1, 2, ... n]: # outer loop
for j in [1, 2, ... 1000/i]: # inner loop
do domething with time complexity O(1). # constant-time operation
Now, the number of constant-time operations within the inner loop (which depends on parameter i of the outer loop) can be expressed as:
Now, we can calculate the number of constant-time operations overall:
Here, N(n) is a harmonic number (see wikipedia), and there is a very interesting property of these numbers:
Where C is Euler–Mascheroni constant. Therefore, the complexity of the first algorithm is:
Regarding the second one. It seems like either the code contains a mistake, or it is a trick test question. The code resolves to
for (i = 1; i < n; i++)
for(j = i; j < n; j++){
arr[j]++;
j++;
}
The inner loop takes
operations, so we can calculate overall complexity:
For the second loop (which it appears that you still need an answer for), you have sort of a misleading bit of code, where you have 3 nested loops, so at first glance, it makes sense that the runtime is O(n^3).
However, this is incorrect. This is because the innermost while loop modifies j, the same variable that the for loop modifies. This code is actually equivalent to this bit of code below:
for(i = 0; i < n; i++){
for(j = i; j < n; j++){
arr[i] += arr[j]; /* THIS LINE */
j++;
}
}
This is because the while loop on the inside will run, incrementing j until j == n, then it breaks out. At that point, the inner for loop will increment j again and compare it to n, where it will find that j >= n, and exit. You should be familiar with this case already, and recognize it as O(n^2).
Just a note, the second bit of code is not safe (technically), as j may overflow when you increment it an additional time after the while loop finishes running. This would cause the for loop to run forever. However, this will only occur when n = int_max().
I have been tasked with optimizing a particular for loop in C. Here is the loop:
#define ARRAY_SIZE 10000
#define N_TIMES 600000
for (i = 0; i < N_TIMES; i++)
{
int j;
for (j = 0; j < ARRAY_SIZE; j++)
{
sum += array[j];
}
}
I'm supposed to use loop unrolling, loop splitting, and pointers in order to speed it up, but every time I try to implement something, the program doesn't return. Here's what I've tried so far:
for (i = 0; i < N_TIMES; i++)
{
int j,k;
for (j = 0; j < ARRAY_SIZE; j++)
{
for (k = 0; k < 100; k += 2)
{
sum += array[k];
sum += array[k + 1];
}
}
}
I don't understand why the program doesn't even return now. Any help would be appreciated.
That second piece of code is both inefficient and wrong, since it adds values more than the original code.
The loop unrolling (or lessening in this case since you probably don't want to unroll a ten-thousand-iteration loop) would be:
// Ensure ARRAY_SIZE is a multiple of two before trying this.
for (int i = 0; i < N_TIMES; i++)
for (int j = 0; j < ARRAY_SIZE; j += 2)
sum += array[j] + array[j+1];
But, to be honest, the days of dumb compilers has long since gone. You should generally leave this level of micro-optimisation up to your compiler, while you concentrate on the more high-level stuff like data structures, algorithms and human analysis.
That last one is rather important. Since you're adding the same array to an accumulated sum a constant number of times, you only really need the sum of the array once, then you can add that partial sum as many times as you want:
int temp = 0;
for (int i = 0; i < ARRAY_SIZE; i++)
temp += array[i];
sum += temp * N_TIMES;
It's still O(n) but with a much lower multiplier on the n (one rather than six hundred thousand). It may be that gcc's insane optimisation level of -O3 could work that out but I doubt it. The human brain can still outdo computers in a lot of areas.
For now, anyway :-)
There is nothing wrong on your program... it will return. It is only going to take 50 times more than the first one...
On the first you had 2 fors: 600.000 * 10.000 = 6.000.000.000 iterations.
On the second you have 3 fors: 600.000 * 10.000 * 50 = 300.000.000.000 iterations...
Loop unrolling doesn't speed loops up, it slows them down. In olden times it gave you a speed bump by reducing the number of conditional evaluations. In modern times it slows you down by killing the cache.
There's no obvious use case for loop splitting here. To split a loop you're looking for two or more obvious groupings in the iterations. At a stretch you could multiply array[j] by i rather than doing the outer loop and claim you've split the inner from the outer, then discarded the outer as useless.
C array-indexing syntax is just defined as (a peculiar syntax for) pointer arithmetic. But I guess you'd want something like:
sum += *arrayPointer++;
In place of your use of j, with things initialised suitably. But I doubt you'll gain anything from it.
As per the comments, if this were real life then you'd just let the compiler figure this stuff out.
Actually I have two questions, the first one is, considering cache, which one of the following code is faster?
int a[10000][10000];
for(int i = 0; i < 10000; i++){
for(int j = 0; j < 10000; j++){
a[i][j]++;
}
}
or
int a[10000][10000];
for(int i = 0; i < 10000; i++){
for(int j = 0; j < 10000; j++){
a[j][i]++;
}
}
I am guessing the first one will be much faster since there are a lot less cache miss. And my question is if you are using OpenMP, what kind of technique will you use to optimise such a nested loop? My strategy is to divide the outer loop into 4 chunks and assign them among 4 cores, is there any better way (more cache friendly) to do it?
Thanks!
Bob
As maxihatop pointed out, the first one performs better because it has better cache locality.
Dividing the outer loop into chunks is a good strategy in the case like this, where the complexity of the task inside the loop is constant.
You might want to take a look at #pragma omp for schedule(static). This will evenly divide the iterations contiguously among threads. So your code should look like:
#pragma omp for schedule(static)
for (i = 0; i < 10000; i++) {
for(j = 0; j < 10000; j++){
a[i][j]++;
}
Lawrence Livermore National Laboratory provides a fantastic tutorial of OpenMP. You can find more information there.
https://computing.llnl.gov/tutorials/openMP/
I'm fairly new to C, not having much need to anything faster than python for most of my research. However, it turns out that recent work I've been doing required the computation of fairly large vectors/matrices, and there therefore a C+MPI solution might be in order.
Mathematically speaking, the task is very simple. I have a lot of vectors of dimensionality ~40k and wish to compute the Kronecker Product of selected pairs of these vectors, and then sum these kronecker products.
The question is, how to do this efficiently? Is there anything wrong with the following structure of code, using for loops, or obtain the effect?
The function kron described below passes vectors A and B of lengths vector_size, and computes their kronecker product, which it stores in C, a vector_size*vector_size matrix.
void kron(int *A, int *B, int *C, int vector_size) {
int i,j;
for(i = 0; i < vector_size; i++) {
for (j = 0; j < vector_size; j++) {
C[i*vector_size+j] = A[i] * B[j];
}
}
return;
}
This seems fine to me, and certainly (if I've not made some silly syntax error) produce the right result, but I have a sneaking suspicion that embedded for loops is not optimal. If there's another way I should be going about this, please let me know. Suggestions welcome.
I thank you for you patience and any advice you may have. Once again, I'm very inexperienced with C, but Googling around has brought me little joy for this query.
Since your loop bodies are all completely independent, there is certainly a way to accelerate this. Easiest would be already to take advantage of several cores before thinking of MPI. OpenMP should do quite fine on this.
#pragma omp parallel for
for(int i = 0; i < vector_size; i++) {
for (int j = 0; j < vector_size; j++) {
C[i][j] = A[i] * B[j];
}
}
This is supported by many compilers nowadays.
You could also try to drag some common expressions out of the inner loop but decent compilers e.g gcc, icc or clang should do this quite well all by themselves:
#pragma omp parallel for
for(int i = 0; i < vector_size; ++i) {
int const x = A[i];
int * vec = &C[i][0];
for (int j = 0; j < vector_size; ++j) {
vec[j] = x * B[j];
}
}
BTW, indexing with int is usually not the right thing to do. size_t is the correct typedef for everything that has to do with indexing and sizes of objects.
For double-precision vectors (single-precision and complex are similar), you can use the BLAS routine DGER (rank-one update) or similar to do the products one-at-a-time, since they are all on vectors. How many vectors are you multiplying? Remember that adding a bunch of vector outer products (which you can treat the Kronecker products as) ends up as a matrix-matrix multiplication, which BLAS's DGEMM can handle efficiently. You might need to write your own routines if you truly need integer operations, though.
If your compiler supports C99 (and you never pass the same vector as A and B), consider compiling in a C99-supporting mode and changing your function signature to:
void kron(int * restrict A, int * restrict B, int * restrict C, int vector_size);
The restrict keyword promises the compiler that the arrays pointed to by A, B and C do not alias (overlap). With your code as written, the compiler must re-load A[i] on every execution of the inner loop, because it must be conservative and assume that your stores to C[] can modify values in A[]. Under restrict, the compiler can assume that this will not happen.
Solution found (thanks to #Jeremiah Willcock): GSL's BLAS bindings seem to do the trick beautifully. If we're progressively selecting pairs of vectors A and B and adding them to some 'running total' vector/matrix C, the following modified version of the above kron function
void kronadd(int *A, int *B, int *C, int vector_size, int alpha) {
int i,j;
for(i = 0; i < vector_size; i++) {
for (j = 0; j < vector_size; j++) {
C[i*vector_size+j] = alpha * A[i] * B[j];
}
}
return;
}
precisely corresponds to the BLAS DGER function (accessible as gsl_blas_dger), functionally speaking. The initial kron function is DGER with alpha = 0 and C being an uninitialised (zeroed) matrix/vector of the correct dimensionality.
It turns out, it might well be easier to simply use python bindings for these libraries, in the end. However, I think I've learned a lot while trying to figure this stuff out. There are some more helpful suggestions in the other responses, do check them out if you have the same sort of problem to deal with. Thanks everyone!
This is a common enough problem in numerical computational circles, that really the best thing to do would be to use a well-debugged package like Matlab (or one of its Free Software clones).
You could probably even find a python binding to it, so you can get rid of C.
All of the above is (probably) going to be faster than code written strictly in python. If you need more speed than that, I'd suggest a couple of things:
Look into using Fortran instead of C. Fortran compilers tend to be better at optimizing numerical computations (one exception would be if you are using gcc, since both its C and Fortran compilers use the same backend).
Consider parallelizing your algorithm. There are variants of Fortran I know that have parallel loop statements. I think there are some C addons around that do the same thing. If you are using a PC (and single-precision) you could also consider using your video card's GPU, which is essentially a really cheap array processor.
Another optimisation that would be easy to implement is that if you know that the inner dimension of your arrays will be divisible by n then add n assignment statements to the body of the loop, reducing the number of necessary iterations, with corresponding changes to the loop counting.
This strategy can be generalised by using a switch statement around the outer loop with cases for array sizes divisible by two, three, four and five, or whatever is most common. This can give quite a big performance win and is compatible with suggestions 1 and 3 for further optimisation/parallelisation. A good compiler may even do something like this for you (aka loop unrolling).
Another optimisation would be to make use of pointer arithmetic to avoid the array indexing. Something like this should do the trick:
int i, j;
for(i = 0; i < vector_size; i++) {
int d = *A++;
int *e = B;
for (j = 0; j < vector_size; j++) {
*C++ = *e++ * d;
}
}
This also avoids accessing the value of A[i] multiple times by caching it in a local variable, which might give you a minor speed boost. (Note that this version is not parallelisable since it alters the value of the pointers, but would still work with loop unrolling.)
To solve your problem, I think you should try to use Eigen 3, it's a C++ library which use all matrix functions!
If you have time, go to see its documentation! =)
Good luck !
uint32_t rA = 3;
uint32_t cA = 5;
uint32_t lda = cA;
uint32_t rB = 5;
uint32_t cB = 3;
uint32_t ldb = cB;
uint32_t rC = rA*rB;
uint32_t cC = cA*cB;
uint32_t ldc = cC;
double *A = (double *)malloc(rA*cA*sizeof(double));
double *B = (double *)malloc(rB*cB*sizeof(double));
double *C = (double *)malloc(rC*cC*sizeof(double));
for (uint32_t i=0, allA=rA*cA; i<allA; i++)
A[i]=i;
for (uint32_t i=0, allB=rB*cB; i<allB; i++)
B[i]=i;
for (uint32_t i=0, allC=rC*cC; i<allC; i++)
C[i]=0;
for (uint32_t i=0, allA=rA*cA; i<allA; i++)
{
for (uint32_t j=0, allB=rB*cB; j<allB; j++)
C[((i/lda)*rB+j/ldb)*ldc
+ (i%lda)*cB+j%ldb ]=A[i]*B[j];
}