Losing dependency on loop variables - loop optimization - c

I have the following nested loop computation:
int aY=a*Y,aX=a*X;
for(int i=0; i<aY; i+=a)
{
for(int j=0; j<aX; j+=a)
{
xInd=i-j+offX;
yInd=i+j+offY;
if ((xInd>=0) && (xInd<X) &&
(yInd>=0) && (yInd<Y) )
{
z=yInd*X+xInd;
//use z
}
}
}
I want to lose the dependency on i,j,xInd and yInd as much as possible. In other words, I want to "traverse" all of the values z receives while running through the loop, but without involving helping variables i,j,xInd and yInd - or at least have a minimal number of computations involved (most importantly to have no multiplications). How can I do that? Other hints to possible ways to make the loop more efficient would be welcome. Thanks!

If we read the question as how to mimimize the number of iterations around the loop, we can take the following approach.
The constraints:
(xInd>=0) && (xInd<X)
(yInd>=0) && (yInd<Y)
allow use to tighten the bound of the for loop. Expanding xInd and yInd gives:
0 <= i - j + offX <= X
0 <= i + j + offY <= Y
Fixing i allows us to rewrite the second loop bounds as:
for(int i=0; i<aY; i+=a) {
int lower = (max(i + offX - X, -i - offY) / a) * a; //factored out for clarity.
int upper = min(i + offX, Y - i -offY);
for(int j=lower; j<=upper; j+=a) {
If you know more about the possible values of offX, offY, a, X and Y further reductions may be possible.
Note that in reality you probably wouldn't want to blindly apply this type of optimisation without profiling first (it may prevent the compiler from doing this for you e.g. gcc graphite).
Use as index
if the value z=yInd*X+xInd is being used to index memory, a bigger win is achieved by ensuring that the memory accesses are sequential to ensure good cache behaviour.
Currently yInd changes for each iteration so poor cache performance will potentially result.
A solution to this issue would be to first compute and store all the indicies, then do all the memory operations in a second pass using these indicies.
int indicies[Y * X];
int index = 0;
for(...){
for(...){
...
indicies[index++] = z;
}
}
// sort indicies
for(int idx = 0; idx < index; idx++){
z = indicies[idx];
//do stuff with z
}

If we assume that offX and offY are 0, and replace your '<'s with '<='s, we can get rid of i and j by doing this:
for (yInd = 0; yInd <= aX + aY; ++yInd)
for (xInd = max(-yInd, -aX); xInd <= min(yInd, aY); ++xInd)

Related

Best behavior in C: for loop variables locally or globally defined

Which is considered the best behavior in C and what are the pros cons of the two possible following options?
OPTION 1 (global variable):
int x;
for (x = 0; x < 100; x++) {
// do something
}
OPTION 2 (local variable):
for (int x = 0; x < 100; x++) {
// do something
}
EDIT: Assume I don't need the variable after looping over it.
The main benefit of option 1 is that you can use x outside the body of the loop; this matters if your loop exits early because of an error condition or something and you want to find which iteration it happened on. It's also valid in all versions of C from K&R onward.
The main benefit of option 2 is that it limits the scope of x to the loop body - this allows you to reuse x in different loops for different purposes (with potentially different types) - whether this counts as good style or not I will leave for others to argue:
for ( int x = 0; x < 100; x++ )
// do something
for ( size_t x = 0; x < sizeof blah; x++ )
// do something;
for ( double x = 0.0; x < 1.0; x += 0.0625 )
// do something
However, this feature was introduced with C99, so it won't work with C89 or K&R implementations.
This is not an answer. This is a third option, that is sort of intermediate:
{
int x;
for (x = 0; x < 100; x++) {
// do something
}
// use x
}
This is equivalent to the 2nd option, if there's no code between the two closing brackets.
Always try to declare variables in minimum scopes where they are used.
This makes your code more readable and clear.
For example consider the first code snippet with a modification
int x;
//... some other code
for (x = 0; x < 100; x++) {
// do something
}
When the reader will read this code snippet and encounter the declaration
int x;
he well not know what this variable means and where it is used.
If you will rewrite this code snippet the following way
int x;
for (x = 0; x < 100; x++) {
// do something
}
then this code will confuse readers because they will think that the variable x is used somewhere else after the for loop though actually it is used only in the for loop.
So the best way is to write
for ( int x = 0; x < 100; x++) {
// do something
}
General good practice in C is to always reduce scope of variables as much as possible. This reduces namespace clutter and naming collisions, plus it is good design to encapsulate logic as much as possible to make it self-contained and readable.
Meaning that option 2, declaring the loop iterator inside the loop, is the preferred practice.
The only time you should use option 1 is when you need to use the value after the loop has ended, or when you are stuck with an old C90 compiler.

Optimization of 3D Direct Convolution Implementation in C

For my project, I've written a naive C implementation of direct 3D convolution with periodic padding on the input. Unfortunately, since I'm new to C, the performance isn't so good... here's the code:
int mod(int a, int b)
{
// calculate mod to get the correct index with periodic padding
int r = a % b;
return r < 0 ? r + b : r;
}
void convolve3D(const double *image, const double *kernel, const int imageDimX, const int imageDimY, const int imageDimZ, const int stencilDimX, const int stencilDimY, const int stencilDimZ, double *result)
{
int imageSize = imageDimX * imageDimY * imageDimZ;
int kernelSize = kernelDimX * kernelDimY * kernelDimZ;
int i, j, k, l, m, n;
int kernelCenterX = (kernelDimX - 1) / 2;
int kernelCenterY = (kernelDimY - 1) / 2;
int kernelCenterZ = (kernelDimZ - 1) / 2;
int xShift,yShift,zShift;
int outIndex, outI, outJ, outK;
int imageIndex = 0, kernelIndex = 0;
// Loop through each voxel
for (k = 0; k < imageDimZ; k++){
for ( j = 0; j < imageDimY; j++) {
for ( i = 0; i < imageDimX; i++) {
stencilIndex = 0;
// for each voxel, loop through each kernel coefficient
for (n = 0; n < kernelDimZ; n++){
for ( m = 0; m < kernelDimY; m++) {
for ( l = 0; l < kernelDimX; l++) {
// find the index of the corresponding voxel in the output image
xShift = l - kernelCenterX;
yShift = m - kernelCenterY;
zShift = n - kernelCenterZ;
outI = mod ((i - xShift), imageDimX);
outJ = mod ((j - yShift), imageDimY);
outK = mod ((k - zShift), imageDimZ);
outIndex = outK * imageDimX * imageDimY + outJ * imageDimX + outI;
// calculate and add
result[outIndex] += stencil[stencilIndex]* image[imageIndex];
stencilIndex++;
}
}
}
imageIndex ++;
}
}
}
}
by convention, all the matrices (image, kernel, result) are stored in column-major fashion, and that's why I loop through them in such way so they are closer in memory (heard this would help).
I know the implementation is very naive, but since it's written in C, I was hoping the performance would be good, but instead it's a little disappointing. I tested it with image of size 100^3 and kernel of size 10^3 (Total ~1GFLOPS if only count the multiplication and addition), and it took ~7s, which I believe is way below the capability of a typical CPU.
If possible, could you guys help me optimize this routine?
I'm open to anything that could help, with just a few things if you could consider:
The problem I'm working with could be big (e.g. image of size 200 by 200 by 200 with kernel of size 50 by 50 by 50 or even larger). I understand that one way of optimizing this is by converting this problem into a matrix multiplication problem and use the blas GEMM routine, but I'm afraid memory could not hold such a big matrix
Due to the nature of the problem, I would prefer direct convolution instead of FFTConvolve, since my model is developed with direct convolution in mind, and my impression of FFT convolve is that it gives slightly different result than direct convolve especially for rapidly changing image, a discrepancy I'm trying to avoid.
That said, I'm in no way an expert in this. so if you have a great implementation based on FFTconvolve and/or my impression on FFT convolve is totally biased, I would really appreciate if you could help me out.
The input images are assumed to be periodic, so periodic padding is necessary
I understand that utilizing blas/SIMD or other lower level ways would definitely help a lot here. but since I'm a newbie here I dont't really know where to start... I would really appreciate if you help pointing me to the right direction if you have experience in these libraries,
Thanks a lot for your help, and please let me know if you need more info about the nature of the problem
As a first step, replace your mod ((i - xShift), imageDimX) with something like this:
inline int clamp( int x, int size )
{
if( x < 0 ) return x + size;
if( x >= size ) return x - size;
return x;
}
These branches are very predictable because they yield same results for very large count of consecutive elements. Integer modulo is relatively slow.
Now, next step (ordered by cost/profit) is going to be parallelizing. If you have any modern C++ compiler, just enable OpenMP somewhere in project settings. After that you need 2 changes.
Decorate your very outer loop with something like this: #pragma omp parallel for schedule(guided)
Move your function-level variables within that loop. This also means you’ll have to compute initial imageIndex from your k, for each iteration.
Next option, rework your code so you only write each output value once. Compute the final value in your innermost 3 loops, reading from random locations from both image and kernel, and only write the result once. When you have that result[outIndex] += in the inner loop, CPU stalls waiting for the data from memory. When you accumulate in a variable that’s a register not memory, there’s no access latency.
SIMD is the most complicated optimization for that. But in short, you’ll need maximum width of the FMA your hardware has (if you have AVX and need double precision, that width is 4), and you’ll also need multiple independent accumulators for your 3 innermost loops, to avoid hitting the latency as opposed to saturating the throughput. Here’s my answer to much easier problem as an example what I mean.

Make this for-loop more efficient?

edit - -
This code will be run with optimizations off
full transparency this is a homework assignment.
I’m having some trouble figuring out how to optimize this code...
My instructor went over unrolling and splitting but neither seems to greatly reduce the time needed to execute the code. Any help would be appreciated!
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
int j;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
// ... and this one. But your inner loop must do the *same
// number of additions as this one does.
}
Assuming you mean same number of additions to sum at runtime (rather than same number of additions in the source code), unrolling could give you something like:
for (j = 0; j + 5 < ARRAY_SIZE; j += 5) {
sum += array[j] + array[j+1] + array[j+2] + array[j+3] + array[j+4];
}
for (; j < ARRAY_SIZE; j++) {
sum += array[j];
}
Alternatively, since you're adding the same values each time through the outer loop, you don't need to process it N_TIMES times, just do this:
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
int j;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
sum *= N_TIMES;
break;
// ... and this one. But your inner loop must do the *same
// number of additions as this one does.
}
This requires that the initial value of sum is zero, which is likely but there's actually nothing in your question that mandates this, so I include it as a pre-condition for this method.
Except by cheating*, this inner loop is essentially non-optimizable. Because you must fetch all the array elements and perform all the additions anyway.
The body of the loop performs:
a conditional branch on j;
a fetch of array[j];
the accumulation to a scalar variable;
the incrementation of j.
As said, 2. to 4. are inescapable.Then all you can do is reducing the number of conditional branches by loop unrolling (this turns the conditional branch in an unconditional one, at the expense of the number of iterations becoming fixed).
It is no surprise that you don't see a big difference. Modern processors are "loop aware", meaning that branch prediction is well tuned to such loops so that the cost of the branches is pretty low.
Cheating:
As others said, you can completely bypass the outer loop. This is just exploiting a flaw in the exercise statement.
As optimizations must be turned off, using inline assembly, pragmas, vector instructions or intrinsics should be banned as well (not mentioning automatic parallelization).
There is a possibility to pack two ints in a long long. If the sum doesn't overflow, you will perform two additions at a time. But is this legal ?
One might think of an access pattern that favors cache utilization. But here there is no hope as the array is fully traversed on every loop and there is no possibility of reuse of the values fetched.
First of all, unless you are explicitly compiling with -O0, your compiler has already likely optimized this loop much further than you could possibly expect.
Including unrolling, and on top of unrolling also vectorization and more. Trying to optimize this by hand is something you should never, absolutely never do. At most you will successfully make the code harder to read and understand, while most likely not even being able to match the compiler in terms of performance.
As to why there is no measurable gain? Possibly because you already hit a bottleneck, even with the "non optimized" version. For ARRAY_SIZE greater than your processors cache even the compiler optimized version is already limited by memory bandwidth.
But for completeness, let's just assume you have not hit that bottleneck, and that you actually had turned optimizations almost off (so no more than -O1), and optimize for that.
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
int j;
int tmpSum[4] = {0,0,0,0};
for (j = 0; j < ARRAY_SIZE; j+=4) {
tmpSum[0] += array[j+0];
tmpSum[1] += array[j+1];
tmpSum[2] += array[j+2];
tmpSum[3] += array[j+3];
}
sum += tmpSum[0] + tmpSum[1] + tmpSum[2] + tmpSum[3];
if(ARRAY_SIZE % 4 != 0) {
j -= 4;
for (; j < ARRAY_SIZE; j++) {
sum += array[j];
}
}
// ... and this one. But your inner loop must do the *same
// number of additions as this one does.
}
There is pretty much only one factor left which still could have reduced the performance, for a smaller array.
Not the overhead for the loop, so plain unrolling would had been pointless with a modern processor. Don't even bother, you won't beat the branch prediction.
But the latency between two instructions, until a value written by one instruction may be read again by the next instruction still applies. In this case, sum is constantly written and read all over again, and even if sum is cached in a register, this delay still applies and the processors pipeline had to wait.
The way around that, is to have multiple independent additions going on simultaneously, and finally just combine the results. This is by the way also an optimization which most modern compilers do know how to perform.
On top of that, you could now also express the first loop with vector instructions - once again also something the compiler would have done. At this point you are running into instruction latency again, so you will likely have to introduce one more set of temporaries, so that you now have two independent addition streams each using vector instructions.
Why the requirement of at least -O1? Because otherwise the compiler won't even place tmpSum in a register, or will try to express e.g. array[j+0] as a sequence of instructions for performing the addition first, rather than just using a single instruction for that. Hardly possible to optimize in that case, without using inline assembly directly.
Or if you just feel like (legit) cheating:
const int N_TIMES = 1000;
const int ARRAY_SIZE = 1024;
const int array[1024] = {1};
int sum = 0;
__attribute__((optimize("O3")))
__attribute__((optimize("unroll-loops")))
int fastSum(const int array[]) {
int j;
int tmpSum;
for (j = 0; j < ARRAY_SIZE; j++) {
tmpSum += array[j];
}
return tmpSum;
}
int main() {
int i;
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
sum += fastSum(array);
// ... and this one. But your inner loop must do the *same
// number of additions as this one does.
}
return sum;
}
The compiler will then apply pretty much all the optimizations described above.

multiple if conditions optimisation

I am building a simple C project (for arduino) and I have come across this question.It's not actually that language specific, it's more of an algorithm optimisation thing.
So, I need to check a value of X against a sensor reading.
If X <5 ...
else if x<10...
else if x<15...
else if x<20...
Then in each clause I have the same for loop,but the iterations change depending on the value of X.
In a general sense, how can these if conditions be replaced by something unified?I remember these "gradation" or "leveling" problems in highschool, but we still used if clauses.
In a comment below you've said (in reference to the second solution under the bar using an array):
I actually do not need the second dimension,as the value ranges are defined in the first dimension/column (5 10 15 20 etc)
In that case, it's really much simpler than the solutions below:
int loops = ((X / 5) + 1) * 5;
...assuming X is an int. That uses integer division, which truncates (e.g., 4 / 5 is 0), adds one, then multiplies the result by 5. Here's the same thing in JavaScript just for an on-site example (in JavaScript, since numbers are always floating point, we have to add in a flooring method, but you don't need that in Java):
var X;
for (X = 0; X < 25; ++X) {
var loops = (Math.floor(X / 5) + 1) * 5;
console.log("X = " + X + ", loops = " + loops);
}
Then in each clause I have the same for loop,but the iterations change depending on the value of X.
I'd set a variable to the number of iterations, then put the for loop after the if/else sequence.
int loops;
if (X < 5) {
loops = /*whatever*/;
} else if (X < 10) {
loops = /*whatever*/;
} else if (X < 15) {
loops = /*whatever*/;
// ...and so on...
} else {
loops = /*whatever for the catch-all case*/;
}
for (int i = 0; i < loops; ++i) {
// ...
}
If you're trying to avoid the if/else, if there are only a small number of possible sensor values, you could use a switch instead, which in some languages is compiled to a jump table and so fairly efficient.
If you want to have the ranges held as data rather than in an if/else sequence, you could use an array of values:
int[][] values = {
{5, 500},
{10, 700},
{15, 800},
{20, 1200},
{0, 1500} // 0 is a flag value
};
(There I'm using an array of int[], but it could be a nice clean class instance instead.)
Then loop through the array looking for the first entry where X < entry[0] is true (or where entry[0] is 0, to flag the last entry).
int loops = 0; // 0 will never be used, but the compiler doesn't know that
for (int[] entry : values) {
if (entry[0] == 0 || X < entry[0]) {
loops = entry[1];
break;
}
}
...followed by the for loop using loops.
Since your intervals are products of 5, it may be possible to just divide X by 5 and use the result as index in an array.
const size_t loops[] = {someval, anotherval, ..., lastval};
size_t i, nloops = loops[X / 5];
for (i = 0; i < nloops; i++) {
...
}

Loop Optimization in C

I have been tasked with optimizing a particular for loop in C. Here is the loop:
#define ARRAY_SIZE 10000
#define N_TIMES 600000
for (i = 0; i < N_TIMES; i++)
{
int j;
for (j = 0; j < ARRAY_SIZE; j++)
{
sum += array[j];
}
}
I'm supposed to use loop unrolling, loop splitting, and pointers in order to speed it up, but every time I try to implement something, the program doesn't return. Here's what I've tried so far:
for (i = 0; i < N_TIMES; i++)
{
int j,k;
for (j = 0; j < ARRAY_SIZE; j++)
{
for (k = 0; k < 100; k += 2)
{
sum += array[k];
sum += array[k + 1];
}
}
}
I don't understand why the program doesn't even return now. Any help would be appreciated.
That second piece of code is both inefficient and wrong, since it adds values more than the original code.
The loop unrolling (or lessening in this case since you probably don't want to unroll a ten-thousand-iteration loop) would be:
// Ensure ARRAY_SIZE is a multiple of two before trying this.
for (int i = 0; i < N_TIMES; i++)
for (int j = 0; j < ARRAY_SIZE; j += 2)
sum += array[j] + array[j+1];
But, to be honest, the days of dumb compilers has long since gone. You should generally leave this level of micro-optimisation up to your compiler, while you concentrate on the more high-level stuff like data structures, algorithms and human analysis.
That last one is rather important. Since you're adding the same array to an accumulated sum a constant number of times, you only really need the sum of the array once, then you can add that partial sum as many times as you want:
int temp = 0;
for (int i = 0; i < ARRAY_SIZE; i++)
temp += array[i];
sum += temp * N_TIMES;
It's still O(n) but with a much lower multiplier on the n (one rather than six hundred thousand). It may be that gcc's insane optimisation level of -O3 could work that out but I doubt it. The human brain can still outdo computers in a lot of areas.
For now, anyway :-)
There is nothing wrong on your program... it will return. It is only going to take 50 times more than the first one...
On the first you had 2 fors: 600.000 * 10.000 = 6.000.000.000 iterations.
On the second you have 3 fors: 600.000 * 10.000 * 50 = 300.000.000.000 iterations...
Loop unrolling doesn't speed loops up, it slows them down. In olden times it gave you a speed bump by reducing the number of conditional evaluations. In modern times it slows you down by killing the cache.
There's no obvious use case for loop splitting here. To split a loop you're looking for two or more obvious groupings in the iterations. At a stretch you could multiply array[j] by i rather than doing the outer loop and claim you've split the inner from the outer, then discarded the outer as useless.
C array-indexing syntax is just defined as (a peculiar syntax for) pointer arithmetic. But I guess you'd want something like:
sum += *arrayPointer++;
In place of your use of j, with things initialised suitably. But I doubt you'll gain anything from it.
As per the comments, if this were real life then you'd just let the compiler figure this stuff out.

Resources