Loop Unrolling Multi Dimesional Array - c

I recently tried unrolling the inner i and j loops within this multi-dimensional array, but the filter->get(i,j) always messes up the texture of the image. Can anyone assist me with unrolling the i and j loop? Thanks.
My attempt:
double
applyFilter(struct Filter *filter, cs1300bmp *input, cs1300bmp *output)
{
long long cycStart, cycStop;
cycStart = rdtscll();
output -> width = input -> width;
output -> height = input -> height;
int a = filter -> getDivisor();
int n = filter -> getSize();
for (int plane = 0; plane < 3; plane++){
for(int row = 1; row < (input -> height) - 1 ; row = row + 1) {
for(int col = 1; col < (input -> width) - 1; col = col + 1) {
int value = 0;
int val1, val2;
for (int j = 0; j < n; j++) {
for (int i = 0; i < n; i+=2) {
val1 = val1 + input -> color[plane][row + i - 1][col + j - 1]
* filter -> get(i, j);
val2 = val2 + input -> color[plane][row + i][col + j -1] * filter->get(i+1,j);
}
}
value = (val1 + val2) / a;
if ( value < 0 ) { value = 0; }
if ( value > 255 ) { value = 255; }
output -> color[plane][row][col] = value;
}
}
}
cycStop = rdtscll();
double diff = cycStop - cycStart;
double diffPerPixel = diff / (output -> width * output -> height);
fprintf(stderr, "Took %f cycles to process, or %f cycles per pixel\n",
diff, diff / (output -> width * output -> height));
return diffPerPixel;
}
Original:
int a = filter -> getDivisor();
int n = filter -> getSize();
for (int plane = 0; plane < 3; plane++){
for(int row = 1; row < (input -> height) - 1 ; row = row + 1) {
for(int col = 1; col < (input -> width) - 1; col = col + 1) {
int value = 0;
for (int j = 0; j < n; j++) {
for (int i = 0; i < n; i++) {
value = value + input -> color[plane][row + i - 1][col + j - 1]
* filter -> get(i, j);
}
}
value = value / a;
if ( value < 0 ) { value = 0; }
if ( value > 255 ) { value = 255; }
output -> color[plane][row][col] = value;

Try replacing the inner loop with:
int value = 0;
int val1 = 0, val2 = 0;
for (int j = 0; j < n; j++) {
int i;
for (i = 0; i < n; i+=2) {
val1 += input->color[plane][row+i-1][col+j-1] * filter->get(i,j);
val2 += input->color[plane][row+i ][col+j-1] * filter->get(i+1,j);
}
if (i < n)
val1 += input->color[plane][row+i-1][col+j-1] * filter->get(i,j);
}
value = (val1 + val2) / a;

Your method only is correct if n is a multiple of 2. Otherwise you will miss one line.
ADDED:
First of all, I just realized that you forgot to initialize val1 and val2 which is probably the main reason for your problems.
Second, it seems to me, that your code was written specifically for filter sizes of 3:
For smaller filters you don't access the borders at all.
For bigger ones, you access positions outside of the picture, as e.g.
[row + i - 1] becomes bigger than or equal to input->height.
If you only want to use filters of size 3, then I would simply unrol the inner loops completely. Otherwise check the boundaries for the row and col values.
Now, for loop unrolling, I would recommend doing a google search, as you can find many examples on how to do that properly. One can be found on the wikipedia page.
In your case, the simplest solution would be:
int value = 0;
int val1=0, val2=0;
for (int j = 0; j < n; j++) {
for (int i = 0; i < n-1; i+=2) {
val1 = val1 + input->color[plane][row+i-1][col+j-1] * filter->get(i ,j);
val2 = val2 + input->color[plane][row+i ][col+j-1] * filter->get(i+1,j);
}
if (n%2 !=0) {
val1 = val1 + input->color[plane][row+n-2][col+j-1] * filter->get(n-1,j);
}
}
value = (val1 + val2) / a;
In case you want to unroll the loop even more, the more generic way would be (e.g. for 4):
int value = 0;
int val1=0, val2=0, val3=0, val4=0;
for (int j = 0; j < n; j++) {
for (int i = 0; i < n-3; i+=4) {
val1 = val1 + input->color[plane][row+i-1][col+j-1] * filter->get(i ,j);
val2 = val2 + input->color[plane][row+i ][col+j-1] * filter->get(i+1,j);
val3 = val3 + input->color[plane][row+i+1][col+j-1] * filter->get(i+2,j);
val4 = val4 + input->color[plane][row+i+2][col+j-1] * filter->get(i+3,j);
}
switch (n % 4) {
case 3: val1+=input->color[plane][row+n-4][col+j-1] * filter->get(i+n-3,j);
case 2: val1+=input->color[plane][row+n-3][col+j-1] * filter->get(i+n-2,j);
case 1: val1+=input->color[plane][row+n-2][col+j-1] * filter->get(i+n-1,j);
}
value = (val1 + val2 + val3 + val4) / a;
}
NOTE:
Please be aware, that depending on the size of your filter, the used compiler and compiler options and your system, the solutions above might not speed up your code but even slow it down. You should also be aware that the compiler can usually do loop unroling for you (e.g. with the -funroll-loops option in gcc) if it makes sense.

Related

As a result of processing arrays -nan(ind)

I am writing a program that creates arrays of a given length and manipulates them. You cannot use other libraries.
First, an array M1 of length N is formed, after which an array M2 of length N is formed/2.
In the M1 array, the division by Pi operation is applied to each element, followed by elevation to the third power.
Then, in the M2 array, each element is alternately added to the previous one, and the tangent modulus operation is applied to the result of addition.
After that, exponentiation is applied to all elements of the M1 and M2 array with the same indexes and the resulting array is sorted by dwarf sorting.
And at the end, the sum of the sines of the elements of the M2 array is calculated, which, when divided by the minimum non-zero element of the M2 array, give an even number.
The problem is that the result X gives is -nan(ind). I can't figure out exactly where the error is.
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
const int A = 441;
const double PI = 3.1415926535897931159979635;
inline void dwarf_sort(double* array, int size) {
size_t i = 1;
while (i < size) {
if (i == 0) {
i = 1;
}
if (array[i - 1] <= array[i]) {
++i;
}
else
{
long tmp = array[i];
array[i] = array[i - 1];
array[i - 1] = tmp;
--i;
}
}
}
inline double reduce(double* array, int size) {
size_t i;
double min = RAND_MAX, sum = 0;
for (i = 0; i < size; ++i) {
if (array[i] < min && array[i] != 0) {
min = array[i];
}
}
for (i = 0; i < size; ++i) {
if ((int)(array[i] / min) % 2 == 0) {
sum += sin(array[i]);
}
}
return sum;
}
int main(int argc, char* argv[])
{
int i, N, j;
double* M1 = NULL, * M2 = NULL, * M2_copy = NULL;
double X;
unsigned int seed = 0;
N = atoi(argv[1]); /* N равен первому параметру командной строки */
M1 = malloc(N * sizeof(double));
M2 = malloc(N / 2 * sizeof(double));
M2_copy = malloc(N / 2 * sizeof(double));
for (i = 0; i < 100; i++)
{
seed = i;
srand(i);
/*generate*/
for (j = 0; j < N; ++j) {
M1[j] = (rand_r(&seed) % A) + 1;
}
for (j = 0; j < N / 2; ++j) {
M2[j] = (rand_r(&seed) % (10 * A)) + 1;
}
/*map*/
for (j = 0; j < N; ++j)
{
M1[j] = pow(M1[j] / PI, 3);
}
for (j = 0; j < N / 2; ++j) {
M2_copy[j] = M2[j];
}
M2[0] = fabs(tan(M2_copy[0]));
for (j = 0; j < N / 2; ++j) {
M2[j] = fabs(tan(M2[j] + M2_copy[j]));
}
/*merge*/
for (j = 0; j < N / 2; ++j) {
M2[j] = pow(M1[j], M2[j]);
}
/*sort*/
dwarf_sort(M2, N / 2);
/*sort*/
X = reduce(M2, N / 2);
}
printf("\nN=%d.\n", N);
printf("X=%f\n", X);
return 0;
}
Knowledgeable people, does anyone see where my mistake is? I think I'm putting the wrong data types to the variables, but I still can't solve the problem.
Replace the /* merge */ part with this:
/*merge*/
for (j = 0; j < N / 2; ++j) {
printf("%f %f ", M1[j], M2[j]);
M2[j] = pow(M1[j], M2[j]);
printf("%f\n", M2[j]);
}
This will print the values and the results of the pow operation. You'll see that some of these values are huge resulting in an capacity overflow of double.
Something like pow(593419.97, 31.80) will not end well.

Intrinsics load matrix

Im learning Intrinsics. I dont know how to load a matrix correctly. I want to do matrix multiplication.
This is my code:
int i, j, k;
__m128 mat2values = _mm_setzero_ps();
__m128 mat1values = _mm_setzero_ps();
__m128 r = _mm_setzero_ps();
for (i = 0; i < N; ++i)
{
for (j = 0; j < N - 3; j += 4)
{
for (k = 0; k < N - 3; k += 4)
{
mat1values = _mm_load_ps(&mat1[i][k]);
mat2values = _mm_load_ps(&mat2[k][j]);
r = _mm_add_ps(r, _mm_mul_ps(mat1values, mat2values));
}
result[i][j] = r.m128_f32[0] + r.m128_f32[1] + r.m128_f32[2] + r.m128_f32[3];
for (; k < N; k++)
result[i][j] += mat1[i][j] * mat2[k][j];
}
}
When debugging result will still hold all 0 values after loop.
Are you sure the expression
_mm_load_ps(mat1[i][k])
returns the correct memory address in float*?

Access Violation writing location C++ 0x02D1F000

I'm attempting to initialize, populate and parse through an array in order to determine its "stability." To avoid a stack overflow, I decided to create dynamic arrays. The problem is that when it comes to populating the array, I get an exception regarding an access violation to a random location. I don't know if its something in the initialization or in the nested for loop when populating the array. I just can't seem to find anything wrong, nor my classmates/TAs. Thanks in advance for your help! I have tried compiling in VS, XCode, and g++ I have tried commenting out the dynamic array loops as well as the delete loops and gone for "regular arrays" such as float array[x][y] and I still get the same error.
#include <iostream>
#include <array>
#include <iomanip>
#include <cmath>
using namespace std;
int main() {
int check = 0;
int iteration = 0;
int newIteration = 0;
int newNewIteration = 0;
int const DIMENSION = 1024;
//Initializing the dynamic arrays in
//heap to avoid a stack overflow
float** firstGrid = new float*[DIMENSION];
for (int a = 0; a < DIMENSION; ++a) {
firstGrid[a] = new float[DIMENSION];
}
float** secondGrid = new float*[DIMENSION];
for (int b = 0; b < DIMENSION; ++b) {
secondGrid[b] = new float[DIMENSION];
}
float** thirdGrid = new float*[DIMENSION];
for (int c = 0; c < DIMENSION; ++c) {
thirdGrid[c] = new float[DIMENSION];
}
//Populating the arrays
//All points inside first array
for (int i = 0; i < DIMENSION; ++i) {
for (int j = 0; i < DIMENSION; ++j) {
firstGrid[i][j] = 0.0; //exception occurs here
}
}
for (int i = 1; i < DIMENSION - 1; ++i) {
for (int j = 1; i < DIMENSION - 1; ++j) {
firstGrid[i][j] = 50.0;
}
}
//Pre-setting second array
for (int i = 0; i < DIMENSION; ++i) {
for (int j = 0; i < DIMENSION; ++j) {
secondGrid[i][j] = 0.0;
}
}
for (int i = 1; i < DIMENSION - 1; ++i) {
for (int j = 1; i < DIMENSION - 1; ++j) {
secondGrid[i][j] = 50.0;
}
}
//Pre-setting third array
for (int i = 0; i < DIMENSION; ++i) {
for (int j = 0; i < DIMENSION; ++j) {
thirdGrid[i][j] = 0.0;
}
}
for (int i = 1; i < DIMENSION - 1; ++i) {
for (int j = 1; i < DIMENSION - 1; ++j) {
thirdGrid[i][j] = 50.0;
}
}
//Checking and Populating new arrays
for (int p = 1; p < DIMENSION - 1; ++p) {
for (int q = 1; q < DIMENSION - 1; ++p) {
check = abs((firstGrid[p - 1][q] + firstGrid[p][q - 1] + firstGrid[p + 1][q] + firstGrid[p][q + 1]) / 4
- firstGrid[p][q]);
if (check > 0.1) {
secondGrid[p][q] = (firstGrid[p - 1][q] + firstGrid[p][q - 1] + firstGrid[p + 1][q] + firstGrid[p][q + 1]) / 4;
iteration = iteration + 1;
}
}
}
for (int p = 1; p < DIMENSION - 1; ++p) {
for (int q = 1; q < DIMENSION - 1; ++p) {
check = abs((secondGrid[p - 1][q] + secondGrid[p][q - 1] + secondGrid[p + 1][q] + secondGrid[p][q + 1]) / 4
- secondGrid[p][q]);
if (check > 0.1) {
thirdGrid[p][q] = (secondGrid[p - 1][q] + secondGrid[p][q - 1] + secondGrid[p + 1][q] + secondGrid[p][q + 1]) / 4;
newIteration = newIteration + 1;
}
}
}
for (int p = 1; p < DIMENSION - 1; ++p) {
for (int q = 1; q < DIMENSION - 1; ++p) {
check = abs((thirdGrid[p - 1][q] + thirdGrid[p][q - 1] + thirdGrid[p + 1][q] + thirdGrid[p][q + 1]) / 4
- thirdGrid[p][q]);
if (check > 0.1) {
newNewIteration = newNewIteration + 1;
}
}
}
//Deleting arrays and freeing memory
for (int x = 0; x < DIMENSION; ++x) {
delete [] firstGrid[x];
}
delete [] firstGrid;
for (int x = 0; x < DIMENSION; ++x) {
delete [] secondGrid[x];
}
delete [] secondGrid;
for (int x = 0; x < DIMENSION; ++x) {
delete [] thirdGrid[x];
}
delete [] thirdGrid;
//iteration checking
cout << iteration << endl << newIteration << endl << newNewIteration;
if (iteration == 179 || newIteration == 179 || newNewIteration == 179) {
return 0;
}
else {
return 1;
}
}
You should use j consistently in your second for-loop (where the error occurs):
for(j=0; j < DIMENSION; j++)

Duplicates within 2 arrays

So what i'm doing is populating 2 arrays x_cord and y_cord with a maximum amount of values in each. In this case both arrays can hold a maximum amount of unique elements of 5 and each element must be between 0 and 2. Afterwards once the arrays are completely randomized I am writing the values into a file
It would look something like this:
0 0
1 2
2 1
2 2
0 1
I don't want any of the rows to be duplicates of no another, however I am running into trouble where I am creating duplicates of one another, any help would be appreciated.
Code:
for (j=0; j < num_pt; j++){
(x_cord[j] = rand()%max_x+1);
(y_cord[j] = rand()%max_y);
for(m=j+1; m < num_pt; m++){
if ((x_cord[j]==x_cord[m]) && (y_cord[j]==y_cord[m])){
x_cord[j] = rand()%max_x+1;
}
}
}
for (j=0; j < num_pt;j++){
fprintf(fp, "%d\t%d\n", x_cord[j], y_cord[j]);
}
Rather than repeatedly generating a pair until you find a unique pair, generate all pairs, then shuffle the pairs.
int max_y = 2;
int max_x = 2;
size_t num_eles = (max_x+1)*(max_y+1);
size_t desired_num_eles = 6;
if (desired_num_eles > num_eles)
desired_num_eles = num_eles;
int* y_cord = malloc(sizeof(int) * num_eles);
int* x_cord = malloc(sizeof(int) * num_eles);
for (int y = max_y; y--; ) {
for (int x = max_x; x--; ) {
size_t i = y * max_x + x;
y_cord[i] = y;
x_cord[i] = x;
}
}
for (size_t i = 0; i<desired_num_eles; ++i) {
size_t j = rand() % (num_eles - i) + i;
// Swap i and j
y_cord[i] ^= y_cord[j]; y_cord[j] ^= y_cord[i]; y_cord[i] ^= y_cord[j];
x_cord[i] ^= x_cord[j]; x_cord[j] ^= x_cord[i]; x_cord[i] ^= x_cord[j];
}
num_eles = desired_num_eles;
y_cord = realloc(sizeof(int) * num_eles);
x_cord = realloc(sizeof(int) * num_eles);

Optimization of C code

For an assignment of a course called High Performance Computing, I required to optimize the following code fragment:
int foobar(int a, int b, int N)
{
int i, j, k, x, y;
x = 0;
y = 0;
k = 256;
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
if (i > j){
y = y + 8*(i-j);
}else{
y = y + 8*(j-i);
}
}
}
return x;
}
Using some recommendations, I managed to optimize the code (or at least I think so), such as:
Constant Propagation
Algebraic Simplification
Copy Propagation
Common Subexpression Elimination
Dead Code Elimination
Loop Invariant Removal
bitwise shifts instead of multiplication as they are less expensive.
Here's my code:
int foobar(int a, int b, int N) {
int i, j, x, y, t;
x = 0;
y = 0;
for (i = 0; i <= N; i++) {
t = i + 512;
for (j = i + 1; j <= N; j++) {
x = x + ((i<<3) + (j<<2))*t;
}
}
return x;
}
According to my instructor, a well optimized code instructions should have fewer or less costly instructions in assembly language level.And therefore must be run, the instructions in less time than the original code, ie calculations are made with::
execution time = instruction count * cycles per instruction
When I generate assembly code using the command: gcc -o code_opt.s -S foobar.c,
the generated code has many more lines than the original despite having made ​​some optimizations, and run-time is lower, but not as much as in the original code. What am I doing wrong?
Do not paste the assembly code as both are very extensive. So I'm calling the function "foobar" in the main and I am measuring the execution time using the time command in linux
int main () {
int a,b,N;
scanf ("%d %d %d",&a,&b,&N);
printf ("%d\n",foobar (a,b,N));
return 0;
}
Initially:
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
if (i > j){
y = y + 8*(i-j);
}else{
y = y + 8*(j-i);
}
}
}
Removing y calculations:
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
}
}
Splitting i, j, k:
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 8*i*i + 16*i*k ; // multiple of 1 (no j)
x = x + (4*i + 8*k)*j ; // multiple of j
}
}
Moving them externally (and removing the loop that runs N-i times):
for (i = 0; i <= N; i++) {
x = x + (8*i*i + 16*i*k) * (N-i) ;
x = x + (4*i + 8*k) * ((N*N+N)/2 - (i*i+i)/2) ;
}
Rewritting:
for (i = 0; i <= N; i++) {
x = x + ( 8*k*(N*N+N)/2 ) ;
x = x + i * ( 16*k*N + 4*(N*N+N)/2 + 8*k*(-1/2) ) ;
x = x + i*i * ( 8*N + 16*k*(-1) + 4*(-1/2) + 8*k*(-1/2) );
x = x + i*i*i * ( 8*(-1) + 4*(-1/2) ) ;
}
Rewritting - recalculating:
for (i = 0; i <= N; i++) {
x = x + 4*k*(N*N+N) ; // multiple of 1
x = x + i * ( 16*k*N + 2*(N*N+N) - 4*k ) ; // multiple of i
x = x + i*i * ( 8*N - 20*k - 2 ) ; // multiple of i^2
x = x + i*i*i * ( -10 ) ; // multiple of i^3
}
Another move to external (and removal of the i loop):
x = x + ( 4*k*(N*N+N) ) * (N+1) ;
x = x + ( 16*k*N + 2*(N*N+N) - 4*k ) * ((N*(N+1))/2) ;
x = x + ( 8*N - 20*k - 2 ) * ((N*(N+1)*(2*N+1))/6);
x = x + (-10) * ((N*N*(N+1)*(N+1))/4) ;
Both the above loop removals use the summation formulas:
Sum(1, i = 0..n) = n+1
Sum(i1, i = 0..n) = n(n + 1)/2
Sum(i2, i = 0..n) = n(n + 1)(2n + 1)/6
Sum(i3, i = 0..n) = n2(n + 1)2/4
y does not affect the final result of the code - removed:
int foobar(int a, int b, int N)
{
int i, j, k, x, y;
x = 0;
//y = 0;
k = 256;
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
//if (i > j){
// y = y + 8*(i-j);
//}else{
// y = y + 8*(j-i);
//}
}
}
return x;
}
k is simply a constant:
int foobar(int a, int b, int N)
{
int i, j, x;
x = 0;
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*256);
}
}
return x;
}
The inner expression can be transformed to: x += 8*i*i + 4096*i + 4*i*j + 2048*j. Use math to push all of them to the outer loop: x += 8*i*i*(N-i) + 4096*i*(N-i) + 2*i*(N-i)*(N+i+1) + 1024*(N-i)*(N+i+1).
You can expand the above expression, and apply sum of squares and sum of cubes formula to obtain a close form expression, which should run faster than the doubly nested loop. I leave it as an exercise to you. As a result, i and j will also be removed.
a and b should also be removed if possible - since a and b are supplied as argument but never used in your code.
Sum of squares and sum of cubes formula:
Sum(x2, x = 1..n) = n(n + 1)(2n + 1)/6
Sum(x3, x = 1..n) = n2(n + 1)2/4
This function is equivalent with the following formula, which contains only 4 integer multiplications, and 1 integer division:
x = N * (N + 1) * (N * (7 * N + 8187) - 2050) / 6;
To get this, I simply typed the sum calculated by your nested loops into Wolfram Alpha:
sum (sum (8*i*i+4096*i+4*i*j+2048*j), j=i+1..N), i=0..N
Here is the direct link to the solution. Think before coding. Sometimes your brain can optimize code better than any compiler.
Briefly scanning the first routine, the first thing you notice is that expressions involving "y" are completely unused and can be eliminated (as you did). This further permits eliminating the if/else (as you did).
What remains is the two for loops and the messy expression. Factoring out the pieces of that expression that do not depend on j is the next step. You removed one such expression, but (i<<3) (ie, i * 8) remains in the inner loop, and can be removed.
Pascal's answer reminded me that you can use a loop stride optimization. First move (i<<3) * t out of the inner loop (call it i1), then calculate, when initializing the loop, a value j1 that equals (i<<2) * t. On each iteration increment j1 by 4 * t (which is a pre-calculated constant). Replace your inner expression with x = x + i1 + j1;.
One suspects that there may be some way to combine the two loops into one, with a stride, but I'm not seeing it offhand.
A few other things I can see. You don't need y, so you can remove its declaration and initialisation.
Also, the values passed in for a and b aren't actually used, so you could use these as local variables instead of x and t.
Also, rather than adding i to 512 each time through you can note that t starts at 512 and increments by 1 each iteration.
int foobar(int a, int b, int N) {
int i, j;
a = 0;
b = 512;
for (i = 0; i <= N; i++, b++) {
for (j = i + 1; j <= N; j++) {
a = a + ((i<<3) + (j<<2))*b;
}
}
return a;
}
Once you get to this point you can also observe that, aside from initialising j, i and j are only used in a single mutiple each - i<<3 and j<<2. We can code this directly in the loop logic, thus:
int foobar(int a, int b, int N) {
int i, j, iLimit, jLimit;
a = 0;
b = 512;
iLimit = N << 3;
jLimit = N << 2;
for (i = 0; i <= iLimit; i+=8) {
for (j = i >> 1 + 4; j <= jLimit; j+=4) {
a = a + (i + j)*b;
}
b++;
}
return a;
}
OK... so here is my solution, along with inline comments to explain what I did and how.
int foobar(int N)
{ // We eliminate unused arguments
int x = 0, i = 0, i2 = 0, j, k, z;
// We only iterate up to N on the outer loop, since the
// last iteration doesn't do anything useful. Also we keep
// track of '2*i' (which is used throughout the code) by a
// second variable 'i2' which we increment by two in every
// iteration, essentially converting multiplication into addition.
while(i < N)
{
// We hoist the calculation '4 * (i+2*k)' out of the loop
// since k is a literal constant and 'i' is a constant during
// the inner loop. We could convert the multiplication by 2
// into a left shift, but hey, let's not go *crazy*!
//
// (4 * (i+2*k)) <=>
// (4 * i) + (4 * 2 * k) <=>
// (2 * i2) + (8 * k) <=>
// (2 * i2) + (8 * 512) <=>
// (2 * i2) + 2048
k = (2 * i2) + 2048;
// We have now converted the expression:
// x = x + 4*(2*i+j)*(i+2*k);
//
// into the expression:
// x = x + (i2 + j) * k;
//
// Counterintuively we now *expand* the formula into:
// x = x + (i2 * k) + (j * k);
//
// Now observe that (i2 * k) is a constant inside the inner
// loop which we can calculate only once here. Also observe
// that is simply added into x a total (N - i) times, so
// we take advantange of the abelian nature of addition
// to hoist it completely out of the loop
x = x + (i2 * k) * (N - i);
// Observe that inside this loop we calculate (j * k) repeatedly,
// and that j is just an increasing counter. So now instead of
// doing numerous multiplications, let's break the operation into
// two parts: a multiplication, which we hoist out of the inner
// loop and additions which we continue performing in the inner
// loop.
z = i * k;
for (j = i + 1; j <= N; j++)
{
z = z + k;
x = x + z;
}
i++;
i2 += 2;
}
return x;
}
The code, without any of the explanations boils down to this:
int foobar(int N)
{
int x = 0, i = 0, i2 = 0, j, k, z;
while(i < N)
{
k = (2 * i2) + 2048;
x = x + (i2 * k) * (N - i);
z = i * k;
for (j = i + 1; j <= N; j++)
{
z = z + k;
x = x + z;
}
i++;
i2 += 2;
}
return x;
}
I hope this helps.
int foobar(int N) //To avoid unuse passing argument
{
int i, j, x=0; //Remove unuseful variable, operation so save stack and Machine cycle
for (i = N; i--; ) //Don't check unnecessary comparison condition
for (j = N+1; --j>i; )
x += (((i<<1)+j)*(i+512)<<2); //Save Machine cycle ,Use shift instead of Multiply
return x;
}

Resources