OpenMP Segmentation Fault When Parallelizing Simple Loop - c

I have a function that takes in an array and updates an array in a for loop like such:
void compute(double values[], int num_points, double ders[]){
for(int i = 0; i < num_points; ++i)
{
double a = values[i* 3 + 0 ];
double b = values[i* 2 + 1 ];
ders[i*4 + 0] = a * b;
ders[i*4 + 1] = a * a;
ders[i*4 + 2] = b * b;
ders[i*4 + 3] = b * a * a;
}
}
All of this is well and good but then I update the code to try to do things in parallel with OpenMP like
void compute(double values[], int num_points, double ders[]){
omp_set_dynamic(0);
omp_set_num_threads(2);
#pragma omp parallel for
for(int i = 0; i < num_points; ++i)
{
double a = values[i* 3 + 0 ];
double b = values[i* 2 + 1 ];
ders[i*4 + 0] = a * b;
ders[i*4 + 1] = a * a;
ders[i*4 + 2] = b * b;
ders[i*4 + 3] = b * a * a;
}
}
And now I'm getting Segmentation Faults.
I feel like I must be overwriting some value in two threads -- but everything in ders and values is indexed according to 'i' so it feels like it should be trivial to parallelize.
What am I doing wrong here?

Related

Fastest way to compute maximal n s.t. n over k <= x

I'm looking for a fast way to compute the maximal n s.t. n over k <= x for given k and x.
In my context n \leq n' for some known constant n', lets say 1000. k is either 1,2, or 3 and x is choosen at random from 0 ... n' over k
My current approach is to compute the binomial coefficient iterativly, starting from a_0 = k over k = 1. The next coefficient a_1 = k+1 over k can be computed as a_1 = a_0 * (k+1) / 1 and so on.
The current C code looks like this
uint32_t max_bc(const uint32_t a, const uint32_t n, const uint32_t k) {
uint32_t tmp = 1;
int ctr = 0;
uint32_t c = k, d = 1;
while(tmp <= a && ctr < n) {
c += 1;
tmp = tmp*c/d;
ctr += 1;
d += 1;
}
return ctr + k - 1;
}
int main() {
const uint32_t n = 10, w = 2;
for (uint32_t a = 0; a < 10 /*bc(n, w)*/; a++) {
const uint32_t b = max_bc(a, n, w);
printf("%d %d\n", a, b);
}
}
which outputs
0 1
1 2
2 2
3 3
4 3
5 3
6 4
7 4
8 4
9 4
So I'm looking for a Bittrick or something to get around the while-loop to speed up my application. Thats because the while loop gets executedat worst n-k times. Precomputation is not an option, because this code is part of a bigger algorithm which uses a lot of memory.
Thanks to #Aleksei
This is my solution:
template<typename T, const uint32_t k>
inline T opt_max_bc(const T a, const uint32_t n) {
if constexpr(k == 1) {
return n - k - a;
}
if constexpr (k == 2) {
const uint32_t t = __builtin_floor((double)(__builtin_sqrt(8 * a + 1) + 1)/2.);
return n - t - 1;
}
if constexpr (k == 3) {
if (a == 1)
return n-k-1;
float x = a;
float t1 = sqrtf(729.f * x * x);
float t2 = cbrtf(3.f * t1 + 81.f * x);
float t3 = t2 / 2.09f;
float ctr2 = t3;
int ctr = int(ctr2);
return n - ctr - k;
}
if constexpr (k == 4) {
const float x = a;
const float t1 = __builtin_floorf(__builtin_sqrtf(24.f * x + 1.f));
const float t2 = __builtin_floorf(__builtin_sqrtf(4.f * t1 + 5.f));
uint32_t ctr = (t2 + 3.f)/ 2.f - 3;
return n - ctr - k;
}
// will never happen
return -1;
}
If k is really limited to just 1, 2 or 3, you can use different methods depending on k:
k == 1: C(n, 1) = n <= x, so the answer is n.
k == 2: C(n, 2) = n * (n - 1) / 4 <= x. You can solve the equation n * (n - 1) / 4 = x, the positive solution is n = 1/2 (sqrt(16x + 1) + 1), the answer to the initial question should be floor( 1/2 (sqrt(16x + 1) + 1) ).
k == 3: C(n, 3) = n(n-1)(n-2)/6 <= x. There is no nice solution, but the formula for the number of combinations is straightforward, so you can use a binary search to find the answer.

Explicitly telling GCC 9.2 to unswitch loop to allow auto-vectorization

I am working on a project that requires automatic vectorization of large loops. It is mandatory to use GCC to compile. A minimum case of the problem could be the following:
#define VLEN 4
#define NTHREADS 4
#define AVX512_ALIGNMENT 64
#define NUM_INTERNAL_ITERS 5
#define real double
typedef struct private_data {
/*
* Alloc enough space for private data and MEM_BLOCK_SIZE bytes of padding.
* Private data must be allocated all at once to squeeze cache performance by only
* padding once per CPU.
*/
real *contiguous_data;
/*
* Pointers to corresponding index in contiguous_data.
*/
real *array_1;
real *array_2;
} private_data_t;
private_data_t private_data[NTHREADS];
int num_iter;
void minimum_case(const int thread) {
// Reference to thread private data.
real *restrict array_1 =
__builtin_assume_aligned(private_data[thread].array_1, AVX512_ALIGNMENT);
real *restrict array_2 =
__builtin_assume_aligned(private_data[thread].array_2, AVX512_ALIGNMENT);
for (int i = 0; i < num_iter; i++) {
for (int k = 0; k < NUM_INTERNAL_ITERS; ++k) {
int array_1_entry =
(k * (NUM_INTERNAL_ITERS) * VLEN) +
i * NUM_INTERNAL_ITERS * NUM_INTERNAL_ITERS * VLEN;
int array_2_entry =
(k * (NUM_INTERNAL_ITERS) * VLEN) +
i * NUM_INTERNAL_ITERS * VLEN;
#pragma GCC unroll 1
#pragma GCC ivdep
for (int j = 0; j < VLEN; j++) {
real pivot;
int a_idx = array_1_entry + VLEN * 0 + j;
int b_idx = array_1_entry + VLEN * 1 + j;
int c_idx = array_1_entry + VLEN * 2 + j;
int d_idx = array_1_entry + VLEN * 3 + j;
int S_idx = array_2_entry + VLEN * 0 + j;
if (k == 0) {
pivot = array_1[a_idx];
// b = b / a
array_1[b_idx] /= pivot;
// c = c / a
array_1[c_idx] /= pivot;
// d = d / a
array_1[d_idx] /= pivot;
// S = S / a
array_2[S_idx] /= pivot;
}
int e_idx = array_1_entry + VLEN * 4 + j;
int f_idx = array_1_entry + VLEN * 5 + j;
int g_idx = array_1_entry + VLEN * 6 + j;
int k_idx = array_1_entry + VLEN * 7 + j;
int T_idx = array_2_entry + VLEN * 1 + j;
pivot = array_1[e_idx];
// f = f - (e * b)
array_1[f_idx] -= array_1[b_idx]
* pivot;
// g = g - (e * c)
array_1[g_idx] -= array_1[c_idx]
* pivot;
// k = k - (e * d)
array_1[k_idx] -= array_1[d_idx]
* pivot;
// T = T - (e * S)
array_2[T_idx] -= array_2[S_idx]
* pivot;
}
}
}
}
For this specific case, GCC is using 16B vectors instead of 32B ones for automatic vectorization. It is fairly easy to see that the control flow depends on a condition that can be checked out of the internal loop, but GCC is not performing any loop-unswitching.
The loop unswitching can be done manually, but please, note that this is a minimum case of the problem, the real loop has hundreds of lines and performing manual loop-unswitching would result in a lot of code redundancy. I am trying to find a way to force GCC to create different loops for different conditions that can be checked out of the internal loop.
Currently I am using GCC 9.2 with the following flags: -Ofast -march=native -std=c11 -fopenmp -ftree-vectorize -ffast-math -mavx -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store -fopt-info-vec-optimized
Especially if the real loop has hundreds of lines I strongly recommend factoring that out into a separate function -- which then would make manual unswitching (where necessary) not that bad.
The following should be equivalent to your code (notice, I also factored out some index calculations -- this could actually be simplified even more):
inline void inner_loop(real *restrict array_1, real *restrict array_2,
int const first) {
#pragma GCC unroll 1
for (int j = 0; j < VLEN; j++) {
int a_idx = VLEN * 0 + j;
int b_idx = VLEN * 1 + j;
int c_idx = VLEN * 2 + j;
int d_idx = VLEN * 3 + j;
int S_idx = VLEN * 0 + j;
if (first) {
real pivot = array_1[a_idx];
array_1[b_idx] /= pivot; // b = b / a
array_1[c_idx] /= pivot; // c = c / a
array_1[d_idx] /= pivot; // d = d / a
array_2[S_idx] /= pivot; // S = S / a
}
int e_idx = VLEN * 4 + j;
int f_idx = VLEN * 5 + j;
int g_idx = VLEN * 6 + j;
int k_idx = VLEN * 7 + j;
int T_idx = VLEN * 1 + j;
real pivot = array_1[e_idx];
array_1[f_idx] -= array_1[b_idx] * pivot; // f = f - (e * b)
array_1[g_idx] -= array_1[c_idx] * pivot; // g = g - (e * c)
array_1[k_idx] -= array_1[d_idx] * pivot; // k = k - (e * d)
array_2[T_idx] -= array_2[S_idx] * pivot; // T = T - (e * S)
}
}
void minimum_case(const int thread) {
// Reference to thread private data.
real *restrict array_1 =
__builtin_assume_aligned(private_data[thread].array_1, AVX512_ALIGNMENT);
real *restrict array_2 =
__builtin_assume_aligned(private_data[thread].array_2, AVX512_ALIGNMENT);
for (int i = 0; i < num_iter; i++) {
real *array_1_i =
array_1 + i * NUM_INTERNAL_ITERS * NUM_INTERNAL_ITERS * VLEN;
real *array_2_i = array_2 + i * NUM_INTERNAL_ITERS * VLEN;
inner_loop(array_1_i, array_2_i, 0);
for (int k = 1; k < NUM_INTERNAL_ITERS; ++k) {
int array_1_entry = (k * (NUM_INTERNAL_ITERS)*VLEN);
int array_2_entry = (k * (NUM_INTERNAL_ITERS)*VLEN);
inner_loop(array_1_i + array_1_entry, array_2_i + array_2_entry, 1);
}
}
}
Full demo on godbolt: https://godbolt.org/z/wMgSnr

OpenMP: Prefix Sum Algorithm

I'm trying to implement a Prefix Sum Algorithm in C using OpenMP, and I'm stuck.
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
int main(int argc, char* argv[])
{
int p = 5;
int X[5] = { 1, 5, 4, 2, 3 };
int* Y = (int*)malloc(p * sizeof(int));
for (int i = 0; i < p; i++)
printf("%d ", X[i]);
printf("\n");
Y[0] = X[0];
int i;
#pragma omp parallel for num_threads(4)
for (i = 1; i < p; i++)
Y[i] = X[i - 1] + X[i];
int k = 2;
while (k < p)
{
int i;
#pragma omp parallel for
for (i = k; i < p; i++)
Y[i] = Y[i - k] + Y[i];
k += k;
}
for (int i = 0; i < p; i++)
printf("%d ", Y[i]);
printf("\n");
system("pause");
return 0;
}
What this code should do?
Input numbers are in X,
output numbers are (prefixes) in Y
and the number count is p.
X = 1, 5, 4, 2, 3
Stage I.
Y[0] = X[0];
Y[0] = 1
Stage II.
int i;
#pragma omp parallel for num_threads(4)
for (i = 1; i < p; i++)
Y[i] = X[i - 1] + X[i];
Example:
Y[1] = X[0] + X[1] = 6
Y[2] = X[1] + X[2] = 9
Y[2] = X[2] + X[3] = 6
Y[4] = X[3] + X[4] = 5
Stage III. (where I am stuck)
int k = 2;
while (k < p)
{
int i;
#pragma omp parallel for
for (i = k; i < p; i++)
Y[i] = Y[i - k] + Y[i];
k += k;
}
Example:
k = 2
Y[2] = Y[0] + Y[2] = 1 + 9 = 10
Y[3] = Y[1] + Y[3] = 6 + 6 = 12
Y[4] = Y[2] + Y[4] = 10 + 5 = 15
Above the 10 + 5 = 15 should be 9 + 5 = 14, but the Y[2] was overwritten by another thread. I want to use that Y[2] what was before the for-loop started.
Example:
k = 4
Y[4] = Y[0] + Y[4] = 1 + 15 = 16
Result: 1, 6, 10, 12, 16. Expected good result: 1, 6, 10, 12, 15.
Above the 10 + 5 = 15 should be 9 + 5 = 14, but the Y[2] was overwritten by another thread. I want to use that Y[2] what was before the for-loop started.
With OpenMP, you always have to consider whether your code is correct for the serial case, with a single thread, because
It might in fact run that way, and
If it's incorrect serially, then it's virtually certain to be incorrect as a parallel program, too.
Your code is not correct serially. It appears you could fix that by running the problem loop backward, from i = p - 1 to k, but in fact that's not sufficient for parallel operation.
Your best bet appears to be to accumulate your partial results into a different array than holds the results of the previous cycle. For example, you might flip between X and Y as data source and result, with a little pointer wrangling to grease the iterative wheels. Or you might do it a little more easily by using a 2D array instead of separate X and Y.
UPDATE for Stage III.
int num_threads = 8;
int k = 2;
while (k < p)
{
#pragma omp parallel for ordered num_threads(k < num_threads ? 1 : num_threads)
for (i = p - 1; i >= k; i--)
{
Y[i] = Y[i - k] + Y[i];
}
k += k;
}
The code above solved my problem. It's now working with parallel, except the first few round.

Navigating through an array of structs and storing values not working properly

I have this code that's very messy and there are two structs, both defined and initialized the same. However for the tall struct I can store variables in the struct tall[radius] without any issues. But when I replicate the process for xx struct it doenst work prints wrong values.
I can't seem to figure out what's wrong with the struct usage.
I need an array of structs to store a different sized array dynamically every time, and other elements later for each struct.
I'm open to new approaches, too.
Why are the two structs performing differently?
#include <stdio.h>
#include <stdlib.h>
#define ar_length 18 //this is the radius max
#define WIDTH 10 //xx width
struct ttall
{
int *pnt;
};
struct xall
{
int *pnt;
};
int main()
{
int i;
int m;
int *x1;
int *x2;
int *x3;
int *ttThis;
int rad_size;
//beginning of the loop
int radius = 1;
int size = (2 * radius + 1);
int size_sqrd = size * size * size;
struct ttall **tall = malloc(sizeof(struct ttall *));
struct ttall *struct_pnt = malloc( 3*sizeof(struct ttall));
struct xall **xx = malloc(sizeof(struct xall *));
struct xall *xpnt = malloc( 3*sizeof(struct xall));
for (int radius = 0; radius < 3; radius++)
{
//need to increment the pointer to the struct everytime
xx[radius] = xpnt+radius;
tall[radius] = struct_pnt+radius;
int z = -1;
int t = -radius-2;
printf("****T*** %d\n",t);
int rad;
int meshCount;
meshCount = (2 * (radius+1) + 1);
rad_size = (2 * (radius+1) + 1) * (2 * (radius+1) + 1) * (2 * (radius+1) + 1);
printf("radius size : %d\n",rad_size);
printf("mesh size : %d\n",meshCount);
tall[radius]->pnt = malloc(rad_size * sizeof(int));
xx[radius]->pnt = malloc(rad_size * WIDTH * sizeof(int));
x1 = malloc(rad_size * sizeof(int));
x2 = malloc(rad_size * sizeof(int));
x3 = malloc(rad_size * sizeof(int));
for (i = 0; i < meshCount; i++)
{
t++;
z++;
for (m = 0; m < meshCount*meshCount; m++)
{
x1[z * (meshCount*meshCount) + m] = t;
}
}
//x2 computations
i = 0;
m = 0;
z = 0;
t = -radius-2;
int x;
for (x = 0; x < meshCount; x++)
{
t++;
for (m = 0; m < meshCount; m++)
{
for(int h = 0;h<meshCount;h++){
int index = meshCount*x+(meshCount*meshCount)*h +m ;
x2[index] = t;
}
}
}
//x3 computations
i = 0;
m = 0;
z = 0;
t = -radius-2;
for (x = 0; x < meshCount; x++)
{
// t++;
t = -radius-2;
for (m = 0; m < meshCount; m++)
{
t++;
for(int h = 0;h<meshCount;h++){
int index = meshCount*x+(meshCount*meshCount)*h +m ;
x3[index] = t;
}
}
}
// structure initializations and memalocation
//works fine with expanding radius
for (m = 0; m < rad_size; m++)
{
tall[radius]->pnt[m] = (x1[m] * x1[m]) + (x2[m] * x2[m]) + (x3[m] * x3[m]);
}
// doesnt work here
m = 0;
for (i = 0; i < rad_size; i++)
{
xx[radius]->pnt[i * WIDTH ] = 1;
xx[radius]->pnt[i * WIDTH + 1] = x1[i];
xx[radius]->pnt[i * WIDTH + 2] = x2[i];
xx[radius]->pnt[i * WIDTH + 3] = x3[i];
xx[radius]->pnt[i * WIDTH + 4] = x1[i] * x1[i];
xx[radius]->pnt[i * WIDTH + 5] = x1[i] * x2[i];
xx[radius]->pnt[i * WIDTH + 6] = x1[i] * x3[i];
xx[radius]->pnt[i * WIDTH + 7] = x2[i] * x2[i];
xx[radius]->pnt[i * WIDTH + 8] = x2[i] * x3[i];
xx[radius]->pnt[i * WIDTH + 9] = x3[i] * x3[i];
}
}
//free(x1);
//free(x2);
//free(x3);
//***testing sum***
// the sum when radius = 1 of xx should be 171
int k = 0;
int sum = 0;
//can replace 27 with rad_size
for (k = 0; k < 27*WIDTH; k++)
{
sum = sum + abs(xx[0]->pnt[k]);
//printf("%d\n",abs(xx[0]->pnt[k]));
}
printf(" sum xx : %d\n", sum);
//******Testing****
for (int c = 0; c < 27; c++)
{
//printf("X2 : %d\n",x3[c]);
printf("tall : %d\n", tall[1]->pnt[c]);
//printf("%d\n",xx[c]);
}
//free(ttThis);
//free(xx);
}
With struct ttall **tall = malloc(sizeof(struct ttall *)); you allocate room for one pointer.
But later you index it with tall[radius] but radius is more than 1.
You must allocate more memory:
struct ttall **tall = malloc(3 * sizeof(struct ttall *));`

Optimization of C code

For an assignment of a course called High Performance Computing, I required to optimize the following code fragment:
int foobar(int a, int b, int N)
{
int i, j, k, x, y;
x = 0;
y = 0;
k = 256;
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
if (i > j){
y = y + 8*(i-j);
}else{
y = y + 8*(j-i);
}
}
}
return x;
}
Using some recommendations, I managed to optimize the code (or at least I think so), such as:
Constant Propagation
Algebraic Simplification
Copy Propagation
Common Subexpression Elimination
Dead Code Elimination
Loop Invariant Removal
bitwise shifts instead of multiplication as they are less expensive.
Here's my code:
int foobar(int a, int b, int N) {
int i, j, x, y, t;
x = 0;
y = 0;
for (i = 0; i <= N; i++) {
t = i + 512;
for (j = i + 1; j <= N; j++) {
x = x + ((i<<3) + (j<<2))*t;
}
}
return x;
}
According to my instructor, a well optimized code instructions should have fewer or less costly instructions in assembly language level.And therefore must be run, the instructions in less time than the original code, ie calculations are made with::
execution time = instruction count * cycles per instruction
When I generate assembly code using the command: gcc -o code_opt.s -S foobar.c,
the generated code has many more lines than the original despite having made ​​some optimizations, and run-time is lower, but not as much as in the original code. What am I doing wrong?
Do not paste the assembly code as both are very extensive. So I'm calling the function "foobar" in the main and I am measuring the execution time using the time command in linux
int main () {
int a,b,N;
scanf ("%d %d %d",&a,&b,&N);
printf ("%d\n",foobar (a,b,N));
return 0;
}
Initially:
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
if (i > j){
y = y + 8*(i-j);
}else{
y = y + 8*(j-i);
}
}
}
Removing y calculations:
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
}
}
Splitting i, j, k:
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 8*i*i + 16*i*k ; // multiple of 1 (no j)
x = x + (4*i + 8*k)*j ; // multiple of j
}
}
Moving them externally (and removing the loop that runs N-i times):
for (i = 0; i <= N; i++) {
x = x + (8*i*i + 16*i*k) * (N-i) ;
x = x + (4*i + 8*k) * ((N*N+N)/2 - (i*i+i)/2) ;
}
Rewritting:
for (i = 0; i <= N; i++) {
x = x + ( 8*k*(N*N+N)/2 ) ;
x = x + i * ( 16*k*N + 4*(N*N+N)/2 + 8*k*(-1/2) ) ;
x = x + i*i * ( 8*N + 16*k*(-1) + 4*(-1/2) + 8*k*(-1/2) );
x = x + i*i*i * ( 8*(-1) + 4*(-1/2) ) ;
}
Rewritting - recalculating:
for (i = 0; i <= N; i++) {
x = x + 4*k*(N*N+N) ; // multiple of 1
x = x + i * ( 16*k*N + 2*(N*N+N) - 4*k ) ; // multiple of i
x = x + i*i * ( 8*N - 20*k - 2 ) ; // multiple of i^2
x = x + i*i*i * ( -10 ) ; // multiple of i^3
}
Another move to external (and removal of the i loop):
x = x + ( 4*k*(N*N+N) ) * (N+1) ;
x = x + ( 16*k*N + 2*(N*N+N) - 4*k ) * ((N*(N+1))/2) ;
x = x + ( 8*N - 20*k - 2 ) * ((N*(N+1)*(2*N+1))/6);
x = x + (-10) * ((N*N*(N+1)*(N+1))/4) ;
Both the above loop removals use the summation formulas:
Sum(1, i = 0..n) = n+1
Sum(i1, i = 0..n) = n(n + 1)/2
Sum(i2, i = 0..n) = n(n + 1)(2n + 1)/6
Sum(i3, i = 0..n) = n2(n + 1)2/4
y does not affect the final result of the code - removed:
int foobar(int a, int b, int N)
{
int i, j, k, x, y;
x = 0;
//y = 0;
k = 256;
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
//if (i > j){
// y = y + 8*(i-j);
//}else{
// y = y + 8*(j-i);
//}
}
}
return x;
}
k is simply a constant:
int foobar(int a, int b, int N)
{
int i, j, x;
x = 0;
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*256);
}
}
return x;
}
The inner expression can be transformed to: x += 8*i*i + 4096*i + 4*i*j + 2048*j. Use math to push all of them to the outer loop: x += 8*i*i*(N-i) + 4096*i*(N-i) + 2*i*(N-i)*(N+i+1) + 1024*(N-i)*(N+i+1).
You can expand the above expression, and apply sum of squares and sum of cubes formula to obtain a close form expression, which should run faster than the doubly nested loop. I leave it as an exercise to you. As a result, i and j will also be removed.
a and b should also be removed if possible - since a and b are supplied as argument but never used in your code.
Sum of squares and sum of cubes formula:
Sum(x2, x = 1..n) = n(n + 1)(2n + 1)/6
Sum(x3, x = 1..n) = n2(n + 1)2/4
This function is equivalent with the following formula, which contains only 4 integer multiplications, and 1 integer division:
x = N * (N + 1) * (N * (7 * N + 8187) - 2050) / 6;
To get this, I simply typed the sum calculated by your nested loops into Wolfram Alpha:
sum (sum (8*i*i+4096*i+4*i*j+2048*j), j=i+1..N), i=0..N
Here is the direct link to the solution. Think before coding. Sometimes your brain can optimize code better than any compiler.
Briefly scanning the first routine, the first thing you notice is that expressions involving "y" are completely unused and can be eliminated (as you did). This further permits eliminating the if/else (as you did).
What remains is the two for loops and the messy expression. Factoring out the pieces of that expression that do not depend on j is the next step. You removed one such expression, but (i<<3) (ie, i * 8) remains in the inner loop, and can be removed.
Pascal's answer reminded me that you can use a loop stride optimization. First move (i<<3) * t out of the inner loop (call it i1), then calculate, when initializing the loop, a value j1 that equals (i<<2) * t. On each iteration increment j1 by 4 * t (which is a pre-calculated constant). Replace your inner expression with x = x + i1 + j1;.
One suspects that there may be some way to combine the two loops into one, with a stride, but I'm not seeing it offhand.
A few other things I can see. You don't need y, so you can remove its declaration and initialisation.
Also, the values passed in for a and b aren't actually used, so you could use these as local variables instead of x and t.
Also, rather than adding i to 512 each time through you can note that t starts at 512 and increments by 1 each iteration.
int foobar(int a, int b, int N) {
int i, j;
a = 0;
b = 512;
for (i = 0; i <= N; i++, b++) {
for (j = i + 1; j <= N; j++) {
a = a + ((i<<3) + (j<<2))*b;
}
}
return a;
}
Once you get to this point you can also observe that, aside from initialising j, i and j are only used in a single mutiple each - i<<3 and j<<2. We can code this directly in the loop logic, thus:
int foobar(int a, int b, int N) {
int i, j, iLimit, jLimit;
a = 0;
b = 512;
iLimit = N << 3;
jLimit = N << 2;
for (i = 0; i <= iLimit; i+=8) {
for (j = i >> 1 + 4; j <= jLimit; j+=4) {
a = a + (i + j)*b;
}
b++;
}
return a;
}
OK... so here is my solution, along with inline comments to explain what I did and how.
int foobar(int N)
{ // We eliminate unused arguments
int x = 0, i = 0, i2 = 0, j, k, z;
// We only iterate up to N on the outer loop, since the
// last iteration doesn't do anything useful. Also we keep
// track of '2*i' (which is used throughout the code) by a
// second variable 'i2' which we increment by two in every
// iteration, essentially converting multiplication into addition.
while(i < N)
{
// We hoist the calculation '4 * (i+2*k)' out of the loop
// since k is a literal constant and 'i' is a constant during
// the inner loop. We could convert the multiplication by 2
// into a left shift, but hey, let's not go *crazy*!
//
// (4 * (i+2*k)) <=>
// (4 * i) + (4 * 2 * k) <=>
// (2 * i2) + (8 * k) <=>
// (2 * i2) + (8 * 512) <=>
// (2 * i2) + 2048
k = (2 * i2) + 2048;
// We have now converted the expression:
// x = x + 4*(2*i+j)*(i+2*k);
//
// into the expression:
// x = x + (i2 + j) * k;
//
// Counterintuively we now *expand* the formula into:
// x = x + (i2 * k) + (j * k);
//
// Now observe that (i2 * k) is a constant inside the inner
// loop which we can calculate only once here. Also observe
// that is simply added into x a total (N - i) times, so
// we take advantange of the abelian nature of addition
// to hoist it completely out of the loop
x = x + (i2 * k) * (N - i);
// Observe that inside this loop we calculate (j * k) repeatedly,
// and that j is just an increasing counter. So now instead of
// doing numerous multiplications, let's break the operation into
// two parts: a multiplication, which we hoist out of the inner
// loop and additions which we continue performing in the inner
// loop.
z = i * k;
for (j = i + 1; j <= N; j++)
{
z = z + k;
x = x + z;
}
i++;
i2 += 2;
}
return x;
}
The code, without any of the explanations boils down to this:
int foobar(int N)
{
int x = 0, i = 0, i2 = 0, j, k, z;
while(i < N)
{
k = (2 * i2) + 2048;
x = x + (i2 * k) * (N - i);
z = i * k;
for (j = i + 1; j <= N; j++)
{
z = z + k;
x = x + z;
}
i++;
i2 += 2;
}
return x;
}
I hope this helps.
int foobar(int N) //To avoid unuse passing argument
{
int i, j, x=0; //Remove unuseful variable, operation so save stack and Machine cycle
for (i = N; i--; ) //Don't check unnecessary comparison condition
for (j = N+1; --j>i; )
x += (((i<<1)+j)*(i+512)<<2); //Save Machine cycle ,Use shift instead of Multiply
return x;
}

Resources