I have a number crunching C program which involves a main loop with two conditionals:
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
for (k = 0; k < N; k++) {
if (k == i || k == j) continue;
...(calculate a, b, c, d (depending on k)
if (a*a + b*b + c*c < d*d) {break;}
} //k
} //j
} //i
The hardware here is the SPE of the Cell processor, where there is a big penalty when using branching. So in order to optimize my program for speedup I need to remove these 2 conditionals, do you know about good strategies for this?
For the first one, you could break it into multiple loops, eg change:
for(int i = 0; i < 1000; i++)
for(int j = 0; j < 1000; j++) {
for(int k = 0; k < 1000; k++) {
if(k==i || k == j) continue;
// other code
}
}
to:
for(int i = 0; i < 1000; i++)
for(int j = 0; j < 1000; j++) {
for(int k = 0; k < min(i, j); k++) {
// other code
}
for(int k = min(i, j) + 1; k < max(i, j); k++) {
// other code
}
for(int k = max(i, j) + 1; k < 1000; k++) {
// other code
}
}
To remove the second, you could store the previous total and use it in the for loop conditions, i.e.:
int left_side = 1, right_side = 0;
for(int i = 0; i < N; i++)
for(int j = 0; j < N; j++) {
for(int k = 0; k < min(i, j) && left_side >= right_side; k++) {
// other code (calculate a, b, c, d)
left_side = a * a + b * b + c * c;
right_side = d * d;
}
for(int k = min(i, j) + 1; k < max(i, j) && left_side >= right_side; k++) {
// same as in previous loop
}
for(int k = max(i, j) + 1; k < N && left_side >= right_side; k++) {
// same as in previous loop
}
}
Implementing min and max without branching could also be tricky. Maybe this version is better:
int i, j, k,
left_side = 1, right_side = 0;
for(i = 0; i < N; i++) {
// this loop covers the case where j < i
for(j = 0; j < i; j++) {
k = 0;
for(; k < j && left_side >= right_side; k++) {
// other code (calculate a, b, c, d)
left_side = a * a + b * b + c * c;
right_side = d * d;
}
k++; // skip k == j
for(; k < i && left_side >= right_side; k++) {
// same as in previous loop
}
k++; // skip k == i
for(; k < N && left_side >= right_side; k++) {
// same as in previous loop
}
}
j++; // skip j == i
// and now, j > i
for(; j < N; j++) {
k = 0;
for(; k < i && left_side >= right_side; k++) {
// other code (calculate a, b, c, d)
left_side = a * a + b * b + c * c;
right_side = d * d;
}
k++; // skip k == i
for(; k < j && left_side >= right_side; k++) {
// same as in previous loop
}
k++; // skip k == j
for(; k < N && left_side >= right_side; k++) {
// same as in previous loop
}
}
}
I agree with 'sje397'.
Besides this, you provide too little information about your problem. You say branching is pricey. But how often does it actually happen? Maybe your problem is that compiler-generated code does branching in the common scenario?
Perhaps you could re-arrange your if-s. The implementation of the if is actually compiler-dependent, bust many compilers treat it in a straight-forward way. That is: if - common - else - rare (jump).
Then try the following:
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
for (k = 0; k < N; k++) {
if (k != i && k != j)
{
...(calculate a, b, c, d)
if (a*a + b*b + c*c >= d*d)
{
...
} else
break;
}
} //k
} //j
} //i
EDIT:
Of course you may go into assembler level to ensure correct code generated.
I would look first at your calculate code, because that could swamp all these branching issues. Some sampling would find out for sure.
However, it looks like you're doing, for each i,j, a linear search for the first point inside a sphere. Could you have 3 arrays, one for each of the X, Y, and Z axes, and in each array store indexes of all the original points in ascending order by that axis? That could facilitate a nearest-neighbor search. Also, you might be able to use an in-cube test, rather than an in-sphere test, since you're not hunting for the closest point, but only a nearby point.
Are you sure you actually need the first if-statement? Even if it jumps one calculation when k equals i or j, the penalty for checking it every iteration is very costly. Also, keep in mind that if N is not a constant, the compiler probably wont be able to unroll the for loops.
Although, if it's a cell processor, the compiler might even try to vectorize the loops.
If the for loops compiles to normal iterative loops it could be an idea to make them compare with zero instead, as the decrement operation will often do the comparison for you when it hits zero.
for (i = 0; i < N; i++) {
...can become...
for (i = N; i != 0; i--) {
Although, if "i" is used as an index or a variable in a calculation, you might get performance degradation as you will get cache misses.
Related
There are many implementations of the Sieve of Eratosthenes online. Through searching Google, I found this implementation in C.
#include <stdio.h>
#include <stdlib.h>
#define limit 100 /*size of integers array*/
int main(){
unsigned long long int i,j;
int *primes;
int z = 1;
primes = malloc(sizeof(int) * limit);
for (i = 2;i < limit; i++)
primes[i] = 1;
for (i = 2;i < limit; i++)
if (primes[i])
for (j = i;i * j < limit; j++)
primes[i * j] = 0;
printf("\nPrime numbers in range 1 to 100 are: \n");
for (i = 2;i < limit; i++)
if (primes[i])
printf("%d\n", i);
return 0;
}
I then attempted to update the existing code so that the C program would follow what is described by Scott Ridgway in Parallel Scientific Computing. In the first chapter, the author describes what is known as the Prime number sieve. Instead of finding the primes up to a number k, the modified sieve searches for primes between k <= n <= k^2. Ridgway provides the psuedocode to write this algorithm.
To match the psuedocode provided by the author, I modified the original program above and wrote
#include <stdio.h>
#include <stdlib.h>
#define limit 10 /*size of integers array*/
int main(){
unsigned long long int i,j,k;
int *primes;
int *arr[100];
int z = 1;
primes = malloc(sizeof(int) * limit);
for (i = 2;i < limit; i++)
primes[i] = 1;
for (i = 2;i < limit; i++)
if (primes[i])
for (j = i;i * j < limit; j++)
primes[i * j] = 0;
/* Code which prints out primes for Sieve of Eratosthenes */
/*printf("\nPrime numbers in range 1 to 100 are: \n");
for (i = 2;i < limit; i++)
if (primes[i])
//printf("Element[%d] = %d\n", i, primes[i]);*/
for (k=limit; k < limit*limit; k++)
for (j = primes[0]; j = arr[sizeof(arr)/sizeof(arr[0]) - 1]; j++)
if ((k % j) == 0)
arr[k]=0;
arr[k] = 1;
printf("\nPrime numbers in range k to k^2 are: \n");
for (k=limit; k < limit*limit; k++)
if (arr[k])
printf("Element[%d] = %d\n", k, k);
return 0;
}
which returns
Prime numbers in range k to k^2 are:
Element[10] = 10
Element[14] = 14
Element[15] = 15
Element[16] = 16
Element[17] = 17
Element[18] = 18
Element[19] = 19
.
.
.
This is clearly wrong. I think that my mistake is in my interpretation of the psuedocode
as
for (k=limit; k < limit*limit; k++)
for (j = primes[0]; j = arr[sizeof(arr)/sizeof(arr[0]) - 1]; j++)
if ((k % j) == 0)
arr[k]=0;
arr[k] = 1;
As I am new to C, I likely made an elementary mistake. I'm not sure what is wrong with the five lines of code above and have therefore asked a question on Stack Overflow.
You have some problem with your loop statement, j variable should use for index of primes that is pointer to array of int with 0 or 1 values. You can use primes array in this case is S(k) in algorithm.
for (k=limit; k < limit*limit; k++)
for (j = primes[0]; j = arr[sizeof(arr)/sizeof(arr[0]) - 1]; j++)
if ((k % j) == 0)
arr[k]=0;
arr[k] = 1;
So the for loop with j should be
for (j = 2; j < limit; j++)
And condition IN if statement should be
if (primes[j] && (k % j) == 0)
{
arr[k] = 0;
break;
}
And if this condition is true, we should exit inner for loop with j variable. Outside for loop with j, should check value of j variable to check if the inner loop is completed or not (j == limit).
if (j == limit) arr[k] = 1;
So here is the entire for loop (outer and inner loop) the I modified.
for (k = limit; k < limit*limit; k++)
{
for (j = 2; j < limit; j++)
{
if (primes[j] && (k % j) == 0)
{
arr[k] = 0;
break;
}
}
if (j == limit) arr[k] = 1;
}
And here is entire solution:
#include <stdio.h>
#include <stdlib.h>
#define limit 10 /*size of integers array*/
int main() {
unsigned long long int i, j, k;
int *primes;
int arr[limit*limit];
int z = 1;
primes = (int*)malloc(sizeof(int) * limit);
for (i = 2; i < limit; i++)
primes[i] = 1;
for (i = 2; i < limit; i++)
if (primes[i])
for (j = i; i * j < limit; j++)
primes[i * j] = 0;
/* Code which prints out primes for Sieve of Eratosthenes */
/*printf("\nPrime numbers in range 1 to 100 are: \n");
for (i = 2;i < limit; i++)
if (primes[i])
//printf("Element[%d] = %d\n", i, primes[i]);*/
for (k = limit; k < limit*limit; k++)
{
for (j = 2; j < limit; j++)
{
if (primes[j] && (k % j) == 0)
{
arr[k] = 0;
break;
}
}
if (j == limit) arr[k] = 1;
}
printf("\nPrime numbers in range k to k^2 are: \n");
for (k = limit; k < limit*limit; k++)
if (arr[k] == 1)
printf("Element %d\n", k);
return 0;
}
i = 1;
while (i <= n) {
j = n - i;
while (j >= 2) {
for (k = 1; k <= j; k++) {
s = s + Arr[k];
}
j = j - 2;
}
i = i + 1;
}
The part that confuses me is where it says
j = n - i;
while(j >= 2){
I'm not really sure how to show my work on that part. I'm pretty sure the algorthim is O(n^3) though.
You can simplify it a bit in order to see things more clearly:
for(i = 1; i <= n; i++)
{
for(j = n - i; j >= 2; j -= 2)
{
for(k = 1; k <= j; k++)
{
s = s + Arr[k];
}
}
}
Now things should be simpler
for(i = 1; i <= n; i++) : O(n) [executes exactly n times, actually]
for(j = n - i; j >= 2; j -= 2) : (n-1)/2 in 1st iteration, (n-3)/2 in the 2nd and so on... O(n)
for(k = 1; k <= j; k++) n-2 in 1st iteration, n-3 in the 2nd and so on... O(n)
s = s + Arr[k]; [simple operation] : O(1)
Multiply every step and you get O(n^3)
If you are still having trouble with it, I would suggest you run a few simulations of this code with varying n values and a counter inside the loops. Hopefully you'll be able to see how the O(n) is the complexity for each loop
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
#include <stdio.h>
main()
{
int x, y;
scanf("%d", &x);
int a[x][x];
int i, j, low = 0, top = x - 1, n = 1;
for (i = 0; i < x / 2; i++, low++, top--)
{
for (j = low; j <= top; j++, n++)
a[i][j] = n;
for (j = low + 1; j <= top; j++, n++)
a[j][top] = n;
for (j = top - 1; j >= low; j--, n++)
a[top][j] = n;
for (j = top - 1; j > low; j--, n++)
a[j][low] = n;
}
for (i = 0; i < x; i++)
{
for (j = 0; j < x; j++)
{
printf("%d", a[i][j]);
}
printf("\n");
}
}
i want to write a number pattern and this is the code but i wanted to write it without arrays.how can i rewrite this without using any arrays?
and of course x can be both even and odd.
thanks for ur help!]1
#include <stdio.h>
int get(int x, int y, int lt, int n)
{
if(x == 0)
return lt+y;
else if(y == 0)
return lt+4*(n-1)-x;
else if(y == n-1)
return lt+n+x-1;
else if(x == n-1)
return lt+3*(n-1)-y;
else
return get(x-1, y-1, lt+4*(n-1), n-2);
}
int main(void)
{
int n, i, j;
scanf("%d", &n);
for(i = 0; i < n; ++i) {
for(j = 0; j < n; ++j)
printf("%2d ", get(i, j, 1, n));
putchar('\n');
}
return 0;
}
I must agree that this code is incredible difficult for me to understand. I have no clue what is happening here. But I will try to answer it irrespective of the logic/pattern.
So you have a bunch of loops that fill in an array and then another loop that prints it in order. But you don't want to use arrays.
So the value should be calculated when it is to be printed. Let us try it this way -
First I will move your calculating code to a separate function for sake for clarity -
int value_at(int I, int J, int x) {
int i, j, low = 0, top = x - 1, n = 1;
int a = 0;
for (i = 0; i < x / 2; i++, low++, top--) {
for (j = low; j <= top; j++, n++)
if ( i == I && j == J)
a = n;
for (j = low + 1; j <= top; j++, n++)
if ( I == j && J == top)
a = n;
for (j = top - 1; j >= low; j--, n++)
if ( top == I && j == J)
a = n
for (j = top - 1; j > low; j--, n++)
if (j == I && low == J)
a = n;
}
return a;
}
Now this function calculates the value of a for any i or j
Now we can just print the values in a loop -
for (i = 0; i < x; i++)
{
for (j = 0; j < x; j++)
{
printf("%d", value_at(i, j, x));
}
printf("\n");
}
This must do the task :
int min(int a, int b)
{
return a<b ? a:b;
}
void printSpiral(int n)
{
for (int i = 0; i < n; i++)
{
for (int j = 0; j < n; j++)
{
int x;
x = min(min(i, j), min(n-1-i, n-1-j));
// For upper right half
if (i <= j)
printf("%d ", ((n*n) - ((n-2*x)*(n-2*x) - (i-x)
- (j-x)))+1);
// for lower left half
else
printf("%d ", ((n*n) - ((n-2*x-2)*(n-2*x-2) + (i-x)
+ (j-x)))+1);
}
printf("\n");
}
}
Just call printSpiral() and pass your x as argument
I've searched for hours and spent many more trying to figure how to fix this problem. I need to find the inverse of a predefined matrix using
A^-1 = I + (B + B^2 + ... + B^20) where B = I-A.
void invA(double a[][3], double id[][3], double z[][3])
{
int i, j, n, k;
double pb[3][3] = {1.,0.,0.,0.,1.,0.,0.,0.,1.};
double temp[3][3] = {1.,0.,0.,0.,1.,0.,0.,0.,1.};
double b[3][3];
temp[i][j] = 0;
b[i][j] = 0;
for(i = 0; i < 3; i++)
for (j = 0; j < 3; j++)
b[i][j] = id[i][j] - a[i][j];
for (n = 0; n < 20; n++) //run loop n times
{
for (i = 0; i < 3; i++) //find b to the power 20
for (j = 0; j < 3; j++)
for (k = 0; k < 3; k++)
temp[i][j] += pb[i][k] * b[k][j];
for (i = 0; i < 3; i++) //allocate pb from temp
for (j = 0; j < 3; j++)
pb[i][j] = temp[i][j];
for (i = 0; i < 3; i++) //summing b n time
for (j = 0; j < 3; j++) //to find inverse
z[i][j] = z[i][j] + pb[i][j];
}
}
Matrix a is the defined matrix, id is the identity and z is the inverse (result). I can't seem to figure out where I've gone wrong.
You have few problems.
First, temp[i][j] = 0; and b[i][j] = 0; at the beginning of the function use uninitialized variables i and j. The behaviour is undefined, and who knows how temp is actually initialized.
Then, temp must be reinitialized to a zero matrix at each iteration. I don't know what exactly does your code compute, but it is not a power for sure.
Finally, (unless z is initialized to I), you are missing the initial term.
All that said, I highly recommend to factor out most of the loops into functions: matAdd() and matMult(). Once they are unit tested, the rest is much simpler.
For each of the following code segments, use OpenMP pragmas to make the loop parallel, or
explain why the code segment is not suitable for parallel execution.
a. for (i = 0; i < sqrt(x); i++)
a[i] = 2.3 * i;
if (i < 10)
b[i] = a[i];
}
b. flag = 0;
for (i = 0; i < n && !flag; i++)
a[i] = 2.3 * i;
if (a[i] < b[i])
flag = 1;
}
c. for (i = 0; i < n && !flag; i++)
a[i] = foo(i);
d. for (i = 0; i < n && !flag; i++) {
a[i] = foo(i);
if (a[i] < b[i])
a[i] = b[i];
}
e. for (i = 0; i < n && !flag; i++) {
a[i] = foo(i);
if (a[i] < b[i])
break;
}
f. dotp = 0;
for (i = 0; i < n; i++)
dotp += a[i] * b[i];
g. for (i = k; i < 2 * k; i++)
a[i] = a[i] + a[i – k];
h. for (i = k; i < n; i++) {
a[i] = c * a[i – k];
Any help regarding the above question would be very much welcome..any line of thinking..
I will not do your HW, but I will give a hint. When playing around with OpenMp for loops, you should be alert about the scope of the variables. For example:
#pragma omp parallel for
for(int x=0; x < width; x++)
{
for(int y=0; y < height; y++)
{
finalImage[x][y] = RenderPixel(x,y, &sceneData);
}
}
is OK, since x and y are private variables.
What about
int x,y;
#pragma omp parallel for
for(x=0; x < width; x++)
{
for(y=0; y < height; y++)
{
finalImage[x][y] = RenderPixel(x,y, &sceneData);
}
}
?
Here, we have defined x and y outside of the for loop. Now consider y. Every thread will access/write it without any synchronization, thus data races will occur, which are very likely to result in logical errors.
Read more here and good luck with your HW.