OpenMP: Prefix Sum Algorithm

OpenMP: Prefix Sum Algorithm - c

I'm trying to implement a Prefix Sum Algorithm in C using OpenMP, and I'm stuck.
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
int main(int argc, char* argv[])
{
int p = 5;
int X[5] = { 1, 5, 4, 2, 3 };
int* Y = (int*)malloc(p * sizeof(int));
for (int i = 0; i < p; i++)
printf("%d ", X[i]);
printf("\n");
Y[0] = X[0];
int i;
#pragma omp parallel for num_threads(4)
for (i = 1; i < p; i++)
Y[i] = X[i - 1] + X[i];
int k = 2;
while (k < p)
{
int i;
#pragma omp parallel for
for (i = k; i < p; i++)
Y[i] = Y[i - k] + Y[i];
k += k;
}
for (int i = 0; i < p; i++)
printf("%d ", Y[i]);
printf("\n");
system("pause");
return 0;
}
What this code should do?
Input numbers are in X,
output numbers are (prefixes) in Y
and the number count is p.
X = 1, 5, 4, 2, 3
Stage I.
Y[0] = X[0];
Y[0] = 1
Stage II.
int i;
#pragma omp parallel for num_threads(4)
for (i = 1; i < p; i++)
Y[i] = X[i - 1] + X[i];
Example:
Y[1] = X[0] + X[1] = 6
Y[2] = X[1] + X[2] = 9
Y[2] = X[2] + X[3] = 6
Y[4] = X[3] + X[4] = 5
Stage III. (where I am stuck)
int k = 2;
while (k < p)
{
int i;
#pragma omp parallel for
for (i = k; i < p; i++)
Y[i] = Y[i - k] + Y[i];
k += k;
}
Example:
k = 2
Y[2] = Y[0] + Y[2] = 1 + 9 = 10
Y[3] = Y[1] + Y[3] = 6 + 6 = 12
Y[4] = Y[2] + Y[4] = 10 + 5 = 15
Above the 10 + 5 = 15 should be 9 + 5 = 14, but the Y[2] was overwritten by another thread. I want to use that Y[2] what was before the for-loop started.
Example:
k = 4
Y[4] = Y[0] + Y[4] = 1 + 15 = 16
Result: 1, 6, 10, 12, 16. Expected good result: 1, 6, 10, 12, 15.

Above the 10 + 5 = 15 should be 9 + 5 = 14, but the Y[2] was overwritten by another thread. I want to use that Y[2] what was before the for-loop started.
With OpenMP, you always have to consider whether your code is correct for the serial case, with a single thread, because
It might in fact run that way, and
If it's incorrect serially, then it's virtually certain to be incorrect as a parallel program, too.
Your code is not correct serially. It appears you could fix that by running the problem loop backward, from i = p - 1 to k, but in fact that's not sufficient for parallel operation.
Your best bet appears to be to accumulate your partial results into a different array than holds the results of the previous cycle. For example, you might flip between X and Y as data source and result, with a little pointer wrangling to grease the iterative wheels. Or you might do it a little more easily by using a 2D array instead of separate X and Y.

UPDATE for Stage III.
int num_threads = 8;
int k = 2;
while (k < p)
{
#pragma omp parallel for ordered num_threads(k < num_threads ? 1 : num_threads)
for (i = p - 1; i >= k; i--)
{
Y[i] = Y[i - k] + Y[i];
}
k += k;
}
The code above solved my problem. It's now working with parallel, except the first few round.

Related

How to matrix inversion For 1 Dimension with c code?

hello i used gauss jordan for 1d but i didnt
i want to find 1d matrix inverse. I found determinant but i dont know inverse of this matrix
Hello my dear friends
Our matrixes:
double A[] = {6, 6 ,2, 4, 9 ,7, 4, 3 ,3};
double B[] = {6, 6 ,2, 4, 9 ,7, 4, 3 ,3};
double Final[9];
Function to calculate determinant:
int Inverse(double *A, double *C, int N){
int n = N;
int i, j, k;
float a[10][10] = { 0.0 };
double C[9] = { 0.0 };
float pivot = 0.0;
float factor = 0.0;
double sum = 0.0; ``` variables
for (k = 1; k <= n - 1; k++)
{
if (a[k][k] == 0.0)
{
printf("error");
}
else
{
pivot = a[k][k];
for (j = k; j <= n + 1; j++)
a[k][j] = a[k][j] / pivot;
for (i = k + 1; i <= n; i++)
{
factor = a[i][k];
for (j = k; j <= n + 1; j++)
{
a[i][j] = a[i][j] - factor * a[k][j];
}
}
}
if (a[n][n] == 0)
printf("error");
else
{
C[n] = a[n][n + 1] / a[n][n];
for (i = n - 1; i >= 1; i--)
{
sum = 0.0;
for (j = i + 1; j <= n; j++)
sum = sum + a[i][j] * C[j];
C[i] = (a[i][n + 1] - sum) / a[i][i];
}
}
}
for (i = 1; i <= n; i++)
{
printf("\n\tx[%1d]=%10.4f", i, C[i]);
}
system("PAUSE");
return 0;
}
Although I tried very hard, I couldn't find the opposite in c programming for a 1x1 dimensional matrix. Output always generates 0. Can you help me where I could be making a mistake. Thank you.

It appears you are using C as an output parameter (to store the inverse); however, you also declare a local variable of the same name in the function. This causes the local variable to shadow (i.e.: hide) the output parameter; thus, changes you make to the C in the function do not affect the C the calling function sees.
To fix this issue, you need to remove the line double C[9] = {0}; from your function.

OpenMP Segmentation Fault When Parallelizing Simple Loop

I have a function that takes in an array and updates an array in a for loop like such:
void compute(double values[], int num_points, double ders[]){
for(int i = 0; i < num_points; ++i)
{
double a = values[i* 3 + 0 ];
double b = values[i* 2 + 1 ];
ders[i*4 + 0] = a * b;
ders[i*4 + 1] = a * a;
ders[i*4 + 2] = b * b;
ders[i*4 + 3] = b * a * a;
}
}
All of this is well and good but then I update the code to try to do things in parallel with OpenMP like
void compute(double values[], int num_points, double ders[]){
omp_set_dynamic(0);
omp_set_num_threads(2);
#pragma omp parallel for
for(int i = 0; i < num_points; ++i)
{
double a = values[i* 3 + 0 ];
double b = values[i* 2 + 1 ];
ders[i*4 + 0] = a * b;
ders[i*4 + 1] = a * a;
ders[i*4 + 2] = b * b;
ders[i*4 + 3] = b * a * a;
}
}
And now I'm getting Segmentation Faults.
I feel like I must be overwriting some value in two threads -- but everything in ders and values is indexed according to 'i' so it feels like it should be trivial to parallelize.
What am I doing wrong here?

Making 2 loops run in parallel

I have a task on one of my work sheets asking me to add OpenMP directives to make both of these loops run in parallel.
{
for (i = ; i < N; i += )
{
D[i] = x * A[i] + x * B[i];
}
for (i = 0; i < N; i++)
{
C[i] = c * D[i];
}
}
I made a C file to add the Openmp directives
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h.>
#define THREADS 4
#define N 10
int main (int argc, char *argv[])
{
int i;
double A[N] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, B[N] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, C[N], D[N];
const double x = 5;
const double z = 5;
#pragma omp parallel for schedule(static) num_threads(THREADS)
for (i = 0; i < N; i += 10)
{
D[i] = z * A[i] + z * B[i];
printf("part 1 Thread %d is doing iteration %d: %d \n", omp_get_thread_num(
),i, D[i]);
}
#pragma omp parallel for schedule(static) num_threads(THREADS)
for (i = 0; i < N; i++)
{
C[i] = x * D[i];
printf("part 2 Thread %d is doing iteration %d: %d \n", omp_get_thread_num(
),i, C[i]);
}
return 0;
}
I get part 1 do one iteration and then part 2 do all iterations, I'm not sure where I'm going wrong.

Part 1 only do 1 iteration because there is only 1 iteration to do:
#pragma omp parallel for schedule(static) num_threads(THREADS)
for (i = 0; i < N; i += 10)
where N expands to 10 in line 7:
#define N 10
A second iteration never happens because is out of loop's range

Reading code without running it

So I have this code that I'm trying to figure out how to read without running it yet I can't seem to find out the pattern and the way to do so. I was hoping someone could give me an explanation of how to read it.
#include <stdio.h>
void mystery(int z[], int size);
void main()
{
int i;
int z[10] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
for (i = 0; i < 10; i++)
printf("%d", z[i]);
printf("\n\n");
mystery(z, 10);
for (i = 0; i < 10; i++)
printf("%d", z[i]);
printf("\n\n");
mystery(z, 7);
for (i = 10; i < 7; i++)
printf("%d", z[i]);
printf("\n\n");
}
void mystery(int z[], int n)
{
int i, temp;
for (i = 1; i < n / 2; i = i + 2)
{
temp = z[i];
z[i] = z[n - 1 - i];
z[n - 1 - i] = temp;
}
return;
}
When running it, the code reads
1 2 3 4 5 6 7 8 9 10
1 9 3 7 5 6 4 8 2 10

The key is that you understand this loop:
for (i = 1; i < n / 2; i = i + 2)
{
temp = z[i];
z[i] = z[n - 1 - i];
z[n - 1 - i] = temp;
}
i + 2 means the step (the increment). The step is 2 rather than 1 in this case which means that i will increase like 1, 3, 5, 7... up to n / 2. The content of the loop just switches the outer elements of the bounds i.e. the first iteration the elements 2 and 9 will switch places.
Because the start of the loop is i=1(and not i=0) the first element is not affected by the loop. Remember arrays start at 0, so the second element has index 1 and that's where the loop starts: At the second element. And since the step is 2, only every second element will switch. I hope this answers your question.

Optimization of C code

For an assignment of a course called High Performance Computing, I required to optimize the following code fragment:
int foobar(int a, int b, int N)
{
int i, j, k, x, y;
x = 0;
y = 0;
k = 256;
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
if (i > j){
y = y + 8*(i-j);
}else{
y = y + 8*(j-i);
}
}
}
return x;
}
Using some recommendations, I managed to optimize the code (or at least I think so), such as:
Constant Propagation
Algebraic Simplification
Copy Propagation
Common Subexpression Elimination
Dead Code Elimination
Loop Invariant Removal
bitwise shifts instead of multiplication as they are less expensive.
Here's my code:
int foobar(int a, int b, int N) {
int i, j, x, y, t;
x = 0;
y = 0;
for (i = 0; i <= N; i++) {
t = i + 512;
for (j = i + 1; j <= N; j++) {
x = x + ((i<<3) + (j<<2))*t;
}
}
return x;
}
According to my instructor, a well optimized code instructions should have fewer or less costly instructions in assembly language level.And therefore must be run, the instructions in less time than the original code, ie calculations are made with::
execution time = instruction count * cycles per instruction
When I generate assembly code using the command: gcc -o code_opt.s -S foobar.c,
the generated code has many more lines than the original despite having made some optimizations, and run-time is lower, but not as much as in the original code. What am I doing wrong?
Do not paste the assembly code as both are very extensive. So I'm calling the function "foobar" in the main and I am measuring the execution time using the time command in linux
int main () {
int a,b,N;
scanf ("%d %d %d",&a,&b,&N);
printf ("%d\n",foobar (a,b,N));
return 0;
}

Initially:
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
if (i > j){
y = y + 8*(i-j);
}else{
y = y + 8*(j-i);
}
}
}
Removing y calculations:
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
}
}
Splitting i, j, k:
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 8*i*i + 16*i*k ; // multiple of 1 (no j)
x = x + (4*i + 8*k)*j ; // multiple of j
}
}
Moving them externally (and removing the loop that runs N-i times):
for (i = 0; i <= N; i++) {
x = x + (8*i*i + 16*i*k) * (N-i) ;
x = x + (4*i + 8*k) * ((N*N+N)/2 - (i*i+i)/2) ;
}
Rewritting:
for (i = 0; i <= N; i++) {
x = x + ( 8*k*(N*N+N)/2 ) ;
x = x + i * ( 16*k*N + 4*(N*N+N)/2 + 8*k*(-1/2) ) ;
x = x + i*i * ( 8*N + 16*k*(-1) + 4*(-1/2) + 8*k*(-1/2) );
x = x + i*i*i * ( 8*(-1) + 4*(-1/2) ) ;
}
Rewritting - recalculating:
for (i = 0; i <= N; i++) {
x = x + 4*k*(N*N+N) ; // multiple of 1
x = x + i * ( 16*k*N + 2*(N*N+N) - 4*k ) ; // multiple of i
x = x + i*i * ( 8*N - 20*k - 2 ) ; // multiple of i^2
x = x + i*i*i * ( -10 ) ; // multiple of i^3
}
Another move to external (and removal of the i loop):
x = x + ( 4*k*(N*N+N) ) * (N+1) ;
x = x + ( 16*k*N + 2*(N*N+N) - 4*k ) * ((N*(N+1))/2) ;
x = x + ( 8*N - 20*k - 2 ) * ((N*(N+1)*(2*N+1))/6);
x = x + (-10) * ((N*N*(N+1)*(N+1))/4) ;
Both the above loop removals use the summation formulas:
Sum(1, i = 0..n) = n+1
Sum(i1, i = 0..n) = n(n + 1)/2
Sum(i2, i = 0..n) = n(n + 1)(2n + 1)/6
Sum(i3, i = 0..n) = n2(n + 1)2/4

y does not affect the final result of the code - removed:
int foobar(int a, int b, int N)
{
int i, j, k, x, y;
x = 0;
//y = 0;
k = 256;
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
//if (i > j){
// y = y + 8*(i-j);
//}else{
// y = y + 8*(j-i);
//}
}
}
return x;
}
k is simply a constant:
int foobar(int a, int b, int N)
{
int i, j, x;
x = 0;
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*256);
}
}
return x;
}
The inner expression can be transformed to: x += 8*i*i + 4096*i + 4*i*j + 2048*j. Use math to push all of them to the outer loop: x += 8*i*i*(N-i) + 4096*i*(N-i) + 2*i*(N-i)*(N+i+1) + 1024*(N-i)*(N+i+1).
You can expand the above expression, and apply sum of squares and sum of cubes formula to obtain a close form expression, which should run faster than the doubly nested loop. I leave it as an exercise to you. As a result, i and j will also be removed.
a and b should also be removed if possible - since a and b are supplied as argument but never used in your code.
Sum of squares and sum of cubes formula:
Sum(x2, x = 1..n) = n(n + 1)(2n + 1)/6
Sum(x3, x = 1..n) = n2(n + 1)2/4

This function is equivalent with the following formula, which contains only 4 integer multiplications, and 1 integer division:
x = N * (N + 1) * (N * (7 * N + 8187) - 2050) / 6;
To get this, I simply typed the sum calculated by your nested loops into Wolfram Alpha:
sum (sum (8*i*i+4096*i+4*i*j+2048*j), j=i+1..N), i=0..N
Here is the direct link to the solution. Think before coding. Sometimes your brain can optimize code better than any compiler.

Briefly scanning the first routine, the first thing you notice is that expressions involving "y" are completely unused and can be eliminated (as you did). This further permits eliminating the if/else (as you did).
What remains is the two for loops and the messy expression. Factoring out the pieces of that expression that do not depend on j is the next step. You removed one such expression, but (i<<3) (ie, i * 8) remains in the inner loop, and can be removed.
Pascal's answer reminded me that you can use a loop stride optimization. First move (i<<3) * t out of the inner loop (call it i1), then calculate, when initializing the loop, a value j1 that equals (i<<2) * t. On each iteration increment j1 by 4 * t (which is a pre-calculated constant). Replace your inner expression with x = x + i1 + j1;.
One suspects that there may be some way to combine the two loops into one, with a stride, but I'm not seeing it offhand.

A few other things I can see. You don't need y, so you can remove its declaration and initialisation.
Also, the values passed in for a and b aren't actually used, so you could use these as local variables instead of x and t.
Also, rather than adding i to 512 each time through you can note that t starts at 512 and increments by 1 each iteration.
int foobar(int a, int b, int N) {
int i, j;
a = 0;
b = 512;
for (i = 0; i <= N; i++, b++) {
for (j = i + 1; j <= N; j++) {
a = a + ((i<<3) + (j<<2))*b;
}
}
return a;
}
Once you get to this point you can also observe that, aside from initialising j, i and j are only used in a single mutiple each - i<<3 and j<<2. We can code this directly in the loop logic, thus:
int foobar(int a, int b, int N) {
int i, j, iLimit, jLimit;
a = 0;
b = 512;
iLimit = N << 3;
jLimit = N << 2;
for (i = 0; i <= iLimit; i+=8) {
for (j = i >> 1 + 4; j <= jLimit; j+=4) {
a = a + (i + j)*b;
}
b++;
}
return a;
}

OK... so here is my solution, along with inline comments to explain what I did and how.
int foobar(int N)
{ // We eliminate unused arguments
int x = 0, i = 0, i2 = 0, j, k, z;
// We only iterate up to N on the outer loop, since the
// last iteration doesn't do anything useful. Also we keep
// track of '2*i' (which is used throughout the code) by a
// second variable 'i2' which we increment by two in every
// iteration, essentially converting multiplication into addition.
while(i < N)
{
// We hoist the calculation '4 * (i+2*k)' out of the loop
// since k is a literal constant and 'i' is a constant during
// the inner loop. We could convert the multiplication by 2
// into a left shift, but hey, let's not go *crazy*!
//
// (4 * (i+2*k)) <=>
// (4 * i) + (4 * 2 * k) <=>
// (2 * i2) + (8 * k) <=>
// (2 * i2) + (8 * 512) <=>
// (2 * i2) + 2048
k = (2 * i2) + 2048;
// We have now converted the expression:
// x = x + 4*(2*i+j)*(i+2*k);
//
// into the expression:
// x = x + (i2 + j) * k;
//
// Counterintuively we now *expand* the formula into:
// x = x + (i2 * k) + (j * k);
//
// Now observe that (i2 * k) is a constant inside the inner
// loop which we can calculate only once here. Also observe
// that is simply added into x a total (N - i) times, so
// we take advantange of the abelian nature of addition
// to hoist it completely out of the loop
x = x + (i2 * k) * (N - i);
// Observe that inside this loop we calculate (j * k) repeatedly,
// and that j is just an increasing counter. So now instead of
// doing numerous multiplications, let's break the operation into
// two parts: a multiplication, which we hoist out of the inner
// loop and additions which we continue performing in the inner
// loop.
z = i * k;
for (j = i + 1; j <= N; j++)
{
z = z + k;
x = x + z;
}
i++;
i2 += 2;
}
return x;
}
The code, without any of the explanations boils down to this:
int foobar(int N)
{
int x = 0, i = 0, i2 = 0, j, k, z;
while(i < N)
{
k = (2 * i2) + 2048;
x = x + (i2 * k) * (N - i);
z = i * k;
for (j = i + 1; j <= N; j++)
{
z = z + k;
x = x + z;
}
i++;
i2 += 2;
}
return x;
}
I hope this helps.

int foobar(int N) //To avoid unuse passing argument
{
int i, j, x=0; //Remove unuseful variable, operation so save stack and Machine cycle
for (i = N; i--; ) //Don't check unnecessary comparison condition
for (j = N+1; --j>i; )
x += (((i<<1)+j)*(i+512)<<2); //Save Machine cycle ,Use shift instead of Multiply
return x;
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

OpenMP: Prefix Sum Algorithm - c

Related

How to matrix inversion For 1 Dimension with c code?

OpenMP Segmentation Fault When Parallelizing Simple Loop

Making 2 loops run in parallel

Reading code without running it

Optimization of C code

Categories

Resources