Understanding #pragma acc kernels

Understanding #pragma acc kernels - c

I'm trying to optimize nbody algorithm and when I add #pragma acc kernels in the loop, I don't understand what are the following comment
#pragma acc kernels
for (i = 0; i < n; i++)
{
real fx, fy, fz;
fx = fy = fz = 0;
real iPosx = in[i].x;
real iPosy = in[i].y;
real iPosz = in[i].z;
for (j = 0; j < n; j++)
{
real rx, ry, rz;
rx = in[j].x - iPosx;
ry = in[j].y - iPosy;
rz = in[j].z - iPosz;
real distSqr = rx*rx+ry*ry+rz*rz;
distSqr += SOFTENING_SQUARED;
real s = in[j].w / POW(distSqr,1.5);
real3 ff;
ff.x = rx * s;
ff.y = ry * s;
ff.z = rz * s;
fx += ff.x;
fy += ff.y;
fz += ff.z;
}
force[i].x = fx;
force[i].y = fy;
force[i].z = fz;
}
What means "generating implicit reduction(+:fx)
"generating implicit reduction(+:fy)
"generating implicit reduction(+:fz)"?
Thank you

In order to parallelize the inner "j" loop, the three variables fx, fy, and fz must be in a sum reduction. The compiler has automatically detected this an therefor implicitly added the reduction for you. It's the same as if you had explicitly declared them, for example:
#pragma acc loop reduction(+:fx,fy,fz)
for (j = 0; j < n; j++)
{
real rx, ry, rz;

Related

Parallelise 2 for loops with OpenMP

So I have this function that I have to parallelize with OpenMP static scheduling for n threads
void computeAccelerations(){
int i,j;
for(i=0;i<bodies;i++){
accelerations[i].x = 0; accelerations[i].y = 0; accelerations[i].z = 0;
for(j=0;j<bodies;j++){
if(i!=j){
//accelerations[i] = addVectors(accelerations[i],scaleVector(GravConstant*masses[j]/pow(mod(subtractVectors(positions[i],positions[j])),3),subtractVectors(positions[j],positions[i])));
vector sij = {positions[i].x-positions[j].x,positions[i].y-positions[j].y,positions[i].z-positions[j].z};
vector sji = {positions[j].x-positions[i].x,positions[j].y-positions[i].y,positions[j].z-positions[i].z};
double mod = sqrt(sij.x*sij.x + sij.y*sij.y + sij.z*sij.z);
double mod3 = mod * mod * mod;
double s = GravConstant*masses[j]/mod3;
vector S = {s*sji.x,s*sji.y,s*sji.z};
accelerations[i].x+=S.x;accelerations[i].y+=S.y;accelerations[i].z+=S.z;
}
}
}
}
I tried to do something like:
void computeAccelerations_static(int num_of_threads){
int i,j;
#pragma omp parallel for num_threads(num_of_threads) schedule(static)
for(i=0;i<bodies;i++){
accelerations[i].x = 0; accelerations[i].y = 0; accelerations[i].z = 0;
for(j=0;j<bodies;j++){
if(i!=j){
//accelerations[i] = addVectors(accelerations[i],scaleVector(GravConstant*masses[j]/pow(mod(subtractVectors(positions[i],positions[j])),3),subtractVectors(positions[j],positions[i])));
vector sij = {positions[i].x-positions[j].x,positions[i].y-positions[j].y,positions[i].z-positions[j].z};
vector sji = {positions[j].x-positions[i].x,positions[j].y-positions[i].y,positions[j].z-positions[i].z};
double mod = sqrt(sij.x*sij.x + sij.y*sij.y + sij.z*sij.z);
double mod3 = mod * mod * mod;
double s = GravConstant*masses[j]/mod3;
vector S = {s*sji.x,s*sji.y,s*sji.z};
accelerations[i].x+=S.x;accelerations[i].y+=S.y;accelerations[i].z+=S.z;
}
}
}
It comes naturally to just add the #pragma omp parallel for num_threads(num_of_threads) schedule(static) but it isn't correct.
I think there is some kind of false sharing with the accelerations[i] but I don't know how to approach it. I appreciate any kind of help. Thank you.

In your loop nest, only the iterations of the outer loop are parallelized. Because i is the loop-control variable, each thread gets its own, private copy, but as a matter of style, it would be better to declare i in the loop control block.
j is another matter. It is declared outside the parallel region and it is not the control variable of a parallelized loop. As a result, it is shared among the threads. Because each of the threads executing i-loop iterations manipulates shared variable j, you have a huge problem with data races. This would be resolved (among other alternatives) by moving the declaration of j into the parallel region, preferrably into the control block of its associated loop.
Overall, then:
// int i, j;
#pragma omp parallel for num_threads(num_of_threads) schedule(static)
for (int i = 0; i < bodies; i++) {
accelerations[i].x = 0;
accelerations[i].y = 0;
accelerations[i].z = 0;
for (int j = 0; j < bodies; j++) {
if (i != j) {
//accelerations[i] = addVectors(accelerations[i],scaleVector(GravConstant*masses[j]/pow(mod(subtractVectors(positions[i],positions[j])),3),subtractVectors(positions[j],positions[i])));
vector sij = { positions[i].x - positions[j].x,
positions[i].y - positions[j].y,
positions[i].z - positions[j].z };
vector sji = { positions[j].x - positions[i].x,
positions[j].y - positions[i].y,
positions[j].z - positions[i].z };
double mod = sqrt(sij.x * sij.x + sij.y * sij.y + sij.z * sij.z);
double mod3 = mod * mod * mod;
double s = GravConstant * masses[j] / mod3;
vector S = { s * sji.x, s * sji.y, s * sji.z };
accelerations[i].x += S.x;
accelerations[i].y += S.y;
accelerations[i].z += S.z;
}
}
}
Note also that computing sji appears to be wasteful, as in mathematical terms it is just -sij, and neither sji nor sij is modified. I would probably reduce the above to something more like this:
#pragma omp parallel for num_threads(num_of_threads) schedule(static)
for (int i = 0; i < bodies; i++) {
accelerations[i].x = 0;
accelerations[i].y = 0;
accelerations[i].z = 0;
for (int j = 0; j < bodies; j++) {
if (i != j) {
vector sij = { positions[i].x - positions[j].x,
positions[i].y - positions[j].y,
positions[i].z - positions[j].z };
double mod = sqrt(sij.x * sij.x + sij.y * sij.y + sij.z * sij.z);
double mod3 = mod * mod * mod;
double s = GravConstant * masses[j] / mod3;
accelerations[i].x -= s * sij.x;
accelerations[i].y -= s * sij.y;
accelerations[i].z -= s * sij.z;
}
}
}

code execution slower in half-parallel OpenMP

good day.
I want to implement inner product in 3 method:
1 - sequential
2 - half-parallel
3 - full-parallel
half parallel means multiplication in parallel and summation in sequential.
here is my code:
int main(int argc, char *argv[]) {
int *x, *y, *z, *w, xy_p, xy_s, xy_ss, i, N=5000;
double s, e;
x = (int *) malloc(sizeof(int)*N);
y = (int *) malloc(sizeof(int)*N);
z = (int *) malloc(sizeof(int)*N);
w = (int *) malloc(sizeof(int)*N);
for(i=0; i < N; i++) {
x[i] = rand();
y[i] = rand();
z[i] = 0;
}
s = omp_get_wtime();
xy_ss = 0;
for(i=0; i < N; i++)
{
xy_ss += x[i] * y[i];
}
e = omp_get_wtime() - s;
printf ( "[**] Sequential execution time is:\n%15.10f and <A,B> is %d\n", e, xy_ss );
s = omp_get_wtime();
xy_s = 0;
#pragma omp parallel for shared ( N, x, y, z ) private ( i )
for(i = 0; i < N; i++)
{
z[i] = x[i] * y[i];
}
for(i=0; i < N; i++)
{
xy_s += z[i];
}
e = omp_get_wtime() - s;
printf ( "[**] Half-Parallel execution time is:\n%15.10f and <A,B> is %d\n", e, xy_s );
s = omp_get_wtime();
xy_p = 0;
# pragma omp parallel shared (N, x, y) private(i)
# pragma omp for reduction ( + : xy_p )
for(i = 0; i < N; i++)
{
xy_p += x[i] * y[i];
}
e = omp_get_wtime() - s;
printf ( "[**] Full-Parallel execution time is:\n%15.10f and <A,B> is %d\n", e, xy_p );
}
so I have some question:
first I want to know: does my code correct?!!!!
second: why half-parallel is faster than sequential?!
third: is 5000 a good size for parallelism?
and finally why sequential is the fastest? because of 5000?!
an sample output:
Sequential execution time is:
0.0000196100 and dot is -1081001655
Half-Parallel execution time is:
0.0090819710 and dot is -1081001655
Full-Parallel execution time is:
0.0080959420 and dot is -1081001655
and for N=5000000
Sequential execution time is:
0.0150297650 and is -1629514371
Half-Parallel execution time is:
0.0292110600 and is -1629514371
Full-Parallel execution time is:
0.0072323760 and is -1629514371
anyway, why half-parallel is the slowest?

C - parallelize recurrence omp

I have a problem: I have to parallelize this piece of code with OMP.
There is a problem of data dependencies and I don't know how to solve it.
Any suggestions?
for (n = 2; n < N+1; n++) {
dz = *(dynamic_d + n-1)*z;
*(dynamic_A + n) = *(dynamic_A + n-1) + dz * (*(dynamic_A + n-2));
*(dynamic_B + n) = *(dynamic_B + n-1) + dz * (*(dynamic_B + n-2));
}

You cannot parallelize the loop iterations due to the depdency, but you can split the computation of dynamic_A vs dynamic_B using sections:
#pragma omp parallel sections
{
#pragma omp section
{
// NOTE: Declare n and dz locally so that it is private!
for (int n = 2; n < N+1; n++) {
my_type dz = dynamic_d[n-1] * z;
dynamic_A[n] = dynamic_A[n-1] + dz * dynamic_A[n-2];
}
}
#pragma omp section
{
for (int n = 2; n < N+1; n++) {
my_type dz = dynamic_d[n-1] * z;
dynamic_B[n] = dynamic_B[n-1] + dz * dynamic_B[n-2];
}
}
}
Please use array indexing instead of the unholy pointer arithmetic referencing abnormity.

Code execution slower with OpenMP

I am trying to speed up the execution of the following code with OpenMP. The code is for calculating a mandelbrot and output it to canvas.
The code works fine single threaded, but I want to use OpenMP to make it faster. I tried all sorts of combination of private and shared variables but nothing seems to work so far. The code always runs a little slower with OpenMP than without it (50 000 iterations - 2s slower).
I am using Ubuntu 16.04 and compiling with GCC.
void calculate_mandelbrot(GLubyte *canvas, GLubyte *color_buffer, uint32_t w, uint32_t h, mandelbrot_f x0, mandelbrot_f x1, mandelbrot_f y0, mandelbrot_f y1, uint32_t max_iter) {
mandelbrot_f dx = (x1 - x0) / w;
mandelbrot_f dy = (y1 - y0) / h;
uint16_t esc_time;
int i, j;
mandelbrot_f x, y;
//timer start
clock_t begin = clock();
#pragma omp parallel for private(i,j,x,y, esc_time) shared(canvas, color_buffer)
for(i = 0; i < w; ++i) {
x = x0 + i * dx;
for(j = 0; j < h; ++j) {
y = y1 - j * dy;
esc_time = escape_time(x, y, max_iter);
canvas[ GET_R(i, j, w) ] = color_buffer[esc_time * 3];
canvas[ GET_G(i, j, w) ] = color_buffer[esc_time * 3 + 1];
canvas[ GET_B(i, j, w) ] = color_buffer[esc_time * 3 + 2];
}
}
//time calculation
clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("%f\n",time_spent );
}
escape_time function which the code uses:
inline uint16_t escape_time(mandelbrot_f x0, mandelbrot_f y0, uint32_t max_iter) {
mandelbrot_f x = 0.0;
mandelbrot_f y = 0.0;
mandelbrot_f xtemp;
uint16_t iteration = 0;
while((x*x + y*y < 4) && (iteration < max_iter)) {
xtemp = x*x - y*y + x0;
y = 2*x*y + y0;
x = xtemp;
iteration++;
}
return iteration;
}
The code is from this repository https://github.com/hortont424/mandelbrot

First, like hinted in the comment, use omp_get_wtime() instead of clock() (it will give you the number of clock ticks accumulated across all threads) measure the time. Second, If I recall correctly, this algorithm have load balancing problems, so try to use a dynamic scheduling:
//timer start
double begin = omp_get_wtime();
#pragma omg parallel for private(j,x,y, esc_time) schedule(dynamic, 1)
for(i = 0; i < w; ++i) {
x = x0 + i * dx;
for(j = 0; j < h; ++j) {
y = y1 - j * dy;
esc_time = escape_time(x, y, max_iter);
canvas[ GET_R(i, j, w) ] = color_buffer[esc_time * 3];
canvas[ GET_G(i, j, w) ] = color_buffer[esc_time * 3 + 1];
canvas[ GET_B(i, j, w) ] = color_buffer[esc_time * 3 + 2];
}
}
//time calculation
double end = omp_get_wtime();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("%f\n",time_spent );

As it was suggested my problem was caused by using the clock() function, which measures CPU time.
Using omp_get_wtime() instead solved my problem.

OpenMP C Normal Random Generator

I was doing a C assignment for parallel computing, where I have to implement some sort of Monte Carlo simulations with efficient tread safe normal random generator using Box-Muller transform. I generate 2 vectors of uniform random numbers X and Y, with condition that X in (0,1] and Y in [0,1].
But I'm not sure that my way of sampling uniform random numbers from the halfopen interval (0,1] is right.
Did anyone encounter something similar?
I'm using following Code:
double* StandardNormalRandom(long int N){
double *X = NULL, *Y = NULL, *U = NULL;
X = vUniformRandom_0(N / 2);
Y = vUniformRandom(N / 2);
#pragma omp parallel for
for (i = 0; i<N/2; i++){
U[2*i] = sqrt(-2 * log(X[i]))*sin(Y[i] * 2 * pi);
U[2*i + 1] = sqrt(-2 * log(X[i]))*cos(Y[i] * 2 * pi);
}
return U;
}
double* NormalRandom(long int N, double mu, double sigma2)
{
double *U = NULL, stdev = sqrt(sigma2);
U = StandardNormalRandom(N);
#pragma omp parallel for
for (int i = 0; i < N; i++) U[i] = mu + stdev*U[i];
return U;
}
here is the bit of my UniformRandom function also implemented in parallel:
#pragma omp parallel for firstprivate(i)
for (long int j = 0; j < N;j++)
{
if (i == 0){
int tn = omp_get_thread_num();
I[tn] = S[tn];
i++;
}
else
{
I[j] = (a*I[j - 1] + c) % m;
}
}
}
#pragma omp parallel for
for (long int j = 0; j < N; j++)
U[j] = (double)I[j] / (m+1.0);

In the StandardNormalRandom function, I will assume that the pointer U has been allocated to the size N, in which case this function looks fine to me.
As well as the function NormalRandom.
However for the function UniformRandom (which is missing some parts, so I'll have to assume some stuff), if the following line I[j] = (a*I[j - 1] + c) % m + 1; is the body of a loop with a omp parallel for, then you will have some issues. As you can't know the order of execution of the thread, the current thread (with a fixed value of j) can't rely on the value of I[j - 1] as this value could be modified at any time (I should be shared by default).
Hope it helps!

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Understanding #pragma acc kernels - c

Related

Parallelise 2 for loops with OpenMP

code execution slower in half-parallel OpenMP

C - parallelize recurrence omp

Code execution slower with OpenMP

OpenMP C Normal Random Generator

Categories

Resources