OpenMP parallelization not efficient - c

I'm trying to parallelize this code using OpenMP.
for(t_step=0;t_step<Ntot;t_step++) {
// current row
if(cur_row + 1 < Npt_x) cur_row++;
else cur_row = 0;
// get data from file which update only the row "cur_row" of array val
read_line(f_u, val[cur_row]);
// computes
for(i=0;i<Npt_x;i++) {
for(j=0;j<Npt_y;j++) {
i_corrected = cur_row - i;
if(i_corrected < 0) i_corrected = Npt_x + i_corrected;
R[i][j] += val[cur_row][0]*val[i_corrected][j]/Ntot;
}
}
}
with
- val and R declared as **double,
- Npt_x and Npt_y are about 500,
- Ntot is about 10^6.
I've done this
for(t_step=0;t_step<Ntot;t_step++) {
// current row
if(cur_row + 1 < Npt_x) cur_row++;
else cur_row = 0;
// get data from file which update only the row "cur_row" of array val
read_line(f_u, val[cur_row]);
// computes
#pragma omp parallel for collapse(2), private(i,j,i_corrected)
for(i=0;i<Npt_x;i++) {
for(j=0;j<Npt_y;j++) {
i_corrected = cur_row - i;
if(i_corrected < 0) i_corrected = Npt_x + i_corrected;
R[i][j] += val[cur_row][0]*val[i_corrected][j]/Ntot;
}
}
}
The problem is that it doesn't seem to be efficient. Is there a way to use OpenMP more efficiently in this case ?
Many thks

Right now, I would try something like this:
for(t_step=0;t_step<Ntot;t_step++) {
// current row
if(cur_row + 1 < Npt_x)
cur_row++;
else
cur_row = 0;
// get data from file which update only the row "cur_row" of array val
read_line(f_u, val[cur_row]);
// computes
#pragma omp parallel for private(i,j,i_corrected)
for(i=0;i<Npt_x;i++) {
i_corrected = cur_row - i;
if(i_corrected < 0)
i_corrected += Npt_x;
double tmp = val[cur_row][0]/Ntot;
#if defined(_OPENMP) && _OPENMP > 201306
#pragma omp simd
#endif
for(j=0;j<Npt_y;j++) {
R[i][j] += tmp*val[i_corrected][j];
}
}
}
However, since the code will be memory bound, that's not sure it'll get you much parallel speed-up... Worth a try though.

Related

Loop through 2 arrays in one for loop?

anyone know how we can loop through two arrays in one for loop?
function setwinner() internal returns(address){
for (uint stime = 0 ; stime < squareStartTimeArray.length; stime++ & uint etime = 0; etime = squareEndTimeArray.length etime++) {
if (winningTime >= stime & winningTime <= etime) {
winningIndex = stime;
if (assert(stime == etime) == true) {
winningAddress = playerArray[stime];
}
}
}
}
To loop through multiple arrays in the same loop you should make sure that they both have the same length first. then you can use this:
require(arrayOne.length == arrayTwo.length)
for (i; arrayOne.length > i; i++) {
arrayOne[i] = ....;
arrayTwo[i] = ....;
}

OpenMP outputs incorrect answers

I have typed this simple code to calculate the number of prime numbers between 2 and 5,000,000.
The algorithm works fine and it outputs the correct answer, however when I try to use OpenMP to speedup the execution it outputs a different answer every time.
#include "time.h"
#include "stdio.h"
#include "omp.h"
int main()
{
clock_t start = clock();
int count = 1;
int x;
bool flag;
#pragma omp parallel for schedule(static,1) num_threads(2) shared(count) private(x,flag)
for (x = 3; x <= 5000000; x+=2)
{
flag = false;
if (x == 2 || x == 3)
count++;
else if (x % 2 == 0 || x % 3 == 0)
continue;
else
{
for (int i = 5; i * i <= x; i += 6)
{
if (x % i == 0 || x % (i + 2) == 0)
{
flag = true;
break;
}
}
if (!flag)
count++;
}
}
clock_t end = clock();
printf("The execution took %f ms\n", (double)end - start / CLOCKS_PER_SEC);
printf("%d\n", count);
}
The code doesn't work for any number of threads, dynamic or static scheduling or different chunk sizes.
I have tried messing with private and shared variables but it still didn't work and declaring x and flag inside the for loop didn't work either.
I am using Visual Studio 2019 and I have OpenMP support enabled.
What's the problem with my code ?
You have race conditions with your count variable where multiple threads can try to update it at the same time. The easy fix is to use an OpenMP reduction() clause to give each thread a private copy of the variable and have them all get added up properly:
#include <time.h>
#include <stdio.h>
#include <stdbool.h>
int main(void)
{
clock_t start = clock();
int count = 1;
#pragma omp parallel for schedule(static,1) num_threads(2) reduction(+:count)
for (int x = 3; x <= 5000000; x+=2)
{
bool flag = false;
if (x == 2 || x == 3)
count++;
else if (x % 2 == 0 || x % 3 == 0)
continue;
else
{
for (int i = 5; i * i <= x; i += 6)
{
if (x % i == 0 || x % (i + 2) == 0)
{
flag = true;
break;
}
}
if (!flag)
count++;
}
}
clock_t end = clock();
printf("The execution took %f ms\n", (double)end - start / CLOCKS_PER_SEC);
printf("%d\n", count);
}
This outputs 348513 (Verified as the right number through other software).
Also note cleaned up headers and moving some variable declarations around to avoid the need for a private() clause.
You could also make count an atomic int, but that's slower than using reduction() in my testing.
Just to add to the answer provided by #Shawn, besides solving the count race condition using the reduction OpenMP clause. You can also analyze if your code has load balancing issues, looking at the iterations of the loop that you are parallelizing it is clear that not all iterations have the same among of work. Since you are assigning work to threads in a static manner you might have one thread doing much more work than the other. Test around with the dynamic schedule to see if you notice any difference.
Besides that, you can significantly simplify your sequential code by removing all those conditional branchings that negatively affect the performance of your parallel version.
First you do not need (x == 2), since int x = 3;. You do not need (x == 3) either, just remove it and make count=2; (instead of count=1;) and int x = 5;, since the loop is incrementing in steps of 2 (i.e., x+=2). With this you can also remove this:
if (x == 2 || x == 3)
count++;
Now because the loop starts at 5, and has an incremental step of 2, you know that it will be iterating over odd numbers only, so we can remove also x % 2 == 0 . Now we only have an if( x % 3 == 0) continue; else{..}, which can be simplified into if(x % 3 != 0){..}.
You can rewrite the code also to remove that break:
#pragma omp parallel for schedule(static,1) num_threads(2) reduction(+:count)
for (int x = 5; x <= 5000000; x += 2) {
boolean flag = false;
if (x % 3 != 0) {
for (i = 5; !flag && i * i <= x; i += 6) {
flag = (x % i == 0 || x % (i + 2) == 0);
}
if (!flag) {
count++;
}
}
}
because you are using C/C++ you can even remove that if as well:
int count = 2;
#pragma omp parallel for schedule(static,1) num_threads(2) reduction(+:count)
for (int x = 5; x <= 5000000; x += 2) {
if (x % 3 != 0) {
int flag = 1;
for (int i = 5; flag && i * i <= x; i += 6) {
flag = x % i != 0 && x % (i + 2) != 0;
}
count += flag;
}
}
printf("%d\n", count);
IMO the code is more readable now, we could further improve it by given a good name to the variable flag.

Increasing n-body program performance using OpenMP

My goal is to increase the performance of a code that simulates the n-body problem.
This is where the time is to be calculated. The two functions that need to be parallelized are the calculate_forces() and the *move_bodies() functions but since the loop control variable t is a double I cannot have a #pragma omp parallel for statement there.
t0 = gettime ();
for (t = 0; t < t_end; t += dt)
{
// draw bodies
show_bodies (window);
// computation
calculate_forces ();
move_bodies ();
}
// print out calculation speed every second
t0 = gettime () - t0;
The two functions calculate_forces() and move_bodies() with the respective directives that I used are the following:
static void
calculate_forces ()
{
double distance, magnitude, factor, r;
vector_t direction;
int i, j;
#pragma omp parallel private(distance,magnitude,factor,direction)
{
#pragma omp for private(i,j)
for (i = 0; i < n_body - 1; i++)
{
for (j = i + 1; j < n_body; j++)
{
r = SQR (bodies[i].position.x - bodies[j].position.x) + SQR (bodies[i].position.y - bodies[j].position.y);
// avoid numerical instabilities
if (r < EPSILON)
{
// this is not how nature works :-)
r += EPSILON;
}
distance = sqrt (r);
magnitude = (G * bodies[i].mass * bodies[j].mass) / (distance * distance);
factor = magnitude / distance;
direction.x = bodies[j].position.x - bodies[i].position.x;
direction.y = bodies[j].position.y - bodies[i].position.y;
// +force for body i
#pragma omp critical
{
bodies[i].force.x += factor * direction.x;
bodies[i].force.y += factor * direction.y;
// -force for body j
bodies[j].force.x -= factor * direction.x;
bodies[j].force.y -= factor * direction.y;
}
}
}
}
}
static void
move_bodies ()
{
vector_t delta_v, delta_p;
int i;
#pragma omp parallel private(delta_v,delta_p,i)
{
#pragma omp for
for (i = 0; i < n_body; i++)
{
// calculate delta_v
delta_v.x = bodies[i].force.x / bodies[i].mass * dt;
delta_v.y = bodies[i].force.y / bodies[i].mass * dt;
// calculate delta_p
delta_p.x = (bodies[i].velocity.x + delta_v.x / 2.0) * dt;
delta_p.y = (bodies[i].velocity.y + delta_v.y / 2.0) * dt;
// update body velocity and position
#pragma omp critical
{
bodies[i].velocity.x += delta_v.x;
bodies[i].velocity.y += delta_v.y;
bodies[i].position.x += delta_p.x;
bodies[i].position.y += delta_p.y;
}
// reset forces
bodies[i].force.x = bodies[i].force.y = 0.0;
if (bounce)
{
// bounce on boundaries (i.e. it's more like billard)
if ((bodies[i].position.x < -body_distance_factor) || (bodies[i].position.x > body_distance_factor))
bodies[i].velocity.x = -bodies[i].velocity.x;
if ((bodies[i].position.y < -body_distance_factor) || (bodies[i].position.y > body_distance_factor))
bodies[i].velocity.y = -bodies[i].velocity.y;
}
}
}
The values of bodies.velocity and bodies.position are changed in the move bodies function, but I couldn't use a reduction.
There is also a checksum function to calculate if the calculated checksum is equal to the reference checksum. That function looks like this:
static unsigned long
checksum()
{
unsigned long checksum = 0;
// initialize bodies
for (int i = 0; i < n_body; i++)
{
// random position vector
checksum += (unsigned long)round(bodies[i].position.x);
checksum += (unsigned long)round(bodies[i].position.y);
}
return checksum;
}
This function uses the previously calculated values of bodies.position.x and bodies.position.y which were calculated in the move_bodies function hence the reason why I used a critical block while calculating those value which didn't seem to yield a correct answer. Can anyone give me some insight on where I am going wrong? Thank you in advance.

Using thread-private variables in OpenMP, for __m128i SSE2 variables?

Need help in multi-threading one supersimple yet supernifty etude!
It is given below, the commented 9 lines are the generic Longest Common SubString loop-in-loop implementation, while the fragment below is the branchless SSE2 counterpart. The etude works just fine as it is, but when trying to multi-thread it (tried several ways) - IT REPORTS randomly correct or incorrect results?!
#ifdef KamXMM
printf("Branchless 128bit Assembly struggling ...\n");
for(i=0; i < size_inLINESIXFOUR2; i++){
XMMclone = _mm_set1_epi8(workK2[i]);
//omp_set_num_threads(4);
#ifdef Commence_OpenMP
//#pragma omp parallel for shared(workK,PADDED32,Matrix_vectorCurr,Matrix_vectorPrev) private(j,ThreadID) // Sometimes reports correctly sometimes NOT?!
#endif
for(j=0; j < PADDED32; j+=(32/2)){
XMMprev = _mm_loadu_si128((__m128i*)(Matrix_vectorPrev+(j-1)));
XMMcurr = _mm_loadu_si128((__m128i*)&workK[j]);
XMMcmp = _mm_cmpeq_epi8(XMMcurr, XMMclone);
XMMand = _mm_and_si128(XMMprev, XMMcmp);
XMMsub = _mm_sub_epi8(XMMzero, XMMcmp);
XMMadd = _mm_add_epi8(XMMand, XMMsub);
_mm_storeu_si128((__m128i*)(Matrix_vectorCurr+j), XMMadd);
// This doesn't work, sometimes reports 24 sometimes 23, (for Carlos vs Japan):
//ThreadID=omp_get_thread_num();
//if (ThreadID==0) XMMmax0 = _mm_max_epu8(XMMmax0, XMMadd);
//if (ThreadID==1) XMMmax1 = _mm_max_epu8(XMMmax1, XMMadd);
//if (ThreadID==2) XMMmax2 = _mm_max_epu8(XMMmax2, XMMadd);
//if (ThreadID==3) XMMmax3 = _mm_max_epu8(XMMmax3, XMMadd);
{
XMMmax = _mm_max_epu8(XMMmax, XMMadd);
}
// if(workK[j] == workK2[i]){
// if (i==0 || j==0)
// *(Matrix_vectorCurr+j) = 1;
// else
// *(Matrix_vectorCurr+j) = *(Matrix_vectorPrev+(j-1)) + 1;
// if(max < *(Matrix_vectorCurr+j)) max = *(Matrix_vectorCurr+j);
// }
// else
// *(Matrix_vectorCurr+j) = 0;
}
// XMMmax = _mm_max_epu8(XMMmax, XMMmax0);
// XMMmax = _mm_max_epu8(XMMmax, XMMmax1);
// XMMmax = _mm_max_epu8(XMMmax, XMMmax2);
// XMMmax = _mm_max_epu8(XMMmax, XMMmax3);
_mm_storeu_si128((__m128i*)vector, XMMmax); // No need since it was last, yet...
for(k=0; k < 32/2; k++)
if ( max < vector[k] ) max = vector[k];
if (max >= 255) {printf("\nWARNING! LCSS >= 255 found, cannot house it within BYTE long cell! Exit.\n"); exit(13);}
printf("%s; Done %d%% \r", Auberge[Melnitchka++], (int)(((double)i*100/size_inLINESIXFOUR2)));
Melnitchka = Melnitchka & 3; // 0 1 2 3: 00 01 10 11
Matrix_vectorSWAP=Matrix_vectorCurr;
Matrix_vectorCurr=Matrix_vectorPrev;
Matrix_vectorPrev=Matrix_vectorSWAP;
}
#endif
My wish is to have it boosted to the extent it reaches for the memory bandwdith, on my laptop with i5-7200u it traverses the rows at 5GB/s, whereas the memcpy() is somewhere at 12GB/s.
My comprehension of OpenMP is superficial, I managed to multi-thread (with #pragma omp sections nowait) non-vector code, but vectors are problematic, how to tell the compiler that XMMmax has to be private?!

OpenMP parallel for loop

void calc_mean(float *left_mean, float *right_mean, const uint8_t* left, const uint8_t* right, int32_t block_width, int32_t block_height, int32_t d, uint32_t w, uint32_t h, int32_t i,int32_t j)
{
*left_mean = 0;
*right_mean = 0;
int32_t i_b;
float local_left = 0, local_right = 0;
for (i_b = -(block_height-1)/2; i_b < (block_height-1)/2; i_b++) {
#pragma omp parallel for reduction(+:local_left,local_right)
for ( int32_t j_b = -(block_width-1)/2; j_b < (block_width-1)/2; j_b++) {
// Borders checking
if (!(i+i_b >= 0) || !(i+i_b < h) || !(j+j_b >= 0) || !(j+j_b < w) || !(j+j_b-d >= 0) || !(j+j_b-d < w)) {
continue;
}
// Calculating indices of the block within the whole image
int32_t ind_l = (i+i_b)*w + (j+j_b);
int32_t ind_r = (i+i_b)*w + (j+j_b-d);
// Updating the block means
//*left_mean += *(left+ind_l);
//*right_mean += *(right+ind_r);
local_left += left[ind_l];
local_right += right[ind_r];
}
}
*left_mean = local_left/(block_height * block_width);
*right_mean = local_right/(block_height * block_width);
}
This now makes the program execution longer than non-threaded version. I added private(left,right) but it leads to bad memory access for ind_l.
I think this should get you closer to what you want, although I'm not quite sure about one final part.
float local_left, local_right = 0;
for ( int32_t i_b = -(block_height-1)/2; i_b < (block_height-1)/2; i_b++) {
#pragma omp for schedule(static, CORES) reduction(+:left_mean, +: right_mean)
{
for ( int32_t j_b = -(block_width-1)/2; j_b < (block_width-1)/2; j_b++) {
if (your conditions) continue;
int32_t ind_l = (i+i_b)*w + (j+j_b);
int32_t ind_r = (i+i_b)*w + (j+j_b-d);
local_left += *(left+ind_l);
local_right += *(right+ind_r);
}
}
}
*left_mean = local_left/(block_height * block_width);
*right_mean = local_right/(block_height * block_width);
Part I am unsure of is whether you need the schedule() and how to do two different reductions. I know for one reduction, you can simply do
reduction(+:left_mean)
EDIT: some reference for the schedule() http://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-loop.html#Loopschedules
It looks like you do not need this, but using it could produce a better runtime

Resources