Actual dIfference between 2 ways of equal parallelism using omp threads - c

I am trying to parallelize my program using OMP threads .
What I am doing is the following and it works perfectly :
#pragma omp parallel num_threads(threadnum) \
default(none) shared(scoreBoard, nDiag, qlength, dlength) private(nEle, i, si, sj, ai, aj, max)
{
for (i = 1; i < nDiag; ++i)
{
if (i <= qlength && i <= dlength) nEle = i;
else if(i <= findmax(qlength, dlength)) nEle = findmin(qlength, dlength);
else nEle = 2*findmin(qlength, dlength) - i + abs(qlength - dlength);
calcfirstele(%si, %sj);
#pragma omp for
for (j = 1; j <= nEle; ++j)
{
ai = si - j + 1;
aj = sj + j - 1
max = searchmax(ai,aj);
scoreBoard[ai][aj] = max;
}
}
}
But isn't it equal to :
for (i = 1; i < nDiag; ++i)
{
if (i <= qlength && i <= dlength) nEle = i;
else if(i <= findmax(qlength, dlength)) nEle = findmin(qlength, dlength);
else nEle = 2*findmin(qlength, dlength) - i + abs(qlength - dlength);
calcfirstele(%si, %sj);
#pragma omp parallel num_threads(threadnum) \
default(none) shared(scoreBoard) private(nEle, i, si, sj, ai, aj, max)
#pragma omp for
for (j = 1; j <= nEle; ++j)
{
ai = si - j + 1;
aj = sj + j - 1
max = searchmax(ai,aj);
scoreBoard[ai][aj] = max;
}
}
Why when i use the second one my program is making more time than the serial one , whereas in the first case it works lot faster than the serial ? Can't understand the difference between them

Your second code is wrong and has an undefined behavior.
The reason for that is that by declaring nEle, si and sj private, you create some local (per-thread) versions of these variables, without giving them any value. Therefore, nEle notably, which is the upper bound of you for loop, can have whatever value, likely increasing quite dramatically the length of your computation.
In order to fix your code, the snippet you gave should look like this (with a few simplifications, not tested obviously):
for (int i = 1; i < nDiag; ++i) {
if (i <= qlength && i <= dlength)
nEle = i;
else if(i <= findmax(qlength, dlength))
nEle = findmin(qlength, dlength);
else
nEle = 2*findmin(qlength, dlength) - i + abs(qlength - dlength);
calcfirstele(%si, %sj); // not sure what this suppose to mean...
#pragma omp parallel for num_threads(threadnum) private(ai, aj, max)
for (int j = 1; j <= nEle; ++j) {
ai = si - j + 1;
aj = sj + j - 1
max = searchmax(ai,aj);
scoreBoard[ai][aj] = max;
}
}

Related

Intrinsics load matrix

Im learning Intrinsics. I dont know how to load a matrix correctly. I want to do matrix multiplication.
This is my code:
int i, j, k;
__m128 mat2values = _mm_setzero_ps();
__m128 mat1values = _mm_setzero_ps();
__m128 r = _mm_setzero_ps();
for (i = 0; i < N; ++i)
{
for (j = 0; j < N - 3; j += 4)
{
for (k = 0; k < N - 3; k += 4)
{
mat1values = _mm_load_ps(&mat1[i][k]);
mat2values = _mm_load_ps(&mat2[k][j]);
r = _mm_add_ps(r, _mm_mul_ps(mat1values, mat2values));
}
result[i][j] = r.m128_f32[0] + r.m128_f32[1] + r.m128_f32[2] + r.m128_f32[3];
for (; k < N; k++)
result[i][j] += mat1[i][j] * mat2[k][j];
}
}
When debugging result will still hold all 0 values after loop.
Are you sure the expression
_mm_load_ps(mat1[i][k])
returns the correct memory address in float*?

How to parallelise a code inside a while using OpenMP

I am trying to parallelise the heat_plate algorithm but I am stuck at this bit of code inside my while:
while(1)
{
.....
.....
#pragma omp parallel shared(diff, u, w) private(i, j, my_diff)
{
my_diff = 0.0;
#pragma omp for
for (i = 1; i < M - 1; i++)
{
for (j = 1; j < N - 1; j++)
{
if ( my_diff < fabs (w[i][j] - u[i][j]))
{
my_diff = fabs (w[i][j] - u[i][j]);
}
}
}
#pragma omp critical
{
if (diff < my_diff)
{
diff = my_diff;
}
}
}
....
....
}
Not only I can't get it to work in parallel it actually takes longer to finish
Edit:
The Program runs in parallel
.
Thank you in advance for your help.
In OpenMP this data dependency:
for (i = 1; i < M - 1; i++)
for (j = 1; j < N - 1; j++)
if ( my_diff < fabs (w[i][j] - u[i][j]))
my_diff = fabs (w[i][j] - u[i][j]);
is typically solved using OpenMP reduction feature, which in your case will avoid the critical region (after the parallel for) and, consequently, improve the overall performance of the parallelization. So if you apply that feature your code would look like the following:
#pragma omp parallel shared(u, w) private(i, j)
{
#pragma omp for reduction(max:diff)
for (i = 1; i < M - 1; i++)
for (j = 1; j < N - 1; j++)
if ( diff < fabs (w[i][j] - u[i][j]))
diff = fabs (w[i][j] - u[i][j]);
}
In turn you can merge both pragmas into one:
#pragma omp parallel for reduction(max:diff) shared(u, w) private(i, j)
for (i = 1; i < M - 1; i++)
for (j = 1; j < N - 1; j++)
if ( diff < fabs (w[i][j] - u[i][j]))
diff = fabs (w[i][j] - u[i][j]);

Convolution operation without conditional loop

I am writing a convolution operation for a filter and a signal. The accumulation operation holds true only for the condition "j - k" is not < 0. Is there a way to remove this condition and try to split the loops to avoid the conditional clause.
for (i = 0; i < RBs ; i++) // Over Resource Blocks
{
for (j = 0; j < (IFFT_Len + Fil_Len -1); j++) ​// Over Output Length
{
acc = 0;
for (k = 0; k < Fil_Len; k++) // over conv operation
{
if (j-k >= 0)
{
acc += Filter[k + (i * fil_data)] * IFFT[j - k + (i * ifft_data)];
}
}
x[j] = acc;
}
UFMC_sig += x;
}

CUDA parallelizing work with arrays

I am new in CUDA, i have just read some NVIDIA tutors about CUDA and i need some help. There is the following code:
//some includes
#define NUM_OF_ACCOMS 3360
#define SIZE_RING 16
#define NUM_OF_BIGRAMMS 256
//...some code...
for (i = 1; i <= SIZE_RING; i++) {
for (j = 1; j <= SIZE_RING; j++) {
if (j == i) continue;
for (k = 1; k <= SIZE_RING; k++) {
if (k == j || k == i) continue;
accoms_theta[indOfAccoms][0] = i - 1; accoms_theta[indOfAccoms][1] = j - 1; accoms_theta[indOfAccoms][2] = k - 1;
accoms_thetaFix[indOfAccoms][0] = i - 1; accoms_thetaFix[indOfAccoms][1] = j - 1; accoms_thetaFix[indOfAccoms][2] = k - 1;
results[indOfAccoms][0] = results[indOfAccoms][1] = results[indOfAccoms][2] = 0;
indOfAccoms++;
}
}
}
for (i = 0; i < SIZE_RING; i++)
for (j = 0; j < SIZE_RING; j++) {
bigramms[indOfBigramms][0] = i; bigramms[indOfBigramms][1] = j;
indOfBigramms++;
}
for (i = 0; i < NUM_OF_ACCOMS; i++) {
thetaArr[0] = accoms_theta[i][0]; thetaArr[1] = accoms_theta[i][1]; thetaArr[2] = accoms_theta[i][2];
d0 = thetaArr[2] - thetaArr[1]; d1 = thetaArr[2] - thetaArr[0];
if (d0 < 0)
d0 += SIZE_RING;
if (d1 < 0)
d1 += SIZE_RING;
for (j = 0; j < NUM_OF_ACCOMS; j++) {
theta_fixArr[0] = accoms_thetaFix[j][0]; theta_fixArr[1] = accoms_thetaFix[j][1]; theta_fixArr[2] = accoms_thetaFix[j][2];
d0_fix = theta_fixArr[2] - theta_fixArr[1]; d1_fix = theta_fixArr[2] - theta_fixArr[0];
count = 0;
if (d0_fix < 0)
d0_fix += SIZE_RING;
if (d1_fix < 0)
d1_fix += SIZE_RING;
for (k = 0; k < NUM_OF_BIGRAMMS; k++) {
diff0 = subst[(d0 + bigramms[k][0]) % SIZE_RING] - subst[bigramms[k][0]];
diff1 = subst[(d1 + bigramms[k][1]) % SIZE_RING] - subst[bigramms[k][1]];
if (diff0 < 0)
diff0 += SIZE_RING;
if (diff1 < 0)
diff1 += SIZE_RING;
if (diff0 == d0_fix && diff1 == d1_fix)
count++;
}
if (max < count) {
max = count;
results[indResults][0] = max; results[indResults][1] = i; results[indResults][2] = j;
count = 0;
indResults++;
}
}
}
As you can see, there are two main cycles with i and j variables. I need foreach array from accoms_theta check the condition with each array from accoms_thetaFix. (subst is an int array with SIZE_RING elements). Well you need for about 2^30 operations to check ALL arrays. Cause i am new in CUDA i need some help in parallelizing my algorithm.
Here is some info about my device
GeForce GT730M
Compute Capability 3.5
Global Memory 2 GB
Shared Memory Per Block 48 KB
Max Threads Per Block 1024
Number of multiprocessors 2
Max Threads Dim 1024 : 1024 : 64
Max Grid Dim 2*(10 ^ 9) : 65535 : 65535
I will not go into the specific details of whatever it is you're trying to compute, but I will make a suggestion regarding what you might do.
A straightforward approach to parallelizing a serial algorithm in CUDA (or OpenCL, or OpenMP even) is to "parallelize for loops". In the context of CUDA that means instead of having a single thread iterate over values of some index i, you have different GPU threads work on the different values of i (or - one thread for every several values of i).
This can be done with nested loops, e.g. with two indices i and j corresponding to two dimensions of your kernel launch grid.
However - doing this 'naively' is only possible for embarrassingly parallel problems - where there are no dependencies between the data to be computed/written by each of the threads (e.g. for each combination of i and j). Also, if the data that's read for different i and j overlaps, or is interleaved, additional care is required to prevent reading the same data repeatedly, degrading performance.
Try this approach. If it fails, or if you reach the conclusion that it cannot apply, please ask another question - but in that question we will need a Minimal, Complete, Verifiable Example - which you have not provided for this question.

Unusual behaviour in an OpenMP program

I have a program and i tried to use Open MP.
The output is still correct (i tested it after multiple runs), but the times i get are quite odd.
So the time for the single threaded version is 0.1 seconds.
With 2 treads i get 0.05, but with 4 i obtain 0.15 seconds.
How is this possible?
I am just using simple parallel for's.
#pragma omp parallel for private(i, j)
for(i = 1; i <= total_height; i++){
for(j = 1; j <= total_width; j++){
int current_neighbours = neighbours[i][j];
// if(i == 2 && j == 1)
// printf("%d%d\n", current_neighbours, neighbours[2][1]);
if(current_neighbours == 0 || current_neighbours == 1 || current_neighbours > 3){
if(map[i][j] == 1){
update_maps(i, j, 0);
}
}
else if(current_neighbours == 3){
if(map[i][j] == 0){
update_maps(i, j, 1);
}
}
}
}
The update_maps functions looks like this
void update_maps(int i, int j, int value){
map[i][j] = value;
int k, neighbouri, neighbourj;
int num_of_thread = omp_get_thread_num();
if(value == 0)
value = -1;
for(k = 0; k < 8 ; k++){
neighbouri = i + di[k];
neighbourj = j + dj[k];
if(in_map(neighbouri, neighbourj)){
neighbouri--;
neighbourj--;
modify[neighbouri * total_height + neighbourj + (total_height * total_width * num_of_thread)] += value;
}
}
}

Resources