CUDA parallelizing work with arrays - arrays

I am new in CUDA, i have just read some NVIDIA tutors about CUDA and i need some help. There is the following code:
//some includes
#define NUM_OF_ACCOMS 3360
#define SIZE_RING 16
#define NUM_OF_BIGRAMMS 256
//...some code...
for (i = 1; i <= SIZE_RING; i++) {
for (j = 1; j <= SIZE_RING; j++) {
if (j == i) continue;
for (k = 1; k <= SIZE_RING; k++) {
if (k == j || k == i) continue;
accoms_theta[indOfAccoms][0] = i - 1; accoms_theta[indOfAccoms][1] = j - 1; accoms_theta[indOfAccoms][2] = k - 1;
accoms_thetaFix[indOfAccoms][0] = i - 1; accoms_thetaFix[indOfAccoms][1] = j - 1; accoms_thetaFix[indOfAccoms][2] = k - 1;
results[indOfAccoms][0] = results[indOfAccoms][1] = results[indOfAccoms][2] = 0;
indOfAccoms++;
}
}
}
for (i = 0; i < SIZE_RING; i++)
for (j = 0; j < SIZE_RING; j++) {
bigramms[indOfBigramms][0] = i; bigramms[indOfBigramms][1] = j;
indOfBigramms++;
}
for (i = 0; i < NUM_OF_ACCOMS; i++) {
thetaArr[0] = accoms_theta[i][0]; thetaArr[1] = accoms_theta[i][1]; thetaArr[2] = accoms_theta[i][2];
d0 = thetaArr[2] - thetaArr[1]; d1 = thetaArr[2] - thetaArr[0];
if (d0 < 0)
d0 += SIZE_RING;
if (d1 < 0)
d1 += SIZE_RING;
for (j = 0; j < NUM_OF_ACCOMS; j++) {
theta_fixArr[0] = accoms_thetaFix[j][0]; theta_fixArr[1] = accoms_thetaFix[j][1]; theta_fixArr[2] = accoms_thetaFix[j][2];
d0_fix = theta_fixArr[2] - theta_fixArr[1]; d1_fix = theta_fixArr[2] - theta_fixArr[0];
count = 0;
if (d0_fix < 0)
d0_fix += SIZE_RING;
if (d1_fix < 0)
d1_fix += SIZE_RING;
for (k = 0; k < NUM_OF_BIGRAMMS; k++) {
diff0 = subst[(d0 + bigramms[k][0]) % SIZE_RING] - subst[bigramms[k][0]];
diff1 = subst[(d1 + bigramms[k][1]) % SIZE_RING] - subst[bigramms[k][1]];
if (diff0 < 0)
diff0 += SIZE_RING;
if (diff1 < 0)
diff1 += SIZE_RING;
if (diff0 == d0_fix && diff1 == d1_fix)
count++;
}
if (max < count) {
max = count;
results[indResults][0] = max; results[indResults][1] = i; results[indResults][2] = j;
count = 0;
indResults++;
}
}
}
As you can see, there are two main cycles with i and j variables. I need foreach array from accoms_theta check the condition with each array from accoms_thetaFix. (subst is an int array with SIZE_RING elements). Well you need for about 2^30 operations to check ALL arrays. Cause i am new in CUDA i need some help in parallelizing my algorithm.
Here is some info about my device
GeForce GT730M
Compute Capability 3.5
Global Memory 2 GB
Shared Memory Per Block 48 KB
Max Threads Per Block 1024
Number of multiprocessors 2
Max Threads Dim 1024 : 1024 : 64
Max Grid Dim 2*(10 ^ 9) : 65535 : 65535

I will not go into the specific details of whatever it is you're trying to compute, but I will make a suggestion regarding what you might do.
A straightforward approach to parallelizing a serial algorithm in CUDA (or OpenCL, or OpenMP even) is to "parallelize for loops". In the context of CUDA that means instead of having a single thread iterate over values of some index i, you have different GPU threads work on the different values of i (or - one thread for every several values of i).
This can be done with nested loops, e.g. with two indices i and j corresponding to two dimensions of your kernel launch grid.
However - doing this 'naively' is only possible for embarrassingly parallel problems - where there are no dependencies between the data to be computed/written by each of the threads (e.g. for each combination of i and j). Also, if the data that's read for different i and j overlaps, or is interleaved, additional care is required to prevent reading the same data repeatedly, degrading performance.
Try this approach. If it fails, or if you reach the conclusion that it cannot apply, please ask another question - but in that question we will need a Minimal, Complete, Verifiable Example - which you have not provided for this question.

Related

Convolution operation without conditional loop

I am writing a convolution operation for a filter and a signal. The accumulation operation holds true only for the condition "j - k" is not < 0. Is there a way to remove this condition and try to split the loops to avoid the conditional clause.
for (i = 0; i < RBs ; i++) // Over Resource Blocks
{
for (j = 0; j < (IFFT_Len + Fil_Len -1); j++) ​// Over Output Length
{
acc = 0;
for (k = 0; k < Fil_Len; k++) // over conv operation
{
if (j-k >= 0)
{
acc += Filter[k + (i * fil_data)] * IFFT[j - k + (i * ifft_data)];
}
}
x[j] = acc;
}
UFMC_sig += x;
}

Actual dIfference between 2 ways of equal parallelism using omp threads

I am trying to parallelize my program using OMP threads .
What I am doing is the following and it works perfectly :
#pragma omp parallel num_threads(threadnum) \
default(none) shared(scoreBoard, nDiag, qlength, dlength) private(nEle, i, si, sj, ai, aj, max)
{
for (i = 1; i < nDiag; ++i)
{
if (i <= qlength && i <= dlength) nEle = i;
else if(i <= findmax(qlength, dlength)) nEle = findmin(qlength, dlength);
else nEle = 2*findmin(qlength, dlength) - i + abs(qlength - dlength);
calcfirstele(%si, %sj);
#pragma omp for
for (j = 1; j <= nEle; ++j)
{
ai = si - j + 1;
aj = sj + j - 1
max = searchmax(ai,aj);
scoreBoard[ai][aj] = max;
}
}
}
But isn't it equal to :
for (i = 1; i < nDiag; ++i)
{
if (i <= qlength && i <= dlength) nEle = i;
else if(i <= findmax(qlength, dlength)) nEle = findmin(qlength, dlength);
else nEle = 2*findmin(qlength, dlength) - i + abs(qlength - dlength);
calcfirstele(%si, %sj);
#pragma omp parallel num_threads(threadnum) \
default(none) shared(scoreBoard) private(nEle, i, si, sj, ai, aj, max)
#pragma omp for
for (j = 1; j <= nEle; ++j)
{
ai = si - j + 1;
aj = sj + j - 1
max = searchmax(ai,aj);
scoreBoard[ai][aj] = max;
}
}
Why when i use the second one my program is making more time than the serial one , whereas in the first case it works lot faster than the serial ? Can't understand the difference between them
Your second code is wrong and has an undefined behavior.
The reason for that is that by declaring nEle, si and sj private, you create some local (per-thread) versions of these variables, without giving them any value. Therefore, nEle notably, which is the upper bound of you for loop, can have whatever value, likely increasing quite dramatically the length of your computation.
In order to fix your code, the snippet you gave should look like this (with a few simplifications, not tested obviously):
for (int i = 1; i < nDiag; ++i) {
if (i <= qlength && i <= dlength)
nEle = i;
else if(i <= findmax(qlength, dlength))
nEle = findmin(qlength, dlength);
else
nEle = 2*findmin(qlength, dlength) - i + abs(qlength - dlength);
calcfirstele(%si, %sj); // not sure what this suppose to mean...
#pragma omp parallel for num_threads(threadnum) private(ai, aj, max)
for (int j = 1; j <= nEle; ++j) {
ai = si - j + 1;
aj = sj + j - 1
max = searchmax(ai,aj);
scoreBoard[ai][aj] = max;
}
}

C - Populate matrix with some density

I've got this code to populate matrix with 0/1 values and RHO density. I need the same for values from 0 to 2. I mean, the percentage of zeros should be the same, but other values in range 1-2.
for (i = 1; i <= n; i++) {
for (j = 1; j <= n; j++) {
grid[cur][i][j] = (((float)rand())/RAND_MAX) < rho;
}
}
The only thing I've been able to do is something inelegant like this. This leaves zero/non zero percentage inalterate and random modifies the 1 cells:
...
if(grid[cur][i][j] > 0) {
grid[cur][i][j] += rand()%2;
}
I think this code will create 0 with RHO density and other values in range 1-2.
for (i = 1; i <= n; i++) {
for (j = 1; j <= n; j++) {
grid[cur][i][j] = (((float)rand())/RAND_MAX) < rho ? 0 : rand() % 2 + 1;
}
}

Can't Count - Number of Comparison Operations

So I have this segment of code that was given to me.
for (int i = 0; i < 100; i++) {
for (int j = 0; j < 100; j++)
{
if (arr[j] < arr[i])
{
temp = arr[i];
arr[i] = arr[j];
arr[j] = temp;
}
}
}
I am trying to calculate the number of comparison operations that would occur if the code were to run.
There's the initial comparison all the way up to i=100. so there's 101 comparisons for the outer loop. The inner loop also has 101 loops, but that comparison within will only happen 100 times due to the j=100 will not have that comparison occurring.
I've made a tries but none of been the right answer so far.
I've had 101 x (101+100) = 20301 which is not the right answer.
I've searched for this on google and came up with a question identical to this but was answering how many assignment operations that occur which I was able to answer on my own. Which btw is 25201.
I got 20201.
#include <stdio.h>
int main(void) {
int i, j;
unsigned long count;
count = 0;
for (i = 0; ++count, i < 100; ++i) {
for (j = 0; ++count, j < 100; ++j) {
++count;
}
}
(void) printf("%lu\n", count);
return 0;
}
100 comparisons on the outer loop drive 101 + 100 comparisons on the inner loop. There is one more comparison on the outer loop to detect loop termination, so:
100 * (101 + 100) + 101 = 20201.
Instrumenting the program:
outer_cmps=0;
total_inner_cmps=0;
for (int i = 0; i < 100; i++) {
++outer_cmps;
inner_cmps=0;
for (int j = 0; j < 100; j++)
{
++inner_cmps;
if (arr[j] < arr[i])
{
temp = arr[i];
arr[i] = arr[j];
arr[j] = temp;
}
++inner_cmps;
}
++inner_cmps;
tota_inner_cmps += inner_cmps;
}
++outer_cmps;
total_cmps = outer_cmps + total_inner_cmps;
So that would be 100*200+100+1=20101
(100 times i, which runs the j loop 100 times, which performs 1 comparisson if (arr[j] < arr[i]) per loop, and one i loop that fails when i==100and 100 times j loop that fail when j==100)

What is Innermost loop in imperfectly nested loops?

Helo, I'm a bit confused about the definition of an inner loop in the case of imperfectly nested loops. Consider this code
for (i = 0; i < n; ++i)
{
for (j = 0; j <= i - 1; ++j)
/*some statement*/
p[i] = 1.0 / sqrt (x);
for (j = i + 1; j < n; ++j)
{
x = a[i][j];
for (k = 0; k <= i - 1; ++k)
/*some statement*/
a[j][i] = x * p[i];
}
}
Here, we have two loops in the same nesting level. But, in the second loop which iterates over "j" starting from j+1, there is a again another nesting level. Considering the entire loop structure, which is the inner most loop in the code ?
Both j loops are nested inside i equally, k is the inner most loop
Lol I don't know how to explain this so i'll give it my best shot I recommend using a debugger! it may help you so much you won't even know
for (i = 0; i < n; ++i)
{
//Goes in here first.. i = 0..
for (j = 0; j <= i - 1; ++j) {
//Goes here second..
//Goes inside here and gets stuck until j is greater then (i- 1) (right now i = 0)
//So (i-1) = -1 so it does this only once.
/*some statement*/
p[i] = 1.0 / sqrt (x);
}
for (j = i + 1; j < n; ++j)
{
//Goes sixth here.. etc.. ..
//when this is done.. goes to loop for (i = 0; i < n; ++i)
//Goes here third and gets stuck
//j = i which is 0 + 1.. so, j == 1
//keeps looping inside this loop until j is greater then n.. idk what is n..
//Can stay here until it hits n.. which could be a while.
x = a[i][j];
for (k = 0; k <= i - 1; ++k) {
//Goes in here fourth until k > (i-1).. i is still 0..
//So (i-1) = -1 so it does this only once
/*some statement*/
a[j][i] = x * p[i];
}
//Goes here fifth.. which goes.... to this same loop!
}
}
I'd say that k is the inner-most loop, because if you count the number of loops required to reach it from the outside, it's three loops, and that's the most out of all four of the loops in your code.

Resources