CUDA optimization: nested loops - loops

I am trying to import CUDA in this code:
double square=0;
for(int j=0;j<width; j++) {
double Up=0,Down=0;
for(int i=0;i<height; i++) {
if(array1[i]>0 && array2[i]>0){
square = source[i*width+j];
square = square*square;
Up += square*array2[i]/array1[i];
Down += square;
}
}
if(Down>0){
out[j] *= (1.+(Up/Down-1.));
}
}
In the first attempt I reduced the first for loop. (works well)
int j = blockDim.x * blockIdx.x + threadIdx.x;
double Up=0, Down=0, square=0;
if (j<width) {
for(int i=0;i<height;i++) {
if(array1[i]>0 && array2[i]>0){
square = source[i*width+j];
square = square*square;
Up += square*array2[i]/array1[i];
Down += square;
}
}
if(Down>0){
out[j] *= (1.+(Up/Down-1.));
}
}
I would also reduce the second for loop, I tried it with a 2D grid does not work.
This is the kernel:
int j = blockDim.x * blockIdx.x + threadIdx.x;
int i = blockDim.y * blockIdx.y + threadIdx.y;
int offset = j + i * blockDim.x * gridDim.x;
double Up[width],Down[width], square[height];
if (j>=width && i>=height) return;
if(array1[i]>0 && array2[i]>0){
square[i] = source[offset]*source[offset];
Up[j] += square[i]*array2[i]/array1[i];
Down[j] += square[i];
}
if(Down[j]>0){
out[j] *= (1.+(Up[j]/Down[j]-1.));
}
and this is the kernel call:
dim3 blocks(32,32);
dim3 grid(width/32,height/32);
kernel <<< grid, blocks >>> (...);
cudaDeviceSynchronize();
... what is the error? there are more efficient solutions? (I could use the dynamic parallelism?)
Thanks a lot!

In your last kernel, it looks like you intended the array of Up, Down and square to persist between threads, but those arrays are thread local, so the data they contain is not shared between threads. Unfortunately, your approach wouldn't work even if they were shared between threads.
In your inner loop, the current round of the loop uses data that was calculated in the previous round. It is not entirely trivial to parallelize such loops, and sometimes it cannot be done at all. In your case, a simple solution would be to use atomic operators to increase the Up and Down counters, but it wouldn't be efficient because the atomic operators cause implicit serialization of the operations.
You should probably look into solving this with existing parallel primitives, such as prefix-sums, that have already been optimized. For instance, those in CUB or Thrust.

Related

Matrix Transpose (with shared Memory) with arbitary size on Cuda C

I can't figure out a way to transpose a non-squared matrix using shared memory in CUDA C. (I am new to CUDA C and C)
On the website:
https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/
an efficient way was shown how to transpose a matrix (Coalesced Transpose Via Shared Memory). But it only works for squared matrices.
Also Code is provided on github (same as on the blog).
On Stackoverflow there is a similar question. There TILE_DIM = 16 is set. But with that implementation every thread just copies one element of the matrix to the result matrix.
This is my current implementation:
__global__ void transpose(double* matIn, double* matTran, int n, int m){
__shared__ double tile[TILE_DIM][TILE_DIM];
int i_n = blockIdx.x*TILE_DIM + threadIdx.x;
int i_m = blockIdx.y*TILE_DIM + threadIdx.y; // <- threadIdx.y only between 0 and 7
// Load matrix into tile
// Every Thread loads in this case 4 elements into tile.
int i;
for (i = 0; i < TILE_DIM; i += BLOCK_ROWS){
if(i_n < n && (i_m+i) < m){
tile[threadIdx.y+i][threadIdx.x] = matIn[n*(i_m+i) + i_n];
} else {
tile[threadIdx.y+i][threadIdx.x] = -1;
}
}
__syncthreads();
for (i = 0; i < TILE_DIM; i += BLOCK_ROWS){
if(tile[threadIdx.x][threadIdx.y+i] != -1){ // <- is there a better way?
if(true){ // <- what should be checked here?
matTran[n*(i_m+i) + i_n] = tile[threadIdx.x][threadIdx.y+i];
} else {
matTran[m*i_n + (i_m+i)] = tile[threadIdx.x][threadIdx.y+i];
}
}
}
}
where 4 elements are copied from a thread into the tile. Also four elements from the tile are copied back into the result matrix.
Here the Kernel-Configuration <<<a, b>>>:
where a: (ceil(n/TILE_DIM), ceil(n/TILE_DIM)) (-> is casted to doubles) and
b: (TILE_DIM, BLOCK_ROWS) (-> (32, 8))
I am currently using the if(tile[threadIdx.x][threadIdx.y+i] != -1)-statement to determine, which thread should copy to the result matrix (There might be another way). As for my current knowledge, this behaves as follows: In a block, the ThreadIdx (x, y) copies the data into the tile and the ThreadIdx (y, x) copies the data back into the result matrix.
I inserted another if-statement to determine where to copy the data, as there are 2(?) possible destinations, depending on the ThreadIdx. Currently true is inserted there, but i tried many different things. The best i could come up with was if(threadIdx.x+1 < threadIdx.y+i), which transposes a 3x2-matrix succesfully.
Can someone please explain, what i am missing by writing back into the result matrix? Obviously only one destination is correct. Using
matTran[n*(i_m+i) + i_n] = tile[threadIdx.x][threadIdx.y+i];
as on the blog mentioned should be correct, but I can't figure out, why it is not working for non-squared matrices?
I was overcomplicating the problem. Here, the indeces are NOT swapped as i thought. They are recalculated using the Y- and X-Coordinate of the Thread/Block. Here is the snippet:
i_n = blockIdx.y * TILE_DIM + threadIdx.x;
i_m = blockIdx.x * TILE_DIM + threadIdx.y
Here is the corrected code:
__global__ void transposeGPUcoalescing(double* matIn, int n, int m, double* matTran){
__shared__ double tile[TILE_DIM][TILE_DIM];
int i_n = blockIdx.x * TILE_DIM + threadIdx.x;
int i_m = blockIdx.y * TILE_DIM + threadIdx.y; // <- threadIdx.y only between 0 and 7
// Load matrix into tile
// Every Thread loads in this case 4 elements into tile.
int i;
for (i = 0; i < TILE_DIM; i += BLOCK_ROWS){
if(i_n < n && (i_m+i) < m){
tile[threadIdx.y+i][threadIdx.x] = matIn[(i_m+i)*n + i_n];
}
}
__syncthreads();
i_n = blockIdx.y * TILE_DIM + threadIdx.x;
i_m = blockIdx.x * TILE_DIM + threadIdx.y;
for (i = 0; i < TILE_DIM; i += BLOCK_ROWS){
if(i_n < m && (i_m+i) < n){
matTran[(i_m+i)*m + i_n] = tile[threadIdx.x][threadIdx.y + i]; // <- multiply by m, non-squared!
}
}
}
Thanks to this comment for noticing the error :)
If you would like to speed-up your kernel even more then, you can use "Shared Memory Bank Conflicts" as shown here:
https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/
Simply, changing the tile initialization with this will help a lot:
__shared__ float tile[TILE_DIM][TILE_DIM+1];

Loop Tiling Optimisations

I've been attempting to optimise one of my loops in my C code in order to make it use the cache more efficiently. I have a few issues. I'm not 100% sure if I'm even writing the code correctly to loop block due to the fact that I am seeing no increase in speed in the run time of my programme. Here is the code:
for(int k = 0; k < N; k+=b){
for (int i = k; i<MIN(N,i+b); ++i) {
a1[i] = 0.0f;
a2[i] = 0.0f;
for (int j = 0; j < N; j++) {
x = x[j] - x[i];
y = y[j] - y[i];
2 = x*x + y*y + eps;
r2inv = 1.0f / sqrt(r2);
r6inv = r2inv * r2inv * r2inv;
s = m[j] * r6inv;
ax[i] += s * x;
ay[i] += s * y;
}
}
}
I also have another issue. How do I go about choosing a correct block size? I understand that you want to load in enough to fill the l1 cache.
Thanks for the help in advance.
What you are doing is rather pointless, because i goes from 0 to N-1 in your code, just in a slightly more complicated way. So you benefit exactly zero from your attempts at tiling.
What is more critical is the array y, so that is what you should be tiling (if N is large, and if the speed isn't limited by the division and square root). For every value i, you make one complete pass through the array y. You can also easily save a few floating point operations for each j, and since r6inv is symmetrical between i and j, only half the values need to be calculated.

CUDA: Using grid-strided loop with reduction in shared memory

I have the following question concerning usage of grid-strided loops and optimized reduction algorithms in shared memory together in CUDA kernels.
Imagine that you have 1D array with number of element more than threads in the grid (BLOCK_SIZE * GRID_SIZE). In this case you will write the kernel of this kind:
#define BLOCK_SIZE (8)
#define GRID_SIZE (8)
#define N (2000)
// ...
__global__ void gridStridedLoop_kernel(double *global_1D_array)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int i;
// N is a total number of elements in the global_1D_array array
for (i = idx; i < N; i += blockDim.x * gridDim.x)
{
// Do smth...
}
}
Now you want to look for maximum element in the global_1D_array using reduction in shared memory and the above kernel will be look like this one:
#define BLOCK_SIZE (8)
#define GRID_SIZE (8)
#define N (2000)
// ...
__global__ void gridStridedLoop_kernel(double *global_1D_array)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int i;
// Initialize shared memory array for the each block
__shared__ double data[BLOCK_SIZE];
// N is a total number of elements in the global_1D_array array
for (i = idx; i < N; i += blockDim.x * gridDim.x)
{
// Load data from global to shared memory
data[threadIdx.x] = global_1D_array[i];
__syncthreads();
// Do reduction in shared memory ...
}
// Copy MAX value for each block into global memory
}
It is clear that some values in the data will be overwritten, i.e. you need longer shared memory array or have to organize the kernel in another way.
What is the best (most efficient) way to use reduction in shared memory and strided loop together?
Thanks in advance.
A reduction using a grid strided loop is documented here. Referring to slide 32, the grid-strided loop looks like this:
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
while (i < n){
sdata[tid] += g_idata[i] + g_idata[i+blockSize];
i += gridSize;
}
__syncthreads();
Note that each iteration of the while-loop increases the index by gridSize, and this while-loop will continue until the index (i) exceeds the (global) data size (n). We call this a grid-strided loop. In this example, the remainder of the threadblock-local reduction operation is not impacted by grid-size looping, thus only the "front-end" is shown. This particular reduction is doing a sum-reduction, but a max-reduction would simply replace the operation with something like:
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockSize + threadIdx.x;
unsigned int gridSize = blockSize*gridDim.x;
sdata[tid] = 0;
while (i < n){
sdata[tid] = (sdata[tid] < g_idata[i]) ? + g_idata[i]:sdata[tid];
i += gridSize;
}
__syncthreads();
And the remainder of the threadblock level reduction would have to be modified in a similar fashion, replacing the summing operation with a max-finding operation.
The full parallel reduction CUDA sample code is available as part of any full cuda samples install, or here.

Summation over one dimension of a three dimensional array using shared memory

I need to do calculation like: A[x][y] = sum{from z=0 till z=n}{B[x][y][z]+C[x][y][z]}, where matrix A has dimensions [height][width] and matrix B,C has dimensions [height][width][n].
Values are mapped to memory with something like:
index = 0;
for (z = 0; z<n; ++z)
for(y = 0; y<width; ++y)
for(x = 0; x<height; ++x) {
matrix[index] = value;
index++;
}
I would like to each block calculate one sum since each block has own shared memory. To avoid data racing I use atomicAdd, something like this:
Part of code in global memory:
dim3 block (n, 1, 1);
dim grid (height, width, 1);
Kernel:
atomicAdd( &(A[blockIdx.x + blockIdx.y*gridDim.y]),
B[blockIdx.x + blockIdx.y*gridDim.y+threadIdx.x*blockDim.x*blockDim.y]
+ C[blockIdx.x + blockIdx.y*gridDim.y+threadIdx.x*blockDim.x*blockDim.y] );
I would like to use shared memory for calculating the sum and then copy this result to global memory.
I am not sure how to do the part with shared memory. In each blockĀ“s shared memory will be stored just one number ( sum result ). How should I copy this number to right place in A matrix in global memory?
You probably don't need shared memory or atomic memory access to do the summation you are asking about. If I have understood this correctly, your data is in column major order, so the logical operation is to have one thread per matrix entry in the output matrix, and have each thread traverse the z axis of the input matrices, summing as they go. The kernel for this could look something like:
__global__ void kernel(float *A, const float *B, const float *C,
const int width, const int height, const int n)
{
int tidx = threadIdx.x + blockDim.x * blockIdx.x;
int tidy = threadIdx.y + blockDim.y * blockIdx.y;
if ( (tidx < height) && (tidy < width) ) {
int stride = width * height;
int ipos = tidx + tidy * height;
float * oval = A + ipos;
float sum = 0.f;
for(int z=0; z<n; z++, ipos+=stride) {
sum += B[ipos] + C[ipos];
}
*oval = sum;
}
}
This approach should be optimal for column-major data with width * height >= n. There are no performance advantages to using shared memory for this, and there is no need to use atomic memory operations either. If you had a problem where width * height << n it might make sense to try a block wise parallel reduction per summation. But you have not indicated what the typical dimensions of the problem are. Leave a comment if your problem is more like the latter, and I can add a reduction based sample kernel to the answer.

Sum 3D matrix cuda

I need to do calculation like: A[x][y] = sum{from z=0 till z=n}{B[x][y][z]+C[x][y][z]}, where matrix A has dimensions [height][width] and matrix B,C has dimensions [height][width][n].
Values are mapped to memory with something like:
index = 0;
for (z = 0; z<n; ++z)
for(y = 0; y<width; ++y)
for(x = 0; x<height; ++x) {
matrix[index] = value;
index++;
}
Q1: is this Cuda kernel ok?
idx = blockIdx.x*blockDim.x + threadIdx.x;
idy = blockIdx.y*blockDim.y + threadIdx.y;
for(z=0; z<n; z++){
A[idx*width+idy] += B[idx*width+idy+z*width*height] + C[idx*width+idy+z*width*height];
}
Q2: Is this faster way to do the calculation?
idx = blockIdx.x*blockDim.x + threadIdx.x;
idy = blockIdx.y*blockDim.y + threadIdx.y;
idz = blockIdx.z*blockDim.z + threadIdx.z;
int stride_x = blockDim.x * gridDim.x;
int stride_y = blockDim.y * gridDim.y;
int stride_z = blockDim.z * gridDim.z;
while ( idx < height && idy < width && idz < n ) {
atomicAdd( &(A[idx*width+idy]), B[idx*width+idy+idz*width*height] + C[idx*width+idy+idz*width*height] );
idx += stride_x;
idy += stride_y;
idz += stride_z;
}
First kernel is ok. But we have not coalesced access to matrix B and C.
As for second kernel function. You have data racing cause not only one thread has an an ability to write in A[idx*width+idy] addres. You need in additional synchronization like AttomicAdd
As for general question:
I think that experiments show that it is better. It's depends on typical matrix sizes that you have. Remember that maximum thread block size on Fermi < 1024 and if matrices have large size you gem many thread blocks. Usually it's slower (to have many thread blocks).
Real simple in ArrayFire:
array A = randu(nx,ny,nz);
array B = sum(A,2); // sum along 3rd dimension
print(B);
Q1: Test it with matrices where you know the answer
Remark: You might have problems when using very large matrices. Use a while loop with appropriate increments. Cuda by Example is as usual the reference book.
An example for implementing a nested loop can be found here: For nested loops with CUDA. There a while loop is implemented.
marina.k is right about the race condition. That would favor approach one, as atomic operations tend to slow down the code.

Resources