Sum 3D matrix cuda - c

I need to do calculation like: A[x][y] = sum{from z=0 till z=n}{B[x][y][z]+C[x][y][z]}, where matrix A has dimensions [height][width] and matrix B,C has dimensions [height][width][n].
Values are mapped to memory with something like:
index = 0;
for (z = 0; z<n; ++z)
for(y = 0; y<width; ++y)
for(x = 0; x<height; ++x) {
matrix[index] = value;
index++;
}
Q1: is this Cuda kernel ok?
idx = blockIdx.x*blockDim.x + threadIdx.x;
idy = blockIdx.y*blockDim.y + threadIdx.y;
for(z=0; z<n; z++){
A[idx*width+idy] += B[idx*width+idy+z*width*height] + C[idx*width+idy+z*width*height];
}
Q2: Is this faster way to do the calculation?
idx = blockIdx.x*blockDim.x + threadIdx.x;
idy = blockIdx.y*blockDim.y + threadIdx.y;
idz = blockIdx.z*blockDim.z + threadIdx.z;
int stride_x = blockDim.x * gridDim.x;
int stride_y = blockDim.y * gridDim.y;
int stride_z = blockDim.z * gridDim.z;
while ( idx < height && idy < width && idz < n ) {
atomicAdd( &(A[idx*width+idy]), B[idx*width+idy+idz*width*height] + C[idx*width+idy+idz*width*height] );
idx += stride_x;
idy += stride_y;
idz += stride_z;
}

First kernel is ok. But we have not coalesced access to matrix B and C.
As for second kernel function. You have data racing cause not only one thread has an an ability to write in A[idx*width+idy] addres. You need in additional synchronization like AttomicAdd
As for general question:
I think that experiments show that it is better. It's depends on typical matrix sizes that you have. Remember that maximum thread block size on Fermi < 1024 and if matrices have large size you gem many thread blocks. Usually it's slower (to have many thread blocks).

Real simple in ArrayFire:
array A = randu(nx,ny,nz);
array B = sum(A,2); // sum along 3rd dimension
print(B);

Q1: Test it with matrices where you know the answer
Remark: You might have problems when using very large matrices. Use a while loop with appropriate increments. Cuda by Example is as usual the reference book.
An example for implementing a nested loop can be found here: For nested loops with CUDA. There a while loop is implemented.
marina.k is right about the race condition. That would favor approach one, as atomic operations tend to slow down the code.

Related

Matrix Transpose (with shared Memory) with arbitary size on Cuda C

I can't figure out a way to transpose a non-squared matrix using shared memory in CUDA C. (I am new to CUDA C and C)
On the website:
https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/
an efficient way was shown how to transpose a matrix (Coalesced Transpose Via Shared Memory). But it only works for squared matrices.
Also Code is provided on github (same as on the blog).
On Stackoverflow there is a similar question. There TILE_DIM = 16 is set. But with that implementation every thread just copies one element of the matrix to the result matrix.
This is my current implementation:
__global__ void transpose(double* matIn, double* matTran, int n, int m){
__shared__ double tile[TILE_DIM][TILE_DIM];
int i_n = blockIdx.x*TILE_DIM + threadIdx.x;
int i_m = blockIdx.y*TILE_DIM + threadIdx.y; // <- threadIdx.y only between 0 and 7
// Load matrix into tile
// Every Thread loads in this case 4 elements into tile.
int i;
for (i = 0; i < TILE_DIM; i += BLOCK_ROWS){
if(i_n < n && (i_m+i) < m){
tile[threadIdx.y+i][threadIdx.x] = matIn[n*(i_m+i) + i_n];
} else {
tile[threadIdx.y+i][threadIdx.x] = -1;
}
}
__syncthreads();
for (i = 0; i < TILE_DIM; i += BLOCK_ROWS){
if(tile[threadIdx.x][threadIdx.y+i] != -1){ // <- is there a better way?
if(true){ // <- what should be checked here?
matTran[n*(i_m+i) + i_n] = tile[threadIdx.x][threadIdx.y+i];
} else {
matTran[m*i_n + (i_m+i)] = tile[threadIdx.x][threadIdx.y+i];
}
}
}
}
where 4 elements are copied from a thread into the tile. Also four elements from the tile are copied back into the result matrix.
Here the Kernel-Configuration <<<a, b>>>:
where a: (ceil(n/TILE_DIM), ceil(n/TILE_DIM)) (-> is casted to doubles) and
b: (TILE_DIM, BLOCK_ROWS) (-> (32, 8))
I am currently using the if(tile[threadIdx.x][threadIdx.y+i] != -1)-statement to determine, which thread should copy to the result matrix (There might be another way). As for my current knowledge, this behaves as follows: In a block, the ThreadIdx (x, y) copies the data into the tile and the ThreadIdx (y, x) copies the data back into the result matrix.
I inserted another if-statement to determine where to copy the data, as there are 2(?) possible destinations, depending on the ThreadIdx. Currently true is inserted there, but i tried many different things. The best i could come up with was if(threadIdx.x+1 < threadIdx.y+i), which transposes a 3x2-matrix succesfully.
Can someone please explain, what i am missing by writing back into the result matrix? Obviously only one destination is correct. Using
matTran[n*(i_m+i) + i_n] = tile[threadIdx.x][threadIdx.y+i];
as on the blog mentioned should be correct, but I can't figure out, why it is not working for non-squared matrices?
I was overcomplicating the problem. Here, the indeces are NOT swapped as i thought. They are recalculated using the Y- and X-Coordinate of the Thread/Block. Here is the snippet:
i_n = blockIdx.y * TILE_DIM + threadIdx.x;
i_m = blockIdx.x * TILE_DIM + threadIdx.y
Here is the corrected code:
__global__ void transposeGPUcoalescing(double* matIn, int n, int m, double* matTran){
__shared__ double tile[TILE_DIM][TILE_DIM];
int i_n = blockIdx.x * TILE_DIM + threadIdx.x;
int i_m = blockIdx.y * TILE_DIM + threadIdx.y; // <- threadIdx.y only between 0 and 7
// Load matrix into tile
// Every Thread loads in this case 4 elements into tile.
int i;
for (i = 0; i < TILE_DIM; i += BLOCK_ROWS){
if(i_n < n && (i_m+i) < m){
tile[threadIdx.y+i][threadIdx.x] = matIn[(i_m+i)*n + i_n];
}
}
__syncthreads();
i_n = blockIdx.y * TILE_DIM + threadIdx.x;
i_m = blockIdx.x * TILE_DIM + threadIdx.y;
for (i = 0; i < TILE_DIM; i += BLOCK_ROWS){
if(i_n < m && (i_m+i) < n){
matTran[(i_m+i)*m + i_n] = tile[threadIdx.x][threadIdx.y + i]; // <- multiply by m, non-squared!
}
}
}
Thanks to this comment for noticing the error :)
If you would like to speed-up your kernel even more then, you can use "Shared Memory Bank Conflicts" as shown here:
https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/
Simply, changing the tile initialization with this will help a lot:
__shared__ float tile[TILE_DIM][TILE_DIM+1];

Loop Tiling Optimisations

I've been attempting to optimise one of my loops in my C code in order to make it use the cache more efficiently. I have a few issues. I'm not 100% sure if I'm even writing the code correctly to loop block due to the fact that I am seeing no increase in speed in the run time of my programme. Here is the code:
for(int k = 0; k < N; k+=b){
for (int i = k; i<MIN(N,i+b); ++i) {
a1[i] = 0.0f;
a2[i] = 0.0f;
for (int j = 0; j < N; j++) {
x = x[j] - x[i];
y = y[j] - y[i];
2 = x*x + y*y + eps;
r2inv = 1.0f / sqrt(r2);
r6inv = r2inv * r2inv * r2inv;
s = m[j] * r6inv;
ax[i] += s * x;
ay[i] += s * y;
}
}
}
I also have another issue. How do I go about choosing a correct block size? I understand that you want to load in enough to fill the l1 cache.
Thanks for the help in advance.
What you are doing is rather pointless, because i goes from 0 to N-1 in your code, just in a slightly more complicated way. So you benefit exactly zero from your attempts at tiling.
What is more critical is the array y, so that is what you should be tiling (if N is large, and if the speed isn't limited by the division and square root). For every value i, you make one complete pass through the array y. You can also easily save a few floating point operations for each j, and since r6inv is symmetrical between i and j, only half the values need to be calculated.

CUDA optimization: nested loops

I am trying to import CUDA in this code:
double square=0;
for(int j=0;j<width; j++) {
double Up=0,Down=0;
for(int i=0;i<height; i++) {
if(array1[i]>0 && array2[i]>0){
square = source[i*width+j];
square = square*square;
Up += square*array2[i]/array1[i];
Down += square;
}
}
if(Down>0){
out[j] *= (1.+(Up/Down-1.));
}
}
In the first attempt I reduced the first for loop. (works well)
int j = blockDim.x * blockIdx.x + threadIdx.x;
double Up=0, Down=0, square=0;
if (j<width) {
for(int i=0;i<height;i++) {
if(array1[i]>0 && array2[i]>0){
square = source[i*width+j];
square = square*square;
Up += square*array2[i]/array1[i];
Down += square;
}
}
if(Down>0){
out[j] *= (1.+(Up/Down-1.));
}
}
I would also reduce the second for loop, I tried it with a 2D grid does not work.
This is the kernel:
int j = blockDim.x * blockIdx.x + threadIdx.x;
int i = blockDim.y * blockIdx.y + threadIdx.y;
int offset = j + i * blockDim.x * gridDim.x;
double Up[width],Down[width], square[height];
if (j>=width && i>=height) return;
if(array1[i]>0 && array2[i]>0){
square[i] = source[offset]*source[offset];
Up[j] += square[i]*array2[i]/array1[i];
Down[j] += square[i];
}
if(Down[j]>0){
out[j] *= (1.+(Up[j]/Down[j]-1.));
}
and this is the kernel call:
dim3 blocks(32,32);
dim3 grid(width/32,height/32);
kernel <<< grid, blocks >>> (...);
cudaDeviceSynchronize();
... what is the error? there are more efficient solutions? (I could use the dynamic parallelism?)
Thanks a lot!
In your last kernel, it looks like you intended the array of Up, Down and square to persist between threads, but those arrays are thread local, so the data they contain is not shared between threads. Unfortunately, your approach wouldn't work even if they were shared between threads.
In your inner loop, the current round of the loop uses data that was calculated in the previous round. It is not entirely trivial to parallelize such loops, and sometimes it cannot be done at all. In your case, a simple solution would be to use atomic operators to increase the Up and Down counters, but it wouldn't be efficient because the atomic operators cause implicit serialization of the operations.
You should probably look into solving this with existing parallel primitives, such as prefix-sums, that have already been optimized. For instance, those in CUB or Thrust.

Summation over one dimension of a three dimensional array using shared memory

I need to do calculation like: A[x][y] = sum{from z=0 till z=n}{B[x][y][z]+C[x][y][z]}, where matrix A has dimensions [height][width] and matrix B,C has dimensions [height][width][n].
Values are mapped to memory with something like:
index = 0;
for (z = 0; z<n; ++z)
for(y = 0; y<width; ++y)
for(x = 0; x<height; ++x) {
matrix[index] = value;
index++;
}
I would like to each block calculate one sum since each block has own shared memory. To avoid data racing I use atomicAdd, something like this:
Part of code in global memory:
dim3 block (n, 1, 1);
dim grid (height, width, 1);
Kernel:
atomicAdd( &(A[blockIdx.x + blockIdx.y*gridDim.y]),
B[blockIdx.x + blockIdx.y*gridDim.y+threadIdx.x*blockDim.x*blockDim.y]
+ C[blockIdx.x + blockIdx.y*gridDim.y+threadIdx.x*blockDim.x*blockDim.y] );
I would like to use shared memory for calculating the sum and then copy this result to global memory.
I am not sure how to do the part with shared memory. In each blockĀ“s shared memory will be stored just one number ( sum result ). How should I copy this number to right place in A matrix in global memory?
You probably don't need shared memory or atomic memory access to do the summation you are asking about. If I have understood this correctly, your data is in column major order, so the logical operation is to have one thread per matrix entry in the output matrix, and have each thread traverse the z axis of the input matrices, summing as they go. The kernel for this could look something like:
__global__ void kernel(float *A, const float *B, const float *C,
const int width, const int height, const int n)
{
int tidx = threadIdx.x + blockDim.x * blockIdx.x;
int tidy = threadIdx.y + blockDim.y * blockIdx.y;
if ( (tidx < height) && (tidy < width) ) {
int stride = width * height;
int ipos = tidx + tidy * height;
float * oval = A + ipos;
float sum = 0.f;
for(int z=0; z<n; z++, ipos+=stride) {
sum += B[ipos] + C[ipos];
}
*oval = sum;
}
}
This approach should be optimal for column-major data with width * height >= n. There are no performance advantages to using shared memory for this, and there is no need to use atomic memory operations either. If you had a problem where width * height << n it might make sense to try a block wise parallel reduction per summation. But you have not indicated what the typical dimensions of the problem are. Leave a comment if your problem is more like the latter, and I can add a reduction based sample kernel to the answer.

Fast 2D convolution for DSP

I want to implement some image-processing algorithms which are intended to run on a beagleboard. These algorithms use convolutions extensively. I'm trying to find a good C implementation for 2D convolution (probably using the Fast Fourier Transform). I also want the algorithm to be able to run on the beagleboard's DSP, because I've heard that the DSP is optimized for these kinds of operations (with its multiply-accumulate instruction).
I have no background in the field so I think it won't be a good idea to implement the convolution myself (I probably won't do it as good as someone who understands all the math behind it). I believe a good C convolution implementation for DSP exists somewhere but I wasn't able find it?
Could someone help?
EDIT: Turns out the kernel is pretty small. Its dimensions are either 2X2 or 3X3. So I guess I'm not looking for an FFT-based implementation. I was searching for convolution on the web to see its definition so I can implement it in a straight forward way (I don't really know what convolution is). All I've found is something with multiplied integrals and I have no idea how to do it with matrices. Could somebody give me a piece of code (or pseudo code) for the 2X2 kernel case?
What are the dimensions of the image and the kernel ? If the kernel is large then you can use FFT-based convolution, otherwise for small kernels just use direct convolution.
The DSP might not be the best way to do this though - just because it has a MAC instruction doesn't mean that it will be more efficient. Does the ARM CPU on the Beagle Board have NEON SIMD ? If so then that might be the way to go (and more fun too).
For a small kernel, you can do direct convolution like this:
// in, out are m x n images (integer data)
// K is the kernel size (KxK) - currently needs to be an odd number, e.g. 3
// coeffs[K][K] is a 2D array of integer coefficients
// scale is a scaling factor to normalise the filter gain
for (i = K / 2; i < m - K / 2; ++i) // iterate through image
{
for (j = K / 2; j < n - K / 2; ++j)
{
int sum = 0; // sum will be the sum of input data * coeff terms
for (ii = - K / 2; ii <= K / 2; ++ii) // iterate over kernel
{
for (jj = - K / 2; jj <= K / 2; ++jj)
{
int data = in[i + ii][j +jj];
int coeff = coeffs[ii + K / 2][jj + K / 2];
sum += data * coeff;
}
}
out[i][j] = sum / scale; // scale sum of convolution products and store in output
}
}
You can modify this to support even values of K - it just takes a little care with the upper/lower limits on the two inner loops.
I know it might be off topic but due to the similarity between C and JavaScript I believe it could still be helpful. PS.: Inspired by #Paul R answer.
Two dimensions 2D convolution algorithm in JavaScript using arrays
function newArray(size){
var result = new Array(size);
for (var i = 0; i < size; i++) {
result[i] = new Array(size);
}
return result;
}
function convolveArrays(filter, image){
var result = newArray(image.length - filter.length + 1);
for (var i = 0; i < image.length; i++) {
var imageRow = image[i];
for (var j = 0; j <= imageRow.length; j++) {
var sum = 0;
for (var w = 0; w < filter.length; w++) {
if(image.length - i < filter.length) break;
var filterRow = filter[w];
for (var z = 0; z < filter.length; z++) {
if(imageRow.length - j < filterRow.length) break;
sum += image[w + i][z + j] * filter[w][z];
}
}
if(i < result.length && j < result.length)
result[i][j] = sum;
}
}
return result;
}
You can check the full blog post at http://ec2-54-232-84-48.sa-east-1.compute.amazonaws.com/two-dimensional-convolution-algorithm-with-arrays-in-javascript/

Resources