I have the following question concerning usage of grid-strided loops and optimized reduction algorithms in shared memory together in CUDA kernels.
Imagine that you have 1D array with number of element more than threads in the grid (BLOCK_SIZE * GRID_SIZE). In this case you will write the kernel of this kind:
#define BLOCK_SIZE (8)
#define GRID_SIZE (8)
#define N (2000)
// ...
__global__ void gridStridedLoop_kernel(double *global_1D_array)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int i;
// N is a total number of elements in the global_1D_array array
for (i = idx; i < N; i += blockDim.x * gridDim.x)
{
// Do smth...
}
}
Now you want to look for maximum element in the global_1D_array using reduction in shared memory and the above kernel will be look like this one:
#define BLOCK_SIZE (8)
#define GRID_SIZE (8)
#define N (2000)
// ...
__global__ void gridStridedLoop_kernel(double *global_1D_array)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int i;
// Initialize shared memory array for the each block
__shared__ double data[BLOCK_SIZE];
// N is a total number of elements in the global_1D_array array
for (i = idx; i < N; i += blockDim.x * gridDim.x)
{
// Load data from global to shared memory
data[threadIdx.x] = global_1D_array[i];
__syncthreads();
// Do reduction in shared memory ...
}
// Copy MAX value for each block into global memory
}
It is clear that some values in the data will be overwritten, i.e. you need longer shared memory array or have to organize the kernel in another way.
What is the best (most efficient) way to use reduction in shared memory and strided loop together?
Thanks in advance.
A reduction using a grid strided loop is documented here. Referring to slide 32, the grid-strided loop looks like this:
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
while (i < n){
sdata[tid] += g_idata[i] + g_idata[i+blockSize];
i += gridSize;
}
__syncthreads();
Note that each iteration of the while-loop increases the index by gridSize, and this while-loop will continue until the index (i) exceeds the (global) data size (n). We call this a grid-strided loop. In this example, the remainder of the threadblock-local reduction operation is not impacted by grid-size looping, thus only the "front-end" is shown. This particular reduction is doing a sum-reduction, but a max-reduction would simply replace the operation with something like:
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockSize + threadIdx.x;
unsigned int gridSize = blockSize*gridDim.x;
sdata[tid] = 0;
while (i < n){
sdata[tid] = (sdata[tid] < g_idata[i]) ? + g_idata[i]:sdata[tid];
i += gridSize;
}
__syncthreads();
And the remainder of the threadblock level reduction would have to be modified in a similar fashion, replacing the summing operation with a max-finding operation.
The full parallel reduction CUDA sample code is available as part of any full cuda samples install, or here.
Related
I can't figure out a way to transpose a non-squared matrix using shared memory in CUDA C. (I am new to CUDA C and C)
On the website:
https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/
an efficient way was shown how to transpose a matrix (Coalesced Transpose Via Shared Memory). But it only works for squared matrices.
Also Code is provided on github (same as on the blog).
On Stackoverflow there is a similar question. There TILE_DIM = 16 is set. But with that implementation every thread just copies one element of the matrix to the result matrix.
This is my current implementation:
__global__ void transpose(double* matIn, double* matTran, int n, int m){
__shared__ double tile[TILE_DIM][TILE_DIM];
int i_n = blockIdx.x*TILE_DIM + threadIdx.x;
int i_m = blockIdx.y*TILE_DIM + threadIdx.y; // <- threadIdx.y only between 0 and 7
// Load matrix into tile
// Every Thread loads in this case 4 elements into tile.
int i;
for (i = 0; i < TILE_DIM; i += BLOCK_ROWS){
if(i_n < n && (i_m+i) < m){
tile[threadIdx.y+i][threadIdx.x] = matIn[n*(i_m+i) + i_n];
} else {
tile[threadIdx.y+i][threadIdx.x] = -1;
}
}
__syncthreads();
for (i = 0; i < TILE_DIM; i += BLOCK_ROWS){
if(tile[threadIdx.x][threadIdx.y+i] != -1){ // <- is there a better way?
if(true){ // <- what should be checked here?
matTran[n*(i_m+i) + i_n] = tile[threadIdx.x][threadIdx.y+i];
} else {
matTran[m*i_n + (i_m+i)] = tile[threadIdx.x][threadIdx.y+i];
}
}
}
}
where 4 elements are copied from a thread into the tile. Also four elements from the tile are copied back into the result matrix.
Here the Kernel-Configuration <<<a, b>>>:
where a: (ceil(n/TILE_DIM), ceil(n/TILE_DIM)) (-> is casted to doubles) and
b: (TILE_DIM, BLOCK_ROWS) (-> (32, 8))
I am currently using the if(tile[threadIdx.x][threadIdx.y+i] != -1)-statement to determine, which thread should copy to the result matrix (There might be another way). As for my current knowledge, this behaves as follows: In a block, the ThreadIdx (x, y) copies the data into the tile and the ThreadIdx (y, x) copies the data back into the result matrix.
I inserted another if-statement to determine where to copy the data, as there are 2(?) possible destinations, depending on the ThreadIdx. Currently true is inserted there, but i tried many different things. The best i could come up with was if(threadIdx.x+1 < threadIdx.y+i), which transposes a 3x2-matrix succesfully.
Can someone please explain, what i am missing by writing back into the result matrix? Obviously only one destination is correct. Using
matTran[n*(i_m+i) + i_n] = tile[threadIdx.x][threadIdx.y+i];
as on the blog mentioned should be correct, but I can't figure out, why it is not working for non-squared matrices?
I was overcomplicating the problem. Here, the indeces are NOT swapped as i thought. They are recalculated using the Y- and X-Coordinate of the Thread/Block. Here is the snippet:
i_n = blockIdx.y * TILE_DIM + threadIdx.x;
i_m = blockIdx.x * TILE_DIM + threadIdx.y
Here is the corrected code:
__global__ void transposeGPUcoalescing(double* matIn, int n, int m, double* matTran){
__shared__ double tile[TILE_DIM][TILE_DIM];
int i_n = blockIdx.x * TILE_DIM + threadIdx.x;
int i_m = blockIdx.y * TILE_DIM + threadIdx.y; // <- threadIdx.y only between 0 and 7
// Load matrix into tile
// Every Thread loads in this case 4 elements into tile.
int i;
for (i = 0; i < TILE_DIM; i += BLOCK_ROWS){
if(i_n < n && (i_m+i) < m){
tile[threadIdx.y+i][threadIdx.x] = matIn[(i_m+i)*n + i_n];
}
}
__syncthreads();
i_n = blockIdx.y * TILE_DIM + threadIdx.x;
i_m = blockIdx.x * TILE_DIM + threadIdx.y;
for (i = 0; i < TILE_DIM; i += BLOCK_ROWS){
if(i_n < m && (i_m+i) < n){
matTran[(i_m+i)*m + i_n] = tile[threadIdx.x][threadIdx.y + i]; // <- multiply by m, non-squared!
}
}
}
Thanks to this comment for noticing the error :)
If you would like to speed-up your kernel even more then, you can use "Shared Memory Bank Conflicts" as shown here:
https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/
Simply, changing the tile initialization with this will help a lot:
__shared__ float tile[TILE_DIM][TILE_DIM+1];
I tried to implement vector sum reduction using CUDA on my own and encountered an error I could fix but not understand what the actual problem was.
I implemented the kernel below, which is pretty much same as used in NVIDIA's samples.
__global__
void reduce0(int *input, int *output)
{
extern __shared__ int s_data[];
int tid = threadIdx.x;
int i = blockIdx.x * blockDim.x + threadIdx.x;
s_data[tid] = input[i];
__syncthreads();
for( int s=1; s < blockDim.x; s *= 2) {
if((tid % 2*s) == 0) {
s_data[tid] += s_data[tid + s];
}
__syncthreads();
}
if(tid == 0) {
output[blockIdx.x] = s_data[0];
}
}
Furthermore, I calculated shared memory space as below on the host side
int sharedMemSize = numberOfValues * sizeof(int);
If there is more than 1 block of threads used, the code just runs fine. Using only 1 block ends in the index out of bounds error mentioned above. Looking for my error by comparing my host code with the one of the examples I found the following line:
int smemSize = (threads <= 32) ? 2 * threads * sizeof(T) : threads * sizeof(T);
Playing a little with my block/grid setup brought me to the following results:
block, arbitrary number of threads => code crashes
>2 blocks, arbitrary number of threads => code runs fine
1 block, arbitrary number of threads, shared memory size 2*#threads => code runs fine
Although thinking about this for a few hours, I don't get why there is an out of bounds error when using a too little number of threads or blocks.
UPDATE: Host code calling the kernel as requested
int numberOfValues = 1024 ;
int numberOfThreadsPerBlock = 32;
int numberOfBlocks = numberOfValues / numberOfThreadsPerBlock;
int memSize = sizeof(int) * numberOfValues;
int *values = (int *) malloc(memSize);
int *result = (int *) malloc(memSize);
int *values_device, *result_device;
cudaMalloc((void **) &values_device, memSize);
cudaMalloc((void **) &result_device, memSize);
for(int i=0; i < numberOfValues ; i++) {
values[i] = i+1;
}
cudaMemcpy(values_device, values, memSize, cudaMemcpyHostToDevice);
dim3 dimGrid(numberOfBlocks,1);
dim3 dimBlock(numberOfThreadsPerBlock,1);
int sharedMemSize = numberOfThreadsPerBlock * sizeof(int);
reduce0 <<< dimGrid, dimBlock, sharedMemSize >>>(values_device, result_device);
if (cudaSuccess != cudaGetLastError())
printf( "Error!\n" );
cudaMemcpy(result, result_device, memSize, cudaMemcpyDeviceToHost);
could your problem be the precedence order of modulo and multiplication.
tid % 2*s is equal to (tid % s)*2 but you want tid % (s*2)
The reason to why you need to use int smemSize = (threads <= 32) ? 2 * threads * sizeof(T) : threads * sizeof(T) for small number of threads is due to out of bounds indexing. One example when this happens is when you launch 29 threads. When tid=28 and s=2 the branch will be taken due to 28 % (2*2) == 0 and you will index into s_data[28+2] but you have only allocated shared memory for 29 threads.
I have two arrays, a and b, and I would like to compute the "min convolution" to produce result c. Simple pseudo code looks like the following:
for i = 0 to size(a)+size(b)
c[i] = inf
for j = 0 to size(a)
if (i - j >= 0) and (i - j < size(b))
c[i] = min(c[i], a[j] + b[i-j])
(edit: changed loops to start at 0 instead of 1)
If the min were instead a sum, we could use a Fast Fourier Transform (FFT), but in the min case, there is no such analog. Instead, I'd like to make this simple algorithm as fast as possible by using a GPU (CUDA). I'd be happy to find existing code that does this (or code that implements the sum case without FFTs, so that I could adapt it for my purposes), but my search so far hasn't turned up any good results. My use case will involve a's and b's that are of size between 1,000 and 100,000.
Questions:
Does code to do this efficiently already exist?
If I am going to implement this myself, structurally, how should the CUDA kernel look so as to maximize efficiency? I've tried a simple solution where each c[i] is computed by a separate thread, but this doesn't seem like the best way. Any tips in terms of how to set up thread block structure and memory access patterns?
An alternative which might be useful for large a and b would be to use a block per output entry in c. Using a block allows for memory coalescing, which will be important in what is a memory bandwidth limited operation, and a fairly efficient shared memory reduction can be used to combine per thread partial results into a final per block result. Probably the best strategy is to launch as many blocks per MP as will run concurrently and have each block emit multiple output points. This eliminates some of the scheduling overheads associated with launching and retiring many blocks with relatively low total instruction counts.
An example of how this might be done:
#include <math.h>
template<int bsz>
__global__ __launch_bounds__(512)
void minconv(const float *a, int sizea, const float *b, int sizeb, float *c)
{
__shared__ volatile float buff[bsz];
for(int i = blockIdx.x; i<(sizea + sizeb); i+=(gridDim.x*blockDim.x)) {
float cval = INFINITY;
for(int j=threadIdx.x; j<sizea; j+= blockDim.x) {
int t = i - j;
if ((t>=0) && (t<sizeb))
cval = min(cval, a[j] + b[t]);
}
buff[threadIdx.x] = cval; __syncthreads();
if (bsz > 256) {
if (threadIdx.x < 256)
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+256]);
__syncthreads();
}
if (bsz > 128) {
if (threadIdx.x < 128)
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+128]);
__syncthreads();
}
if (bsz > 64) {
if (threadIdx.x < 64)
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+64]);
__syncthreads();
}
if (threadIdx.x < 32) {
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+32]);
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+16]);
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+8]);
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+4]);
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+2]);
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+1]);
if (threadIdx.x == 0) c[i] = buff[0];
}
}
}
// Instances for all valid block sizes.
template __global__ void minconv<64>(const float *, int, const float *, int, float *);
template __global__ void minconv<128>(const float *, int, const float *, int, float *);
template __global__ void minconv<256>(const float *, int, const float *, int, float *);
template __global__ void minconv<512>(const float *, int, const float *, int, float *);
[disclaimer: not tested or benchmarked, use at own risk]
This is single precision floating point, but the same idea should work for double precision floating point. For integer, you would need to replace the C99 INFINITY macro with something like INT_MAX or LONG_MAX, but the principle remains the same otherwise.
A faster version:
__global__ void convAgB(double *a, double *b, double *c, int sa, int sb)
{
int i = (threadIdx.x + blockIdx.x * blockDim.x);
int idT = threadIdx.x;
int out,j;
__shared__ double c_local [512];
c_local[idT] = c[i];
out = (i > sa) ? sa : i + 1;
j = (i > sb) ? i - sb + 1 : 1;
for(; j < out; j++)
{
if(c_local[idT] > a[j] + b[i-j])
c_local[idT] = a[j] + b[i-j];
}
c[i] = c_local[idT];
}
**Benckmark:**
Size A Size B Size C Time (s)
1000 1000 2000 0.0008
10k 10k 20k 0.0051
100k 100k 200k 0.3436
1M 1M 1M 43,327
Old Version,
For sizes between 1000 and 100000, I tested with this naive version:
__global__ void convAgB(double *a, double *b, double *c, int sa, int sb)
{
int size = sa+sb;
int idT = (threadIdx.x + blockIdx.x * blockDim.x);
int out,j;
for(int i = idT; i < size; i += blockDim.x * gridDim.x)
{
if(i > sa) out = sa;
else out = i + 1;
if(i > sb) j = i - sb + 1;
else j = 1;
for(; j < out; j++)
{
if(c[i] > a[j] + b[i-j])
c[i] = a[j] + b[i-j];
}
}
}
I populated the array a and b with some random double numbers and c with 999999 (just for testing). I validated the c array (in the CPU) using your function (without any modifications).
I also removed the conditionals from inside of the inner loop, so it will only test them once.
I am not 100% sure but I think the following modification makes sense. Since you had i - j >= 0, which is the same as i >= j, this means that as soon as j > i it will never enter this block 'X' (since j++):
if(c[i] > a[j] + b[i-j])
c[i] = a[j] + b[i-j];
So I calculated on the variable out the loop conditional if i > sa, which means that the loop will finish when j == sa, if i < sa this means the loop will finish (earlier) on i + 1 because of the condition i >= j.
The other condition i - j < size(b) means that you will start the execution of the block 'X' when i > size(b) + 1 since j starts always = 1. So we can put j with the value that should begin, thus
if(i > sb) j = i - sb + 1;
else j = 1;
See if you can test this version with real arrays of data, and give me feedback. Also, any improvements are welcome.
EDIT : A new optimization can be implemented, but this one does not make much of a difference.
if(c[i] > a[j] + b[i-j])
c[i] = a[j] + b[i-j];
we can eliminate the if, by:
double add;
...
for(; j < out; j++)
{
add = a[j] + b[i-j];
c[i] = (c[i] < add) * c[i] + (add <= c[i]) * add;
}
Having:
if(a > b) c = b;
else c = a;
it the same of having c = (a < b) * a + (b <= a) * b.
if a > b then c = 0 * a + 1 * b; => c = b;
if a <= b then c = 1*a + 0 *b; => c = a;
**Benckmark:**
Size A Size B Size C Time (s)
1000 1000 2000 0.0013
10k 10k 20k 0.0051
100k 100k 200k 0.4436
1M 1M 1M 47,327
I am measuring the time of copying from CPU to GPU, running the kernel and copying from GPU to CPU.
GPU Specifications
Device Tesla C2050
CUDA Capability Major/Minor 2.0
Global Memory 2687 MB
Cores 448 CUDA Cores
Warp size 32
I have used you algorithm. I think it'll help you.
const int Length=1000;
__global__ void OneD(float *Ad,float *Bd,float *Cd){
int i=blockIdx.x;
int j=threadIdx.x;
Cd[i]=99999.99;
for(int k=0;k<Length/500;k++){
while(((i-j)>=0)&&(i-j<Length)&&Cd[i+k*Length]>Ad[j+k*Length]+Bd[i-j]){
Cd[i+k*Length]=Ad[j+k*Length]+Bd[i-j];
}}}
I have taken 500 Threads per block. And, 500 blocks per Grid. As, the number of threads per block in my device is restricted to 512, I used 500 threads. I have taken the size of all the arrays as Length (=1000).
Working:
i stores the Block Index and j stores the Thread Index.
The for loop is used as the number of threads are less than the size of the arrays.
The while loop is used for iterating Cd[n].
I have not used Shared Memory because, I have taken lots of blocks and threads. So, the amount of Shared Memory required for each block is low.
PS: If your device supports more Threads and Blocks, replace k<Length/500 with k<Length/(supported number of threads)
Looking at Mark Harris's reduction example, I am trying to see if I can have threads store intermediate values without reduction operation:
For example CPU code:
for(int i = 0; i < ntr; i++)
{
for(int j = 0; j < pos* posdir; j++)
{
val = x[i] * arr[j];
if(val > 0.0)
{
out[xcount] = val*x[i];
xcount += 1;
}
}
}
Equivalent GPU code:
const int threads = 64;
num_blocks = ntr/threads;
__global__ void test_g(float *in1, float *in2, float *out1, int *ct, int posdir, int pos)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
__shared__ float t1[threads];
__shared__ float t2[threads];
int gcount = 0;
for(int i = 0; i < posdir*pos; i += 32) {
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i%posdir];
}
__syncthreads();
for(int i = 0; i < 32; i++)
{
t2[i] = t1[i] * in1[tid];
if(t2[i] > 0){
out1[gcount] = t2[i] * in1[tid];
gcount = gcount + 1;
}
}
}
ct[0] = gcount;
}
what I am trying to do here is the following steps:
(1)Store 32 values of in2 in shared memory variable t1,
(2)For each value of i and in1[tid], calculate t2[i],
(3)if t2[i] > 0 for that particular combination of i, write t2[i]*in1[tid] to out1[gcount]
But my output is all wrong. I am not even able to get a count of all the times t2[i] is greater than 0.
Any suggestions on how to save the value of gcount for each i and tid ?? As I debug, I find that for block (0,0,0) and thread(0,0,0) I can sequentially see the values of t2 updated. After the CUDA kernel switches focus to block(0,0,0) and thread(32,0,0), the values of out1[0] are re-written again. How can I get/store the values of out1 for each thread and write it to the output?
I tried two approaches so far: (suggested by #paseolatis on NVIDIA forums)
(1) defined offset=tid*32; and replace out1[gcount] with out1[offset+gcount],
(2) defined
__device__ int totgcount=0; // this line before main()
atomicAdd(&totgcount,1);
out1[totgcount]=t2[i] * in1[tid];
int *h_xc = (int*) malloc(sizeof(int) * 1);
cudaMemcpyFromSymbol(h_xc, totgcount, sizeof(int)*1, cudaMemcpyDeviceToHost);
printf("GPU: xcount = %d\n", h_xc[0]); // Output looks like this: GPU: xcount = 1928669800
Any suggestions? Thanks in advance !
OK let's compare your description of what the code should do with what you have posted (this is sometimes called rubber duck debugging).
Store 32 values of in2 in shared memory variable t1
Your kernel contains this:
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i%posdir];
}
which is effectively loading the same value from in2 into every value of t1. I suspect you want something more like this:
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i+threadIdx.x];
}
For each value of i and in1[tid], calculate t2[i],
This part is OK, but why is t2 needed in shared memory at all? It is only an intermediate result which can be discarded after the inner iteration is completed. You could easily have something like:
float inval = in1[tid];
.......
for(int i = 0; i < 32; i++)
{
float result = t1[i] * inval;
......
if t2[i] > 0 for that particular combination of i, write
t2[i]*in1[tid] to out1[gcount]
This is where the problems really start. Here you do this:
if(t2[i] > 0){
out1[gcount] = t2[i] * in1[tid];
gcount = gcount + 1;
}
This is a memory race. gcount is a thread local variable, so each thread will, at different times, overwrite any given out1[gcount] with its own value. What you must have, for this code to work correctly as written, is to have gcount as a global memory variable and use atomic memory updates to ensure that each thread uses a unique value of gcount each time it outputs a value. But be warned that atomic memory access is very expensive if it is used often (this is why I asked about how many output points there are per kernel launch in a comment).
The resulting kernel might look something like this:
__device__ int gcount; // must be set to zero before the kernel launch
__global__ void test_g(float *in1, float *in2, float *out1, int posdir, int pos)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
__shared__ float t1[32];
float ival = in1[tid];
for(int i = 0; i < posdir*pos; i += 32) {
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i+threadIdx.x];
}
__syncthreads();
for(int j = 0; j < 32; j++)
{
float tval = t1[j] * ival;
if(tval > 0){
int idx = atomicAdd(&gcount, 1);
out1[idx] = tval * ival
}
}
}
}
Disclaimer: written in browser, never been compiled or tested, use at own risk.....
Note that your write to ct was also a memory race, but with gcount now a global value, you can read the value after the kernel without the need for ct.
EDIT: It seems that you are having some problems with zeroing gcount before running the kernel. To do this, you will need to use something like cudaMemcpyToSymbol or perhaps cudaGetSymbolAddress and cudaMemset. It might look something like:
const int zero = 0;
cudaMemcpyToSymbol("gcount", &zero, sizeof(int), 0, cudaMemcpyHostToDevice);
Again, usual disclaimer: written in browser, never been compiled or tested, use at own risk.....
A better way to do what you are doing is to give each thread its own output, and let it increment its own count and enter values - this way, the double-for loop can happen in parallel in any order, which is what the GPU does well. The output is wrong because the threads share the out1 array, so they'll all overwrite on it.
You should also move the code to copy into shared memory into a separate loop, with a __syncthreads() after. With the __syncthreads() out of the loop, you should get better performance - this means that your shared array will have to be the size of in2 - if this is a problem, there's a better way to deal with this at the end of this answer.
You also should move the threadIdx.x < 32 check to the outside. So your code will look something like this:
if (threadIdx.x < 32) {
for(int i = threadIdx.x; i < posdir*pos; i+=32) {
t1[i] = in2[i];
}
}
__syncthreads();
for(int i = threadIdx.x; i < posdir*pos; i += 32) {
for(int j = 0; j < 32; j++)
{
...
}
}
Then put a __syncthreads(), an atomic addition of gcount += count, and a copy from the local output array to a global one - this part is sequential, and will hurt performance. If you can, I would just have a global list of pointers to the arrays for each local one, and put them together on the CPU.
Another change is that you don't need shared memory for t2 - it doesn't help you. And the way you are doing this, it seems like it works only if you are using a single block. To get good performance out of most NVIDIA GPUs, you should partition this into multiple blocks. You can tailor this to your shared memory constraint. Of course, you don't have a __syncthreads() between blocks, so the threads in each block have to go over the whole range for the inner loop, and a partition of the outer loop.
I need to do calculation like: A[x][y] = sum{from z=0 till z=n}{B[x][y][z]+C[x][y][z]}, where matrix A has dimensions [height][width] and matrix B,C has dimensions [height][width][n].
Values are mapped to memory with something like:
index = 0;
for (z = 0; z<n; ++z)
for(y = 0; y<width; ++y)
for(x = 0; x<height; ++x) {
matrix[index] = value;
index++;
}
I would like to each block calculate one sum since each block has own shared memory. To avoid data racing I use atomicAdd, something like this:
Part of code in global memory:
dim3 block (n, 1, 1);
dim grid (height, width, 1);
Kernel:
atomicAdd( &(A[blockIdx.x + blockIdx.y*gridDim.y]),
B[blockIdx.x + blockIdx.y*gridDim.y+threadIdx.x*blockDim.x*blockDim.y]
+ C[blockIdx.x + blockIdx.y*gridDim.y+threadIdx.x*blockDim.x*blockDim.y] );
I would like to use shared memory for calculating the sum and then copy this result to global memory.
I am not sure how to do the part with shared memory. In each blockĀ“s shared memory will be stored just one number ( sum result ). How should I copy this number to right place in A matrix in global memory?
You probably don't need shared memory or atomic memory access to do the summation you are asking about. If I have understood this correctly, your data is in column major order, so the logical operation is to have one thread per matrix entry in the output matrix, and have each thread traverse the z axis of the input matrices, summing as they go. The kernel for this could look something like:
__global__ void kernel(float *A, const float *B, const float *C,
const int width, const int height, const int n)
{
int tidx = threadIdx.x + blockDim.x * blockIdx.x;
int tidy = threadIdx.y + blockDim.y * blockIdx.y;
if ( (tidx < height) && (tidy < width) ) {
int stride = width * height;
int ipos = tidx + tidy * height;
float * oval = A + ipos;
float sum = 0.f;
for(int z=0; z<n; z++, ipos+=stride) {
sum += B[ipos] + C[ipos];
}
*oval = sum;
}
}
This approach should be optimal for column-major data with width * height >= n. There are no performance advantages to using shared memory for this, and there is no need to use atomic memory operations either. If you had a problem where width * height << n it might make sense to try a block wise parallel reduction per summation. But you have not indicated what the typical dimensions of the problem are. Leave a comment if your problem is more like the latter, and I can add a reduction based sample kernel to the answer.