Accessing 2D data in a CUDA kernel [duplicate] - arrays

This question already has an answer here:
2D array with CUDA and cudaMallocPitch
(1 answer)
Closed 1 year ago.
I'm doing an assignment for my university, and the main Idea is to compare CUDA Data parallelism with CUDA Task parallelism. I came up with an idea to parallelize the Conway's game of life. The problem is, I cannot figure out how to navigate through an 2D array in CUDA in multiple directions, i.e. above/under/right/left and the corners around the cell, which the kernel evaluates.
So far I came up with following:
The first Kernel Code
//determines the alive cell and save value of each cell into an array
__global__ void numAliveAround(int *oldBoard, int *newBoard, int xSize, int ySize, size_t pitchOld, size_t pitchNew)
{
int x = (blockIdx.x * blockDim.x) + threadIdx.x;
int y = (blockIdx.y * blockDim.y) + threadIdx.y;
if(x < xSize && y < ySize)
{
//cell above
//xMod is to make sure the number wraps when it overflows the board
xMod = ((x + 1) % xSize + xSize) % xSize;
//idx calculation
idx = xMod * xSize + y;
outputNumber += board[idx];
//more of the same code, just for cell under, left, right, and corners
newBoard[x * xSize + y] = outputNumber;
}
}
The second Kernel code
//sets new cell status according to the number of alive cells around
__global__ void determineNextState(int *board, int *newBoard, int xSize, int ySize, size_t pitchOld, size_t pitchNew)
{
//getting threads
int x = (blockIdx.x * blockDim.x) + threadIdx.x;
int y = (blockIdx.y * blockDim.y) + threadIdx.y;
if (x < xSize && y < ySize)
{
int idxNew = x * xSize + y;
int idxOld = x * xSize + y;
int state = board[idxOld];
//ALIVE = 1, DEAD = 0;
int output = DEAD;
//checking if any alive condition is met
if (state == ALIVE)
{
if ((newBoard[idxNew] == 2 || newBoard[idxNew] == 3))
{
output = ALIVE;
}
}
else
{
if (newBoard[idxNew] == 3)
{
output = ALIVE;
}
}
newBoard[idxNew] = output;
}
}
Kernel calling function
void SendToCUDA(int oldBoard[COLUMNS][ROWS], int newBoard[COLUMNS][ROWS])
{
//CUDA pointers
int *d_oldBoard;
int *d_newBoard;
size_t pitchOld;
size_t pitchNew;
cudaMallocPitch(&d_oldBoard, &pitchOld, COLUMNS * sizeof(int), ROWS);
cudaMallocPitch(&d_newBoard, &pitchNew, COLUMNS * sizeof(int), ROWS);
cudaMemcpy2D(d_oldBoard, pitchOld, oldBoard, COLUMNS * sizeof(int), COLUMNS * sizeof(int), ROWS, cudaMemcpyHostToDevice);
dim3 grid(divideAndRound(COLUMNS, BLOCKSIZE_X), divideAndRound(ROWS, BLOCKSIZE_Y));
dim3 block(BLOCKSIZE_Y, BLOCKSIZE_X);
printf("counting \n");
numberAliveAround <<<block, grid>>> (d_oldBoard, d_newBoard, COLUMNS, ROWS, pitchOld, pitchNew);
cudaDeviceSynchronize();
printf("determining \n");
determineNextState <<<block, grid>>> (d_oldBoard, d_newBoard, COLUMNS, ROWS, pitchOld, pitchNew);
cudaDeviceSynchronize();
//using newBoard later (outside the function) to display the Board
cudaMemcpy2D(newBoard, COLUMNS * sizeof(int), d_newBoard, pitchNew, COLUMNS * sizeof(int), ROWS, cudaMemcpyDeviceToHost);
cudaFree(d_oldBoard);
cudaFree(d_newBoard);
}
I found multiple ways of accessing flattened 2d array, of which some contradict each other, like:
//what is usually used as an exmplanation
idx = x * widht + y;
//sometimes x and y are swapped
idx = y * width + x;
//what works with simple access
int *value = (int *)((char *)(d_matrix + y * pitch)) + x;
//or
idx = x * xDim + y + pitch;
the funny thing is that 2 later ones work when I just access a single point in the array (for example increase all the values in it by 1) but completely do not work with more complex navigation. I've been sitting on this Problem for quite some time at this point. So any kind of insight would be extremely helpful.

I did figured out the answer, namely the correct way of accessing a 2D array after cudaMalloc2D is:
board[y * (pitch / sizeof(int)) + x]
because pitch is the length in bytes, therefore when one indexes an array through [] operator, one must first align it with the data type.
pitch / sizeof(datatype)
Later I found even more Issues with this code, so please don't just copy it.

Related

Matrix Transpose (with shared Memory) with arbitary size on Cuda C

I can't figure out a way to transpose a non-squared matrix using shared memory in CUDA C. (I am new to CUDA C and C)
On the website:
https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/
an efficient way was shown how to transpose a matrix (Coalesced Transpose Via Shared Memory). But it only works for squared matrices.
Also Code is provided on github (same as on the blog).
On Stackoverflow there is a similar question. There TILE_DIM = 16 is set. But with that implementation every thread just copies one element of the matrix to the result matrix.
This is my current implementation:
__global__ void transpose(double* matIn, double* matTran, int n, int m){
__shared__ double tile[TILE_DIM][TILE_DIM];
int i_n = blockIdx.x*TILE_DIM + threadIdx.x;
int i_m = blockIdx.y*TILE_DIM + threadIdx.y; // <- threadIdx.y only between 0 and 7
// Load matrix into tile
// Every Thread loads in this case 4 elements into tile.
int i;
for (i = 0; i < TILE_DIM; i += BLOCK_ROWS){
if(i_n < n && (i_m+i) < m){
tile[threadIdx.y+i][threadIdx.x] = matIn[n*(i_m+i) + i_n];
} else {
tile[threadIdx.y+i][threadIdx.x] = -1;
}
}
__syncthreads();
for (i = 0; i < TILE_DIM; i += BLOCK_ROWS){
if(tile[threadIdx.x][threadIdx.y+i] != -1){ // <- is there a better way?
if(true){ // <- what should be checked here?
matTran[n*(i_m+i) + i_n] = tile[threadIdx.x][threadIdx.y+i];
} else {
matTran[m*i_n + (i_m+i)] = tile[threadIdx.x][threadIdx.y+i];
}
}
}
}
where 4 elements are copied from a thread into the tile. Also four elements from the tile are copied back into the result matrix.
Here the Kernel-Configuration <<<a, b>>>:
where a: (ceil(n/TILE_DIM), ceil(n/TILE_DIM)) (-> is casted to doubles) and
b: (TILE_DIM, BLOCK_ROWS) (-> (32, 8))
I am currently using the if(tile[threadIdx.x][threadIdx.y+i] != -1)-statement to determine, which thread should copy to the result matrix (There might be another way). As for my current knowledge, this behaves as follows: In a block, the ThreadIdx (x, y) copies the data into the tile and the ThreadIdx (y, x) copies the data back into the result matrix.
I inserted another if-statement to determine where to copy the data, as there are 2(?) possible destinations, depending on the ThreadIdx. Currently true is inserted there, but i tried many different things. The best i could come up with was if(threadIdx.x+1 < threadIdx.y+i), which transposes a 3x2-matrix succesfully.
Can someone please explain, what i am missing by writing back into the result matrix? Obviously only one destination is correct. Using
matTran[n*(i_m+i) + i_n] = tile[threadIdx.x][threadIdx.y+i];
as on the blog mentioned should be correct, but I can't figure out, why it is not working for non-squared matrices?
I was overcomplicating the problem. Here, the indeces are NOT swapped as i thought. They are recalculated using the Y- and X-Coordinate of the Thread/Block. Here is the snippet:
i_n = blockIdx.y * TILE_DIM + threadIdx.x;
i_m = blockIdx.x * TILE_DIM + threadIdx.y
Here is the corrected code:
__global__ void transposeGPUcoalescing(double* matIn, int n, int m, double* matTran){
__shared__ double tile[TILE_DIM][TILE_DIM];
int i_n = blockIdx.x * TILE_DIM + threadIdx.x;
int i_m = blockIdx.y * TILE_DIM + threadIdx.y; // <- threadIdx.y only between 0 and 7
// Load matrix into tile
// Every Thread loads in this case 4 elements into tile.
int i;
for (i = 0; i < TILE_DIM; i += BLOCK_ROWS){
if(i_n < n && (i_m+i) < m){
tile[threadIdx.y+i][threadIdx.x] = matIn[(i_m+i)*n + i_n];
}
}
__syncthreads();
i_n = blockIdx.y * TILE_DIM + threadIdx.x;
i_m = blockIdx.x * TILE_DIM + threadIdx.y;
for (i = 0; i < TILE_DIM; i += BLOCK_ROWS){
if(i_n < m && (i_m+i) < n){
matTran[(i_m+i)*m + i_n] = tile[threadIdx.x][threadIdx.y + i]; // <- multiply by m, non-squared!
}
}
}
Thanks to this comment for noticing the error :)
If you would like to speed-up your kernel even more then, you can use "Shared Memory Bank Conflicts" as shown here:
https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/
Simply, changing the tile initialization with this will help a lot:
__shared__ float tile[TILE_DIM][TILE_DIM+1];

C: Print Horizzontal, vertical and oblique values of a point in an array

Excuse me for any grammatcal errors.
I'll try to explain which my problem is as good as I can.
I'm working on a 2 dimensional array, starting from a point of the array I should print the two horizzontal, vertical and oblique nearby cells.
The yellow cell is the start point, the red and grey cells are the cells that I have to print.
The solution I have found is that of making 4 different algoritms: 1 that prints the horizzontal cells, another one that prints the vertical cells, another one that prints the oblique cells (by right to left) and another one that prints the oblique cells (by left to right).
So I am solving the problem as if I were working with vectors, I thing that this is a really bad solution.
Example of the horizzontal print:
int startPos=0;
int counter=5; //Five and not four, because it includes the start point that hasn't to be printed
if(column >= 2) startPos = column - 2;
else counter -= (2-column);
for(i=0; i<counter; i++){
if(startPos + i != column){ //the start point hasn't to be printed
printf("m-array[%d][%d] ", row, startPos + i);
}
}
I go back of two from the start point and I print the next four cells.
If you want the 4 "different" algorithms to be 1, you just need to find what logic is shared between them, and make a function that implements it.
That shared part is that they all print a single line. Each line starts in a different place, and prints in a different direction. I called this function printLine.
Note, that the function I made can work with both statically and dynamically allocated arrays.
You can implement it differently. Specifically, you can combine both of the for loops and add a test to prevent the main cell from being printed.
#include <stdio.h>
int isInBounds(int rows, int cols, int y, int x) {
return (y >= 0) && (y < rows) && (x >= 0) && (x < cols);
}
void printLine(int *array, // pointer to start of the array
int rows, int cols, // amount of rows and columns
int count, // how many steps before and after the given cell
int posY, int posX, // the position of the cell to print around
int dirY, int dirX) { // the direction to advance each time
int y = posY - count * dirY;
int x = posX - count * dirX;
int i = 0;
// loop till we get to the given cell
// increase y and x according to dirY and dirX
for(i = 0; i < count; i++, y += dirY, x += dirX) {
if(isInBounds(rows, cols, y, x)) {
// don't print if this cell doesn't exist
printf("Array[%d][%d] = %d\n", y, x, array[y * cols + x]);
}
}
y = posY + dirY;
x = posX + dirX;
// loop count times, starting 1 step after the given cell
for(i = 0; i < count; i++, y += dirY, x += dirX) {
if(isInBounds(rows, cols, y, x)) {
// don't print if this cell doesn't exist
printf("Array[%d][%d] = %d\n", y, x, array[y * cols + x]);
}
}
}
void main(void) {
int rows = 5;
int cols = 8;
int array[rows][cols]; // array is uninitialized
int count = 2; // you wanted to print 5 without middle, which means 2 from each side
int posY = 2;
int posX = 3;
/*************************
* initialize array here */
int i = 0;
for(; i < rows * cols; i++) {
*(((int *)array) + i) = i;
}
/************************/
printLine((int *)array, rows, cols,
count,
posY, posX,
1, 0); // move 1 row down
printLine((int *)array, rows, cols,
count,
posY, posX,
0, 1); // move 1 column to the right
printLine((int *)array, rows, cols,
count,
posY, posX,
1, 1); // move 1 row and column, so down and right
printLine((int *)array, rows, cols,
count,
posY, posX,
-1, 1); // same as last diagonal but now up and right
}

Scaling a bitmap image getting segfault

Sup guys, learning C and working on a C programming assignment where I am to scale a given bitmap image and I have been stuck on this all day. this is my code thus far but I am getting a segfault and can't figure out why. I've been tracing through the code all day and am just stuck. here is my code of the function to scale, any help would be appreciated
int enlarge(PIXEL* original, int rows, int cols, int scale,
PIXEL** new, int* newrows, int* newcols)
{
int ncols, nrows;
ncols = cols * scale;
nrows = rows * scale;
double xratio =(double) rows / nrows;
double yratio =(double) cols / ncols;
int px, py;
int auxw, cnt;
int i, j;
*new = (PIXEL*)malloc(nrows * ncols * sizeof(PIXEL));
for (i = 0; i < nrows; i++){
auxw = 0;
cnt = 0;
int m = i * 3;
for (j = 0; j < ncols; j++){
px = (int)floor( j * xratio);
py = (int)floor( i * yratio);
PIXEL* o1 = original + ((py*rows + px) *3);
PIXEL* n1 = (*new) + m*ncols + j + auxw;
*n1 = *o1;
PIXEL* o2 = original + ((py*rows + px) *3) + 1;
PIXEL* n2 = (*new) + m*ncols + j + 1 + auxw;
*n2 = *o2;
PIXEL* o3 = original + ((py*rows + px) *3) + 2;
PIXEL* n3 = (*new) + m*ncols + j + 2 + auxw;
*n3 = *o3;
auxw += 2;
cnt++;
}
}
return 0;
}
using the GDB, i get the following :
Program received signal SIGSEGV, Segmentation fault.
0x00000000004013ff in enlarge (original=0x7ffff7f1e010, rows=512, cols=512, scale=2, new=0x7fffffffdeb8,
newrows=0x7fffffffdfb0, newcols=0x0) at maind.c:53
53 *n3 = *o3;
however, I can't understand what exactly the problem is
thanks
EDIT:
Working off code our professor provided for us, a PIXEL is defined as such:
typedef struct {
unsigned char r;
unsigned char g;
unsigned char b;
} PIXEL;
From my understanding i have a 2 dimensional array where each element of that array contains a 3 element PIXEL array.
Also, when tracing my code on paper, I added the auxw logic in order to advance down the array. It works somewhat in the same way as multiplying by 3.
Is your array a cols X rows array of PIXEL objects -- or is it actually an cols X rows X 3 array of PIXEL objects where what you call a pixel is actually really a component channel of a pixel? Your code isn't clear. When accessing the original array, you multiply by 3, suggesting an array of 3 channels:
PIXEL* o1 = original + ((py*rows + px) *3);
But when accessing the (*new) array there is no multiplication by 3, instead there's some logic I cannot follow with auxw:
PIXEL* n1 = (*new) + m*ncols + j + auxw;
auxw += 2;
Anyway, assuming that what you call a pixel is actually a channel, and that there are the standard 3 RGB channels in each pixel, you need to allocate 3 times as much memory for your array:
*new = (PIXEL*)malloc(nrows * ncols * 3*sizeof(PIXEL));
Some additional issues:
int* newrows and int* newcols are never initialized. You probably want to initialize them to the values of nrows and ncols
If PIXEL is really a CHANNEL, then rename it to correctly express its meaning.
Rather than copying logic for multidimensional array pointer arithmetic all over the place, protect yourself from indexing off your pixel/channel/whatever arrays by using a function:
#include "assert.h"
PIXEL *get_channel(PIXEL *pixelArray, int nRows, int nCols, int nChannels, int iRow, int iCol, int iChannel)
{
if (iRow < 0 || iRow >= nRows)
{
assert(!(iRow < 0 || iRow >= nRows));
return NULL;
}
if (iCol < 0 || iCol >= nCols)
{
assert(!(iRow < 0 || iRow >= nRows));
return NULL;
}
if (iChannel < 0 || iChannel >= nChannels)
{
assert(!(iChannel < 0 || iChannel >= nChannels));
return NULL;
}
return pixelArray + (iRow * nCols + iCol) * nChannels + iChannel;
}
Later, once your code is fully debugged, if performance is a problem you can replace the function with a macro in release mode:
#define get_channel(pixelArray, nRows, nCols, nChannels, iRow, iCol, iChannel)\
((pixelArray) + ((iRow) * (nCols) + (iCol)) * (nChannels) + (iChannel))
Another reason to use a standard get_channel() function is that your pointer arithmetic is inconsistent:
PIXEL* o1 = original + ((py*rows + px) *3);
PIXEL* n1 = (*new) + m*ncols + j + auxw;
to access the original pixel, you do array + iCol * nRows + iRow, which looks good. But to access the *new array, you do array + iCol * nCols + iRow, which looks wrong. Make a single function to access any pixel array, debug it, and use it.
Update
Given your definition of the PIXEL struct, it is unnecessary for you to be "adding those +1 and +2 values allowed me to reach the second and third element of the PIXEL struct." Since PIXEL is a struct, if you have a pointer to one you access its fields using the -> operator:
PIXEL *p_oldPixel = get_pixel(old, nRowsOld, nColsOld, iRowOld, iColOld);
PIXEL *p_newPixel = get_pixel(*p_new, nRowsNew, nColsNew, iRowNew, iColNew);
p_newPixel->r = p_oldPixel->r;
p_newPixel->g = p_oldPixel->g;
p_newPixel->b = p_oldPixel->b;
Or, in this case you can use the assignment operator to copy the struct:
*p_newPixel = *p_oldPixel;
As for indexing through the PIXEL array, since your pointers are correctly declared as PIXEL *, the C compiler's arithmetic will multiply offsets by the size of the struct.
Also, I'd recommend clarifying your code by using clear and consistent naming conventions:
Use consistent and descriptive names for loop iterators and boundaries. Is i a row or a column? Why use i in one place but py in another? A consistent naming convention helps to ensure you never mix up your rows and columns.
Distinguish pointers from variables or structures by prepending "p_" or appending "_ptr". A naming convention that clearly distinguishes pointers can make instances of pass-by-reference more clear, so (e.g.) you don't forget to initialize output arguments.
Use the same syllable for all variables corresponding to the old and new bitmaps. E.g. if you have arguments named old, nRowsOld and nColsOld you are less likely to accidentally use nColsOld with the new bitmap.
Thus your code becomes:
#include "assert.h"
typedef struct _pixel {
unsigned char r;
unsigned char g;
unsigned char b;
} PIXEL;
PIXEL *get_pixel(PIXEL *pixelArray, int nRows, int nCols, int iRow, int iCol)
{
if (iRow < 0 || iRow >= nRows)
{
assert(!(iRow < 0 || iRow >= nRows));
return NULL;
}
if (iCol < 0 || iCol >= nCols)
{
assert(!(iRow < 0 || iRow >= nRows));
return NULL;
}
return pixelArray + iRow * nCols + iCol;
}
int enlarge(PIXEL* old, int nRowsOld, int nColsOld, int scale,
PIXEL **p_new, int *p_nRowsNew, int *p_nColsNew)
{
int nColsNew = nColsOld * scale;
int nRowsNew = nRowsOld * scale;
double xratio =(double) nRowsOld / nRowsNew;
double yratio =(double) nColsOld / nColsNew;
int iRowNew, iColNew;
*p_new = malloc(nRowsNew * nColsNew * sizeof(PIXEL));
*p_nRowsNew = nRowsNew;
*p_nColsNew = nColsNew;
for (iRowNew = 0; iRowNew < nRowsNew; iRowNew++){
for (iColNew = 0; iColNew < nColsNew; iColNew++){
int iColOld = (int)floor( iColNew * xratio);
int iRowOld = (int)floor( iRowNew * yratio);
PIXEL *p_oldPixel = get_pixel(old, nRowsOld, nColsOld, iRowOld, iColOld);
PIXEL *p_newPixel = get_pixel(*p_new, nRowsNew, nColsNew, iRowNew, iColNew);
*p_newPixel = *p_oldPixel;
}
}
return 0;
}
I haven't tested this code yet, but by using consistent naming conventions one can clearly see what it is doing and why it should work.

CUDA: Tiled matrix-matrix multiplication with shared memory and matrix size which is non-multiple of the block size

I'm trying to familiarize myself with CUDA programming, and having a pretty fun time of it. I'm currently looking at this pdf which deals with matrix multiplication, done with and without shared memory. Full code for both versions can be found here. This code is almost the exact same as what's in the CUDA matrix multiplication samples. Although the non-shared memory version has the capability to run at any matrix size, regardless of block size, the shared memory version must work with matrices that are a multiple of the block size (which I set to 4, default was originally 16).
One of the problems suggested at the end of the pdf is to change it so that the shared memory version can also work with non-multiples of the block size. I thought this would be a simple index check, like in the non-shared version:
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if(row > A.height || col > B.width) return;
But this doesn't work. Here's the full code, minus the main method (a bit of a mess, sorry), which has been modified somewhat by me:
void MatMul(const Matrix A, const Matrix B, Matrix C) {
// Load A and B to device memory
Matrix d_A;
d_A.width = d_A.stride = A.width;
d_A.height = A.height;
size_t size = A.width * A.height * sizeof(float);
cudaError_t err = cudaMalloc(&d_A.elements, size);
printf("CUDA malloc A: %s\n",cudaGetErrorString(err));
err = cudaMemcpy(d_A.elements, A.elements, size, cudaMemcpyHostToDevice);
printf("Copy A to device: %s\n",cudaGetErrorString(err));
Matrix d_B;
d_B.width = d_B.stride = B.width;
d_B.height = B.height;
size = B.width * B.height * sizeof(float);
err = cudaMalloc(&d_B.elements, size);
printf("CUDA malloc B: %s\n",cudaGetErrorString(err));
err = cudaMemcpy(d_B.elements, B.elements, size, cudaMemcpyHostToDevice);
printf("Copy B to device: %s\n",cudaGetErrorString(err));
Matrix d_C;
d_C.width = d_C.stride = C.width;
d_C.height = C.height;
size = C.width * C.height * sizeof(float);
err = cudaMalloc(&d_C.elements, size);
printf("CUDA malloc C: %s\n",cudaGetErrorString(err));
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid((B.width + dimBlock.x - 1) / dimBlock.x, (A.height + dimBlock.y-1) / dimBlock.y);
MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);
err = cudaThreadSynchronize();
printf("Run kernel: %s\n", cudaGetErrorString(err));
// Read C from device memory
err = cudaMemcpy(C.elements, d_C.elements, size, cudaMemcpyDeviceToHost);
printf("Copy C off of device: %s\n",cudaGetErrorString(err));
// Free device memory
cudaFree(d_A.elements);
cudaFree(d_B.elements);
cudaFree(d_C.elements);
}
// Get a matrix element
__device__ float GetElement(const Matrix A, int row, int col) {
return A.elements[row * A.stride + col];
}
// Set a matrix element
__device__ void SetElement(Matrix A, int row, int col, float value) {
A.elements[row * A.stride + col] = value;
}
// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is
// located col sub-matrices to the right and row sub-matrices down
// from the upper-left corner of A
__device__ Matrix GetSubMatrix(Matrix A, int row, int col) {
Matrix Asub;
Asub.width = BLOCK_SIZE;
Asub.height = BLOCK_SIZE;
Asub.stride = A.stride;
Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row + BLOCK_SIZE * col];
return Asub;
}
// Matrix multiplication kernel called by MatMul()
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C) {
// Block row and column
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;
int rowTest = blockIdx.y * blockDim.y + threadIdx.y;
int colTest = blockIdx.x * blockDim.x + threadIdx.x;
if (rowTest>A.height || colTest>B.width)
return;
// Each thread block computes one sub-matrix Csub of C
Matrix Csub = GetSubMatrix(C, blockRow, blockCol);
// Each thread computes one element of Csub
// by accumulating results into Cvalue
float Cvalue = 0.0;
// Thread row and column within Csub
int row = threadIdx.y;
int col = threadIdx.x;
// Loop over all the sub-matrices of A and B that are
// required to compute Csub
// Multiply each pair of sub-matrices together
// and accumulate the results
for (int m = 0; m < (BLOCK_SIZE + A.width - 1)/BLOCK_SIZE; ++m) {
// Get sub-matrix Asub of A
Matrix Asub = GetSubMatrix(A, blockRow, m);
// Get sub-matrix Bsub of B
Matrix Bsub = GetSubMatrix(B, m, blockCol);
// Shared memory used to store Asub and Bsub respectively
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
// Load Asub and Bsub from device memory to shared memory
// Each thread loads one element of each sub-matrix
As[row][col] = GetElement(Asub, row, col);
Bs[row][col] = GetElement(Bsub, row, col);
// Synchronize to make sure the sub-matrices are loaded
// before starting the computation
__syncthreads();
// Multiply Asub and Bsub together
for (int e = 0; e < BLOCK_SIZE; ++e)
{
Cvalue += As[row][e] * Bs[e][col];
}
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}
// Write Csub to device memory
// Each thread writes one element
SetElement(Csub, row, col, Cvalue);
}
notable things which I changed: I added a check in MatMulKernel that checks if our current thread is trying to work on a spot in C that doesn't exist. This doesn't seem to work. Although it does change the result, the changes don't seem to have any pattern other than that later (higher x or y value) entries seem to be more affected (and I get a lot more non-integer results). I also changed the given dimGrid calculation method and the loop condition for m in MatMulKernel(before it was just width or height divided by block size, which seemed wrong).
Even the solutions guide that I found for this guide seems to suggest it should just be a simple index check, so I think I'm missing something really fundamental.
When the matrix dimensions are not multiples of the tile dimensions, then it can happen that some tiles cover the matrices only partially. The tile elements falling outside the not-fully overlapping tiles should be properly zero-ed. So, extending your code to arbitrarly sized matrices is easy, but does not amount at a simple index check. Below, I'm copying and pasting my version of the tiled matrix-matrix multiplication kernel with arbitrarily sized matrices
__global__ void MatMul(float* A, float* B, float* C, int ARows, int ACols, int BRows,
int BCols, int CRows, int CCols)
{
float CValue = 0;
int Row = blockIdx.y*TILE_DIM + threadIdx.y;
int Col = blockIdx.x*TILE_DIM + threadIdx.x;
__shared__ float As[TILE_DIM][TILE_DIM];
__shared__ float Bs[TILE_DIM][TILE_DIM];
for (int k = 0; k < (TILE_DIM + ACols - 1)/TILE_DIM; k++) {
if (k*TILE_DIM + threadIdx.x < ACols && Row < ARows)
As[threadIdx.y][threadIdx.x] = A[Row*ACols + k*TILE_DIM + threadIdx.x];
else
As[threadIdx.y][threadIdx.x] = 0.0;
if (k*TILE_DIM + threadIdx.y < BRows && Col < BCols)
Bs[threadIdx.y][threadIdx.x] = B[(k*TILE_DIM + threadIdx.y)*BCols + Col];
else
Bs[threadIdx.y][threadIdx.x] = 0.0;
__syncthreads();
for (int n = 0; n < TILE_DIM; ++n)
CValue += As[threadIdx.y][n] * Bs[n][threadIdx.x];
__syncthreads();
}
if (Row < CRows && Col < CCols)
C[((blockIdx.y * blockDim.y + threadIdx.y)*CCols) +
(blockIdx.x * blockDim.x)+ threadIdx.x] = CValue;
}

Summation over one dimension of a three dimensional array using shared memory

I need to do calculation like: A[x][y] = sum{from z=0 till z=n}{B[x][y][z]+C[x][y][z]}, where matrix A has dimensions [height][width] and matrix B,C has dimensions [height][width][n].
Values are mapped to memory with something like:
index = 0;
for (z = 0; z<n; ++z)
for(y = 0; y<width; ++y)
for(x = 0; x<height; ++x) {
matrix[index] = value;
index++;
}
I would like to each block calculate one sum since each block has own shared memory. To avoid data racing I use atomicAdd, something like this:
Part of code in global memory:
dim3 block (n, 1, 1);
dim grid (height, width, 1);
Kernel:
atomicAdd( &(A[blockIdx.x + blockIdx.y*gridDim.y]),
B[blockIdx.x + blockIdx.y*gridDim.y+threadIdx.x*blockDim.x*blockDim.y]
+ C[blockIdx.x + blockIdx.y*gridDim.y+threadIdx.x*blockDim.x*blockDim.y] );
I would like to use shared memory for calculating the sum and then copy this result to global memory.
I am not sure how to do the part with shared memory. In each blockĀ“s shared memory will be stored just one number ( sum result ). How should I copy this number to right place in A matrix in global memory?
You probably don't need shared memory or atomic memory access to do the summation you are asking about. If I have understood this correctly, your data is in column major order, so the logical operation is to have one thread per matrix entry in the output matrix, and have each thread traverse the z axis of the input matrices, summing as they go. The kernel for this could look something like:
__global__ void kernel(float *A, const float *B, const float *C,
const int width, const int height, const int n)
{
int tidx = threadIdx.x + blockDim.x * blockIdx.x;
int tidy = threadIdx.y + blockDim.y * blockIdx.y;
if ( (tidx < height) && (tidy < width) ) {
int stride = width * height;
int ipos = tidx + tidy * height;
float * oval = A + ipos;
float sum = 0.f;
for(int z=0; z<n; z++, ipos+=stride) {
sum += B[ipos] + C[ipos];
}
*oval = sum;
}
}
This approach should be optimal for column-major data with width * height >= n. There are no performance advantages to using shared memory for this, and there is no need to use atomic memory operations either. If you had a problem where width * height << n it might make sense to try a block wise parallel reduction per summation. But you have not indicated what the typical dimensions of the problem are. Leave a comment if your problem is more like the latter, and I can add a reduction based sample kernel to the answer.

Resources