OpenCL - Element-wise operations on 4D array - arrays

I am trying to write an OpenCL code to do element-wise operations on multi-dimensional arrays.
I know that OpenCL buffers are flattened, which makes indexing a bit tricky. I succeeded when dealing with 2-dimensional arrays, but for 3+ dimensional arrays, I have either indexing errors or the wrong result.
It is all the more surprising so that I use the same indexing principle/formula as in the 2D case.
2D case:
__kernel void test1(__global int* a, __global int* b, __global int* c, const int height) {
int i = get_global_id(0);
int j = get_global_id(1);
c[i + height * j] = a[i + height * j] + b[i + height * j];
}
Correct.
3D case:
__kernel void test1(__global int* a, __global int* b, __global int* c, const int dim1, const int dim2) {
int i = get_global_id(0);
int j = get_global_id(1);
int k = get_global_id(2);
int idx = i + dim1 * j + dim1 * dim2 * k;
c[idx] = a[idx] + b[idx];
}
Wrong result (usually an output buffer filled with values very close to 0).
4D case:
__kernel void test1(__global int* a, __global int* b, __global int* c, const int dim1, const int dim2, const int dim3) {
int i = get_global_id(0);
int j = get_global_id(1);
int k = get_global_id(2);
int l = get_global_id(3);
int idx = i + dim1 * j + dim1 * dim2 * k + l * dim1 * dim2 * dim3;
c[idx] = a[idx] + b[idx];
}
Here is the indexing error: enqueue_knl_test1 pyopencl._cl.LogicError: clEnqueueNDRangeKernel failed: INVALID_WORK_DIMENSION

In the 4D case, you are simply using the API wrongly. OpenCL does not support an infinite number of global / local dimensions. Just up to 3.
In the 2D case, your indexing seems wrong. Assuming row-major arrays. It should be i + j * width not i + j * height.
In the 3D case, the indexing inside the kernel seems OK, assuming row-major memory layout and that dim1 equals cols (width) and dim2 equals rows (height). But anyway, your question lacks context:
Input buffers allocation and initialization.
Kernel invocation code (parameters, work group and global size).
Result collection. synchronization.
You could be accessing beyond the buffer allocated size. It should be checked.
Doing these steps incorrectly can easily lead to unexpected results. Even if your kernel code is OK.
If you wish to debug indexing issues, the easiest thing to do is to write a simple kernel that output the calculated index.
__kernel void test1(__global int* c, const int dim1, const int dim2) {
int i = get_global_id(0);
int j = get_global_id(1);
int k = get_global_id(2);
int idx = i + dim1 * j + dim1 * dim2 * k;
c[idx] = idx;
}
You should then expect a result with linearly increasing values. I would start with a single workgroup and then move on to using multiple workgroups.
Also, If you perform a simple element-wise operation between arrays, then it is much simpler to use 1D indexing. You could simply use a 1D workgroup and global size that equals the number of elements (rounded up to to fit workgroup dim):
__kernel void test1(__global int* a, __global int* b, __global int* c, const int total) {
// no need for complex indexing for elementwise operations
int idx = get_global_id(0);
if (idx < total)
{
c[idx] = a[idx] + b[idx];
}
}
You would probably set local_work_size to the max size the hardware allows (for instance 512 for Nvidia, 256 for AMD) and global_work_size to the total of elements rounded up to multiples of local_work_size. See clEnqueueNDRangeKernel.
2D & 3D dims are usually used for operations that access adjacent elements in 2D / 3D space. Such as image convolutions.

Related

Aggregate many small arrays in fewer large arrays by basic function

I have many small 2D arrays (e.g. M x 32 x 40) and fewer larger 2D arrays (e.g. N x 200 x 300).
I would like to 'put' the smaller matrices at indices n,i,j in the larger arrays (upper left index of the array at batch index n). These small arrays could overlap and should be aggregated by functions that are associative and commutative say plus, multiply, etc.
I figure this is a pretty basic scenario that many people should have come across, right? Is there a cuda implementation that supports this in an efficient way?
Typical values M = 10^6, N = 10^4
This is a reduction operation.
In addition to what is expressed in the comments, I'll make the assumption that the distribution of the M matrices in terms of which of the N matrices they belong to, is relatively uniform, i.e. evenly distributed. This means for the dimensions given, that there will be approximately 100 of the M matrices that intended to update N matrix 0, 100 for N matrix 1, and so on. Furthermore, if we inspect the n array, we would observe a uniformly random pattern of indices (i.e. no clumping or grouping).
Given that, in what may be a first for me, I'll suggest a lock/critical section algorithm, using the plumbing from here. Each threadblock will take one of the M arrays, and attempt to acquire a lock so that it can update the appropriate N array. When finished, release the lock.
I considered other approaches as well, some of which are evident in the code. In any event, for the stated conditions, the lock based approach had a kernel runtime of about 40ms on my V100 GPU, which was the best I observed.
I would also note that the stated dimensions result in a data working set of ~8GB. Not that that is a problem, just be aware if running this code as-is on your laptop GPU.
Here's an example:
$ cat t34.cu
#include <iostream>
#include <cstdlib>
const int N = 10000;
const int M = 1000000;
const int Mx = 32;
const int My = 40;
const int Nx = 200;
const int Ny = 300;
const int nTPB = 256;
template <typename T>
__host__ __device__
T reduction_op(T &a, const T &b){ return a+b;}
template <typename T>
__global__ void k(const T * __restrict__ M, T * __restrict__ N, const int * __restrict__ n, const int * __restrict__ i, const int * __restrict__ j, const int num_M){
for (int ii = 0; ii < num_M; ii++){
if (n[ii] == blockIdx.x) {
for (int jj = threadIdx.x; jj < Mx*My; jj += blockDim.x){
int y = jj/Mx;
int x = jj - (y*Mx);
N[blockIdx.x*Nx*Ny + i[ii] + (j[ii]+y)*Nx + x] = reduction_op(
N[blockIdx.x*Nx*Ny + i[ii] + (j[ii]+y)*Nx + x], M[ii*Mx*My + y*Mx + x]);}
}
__syncthreads();}
}
// assumes Ny is whole-number divisible by sl
template <typename T>
__global__ void ki(const T * __restrict__ M, T * __restrict__ N, const int * __restrict__ n, const int * __restrict__ i, const int * __restrict__ j, const int num_M, const int sl){
extern __shared__ T s[];
for (int c = 0; c < Ny; c+=sl){ // process per chunk of N array
// load shared
for (int t = threadIdx.x; t < sl*Nx; t += blockDim.x) s[t] = N[blockIdx.x*Nx*Ny + c*Nx + t];
__syncthreads();
// process chunk stack
for (int ii = 0; ii < num_M; ii++){ // iterate through "stack"
if ((n[ii] == blockIdx.x) && (j[ii] < (c+sl)) && ((j[ii]+My) > c)) {
for (int jj = threadIdx.x; jj < sl*Mx; jj += blockDim.x){
int y = jj/Mx;
int x = jj - (y*Mx);
//y += c;
if ((y+c >= j[ii]) && (y+c < (j[ii]+My)))
s[y*Nx+x+i[ii]] = reduction_op(s[y*Nx+x+i[ii]], M[ii*Mx*My + (y+c-j[ii])*Mx + x]);}
}
__syncthreads();}
// save shared
for (int t = threadIdx.x; t < sl*Nx; t += blockDim.x) N[blockIdx.x*Nx*Ny + c*Nx + t] = s[t];
}
}
template <typename T>
__global__ void ka(const T * __restrict__ M, T * __restrict__ N, const int * __restrict__ n, const int * __restrict__ i, const int * __restrict__ j, const int num_M){
int x = threadIdx.x;
for (int y = threadIdx.y; y < My; y += blockDim.y)
atomicAdd(N+n[blockIdx.x]*Nx*Ny+(j[blockIdx.x]+y)*Nx+i[blockIdx.x]+x, M[blockIdx.x*Mx*My+y*Mx+x]);
}
__device__ void acquire_semaphore(volatile int *lock){
while (atomicCAS((int *)lock, 0, 1) != 0);
}
__device__ void release_semaphore(volatile int *lock){
*lock = 0;
__threadfence();
}
template <typename T>
__global__ void kl(const T * __restrict__ M, T * __restrict__ N, const int * __restrict__ n, const int * __restrict__ i, const int * __restrict__ j, const int num_M, int * __restrict__ locks){
if ((threadIdx.x == 0) && (threadIdx.y == 0))
acquire_semaphore(locks+n[blockIdx.x]);
__syncthreads();
//begin critical section
int x = threadIdx.x;
for (int y = threadIdx.y; y < My; y += blockDim.y){
N[n[blockIdx.x]*Nx*Ny + i[blockIdx.x] + (j[blockIdx.x]+y)*Nx + x] = reduction_op(
N[n[blockIdx.x]*Nx*Ny + i[blockIdx.x] + (j[blockIdx.x]+y)*Nx + x], M[blockIdx.x*Mx*My + y*Mx + x]);}
// end critical section
__threadfence(); // not strictly necessary for the lock, but to make any global updates in the critical section visible to other threads in the grid
__syncthreads();
if ((threadIdx.x == 0) && (threadIdx.y == 0))
release_semaphore(locks+n[blockIdx.x]);
}
typedef float mt;
int main(){
mt *d_M, *h_M, *d_N, *h_N, *r1, *r2;
int *d_n, *h_n, *d_i, *h_i, *d_j, *h_j;
h_M = new mt[M*Mx*My];
h_N = new mt[N*Nx*Ny];
r1 = new mt[N*Nx*Ny];
r2 = new mt[N*Nx*Ny];
h_n = new int[M];
h_i = new int[M];
h_j = new int[M];
cudaMalloc(&d_M, M*Mx*My*sizeof(mt));
cudaMalloc(&d_N, N*Nx*Ny*sizeof(mt));
cudaMalloc(&d_n, M*sizeof(int));
cudaMalloc(&d_i, M*sizeof(int));
cudaMalloc(&d_j, M*sizeof(int));
for (int i = 0; i < M; i++){
h_n[i] = rand()%N;
h_i[i] = rand()%(Nx - Mx);
h_j[i] = rand()%(Ny - My);}
for (int i = 0; i < N*Nx*Ny; i++) h_N[i] = (mt)(i%3);
for (int i = 0; i < M*Mx*My; i++) h_M[i] = (mt)((i%3)+1);
cudaMemcpy(d_M, h_M, M*Mx*My*sizeof(mt), cudaMemcpyHostToDevice);
cudaMemcpy(d_N, h_N, N*Nx*Ny*sizeof(mt), cudaMemcpyHostToDevice);
cudaMemcpy(d_n, h_n, M*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_i, h_i, M*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_j, h_j, M*sizeof(int), cudaMemcpyHostToDevice);
#ifdef USE_SINGLE_N
cudaMemset(d_n, 0, M*sizeof(int));
#endif
#if 0
const int sl = 40;
const int sb = sl * Nx * sizeof(mt);
ki<<<N, nTPB, sb>>>(d_M, d_N, d_n, d_i, d_j, M, sl);
cudaMemcpy(r2, d_N, N*Nx*Ny*sizeof(mt), cudaMemcpyDeviceToHost);
#endif
dim3 block(Mx, 8);
#if 0
ka<<<M, block>>>(d_M, d_N, d_n, d_i, d_j, M);
cudaMemcpy(r2, d_N, N*Nx*Ny*sizeof(mt), cudaMemcpyDeviceToHost);
#endif
int *d_locks;
cudaMalloc(&d_locks, N*sizeof(int));
cudaMemset(d_locks, 0, N*sizeof(int));
kl<<<M, block>>>(d_M, d_N, d_n, d_i, d_j, M, d_locks);
cudaMemcpy(r2, d_N, N*Nx*Ny*sizeof(mt), cudaMemcpyDeviceToHost);
cudaMemcpy(d_N, h_N, N*Nx*Ny*sizeof(mt), cudaMemcpyHostToDevice);
k<<<N, nTPB>>>(d_M, d_N, d_n, d_i, d_j, M);
cudaMemcpy(r1, d_N, N*Nx*Ny*sizeof(mt), cudaMemcpyDeviceToHost);
for (int i = 0; i < N*Nx*Ny; i++) if (r1[i] != r2[i]) {std::cout << "mismatch at: " << i << " was: " << r2[i] << " should be: " << r1[i] << std::endl; return 0;}
}
$ nvcc -o t34 t34.cu -O3 -lineinfo
$ nvprof ./t34
==17970== NVPROF is profiling process 17970, command: ./t34
==17970== Profiling application: ./t34
==17970== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 34.57% 3.09036s 2 1.54518s 1.54294s 1.54742s [CUDA memcpy DtoH]
33.18% 2.96615s 1 2.96615s 2.96615s 2.96615s void k<float>(float const *, float*, int const *, int const *, int const *, int)
31.81% 2.84401s 6 474.00ms 1.4255ms 1.27035s [CUDA memcpy HtoD]
0.45% 39.949ms 1 39.949ms 39.949ms 39.949ms void kl<float>(float const *, float*, int const *, int const *, int const *, int, int*)
0.00% 2.1120us 1 2.1120us 2.1120us 2.1120us [CUDA memset]
API calls: 96.13% 8.94558s 8 1.11820s 1.9203ms 4.51030s cudaMemcpy
3.60% 334.59ms 6 55.765ms 277.58us 330.37ms cudaMalloc
0.15% 13.752ms 8 1.7190ms 1.3268ms 2.2025ms cuDeviceTotalMem
0.11% 10.472ms 808 12.959us 172ns 728.50us cuDeviceGetAttribute
0.01% 997.81us 8 124.73us 100.93us 176.73us cuDeviceGetName
0.00% 69.047us 2 34.523us 32.349us 36.698us cudaLaunchKernel
0.00% 68.013us 1 68.013us 68.013us 68.013us cudaMemset
0.00% 46.172us 8 5.7710us 1.8940us 23.025us cuDeviceGetPCIBusId
0.00% 8.5060us 16 531ns 260ns 1.5030us cuDeviceGet
0.00% 3.7870us 8 473ns 229ns 881ns cuDeviceGetUuid
0.00% 3.3980us 3 1.1320us 610ns 2.0780us cuDeviceGetCount
$
Extended discussion:
On performance:
This is a memory bound algorithm. Therefore, we can estimate optimal kernel performance by determining the minimum number of memory reads and writes needed to perform the operation, then dividing by the available memory bandwidth, to determine the optimal or lower-bound for kernel duration. Unfortunately the determination of the minimum number of reads and writes depends on the positioning of the M matrices, so cannot be easily generally determined, without inspecting the n, i, and j matrices.
However we can look for another way to estimate. Another approach to estimation would be to observe that each M matrix update will require reading 2 values and writing one value. If we then use that as our estimate, we come up with M*Mx*My*3*sizeof(element_of_M)/GPU_memory_bandwidth. On my V100 (~700GB/s BW) this works out to about 20ms lower bound on kernel duration.
On approaches considered:
"naive" approach, kernel k: Each threadblock will be responsible for one of the N matrices, and will iterate through the M matrices, inspecting n to determine if the M matrices will update the assigned N matrix. This gives a non-optimal run time of ~3s but seems to be mostly invariant performance-wise based on the distribution of n, and can use an "arbitrary" reduction op.
attempt at "optimal" approach, kernel ki: Each threadblock will be responsible for one of the N matrices, but will only load a chunk of that matrix at a time. It will then proceed through the M matrices updating that chunk, similar the the k kernel. This necessitates more loops through the matrices, but should "almost" only load or save each global memory item the minimum number of times necessary. Nevertheless, the run time is really long, ~40s
atomic approach, kernel ka: Each threadblock will be responsible for one of the M matrices, and will atomically update the relevant N matrix. Simplicity. And the runtime is "fast" at ~40ms. (The atomic approach may be even faster than this is non-uniform n distributions. I witnessed kernel runtimes as low as 8ms!) However this is not readily generalizable to operations that don't have an atomic equivalent, such as multiply.
lock based approach, kernel kl: Like the atomic approach, each threadblock will be responsible for one of the M matrices, and will first acquire a lock on the relevant N matrix. The lock means that atomics are not necessary. For the uniformly distributed n case presented, it has about the same performance as the atomic case. It has the benefit that it can handle other reduction ops, such as multiply, readily. A disadvantage is that in the presence of non-uniformly-random distribution in n the performance can suffer, with a worst case in the ballpark of the naive kernel (3-5s).
Overall if the requirement for an arbitrary reduction operator can be dropped (e.g. only use addition, for example) then the atomic method may be best.

My first 2D arrays CUDA

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#define BLOCK_SIZE 6
#define GRID_SIZE 1
__global__ void test(int A[BLOCK_SIZE][BLOCK_SIZE], int B[BLOCK_SIZE][BLOCK_SIZE], int C[BLOCK_SIZE][BLOCK_SIZE]) {
int i = blockIdx.y * blockDim.y + threadIdx.y;
int j = blockIdx.x * blockDim.x + threadIdx.x;
C[i][j] = A[i][j] + B[i][j];
}
int main(){
int A[BLOCK_SIZE][BLOCK_SIZE];
int B[BLOCK_SIZE][BLOCK_SIZE];
int C[BLOCK_SIZE][BLOCK_SIZE];
for (int i = 0; i<BLOCK_SIZE; i++)
for (int j = 0; j<BLOCK_SIZE; j++){
A[i][j] = i + j;
B[i][j] = i + j;
}
int dev_A[BLOCK_SIZE][BLOCK_SIZE];
int dev_B[BLOCK_SIZE][BLOCK_SIZE];
int dev_C[BLOCK_SIZE][BLOCK_SIZE];
cudaMalloc((void**)&dev_C, BLOCK_SIZE * BLOCK_SIZE * sizeof(int));
cudaMalloc((void**)&dev_A, BLOCK_SIZE * BLOCK_SIZE * sizeof(int));
cudaMalloc((void**)&dev_B, BLOCK_SIZE * BLOCK_SIZE * sizeof(int));
cudaMemcpy(dev_A, A, BLOCK_SIZE * BLOCK_SIZE * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_B, B, BLOCK_SIZE * BLOCK_SIZE * sizeof(int), cudaMemcpyHostToDevice);
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); // so your threads are BLOCK_SIZE*BLOCK_SIZE, 36 in this case
dim3 dimGrid(GRID_SIZE, GRID_SIZE); // 1*1 blocks in a grid
test <<<dimGrid, dimBlock >>> (dev_A, dev_B, dev_C);
cudaDeviceSynchronize();
cudaMemcpy(C, dev_C, BLOCK_SIZE * BLOCK_SIZE * sizeof(int), cudaMemcpyDeviceToHost);
}
I tried to copy this code How to use 2D Arrays in CUDA?.
Some website tell me to use something like
result[row*WIDTH + col] = array1[row*WIDTH + col] + array2[row*WIDTH + col];
but I don't know how to use it.
My solution is always -858993460
There are two main issues to your code:
Firstly, when you define an array within function scope like this:
int dev_A[BLOCK_SIZE][BLOCK_SIZE];
This creates an array of arrays in host memory which is stored contiguously on the stack. This array can be used straight away from host code without further allocating any memory for it. This is a real C array and not a pointer. While this is fine and correct for A, B and C, this will not suffice for your declarations of dev_A, dev_B and dev_C, as you require memory allocated on the device for these.
There are a couple of ways to correct this. One way is to instead use a pointer to an array of arrays of ints. The syntax for such a declaration is as follows:
int (*dev_A)[BLOCK_SIZE][BLOCK_SIZE];
If you go by this approach, I would recommend changing your cudaMalloc and cudaMemcpy calls as follows:
cudaMalloc((void **) &dev_A, sizeof *dev_A);
// ...
cudaMemcpy(dev_A, &A, sizeof *dev_A, cudaMemcpyHostToDevice);
The difference here is that using sizeof *dev_A is the same as writing sizeof(int [BLOCK_SIZE][BLOCK_SIZE]), which gives the number of bytes taken up by the entire host array, and using &A instead of A, since &A gives a pointer to an array of arrays, while A decays to a pointer to an array. Technically what you already have should evaluate to the exact same values, since the size of an array is equal to the size of its elements multiplied by its length, and also a pointer to an array points to the same address as the first element in that array, however it would be more correct and consistent with how you would use cudaMalloc and cudaMemcpy with any other non-array type, and rightly treats the array of arrays as one single value:
int A, *dev_A;
cudaMalloc((void **) &dev_A, sizeof *dev_A);
cudaMemcpy(dev_A, &A, sizeof *dev_A, cudaMemcpyHostToDevice);
The other approach would be to dynamically allocate memory for multiple contiguous int [BLOCK_SIZE]s rather than a single int [BLOCK_SIZE][BLOCK_SIZE], which could be done as follows:
int (*dev_A)[BLOCK_SIZE];
// ...
cudaMalloc((void **) &dev_A, sizeof *dev_A * BLOCK_SIZE);
// ...
cudaMemcpy(dev_A, A, sizeof *dev_A * BLOCK_SIZE, cudaMemcpyHostToDevice);
This means dev_A now represents a pointer to an array of BLOCK_SIZE ints which is the first element of a sequence of BLOCK_SIZE contiguous arrays in memory. Notice how this time, A is used for cudaMemcpy rather than &A, as A's int [BLOCK_SIZE][BLOCK_SIZE] type decays to int (*)[BLOCK_SIZE] which matches the type of dev_A. Technically speaking, all the approaches mentioned so far do exactly the same thing and pass the same numerical values to the cudaMalloc and cudaMemcpy functions, however, the type of dev_A, dev_B and dev_C is important for how the arrays are used later.
The second issue with your code is in the signature of the test kernel function itself. This function has parameters declared like int A[BLOCK_SIZE][BLOCK_SIZE], however, in C (and C++), when you declare an array parameter in a function, it is instead adjusted to actually be a pointer to the array's element type. So int A[N] as a function parameter actually declares int *A, and the size is ignored. In the case of arrays of arrays, such as int A[N][M], this is converted to int (*A)[M], which means your parameters are int (*)[BLOCK_SIZE] (pointer to an array of BLOCK_SIZE ints) and your function currently has the following effective signature:
__global__
void test(int (*A)[BLOCK_SIZE],
int (*B)[BLOCK_SIZE],
int (*C)[BLOCK_SIZE])
If you stick with this function signature, then if you follow the approach of making dev_A and friends of type int (*)[BLOCK_SIZE], then your code should work as is, as the expression A[i][j] in your function first locates and dereferences the ith array after the address A, and then this array value decays into an int * pointer, and then the jth int after this address is accessed. However if you take the approach of declaring your device pointers as int (*dev_A)[BLOCK_SIZE][BLOCK_SIZE], then you will either have to dereference these pointers when calling your kernel like so (which should be fine as the dereferenced array immediately decays into a pointer so device memory should not be accessed from host code):
test<<<dimGrid, dimBlock>>>(*dev_A, *dev_B, *dev_C);
Or alternatively, the signature of the test function can be changed as follows:
__global__
void test(int (*A)[BLOCK_SIZE][BLOCK_SIZE],
int (*B)[BLOCK_SIZE][BLOCK_SIZE],
int (*C)[BLOCK_SIZE][BLOCK_SIZE])
When doing so however, these pointers-to-arrays must be first dereferenced before accessing their data, so your code within your function will have to be changed as follows:
(*C)[i][j] = (*A)[i][j] + (*B)[i][j];
Using plain C arrays, arrays of arrays, pointers to arrays, and pointers to arrays of arrays can have quite confusing semantics, and also requires your array's size to be fixed at compile-time, so you may prefer instead of using any of these approaches to use a single linear sequence of ints, and then index the elements yourself, for example:
void test(int *A)
{
A[row * BLOCK_SIZE + col] = 123;
}
Device memory for this can easily be allocated as follows:
int *dev_A;
cudaMalloc((void **) &dev_A, sizeof *dev_A * BLOCK_SIZE * BLOCK_SIZE);
An important note is that CUDA code is not C and is actually C++, however your code and the code discussed in this answer is both valid C and C++ (ignoring CUDA extensions). This may create some additional obstacles when writing C-like code, for example having to explicitly cast void * values to other pointer types, but also allows you to make use of useful C++ features such as operator overloading, as featured in talonmies's answer, to encapsulate addressing a 2D grid of values within a single linear buffer of data (so you can write A(row, col) instead of A[row * BLOCK_SIZE + col]).
There is a lot wrong with the code you posted, and most of it probably related to the ambiguous way that C and related languages deal with statically declared multidimensional arrays and the [][] style indexing scheme it supports.
Rather than describe all the required fixes I will just leave this here:
#include <stdio.h>
#define BLOCK_SIZE 6
#define GRID_SIZE 1
template<typename T>
struct array2D
{
T* p;
int lda;
__device__ __host__
array2D(T* _p, int cols) : p(_p), lda(cols) {}
__device__ __host__
T& operator()(int i, int j) { return p[i * lda + j]; }
__device__ __host__
T& operator()(int i, int j) const { return p[i * lda + j]; }
};
__global__ void test(array2D<int> A, array2D<int> B, array2D<int> C) {
int i = blockIdx.y * blockDim.y + threadIdx.y;
int j = blockIdx.x * blockDim.x + threadIdx.x;
C(i,j) = A(i,j) + B(i,j);
}
int main(){
int A[BLOCK_SIZE][BLOCK_SIZE];
int B[BLOCK_SIZE][BLOCK_SIZE];
int C[BLOCK_SIZE][BLOCK_SIZE];
for (int i = 0; i<BLOCK_SIZE; i++) {
for (int j = 0; j<BLOCK_SIZE; j++){
A[i][j] = i + j;
B[i][j] = i + j;
}
}
int* dev_A; cudaMalloc((void**)&dev_A, BLOCK_SIZE * BLOCK_SIZE * sizeof(int));
int* dev_B; cudaMalloc((void**)&dev_B, BLOCK_SIZE * BLOCK_SIZE * sizeof(int));
int* dev_C; cudaMalloc((void**)&dev_C, BLOCK_SIZE * BLOCK_SIZE * sizeof(int));
cudaMemcpy(dev_A, A, BLOCK_SIZE * BLOCK_SIZE * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_B, B, BLOCK_SIZE * BLOCK_SIZE * sizeof(int), cudaMemcpyHostToDevice);
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); // so your threads are BLOCK_SIZE*BLOCK_SIZE, 36 in this case
dim3 dimGrid(GRID_SIZE, GRID_SIZE); // 1*1 blocks in a grid
test <<<dimGrid, dimBlock >>> (array2D<int>(dev_A, BLOCK_SIZE),
array2D<int>(dev_B, BLOCK_SIZE),
array2D<int>(dev_C, BLOCK_SIZE));
cudaDeviceSynchronize();
cudaMemcpy(C, dev_C, BLOCK_SIZE * BLOCK_SIZE * sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i<BLOCK_SIZE; i++) {
for (int j = 0; j<BLOCK_SIZE; j++){
printf("(%d,%d) = %d {%d}\n", i, j, C[i][j], A[i][j] + B[i][j]);
}
}
return 0;
}
The most important feature of the code is the use of a tiny wrapper class which provides you with the (i,j) style indexing you apparently want without any complexity in the kernel code. At this point you don't even need to understand how it works, just accept that it provides you with the necessary indexing mechanism you want within the kernel and use it.
It you compile and run the code like so:
$ nvcc --std=c++11 myfirstpony.cu -o myfirstpony
$ ./myfirstpony
(0,0) = 0 {0}
(0,1) = 2 {2}
(0,2) = 4 {4}
(0,3) = 6 {6}
(0,4) = 8 {8}
(0,5) = 10 {10}
(1,0) = 2 {2}
(1,1) = 4 {4}
(1,2) = 6 {6}
(1,3) = 8 {8}
(1,4) = 10 {10}
(1,5) = 12 {12}
(2,0) = 4 {4}
(2,1) = 6 {6}
(2,2) = 8 {8}
(2,3) = 10 {10}
(2,4) = 12 {12}
(2,5) = 14 {14}
(3,0) = 6 {6}
(3,1) = 8 {8}
(3,2) = 10 {10}
(3,3) = 12 {12}
(3,4) = 14 {14}
(3,5) = 16 {16}
(4,0) = 8 {8}
(4,1) = 10 {10}
(4,2) = 12 {12}
(4,3) = 14 {14}
(4,4) = 16 {16}
(4,5) = 18 {18}
(5,0) = 10 {10}
(5,1) = 12 {12}
(5,2) = 14 {14}
(5,3) = 16 {16}
(5,4) = 18 {18}
(5,5) = 20 {20}
You can see for yourself the correctness of the result.

OpenCL Nested Kernels

I have a nxnxm ndarray that I would like to reduce down the m-axis. pyopencl has a built in ReductionKernel that is the following:
class pyopencl.reduction.ReductionKernel(ctx, dtype_out, neutral, reduce_expr, map_expr=None, arguments=None, name="reduce_kernel", options=[], preamble="")
If written the following way, it successfully sums a single vector down to a scalar:
krnl = ReductionKernel(context, numpy.float32, neutral="0",reduce_expr="a+b", map_expr="x[i]", arguments="__global float *x")
sum_A = krnl(d_A).get()
where sum_A is a float and d_A is the vector on device memory.
I'd like to call this kernel and pass a m-length column for each index in the nxn matrix. My strategy was to pass the entire nxnxm ndarray to a parent kernel and then use enqueue_kernel pass an array to ReductionKernel but I'm unsure of how the syntax works in terms of receiving the sum. By the way, the matrix is already in row-order so the m-length arrays are already contiguous.
__kernel reduce(global float* input,
global float* output,
const unsigned int m,
const unsigned int n)
{
const int i = get_global_id(0);
const int j = get_global_id(1);
float array[m];
//Initialize an array of input[i*m+j*m*n] to input[i*m+j*m*n + m]
for(k = 0, k < m, k++)
{
array[k] = input[i*m+j*m*n+k];
}
//enqueue ReductionKernel with this array
//Place result in output[i*n+j]
}

wrong in initialize shared memory with global memory in CUDA

I am writing a simple cuda program recently, the kernel function is below:
#define BLOCK_SIZE 16
#define RADIOUS 7
#define SM_SIZE BLOCK_SIZE+2*RADIOUS
__global__ static void DarkChannelPriorCUDA(const float* r, size_t ldr, const float* g, size_t ldg, const float* b, size_t ldb, float * d, size_t ldd, int n, int m)
{
__shared__ float R[SM_SIZE][SM_SIZE];
__shared__ float G[SM_SIZE][SM_SIZE];
__shared__ float B[SM_SIZE][SM_SIZE];
const int tidr = threadIdx.x;
const int tidc = threadIdx.y;
const int bidr = blockIdx.x * BLOCK_SIZE;
const int bidc = blockIdx.y * BLOCK_SIZE;
int i, j ,tr, tc;
for( i = 0; i < SM_SIZE; i += BLOCK_SIZE)
{
tr = bidr-RADIOUS+i+tidr;
for( j = 0; j < SM_SIZE; j += BLOCK_SIZE)
{
tc = bidc-RADIOUS+j+tidc;
if(tr <0 || tc<0 || tr>=n || tc>=m)
{
R[i][j]=1e20;
G[i][j]=1e20;
B[i][j]=1e20;
}
else
{
R[i][j]=r[tr*ldr+tc];
G[i][j]=g[tr*ldg+tc];
B[i][j]=b[tr*ldb+tc];
}
}
}
__syncthreads();
float results = 1e20;
for(i = tidr; i <= tidr + 2*RADIOUS; i++)
for(j = tidc; j <= tidc + 2*RADIOUS; j++)
{
results = results < R[i][j] ? results : R[i][j];
results = results < G[i][j] ? results : G[i][j];
results = results < B[i][j] ? results : B[i][j];
}
d[(tidr + bidr) * ldd + tidc + bidc] = results;
}
this function read r, g, b three 2d matrix of n*m as input, output a matrix d of n*m, each element of d[i][j]'s value is equal to the minimal value among r, g, b three matrix which covered by the window of (2*RADIOUS+1)*(2*RADIOUS+1) with center (i,j).
in order to speed up, i used a shared memory to store a small amount of value for each block. each block has 16*16 threads, each single thread calculate the result for one element of maxtrix d. shared memory need to store (BLOCK_SIZE+2*RADIOUS)*(BLOCK_SIZE+2*RADIOUS) elements of r, g, b.
But the result is wrong, the value in shared memory R, G and B is different from r, g and b in global memory. It seems that the data in global memory never tansfer to shared memory successful, I can't understand why it happens.
You should notice what is inside the global, is performed per each thread. When you write:
R[i][j]=r[tr*ldr+tc];
G[i][j]=g[tr*ldg+tc];
B[i][j]=b[tr*ldb+tc];
Different threads in each block are overwriting [i][j] component of R, G and B which are shared among the threads.

Non-square matrix multiplication in CUDA

For my GPU programming class, we've been tasked with completing certain parts of a non-square matrix multiplication program. Specifically, the kernel function and initializing the thread block and kernel grid dimensions.
I've based my code on the CUDA C Programming Guide's matrix multiplication code, but instead of using structs as they do, I have modified mine to use only the parameters given (since we're not allowed to change parameters). We are provided with the 3 matrices A, B, and C, as well as the dimensions of them- m x k, k x n, and m x n, respectively. Where the struct used A.height, I've used dimension m, where it used B.width, I've used dimension n, etc.
I've run into several problems, the first of which is that my program doesn't pass the included test, which verifies the correctness of the product matrix C. I assume that there is something wrong in my matrix multiplication code, then, and that the issue probably arises from me adapting the struct code.
#include <stdio.h>
__global__ void mysgemm(int m, int n, int k, const float *A, const float *B,
float* C) {
/********************************************************************
*
* Compute C = A x B
* where A is a (m x k) matrix
* where B is a (k x n) matrix
* where C is a (m x n) matrix
*
********************************************************************/
// INSERT KERNEL CODE HERE
// Each thread computes one element of C
// by accumulating results into Cvalue
float Cvalue = 0;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
for (int e = 0; e < k; ++e){
Cvalue += (A[row * k + e]) * (B[e * n + col]);
}
C[row * n + col] = Cvalue;
}
My other problem, which I'm even less sure about, involves the code to initialize the thread block and kernel grid dimensions.
// Initialize thread block and kernel grid dimensions ---------------------
const unsigned int BLOCK_SIZE = 16; // Use 16x16 thread blocks
//INSERT CODE HERE
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(n / dimBlock.x, m / dimBlock.y);
// Invoke CUDA kernel -----------------------------------------------------
//INSERT CODE HERE
mysgemm<<<dimGrid, dimBlock>>>(m, n, k, A, B, C);
I understand dimBlock, but I don't understand dimGrid, and don't have a proper idea of what to use as parameters for it. When I run the code as is, the kernel won't even launch if the matrix I pass in doesn't have a dimension that is a power of 2. And if I do use a power of 2, the test still fails.
I apologize if I've been too wordy. This is my first post and I wanted to give as many details as possible. Hopefully someone can help walk me through these issues.
The following kernel I'm posting below is a variant of the one I posted in
CUDA: Tiled matrix-matrix multiplication with shared memory and matrix size which is non-multiple of the block size
in that it does not use shared memory.
__global__ void MatMulNoShared(float* A, float* B, float* C, int ARows, int ACols, int BRows, int BCols, int CRows, int CCols) {
float CValue = 0;
int Row = blockIdx.y*TILE_DIM + threadIdx.y;
int Col = blockIdx.x*TILE_DIM + threadIdx.x;
for (int k = 0; k < (TILE_DIM + ACols - 1)/TILE_DIM; k++) {
for (int n = 0; n < TILE_DIM; ++n)
if ((k*TILE_DIM + n < ACols && Row < ARows) && (k*TILE_DIM + n < BRows && Col < BCols))
CValue += A[Row*ACols + k*TILE_DIM + n] * B[(k*TILE_DIM + n)*BCols + Col];
}
if (Row < CRows && Col < CCols) C[((blockIdx.y * blockDim.y + threadIdx.y)*CCols)+(blockIdx.x*blockDim.x)+threadIdx.x]=CValue;
}
The two if statements in the kernel are the if statements mentioned in the answer by Eric.
For the sake of your convenience, I'm posting the full code below:
#include <stdio.h>
#include <math.h>
#include <conio.h>
#define TILE_DIM 16 // Tile dimension
#define DIMX 373
#define DIMY 242
#define DIMZ 533
__global__ void MatMulNoShared(float* A, float* B, float* C, int ARows, int ACols, int BRows, int BCols, int CRows, int CCols) {
float CValue = 0;
int Row = blockIdx.y*TILE_DIM + threadIdx.y;
int Col = blockIdx.x*TILE_DIM + threadIdx.x;
for (int k = 0; k < (TILE_DIM + ACols - 1)/TILE_DIM; k++) {
for (int n = 0; n < TILE_DIM; ++n)
if ((k*TILE_DIM + n < ACols && Row < ARows) && (k*TILE_DIM + n < BRows && Col < BCols))
CValue += A[Row*ACols + k*TILE_DIM + n] * B[(k*TILE_DIM + n)*BCols + Col];
}
if (Row < CRows && Col < CCols) C[((blockIdx.y * blockDim.y + threadIdx.y)*CCols)+(blockIdx.x*blockDim.x)+threadIdx.x]=CValue;
}
int main() {
int CCols = DIMZ, CRows=DIMX, ACols=DIMY, ARows=DIMX, BCols=DIMZ, BRows=DIMY;
dim3 dimBlock(TILE_DIM, TILE_DIM, 1);
dim3 dimGrid;
dimGrid.x = (CCols + dimBlock.x - 1)/dimBlock.x;
dimGrid.y = (CRows + dimBlock.y - 1)/dimBlock.y;
float *deviceA, *deviceB, *deviceC;
float* hostA = (float*)malloc(DIMX*DIMY*sizeof(float));
float* hostB = (float*)malloc(DIMY*DIMZ*sizeof(float));
float* hostC = (float*)malloc(DIMX*DIMZ*sizeof(float));
float* hostCp = (float*)malloc(DIMX*DIMZ*sizeof(float));
for (int x = 0; x<DIMX; x++)
for (int y = 0; y<DIMY; y++) {
hostA[x*DIMY+y] = rand()/(float)RAND_MAX;
hostB[x*DIMY+y] = rand()/(float)RAND_MAX;
}
cudaMalloc((void **)&deviceA, DIMX*DIMY*sizeof(float));
cudaMalloc((void **)&deviceB, DIMY*DIMZ*sizeof(float));
cudaMalloc((void **)&deviceC, DIMX*DIMZ*sizeof(float));
cudaMemcpy(deviceA, hostA, DIMX*DIMY*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(deviceB, hostB, DIMY*DIMZ*sizeof(float), cudaMemcpyHostToDevice);
MatMulNoShared<<<dimGrid , dimBlock>>>(deviceA , deviceB , deviceC , ARows , ACols, BRows ,BCols , CRows , CCols);
cudaMemcpy(hostC, deviceC, DIMX*DIMZ*sizeof(float), cudaMemcpyDeviceToHost);
return 0;
}
Note that the two instructions
dimGrid.x = (CCols + dimBlock.x - 1)/dimBlock.x;
dimGrid.y = (CRows + dimBlock.y - 1)/dimBlock.y;
ensure a full tiled coverage of the matrices, as mentioned at point 1. of Eric's answer.
Your code currently only works when m and n are multiples of 16, which is your block size.
Two things you can do now to make it work on arbitrary sizes.
Make the gird size large enough to cover the whole matrix C. Instead of using the floor of n/blockdim.x as you have done, you could use the ceil of that value by
(n+blockdim.x-1)/blockdim.x
After you have done step 1, the matrix you are multiplying will be a little bit larger because of the ceiling operation. you could then limit the multiplying to the exact size of the result matrix C by adding an if clause in the kernel.
Please refer to CUDA docs for more details, especially the programming guide.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

Resources