OpenCL Nested Kernels - c

I have a nxnxm ndarray that I would like to reduce down the m-axis. pyopencl has a built in ReductionKernel that is the following:
class pyopencl.reduction.ReductionKernel(ctx, dtype_out, neutral, reduce_expr, map_expr=None, arguments=None, name="reduce_kernel", options=[], preamble="")
If written the following way, it successfully sums a single vector down to a scalar:
krnl = ReductionKernel(context, numpy.float32, neutral="0",reduce_expr="a+b", map_expr="x[i]", arguments="__global float *x")
sum_A = krnl(d_A).get()
where sum_A is a float and d_A is the vector on device memory.
I'd like to call this kernel and pass a m-length column for each index in the nxn matrix. My strategy was to pass the entire nxnxm ndarray to a parent kernel and then use enqueue_kernel pass an array to ReductionKernel but I'm unsure of how the syntax works in terms of receiving the sum. By the way, the matrix is already in row-order so the m-length arrays are already contiguous.
__kernel reduce(global float* input,
global float* output,
const unsigned int m,
const unsigned int n)
{
const int i = get_global_id(0);
const int j = get_global_id(1);
float array[m];
//Initialize an array of input[i*m+j*m*n] to input[i*m+j*m*n + m]
for(k = 0, k < m, k++)
{
array[k] = input[i*m+j*m*n+k];
}
//enqueue ReductionKernel with this array
//Place result in output[i*n+j]
}

Related

OpenCL - Element-wise operations on 4D array

I am trying to write an OpenCL code to do element-wise operations on multi-dimensional arrays.
I know that OpenCL buffers are flattened, which makes indexing a bit tricky. I succeeded when dealing with 2-dimensional arrays, but for 3+ dimensional arrays, I have either indexing errors or the wrong result.
It is all the more surprising so that I use the same indexing principle/formula as in the 2D case.
2D case:
__kernel void test1(__global int* a, __global int* b, __global int* c, const int height) {
int i = get_global_id(0);
int j = get_global_id(1);
c[i + height * j] = a[i + height * j] + b[i + height * j];
}
Correct.
3D case:
__kernel void test1(__global int* a, __global int* b, __global int* c, const int dim1, const int dim2) {
int i = get_global_id(0);
int j = get_global_id(1);
int k = get_global_id(2);
int idx = i + dim1 * j + dim1 * dim2 * k;
c[idx] = a[idx] + b[idx];
}
Wrong result (usually an output buffer filled with values very close to 0).
4D case:
__kernel void test1(__global int* a, __global int* b, __global int* c, const int dim1, const int dim2, const int dim3) {
int i = get_global_id(0);
int j = get_global_id(1);
int k = get_global_id(2);
int l = get_global_id(3);
int idx = i + dim1 * j + dim1 * dim2 * k + l * dim1 * dim2 * dim3;
c[idx] = a[idx] + b[idx];
}
Here is the indexing error: enqueue_knl_test1 pyopencl._cl.LogicError: clEnqueueNDRangeKernel failed: INVALID_WORK_DIMENSION
In the 4D case, you are simply using the API wrongly. OpenCL does not support an infinite number of global / local dimensions. Just up to 3.
In the 2D case, your indexing seems wrong. Assuming row-major arrays. It should be i + j * width not i + j * height.
In the 3D case, the indexing inside the kernel seems OK, assuming row-major memory layout and that dim1 equals cols (width) and dim2 equals rows (height). But anyway, your question lacks context:
Input buffers allocation and initialization.
Kernel invocation code (parameters, work group and global size).
Result collection. synchronization.
You could be accessing beyond the buffer allocated size. It should be checked.
Doing these steps incorrectly can easily lead to unexpected results. Even if your kernel code is OK.
If you wish to debug indexing issues, the easiest thing to do is to write a simple kernel that output the calculated index.
__kernel void test1(__global int* c, const int dim1, const int dim2) {
int i = get_global_id(0);
int j = get_global_id(1);
int k = get_global_id(2);
int idx = i + dim1 * j + dim1 * dim2 * k;
c[idx] = idx;
}
You should then expect a result with linearly increasing values. I would start with a single workgroup and then move on to using multiple workgroups.
Also, If you perform a simple element-wise operation between arrays, then it is much simpler to use 1D indexing. You could simply use a 1D workgroup and global size that equals the number of elements (rounded up to to fit workgroup dim):
__kernel void test1(__global int* a, __global int* b, __global int* c, const int total) {
// no need for complex indexing for elementwise operations
int idx = get_global_id(0);
if (idx < total)
{
c[idx] = a[idx] + b[idx];
}
}
You would probably set local_work_size to the max size the hardware allows (for instance 512 for Nvidia, 256 for AMD) and global_work_size to the total of elements rounded up to multiples of local_work_size. See clEnqueueNDRangeKernel.
2D & 3D dims are usually used for operations that access adjacent elements in 2D / 3D space. Such as image convolutions.

How to operate matrices of different size with one function in C?

I have a code from Mathlab, where all matrix operations are done by a couple of symbols. By translating it into C I faced a problem that for every size of matrix I have to create a special function. It's a big code, i will not place it all here but will try to explain how it works.
I also have a big loop where a lot of matrix operations are going on. Functions which are operating with matrices should take matrices as income and store results in temporary matrices for upcoming operations. In fact i know the size of matrices but i also want to make the functions as universal as possible. In oder to reduce code size and save my time.
For example, matrix transposition operation of 2x4 and 4x4 matrices:
void A_matrix_transposition (float transposed_matrix[4][2], float matrix[2][4], int rows_in_matrix, int columnes_in_matrix);
void B_matrix_transposition (float transposed_matrix[4][4], float matrix[4][4], int rows_in_matrix, int columnes_in_matrix);
int main() {
float transposed_matrix_A[4][2]; //temporary matrices
float transposed_matrix_B[4][4];
float input_matrix_A[2][4], input_matrix_B[4][4]; //input matrices with numbers
A_matrix_transposition (transposed_matrix_A, input_matrix_A, 2, 4);
B_matrix_transposition (transposed_matrix_B, input_matrix_B, 4, 4);
// after calling the functions i want to use temporary matrices again. How do I pass them to other functions if i dont know their size, in general?
}
void A_matrix_transposition (float transposed_matrix[4][2], float matrix[2][4], int rows_in_matrix, int columnes_in_matrix)
{ static int i,j;
for(i = 0; i < rows_in_matrix; ++i) {
for(j = 0; j < columnes_in_matrix; ++j)
{ transposed_matrix[j][i] = matrix[i][j];
}
}
}
void B_matrix_transposition (float transposed_matrix[4][4], float matrix[4][4], int rows_in_matrix, int columnes_in_matrix)
{ static int i,j;
for(i = 0; i < rows_in_matrix; ++i) {
for(j = 0; j < columnes_in_matrix; ++j)
{ transposed_matrix[j][i] = matrix[i][j];
}
}
}
The operation is simple, but the code is massive already because of 2 different functions, but it will be a slow disaster if I continue like this.
How do i create one function for transposing to operate matrices of different sizes?
I suppose it can be done with pointers, but I don't know how.
I'm looking for a realy general answer to understand how to tune up the "comunication" between functions and temporary matrices, best with an example. Thank you all in advance for the information and help.
There are different way you can achieve this in c from not so good to good solutions.
If you know what the maximum size of the matrices would be you can create a matrix big enough to accommodate that size and work on it. If it is lesser than that - no problem write custom operations only considering that small sub-matrix rather than the whole one.
Another solution is to - create a data structure to hold the matrix this may vary from jagged array creation which can be done using the attribute that is stored in the structure itself. For example: number of rows and column information will be stored in the structure itself. Jagged array gives you the benefit that now you can allocate de-allocate memory - giving you a better control over the form - order of the matrices. This is better in that - now you can pass two matrices of different sizes and the functions all see that structure which contain the actual matrix and work on it. (wrapped I would say).
By Structure I meant something like
struct matrix{
int ** mat;
int row;
int col;
}
If your C implementation supports variable length arrays, then you can accomplish this with:
void matrix_transposition(size_t M, size_t N,
float Destination[M][N], const float Source[N][M])
{
for (size_t m = 0; m < M; ++m)
for (size_t n = 0; n < N; ++n)
Destination[m][n] = Source[n][m];
}
If your C implementation does not support variable length arrays, but does allow pointers to arrays to be cast to pointers to elements and used to access a two-dimensional array as if it were one-dimensional (this is not standard C but may be supported by a compiler), you can use:
void matrix_transposition(size_t M, size_t N,
float *Destination, const float *Source)
{
for (size_t m = 0; m < M; ++m)
for (size_t n = 0; n < N; ++n)
Destination[m*N+n] = Source[n*M+m];
}
The above requires the caller to cast the arguments to float *. We can make it more convenient for the caller with:
void matrix_transposition(size_t M, size_t N,
void *DestinationPointer, const void *SourcePointer)
{
float *Destination = DestinationPointer;
const float *Source = SourcePointer;
for (size_t m = 0; m < M; ++m)
for (size_t n = 0; n < N; ++n)
Destination[m*N+n] = Source[n*M+m];
}
(Unfortunately, this prevents the compiler from checking that the argument types match the intended types, but this is a shortcoming of C.)
If you need a solution strictly in standard C without variable length arrays, then, technically, the proper way is to copy the bytes of the objects:
void matrix_transposition(size_t M, size_t N,
void *DestinationPointer, const void *SourcePointer)
{
char *Destination = DestinationPointer;
const char *Source = SourcePointer;
for (size_t m = 0; m < M; ++m)
for (size_t n = 0; n < N; ++n)
{
// Calculate locations of elements in memory.
char *D = Destination + (m*N+n) * sizeof(float);
const char *S = Source + (n*M+m) * sizeof(float);
memcpy(D, S, sizeof(float));
}
}
Notes:
Include <stdlib.h> to declare size_t and, if using the last solution, include <string.h> to declare memcpy.
Variable length arrays were required in C 1999 but made optional in C 2011. Good quality compilers for general purpose systems will support them.
If you are using C99 compiler, you can make use of Variable Length Array (VLA's) (optional in C11 compiler). You can write a function like this:
void matrix_transposition (int rows_in_matrix, int columnes_in_matrix, float transposed_matrix[columnes_in_matrix][rows_in_matrix], float matrix[rows_in_matrix][columnes_in_matrix])
{
int i,j;
for(i = 0; i < rows_in_matrix; ++i) {
for(j = 0; j < columnes_in_matrix; ++j)
{
transposed_matrix[j][i] = matrix[i][j];
}
}
}
This one function can work for the different number of rows_in_matrix and columnes_in_matrix. Call it like this:
matrix_transposition (2, 4, transposed_matrix_A, input_matrix_A);
matrix_transposition (4, 4, transposed_matrix_B, input_matrix_B);
You probably don't want to be hard-coding array sizes in your program. I suggest a structure that contains a single flat array, which you can then interpret in two dimensions:
typedef struct {
size_t width;
size_t height;
float *elements;
} Matrix;
Initialize it with
int matrix_init(Matrix *m, size_t w, size_t h)
{
m.elements = malloc((sizeof *m.elements) * w * h);
if (!m.elements) {
m.width = m.height = 0;
return 0; /* failed */
}
m.width = w;
m.height = h;
return 1; /* success */
}
Then, to find the element at position (x,y), we can use a simple function:
float *matrix_element(Matrix *m, size_t x, size_t y)
{
/* optional: range checking here */
return m.elements + x + m.width * y;
}
This has better locality than an array of pointers (and is easier and faster to allocate and deallocate correctly), and is more flexible than an array of arrays (where, as you've found, the inner arrays need a compile-time constant size).
You might be able to use an array of arrays wrapped in a Matrix struct - it's possible you'll need a stride that is not necessarily the same as width, if the array of arrays has padding on your platform.

wrong in initialize shared memory with global memory in CUDA

I am writing a simple cuda program recently, the kernel function is below:
#define BLOCK_SIZE 16
#define RADIOUS 7
#define SM_SIZE BLOCK_SIZE+2*RADIOUS
__global__ static void DarkChannelPriorCUDA(const float* r, size_t ldr, const float* g, size_t ldg, const float* b, size_t ldb, float * d, size_t ldd, int n, int m)
{
__shared__ float R[SM_SIZE][SM_SIZE];
__shared__ float G[SM_SIZE][SM_SIZE];
__shared__ float B[SM_SIZE][SM_SIZE];
const int tidr = threadIdx.x;
const int tidc = threadIdx.y;
const int bidr = blockIdx.x * BLOCK_SIZE;
const int bidc = blockIdx.y * BLOCK_SIZE;
int i, j ,tr, tc;
for( i = 0; i < SM_SIZE; i += BLOCK_SIZE)
{
tr = bidr-RADIOUS+i+tidr;
for( j = 0; j < SM_SIZE; j += BLOCK_SIZE)
{
tc = bidc-RADIOUS+j+tidc;
if(tr <0 || tc<0 || tr>=n || tc>=m)
{
R[i][j]=1e20;
G[i][j]=1e20;
B[i][j]=1e20;
}
else
{
R[i][j]=r[tr*ldr+tc];
G[i][j]=g[tr*ldg+tc];
B[i][j]=b[tr*ldb+tc];
}
}
}
__syncthreads();
float results = 1e20;
for(i = tidr; i <= tidr + 2*RADIOUS; i++)
for(j = tidc; j <= tidc + 2*RADIOUS; j++)
{
results = results < R[i][j] ? results : R[i][j];
results = results < G[i][j] ? results : G[i][j];
results = results < B[i][j] ? results : B[i][j];
}
d[(tidr + bidr) * ldd + tidc + bidc] = results;
}
this function read r, g, b three 2d matrix of n*m as input, output a matrix d of n*m, each element of d[i][j]'s value is equal to the minimal value among r, g, b three matrix which covered by the window of (2*RADIOUS+1)*(2*RADIOUS+1) with center (i,j).
in order to speed up, i used a shared memory to store a small amount of value for each block. each block has 16*16 threads, each single thread calculate the result for one element of maxtrix d. shared memory need to store (BLOCK_SIZE+2*RADIOUS)*(BLOCK_SIZE+2*RADIOUS) elements of r, g, b.
But the result is wrong, the value in shared memory R, G and B is different from r, g and b in global memory. It seems that the data in global memory never tansfer to shared memory successful, I can't understand why it happens.
You should notice what is inside the global, is performed per each thread. When you write:
R[i][j]=r[tr*ldr+tc];
G[i][j]=g[tr*ldg+tc];
B[i][j]=b[tr*ldb+tc];
Different threads in each block are overwriting [i][j] component of R, G and B which are shared among the threads.

How do I use fftw_plan_many_dft on a transposed array of data?

I have a 2D array of data stored in column-major (Fortran-style) format, and I'd like to take the FFT of each row. I would like to avoid transposing the array (it is not square). For example, my array
fftw_complex* data = new fftw_complex[21*256];
contains entries [r0_val0, r1_val0,..., r20_val0, r0_val1,...,r20_val255].
I can use fftw_plan_many_dft to make a plan to solve each of the 21 FFTs in-place in the data array if it is row-major, e.g. [r0_val0, r0_val1,..., r0_val255, r1_val0,...,r20_val255]:
int main() {
int N = 256;
int howmany = 21;
fftw_complex* data = new fftw_complex[N*howmany];
fftw_plan p;
// this plan is OK
p = fftw_plan_many_dft(1,&N,howmany,data,NULL,1,N,data,NULL,1,N,FFTW_FORWARD,FFTW_MEASURE);
// do stuff...
return 0;
}
According to the documentation (section 4.4.1 of the FFTW manual), the signature for the function is
fftw_plan fftw_plan_many_dft(int rank, const int *n, int howmany,
fftw_complex *in, const int *inembed,
int istride, int idist,
fftw_complex *out, const int *onembed,
int ostride, int odist,
int sign, unsigned flags);
and I should be able to use the stride and dist parameters to set the indexing. From what I can understand from the documentation, the entries in the array to be transformed are indexed as in + j*istride + k*idist where j=0..n-1 and k=0..howmany-1. (My arrays are 1D and there are howmany of them). However, the following code results in a seg. fault (edit: the stride length is wrong, see update below):
int main() {
int N = 256;
int howmany = 21;
fftw_complex* data = new fftw_complex[N*howmany];
fftw_plan p;
// this call results in a seg. fault.
p = fftw_plan_many_dft(1,&N,howmany,data,NULL,N,1,data,NULL,N,1,FFTW_FORWARD,FFTW_MEASURE);
return 0;
}
Update:
I made an error choosing the stride length. The correct call is (the correct stride length is howmany, not N):
int main() {
int N = 256;
int howmany = 21;
fftw_complex* data = new fftw_complex[N*howmany];
fftw_plan p;
// OK
p = fftw_plan_many_dft(1,&N,howmany,data,NULL,howmany,1,data,NULL,howmany,1,FFTW_FORWARD,FFTW_MEASURE);
// do stuff
return 0;
}
The function works as documented. I made an error with the stride length, which should actually be howmany in this case. I have updated the question to reflect this.
I find the documentation for FFTW is somewhat difficult to comprehend without examples (I could just be illiterate...), so I'm posting a more detailed example below, comparing the usual use of fftw_plan_dft_1d with fftw_plan_many_dft. To recap, in the case of howmany arrays with length N that are stored in a contiguous block of memory referenced as in, the array elements j for each transform k are
*(in + j*istride + k*idist)
The following two pieces of code are equivalent. In the first, the conversion from some 2D array is done explicitly, and in the second the fftw_plan_many_dft call is used to do everything in-place.
Explicit Copy
int N, howmany;
// ...
fftw_complex* data = (fftw_complex*) fftw_malloc(N*howmany*sizeof(fftw_complex));
// ... load data with howmany arrays of length N
int istride, idist;
// ... if data is column-major, set istride=howmany, idist=1
// if data is row-major, set istride=1, idist=N
fftw_complex* in = (fftw_complex*) fftw_malloc(N*sizeof(fftw_complex));
fftw_complex* out = (fftw_complex*) fftw_malloc(N*sizeof(fftw_complex));
fftw_plan p = fftw_plan_dft_1d(N,in,out,FFTW_FORWARD,FFTW_MEASURE);
for (int k = 0; k < howmany; ++k) {
for (int j = 0; j < N; ++j) {
in[j] = data[j*istride + k*idist];
}
fftw_execute(p);
// do something with out
}
Plan Many
int N, howmany;
// ...
fftw_complex* data = (fftw_complex*) fftw_malloc(N*howmany*sizeof(fftw_complex));
// ... load data with howmany arrays of length N
int istride, idist;
// ... if data is column-major, set istride=howmany, idist=1
// if data is row-major, set istride=1, idist=N
fftw_plan p = fftw_plan_many_dft(1,&N,howmany,data,NULL,howmany,1,data,NULL,howmany,1,FFTW_FORWARD,FFTW_MEASURE);
fftw_execute(p);

Float size, matrix multiplication, OpenCL, sockets. Weird

I'm generating two matrices using the following function (note some code is omitted):
srand(2007);
randomInit(h_A_data, size_A);
void randomInit(float* data, int size)
{
int i;
for (i = 0; i < size; ++i){
data[i] = rand() / (float)RAND_MAX;
}
}
This is called for matrix A and B. This populates the matrices with 0.something values, e.g. 0.748667. I then perform a matrix multiplication using a CPU. I compare the result to a GPU implementation via OpenCL. The resulting matrix has values in the range 20.something, e.g. 23.472757. Both the CPU and the GPU give the same result. The CPU implementation is taken from the Cuda toolkit distrib by nvidia:
void computeGold(float* C, const float* A, const float* B, unsigned int hA, unsigned int wA, unsigned int wB)
{
unsigned int i;
unsigned int j;
unsigned int k;
for (i = 0; i < hA; ++i)
for (j = 0; j < wB; ++j) {
double sum = 0;
for (k = 0; k < wA; ++k) {
double a = A[i * wA + k];
double b = B[k * wB + j];
sum += a * b;
}
C[i * wB + j] = (float)sum;
}
}
The weird thing is, all three matrices in memory are of the same size, i.e. sizeof(float)*size_A, or *size_B for matrix B etc. When I dump them to the disk, the file for the result stored in matrix C (the multiplied matrix) is bigger than matrix A and B.
Even more critical, for my application I'm transferring these over a network via a socket. In terms of the raw number of bytes, all matrices are the same, and yet it takes longer to transfer matrix C over the network. The problem is extrapolated for large matrix sizes. Why is this?
UPDATE/EDIT:
fprintf(matrix_c_file,"\n\nMatrix C\n");
for(i = 0; i < size_C; i++)
{
fprintf(matrix_c_file,"%f ", h_C_data[i]);
}
fprintf(matrix_c_file,"\n");
When matrix A and B contain only zero's, all three (matrix A, B and C) are the same size on disk.
I think that lijie has the correct (albeit terse) answer in the comments. The %f format specifier can result in a string with variable width. Consider the following C code:
printf("%f\n", 0.0);
printf("%f\n", 3.1415926535897932384626433);
printf("%f\n", 20.53);
printf("%f\n", 20.5e38);
which produces:
0.000000
3.141593
20.530000
2050000000000000019963732141023730597888.000000
All of the output has the same number of digits after the decimal point (6 by default), but a variable number to the left of the decimal point. If you need the textual representation of your matrix to be a consistent size and you don't mind sacrificing some precision, you can use the %e format specifier instead to force an exponential representation like 2.345e12.

Resources