CUDA kernel launch parameters explained right? - c

Here I tried to self-explain the CUDA launch parameters model (or execution configuration model) using some pseudo codes, but I don't know if there were some big mistakes, So hope someone help to review it, and give me some advice. Thanks advanced.
Here it is:
/*
normally, we write kernel function like this.
note, __global__ means this function will be called from host codes,
and executed on device. and a __global__ function could only return void.
if there's any parameter passed into __global__ function, it should be stored
in shared memory on device. so, kernel function is so different from the *normal*
C/C++ functions. if I was the CUDA authore, I should make the kernel function more
different from a normal C function.
*/
__global__ void
kernel(float *arr_on_device, int n) {
int idx = blockIdx.x * blockDIm.x + threadIdx.x;
if (idx < n) {
arr_on_device[idx] = arr_on_device[idx] * arr_on_device[idx];
}
}
/*
after this definition, we could call this kernel function in our normal C/C++ codes !!
do you feel something wired ? un-consistant ?
normally, when I write C codes, I will think a lot about the execution process down to
the metal in my mind, and this one...it's like some fragile codes. break the sequential
thinking process in my mind.
in order to make things normal, I found a way to explain: I expand the *__global__ * function
to some pseudo codes:
*/
#define __foreach(var, start, end) for (var = start, var < end; ++var)
__device__ int
__indexing() {
const int blockId = blockIdx.x * gridDim.x + gridDim.x * gridDim.y * blockIdx.z;
return
blockId * (blockDim.x * blockDim.y * blockDim.z) +
threadIdx.z * (blockDim.x * blockDim.y) +
threadIdx.x;
}
global_config =:
{
/*
global configuration.
note the default values are all 1, so in the kernel codes,
we could just ignore those dimensions.
*/
gridDim.x = gridDim.y = gridDim.z = 1;
blockDim.x = blockDim.y = blockDim.z = 1;
};
kernel =:
{
/*
I thought CUDA did some bad evil-detail-covering things here.
it's said that CUDA C is an extension of C, but in my mind,
CUDA C is more like C++, and the *<<<>>>* part is too tricky.
for example:
kernel<<<10, 32>>>(); means kernel will execute in 10 blocks each have 32 threads.
dim3 dimG(10, 1, 1);
dim3 dimB(32, 1, 1);
kernel<<<dimG, dimB>>>(); this is exactly the same thing with above.
it's not C style, and C++ style ? at first, I thought this could be done by
C++'s constructor stuff, but I checked structure *dim3*, there's no proper
constructor for this. this just brroke the semantics of both C and C++. I thought
force user to use *kernel<<<dim3, dim3>>>* would be better. So I'd like to keep
this rule in my future codes.
*/
gridDim = dimG;
blockDim = dimB;
__foreach(blockIdx.z, 0, gridDim.z)
__foreach(blockIdx.y, 0, gridDim.y)
__foreach(blockIdx.x, 0, gridDim.x)
__foreach(threadIdx.z, 0, blockDim.z)
__foreach(threadIdx.y, 0, blockDim.y)
__foreach(threadIdx.x, 0, blockDim.x)
{
const int idx = __indexing();
if (idx < n) {
arr_on_device[idx] = arr_on_device[idx] * arr_on_device[idx];
}
}
};
/*
so, for me, gridDim & blockDim is like some boundaries.
e.g. gridDim.x is the upper bound of blockIdx.x, this is not that obvious for people like me.
*/
/* the declaration of dim3 from vector_types.h of CUDA/include */
struct __device_builtin__ dim3
{
unsigned int x, y, z;
#if defined(__cplusplus)
__host__ __device__ dim3(unsigned int vx = 1, unsigned int vy = 1, unsigned int vz = 1) : x(vx), y(vy), z(vz) {}
__host__ __device__ dim3(uint3 v) : x(v.x), y(v.y), z(v.z) {}
__host__ __device__ operator uint3(void) { uint3 t; t.x = x; t.y = y; t.z = z; return t; }
#endif /* __cplusplus */
};
typedef __device_builtin__ struct dim3 dim3;

CUDA DRIVER API
The CUDA Driver API v4.0 and above uses the following functions to control a kernel launch:
cuFuncSetCacheConfig
cuFuncSetSharedMemConfig
cuLaunchKernel
The following CUDA Driver API functions were used prior to the introduction of cuLaunchKernel in v4.0.
cuFuncSetBlockShape()
cuFuncSetSharedSize()
cuParamSet{Size,i,fv}()
cuLaunch
cuLaunchGrid
Additional information on these functions can be found in cuda.h.
CUresult CUDAAPI cuLaunchKernel(CUfunction f,
unsigned int gridDimX,
unsigned int gridDimY,
unsigned int gridDimZ,
unsigned int blockDimX,
unsigned int blockDimY,
unsigned int blockDimZ,
unsigned int sharedMemBytes,
CUstream hStream,
void **kernelParams,
void **extra);
cuLaunchKernel takes as parameters the entire launch configuration.
See NVIDIA Driver API[Execution Control]1 for more details.
CUDA KERNEL LAUNCH
cuLaunchKernel will
1. verify the launch parameters
2. change the shared memory configuration
3. change the local memory allocation
4. push a stream synchronization token into the command buffer to make sure two commands in the stream do not overlap
4. push the launch parameters into the command buffer
5. push the launch command into the command buffer
6. submit the command buffer to the device (on wddm drivers this step may be deferred)
7. on wddm the kernel driver will page all memory required in device memory
The GPU will
1. verify the command
2. send the commands to the compute work distributor
3. dispatch launch configuration and thread blocks to the SMs
When all thread blocks have completed the work distributor will flush the caches to honor the CUDA memory model and it will mark the kernel as completed so the next item in the stream can make forward progress.
The order that thread blocks are dispatched differs between architectures.
Compute capability 1.x devices store the kernel parameters in shared memory.
Compute capability 2.0-3.5 devices store the kenrel parameters in constant memory.
CUDA RUNTIME API
The CUDA Runtime is a C++ software library and build tool chain on top of the CUDA Driver API. The CUDA Runtime uses the following functions to control a kernel launch:
cudaConfigureCall
cudaFuncSetCacheConfig
cudaFuncSetSharedMemConfig
cudaLaunch
cudaSetupArgument
See NVIDIA Runtime API[Execution Control]2
The <<<>>> CUDA language extension is the most common method used to launch a kernel.
During compilation nvcc will create a new CPU stub function for each kernel function called using <<<>>> and it will replace the <<<>>> with a call to the stub function.
For example
__global__ void kernel(float* buf, int j)
{
// ...
}
kernel<<<blocks,threads,0,myStream>>>(d_buf,j);
generates
void __device_stub__Z6kernelPfi(float *__par0, int __par1){__cudaSetupArgSimple(__par0, 0U);__cudaSetupArgSimple(__par1, 4U);__cudaLaunch(((char *)((void ( *)(float *, int))kernel)));}
You can inspect the generated files by adding --keep to your nvcc command line.
cudaLaunch calls cuLaunchKernel.
CUDA DYNAMIC PARALLELISM
CUDA CDP works similar to the CUDA Runtime API described above.

By using <<<...>>>, you are launching a number of threads in the GPU. These threads are grouped into blocks and forms a large grid. All the threads will execute the invoked kernel function code.
In the kernel function, build-in variables like threadIdx and blockIdx enable the code know which thread it runs and do the scheduled part of the work.
edit
Basically, <<<...>>> simplifies the configuration procedure to launch a kernel. Without using it, one may have to call 4~5 APIs for a single kernel launch, just as the OpenCL way, which use only C99 syntax.
In fact you could check CUDA driver APIs. It may provide all those APIs so you don't need to use <<<>>>.

Basically, the GPU is divided into separate "device" GPUs (e.g. GeForce 690 has 2) -> multiple SM's (streaming multiprocessors) -> multiple CUDA cores. As far as I know, the dimensionality of a block or grid is just a logical assignment irrelevant of hardware, but the total size of a block (x*y*z) is very important.
Threads in a block HAVE TO be on the same SM, to use its facilities of shared memory and synchronization. So you cannot have blocks with more threads than CUDA cores are contained in a SM.
If we have a simple scenario where we have 16 SMs with 32 CUDA cores each, and we have 31x1x1 block size, and 20x1x1 grid size, we will forfeit at least 1/32 of the processing power of the card. Every time a block is run, a SM will have only 31 of its 32 cores busy. Blocks will load to fill up the SMs, we will have 16 blocks finish at roughly the same time, and as the first 4 SMs free up, they will start processing the last 4 blocks (NOT necessarily blocks #17-20).
Comments and corrections are welcome.

Related

It's like OpenCL kernel instance ends abruptly

I'm new to OpenCL and I'm working on converting an existing algorithm to OpenCL.
In this process, I am experiencing a phenomenon that I cannot solve on my own, and I would like to ask some help.
Here's details.
My kernel is applied to images of different size (to be precise, each layer of the Laplacian pyramid).
I get normal results for images of larger size such as 3072 x 3072, 1536 x 1536.
But I get abnormal results for smaller images such as 12 x 12, 6 x 6, 3 x 3, 2 x 2.
At first, I suspected that clEnqueueNDRangeKernel had a bottom limit for dimensions, causing this problem. So, I added printf to the beginning of the kernel as follows. It is confirmed that all necessary kernel instances are executed.
__kernel void GetValueOfB(/* parameters */)
{
uint xB = get_global_id(0);
uint yB = get_global_id(1);
printf("(%d, %d)\n", xB, yB);
// calculation code is omitted
}
So after wandering for a while, I added the same printf to the end of the kernel. When I did this, it was confirmed that printf works only for some pixel positions. For pixel positions not output by printf, the calculated values in the resulting image are incorrect, and as a result, I concluded that some kernel instances terminate abnormally before completing the calculations.
__kernel void GetValueOfB(/* parameters */)
{
uint xB = get_global_id(0);
uint yB = get_global_id(1);
printf("(%d, %d)\n", xB, yB);
// calculation code is omitted
printf("(%d, %d, %f)\n", xB, yB, result_for_this_position);
}
It seems that there is no problem with the calculation of the kernel. If I compile the kernel turning off the optimization with the -cl-opt-disable option, I get perfectly correct results for all images regardless of their size. In addition to that, with NVIDA P4000, it works correct. Of course, in theses cases, I confirmed that the printf added at the bottom of the Kernel works for all pixels.
Below I put additional information and attach a part of the code I wrote.
Any advice is welcomed and appreciated.
Thank you.
SDK: Intel® SDK For OpenCL™ Applications 2020.3.494
Platform: Intel(R) OpenCL HD Graphics
for all images
{
...
const size_t globalSize[2] = { size_t(vtMatB_GPU_LLP[nLayerIndex].cols), size_t(vtMatB_GPU_LLP[nLayerIndex].rows) };
err = clEnqueueNDRangeKernel(_pOpenCLManager->GetCommandQueue(), kernel, 2,
NULL, globalSize, NULL, 0, NULL, NULL);
if (CL_SUCCESS != err)
return -1;
// I tried with this but it didn't make any difference
//std::this_thread::sleep_for(std::chrono::seconds(1));
err = clFinish(_pOpenCLManager->GetCommandQueue());
if (CL_SUCCESS != err)
return -1;
err = clEnqueueReadBuffer(_pOpenCLManager->GetCommandQueue(), memMatB, CL_TRUE,
0, sizeof(float) * vtMatB_GPU_LLP[nLayerIndex].cols *
vtMatB_GPU_LLP[nLayerIndex].rows, vtMatB_GPU_LLP[nLayerIndex].data, 0, nullptr, nullptr);
if (CL_SUCCESS != err)
return -1;
...
}
And I tried with event, too, but it works the same way.
for all images
{
...
const size_t globalSize[2] = { size_t(vtMatB_GPU_LLP[nLayerIndex].cols), size_t(vtMatB_GPU_LLP[nLayerIndex].rows) };
cl_event event;
err = clEnqueueNDRangeKernel(_pOpenCLManager->GetCommandQueue(), kernel, 2, NULL, globalSize, NULL, 0, NULL, &event);
if (CL_SUCCESS != err)
return -1;
err = clWaitForEvents(1, &event);
if (CL_SUCCESS != err)
return -1;
err = clFinish(_pOpenCLManager->GetCommandQueue());
if (CL_SUCCESS != err)
return -1;
err = clEnqueueReadBuffer(_pOpenCLManager->GetCommandQueue(), memMatB, CL_TRUE,
0, sizeof(float) * vtMatB_GPU_LLP[nLayerIndex].cols *
vtMatB_GPU_LLP[nLayerIndex].rows, vtMatB_GPU_LLP[nLayerIndex].data, 0, nullptr, nullptr);
if (CL_SUCCESS != err)
return -1;
...
}
/////// Added contents ////////////////////////////////////////////
Would you guys please take look at this issue in the aspect of clFinsh, or clWaitEvent. Am I missing something in this regard?
Sometimes I get less correct values and sometimes I get more correct values.
To be more specific, let's say I'm applying the kernel to 12 x 12 size image. So there're 144 pixel values.
Sometime I get correct values for 56 pixels.
Sometime I get correct values for 89 pixels.
Some other time I get correct value for n(less then 144) pixels.
If I turn off the OpenCL optimization when compiling the kernel by specifying -cl-opt-disable option, I get correct values for all 144 pixels.
The other thing that makes me think the calculation code is correct is that the same OpenCL code with no modification(other then device select code) runs perfectly correctly with NVIDIA P4000.
At first, I was really suspicious about the calculation code, but more I inspect code, more I'm confident there's nothing wrong with calculation code.
I know there's still a chance that there is an error in the calculation code so that there happen some exceptions anywhere during calculations.
I have plain C++ code for same task. I'm comparing results from those two.
/////// Another added contents ////////////////////////////////////////////
I made a minimum code(except projects template) to reproduce the phenomenon.
What's odd more is that if I install "Intel® Distribution for GDB Target" I get correct results.
https://github.com/heysweetethan/GPUOpenCLProjectforWindows
OpenCL kernels run threads in parallel on a specified global range, which in your case is the image size, with one thread per pixel.
The threads are grouped in workgroups, Workgroup size should be a multiple of 32; ideally 64 to make full use of the hardware, or 8x8 pixels in 2D. These workgroups cannot be split, so the global range must be a multiple of workgroup size.
What happens if global range is not clearly divisible by workgroup size, or smaller than workgroup size, like 3x3 pixels? Then the last workgroup is still executed with all 8x8 threads. The first 3x3 work on valid data in memory, but all the other threads read/write unallocated memory. This can cause undefined behavior or even crashes.
If you cannot have global size as a multiple of workgroup size, there is still a solution: a guard clause in the very beginning of the kernel:
if(xB>=xImage||yB>=yImage) return;
This ensures that no threads access unallocated memory.
As you don't supply a complete reproducible code sample, here's a loose collection of comments/suggestions/advice:
1. printf in kernel code
Don't rely on large amounts of printf output from kernels. It's necessarily buffered, and some implementations don't guarantee delivery of messages - often there's a fixed size buffer and when that's full, messages are dropped.
Note that your post-calculation printf increases the total amount of output, for example.
The reliable way to check or print kernel output is to write it to a global buffer and print it in host code. For example, if you want to verify each work-item reaches a specific point in the code, consider creating a zero-initialised global buffer where you can set a flag in each work-item.
2. Events
As you asked about events, flushing, etc. Your clFinish call certainly should suffice to ensure everything has executed - if anything, it's overkill, but especially while you're debugging other issues it's a good way to rule out queuing issue.
The clWaitForEvents() call preceeding it is not a great idea, as you haven't called clFlush() after queueing the kernel whose event you're waiting for. It's fairly minor, but could be a problem on some implementations.
3. Small image sizes
You've not actually posted any of the code that deals with the images themselves, so I can only guess at potential issues there. It looks like you're not using workgroups, so you shouldn't be running into the usual multiple-of-group-size pitfall.
However, are you sure you're loading the source data correctly, and you're correctly indexing into it? There could be all sorts of pitfalls here, from alignment of pixel rows in the source data, enqueueing the kernel before filling the source buffers has completed, creating source buffers with the wrong flags, etc.
So in summary, I'd suggest:
Don't believe in-kernel-printf if something strange is going on. Switch to something more reliable for observing the behaviour of your kernel code.
At minimum, post all your OpenCL API calling host code. Buffer creation, setting arguments, etc. Any fragments of kernel code accessing the buffers are probably not a bad idea either.
Thanks to a person from intel community, I could understand the phenomenon.
Briefly, if you spend to much time on a single kernel instance, 'Timeout Detection and Recovery(TDR)' stops the kernel instance.
For more information about this, you could refer to the followings.
https://learn.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys
https://www.pugetsystems.com/labs/hpc/Working-around-TDR-in-Windows-for-a-better-GPU-computing-experience-777/
https://community.intel.com/t5/GPU-Compute-Software/It-s-like-OpenCL-kernel-instance-ends-abruptly/m-p/1386883#M478
I appreciate for all the people who gave me advices.

OpenCL - Local Memory

I do understand whats the difference between global- and local-memory in general.
But I have problems to use local-memory.
1) What has to be considered by transforming a global-memory variables to local-memory variables?
2) How do I use the local-barriers?
Maybe someone can help me with a little example.
I tried to do a jacobi-computation by using local-memory, but I only get 0 as result. Maybe someone can give me an advice.
Working Solution:
#define IDX(_M,_i,_j) (_M)[(_i) * N + (_j)]
#define U(_i, _j) IDX(uL, _i, _j)
__kernel void jacobi(__global VALUE* u, __global VALUE* f, __global VALUE* tmp, VALUE factor) {
int i = get_global_id(0);
int j = get_global_id(1);
int iL = get_local_id(0);
int jL = get_local_id(1);
__local VALUE uL[(N+2)*(N+2)];
__local VALUE fL[(N+2)*(N+2)];
IDX(uL, iL, jL) = IDX(u, i, j);
IDX(fL, iL, jL) = IDX(f, i, j);
barrier(CLK_LOCAL_MEM_FENCE);
IDX(tmp, i, j) = (VALUE)0.25 * ( U(iL-1, jL) + U(iL, jL-1) + U(iL, jL+1) + U(iL+1, jL) - factor * IDX(fL, iL, jL));
}
Thanks.
1) Query for CL_DEVICE_LOCAL_MEM_SIZE value, it is 16kB minimum and increses for different hardwares. If your local variables can fit in this and if they are re-used many times, you should put them in local memory before usage. Even if you don't, automatic usage of L2 cache when accessing global memory of a gpu can be still effective for utiliation of cores.
If global-local copy is taking important slice of time, you can do async work group copy while cores calculating things.
Another important part is, more free local memory space means more concurrent threads per core. If gpu has 64 cores per compute unit, only 64 threads can run when all local memory is used. When it has more space, 128,192,...2560 threads can be run at the same time if there are no other limitations.
A profiler can show bottlenecks so you can consider it worth a try or not.
For example, a naive matrix-matrix multiplication using nested loop relies on cache l1 l2 but submatices can fit in local memory. Maybe 48x48 submatices of floats can fit in a mid-range graphics card compute unit and can be used for N times for whole calculation before replaced by next submatrix.
CL_DEVICE_LOCAL_MEM_TYPE querying can return LOCAL or GLOBAL which also says that not recommended to use local memory if it is GLOBAL.
Lastly, any memory space allocation(except __private) size must be known at compile time(for device, not host) because it must know how many wavefronts can be issued to achieve max performance(and/or maybe other compiler optimizations). That is why no recursive function allowed by opencl 1.2. But you can copy a function and rename for n times to have pseudo recursiveness.
2) Barriers are a meeting point for all workgroup threads in a workgroup. Similar to cyclic barriers, they all stop there, wait for all until continuing. If it is a local barrier, all workgroup threads finish any local memory operations before departing from that point. If you want to give some numbers 1,2,3,4.. to a local array, you can't be sure if all threads writing these numbers or already written, until a local barrier is passed, then it is certain that array will have final values already written.
All workgroup threads must hit same barrier. If one cannot reach it, kernel stucks or you get an error.
__local int localArray[64]; // not each thread. For all threads.
// per compute unit.
if(localThreadId!=0)
localArray[localThreadId]=localThreadId; // 64 values written in O(1)
// not sure if 2nd thread done writing, just like last thread
if(localThreadId==0) // 1st core of each compute unit loads from VRAM
localArray[localThreadId]=globalArray[globalThreadId];
barrier(CLK_LOCAL_MEM_FENCE); // probably all threads wait 1st thread
// (maybe even 1st SIMD or
// could be even whole 1st wavefront!)
// here all threads written their own id to local array. safe to read.
// except first element which is a variable from global memory
// lets add that value to all other values
if(localThreadId!=0)
localArrray[localThreadId]+=localArray[0];
Working example(local work group size=64):
inputs: 0,1,2,3,4,0,0,0,0,0,0,..
__kernel void vecAdd(__global float* x )
{
int id = get_global_id(0);
int idL = get_local_id(0);
__local float loc[64];
loc[idL]=x[id];
barrier (CLK_LOCAL_MEM_FENCE);
float distance_square_sum=0;
for(int i=0;i<64;i++)
{
float diff=loc[idL]-loc[i];
float diff_squared=diff*diff;
distance_square_sum+=diff_squared;
}
x[id]=distance_square_sum;
}
output: 30, 74, 246, 546, 974, 30, 30, 30...

CUDA How to access constant memory in device kernel when the constant memory is declared in the host code?

For the record this is homework so help as little or as much with that in mind. We are using constant memory to store a "mask matrix" that will be used to perform a convolution on a larger matrix. When I am in the host code I am copying the mask to constant memory using the cudaMemcpyToSymbol().
My question is once this is copied over and I launch my device kernel code how does the device know where to access the constant memory mask matrix. Is there a pointer that I need to pass in on kernel launch. Most of the code that the professor gave us is not supposed to be changed (there is no pointer to the mask passed in) but there is always the possibility that he made a mistake ( although it is most likely my understanding of something)
Is the constant memeory declaratoin supposed to be included in the seperate kernel.cu file?
I am minimizing the code to just show the things having to do with the constant memory. As such please don't point out if something is not initialized ect. There is code for that but that is not of concern at this time.
main.cu:
#include <stdio.h>
#include "kernel.cu"
__constant__ float M_d[FILTER_SIZE * FILTER_SIZE];
int main(int argc, char* argv[])
{
Matrix M_h, N_h, P_h; // M: filter, N: input image, P: output image
/* Allocate host memory */
M_h = allocateMatrix(FILTER_SIZE, FILTER_SIZE);
N_h = allocateMatrix(imageHeight, imageWidth);
P_h = allocateMatrix(imageHeight, imageWidth);
/* Initialize filter and images */
initMatrix(M_h);
initMatrix(N_h);
cudaError_t cudda_ret = cudaMemcpyToSymbol(M_d, M_h.elements, M_h.height * M_h.width * sizeof(float), 0, cudaMemcpyHostToDevice);
//char* cudda_ret_pointer = cudaGetErrorString(cudda_ret);
if( cudda_ret != cudaSuccess){
printf("\n\ncudaMemcpyToSymbol failed\n\n");
printf("%s, \n\n", cudaGetErrorString(cudda_ret));
}
// Launch kernel ----------------------------------------------------------
printf("Launching kernel..."); fflush(stdout);
//INSERT CODE HERE
//block size is 16x16
// \\\\\\\\\\\\\**DONE**
dim_grid = dim3(ceil(N_h.width / (float) BLOCK_SIZE), ceil(N_h.height / (float) BLOCK_SIZE));
dim_block = dim3(BLOCK_SIZE, BLOCK_SIZE);
//KERNEL Launch
convolution<<<dim_grid, dim_block>>>(N_d, P_d);
return 0;
}
kernel.cu: THIS IS WHERE I DO NOT KNOW HOW TO ACCESS THE CONSTANT MEMORY.
//__constant__ float M_c[FILTER_SIZE][FILTER_SIZE];
__global__ void convolution(Matrix N, Matrix P)
{
/********************************************************************
Determine input and output indexes of each thread
Load a tile of the input image to shared memory
Apply the filter on the input image tile
Write the compute values to the output image at the correct indexes
********************************************************************/
//INSERT KERNEL CODE HERE
//__shared__ float N_shared[BLOCK_SIZE][BLOCK_SIZE];
//int row = (blockIdx.y * blockDim.y) + threadIdx.y;
//int col = (blockIdx.x * blockDim.x) + threadIdx.x;
}
In "classic" CUDA compilation you must define all code and symbols (textures, constant memory, device functions) and any host API calls which access them (including kernel launches, binding to textures, copying to symbols) within the same translation unit. This means, effectively, in the same file (or via multiple include statements within the same file). This is because "classic" CUDA compilation doesn't include a device code linker.
Since CUDA 5 was released, there is the possibility of using separate compilation mode and linking different device code objects into a single fatbinary payload on architectures which support it. In that case, you need to declare any __constant__ variables using the extern keyword and define the symbol exactly once.
If you can't use separate compilation, then the usual workaround is to define the __constant__ symbol in the same .cu file as your kernel, and include a small host wrapper function which just calls cudaMemcpyToSymbol to set the __constant__ symbol in question. You would probably do the same with kernel calls and texture operations.
Below is a "minimum-sized" example showing the use of __constant__ symbols. You do not need to pass any pointer to the __global__ function.
#include <cuda.h>
#include <cuda_runtime.h>
#include <stdio.h>
__constant__ float test_const;
__global__ void test_kernel(float* d_test_array) {
d_test_array[threadIdx.x] = test_const;
}
#include <conio.h>
int main(int argc, char **argv) {
float test = 3.f;
int N = 16;
float* test_array = (float*)malloc(N*sizeof(float));
float* d_test_array;
cudaMalloc((void**)&d_test_array,N*sizeof(float));
cudaMemcpyToSymbol(test_const, &test, sizeof(float));
test_kernel<<<1,N>>>(d_test_array);
cudaMemcpy(test_array,d_test_array,N*sizeof(float),cudaMemcpyDeviceToHost);
for (int i=0; i<N; i++) printf("%i %f\n",i,test_array[i]);
getch();
return 0;
}

CUDA c program strange behaviour - kernel works until I add more operations

I am puzzled by the following program (code below). It works fine and gives correct results when the two lines in the kernel defining specsin and speccos are given by (note the second term, which is sin(t)):
specsin+=sin(pi*t/my_tau)*sin(t)*sin(my_omega*(t+my_a0*my_a0/4.0/pi*(2.0*pi*t-my_tau*sin(2*pi*t/my_tau))));
speccos+=sin(pi*t/my_tau)*sin(t)*cos(my_omega*(t+my_a0*my_a0/4.0/pi*(2.0*pi*t-my_tau*sin(2*pi*t/my_tau))));
Once I change this second sin(t) term to sin(t+0.0*my_a0*my_a0), which shouldn't change the result, I get all zeros instead of correct answer.
Can it be that I ran out of kernel memory?
#include <stdio.h>
__global__ void Calculate_Spectrum(float * d_Detector_Data, int numCols, int numRows,
const float omega_min, float dOmega,
const float a0_min, float da0,
const float tau_min, float dtau, float dt)
{
int Global_x = blockIdx.x * blockDim.x + threadIdx.x;
int Global_y = blockIdx.y * blockDim.y + threadIdx.y;
int Position1D = Global_y * numCols + Global_x;
float my_omega=omega_min + Global_x * dOmega;
float my_a0=a0_min + Global_y*da0;
float my_tau=tau_min;
int total_time_steps=int(my_tau/dt);
float specsin=0.0;
float speccos=0.0;
float t=0.0;
float pi=3.14159265359;
for(int n=0; n<total_time_steps; n++)
{
t=n*dt;
specsin+=sin(pi*t/my_tau)*sin(t+0.0*my_a0*my_a0)*sin(my_omega*(t+my_a0*my_a0/4.0/pi*(2.0*pi*t-my_tau*sin(2*pi*t/my_tau))));
speccos+=sin(pi*t/my_tau)*sin(t+0.0*my_a0*my_a0)*cos(my_omega*(t+my_a0*my_a0/4.0/pi*(2.0*pi*t-my_tau*sin(2*pi*t/my_tau))));
}
d_Detector_Data[Position1D]=(specsin*specsin+speccos*speccos)*dt*dt*my_a0*my_a0*my_omega*my_omega/4.0/pi/pi;
}
int main(int argc, char ** argv)
{
const int omega_bins = 1024;
const int a0_bins = 512;
const int tau_bins = 1;
const float omega_min = 0.5;
const float omega_max = 1.1;
const float a0_min = 0.05;
const float a0_max = 1.0;
const float tau_min = 1200;
const float tau_max = 600;
const int steps_per_period=20; // for integrating
float dt=1.0/steps_per_period;
int TotalSize = omega_bins * a0_bins * tau_bins;
float dOmega=(omega_max-omega_min)/(omega_bins-1);
float da0=(a0_max-a0_min)/(a0_bins-1);
float dtau=0.;
float * d_Detector_Data;
int * d_Global_x;
int * d_Global_y;
float h_Detector_Data[TotalSize];
// allocate GPU memory
cudaMalloc((void **) &d_Detector_Data, TotalSize*sizeof(float));
Calculate_Spectrum<<<dim3(1,a0_bins,1), dim3(omega_bins,1,1)>>>(d_Detector_Data, omega_bins, a0_bins, omega_min, dOmega, a0_min, da0, tau_min, dtau, dt);
cudaMemcpy(h_Detector_Data, d_Detector_Data, TotalSize*sizeof(float), cudaMemcpyDeviceToHost);
FILE * SaveFile;
char TempStr[255];
sprintf(TempStr, "result.dat");
SaveFile = fopen(TempStr, "w");
int counter=0;
for(int j=0; j<a0_bins;j++)
{
for(int i=0; i<omega_bins; i++)
{
fprintf(SaveFile,"%e\t", h_Detector_Data[counter]);
counter++;
}
fprintf(SaveFile, "\n");
}
fclose(SaveFile);
// free GPU memory
return 0;
}
I believe this is due to a register limitation.
In order to launch a kernel, the total registers per thread must not exceed the maximum limit (i.e. the Maximum number of 32-bit registers per thread which the compiler should guarantee) and the registers per thread times the number of threads requested must not exceed the maximum limit (the Number of 32-bit registers per multiprocessor).
In the cases where you are getting incorrect results, I believe your kernel is not launching for this reason (too many registers requested in total). You're not doing any cuda error checking, but if you did, I believe you could confirm this.
You can work around this using any method that reduces the total to under the limit. Obviously reducing the threads per block is a direct way to do this. Other things like specifying the -G switch to the compiler also affect code generation and therefore may affect register per thread. Another way to work around this is to instruct the compiler to limit its usage of registers to some maximum amount per thread. This is documented in the nvcc manual, the usage is like this:
nvcc -maxrregcount=xx ... (rest of compile command)
Where xx is the number of registers per thread to limit the usage. If you limit it to let's say, 20 per thread, then even with 1024 threads per block, I will still only be using roughly 20K registers, and this will fit within any device that supports 1024 threads per block (cc 2.0 and above).
You can also get register usage statistics by asking the compiler to generate those:
nvcc -Xptxas -v ... (rest of compile command)
which will cause the compiler to generate a variety of statistics about resource usage, including the number of register per thread used/expected.
Note that resource usage, including register usage, may affect occupancy, which has implications for overall application performance. Therefore limiting register usage may not only allow the kernel to run, but may also allow multiple threadblocks to be resident on an SM, which generally suggests your occupancy is improved, which may improve the performance of your application.
Apparently the usage of 0.0 vs. 0.0f has some subtle effect on compiler behavior, which is showing up in code generation. I would also surmise that you may be right on the boundary of what is acceptable, so perhaps a small change in registers used per thread may be affecting what will run. You can investigate this further using the printout of resource usage statistics from the compiler that I referenced above, and/or possibly by inspecting the PTX code (an intermediate code, somthing like assembly code) generated:
nvcc -ptx ....
If you choose to inspect the PTX, you will want to refer to the PTX manual.

CUDA Array Reduction

I'm aware that there are multiple questions similar to this one already answered but I've been unable to piece together anything very helpful from them other than that I'm probably incorrectly indexing something.
I'm trying to preform a sequential addressing reduction on input vector A into output vector B.
The full code is available here http://pastebin.com/7UGadgjX, but this is the kernel:
__global__ void vectorSum(int *A, int *B, int numElements) {
extern __shared__ int S[];
// Each thread loads one element from global to shared memory
int tid = threadIdx.x;
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numElements) {
S[tid] = A[i];
__syncthreads();
// Reduce in shared memory
for (int t = blockDim.x/2; t > 0; t>>=1) {
if (tid < t) {
S[tid] += S[tid + t];
}
__syncthreads();
}
if (tid == 0) B[blockIdx.x] = S[0];
}
}
and these are the kernel launch statements:
// Launch the Vector Summation CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
vectorSum<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, numElements);
I'm getting a unspecified launch error which I've read is similar to a segfault. I've been following the nvidia reduction documentation closely and tried to keep my kernel within the bounds of numElements but I seem to be missing something key considering how simple the code is.
Your problem is that the reduction kernel requires dynamically allocated shared memory to operate correctly, but your kernel launch doesn't specify any. The result is out of bounds/illegal shared memory access which aborts the kernel.
In CUDA runtime API syntax, the kernel launch statement has four arguments. The first two are the grid and block dimensions for the launch. The latter two are optional with zero default values, but specify the dynamically allocated shared memory size and stream.
To fix this, change the launch code as follows:
// Launch the Vector Summation CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
size_t shmsz = (size_t)threadsPerBlock * sizeof(int);
vectorSum<<<blocksPerGrid, threadsPerBlock, shmsz>>>(d_A, d_B, numElements);
[disclaimer: code written in browser, not compiled or tested, use at own risk]
This should at least fix the most obvious problem with your code.

Resources