Nested kernels in CUDA

Nested kernels in CUDA - arrays

CUDA currently does not allow nested kernels.
To be specific, I have the following problem:
I have N number of M-dimensional data. To process each of the N data-points, three kernels need to be run in a sequence. Since, nesting of kernels is not allowed, I cannot create a kernel with calls to the three kernels. Therefore, I have to process each data-point serially.
One solution is to write a big kernel containing the functionality of all the other three kernels, but I think it will sub-optimal.
Can anyone suggest how streams can be used to run the N data-points in parallel, while retaining the the three smaller kernels.
Thanks.

Well, if you want to use streams... you will want to create N streams:
cudaStream_t streams;
streams = malloc(N * sizeof(cudaStream_t));
for(i=0; i<N; i++)
{
cudaStreamCreate(&streams[i]);
}
Then for the ith data point, you want to use cudaMemcpyAsync for transfers:
cudaMemcpyAsync(dst, src, kind, count, streams[i]);
and call your kernels with all four configuration parameters (sharedMemory can be 0, of course):
kernel_1 <<< nBlocks, nThreads, sharedMemory, streams[i] >>> ( args );
kernel_2 <<< nBlocks, nThreads, sharedMemory, streams[i] >>> ( args );
and of course cleanup:
for(i=0; i<N; i++)
{
cudaStreamDestroy(streams[i]);
}
free(streams)

As an update to the selected answer, NVidia's GPU with Compute Capability 3.5 now allows nested kernels, Dynamic Parallelism as they call it.

Nowadays, with the Fermi compatibility, it is possible to launch parallel kernel

Related

tasks run in thread takes longer than in serial?

So im doing some computation on 4 million nodes.
the very bask serial version just have a for loop which loops 4 million times and do 4 million times of computation. this takes roughly 1.2 sec.
when I split the for loop to, say, 4 for loops and each does 1/4 of the computation, the total time became 1.9 sec.
I guess there are some overhead in creating for loops and maybe has to do with cpu likes to compute data in chunk.
The real thing bothers me is when I try to put 4 loops to 4 thread on a 8 core machine, each thread would take 0.9 seconds to finish.
I am expecting each of them to only take 1.9/4 second instead.
I dont think there are any race condition or synchronize issue since all I do was having a for loop to create 4 threads, which took 200 microseconds. And then a for loop to joins them.
The computation read from a shared array and write to a different shared array.
I am sure they are not writing to the same byte.
Where could the overhead came from?
main: ncores: number of cores. node_size: size of graph (4 million node)
for(i = 0 ; i < ncores ; i++){
int *t = (int*)malloc(sizeof(int));
*t = i;
int iret = pthread_create( &thread[i], NULL, calculate_rank_p, (void*)(t));
}
for (i = 0; i < ncores; i++)
{
pthread_join(thread[i], NULL);
}
calculate_rank_p: vector is the rank vector for page rank calculation
Void *calculate_rank_pthread(void *argument) {
int index = *(int*)argument;
for(i = index; i < node_size ; i+=ncores)
current_vector[i] = calc_r(i, vector);
return NULL;
}
calc_r: this is just a page rank calculation using compressed row format.
double calc_r(int i, double *vector){
double prank = 0;
int j;
for(j = row_ptr[i]; j < row_ptr[i+1]; j++){
prank += vector[col_ind[j]] * val[j];
}
return prank;
}
everything that is not declared are global variable

The computation read from a shared array and write to a different shared array. I am sure they are not writing to the same byte.
It's impossible to be sure without seeing relevant code and having some more details, but this sounds like it could be due to false sharing, or ...
the performance issue of false sharing (aka cache line ping-ponging), where threads use different objects but those objects happen to be close enough in memory that they fall on the same cache line, and the cache system treats them as a single lump that is effectively protected by a hardware write lock that only one core can hold at a time. This causes real but invisible performance contention; whichever thread currently has exclusive ownership so that it can physically perform an update to the cache line will silently throttle other threads that are trying to use different (but, alas, nearby) data that sits on the same line.
http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206
UPDATE
This looks like it could very well trigger false sharing, depending on the size of a vector (though there is still not enough information in the post to be sure, as we don't see how the various vector are allocated.
for(i = index; i < node_size ; i+=ncores)
Instead of interleaving which core works on which data i += ncores give each of them a range of data to work on.

For me the same surprise when build and run in Debug (other test code though).
In release all as expected ;)

CUDA kernel launch parameters explained right?

Here I tried to self-explain the CUDA launch parameters model (or execution configuration model) using some pseudo codes, but I don't know if there were some big mistakes, So hope someone help to review it, and give me some advice. Thanks advanced.
Here it is:
/*
normally, we write kernel function like this.
note, __global__ means this function will be called from host codes,
and executed on device. and a __global__ function could only return void.
if there's any parameter passed into __global__ function, it should be stored
in shared memory on device. so, kernel function is so different from the *normal*
C/C++ functions. if I was the CUDA authore, I should make the kernel function more
different from a normal C function.
*/
__global__ void
kernel(float *arr_on_device, int n) {
int idx = blockIdx.x * blockDIm.x + threadIdx.x;
if (idx < n) {
arr_on_device[idx] = arr_on_device[idx] * arr_on_device[idx];
}
}
/*
after this definition, we could call this kernel function in our normal C/C++ codes !!
do you feel something wired ? un-consistant ?
normally, when I write C codes, I will think a lot about the execution process down to
the metal in my mind, and this one...it's like some fragile codes. break the sequential
thinking process in my mind.
in order to make things normal, I found a way to explain: I expand the *__global__ * function
to some pseudo codes:
*/
#define __foreach(var, start, end) for (var = start, var < end; ++var)
__device__ int
__indexing() {
const int blockId = blockIdx.x * gridDim.x + gridDim.x * gridDim.y * blockIdx.z;
return
blockId * (blockDim.x * blockDim.y * blockDim.z) +
threadIdx.z * (blockDim.x * blockDim.y) +
threadIdx.x;
}
global_config =:
{
/*
global configuration.
note the default values are all 1, so in the kernel codes,
we could just ignore those dimensions.
*/
gridDim.x = gridDim.y = gridDim.z = 1;
blockDim.x = blockDim.y = blockDim.z = 1;
};
kernel =:
{
/*
I thought CUDA did some bad evil-detail-covering things here.
it's said that CUDA C is an extension of C, but in my mind,
CUDA C is more like C++, and the *<<<>>>* part is too tricky.
for example:
kernel<<<10, 32>>>(); means kernel will execute in 10 blocks each have 32 threads.
dim3 dimG(10, 1, 1);
dim3 dimB(32, 1, 1);
kernel<<<dimG, dimB>>>(); this is exactly the same thing with above.
it's not C style, and C++ style ? at first, I thought this could be done by
C++'s constructor stuff, but I checked structure *dim3*, there's no proper
constructor for this. this just brroke the semantics of both C and C++. I thought
force user to use *kernel<<<dim3, dim3>>>* would be better. So I'd like to keep
this rule in my future codes.
*/
gridDim = dimG;
blockDim = dimB;
__foreach(blockIdx.z, 0, gridDim.z)
__foreach(blockIdx.y, 0, gridDim.y)
__foreach(blockIdx.x, 0, gridDim.x)
__foreach(threadIdx.z, 0, blockDim.z)
__foreach(threadIdx.y, 0, blockDim.y)
__foreach(threadIdx.x, 0, blockDim.x)
{
const int idx = __indexing();
if (idx < n) {
arr_on_device[idx] = arr_on_device[idx] * arr_on_device[idx];
}
}
};
/*
so, for me, gridDim & blockDim is like some boundaries.
e.g. gridDim.x is the upper bound of blockIdx.x, this is not that obvious for people like me.
*/
/* the declaration of dim3 from vector_types.h of CUDA/include */
struct __device_builtin__ dim3
{
unsigned int x, y, z;
#if defined(__cplusplus)
__host__ __device__ dim3(unsigned int vx = 1, unsigned int vy = 1, unsigned int vz = 1) : x(vx), y(vy), z(vz) {}
__host__ __device__ dim3(uint3 v) : x(v.x), y(v.y), z(v.z) {}
__host__ __device__ operator uint3(void) { uint3 t; t.x = x; t.y = y; t.z = z; return t; }
#endif /* __cplusplus */
};
typedef __device_builtin__ struct dim3 dim3;

CUDA DRIVER API
The CUDA Driver API v4.0 and above uses the following functions to control a kernel launch:
cuFuncSetCacheConfig
cuFuncSetSharedMemConfig
cuLaunchKernel
The following CUDA Driver API functions were used prior to the introduction of cuLaunchKernel in v4.0.
cuFuncSetBlockShape()
cuFuncSetSharedSize()
cuParamSet{Size,i,fv}()
cuLaunch
cuLaunchGrid
Additional information on these functions can be found in cuda.h.
CUresult CUDAAPI cuLaunchKernel(CUfunction f,
unsigned int gridDimX,
unsigned int gridDimY,
unsigned int gridDimZ,
unsigned int blockDimX,
unsigned int blockDimY,
unsigned int blockDimZ,
unsigned int sharedMemBytes,
CUstream hStream,
void **kernelParams,
void **extra);
cuLaunchKernel takes as parameters the entire launch configuration.
See NVIDIA Driver API[Execution Control]1 for more details.
CUDA KERNEL LAUNCH
cuLaunchKernel will
1. verify the launch parameters
2. change the shared memory configuration
3. change the local memory allocation
4. push a stream synchronization token into the command buffer to make sure two commands in the stream do not overlap
4. push the launch parameters into the command buffer
5. push the launch command into the command buffer
6. submit the command buffer to the device (on wddm drivers this step may be deferred)
7. on wddm the kernel driver will page all memory required in device memory
The GPU will
1. verify the command
2. send the commands to the compute work distributor
3. dispatch launch configuration and thread blocks to the SMs
When all thread blocks have completed the work distributor will flush the caches to honor the CUDA memory model and it will mark the kernel as completed so the next item in the stream can make forward progress.
The order that thread blocks are dispatched differs between architectures.
Compute capability 1.x devices store the kernel parameters in shared memory.
Compute capability 2.0-3.5 devices store the kenrel parameters in constant memory.
CUDA RUNTIME API
The CUDA Runtime is a C++ software library and build tool chain on top of the CUDA Driver API. The CUDA Runtime uses the following functions to control a kernel launch:
cudaConfigureCall
cudaFuncSetCacheConfig
cudaFuncSetSharedMemConfig
cudaLaunch
cudaSetupArgument
See NVIDIA Runtime API[Execution Control]2
The <<<>>> CUDA language extension is the most common method used to launch a kernel.
During compilation nvcc will create a new CPU stub function for each kernel function called using <<<>>> and it will replace the <<<>>> with a call to the stub function.
For example
__global__ void kernel(float* buf, int j)
{
// ...
}
kernel<<<blocks,threads,0,myStream>>>(d_buf,j);
generates
void __device_stub__Z6kernelPfi(float *__par0, int __par1){__cudaSetupArgSimple(__par0, 0U);__cudaSetupArgSimple(__par1, 4U);__cudaLaunch(((char *)((void ( *)(float *, int))kernel)));}
You can inspect the generated files by adding --keep to your nvcc command line.
cudaLaunch calls cuLaunchKernel.
CUDA DYNAMIC PARALLELISM
CUDA CDP works similar to the CUDA Runtime API described above.

By using <<<...>>>, you are launching a number of threads in the GPU. These threads are grouped into blocks and forms a large grid. All the threads will execute the invoked kernel function code.
In the kernel function, build-in variables like threadIdx and blockIdx enable the code know which thread it runs and do the scheduled part of the work.
edit
Basically, <<<...>>> simplifies the configuration procedure to launch a kernel. Without using it, one may have to call 4~5 APIs for a single kernel launch, just as the OpenCL way, which use only C99 syntax.
In fact you could check CUDA driver APIs. It may provide all those APIs so you don't need to use <<<>>>.

Basically, the GPU is divided into separate "device" GPUs (e.g. GeForce 690 has 2) -> multiple SM's (streaming multiprocessors) -> multiple CUDA cores. As far as I know, the dimensionality of a block or grid is just a logical assignment irrelevant of hardware, but the total size of a block (x*y*z) is very important.
Threads in a block HAVE TO be on the same SM, to use its facilities of shared memory and synchronization. So you cannot have blocks with more threads than CUDA cores are contained in a SM.
If we have a simple scenario where we have 16 SMs with 32 CUDA cores each, and we have 31x1x1 block size, and 20x1x1 grid size, we will forfeit at least 1/32 of the processing power of the card. Every time a block is run, a SM will have only 31 of its 32 cores busy. Blocks will load to fill up the SMs, we will have 16 blocks finish at roughly the same time, and as the first 4 SMs free up, they will start processing the last 4 blocks (NOT necessarily blocks #17-20).
Comments and corrections are welcome.

MPI wrapper that imitates OpenMP's for-loop pragma

I am thinking about implementing a wrapper for MPI that imitates OpenMP's way
of parallelizing for loops.
begin_parallel_region( chunk_size=100 , num_proc=10 );
for( int i=0 ; i<1000 ; i++ )
{
//some computation
}
end_parallel_region();
The code above distributes computation inside the for loop to 10 slave MPI processors.
Upon entering the parallel region, the chunk size and number of slave processors are provided.
Upon leaving the parallel region, the MPI processors are synched and are put idle.
EDITED in response to High Performance Mark.
I have no intention to simulate the OpenMP's shared memory model.
I propose this because I need it.
I am developing a library that is required to build graphs from mathetical functions.
In these mathetical functions, there often exist for loops like the one below.
for( int i=0 ; i<n ; i++ )
{
s = s + sin(x[i]);
}
So I want to first be able to distribute sin(x[i]) to slave processors and at the end reduce to the single varible just like in OpenMP.
I was wondering if there is such a wrapper out there so that I don't have to reinvent the wheel.
Thanks.

There is no such wrapper out there which has escaped from the research labs into widespread use. What you propose is not so much re-inventing the wheel as inventing the flying car.
I can see how you propose to write MPI code which simulates OpenMP's approach to sharing the burden of loops, what is much less clear is how you propose to have MPI simulate OpenMP's shared memory model ?
In a simple OpenMP program one might have, as you suggest, 10 threads each perform 10% of the iterations of a large loop, perhaps updating the values of a large (shared) data structure. To simulate that inside your cunning wrapper in MPI you'll either have to (i) persuade single-sided communications to behave like shared memory (this might be doable and will certainly be difficult) or (ii) distribute the data to all processes, have each process independently compute 10% of the results, then broadcast the results all-to-all so that at the end of execution each process has all the data that the others have.
Simulating shared memory computing on distributed memory hardware is a hot topic in parallel computing, always has been, always will be. Google for distributed shared memory computing and join the fun.
EDIT
Well, if you've distributed x across processes then individual processes can compute sin(x[i]) and you can reduce the sum on to one process using MPI_Reduce.
I must be missing something about your requirements because I just can't see why you want to build any superstructure on top of what MPI already provides. Nevertheless, my answer to your original question remains No, there is no such wrapper as you seek and all the rest of my answer is mere commentary.

Yes, you could do this, for specific tasks. But you shouldn't.
Consider how you might implement this; the begin part would distribute the data, and the end part would bring the answer back:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <mpi.h>
typedef struct state_t {
int globaln;
int localn;
int *locals;
int *offsets;
double *localin;
double *localout;
double (*map)(double);
} state;
state *begin_parallel_mapandsum(double *in, int n, double (*map)(double)) {
state *s = malloc(sizeof(state));
s->globaln = n;
s->map = map;
/* figure out decomposition */
int size, rank;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
s->locals = malloc(size * sizeof(int));
s->offsets = malloc(size * sizeof(int));
s->offsets[0] = 0;
for (int i=0; i<size; i++) {
s->locals[i] = (n+i)/size;
if (i < size-1) s->offsets[i+1] = s->offsets[i] + s->locals[i];
}
/* allocate local arrays */
s->localn = s->locals[rank];
s->localin = malloc(s->localn*sizeof(double));
s->localout = malloc(s->localn*sizeof(double));
/* distribute */
MPI_Scatterv( in, s->locals, s->offsets, MPI_DOUBLE,
s->localin, s->locals[rank], MPI_DOUBLE,
0, MPI_COMM_WORLD);
return s;
}
double end_parallel_mapandsum(state **s) {
double localanswer=0., answer;
/* sum up local answers */
for (int i=0; i<((*s)->localn); i++) {
localanswer += ((*s)->localout)[i];
}
/* and get global result. Everyone gets answer */
MPI_Allreduce(&localanswer, &answer, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
free( (*s)->localin );
free( (*s)->localout );
free( (*s)->locals );
free( (*s)->offsets );
free( (*s) );
return answer;
}
int main(int argc, char **argv) {
int rank;
double *inputs;
double result;
int n=100;
const double pi=4.*atan(1.);
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
inputs = malloc(n * sizeof(double));
for (int i=0; i<n; i++) {
inputs[i] = 2.*pi/n*i;
}
}
state *s=begin_parallel_mapandsum(inputs, n, sin);
for (int i=0; i<s->localn; i++) {
s->localout[i] = (s->map)(s->localin[i]);
}
result = end_parallel_mapandsum(&s);
if (rank == 0) {
printf("Calculated result: %lf\n", result);
double trueresult = 0.;
for (int i=0; i<n; i++) trueresult += sin(inputs[i]);
printf("True result: %lf\n", trueresult);
}
MPI_Finalize();
}
That constant distribute/gather is a terrible communications burden to sum up a few numbers, and is antithetical to the entire distributed-memory computing model.
To a first approximation, shared memory approaches - OpenMP, pthreads, IPP, what have you - are about scaling computations faster; about throwing more processors at the same chunk of memory. On the other hand, distributed-memory computing is about scaling a computation bigger; about using more resourses, particularly memory, than can be found on a single computer. The big win of using MPI is when you're dealing with problem sets which can't fit on any one node's memory, ever. So when doing distributed-memory computing, you avoid having all the data in any one place.
It's important to keep that basic approach in mind even when you are just using MPI on-node to use all the processors. The above scatter/gather approach will just kill performance. The more idiomatic distributed-memory computing approach is for the logic of the program to already have distributed the data - that is, your begin_parallel_region and end_parallel_region above would have already been built into the code above your loop at the very beginning. Then, every loop is just
for( int i=0 ; i<localn ; i++ )
{
s = s + sin(x[i]);
}
and when you need to exchange data between tasks (or reduce a result, or what have you) then you call the MPI functions to do those specific tasks.

Is MPI a must or are you just trying to run your OpenMP-like code on a cluster? In the latter case, I propose you to take a look at Intel's Cluster OpenMP:
http://www.hpcwire.com/hpcwire/2006-05-19/openmp_on_clusters-1.html

problems when creating many plans and executing plans

I am a little confused about creating many_plan by calling fftwf_plan_many_dft_r2c() and executing it with OpenMP. What I am trying to achieve here is to see if explicitly using OpenMP and organizing FFTW data could work together. ( I know I "should" use multithreaded version of fftw but I failed to get a expected speedup from it ).
My code looks like this:
/* I ignore some helper APIs */
#define N 1024*1024 //N is the total size of 1d fft
fftwf_plan p;
float * in;
fftwf_complex *out;
omp_set_num_threads(threadNum); // Suppose threadNum is 2 here
in = fftwf_alloc_real(2*(N/2+1));
std::fill(in,in+2*(N/2+1),1.1f); // just try with a random real floating numbers
out = (fftwf_complex *)&in[0]; // for in-place transformation
/* Problems start from here */
int n[] = {N/threadNum}; // according to the manual, n is the size of each "howmany" transformation
p = fftwf_plan_many_dft_r2c(1, n, threadNum, in, NULL,1 ,1, out, NULL, 1, 1, FFTW_ESTIMATE);
#pragma omp parallel for
for (int i = 0; i < threadNum; i ++)
{
fftwf_execute(p);
// fftwf_execute_dft_r2c(p,in+i*N/threadNum,out+i*N/threadNum);
}
What I got is like this:
If I use fftwf_execute(p), the program executes successfully, but the result seems not correct. ( I compare the result with the version of not using many_plan and openmp )
If I use fftwf_execute_dft_r2c(), I got segmentation fault.
Can somebody help me here? How should I partition the data across multiple threads? Or it is not correct in the first place.
Thank you in advance.
flyree

Do you properly allocate memory for out? Does this:
out = (fftwf_complex *)&in[0]; // for in-place transformation
do the same as this:
out = (fftw_complex*)fftw_malloc(sizeof(fftw_complex)*numberOfOutputColumns);
You are trying to access 'p' inside your parallel block, without specifically telling openMP how to use it. It should be:
pragma omp parallel for shared(p)
If you are going to split the work up for n threads, I would think you'd explicitly want to tell omp to use n threads:
pragma omp parallel for shared(p) num_threads(n)
Does this code work without multithreading? If you removed the for loop and openMP call and executed fftwf_execute(p) just once does it work?
I don't know much about FFTW's plans for many, but it seems like p is really many plans, not one single plan. So, when you "execute" p, you are executing all plans at once, right? You don't really need to iteratively execute p.
I'm still learning about OpenMP + FFTW so I could be wrong on these. StackOverflow doesn't like it when i put a # in front of pragma, but you need one.

parallel programming in C

I'm trying to parallelize a ray tracer in C, but the execution time is not dropping as the number of threads increase. The code I have so far is:
main2(thread function):
float **result=malloc(width * sizeof(float*));
int count=0;
for (int px=0;, px<width; ++px)
{
...
for (int py=0; py<height; ++py)
{
...
float *scaled_color=malloc(3*sizeof(float));
scaled_color[0]=...
scaled_color[1]=...
scaled_color[2]=...
result[count]=scaled_color;
count++;
...
}
}
...
return (void *) result;
main:
pthread_t threads[nthreads];
for (i=0;i<nthreads;i++)
{
pthread_create(&threads[i], NULL, main2, &i);
}
float** result_handler;
for (i=0; i<nthreads; i++)
{
pthread_join(threads[i], (void *) &result_handler);
int count=0;
for(j=0; j<width;j++)
{
for(k=0;k<height;k++)
{
float* scaled_color=result_handler[count];
count ++;
printf...
}
printf("\n");
}
}
main2 returns a float ** so that the picture can be printed in order in the main function. Anyone know why the exectution time is not dropping (e.g. it runs longer with 8 threads than with 4 threads when it's supposed to be the other way around)?

It's not enough to add threads, you need to actually split the task as well. Looks like you're doing the same job in every thread, so you get n copies of the result with n threads.

Parallelism of programs and algorithms is usually non trivial to achieve and doesn't come without some investment.
I don't think that working directly with threads is the right tool for you. Try to look into OpenMp, it is much more highlevel.

Two things are working against you here. (1) Unless you can allocate threads to more than one core, you couldn't expect a speed up in the first place; using a single core, that core has the same amount of work to do whether you parallelize the code or not. (2) Even with multiple cores, parallel performance is exquisitely sensitive to the ratio of computation done on-core to the amount of communication necessary between cores. With ptrhead_join() inside the loop, you're incurring a lot of this kind of 'stop and wait for the other guy' kind of performance hits.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight