OpenCL - Local Memory - c

I do understand whats the difference between global- and local-memory in general.
But I have problems to use local-memory.
1) What has to be considered by transforming a global-memory variables to local-memory variables?
2) How do I use the local-barriers?
Maybe someone can help me with a little example.
I tried to do a jacobi-computation by using local-memory, but I only get 0 as result. Maybe someone can give me an advice.
Working Solution:
#define IDX(_M,_i,_j) (_M)[(_i) * N + (_j)]
#define U(_i, _j) IDX(uL, _i, _j)
__kernel void jacobi(__global VALUE* u, __global VALUE* f, __global VALUE* tmp, VALUE factor) {
int i = get_global_id(0);
int j = get_global_id(1);
int iL = get_local_id(0);
int jL = get_local_id(1);
__local VALUE uL[(N+2)*(N+2)];
__local VALUE fL[(N+2)*(N+2)];
IDX(uL, iL, jL) = IDX(u, i, j);
IDX(fL, iL, jL) = IDX(f, i, j);
barrier(CLK_LOCAL_MEM_FENCE);
IDX(tmp, i, j) = (VALUE)0.25 * ( U(iL-1, jL) + U(iL, jL-1) + U(iL, jL+1) + U(iL+1, jL) - factor * IDX(fL, iL, jL));
}
Thanks.

1) Query for CL_DEVICE_LOCAL_MEM_SIZE value, it is 16kB minimum and increses for different hardwares. If your local variables can fit in this and if they are re-used many times, you should put them in local memory before usage. Even if you don't, automatic usage of L2 cache when accessing global memory of a gpu can be still effective for utiliation of cores.
If global-local copy is taking important slice of time, you can do async work group copy while cores calculating things.
Another important part is, more free local memory space means more concurrent threads per core. If gpu has 64 cores per compute unit, only 64 threads can run when all local memory is used. When it has more space, 128,192,...2560 threads can be run at the same time if there are no other limitations.
A profiler can show bottlenecks so you can consider it worth a try or not.
For example, a naive matrix-matrix multiplication using nested loop relies on cache l1 l2 but submatices can fit in local memory. Maybe 48x48 submatices of floats can fit in a mid-range graphics card compute unit and can be used for N times for whole calculation before replaced by next submatrix.
CL_DEVICE_LOCAL_MEM_TYPE querying can return LOCAL or GLOBAL which also says that not recommended to use local memory if it is GLOBAL.
Lastly, any memory space allocation(except __private) size must be known at compile time(for device, not host) because it must know how many wavefronts can be issued to achieve max performance(and/or maybe other compiler optimizations). That is why no recursive function allowed by opencl 1.2. But you can copy a function and rename for n times to have pseudo recursiveness.
2) Barriers are a meeting point for all workgroup threads in a workgroup. Similar to cyclic barriers, they all stop there, wait for all until continuing. If it is a local barrier, all workgroup threads finish any local memory operations before departing from that point. If you want to give some numbers 1,2,3,4.. to a local array, you can't be sure if all threads writing these numbers or already written, until a local barrier is passed, then it is certain that array will have final values already written.
All workgroup threads must hit same barrier. If one cannot reach it, kernel stucks or you get an error.
__local int localArray[64]; // not each thread. For all threads.
// per compute unit.
if(localThreadId!=0)
localArray[localThreadId]=localThreadId; // 64 values written in O(1)
// not sure if 2nd thread done writing, just like last thread
if(localThreadId==0) // 1st core of each compute unit loads from VRAM
localArray[localThreadId]=globalArray[globalThreadId];
barrier(CLK_LOCAL_MEM_FENCE); // probably all threads wait 1st thread
// (maybe even 1st SIMD or
// could be even whole 1st wavefront!)
// here all threads written their own id to local array. safe to read.
// except first element which is a variable from global memory
// lets add that value to all other values
if(localThreadId!=0)
localArrray[localThreadId]+=localArray[0];
Working example(local work group size=64):
inputs: 0,1,2,3,4,0,0,0,0,0,0,..
__kernel void vecAdd(__global float* x )
{
int id = get_global_id(0);
int idL = get_local_id(0);
__local float loc[64];
loc[idL]=x[id];
barrier (CLK_LOCAL_MEM_FENCE);
float distance_square_sum=0;
for(int i=0;i<64;i++)
{
float diff=loc[idL]-loc[i];
float diff_squared=diff*diff;
distance_square_sum+=diff_squared;
}
x[id]=distance_square_sum;
}
output: 30, 74, 246, 546, 974, 30, 30, 30...

Related

Manual synchronization in OpenMP while loop

I recently started working with OpenMP to do some 'research' for an project in university. I have a rectangular and evenly spaced grid on which I'm solving a partial differential equation with an iterative scheme. So I basically have two for-loops (one in x- and y-direction of the grid each) wrapped by a while-loop for the iterations.
Now I want to investigate different parallelization schemes for this. The first (obvious) approach was to do a spatial a parallelization on the for loops.
Works fine too.
The approach I have problems with is a more tricky idea. Each thread calculates all grid points. The first thread starts solving the equation at the first grid row (y=0). When it's finished the thread goes on with the next row (y=1) and so on. At the same time thread #2 can already start at y=0, because all the necessary information are already available. I just need to do a kind of a manual synchronization between the threads so they can't overtake each other.
Therefore I used an array called check. It contains the thread-id that is currently allowed to work on each grid row. When the upcoming row is not 'ready' (value in check[j] is not correct), the thread goes into an empty while-loop, until it is.
Things will get clearer with a MWE:
#include <stdio.h>
#include <math.h>
#include <omp.h>
int main()
{
// initialize variables
int iter = 0; // iteration step counter
int check[100] = { 0 }; // initialize all rows for thread #0
#pragma omp parallel num_threads(2)
{
int ID, num_threads, nextID;
double u[100 * 300] = { 0 };
// get parallelization info
ID = omp_get_thread_num();
num_threads = omp_get_num_threads();
// determine next valid id
if (ID == num_threads - 1) nextID = 0;
else nextID = ID + 1;
// iteration loop until abort criteria (HERE: SIMPLIFIED) are valid
while (iter<1000)
{
// rows (j=0 and j=99 are boundary conditions and don't have to be calculated)
for (int j = 1; j < (100 - 1); j++)
{
// manual sychronization: wait until previous thread completed enough rows
while (check[j + 1] != ID)
{
//printf("Thread #%d is waiting!\n", ID);
}
// gridpoints in row j
for (int i = 1; i < (300 - 1); i++)
{
// solve PDE on gridpoint
// replaced by random operation to consume time
double ignore = pow(8.39804,10.02938) - pow(12.72036,5.00983);
}
// update of check array in atomic to avoid race condition
#pragma omp atomic write
{
check[j] = nextID;
}
}// for j
#pragma omp atomic write
check[100 - 1] = nextID;
#pragma omp atomic
iter++;
#pragma omp single
{
printf("Iteration step: %d\n\n", iter);
}
}//while
}// omp parallel
}//main
The thing is, this MWE actually works on my machine. But if I copy it into my project, it doesn't. Additionally the outcome is always different: It stops either after the first iteration or after the third.
Another weird thing: when I remove the slashes of the comment in the inner while-loop it works! The output contains some
"Thread #1 is waiting!"
but that's reasonable. To me it looks like I created somehow a race condition, but I don't know where.
Does somebody has an idea what the problem could be? Or a hint how to realize this kind of synchronization?
I think you are mixing up atomicity and memory consitency. The OpenMP standard actually describes it very nicely in
1.4 Memory Model (emphasis mine):
The OpenMP API provides a relaxed-consistency, shared-memory model.
All OpenMP threads have access to a place to store and to retrieve
variables, called the memory. In addition, each thread is allowed to
have its own temporary view of the memory. The temporary view of
memory for each thread is not a required part of the OpenMP memory
model, but can represent any kind of intervening structure, such as
machine registers, cache, or other local storage, between the thread
and the memory. The temporary view of memory allows the thread to
cache variables and thereby to avoid going to memory for every
reference to a variable.
1.4.3 The Flush Operation
The memory model has relaxed-consistency because a thread’s temporary
view of memory is not required to be consistent with memory at all
times. A value written to a variable can remain in the thread’s
temporary view until it is forced to memory at a later time. Likewise,
a read from a variable may retrieve the value from the thread’s
temporary view, unless it is forced to read from memory. The OpenMP
flush operation enforces consistency between the temporary view and
memory.
To avoid that, you should also make the read of check[] atomic and specify the seq_cst clause to your atomic constructs. This clause forces an implicit flush to the operation. (It is called a sequentially consistent atomic construct)
int c;
// manual sychronization: wait until previous thread completed enough rows
do
{
#pragma omp atomic read
c = check[j + 1];
} while (c != ID);
Disclaimer: I can't really try the code right now.
Furhter Notes:
I think the iter stop criteria is bogus, the way you use it, but I guess that's irrelevant given that it is not your actual criteria.
I assume this variant will perform worse than the spatial decomposition. You loose a lot of data locality, especially on NUMA systems. But of course it is fine to try and measure.
There seems to be a discrepancy between your code (using check[j + 1]) and your description "At the same time thread #2 can already start at y=0"

tasks run in thread takes longer than in serial?

So im doing some computation on 4 million nodes.
the very bask serial version just have a for loop which loops 4 million times and do 4 million times of computation. this takes roughly 1.2 sec.
when I split the for loop to, say, 4 for loops and each does 1/4 of the computation, the total time became 1.9 sec.
I guess there are some overhead in creating for loops and maybe has to do with cpu likes to compute data in chunk.
The real thing bothers me is when I try to put 4 loops to 4 thread on a 8 core machine, each thread would take 0.9 seconds to finish.
I am expecting each of them to only take 1.9/4 second instead.
I dont think there are any race condition or synchronize issue since all I do was having a for loop to create 4 threads, which took 200 microseconds. And then a for loop to joins them.
The computation read from a shared array and write to a different shared array.
I am sure they are not writing to the same byte.
Where could the overhead came from?
main: ncores: number of cores. node_size: size of graph (4 million node)
for(i = 0 ; i < ncores ; i++){
int *t = (int*)malloc(sizeof(int));
*t = i;
int iret = pthread_create( &thread[i], NULL, calculate_rank_p, (void*)(t));
}
for (i = 0; i < ncores; i++)
{
pthread_join(thread[i], NULL);
}
calculate_rank_p: vector is the rank vector for page rank calculation
Void *calculate_rank_pthread(void *argument) {
int index = *(int*)argument;
for(i = index; i < node_size ; i+=ncores)
current_vector[i] = calc_r(i, vector);
return NULL;
}
calc_r: this is just a page rank calculation using compressed row format.
double calc_r(int i, double *vector){
double prank = 0;
int j;
for(j = row_ptr[i]; j < row_ptr[i+1]; j++){
prank += vector[col_ind[j]] * val[j];
}
return prank;
}
everything that is not declared are global variable
The computation read from a shared array and write to a different shared array. I am sure they are not writing to the same byte.
It's impossible to be sure without seeing relevant code and having some more details, but this sounds like it could be due to false sharing, or ...
the performance issue of false sharing (aka cache line ping-ponging), where threads use different objects but those objects happen to be close enough in memory that they fall on the same cache line, and the cache system treats them as a single lump that is effectively protected by a hardware write lock that only one core can hold at a time. This causes real but invisible performance contention; whichever thread currently has exclusive ownership so that it can physically perform an update to the cache line will silently throttle other threads that are trying to use different (but, alas, nearby) data that sits on the same line.
http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206
UPDATE
This looks like it could very well trigger false sharing, depending on the size of a vector (though there is still not enough information in the post to be sure, as we don't see how the various vector are allocated.
for(i = index; i < node_size ; i+=ncores)
Instead of interleaving which core works on which data i += ncores give each of them a range of data to work on.
For me the same surprise when build and run in Debug (other test code though).
In release all as expected ;)

Why doesn't this code scale linearly?

I wrote this SOR solver code. Don't bother too much what this algorithm does, it is not the concern here. But just for the sake of completeness: it may solve a linear system of equations, depending on how well conditioned the system is.
I run it with an ill conditioned 2097152 rows sparce matrix (that never converges), with at most 7 non-zero columns per row.
Translating: the outer do-while loop will perform 10000 iterations (the value I pass as max_iters), the middle for will perform 2097152 iterations, split in chunks of work_line, divided among the OpenMP threads. The innermost for loop will have 7 iterations, except in very few cases (less than 1%) where it can be less.
There is data dependency among the threads in the values of sol array. Each iteration of the middle for updates one element but reads up to 6 other elements of the array. Since SOR is not an exact algorithm, when reading, it can have any of the previous or the current value on that position (if you are familiar with solvers, this is a Gauss-Siedel that tolerates Jacobi behavior on some places for the sake of parallelism).
typedef struct{
size_t size;
unsigned int *col_buffer;
unsigned int *row_jumper;
real *elements;
} Mat;
int work_line;
// Assumes there are no null elements on main diagonal
unsigned int solve(const Mat* matrix, const real *rhs, real *sol, real sor_omega, unsigned int max_iters, real tolerance)
{
real *coefs = matrix->elements;
unsigned int *cols = matrix->col_buffer;
unsigned int *rows = matrix->row_jumper;
int size = matrix->size;
real compl_omega = 1.0 - sor_omega;
unsigned int count = 0;
bool done;
do {
done = true;
#pragma omp parallel shared(done)
{
bool tdone = true;
#pragma omp for nowait schedule(dynamic, work_line)
for(int i = 0; i < size; ++i) {
real new_val = rhs[i];
real diagonal;
real residual;
unsigned int end = rows[i+1];
for(int j = rows[i]; j < end; ++j) {
unsigned int col = cols[j];
if(col != i) {
real tmp;
#pragma omp atomic read
tmp = sol[col];
new_val -= coefs[j] * tmp;
} else {
diagonal = coefs[j];
}
}
residual = fabs(new_val - diagonal * sol[i]);
if(residual > tolerance) {
tdone = false;
}
new_val = sor_omega * new_val / diagonal + compl_omega * sol[i];
#pragma omp atomic write
sol[i] = new_val;
}
#pragma omp atomic update
done &= tdone;
}
} while(++count < max_iters && !done);
return count;
}
As you can see, there is no lock inside the parallel region, so, for what they always teach us, it is the kind of 100% parallel problem. That is not what I see in practice.
All my tests were run on a Intel(R) Xeon(R) CPU E5-2670 v2 # 2.50GHz, 2 processors, 10 cores each, hyper-thread enabled, summing up to 40 logical cores.
On my first set runs, work_line was fixed on 2048, and the number of threads varied from 1 to 40 (40 runs in total). This is the graph with the execution time of each run (seconds x number of threads):
The surprise was the logarithmic curve, so I thought that since the work line was so large, the shared caches were not very well used, so I dug up this virtual file /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size that told me this processor's L1 cache synchronizes updates in groups of 64 bytes (8 doubles in the array sol). So I set the work_line to 8:
Then I thought 8 was too low to avoid NUMA stalls and set work_line to 16:
While running the above, I thought "Who am I to predict what work_line is good? Lets just see...", and scheduled to run every work_line from 8 to 2048, steps of 8 (i.e. every multiple of the cache line, from 1 to 256). The results for 20 and 40 threads (seconds x size of the split of the middle for loop, divided among the threads):
I believe the cases with low work_line suffers badly from cache synchronization, while bigger work_line offers no benefit beyond a certain number of threads (I assume because the memory pathway is the bottleneck). It is very sad that a problem that seems 100% parallel presents such bad behavior on a real machine. So, before I am convinced multi-core systems are a very well sold lie, I am asking you here first:
How can I make this code scale linearly to the number of cores? What am I missing? Is there something in the problem that makes it not as good as it seems at first?
Update
Following suggestions, I tested both with static and dynamic scheduling, but removing the atomics read/write on the array sol. For reference, the blue and orange lines are the same from the previous graph (just up to work_line = 248;). The yellow and green lines are the new ones. For what I could see: static makes a significant difference for low work_line, but after 96 the benefits of dynamic outweighs its overhead, making it faster. The atomic operations makes no difference at all.
The sparse matrix vector multiplication is memory bound (see here) and it could be shown with a simple roofline model. Memory bound problems benefit from higher memory bandwidth of multisocket NUMA systems but only if the data initialisation is done in such a way that the data is distributed among the two NUMA domains. I have some reasons to believe that you are loading the matrix in serial and therefore all its memory is allocated on a single NUMA node. In that case you won't benefit from the double memory bandwidth available on a dual-socket system and it really doesn't matter if you use schedule(dynamic) or schedule(static). What you could do is enable memory interleaving NUMA policy in order to have the memory allocation spread among both NUMA nodes. Thus each thread would end up with 50% local memory access and 50% remote memory access instead of having all threads on the second CPU being hit by 100% remote memory access. The easiest way to enable the policy is by using numactl:
$ OMP_NUM_THREADS=... OMP_PROC_BIND=1 numactl --interleave=all ./program ...
OMP_PROC_BIND=1 enables thread pinning and should improve the performance a bit.
I would also like to point out that this:
done = true;
#pragma omp parallel shared(done)
{
bool tdone = true;
// ...
#pragma omp atomic update
done &= tdone;
}
is a probably a not very efficient re-implementation of:
done = true;
#pragma omp parallel reduction(&:done)
{
// ...
if(residual > tolerance) {
done = false;
}
// ...
}
It won't have a notable performance difference between the two implementations because of the amount of work done in the inner loop, but still it is not a good idea to reimplement existing OpenMP primitives for the sake of portability and readability.
Try running the IPCM (Intel Performance Counter Monitor). You can watch memory bandwidth, and see if it maxes out with more cores. My gut feeling is that you are memory bandwidth limited.
As a quick back of the envelope calculation, I find that uncached read bandwidth is about 10 GB/s on a Xeon. If your clock is 2.5 GHz, that's one 32 bit word per clock cycle. Your inner loop is basically just a multiple-add operation whose cycles you can count on one hand, plus a few cycles for the loop overhead. It doesn't surprise me that after 10 threads, you don't get any performance gain.
Your inner loop has an omp atomic read, and your middle loop has an omp atomic write to a location that could be the same one read by one of the reads. OpenMP is obligated to ensure that atomic writes and reads of the same location are serialized, so in fact it probably does need to introduce a lock, even though there isn't any explicit one.
It might even need to lock the whole sol array unless it can somehow figure out which reads might conflict with which writes, and really, OpenMP processors aren't necessarily all that smart.
No code scales absolutely linearly, but rest assured that there are many codes that do scale much closer to linearly than yours does.
I suspect you are having caching issues. When one thread updates a value in the sol array, it invalids the caches on other CPUs that are storing that same cache line. This forces the caches to be updated, which then leads to the CPUs stalling.
Even if you don't have an explicit mutex lock in your code, you have one shared resource between your processes: the memory and its bus. You don't see this in your code because it is the hardware that takes care of handling all the different requests from the CPUs, but nevertheless, it is a shared resource.
So, whenever one of your processes writes to memory, that memory location will have to be reloaded from main memory by all other processes that use it, and they all have to use the same memory bus to do so. The memory bus saturates, and you have no more performance gain from additional CPU cores that only serve to worsen the situation.

Why does CILK_NWORKERS affect program with only one cilk_spawn?

I am trying to paralellize a matrix processing program. After using OpenMP I decided to also check out CilkPlus and I noticed the following:
In my C code, I only apply parallelism in one part i.e.:
//(test_function declarations)
cilk_spawn highPrep(d, x, half);
d = temp_0;
r = malloc(sizeof(int)*(half));
temp_1 = r;
x = x_alloc + F_EXTPAD;
lowPrep(r, d, x, half);
cilk_sync;
//test_function return
According to the documentation I have read so far, cilk_spawn is expected to -maybe- (CilkPlus does not enforce parallelism) take the highPrep() function and execute it in a different hardware thread should one be available, and then continue with the execution of the rest of the code, including the function lowPrep(). The threads then should synchronize at cilk_sync before the execution proceeds with the rest of the code.
I am running this on an 8core/16thread Xeon E5-2680, that does not execute anything else at any given time apart from my experiments. My question at this point is that I notice that when I change the environment variable CILK_NWORKERS and try values such as 2, 4, 8, 16 the time that the test_function requires to be executed changes with a big variation. In particular, the higher the CILK_NWORKERS is set (after 2) the slower the function becomes. This seems counter intuitive to me since I would expect the available number of threads not to change the operation of cilk_spawn. I would expect that if 2 threads are available then the function highPrep is executed on another thread. Anything more than 2 threads I would expected to remain idle.
The highPrep and lowPrep functions are:
void lowPrep(int *dest, int *src1, int *src2, int size)
{
double temp;
int i;
for(i = 0; i < size; i++)
{
temp = -.25 * (src1[i] + src1[i + 1]) + .5;
if (temp > 0)
temp = (int)temp;
else
{
if (temp != (int)temp)
temp = (int)(temp - 1);
}
dest[i] = src2[2*i] - (int)(temp);
}
}
void highPrep(int *dest, int *src, int size)
{
double temp;
int i;
for(i=0; i < size + 1; i++)
{
temp = (-1.0/16 * (src[-4 + 2*i] + src[2 + 2*i]) + 9.0/16 * (src[-2 + 2*i] + src[0 + 2*i]) + 0.5);
if (temp > 0)
temp = (int)temp;
else
{
if (temp != (int)temp)
temp = (int)(temp - 1);
}
dest[i] = src[-1 + 2*i] - (int)temp;
}
}
There must be a reasonable explanation behind this, is it reasonable to expect different execution times for a program like this?
Clarification: Cilk does "continuation stealing", not "child stealing", so highPrep always runs on the same hardware thread as its caller. It's the "rest of the code" that might end up running on a different thread. See this primer for a fuller explanation.
As to the slowdown, it's probably an artifact of the implementation being biased towards high degrees of parallelism that can consume all threads. The extra threads are looking for work, and in the process of doing so, eat up some memory bandwidth, and for hyperthreaded processors eat up some core cycles. The Linux "Completely Fair Scheduler" given us some grief in this area because sleep(0) no longer gives up the timeslice. It's also possible that the extra threads cause the OS to map software threads onto the machine less efficiently.
The root of the problem is a tricky tradeoff: Running thieves aggressively enables them to pick up work faster if it appears, but also causes them to unnecessarily consume resources if no work is available. Putting thieves to sleep when there is no work available saves the resources, but adds significant overhead to spawn, since now a spawning thread has to check if there are sleeping threads to be woken up. TBB pays this overhead, but it's not much for TBB because TBB's spawn overhead is much higher anyway. The current Cilk implementation does pay this tax: it only sleeps workers during sequential execution.
The best (but imperfect) advice I can give is to find more parallelism so that no worker threads loiter for long and cause trouble.

CUDA kernel launch parameters explained right?

Here I tried to self-explain the CUDA launch parameters model (or execution configuration model) using some pseudo codes, but I don't know if there were some big mistakes, So hope someone help to review it, and give me some advice. Thanks advanced.
Here it is:
/*
normally, we write kernel function like this.
note, __global__ means this function will be called from host codes,
and executed on device. and a __global__ function could only return void.
if there's any parameter passed into __global__ function, it should be stored
in shared memory on device. so, kernel function is so different from the *normal*
C/C++ functions. if I was the CUDA authore, I should make the kernel function more
different from a normal C function.
*/
__global__ void
kernel(float *arr_on_device, int n) {
int idx = blockIdx.x * blockDIm.x + threadIdx.x;
if (idx < n) {
arr_on_device[idx] = arr_on_device[idx] * arr_on_device[idx];
}
}
/*
after this definition, we could call this kernel function in our normal C/C++ codes !!
do you feel something wired ? un-consistant ?
normally, when I write C codes, I will think a lot about the execution process down to
the metal in my mind, and this one...it's like some fragile codes. break the sequential
thinking process in my mind.
in order to make things normal, I found a way to explain: I expand the *__global__ * function
to some pseudo codes:
*/
#define __foreach(var, start, end) for (var = start, var < end; ++var)
__device__ int
__indexing() {
const int blockId = blockIdx.x * gridDim.x + gridDim.x * gridDim.y * blockIdx.z;
return
blockId * (blockDim.x * blockDim.y * blockDim.z) +
threadIdx.z * (blockDim.x * blockDim.y) +
threadIdx.x;
}
global_config =:
{
/*
global configuration.
note the default values are all 1, so in the kernel codes,
we could just ignore those dimensions.
*/
gridDim.x = gridDim.y = gridDim.z = 1;
blockDim.x = blockDim.y = blockDim.z = 1;
};
kernel =:
{
/*
I thought CUDA did some bad evil-detail-covering things here.
it's said that CUDA C is an extension of C, but in my mind,
CUDA C is more like C++, and the *<<<>>>* part is too tricky.
for example:
kernel<<<10, 32>>>(); means kernel will execute in 10 blocks each have 32 threads.
dim3 dimG(10, 1, 1);
dim3 dimB(32, 1, 1);
kernel<<<dimG, dimB>>>(); this is exactly the same thing with above.
it's not C style, and C++ style ? at first, I thought this could be done by
C++'s constructor stuff, but I checked structure *dim3*, there's no proper
constructor for this. this just brroke the semantics of both C and C++. I thought
force user to use *kernel<<<dim3, dim3>>>* would be better. So I'd like to keep
this rule in my future codes.
*/
gridDim = dimG;
blockDim = dimB;
__foreach(blockIdx.z, 0, gridDim.z)
__foreach(blockIdx.y, 0, gridDim.y)
__foreach(blockIdx.x, 0, gridDim.x)
__foreach(threadIdx.z, 0, blockDim.z)
__foreach(threadIdx.y, 0, blockDim.y)
__foreach(threadIdx.x, 0, blockDim.x)
{
const int idx = __indexing();
if (idx < n) {
arr_on_device[idx] = arr_on_device[idx] * arr_on_device[idx];
}
}
};
/*
so, for me, gridDim & blockDim is like some boundaries.
e.g. gridDim.x is the upper bound of blockIdx.x, this is not that obvious for people like me.
*/
/* the declaration of dim3 from vector_types.h of CUDA/include */
struct __device_builtin__ dim3
{
unsigned int x, y, z;
#if defined(__cplusplus)
__host__ __device__ dim3(unsigned int vx = 1, unsigned int vy = 1, unsigned int vz = 1) : x(vx), y(vy), z(vz) {}
__host__ __device__ dim3(uint3 v) : x(v.x), y(v.y), z(v.z) {}
__host__ __device__ operator uint3(void) { uint3 t; t.x = x; t.y = y; t.z = z; return t; }
#endif /* __cplusplus */
};
typedef __device_builtin__ struct dim3 dim3;
CUDA DRIVER API
The CUDA Driver API v4.0 and above uses the following functions to control a kernel launch:
cuFuncSetCacheConfig
cuFuncSetSharedMemConfig
cuLaunchKernel
The following CUDA Driver API functions were used prior to the introduction of cuLaunchKernel in v4.0.
cuFuncSetBlockShape()
cuFuncSetSharedSize()
cuParamSet{Size,i,fv}()
cuLaunch
cuLaunchGrid
Additional information on these functions can be found in cuda.h.
CUresult CUDAAPI cuLaunchKernel(CUfunction f,
unsigned int gridDimX,
unsigned int gridDimY,
unsigned int gridDimZ,
unsigned int blockDimX,
unsigned int blockDimY,
unsigned int blockDimZ,
unsigned int sharedMemBytes,
CUstream hStream,
void **kernelParams,
void **extra);
cuLaunchKernel takes as parameters the entire launch configuration.
See NVIDIA Driver API[Execution Control]1 for more details.
CUDA KERNEL LAUNCH
cuLaunchKernel will
1. verify the launch parameters
2. change the shared memory configuration
3. change the local memory allocation
4. push a stream synchronization token into the command buffer to make sure two commands in the stream do not overlap
4. push the launch parameters into the command buffer
5. push the launch command into the command buffer
6. submit the command buffer to the device (on wddm drivers this step may be deferred)
7. on wddm the kernel driver will page all memory required in device memory
The GPU will
1. verify the command
2. send the commands to the compute work distributor
3. dispatch launch configuration and thread blocks to the SMs
When all thread blocks have completed the work distributor will flush the caches to honor the CUDA memory model and it will mark the kernel as completed so the next item in the stream can make forward progress.
The order that thread blocks are dispatched differs between architectures.
Compute capability 1.x devices store the kernel parameters in shared memory.
Compute capability 2.0-3.5 devices store the kenrel parameters in constant memory.
CUDA RUNTIME API
The CUDA Runtime is a C++ software library and build tool chain on top of the CUDA Driver API. The CUDA Runtime uses the following functions to control a kernel launch:
cudaConfigureCall
cudaFuncSetCacheConfig
cudaFuncSetSharedMemConfig
cudaLaunch
cudaSetupArgument
See NVIDIA Runtime API[Execution Control]2
The <<<>>> CUDA language extension is the most common method used to launch a kernel.
During compilation nvcc will create a new CPU stub function for each kernel function called using <<<>>> and it will replace the <<<>>> with a call to the stub function.
For example
__global__ void kernel(float* buf, int j)
{
// ...
}
kernel<<<blocks,threads,0,myStream>>>(d_buf,j);
generates
void __device_stub__Z6kernelPfi(float *__par0, int __par1){__cudaSetupArgSimple(__par0, 0U);__cudaSetupArgSimple(__par1, 4U);__cudaLaunch(((char *)((void ( *)(float *, int))kernel)));}
You can inspect the generated files by adding --keep to your nvcc command line.
cudaLaunch calls cuLaunchKernel.
CUDA DYNAMIC PARALLELISM
CUDA CDP works similar to the CUDA Runtime API described above.
By using <<<...>>>, you are launching a number of threads in the GPU. These threads are grouped into blocks and forms a large grid. All the threads will execute the invoked kernel function code.
In the kernel function, build-in variables like threadIdx and blockIdx enable the code know which thread it runs and do the scheduled part of the work.
edit
Basically, <<<...>>> simplifies the configuration procedure to launch a kernel. Without using it, one may have to call 4~5 APIs for a single kernel launch, just as the OpenCL way, which use only C99 syntax.
In fact you could check CUDA driver APIs. It may provide all those APIs so you don't need to use <<<>>>.
Basically, the GPU is divided into separate "device" GPUs (e.g. GeForce 690 has 2) -> multiple SM's (streaming multiprocessors) -> multiple CUDA cores. As far as I know, the dimensionality of a block or grid is just a logical assignment irrelevant of hardware, but the total size of a block (x*y*z) is very important.
Threads in a block HAVE TO be on the same SM, to use its facilities of shared memory and synchronization. So you cannot have blocks with more threads than CUDA cores are contained in a SM.
If we have a simple scenario where we have 16 SMs with 32 CUDA cores each, and we have 31x1x1 block size, and 20x1x1 grid size, we will forfeit at least 1/32 of the processing power of the card. Every time a block is run, a SM will have only 31 of its 32 cores busy. Blocks will load to fill up the SMs, we will have 16 blocks finish at roughly the same time, and as the first 4 SMs free up, they will start processing the last 4 blocks (NOT necessarily blocks #17-20).
Comments and corrections are welcome.

Resources