Looking at Mark Harris's reduction example, I am trying to see if I can have threads store intermediate values without reduction operation:
For example CPU code:
for(int i = 0; i < ntr; i++)
{
for(int j = 0; j < pos* posdir; j++)
{
val = x[i] * arr[j];
if(val > 0.0)
{
out[xcount] = val*x[i];
xcount += 1;
}
}
}
Equivalent GPU code:
const int threads = 64;
num_blocks = ntr/threads;
__global__ void test_g(float *in1, float *in2, float *out1, int *ct, int posdir, int pos)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
__shared__ float t1[threads];
__shared__ float t2[threads];
int gcount = 0;
for(int i = 0; i < posdir*pos; i += 32) {
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i%posdir];
}
__syncthreads();
for(int i = 0; i < 32; i++)
{
t2[i] = t1[i] * in1[tid];
if(t2[i] > 0){
out1[gcount] = t2[i] * in1[tid];
gcount = gcount + 1;
}
}
}
ct[0] = gcount;
}
what I am trying to do here is the following steps:
(1)Store 32 values of in2 in shared memory variable t1,
(2)For each value of i and in1[tid], calculate t2[i],
(3)if t2[i] > 0 for that particular combination of i, write t2[i]*in1[tid] to out1[gcount]
But my output is all wrong. I am not even able to get a count of all the times t2[i] is greater than 0.
Any suggestions on how to save the value of gcount for each i and tid ?? As I debug, I find that for block (0,0,0) and thread(0,0,0) I can sequentially see the values of t2 updated. After the CUDA kernel switches focus to block(0,0,0) and thread(32,0,0), the values of out1[0] are re-written again. How can I get/store the values of out1 for each thread and write it to the output?
I tried two approaches so far: (suggested by #paseolatis on NVIDIA forums)
(1) defined offset=tid*32; and replace out1[gcount] with out1[offset+gcount],
(2) defined
__device__ int totgcount=0; // this line before main()
atomicAdd(&totgcount,1);
out1[totgcount]=t2[i] * in1[tid];
int *h_xc = (int*) malloc(sizeof(int) * 1);
cudaMemcpyFromSymbol(h_xc, totgcount, sizeof(int)*1, cudaMemcpyDeviceToHost);
printf("GPU: xcount = %d\n", h_xc[0]); // Output looks like this: GPU: xcount = 1928669800
Any suggestions? Thanks in advance !
OK let's compare your description of what the code should do with what you have posted (this is sometimes called rubber duck debugging).
Store 32 values of in2 in shared memory variable t1
Your kernel contains this:
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i%posdir];
}
which is effectively loading the same value from in2 into every value of t1. I suspect you want something more like this:
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i+threadIdx.x];
}
For each value of i and in1[tid], calculate t2[i],
This part is OK, but why is t2 needed in shared memory at all? It is only an intermediate result which can be discarded after the inner iteration is completed. You could easily have something like:
float inval = in1[tid];
.......
for(int i = 0; i < 32; i++)
{
float result = t1[i] * inval;
......
if t2[i] > 0 for that particular combination of i, write
t2[i]*in1[tid] to out1[gcount]
This is where the problems really start. Here you do this:
if(t2[i] > 0){
out1[gcount] = t2[i] * in1[tid];
gcount = gcount + 1;
}
This is a memory race. gcount is a thread local variable, so each thread will, at different times, overwrite any given out1[gcount] with its own value. What you must have, for this code to work correctly as written, is to have gcount as a global memory variable and use atomic memory updates to ensure that each thread uses a unique value of gcount each time it outputs a value. But be warned that atomic memory access is very expensive if it is used often (this is why I asked about how many output points there are per kernel launch in a comment).
The resulting kernel might look something like this:
__device__ int gcount; // must be set to zero before the kernel launch
__global__ void test_g(float *in1, float *in2, float *out1, int posdir, int pos)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
__shared__ float t1[32];
float ival = in1[tid];
for(int i = 0; i < posdir*pos; i += 32) {
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i+threadIdx.x];
}
__syncthreads();
for(int j = 0; j < 32; j++)
{
float tval = t1[j] * ival;
if(tval > 0){
int idx = atomicAdd(&gcount, 1);
out1[idx] = tval * ival
}
}
}
}
Disclaimer: written in browser, never been compiled or tested, use at own risk.....
Note that your write to ct was also a memory race, but with gcount now a global value, you can read the value after the kernel without the need for ct.
EDIT: It seems that you are having some problems with zeroing gcount before running the kernel. To do this, you will need to use something like cudaMemcpyToSymbol or perhaps cudaGetSymbolAddress and cudaMemset. It might look something like:
const int zero = 0;
cudaMemcpyToSymbol("gcount", &zero, sizeof(int), 0, cudaMemcpyHostToDevice);
Again, usual disclaimer: written in browser, never been compiled or tested, use at own risk.....
A better way to do what you are doing is to give each thread its own output, and let it increment its own count and enter values - this way, the double-for loop can happen in parallel in any order, which is what the GPU does well. The output is wrong because the threads share the out1 array, so they'll all overwrite on it.
You should also move the code to copy into shared memory into a separate loop, with a __syncthreads() after. With the __syncthreads() out of the loop, you should get better performance - this means that your shared array will have to be the size of in2 - if this is a problem, there's a better way to deal with this at the end of this answer.
You also should move the threadIdx.x < 32 check to the outside. So your code will look something like this:
if (threadIdx.x < 32) {
for(int i = threadIdx.x; i < posdir*pos; i+=32) {
t1[i] = in2[i];
}
}
__syncthreads();
for(int i = threadIdx.x; i < posdir*pos; i += 32) {
for(int j = 0; j < 32; j++)
{
...
}
}
Then put a __syncthreads(), an atomic addition of gcount += count, and a copy from the local output array to a global one - this part is sequential, and will hurt performance. If you can, I would just have a global list of pointers to the arrays for each local one, and put them together on the CPU.
Another change is that you don't need shared memory for t2 - it doesn't help you. And the way you are doing this, it seems like it works only if you are using a single block. To get good performance out of most NVIDIA GPUs, you should partition this into multiple blocks. You can tailor this to your shared memory constraint. Of course, you don't have a __syncthreads() between blocks, so the threads in each block have to go over the whole range for the inner loop, and a partition of the outer loop.
Related
!!! HOMEWORK - ASSIGNMENT !!!
Please do not post code as I would like to complete myself but rather if possible point me in the right direction with general information or by pointing out mistakes in thought or other possible useful and relevant resources.
I have a method that creates my square npages * npages matrix hat of double for use in my pagerank algorithm.
I have made it with pthreads, SIMD and with both pthreads and SIMD. I have used xcode instruments time profiler and found that the pthreads only version is the fastest, next is the SIMD only version and slowest is the version with both SIMD and pthreads.
As it is homework it can be run on multiple different machines however we were given the header #include so it is to be assumed we can use upto AVX at least. We are given how many threads the program will use as the argument to the program and store it in a global variable g_nthreads.
In my tests I have been testing it on my machine which is an IvyBridge with 4 hardware cores and 8 logical cores and I've been testing it with 4 threads as an arguments and with 8 threads as an argument.
RUNNING TIMES:
SIMD ONLY:
*331ms - for consturct_matrix_hat function *
PTHREADS ONLY (8 threads):
70ms - each thread concurrently
SIMD & PTHREADS (8 threads):
110ms - each thread concurrently
What am I doing that is slowing it down more when using both forms of optimisation?
I will post each implementation:
All versions share these macros:
#define BIG_CHUNK (g_n2/g_nthreads)
#define SMALL_CHUNK (g_npages/g_nthreads)
#define MOD BIG_CHUNK - (BIG_CHUNK % 4)
#define IDX(a, b) ((a * g_npages) + b)
Pthreads:
// struct used for passing arguments
typedef struct {
double* restrict m;
double* restrict m_hat;
int t_id;
char padding[44];
} t_arg_matrix_hat;
// Construct matrix_hat with pthreads
static void* pthread_construct_matrix_hat(void* arg) {
t_arg_matrix_hat* t_arg = (t_arg_matrix_hat*) arg;
// set coordinate limits thread is able to act upon
size_t start = t_arg->t_id * BIG_CHUNK;
size_t end = t_arg->t_id + 1 != g_nthreads ? (t_arg->t_id + 1) * BIG_CHUNK : g_n2;
// Initialise coordinates with given uniform value
for (size_t i = start; i < end; i++) {
t_arg->m_hat[i] = ((g_dampener * t_arg->m[i]) + HAT);
}
return NULL;
}
// Construct matrix_hat
double* construct_matrix_hat(double* matrix) {
double* matrix_hat = malloc(sizeof(double) * g_n2);
// create structs to send and retrieve matrix and value from threads
t_arg_matrix_hat t_args[g_nthreads];
for (size_t i = 0; i < g_nthreads; i++) {
t_args[i] = (t_arg_matrix_hat) {
.m = matrix,
.m_hat = matrix_hat,
.t_id = i
};
}
// create threads and send structs with matrix and value to divide the matrix and
// initialise the coordinates with the given value
pthread_t threads[g_nthreads];
for (size_t i = 0; i < g_nthreads; i++) {
pthread_create(threads + i, NULL, pthread_construct_matrix_hat, t_args + i);
}
// join threads after all coordinates have been intialised
for (size_t i = 0; i < g_nthreads; i++) {
pthread_join(threads[i], NULL);
}
return matrix_hat;
}
SIMD:
// Construct matrix_hat
double* construct_matrix_hat(double* matrix) {
double* matrix_hat = malloc(sizeof(double) * g_n2);
double dampeners[4] = {g_dampener, g_dampener, g_dampener, g_dampener};
__m256d b = _mm256_loadu_pd(dampeners);
// Use simd to subtract values from each other
for (size_t i = 0; i < g_mod; i += 4) {
__m256d a = _mm256_loadu_pd(matrix + i);
__m256d res = _mm256_mul_pd(a, b);
_mm256_storeu_pd(&matrix_hat[i], res);
}
// Subtract values from each other that weren't included in simd
for (size_t i = g_mod; i < g_n2; i++) {
matrix_hat[i] = g_dampener * matrix[i];
}
double hats[4] = {HAT, HAT, HAT, HAT};
b = _mm256_loadu_pd(hats);
// Use simd to raise each value to the power 2
for (size_t i = 0; i < g_mod; i += 4) {
__m256d a = _mm256_loadu_pd(matrix_hat + i);
__m256d res = _mm256_add_pd(a, b);
_mm256_storeu_pd(&matrix_hat[i], res);
}
// Raise each value to the power 2 that wasn't included in simd
for (size_t i = g_mod; i < g_n2; i++) {
matrix_hat[i] += HAT;
}
return matrix_hat;
}
Pthreads & SIMD:
// struct used for passing arguments
typedef struct {
double* restrict m;
double* restrict m_hat;
int t_id;
char padding[44];
} t_arg_matrix_hat;
// Construct matrix_hat with pthreads
static void* pthread_construct_matrix_hat(void* arg) {
t_arg_matrix_hat* t_arg = (t_arg_matrix_hat*) arg;
// set coordinate limits thread is able to act upon
size_t start = t_arg->t_id * BIG_CHUNK;
size_t end = t_arg->t_id + 1 != g_nthreads ? (t_arg->t_id + 1) * BIG_CHUNK : g_n2;
size_t leftovers = start + MOD;
__m256d b1 = _mm256_loadu_pd(dampeners);
//
for (size_t i = start; i < leftovers; i += 4) {
__m256d a1 = _mm256_loadu_pd(t_arg->m + i);
__m256d r1 = _mm256_mul_pd(a1, b1);
_mm256_storeu_pd(&t_arg->m_hat[i], r1);
}
//
for (size_t i = leftovers; i < end; i++) {
t_arg->m_hat[i] = dampeners[0] * t_arg->m[i];
}
__m256d b2 = _mm256_loadu_pd(hats);
//
for (size_t i = start; i < leftovers; i += 4) {
__m256d a2 = _mm256_loadu_pd(t_arg->m_hat + i);
__m256d r2 = _mm256_add_pd(a2, b2);
_mm256_storeu_pd(&t_arg->m_hat[i], r2);
}
//
for (size_t i = leftovers; i < end; i++) {
t_arg->m_hat[i] += hats[0];
}
return NULL;
}
// Construct matrix_hat
double* construct_matrix_hat(double* matrix) {
double* matrix_hat = malloc(sizeof(double) * g_n2);
// create structs to send and retrieve matrix and value from threads
t_arg_matrix_hat t_args[g_nthreads];
for (size_t i = 0; i < g_nthreads; i++) {
t_args[i] = (t_arg_matrix_hat) {
.m = matrix,
.m_hat = matrix_hat,
.t_id = i
};
}
// create threads and send structs with matrix and value to divide the matrix and
// initialise the coordinates with the given value
pthread_t threads[g_nthreads];
for (size_t i = 0; i < g_nthreads; i++) {
pthread_create(threads + i, NULL, pthread_construct_matrix_hat, t_args + i);
}
// join threads after all coordinates have been intialised
for (size_t i = 0; i < g_nthreads; i++) {
pthread_join(threads[i], NULL);
}
return matrix_hat;
}
I think it's because your SIMD code is horribly inefficient: It loops over the memory twice, instead of doing the add with the multiply, before storing. You didn't test SIMD vs. a scalar baseline, but if you had you'd probably find that your SIMD code wasn't a speedup with a single thread either.
STOP READING HERE if you want to solve the rest of your homework yourself.
If you used gcc -O3 -march=ivybridge, the simple scalar loop in the pthread version probably auto-vectorized into something like what you should have done with intrinsics. You even used restrict, so it might realize that the pointers can't overlap with each other, or with g_dampener.
// this probably autovectorizes well.
// Initialise coordinates with given uniform value
for (size_t i = start; i < end; i++) {
t_arg->m_hat[i] = ((g_dampener * t_arg->m[i]) + HAT);
}
// but this would be even safer to help the compiler's aliasing analysis:
double dampener = g_dampener; // in case the compiler things one of the pointers might point at the global
double *restrict hat = t_arg->hat;
const double *restrict mat = t_arg->m;
... same loop but using these locals instead of
It's probably not a problem for an FP loop, since double definitely can't alias with double *.
The coding style is also pretty nasty. You should give meaningful names to your __m256d variables whenever possible.
Also, you use malloc, which doesn't guarantee that matrix_hat will be aligned to a 32B boundary. C11's aligned_alloc is probably the nicest way, vs. posix_memalign (clunky interface), _mm_malloc (have to free with _mm_free, not free(3)), or other options.
double* construct_matrix_hat(const double* matrix) {
// double* matrix_hat = malloc(sizeof(double) * g_n2);
double* matrix_hat = aligned_alloc(64, sizeof(double) * g_n2);
// double dampeners[4] = {g_dampener, g_dampener, g_dampener, g_dampener}; // This idiom is terrible, and might actually compile to code that stores it 4 times on the stack and then loads.
__m256d vdamp = _mm256_set1_pd(g_dampener); // will compile to a broadcast-load (vbroadcastsd)
__m256d vhat = _mm256_set1_pd(HAT);
size_t last_full_vector = g_n2 & ~3ULL; // don't load this from a global.
// it's better for the compiler to see how it's calculated from g_n2
// ??? Use simd to subtract values from each other // huh? this is a multiply, not a subtract. Also, everyone can see it's using SIMD, that part adds no new information
// if you really want to manually vectorize this, instead of using an OpenMP pragma or -O3 on the scalar loop, then:
for (size_t i = 0; i < last_full_vector; i += 4) {
__m256d vmat = _mm256_loadu_pd(matrix + i);
__m256d vmul = _mm256_mul_pd(vmat, vdamp);
__m256d vres = _mm256_add_pd(vmul, vhat);
_mm256_store_pd(&matrix_hat[i], vres); // aligned store. Doesn't matter for performance.
}
#if 0
// Scalar cleanup
for (size_t i = last_vector; i < g_n2; i++) {
matrix_hat[i] = g_dampener * matrix[i] + HAT;
}
#else
// assume that g_n2 >= 4, and do a potentially-overlapping unaligned vector
if (last_full_vector != g_n2) {
// Or have this always run, and have the main loop stop one element sooner (so this overlaps by 0..3 instead of by 1..3 with a conditional)
assert(g_n2 >= 4);
__m256d vmat = _mm256_loadu_pd(matrix + g_n2 - 4);
__m256d vmul = _mm256_mul_pd(vmat, vdamp);
__m256d vres = _mm256_add_pd(vmul, vhat);
_mm256_storeu_pd(&matrix_hat[g_n2-4], vres);
}
#endif
return matrix_hat;
}
This version compiles (after defining a couple globals) to the asm we expect. BTW, normal people pass sizes around as function arguments. This is another way of avoiding optimization-failure due to C aliasing rules.
Anyway, really your best bet is to let OpenMP auto-vectorize it, because then you don't have to write a cleanup loop yourself. There's nothing tricky about the data organization, so it vectorizes trivially. (And it's not a reduction, like in your other question, so there's no loop-carried dependency or order-of-operations concern).
I have the following piece of code
#include "stdio.h"
#include "stdlib.h"
#include <string.h>
#define MAXBINS 8
void swap_long(unsigned long int **x, unsigned long int **y){
unsigned long int *tmp;
tmp = x[0];
x[0] = y[0];
y[0] = tmp;
}
void swap(unsigned int **x, unsigned int **y){
unsigned int *tmp;
tmp = x[0];
x[0] = y[0];
y[0] = tmp;
}
void truncated_radix_sort(unsigned long int *morton_codes,
unsigned long int *sorted_morton_codes,
unsigned int *permutation_vector,
unsigned int *index,
int *level_record,
int N,
int population_threshold,
int sft, int lv){
int BinSizes[MAXBINS] = {0};
unsigned int *tmp_ptr;
unsigned long int *tmp_code;
level_record[0] = lv; // record the level of the node
if(N<=population_threshold || sft < 0) { // Base case. The node is a leaf
memcpy(permutation_vector, index, N*sizeof(unsigned int)); // Copy the pernutation vector
memcpy(sorted_morton_codes, morton_codes, N*sizeof(unsigned long int)); // Copy the Morton codes
return;
}
else{
// Find which child each point belongs to
int j = 0;
for(j=0; j<N; j++){
unsigned int ii = (morton_codes[j]>>sft) & 0x07;
BinSizes[ii]++;
}
// scan prefix
int offset = 0, i = 0;
for(i=0; i<MAXBINS; i++){
int ss = BinSizes[i];
BinSizes[i] = offset;
offset += ss;
}
for(j=0; j<N; j++){
unsigned int ii = (morton_codes[j]>>sft) & 0x07;
permutation_vector[BinSizes[ii]] = index[j];
sorted_morton_codes[BinSizes[ii]] = morton_codes[j];
BinSizes[ii]++;
}
//swap the index pointers
swap(&index, &permutation_vector);
//swap the code pointers
swap_long(&morton_codes, &sorted_morton_codes);
/* Call the function recursively to split the lower levels */
offset = 0;
for(i=0; i<MAXBINS; i++){
int size = BinSizes[i] - offset;
truncated_radix_sort(&morton_codes[offset],
&sorted_morton_codes[offset],
&permutation_vector[offset],
&index[offset], &level_record[offset],
size,
population_threshold,
sft-3, lv+1);
offset += size;
}
}
}
I tried to make this block
int j = 0;
for(j=0; j<N; j++){
unsigned int ii = (morton_codes[j]>>sft) & 0x07;
BinSizes[ii]++;
}
parallel by substituting it with the following
int rc,j;
pthread_t *thread = (pthread_t *)malloc(NTHREADS*sizeof(pthread_t));
belong *belongs = (belong *)malloc(NTHREADS*sizeof(belong));
pthread_mutex_init(&bin_mtx, NULL);
for (j = 0; j < NTHREADS; j++){
belongs[j].n = NTHREADS;
belongs[j].N = N;
belongs[j].tid = j;
belongs[j].sft = sft;
belongs[j].BinSizes = BinSizes;
belongs[j].mcodes = morton_codes;
rc = pthread_create(&thread[j], NULL, belong_wrapper, (void *)&belongs[j]);
}
for (j = 0; j < NTHREADS; j++){
rc = pthread_join(thread[j], NULL);
}
and defining these outside the recursive function
typedef struct{
int n, N, tid, sft;
int *BinSizes;
unsigned long int *mcodes;
}belong;
pthread_mutex_t bin_mtx;
void * belong_wrapper(void *arg){
int n, N, tid, sft, j;
int *BinSizes;
unsigned int ii;
unsigned long int *mcodes;
n = ((belong *)arg)->n;
N = ((belong *)arg)->N;
tid = ((belong *)arg)->tid;
sft = ((belong *)arg)->sft;
BinSizes = ((belong *)arg)->BinSizes;
mcodes = ((belong *)arg)->mcodes;
for (j = tid; j<N; j+=n){
ii = (mcodes[j] >> sft) & 0x07;
pthread_mutex_lock(&bin_mtx);
BinSizes[ii]++;
pthread_mutex_unlock(&bin_mtx);
}
}
However it takes a lot more time than the serial one to execute... Why is this happening? What should I change?
Since you're using a single mutex to guard updates to the BinSizes array, you're still ultimately doing all the updates to this array sequentially: only one thread can call BinSizes[ii]++ at any given time. Basically you're still executing your function in sequence but incurring the extra overhead of creating and destroying threads.
There are several options I can think of for you (there are probably more):
do as #Chris suggests and make each thread update one portion of
BinSizes. This might not be viable depending on the properties of
the calculation you're using to compute ii.
Create multiple mutexes representing different partitions of
BinSizes. For example, if BinSizes has 10 elements, you could
create one mutex for elements 0-4, and another for elements 5-9,
then use them in your thread something like so:
if (ii < 5) {
mtx_index = 0;
} else {
mtx_index = 1;
}
pthread_mutex_lock(&bin_mtx[mtx_index]);
BinSizes[ii]++;
pthread_mutex_unlock(&bin_mtx[mtx_index]);
You could generalize this idea to any size of BinSizes and any range:
Potentially you could have a different mutex for each array element. Of course
then you're opening yourself up to the overhead of creating each of these mutexes, and
the possibility of deadlock if someone tries to lock several of them at once etc...
Finally, you could abandon the idea of parallelizing this block altogether: as other users have mentioned using threads this way is subject to some level of diminishing returns. Unless your BinSizes array is very large, you might not see a huge benefit to parallelization even if you "do it right".
tl;dr - adding threads isn't a trivial fix for most problems. Yours isn't embarassingly parallelizable, and this code has hardly any actual concurrency.
You spin a mutex for every (cheap) integer operation on BinSizes. This will crush any parallelism, because all your threads are serialized on this.
The few instructions you can run concurrently (the for loop and a couple of operations on the morton code array) are much cheaper than (un)locking a mutex: even using an atomic increment (if available) would be more expensive than the un-synchronized part.
One fix would be to give each thread its own output array, and combine them after all tasks are complete.
Also, you create and join multiple threads per call. Creating threads is relatively expensive compared to computation, so it's generally recommended to create a long-lived pool of them to spread that cost.
Even if you do this, you need to tune the number of threads according to how many (free) cores do you have. If you do this in a recursive function, how many threads exist at the same time? Creating more threads than you have cores to schedule them on is pointless.
Oh, and you're leaking memory.
I would like to know how to avoid branch divergence in string searching with CUDA, and if there was a good way to do it.
At the moment I tried to adapt Knuth Morris Pratt to GPUs but I believe that there is a lot of divergence since each thread is looking for N letters and comparing each time if this letters correspond to the first letter of the word I'm searching.
int tid = blockDim.x * blockIdx.x + threadIdx.x;
int startId = tid * 64;
int x = 0;
for(int i = 0; i < 64; i++){
if(array[startId + i] == 'C'){
x++;
}
}
if I use this dummy code to find the letter 'C', but I could also make a second look to search for more letters as well.
You could try adding the results of the comparisions directly into the value like so:
x+= (array[startId + i] == 'C');
But that I believe this may still branch. My solution would be to store array values in a block to shared memory and then assign each thread in the block a desired character and place the results into their own shared memory space then reduce.
__shared__ char l_array[BLOCK_SIZE];
__shared__ char l_results[BLOCK_SIZE];
int bid = blockDim.x * blockIdx.x;
int lid = threadIdx.x;
int tid = bid + lid;
int x=0;
char desired_char = get_character(lid);
l_array[lid] = -1;
//Store global values in shared memory
if(tid < array_size){
l_array[lid] = array[tid];
}
__syncthreads();
//Check local memory for desired character
for(int i = 0; i < BLOCK_SIZE; i++)
x+=(l_array[i] == desired_char);
//Store results into shared memory
l_results[lid] = x;
__syncthreads();
//Then reduce (poorly)
if(lid==0){
for(int i = 0; i < BLOCK_SIZE; i++)
x+= l_results[i];
}
Although i don't know the algorithm per se, I'm just guessing but something along the lines here might help you figure this out.
I have two arrays, a and b, and I would like to compute the "min convolution" to produce result c. Simple pseudo code looks like the following:
for i = 0 to size(a)+size(b)
c[i] = inf
for j = 0 to size(a)
if (i - j >= 0) and (i - j < size(b))
c[i] = min(c[i], a[j] + b[i-j])
(edit: changed loops to start at 0 instead of 1)
If the min were instead a sum, we could use a Fast Fourier Transform (FFT), but in the min case, there is no such analog. Instead, I'd like to make this simple algorithm as fast as possible by using a GPU (CUDA). I'd be happy to find existing code that does this (or code that implements the sum case without FFTs, so that I could adapt it for my purposes), but my search so far hasn't turned up any good results. My use case will involve a's and b's that are of size between 1,000 and 100,000.
Questions:
Does code to do this efficiently already exist?
If I am going to implement this myself, structurally, how should the CUDA kernel look so as to maximize efficiency? I've tried a simple solution where each c[i] is computed by a separate thread, but this doesn't seem like the best way. Any tips in terms of how to set up thread block structure and memory access patterns?
An alternative which might be useful for large a and b would be to use a block per output entry in c. Using a block allows for memory coalescing, which will be important in what is a memory bandwidth limited operation, and a fairly efficient shared memory reduction can be used to combine per thread partial results into a final per block result. Probably the best strategy is to launch as many blocks per MP as will run concurrently and have each block emit multiple output points. This eliminates some of the scheduling overheads associated with launching and retiring many blocks with relatively low total instruction counts.
An example of how this might be done:
#include <math.h>
template<int bsz>
__global__ __launch_bounds__(512)
void minconv(const float *a, int sizea, const float *b, int sizeb, float *c)
{
__shared__ volatile float buff[bsz];
for(int i = blockIdx.x; i<(sizea + sizeb); i+=(gridDim.x*blockDim.x)) {
float cval = INFINITY;
for(int j=threadIdx.x; j<sizea; j+= blockDim.x) {
int t = i - j;
if ((t>=0) && (t<sizeb))
cval = min(cval, a[j] + b[t]);
}
buff[threadIdx.x] = cval; __syncthreads();
if (bsz > 256) {
if (threadIdx.x < 256)
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+256]);
__syncthreads();
}
if (bsz > 128) {
if (threadIdx.x < 128)
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+128]);
__syncthreads();
}
if (bsz > 64) {
if (threadIdx.x < 64)
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+64]);
__syncthreads();
}
if (threadIdx.x < 32) {
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+32]);
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+16]);
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+8]);
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+4]);
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+2]);
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+1]);
if (threadIdx.x == 0) c[i] = buff[0];
}
}
}
// Instances for all valid block sizes.
template __global__ void minconv<64>(const float *, int, const float *, int, float *);
template __global__ void minconv<128>(const float *, int, const float *, int, float *);
template __global__ void minconv<256>(const float *, int, const float *, int, float *);
template __global__ void minconv<512>(const float *, int, const float *, int, float *);
[disclaimer: not tested or benchmarked, use at own risk]
This is single precision floating point, but the same idea should work for double precision floating point. For integer, you would need to replace the C99 INFINITY macro with something like INT_MAX or LONG_MAX, but the principle remains the same otherwise.
A faster version:
__global__ void convAgB(double *a, double *b, double *c, int sa, int sb)
{
int i = (threadIdx.x + blockIdx.x * blockDim.x);
int idT = threadIdx.x;
int out,j;
__shared__ double c_local [512];
c_local[idT] = c[i];
out = (i > sa) ? sa : i + 1;
j = (i > sb) ? i - sb + 1 : 1;
for(; j < out; j++)
{
if(c_local[idT] > a[j] + b[i-j])
c_local[idT] = a[j] + b[i-j];
}
c[i] = c_local[idT];
}
**Benckmark:**
Size A Size B Size C Time (s)
1000 1000 2000 0.0008
10k 10k 20k 0.0051
100k 100k 200k 0.3436
1M 1M 1M 43,327
Old Version,
For sizes between 1000 and 100000, I tested with this naive version:
__global__ void convAgB(double *a, double *b, double *c, int sa, int sb)
{
int size = sa+sb;
int idT = (threadIdx.x + blockIdx.x * blockDim.x);
int out,j;
for(int i = idT; i < size; i += blockDim.x * gridDim.x)
{
if(i > sa) out = sa;
else out = i + 1;
if(i > sb) j = i - sb + 1;
else j = 1;
for(; j < out; j++)
{
if(c[i] > a[j] + b[i-j])
c[i] = a[j] + b[i-j];
}
}
}
I populated the array a and b with some random double numbers and c with 999999 (just for testing). I validated the c array (in the CPU) using your function (without any modifications).
I also removed the conditionals from inside of the inner loop, so it will only test them once.
I am not 100% sure but I think the following modification makes sense. Since you had i - j >= 0, which is the same as i >= j, this means that as soon as j > i it will never enter this block 'X' (since j++):
if(c[i] > a[j] + b[i-j])
c[i] = a[j] + b[i-j];
So I calculated on the variable out the loop conditional if i > sa, which means that the loop will finish when j == sa, if i < sa this means the loop will finish (earlier) on i + 1 because of the condition i >= j.
The other condition i - j < size(b) means that you will start the execution of the block 'X' when i > size(b) + 1 since j starts always = 1. So we can put j with the value that should begin, thus
if(i > sb) j = i - sb + 1;
else j = 1;
See if you can test this version with real arrays of data, and give me feedback. Also, any improvements are welcome.
EDIT : A new optimization can be implemented, but this one does not make much of a difference.
if(c[i] > a[j] + b[i-j])
c[i] = a[j] + b[i-j];
we can eliminate the if, by:
double add;
...
for(; j < out; j++)
{
add = a[j] + b[i-j];
c[i] = (c[i] < add) * c[i] + (add <= c[i]) * add;
}
Having:
if(a > b) c = b;
else c = a;
it the same of having c = (a < b) * a + (b <= a) * b.
if a > b then c = 0 * a + 1 * b; => c = b;
if a <= b then c = 1*a + 0 *b; => c = a;
**Benckmark:**
Size A Size B Size C Time (s)
1000 1000 2000 0.0013
10k 10k 20k 0.0051
100k 100k 200k 0.4436
1M 1M 1M 47,327
I am measuring the time of copying from CPU to GPU, running the kernel and copying from GPU to CPU.
GPU Specifications
Device Tesla C2050
CUDA Capability Major/Minor 2.0
Global Memory 2687 MB
Cores 448 CUDA Cores
Warp size 32
I have used you algorithm. I think it'll help you.
const int Length=1000;
__global__ void OneD(float *Ad,float *Bd,float *Cd){
int i=blockIdx.x;
int j=threadIdx.x;
Cd[i]=99999.99;
for(int k=0;k<Length/500;k++){
while(((i-j)>=0)&&(i-j<Length)&&Cd[i+k*Length]>Ad[j+k*Length]+Bd[i-j]){
Cd[i+k*Length]=Ad[j+k*Length]+Bd[i-j];
}}}
I have taken 500 Threads per block. And, 500 blocks per Grid. As, the number of threads per block in my device is restricted to 512, I used 500 threads. I have taken the size of all the arrays as Length (=1000).
Working:
i stores the Block Index and j stores the Thread Index.
The for loop is used as the number of threads are less than the size of the arrays.
The while loop is used for iterating Cd[n].
I have not used Shared Memory because, I have taken lots of blocks and threads. So, the amount of Shared Memory required for each block is low.
PS: If your device supports more Threads and Blocks, replace k<Length/500 with k<Length/(supported number of threads)
Following the skeleton of my setup. Executed like this it doesn't give the correct result. This is most likely due to the async data transfers which haven't finished when the kernel uses them. I implemented a "failsafe" version with the preprocessor if-else statement. When translating the else part the program runs fine. I don't get it. Why that?
The in1, out1 ,... are just placeholders. Of course they point to different containers each iteration of the for loop. so that async transfer can take place. But within on iteration the out1 used by the transfer and the one by the kernel are the same.
cudaStream_t streams[2];
cudaEvent_t evCopied;
cudaStreamCreate(&streams[0]); // TRANSFER
cudaStreamCreate(&streams[1]); // KERNEL
cudaEventCreate(&evCopied);
// many iterations
for () {
// Here I want overlapping of transfers with previous kernel
cudaMemcpyAsync( out1, in1, size1, cudaMemcpyDefault, streams[0] );
cudaMemcpyAsync( out2, in2, size2, cudaMemcpyDefault, streams[0] );
cudaMemcpyAsync( out3, in3, size3, cudaMemcpyDefault, streams[0] );
#if 1
// make sure host thread doesn't "run away"
cudaStreamSynchronize( streams[1] );
cudaEventRecord( evCopied , streams[0] );
cudaStreamWaitEvent( streams[1] , evCopied , 0);
#else
// this gives the correct results
cudaStreamSynchronize( streams[0] );
cudaStreamSynchronize( streams[1] );
#endif
kernel<<< grid , sh_mem , streams[1] >>>(out1,out2,out3);
}
Please don't post answers suggesting a rearrangement of the setup. Something like, divide your kernels into several ones and issue them in separate stream.
What you are doing -- or at least the use of an event to synchronize two streams -- should work. It is basically impossible to say why your actual code doesn't work because you have chosen not to post it, and the devil is always in the detail.
However, here is a complete, runnable example which I think is using the streams API in a fashion similar to what you are trying to do and which works correctly:
#include <cstdio>
typedef unsigned int uint;
template<uint bsz>
__global__ void kernel(uint * a, uint * b, uint * c, const uint N)
{
__shared__ volatile uint buf[bsz];
uint tid = threadIdx.x + blockIdx.x * blockDim.x;
uint stride = blockDim.x * gridDim.x;
uint val = 0;
for(uint i=tid; i<N; i+=stride) {
val += a[i] + b[i];
}
buf[threadIdx.x] = val; __syncthreads();
#pragma unroll
for(uint i=(threadIdx.x+warpSize); (threadIdx.x<warpSize)&&(i<bsz); i+=warpSize)
buf[threadIdx.x] += buf[i];
if (threadIdx.x < 16) buf[threadIdx.x] += buf[threadIdx.x+16];
if (threadIdx.x < 8) buf[threadIdx.x] += buf[threadIdx.x+8];
if (threadIdx.x < 4) buf[threadIdx.x] += buf[threadIdx.x+4];
if (threadIdx.x < 2) buf[threadIdx.x] += buf[threadIdx.x+2];
if (threadIdx.x == 0) c[blockIdx.x] += buf[0] + buf[1];
}
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
int main(void)
{
const int nruns = 16, ntransfers = 3;
const int Nb = 32, Nt = 192, Nr = 3000, N = Nr * Nb * Nt;
const size_t szNb = Nb * sizeof(uint), szN = size_t(N) * sizeof(uint);
size_t sz[4] = { szN, szN, szNb, szNb };
uint * d[ntransfers+1];
for(int i=0; i<ntransfers+1; i++)
gpuErrchk(cudaMallocHost((void **)&d[i], sz[i]));
uint * a = d[0], * b = d[1], * c = d[2], * out = d[3];
for(uint i=0; i<N; i++) {
a[i] = b[i] = 1;
if (i<Nb) c[i] = 0;
}
uint * _d[3];
for(int i=0; i<ntransfers; i++)
gpuErrchk(cudaMalloc((void **)&_d[i], sz[i]));
uint * _a = _d[0], * _b = _d[1], * _c = _d[2];
cudaStream_t stream[2];
for (int i = 0; i < 2; i++)
gpuErrchk(cudaStreamCreate(&stream[i]));
cudaEvent_t sync_event;
gpuErrchk(cudaEventCreate(&sync_event));
uint results[nruns];
for(int j=0; j<nruns; j++) {
for(int i=0; i<ntransfers; i++)
gpuErrchk(cudaMemcpyAsync(_d[i], d[i], sz[i], cudaMemcpyHostToDevice, stream[0]));
gpuErrchk(cudaEventRecord(sync_event, stream[0]));
gpuErrchk(cudaStreamWaitEvent(stream[1], sync_event, 0));
kernel<Nt><<<Nb, Nt, 0, stream[1]>>>(_a, _b, _c, N);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaMemcpyAsync(out, _c, szNb, cudaMemcpyDeviceToHost, stream[1]));
gpuErrchk(cudaStreamSynchronize(stream[1]));
results[j] = uint(0);
for(int i=0; i<Nb; i++) results[j]+= out[i];
}
for(int j=0; j<nruns; j++)
fprintf(stdout, "%3d: ans = %u\n", j, results[j]);
gpuErrchk(cudaDeviceReset());
return 0;
}
The kernel is a "fused vector addition/reduction", just nonsense, but it relies on the last of the three inputs being zeroed prior to kernel execution to produce the correct answer, which should simply be twice the number of input data points. As in your example, the kernel execution and asynchronous input array copying are in different streams, so the copying and the execution can potentially overlap. There is no sane reason to copy the first two large inputs at every iteration in this case, other than to introduce delay before the last copy (which is the critical one) is done and increase the chance it will incorrectly overlap with the kernel. This might be where you are going wrong, because I don't believe the CUDA memory model guarantees that it is safe to asynchronously modify memory being accessed by a running kernel. If that is what you are trying to do, then expect it to fail. But without seeing real code, it is impossible to say more.
With that out of the way, you can see for yourself that the kernel won't produce the correct result without the cudaStreamWaitEvent to synchronize the two streams prior to kernel launch. The only difference between the your pseudo code and this example is the location of the cudaStreamSynchronize on the execution stream. Here I placed it after the kernel launch in order to make sure the kernel finishes before the transfer to gather the results back to the host. That could be the critical difference, but again, no real code equal no real code analysis....
All I can suggest is you play with this example to get a feel for how it works. I understand there is the possibility to profile asynchronous code without the profiling artificially serializing the execution streams in very recent versions of Nsight for Windows. That might be able to help you diagnose your problem if you can't work out the problem from this example or your own code.