I have two arrays, a and b, and I would like to compute the "min convolution" to produce result c. Simple pseudo code looks like the following:
for i = 0 to size(a)+size(b)
c[i] = inf
for j = 0 to size(a)
if (i - j >= 0) and (i - j < size(b))
c[i] = min(c[i], a[j] + b[i-j])
(edit: changed loops to start at 0 instead of 1)
If the min were instead a sum, we could use a Fast Fourier Transform (FFT), but in the min case, there is no such analog. Instead, I'd like to make this simple algorithm as fast as possible by using a GPU (CUDA). I'd be happy to find existing code that does this (or code that implements the sum case without FFTs, so that I could adapt it for my purposes), but my search so far hasn't turned up any good results. My use case will involve a's and b's that are of size between 1,000 and 100,000.
Questions:
Does code to do this efficiently already exist?
If I am going to implement this myself, structurally, how should the CUDA kernel look so as to maximize efficiency? I've tried a simple solution where each c[i] is computed by a separate thread, but this doesn't seem like the best way. Any tips in terms of how to set up thread block structure and memory access patterns?
An alternative which might be useful for large a and b would be to use a block per output entry in c. Using a block allows for memory coalescing, which will be important in what is a memory bandwidth limited operation, and a fairly efficient shared memory reduction can be used to combine per thread partial results into a final per block result. Probably the best strategy is to launch as many blocks per MP as will run concurrently and have each block emit multiple output points. This eliminates some of the scheduling overheads associated with launching and retiring many blocks with relatively low total instruction counts.
An example of how this might be done:
#include <math.h>
template<int bsz>
__global__ __launch_bounds__(512)
void minconv(const float *a, int sizea, const float *b, int sizeb, float *c)
{
__shared__ volatile float buff[bsz];
for(int i = blockIdx.x; i<(sizea + sizeb); i+=(gridDim.x*blockDim.x)) {
float cval = INFINITY;
for(int j=threadIdx.x; j<sizea; j+= blockDim.x) {
int t = i - j;
if ((t>=0) && (t<sizeb))
cval = min(cval, a[j] + b[t]);
}
buff[threadIdx.x] = cval; __syncthreads();
if (bsz > 256) {
if (threadIdx.x < 256)
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+256]);
__syncthreads();
}
if (bsz > 128) {
if (threadIdx.x < 128)
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+128]);
__syncthreads();
}
if (bsz > 64) {
if (threadIdx.x < 64)
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+64]);
__syncthreads();
}
if (threadIdx.x < 32) {
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+32]);
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+16]);
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+8]);
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+4]);
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+2]);
buff[threadIdx.x] = min(buff[threadIdx.x], buff[threadIdx.x+1]);
if (threadIdx.x == 0) c[i] = buff[0];
}
}
}
// Instances for all valid block sizes.
template __global__ void minconv<64>(const float *, int, const float *, int, float *);
template __global__ void minconv<128>(const float *, int, const float *, int, float *);
template __global__ void minconv<256>(const float *, int, const float *, int, float *);
template __global__ void minconv<512>(const float *, int, const float *, int, float *);
[disclaimer: not tested or benchmarked, use at own risk]
This is single precision floating point, but the same idea should work for double precision floating point. For integer, you would need to replace the C99 INFINITY macro with something like INT_MAX or LONG_MAX, but the principle remains the same otherwise.
A faster version:
__global__ void convAgB(double *a, double *b, double *c, int sa, int sb)
{
int i = (threadIdx.x + blockIdx.x * blockDim.x);
int idT = threadIdx.x;
int out,j;
__shared__ double c_local [512];
c_local[idT] = c[i];
out = (i > sa) ? sa : i + 1;
j = (i > sb) ? i - sb + 1 : 1;
for(; j < out; j++)
{
if(c_local[idT] > a[j] + b[i-j])
c_local[idT] = a[j] + b[i-j];
}
c[i] = c_local[idT];
}
**Benckmark:**
Size A Size B Size C Time (s)
1000 1000 2000 0.0008
10k 10k 20k 0.0051
100k 100k 200k 0.3436
1M 1M 1M 43,327
Old Version,
For sizes between 1000 and 100000, I tested with this naive version:
__global__ void convAgB(double *a, double *b, double *c, int sa, int sb)
{
int size = sa+sb;
int idT = (threadIdx.x + blockIdx.x * blockDim.x);
int out,j;
for(int i = idT; i < size; i += blockDim.x * gridDim.x)
{
if(i > sa) out = sa;
else out = i + 1;
if(i > sb) j = i - sb + 1;
else j = 1;
for(; j < out; j++)
{
if(c[i] > a[j] + b[i-j])
c[i] = a[j] + b[i-j];
}
}
}
I populated the array a and b with some random double numbers and c with 999999 (just for testing). I validated the c array (in the CPU) using your function (without any modifications).
I also removed the conditionals from inside of the inner loop, so it will only test them once.
I am not 100% sure but I think the following modification makes sense. Since you had i - j >= 0, which is the same as i >= j, this means that as soon as j > i it will never enter this block 'X' (since j++):
if(c[i] > a[j] + b[i-j])
c[i] = a[j] + b[i-j];
So I calculated on the variable out the loop conditional if i > sa, which means that the loop will finish when j == sa, if i < sa this means the loop will finish (earlier) on i + 1 because of the condition i >= j.
The other condition i - j < size(b) means that you will start the execution of the block 'X' when i > size(b) + 1 since j starts always = 1. So we can put j with the value that should begin, thus
if(i > sb) j = i - sb + 1;
else j = 1;
See if you can test this version with real arrays of data, and give me feedback. Also, any improvements are welcome.
EDIT : A new optimization can be implemented, but this one does not make much of a difference.
if(c[i] > a[j] + b[i-j])
c[i] = a[j] + b[i-j];
we can eliminate the if, by:
double add;
...
for(; j < out; j++)
{
add = a[j] + b[i-j];
c[i] = (c[i] < add) * c[i] + (add <= c[i]) * add;
}
Having:
if(a > b) c = b;
else c = a;
it the same of having c = (a < b) * a + (b <= a) * b.
if a > b then c = 0 * a + 1 * b; => c = b;
if a <= b then c = 1*a + 0 *b; => c = a;
**Benckmark:**
Size A Size B Size C Time (s)
1000 1000 2000 0.0013
10k 10k 20k 0.0051
100k 100k 200k 0.4436
1M 1M 1M 47,327
I am measuring the time of copying from CPU to GPU, running the kernel and copying from GPU to CPU.
GPU Specifications
Device Tesla C2050
CUDA Capability Major/Minor 2.0
Global Memory 2687 MB
Cores 448 CUDA Cores
Warp size 32
I have used you algorithm. I think it'll help you.
const int Length=1000;
__global__ void OneD(float *Ad,float *Bd,float *Cd){
int i=blockIdx.x;
int j=threadIdx.x;
Cd[i]=99999.99;
for(int k=0;k<Length/500;k++){
while(((i-j)>=0)&&(i-j<Length)&&Cd[i+k*Length]>Ad[j+k*Length]+Bd[i-j]){
Cd[i+k*Length]=Ad[j+k*Length]+Bd[i-j];
}}}
I have taken 500 Threads per block. And, 500 blocks per Grid. As, the number of threads per block in my device is restricted to 512, I used 500 threads. I have taken the size of all the arrays as Length (=1000).
Working:
i stores the Block Index and j stores the Thread Index.
The for loop is used as the number of threads are less than the size of the arrays.
The while loop is used for iterating Cd[n].
I have not used Shared Memory because, I have taken lots of blocks and threads. So, the amount of Shared Memory required for each block is low.
PS: If your device supports more Threads and Blocks, replace k<Length/500 with k<Length/(supported number of threads)
Related
I've written a simple benchmark to test and measure the single-precision fused multiply add performance of both processors, and OpenCL devices.
I recently added SMP support using Pthread. The CPU side is simple, it generates a couple of random matrices for inputs to ensure that the work can't be optimized out by the compiler.
The function cpu_result_matrix() creates the threads, and blocks until every thread returns using pthread_join(). It's this function that's timed to determine the performance of the device.
static float *cpu_result_matrix(struct bench_buf *in)
{
const unsigned tc = nthreads();
struct cpu_res_arg targ[tc];
float *res = aligned_alloc(16, BUFFER_SIZE * sizeof(float));
for (unsigned i = 0; i < tc; i++) {
targ[i].tid = i;
targ[i].tc = tc;
targ[i].in = in;
targ[i].ret = res;
}
pthread_t cpu_res_t[tc];
for (unsigned i = 0; i < tc; i++)
pthread_create(&cpu_res_t[i], NULL,
cpu_result_matrix_mt, (void *)&targ[i]);
for (unsigned i = 0; i < tc; i++)
pthread_join(cpu_res_t[i], NULL);
return res;
}
The actual kernel is in cpu_result_matrix_mt():
static void *cpu_result_matrix_mt(void *v_arg)
{
struct cpu_res_arg *arg = (struct cpu_res_arg *)v_arg;
const unsigned buff_size = BUFFER_SIZE;
const unsigned work_size = buff_size / arg->tc;
const unsigned work_start = arg->tid * work_size;
const unsigned work_end = work_start + work_size;
const unsigned round_cnt = ROUNDS_PER_ITERATION;
float lres;
for (unsigned i = work_start; i < work_end; i++) {
lres = 0;
float a = arg->in->a[i], b = arg->in->b[i], c = arg->in->c[i];
for (unsigned j = 0; j < round_cnt; j++) {
lres += a * ((b * c) + b);
lres += b * ((c * a) + c);
lres += c * ((a * b) + a);
}
arg->ret[i] = lres;
}
return NULL;
}
I noticed that the reported time taken for the kernel was roughly the same, regardless of how much I unrolled the inner loop.
To investigate, I made the kernel much larger by manually unrolling the inner loop until I could easily measure the wall time of the program running.
In the process, I observed that (it appears) the threads are returning before the kernel does the work it actually should do, which causes pthread_join() to stop blocking the main thread, and the execution time to appear to be much lower than it really is. I don't understand how this is possible, or how the program could continue to run and output correct results under these conditions.
Htop shows that the threads are still very much alive and working. I checked the return value of pthread_join(), and it was successful after every run. I got curious, and put a print statement in at the end of the kernel, before the return statement, and sure enough, each thread printed that it finished much sooner than it should have.
I watched ps while running the program, and it showed one thread, followed by three more, another five, then it dropped down to four.
I'm baffled, I've never seen threads act like this before.
The full source for my modified test branch is here: https://github.com/jakogut/clperf/tree/test
In the process, I observed that (it appears) the threads are returning before the kernel does the work it actually should do, which causes pthread_join() to stop blocking the main thread, and the execution time to appear to be much lower than it really is.
I'm not sure how you determine this. But looking at the assembly with -Ofast shows that
res[i] += a * ((b * c) + b);
res[i] += b * ((c * a) + c);
res[i] += c * ((a * b) + a);
is calculated before the inner loop. The inner loop is effectively
float t = a * ((b * c) + b) + b * ((c * a) + c) + c * ((a * b) + a);
float sum = 0;
for (unsigned j = 0; j < ROUNDS_PER_ITERATION; j++) {
sum += t;
}
res[i] = sum;
If in your timing you're expecting your inner loop to do sum += a * ((b * c) + b) + b * ((c * a) + c) + c * ((a * b) + a) each iteration when in fact it only does sum += t then your timing estimate will be much larger than what you observe.
OpenMP seems to be a much better solution. It requires far less setup and complexity with problems like this that can exploit data parallelism.
static float *cpu_result_matrix(struct bench_buf *in)
{
float *res = aligned_alloc(16, BUFFER_SIZE * sizeof(float));
#pragma omp parallel for
for (unsigned i = 0; i < BUFFER_SIZE; i++) {
float a = in->a[i], b = in->b[i], c = in->c[i];
for (unsigned j = 0; j < ROUNDS_PER_ITERATION; j++) {
res[i] += a * ((b * c) + b);
res[i] += b * ((c * a) + c);
res[i] += c * ((a * b) + a);
}
}
return res;
}
However, that doesn't answer why pthreads were behaving like they were in the question.
I thought memory access would be faster than the multiplication and division (although compiler-optimized) done with alpha blending. But it wasn't as fast as expected.
The 16 megabytes used for the table is not an issue in this case. But it is a problem if table lookup could even be slower than doing all the CPU calculations.
Can anyone explain to me why and what is happening? Will the table lookup beat out with a slower CPU?
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <time.h>
#define COLOR_MAX UCHAR_MAX
typedef unsigned char color;
color (*blending_table)[COLOR_MAX + 1][COLOR_MAX + 1];
static color blend(unsigned int destination, unsigned int source, unsigned int a) {
return (source * a + destination * (COLOR_MAX - a)) / COLOR_MAX;
}
void initialize_blending_table(void) {
int destination, source, a;
blending_table = malloc((COLOR_MAX + 1) * sizeof *blending_table);
for (destination = 0; destination <= COLOR_MAX; ++destination) {
for (source = 0; source <= COLOR_MAX; ++source) {
for (a = 0; a <= COLOR_MAX; ++a) {
blending_table[destination][source][a] = blend(destination, source, a);
}
}
}
}
struct timer {
double start;
double end;
};
void timer_start(struct timer *self) {
self->start = clock();
}
void timer_end(struct timer *self) {
self->end = clock();
}
double timer_measure_in_seconds(struct timer *self) {
return (self->end - self->start) / CLOCKS_PER_SEC;
}
#define n 300
int main(void) {
struct timer timer;
volatile int i, j, k, l, m;
timer_start(&timer);
initialize_blending_table();
timer_end(&timer);
printf("init %f\n", timer_measure_in_seconds(&timer));
timer_start(&timer);
for (i = 0; i <= n; ++i) {
for (j = 0; j <= COLOR_MAX; ++j) {
for (k = 0; k <= COLOR_MAX; ++k) {
for (l = 0; l <= COLOR_MAX; ++l) {
m = blending_table[j][k][l];
}
}
}
}
timer_end(&timer);
printf("table %f\n", timer_measure_in_seconds(&timer));
timer_start(&timer);
for (i = 0; i <= n; ++i) {
for (j = 0; j <= COLOR_MAX; ++j) {
for (k = 0; k <= COLOR_MAX; ++k) {
for (l = 0; l <= COLOR_MAX; ++l) {
m = blend(j, k, l);
}
}
}
}
timer_end(&timer);
printf("function %f\n", timer_measure_in_seconds(&timer));
return EXIT_SUCCESS;
}
result
$ gcc test.c -O3
$ ./a.out
init 0.034328
table 14.176643
function 14.183924
Table lookup is not a panacea. It helps when the table is small enough, but in your case the table is very big. You write
16 megabytes used for the table is not an issue in this case
which I think is very wrong, and is possibly the source of the problem you experience. 16 megabytes is too big for L1 cache, so reading data from random indices in the table will involve the slower caches (L2, L3, etc). The penalty for cache misses is typically large; your blending algorithm must be very complex if you want your LUT solution to be faster.
Read the Wikipedia article for more info.
Your benchmark is hopelessly broken, it makes the LUT look a lot better than it actually is because it reads the table in-order.
If your performance results show that the LUT is worse than direct calculation, then when you start with real-world random access patterns and cache misses, the LUT is going to be much worse.
Focus on improving the computation, and enabling vectorization. It's likely to pay off far better than a table-based approach.
(source * a + destination * (COLOR_MAX - a)) / COLOR_MAX
with rearrangement becomes
(source * a + destination * COLOR_MAX - destination * a) / COLOR_MAX
which simplifies to
destination + (source - destination) * a / COLOR_MAX
which has one multiply and one division by a constant, both of which are very efficient. And it is easily vectorized.
You should also mark your helper function as inline, although a good optimizing compiler is probably inlining it anyway.
I'm trying to implement a kernel which does parallel reduction. The code below works on occasion, I have not been able to pin down why it goes wrong on the occasions it does.
__kernel void summation(__global float* input, __global float* partialSum, __local float *localSum){
int local_id = get_local_id(0);
int workgroup_size = get_local_size(0);
localSum[local_id] = input[get_global_id(0)];
for(int step = workgroup_size/2; step>0; step/=2){
barrier(CLK_LOCAL_MEM_FENCE);
if(local_id < step){
localSum[local_id] += localSum[local_id + step];
}
}
if(local_id == 0){
partialSum[get_group_id(0)] = localSum[0];
}}
Essentially I'm summing the values per work group and storing each work group's total into partialSum, the final summation is done on the host. Below is the code which sets up the values for the summation.
size_t global[1];
size_t local[1];
const int DATA_SIZE = 15000;
float *input = NULL;
float *partialSum = NULL;
int count = DATA_SIZE;
local[0] = 2;
global[0] = count;
input = (float *)malloc(count * sizeof(float));
partialSum = (float *)malloc(global[0]/local[0] * sizeof(float));
int i;
for (i = 0; i < count; i++){
input[i] = (float)i+1;
}
I'm thinking it has something to do when the size of the input is not a power of two? I noticed it begins to go off for numbers around 8000 and beyond. Any assistance is welcome. Thanks.
I'm thinking it has something to do when the size of the input is not a power of two?
Yes. Consider what happens when you try to reduce, say, 9 elements. Suppose you launch 1 work-group of 9 work-items:
for (int step = workgroup_size / 2; step > 0; step /= 2){
// At iteration 0: step = 9 / 2 = 4
barrier(CLK_LOCAL_MEM_FENCE);
if (local_id < step) {
// Branch taken by threads 0 to 3
// Only 8 numbers added up together!
localSum[local_id] += localSum[local_id + step];
}
}
You're never summing the 9th element, hence the reduction is incorrect. An easy solution is to pad the input data with enough zeroes to make the work-group size the immediate next power-of-two.
I tried to implement vector sum reduction using CUDA on my own and encountered an error I could fix but not understand what the actual problem was.
I implemented the kernel below, which is pretty much same as used in NVIDIA's samples.
__global__
void reduce0(int *input, int *output)
{
extern __shared__ int s_data[];
int tid = threadIdx.x;
int i = blockIdx.x * blockDim.x + threadIdx.x;
s_data[tid] = input[i];
__syncthreads();
for( int s=1; s < blockDim.x; s *= 2) {
if((tid % 2*s) == 0) {
s_data[tid] += s_data[tid + s];
}
__syncthreads();
}
if(tid == 0) {
output[blockIdx.x] = s_data[0];
}
}
Furthermore, I calculated shared memory space as below on the host side
int sharedMemSize = numberOfValues * sizeof(int);
If there is more than 1 block of threads used, the code just runs fine. Using only 1 block ends in the index out of bounds error mentioned above. Looking for my error by comparing my host code with the one of the examples I found the following line:
int smemSize = (threads <= 32) ? 2 * threads * sizeof(T) : threads * sizeof(T);
Playing a little with my block/grid setup brought me to the following results:
block, arbitrary number of threads => code crashes
>2 blocks, arbitrary number of threads => code runs fine
1 block, arbitrary number of threads, shared memory size 2*#threads => code runs fine
Although thinking about this for a few hours, I don't get why there is an out of bounds error when using a too little number of threads or blocks.
UPDATE: Host code calling the kernel as requested
int numberOfValues = 1024 ;
int numberOfThreadsPerBlock = 32;
int numberOfBlocks = numberOfValues / numberOfThreadsPerBlock;
int memSize = sizeof(int) * numberOfValues;
int *values = (int *) malloc(memSize);
int *result = (int *) malloc(memSize);
int *values_device, *result_device;
cudaMalloc((void **) &values_device, memSize);
cudaMalloc((void **) &result_device, memSize);
for(int i=0; i < numberOfValues ; i++) {
values[i] = i+1;
}
cudaMemcpy(values_device, values, memSize, cudaMemcpyHostToDevice);
dim3 dimGrid(numberOfBlocks,1);
dim3 dimBlock(numberOfThreadsPerBlock,1);
int sharedMemSize = numberOfThreadsPerBlock * sizeof(int);
reduce0 <<< dimGrid, dimBlock, sharedMemSize >>>(values_device, result_device);
if (cudaSuccess != cudaGetLastError())
printf( "Error!\n" );
cudaMemcpy(result, result_device, memSize, cudaMemcpyDeviceToHost);
could your problem be the precedence order of modulo and multiplication.
tid % 2*s is equal to (tid % s)*2 but you want tid % (s*2)
The reason to why you need to use int smemSize = (threads <= 32) ? 2 * threads * sizeof(T) : threads * sizeof(T) for small number of threads is due to out of bounds indexing. One example when this happens is when you launch 29 threads. When tid=28 and s=2 the branch will be taken due to 28 % (2*2) == 0 and you will index into s_data[28+2] but you have only allocated shared memory for 29 threads.
Looking at Mark Harris's reduction example, I am trying to see if I can have threads store intermediate values without reduction operation:
For example CPU code:
for(int i = 0; i < ntr; i++)
{
for(int j = 0; j < pos* posdir; j++)
{
val = x[i] * arr[j];
if(val > 0.0)
{
out[xcount] = val*x[i];
xcount += 1;
}
}
}
Equivalent GPU code:
const int threads = 64;
num_blocks = ntr/threads;
__global__ void test_g(float *in1, float *in2, float *out1, int *ct, int posdir, int pos)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
__shared__ float t1[threads];
__shared__ float t2[threads];
int gcount = 0;
for(int i = 0; i < posdir*pos; i += 32) {
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i%posdir];
}
__syncthreads();
for(int i = 0; i < 32; i++)
{
t2[i] = t1[i] * in1[tid];
if(t2[i] > 0){
out1[gcount] = t2[i] * in1[tid];
gcount = gcount + 1;
}
}
}
ct[0] = gcount;
}
what I am trying to do here is the following steps:
(1)Store 32 values of in2 in shared memory variable t1,
(2)For each value of i and in1[tid], calculate t2[i],
(3)if t2[i] > 0 for that particular combination of i, write t2[i]*in1[tid] to out1[gcount]
But my output is all wrong. I am not even able to get a count of all the times t2[i] is greater than 0.
Any suggestions on how to save the value of gcount for each i and tid ?? As I debug, I find that for block (0,0,0) and thread(0,0,0) I can sequentially see the values of t2 updated. After the CUDA kernel switches focus to block(0,0,0) and thread(32,0,0), the values of out1[0] are re-written again. How can I get/store the values of out1 for each thread and write it to the output?
I tried two approaches so far: (suggested by #paseolatis on NVIDIA forums)
(1) defined offset=tid*32; and replace out1[gcount] with out1[offset+gcount],
(2) defined
__device__ int totgcount=0; // this line before main()
atomicAdd(&totgcount,1);
out1[totgcount]=t2[i] * in1[tid];
int *h_xc = (int*) malloc(sizeof(int) * 1);
cudaMemcpyFromSymbol(h_xc, totgcount, sizeof(int)*1, cudaMemcpyDeviceToHost);
printf("GPU: xcount = %d\n", h_xc[0]); // Output looks like this: GPU: xcount = 1928669800
Any suggestions? Thanks in advance !
OK let's compare your description of what the code should do with what you have posted (this is sometimes called rubber duck debugging).
Store 32 values of in2 in shared memory variable t1
Your kernel contains this:
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i%posdir];
}
which is effectively loading the same value from in2 into every value of t1. I suspect you want something more like this:
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i+threadIdx.x];
}
For each value of i and in1[tid], calculate t2[i],
This part is OK, but why is t2 needed in shared memory at all? It is only an intermediate result which can be discarded after the inner iteration is completed. You could easily have something like:
float inval = in1[tid];
.......
for(int i = 0; i < 32; i++)
{
float result = t1[i] * inval;
......
if t2[i] > 0 for that particular combination of i, write
t2[i]*in1[tid] to out1[gcount]
This is where the problems really start. Here you do this:
if(t2[i] > 0){
out1[gcount] = t2[i] * in1[tid];
gcount = gcount + 1;
}
This is a memory race. gcount is a thread local variable, so each thread will, at different times, overwrite any given out1[gcount] with its own value. What you must have, for this code to work correctly as written, is to have gcount as a global memory variable and use atomic memory updates to ensure that each thread uses a unique value of gcount each time it outputs a value. But be warned that atomic memory access is very expensive if it is used often (this is why I asked about how many output points there are per kernel launch in a comment).
The resulting kernel might look something like this:
__device__ int gcount; // must be set to zero before the kernel launch
__global__ void test_g(float *in1, float *in2, float *out1, int posdir, int pos)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
__shared__ float t1[32];
float ival = in1[tid];
for(int i = 0; i < posdir*pos; i += 32) {
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i+threadIdx.x];
}
__syncthreads();
for(int j = 0; j < 32; j++)
{
float tval = t1[j] * ival;
if(tval > 0){
int idx = atomicAdd(&gcount, 1);
out1[idx] = tval * ival
}
}
}
}
Disclaimer: written in browser, never been compiled or tested, use at own risk.....
Note that your write to ct was also a memory race, but with gcount now a global value, you can read the value after the kernel without the need for ct.
EDIT: It seems that you are having some problems with zeroing gcount before running the kernel. To do this, you will need to use something like cudaMemcpyToSymbol or perhaps cudaGetSymbolAddress and cudaMemset. It might look something like:
const int zero = 0;
cudaMemcpyToSymbol("gcount", &zero, sizeof(int), 0, cudaMemcpyHostToDevice);
Again, usual disclaimer: written in browser, never been compiled or tested, use at own risk.....
A better way to do what you are doing is to give each thread its own output, and let it increment its own count and enter values - this way, the double-for loop can happen in parallel in any order, which is what the GPU does well. The output is wrong because the threads share the out1 array, so they'll all overwrite on it.
You should also move the code to copy into shared memory into a separate loop, with a __syncthreads() after. With the __syncthreads() out of the loop, you should get better performance - this means that your shared array will have to be the size of in2 - if this is a problem, there's a better way to deal with this at the end of this answer.
You also should move the threadIdx.x < 32 check to the outside. So your code will look something like this:
if (threadIdx.x < 32) {
for(int i = threadIdx.x; i < posdir*pos; i+=32) {
t1[i] = in2[i];
}
}
__syncthreads();
for(int i = threadIdx.x; i < posdir*pos; i += 32) {
for(int j = 0; j < 32; j++)
{
...
}
}
Then put a __syncthreads(), an atomic addition of gcount += count, and a copy from the local output array to a global one - this part is sequential, and will hurt performance. If you can, I would just have a global list of pointers to the arrays for each local one, and put them together on the CPU.
Another change is that you don't need shared memory for t2 - it doesn't help you. And the way you are doing this, it seems like it works only if you are using a single block. To get good performance out of most NVIDIA GPUs, you should partition this into multiple blocks. You can tailor this to your shared memory constraint. Of course, you don't have a __syncthreads() between blocks, so the threads in each block have to go over the whole range for the inner loop, and a partition of the outer loop.