Dividing processes evenly among threads

Dividing processes evenly among threads - c

I am trying to come up with an algorithm to divide a number of processes as evenly as possible over a number of threads. Each process takes the same amount of time.
The number of processes can vary, from 1 to 1 million. The threadCount is fixed, and can be anywhere from 4 to 48.
The code below does divide all the work evenly, except for the last case, where I throw in what is left over.
Is there a way to fix this so that the work is spread more evenly?
void main(void)
{
int processBegin[100];
int processEnd[100];
int activeProcessCount = 6243;
int threadCount = 24;
int processsInBundle = (int) (activeProcessCount / threadCount);
int processBalance = activeProcessCount - (processsInBundle * threadCount);
for (int i = 0; i < threadCount; ++i)
{
processBegin[ i ] = i * processsInBundle;
processEnd[ i ] = (processBegin[ i ] + processsInBundle) - 1;
}
processEnd[ threadCount - 1 ] += processBalance;
FILE *debug = fopen("s:\\data\\testdump\\debug.csv", WRITE);
for (int i = 0; i < threadCount; ++i)
{
int processsInBucket = (i == threadCount - 1) ? processsInBundle + processBalance : processBegin[i+1] - processBegin[i];
fprintf(debug, "%d,start,%d,stop,%d,processsInBucket,%d\n", activeProcessCount, processBegin[i], processEnd[i], processsInBucket);
}
fclose(debug);
}

Give the first activeProcessCount % threadCount threads processInBundle + 1 processes and give the others processsInBundle ones.
int processInBundle = (int) (activeProcessCount / threadCount);
int processSoFar = 0;
for (int i = 0; i < activeProcessCount % threadCount; i++){
processBegin[i] = processSoFar;
processSoFar += processInBundle + 1;
processEnd[i] = processSoFar - 1;
}
for (int i = activeProcessCount % threadCount; i < threadCount; i++){
processBegin[i] = processSoFar;
processSoFar += processInBundle;
processEnd[i] = processSoFar - 1;
}

That's the same problem as trying to divide 5 pennies onto 3 people. It's just impossible unless you can saw the pennies in half.
Also even if all processes need an equal amount of theoretical runtime it doesn't mean that they will be executed in the same amount of time due to kernel scheduling, cache performance and various other hardware related factors.
To suggest some performance optimisations:
Use dynamic scheduling. i.e. split your work into batches (can be size 1) and have your threads take one batch at a time, run it, then take the next one. This way the threads will always be working until all batches are gone.
More advanced is to start with a big batch size (commonly numwork/numthreads and decrease it each time a thread takes work out of the pool). OpenMP refers to it as guided scheduling.

Related

Heap corruption with pthreads in C

I've been writing a C program to simulate the motion of n bodies under the influence of gravity. I have a working version that uses a single thread, and I'm attempting to write a version that uses multi-threading with the POSIX pthreads library. Essentially, the program initializes a specified number n of bodies, and stores their randomly selected initial positions as well as masses and radii in an array 'data', made using the pointer-to-pointer method. Data is a pointer in the global scope, and it is allocated the correct amount of memory in the 'populate()' function in main(). Then, I spawn twelve threads (I am using a 6 core processor, so I thought 12 would be a good starting point), and each thread is assigned a set of objects in the simulation. The thread function below calculates the interaction between all objects in 'data' and the object currently being operated on. See the function below:
void* calculate_step(void*index_val) {
int index = * (int *)index_val;
long double x_dist;
long double y_dist;
long double distance;
long double force;
for (int i = 0; i < (rows/nthreads); ++i) { //iterate over every object assigned to this thread
data[i+index][X_FORCE] = 0; //reset all forces to 0
data[i+index][Y_FORCE] = 0;
data[i+index][X_ACCEL] = 0;
data[i+index][X_ACCEL] = 0;
for (int j = 0; j < rows; ++j) { //iterate over every possible pair with this object i
if (i != j && data[j][DELETED] != 1 && data[i+index][DELETED] != 1) { //continue if not comparing an object with itself and if the other object has not been deleted previously.
x_dist = data[j][X_POS] - data[i+index][X_POS];
y_dist = data[j][Y_POS] - data[i+index][X_POS];
distance = sqrtl(powl(x_dist, 2) + powl(y_dist, 2));
if (distance > data[i+index][RAD] + data[j][RAD]) {
force = G * data[i+index][MASS] * data[j][MASS] /
powl(distance, 2); //calculate accel, vel, pos, data for pair of non-colliding objects
data[i+index][X_FORCE] += force * (x_dist / distance);
data[i+index][Y_FORCE] += force * (y_dist / distance);
data[i+index][X_ACCEL] = data[i+index][X_FORCE]/data[i+index][MASS];
data[i+index][X_VEL] += data[i+index][X_ACCEL]*dt;
data[i+index][X_POS] += data[i+index][X_VEL]*dt;
data[i+index][Y_ACCEL] = data[i+index][Y_FORCE]/data[i+index][MASS];
data[i+index][Y_VEL] += data[i+index][Y_ACCEL]*dt;
data[i+index][Y_POS] += data[i+index][Y_VEL]*dt;
}
else{
if (data[i+index][MASS] < data[j][MASS]) {
int temp;
temp = i;
i = j;
j = temp;
} //conserve momentum
data[i+index][X_VEL] = (data[i+index][X_VEL] * data[i+index][MASS] + data[j][X_VEL] * data[j][MASS])/(data[i+index][MASS] + data[i+index][MASS]);
data[i+index][Y_VEL] = (data[i+index][Y_VEL] * data[i+index][MASS] + data[j][Y_VEL] * data[j][MASS])/(data[i+index][MASS] + data[i+index][MASS]);
//conserve center of mass position
data[i+index][X_POS] = (data[i+index][X_POS] * data[i+index][MASS] + data[j][X_POS] * data[j][MASS])/(data[i+index][MASS] + data[i+index][MASS]);
data[i+index][Y_POS] = (data[i+index][Y_POS] * data[i+index][MASS] + data[j][Y_POS] * data[j][MASS])/(data[i+index][MASS] + data[i+index][MASS]);
//conserve mass
data[i+index][MASS] += data[j][MASS];
//increase radius proportionally to dM
data[i+index][RAD] = powl(powl(data[i+index][RAD], 3) + powl(data[j][RAD], 3), ((long double) 1 / (long double) 3));
data[j][DELETED] = 1;
data[j][MASS] = 0;
data[j][RAD] = 0;
}
}
}
}
return NULL;
}
This calculates values for velocity, acceleration, etc. and writes them to the array. Each thread does this once for each object assigned to it (i.e. 36 objects means each thread calculates values for 3 objects). The thread then returns and the main loop jumps to the next time step (usually increments of 0.01 seconds), and the process repeats again. If two balls collide, their masses, momenta and centers of mass are added, and one of the objects' 'DELETED' index in its row in the array is marked with a row. This object is then ignored in all future iterations. See the main loop below:
int main() {
pthread_t *thread_array; //pointer to future thread array
long *thread_ids;
short num_obj;
short sim_time;
printf("Number of objects to simulate: \n");
scanf("%hd", &num_obj);
num_obj = num_obj - num_obj%12;
printf("Timespan of the simulation: \n");
scanf("%hd", &sim_time);
printf("Length of time steps: \n");
scanf("%f", &dt);
printf("Relative complexity score: %.2f\n", (((float)sim_time/dt)*((float)(num_obj^2)))/1000);
thread_array = malloc(nthreads*sizeof(pthread_t));
thread_ids = malloc(nthreads*sizeof(long));
populate(num_obj);
int index;
for (int i = 0; i < nthreads; ++i) { //initialize all threads
}
time_t start = time(NULL);
print_data();
for (int i = 0; i < (int)((float)sim_time/dt); ++i) { //main loop of simulation
for (int j = 0; j < nthreads; ++j) {
index = j*(rows/nthreads);
thread_ids[j] = j;
pthread_create(&thread_array[j], NULL, calculate_step, &index);
}
for (int j = 0; j < nthreads; ++j) {
pthread_join(thread_array[j], NULL);
//pthread_exit(NULL);
}
}
time_t end = time(NULL) - start;
printf("\n");
print_data();
printf("Took %zu seconds to simulate %d frames with %d objects initially, now %d objects.\n", end, (int)((float)sim_time/dt), num_obj, rows);
}
Every time the program runs, I get the following message:
Number of objects to simulate:
36
Timespan of the simulation:
10
Length of time steps:
0.01
Relative complexity score: 38.00
Process finished with exit code -1073740940 (0xC0000374)
which seams to indicate the heap is getting corrupted. I am guessing this has to do with the data array pointer being a global variable, but that was my workaround for only being allowed to pass one arg to the pthreads function.
I have tried stepping through the program with the debugger, and it seems it works when I run it in debug mode (I am using CLion), but not in regular compile mode. Furthermore, when i debug the program and it outputs the values of the data array for the last simulation 'frame', the first chunk of values which were supposed to be handled by the first thread that spawns are unchanged. When I go through it with the debugger however I can see that thread being created in the thread generation loop. What are some issues with this code structure and what could be causing the heap corruption and the first thread doing nothing?

MPI Program Does Not "Speed Up" After Implementing Parallel Computing Techniques

I am developing an MPI parallel program designed specifically to solve problem 2 on Project Euler. The original problem statement can be found here. My code works without any compilation errors, and the correct answer is retuned consistently (which can be verified on the website).
However, I thought it would be worthwhile to use MPI_Wtime() to gather data on how long it takes to execute the MPI program using 1, 2, 3, and 4 processes. To my surprise, I found that my program takes longer to execute as more processes are included. This is contrary to my expectations, as I thought increasing the number of processes would reduce the computation time (speed up) according to Amdahl’s law. I included my code for anyone who may be interested in testing this for themselves.
#include <mpi.h>
#include <stdio.h>
#include <tgmath.h>
int main(int argc, char* argv[]) {
MPI_Init(&argc, &argv);
int rank, size, start_val, end_val, upperLimit;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
upperLimit = 33;
start_val = rank * (upperLimit / size) + 1;
int num1 = 1; int num2 = 1; int SumNum = num1 + num2; int x = 0;
double start, end;
// begin timing
start = MPI_Wtime();
// arbitrarily inflate the number of computations
// to make the program take longer to compute
// change to t < 1 for only 1 computation
for (int i = 0; i < 10000000; i++) {
// generate an algorithim that defines the range of
// each process to handle for the fibb_sequence problem.
if (rank == (size - 1)) {
end_val = upperLimit;
}
else {
end_val = start_val + (upperLimit / size) - 1;
}
/*
calculations before this code indicate that it will take exactly 32 seperate algorithim computations
to get to the largest number before exceeding 4,000,000 in the fibb sequence. This can be done with a simple
computation, but this calculation will not be listed in this code.
*/
long double fibb_const = (1 + sqrt(5)) / 2; int j = start_val - 1; long double fibb_const1 = (1 - sqrt(5)) / 2;
// calculate fibb sequence positions for the sequence using a formula
double position11 = (pow(fibb_const, j) - pow(fibb_const1, j)) / (sqrt(5));
double position12 = (pow(fibb_const, j + 1) - pow(fibb_const1, (j + 1))) / (sqrt(5));
position11 = floor(position11);
position12 = floor(position12);
// dynamically assign values to each process to generate a solution quickly
if (rank == 0) {
for (int i = start_val; i < end_val; i++) {
SumNum = num1 + num2;
num1 = num2;
num2 = SumNum;
if (SumNum % 2 == 0) {
x = x + SumNum;
//printf("Process 0 reports %d \n \n", SumNum);
//fflush(stdout);
}
}
}
else {
for (int i = start_val; i < end_val; i++) {
SumNum = position12 + position11;
if (SumNum % 2 == 0) {
x = x + SumNum;
//printf("Process %d reports %d \n \n", rank, SumNum);
//fflush(stdout);
}
position11 = position12;
position12 = SumNum;
}
}
int recieve_buf = 0;
MPI_Reduce(&x, &recieve_buf, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
if (rank == 0) {
//printf("This is the final solution: %d \n \n", recieve_buf);
//fflush(stdout);
}
}
// end timing
end = MPI_Wtime();
// timer goes here
double elapsed_time = end - start;
printf("I am rank %d, and I report a walltime of %f seconds.", rank, elapsed_time);
// end the MPI code
MPI_Finalize();
return 0;
}
Note that I utilize 10000000 computations in a for loop to intentionally increase the computation time.
I have attempted to solve this problem by utilizing time.h and chrono in alternate versions of this code to cross-reference my results. Consistently, it seems as if the computation time increases as more processes are included. I saw a similar SO post here, but I could use an additional explination.
How I Run my Code
I use mpiexec -n <process_count> <my_file_name>.exe to run my code on from the VS Studio 2022 command prompt. Additionally, I have tested this code on macOS by running mpicc foo.c followed by mpiexec -n <process_count> ./a.out. All my best efforts seem to produce data contrary to my expectations.
Hopefully this question isn't too vague. I will provide more information if needed.
System Info
I am currently using a x64 based pc, Lenovo, Windows 11. Thanks again

This is a case of the granularity being too fine. Granularity is defined as the amount of work between synchronization points vs the cost of synchronization.
Let's say your MPI_Reduce takes one, or a couple of, microseconds. (A figure that has stayed fairly constant over the past few decades!) That's enough time to do a few thousand operations. So for speedup to occur, you need many thousands of operations between the reductions. You don't have that, so the time of your code is completely dominated by the cost of the MPI calls, and that does not go down with the number of processes.

Multiply large numbers using multithreading

I need to multiply large numbers using multithreading. The two numbers to be multiplied can have up to 10000 digits. I have written multiplication code using a single thread. But I am not sure how to multiply when I am assigning multiple threads to different digits.
For example, if the two numbers are: 254678 and 378929 and there are 3 threads, I am assigning each of the two digits to one thread(2,5-Thread 1),(4,6->Thread 2),(7,8-> Thread 3) and each of the digits should multiply digits of 2nd number-> 378929.
When the threads will run in parallel I don't know how to manage the carry variable when multiple threads will update the variable at the same time.
input: array contains both the numbers
index: i1 contains the last digit of 1st number
index: i2 contains the last digit of 2nd number
enter code here
for (int i=i1-1; i>=1; i--){
int carry = 0;
int n1 = input[i];
t2 = 0;
for(int j=i2-1; j>i1-1; j--){
int n2 = input[j];
int sum = n1*n2 + output[t1+t2] + carry;
carry = sum/10;
output[t1+t2] = sum % 10;
t2++;
}
if(carry > 0)
output[t1 + t2] += carry;
t1++;
}
int main() {
pthread_t threads[MAX_THREAD];
for (int i = 0; i < MAX_THREAD; i++)
pthread_create(&threads[i], NULL, &multiply, (void*)NULL);
for (int i = 0; i < MAX_THREAD; i++)
pthread_join(threads[i], NULL);
}

When the threads will run in parallel I don't know how to manage the carry variable when multiple threads will update the variable at the same time.
Don't allow multiple threads to update anything at the same time.
Specifically (assuming each digit is a digit in "base 1<<32"), each thread can be like:
my_accumulator = table_of_per_thread_accumulators[thread_number];
for (int src1_digit_number = startDigit; src1_digit_number >= 0; src1_digit_number -= NUM_THREADS ) {
for (int src2_digit_number = src2_digits-1; src2_digit_number >= 0; src2_digit_number-- ) {
int dest_digit_number = src1_digit_number + src2_digit_number;
uint32_t n1 = input1[src1_digit_number];
uint32_t n2 = input2[src2_digit_number];
uint64_t r = (uint64_t)n1 * n2;
while(r != 0) {
uint64_t temp = my_accumulator[dest_digit_number] + (r & 0xFFFFFFFFUL);
my_accumulator[dest_digit_number++] = temp;
temp >>= 32;
r = (r >> 32) + temp;
}
}
}
Then, when all threads are finished (after some "pthread_join()" calls?) you'd add the numbers from each thread's separate my_accumulator to get the actual result.
Note 1: In theory you can do some "binary merging of accumulators", such that an odd numbered thread waits for the next lower numbered thread to finish and adds that thread's accumulator to its own; then threads that satisfy "my_thread_number % 4 == 3" wait for the the thread numbered "my_thread_number - 2" to finish and adds its accumulator to its own; then... This is likely to be too complicated and messy to bother with.
Note 2: The alternative (if you use a single output[] that's modified by multiple threads) is to have a mutex to ensure only one thread can modify output[] at the same time (or multiple mutexes to ensure only one thread can modify a piece of output[] at a time); and this will destroy performance so badly that it will be faster to use a single thread.
Note 3: The alternative alternative is to use atomics somehow, or use inline assembly (e.g. "lock add [output+rdi],eax;") so that no mutex is needed. This is still very bad because the CPUs will be fighting for exclusive access of the same cache line (and you'll ruin performance by having CPUs spend most of their time trying to get exclusive access to the cache line).

Improving a simple function using threading

I have written a simple function with the following code that calculates the minimum number from a one-dimensional array:
uint32_t get_minimum(const uint32_t* matrix) {
int min = 0;
min = matrix[0];
for (ssize_t i = 0; i < g_elements; i++){
if (min > matrix[i]){
min = matrix[i];
}
}
return min;
}
However, I wanted to improve the performance of this function and was advised using threads so I have modified it to the following:
struct minargument{
const uint32_t* matrix;
ssize_t tid;
long long results;
};
static void *minworker(void *arg){
struct minargument *argument = (struct minargument *)arg;
const ssize_t start = argument -> tid * CHUNK;
const ssize_t end = argument -> tid == THREADS - 1 ? g_elements : (argument -> tid + 1) * CHUNK;
long long result = argument -> matrix[0];
for(ssize_t i = start; i < end; i++){
for(ssize_t x = 0; x < g_elements; x++){
if(result > argument->matrix[i]){
result = argument->matrix[i];
}
}
}
argument -> results = result;
return NULL;
}
uint32_t get_minimum(const uint32_t* matrix) {
struct minargument *args = malloc(sizeof(struct minargument) * THREADS);
long long min = 0;
for(ssize_t i = 0; i < THREADS; i++){
args[i] = (struct minargument){
.matrix = matrix,
.tid = i,
.results = min,
};
}
pthread_t thread_ids[THREADS];
for(ssize_t i =0; i < THREADS; i++){
if(pthread_create(thread_ids + i, NULL, minworker, args + i) != 0){
perror("pthread_create failed");
return 1;
}
}
for (ssize_t i = 0; i < THREADS; i++){
if(pthread_join(thread_ids[i], NULL) != 0){
perror("pthread_join failed");
return 1;
}
}
for(ssize_t i =0; i < THREADS; i++){
min = args[i].results;
}
free(args);
return min;
}
However this seems to be slower than the first function.
Am I correct in using threads to make the first function run faster? And if so, how do I modify the second function so that it is faster than the first function?

Having more threads than cores available to run them on is always going to be slower than a single thread due to the overhead of creating them, scheduling them and waiting for them all to finish.
The example you provide is unlikely to benefit from any optimisation beyond that which the compiler will do for you, as it is a short and simple operation. If you were doing something more complicated on a multi-core system, such as multiplying two huge matrices, of running a correlation algorithm on high speed real-time data then multi-threading may be the solution.
A more abstract answer to your question is another question: do you really need to be optimising it at all? Unless you know for a fact that there are performance issues, then your time would be better spent adding more functionality to your program than fixing a problem that doesn't really exist.
Edit - Comparison
I just ran (a representative version of) the OP's code on a 16 bit ARM microcontroller running with a 40 MHz instruction clock. Code compiled using GCC with no optimisation.
Finding the minimum of 20,000 32 bit integers took a little over 25 milliseonds.
With a 40 kByte page size (to hold half of a 20,000 array of 4 byte values) with threads running on different cores of a dual Intel 5150 processor clocked at 2.67 GHz, it takes nearly 50 ms just to do the context switch and paging operation!
A simple, single-threaded microcontroller implementation takes half as long in real time terms as a multi-threaded desktop implementation.

Using shared memory in CUDA without reducing threads

Looking at Mark Harris's reduction example, I am trying to see if I can have threads store intermediate values without reduction operation:
For example CPU code:
for(int i = 0; i < ntr; i++)
{
for(int j = 0; j < pos* posdir; j++)
{
val = x[i] * arr[j];
if(val > 0.0)
{
out[xcount] = val*x[i];
xcount += 1;
}
}
}
Equivalent GPU code:
const int threads = 64;
num_blocks = ntr/threads;
__global__ void test_g(float *in1, float *in2, float *out1, int *ct, int posdir, int pos)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
__shared__ float t1[threads];
__shared__ float t2[threads];
int gcount = 0;
for(int i = 0; i < posdir*pos; i += 32) {
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i%posdir];
}
__syncthreads();
for(int i = 0; i < 32; i++)
{
t2[i] = t1[i] * in1[tid];
if(t2[i] > 0){
out1[gcount] = t2[i] * in1[tid];
gcount = gcount + 1;
}
}
}
ct[0] = gcount;
}
what I am trying to do here is the following steps:
(1)Store 32 values of in2 in shared memory variable t1,
(2)For each value of i and in1[tid], calculate t2[i],
(3)if t2[i] > 0 for that particular combination of i, write t2[i]*in1[tid] to out1[gcount]
But my output is all wrong. I am not even able to get a count of all the times t2[i] is greater than 0.
Any suggestions on how to save the value of gcount for each i and tid ?? As I debug, I find that for block (0,0,0) and thread(0,0,0) I can sequentially see the values of t2 updated. After the CUDA kernel switches focus to block(0,0,0) and thread(32,0,0), the values of out1[0] are re-written again. How can I get/store the values of out1 for each thread and write it to the output?
I tried two approaches so far: (suggested by #paseolatis on NVIDIA forums)
(1) defined offset=tid*32; and replace out1[gcount] with out1[offset+gcount],
(2) defined
__device__ int totgcount=0; // this line before main()
atomicAdd(&totgcount,1);
out1[totgcount]=t2[i] * in1[tid];
int *h_xc = (int*) malloc(sizeof(int) * 1);
cudaMemcpyFromSymbol(h_xc, totgcount, sizeof(int)*1, cudaMemcpyDeviceToHost);
printf("GPU: xcount = %d\n", h_xc[0]); // Output looks like this: GPU: xcount = 1928669800
Any suggestions? Thanks in advance !

OK let's compare your description of what the code should do with what you have posted (this is sometimes called rubber duck debugging).
Store 32 values of in2 in shared memory variable t1
Your kernel contains this:
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i%posdir];
}
which is effectively loading the same value from in2 into every value of t1. I suspect you want something more like this:
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i+threadIdx.x];
}
For each value of i and in1[tid], calculate t2[i],
This part is OK, but why is t2 needed in shared memory at all? It is only an intermediate result which can be discarded after the inner iteration is completed. You could easily have something like:
float inval = in1[tid];
.......
for(int i = 0; i < 32; i++)
{
float result = t1[i] * inval;
......
if t2[i] > 0 for that particular combination of i, write
t2[i]*in1[tid] to out1[gcount]
This is where the problems really start. Here you do this:
if(t2[i] > 0){
out1[gcount] = t2[i] * in1[tid];
gcount = gcount + 1;
}
This is a memory race. gcount is a thread local variable, so each thread will, at different times, overwrite any given out1[gcount] with its own value. What you must have, for this code to work correctly as written, is to have gcount as a global memory variable and use atomic memory updates to ensure that each thread uses a unique value of gcount each time it outputs a value. But be warned that atomic memory access is very expensive if it is used often (this is why I asked about how many output points there are per kernel launch in a comment).
The resulting kernel might look something like this:
__device__ int gcount; // must be set to zero before the kernel launch
__global__ void test_g(float *in1, float *in2, float *out1, int posdir, int pos)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
__shared__ float t1[32];
float ival = in1[tid];
for(int i = 0; i < posdir*pos; i += 32) {
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i+threadIdx.x];
}
__syncthreads();
for(int j = 0; j < 32; j++)
{
float tval = t1[j] * ival;
if(tval > 0){
int idx = atomicAdd(&gcount, 1);
out1[idx] = tval * ival
}
}
}
}
Disclaimer: written in browser, never been compiled or tested, use at own risk.....
Note that your write to ct was also a memory race, but with gcount now a global value, you can read the value after the kernel without the need for ct.
EDIT: It seems that you are having some problems with zeroing gcount before running the kernel. To do this, you will need to use something like cudaMemcpyToSymbol or perhaps cudaGetSymbolAddress and cudaMemset. It might look something like:
const int zero = 0;
cudaMemcpyToSymbol("gcount", &zero, sizeof(int), 0, cudaMemcpyHostToDevice);
Again, usual disclaimer: written in browser, never been compiled or tested, use at own risk.....

A better way to do what you are doing is to give each thread its own output, and let it increment its own count and enter values - this way, the double-for loop can happen in parallel in any order, which is what the GPU does well. The output is wrong because the threads share the out1 array, so they'll all overwrite on it.
You should also move the code to copy into shared memory into a separate loop, with a __syncthreads() after. With the __syncthreads() out of the loop, you should get better performance - this means that your shared array will have to be the size of in2 - if this is a problem, there's a better way to deal with this at the end of this answer.
You also should move the threadIdx.x < 32 check to the outside. So your code will look something like this:
if (threadIdx.x < 32) {
for(int i = threadIdx.x; i < posdir*pos; i+=32) {
t1[i] = in2[i];
}
}
__syncthreads();
for(int i = threadIdx.x; i < posdir*pos; i += 32) {
for(int j = 0; j < 32; j++)
{
...
}
}
Then put a __syncthreads(), an atomic addition of gcount += count, and a copy from the local output array to a global one - this part is sequential, and will hurt performance. If you can, I would just have a global list of pointers to the arrays for each local one, and put them together on the CPU.
Another change is that you don't need shared memory for t2 - it doesn't help you. And the way you are doing this, it seems like it works only if you are using a single block. To get good performance out of most NVIDIA GPUs, you should partition this into multiple blocks. You can tailor this to your shared memory constraint. Of course, you don't have a __syncthreads() between blocks, so the threads in each block have to go over the whole range for the inner loop, and a partition of the outer loop.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight