Heap corruption with pthreads in C - c

I've been writing a C program to simulate the motion of n bodies under the influence of gravity. I have a working version that uses a single thread, and I'm attempting to write a version that uses multi-threading with the POSIX pthreads library. Essentially, the program initializes a specified number n of bodies, and stores their randomly selected initial positions as well as masses and radii in an array 'data', made using the pointer-to-pointer method. Data is a pointer in the global scope, and it is allocated the correct amount of memory in the 'populate()' function in main(). Then, I spawn twelve threads (I am using a 6 core processor, so I thought 12 would be a good starting point), and each thread is assigned a set of objects in the simulation. The thread function below calculates the interaction between all objects in 'data' and the object currently being operated on. See the function below:
void* calculate_step(void*index_val) {
int index = * (int *)index_val;
long double x_dist;
long double y_dist;
long double distance;
long double force;
for (int i = 0; i < (rows/nthreads); ++i) { //iterate over every object assigned to this thread
data[i+index][X_FORCE] = 0; //reset all forces to 0
data[i+index][Y_FORCE] = 0;
data[i+index][X_ACCEL] = 0;
data[i+index][X_ACCEL] = 0;
for (int j = 0; j < rows; ++j) { //iterate over every possible pair with this object i
if (i != j && data[j][DELETED] != 1 && data[i+index][DELETED] != 1) { //continue if not comparing an object with itself and if the other object has not been deleted previously.
x_dist = data[j][X_POS] - data[i+index][X_POS];
y_dist = data[j][Y_POS] - data[i+index][X_POS];
distance = sqrtl(powl(x_dist, 2) + powl(y_dist, 2));
if (distance > data[i+index][RAD] + data[j][RAD]) {
force = G * data[i+index][MASS] * data[j][MASS] /
powl(distance, 2); //calculate accel, vel, pos, data for pair of non-colliding objects
data[i+index][X_FORCE] += force * (x_dist / distance);
data[i+index][Y_FORCE] += force * (y_dist / distance);
data[i+index][X_ACCEL] = data[i+index][X_FORCE]/data[i+index][MASS];
data[i+index][X_VEL] += data[i+index][X_ACCEL]*dt;
data[i+index][X_POS] += data[i+index][X_VEL]*dt;
data[i+index][Y_ACCEL] = data[i+index][Y_FORCE]/data[i+index][MASS];
data[i+index][Y_VEL] += data[i+index][Y_ACCEL]*dt;
data[i+index][Y_POS] += data[i+index][Y_VEL]*dt;
}
else{
if (data[i+index][MASS] < data[j][MASS]) {
int temp;
temp = i;
i = j;
j = temp;
} //conserve momentum
data[i+index][X_VEL] = (data[i+index][X_VEL] * data[i+index][MASS] + data[j][X_VEL] * data[j][MASS])/(data[i+index][MASS] + data[i+index][MASS]);
data[i+index][Y_VEL] = (data[i+index][Y_VEL] * data[i+index][MASS] + data[j][Y_VEL] * data[j][MASS])/(data[i+index][MASS] + data[i+index][MASS]);
//conserve center of mass position
data[i+index][X_POS] = (data[i+index][X_POS] * data[i+index][MASS] + data[j][X_POS] * data[j][MASS])/(data[i+index][MASS] + data[i+index][MASS]);
data[i+index][Y_POS] = (data[i+index][Y_POS] * data[i+index][MASS] + data[j][Y_POS] * data[j][MASS])/(data[i+index][MASS] + data[i+index][MASS]);
//conserve mass
data[i+index][MASS] += data[j][MASS];
//increase radius proportionally to dM
data[i+index][RAD] = powl(powl(data[i+index][RAD], 3) + powl(data[j][RAD], 3), ((long double) 1 / (long double) 3));
data[j][DELETED] = 1;
data[j][MASS] = 0;
data[j][RAD] = 0;
}
}
}
}
return NULL;
}
This calculates values for velocity, acceleration, etc. and writes them to the array. Each thread does this once for each object assigned to it (i.e. 36 objects means each thread calculates values for 3 objects). The thread then returns and the main loop jumps to the next time step (usually increments of 0.01 seconds), and the process repeats again. If two balls collide, their masses, momenta and centers of mass are added, and one of the objects' 'DELETED' index in its row in the array is marked with a row. This object is then ignored in all future iterations. See the main loop below:
int main() {
pthread_t *thread_array; //pointer to future thread array
long *thread_ids;
short num_obj;
short sim_time;
printf("Number of objects to simulate: \n");
scanf("%hd", &num_obj);
num_obj = num_obj - num_obj%12;
printf("Timespan of the simulation: \n");
scanf("%hd", &sim_time);
printf("Length of time steps: \n");
scanf("%f", &dt);
printf("Relative complexity score: %.2f\n", (((float)sim_time/dt)*((float)(num_obj^2)))/1000);
thread_array = malloc(nthreads*sizeof(pthread_t));
thread_ids = malloc(nthreads*sizeof(long));
populate(num_obj);
int index;
for (int i = 0; i < nthreads; ++i) { //initialize all threads
}
time_t start = time(NULL);
print_data();
for (int i = 0; i < (int)((float)sim_time/dt); ++i) { //main loop of simulation
for (int j = 0; j < nthreads; ++j) {
index = j*(rows/nthreads);
thread_ids[j] = j;
pthread_create(&thread_array[j], NULL, calculate_step, &index);
}
for (int j = 0; j < nthreads; ++j) {
pthread_join(thread_array[j], NULL);
//pthread_exit(NULL);
}
}
time_t end = time(NULL) - start;
printf("\n");
print_data();
printf("Took %zu seconds to simulate %d frames with %d objects initially, now %d objects.\n", end, (int)((float)sim_time/dt), num_obj, rows);
}
Every time the program runs, I get the following message:
Number of objects to simulate:
36
Timespan of the simulation:
10
Length of time steps:
0.01
Relative complexity score: 38.00
Process finished with exit code -1073740940 (0xC0000374)
which seams to indicate the heap is getting corrupted. I am guessing this has to do with the data array pointer being a global variable, but that was my workaround for only being allowed to pass one arg to the pthreads function.
I have tried stepping through the program with the debugger, and it seems it works when I run it in debug mode (I am using CLion), but not in regular compile mode. Furthermore, when i debug the program and it outputs the values of the data array for the last simulation 'frame', the first chunk of values which were supposed to be handled by the first thread that spawns are unchanged. When I go through it with the debugger however I can see that thread being created in the thread generation loop. What are some issues with this code structure and what could be causing the heap corruption and the first thread doing nothing?

Related

Minimum and Maximum of an array using pthreads in C

I'm having an issue with my code. Disclaimer btw, I'm new to C. Trying to learn it on my own. Anyways, I'm trying to get the minimum and maximum of an array. I broke the array into 4 parts to make 4 separate arrays and then used those 4 to pass in one of the parameters of each thread. With that being said, I'm only able to get the maximum for each part of the array and not the minimum and I don't understand why.
I think we can simplify your code, avoid all these unnecessary malloc calls, and simplify your algorithm for finding a min/max pair in an array.
Start by having a thread function that takes as input the following: an array (represented by a pointer), an index into the array from where to start searching on, and an index in the array on where to stop. Further, this function will need two output parameters - smallest and largest integer found in the array subset found.
Start with the parameter declaration. Similar to your MaxMin, but has both input and output parameters:
struct ThreadParameters
{
// input
int* array;
int start;
int end;
// output
int smallest;
int largest;
};
And then a thread function that scans from array[start] all the way up to (but not including) array[end]. And it puts the results of its scan into the smallest and largest member of the above struct:
void* find_min_max(void* args)
{
struct ThreadParameters* params = (struct ThreadParameters*)args;
int *array = params->array;
int start = params->start;
int end = params->end;
int smallest = array[start];
int largest = array[start];
for (int i = start; i < end; i++)
{
if (array[i] < smallest)
{
smallest = array[i];
}
if (array[i] > largest)
{
largest = array[i];
}
}
// write the result back to the parameter structure
params->smallest = smallest;
params->largest = largest;
return NULL;
}
And while we are at it, use capitol letters for your macros:
#define THREAD_COUNT 4
Now you can keep with your "4 separate arrays" design. But there's no reason to since the thread function can scan any range of any array. So let's declare a single global array as follows:
#define ARRAY_SIZE 400
int arr[ARRAY_SIZE];
The capitol letter syntax is preferred for macros.
fillArray becomes simpler:
void fillArray()
{
for (int i = 0; i < ARRAY_SIZE; i++)
{
arr[i] = rand() % 1000 + 1;
}
}
Now main, becomes a whole lot simpler by doing these techniques.:
We'll leverage the stack to allocate our thread parameter structure (no malloc and free)
We'll simply start 4 threads - passing each thread a pointer to a ThreadParameter struct. Since the thread won't outlive main, this is safe.
After starting each thread, we just wait for each thread to finish)
Then we scan the list of thread parameters to get the final smallest and largest.
main becomes much easier to manage:
int main()
{
int smallest;
int largest;
// declare an array of threads and associated parameter instances
pthread_t threads[THREAD_COUNT] = {0};
struct ThreadParameters thread_parameters[THREAD_COUNT] = {0};
// intialize the array
fillArray();
// smallest and largest needs to be set to something
smallest = arr[0];
largest = arr[0];
// start all the threads
for (int i = 0; i < THREAD_COUNT; i++)
{
thread_parameters[i].array = arr;
thread_parameters[i].start = i * (ARRAY_SIZE / THREAD_COUNT);
thread_parameters[i].end = (i+1) * (ARRAY_SIZE / THREAD_COUNT);
thread_parameters[i].largest = 0;
pthread_create(&threads[i], NULL, find_min_max, &thread_parameters[i]);
}
// wait for all the threads to complete
for (int i = 0; i < THREAD_COUNT; i++)
{
pthread_join(threads[i], NULL);
}
// Now aggregate the "smallest" and "largest" results from all thread runs
for (int i = 0; i < THREAD_COUNT; i++)
{
if (thread_parameters[i].smallest < smallest)
{
smallest = thread_parameters[i].smallest;
}
if (thread_parameters[i].largest > largest)
{
largest = thread_parameters[i].largest;
}
}
printf("Smallest is %d\n", smallest);
printf("Largest is %d\n", largest);
}

How to pass specific value to one thread?

I'm working through a thread exercise in C, it's a typical thread scheduling code many schools teach, a basic one can be seen here, my code is basically the same except for my altered runner method
http://webhome.csc.uvic.ca/~wkui/Courses/CSC360/pthreadScheduling.c
What I'm doing is basically altering the runner part so my code prints an array with random numbers within a certain range, instead of just printing some words. my runner code is here:
void *runner(void *param) {
int i, j, total;
int threadarray[100];
for (i = 0; i < 100; i++)
threadarray[i] = rand() % ((199 + modifier*100) + 1 - (100 + modifier*100)) + (100 + modifier*100);
/* prints array and add to total */
for (j = 0; j < 100; j += 10) {
printf("%d,%d,%d,%d,%d,%d,%d,%d,%d,%d\n", threadarray[j], threadarray[j+1], threadarray[j+2], threadarray[j+3], threadarray[j+4], threadarray[j+5], threadarray[j+6], threadarray[j+7], threadarray[j+8], threadarray[j+9]);
total = total + threadarray[j] + threadarray[j+1] + threadarray[j+2] + threadarray[j+3] + threadarray[j+4] + threadarray[j+5] + threadarray[j+6] + threadarray[j+7] + threadarray[j+8] + threadarray[j+9];
}
printf("Thread %d finished running, total is: %d\n", pthread_self(), total);
pthread_exit(0);
}
My question lies in the first for loop where I'm assigning random numbers to my array, I want this modifier to change based on which thread it is, but I can't figure out how to do it, for example if its the first thread the range will be 100-199, 2nd will be 200-299, etc and so on. I have tried to assign i to an value before doing pthread_create and assigning that value to an int in runner to use as the modifier, but since there are 5 concurrent threads it ends up assigning this number to all 5 threads, and they end up having the same modifier.
So I'm looking for a method to approach this where it will work for all the individual threads instead of assigning it to all of them, I have tried to change the parameters to something like (void *param, int modifier) but when I do this I have no idea how to reference runner, since by default it's refrenced like pthread_create(&tid[i],&attr,runner,NULL);
You want to make param point to a data structure or variable who's lifetime will exist longer than the thread lifetime. And you cast the void* parameter to the actual data type it was allocated as.
Easy example:
struct thread_data
{
int thread_index;
int start;
int end;
}
struct thread_info;
{
struct thread_data data;
pthread_t thread;
}
struct thread_info threads[10];
for (int x = 0; x < 10; x++)
{
struct thread_data* pData = (struct thread_data*)malloc(sizeof(struct thread_data)); // never pass a stack variable as a thread parameter. Always allocate it from the heap.
pData->thread_index = x;
pData->start = 100 * x + 1;
pData->end = 100*(x+1) - 1;
pthread_create(&(threads[x].thread), NULL, runner, pData);
}
Then your runner:
void *runner(void *param)
{
struct thread_data* data = (struct thread_data*)param;
int modifier = data->thread_index;
int i, j, total;
int threadarray[100];
for (i = 0; i < 100; i++)
{
threadarray[i] = ...
}

Ising 1-Dimensional C - program

i am trying to simulate the Ising Model 1-D. This model consists in a chain of spin (100 spins) and using the Mont Carlo - Metropolis to accept the flip of a spin if the energy of the system (unitary) goes down or if it will be less than a random number.
In the correct program, both the energy the magnetization go to zero, and we have the results as a Gaussian (graphics of Energyor the magnetization by the number of Monte Carlo steps).
I have done some work but i think my random generator isn't correctt for this, and i don't know how/where to implement the boundary conditions: the last spin of the chain is the first one.
I need help to finish it. Any help will be welcome. Thank you.
I am pasting my C program down:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h> //necessary for function time()
#define LENGTH 100 //size of the chain of spins
#define TEMP 2 // Temperature in units of J
#define WARM 200 // Termalização
#define MCS 20000 //Monte Carlo Steps
void start( int spin[])
{
/* starts with all the spins 1 */
int i;
for (i = 0 ; i < 100; i++)
{
spin[i] = 1;
}
}
double energy( int spin[]) //of the change function J=1
{
int i;
double energyX=0;// because the begining Energy = -J*sum (until 100) =-100,
for (i = 0;i<100;i++)
energyX=energyX-spin[i]*spin[i+1];
return(energyX);
}
int randnum(){
int num;
srand(time(NULL));
/* srand(time(NULL)) objectives to initiate the random number generator
with the value of the function time(NULL). This is calculated as being the
total of seconds passed since january first of 1970 until the present date.
So, this way, for each execution the value of the "seed" will be different.
*/
srand(time(NULL));
//picking one spin randomly zero to 100
num=rand() % 100;
printf("num = %d ", num);
return num;
}
void montcarlo( int spin[])
{
int i,j,num;
double prob;
double energyA, energyB; // A -> old energy and B -> the new energy
int rnum1,rnum2;
prob=exp(-(energyB-energyA)/TEMP);
energyA = 0;
energyB = 0;
for (i = 0;i<100;i++)
{
for (j = 0;j<100;j++)
{
energyA=energy(spin);
rnum1=randnum();
rnum2=randnum(); // i think they will give me different numbers
spin[rnum1] = -spin[rnum1]; //flip of the randomly selected spin
energyB = energyB-spin[j]*spin[j+1];
if ((energyB-energyA<0)||((energyB-energyA>0)&&(rnum2>prob))){ // using rnum2 not to be correlated if i used rnum1
spin[rnum1]=spin[rnum1];} // keep the flip
else if((energyB-energyA>0)&&(rnum2<prob))
spin[rnum1]=-spin[rnum1]; // unflip
}
}
}
int Mag_Moment( int spin[] ) // isso é momento magnetico
{
int i;
int mag;
for (i = 0 ; i < 100; i++)
{
mag = mag + spin[i];
}
return(mag);
}
int main()
{
// starting the spin's chain
int spin[100];//the vector goes til LENGHT=100
int i,num,j;
int itime;
double mag_moment;
start(spin);
double energy_chain=0;
energy_chain=energy(spin); // that will give me -100 in the begining
printf("energy_chain starts with %f", energy_chain);// initially it gives -100
/*Warming it makes the spins not so ordered*/
for (i = 1 ; i <= WARM; i++)
{
itime = i;
montcarlo(spin);
}
printf("Configurtion after warming %d \n", itime);
for (j = 0 ; j < LENGTH; j++)
{
printf("%d",spin[j]);
}
printf("\n");
energy_chain=energy(spin); // new energy after the warming
/*openning a file to save the values of energy and magnet moment of the chain*/
FILE *fp; // declaring the file for the energy
FILE *fp2;// declaring the file for the mag moment
fp=fopen("energy_chain.txt","w");
fp2=fopen("mag_moment.txt","w");
int pures;// net value of i
int a;
/* using Monte Carlo metropolis for the whole chain */
for (i = (WARM + 1) ; i <= MCS; i++)
{
itime=i;//saving the i step for the final printf.
pures = i-(WARM+1);
montcarlo(spin);
energy_chain = energy_chain + energy(spin);// the spin chain is moodified by void montcarlo
mag_moment = mag_moment + Mag_Moment(spin);
a=pures%10000;// here i select a value to save in a txt file for 10000 steps to produce graphs
if (a==0){
fprintf(fp,"%.12f\n",energy_chain); // %.12f just to give a great precision
fprintf(fp2,"%.12f\n",mag_moment);
}
}
fclose(fp); // closing the files
fclose(fp2);
/* Finishing -- Printing */
printf("energy_chain = %.12f\n", energy_chain);
printf("mag_moment = %.12f \n", mag_moment);
printf("Temperature = %d,\n Size of the system = 100 \n", TEMP);
printf("Warm steps = %d, Montcarlo steps = %d \n", WARM , MCS);
printf("Configuration in time %d \n", itime);
for (j = 0 ; j < 100; j++)
{
printf("%d",spin[j]);
}
printf("\n");
return 0;
}
you should call srand(time(NULL)); only once in your program. Every time you call this in the same second you will get the same sequence of random numbers. So it is very likely that both calls to randnum will give you the same number.
Just add srand(time(NULL)); at the begin of main and remove it elsewhere.
I see a number of bugs in this code, I think. The first one is the re-seeding of the srand() each loop which has already been addressed. Many of the loops go beyond the array bounds, such as:
for (ii = 0;ii<100;ii++)
{
energyX = energyX - spin[ii]*spin[ii+1];
}
This will give you spin[99]*spin[100] for the last loop, for which is out of bounds. That is kind of peppered throughout the code. Also, I noticed the probability rnum2 is an int but compared as if it's supposed to be a double. I think dividing the rnum2 by 100 will give a reasonable probability.
rnum2 = (randnum()/100.0); // i think they will give me different numbers
The initial probability used to calculate the spin is, prob=exp(-(energyB-energyA)/TEMP); but both energy values are not initialized, maybe this is intentional, but I think it would be better to just use rand(). The Mag_Moment() function never initializes the return value, so you wind up with a return value that is garbage. Can you point me to the algorithm you are trying to reproduce? I'm just curious.

Dividing processes evenly among threads

I am trying to come up with an algorithm to divide a number of processes as evenly as possible over a number of threads. Each process takes the same amount of time.
The number of processes can vary, from 1 to 1 million. The threadCount is fixed, and can be anywhere from 4 to 48.
The code below does divide all the work evenly, except for the last case, where I throw in what is left over.
Is there a way to fix this so that the work is spread more evenly?
void main(void)
{
int processBegin[100];
int processEnd[100];
int activeProcessCount = 6243;
int threadCount = 24;
int processsInBundle = (int) (activeProcessCount / threadCount);
int processBalance = activeProcessCount - (processsInBundle * threadCount);
for (int i = 0; i < threadCount; ++i)
{
processBegin[ i ] = i * processsInBundle;
processEnd[ i ] = (processBegin[ i ] + processsInBundle) - 1;
}
processEnd[ threadCount - 1 ] += processBalance;
FILE *debug = fopen("s:\\data\\testdump\\debug.csv", WRITE);
for (int i = 0; i < threadCount; ++i)
{
int processsInBucket = (i == threadCount - 1) ? processsInBundle + processBalance : processBegin[i+1] - processBegin[i];
fprintf(debug, "%d,start,%d,stop,%d,processsInBucket,%d\n", activeProcessCount, processBegin[i], processEnd[i], processsInBucket);
}
fclose(debug);
}
Give the first activeProcessCount % threadCount threads processInBundle + 1 processes and give the others processsInBundle ones.
int processInBundle = (int) (activeProcessCount / threadCount);
int processSoFar = 0;
for (int i = 0; i < activeProcessCount % threadCount; i++){
processBegin[i] = processSoFar;
processSoFar += processInBundle + 1;
processEnd[i] = processSoFar - 1;
}
for (int i = activeProcessCount % threadCount; i < threadCount; i++){
processBegin[i] = processSoFar;
processSoFar += processInBundle;
processEnd[i] = processSoFar - 1;
}
That's the same problem as trying to divide 5 pennies onto 3 people. It's just impossible unless you can saw the pennies in half.
Also even if all processes need an equal amount of theoretical runtime it doesn't mean that they will be executed in the same amount of time due to kernel scheduling, cache performance and various other hardware related factors.
To suggest some performance optimisations:
Use dynamic scheduling. i.e. split your work into batches (can be size 1) and have your threads take one batch at a time, run it, then take the next one. This way the threads will always be working until all batches are gone.
More advanced is to start with a big batch size (commonly numwork/numthreads and decrease it each time a thread takes work out of the pool). OpenMP refers to it as guided scheduling.

Using shared memory in CUDA without reducing threads

Looking at Mark Harris's reduction example, I am trying to see if I can have threads store intermediate values without reduction operation:
For example CPU code:
for(int i = 0; i < ntr; i++)
{
for(int j = 0; j < pos* posdir; j++)
{
val = x[i] * arr[j];
if(val > 0.0)
{
out[xcount] = val*x[i];
xcount += 1;
}
}
}
Equivalent GPU code:
const int threads = 64;
num_blocks = ntr/threads;
__global__ void test_g(float *in1, float *in2, float *out1, int *ct, int posdir, int pos)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
__shared__ float t1[threads];
__shared__ float t2[threads];
int gcount = 0;
for(int i = 0; i < posdir*pos; i += 32) {
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i%posdir];
}
__syncthreads();
for(int i = 0; i < 32; i++)
{
t2[i] = t1[i] * in1[tid];
if(t2[i] > 0){
out1[gcount] = t2[i] * in1[tid];
gcount = gcount + 1;
}
}
}
ct[0] = gcount;
}
what I am trying to do here is the following steps:
(1)Store 32 values of in2 in shared memory variable t1,
(2)For each value of i and in1[tid], calculate t2[i],
(3)if t2[i] > 0 for that particular combination of i, write t2[i]*in1[tid] to out1[gcount]
But my output is all wrong. I am not even able to get a count of all the times t2[i] is greater than 0.
Any suggestions on how to save the value of gcount for each i and tid ?? As I debug, I find that for block (0,0,0) and thread(0,0,0) I can sequentially see the values of t2 updated. After the CUDA kernel switches focus to block(0,0,0) and thread(32,0,0), the values of out1[0] are re-written again. How can I get/store the values of out1 for each thread and write it to the output?
I tried two approaches so far: (suggested by #paseolatis on NVIDIA forums)
(1) defined offset=tid*32; and replace out1[gcount] with out1[offset+gcount],
(2) defined
__device__ int totgcount=0; // this line before main()
atomicAdd(&totgcount,1);
out1[totgcount]=t2[i] * in1[tid];
int *h_xc = (int*) malloc(sizeof(int) * 1);
cudaMemcpyFromSymbol(h_xc, totgcount, sizeof(int)*1, cudaMemcpyDeviceToHost);
printf("GPU: xcount = %d\n", h_xc[0]); // Output looks like this: GPU: xcount = 1928669800
Any suggestions? Thanks in advance !
OK let's compare your description of what the code should do with what you have posted (this is sometimes called rubber duck debugging).
Store 32 values of in2 in shared memory variable t1
Your kernel contains this:
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i%posdir];
}
which is effectively loading the same value from in2 into every value of t1. I suspect you want something more like this:
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i+threadIdx.x];
}
For each value of i and in1[tid], calculate t2[i],
This part is OK, but why is t2 needed in shared memory at all? It is only an intermediate result which can be discarded after the inner iteration is completed. You could easily have something like:
float inval = in1[tid];
.......
for(int i = 0; i < 32; i++)
{
float result = t1[i] * inval;
......
if t2[i] > 0 for that particular combination of i, write
t2[i]*in1[tid] to out1[gcount]
This is where the problems really start. Here you do this:
if(t2[i] > 0){
out1[gcount] = t2[i] * in1[tid];
gcount = gcount + 1;
}
This is a memory race. gcount is a thread local variable, so each thread will, at different times, overwrite any given out1[gcount] with its own value. What you must have, for this code to work correctly as written, is to have gcount as a global memory variable and use atomic memory updates to ensure that each thread uses a unique value of gcount each time it outputs a value. But be warned that atomic memory access is very expensive if it is used often (this is why I asked about how many output points there are per kernel launch in a comment).
The resulting kernel might look something like this:
__device__ int gcount; // must be set to zero before the kernel launch
__global__ void test_g(float *in1, float *in2, float *out1, int posdir, int pos)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
__shared__ float t1[32];
float ival = in1[tid];
for(int i = 0; i < posdir*pos; i += 32) {
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i+threadIdx.x];
}
__syncthreads();
for(int j = 0; j < 32; j++)
{
float tval = t1[j] * ival;
if(tval > 0){
int idx = atomicAdd(&gcount, 1);
out1[idx] = tval * ival
}
}
}
}
Disclaimer: written in browser, never been compiled or tested, use at own risk.....
Note that your write to ct was also a memory race, but with gcount now a global value, you can read the value after the kernel without the need for ct.
EDIT: It seems that you are having some problems with zeroing gcount before running the kernel. To do this, you will need to use something like cudaMemcpyToSymbol or perhaps cudaGetSymbolAddress and cudaMemset. It might look something like:
const int zero = 0;
cudaMemcpyToSymbol("gcount", &zero, sizeof(int), 0, cudaMemcpyHostToDevice);
Again, usual disclaimer: written in browser, never been compiled or tested, use at own risk.....
A better way to do what you are doing is to give each thread its own output, and let it increment its own count and enter values - this way, the double-for loop can happen in parallel in any order, which is what the GPU does well. The output is wrong because the threads share the out1 array, so they'll all overwrite on it.
You should also move the code to copy into shared memory into a separate loop, with a __syncthreads() after. With the __syncthreads() out of the loop, you should get better performance - this means that your shared array will have to be the size of in2 - if this is a problem, there's a better way to deal with this at the end of this answer.
You also should move the threadIdx.x < 32 check to the outside. So your code will look something like this:
if (threadIdx.x < 32) {
for(int i = threadIdx.x; i < posdir*pos; i+=32) {
t1[i] = in2[i];
}
}
__syncthreads();
for(int i = threadIdx.x; i < posdir*pos; i += 32) {
for(int j = 0; j < 32; j++)
{
...
}
}
Then put a __syncthreads(), an atomic addition of gcount += count, and a copy from the local output array to a global one - this part is sequential, and will hurt performance. If you can, I would just have a global list of pointers to the arrays for each local one, and put them together on the CPU.
Another change is that you don't need shared memory for t2 - it doesn't help you. And the way you are doing this, it seems like it works only if you are using a single block. To get good performance out of most NVIDIA GPUs, you should partition this into multiple blocks. You can tailor this to your shared memory constraint. Of course, you don't have a __syncthreads() between blocks, so the threads in each block have to go over the whole range for the inner loop, and a partition of the outer loop.

Resources