Multiply large numbers using multithreading - c

I need to multiply large numbers using multithreading. The two numbers to be multiplied can have up to 10000 digits. I have written multiplication code using a single thread. But I am not sure how to multiply when I am assigning multiple threads to different digits.
For example, if the two numbers are: 254678 and 378929 and there are 3 threads, I am assigning each of the two digits to one thread(2,5-Thread 1),(4,6->Thread 2),(7,8-> Thread 3) and each of the digits should multiply digits of 2nd number-> 378929.
When the threads will run in parallel I don't know how to manage the carry variable when multiple threads will update the variable at the same time.
input: array contains both the numbers
index: i1 contains the last digit of 1st number
index: i2 contains the last digit of 2nd number
enter code here
for (int i=i1-1; i>=1; i--){
int carry = 0;
int n1 = input[i];
t2 = 0;
for(int j=i2-1; j>i1-1; j--){
int n2 = input[j];
int sum = n1*n2 + output[t1+t2] + carry;
carry = sum/10;
output[t1+t2] = sum % 10;
t2++;
}
if(carry > 0)
output[t1 + t2] += carry;
t1++;
}
int main() {
pthread_t threads[MAX_THREAD];
for (int i = 0; i < MAX_THREAD; i++)
pthread_create(&threads[i], NULL, &multiply, (void*)NULL);
for (int i = 0; i < MAX_THREAD; i++)
pthread_join(threads[i], NULL);
}

When the threads will run in parallel I don't know how to manage the carry variable when multiple threads will update the variable at the same time.
Don't allow multiple threads to update anything at the same time.
Specifically (assuming each digit is a digit in "base 1<<32"), each thread can be like:
my_accumulator = table_of_per_thread_accumulators[thread_number];
for (int src1_digit_number = startDigit; src1_digit_number >= 0; src1_digit_number -= NUM_THREADS ) {
for (int src2_digit_number = src2_digits-1; src2_digit_number >= 0; src2_digit_number-- ) {
int dest_digit_number = src1_digit_number + src2_digit_number;
uint32_t n1 = input1[src1_digit_number];
uint32_t n2 = input2[src2_digit_number];
uint64_t r = (uint64_t)n1 * n2;
while(r != 0) {
uint64_t temp = my_accumulator[dest_digit_number] + (r & 0xFFFFFFFFUL);
my_accumulator[dest_digit_number++] = temp;
temp >>= 32;
r = (r >> 32) + temp;
}
}
}
Then, when all threads are finished (after some "pthread_join()" calls?) you'd add the numbers from each thread's separate my_accumulator to get the actual result.
Note 1: In theory you can do some "binary merging of accumulators", such that an odd numbered thread waits for the next lower numbered thread to finish and adds that thread's accumulator to its own; then threads that satisfy "my_thread_number % 4 == 3" wait for the the thread numbered "my_thread_number - 2" to finish and adds its accumulator to its own; then... This is likely to be too complicated and messy to bother with.
Note 2: The alternative (if you use a single output[] that's modified by multiple threads) is to have a mutex to ensure only one thread can modify output[] at the same time (or multiple mutexes to ensure only one thread can modify a piece of output[] at a time); and this will destroy performance so badly that it will be faster to use a single thread.
Note 3: The alternative alternative is to use atomics somehow, or use inline assembly (e.g. "lock add [output+rdi],eax;") so that no mutex is needed. This is still very bad because the CPUs will be fighting for exclusive access of the same cache line (and you'll ruin performance by having CPUs spend most of their time trying to get exclusive access to the cache line).

Related

Heap corruption with pthreads in C

I've been writing a C program to simulate the motion of n bodies under the influence of gravity. I have a working version that uses a single thread, and I'm attempting to write a version that uses multi-threading with the POSIX pthreads library. Essentially, the program initializes a specified number n of bodies, and stores their randomly selected initial positions as well as masses and radii in an array 'data', made using the pointer-to-pointer method. Data is a pointer in the global scope, and it is allocated the correct amount of memory in the 'populate()' function in main(). Then, I spawn twelve threads (I am using a 6 core processor, so I thought 12 would be a good starting point), and each thread is assigned a set of objects in the simulation. The thread function below calculates the interaction between all objects in 'data' and the object currently being operated on. See the function below:
void* calculate_step(void*index_val) {
int index = * (int *)index_val;
long double x_dist;
long double y_dist;
long double distance;
long double force;
for (int i = 0; i < (rows/nthreads); ++i) { //iterate over every object assigned to this thread
data[i+index][X_FORCE] = 0; //reset all forces to 0
data[i+index][Y_FORCE] = 0;
data[i+index][X_ACCEL] = 0;
data[i+index][X_ACCEL] = 0;
for (int j = 0; j < rows; ++j) { //iterate over every possible pair with this object i
if (i != j && data[j][DELETED] != 1 && data[i+index][DELETED] != 1) { //continue if not comparing an object with itself and if the other object has not been deleted previously.
x_dist = data[j][X_POS] - data[i+index][X_POS];
y_dist = data[j][Y_POS] - data[i+index][X_POS];
distance = sqrtl(powl(x_dist, 2) + powl(y_dist, 2));
if (distance > data[i+index][RAD] + data[j][RAD]) {
force = G * data[i+index][MASS] * data[j][MASS] /
powl(distance, 2); //calculate accel, vel, pos, data for pair of non-colliding objects
data[i+index][X_FORCE] += force * (x_dist / distance);
data[i+index][Y_FORCE] += force * (y_dist / distance);
data[i+index][X_ACCEL] = data[i+index][X_FORCE]/data[i+index][MASS];
data[i+index][X_VEL] += data[i+index][X_ACCEL]*dt;
data[i+index][X_POS] += data[i+index][X_VEL]*dt;
data[i+index][Y_ACCEL] = data[i+index][Y_FORCE]/data[i+index][MASS];
data[i+index][Y_VEL] += data[i+index][Y_ACCEL]*dt;
data[i+index][Y_POS] += data[i+index][Y_VEL]*dt;
}
else{
if (data[i+index][MASS] < data[j][MASS]) {
int temp;
temp = i;
i = j;
j = temp;
} //conserve momentum
data[i+index][X_VEL] = (data[i+index][X_VEL] * data[i+index][MASS] + data[j][X_VEL] * data[j][MASS])/(data[i+index][MASS] + data[i+index][MASS]);
data[i+index][Y_VEL] = (data[i+index][Y_VEL] * data[i+index][MASS] + data[j][Y_VEL] * data[j][MASS])/(data[i+index][MASS] + data[i+index][MASS]);
//conserve center of mass position
data[i+index][X_POS] = (data[i+index][X_POS] * data[i+index][MASS] + data[j][X_POS] * data[j][MASS])/(data[i+index][MASS] + data[i+index][MASS]);
data[i+index][Y_POS] = (data[i+index][Y_POS] * data[i+index][MASS] + data[j][Y_POS] * data[j][MASS])/(data[i+index][MASS] + data[i+index][MASS]);
//conserve mass
data[i+index][MASS] += data[j][MASS];
//increase radius proportionally to dM
data[i+index][RAD] = powl(powl(data[i+index][RAD], 3) + powl(data[j][RAD], 3), ((long double) 1 / (long double) 3));
data[j][DELETED] = 1;
data[j][MASS] = 0;
data[j][RAD] = 0;
}
}
}
}
return NULL;
}
This calculates values for velocity, acceleration, etc. and writes them to the array. Each thread does this once for each object assigned to it (i.e. 36 objects means each thread calculates values for 3 objects). The thread then returns and the main loop jumps to the next time step (usually increments of 0.01 seconds), and the process repeats again. If two balls collide, their masses, momenta and centers of mass are added, and one of the objects' 'DELETED' index in its row in the array is marked with a row. This object is then ignored in all future iterations. See the main loop below:
int main() {
pthread_t *thread_array; //pointer to future thread array
long *thread_ids;
short num_obj;
short sim_time;
printf("Number of objects to simulate: \n");
scanf("%hd", &num_obj);
num_obj = num_obj - num_obj%12;
printf("Timespan of the simulation: \n");
scanf("%hd", &sim_time);
printf("Length of time steps: \n");
scanf("%f", &dt);
printf("Relative complexity score: %.2f\n", (((float)sim_time/dt)*((float)(num_obj^2)))/1000);
thread_array = malloc(nthreads*sizeof(pthread_t));
thread_ids = malloc(nthreads*sizeof(long));
populate(num_obj);
int index;
for (int i = 0; i < nthreads; ++i) { //initialize all threads
}
time_t start = time(NULL);
print_data();
for (int i = 0; i < (int)((float)sim_time/dt); ++i) { //main loop of simulation
for (int j = 0; j < nthreads; ++j) {
index = j*(rows/nthreads);
thread_ids[j] = j;
pthread_create(&thread_array[j], NULL, calculate_step, &index);
}
for (int j = 0; j < nthreads; ++j) {
pthread_join(thread_array[j], NULL);
//pthread_exit(NULL);
}
}
time_t end = time(NULL) - start;
printf("\n");
print_data();
printf("Took %zu seconds to simulate %d frames with %d objects initially, now %d objects.\n", end, (int)((float)sim_time/dt), num_obj, rows);
}
Every time the program runs, I get the following message:
Number of objects to simulate:
36
Timespan of the simulation:
10
Length of time steps:
0.01
Relative complexity score: 38.00
Process finished with exit code -1073740940 (0xC0000374)
which seams to indicate the heap is getting corrupted. I am guessing this has to do with the data array pointer being a global variable, but that was my workaround for only being allowed to pass one arg to the pthreads function.
I have tried stepping through the program with the debugger, and it seems it works when I run it in debug mode (I am using CLion), but not in regular compile mode. Furthermore, when i debug the program and it outputs the values of the data array for the last simulation 'frame', the first chunk of values which were supposed to be handled by the first thread that spawns are unchanged. When I go through it with the debugger however I can see that thread being created in the thread generation loop. What are some issues with this code structure and what could be causing the heap corruption and the first thread doing nothing?

MPI Program Does Not "Speed Up" After Implementing Parallel Computing Techniques

I am developing an MPI parallel program designed specifically to solve problem 2 on Project Euler. The original problem statement can be found here. My code works without any compilation errors, and the correct answer is retuned consistently (which can be verified on the website).
However, I thought it would be worthwhile to use MPI_Wtime() to gather data on how long it takes to execute the MPI program using 1, 2, 3, and 4 processes. To my surprise, I found that my program takes longer to execute as more processes are included. This is contrary to my expectations, as I thought increasing the number of processes would reduce the computation time (speed up) according to Amdahl’s law. I included my code for anyone who may be interested in testing this for themselves.
#include <mpi.h>
#include <stdio.h>
#include <tgmath.h>
int main(int argc, char* argv[]) {
MPI_Init(&argc, &argv);
int rank, size, start_val, end_val, upperLimit;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
upperLimit = 33;
start_val = rank * (upperLimit / size) + 1;
int num1 = 1; int num2 = 1; int SumNum = num1 + num2; int x = 0;
double start, end;
// begin timing
start = MPI_Wtime();
// arbitrarily inflate the number of computations
// to make the program take longer to compute
// change to t < 1 for only 1 computation
for (int i = 0; i < 10000000; i++) {
// generate an algorithim that defines the range of
// each process to handle for the fibb_sequence problem.
if (rank == (size - 1)) {
end_val = upperLimit;
}
else {
end_val = start_val + (upperLimit / size) - 1;
}
/*
calculations before this code indicate that it will take exactly 32 seperate algorithim computations
to get to the largest number before exceeding 4,000,000 in the fibb sequence. This can be done with a simple
computation, but this calculation will not be listed in this code.
*/
long double fibb_const = (1 + sqrt(5)) / 2; int j = start_val - 1; long double fibb_const1 = (1 - sqrt(5)) / 2;
// calculate fibb sequence positions for the sequence using a formula
double position11 = (pow(fibb_const, j) - pow(fibb_const1, j)) / (sqrt(5));
double position12 = (pow(fibb_const, j + 1) - pow(fibb_const1, (j + 1))) / (sqrt(5));
position11 = floor(position11);
position12 = floor(position12);
// dynamically assign values to each process to generate a solution quickly
if (rank == 0) {
for (int i = start_val; i < end_val; i++) {
SumNum = num1 + num2;
num1 = num2;
num2 = SumNum;
if (SumNum % 2 == 0) {
x = x + SumNum;
//printf("Process 0 reports %d \n \n", SumNum);
//fflush(stdout);
}
}
}
else {
for (int i = start_val; i < end_val; i++) {
SumNum = position12 + position11;
if (SumNum % 2 == 0) {
x = x + SumNum;
//printf("Process %d reports %d \n \n", rank, SumNum);
//fflush(stdout);
}
position11 = position12;
position12 = SumNum;
}
}
int recieve_buf = 0;
MPI_Reduce(&x, &recieve_buf, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
if (rank == 0) {
//printf("This is the final solution: %d \n \n", recieve_buf);
//fflush(stdout);
}
}
// end timing
end = MPI_Wtime();
// timer goes here
double elapsed_time = end - start;
printf("I am rank %d, and I report a walltime of %f seconds.", rank, elapsed_time);
// end the MPI code
MPI_Finalize();
return 0;
}
Note that I utilize 10000000 computations in a for loop to intentionally increase the computation time.
I have attempted to solve this problem by utilizing time.h and chrono in alternate versions of this code to cross-reference my results. Consistently, it seems as if the computation time increases as more processes are included. I saw a similar SO post here, but I could use an additional explination.
How I Run my Code
I use mpiexec -n <process_count> <my_file_name>.exe to run my code on from the VS Studio 2022 command prompt. Additionally, I have tested this code on macOS by running mpicc foo.c followed by mpiexec -n <process_count> ./a.out. All my best efforts seem to produce data contrary to my expectations.
Hopefully this question isn't too vague. I will provide more information if needed.
System Info
I am currently using a x64 based pc, Lenovo, Windows 11. Thanks again
This is a case of the granularity being too fine. Granularity is defined as the amount of work between synchronization points vs the cost of synchronization.
Let's say your MPI_Reduce takes one, or a couple of, microseconds. (A figure that has stayed fairly constant over the past few decades!) That's enough time to do a few thousand operations. So for speedup to occur, you need many thousands of operations between the reductions. You don't have that, so the time of your code is completely dominated by the cost of the MPI calls, and that does not go down with the number of processes.

How does openMP COLLAPSE works internally?

I am trying the openMP parallelism, to multiply 2 matrixes using 2 threads.
I understand how the outer loop parallelism works (i.e without the "collapse(2)" works).
Now, using collapse:
#pragma omp parallel for collapse(2) num_threads(2)
for( i = 0; i < m; i++)
for( j = 0; j < n; j++)
{
s = 0;
for( k = 0; k < p; k++)
s += A[i][k] * B[k][j];
C[i][j] = s;
}
From what I gather, collapse "collapses" the loops into a single big loop, and then uses threads in the big loop. So, for the previous code, i think that it would be equivalent to something like this:
#pragma omp parallel for num_threads(2)
for (ij = 0; ij <n*m; ij++)
{
i= ij/n;
j= mod(ij,n);
s = 0;
for( k = 0; k < p; k++)
s += A[i][k] * B[k][j];
C[i][j] = s;
}
My questions are:
Is that how it works? I have not found any explaination on how it
"collapses" the loops.
If yes, what is the benefit in using that? Doesn't
it divide the jobs between 2 threads EXACTLY like the parallelism without
collapsing?. If not, then does how it work?
PS: Now that i am thinking a little bit more, in case that n is an odd number, say 3, without the collapse one thread would have 2 iterations, and the other just one. That results in uneven jobs for the threads, and a bit less efficient.
If we were to use my collapse equivalent (if that is how collapse indeed works) each thread would have "1.5" iterations. If n would be very large, that would not really matter, would it? Not to mention, doing that i= ij/n; j= mod(ij,n); every time, it decreases performance, doesn't it?
The OpenMP specification says just (page 58 of Version 4.5):
If a collapse clause is specified with a parameter value greater than 1, then the iterations of the associated loops to which the clause applies are collapsed into one larger iteration space that is then divided according to the schedule clause. The sequential execution of the iterations in these associated loops determines the order of the iterations in the collapsed iteration space.
So, basically your logic is correct, except that your code is equivalent to the schedule(static,1) collapse(2) case, i.e. iteration chunk size of 1. In the general case, most OpenMP runtimes have default schedule of schedule(static), which means that the chunk size will be (approximately) equal to the number of iterations divided by the number of threads. The compiler may then use some optimisation to implement it by e.g. running a partial inner loop for a fixed value for the outer loop, then an integer number of outer iterations with complete inner loops, then a partial inner loop again.
For example, the following code:
#pragma omp parallel for collapse(2)
for (int i = 0; i < 100; i++)
for (int j = 0; j < 100; j++)
a[100*i+j] = i+j;
gets transformed by the OpenMP engine of GCC into:
<bb 3>:
i = 0;
j = 0;
D.1626 = __builtin_GOMP_loop_static_start (0, 10000, 1, 0, &.istart0.3, &.iend0.4);
if (D.1626 != 0)
goto <bb 8>;
else
goto <bb 5>;
<bb 8>:
.iter.1 = .istart0.3;
.iend0.5 = .iend0.4;
.tem.6 = .iter.1;
D.1630 = .tem.6 % 100;
j = (int) D.1630;
.tem.6 = .tem.6 / 100;
D.1631 = .tem.6 % 100;
i = (int) D.1631;
<bb 4>:
D.1632 = i * 100;
D.1633 = D.1632 + j;
D.1634 = (long unsigned int) D.1633;
D.1635 = D.1634 * 4;
D.1636 = .omp_data_i->a;
D.1637 = D.1636 + D.1635;
D.1638 = i + j;
*D.1637 = D.1638;
.iter.1 = .iter.1 + 1;
if (.iter.1 < .iend0.5)
goto <bb 10>;
else
goto <bb 9>;
<bb 9>:
D.1639 = __builtin_GOMP_loop_static_next (&.istart0.3, &.iend0.4);
if (D.1639 != 0)
goto <bb 8>;
else
goto <bb 5>;
<bb 10>:
j = j + 1;
if (j <= 99)
goto <bb 4>;
else
goto <bb 11>;
<bb 11>:
j = 0;
i = i + 1;
goto <bb 4>;
<bb 5>:
__builtin_GOMP_loop_end_nowait ();
<bb 6>:
This is a C-like representation of the program's abstract syntax tree, which probably a bit hard to read, but what it does is, it uses modulo arithmetic only once to compute the initial values of i and j based on the start of the iteration block (.istart0.3) determined by the call to GOMP_loop_static_start(). Then it simply increases i and j as one would expect a loop nest to be implemented, i.e. increase j until it hits 100, then reset j to 0 and increase i. At the same time, it also keeps the current iteration number from the collapsed iteration space in .iter.1, basically iterating at the same time both the single collapsed loop and the two nested loops.
As to case when the number of threads does not divide the number of iterations, the OpenMP standard says:
When no chunk_size is specified, the iteration space is divided into chunks that are approximately equal in size, and at most one chunk is distributed to each thread. The size of the chunks is unspecified in this case.
The GCC implementation leaves the threads with highest IDs doing one iteration less. Other possible distribution strategies are outlined in the note on page 61. The list is by no means exhaustive.
The exact behavior is not specified by the standard itself. However, the standard requires that the inner loop has exactly the same iterations for each iteration of the outer loop. This allows the following transformation:
#pragma omp parallel
{
int iter_total = m * n;
int iter_per_thread = 1 + (iter_total - 1) / omp_num_threads(); // ceil
int iter_start = iter_per_thread * omp_get_thread_num();
int iter_end = min(iter_iter_start + iter_per_thread, iter_total);
int ij = iter_start;
for (int i = iter_start / n;; i++) {
for (int j = iter_start % n; j < n; j++) {
// normal loop body
ij++;
if (ij == iter_end) {
goto end;
}
}
}
end:
}
From skimming the disassembly, i believe this is similar to what GCC does. It does avoid the per-iteration division/modulo, but costs one register and addition per inner iterator. Of course it will vary for different scheduling strategies.
Collapsing loops does increase the number of loop iterations that can be assigned to threads, thus helping with load balance or even exposing enough parallel work in the first place.

Dividing processes evenly among threads

I am trying to come up with an algorithm to divide a number of processes as evenly as possible over a number of threads. Each process takes the same amount of time.
The number of processes can vary, from 1 to 1 million. The threadCount is fixed, and can be anywhere from 4 to 48.
The code below does divide all the work evenly, except for the last case, where I throw in what is left over.
Is there a way to fix this so that the work is spread more evenly?
void main(void)
{
int processBegin[100];
int processEnd[100];
int activeProcessCount = 6243;
int threadCount = 24;
int processsInBundle = (int) (activeProcessCount / threadCount);
int processBalance = activeProcessCount - (processsInBundle * threadCount);
for (int i = 0; i < threadCount; ++i)
{
processBegin[ i ] = i * processsInBundle;
processEnd[ i ] = (processBegin[ i ] + processsInBundle) - 1;
}
processEnd[ threadCount - 1 ] += processBalance;
FILE *debug = fopen("s:\\data\\testdump\\debug.csv", WRITE);
for (int i = 0; i < threadCount; ++i)
{
int processsInBucket = (i == threadCount - 1) ? processsInBundle + processBalance : processBegin[i+1] - processBegin[i];
fprintf(debug, "%d,start,%d,stop,%d,processsInBucket,%d\n", activeProcessCount, processBegin[i], processEnd[i], processsInBucket);
}
fclose(debug);
}
Give the first activeProcessCount % threadCount threads processInBundle + 1 processes and give the others processsInBundle ones.
int processInBundle = (int) (activeProcessCount / threadCount);
int processSoFar = 0;
for (int i = 0; i < activeProcessCount % threadCount; i++){
processBegin[i] = processSoFar;
processSoFar += processInBundle + 1;
processEnd[i] = processSoFar - 1;
}
for (int i = activeProcessCount % threadCount; i < threadCount; i++){
processBegin[i] = processSoFar;
processSoFar += processInBundle;
processEnd[i] = processSoFar - 1;
}
That's the same problem as trying to divide 5 pennies onto 3 people. It's just impossible unless you can saw the pennies in half.
Also even if all processes need an equal amount of theoretical runtime it doesn't mean that they will be executed in the same amount of time due to kernel scheduling, cache performance and various other hardware related factors.
To suggest some performance optimisations:
Use dynamic scheduling. i.e. split your work into batches (can be size 1) and have your threads take one batch at a time, run it, then take the next one. This way the threads will always be working until all batches are gone.
More advanced is to start with a big batch size (commonly numwork/numthreads and decrease it each time a thread takes work out of the pool). OpenMP refers to it as guided scheduling.

Using shared memory in CUDA without reducing threads

Looking at Mark Harris's reduction example, I am trying to see if I can have threads store intermediate values without reduction operation:
For example CPU code:
for(int i = 0; i < ntr; i++)
{
for(int j = 0; j < pos* posdir; j++)
{
val = x[i] * arr[j];
if(val > 0.0)
{
out[xcount] = val*x[i];
xcount += 1;
}
}
}
Equivalent GPU code:
const int threads = 64;
num_blocks = ntr/threads;
__global__ void test_g(float *in1, float *in2, float *out1, int *ct, int posdir, int pos)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
__shared__ float t1[threads];
__shared__ float t2[threads];
int gcount = 0;
for(int i = 0; i < posdir*pos; i += 32) {
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i%posdir];
}
__syncthreads();
for(int i = 0; i < 32; i++)
{
t2[i] = t1[i] * in1[tid];
if(t2[i] > 0){
out1[gcount] = t2[i] * in1[tid];
gcount = gcount + 1;
}
}
}
ct[0] = gcount;
}
what I am trying to do here is the following steps:
(1)Store 32 values of in2 in shared memory variable t1,
(2)For each value of i and in1[tid], calculate t2[i],
(3)if t2[i] > 0 for that particular combination of i, write t2[i]*in1[tid] to out1[gcount]
But my output is all wrong. I am not even able to get a count of all the times t2[i] is greater than 0.
Any suggestions on how to save the value of gcount for each i and tid ?? As I debug, I find that for block (0,0,0) and thread(0,0,0) I can sequentially see the values of t2 updated. After the CUDA kernel switches focus to block(0,0,0) and thread(32,0,0), the values of out1[0] are re-written again. How can I get/store the values of out1 for each thread and write it to the output?
I tried two approaches so far: (suggested by #paseolatis on NVIDIA forums)
(1) defined offset=tid*32; and replace out1[gcount] with out1[offset+gcount],
(2) defined
__device__ int totgcount=0; // this line before main()
atomicAdd(&totgcount,1);
out1[totgcount]=t2[i] * in1[tid];
int *h_xc = (int*) malloc(sizeof(int) * 1);
cudaMemcpyFromSymbol(h_xc, totgcount, sizeof(int)*1, cudaMemcpyDeviceToHost);
printf("GPU: xcount = %d\n", h_xc[0]); // Output looks like this: GPU: xcount = 1928669800
Any suggestions? Thanks in advance !
OK let's compare your description of what the code should do with what you have posted (this is sometimes called rubber duck debugging).
Store 32 values of in2 in shared memory variable t1
Your kernel contains this:
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i%posdir];
}
which is effectively loading the same value from in2 into every value of t1. I suspect you want something more like this:
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i+threadIdx.x];
}
For each value of i and in1[tid], calculate t2[i],
This part is OK, but why is t2 needed in shared memory at all? It is only an intermediate result which can be discarded after the inner iteration is completed. You could easily have something like:
float inval = in1[tid];
.......
for(int i = 0; i < 32; i++)
{
float result = t1[i] * inval;
......
if t2[i] > 0 for that particular combination of i, write
t2[i]*in1[tid] to out1[gcount]
This is where the problems really start. Here you do this:
if(t2[i] > 0){
out1[gcount] = t2[i] * in1[tid];
gcount = gcount + 1;
}
This is a memory race. gcount is a thread local variable, so each thread will, at different times, overwrite any given out1[gcount] with its own value. What you must have, for this code to work correctly as written, is to have gcount as a global memory variable and use atomic memory updates to ensure that each thread uses a unique value of gcount each time it outputs a value. But be warned that atomic memory access is very expensive if it is used often (this is why I asked about how many output points there are per kernel launch in a comment).
The resulting kernel might look something like this:
__device__ int gcount; // must be set to zero before the kernel launch
__global__ void test_g(float *in1, float *in2, float *out1, int posdir, int pos)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
__shared__ float t1[32];
float ival = in1[tid];
for(int i = 0; i < posdir*pos; i += 32) {
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i+threadIdx.x];
}
__syncthreads();
for(int j = 0; j < 32; j++)
{
float tval = t1[j] * ival;
if(tval > 0){
int idx = atomicAdd(&gcount, 1);
out1[idx] = tval * ival
}
}
}
}
Disclaimer: written in browser, never been compiled or tested, use at own risk.....
Note that your write to ct was also a memory race, but with gcount now a global value, you can read the value after the kernel without the need for ct.
EDIT: It seems that you are having some problems with zeroing gcount before running the kernel. To do this, you will need to use something like cudaMemcpyToSymbol or perhaps cudaGetSymbolAddress and cudaMemset. It might look something like:
const int zero = 0;
cudaMemcpyToSymbol("gcount", &zero, sizeof(int), 0, cudaMemcpyHostToDevice);
Again, usual disclaimer: written in browser, never been compiled or tested, use at own risk.....
A better way to do what you are doing is to give each thread its own output, and let it increment its own count and enter values - this way, the double-for loop can happen in parallel in any order, which is what the GPU does well. The output is wrong because the threads share the out1 array, so they'll all overwrite on it.
You should also move the code to copy into shared memory into a separate loop, with a __syncthreads() after. With the __syncthreads() out of the loop, you should get better performance - this means that your shared array will have to be the size of in2 - if this is a problem, there's a better way to deal with this at the end of this answer.
You also should move the threadIdx.x < 32 check to the outside. So your code will look something like this:
if (threadIdx.x < 32) {
for(int i = threadIdx.x; i < posdir*pos; i+=32) {
t1[i] = in2[i];
}
}
__syncthreads();
for(int i = threadIdx.x; i < posdir*pos; i += 32) {
for(int j = 0; j < 32; j++)
{
...
}
}
Then put a __syncthreads(), an atomic addition of gcount += count, and a copy from the local output array to a global one - this part is sequential, and will hurt performance. If you can, I would just have a global list of pointers to the arrays for each local one, and put them together on the CPU.
Another change is that you don't need shared memory for t2 - it doesn't help you. And the way you are doing this, it seems like it works only if you are using a single block. To get good performance out of most NVIDIA GPUs, you should partition this into multiple blocks. You can tailor this to your shared memory constraint. Of course, you don't have a __syncthreads() between blocks, so the threads in each block have to go over the whole range for the inner loop, and a partition of the outer loop.

Resources