openCL Kernel segmentation fault in pi calculation - c

Good evening all,
I am trying to design an openCL kernel to calculate pi. This is a school assignment and we were told to use this equation:
Pi/4 = 1 - 1/3 + 1/5 - 1/7 + 1/9 - ...
Here is my kernel design that is currently generating a segfault and I am not sure why:
__kernel void calculatePi(int numIterations, __global float *outputPi, __local float* local_result, int numWorkers)
{
// Get global ID for worker
const uint gid = get_global_id(0);
const uint lid = get_local_id(0);
const uint offset = numIterations*gid*2;
float sum = 0.0f;
for (int i = 0; i < numWorkers; i++)
{
local_result[i] = 0.0f;
}
barrier(CLK_LOCAL_MEM_FENCE);
for (int i=0; i<numIterations; i++)
{
if (i % 2 == 0)
sum += 1 / (1 + 2*i + offset);
else
sum -= 1 / (1 + 2*i + offset);
}
local_result[gid] = sum;
barrier(CLK_GLOBAL_MEM_FENCE);
if (lid == 0)
{
outputPi[0] = 0;
for (int i = 0; i < numWorkers; i++)
{
outputPi[0] += local_result[i];
}
outputPi[0] *= 4;
}
}
Basically, my thought process was to have 16 workers in parallel. each worker will take numIterations terms and determine a partial calculation of pi. In this case, I'm also using 16 for numIterations. The terms alternate, so for every odd term I subtract and every even I add. The first worker is responsible for calculating the first 16 terms, the next worker the next 16 terms, and so on to create 16 partial sums of 16 digits each. Once each worker has calculated their partial sum, I have the first worker take all of the partial sums and add it up to send out. I also multiply by 4 to complete the equation.
My issue is that I keep getting a segmentation fault within my main program at the following line:
ret = clEnqueueReadBuffer(command_queue, result_buffer, CL_TRUE, 0, sizeof(result), &result, 0, NULL, NULL);
Here are the other uses of "result" that could be causing this issue:
float result[1] = {0}; // Initialized at top of main
/* Create buffers to hold the text characters and count */
cl_mem result_buffer = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof(result), result, NULL);
printf("Final calculated value: %f \n", result[0]);
Can anyone give me insight as to why I'm getting the segfault when trying to read the result buffer back into result?
The full code can be seen within my github: https://github.com/TreverWagenhals/TreverWagenhals/tree/master/School/Heterogeneous%20Computing/Lab2
Thanks.
EDIT: I've found the issue that was in my code. I was creating a variable called numWorkers and passing that into one of the kernel arguments, which apparently wasn't correct. In the process of simplifying my code I was able to remove it and use the global_size variable directly, which now resolves the seg fault issue and shows the data each call.
Now I'm having an issue within my Kernel where 4 is being returned instead of the value for pi. I will debug further and create a new question if I can't see the issue. I'll have

Related

Heap corruption with pthreads in C

I've been writing a C program to simulate the motion of n bodies under the influence of gravity. I have a working version that uses a single thread, and I'm attempting to write a version that uses multi-threading with the POSIX pthreads library. Essentially, the program initializes a specified number n of bodies, and stores their randomly selected initial positions as well as masses and radii in an array 'data', made using the pointer-to-pointer method. Data is a pointer in the global scope, and it is allocated the correct amount of memory in the 'populate()' function in main(). Then, I spawn twelve threads (I am using a 6 core processor, so I thought 12 would be a good starting point), and each thread is assigned a set of objects in the simulation. The thread function below calculates the interaction between all objects in 'data' and the object currently being operated on. See the function below:
void* calculate_step(void*index_val) {
int index = * (int *)index_val;
long double x_dist;
long double y_dist;
long double distance;
long double force;
for (int i = 0; i < (rows/nthreads); ++i) { //iterate over every object assigned to this thread
data[i+index][X_FORCE] = 0; //reset all forces to 0
data[i+index][Y_FORCE] = 0;
data[i+index][X_ACCEL] = 0;
data[i+index][X_ACCEL] = 0;
for (int j = 0; j < rows; ++j) { //iterate over every possible pair with this object i
if (i != j && data[j][DELETED] != 1 && data[i+index][DELETED] != 1) { //continue if not comparing an object with itself and if the other object has not been deleted previously.
x_dist = data[j][X_POS] - data[i+index][X_POS];
y_dist = data[j][Y_POS] - data[i+index][X_POS];
distance = sqrtl(powl(x_dist, 2) + powl(y_dist, 2));
if (distance > data[i+index][RAD] + data[j][RAD]) {
force = G * data[i+index][MASS] * data[j][MASS] /
powl(distance, 2); //calculate accel, vel, pos, data for pair of non-colliding objects
data[i+index][X_FORCE] += force * (x_dist / distance);
data[i+index][Y_FORCE] += force * (y_dist / distance);
data[i+index][X_ACCEL] = data[i+index][X_FORCE]/data[i+index][MASS];
data[i+index][X_VEL] += data[i+index][X_ACCEL]*dt;
data[i+index][X_POS] += data[i+index][X_VEL]*dt;
data[i+index][Y_ACCEL] = data[i+index][Y_FORCE]/data[i+index][MASS];
data[i+index][Y_VEL] += data[i+index][Y_ACCEL]*dt;
data[i+index][Y_POS] += data[i+index][Y_VEL]*dt;
}
else{
if (data[i+index][MASS] < data[j][MASS]) {
int temp;
temp = i;
i = j;
j = temp;
} //conserve momentum
data[i+index][X_VEL] = (data[i+index][X_VEL] * data[i+index][MASS] + data[j][X_VEL] * data[j][MASS])/(data[i+index][MASS] + data[i+index][MASS]);
data[i+index][Y_VEL] = (data[i+index][Y_VEL] * data[i+index][MASS] + data[j][Y_VEL] * data[j][MASS])/(data[i+index][MASS] + data[i+index][MASS]);
//conserve center of mass position
data[i+index][X_POS] = (data[i+index][X_POS] * data[i+index][MASS] + data[j][X_POS] * data[j][MASS])/(data[i+index][MASS] + data[i+index][MASS]);
data[i+index][Y_POS] = (data[i+index][Y_POS] * data[i+index][MASS] + data[j][Y_POS] * data[j][MASS])/(data[i+index][MASS] + data[i+index][MASS]);
//conserve mass
data[i+index][MASS] += data[j][MASS];
//increase radius proportionally to dM
data[i+index][RAD] = powl(powl(data[i+index][RAD], 3) + powl(data[j][RAD], 3), ((long double) 1 / (long double) 3));
data[j][DELETED] = 1;
data[j][MASS] = 0;
data[j][RAD] = 0;
}
}
}
}
return NULL;
}
This calculates values for velocity, acceleration, etc. and writes them to the array. Each thread does this once for each object assigned to it (i.e. 36 objects means each thread calculates values for 3 objects). The thread then returns and the main loop jumps to the next time step (usually increments of 0.01 seconds), and the process repeats again. If two balls collide, their masses, momenta and centers of mass are added, and one of the objects' 'DELETED' index in its row in the array is marked with a row. This object is then ignored in all future iterations. See the main loop below:
int main() {
pthread_t *thread_array; //pointer to future thread array
long *thread_ids;
short num_obj;
short sim_time;
printf("Number of objects to simulate: \n");
scanf("%hd", &num_obj);
num_obj = num_obj - num_obj%12;
printf("Timespan of the simulation: \n");
scanf("%hd", &sim_time);
printf("Length of time steps: \n");
scanf("%f", &dt);
printf("Relative complexity score: %.2f\n", (((float)sim_time/dt)*((float)(num_obj^2)))/1000);
thread_array = malloc(nthreads*sizeof(pthread_t));
thread_ids = malloc(nthreads*sizeof(long));
populate(num_obj);
int index;
for (int i = 0; i < nthreads; ++i) { //initialize all threads
}
time_t start = time(NULL);
print_data();
for (int i = 0; i < (int)((float)sim_time/dt); ++i) { //main loop of simulation
for (int j = 0; j < nthreads; ++j) {
index = j*(rows/nthreads);
thread_ids[j] = j;
pthread_create(&thread_array[j], NULL, calculate_step, &index);
}
for (int j = 0; j < nthreads; ++j) {
pthread_join(thread_array[j], NULL);
//pthread_exit(NULL);
}
}
time_t end = time(NULL) - start;
printf("\n");
print_data();
printf("Took %zu seconds to simulate %d frames with %d objects initially, now %d objects.\n", end, (int)((float)sim_time/dt), num_obj, rows);
}
Every time the program runs, I get the following message:
Number of objects to simulate:
36
Timespan of the simulation:
10
Length of time steps:
0.01
Relative complexity score: 38.00
Process finished with exit code -1073740940 (0xC0000374)
which seams to indicate the heap is getting corrupted. I am guessing this has to do with the data array pointer being a global variable, but that was my workaround for only being allowed to pass one arg to the pthreads function.
I have tried stepping through the program with the debugger, and it seems it works when I run it in debug mode (I am using CLion), but not in regular compile mode. Furthermore, when i debug the program and it outputs the values of the data array for the last simulation 'frame', the first chunk of values which were supposed to be handled by the first thread that spawns are unchanged. When I go through it with the debugger however I can see that thread being created in the thread generation loop. What are some issues with this code structure and what could be causing the heap corruption and the first thread doing nothing?

MPI Program Does Not "Speed Up" After Implementing Parallel Computing Techniques

I am developing an MPI parallel program designed specifically to solve problem 2 on Project Euler. The original problem statement can be found here. My code works without any compilation errors, and the correct answer is retuned consistently (which can be verified on the website).
However, I thought it would be worthwhile to use MPI_Wtime() to gather data on how long it takes to execute the MPI program using 1, 2, 3, and 4 processes. To my surprise, I found that my program takes longer to execute as more processes are included. This is contrary to my expectations, as I thought increasing the number of processes would reduce the computation time (speed up) according to Amdahl’s law. I included my code for anyone who may be interested in testing this for themselves.
#include <mpi.h>
#include <stdio.h>
#include <tgmath.h>
int main(int argc, char* argv[]) {
MPI_Init(&argc, &argv);
int rank, size, start_val, end_val, upperLimit;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
upperLimit = 33;
start_val = rank * (upperLimit / size) + 1;
int num1 = 1; int num2 = 1; int SumNum = num1 + num2; int x = 0;
double start, end;
// begin timing
start = MPI_Wtime();
// arbitrarily inflate the number of computations
// to make the program take longer to compute
// change to t < 1 for only 1 computation
for (int i = 0; i < 10000000; i++) {
// generate an algorithim that defines the range of
// each process to handle for the fibb_sequence problem.
if (rank == (size - 1)) {
end_val = upperLimit;
}
else {
end_val = start_val + (upperLimit / size) - 1;
}
/*
calculations before this code indicate that it will take exactly 32 seperate algorithim computations
to get to the largest number before exceeding 4,000,000 in the fibb sequence. This can be done with a simple
computation, but this calculation will not be listed in this code.
*/
long double fibb_const = (1 + sqrt(5)) / 2; int j = start_val - 1; long double fibb_const1 = (1 - sqrt(5)) / 2;
// calculate fibb sequence positions for the sequence using a formula
double position11 = (pow(fibb_const, j) - pow(fibb_const1, j)) / (sqrt(5));
double position12 = (pow(fibb_const, j + 1) - pow(fibb_const1, (j + 1))) / (sqrt(5));
position11 = floor(position11);
position12 = floor(position12);
// dynamically assign values to each process to generate a solution quickly
if (rank == 0) {
for (int i = start_val; i < end_val; i++) {
SumNum = num1 + num2;
num1 = num2;
num2 = SumNum;
if (SumNum % 2 == 0) {
x = x + SumNum;
//printf("Process 0 reports %d \n \n", SumNum);
//fflush(stdout);
}
}
}
else {
for (int i = start_val; i < end_val; i++) {
SumNum = position12 + position11;
if (SumNum % 2 == 0) {
x = x + SumNum;
//printf("Process %d reports %d \n \n", rank, SumNum);
//fflush(stdout);
}
position11 = position12;
position12 = SumNum;
}
}
int recieve_buf = 0;
MPI_Reduce(&x, &recieve_buf, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
if (rank == 0) {
//printf("This is the final solution: %d \n \n", recieve_buf);
//fflush(stdout);
}
}
// end timing
end = MPI_Wtime();
// timer goes here
double elapsed_time = end - start;
printf("I am rank %d, and I report a walltime of %f seconds.", rank, elapsed_time);
// end the MPI code
MPI_Finalize();
return 0;
}
Note that I utilize 10000000 computations in a for loop to intentionally increase the computation time.
I have attempted to solve this problem by utilizing time.h and chrono in alternate versions of this code to cross-reference my results. Consistently, it seems as if the computation time increases as more processes are included. I saw a similar SO post here, but I could use an additional explination.
How I Run my Code
I use mpiexec -n <process_count> <my_file_name>.exe to run my code on from the VS Studio 2022 command prompt. Additionally, I have tested this code on macOS by running mpicc foo.c followed by mpiexec -n <process_count> ./a.out. All my best efforts seem to produce data contrary to my expectations.
Hopefully this question isn't too vague. I will provide more information if needed.
System Info
I am currently using a x64 based pc, Lenovo, Windows 11. Thanks again
This is a case of the granularity being too fine. Granularity is defined as the amount of work between synchronization points vs the cost of synchronization.
Let's say your MPI_Reduce takes one, or a couple of, microseconds. (A figure that has stayed fairly constant over the past few decades!) That's enough time to do a few thousand operations. So for speedup to occur, you need many thousands of operations between the reductions. You don't have that, so the time of your code is completely dominated by the cost of the MPI calls, and that does not go down with the number of processes.

Multiply large numbers using multithreading

I need to multiply large numbers using multithreading. The two numbers to be multiplied can have up to 10000 digits. I have written multiplication code using a single thread. But I am not sure how to multiply when I am assigning multiple threads to different digits.
For example, if the two numbers are: 254678 and 378929 and there are 3 threads, I am assigning each of the two digits to one thread(2,5-Thread 1),(4,6->Thread 2),(7,8-> Thread 3) and each of the digits should multiply digits of 2nd number-> 378929.
When the threads will run in parallel I don't know how to manage the carry variable when multiple threads will update the variable at the same time.
input: array contains both the numbers
index: i1 contains the last digit of 1st number
index: i2 contains the last digit of 2nd number
enter code here
for (int i=i1-1; i>=1; i--){
int carry = 0;
int n1 = input[i];
t2 = 0;
for(int j=i2-1; j>i1-1; j--){
int n2 = input[j];
int sum = n1*n2 + output[t1+t2] + carry;
carry = sum/10;
output[t1+t2] = sum % 10;
t2++;
}
if(carry > 0)
output[t1 + t2] += carry;
t1++;
}
int main() {
pthread_t threads[MAX_THREAD];
for (int i = 0; i < MAX_THREAD; i++)
pthread_create(&threads[i], NULL, &multiply, (void*)NULL);
for (int i = 0; i < MAX_THREAD; i++)
pthread_join(threads[i], NULL);
}
When the threads will run in parallel I don't know how to manage the carry variable when multiple threads will update the variable at the same time.
Don't allow multiple threads to update anything at the same time.
Specifically (assuming each digit is a digit in "base 1<<32"), each thread can be like:
my_accumulator = table_of_per_thread_accumulators[thread_number];
for (int src1_digit_number = startDigit; src1_digit_number >= 0; src1_digit_number -= NUM_THREADS ) {
for (int src2_digit_number = src2_digits-1; src2_digit_number >= 0; src2_digit_number-- ) {
int dest_digit_number = src1_digit_number + src2_digit_number;
uint32_t n1 = input1[src1_digit_number];
uint32_t n2 = input2[src2_digit_number];
uint64_t r = (uint64_t)n1 * n2;
while(r != 0) {
uint64_t temp = my_accumulator[dest_digit_number] + (r & 0xFFFFFFFFUL);
my_accumulator[dest_digit_number++] = temp;
temp >>= 32;
r = (r >> 32) + temp;
}
}
}
Then, when all threads are finished (after some "pthread_join()" calls?) you'd add the numbers from each thread's separate my_accumulator to get the actual result.
Note 1: In theory you can do some "binary merging of accumulators", such that an odd numbered thread waits for the next lower numbered thread to finish and adds that thread's accumulator to its own; then threads that satisfy "my_thread_number % 4 == 3" wait for the the thread numbered "my_thread_number - 2" to finish and adds its accumulator to its own; then... This is likely to be too complicated and messy to bother with.
Note 2: The alternative (if you use a single output[] that's modified by multiple threads) is to have a mutex to ensure only one thread can modify output[] at the same time (or multiple mutexes to ensure only one thread can modify a piece of output[] at a time); and this will destroy performance so badly that it will be faster to use a single thread.
Note 3: The alternative alternative is to use atomics somehow, or use inline assembly (e.g. "lock add [output+rdi],eax;") so that no mutex is needed. This is still very bad because the CPUs will be fighting for exclusive access of the same cache line (and you'll ruin performance by having CPUs spend most of their time trying to get exclusive access to the cache line).

How to implement summation using parallel reduction in OpenCL?

I'm trying to implement a kernel which does parallel reduction. The code below works on occasion, I have not been able to pin down why it goes wrong on the occasions it does.
__kernel void summation(__global float* input, __global float* partialSum, __local float *localSum){
int local_id = get_local_id(0);
int workgroup_size = get_local_size(0);
localSum[local_id] = input[get_global_id(0)];
for(int step = workgroup_size/2; step>0; step/=2){
barrier(CLK_LOCAL_MEM_FENCE);
if(local_id < step){
localSum[local_id] += localSum[local_id + step];
}
}
if(local_id == 0){
partialSum[get_group_id(0)] = localSum[0];
}}
Essentially I'm summing the values per work group and storing each work group's total into partialSum, the final summation is done on the host. Below is the code which sets up the values for the summation.
size_t global[1];
size_t local[1];
const int DATA_SIZE = 15000;
float *input = NULL;
float *partialSum = NULL;
int count = DATA_SIZE;
local[0] = 2;
global[0] = count;
input = (float *)malloc(count * sizeof(float));
partialSum = (float *)malloc(global[0]/local[0] * sizeof(float));
int i;
for (i = 0; i < count; i++){
input[i] = (float)i+1;
}
I'm thinking it has something to do when the size of the input is not a power of two? I noticed it begins to go off for numbers around 8000 and beyond. Any assistance is welcome. Thanks.
I'm thinking it has something to do when the size of the input is not a power of two?
Yes. Consider what happens when you try to reduce, say, 9 elements. Suppose you launch 1 work-group of 9 work-items:
for (int step = workgroup_size / 2; step > 0; step /= 2){
// At iteration 0: step = 9 / 2 = 4
barrier(CLK_LOCAL_MEM_FENCE);
if (local_id < step) {
// Branch taken by threads 0 to 3
// Only 8 numbers added up together!
localSum[local_id] += localSum[local_id + step];
}
}
You're never summing the 9th element, hence the reduction is incorrect. An easy solution is to pad the input data with enough zeroes to make the work-group size the immediate next power-of-two.

Dividing processes evenly among threads

I am trying to come up with an algorithm to divide a number of processes as evenly as possible over a number of threads. Each process takes the same amount of time.
The number of processes can vary, from 1 to 1 million. The threadCount is fixed, and can be anywhere from 4 to 48.
The code below does divide all the work evenly, except for the last case, where I throw in what is left over.
Is there a way to fix this so that the work is spread more evenly?
void main(void)
{
int processBegin[100];
int processEnd[100];
int activeProcessCount = 6243;
int threadCount = 24;
int processsInBundle = (int) (activeProcessCount / threadCount);
int processBalance = activeProcessCount - (processsInBundle * threadCount);
for (int i = 0; i < threadCount; ++i)
{
processBegin[ i ] = i * processsInBundle;
processEnd[ i ] = (processBegin[ i ] + processsInBundle) - 1;
}
processEnd[ threadCount - 1 ] += processBalance;
FILE *debug = fopen("s:\\data\\testdump\\debug.csv", WRITE);
for (int i = 0; i < threadCount; ++i)
{
int processsInBucket = (i == threadCount - 1) ? processsInBundle + processBalance : processBegin[i+1] - processBegin[i];
fprintf(debug, "%d,start,%d,stop,%d,processsInBucket,%d\n", activeProcessCount, processBegin[i], processEnd[i], processsInBucket);
}
fclose(debug);
}
Give the first activeProcessCount % threadCount threads processInBundle + 1 processes and give the others processsInBundle ones.
int processInBundle = (int) (activeProcessCount / threadCount);
int processSoFar = 0;
for (int i = 0; i < activeProcessCount % threadCount; i++){
processBegin[i] = processSoFar;
processSoFar += processInBundle + 1;
processEnd[i] = processSoFar - 1;
}
for (int i = activeProcessCount % threadCount; i < threadCount; i++){
processBegin[i] = processSoFar;
processSoFar += processInBundle;
processEnd[i] = processSoFar - 1;
}
That's the same problem as trying to divide 5 pennies onto 3 people. It's just impossible unless you can saw the pennies in half.
Also even if all processes need an equal amount of theoretical runtime it doesn't mean that they will be executed in the same amount of time due to kernel scheduling, cache performance and various other hardware related factors.
To suggest some performance optimisations:
Use dynamic scheduling. i.e. split your work into batches (can be size 1) and have your threads take one batch at a time, run it, then take the next one. This way the threads will always be working until all batches are gone.
More advanced is to start with a big batch size (commonly numwork/numthreads and decrease it each time a thread takes work out of the pool). OpenMP refers to it as guided scheduling.

Resources