Trying out with some code to make use of multithreading in C

Trying out with some code to make use of multithreading in C - c

I have an app that can spawn any given number of threads. So I'd like this code to become multithreaded
void *some_thread_fuction(void *param)
{
struct some_struct *obj=(struct some_struct *)param;
int m=obj->m;
int n=...
double t[m+2][n+2]={0};
for (i=0; i <= m+1; i++) {
for (j=0; j <= n+1; j++) {
t[i][j] = 30.0;
}
}
for (i=1; i <= m; i++) {
t[i][0] = 40.0;
t[i][n+1] = 90.0;
}
for (j=1; j <= n; j++) {
t[0][j] = 30.0;
t[m+1][j] = 50.0;
}
memcpy(global_t,t,...);
}
I am having simple reasoning issue as to why I like to make this a multi-threaded program. but it make sense because if I have 5 threads (assumed I am taking how many number of threads to spawn at program start in program parameter) and n=20 m=20 which is also fed at program start as parameters then I can try working on 0-4 in one thread, 5-8 in second thread and so on until 16-20 in last iteration of first loop(just an example, because m=etc n=etc and number of threads can be anything values fed by user).
But more importantly I am having tough time as to how to even dissect the three for loops to distribute the processing over amount of work to multiple threads to completion of all the loops in this code. This is simple code, so it's just a real world example that I am having difficulty understanding how to do it in code for a threaded program for this scenario.

Move this piece of code into a function:
for (i=0; i <= m+1; i++) {
for (j=0; j <= n+1; j++) {
t[i][j] = 30.0;
}
}
as follows:
void initialize(double t[], int start_i, int end_i, int n) {
for (i=start_i; i <= end_i; i++) {
for (j=0; j <= n+1; j++) {
t[i][j] = 30.0;
}
}
}
You can then split the interval [0 m+1] into as 5 intervals and call the initialize function from each thread for each interval.
That said, there must be more efficient ways to achieve the same thing using some copy instructions.

Related

Parallel for is less efficient than serial?

I´ve been working on a small project for my college, with C and Openmp. To make it short, when trying to parallelize a for loop using the #pragma omp parallel for constructor, it ends up being way slower than the serial version, just by adding that, is a parallel version of odd-even sort that works with an array of integers
I found it has something to do with the threads accessing the memory location of the whole array each time they compare numbers and updating its own copy of it on the cache memory. But I don´t know how to fix it, so rather than updating the whole array, they just check the exact location of the integers they are comparing, I´m kinda new using Openmp so idk if there´s a clause of constructor for this kind of situation.
//version without parallel for
void bubbleSortParalelo(int array[], int size) {
int i,j,first;
for (i = 0; i < size; i++){
first = i % 2;
for (j = first; j < size-1 ; j+= 2){
if (array[j] > array[j+1]){
int temp = array[j+1];
array[j+1]=array[j];
array[j]= temp;
}
}
}
}
//Version with parallel for, takes longer somehow
void bubbleSortParalelo2(int array[], int size) {
int i,j,first;
for (i = 0; i < size; i++){
first = i % 2;
#pragma omp parallel for
for (j = first; j < size-1 ; j+= 2){
if (array[j] > array[j+1]){
int temp = array[j+1];
array[j+1]=array[j];
array[j]= temp;
}
}
}
I want to make the parallel version at least as efficient as the serial one, because right now it takes like 10 times more, becoming worse with the more threads I use.

managing memory in C, 2D float array

I'm very inexperienced with C and have to use it for a piece of coursework exploring heat transfer.
I'm getting all sorts of errors with the (0xC0000005) return code. I'm aware that this is an attempt to access memory out of bounds and that I'm probably failing to allocate memory properly somewhere but I haven't been able to figure it out.
Any idea what needs changing?
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main(){
/* constant alpha squared, the area of the domain */
const float alphSqrd = 0.01;
const float deltX, deltY = 0.0002;
const float timeStep = 0.0000009;
const int maxTimeSteps = 1000;
int h, i, j;
/* 2D Array of Temperature Values */
float T [500][500];
float dTdt;
/* initialise temperature values */
printf("Initialising 2D array...\n");
for (i=0; i<500; i++){
for (j=0; j<500; j++){
if (150<=i && i<350 && 150<=j && j<350){
T[i][j] = 50;
} else {
T[i][j] = 20;
}
}
}
printf("Updating values...\n");
for (h=0; h<maxTimeSteps; h++){
for (i=0; i<500; i++){
for (j=0; j<500; j++){
dTdt = alphSqrd*(((T[i+1][j]-2*T[i][j]+T[i-1][j])/(deltX*deltX))+((T[i][j+1]-2*T[i][j]+T[i][j-1])/(deltY*deltY)));
T[i][j] = T[i][j] + dTdt * timeStep;
printf("%f ",T[i][j]);
}
printf("\n");
}
}
return 0;
}

Although that's not the problem you're describing, one issue is that you're not initializing deltX. Instead of this
const float deltX, deltY = 0.0002;
you want
const float deltX = 0.0002 , deltY = 0.0002;
Aside from that, you have an out of range issue. If you're going to access index i - 1 and i + 1, you can't loop from 0 to 499 on an array of 500 elements.
It works for me if I adjust the loops like this:
for (i = 1; i < 499; i++) {
for (j = 1; j < 499; j++) {

During your dTdt calculation you are using T[i-1][j] and T[i-1][j]. If i is maximal (0 or 500) this exceeds the array limits. The same holds for j.
Thus, you use uninitialised memory. You need to loop form 1 to 499 and treat the boundary differently depending on the problem you are trying to solve.

Firstly adjust the loop min and max values;
for (i=1; i<499; i++){
for (j=1; j<499; j++){
And you should make comment line to printf(). stdout takes much time in the loop.
for (h=0; h<maxTimeSteps; h++){
for (i=1; i<499; i++){
for (j=1; j<499; j++){
dTdt = alphSqrd*(((T[i+1][j]-2*T[i][j]+T[i-1][j])/(deltX*deltX))+((T[i][j+1]-2*T[i][j]+T[i][j-1])/(deltY*deltY)));
T[i][j] = T[i][j] + dTdt * timeStep;
//printf("%f ",T[i][j]);
}
//printf("\n");
}
}
I compiled this code on my virtual Linux and it took nearly 3 seconds.
enter image description here

Multithreading for backpropagation algorithm

To speed up some neural network learning, I tried to do some multi-threading, since for a particular layer, the calculations for each neuron are independent from one another.
The original function I used is some basic backpropagation algorithm, for the evaluation of the deltas in the net :
δ = derivative * Σ (weight * previous δ)
void backpropagation (Autoencoder* AE)
{
int i, j, k;
for(i = AE->numLayer-2; i >= 0; i--)
{
for(j = 0; j < AE->layer[i].size; j++)
{
register double sum = 0.0;
for(k = 0; k < AE->layer[i+1].size; k++)
{
sum += AE->layer[i+1].neuron[k].weight[j] * AE->layer[i+1].neuron[k].delta;
}
AE->layer[i].neuron[j].delta = AE->layer[i].neuron[j].derivative * sum;
}
}
}
Autoencoder being the structure containing the neural net. It worked fine enough, if a bit slow, and it seems like a good idea to try this function first.
The modified functions are the following :
void backpropagationmultithread (Autoencoder* AE, unsigned int ncore, pthread_t* pth)
{
int i, j;
unsigned int neuronpercore, extra;
sem_t semaphore;
argThread* args[ncore];
for(i = AE->numLayer-2; i >= 0; i--)
{
neuronpercore = AE->layer[i].size / ncore;
extra = neuronpercore + (AE->layer[i].size % ncore);
sem_init(&semaphore, 0, -ncore);
for(j = 0; j < ncore; j++)
{
args[j] = malloc(sizeof(argThread));
args[j]->layer = i;
args[j]->AE = AE;
args[j]->sem = &semaphore;
args[j]->startat = neuronpercore * j;
args[j]->nneurons = (j!=ncore-1)?neuronpercore:extra;
pthread_create(&pth[j], NULL, backpropagationthread, (void*)args[j]);
}
sem_wait(&semaphore);
for(j = 0; j < ncore; j++)
{
pthread_cancel(pth[j]);
}
}
}
And the function for the new threads :
void* backpropagationthread (void* arg)
{
argThread* args = (argThread*) arg;
unsigned int j,k,layer = args->layer, start = args->startat, end = args->startat + args->nneurons;
Autoencoder* AE = args->AE;
for(j = start; j < end; j++)
{
register double sum = 0.0;
for(k = 0; k < AE->layer[layer+1].size; k++)
{
sum += AE->layer[layer+1].neuron[k].weight[j] * AE->layer[layer+1].neuron[k].delta;
}
AE->layer[layer].neuron[j].delta = AE->layer[layer].neuron[j].derivative * sum;
}
sem_post(args->sem);
free(arg);
return NULL;
}
argThread is just a little structure that contains all the arguments to be passed to the thread, ncore the number of CPU cores. The idea was to split up each layer into a roughly equal number of neurons to be treated individually by each thread (the last one with all the extra neurons if they are not multiples).
The new function does work to some degree, and much faster, but after a certain threshold does not converge anymore, where the old function did, and I cannot find why its behaviour would change. Am I missing some neurons or weights?

I implemented a multi-threaded backpropagation algorithm for Encog some time ago. I wrote this in Java, but when I implemented it in C, I made use of OpenMP, rather than pthreads. You can see my C implementation here.
https://github.com/encog/encog-c
I also wrote an article about my approach for performing backpropagation in multi-threaded.
You can see my article here.
http://www.heatonresearch.com/encog/mprop/compare.html
There are a few other questions about this on Stack Overflow, as well. Most seem to reference my algorithm.
Multithreaded backpropagation
How can I apply multithreading to the backpropagation neural network training?

Why is my program not sorting the struct?

I am trying to create this program that takes an int number from the user then randomizes a pair of numbers, based on the input, inside the an array of struct. Then it sorts this array based on the sum of the number pair the program randomized.
However my program won´t sort the array of struct. It doesn´t do the sorting properly and Im not sure why. Here is the code.
#define MAX 10
struct NumPair{
int n,m;
};
int main()
{
int i, j, amount=0;
NumPair NumPair[MAX];
srand(time(NULL));
printf("How many pair of numbers? (max 10): ");
scanf("%d", &amount);
for (i=0; i<amount; i++)
{
NumPair[i].n = rand() % 11;
NumPair[i].m = rand() % 11;
}
for (i=0; i<amount; i++)
{
for(j=1; j<amount; j++)
{
if( (NumPair[i].n+NumPair[i].m) > (NumPair[j].n+NumPair[j].m) )
{
int tmp;
tmp = NumPair[i].n;
NumPair[i].n = NumPair[j].n;
NumPair[j].n = tmp;
tmp = NumPair[i].m;
NumPair[i].m = NumPair[j].m;
NumPair[j].m = tmp;
}
}
}
for (i=0; i<amount; i++)
{
printf(" NumPair %d: (%d,%d)\n", i+1, NumPair[i].n, NumPair[i].m);
}
return 0;
}
What am I missing? It's probably something very silly.
Thanks in advance.

Your algorithm is incorrect. This little snippet:
for (i=0; i<amount; i++) {
for(j=1; j<amount; j++) {
will result in situations where i is greater than j and then you comparison/swap operation is faulty (it swaps if the i element is greater than the j one which, if i > j, is the wrong comparison).
I should mention that (unless this is homework or some other education) C has a perfectly adequate qsort() function that will do the heavy lifting for you. You'd be well advised to learn that.
If it is homework/education, I think I've given you enough to nut it out. You should find the particular algorithm you're trying to implement and revisit your code for it.

You are comparing iterators i with j. Bubble sort should compare jth iterator with the next one
for (i=0; i<amount; i++) //pseudo code
{
for(j=0; j<amount-1; j++)
{
if( NumPair[j] > NumPair[j+1] ) //compare your elements
{
//swap
}
}
}
Note that the second loop will go only until amount-1 since you don't want to step out of bounds of the array.

change to
for (i=0; i<amount-1; i++){
for(j=i+1; j<amount; j++){

Parallelizing giving wrong output

I got some problems trying to parallelize an algorithm. The intention is to do some modifications to a 100x100 matrix. When I run the algorithm without openMP everything runs smoothly in about 34-35 seconds, when I parallelize on 2 threads (I need it to be with 2 threads only) it gets down to like 22 seconds but the output is wrong and I think it's a synchronization problem that I cannot fix.
Here's the code :
for (p = 0; p < sapt; p++){
memset(count,0,Nc*sizeof(int));
for (i = 0; i < N; i ++){
for (j = 0; j < N; j++){
for( m = 0; m < Nc; m++)
dist[m] = N+1;
omp_set_num_threads(2);
#pragma omp parallel for shared(configurationMatrix, dist) private(k,m) schedule(static,chunk)
for (k = 0; k < N; k++){
for (m = 0; m < N; m++){
if (i == k && j == m)
continue;
if (MAX(abs(i-k),abs(j-m)) < dist[configurationMatrix[k][m]])
dist[configurationMatrix[k][m]] = MAX(abs(i-k),abs(j-m));
}
}
int max = -1;
for(m = 0; m < Nc; m++){
if (dist[m] == N+1)
continue;
if (dist[m] > max){
max = dist[m];
configurationMatrix2[i][j] = m;
}
}
}
}
memcpy(configurationMatrix, configurationMatrix2, N*N*sizeof(int));
#pragma omp parallel for shared(count, configurationMatrix) private(i,j)
for (i = 0; i < N; i ++)
for (j = 0; j < N; j++)
count[configurationMatrix[i][j]] ++;
for (i = 0; i < Nc; i ++)
fprintf(out,"%i ", count[i]);
fprintf(out, "\n");
}
In which : sapt = 100;
count -> it's a vector that holds me how many of an each element of the matrix I'm having on each step;
(EX: count[1] = 60 --> I have the element '1' 60 times in my matrix and so on)
dist --> vector that holds me max distances from element i,j of let's say value K to element k,m of same value K.
(EX: dist[1] = 10 --> distance from the element of value 1 to the furthest element of value 1)
Then I write stuff down in an output file, but again, wrong output.

If I understand your code correctly this line
count[configurationMatrix[i][j]] ++;
increments count at the element whose index is at configurationMatrix[i][j]. I don't see that your code takes any steps to ensure that threads are not simultaneously trying to increment the same element of count. It's entirely feasible that two different elements of configurationMatrix provide the same index into count and that those two elements are handled by different threads. Since ++ is not an atomic operation your code has a data race; multiple threads can contend for update access to the same variable and you lose any guarantees of correctness, or determinism, in the result.
I think you may have other examples of the same problem in other parts of your code too. You are silent on the errors you observe in the results of the parallel program compared with the results from the serial program yet those errors are often very useful in diagnosing a problem. For example, if the results of the parallel program are not the same every time you run it, that is very suggestive of a data race somewhere in your code.
How to fix this ? Since you only have 2 threads the easiest fix would be to not parallelise this part of the program. You could wrap the data race inside an OpenMP critical section but that's really just another way of serialising your code. Finally, you could possibly modify your algorithm and data structures to avoid this problem entirely.