Matrix Multiply with Threads (each thread does single multiply) - c

I'm looking to do a matrix multiply using threads where each thread does a single multiplication and then the main thread will add up all of the results and place them in the appropriate spot in the final matrix (after the other threads have exited).
The way I am trying to do it is to create a single row array that holds the results of each thread. Then I would go through the array and add + place the results in the final matrix.
Ex: If you have the matrices:
A = [{1,4}, {2,5}, {3,6}]
B = [{8,7,6}, {5,4,3}]
Then I want an array holding [8, 20, 7, 16, 6, 12, 16 etc]
I would then loop through the array adding up every 2 numbers and placing them in my final array.
This is a HW assignment so I am not looking for exact code, but some logic on how to store the results in the array properly. I'm struggling with how to keep track of where I am in each matrix so that I don't miss any numbers.
Thanks.
EDIT2: Forgot to mention that there must be a single thread for every single multiplication to be done. Meaning for the example above, there will be 18 threads each doing its own calculation.
EDIT: I'm currently using this code as a base to work off of.
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#define M 3
#define K 2
#define N 3
#define NUM_THREADS 10
int A [M][K] = { {1,4}, {2,5}, {3,6} };
int B [K][N] = { {8,7,6}, {5,4,3} };
int C [M][N];
struct v {
int i; /* row */
int j; /* column */
};
void *runner(void *param); /* the thread */
int main(int argc, char *argv[]) {
int i,j, count = 0;
for(i = 0; i < M; i++) {
for(j = 0; j < N; j++) {
//Assign a row and column for each thread
struct v *data = (struct v *) malloc(sizeof(struct v));
data->i = i;
data->j = j;
/* Now create the thread passing it data as a parameter */
pthread_t tid; //Thread ID
pthread_attr_t attr; //Set of thread attributes
//Get the default attributes
pthread_attr_init(&attr);
//Create the thread
pthread_create(&tid,&attr,runner,data);
//Make sure the parent waits for all thread to complete
pthread_join(tid, NULL);
count++;
}
}
//Print out the resulting matrix
for(i = 0; i < M; i++) {
for(j = 0; j < N; j++) {
printf("%d ", C[i][j]);
}
printf("\n");
}
}
//The thread will begin control in this function
void *runner(void *param) {
struct v *data = param; // the structure that holds our data
int n, sum = 0; //the counter and sum
//Row multiplied by column
for(n = 0; n< K; n++){
sum += A[data->i][n] * B[n][data->j];
}
//assign the sum to its coordinate
C[data->i][data->j] = sum;
//Exit the thread
pthread_exit(0);
}
Source: http://macboypro.wordpress.com/2009/05/20/matrix-multiplication-in-c-using-pthreads-on-linux/

You need to store M * K * N element-wise products. The idea is presumably that the threads will all run in parallel, or at least will be able to do, so each thread needs its own distinct storage location of appropriate type. A straightforward way to do that would be to create an array with that many elements ... but of what element type?
Each thread will need to know not only where to store its result, but also which multiplication to perform. All of that information needs to be conveyed via a single argument of type void *. One would typically, then, create a structure type suitable for holding all the data needed by one thread, create an instance of that structure type for each thread, and pass pointers to those structures. Sounds like you want an array of structures, then.
The details could be worked a variety of ways, but the one that seems most natural to me is to give the structure members for the two factors, and a member in which to store the product. I would then have the main thread declare a 3D array of such structures (if the needed total number is smallish) or else dynamically allocate one. For example,
struct multiplication {
// written by the main thread; read by the compute thread:
int factor1;
int factor2;
// written by the compute thread; read by the main thread:
int product;
} partial_result[M][K][N];
How to write code around that is left as the exercise it is intended to be.

Not sure haw many threads you would need to dispatch and I am also not sure if you would use join later to pick them up. I am guessing you are in C here so I would use the thread id as a way to track which row to process .. something like :
#define NUM_THREADS 64
/*
* struct to pass parameters to a dispatched thread
*/
typedef struct {
int value; /* thread number */
char somechar[128]; /* char data passed to thread */
unsigned long ret;
struct foo *row;
} thread_parm_t;
Where I am guessing that each thread will pick up its row data in the pointer *row which has some defined type foo. A bunch of integers or floats or even complex types. Whatever you need to pass to the thread.
/*
* the thread to actually crunch the row data
*/
void *thr_rowcrunch( void *parm );
pthread_t tid[NUM_THREADS]; /* POSIX array of thread IDs */
Then in your main code segment something like :
thread_parm_t *parm=NULL;
Then dispatch the threads with something like :
for ( i = 0; i < NUM_THREADS; i++) {
parm = malloc(sizeof(thread_parm_t));
parm->value = i;
strcpy(parm->somechar, char_data_to-pass );
fill_in_row ( parm->row, my_row_data );
pthread_create(&tid[i], NULL, thr_insert, (void *)parm);
}
Then later on :
for ( i = 0; i < NUM_THREADS; i++)
pthread_join(tid[i], NULL);
However the real work needs to be done in thr_rowcrunch( void *parm ) which receives the row data and then each thread just knows its own thread number. The guts of what you do in that dispatched thread however I can only guess at.
Just trying to help here, not sure if this is clear.

Related

Synchronization problem between threads using pthread_mutex_t

Basically i need to make three threads B,C,D to work simultaneously. Thread B sums up the even indexes in a global array X , C sums up the odd indexes in X, D sums up both results while B and C are still summing. I used two mutexes to do so but its not working properly.
In the array X given in the code below the results should be: evenSum = 47,oddSum = 127 ,bothSum = 174.
any help is greatly appreciated!
Thanks!!
#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>
#define SIZE 20
int X[SIZE] = {5,4,5,3,7,9,3,3,1,2,9,0,3,43,3,56,7,3,4,4};
int evenSum = 0;
int oddSum = 0;
int bothSum = 0;
//Initialize two mutex semaphores
pthread_mutex_t mutex1 = PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_t mutex2 = PTHREAD_MUTEX_INITIALIZER;
void* sum_even_indexes(void* args){
int i;
for(i=0 ; i<20 ; i+=2){
pthread_mutex_lock(&mutex1);
evenSum+=X[i];
pthread_mutex_unlock(&mutex1);
}
pthread_exit(NULL);
}
void* sum_odd_indexes(void* args){
int i;
for(i=1 ; i<20 ; i+=2){
pthread_mutex_lock(&mutex2);
oddSum+=X[i];
pthread_mutex_unlock(&mutex2);
}
pthread_exit(NULL);
}
void* sum_of_both(void* args){
int i;
for(i=0 ; i<SIZE ; i++){
pthread_mutex_lock(&mutex1);
pthread_mutex_lock(&mutex2);
bothSum += (evenSum+oddSum);
pthread_mutex_unlock(&mutex2);
pthread_mutex_unlock(&mutex1);
}
pthread_exit(NULL);
}
int main(int argc, char const *argv[]){
pthread_t B,C,D;
/***
* Create three threads:
* thread B : Sums up the even indexes in the array X
* thread C : Sums up the odd indexes in the array X
* thread D : Sums up both B and C results
*
* Note:
* All threads must work simultaneously
*/
pthread_create(&B,NULL,sum_even_indexes,NULL);
pthread_create(&C,NULL,sum_odd_indexes,NULL);
pthread_create(&D,NULL,sum_of_both,NULL);
//Wait for all threads to finish
pthread_join(B,NULL);
pthread_join(C,NULL);
pthread_join(D,NULL);
pthread_mutex_destroy(&mutex1);
pthread_mutex_destroy(&mutex2);
//Testing Print
printf("Odds Sum = %d\n",oddSum);
printf("Evens Sum = %d\n",evenSum);
printf("Both Sum = %d\n",bothSum);
return 0;
}
The mutexes will not enforce any ordering between D and B or C.
To make such ordering, you need to run D later (after B and C joined), or make D wait on a condition, which is set, when B an C are done.
mutex1 and mutex2 are actually not needed, if you implement proper waiting. You need no mutexes if you start D later, or you need only one, if you'll use waiting on a condition variable
Also it is not clear why you need a loop in D. You have just a sum of two variables.
For real application, parallel array processing is usually done by partitioning array by ranges, not by modulo. It does not make sense at all for small sizes like 20. You'd prefer threadpool threads to avoid thread startup overhead and control count of threads. And sure for just evenSum+oddSum you wouldn't need a thread.

why is this c code causing a race condition?

I'm trying to count the number of prime numbers up to 10 million and I have to do it using multiple threads using Posix threads(so, that each thread computes a subset of 10 million). However, my code is not checking for the condition IsPrime. I'm thinking this is due to a race condition. If it is what can I do to ameliorate this issue?
I've tried using a global integer array with k elements but since k is not defined it won't let me declare that at the file scope.
I'm running my code using gcc -pthread:
/*
Program that spawns off "k" threads
k is read in at command line each thread will compute
a subset of the problem domain(check if the number is prime)
to compile: gcc -pthread lab5_part2.c -o lab5_part2
*/
#include <math.h>
#include <stdio.h>
#include <time.h>
#include <pthread.h>
#include <stdlib.h>
typedef int bool;
#define FALSE 0
#define TRUE 1
#define N 10000000 // 10 Million
int k; // global variable k willl hold the number of threads
int primeCount = 0; //it will hold the number of primes.
//returns whether num is prime
bool isPrime(long num) {
long limit = sqrt(num);
for(long i=2; i<=limit; i++) {
if(num % i == 0) {
return FALSE;
}
}
return TRUE;
}
//function to use with threads
void* getPrime(void* input){
//get the thread id
long id = (long) input;
printf("The thread id is: %ld \n", id);
//how many iterations each thread will have to do
int numOfIterations = N/k;
//check the last thread. to make sure is a whole number.
if(id == k-1){
numOfIterations = N - (numOfIterations * id);
}
long startingPoint = (id * numOfIterations);
long endPoint = (id + 1) * numOfIterations;
for(long i = startingPoint; i < endPoint; i +=2){
if(isPrime(i)){
primeCount ++;
}
}
//terminate calling thread.
pthread_exit(NULL);
}
int main(int argc, char** args) {
//get the num of threads from command line
k = atoi(args[1]);
//make sure is working
printf("Number of threads is: %d\n",k );
struct timespec start,end;
//start clock
clock_gettime(CLOCK_REALTIME,&start);
//create an array of threads to run
pthread_t* threads = malloc(k * sizeof(pthread_t));
for(int i = 0; i < k; i++){
pthread_create(&threads[i],NULL,getPrime,(void*)(long)i);
}
//wait for each thread to finish
int retval;
for(int i=0; i < k; i++){
int * result = NULL;
retval = pthread_join(threads[i],(void**)(&result));
}
//get the time time_spent
clock_gettime(CLOCK_REALTIME,&end);
double time_spent = (end.tv_sec - start.tv_sec) +
(end.tv_nsec - start.tv_nsec)/1000000000.0f;
printf("Time tasken: %f seconds\n", time_spent);
printf("%d primes found.\n", primeCount);
}
the current output I am getting: (using the 2 threads)
Number of threads is: 2
Time tasken: 0.038641 seconds
2 primes found.
The counter primeCount is modified by multiple threads, and therefore must be atomic. To fix this using the standard library (which is now supported by POSIX as well), you should #include <stdatomic.h>, declare primeCount as an atomic_int, and increment it with an atomic_fetch_add() or atomic_fetch_add_explicit().
Better yet, if you don’t care about the result until the end, each thread can store its own count in a separate variable, and the main thread can add all the counts together once the threads finish. You will need to create, in the main thread, an atomic counter per thread (so that updates don’t clobber other data in the same cache line), pass each thread a pointer to its output parameter, and then return the partial tally to the main thread through that pointer.
This looks like an exercise that you want to solve yourself, so I won’t write the code for you, but the approach to use would be to declare an array of counters like the array of thread IDs, and pass &counters[i] as the arg parameter of pthread_create() similarly to how you pass &threads[i]. Each thread would need its own counter. At the end of the thread procedure, you would write something like, atomic_store_explicit( (atomic_int*)arg, localTally, memory_order_relaxed );. This should be completely wait-free on all modern architectures.
You might also decide that it’s not worth going to that trouble to avoid a single atomic update per thread, declare primeCount as an atomic_int, and then atomic_fetch_add_explicit( &primeCount, localTally, memory_order_relaxed ); once before the thread procedure terminates.

How to return each thread's output into an array using OpenMP?

I would like to proceed a multi-thread program where each thread outputs an array of unknown number of elements.
For example, select all numbers that < 10 from an int array and put them into a new array.
Pseudo code (8 threads):
int *hugeList = malloc(10000000);
for (long i = 0; i < 1000000; ++i)
{
hugeList[i] = (rand() % 100);//random integers from 0 to 99
}
long *subList[8];//to fill each thread's result
#pragma omp parallel
for (long i = 0; i < 1000000; ++i)
{
long n = 0;
if(hugeList[i] < 10)
{
//do something to fill "subList" properly
subList[threadNo][n] = hugeList[i];
n++;
}
}
Array "subList" should collect the elements in "hugeList" which satisfies condition (<10) ,sequentially and in terms of thread number.
How should I write the code? It is OK if there is a better way using OpenMP.
There are several problems in your code.
1/ omp pragma should be parallel for, if you want the for loop to be parallelized. Otherwise, code will be duplicated in everay thread.
2/ code is incoherent with comment
//do something to fill "subList" properly
hugeList[i] = subList[threadNo][n];
3/ How do you know the number of element in your sublists? It must be returned to main thread. You could use an array, but beware of false sharing. Better use a local var and write it at the end the parallel section.
4/ sublist is not allocated. The difficulty is that you do not know the number of threads. You can ask omp the max number of thread (get_omp_max_thread), and do dynamic allocation. If you want some static allocation, maybe the best is to allocate a large table and to compute the actual address in every thread.
5/ omp code must also work without an openmp compiler. Use #ifdef _OPENMP for that.
Here is an (untested) way your code can be written
#define HUGE 10000000
int *hugeList = (int *) malloc(HUGE);
#ifdef _OPENMP
int thread_nbr=omp_get_max_threads();
#else
int thread_nbr=1; // to ensure proper behavior in a sequential context
#endif
struct thread_results { // to hold per thread results
int nbr; // nbr of generated results
int *results; // actual filtered numbers. Will write in subList table
};
// could be parallelized, but rand is not thread safe. drand48 should be
for (long i = 0; i < 1000000; ++i)
{
hugeList[i] = (rand() % 100);//random integers from 0 to 99
}
int *subList=(int *)malloc(HUGE*sizeof(int)); // table to hold thread results
// this is more complex to have a 2D array here as max_thread and actual number of thread
// are not known at compile time. VLA cannot be used (and array dim can be very large).
// Concerning its size, it is possible to have ALL elements in hugeList selected and the array must be
// dimensionned accordingly to avoid bugs.
struct thread_results* threadres=(struct thread_results *)malloc(thread_nbr*sizeof(struct thread_results));
#pragma omp parallel
{
// first declare and initialize thread vars
#ifdef _OPENMP
int thread_id = omp_get_thread_num() ; // hold thread id
int thread_nbr = omp_get_num_threads() ; // hold actual nbr of threads
#else
// to ensure proper serial behavior
int thread_id = 0;
int thread_nbr = 1;
#endif
struct thread_results *res=threadres+thread_id;
res->nbr=0;
// compute address in subList table
res->results=subList+(HUGE/thread_nbr)*thread_id;
int * res_ptr=res->results; // local pointer. Each thread points to independent part of subList table
int n=0; // number of results. We want one per thread to only have local updates.
#pragma omp for
for (long i = 0; i < 1000000; ++i)
{
if(hugeList[i] < 10)
{
//do something to fill "subList" properly
res_ptr[n]=hugeList[i];
n++;
}
}
res->nbr=n;
}
Updated complete codes based on #Alain Merigot 's answer
I tested the following code; It is reproducible (including presence & absence of #pragma arguments).
However, only the front elements of subList are correct, while the rest are empty.
(filename.c)
#include <stdio.h>
#include <time.h>
#include <omp.h>
#include <stdlib.h>
#include <math.h>
#define HUGE 10000000
#define DELAY 1000 //depends on your CPU power
//use global variables to store desired results, otherwise can't be obtain outside "pragma"
int n = 0;// number of results. We want one per thread to only have local updates.
double *subList;// table to hold thread results
int main()
{
double *hugeList = (double *)malloc(HUGE);
#ifdef _OPENMP
int thread_nbr = omp_get_max_threads();
#else
int thread_nbr = 1; // to ensure proper behavior in a sequential context
#endif
struct thread_results
{ // to hold per thread results
int nbr; // nbr of generated results
double *results; // actual filtered numbers. Will write in subList table
};
// could be parallelized, but rand is not thread safe. drand48 should be
for (long i = 0; i < 1000000; ++i)
{
hugeList[i] = sin(i); //fixed array content to test reproducibility
}
subList = (double *)malloc(HUGE * sizeof(double)); // table to hold thread results
// this is more complex to have a 2D array here as max_thread and actual number of thread
// are not known at compile time. VLA cannot be used (and array dim can be very large).
// Concerning its size, it is possible to have ALL elements in hugeList selected and the array must be
// dimensionned accordingly to avoid bugs.
struct thread_results *threadres = (struct thread_results *)malloc(thread_nbr * sizeof(struct thread_results));
#pragma omp parallel
{
// first declare and initialize thread vars
#ifdef _OPENMP
int thread_id = omp_get_thread_num(); // hold thread id
int thread_nbr = omp_get_num_threads(); // hold actual nbr of threads
#else
// to ensure proper serial behavior
int thread_id = 0;
int thread_nbr = 1;
#endif
struct thread_results *res = threadres + thread_id;
res->nbr = 0;
// compute address in subList table
res->results = subList + (HUGE / thread_nbr) * thread_id;
double *res_ptr = res->results; // local pointer. Each thread points to independent part of subList table
#pragma omp for reduction(+ \
: n)
for (long i = 0; i < 1000000; ++i)
{
for (int i = 0; i < DELAY; ++i){}//do nothing, just waste time
if (hugeList[i] < 0)
{
//do something to fill "subList" properly
res_ptr[n] = hugeList[i];
n++;
}
}
res->nbr = n;
}
for (int i = 0; i < 10; ++i)
{
printf("sublist %d: %lf\n", i, subList[i]);//show some elements of subList to check reproducibility
}
printf("n = %d\n", n);
}
Linux compile: gcc -o filename filename.c -fopenmp -lm
I hope there can be more discussion of the mechanism of this code.

compare array with multiple thread using pThread in C

I have some difficulties in understanding multiple thread. Here is the situation:
I am going to select some integers from an array and store them into another array with some conditions. The conditions is quite complicated, basically it's a huge set of comparison between array[i] and all others array[not i]. Let's called it checkCondition();
First, I create the pthread. Here is my code, noted that dataPackage is a struct containing the array.
for(int i = 0; i < thread_number; i++){
if(pthread_create(&tid, NULL, checkPermissible, &dataPackage) != 0){
perror("Error occurs when creating thread!!!\n");
exit(EXIT_FAILURE);
}
}
for(int i = 0; i < thread_number; i++){
pthread_join(tid, NULL);
}
Here is the content of checkPermissible()
void* checkPermissible(void* content){
readThread *dataPackage = content;
for(int i = 0; i < (*dataPackage).arrayLength; i++){
if(checkCondition()){
pthread_mutex_lock(&mutex);
insert(array[i], (*dataPackage).result);
pthread_mutex_unlock(&mutex);
//if condition true, insert it into result(a pointer)
//mutex avoid different thread insert the value at the same time
}
}
pthread_exit(NULL);
}
However, It would not have any difference if I'm not using pThread way to do this. How to I implement checkPermissible() in order to bring out the advantage of multiple thread? I quite confused about this stuff.
My idea is, dividing the array into noOfThread in each Thread. For example, I have an array[20] and 4 thread,
Thread 1: compute checkCondition with array[0] to array[4]
Thread 2: compute checkCondition with array[5] to array[9]
Thread 3: compute checkCondition with array[10] to array[14]
Thread 4: compute checkCondition with array[15] to array[19]
Something like that, in which I don't know how to achieve.
First, you can pass lower and upper bound or addresses to a thread in your structure as follows:
struct readThread {
int low;
int hi;
int * myarray;
};
for (int i=low;i<hi;++i)
//or
struct readThread {
int * start;
int * end;
};
for (int* i=start; i<end; ++i)
First one is easier and easier to understand as well. In this way, your array will split.
There are other ways like creating split copies of your error for each thread.

How to pass specific value to one thread?

I'm working through a thread exercise in C, it's a typical thread scheduling code many schools teach, a basic one can be seen here, my code is basically the same except for my altered runner method
http://webhome.csc.uvic.ca/~wkui/Courses/CSC360/pthreadScheduling.c
What I'm doing is basically altering the runner part so my code prints an array with random numbers within a certain range, instead of just printing some words. my runner code is here:
void *runner(void *param) {
int i, j, total;
int threadarray[100];
for (i = 0; i < 100; i++)
threadarray[i] = rand() % ((199 + modifier*100) + 1 - (100 + modifier*100)) + (100 + modifier*100);
/* prints array and add to total */
for (j = 0; j < 100; j += 10) {
printf("%d,%d,%d,%d,%d,%d,%d,%d,%d,%d\n", threadarray[j], threadarray[j+1], threadarray[j+2], threadarray[j+3], threadarray[j+4], threadarray[j+5], threadarray[j+6], threadarray[j+7], threadarray[j+8], threadarray[j+9]);
total = total + threadarray[j] + threadarray[j+1] + threadarray[j+2] + threadarray[j+3] + threadarray[j+4] + threadarray[j+5] + threadarray[j+6] + threadarray[j+7] + threadarray[j+8] + threadarray[j+9];
}
printf("Thread %d finished running, total is: %d\n", pthread_self(), total);
pthread_exit(0);
}
My question lies in the first for loop where I'm assigning random numbers to my array, I want this modifier to change based on which thread it is, but I can't figure out how to do it, for example if its the first thread the range will be 100-199, 2nd will be 200-299, etc and so on. I have tried to assign i to an value before doing pthread_create and assigning that value to an int in runner to use as the modifier, but since there are 5 concurrent threads it ends up assigning this number to all 5 threads, and they end up having the same modifier.
So I'm looking for a method to approach this where it will work for all the individual threads instead of assigning it to all of them, I have tried to change the parameters to something like (void *param, int modifier) but when I do this I have no idea how to reference runner, since by default it's refrenced like pthread_create(&tid[i],&attr,runner,NULL);
You want to make param point to a data structure or variable who's lifetime will exist longer than the thread lifetime. And you cast the void* parameter to the actual data type it was allocated as.
Easy example:
struct thread_data
{
int thread_index;
int start;
int end;
}
struct thread_info;
{
struct thread_data data;
pthread_t thread;
}
struct thread_info threads[10];
for (int x = 0; x < 10; x++)
{
struct thread_data* pData = (struct thread_data*)malloc(sizeof(struct thread_data)); // never pass a stack variable as a thread parameter. Always allocate it from the heap.
pData->thread_index = x;
pData->start = 100 * x + 1;
pData->end = 100*(x+1) - 1;
pthread_create(&(threads[x].thread), NULL, runner, pData);
}
Then your runner:
void *runner(void *param)
{
struct thread_data* data = (struct thread_data*)param;
int modifier = data->thread_index;
int i, j, total;
int threadarray[100];
for (i = 0; i < 100; i++)
{
threadarray[i] = ...
}

Resources