Using multiple threads to do matrix Multiplication in C

Using multiple threads to do matrix Multiplication in C - c

So, I was trying to write a program to do matrix multiplication using multiple threads and then plot a graph between the time taken and the number of threads used.
I used the following approach:
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
pthread_mutex_t lock;
#define M 200
#define N 300
#define P 400
#define X 2 // Number of Threads
#define RED "\x1b[31m"
#define GREEN "\x1b[32m"
int A[M][N], B[N][P], C[M][P], D[M][P];
int row = 0;
void *matrixMulti(void *arg)
{
pthread_mutex_lock(&lock);
int i = row++;
for (int j = 0; j < P; j++)
{
C[i][j] = 0;
for (int k = 0; k < N; k++)
{
C[i][j] += A[i][k] * B[k][j];
}
}
pthread_exit(NULL);
pthread_mutex_unlock(&lock);
}
void matrixMultiplicationWithoutThreading();
void matrixMultiplicationWithThreading();
void verifyIfBothMatrixAreSame();
int main()
{
int m, n, p;
// A: m*n Matrix, B: n*p Matrix
for (int i = 0; i < M; i++)
for (int j = 0; j < N; j++)
A[i][j] = rand() % 10;
// scanf("%d", &A[i][j]);
for (int i = 0; i < N; i++)
for (int j = 0; j < P; j++)
B[i][j] = rand() % 10;
// scanf("%d", &B[i][j]);
struct timeval start, end;
gettimeofday(&start, NULL);
matrixMultiplicationWithoutThreading();
gettimeofday(&end, NULL);
double time = (end.tv_sec - start.tv_sec) * 1e6;
time = (time + end.tv_usec - start.tv_usec) * 1e-6;
printf("The time taken by simple matrix calculation without threding is %0.6f\n", time);
struct timeval start_th, end_th;
gettimeofday(&start_th, NULL);
matrixMultiplicationWithThreading();
gettimeofday(&end_th, NULL);
time = (end_th.tv_sec - start_th.tv_sec) * 1e6;
time = (time + end_th.tv_usec - start_th.tv_usec) * 1e-6;
printf("The time taken by using the Threading Method with %d threads is %0.6f\n", X, time);
verifyIfBothMatrixAreSame();
}
void matrixMultiplicationWithThreading()
{
pthread_t threads[X];
for (int i = 0; i < X; i++)
{
threads[i] = (pthread_t)-1;
}
// Computation Started:
for (int i = 0; i < M; i++)
{
// At any moment only X threads at max are working
if (threads[i] == (pthread_t)-1)
pthread_create(&threads[i % X], NULL, matrixMulti, NULL);
else
{
pthread_join(threads[i % X], NULL);
pthread_create(&threads[i % X], NULL, matrixMulti, NULL);
}
}
for (int i = 0; i < X; i++)
pthread_join(threads[i], NULL);
// Computation Done:
}
void matrixMultiplicationWithoutThreading()
{
// Computation Started:
for (int i = 0; i < M; i++)
for (int j = 0; j < P; j++)
{
D[i][j] = 0;
for (int k = 0; k < N; k++)
D[i][j] += A[i][k] * B[k][j];
}
// Computation Done:
}
void verifyIfBothMatrixAreSame()
{
for (int i = 0; i < M; i++)
for (int j = 0; j < P; j++)
{
if (C[i][j] != D[i][j])
{
printf(RED "\nMatrix's are not equal something wrong with the computation\n");
return;
}
}
printf(GREEN "\nBoth Matrixes are equal thus verifying the computation\n");
}
Now, this code works sometimes, and sometimes it doesn't, like the result does not match the actual result. Similarly, this code gives a segmentation fault in one of the Linux virtual machines. Also, even when it works correctly, it doesn't give the asymptotically decreasing graph. Instead, the time is almost constant with arbitrary variations with the thread number.
Can someone help with this, like why this is happening? I found multiple solutions to this problem on the internet; some of them don't work (rarely but it happens), but I haven't seen my approach yet; it might be an issue I think. So, can anyone comment on using pthread_create(&threads[i % X], NULL, matrixMulti, NULL), like why this is not a good idea?
EDITED:
I have tried taking the suggestion and optimising the code, I have not done the Matrix multiplication efficient method, as we were asked to do the O(n^3) method, but I have tried doing the threading correctly. Is this correct?
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <math.h>
#define M 2
#define N 2
#define P 2
#define X 40 // Number of Threads
#define RED "\x1b[31m"
#define GREEN "\x1b[32m"
int t = 0; // Computation done by the first usedXFullthreads
int usedXFull = 0;
int A[M][N], B[N][P], C[M][P], D[M][P];
int row = 0;
void *matrixMulti(void *arg)
{
int* l = (int *)arg;
int n = *l;
int i = 0, j = 0, k = 0, comp = 0;
if (n <= usedXFull)
{
i = n * t / (N * P);
j = (n * t - N * P * i) / N;
k = n * t - N * P * i - N * j;
if (n == usedXFull)
comp = M * N * P - usedXFull * t;
else
comp = t;
}
while (comp)
{
if (i == M)
printf(RED "Some fault in the code\n\n");
C[i][j] += A[i][k] * B[k][j];
comp--;
k++;
if (k == N)
{
j++;
if (j == P)
{
i++;
j = 0;
}
k = 0;
}
}
return NULL;
}
void matrixMultiplicationWithoutThreading();
void matrixMultiplicationWithThreading();
void verifyIfBothMatrixAreSame();
int main()
{
int m, n, p;
// A: m*n Matrix, B: n*p Matrix
for (int i = 0; i < M; i++)
for (int j = 0; j < N; j++)
A[i][j] = rand() % 10;
// scanf("%d", &A[i][j]);
for (int i = 0; i < N; i++)
for (int j = 0; j < P; j++)
B[i][j] = rand() % 10;
// scanf("%d", &B[i][j]);
for (int i = 0; i < M; i++)
for (int j = 0; j < P; j++)
C[i][j] = 0;
struct timeval start, end;
gettimeofday(&start, NULL);
matrixMultiplicationWithoutThreading();
gettimeofday(&end, NULL);
double time = (end.tv_sec - start.tv_sec) * 1e6;
time = (time + end.tv_usec - start.tv_usec) * 1e-6;
printf("The time taken by simple matrix calculation without threding is %0.6f\n", time);
struct timeval start_th, end_th;
gettimeofday(&start_th, NULL);
matrixMultiplicationWithThreading();
gettimeofday(&end_th, NULL);
time = (end_th.tv_sec - start_th.tv_sec) * 1e6;
time = (time + end_th.tv_usec - start_th.tv_usec) * 1e-6;
printf("The time taken by using the Threading Method with %d threads is %0.6f\n", X, time);
verifyIfBothMatrixAreSame();
}
void matrixMultiplicationWithThreading()
{
int totalComp = M * N * P; // Total Computation
t = ceil((double)totalComp / (double)X);
usedXFull = totalComp / t;
int computationByLastUsedThread = totalComp - t * usedXFull;
int computationIndex[X];
pthread_t threads[X];
// Computation Started:
for (int i = 0; i < X; i++)
{
computationIndex[i] = i;
int rc = pthread_create(&threads[i], NULL, matrixMulti, (void *)&computationIndex[i]);
if (rc)
{
printf(RED "ERROR; return code from pthread_create() is %d\n", rc);
exit(-1);
}
}
for (int i = 0; i < X; i++)
pthread_join(threads[i], NULL);
// Computation Done:
}
void matrixMultiplicationWithoutThreading()
{
// Computation Started:
for (int i = 0; i < M; i++)
for (int j = 0; j < P; j++)
{
D[i][j] = 0;
for (int k = 0; k < N; k++)
D[i][j] += A[i][k] * B[k][j];
}
// Computation Done:
}
void verifyIfBothMatrixAreSame()
{
for (int i = 0; i < M; i++)
for (int j = 0; j < P; j++)
{
if (C[i][j] != D[i][j])
{
printf(RED "\nMatrix's are not equal something wrong with the computation\n");
return;
}
}
printf(GREEN "\nBoth Matrixes are equal thus verifying the computation\n");
}

There are many issues in the code. Here are some points:
lock is not initialized with pthread_mutex_init which is required (nor freed).
There is no need for locks in a matrix multiplication: work sharing should be preferred instead (especially since the current lock make your code run fully serially).
Using pthread_exit is generally rather a bad idea, at least it is here. Consider just returning NULL. Besides, returning something is mandatory in matrixMulti. Please enable compiler warnings so to detect such a thing.
There is an out of bound of threads[i] in the 0..M based loop.
There is no need to create M threads. You can create 2 threads and divide the work in 2 even parts along the M-based dimension. Creating M threads while allowing only 2 threads to run simultaneously just add more overhead for no reason (it takes time for thread to be created and scheduled by the OS).
It is generally better to dynamically allocate large arrays than using static global C arrays.
It is better to avoid global variables and use the arg parameter so to get thread-specific data.
To design a fast matrix multiplication, please consider reading this article. For example, the ijk loop nest is very inefficient and should really not be used for sake of performance (not efficient in cache). Besides, note you can use a BLAS library for that (they are highly optimized and easy to use) though I guess this is a homework. Additionally, note that you can use OpenMP instead of pthread so to make the code shorter and easy to read.

Related

Going over a matrix while multi-threading

I need to create a program that gets a dynamic matrix and changes it to one dimension, for example 4x4 matrix will give 16 arrays length, where each index has a odd or even number, matching the index itself. The threads needs to go over the matrix at the same time and copy the odd and even numbers to the correct places in the array. The main thread needs to wait for the rest of them to finish before printing the array and every value with its respective thread. It should come out like this
We managed to fix the segmentation fault that kept happening, but now we need to set it so that each thread runs right after the other but instead each thread runs 4 times and then it switches to a different one. How can I change it so it'll run as asked?
#define _CRT_SECURE_NO_WARNINGS
#include <stdio.h>
#include <pthread.h>
#include <math.h>
#define CORE 4
int N;
int odd = 1;
int even = 0;
typedef struct my_thread {
int** matrix;
int* resArray;
int threadId;
int strart_raw;
int strart_cal;
int end_raw;
int end_cal;
int counter;
} my_thread;
void* createArray(struct my_thread* thread);
void main() {
pthread_t th[CORE];
int s_r = 0, s_c, e_r, e_c;
int i, j, lines, columns, * intMatrix;
printf("Type the N for the N*N matrix:\t");
scanf("%d", &N);
int size = N * N;
int result_Array[N * N];
int retcode;
int interval = size / CORE;
int matrix_build_counter = 1;
intMatrix = (int*)malloc(N * N * sizeof(int));
for (i = 0; i < N; ++i)
{
for (j = 0; j < N; ++j)
{
intMatrix[i * N + j] = matrix_build_counter;
matrix_build_counter++;
}
}
printf("The matrix:\n");
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
printf("%d ", intMatrix[i * N + j]);
}
printf("\n");
}
struct my_thread thred_obj_array[CORE];
for (int i = 0; i < CORE; i++) {
thred_obj_array[i].matrix = &intMatrix;
thred_obj_array[i].resArray = result_Array;
thred_obj_array[i].threadId = i;
thred_obj_array[i].strart_raw = (int)((i * N) / CORE);
thred_obj_array[i].end_raw = (int)(((interval * (i + 1)) / N));
thred_obj_array[i].strart_cal = ((interval * i)) % N;
thred_obj_array[i].end_cal = ((interval) * (i + 1));
thred_obj_array[i].counter = (int)floor((interval)*i);
}
for (int i = 0; i < CORE; i++) {
retcode = pthread_create(&th[i], NULL, createArray, &thred_obj_array[i]);
if (retcode != 0) {
printf("Create thread failed with error %d\n", retcode);
}
}
printf("done");
for (int i = 0; i < CORE; i++) {
pthread_join(th[i], NULL);
}
printf("the result array is: ");
for (int i = 0; i < N * N; i++) {
printf("%d ", result_Array[i]);
}
}
void* createArray(struct my_thread* thread) {
int j;
for (int i = thread->strart_raw; i < N; i = i * sizeof(int) * N) {
for (j = thread->strart_cal; j < N; j++) {
printf("I am thread: %d And my value is: %d , (%d,%d)\n", thread->threadId, (*thread->matrix + i * N)[j], i, j);
if (((*thread->matrix + i * N)[j]) % 2 == 0) {
thread->resArray[even] = (*thread->matrix + i * N)[j];
even += 2;
printf("-----%d ---even---\n", even);
}
else {
thread->resArray[odd] = (*thread->matrix + i * N)[j];
odd += 2;
printf("---%d ---odd--\n", odd);
}
(thread->counter)++;
if (thread->counter == thread->end_cal) {
return;
}
}
thread->strart_cal = 0;
}
}

I am using pthreads to speed up my matrix multiplication but not getting correct values

I have to calculate the speed of operation for different threads but the matrix size has to be (1000X1 and 1X1000). I have to calculate this using 1,2,4,8,16,32,64,128,256 and 512 threads.
my program is returning same values for all the resultant matrix. Where do I make changes?
I am using random number generator to fill the matrices and dynamically allocating and freeing the matrices.
I first kept N = 512 but I was getting segmentation fault core dumped error so I increased the N. How can I use different number of threads to calculate the matrix?
For a 1000X1 and 1X1000 matrix the resulatant matrix is all 18 and gets segmentation fault core dump error.
For 100X1 and 1X100 all values are 3 and the same happens
to compile it you have to use -lpthreads and -fopenmp
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <stdlib.h>
// Each thread computes single element in the resultant matrix
void *mult(void *arg)
{
int *data = (int *)arg;
int k = 0, i = 0;
int x = data[0];
for (i = 1; i <= x; i++)
k += data[i] * data[i + x];
int *p = (int *)malloc(sizeof(int));
*p = k;
// Used to terminate a thread and the return value is passed as a pointer
pthread_exit(p);
}
// Driver code
int main()
{
int i, j, k, row1, col1, row2, col2, r, sum, no_of_threads;
printf("Enter the number of rows for matrix 1\n");
scanf("%d", &row1);
printf("Enter the number of columns for matrix 1 \n");
scanf("%d", &col1);
printf("Enter the number of rows for matrix 2 \n");
scanf("%d", &row2);
printf("Enter the number of columns for matrix 2\n");
scanf("%d", &col2);
int **a = (int **)malloc(row1 * sizeof(int *));
for (i = 0; i < row1; i++)
a[i] = (int *)malloc(col1 * sizeof(int));
int **b = (int **)malloc(row2 * sizeof(int *));
for (i = 0; i < row2; i++)
b[i] = (int *)malloc(col2 * sizeof(int));
for (i = 0; i < row1; i++)
{
for (j = 0; j < col1; j++)
{
a[i][j] = (rand() % 9) + 1;
}
}
for (i = 0; i < row2; i++)
{
for (j = 0; j < col2; j++)
{
b[i][j] = (rand() % 9) + 1;
}
}
int N = 1000;
// declaring array of threads of size row1*col2
pthread_t *threads;
threads = (pthread_t *)malloc(N * sizeof(pthread_t));
int count = 0;
int *data = NULL;
for (i = 0; i < row1; i++)
for (j = 0; j < col2; j++)
{
// storing row and column elements in data
data = (int *)malloc((N) * sizeof(int));
data[0] = col1;
for (k = 0; k < col1; k++)
data[k + 1] = a[i][k];
for (k = 0; k < row2; k++)
data[k + col1 + 1] = b[k][j];
}
// creating threads
for (int i = 0; i < 512; i++)
{
std:
pthread_create(&threads[i], NULL,
mult, (void *)(data));
}
printf("RESULTANT MATRIX IS :- \n");
for (i = 0; i < N; i++)
{
void *k;
// Joining all threads and collecting return value
pthread_join(threads[i], &k);
int *p = (int *)k;
printf("%d ", *p);
if ((i + 1) % col2 == 0)
printf("\n");
}
for (int i = 0; i < row1; i++)
{
free(a[i]);
}
for (int i = 0; i < row2; i++)
{
free(b[i]);
}
free(a);
free(b);
free(data);
return 0;
}
EDIT:
How do I send multiple arguments through threads as if I read the matrix in my main function, I am not able to pass the values of rows and columns to my mult function. If I read the matrix in mult function then the threads are not working. How do I get it to multiply?

Your approach is flawed:
the destination matrix is not allocated properly
you pass the same argument to all threads
the threads do not receive any information regarding where to store the result of the scalar product.
You should allocate an argument structure for each thread and pass the source arrays and destination location.
With the current approach, there are row1 * col2 scalar products to compute: if you want to use fixed number of threads, you should construct a list of tasks for each thread to process in order to distribute the word among the threads. It is rather easy to do it statically and since all single computations are equivalent in terms of complexity, dynamic distribution does not seem necessary.
Note however that you must wait for all threads to complete before examining and freeing the results. Freeing the arrays as the threads are potentially still accessing the data is among the many causes for undefined behavior in the posted code.

Display the prime numbers using multi-threading in C

In the function printprime, I am iterating over each element with each of the four threads, this is almost equivalent to a single threaded program. I want to increment i by i=i+MAX_THREADS. I am using four threads as my Laptop has four processors and it is fully optimized. Can someone tell me how to tweak the printprime so that each thread iterates over a single digit. Like, thread 1 checks 2, 6, 10... thread2 checks 3, 7, 11... and so on.
#include <stdio.h>
#include <pthread.h>
#define N 30
#define MAX_THREADS 4
int prime_arr[N] = { 0 };
void *printprime(void *ptr) {
int j, flag;
int i = (int)(long long int)ptr;
for (i = 2; i < N; i++) {
flag = 0;
for (j = 2; j <= i / 2; j++) {
if (i % j == 0) {
flag = 1;
break;
}
}
if (flag == 0) {
prime_arr[i] = 1;
}
}
}
int main() {
pthread_t tid[MAX_THREADS] = {{ 0 }};
int count = 0;
for (count = 0; count < MAX_THREADS; count++) {
printf("\r\n CREATING THREADS %d", count);
pthread_create(&tid[count], NULL, printprime, (void *)count);
}
printf("\n");
for (count = 0; count < MAX_THREADS; count++) {
pthread_join(tid[count], NULL);
}
int c = 0;
for (count = 0; count < N; count++)
if (prime_arr[count] == 1)
printf("%d ", count);
return 0;
}

To achieve desirable, increment variable i in function void *printprime(void *ptr) by MAX_THREADS(4 in your case).
Note: Line printf("Thread id[%d] checking [%d]\n",pthread_self(),i); is used to show that which thread is checking which value.
Following code may be helpful:
#include<stdio.h>
#include<pthread.h>
#define N 30
#define MAX_THREADS 4
int prime_arr[N]={0};
void *printprime(void *ptr)
{
int j,flag;
int i=(int)(long long int)ptr;
while(i<N)
{
printf("Thread id[%d] checking [%d]\n",pthread_self(),i);
flag=0;
for(j=2;j<=i/2;j++)
{
if(i%j==0)
{
flag=1;
break;
}
}
if(flag==0 && (i>1))
{
prime_arr[i]=1;
}
i+=MAX_THREADS;
}
}
int main()
{
pthread_t tid[MAX_THREADS]={{0}};
int count=0;
for(count=0;count<MAX_THREADS;count++)
{
printf("\r\n CREATING THREADS %d",count);
pthread_create(&tid[count],NULL,printprime,(void*)count);
}
printf("\n");
for(count=0;count<MAX_THREADS;count++)
{
pthread_join(tid[count],NULL);
}
int c=0;
for(count=0;count<N;count++)
if(prime_arr[count]==1)
printf("%d ",count);
return 0;
}

There are multiple problems in your code:
all threads use for (i = 2; i < N; i++) so they perform exactly the same scan, testing the same numbers... You get no advantage from using multiple threads.
the name printprime is very confusing for a function that scans for prime numbers but does not print them.
you modify the same array in multiple threads without synchronisation: this has undefined behavior if the same element is accessed from different threads and if the element size is smaller than the atomic size.
even if the code was modified for each thread to test the subset you document in the question, this would be very inefficient as every other threads would end up testing only even numbers.
the loop for (j = 2; j <= i / 2; j++) iterates far too long for prime numbers. You should stop when j * j > i, which can be tested as for (j = 2; i / j <= j; j++).
even with this optimisation, trial division is very inefficient to populate the prime_arr array. Implementing a Sieve of Eratosthenes is far superior and much more appropriate for a multithreading approach.
Here is an example:
#include <stdio.h>
#include <stdint.h>
#include <pthread.h>
#define N 10000000
#define MAX_THREADS 4
unsigned char prime_arr[N];
void *scanprime(void *ptr) {
int n, i, j, flag, start, stop;
n = (int)(intptr_t)ptr;
start = N / MAX_THREADS * n;
stop = N / MAX_THREADS * (n + 1);
if (start < 2)
start = 2;
if (n == MAX_THREADS - 1)
stop = N;
for (i = start; i < stop; i++) {
flag = 1;
for (j = 2; i / j >= j; j++) {
if (i % j == 0) {
flag = 0;
break;
}
}
prime_arr[i] = flag;
}
return NULL;
}
void *sieveprimes(void *ptr) {
int n, i, j, start, stop;
n = (int)(intptr_t)ptr;
/* compute slice boundaries */
start = N / MAX_THREADS * n;
stop = N / MAX_THREADS * (n + 1);
/* special case 0, 1 and 2 */
if (n == 0) {
prime_arr[0] = prime_arr[1] = 0;
prime_arr[2] = 1;
start = 3;
}
if (n == MAX_THREADS - 1) {
stop = N;
}
/* initialize array slice: only odd numbers may be prime */
for (i = start; i < stop; i++) {
prime_arr[i] = i & 1;
}
/* set all multiples of odd numbers as composite */
for (j = 3; j * j < N; j += 2) {
/* start at first multiple of j inside the slice */
i = (start + j - 1) / j * j;
/* all multiples below j * j have been cleared already */
if (i < j * j)
i = j * j;
/* only handle odd multiples */
if ((i & 1) == 0)
i += j;
for (; i < stop; i += j + j) {
prime_arr[i] = 0;
}
}
return NULL;
}
int main() {
pthread_t tid[MAX_THREADS] = { 0 };
int i;
for (i = 0; i < MAX_THREADS; i++) {
printf("Creating thread %d\n", i);
pthread_create(&tid[i], NULL, sieveprimes, (void *)(intptr_t)i);
}
for (i = 0; i < MAX_THREADS; i++) {
pthread_join(tid[i], NULL);
}
int count = 0;
for (i = 0; i < N; i++) {
count += prime_arr[i];
//if (prime_arr[i] == 1)
// printf("%d\n", i);
}
printf("%d\n", count);
return 0;
}

Why is my program generating random results when I nest it?

I made this parallel matrix multiplication program using nesting of for loops in OpenMP. When I run the program the displays the answer randomly ( mostly ) with varying indice of the resultant matrix. Here is the snippet of the code :
#pragma omp parallel for
for(i=0;i<N;i++){
#pragma omp parallel for
for(j=0;j<N;j++){
C[i][j]=0;
#pragma omp parallel for
for(m=0;m<N;m++){
C[i][j]=A[i][m]*B[m][j]+C[i][j];
}
printf("C:i=%d j=%d %f \n",i,j,C[i][j]);
}
}

These are the symptoms of a so called "race conditions" as the commenters already stated.
The threads OpenMP uses are independent of each other but the results of the individual loops of the matrix multiplication are not, so one thread might be at a different position than the other one and suddenly you are in trouble if you depend on the order of the results.
You can only parallelize the outmost loop:
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
int main(int argc, char **argv)
{
int n;
double **A, **B, **C, **D, t;
int i, j, k;
struct timeval start, stop;
if (argc != 2) {
fprintf(stderr, "Usage: %s a positive integer >= 2 and < 1 mio\n", argv[0]);
exit(EXIT_FAILURE);
}
n = atoi(argv[1]);
if (n <= 2 || n >= 1000000) {
fprintf(stderr, "Usage: %s a positive integer >= 2 and < 1 mio\n", argv[0]);
exit(EXIT_FAILURE);
}
// make it repeatable
srand(0xdeadbeef);
// allocate memory for and initialize A
A = malloc(sizeof(*A) * n);
for (i = 0; i < n; i++) {
A[i] = malloc(sizeof(**A) * n);
for (j = 0; j < n; j++) {
A[i][j] = (double) ((rand() % 100) / 99.);
}
}
// do the same for B
B = malloc(sizeof(*B) * n);
for (i = 0; i < n; i++) {
B[i] = malloc(sizeof(**B) * n);
for (j = 0; j < n; j++) {
B[i][j] = (double) ((rand() % 100) / 99.);
}
}
// and C but initialize with zero
C = malloc(sizeof(*C) * n);
for (i = 0; i < n; i++) {
C[i] = malloc(sizeof(**C) * n);
for (j = 0; j < n; j++) {
C[i][j] = 0.0;
}
}
// ditto with D
D = malloc(sizeof(*D) * n);
for (i = 0; i < n; i++) {
D[i] = malloc(sizeof(**D) * n);
for (j = 0; j < n; j++) {
D[i][j] = 0.0;
}
}
// some coarse timing
gettimeofday(&start, NULL);
// naive matrix multiplication
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
}
}
gettimeofday(&stop, NULL);
t = ((stop.tv_sec - start.tv_sec) * 1000000u +
stop.tv_usec - start.tv_usec) / 1.e6;
printf("Timing for naive run = %.10g\n", t);
gettimeofday(&start, NULL);
#pragma omp parallel shared(A, B, C) private(i, j, k)
#pragma omp for
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
D[i][j] = D[i][j] + A[i][k] * B[k][j];
}
}
}
gettimeofday(&stop, NULL);
t = ((stop.tv_sec - start.tv_sec) * 1000000u +
stop.tv_usec - start.tv_usec) / 1.e6;
printf("Timing for parallel run = %.10g\n", t);
// check result
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
if (D[i][j] != C[i][j]) {
printf("Cell %d,%d differs with delta(D_ij-C_ij) = %.20g\n", i, j,
D[i][j] - C[i][j]);
}
}
}
// clean up
for (i = 0; i < n; i++) {
free(A[i]);
free(B[i]);
free(C[i]);
free(D[i]);
}
free(A);
free(B);
free(C);
free(D);
puts("All ok? Bye");
exit(EXIT_SUCCESS);
}
(n>2000 might need some patience to get the result)
But it's not fully true. You could (but shouldn't) try to get the innermost loop with something like
sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (k = 0; k < n; k++) {
sum += A[i][k] * B[k][j];
}
D[i][j] = sum;
Does not seem to be faster, is even slower with small n.
With the original code and n = 2500 (only one run):
Timing for naive run = 124.466307
Timing for parallel run = 44.154538
About the same with the reduction:
Timing for naive run = 119.586365
Timing for parallel run = 43.288371
With a smaller n = 500
Timing for naive run = 0.444061
Timing for parallel run = 0.150842
It is already slower with reduction at that size:
Timing for naive run = 0.447894
Timing for parallel run = 0.245481
It might win for very large n but I lack the necessary patience.
Nevertheless, a last one with n = 4000 (OpenMP part only):
Normal:
Timing for parallel run = 174.647404
With reduction:
Timing for parallel run = 179.062463
That difference is still fully inside the error-bars.
A better way to multiply large matrices (at ca. n>100 ) would be the Schönhage-Straßen algorithm.
Oh: I just used square matrices for convenience not because they must be of that form! But if you have rectangular matrices with a large length-ratio you might try to change the way the loops run; column-first or row-first can make a significant difference here.

How can I calculate the running time of a pthread matrix multiplication program?

I have created a matrix multiplication program, one in serial, and one using pthreads. I need to compare their running times. My serial code takes about 16 seconds to calculate 1000x1000 matrix multiplication, and I checked it using my stopwatch, and it is exactly as it should be. On the other hand, when I run my pthreads matrix multiplication program I get printed as a result something around 22-23 seconds, but the result gets printed on the terminal so much faster. I also used my stopwatch to check the time it takes to output the running time, and it was around 6 seconds, but it prints that it took around 23 seconds. I guess there is some other way in checking the running time of a pthread program. Below you can find my pthreads code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <pthread.h>
#include <assert.h>
int SIZE, NTHREADS;
int **A, **B, **C;
void init()
{
int i, j;
A = (int**)malloc(SIZE * sizeof(int *));
for(i = 0; i < SIZE; i++)
A[i] = malloc(SIZE * sizeof(int));
B = (int**)malloc(SIZE * sizeof(int *));
for(i = 0; i < SIZE; i++)
B[i] = malloc(SIZE * sizeof(int));
C = (int**)malloc(SIZE * sizeof(int *));
for(i = 0; i < SIZE; i++)
C[i] = malloc(SIZE * sizeof(int));
srand(time(NULL));
for(i = 0; i < SIZE; i++) {
for(j = 0; j < SIZE; j++) {
A[i][j] = rand()%100;
B[i][j] = rand()%100;
}
}
}
void mm(int tid)
{
int i, j, k;
int start = tid * SIZE/NTHREADS;
int end = (tid+1) * (SIZE/NTHREADS) - 1;
for(i = start; i <= end; i++) {
for(j = 0; j < SIZE; j++) {
C[i][j] = 0;
for(k = 0; k < SIZE; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
}
void *worker(void *arg)
{
int tid = (int)arg;
mm(tid);
}
int main(int argc, char* argv[])
{
pthread_t* threads;
int rc, i;
if(argc != 3)
{
printf("Usage: %s <size_of_square_matrix> <number_of_threads>\n", argv[0]);
exit(1);
}
SIZE = atoi(argv[1]);
NTHREADS = atoi(argv[2]);
init();
threads = (pthread_t*)malloc(NTHREADS * sizeof(pthread_t));
clock_t begin, end;
double time_spent;
begin = clock();
for(i = 0; i < NTHREADS; i++) {
rc = pthread_create(&threads[i], NULL, worker, (void *)i);
assert(rc == 0);
}
for(i = 0; i < NTHREADS; i++) {
rc = pthread_join(threads[i], NULL);
assert(rc == 0);
}
end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("Elapsed time: %.2lf seconds.\n", time_spent);
for(i = 0; i < SIZE; i++)
free((void *)A[i]);
free((void *)A);
for(i = 0; i < SIZE; i++)
free((void *)B[i]);
free((void *)B);
for(i = 0; i < SIZE; i++)
free((void *)C[i]);
free((void *)C);
free(threads);
return 0;
}

This is how you get the CPU time that has elapsed, but not how to get the wall-clock time that has elapsed. For that, you will want to use either time (which only has second granularity), or clock_gettime with the CLOCK_MONOTONIC option, which would be preferred. You will need to link against the POSIX Realtime extensions (-lrt) for this.
struct timespec begin, end;
double elapsed;
clock_gettime(CLOCK_MONOTONIC, &begin);
// spawn threads to do work here
clock_gettime(CLOCK_MONOTONIC, &end);
elapsed = end.tv_sec - begin.tv_sec;
elapsed += (end.tv_nsec - begin.tv_nsec) / 1000000000.0;
In your example, I'm guessing you used around 4 threads? The CPU time would then be (time used in CPU 1 + time used in CPU 2 + time used in CPU 3 + time used in CPU 4) which should be roughly 4 times the absolute time (6 vs. 23 seconds).

The easiest way I know of is with OpenMP. Link with -fopenmp
#include <omp.h>
int main() {
double dtime = omp_get_wtime(); //value in seconds
//run some code
dtime = omp_get_wtime() - dtime;
}
Note that 16 seconds for 1000x1000 matrix multiplication is incredibly slow. My code does 1056x1056 in 0.03 seconds on my i7-2600k at 4.3 GHz and even that is less than 30% of the max theoretical speed.