parallelizing matrix multiplication through threading and SIMD

parallelizing matrix multiplication through threading and SIMD - c

I am trying to speed up matrix multiplication on multicore architecture. For this end, I try to use threads and SIMD at the same time. But my results are not good. I test speed up over sequential matrix multiplication:
void sequentialMatMul(void* params)
{
cout << "SequentialMatMul started.";
int i, j, k;
for (i = 0; i < N; i++)
{
for (k = 0; k < N; k++)
{
for (j = 0; j < N; j++)
{
X[i][j] += A[i][k] * B[k][j];
}
}
}
cout << "\nSequentialMatMul finished.";
}
I tried to add threading and SIMD to matrix multiplication as follows:
void threadedSIMDMatMul(void* params)
{
bounds *args = (bounds*)params;
int lowerBound = args->lowerBound;
int upperBound = args->upperBound;
int idx = args->idx;
int i, j, k;
for (i = lowerBound; i <upperBound; i++)
{
for (k = 0; k < N; k++)
{
for (j = 0; j < N; j+=4)
{
mmx1 = _mm_loadu_ps(&X[i][j]);
mmx2 = _mm_load_ps1(&A[i][k]);
mmx3 = _mm_loadu_ps(&B[k][j]);
mmx4 = _mm_mul_ps(mmx2, mmx3);
mmx0 = _mm_add_ps(mmx1, mmx4);
_mm_storeu_ps(&X[i][j], mmx0);
}
}
}
_endthread();
}
And the following section is used for calculating lowerbound and upperbound of each thread:
bounds arg[CORES];
for (int part = 0; part < CORES; part++)
{
arg[part].idx = part;
arg[part].lowerBound = (N / CORES)*part;
arg[part].upperBound = (N / CORES)*(part + 1);
}
And finally threaded SIMD version is called like this:
HANDLE handle[CORES];
for (int part = 0; part < CORES; part++)
{
handle[part] = (HANDLE)_beginthread(threadedSIMDMatMul, 0, (void*)&arg[part]);
}
for (int part = 0; part < CORES; part++)
{
WaitForSingleObject(handle[part], INFINITE);
}
The result is as follows:
Test 1:
// arrays are defined as follow
float A[N][N];
float B[N][N];
float X[N][N];
N=2048
Core=1//just one thread
Sequential time: 11129ms
Threaded SIMD matmul time: 14650ms
Speed up=0.75x
Test 2:
//defined arrays as follow
float **A = (float**)_aligned_malloc(N* sizeof(float), 16);
float **B = (float**)_aligned_malloc(N* sizeof(float), 16);
float **X = (float**)_aligned_malloc(N* sizeof(float), 16);
for (int k = 0; k < N; k++)
{
A[k] = (float*)malloc(cols * sizeof(float));
B[k] = (float*)malloc(cols * sizeof(float));
X[k] = (float*)malloc(cols * sizeof(float));
}
N=2048
Core=1//just one thread
Sequential time: 15907ms
Threaded SIMD matmul time: 18578ms
Speed up=0.85x
Test 3:
//defined arrays as follow
float A[N][N];
float B[N][N];
float X[N][N];
N=2048
Core=2
Sequential time: 10855ms
Threaded SIMD matmul time: 27967ms
Speed up=0.38x
Test 4:
//defined arrays as follow
float **A = (float**)_aligned_malloc(N* sizeof(float), 16);
float **B = (float**)_aligned_malloc(N* sizeof(float), 16);
float **X = (float**)_aligned_malloc(N* sizeof(float), 16);
for (int k = 0; k < N; k++)
{
A[k] = (float*)malloc(cols * sizeof(float));
B[k] = (float*)malloc(cols * sizeof(float));
X[k] = (float*)malloc(cols * sizeof(float));
}
N=2048
Core=2
Sequential time: 16579ms
Threaded SIMD matmul time: 30160ms
Speed up=0.51x
My question: why I don’t get speed up?

Here are the times I get building on your algorithm on my four core i7 IVB processor.
sequential: 3.42 s
4 threads: 0.97 s
4 threads + SSE: 0.86 s
Here are the times on a 2 core P9600 #2.53 GHz which is similar to the OP's E2200 #2.2 GHz
sequential: time 6.52 s
2 threads: time 3.66 s
2 threads + SSE: 3.75 s
I used OpenMP because it makes this easy. Each thread in OpenMP runs over effectively
lowerBound = N*part/CORES;
upperBound = N*(part + 1)/CORES;
(note that that is slightly different than your definition. Your definition can give the wrong result due to rounding for some values of N since you divide by CORES first.)
As to the SIMD version. It's not much faster probably due it being memory bandwidth bound . It's probably not really faster because GCC already vectroizes the loop.
The most optimal solution is much more complicated. You need to use loop tiling and reorder the elements within tiles to get the optimal performance. I don't have time to do that today.
Here is the code I used:
//c99 -O3 -fopenmp -Wall foo.c
#include <stdio.h>
#include <string.h>
#include <x86intrin.h>
#include <omp.h>
void gemm(float * restrict a, float * restrict b, float * restrict c, int n) {
for(int i=0; i<n; i++) {
for(int k=0; k<n; k++) {
for(int j=0; j<n; j++) {
c[i*n+j] += a[i*n+k]*b[k*n+j];
}
}
}
}
void gemm_tlp(float * restrict a, float * restrict b, float * restrict c, int n) {
#pragma omp parallel for
for(int i=0; i<n; i++) {
for(int k=0; k<n; k++) {
for(int j=0; j<n; j++) {
c[i*n+j] += a[i*n+k]*b[k*n+j];
}
}
}
}
void gemm_tlp_simd(float * restrict a, float * restrict b, float * restrict c, int n) {
#pragma omp parallel for
for(int i=0; i<n; i++) {
for(int k=0; k<n; k++) {
__m128 a4 = _mm_set1_ps(a[i*n+k]);
for(int j=0; j<n; j+=4) {
__m128 c4 = _mm_load_ps(&c[i*n+j]);
__m128 b4 = _mm_load_ps(&b[k*n+j]);
c4 = _mm_add_ps(_mm_mul_ps(a4,b4),c4);
_mm_store_ps(&c[i*n+j], c4);
}
}
}
}
int main(void) {
int n = 2048;
float *a = _mm_malloc(n*n * sizeof *a, 64);
float *b = _mm_malloc(n*n * sizeof *b, 64);
float *c1 = _mm_malloc(n*n * sizeof *c1, 64);
float *c2 = _mm_malloc(n*n * sizeof *c2, 64);
float *c3 = _mm_malloc(n*n * sizeof *c2, 64);
for(int i=0; i<n*n; i++) a[i] = 1.0*i;
for(int i=0; i<n*n; i++) b[i] = 1.0*i;
memset(c1, 0, n*n * sizeof *c1);
memset(c2, 0, n*n * sizeof *c2);
memset(c3, 0, n*n * sizeof *c3);
double dtime;
dtime = -omp_get_wtime();
gemm(a,b,c1,n);
dtime += omp_get_wtime();
printf("time %f\n", dtime);
dtime = -omp_get_wtime();
gemm_tlp(a,b,c2,n);
dtime += omp_get_wtime();
printf("time %f\n", dtime);
dtime = -omp_get_wtime();
gemm_tlp_simd(a,b,c3,n);
dtime += omp_get_wtime();
printf("time %f\n", dtime);
printf("error %d\n", memcmp(c1,c2, n*n*sizeof *c1));
printf("error %d\n", memcmp(c1,c3, n*n*sizeof *c1));
}

It looks to me that the threads are sharing __m128 mmx* variables, you probably defined them global/static. You must be getting wrong results in your X array too. Define __m128 mmx* variables inside threadedSIMDMatMul function scope and it will run much faster.
void threadedSIMDMatMul(void* params)
{
__m128 mmx0, mmx1, mmx2, mmx3, mmx4;
// rest of the code here
}

Related

Numbers not randomized after runs

I'm trying to create an openMP program that randomizes double arrays and run the values through the formula: y[i] = (a[i] * b[i]) + c[i] + (d[i] * e[i]) + (f[i] / 2);
If I run the program multiple times I've realised that the Y[] values are the same even though they are supposed to be randomized when the arrays are initialized in the first #pragma omp for . Any Ideas as to why this might be happening?
#include<stdio.h>
#include <stdio.h>
#include <stdlib.h>
#include<omp.h>
#define ARRAY_SIZE 10
double randfrom(double min, double max);
double randfrom(double min, double max)
{
double range = (max - min);
double div = RAND_MAX / range;
return min + (rand() / div);
}
int main() {
int i;
double a[ARRAY_SIZE], b[ARRAY_SIZE], c[ARRAY_SIZE], d[ARRAY_SIZE], e[ARRAY_SIZE], f[ARRAY_SIZE], y[ARRAY_SIZE];
double min, max;
int imin, imax;
/*A[10] consists of random number in between 1 and 100
B[10] consists of random number in between 10 and 50
C[10] consists of random number in between 1 and 10
D[10] consists of random number in between 1 and 50
E[10] consists of random number in between 1 and 5
F[10] consists of random number in between 10 and 80*/
srand(time(NULL));
#pragma omp parallel
{
#pragma omp parallel for
for (i = 0; i < ARRAY_SIZE; i++) {
a[i] = randfrom(1, 100);
b[i] = randfrom(10, 50);
c[i] = randfrom(1, 50);
d[i] = randfrom(1, 50);
e[i] = randfrom(1, 5);
f[i] = randfrom(10, 80);
}
}
printf("This is the parallel Print\n\n\n");
#pragma omp parallel shared(a,b,c,d,e,f,y) private(i)
{
//Y=(A*B)+C+(D*E)+(F/2)
#pragma omp for schedule(dynamic) nowait
for (i = 0; i < ARRAY_SIZE; i++) {
/*printf("A[%d]%.2f",i, a[i]);
printf("\n\n");
printf("B[%d]%.2f", i, b[i]);
printf("\n\n");
printf("C[%d]%.2f", i, c[i]);
printf("\n\n");
printf("D[%d]%.2f", i, d[i]);
printf("\n\n");
printf("E[%d]%.2f", i, e[i]);
printf("\n\n");
printf("F[%d]%.2f", i, f[i]);
printf("\n\n");*/
y[i] = (a[i] * b[i]) + c[i] + (d[i] * e[i]) + (f[i] / 2);
printf("Y[%d]=%.2f\n", i, y[i]);
}
}
#pragma omp parallel shared(y, min,imin,max,imax) private(i)
{
//min
#pragma omp for schedule(dynamic) nowait
for (i = 0; i < ARRAY_SIZE; i++) {
if (i == 0) {
min = y[i];
imin = i;
}
else {
if (y[i] < min) {
min = y[i];
imin = i;
}
}
}
//max
#pragma omp for schedule(dynamic) nowait
for (i = 0; i < ARRAY_SIZE; i++) {
if (i == 0) {
max = y[i];
imax = i;
}
else {
if (y[i] > max) {
max = y[i];
imax = i;
}
}
}
}
printf("min y[%d] = %.2f\nmax y[%d] = %.2f\n", imin, min, imax, max);
return 0;
}

First of all, I would like to emphasize that OpenMP has significant overheads, so you need a reasonable amount of work in your code, otherwise the overhead is bigger than the gain by parallelization. In your code this is the case, so the fastest solution is to use serial code. However, you mentioned that your goal is to learn OpenMP, so I will show you how to do it.
In your previous post's comments #paleonix linked a post ( How to generate random numbers in parallel? ) which answers your question about random numbers. One of the solutions is to use rand_r.
Your code has a data race when searching for minimum and maximum values of array Y. If you need to find the minimum/maximum value only it is very easy, because you can use reduction like this:
double max=y[0];
#pragma omp parallel for default(none) shared(y) reduction(max:max)
for (int i = 1; i < ARRAY_SIZE; i++) {
if (y[i] > max) {
max = y[i];
}
}
But in your case you also need the indices of minimum and maximum value, so it is a bit more complicated. You have to use a critical section to be sure that other threads can not change the max, min, imax and imin values while you updating their values. So, it can be done the following way (e.g. for finding minimum value):
#pragma omp parallel for
for (int i = 0; i < ARRAY_SIZE; i++) {
if (y[i] < min) {
#pragma omp critical
if (y[i] < min) {
min = y[i];
imin = i;
}
}
}
Note that the if (y[i] < min) appears twice, because after the first comparison other threads may change the value of min, so inside the critical region before updating min and imin values you have to check it again. You can do it exactly the same way in the case of finding the maximum value.
Always use your variables at their minimum required scope.
It is also recommend to use default(none) clause in your OpenMP parallel region so, you have to explicitly define the sharing attributes all of your variables.
You can fill the array and find its minimum/maximum values in a single loop and print their values in a different serial loop.
If you set min and max before the loop, you can get rid of the extra comparison if (i == 0) used inside the loop.
Putting it together:
double threadsafe_rand(unsigned int* seed, double min, double max)
{
double range = (max - min);
double div = RAND_MAX / range;
return min + (rand_r(seed) / div);
}
In main:
double min=DBL_MAX;
double max=-DBL_MAX;
#pragma omp parallel default(none) shared(a,b,c,d,e,f,y,imin,imax,min,max)
{
unsigned int seed=omp_get_thread_num();
#pragma omp for
for (int i = 0; i < ARRAY_SIZE; i++) {
a[i] = threadsafe_rand(&seed, 1,100);
b[i] = threadsafe_rand(&seed,10, 50);
c[i] = threadsafe_rand(&seed,1, 10);
d[i] = threadsafe_rand(&seed,1, 50);
e[i] = threadsafe_rand(&seed,1, 5);
f[i] = threadsafe_rand(&seed,10, 80);
y[i] = (a[i] * b[i]) + c[i] + (d[i] * e[i]) + (f[i] / 2);
if (y[i] < min) {
#pragma omp critical
if (y[i] < min) {
min = y[i];
imin = i;
}
}
if (y[i] > max) {
#pragma omp critical
if (y[i] > max) {
max = y[i];
imax = i;
}
}
}
}
// printout
for (int i = 0; i < ARRAY_SIZE; i++) {
printf("Y[%d]=%.2f\n", i, y[i]);
}
printf("min y[%d] = %.2f\nmax y[%d] = %.2f\n", imin, min, imax, max);
Update:
I have updated the code according to #Qubit's and #JérômeRichard's suggestions:
I used the 'Really minimal PCG32 code' / (c) 2014 M.E. O'Neill / from https://www.pcg-random.org/download.html. Note that I do not intend to properly handle the seeding of this simple random number generator. If you would like to do so, please use a complete random number generator library.
I have changed the code to use user defined reductions. Indeed, it makes the code much more efficient, but not really beginner friendly. It would require a very long post to explain it, so if you are interested in the details, please read a book about OpenMP.
I have reduced the number of divisions in threadsafe_rand
The updated code:
#include<stdio.h>
#include<stdint.h>
#include<time.h>
#include<float.h>
#include<limits.h>
#include<omp.h>
#define ARRAY_SIZE 10
// *Really* minimal PCG32 code / (c) 2014 M.E. O'Neill / pcg-random.org
// Licensed under Apache License 2.0 (NO WARRANTY, etc. see website)
typedef struct { uint64_t state; uint64_t inc; } pcg32_random_t;
inline uint32_t pcg32_random_r(pcg32_random_t* rng)
{
uint64_t oldstate = rng->state;
// Advance internal state
rng->state = oldstate * 6364136223846793005ULL + (rng->inc|1);
// Calculate output function (XSH RR), uses old state for max ILP
uint32_t xorshifted = ((oldstate >> 18u) ^ oldstate) >> 27u;
uint32_t rot = oldstate >> 59u;
return (xorshifted >> rot) | (xorshifted << ((-rot) & 31));
}
inline double threadsafe_rand(pcg32_random_t* seed, double min, double max)
{
const double tmp=1.0/UINT32_MAX;
return min + tmp*(max - min)*pcg32_random_r(seed);
}
struct v{
double value;
int i;
};
#pragma omp declare reduction(custom_min: struct v: \
omp_out = omp_in.value < omp_out.value ? omp_in : omp_out )\
initializer(omp_priv={DBL_MAX,0} )
#pragma omp declare reduction(custom_max: struct v: \
omp_out = omp_in.value > omp_out.value ? omp_in : omp_out )\
initializer(omp_priv={-DBL_MAX,0} )
int main() {
double a[ARRAY_SIZE], b[ARRAY_SIZE], c[ARRAY_SIZE], d[ARRAY_SIZE], e[ARRAY_SIZE], f[ARRAY_SIZE], y[ARRAY_SIZE];
struct v max={-DBL_MAX,0};
struct v min={DBL_MAX,0};
#pragma omp parallel default(none) shared(a,b,c,d,e,f,y) reduction(custom_min:min) reduction(custom_max:max)
{
pcg32_random_t seed={omp_get_thread_num()*7842 + time(NULL)%2299, 1234+omp_get_thread_num()};
#pragma omp for
for (int i=0 ; i < ARRAY_SIZE; i++) {
a[i] = threadsafe_rand(&seed, 1,100);
b[i] = threadsafe_rand(&seed,10, 50);
c[i] = threadsafe_rand(&seed,1, 10);
d[i] = threadsafe_rand(&seed,1, 50);
e[i] = threadsafe_rand(&seed,1, 5);
f[i] = threadsafe_rand(&seed,10, 80);
y[i] = (a[i] * b[i]) + c[i] + (d[i] * e[i]) + (f[i] / 2);
if (y[i] < min.value) {
min.value = y[i];
min.i = i;
}
if (y[i] > max.value) {
max.value = y[i];
max.i = i;
}
}
}
// printout
for (int i = 0; i < ARRAY_SIZE; i++) {
printf("Y[%d]=%.2f\n", i, y[i]);
}
printf("min y[%d] = %.2f\nmax y[%d] = %.2f\n", min.i, min.value, max.i, max.value);
return 0;
}

Compare Matrix Multiplication Execution Time in Different Sizes in C

I need to calculate and compare execution time of multiplication of 2 matrices in 3 different sizes (100 * 100 , 1000 * 1000 and 10000 * 10000) in C programming language. I wrote the following simple code to do that for 1000 * 1000 and I got the execution time
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main()
{
int r1 = 1000, c1 = 1000, r2 = 1000, c2 = 1000, i, j, k;
// Dynamic allocation.
double(*a)[r1][c1] = malloc(sizeof *a);
double(*b)[r2][c2] = malloc(sizeof *b);
double(*result)[r1][c2] = malloc(sizeof *result);
// Storing elements of first matrix.
for (i = 0; i < r1; ++i)
{
for (j = 0; j < c1; ++j)
{
(*a)[i][j] = rand() / RAND_MAX;
}
}
// Storing elements of second matrix.
for (i = 0; i < r2; ++i)
{
for (j = 0; j < c2; ++j)
{
(*b)[i][j] = rand() / RAND_MAX;
}
}
// Initializing all elements of result matrix to 0
for (i = 0; i < r1; ++i)
{
for (j = 0; j < c2; ++j)
{
(*result)[i][j] = 0;
}
}
clock_t begin1 = clock();
// Multiplying matrices a and b and
// storing result in result matrix
for (i = 0; i < r1; ++i)
for (j = 0; j < c2; ++j)
for (k = 0; k < c1; ++k)
{
(*result)[i][j] += (*a)[i][k] * (*b)[k][j];
}
clock_t end1 = clock();
double time_taken = (double)(end1 - begin1) / CLOCKS_PER_SEC;
printf("\n function took %f seconds to execute \n", time_taken);
return 0;
}
And now I want to repeat this part for two other sizes and get the result like this at the end of my program with one run:
the execution time for 100 * 100 is 1 second
the execution time for 1000 * 1000 is 2 seconds
the execution time for 10000 * 10000 is 3 seconds
What is the best solution for that? When I repeat this part for 10000 * 10000 after 1000 * 1000 I got the segmentation fault error.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main()
{
int r1 = 1000, c1 = 1000, r2 = 1000, c2 = 1000, i, j, k;
// Dynamic allocation.
double(*a)[r1][c1] = malloc(sizeof *a);
double(*b)[r2][c2] = malloc(sizeof *b);
double(*result)[r1][c2] = malloc(sizeof *result);
// Storing elements of first matrix.
for (i = 0; i < r1; ++i)
{
for (j = 0; j < c1; ++j)
{
(*a)[i][j] = rand() / RAND_MAX;
}
}
// Storing elements of second matrix.
for (i = 0; i < r2; ++i)
{
for (j = 0; j < c2; ++j)
{
(*b)[i][j] = rand() / RAND_MAX;
}
}
// Initializing all elements of result matrix to 0
for (i = 0; i < r1; ++i)
{
for (j = 0; j < c2; ++j)
{
(*result)[i][j] = 0;
}
}
clock_t begin1 = clock();
// Multiplying matrices a and b and
// storing result in result matrix
for (i = 0; i < r1; ++i)
for (j = 0; j < c2; ++j)
for (k = 0; k < c1; ++k)
{
(*result)[i][j] += (*a)[i][k] * (*b)[k][j];
}
clock_t end1 = clock();
double time_taken = (double)(end1 - begin1) / CLOCKS_PER_SEC;
printf("\n \nfunction took %f seconds to execute \n",
time_taken);
free(a);
free(b);
free(result);
r1 = 10000, c1 = 10000, r2 = 10000, c2 = 10000;
printf("\n run second one for %d \n",r1);
// Storing elements of first matrix.
for (i = 0; i < r1; ++i)
{
for (j = 0; j < c1; ++j)
{
(*a)[i][j] = rand() / RAND_MAX;
}
}
// Storing elements of second matrix.
for (i = 0; i < r2; ++i)
{
for (j = 0; j < c2; ++j)
{
(*b)[i][j] = rand() / RAND_MAX;
}
}
// Initializing all elements of result matrix to 0
for (i = 0; i < r1; ++i)
{
for (j = 0; j < c2; ++j)
{
(*result)[i][j] = 0;
}
}
begin1 = clock();
// Multiplying matrices a and b and
// storing result in result matrix
for (i = 0; i < r1; ++i)
for (j = 0; j < c2; ++j)
for (k = 0; k < c1; ++k)
{
(*result)[i][j] += (*a)[i][k] * (*b)[k][j];
}
end1 = clock();
time_taken = (double)(end1 - begin1) / CLOCKS_PER_SEC;
printf("\n second function took %f seconds to execute \n",
time_taken);
free(a);
free(b);
free(result);
return 0;
}

A simplified version of your program:
...
int main()
{
int r1 = 1000, c1 = 1000, r2 = 1000, c2 = 1000, i, j, k;
// Dynamic allocation.
double(*a)[r1][c1] = malloc(sizeof *a);
double(*b)[r2][c2] = malloc(sizeof *b);
double(*result)[r1][c2] = malloc(sizeof *result);
...
free(a);
free(b);
free(result);
r1 = 10000, c1 = 10000, r2 = 10000, c2 = 10000;
for (i = 0; i < r1; ++i)
for (j = 0; j < c1; ++j)
(*a)[i][j] = rand() /RAND_MAX; // KABOOM !
...
}
A quick but crucial information about about VLA arrays. Name "variable" in "variable-length-array" means that the size is stored in a variable, not that the size is variable. This variable is hidden and can be only read via sizeof operator.
The size of array is bound to it's type, not to its value. Therefore the dimensions of VLA type (and object) cannot change, no matter if the object is dynamic or automatic.
The line:
double(*a)[r1][c1] = malloc(sizeof *a);
it interpreted as:
typedef double __hidden_type[r1][c1];
__hidden_type* a = malloc(sizeof *a);
... changes of r1 or c1 do not affect sizeof(__hidden_type)
The sizes are bound to the types when the types are defined. After that the types are immutable.
Therefore changing the r1 does not change the size of *a. You need to create a new a (or rather its type) and allocate memory for this new *a.
I suggest moving the whole test to a function that takes r1, r2, c1 and c2 as parameters. The arrays would be local to the function.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
void bench(int r1, int c1, int r2, int c2) {
int i, j, k;
// Dynamic allocation.
double(*a)[r1][c1] = malloc(sizeof *a);
double(*b)[r2][c2] = malloc(sizeof *b);
double(*result)[r1][c2] = malloc(sizeof *result);
// Storing elements of first matrix.
for (i = 0; i < r1; ++i)
{
for (j = 0; j < c1; ++j)
{
(*a)[i][j] = rand() /RAND_MAX;
}
}
// Storing elements of second matrix.
for (i = 0; i < r2; ++i)
{
for (j = 0; j < c2; ++j)
{
(*b)[i][j] = rand()/ RAND_MAX;
}
}
// Initializing all elements of result matrix to 0
for (i = 0; i < r1; ++i)
{
for (j = 0; j < c2; ++j)
{
(*result)[i][j] = 0;
}
}
clock_t begin1 = clock();
// Multiplying matrices a and b and
// storing result in result matrix
for (i = 0; i < r1; ++i)
for (j = 0; j < c2; ++j)
for (k = 0; k < c1; ++k)
{
(*result)[i][j] += (*a)[i][k] * (*b)[k][j];
}
clock_t end1 = clock();
double time_taken = (double)(end1 - begin1) / CLOCKS_PER_SEC;
printf("\n \nfunction took %f seconds to execute \n", time_taken);
free(a);
free(b);
free(result);
}
int main()
{
bench(1000, 1000, 1000, 1000);
bench(2000, 2000, 2000, 2000);
}
I've reduced the size from 10000 to 2000 to get results in reasonable time.
On my machine I got:
function took 1.966788 seconds to execute
function took 37.370633 seconds to execute
Note that the function is very cache unfriendly.
for (i = 0; i < r1; ++i)
for (j = 0; j < c2; ++j)
for (k = 0; k < c1; ++k)
(*result)[i][j] += (*a)[i][k] * (*b)[k][j];
On every iteration of k you get a cache miss when accessing (*b)[k][j]. Try swapping the j and k loops:
for (i = 0; i < r1; ++i)
for (k = 0; k < c1; ++k)
for (j = 0; j < c2; ++j)
(*result)[i][j] += (*a)[i][k] * (*b)[k][j];
Now when increasing j then (*result)[i][j] and (*b)[k][j] are likely in cache.
On my machine this trivial change gave 10x speedup:
function took 0.319594 seconds to execute
function took 3.829459 seconds to execute

There are multiple problems in your code:
you free the matrices and perform a new benchmark, storing data thru invalid pointers... this has undefined behavior, in your case a segmentation fault.
the allocation code is specific for the initial matrix size, you cannot reallocate the matrices for a different size in the main() function. You should move the code to a separate function taking the matrix sizes as arguments and call this function multiple times.
the initialization values rand() / RAND_MAX are almost always zero because integer arithmetics is used for this division. You should use (*a)[i][j] = rand() / (double)RAND_MAX;
Here is a modified version (similar to tstanisl's):
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
void test(int r1, int c1, int r2, int c2) {
int i, j, k;
// Dynamic allocation.
double(*a)[r1][c1] = malloc(sizeof *a);
double(*b)[r2][c2] = malloc(sizeof *b);
double(*result)[r1][c2] = malloc(sizeof *result);
// Storing elements of first matrix.
for (i = 0; i < r1; ++i) {
for (j = 0; j < c1; ++j) {
(*a)[i][j] = rand() / (double)RAND_MAX;
}
}
// Storing elements of second matrix.
for (i = 0; i < r2; ++i) {
for (j = 0; j < c2; ++j) {
(*b)[i][j] = rand() / (double)RAND_MAX;
}
}
// Initializing all elements of result matrix to 0
for (i = 0; i < r1; ++i) {
for (j = 0; j < c2; ++j) {
(*result)[i][j] = 0;
}
}
clock_t begin1 = clock();
// Multiplying matrices a and b and
// storing result in result matrix
// using cache friendly index order
for (i = 0; i < r1; ++i) {
for (k = 0; k < c1; ++k) {
for (j = 0; j < c2; ++j) {
(*result)[i][j] += (*a)[i][k] * (*b)[k][j];
}
}
}
clock_t end1 = clock();
double time_taken = (double)(end1 - begin1) / CLOCKS_PER_SEC;
printf("M(%d,%d) x M(%d,%d) took %f seconds to execute\n",
r1, c1, r2, c2, time_taken);
free(a);
free(b);
free(result);
}
int main() {
test(100, 100, 100, 100);
test(1000, 1000, 1000, 1000);
test(2000, 2000, 2000, 2000);
test(3000, 3000, 3000, 3000);
test(4000, 4000, 4000, 4000);
return 0;
}
Output:
M(100,100) x M(100,100) took 0.000347 seconds to execute
M(1000,1000) x M(1000,1000) took 0.616177 seconds to execute
M(2000,2000) x M(2000,2000) took 5.017987 seconds to execute
M(3000,3000) x M(3000,3000) took 17.703356 seconds to execute
M(4000,4000) x M(4000,4000) took 43.825951 seconds to execute
The time complexity of this simplistic implementation is O(N3), which is consistent with the above timings. Given enough RAM (2.4 GB), multiplying matrices with 10000 rows and columns would take a bit more than 10 minutes.
Achieving the multiplication of 2 10k by 10k double matrices in 3 seconds requires specialized hardware and tailor made software, well beyond the simple approach in this answer.

And now I want to repeat this part for two other sizes and get the result like this at the end of my program with one run: the execution time for 100 * 100 is 1 second the execution time for 1000 * 1000 is 2 seconds the execution time for 10000 * 10000 is 3 seconds
I simply cannot believe that you have multiplied two 10,000 x 10,000 matrices in 3 seconds. What computer are you running that experiment in? Not, with that algorithm and using only one core. Probably you are optimizing your compilation (with default flag -O2 and the whole algorithm has been evicted from the compiler output, as you don't use the matrix after the calculation, so it is innecessary to lose time in a loop for nothing) first, compile your matrix, or print after the computation one element of the result matrix, so the compiler cannot evict the calculation. But don't say your algorithm is multiplying two matrices 10,000 rows and columns in 3 sec.
Allocating a matrix of 10,000x10,000 double quantities, requires a lot of memory. It's 100,000,000 of double entries, wich gets over 800Mb in a single malloc It's very possible that your laptop can handle this once.... but don't do many of these allocations, as you will probably will be over the limits of malloc(3) (or your system).
More when you need at least two of these allocations, or even more, as you said you want to repeat the calculations.
Have you trying to scale yor problem to 100,000 by 100,000 matrices?
When you repeat, it's not warranted that there's no fragmentation in the heap maintained by malloc(), so as you are requesting one gigabyte of continuous memory per matrix, it is probable that malloc(3) runs out of memory in the last mallocs. I'd suggest you to do the three tests in different programs (separate), and don't start any other program (like the browser or your favourite desktop environment while running a matrix multiplication involving 200,000,000 numbers.
Anyway, your program probably start doing swapping if you don't fine control your execution environment, just trashing all the efficiency measurements you are trying to do.
Another thing is that you can probably could incurr in a process limit (if your administrator has established a maximum core memory limmit for your process) and you don't check the result from malloc(3) for proper allocation.
Note:
I have not been able to multiply the two 10,000 by 10,000 matrices in my machine (Pentium Core duo with 6Gb RAM) because the program starts swaping and becomes incredible slow. BTW, be careful, and don't compile your program with optimization (even -O1 makes the matrix multiplication to be eliminated completely from the code, as there's no use of the product, so the complete algorithm is eliminated from the output code by the optimization code in the compiler) I'll leave it running and update my answer if I get some result.
(edit)
$ matrix_mult
function took 9364.303443 seconds to execute
$
It took 2h 36m 6.303443s to execute in my system. This is more approximate to the complexity of the problem (as your approximation is barely linear) you need to compile it without optimization or at least print one element of the resultant matrix, if you want the product being calculated.
Below is the code I'm testing, if you want to test it (I have modified it a bit to make it more readable, to use better timestamps, and to avoid the C initialization, as it can be done on the fly (read the comments in the code):
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define N 10000
int main()
{
// Dynamic allocation.
#define A (*a)
#define B (*b)
#define R (*result)
double A[N][N] = malloc(sizeof *a);
double B[N][N] = malloc(sizeof *b);
double R[N][N] = malloc(sizeof *result);
printf("Initializing both matrices\n");
// Storing elements of first (and second) matrices.
for (int row = 0; row < N; ++row) {
for (int col = 0; col < N; ++col) {
A[row][col] = rand() / (double)RAND_MAX; // matrix A
B[row][col] = rand() / (double)RAND_MAX; // matrix B
}
}
// Storing elements of second matrix. (done above)
// Initializing all elements of result matrix to 0
// (not needed, see below)
printf("Starting multiplication\n");
struct timespec start_ts; // start timestamp (secs & nsecs).
int res = clock_gettime(
CLOCK_PROCESS_CPUTIME_ID,
&start_ts); // cpu time only.
if (res < 0) {
perror("clock_gettime");
exit(EXIT_FAILURE);
}
// Multiplying matrices a and b and
// storing result in result matrix
for (int row = 0; row < N; ++row) {
for (int col = 0; col < N; ++col) {
double aux = 0.0;
for (int k = 0; k < N; ++k) {
aux += A[row][k] * B[k][col];
}
// why to involve calculating the address of the affected
// cell at every inner loop iteration???
R[row][col] = aux;
}
}
struct timespec end_ts;
res = clock_gettime(
CLOCK_PROCESS_CPUTIME_ID,
&end_ts);
if (res < 0) {
perror("clock_gettime");
exit(EXIT_FAILURE);
}
bool carry = start_ts.tv_nsec > end_ts.tv_nsec;
struct timespec diff_time;
diff_time.tv_sec = end_ts.tv_sec - start_ts.tv_sec;
diff_time.tv_nsec = end_ts.tv_nsec - start_ts.tv_nsec;
if (carry) {
diff_time.tv_sec--;
diff_time.tv_nsec += 1000000000;
}
printf("\n function took %ld.%06ld seconds to execute \n",
diff_time.tv_sec, diff_time.tv_nsec / 1000);
return 0;
}
2nd edit
I have tested multiplication times on a modified version of your program (but using the same algorithm) with the program below:
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define MAX 10000
int dim[] = {
89, 144, 233, 377, 610,
987, 1597, 2584, 4181, 6765,
10000
}; /* several numbers taken from fibonacci numbers */
size_t dim_cnt = sizeof dim / sizeof dim[0];
double A[MAX][MAX];
double B[MAX][MAX];
double R[MAX][MAX];
int main()
{
for (int rep = 0; rep < dim_cnt; rep++) {
size_t N = dim[rep];
printf("It %d: Initializing both %zd x %zd matrices\n",
rep, N, N);
// Storing elements of first (and second) matrices.
for (int row = 0; row < N; ++row) {
for (int col = 0; col < N; ++col) {
A[row][col] = rand() /(double)RAND_MAX; // matrix A
B[row][col] = rand() /(double)RAND_MAX; // matrix B
}
}
// Storing elements of second matrix. (done above)
// Initializing all elements of result matrix to 0
// (not needed, see below)
printf("It %d: Starting multiplication\n", rep);
struct timespec start_ts; // start timestamp (secs & nsecs).
int res = clock_gettime(
CLOCK_PROCESS_CPUTIME_ID,
&start_ts); // cpu time only.
if (res < 0) {
perror("clock_gettime");
exit(EXIT_FAILURE);
}
// Multiplying matrices a and b and
// storing result in result matrix
for (int row = 0; row < N; ++row) {
for (int col = 0; col < N; ++col) {
double aux = 0.0;
for (int k = 0; k < N; ++k) {
aux += A[row][k] * B[k][col];
}
// why to involve calculating the address of the affected
// cell at every inner loop iteration???
R[row][col] = aux;
}
}
struct timespec end_ts;
res = clock_gettime(
CLOCK_PROCESS_CPUTIME_ID,
&end_ts);
if (res < 0) {
perror("clock_gettime");
exit(EXIT_FAILURE);
}
bool carry = start_ts.tv_nsec > end_ts.tv_nsec;
struct timespec diff_time;
diff_time.tv_sec = end_ts.tv_sec - start_ts.tv_sec;
diff_time.tv_nsec = end_ts.tv_nsec - start_ts.tv_nsec;
if (carry) {
diff_time.tv_sec--;
diff_time.tv_nsec += 1000000000;
}
printf("It %d: R[0][0] = %g\n", rep, R[0][0]);
printf("%7zd %ld.%06ld\n",
N,
diff_time.tv_sec,
diff_time.tv_nsec / 1000);
}
return 0;
}
It uses a single, statically allocated memory for the biggest of the matrices, and uses a subset of it to model the lower ones. The execution times are far more appropiate than the values you shown in your doce, and I print a result matrix cell value to force the optimizer to conserve the matrix calculation. The resulting execution you get should show values proportional to the ones shown here.
$ time matrix_mult
It 0: Initializing both 89 x 89 matrices
It 0: Starting multiplication
It 0: R[0][0] = 23.6756
89 0.005026
It 1: Initializing both 144 x 144 matrices
It 1: Starting multiplication
It 1: R[0][0] = 40.2614
144 0.019682
It 2: Initializing both 233 x 233 matrices
It 2: Starting multiplication
It 2: R[0][0] = 59.5599
233 0.095213
It 3: Initializing both 377 x 377 matrices
It 3: Starting multiplication
It 3: R[0][0] = 93.4422
377 0.392914
It 4: Initializing both 610 x 610 matrices
It 4: Starting multiplication
It 4: R[0][0] = 153.068
610 1.671904
It 5: Initializing both 987 x 987 matrices
It 5: Starting multiplication
It 5: R[0][0] = 252.981
987 8.816252
It 6: Initializing both 1597 x 1597 matrices
It 6: Starting multiplication
It 6: R[0][0] = 403.61
1597 37.807920
It 7: Initializing both 2584 x 2584 matrices
It 7: Starting multiplication
It 7: R[0][0] = 629.521
2584 157.371367
It 8: Initializing both 4181 x 4181 matrices
It 8: Starting multiplication
It 8: R[0][0] = 1036.47
4181 667.084346
It 9: Initializing both 6765 x 6765 matrices
It 9: Starting multiplication
It 9: R[0][0] = 1653.59
6765 2831.117818
It 10: Initializing both 10000 x 10000 matrices
It 10: Starting multiplication
It 10: R[0][0] = 2521.68
10000 9211.738007
real 216m46,129s
user 215m16,041s
sys 0m4,899s
$ _
In my system, the program shows the following size
$ size matrix_mult
text data bss dec hex filename
2882 528 2400000016 2400003426 0x8f0d2562 matrix_mult
$ _
with a 2.4Gb large bss segment, as corresponding to three variables of around 800Mb each.
One last point: Using VLAs and making your program to dynamically allocate things that will be used during all the program life will not help you to make it faster or slower. It's the algorithm what makes programs faster or slower. Bu I fear you have not calculated a 10,000 by 10,000 matrix product in 3s.

code execution slower in half-parallel OpenMP

good day.
I want to implement inner product in 3 method:
1 - sequential
2 - half-parallel
3 - full-parallel
half parallel means multiplication in parallel and summation in sequential.
here is my code:
int main(int argc, char *argv[]) {
int *x, *y, *z, *w, xy_p, xy_s, xy_ss, i, N=5000;
double s, e;
x = (int *) malloc(sizeof(int)*N);
y = (int *) malloc(sizeof(int)*N);
z = (int *) malloc(sizeof(int)*N);
w = (int *) malloc(sizeof(int)*N);
for(i=0; i < N; i++) {
x[i] = rand();
y[i] = rand();
z[i] = 0;
}
s = omp_get_wtime();
xy_ss = 0;
for(i=0; i < N; i++)
{
xy_ss += x[i] * y[i];
}
e = omp_get_wtime() - s;
printf ( "[**] Sequential execution time is:\n%15.10f and <A,B> is %d\n", e, xy_ss );
s = omp_get_wtime();
xy_s = 0;
#pragma omp parallel for shared ( N, x, y, z ) private ( i )
for(i = 0; i < N; i++)
{
z[i] = x[i] * y[i];
}
for(i=0; i < N; i++)
{
xy_s += z[i];
}
e = omp_get_wtime() - s;
printf ( "[**] Half-Parallel execution time is:\n%15.10f and <A,B> is %d\n", e, xy_s );
s = omp_get_wtime();
xy_p = 0;
# pragma omp parallel shared (N, x, y) private(i)
# pragma omp for reduction ( + : xy_p )
for(i = 0; i < N; i++)
{
xy_p += x[i] * y[i];
}
e = omp_get_wtime() - s;
printf ( "[**] Full-Parallel execution time is:\n%15.10f and <A,B> is %d\n", e, xy_p );
}
so I have some question:
first I want to know: does my code correct?!!!!
second: why half-parallel is faster than sequential?!
third: is 5000 a good size for parallelism?
and finally why sequential is the fastest? because of 5000?!
an sample output:
Sequential execution time is:
0.0000196100 and dot is -1081001655
Half-Parallel execution time is:
0.0090819710 and dot is -1081001655
Full-Parallel execution time is:
0.0080959420 and dot is -1081001655
and for N=5000000
Sequential execution time is:
0.0150297650 and is -1629514371
Half-Parallel execution time is:
0.0292110600 and is -1629514371
Full-Parallel execution time is:
0.0072323760 and is -1629514371
anyway, why half-parallel is the slowest?

Speed up matrix-matrix multiplication using SSE vector instructions

I have some trouble in vectorize some C code using SSE vector instructions. The code which I have to victorize is
#define N 1000
void matrix_mul(int mat1[N][N], int mat2[N][N], int result[N][N])
{
int i, j, k;
for (i = 0; i < N; ++i)
{
for (j = 0; j < N; ++j)
{
for (k = 0; k < N; ++k)
{
result[i][k] += mat1[i][j] * mat2[j][k];
}
}
}
}
Here is what I got so far:
void matrix_mul_sse(int mat1[N][N], int mat2[N][N], int result[N][N])
{
int i, j, k; int* l;
__m128i v1, v2, v3;
v3 = _mm_setzero_si128();
for (i = 0; i < N; ++i)
{
for (j = 0; j < N; j += 4)
{
for (k = 0; k < N; k += 4)
{
v1 = _mm_set1_epi32(mat1[i][j]);
v2 = _mm_loadu_si128((__m128i*)&mat2[j][k]);
v3 = _mm_add_epi32(v3, _mm_mul_epi32(v1, v2));
_mm_storeu_si128((__m128i*)&result[i][k], v3);
v3 = _mm_setzero_si128();
}
}
}
}
After execution I got wrong result. I know that the reason is the loading from memory to v2. I loop through mat1 in row major order so I need to load mat2[0][0], mat2[1][0], mat2[2][0], mat2[3][0].... but what actually loaded is mat2[0][0], mat2[0][1], mat2[0][2], mat2[0][3]... because mat2 has stored in the memory in row major order. I tried to fix this problem but without any improvement.
Can anyone help me please.

Below fixed your implementation:
void matrix_mul_sse(int mat1[N][N], int mat2[N][N], int result[N][N])
{
int i, j, k;
__m128i v1, v2, v3, v4;
for (i = 0; i < N; ++i)
{
for (j = 0; j < N; ++j) // 'j' must be incremented by 1
{
// read mat1 here because it does not use 'k' index
v1 = _mm_set1_epi32(mat1[i][j]);
for (k = 0; k < N; k += 4)
{
v2 = _mm_loadu_si128((const __m128i*)&mat2[j][k]);
// read what's in the result array first as we will need to add it later to our calculations
v3 = _mm_loadu_si128((const __m128i*)&result[i][k]);
// use _mm_mullo_epi32 here instead _mm_mul_epi32 and add it to the previous result
v4 = _mm_add_epi32(v3, _mm_mullo_epi32(v1, v2));
// store the result
_mm_storeu_si128((__m128i*)&result[i][k], v4);
}
}
}
}
In short _mm_mullo_epi32 (requires SSE4.1) produces 4 x int32 results as opposed to _mm_mul_epi32 which does 2 x int64 results. If you cannot use SSE4.1 then have a look at the answer here for an alternative SSE2 solution.
Full description by Intel Intrinsic Guide:
_mm_mullo_epi32: Multiply the packed 32-bit integers in a and b, producing intermediate 64-bit integers, and store
the low 32 bits of the intermediate integers in dst.
_mm_mul_epi32: Multiply the low 32-bit integers from each packed 64-bit element in a and b, and store the
signed 64-bit results in dst.

I kinda changed around your code to make the addressing explicit [ it helps in this case ].
#define N 100
This is a stub for the vector unit multiple & accumulate operation; you should be able to replace NV with whatever throw your vector unit has, and put the relevant opcodes in here.
#define NV 8
int Vmacc(int *A, int *B) {
int i = 0;
int x = 0;
for (i = 0; i < NV; i++) {
x += *A++ * *B++;
}
return x;
}
This multiply has two notable variations from the norm:
1. It caches the columnar vector into a contiguous one.
2. It attempts to push slices of the multiply accumulate into a vector-like func.
Even without using the vector unit, this takes half the time of naive version just because of better cache/prefetch utilization.
void mm2(int *A, int *B, int n, int *C) {
int c, r;
int stride = 0;
int cache[N];
for (c = 0; c < n; c++) {
/* cache cumn i: */
for (r = 0; r < n; r++) {
cache[r] = B[c + r*n];
}
for (r = 0; r < n; r++) {
int k = 0;
int x = 0;
int *Av = A + r*n;
for (k = 0; k+NV-1 < n; k += NV) {
x += Vmacc(Av+k, cache+k);
}
while (k < n) {
x += Av[k] * cache[k];
k++;
}
C[r*n + c] = x;
}
}
}

DGEMM and DGEMV give different results

I want to implement the following equation in C:
C[l,q,m] = A[m,q,k] * B[k,l]
where the repeated index k is being summed over.
I implemented this in three ways:
Naive implementation with loops
Using the BLAS routine DGEMV (matrix-vector multiplication)
Using the BLAS routine DGEMM (matrix-matrix multiplication)
This is the minimal not-working code:
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <cblas.h>
void main()
{
const size_t n = 3;
const size_t n2 = n*n;
const size_t n3 = n*n*n;
/* Fill rank 3 tensor with random numbers */
double a[n3];
for (size_t i = 0; i < n3; i++) {
a[i] = (double) rand() / RAND_MAX;
}
/* Fill matrix with random numbers */
double b[n2];
for (size_t i = 0; i < n2; i++) {
b[i] = (double) rand() / RAND_MAX;
}
/* All loops */
double c_exact[n3];
memset(c_exact, 0, n3 * sizeof(double));
for (size_t l = 0; l < n; l++) {
for (size_t q = 0; q < n; q++) {
for (size_t m = 0; m < n; m++) {
for (size_t k = 0; k < n; k++) {
c_exact[l*n2+q*n+m] += a[m*n2+q*n+k] * b[k*n+l];
}
}
}
}
/* Matrix-vector */
double c_mv[n3];
memset(c_mv, 0, n3 * sizeof(double));
for (size_t m = 0; m < n; m++) {
for (size_t l = 0; l < n; l++) {
cblas_dgemv(
CblasRowMajor, CblasNoTrans, n, n, 1.0, &a[m*n2],
n, &b[l], n, 0.0, &c_mv[l*n2+m], n);
}
}
/* Matrix-matrix */
double c_mm[n3];
memset(c_mm, 0, n3 * sizeof(double));
for (size_t m = 0; m < n; m++) {
cblas_dgemm(
CblasRowMajor, CblasTrans, CblasTrans, n, n, n, 1.0, b, n,
&a[m*n2], n, 0.0, &c_mm[m], n2);
}
/* Compute difference */
double diff_mv = 0.0;
double diff_mm = 0.0;
for (size_t idx = 0; idx < n3; idx++) {
diff_mv += c_mv[idx] - c_exact[idx];
diff_mm += c_mm[idx] - c_exact[idx];
}
printf("Difference matrix-vector: %e\n", diff_mv);
printf("Difference matrix-matrix: %e\n", diff_mm);
}
And this the output:
Difference matrix-vector: 0.000000e+00
Difference matrix-matrix: -1.188678e+01
i.e. the DGEMV implementation is correct, the DGEMM not - I really don't understand this. I switched around the multiplication (matrix-matrix multiplication is non commutative) and transposed both to get the right order C[l,q,m] instead of C[q,l,m], but I also tried it without switching/transposing and it does not work.
Can anyone please help?
Thanks.
edit: I thought about it a bit and feel like I'm trying to do something that DGEMM doe not support? Namely I try to insert a submatrix into C[:,:,m], which means that both the leading and trailing index are not contiguous in memory. DGEMM allows me to set the parameter LDC, which in this case needs to be n^2, but it does not know that also the second index is non-contiguous with a stride of n (and there is no parameter to tell it?). So why does DGEMM not support a second parameter for the stride of the trailing dimension?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

parallelizing matrix multiplication through threading and SIMD - c

Related

Numbers not randomized after runs

Compare Matrix Multiplication Execution Time in Different Sizes in C

code execution slower in half-parallel OpenMP

Speed up matrix-matrix multiplication using SSE vector instructions

DGEMM and DGEMV give different results

Categories

Resources