OpenMP Matrix Multiplication Issues - c

I am trying to multiple the values of a matrix.
#include <stdio.h>
#include <omp.h>
#include <time.h>
#include <stdlib.h>
#include <omp.h>
#define N 2048
#define FactorIntToDouble 1.1;
#define THREAD_NUM 4
double firstMatrix [N] [N] = {0.0};
double secondMatrix [N] [N] = {0.0};
double matrixMultiResult [N] [N] = {0.0};
// Sync
void matrixMulti() {
for(int row = 0 ; row < N ; row++){
for(int col = 0; col < N ; col++){
double resultValue = 0;
for(int transNumber = 0 ; transNumber < N ; transNumber++) {
resultValue += firstMatrix [row] [transNumber] * secondMatrix [transNumber] [col] ;
}
matrixMultiResult [row] [col] = resultValue;
}
}
}
void matrixInit() {
for(int row = 0 ; row < N ; row++ ) {
for(int col = 0 ; col < N ;col++){
srand(row+col);
firstMatrix [row] [col] = ( rand() % 10 ) * FactorIntToDouble;
secondMatrix [row] [col] = ( rand() % 10 ) * FactorIntToDouble;
}
}
}
// Parallel
void matrixMulti2(int start, int end) {
printf("Op: %d - %d\n", start, end);
for(int row = start ; row < end ; row++){
for(int col = 0; col < N ; col++){
double resultValue = 0;
for(int transNumber = 0 ; transNumber < N ; transNumber++) {
resultValue += firstMatrix [row] [transNumber] * secondMatrix [transNumber] [col] ;
}
matrixMultiResult [row] [col] = resultValue;
}
}
}
void process1(){
clock_t t1 = clock();
#pragma omp parallel
{
int thread = omp_get_thread_num();
int thread_multi = N / 4;
int start = (thread) * thread_multi;
int end = 0;
if(thread == (THREAD_NUM - 1)){
end = (start + thread_multi);
}else{
end = (start + thread_multi) - 1;
}
matrixMulti2(start, end);
}
clock_t t2 = clock();
printf("time 2: %ld\n", t2-t1);
}
int main(){
matrixInit();
clock_t t1 = clock();
matrixMulti();
clock_t t2 = clock();
printf("time: %ld", t2-t1);
process1();
return 0;
}
I have both a parallel and sync version. But the parallel version is longer than the sync version.
Current the sync takes around 90 seconds and the parallel over 100. Which makes no sense to me.
My logic was to split the matrix into 4 parts from the first 4 statement. Which I believe is logical.
After I finish this part. I would like to figure out how to speed up this process for the parallel even more. Possibly using Strassen's Matrix Multiplication. I just don't know where to start or how to get to this point.
I've already spent around 5 hours trying to figure this out.

Here it is:
// Sync
void matrixMulti() {
#pragma omp parallel for collapse(2)
for(int row = 0 ; row < N ; row++){
for(int col = 0; col < N ; col++){
double resultValue = 0;
for(int transNumber = 0 ; transNumber < N ; transNumber++) {
resultValue += firstMatrix [row] [transNumber] * secondMatrix [transNumber] [col] ;
}
matrixMultiResult [row] [col] = resultValue;
}
}
}
Update: Here is what I got on an 8 core system using gcc 10.3 -O3 -fopenmp flags (I show you the program's output and result of linux time command) :
main() was changed to measure the time with omp_get_wtime() because in linux clock() measures processor time:
double t1 = omp_get_wtime();
matrixMulti();
double t2 = omp_get_wtime();
printf("time: %f", t2-t1);
Serial program:
time: 25.895234
real 0m33.296s
user 0m33.139s
sys 0m0.152s
using: #pragma omp parallel for
time: 3.573521
real 0m11.120s
user 0m32.205s
sys 0m0.136s
using: #pragma omp parallel for collapse(2)
time: 5.466674
real 0m12.786s
user 0m49.978s
sys 0m0.248s
The results suggest that initialization of matrix takes ca. 8 s, so it may also be worth parallelizing. Without collapse(2) the program runs faster, so do not use collapse(2) clause.
Note that on your system you may got different speed improvement or even decrease depending on your hardware. Speed of matrix multiplication strongly depends on the speed of memory read/write. Shared-Memory Multicore systems (i.e most PCs, laptops) may not show any speed increase upon parallelization of this program, but Distributed-Memory Multicore systems (i.e. high-end serves) definitely show performance increase. For more details please read e.g. this.
Update2: On Ryzen 7 5800X I got 41.6 s vs 1.68 s, which is a bigger increase than the number of cores. It is because more cache memory is available when all the cores is used.

Related

How to Optimise my code that computes the sum of all from less than 2 million

I've tried this problem from Project Euler where I need to calculate the sum of all primes until two million.
This is the solution I've come up with -
#include <stdio.h>
int main() {
long sum = 5; // Already counting 2 and 3 in my sum.
int i = 5; // Checking from 5
int count = 0;
while (i <= 2000000) {
count = 0;
for (int j = 3; j <= i / 2; j += 2) {
// Checking if i (starting from 5) is divisible from 3
if (i % j == 0) { // to i/2 and only checking for odd values of j
count = 1;
}
}
if (count == 0) {
sum += i;
}
i += 2;
}
printf("%ld ", sum);
}
It takes around 480 secs to run and I was wondering if there was a better solution or tips to improve my program.
________________________________________________________
Executed in 480.95 secs fish external
usr time 478.54 secs 0.23 millis 478.54 secs
sys time 1.28 secs 6.78 millis 1.28 secs
With two little modifications your code becomes magnitudes faster:
#include <stdio.h>
#include <math.h>
int main() {
long long sum = 5; // we need long long, long might not be enough
// depending on your platform
int i = 5;
int count = 0;
while (i <= 2000000) {
count = 0;
int limit = sqrt(i); // determine upper limit once and for all
for (int j = 3; j <= limit; j += 2) { // use upper limit sqrt(i) instead if i/2
if (i % j == 0) {
count = 1;
break; // break out from loop as soon
// as number is not prime
}
}
if (count == 0) {
sum += i;
}
i += 2;
}
printf("%lld ", sum); // we need %lld for long long
}
All explanations are in the comments.
But there are certainly better and even faster ways to do this.
I ran this on my 10 year old MacPro and for the 20 million first primes it took around 30 seconds.
This program computes near instantly (even in Debug...) the sum for 2 millions, just need one second for 20 millions (Windows 10, 10 years-old i7 # 3.4 GHz, MSVC 2019).
Note: Didn't had time to set up my C compiler, it's why there is a cast on the malloc.
The "big" optimization is to store square values AND prime numbers, so absolutely no impossible divisor is tested. Since there is no more than 1/10th of primes within a given integer interval (heuristic, a robust code should test that and realloc the primes array when needed), the time is drastically cut.
#include <stdio.h>
#include <malloc.h>
#define LIMIT 2000000ul // Computation limit.
typedef struct {
unsigned long int p ; // Store a prime number.
unsigned long int sq ; // and its square.
} prime ;
int main() {
prime* primes = (prime*)malloc((LIMIT/10)*sizeof(*primes)) ; // Store found primes. Can quite safely use 1/10th of the whole computation limit.
unsigned long int primes_count=1 ;
unsigned long int i = 3 ;
unsigned long long int sum = 0 ;
unsigned long int j = 0 ;
int is_prime = 1 ;
// Feed the first prime, 2.
primes[0].p = 2 ;
primes[0].sq = 4 ;
sum = 2 ;
// Parse all numbers up to LIMIT, ignoring even numbers.
// Also reset the "is_prime" flag at each loop.
for (i = 3 ; i <= LIMIT ; i+=2, is_prime = 1 ) {
// Parse all previously found primes.
for (j = 0; j < primes_count; j++) {
// Above sqrt(i)? Break, i is a prime.
if (i<primes[j].sq)
break ;
// Found a divisor? Not a prime (and break).
if ((i % primes[j].p == 0)) {
is_prime = 0 ;
break ;
}
}
// Add the prime and its square to the array "primes".
if (is_prime) {
primes[primes_count].p = i ;
primes[primes_count++].sq = i*i ;
// Compute the sum on-the-fly
sum += i ;
}
}
printf("Sum of all %lu primes: %llu\n", primes_count, sum);
free(primes) ;
}
Your program can easily be improved by stopping the inner loop earlier:
when i exceeds sqrt(j).
when a divisor has been found.
Also note that type long might not be large enough for the sum on all architectures. long long is recommended.
Here is a modified version:
#include <stdio.h>
int main() {
long long sum = 5; // Already counting 2 and 3 in my sum.
long i = 5; // Checking from 5
while (i <= 2000000) {
int count = 0;
for (int j = 3; j * j <= i; j += 2) {
// Checking if i (starting from 5) is divisible from 3
if (i % j == 0) { // to i/2 and only checking for odd values of j
count = 1;
break;
}
}
if (count == 0) {
sum += i;
}
i += 2;
}
printf("%lld\n", sum);
}
This simple change drastically reduces the runtime! It is more than 1000 times faster for 2000000:
chqrlie> time ./primesum
142913828922
real 0m0.288s
user 0m0.264s
sys 0m0.004s
Note however that trial division is much less efficient than the classic sieve of Eratosthenes.
Here is a simplistic version:
#include <stdio.h>
#include <stdlib.h>
int main() {
long max = 2000000;
long long sum = 0;
// Allocate an array of indicators initialized to 0
unsigned char *composite = calloc(1, max + 1);
// For all numbers up to sqrt(max)
for (long i = 2; i * i <= max; i++) {
// It the number is a prime
if (composite[i] == 0) {
// Set all multiples as composite. Multiples below the
// square of i are skipped because they have already been
// set as multiples of a smaller prime.
for (long j = i * i; j <= max; j += i) {
composite[j] = 1;
}
}
}
for (long i = 2; i <= max; i++) {
if (composite[i] == 0)
sum += i;
}
printf("%lld\n", sum);
free(composite);
return 0;
}
This code is another 20 times faster for 2000000:
chqrlie> time ./primesum-sieve
142913828922
real 0m0.014s
user 0m0.007s
sys 0m0.002s
The sieve approach can be further improved in many ways for larger boundaries.

Increasing n-body program performance using OpenMP

My goal is to increase the performance of a code that simulates the n-body problem.
This is where the time is to be calculated. The two functions that need to be parallelized are the calculate_forces() and the *move_bodies() functions but since the loop control variable t is a double I cannot have a #pragma omp parallel for statement there.
t0 = gettime ();
for (t = 0; t < t_end; t += dt)
{
// draw bodies
show_bodies (window);
// computation
calculate_forces ();
move_bodies ();
}
// print out calculation speed every second
t0 = gettime () - t0;
The two functions calculate_forces() and move_bodies() with the respective directives that I used are the following:
static void
calculate_forces ()
{
double distance, magnitude, factor, r;
vector_t direction;
int i, j;
#pragma omp parallel private(distance,magnitude,factor,direction)
{
#pragma omp for private(i,j)
for (i = 0; i < n_body - 1; i++)
{
for (j = i + 1; j < n_body; j++)
{
r = SQR (bodies[i].position.x - bodies[j].position.x) + SQR (bodies[i].position.y - bodies[j].position.y);
// avoid numerical instabilities
if (r < EPSILON)
{
// this is not how nature works :-)
r += EPSILON;
}
distance = sqrt (r);
magnitude = (G * bodies[i].mass * bodies[j].mass) / (distance * distance);
factor = magnitude / distance;
direction.x = bodies[j].position.x - bodies[i].position.x;
direction.y = bodies[j].position.y - bodies[i].position.y;
// +force for body i
#pragma omp critical
{
bodies[i].force.x += factor * direction.x;
bodies[i].force.y += factor * direction.y;
// -force for body j
bodies[j].force.x -= factor * direction.x;
bodies[j].force.y -= factor * direction.y;
}
}
}
}
}
static void
move_bodies ()
{
vector_t delta_v, delta_p;
int i;
#pragma omp parallel private(delta_v,delta_p,i)
{
#pragma omp for
for (i = 0; i < n_body; i++)
{
// calculate delta_v
delta_v.x = bodies[i].force.x / bodies[i].mass * dt;
delta_v.y = bodies[i].force.y / bodies[i].mass * dt;
// calculate delta_p
delta_p.x = (bodies[i].velocity.x + delta_v.x / 2.0) * dt;
delta_p.y = (bodies[i].velocity.y + delta_v.y / 2.0) * dt;
// update body velocity and position
#pragma omp critical
{
bodies[i].velocity.x += delta_v.x;
bodies[i].velocity.y += delta_v.y;
bodies[i].position.x += delta_p.x;
bodies[i].position.y += delta_p.y;
}
// reset forces
bodies[i].force.x = bodies[i].force.y = 0.0;
if (bounce)
{
// bounce on boundaries (i.e. it's more like billard)
if ((bodies[i].position.x < -body_distance_factor) || (bodies[i].position.x > body_distance_factor))
bodies[i].velocity.x = -bodies[i].velocity.x;
if ((bodies[i].position.y < -body_distance_factor) || (bodies[i].position.y > body_distance_factor))
bodies[i].velocity.y = -bodies[i].velocity.y;
}
}
}
The values of bodies.velocity and bodies.position are changed in the move bodies function, but I couldn't use a reduction.
There is also a checksum function to calculate if the calculated checksum is equal to the reference checksum. That function looks like this:
static unsigned long
checksum()
{
unsigned long checksum = 0;
// initialize bodies
for (int i = 0; i < n_body; i++)
{
// random position vector
checksum += (unsigned long)round(bodies[i].position.x);
checksum += (unsigned long)round(bodies[i].position.y);
}
return checksum;
}
This function uses the previously calculated values of bodies.position.x and bodies.position.y which were calculated in the move_bodies function hence the reason why I used a critical block while calculating those value which didn't seem to yield a correct answer. Can anyone give me some insight on where I am going wrong? Thank you in advance.

How to optimize Matrix initialization and transposition to run faster using C

The dimension of this matrix is 40000*40000. I was supposed to consider spatial and temporal locality for program but I have no idea to optimize this code. It cost about 50+ seconds in my computer which is not acceptable for our group.The size of block is 500 now. Could someone help me to improve this code?
void InitializeMatrixRowwise(){
int i,j,ii,jj;
double x;
x = 0.0;
for (i = 0; i < DIMENSION; i += BLOCKSIZE)
{
for (j = 0; j < DIMENSION; j += BLOCKSIZE)
{
for (ii = i; ii < i+BLOCKSIZE && ii < DIMENSION; ii++)
{
for (jj = j; jj < j+BLOCKSIZE && jj < DIMENSION; jj++)
{
if (ii >= jj)
{
Matrix[ii][jj] = x++;
}
else
Matrix[ii][jj] = 1.0;
}
}
}
}
}
void TransposeMatrixRowwise(){
int column,row,i,j;
double temp;
for (row = 0; row < DIMENSION; row += BLOCKSIZE)
{
for (column = 0; column < DIMENSION; column += BLOCKSIZE)
{
for (i = row; i < row + BLOCKSIZE && i < DIMENSION; i++)
{
for (j = column; j < column + BLOCKSIZE && j < DIMENSION; j++)
{
if (i > j)
{
temp = Matrix[i][j];
Matrix[i][j] = Matrix[j][i];
Matrix[j][i] = temp;
}
}
}
}
}
}
Your transpose function seems like it might be more complex than necessary and therefore perhaps slower than necessary. However, I created two versions of the code with timing inserted on the 'full size' (40k x 40k array, with 500 x 500 blocks), one using your transpose function and one using this much simpler algorithm:
static void TransposeMatrixRowwise(void)
{
for (int row = 0; row < DIMENSION; row++)
{
for (int col = row + 1; col < DIMENSION; col++)
{
double temp = Matrix[row][col];
Matrix[row][col] = Matrix[col][row];
Matrix[col][row] = temp;
}
}
}
This looks much simpler; it has only two nested loops instead of four, but the timing turns out to be dramatically worse — 31.5s vs 14.7s.
# Simple transpose
# Count = 7
# Sum(x1) = 220.87
# Sum(x2) = 6979.00
# Mean = 31.55
# Std Dev = 1.27 (sample)
# Variance = 1.61 (sample)
# Min = 30.41
# Max = 33.54
# Complex transpose
# Count = 7
# Sum(x1) = 102.81
# Sum(x2) = 1514.00
# Mean = 14.69
# Std Dev = 0.82 (sample)
# Variance = 0.68 (sample)
# Min = 13.59
# Max = 16.21
The reason for the performance difference is almost certainly due to locality of reference. The more complex algorithm is working with two separate blocks of memory at a time, whereas the simpler algorithm is ranging over far more memory, leading to many more page misses, and the slower performance.
Thus, while you might be able to tune the transpose algorithm using different block sizes (it needn't be the same block size as was used to generate the matrices), there is little doubt based on these measurements
that the more complex algorithm is more efficient.
I also did a check at 1/10th scale — 4k x 4k matrix, 50 x 50 block size — to ensure that the output from the transposition was the same (about 152 MiB of text). I didn't save the data at full scale with more than 100 times as much data. The times at 1/10th scale were dramatically better — less than 1/100th time — for both versions at the 1/10th scale:
< Initialization: 0.068667
< Transposition: 0.063927
---
> Initialization: 0.081022
> Transposition: 0.039169
4005c4005
< Print transposition: 3.901960
---
> Print transposition: 4.040136
JFTR: Testing on a 2016 MacBook Pro running macOS High Sierra 10.13.1 with 2.7 GHz Intel Core i7 CPU and 16 GB 2133 MHz LPDDR3 RAM. The compiler was GCC 7.2.0 (home-built). There was a browser running (but mostly inactive) and music playing in the background, so the machine wasn't idle, but I don't think those will dramatically affect the numbers.

Poor maths performance in C vs Python/numpy

Near-duplicate / related:
How does BLAS get such extreme performance? (If you want fast matmul in C, seriously just use a good BLAS library unless you want to hand-tune your own asm version.) But that doesn't mean it's not interesting to see what happens when you compile less-optimized matrix code.
how to optimize matrix multiplication (matmul) code to run fast on a single processor core
Matrix Multiplication with blocks
Out of interest, I decided to compare the performance of (inexpertly) handwritten C vs. Python/numpy performing a simple matrix multiplication of two, large, square matrices filled with random numbers from 0 to 1.
I found that python/numpy outperformed my C code by over 10,000x This is clearly not right, so what is wrong with my C code that is causing it to perform so poorly? (even compiled with -O3 or -Ofast)
The python:
import time
import numpy as np
t0 = time.time()
m1 = np.random.rand(2000, 2000)
m2 = np.random.rand(2000, 2000)
t1 = time.time()
m3 = m1 # m2
t2 = time.time()
print('creation time: ', t1 - t0, ' \n multiplication time: ', t2 - t1)
The C:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main(void) {
clock_t t0=clock(), t1, t2;
// create matrices and allocate memory
int m_size = 2000;
int i, j, k;
double running_sum;
double *m1[m_size], *m2[m_size], *m3[m_size];
double f_rand_max = (double)RAND_MAX;
for(i = 0; i < m_size; i++) {
m1[i] = (double *)malloc(sizeof(double)*m_size);
m2[i] = (double *)malloc(sizeof(double)*m_size);
m3[i] = (double *)malloc(sizeof(double)*m_size);
}
// populate with random numbers 0 - 1
for (i=0; i < m_size; i++)
for (j=0; j < m_size; j++) {
m1[i][j] = (double)rand() / f_rand_max;
m2[i][j] = (double)rand() / f_rand_max;
}
t1 = clock();
// multiply together
for (i=0; i < m_size; i++)
for (j=0; j < m_size; j++) {
running_sum = 0;
for (k = 0; k < m_size; k++)
running_sum += m1[i][k] * m2[k][j];
m3[i][j] = running_sum;
}
t2 = clock();
float t01 = ((float)(t1 - t0) / CLOCKS_PER_SEC );
float t12 = ((float)(t2 - t1) / CLOCKS_PER_SEC );
printf("creation time: %f", t01 );
printf("\nmultiplication time: %f", t12 );
return 0;
}
EDIT: Have corrected the python to do a proper dot product which closes the gap a little and the C to time with a resolution of microseconds and use the comparable double data type, rather than float, as originally posted.
Outputs:
$ gcc -O3 -march=native bench.c
$ ./a.out
creation time: 0.092651
multiplication time: 139.945068
$ python3 bench.py
creation time: 0.1473407745361328
multiplication time: 0.329038143157959
It has been pointed out that the naive algorithm implemented here in C could be improved in ways that lend themselves to make better use of compiler optimisations and the cache.
EDIT: Having modified the C code to transpose the second matrix in order to achieve a more efficient access pattern, the gap closes more
The modified multiplication code:
// transpose m2 in order to capitalise on cache efficiencies
// store transposed matrix in m3 for now
for (i=0; i < m_size; i++)
for (j=0; j < m_size; j++)
m3[j][i] = m2[i][j];
// swap the pointers
void *mtemp = *m3;
*m3 = *m2;
*m2 = mtemp;
// multiply together
for (i=0; i < m_size; i++)
for (j=0; j < m_size; j++) {
running_sum = 0;
for (k = 0; k < m_size; k++)
running_sum += m1[i][k] * m2[j][k];
m3[i][j] = running_sum;
}
The results:
$ gcc -O3 -march=native bench2.c
$ ./a.out
creation time: 0.107767
multiplication time: 10.843431
$ python3 bench.py
creation time: 0.1488208770751953
multiplication time: 0.3335080146789551
EDIT: compiling with -0fast, which I am reassured is a fair comparison, brings down the difference to just over an order of magnitude (in numpy's favour).
$ gcc -Ofast -march=native bench2.c
$ ./a.out
creation time: 0.098201
multiplication time: 4.766985
$ python3 bench.py
creation time: 0.13812589645385742
multiplication time: 0.3441300392150879
EDIT: It was suggested to change indexing from arr[i][j] to arr[i*m_size + j] this yielded a small performance increase:
for m_size = 10000
$ gcc -Ofast -march=native bench3.c # indexed by arr[ i * m_size + j ]
$ ./a.out
creation time: 1.280863
multiplication time: 626.327820
$ gcc -Ofast -march=native bench2.c # indexed by art[I][j]
$ ./a.out
creation time: 2.410230
multiplication time: 708.979980
$ python3 bench.py
creation time: 3.8284950256347656
multiplication time: 39.06089973449707
The up to date code bench3.c:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main(void) {
clock_t t0, t1, t2;
t0 = clock();
// create matrices and allocate memory
int m_size = 10000;
int i, j, k, x, y;
double running_sum;
double *m1 = (double *)malloc(sizeof(double)*m_size*m_size),
*m2 = (double *)malloc(sizeof(double)*m_size*m_size),
*m3 = (double *)malloc(sizeof(double)*m_size*m_size);
double f_rand_max = (double)RAND_MAX;
// populate with random numbers 0 - 1
for (i=0; i < m_size; i++) {
x = i * m_size;
for (j=0; j < m_size; j++)
m1[x + j] = ((double)rand()) / f_rand_max;
m2[x + j] = ((double)rand()) / f_rand_max;
m3[x + j] = ((double)rand()) / f_rand_max;
}
t1 = clock();
// transpose m2 in order to capitalise on cache efficiencies
// store transposed matrix in m3 for now
for (i=0; i < m_size; i++)
for (j=0; j < m_size; j++)
m3[j*m_size + i] = m2[i * m_size + j];
// swap the pointers
double *mtemp = m3;
m3 = m2;
m2 = mtemp;
// multiply together
for (i=0; i < m_size; i++) {
x = i * m_size;
for (j=0; j < m_size; j++) {
running_sum = 0;
y = j * m_size;
for (k = 0; k < m_size; k++)
running_sum += m1[x + k] * m2[y + k];
m3[x + j] = running_sum;
}
}
t2 = clock();
float t01 = ((float)(t1 - t0) / CLOCKS_PER_SEC );
float t12 = ((float)(t2 - t1) / CLOCKS_PER_SEC );
printf("creation time: %f", t01 );
printf("\nmultiplication time: %f", t12 );
return 0;
}
CONCLUSION: So the original absurd factor of x10,000 difference was largely due to mistakenly comparing element-wise multiplication in Python/numpy to C code and not compiled with all of the available optimisations and written with a highly inefficient memory access pattern that likely didn't utilise the cache.
A 'fair' comparison (ie. correct, but highly inefficient single-threaded algorithm, compiled with -Ofast) yields a performance factor difference of x350
A number of simple edits to improve the memory access pattern brought the comparison down to a factor of x16 (in numpy's favour) for large matrix (10000 x 10000) multiplication. Furthermore, numpy automatically utilises all four virtual cores on my machine whereas this C does not, so the performance difference could be a factor of x4 - x8 (depending on how well this program ran on hyperthreading). I consider a factor of x4 - x8 to be fairly sensible, given that I don't really know what I'm doing and just knocked a bit of code together whereas numpy is based on BLAS which I understand has been extensively optimised over the years by experts from all over the place so I consider the question answered/solved.

Why the average speed of n threads is not as fast as one single thread in C?

I wrote a program with 2 threads doing the same thing but I found the throughput of each threads is slower than if I only spawn one thread. Then I write this simple test to see if that's my problem or it's because of the system.
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <time.h>
/*
* Function: run_add
* -----------------------
* Do addition operation for iteration ^ 3 times
*
* returns: void
*/
void *run_add(void *ptr) {
clock_t t1, t2;
t1 = clock();
int sum = 0;
int i = 0, j = 0, k = 0;
int iteration = 1000;
long total = iteration * iteration * iteration;
for (i = 0; i < iteration; i++) {
for (j = 0; j < iteration; j++) {
for (k = 0; k < iteration; k++) {
sum++;
}
}
}
t2 = clock();
float diff = ((float)(t2 - t1) / 1000000.0F );
printf("thread id = %d\n", (int)(pthread_self()));
printf("Total addtions: %ld\n", total);
printf("Total time: %f second\n", diff);
printf("Addition per second: %f\n", total / diff);
printf("\n");
return NULL;
}
void run_test(int num_thread) {
pthread_t pth_arr[num_thread];
int i = 0;
for (i = 0; i < num_thread; i++) {
pthread_create(&pth_arr[i], NULL, run_add, NULL);
}
for (i = 0; i < num_thread; i++) {
pthread_join(pth_arr[i], NULL);
}
}
int main() {
int num_thread = 5;
int i = 0;
for (i = 1; i < num_thread; i++) {
printf("Running SUM with %d threads. \n\n", i);
run_test(i);
}
return 0;
}
The result still shows the average speed of n threads is slower than one single thread. The more threads I have, the slower each one is.
Here's the result:
Running SUM with 1 threads.
thread id = 528384,
Total addtions: 1000000000,
Total time: 1.441257 second,
Addition per second: 693838784.000000
Running SUM with 2 threads.
thread id = 528384,
Total addtions: 1000000000,
Total time: 2.970870 second,
Addition per second: 336601728.000000
thread id = 1064960,
Total addtions: 1000000000,
Total time: 2.972992 second,
Addition per second: 336361504.000000
Running SUM with 3 threads.
thread id = 1064960,
Total addtions: 1000000000,
Total time: 4.434701 second,
Addition per second: 225494352.000000
thread id = 1601536,
Total addtions: 1000000000,
Total time: 4.449250 second,
Addition per second: 224756976.000000
thread id = 528384,
Total addtions: 1000000000,
Total time: 4.454826 second,
Addition per second: 224475664.000000
Running SUM with 4 threads.
thread id = 528384,
Total addtions: 1000000000,
Total time: 6.261967 second,
Addition per second: 159694224.000000
thread id = 1064960,
Total addtions: 1000000000,
Total time: 6.293107 second,
Addition per second: 158904016.000000
thread id = 2138112,
Total addtions: 1000000000,
Total time: 6.295047 second,
Addition per second: 158855056.000000
thread id = 1601536,
Total addtions: 1000000000,
Total time: 6.306261 second,
Addition per second: 158572560.000000
I have a 4-core CPU and my system monitor shows each time I ran n threads, n CPU cores are 100% utilized. Is it true that n threads(<= my CPU cores) are supposed to run n times as fast as one thread? Why it is not the case here?
clock() measures CPU time not "Wall" time.
it also measures the total time of all threads..
CPU time is time when the processor was executing you code, wall time is real world elapsed time (like a clock on the wall would show)
time your program using /usr/bin/time to see what's really happening.
or use a wall-time function like time(), gettimeofday() or clock_gettime()
clock_gettime() can measure CPU time for this thread, for this process, or wall time. - it's probably the best way to do this type of experiment.
While you have your answer regarding why the multi-threaded performance seemed worse than single-thread, there are several things you can do to clean up the logic of your program and make it work like it appears you intended it to.
First, if you were keeping track of the relative wall-time that passed and the time reported by your diff of the clock() times, you would have noticed the time reported was approximately a (n-proccessor core) multiple of the actual wall-time. That was explained in the other answer.
For relative per-core performance timing, the use of clock() is fine. You are getting only an approximation of wall-time, but for looking at a relative additions per-second, that provides a clean per-core look at performance.
While you have correctly used a divisor of 1000000 for diff, time.h provides a convenient define for you. POSIX requires that CLOCKS_PER_SEC equals 1000000 independent of the actual resolution. That constant is provided in time.h.
Next, you should also notice that your output per-core wasn't reported until all threads were joined making reporting totals in run_add somewhat pointless. You can output thread_id, etc. from the individual threads for convenience, but the timing information should be computed back in the calling function after all threads have been joined. That will clean up the logic of your run_add significantly. Further, if you want to be able to vary the number of iterations, you should consider passing that value through ptr. e.g.:
/*
* Function: run_add
* -----------------------
* Do addition operation for iteration ^ 3 times
*
* returns: void
*/
void *run_add (void *ptr)
{
int i = 0, j = 0, k = 0, iteration = *(int *)ptr;
unsigned long sum = 0;
for (i = 0; i < iteration; i++)
for (j = 0; j < iteration; j++)
for (k = 0; k < iteration; k++)
sum++;
printf (" thread id = %lu\n", (long unsigned) (pthread_self ()));
printf (" iterations = %lu\n\n", sum);
return NULL;
}
run_test is relatively unchanged, with the bulk of the calculation changes being those moved from run_add to main and being scaled to account for the number of cores utilized. The following is a rewrite of main allowing the user to specify the number of cores to use as the first argument (using all-cores by default) and the base for your cubed number of iterations as the second argument (1000 by default):
int main (int argc, char **argv) {
int nproc = sysconf (_SC_NPROCESSORS_ONLN), /* number of core available */
num_thread = argc > 1 ? atoi (argv[1]) : nproc,
iter = argc > 2 ? atoi (argv[2]) : 1000;
unsigned long subtotal = iter * iter * iter,
total = subtotal * num_thread;
double diff = 0.0, t1 = 0.0, t2 = 0.0;
if (num_thread > nproc) num_thread = nproc;
printf ("\nrunning sum with %d threads.\n\n", num_thread);
t1 = clock ();
run_test (num_thread, &iter);
t2 = clock ();
diff = (double)((t2 - t1) / CLOCKS_PER_SEC / num_thread);
printf ("----------------\nTotal time: %lf second\n", diff);
printf ("Total addtions: %lu\n", total);
printf ("Additions per-second: %lf\n\n", total / diff);
return 0;
}
Putting all the pieces together, you could write a working example as follows. Make sure you disable optimizations to prevent your compiler from optimizing out your loops for sum, etc...
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <time.h>
#include <unistd.h>
/*
* Function: run_add
* -----------------------
* Do addition operation for iteration ^ 3 times
*
* returns: void
*/
void *run_add (void *ptr)
{
int i = 0, j = 0, k = 0, iteration = *(int *)ptr;
unsigned long sum = 0;
for (i = 0; i < iteration; i++)
for (j = 0; j < iteration; j++)
for (k = 0; k < iteration; k++)
sum++;
printf (" thread id = %lu\n", (long unsigned) (pthread_self ()));
printf (" iterations = %lu\n\n", sum);
return NULL;
}
void run_test (int num_thread, int *it)
{
pthread_t pth_arr[num_thread];
int i = 0;
for (i = 0; i < num_thread; i++)
pthread_create (&pth_arr[i], NULL, run_add, it);
for (i = 0; i < num_thread; i++)
pthread_join (pth_arr[i], NULL);
}
int main (int argc, char **argv) {
int nproc = sysconf (_SC_NPROCESSORS_ONLN),
num_thread = argc > 1 ? atoi (argv[1]) : nproc,
iter = argc > 2 ? atoi (argv[2]) : 1000;
unsigned long subtotal = iter * iter * iter,
total = subtotal * num_thread;
double diff = 0.0, t1 = 0.0, t2 = 0.0;
if (num_thread > nproc) num_thread = nproc;
printf ("\nrunning sum with %d threads.\n\n", num_thread);
t1 = clock ();
run_test (num_thread, &iter);
t2 = clock ();
diff = (double)((t2 - t1) / CLOCKS_PER_SEC / num_thread);
printf ("----------------\nTotal time: %lf second\n", diff);
printf ("Total addtions: %lu\n", total);
printf ("Additions per-second: %lf\n\n", total / diff);
return 0;
}
Example Use/Output
Now you can measure the relative number of additions per-second performed based on the number of cores utilized -- and have it return a Total time that is roughly what wall-time would be. For example, measuring the additions per-second using a single core results in:
$ ./bin/pthread_one_per_core 1
running sum with 1 threads.
thread id = 140380000397056
iterations = 1000000000
----------------
Total time: 2.149662 second
Total addtions: 1000000000
Additions per-second: 465189411.172547
Approximatey 465M additions per-sec. Using two cores should double that rate:
$ ./bin/pthread_one_per_core 2
running sum with 2 threads.
thread id = 140437156796160
iterations = 1000000000
thread id = 140437165188864
iterations = 1000000000
----------------
Total time: 2.152436 second
Total addtions: 2000000000
Additions per-second: 929179560.000957
Exactly twice the additions per-sec at 929M/s. Using 4-cores:
$ ./bin/pthread_one_per_core 4
running sum with 4 threads.
thread id = 139867841853184
iterations = 1000000000
thread id = 139867858638592
iterations = 1000000000
thread id = 139867867031296
iterations = 1000000000
thread id = 139867850245888
iterations = 1000000000
----------------
Total time: 2.202021 second
Total addtions: 4000000000
Additions per-second: 1816513309.422720
Doubled again to 1.81G/s, and using 8-cores gives the expected results:
$ ./bin/pthread_one_per_core
running sum with 8 threads.
thread id = 140617712838400
iterations = 1000000000
thread id = 140617654089472
iterations = 1000000000
thread id = 140617687660288
iterations = 1000000000
thread id = 140617704445696
iterations = 1000000000
thread id = 140617662482176
iterations = 1000000000
thread id = 140617696052992
iterations = 1000000000
thread id = 140617670874880
iterations = 1000000000
thread id = 140617679267584
iterations = 1000000000
----------------
Total time: 2.250243 second
Total addtions: 8000000000
Additions per-second: 3555171004.558562
3.55G/s. Look over the both answers (currently) and let us know if you have any questions.
note: there are a number of additional clean-ups and validations that could be applied, but for purposes of your example, updating the types to rational unsigned prevents strange results with thread_id and the addition numbers.

Resources