C - pthreads appear to only be utilizing one core - c

Let me first of all say that this is for school but I don't really need help, I'm just confused by some results I'm getting.
I have a simple program that approximates pi using Simpson's rule, in one assignment we had to do this by spawning 4 child processes and now in this assignment we have to use 4 kernel-level threads. I've done this, but when I time the programs the one using child processes seems to run faster (I get the impression I should be seeing the opposite result).
Here is the program using pthreads:
#include <stdio.h>
#include <unistd.h>
#include <pthread.h>
#include <stdlib.h>
// This complicated ternary statement does the bulk of our work.
// Basically depending on whether or not we're at an even number in our
// sequence we'll call the function with x/32000 multiplied by 2 or 4.
#define TERN_STMT(x) (((int)x%2==0)?2*func(x/32000):4*func(x/32000))
// Set to 0 for no 100,000 runs
#define SPEED_TEST 1
struct func_range {
double start;
double end;
};
// The function defined in the assignment
double func(double x)
{
return 4 / (1 + x*x);
}
void *partial_sum(void *r)
{
double *ret = (double *)malloc(sizeof(double));
struct func_range *range = r;
#if SPEED_TEST
int k;
double begin = range->start;
for (k = 0; k < 25000; k++)
{
range->start = begin;
*ret = 0;
#endif
for (; range->start <= range->end; ++range->start)
*ret += TERN_STMT(range->start);
#if SPEED_TEST
}
#endif
return ret;
}
int main()
{
// An array for our threads.
pthread_t threads[4];
double total_sum = func(0);
void *temp;
struct func_range our_range;
int i;
for (i = 0; i < 4; i++)
{
our_range.start = (i == 0) ? 1 : (i == 1) ? 8000 : (i == 2) ? 16000 : 24000;
our_range.end = (i == 0) ? 7999 : (i == 1) ? 15999 : (i == 2) ? 23999 : 31999;
pthread_create(&threads[i], NULL, &partial_sum, &our_range);
pthread_join(threads[i], &temp);
total_sum += *(double *)temp;
free(temp);
}
total_sum += func(1);
// Final calculations
total_sum /= 3.0;
total_sum *= (1.0/32000.0);
// Print our result
printf("%f\n", total_sum);
return EXIT_SUCCESS;
}
Here is using child processes:
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
// This complicated ternary statement does the bulk of our work.
// Basically depending on whether or not we're at an even number in our
// sequence we'll call the function with x/32000 multiplied by 2 or 4.
#define TERN_STMT(x) (((int)x%2==0)?2*func(x/32000):4*func(x/32000))
// Set to 0 for no 100,000 runs
#define SPEED_TEST 1
// The function defined in the assignment
double func(double x)
{
return 4 / (1 + x*x);
}
int main()
{
// An array for our subprocesses.
pid_t pids[4];
// The pipe to pass-through information
int mypipe[2];
// Counter for subproccess loops
double j;
// Counter for outer loop
int i;
// Number of PIDs
int n = 4;
// The final sum
double total_sum = 0;
// Temporary variable holding the result from a subproccess
double temp;
// The partial sum tallied by a subproccess.
double sum = 0;
int k;
if (pipe(mypipe))
{
perror("pipe");
return EXIT_FAILURE;
}
// Create the PIDs
for (i = 0; i < 4; i++)
{
// Abort if something went wrong
if ((pids[i] = fork()) < 0)
{
perror("fork");
abort();
}
else if (pids[i] == 0)
// Depending on what PID number we are we'll only calculate
// 1/4 the total.
#if SPEED_TEST
for (k = 0; k < 25000; ++k)
{
sum = 0;
#endif
switch (i)
{
case 0:
sum += func(0);
for (j = 1; j <= 7999; ++j)
sum += TERN_STMT(j);
break;
case 1:
for (j = 8000; j <= 15999; ++j)
sum += TERN_STMT(j);
break;
case 2:
for (j = 16000; j <= 23999; ++j)
sum += TERN_STMT(j);
break;
case 3:
for (j = 24000; j < 32000; ++j)
sum += TERN_STMT(j);
sum += func(1);
break;
}
#if SPEED_TEST
}
#endif
// Write the data to the pipe
write(mypipe[1], &sum, sizeof(sum));
exit(0);
}
}
int status;
pid_t pid;
while (n > 0)
{
// Wait for the calculations to finish
pid = wait(&status);
// Read from the pipe
read(mypipe[0], &temp, sizeof(total_sum));
// Add to the total
total_sum += temp;
n--;
}
// Final calculations
total_sum /= 3.0;
total_sum *= (1.0/32000.0);
// Print our result
printf("%f\n", total_sum);
return EXIT_SUCCESS;
}
Here is a time result from the pthreads version running 100,000 times:
real 11.15
user 11.15
sys 0.00
And here is the child process version:
real 5.99
user 23.81
sys 0.00
Having a user time of 23.81 implies that that is the sum of the time each core took to execute the code. In the pthread analysis the real/user time is the same implying that only one core is being used. Why isn't it using all 4 cores? I thought by default it might do it better than child processes.
Hopefully this question makes sense, this is my first time programming with pthreads and I'm pretty new to OS-level programming in general.
Thanks for taking the time to read this lengthy question.

When you say pthread_join immediately after pthread_create, you're effectively serializing all the threads. Don't join threads until after you've created all the threads and done all the other work that doesn't need the result from the threaded computations.

Related

how to compute sum of n/m Gregory-Leibniz terms in C language

get the two values named m & n from the command line arguments and convert them into integers. now after that create m threads and each thread computes the sum of n/m terms in Gregory-Leibniz Series.
pi = 4 * (1 - 1/3 + 1/5 - 1/7 + 1/9 - ...)
Now when thread finishes its computation, print its partial sum and atomically add it to a shared global variable.
& how to check that all of the m computational threads have done the atomic additions?
I share my source code, what I tried
#include<stdio.h>
#include<pthread.h>
#include <stdlib.h>
#include<math.h>
pthread_barrier_t barrier;
int count;
long int term;
// int* int_arr;
double total;
void *thread_function(void *vargp)
{
int thread_rank = *(int *)vargp;
// printf("waiting for barrier... \n");
pthread_barrier_wait(&barrier);
// printf("we passed the barrier... \n");
double sum = 0.0;
int n = count * term;
int start = n - term;
// printf("start %d & end %d \n\n", start, n);
for(int i = start; i < n; i++)
{
sum += pow(-1, i) / (2*i+1);
// v += 1 / i - 1 / (i + 2);
}
total += sum;
// int_arr[count] = sum;
count++;
printf("thr %d : %lf \n", thread_rank, sum);
return NULL;
}
int main(int argc,char *argv[])
{
if (argc <= 2) {
printf("missing arguments. please pass two num. in arguments\n");
exit(-1);
}
int m = atoi(argv[1]); // get value of first argument
int n = atoi(argv[2]); // get value of second argument
// int_arr = (int*) calloc(m, sizeof(int));
count = 1;
term = n / m;
pthread_t thread_id[m];
int i, ret;
double pi;
/* Initialize the barrier. */
pthread_barrier_init(&barrier, NULL, m);
for(i = 0; i < m; i++)
{
ret = pthread_create(&thread_id[i], NULL , &thread_function, (void *)&i);
if (ret) {
printf("unable to create thread! \n");
exit(-1);
}
}
for(i = 0; i < m; i++)
{
if(pthread_join(thread_id[i], NULL) != 0) {
perror("Failed to join thread");
}
}
pi = 4 * total;
printf("%lf ", pi);
pthread_barrier_destroy(&barrier);
return 0;
}
what I need :-
create M thread & each thread computes the sum of n/m terms in the Gregory-Leibniz Series.
first thread computes the sum of term 1 to n/m , the second thread computes the sum of the terms from (n/m + 1) to 2n/m etc.
when all the thread finishes its computation than print its partial sum and Value of Pi.
I tried a lot, but I can't achieve exact what I want. I got wrong output value of PI
for example : m = 16 and n = 1024
then it sometimes return 3.125969, sometimes 12.503874 , 15.629843, sometimes 6.251937 as a output of Pi value
please help me
Edited Source Code :
#include <inttypes.h>
#include <math.h>
#include <pthread.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
struct args {
uint64_t thread_id;
struct {
uint64_t start;
uint64_t end;
} range;
};
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_barrier_t barrier;
long double total = 0;
uint64_t total_iterations = 0;
void *partial_sum(void *arg)
{
struct args *args = arg;
long double sum = 0;
printf("waiting for barrier in thread -> %" PRId64 "\n", args->thread_id);
pthread_barrier_wait(&barrier);
// printf("we passed the barrier... \n");
for (uint64_t n = args->range.start; n < args->range.end; n++)
sum += pow(-1.0, n) / (1 + n * 2);
if (pthread_mutex_lock(&mutex)) {
perror("pthread_mutex_lock");
exit(EXIT_FAILURE);
}
total += sum;
total_iterations += args->range.end - args->range.start;
if (pthread_mutex_unlock(&mutex)) {
perror("pthread_mutex_unlock");
exit(EXIT_FAILURE);
}
printf("thr %" PRId64 " : %.20Lf\n", args->thread_id, sum);
return NULL;
}
int main(int argc,char *argv[])
{
if (argc <= 2) {
fprintf(stderr, "usage: %s THREADS TERMS.\tPlease pass two num. in arguments\n", *argv);
return EXIT_FAILURE;
}
int m = atoi(argv[1]); // get value of first argument & converted into int
int n = atoi(argv[2]); // get value of second argument & converted into int
if (!m || !n) {
fprintf(stderr, "Argument is zero.\n");
return EXIT_FAILURE;
}
uint64_t threads = m;
uint64_t terms = n;
uint64_t range = terms / threads;
uint64_t excess = terms - range * threads;
pthread_t thread_id[threads];
struct args arguments[threads];
int ret;
/* Initialize the barrier. */
ret = pthread_barrier_init(&barrier, NULL, m);
if (ret) {
perror("pthread_barrier_init");
return EXIT_FAILURE;
}
for (uint64_t i = 0; i < threads; i++) {
arguments[i].thread_id = i;
arguments[i].range.start = i * range;
arguments[i].range.end = arguments[i].range.start + range;
if (threads - 1 == i)
arguments[i].range.end += excess;
printf("In main: creating thread %ld\n", i);
ret = pthread_create(thread_id + i, NULL, partial_sum, arguments + i);
if (ret) {
perror("pthread_create");
return EXIT_FAILURE;
}
}
for (uint64_t i = 0; i < threads; i++)
if (pthread_join(thread_id[i], NULL))
perror("pthread_join");
pthread_barrier_destroy(&barrier);
printf("Pi value is : %.10Lf\n", 4 * total);
printf("COMPLETE? (%s)\n", total_iterations == terms ? "YES" : "NO");
return 0;
}
In each thread, the count variable is expected to be of a steadily increasing value in this expression
int n = count * term;
being one larger than it was in the "previous" thread, but count is only increased later on in each thread.
Even if you were to "immediately" increase count, there is nothing that guards against two or more threads attempting to read from and write to the variable at the same time.
The same issue exists for total.
The unpredictability of these reads and writes will lead to indeterminate results.
When sharing resources between threads, you must take care to avoid these race conditions. The POSIX threads library does not contain any atomics for fundamental integral operations.
You should protect your critical data against a read/write race condition by using a lock to restrict access to a single thread at a time.
The POSIX threads library includes a pthread_mutex_t type for this purpose. See:
pthread_mutex_init / pthread_mutex_destroy
pthread_mutex_lock / pthread_mutex_unlock
Additionally, as pointed out by #Craig Estey, using (void *) &i as the argument to the thread functions introduces a race condition where the value of i may change before any given thread executes *(int *) vargp;.
The suggestion is to pass the value of i directly, storing it intermediately as a pointer, but you should use the appropriate type of intptr_t or uintptr_t, which are well defined for this purpose.
pthread_create(&thread_id[i], NULL , thread_function, (intptr_t) i)
int thread_rank = (intptr_t) vargp;
How to check that all of the m computational threads have done the atomic additions?
Sum up the number of terms processed by each thread, and ensure it is equal to the expected number of terms. This can also naturally be assumed to be the case if all possible errors are accounted for (ensuring all threads run to completion and assuming the algorithm used is correct).
A moderately complete example program:
#define _POSIX_C_SOURCE 200809L
#include <inttypes.h>
#include <math.h>
#include <pthread.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
struct args {
uint64_t thread_id;
struct {
uint64_t start;
uint64_t end;
} range;
};
pthread_mutex_t mutex;
long double total = 0;
uint64_t total_iterations = 0;
void *partial_sum(void *arg)
{
struct args *args = arg;
long double sum = 0;
for (uint64_t n = args->range.start; n < args->range.end; n++)
sum += pow(-1.0, n) / (1 + n * 2);
if (pthread_mutex_lock(&mutex)) {
perror("pthread_mutex_lock");
exit(EXIT_FAILURE);
}
total += sum;
total_iterations += args->range.end - args->range.start;
if (pthread_mutex_unlock(&mutex)) {
perror("pthread_mutex_unlock");
exit(EXIT_FAILURE);
}
printf("thread(%" PRId64 ") Partial sum: %.20Lf\n", args->thread_id, sum);
return NULL;
}
int main(int argc,char **argv)
{
if (argc < 3) {
fprintf(stderr, "usage: %s THREADS TERMS\n", *argv);
return EXIT_FAILURE;
}
uint64_t threads = strtoull(argv[1], NULL, 10);
uint64_t terms = strtoull(argv[2], NULL, 10);
if (!threads || !terms) {
fprintf(stderr, "Argument is zero.\n");
return EXIT_FAILURE;
}
uint64_t range = terms / threads;
uint64_t excess = terms - range * threads;
pthread_t thread_id[threads];
struct args arguments[threads];
if (pthread_mutex_init(&mutex, NULL)) {
perror("pthread_mutex_init");
return EXIT_FAILURE;
}
for (uint64_t i = 0; i < threads; i++) {
arguments[i].thread_id = i;
arguments[i].range.start = i * range;
arguments[i].range.end = arguments[i].range.start + range;
if (threads - 1 == i)
arguments[i].range.end += excess;
int ret = pthread_create(thread_id + i, NULL , partial_sum, arguments + i);
if (ret) {
perror("pthread_create");
return EXIT_FAILURE;
}
}
for (uint64_t i = 0; i < threads; i++)
if (pthread_join(thread_id[i], NULL))
perror("pthread_join");
pthread_mutex_destroy(&mutex);
printf("%.10Lf\n", 4 * total);
printf("COMPLETE? (%s)\n", total_iterations == terms ? "YES" : "NO");
}
Using 16 threads to process 1 billion terms:
$ ./a.out 16 10000000000
thread(14) Partial sum: 0.00000000000190476190
thread(10) Partial sum: 0.00000000000363636364
thread(2) Partial sum: 0.00000000006666666667
thread(1) Partial sum: 0.00000000020000000000
thread(8) Partial sum: 0.00000000000555555556
thread(15) Partial sum: 0.00000000000166666667
thread(0) Partial sum: 0.78539816299744868408
thread(3) Partial sum: 0.00000000003333333333
thread(13) Partial sum: 0.00000000000219780220
thread(11) Partial sum: 0.00000000000303030303
thread(4) Partial sum: 0.00000000002000000000
thread(5) Partial sum: 0.00000000001333333333
thread(7) Partial sum: 0.00000000000714285714
thread(6) Partial sum: 0.00000000000952380952
thread(12) Partial sum: 0.00000000000256410256
thread(9) Partial sum: 0.00000000000444444444
3.1415926535
COMPLETE? (YES)

Printing in order with c threads

Given this code:
#include <stdio.h>
#include <pthread.h>
#include <semaphore.h>
void *findPrimes(void *arg)
{
int val = *(int *)arg;
for (int i = val * 1000; i < val * 1000 + 1000; i++)
{
int isPrime = 1;
for (int j = 2; j < i; j++)
{
if (i % j == 0)
{
isPrime = 0;
break;
}
}
if (isPrime)
{
printf("%d\n", i);
}
}
pthread_exit(NULL);
}
int main()
{
pthread_t p[3];
int val[3] = {0, 1, 2};
for (int i = 0; i < 3; i++)
{
pthread_create(&p[i], NULL, findPrimes, &val[i]);
}
for (int i = 0; i < 3; i++)
{
pthread_join(p[i], NULL);
}
return 0;
}
Who prints in 3 threads all the prime number between 0 and 3000.
I want to print them in order, how can i do it?
My professor suggest to use an array of semaphore.
In order to synchronize the actions of all the threads I suggest using a pthread_mutex_t and a pthread_cond_t (a condition variable). You also need a way to share data between threads, so I'd create a struct for that:
#include <pthread.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
typedef struct {
unsigned whos_turn;
pthread_mutex_t mtx;
pthread_cond_t cv;
} shared_data;
whos_turn will here be used to tell the threads whos turn it is to print the primes found.
Each thread also needs some thread-unique information. You called it val so I'll call it val here too. We can compare val with whos_turn to decide which thread it is that should print its result. In order to pass both the shared data and val to a thread, you can package that in a struct too:
typedef struct {
unsigned val;
shared_data *sd; // will point to the one and only instance of `shared_data`
} work_order;
Now, findPrimes need somewhere to store the primes it calculates before it's time to print them. Since the range to search is hardcoded, I'd just add an array for that:
#define SEARCH_RANGE (1000ULL)
void *findPrimes(void *arg) {
work_order *wo = arg;
uintmax_t primes[SEARCH_RANGE]; // to store the found primes
int found_count = 0;
for (uintmax_t i = wo->val*SEARCH_RANGE+1; i <= (wo->val+1)*SEARCH_RANGE; i += 2) {
bool isPrime = true;
for (uintmax_t j = 3; j < i; j += 2) {
if (i % j == 0) { // note: both i and j are odd
isPrime = false;
break;
}
}
if (isPrime) {
primes[found_count++] = i;
}
}
if(wo->val == 0) { // special case for the first range
primes[0] = 2; // 1 is not a prime, but 2 is.
}
// ... to be continued below ...
So far, nothing spectacular. The thread has now found all primes in its range and has come to the synchronizing part. The thread must
lock the mutex
wait for its turn (called "the predicate")
let other threads do the same
Here's one common pattern:
// ... continued from above ...
// synchronize
pthread_mutex_lock(&wo->sd->mtx); // lock the mutex
// only 1 thread at a time reaches here
// check the predicate: That is's this thread's turn to print
while(wo->val != wo->sd->whos_turn) { // <- the predicate
// if control enters here, it was not this thread's turn
// cond_wait internally "unlocks" the mutex to let other threads
// reach here and wait for the condition variable to get signalled
pthread_cond_wait(&wo->sd->cv, &wo->sd->mtx);
// and here the lock is only held by one thread at a time again
}
// only the thread whos turn it is reaches here
Now, the thread has reached the point where it is its time to print. It has the mutex lock so no other threads can reach this point at the same time.
// print the collected primes
for(int i = 0; i < found_count; ++i)
printf("%ju\n", primes[i]);
And hand over to the next thread in line to print the primes it has found:
// step the "whos_turn" indicator
wo->sd->whos_turn++;
pthread_mutex_unlock(&wo->sd->mtx); // release the mutex
pthread_cond_broadcast(&wo->sd->cv); // signal all threads to check the predicate
pthread_exit(NULL);
}
And it can be tied together quite neatly in main:
#define Size(x) (sizeof (x) / sizeof *(x))
int main() {
shared_data sd = {.whos_turn = 0,
.mtx = PTHREAD_MUTEX_INITIALIZER,
.cv = PTHREAD_COND_INITIALIZER};
pthread_t p[3];
work_order wos[Size(p)];
for (unsigned i = 0; i < Size(p); i++) {
wos[i].val = i; // the thread-unique information
wos[i].sd = &sd; // all threads will point at the same `shared_data`
pthread_create(&p[i], NULL, findPrimes, &wos[i]);
}
for (unsigned i = 0; i < Size(p); i++) {
pthread_join(p[i], NULL);
}
}
Demo

How do I parallelize the next code in C so it iterates through all rows and columns? (linear regression program)

I'm doing a parallelization assignment in C from a linear regression calculation program but I am just supposed to parallelize the part that calculates all additions just right before the linearity calculation.
Original code. Arguments: number of elements
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <time.h>
#include <assert.h>
#define N 50000
int nn;
int *X[N+1],*apX, *Y;
long long *sumaX, *sumaX2, sumaY, *sumaXY;
double *A, *B;
int main(int np, char*p[])
{
int i,j;
double sA,sB;
clock_t ta,t;
assert(np==2);
nn = atoi(p[1]);
assert(nn<=N);
srand(1);
printf("Dimensio dades =~ %g Mbytes\n",((double)(nn*(nn+11))*4)/(1024*1024));
apX = calloc(nn*nn,sizeof(int)); assert (apX);
Y = calloc(nn,sizeof(int)); assert (Y);
sumaX = calloc(nn,sizeof(long long)); assert (sumaX);
sumaX2 = calloc(nn,sizeof(long long)); assert (sumaX2);
sumaXY = calloc(nn,sizeof(long long)); assert (sumaXY);
A = calloc(nn,sizeof(double)); assert (A);
B = calloc(nn,sizeof(double)); assert (B);
// Initialization
X[0] = apX;
/*for (i=0;i<nn;i++) {
for (j=0;j<nn;j+=8)
X[i][j]=rand()%100+1;
Y[i]=rand()%100 - 49;
X[i+1] = X[i] + nn;
}*/
for (i=0;i<nn;i++) {
for (j=0;j<nn;j+=8)
X[i][j]=90;
Y[i]=40;
X[i+1] = X[i] + nn;
}
// add (parallelization part)
sumaY = 0;
for (i=0;i<nn;i++) {
sumaX[i] = sumaX2[i] = sumaXY[i] = 0;
for (j=0;j<nn;j++) {
sumaX[i] += X[i][j];
sumaX2[i] += X[i][j] * X[i][j];
sumaXY[i] += X[i][j] * Y[j];
}
sumaY += Y[i];
}
// linearity calculation
for (i=0;i<nn;i++) {
B[i] = sumaXY[i] - (sumaX[i] * sumaY)/nn;
B[i] = B[i] / (sumaX2[i] - (sumaX[i] * sumaX[i])/nn);
A[i] = (sumaY -B[i]*sumaX[i])/nn;
}
// check
sA = sB = 0;
for (i=0;i<nn;i++) {
//printf("%lg, %lg\n",A[i],B[i]);
sA += A[i];
sB += B[i];
}
printf("Suma elements de A: %lg B:%lg\n",sA,sB);
exit(0);
}
Parallelization. Arguments: number of elements and threads
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <time.h>
#include <assert.h>
#include <pthread.h>
#define N 50000
#define MAX_THREADS 256
int nn, numThreads;
int *X[N+1],*apX, *Y;
long *sumaX, *sumaX2, sumaY, *sumaXY;
double *A, *B;
int range[MAX_THREADS];
pthread_mutex_t mutex= PTHREAD_MUTEX_INITIALIZER;
int ret;
void * parallel_code(void * id){
int index = (intptr_t) id;
int i, ini, row, col;
int rowAux = -5;
if(index == 0)
ini = 0;
else
ini = range[index-1];
for(i=ini; i<range[index]; i++){
row = i/nn;
col = i%nn;
sumaX[row] += X[row][col];
sumaX2[row] += X[row][col] * X[row][col];
sumaXY[row] += X[row][col] * Y[col];
pthread_mutex_lock(&mutex);
if(rowAux != row){
sumaY += Y[row];
rowAux = row;
pthread_mutex_unlock(&mutex);
}else{
pthread_mutex_unlock(&mutex);
}
}
pthread_exit(0);
}
int main(int np, char*p[])
{
int i,j,index;
double sA,sB;
clock_t ta,t;
pthread_t threads[MAX_THREADS];
assert(np==3);
nn = atoi(p[1]);
assert(nn<=N);
srand(1);
numThreads = atoi(p[2]);
assert(numThreads >= 2 && numThreads <= MAX_THREADS);
printf("Dimensio dades =~ %g Mbytes\n",((double)(nn*(nn+11))*4)/(1024*1024));
memset(range,0,numThreads*sizeof(int));
apX = calloc(nn*nn,sizeof(int)); assert (apX);
Y = calloc(nn,sizeof(int)); assert (Y);
sumaX = calloc(nn,sizeof(long long)); assert (sumaX);
sumaX2 = calloc(nn,sizeof(long long)); assert (sumaX2);
sumaXY = calloc(nn,sizeof(long long)); assert (sumaXY);
A = calloc(nn,sizeof(double)); assert (A);
B = calloc(nn,sizeof(double)); assert (B);
// Inicialitzacio
/*X[0] = apX;
for (i=0;i<nn;i++) {
for (j=0;j<nn;j+=8)
X[i][j]=rand()%100+1;
Y[i]=rand()%100 - 49;
X[i+1] = X[i] + nn;
}*/
X[0] = apX;
for (i=0;i<nn;i++) {
for (j=0;j<nn;j+=8)
X[i][j]=90;
Y[i]=40;
X[i+1] = X[i] + nn;
}
int portion = nn*nn/numThreads;
int mod = nn*nn % numThreads;
if(mod != 0.00){
mod = mod*numThreads;
for(i=0; i<mod; i++){
range[i] = range[i] + 1;
}
}
range[0] = range[0] + portion;
for(i=1; i<numThreads; i++){
range[i] += range[i-1] + portion;
}
sumaY = 0;
pthread_mutex_init(&mutex, NULL);
for (index = 0; index < numThreads; index++)
{
assert(!pthread_create(&threads[index], NULL, parallel_code, (void *) (intptr_t)index));
}
for(index = 0; index < numThreads; index++)
{
assert(!pthread_join(threads[index], NULL ));
}
pthread_mutex_destroy(&mutex);
for (i=0;i<nn;i++) {
B[i] = sumaXY[i] - (sumaX[i] * sumaY)/nn;
B[i] = B[i] / (sumaX2[i] - (sumaX[i] * sumaX[i])/nn);
A[i] = (sumaY -B[i]*sumaX[i])/nn;
}
// check
sA = 0;
sB = 0;
for (i=0;i<nn;i++) {
//printf("%f, %f\n",sA,sB);
sA += A[i];
sB += B[i];
}
printf("Suma elements de A: %lg B:%lg\n",sA,sB);
exit(0);
}
So far I've done some parallelization as you can see in the code above: calculated how much data every thread has to work with (that's what the variables "portion" and "mod" are for), created threads for every portion, created a mutex to control SumaY access... The thing is the current program just works with small values (like 2000 as number of elements for example) and don't really know why. Might be because the thread doesn't read all necessary columns and/or rows as every time the program displays a wrong value this value is always lower than the right one, so might indicate the program is missing some data to read. To be fair I think I'm pretty close to the right solution so I came here as a last resource. Also, take into account splitting the task into multiple portions for every thread is a must for the assignment.
Thanks a lot in advance.
27/01/2021 EDIT (working code, fast and slow versions):
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <time.h>
#include <assert.h>
#include <pthread.h>
#define N 50000
#define MAX_THREADS 256
int nn, numThreads;
int *X[N+1],*apX, *Y;
long long *sumaX, *sumaX2, sumaY, *sumaXY;
double *A, *B;
int range[MAX_THREADS];
pthread_mutex_t mutex= PTHREAD_MUTEX_INITIALIZER;
// Slow version
//int visitedRows[N];
void * parallel_code(void * args){
int index = (*(int*)args);
int i, j, ini;
// Slow version
// int row, col;
if(index == 0)
ini = 0;
else
ini = range[index-1];
for(i=ini; i<range[index]; i++){
// Fast version
for (j=0;j<nn;j++) {
sumaX[i] += X[i][j];
sumaX2[i] += X[i][j] * X[i][j];
sumaXY[i] += X[i][j] * Y[j];
}
pthread_mutex_lock(&mutex);
sumaY += Y[i];
pthread_mutex_unlock(&mutex);
// Slow version
/*row = i/nn;
col = i%nn;
sumaX[row] += X[row][col];
sumaX2[row] += X[row][col] * X[row][col];
sumaXY[row] += X[row][col] * Y[col];
pthread_mutex_lock(&mutex);
if(visitedRows[row] == 0){
visitedRows[row] = 1;
sumaY += Y[row];
pthread_mutex_unlock(&mutex);
}else{
pthread_mutex_unlock(&mutex);
}*/
}
pthread_exit(0);
}
int main(int np, char*p[])
{
int i,j,index;
double sA,sB;
unsigned int thread_args[MAX_THREADS];
pthread_t threads[MAX_THREADS];
assert(np==3);
nn = atoi(p[1]);
assert(nn<=N);
srand(1);
numThreads = atoi(p[2]);
assert(numThreads >= 2 && numThreads <= MAX_THREADS);
printf("Dimensio dades =~ %g Mbytes\n",((double)(nn*(nn+11))*4)/(1024*1024));
memset(range,0,numThreads*sizeof(int));
// Slow version
//memset(visitedRows,0,nn*sizeof(int));
apX = calloc(nn*nn,sizeof(int)); assert (apX);
Y = calloc(nn,sizeof(int)); assert (Y);
sumaX = calloc(nn,sizeof(long long)); assert (sumaX);
sumaX2 = calloc(nn,sizeof(long long)); assert (sumaX2);
sumaXY = calloc(nn,sizeof(long long)); assert (sumaXY);
A = calloc(nn,sizeof(double)); assert (A);
B = calloc(nn,sizeof(double)); assert (B);
// Inicialitzacio
X[0] = apX;
for (i=0;i<nn;i++) {
for (j=0;j<nn;j+=8)
X[i][j]=rand()%100+1;
Y[i]=rand()%100 - 49;
X[i+1] = X[i] + nn;
}
// Fast version
int portion = nn/numThreads;
int mod = nn % numThreads;
// Slow version
//int portion = nn*nn/numThreads;
//int mod = nn*nn % numThreads;
for(i=0; i<numThreads; i++){
range[i] = portion;
if (i != 0) range[i] += range[i-1];
if (i < mod) range[i]++;
}
sumaY = 0;
pthread_mutex_init(&mutex, NULL);
for (index = 0; index < numThreads; index++)
{
thread_args[index] = index;
assert(!pthread_create(&threads[index], NULL, parallel_code, &thread_args[index]));
}
for(index = 0; index < numThreads; index++)
{
assert(!pthread_join(threads[index], NULL ));
}
pthread_mutex_destroy(&mutex);
for (i=0;i<nn;i++) {
B[i] = sumaXY[i] - (sumaX[i] * sumaY)/nn;
B[i] = B[i] / (sumaX2[i] - (sumaX[i] * sumaX[i])/nn);
A[i] = (sumaY -B[i]*sumaX[i])/nn;
}
// check
sA = sB = 0;
for (i=0;i<nn;i++) {
//printf("%f, %f\n",sA,sB);
sA += A[i];
sB += B[i];
}
printf("Suma elements de A: %lg B:%lg\n",sA,sB);
exit(0);
}
Mutex initialization and destruction
You having initialized mutex via the static initializer, it is erroneous to initialize it again via pthread_mutex_init() (unless you first tear it down with pthread_mutex_destroy()).
Additionally, it is unnecessary, albeit not wrong, to tear down the mutex after joining all the threads.
Work distribution, part 1
This is wrong:
int mod = nn*nn % numThreads;
if(mod != 0.00){
mod = mod*numThreads;
for(i=0; i<mod; i++){
range[i] = range[i] + 1;
}
}
I think you are trying to distribute the excess of the nn * nn data among the threads, but there are only mod excess elements, not mod * NumThreads of them. I think you meant
int mod = nn*nn % numThreads;
// No need to pre-test whether mod is nonzero.
// mod is used as originally computed, not multiplied by numThreads.
for (i = 0; i < mod; i++) {
range[i] = range[i] + 1;
}
The original version not only sets the ranges incorrectly, but it also overruns the bounds of array range when nn*nn % numThreads is greater than 1.
But probably this is all moot. See below.
Work distribution, part 2
I suspect that the main problem is that these lines ...
sumaX[row] += X[row][col];
sumaX2[row] += X[row][col] * X[row][col];
sumaXY[row] += X[row][col] * Y[col];
... are executed by the thread function without locking the mutex. sumX, sumX2, and sumXY point to data shared among the threads, and as the work has been split among them, it is entirely possible for more than one thread to contribute to the same elements. In that event, you have data races, and the resulting behavior is undefined.
Naively, you could solve that problem by moving those computations inside the critical section, after the pthread_mutex_lock(), since right now you do lock and unlock the mutex on every iteration of the loop. But there are several problems with that, especially:
you would squeeze out most of the already constrained opportunity for thread concurrency; and
mutex operations are comparatively expensive, and with otherwise only a handful of arithmetic operations per loop iteration, locking and unlocking the mutex as frequently as you are doing is likely to dominate the performance.
I would be surprised if the parallel version weren't slower than the serial version if you approached it that way.
What you should do instead is limit the number of threads actually used to at most the number of rows of data, and assign data to threads on a whole-row basis. No row should be split across two or more threads. That will eliminate the aforementioned data races without defeating the purpose of parallelizing.
I would also modify the thread function so that it locks the mutex only when it adds per-row results to the global sum. That will get you a lot more concurrency than you have now.
This will give you a less even division of data among threads, and perhaps fewer threads overall, but it doesn't help you anyway to have more threads than you have execution units to run them on. When the number of rows is large relative to the number of threads, the effect of the unevenness will not be very pronounced, whereas in small cases, the overall runtime isn't so much of an issue in the first place. More importantly, the computations should produce correct results, and the reduction in locking should substantially improve performance.

No real acceleration with pthreads

I am trying to implement the multithreaded version of the Monte-Carlo algorithm. Here is my code:
#define _POSIX_C_SOURCE 200112L
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <time.h>
#include <math.h>
#include <semaphore.h>
#include <errno.h>
#include <stdbool.h>
#include <string.h>
#define MAX_THREADS 12
#define MAX_DOTS 10000000
double sum = 0.0;
sem_t sem;
void reset() {
sum = 0.0;
}
void* check_dot(void* _iterations) {
int* iterations = (int*)_iterations;
for(int i = 0; i < *iterations; ++i) {
double x = (double)(rand() % 314) / 100;
double y = (double)(rand() % 100) / 100;
if(y <= sin(x)) {
sem_wait(&sem);
sum += x * y;
sem_post(&sem);
}
}
return NULL;
}
void* check_dots_advanced(void* _iterations) {
int* iterations = (int*)_iterations;
double* res = (double*)malloc(sizeof(double));
for(int i = 0; i < *iterations; ++i) {
double x = (double)(rand() % 314) / 100;
double y = (double)(rand() % 100) / 100;
if(y <= sin(x)) *res += x * y;
}
pthread_exit((void*)res);
}
double run(int threads_num, bool advanced) {
if(!advanced) sem_init(&sem, 0, 1);
struct timespec begin, end;
double elapsed;
pthread_t threads[threads_num];
int iters = MAX_DOTS / threads_num;
for(int i = 0; i < threads_num; ++i) {
if(!advanced) pthread_create(&threads[i], NULL, &check_dot, (void*)&iters);
else pthread_create(&threads[i], NULL, &check_dots_advanced, (void*)&iters);
}
if(clock_gettime(CLOCK_REALTIME, &begin) == -1) {
perror("Unable to get time");
exit(-1);
}
for(int i = 0; i < threads_num; ++i) {
if(!advanced) pthread_join(threads[i], NULL);
else {
void* tmp;
pthread_join(threads[i], &tmp);
sum += *((double*)tmp);
free(tmp);
}
}
if(clock_gettime(CLOCK_REALTIME, &end) == -1) {
perror("Unable to get time");
exit(-1);
}
if(!advanced) sem_destroy(&sem);
elapsed = end.tv_sec - begin.tv_sec;
elapsed += (end.tv_nsec - begin.tv_nsec) / 1000000000.0;
return elapsed;
}
int main(int argc, char** argv) {
bool advanced = false;
char* filename = NULL;
for(int i = 1; i < argc; ++i) {
if(strcmp(argv[i], "-o") == 0 && argc > i + 1) {
filename = argv[i + 1];
++i;
}
else if(strcmp(argv[i], "-a") == 0 || strcmp(argv[i], "--advanced") == 0) {
advanced = true;
}
}
if(!filename) {
fprintf(stderr, "You should provide the name of the output file.\n");
exit(-1);
}
FILE* fd = fopen(filename, "w");
if(!fd) {
perror("Unable to open file");
exit(-1);
}
srand(time(NULL));
double worst_time = run(1, advanced);
double result = (3.14 / MAX_DOTS) * sum;
reset();
fprintf(fd, "Result: %f\n", result);
for(int i = 2; i <= MAX_THREADS; ++i) {
double time = run(i, advanced);
double accel = time / worst_time;
fprintf(fd, "%d:%f\n", i, accel);
reset();
}
fclose(fd);
return 0;
}
However, I can't see any real acceleration with increasing the number of threads (and it does not matter what check_dot() function I am using). I have tried to execute this code on my laptop with Intel Core i7-3517u (lscpu says that it has 4 independent CPUs) and it looks like the number of threads not really influences the execution time of my program:
Number of threads: 1, working time: 0.847277 s
Number of threads: 2, working time: 3.133838 s
Number of threads: 3, working time: 2.331216 s
Number of threads: 4, working time: 3.011819 s
Number of threads: 5, working time: 3.086003 s
Number of threads: 6, working time: 3.118296 s
Number of threads: 7, working time: 3.058180 s
Number of threads: 8, working time: 3.114867 s
Number of threads: 9, working time: 3.179515 s
Number of threads: 10, working time: 3.025266 s
Number of threads: 11, working time: 3.142141 s
Number of threads: 12, working time: 3.064318 s
I supposed that it should be some kind of linear dependence between the execution time and number of working threads for at least four first values (the more threads are working the less is execution time), but here I have pretty equal time values. Is it a real problem in my code or I am too demanding?
The problem you are experiencing is that the internal state of rand() is a shared resource between all threads, so the threads are going to serialise on access to rand().
You need to use a pseudo-random number generator with per-thread state - the rand_r() function (although marked obsolete in the latest version of POSIX) can be used as such. For serious work you would be best off importing the implementation of some specific PRNG algorithm such as Mersenne Twister.
I was able to collect the timing / scaling measurements that you would desire with two changes to your code.
First, rand() is not thread safe. Replacing the calls with calls to rand_r(seed) in the advanced check_dots showed continual scaling as threads increased. I think rand might have an internal lock that is serializing execution and preventing any speedup. This change alone shows some scaling, from 1.23s -> 0.55 sec (5 threads).
Second, I introduced barriers around the core execution region so that the cost of serially creating/joining threads and the malloc calls is not included. The core execution region shows good scaling, from 1.23sec -> 0.18sec (8 threads).
Code was compiled with gcc -O3 -pthread mcp.c -std=c11 -lm, run on Intel E3-1240 v5 (4 cores, HT), Linux 3.19.0-68-generic. Single measurements reported.
pthread_barrier_t bar;
void* check_dots_advanced(void* _iterations) {
int* iterations = (int*)_iterations;
double* res = (double*)malloc(sizeof(double));
sem_wait(&sem);
unsigned int seed = rand();
sem_post(&sem);
pthread_barrier_wait(&bar);
for(int i = 0; i < *iterations; ++i) {
double x = (double)(rand_r(&seed) % 314) / 100;
double y = (double)(rand_r(&seed) % 100) / 100;
if(y <= sin(x)) *res += x * y;
}
pthread_barrier_wait(&bar);
pthread_exit((void*)res);
}
double run(int threads_num, bool advanced) {
sem_init(&sem, 0, 1);
struct timespec begin, end;
double elapsed;
pthread_t threads[threads_num];
int iters = MAX_DOTS / threads_num;
pthread_barrier_init(&bar, NULL, threads_num + 1); // barrier init
for(int i = 0; i < threads_num; ++i) {
if(!advanced) pthread_create(&threads[i], NULL, &check_dot, (void*)&iters);
else pthread_create(&threads[i], NULL, &check_dots_advanced, (void*)&iters);
}
pthread_barrier_wait(&bar); // wait until threads are ready
if(clock_gettime(CLOCK_REALTIME, &begin) == -1) { // begin time
perror("Unable to get time");
exit(-1);
}
pthread_barrier_wait(&bar); // wait until threads finish
if(clock_gettime(CLOCK_REALTIME, &end) == -1) { // end time
perror("Unable to get time");
exit(-1);
}
for(int i = 0; i < threads_num; ++i) {
if(!advanced) pthread_join(threads[i], NULL);
else {
void* tmp;
pthread_join(threads[i], &tmp);
sum += *((double*)tmp);
free(tmp);
}
}
pthread_barrier_destroy(&bar);

Why is my parallel code slower than serial?

Issue
Hello everyone, I have got a program (from the net) that I intend to speed up by converting it into its parallel version with the use of pthreads. But surprisingly though, it runs slower than the serial version. Below is the program:
# include <stdio.h>
//fast square root algorithm
double asmSqrt(double x)
{
__asm__ ("fsqrt" : "+t" (x));
return x;
}
//test if a number is prime
bool isPrime(int n)
{
if (n <= 1) return false;
if (n == 2) return true;
if (n%2 == 0) return false;
int sqrtn,i;
sqrtn = asmSqrt(n);
for (i = 3; i <= sqrtn; i+=2) if (n%i == 0) return false;
return true;
}
//number generator iterated from 0 to n
int main()
{
n = 1000000; //maximum number
int k,j;
for (j = 0; j<= n; j++)
{
if(isPrime(j) == 1) k++;
if(j == n) printf("Count: %d\n",k);
}
return 0;
}
First attempt for parallelization
I let the pthread manage the for loop
# include <stdio.h>
.
.
int main()
{
.
.
//----->pthread code here<----
for (j = 0; j<= n; j++)
{
if(isPrime(j) == 1) k++;
if(j == n) printf("Count: %d\n",k);
}
return 0;
}
Well, it runs slower than the serial one
Second attempt
I divided the for loop into two threads and run them in parallel using pthreads
However, it still runs slower, I am intending that it may run about twice as fast or well faster. But its not!
These is my parallel code by the way:
# include <stdio.h>
# include <pthread.h>
# include <cmath>
# define NTHREADS 2
pthread_mutex_t mutex1 = PTHREAD_MUTEX_INITIALIZER;
int k = 0;
double asmSqrt(double x)
{
__asm__ ("fsqrt" : "+t" (x));
return x;
}
struct arg_struct
{
int initialPrime;
int nextPrime;
};
bool isPrime(int n)
{
if (n <= 1) return false;
if (n == 2) return true;
if (n%2 == 0) return false;
int sqrtn,i;
sqrtn = asmSqrt(n);
for (i = 3; i <= sqrtn; i+=2) if (n%i == 0) return false;
return true;
}
void *parallel_launcher(void *arguments)
{
struct arg_struct *args = (struct arg_struct *)arguments;
int j = args -> initialPrime;
int n = args -> nextPrime - 1;
for (j = 0; j<= n; j++)
{
if(isPrime(j) == 1)
{
printf("This is prime: %d\n",j);
pthread_mutex_lock( &mutex1 );
k++;
pthread_mutex_unlock( &mutex1 );
}
if(j == n) printf("Count: %d\n",k);
}
pthread_exit(NULL);
}
int main()
{
int f = 100000000;
int m;
pthread_t thread_id[NTHREADS];
struct arg_struct args;
int rem = (f+1)%NTHREADS;
int n = floor((f+1)/NTHREADS);
for(int h = 0; h < NTHREADS; h++)
{
if(rem > 0)
{
m = n + 1;
rem-= 1;
}
else if(rem == 0)
{
m = n;
}
args.initialPrime = args.nextPrime;
args.nextPrime = args.initialPrime + m;
pthread_create(&thread_id[h], NULL, &parallel_launcher, (void *)&args);
pthread_join(thread_id[h], NULL);
}
// printf("Count: %d\n",k);
return 0;
}
Note:
OS: Fedora 21 x86_64,
Compiler: gcc-4.4,
Processor: Intel Core i5 (2 physical core, 4 logical),
Mem: 6 Gb,
HDD: 340 Gb,
You need to split the range you are examining for primes up into n parts, where n is the number of threads.
The code that each thread runs becomes:
typedef struct start_end {
int start;
int end;
} start_end_t;
int find_primes_in_range(void *in) {
start_end_t *start_end = (start_end_t *) in;
int num_primes = 0;
for (int j = start_end->start; j <= start_end->end; j++) {
if (isPrime(j) == 1)
num_primes++;
}
pthread_exit((void *) num_primes;
}
The main routine first starts all the threads which call find_primes_in_range, then calls pthread_join for each thread. It sums all the values returned by find_primes_in_range. This avoids locking and unlocking a shared count variable.
This will parallelize the work, but the amount of work per thread will not be equal. This can be addressed but is more complicated.
The main design flaw: you must let each thread have its own private counter variable instead of using the shared one. Otherwise they will spend far more time waiting on and handling that mutex, than they will do on the actual calculation. You are essentially forcing the threads to execute in serial.
Instead, sum everything up with a private counter variable and once a thread is done with its work, return the counter variable and sum them up in main().
Also, you should not call printf() from inside the threads. If there is a context switch in the middle of a printf call, you'll end up with crappy output such as This is This is prime: 2. In which case you must synchronize the printf calls between threads, which will slow the program down again. Also, the printf() calls themselves are likely 90% of the work that the thread is doing. So some sort of re-design of who does the printing might be a good idea, depending on what you want to do with the results.
Summary
Indeed, the use of PThread speed up my code. It was my programming flaw of placing pthread_join right after the first pthread_create and the common counter I have set on arguments. After fixing this up, I tested my parallel code to determine the primality of 100 Million numbers then compared its processing time with a serial code. Below are the results.
http://i.stack.imgur.com/gXFyk.jpg (I could not attach the image as I don't have much reputation yet, instead, I am including a link)
I conducted three trials for each to account for the variations caused by different OS activities. We got speed up for utilizing parallel programming with PThread. What is surprising is a PThread code running in ONE thread was a bit faster than purely serial code. I could not explain this one, nevertheless using PThreads is well, surely worth a try.
Here is the corrected parallel version of the code (gcc-c++):
# include <stdio.h>
# include <pthread.h>
# include <cmath>
# define NTHREADS 4
double asmSqrt(double x)
{
__asm__ ("fsqrt" : "+t" (x));
return x;
}
struct start_end_f
{
int start;
int end;
};
//test if a number is prime
bool isPrime(int n)
{
if (n <= 1) return false;
if (n == 2) return true;
if (n%2 == 0) return false;
int sqrtn = asmSqrt(n);
for (int i = 3; i <= sqrtn; i+=2) if (n%i == 0) return false;
return true;
}
//executes the tests for prime in a certain range, other threads will test the next range and so on..
void *find_primes_in_range(void *in)
{
int k = 0;
struct start_end_f *start_end_h = (struct start_end_f *)in;
for (int j = start_end_h->start; j < (start_end_h->end +1); j++)
{
if(isPrime(j) == 1) k++;
}
int *t = new int;
*t = k;
pthread_exit(t);
}
int main()
{
int f = 100000000; //maximum number to be tested for prime
pthread_t thread_id[NTHREADS];
struct start_end_f start_end[NTHREADS];
int rem = (f+1)%NTHREADS;
int n = (f+1)/NTHREADS;
int rem_change = rem;
int m;
if(rem>0) m = n+1;
else if(rem == 0) m = n;
//distributes task 'evenly' to the number of parallel threads requested
for(int h = 0; h < NTHREADS; h++)
{
if(rem_change > 0)
{
start_end[h].start = m*h;
start_end[h].end = start_end[h].start+m-1;
rem_change -= 1;
}
else if(rem_change<= 0)
{
start_end[h].start = m*(h+rem_change)-rem_change*n;
start_end[h].end = start_end[h].start+n-1;
rem_change -= 1;
}
pthread_create(&thread_id[h], NULL, find_primes_in_range, &start_end[h]);
}
//retreiving returned values
int *t;
int c = 0;
for(int h = 0; h < NTHREADS; h++)
{
pthread_join(thread_id[h], (void **)&t);
int b = *((int *)t);
c += b;
b = 0;
}
printf("\nNumber of Primes: %d\n",c);
return 0;
}

Resources