No real acceleration with pthreads

No real acceleration with pthreads - c

I am trying to implement the multithreaded version of the Monte-Carlo algorithm. Here is my code:
#define _POSIX_C_SOURCE 200112L
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <time.h>
#include <math.h>
#include <semaphore.h>
#include <errno.h>
#include <stdbool.h>
#include <string.h>
#define MAX_THREADS 12
#define MAX_DOTS 10000000
double sum = 0.0;
sem_t sem;
void reset() {
sum = 0.0;
}
void* check_dot(void* _iterations) {
int* iterations = (int*)_iterations;
for(int i = 0; i < *iterations; ++i) {
double x = (double)(rand() % 314) / 100;
double y = (double)(rand() % 100) / 100;
if(y <= sin(x)) {
sem_wait(&sem);
sum += x * y;
sem_post(&sem);
}
}
return NULL;
}
void* check_dots_advanced(void* _iterations) {
int* iterations = (int*)_iterations;
double* res = (double*)malloc(sizeof(double));
for(int i = 0; i < *iterations; ++i) {
double x = (double)(rand() % 314) / 100;
double y = (double)(rand() % 100) / 100;
if(y <= sin(x)) *res += x * y;
}
pthread_exit((void*)res);
}
double run(int threads_num, bool advanced) {
if(!advanced) sem_init(&sem, 0, 1);
struct timespec begin, end;
double elapsed;
pthread_t threads[threads_num];
int iters = MAX_DOTS / threads_num;
for(int i = 0; i < threads_num; ++i) {
if(!advanced) pthread_create(&threads[i], NULL, &check_dot, (void*)&iters);
else pthread_create(&threads[i], NULL, &check_dots_advanced, (void*)&iters);
}
if(clock_gettime(CLOCK_REALTIME, &begin) == -1) {
perror("Unable to get time");
exit(-1);
}
for(int i = 0; i < threads_num; ++i) {
if(!advanced) pthread_join(threads[i], NULL);
else {
void* tmp;
pthread_join(threads[i], &tmp);
sum += *((double*)tmp);
free(tmp);
}
}
if(clock_gettime(CLOCK_REALTIME, &end) == -1) {
perror("Unable to get time");
exit(-1);
}
if(!advanced) sem_destroy(&sem);
elapsed = end.tv_sec - begin.tv_sec;
elapsed += (end.tv_nsec - begin.tv_nsec) / 1000000000.0;
return elapsed;
}
int main(int argc, char** argv) {
bool advanced = false;
char* filename = NULL;
for(int i = 1; i < argc; ++i) {
if(strcmp(argv[i], "-o") == 0 && argc > i + 1) {
filename = argv[i + 1];
++i;
}
else if(strcmp(argv[i], "-a") == 0 || strcmp(argv[i], "--advanced") == 0) {
advanced = true;
}
}
if(!filename) {
fprintf(stderr, "You should provide the name of the output file.\n");
exit(-1);
}
FILE* fd = fopen(filename, "w");
if(!fd) {
perror("Unable to open file");
exit(-1);
}
srand(time(NULL));
double worst_time = run(1, advanced);
double result = (3.14 / MAX_DOTS) * sum;
reset();
fprintf(fd, "Result: %f\n", result);
for(int i = 2; i <= MAX_THREADS; ++i) {
double time = run(i, advanced);
double accel = time / worst_time;
fprintf(fd, "%d:%f\n", i, accel);
reset();
}
fclose(fd);
return 0;
}
However, I can't see any real acceleration with increasing the number of threads (and it does not matter what check_dot() function I am using). I have tried to execute this code on my laptop with Intel Core i7-3517u (lscpu says that it has 4 independent CPUs) and it looks like the number of threads not really influences the execution time of my program:
Number of threads: 1, working time: 0.847277 s
Number of threads: 2, working time: 3.133838 s
Number of threads: 3, working time: 2.331216 s
Number of threads: 4, working time: 3.011819 s
Number of threads: 5, working time: 3.086003 s
Number of threads: 6, working time: 3.118296 s
Number of threads: 7, working time: 3.058180 s
Number of threads: 8, working time: 3.114867 s
Number of threads: 9, working time: 3.179515 s
Number of threads: 10, working time: 3.025266 s
Number of threads: 11, working time: 3.142141 s
Number of threads: 12, working time: 3.064318 s
I supposed that it should be some kind of linear dependence between the execution time and number of working threads for at least four first values (the more threads are working the less is execution time), but here I have pretty equal time values. Is it a real problem in my code or I am too demanding?

The problem you are experiencing is that the internal state of rand() is a shared resource between all threads, so the threads are going to serialise on access to rand().
You need to use a pseudo-random number generator with per-thread state - the rand_r() function (although marked obsolete in the latest version of POSIX) can be used as such. For serious work you would be best off importing the implementation of some specific PRNG algorithm such as Mersenne Twister.

I was able to collect the timing / scaling measurements that you would desire with two changes to your code.
First, rand() is not thread safe. Replacing the calls with calls to rand_r(seed) in the advanced check_dots showed continual scaling as threads increased. I think rand might have an internal lock that is serializing execution and preventing any speedup. This change alone shows some scaling, from 1.23s -> 0.55 sec (5 threads).
Second, I introduced barriers around the core execution region so that the cost of serially creating/joining threads and the malloc calls is not included. The core execution region shows good scaling, from 1.23sec -> 0.18sec (8 threads).
Code was compiled with gcc -O3 -pthread mcp.c -std=c11 -lm, run on Intel E3-1240 v5 (4 cores, HT), Linux 3.19.0-68-generic. Single measurements reported.
pthread_barrier_t bar;
void* check_dots_advanced(void* _iterations) {
int* iterations = (int*)_iterations;
double* res = (double*)malloc(sizeof(double));
sem_wait(&sem);
unsigned int seed = rand();
sem_post(&sem);
pthread_barrier_wait(&bar);
for(int i = 0; i < *iterations; ++i) {
double x = (double)(rand_r(&seed) % 314) / 100;
double y = (double)(rand_r(&seed) % 100) / 100;
if(y <= sin(x)) *res += x * y;
}
pthread_barrier_wait(&bar);
pthread_exit((void*)res);
}
double run(int threads_num, bool advanced) {
sem_init(&sem, 0, 1);
struct timespec begin, end;
double elapsed;
pthread_t threads[threads_num];
int iters = MAX_DOTS / threads_num;
pthread_barrier_init(&bar, NULL, threads_num + 1); // barrier init
for(int i = 0; i < threads_num; ++i) {
if(!advanced) pthread_create(&threads[i], NULL, &check_dot, (void*)&iters);
else pthread_create(&threads[i], NULL, &check_dots_advanced, (void*)&iters);
}
pthread_barrier_wait(&bar); // wait until threads are ready
if(clock_gettime(CLOCK_REALTIME, &begin) == -1) { // begin time
perror("Unable to get time");
exit(-1);
}
pthread_barrier_wait(&bar); // wait until threads finish
if(clock_gettime(CLOCK_REALTIME, &end) == -1) { // end time
perror("Unable to get time");
exit(-1);
}
for(int i = 0; i < threads_num; ++i) {
if(!advanced) pthread_join(threads[i], NULL);
else {
void* tmp;
pthread_join(threads[i], &tmp);
sum += *((double*)tmp);
free(tmp);
}
}
pthread_barrier_destroy(&bar);

Related

how to compute sum of n/m Gregory-Leibniz terms in C language

get the two values named m & n from the command line arguments and convert them into integers. now after that create m threads and each thread computes the sum of n/m terms in Gregory-Leibniz Series.
pi = 4 * (1 - 1/3 + 1/5 - 1/7 + 1/9 - ...)
Now when thread finishes its computation, print its partial sum and atomically add it to a shared global variable.
& how to check that all of the m computational threads have done the atomic additions?
I share my source code, what I tried
#include<stdio.h>
#include<pthread.h>
#include <stdlib.h>
#include<math.h>
pthread_barrier_t barrier;
int count;
long int term;
// int* int_arr;
double total;
void *thread_function(void *vargp)
{
int thread_rank = *(int *)vargp;
// printf("waiting for barrier... \n");
pthread_barrier_wait(&barrier);
// printf("we passed the barrier... \n");
double sum = 0.0;
int n = count * term;
int start = n - term;
// printf("start %d & end %d \n\n", start, n);
for(int i = start; i < n; i++)
{
sum += pow(-1, i) / (2*i+1);
// v += 1 / i - 1 / (i + 2);
}
total += sum;
// int_arr[count] = sum;
count++;
printf("thr %d : %lf \n", thread_rank, sum);
return NULL;
}
int main(int argc,char *argv[])
{
if (argc <= 2) {
printf("missing arguments. please pass two num. in arguments\n");
exit(-1);
}
int m = atoi(argv[1]); // get value of first argument
int n = atoi(argv[2]); // get value of second argument
// int_arr = (int*) calloc(m, sizeof(int));
count = 1;
term = n / m;
pthread_t thread_id[m];
int i, ret;
double pi;
/* Initialize the barrier. */
pthread_barrier_init(&barrier, NULL, m);
for(i = 0; i < m; i++)
{
ret = pthread_create(&thread_id[i], NULL , &thread_function, (void *)&i);
if (ret) {
printf("unable to create thread! \n");
exit(-1);
}
}
for(i = 0; i < m; i++)
{
if(pthread_join(thread_id[i], NULL) != 0) {
perror("Failed to join thread");
}
}
pi = 4 * total;
printf("%lf ", pi);
pthread_barrier_destroy(&barrier);
return 0;
}
what I need :-
create M thread & each thread computes the sum of n/m terms in the Gregory-Leibniz Series.
first thread computes the sum of term 1 to n/m , the second thread computes the sum of the terms from (n/m + 1) to 2n/m etc.
when all the thread finishes its computation than print its partial sum and Value of Pi.
I tried a lot, but I can't achieve exact what I want. I got wrong output value of PI
for example : m = 16 and n = 1024
then it sometimes return 3.125969, sometimes 12.503874 , 15.629843, sometimes 6.251937 as a output of Pi value
please help me
Edited Source Code :
#include <inttypes.h>
#include <math.h>
#include <pthread.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
struct args {
uint64_t thread_id;
struct {
uint64_t start;
uint64_t end;
} range;
};
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_barrier_t barrier;
long double total = 0;
uint64_t total_iterations = 0;
void *partial_sum(void *arg)
{
struct args *args = arg;
long double sum = 0;
printf("waiting for barrier in thread -> %" PRId64 "\n", args->thread_id);
pthread_barrier_wait(&barrier);
// printf("we passed the barrier... \n");
for (uint64_t n = args->range.start; n < args->range.end; n++)
sum += pow(-1.0, n) / (1 + n * 2);
if (pthread_mutex_lock(&mutex)) {
perror("pthread_mutex_lock");
exit(EXIT_FAILURE);
}
total += sum;
total_iterations += args->range.end - args->range.start;
if (pthread_mutex_unlock(&mutex)) {
perror("pthread_mutex_unlock");
exit(EXIT_FAILURE);
}
printf("thr %" PRId64 " : %.20Lf\n", args->thread_id, sum);
return NULL;
}
int main(int argc,char *argv[])
{
if (argc <= 2) {
fprintf(stderr, "usage: %s THREADS TERMS.\tPlease pass two num. in arguments\n", *argv);
return EXIT_FAILURE;
}
int m = atoi(argv[1]); // get value of first argument & converted into int
int n = atoi(argv[2]); // get value of second argument & converted into int
if (!m || !n) {
fprintf(stderr, "Argument is zero.\n");
return EXIT_FAILURE;
}
uint64_t threads = m;
uint64_t terms = n;
uint64_t range = terms / threads;
uint64_t excess = terms - range * threads;
pthread_t thread_id[threads];
struct args arguments[threads];
int ret;
/* Initialize the barrier. */
ret = pthread_barrier_init(&barrier, NULL, m);
if (ret) {
perror("pthread_barrier_init");
return EXIT_FAILURE;
}
for (uint64_t i = 0; i < threads; i++) {
arguments[i].thread_id = i;
arguments[i].range.start = i * range;
arguments[i].range.end = arguments[i].range.start + range;
if (threads - 1 == i)
arguments[i].range.end += excess;
printf("In main: creating thread %ld\n", i);
ret = pthread_create(thread_id + i, NULL, partial_sum, arguments + i);
if (ret) {
perror("pthread_create");
return EXIT_FAILURE;
}
}
for (uint64_t i = 0; i < threads; i++)
if (pthread_join(thread_id[i], NULL))
perror("pthread_join");
pthread_barrier_destroy(&barrier);
printf("Pi value is : %.10Lf\n", 4 * total);
printf("COMPLETE? (%s)\n", total_iterations == terms ? "YES" : "NO");
return 0;
}

In each thread, the count variable is expected to be of a steadily increasing value in this expression
int n = count * term;
being one larger than it was in the "previous" thread, but count is only increased later on in each thread.
Even if you were to "immediately" increase count, there is nothing that guards against two or more threads attempting to read from and write to the variable at the same time.
The same issue exists for total.
The unpredictability of these reads and writes will lead to indeterminate results.
When sharing resources between threads, you must take care to avoid these race conditions. The POSIX threads library does not contain any atomics for fundamental integral operations.
You should protect your critical data against a read/write race condition by using a lock to restrict access to a single thread at a time.
The POSIX threads library includes a pthread_mutex_t type for this purpose. See:
pthread_mutex_init / pthread_mutex_destroy
pthread_mutex_lock / pthread_mutex_unlock
Additionally, as pointed out by #Craig Estey, using (void *) &i as the argument to the thread functions introduces a race condition where the value of i may change before any given thread executes *(int *) vargp;.
The suggestion is to pass the value of i directly, storing it intermediately as a pointer, but you should use the appropriate type of intptr_t or uintptr_t, which are well defined for this purpose.
pthread_create(&thread_id[i], NULL , thread_function, (intptr_t) i)
int thread_rank = (intptr_t) vargp;
How to check that all of the m computational threads have done the atomic additions?
Sum up the number of terms processed by each thread, and ensure it is equal to the expected number of terms. This can also naturally be assumed to be the case if all possible errors are accounted for (ensuring all threads run to completion and assuming the algorithm used is correct).
A moderately complete example program:
#define _POSIX_C_SOURCE 200809L
#include <inttypes.h>
#include <math.h>
#include <pthread.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
struct args {
uint64_t thread_id;
struct {
uint64_t start;
uint64_t end;
} range;
};
pthread_mutex_t mutex;
long double total = 0;
uint64_t total_iterations = 0;
void *partial_sum(void *arg)
{
struct args *args = arg;
long double sum = 0;
for (uint64_t n = args->range.start; n < args->range.end; n++)
sum += pow(-1.0, n) / (1 + n * 2);
if (pthread_mutex_lock(&mutex)) {
perror("pthread_mutex_lock");
exit(EXIT_FAILURE);
}
total += sum;
total_iterations += args->range.end - args->range.start;
if (pthread_mutex_unlock(&mutex)) {
perror("pthread_mutex_unlock");
exit(EXIT_FAILURE);
}
printf("thread(%" PRId64 ") Partial sum: %.20Lf\n", args->thread_id, sum);
return NULL;
}
int main(int argc,char **argv)
{
if (argc < 3) {
fprintf(stderr, "usage: %s THREADS TERMS\n", *argv);
return EXIT_FAILURE;
}
uint64_t threads = strtoull(argv[1], NULL, 10);
uint64_t terms = strtoull(argv[2], NULL, 10);
if (!threads || !terms) {
fprintf(stderr, "Argument is zero.\n");
return EXIT_FAILURE;
}
uint64_t range = terms / threads;
uint64_t excess = terms - range * threads;
pthread_t thread_id[threads];
struct args arguments[threads];
if (pthread_mutex_init(&mutex, NULL)) {
perror("pthread_mutex_init");
return EXIT_FAILURE;
}
for (uint64_t i = 0; i < threads; i++) {
arguments[i].thread_id = i;
arguments[i].range.start = i * range;
arguments[i].range.end = arguments[i].range.start + range;
if (threads - 1 == i)
arguments[i].range.end += excess;
int ret = pthread_create(thread_id + i, NULL , partial_sum, arguments + i);
if (ret) {
perror("pthread_create");
return EXIT_FAILURE;
}
}
for (uint64_t i = 0; i < threads; i++)
if (pthread_join(thread_id[i], NULL))
perror("pthread_join");
pthread_mutex_destroy(&mutex);
printf("%.10Lf\n", 4 * total);
printf("COMPLETE? (%s)\n", total_iterations == terms ? "YES" : "NO");
}
Using 16 threads to process 1 billion terms:
$ ./a.out 16 10000000000
thread(14) Partial sum: 0.00000000000190476190
thread(10) Partial sum: 0.00000000000363636364
thread(2) Partial sum: 0.00000000006666666667
thread(1) Partial sum: 0.00000000020000000000
thread(8) Partial sum: 0.00000000000555555556
thread(15) Partial sum: 0.00000000000166666667
thread(0) Partial sum: 0.78539816299744868408
thread(3) Partial sum: 0.00000000003333333333
thread(13) Partial sum: 0.00000000000219780220
thread(11) Partial sum: 0.00000000000303030303
thread(4) Partial sum: 0.00000000002000000000
thread(5) Partial sum: 0.00000000001333333333
thread(7) Partial sum: 0.00000000000714285714
thread(6) Partial sum: 0.00000000000952380952
thread(12) Partial sum: 0.00000000000256410256
thread(9) Partial sum: 0.00000000000444444444
3.1415926535
COMPLETE? (YES)

Performance of multithreaded algorithm to find max number in array

I'm trying to learn about multithreaded algorithms so I've implemented a simple find max number function of an array.
I've made a baseline program (findMax1.c) which loads from a file about 263 million int numbers into memory.
Then I simply use a for loop to find the max number. Then I've made another program (findMax2.c) which uses 4 threads.
I chose 4 threads because the CPU (intel i5 4460) I'm using has 4 cores and 1 thread per core. So my guess is that
if I assign each core a chunk of the array to process it would be more efficient because that way I'll have fewer cache
misses. Now, each thread finds the max number from each chunk, then I join all threads to finally find the max number
from all those chunks. The baseline program findMax1.c takes about 660ms to complete the task, so my initial thought was
that findMax2.c (which uses 4 threads) would take about 165ms (660ms / 4) to complete since now I have 4 threads running
all in parallel to do the same task, but findMax2.c takes about 610ms. Only 50ms less than findMax1.c.
What am I missing? is there something wrong with the implementation of the threaded program?
findMax1.c
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <time.h>
int main(void)
{
int i, *array, max = 0, position;
size_t array_size_in_bytes = 1024*1024*1024, elements_read, array_size;
FILE *f;
clock_t t;
double time;
array = (int*) malloc(array_size_in_bytes);
assert(array != NULL); // assert if condition is falsa
printf("Loading array...");
t = clock();
f = fopen("numbers.bin", "rb");
assert(f != NULL);
elements_read = fread(array, array_size_in_bytes, 1, f);
t = clock() - t;
time = ((double) t) / CLOCKS_PER_SEC;
assert(elements_read == 1);
printf("done!\n");
printf("File load time: %f [s]\n", time);
fclose(f);
array_size = array_size_in_bytes / sizeof(int);
printf("Finding max...");
t = clock();
for(i = 0; i < array_size; i++)
if(array[i] > max)
{
max = array[i];
position = i;
}
t = clock() - t;
time = ((double) t) / CLOCKS_PER_SEC;
printf("done!\n");
printf("----------- Program results -------------\nMax number: %d position %d\n", max, position);
printf("Time %f [s]\n", time);
return 0;
}
findMax2.c:
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <time.h>
#include <pthread.h>
#include <stdlib.h>
#include <unistd.h>
#include <sched.h>
#define NUM_THREADS 4
int max_chunk[NUM_THREADS], pos_chunk[NUM_THREADS];
int *array;
pthread_t tid[NUM_THREADS];
void *thread(void *arg)
{
size_t array_size_in_bytes = 1024*1024*1024;
int i, rc, offset, chunk_size, array_size, *core_id = (int*) arg, num_cores = sysconf(_SC_NPROCESSORS_ONLN);
pthread_t id = pthread_self();
cpu_set_t cpuset;
if (*core_id < 0 || *core_id >= num_cores)
return NULL;
CPU_ZERO(&cpuset);
CPU_SET(*core_id, &cpuset);
rc = pthread_setaffinity_np(id, sizeof(cpu_set_t), &cpuset);
if(rc != 0)
{
printf("pthread_setaffinity_np() failed! - rc %d\n", rc);
return NULL;
}
printf("Thread running on CPU %d\n", sched_getcpu());
array_size = (int) (array_size_in_bytes / sizeof(int));
chunk_size = (int) (array_size / NUM_THREADS);
offset = chunk_size * (*core_id);
// Find max number in the array chunk
for(i = offset; i < (offset + chunk_size); i++)
{
if(array[i] > max_chunk[*core_id])
{
max_chunk[*core_id] = array[i];
pos_chunk[*core_id] = i;
}
}
return NULL;
}
void load_array(void)
{
FILE *f;
size_t array_size_in_bytes = 1024*1024*1024, elements_read;
array = (int*) malloc(array_size_in_bytes);
assert(array != NULL); // assert if condition is false
printf("Loading array...");
f = fopen("numbers.bin", "rb");
assert(f != NULL);
elements_read = fread(array, array_size_in_bytes, 1, f);
assert(elements_read == 1);
printf("done!\n");
fclose(f);
}
int main(void)
{
int i, max = 0, position, id[NUM_THREADS], rc;
clock_t t;
double time;
load_array();
printf("Finding max...");
t = clock();
// Create threads
for(i = 0; i < NUM_THREADS; i++)
{
id[i] = i; // uso id para pasarle un puntero distinto a cada thread
rc = pthread_create(&(tid[i]), NULL, &thread, (void*)(id + i));
if (rc != 0)
printf("Can't create thread! rc = %d\n", rc);
else
printf("Thread %lu created\n", tid[i]);
}
// Join threads
for(i = 0; i < NUM_THREADS; i++)
pthread_join(tid[i], NULL);
// Find max number from all chunks
for(i = 0; i < NUM_THREADS; i++)
if(max_chunk[i] > max)
{
max = max_chunk[i];
position = pos_chunk[i];
}
t = clock() - t;
time = ((double) t) / CLOCKS_PER_SEC;
printf("done!\n");
free(array);
printf("----------- Program results -------------\nMax number: %d position %d\n", max, position);
printf("Time %f [s]\n", time);
pthread_exit(NULL);
return 0;
}

First of all, you're measuring your time wrong.
clock() measures process CPU time, i.e., time used by all threads. The real elapsed time will be fraction of that. clock_gettime(CLOCK_MONOTONIC,...) should yield better measurements.
Second, your core loops aren't at all comparable.
In the multithreaded program you're writing in each loop iteration to global variables that are very close to each other and that is horrible for cache contention.
You could space that global memory apart (make each array item a cache-aligned struct (_Alignas(64))) and that'll help the time, but a better and fairer approach would be to use local variables (which should go into registers), copying the approach of the first loop, and then write out the chunk result to memory at the end of the loop:
int l_max_chunk=0, l_pos_chunk=0, *a;
for(i = 0,a=array+offset; i < chunk_size; i++)
if(a[i] > l_max_chunk) l_max_chunk=a[i], l_pos_chunk=i;
max_chunk[*core_id] = l_max_chunk;
pos_chunk[*core_id] = l_pos_chunk;
Here's your modified test program with expected speedups (I'm getting approx. a 2x speedup on my two-core processor).
(I've also taken the liberty of replacing the file load with in-memory initialization, to make it simpler to test.)
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <time.h>
#include <pthread.h>
#include <stdlib.h>
#include <unistd.h>
#include <sched.h>
#include <stdint.h>
struct timespec ts0,ts1;
uint64_t sc_timespec_diff(struct timespec Ts1, struct timespec Ts0) { return (Ts1.tv_sec - Ts0.tv_sec)*1000000000+(Ts1.tv_nsec - Ts0.tv_nsec); }
#define NUM_THREADS 4
int max_chunk[NUM_THREADS], pos_chunk[NUM_THREADS];
int *array;
pthread_t tid[NUM_THREADS];
void *thread(void *arg)
{
size_t array_size_in_bytes = 1024*1024*1024;
int i, rc, offset, chunk_size, array_size, *core_id = (int*) arg, num_cores = sysconf(_SC_NPROCESSORS_ONLN);
#if 1 //shouldn't make much difference
pthread_t id = pthread_self();
cpu_set_t cpuset;
if (*core_id < 0 || *core_id >= num_cores)
return NULL;
CPU_ZERO(&cpuset);
CPU_SET(*core_id, &cpuset);
rc = pthread_setaffinity_np(id, sizeof(cpu_set_t), &cpuset);
if(rc != 0)
{
printf("pthread_setaffinity_np() failed! - rc %d\n", rc);
return NULL;
}
printf("Thread running on CPU %d\n", sched_getcpu());
#endif
array_size = (int) (array_size_in_bytes / sizeof(int));
chunk_size = (int) (array_size / NUM_THREADS);
offset = chunk_size * (*core_id);
// Find max number in the array chunk
#if 0 //horrible for caches
for(i = offset; i < (offset + chunk_size); i++)
{
if(array[i] > max_chunk[*core_id])
{
max_chunk[*core_id] = array[i];
pos_chunk[*core_id] = i;
}
}
#else
int l_max_chunk=0, l_pos_chunk=0, *a;
for(i = 0,a=array+offset; i < chunk_size; i++)
if(a[i] > l_max_chunk) l_max_chunk=a[i], l_pos_chunk=i;
max_chunk[*core_id] = l_max_chunk;
pos_chunk[*core_id] = l_pos_chunk;
#endif
return NULL;
}
void load_array(void)
{
FILE *f;
size_t array_size_in_bytes = 1024*1024*1024, array_size=array_size_in_bytes/sizeof(int);
array = (int*) malloc(array_size_in_bytes);
if(array == NULL) abort(); // assert if condition is false
for(size_t i=0; i<array_size; i++) array[i]=i;
}
int main(void)
{
int i, max = 0, position, id[NUM_THREADS], rc;
clock_t t;
double time;
load_array();
printf("Finding max...");
t = clock();
clock_gettime(CLOCK_MONOTONIC,&ts0);
// Create threads
for(i = 0; i < NUM_THREADS; i++)
{
id[i] = i; // uso id para pasarle un puntero distinto a cada thread
rc = pthread_create(&(tid[i]), NULL, &thread, (void*)(id + i));
if (rc != 0)
printf("Can't create thread! rc = %d\n", rc);
else
printf("Thread %lu created\n", tid[i]);
}
// Join threads
for(i = 0; i < NUM_THREADS; i++)
pthread_join(tid[i], NULL);
// Find max number from all chunks
for(i = 0; i < NUM_THREADS; i++)
if(max_chunk[i] > max)
{
max = max_chunk[i];
position = pos_chunk[i];
}
clock_gettime(CLOCK_MONOTONIC,&ts1);
printf("Time2 %.6LF\n", sc_timespec_diff(ts1,ts0)/1E9L);
t = clock() - t;
time = ((double) t) / CLOCKS_PER_SEC;
printf("done!\n");
free(array);
printf("----------- Program results -------------\nMax number: %d position %d\n", max, position);
printf("Time %f [s]\n", time);
pthread_exit(NULL);
return 0;
}
My timings:
0.188917 for the signle threaded version
2.511590 for the original multithreaded version (measured with clock_gettime(CLOCK_MONOTONIC,...)
0.099802 with the modified threaded version (measured with clock_gettime(CLOCK_MONOTONIC,...)
ran on a Linux machine with Intel(R) Core(TM) i7-2620M CPU # 2.70GHz.

Multithreaded matrix multiplication in C

I'm trying to do some multithreaded high performance c matrix multiplication, the code below here is the program i wrote in C, it just works fine when the # of cores is 12 (since my pc has 12 threads or when i manually fix it to 12) when I switch it to a lower value (like 10 f.e.) gives me strange results, doesn anyone have an idea on what the problem could be?
Tested an perfectly working with 12 cores (or threads, call as whatever u want ) with a lower number of cores doesn't work anymore (look like he ends the execution almost immediately)
Tried with different values but looks like there is an error in the code I can't figure out probably.
The error is present in big size matrices but sometimes also in small size matrices
//
// Created by christian on 06/09/2019.
//
#pragma GCC optimize("O3", "unroll-loops", "omit-frame-pointer", "inline") //Optimization flags
#pragma GCC option("arch=native", "tune=native", "no-zero-upper") //Enable AVX
#pragma GCC target("avx") //Enable AVX
#include <time.h> // for clock_t, clock(), CLOCKS_PER_SEC
#include <sys/time.h>
#include <stdio.h> //AVX/SSE Extensions are included in stdio.h
#include <unistd.h>
#include <stdlib.h>
#include <pthread.h>
int ops = 0;
//define matrix size (in this case we'll use a square matrix)
#define DIM 200 //DO NOT EXCEED 10000 (modification to the stack size needed)
float matrix[DIM][DIM];
float result_matrix[DIM][DIM];
float *matrix_ptr = (float *) &matrix;
float *result_ptr = (float *) &result_matrix;
// set the number of logical cores to 1 (just in case the auto-detection doesn't work properly)
int cores = 1;
//functions prototypes
void single_multiply(int row);
void *thread_multiply(void *offset);
int detect_number_of_cores();
void fill_matrix();
int main() {
//two instructions needed for pseudo-random float numbers
srand((unsigned int) time(NULL));
//detect the number of active cores
cores = detect_number_of_cores();
//matrix filling with random float values
fill_matrix();
printf("------------- MATRIX MULTIPLICATION -------------\n");
printf("--- multi-thread (vectorization enabled) v1.0 ---\n");
// printf("\n ORIGINAL MATRIX");
// for(int c=0; c<DIM; c++){
// printf("\n");
// for(int k=0; k<DIM; k++){
// printf("%f \t", matrix[c][k]);
// }
// }
//uncomment and modify this value to force a particular number of threads (not recommended)
//cores = 4;
printf("\n Currently using %i cores", cores);
printf("\n Matrix size: %i x %i", DIM, DIM);
//time detection struct declaration
struct timeval start, end;
gettimeofday(&start, NULL);
//decisional tree for the number of threads to be used
if (cores == 0 || cores == 1 || cores > DIM) {
//passing 0 because it has to start from the first row
single_multiply(0);
//this value may not be correct if matrix size exceeds 80x80 due to thread lock problems
printf("\n Total multiply ops: %i", ops);
gettimeofday(&end, NULL);
long seconds = (end.tv_sec - start.tv_sec);
long micros = ((seconds * 1000000) + end.tv_usec) - (start.tv_usec);
printf("\n\n Time elapsed is %d seconds and %d micros\n", seconds, micros);
return 0;
} else {
//split the matrix in more parts (as much as the number of active cores)
int rows_por_thread = DIM / cores;
printf("\n Rows por Thread: %i", rows_por_thread);
//calculate the rest of the division (if there is one obviously)
int rest = DIM % cores;
printf("\n Rest: %i \n", rest);
if (rest == 0) {
//execute just the multi-thread function n times
int times = rows_por_thread;
//create an array of thread-like objects
pthread_t threads[cores];
//create an array with the arguments for each thread
int thread_args[cores];
//launching the threads according to the available cores
int i = 0;
int error;
for (int c = 0; c < DIM; c += rows_por_thread) {
thread_args[i] = c;
i++;
}
for (int c = 0; c < cores; c++) {
error = pthread_create(&threads[c], NULL, thread_multiply, (void *) &thread_args[c]);
if (error != 0) {
printf("\n Error in thread %i creation, exiting...", c);
}
printf("created thread n %i with argument: %i \n", c, thread_args[c]);
}
printf("\n ... working ...");
for (int c = 0; c < cores; c++) {
pthread_join(threads[i], NULL);
printf("\n Waiting to join thread n: %i", c);
}
} else {
//THE PROBLEM MUST BE INSIDE THIS ELSE STATEMENT
//execute the multi-thread function n times and the single function th rest remaining times
printf("\n The number of cores is NOT a divisor of the size of the matrix. \n");
//create an array of thread-like objects
pthread_t threads[cores];
//create an array with the arguments for each thread
int thread_args[cores];
//launching the threads according to the available cores
int i = 0; //counter for the thread ID
int entrypoint_residual_rows = 0; //first unprocessed residual row
//launching the threads according to the available coreS
for (int c = 0; c < DIM; c += rows_por_thread) {
thread_args[i] = c;
i++;
}
entrypoint_residual_rows = cores * rows_por_thread;
int error;
//launch the threads
for (int c = 0; c < cores; c++) {
error = pthread_create(&threads[c], NULL, thread_multiply, (void *) &thread_args[c]);
if (error != 0) {
printf("\n Error in thread %i creation, exiting...", c);
}
printf("created thread n %i with argument: %i \n", c, thread_args[c]);
}
printf("\n ... working ...\n");
//join all the previous generated threads
for (int c = 0; c < cores; c++) {
pthread_join(threads[i], NULL);
printf("\n Waiting to join thread n: %i", c);
}
printf("\n entry-point index for the single function %i ", entrypoint_residual_rows);
single_multiply(entrypoint_residual_rows);
}
}
// printf("\n MULTIPLIED MATRIX");
// for (int c = 0; c < DIM; c++) {
// printf("\n");
// for (int k = 0; k < DIM; k++) {
// printf("%f \t", result_matrix[c][k]);
// }
// }
gettimeofday(&end, NULL);
printf("\n All threads joined correctly");
long seconds = (end.tv_sec - start.tv_sec);
long micros = ((seconds * 1000000) + end.tv_usec) - (start.tv_usec);
printf("\n\n Time elapsed is %d seconds and %d micros\n", seconds, micros);
//this value may not be correct if matrix size exceeds 80x80 due to thread lock problems
printf("\n Total multiply ops: %i", ops);
return 0;
}
//detect number of cores of the CPU (logical cores)
int detect_number_of_cores() {
return (int) sysconf(_SC_NPROCESSORS_ONLN); // Get the number of logical CPUs.
}
//matrix filling function
void fill_matrix() {
float a = 5.0;
for (int c = 0; c < DIM; c++)
for (int d = 0; d < DIM; d++) {
matrix[c][d] = (float) rand() / (float) (RAND_MAX) * a;
}
}
//row by row multiplication algorithm (mono-thread version)
void single_multiply(int row) {
for (int i = row; i < DIM; i++) {
for (int j = 0; j < DIM; j++) {
*(result_ptr + i * DIM + j) = 0;
ops++;
for (int k = 0; k < DIM; k++) {
*(result_ptr + i * DIM + j) += *(matrix_ptr + i * DIM + k) * *(matrix_ptr + k * DIM + j);
}
}
}
}
//thread for the multiplication algorithm
void *thread_multiply(void *offset) {
//de-reference the parameter passed by the main-thread
int *row_offset = (int *) offset;
//multiplication loops
for (int i = *row_offset; i < (*row_offset + (DIM / cores)); i++) {
for (int j = 0; j < DIM; j++) {
*(result_ptr + i * DIM + j) = 0;
ops++;
for (int k = 0; k < DIM; k++) {
*(result_ptr + i * DIM + j) += *(matrix_ptr + i * DIM + k) * *(matrix_ptr + k * DIM + j);
}
}
}
return NULL;
}
this is the way the result looks (also the number of ops in the result should be equal to size x size)
------------- MATRIX MULTIPLICATION -------------
--- multi-thread (vectorization enabled) v1.0 ---
Currently using 4 cores
Matrix size: 200 x 200
Rows por Thread: 50
Rest: 0
created thread n 0 with argument: 0
created thread n 1 with argument: 50
created thread n 2 with argument: 100
created thread n 3 with argument: 150
... working ...
Waiting to join thread n: 0
Waiting to join thread n: 1
Waiting to join thread n: 2
Waiting to join thread n: 3
All threads joined correctly
Time elapsed is 0 seconds and 804 micros
Total multiply ops: 2200
Process finished with exit code 0

This pthread_join here looks extremely fishy -- observe how the loop variable is c, but you index the array on i:
for (int c = 0; c < cores; c++) {
pthread_join(threads[i], NULL);
printf("\n Waiting to join thread n: %i", c);
}
I doubt it's doing the right thing.

in thread_multiple, the unadorned line:
ops++;
looks a bit suspicious. Did you not say you were running multiple instances of these threads?
As a general comment, you should look to have your functions a bit better defined; for example if you changed your single_multiply to be:
int single_multiply(int RowStart, int RowEnd) {
int ops = 0;
....
return ops;
}
then
void *thread_multiply(void *p) {
int *rows = p;
int ops;
ops = single_multiply(rows[0], rows[1]);
return (void *)ops;
}
you have:
reduced the bit of code that cares about things like 'cores' to the only bit that matters about them.
removed contention on the counter (you can collect them in pthread_join)
removed the redundant, nearly identical code.

Thank You ALL guys, this is now what it looks like, i would have expected better performance honeslty but at least it looks like it's working, does anyone have some idea on performance improvements I could do?
//
// Created by christian on 06/09/2019.
//
#pragma GCC optimize("O3", "unroll-loops", "omit-frame-pointer", "inline") //Optimization flags
#pragma GCC option("arch=native", "tune=native", "no-zero-upper") //Enable AVX
#pragma GCC target("avx") //Enable AVX
#include <time.h> // for clock_t, clock(), CLOCKS_PER_SEC
#include <sys/time.h>
#include <stdio.h> //AVX/SSE Extensions are included in stdio.h
#include <unistd.h>
#include <stdlib.h>
#include <pthread.h>
//define matrix size (in this case we'll use a square matrix)
#define DIM 4000 //DO NOT EXCEED 10000 (modification to the stack size needed)
float matrix[DIM][DIM];
float result_matrix[DIM][DIM];
float *matrix_ptr = (float *) &matrix;
float *result_ptr = (float *) &result_matrix;
// set the number of logical cores to 1 (just in case the auto-detection doesn't work properly)
int cores = 1;
//functions prototypes
void single_multiply(int rowStart, int rowEnd);
void *thread_multiply(void *offset);
int detect_number_of_cores();
void fill_matrix();
int main() {
//two instructions needed for pseudo-random float numbers
srand((unsigned int) time(NULL));
//detect the number of active cores
cores = detect_number_of_cores();
//matrix filling with random float values
fill_matrix();
printf("------------- MATRIX MULTIPLICATION -------------\n");
printf("--- multi-thread (vectorization enabled) v1.0 ---\n");
// printf("\n ORIGINAL MATRIX");
// for(int c=0; c<DIM; c++){
// printf("\n");
// for(int k=0; k<DIM; k++){
// printf("%f \t", matrix[c][k]);
// }
// }
//uncomment and modify this value to force a particular number of threads (not recommended)
//cores = 4;
printf("\n Currently using %i cores", cores);
printf("\n Matrix size: %i x %i", DIM, DIM);
//time detection struct declaration
struct timeval start, end;
gettimeofday(&start, NULL);
//decisional tree for the number of threads to be used
if (cores == 0 || cores == 1 || cores > DIM) {
//passing 0 because it has to start from the first row
single_multiply(0, DIM);
gettimeofday(&end, NULL);
long seconds = (end.tv_sec - start.tv_sec);
long micros = ((seconds * 1000000) + end.tv_usec) - (start.tv_usec);
printf("\n\n Time elapsed is %ld seconds and %ld micros\n", seconds, micros);
return 0;
} else {
//split the matrix in more parts (as much as the number of active cores)
int rows_por_thread = DIM / cores;
printf("\n Rows por Thread: %i", rows_por_thread);
//calculate the rest of the division (if there is one obviously)
int rest = DIM % cores;
printf("\n Rest: %i \n", rest);
if (rest == 0) {
//execute just the multi-thread function n times
int times = rows_por_thread;
//create an array of thread-like objects
pthread_t threads[cores];
//create an array with the arguments for each thread
int thread_args[cores];
//launching the threads according to the available cores
int i = 0;
int error;
for (int c = 0; c < DIM; c += rows_por_thread) {
thread_args[i] = c;
i++;
}
for (int c = 0; c < cores; c++) {
error = pthread_create(&threads[c], NULL, thread_multiply, (void *) &thread_args[c]);
if (error != 0) {
printf("\n Error in thread %i creation", c);
}
printf("created thread n %i with argument: %i \n", c, thread_args[c]);
}
printf("\n ... working ...");
for (int c = 0; c < cores; c++) {
error = pthread_join(threads[c], NULL);
if (error != 0) {
printf("\n Error in thread %i join", c);
}
printf("\n Waiting to join thread n: %i", c);
}
} else {
//THE PROBLEM MUST BE INSIDE THIS ELSE STATEMENT
//execute the multi-thread function n times and the single function th rest remaining times
printf("\n The number of cores is NOT a divisor of the size of the matrix. \n");
//create an array of thread-like objects
pthread_t threads[cores];
//create an array with the arguments for each thread
int thread_args[cores];
//launching the threads according to the available cores
int i = 0; //counter for the thread ID
int entrypoint_residual_rows = 0; //first unprocessed residual row
//launching the threads according to the available coreS
for (int c = 0; c < DIM - rest; c += rows_por_thread) {
thread_args[i] = c;
i++;
}
entrypoint_residual_rows = cores * rows_por_thread;
int error;
//launch the threads
for (int c = 0; c < cores; c++) {
error = pthread_create(&threads[c], NULL, thread_multiply, (void *) &thread_args[c]);
if (error != 0) {
printf("\n Error in thread %i creation, exiting...", c);
}
printf("created thread n %i with argument: %i \n", c, thread_args[c]);
}
printf("\n ... working ...\n");
//join all the previous generated threads
for (int c = 0; c < cores; c++) {
pthread_join(threads[c], NULL);
printf("\n Waiting to join thread n: %i", c);
}
printf("\n entry-point index for the single function %i ", entrypoint_residual_rows);
single_multiply(entrypoint_residual_rows, DIM);
}
}
// printf("\n MULTIPLIED MATRIX");
// for (int c = 0; c < DIM; c++) {
// printf("\n");
// for (int k = 0; k < DIM; k++) {
// printf("%f \t", result_matrix[c][k]);
// }
// }
gettimeofday(&end, NULL);
printf("\n All threads joined correctly");
long seconds = (end.tv_sec - start.tv_sec);
long micros = ((seconds * 1000000) + end.tv_usec) - (start.tv_usec);
printf("\n\n Time elapsed is %d seconds and %d micros\n", seconds, micros);
return 0;
}
//detect number of cores of the CPU (logical cores)
int detect_number_of_cores() {
return (int) sysconf(_SC_NPROCESSORS_ONLN); // Get the number of logical CPUs.
}
//matrix filling function
void fill_matrix() {
float a = 5.0;
for (int c = 0; c < DIM; c++)
for (int d = 0; d < DIM; d++) {
matrix[c][d] = (float) rand() / (float) (RAND_MAX) * a;
}
}
//row by row multiplication algorithm (mono-thread version)
void single_multiply(int rowStart, int rowEnd) {
for (int i = rowStart; i < rowEnd; i++) {
//printf("\n %i", i);
for (int j = 0; j < DIM; j++) {
*(result_ptr + i * DIM + j) = 0;
for (int k = 0; k < DIM; k++) {
*(result_ptr + i * DIM + j) += *(matrix_ptr + i * DIM + k) * *(matrix_ptr + k * DIM + j);
}
}
}
}
//thread for the multiplication algorithm
void *thread_multiply(void *offset) {
//de-reference the parameter passed by the main-thread
int *row_offset = (int *) offset;
printf(" Starting at line %i ending at line %i \n ", *row_offset, *row_offset + (DIM / cores));
single_multiply(*row_offset, *row_offset + (DIM / cores));
printf("\n ended at line %i", *row_offset + (DIM / cores));
return NULL;
}

Why is my parallel code slower than serial?

Issue
Hello everyone, I have got a program (from the net) that I intend to speed up by converting it into its parallel version with the use of pthreads. But surprisingly though, it runs slower than the serial version. Below is the program:
# include <stdio.h>
//fast square root algorithm
double asmSqrt(double x)
{
__asm__ ("fsqrt" : "+t" (x));
return x;
}
//test if a number is prime
bool isPrime(int n)
{
if (n <= 1) return false;
if (n == 2) return true;
if (n%2 == 0) return false;
int sqrtn,i;
sqrtn = asmSqrt(n);
for (i = 3; i <= sqrtn; i+=2) if (n%i == 0) return false;
return true;
}
//number generator iterated from 0 to n
int main()
{
n = 1000000; //maximum number
int k,j;
for (j = 0; j<= n; j++)
{
if(isPrime(j) == 1) k++;
if(j == n) printf("Count: %d\n",k);
}
return 0;
}
First attempt for parallelization
I let the pthread manage the for loop
# include <stdio.h>
.
.
int main()
{
.
.
//----->pthread code here<----
for (j = 0; j<= n; j++)
{
if(isPrime(j) == 1) k++;
if(j == n) printf("Count: %d\n",k);
}
return 0;
}
Well, it runs slower than the serial one
Second attempt
I divided the for loop into two threads and run them in parallel using pthreads
However, it still runs slower, I am intending that it may run about twice as fast or well faster. But its not!
These is my parallel code by the way:
# include <stdio.h>
# include <pthread.h>
# include <cmath>
# define NTHREADS 2
pthread_mutex_t mutex1 = PTHREAD_MUTEX_INITIALIZER;
int k = 0;
double asmSqrt(double x)
{
__asm__ ("fsqrt" : "+t" (x));
return x;
}
struct arg_struct
{
int initialPrime;
int nextPrime;
};
bool isPrime(int n)
{
if (n <= 1) return false;
if (n == 2) return true;
if (n%2 == 0) return false;
int sqrtn,i;
sqrtn = asmSqrt(n);
for (i = 3; i <= sqrtn; i+=2) if (n%i == 0) return false;
return true;
}
void *parallel_launcher(void *arguments)
{
struct arg_struct *args = (struct arg_struct *)arguments;
int j = args -> initialPrime;
int n = args -> nextPrime - 1;
for (j = 0; j<= n; j++)
{
if(isPrime(j) == 1)
{
printf("This is prime: %d\n",j);
pthread_mutex_lock( &mutex1 );
k++;
pthread_mutex_unlock( &mutex1 );
}
if(j == n) printf("Count: %d\n",k);
}
pthread_exit(NULL);
}
int main()
{
int f = 100000000;
int m;
pthread_t thread_id[NTHREADS];
struct arg_struct args;
int rem = (f+1)%NTHREADS;
int n = floor((f+1)/NTHREADS);
for(int h = 0; h < NTHREADS; h++)
{
if(rem > 0)
{
m = n + 1;
rem-= 1;
}
else if(rem == 0)
{
m = n;
}
args.initialPrime = args.nextPrime;
args.nextPrime = args.initialPrime + m;
pthread_create(&thread_id[h], NULL, &parallel_launcher, (void *)&args);
pthread_join(thread_id[h], NULL);
}
// printf("Count: %d\n",k);
return 0;
}
Note:
OS: Fedora 21 x86_64,
Compiler: gcc-4.4,
Processor: Intel Core i5 (2 physical core, 4 logical),
Mem: 6 Gb,
HDD: 340 Gb,

You need to split the range you are examining for primes up into n parts, where n is the number of threads.
The code that each thread runs becomes:
typedef struct start_end {
int start;
int end;
} start_end_t;
int find_primes_in_range(void *in) {
start_end_t *start_end = (start_end_t *) in;
int num_primes = 0;
for (int j = start_end->start; j <= start_end->end; j++) {
if (isPrime(j) == 1)
num_primes++;
}
pthread_exit((void *) num_primes;
}
The main routine first starts all the threads which call find_primes_in_range, then calls pthread_join for each thread. It sums all the values returned by find_primes_in_range. This avoids locking and unlocking a shared count variable.
This will parallelize the work, but the amount of work per thread will not be equal. This can be addressed but is more complicated.

The main design flaw: you must let each thread have its own private counter variable instead of using the shared one. Otherwise they will spend far more time waiting on and handling that mutex, than they will do on the actual calculation. You are essentially forcing the threads to execute in serial.
Instead, sum everything up with a private counter variable and once a thread is done with its work, return the counter variable and sum them up in main().
Also, you should not call printf() from inside the threads. If there is a context switch in the middle of a printf call, you'll end up with crappy output such as This is This is prime: 2. In which case you must synchronize the printf calls between threads, which will slow the program down again. Also, the printf() calls themselves are likely 90% of the work that the thread is doing. So some sort of re-design of who does the printing might be a good idea, depending on what you want to do with the results.

Summary
Indeed, the use of PThread speed up my code. It was my programming flaw of placing pthread_join right after the first pthread_create and the common counter I have set on arguments. After fixing this up, I tested my parallel code to determine the primality of 100 Million numbers then compared its processing time with a serial code. Below are the results.
http://i.stack.imgur.com/gXFyk.jpg (I could not attach the image as I don't have much reputation yet, instead, I am including a link)
I conducted three trials for each to account for the variations caused by different OS activities. We got speed up for utilizing parallel programming with PThread. What is surprising is a PThread code running in ONE thread was a bit faster than purely serial code. I could not explain this one, nevertheless using PThreads is well, surely worth a try.
Here is the corrected parallel version of the code (gcc-c++):
# include <stdio.h>
# include <pthread.h>
# include <cmath>
# define NTHREADS 4
double asmSqrt(double x)
{
__asm__ ("fsqrt" : "+t" (x));
return x;
}
struct start_end_f
{
int start;
int end;
};
//test if a number is prime
bool isPrime(int n)
{
if (n <= 1) return false;
if (n == 2) return true;
if (n%2 == 0) return false;
int sqrtn = asmSqrt(n);
for (int i = 3; i <= sqrtn; i+=2) if (n%i == 0) return false;
return true;
}
//executes the tests for prime in a certain range, other threads will test the next range and so on..
void *find_primes_in_range(void *in)
{
int k = 0;
struct start_end_f *start_end_h = (struct start_end_f *)in;
for (int j = start_end_h->start; j < (start_end_h->end +1); j++)
{
if(isPrime(j) == 1) k++;
}
int *t = new int;
*t = k;
pthread_exit(t);
}
int main()
{
int f = 100000000; //maximum number to be tested for prime
pthread_t thread_id[NTHREADS];
struct start_end_f start_end[NTHREADS];
int rem = (f+1)%NTHREADS;
int n = (f+1)/NTHREADS;
int rem_change = rem;
int m;
if(rem>0) m = n+1;
else if(rem == 0) m = n;
//distributes task 'evenly' to the number of parallel threads requested
for(int h = 0; h < NTHREADS; h++)
{
if(rem_change > 0)
{
start_end[h].start = m*h;
start_end[h].end = start_end[h].start+m-1;
rem_change -= 1;
}
else if(rem_change<= 0)
{
start_end[h].start = m*(h+rem_change)-rem_change*n;
start_end[h].end = start_end[h].start+n-1;
rem_change -= 1;
}
pthread_create(&thread_id[h], NULL, find_primes_in_range, &start_end[h]);
}
//retreiving returned values
int *t;
int c = 0;
for(int h = 0; h < NTHREADS; h++)
{
pthread_join(thread_id[h], (void **)&t);
int b = *((int *)t);
c += b;
b = 0;
}
printf("\nNumber of Primes: %d\n",c);
return 0;
}

C - pthreads appear to only be utilizing one core

Let me first of all say that this is for school but I don't really need help, I'm just confused by some results I'm getting.
I have a simple program that approximates pi using Simpson's rule, in one assignment we had to do this by spawning 4 child processes and now in this assignment we have to use 4 kernel-level threads. I've done this, but when I time the programs the one using child processes seems to run faster (I get the impression I should be seeing the opposite result).
Here is the program using pthreads:
#include <stdio.h>
#include <unistd.h>
#include <pthread.h>
#include <stdlib.h>
// This complicated ternary statement does the bulk of our work.
// Basically depending on whether or not we're at an even number in our
// sequence we'll call the function with x/32000 multiplied by 2 or 4.
#define TERN_STMT(x) (((int)x%2==0)?2*func(x/32000):4*func(x/32000))
// Set to 0 for no 100,000 runs
#define SPEED_TEST 1
struct func_range {
double start;
double end;
};
// The function defined in the assignment
double func(double x)
{
return 4 / (1 + x*x);
}
void *partial_sum(void *r)
{
double *ret = (double *)malloc(sizeof(double));
struct func_range *range = r;
#if SPEED_TEST
int k;
double begin = range->start;
for (k = 0; k < 25000; k++)
{
range->start = begin;
*ret = 0;
#endif
for (; range->start <= range->end; ++range->start)
*ret += TERN_STMT(range->start);
#if SPEED_TEST
}
#endif
return ret;
}
int main()
{
// An array for our threads.
pthread_t threads[4];
double total_sum = func(0);
void *temp;
struct func_range our_range;
int i;
for (i = 0; i < 4; i++)
{
our_range.start = (i == 0) ? 1 : (i == 1) ? 8000 : (i == 2) ? 16000 : 24000;
our_range.end = (i == 0) ? 7999 : (i == 1) ? 15999 : (i == 2) ? 23999 : 31999;
pthread_create(&threads[i], NULL, &partial_sum, &our_range);
pthread_join(threads[i], &temp);
total_sum += *(double *)temp;
free(temp);
}
total_sum += func(1);
// Final calculations
total_sum /= 3.0;
total_sum *= (1.0/32000.0);
// Print our result
printf("%f\n", total_sum);
return EXIT_SUCCESS;
}
Here is using child processes:
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
// This complicated ternary statement does the bulk of our work.
// Basically depending on whether or not we're at an even number in our
// sequence we'll call the function with x/32000 multiplied by 2 or 4.
#define TERN_STMT(x) (((int)x%2==0)?2*func(x/32000):4*func(x/32000))
// Set to 0 for no 100,000 runs
#define SPEED_TEST 1
// The function defined in the assignment
double func(double x)
{
return 4 / (1 + x*x);
}
int main()
{
// An array for our subprocesses.
pid_t pids[4];
// The pipe to pass-through information
int mypipe[2];
// Counter for subproccess loops
double j;
// Counter for outer loop
int i;
// Number of PIDs
int n = 4;
// The final sum
double total_sum = 0;
// Temporary variable holding the result from a subproccess
double temp;
// The partial sum tallied by a subproccess.
double sum = 0;
int k;
if (pipe(mypipe))
{
perror("pipe");
return EXIT_FAILURE;
}
// Create the PIDs
for (i = 0; i < 4; i++)
{
// Abort if something went wrong
if ((pids[i] = fork()) < 0)
{
perror("fork");
abort();
}
else if (pids[i] == 0)
// Depending on what PID number we are we'll only calculate
// 1/4 the total.
#if SPEED_TEST
for (k = 0; k < 25000; ++k)
{
sum = 0;
#endif
switch (i)
{
case 0:
sum += func(0);
for (j = 1; j <= 7999; ++j)
sum += TERN_STMT(j);
break;
case 1:
for (j = 8000; j <= 15999; ++j)
sum += TERN_STMT(j);
break;
case 2:
for (j = 16000; j <= 23999; ++j)
sum += TERN_STMT(j);
break;
case 3:
for (j = 24000; j < 32000; ++j)
sum += TERN_STMT(j);
sum += func(1);
break;
}
#if SPEED_TEST
}
#endif
// Write the data to the pipe
write(mypipe[1], &sum, sizeof(sum));
exit(0);
}
}
int status;
pid_t pid;
while (n > 0)
{
// Wait for the calculations to finish
pid = wait(&status);
// Read from the pipe
read(mypipe[0], &temp, sizeof(total_sum));
// Add to the total
total_sum += temp;
n--;
}
// Final calculations
total_sum /= 3.0;
total_sum *= (1.0/32000.0);
// Print our result
printf("%f\n", total_sum);
return EXIT_SUCCESS;
}
Here is a time result from the pthreads version running 100,000 times:
real 11.15
user 11.15
sys 0.00
And here is the child process version:
real 5.99
user 23.81
sys 0.00
Having a user time of 23.81 implies that that is the sum of the time each core took to execute the code. In the pthread analysis the real/user time is the same implying that only one core is being used. Why isn't it using all 4 cores? I thought by default it might do it better than child processes.
Hopefully this question makes sense, this is my first time programming with pthreads and I'm pretty new to OS-level programming in general.
Thanks for taking the time to read this lengthy question.

When you say pthread_join immediately after pthread_create, you're effectively serializing all the threads. Don't join threads until after you've created all the threads and done all the other work that doesn't need the result from the threaded computations.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

No real acceleration with pthreads - c

Related

how to compute sum of n/m Gregory-Leibniz terms in C language

Performance of multithreaded algorithm to find max number in array

Multithreaded matrix multiplication in C

Why is my parallel code slower than serial?

C - pthreads appear to only be utilizing one core

Categories

Resources