Counting a number in a list using pthreads.. C - c

I am having trouble getting using pthreads to count the number of 3's in a list. Using a serial version of my code works fine but trying to use pthread_create is giving me trouble. Currently the problem is that count3s_thread_2(int id) is not giving me the same value as the serial version.
What do I need to change?
P.S., sorry for the mess. I am new to programming in C.
// Declares some global variables we will use throughout the
// program with all versions.
#define NUM_THREADS 4
int Length = 1000;
int array[1000];
int count;
long i;
pthread_mutex_t m;
pthread_t threads[NUM_THREADS];
void create_list(int *array)
{
srand(time(NULL));
for (i = 0; i < Length; i++)
{
int r = rand();
r = (r % 10) + 1;
array[i] = r;
}
}
void* count3s(void* threadid)
{
// This is the function that counts the number of threes for
// the first threaded version.
//int i = (intptr_t)threadid;
int i = (intptr_t)threadid;
long tid = (long)threadid;
int length_per_thread = Length / NUM_THREADS;
long start = tid * (long)length_per_thread;
for (i = start; i < start + length_per_thread; i++)
{
if (array[i] == 3)
{
count++;
}
}
pthread_exit(NULL);
}
void* count3s_v2(void* threadid)
{
// This is the function that counts the number of threes for
// the second threaded version.
//int serial = count3s_serial();
//printf("Number of threes: %d\n", serial);
int i = (intptr_t)threadid;
long tid = (long)threadid;
int length_per_thread = Length / NUM_THREADS;
long start = tid * (long)length_per_thread;
for (i = start; i < start + length_per_thread; i++)
{
if (array[i] == 3)
{
pthread_mutex_lock(&m);
count++;
pthread_mutex_unlock(&m);
}
}
pthread_exit(NULL);
}
int count3s_serial()
{
// This is the serial version of count3s. No threads are
// created and run separately from other threads.
count = 0;
for (i = 0; i < Length; i++)
{
if (array[i] == 3)
{
count++;
}
}
return count;
}
int count3s_thread(int id)
{
clock_t begin, end;
double time_spent;
begin = clock();
//pthread_attr_init();
for (i = 0; i < NUM_THREADS; i++)
{
pthread_create(&threads[i], NULL, count3s, (void *)i);
}
//pthread_attr_destroy();
end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
return count;
}
int count3s_thread_2(int id)
{
clock_t begin, end;
double time_spent;
begin = clock();
pthread_attr_init(&something);
for (i = 0; i < NUM_THREADS; i++)
{
pthread_create(&threads[i], NULL, count3s_v2, (void *)i);
}
pthread_attr_destroy(&something);
end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
return count;
//printf("Thread Version 2: Number of threes = %d\nThread Version 2: Time Spent = %f\n", count, time_spent);
}
int main()
{
create_list(array);
clock_t begin, end;
double time_spent;
for (i = 0; i < Length; i++)
{
printf("%d\n", array[i]);
}
// Beginning of serial version. Timer begins, serial version
// is ran and after it's done, the timer stops.
begin = clock();
int serial = count3s_serial();
end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("Serial Version: Number of threes = %d\nSerial Version: Time Spent = %f\n", serial, time_spent);
// End of serial version.
/*
*********************************************************************
*/
// Beginning of first theaded version. Timer begins, first
// threaded version is ran and after it's done, the timer stops.
int the_thing = 0;
count = 0;
the_thing = count3s_thread(i);
printf("Thread Version 1: Number of threes = %d\nThread Version 1: Time Spent = %f\n", the_thing, time_spent);
// End of first threaded version.
/*
*********************************************************************
*/
// Beginning of second theaded version. Timer begins, second
// threaded version is ran and after it's done, the timer stops.
int the_other_thing = 0;
count = 0;
the_other_thing = count3s_thread_2(i);
printf("Thread Version 2: Number of threes = %d\nThread Version 2: Time Spent = %f\n", the_other_thing, time_spent);
pthread_exit(NULL);
}

The problem is that you spawn the threads but don't wait for them to finish before printing the result. Both thread versions have the same problem. Use pthread_join to wait for the threads to exit or implement some other synchronisation for parent to know when the threads have completed their work.
For example, add the following block of code to the end of both count3s_thread and count3s_thread_2. It will wait for the threads to complete before printing the result. NOTE: You must add it to both functions (even though you are ok for the first one to have the wrong count). Otherwise when you run the second threading version the first set of threads are likely to still be executing and will mess up the global count.
for (i = 0; i < NUM_THREADS; i++) {
pthread_join(threads[i], NULL);
}

Okay, mucked around with your code and got it to compile. You aren’t waiting for your threads to return before you announce the result. You need to pthread_join() each of your threads in count3s_thread() before it returns.

Related

Monte Carlo with threading

This is what I am trying to accomplish.Write a multithreaded program in C (or C++/C#) that creates 5 threads. Each thread should generate 1,000 random points and count the number of points that occur within the circle. The main thread should wait for the five threads to terminate one after another. Once a thread is terminated, the main thread updates the value of PI using the total number of points in the circle and the total number of points generated by the terminated thread. For example, the main thread waits for the first thread to terminate. When the first thread is terminated, the main thread incorporates the total number of points in the circle obtained by the first thread to update the value of PI. Next, the main thread waits for the second thread to terminate. When the second thread is terminated, the main thread updates the value of PI using the total number of points in the circle obtained by the second thread, and so on.
I keep getting an error saying that non-void function doesn't return a value, so is there anyway I can get around that or what are some alternatives for me?
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
long incircle = 0;
long ppt; /* points per thread*/
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
void *runner() {
long incircle_thread = 0;
unsigned int rand_state = rand();
long i;
for (i = 0; i < ppt; i++) {
double x = rand_r(&rand_state) / ((double)RAND_MAX + 1) * 2.0 - 1.0;
double y = rand_r(&rand_state) / ((double)RAND_MAX + 1) * 2.0 - 1.0;
if (x * x + y * y < 1) {
incircle_thread++;
}
}
pthread_mutex_lock(&mutex);
incircle += incircle_thread;
pthread_mutex_unlock(&mutex);
}
int main(int argc, const char *argv[])
{
if (argc != 3) {
fprintf(stderr, "usage: ./pi <total points> <threads>\n");
exit(1);
}
long totalpoints = atol(argv[1]);
int thread_count = atoi(argv[2]);
ppt = totalpoints / thread_count;
time_t start = time(NULL);
srand((unsigned)time(NULL));
pthread_t *threads = malloc(thread_count * sizeof(pthread_t));
pthread_attr_t attr;
pthread_attr_init(&attr);
int i;
for (i = 0; i < thread_count; i++) {
pthread_create(&threads[i], &attr, runner, (void *) NULL);
}
for (i = 0; i < thread_count; i++) {
pthread_join(threads[i], NULL);
}
pthread_mutex_destroy(&mutex);
free(threads);
double points_per_thread = 0.0;
printf("Pi: %f\n", (4. * (double)incircle) / ((double)points_per_thread * thread_count));
printf("Time: %d sec\n", (unsigned int)(time(NULL) - start));
return 0;
}
'''
The return type of 'runner' is void*, so that's what it needs to return
In your case, it looks like you would just want to add
return NULL;

Multithreaded matrix multiplication in C

I'm trying to do some multithreaded high performance c matrix multiplication, the code below here is the program i wrote in C, it just works fine when the # of cores is 12 (since my pc has 12 threads or when i manually fix it to 12) when I switch it to a lower value (like 10 f.e.) gives me strange results, doesn anyone have an idea on what the problem could be?
Tested an perfectly working with 12 cores (or threads, call as whatever u want ) with a lower number of cores doesn't work anymore (look like he ends the execution almost immediately)
Tried with different values but looks like there is an error in the code I can't figure out probably.
The error is present in big size matrices but sometimes also in small size matrices
//
// Created by christian on 06/09/2019.
//
#pragma GCC optimize("O3", "unroll-loops", "omit-frame-pointer", "inline") //Optimization flags
#pragma GCC option("arch=native", "tune=native", "no-zero-upper") //Enable AVX
#pragma GCC target("avx") //Enable AVX
#include <time.h> // for clock_t, clock(), CLOCKS_PER_SEC
#include <sys/time.h>
#include <stdio.h> //AVX/SSE Extensions are included in stdio.h
#include <unistd.h>
#include <stdlib.h>
#include <pthread.h>
int ops = 0;
//define matrix size (in this case we'll use a square matrix)
#define DIM 200 //DO NOT EXCEED 10000 (modification to the stack size needed)
float matrix[DIM][DIM];
float result_matrix[DIM][DIM];
float *matrix_ptr = (float *) &matrix;
float *result_ptr = (float *) &result_matrix;
// set the number of logical cores to 1 (just in case the auto-detection doesn't work properly)
int cores = 1;
//functions prototypes
void single_multiply(int row);
void *thread_multiply(void *offset);
int detect_number_of_cores();
void fill_matrix();
int main() {
//two instructions needed for pseudo-random float numbers
srand((unsigned int) time(NULL));
//detect the number of active cores
cores = detect_number_of_cores();
//matrix filling with random float values
fill_matrix();
printf("------------- MATRIX MULTIPLICATION -------------\n");
printf("--- multi-thread (vectorization enabled) v1.0 ---\n");
// printf("\n ORIGINAL MATRIX");
// for(int c=0; c<DIM; c++){
// printf("\n");
// for(int k=0; k<DIM; k++){
// printf("%f \t", matrix[c][k]);
// }
// }
//uncomment and modify this value to force a particular number of threads (not recommended)
//cores = 4;
printf("\n Currently using %i cores", cores);
printf("\n Matrix size: %i x %i", DIM, DIM);
//time detection struct declaration
struct timeval start, end;
gettimeofday(&start, NULL);
//decisional tree for the number of threads to be used
if (cores == 0 || cores == 1 || cores > DIM) {
//passing 0 because it has to start from the first row
single_multiply(0);
//this value may not be correct if matrix size exceeds 80x80 due to thread lock problems
printf("\n Total multiply ops: %i", ops);
gettimeofday(&end, NULL);
long seconds = (end.tv_sec - start.tv_sec);
long micros = ((seconds * 1000000) + end.tv_usec) - (start.tv_usec);
printf("\n\n Time elapsed is %d seconds and %d micros\n", seconds, micros);
return 0;
} else {
//split the matrix in more parts (as much as the number of active cores)
int rows_por_thread = DIM / cores;
printf("\n Rows por Thread: %i", rows_por_thread);
//calculate the rest of the division (if there is one obviously)
int rest = DIM % cores;
printf("\n Rest: %i \n", rest);
if (rest == 0) {
//execute just the multi-thread function n times
int times = rows_por_thread;
//create an array of thread-like objects
pthread_t threads[cores];
//create an array with the arguments for each thread
int thread_args[cores];
//launching the threads according to the available cores
int i = 0;
int error;
for (int c = 0; c < DIM; c += rows_por_thread) {
thread_args[i] = c;
i++;
}
for (int c = 0; c < cores; c++) {
error = pthread_create(&threads[c], NULL, thread_multiply, (void *) &thread_args[c]);
if (error != 0) {
printf("\n Error in thread %i creation, exiting...", c);
}
printf("created thread n %i with argument: %i \n", c, thread_args[c]);
}
printf("\n ... working ...");
for (int c = 0; c < cores; c++) {
pthread_join(threads[i], NULL);
printf("\n Waiting to join thread n: %i", c);
}
} else {
//THE PROBLEM MUST BE INSIDE THIS ELSE STATEMENT
//execute the multi-thread function n times and the single function th rest remaining times
printf("\n The number of cores is NOT a divisor of the size of the matrix. \n");
//create an array of thread-like objects
pthread_t threads[cores];
//create an array with the arguments for each thread
int thread_args[cores];
//launching the threads according to the available cores
int i = 0; //counter for the thread ID
int entrypoint_residual_rows = 0; //first unprocessed residual row
//launching the threads according to the available coreS
for (int c = 0; c < DIM; c += rows_por_thread) {
thread_args[i] = c;
i++;
}
entrypoint_residual_rows = cores * rows_por_thread;
int error;
//launch the threads
for (int c = 0; c < cores; c++) {
error = pthread_create(&threads[c], NULL, thread_multiply, (void *) &thread_args[c]);
if (error != 0) {
printf("\n Error in thread %i creation, exiting...", c);
}
printf("created thread n %i with argument: %i \n", c, thread_args[c]);
}
printf("\n ... working ...\n");
//join all the previous generated threads
for (int c = 0; c < cores; c++) {
pthread_join(threads[i], NULL);
printf("\n Waiting to join thread n: %i", c);
}
printf("\n entry-point index for the single function %i ", entrypoint_residual_rows);
single_multiply(entrypoint_residual_rows);
}
}
// printf("\n MULTIPLIED MATRIX");
// for (int c = 0; c < DIM; c++) {
// printf("\n");
// for (int k = 0; k < DIM; k++) {
// printf("%f \t", result_matrix[c][k]);
// }
// }
gettimeofday(&end, NULL);
printf("\n All threads joined correctly");
long seconds = (end.tv_sec - start.tv_sec);
long micros = ((seconds * 1000000) + end.tv_usec) - (start.tv_usec);
printf("\n\n Time elapsed is %d seconds and %d micros\n", seconds, micros);
//this value may not be correct if matrix size exceeds 80x80 due to thread lock problems
printf("\n Total multiply ops: %i", ops);
return 0;
}
//detect number of cores of the CPU (logical cores)
int detect_number_of_cores() {
return (int) sysconf(_SC_NPROCESSORS_ONLN); // Get the number of logical CPUs.
}
//matrix filling function
void fill_matrix() {
float a = 5.0;
for (int c = 0; c < DIM; c++)
for (int d = 0; d < DIM; d++) {
matrix[c][d] = (float) rand() / (float) (RAND_MAX) * a;
}
}
//row by row multiplication algorithm (mono-thread version)
void single_multiply(int row) {
for (int i = row; i < DIM; i++) {
for (int j = 0; j < DIM; j++) {
*(result_ptr + i * DIM + j) = 0;
ops++;
for (int k = 0; k < DIM; k++) {
*(result_ptr + i * DIM + j) += *(matrix_ptr + i * DIM + k) * *(matrix_ptr + k * DIM + j);
}
}
}
}
//thread for the multiplication algorithm
void *thread_multiply(void *offset) {
//de-reference the parameter passed by the main-thread
int *row_offset = (int *) offset;
//multiplication loops
for (int i = *row_offset; i < (*row_offset + (DIM / cores)); i++) {
for (int j = 0; j < DIM; j++) {
*(result_ptr + i * DIM + j) = 0;
ops++;
for (int k = 0; k < DIM; k++) {
*(result_ptr + i * DIM + j) += *(matrix_ptr + i * DIM + k) * *(matrix_ptr + k * DIM + j);
}
}
}
return NULL;
}
this is the way the result looks (also the number of ops in the result should be equal to size x size)
------------- MATRIX MULTIPLICATION -------------
--- multi-thread (vectorization enabled) v1.0 ---
Currently using 4 cores
Matrix size: 200 x 200
Rows por Thread: 50
Rest: 0
created thread n 0 with argument: 0
created thread n 1 with argument: 50
created thread n 2 with argument: 100
created thread n 3 with argument: 150
... working ...
Waiting to join thread n: 0
Waiting to join thread n: 1
Waiting to join thread n: 2
Waiting to join thread n: 3
All threads joined correctly
Time elapsed is 0 seconds and 804 micros
Total multiply ops: 2200
Process finished with exit code 0
This pthread_join here looks extremely fishy -- observe how the loop variable is c, but you index the array on i:
for (int c = 0; c < cores; c++) {
pthread_join(threads[i], NULL);
printf("\n Waiting to join thread n: %i", c);
}
I doubt it's doing the right thing.
in thread_multiple, the unadorned line:
ops++;
looks a bit suspicious. Did you not say you were running multiple instances of these threads?
As a general comment, you should look to have your functions a bit better defined; for example if you changed your single_multiply to be:
int single_multiply(int RowStart, int RowEnd) {
int ops = 0;
....
return ops;
}
then
void *thread_multiply(void *p) {
int *rows = p;
int ops;
ops = single_multiply(rows[0], rows[1]);
return (void *)ops;
}
you have:
reduced the bit of code that cares about things like 'cores' to the only bit that matters about them.
removed contention on the counter (you can collect them in pthread_join)
removed the redundant, nearly identical code.
Thank You ALL guys, this is now what it looks like, i would have expected better performance honeslty but at least it looks like it's working, does anyone have some idea on performance improvements I could do?
//
// Created by christian on 06/09/2019.
//
#pragma GCC optimize("O3", "unroll-loops", "omit-frame-pointer", "inline") //Optimization flags
#pragma GCC option("arch=native", "tune=native", "no-zero-upper") //Enable AVX
#pragma GCC target("avx") //Enable AVX
#include <time.h> // for clock_t, clock(), CLOCKS_PER_SEC
#include <sys/time.h>
#include <stdio.h> //AVX/SSE Extensions are included in stdio.h
#include <unistd.h>
#include <stdlib.h>
#include <pthread.h>
//define matrix size (in this case we'll use a square matrix)
#define DIM 4000 //DO NOT EXCEED 10000 (modification to the stack size needed)
float matrix[DIM][DIM];
float result_matrix[DIM][DIM];
float *matrix_ptr = (float *) &matrix;
float *result_ptr = (float *) &result_matrix;
// set the number of logical cores to 1 (just in case the auto-detection doesn't work properly)
int cores = 1;
//functions prototypes
void single_multiply(int rowStart, int rowEnd);
void *thread_multiply(void *offset);
int detect_number_of_cores();
void fill_matrix();
int main() {
//two instructions needed for pseudo-random float numbers
srand((unsigned int) time(NULL));
//detect the number of active cores
cores = detect_number_of_cores();
//matrix filling with random float values
fill_matrix();
printf("------------- MATRIX MULTIPLICATION -------------\n");
printf("--- multi-thread (vectorization enabled) v1.0 ---\n");
// printf("\n ORIGINAL MATRIX");
// for(int c=0; c<DIM; c++){
// printf("\n");
// for(int k=0; k<DIM; k++){
// printf("%f \t", matrix[c][k]);
// }
// }
//uncomment and modify this value to force a particular number of threads (not recommended)
//cores = 4;
printf("\n Currently using %i cores", cores);
printf("\n Matrix size: %i x %i", DIM, DIM);
//time detection struct declaration
struct timeval start, end;
gettimeofday(&start, NULL);
//decisional tree for the number of threads to be used
if (cores == 0 || cores == 1 || cores > DIM) {
//passing 0 because it has to start from the first row
single_multiply(0, DIM);
gettimeofday(&end, NULL);
long seconds = (end.tv_sec - start.tv_sec);
long micros = ((seconds * 1000000) + end.tv_usec) - (start.tv_usec);
printf("\n\n Time elapsed is %ld seconds and %ld micros\n", seconds, micros);
return 0;
} else {
//split the matrix in more parts (as much as the number of active cores)
int rows_por_thread = DIM / cores;
printf("\n Rows por Thread: %i", rows_por_thread);
//calculate the rest of the division (if there is one obviously)
int rest = DIM % cores;
printf("\n Rest: %i \n", rest);
if (rest == 0) {
//execute just the multi-thread function n times
int times = rows_por_thread;
//create an array of thread-like objects
pthread_t threads[cores];
//create an array with the arguments for each thread
int thread_args[cores];
//launching the threads according to the available cores
int i = 0;
int error;
for (int c = 0; c < DIM; c += rows_por_thread) {
thread_args[i] = c;
i++;
}
for (int c = 0; c < cores; c++) {
error = pthread_create(&threads[c], NULL, thread_multiply, (void *) &thread_args[c]);
if (error != 0) {
printf("\n Error in thread %i creation", c);
}
printf("created thread n %i with argument: %i \n", c, thread_args[c]);
}
printf("\n ... working ...");
for (int c = 0; c < cores; c++) {
error = pthread_join(threads[c], NULL);
if (error != 0) {
printf("\n Error in thread %i join", c);
}
printf("\n Waiting to join thread n: %i", c);
}
} else {
//THE PROBLEM MUST BE INSIDE THIS ELSE STATEMENT
//execute the multi-thread function n times and the single function th rest remaining times
printf("\n The number of cores is NOT a divisor of the size of the matrix. \n");
//create an array of thread-like objects
pthread_t threads[cores];
//create an array with the arguments for each thread
int thread_args[cores];
//launching the threads according to the available cores
int i = 0; //counter for the thread ID
int entrypoint_residual_rows = 0; //first unprocessed residual row
//launching the threads according to the available coreS
for (int c = 0; c < DIM - rest; c += rows_por_thread) {
thread_args[i] = c;
i++;
}
entrypoint_residual_rows = cores * rows_por_thread;
int error;
//launch the threads
for (int c = 0; c < cores; c++) {
error = pthread_create(&threads[c], NULL, thread_multiply, (void *) &thread_args[c]);
if (error != 0) {
printf("\n Error in thread %i creation, exiting...", c);
}
printf("created thread n %i with argument: %i \n", c, thread_args[c]);
}
printf("\n ... working ...\n");
//join all the previous generated threads
for (int c = 0; c < cores; c++) {
pthread_join(threads[c], NULL);
printf("\n Waiting to join thread n: %i", c);
}
printf("\n entry-point index for the single function %i ", entrypoint_residual_rows);
single_multiply(entrypoint_residual_rows, DIM);
}
}
// printf("\n MULTIPLIED MATRIX");
// for (int c = 0; c < DIM; c++) {
// printf("\n");
// for (int k = 0; k < DIM; k++) {
// printf("%f \t", result_matrix[c][k]);
// }
// }
gettimeofday(&end, NULL);
printf("\n All threads joined correctly");
long seconds = (end.tv_sec - start.tv_sec);
long micros = ((seconds * 1000000) + end.tv_usec) - (start.tv_usec);
printf("\n\n Time elapsed is %d seconds and %d micros\n", seconds, micros);
return 0;
}
//detect number of cores of the CPU (logical cores)
int detect_number_of_cores() {
return (int) sysconf(_SC_NPROCESSORS_ONLN); // Get the number of logical CPUs.
}
//matrix filling function
void fill_matrix() {
float a = 5.0;
for (int c = 0; c < DIM; c++)
for (int d = 0; d < DIM; d++) {
matrix[c][d] = (float) rand() / (float) (RAND_MAX) * a;
}
}
//row by row multiplication algorithm (mono-thread version)
void single_multiply(int rowStart, int rowEnd) {
for (int i = rowStart; i < rowEnd; i++) {
//printf("\n %i", i);
for (int j = 0; j < DIM; j++) {
*(result_ptr + i * DIM + j) = 0;
for (int k = 0; k < DIM; k++) {
*(result_ptr + i * DIM + j) += *(matrix_ptr + i * DIM + k) * *(matrix_ptr + k * DIM + j);
}
}
}
}
//thread for the multiplication algorithm
void *thread_multiply(void *offset) {
//de-reference the parameter passed by the main-thread
int *row_offset = (int *) offset;
printf(" Starting at line %i ending at line %i \n ", *row_offset, *row_offset + (DIM / cores));
single_multiply(*row_offset, *row_offset + (DIM / cores));
printf("\n ended at line %i", *row_offset + (DIM / cores));
return NULL;
}

No real acceleration with pthreads

I am trying to implement the multithreaded version of the Monte-Carlo algorithm. Here is my code:
#define _POSIX_C_SOURCE 200112L
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <time.h>
#include <math.h>
#include <semaphore.h>
#include <errno.h>
#include <stdbool.h>
#include <string.h>
#define MAX_THREADS 12
#define MAX_DOTS 10000000
double sum = 0.0;
sem_t sem;
void reset() {
sum = 0.0;
}
void* check_dot(void* _iterations) {
int* iterations = (int*)_iterations;
for(int i = 0; i < *iterations; ++i) {
double x = (double)(rand() % 314) / 100;
double y = (double)(rand() % 100) / 100;
if(y <= sin(x)) {
sem_wait(&sem);
sum += x * y;
sem_post(&sem);
}
}
return NULL;
}
void* check_dots_advanced(void* _iterations) {
int* iterations = (int*)_iterations;
double* res = (double*)malloc(sizeof(double));
for(int i = 0; i < *iterations; ++i) {
double x = (double)(rand() % 314) / 100;
double y = (double)(rand() % 100) / 100;
if(y <= sin(x)) *res += x * y;
}
pthread_exit((void*)res);
}
double run(int threads_num, bool advanced) {
if(!advanced) sem_init(&sem, 0, 1);
struct timespec begin, end;
double elapsed;
pthread_t threads[threads_num];
int iters = MAX_DOTS / threads_num;
for(int i = 0; i < threads_num; ++i) {
if(!advanced) pthread_create(&threads[i], NULL, &check_dot, (void*)&iters);
else pthread_create(&threads[i], NULL, &check_dots_advanced, (void*)&iters);
}
if(clock_gettime(CLOCK_REALTIME, &begin) == -1) {
perror("Unable to get time");
exit(-1);
}
for(int i = 0; i < threads_num; ++i) {
if(!advanced) pthread_join(threads[i], NULL);
else {
void* tmp;
pthread_join(threads[i], &tmp);
sum += *((double*)tmp);
free(tmp);
}
}
if(clock_gettime(CLOCK_REALTIME, &end) == -1) {
perror("Unable to get time");
exit(-1);
}
if(!advanced) sem_destroy(&sem);
elapsed = end.tv_sec - begin.tv_sec;
elapsed += (end.tv_nsec - begin.tv_nsec) / 1000000000.0;
return elapsed;
}
int main(int argc, char** argv) {
bool advanced = false;
char* filename = NULL;
for(int i = 1; i < argc; ++i) {
if(strcmp(argv[i], "-o") == 0 && argc > i + 1) {
filename = argv[i + 1];
++i;
}
else if(strcmp(argv[i], "-a") == 0 || strcmp(argv[i], "--advanced") == 0) {
advanced = true;
}
}
if(!filename) {
fprintf(stderr, "You should provide the name of the output file.\n");
exit(-1);
}
FILE* fd = fopen(filename, "w");
if(!fd) {
perror("Unable to open file");
exit(-1);
}
srand(time(NULL));
double worst_time = run(1, advanced);
double result = (3.14 / MAX_DOTS) * sum;
reset();
fprintf(fd, "Result: %f\n", result);
for(int i = 2; i <= MAX_THREADS; ++i) {
double time = run(i, advanced);
double accel = time / worst_time;
fprintf(fd, "%d:%f\n", i, accel);
reset();
}
fclose(fd);
return 0;
}
However, I can't see any real acceleration with increasing the number of threads (and it does not matter what check_dot() function I am using). I have tried to execute this code on my laptop with Intel Core i7-3517u (lscpu says that it has 4 independent CPUs) and it looks like the number of threads not really influences the execution time of my program:
Number of threads: 1, working time: 0.847277 s
Number of threads: 2, working time: 3.133838 s
Number of threads: 3, working time: 2.331216 s
Number of threads: 4, working time: 3.011819 s
Number of threads: 5, working time: 3.086003 s
Number of threads: 6, working time: 3.118296 s
Number of threads: 7, working time: 3.058180 s
Number of threads: 8, working time: 3.114867 s
Number of threads: 9, working time: 3.179515 s
Number of threads: 10, working time: 3.025266 s
Number of threads: 11, working time: 3.142141 s
Number of threads: 12, working time: 3.064318 s
I supposed that it should be some kind of linear dependence between the execution time and number of working threads for at least four first values (the more threads are working the less is execution time), but here I have pretty equal time values. Is it a real problem in my code or I am too demanding?
The problem you are experiencing is that the internal state of rand() is a shared resource between all threads, so the threads are going to serialise on access to rand().
You need to use a pseudo-random number generator with per-thread state - the rand_r() function (although marked obsolete in the latest version of POSIX) can be used as such. For serious work you would be best off importing the implementation of some specific PRNG algorithm such as Mersenne Twister.
I was able to collect the timing / scaling measurements that you would desire with two changes to your code.
First, rand() is not thread safe. Replacing the calls with calls to rand_r(seed) in the advanced check_dots showed continual scaling as threads increased. I think rand might have an internal lock that is serializing execution and preventing any speedup. This change alone shows some scaling, from 1.23s -> 0.55 sec (5 threads).
Second, I introduced barriers around the core execution region so that the cost of serially creating/joining threads and the malloc calls is not included. The core execution region shows good scaling, from 1.23sec -> 0.18sec (8 threads).
Code was compiled with gcc -O3 -pthread mcp.c -std=c11 -lm, run on Intel E3-1240 v5 (4 cores, HT), Linux 3.19.0-68-generic. Single measurements reported.
pthread_barrier_t bar;
void* check_dots_advanced(void* _iterations) {
int* iterations = (int*)_iterations;
double* res = (double*)malloc(sizeof(double));
sem_wait(&sem);
unsigned int seed = rand();
sem_post(&sem);
pthread_barrier_wait(&bar);
for(int i = 0; i < *iterations; ++i) {
double x = (double)(rand_r(&seed) % 314) / 100;
double y = (double)(rand_r(&seed) % 100) / 100;
if(y <= sin(x)) *res += x * y;
}
pthread_barrier_wait(&bar);
pthread_exit((void*)res);
}
double run(int threads_num, bool advanced) {
sem_init(&sem, 0, 1);
struct timespec begin, end;
double elapsed;
pthread_t threads[threads_num];
int iters = MAX_DOTS / threads_num;
pthread_barrier_init(&bar, NULL, threads_num + 1); // barrier init
for(int i = 0; i < threads_num; ++i) {
if(!advanced) pthread_create(&threads[i], NULL, &check_dot, (void*)&iters);
else pthread_create(&threads[i], NULL, &check_dots_advanced, (void*)&iters);
}
pthread_barrier_wait(&bar); // wait until threads are ready
if(clock_gettime(CLOCK_REALTIME, &begin) == -1) { // begin time
perror("Unable to get time");
exit(-1);
}
pthread_barrier_wait(&bar); // wait until threads finish
if(clock_gettime(CLOCK_REALTIME, &end) == -1) { // end time
perror("Unable to get time");
exit(-1);
}
for(int i = 0; i < threads_num; ++i) {
if(!advanced) pthread_join(threads[i], NULL);
else {
void* tmp;
pthread_join(threads[i], &tmp);
sum += *((double*)tmp);
free(tmp);
}
}
pthread_barrier_destroy(&bar);

Multiple threads increment shared variable without locks but return "unexpected" output

I'm invoking 100 threads, and each threads should increment a shared variable 1000 times. Since I'm doing this without using mutex locks, the expected output should NOT be 100000. However, I'm still getting 100000 every time.
This is what I have
volatile unsigned int count = 0;
void *increment(void *vargp);
int main() {
fprintf(stdout, "Before: count = %d\n", count);
int j;
// run 10 times to test output
for (j = 0; j < 10; j++) {
// reset count every time
count = 0;
int i;
// create 100 theads
for (i = 0; i < 100; i++) {
pthread_t thread;
Pthread_create(&thread, NULL, increment, NULL);
Pthread_join(thread, NULL);
}
fprintf(stdout, "After: count = %d\n", count);
}
return 0;
}
void *increment(void *vargp) {
int c;
// increment count 1000 times
for (c = 0; c < 1000; c++) {
count++;
}
return NULL;
}
The pthread_join() function suspends execution of the calling thread until the target thread terminates (source). After you create each thread, you wait for it to run and terminate. The threads never execute concurrently. Thus, there's no race condition

Non recursive factorial in C

I have a simple question for you. I made this code to calculate the factorial of a number without recursion.
int fact2(int n){
int aux=1, total = 1;
int i;
int limit = n - 1;
for (i=1; i<=limit; i+=2){
aux = i*(i+1);
total = total*aux;
}
for (;i<=n;i++){
total = total*i;
}
return total;
}
As you can see, my code uses loop unrolling to optimize clock cycles in the execution. Now I'm asked to add two-way parallelism to the same code, any idea how?
You can use ptherads library to create two separate threads. Each thread should do half of the multiplications. I could put together following solution.
#include <pthread.h>
typedef struct {
int id;
int num;
int *result;
} thread_arg_t;
void* thread_func(void *arg) {
int i;
thread_arg_t *th_arg = (thread_arg_t *)arg;
int start, end;
if(th_arg->id == 0) {
start = 1;
end = th_arg->num/2;
} else if (th_arg->id == 1) {
start = th_arg->num / 2;
end = th_arg->num + 1;
} else {
return NULL;
}
for(i=start; i < end; i++) {
th_arg->result[th_arg->id] *= i;
}
return NULL;
}
int factorial2(int n) {
pthread_t threads[2];
int rc;
int result[2];
thread_arg_t th_arg[2];
for(i=0; i<2; i++) {
th_arg[i].id = i;
th_arg[i].num = n;
th_arg[i].result = result;
rc = pthread_create(&threads[i], NULL, thread_func, (void *)&th_arg[i]);
if (rc){
printf("pthread_create() failed, rc = %d\n", rc);
exit(1);
}
}
/* wait for threads to finish */
for(i=0; i<2; i++) {
pthread_join(thread[i], NULL);
/* compute final one multiplication */
return (result[0] * result[1]);
}
The pthread library implementation should take care of parallelizing the work of two threads for you. Also, this example can be generalized for N threads with minor modifications.

Resources