Taylor series using OpenMP

Taylor series using OpenMP - c

ex = 1 + x + x2/2! + x3/3! + x4/4! + x5/5! +...
I have converted a Taylor series of ex (above) into a program of OpenMp.
All the codes are written below.
When I run the code through Oracle Ubuntu it works.
It is giving me e^0=1,e^1=2.718,e^2=7.389056
But when I run it on Ubuntu (not virtually), then it doesn't work right.
It is giving me e^0=nan,e^1=0.40..,e^2=4.780.
And output is totally random as in its not exact as I mentioned above.
I need help.
#include <math.h>
#include <pthread.h>
#include <stdlib.h>
long double x, fact[150], pwr[150], s[1];
int i, term;
void *Power(void *temp) {
int k;
for (k = 0; k < 150; k++) {
pwr[k] = pow(x, k);
//printf("%.2Lf\n", pwr[k]);
}
return pwr;
}
void *Fact(void *temp) {
long double f;
int j;
fact[0] = 1.0;
for (term = 1; term < 150; term++) {
f = 1.0;
for (j = term; j > 0; j--)
f = f * j;
fact[term] = f;
//printf("%.2Lf\n", fact[term]);
}
return fact;
}
void *Exp(void *temp) {
int t;
s[0] = 0;
for (t = 0; t < 150; t++)
s[0] = s[0] + (pwr[t] / fact[t]);
return s;
}
int main(void) {
pthread_t thread1, thread2, thread3;
long double **sum;
printf("Exponential [PROMPT] Enter the value of x (between 0 to 100) (for calculating exp(x)):");
scanf("%Lf", &x);
printf("\nExponential [INFO] Threads creating.....\n");
pthread_create(&thread1, NULL, Power, NULL); //calling power function
pthread_create(&thread2, NULL, Fact, NULL); //calling factorial function
printf("Exponential [INFO] Threads created\n");
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
printf("Exponential [INFO] Master thread and terminated threads are joining\n");
printf("Exponential [INFO] Result collected in Master thread\n");
pthread_create(&thread3, NULL, Exp, NULL);
pthread_join(thread3, sum);
printf("\neXPONENTIAL [INFO] Value of exp(%.2Lf) is : %Lf\n\n", x, s[0]);
exit(1);
}
The above code is originally for ex using threads which works.
#include <math.h>
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main(void) {
long double x, f, fact[150], pwr[150], s[1];
int i, term, k, j, t;
long double sum;
printf("Exponential [PROMPT] Enter the value of x (between 0 to 100) (for calculating exp(x)):");
scanf("%Lf", &x);
#pragma omp parallel num_threads(10)
#pragma omp for
for (k = 0; k < 150; k++) {
for (int h = 0; h <= k; h++) {
if (h == 0)
x = 1;
else
pwr[k] = pow(x, k);
}
#pragma omp for
for (term = 1; term < 150; term++) {
f = 1.0;
for (j = term; j > 0; j--)
f = f * j;
fact[term] = f;
}
#pragma omp for
for (t = 0; t < 150; t++)
s[0] = s[0] + (pwr[t] / fact[t]);
printf("\neXPONENTIAL [INFO] Value of exp(%.2Lf) is : %Lf\n\n", x, s[0]);
exit(1);
}
And the code above is a conversion of the previous code to an OpenMP.

for (t = 0; t < 150; t++)
s[0] = s[0] + (pwr[t] / fact[t]);
This code, when parallelized, will overwrite the same variable concurrently with partial calculation results. This can only work when the threads are coordinated somehow. Fortunately, openmp has a dedicated directive reduce for calculating sums, so you can fix this easily.
In the pthread version of the code, one thread does this calculation, so no problem there.

Related

Multithreaded matrix multiplication in C

I'm trying to do some multithreaded high performance c matrix multiplication, the code below here is the program i wrote in C, it just works fine when the # of cores is 12 (since my pc has 12 threads or when i manually fix it to 12) when I switch it to a lower value (like 10 f.e.) gives me strange results, doesn anyone have an idea on what the problem could be?
Tested an perfectly working with 12 cores (or threads, call as whatever u want ) with a lower number of cores doesn't work anymore (look like he ends the execution almost immediately)
Tried with different values but looks like there is an error in the code I can't figure out probably.
The error is present in big size matrices but sometimes also in small size matrices
//
// Created by christian on 06/09/2019.
//
#pragma GCC optimize("O3", "unroll-loops", "omit-frame-pointer", "inline") //Optimization flags
#pragma GCC option("arch=native", "tune=native", "no-zero-upper") //Enable AVX
#pragma GCC target("avx") //Enable AVX
#include <time.h> // for clock_t, clock(), CLOCKS_PER_SEC
#include <sys/time.h>
#include <stdio.h> //AVX/SSE Extensions are included in stdio.h
#include <unistd.h>
#include <stdlib.h>
#include <pthread.h>
int ops = 0;
//define matrix size (in this case we'll use a square matrix)
#define DIM 200 //DO NOT EXCEED 10000 (modification to the stack size needed)
float matrix[DIM][DIM];
float result_matrix[DIM][DIM];
float *matrix_ptr = (float *) &matrix;
float *result_ptr = (float *) &result_matrix;
// set the number of logical cores to 1 (just in case the auto-detection doesn't work properly)
int cores = 1;
//functions prototypes
void single_multiply(int row);
void *thread_multiply(void *offset);
int detect_number_of_cores();
void fill_matrix();
int main() {
//two instructions needed for pseudo-random float numbers
srand((unsigned int) time(NULL));
//detect the number of active cores
cores = detect_number_of_cores();
//matrix filling with random float values
fill_matrix();
printf("------------- MATRIX MULTIPLICATION -------------\n");
printf("--- multi-thread (vectorization enabled) v1.0 ---\n");
// printf("\n ORIGINAL MATRIX");
// for(int c=0; c<DIM; c++){
// printf("\n");
// for(int k=0; k<DIM; k++){
// printf("%f \t", matrix[c][k]);
// }
// }
//uncomment and modify this value to force a particular number of threads (not recommended)
//cores = 4;
printf("\n Currently using %i cores", cores);
printf("\n Matrix size: %i x %i", DIM, DIM);
//time detection struct declaration
struct timeval start, end;
gettimeofday(&start, NULL);
//decisional tree for the number of threads to be used
if (cores == 0 || cores == 1 || cores > DIM) {
//passing 0 because it has to start from the first row
single_multiply(0);
//this value may not be correct if matrix size exceeds 80x80 due to thread lock problems
printf("\n Total multiply ops: %i", ops);
gettimeofday(&end, NULL);
long seconds = (end.tv_sec - start.tv_sec);
long micros = ((seconds * 1000000) + end.tv_usec) - (start.tv_usec);
printf("\n\n Time elapsed is %d seconds and %d micros\n", seconds, micros);
return 0;
} else {
//split the matrix in more parts (as much as the number of active cores)
int rows_por_thread = DIM / cores;
printf("\n Rows por Thread: %i", rows_por_thread);
//calculate the rest of the division (if there is one obviously)
int rest = DIM % cores;
printf("\n Rest: %i \n", rest);
if (rest == 0) {
//execute just the multi-thread function n times
int times = rows_por_thread;
//create an array of thread-like objects
pthread_t threads[cores];
//create an array with the arguments for each thread
int thread_args[cores];
//launching the threads according to the available cores
int i = 0;
int error;
for (int c = 0; c < DIM; c += rows_por_thread) {
thread_args[i] = c;
i++;
}
for (int c = 0; c < cores; c++) {
error = pthread_create(&threads[c], NULL, thread_multiply, (void *) &thread_args[c]);
if (error != 0) {
printf("\n Error in thread %i creation, exiting...", c);
}
printf("created thread n %i with argument: %i \n", c, thread_args[c]);
}
printf("\n ... working ...");
for (int c = 0; c < cores; c++) {
pthread_join(threads[i], NULL);
printf("\n Waiting to join thread n: %i", c);
}
} else {
//THE PROBLEM MUST BE INSIDE THIS ELSE STATEMENT
//execute the multi-thread function n times and the single function th rest remaining times
printf("\n The number of cores is NOT a divisor of the size of the matrix. \n");
//create an array of thread-like objects
pthread_t threads[cores];
//create an array with the arguments for each thread
int thread_args[cores];
//launching the threads according to the available cores
int i = 0; //counter for the thread ID
int entrypoint_residual_rows = 0; //first unprocessed residual row
//launching the threads according to the available coreS
for (int c = 0; c < DIM; c += rows_por_thread) {
thread_args[i] = c;
i++;
}
entrypoint_residual_rows = cores * rows_por_thread;
int error;
//launch the threads
for (int c = 0; c < cores; c++) {
error = pthread_create(&threads[c], NULL, thread_multiply, (void *) &thread_args[c]);
if (error != 0) {
printf("\n Error in thread %i creation, exiting...", c);
}
printf("created thread n %i with argument: %i \n", c, thread_args[c]);
}
printf("\n ... working ...\n");
//join all the previous generated threads
for (int c = 0; c < cores; c++) {
pthread_join(threads[i], NULL);
printf("\n Waiting to join thread n: %i", c);
}
printf("\n entry-point index for the single function %i ", entrypoint_residual_rows);
single_multiply(entrypoint_residual_rows);
}
}
// printf("\n MULTIPLIED MATRIX");
// for (int c = 0; c < DIM; c++) {
// printf("\n");
// for (int k = 0; k < DIM; k++) {
// printf("%f \t", result_matrix[c][k]);
// }
// }
gettimeofday(&end, NULL);
printf("\n All threads joined correctly");
long seconds = (end.tv_sec - start.tv_sec);
long micros = ((seconds * 1000000) + end.tv_usec) - (start.tv_usec);
printf("\n\n Time elapsed is %d seconds and %d micros\n", seconds, micros);
//this value may not be correct if matrix size exceeds 80x80 due to thread lock problems
printf("\n Total multiply ops: %i", ops);
return 0;
}
//detect number of cores of the CPU (logical cores)
int detect_number_of_cores() {
return (int) sysconf(_SC_NPROCESSORS_ONLN); // Get the number of logical CPUs.
}
//matrix filling function
void fill_matrix() {
float a = 5.0;
for (int c = 0; c < DIM; c++)
for (int d = 0; d < DIM; d++) {
matrix[c][d] = (float) rand() / (float) (RAND_MAX) * a;
}
}
//row by row multiplication algorithm (mono-thread version)
void single_multiply(int row) {
for (int i = row; i < DIM; i++) {
for (int j = 0; j < DIM; j++) {
*(result_ptr + i * DIM + j) = 0;
ops++;
for (int k = 0; k < DIM; k++) {
*(result_ptr + i * DIM + j) += *(matrix_ptr + i * DIM + k) * *(matrix_ptr + k * DIM + j);
}
}
}
}
//thread for the multiplication algorithm
void *thread_multiply(void *offset) {
//de-reference the parameter passed by the main-thread
int *row_offset = (int *) offset;
//multiplication loops
for (int i = *row_offset; i < (*row_offset + (DIM / cores)); i++) {
for (int j = 0; j < DIM; j++) {
*(result_ptr + i * DIM + j) = 0;
ops++;
for (int k = 0; k < DIM; k++) {
*(result_ptr + i * DIM + j) += *(matrix_ptr + i * DIM + k) * *(matrix_ptr + k * DIM + j);
}
}
}
return NULL;
}
this is the way the result looks (also the number of ops in the result should be equal to size x size)
------------- MATRIX MULTIPLICATION -------------
--- multi-thread (vectorization enabled) v1.0 ---
Currently using 4 cores
Matrix size: 200 x 200
Rows por Thread: 50
Rest: 0
created thread n 0 with argument: 0
created thread n 1 with argument: 50
created thread n 2 with argument: 100
created thread n 3 with argument: 150
... working ...
Waiting to join thread n: 0
Waiting to join thread n: 1
Waiting to join thread n: 2
Waiting to join thread n: 3
All threads joined correctly
Time elapsed is 0 seconds and 804 micros
Total multiply ops: 2200
Process finished with exit code 0

This pthread_join here looks extremely fishy -- observe how the loop variable is c, but you index the array on i:
for (int c = 0; c < cores; c++) {
pthread_join(threads[i], NULL);
printf("\n Waiting to join thread n: %i", c);
}
I doubt it's doing the right thing.

in thread_multiple, the unadorned line:
ops++;
looks a bit suspicious. Did you not say you were running multiple instances of these threads?
As a general comment, you should look to have your functions a bit better defined; for example if you changed your single_multiply to be:
int single_multiply(int RowStart, int RowEnd) {
int ops = 0;
....
return ops;
}
then
void *thread_multiply(void *p) {
int *rows = p;
int ops;
ops = single_multiply(rows[0], rows[1]);
return (void *)ops;
}
you have:
reduced the bit of code that cares about things like 'cores' to the only bit that matters about them.
removed contention on the counter (you can collect them in pthread_join)
removed the redundant, nearly identical code.

Thank You ALL guys, this is now what it looks like, i would have expected better performance honeslty but at least it looks like it's working, does anyone have some idea on performance improvements I could do?
//
// Created by christian on 06/09/2019.
//
#pragma GCC optimize("O3", "unroll-loops", "omit-frame-pointer", "inline") //Optimization flags
#pragma GCC option("arch=native", "tune=native", "no-zero-upper") //Enable AVX
#pragma GCC target("avx") //Enable AVX
#include <time.h> // for clock_t, clock(), CLOCKS_PER_SEC
#include <sys/time.h>
#include <stdio.h> //AVX/SSE Extensions are included in stdio.h
#include <unistd.h>
#include <stdlib.h>
#include <pthread.h>
//define matrix size (in this case we'll use a square matrix)
#define DIM 4000 //DO NOT EXCEED 10000 (modification to the stack size needed)
float matrix[DIM][DIM];
float result_matrix[DIM][DIM];
float *matrix_ptr = (float *) &matrix;
float *result_ptr = (float *) &result_matrix;
// set the number of logical cores to 1 (just in case the auto-detection doesn't work properly)
int cores = 1;
//functions prototypes
void single_multiply(int rowStart, int rowEnd);
void *thread_multiply(void *offset);
int detect_number_of_cores();
void fill_matrix();
int main() {
//two instructions needed for pseudo-random float numbers
srand((unsigned int) time(NULL));
//detect the number of active cores
cores = detect_number_of_cores();
//matrix filling with random float values
fill_matrix();
printf("------------- MATRIX MULTIPLICATION -------------\n");
printf("--- multi-thread (vectorization enabled) v1.0 ---\n");
// printf("\n ORIGINAL MATRIX");
// for(int c=0; c<DIM; c++){
// printf("\n");
// for(int k=0; k<DIM; k++){
// printf("%f \t", matrix[c][k]);
// }
// }
//uncomment and modify this value to force a particular number of threads (not recommended)
//cores = 4;
printf("\n Currently using %i cores", cores);
printf("\n Matrix size: %i x %i", DIM, DIM);
//time detection struct declaration
struct timeval start, end;
gettimeofday(&start, NULL);
//decisional tree for the number of threads to be used
if (cores == 0 || cores == 1 || cores > DIM) {
//passing 0 because it has to start from the first row
single_multiply(0, DIM);
gettimeofday(&end, NULL);
long seconds = (end.tv_sec - start.tv_sec);
long micros = ((seconds * 1000000) + end.tv_usec) - (start.tv_usec);
printf("\n\n Time elapsed is %ld seconds and %ld micros\n", seconds, micros);
return 0;
} else {
//split the matrix in more parts (as much as the number of active cores)
int rows_por_thread = DIM / cores;
printf("\n Rows por Thread: %i", rows_por_thread);
//calculate the rest of the division (if there is one obviously)
int rest = DIM % cores;
printf("\n Rest: %i \n", rest);
if (rest == 0) {
//execute just the multi-thread function n times
int times = rows_por_thread;
//create an array of thread-like objects
pthread_t threads[cores];
//create an array with the arguments for each thread
int thread_args[cores];
//launching the threads according to the available cores
int i = 0;
int error;
for (int c = 0; c < DIM; c += rows_por_thread) {
thread_args[i] = c;
i++;
}
for (int c = 0; c < cores; c++) {
error = pthread_create(&threads[c], NULL, thread_multiply, (void *) &thread_args[c]);
if (error != 0) {
printf("\n Error in thread %i creation", c);
}
printf("created thread n %i with argument: %i \n", c, thread_args[c]);
}
printf("\n ... working ...");
for (int c = 0; c < cores; c++) {
error = pthread_join(threads[c], NULL);
if (error != 0) {
printf("\n Error in thread %i join", c);
}
printf("\n Waiting to join thread n: %i", c);
}
} else {
//THE PROBLEM MUST BE INSIDE THIS ELSE STATEMENT
//execute the multi-thread function n times and the single function th rest remaining times
printf("\n The number of cores is NOT a divisor of the size of the matrix. \n");
//create an array of thread-like objects
pthread_t threads[cores];
//create an array with the arguments for each thread
int thread_args[cores];
//launching the threads according to the available cores
int i = 0; //counter for the thread ID
int entrypoint_residual_rows = 0; //first unprocessed residual row
//launching the threads according to the available coreS
for (int c = 0; c < DIM - rest; c += rows_por_thread) {
thread_args[i] = c;
i++;
}
entrypoint_residual_rows = cores * rows_por_thread;
int error;
//launch the threads
for (int c = 0; c < cores; c++) {
error = pthread_create(&threads[c], NULL, thread_multiply, (void *) &thread_args[c]);
if (error != 0) {
printf("\n Error in thread %i creation, exiting...", c);
}
printf("created thread n %i with argument: %i \n", c, thread_args[c]);
}
printf("\n ... working ...\n");
//join all the previous generated threads
for (int c = 0; c < cores; c++) {
pthread_join(threads[c], NULL);
printf("\n Waiting to join thread n: %i", c);
}
printf("\n entry-point index for the single function %i ", entrypoint_residual_rows);
single_multiply(entrypoint_residual_rows, DIM);
}
}
// printf("\n MULTIPLIED MATRIX");
// for (int c = 0; c < DIM; c++) {
// printf("\n");
// for (int k = 0; k < DIM; k++) {
// printf("%f \t", result_matrix[c][k]);
// }
// }
gettimeofday(&end, NULL);
printf("\n All threads joined correctly");
long seconds = (end.tv_sec - start.tv_sec);
long micros = ((seconds * 1000000) + end.tv_usec) - (start.tv_usec);
printf("\n\n Time elapsed is %d seconds and %d micros\n", seconds, micros);
return 0;
}
//detect number of cores of the CPU (logical cores)
int detect_number_of_cores() {
return (int) sysconf(_SC_NPROCESSORS_ONLN); // Get the number of logical CPUs.
}
//matrix filling function
void fill_matrix() {
float a = 5.0;
for (int c = 0; c < DIM; c++)
for (int d = 0; d < DIM; d++) {
matrix[c][d] = (float) rand() / (float) (RAND_MAX) * a;
}
}
//row by row multiplication algorithm (mono-thread version)
void single_multiply(int rowStart, int rowEnd) {
for (int i = rowStart; i < rowEnd; i++) {
//printf("\n %i", i);
for (int j = 0; j < DIM; j++) {
*(result_ptr + i * DIM + j) = 0;
for (int k = 0; k < DIM; k++) {
*(result_ptr + i * DIM + j) += *(matrix_ptr + i * DIM + k) * *(matrix_ptr + k * DIM + j);
}
}
}
}
//thread for the multiplication algorithm
void *thread_multiply(void *offset) {
//de-reference the parameter passed by the main-thread
int *row_offset = (int *) offset;
printf(" Starting at line %i ending at line %i \n ", *row_offset, *row_offset + (DIM / cores));
single_multiply(*row_offset, *row_offset + (DIM / cores));
printf("\n ended at line %i", *row_offset + (DIM / cores));
return NULL;
}

Adding more threads to program resulted in longer execution time for calculating trapezoidal integration

I am working on a multi-threaded numerical integration program using the trapezoidal rule.
I have a struct which contains six items:
typedef struct trapezoidalIntegrationThread{
float a;
float b;
int n;
float h;
double res;
float elTime;
}threadParams;
a is the left end point, b is the right end point, n is the number of trapezoids, h is the height, res is the result calculated within compute_with_pthread, and finally, elTime is the elapsed time for compute_with_pthread for benchmarking.
Here is my code in main:
int n = NUM_TRAPEZOIDS;
float a = LEFT_ENDPOINT;
float b = RIGHT_ENDPOINT;
pthread_t masterThread;
pthread_t slaveThread[NUM_THREADs];
threadParams *trapThread;
for(i = 0; i < NUM_THREADs; i++) {
trapThread = (threadParams *) malloc(sizeof(threadParams));
trapThread->a = a;
trapThread->b = b;
trapThread->n = n;
trapThread->h = (b - a) / (float) n;
if (pthread_create(&slaveThread[i], NULL, compute_using_pthreads, (void *) trapThread) != 0) {
printf("Looks like something went wrong..\n");
return -1;
}
}
for(i = 0; i < NUM_THREADs; i++) {
pthread_join(slaveThread[i], NULL);
}
pthread_exit((void *) masterThread);
I am basically creating the number of threads defined in NUM_THREADS (let's assume this value is 4). I am allocating how much memory the struct needs, and setting the pre-defined values of:
#define LEFT_ENDPOINT 5
#define RIGHT_ENDPOINT 1000
#define NUM_TRAPEZOIDS 100000000
#define NUM_THREADs 8 /* Number of threads to run. */
Next, I create my pthreads, and call the compute_using_pthreads function:
void *compute_using_pthreads(void *inputs)
{
double integral;
int k;
threadParams *args = (threadParams *) inputs;
unsigned long p_micros = 0;
float p_millis = 0.0;
clock_t p_start, p_end;
float a = args->a;
float b = args->b;
int n = args->n;
float h = args->h;
p_start = clock();
integral = (f(a) + f(b))/2.0;
for (k = 1; k <= n-1; k++) {
integral += f(a+k*h);
}
integral = integral*h;
p_end = clock();
p_micros = p_end - p_start;
p_millis = p_micros / 1000;
args->res = integral;
args->elTime = p_millis;
}
I ran this program and compared it against a non-multithreaded function:
double compute_gold(float a, float b, int n, float h)
{
double integral;
int k;
integral = (f(a) + f(b))/2.0;
for (k = 1; k <= n-1; k++) {
integral += f(a+k*h);
}
integral = integral*h;
return integral;
}
So here are the results:
Run-time of compute_gold:
~3000 ms
Run_time of compute_with_pthread:
Using 1 thread: ~3000 ms
Using 2 threads: ~6000 ms
Using 4 thrads: ~12000 ms
....
So for some reason, the more threads I added, the execution took n-threads more time to execute. I can't for the life of me figure out why this is happening, as I am quite new to C programming =/

Why is my parallel code slower than serial?

Issue
Hello everyone, I have got a program (from the net) that I intend to speed up by converting it into its parallel version with the use of pthreads. But surprisingly though, it runs slower than the serial version. Below is the program:
# include <stdio.h>
//fast square root algorithm
double asmSqrt(double x)
{
__asm__ ("fsqrt" : "+t" (x));
return x;
}
//test if a number is prime
bool isPrime(int n)
{
if (n <= 1) return false;
if (n == 2) return true;
if (n%2 == 0) return false;
int sqrtn,i;
sqrtn = asmSqrt(n);
for (i = 3; i <= sqrtn; i+=2) if (n%i == 0) return false;
return true;
}
//number generator iterated from 0 to n
int main()
{
n = 1000000; //maximum number
int k,j;
for (j = 0; j<= n; j++)
{
if(isPrime(j) == 1) k++;
if(j == n) printf("Count: %d\n",k);
}
return 0;
}
First attempt for parallelization
I let the pthread manage the for loop
# include <stdio.h>
.
.
int main()
{
.
.
//----->pthread code here<----
for (j = 0; j<= n; j++)
{
if(isPrime(j) == 1) k++;
if(j == n) printf("Count: %d\n",k);
}
return 0;
}
Well, it runs slower than the serial one
Second attempt
I divided the for loop into two threads and run them in parallel using pthreads
However, it still runs slower, I am intending that it may run about twice as fast or well faster. But its not!
These is my parallel code by the way:
# include <stdio.h>
# include <pthread.h>
# include <cmath>
# define NTHREADS 2
pthread_mutex_t mutex1 = PTHREAD_MUTEX_INITIALIZER;
int k = 0;
double asmSqrt(double x)
{
__asm__ ("fsqrt" : "+t" (x));
return x;
}
struct arg_struct
{
int initialPrime;
int nextPrime;
};
bool isPrime(int n)
{
if (n <= 1) return false;
if (n == 2) return true;
if (n%2 == 0) return false;
int sqrtn,i;
sqrtn = asmSqrt(n);
for (i = 3; i <= sqrtn; i+=2) if (n%i == 0) return false;
return true;
}
void *parallel_launcher(void *arguments)
{
struct arg_struct *args = (struct arg_struct *)arguments;
int j = args -> initialPrime;
int n = args -> nextPrime - 1;
for (j = 0; j<= n; j++)
{
if(isPrime(j) == 1)
{
printf("This is prime: %d\n",j);
pthread_mutex_lock( &mutex1 );
k++;
pthread_mutex_unlock( &mutex1 );
}
if(j == n) printf("Count: %d\n",k);
}
pthread_exit(NULL);
}
int main()
{
int f = 100000000;
int m;
pthread_t thread_id[NTHREADS];
struct arg_struct args;
int rem = (f+1)%NTHREADS;
int n = floor((f+1)/NTHREADS);
for(int h = 0; h < NTHREADS; h++)
{
if(rem > 0)
{
m = n + 1;
rem-= 1;
}
else if(rem == 0)
{
m = n;
}
args.initialPrime = args.nextPrime;
args.nextPrime = args.initialPrime + m;
pthread_create(&thread_id[h], NULL, &parallel_launcher, (void *)&args);
pthread_join(thread_id[h], NULL);
}
// printf("Count: %d\n",k);
return 0;
}
Note:
OS: Fedora 21 x86_64,
Compiler: gcc-4.4,
Processor: Intel Core i5 (2 physical core, 4 logical),
Mem: 6 Gb,
HDD: 340 Gb,

You need to split the range you are examining for primes up into n parts, where n is the number of threads.
The code that each thread runs becomes:
typedef struct start_end {
int start;
int end;
} start_end_t;
int find_primes_in_range(void *in) {
start_end_t *start_end = (start_end_t *) in;
int num_primes = 0;
for (int j = start_end->start; j <= start_end->end; j++) {
if (isPrime(j) == 1)
num_primes++;
}
pthread_exit((void *) num_primes;
}
The main routine first starts all the threads which call find_primes_in_range, then calls pthread_join for each thread. It sums all the values returned by find_primes_in_range. This avoids locking and unlocking a shared count variable.
This will parallelize the work, but the amount of work per thread will not be equal. This can be addressed but is more complicated.

The main design flaw: you must let each thread have its own private counter variable instead of using the shared one. Otherwise they will spend far more time waiting on and handling that mutex, than they will do on the actual calculation. You are essentially forcing the threads to execute in serial.
Instead, sum everything up with a private counter variable and once a thread is done with its work, return the counter variable and sum them up in main().
Also, you should not call printf() from inside the threads. If there is a context switch in the middle of a printf call, you'll end up with crappy output such as This is This is prime: 2. In which case you must synchronize the printf calls between threads, which will slow the program down again. Also, the printf() calls themselves are likely 90% of the work that the thread is doing. So some sort of re-design of who does the printing might be a good idea, depending on what you want to do with the results.

Summary
Indeed, the use of PThread speed up my code. It was my programming flaw of placing pthread_join right after the first pthread_create and the common counter I have set on arguments. After fixing this up, I tested my parallel code to determine the primality of 100 Million numbers then compared its processing time with a serial code. Below are the results.
http://i.stack.imgur.com/gXFyk.jpg (I could not attach the image as I don't have much reputation yet, instead, I am including a link)
I conducted three trials for each to account for the variations caused by different OS activities. We got speed up for utilizing parallel programming with PThread. What is surprising is a PThread code running in ONE thread was a bit faster than purely serial code. I could not explain this one, nevertheless using PThreads is well, surely worth a try.
Here is the corrected parallel version of the code (gcc-c++):
# include <stdio.h>
# include <pthread.h>
# include <cmath>
# define NTHREADS 4
double asmSqrt(double x)
{
__asm__ ("fsqrt" : "+t" (x));
return x;
}
struct start_end_f
{
int start;
int end;
};
//test if a number is prime
bool isPrime(int n)
{
if (n <= 1) return false;
if (n == 2) return true;
if (n%2 == 0) return false;
int sqrtn = asmSqrt(n);
for (int i = 3; i <= sqrtn; i+=2) if (n%i == 0) return false;
return true;
}
//executes the tests for prime in a certain range, other threads will test the next range and so on..
void *find_primes_in_range(void *in)
{
int k = 0;
struct start_end_f *start_end_h = (struct start_end_f *)in;
for (int j = start_end_h->start; j < (start_end_h->end +1); j++)
{
if(isPrime(j) == 1) k++;
}
int *t = new int;
*t = k;
pthread_exit(t);
}
int main()
{
int f = 100000000; //maximum number to be tested for prime
pthread_t thread_id[NTHREADS];
struct start_end_f start_end[NTHREADS];
int rem = (f+1)%NTHREADS;
int n = (f+1)/NTHREADS;
int rem_change = rem;
int m;
if(rem>0) m = n+1;
else if(rem == 0) m = n;
//distributes task 'evenly' to the number of parallel threads requested
for(int h = 0; h < NTHREADS; h++)
{
if(rem_change > 0)
{
start_end[h].start = m*h;
start_end[h].end = start_end[h].start+m-1;
rem_change -= 1;
}
else if(rem_change<= 0)
{
start_end[h].start = m*(h+rem_change)-rem_change*n;
start_end[h].end = start_end[h].start+n-1;
rem_change -= 1;
}
pthread_create(&thread_id[h], NULL, find_primes_in_range, &start_end[h]);
}
//retreiving returned values
int *t;
int c = 0;
for(int h = 0; h < NTHREADS; h++)
{
pthread_join(thread_id[h], (void **)&t);
int b = *((int *)t);
c += b;
b = 0;
}
printf("\nNumber of Primes: %d\n",c);
return 0;
}

Multi threaded C program freezes when I change number of threads

I am writing a multi threaded c program to multiply two matrices and find the row norm using pthreads and Blas. I thought I had it working when I set the dimension of the matrices to 4 and the number of threads to use to 2. I then changed the number of threads, and it no longer works. It does not compute the wrong answers, but gets stuck when I try to join the threads
void *matrix_norm(void *arg){
mat_norm_t *thread_mat_norm_data = arg;
int n = thread_mat_norm_data->n;
int i, j;
double norm = 0.;
for(i=0;i<thread_mat_norm_data->sub_n;i++){
double row_sum = 0.;
for(j=0;j<n;j++){
row_sum += *(thread_mat_norm_data->z+i*n+j);
}
if(row_sum>norm){
norm = row_sum;
}
}
pthread_mutex_lock(thread_mat_norm_data->mutex);
if (norm > *(thread_mat_norm_data->global_norm)){
*(thread_mat_norm_data->global_norm)=norm;
}
pthread_mutex_unlock(thread_mat_norm_data->mutex);
pthread_exit(NULL);
}
int main() {
pthread_t *working_thread;
mat_mult_t *thread_mat_mult_data;
mat_norm_t *thread_mat_norm_data;
pthread_mutex_t *mutex;
double *x, *y, *z, norm;
int i, rows_per_thread;
int n = 8;
int num_of_thrds = 4;// Works when this is 2, not when 4
if(n<=num_of_thrds && num_of_thrds < MAXTHRDS){
printf("Matrix dim must be greater than num of thrds\nand num of thrds less than 124.\n");
return (-1);
}
x = malloc(n*n*sizeof(double));
y = malloc(n*n*sizeof(double));
z = malloc(n*n*sizeof(double));
initMat(n, x);
initMat(n, y);
working_thread = malloc(num_of_thrds * sizeof(pthread_t));
thread_mat_mult_data = malloc(num_of_thrds * sizeof(mat_mult_t));
rows_per_thread = n/num_of_thrds;
for(i=0;i<num_of_thrds;i++){
thread_mat_mult_data[i].x = x + i * rows_per_thread * n;
thread_mat_mult_data[i].y = y;
thread_mat_mult_data[i].z = z + i * rows_per_thread * n;
thread_mat_mult_data[i].n = n;
thread_mat_mult_data[i].sub_n =
(i == num_of_thrds-1) ? n-(num_of_thrds-1)*rows_per_thread : rows_per_thread;
pthread_create(&working_thread[i], NULL, matrix_mult, (void *)&thread_mat_mult_data[i]);
}
for(i=0;i<num_of_thrds;i++){
pthread_join(working_thread[i], NULL);
}
free(working_thread);
working_thread = malloc(num_of_thrds * sizeof(pthread_t));
thread_mat_norm_data = malloc(num_of_thrds * sizeof(mat_norm_t));
mutex = malloc(sizeof(pthread_mutex_t));
for(i=0;i<num_of_thrds;i++){
thread_mat_norm_data[i].z = z + i * rows_per_thread * n;
thread_mat_norm_data[i].n = n;
thread_mat_norm_data[i].global_norm = &norm;
thread_mat_norm_data[i].sub_n =
(i == num_of_thrds-1) ? n-(num_of_thrds-1)*rows_per_thread : rows_per_thread;
thread_mat_norm_data[i].mutex = mutex;
pthread_create(&working_thread[i], NULL, matrix_norm, (void *)&thread_mat_norm_data[i]);
}
//Stuck running here
for(i=0;i<num_of_thrds;i++){
pthread_join(working_thread[i], NULL);
}
printMat(n, z , "z");
printf("\nRow Sum Norm = %f\n", norm);
free(x);
free(y);
free(z);
free(working_thread);
free(thread_mat_mult_data);
free(thread_mat_norm_data);
pthread_mutex_destroy(mutex);
free(mutex);
return(0);
}
I unsure why it works under certain circumstances and not others, any explanation would be great!

Forgot to initialize the mutex with pthread_mutex_init(mutex, NULL); I am still unsure why it would work with out this for two threads but not more than this?

How to fix my Pthreads code about Mandelbrot set?

I have the following Pthreads code about calculating and creating a picture of the Mandelbrot set. My code in C works just fine and it prints the resulting picture nicely. The point is that using the below code, I am able to compile the code and execute it. Afterwards, if I try to view the resulting .ppm file in Gimp, it simply cannot open it. I guess I'm doing something wrong in my code. If someone could help me I would be glad.
// mandpthread.c
// to compile: gcc mandpthread.c -o mandpthread -lm -lrt -lpthread
// usage: ./mandpthread <no_of_iterations> <no_of_threads> > output.ppm
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <assert.h>
#include <pthread.h>
typedef struct {
int r, g, b;
} rgb;
int NITERATIONS, NTHREADS;
rgb **m;
void color(rgb **m, int x, int y, int red, int green, int blue)
{
m[y][x].r = red;
m[y][x].g = green;
m[y][x].b = blue;
}
void mandelbrot(int tid)
{
int w = 600, h = 400, x, y;
// each iteration, it calculates: newz = oldz*oldz + p,
// where p is the current pixel, and oldz stars at the origin
double pr, pi; // real and imaginary part of the pixel p
double newRe, newIm, oldRe, oldIm; // real and imaginary parts of new and old z
double zoom = 1, moveX = -0.5, moveY = 0; // you can change these to zoom and change position
int start = tid * NITERATIONS/NTHREADS;
int end = (tid+1) * (NITERATIONS/NTHREADS) - 1;
//loop through every pixel
for(y = 0; y < h; y++) {
for(x = 0; x < w; x++) {
// calculate the initial real and imaginary part of z,
// based on the pixel location and zoom and position values
pr = 1.5 * (x - w / 2) / (0.5 * zoom * w) + moveX;
pi = (y - h / 2) / (0.5 * zoom * h) + moveY;
newRe = newIm = oldRe = oldIm = 0; //these should start at 0,0
// i will represent the number of iterations
int i;
// start the iteration process
for(i = start; i <= end; i++) {
// remember value of previous iteration
oldRe = newRe;
oldIm = newIm;
// the actual iteration, the real and imaginary part are calculated
newRe = oldRe * oldRe - oldIm * oldIm + pr;
newIm = 2 * oldRe * oldIm + pi;
// if the point is outside the circle with radius 2: stop
if((newRe * newRe + newIm * newIm) > 4) break;
}
if(i == NITERATIONS)
color(m, x, y, 0, 0, 0); // black
else
{
// normalized iteration count method for proper coloring
double z = sqrt(newRe * newRe + newIm * newIm);
int brightness = 256. * log2(1.75 + i - log2(log2(z))) / log2((double)NITERATIONS);
color(m, x, y, brightness, brightness, 255);
}
}
}
}
// worker function which will be passed to pthread_create function
void *worker(void *arg)
{
int tid = (int)arg;
mandelbrot(tid);
}
int main(int argc, char *argv[])
{
pthread_t* threads;
int i, j, rc;
if(argc != 3)
{
printf("Usage: %s <no_of_iterations> <no_of_threads> > output.ppm\n", argv[0]);
exit(1);
}
NITERATIONS = atoi(argv[1]);
NTHREADS = atoi(argv[2]);
threads = (pthread_t*)malloc(NTHREADS * sizeof(pthread_t));
m = malloc(400 * sizeof(rgb *));
for(i = 0; i < 400; i++)
m[i] = malloc(600 * sizeof(rgb));
// declaring the needed variables for calculating the running time
struct timespec begin, end;
double time_spent;
// starting the run time
clock_gettime(CLOCK_MONOTONIC, &begin);
printf("P6\n# AUTHOR: ET\n");
printf("%d %d\n255\n",600,400);
for(i = 0; i < NTHREADS; i++) {
rc = pthread_create(&threads[i], NULL, worker, (void *)i);
assert(rc == 0); // checking whether thread creating was successfull
}
for(i = 0; i < NTHREADS; i++) {
rc = pthread_join(threads[i], NULL);
assert(rc == 0); // checking whether thread join was successfull
}
// printing to file
for(i = 0; i < 400; i++) {
for(j = 0; j < 600; j++) {
fputc((char)m[i][j].r, stdout);
fputc((char)m[i][j].g, stdout);
fputc((char)m[i][j].b, stdout);
}
}
// ending the run time
clock_gettime(CLOCK_MONOTONIC, &end);
// calculating time spent during the calculation and printing it
time_spent = end.tv_sec - begin.tv_sec;
time_spent += (end.tv_nsec - begin.tv_nsec) / 1000000000.0;
fprintf(stderr, "Elapsed time: %.2lf seconds.\n", time_spent);
for(i = 0; i < 400; i++)
free(m[i]);
free(m);
free(threads);
return 0;
}

The newest version of your code works for me with 100 iterations and 1 thread.
Doing two threads fails, because the ppm file has 2 headers one from each thread.
If I delete one of the headers, the image loads but the colours are off and there's a glitch in the image.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Taylor series using OpenMP - c

Related

Multithreaded matrix multiplication in C

Adding more threads to program resulted in longer execution time for calculating trapezoidal integration

Why is my parallel code slower than serial?

Multi threaded C program freezes when I change number of threads

How to fix my Pthreads code about Mandelbrot set?

Categories

Resources