Why is my parallel code slower than serial?

Why is my parallel code slower than serial? - c

Issue
Hello everyone, I have got a program (from the net) that I intend to speed up by converting it into its parallel version with the use of pthreads. But surprisingly though, it runs slower than the serial version. Below is the program:
# include <stdio.h>
//fast square root algorithm
double asmSqrt(double x)
{
__asm__ ("fsqrt" : "+t" (x));
return x;
}
//test if a number is prime
bool isPrime(int n)
{
if (n <= 1) return false;
if (n == 2) return true;
if (n%2 == 0) return false;
int sqrtn,i;
sqrtn = asmSqrt(n);
for (i = 3; i <= sqrtn; i+=2) if (n%i == 0) return false;
return true;
}
//number generator iterated from 0 to n
int main()
{
n = 1000000; //maximum number
int k,j;
for (j = 0; j<= n; j++)
{
if(isPrime(j) == 1) k++;
if(j == n) printf("Count: %d\n",k);
}
return 0;
}
First attempt for parallelization
I let the pthread manage the for loop
# include <stdio.h>
.
.
int main()
{
.
.
//----->pthread code here<----
for (j = 0; j<= n; j++)
{
if(isPrime(j) == 1) k++;
if(j == n) printf("Count: %d\n",k);
}
return 0;
}
Well, it runs slower than the serial one
Second attempt
I divided the for loop into two threads and run them in parallel using pthreads
However, it still runs slower, I am intending that it may run about twice as fast or well faster. But its not!
These is my parallel code by the way:
# include <stdio.h>
# include <pthread.h>
# include <cmath>
# define NTHREADS 2
pthread_mutex_t mutex1 = PTHREAD_MUTEX_INITIALIZER;
int k = 0;
double asmSqrt(double x)
{
__asm__ ("fsqrt" : "+t" (x));
return x;
}
struct arg_struct
{
int initialPrime;
int nextPrime;
};
bool isPrime(int n)
{
if (n <= 1) return false;
if (n == 2) return true;
if (n%2 == 0) return false;
int sqrtn,i;
sqrtn = asmSqrt(n);
for (i = 3; i <= sqrtn; i+=2) if (n%i == 0) return false;
return true;
}
void *parallel_launcher(void *arguments)
{
struct arg_struct *args = (struct arg_struct *)arguments;
int j = args -> initialPrime;
int n = args -> nextPrime - 1;
for (j = 0; j<= n; j++)
{
if(isPrime(j) == 1)
{
printf("This is prime: %d\n",j);
pthread_mutex_lock( &mutex1 );
k++;
pthread_mutex_unlock( &mutex1 );
}
if(j == n) printf("Count: %d\n",k);
}
pthread_exit(NULL);
}
int main()
{
int f = 100000000;
int m;
pthread_t thread_id[NTHREADS];
struct arg_struct args;
int rem = (f+1)%NTHREADS;
int n = floor((f+1)/NTHREADS);
for(int h = 0; h < NTHREADS; h++)
{
if(rem > 0)
{
m = n + 1;
rem-= 1;
}
else if(rem == 0)
{
m = n;
}
args.initialPrime = args.nextPrime;
args.nextPrime = args.initialPrime + m;
pthread_create(&thread_id[h], NULL, &parallel_launcher, (void *)&args);
pthread_join(thread_id[h], NULL);
}
// printf("Count: %d\n",k);
return 0;
}
Note:
OS: Fedora 21 x86_64,
Compiler: gcc-4.4,
Processor: Intel Core i5 (2 physical core, 4 logical),
Mem: 6 Gb,
HDD: 340 Gb,

You need to split the range you are examining for primes up into n parts, where n is the number of threads.
The code that each thread runs becomes:
typedef struct start_end {
int start;
int end;
} start_end_t;
int find_primes_in_range(void *in) {
start_end_t *start_end = (start_end_t *) in;
int num_primes = 0;
for (int j = start_end->start; j <= start_end->end; j++) {
if (isPrime(j) == 1)
num_primes++;
}
pthread_exit((void *) num_primes;
}
The main routine first starts all the threads which call find_primes_in_range, then calls pthread_join for each thread. It sums all the values returned by find_primes_in_range. This avoids locking and unlocking a shared count variable.
This will parallelize the work, but the amount of work per thread will not be equal. This can be addressed but is more complicated.

The main design flaw: you must let each thread have its own private counter variable instead of using the shared one. Otherwise they will spend far more time waiting on and handling that mutex, than they will do on the actual calculation. You are essentially forcing the threads to execute in serial.
Instead, sum everything up with a private counter variable and once a thread is done with its work, return the counter variable and sum them up in main().
Also, you should not call printf() from inside the threads. If there is a context switch in the middle of a printf call, you'll end up with crappy output such as This is This is prime: 2. In which case you must synchronize the printf calls between threads, which will slow the program down again. Also, the printf() calls themselves are likely 90% of the work that the thread is doing. So some sort of re-design of who does the printing might be a good idea, depending on what you want to do with the results.

Summary
Indeed, the use of PThread speed up my code. It was my programming flaw of placing pthread_join right after the first pthread_create and the common counter I have set on arguments. After fixing this up, I tested my parallel code to determine the primality of 100 Million numbers then compared its processing time with a serial code. Below are the results.
http://i.stack.imgur.com/gXFyk.jpg (I could not attach the image as I don't have much reputation yet, instead, I am including a link)
I conducted three trials for each to account for the variations caused by different OS activities. We got speed up for utilizing parallel programming with PThread. What is surprising is a PThread code running in ONE thread was a bit faster than purely serial code. I could not explain this one, nevertheless using PThreads is well, surely worth a try.
Here is the corrected parallel version of the code (gcc-c++):
# include <stdio.h>
# include <pthread.h>
# include <cmath>
# define NTHREADS 4
double asmSqrt(double x)
{
__asm__ ("fsqrt" : "+t" (x));
return x;
}
struct start_end_f
{
int start;
int end;
};
//test if a number is prime
bool isPrime(int n)
{
if (n <= 1) return false;
if (n == 2) return true;
if (n%2 == 0) return false;
int sqrtn = asmSqrt(n);
for (int i = 3; i <= sqrtn; i+=2) if (n%i == 0) return false;
return true;
}
//executes the tests for prime in a certain range, other threads will test the next range and so on..
void *find_primes_in_range(void *in)
{
int k = 0;
struct start_end_f *start_end_h = (struct start_end_f *)in;
for (int j = start_end_h->start; j < (start_end_h->end +1); j++)
{
if(isPrime(j) == 1) k++;
}
int *t = new int;
*t = k;
pthread_exit(t);
}
int main()
{
int f = 100000000; //maximum number to be tested for prime
pthread_t thread_id[NTHREADS];
struct start_end_f start_end[NTHREADS];
int rem = (f+1)%NTHREADS;
int n = (f+1)/NTHREADS;
int rem_change = rem;
int m;
if(rem>0) m = n+1;
else if(rem == 0) m = n;
//distributes task 'evenly' to the number of parallel threads requested
for(int h = 0; h < NTHREADS; h++)
{
if(rem_change > 0)
{
start_end[h].start = m*h;
start_end[h].end = start_end[h].start+m-1;
rem_change -= 1;
}
else if(rem_change<= 0)
{
start_end[h].start = m*(h+rem_change)-rem_change*n;
start_end[h].end = start_end[h].start+n-1;
rem_change -= 1;
}
pthread_create(&thread_id[h], NULL, find_primes_in_range, &start_end[h]);
}
//retreiving returned values
int *t;
int c = 0;
for(int h = 0; h < NTHREADS; h++)
{
pthread_join(thread_id[h], (void **)&t);
int b = *((int *)t);
c += b;
b = 0;
}
printf("\nNumber of Primes: %d\n",c);
return 0;
}

Related

Cast to Pointer Error Multithread Program

This is a multi-threaded program that outputs prime numbers. The user runs the program and enters a number into the command line. It creates a separate thread that outputs all the prime numbers less than or equal to the number entered by the user.
I have an error: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] I'm so close but I've been staring at this for awhile now. I thought I would get some feedback.
How can I fix this? It is referring to the void here:
(void *)count);
Here is all the code:
#include <stdio.h>
#include <pthread.h>
int N = 100; //number of promes to be generated
int prime_arr[100000] = {0}; //prime arrray
void *printprime(void *ptr) //thread function
{
int j, flag;
int i = (int)(long long int)ptr; //getting thread number
//for thread 0, we check for all primes 0,4,8,12
//for thread 1, we check for all primes 1,5,9,13
while (i < N) { //while number in range
flag = 0; //check if i has factor
for (j = 2; j <= i / 2; j++) //factor can be at max i/2 value
{
if (i % j == 0) //factor found
{
flag = 1;
break;
}
}
if (flag == 0 && (i > 1)) //prime found, no factor
{
prime_arr[i] = 1;
}
i += 4; //increase by interval of 4
}
}
int main()
{
printf("Enter N: ");
scanf("%d", &N); //input N
pthread_t tid[4] = {0}; //create an array of 4 threads
int count = 0;
for (count = 0; count < 4; count++) //initialize threads and start
{
printf("\r\n CREATING THREADS %d", count);
pthread_create(&tid[count], NULL, printprime,(void *)count); //count is passed as argument, target = printprime
}
printf("\n");
for (count = 0; count < 4; count++)
{
pthread_join(tid[count], NULL); //while all thread havent finished
}
int c = 0;
for (count = 0; count < N; count++) //print primes
if (prime_arr[count] == 1)
printf("%d ", count);
printf("\n");
return 0;
}

Here you cast count to a void* which isn't a compatible type.
pthread_create(&tid[count], NULL, printprime, (void*) count);
And here you try to convert it back to an int improperly:
int i = (int)(long long int)ptr;
I suggest creating workpackages, tasks that you instead use and cast proberly to void* and back.
Example:
#include <pthread.h>
#include <stdio.h>
typedef struct {
pthread_t tid;
int count;
} task_t;
void *printprime(void *ptr) {
task_t *task = ptr;
task->count += 10; // do some work
return NULL;
}
#define TASKS (4)
int main() {
task_t tasks[TASKS] = {0}; // an array of tasks
for (int count = 0; count < TASKS; ++count) {
tasks[count].count = count; // fill the task with some job
pthread_create(&tasks[count].tid, NULL, printprime, &tasks[count]);
}
// join and take care of result from all threads
for (int count = 0; count < TASKS; ++count) {
pthread_join(tasks[count].tid, NULL);
printf("task %d value = %d\n", count, tasks[count].count);
}
}
Demo

Use a uintptr_t or a intptr_t instead of an int.
Technically, that's for storing a pointer in an integer, not for storing an integer in a pointer. So it's not exactly kosher. But it's still a common practice.
To do it properly, you would need to (statically or dynamically) allocate a variable for each thread, and pass the address of that variable to the thread.

Why does adding a critical section cause a segmentation fault?

I am trying to play around with the following quicksort algorithm in parallel:
#include <stdio.h>
#include <stdlib.h>
#include<omp.h>
#define MAX_UNFINISHED 1000 /* Maximum number of unsorted sub-arrays */
struct {
int first; /* Low index of unsorted sub-array */
int last; /* High index of unsorted sub-array */
} unfinished[MAX_UNFINISHED]; /* Stack */
int unfinished_index; /* Index of top of stack */
float *A; /* Array of elements to be sorted */
int n; /* Number of elements in A */
void swap (float *x, float *y)
{
float tmp;
tmp = *x;
*x = *y;
*y = tmp;
}
int partition (int first, int last)
{
int i, j;
float x;
x = A[last];
i = first - 1;
for (j = first; j < last; j++)
if (A[j] <= x) {
i++;
swap (&A[i], &A[j]);
}
swap (&A[i+1], &A[last]);
return (i+1);
}
void quicksort (void)
{
int first;
int last;
int my_index;
int q;
while (unfinished_index >= 0) {
#pragma omp critical
{
my_index = unfinished_index;
unfinished_index--;
first = unfinished[my_index].first;
last = unfinished[my_index].last;
}
while (first < last) {
q = partition (first, last);
if ((unfinished_index+1) >= MAX_UNFINISHED) {
printf ("Stack overflow\n");
exit (-1);
}
#pragma omp critical
{
unfinished_index++;
unfinished[unfinished_index].first = q+1;
unfinished[unfinished_index].last = last;
last = q-1;
}
}
}
}
int verify_sorted (float *A, int n)
{
int i;
for (i = 0; i < n-1; i++)
if (A[i] > A[i+1])
return 0;
return 1;
}
int main (int argc, char *argv[])
{
int i;
int seed; /* Seed component input by user */
unsigned short xi[3]; /* Random number seed */
if (argc != 3) {
printf ("Command-line syntax: %s <n> <seed>\n", argv[0]);
exit (-1);
}
seed = atoi (argv[2]);
xi[0] = xi[1] = xi[2] = seed;
n = atoi (argv[1]);
A = (float *) malloc (n * sizeof(float));
for (i = 0; i < n; i++)
A[i] = erand48(xi);
unfinished[0].first = 0;
unfinished[0].last = n-1;
unfinished_index = 0; //
#pragma omp parallel
quicksort();
if (verify_sorted (A, n)) printf ("Elements are sorted\n");
else printf ("ERROR: Elements are NOT sorted\n");
return 0;
}
Adding the critical sections in the quicksort() function causes a segmentation fault 11, why is that? From my basic understanding, such an error occurs when the system tries to access memory it doesn't have access to or is non-existent, I can't see where that would happen. Putting a critical section over the entire while() loop fixes it but it would be slow.

Your strategy to parallelize the QuickSort looks overcomplicated and prone to race-conditions, the typically way to parallelize that algorithm is to use OpenMP tasks. You can have a look a the following SO Threads
1.QuickSort;
2.MergeSort.
Adding the critical sections in the quicksort() function causes a
segmentation fault 11, why is that?
Your code has several issues namely;
There is a race-condition between the read of unfinished_index in while (unfinished_index >= 0) and the updates of that variable by other threads;
The segmentation fault 11 happens because threads can access positions out of bounds of the array unfinished, including negative positions.
Even with the critical region multiple threads can execute:
unfinished_index--;
which eventually leads to unfinished_index < 0 and consequently:
my_index = unfinished_index;
unfinished_index--;
first = unfinished[my_index].first; <-- problem
last = unfinished[my_index].last; <-- problem
accessing negative position the unfinished array. And the same applies to the upper bound as well. All threads my pass this check:
if ((unfinished_index+1) >= MAX_UNFINISHED) {
printf ("Stack overflow\n");
exit (-1);
}
and then simply
#pragma omp critical
{
unfinished_index++;
unfinished[unfinished_index].first = q+1;
unfinished[unfinished_index].last = last;
last = q-1;
}
increment unfinished_index so much that it can access positions outside of the array boundaries.
To solve those problems you can do the following:
void quicksort (void)
{
int first;
int last;
int my_index;
int q;
int keep_working = 1;
while (keep_working) {
#pragma omp critical
{
my_index = unfinished_index;
if(my_index >= 0 && my_index < MAX_UNFINISHED)
unfinished_index--;
else
keep_working = 0;
}
if(keep_working){
first = unfinished[my_index].first;
last = unfinished[my_index].last;
while (first < last && keep_working)
{
q = partition (first, last);
#pragma omp critical
{
unfinished_index++;
my_index = unfinished_index;
}
if (my_index < MAX_UNFINISHED){
unfinished[my_index].first = q+1;
unfinished[my_index].last = last;
last = q-1;
}
else
keep_working = 0;
}
}
}
}
Bear in mind, however, that the following code works for 2 threads. For more than that you might get sometimes "the array not sorted". I will leave it up to you to fixed. However, I would suggested you to used the Task approach instead, because it is faster, simpler and more updated with more modern OpenMP constructors.

Peterson's Algorithm in C for Thread concurrency (segment fault)

Hey guys I'm implementing Peterson's algorithm in C. I have two functions that will be executed by the threads created, one that adds 1 to a variable and other that subtracts 1 to that same variable.
The program receives an argument of type int, that integer is the square root of the number of threads I want to create, for example if I execute it in the terminal typing
./algorithm 10, there will be 10*10 (10 000) threads created.
The program runs ok if y type less than 170 as an argument (There would be 28900 threads created) but if I want to create more than that I got a segment fault, tried using "long long int" variables but that wasn't it.
There is a counter named "cont", the variable will be printed each time cont reaches 10 000.
There is another print for the last result of the variable, that should always be 0 because n threads added 1 and n threads subtracted 1.
I want to know why I'm getting a Segment Fault, if there is a limit of threads to be created, or if it is something in my code.
I'm running it using the next command to use only one processor cause Peterson's algorithm only work perfectly on mono-processor systems:
taskset -c 0 ./alg3 100
Here's the code:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
long int n;
long int var = 0;
long int cont = 1;
int flag[] = {0, 0};
int turn = 0;
void* sum(void* data) {
//int n = *((int*)data);
int i;
turn = 2;
flag[0] = 1;
while (turn == 2 && flag[1]);
cont++;
var += 1;
if (cont == 10000) {
printf("varsum=%ld\n", var);
cont = 1;
}
flag[0] = 0;
}
void* rest(void* data) {
//int n = *((int*)data);
int i;
turn = 1;
flag[1] = 1;
while (turn == 1 && flag[0]);
cont++;
var -= 1;
if (cont == 10000) {
printf("varres=%ld\n", var);
cont = 1;
}
flag[1] = 0;
}
main(int argc, char *argv[]) {
long int i;
n = atoi(argv[1]);
n *= n; //n*n is the actual number of threads that will be created
pthread_t tid[n];
for (i = 0; i < n; i++) {
pthread_create(&tid[i], NULL, sum, NULL);
//cont++;
}
for (i = 0; i < n; i++)
pthread_join(tid[i], NULL);
for (i = 0; i < n; i++) {
pthread_create(&tid[i], NULL, rest, NULL);
//cont++;
}
for (i = 0; i < n; i++)
pthread_join(tid[i], NULL);
printf("main() reporting that all %ld threads have terminated\n", i);
printf("variable=%ld\n", var);
} /* main */

First of all, of course there is limit to create threads. It is decided by the stack size of each thread and hardware, details suggest google it...
Segment fault reason:
You didn't check the return value of function pthread_create, when 'n' ls large enough, pthread_create will failed, then pthread_join may use the nonexistent thread_t as the first input parameter. The following code (change from your example) can test how many threads in you can create.
int rc = 0, thread_num = 0;
for (i = 0; i < n; i++) {
rc = pthread_create(&tid[i], NULL, sum, NULL);
if (rc)
{
printf("pthread_crate failed, thread number: %d, error code: %d\n", thread_num, rc);
}
thread_num++;
}
printf("created %d threads.\n", thread_num);

Add error checking at least to pthread_create() to avoid passing an invalid pthread_t variable to pthread_join():
int main(int arc, char ** argv)
{
...
pthread_t tid[n];
int result[n];
for (i = 0; i < n; i++) {
result[i] = errno = pthread_create(&tid[i], NULL, sum, NULL);
if (0 != errno) {
perror("pthread_create() failed");
}
}
for (i = 0; i < n; i++) {
if (0 == result(i]) {
errno = pthread_join(tid[i], NULL);
if (0 != errno) {
perror("pthread_join() failed");
}
}
}
...
Also always protect concurrent access to variables wich are written to, count here. To do so use a pthread_mutex_t variable.

C - pthreads appear to only be utilizing one core

Let me first of all say that this is for school but I don't really need help, I'm just confused by some results I'm getting.
I have a simple program that approximates pi using Simpson's rule, in one assignment we had to do this by spawning 4 child processes and now in this assignment we have to use 4 kernel-level threads. I've done this, but when I time the programs the one using child processes seems to run faster (I get the impression I should be seeing the opposite result).
Here is the program using pthreads:
#include <stdio.h>
#include <unistd.h>
#include <pthread.h>
#include <stdlib.h>
// This complicated ternary statement does the bulk of our work.
// Basically depending on whether or not we're at an even number in our
// sequence we'll call the function with x/32000 multiplied by 2 or 4.
#define TERN_STMT(x) (((int)x%2==0)?2*func(x/32000):4*func(x/32000))
// Set to 0 for no 100,000 runs
#define SPEED_TEST 1
struct func_range {
double start;
double end;
};
// The function defined in the assignment
double func(double x)
{
return 4 / (1 + x*x);
}
void *partial_sum(void *r)
{
double *ret = (double *)malloc(sizeof(double));
struct func_range *range = r;
#if SPEED_TEST
int k;
double begin = range->start;
for (k = 0; k < 25000; k++)
{
range->start = begin;
*ret = 0;
#endif
for (; range->start <= range->end; ++range->start)
*ret += TERN_STMT(range->start);
#if SPEED_TEST
}
#endif
return ret;
}
int main()
{
// An array for our threads.
pthread_t threads[4];
double total_sum = func(0);
void *temp;
struct func_range our_range;
int i;
for (i = 0; i < 4; i++)
{
our_range.start = (i == 0) ? 1 : (i == 1) ? 8000 : (i == 2) ? 16000 : 24000;
our_range.end = (i == 0) ? 7999 : (i == 1) ? 15999 : (i == 2) ? 23999 : 31999;
pthread_create(&threads[i], NULL, &partial_sum, &our_range);
pthread_join(threads[i], &temp);
total_sum += *(double *)temp;
free(temp);
}
total_sum += func(1);
// Final calculations
total_sum /= 3.0;
total_sum *= (1.0/32000.0);
// Print our result
printf("%f\n", total_sum);
return EXIT_SUCCESS;
}
Here is using child processes:
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
// This complicated ternary statement does the bulk of our work.
// Basically depending on whether or not we're at an even number in our
// sequence we'll call the function with x/32000 multiplied by 2 or 4.
#define TERN_STMT(x) (((int)x%2==0)?2*func(x/32000):4*func(x/32000))
// Set to 0 for no 100,000 runs
#define SPEED_TEST 1
// The function defined in the assignment
double func(double x)
{
return 4 / (1 + x*x);
}
int main()
{
// An array for our subprocesses.
pid_t pids[4];
// The pipe to pass-through information
int mypipe[2];
// Counter for subproccess loops
double j;
// Counter for outer loop
int i;
// Number of PIDs
int n = 4;
// The final sum
double total_sum = 0;
// Temporary variable holding the result from a subproccess
double temp;
// The partial sum tallied by a subproccess.
double sum = 0;
int k;
if (pipe(mypipe))
{
perror("pipe");
return EXIT_FAILURE;
}
// Create the PIDs
for (i = 0; i < 4; i++)
{
// Abort if something went wrong
if ((pids[i] = fork()) < 0)
{
perror("fork");
abort();
}
else if (pids[i] == 0)
// Depending on what PID number we are we'll only calculate
// 1/4 the total.
#if SPEED_TEST
for (k = 0; k < 25000; ++k)
{
sum = 0;
#endif
switch (i)
{
case 0:
sum += func(0);
for (j = 1; j <= 7999; ++j)
sum += TERN_STMT(j);
break;
case 1:
for (j = 8000; j <= 15999; ++j)
sum += TERN_STMT(j);
break;
case 2:
for (j = 16000; j <= 23999; ++j)
sum += TERN_STMT(j);
break;
case 3:
for (j = 24000; j < 32000; ++j)
sum += TERN_STMT(j);
sum += func(1);
break;
}
#if SPEED_TEST
}
#endif
// Write the data to the pipe
write(mypipe[1], &sum, sizeof(sum));
exit(0);
}
}
int status;
pid_t pid;
while (n > 0)
{
// Wait for the calculations to finish
pid = wait(&status);
// Read from the pipe
read(mypipe[0], &temp, sizeof(total_sum));
// Add to the total
total_sum += temp;
n--;
}
// Final calculations
total_sum /= 3.0;
total_sum *= (1.0/32000.0);
// Print our result
printf("%f\n", total_sum);
return EXIT_SUCCESS;
}
Here is a time result from the pthreads version running 100,000 times:
real 11.15
user 11.15
sys 0.00
And here is the child process version:
real 5.99
user 23.81
sys 0.00
Having a user time of 23.81 implies that that is the sum of the time each core took to execute the code. In the pthread analysis the real/user time is the same implying that only one core is being used. Why isn't it using all 4 cores? I thought by default it might do it better than child processes.
Hopefully this question makes sense, this is my first time programming with pthreads and I'm pretty new to OS-level programming in general.
Thanks for taking the time to read this lengthy question.

When you say pthread_join immediately after pthread_create, you're effectively serializing all the threads. Don't join threads until after you've created all the threads and done all the other work that doesn't need the result from the threaded computations.

Non recursive factorial in C

I have a simple question for you. I made this code to calculate the factorial of a number without recursion.
int fact2(int n){
int aux=1, total = 1;
int i;
int limit = n - 1;
for (i=1; i<=limit; i+=2){
aux = i*(i+1);
total = total*aux;
}
for (;i<=n;i++){
total = total*i;
}
return total;
}
As you can see, my code uses loop unrolling to optimize clock cycles in the execution. Now I'm asked to add two-way parallelism to the same code, any idea how?

You can use ptherads library to create two separate threads. Each thread should do half of the multiplications. I could put together following solution.
#include <pthread.h>
typedef struct {
int id;
int num;
int *result;
} thread_arg_t;
void* thread_func(void *arg) {
int i;
thread_arg_t *th_arg = (thread_arg_t *)arg;
int start, end;
if(th_arg->id == 0) {
start = 1;
end = th_arg->num/2;
} else if (th_arg->id == 1) {
start = th_arg->num / 2;
end = th_arg->num + 1;
} else {
return NULL;
}
for(i=start; i < end; i++) {
th_arg->result[th_arg->id] *= i;
}
return NULL;
}
int factorial2(int n) {
pthread_t threads[2];
int rc;
int result[2];
thread_arg_t th_arg[2];
for(i=0; i<2; i++) {
th_arg[i].id = i;
th_arg[i].num = n;
th_arg[i].result = result;
rc = pthread_create(&threads[i], NULL, thread_func, (void *)&th_arg[i]);
if (rc){
printf("pthread_create() failed, rc = %d\n", rc);
exit(1);
}
}
/* wait for threads to finish */
for(i=0; i<2; i++) {
pthread_join(thread[i], NULL);
/* compute final one multiplication */
return (result[0] * result[1]);
}
The pthread library implementation should take care of parallelizing the work of two threads for you. Also, this example can be generalized for N threads with minor modifications.