I have a problem with a #pragma omp parallel section in my code.
I hava program which should sort a given array of integers with quicksort using multiple threads. For this in every step every thread gets assigned a portion of the array, partitions it and returns how many elements are smaller than a given global pivot. The code executes without errors, but the more threads I tell omp to use, the slower it executes. I added logging for the execution times and it seems like a huge part of the program is spent on overhead for OpenMP. The overhead seems to be consistent, so the speed difference is proportional to the size of the array to sort.
Here is the code which is executed in parallel:
void create_count_elems_lower(int lower, int upper, int global_pivot_position, int *block_sizes, int *data,
int *count_elems_lower) {
assert(lower >= 0);
times_function_called++;
int pivot = data[global_pivot_position];
double start = omp_get_wtime();
double wait_start = 0;
double wait_time = 0;
#pragma omp parallel for
for (int i = 0; i < omp_get_max_threads(); ++i) {
double start_thread = omp_get_wtime();
int lower_p = lower;
lower_p += i == 0 ? 0 : block_sizes[i - 1] * i;
count_elems_lower[i] = partition_fixed_pivot(lower_p, lower_p + block_sizes[i], pivot, data) -
lower_p; // - lower_p since it needs to be relative
assert(count_elems_lower[i] >= 0);
double end_thread = omp_get_wtime();
double time_spent = end_thread - start_thread;
time_spent_per_thread_sum += time_spent;
if (max_time_spent_per_thread[i] < time_spent) {
max_time_spent_per_thread[i] = time_spent;
}
if (wait_start == 0) {
wait_start = end_thread;
} else {
double time_waiting = end_thread - wait_start;
if (time_waiting > wait_time) {
wait_time = time_waiting;
}
}
}
double end = omp_get_wtime();
time_spent_in_function += end - start;
time_spent_idling += wait_time;
}
And here the testing function:
printf("Num threads: %d\n", num_threads);
double start = omp_get_wtime();
test_sort_big();
double end = omp_get_wtime();
printf("total: %f\n", end - start);
printf("times function called: %f\n", times_function_called);
printf("time spent in create_count_elems_lower: %f\n", time_spent_in_function);
printf("time spent per thread approx: %f\n", time_spent_per_thread_sum / num_threads);
printf("time spent idling: %f\n", time_spent_idling);
for (int i = 0; i < num_threads; ++i) {
printf("max time spent by thread %d: %f \t", i, max_time_spent_per_thread[i]);
}
The program gets compiled and linked with:
gcc -fopenmp -O3 -c -o tests/tests.o tests/tests.c
gcc -fopenmp -o build_test tests/tests.o array_utils.o datagenerator.o quicksort.o
And the results are:
Num threads: 1
Testing sorting of 10000000 Elements
total: 9.204632
times function called: 10000000.000000
time spent in create_count_elems_lower: 5.914602
time spent per thread approx: 1.610363
time spent idling: 0.000000
max time spent by thread 0: 0.041889
Num threads: 4
Testing sorting of 10000000 Elements
total: 16.955334
times function called: 10000000.000000
time spent in create_count_elems_lower: 12.598185
time spent per thread approx: 0.874607
time spent idling: 2.130419
max time spent by thread 0: 0.016055 max time spent by thread 1: 0.013543 max time spent by thread 2: 0.013532 max time spent by thread 3: 0.018599
I run Fedora 27 64 bit with an Intel® Core™ i7-2760QM CPU # 2.40GHz × 8
Edit:
As it turned out the overhead was the problem, since the method gets called a lot of times with only one thread, changing the algorithm to a simple sort when only one thread is available improved the runtime a lot.
Related
Suppose i have a nested for-loop and if-checks shown below, if i wanted to see how many clock cycles (ultimately how many secs) a particular for-loop or if-check is taking to finish executing.
Should the sum of number of clock cycles (secs) taken by the inner for-loop and if-check be equal (or approximately equal) to the number of clock cycles(secs) taken by the outer most for-loop.?
Or am i doing it wrong? how do i time the loops if there's any other way of doing it.?
Note: I have 3 different functions doing pretty much the same thing, i have declared 3 different functions to measure each for-loop or if-check separately 'cause if i try to get the execution time of all the sub components in the same piece of code, then the number of clock cycles(secs) taken by the outer for-loop will include some extra execution of instructions which are calculating the clock cycles count of inner for-loop and if-check i guess.
void fun1(){
int i=0,j=0,k=0;
clock_t t=0,t_start=0,t_end=0;
//time the outermost forloop
t_start = clock();
for(i=0;i<100000;i++){
for(j=0;j<1000;j++){
//some code
}
if(k==0){
//some code
}
}
t_end = clock();
t=t_end-t_start;
double time_taken = ((double)t)/CLOCKS_PER_SEC;
printf("outer for-loop took %f seconds to execute \n", time_taken);
}
void fun2(){
int i=0,j=0,k=0;
clock_t t2=0,t2_start=0,t2_end=0;
for(i=0;i<100000;i++){
//time the inner for loop
t2_start=clock();
for(j=0;j<1000;j++){
//some code
}
t2_end=clock();
t2+=(t2_end-t2_start);
if(k==0){
//some code
}
}
double time_taken = ((double)t2)/CLOCKS_PER_SEC;
printf("inner for-loop took %f seconds to execute \n", time_taken);
}
void fun3(){
int i=0,j=0,k=0;
clock_t t3=0,t3_start=0,t3_end=0;
for(i=0;i<100000;i++){
for(j=0;j<1000;j++){
//some code
}
//time the if check
t3_start=clock();
if(k==0){
//some code
}
t3_end=clock();
t3+=(t3_end-t3_start);
}
double time_taken = ((double)t3)/CLOCKS_PER_SEC;
printf("if-check took %f seconds to execute \n", time_taken);
}
The expected answer is t in fun1 will likely be slightly more than t2+t3 from fun2 and fun3 respectively, representing the additional time to evaluate the outer loop itself.
Less obvious, however, is the time added by the measurement itself, which will be the time to invoke clock() itself once for each measurement. When measuring the inside loops, it's effectively multiplied by 100,000 because of the iteration of the outer loop.
Here's a program to measure the measurement itself, and for good measure, also measures the time to evaluate an empty outer loop.
#include <time.h>
#include <stdio.h>
int main () {
clock_t t = 0;
clock_t t_start, t_end;
for (int i = 0; i < 100000; i++) {
t_start = clock();
t_end = clock();
t += (t_end - t_start);
}
double time_taken = ((double) t) / CLOCKS_PER_SEC;
printf ("Time imposed by measurement itself: %fsec\n", time_taken);
t_start = clock();
for (int i = 0; i < 100000; i++) {
}
t_end = clock();
t = (t_end - t_start);
time_taken = ((double) t) / CLOCKS_PER_SEC;
printf ("Time to evaluate the loop: %fsec\n", time_taken);
}
Which, at least on my system, suggests the measurement may skew the results some:
Time imposed by measurement itself: 0.056949sec
Time to evaluate the loop: 0.000200sec
To get the amount of time your inner loops "really" take, you'll need to subtract out that added by the act of measuring it.
I have 2 unsorted arrays and 2 copies of these arrays. I am using two different threads to sort two arrays, then I am sorting other two unsorted array one by one. What I thought was that the thread process would be faster but it's not, so how does threads take more time?
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>
#include <pthread.h>
struct thread_data
{
int count;
unsigned int *arr;
};
struct thread_data thread_data_array[2];
void insertionSort(unsigned int arr[], int n)
{
int i, key, j;
for (i = 1; i < n; i++)
{
key = arr[i];
j = i-1;
while (j >= 0 && arr[j] > key)
{
arr[j+1] = arr[j];
j = j-1;
}
arr[j+1] = key;
}
}
void *sortAndMergeArrays(void *threadarg)
{
int count;
unsigned int *arr;
struct thread_data *my_data;
my_data = (struct thread_data *) threadarg;
count = my_data->count;
arr = my_data->arr;
insertionSort(arr, count);
pthread_exit(NULL);
}
int main(int argc, char *argv[])
{
int count, i, rc;
clock_t start, end, total_t;
pthread_t threads[2];
//get the loop count. If loop count is not provided take 10000 as default loop count.
if(argc == 2){
count = atoi(argv[1]);
}
else{
count = 10000;
}
unsigned int arr1[count], arr2[count], copyArr1[count], copyArr2[count];
srand(time(0));
for(i = 0; i<count; i++){
arr1[i] = rand();
arr2[i] = rand();
copyArr1[i] = arr1[i];
copyArr2[i] = arr2[i];
}
start = clock();
for(int t=0; t<2; t++) {
thread_data_array[t].count = count;
if(t==0)
thread_data_array[t].arr = arr1;
else
thread_data_array[t].arr = arr2;
rc = pthread_create(&threads[t], NULL, sortAndMergeArrays, (void *) &thread_data_array[t]);
if (rc) {
printf("ERROR; return code from pthread_create() is %d\n", rc);
exit(-1);
}
}
pthread_join(threads[0], NULL);
pthread_join(threads[1], NULL);
end = clock();
total_t = (double)(end - start);
printf("Total time taken by CPU to sort using threads: %d\n", total_t);
start = clock();
insertionSort(copyArr1, count);
insertionSort(copyArr2, count);
end = clock();
total_t = (double)(end - start);
printf("Total time taken by CPU to sort sequentially: %d\n", total_t);
pthread_exit(NULL);
}
I am using Linux server to execute the code. First I am randomly populating the arrays and copying them to two separate arrays. For the first two arrays I am creating two threads using pthread and passing the two arrays to them, which uses insertion sort to sort them. And for the other two arrays I am just sorting one by one.
I expected that by using threads I would reduce the execution time but actually takes more time.
Diagnosis
The reason you get practically the same time — and slightly more time from the threaded code than from the sequential code — is that clock() measures CPU time, and the two ways of sorting take almost the same amount of CPU time because they're doing the same job (and the threading number is probably slightly bigger because of the time to setup and tear down threads).
The clock() function shall return the implementation's best approximation to the processor time used by the process since the beginning of an implementation-defined era related only to the process invocation.
BSD (macOS) man page:
The clock() function determines the amount of processor time used since the invocation of the calling process, measured in CLOCKS_PER_SECs of a second.
The amount of CPU time it takes to sort the two arrays is basically the same; the difference is the overhead of threading (more or less).
Revised code
I have a set of functions that can use clock_gettime() instead (code in timer.c and timer.h at GitHub). These measure wall clock time — elapsed time, not CPU time.
Here's a mildly tweaked version of your code — the substantive changes were changing the type of key in the sort function from int to unsigned int to match the data in the array, and to fix the conversion specification of %d to %ld to match the type identified by GCC as clock_t. I mildly tweaked the argument handling, and the timing messages so that they're consistent in length, and added the elapsed time measurement code:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>
#include <pthread.h>
#include "timer.h"
struct thread_data
{
int count;
unsigned int *arr;
};
struct thread_data thread_data_array[2];
static
void insertionSort(unsigned int arr[], int n)
{
for (int i = 1; i < n; i++)
{
unsigned int key = arr[i];
int j = i - 1;
while (j >= 0 && arr[j] > key)
{
arr[j + 1] = arr[j];
j = j - 1;
}
arr[j + 1] = key;
}
}
static
void *sortAndMergeArrays(void *threadarg)
{
int count;
unsigned int *arr;
struct thread_data *my_data;
my_data = (struct thread_data *)threadarg;
count = my_data->count;
arr = my_data->arr;
insertionSort(arr, count);
pthread_exit(NULL);
}
int main(int argc, char *argv[])
{
int count = 10000;
int i, rc;
clock_t start, end, total_t;
pthread_t threads[2];
// get the loop count. If loop count is not provided take 10000 as default loop count.
if (argc == 2)
count = atoi(argv[1]);
unsigned int arr1[count], arr2[count], copyArr1[count], copyArr2[count];
srand(time(0));
for (i = 0; i < count; i++)
{
arr1[i] = rand();
arr2[i] = rand();
copyArr1[i] = arr1[i];
copyArr2[i] = arr2[i];
}
Clock clk;
clk_init(&clk);
start = clock();
clk_start(&clk);
for (int t = 0; t < 2; t++)
{
thread_data_array[t].count = count;
if (t == 0)
thread_data_array[t].arr = arr1;
else
thread_data_array[t].arr = arr2;
rc = pthread_create(&threads[t], NULL, sortAndMergeArrays, (void *)&thread_data_array[t]);
if (rc)
{
printf("ERROR; return code from pthread_create() is %d\n", rc);
exit(-1);
}
}
pthread_join(threads[0], NULL);
pthread_join(threads[1], NULL);
clk_stop(&clk);
end = clock();
char buffer[32];
printf("Elapsed using threads: %s s\n", clk_elapsed_us(&clk, buffer, sizeof(buffer)));
total_t = (double)(end - start);
printf("CPU time using threads: %ld\n", total_t);
start = clock();
clk_start(&clk);
insertionSort(copyArr1, count);
insertionSort(copyArr2, count);
clk_stop(&clk);
end = clock();
printf("Elapsed sequentially: %s s\n", clk_elapsed_us(&clk, buffer, sizeof(buffer)));
total_t = (double)(end - start);
printf("CPU time sequentially: %ld\n", total_t);
return 0;
}
Results
Example runs (program inssortthread23) — run on a MacBook Pro (15" 2016) with 16 GiB RAM and 2.7 GHz Intel Core i7 CPU, running macOS High Sierra 10.13, using GCC 7.2.0 for compilation.
I had routine background programs running — e.g. browser not being actively used, no music or videos playing, no downloads in progress etc. (These things matter for benchmarking.)
$ inssortthread23 100000
Elapsed using threads: 1.060299 s
CPU time using threads: 2099441
Elapsed sequentially: 2.146059 s
CPU time sequentially: 2138465
$ inssortthread23 200000
Elapsed using threads: 4.332935 s
CPU time using threads: 8616953
Elapsed sequentially: 8.496348 s
CPU time sequentially: 8469327
$ inssortthread23 300000
Elapsed using threads: 9.984021 s
CPU time using threads: 19880539
Elapsed sequentially: 20.000900 s
CPU time sequentially: 19959341
$
Conclusions
Here, you can clearly see that:
The elapsed time is approximately twice as long for the non-threaded code as for the threaded code.
The CPU time for the threaded and non-threaded code is almost the same.
The overall time is quadratic in the number of rows sorted.
All of which are very much in line with expectations — once you realize that clock() is measuring CPU time, not elapsed time.
Minor puzzle
You can also see that I'm getting the threaded CPU time as slightly smaller than the CPU time for sequential sorts, some of the time. I don't have an explanation for that — I deem it 'lost in the noise', though the effect is persistent:
$ inssortthread23 100000
Elapsed using threads: 1.051229 s
CPU time using threads: 2081847
Elapsed sequentially: 2.138538 s
CPU time sequentially: 2132083
$ inssortthread23 100000
Elapsed using threads: 1.053656 s
CPU time using threads: 2089886
Elapsed sequentially: 2.128908 s
CPU time sequentially: 2122983
$ inssortthread23 100000
Elapsed using threads: 1.058283 s
CPU time using threads: 2093644
Elapsed sequentially: 2.126402 s
CPU time sequentially: 2120625
$
$ inssortthread23 200000
Elapsed using threads: 4.259660 s
CPU time using threads: 8479978
Elapsed sequentially: 8.872929 s
CPU time sequentially: 8843207
$ inssortthread23 200000
Elapsed using threads: 4.463954 s
CPU time using threads: 8883267
Elapsed sequentially: 8.603401 s
CPU time sequentially: 8580240
$ inssortthread23 200000
Elapsed using threads: 4.227154 s
CPU time using threads: 8411582
Elapsed sequentially: 8.816412 s
CPU time sequentially: 8797965
$
I have the following code that uses OMP to parallelize a monte carlo method. My question is why does the serial version of the code (monte_carlo_serial) run a lot faster than the parallel version (monte_carlo_parallel). I am running the code on a machine with 32 cores and get the following result printed to the console:
-bash-4.1$ gcc -fopenmp hello.c ;
-bash-4.1$ ./a.out
Pi (Serial): 3.140856
Time taken 0 seconds 50 milliseconds
Pi (Parallel): 3.132103
Time taken 127 seconds 990 milliseconds
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
#include <time.h>
int niter = 1000000; //number of iterations per FOR loop
int monte_carlo_parallel() {
double x,y; //x,y value for the random coordinate
int i; //loop counter
int count=0; //Count holds all the number of how many good coordinates
double z; //Used to check if x^2+y^2<=1
double pi; //holds approx value of pi
int numthreads = 32;
#pragma omp parallel firstprivate(x, y, z, i) reduction(+:count) num_threads(numthreads)
{
srand48((int)time(NULL) ^ omp_get_thread_num()); //Give random() a seed value
for (i=0; i<niter; ++i) //main loop
{
x = (double)drand48(); //gets a random x coordinate
y = (double)drand48(); //gets a random y coordinate
z = ((x*x)+(y*y)); //Checks to see if number is inside unit circle
if (z<=1)
{
++count; //if it is, consider it a valid random point
}
}
}
pi = ((double)count/(double)(niter*numthreads))*4.0;
printf("Pi (Parallel): %f\n", pi);
return 0;
}
int monte_carlo_serial(){
double x,y; //x,y value for the random coordinate
int i; //loop counter
int count=0; //Count holds all the number of how many good coordinates
double z; //Used to check if x^2+y^2<=1
double pi; //holds approx value of pi
srand48((int)time(NULL) ^ omp_get_thread_num()); //Give random() a seed value
for (i=0; i<niter; ++i) //main loop
{
x = (double)drand48(); //gets a random x coordinate
y = (double)drand48(); //gets a random y coordinate
z = ((x*x)+(y*y)); //Checks to see if number is inside unit circle
if (z<=1)
{
++count; //if it is, consider it a valid random point
}
}
pi = ((double)count/(double)(niter))*4.0;
printf("Pi (Serial): %f\n", pi);
return 0;
}
void main(){
clock_t start = clock(), diff;
monte_carlo_serial();
diff = clock() - start;
int msec = diff * 1000 / CLOCKS_PER_SEC;
printf("Time taken %d seconds %d milliseconds \n", msec/1000, msec%1000);
start = clock(), diff;
monte_carlo_parallel();
diff = clock() - start;
msec = diff * 1000 / CLOCKS_PER_SEC;
printf("Time taken %d seconds %d milliseconds \n", msec/1000, msec%1000);
}
The variable
count
is shared across all of your spawned threads. Each of them has to lock count to increment it. In addition if the threads are running on separate cpu's (and there's no possible win if they're not) you have the cost of sending the value of count from one core to another and back again.
This is a textbook example of false sharing. Accessing count in your serial version it will be in a register and cost 1 cycle to increment. In the parallel version it will usually not be in cache, you have to tell the other cores to invalidate that cache line, fetch it (L3 is going to take 66 cycles at best) increment it, and store it back. Every time count migrates from one cpu core to another you have a minimum ~125 cycle cost which is a lot worse than 1. The threads will never be able to run in parallel because they depend on count.
Try to modify your code so that each thread has its own count, then sum all values of count from all the threads at the end and you /might/ see a speedup.
I wrote a program based on the idea of Riemann's sum to find out the integral value. It uses several threads, but the performance of it (the algorithm), compared to sequential program i wrote later, is subpar. Algorithm-wise they are identical except the threads stuff, so the question is what's wrong with it? pthread_join is not the case, i assume, because if one thread will finish sooner than the other thread, that join wait on, it will simply skip it in the future. Is that correct? The free call is probably wrong and there is no error check upon creation of threads, i'm aware of it, i deleted it along the way of testing various stuff. Sorry for bad english and thanks in advance.
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/types.h>
#include <time.h>
int counter = 0;
float sum = 0;
pthread_mutex_t mutx;
float function_res(float);
struct range {
float left_border;
int steps;
float step_range;
};
void *calcRespectiveRange(void *ranges) {
struct range *rangs = ranges;
float left_border = rangs->left_border;
int steps = rangs->steps;
float step_range = rangs->step_range;
free(rangs);
//printf("left: %f steps: %d step range: %f\n", left_border, steps, step_range);
int i;
float temp_sum = 0;
for(i = 0; i < steps; i++) {
temp_sum += step_range * function_res(left_border);
left_border += step_range;
}
sum += temp_sum;
pthread_exit(NULL);
}
int main() {
clock_t begin, end;
if(pthread_mutex_init(&mutx, NULL) != 0) {
printf("mutex error\n");
}
printf("enter range, amount of steps and threads: \n");
float left_border, right_border;
int steps_count;
int threads_amnt;
scanf("%f %f %d %d", &left_border, &right_border, &steps_count, &threads_amnt);
float step_range = (right_border - left_border) / steps_count;
int i;
pthread_t tid[threads_amnt];
float chunk = (right_border - left_border) / threads_amnt;
int steps_per_thread = steps_count / threads_amnt;
begin = clock();
for(i = 0; i < threads_amnt; i++) {
struct range *ranges;
ranges = malloc(sizeof(ranges));
ranges->left_border = i * chunk + left_border;
ranges->steps = steps_per_thread;
ranges->step_range = step_range;
pthread_create(&tid[i], NULL, calcRespectiveRange, (void*) ranges);
}
for(i = 0; i < threads_amnt; i++) {
pthread_join(tid[i], NULL);
}
end = clock();
pthread_mutex_destroy(&mutx);
printf("\n%f\n", sum);
double time_spent = (double) (end - begin) / CLOCKS_PER_SEC;
printf("Time spent: %lf\n", time_spent);
return(0);
}
float function_res(float lb) {
return(lb * lb + 4 * lb + 3);
}
Edit: in short - can it be improved to reduce execution time (with mutexes, for example)?
The execution time will be shortened, provided you you have multiple hardware threads available.
The problem is in how you measure time: clock returns the processor time used by the program. That means, it sums the time taken by all the threads. If your program uses 2 threads, and it's linear execution time is 1 second, that means that each thread has used 1 second of CPU time, and clock will return the equivalent of 2 seconds.
To get the actual time used (on Linux), use gettimeofday. I modified your code by adding
#include <sys/time.h>
and capturing the start time before the loop:
struct timeval tv_start;
gettimeofday( &tv_start, NULL );
and after:
struct timeval tv_end;
gettimeofday( &tv_end, NULL );
and calculating the difference in seconds:
printf("CPU Time: %lf\nTime passed: %lf\n",
time_spent,
((tv_end.tv_sec * 1000*1000.0 + tv_end.tv_usec) -
(tv_start.tv_sec * 1000*1000.0 + tv_start.tv_usec)) / 1000/1000
);
(I also fixed the malloc from malloc(sizeof(ranges)) which allocates the size of a pointer (4 or 8 bytes for 32/64 bit CPU) to malloc(sizeof(struct range)) (12 bytes)).
When running with the input parameters 0 1000000000 1000000000 1, that is, 1 billion iterations in 1 thread, the output on my machine is:
CPU Time: 4.352000
Time passed: 4.400006
When running with 0 1000000000 1000000000 2, that is, 1 billion iterations spread over 2 threads (500 million iterations each), the output is:
CPU Time: 4.976000
Time passed: 2.500003
For completeness sake, I tested it with the input 0 1000000000 1000000000 4:
CPU Time: 8.236000
Time passed: 2.180114
It is a little faster, but not twice as fast as with 2 threads, and it uses double the CPU time. This is because my CPU is a Core i3, a dual-core with hyperthreading, which aren't true hardware threads.
I have a piece of code that i wrote a time ago. The only purpose of it was an experiment with openMP. But i recently switched form a MacBook Pro Lion (early 2011) to a MacBook Pro Mountain Lion (early 2013). If it would help to get more hardware of other info, I would be happy to give them.
The code worked fine on the old one, meaning 8 threads got a 100% (98% min) load on my processor. And now the identical code, recompiled on my new machine gets only a 62% max processor load. Even if I raise the threads. The processor loads are both measured with "istat pro".
My question is what can cause this to happen?
EDIT: The problem seems to be solved if I delete the for in #pragma omp parallel for shared(largest_factor, largest). So I get #pragma omp parallel shared(largest_factor, largest)
But I still don't understand why it works.
The code in question:
#include <stdio.h>
#include <omp.h>
double fib(double n);
int main()
{
int data[] = {124847,194747,194747,194747,194747,
194747,194747,194747,194747,194747,194747};
int largest, largest_factor = 0;
omp_set_num_threads(8);
/* "omp parallel for" turns the for loop multithreaded by making each thread
* iterating only a part of the loop variable, in this case i; variables declared
* as "shared" will be implicitly locked on access
*/
#pragma omp parallel for shared(largest_factor, largest)
for (int i = 0; i < 10; i++) {
int p, n = data[i];
for (p = 3; p * p <= n && n % p; p += 2);
printf("\n%f\n\n",fib(i+40));
if (p * p > n) p = n;
if (p > largest_factor) {
largest_factor = p;
largest = n;
printf("thread %d: found larger: %d of %d\n",
omp_get_thread_num(), p, n);
}
else
{
printf("thread %d: not larger: %d of %d\n",
omp_get_thread_num(), p, n);
}
}
printf("Largest factor: %d of %d\n", largest_factor, largest);
return 0;
}
double fib(double n)
{
if (n<=1)
{
return 1;
}
else
{
return fib(n-1)+fib(n-2);
}
}
The main reason you don't see all threads being used is that each thread takes different time (due to the recursive function or the inner loop) and you only have 10 iterations. The fast threads finish fast and then there are only a few threads left to run. When you first run your code it starts off 100% and falls off as the fast threads finish and the few last slow threads are still running. If you change your iterations to 100 (and increase the data array) you will see the CPU usage at 100% for much longer. I added some timing printouts to your code.
Also I think you have a race condition with your shared variables so I put in a critical section.
To answer your question about the code without the "for" statement what that's doing is running the same code on eight different threads! Instead of threads running a particular iteration they each run all 10 iterations. That's going to be no faster than running a single thread and perhaps even slower.
Lastly since each iteration takes different time in general you should use "schedual(dynamic)" like this
#pragma omp parallel for shared(largest_factor, largest) schedule(dynamic)
However, since you only have 10 iterations I don't think it will make much difference in this case. Here is what I did to your code to understand what is going on:
#include <stdio.h>
#include <omp.h>
double fib(double n);
int main()
{
int data[] = {124847,194747,194747,194747,194747,
194747,194747,194747,194747,194747,194747};
int largest, largest_factor = 0;
omp_set_num_threads(8);
/* "omp parallel for" turns the for loop multithreaded by making each thread
* iterating only a part of the loop variable, in this case i; variables declared
* as "shared" will be implicitly locked on access
*/
#pragma omp parallel for shared(largest_factor, largest)
for (int i = 0; i < 10; i++) {
int p, n = data[i];
double time = omp_get_wtime();
for (p = 3; p * p <= n && n % p; p += 2);
printf("\n iteratnion %d, fib %f\n\n",i, fib(i+40));
time = omp_get_wtime() - time;
printf("time %f\n", time);
if (p * p > n) p = n;
#pragma omp critical
{
if (p > largest_factor) {
largest_factor = p;
largest = n;
printf("thread %d: found larger: %d of %d\n",
omp_get_thread_num(), p, n);
}
else {
printf("thread %d: not larger: %d of %d\n",
omp_get_thread_num(), p, n);
}
}
}
printf("Largest factor: %d of %d\n", largest_factor, largest);
return 0;
}
double fib(double n) {
if (n<=1) {
return 1;
}
else {
return fib(n-1)+fib(n-2);
}
}