Why is multithreading slower than sequential programming in my case? - c

I'm new to multithreading and try to learn it through a simple program, which adds 1 to n and return the sum. In the sequential case, the main call the sumFrom1 function twice for n = 1e5 and 2e5; in the multithreaded cases, two threads are created using pthread_create and two sums are calculated in separate thread. The multithreadting version is much slower than the sequential version (see results below). I run this on a 12-CPU platform and there are no communication between threads.
Multithreaded:
Thread 1 returns: 0
Thread 2 returns: 0
sum of 1..10000: 50005000
sum of 1..20000: 200010000
time: 156 seconds
Sequential:
sum of 1..10000: 50005000
sum of 1..20000: 200010000
time: 56 seconds
When I add -O2 in compilation, the time of multithreaded version (9s) is less than that of sequential version (11s), but not much as I expect. I can always have the -O2 flag on but I'm curious about the low speed of multithreading in the unoptimized case. Should it be slower than sequential version? If not, what can I do to make it faster?
The code:
#include <stdio.h>
#include <pthread.h>
#include <time.h>
typedef struct my_struct
{
int n;
int sum;
}my_struct_t;
void *sumFrom1(void* sit)
{
my_struct_t* local_sit = (my_struct_t*) sit;
int i;
int nsim = 500000; // Loops for consuming time
int j;
for(j = 0; j < nsim; j++)
{
local_sit->sum = 0;
for(i = 0; i <= local_sit->n; i++)
local_sit->sum += i;
}
}
int main(int argc, char *argv[])
{
pthread_t thread1;
pthread_t thread2;
my_struct_t si1;
my_struct_t si2;
int iret1;
int iret2;
time_t t1;
time_t t2;
si1.n = 10000;
si2.n = 20000;
if(argc == 2 && atoi(argv[1]) == 1) // Use "./prog 1" to test the time of multithreaded version
{
t1 = time(0);
iret1 = pthread_create(&thread1, NULL, sumFrom1, (void*)&si1);
iret2 = pthread_create(&thread2, NULL, sumFrom1, (void*)&si2);
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
t2 = time(0);
printf("Thread 1 returns: %d\n",iret1);
printf("Thread 2 returns: %d\n",iret2);
printf("sum of 1..%d: %d\n", si1.n, si1.sum);
printf("sum of 1..%d: %d\n", si2.n, si2.sum);
printf("time: %d seconds", t2 - t1);
}
else // Use "./prog" to test the time of sequential version
{
t1 = time(0);
sumFrom1((void*)&si1);
sumFrom1((void*)&si2);
t2 = time(0);
printf("sum of 1..%d: %d\n", si1.n, si1.sum);
printf("sum of 1..%d: %d\n", si2.n, si2.sum);
printf("time: %d seconds", t2 - t1);
}
return 0;
}
UPDATE1:
After a little googling on "false sharing" (Thanks, #Martin James!), I think it is the main cause. There are (at least) two ways to fix it:
The first way is inserting a buffer zone between the two structs (Thanks, #dasblinkenlight):
my_struct_t si1;
char memHolder[4096];
my_struct_t si2;
Without -O2, the time consuming decreases from ~156s to ~38s.
The second way is avoiding frequently updating sit->sum, which can be realized using a temp variable in sumFrom1 (as #Jens Gustedt replied):
for(int sum = 0, j = 0; j < nsim; j++)
{
sum = 0;
for(i = 0; i <= local_sit->n; i++)
sum += i;
}
local_sit->sum = sum;
Without -O2, the time consuming decreases from ~156s to ~35s or ~109s (It has two peaks! I don't know why.). With -O2, the time consuming stays ~8s.

By modifying your code to
typedef struct my_struct
{
size_t n;
size_t sum;
}my_struct_t;
void *sumFrom1(void* sit)
{
my_struct_t* local_sit = sit;
size_t nsim = 500000; // Loops for consuming time
size_t n = local_sit->n;
size_t sum = 0;
for(size_t j = 0; j < nsim; j++)
{
for(size_t i = 0; i <= n; i++)
sum += i;
}
local_sit->sum = sum;
return 0;
}
the phenomenon disappears. The problems you had:
using int as a datatype is completely wrong for such a test. Your
figures where such that the sum overflowed. Overflow of signed types is undefined behavior. You are lucky that it didn't eat your lunch.
having bounds and summation variables with indirection buys you
additional loads and stores, that in case of -O0 are really done as
such, with all the implications of false sharing and stuff like that.
Your code also observed other errors:
a missing include for atoi
superflouous cast to and from void*
printing of time_t as int
Please compile your code with -Wall before posting.

Related

Why MultiThread is slower than Single?? (Linux,C,using pthread)

I have a question with MultiThread.
This code is simple Example about comparing Single Thread vs MultiThread.
(sum 0~400,000,000 with singlethread vs 4-multiThread)
//Single
#include<pthread.h>
#include<unistd.h>
#include<stdio.h>
#include<stdlib.h>
#define NUM_THREAD 4
#define MY_NUM 100000000
void* calcThread(void* param);
double total = 0;
double sum[NUM_THREAD] = { 0, };
int main() {
long p[NUM_THREAD] = {MY_NUM, MY_NUM * 2,MY_NUM * 3,MY_NUM * 4 };
int i;
long total_nstime;
struct timespec begin, end;
pthread_t tid[NUM_THREAD];
pthread_attr_t attr[NUM_THREAD];
clock_gettime(CLOCK_MONOTONIC, &begin);
for (i = 0; i < NUM_THREAD; i++) {
calcThread((void*)p[i]);
}
for (i = 0; i < NUM_THREAD; i++) {
total += sum[i];
}
clock_gettime(CLOCK_MONOTONIC, &end);
printf("total = %lf\n", total);
total_nstime = (end.tv_sec - begin.tv_sec) * 1000000000 + (end.tv_nsec - begin.tv_nsec);
printf("%.3fs\n", (float)total_nstime / 1000000000);
return 0;
}
void* calcThread(void* param) {
int i;
long to = (long)(param);
int from = to - MY_NUM + 1;
int th_num = from / MY_NUM;
for (i = from; i <= to; i++)
sum[th_num] += i;
}
I wanna change using 4-MultiThread Code, so I changed that calculate function to using MultiThread.
...
int main() {
...
//createThread
for (i = 0; i < NUM_THREAD; i++) {
pthread_attr_init(&attr[i]);
pthread_create(&tid[i],&attr[i],calcThread,(void *)p[i]);
}
//wait
for(i=0;i<NUM_THREAD;i++){
pthread_join(tid[i],NULL);
}
for (i = 0; i < NUM_THREAD; i++) {
total += sum[i];
}
clock_gettime(CLOCK_MONOTONIC, &end);
...
}
Result(in Ubuntu)
But,It's slower than Single Function Code. I know MultiThread is faster.
I have no idea with this problem :( What's wrong?
Could you give me some advice ? Thanks a lot!
"I know MultiThread is faster"
This isn't always the case, as generally you would be CPU bound in some way, whether that be due to core count, how it is scheduled at the OS level, and hardware level.
It is a balance how many threads is worth giving to a process, as you may run into an old Linux problem where you would be spending more time scheduling the processes than actually running them.
As this is very hardware and OS dependant, it is difficult to say exactly what the issue may be, but make sure you have the appropriate microcode for your CPU installed (generally installed by default in Ubuntu), but just in case, try:
sudo apt-get install intel-microcode
Otherwise look at what other processes are being run, and it may be that a lot of other things are running on the cores that are being allocated the process.

Weird results: matrix multiplication using pthreads

I made a program that multiplies matrices of the same dimension using (p)threads. The program accepts command line flags -N n -M m where n is the size of the matrix arrays and m is the number of threads (computing threshold). The program compiles and runs but I get strange times for elapsed time, USR time, SYS time, and USR+SYS time. I am testing sizes n = {1000,2000,4000} with each threshold m = {1,2,4}.
I should be seeing the elapsed time reduced and a fairly constant USR+SYS time for each value of n but that is not the case. The output will fluctuate but the problem is that a higher threshold doesn't result in a reduction of elapsed time. Am I implementing threads wrong or is there an issue with my timing?
compile with: -pthread
./* -N n -M m
main
#include<stdio.h>
#include<stdlib.h>
#include<unistd.h>
#include<pthread.h>
#include<sys/time.h>
#include<sys/resource.h>
struct matrix {
double **Matrix_A;
double **Matrix_B;
int begin;
int end;
int n;
};
void *calculon(void *mtrx) {
struct matrix *f_mat = (struct matrix *)mtrx;
// transfer data
int f_begin = f_mat->begin;
int f_end = f_mat->end;
int f_n = f_mat->n;
// definition of temp matrix
double ** Matrix_C;
Matrix_C = (double**)malloc(sizeof(double)*f_n);
int f_pholder;
for(f_pholder=0; f_pholder < f_n; f_pholder++)
Matrix_C[f_pholder] = (double*)malloc(sizeof(double)*f_n);
int x, y, z;
for(x = f_begin; x < f_end; x++)
for(y = 0; y < f_n; y++)
for(z = 0; z < f_n; z++)
Matrix_C[x][y] += f_mat->Matrix_A[x][z]*f_mat->Matrix_B[z][y];
for(f_pholder = 0; f_pholder < f_n; f_pholder++)
free(Matrix_C[f_pholder]);
free(Matrix_C);
}
int main(int argc, char **argv) {
char *p;
int c, i, j, x, y, n, m, pholder, n_m, make_thread;
int m_begin = 0;
int m_end = 0;
while((c=getopt(argc, argv, "NM")) != -1) {
switch(c) {
case 'N':
n = strtol(argv[optind], &p, 10);
break;
case 'M':
m = strtol(argv[optind], &p, 10);
break;
default:
printf("\n**WARNING**\nUsage: -N n -M m");
break;
}
}
if(m > n)
printf("\n**WARNING**\nUsage: -N n -M m\n=> m > n");
else if(n%m != 0)
printf("\n**WARNING**\nUsage: -N n -M m\n=> n % m = 0");
else {
n_m = n/m;
// initialize input matrices
double ** thread_matrixA;
double ** thread_matrixB;
// allocate rows onto heap
thread_matrixA=(double**)malloc(sizeof(double)*n);
thread_matrixB=(double**)malloc(sizeof(double)*n);
// allocate columns onto heap
for(pholder = 0; pholder < n; pholder++) {
thread_matrixA[pholder]=(double*)malloc(sizeof(double)*n);
thread_matrixB[pholder]=(double*)malloc(sizeof(double)*n);
}
// populate input matrices with random numbers
for(i = 0; i < n; i++)
for(j = 0; j < n; j++)
thread_matrixA[i][j] = (double)rand()/RAND_MAX+1;
for(x = 0; x < n; x++)
for(y = 0; y < n; y++)
thread_matrixB[x][y] = (double)rand()/RAND_MAX+1;
printf("\n*** Matrix will be of size %d x %d *** \n", n, n);
printf("*** Creating matrix with %d thread(s) ***\n", m);
struct rusage r_usage;
struct timeval usage;
struct timeval time1, time2;
struct timeval cpu_time1, cpu_time2;
struct timeval sys_time1, sys_time2;
struct matrix mat;
pthread_t thread_lord[m];
// begin timing
getrusage(RUSAGE_SELF, &r_usage);
cpu_time1 = r_usage.ru_utime;
sys_time1 = r_usage.ru_stime;
gettimeofday(&time1, NULL);
for(make_thread = 0; make_thread < m; make_thread++) {
m_begin += n_m;
// assign values to struct
mat.Matrix_A = thread_matrixA;
mat.Matrix_B = thread_matrixB;
mat.n = n;
mat.begin = m_begin;
mat.end = m_end;
// create threads
pthread_create(&thread_lord[make_thread], NULL, calculon, (void *)&mat);
m_begin = (m_end + 1);
}
// wait for thread to finish before joining
for(i = 0; i < m; i++)
pthread_join(thread_lord[i], NULL);
// end timing
getrusage(RUSAGE_SELF, &r_usage);
cpu_time2 = r_usage.ru_utime;
sys_time2 = r_usage.ru_stime;
gettimeofday(&time2, NULL);
printf("\nUser time: %f seconds\n", ((cpu_time2.tv_sec * 1000000 + cpu_time2.tv_usec) - (cpu_time1.tv_sec * 1000000 + cpu_time1.tv_usec))/1e6);
printf("System time: %f seconds\n", ((sys_time2.tv_sec * 1000000 + sys_time2.tv_usec) - (sys_time1.tv_sec * 1000000 + sys_time1.tv_usec))/1e6);
printf("Wallclock time: %f seconds\n\n", ((time2.tv_sec * 1000000 + time2.tv_usec) - (time1.tv_sec * 1000000 + time1.tv_usec))/1e6);
// deallocate matrices
for(pholder = 0; pholder < n; pholder++) {
free(thread_matrixA[pholder]);
free(thread_matrixB[pholder]);
}
free(thread_matrixA);
free(thread_matrixB);
}
return 0;
}
Timing
My guess is that all those malloc()s you're using in the individual threads take a lot more time than you save by splitting the calculation among the threads. Math is fast; malloc() is slow. (To oversimplify a bit)
Sometimes weird timing behavior with threads happens when you have multiple threads trying to access a shared resource which is protected by some kind of exclusive lock. (Example, from something I did a long time ago) But I don't think that's the case here because, first of all, you don't seem to be using any locks, and second, the timing pattern that results typically has the runtime increasing by a little bit as you increase the number of threads. In this case, your runtime increases as you increase the number of threads, by a lot (specifically: it seems to be related to the thread number), so I suspect per-thread resource usage to be the culprit.
That being said, I'm having a hard time confirming my guess, so I can't be sure about this.

Recursive function using pthreads in C

I have the following piece of code
#include "stdio.h"
#include "stdlib.h"
#include <string.h>
#define MAXBINS 8
void swap_long(unsigned long int **x, unsigned long int **y){
unsigned long int *tmp;
tmp = x[0];
x[0] = y[0];
y[0] = tmp;
}
void swap(unsigned int **x, unsigned int **y){
unsigned int *tmp;
tmp = x[0];
x[0] = y[0];
y[0] = tmp;
}
void truncated_radix_sort(unsigned long int *morton_codes,
unsigned long int *sorted_morton_codes,
unsigned int *permutation_vector,
unsigned int *index,
int *level_record,
int N,
int population_threshold,
int sft, int lv){
int BinSizes[MAXBINS] = {0};
unsigned int *tmp_ptr;
unsigned long int *tmp_code;
level_record[0] = lv; // record the level of the node
if(N<=population_threshold || sft < 0) { // Base case. The node is a leaf
memcpy(permutation_vector, index, N*sizeof(unsigned int)); // Copy the pernutation vector
memcpy(sorted_morton_codes, morton_codes, N*sizeof(unsigned long int)); // Copy the Morton codes
return;
}
else{
// Find which child each point belongs to
int j = 0;
for(j=0; j<N; j++){
unsigned int ii = (morton_codes[j]>>sft) & 0x07;
BinSizes[ii]++;
}
// scan prefix
int offset = 0, i = 0;
for(i=0; i<MAXBINS; i++){
int ss = BinSizes[i];
BinSizes[i] = offset;
offset += ss;
}
for(j=0; j<N; j++){
unsigned int ii = (morton_codes[j]>>sft) & 0x07;
permutation_vector[BinSizes[ii]] = index[j];
sorted_morton_codes[BinSizes[ii]] = morton_codes[j];
BinSizes[ii]++;
}
//swap the index pointers
swap(&index, &permutation_vector);
//swap the code pointers
swap_long(&morton_codes, &sorted_morton_codes);
/* Call the function recursively to split the lower levels */
offset = 0;
for(i=0; i<MAXBINS; i++){
int size = BinSizes[i] - offset;
truncated_radix_sort(&morton_codes[offset],
&sorted_morton_codes[offset],
&permutation_vector[offset],
&index[offset], &level_record[offset],
size,
population_threshold,
sft-3, lv+1);
offset += size;
}
}
}
I tried to make this block
int j = 0;
for(j=0; j<N; j++){
unsigned int ii = (morton_codes[j]>>sft) & 0x07;
BinSizes[ii]++;
}
parallel by substituting it with the following
int rc,j;
pthread_t *thread = (pthread_t *)malloc(NTHREADS*sizeof(pthread_t));
belong *belongs = (belong *)malloc(NTHREADS*sizeof(belong));
pthread_mutex_init(&bin_mtx, NULL);
for (j = 0; j < NTHREADS; j++){
belongs[j].n = NTHREADS;
belongs[j].N = N;
belongs[j].tid = j;
belongs[j].sft = sft;
belongs[j].BinSizes = BinSizes;
belongs[j].mcodes = morton_codes;
rc = pthread_create(&thread[j], NULL, belong_wrapper, (void *)&belongs[j]);
}
for (j = 0; j < NTHREADS; j++){
rc = pthread_join(thread[j], NULL);
}
and defining these outside the recursive function
typedef struct{
int n, N, tid, sft;
int *BinSizes;
unsigned long int *mcodes;
}belong;
pthread_mutex_t bin_mtx;
void * belong_wrapper(void *arg){
int n, N, tid, sft, j;
int *BinSizes;
unsigned int ii;
unsigned long int *mcodes;
n = ((belong *)arg)->n;
N = ((belong *)arg)->N;
tid = ((belong *)arg)->tid;
sft = ((belong *)arg)->sft;
BinSizes = ((belong *)arg)->BinSizes;
mcodes = ((belong *)arg)->mcodes;
for (j = tid; j<N; j+=n){
ii = (mcodes[j] >> sft) & 0x07;
pthread_mutex_lock(&bin_mtx);
BinSizes[ii]++;
pthread_mutex_unlock(&bin_mtx);
}
}
However it takes a lot more time than the serial one to execute... Why is this happening? What should I change?
Since you're using a single mutex to guard updates to the BinSizes array, you're still ultimately doing all the updates to this array sequentially: only one thread can call BinSizes[ii]++ at any given time. Basically you're still executing your function in sequence but incurring the extra overhead of creating and destroying threads.
There are several options I can think of for you (there are probably more):
do as #Chris suggests and make each thread update one portion of
BinSizes. This might not be viable depending on the properties of
the calculation you're using to compute ii.
Create multiple mutexes representing different partitions of
BinSizes. For example, if BinSizes has 10 elements, you could
create one mutex for elements 0-4, and another for elements 5-9,
then use them in your thread something like so:
if (ii < 5) {
mtx_index = 0;
} else {
mtx_index = 1;
}
pthread_mutex_lock(&bin_mtx[mtx_index]);
BinSizes[ii]++;
pthread_mutex_unlock(&bin_mtx[mtx_index]);
You could generalize this idea to any size of BinSizes and any range:
Potentially you could have a different mutex for each array element. Of course
then you're opening yourself up to the overhead of creating each of these mutexes, and
the possibility of deadlock if someone tries to lock several of them at once etc...
Finally, you could abandon the idea of parallelizing this block altogether: as other users have mentioned using threads this way is subject to some level of diminishing returns. Unless your BinSizes array is very large, you might not see a huge benefit to parallelization even if you "do it right".
tl;dr - adding threads isn't a trivial fix for most problems. Yours isn't embarassingly parallelizable, and this code has hardly any actual concurrency.
You spin a mutex for every (cheap) integer operation on BinSizes. This will crush any parallelism, because all your threads are serialized on this.
The few instructions you can run concurrently (the for loop and a couple of operations on the morton code array) are much cheaper than (un)locking a mutex: even using an atomic increment (if available) would be more expensive than the un-synchronized part.
One fix would be to give each thread its own output array, and combine them after all tasks are complete.
Also, you create and join multiple threads per call. Creating threads is relatively expensive compared to computation, so it's generally recommended to create a long-lived pool of them to spread that cost.
Even if you do this, you need to tune the number of threads according to how many (free) cores do you have. If you do this in a recursive function, how many threads exist at the same time? Creating more threads than you have cores to schedule them on is pointless.
Oh, and you're leaking memory.

clock() returns 0 even with roughly a 5 second pause when using C

given the code below
#include<time.h>
#include <stdio.h>
#include <stdlib.h>
void firstSequence()
{
int columns = 999999;
int rows = 400000;
int **matrix;
int j;
int counter = 0;
matrix = (int **)malloc(columns*sizeof(int*));
for(j=0;j<columns;j++)
{
matrix[j]=(int*)malloc(rows*sizeof(int));
}
for(counter = 1;counter < columns; counter ++)
{
free(matrix[counter]);
}
}
void secondSequence()
{
int columns = 111;
int rows = 600000;
int **matrix;
int j;
matrix = (int **)malloc(columns*sizeof(int*));
for(j=0;j<columns;j++)
{
matrix[j]=(int*)malloc(rows*sizeof(int));
}
}
int main()
{
long t1;
long t2;
long diff;
t1 = clock();
firstSequence();
t2 = clock();
diff = (t2-t1) * 1000.0 / CLOCKS_PER_SEC;
printf("%f",t2);
t1 = clock();
secondSequence();
t2 = clock();
diff = (t2-t1) * 1000.0 / CLOCKS_PER_SEC;
printf("%f",diff);
return(0);
}
I need to be able to see how long it takes for both sequence one and sequence two to run. However both times I get 0 as the time elapsed. From looking online I have seen that this can be an issue but I do not how to fix the issue
You display the time incorrectly, so even if your functions take more than 0ms the call to printf() invokes undefined behaviour.
printf("%f",diff);
%f is used to display doubles. You probably want to use %ld.
If your functions really do take 0 ms to execute then a simple method to calculate the time for one call to the function is to call it multiple times, evough to be measurable, and then take the average of the total time elapsed.
clock is not the suitable function for calculating the time a program used.
You should use clock_gettime instead. detail explain about clock_gettime
Simple usage:
struct timespec start, end;
clock_gettime(CLOCK_REALTIME, &start);
for(int i = 0; i < 10000; i++) {
f1();
}
clock_gettime(CLOCK_REALTIME, &end);
cout <<"time elapsed = " << (double)((end.tv_sec - start.tv_sec)*1000000 + end.tv_nsec - start.tv_nsec) << endl;
PS: when you are compiling on linux, remember using the -lrt.

Differences in OpenMP performance with different versions of OS

I have a piece of code that i wrote a time ago. The only purpose of it was an experiment with openMP. But i recently switched form a MacBook Pro Lion (early 2011) to a MacBook Pro Mountain Lion (early 2013). If it would help to get more hardware of other info, I would be happy to give them.
The code worked fine on the old one, meaning 8 threads got a 100% (98% min) load on my processor. And now the identical code, recompiled on my new machine gets only a 62% max processor load. Even if I raise the threads. The processor loads are both measured with "istat pro".
My question is what can cause this to happen?
EDIT: The problem seems to be solved if I delete the for in #pragma omp parallel for shared(largest_factor, largest). So I get #pragma omp parallel shared(largest_factor, largest)
But I still don't understand why it works.
The code in question:
#include <stdio.h>
#include <omp.h>
double fib(double n);
int main()
{
int data[] = {124847,194747,194747,194747,194747,
194747,194747,194747,194747,194747,194747};
int largest, largest_factor = 0;
omp_set_num_threads(8);
/* "omp parallel for" turns the for loop multithreaded by making each thread
* iterating only a part of the loop variable, in this case i; variables declared
* as "shared" will be implicitly locked on access
*/
#pragma omp parallel for shared(largest_factor, largest)
for (int i = 0; i < 10; i++) {
int p, n = data[i];
for (p = 3; p * p <= n && n % p; p += 2);
printf("\n%f\n\n",fib(i+40));
if (p * p > n) p = n;
if (p > largest_factor) {
largest_factor = p;
largest = n;
printf("thread %d: found larger: %d of %d\n",
omp_get_thread_num(), p, n);
}
else
{
printf("thread %d: not larger: %d of %d\n",
omp_get_thread_num(), p, n);
}
}
printf("Largest factor: %d of %d\n", largest_factor, largest);
return 0;
}
double fib(double n)
{
if (n<=1)
{
return 1;
}
else
{
return fib(n-1)+fib(n-2);
}
}
The main reason you don't see all threads being used is that each thread takes different time (due to the recursive function or the inner loop) and you only have 10 iterations. The fast threads finish fast and then there are only a few threads left to run. When you first run your code it starts off 100% and falls off as the fast threads finish and the few last slow threads are still running. If you change your iterations to 100 (and increase the data array) you will see the CPU usage at 100% for much longer. I added some timing printouts to your code.
Also I think you have a race condition with your shared variables so I put in a critical section.
To answer your question about the code without the "for" statement what that's doing is running the same code on eight different threads! Instead of threads running a particular iteration they each run all 10 iterations. That's going to be no faster than running a single thread and perhaps even slower.
Lastly since each iteration takes different time in general you should use "schedual(dynamic)" like this
#pragma omp parallel for shared(largest_factor, largest) schedule(dynamic)
However, since you only have 10 iterations I don't think it will make much difference in this case. Here is what I did to your code to understand what is going on:
#include <stdio.h>
#include <omp.h>
double fib(double n);
int main()
{
int data[] = {124847,194747,194747,194747,194747,
194747,194747,194747,194747,194747,194747};
int largest, largest_factor = 0;
omp_set_num_threads(8);
/* "omp parallel for" turns the for loop multithreaded by making each thread
* iterating only a part of the loop variable, in this case i; variables declared
* as "shared" will be implicitly locked on access
*/
#pragma omp parallel for shared(largest_factor, largest)
for (int i = 0; i < 10; i++) {
int p, n = data[i];
double time = omp_get_wtime();
for (p = 3; p * p <= n && n % p; p += 2);
printf("\n iteratnion %d, fib %f\n\n",i, fib(i+40));
time = omp_get_wtime() - time;
printf("time %f\n", time);
if (p * p > n) p = n;
#pragma omp critical
{
if (p > largest_factor) {
largest_factor = p;
largest = n;
printf("thread %d: found larger: %d of %d\n",
omp_get_thread_num(), p, n);
}
else {
printf("thread %d: not larger: %d of %d\n",
omp_get_thread_num(), p, n);
}
}
}
printf("Largest factor: %d of %d\n", largest_factor, largest);
return 0;
}
double fib(double n) {
if (n<=1) {
return 1;
}
else {
return fib(n-1)+fib(n-2);
}
}

Resources