Does shared field in multithreading C program use CPU a lot?

Does shared field in multithreading C program use CPU a lot? - c

I'm writing a small program that uses a certain percentage of CPU. The basic strategy is that I will continuously check the CPU usage, and make the process sleep if the level of usage is higher than the given value.
Moreover, since I'm using MacOS(no proc/stat like Linux, no PerformanceCounter in C#), I have to execute top command in another thread and get the CPU usage from it.
The problem is I keep getting a very high usage of CPU even I give a small value as argument. And after several experiments, it seems caused by shared field by multithreads.
Here are my code(code 1) and experiments:
(code 2)Initially I thought it is the shell commands make the usage very high, so I commented the infinite loop in run(), leaving only the getCpuUsage() running. However, the CPU usage is nearly zero.
(code 3)Then, I wrote another run() function independent from the cpuUsage, which is intended to use 50% of CPU. It works well! I think the only difference between code 1 and code 3 is the usage of cpuUsage. So I'm wondering if sharing field between threads will use CPU a lot?
code 1
const char CPU_COMMAND[] = "top -stats cpu -l 1 -n 0| grep CPU\\ usage | cut -c 12-15";
int cpuUsage; // shared field that stores the cpu usage
// thread that continuously check CPU usage
// and store it in cpuUsage
void getCpuUsage() {
char usage[3];
FILE *out;
while (1) {
out = popen(CPU_COMMAND, "r");
if (fgets(usage, 3, out) != NULL) {
cpuUsage = atof(usage);
} else {
cpuUsage = 0;
}
pclose(out);
}
}
// keep the CPU usage under ratio
void run(int ratio) {
pthread_t id;
int ret = pthread_create(&id, NULL, (void *)getCpuUsage, NULL);
if (ret!=0) printf("thread error!");
while (1) {
// if current cpu usage is higher than ration, make it asleep
if (cpuUsage > ratio) {
usleep(10);
}
}
pthread_join(id, NULL);
}
code 2
// keep the CPU usage under ratio
void run(int ratio) {
pthread_t id;
int ret = pthread_create(&id, NULL, (void *)getCpuUsage, NULL);
if (ret!=0) printf("thread error!");
/*while (1) {
// if current cpu usage is higher than ration, make it asleep
if (cpuUsage > ratio) {
usleep(10);
}
}*/
pthread_join(id, NULL);
}
code 3
void run() {
const clock_t busyTime = 10;
const clock_t idleTime = busyTime;
while (1) {
clock_t startTime = clock();
while (clock() - startTime <= busyTime);
usleep(idleTime);
}
}

Does shared field in multithreading C program use CPU a lot?
Yes, constant reads/writes to/from a shared memory locations, by multiple threads on multiple CPUs cause a cache line to constantly move between CPUs (cache bounce). IMO, it's the single most important reason for poor scalability in naive "parallel" applications.

Ok. Code1 creates a thread which -as fast as possible- does a popen. So this thread uses up all the cpu-time. the other thread (main-thread) does usleep's, but not the popening thread...
Code2 also starts this cpu-using thread and then waits for it to finish (join), which will never happen.
Code3 runs for a while, then sleeps the same amount, so it should use up around 50%.
So basically what you should do (if you really want to use top for that purpose), you call it, then sleep for lets say 1 second, or 100ms, and see wether your main-loop in code1 adjusts.
while (1) {
usleep (100*1000);
out = popen(CPU_COMMAND, "r");

Related

How to implement a timer for every second with zero nanosecond with liburing?

I noticed the io_uring kernel side uses CLOCK_MONOTONIC at CLOCK_MONOTONIC, so for the first timer, I get the time with both CLOCK_REALTIME and CLOCK_MONOTONIC and adjust the nanosecond like below and use IORING_TIMEOUT_ABS flag for io_uring_prep_timeout. iorn/clock.c at master · hnakamur/iorn
const long sec_in_nsec = 1000000000;
static int queue_timeout(iorn_queue_t *queue) {
iorn_timeout_op_t *op = calloc(1, sizeof(*op));
if (op == NULL) {
return -ENOMEM;
}
struct timespec rts;
int ret = clock_gettime(CLOCK_REALTIME, &rts);
if (ret < 0) {
fprintf(stderr, "clock_gettime CLOCK_REALTIME error: %s\n", strerror(errno));
return -errno;
}
long nsec_diff = sec_in_nsec - rts.tv_nsec;
ret = clock_gettime(CLOCK_MONOTONIC, &op->ts);
if (ret < 0) {
fprintf(stderr, "clock_gettime CLOCK_MONOTONIC error: %s\n", strerror(errno));
return -errno;
}
op->handler = on_timeout;
op->ts.tv_sec++;
op->ts.tv_nsec += nsec_diff;
if (op->ts.tv_nsec > sec_in_nsec) {
op->ts.tv_sec++;
op->ts.tv_nsec -= sec_in_nsec;
}
op->count = 1;
op->flags = IORING_TIMEOUT_ABS;
ret = iorn_prep_timeout(queue, op);
if (ret < 0) {
return ret;
}
return iorn_submit(queue);
}
From the second time, I just increment the second part tv_sec and use IORING_TIMEOUT_ABS flag for io_uring_prep_timeout.
Here is the output from my example program. The millisecond part is zero but it is about 400 microsecond later than just second.
on_timeout time=2020-05-10T14:49:42.000442
on_timeout time=2020-05-10T14:49:43.000371
on_timeout time=2020-05-10T14:49:44.000368
on_timeout time=2020-05-10T14:49:45.000372
on_timeout time=2020-05-10T14:49:46.000372
on_timeout time=2020-05-10T14:49:47.000373
on_timeout time=2020-05-10T14:49:48.000373
Could you tell me a better way than this?

Thanks for your comments! I'd like to update the current time for logging like ngx_time_update(). I modified my example to use just CLOCK_REALTIME, but still about 400 microseconds late. github.com/hnakamur/iorn/commit/… Does it mean clock_gettime takes about 400 nanoseconds on my machine?
Yes, that sounds about right, sort of. But, if you're on an x86 PC under linux, 400 ns for clock_gettime overhead may be a bit high (order of magnitude higher--see below). If you're on an arm CPU (e.g. Raspberry Pi, nvidia Jetson), it might be okay.
I don't know how you're getting 400 microseconds. But, I've had to do a lot of realtime stuff under linux, and 400 us is similar to what I've measured as the overhead to do a context switch and/or wakeup a process/thread after a syscall suspends it.
I never use gettimeofday anymore. I now just use clock_gettime(CLOCK_REALTIME,...) because it's the same except you get nanoseconds instead of microseconds.
Just so you know, although clock_gettime is a syscall, nowadays, on most systems, it uses the VDSO layer. The kernel injects special code into the userspace app, so that it is able to access the time directly without the overhead of a syscall.
If you're interested, you could run under gdb and disassemble the code to see that it just accesses some special memory locations instead of doing a syscall.
I don't think you need to worry about this too much. Just use clock_gettime(CLOCK_MONOTONIC,...) and set flags to 0. The overhead doesn't factor into this, for the purposes of the ioring call as your iorn layer is using it.
When I do this sort of thing, and I want/need to calculate the overhead of clock_gettime itself, I call clock_gettime in a loop (e.g. 1000 times), and try to keep the total time below a [possible] timeslice. I use the minimum diff between times in each iteration. That compensates for any [possible] timeslicing.
The minimum is the overhead of the call itself [on average].
There are additional tricks that you can do to minimize latency in userspace (e.g. raising process priority, clamping CPU affinity and I/O interrupt affinity), but they can involve a few more things, and, if you're not very careful, they can produce worse results.
Before you start taking extraordinary measures, you should have a solid methodology to measure timing/benchmarking to prove that your results can not meet your timing/throughput/latency requirements. Otherwise, you're doing complicated things for no real/measurable/necessary benefit.
Below is some code I just created, simplified, but based on code I already have/use to calibrate the overhead:
#include <stdio.h>
#include <time.h>
#define ITERMAX 10000
typedef long long tsc_t;
// tscget -- get time in nanoseconds
static inline tsc_t
tscget(void)
{
struct timespec ts;
tsc_t tsc;
clock_gettime(CLOCK_MONOTONIC,&ts);
tsc = ts.tv_sec;
tsc *= 1000000000;
tsc += ts.tv_nsec;
return tsc;
}
// tscsec -- convert nanoseconds to fractional seconds
double
tscsec(tsc_t tsc)
{
double sec;
sec = tsc;
sec /= 1e9;
return sec;
}
tsc_t
calibrate(void)
{
tsc_t tscbeg;
tsc_t tscold;
tsc_t tscnow;
tsc_t tscdif;
tsc_t tscmin;
int iter;
tscmin = 1LL << 62;
tscbeg = tscget();
tscold = tscbeg;
for (iter = ITERMAX; iter > 0; --iter) {
tscnow = tscget();
tscdif = tscnow - tscold;
if (tscdif < tscmin)
tscmin = tscdif;
tscold = tscnow;
}
tscdif = tscnow - tscbeg;
printf("MIN:%.9f TOT:%.9f AVG:%.9f\n",
tscsec(tscmin),tscsec(tscdif),tscsec(tscnow - tscbeg) / ITERMAX);
return tscmin;
}
int
main(void)
{
calibrate();
return 0;
}
On my system, a 2.67GHz Core i7, the output is:
MIN:0.000000019 TOT:0.000254999 AVG:0.000000025
So, I'm getting 25 ns overhead [and not 400 ns]. But, again, each system can be different to some extent.
UPDATE:
Note that x86 processors have "speed step". The OS can adjust the CPU frequency up or down semi-automatically. Lower speeds conserve power. Higher speeds are maximum performance.
This is done with a heuristic (e.g. if the OS detects that the process is a heavy CPU user, it will up the speed).
To force maximum speed, linux has this directory:
/sys/devices/system/cpu/cpuN/cpufreq
Where N is the cpu number (e.g. 0-7)
Under this directory, there are a number of files of interest. They should be self explanatory.
In particular, look at scaling_governor. It has either ondemand [kernel will adjust as needed] or performance [kernel will force maximum CPU speed].
To force maximum speed, as root, set this [once] to performance (e.g.):
echo "performance" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
Do this for all cpus.
However, I just did this on my system, and it had little effect. So, the kernel's heuristic may have improved.
As to the 400us, when a process has been waiting on something, when it is "woken up", this is a two step process.
The process is marked "runnable".
At some point, the system/CPU does a reschedule. The process will be run, based upon the scheduling policy and the process priority in effect.
For many syscalls, the reschedule [only] occurs on the next system timer/clock tick/interrupt. So, for some, there can be a delay of up to a full clock tick (i.e.) for HZ value of 1000, this can be up to 1ms (1000 us) later.
On average, this is one half of HZ or 500 us.
For some syscalls, when the process is marked runnable, a reschedule is done immediately. If the process has a higher priority, it will be run immediately.
When I first looked at this [circa 2004], I looked at all code paths in the kernel, and the only syscall that did the immediate reschedule was SysV IPC, for msgsnd/msgrcv. That is, when process A did msgsnd, any process B waiting for the given message would be run.
But, others did not (e.g. futex). They would wait for the timer tick. A lot has changed since then, and now, more syscalls will do the immediate reschedule. For example, I recently measured futex [invoked via pthread_mutex_*], and it seemed to do the quick reschedule.
Also, the kernel scheduler has changed. The newer scheduler can wakeup/run some things on a fraction of a clock tick.
So, for you, the 400 us, is [possibly] the alignment to the next clock tick.
But, it could just be the overhead of doing the syscall. To test that, I modified my test program to open /dev/null [and/or /dev/zero], and added read(fd,buf,1) to the test loop.
I got a MIN: value of 529 us. So, the delay you're getting could just be the amount of time it takes to do the task switch.
This is what I would call "good enough for now".
To get "razor's edge" response, you'd probably have to write a custom kernel driver and have the driver do this. This is what embedded systems would do if (e.g.) they had to toggle a GPIO pin on every interval.
But, if all you're doing is printf, the overhead of printf and the underlying write(1,...) tends to swamp the actual delay.
Also, note that when you do printf, it builds the output buffer and when the buffer in FILE *stdout is full, it flushes via write.
For best performance, it's better to do int len = sprintf(buf,"current time is ..."); write(1,buf,len);
Also, when you do this, if the kernel buffers for TTY I/O get filled [which is quite possible given the high frequency of messages you're doing], the process will be suspended until the I/O has been sent to the TTY device.
To do this well, you'd have to watch how much space is available, and skip some messages if there isn't enough space to [wholy] contain them.
You'd need to do: ioctl(1,TIOCOUTQ,...) to get the available space and skip some messages if it is less than the size of the message you want to output (e.g. the len value above).
For your usage, you're probably more interested in the latest time message, rather than outputting all messages [which would eventually produce a lag]

Thread segmentation fault using FFTW

we have a C software on a i.MX53 (SO based on Linux) that have to compute a FFT; for this purpose we have adopted the FFTW library. For performances reasons we have decided to separate the FFT from the main application, using a separated thread. The overall application seems work, but after a while we have a segmentation fault in correspondence of fftwf_execute. I am sure of this because without this single istruction we have not segmentation faults. We have made several attempts but the problem persists. Here parte of the thread function:
void* vGestDiag_ThreadFFT( void* unused )
{
Int32U idx = 0, idxI = 0, idxJ = 0, idxZ = 0, idxK = 0;
Flo32 lfBufferAccm_chn01[LEN_BUFFER_SAMPLES];
Flo32 lfBufferAccm_chn02[LEN_BUFFER_SAMPLES];
Flo64 dblBufferFFT[LEN_BUFFER_SAMPLES];
Int32U ulCntUtilSample = 0;
float *in;
fftwf_complex *out;
fftwf_plan plan;
/* other variables.... */
/* Init */
memset(lfBufferAccm_chn01, 0x00, LEN_BUFFER_SAMPLES*sizeof(Flo32));
memset(lfBufferAccm_chn02, 0x00, LEN_BUFFER_SAMPLES*sizeof(Flo32));
memset(dblBufferFFT, 0x00, LEN_BUFFER_SAMPLES*sizeof(Flo64));
/* other local memsets .... */
/* Inputs */
pthread_mutex_lock(&lockIN);
ulCntUtilSample = wulCntUtilSample;
/* other inputs.... */
for (idxJ = 0; idxJ < ulCntUtilSample; idxJ++)
{
boBuffCirc_ReadBuffer(&wulBufferAcc01, &ulTmpValue);
lfBufferAccm_chn01[idxJ] = (Flo32)((((Flo32)ulTmpValue - ACC_Q)/ACC_M) * ACC_U) * wlfSensAcc;
boBuffCirc_ReadBuffer(&wulBufferAcc02, &ulTmpValue);
lfBufferAccm_chn02[idxJ] = (Flo32)((((Flo32)ulTmpValue - ACC_Q)/ACC_M) * ACC_U) * wlfSensAcc;
}
pthread_mutex_unlock(&lockIN);
/* --------- Plan FFT ------------------------- */
in = (float*) fftwf_malloc(sizeof(float) * ulCntUtilSample);
out = (fftwf_complex*) fftwf_malloc(sizeof(fftwf_complex) * ulCntUtilSample);
fftwf_plan_with_nthreads(1);
plan = fftwf_plan_dft_r2c_1d(ulCntUtilSample, in, out, FFTW_ESTIMATE);
for (idxI = 0; idxI <= 1; idxI++)
{
switch(idxI)
{
case 0:
memcpy(in, lfBufferAccm_chn01, ulCntUtilSample*sizeof(float));
break;
case 1:
memcpy(in, lfBufferAccm_chn02, ulCntUtilSample*sizeof(float));
break;
default:
break;
}
/* --------- FFT ------------------------- */
/* EXEC */
fftwf_execute(plan);
/* Complex -> Real */
for (idxZ = 0; idxZ < ulCntUtilSample; idxZ++)
{
dblBufferFFT[idxZ] = cabs(out[idxZ]);
}
/* --------- End FFT ------------------------- */
/* Post-Processing FFT */
/* post-processing and outputs in module static variables, within mutex */
}
/* DEL plan */
fftwf_destroy_plan(plan);
fftwf_free(in);
fftwf_free(out);
/* exit */
pthread_exit(NULL);
}
Variables starting with 'w' are module static variables, LEN_BUFFER is oversized respect the number of samples.
Thanks everyone for helping!!

This is my first project involving threads and also my coworkers have not a great experience, so we have not considered several issues. The first one was the difference between DETACHABLE and JOINABLE: our application have not to wait for thread completing; indeed this was the problem that lead us to thread (the main that waited a long time for FFT completing, while this was not necessary). The previous version of my SW was default, joinable, so resources allocated to the thread were never free. The second point that we have not considered was that generating a thread requires additional resources. In the previous version of my SW was generated a thread each time that FFT had to be computed; in our application this append about every 10 second or more, so apparently not critical; moreover the fastest is FFT beginning, the less data are used for FFT (is a long story, but we can summerized so); finally, the use of mutex and the logic of the algorithm apparently protected the shared resources (it was quite improbable that a shared resources was used from the threads and main at the same time). Unfortunately, the thread generation, after several cycles saturated the memory (we are working not on a PC but on a simple microprocessor....), and this was the origin of the segmentation faults, so i believe.
I have solved the problem in this way: the FFT thread is generated at the start of the application, so it is generated only a single thread, together with the main. Within the thread is an infinite loop that periodically (at the moment 4 times fastest than the main loop) check for a shared variable that indicate the request for FFT: if true, the thread read and copy locally a shared memory that cointains the FFT inputs and other required parameters. At the end of the processing, outputs are saved in a shared memory and another flag is set to indicated the main loop that a fft is available. Each shared data is accessed R/W using mutex, both within thread and main, and only data used by the thread are controlled by mutexes. The thread is generated as detachable without any kills or exits, because the thread has to live "forever" together with the main. Yesterday this SW has run for several hours without problems (i have stopped manually) and also in stress conditions.

Need some help in C code for optimization (Poll + delay/sleep)

Currently I'm polling the register to get the expected value and now I want reduce the CPU usage and increase the performance.
So, I think, if we do polling for particular time (Say for 10ms) and if we didn't get expected value then wait for some time (like udelay(10*1000) or usleep(10*1000) delay/sleep in ms) then continue to do polling for more more extra time (Say 100ms) and still if you didn't get the expected value then do sleep/delay for 100ms.....vice versa... need to do till it reach to maximum timeout value.
Please let me know if anything.
This is the old code:
#include <sys/time.h> /* for setitimer */
#include <unistd.h> /* for pause */
#include <signal.h> /* for signal */
#define INTERVAL 500 //timeout in ms
static int timedout = 0;
struct itimerval it_val; /* for setting itimer */
char temp_reg[2];
int main(void)
{
/* Upon SIGALRM, call DoStuff().
* Set interval timer. We want frequency in ms,
* but the setitimer call needs seconds and useconds. */
if (signal(SIGALRM, (void (*)(int)) DoStuff) == SIG_ERR)
{
perror("Unable to catch SIGALRM");
exit(1);
}
it_val.it_value.tv_sec = INTERVAL/1000;
it_val.it_value.tv_usec = (INTERVAL*1000) % 1000000;
it_val.it_interval = it_val.it_value;
if (setitimer(ITIMER_REAL, &it_val, NULL) == -1)
{
perror("error calling setitimer()");
exit(1);
}
do
{
temp_reg[0] = read_reg();
//Read the register here and copy the value into char array (temp_reg
if (timedout == 1 )
return -1;//Timedout
} while (temp_reg[0] != 0 );//Check the value and if not try to read the register again (poll)
}
/*
* DoStuff
*/
void DoStuff(void)
{
timedout = 1;
printf("Timer went off.\n");
}
Now I want to optimize and reduce the CPU usage and want to improve the performance.
Can any one help me on this issue ?
Thanks for your help on this.

Currently I'm polling the register to get the expected value [...]
wow wow wow, hold on a moment here, there is a huge story hidden behind this sentence; what is "the register"? what is "the expected value"? What does read_reg() do? are you polling some external hardware? Well then, it all depends on how your hardware behaves.
There are two possibilities:
Your hardware buffers the values that it produces. This means that the hardware will keep each value available until you read it; it will detect when you have read the value, and then it will provide the next value.
Your hardware does not buffer values. This means that values are being made available in real time, for an unknown length of time each, and they are replaced by new values at a rate that only your hardware knows.
If your hardware is buffering, then you do not need to be afraid that some values might be lost, so there is no need to poll at all: just try reading the next value once and only once, and if it is not what you expect, sleep for a while. Each value will be there when you get around to reading it.
If your hardware is not buffering, then there is no strategy of polling and sleeping that will work for you. Your hardware must provide an interrupt, and you must write an interrupt-handling routine that will read every single new value as quickly as possible from the moment that it has been made available.

Here are some pseudo code that might help:
do
{
// Pseudo code
start_time = get_current_time();
do
{
temp_reg[0] = read_reg();
//Read the register here and copy the value into char array (temp_reg
if (timedout == 1 )
return -1;//Timedout
// Pseudo code
stop_time = get_current_time();
if (stop_time - start_time > some_limit) break;
} while (temp_reg[0] != 0 );
if (temp_reg[0] != 0)
{
usleep(some_time);
start_time = get_current_time();
}
} while (temp_reg[0] != 0 );
To turn the pseudo code into real code, see https://stackoverflow.com/a/2150334/4386427

Why is the multithreaded version of this program slower?

I am trying to learn pthreads and I have been experimenting with a program that tries to detect the changes on an array. Function array_modifier() picks a random element and toggles it's value (1 to 0 and vice versa) and then sleeps for some time (big enough so race conditions do not appear, I know this is bad practice). change_detector() scans the array and when an element doesn't match it's prior value and it is equal to 1, the change is detected and diff array is updated with the detection delay.
When there is one change_detector() thread (NTHREADS==1) it has to scan the whole array. When there are more threads each is assigned a portion of the array. Each detector thread will only catch the modifications in its part of the array, so you need to sum the catch times of all 4 threads to get the total time to catch all changes.
Here is the code:
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>
#define TIME_INTERVAL 100
#define CHANGES 5000
#define UNUSED(x) ((void) x)
typedef struct {
unsigned int tid;
} parm;
static volatile unsigned int* my_array;
static unsigned int* old_value;
static struct timeval* time_array;
static unsigned int N;
static unsigned long int diff[NTHREADS] = {0};
void* array_modifier(void* args);
void* change_detector(void* arg);
int main(int argc, char** argv) {
if (argc < 2) {
exit(1);
}
N = (unsigned int)strtoul(argv[1], NULL, 0);
my_array = calloc(N, sizeof(int));
time_array = malloc(N * sizeof(struct timeval));
old_value = calloc(N, sizeof(int));
parm* p = malloc(NTHREADS * sizeof(parm));
pthread_t generator_thread;
pthread_t* detector_thread = malloc(NTHREADS * sizeof(pthread_t));
for (unsigned int i = 0; i < NTHREADS; i++) {
p[i].tid = i;
pthread_create(&detector_thread[i], NULL, change_detector, (void*) &p[i]);
}
pthread_create(&generator_thread, NULL, array_modifier, NULL);
pthread_join(generator_thread, NULL);
usleep(500);
for (unsigned int i = 0; i < NTHREADS; i++) {
pthread_cancel(detector_thread[i]);
}
for (unsigned int i = 0; i < NTHREADS; i++) fprintf(stderr, "%lu ", diff[i]);
fprintf(stderr, "\n");
_exit(0);
}
void* array_modifier(void* arg) {
UNUSED(arg);
srand(time(NULL));
unsigned int changing_signals = CHANGES;
while (changing_signals--) {
usleep(TIME_INTERVAL);
const unsigned int r = rand() % N;
gettimeofday(&time_array[r], NULL);
my_array[r] ^= 1;
}
pthread_exit(NULL);
}
void* change_detector(void* arg) {
const parm* p = (parm*) arg;
const unsigned int tid = p->tid;
const unsigned int start = tid * (N / NTHREADS) +
(tid < N % NTHREADS ? tid : N % NTHREADS);
const unsigned int end = start + (N / NTHREADS) +
(tid < N % NTHREADS);
unsigned int r = start;
while (1) {
unsigned int tmp;
while ((tmp = my_array[r]) == old_value[r]) {
r = (r < end - 1) ? r + 1 : start;
}
old_value[r] = tmp;
if (tmp) {
struct timeval tv;
gettimeofday(&tv, NULL);
// detection time in usec
diff[tid] += (tv.tv_sec - time_array[r].tv_sec) * 1000000 + (tv.tv_usec - time_array[r].tv_usec);
}
}
}
when I compile & run like this:
gcc -Wall -Wextra -O3 -DNTHREADS=1 file.c -pthread && ./a.out 100
I get:
665
but when I compile & run like this:
gcc -Wall -Wextra -O3 -DNTHREADS=4 file.c -pthread && ./a.out 100
I get:
152 190 164 242
(this sums up to 748).
So, the delay for the multithreaded program is larger.
My cpu has 6 cores.

Short Answer
You are sharing memory between thread and sharing memory between threads is slow.
Long Answer
Your program is using a number of thread to write to my_array and another thread to read from my_array. Effectively my_array is shared by a number of threads.
Now lets assume you are benchmarking on a multicore machine, you probably are hoping that the OS will assign different cores to each thread.
Bear in mind that on modern processors writing to RAM is really expensive (hundreds of CPU cycles). To improve performance CPUs have multi-level caches. The fastest Cache is the small L1 cache. A core can write to its L1 cache in the order of 2-3 cycles. The L2 cache may take on the order of 20 - 30 cycles.
Now in lots of CPU architectures each core has its own L1 cache but the L2 cache is shared. This means any data that is shared between thread (cores) has to go through the L2 cache which is much slower than the L1 cache. This means that shared memory access tends to be quite slow.
Bottom line is that if you want your multithreaded programs to perform well you need to ensure that threads do not share memory. Sharing memory is slow.
Aside
Never rely on volatile to do the correct thing when sharing memory between thread, either use your library atomic operations or use mutexes. This is because some CPUs allow out of order reads and writes that may do strange things if you do not know what you are doing.

It is rare that a multithreaded program scales perfectly with the number of threads. In your case you measured a speed-up factor of ca 0.9 (665/748) with 4 threads. That is not so good.
Here are some factors to consider:
The overhead of starting threads and dividing the work. For small jobs the cost of starting additional threads can be considerably larger than the actual work. Not applicable to this case, since the overhead isn't included in the time measurements.
"Random" variations. Your threads varied between 152 and 242. You should run the test multiple times and use either the mean or the median values.
The size of the test. Generally you get more reliable measurements on larger tests (more data). However, you need to consider how having more data affects the caching in L1/L2/L3 cache. And if the data is too large to fit into RAM you need to factor in disk I/O. Usually, multithreaded implementations are slower, because they want to work on more data at a time but in rare instances they can be faster, a phenomenon called super-linear speedup.
Overhead caused by inter-thread communication. Maybe not a factor in your case, since you don't have much of that.
Overhead caused by resource locking. Usually has a low impact on cpu utilization but may have a large impact on the total real time used.
Hardware optimizations. Some CPUs change the clock frequency depending on how many cores you use.
The cost of the measurement itself. In your case a change will be detected within 25 (100/4) iterations of the for loop. Each iteration takes but a few clock cycles. Then you call gettimeofday which probably costs thousands of clock cycles. So what you are actually measuring is more or less the cost of calling gettimeofday.
I would increase the number of values to check and the cost to check each value. I would also consider turning off compiler optimizations, since these can cause the program to do unexpected things (or skip some things entirely).

Recursive Fib with Threads, Segmentation Fault?

Any ideas why it works fine for values like 0, 1, 2, 3, 4... and seg faults for values like >15?
#include
#include
#include
void *fib(void *fibToFind);
main(){
pthread_t mainthread;
long fibToFind = 15;
long finalFib;
pthread_create(&mainthread,NULL,fib,(void*) fibToFind);
pthread_join(mainthread,(void*)&finalFib);
printf("The number is: %d\n",finalFib);
}
void *fib(void *fibToFind){
long retval;
long newFibToFind = ((long)fibToFind);
long returnMinusOne;
long returnMinustwo;
pthread_t minusone;
pthread_t minustwo;
if(newFibToFind == 0 || newFibToFind == 1)
return newFibToFind;
else{
long newFibToFind1 = ((long)fibToFind) - 1;
long newFibToFind2 = ((long)fibToFind) - 2;
pthread_create(&minusone,NULL,fib,(void*) newFibToFind1);
pthread_create(&minustwo,NULL,fib,(void*) newFibToFind2);
pthread_join(minusone,(void*)&returnMinusOne);
pthread_join(minustwo,(void*)&returnMinustwo);
return returnMinusOne + returnMinustwo;
}
}

Runs out of memory (out of space for stacks), or valid thread handles?
You're asking for an awful lot of threads, which require lots of stack/context.
Windows (and Linux) have a stupid "big [contiguous] stack" idea.
From the documentation on pthreads_create:
"On Linux/x86-32, the default stack size for a new thread is 2 megabytes."
If you manufacture 10,000 threads, you need 20 Gb of RAM.
I built a version of OP's program, and it bombed with some 3500 (p)threads
on Windows XP64.
See this SO thread for more details on why big stacks are a really bad idea:
Why are stack overflows still a problem?
If you give up on big stacks, and implement a parallel language with heap allocation
for activation records
(our PARLANSE is
one of these) the problem goes away.
Here's the first (sequential) program we wrote in PARLANSE:
(define fibonacci_argument 45)
(define fibonacci
(lambda(function natural natural )function
`Given n, computes nth fibonacci number'
(ifthenelse (<= ? 1)
?
(+ (fibonacci (-- ?))
(fibonacci (- ? 2))
)+
)ifthenelse
)lambda
)define
Here's an execution run on an i7:
C:\DMS\Domains\PARLANSE\Tools\PerformanceTest>run fibonaccisequential
Starting Sequential Fibonacci(45)...Runtime: 33.752067 seconds
Result: 1134903170
Here's the second, which is parallel:
(define coarse_grain_threshold 30) ; technology constant: tune to amortize fork overhead across lots of work
(define parallel_fibonacci
(lambda (function natural natural )function
`Given n, computes nth fibonacci number'
(ifthenelse (<= ? coarse_grain_threshold)
(fibonacci ?)
(let (;; [n natural ] [m natural ] )
(value (|| (= m (parallel_fibonacci (-- ?)) )=
(= n (parallel_fibonacci (- ? 2)) )=
)||
(+ m n)
)value
)let
)ifthenelse
)lambda
)define
Making the parallelism explicit makes the programs a lot easier to write, too.
The parallel version we test by calling (parallel_fibonacci 45). Here
is the execution run on the same i7 (which arguably has 8 processors,
but it is really 4 processors hyperthreaded so it really isn't quite 8
equivalent CPUs):
C:\DMS\Domains\PARLANSE\Tools\PerformanceTest>run fibonacciparallelcoarse
Parallel Coarse-grain Fibonacci(45) with cutoff 30...Runtime: 5.511126 seconds
Result: 1134903170
A speedup near 6+, not bad for not-quite-8 processors. One of the other
answers to this question ran the pthreads version; it took "a few seconds"
(to blow up) computing Fib(18), and this is 5.5 seconds for Fib(45).
This tells you pthreads
is a fundamentally bad way to do lots of fine grain parallelism, because
it has really, really high forking overhead. (PARLANSE is designed to
minimize that forking overhead).
Here's what happens if you set the technology constant to zero (forks on every call
to fib):
C:\DMS\Domains\PARLANSE\Tools\PerformanceTest>run fibonacciparallel
Starting Parallel Fibonacci(45)...Runtime: 15.578779 seconds
Result: 1134903170
You can see that amortizing fork overhead is a good idea, even if you have fast forks.
Fib(45) produces a lot of grains. Heap allocation
of activation records solves the OP's first-order problem (thousands of pthreads each
with 1Mb of stack burns gigabytes of RAM).
But there's a second order problem: 2^45 PARLANSE "grains" will burn all your memory too
just keeping track of the grains even if your grain control block is tiny.
So it helps to have a scheduler that throttles forks once you have "a lot"
(for some definition of "a lot" significantly less that 2^45) grains to prevent the
explosion of parallelism from swamping the machine with "grain" tracking data structures.
It has to unthrottle forks when the number of grains falls below a threshold
too, to make sure there is always lots of logical, parallel work for the physical
CPUs to do.

You are not checking for errors - in particular, from pthread_create(). When pthread_create() fails, the pthread_t variable is left undefined, and the subsequent pthread_join() may crash.
If you do check for errors, you will find that pthread_create() is failing. This is because you are trying to generate almost 2000 threads - with default settings, this would require 16GB of thread stacks to be allocated alone.
You should revise your algorithm so that it does not generate so many threads.

I tried to run your code, and came across several surprises:
printf("The number is: %d\n", finalFib);
This line has a small error: %d means printf expects an int, but is passed a long int. On most platforms this is the same, or will have the same behavior anyways, but pedantically speaking (or if you just want to stop the warning from coming up, which is a very noble ideal too), you should use %ld instead, which will expect a long int.
Your fib function, on the other hand, seems non-functional. Testing it on my machine, it doesn't crash, but it yields 1047, which is not a Fibonacci number. Looking closer, it seems your program is incorrect on several aspects:
void *fib(void *fibToFind)
{
long retval; // retval is never used
long newFibToFind = ((long)fibToFind);
long returnMinusOne; // variable is read but never initialized
long returnMinustwo; // variable is read but never initialized
pthread_t minusone; // variable is never used (?)
pthread_t minustwo; // variable is never used
if(newFibToFind == 0 || newFibToFind == 1)
// you miss a cast here (but you really shouldn't do it this way)
return newFibToFind;
else{
long newFibToFind1 = ((long)fibToFind) - 1; // variable is never used
long newFibToFind2 = ((long)fibToFind) - 2; // variable is never used
// reading undefined variables (and missing a cast)
return returnMinusOne + returnMinustwo;
}
}
Always take care of compiler warnings: when you get one, usually, you really are doing something fishy.
Maybe you should revise the algorithm a little: right now, all your function does is returning the sum of two undefined values, hence the 1047 I got earlier.
Implementing the Fibonacci suite using a recursive algorithm means you need to call the function again. As others noted, it's quite an inefficient way of doing it, but it's easy, so I guess all computer science teachers use it as an example.
The regular recursive algorithm looks like this:
int fibonacci(int iteration)
{
if (iteration == 0 || iteration == 1)
return 1;
return fibonacci(iteration - 1) + fibonacci(iteration - 2);
}
I don't know to which extent you were supposed to use threads—just run the algorithm on a secondary thread, or create new threads for each call? Let's assume the first for now, since it's a lot more straightforward.
Casting integers to pointers and vice-versa is a bad practice because if you try to look at things at a higher level, they should be widely different. Integers do maths, and pointers resolve memory addresses. It happens to work because they're represented the same way, but really, you shouldn't do this. Instead, you might notice that the function called to run your new thread accepts a void* argument: we can use it to convey both where the input is, and where the output will be.
So building upon my previous fibonacci function, you could use this code as the thread main routine:
void* fibonacci_offshored(void* pointer)
{
int* pointer_to_number = pointer;
int input = *pointer_to_number;
*pointer_to_number = fibonacci(input);
return NULL;
}
It expects a pointer to an integer, and takes from it its input, then writes it output there.1 You would then create the thread like that:
int main()
{
int value = 15;
pthread_t thread;
// on input, value should contain the number of iterations;
// after the end of the function, it will contain the result of
// the fibonacci function
int result = pthread_create(&thread, NULL, fibonacci_offshored, &value);
// error checking is important! try to crash gracefully at the very least
if (result != 0)
{
perror("pthread_create");
return 1;
}
if (pthread_join(thread, NULL)
{
perror("pthread_join");
return 1;
}
// now, value contains the output of the fibonacci function
// (note that value is an int, so just %d is fine)
printf("The value is %d\n", value);
return 0;
}
If you need to call the Fibonacci function from new distinct threads (please note: that's not what I'd advise, and others seem to agree with me; it will just blow up for a sufficiently large amount of iterations), you'll first need to merge the fibonacci function with the fibonacci_offshored function. It will considerably bulk it up, because dealing with threads is heavier than dealing with regular functions.
void* threaded_fibonacci(void* pointer)
{
int* pointer_to_number = pointer;
int input = *pointer_to_number;
if (input == 0 || input == 1)
{
*pointer_to_number = 1;
return NULL;
}
// we need one argument per thread
int minus_one_number = input - 1;
int minus_two_number = input - 2;
pthread_t minus_one;
pthread_t minus_two;
// don't forget to check! especially that in a recursive function where the
// recursion set actually grows instead of shrinking, you're bound to fail
// at some point
if (pthread_create(&minus_one, NULL, threaded_fibonacci, &minus_one_number) != 0)
{
perror("pthread_create");
*pointer_to_number = 0;
return NULL;
}
if (pthread_create(&minus_two, NULL, threaded_fibonacci, &minus_two_number) != 0)
{
perror("pthread_create");
*pointer_to_number = 0;
return NULL;
}
if (pthread_join(minus_one, NULL) != 0)
{
perror("pthread_join");
*pointer_to_number = 0;
return NULL;
}
if (pthread_join(minus_two, NULL) != 0)
{
perror("pthread_join");
*pointer_to_number = 0;
return NULL;
}
*pointer_to_number = minus_one_number + minus_two_number;
return NULL;
}
Now that you have this bulky function, adjustments to your main function are going to be quite easy: just change the reference to fibonacci_offshored to threaded_fibonacci.
int main()
{
int value = 15;
pthread_t thread;
int result = pthread_create(&thread, NULL, threaded_fibonacci, &value);
if (result != 0)
{
perror("pthread_create");
return 1;
}
pthread_join(thread, NULL);
printf("The value is %d\n", value);
return 0;
}
You might have been told that threads speed up parallel processes, but there's a limit somewhere where it's more expensive to set up the thread than run its contents. This is a very good example of such a situation: the threaded version of the program runs much, much slower than the non-threaded one.
For educational purposes, this program runs out of threads on my machine when the number of desired iterations is 18, and takes a few seconds to run. By comparison, using an iterative implementation, we never run out of threads, and we have our answer in a matter of milliseconds. It's also considerably simpler. This would be a great example of how using a better algorithm fixes many problems.
Also, out of curiosity, it would be interesting to see if it crashes on your machine, and where/how.
1. Usually, you should try to avoid to change the meaning of a variable between its value on input and its value after the return of the function. For instance, here, on input, the variable is the number of iterations we want; on output, it's the result of the function. Those are two very different meanings, and that's not really a good practice. I didn't feel like using dynamic allocations to return a value through the void* return value.