I am writing a file system for one of my classes. This function is killing my performance by a LARGE margin and I can't figure out why. I've been staring at this code way too long and I am probably missing something very obvious. Does someone see why this function should go so slowly?
int getFreeDataBlock(struct disk *d, unsigned int dataBlockNumber)
{
if (d == NULL)
{
fprintf(stderr, "Invalid disk pointer to getFreeDataBlock()\n");
errorCheck();
return -1;
}
// Allocate a buffer
char *buffer = (char *) malloc(d->blockSize * sizeof(char));
if (buffer == NULL)
{
fprintf(stderr, "Out of memory.\n");
errorCheck();
return -1;
}
do {
// Read a block from the disk
diskread(d, buffer, dataBlockNumber);
// Cast to appropriate struct
struct listDataBlock *block = (struct listDataBlock *) buffer;
unsigned int i;
for (i = 0; i < DATABLOCK_FREE_SLOT_LENGTH; ++i)
{
// We are in the last datalisting block...and out of slots...break
if (block->listOfFreeBlocks[i] == -2)
{
break;
}
if (block->listOfFreeBlocks[i] != -1)
{
int returnValue = block->listOfFreeBlocks[i];
// MARK THIS AS USED NOW
block->listOfFreeBlocks[i] = -1;
diskwriteNoSync(d, buffer, dataBlockNumber);
// No memory leaks
free(buffer);
return returnValue;
}
}
// Ok, nothing in this data block, move to next
dataBlockNumber = block->nextDataBlock;
} while (dataBlockNumber != -1);
// Nope, didn't find any...disk must be full
free(buffer);
fprintf(stderr, "DISK IS FULL\n");
errorCheck();
return -1;
}
As you can see from the gprof, the diskread() nor the diskwriteNoSync() are taking extensive amounts of time?
% cumulative self self total
time seconds seconds calls ms/call ms/call name
99.45 12.25 12.25 2051 5.97 5.99 getFreeDataBlock
0.24 12.28 0.03 2220903 0.00 0.00 diskread
0.24 12.31 0.03 threadFunc
0.08 12.32 0.01 2048 0.00 6.00 writeHelper
0.00 12.32 0.00 6154 0.00 0.00 diskwriteNoSync
0.00 12.32 0.00 2053 0.00 0.00 validatePath
or am I not understanding the output properly?
Thanks for any help.
The fact that you've been staring at this code and puzzling over the gprof output puts you in good company, because gprof and the concepts that are taught with it only work with little academic-scale programs doing no I/O. Here's the method I use.
Some excerpts from a useful post that got deleted, giving some MYTHS about profiling:
that program counter sampling is useful.
It is only useful if you have an unnecessary hotspot bottleneck such as a bubble sort of a big array of scalar values. As soon as you, for example, change it into a sort using string-compare, it is still a bottleneck, but program counter sampling will not see it because now the hotspot is in string-compare. On the other hand if it were to sample the extended program counter (the call stack), the point at which the string-compare is called, the sort loop, is clearly displayed. In fact, gprof was an attempt to remedy the limitations of pc-only sampling.
that samples need not be taken when blocked
The reasons for this myth are twofold: 1) that PC sampling is meaningless when the program is waiting, and 2) the preoccupation with accuracy of timing. However, for (1) the program may very well be waiting for something that it asked for, such as file I/O, which you need to know, and which stack samples reveal. (Obviously you want to exclude samples while waiting for user input.) For (2) if the program is waiting simply because of competition with other processes, that presumably happens in a fairly random way while it's running.
So while the program may be taking longer, that will not have a large effect on the statistic that matters, the percentage of time that statements are on the stack.
that counting of statement or function invocations is useful.
Suppose you know a function has been called 1000 times. Can you tell from that what fraction of time it costs? You also need to know how long it takes to run, on average, multiply it by the count, and divide by the total time. The average invocation time could vary from nanoseconds to seconds, so the count alone doesn't tell much. If there are stack samples, the cost of a routine or of any statement is just the fraction of samples it is on. That fraction of time is what could in principle be saved overall if the routine or statement could be made to take no time, so that is what has the most direct relationship to performance.
There are more where those came from.
Related
I noticed the io_uring kernel side uses CLOCK_MONOTONIC at CLOCK_MONOTONIC, so for the first timer, I get the time with both CLOCK_REALTIME and CLOCK_MONOTONIC and adjust the nanosecond like below and use IORING_TIMEOUT_ABS flag for io_uring_prep_timeout. iorn/clock.c at master · hnakamur/iorn
const long sec_in_nsec = 1000000000;
static int queue_timeout(iorn_queue_t *queue) {
iorn_timeout_op_t *op = calloc(1, sizeof(*op));
if (op == NULL) {
return -ENOMEM;
}
struct timespec rts;
int ret = clock_gettime(CLOCK_REALTIME, &rts);
if (ret < 0) {
fprintf(stderr, "clock_gettime CLOCK_REALTIME error: %s\n", strerror(errno));
return -errno;
}
long nsec_diff = sec_in_nsec - rts.tv_nsec;
ret = clock_gettime(CLOCK_MONOTONIC, &op->ts);
if (ret < 0) {
fprintf(stderr, "clock_gettime CLOCK_MONOTONIC error: %s\n", strerror(errno));
return -errno;
}
op->handler = on_timeout;
op->ts.tv_sec++;
op->ts.tv_nsec += nsec_diff;
if (op->ts.tv_nsec > sec_in_nsec) {
op->ts.tv_sec++;
op->ts.tv_nsec -= sec_in_nsec;
}
op->count = 1;
op->flags = IORING_TIMEOUT_ABS;
ret = iorn_prep_timeout(queue, op);
if (ret < 0) {
return ret;
}
return iorn_submit(queue);
}
From the second time, I just increment the second part tv_sec and use IORING_TIMEOUT_ABS flag for io_uring_prep_timeout.
Here is the output from my example program. The millisecond part is zero but it is about 400 microsecond later than just second.
on_timeout time=2020-05-10T14:49:42.000442
on_timeout time=2020-05-10T14:49:43.000371
on_timeout time=2020-05-10T14:49:44.000368
on_timeout time=2020-05-10T14:49:45.000372
on_timeout time=2020-05-10T14:49:46.000372
on_timeout time=2020-05-10T14:49:47.000373
on_timeout time=2020-05-10T14:49:48.000373
Could you tell me a better way than this?
Thanks for your comments! I'd like to update the current time for logging like ngx_time_update(). I modified my example to use just CLOCK_REALTIME, but still about 400 microseconds late. github.com/hnakamur/iorn/commit/… Does it mean clock_gettime takes about 400 nanoseconds on my machine?
Yes, that sounds about right, sort of. But, if you're on an x86 PC under linux, 400 ns for clock_gettime overhead may be a bit high (order of magnitude higher--see below). If you're on an arm CPU (e.g. Raspberry Pi, nvidia Jetson), it might be okay.
I don't know how you're getting 400 microseconds. But, I've had to do a lot of realtime stuff under linux, and 400 us is similar to what I've measured as the overhead to do a context switch and/or wakeup a process/thread after a syscall suspends it.
I never use gettimeofday anymore. I now just use clock_gettime(CLOCK_REALTIME,...) because it's the same except you get nanoseconds instead of microseconds.
Just so you know, although clock_gettime is a syscall, nowadays, on most systems, it uses the VDSO layer. The kernel injects special code into the userspace app, so that it is able to access the time directly without the overhead of a syscall.
If you're interested, you could run under gdb and disassemble the code to see that it just accesses some special memory locations instead of doing a syscall.
I don't think you need to worry about this too much. Just use clock_gettime(CLOCK_MONOTONIC,...) and set flags to 0. The overhead doesn't factor into this, for the purposes of the ioring call as your iorn layer is using it.
When I do this sort of thing, and I want/need to calculate the overhead of clock_gettime itself, I call clock_gettime in a loop (e.g. 1000 times), and try to keep the total time below a [possible] timeslice. I use the minimum diff between times in each iteration. That compensates for any [possible] timeslicing.
The minimum is the overhead of the call itself [on average].
There are additional tricks that you can do to minimize latency in userspace (e.g. raising process priority, clamping CPU affinity and I/O interrupt affinity), but they can involve a few more things, and, if you're not very careful, they can produce worse results.
Before you start taking extraordinary measures, you should have a solid methodology to measure timing/benchmarking to prove that your results can not meet your timing/throughput/latency requirements. Otherwise, you're doing complicated things for no real/measurable/necessary benefit.
Below is some code I just created, simplified, but based on code I already have/use to calibrate the overhead:
#include <stdio.h>
#include <time.h>
#define ITERMAX 10000
typedef long long tsc_t;
// tscget -- get time in nanoseconds
static inline tsc_t
tscget(void)
{
struct timespec ts;
tsc_t tsc;
clock_gettime(CLOCK_MONOTONIC,&ts);
tsc = ts.tv_sec;
tsc *= 1000000000;
tsc += ts.tv_nsec;
return tsc;
}
// tscsec -- convert nanoseconds to fractional seconds
double
tscsec(tsc_t tsc)
{
double sec;
sec = tsc;
sec /= 1e9;
return sec;
}
tsc_t
calibrate(void)
{
tsc_t tscbeg;
tsc_t tscold;
tsc_t tscnow;
tsc_t tscdif;
tsc_t tscmin;
int iter;
tscmin = 1LL << 62;
tscbeg = tscget();
tscold = tscbeg;
for (iter = ITERMAX; iter > 0; --iter) {
tscnow = tscget();
tscdif = tscnow - tscold;
if (tscdif < tscmin)
tscmin = tscdif;
tscold = tscnow;
}
tscdif = tscnow - tscbeg;
printf("MIN:%.9f TOT:%.9f AVG:%.9f\n",
tscsec(tscmin),tscsec(tscdif),tscsec(tscnow - tscbeg) / ITERMAX);
return tscmin;
}
int
main(void)
{
calibrate();
return 0;
}
On my system, a 2.67GHz Core i7, the output is:
MIN:0.000000019 TOT:0.000254999 AVG:0.000000025
So, I'm getting 25 ns overhead [and not 400 ns]. But, again, each system can be different to some extent.
UPDATE:
Note that x86 processors have "speed step". The OS can adjust the CPU frequency up or down semi-automatically. Lower speeds conserve power. Higher speeds are maximum performance.
This is done with a heuristic (e.g. if the OS detects that the process is a heavy CPU user, it will up the speed).
To force maximum speed, linux has this directory:
/sys/devices/system/cpu/cpuN/cpufreq
Where N is the cpu number (e.g. 0-7)
Under this directory, there are a number of files of interest. They should be self explanatory.
In particular, look at scaling_governor. It has either ondemand [kernel will adjust as needed] or performance [kernel will force maximum CPU speed].
To force maximum speed, as root, set this [once] to performance (e.g.):
echo "performance" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
Do this for all cpus.
However, I just did this on my system, and it had little effect. So, the kernel's heuristic may have improved.
As to the 400us, when a process has been waiting on something, when it is "woken up", this is a two step process.
The process is marked "runnable".
At some point, the system/CPU does a reschedule. The process will be run, based upon the scheduling policy and the process priority in effect.
For many syscalls, the reschedule [only] occurs on the next system timer/clock tick/interrupt. So, for some, there can be a delay of up to a full clock tick (i.e.) for HZ value of 1000, this can be up to 1ms (1000 us) later.
On average, this is one half of HZ or 500 us.
For some syscalls, when the process is marked runnable, a reschedule is done immediately. If the process has a higher priority, it will be run immediately.
When I first looked at this [circa 2004], I looked at all code paths in the kernel, and the only syscall that did the immediate reschedule was SysV IPC, for msgsnd/msgrcv. That is, when process A did msgsnd, any process B waiting for the given message would be run.
But, others did not (e.g. futex). They would wait for the timer tick. A lot has changed since then, and now, more syscalls will do the immediate reschedule. For example, I recently measured futex [invoked via pthread_mutex_*], and it seemed to do the quick reschedule.
Also, the kernel scheduler has changed. The newer scheduler can wakeup/run some things on a fraction of a clock tick.
So, for you, the 400 us, is [possibly] the alignment to the next clock tick.
But, it could just be the overhead of doing the syscall. To test that, I modified my test program to open /dev/null [and/or /dev/zero], and added read(fd,buf,1) to the test loop.
I got a MIN: value of 529 us. So, the delay you're getting could just be the amount of time it takes to do the task switch.
This is what I would call "good enough for now".
To get "razor's edge" response, you'd probably have to write a custom kernel driver and have the driver do this. This is what embedded systems would do if (e.g.) they had to toggle a GPIO pin on every interval.
But, if all you're doing is printf, the overhead of printf and the underlying write(1,...) tends to swamp the actual delay.
Also, note that when you do printf, it builds the output buffer and when the buffer in FILE *stdout is full, it flushes via write.
For best performance, it's better to do int len = sprintf(buf,"current time is ..."); write(1,buf,len);
Also, when you do this, if the kernel buffers for TTY I/O get filled [which is quite possible given the high frequency of messages you're doing], the process will be suspended until the I/O has been sent to the TTY device.
To do this well, you'd have to watch how much space is available, and skip some messages if there isn't enough space to [wholy] contain them.
You'd need to do: ioctl(1,TIOCOUTQ,...) to get the available space and skip some messages if it is less than the size of the message you want to output (e.g. the len value above).
For your usage, you're probably more interested in the latest time message, rather than outputting all messages [which would eventually produce a lag]
My CPU has four cores,MAC os. I use 4 threads to calculate an array. But the time of calculating does't being reduced. If I don't use multithread, the time of calculating is about 52 seconds. But even I use 4 multithreads, or 2 threads, the time doesn't change.
(I know why this happen now. The problem is that I use clock() to calculate the time. It is wrong when it is used in multithread program because this function will multiple the real time based on the num of threads. When I use time() to calculate the time, the result is correct.)
The output of using 2 threads:
id 1 use time = 43 sec to finish
id 0 use time = 51 sec to finish
time for round 1 = 51 sec
id 1 use time = 44 sec to finish
id 0 use time = 52 sec to finish
time for round 2 = 52 sec
id 1 and id 0 is thread 1 and thread 0. time for round is the time of finishing two threads. If I don't use multithread, time for round is also about 52 seconds.
This is the part of calling 4 threads:
for(i=1;i<=round;i++)
{
time_round_start=clock();
for(j=0;j<THREAD_NUM;j++)
{
cal_arg[j].roundth=i;
pthread_create(&thread_t_id[j], NULL, Multi_Calculate, &cal_arg[j]);
}
for(j=0;j<THREAD_NUM;j++)
{
pthread_join(thread_t_id[j], NULL);
}
time_round_end=clock();
int round_time=(int)((time_round_end-time_round_start)/CLOCKS_PER_SEC);
printf("time for round %d = %d sec\n",i,round_time);
}
This is the code inside the thread function:
void *Multi_Calculate(void *arg)
{
struct multi_cal_data cal=*((struct multi_cal_data *)arg);
int p_id=cal.thread_id;
int i=0;
int root_level=0;
int leaf_addr=0;
int neighbor_root_level=0;
int neighbor_leaf_addr=0;
Neighbor *locate_neighbor=(Neighbor *)malloc(sizeof(Neighbor));
printf("id:%d, start:%d end:%d,round:%d\n",p_id,cal.start_num,cal.end_num,cal.roundth);
for(i=cal.start_num;i<=cal.end_num;i++)
{
root_level=i/NUM_OF_EACH_LEVEL;
leaf_addr=i%NUM_OF_EACH_LEVEL;
if(root_addr[root_level][leaf_addr].node_value!=i)
{
//ignore, because this is a gap, no this node
}
else
{
int k=0;
locate_neighbor=root_addr[root_level][leaf_addr].head;
double tmp_credit=0;
for(k=0;k<root_addr[root_level][leaf_addr].degree;k++)
{
neighbor_root_level=locate_neighbor->neighbor_value/NUM_OF_EACH_LEVEL;
neighbor_leaf_addr=locate_neighbor->neighbor_value%NUM_OF_EACH_LEVEL;
tmp_credit += root_addr[neighbor_root_level][neighbor_leaf_addr].g_credit[cal.roundth-1]/root_addr[neighbor_root_level][neighbor_leaf_addr].degree;
locate_neighbor=locate_neighbor->next;
}
root_addr[root_level][leaf_addr].g_credit[cal.roundth]=tmp_credit;
}
}
return 0;
}
The array is very large, each thread calculate part of the array.
Is there something wrong with my code?
It could be a bug, but if you feel the code is correct, then the overhead of parallelization, mutexes and such, might mean the overall performance (runtime) is the same as for the non-parallelized code, for the size of elements to compute against.
It might be an interesting study, to do looped code, single-threaded, and the threaded code, against very large arrays (100k elements?), and see if the results start to diverge to be faster in the parallel/threaded code?
Amdahl's law, also known as Amdahl's argument,[1] is used to find the maximum expected improvement to an overall system when only part of the system is improved. It is often used in parallel computing to predict the theoretical maximum speedup using multiple processors.
https://en.wikipedia.org/wiki/Amdahl%27s_law
You don't always gain speed by multi-threading a program. There is a certain amount of overhead that comes with threading. Unless there is enough inefficiencies in the non-threaded code to make up for the overhead, you'll not see an improvement. A lot can be learned about how multi-threading works even if the program you write ends up running slower.
I know why this happen now. The problem is that I use clock() to calculate the time. It is wrong when it is used in multithread program because this function will multiple the real time based on the num of threads. When I use time() to calculate the time, the result is correct.
I have a C program which for the following number of inputs executes in the following amount of time:
23 0.001s
100 0.001s
I tried finding a formula for this but was unsuccessful. As far as I can see, the time sometimes doubles sometimes it doesn't, that is why I couldn't find a formula for this.
Any thoughts?
NOTES
1) I am measuring this in CPU time (user+sys).
2) My program uses quicksort
2) The asymptotic run-time analysis/complexity of my program is O(NlogN)
Plotting this it looks very much like you are hitting a "cache cliff" - your data is big enough so that it doesn't all fit in a cpu cache level and so must swap data between cheaper and more expensive levels
The differences at lower levels are likely due to precission - if you increase loops inside the timer block - this might smooth out
generally when hitting a cache cliff the eprformace is something like plotting O*Q
where O is the average runtime -- Nlog(N) in your case
and Q is a step function that increases as you pass the allocated size for each cache.
so Q=1 below L1 size, 10 above L1 size, 100 above L2 size, etc. (numbers only for example)
In fact, there are people that verify their cache sizes by plotting a O(1) function and looking for the memory used at performance cliffs the cliffs:
___
_____-----
L1 | L2 | L3
I always use this to get precise run time..
#include<stdio.h>
#include <time.h>
#include <stdlib.h>
`clock_t startm, stopm;
`#define START if ( (startm = clock()) == -1) {printf("Error calling clock");exit(1);}
`#define STOP if ( (stopm = clock()) == -1) {printf("Error calling clock");exit(1);}
#define PRINTTIME printf( "%6.3f seconds used by the processor.", ((double)stopm-startm)/CLOCKS_PER_SEC);
`int main() {
int i,x;
START;
scanf("%d",&x);
for(i=0;i<10000;i++){
printf("%d\n",i);
}
STOP;
PRINTTIME;
}
I have an application on Linux that needs to change some parameters each hour, e.g. at 11:00, 12:00, etc. and the system's date can be changed by the user anytime.
Is there any signal, posix function that would provides me when a hour changes from xx:59 to xx+1:00?
Normally, I use localtime(3) to fetch the current time each seconds then compare if the minute part is equal to 0. however, it does not look a good way to do it, in order to detect a change, I need to call the same function each second for an hour. Especially I run the code on an embedded board that would be good to use less resources.
Here is an example code how I do it:
static char *fetch_time() { // I use this fcn for some other purpose to fetch the time info
char *p;
time_t rawtime;
struct tm * timeinfo;
char buffer[13];
time(&rawtime);
timeinfo = localtime(&rawtime);
strftime (buffer,13,"%04Y%02m%02d%02k%02M",timeinfo);
p = (char *)malloc(sizeof(buffer));
strcpy(p, buffer);
return p;
}
static int hour_change_check(){
char *p;
p = fetch_time();
char current_minute[3] = {'\0'};
current_minute[0] = p[10];
current_minute[1] = p[11];
int current_minute_as_int = atoi(current_minute);
if (current_minute_as_int == 0){
printf("current_min: %d\n",current_minute_as_int);
free(p);
return 1;
}
free(p);
return 0;
}
int main(void){
while(1){
int x = hour_change_check();
printf("x:%d\n",x);
sleep(1);
}
return 0;
}
There is no such signal, but traditionally the method of waiting until some target time is to compute how long it is between "now" and "then", and then call sleep():
now = time(NULL);
when = (some calculation);
if (when > now)
sleep(when - now);
If you need to be very precise about the transition from, e.g., 3:59:59 to 4:00:00, you may want to sleep for a slightly shorter time in case of time adjustments due to leap seconds. (If you are running in a portable device in which time zones can change, you also need to worry about picking up the new location, and if it runs on a half-hour offset, redo all computations. There's even Solar Time in Saudi Arabia....)
Edit: per the suggestion from R.., if clock_nanosleep() is available, calculate a timespec value for the absolute wakeup time and call it with the TIMER_ABSTIME flag. See http://pubs.opengroup.org/onlinepubs/009695399/functions/clock_nanosleep.html for the definition for clock_nanosleep(). However, if time is allowed to step backwards (e.g., localtime with zone shifts), you may still have to do some maintenance checking.
Have you actually measured the overhead used in your solution of polling the time once per second (or even two given some of your other comments)?
The number of instructions that are invoked is minimal AND you do not have any looping. So at worse maybe the cpu uses 100 micro-seconds (0.1 ms, or 0.0001 s) time. This estimate is very dependent on the processor used in your embedded system and its clock speed, but the idea is that maybe the polling logic uses 1/1000 of the total time available.
Also, you could optimize your hour_change_check code to do all of the time calcs and not call another function that issues malloc which has to be immediately freed! Also, if this is an embedded *nix system, can you still run this polling logic in its own thread so that when it issues sleep() it will not interfere or delay other units of work.
Hence, measure the problem and see if it is a significant problem. The polling's performance must be balanced against the requirement that when a user changes the time then the hour change MUST be detected. That is, I think polling every second will catch the hour rollover even if the user changes the time, but is the overhead worth it. Well, how much, exactly, overhead is there?
Any ideas why it works fine for values like 0, 1, 2, 3, 4... and seg faults for values like >15?
#include
#include
#include
void *fib(void *fibToFind);
main(){
pthread_t mainthread;
long fibToFind = 15;
long finalFib;
pthread_create(&mainthread,NULL,fib,(void*) fibToFind);
pthread_join(mainthread,(void*)&finalFib);
printf("The number is: %d\n",finalFib);
}
void *fib(void *fibToFind){
long retval;
long newFibToFind = ((long)fibToFind);
long returnMinusOne;
long returnMinustwo;
pthread_t minusone;
pthread_t minustwo;
if(newFibToFind == 0 || newFibToFind == 1)
return newFibToFind;
else{
long newFibToFind1 = ((long)fibToFind) - 1;
long newFibToFind2 = ((long)fibToFind) - 2;
pthread_create(&minusone,NULL,fib,(void*) newFibToFind1);
pthread_create(&minustwo,NULL,fib,(void*) newFibToFind2);
pthread_join(minusone,(void*)&returnMinusOne);
pthread_join(minustwo,(void*)&returnMinustwo);
return returnMinusOne + returnMinustwo;
}
}
Runs out of memory (out of space for stacks), or valid thread handles?
You're asking for an awful lot of threads, which require lots of stack/context.
Windows (and Linux) have a stupid "big [contiguous] stack" idea.
From the documentation on pthreads_create:
"On Linux/x86-32, the default stack size for a new thread is 2 megabytes."
If you manufacture 10,000 threads, you need 20 Gb of RAM.
I built a version of OP's program, and it bombed with some 3500 (p)threads
on Windows XP64.
See this SO thread for more details on why big stacks are a really bad idea:
Why are stack overflows still a problem?
If you give up on big stacks, and implement a parallel language with heap allocation
for activation records
(our PARLANSE is
one of these) the problem goes away.
Here's the first (sequential) program we wrote in PARLANSE:
(define fibonacci_argument 45)
(define fibonacci
(lambda(function natural natural )function
`Given n, computes nth fibonacci number'
(ifthenelse (<= ? 1)
?
(+ (fibonacci (-- ?))
(fibonacci (- ? 2))
)+
)ifthenelse
)lambda
)define
Here's an execution run on an i7:
C:\DMS\Domains\PARLANSE\Tools\PerformanceTest>run fibonaccisequential
Starting Sequential Fibonacci(45)...Runtime: 33.752067 seconds
Result: 1134903170
Here's the second, which is parallel:
(define coarse_grain_threshold 30) ; technology constant: tune to amortize fork overhead across lots of work
(define parallel_fibonacci
(lambda (function natural natural )function
`Given n, computes nth fibonacci number'
(ifthenelse (<= ? coarse_grain_threshold)
(fibonacci ?)
(let (;; [n natural ] [m natural ] )
(value (|| (= m (parallel_fibonacci (-- ?)) )=
(= n (parallel_fibonacci (- ? 2)) )=
)||
(+ m n)
)value
)let
)ifthenelse
)lambda
)define
Making the parallelism explicit makes the programs a lot easier to write, too.
The parallel version we test by calling (parallel_fibonacci 45). Here
is the execution run on the same i7 (which arguably has 8 processors,
but it is really 4 processors hyperthreaded so it really isn't quite 8
equivalent CPUs):
C:\DMS\Domains\PARLANSE\Tools\PerformanceTest>run fibonacciparallelcoarse
Parallel Coarse-grain Fibonacci(45) with cutoff 30...Runtime: 5.511126 seconds
Result: 1134903170
A speedup near 6+, not bad for not-quite-8 processors. One of the other
answers to this question ran the pthreads version; it took "a few seconds"
(to blow up) computing Fib(18), and this is 5.5 seconds for Fib(45).
This tells you pthreads
is a fundamentally bad way to do lots of fine grain parallelism, because
it has really, really high forking overhead. (PARLANSE is designed to
minimize that forking overhead).
Here's what happens if you set the technology constant to zero (forks on every call
to fib):
C:\DMS\Domains\PARLANSE\Tools\PerformanceTest>run fibonacciparallel
Starting Parallel Fibonacci(45)...Runtime: 15.578779 seconds
Result: 1134903170
You can see that amortizing fork overhead is a good idea, even if you have fast forks.
Fib(45) produces a lot of grains. Heap allocation
of activation records solves the OP's first-order problem (thousands of pthreads each
with 1Mb of stack burns gigabytes of RAM).
But there's a second order problem: 2^45 PARLANSE "grains" will burn all your memory too
just keeping track of the grains even if your grain control block is tiny.
So it helps to have a scheduler that throttles forks once you have "a lot"
(for some definition of "a lot" significantly less that 2^45) grains to prevent the
explosion of parallelism from swamping the machine with "grain" tracking data structures.
It has to unthrottle forks when the number of grains falls below a threshold
too, to make sure there is always lots of logical, parallel work for the physical
CPUs to do.
You are not checking for errors - in particular, from pthread_create(). When pthread_create() fails, the pthread_t variable is left undefined, and the subsequent pthread_join() may crash.
If you do check for errors, you will find that pthread_create() is failing. This is because you are trying to generate almost 2000 threads - with default settings, this would require 16GB of thread stacks to be allocated alone.
You should revise your algorithm so that it does not generate so many threads.
I tried to run your code, and came across several surprises:
printf("The number is: %d\n", finalFib);
This line has a small error: %d means printf expects an int, but is passed a long int. On most platforms this is the same, or will have the same behavior anyways, but pedantically speaking (or if you just want to stop the warning from coming up, which is a very noble ideal too), you should use %ld instead, which will expect a long int.
Your fib function, on the other hand, seems non-functional. Testing it on my machine, it doesn't crash, but it yields 1047, which is not a Fibonacci number. Looking closer, it seems your program is incorrect on several aspects:
void *fib(void *fibToFind)
{
long retval; // retval is never used
long newFibToFind = ((long)fibToFind);
long returnMinusOne; // variable is read but never initialized
long returnMinustwo; // variable is read but never initialized
pthread_t minusone; // variable is never used (?)
pthread_t minustwo; // variable is never used
if(newFibToFind == 0 || newFibToFind == 1)
// you miss a cast here (but you really shouldn't do it this way)
return newFibToFind;
else{
long newFibToFind1 = ((long)fibToFind) - 1; // variable is never used
long newFibToFind2 = ((long)fibToFind) - 2; // variable is never used
// reading undefined variables (and missing a cast)
return returnMinusOne + returnMinustwo;
}
}
Always take care of compiler warnings: when you get one, usually, you really are doing something fishy.
Maybe you should revise the algorithm a little: right now, all your function does is returning the sum of two undefined values, hence the 1047 I got earlier.
Implementing the Fibonacci suite using a recursive algorithm means you need to call the function again. As others noted, it's quite an inefficient way of doing it, but it's easy, so I guess all computer science teachers use it as an example.
The regular recursive algorithm looks like this:
int fibonacci(int iteration)
{
if (iteration == 0 || iteration == 1)
return 1;
return fibonacci(iteration - 1) + fibonacci(iteration - 2);
}
I don't know to which extent you were supposed to use threads—just run the algorithm on a secondary thread, or create new threads for each call? Let's assume the first for now, since it's a lot more straightforward.
Casting integers to pointers and vice-versa is a bad practice because if you try to look at things at a higher level, they should be widely different. Integers do maths, and pointers resolve memory addresses. It happens to work because they're represented the same way, but really, you shouldn't do this. Instead, you might notice that the function called to run your new thread accepts a void* argument: we can use it to convey both where the input is, and where the output will be.
So building upon my previous fibonacci function, you could use this code as the thread main routine:
void* fibonacci_offshored(void* pointer)
{
int* pointer_to_number = pointer;
int input = *pointer_to_number;
*pointer_to_number = fibonacci(input);
return NULL;
}
It expects a pointer to an integer, and takes from it its input, then writes it output there.1 You would then create the thread like that:
int main()
{
int value = 15;
pthread_t thread;
// on input, value should contain the number of iterations;
// after the end of the function, it will contain the result of
// the fibonacci function
int result = pthread_create(&thread, NULL, fibonacci_offshored, &value);
// error checking is important! try to crash gracefully at the very least
if (result != 0)
{
perror("pthread_create");
return 1;
}
if (pthread_join(thread, NULL)
{
perror("pthread_join");
return 1;
}
// now, value contains the output of the fibonacci function
// (note that value is an int, so just %d is fine)
printf("The value is %d\n", value);
return 0;
}
If you need to call the Fibonacci function from new distinct threads (please note: that's not what I'd advise, and others seem to agree with me; it will just blow up for a sufficiently large amount of iterations), you'll first need to merge the fibonacci function with the fibonacci_offshored function. It will considerably bulk it up, because dealing with threads is heavier than dealing with regular functions.
void* threaded_fibonacci(void* pointer)
{
int* pointer_to_number = pointer;
int input = *pointer_to_number;
if (input == 0 || input == 1)
{
*pointer_to_number = 1;
return NULL;
}
// we need one argument per thread
int minus_one_number = input - 1;
int minus_two_number = input - 2;
pthread_t minus_one;
pthread_t minus_two;
// don't forget to check! especially that in a recursive function where the
// recursion set actually grows instead of shrinking, you're bound to fail
// at some point
if (pthread_create(&minus_one, NULL, threaded_fibonacci, &minus_one_number) != 0)
{
perror("pthread_create");
*pointer_to_number = 0;
return NULL;
}
if (pthread_create(&minus_two, NULL, threaded_fibonacci, &minus_two_number) != 0)
{
perror("pthread_create");
*pointer_to_number = 0;
return NULL;
}
if (pthread_join(minus_one, NULL) != 0)
{
perror("pthread_join");
*pointer_to_number = 0;
return NULL;
}
if (pthread_join(minus_two, NULL) != 0)
{
perror("pthread_join");
*pointer_to_number = 0;
return NULL;
}
*pointer_to_number = minus_one_number + minus_two_number;
return NULL;
}
Now that you have this bulky function, adjustments to your main function are going to be quite easy: just change the reference to fibonacci_offshored to threaded_fibonacci.
int main()
{
int value = 15;
pthread_t thread;
int result = pthread_create(&thread, NULL, threaded_fibonacci, &value);
if (result != 0)
{
perror("pthread_create");
return 1;
}
pthread_join(thread, NULL);
printf("The value is %d\n", value);
return 0;
}
You might have been told that threads speed up parallel processes, but there's a limit somewhere where it's more expensive to set up the thread than run its contents. This is a very good example of such a situation: the threaded version of the program runs much, much slower than the non-threaded one.
For educational purposes, this program runs out of threads on my machine when the number of desired iterations is 18, and takes a few seconds to run. By comparison, using an iterative implementation, we never run out of threads, and we have our answer in a matter of milliseconds. It's also considerably simpler. This would be a great example of how using a better algorithm fixes many problems.
Also, out of curiosity, it would be interesting to see if it crashes on your machine, and where/how.
1. Usually, you should try to avoid to change the meaning of a variable between its value on input and its value after the return of the function. For instance, here, on input, the variable is the number of iterations we want; on output, it's the result of the function. Those are two very different meanings, and that's not really a good practice. I didn't feel like using dynamic allocations to return a value through the void* return value.