Getting reliable performance measurements for short bits of code - c

I'm trying to profile a few set of functions which implement different versions of the same algorithm in different ways. I've increased the number of times each function is run so that the total time spent in a single function is roughly 1 minute (to reveal performance differences).
Now, running several times the test produces baffling results. There is a huge variability (+- 50 %) between several executions of the same function, and determining which function is the fastest (which is the goal of the test) is nearly impossible because of that.
Is there something special I should take care of before running the tests, so that I get smoother measurements? Failing that, is running the test several times and compute the average for each function the way to go?

There are lots of things to check!
First, make sure your functions are actually CPU-bound. If so, make sure you have all CPU throttling, turbo modes, and power-saving modes disabled (in BIOS) for the test. If you still have trouble, try pinning your process to a single core. Disable hyper-threading too perhaps.
The goal of all this is to make sure you get your code running hot on a single core without much interruption. If you're on Linux, you can remove a single core from the OS list of available cores and use that (so there is no chance of interference on that core).
Running the test several times is a good idea, but using the average (arithmetic mean) is not. Instead, use the median or minimum or some other measurement which won't be influenced by outliers. Usually, the occasional long test run can be thrown out entirely (unless you're building a real-time system!).

Related

About Dijkstra omp

Recently I've download a source code from internet of the OpenMP Dijkstra.
But I found that the parallel time will always larger than when it is run by one thread (whatever I use two, four or eight threads.)
Since I'm new to OpenMP I really want to figure out what happens.
The is due to the overheard of setting up the threads. The execution time of the work itself is theoretically the same, but the system has to set up the threads that manage the work (even if there's only one). For little work, or for only one thread, this overhead time makes your time-to-solution slower than the serial time-to-solution.
Alternatively, if you see the time increasing dramatically as you increase the thread-count, you could only be using 1 core on your computer and tricking it into thinking its 2,4,8, etc threads.
Finally, it's possible that the way you're implementing dijkstra's method is largely serial. But without looking at your code it would be too hard to say.

Are two successive calls to getrusage guaranteed to produce increasing results?

In a program that calls getrusage() twice in order to obtain the time of a task by subtraction, I have once seen an assertion, saying that the time of the task should be nonnegative, fail. This, of course, cannot easily be reproduced, although I could write a specialized program that might reproduce it more easily.
I have tried to find a guarantee that getrusage() increased along execution, but neither the man page on my system(Linux on x86-64) nor this system-independant description say so explicitly.
The behavior was observed on a physical computer, with several cores, and NTP running.
Should I report a bug against the OS I am using? Am I asking too much when I expect getrusage() to increase with time?
On many systems rusage (I presume you mean ru_utime and ru_stime) is not calculated accurately, it's just sampled once per clock tick which is usually as slow as 100Hz and sometimes even slower.
Primary reason for that is that many machines have clocks that are incredibly expensive to read and you don't want to do this accounting (you'd have to read the clock twice for every system call). You could easily end up spending more time reading clocks than doing anything else in programs that do many system calls.
The counters should never go backwards though. I've seen that many years ago where the total running time of the process was tracked on context switches (which was relatively cheap and getrusge could calculate utime by using samples for stime, and subtracting that from the total running time). The clock used in that case was the wall clock instead of a monotonic clock and when you changed the time on the machine, the running time of processes could go back. But that was of course a bug.

time() and context switching

I am more or less wondering how time() is implemented in the C standard library and what would happen in the situation described below. Although this time is most-likely negligible, consider a situation where you have a hard-limit on time and no control over the CPU scheduler (assume that it is a "good" scheduler for a general-purpose CPU).
Now, if I use time() to calculate my execution time of a particular section of code and use this time subtracted from some maximum bound to determine some other time-dependent variable, how would this variable be skewed based on context-switches? I know we could use nice and other tools (i.e. custom scheduler, etc.) to be certain we get full CPU usage when we need it, however, I am wondering how this works in general for similar situations as this and what side-effects exist due to the system's choices.
time is supposed to measure wall-time. I.e., it gives the current time, regardless of how much or little your process has run.
If you want to measure cpu time, you should use clock instead (though some vendors such as MS implement it wrong, so it does wall time also).
Of course, there are also other tools to retrieve CPU usage, such as times on Unix-like systems or GetProcessTimes on Windows. Most people find these more useful despite the reduced portability.

When benchmarking, what causes a lag between CPU time and "elapsed real time"?

I'm using a built-in benchmarking module for some quick and dirty tests. It gives me:
CPU time
system CPU time (actually I never get any result for this with the code I'm running)
the sum of the user and system CPU times (always the same as the CPU time in my case)
the elapsed real time
I didn't even know I needed all that information.
I just want to compare two pieces of code and see which one takes longer. I know that one piece of code probably does more garbage collection than the other but I'm not sure how much of an impact it's going to have.
Any ideas which metric I should be looking at?
And, most importantly, could someone explain why the "elapsed real time" is always longer than the CPU time - what causes the lag between the two?
There are many things going on in your system other than running your Ruby code. Elapsed time is the total real time taken and should not be used for benchmarking. You want the system and user CPU times since those are the times that your process actually had the CPU.
An example, if your process:
used the CPU for one second running your code; then
used the CPU for one second running OS kernel code; then
was swapped out for seven seconds while another process ran; then
used the CPU for one more second running your code,
you would have seen:
ten seconds elapsed time,
two seconds user time,
one second system time,
three seconds total CPU time.
The three seconds is what you need to worry about, since the ten depends entirely upon the vagaries of the process scheduling.
Multitasking operating system, stalls while waiting for I/O, and other moments when you code is not actively working.
You don't want to totally discount clock-on-the-wall time. Time used to wait w/o another thread ready to utilize CPU cycles may make one piece of code less desirable than another. One set of code may take some more CPU time, but, employ multi-threading to dominate over the other code in the real world. Depends on requirements and specifics. My point is ... use all metrics available to you to make your decision.
Also, as a good practice, if you want to compare two pieces of code you should be running as few extraneous processes as possible.
It may also be the case that the CPU time when your code is executing is not counted.
The extreme example is a real-time system where the timer triggers some activity which is always shorter than a timer tick. Then the CPU time for that activity may never be counted (depending on how the OS does the accounting).

Are high resolution calls to get the system time wrong by the time the function returns?

Given a C process that runs at the highest priority that requests the current time, Is the time returned adjusted for the amount of time the code takes to return to the user process space? Is it out of date when you get it? As a measurement taking the execution time of known number of assembly instructions in a loop and asking for the time before and after it could give you an approximation of the error. I know this must be an issue in scientific applications? I don't plan to write software involving any super colliders any time in the near future. I have read a few articles on the subject but they do not indicate that any correction is made to make the time given to you be slightly ahead of the time the system read in. Should I lose sleep over other things?
Yes, they are almost definitely "wrong".
For Windows, the timing functions do not take into account the time it takes to transition back to user mode. Even if this were taken into account, it can't correct if the function returns, and your code hits a page fault/gets swapped out/etc., before capturing the return value.
In general, when timing things you should snap a start and an end time around a large number of iterations to weed out these sort of uncertainties.
No, you should not lose sleep over this. No amount of adjustment or other software trickery will yield perfect results on a system with a pipelined processor with multi-layered memory access running a multi-tasking operating system with memory management, devices, interrupt handlers... Not even if your process has the highest priority.
Plus, taking the difference of two such times will cancel out the constant overhead, anyway.
Edit: I mean yes, you should lose sleep over other things :).
Yes, the answer you get will be off by a certain (smallish) amount; I have never heard of a timer function compensating for the average return time, because such a thing is nearly impossible to predict well. Such things are usually implemented by simply reading a register in the hardware and returning the value, or a version of it scaled to the appropriate timescale.
That said, I wouldn't lose sleep over this. The accepted way of keeping this overhead from affecting your measurements in any significant way is not to use these timers for short events. Usually, you will time several hundred, thousand, or million executions of the same thing, and divide by the number of executions to estimate the average time. Such a thing is usually more useful than timing a single instance, as it takes into account average cache behavior, OS effects, and so forth.
Most of the real world problems involving high resolution timers are used for profiling, in which the time is read once during START, and once more during FINISH. So most of the times ~almost~ the same amount of delay in involved in both START and FINISH. And hence it works fine.
Now, for nuclear reactors, WINDOWS or for that many other operating system with generic functions may not be suitable. I guess they use REAL TIME operating systems which might give a better accurate time values than desktop operating systems.

Resources