I am trying to time how long an execution takes. (I'm comparing times of an execution depending of the number of processes spawned.) Anyways, timer:tc is returning times rounded to the nearest 1000 ms. I have seen people have better accuracy than that and I am wondering what could cause this?
This is the case on Windows OS (at least XP and 7), but it is rounded to 1000µs, not ms.
Except with very short function it is not a big problem since the execution time varies from one execution to the other.
Erlang rounds down to 1ms on Windows.
A common way to work around this is to run your code many times (let's say 1000) and then to divide the time you get by 1000. This will give you a more accurate mean result.
Related
I work for a company that produces automatic machines, and I help maintain their software that controls the machines. The software runs on a real-time operating system, and consists of multiple threads running concurrently. The code bases are legacy, and have substantial technical debts. Among all the issues that the code bases exhibit, one stands out as being rather bizarre to me; most of the timing algorithms that involve the computation of time elapsed to realize common timed features such as timeouts, delays, recording time spent in a particular state, and etc., basically take the following form:
unsigned int shouldContinue = 1;
unsigned int blockDuration = 1; // Let's say 1 millisecond.
unsigned int loopCount = 0;
unsigned int elapsedTime = 0;
while (shouldContinue)
{
.
. // a bunch of statements, selections and function calls
.
blockingSystemCall(blockDuration);
.
. // a bunch of statements, selections and function calls
.
loopCount++;
elapsedTime = loopCount * blockDuration;
}
The blockingSystemCall function can be any operating system's API that suspends the current thread for the specified blockDuration. The elapsedTime variable is subsequently computed by basically multiplying loopCount by blockDuration or by any equivalent algorithm.
To me, this kind of timing algorithm is wrong, and is not acceptable under most circumstances. All the instructions in the loop, including the condition of the loop, are executed sequentially, and each instruction requires measurable CPU time to execute. Therefore, the actual time elapsed is strictly greater than the value of elapsedTime in any given instance after the loop starts. Consequently, suppose the CPU time required to execute all the statements in the loop, denoted by d, is constant. Then, elapsedTime lags behind the actual time elapsed by loopCount • d for any loopCount > 0; that is, the deviation grows according to an arithmetic progression. This sets the lower bound of the deviation because, in reality, there will be additional delays caused by thread scheduling and time slicing, depending on other factors.
In fact, not too long ago, while testing a new data-driven predictive maintenance feature which relies on the operation time of a machine, we discovered that the operation time reported by the software lagged behind that of a standard reference clock by a whopping three hours after the machine was in continuous operation for just over two days. It was through this test that I discovered the algorithm outlined above, which I swiftly determined to be the root cause.
Coming from a background where I used to implement timing algorithms on bare-metal systems using timer interrupts, which allows the CPU to carry on with the execution of the business logic while the timer process runs in parallel, it was shocking for me to have discovered that the algorithm outlined in the introduction is used in the industry to compute elapsed time, even more so when a typical operating system already encapsulates the timer functions in the form of various easy-to-use public APIs, liberating the programmer from the hassle of configuring a timer via hardware registers, raising events via interrupt service routines, etc.
The kind of timing algorithm as illustrated in the skeleton code above is found in at least two code bases independently developed by two distinct software engineering teams from two subsidiary companies located in two different cities, albeit within the same state. This makes me wonder whether it is how things are normally done in the industry or it is just an isolated case and is not widespread.
So, the question is, is the algorithm shown above common or acceptable in calculating elapsed time, given that the underlying operating system already provides highly optimized time-management system calls that can be used right out of the box to accurately measure elapsed time or even used as basic building blocks for creating higher-level timing facilities that provide more intuitive methods similar to, e.g., the Timer class in C#?
You're right that calculating elapsed time that way is inaccurate -- since it assumes that the blocking call will take exactly the amount of time indicated, and that everything that happens outside of the blocking system call will take no time at all, which would only be true on an infinitely-fast machine. Since actual machines are not infinitely fast, the elapsed-time calculated this way will always be somewhat less than the actual elapsed time.
As to whether that's acceptable, it's going to depend on how much timing accuracy your program needs. If it's just doing a rough estimate to make sure a function doesn't run for "too long", this might be okay. OTOH if it is trying for accuracy (and in particular accuracy over a long period of time), then this approach won't provide that.
FWIW the more common (and more accurate) way to measure elapsed time would be something like this:
const unsigned int startTime = current_clock_time();
while (shouldContinue)
{
loopCount++;
elapsedTime = current_clock_time() - startTime;
}
This has the advantage of not "drifting away" from the accurate value over time, but it does assume that you have a current_clock_time() type of function available, and that it's acceptable to call it within the loop. (If current_clock_time() is very expensive, or doesn't provide some real-time performance guarantees that the calling routine requires, that might be a reason not to do it this way)
I don't think these loops do what you think they do.
In a RTOS, the purpose of a loop like this is usually to perform a task at regular intervals.
blockingSystemCall(N) probably does not just sleep for N milliseconds like you think it does. It probably sleeps until N milliseconds after the last time your thread woke up.
More accurately, all the sleeps your thread has performed since starting are added to the thread start time to get the time at which the OS will try to wake the thread up. If your thread woke up due to an I/O event, then the last one of those times could be used instead of the thread start time. The point is that the inaccuracies in all these start times are corrected, so your thread wakes up at regular intervals and the elapsed time measurement is perfectly accurate according to the RTOS master clock.
There could also be very good reasons for measuring elapsed time by the RTOS master clock instead of a more accurate wall clock time, in addition to simplicity. This is because all of the guarantees that an RTOS provides (which is the reason you are using a RTOS in the first place) are provided in that time scale. The amount of time taken by one task can affect the amount of time you are guaranteed to have available for other tasks, as measured by this clock.
It may or may not be a problem that your RTOS master clock runs slow by 3 hours every 2 days...
Say I have a target of x requests/sec that I want to generate continuously. My goal is to start these requests at roughly the same interval, rather than just generating x requests and then waiting until 1 second has elapsed and repeating the whole thing over and over again. I'm not making any assumptions about these requests, some might take much longer than others, which is why my scheduler thread will not perform the requests (or wait for them to finish), but hand them over to a sufficiently sized Thread Pool.
Now if x is in the range of hundreds or less, I might get by with .net's Timers or Thread.Sleep and checking actually elapsed time using Stopwatch.
But if I want to go into the thousands or tens of thousands, I could try going high-resolution timer to maintain my roughly the same interval approach. But this would (in most programming environments on a general OS) imply some amount of hand-coding with spin waiting and so forth, and I'm not sure it's worthwhile to take this route.
Extending the initial approach, I could instead use a Timer to sleep and do y requests on each Timer event, monitor the actual requests per second achieved doing this and fine-tune y at runtime. The effect is somewhere in between "put all x requests and wait until 1 second elapsed since start", which I'm trying not to do, and "wait more or less exactly 1/x seconds before starting the next request".
The latter seems like a good compromise, but is there anything that's easier while still spreading the requests somewhat evenly over time? This must have been implemented hundreds of times by different people, but I can't find good references on the issue.
So what's the easiest way to implement this?
One way to do it:
First find (good luck on Windows) or implement a usleep or nanosleep function. As a first step, this could be (on .net) a simple Thread.SpinWait() / Stopwatch.Elapsed > x combo. If you want to get fancier, do Thread.Sleep() if the time span is large enough and only do the fine-tuning using Thread.SpinWait().
That done, just take the inverse of the rate and you have the time interval you need to sleep between each event. Your basic loop, which you do on one dedicated thread, then goes
Fire event
Sleep(sleepTime)
Then every, say, 250ms (or more for faster rates), check the actually achieved rate and adjust the sleepTime interval, perhaps with some smoothing to dampen wild temporary swings, like this
newRate = max(1, sleepTime / targetRate * actualRate)
sleepTime = 0.3 * sleepTime + 0.7 * newRate
This adjusts to what is actually going on in your program and on your system, and makes up for the time spent to invoke the event callback, and whatever the callback is doing on that same thread etc. Without this, you will probably not be able to get high accuracy.
Needless to say, if your rate is so high that you cannot use Sleep but always have to spin, one core will be spinning continuously. The good news: We get ever more cores on our machines, so one core matters less and less :) More serious though, as you mentioned in the comment, if your program does actual work, your event generator will have less time (and need) to waste cycles.
Check out https://github.com/EugenDueck/EventCannon for a proof of concept implementation in .net. It's implemented roughly as described above and done as a library, so you can embed that in your program if you use .net.
I know QueryPerformanceCounter() can be used for timing functions. I want to know:
1-Can I increase the resolution of the timer by over-clocking the CPU (so it ticks faster)?
2-Basically what makes some timers more precise than others, (e.g, QueryPerformanceCounter() is more precise as compared to GetTickCount())? If there is single crystal oscillator on the motherboard , why some timers are slower as compared to others?
QueryPerformanceCounter has very high resolution - normally less than one nanosecond. I don't see why you'd like to increase it. Overclocking will increase it, but it seems like a very weak reason for overclocking.
QueryPerformanceCounter is very accurate, but somewhat expensive and not very convenient.
a. It's expensive because it uses the expensive rdtsc instruction. Faster timers can just read an integer from memory. This integer needs to be updated, and we don't want to do it too often (1000 times a second is reasonable), so we get a very cheap timer, with low precision. That's basically GetTickCount.
b. It's inconvenient because it uses units which change between computers. Sometimes it will be nanoseconds, sometimes half-nano, or other values. It makes it harder to calculate with.
a. Another source of inconvenience is that it returns very large numbers, which may overflow when you try to do math with them, so you need to be careful.
The timing source for QPC is machine dependent. It is typically picked up from a frequency available somewhere in the chipset. Whether overclocking the cpu is going to affect it is highly dependent on your motherboard design. The simplest way is to just try it, use QueryPerformanceFrequency to see the effect.
GetTickCount is driven from an entirely different timer source, the signal that also generates the clock interrupt. It is not very precise, 1/64 of second normally, but it is highly accurate. Your machine contacts a time server from time to time to recalibrate the clock and adjust the clock correction factor. Which makes it accurate to about a second over an entire year. QPC is very precise, but not nearly as accurate. Use it only to time short intervals.
1 - Yes, Internally, one of the better timers is rdtsc, which does give you the clock value. Combining this with information from cpuid instruction, gives you time.
2 - The other timers rely upon various timing sources, such as the 8253 timer, for instance.
QPF is a wrapper added by Microsoft on and over what rdtsc provides. Read this article for more info:
http://www.strchr.com/performance_measurements_with_rdtsc
I wish to write a C program which obtains the system time and hence
uses this time to print out its ntp equivalent.
Am I right in saying that the following is correct for the seconds part
of the ntp time?
long int ntp.seconds = time(NULL) + 2208988800;
How might I calculate the fraction part?
The fractional part to add obviously is 0ps ... ;-)
So the question for the fraction could be reduced to how accurate is the system clock.
gettimeofday() gets you micro seconds. clock_gettime() could get you nano seconds.
Anyhow I doubt you'll be reaching the theoratically possible resolution the 32bit wide value for the fraction allows (at least on a standard PC).
I need to find out time taken by a function in my application. Application is a MS VIsual Studio 2005 solution, all C code.
I used thw windows API GetLocalTime(SYSTEMTIME *) to get the current system time before and after the function call which I want to measure time of.
But this has shortcoming that it lowest resolution is only 1msec. Nothing below that. So I cannot get any time granularity in micro seconds.
I know that time() which gives the time elapsed since the epoch time, also has resolution of 1msec (No microseconds)
1.) Is there any other Windows API which gives time in microseconds which I can use to measure the time consumed by my function?
-AD
There are some other possibilities.
QueryPerformanceCounter and QueryPerformanceFrequency
QueryPerformanceCounter will return a "performance counter" which is actually a CPU-managed 64-bit counter that increments from 0 starting with the computer power-on. The frequency of this counter is returned by the QueryPerformanceFrequency. To get the time reference in seconds, divide performance counter by performance frequency. In Delphi:
function QueryPerfCounterAsUS: int64;
begin
if QueryPerformanceCounter(Result) and
QueryPerformanceFrequency(perfFreq)
then
Result := Round(Result / perfFreq * 1000000);
else
Result := 0;
end;
On multiprocessor platforms, QueryPerformanceCounter should return consistent results regardless of the CPU the thread is currently running on. There are occasional problems, though, usually caused by bugs in hardware chips or BIOSes. Usually, patches are provided by motherboard manufacturers. Two examples from the MSDN:
Programs that use the QueryPerformanceCounter function may perform poorly in Windows Server 2003 and in Windows XP
Performance counter value may unexpectedly leap forward
Another problem with QueryPerformanceCounter is that it is quite slow.
RDTSC instruction
If you can limit your code to one CPU (SetThreadAffinity), you can use RDTSC assembler instruction to query performance counter directly from the processor.
function CPUGetTick: int64;
asm
dw 310Fh // rdtsc
end;
RDTSC result is incremented with same frequency as QueryPerformanceCounter. Divide it by QueryPerformanceFrequency to get time in seconds.
QueryPerformanceCounter is much slower thatn RDTSC because it must take into account multiple CPUs and CPUs with variable frequency. From Raymon Chen's blog:
(QueryPerformanceCounter) counts elapsed time. It has to, since its value is
governed by the QueryPerformanceFrequency function, which returns a number
specifying the number of units per second, and the frequency is spec'd as not
changing while the system is running.
For CPUs that can run at variable speed, this means that the HAL cannot
use an instruction like RDTSC, since that does not correlate with elapsed time.
timeGetTime
TimeGetTime belongs to the Win32 multimedia Win32 functions. It returns time in milliseconds with 1 ms resolution, at least on a modern hardware. It doesn't hurt if you run timeBeginPeriod(1) before you start measuring time and timeEndPeriod(1) when you're done.
GetLocalTime and GetSystemTime
Before Vista, both GetLocalTime and GetSystemTime return current time with millisecond precision, but they are not accurate to a millisecond. Their accuracy is typically in the range of 10 to 55 milliseconds. (Precision is not the same as accuracy)
On Vista, GetLocalTime and GetSystemTime both work with 1 ms resolution.
You can try to use clock() which will provide the number of "ticks" between two points. A "tick" is the smallest unit of time a processor can measure.
As a side note, you can't use clock() to determine the actual time - only the number of ticks between two points in your program.
One caution on multiprocessor systems:
from http://msdn.microsoft.com/en-us/library/ms644904(VS.85).aspx
On a multiprocessor computer, it should not matter which processor is called. However, you can get different results on different processors due to bugs in the basic input/output system (BIOS) or the hardware abstraction layer (HAL). To specify processor affinity for a thread, use the SetThreadAffinityMask function.
Al Weiner
On Windows you can use the 'high performance counter API'. Check out: QueryPerformanceCounter and QueryPerformanceCounterFrequency for the details.