Say I have a target of x requests/sec that I want to generate continuously. My goal is to start these requests at roughly the same interval, rather than just generating x requests and then waiting until 1 second has elapsed and repeating the whole thing over and over again. I'm not making any assumptions about these requests, some might take much longer than others, which is why my scheduler thread will not perform the requests (or wait for them to finish), but hand them over to a sufficiently sized Thread Pool.
Now if x is in the range of hundreds or less, I might get by with .net's Timers or Thread.Sleep and checking actually elapsed time using Stopwatch.
But if I want to go into the thousands or tens of thousands, I could try going high-resolution timer to maintain my roughly the same interval approach. But this would (in most programming environments on a general OS) imply some amount of hand-coding with spin waiting and so forth, and I'm not sure it's worthwhile to take this route.
Extending the initial approach, I could instead use a Timer to sleep and do y requests on each Timer event, monitor the actual requests per second achieved doing this and fine-tune y at runtime. The effect is somewhere in between "put all x requests and wait until 1 second elapsed since start", which I'm trying not to do, and "wait more or less exactly 1/x seconds before starting the next request".
The latter seems like a good compromise, but is there anything that's easier while still spreading the requests somewhat evenly over time? This must have been implemented hundreds of times by different people, but I can't find good references on the issue.
So what's the easiest way to implement this?
One way to do it:
First find (good luck on Windows) or implement a usleep or nanosleep function. As a first step, this could be (on .net) a simple Thread.SpinWait() / Stopwatch.Elapsed > x combo. If you want to get fancier, do Thread.Sleep() if the time span is large enough and only do the fine-tuning using Thread.SpinWait().
That done, just take the inverse of the rate and you have the time interval you need to sleep between each event. Your basic loop, which you do on one dedicated thread, then goes
Fire event
Sleep(sleepTime)
Then every, say, 250ms (or more for faster rates), check the actually achieved rate and adjust the sleepTime interval, perhaps with some smoothing to dampen wild temporary swings, like this
newRate = max(1, sleepTime / targetRate * actualRate)
sleepTime = 0.3 * sleepTime + 0.7 * newRate
This adjusts to what is actually going on in your program and on your system, and makes up for the time spent to invoke the event callback, and whatever the callback is doing on that same thread etc. Without this, you will probably not be able to get high accuracy.
Needless to say, if your rate is so high that you cannot use Sleep but always have to spin, one core will be spinning continuously. The good news: We get ever more cores on our machines, so one core matters less and less :) More serious though, as you mentioned in the comment, if your program does actual work, your event generator will have less time (and need) to waste cycles.
Check out https://github.com/EugenDueck/EventCannon for a proof of concept implementation in .net. It's implemented roughly as described above and done as a library, so you can embed that in your program if you use .net.
Related
I work for a company that produces automatic machines, and I help maintain their software that controls the machines. The software runs on a real-time operating system, and consists of multiple threads running concurrently. The code bases are legacy, and have substantial technical debts. Among all the issues that the code bases exhibit, one stands out as being rather bizarre to me; most of the timing algorithms that involve the computation of time elapsed to realize common timed features such as timeouts, delays, recording time spent in a particular state, and etc., basically take the following form:
unsigned int shouldContinue = 1;
unsigned int blockDuration = 1; // Let's say 1 millisecond.
unsigned int loopCount = 0;
unsigned int elapsedTime = 0;
while (shouldContinue)
{
.
. // a bunch of statements, selections and function calls
.
blockingSystemCall(blockDuration);
.
. // a bunch of statements, selections and function calls
.
loopCount++;
elapsedTime = loopCount * blockDuration;
}
The blockingSystemCall function can be any operating system's API that suspends the current thread for the specified blockDuration. The elapsedTime variable is subsequently computed by basically multiplying loopCount by blockDuration or by any equivalent algorithm.
To me, this kind of timing algorithm is wrong, and is not acceptable under most circumstances. All the instructions in the loop, including the condition of the loop, are executed sequentially, and each instruction requires measurable CPU time to execute. Therefore, the actual time elapsed is strictly greater than the value of elapsedTime in any given instance after the loop starts. Consequently, suppose the CPU time required to execute all the statements in the loop, denoted by d, is constant. Then, elapsedTime lags behind the actual time elapsed by loopCount • d for any loopCount > 0; that is, the deviation grows according to an arithmetic progression. This sets the lower bound of the deviation because, in reality, there will be additional delays caused by thread scheduling and time slicing, depending on other factors.
In fact, not too long ago, while testing a new data-driven predictive maintenance feature which relies on the operation time of a machine, we discovered that the operation time reported by the software lagged behind that of a standard reference clock by a whopping three hours after the machine was in continuous operation for just over two days. It was through this test that I discovered the algorithm outlined above, which I swiftly determined to be the root cause.
Coming from a background where I used to implement timing algorithms on bare-metal systems using timer interrupts, which allows the CPU to carry on with the execution of the business logic while the timer process runs in parallel, it was shocking for me to have discovered that the algorithm outlined in the introduction is used in the industry to compute elapsed time, even more so when a typical operating system already encapsulates the timer functions in the form of various easy-to-use public APIs, liberating the programmer from the hassle of configuring a timer via hardware registers, raising events via interrupt service routines, etc.
The kind of timing algorithm as illustrated in the skeleton code above is found in at least two code bases independently developed by two distinct software engineering teams from two subsidiary companies located in two different cities, albeit within the same state. This makes me wonder whether it is how things are normally done in the industry or it is just an isolated case and is not widespread.
So, the question is, is the algorithm shown above common or acceptable in calculating elapsed time, given that the underlying operating system already provides highly optimized time-management system calls that can be used right out of the box to accurately measure elapsed time or even used as basic building blocks for creating higher-level timing facilities that provide more intuitive methods similar to, e.g., the Timer class in C#?
You're right that calculating elapsed time that way is inaccurate -- since it assumes that the blocking call will take exactly the amount of time indicated, and that everything that happens outside of the blocking system call will take no time at all, which would only be true on an infinitely-fast machine. Since actual machines are not infinitely fast, the elapsed-time calculated this way will always be somewhat less than the actual elapsed time.
As to whether that's acceptable, it's going to depend on how much timing accuracy your program needs. If it's just doing a rough estimate to make sure a function doesn't run for "too long", this might be okay. OTOH if it is trying for accuracy (and in particular accuracy over a long period of time), then this approach won't provide that.
FWIW the more common (and more accurate) way to measure elapsed time would be something like this:
const unsigned int startTime = current_clock_time();
while (shouldContinue)
{
loopCount++;
elapsedTime = current_clock_time() - startTime;
}
This has the advantage of not "drifting away" from the accurate value over time, but it does assume that you have a current_clock_time() type of function available, and that it's acceptable to call it within the loop. (If current_clock_time() is very expensive, or doesn't provide some real-time performance guarantees that the calling routine requires, that might be a reason not to do it this way)
I don't think these loops do what you think they do.
In a RTOS, the purpose of a loop like this is usually to perform a task at regular intervals.
blockingSystemCall(N) probably does not just sleep for N milliseconds like you think it does. It probably sleeps until N milliseconds after the last time your thread woke up.
More accurately, all the sleeps your thread has performed since starting are added to the thread start time to get the time at which the OS will try to wake the thread up. If your thread woke up due to an I/O event, then the last one of those times could be used instead of the thread start time. The point is that the inaccuracies in all these start times are corrected, so your thread wakes up at regular intervals and the elapsed time measurement is perfectly accurate according to the RTOS master clock.
There could also be very good reasons for measuring elapsed time by the RTOS master clock instead of a more accurate wall clock time, in addition to simplicity. This is because all of the guarantees that an RTOS provides (which is the reason you are using a RTOS in the first place) are provided in that time scale. The amount of time taken by one task can affect the amount of time you are guaranteed to have available for other tasks, as measured by this clock.
It may or may not be a problem that your RTOS master clock runs slow by 3 hours every 2 days...
System is an embedded Linux/Busybox core on a small embedded board with a web server (Boa) running.
We are seeing some high latency in responses from the web server - sometimes >500ms for no good reason, so I've been digging...
On liberally scattering debug prints throughout the code it seems to come down to the entire process just... stopping for a bit, in a way which I can only assume must be the process/thread being interrupted by another process.
Using print statements and clock_gettime() to calculate time taken to process a request, I can see the code reach the bottom of a while() loop (parsing input), print something like "Time so far: 5ms" and then the next line at the top of the loop will print "Time so far: 350ms" - and all that the code does between the bottom of the loop and the 1st print back at the top is a basic check along the lines of while(position < end), it has nothing complicated that could hold it up.
There's no IO blocking, the data it's parsing has all arrived already, and it's not making any external calls or wandering off into complex functions.
I then looked into whether the kernel scheduler (CFS in our case) might be holding things up, adding calls to clock() (processor time rather than wall-clock) and again calculating time differences Vs processor time used I can see that the wall-clock time delay may run beyond 300ms from one loop to the next, but the reported processor time taken (which seems to have a ~10ms resolution) is more like 50ms.
So, that suggests the task scheduler is holding the process up for hundreds of milliseconds at a time. I've checked the scheduler granularity and max delay and it's nowhere near 100ms, scheduler latency is set at 6ms for example.
Any advice on what I can do now to try and track down the problem - identifying processes which could hog the CPU for >100ms, measuring/tracking what the scheduler is doing, etc.?
First you should try and run your program using strace to see if there are any system calls holding things up.
If that is ambiguous or does not help I would suggest you try and profile the kernel. You could try OProfile
This will create a call graph that you can analyze and see what is happening.
Come someone please tell me how this function works? I'm using it in code and have an idea how it works, but I'm not 100% sure exactly. I understand the concept of an input variable N incrementing down, but how the heck does it work? Also, if I am using it repeatedly in my main() for different delays (different iputs for N), then do I have to "zero" the function if I used it somewhere else?
Reference: MILLISEC is a constant defined by Fcy/10000, or system clock/10000.
Thanks in advance.
// DelayNmSec() gives a 1mS to 65.5 Seconds delay
/* Note that FCY is used in the computation. Please make the necessary
Changes(PLLx4 or PLLx8 etc) to compute the right FCY as in the define
statement above. */
void DelayNmSec(unsigned int N)
{
unsigned int j;
while(N--)
for(j=0;j < MILLISEC;j++);
}
This is referred to as busy waiting, a concept that just burns some CPU cycles thus "waiting" by keeping the CPU "busy" doing empty loops. You don't need to reset the function, it will do the same if called repeatedly.
If you call it with N=3, it will repeat the while loop 3 times, every time counting with j from 0 to MILLISEC, which is supposedly a constant that depends on the CPU clock.
The original author of the code have timed and looked at the assembler generated to get the exact number of instructions executed per Millisecond, and have configured a constant MILLISEC to match that for the for loop as a busy-wait.
The input parameter N is then simply the number of milliseconds the caller want to wait and the number of times the for-loop is executed.
The code will break if
used on a different or faster micro controller (depending on how Fcy is maintained), or
the optimization level on the C compiler is changed, or
c-compiler version is changed (as it may generate different code)
so, if the guy who wrote it is clever, there may be a calibration program which defines and configures the MILLISEC constant.
This is what is known as a busy wait in which the time taken for a particular computation is used as a counter to cause a delay.
This approach does have problems in that on different processors with different speeds, the computation needs to be adjusted. Old games used this approach and I remember a simulation using this busy wait approach that targeted an old 8086 type of processor to cause an animation to move smoothly. When the game was used on a Pentium processor PC, instead of the rocket majestically rising up the screen over several seconds, the entire animation flashed before your eyes so fast that it was difficult to see what the animation was.
This sort of busy wait means that in the thread running, the thread is sitting in a computation loop counting down for the number of milliseconds. The result is that the thread does not do anything else other than counting down.
If the operating system is not a preemptive multi-tasking OS, then nothing else will run until the count down completes which may cause problems in other threads and tasks.
If the operating system is preemptive multi-tasking the resulting delays will have a variability as control is switched to some other thread for some period of time before switching back.
This approach is normally used for small pieces of software on dedicated processors where a computation has a known amount of time and where having the processor dedicated to the countdown does not impact other parts of the software. An example might be a small sensor that performs a reading to collect a data sample then does this kind of busy loop before doing the next read to collect the next data sample.
I am writing a Gif animator in C.
I have two threads running in parallel, both . The first allows the user to alter the speed of the animation. The second draws the current frame, and then calls Sleep(Constant * 100 / CurrentSpeed), where CurrentSpeed is a percentage amount, ranging from 1 to 200.
The problem is that if you quickly change the speed from 100%, to 1%, and then back to the first, the second thread will execute the following:
Sleep(Constant * 100)
This will draw frame A, wait many seconds (although the speed was changed by the user), and only then draw B and the following frames in the default speed.
It seems to me that Sleep is a poor choice of mine in this case. What can I do to solve this problem?
EDIT:
The code I currently have (Simplified):
while (1) {
InvalidateRect(Handle, &ImageRect, FALSE);
if (shouldDispose) {
break;
}
if (DelayTime)
Sleep(DelayTime * 100 / CurrentSpeed);
SelectNextImage();
}
Instead of calling Sleep() with the desired frame rate, why don't you call it with a constant interval of 1 ms, for example, and use a variable as a counter?
For example, let C be a global variable (counter) which is loaded with a number of 'ticks' of 1ms. Then, write the loop:
while(1) { //Main loop of the player thread
if (C > 0) C--;
if (C == 0) nextframe(); //if counter reaches 0, load next frame.
Sleep(1);
}
The control thread would load C with a number of 1ms ticks (i.e. frame rate), and the player thread will never be stopped beyond 1 ms. The use of 1ms as the base rate is arbitrary. Use the minimum time that allows you the maximum frame rate, in order to load CPU the less as possible.
EDIT
After some hot comments (arguing is good after all), I'd like to point out that this solution is sub-optimal, i.e., it doesn't use any OS mechanism for signaling threads or any other API for preventing the thread from wasting CPU time. The solution shown here is generic: it may be used in any system (even in embedded systems without any running OS. But above all, it is based on the original code posted by the user that asked the question: using Sleep(), how can I achieve my purpose. I give him my humble answer. Anyway, I encourage other people to write sample code using the appropriate API for achieving the same goal. With no hard feelings, special thanks to Martin James.
Find a synchro API on your OS that allows a wait with a timeout, eg. WaitForSingleObject() on Windows. If you want to change the delay, change the timeout and signal the event upon which the WFSO is waiting to make it return 'early' and restart the wait with the new timeout.
Polling with Sleep(1) loops is rarely justifiable.
Create a waitable timer. When you set the timer, you can specify a callback function that will run in the setting thread's context. This means you can do it with two threads, but it actually works just fine with only a single thread as well.
The main advantage of a waitable timer is, however, that it is more accurate and more reliable than Sleep. A timer is conceptually much different from Sleep insofar as Sleep only gives up control and the scheduler marks the thread as ready to run when the time is up and when the scheduler runs anyway. It doesn't do anything beyond that. Which means that the thread will eventually be scheduled to run again, like any other thread that is ready.
A thread that is waiting on a timer (or other waitable object) causes the scheduler to run when the timer is up and has its priority temporarily boosted. It therefore runs not only more reliably and more closely to the desired time, but also earlier than all other threads with the same base priority. Which does not give a realtime guarantee but at least gives a sort of "soft guarantee".
If you still want to use Sleep, use SleepEx instead which you can alert, either by queueing an APC, or by calling the undocumented NtAlertThread function.
In any case, Sleep is troublesome not only because of being unreliable, but also because it bases on the granularity of the system-wide timer. Which you can, of course, set to as low as 1ms (or less on some systems), but that will cause a lot of unnecessary interrupts.
I'm trying to determine the granularity I can accurately schedule tasks to occur in C/C++. At the moment I can reliably schedule tasks to occur every 5 microseconds, but I'm trying to see if I can lower this further.
Any advice on how to achieve this / if it is possible would be greatly appreciated.
Since I know timer granularity can often be OS dependent: I am currently running on Linux, but would use Windows if the timing granularity is better (although I don't believe it is, based on what I've found for the QueryPerformanceCounter)
I execute all measurements on bare-metal (no VM). /proc/timer_info confirms nanosecond timer resolution for my CPU (but I know that doesn't translate to nanosecond alarm resolution)
Current
My current code can be found as a Gist here
At the moment, I'm able to execute a request every 5 microseconds (5000 nanoseconds) with less then 1% late arrivals. When late arrivals do occur, they are typically only one cycle (5000 nanoseconds) behind.
I'm doing 3 things at the moment
Setting the process to real-time priority (some pointed out by #Spudd86 here)
struct sched_param schedparm;
memset(&schedparm, 0, sizeof(schedparm));
schedparm.sched_priority = 99; // highest rt priority
sched_setscheduler(0, SCHED_FIFO, &schedparm);
Minimizing the timer slack
prctl(PR_SET_TIMERSLACK, 1);
Using timerfds (part of the 2.6 Linux kernel)
int timerfd = timerfd_create(CLOCK_MONOTONIC,0);
struct itimerspec timspec;
bzero(&timspec, sizeof(timspec));
timspec.it_interval.tv_sec = 0;
timspec.it_interval.tv_nsec = nanosecondInterval;
timspec.it_value.tv_sec = 0;
timspec.it_value.tv_nsec = 1;
timerfd_settime(timerfd, 0, &timspec, 0);
Possible improvements
Dedicate a processor to this process?
Use a nonblocking timerfd so that I can create a tight loop, instead of blocking (tight loop will waste more CPU, but may also be quicker to respond to an alarm)
Using an external embedded device for triggering (can't imagine why this would be better)
Why
I'm currently working on creating a workload generator for a benchmarking engine. The workload generator simulates an arrival rate (X requests / second, etc.) using a Poisson process. From the Poisson process, I can determine the relative times at which requests must be made from the benchmarking engine.
So for instance, at 10 requests a second, we may have requests made at:
t = 0.02, 0.04, 0.05, 0.056, 0.09 seconds
These requests need to be scheduled in advance and then executed. As the number of requests per second increases, the granularity required for scheduling these requests increases (thousands of requests per second requires sub-millisecond accuracy). As a result, I'm trying to figure out how to scale this system further.
You're very close to the limits of what vanilla Linux will offer you, and it's way past what it can guarantee. Adding the real-time patches to your kernel and tuning for full pre-emption will help give you better guarantees under load. I would also remove any dynamic memory allocation from your time critical code, malloc and friends can (and will) stall for a not-inconsequential (in a real-time sense) period of time if it has to reclaim the memory from the i/o cache. I would also be considering removing swap from that machine to help guarantee performance. Dedicating a processor to your task will help to prevent context switch times but, again, it's no guarantee.
I would also suggest that you be careful with that level of sched_priority, you're above various important bits of Linux there, which can lead to very strange effects.
What you gain from building a realtime kernel is more reliable guarantees (ie lower maximum latency) of the time between an IO/timer event handled by the kernel, and control being passed to your app in response. This comes at the price of lower throughput, and you might notice an increase in your best-case latency times.
However, the only reason for using OS timers to schedule events with high-precision is if you're afraid of burning CPU cycles in a loop while you wait for your next due event. OS timers (especially in MS Windows) are not reliable for high granularity timing events, and are very dependant on the sort of timing/HPET hardware available in your system.
When I require highly accurate event scheduling, I use a hybrid method. First, I measure the worst case latency - that is, the biggest difference between the time I requested to sleep, and the actual clock time after sleeping. Let's call this difference "D". (You can actually do this on-the-fly during normal running, by tracking "D" every time you sleep, with something like "D = (D*7 + lastD) / 8" to produce a temporal average).
Then never request to sleep beyond "N - D*2", where "N" is the time of the next event. When within "D*2" time of the next event, enter a spin loop and wait for "N" to occur.
This eats a lot more CPU cycles, but depending on the accuracy you require, you might be able to get away with a "sched_yield()" in your spin loop, which is more kind to your system.