System is an embedded Linux/Busybox core on a small embedded board with a web server (Boa) running.
We are seeing some high latency in responses from the web server - sometimes >500ms for no good reason, so I've been digging...
On liberally scattering debug prints throughout the code it seems to come down to the entire process just... stopping for a bit, in a way which I can only assume must be the process/thread being interrupted by another process.
Using print statements and clock_gettime() to calculate time taken to process a request, I can see the code reach the bottom of a while() loop (parsing input), print something like "Time so far: 5ms" and then the next line at the top of the loop will print "Time so far: 350ms" - and all that the code does between the bottom of the loop and the 1st print back at the top is a basic check along the lines of while(position < end), it has nothing complicated that could hold it up.
There's no IO blocking, the data it's parsing has all arrived already, and it's not making any external calls or wandering off into complex functions.
I then looked into whether the kernel scheduler (CFS in our case) might be holding things up, adding calls to clock() (processor time rather than wall-clock) and again calculating time differences Vs processor time used I can see that the wall-clock time delay may run beyond 300ms from one loop to the next, but the reported processor time taken (which seems to have a ~10ms resolution) is more like 50ms.
So, that suggests the task scheduler is holding the process up for hundreds of milliseconds at a time. I've checked the scheduler granularity and max delay and it's nowhere near 100ms, scheduler latency is set at 6ms for example.
Any advice on what I can do now to try and track down the problem - identifying processes which could hog the CPU for >100ms, measuring/tracking what the scheduler is doing, etc.?
First you should try and run your program using strace to see if there are any system calls holding things up.
If that is ambiguous or does not help I would suggest you try and profile the kernel. You could try OProfile
This will create a call graph that you can analyze and see what is happening.
Related
Come someone please tell me how this function works? I'm using it in code and have an idea how it works, but I'm not 100% sure exactly. I understand the concept of an input variable N incrementing down, but how the heck does it work? Also, if I am using it repeatedly in my main() for different delays (different iputs for N), then do I have to "zero" the function if I used it somewhere else?
Reference: MILLISEC is a constant defined by Fcy/10000, or system clock/10000.
Thanks in advance.
// DelayNmSec() gives a 1mS to 65.5 Seconds delay
/* Note that FCY is used in the computation. Please make the necessary
Changes(PLLx4 or PLLx8 etc) to compute the right FCY as in the define
statement above. */
void DelayNmSec(unsigned int N)
{
unsigned int j;
while(N--)
for(j=0;j < MILLISEC;j++);
}
This is referred to as busy waiting, a concept that just burns some CPU cycles thus "waiting" by keeping the CPU "busy" doing empty loops. You don't need to reset the function, it will do the same if called repeatedly.
If you call it with N=3, it will repeat the while loop 3 times, every time counting with j from 0 to MILLISEC, which is supposedly a constant that depends on the CPU clock.
The original author of the code have timed and looked at the assembler generated to get the exact number of instructions executed per Millisecond, and have configured a constant MILLISEC to match that for the for loop as a busy-wait.
The input parameter N is then simply the number of milliseconds the caller want to wait and the number of times the for-loop is executed.
The code will break if
used on a different or faster micro controller (depending on how Fcy is maintained), or
the optimization level on the C compiler is changed, or
c-compiler version is changed (as it may generate different code)
so, if the guy who wrote it is clever, there may be a calibration program which defines and configures the MILLISEC constant.
This is what is known as a busy wait in which the time taken for a particular computation is used as a counter to cause a delay.
This approach does have problems in that on different processors with different speeds, the computation needs to be adjusted. Old games used this approach and I remember a simulation using this busy wait approach that targeted an old 8086 type of processor to cause an animation to move smoothly. When the game was used on a Pentium processor PC, instead of the rocket majestically rising up the screen over several seconds, the entire animation flashed before your eyes so fast that it was difficult to see what the animation was.
This sort of busy wait means that in the thread running, the thread is sitting in a computation loop counting down for the number of milliseconds. The result is that the thread does not do anything else other than counting down.
If the operating system is not a preemptive multi-tasking OS, then nothing else will run until the count down completes which may cause problems in other threads and tasks.
If the operating system is preemptive multi-tasking the resulting delays will have a variability as control is switched to some other thread for some period of time before switching back.
This approach is normally used for small pieces of software on dedicated processors where a computation has a known amount of time and where having the processor dedicated to the countdown does not impact other parts of the software. An example might be a small sensor that performs a reading to collect a data sample then does this kind of busy loop before doing the next read to collect the next data sample.
I have written a thread hang detector which prints some debugging output (backtraces, etc.) when some thread hangs unexpectedly. Each thread which wants to be watched registers itself in the hang-detector system, specifies some timeout (in my case 5 secs) and calls some IAmAlife() function frequently (in my case about every 1-10ms).
It works great. However, in some cases, I get false positives. E.g., when I SIGSTOP the process and resume it later, it gets triggered (for example, when attaching with a debugger like GDB/LLDB). And also rarely, when the process is just not doing much, just idling, I guess MacOSX' App Nap kicks in and it also triggers the hang detector.
How could I detect such system hangs? Looking at the processor time (clock()) doesn't help that much because if my app is in a deadlock, it will probably also not consume much (if any) processor time.
So, the implementation might be a little complex, but the basic idea is simple: track the wall clock time and use that to figure a "grace period" to be added on to the amount of time that each thread can be late phoning home.
Use something like gettimeofday() (http://linux.die.net/man/2/gettimeofday) to record the time when your watcher thread first starts. Each time it re-awakes, call gettimeofday() again, take the difference, and then subtract from that the amount of "processor time" elapsed. That gives you the rough "grace period" that you should grant, since it's time that your process was not running.
The only minor complexification arises because the grace period needs to be maintained separately for each of the threads you are watching. Since you clearly know enough to write threads, I'll assume that part is well within your ability :-).
I am writing a Gif animator in C.
I have two threads running in parallel, both . The first allows the user to alter the speed of the animation. The second draws the current frame, and then calls Sleep(Constant * 100 / CurrentSpeed), where CurrentSpeed is a percentage amount, ranging from 1 to 200.
The problem is that if you quickly change the speed from 100%, to 1%, and then back to the first, the second thread will execute the following:
Sleep(Constant * 100)
This will draw frame A, wait many seconds (although the speed was changed by the user), and only then draw B and the following frames in the default speed.
It seems to me that Sleep is a poor choice of mine in this case. What can I do to solve this problem?
EDIT:
The code I currently have (Simplified):
while (1) {
InvalidateRect(Handle, &ImageRect, FALSE);
if (shouldDispose) {
break;
}
if (DelayTime)
Sleep(DelayTime * 100 / CurrentSpeed);
SelectNextImage();
}
Instead of calling Sleep() with the desired frame rate, why don't you call it with a constant interval of 1 ms, for example, and use a variable as a counter?
For example, let C be a global variable (counter) which is loaded with a number of 'ticks' of 1ms. Then, write the loop:
while(1) { //Main loop of the player thread
if (C > 0) C--;
if (C == 0) nextframe(); //if counter reaches 0, load next frame.
Sleep(1);
}
The control thread would load C with a number of 1ms ticks (i.e. frame rate), and the player thread will never be stopped beyond 1 ms. The use of 1ms as the base rate is arbitrary. Use the minimum time that allows you the maximum frame rate, in order to load CPU the less as possible.
EDIT
After some hot comments (arguing is good after all), I'd like to point out that this solution is sub-optimal, i.e., it doesn't use any OS mechanism for signaling threads or any other API for preventing the thread from wasting CPU time. The solution shown here is generic: it may be used in any system (even in embedded systems without any running OS. But above all, it is based on the original code posted by the user that asked the question: using Sleep(), how can I achieve my purpose. I give him my humble answer. Anyway, I encourage other people to write sample code using the appropriate API for achieving the same goal. With no hard feelings, special thanks to Martin James.
Find a synchro API on your OS that allows a wait with a timeout, eg. WaitForSingleObject() on Windows. If you want to change the delay, change the timeout and signal the event upon which the WFSO is waiting to make it return 'early' and restart the wait with the new timeout.
Polling with Sleep(1) loops is rarely justifiable.
Create a waitable timer. When you set the timer, you can specify a callback function that will run in the setting thread's context. This means you can do it with two threads, but it actually works just fine with only a single thread as well.
The main advantage of a waitable timer is, however, that it is more accurate and more reliable than Sleep. A timer is conceptually much different from Sleep insofar as Sleep only gives up control and the scheduler marks the thread as ready to run when the time is up and when the scheduler runs anyway. It doesn't do anything beyond that. Which means that the thread will eventually be scheduled to run again, like any other thread that is ready.
A thread that is waiting on a timer (or other waitable object) causes the scheduler to run when the timer is up and has its priority temporarily boosted. It therefore runs not only more reliably and more closely to the desired time, but also earlier than all other threads with the same base priority. Which does not give a realtime guarantee but at least gives a sort of "soft guarantee".
If you still want to use Sleep, use SleepEx instead which you can alert, either by queueing an APC, or by calling the undocumented NtAlertThread function.
In any case, Sleep is troublesome not only because of being unreliable, but also because it bases on the granularity of the system-wide timer. Which you can, of course, set to as low as 1ms (or less on some systems), but that will cause a lot of unnecessary interrupts.
How would be the correct way to prevent a soft lockup/unresponsiveness in a long running while loop in a C program?
(dmesg is reporting a soft lockup)
Pseudo code is like this:
while( worktodo ) {
worktodo = doWork();
}
My code is of course way more complex, and also includes a printf statement which gets executed once a second to report progress, but the problem is, the program ceases to respond to ctrl+c at this point.
Things I've tried which do work (but I want an alternative):
doing printf every loop iteration (don't know why, but the program becomes responsive again that way (???)) - wastes a lot of performance due to unneeded printf calls (each doWork() call does not take very long)
using sleep/usleep/... - also seems like a waste of (processing-)time to me, as the whole program will already be running several hours at full speed
What I'm thinking about is some kind of process_waiting_events() function or the like, and normal signals seem to be working fine as I can use kill on a different shell to stop the program.
Additional background info: I'm using GWAN and my code is running inside the main.c "maintenance script", which seems to be running in the main thread as far as I can tell.
Thank you very much.
P.S.: Yes I did check all other threads I found regarding soft lockups, but they all seem to ask about why soft lockups occur, while I know the why and want to have a way of preventing them.
P.P.S.: Optimizing the program (making it run shorter) is not really a solution, as I'm processing a 29GB bz2 file which extracts to about 400GB xml, at the speed of about 10-40MB per second on a single thread, so even at max speed I would be bound by I/O and still have it running for several hours.
While the posed answer using threads might possibly be an option, it would in reality just shift the problem to a different thread. My solution after all was using
sleep(0)
Also tested sched_yield / pthread_yield, both of which didn't really help. Unfortunately I've been unable to find a good resource which documents sleep(0) in linux, but for windows the documentation states that using a value of 0 lets the thread yield it's remaining part of the current cpu slice.
It turns out that sleep(0) is most probably relying on what is called timer slack in linux - an article about this can be found here: http://lwn.net/Articles/463357/
Another possibility is using nanosleep(&(struct timespec){0}, NULL) which seems to not necessarily rely on timer slack - linux man pages for nanosleep state that if the requested interval is below clock granularity, it will be rounded up to clock granularity, which on linux depends on CLOCK_MONOTONIC according to the man pages. Thus, a value of 0 nanoseconds is perfectly valid and should always work, as clock granularity can never be 0.
Hope this helps someone else as well ;)
Your scenario is not really a soft lock up, it is a process is busy doing something.
How about this pseudo code:
void workerThread()
{
while(workToDo)
{
if(threadSignalled)
break;
workToDo = DoWork()
}
}
void sighandler()
{
signal worker thread to finish
waitForWorkerThreadFinished;
}
void main()
{
InstallSignalHandler;
CreateSemaphore
StartThread;
waitForWorkerThreadFinished;
}
Clearly a timing issue. Using a signalling mechanism should remove the problem.
The use of printf solves the problem because printf accesses the console which is an expensive and time consuming process which in your case gives enough time for the worker to complete its work.
If I understand correctly, when you launch a CUDA kernel asynchronously, it may begin execution immediately or it may wait for previous asynchronous calls (transfers, kernels, etc) to complete first. (I also understand that kernels can run concurrently in some cases, but I want to ignore that for now).
How can I find out the time between launching a kernel ("queuing") and when it actually begins execution. In fact, I really just want to know the average "queued time" for all launches in a single run of my program (generally in the tens or hundreds of thousands of kernel launches.)
I can easily calculate the average execution time per kernel with events (~500us). I tried to simulate - I dropped the results of CLOCK() every time a kernel is launched, with the idea that I could then determine how long the launch queue was when each kernel was launched. But CLOCK() does not have high enough precision (0.01s) - sometimes as many as 60 kernels appear to be launched at a single time, when of course in reality many are not.
Rather than clock use the QueryPerformanceTimer which counts based on machine clock cycles.
Code for QueryPerformanceTimer
Secondly, the profiling tool (Visual Profiler) only measures serial launches [see page 24] and [see post number 3].
Thus the best option is (1) use QueryPerformanceTimer (or the Visual Profiler) such that you get an accurate measurement of a single launch and (2) use QueryPerformanceTimer to get the timing of multiple launches and observe whether the timing results suggest that asynchronous launching took place.