How can I detect hung processes in Linux using C? [duplicate] - c

This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
Linux API to list running processes?
How can I detect hung processes in Linux using C?

Under linux the way to do this is by examining the contents of /proc/[PID]/* a good one-stop location would be /proc/*/status. Its first two lines are:
Name: [program name]
State: R (running)
Of course, detecting hung processes is an entirely separate issue.
/proc//stat is a more machine-readable format of the same info as /proc//status, and is, in fact, what the ps(1) command reads to produce its output.

Monitoring and/or killing a process is just a matter of system calls. I'd think the toughest part of your question would really be reliably determining that a process is "hung", rather than meerly very busy (or waiting for a temporary condition).
In the general case, I'd think this would be rather difficult. Even Windows asks for a decision from the user when it thinks a program might be "hung" (on my system it is often wrong about that, too).
However, if you have a specific program that likes to hang in a specific way, I'd think you ought to be able to reliably detect that.

Seeing as the question has changed:
http://procps.sourceforge.net/
Is the source of ps and other process tools. They do indeed use proc (indicating it is probably the conventional and best way to read process information). Their source is quite readable. The file
/procps-3.2.8/proc/readproc.c
You can also link your program to libproc, which sould be available in your repo (or already installed I would say) but you will need the "-dev" variation for the headers and what-not. Using this API you can read process information and status.
You can use the psState() function through libproc to check for things like
#define PS_RUN 1 /* process is running */
#define PS_STOP 2 /* process is stopped */
#define PS_LOST 3 /* process is lost to control (EAGAIN) */
#define PS_UNDEAD 4 /* process is terminated (zombie) */
#define PS_DEAD 5 /* process is terminated (core file) */
#define PS_IDLE 6 /* process has not been run */
In response to comment
IIRC, unless your program is on the CPU and you can prod it from within the kernel with signals ... you can't really tell how responsive it is. Even then, after the trap a signal handler is called which may run fine in the state.
Best bet is to schedule another process on another core that can poke the process in some way while it is running (or in a loop, or non-responsive). But I could be wrong here, and it would be tricky.
Good Luck

You may be able to use whatever mechanism strace() uses to determine what system calls the process is making. Then, you could determine what system calls you end up in for things like pthread_mutex deadlocks, or whatever... You could then use a heuristic approach and just decide that if a process is hung on a lock system call for more than 30 seconds, it's deadlocked.

You can run 'strace -p ' on a process pid to determine what (if any) system calls it is making. If a process is not making any system calls but is using CPU time then it is either hung, or is running in a tight calculation loop inside userspace. You'd really need to know the expected behaviour of the individual program to know for sure. If it is not making system calls but is not using CPU, it could also just be idle or deadlocked.
The only bulletproof way to do this, is to modify the program being monitored to either send a 'ping' every so often to a 'watchdog' process, or to respond to a ping request when requested, eg, a socket connection where you can ask it "Are you Alive?" and get back "Yes". The program can be coded in such a way that it is unlikely to do the ping if it has gone off into the weeds somewhere and is not executing properly. I'm pretty sure this is how Windows knows a process is hung, because every Windows program has some sort of event queue where it processes a known set of APIs from the operating system.
Not necessarily a programmatic way, but one way to tell if a program is 'hung' is to break into it with gdb and pull a backtrace and see if it is stuck somewhere.

Related

How to get the list of all pthread Ids from main thread [duplicate]

I have a multi-threaded application in a POSIX/Linux environment - I have no control over the code that creates the pthreads. At some point the process - owner of the pthreads - receives a signal.
The handler of that signal should abort,cancel or stop all the pthreads and log how many pthreads where running.
My problem is that I could not find how to list all the pthreads running in process.
There doesn't seem to be any portable way to enumerate the threads in a process.
Linux has pthread_kill_other_threads_np, which looks like a leftover from the original purely-userland pthreads implementation that may or may not work as documented today. It doesn't tell you how many threads there were.
You can get a lot of information about your process by looking in /proc/self (or, for other processes, /proc/123). Although many unices have a file or directory with that name, the layout is completely different, so any code using /proc will be Linux-specific. The documentation of /proc is in Documentation/filesystems/proc.txt in the kernel source. In particular, /proc/self/task has a subdirectory for each thread. The name of the subdirectory is the LWP id; unfortunately, [1][2][3] there doesn't seem to be a way to associate LWP ids with pthread ids (but you can get your own thread id with gettid(2) if you work for it). Of course, reading /proc/self/task is not atomic; the number of threads is available atomically through /proc/self/status (but of course it might change before you act on it).
If you can't achieve what you want with the limited support you get from Linux pthreads, another tactic is to play dynamic linking tricks to provide your own version of pthread_create that logs to a data structure you can inspect afterwards.
You could wrap ps -eLF (or another command that more closely reads just the process you're interested in) and read the NLWP column to find out how many threads are running.
Given that the threads are in your process, they should be under your control. You can record all of them in a data structure and keep track.
However, doing this won't be race-condition free unless it's appropriately managed (or you only ever create and join threads from one thread).
Any threads created by libraries you use are their business and you should not be messing with them directory, or the library may break.
If you are planning to exit the process of course, you can just leave the threads running anyway, as calling exit() terminates them all.
Remember that a robust application should be crash-safe anyway, so you should not depend upon shutdown behaviour to avoid data loss etc.

How to avoid the binary to be launched more than one time in linux? [duplicate]

This question already has answers here:
How to create a single instance application in C or C++
(15 answers)
Closed 9 years ago.
I have a binary and it's a daemon and it's developed in C. I want to add a check at the beginning of my program to guarantee that the binary is launched only one time. My binary runs on Linux.
Any suggestions?
A common method is to put a PID file in /var/run. After your daemon starts successfully, you flock write its PID to this file. At startup, you check the value of the PID in this file, if it exists. If there is no PID currently running, it's safe for the application to startup. If the PID exists, perform a check to see if that PID is an instance of your executable. If it's not, it is also safe to startup. You should delete the file on exit, but it's not strictly necessary.
The best way to do this, in my opinion, is not to do it. Let your initialization scheme serialize instances of the daemon: systemd, runit, supervise, upstart, launchd, and so on can make sure there are no double invocations.
If you need to invoke your daemon "by hand," try the linux utility flock(1) or a 3rd-party utility like setlock. Both of these will run the daemon under the protection of a (perhaps inherited) lockfile which remains locked for the life of the program.
If you insist upon adding this functionality to the daemon itself (which, in my opinion, is complication that most daemons don't need), choose a lockfile and keep it exclusively flock(2)d. Unlike most pidfile/process table approaches, this approach is not race-prone. Unlike POSIX system semaphores, this mechanism will correctly handle the case of a crashed daemon (the lock vanishes when the process does).
There may be other easy serializations, too. If your daemon binds to a socket, you know that EADDRINUSE probably means that another instance is running...
Fork and execute this:
pidof nameOfProgram
If it returns a value, you know your program is running!
The other classic method is to have a lock file - the program creates a file, but only if that file does not already exist. If the file does exist, it presumes there's another copy of the program running. Because the program could crash after creating the file, smarter versions of this have ways to detect that situation.

Uninterruptable process in Windows(or Linux)?

Is there any way to make a program that cannot be interrupted (an uninterrupted program)? By that, I mean a process that can't be terminated by any signal, kill command, or any other key combinations in any System: Linux, windows etc.
First, I am interested to know whether it's possible or not. And if yes, upto what extend it is possible?
I mostly write code in C, C++, and python; but I don't know any of such command(s) available in these programming languages.
Is it possible with assembly language, & how ? Or in high level language c with embedded assembly code(inline assembly)?
I know some signals are catchable some are not like SIGKILL and SIGSTOP.
I remember, when I was use to work on Windows-XP, some viruses couldn't be terminated even from Task Manager. So I guess some solution is possible in low level languages. maybe by overriding Interrupt Vector Table.
Can we write an uninterrupted program using TSRs(Hooking)? Because TSR can only removed when the computer is rebooted or if the TSR is explicitly removed from memory. Am I correct?
I couldn't find any thing on Google.
Well, possibly one can write a program which doesn't respond for most signals like SIGQUIT, SIGHUP etc. - each kind of "kill" is actually a kind of signal sent to program by kernel, some signals means for the kernel that program is stuck and should be killed.
Actually the only unkillable program is kernel itself, even init ( PID 1 ) can be "killed" with HUP ( which means reload ).
Learn more about signal handling, starting with kill -l ( list signals ) command.
Regarding Windows ( basing on "antivirus" tag ) - which actually applies to linux too - if you just need to run some antivirus user is unable to skip/close, it's permission problem, I mean program started by system, and non-administrative user without permission to kill it, won't be able to close/exit it anyway. I guess lameusers on Windows all over the world would start "solving" any problems they have by trying to close antivirus first, just if it would be possible :)
On Linux, it is possible to avoid being killed by one of two ways:
Become init (PID 1). init ignores all signals that it does not catch, even normally unblockable ones like SIGSTOP and SIGKILL.
Trigger a kernel bug, and get your program stuck in D (uninterruptible wait) state.
For 2., one common way to end up in D state is to attempt to access some hardware that is not responding. Particularly on older versions of Linux, the process would become stuck in kernel mode, and not respond to any signals until the kernel gave up on the hardware (which can take quite some time!). Of course, your program can't do anything else while it's stuck like this, so it's more annoying than useful, and newer versions of Linux are starting to rectify this problem by dividing D state into a killable state (where SIGKILL works) and an unkillable state (where all signals are blocked).
Or, of course, you could simply load your code as a kernel module. Kernel modules can't be 'killed', only unloaded - and only if they allow themselves to be unloaded.
You can catch pretty-much any signal or input and stay alive through it, the main exception being SIGKILL. It is possible to prevent that from killing you, but you'd have to replace init (and reboot to become the new init). PID 0 is special on most Unixes, in that it's the only thing that can't be KILL'd.

POSIX API call to list all the pthreads running in a process

I have a multi-threaded application in a POSIX/Linux environment - I have no control over the code that creates the pthreads. At some point the process - owner of the pthreads - receives a signal.
The handler of that signal should abort,cancel or stop all the pthreads and log how many pthreads where running.
My problem is that I could not find how to list all the pthreads running in process.
There doesn't seem to be any portable way to enumerate the threads in a process.
Linux has pthread_kill_other_threads_np, which looks like a leftover from the original purely-userland pthreads implementation that may or may not work as documented today. It doesn't tell you how many threads there were.
You can get a lot of information about your process by looking in /proc/self (or, for other processes, /proc/123). Although many unices have a file or directory with that name, the layout is completely different, so any code using /proc will be Linux-specific. The documentation of /proc is in Documentation/filesystems/proc.txt in the kernel source. In particular, /proc/self/task has a subdirectory for each thread. The name of the subdirectory is the LWP id; unfortunately, [1][2][3] there doesn't seem to be a way to associate LWP ids with pthread ids (but you can get your own thread id with gettid(2) if you work for it). Of course, reading /proc/self/task is not atomic; the number of threads is available atomically through /proc/self/status (but of course it might change before you act on it).
If you can't achieve what you want with the limited support you get from Linux pthreads, another tactic is to play dynamic linking tricks to provide your own version of pthread_create that logs to a data structure you can inspect afterwards.
You could wrap ps -eLF (or another command that more closely reads just the process you're interested in) and read the NLWP column to find out how many threads are running.
Given that the threads are in your process, they should be under your control. You can record all of them in a data structure and keep track.
However, doing this won't be race-condition free unless it's appropriately managed (or you only ever create and join threads from one thread).
Any threads created by libraries you use are their business and you should not be messing with them directory, or the library may break.
If you are planning to exit the process of course, you can just leave the threads running anyway, as calling exit() terminates them all.
Remember that a robust application should be crash-safe anyway, so you should not depend upon shutdown behaviour to avoid data loss etc.

C functions invoked as threads - Linux userland program

I'm writing a linux daemon in C which gets values from an ADC by SPI interface (ioctl). The SPI (spidev - userland) seems to be a bit unstable and freezes the daemon at random times.
I need to have some better control of the calls to the functions getting the values, and I was thinking of making it as a thread which I could wait for to finish and get the return value and if it times out assume that it froze and kill it without this new thread taking down the daemon itself. Then I could apply measures like resetting the ADC before restarting. Is this possible?
Pseudo example of what I want to achieve:
(function int get_adc_value(int adc_channel, float *value) )
pid = thread( get_adc_value(1,&value); //makes thread calling the function
wait_until_finish(pid, timeout); //waits until function finishes/timesout
if(timeout) kill pid, start over //if thread do not return in given time, kill it (it is frozen)
else if return value sane, continue //if successful, handle return variable value and continue
Thanks for any input on the matter, examples highly appreciated!
I would try looking at using the pthreads library. I have used it for some of my c projects with good success and it gives you pretty good control over what is running and when.
A pretty good tutorial can be found here:
http://www.yolinux.com/TUTORIALS/LinuxTutorialPosixThreads.html
In glib there is too a way to check the threads, using GCond (look for it in the glib help).
In resume you should periodically set a GCond in the child thread and check it in the main thread with a g_cond_timed_wait. It's the same with the glib or the pthread.
Here is an example with the pthread:
http://koders.com/c/fidA03D565734AE2AD9F5B42AFC740B9C17D75A33E3.aspx?s=%22pthread_cond_timedwait%22#L46
I'd recommend a different approach.
Write a program that takes samples and writes them to standard output. It simply need have alarm(TIMEOUT); before every sample collection, and should it hang the program will exit automatically.
Write another program that runs that first program. If it exits, it runs it again. It looks something like this:
main(){for(;;){system("sampler");sleep(1);}}
Then in your other program, use FILE*fp=popen("supervise_sampler","r"); and read the samples from fp. Better still: Have the program simply read the samples from stdin and insist users start your program like this:
(while true;do sampler;sleep 1; done)|program
Splitting up the task like this makes it easier to develop and easier to test, for example, you can collect samples and save them to a file and then run your program on that file:
sampler > data
program < data
Then, as you make changes to program, you can simply run it again on the same data over and over again.
It's also trivial to enable data logging- so should you find a serious issue you can run all your data through your program again to find the bugs.
Something very interesting happens to a thread when it executes an ioctl(), it goes into a very special kind of sleep known as disk sleep where it can not be interrupted or killed until the call returns. This is by design and prevents the kernel from rotting from the inside out.
If your daemon is getting stuck in ioctl(), its conceivable that it may stay that way forever (at least till the ADC is re-set).
I'd advise dropping something, like a file with a timestamp prior to calling ioctl() on a known buggy interface. If your thread does not unlink that file in xx amount of seconds, something else needs to re-start the ADC.
I also agree with the use of pthreads, if you need example code, just update your question.

Resources