Linux automatically restarting application on crash - Daemons - c

I have an system running embedded linux and it is critical that it runs continuously. Basically it is a process for communicating to sensors and relaying that data to database and web client.
If a crash occurs, how do I restart the application automatically?
Also, there are several threads doing polling(eg sockets & uart communications). How do I ensure none of the threads get hung up or exit unexpectedly? Is there an easy to use watchdog that is threading friendly?

You can seamlessly restart your process as it dies with fork and waitpid as described in this answer. It does not cost any significant resources, since the OS will share the memory pages.
Which leaves only the problem of detecting a hung process. You can use any of the solutions pointed out by Michael Aaron Safyan for this, but a yet easier solution would be to use the alarm syscall repeatedly, having the signal terminate the process (use sigaction accordingly). As long as you keep calling alarm (i.e. as long as your program is running) it will keep running. Once you don't, the signal will fire.
That way, no extra programs needed, and only portable POSIX stuff used.

The gist of it is:
You need to detect if the program is still running and not hung.
You need to (re)start the program if the program is not running or is hung.
There are a number of different ways to do #1, but two that come to mind are:
Listening on a UNIX domain socket, to handle status requests. An external application can then inquire as to whether the application is still ok. If it gets no response within some timeout period, then it can be assumed that the application being queried has deadlocked or is dead.
Periodically touching a file with a preselected path. An external application can look a the timestamp for the file, and if it is stale, then it can assume that the appliation is dead or deadlocked.
With respect to #2, killing the previous PID and using fork+exec to launch a new process is typical. You might also consider making your application that runs "continuously", into an application that runs once, but then use "cron" or some other application to continuously rerun that single-run application.
Unfortunately, watchdog timers and getting out of deadlock are non-trivial issues. I don't know of any generic way to do it, and the few that I've seen are pretty ugly and not 100% bug-free. However, tsan can help detect potential deadlock scenarios and other threading issues with static analysis.

You could create a CRON job to check if the process is running with start-stop-daemon from time to time.

use this script for running your application
#!/bin/bash
while ! /path/to/program #This will wait for the program to exit successfully.
do
echo “restarting” # Else it will restart.
done
you can also put this script on your /etc/init.d/ in other to start as daemon

Related

Erlang spawning large amounts of C processes

I've been looking into how I could embed languages (let's use Lua as an example) in Erlang. This of course isn't a new idea and there are many libraries out there that can do this. However I was wondering if it was possible to start a Genserver with state which is modified by Lua. This means that once you start the Genserver, it will start a (long running) Lua process to manipulate the Genserver's state. I know this is possible as well, but I was wondering if I could spawn 1,000 10,000 or even 100,000 of these processes.
I'm not really familiar with this topic but I have done some research.
(Please correct me if I'm wrong on any of these options).
TLDR; Skip to the last paragraph.
First option: NIFs:
This doesn't seem like an option since it will block the Erlang Scheduler of the current process. If I want to spawn a large amount of these it will freeze the entire runtime.
Second option: Port Driver:
It's like a NIF but communicates by sending data to a specified port, which can also send data back to Erlang. This is nice although this also seems to block the scheduler. I've tried a library which does the boiler plat for you as well, but that seemed to block the scheduler after spawning 10 processes. I've also looked into the postgresql example on the Erlang Documentation which is said to be async but I couldn't get the example code to work (R13?). Is it even possible to run as many Port Driver processes without blocking the runtime?
Third option: C Nodes:
I thought this was very interesting and wanted to try it out, but apparently the project "erlang-lua" already does this. It's nice because it won't crash your Erlang VM if something goes wrong and the processes are isolated. But in order to actually spawn a single process you need to spawn an entire node. I have no idea how expensive this is. Nor am I sure what the limit is for connecting nodes in a cluster, but I don't see myself spawning 100,000 C nodes.
Fourth option: Ports:
At first I thought this was the same as a Port Driver but it's actually different. You spawn a process which executes an application and communicates through STDIN and STDOUT. This would work well for spawning a large amount of processes, and (I think?) they aren't a threat to the Erlang VM. But if I'm going to communicate through STDIN / STDOUT, why even bother with an embeddable language to begin with? Might as well use any other scripting language.
And so after much research in a field I'm not familiar with I've come to this. You could a Genserver as an "entity" where the AI is written in Lua. Which is why I'd like to have a processes for each entity. My question is how do I achieve spawning many Genservers which communicate with long running Lua processes? Is this even possible? Should I be tackling my problem differently?
If you can make the Lua code — or more accurately, its underlying native code — cooperate with the Erlang VM, you have a few choices.
Consider one of the most important functions of the Erlang VM: managing the execution of a (potentially large number of) Erlang's lightweight processes across a relatively small set of scheduler threads. It uses several techniques to know when a process has used up its timeslice or is waiting and so should be scheduled out to give another process a chance to run.
You seem to be asking how you can get native code to run however it likes within the VM, but as you've already hinted, the reason native code can cause problems for the VM is that it has no practical way to stop the native code from completely taking over a scheduler thread and thus preventing regular Erlang processes from executing. Because of this, native code has to cooperatively yield the scheduler thread back to the VM.
For older NIFs, the choices for such cooperation were:
Keep the amount of time NIF calls ran on a scheduler thread to 1ms or less.
Create one or more private threads. Transition each long-running NIF call from its scheduler thread over to a private thread for execution, then return the scheduler thread to the VM.
The problems here are that not all calls can complete in 1ms or less, and that managing private threads can be error-prone. To get around the first problem, some developers would break the work down into chunks and use an Erlang function as a wrapper to manage a series of short NIF calls, each of which completed one chunk of work. As for the second problem, well, sometimes you just can't avoid it, despite its inherent difficulty.
NIFs running on Erlang 17.3 or later can also cooperatively yield the scheduler thread using the enif_schedule_nif function. To use this feature, the native code has to be able to do its work in chunks such that each chunk can complete within the usual 1ms NIF execution window, similar to the approach mentioned earlier but without the need to artificially return to an Erlang wrapper. My bitwise example code provides many details about this.
Erlang 17 also brought an experimental feature, off by default, called dirty schedulers. This is a set of VM schedulers that do not have the same native code execution time constraints as the regular schedulers; work there can block for essentially infinite periods without disrupting normal VM operation.
Dirty schedulers come in two flavors: CPU schedulers for CPU-bound work, and I/O schedulers for I/O-bound work. In a VM compiled to enable dirty schedulers, there are by default as many dirty CPU schedulers as there are regular schedulers, and there are 10 I/O schedulers. These numbers can be altered using command-line switches, but note that to try to prevent regular scheduler starvation, you can never have more dirty CPU schedulers than regular schedulers. Applications use the same enif_schedule_nif function mentioned earlier to execute NIFs on dirty schedulers. My bitwise example code provides many details about this too. Dirty schedulers will remain an experimental feature for Erlang 18 as well.
Native code in linked-in port drivers is subject to the same on-scheduler execution time constraints as NIFs, but drivers have two features NIFs don't:
Driver code can register file descriptors into the VM polling subsystem and be notified when any of those file descriptors becomes I/O-ready.
The driver API supports access to a non-scheduler async thread pool, the size of which is configurable but by default has 10 threads.
The first feature allows native driver code to avoid blocking a thread for I/O. For example, instead of performing a blocking recv call, driver code can register the socket file descriptor so the VM can poll it and call the driver back when the file descriptor becomes readable.
The second feature provides a separate thread pool useful for driver tasks that can't conform to the scheduler thread native code execution time constraints. You can achieve the same in a NIF but you have to set up your own thread pool and write your own native code to manage and access it. But regardless of whether you use the driver async thread pool, your own NIF thread pool, or dirty schedulers, note that they are all regular operating system threads, and so trying to start a huge number of them simply isn't practical.
Native driver code does not yet have dirty scheduler access, but this work is on-going and it might become available as an experimental feature in an 18.x release.
If your Lua code can make use of one or more of these features to cooperate with the Erlang VM, then what you're attempting may be possible.

Polling a database versus triggering program from database?

I have a process wherein a program running in an application server must access a table in an Oracle database server whenever at least one row exists in this table. Each row of data relates to a client requesting some number crunching performed by the program. The program can only perform this number crunching serially (that is, for one client at a time rather than multiple clients in parallel).
Thus, the program needs to be informed of when data is available in the database for it to process. I could either
have the program poll the database, or
have the database trigger the program.
QUESTION 1: Is there any conventional wisdom why one approach might be better than the other?
QUESTION 2: I wonder if programs have any issues "running" for months at a time (would any processes in the server stop or disrupt the program from running? -- if so I don't know how I'd learn there was a problem unless from angry customers). Anyone have experience running programs on a server for a long time without issues? Or, if the server does crash, is there a way to auto-start a (i.e. C language executable) program on it after the server re-boots, thus not requiring a human to start it specifically?
Any advice appreciated.
UPDATE 1: Client is waiting for results, but a couple seconds additional delay (from polling) isn't a deal breaker.
I would like to give a more generic answer...
There is no right answer that applies every time. Some times you need a trigger, and some times is better to poll.
But… 9 out of 10 times, polling is much more efficient, safe and fast than triggering.
It's really simple. A trigger needs to instantiate a single program, of whatever nature, for every shot. That is just not efficient most of the time. Some people will argue that that is required when response time is a factor, but even then, half of the times polling is better because:
1) Resources: With triggers, and say 100 messages, you will need resources for 100 threads, with 1 thread processing a packet of 100 messages you need resources for 1 program.
2) Monitoring: A thread processing packets can report time consumed constantly on a defined packet size, clearly indicating how it is performing and when and how is performance being affected. Try that with a billion triggers jumping around…
3) Speed: Instantiating threads and allocating their resources is very expensive. And don’t get me started if you are opening a transaction for each trigger. A simple program processing a say 100 meessage packet will always be much faster that initiating 100 triggers…
3) Reaction time: With polling you can not react to things on line. So, the only exception allowed to use polling is when a user is waiting for the message to be processed. But then you need to be very careful, because if you have lots of clients doing the same thing at the same time, triggering might respond LATER, than if you where doing fast polling.
My 2cts. This has been learned the hard way ..
1) have the program poll the database, since you don't want your database to be able to start host programs (because you'd have to make sure that only "your" program can be started this way).
The classic (and most convenient IMO) way for doing this in Oracle would be through the DBMS_ALERT package.
The first program would signal an alert with a certain name, passing an optional message. A second program which registered for the alert would wait and receive it immediatly after the first program commits. A rollback of the first program would cancel the alert.
Of cause you can have many sessions signaling and waiting for alerts. However, an alert is a serialization device, so if one program signaled an alert, other programs signaling the same alert name will be blocked until the first one commits or rolls back.
Table DBMS_ALERT_INFO contains all the sessions which have registered for an alert. You can use this to check if the alert-processing is alive.
2) autostarting or background execution depends on your host platform and OS. In Windows you can use SRVANY.EXE to run any executable as a service.
I recommend using a C program to poll the database and a utility such as monit to restart the C program if there are any problems. Your C program can touch a file once in a while to indicate that it is still functioning properly, and monit can monitor the file. Monit can also check the process directly and make sure it isn't using too much memory.
For more information you could see my answer of this other question:
When a new row in database is added, an external command line program must be invoked
Alternatively, if people aren't sitting around waiting for the computation to finish, you could use a cron job to run the C program on a regular basis (e.g. every minute). Then monit would be less needed because your C program will start and stop all the time.
You might want to look into Oracle's "Change Notification":
http://docs.oracle.com/cd/E11882_01/appdev.112/e25518/adfns_cqn.htm
I don't know how well this integrates with a "regular" C program though.
It's also available through .Net and Java/JDBC
http://docs.oracle.com/cd/E11882_01/win.112/e23174/featChange.htm
http://docs.oracle.com/cd/E11882_01/java.112/e16548/dbchgnf.htm
There are simple job managers like gearman that you can use to send a job message from the database to a worker. Gearman has among others a MySQL user defined function interface, so it is probably easy to build one for oracle as well.

Any possible solution to capture process entry/exit?

I Would like to capture the process entry, exit and maintain a log for the entire system (probably a daemon process).
One approach was to read /proc file system periodically and maintain the list, as I do not see the possibility to register inotify for /proc. Also, for desktop applications, I could get the help of dbus, and whenever client registers to desktop, I can capture.
But for non-desktop applications, I don't know how to go ahead apart from reading /proc periodically.
Kindly provide suggestions.
You mentioned /proc, so I'm going to assume you've got a linux system there.
Install the acct package. The lastcomm command shows all processes executed and their run duration, which is what you're asking for. Have your program "tail" /var/log/account/pacct (you'll find its structure described in acct(5)) and voila. It's just notification on termination, though. To detect start-ups, you'll need to dig through the system process table periodically, if that's what you really need.
Maybe the safer way to move is to create a SuperProcess that acts as a parent and forks children. Everytime a child process stops the father can find it. That is just a thought in case that architecture fits your needs.
Of course, if the parent process is not doable then you must go to the kernel.
If you want to log really all process entry and exits, you'll need to hook into kernel. Which means modifying the kernel or at least writing a kernel module. The "linux security modules" will certainly allow hooking into entry, but I am not sure whether it's possible to hook into exit.
If you can live with occasional exit slipping past (if the binary is linked statically or somehow avoids your environment setting), there is a simple option by preloading a library.
Linux dynamic linker has a feature, that if environment variable LD_PRELOAD (see this question) names a shared library, it will force-load that library into the starting process. So you can create a library, that will in it's static initialization tell the daemon that a process has started and do it so that the process will find out when the process exits.
Static initialization is easiest done by creating a global object with constructor in C++. The dynamic linker will ensure the static constructor will run when the library is loaded.
It will also try to make the corresponding destructor to run when the process exits, so you could simply log the process in the constructor and destructor. But it won't work if the process dies of signal 9 (KILL) and I am not sure what other signals will do.
So instead you should have a daemon and in the constructor tell the daemon about process start and make sure it will notice when the process exits on it's own. One option that comes to mind is opening a unix-domain socket to the daemon and leave it open. Kernel will close it when the process dies and the daemon will notice. You should take some precautions to use high descriptor number for the socket, since some processes may assume the low descriptor numbers (3, 4, 5) are free and dup2 to them. And don't forget to allow more filedescriptors for the daemon and for the system in general.
Note that just polling the /proc filesystem you would probably miss the great number of processes that only live for split second. There are really many of them on unix.
Here is an outline of the solution that we came up with.
We created a program that read a configuration file of all possible applications that the system is able to monitor. This program read the configuration file and through a command line interface you was able to start or stop programs. The program itself stored a table in shared memory that marked applications as running or not. A interface that anybody could access could get the status of these programs. This program also had an alarm system that could either email/page or set off an alarm.
This solution does not require any changes to the kernel and is therefore a less painful solution.
Hope this helps.

Is a server an infinite loop running as a background process?

Is a server essentially a background process running an infinite loop listening on a port? For example:
while(1){
command = read(127.0.0.1:xxxx);
if(command){
execute(command);
}
}
When I say server, I obviously am not referring to a physical server (computer). I am referring to a MySQL server, or Apache, etc.
Full disclosure - I haven't had time to poke through any source code. Actual code examples would be great!
That's more or less what server software generally does.
Usually it gets more complicated because the infinite loop "only" accepts the connection and each connection can often handle multiple "commands" (or whatever they are called in the used protocol), but the basic idea is roughly this.
There are three kinds of 'servers' - forking, threading and single threaded (non-blocking). All of them generally loop the way you show, the difference is what happens when there is something to be serviced.
A forking service is just that. For every request, fork() is invoked creating a new child process that handles the request, then exits (or remains alive, to handle subsequent requests, depending on the design).
A threading service is like a forking service, but instead of a whole new process, a new thread is created to serve the request. Like forks, sometimes threads stay around to handle subsequent requests. The difference in performance and footprint is simply the difference of threads vs forks. Depending on the memory usage that is not servicing a client (and prone to changing), its usually better to not clone the entire address space. The only added complexity here is synchronization.
A single process (aka single threaded) server will fork only once to daemonize. It will not spawn new threads, it will not spawn child processes. It will continue to poll() the socket to find out when the file descriptor is ready to receive data, or has data available to be processed. Data for each connection is kept in its own structure, identified by various states (writing, waiting for ACK, reading, closing, etc). This can be an extremely efficient design, if done properly. Instead of having multiple children or threads blocking while waiting to do work, you have a single process and event loop servicing requests as they are ready.
There are instances where single threaded services spawn multiple threads, however the additional threads aren't working on servicing incoming requests, one might (for instance) set up a local socket in a thread that allows an administrator to obtain a status of all connections.
A little googling for non blocking http server will yield some interesting hand rolled web servers written as code golf challenges.
In short, the difference is what happens once the endless loop is entered, not just the endless loop :)
In a matter of speaking, yes. A server is simply something that "loops forever" and serves. However, typically you'll find that "daemons" do things like open STDOUT and STDERR onto file handles or /dev/null along with double forks among other things. Your code is a very simplistic "server" in a sense.

Cleanest way to stop a process on Win32?

While implementing an applicative server and its client-side libraries in C++, I am having trouble finding a clean and reliable way to stop client processes on server shutdown on Windows.
Assuming the server and its clients run under the same user, the requirements are:
the solution should work in the following cases:
clients may each feature either a console or a gui.
user may be unprivileged.
clients may be or become unresponsive (infinite loop, deadlock).
clients may or may not be children of the server (direct or indirect).
unless prevented by a client-side defect, clients shall be allowed the opportunity to exit cleanly (free their ressources, sync some data to disk...) and some reasonable time to do so.
all client return codes shall be made available (if possible) to the server during the shutdown procedure.
server shall wait until all clients are gone.
As of this edit, the majority of the answers below advocate the use of a shared memory (or another IPC mechanism) between the server and its clients to convey shutdown orders and client status. These solutions would work, but require that clients successfully initialize the library.
What I did not say, is that the server is also used to start the clients and in some cases other programs/scripts which don't use the client library at all. A solution that did not rely on a graceful communication between server and clients would be nicer (if possible).
Some time ago, I stumbled upon a C snippet (in the MSDN I believe) that did the following:
start a thread via CreateRemoteThread in the process to shutdown.
had that thread directly call ExitProcess.
Unfortunately now that I'm looking for it, I'm unable to find it and the search results seem to imply that this trick does not work anymore on Vista. Any expert input on this ?
If you use thread, a simple solution is to use a named system event, the thread sleeps on the event waiting for it to be signaled, the control application can signal the event when it wants the client applications to quit.
For the UI application it (the thread) can post a message to the main window, WM_ CLOSE or QUIT I forget which, in the console application it can issue a CTRL-C or if the main console code loops it can check some exit condition set by the thread.
Either way rather than finding the client applications an telling them to quit, use the OS to signal they should quit. The sleeping thread will use virtually no CPU footprint provided it uses WaitForSingleObject to sleep on.
You want some sort of IPC between clients and servers. If all clients were children, I think pipes would have been easiest; since they're not, I guess a server-operated shared-memory segment can be used to register clients, issue the shutdown command, and collect return codes posted there by clients successfully shutting down.
In this shared-memory area, clients put their process IDs, so that the server can forcefully kill any unresponsive clients (modulo server privileges), using TerminateProcess().
If you are willing to go the IPC route, make the normal communication between client and server bi-directional to let the server ask the clients to shut down. Or, failing that, have the clients poll. Or as the last resort, the clients should be instructed to exit when the make a request to server. You can let the library user register an exit callback, but the best way I know of is to simply call "exit" in the client library when the client is told to shut down. If the client gets stuck in shutdown code, the server needs to be able to work around it by ignoring that client's data structures and connection.
Use PostMessage or a named event.
Re: PostMessage -- applications other than GUIs, as well as threads other than the GUI thread, can have message loops and it's very useful for stuff like this. (In fact COM uses message loops under the hood.) I've done it before with ATL but am a little rusty with that.
If you want to be robust to malicious attacks from "bad" processes, include a private key shared by client/server as one of the parameters in the message.
The named event approach is probably simpler; use CreateEvent with a name that is a secret shared by the client/server, and have the appropriate app check the status of the event (e.g. WaitForSingleObject with a timeout of 0) within its main loop to determine whether to shut down.
That's a very general question, and there are some inconsistencies.
While it is a not 100% rule, most console applications run to completion, whereas GUI applications run until the user terminates them (And services run until stopped via the SCM). Hence, it's easier to request a GUI to close. You send them the equivalent of Alt-F4. But for a console program, you have to send them the equivalent of Ctrl-C and hope they handle it. In both cases, you simply wait. If the process sticks around, you then shoot it down (TerminateProcess) and pray that the damage is limited. But your HDD can fill up with temporary files.
GUI application in general do not have exit codes - where would they go? And a console process that is forcefully terminated by definition does not exit, so it has no exit code. So, in a server shutdown scenario, don't expect exit codes.
If you've got a debugger attached, you generally can't shutdown the process from another application. That would make it impossible for debuggers to debug exit code!

Resources