I have a C program which has multiple worker threads. There is a main thread which periodically (every 0.2s) does some basic checks (i.e. has a thread finished, has a signal been received, etc). At each check, I would like to write to a log file any data that any of the threads may have in their log buffer to a single log file.
My initial idea was to simply open the log file, write the data from all the threads and then close it again. I am worried that this might be too much of an overhead seeing as these checks occur every 0.2s.
So my question is - is this scenario inefficient?
If so, can anyone suggest a better solution?
I thought of leaving the file descriptor open and just writing new data on every check, but then there is the problem if somehow the physical file gets deleted, the program would never know (without rechecking, and in this case we might as well just open the file again) and logging data would be lost.
(This program is designed to run for very long periods of time, so the fact that log file will be deleted at some point is basically guaranteed due to log rotation.)
I thought of leaving the file descriptor open and just writing new data on every check, but then there is the problem if somehow the physical file gets deleted, the program would never know (without rechecking, and in this case we might as well just open the file again) and logging data would be lost.
The standard solution on UNIX is to add a signal handler for SIGHUP which closes and re-opens the log file. Many UNIX daemons do this for precisely this purpose, to support log rotation. Call kill -HUP <pid> in your log rotation script and you're good to go.
(Some programs will also treat SIGHUP as a cue to re-read their configuration files, so you can make configuration changes on the fly without having to restart processes.)
Currently, there isn't much of a good solution. I would suggest to write a timer that runs separately from your main 0.2s check, and checks the logfile buffers and write them to disk.
I am working on something network based that could solve this (I have had the same problem) with excellent performance, fire me a message on github for details.
Related
I am running a process and I realized it will take longer than I thought to finish. It has been running for quite some time now and I would like to end the process without losing the data it has generated. The file outputs to a text file using C. How can I close the file mid process without losing data?
I ended up canceling the computation and took Steve Summit's advice which was to catch SIGINT and SIGTERM as well as calling fflush() periodically. I was actually able to recover some data, but at least now, I can recover all the data that has been processed.
I have an executable file which performs data acquisition from an interfaced FPGA and stores it in a specific format. Once in a while, randomly, the acquisition code stops citing receive error. However, re-running the executable works.
Hence one temporary arrangement is to run the executable in a shell script. The corresponding process needs to be monitored. If the acquisition stops (and the process ends), the script should re-run the executable.
Any hints on how to go about it?
By your description, it sounds like you simply want an endless loop which calls the collector again and again.
while true; do
collector
done >output
Redirecting the output outside of the loop is more efficient (you only open the file for writing once) as well as simpler (you don't have to figure out within the loop whether to overwrite or append). If your collector doesn't produce data on standard output, then of course, this detail is moot.
grep for the executable process in the shell script wrapper. If the PID is not found, then restart. You need to schedule the Shell wrapper as a cron job.
I have a listening socket on a tcp port. The process itself is using setrlimit(RLIMIT_NOFILE,&...); to configure how many sockets are allowed for the process.
For tests RLIMIT_NOFILE is set to 20 and of course for production it will be set to a sanely bigger number. 20 is good for easily reaching the limit in a test environment.
The server itself has no issues like descriptor leak or similar, but trying to solve the problem by increasing RLIMIT_NOFILE obviously cannot do, because in real life there is no guarantee the the limit will not be reached, no matter how high it is set.
The problem is that after reaching the limit accept returns Too many open files and unless a file or socket is closed the event loop starts spinning without delay, eating 100% of one core. Even if the client closes the connection (e.g. because of timeout), the server will loop until a file descriptor is available to process and close the already dead incoming connection. EDIT: On the other hand the client stalls and there is no good way to know that the server is overloaded.
My question: is there some standard way to handle this situation by closing the incoming connection after accept returns Too many open files.
Several dirty approaches come to mind:
To close and reopen the listening socket with the hope that all pending connections will be closed (this is quite dirty because in threaded server some other thread may get the freed file descriptor)
To track open file descriptor count (this cannot be properly done with external libraries that will have some untracked file descriptors)
To check if file descriptor number is near the limit and start closing incoming connections before the situation happens (this is rather implementation specific and while it will work on Linux, there is no guarantee that other OS will handle file descriptors in the same way)
EDIT: One more dirty and ugly approach:
To keep one spare fd (e.g. dup(STDIN_FILENO) or open("/dev/null",...)) that will be used in case accept fails. The sequence will be:
... accept failed
// stop threads
close(sparefd);
newconnection = accept(...);
close(newconnection);
sparefd = open("/dev/null",...);
// release threads
The main drawback with this approach is thread synchronization to prevent other threads to get the just freed spare fd.
You shouldn't use setrlimit to control how many simultaneous connections your process can handle. Your tiny little bit of socket code is saying to the whole rest of the application, "I only want to have N connections open at a time, and this is the only way I know how to do it, so... nothing else in the process can have any files!". What would happen if everybody did that?
The proper way to do what you want is easy -- keep track of how many connections you have open, and just don't call accept until you can handle another one.
I understand that your code is in a library. The library encounters a resource limit event. I would distinguish, generally, between events which are catastrophic (memory exhaustion, can't open listening socket) and those which are probably temporary. Catastrophic events are hard to handle: without memory, even logging or an orderly shutdown may be impossible.
Too many open files, by contrast, is a condition which is probably temporary, not least because we are the resource hog. Temporary error conditions are luckily trivial to handle: By waiting. This is what you don't do: You should wait for a spell after accept returns "Too many open files", before you call accept again. That will solve the 100% CPU load problem. (I assume that our server performs some work on each connection which is at some point finished, so that the file descriptors of the client connections which our library holds are eventually closed.)
There remains the problem that the library cannot know the requirements of the user code. (How long should the pause between accepts be?1 Is it at all acceptable (sic) to let connection requests wait at all? Do we give up at some point?) It is imperative to report errors back to the user code, so that the user code has a chance to see and fix the error.
If the user code gets the file descriptor back, that's easy: Return accept's error code (and make sure to document that possibility). So I assume that the user code never sees gritty details like file descriptors but instead gets some data, for example. It may even be that the library performs just side effects, possibly concurrently, so that user code never sees any return value which would be usable to communicate errors. Then the library must provide some other way to signal the error condition to the user code. This may impose restrictions on how the user code can use the library: Perhaps before or after certain function calls, or simply periodically, an error status must be actively checked.
1By the way, it is not clear to me, even after reading the accept man page, whether the client's connect fails (because the connection request has been de-queued on the server side but cannot be handled), or whether the request simply stays in the queue so that the client is oblivious of the server's problems, apart from a delay.
notice that multiplexing syscalls such as poll(2) can work (so wait without busy spin looping) on accept-ing sockets (and on connected sockets also, or any other kind of stream file descriptor).
So just have your event loop handle them (probably with other readable & writable file descriptors). And don't call accept(2) when you don't want to.
A program Foo periodically updates a file and calls my C program Bar to process the file.
The issue is that the Foo might update the file, call Bar to process it, and while Bar reads the file, Foo might update the file again.
Is it possible for Bar to read the file in inconsistent state, e.g. read first half of the file as written by first Foo and the other half as written by the second Foo? If so, how would I prevent that, assuming I can modify only Bar's code?
Typically, Foo should not simply rewrite the contents of the file again and again, but create a new temporary file, and replace the old file with the temporary file when it is done (using link()). In this case, simply opening the file (at any point in time) will give the reader a consistent snapshot of the contents, because of how typical POSIX filesystems work. (After opening the file, the file descriptor will refer to the same inode/contents, even if the file gets deleted or replaced; the disk space will be released only after the last open file descriptor of a deleted/replaced file is closed.)
If Foo does rewrite the same file (without a temporary file) over and over, the recommended solution would be for both Foo and Bar to use fcntl()-based advisory locking. (However, using a temporary file and renaming/linking it over the actual file when complete, would be even better.)
(While flock()-based locking might seem easier, it is actually a bit of a guessing game whether it works on NFS mounts or not. fcntl() works, unless the NFS server is configured not to support locking. Which is a bit of an issue on some commercial web hosts, actually.)
If you cannot modify the behaviour of Foo, and it does not use advisory locking, there are still some options in Linux.
If Foo closes the file -- i.e., Bar is the only one to open the file --, then taking an exclusive file lease (using fcntl(descriptor, F_SETLEASE, F_WRLCK) is a workable solution. You can only get an exclusive file lease if descriptor is the only open descriptor on the file, and the owner user of the file is the same as the process UID (or the process has the CAP_LEASE capability). If any other process tries to open or truncate the file, the lease owner gets signaled (SIGIO by default), and has up to /proc/sys/fs/lease-break-time seconds to downgrade or release the lease. The opener is blocked for the duration, which allows Bar to either cancel the processing, or copy the file for later processing.
The other option for Bar is rather violent. It can monitor the file say once per second, and when the file is old enough -- say, a few seconds --, pause Foo by sending it a SIGSTOP signal, checking /proc/FOOPID/stat until it gets stopped, and rechecking the file statistics to verify it's still old, until making a temporary copy of it (either in memory, or on disk) for processing. After the file is read/copied, Bar can let Foo continue by sending it a SIGCONT signal.
Some filesystems may support file snapshots, but in my opinion, one of the above are much saner than relying on nonstandard filesystem support to function correctly. If Foo cannot be modified to co-operate, it is time to refactor it out of the picture. You do not want to be a hostage for a black box out of your control, so the sooner you replace it with something more user/administrator-friendly, the better you'll be in the long term.
This is difficult to do robustly without Foo's cooperation.
Unixes have two main kinds of file locking:
range locking with fcntl(2)
always-whole-file locking with flock(2)
Ideally, you use either of these in cooperative mode (advisory locking), where all participants attempt to acquire the lock and only one will get it at a time.
Without the other program's cooperation, your only recourse, as far as I know is mandatory locking, which you can have with fcntl if you allow it on the filesystem, but the manpage mentions that the Linux implementation is unreliable.
In all UN*X systems, what is warranted to happen atomically is the write(2) or read(2) system calls. The kernel even locks the file inode in memory, so while you are read(2)ing or write(2)ing it, it would not change.
For more spatial atomicity, you have to lock the whole file. You can use the file locking tools available to lock different regions of a file. Some are advisory (you can force an skip over them) and others are mandatory (you are blocked until the other side unblocks the file region)
See fcntl(2) and the options F_GETLK, F_SETLK and F_SETLKW to get lock info, set lock for reading or writing, respectively.
What could go wrong if the reader of a pipe forgets to close fd[1] or if the writer of a pipe forgets to close fd[0]?
You'll have a file handle leak (as long as the process that has the file descriptor open is running). Worst thing that can happen is that you run out of file descriptor handles if you have lot of pipes.
There's usually a soft and a hard limit (see ulimit) per user, and also a system wide limit (although you're unlikely to hit that if your system has a useful per-user limit). Once you run out of file descriptor handles, strange things happen like you won't be able to start new processes or other running processes might stop working correctly.
Most of the time this isn't something to worry about as most of the time there's just two processes and one pipe, so the leak won't be a big deal. Still, you usually really want to close any filehandle you don't need any more to free up resources.
No resource runs infinite for a given process. So is the number of files , sockets that a process can create. Failing to close the FDs after use can cause something akin to memory leak if your processes once again requests new FDs.
Check ulimit for the number of open files allowed. You can try creating new descriptors without close. You should soon run out of it.