I am running a process and I realized it will take longer than I thought to finish. It has been running for quite some time now and I would like to end the process without losing the data it has generated. The file outputs to a text file using C. How can I close the file mid process without losing data?
I ended up canceling the computation and took Steve Summit's advice which was to catch SIGINT and SIGTERM as well as calling fflush() periodically. I was actually able to recover some data, but at least now, I can recover all the data that has been processed.
Related
I have a simple c program ( on linux). The steps in the program are as follows:
within a while loop, It calls a query that returns exactly one record. It is essentially a view that looks for a column called "processed" with value of "0" and uses "limit 1".
I read the records in the result set and perform some calculations and upload the results back to the database. I also set the processed column to "1".
If this query does not return any records, I exit the while loop.
Once the while loop is exited, program exits.
Once it completes running, I do not want the program to exit. The reason is the database might get more qualifying records in the next 30 minutes. I want this program to be long running program that would check for any new records and start the while loop again to process the records.
I am not doing any multi threading or fancy stuff. I did some google and found posts talking about semaphore.
Is this the right way to go about? Are there any simple examples of semaphores with explanation?
First, I hope you're using a transaction. Otherwise there can be a race condition between 1 and 2.
I think your question is "How does your program know when there is more information to be processed in a SQL table?" There's several ways to do this.
The simplest is polling. Your program just checks every so often if there's any work. If there isn't, it sleeps for a while. If checking is cheap, or you don't have to check very often, polling is fine. It's pretty robust, there's no coordination necessary between the worker and the supplier. The worker just checks for work.
Another is to make the program block on some sort of I/O like waiting for a lock on a file. That's what semaphores are about. It goes like this.
The queue is empty.
The producer gets an exclusive lock on the semaphore file.
Your worker tries to get a lock on the semaphore file, it blocks.
The producer adds to the queue and releases its lock.
The worker immediately unblocks.
Checks the queue
Does its work.
...but resetting the system is a problem. The producer doesn't know when the queue is empty again without polling. And this requires everything adding to the SQL table knows about this procedure and is located on the same machine. Even if you get it working, it's very vulnerable to deadlocks and race conditions.
Another way is via signals. The producer process sends a signal to the worker process to say "I added some work". As above, this requires coordination between the things adding to the SQL table and the workers.
A better solution is to not use SQL for a work queue. It's inherently something you have to poll. Instead use a named or network pipe. Pipes automatically act as a queue. Producers write to the pipe when they add work. The worker connects to the pipe and read from it to get more work. If there's no work, it quietly blocks waiting for work. The pipe can contain all the information necessary to do the work, or it can just contain an indication that there is work elsewhere (like an ID for a row).
Finally, depending on how much processing needs to be done, you could try doing all that processing in a stored procedure triggered by a table update.
What could go wrong if the reader of a pipe forgets to close fd[1] or if the writer of a pipe forgets to close fd[0]?
You'll have a file handle leak (as long as the process that has the file descriptor open is running). Worst thing that can happen is that you run out of file descriptor handles if you have lot of pipes.
There's usually a soft and a hard limit (see ulimit) per user, and also a system wide limit (although you're unlikely to hit that if your system has a useful per-user limit). Once you run out of file descriptor handles, strange things happen like you won't be able to start new processes or other running processes might stop working correctly.
Most of the time this isn't something to worry about as most of the time there's just two processes and one pipe, so the leak won't be a big deal. Still, you usually really want to close any filehandle you don't need any more to free up resources.
No resource runs infinite for a given process. So is the number of files , sockets that a process can create. Failing to close the FDs after use can cause something akin to memory leak if your processes once again requests new FDs.
Check ulimit for the number of open files allowed. You can try creating new descriptors without close. You should soon run out of it.
I have a C program which has multiple worker threads. There is a main thread which periodically (every 0.2s) does some basic checks (i.e. has a thread finished, has a signal been received, etc). At each check, I would like to write to a log file any data that any of the threads may have in their log buffer to a single log file.
My initial idea was to simply open the log file, write the data from all the threads and then close it again. I am worried that this might be too much of an overhead seeing as these checks occur every 0.2s.
So my question is - is this scenario inefficient?
If so, can anyone suggest a better solution?
I thought of leaving the file descriptor open and just writing new data on every check, but then there is the problem if somehow the physical file gets deleted, the program would never know (without rechecking, and in this case we might as well just open the file again) and logging data would be lost.
(This program is designed to run for very long periods of time, so the fact that log file will be deleted at some point is basically guaranteed due to log rotation.)
I thought of leaving the file descriptor open and just writing new data on every check, but then there is the problem if somehow the physical file gets deleted, the program would never know (without rechecking, and in this case we might as well just open the file again) and logging data would be lost.
The standard solution on UNIX is to add a signal handler for SIGHUP which closes and re-opens the log file. Many UNIX daemons do this for precisely this purpose, to support log rotation. Call kill -HUP <pid> in your log rotation script and you're good to go.
(Some programs will also treat SIGHUP as a cue to re-read their configuration files, so you can make configuration changes on the fly without having to restart processes.)
Currently, there isn't much of a good solution. I would suggest to write a timer that runs separately from your main 0.2s check, and checks the logfile buffers and write them to disk.
I am working on something network based that could solve this (I have had the same problem) with excellent performance, fire me a message on github for details.
I'm trying to understand how asynchronous file operations being emulated using threads. I've found next-to-nothing materials to read about the subject.
Is it possible that:
a process uses a thread to open a regular file (HDD).
the parent gets the file descriptor from the thread, now it may close the thread.
the parent uses the file descriptor with a new thread, reading X bytes from the file.
the parent gets the file descriptor with the seek-position of the current file state.
the parent may repeat these operations, without the need to open, or seek, every time it wishes to "continue" reading a new chunk of the file?
This is just a wild guess of mine, would appreciate if anybody mind to shed more light to clarify how it's being emulated efficiently.
UPDATE:
By efficient I actually mean that I don't want the thread to "wait" since the moment the file been opened. Think of a HTTP non-blocking daemon which serves a client with a huge file, you want to use the thread to read chunks of the file without blocking the daemon - but you don't want to keep the thread busy while "waiting" for the actual transfer to take place, you want to use the thread for other blocking operations of other clients.
To understand asynchronous I/O better, it may be helpful to think in terms of overlapping operation. That is, the number of pending operations (operations that have been started but not yet completed) can simutaneously go above one.
A diagram that explains asynchronous I/O might look like this: http://msdn.microsoft.com/en-us/library/aa365683(VS.85).aspx
If you are using the asynchronous I/O capabilities provided by the underlying Operating System, then it is possible to asynchronously read from multiple files without spawning a equal number of threads.
If your underlying Operating System does not provide asynchronous I/O, or if you decide not to use it, in other words, you wish to emulate asynchronous operation by only using blocking I/O (the regular Read/Write provided by the Operating System) then it is necessary to spawn as many threads as the number of simutaneous I/O operations. This is because when a thread is making a function call to blocking I/O, the thread cannot continue its execution until the operation finishes. In order to start another blocking I/O operation, that operation has to be issued from another thread that is not already occupied.
When you open/create a file fire up a thread. Now store that thread id/ptr as your file handle.
Basically the thread will do nothing except sit in a loop waiting for an "event". A semaphore would be good here. When you want to do a read then you add the read command to a queue (remember to critical section the stack add), return a unique id, and then you increment the semaphore. If the thread is asleep it will now wake up and grab the first message off the queue and process it. When it has completed you remove the command from the queue.
To poll if a file read has completed you can, simply, check to see if its in the command queue. If its not there then the command has completed.
Furthermore if you want to allow synchronous reads as well then you can wait after sending the message through for an "event" to get triggered by the completion. You then check to see if the unique id is the queue and if it isn't you return control. If it still is then you go back to a wait state until the relevant unique id has been processed.
I have an application that monitors a high-speed communication link and writes logs to a file (via standard C file IO). The response time to messages that arrive on the link is important, so I knowingly don't fflush the file at each message, because this slows down my response time.
However, in some circumstances my application is terminated "violently" (e.g. by killing the process), and in these cases the last few log messages are not written (even if the communication link has been quiet for some time).
What techniques/strategies can I use to make sure most of my data is flushed, but without giving up speed of response?
Edit: The application runs on Windows
Using a thread is the standard solution to this. Have your data collection code write data to a thread-safe queue and use a semaphore to signal the writing thread.
However, before you go there, double-check your assertion that fflush() would be slow. Most operating systems have a file system cache. It makes writes very fast, as simple memory-to-memory block copy. The data gets written to disk lazily, your crash won't affect it.
If you are on Unix or Linux, your process would receive some termination signal which you can catch (except SIGKILL) and fflush() in your signal handler.
For signal catching see man sigaction.
EDIT: No idea about Windows.
I would suggest an asynchronous write-though. That way you don't need to wait for the write IOP to happen, nor will the OS will delay the IOP. See CreateFile() flags FILE_FLAG_WRITE_THROUGH | FILE_FLAG_OVERLAPPED.
You don't need FILE_FLAG_NO_BUFFERING. That's only to skip the OS cache. You would only need it if you are worried about the entire OS dying violently.
If your program terminates by calling exit() or returning from main(), the C standard guarantees that open streams are flushed and closed, so no special handling is needed. It sounds from your description like this is what is happening: if your program died due to a signal, you wouldn't see the flush.
I'm having trouble understanding what the problem is exactly.
If it's just that you're trying to find a happy medium between flushing often and the default fully buffered output, then maybe line buffering is what you want:
setvbuf(stream, 0, _IOLBF, 0);