What are the reasons to check for error on close()? - c

Note: Please read to the end before marking this as duplicate. While it's similar, the scope of what I'm looking for in an answer extends beyond what the previous question was asking for.
Widespread practice, which I tend to agree with, tends to be treating close purely as a resource-deallocation function for file descriptors rather than a potential IO operation with meaningful failure cases. And indeed, prior to the resolution of issue 529, POSIX left the state of the file descriptor (i.e. whether it was still allocated or not) unspecified after errors, making it impossible to respond portably to errors in any meaningful way.
However, a lot of GNU software goes to great lengths to check for errors from close, and the Linux man page for close calls failure to do so "a common but nevertheless serious programming error". NFS and quotas are cited as circumstances under which close might produce an error but does not give details.
What are the situations under which close might fail, on real-world systems, and are they relevant today? I'm particularly interested in knowing whether there are any modern systems where close fails for any non-NFS, non-device-node-specific reasons, and as for NFS or device-related failures, under what conditions (e.g. configurations) they might be seen.

Once upon a time (24 march, 2007), Eric Sosman had the following tale to share in the comp.lang.c newsgroup:
(Let me begin by confessing to a little white lie: It wasn't
fclose() whose failure went undetected, but the POSIX close()
function; this part of the application used POSIX I/O. The lie
is harmless, though, because the C I/O facilities would have
failed in exactly the same way, and an undetected failure would
have had the same consequences. I'll describe what happened in
terms of C's I/O to avoid dwelling on POSIX too much.)
The situation was very much as Richard Tobin described.
The application was a document management system that loaded a
document file into memory, applied the user's edits to the in-
memory copy, and then wrote everything to a new file when told
to save the edits. It also maintained a one-level "old version"
backup for safety's sake: the Save operation wrote to a temp
file, and then if that was successful it deleted the old backup,
renamed the old document file to the backup name, and renamed the
temp file to the document. bak -> trash, doc -> bak, tmp -> doc.
The write-to-temp-file step checked almost everything. The
fopen(), obviously, but also all the fwrite()s and even a final
fflush() were checked for error indications -- but the fclose()
was not. And on one system it happened that the last few disk
blocks weren't actually allocated until fclose() -- the I/O
system sat atop VMS' lower-level file access machinery, and a
little bit of asynchrony was inherent in the arrangement.
The customer's system had disk quotas enabled, and the
victim was right up close to his limit. He opened a document,
edited for a while, saved his work thus far, and exceeded his
quota -- which went undetected because the error didn't appear
until the unchecked fclose(). Thinking that the save succeeded,
the application discarded the old backup, renamed the original
document to become the backup, and renamed the truncated temp
file to be the new document. The user worked a little longer
and saved again -- same thing, except you'll note that this time
the only surviving complete file got deleted, and both the
backup and the master document file are truncated. Result: the
whole document file became trash, not just the latest session
of work but everything that had gone before.
As Murphy would have it, the victim was the boss of the
department that had purchased several hundred licenses for our
software, and I got the privilege of flying to St. Louis to be
thrown to the lions.
[...]
In this case, the failure of fclose() would (if detected) have
stopped the delete-and-rename sequence. The user would have been
told "Hey, there was a problem saving the document; do something
about it and try again. Meanwhile, nothing has changed on disk."
Even if he'd been unable to save his latest batch of work, he would
at least not have lost everything that went before.

Consider the inverse of your question: "Under what situations can we guarantee that close will succeed?" The answer is:
when you call it correctly, and
when you know that the file system the file is on does not return errors from close in this OS and Kernel version
If you are convinced that you program doesn't have any logic errors and you have complete control over the Kernel and file system, then you don't need to check the return value of close.
Otherwise, you have to ask yourself how much you care about diagnosing problems with close. I think there is value in checking and logging the error for diagnostic purposes:
If a coder makes a logic error and passes an invalid fd to close, then you'll be able to quickly track it down. This may help to catch a bug early before it causes problems.
If a user runs the program in an environment where close does return an error when (for example) data was not flushed, then you'll be able to quickly diagnose why the data got corrupted. It's an easy red flag because you know the error should not occur.

Related

Logging Events---Looking For a Good Way

I created a program that monitors for events.
I want to log these events "in the right way".
Currently I have a string array, log[500][100].
Each line is a string of characters (up to 100) that report something about the event.
I have it set up so that only the last 500 events are saved in the array.
After that, new events overwrite the oldest events.
Currently I just keep revolving through the array until the program terminates, then I write the array to a file.
Going forward I would like to view the log in real time, any time I wish, without disturbing the event processing and logging process.
I considered opening the file for "appending" but here are my concerns:
(1) The program is running on a Raspberry Pi which has a flash memory as a "disk drive". I believe flash memories have a limited number of write cycles before problems can occur. This program runs 24/7 "forever" so I am afraid the "disk drive" will "wear out".
(2) I am using pretty much all the CPU capacity of the RPi so I don't want to add a lot of overhead/CPU cycles.
How would experienced programmers attack this problem?
Please go easy on me, this is my first C program.
[EDIT]
I began reviewing all the information and I became intrigued by Mark A's suggestion for tmpfs. I looked into it more and I am sure this answers my question. It allows the creation of files in RAM not the SD card. They are lost on power down but I don't care.
In order to keep the files from growing to large I created a double buffer approach. First I write 500 events to file A then switch to file B. When 500 events have been written to file B I close and reopen file A (to delete the contents and start at 0 events) and switch to writing to file A. I found I needed to fflush(file...) after each write or else the file was empty until fclose.
Normally that would be OK but right now I am fighting a nasty segmentation fault so I want as much insight into what is going on. When I hit the fault, I never get to my fclose statements.
Welcome to Stack Overflow and to C programming! A wonderful world of possibilities awaits you.
I have several thoughts in response to your situation.
The very short summary is to use stdout and delegate the output-file management to the shell.
The longer, rambling answer full of my personal musing is as follows:
1 : A very typical thing for C programs to do is not be in charge of how outputs are kept. You might have heard of the "built in" file handles, stdin, stdout, and stderr. These file handles are (under normal circumstances) always available to your program for input (from stdin) and output (stdout and stderr). As you might guess from their names stdout is customarily used for regular output and stderr is customarily used for error / exception output. It is exceedingly typical for a C program to simply read from stdin and output to stdout and stderr, and let something else (e.g., the shell) take care of what those actually are.
For example, reading from stdin means that your program can be used for keyboard entry and for file reading, without needing to change your program's code. The same goes for stdout and stderr; simply output to those file handles, and let the user decide whether those should go to the screen or be redirected to a file. And, because stdout and stderr are separate file handles, the user can have them go to separate 'destinations'.
In your case, to implement this, drop the array entirely, and simply
fprintf(stdout, "event notice : %s\n", eventdetailstring);
(or similar) every time your program has something to say. Take a look at fflush(), too, because of potential output buffering.
2a : This gets you continuous output. This itself can help with your concern about memory wear on the Pi's flash disk. If you do something like:
eventmonitor > logfile
then logfile will be being appended to during the lifetime of your program, which will tend to be writing to new parts of the flash disk. Of course, if you only ever append, you will eventually run out of space on the disk, so you might set up a cron job to kill the currently running eventmonitor and restart it every day at midnight. Done with the above command, that would cause it to overwrite logfile once per day. This prevents endless growth, and it might even use a new physical area of the flash drive for the new file (even though it's the same name; underneath, it's a different file, with a different inode, etc.) But even if it reuses the exact same area of the flash drive, now you are down to worrying if this will last more than 10,000 days, instead of 10,000 writes. I'm betting that within 10,000 days, new options will be available -- worst case, you buy a new Pi every 27 years or so!
There are other possible variations on this theme, as well. e.g., you could have a sophisticated script kicked off by cron every day at midnight that kills any currently running eventmonitor, deletes output files older than a week, and starts a new eventmonitor outputting to a file whose filename is based partly on the date so that past days' file aren't overwritten. But all of this is in the realm of using your program. You can make your program easier to use by writing it to use stdin, stdout, and stderr.
2b : Or, you can just have stdout go to the screen, which is typically how it already is when a program is started from an interactive shell / terminal window. I imagine you could have the Pi running headless most of the time, and when you want to see what your program is outputting, hook up a monitor. Generally, things will stay running between disconnecting and reconnecting your monitor. This avoids affecting the flash drive at all.
3 : Another approach is to have your event monitoring program send its output somewhere off-system. This is getting into more advanced programming territory, so you might want to save this for a later enhancement, after you've mastered more of the basics. But, your program could establish a network connection to, say, a JSON API and send event information there. This would let you separate the functions of event monitoring from event reporting.
You will discover as you learn more programming that this idea of separation of concerns is an important concept, and applies at various levels of a program or a system of interoperating programs. In this case, the Pi is a good fit for the data monitoring aspect because it is a lightweight solution, and some other system with more capacity and more stable storage can cover the data collection aspect.

Should we error check every call in C?

When we write C programs we make calls to malloc or printf. But do we need to check every call? What guidelines do you use?
e.g.
char error_msg[BUFFER_SIZE];
if (fclose(file) == EOF) {
sprintf(error_msg, "Error closing %s\n", filename);
perror(error_msg);
}
The answer to your question is: "Do whatever you want", there is no written rule, BUT the right question is "What do users want in case of failure".
Let me explain, if you are a student writing a test program for example, no absolute need to check for errors: it may be a waste of time.
Now, if your code may be distributed or used by other people, that quite different: put yourself in the shoes of future users. Which message do you prefer when something goes wrong with an application:
Core was generated by `./cut --output-d=: -b1,1234567890- /dev/fd/63'.
Program terminated with signal SIGSEGV, Segmentation fault.
or
MySuperApp failed to start MySuperModule because there is not enough space on the disk.
Try to free space on disk, then relaunch the app.
If this error persists contact us at support#mysuperapp.com
As it has already been addressed in the comment, you have to consider two types of error:
A fatal error is one that kills your program (app / server / site / whatever it is). It renders it unusable, either by crashing or by putting it in some state whereby it can't do it's usable work. e.g. memory allocation, disk space ...
Non-fatal error is one where something messes up, but the program can continue to do what it's supposed to do. e.g. file not found, serve other users not requesting the thing that called the error.
Source : https://www.quora.com/What-is-the-difference-between-an-error-and-a-fatal-error
Just do error checking if your program behaviour has to behave differently in case an error is detected. Let me illustrate this with an example: Assume you have used a temporary file in your program and you use the unlink(2) system call to erase that temporary file at the end of the program. Have you to check if the file has been successfully erased? Let's analyse the problem with some common sense: if you check for errors, are you going to be able (inside the program) of doing some alternate thing to cope with this? This is uncommon (if you created the file, it's rare that you will not be able to erase it, but something can happen in the time between --- for example a change in directory permissions that forbids you to write on the directory anymore) But what can you do in that case? Is it possible to use a different approach to erase temporary file in that case. Probably not... so checking (in that case) a possible error from the unlink(2) system call will be almost useless.
Of course, this doesn't apply always, you have to use common sense while programming. Errors about writing to files should be always considered, as they belong to access permissions or mostly to full filesystems (In that case, even trying to generate a log message can be useles, as you have filled your disk --- or not, that depends) Do you know always the precise environment details to obviate if a full filesystem error can be ignored. Suppose you have to connect to a server in your program. Should the connect(2) system call failure be acted upon? probably most of the times, at least a message to the user with the protocol error (or the cause of the failure) must be given to the user.... assuming everything goes ok can save you time in a prototype, but you have to cope with what can happen, in production programs.
When i want to use return value of function than suggested to check return value before using it
For example pointer return address that can be null also.so suggested to keep null check before using it.

Writing programs to cope with I/O errors causing lost writes on Linux

TL;DR: If the Linux kernel loses a buffered I/O write, is there any way for the application to find out?
I know you have to fsync() the file (and its parent directory) for durability. The question is if the kernel loses dirty buffers that are pending write due to an I/O error, how can the application detect this and recover or abort?
Think database applications, etc, where order of writes and write durability can be crucial.
Lost writes? How?
The Linux kernel's block layer can under some circumstances lose buffered I/O requests that have been submitted successfully by write(), pwrite() etc, with an error like:
Buffer I/O error on device dm-0, logical block 12345
lost page write due to I/O error on dm-0
(See end_buffer_write_sync(...) and end_buffer_async_write(...) in fs/buffer.c).
On newer kernels the error will instead contain "lost async page write", like:
Buffer I/O error on dev dm-0, logical block 12345, lost async page write
Since the application's write() will have already returned without error, there seems to be no way to report an error back to the application.
Detecting them?
I'm not that familiar with the kernel sources, but I think that it sets AS_EIO on the buffer that failed to be written-out if it's doing an async write:
set_bit(AS_EIO, &page->mapping->flags);
set_buffer_write_io_error(bh);
clear_buffer_uptodate(bh);
SetPageError(page);
but it's unclear to me if or how the application can find out about this when it later fsync()s the file to confirm it's on disk.
It looks like wait_on_page_writeback_range(...) in mm/filemap.c might by do_sync_mapping_range(...) in fs/sync.c which is turn called by sys_sync_file_range(...). It returns -EIO if one or more buffers could not be written.
If, as I'm guessing, this propagates to fsync()'s result, then if the app panics and bails out if it gets an I/O error from fsync() and knows how to re-do its work when restarted, that should be sufficient safeguard?
There's presumably no way for the app to know which byte offsets in a file correspond to the lost pages so it can rewrite them if it knows how, but if the app repeats all its pending work since the last successful fsync() of the file, and that rewrites any dirty kernel buffers corresponding to lost writes against the file, that should clear any I/O error flags on the lost pages and allow the next fsync() to complete - right?
Are there then any other, harmless, circumstances where fsync() may return -EIO where bailing out and redoing work would be too drastic?
Why?
Of course such errors should not happen. In this case the error arose from an unfortunate interaction between the dm-multipath driver's defaults and the sense code used by the SAN to report failure to allocate thin-provisioned storage. But this isn't the only circumstance where they can happen - I've also seen reports of it from thin provisioned LVM for example, as used by libvirt, Docker, and more. An critical application like a database should try to cope with such errors, rather than blindly carrying on as if all is well.
If the kernel thinks it's OK to lose writes without dying with a kernel panic, applications have to find a way to cope.
The practical impact is that I found a case where a multipath problem with a SAN caused lost writes that landed up causing database corruption because the DBMS didn't know its writes had failed. Not fun.
fsync() returns -EIO if the kernel lost a write
(Note: early part references older kernels; updated below to reflect modern kernels)
It looks like async buffer write-out in end_buffer_async_write(...) failures set an -EIO flag on the failed dirty buffer page for the file:
set_bit(AS_EIO, &page->mapping->flags);
set_buffer_write_io_error(bh);
clear_buffer_uptodate(bh);
SetPageError(page);
which is then detected by wait_on_page_writeback_range(...) as called by do_sync_mapping_range(...) as called by sys_sync_file_range(...) as called by sys_sync_file_range2(...) to implement the C library call fsync().
But only once!
This comment on sys_sync_file_range
168 * SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will detect any
169 * I/O errors or ENOSPC conditions and will return those to the caller, after
170 * clearing the EIO and ENOSPC flags in the address_space.
suggests that when fsync() returns -EIO or (undocumented in the manpage) -ENOSPC, it will clear the error state so a subsequent fsync() will report success even though the pages never got written.
Sure enough wait_on_page_writeback_range(...) clears the error bits when it tests them:
301 /* Check for outstanding write errors */
302 if (test_and_clear_bit(AS_ENOSPC, &mapping->flags))
303 ret = -ENOSPC;
304 if (test_and_clear_bit(AS_EIO, &mapping->flags))
305 ret = -EIO;
So if the application expects it can re-try fsync() until it succeeds and trust that the data is on-disk, it is terribly wrong.
I'm pretty sure this is the source of the data corruption I found in the DBMS. It retries fsync() and thinks all will be well when it succeeds.
Is this allowed?
The POSIX/SuS docs on fsync() don't really specify this either way:
If the fsync() function fails, outstanding I/O operations are not guaranteed to have been completed.
Linux's man-page for fsync() just doesn't say anything about what happens on failure.
So it seems that the meaning of fsync() errors is "I don't know what happened to your writes, might've worked or not, better try again to be sure".
Newer kernels
On 4.9 end_buffer_async_write sets -EIO on the page, just via mapping_set_error.
buffer_io_error(bh, ", lost async page write");
mapping_set_error(page->mapping, -EIO);
set_buffer_write_io_error(bh);
clear_buffer_uptodate(bh);
SetPageError(page);
On the sync side I think it's similar, though the structure is now pretty complex to follow. filemap_check_errors in mm/filemap.c now does:
if (test_bit(AS_EIO, &mapping->flags) &&
test_and_clear_bit(AS_EIO, &mapping->flags))
ret = -EIO;
which has much the same effect. Error checks seem to all go through filemap_check_errors which does a test-and-clear:
if (test_bit(AS_EIO, &mapping->flags) &&
test_and_clear_bit(AS_EIO, &mapping->flags))
ret = -EIO;
return ret;
I'm using btrfs on my laptop, but when I create an ext4 loopback for testing on /mnt/tmp and set up a perf probe on it:
sudo dd if=/dev/zero of=/tmp/ext bs=1M count=100
sudo mke2fs -j -T ext4 /tmp/ext
sudo mount -o loop /tmp/ext /mnt/tmp
sudo perf probe filemap_check_errors
sudo perf record -g -e probe:end_buffer_async_write -e probe:filemap_check_errors dd if=/dev/zero of=/mnt/tmp/test bs=4k count=1 conv=fsync
I find the following call stack in perf report -T:
---__GI___libc_fsync
entry_SYSCALL_64_fastpath
sys_fsync
do_fsync
vfs_fsync_range
ext4_sync_file
filemap_write_and_wait_range
filemap_check_errors
A read-through suggests that yeah, modern kernels behave the same.
This seems to mean that if fsync() (or presumably write() or close()) returns -EIO, the file is in some undefined state between when you last successfully fsync()d or close()d it and its most recently write()ten state.
Test
I've implemented a test case to demonstrate this behaviour.
Implications
A DBMS can cope with this by entering crash recovery. How on earth is a normal user application supposed to cope with this? The fsync() man page gives no warning that it means "fsync-if-you-feel-like-it" and I expect a lot of apps won't cope well with this behaviour.
Bug reports
https://bugzilla.kernel.org/show_bug.cgi?id=194755
https://bugzilla.kernel.org/show_bug.cgi?id=194757
Further reading
lwn.net touched on this in the article "Improved block-layer error handling".
postgresql.org mailing list thread.
Since the application's write() will have already returned without error, there seems to be no way to report an error back to the application.
I do not agree. write can return without error if the write is simply queued, but the error will be reported on the next operation that will require the actual writing on disk, that means on next fsync, possibly on a following write if the system decides to flush the cache and at least on last file close.
That is the reason why it is essential for application to test the return value of close to detect possible write errors.
If you really need to be able to do clever error processing you must assume that everything that was written since the last successful fsync may have failed and that in all that at least something has failed.
write(2) provides less than you expect. The man page is very open about the semantic of a successful write() call:
A successful return from write() does not make any guarantee that
data has been committed to disk. In fact, on some buggy implementations,
it does not even guarantee that space has successfully been reserved
for the data. The only way to be sure is to call fsync(2) after you
are done writing all your data.
We can conclude that a successful write() merely means that the data has reached the kernel's buffering facilities. If persisting the buffer fails, a subsequent access to the file descriptor will return the error code. As last resort that may be close(). The man page of the close(2) system call contains the following sentence:
It is quite possible that errors on a previous write(2) operation are
first reported at the final close().
If your application needs to persist data write away it has to use fsync/fsyncdata on a regular basis:
fsync() transfers ("flushes") all modified in-core data of (i.e., modified
buffer cache pages for) the file referred to by the file descriptor fd
to the disk device (or other permanent storage device) so
that all changed information can be retrieved even after the
system crashed or was rebooted. This includes writing through or
flushing a disk cache if present. The call blocks until the
device reports that the transfer has completed.
Use the O_SYNC flag when you open the file. It ensures the data is written to the disk.
If this won't satisfy you, there will be nothing.
Check the return value of close. close can fail whilst buffered writes appear to succeed.

AIO in C on Unix - aio_fsync usage

I can't understand what this function aio_fsync does. I've read man pages and even googled but can't find an understandable definition. Can you explain it in a simple way, preferably with an example?
aio_fsync is just the asynchronous version of fsync; when either have completed, all data is written back to the physical drive media.
Note 1: aio_fsync() simply starts the request; the fsync()-like operation is not finished until the request is completed, similar to the other aio_* calls.
Note 2: only the aio_* operations already queued when aio_fsync() is called are included.
As you comment mentioned, if you don't use fsync or aio_fsync, the data will still appear in the file after your program ends. However, if the machine was abruptly powered off, it would very likely not be there.
This is because when you write to a file, the OS actually writes to the Page Cache which is a copy of disk sectors kept in RAM, not the to the disk itself. Of course, even before it is written back to the disk, you can still see the data in RAM. When you call fsync() or aio_fsync() it will insure that writes(), aio_writes(), etc. to all parts of that file are written back to the physical disk, not just RAM.
If you never call fsync(), etc. the OS will eventually write the data back to the drive whenever it has spare time to do it. Or an orderly OS shutdown should do it as well.
I would say you should usually not worry about manually calling these unless you need to insure that your data, say a log record, is flushed to the physical disk and needs to be more likely to survive an abrupt system crash. Clearly database engines would be doing this for transactions and journals.
However, there are other reasons the data may not survive this and it is very complex to insure absolute consistency in the face of failures. So if your application does not absolutely need it then it is perfectly reasonable to let the OS manage this for you. For example, if the output .o of the compiler ended up incomplete/corrupt because you power-cycled the machine in the middle of a compile or shortly after, it would not surprise anyone - you would just restart the build operation.

stat() system call is being blocked

stat() system call is taking long time when I am trying to do a stat on a file which is corrupted. Magic number is corrupted.
I have a print after this call in my source code which is getting printed after some delay.
I am not sure if stat() is doing any retry on the call. If any documentation available please share it. It would be great help.
It returned input output error. Error no 5 EIO. So i am not sure if the file or the filesystem is corrupted
This can be caused by bad blocks on an aging or damaged spinning disk. There are two other symptoms that will likely occur concurrently:
Copious explicit I/O errors reported by the kernel in the system logs.
A sudden spike in load average. This happens because processes which are stuck waiting on I/O are in uninterrupted sleep while the kernel busy loops in an attempt to interact with the hardware, causing the system to become sluggish temporarily. You cannot stop this from happening, or kill processes in uninterrupted sleep. It's a sort of OS Achille's heel.
If this is the case, unmount the filesystems involved and run e2fsck -c -y on them. If it is the root filesystem, you will need to, e.g., boot the system with a live CD and do it from there. From man e2fsck:
-c
This option causes e2fsck to use badblocks(8) program to do a read-only scan of the device in
order to find any bad blocks. If any bad blocks are found, they are added to the bad block
inode to prevent them from being allocated to a file or directory. If this option is specified twice, then the bad block scan will be done using a non-destructive read-write test.
Note that -cc takes a long time; -c should be sufficient. -y answers yes automatically to all questions, which you might as well do since there may be a lot of those.
You will probably lose some data (have a look in /lost+found afterward); hopefully the system still boots. At the very least, the filesystems are now safe to mount. The disk itself may or may not last a while longer. I've done this and had them remain fine for months more, but don't count on it.
If this is a SMART drive, there are apparently some other tools you can use to diagnose and deal with the same problem, although what I've outlined here is probably good enough.

Resources