stat() system call is being blocked - c

stat() system call is taking long time when I am trying to do a stat on a file which is corrupted. Magic number is corrupted.
I have a print after this call in my source code which is getting printed after some delay.
I am not sure if stat() is doing any retry on the call. If any documentation available please share it. It would be great help.
It returned input output error. Error no 5 EIO. So i am not sure if the file or the filesystem is corrupted

This can be caused by bad blocks on an aging or damaged spinning disk. There are two other symptoms that will likely occur concurrently:
Copious explicit I/O errors reported by the kernel in the system logs.
A sudden spike in load average. This happens because processes which are stuck waiting on I/O are in uninterrupted sleep while the kernel busy loops in an attempt to interact with the hardware, causing the system to become sluggish temporarily. You cannot stop this from happening, or kill processes in uninterrupted sleep. It's a sort of OS Achille's heel.
If this is the case, unmount the filesystems involved and run e2fsck -c -y on them. If it is the root filesystem, you will need to, e.g., boot the system with a live CD and do it from there. From man e2fsck:
-c
This option causes e2fsck to use badblocks(8) program to do a read-only scan of the device in
order to find any bad blocks. If any bad blocks are found, they are added to the bad block
inode to prevent them from being allocated to a file or directory. If this option is specified twice, then the bad block scan will be done using a non-destructive read-write test.
Note that -cc takes a long time; -c should be sufficient. -y answers yes automatically to all questions, which you might as well do since there may be a lot of those.
You will probably lose some data (have a look in /lost+found afterward); hopefully the system still boots. At the very least, the filesystems are now safe to mount. The disk itself may or may not last a while longer. I've done this and had them remain fine for months more, but don't count on it.
If this is a SMART drive, there are apparently some other tools you can use to diagnose and deal with the same problem, although what I've outlined here is probably good enough.

Related

Writing programs to cope with I/O errors causing lost writes on Linux

TL;DR: If the Linux kernel loses a buffered I/O write, is there any way for the application to find out?
I know you have to fsync() the file (and its parent directory) for durability. The question is if the kernel loses dirty buffers that are pending write due to an I/O error, how can the application detect this and recover or abort?
Think database applications, etc, where order of writes and write durability can be crucial.
Lost writes? How?
The Linux kernel's block layer can under some circumstances lose buffered I/O requests that have been submitted successfully by write(), pwrite() etc, with an error like:
Buffer I/O error on device dm-0, logical block 12345
lost page write due to I/O error on dm-0
(See end_buffer_write_sync(...) and end_buffer_async_write(...) in fs/buffer.c).
On newer kernels the error will instead contain "lost async page write", like:
Buffer I/O error on dev dm-0, logical block 12345, lost async page write
Since the application's write() will have already returned without error, there seems to be no way to report an error back to the application.
Detecting them?
I'm not that familiar with the kernel sources, but I think that it sets AS_EIO on the buffer that failed to be written-out if it's doing an async write:
set_bit(AS_EIO, &page->mapping->flags);
set_buffer_write_io_error(bh);
clear_buffer_uptodate(bh);
SetPageError(page);
but it's unclear to me if or how the application can find out about this when it later fsync()s the file to confirm it's on disk.
It looks like wait_on_page_writeback_range(...) in mm/filemap.c might by do_sync_mapping_range(...) in fs/sync.c which is turn called by sys_sync_file_range(...). It returns -EIO if one or more buffers could not be written.
If, as I'm guessing, this propagates to fsync()'s result, then if the app panics and bails out if it gets an I/O error from fsync() and knows how to re-do its work when restarted, that should be sufficient safeguard?
There's presumably no way for the app to know which byte offsets in a file correspond to the lost pages so it can rewrite them if it knows how, but if the app repeats all its pending work since the last successful fsync() of the file, and that rewrites any dirty kernel buffers corresponding to lost writes against the file, that should clear any I/O error flags on the lost pages and allow the next fsync() to complete - right?
Are there then any other, harmless, circumstances where fsync() may return -EIO where bailing out and redoing work would be too drastic?
Why?
Of course such errors should not happen. In this case the error arose from an unfortunate interaction between the dm-multipath driver's defaults and the sense code used by the SAN to report failure to allocate thin-provisioned storage. But this isn't the only circumstance where they can happen - I've also seen reports of it from thin provisioned LVM for example, as used by libvirt, Docker, and more. An critical application like a database should try to cope with such errors, rather than blindly carrying on as if all is well.
If the kernel thinks it's OK to lose writes without dying with a kernel panic, applications have to find a way to cope.
The practical impact is that I found a case where a multipath problem with a SAN caused lost writes that landed up causing database corruption because the DBMS didn't know its writes had failed. Not fun.
fsync() returns -EIO if the kernel lost a write
(Note: early part references older kernels; updated below to reflect modern kernels)
It looks like async buffer write-out in end_buffer_async_write(...) failures set an -EIO flag on the failed dirty buffer page for the file:
set_bit(AS_EIO, &page->mapping->flags);
set_buffer_write_io_error(bh);
clear_buffer_uptodate(bh);
SetPageError(page);
which is then detected by wait_on_page_writeback_range(...) as called by do_sync_mapping_range(...) as called by sys_sync_file_range(...) as called by sys_sync_file_range2(...) to implement the C library call fsync().
But only once!
This comment on sys_sync_file_range
168 * SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will detect any
169 * I/O errors or ENOSPC conditions and will return those to the caller, after
170 * clearing the EIO and ENOSPC flags in the address_space.
suggests that when fsync() returns -EIO or (undocumented in the manpage) -ENOSPC, it will clear the error state so a subsequent fsync() will report success even though the pages never got written.
Sure enough wait_on_page_writeback_range(...) clears the error bits when it tests them:
301 /* Check for outstanding write errors */
302 if (test_and_clear_bit(AS_ENOSPC, &mapping->flags))
303 ret = -ENOSPC;
304 if (test_and_clear_bit(AS_EIO, &mapping->flags))
305 ret = -EIO;
So if the application expects it can re-try fsync() until it succeeds and trust that the data is on-disk, it is terribly wrong.
I'm pretty sure this is the source of the data corruption I found in the DBMS. It retries fsync() and thinks all will be well when it succeeds.
Is this allowed?
The POSIX/SuS docs on fsync() don't really specify this either way:
If the fsync() function fails, outstanding I/O operations are not guaranteed to have been completed.
Linux's man-page for fsync() just doesn't say anything about what happens on failure.
So it seems that the meaning of fsync() errors is "I don't know what happened to your writes, might've worked or not, better try again to be sure".
Newer kernels
On 4.9 end_buffer_async_write sets -EIO on the page, just via mapping_set_error.
buffer_io_error(bh, ", lost async page write");
mapping_set_error(page->mapping, -EIO);
set_buffer_write_io_error(bh);
clear_buffer_uptodate(bh);
SetPageError(page);
On the sync side I think it's similar, though the structure is now pretty complex to follow. filemap_check_errors in mm/filemap.c now does:
if (test_bit(AS_EIO, &mapping->flags) &&
test_and_clear_bit(AS_EIO, &mapping->flags))
ret = -EIO;
which has much the same effect. Error checks seem to all go through filemap_check_errors which does a test-and-clear:
if (test_bit(AS_EIO, &mapping->flags) &&
test_and_clear_bit(AS_EIO, &mapping->flags))
ret = -EIO;
return ret;
I'm using btrfs on my laptop, but when I create an ext4 loopback for testing on /mnt/tmp and set up a perf probe on it:
sudo dd if=/dev/zero of=/tmp/ext bs=1M count=100
sudo mke2fs -j -T ext4 /tmp/ext
sudo mount -o loop /tmp/ext /mnt/tmp
sudo perf probe filemap_check_errors
sudo perf record -g -e probe:end_buffer_async_write -e probe:filemap_check_errors dd if=/dev/zero of=/mnt/tmp/test bs=4k count=1 conv=fsync
I find the following call stack in perf report -T:
---__GI___libc_fsync
entry_SYSCALL_64_fastpath
sys_fsync
do_fsync
vfs_fsync_range
ext4_sync_file
filemap_write_and_wait_range
filemap_check_errors
A read-through suggests that yeah, modern kernels behave the same.
This seems to mean that if fsync() (or presumably write() or close()) returns -EIO, the file is in some undefined state between when you last successfully fsync()d or close()d it and its most recently write()ten state.
Test
I've implemented a test case to demonstrate this behaviour.
Implications
A DBMS can cope with this by entering crash recovery. How on earth is a normal user application supposed to cope with this? The fsync() man page gives no warning that it means "fsync-if-you-feel-like-it" and I expect a lot of apps won't cope well with this behaviour.
Bug reports
https://bugzilla.kernel.org/show_bug.cgi?id=194755
https://bugzilla.kernel.org/show_bug.cgi?id=194757
Further reading
lwn.net touched on this in the article "Improved block-layer error handling".
postgresql.org mailing list thread.
Since the application's write() will have already returned without error, there seems to be no way to report an error back to the application.
I do not agree. write can return without error if the write is simply queued, but the error will be reported on the next operation that will require the actual writing on disk, that means on next fsync, possibly on a following write if the system decides to flush the cache and at least on last file close.
That is the reason why it is essential for application to test the return value of close to detect possible write errors.
If you really need to be able to do clever error processing you must assume that everything that was written since the last successful fsync may have failed and that in all that at least something has failed.
write(2) provides less than you expect. The man page is very open about the semantic of a successful write() call:
A successful return from write() does not make any guarantee that
data has been committed to disk. In fact, on some buggy implementations,
it does not even guarantee that space has successfully been reserved
for the data. The only way to be sure is to call fsync(2) after you
are done writing all your data.
We can conclude that a successful write() merely means that the data has reached the kernel's buffering facilities. If persisting the buffer fails, a subsequent access to the file descriptor will return the error code. As last resort that may be close(). The man page of the close(2) system call contains the following sentence:
It is quite possible that errors on a previous write(2) operation are
first reported at the final close().
If your application needs to persist data write away it has to use fsync/fsyncdata on a regular basis:
fsync() transfers ("flushes") all modified in-core data of (i.e., modified
buffer cache pages for) the file referred to by the file descriptor fd
to the disk device (or other permanent storage device) so
that all changed information can be retrieved even after the
system crashed or was rebooted. This includes writing through or
flushing a disk cache if present. The call blocks until the
device reports that the transfer has completed.
Use the O_SYNC flag when you open the file. It ensures the data is written to the disk.
If this won't satisfy you, there will be nothing.
Check the return value of close. close can fail whilst buffered writes appear to succeed.

AIO in C on Unix - aio_fsync usage

I can't understand what this function aio_fsync does. I've read man pages and even googled but can't find an understandable definition. Can you explain it in a simple way, preferably with an example?
aio_fsync is just the asynchronous version of fsync; when either have completed, all data is written back to the physical drive media.
Note 1: aio_fsync() simply starts the request; the fsync()-like operation is not finished until the request is completed, similar to the other aio_* calls.
Note 2: only the aio_* operations already queued when aio_fsync() is called are included.
As you comment mentioned, if you don't use fsync or aio_fsync, the data will still appear in the file after your program ends. However, if the machine was abruptly powered off, it would very likely not be there.
This is because when you write to a file, the OS actually writes to the Page Cache which is a copy of disk sectors kept in RAM, not the to the disk itself. Of course, even before it is written back to the disk, you can still see the data in RAM. When you call fsync() or aio_fsync() it will insure that writes(), aio_writes(), etc. to all parts of that file are written back to the physical disk, not just RAM.
If you never call fsync(), etc. the OS will eventually write the data back to the drive whenever it has spare time to do it. Or an orderly OS shutdown should do it as well.
I would say you should usually not worry about manually calling these unless you need to insure that your data, say a log record, is flushed to the physical disk and needs to be more likely to survive an abrupt system crash. Clearly database engines would be doing this for transactions and journals.
However, there are other reasons the data may not survive this and it is very complex to insure absolute consistency in the face of failures. So if your application does not absolutely need it then it is perfectly reasonable to let the OS manage this for you. For example, if the output .o of the compiler ended up incomplete/corrupt because you power-cycled the machine in the middle of a compile or shortly after, it would not surprise anyone - you would just restart the build operation.

C BZ2_bzDecompress way slower than bzip2 command

I'm using mmap/read + BZ2_bzDecompress to sequentially decompress a large file (29GB). This is done because I need to parse the uncompressed xml data, but only need small bits of it, and it seemed like it would be way more efficient to do this sequentially than to uncompress the whole file (400GB uncompressed) and then parse it. Interestingly already the decompression part is extremely slow - while the shell command bzip2 is able to do a bit more than 52MB per second (used several runs of timeout 10 bzip2 -c -k -d input.bz2 > output and divided produced filesize by 10), my program is able to do not even 2MB/s, slowing down after a few seconds to 1.2MB/s
The file I'm trying to process uses multiple bz2 streams, so I'm checking BZ2_bzDecompress for BZ_STREAM_END, and if it occurs, use BZ2_bzDecompressEnd( strm ); and BZ2_bzDecompressInit( strm, 0, 0 ) to restart with the next stream, in case the file hasn't been completely processed. I also tried without BZ2_bzDecompressEnd but that didn't change anything (and I can't really see in the documentation how one should handle multiple streams correctly)
The file is being mmap'ed before, where I also tried different combinations of flags, currently MAP_RDONLY, MAP_PRIVATE with madvise to MADV_SEQUENTIAL | MADV_WILLNEED | MADV_HUGEPAGE (I'm checking return value, and madvise does not report any problems, and I'm on a linux kernel 3.2x debian setup which has hugepage support)
When profiling I made sure that other than some counters for measuring speed and a printf which was limited to once every n iterations, nothing else was run. Also this is on a modern multicore server processor where all other cores where idle, and it's bare metal, not virtualized.
Any ideas on what I could be doing wrong / do to improve performance?
Update: Thanks to James Chong's suggestion I tried "swapping" mmap() with read(), and the speed is still the same. So it seems mmap() is not the problem (either that, or mmap() and read() share an underlying problem)
Update 2: Thinking that maybe the malloc/free calls done in bzDecompressInit/bzDecompressEnd would be the cause, I set bzalloc/bzfree of the bz_stream struct to a custom implementation which only allocates memory the first time and does not free it unless a flag is set (passed by the opaque parameter = strm.opaque). It works perfectly fine, but again the speed did not increase.
Update 3: I also tried fread() instead of read() now, and still the speed stays the same. Also tried different amount of read bytes and decompressed-data-buffer sizes - no change.
Update 4: Reading speed is definitely not an issue, as I've been able to achieve speeds close to about 120MB/s in sequential reading using just mmap().
Swapping, mmap flags have with them little to do. If bzip2 is slow, it is not because of the file I/O.
I think your libbz2 wasn't fully optimized. Recompile it with the most brutal gcc flags which you can imagine.
My second idea were if there is some ELF linking overhead. In this case the problem will disappear if you link in bz2 statically. (After that you will be able to think how to make this fast with dynamically loaded libbz2).
Important extension from the future:
Libbz2 must be reentrant, thread-safe and position-independent. This means various C flags to be compiled with, and these flags don't have a good effect to performance (although they produce much faster code). In an extrem case I could even imagine a 5-10-times slow, compared to the single-threaded, non-PIC, non-reentrant version.

How to avoid caching effects in read benchmarks

I have a read benchmark and between consecutive runs, I have to make sure that the data does not reside in memory to avoid effects seen due to caching. So far what I used to do is: run a program that writes a large file between consecutive runs of the read benchmark. Something like
./read_benchmark
./write --size 64G --path /tmp/test.out
./read_benchmark
The write program simply writes an array of size 1G 64 times to file. Since the size of the main memory is 64G, I write a file that is approx. the same size. The problem is that writing takes a long time and I was wondering if there are better ways to do this, i.e. avoid effects seen when data is cached.
Also, what happens if I write data to /dev/null?
./write --size 64G --path /dev/null
This way, the write program exits very fast, no I/O is actually performed, but I am not sure if it overwrites 64G of main memory, which is what I ultimately want.
Your input is greatly appreciated.
You can drop all caches using a special file in /proc like this:
echo 3 > /proc/sys/vm/drop_caches
That should make sure cache does not affect the benchmark.
You can just unmount the filesystem and mount it back. Unmounting flushes and drops the cache for the filesystem.
Use echo 3 > /proc/sys/vm/drop_caches to flush the pagecache, directory entries cache and inodes cache.
You can the fadvise calls with FADV_DONTNEED to tell the kernel to keep certain files from being cached. You can also use mincore() to verify that the file is not cached. While the drop_caches solution is clearly simpler, this might be better than wiping out the entire cache as that effects all processes on the box.. I don't think you need elevated privledges to use fadvise while I bet you do for writing to /proc. Here is a good example of how to use fadvise calls for this purpose: http://insights.oetiker.ch/linux/fadvise/
One (crude) way that almost never fails is to simply occupy all that excess memory with another program.
Make a trivial program that allocates nearly all the free memory (while leaving enough for your benchmark app). Then memset() the memory to something to ensure that the OS will commit it to physical memory. Finally, do a scanf() to halt the program without terminating it.
By "hogging" all the excess memory, the OS won't be able to use it as cache. And this works in both Linux and Windows. Now you can proceed to do your I/O benchmark.
(Though this might not go well if you're sharing the machine with other users...)

Creating unflushed file output buffers

I am trying to clear up an issue that occurs with unflushed file I/O buffers in a couple of programs, in different languages, running on Linux. The solution of flushing buffers is easy enough, but this issue of unflushed buffers happens quite randomly. Rather than seek help on what may cause it, I am interested in how to create (reproduce) and diagnose this kind of situation.
This leads to a two-part question:
Is it feasible to artificially and easily construct instances where, for a given period of time, one can have output buffers that are known to be unflushed? My searches are turning up empty. A trivial baseline is to hammer the hard drive (e.g. swapping) in one process while trying to write a large amount of data from another process. While this "works", it makes the system practically unusable: I can't poke around and see what's going on.
Are there commands from within Linux that can identify that a given process has unflushed file output buffers? Is this something that can be run at the command line, or is it necessary to query the kernel directly? I have been looking at fsync, sync, ioctl, flush, bdflush, and others. However, lacking a method for creating unflushed buffers, it's not clear what these may reveal.
In order to reproduce for others, an example for #1 in C would be excellent, but the question is truly language agnostic - just knowing an approach to create this situation would help in the other languages I'm working in.
Update 1: My apologies for any confusion. As several people have pointed out, buffers can be in the kernel space or the user space. This helped pinpoint the problems: we're creating big dirty kernel buffers. This distinction and the answers completely resolve #1: it now seems clear how to re-create unflushed buffers in either user space or kernel space. Identifying which process ID has dirty kernel buffers is not yet clear, though.
If you are interested in the kernel-buffered data, then you can tune the VM writeback through the sysctls in /proc/sys/vm/dirty_*. In particular, dirty_expire_centisecs is the age, in hundredths of a second, at which dirty data becomes eligible for writeback. Increasing this value will give you a larger window of time in which to do your investigation. You can also increase dirty_ratio and dirty_background_ratio (which are percentages of system memory, defining the point at which synchronous and asynchronous writeback start respectively).
Actually creating dirty pages is easy - just write(2) to a file and exit without syncing, or dirty some pages in a MAP_SHARED mapping of a file.
A simple program that would have an unflushed buffer would be:
main()
{
printf("moo");
pause();
}
Stdio, by default only flushes stdout on newlines, when connected to a terminal.
It is very easy to cause unflushed buffers by controlling the receiving side. The beauty of *nix systems is that everything looks like a file, so you can use special files to do what you want. The easiest option is a pipe. If you just want to control stdout, this is the simples option: unflushed_program | slow_consumer. Otherwise, you can use named pipes:
mkfifo pipe_file
unflushed_program --output pipe_file
slow_consumer --input pipe_file
slow_consumer is most likely a program you design to read data slowly, or just read X bytes and stop.

Resources