Is the data written via write (or fwrite) guaranteed to be persisted to the disk in a sequence manner? In particular in relation to fault tolerance. If the system should fail during the write, will it behave as though first bytes were written first and writing stopped mid-stream (as opposed to random blocks written).
Also, are sequential calls to write/fwrite guaranteed to be sequential? According to POSIX I find only that a call to read is guaranteed to consider a previous write.
I'm asking as I'm creating a fault tolerant data store that persists to disks. My logical order of writing is such that faults won't ruin the data, but if the logical order isn't being obeyed I have a problem.
Note: I'm not asking if persistence is guaranteed. Only that if my calls to write do eventually persist they obey the order in which I actually write.
The POSIX docs for write() state that "If the O_DSYNC bit has been set, write I/O operations on the file descriptor shall complete as defined by synchronized I/O data integrity completion". Presumably, if the O_DSYNC bit isn't set, then the synchronization of I/O data integrity completion is unspecified. POSIX also says that "This volume of POSIX.1-2008 is also silent about any effects of application-level caching (such as that done by stdio)", so I think there is no guarantee for fwrite().
I am not an expert, but I might know enough to point you in the right direction:
The most disastrous case is if you lose power, so that is the only one worth considering.
Start with a file with X bytes of meaningful content, and a header that indicates it.
Write Y bytes of meaningful content somewhere that won't invalidate X.
Call fsync (slow!).
Update the header (probably has to be less than your disk's block size).
I don't know if changing the length of a file is safe. I don't know how much depends on the filesystem mount mode, except that any "safe" mode is probably completely unusable for systems need to have even a slight level of performance.
Keep in mind that on some systems the fsync call lies and just returns without doing anything safely. You can tell because it returns quickly. For this reason, you need to make pretty large transactions (i.e. much larger than application-level transactions).
Keep in mind that the kind of people who solve this problem in the real world get paid high 6 figures at least. The best answer for the rest of us is either "just send the data to postgres and let it deal with it." or "accept that we might have to lose data and revert to an hourly backup."
No, in general as far as POSIX and reality are concerned, filesystems do not provide such guarantee. Order of persistence (in which disk makes them permanent on the platter) is not dictated by order in which syscalls were made, or position within the file, or order of sectors on disk. Filesystems retain the data to be written in memory for several seconds, hoarding as much as possible, and later send them to disk in batches, in whatever order they seem fit. And regardless of how kernel sends it to disk, the disk itself can reorder writes on its own due to NCQ.
Filesystems have means of ensuring some ordering for safety. Barriers were used in the past, and explicit flushes and FUA requests nowadays. There is a good article on LWN about that. But those are used by filesystems, not applications.
I strongly suggest reading the article about Application Level Consistency. Not sure how relevant to you but it shows many behaviors that developers wrongly assumed in the past.
Answer from o11c is a good way to do it.
Yes, so long as we are not talking about adding in the complexity of multithreading. It will be in the same order on disk, for what makes it to disk. It buffers to memory and dumps that memory to disk when it fills up, or you close the file.
Related
The Linux man-page for aio_write says
The buffer area being written out must not be accessed during the operation or undefined results may occur.
My emphasis on "accessed", which strictly interpreted is not only stores to the buffer, but also loads from the buffer.
The man-page on Mac OS X says
Modifications of the Asynchronous I/O Control Block structure or the buffer contents after the request has been enqueued, but before the request has completed, are not allowed.
This sounds slightly more reasonable; the buffer can be read from, but not modified. The consequences of a violation are still left vague, though.
Thinking about how this might be implemented in the OS, I can't why a read access would ever be a problem, and the only problem I can imagine from a concurrent write would be that the actual data written could be some arbitrary mix of the initial buffer contents and the concurrent stores to the buffer.
Undefined behaviour opens up a lot of possibilities, however, and with that in mind we could get SIGSEGV on access (the underlying page was locked to prevent concurrent access?), or reads could return garbage data (the file system does in-place encryption or compression?), or the file could be left with permanently unreadable blocks (block checksummed, then modified concurrently, then written out?). Undefined behaviour does not even exclude crashing the storage device firmware, or the OS.
My question is, what could actually, reasonably happen, given the systems and hardware we have? I assume the language is left intentionally vague to not constrain future implementations.
Linux, BSD (MacOS is a BSD flavour), POSIX say different things.
POSIX says:
For any system action that changes the process memory space while an
asynchronous I/O is outstanding to the address range being changed,
the result of that action is undefined.
Linux manual seems more restrictive, two possibilites:
It's matter of interpretation. Author may thought about write accesses but simply wrote accesses,
It may be any access because implementation is free to use any mechanism that may forbid any access (severe locking or protection during IO).
BSD also says:
If the request is successfully enqueued, the value of iocb-_aio_offset
can be modified during the request as context, so this value must not be
referenced after the request is enqueued
thus explicitly forbids some read accesses (to the control structure).
As Martin said in comment: I don't know why anyone would want to access the structs/buffers/s before I/O completion is notified. But it is also too restrictive: ok, that is clear for write access, but one can imagine (while not common) a scenario where you want read access to the buffer during the IO (writing a framebuffer content while displaying it - or alike).
Whatever, if you violate the restrictions anything bad may happen, so don't violate them.
My question is, what could actually, reasonably happen, given the systems and hardware we have? I assume the language is left intentionally vague to not constrain future implementations.
Take a close look at the API:
int aio_write(struct aiocb *aiocbp);
Notice it does not take pointer to const? The warning is quite clear, once you pass the aiocbp parameter to aio_write(), the data belongs to the AIO code until the operation is complete. You could read the data, but what can you reasonably expect its state to be? According the API and the spec, you can't expect anything at all. Even observed behavior could appear to be totally random. In addition, AIO may lock the cache lines for that block for performance (consistency) reasons, any reads from another core could interfere with the performance of the entire system.
In the absence of lock/unlock semantics, anytime you pass non-const data off to another thread of execution, you cannot reasonably expect to consistently read anything from that data block, until whatever API you are using has signaled completion of whatever work you expect it to perform. This is true whether their documentation says so or not.
I'm trying to fine tune mmap() to perform fast writes or reads (generally not both) of a potentially very large file. The writes and reads will be mostly sequential on one pass and then likely very sparse on future passes. No region of memory needs to be accessed more than once.
In other words, think of it as a file transfer with some lossiness that gets fixed asynchronously.
It appears, as expected, that the main limitation of mmap()'s performance seems to be the number of minor page faults it generates on large files. Furthermore, I suspect the laziness of the Linux kernel's page-to-disk is causing some performance issues. Namely, any test programs that end up performing huge writes to mmaped memory seem to take a long time after performing all writes to terminate/munmap memory.
I was hoping to offset the cost of these faults by concurrently prefaulting pages while performing the almost-sequential access and paging out pages that I won't need again. But I have three main questions regarding this approach and my understanding of the problem:
Is there a straightforward (preferably POSIX [or at least OSX] compatible) way of performing a partial prefault? I am aware of the MAP_POPULATE flag, but this seems to attempt loading the entire file into memory, which is intolerable in many cases. Also, this seems to cause the mmap() call to block until prefaulting is complete, which is also intolerable. My idea for a manual alternative was to spawn a thread simply to try reading the next N pages in memory to force a prefetch. But it might be that madvise with MADV_SEQUENTIAL already does this, in effect.
msync() can be used to flush changes to the disk. However, is it actually useful to do this periodically? My idea is that it might be useful if the program is frequently in an "Idle" state of disk IO and can afford to squeeze in some disk writebacks. Then again, the kernel might very well be handling this itself better than the ever application could.
Is my understanding of disk IO accurate? My assumption is that prefaulting and reading/writing pages can be done concurrently by different threads or processes; if I am wrong about this, then manual prefaulting would not be useful at all. Similarly, if an msync() call blocks all disk IO, both to the filesystem cache and to the raw filesystem, then there also isn't as much of an incentive to use it over flushing the entire disk cache at the program's termination.
It appears, as expected, that the main limitation of mmap()'s performance seems to be the number of minor page faults it generates on large files.
That's not particularly surprising, I agree. But this is a cost that cannot be avoided, at least for the pages corresponding to regions of the mapped file that you actually access.
Furthermore, I suspect the laziness of the Linux kernel's page-to-disk is causing some performance issues. Namely, any test programs that end up performing huge writes to mmaped memory seem to take a long time after performing all writes to terminate/munmap memory.
That's plausible. Again, this is an unavoidable cost, at least for dirty pages, but you can exercise some influence over when those costs are incurred.
I was hoping to offset the cost of these faults by concurrently
prefaulting pages while performing the almost-sequential access and
paging out pages that I won't need again. But I have three main
questions regarding this approach and my understanding of the problem:
Is there a straightforward (preferably POSIX [or at least OSX] compatible) way of performing a partial prefault? I am aware of the
MAP_POPULATE flag, but this seems to attempt loading the entire file
into memory,
Yes, that's consistent with its documentation.
which is intolerable in many cases. Also, this seems to
cause the mmap() call to block until prefaulting is complete,
That's also as documented.
which
is also intolerable. My idea for a manual alternative was to spawn a
thread simply to try reading the next N pages in memory to force a
prefetch.
Unless there's a delay between when you initially mmap() the file and when you want to start accessing the mapping, it's not clear to me why you would expect that to provide any improvement.
But it might be that madvise with MADV_SEQUENTIAL already
does this, in effect.
If you want POSIX compatibility, then you're looking for posix_madvise(). I would indeed recommend using this function instead of trying to roll your own userspace alternative. In particular, if you use posix_madvise() to assert POSIX_MADV_SEQUENTIAL on some or all of the mapped region, then it is reasonable to hope that the kernel will read ahead to load pages before they are needed. Additionally, if you advise with POSIX_MADV_DONTNEED then you might, at the kernel's discretion, get earlier sync to disk and overall less memory use. There is other advice you can pass by this mechanism, too, if it is useful.
msync() can be used to flush changes to the disk. However, is it actually useful to do this periodically? My idea is that it might
be useful if the program is frequently in an "Idle" state of disk IO
and can afford to squeeze in some disk writebacks. Then again, the
kernel might very well be handling this itself better than the ever
application could.
This is something to test. Note that msync() supports asynchronous syncing, however, so you don't need I/O idleness. Thus, when you're sure you're done with a given page you could consider msync()ing it with flag MS_ASYNC to request that the kernel schedule a sync. This might reduce the delay incurred when you unmap the file. You'll have to experiment with combining it with posix_madvise(..., ..., POSIX_MADV_DONTNEED); they might or might not complement each other.
Is my understanding of disk IO accurate? My assumption is that prefaulting and reading/writing pages can be done concurrently by
different threads or processes; if I am wrong about this, then manual
prefaulting would not be useful at all.
It should be possible for one thread to prefault pages (by accessing them), while another reads or writes others that have already been faulted in, but it's unclear to me why you expect such a prefaulting thread to be able to run ahead of the one(s) doing the reads and writes. If it has any effect at all (i.e. if the kernel does not prefault on its own) then I would expect prefaulting a page to be more expensive than reading or writing each byte in it once.
Similarly, if an msync()
call blocks all disk IO, both to the filesystem cache and to the raw
filesystem, then there also isn't as much of an incentive to use it
over flushing the entire disk cache at the program's termination.
There is a minimum number of disk reads and writes that will need to be performed on behalf of your program. For any given mmapped file, they will all be performed on the same I/O device, and therefore they will all be serialized with respect to one another. If you are I/O bound then to a first approximation, the order in which those I/O operations are performed does not matter for overall runtime.
Thus, if the runtime is what you're concerned with, then probably neither posix_madvise() nor msync() will be of much help unless your program spends a significant fraction of its runtime on tasks that are independent of accessing the mmapped file. If you do find yourself not wholly I/O bound then my suggestion would be to see first what posix_madvise() can do for you, and to try asynchronous msync() if you need more. I'm inclined to doubt that userspace prefaulting or synchronous msync() would provide a win, but in optimization, it's always better to test than to (only) predict.
I was implementing an efficient text file loader and found some good advice from the author of GNU grep in this post:
http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
One of things he suggests is to do read() calls of page aligned blocks of data into page aligned buffers. Apparently this allows the kernel to avoid some extra buffering.
I've been searching and I haven't heard anyone else back up this claim. Is it true that calling read() into a page aligned buffer (perhaps allocated with mmap/posix_memalign etc..) is actually more efficient? If its not true, is it something that used to be true? Does it heavily depend on the underlying file system or other factors like that?
Thanks!
Normally, read() will read into a kernel buffer, then copy it to user space. This extra copy is what is being discussed.
Linux supports "direct I/O" via the O_DIRECT flag to open(). This will skip kernel buffering and read directly into the userspace buffer. However, this direct I/O requires aligned accesses and buffers. So I don't think the author of that post meant that magic happens when you're aligned, but rather that if you align carefully, you can use "closer-to-the-metal" techniques to extract more performance.
mmap() is a much easier way to get the same effect. When the mapping is first set up, no I/O happens. When the user first accesses a page in the mapping, a page fault is triggered, which the kernel handles by allocating the user's page and performing the I/O to fill it. No copy. But again, the I/O happens in page-sized chunks, on page-aligned boundaries.
Whether this is a big deal or not depends on how fast memory copies happen relative to the I/O, and what proportion of CPU time is spent copying rather than doing real work. A web server, for instance, often doesn't even have to look at what it's reading: it just writes it out again out a socket (which incurs another copy). That's why a bunch of work has gone into "zerocopy" techniques like system calls sendfile() and splice(). These are specialized workloads. Normally, the buffering is too small an effect to worry about.
As I understand it, if I want to synchronise data to the storage device I can use fsync() to supposedly flush all the OS output caches... but apparently it doesn't guarantee this at all, unlike the documentation tries to deceive you, and the data may not be written to the disk!
This is not very good for many purposes because it can lead to data corruption. How do I use the POSIX libraries (In a portable way if possible) to guarantee that the data has been written (as far as possible) and prevent data corruption?
There is fdatasync() but it is not implemented on OSX, so is there a better and more portable way, or does one have to implement different code on different systems? I'm also not sure if fdatasync() is good enough.
Of-course, in the worst case scenario I could forget about this and use a redundant database library that uses ACID to store the data. I don't want that.
Also I'm interested in how to ensure truncate and rename operations have definitely completed.
Thanks!
You are looking for sync. There is both a program called sync and a system call called sync (man 1 sync and man 2 sync respectively):
#include <unistd.h>
void sync(void);
DESCRIPTION
sync() first commits inodes to buffers, and then buffers to disk.
So it will ensure that all pending operations (truncates, renames etc) are in fact written to the disk.
fsync does not claim to flush all output caches, but instead claims to flush all changes to a particular file descriptor to disk. It explicitly does not ensure that the directory entry is updated (in which case a call to fsync on a filedescriptor for the directory is needed).
fsyncdata is even more useless as it will not flush file metadata and instead will just ensure that the data in the file is flushed.
It is a good idea to trust the manpages. I won't say there are not mistakes, but they tend to be extremely accurate.
I recently came across a bit of not-well-tested legacy code for writing data that's distributed across multiple processes (these are part of an MPI-based parallel computation) into the same file. Is this actually guaranteed to work?
It goes like this:
All processes open the same file for writing.
Each process calls fseek to seek to a different location within the file. This position may be past the end of the file.
Each process then writes a block of data into the file with fwrite. The seek locations
and block sizes are such that these writes completely tile a
section of the file -- no gaps, no overlaps.
Is this guaranteed to work, or will it sometimes fail horribly? There is no locking to serialize the writes, and in fact they are likely to be starting from a synchronization point. On the other hand, we can guarantee that they are writing to different file positions, unlike other questions which have had issues with trying to write to the "end of the file" from multiple processes.
It occurs to me that the processes may be on different machines that mount the file via NFS, which I suspect probably answers my question -- but, would it work if the file is local?
I believe this will typically work but there is no guarantee that I can find. The Posix specifications for fwrite(3) defer to ISO C and neither standard mentions concurrency.
So I suspect it will typically work, but fseek(3) and fwrite(3) are buffered I/O functions, so success will depend on internal details of the library implementation. So, absolutely no guarantee but various reasons to expect that it will work.
Now, should the program use lseek(2) and write(2) then I believe you could consider the results guaranteed, but now it's restricted to Posix operating systems.
One thing seems ... odd ... why would an MPI program decide to share its data via NFS and not the message API? It would seem slower, less portable, more prone to trouble, and generally just a waste of the MPI feature set. It certainly is no more distributed given the reliance on a single NFS server.