Safely Persisting to Disk - database

A few years ago MongoDB caught some heat for having an unsafe default relating to disk persistence (see this question for instance). What measures must a database implementation go through to ensure that writes to disk are safe? Is it sufficient to call fsync() after a write, or must other precautions be taken such as journaling or particular ways of using the disk?

Calling fsync() would flush the dirty pages in the buffer cache to the disk. This depends on the load on your server, as having a large number of dirty pages in the cache and initiating a flush could causes the system to hung or get to an unresponsive state. However its recommended tune some of the kernel turntables with optimal values for vm.dirty_expire_centisecs, vm.dirty_background_ratio to make sure all writes a safe and quick and not kept in the cache for a long time. Having lower values could slow average I/O speed as constantly trying to write dirty pages out will just trigger the I/O congestion code more frequently.
Alternatively, some of the databases provide Direct I/O as a feature of the file system whereby file reads and writes go directly from the applications to the storage device, bypassing caches. Direct I/O is mostly used in applications (databases) that manage their own caches with the O_DIRECT flag.

Related

fflush, fsync and sync vs memory layers

I know there are already similar questions and I gave them a look but I couldn't find an explicit univocal answer to my question. I was just investigating online about these functions and their relationship with memory layers. In particular I found this beautiful article that gave me a good insight about memory layers
It seems that fflush() moves data from the application to kernel filesystem buffer and it's ok, everyone seems to agree on this point. The only thing that left me puzzled was that in the same article they assumed a write-back cache saying that with fsync() "the data is saved to the stable storage layer" and after they added that "the storage may itself store the data in a write-back cache, so fsync() is still required for files opened with O_DIRECT in order to save the data to stable storage"
Reading here and there it seems like the truth is that fsync() and sync() let the data enter the storage device but if this one has caching layers it is just moved here, not at once to permanent storage and data may even be lost if there is a power failure. Unless we have a filesystem with barriers enabled and then "sync()/fsync() and some other operations will cause the appropriate CACHE FLUSH (ATA) or SYNCHRONIZE CACHE (SCSI) commands to be sent to the device" [from your website answer]
Questions:
if the data to be updated are already in the kernel buffers and my device has a volatile cache layer in write-back mode is it true, like said by the article, that operations like fsync() [and sync() I suppose] synchronize data to the stable memory layer skipping the volatile one? I think this is what happens with a write-through cache, not a write-back one. From what I read I understood that with a write-back cache on fsync() can just send data to the device that will put them in the volatile cache and they will enter the permanent memory only after
I read that fsync() works with a file descriptor and then with a single file while sync() causes a total deployment for the buffers so it applies to every data to be updated. And from this page also that fsync() waits for the end of the writing to the disk while sync() doesn't wait for the end of the actual writing to the disk. Are there other differences connected to memory data transfers between the two?
Thanks to those who will try to help
1. As you correctly concluded from your research fflush synchronizes the user-space buffered data to kernel-level cache (since it's working with FILE objects that reside at user-level and are invisible to kernel), whereas fsync or sync (working directly with file descriptors) synchronize kernel cached data with device. However, the latter comes without a guarantee that the data has been actually written to the storage device — as these usually come with their own caches as well. I would expect the same holds for msync called with MS_SYNC flag as well.
Relatedly, I find the distinction between synchronized and synchronous operations very useful when talking about the topic. Here's how Robert Love puts it succinctly:
A synchronous write operation does not return until the written data is—at least—stored in the kernel’s buffer cache. [...] A synchronized operation is more restrictive and safer than a merely synchronous operation. A synchronized write operation flushes the data to disk, ensuring that the on-disk data is always synchronized vis-à-vis the corresponding kernel buffers.
With that in mind you can call open with O_SYNC flag (together with some other flag that opens the file with a write permission) to enforce synchronized write operations. Again, as you correctly assumed this will work only with WRITE THROUGH disk caching policy, which effectively amounts to disabling disk caching.
You can read this answer about how to disable disk caching on Linux. Be sure to also check this website which also covers SCSI-based in addition to ATA-based devices (to read about different types of disks see this page on Microsoft SQL Server 2005, last updated: Apr 19, 2018).
Speaking of which, it is very informative to read about how the issue is dealt with on Windows machines:
To open a file for unbuffered I/O, call the CreateFile function with the FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH flags. This prevents the file contents from being cached and flushes the metadata to disk with each write. For more information, see CreateFile.
Apparently, this is how Microsoft SQL Server 2005 family ensures data integrity:
All versions of SQL Server open the log and data files using the Win32 CreateFile function. The dwFlagsAndAttributes member includes the FILE_FLAG_WRITE_THROUGH option when opened by SQL Server. [...]
This option instructs the system to write through any intermediate cache and go directly to disk. The system can still cache write operations, but cannot lazily flush them.
I'm saying this is informative in particular because of this blog post from 2012 showing that some SATA disks ignore the FILE_FLAG_WRITE_THROUGH! I don't know what the current state of affairs is, but it seems that in order to ensure that writing to a disk is truly synchronized, you need to:
Disable disk caching using your device drivers.
Make sure that the specific device you're using supports write-through/no-caching policy.
However, if you're looking for a guarantee of data integrity you could just buy a disk with its own battery-based power supply that goes beyond capacitors (which is usually only enough for completing the on-going write processes). As put in the conclusion in the blog article mentioned above:
Bottom-line, use Enterprise-Class disks for your data and transaction log files. [...] Actually, the situation is not as dramatic as it seems. Many RAID controllers have battery-backed cache and do not need to honor the write-through requirement.
2. To (partially) answer the second question, this is from the man pages SYNC(2):
According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but may return before the actual writing is done. However, since version 1.3.20 Linux does actually wait. (This still does not guarantee data integrity: modern disks have large caches.)
This would imply that fsync and sync work differently, however, note they're both implemented in unistd.h which suggests some consistency between them. However, I would follow Robert Love who does not recommend using sync syscall when writing your own code.
The only real use for sync() is in the implementation of the sync utility. Applications should use fsync() and fdatasync() to commit to disk the data of only the requisite file descriptors. Note that sync() may take several minutes or longer to complete on a busy system.
"I don't have any solution, but certainly admire the problem."
From all I read from your good references, is that there is no standard. The standard ends somewhere in the kernel. The kernel controls the device driver and the device driver (possibly supplied by the disk manufacturer) controls the disk through an API (device has small computer on board). The manufacturer may have added capacitors/battery with just enough power to flush its device buffers in case of power failure, or he may have not. The device may provide a sync function but whether this truely syncs (flushes) the device buffers is not known (device dependent). So unless you select and install a device according to your specifications (and verify those specs), you are never sure.
This is a fair problem. Even after handling error conditions, you are not safe of the data presence in your storage.
man page of fsync explains this issue clearly!! :)
For applications that require tighter guarantees about the integrity of
their data, Mac OS X provides the F_FULLFSYNC fcntl. The F_FULLFSYNC fcntl asks the drive to flush all buffered data to permanent storage.
Applications, such as databases, that require a strict ordering of writes
should use F_FULLFSYNC to ensure that their data is written in the order they expect. Please see fcntl(2) for more detail.
Yes, fflush() ensures the data leaves the process memory space, but it may be in dirty pages of RAM awaiting write back. This is proof against app abort, but not system crash or power failure. Even if the power is backed up, the system could crash due to some software vulnerability! As mentioned in other answers/comments, getting the data from dirty pages written to disk magnetically or whatever SSD do, not stuck in some volatile buffer in the disk controller or drive, is a combination of the right calls or open options and the right controllers and devices! Calls give you more control over the overhead, writing more in bulk at the end of a transaction.
RDBMS, for instance, need to worry not only about the database holding files but even more about the log files that allow recovery, both after disk loss and on any RDBMS restart after a crash. In fact, some may be more sync'd in the log than the database, to preserve speed, since recovery is not a frequent process and not usually a long one. Things written to the log by transactions are guaranteed to be recoverable if the log is intact.

Embedded File System and power-off

I am working on an embedded application without any OS that needs the use of a File System. I've been over this many times with the people in the project and some agree with me that the system must make a proper shut down of the system whenever there is a power failure or else the file system might go crazy.
Some people say that it doesn't matter if you simply power off the system and let nature run its course, but I think that's one of the worst things to do, especially if you know this will bring you a problem and probably shorten your product's life span.
In the last paragraph I just assumed that it is a problem, but my question remains:
Does a power down have any effect on the file system?
Here is a list of various techniques to help an embedded system tolerate a power failure. These may not be practical for your particular application.
Use a Journaling File System - Can tolerate incomplete writes due to power failure, OS crash, etc. Most modern filesystems are journaled, but do your homework to confirm.
Unless your application needs the write performance, disable all write caching. Check your disk drivers for caching options. Under Linux/Unix, consider mounting the filesystem in sync mode.
Unless it must be writable, make it read-only. Try to keep your application executables and operating system files on their own partition(s), with write protections in place (e.g. mount read only in Linux). Your read/write data should be on its own partition. Even if your application data gets corrupted, your system should still be able to boot (albeit with a fail safe default configuration).
3a. For data that is only written once (e.g. Configuration Settings), try to keep it mounted as read-only most of the time. If there is a settings change mount is as R/W temporarily, update the data, and then unmount/remount it as read-only.
3b. Use a technique similar to 3a to handle application/OS updates in the field.
3c. If it is impractical for you to mount the FS as read-only, at least consider opening individual files as read-only (e.g. fp=fopen("configuration.ini", "r")).
If possible, use separate devices for your storage. Keeping things in separate partitions provides some protection, but there are still edge cases where a partition table may become corrupt and render the entire drive unreadable. Using physically separate devices further isolates against one corrupt device bringing down the whole system. In a perfect world, you would have at least 4 separate devices:
4a. Boot Loader
4b. Operating System & Application Code
4c. Configuration Settings
4e. Application Data
Know the characteristics of your storage devices, and control the brand/model/revision of devices used. Some hard disks ignore cache flush commands from the OS. We had cases where some models of CompactFlash cards would corrupt themselves during a power failure, but the "industrial" models did not have this problem. Of course, this information was not published in any datasheet, and had to be gathered by experimental testing. We developed a list of approved CF cards, and kept inventory of those cards. We periodically had to update this list as older cards became obsolete, or the manufacturer would make a revision.
Put your temporary files in a RAM Disk. If you keep those writes off-disk, you eliminate them as a potential source of corruption. You also reduce flash wear and tear.
Develop automated corruption detection and recovery methods. - All of the above techniques will not help you if the application simply hangs because a missing config file. You need to be able to recover as gracefully as possible:
7a. Your system should maintain at least two copies of its configuration settings, a "primary" and a "backup". If the primary fails for some reason, switch to the backup. You should also consider mechanisms for making backups whenever whenever the configuration is changed, or after a configuration has been declared "good" by the user (testing vs production mode).
7b. Did your Application Data partition fail to mount? Automatically run chkdsk/fsck.
7c. Did chkdsk/fsck fail to fix the problem? Automatically re-format the partition and get it back to a known state.
7d. Do you have a Boot Loader or other method to restore the OS and application after a failure?
7e. Make sure your system will beep, flash an LED, or something to indicate to the user what happened.
Power Failures should be part of your system qualification testing. The only way you will be sure you have a robust system is to test it. Yank the power cord from the system and document what happens. Try yanking the power at multiple points in the system operation (during runtime, while booting, mid configuration, etc). Repeat each test multiple times.
If you cannot mitigate all power failure problems, incorporate a battery or Supercapacitor into the system - Keep in mind that you will need a background process in your OS to initiate a graceful shutdown when power gets low. Also, batteries will require periodic testing and replacement with age.
Addition to msemack's response, unfortunately my rating is too low to post a comment to his answer vs. a separate answer.
Does a power down have any effect on the file system?
Yes, if proper measures aren't put in place to prevent corruption. See previous answers for file system options to help mitigate. However if ATA flush/sleep aren't properly implemented on your device you may run into the scenario we did. In our scenario the device was corrupt beyond the file system, and fdisk/format would not recover the device.
Instead an ATA security-erase was required to recover the device once corruption occurs. In order to avoid this, we implemented an ATA sleep command prior to power loss. This required hold-up of 400ms to support the 160ms ATA sleep took, and leave some head room for degradation of the caps over the life of the product.
Notes from our scenario:
fdisk/format failed to repair/recover the drive.
Our power-safe file system's check disk utility returned that the device had bad blocks, but there really weren't any.
flush/sync returned success, quickly, and most likely weren't implemented.
Once corrupt, dd could not read the device beyond the 1st partition boundary and returned i/o errors after.
hdparm used to issue ATA security-erase, as only method of recovery for some corruption scenarios.
For non-journalling filesystem unexpected turn-off can mean corruption of certain data including directory structure. This happens if there's unsaved data in the cache or if the FS is in the process of writing multi-block update and interruption happens when only some blocks are written.
Journalling addresses this problem mostly - if there's interruption in the middle, recovery routine or check-and-repair operation done by the FS (usually implicitly) brings the filesystem to consistent state. However this state is not always the latest - i.e. if there were some data in the memory cache, they can be lost even with journalling. This is because journalling saves you from corruption of the filesystem but doesn't do magic.
Write-through mode (no write caching) reduces possibility of the data loss but doesn't solve the problem completely, as journalling will work as a cache (for a very short time).
So unfortunately backup or data duplication are the main ways to prevent data loss.
It totally depends on the file system you are using and if it is acceptable to loose some data at power off based on your project requirements.
One could imagine using a file system that is secured against unattended power-off and is able to recover from a partial write sequence. So on the applicative side, if you don't have critic data that absolutely needs to be written before shuting down, there is no need for a specific power off detection procedure.
Now if you want a more specific answer for your project you will have to give more information on the file system you are using and your project requirements.
Edit: As you have critical applicative data to save before power-off, i think you have answered the question yourself. The only way to secure unattended power-off is to have a brown-out detection that alerts your embedded device coupled with some hardware circuitry that allows keeping delivering enought power to the device to perform the shutdown procedure.
The FAT file-system is particularly prone to corruption if a write is in progress or a file is open on shutdown - specifically if ther is a buffered operation that is not flushed . On one project I worked on the solution was to run a file system integrity check and repair (essentially chkdsk/scandsk) on start-up. This strategy did not prevent data loss, but it did prevent the file system becoming unusable.
A number of vendors provide journalling add-on components for FAT to counter exactly this problem. These include Segger, Quadros and Micrium for example.
Either way, your system should generally adopt a open-write-close approach to file access, or open-write-flush if you feel the need to keep the file open.

After how many seconds are file system write buffers typically flushed?

Before overwriting data in a file, I would like to be pretty sure the old data is stored on disk. It's potentially a very big file (multiple GB), so in-place updates are needed. Usually writes will be 2 MB or larger (my plan is to use a block size of 4 KB).
Instead of (or in addition to) calling fsync(), I would like to retain (not overwrite) old data on disk until the file system has written the new data. The main reasons why I don't want to rely on fsync() is: most hard disks lie to you about doing an fsync.
So what I'm looking for is what is the typical maximum delay for a file system, operating system (for example Windows), hard drive until data is written to disk, without using fsync or similar methods. I would like to have real-world numbers if possible. I'm not looking for advice to use fsync.
I know there is no 100% reliable way to do it, but I would like to better understand how operating systems and file systems work in this regard.
What I found so far is: 30 seconds is / was the default for /proc/sys/vm/dirty_expire_centiseconds. Then "dirty pages are flushed (written) to disk ... (when) too much time has elapsed since a page has stayed dirty" (but there I couldn't find the default time). So for Linux, 40 seconds seems to be on the safe side. But is this true for all file systems / disks? What about Windows, Android, and so on? I would like to get an answer that applies to all common operating systems / file system / disk types, including Windows, Android, regular hard disks, SSDs, and so on.
Let me restate this your problem in only slightly-uncharitable terms: You're trying to control the behavior of a physical device which its driver in the operating system cannot control. What you're trying to do seems impossible, if what you want is an actual guarantee, rather than a pretty good guess. If all you want is a pretty good guess, fine, but beware of this and document accordingly.
You might be able to solve this with the right device driver. The SCSI protocol, for example, has a Force Unit Access (FUA) bit in its READ and WRITE commands that instructs the device to bypass any internal cache. Even if the data were originally written buffered, reading unbuffered should be able to verify that it was actually there.
The only way to reliably make sure that data has been synced is to use the OS specific syncing mechanism, and as per PostgreSQL's Reliability Docs.
When the operating system sends a write request to the storage
hardware, there is little it can do to make sure the data has arrived
at a truly non-volatile storage area. Rather, it is the
administrator's responsibility to make certain that all storage
components ensure data integrity.
So no, there are no truly portable solutions, but it is possible (but hard) to write portable wrappers and deploy a reliable solution.
First of all thanks for the information that hard disks lie about flushing data, that was new to me.
Now to your problem: you want to be sure that all data that you write has been written to the disk (lowest level). You are saying that there are two parts which need to be controlled: the time when the OS writes to the hard drive and the time when the hard drive writes to the disk.
Your only solution is to use a fuzzy logic timer to estimate when the data will be written.
In my opinion this is the wrong way. You have control about when the OS is writing to the hard drive, so use the possibility and control it! Then only the lying hard drive is your problem. This problem can't be solved reliably. I think, you should tell the user/admin that he must take care when choosing the right hard drive. Of course it might be a good idea to implement the additional timer you proposed.
I believe, it's up to you to start a row of tests with different hard drives and Brad Fitzgerald's tool to get a good estimation of when hard drives will have written all data. But of course - if the hard drive wants to lie, you can never be sure that the data really has been written to the disk.
There are a lot of caches involved in giving users a responsive system.
There is cpu cache, kernel/filesystem memory cache, disk drive memory cache, etc. What you are asking is how long does it take to flush all the caches?
Or, another way to look at it is, what happens if the disk drive goes bad? All the flushing is not going to guarantee a successful read or write operation.
Disk drives do go bad eventually. The solution you are looking for is how can you have a redundant cpu/disk drive system such that the system survives a component failure and still keeps working.
You could improve the likelihood that system will keep working with aid of hardware such as RAID arrays and other high availability configurations.
As far software solution goes, I think the answer is, trust the OS to do the optimal thing. Most of them flush buffers out routinely.
This is an old question but still relevant in 2019. For Windows, the answer appears to be "at least after every one second" based on this:
To ensure that the right amount of flushing occurs, the cache manager spawns a process every second called a lazy writer. The lazy writer process queues one-eighth of the pages that have not been flushed recently to be written to disk. It constantly reevaluates the amount of data being flushed for optimal system performance, and if more data needs to be written it queues more data.
To be clear, the above says the lazy writer is spawned after every second, which is not the same as writing out data every second, but it's the best I can find so far in my own search for an answer to a similar question (in my case, I have an Android apps which lazy-writes data back to disk and I noticed some data loss when using an interval of 3 seconds, so I am going to reduce it to 1 second and see if that helps...it may hurt performance but losing data kills performance a whole lot more if you consider the hours it takes to recover it).

Guarantees in write ahead logging implementation

If one were to issue a sequential series of write(2) in Linux/Unix seperated by fdatasync(2) or fsync(2) or sync(2) is it guaranteed that the first write() will be committed to disk before your second write()? The following SO post seems to say that such guarantees cannot be given, since there are multiple caching layers involved. For database systems which guarantee consistency this seems to be important, since in WAL (Write Ahead Logging) recovery, you'd need your logs to be persisted on disk before actually changing your data, so that in the event of an application/system failure you can revert back to your last known consistent state. How is this ensured/implemented in an actual database system?
The sync() system call is practically no help whatsoever; it promises to schedule the write-to-disk operations, but that's about all.
The normal technique used is to set the correct options when you open() the file descriptor for the disk file: O_DSYNC, O_RSYNC, O_SYNC. However, the fsync() and fdatasync() get pretty close to the same effects. You can also look at O_DIRECTIO which is often supported, though it is not standardized at all by POSIX.
Ultimately, the DBMS relies on the O/S to undertake that data written and synchronized to one disk is secure. As long as the device will always return what the DBMS last wrote, even if it is not on actual disk yet because of caching (because it is backed up in non-volatile cache, or something like that), then it isn't critical. If, on the other, you have NAS (network attached storage) that doesn't guarantee that what you last wrote (and were told was safe on disk) is returned when you read it, then your DBMS can suffer if it has to do recovery. So, you choose where you store your DBMS with care, making sure the storage works sensibly. If the storage does not work sufficiently like the hypothetical disk, you can end up losing data.
Yes, fsync in modern versions of the kernel does both flush memory (buffer cache) to disk and disk hardware buffer to platter. Man page says older kernels used to only do the first thing.
DESCRIPTION fsync() transfers ("flushes") all modified in-core data
of (i.e., modi‐ fied buffer cache pages for) the file referred to
by the file descrip‐ tor fd to the disk device (or other permanent
storage device) so that all changed information can be retrieved
even after the system crashed or was rebooted. This includes
writing through or flushing a disk cache if present. The
call blocks until the device reports that the transfer has
completed. It also flushes metadata information associ‐ ated
with the file (see stat(2)).
The fsync() implementations in older kernels and lesser used
filesys‐ tems does not know how to flush disk caches. In these
cases disk caches need to be disabled using hdparm(8) or
sdparm(8) to guarantee safe operation.
This refers to what applications can request. Fsync is an interface that filesystems provide to applications, filesystems themselves use something else underneath. Filesystems use barriers, or rather explicit flushes and FUA requests to commit the journal. Look at LWN post.

implementing high performance distributed filesystem/database

I need to implement the fastest possible way to store a key/value pair in a distributed system on Linux. Records of the database are tiny, 256 bytes on average.
I am thinking to use open(), write() and read() system calls and write the key-value pairs directly at some offset in the file. I can omit fdatasync() system call since I will be using SSD disk with battery, so I don't have to worry about ACID compliance if an unexpected shutdown of the system happens.
Linux already provides disk cache implementation, so no reads/writes will happen on sectors that were already loaded in memory. This (i think) would be the fastest way to store data, much faster than any other cache capable database engine like for example GT.M or Intersystem's Globals.
However the data is not replicated, and to achieve replication, I can mount a filesystem of another Linux server with NFS and copy the data there, so for example, if I have 2 data servers (1 local and 1 remote), I would issue 2 open(), 2 write() and 2 close() calls. If a transaction fails on remote server, I would mark it as "out of sync" and simply copy the good file again when the remote server comes back.
What do you think of this approach? Will it be fast? I can use NFS over UDP so I will avoid the TCP Stack overhead.
Advantage list so far goes like this:
Linux disk cache reused
Few lines of code
High performance
I will be coding this in C. To locate the record in the file I will keep a btree in memory with a pointer to physical location.
A few suggestions come to mind.
is it necessary to open()/write()/close() for every transaction? the system call overhead of open() in particular is probably non-trivial
could you use mmap() instead of explicit write()s?
if you're doing 2 write() calls (1 local, 1 NFS) for each transaction, it seems like any kind of network problem (latency, dropped packets, etc.) has the potential to bring your application to a screeching halt if you're waiting for the NFS write() call to succeed. And if you're not waiting, for example by doing the NFS writes from a separate thread, your complexity will rapidly grow (I don't think "Few lines of code" will remain true.)
In general, I would suggest that you really prove to yourself that the available tools don't meet your performance requirements before choosing to re-invent this particular wheel.
You might look into a real distributed filesystem rather than using NFS, which as you point out, still provides a single point of failure and no replication.
The Andrew File System (AFS) originally developed by CMU may be a solution for you. It's a commercial product, but you might check out OpenAFS which works on linux (and other systems).
Warning though: AFS has a learning curve.

Resources