implementing high performance distributed filesystem/database - c

I need to implement the fastest possible way to store a key/value pair in a distributed system on Linux. Records of the database are tiny, 256 bytes on average.
I am thinking to use open(), write() and read() system calls and write the key-value pairs directly at some offset in the file. I can omit fdatasync() system call since I will be using SSD disk with battery, so I don't have to worry about ACID compliance if an unexpected shutdown of the system happens.
Linux already provides disk cache implementation, so no reads/writes will happen on sectors that were already loaded in memory. This (i think) would be the fastest way to store data, much faster than any other cache capable database engine like for example GT.M or Intersystem's Globals.
However the data is not replicated, and to achieve replication, I can mount a filesystem of another Linux server with NFS and copy the data there, so for example, if I have 2 data servers (1 local and 1 remote), I would issue 2 open(), 2 write() and 2 close() calls. If a transaction fails on remote server, I would mark it as "out of sync" and simply copy the good file again when the remote server comes back.
What do you think of this approach? Will it be fast? I can use NFS over UDP so I will avoid the TCP Stack overhead.
Advantage list so far goes like this:
Linux disk cache reused
Few lines of code
High performance
I will be coding this in C. To locate the record in the file I will keep a btree in memory with a pointer to physical location.

A few suggestions come to mind.
is it necessary to open()/write()/close() for every transaction? the system call overhead of open() in particular is probably non-trivial
could you use mmap() instead of explicit write()s?
if you're doing 2 write() calls (1 local, 1 NFS) for each transaction, it seems like any kind of network problem (latency, dropped packets, etc.) has the potential to bring your application to a screeching halt if you're waiting for the NFS write() call to succeed. And if you're not waiting, for example by doing the NFS writes from a separate thread, your complexity will rapidly grow (I don't think "Few lines of code" will remain true.)
In general, I would suggest that you really prove to yourself that the available tools don't meet your performance requirements before choosing to re-invent this particular wheel.

You might look into a real distributed filesystem rather than using NFS, which as you point out, still provides a single point of failure and no replication.
The Andrew File System (AFS) originally developed by CMU may be a solution for you. It's a commercial product, but you might check out OpenAFS which works on linux (and other systems).
Warning though: AFS has a learning curve.

Related

fflush, fsync and sync vs memory layers

I know there are already similar questions and I gave them a look but I couldn't find an explicit univocal answer to my question. I was just investigating online about these functions and their relationship with memory layers. In particular I found this beautiful article that gave me a good insight about memory layers
It seems that fflush() moves data from the application to kernel filesystem buffer and it's ok, everyone seems to agree on this point. The only thing that left me puzzled was that in the same article they assumed a write-back cache saying that with fsync() "the data is saved to the stable storage layer" and after they added that "the storage may itself store the data in a write-back cache, so fsync() is still required for files opened with O_DIRECT in order to save the data to stable storage"
Reading here and there it seems like the truth is that fsync() and sync() let the data enter the storage device but if this one has caching layers it is just moved here, not at once to permanent storage and data may even be lost if there is a power failure. Unless we have a filesystem with barriers enabled and then "sync()/fsync() and some other operations will cause the appropriate CACHE FLUSH (ATA) or SYNCHRONIZE CACHE (SCSI) commands to be sent to the device" [from your website answer]
Questions:
if the data to be updated are already in the kernel buffers and my device has a volatile cache layer in write-back mode is it true, like said by the article, that operations like fsync() [and sync() I suppose] synchronize data to the stable memory layer skipping the volatile one? I think this is what happens with a write-through cache, not a write-back one. From what I read I understood that with a write-back cache on fsync() can just send data to the device that will put them in the volatile cache and they will enter the permanent memory only after
I read that fsync() works with a file descriptor and then with a single file while sync() causes a total deployment for the buffers so it applies to every data to be updated. And from this page also that fsync() waits for the end of the writing to the disk while sync() doesn't wait for the end of the actual writing to the disk. Are there other differences connected to memory data transfers between the two?
Thanks to those who will try to help
1. As you correctly concluded from your research fflush synchronizes the user-space buffered data to kernel-level cache (since it's working with FILE objects that reside at user-level and are invisible to kernel), whereas fsync or sync (working directly with file descriptors) synchronize kernel cached data with device. However, the latter comes without a guarantee that the data has been actually written to the storage device — as these usually come with their own caches as well. I would expect the same holds for msync called with MS_SYNC flag as well.
Relatedly, I find the distinction between synchronized and synchronous operations very useful when talking about the topic. Here's how Robert Love puts it succinctly:
A synchronous write operation does not return until the written data is—at least—stored in the kernel’s buffer cache. [...] A synchronized operation is more restrictive and safer than a merely synchronous operation. A synchronized write operation flushes the data to disk, ensuring that the on-disk data is always synchronized vis-à-vis the corresponding kernel buffers.
With that in mind you can call open with O_SYNC flag (together with some other flag that opens the file with a write permission) to enforce synchronized write operations. Again, as you correctly assumed this will work only with WRITE THROUGH disk caching policy, which effectively amounts to disabling disk caching.
You can read this answer about how to disable disk caching on Linux. Be sure to also check this website which also covers SCSI-based in addition to ATA-based devices (to read about different types of disks see this page on Microsoft SQL Server 2005, last updated: Apr 19, 2018).
Speaking of which, it is very informative to read about how the issue is dealt with on Windows machines:
To open a file for unbuffered I/O, call the CreateFile function with the FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH flags. This prevents the file contents from being cached and flushes the metadata to disk with each write. For more information, see CreateFile.
Apparently, this is how Microsoft SQL Server 2005 family ensures data integrity:
All versions of SQL Server open the log and data files using the Win32 CreateFile function. The dwFlagsAndAttributes member includes the FILE_FLAG_WRITE_THROUGH option when opened by SQL Server. [...]
This option instructs the system to write through any intermediate cache and go directly to disk. The system can still cache write operations, but cannot lazily flush them.
I'm saying this is informative in particular because of this blog post from 2012 showing that some SATA disks ignore the FILE_FLAG_WRITE_THROUGH! I don't know what the current state of affairs is, but it seems that in order to ensure that writing to a disk is truly synchronized, you need to:
Disable disk caching using your device drivers.
Make sure that the specific device you're using supports write-through/no-caching policy.
However, if you're looking for a guarantee of data integrity you could just buy a disk with its own battery-based power supply that goes beyond capacitors (which is usually only enough for completing the on-going write processes). As put in the conclusion in the blog article mentioned above:
Bottom-line, use Enterprise-Class disks for your data and transaction log files. [...] Actually, the situation is not as dramatic as it seems. Many RAID controllers have battery-backed cache and do not need to honor the write-through requirement.
2. To (partially) answer the second question, this is from the man pages SYNC(2):
According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but may return before the actual writing is done. However, since version 1.3.20 Linux does actually wait. (This still does not guarantee data integrity: modern disks have large caches.)
This would imply that fsync and sync work differently, however, note they're both implemented in unistd.h which suggests some consistency between them. However, I would follow Robert Love who does not recommend using sync syscall when writing your own code.
The only real use for sync() is in the implementation of the sync utility. Applications should use fsync() and fdatasync() to commit to disk the data of only the requisite file descriptors. Note that sync() may take several minutes or longer to complete on a busy system.
"I don't have any solution, but certainly admire the problem."
From all I read from your good references, is that there is no standard. The standard ends somewhere in the kernel. The kernel controls the device driver and the device driver (possibly supplied by the disk manufacturer) controls the disk through an API (device has small computer on board). The manufacturer may have added capacitors/battery with just enough power to flush its device buffers in case of power failure, or he may have not. The device may provide a sync function but whether this truely syncs (flushes) the device buffers is not known (device dependent). So unless you select and install a device according to your specifications (and verify those specs), you are never sure.
This is a fair problem. Even after handling error conditions, you are not safe of the data presence in your storage.
man page of fsync explains this issue clearly!! :)
For applications that require tighter guarantees about the integrity of
their data, Mac OS X provides the F_FULLFSYNC fcntl. The F_FULLFSYNC fcntl asks the drive to flush all buffered data to permanent storage.
Applications, such as databases, that require a strict ordering of writes
should use F_FULLFSYNC to ensure that their data is written in the order they expect. Please see fcntl(2) for more detail.
Yes, fflush() ensures the data leaves the process memory space, but it may be in dirty pages of RAM awaiting write back. This is proof against app abort, but not system crash or power failure. Even if the power is backed up, the system could crash due to some software vulnerability! As mentioned in other answers/comments, getting the data from dirty pages written to disk magnetically or whatever SSD do, not stuck in some volatile buffer in the disk controller or drive, is a combination of the right calls or open options and the right controllers and devices! Calls give you more control over the overhead, writing more in bulk at the end of a transaction.
RDBMS, for instance, need to worry not only about the database holding files but even more about the log files that allow recovery, both after disk loss and on any RDBMS restart after a crash. In fact, some may be more sync'd in the log than the database, to preserve speed, since recovery is not a frequent process and not usually a long one. Things written to the log by transactions are guaranteed to be recoverable if the log is intact.

Safely Persisting to Disk

A few years ago MongoDB caught some heat for having an unsafe default relating to disk persistence (see this question for instance). What measures must a database implementation go through to ensure that writes to disk are safe? Is it sufficient to call fsync() after a write, or must other precautions be taken such as journaling or particular ways of using the disk?
Calling fsync() would flush the dirty pages in the buffer cache to the disk. This depends on the load on your server, as having a large number of dirty pages in the cache and initiating a flush could causes the system to hung or get to an unresponsive state. However its recommended tune some of the kernel turntables with optimal values for vm.dirty_expire_centisecs, vm.dirty_background_ratio to make sure all writes a safe and quick and not kept in the cache for a long time. Having lower values could slow average I/O speed as constantly trying to write dirty pages out will just trigger the I/O congestion code more frequently.
Alternatively, some of the databases provide Direct I/O as a feature of the file system whereby file reads and writes go directly from the applications to the storage device, bypassing caches. Direct I/O is mostly used in applications (databases) that manage their own caches with the O_DIRECT flag.

Embedded File System and power-off

I am working on an embedded application without any OS that needs the use of a File System. I've been over this many times with the people in the project and some agree with me that the system must make a proper shut down of the system whenever there is a power failure or else the file system might go crazy.
Some people say that it doesn't matter if you simply power off the system and let nature run its course, but I think that's one of the worst things to do, especially if you know this will bring you a problem and probably shorten your product's life span.
In the last paragraph I just assumed that it is a problem, but my question remains:
Does a power down have any effect on the file system?
Here is a list of various techniques to help an embedded system tolerate a power failure. These may not be practical for your particular application.
Use a Journaling File System - Can tolerate incomplete writes due to power failure, OS crash, etc. Most modern filesystems are journaled, but do your homework to confirm.
Unless your application needs the write performance, disable all write caching. Check your disk drivers for caching options. Under Linux/Unix, consider mounting the filesystem in sync mode.
Unless it must be writable, make it read-only. Try to keep your application executables and operating system files on their own partition(s), with write protections in place (e.g. mount read only in Linux). Your read/write data should be on its own partition. Even if your application data gets corrupted, your system should still be able to boot (albeit with a fail safe default configuration).
3a. For data that is only written once (e.g. Configuration Settings), try to keep it mounted as read-only most of the time. If there is a settings change mount is as R/W temporarily, update the data, and then unmount/remount it as read-only.
3b. Use a technique similar to 3a to handle application/OS updates in the field.
3c. If it is impractical for you to mount the FS as read-only, at least consider opening individual files as read-only (e.g. fp=fopen("configuration.ini", "r")).
If possible, use separate devices for your storage. Keeping things in separate partitions provides some protection, but there are still edge cases where a partition table may become corrupt and render the entire drive unreadable. Using physically separate devices further isolates against one corrupt device bringing down the whole system. In a perfect world, you would have at least 4 separate devices:
4a. Boot Loader
4b. Operating System & Application Code
4c. Configuration Settings
4e. Application Data
Know the characteristics of your storage devices, and control the brand/model/revision of devices used. Some hard disks ignore cache flush commands from the OS. We had cases where some models of CompactFlash cards would corrupt themselves during a power failure, but the "industrial" models did not have this problem. Of course, this information was not published in any datasheet, and had to be gathered by experimental testing. We developed a list of approved CF cards, and kept inventory of those cards. We periodically had to update this list as older cards became obsolete, or the manufacturer would make a revision.
Put your temporary files in a RAM Disk. If you keep those writes off-disk, you eliminate them as a potential source of corruption. You also reduce flash wear and tear.
Develop automated corruption detection and recovery methods. - All of the above techniques will not help you if the application simply hangs because a missing config file. You need to be able to recover as gracefully as possible:
7a. Your system should maintain at least two copies of its configuration settings, a "primary" and a "backup". If the primary fails for some reason, switch to the backup. You should also consider mechanisms for making backups whenever whenever the configuration is changed, or after a configuration has been declared "good" by the user (testing vs production mode).
7b. Did your Application Data partition fail to mount? Automatically run chkdsk/fsck.
7c. Did chkdsk/fsck fail to fix the problem? Automatically re-format the partition and get it back to a known state.
7d. Do you have a Boot Loader or other method to restore the OS and application after a failure?
7e. Make sure your system will beep, flash an LED, or something to indicate to the user what happened.
Power Failures should be part of your system qualification testing. The only way you will be sure you have a robust system is to test it. Yank the power cord from the system and document what happens. Try yanking the power at multiple points in the system operation (during runtime, while booting, mid configuration, etc). Repeat each test multiple times.
If you cannot mitigate all power failure problems, incorporate a battery or Supercapacitor into the system - Keep in mind that you will need a background process in your OS to initiate a graceful shutdown when power gets low. Also, batteries will require periodic testing and replacement with age.
Addition to msemack's response, unfortunately my rating is too low to post a comment to his answer vs. a separate answer.
Does a power down have any effect on the file system?
Yes, if proper measures aren't put in place to prevent corruption. See previous answers for file system options to help mitigate. However if ATA flush/sleep aren't properly implemented on your device you may run into the scenario we did. In our scenario the device was corrupt beyond the file system, and fdisk/format would not recover the device.
Instead an ATA security-erase was required to recover the device once corruption occurs. In order to avoid this, we implemented an ATA sleep command prior to power loss. This required hold-up of 400ms to support the 160ms ATA sleep took, and leave some head room for degradation of the caps over the life of the product.
Notes from our scenario:
fdisk/format failed to repair/recover the drive.
Our power-safe file system's check disk utility returned that the device had bad blocks, but there really weren't any.
flush/sync returned success, quickly, and most likely weren't implemented.
Once corrupt, dd could not read the device beyond the 1st partition boundary and returned i/o errors after.
hdparm used to issue ATA security-erase, as only method of recovery for some corruption scenarios.
For non-journalling filesystem unexpected turn-off can mean corruption of certain data including directory structure. This happens if there's unsaved data in the cache or if the FS is in the process of writing multi-block update and interruption happens when only some blocks are written.
Journalling addresses this problem mostly - if there's interruption in the middle, recovery routine or check-and-repair operation done by the FS (usually implicitly) brings the filesystem to consistent state. However this state is not always the latest - i.e. if there were some data in the memory cache, they can be lost even with journalling. This is because journalling saves you from corruption of the filesystem but doesn't do magic.
Write-through mode (no write caching) reduces possibility of the data loss but doesn't solve the problem completely, as journalling will work as a cache (for a very short time).
So unfortunately backup or data duplication are the main ways to prevent data loss.
It totally depends on the file system you are using and if it is acceptable to loose some data at power off based on your project requirements.
One could imagine using a file system that is secured against unattended power-off and is able to recover from a partial write sequence. So on the applicative side, if you don't have critic data that absolutely needs to be written before shuting down, there is no need for a specific power off detection procedure.
Now if you want a more specific answer for your project you will have to give more information on the file system you are using and your project requirements.
Edit: As you have critical applicative data to save before power-off, i think you have answered the question yourself. The only way to secure unattended power-off is to have a brown-out detection that alerts your embedded device coupled with some hardware circuitry that allows keeping delivering enought power to the device to perform the shutdown procedure.
The FAT file-system is particularly prone to corruption if a write is in progress or a file is open on shutdown - specifically if ther is a buffered operation that is not flushed . On one project I worked on the solution was to run a file system integrity check and repair (essentially chkdsk/scandsk) on start-up. This strategy did not prevent data loss, but it did prevent the file system becoming unusable.
A number of vendors provide journalling add-on components for FAT to counter exactly this problem. These include Segger, Quadros and Micrium for example.
Either way, your system should generally adopt a open-write-close approach to file access, or open-write-flush if you feel the need to keep the file open.

After how many seconds are file system write buffers typically flushed?

Before overwriting data in a file, I would like to be pretty sure the old data is stored on disk. It's potentially a very big file (multiple GB), so in-place updates are needed. Usually writes will be 2 MB or larger (my plan is to use a block size of 4 KB).
Instead of (or in addition to) calling fsync(), I would like to retain (not overwrite) old data on disk until the file system has written the new data. The main reasons why I don't want to rely on fsync() is: most hard disks lie to you about doing an fsync.
So what I'm looking for is what is the typical maximum delay for a file system, operating system (for example Windows), hard drive until data is written to disk, without using fsync or similar methods. I would like to have real-world numbers if possible. I'm not looking for advice to use fsync.
I know there is no 100% reliable way to do it, but I would like to better understand how operating systems and file systems work in this regard.
What I found so far is: 30 seconds is / was the default for /proc/sys/vm/dirty_expire_centiseconds. Then "dirty pages are flushed (written) to disk ... (when) too much time has elapsed since a page has stayed dirty" (but there I couldn't find the default time). So for Linux, 40 seconds seems to be on the safe side. But is this true for all file systems / disks? What about Windows, Android, and so on? I would like to get an answer that applies to all common operating systems / file system / disk types, including Windows, Android, regular hard disks, SSDs, and so on.
Let me restate this your problem in only slightly-uncharitable terms: You're trying to control the behavior of a physical device which its driver in the operating system cannot control. What you're trying to do seems impossible, if what you want is an actual guarantee, rather than a pretty good guess. If all you want is a pretty good guess, fine, but beware of this and document accordingly.
You might be able to solve this with the right device driver. The SCSI protocol, for example, has a Force Unit Access (FUA) bit in its READ and WRITE commands that instructs the device to bypass any internal cache. Even if the data were originally written buffered, reading unbuffered should be able to verify that it was actually there.
The only way to reliably make sure that data has been synced is to use the OS specific syncing mechanism, and as per PostgreSQL's Reliability Docs.
When the operating system sends a write request to the storage
hardware, there is little it can do to make sure the data has arrived
at a truly non-volatile storage area. Rather, it is the
administrator's responsibility to make certain that all storage
components ensure data integrity.
So no, there are no truly portable solutions, but it is possible (but hard) to write portable wrappers and deploy a reliable solution.
First of all thanks for the information that hard disks lie about flushing data, that was new to me.
Now to your problem: you want to be sure that all data that you write has been written to the disk (lowest level). You are saying that there are two parts which need to be controlled: the time when the OS writes to the hard drive and the time when the hard drive writes to the disk.
Your only solution is to use a fuzzy logic timer to estimate when the data will be written.
In my opinion this is the wrong way. You have control about when the OS is writing to the hard drive, so use the possibility and control it! Then only the lying hard drive is your problem. This problem can't be solved reliably. I think, you should tell the user/admin that he must take care when choosing the right hard drive. Of course it might be a good idea to implement the additional timer you proposed.
I believe, it's up to you to start a row of tests with different hard drives and Brad Fitzgerald's tool to get a good estimation of when hard drives will have written all data. But of course - if the hard drive wants to lie, you can never be sure that the data really has been written to the disk.
There are a lot of caches involved in giving users a responsive system.
There is cpu cache, kernel/filesystem memory cache, disk drive memory cache, etc. What you are asking is how long does it take to flush all the caches?
Or, another way to look at it is, what happens if the disk drive goes bad? All the flushing is not going to guarantee a successful read or write operation.
Disk drives do go bad eventually. The solution you are looking for is how can you have a redundant cpu/disk drive system such that the system survives a component failure and still keeps working.
You could improve the likelihood that system will keep working with aid of hardware such as RAID arrays and other high availability configurations.
As far software solution goes, I think the answer is, trust the OS to do the optimal thing. Most of them flush buffers out routinely.
This is an old question but still relevant in 2019. For Windows, the answer appears to be "at least after every one second" based on this:
To ensure that the right amount of flushing occurs, the cache manager spawns a process every second called a lazy writer. The lazy writer process queues one-eighth of the pages that have not been flushed recently to be written to disk. It constantly reevaluates the amount of data being flushed for optimal system performance, and if more data needs to be written it queues more data.
To be clear, the above says the lazy writer is spawned after every second, which is not the same as writing out data every second, but it's the best I can find so far in my own search for an answer to a similar question (in my case, I have an Android apps which lazy-writes data back to disk and I noticed some data loss when using an interval of 3 seconds, so I am going to reduce it to 1 second and see if that helps...it may hurt performance but losing data kills performance a whole lot more if you consider the hours it takes to recover it).

Need data on disk drive management by OS: getting base I/O unit size, “sync” option, Direct Memory Access

I want to ensure I have done all I can to configure a system's disks for serious database use. The three areas I know of (any others?) to be concerned about are:
I/O size: the database engine and disk's native size should either match, or the database's native I/O size should be a multiple of the disk's native I/O size.
Disks that are capable of Direct Memory Access (eg. IDE) should be configured for it.
When a disk says it has written data persistently, it must be so! No keeping it in cache and lying about it.
I have been looking for information on how to ensure these are so for CENTOS and Ubuntu, but can't seem to find anything at all!
I want to be able to check these things and change them if needed.
Any and all input appreciated.
PLEASE NOTE: The actual hardware involved is VERY modest. The point is to get the most out of what hardware we do have, even though it's "not very serious hardware" from a broader perspective.
MORE:
I appreciate the time taken to read and reply, but I'm hoping to get "answers" that aren't just good database / hardware advice but answers that actually address the specific things I asked about. Namely:
1) What's a good easy way to tell what the I/O unit size is that the OS wants to do? How can I change it? (IOW: If this exclusively a file-system-format issue, how can I tell what was used on an already-created file system? I know /etc/fstab will tell me the file system format... In this case, it's ext3.
2) How can I tell if a disk drive has DMA? If so, how can I turn it on? (I've been told that some drives have this capability, but now I want to follow up and ensure that if these drives have it, it's turned on.)
And, finally;
3) How can I tell if a drive is merely telling the writer that their material is written when it's actually still in cache? And, more importantly, how can I set the system to NOT use such features if / when they exist?
Thank you for your insights.
RT
1) Check /sys/block/sdX/queue/{max_hw_sectors_kb,max_sectors_kb}. The first is that max transfer size the hw allows, the other is the current maximum which can be set to any value <= max_hw_sectors_kb
2) hdparm -i /dev/sdX
3) Turn off write-back caching (hdparm can do it), or make sure that the filesystem issues barriers when synchronizing (as in fsync(), or journal commit).
"serious database use" and you mention IDE in the same sentence?
SSDs or 15k SCSI in a many spindle RAID 1+0 array with separate arrays for data, log and backup. Consider a separate array for tempdb too.
You'd also switch the controller cache to 100% read too to avoid caching issues
Of course, if it's "serious" then you'd consider clustering etc: so a SAN comes in useful here but you may not be as quick as local spindles
You didn't include any info on filesystem or database, so here are some misc pointers.
It is inevitable that you will lose a disk eventually, so its equally important to put a good backup and recovery strategy in place, and mirror your transaction logs, so you can handle a disk failure or even full datafile loss.
1) If possible, put at least one copy of your transaction log on a fixed disk. Don't put your sole transaction log to an external storage subsystem. (Assuming you use a db that supports log mirroring).
2) I agree with gbn, in practice, don't use write caching. I've lost databases on RAID arrays with battery backup. Configure the storage controller card for write-through.
3) Raw devices provide guaranteed writes, but its not worth the hassle. Some filesystems provide synchronous write options too, use one if possible. I am partial to VxFS, but I'm from the Sun world. On Linux, btrfs is eminent at least, but for now, Ext3 works fine if you setup your db properly.

Resources