Delayed Write errors - c

For the past few months, we've been losing data to a Delayed Write errors. I've experienced the error with both custom code and shrink-wrap applications. For example, the error message below came from Visual Studio 2008 on building a solution
Windows - Delayed Write Failed : Windows was unable
to save all the data for the file
\Vital\Source\Other\OCHSHP\Done07\LHFTInstaller\Release\LHFAI.CAB. The
data has been lost. This error may be caused by a failure of your
computer hardware or network connection. Please try to save this file
elsewhere.
When it occurs in Adobe, Visual Studio, or Word, for example, no harm is done. The major problem is when it occurs to our custom applications (straight C apps that writes data in dBase files to a network share.)
From the program's perspective, the write succeeds. It deletes the source data, and goes on to the next record. A few minutes later, Windows pops up an error message saying that a delayed write occurred and the data was lost.
My question is, what can we do to help our networking/server teams isolate and correct the problem (read, convince them the problem is real. Simply telling them many, many times hasn't convinced them as of yet) and do you have any suggestions of how we can write to avoid the data loss?

Writes on Windows, like any modern operating system, are not actually sent to the disk until the OS gets around to it. This is a big performance win, but the problem (as you have found) is that you cannot detect errors at the time of the write.
Every operating system that does asynchronous writes also provides mechanisms for forcing data to disk. On Windows, the FlushFileBuffers or _commit function will do the trick. (One is for HANDLEs, the other for file descriptors.)
Note that you must check the return value of every disk write, and the return value of these synchronizing functions, in order to be certain the data made it to disk. Also note that these functions block and wait for the data to reach disk -- even if you are writing to a network server -- so they can be slow. Do not call them until you really need to push the data to stable storage.
For more, see fsync() Across Platforms.

You have a corrupted file system or a hard disk that is failing. The networking/server team should scan the disk to fix the former and detect the latter. Also check the error log to see if it tells you anything. If the error log indicates that failure to write to the hardware then you need to replace the disk.

Related

fflush, fsync and sync vs memory layers

I know there are already similar questions and I gave them a look but I couldn't find an explicit univocal answer to my question. I was just investigating online about these functions and their relationship with memory layers. In particular I found this beautiful article that gave me a good insight about memory layers
It seems that fflush() moves data from the application to kernel filesystem buffer and it's ok, everyone seems to agree on this point. The only thing that left me puzzled was that in the same article they assumed a write-back cache saying that with fsync() "the data is saved to the stable storage layer" and after they added that "the storage may itself store the data in a write-back cache, so fsync() is still required for files opened with O_DIRECT in order to save the data to stable storage"
Reading here and there it seems like the truth is that fsync() and sync() let the data enter the storage device but if this one has caching layers it is just moved here, not at once to permanent storage and data may even be lost if there is a power failure. Unless we have a filesystem with barriers enabled and then "sync()/fsync() and some other operations will cause the appropriate CACHE FLUSH (ATA) or SYNCHRONIZE CACHE (SCSI) commands to be sent to the device" [from your website answer]
Questions:
if the data to be updated are already in the kernel buffers and my device has a volatile cache layer in write-back mode is it true, like said by the article, that operations like fsync() [and sync() I suppose] synchronize data to the stable memory layer skipping the volatile one? I think this is what happens with a write-through cache, not a write-back one. From what I read I understood that with a write-back cache on fsync() can just send data to the device that will put them in the volatile cache and they will enter the permanent memory only after
I read that fsync() works with a file descriptor and then with a single file while sync() causes a total deployment for the buffers so it applies to every data to be updated. And from this page also that fsync() waits for the end of the writing to the disk while sync() doesn't wait for the end of the actual writing to the disk. Are there other differences connected to memory data transfers between the two?
Thanks to those who will try to help
1. As you correctly concluded from your research fflush synchronizes the user-space buffered data to kernel-level cache (since it's working with FILE objects that reside at user-level and are invisible to kernel), whereas fsync or sync (working directly with file descriptors) synchronize kernel cached data with device. However, the latter comes without a guarantee that the data has been actually written to the storage device — as these usually come with their own caches as well. I would expect the same holds for msync called with MS_SYNC flag as well.
Relatedly, I find the distinction between synchronized and synchronous operations very useful when talking about the topic. Here's how Robert Love puts it succinctly:
A synchronous write operation does not return until the written data is—at least—stored in the kernel’s buffer cache. [...] A synchronized operation is more restrictive and safer than a merely synchronous operation. A synchronized write operation flushes the data to disk, ensuring that the on-disk data is always synchronized vis-à-vis the corresponding kernel buffers.
With that in mind you can call open with O_SYNC flag (together with some other flag that opens the file with a write permission) to enforce synchronized write operations. Again, as you correctly assumed this will work only with WRITE THROUGH disk caching policy, which effectively amounts to disabling disk caching.
You can read this answer about how to disable disk caching on Linux. Be sure to also check this website which also covers SCSI-based in addition to ATA-based devices (to read about different types of disks see this page on Microsoft SQL Server 2005, last updated: Apr 19, 2018).
Speaking of which, it is very informative to read about how the issue is dealt with on Windows machines:
To open a file for unbuffered I/O, call the CreateFile function with the FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH flags. This prevents the file contents from being cached and flushes the metadata to disk with each write. For more information, see CreateFile.
Apparently, this is how Microsoft SQL Server 2005 family ensures data integrity:
All versions of SQL Server open the log and data files using the Win32 CreateFile function. The dwFlagsAndAttributes member includes the FILE_FLAG_WRITE_THROUGH option when opened by SQL Server. [...]
This option instructs the system to write through any intermediate cache and go directly to disk. The system can still cache write operations, but cannot lazily flush them.
I'm saying this is informative in particular because of this blog post from 2012 showing that some SATA disks ignore the FILE_FLAG_WRITE_THROUGH! I don't know what the current state of affairs is, but it seems that in order to ensure that writing to a disk is truly synchronized, you need to:
Disable disk caching using your device drivers.
Make sure that the specific device you're using supports write-through/no-caching policy.
However, if you're looking for a guarantee of data integrity you could just buy a disk with its own battery-based power supply that goes beyond capacitors (which is usually only enough for completing the on-going write processes). As put in the conclusion in the blog article mentioned above:
Bottom-line, use Enterprise-Class disks for your data and transaction log files. [...] Actually, the situation is not as dramatic as it seems. Many RAID controllers have battery-backed cache and do not need to honor the write-through requirement.
2. To (partially) answer the second question, this is from the man pages SYNC(2):
According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but may return before the actual writing is done. However, since version 1.3.20 Linux does actually wait. (This still does not guarantee data integrity: modern disks have large caches.)
This would imply that fsync and sync work differently, however, note they're both implemented in unistd.h which suggests some consistency between them. However, I would follow Robert Love who does not recommend using sync syscall when writing your own code.
The only real use for sync() is in the implementation of the sync utility. Applications should use fsync() and fdatasync() to commit to disk the data of only the requisite file descriptors. Note that sync() may take several minutes or longer to complete on a busy system.
"I don't have any solution, but certainly admire the problem."
From all I read from your good references, is that there is no standard. The standard ends somewhere in the kernel. The kernel controls the device driver and the device driver (possibly supplied by the disk manufacturer) controls the disk through an API (device has small computer on board). The manufacturer may have added capacitors/battery with just enough power to flush its device buffers in case of power failure, or he may have not. The device may provide a sync function but whether this truely syncs (flushes) the device buffers is not known (device dependent). So unless you select and install a device according to your specifications (and verify those specs), you are never sure.
This is a fair problem. Even after handling error conditions, you are not safe of the data presence in your storage.
man page of fsync explains this issue clearly!! :)
For applications that require tighter guarantees about the integrity of
their data, Mac OS X provides the F_FULLFSYNC fcntl. The F_FULLFSYNC fcntl asks the drive to flush all buffered data to permanent storage.
Applications, such as databases, that require a strict ordering of writes
should use F_FULLFSYNC to ensure that their data is written in the order they expect. Please see fcntl(2) for more detail.
Yes, fflush() ensures the data leaves the process memory space, but it may be in dirty pages of RAM awaiting write back. This is proof against app abort, but not system crash or power failure. Even if the power is backed up, the system could crash due to some software vulnerability! As mentioned in other answers/comments, getting the data from dirty pages written to disk magnetically or whatever SSD do, not stuck in some volatile buffer in the disk controller or drive, is a combination of the right calls or open options and the right controllers and devices! Calls give you more control over the overhead, writing more in bulk at the end of a transaction.
RDBMS, for instance, need to worry not only about the database holding files but even more about the log files that allow recovery, both after disk loss and on any RDBMS restart after a crash. In fact, some may be more sync'd in the log than the database, to preserve speed, since recovery is not a frequent process and not usually a long one. Things written to the log by transactions are guaranteed to be recoverable if the log is intact.

Embedded File System and power-off

I am working on an embedded application without any OS that needs the use of a File System. I've been over this many times with the people in the project and some agree with me that the system must make a proper shut down of the system whenever there is a power failure or else the file system might go crazy.
Some people say that it doesn't matter if you simply power off the system and let nature run its course, but I think that's one of the worst things to do, especially if you know this will bring you a problem and probably shorten your product's life span.
In the last paragraph I just assumed that it is a problem, but my question remains:
Does a power down have any effect on the file system?
Here is a list of various techniques to help an embedded system tolerate a power failure. These may not be practical for your particular application.
Use a Journaling File System - Can tolerate incomplete writes due to power failure, OS crash, etc. Most modern filesystems are journaled, but do your homework to confirm.
Unless your application needs the write performance, disable all write caching. Check your disk drivers for caching options. Under Linux/Unix, consider mounting the filesystem in sync mode.
Unless it must be writable, make it read-only. Try to keep your application executables and operating system files on their own partition(s), with write protections in place (e.g. mount read only in Linux). Your read/write data should be on its own partition. Even if your application data gets corrupted, your system should still be able to boot (albeit with a fail safe default configuration).
3a. For data that is only written once (e.g. Configuration Settings), try to keep it mounted as read-only most of the time. If there is a settings change mount is as R/W temporarily, update the data, and then unmount/remount it as read-only.
3b. Use a technique similar to 3a to handle application/OS updates in the field.
3c. If it is impractical for you to mount the FS as read-only, at least consider opening individual files as read-only (e.g. fp=fopen("configuration.ini", "r")).
If possible, use separate devices for your storage. Keeping things in separate partitions provides some protection, but there are still edge cases where a partition table may become corrupt and render the entire drive unreadable. Using physically separate devices further isolates against one corrupt device bringing down the whole system. In a perfect world, you would have at least 4 separate devices:
4a. Boot Loader
4b. Operating System & Application Code
4c. Configuration Settings
4e. Application Data
Know the characteristics of your storage devices, and control the brand/model/revision of devices used. Some hard disks ignore cache flush commands from the OS. We had cases where some models of CompactFlash cards would corrupt themselves during a power failure, but the "industrial" models did not have this problem. Of course, this information was not published in any datasheet, and had to be gathered by experimental testing. We developed a list of approved CF cards, and kept inventory of those cards. We periodically had to update this list as older cards became obsolete, or the manufacturer would make a revision.
Put your temporary files in a RAM Disk. If you keep those writes off-disk, you eliminate them as a potential source of corruption. You also reduce flash wear and tear.
Develop automated corruption detection and recovery methods. - All of the above techniques will not help you if the application simply hangs because a missing config file. You need to be able to recover as gracefully as possible:
7a. Your system should maintain at least two copies of its configuration settings, a "primary" and a "backup". If the primary fails for some reason, switch to the backup. You should also consider mechanisms for making backups whenever whenever the configuration is changed, or after a configuration has been declared "good" by the user (testing vs production mode).
7b. Did your Application Data partition fail to mount? Automatically run chkdsk/fsck.
7c. Did chkdsk/fsck fail to fix the problem? Automatically re-format the partition and get it back to a known state.
7d. Do you have a Boot Loader or other method to restore the OS and application after a failure?
7e. Make sure your system will beep, flash an LED, or something to indicate to the user what happened.
Power Failures should be part of your system qualification testing. The only way you will be sure you have a robust system is to test it. Yank the power cord from the system and document what happens. Try yanking the power at multiple points in the system operation (during runtime, while booting, mid configuration, etc). Repeat each test multiple times.
If you cannot mitigate all power failure problems, incorporate a battery or Supercapacitor into the system - Keep in mind that you will need a background process in your OS to initiate a graceful shutdown when power gets low. Also, batteries will require periodic testing and replacement with age.
Addition to msemack's response, unfortunately my rating is too low to post a comment to his answer vs. a separate answer.
Does a power down have any effect on the file system?
Yes, if proper measures aren't put in place to prevent corruption. See previous answers for file system options to help mitigate. However if ATA flush/sleep aren't properly implemented on your device you may run into the scenario we did. In our scenario the device was corrupt beyond the file system, and fdisk/format would not recover the device.
Instead an ATA security-erase was required to recover the device once corruption occurs. In order to avoid this, we implemented an ATA sleep command prior to power loss. This required hold-up of 400ms to support the 160ms ATA sleep took, and leave some head room for degradation of the caps over the life of the product.
Notes from our scenario:
fdisk/format failed to repair/recover the drive.
Our power-safe file system's check disk utility returned that the device had bad blocks, but there really weren't any.
flush/sync returned success, quickly, and most likely weren't implemented.
Once corrupt, dd could not read the device beyond the 1st partition boundary and returned i/o errors after.
hdparm used to issue ATA security-erase, as only method of recovery for some corruption scenarios.
For non-journalling filesystem unexpected turn-off can mean corruption of certain data including directory structure. This happens if there's unsaved data in the cache or if the FS is in the process of writing multi-block update and interruption happens when only some blocks are written.
Journalling addresses this problem mostly - if there's interruption in the middle, recovery routine or check-and-repair operation done by the FS (usually implicitly) brings the filesystem to consistent state. However this state is not always the latest - i.e. if there were some data in the memory cache, they can be lost even with journalling. This is because journalling saves you from corruption of the filesystem but doesn't do magic.
Write-through mode (no write caching) reduces possibility of the data loss but doesn't solve the problem completely, as journalling will work as a cache (for a very short time).
So unfortunately backup or data duplication are the main ways to prevent data loss.
It totally depends on the file system you are using and if it is acceptable to loose some data at power off based on your project requirements.
One could imagine using a file system that is secured against unattended power-off and is able to recover from a partial write sequence. So on the applicative side, if you don't have critic data that absolutely needs to be written before shuting down, there is no need for a specific power off detection procedure.
Now if you want a more specific answer for your project you will have to give more information on the file system you are using and your project requirements.
Edit: As you have critical applicative data to save before power-off, i think you have answered the question yourself. The only way to secure unattended power-off is to have a brown-out detection that alerts your embedded device coupled with some hardware circuitry that allows keeping delivering enought power to the device to perform the shutdown procedure.
The FAT file-system is particularly prone to corruption if a write is in progress or a file is open on shutdown - specifically if ther is a buffered operation that is not flushed . On one project I worked on the solution was to run a file system integrity check and repair (essentially chkdsk/scandsk) on start-up. This strategy did not prevent data loss, but it did prevent the file system becoming unusable.
A number of vendors provide journalling add-on components for FAT to counter exactly this problem. These include Segger, Quadros and Micrium for example.
Either way, your system should generally adopt a open-write-close approach to file access, or open-write-flush if you feel the need to keep the file open.

After how many seconds are file system write buffers typically flushed?

Before overwriting data in a file, I would like to be pretty sure the old data is stored on disk. It's potentially a very big file (multiple GB), so in-place updates are needed. Usually writes will be 2 MB or larger (my plan is to use a block size of 4 KB).
Instead of (or in addition to) calling fsync(), I would like to retain (not overwrite) old data on disk until the file system has written the new data. The main reasons why I don't want to rely on fsync() is: most hard disks lie to you about doing an fsync.
So what I'm looking for is what is the typical maximum delay for a file system, operating system (for example Windows), hard drive until data is written to disk, without using fsync or similar methods. I would like to have real-world numbers if possible. I'm not looking for advice to use fsync.
I know there is no 100% reliable way to do it, but I would like to better understand how operating systems and file systems work in this regard.
What I found so far is: 30 seconds is / was the default for /proc/sys/vm/dirty_expire_centiseconds. Then "dirty pages are flushed (written) to disk ... (when) too much time has elapsed since a page has stayed dirty" (but there I couldn't find the default time). So for Linux, 40 seconds seems to be on the safe side. But is this true for all file systems / disks? What about Windows, Android, and so on? I would like to get an answer that applies to all common operating systems / file system / disk types, including Windows, Android, regular hard disks, SSDs, and so on.
Let me restate this your problem in only slightly-uncharitable terms: You're trying to control the behavior of a physical device which its driver in the operating system cannot control. What you're trying to do seems impossible, if what you want is an actual guarantee, rather than a pretty good guess. If all you want is a pretty good guess, fine, but beware of this and document accordingly.
You might be able to solve this with the right device driver. The SCSI protocol, for example, has a Force Unit Access (FUA) bit in its READ and WRITE commands that instructs the device to bypass any internal cache. Even if the data were originally written buffered, reading unbuffered should be able to verify that it was actually there.
The only way to reliably make sure that data has been synced is to use the OS specific syncing mechanism, and as per PostgreSQL's Reliability Docs.
When the operating system sends a write request to the storage
hardware, there is little it can do to make sure the data has arrived
at a truly non-volatile storage area. Rather, it is the
administrator's responsibility to make certain that all storage
components ensure data integrity.
So no, there are no truly portable solutions, but it is possible (but hard) to write portable wrappers and deploy a reliable solution.
First of all thanks for the information that hard disks lie about flushing data, that was new to me.
Now to your problem: you want to be sure that all data that you write has been written to the disk (lowest level). You are saying that there are two parts which need to be controlled: the time when the OS writes to the hard drive and the time when the hard drive writes to the disk.
Your only solution is to use a fuzzy logic timer to estimate when the data will be written.
In my opinion this is the wrong way. You have control about when the OS is writing to the hard drive, so use the possibility and control it! Then only the lying hard drive is your problem. This problem can't be solved reliably. I think, you should tell the user/admin that he must take care when choosing the right hard drive. Of course it might be a good idea to implement the additional timer you proposed.
I believe, it's up to you to start a row of tests with different hard drives and Brad Fitzgerald's tool to get a good estimation of when hard drives will have written all data. But of course - if the hard drive wants to lie, you can never be sure that the data really has been written to the disk.
There are a lot of caches involved in giving users a responsive system.
There is cpu cache, kernel/filesystem memory cache, disk drive memory cache, etc. What you are asking is how long does it take to flush all the caches?
Or, another way to look at it is, what happens if the disk drive goes bad? All the flushing is not going to guarantee a successful read or write operation.
Disk drives do go bad eventually. The solution you are looking for is how can you have a redundant cpu/disk drive system such that the system survives a component failure and still keeps working.
You could improve the likelihood that system will keep working with aid of hardware such as RAID arrays and other high availability configurations.
As far software solution goes, I think the answer is, trust the OS to do the optimal thing. Most of them flush buffers out routinely.
This is an old question but still relevant in 2019. For Windows, the answer appears to be "at least after every one second" based on this:
To ensure that the right amount of flushing occurs, the cache manager spawns a process every second called a lazy writer. The lazy writer process queues one-eighth of the pages that have not been flushed recently to be written to disk. It constantly reevaluates the amount of data being flushed for optimal system performance, and if more data needs to be written it queues more data.
To be clear, the above says the lazy writer is spawned after every second, which is not the same as writing out data every second, but it's the best I can find so far in my own search for an answer to a similar question (in my case, I have an Android apps which lazy-writes data back to disk and I noticed some data loss when using an interval of 3 seconds, so I am going to reduce it to 1 second and see if that helps...it may hurt performance but losing data kills performance a whole lot more if you consider the hours it takes to recover it).

error no set to EIO while using pwrite() API

I have remote disks mounted on my system using NFS and I am trying to write to the files on the mounted remote disks using pwrite() API.
It doesn't happen every time but in some cases while doing I/O pwrite() fails and set the error number to EIO(Input/Output error).
Can some one please explain why this error occur on the first place and is there any way I can correct it?
Thanks
From (bad) experiences with reading and writing to NFS based files I learned that you have a good chance to work around this EIO by simply retrying the failed I/O operation (read(), write()).
Also on NFS one can not assume that read()/write() do transfer the amount of data specified, so it is good idea to always check the return value of the function in question on how many byte were transfered.
I see the issue in the underlying NFS functionality or in the way the NFS driver's results are handled by the kernel, so I strongly assume pread()/pwrite() show the same effects as I witnessed when using read()/write().

What's the most efficient file logging in a server written in C?

I wonder what the most efficient file logging strategy would be in a server written in C?
I can see the following options:
fopen() append and then fwrite() the data for a time frame of say 1 hour, then fclose()?
Caching the data and then occasionally open() append write() and close()?
Using a thread is usually a good solution, we adopted it with interesting results.
The main thread that needs to log prepare the log string and passes it to a second thread. To feed the second thread we use a lockless queue + a circular memory in order to minimize amount of alloc/free and wait time.
The secon thread waits for the lockless queue to be available. When it finds there's some job to do, a new slot of the lockless queue is consumed and the data logged.
Using a separate thread you can save a great amount of time.
After we decided to use a secon thread we had to face another problem. Many istances of the same program (a full text serach engine) must log all together on the same file so the resource shoud be regularly shared among every instance of the server.
We could decide to use a semaphore or another syncornizing methiod but we found another solution: the second thread sends an UDP packet to a local log server that listen on a known port. This server reads each message and logs it on the file (the server is actually the only one that owns he file while it's written). The UDP socket itself grants serialization of logs.
I've been using this solution for more than 10 years and never loose a single line of my logs file, using the second thread I also saved a great percentage of time for every operation (we use to log a lot of information for any single command the server receives).
HTH
Why don't you directly log your data when the events occur?
If your server crashes, you want to retrieve those data at the time it crashed. If you only flush your buffered logs once an hour, you'll miss interesting logs.
File streams are usually buffered by the OS.
If you believe it makes your server slow, due to hard drive writing, you might consider to log into a separate thread. But I wonder if it is the problem. Premature optimizations?
Unless you've benchmarked and found that it's a bottleneck, use fopen and fprintf. There's no reason to put your own complex buffering layer on top unless stdio is too slow for you (and if it is too slow, you might consider whether to rethink the OS/C library your server is running).
The slowest part of writing a system log is the output operation to the physical disks.
Buffering and checksumming the log records are necessary to ensure that you don't lose any log data and that the log data can't be tampered with after the fact, respectively.

Resources