I am writing a C program for Linux that reads and writes to files on an NFS server. The share is mounted hard; attempts to access it will block indefinitely until they work. Having my program block indefinitely is bad; it is still capable of doing useful work even if the files are unavailable. Remounting the share soft is not an option.
Having two processes, one of which does work and won't block, and another that handles file IO and might block is an option, but would constitute a Major Change. I'd like to avoid that. Really, I want to say, "I know that you're hard mounted so that naive programs can pretend you're a highly reliable local disk. But I know better and I am prepared to cope with any access failing, similar to the behavior if you were soft mounted." So:
In C, how can I access files on a hard-mounted NFS share, getting errors if the server is unavailable instead of blocking indefinitely?
I can run as root if necessary, but would prefer not. Using root to remount the share is right out. I can potentially rely on new features, but the further back support goes the better.
My research suggests the answer is that it's just not possible, but perhaps I've missed something.
You've not missed anything, you will never receive server unavailable errors because the kernel will never deliver them on hard mounted nfs mount points.
Because the hard option is a property of the mount point, you can't have applications that pick and choose because the kernel isn't set out to behave in that manner.
However, you do mention that you can run the application as root. Why not mount the file system somewhere else soft, and then get your anticipated behaviour?
Related
I have some code that umounts a file system on a device and then immediately removes the device from device-mapper using the DM_DEV_REMOVE ioctl command.
Sometimes, as part of a stress test, I run this code in a tight loop of:
create the device
mount the file system on the device
unmount the file system
remove the device
Often, when running this test over thousands of iterations, I will eventually get the errno EBUSY when trying to remove the device. The umount is always successful.
I have tried searching on this issue, but mostly what I find is people having issues with getting EBUSY when umounting, which is not the problem I am having.
The closest thing to being helpful that I could find is that in the man page for dmsetup it talks about using the --retry option as a workaround for udev rules opening up devices when you are trying to remove them. Unfortunately for me though, I have been able to confirm that udev does not have my device open when I am trying to remove it.
I have used the DM_DEV_STATUS command to check the open_count for my device, and what I see is that the open_count is always 1 before the umount and when my test succeeds it was 0 after the umount and when it fails it was 1 after the umount.
Now, what I am trying to find out to root-cause my issue is, "Could my resource busy failure be caused by umount asynchronously releasing my device, thus creating a race condition?". I know that umount is supposed to be synchronous when it comes to the actual unmounting, but I couldn't find any documentation for whether releasing/closing the underlying device could occur asynchronously or not.
And, if it isn't umount holding a open handle to my device, are there any other likely candidates?
My test is running on a 3.10 kernel.
Historically, system calls blocked the process involved until all the task is done (being write(2) to a block device the first major exception for obvious reasons) The reason was that you need one process to do the job and the syscall involved process was there for that reason (and you could charge the cpu processing to that user's account)
Nowadays, there are plenty of kernel threads involved in solving non-process related issues, and the umount(2) syscall can be one of the syscalls demanding some background (I think it isn't as umount(2) is not frequently issued to justify a change in the code)
But linux is not a unix descendant, so umount(2) could be implemented this way. I don't believe that, anyway.
umount(2) syscall normally succeeds, except when inodes on the filesystem are in use. That's not the case. But the kernel can be involved in some heavy duty process that makes it to alloc some kernel memory (not swappable) and fail in the request. This can lead to the error (note that this is only a guess, I have not checked this in the code, you had better to look at the umount(2) syscall implementation) you get anyway.
There's another issue, that could block your umount process (or fail) in case you have touched someway the filesystem. There's some references dependency code that makes filesystems capable of resist power failures in a consistent status (in linux, this is calles ordered data, in BSD systems it is called software updates, that makes erased files to not be freed immediately after unlink(2). This could block umount(2) (or make it fail) if some data has to be updated on the filesystem, previous to make the actual umount(2) call. But again, this should not be your case, as you say, you don't modify the mounted filesystem.
I am trying to do some testing of several thousand NFSv3 fileserver exports across hundreds of servers. Lots of things can go wrong, from configurations on the server to network connectivity. The most complete test I can do is to actually try to mount it on a client.
I can do that, but actually mounting everything is more than I need, takes state and resources beyond the program's execution, and tends to stress client a bit. I've more than once seen problems that seem to indicate something on the client was unhappy and preventing a mount from happening. (With no changes other than a client reboot, mounts worked again).
I was hoping to instead code something lighter weight that would simply act as a NFS client and see if the NFS MOUNT call successfully returned a filehandle. If it did, my server is running and my client is authorized. But I've not found any simple code to do so.
When I look at the Linux Source, it looks like at least some of the code is involved with it being a linux module, which is confusing.
Is there some user-space code that just requests a NFS filehandle via a mount call that I might be able to strip down? (Or is there any reason that my idea wouldn't work)? This is all AUTH_SYS, so I don't have to get kerberos tickets or anything.
Without knowing more, I will speculate a little based on my knowledge of NFS/Linux file systems.
I assume your client is linux (but the same logic could apply with windows if it had an nfs client present).
It sounds like when you do the mounts you are reaching a point where resources are being consumed to the point that the client cannot mount any more nfs mounts. Makes sense as when you reboot it starts working again, and a reboot will drop the nfs mounts (assuming that you are explicitly/programmatically mounting) thus allowing mounts to occur again. I bet you are just mounting the nfs mounts and are never unmounting them. So I would suggest the following:
mount
access a file or directory in the just mounted nfs filesystem
unmount the nfs file system
well, then just get the rpc protocol definition files for the version of nfs you are running, probably version 3 or 4, run them through the rpcget xdr protocol compiler you will get client functions in c that you can compile that you can make calls to the server with. But they will execute several system system calls, no way to do network communication linux without that happening and it will go throuth the tcp/ip stack (you will probably use udp) in the linux kernel. U can probably find the nfs protocol definition files on the SUN/Oracle website, or you can get find them in the source for a linux distribution -- you will be making application layer calls but the client calls rpc library functions which in turn will call linux system calls which go to the kernel
I am working on an embedded application without any OS that needs the use of a File System. I've been over this many times with the people in the project and some agree with me that the system must make a proper shut down of the system whenever there is a power failure or else the file system might go crazy.
Some people say that it doesn't matter if you simply power off the system and let nature run its course, but I think that's one of the worst things to do, especially if you know this will bring you a problem and probably shorten your product's life span.
In the last paragraph I just assumed that it is a problem, but my question remains:
Does a power down have any effect on the file system?
Here is a list of various techniques to help an embedded system tolerate a power failure. These may not be practical for your particular application.
Use a Journaling File System - Can tolerate incomplete writes due to power failure, OS crash, etc. Most modern filesystems are journaled, but do your homework to confirm.
Unless your application needs the write performance, disable all write caching. Check your disk drivers for caching options. Under Linux/Unix, consider mounting the filesystem in sync mode.
Unless it must be writable, make it read-only. Try to keep your application executables and operating system files on their own partition(s), with write protections in place (e.g. mount read only in Linux). Your read/write data should be on its own partition. Even if your application data gets corrupted, your system should still be able to boot (albeit with a fail safe default configuration).
3a. For data that is only written once (e.g. Configuration Settings), try to keep it mounted as read-only most of the time. If there is a settings change mount is as R/W temporarily, update the data, and then unmount/remount it as read-only.
3b. Use a technique similar to 3a to handle application/OS updates in the field.
3c. If it is impractical for you to mount the FS as read-only, at least consider opening individual files as read-only (e.g. fp=fopen("configuration.ini", "r")).
If possible, use separate devices for your storage. Keeping things in separate partitions provides some protection, but there are still edge cases where a partition table may become corrupt and render the entire drive unreadable. Using physically separate devices further isolates against one corrupt device bringing down the whole system. In a perfect world, you would have at least 4 separate devices:
4a. Boot Loader
4b. Operating System & Application Code
4c. Configuration Settings
4e. Application Data
Know the characteristics of your storage devices, and control the brand/model/revision of devices used. Some hard disks ignore cache flush commands from the OS. We had cases where some models of CompactFlash cards would corrupt themselves during a power failure, but the "industrial" models did not have this problem. Of course, this information was not published in any datasheet, and had to be gathered by experimental testing. We developed a list of approved CF cards, and kept inventory of those cards. We periodically had to update this list as older cards became obsolete, or the manufacturer would make a revision.
Put your temporary files in a RAM Disk. If you keep those writes off-disk, you eliminate them as a potential source of corruption. You also reduce flash wear and tear.
Develop automated corruption detection and recovery methods. - All of the above techniques will not help you if the application simply hangs because a missing config file. You need to be able to recover as gracefully as possible:
7a. Your system should maintain at least two copies of its configuration settings, a "primary" and a "backup". If the primary fails for some reason, switch to the backup. You should also consider mechanisms for making backups whenever whenever the configuration is changed, or after a configuration has been declared "good" by the user (testing vs production mode).
7b. Did your Application Data partition fail to mount? Automatically run chkdsk/fsck.
7c. Did chkdsk/fsck fail to fix the problem? Automatically re-format the partition and get it back to a known state.
7d. Do you have a Boot Loader or other method to restore the OS and application after a failure?
7e. Make sure your system will beep, flash an LED, or something to indicate to the user what happened.
Power Failures should be part of your system qualification testing. The only way you will be sure you have a robust system is to test it. Yank the power cord from the system and document what happens. Try yanking the power at multiple points in the system operation (during runtime, while booting, mid configuration, etc). Repeat each test multiple times.
If you cannot mitigate all power failure problems, incorporate a battery or Supercapacitor into the system - Keep in mind that you will need a background process in your OS to initiate a graceful shutdown when power gets low. Also, batteries will require periodic testing and replacement with age.
Addition to msemack's response, unfortunately my rating is too low to post a comment to his answer vs. a separate answer.
Does a power down have any effect on the file system?
Yes, if proper measures aren't put in place to prevent corruption. See previous answers for file system options to help mitigate. However if ATA flush/sleep aren't properly implemented on your device you may run into the scenario we did. In our scenario the device was corrupt beyond the file system, and fdisk/format would not recover the device.
Instead an ATA security-erase was required to recover the device once corruption occurs. In order to avoid this, we implemented an ATA sleep command prior to power loss. This required hold-up of 400ms to support the 160ms ATA sleep took, and leave some head room for degradation of the caps over the life of the product.
Notes from our scenario:
fdisk/format failed to repair/recover the drive.
Our power-safe file system's check disk utility returned that the device had bad blocks, but there really weren't any.
flush/sync returned success, quickly, and most likely weren't implemented.
Once corrupt, dd could not read the device beyond the 1st partition boundary and returned i/o errors after.
hdparm used to issue ATA security-erase, as only method of recovery for some corruption scenarios.
For non-journalling filesystem unexpected turn-off can mean corruption of certain data including directory structure. This happens if there's unsaved data in the cache or if the FS is in the process of writing multi-block update and interruption happens when only some blocks are written.
Journalling addresses this problem mostly - if there's interruption in the middle, recovery routine or check-and-repair operation done by the FS (usually implicitly) brings the filesystem to consistent state. However this state is not always the latest - i.e. if there were some data in the memory cache, they can be lost even with journalling. This is because journalling saves you from corruption of the filesystem but doesn't do magic.
Write-through mode (no write caching) reduces possibility of the data loss but doesn't solve the problem completely, as journalling will work as a cache (for a very short time).
So unfortunately backup or data duplication are the main ways to prevent data loss.
It totally depends on the file system you are using and if it is acceptable to loose some data at power off based on your project requirements.
One could imagine using a file system that is secured against unattended power-off and is able to recover from a partial write sequence. So on the applicative side, if you don't have critic data that absolutely needs to be written before shuting down, there is no need for a specific power off detection procedure.
Now if you want a more specific answer for your project you will have to give more information on the file system you are using and your project requirements.
Edit: As you have critical applicative data to save before power-off, i think you have answered the question yourself. The only way to secure unattended power-off is to have a brown-out detection that alerts your embedded device coupled with some hardware circuitry that allows keeping delivering enought power to the device to perform the shutdown procedure.
The FAT file-system is particularly prone to corruption if a write is in progress or a file is open on shutdown - specifically if ther is a buffered operation that is not flushed . On one project I worked on the solution was to run a file system integrity check and repair (essentially chkdsk/scandsk) on start-up. This strategy did not prevent data loss, but it did prevent the file system becoming unusable.
A number of vendors provide journalling add-on components for FAT to counter exactly this problem. These include Segger, Quadros and Micrium for example.
Either way, your system should generally adopt a open-write-close approach to file access, or open-write-flush if you feel the need to keep the file open.
Before overwriting data in a file, I would like to be pretty sure the old data is stored on disk. It's potentially a very big file (multiple GB), so in-place updates are needed. Usually writes will be 2 MB or larger (my plan is to use a block size of 4 KB).
Instead of (or in addition to) calling fsync(), I would like to retain (not overwrite) old data on disk until the file system has written the new data. The main reasons why I don't want to rely on fsync() is: most hard disks lie to you about doing an fsync.
So what I'm looking for is what is the typical maximum delay for a file system, operating system (for example Windows), hard drive until data is written to disk, without using fsync or similar methods. I would like to have real-world numbers if possible. I'm not looking for advice to use fsync.
I know there is no 100% reliable way to do it, but I would like to better understand how operating systems and file systems work in this regard.
What I found so far is: 30 seconds is / was the default for /proc/sys/vm/dirty_expire_centiseconds. Then "dirty pages are flushed (written) to disk ... (when) too much time has elapsed since a page has stayed dirty" (but there I couldn't find the default time). So for Linux, 40 seconds seems to be on the safe side. But is this true for all file systems / disks? What about Windows, Android, and so on? I would like to get an answer that applies to all common operating systems / file system / disk types, including Windows, Android, regular hard disks, SSDs, and so on.
Let me restate this your problem in only slightly-uncharitable terms: You're trying to control the behavior of a physical device which its driver in the operating system cannot control. What you're trying to do seems impossible, if what you want is an actual guarantee, rather than a pretty good guess. If all you want is a pretty good guess, fine, but beware of this and document accordingly.
You might be able to solve this with the right device driver. The SCSI protocol, for example, has a Force Unit Access (FUA) bit in its READ and WRITE commands that instructs the device to bypass any internal cache. Even if the data were originally written buffered, reading unbuffered should be able to verify that it was actually there.
The only way to reliably make sure that data has been synced is to use the OS specific syncing mechanism, and as per PostgreSQL's Reliability Docs.
When the operating system sends a write request to the storage
hardware, there is little it can do to make sure the data has arrived
at a truly non-volatile storage area. Rather, it is the
administrator's responsibility to make certain that all storage
components ensure data integrity.
So no, there are no truly portable solutions, but it is possible (but hard) to write portable wrappers and deploy a reliable solution.
First of all thanks for the information that hard disks lie about flushing data, that was new to me.
Now to your problem: you want to be sure that all data that you write has been written to the disk (lowest level). You are saying that there are two parts which need to be controlled: the time when the OS writes to the hard drive and the time when the hard drive writes to the disk.
Your only solution is to use a fuzzy logic timer to estimate when the data will be written.
In my opinion this is the wrong way. You have control about when the OS is writing to the hard drive, so use the possibility and control it! Then only the lying hard drive is your problem. This problem can't be solved reliably. I think, you should tell the user/admin that he must take care when choosing the right hard drive. Of course it might be a good idea to implement the additional timer you proposed.
I believe, it's up to you to start a row of tests with different hard drives and Brad Fitzgerald's tool to get a good estimation of when hard drives will have written all data. But of course - if the hard drive wants to lie, you can never be sure that the data really has been written to the disk.
There are a lot of caches involved in giving users a responsive system.
There is cpu cache, kernel/filesystem memory cache, disk drive memory cache, etc. What you are asking is how long does it take to flush all the caches?
Or, another way to look at it is, what happens if the disk drive goes bad? All the flushing is not going to guarantee a successful read or write operation.
Disk drives do go bad eventually. The solution you are looking for is how can you have a redundant cpu/disk drive system such that the system survives a component failure and still keeps working.
You could improve the likelihood that system will keep working with aid of hardware such as RAID arrays and other high availability configurations.
As far software solution goes, I think the answer is, trust the OS to do the optimal thing. Most of them flush buffers out routinely.
This is an old question but still relevant in 2019. For Windows, the answer appears to be "at least after every one second" based on this:
To ensure that the right amount of flushing occurs, the cache manager spawns a process every second called a lazy writer. The lazy writer process queues one-eighth of the pages that have not been flushed recently to be written to disk. It constantly reevaluates the amount of data being flushed for optimal system performance, and if more data needs to be written it queues more data.
To be clear, the above says the lazy writer is spawned after every second, which is not the same as writing out data every second, but it's the best I can find so far in my own search for an answer to a similar question (in my case, I have an Android apps which lazy-writes data back to disk and I noticed some data loss when using an interval of 3 seconds, so I am going to reduce it to 1 second and see if that helps...it may hurt performance but losing data kills performance a whole lot more if you consider the hours it takes to recover it).
I'm trying to find proper way to handle stale data on NFS client. Consider following scenario:
Two servers mount same NFS shared storage with number of files
Client application on 1 server deletes some files
Client application on 2 server tries to access deleted files and fails with: Stale NFS file handle (nothing strange, error is expected)
(Also it may be useful to know, that cache mount options are pretty high on both servers for performance reasons).
What I'm trying to understand is:
Is there reliable method to check, that file is present? In the scenario given above lstat on the file returns success and application fails only after trying to move file.
How can I manually sync contents of directory on the client with server?
Some general advise on how to write reliable file management code in case of NFS?
Thanks.
Is there reliable method to check, that file is present? In the scenario given above lstat on the file returns success and application fails only after trying to move file.
That's it normal NFS behavior.
How can I manually sync contents of directory on the client with server?
That is impossible to do manually, since NFS pretends to be a normal POSIX-compliant file system.
I have tried once to code close()/open() in an attempt to somehow mitigate the effects of the NFS client-side caching. In my case I needed to read the info written to the file on other server. But even the reopen trick had close to zero effect. And I can't add fdatasync() to the writing side, since that slows whole application down.
My experience with NFS to date is that nothing you can do. In critical code paths I simply coded to retry the file operations which return ESTALE.
Some general advise on how to write reliable file management code in case of NFS?
Mod me down all you want, but if your customers want reliability then they shouldn't use NFS.
My company for example advertises use of proper distributed file system (I intentionally omit the brand) if customer wants reliability. Our core software is not guaranteed to run on NFS and we do not support such configurations. But in our case we really need the guarantees that as soon as the data are written to FS they become accessible on all other nodes.
Coherency in NFS can be achieved, but at the cost of performance, making NFS barely usable. (Check its mount options.) NFS is caching like crazy to hide the fact that it is a server file system. To make all operations coherent, NFS client would have to go to the NFS server synchronously for every little operation, bypassing the local cache. And that would never be fast.
But since we are talking Linux here, one can advise customers of the software to evaluate available cluster file systems. E.g. RedHat now officially support GFS. I have heard about people using CodaFS, but have no hard info on it.
i have had success with doing ls -l on the directory which contains the file.
You could try the ''noac'' mount option
from man nfs:
In addition to preventing the client
from caching file attributes, the noac
option forces application writes to
become synchronous so that local
changes to a file become visible on
the server immediately. That way,
other clients can quickly detect
recent writes when they check the
file's attributes.
Using the noac option provides
greater cache coherence among NFS
clients accessing the same files, but
it extracts a significant performance
penalty. As such, judicious use of
file locking is encouraged instead.
You could have two mounts, one for critical fast changing data that you need synchronized and another mount for other data.
Also, look into NFS locking and its limitations.
As for general advice:
One way to truncate a file that is concurrently read from multiple hosts is to write the content into a temporary file and then rename that file to the final location.
On the same filesystem this operation should be atomic.