error no set to EIO while using pwrite() API - c

I have remote disks mounted on my system using NFS and I am trying to write to the files on the mounted remote disks using pwrite() API.
It doesn't happen every time but in some cases while doing I/O pwrite() fails and set the error number to EIO(Input/Output error).
Can some one please explain why this error occur on the first place and is there any way I can correct it?
Thanks

From (bad) experiences with reading and writing to NFS based files I learned that you have a good chance to work around this EIO by simply retrying the failed I/O operation (read(), write()).
Also on NFS one can not assume that read()/write() do transfer the amount of data specified, so it is good idea to always check the return value of the function in question on how many byte were transfered.
I see the issue in the underlying NFS functionality or in the way the NFS driver's results are handled by the kernel, so I strongly assume pread()/pwrite() show the same effects as I witnessed when using read()/write().

Related

Does umount asynchronously release the underlying device?

I have some code that umounts a file system on a device and then immediately removes the device from device-mapper using the DM_DEV_REMOVE ioctl command.
Sometimes, as part of a stress test, I run this code in a tight loop of:
create the device
mount the file system on the device
unmount the file system
remove the device
Often, when running this test over thousands of iterations, I will eventually get the errno EBUSY when trying to remove the device. The umount is always successful.
I have tried searching on this issue, but mostly what I find is people having issues with getting EBUSY when umounting, which is not the problem I am having.
The closest thing to being helpful that I could find is that in the man page for dmsetup it talks about using the --retry option as a workaround for udev rules opening up devices when you are trying to remove them. Unfortunately for me though, I have been able to confirm that udev does not have my device open when I am trying to remove it.
I have used the DM_DEV_STATUS command to check the open_count for my device, and what I see is that the open_count is always 1 before the umount and when my test succeeds it was 0 after the umount and when it fails it was 1 after the umount.
Now, what I am trying to find out to root-cause my issue is, "Could my resource busy failure be caused by umount asynchronously releasing my device, thus creating a race condition?". I know that umount is supposed to be synchronous when it comes to the actual unmounting, but I couldn't find any documentation for whether releasing/closing the underlying device could occur asynchronously or not.
And, if it isn't umount holding a open handle to my device, are there any other likely candidates?
My test is running on a 3.10 kernel.
Historically, system calls blocked the process involved until all the task is done (being write(2) to a block device the first major exception for obvious reasons) The reason was that you need one process to do the job and the syscall involved process was there for that reason (and you could charge the cpu processing to that user's account)
Nowadays, there are plenty of kernel threads involved in solving non-process related issues, and the umount(2) syscall can be one of the syscalls demanding some background (I think it isn't as umount(2) is not frequently issued to justify a change in the code)
But linux is not a unix descendant, so umount(2) could be implemented this way. I don't believe that, anyway.
umount(2) syscall normally succeeds, except when inodes on the filesystem are in use. That's not the case. But the kernel can be involved in some heavy duty process that makes it to alloc some kernel memory (not swappable) and fail in the request. This can lead to the error (note that this is only a guess, I have not checked this in the code, you had better to look at the umount(2) syscall implementation) you get anyway.
There's another issue, that could block your umount process (or fail) in case you have touched someway the filesystem. There's some references dependency code that makes filesystems capable of resist power failures in a consistent status (in linux, this is calles ordered data, in BSD systems it is called software updates, that makes erased files to not be freed immediately after unlink(2). This could block umount(2) (or make it fail) if some data has to be updated on the filesystem, previous to make the actual umount(2) call. But again, this should not be your case, as you say, you don't modify the mounted filesystem.

fsync on mapped crypted device with dm-crypt?

I have a question about dm-crypt.
Here is my situation. I have an encrypted partition mapped (encrypted in virtual device) using the cryptsetup command in Linux. I am opening the mapped virtual device in a c-program using the open() function.
Can i be sure that when i use the fsync() function all information will be written to the encrypted partition or is there some buffer in the dm-crypt driver?
I could not find much reference on this. Maybe someone can shed more light on this, as I have not grokked the source, but it seems as though a sync writes to disk.
One point is the questions trim-with-lvm-and-dm-crypt where a sync changes the disk content reliably, yet the cached content is only updated after a echo 1 > /proc/sys/vm/drop_caches.
Another is the issue that sync hangs on a suspended device, which indicates that the sync goes directly to the device.
A third is this Gentoo discussion where luksClose is possible reliably after a sync.
A fourth is this UL answer, which says
the rest of the stuff [dm-crypt] is in kernel and pretty heavily used, so it's
probably fine
It may still be that all these are wrong, and it can happen that sync does not write directly to the encrypted disk, but it seems unlikely.

select() equivalence in I/O Completion Ports

I am developing a proxy server using WinSock 2.0 in Windows. If I wanted to develop it in blocking model, select() was the way to wait for client or remote server to receive data from. Is there any applicable way to do this so using I/O Completion Ports?
I used to have two Contexts for two directions of data using I/O Completion Ports. But having a WSARecv pending couldn't receive any data from remote server! I coudn't find the problem.
Thanks in advance.
EDIT. Here's the WorkerThread Code on currently developed I/O Completion Ports. But I am asking about how to implement select() equivalence.
I/O Completion Ports provide an indication of when an I/O operation completes, they do not indicate when it is possible to initiate an operation. In many situations this doesn't actually matter. Most of the time the overlapped I/O model will work perfectly well if you assume it is always possible to initiate an operation. The underlying operating system will, in most cases, simply do the right thing and queue the data for you until it is possible to complete the operation.
However, there are some situations when this is less than ideal. For example you can always send to a socket using overlapped I/O. You can do this even when the remote peer is not reading and the TCP stack has started to use flow control and has filled the TCP window... This simply uses resources on your local machine in a completely uncontrolled manner (not entirely uncontrolled, but controlled by the peer, which is not ideal). I write about this here and in many situations you DO need to actively manage this kind of thing by tracking how many outstanding I/O write requests you have and using that as an indication of 'readiness to send'.
Likewise if you want a 'readiness to recv' indication you could issue a 'zero byte' read on the socket. This is a read which is issued with a zero length buffer. The read returns when there is data to read but no data is returned. This would give you the indication that there is data to be read on the connection but is, IMHO, pointless unless you are suffering from the very unlikely situation of hitting the I/O page lock limit, as you may as well read the data when it becomes available rather than forcing multiple kernel to user mode transitions.
In summary, you don't really need an answer to your question. You need to look at how the API works and write your code to work with it rather than trying to force the API to work in a way that other APIs that you are familiar with work.

Delayed Write errors

For the past few months, we've been losing data to a Delayed Write errors. I've experienced the error with both custom code and shrink-wrap applications. For example, the error message below came from Visual Studio 2008 on building a solution
Windows - Delayed Write Failed : Windows was unable
to save all the data for the file
\Vital\Source\Other\OCHSHP\Done07\LHFTInstaller\Release\LHFAI.CAB. The
data has been lost. This error may be caused by a failure of your
computer hardware or network connection. Please try to save this file
elsewhere.
When it occurs in Adobe, Visual Studio, or Word, for example, no harm is done. The major problem is when it occurs to our custom applications (straight C apps that writes data in dBase files to a network share.)
From the program's perspective, the write succeeds. It deletes the source data, and goes on to the next record. A few minutes later, Windows pops up an error message saying that a delayed write occurred and the data was lost.
My question is, what can we do to help our networking/server teams isolate and correct the problem (read, convince them the problem is real. Simply telling them many, many times hasn't convinced them as of yet) and do you have any suggestions of how we can write to avoid the data loss?
Writes on Windows, like any modern operating system, are not actually sent to the disk until the OS gets around to it. This is a big performance win, but the problem (as you have found) is that you cannot detect errors at the time of the write.
Every operating system that does asynchronous writes also provides mechanisms for forcing data to disk. On Windows, the FlushFileBuffers or _commit function will do the trick. (One is for HANDLEs, the other for file descriptors.)
Note that you must check the return value of every disk write, and the return value of these synchronizing functions, in order to be certain the data made it to disk. Also note that these functions block and wait for the data to reach disk -- even if you are writing to a network server -- so they can be slow. Do not call them until you really need to push the data to stable storage.
For more, see fsync() Across Platforms.
You have a corrupted file system or a hard disk that is failing. The networking/server team should scan the disk to fix the former and detect the latter. Also check the error log to see if it tells you anything. If the error log indicates that failure to write to the hardware then you need to replace the disk.

Ensure that file state on the client is in sync with NFS server

I'm trying to find proper way to handle stale data on NFS client. Consider following scenario:
Two servers mount same NFS shared storage with number of files
Client application on 1 server deletes some files
Client application on 2 server tries to access deleted files and fails with: Stale NFS file handle (nothing strange, error is expected)
(Also it may be useful to know, that cache mount options are pretty high on both servers for performance reasons).
What I'm trying to understand is:
Is there reliable method to check, that file is present? In the scenario given above lstat on the file returns success and application fails only after trying to move file.
How can I manually sync contents of directory on the client with server?
Some general advise on how to write reliable file management code in case of NFS?
Thanks.
Is there reliable method to check, that file is present? In the scenario given above lstat on the file returns success and application fails only after trying to move file.
That's it normal NFS behavior.
How can I manually sync contents of directory on the client with server?
That is impossible to do manually, since NFS pretends to be a normal POSIX-compliant file system.
I have tried once to code close()/open() in an attempt to somehow mitigate the effects of the NFS client-side caching. In my case I needed to read the info written to the file on other server. But even the reopen trick had close to zero effect. And I can't add fdatasync() to the writing side, since that slows whole application down.
My experience with NFS to date is that nothing you can do. In critical code paths I simply coded to retry the file operations which return ESTALE.
Some general advise on how to write reliable file management code in case of NFS?
Mod me down all you want, but if your customers want reliability then they shouldn't use NFS.
My company for example advertises use of proper distributed file system (I intentionally omit the brand) if customer wants reliability. Our core software is not guaranteed to run on NFS and we do not support such configurations. But in our case we really need the guarantees that as soon as the data are written to FS they become accessible on all other nodes.
Coherency in NFS can be achieved, but at the cost of performance, making NFS barely usable. (Check its mount options.) NFS is caching like crazy to hide the fact that it is a server file system. To make all operations coherent, NFS client would have to go to the NFS server synchronously for every little operation, bypassing the local cache. And that would never be fast.
But since we are talking Linux here, one can advise customers of the software to evaluate available cluster file systems. E.g. RedHat now officially support GFS. I have heard about people using CodaFS, but have no hard info on it.
i have had success with doing ls -l on the directory which contains the file.
You could try the ''noac'' mount option
from man nfs:
In addition to preventing the client
from caching file attributes, the noac
option forces application writes to
become synchronous so that local
changes to a file become visible on
the server immediately. That way,
other clients can quickly detect
recent writes when they check the
file's attributes.
Using the noac option provides
greater cache coherence among NFS
clients accessing the same files, but
it extracts a significant performance
penalty. As such, judicious use of
file locking is encouraged instead.
You could have two mounts, one for critical fast changing data that you need synchronized and another mount for other data.
Also, look into NFS locking and its limitations.
As for general advice:
One way to truncate a file that is concurrently read from multiple hosts is to write the content into a temporary file and then rename that file to the final location.
On the same filesystem this operation should be atomic.

Resources