Is there a posix-way to ensure two files are flushed in sequence without blocking? - file

In my program, I hold two files open for writing, a content-file, containing chunks of data, and an index-file, containing a map over which chunks of data has been written so far.
I would like to flush them both to disc, as performant as possible, with the only constraint that the blocks in the data-file must be written before the corresponding blocks in the map-file (naturally).
The catch is that I would like to avoid blocking I.E. doing an fsync, both for latency and throughput-reasons.
Any ideas?

I don't think you can do this easily in a single execution path. You need fsync to have the write to disk guaranteed - and this is going to have to wait for the write.
I suspect it is possible (but not easy) to do this by delegating the writing task to a separate thread or process. Generate the data in your existing program and 'write' it to the second thread/process using any method that looks sensible. This can be non-blocking. The second thread would then write any new data to the data to your content-file, then fsync, then write the index-file, then check for new data again. Key design decisions relate to how you separate the two execution paths, how you communicate between them, and if you need to report the write back to the main program. This could still have latency and throughput issues, but that's part of the cost of choosing to have the index-file and content-file in sync. At least there would be a chance of getting work done while waiting on the disk.
It could be worth looking to see if this is well encapsulated so as to be useful to you in the source of any of the transactional databases. You could also investigate the sync option when you mount the file system for the content-file.

Related

How fio benchmark tool performs sequential disk reads?

I use fio to test read/write bandwidth of my disks.
Even for the sequential read test, I can let it run the multiple threads.
What does it mean by running multiple threads on sequential read test?
Does it perform multiple sequential reads? (each thread is assigned a file offset to start the sequential scanning from)
Do the multiple threads share a file offset? (Each thread invokes sequential reads using a single file offset that is shared by the multiple threads)
I tried to read the open source codes of fio, but I couldn't really figure it out.
Can any one give me an idea?
Sadly you didn't include a jobfile with your question and didn't say what platform you're running on. Here's a stab at answers:
Yes it does multiple sequential reads though wouldn't it have to do this even with a single thread?
No each thread has its own offset but unless you use offset and size they will all work inside the same "region".
On Linux fio actually defaults to using separate processes per job and each process has its own file descriptor (for ioengines that use files) for each file used. Further, some ioengines (e.g. libaio, pvsync but there are many others) use syscalls that take the offset you want to do the I/O at with the request itself so even if they do share a descriptor their offset is not impacted by others using the same descriptor.
There may be problems if you use the sync ioengine, ask fio to use threads rather than processes and have those threads work on the same file. That ioengine has to use lseek prior to doing its I/O so perhaps there's a chance for another thread's lseek to sneak in before the I/O is submitted. Note that the sync I/O engine is not the default one used with recent fio versions.
Perhaps the fio mailing list can say more?

How to prevent C read() from reading from cache

I have a program that is used to exercise several disk units in a raid configuration. 1 process synchronously (O_SYNC) writes random data to a file using write(). It then puts the name of the directory into a shared-memory queue, where a 2nd process is waiting for the queue to have entries to read the data back into memory using read().
The problem that I can't seem to overcome is that when the 2nd process attempts to read the data back into memory, none of the disk units show read accesses. The program has code to check whether or not the data read back in is equal to the code that is written to disk, and the data always matches.
My question is, how can I make the OS (IBM i) not buffer the data when it is written to disk so that the read() system call accesses the data on the disk rather than in cache? I am doing simple throughput calculations and the read() operations are always 10+ times faster than the write operations.
I have tried using the O_DIRECT flag, but cannot seem to get the data to write to the file. It could have to do with setting up the correct aligned buffers. I have also tried the posix_fadvise(fd, offset,len, POSIX_FADV_DONTNEED) system call.
I have read through this similar question but haven't found a solution. I can provide code if it would be helpful.
My though is that if you write ENOUGH data, then there simply won't be enough memory to cache it, and thus SOME data must be written to disk.
You can also, if you want to make sure that small writes to your file works, try writing ANOTHER large file (either from the same process or a different one - for example, you could start a process like dd if=/dev/zero of=myfile.dat bs=4k count=some_large_number) to force other data to fill the cache.
Another "trick" may be to "chew up" some (more like most) of the RAM in the system - just allocate a large lump of memory, then write to some small part of it at a time - for example, an array of integers, where you write to every 256th entry of the array in a loop, moving to one step forward each time - that way, you walk through ALL of the memory quickly, and since you are writing continuously to all of it, the memory will have to be resident. [I used this technique to simulate a "busy" virtual machine when running VM tests].
The other option is of course to nobble the caching system itself in OS/filesystem driver, but I would be very worried about doing that - it will almost certainly slow the system down to a slow crawl, and unless there is an existing option to disable it, you may find it hard to do accurately/correctly/reliably.
...exercise several disk units in a raid configuration... How? IBM i doesn't allow a program access to the hardware. How are you directing I/O to any specific physical disks?
ANSWER: The write/read operations are done in parallel against IFS so the stream file manager is selecting which disks to target. By having enough threads reading/writing, the busyness of SYSBASE or an IASP can be driven up.
...none of the disk units show read accesses. None of them? Unless you are running the sole job on a system in restricted state, there is going to be read activity on the disks from other tasks. Is the system divided into multiple LPARs? Multiple ASPs? I'm suggesting that you may be monitoring disks that this program isn't writing to, because IBM i handles physical I/O, not programs.
ANSWER I guess none of them is a slight exaggeration - I know which disks belong to SYSBASE and those disks are not being targeted with many read requests. I was just trying to generalize for an audience not familiar w/IBM i. In the picture below, you will see that the write reqs are driving the % busyness up, but the read reqs are not even though they are targeting the same files.
...how can I make the OS (IBM i) not buffer the data when it is written to disk... Use a memory starved main storage pool to maximise paging, write immense blocks of data so as to guarantee that the system and disk controller caches overflow and use a busy machine so that other tasks are demanding disk I/O as well.

Atomically write 64kB

I need to write something like 64 kB of data atomically in the middle of an existing file. That is all, or nothing should be written. How to achieve that in Linux/C?
I don't think it's possible, or at least there's not any interface that guarantees as part of its contract that the write would be atomic. In other words, if there is a way that's atomic right now, that's an implementation detail, and it's not safe to rely on it remaining that way. You probably need to find another solution to your problem.
If however you only have one writing process, and your goal is that other processes either see the full write or no write at all, you can just make the changes in a temporary copy of the file and then use rename to atomically replace it. Any reader that already had a file descriptor open to the old file will see the old contents; any reader opening it newly by name will see the new contents. Partial updates will never be seen by any reader.
There are a few approaches to modify file contents "atomically". While technically the modification itself is never truly atomic, there are ways to make it seem atomic to all other processes.
My favourite method in Linux is to take a write lease using fcntl(fd, F_SETLEASE, F_WRLCK). It will only succeed if fd is the only open descriptor to the file; that is, nobody else (not even this process) has the file open. Also, the file must be owned by the user running the process, or the process must run as root, or the process must have the CAP_LEASE capability, for the kernel to grant the lease.
When successful, the lease owner process gets a signal (SIGIO by default) whenever another process is opening or truncating the file. The opener will be blocked by the kernel for up to /proc/sys/fs/lease-break-time seconds (45 by default), or until the lease owner releases or downgrades the lease or closes the file, whichever is shorter. Thus, the lease owner has dozens of seconds to complete the "atomic" operation, without any other process being able to see the file contents.
There are a couple of wrinkles one needs to be aware of. One is the privileges or ownership required for the kernel to allow the lease. Another is the fact that the other party opening or truncating the file will only be delayed; the lease owner cannot replace (hardlink or rename) the file. (Well, it can, but the opener will always open the original file.) Also, renaming, hardlinking, and unlinking/deleting the file does not affect the file contents, and therefore are not affected at all by file leases.
Remember also that you need to handle the signal generated. You can use fcntl(fd, F_SETSIG, signum) to change the signal. I personally use a trivial signal handler -- one with an empty body -- to catch the signal, but there are other ways too.
A portable method to achieve semi-atomicity is to use a memory map using mmap(). The idea is to use memmove() or similar to replace the contents as quickly as possible, then use msync() to flush the changes to the actual storage medium.
If the memory map offset in the file is a multiple of the page size, the mapped pages reflect the page cache. That is, any other process reading the file, in any way -- mmap() or read() or their derivatives -- will immediately see the changes made by the memmove(). The msync() is only needed to make sure the changes are also stored on disk, in case of a system crash -- it is basically equivalent to fsync().
To avoid preemption (kernel interrupting the action due to the current timeslice being up) and page faults, I'd first read the mapped data to make sure the pages are in memory, and then call sched_yield(), before the memmove(). Reading the mapped data should fault the pages into page cache, and sched_yield() releases the rest of the timeslice, making it extremely likely that the memmove() is not interrupted by the kernel in any way. (If you do not make sure the pages are already faulted in, the kernel will likely interrupt the memmove() for each page separately. You won't see that in the process, but other processes see the modifications to occur in page-sized chunks.)
This is not exactly atomic, but it is practical: it does not give you any guarantees, only makes the race window very very short; therefore I call this semi-atomic.
Note that this method is compatible with file leases. One could try to take a write lease on the file, but fall back to leaseless memory mapping if the lease is not granted within some acceptable time period, say a second or two. I'd use timer_create() and timer_settime() to create the timeout timer, and the same empty-body signal handler to catch the SIGALRM signal; that way the fcntl() is interrupted (returns -1 with errno == EINTR) when the timeout occurs -- with the timer interval set to some small value (say 25000000 nanoseconds, or 0.025 seconds) so it repeats very often after that, interrupting syscalls if the initial interrupt is missed for any reason.
Most userspace applications create a copy of the original file, modify the contents of the copy, then replace the original file with the copy.
Each process that opens the file will only see complete changes, never a mix of old and new contents. However, anyone keeping the file open, will only see their original contents, and not be aware of any changes (unless they check themselves). Most text editors do check, but daemons and other processes do not bother.
Remember that in Linux, the file name and its contents are two separate things. You can open a file, unlink/remove it, and still keep reading and modifying the contents for as long as you have the file open.
There are other approaches, too. I do not want to suggest any specific approach, because the optimal one depends heavily on the circumstances: Do the other processes keep the file open, or do they always (re)open it before reading the contents? Is atomicity preferred or absolutely required? Is the data plain text, structured like XML, or binary?
EDITED TO ADD:
Please note that there are no ways to guarantee beforehand that the file will be successfully modified atomically. Not in theory, and not in practice.
You might encounter a write error with the disk full, for example. Or the drive might hiccup at just the wrong moment. I'm only listing three practical ways to make it seem atomic in typical use cases.
The reason write leases are my favourite is that I can always use fcntl(fd,F_GETLEASE,&ptr) to check whether the lease is still valid or not. If not, then the write was not atomic.
High system load is unlikely to cause the lease to be broken for a 64k write, if the same data has been read just prior (so that it will likely be in page cache). If the process has superuser privileges, you can use setpriority(PRIO_PROCESS,getpid(),-20) to temporarily raise the process priority to maximum while taking the file lease and modifying the file. If the data to be overwritten has just been read, it is extremely unlikely to be moved to swap; thus swapping should not occur, either.
In other words, while it is quite possible for the lease method to fail, in practice it is almost always successful -- even without the extra tricks mentioned in this addendum.
Personally, I simply check if the modification was not atomic, using the fcntl() call after the modification, prior to msync()/fsync() (making sure the data hits the disk in case a power outage occurs); that gives me an absolutely reliable, trivial method to check whether the modification was atomic or not.
For configuration files and other sensitive data, I too recommend the rename method. (Actually, I prefer the hardlink approach used for NFS-safe file locking, which amounts to the same thing but uses a temporary name to detect naming races.) However, it has the problem that any process keeping the file open will have to check and reopen the file, voluntarily, to see the changed contents.
Disk writes cannot be atomic without a layer of abstraction. You should keep a journal and revert if a write is interrupted.
As far as I know a write below the size of PIPE_BUF is atomic. However I never rely on this. If the programs that access the file are written by you, you can use flock() to achieve exclusive access. This system call sets a lock on the file and allows other processes that know about the lock to get access or not.

Simulating file system access

I am designing a file system in user space and need to test it. I do not want to use the available benchmarking tools as my requirements are different. So to test the file system I wish to simulate file access operation. To do this, I first use the ftw() function to walk through one f my existing file system(experimental) and list all the files and directories in a file.
Then I invoke a simulator to simulate file access by a number of processes. Thus, the simulator randomly starts a process i.e it forks a thread which does what a real process would have done. The thread randomly selects a file operation (read, write, rename etc) selects arguments to this operation from the list(generated by ftw()) . The thread does a number of such file operations and then exits marking the end of a process. The simulator continues to spawn threads; thread execution can overlap just as real processes do. Now, as operations are performed by threads, files get inserted, deleted, renamed and this is updated in the list of files.
I have not yet started coding. Does the plan seem sane? I am also not sure how to code the simulator...how will it spawn threads over a period of time. Should I be using some random delay to do this.
Thanks
Yep, that seems fairly reasonable to me. I would consider attempting to impose a statistical distribution over your file operations (and accesses to particular files) that is somehow matched to your expected workload. You might be able to find some statistics about typical filesystem workloads as a starting point.
That sounds about right for a decent test case just to make sure it's working. You could use sleep() to wait between spawning threads or just spawn them all at once and have them do an operation then wait a bit, then do another operation, etc... IMO if you hit it hard with a lot of requests and it works then there's a likely chance your filesystem will do just fine. Take an example from PostMark which all it does is append like crazy to different files and other benchmarks that do random access reads/writes in different locations to make sure that the page has to be read from disk.

How can storage system handle different write streams at the same place?

Normally, if two applications send two write requests to the same place (lba) of the disk, applications or file systems will add lock for it, so only one request will be handled at a time.
But now there is a difficult problem. There may be multiple write requests that should be written to the same place, but applications don't handle the lock. There is no file system, because the data are directly written to the raw disk. What I can do is to modify the code of the storage system. Things are very complicated now. Suppose there are two requests, A and B. Then finally the data in the corresponding lba may be one of the three results:
All data are from A.
All data are from B.
Parts of data are from A; parts of data are from B.
In my opinion, result 1 & 2 are acceptable, but result 3 is not acceptable. But someone doesn't think so. How about you opinions?
I agree that it should be all of one or none of either. This can be done quite easily by using a form of storage system manager, and writing to the manager in large enough chunks. The manager can do appropriate locking internally so that only one block from one request is written at a time, and you don't get overlaps.

Resources