The semantics of Linux's Asynchronous file IO (AIO) is well described in the man page of io_setup(2), io_submit(2) and io_getevents(2).
However, without diving in the block IO subsystem, the operational side of the implementation is a little less clear.
An aio_context allocates a queue for sending back io_events to a specific client in user-space. But is there more to it ?
Let be a file read sequentially chunks by chunks. Can requests, especially in Direct IO (DIO), be collated ? What if requests for two files are interleaved into one aio_context ? What if requests for one file are sent to two different aio_contexts ?
How requests are prioritized and scheduled in the above cases, with one or multiple aio_contexts ?
Is it possible that requests from two aio_contexts get interleaved at some point ? (Occasioning more seek latencies than intended.)
Does the thread or the CPU calling io_submit influence how it is scheduled ? Is the NUMA node containing the target buffer taken into consideration ?
More broadly, to which hardware resources (NUMA nodes, CPU cores, physical drives, file-systems and files) aio_contexts should be assigned, and at which level of granularity ?
Maybe it doesn't really matter and aio_contexts are no more than an abstraction for user-space programs.
I'm asking since I have observed a performance decrease when concurrently reading multiples files, each with it's own aio_context, compared to a manual Round-robin serialization of chunks requests into a single aio_context.
You can mix requests freely in a single context and I would do so. Otherwise you have to poll two separate contexts doubling the number of syscalls.
Requests to a context are passed to the kernels async IO VFS layer. Multiple files, multiple contexts, multiple processes or users doing the requests it all ends up in the same layer. The VFS layer then sends the requests to the relevant filesystems or block devices and all the usual collation and such happens naturally.
Requests to the same file to one or more context at the same time I think are undefined behavior if they overlap. They could be ordered one way or the other. The later request could be processed first for example. So you need to write your own synchronization if strict ordering is required. Same as one or more threads doing read/write calls in parallel.
Prioritization and scheduling will depend on the lower layers. Afaik block devices will reorder requests so they happen in increasing block numbers (elevator code) to minimize seek times on rotating disks.
Yes, requests from different contexts and normal read/write calls will get interleaved.
I think the requesting process and NUMA and such is completely ignored.
Note: When dealing with files make sure the filesystem supports the linux async IO hooks and you might need to use O_DIRECT on open() with all it's consequences.
A way to simply test this I found is to make lots of requests to a file in one io_submit() call and then check if the all finish simultaneously. If the filesystem falls back to sync IO then everything submitted will finish at the same time.
Related
If I call a rescale() operation in Flink, I assume that there is NO serialization/deserialization (since the data is not crossing nodes), right? Further, is it correct to assume that objects are not copied/deep copied when rescale() is called?
I ask because I'm passing some large objects, 99% of which are common between multiple threads, so it would be a tremendous RAM waste if the objects were recopied in each thread after a rescale(). Instead, all the different threads should point to the same single object in the java heap for that node.
(Of course, if I call a rebalance, I would expect that there would be ONE serialization of the common objects to the other nodes, even if there are dozens of threads on each of the other nodes? That is, on the other nodes, there should only be 1 copy of a common object that all the threads for that node can share, right?)
Based on the rescale() documentation, there will be network traffic (and thus serialization/deserialization), just not as much as a rebalance(). But as several Flink committers have noted, data skew can make the reduction in network traffic insignificant compared to the cost of unbalanced data, which is why rebalance() is the default action when the stream topology changes.
Also, if you're passing around a lot of common data, then maybe look at using a broadcast stream to more efficiently share that across nodes.
Finally, it's conceptually easier to think about sub-tasks vs. threads. Each operator runs as a sub-task, which (on one Task Manager) is indeed being threaded, but the operator instances are separate, which means you don't have to worry about multi-threading at the operator level (unless you use class variables, which is usually a Bad Idea).
When doing multiple aio_writes to file is it necessary to wait (e.g. aio_suspend or other) before starting the next one? From the documentation it says that writes are enqueued so does that mean they are written in order? Also, I can track the offset and make sure nothing is ever overwritten (I'm assuming that a failed write could leave a gap in this case).
You've asked two questions:
Can I issue new aios before previous aio finishes?
If so, are aios finished in-order(i.e. the same as issue order)?
Essentially aio works asynchronously only for O_DIRECT opened files, so I assume it following.
The answer is:
Yes. That's one essential usage of asynchronous I/O
No. You can almost assume nothing on their completion order
aio(different from POSIX aio, see also here) is a linux native support for asynchronous I/O. You submit async I/O to kernel, kernel records it and performs it in background asynchronously. Also, multiple "outstanding" async I/O is permitted and will be maintained by kernel.
As for order, more specifically, write order here, there are many potentials to reorder them inside the kernel.
A typical reorder comes from block layer, which accepts disk write requests from upper filesystem layer and delivers them to lower device driver layer. Within block layer there are many request schedulers, which will schedule the I/O into a "suitable" order. Well, how the "suitable" here is defined depends on the scheduler type. For example, HDD may try to merge more requests and cache a single sector/page waiting for merge potential, which incurs reorder. For more info here is an introduction.
For strong order requirement, you may control it manually by waiting for aios to finish before issuing more.
I have a multi threaded process, where each thread runs on one core. I am reading the same set of files from each of the threads and processing them. Will reading the same set of files by multiple threads affect the performance of the process?
Not necessarily, but there are a few factors to be taken on account.
When you open a file for READING you don't need to put a read lock on it.
That means multiple threads can be reading from the same file.
In fact all threads from a process share the process memory, so you can use that for your benefit by caching the whole set (or part of it, depending on the size) on the process memory. That will reduce access time.
Otherwise if we assume all files are in the same device, the problem is that reading multiple files simultaneously from the same device at the same time is slow and, depending on the number of threads and the storage type it can be noticeably slower
Reading the same set of files from each different thread may reduce the performance of the process, because the IO request are normally costly and slow, in addition to being repeating the same read operation for each difference thread.
One possible solution to deal with this is having one thread dealing with the IO reads/writes and the rest processing the data, for example as a producer consumer.
You may consider Memory-Mapped Files for concurrent read access.
It will avoid overhead of copying data into every process address space.
I have a program that is used to exercise several disk units in a raid configuration. 1 process synchronously (O_SYNC) writes random data to a file using write(). It then puts the name of the directory into a shared-memory queue, where a 2nd process is waiting for the queue to have entries to read the data back into memory using read().
The problem that I can't seem to overcome is that when the 2nd process attempts to read the data back into memory, none of the disk units show read accesses. The program has code to check whether or not the data read back in is equal to the code that is written to disk, and the data always matches.
My question is, how can I make the OS (IBM i) not buffer the data when it is written to disk so that the read() system call accesses the data on the disk rather than in cache? I am doing simple throughput calculations and the read() operations are always 10+ times faster than the write operations.
I have tried using the O_DIRECT flag, but cannot seem to get the data to write to the file. It could have to do with setting up the correct aligned buffers. I have also tried the posix_fadvise(fd, offset,len, POSIX_FADV_DONTNEED) system call.
I have read through this similar question but haven't found a solution. I can provide code if it would be helpful.
My though is that if you write ENOUGH data, then there simply won't be enough memory to cache it, and thus SOME data must be written to disk.
You can also, if you want to make sure that small writes to your file works, try writing ANOTHER large file (either from the same process or a different one - for example, you could start a process like dd if=/dev/zero of=myfile.dat bs=4k count=some_large_number) to force other data to fill the cache.
Another "trick" may be to "chew up" some (more like most) of the RAM in the system - just allocate a large lump of memory, then write to some small part of it at a time - for example, an array of integers, where you write to every 256th entry of the array in a loop, moving to one step forward each time - that way, you walk through ALL of the memory quickly, and since you are writing continuously to all of it, the memory will have to be resident. [I used this technique to simulate a "busy" virtual machine when running VM tests].
The other option is of course to nobble the caching system itself in OS/filesystem driver, but I would be very worried about doing that - it will almost certainly slow the system down to a slow crawl, and unless there is an existing option to disable it, you may find it hard to do accurately/correctly/reliably.
...exercise several disk units in a raid configuration... How? IBM i doesn't allow a program access to the hardware. How are you directing I/O to any specific physical disks?
ANSWER: The write/read operations are done in parallel against IFS so the stream file manager is selecting which disks to target. By having enough threads reading/writing, the busyness of SYSBASE or an IASP can be driven up.
...none of the disk units show read accesses. None of them? Unless you are running the sole job on a system in restricted state, there is going to be read activity on the disks from other tasks. Is the system divided into multiple LPARs? Multiple ASPs? I'm suggesting that you may be monitoring disks that this program isn't writing to, because IBM i handles physical I/O, not programs.
ANSWER I guess none of them is a slight exaggeration - I know which disks belong to SYSBASE and those disks are not being targeted with many read requests. I was just trying to generalize for an audience not familiar w/IBM i. In the picture below, you will see that the write reqs are driving the % busyness up, but the read reqs are not even though they are targeting the same files.
...how can I make the OS (IBM i) not buffer the data when it is written to disk... Use a memory starved main storage pool to maximise paging, write immense blocks of data so as to guarantee that the system and disk controller caches overflow and use a busy machine so that other tasks are demanding disk I/O as well.
In my program, I hold two files open for writing, a content-file, containing chunks of data, and an index-file, containing a map over which chunks of data has been written so far.
I would like to flush them both to disc, as performant as possible, with the only constraint that the blocks in the data-file must be written before the corresponding blocks in the map-file (naturally).
The catch is that I would like to avoid blocking I.E. doing an fsync, both for latency and throughput-reasons.
Any ideas?
I don't think you can do this easily in a single execution path. You need fsync to have the write to disk guaranteed - and this is going to have to wait for the write.
I suspect it is possible (but not easy) to do this by delegating the writing task to a separate thread or process. Generate the data in your existing program and 'write' it to the second thread/process using any method that looks sensible. This can be non-blocking. The second thread would then write any new data to the data to your content-file, then fsync, then write the index-file, then check for new data again. Key design decisions relate to how you separate the two execution paths, how you communicate between them, and if you need to report the write back to the main program. This could still have latency and throughput issues, but that's part of the cost of choosing to have the index-file and content-file in sync. At least there would be a chance of getting work done while waiting on the disk.
It could be worth looking to see if this is well encapsulated so as to be useful to you in the source of any of the transactional databases. You could also investigate the sync option when you mount the file system for the content-file.