When doing multiple aio_writes to file is it necessary to wait (e.g. aio_suspend or other) before starting the next one? From the documentation it says that writes are enqueued so does that mean they are written in order? Also, I can track the offset and make sure nothing is ever overwritten (I'm assuming that a failed write could leave a gap in this case).
You've asked two questions:
Can I issue new aios before previous aio finishes?
If so, are aios finished in-order(i.e. the same as issue order)?
Essentially aio works asynchronously only for O_DIRECT opened files, so I assume it following.
The answer is:
Yes. That's one essential usage of asynchronous I/O
No. You can almost assume nothing on their completion order
aio(different from POSIX aio, see also here) is a linux native support for asynchronous I/O. You submit async I/O to kernel, kernel records it and performs it in background asynchronously. Also, multiple "outstanding" async I/O is permitted and will be maintained by kernel.
As for order, more specifically, write order here, there are many potentials to reorder them inside the kernel.
A typical reorder comes from block layer, which accepts disk write requests from upper filesystem layer and delivers them to lower device driver layer. Within block layer there are many request schedulers, which will schedule the I/O into a "suitable" order. Well, how the "suitable" here is defined depends on the scheduler type. For example, HDD may try to merge more requests and cache a single sector/page waiting for merge potential, which incurs reorder. For more info here is an introduction.
For strong order requirement, you may control it manually by waiting for aios to finish before issuing more.
Related
I am a newbie to PCIe, so this might be a dumb question. This seems like fairly basic information to ask about PCIe interfaces, but I am having trouble finding the answer so I am guessing that I am missing some information which makes the answer obvious.
I have a system in which I have an ARM processor (host) communicating to a Xilinx SoC via PCIe (device). The endpoint within the SoC is an ARM processor as well.
The external ARM processor (host) is going to be writing to the register space of the SoC's ARM processor (device) via PCIe. This will command the SoC to do various things. That register space will be read-only with respect to the SoC (device). The external ARM processor (host) will make a write to this register space, and then signal an interrupt to indicate to the SoC that new parameters have been written and it should process them.
My question is: are the writes made by the external ARM (host) guaranteed to be atomic with respect to the reads by the SoC (device)? In conventional shared memory situations, a write to a single byte is guaranteed to be an atomic operation (i.e. you can never be in a situation where the reader had read the first 2 bits of the byte, but before it reads the last 6 bits the writer replace them with a new value, leading to garbage data). Is this the case in PCIe as well? And if so, what is the "unit" of atomic-ness? Are all bytes in a single transaction atomic with respect to the entire transaction, or is each byte atomic only in relation to itself?
Does this question make sense?
Basically I want to know to what extent memory protection is necessary in my situation. If at all possible, I would like to avoid locking memory regions as both processors are running RTOSes and avoiding memory locks would make design simpler.
So on the question of atomicity the PCIe 3.0 specification (only one I have) is mentioned a few times.
First you have SECTION 6.5 Locked Transactions this is likely not what you need but I want to document it anyway. Basically it's the worst case scenario of what you were describing earlier.
Locked Transaction support is required to prevent deadlock in systems that use legacy software
which causes the accesses to I/O devices
But you need to properly check using this anyway as it notes.
If any read associated with a locked sequence is completed unsuccessfully, the Requester must
assume that the atomicity of the lock is no longer assured, and that the path between the
Requester and Completer is no longer locked
With that said Section 6.15 Atomic Operations (AtomicOps) is much more like what you are interested in. There are 3 types of operations you can perform with the AtomicOps instruction.
FetchAdd (Fetch and Add): Request contains a single operand, the “add” value
Swap (Unconditional Swap): Request contains a single operand, the “swap” value
CAS (Compare and Swap): Request contains two operands, a “compare” value and a “swap” value
Reading Section 6.15.1 we see mention that these instructions are largely implemented for cases where multiple producers/consumers exist on a singular bus.
AtomicOps enable advanced synchronization mechanisms that are particularly useful when there are
multiple producers and/or multiple consumers that need to be synchronized in a non-blocking fashion. For example, multiple producers can safely enqueue to a common queue without any explicit locking.
Searching the rest of the specification I find little mention of atomicity outside of the sections pertaining to these AtomicOps. That would imply to me that the spec only insures such behavior when these operations are used however the context around why this was implemented suggests that they only expect such questions when a multi producer/consumer environment exists which yours clearly does not.
The last place I would suggest looking to answer your question is Section 2.4 Transaction Ordering To note I am fairly sure the idea of transactions "passing" others only makes sense with switches in the middle as these switches can make such decisions, once your put bits on the bus in your case there is no going back. So this likely only applies if you place a switch in there.
Your concern is can a write bypass a read. Write being posted, read being non-posted.
A3, A4 A Posted Request must be able to pass Non-Posted Requests to avoid deadlocks.
So in general the write is allowed to bypass the read to avoid deadlocks.
With that concern raised I do not believe it is possible for the write to bypass the read on your system since there is no device on the bus to do this transaction reordering. Since you have RTOSes I highly doubt they are enquing the PCIe transactions and reordering them before sending although I have not looked into that personally.
atomic_compare_exchange_strong_explicit(mem, old, new, <mem_order>, <mem_order>);
ftruncate(fd, <size>);
All I want is that these two lines of code always occur without any interference (WITHOUT USING LOCKS). Immediately after that CAS, ftruncate(2) should be called. I read a small description about memory orders, although I don’t understand them much. But they seemed to make this possible. Is there any way around?
Your title asks for the things to occur in order. That's easy, and C basically does that automatically with mo_seq_cst; all visible side-effects of CAS will appear before any from ftruncate.
(Not strictly required by the ISO C standard, but in practice real implementations implement seq-cst with a full barrier, except AArch64 where STLR doesn't stall to drain the store buffer unless/until there's a LDAR while the seq-cst store is still in the store buffer. But a system call is definitely going to also include a full barrier.)
Within the thread doing the operation, the atomic is Sequenced Before the system call.
What kind of interference are you worried about? Some other thread changing the size of the file? You can't prevent that race condition.
There's no way to combine some operation on memory + a system call into a single atomic transaction. You would need to use a hypothetical system call that atomically does what you want. (Presumably it would have to do locking inside the kernel to make a file operation and a memory modification appear as one atomic transaction.) e.g. the Linux futex system call atomically does a couple things, but of course there's nothing like this for any other operations.
Or you need locking. (Or to suspend all other threads of your process somehow.)
The semantics of Linux's Asynchronous file IO (AIO) is well described in the man page of io_setup(2), io_submit(2) and io_getevents(2).
However, without diving in the block IO subsystem, the operational side of the implementation is a little less clear.
An aio_context allocates a queue for sending back io_events to a specific client in user-space. But is there more to it ?
Let be a file read sequentially chunks by chunks. Can requests, especially in Direct IO (DIO), be collated ? What if requests for two files are interleaved into one aio_context ? What if requests for one file are sent to two different aio_contexts ?
How requests are prioritized and scheduled in the above cases, with one or multiple aio_contexts ?
Is it possible that requests from two aio_contexts get interleaved at some point ? (Occasioning more seek latencies than intended.)
Does the thread or the CPU calling io_submit influence how it is scheduled ? Is the NUMA node containing the target buffer taken into consideration ?
More broadly, to which hardware resources (NUMA nodes, CPU cores, physical drives, file-systems and files) aio_contexts should be assigned, and at which level of granularity ?
Maybe it doesn't really matter and aio_contexts are no more than an abstraction for user-space programs.
I'm asking since I have observed a performance decrease when concurrently reading multiples files, each with it's own aio_context, compared to a manual Round-robin serialization of chunks requests into a single aio_context.
You can mix requests freely in a single context and I would do so. Otherwise you have to poll two separate contexts doubling the number of syscalls.
Requests to a context are passed to the kernels async IO VFS layer. Multiple files, multiple contexts, multiple processes or users doing the requests it all ends up in the same layer. The VFS layer then sends the requests to the relevant filesystems or block devices and all the usual collation and such happens naturally.
Requests to the same file to one or more context at the same time I think are undefined behavior if they overlap. They could be ordered one way or the other. The later request could be processed first for example. So you need to write your own synchronization if strict ordering is required. Same as one or more threads doing read/write calls in parallel.
Prioritization and scheduling will depend on the lower layers. Afaik block devices will reorder requests so they happen in increasing block numbers (elevator code) to minimize seek times on rotating disks.
Yes, requests from different contexts and normal read/write calls will get interleaved.
I think the requesting process and NUMA and such is completely ignored.
Note: When dealing with files make sure the filesystem supports the linux async IO hooks and you might need to use O_DIRECT on open() with all it's consequences.
A way to simply test this I found is to make lots of requests to a file in one io_submit() call and then check if the all finish simultaneously. If the filesystem falls back to sync IO then everything submitted will finish at the same time.
Assume that a large file is saved on disk and I want to run a computation on every chunk of data contained in the file.
The C/C++ code that I would write to do so would load part of the file, then do the processing, then load the next part, then do the processing of this next part, and so on.
If I am, however, interested to do so in the shortest possible time, I could actually do the following: First, tell DMA-controller to load first part of the file. When this part is loaded tell the DMA-controller to load the second part (in some other part of the memory) and then immediately start processing the first part.
If I get an interrupt from the DMA during processing the first part, I finish the first part and afterwards tell the DMA to overwrite it with the third part of the file; then I process the second part.
If I do not get an interrupt from the DMA during processing the first part, I finish the first part and wait for the interrupt of the DMA.
Depending of how long the processing takes in relation to the disk-read, this should be up to twice as fast. In reality, of course, one would have to measure. But that is not the question I am asking.
The question is: Is it possible to do this a) in C using some non-standard extension or b) in assembly? Or do operating systems not allow such things in general? The question is meant primarily in a single-thread context, although I also would be interested to know how to do it with two threads. Also, I am interested in specific code; this is more of a theoretical question.
You're right that you will not get the benefit of this by default, because a blocking read stops your thread from doing any processing. Hans is right that modern OSes already take care of all the little details of DMA and interrupt completion routines.
You need to use the architecture you've described, of issuing a request in advance of when you will use the data. Issue asynchronous I/O requests (on Windows these are called OVERLAPPED). Then the flow will go exactly as you envisions, but the DMA and interrupts are handled in the drivers.
On Windows, take a look at FILE_FLAG_OVERLAPPED (to CreateFile) and ReadFile (if you like events) or ReadFileEx (if you like callbacks). If you don't have to process the data in any particular order, then add a completion port to the mix, which queues the completion responses.
On Linux, OSX, and many other Unix-like OSes, look at aio_read. Or fadvise. Or use mmap with madvise.
And you can get these benefits without even writing native code. .NET recently added the ReadAsync method to its FileStream, which can be used with continuation-passing style in the form of Task objects, with async/await syntactic sugar in the C# compiler.
Typically, in a multi-mode (user/system) operating system, you do not have access to direct dma or to interrupts. In systems that extend those features from kernel(system) mode down to user mode, the overhead eliminates the benefit of using them.
Ignoring that what you're asking to do requires a very specialized environment to support it, the idea is sound and common: declaring two (or more) buffers to enable DMA to the next while you process the first. When two buffers are used they're sometimes referred to as ping-pong buffers.
In my program, I hold two files open for writing, a content-file, containing chunks of data, and an index-file, containing a map over which chunks of data has been written so far.
I would like to flush them both to disc, as performant as possible, with the only constraint that the blocks in the data-file must be written before the corresponding blocks in the map-file (naturally).
The catch is that I would like to avoid blocking I.E. doing an fsync, both for latency and throughput-reasons.
Any ideas?
I don't think you can do this easily in a single execution path. You need fsync to have the write to disk guaranteed - and this is going to have to wait for the write.
I suspect it is possible (but not easy) to do this by delegating the writing task to a separate thread or process. Generate the data in your existing program and 'write' it to the second thread/process using any method that looks sensible. This can be non-blocking. The second thread would then write any new data to the data to your content-file, then fsync, then write the index-file, then check for new data again. Key design decisions relate to how you separate the two execution paths, how you communicate between them, and if you need to report the write back to the main program. This could still have latency and throughput issues, but that's part of the cost of choosing to have the index-file and content-file in sync. At least there would be a chance of getting work done while waiting on the disk.
It could be worth looking to see if this is well encapsulated so as to be useful to you in the source of any of the transactional databases. You could also investigate the sync option when you mount the file system for the content-file.