Using interrupts during reading a file from disk - c

Assume that a large file is saved on disk and I want to run a computation on every chunk of data contained in the file.
The C/C++ code that I would write to do so would load part of the file, then do the processing, then load the next part, then do the processing of this next part, and so on.
If I am, however, interested to do so in the shortest possible time, I could actually do the following: First, tell DMA-controller to load first part of the file. When this part is loaded tell the DMA-controller to load the second part (in some other part of the memory) and then immediately start processing the first part.
If I get an interrupt from the DMA during processing the first part, I finish the first part and afterwards tell the DMA to overwrite it with the third part of the file; then I process the second part.
If I do not get an interrupt from the DMA during processing the first part, I finish the first part and wait for the interrupt of the DMA.
Depending of how long the processing takes in relation to the disk-read, this should be up to twice as fast. In reality, of course, one would have to measure. But that is not the question I am asking.
The question is: Is it possible to do this a) in C using some non-standard extension or b) in assembly? Or do operating systems not allow such things in general? The question is meant primarily in a single-thread context, although I also would be interested to know how to do it with two threads. Also, I am interested in specific code; this is more of a theoretical question.

You're right that you will not get the benefit of this by default, because a blocking read stops your thread from doing any processing. Hans is right that modern OSes already take care of all the little details of DMA and interrupt completion routines.
You need to use the architecture you've described, of issuing a request in advance of when you will use the data. Issue asynchronous I/O requests (on Windows these are called OVERLAPPED). Then the flow will go exactly as you envisions, but the DMA and interrupts are handled in the drivers.
On Windows, take a look at FILE_FLAG_OVERLAPPED (to CreateFile) and ReadFile (if you like events) or ReadFileEx (if you like callbacks). If you don't have to process the data in any particular order, then add a completion port to the mix, which queues the completion responses.
On Linux, OSX, and many other Unix-like OSes, look at aio_read. Or fadvise. Or use mmap with madvise.
And you can get these benefits without even writing native code. .NET recently added the ReadAsync method to its FileStream, which can be used with continuation-passing style in the form of Task objects, with async/await syntactic sugar in the C# compiler.

Typically, in a multi-mode (user/system) operating system, you do not have access to direct dma or to interrupts. In systems that extend those features from kernel(system) mode down to user mode, the overhead eliminates the benefit of using them.
Ignoring that what you're asking to do requires a very specialized environment to support it, the idea is sound and common: declaring two (or more) buffers to enable DMA to the next while you process the first. When two buffers are used they're sometimes referred to as ping-pong buffers.

Related

How to do multiple aio_writes to file

When doing multiple aio_writes to file is it necessary to wait (e.g. aio_suspend or other) before starting the next one? From the documentation it says that writes are enqueued so does that mean they are written in order? Also, I can track the offset and make sure nothing is ever overwritten (I'm assuming that a failed write could leave a gap in this case).
You've asked two questions:
Can I issue new aios before previous aio finishes?
If so, are aios finished in-order(i.e. the same as issue order)?
Essentially aio works asynchronously only for O_DIRECT opened files, so I assume it following.
The answer is:
Yes. That's one essential usage of asynchronous I/O
No. You can almost assume nothing on their completion order
aio(different from POSIX aio, see also here) is a linux native support for asynchronous I/O. You submit async I/O to kernel, kernel records it and performs it in background asynchronously. Also, multiple "outstanding" async I/O is permitted and will be maintained by kernel.
As for order, more specifically, write order here, there are many potentials to reorder them inside the kernel.
A typical reorder comes from block layer, which accepts disk write requests from upper filesystem layer and delivers them to lower device driver layer. Within block layer there are many request schedulers, which will schedule the I/O into a "suitable" order. Well, how the "suitable" here is defined depends on the scheduler type. For example, HDD may try to merge more requests and cache a single sector/page waiting for merge potential, which incurs reorder. For more info here is an introduction.
For strong order requirement, you may control it manually by waiting for aios to finish before issuing more.

Are writes on the PCIe bus atomic?

I am a newbie to PCIe, so this might be a dumb question. This seems like fairly basic information to ask about PCIe interfaces, but I am having trouble finding the answer so I am guessing that I am missing some information which makes the answer obvious.
I have a system in which I have an ARM processor (host) communicating to a Xilinx SoC via PCIe (device). The endpoint within the SoC is an ARM processor as well.
The external ARM processor (host) is going to be writing to the register space of the SoC's ARM processor (device) via PCIe. This will command the SoC to do various things. That register space will be read-only with respect to the SoC (device). The external ARM processor (host) will make a write to this register space, and then signal an interrupt to indicate to the SoC that new parameters have been written and it should process them.
My question is: are the writes made by the external ARM (host) guaranteed to be atomic with respect to the reads by the SoC (device)? In conventional shared memory situations, a write to a single byte is guaranteed to be an atomic operation (i.e. you can never be in a situation where the reader had read the first 2 bits of the byte, but before it reads the last 6 bits the writer replace them with a new value, leading to garbage data). Is this the case in PCIe as well? And if so, what is the "unit" of atomic-ness? Are all bytes in a single transaction atomic with respect to the entire transaction, or is each byte atomic only in relation to itself?
Does this question make sense?
Basically I want to know to what extent memory protection is necessary in my situation. If at all possible, I would like to avoid locking memory regions as both processors are running RTOSes and avoiding memory locks would make design simpler.
So on the question of atomicity the PCIe 3.0 specification (only one I have) is mentioned a few times.
First you have SECTION 6.5 Locked Transactions this is likely not what you need but I want to document it anyway. Basically it's the worst case scenario of what you were describing earlier.
Locked Transaction support is required to prevent deadlock in systems that use legacy software
which causes the accesses to I/O devices
But you need to properly check using this anyway as it notes.
If any read associated with a locked sequence is completed unsuccessfully, the Requester must
assume that the atomicity of the lock is no longer assured, and that the path between the
Requester and Completer is no longer locked
With that said Section 6.15 Atomic Operations (AtomicOps) is much more like what you are interested in. There are 3 types of operations you can perform with the AtomicOps instruction.
FetchAdd (Fetch and Add): Request contains a single operand, the “add” value
Swap (Unconditional Swap): Request contains a single operand, the “swap” value
CAS (Compare and Swap): Request contains two operands, a “compare” value and a “swap” value
Reading Section 6.15.1 we see mention that these instructions are largely implemented for cases where multiple producers/consumers exist on a singular bus.
AtomicOps enable advanced synchronization mechanisms that are particularly useful when there are
multiple producers and/or multiple consumers that need to be synchronized in a non-blocking fashion. For example, multiple producers can safely enqueue to a common queue without any explicit locking.
Searching the rest of the specification I find little mention of atomicity outside of the sections pertaining to these AtomicOps. That would imply to me that the spec only insures such behavior when these operations are used however the context around why this was implemented suggests that they only expect such questions when a multi producer/consumer environment exists which yours clearly does not.
The last place I would suggest looking to answer your question is Section 2.4 Transaction Ordering To note I am fairly sure the idea of transactions "passing" others only makes sense with switches in the middle as these switches can make such decisions, once your put bits on the bus in your case there is no going back. So this likely only applies if you place a switch in there.
Your concern is can a write bypass a read. Write being posted, read being non-posted.
A3, A4 A Posted Request must be able to pass Non-Posted Requests to avoid deadlocks.
So in general the write is allowed to bypass the read to avoid deadlocks.
With that concern raised I do not believe it is possible for the write to bypass the read on your system since there is no device on the bus to do this transaction reordering. Since you have RTOSes I highly doubt they are enquing the PCIe transactions and reordering them before sending although I have not looked into that personally.

how to jump out of and resume at arbitrary locations in c-code without refactoring

BACKGROUND
I'm integrating micropython into my custom cooperative multitasking OS (no, my company won't change to pre-preemptive)
Micropython uses garbage collection and this takes much more time than my alloted time slice even when there's nothing to collect i.e. I called it twice in a row, timed it and still takes A LOT of time.
OBVIOUS SOLUTION
Yes I could refactor micropython source but then whenever there's a change . . .
IDEAL SOLUTION
The ideal solution would involve calling some function void pause(&func_in_call_stack) that would jump out, leaving the stack intact, all the way to the function that is at the top of the call stack, say main. And resume would . . . resume.
QUESTION
Is it possible, using C and assembly, to implement pause?
UPDATE
As I wrote this, I realize that the C-based exception handling code nlr_push()/nlr_pop() already does most of what I need.
Your question is about implementing context switching. As we've covered fairly exhaustively in comments, support for context switching is among the key characteristics of any multitasking system, and of a multitasking OS in particular. Inasmuch as you posit no OS support for context switching, you are talking about implementing multitasking for a single-tasking OS.
That you describe the OS as providing some kind of task queue ("to relinquish control, a thread must simply exit its run loop") does not change this, though to some extent we could consider it a question of semantics. I imagine that a typical task for such a system would operate by creating and executing a series of microtasks (the work of the "run loop"), providing a shared, mutable memory context to each. Such a run loop could safely exit and later be reentered, to resume generating microtasks from where it left off.
Dividing tasks into microtasks at boundaries defined by affirmative application action (i.e. your pause()) would depend on capabilities beyond those provided by ISO C. Very likely, however, it could be done with the help of some assembly, plus some kind of framework support. You need at least these things:
A mechanism for recording a task's current execution context -- stack, register contents, and maybe other details. This is inherently system-specific.
A task-associated place to store recorded execution context. There are various ways in which such a thing could be established. Promising alternatives include (i) provided by the OS; (ii) provided by some kind of userland multi-tasking system running on top of the OS; (iii) built into the task by the compiler.
A mechanism for restoring recorded execution context -- this, too, will be system-specific.
If the OS does not provide such features, then you could consider the (now removed) POSIX context system as a model interface for recording and restoring execution context. (See makecontext(), swapcontext(), getcontext(), and setcontext().) You would need to implement those yourself, however, and you might want to wrap them to present a simpler interface to applications. Details will be highly dependent on hardware and underlying OS.
As an alternative, you might implement transparent multitasking support for such a system by providing compilers that emit specially instrumented code (i.e. even more specially instrumented than you otherwise need). For example, consider compilers that emit bytecode for a VM of your own design. The VMs in which the resulting programs run would naturally track the state of the program running within, and could yield after each sequence of a certain number of opcodes.

When does OpenCL data transfer occur?

I've seen a few questions here on Stack overflow dealing with the same issues, but no definite answer. I thought I'll ask again, with a bunch of questions of my own. All relate to the subject matter at hand.
So, do we know when the data transfer from host to the openCL device occurs? Can you tell me the exact memory transfer operation of the functions below (that is, what data is transferred or created, if any, when these functions are invoked?):
clCreateBuffer()
clSetKernelArg()
clEnqueueNDRangeKernel()
The first two don't even produce events, so we can't time them, but surely some data transferring is happening here.
Is there a way to transfer data to a device without first setting it as a kernel arg?
It appears (from preliminary testing of my own) that a mem object created with CL_MEM_USE_HOST_PTR gets directly manipulated by the device. Why would that not be desirable, since, that way, we could avoid further data transfer commands (and surely the driver implements this in the most efficient way)?
Does transferred data (say, as par of a kernel arg) stay at the device for further manipulation, after a kernel returns? If not is there a way to do just that?
Buffer copies are related to command queues. Command queues are synced with host using finish() as easiest way.
clCreateBuffer()
clEnqueueWriteBuffer() <-------- you can get event data from this
(set blocking parameter to false to queue everything quickly)
(set blockinig to true if you sync write here)
clSetKernelArg()
clEnqueueWriteBuffer() <----- it could be here too
clEnqueueNDRangeKernel()
clEnqueueWriteBuffer() <----- or here (too quickly re-set an array?)
clFinish() <--------- this ensures all queued commands are executed before this
now you can query data of that event to check when it started and when ended
to let a buffer stay in device, you should create it in device first then don't migrate it to another device. Using only CL_MEM_READ_WRITE flag in createBuffer() is enough to make it a real buffer on device-side until you release that buffer.
CL_MEM_USE_HOST_PTR or CL_MEM_ALLOC_HOST_PTR uses host memory as device maps it to its cores. This is faster for streaming data in and out because of not-needing of extra data movements in host side. If you need to use device memory such as fast gddr5 or hbm always, then you should not use these flags.
Copy to device once, use as much as you want. If device has its own memory of course. For example, Intel HD Graphics 400 doesn't have its own memory and shares RAM so it is much faster to use CL_MEM_..._HOST_PTR flags and especially USE_HOST_PTR.
To check if device shares RAM with CPU, you query CL_DEVICE_HOST_UNIFIED_MEMORY property of device.
It appears (from preliminary testing of my own) that a mem object
created with CL_MEM_USE_HOST_PTR gets directly manipulated by the
device
Even without map/unmap commands pror to kernel execution, my computer is behaving same, but I'm using map/unmap just to be safe and it doesn't tax too many cycles.
Edit: if you want to make sure a command doesn't start before you want, you can add a user event in event list input parameter of bufferwrite command. Then you can trigger the user event to let writing start because commands wait for all events in the list to be fired+completed before continuing (if there are any specified in event list input parameter)

Pushing code towards kernel or user space, for performance reasons?

Originally I thought to make code faster it would be better to try and reduce the transition between Kernel and user space- by pushing more of the code to run in the kernel. However, I have read in a few forums like SO that the opposite is actually done- more of the code is pushed into the user space. Why is this? It seems counter intuitive? Putting more of the code into the user space still requires kernel-user transitions, whereas putting the code in the kernel doesnt requite kernel-user transitions?
In case anyone asks- I am thinking about an application processing packet data.
EDIT
So more details, I am thinking about when packet data arrives- I want to re-write the network stack and cut out code which isn't applicable for my packet processing and have zero copy- putting the packet data somewhere where the user program can access it as quick as possible.
The kernel is a time sensitive area, it’s where your ISRs, time tick routines, and hardware critical sections reside. Because of this, the objective is to keep kernel code small and tight, get in, get your work done, and get out.
In your case you're getting packets from the network, that's a hardware dependent task (you need to get data from the lower network layers), so get your data, clear the buffers, and send it via a DMA transfer to user space; then do your processing in user space.
From my experiences: The preformance gained by executing your code in ther kernel will not outweigh the preformance lost overall by executing more code in the kernel.
If you expect your code to go into the official kernel release, "shuffling user mode parts of it into the kernel" is probably a bad idea as a rule.
Of course, if you can prove that by doing so is the BEST (subjective, I know) way to achieve better performance, and the cost is acceptable (in terms of extra code in kernel -> more burden of maintenance on the kernel, bigger kernel -> more complaints about kernel being "too big" etc), then by all means follow that route.
But in general, it's probably better to approach this by doing more work in user-mode, and make the kernel mode task smaller, if that is at all an alternative. Without knowing exactly what you are doing in the kernel and what you are doing in usermode, it's hard to say for sure what you should/shouldn't do. But for example batching up a dozen "items" into a block that is ONE request for the kernel to do something is a better option than calling the kernel a dozen times.
In response to your edit describing what you are doing:
Would it not be better to pass a user-mode memory region to receive the data, and then just copy into that when the packet arrives. Assuming "all memory is equal" [if it isn't, you have problems with "in place use" anyway], this should work just as well, with less time spent in the kernel.
Transitions from user-mode to kernel-mode take some time and resources, so keeping the code in only one of the modes may increase performance.
As mentioned: in your case probably the best option you have is to fetch the data as fast as possible and make it available in user-land right away and do the processing in user-land... moving all the processing to kernel-level seems to me unnecessary... Unless you have a good reason to do so... with no further information it seems to me you have no reason to believe you'll do it faster in kernel-mode than user-mode, all you could spare is a mode transition now and then, which shouldn't be relevant.

Resources