Cache coherence issues in a DMA context - c

Suppose the CPU modifies the value in location x+50 and does not flush it back to main memory(write-back).
Meanwhile, a device launches a DMA read request from x to x+100.
In that case, how the CPU is informed to flush back the dirty cache line?

The DMA circuitry often works directly with the main memory without involving the CPU (and that's the main idea, to free the CPU from doing I/O that can be done elsewhere in the hardware and thus save CPU cycles). So, you may indeed run into cache coherency problems. Microsoft recommends flushing I/O buffers when using DMA.
But some systems do support cache coherency protocols between CPUs and DMA circuits much like between CPUs in multiprocessor systems. The ultimate answer depends on the actual hardware.

There are three approaches I can think of:
The memory is marked as un-cacheable,
the DMA controller co-ordinates with the cache controller,
the OS guarantees this will never happen, e.g. by ensuring the CPU-part of the process isn't running.
It depends on the hardware, and the capabilities of the OS.
Ensuring the process is not running isn't too weird on a multi-tasking OS, as DMA on memory owned by a process is likely triggered by the process doing a system call, e.g. a write. The process can be de-scheduled, and other processes run, until the DMA completes.
It may be too much of a constraint to wait for an I/O device to complete, so the DMA controller might be copying from the processes address space to a secondary buffer.
So if you have a case where this has happened, please outline the example, and the tests you've run.

Related

ISR vs main: what are the trade offs of running in one or the other?

I know it has to do with time and efficiency, and how ISRs take time away from other processes, but I am unclear why this is. I am always told to keep ISRs very short. I am a bit confused why this is.
Normally, ISRs come into scene when a hardware device needs to interact with the CPU. They send an interrupt signal that makes the CPU to leave whatever it was doing to service the interrupt. That it's what ISR must care about.
Now, this depends on many factors, being the hardware environment and the nature of the interrupt maybe the most relevant ones, but it usually happens that in order to properly service an interrupt, ISRs run with interrupts disabled so they cannot be interrupted. This means that the CPU cannot be shared among other processes while it is running ISR code because the system timer interrupt that is used to run the scheduler (which is the part of the kernel that takes care of making the illusion that the CPU can do several tasks at the same time) won't work.
So, if your ISR takes too much time to perform a certain operation with the device, your system will be affected as a whole, because the percentage of time the CPU is available for the rest of processes will be less than usual. This is much noted on old system with PIO hard disks, which interrupt the CPU for every disk sector they want to transfer to the CPU, and the ISR must do the actual transfer. If there's many disk traffic, you may notice things like your mouse moving jerky (because the interrupt that the mouse device sends to the CPU is not attended)
OSes like Linux allow ISRs to defer time consuming operations with hardware devices to tasklets: sort of kernel threads that can share CPU time with other processes, yet keeping the atomic nature of hardware device operations (the OS ensures that there won't be more than one tasklet function -for the specific tasklet associated to the ISR- running in the system at the same time). The PIO transfer from disk to kernel buffers is an example of such operation.
Some precisions w.r.t. the accepted answer.
Interrupts are not necessarily disabled when running an interrupt, and that is not necessarily the reason why the kernel processes all interrupts before returning to threads.
There is the concept of interrupt priorities. An interrupt of higher priority will preempt a running ISR: if the timer interrupt is of higher priority than the running ISR, it will run. However, a kernel will not handle context switches at this time, but rather defer them until all queued/pending ISRs have run.
Also, on some processors (eg. ARM Cortex-M3), the concept of handling an interrupt is a mode of operation in the processor itself. The processor cannot go back to running threads until it gets out of interrupt mode. Once that happens, all interrupts are fully serviced: you cannot go back to running an ISR.
But the main reason why all ISRs must finish before going back to threads is that kernels do not have the concept of a thread-like running context for ISRs. An ISR thus cannot pend: it must run to completion. An ISR is thus hogging the CPU, except from higher-priority interrupts, until it finishes its purpose.
Usually, the main thread has lower priority than the ISRs. Depending on the scheduler, often the main code will be executed after all pending ISRs have been run.
Having alot of computation intensive code in one or many ISR is generally not advisable, since it may cause delays or even CPU starvation of lower priority ISRs or threads, which may be detrimental if time-critical code needs to be executed.
However, when action needs to be taken immediately at an interrupt event, the fastest way is to execute code from the associated ISR (and possibly assign it a high priority).
If you plan on using several interrupt sources that execute time-consuming code, the way to go is by using an RTOS to allow safe and efficient interleaving of several threads to service each of the interrupts.

How can I use DMA in linux kernel? [duplicate]

I am using memcpy() in my program. as I increase the number of variables, unfortunately the CPU usage increases. it is as if memcpy is run by using for loop iteration. is there a fast memcpy function in linux too? shall I use a patch and compile the kernel?
There are architectures where the bus between the CPU and memory is rather weak; some of those architectures add a DMA engine to allow big blocks of memory to be copied without having a loop running on the CPU.
In Linux, you would be able to access the DMA engine with the dmaengine subsystem, but it is very hardware-dependent whether such an engine is actually available.
X86 CPUs have a good memory subsystem, and also have special hardware support for copying large blocks, so using a DMA engine would be very unlikely to actually help.
(Intel added a DMA engine called I/OAT to some server boards, but the overall results were not much better than plain CPU copies.)
DMA forces the data out of the CPU caches, so doing DMA copies for your program's variables would be utterly pointless because the first CPU access afterwards would have to read them back into the cache.

how memcpy is handled by DMA in linux

I am using memcpy() in my program. as I increase the number of variables, unfortunately the CPU usage increases. it is as if memcpy is run by using for loop iteration. is there a fast memcpy function in linux too? shall I use a patch and compile the kernel?
There are architectures where the bus between the CPU and memory is rather weak; some of those architectures add a DMA engine to allow big blocks of memory to be copied without having a loop running on the CPU.
In Linux, you would be able to access the DMA engine with the dmaengine subsystem, but it is very hardware-dependent whether such an engine is actually available.
X86 CPUs have a good memory subsystem, and also have special hardware support for copying large blocks, so using a DMA engine would be very unlikely to actually help.
(Intel added a DMA engine called I/OAT to some server boards, but the overall results were not much better than plain CPU copies.)
DMA forces the data out of the CPU caches, so doing DMA copies for your program's variables would be utterly pointless because the first CPU access afterwards would have to read them back into the cache.

pthread on-wakeup execution

How can I make my pthreads execute a function each time they are rescheduled by the kernel?
I need to identify on which physical CPU/socket (not logical core) my thread is being scheduled at and cannot afford to do this all the time.
Can the wakeup routine be hooked somehow to make the necessary updates to TLS only when the thread is actually being rescheduled?
As to why I need this: I have code which executes AMOs appx every 70ns per thread which is fine if the address is not cached on another socket, deploying the same code on two sockets gives a 15 times performance impact because of frequent cache invalidations. I intend to allocate memory especially for this which is only shared among threads running the same L3 cache. So I need to identify on which socket I am running and address the correct memory block. I could obviously call sched_getcpu and compare this to the physical CPU ID in /proc/cpuinfo, but this is a rather big overhead. I cannot afford to allocate thread-private memory for each thread though, too expensive.
From what I have read in Linux Kernel Development, Third Edition, there is no service nor interface, provided by the kernel, for what you want. Using pthread_setaffinity (as suggested above by #osgx, or, in more recent linux kernel implementations, pthread_setaffinity_np) or caching a TLS key per cpu socket in the beginning (as suggested above by #caf) are perhaps the best methods to use in that direction.

What is the minimum guaranteed time for a process in windows?

I have a process that feeds a piece of hardware (data transmission device) with a specific buffer size. What can I reasonable expect from the windows scheduler windows to ensure I do not get a buffer underflow?
My buffer is 32K in size and gets consumed at ~800k bytes per second.
If I fill it in 16k byte batches that is one batch every 20ms. However, what is my lower limit for filling it. If say, I call sleep(0) in my filling loop what is my reasonable worst case scheduling interval?
OS = Windows XP SP3
Dual Core 2.2Ghz
Note, I am making an API call to check the buffer fill level and a call to the driver API to pass it the data. I am assuming these are scheduling points that Windows could make use of in addition to the sleep(0).
I would like to (as a process) play nice and still meet my realtime deadline. The machine is dedicated to this task but needs to receive the data over the network and send it to the IO device.
What can I expect for scheduler perfomance?
What else do I need to take into account.
There is no guaranteed worst-case. Losing the CPU for hundreds of milliseconds is quite possible. You are subject to whatever kernel threads are doing, they'll always run with a higher priority than you can ever get. Running into a misbehaving NIC, USB or audio driver is a problem you'll constantly be fighting. Unless you can control the hardware.
If you can survive occasional under-runs then make sure that the I/O request you use to get the device data is a waitable event. Windows likes scheduling threads that are blocking on an I/O request that completed ahead of all other ones. Polling with a Sleep() is not a good strategy. It burns CPU cycles needlessly and the scheduler won't favor the thread at all.
If you can't survive the under-runs then you need to consider a device driver.
What is the minimum guaranteed time for a process in windows?
There is no guarantee: Windows is not a real-time O/S.
What else do I need to take into account
What else is running on the machine (something high priority might preempt you)
How much RAM you have (system performance changes a lot when RAM is in short supply)
Whether you're dong I/O (because you might e.g. stall while waiting for disk or network access)
I would like to (as a process) play nice and still meet my realtime deadline. The machine is dedicated to this task but needs to receive the data over the network and send it to the IO device.
Consider setting the priority of your process and/or thread at "real time priority".

Resources