Making DMA memory temporarily cachable

Making DMA memory temporarily cachable - c

I have an arm cortex-a9 quad core device, and I'm programming a multi-process application.
These processes share the same source of input - a DMA buffer which they all access using a mmap() call.
I noticed that the time it takes for the processes to access the DMA memory, is significantly longer than it takes if I change the source of input to be a normal allocated buffer (i.e. allocated using malloc).
I understand why a DMA buffer must be non-cacheable, however, since I have the ability to determine when the buffer is stable (unchanged by the hardware, which is the case most of the time) or dirty (data has changed) I thought I might get a significant speed improvement if I'll make the memory region temporarily cacheable.
Is there a way to do that?
I'm currently using this line to map the memory:
void *buf = mmap(0, size, PROT_READ | PROT_WRITE,MAP_SHARED, fd, phy_addr);
Thanks!

Most modern CPUs use snooping to determine if/when cache lines must be flushed to memory or marked invalid. On such CPUs a "DMA buffer" is identical to a kmalloc() buffer. This, of course, assumes the snoop feature works correctly and that the OS takes advantage of the snoop feature. If you are seeing differences in accesses to DMA and non-DMA memory regions then I can only assume your CPU either does not have cache snooping capabilities (check CPU docs) or the capability is not used because it doesn't work (check CPU errata).
Problems with your proposed approach:
Do you know when it is time to change the memory region back to non-cacheable?
Changing MMU settings for a memory region is not always trivial (is CPU dependent) and I'm not sure an API even exists within your OS for changing such settings.
Changing MMU settings for a memory region is risky even when it is possible and such changes must be carefully synchronized with your DMA operation or data corruption is virtually guaranteed.
Given all of these significant problems, I suggest a better approach is to copy the data from the DMA buffer to the kmalloc() buffer when you detect the DMA buffer has been updated.

Related

why mmap is faster than traditional file io [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
mmap() vs. reading blocks
I heard (read it on the internet somewhere) that mmap() is faster than sequential IO. Is this correct? If yes then why it is faster?
mmap() is not reading sequentially.
mmap() has to fetch from the disk itself same as read() does
The mapped area is not sequential - so no DMA (?).
So mmap() should actually be slower than read() from a file? Which of my assumptions above are wrong?

I heard (read it on the internet somewhere) that mmap() is faster than sequential IO. Is this correct? If yes then why it is faster?
It can be - there are pros and cons, listed below. When you really have reason to care, always benchmark both.
Quite apart from the actual IO efficiency, there are implications for the way the application code tracks when it needs to do the I/O, and does data processing/generation, that can sometimes impact performance quite dramatically.
mmap() is not reading sequentially.
2) mmap() has to fetch from the disk itself same as read() does
3) The mapped area is not sequential - so no DMA (?).
So mmap() should actually be slower than read() from a file? Which of my assumptions above are wrong?
is wrong... mmap() assigns a region of virtual address space corresponding to file content... whenever a page in that address space is accessed, physical RAM is found to back the virtual addresses and the corresponding disk content is faulted into that RAM. So, the order in which reads are done from the disk matches the order of access. It's a "lazy" I/O mechanism. If, for example, you needed to index into a huge hash table that was to be read from disk, then mmaping the file and starting to do access means the disk I/O is not done sequentially and may therefore result in longer elapsed time until the entire file is read into memory, but while that's happening lookups are succeeding and dependent work can be undertaken, and if parts of the file are never actually needed they're not read (allow for the granularity of disk and memory pages, and that even when using memory mapping many OSes allow you to specify some performance-enhancing / memory-efficiency tips about your planned access patterns so they can proactively read ahead or release memory more aggressively knowing you're unlikely to return to it).
absolutely true
"The mapped area is not sequential" is vague. Memory mapped regions are "contiguous" (sequential) in virtual address space. We've discussed disk I/O being sequential above. Or, are you thinking of something else? Anyway, while pages are being faulted in, they may indeed be transferred using DMA.
Further, there are other reasons why memory mapping may outperform usual I/O:
there's less copying:
often OS & library level routines pass data through one or more buffers before it reaches an application-specified buffer, the application then dynamically allocates storage, then copies from the I/O buffer to that storage so the data's usable after the file reading completes
memory mapping allows (but doesn't force) in-place usage (you can just record a pointer and possibly length)
continuing to access data in-place risks increased cache misses and/or swapping later: the file/memory-map could be more verbose than data structures into which it could be parsed, so access patterns on data therein could have more delays to fault in more memory pages
memory mapping can simplify the application's parsing job by letting the application treat the entire file content as accessible, rather than worrying about when to read another buffer full
the application defers more to the OS's wisdom re number of pages that are in physical RAM at any single point in time, effectively sharing a direct-access disk cache with the application
as well-wisher comments below, "using memory mapping you typically use less system calls"
if multiple processes are accessing the same file, they should be able to share the physical backing pages
The are also reasons why mmap may be slower - do read Linus Torvald's post here which says of mmap:
...page table games along with the fault (and even just TLB miss)
overhead is easily more than the cost of copying a page in a nice
streaming manner...
And from another of his posts:
quite noticeable setup and teardown costs. And I mean noticeable. It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the mappings. It's The TLB flush needed after unmapping stuff.
page faulting is expensive. That's how the mapping gets populated, and it's quite slow.
Linux does have "hugepages" (so one TLB entry per 2MB, instead of per 4kb) and even Transparent Huge Pages, where the OS attempts to use them even if the application code wasn't written to explicitly utilise them.
FWIW, the last time this arose for me at work, memory mapped input was 80% faster than fread et al for reading binary database records into a proprietary database, on 64 bit Linux with ~170GB files.

mmap() can share between process.
DMA will be used whenever possible. DMA does not require contiguous memory -- many high end cards support scatter-gather DMA.
The memory area may be shared with kernel block cache if possible. So there is lessor copying.
Memory for mmap is allocated by kernel, it is always aligned.

"Faster" in absolute terms doesn't exist. You'd have to specify constraints and circumstances.
mmap() is not reading sequentially.
what makes you think that? If you really access the mapped memory sequentially, the system will usually fetch the pages in that order.
mmap() has to fetch from the disk itself same as read() does
sure, but the OS determines the time and buffer size
The mapped area is not sequential - so no DMA (?).
see above
What mmap helps with is that there is no extra user space buffer involved, the "read" takes place there where the OS kernel sees fit and in chunks that can be optimized. This may be an advantage in speed, but first of all this is just an interface that is easier to use.
If you want to know about speed for a particular setup (hardware, OS, use pattern) you'd have to measure.

Is accessing mapped device memory slow (in terms of latency)?

I know the question is vague.. but here is what I hope to learn: the MCU directs some part of memory address to devices on the PCI bus, hence in theory user/kernel code can directly read/write device memory as if it were main memory. But data in and out of PCI Express devices are packaged/serialized/transmitted in lanes, which means each read/write incurs significant overhead, such as packaging (add headers) and un-packaging. So that means it is not ideal for user/kernel to read device memory a byte at a time, instead it should do some sort of bulk transfer. If so, what is the preferred mechanism and API?
BTW, I know there is DMA, but it seems to me that DMA does not require device memory to be directly mapped into main memory address space - DMA is about letting device access main memory, and my question is the other way, letting user/kernel access device memory. So I am guessing it is not related to the question above, is that correct?

Yes, accessing memory-mapped I/O (MMIO) is slow.
The primary reason that it is slow is that it is typically uncacheable,
so every access has to go all the way to the device.
In x86 systems, which I am most familiar with, cacheable memory is accessed in 64-byte chunks,
even though processor instructions typically access memory in 1, 2, 4, or 8 byte chunks.
If multiple processor instructions access adjacent cacheable memory locations, all but the first access are satisfied from the cache. For similar accesses to device memory, every access incurs the full latency to the device and back.
The second reason is that the path from the processors to memory are critical to performance and are highly optimized.
The path to devices has always been slow, so software is designed to compensate for that, and optimizing the performance of MMIO isn't a priority.
Another related reason is that PCI has ordering rules that require accesses to be buffered and processed in a strict order.
The memory system can handle ordering in a much more flexible way. For example, a dirty cache line may be written to memory at any convenient time. MMIO accesses must be performed precisely in the order that they are executed by the CPU.
The best way to do bulk transfer of data to a device is to have the device itself perform DMA, "pulling" the data from memory into the device, rather than "pushing" it from the CPU to the device. (This also reduces the load on the CPU, freeing it to do other useful work.)

Increasing Linux DMA_ZONE memory on ARM i.MX287

I am working in an Embedded Linux system which has the 2.6.35.3 kernel.
Within the device we require a 4MB+192kB contiguous DMA capable buffer for one of our data capture drivers. The driver uses SPI transfers to copy data into this buffer.
The user space application issues a mmap system call to map the buffer into user space and after that, it directly reads the available data.
The buffer is allocated using "alloc_bootmem_low_pages" call, because it is not possible to allocate more than 4 MB buffer using other methods, such as kmalloc.
However, due to a recent upgrade, we need to increase the buffer space to 22MB+192kB. As I've read, the Linux kernel has only 16MB of DMA capable memory. Therefore, theoretically this is not possible unless there is a way to tweak this setting.
If there is anyone who knows how to perform this, please let me know?
Is this a good idea or will this make the system unstable?

The ZONE_DMA 16MB limit is imposed by a hardware limitation of certain devices. Specifically, on the PC architecture in the olden days, ISA cards performing DMA needed buffers allocated in the first 16MB of the physical address space because the ISA interface had 24 physical address lines which were only capable of addressing the first 2^24=16MB of physical memory. Therefore, device drivers for these cards would allocate DMA buffers in the ZONE_DMA area to accommodate this hardware limitation.
Depending on your embedded system and device hardware, your device either is or isn't subject to this limitation. If it is subject to this limitation, there is no software fix you can apply to allow your device to address a 22MB block of memory, and if you modify the kernel to extend the DMA address space beyond 16MB, then of course the system will become unstable.
On the other hand, if your device is not subject to this limitation (which is the only way it could possibly write to a 22MB buffer), then there is no reason to allocate memory in ZONE_DMA. In this case, I think if you simply replace your alloc_bootmem_low_pages call with an alloc_bootmem_pages call, it should work fine to allocate your 22MB buffer. If the system becomes unstable, then it's probably because your device is subject to a hardware limitation, and you cannot use a 22MB buffer.

It looks like my first attempt at an answer was a little too generic. I think that for the specific i.MX287 architecture you mention in the comments, the DMA zone size is configurable through the CONFIG_DMA_ZONE_SIZE parameter which can be made as large as 32Megs. The relevant configuration option should be under "System Type -> Freescale i.MXS implementations -> DMA memory zone size".
On this architecture, it's seems safe to modify it, as it looks like it's not addressing a hardware limitation (the way it was on x86 architectures) but just determining how to lay out memory.
If you try setting it to 32Meg and testing both alloc_bootmem_pages and alloc_bootmem_low_pages in your own driver, perhaps one of those will work.
Otherwise, I think I'm out of ideas.

What is the difference in cache memory and tightly coupled memory

Due to being embedded inside the CPU The TCM has a
Harvard-architecture, so there is an ITCM (instruction TCM)
and a DTCM (data TCM). The DTCM can not contain any
instructions, but the ITCM can actually contain data.
The size of DTCM or ITCM is minimum 4KiB so the typical
minimum configuration is 4KiB ITCM and 4KiB DTCM.
It looks like tcm have same purpose as cache memory.
No. They didn't used the word cache in explanation

A cache uses access patterns to populate data within the cache. It has extra hardware to track the backing address and may have communication with other system entities (SMP) to track when a cache line is dirty (someone else has written something to primary memory).
The 'TCM' (tightly coupled memory) is fast, probably SRAM multi-transistor memory, like the cache. Both have a fast dedicated connection to the CPU. However, the overhead to implement the TCM is far less than a cache. Typically TCM is found on lower-end (deeply embedded probably Cortex-M) ARM devices.
Most CPU caches have a lock down feature which enables them to behave like the TCM. However, the TCM does not have on the fly capabilities to buffer high use code and data. Because of this, the TCM (and locked cache) is probably more deterministic which may help hard real time applications.

This is what I found that I feel is more concise and to the point.
Cache memory is implemented with on-chip memory and control logic. Tightly coupled memory is implemented with on-chip memory and a dedicated connection.
Tightly coupled memory has a fixed span in the address map. Cache does not live in the address map (.... well it kinda does.... just don't think of it as a physical memory) but instead serves as an intermediate between the processor and the memory to (hopefully) provide more efficient memory accesses.
Tightly coupled memory has deterministic access time. Accesses through the cache are not deterministic since the data will either live in the cache (hit) or the data must be fetched from main memory (miss).
Another
While both are very fast accessed memories, cache stores dynamically data/code which has been lately used in order to improve access speed, compared to standard memory connected to the global Avalon matrix. Every time a memory access is required, the processor checks if the required data is already present in the cache or must be newly fetched from memory; in the meantime, old unused cache data is being continously replaced with new data.
Tightly coupled memory is also a fast access memory, since it exploits a dedicated port, but it has static content: you decide what you need there and you specify it in the linker script.

TCM has allocated address space so you can find it in the memory map. You can control the data that will be stored there already at link time. Just think of it as a normal system memory that has access times similar with cache. Usually the data from TCM is uncacheable.

If we ignore the dual-configurable cache-cum-TCM, cache memory should be connected to Bus Interface (BIU) to connect external memory, whereas TCM aren't. Reason is TCM has original data by itself. Whereas cache is temporary storage (for speed) of external memory content.

What is the difference between vmalloc and kmalloc?

I've googled around and found most people advocating the use of kmalloc, as you're guaranteed to get contiguous physical blocks of memory. However, it also seems as though kmalloc can fail if a contiguous physical block that you want can't be found.
What are the advantages of having a contiguous block of memory? Specifically, why would I need to have a contiguous physical block of memory in a system call? Is there any reason I couldn't just use vmalloc?
Finally, if I were to allocate memory during the handling of a system call, should I specify GFP_ATOMIC? Is a system call executed in an atomic context?
GFP_ATOMIC
The allocation is high-priority and
does not sleep. This is the flag to
use in interrupt handlers, bottom
halves and other situations where you
cannot sleep.
GFP_KERNEL
This is a normal allocation and might block. This is the flag to use
in process context code when it is safe to sleep.

You only need to worry about using physically contiguous memory if the buffer will be accessed by a DMA device on a physically addressed bus (like PCI). The trouble is that many system calls have no way to know whether their buffer will eventually be passed to a DMA device: once you pass the buffer to another kernel subsystem, you really cannot know where it is going to go. Even if the kernel does not use the buffer for DMA today, a future development might do so.
vmalloc is often slower than kmalloc, because it may have to remap the buffer space into a virtually contiguous range. kmalloc never remaps, though if not called with GFP_ATOMIC kmalloc can block.
kmalloc is limited in the size of buffer it can provide: 128 KBytes*). If you need a really big buffer, you have to use vmalloc or some other mechanism like reserving high memory at boot.
*) This was true of earlier kernels. On recent kernels (I tested this on 2.6.33.2), max size of a single kmalloc is up to 4 MB! (I wrote a fairly detailed post on this.) — kaiwan
For a system call you don't need to pass GFP_ATOMIC to kmalloc(), you can use GFP_KERNEL. You're not an interrupt handler: the application code enters the kernel context by means of a trap, it is not an interrupt.

Short answer: download Linux Device Drivers and read the chapter on memory management.
Seriously, there are a lot of subtle issues related to kernel memory management that you need to understand - I spend a lot of my time debugging problems with it.
vmalloc() is very rarely used, because the kernel rarely uses virtual memory. kmalloc() is what is typically used, but you have to know what the consequences of the different flags are and you need a strategy for dealing with what happens when it fails - particularly if you're in an interrupt handler, like you suggested.

Linux Kernel Development by Robert Love (Chapter 12, page 244 in 3rd edition) answers this very clearly.
Yes, physically contiguous memory is not required in many of the cases. Main reason for kmalloc being used more than vmalloc in kernel is performance. The book explains, when big memory chunks are allocated using vmalloc, kernel has to map the physically non-contiguous chunks (pages) into a single contiguous virtual memory region. Since the memory is virtually contiguous and physically non-contiguous, several virtual-to-physical address mappings will have to be added to the page table. And in the worst case, there will be (size of buffer/page size) number of mappings added to the page table.
This also adds pressure on TLB (the cache entries storing recent virtual to physical address mappings) when accessing this buffer. This can lead to thrashing.

The kmalloc() & vmalloc() functions are a simple interface for obtaining kernel memory in byte-sized chunks.
The kmalloc() function guarantees that the pages are physically contiguous (and virtually contiguous).
The vmalloc() function works in a similar fashion to kmalloc(), except it allocates memory that is only virtually contiguous and not necessarily physically contiguous.

What are the advantages of having a contiguous block of memory? Specifically, why would I need to have a contiguous physical block of memory in a system call? Is there any reason I couldn't just use vmalloc?
From Google's "I'm Feeling Lucky" on vmalloc:
kmalloc is the preferred way, as long as you don't need very big areas. The trouble is, if you want to do DMA from/to some hardware device, you'll need to use kmalloc, and you'll probably need bigger chunk. The solution is to allocate memory as soon as possible, before
memory gets fragmented.

On a 32-bit system, kmalloc() returns the kernel logical address (its a virtual address though) which has the direct mapping (actually with constant offset) to physical address.
This direct mapping ensures that we get a contiguous physical chunk of RAM. Suited for DMA where we give only the initial pointer and expect a contiguous physical mapping thereafter for our operation.
vmalloc() returns the kernel virtual address which in turn might not be having a contiguous mapping on physical RAM.
Useful for large memory allocation and in cases where we don't care about that the memory allocated to our process is continuous also in Physical RAM.

One of other differences is kmalloc will return logical address (else you specify GPF_HIGHMEM). Logical addresses are placed in "low memory" (in the first gigabyte of physical memory) and are mapped directly to physical addresses (use __pa macro to convert it). This property implies kmalloced memory is continuous memory.
In other hand, Vmalloc is able to return virtual addresses from "high memory". These addresses cannot be converted in physical addresses in a direct fashion (you have to use virt_to_page function).

In short, vmalloc and kmalloc both could fix fragmentation. vmalloc use memory mappings to fix external fragmentation; kmalloc use slab to fix internal frgamentation. Fot what it's worth, kmalloc also has many other advantages.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight