Increasing Linux DMA_ZONE memory on ARM i.MX287

Increasing Linux DMA_ZONE memory on ARM i.MX287 - c

I am working in an Embedded Linux system which has the 2.6.35.3 kernel.
Within the device we require a 4MB+192kB contiguous DMA capable buffer for one of our data capture drivers. The driver uses SPI transfers to copy data into this buffer.
The user space application issues a mmap system call to map the buffer into user space and after that, it directly reads the available data.
The buffer is allocated using "alloc_bootmem_low_pages" call, because it is not possible to allocate more than 4 MB buffer using other methods, such as kmalloc.
However, due to a recent upgrade, we need to increase the buffer space to 22MB+192kB. As I've read, the Linux kernel has only 16MB of DMA capable memory. Therefore, theoretically this is not possible unless there is a way to tweak this setting.
If there is anyone who knows how to perform this, please let me know?
Is this a good idea or will this make the system unstable?

The ZONE_DMA 16MB limit is imposed by a hardware limitation of certain devices. Specifically, on the PC architecture in the olden days, ISA cards performing DMA needed buffers allocated in the first 16MB of the physical address space because the ISA interface had 24 physical address lines which were only capable of addressing the first 2^24=16MB of physical memory. Therefore, device drivers for these cards would allocate DMA buffers in the ZONE_DMA area to accommodate this hardware limitation.
Depending on your embedded system and device hardware, your device either is or isn't subject to this limitation. If it is subject to this limitation, there is no software fix you can apply to allow your device to address a 22MB block of memory, and if you modify the kernel to extend the DMA address space beyond 16MB, then of course the system will become unstable.
On the other hand, if your device is not subject to this limitation (which is the only way it could possibly write to a 22MB buffer), then there is no reason to allocate memory in ZONE_DMA. In this case, I think if you simply replace your alloc_bootmem_low_pages call with an alloc_bootmem_pages call, it should work fine to allocate your 22MB buffer. If the system becomes unstable, then it's probably because your device is subject to a hardware limitation, and you cannot use a 22MB buffer.

It looks like my first attempt at an answer was a little too generic. I think that for the specific i.MX287 architecture you mention in the comments, the DMA zone size is configurable through the CONFIG_DMA_ZONE_SIZE parameter which can be made as large as 32Megs. The relevant configuration option should be under "System Type -> Freescale i.MXS implementations -> DMA memory zone size".
On this architecture, it's seems safe to modify it, as it looks like it's not addressing a hardware limitation (the way it was on x86 architectures) but just determining how to lay out memory.
If you try setting it to 32Meg and testing both alloc_bootmem_pages and alloc_bootmem_low_pages in your own driver, perhaps one of those will work.
Otherwise, I think I'm out of ideas.

Related

How to allocate block of physical memory? [duplicate]

Is there a way to allocate contiguous physical memory from userspace in linux? At least few guaranteed contiguous memory pages. One huge page isn't the answer.

No. There is not. You do need to do this from Kernel space.
If you say "we need to do this from User Space" - without anything going on in kernel-space it makes little sense - because a user space program has no way of controlling or even knowing if the underlying memory is contiguous or not.
The only reason where you would need to do this - is if you were working in-conjunction with a piece of hardware, or some other low-level (i.e. Kernel) service that needed this requirement. So again, you would have to deal with it at that level.
So the answer isn't just "you can't" - but "you should never need to".
I have written such memory managers that do allow me to do this - but it was always because of some underlying issue at the kernel level, which had to be addressed at the kernel level. Generally because some other agent on the bus (PCI card, BIOS or even another computer over RDMA interface) had the physical contiguous memory requirement. Again, all of this had to be addressed in kernel space.
When you talk about "cache lines" - you don't need to worry. You can be assured that each page of your user-space memory is contiguous, and each page is much larger than a cache-line (no matter what architecture you're talking about).

Yes, if all you need is a few pages, this may indeed be possible.
The file /proc/[pid]/pagemap now allows programs to inspect the mapping of their virtual memory to physical memory.
While you cannot explicitly modify the mapping, you can just allocate a virtual page, lock it into memory via a call to mlock, record its physical address via a lookup into /proc/self/pagemap, and repeat until you just happen to get enough blocks touching eachother to create a large enough contiguous block. Then unlock and free your excess blocks.
It's hackish, clunky and potentially slow, but it's worth a try. On the other hand, there's a decently large chance that this isn't actually what you really need.

DPDK library's memory allocator uses approach #Wallacoloo described. eal_memory.c. The code is BSD licensed.

if specific device driver exports dma buffer which is physical contiguous, user space can access through dma buf apis
so user task can access but not allocate directly
that is because physically contiguous constraints are not from user aplications but only from device
so only device drivers should care.

Identify DMA memory in /proc/mtrr and /proc/iomem?

I wonder if there is a way to identify memory used for DMA mapping in some proc files, such as mtrr and iomem, or via lspic -vv.
In my /proc/mtrr, there is only one uncachable region, and it seems to be pointing at the 'PCI hole' at 3.5-4GB, almost.
base=0x0e0000000 ( 3584MB), size= 512MB, count=1: uncachable
By cross verifying with /proc/iomem, of this 512MB region, only the last 21 MB before 4GB is NOT consumed by PCI Bus, and that 21MB sliver is occupied by things like pnp/IOAPIC/Reserved.
So my questions are:
What is the signature of DMA region in /proc/mtrr and /proc/iomem
Are there other places, such as other proc files and commands that I can use to see DMA region?
It seems by adding rows to /proc/mtrr, a privileged user can change caching mechanism of any memory, at runtime. So besides the fact that DMA has to be lower 32bit(assuming without DAC), are there other special requirement for DMA memory allocation? If there are no further requirment, then maybe the only hint I can use to identify DMA memory would be /proc/mtrr?

DMA (Direct Memory Access) is just where a device accesses memory itself (without asking CPU to feed the data to the device). For a (simplified) example of DMA; imagine a random process does a write(), and this bubbles its way up (through VFS, through file system, through any RAID layer, etc) until it reaches some kind of disk controller driver; then the disk controller driver tells its disk controller "transfer N bytes from this physical address to that place on the disk and let me know when the transfer has been done". Most devices (disk controllers, network cards, video cards, sound cards, USB controllers, ...) use DMA in some way. Under load, all the devices in your computer may be doing thousands of transfers (via. DMA) per second, potentially scattered across all usable RAM.
As far as I know; there are no files in /proc/ that would help (most likely because it changes too fast and too often to bother providing any, and there'd be very little reason for anyone to ever want to look at it).
The MTTRs are mostly irrelevant - they only control the CPU's caches and have no effect on DMA requests from devices.
The /proc/iomem is also irrelevant. It only shows which areas devices are using for their own registers and has nothing to do with RAM (and therefore has nothing to do with DMA).
Note 1: DMA doesn't have to be in the lower 32-bit (e.g. most PCI devices have supported 64-bit DMA/bus mastering for a decade or more); and for the rare devices that don't support 64-bit it's possible for Linux to use an IOMMU to remap their requests (so the device thinks it's using 32-bit addresses when it actually isn't).
Note 2: Once upon a time (a long time ago) there were "ISA DMA controller chips". Like the ISA bus itself; these were restricted to the first 16 MiB of the physical address space (and had other restrictions - e.g. not supporting transfers that cross a 64 KiB boundary). These chips haven't really had a reason to exist since floppy disk controllers became obsolete. You might have a /proc/dma describing these (but if you do it probably only says "cascade" to indicate how the chips connect, with no devices using them).

Is accessing mapped device memory slow (in terms of latency)?

I know the question is vague.. but here is what I hope to learn: the MCU directs some part of memory address to devices on the PCI bus, hence in theory user/kernel code can directly read/write device memory as if it were main memory. But data in and out of PCI Express devices are packaged/serialized/transmitted in lanes, which means each read/write incurs significant overhead, such as packaging (add headers) and un-packaging. So that means it is not ideal for user/kernel to read device memory a byte at a time, instead it should do some sort of bulk transfer. If so, what is the preferred mechanism and API?
BTW, I know there is DMA, but it seems to me that DMA does not require device memory to be directly mapped into main memory address space - DMA is about letting device access main memory, and my question is the other way, letting user/kernel access device memory. So I am guessing it is not related to the question above, is that correct?

Yes, accessing memory-mapped I/O (MMIO) is slow.
The primary reason that it is slow is that it is typically uncacheable,
so every access has to go all the way to the device.
In x86 systems, which I am most familiar with, cacheable memory is accessed in 64-byte chunks,
even though processor instructions typically access memory in 1, 2, 4, or 8 byte chunks.
If multiple processor instructions access adjacent cacheable memory locations, all but the first access are satisfied from the cache. For similar accesses to device memory, every access incurs the full latency to the device and back.
The second reason is that the path from the processors to memory are critical to performance and are highly optimized.
The path to devices has always been slow, so software is designed to compensate for that, and optimizing the performance of MMIO isn't a priority.
Another related reason is that PCI has ordering rules that require accesses to be buffered and processed in a strict order.
The memory system can handle ordering in a much more flexible way. For example, a dirty cache line may be written to memory at any convenient time. MMIO accesses must be performed precisely in the order that they are executed by the CPU.
The best way to do bulk transfer of data to a device is to have the device itself perform DMA, "pulling" the data from memory into the device, rather than "pushing" it from the CPU to the device. (This also reduces the load on the CPU, freeing it to do other useful work.)

get_user_pages_fast() for DMA?

I have a Linux driver that does DMA transfers to/from a device. For sending data to the device (to prevent copy operations) the driver maps the userspace buffer and uses it for DMA directly via get_user_pages_fast(). The user pages are then added to a scatter-gather list and used for DMA.
This works rather well, but the one issue is that this forces the userspace buffer to have various alignment requirements to the cache line of the CPU. My system returns 128 when you call dma_get_cache_alignment(), which means that in userspace I have to ensure that the start address is aligned to this value. Also, I have to check that the buffer is sized to a multiple of 128.
I see two options for handling this:
Deal with it. That is, in userspace ensure that the buffer is properly aligned. This sounds reasonable, but I have run into some issues since my device has to be integrated into a larger project, and I don't have control over the buffers that get passed to me. As a result, I have to allocate a properly aligned buffer in userspace to sit between the driver and the application and use that buffer in the event the caller's buffer is not aligned. This adds a copy operation and isn't the end of the world, but the resulting code is rather messy.
Rework the driver to use a kernel space buffer. That is, change the code such that the driver uses copy_from_user() to move the data into a properly aligned kernel space buffer. I'm not too concerned about the performance here, so this is an option, but would require a good amount of rework.
Is there anything that I'm missing? I'm hoping that there might be some magic flag or something that I overlooked to remove the alignment requirement altogether.

Making DMA memory temporarily cachable

I have an arm cortex-a9 quad core device, and I'm programming a multi-process application.
These processes share the same source of input - a DMA buffer which they all access using a mmap() call.
I noticed that the time it takes for the processes to access the DMA memory, is significantly longer than it takes if I change the source of input to be a normal allocated buffer (i.e. allocated using malloc).
I understand why a DMA buffer must be non-cacheable, however, since I have the ability to determine when the buffer is stable (unchanged by the hardware, which is the case most of the time) or dirty (data has changed) I thought I might get a significant speed improvement if I'll make the memory region temporarily cacheable.
Is there a way to do that?
I'm currently using this line to map the memory:
void *buf = mmap(0, size, PROT_READ | PROT_WRITE,MAP_SHARED, fd, phy_addr);
Thanks!

Most modern CPUs use snooping to determine if/when cache lines must be flushed to memory or marked invalid. On such CPUs a "DMA buffer" is identical to a kmalloc() buffer. This, of course, assumes the snoop feature works correctly and that the OS takes advantage of the snoop feature. If you are seeing differences in accesses to DMA and non-DMA memory regions then I can only assume your CPU either does not have cache snooping capabilities (check CPU docs) or the capability is not used because it doesn't work (check CPU errata).
Problems with your proposed approach:
Do you know when it is time to change the memory region back to non-cacheable?
Changing MMU settings for a memory region is not always trivial (is CPU dependent) and I'm not sure an API even exists within your OS for changing such settings.
Changing MMU settings for a memory region is risky even when it is possible and such changes must be carefully synchronized with your DMA operation or data corruption is virtually guaranteed.
Given all of these significant problems, I suggest a better approach is to copy the data from the DMA buffer to the kmalloc() buffer when you detect the DMA buffer has been updated.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight