Identify DMA memory in /proc/mtrr and /proc/iomem? - c

I wonder if there is a way to identify memory used for DMA mapping in some proc files, such as mtrr and iomem, or via lspic -vv.
In my /proc/mtrr, there is only one uncachable region, and it seems to be pointing at the 'PCI hole' at 3.5-4GB, almost.
base=0x0e0000000 ( 3584MB), size= 512MB, count=1: uncachable
By cross verifying with /proc/iomem, of this 512MB region, only the last 21 MB before 4GB is NOT consumed by PCI Bus, and that 21MB sliver is occupied by things like pnp/IOAPIC/Reserved.
So my questions are:
What is the signature of DMA region in /proc/mtrr and /proc/iomem
Are there other places, such as other proc files and commands that I can use to see DMA region?
It seems by adding rows to /proc/mtrr, a privileged user can change caching mechanism of any memory, at runtime. So besides the fact that DMA has to be lower 32bit(assuming without DAC), are there other special requirement for DMA memory allocation? If there are no further requirment, then maybe the only hint I can use to identify DMA memory would be /proc/mtrr?

DMA (Direct Memory Access) is just where a device accesses memory itself (without asking CPU to feed the data to the device). For a (simplified) example of DMA; imagine a random process does a write(), and this bubbles its way up (through VFS, through file system, through any RAID layer, etc) until it reaches some kind of disk controller driver; then the disk controller driver tells its disk controller "transfer N bytes from this physical address to that place on the disk and let me know when the transfer has been done". Most devices (disk controllers, network cards, video cards, sound cards, USB controllers, ...) use DMA in some way. Under load, all the devices in your computer may be doing thousands of transfers (via. DMA) per second, potentially scattered across all usable RAM.
As far as I know; there are no files in /proc/ that would help (most likely because it changes too fast and too often to bother providing any, and there'd be very little reason for anyone to ever want to look at it).
The MTTRs are mostly irrelevant - they only control the CPU's caches and have no effect on DMA requests from devices.
The /proc/iomem is also irrelevant. It only shows which areas devices are using for their own registers and has nothing to do with RAM (and therefore has nothing to do with DMA).
Note 1: DMA doesn't have to be in the lower 32-bit (e.g. most PCI devices have supported 64-bit DMA/bus mastering for a decade or more); and for the rare devices that don't support 64-bit it's possible for Linux to use an IOMMU to remap their requests (so the device thinks it's using 32-bit addresses when it actually isn't).
Note 2: Once upon a time (a long time ago) there were "ISA DMA controller chips". Like the ISA bus itself; these were restricted to the first 16 MiB of the physical address space (and had other restrictions - e.g. not supporting transfers that cross a 64 KiB boundary). These chips haven't really had a reason to exist since floppy disk controllers became obsolete. You might have a /proc/dma describing these (but if you do it probably only says "cascade" to indicate how the chips connect, with no devices using them).

Related

Are memory mapped registers separate registers on the bus?

I will use the TM4C123 Arm Microcontroller/Board as an example.
Most of it's I/O registers are memory mapped so you can get/set their values using
regular memory load/store instructions.
My questions is, is there some type of register outside of cpu somewhere on the bus which is mapped to memory and you read/write to it using the memory region essentially having duplicate values one on the register and on memory, or the memory IS the register itself?
There are many buses even in an MCU. Bus after bus after bus splitting off like branches in a tree. (sometimes even merging unlike a tree).
It may predate the intel/motorola battle but certainly in that time frame you had segmented vs flat addressing and you had I/O mapped I/O vs memory mapped I/O, since motorola (and others) did not have a separate I/O bus (well one extra...address...signal).
Look at the arm architecture documents and the chip documentation (arm makes IP not chips). You have load and store instructions that operate on addresses. The documentation for the chip (and to some extent ARM provides rules for the cortex-m address space) provides a long list of addresses for things. As a programmer you simply line up the address you do loads and stores with and the right instructions.
Someones marketing may still carry about terms like memory mapped I/O, because intel x86 still exists (how????), some folks will continue to carry those terms. As a programmer, they are number one just bits that go into registers, and for single instructions here and there those bits are addresses. If you want to add adjectives to that, go for it.
If the address you are using based on the chip and core documentation is pointing at an sram, then that is a read or write of memory. If it is a flash address, then that is the flash. The uart, the uart. timer 5, then timer 5 control and status registers. Etc...
There are addresses in these mcus that point at two or three things, but not at the same time. (address 0x00000000 and some number of kbytes after that). But, again, not at the same time. And this overlap at least in many of these cortex-m mcus, these special address spaces are not going to overlap "memory" and peripherals (I/O). But instead places where you can boot the chip and run some code. With these cortex-ms I do not think you can even use the sort of mmu to mix these spaces. Yes definitely in full sized arms and other architectures you can use a fully blow mcu to mix up the address spaces and have a virtual address space that lands on a physical address space of a different type.

Is accessing mapped device memory slow (in terms of latency)?

I know the question is vague.. but here is what I hope to learn: the MCU directs some part of memory address to devices on the PCI bus, hence in theory user/kernel code can directly read/write device memory as if it were main memory. But data in and out of PCI Express devices are packaged/serialized/transmitted in lanes, which means each read/write incurs significant overhead, such as packaging (add headers) and un-packaging. So that means it is not ideal for user/kernel to read device memory a byte at a time, instead it should do some sort of bulk transfer. If so, what is the preferred mechanism and API?
BTW, I know there is DMA, but it seems to me that DMA does not require device memory to be directly mapped into main memory address space - DMA is about letting device access main memory, and my question is the other way, letting user/kernel access device memory. So I am guessing it is not related to the question above, is that correct?
Yes, accessing memory-mapped I/O (MMIO) is slow.
The primary reason that it is slow is that it is typically uncacheable,
so every access has to go all the way to the device.
In x86 systems, which I am most familiar with, cacheable memory is accessed in 64-byte chunks,
even though processor instructions typically access memory in 1, 2, 4, or 8 byte chunks.
If multiple processor instructions access adjacent cacheable memory locations, all but the first access are satisfied from the cache. For similar accesses to device memory, every access incurs the full latency to the device and back.
The second reason is that the path from the processors to memory are critical to performance and are highly optimized.
The path to devices has always been slow, so software is designed to compensate for that, and optimizing the performance of MMIO isn't a priority.
Another related reason is that PCI has ordering rules that require accesses to be buffered and processed in a strict order.
The memory system can handle ordering in a much more flexible way. For example, a dirty cache line may be written to memory at any convenient time. MMIO accesses must be performed precisely in the order that they are executed by the CPU.
The best way to do bulk transfer of data to a device is to have the device itself perform DMA, "pulling" the data from memory into the device, rather than "pushing" it from the CPU to the device. (This also reduces the load on the CPU, freeing it to do other useful work.)

Increasing Linux DMA_ZONE memory on ARM i.MX287

I am working in an Embedded Linux system which has the 2.6.35.3 kernel.
Within the device we require a 4MB+192kB contiguous DMA capable buffer for one of our data capture drivers. The driver uses SPI transfers to copy data into this buffer.
The user space application issues a mmap system call to map the buffer into user space and after that, it directly reads the available data.
The buffer is allocated using "alloc_bootmem_low_pages" call, because it is not possible to allocate more than 4 MB buffer using other methods, such as kmalloc.
However, due to a recent upgrade, we need to increase the buffer space to 22MB+192kB. As I've read, the Linux kernel has only 16MB of DMA capable memory. Therefore, theoretically this is not possible unless there is a way to tweak this setting.
If there is anyone who knows how to perform this, please let me know?
Is this a good idea or will this make the system unstable?
The ZONE_DMA 16MB limit is imposed by a hardware limitation of certain devices. Specifically, on the PC architecture in the olden days, ISA cards performing DMA needed buffers allocated in the first 16MB of the physical address space because the ISA interface had 24 physical address lines which were only capable of addressing the first 2^24=16MB of physical memory. Therefore, device drivers for these cards would allocate DMA buffers in the ZONE_DMA area to accommodate this hardware limitation.
Depending on your embedded system and device hardware, your device either is or isn't subject to this limitation. If it is subject to this limitation, there is no software fix you can apply to allow your device to address a 22MB block of memory, and if you modify the kernel to extend the DMA address space beyond 16MB, then of course the system will become unstable.
On the other hand, if your device is not subject to this limitation (which is the only way it could possibly write to a 22MB buffer), then there is no reason to allocate memory in ZONE_DMA. In this case, I think if you simply replace your alloc_bootmem_low_pages call with an alloc_bootmem_pages call, it should work fine to allocate your 22MB buffer. If the system becomes unstable, then it's probably because your device is subject to a hardware limitation, and you cannot use a 22MB buffer.
It looks like my first attempt at an answer was a little too generic. I think that for the specific i.MX287 architecture you mention in the comments, the DMA zone size is configurable through the CONFIG_DMA_ZONE_SIZE parameter which can be made as large as 32Megs. The relevant configuration option should be under "System Type -> Freescale i.MXS implementations -> DMA memory zone size".
On this architecture, it's seems safe to modify it, as it looks like it's not addressing a hardware limitation (the way it was on x86 architectures) but just determining how to lay out memory.
If you try setting it to 32Meg and testing both alloc_bootmem_pages and alloc_bootmem_low_pages in your own driver, perhaps one of those will work.
Otherwise, I think I'm out of ideas.

Significance of Reset Vector in Modern Processors

I am trying to understand how computer boots up in very detail.
I came across two things which made me more curious,
1. RAM is placed at the bottom of ROM, to avoid Memory Holes as in Z80 processor.
2. Reset Vector is used, which takes the processor to a memory location in ROM, whose contents point to the actual location (again ROM) from where processor would actually start executing instructions (POST instruction). Why so?
If you still can't understand me, this link will explain you briefly,
http://lateblt.tripod.com/bit68.txt
The processor logic is generally rigid and fixed, thus the term hardware. Software is something that can be changed, molded, etc. thus the term software.
The hardware needs to start some how, two basic methods,
1) an address, hardcoded in the logic, in the processors memory space is read and that value is an address to start executing code
2) an address, hardcoded in the logic, is where the processor starts executing code
When the processor itself is integrated with other hardware, anything can be mapped into any address space. You can put ram at address 0x1000 or 0x40000000 or both. You can map a peripheral to 0x1000 or 0x4000 or 0xF0000000 or all of the above. It is the choice of the system designers or a combination of the teams of engineers where things will go. One important factor is how the system will boot once reset is relesed. The booting of the processor is well known due to its architecture. The designers often choose two paths:
1) put a rom in the memory space that contains the reset vector or the entry point depending on the boot method of the processor (no matter what architecture there is a first address or first block of addresses that are read and their contents drive the booting of the processor). The software places code or a vector table or both in this rom so that the processor will boot and run.
2) put ram in the memory space, in such a way that some host can download a program into that ram, then release reset on the processor. The processor then follows its hardcoded boot procedure and the software is executed.
The first one is most common, the second is found in some peripherals, mice and network cards and things like that (Some of the firmware in /usr/lib/firmware/ is used for this for example).
The bottom line though is that the processor is usually designed with one boot method, a fixed method, so that all software written for that processor can conform to that one method and not have to keep changing. Also, the processor when designed doesnt know its target application so it needs a generic solution. The target application often defines the memory map, what is where in the processors memory space, and one of the tasks in that assignment is how that product will boot. From there the software is compiled and placed such that it conforms to the processors rules and the products hardware rules.
It completely varies by architecture. There are a few reasons why cores might want to do this though. Embedded cores (think along the lines of ARM and Microblaze) tend to be used within system-on-chip machines with a single address space. Such architectures can have multiple memories all over the place and tend to only dictate that the bottom area of memory (i.e. 0x00) contains the interrupt vectors. Then then allows the programmer to easily specify where to boot from. On Microblaze, you can attach memory wherever the hell you like in XPS.
In addition, it can be used to easily support bootloaders. These are typically used as a small program to do a bit of initialization, then fetch a larger program from a medium that can't be accessed simply (e.g. USB or Ethernet). In these cases, the bootloader typically copies itself to high memory, fetches below it and then jumps there. The reset vector simply allows the programmer to bypass the first step.

Memory alignment

I have understood why memory should be aligned to 4 byte and 8 byte based on data width of the bus. But following statement confuses me
"IoDrive requires that all I/O performed on a device using O_DIRECT must be 512-byte
alligned and a multiple of 512 bytes in size."
What is the need for aligning address to 512 bytes.
Blanket statements blaming DMA for large buffer alignment restrictions are wrong.
Hardware DMA transfers are usually aligned on 4 or 8 byte boundaries since the PCI bus can physically transfer 32 or 64bits at a time. Beyond this basic alignment, hardware DMA transfers are designed to work with any address provided.
However, the hardware deals with physical addresses, while the OS deals with virtual memory addresses (which is a protected mode construct in the x86 cpu). This means that a contiguous buffer in process space may not be contiguous in physical ram. Unless care is taken to create physically contiguous buffers, the DMA transfer needs to be broken up at VM page boundaries (typically 4K, possibly 2M).
As for buffers needing to be aligned to disk sector size, this is completely untrue; the DMA hardware is completely oblivious to the physical sector size on a hard drive.
Under Linux 2.4 O_DIRECT required 4K alignment, under 2.6 it's been relaxed to 512B. In either case, it was probably a design decision to prevent single sector updates from crossing VM page boundaries and therefor requiring split DMA transfers. (An arbitrary 512B buffer has a 1/4 chance of crossing a 4K page).
So, while the OS is to blame rather than the hardware, we can see why page aligned buffers are more efficient.
Edit: Of course, if we're writing large buffers anyways (100KB), then the number of VM page boundaries crossed will be practically the same whether we've aligned to 512B or not.
So the main case being optimized by 512B alignment is single sector transfers.
Usually large alignment requirements like that are due to underlying DMA hardware. Large block transfers can sometimes be made much faster by requiring much stronger alignment restrictions than what you have here.
On several ARM processors, the first level translation table has to be aligned on a 16 KB boundary!
If you don't know what you're doing, don't use O_DIRECT.
O_DIRECT means "direct device access". This means it bypasses all OS caches, hitting the disk (or possibly RAID controller, etc) directly. Disk accesses are on a per-sector basis.
EDIT: The alignment requirement is for the IO offset/size; it's not usually a memory-alignment requirement.
EDIT: If you're looking at this page (it appears to be the only hit), it also says that the memory must be page-aligned.

Resources