Memory alignment - c

I have understood why memory should be aligned to 4 byte and 8 byte based on data width of the bus. But following statement confuses me
"IoDrive requires that all I/O performed on a device using O_DIRECT must be 512-byte
alligned and a multiple of 512 bytes in size."
What is the need for aligning address to 512 bytes.

Blanket statements blaming DMA for large buffer alignment restrictions are wrong.
Hardware DMA transfers are usually aligned on 4 or 8 byte boundaries since the PCI bus can physically transfer 32 or 64bits at a time. Beyond this basic alignment, hardware DMA transfers are designed to work with any address provided.
However, the hardware deals with physical addresses, while the OS deals with virtual memory addresses (which is a protected mode construct in the x86 cpu). This means that a contiguous buffer in process space may not be contiguous in physical ram. Unless care is taken to create physically contiguous buffers, the DMA transfer needs to be broken up at VM page boundaries (typically 4K, possibly 2M).
As for buffers needing to be aligned to disk sector size, this is completely untrue; the DMA hardware is completely oblivious to the physical sector size on a hard drive.
Under Linux 2.4 O_DIRECT required 4K alignment, under 2.6 it's been relaxed to 512B. In either case, it was probably a design decision to prevent single sector updates from crossing VM page boundaries and therefor requiring split DMA transfers. (An arbitrary 512B buffer has a 1/4 chance of crossing a 4K page).
So, while the OS is to blame rather than the hardware, we can see why page aligned buffers are more efficient.
Edit: Of course, if we're writing large buffers anyways (100KB), then the number of VM page boundaries crossed will be practically the same whether we've aligned to 512B or not.
So the main case being optimized by 512B alignment is single sector transfers.

Usually large alignment requirements like that are due to underlying DMA hardware. Large block transfers can sometimes be made much faster by requiring much stronger alignment restrictions than what you have here.
On several ARM processors, the first level translation table has to be aligned on a 16 KB boundary!

If you don't know what you're doing, don't use O_DIRECT.
O_DIRECT means "direct device access". This means it bypasses all OS caches, hitting the disk (or possibly RAID controller, etc) directly. Disk accesses are on a per-sector basis.
EDIT: The alignment requirement is for the IO offset/size; it's not usually a memory-alignment requirement.
EDIT: If you're looking at this page (it appears to be the only hit), it also says that the memory must be page-aligned.

Related

Does an "aligned memory accesses" in CUDA need to address an "even multiple" of the cache granularity?

I am reading the book, Professional CUDA C Programming. On page 159, it says:
Aligned memory accesses occur when the first address of a device
memory transaction is an even multiple of the cache granularity being
used to service the transaction (either 32 bytes for L2 cache or 128
bytes for L1 cache).
I am wondering why aligned memory accesses in CUDA need even multiples of the cache granularity rather than just multiples of the cache granularity.
So, I checked the cuda-c-programming-guide from NVDIA. It says:
Global memory resides in device memory and device memory is accessed
via 32-, 64-, or 128-byte memory transactions. These memory
transactions must be naturally aligned: Only the 32-, 64-, or 128-byte
segments of device memory that are aligned to their size (i.e., whose
first address is a multiple of their size) can be read or written by
memory transactions.
It seems that even multiples of the cache granularity is unnecessary for aligned memory access, isn't it?
The quoted sentence from the book seems to be incorrect in two senses:
A memory access has an alignment of N if it is an access to an address that is a multiple of N. That's irrespective of CUDA. What seems to be discussed here is memory access coalescence.
As you suggest, and AFAIK, coalescence requires "multiples of" the cache granularity, not "even multiples of".

Identify DMA memory in /proc/mtrr and /proc/iomem?

I wonder if there is a way to identify memory used for DMA mapping in some proc files, such as mtrr and iomem, or via lspic -vv.
In my /proc/mtrr, there is only one uncachable region, and it seems to be pointing at the 'PCI hole' at 3.5-4GB, almost.
base=0x0e0000000 ( 3584MB), size= 512MB, count=1: uncachable
By cross verifying with /proc/iomem, of this 512MB region, only the last 21 MB before 4GB is NOT consumed by PCI Bus, and that 21MB sliver is occupied by things like pnp/IOAPIC/Reserved.
So my questions are:
What is the signature of DMA region in /proc/mtrr and /proc/iomem
Are there other places, such as other proc files and commands that I can use to see DMA region?
It seems by adding rows to /proc/mtrr, a privileged user can change caching mechanism of any memory, at runtime. So besides the fact that DMA has to be lower 32bit(assuming without DAC), are there other special requirement for DMA memory allocation? If there are no further requirment, then maybe the only hint I can use to identify DMA memory would be /proc/mtrr?
DMA (Direct Memory Access) is just where a device accesses memory itself (without asking CPU to feed the data to the device). For a (simplified) example of DMA; imagine a random process does a write(), and this bubbles its way up (through VFS, through file system, through any RAID layer, etc) until it reaches some kind of disk controller driver; then the disk controller driver tells its disk controller "transfer N bytes from this physical address to that place on the disk and let me know when the transfer has been done". Most devices (disk controllers, network cards, video cards, sound cards, USB controllers, ...) use DMA in some way. Under load, all the devices in your computer may be doing thousands of transfers (via. DMA) per second, potentially scattered across all usable RAM.
As far as I know; there are no files in /proc/ that would help (most likely because it changes too fast and too often to bother providing any, and there'd be very little reason for anyone to ever want to look at it).
The MTTRs are mostly irrelevant - they only control the CPU's caches and have no effect on DMA requests from devices.
The /proc/iomem is also irrelevant. It only shows which areas devices are using for their own registers and has nothing to do with RAM (and therefore has nothing to do with DMA).
Note 1: DMA doesn't have to be in the lower 32-bit (e.g. most PCI devices have supported 64-bit DMA/bus mastering for a decade or more); and for the rare devices that don't support 64-bit it's possible for Linux to use an IOMMU to remap their requests (so the device thinks it's using 32-bit addresses when it actually isn't).
Note 2: Once upon a time (a long time ago) there were "ISA DMA controller chips". Like the ISA bus itself; these were restricted to the first 16 MiB of the physical address space (and had other restrictions - e.g. not supporting transfers that cross a 64 KiB boundary). These chips haven't really had a reason to exist since floppy disk controllers became obsolete. You might have a /proc/dma describing these (but if you do it probably only says "cascade" to indicate how the chips connect, with no devices using them).

Is accessing mapped device memory slow (in terms of latency)?

I know the question is vague.. but here is what I hope to learn: the MCU directs some part of memory address to devices on the PCI bus, hence in theory user/kernel code can directly read/write device memory as if it were main memory. But data in and out of PCI Express devices are packaged/serialized/transmitted in lanes, which means each read/write incurs significant overhead, such as packaging (add headers) and un-packaging. So that means it is not ideal for user/kernel to read device memory a byte at a time, instead it should do some sort of bulk transfer. If so, what is the preferred mechanism and API?
BTW, I know there is DMA, but it seems to me that DMA does not require device memory to be directly mapped into main memory address space - DMA is about letting device access main memory, and my question is the other way, letting user/kernel access device memory. So I am guessing it is not related to the question above, is that correct?
Yes, accessing memory-mapped I/O (MMIO) is slow.
The primary reason that it is slow is that it is typically uncacheable,
so every access has to go all the way to the device.
In x86 systems, which I am most familiar with, cacheable memory is accessed in 64-byte chunks,
even though processor instructions typically access memory in 1, 2, 4, or 8 byte chunks.
If multiple processor instructions access adjacent cacheable memory locations, all but the first access are satisfied from the cache. For similar accesses to device memory, every access incurs the full latency to the device and back.
The second reason is that the path from the processors to memory are critical to performance and are highly optimized.
The path to devices has always been slow, so software is designed to compensate for that, and optimizing the performance of MMIO isn't a priority.
Another related reason is that PCI has ordering rules that require accesses to be buffered and processed in a strict order.
The memory system can handle ordering in a much more flexible way. For example, a dirty cache line may be written to memory at any convenient time. MMIO accesses must be performed precisely in the order that they are executed by the CPU.
The best way to do bulk transfer of data to a device is to have the device itself perform DMA, "pulling" the data from memory into the device, rather than "pushing" it from the CPU to the device. (This also reduces the load on the CPU, freeing it to do other useful work.)

Increasing Linux DMA_ZONE memory on ARM i.MX287

I am working in an Embedded Linux system which has the 2.6.35.3 kernel.
Within the device we require a 4MB+192kB contiguous DMA capable buffer for one of our data capture drivers. The driver uses SPI transfers to copy data into this buffer.
The user space application issues a mmap system call to map the buffer into user space and after that, it directly reads the available data.
The buffer is allocated using "alloc_bootmem_low_pages" call, because it is not possible to allocate more than 4 MB buffer using other methods, such as kmalloc.
However, due to a recent upgrade, we need to increase the buffer space to 22MB+192kB. As I've read, the Linux kernel has only 16MB of DMA capable memory. Therefore, theoretically this is not possible unless there is a way to tweak this setting.
If there is anyone who knows how to perform this, please let me know?
Is this a good idea or will this make the system unstable?
The ZONE_DMA 16MB limit is imposed by a hardware limitation of certain devices. Specifically, on the PC architecture in the olden days, ISA cards performing DMA needed buffers allocated in the first 16MB of the physical address space because the ISA interface had 24 physical address lines which were only capable of addressing the first 2^24=16MB of physical memory. Therefore, device drivers for these cards would allocate DMA buffers in the ZONE_DMA area to accommodate this hardware limitation.
Depending on your embedded system and device hardware, your device either is or isn't subject to this limitation. If it is subject to this limitation, there is no software fix you can apply to allow your device to address a 22MB block of memory, and if you modify the kernel to extend the DMA address space beyond 16MB, then of course the system will become unstable.
On the other hand, if your device is not subject to this limitation (which is the only way it could possibly write to a 22MB buffer), then there is no reason to allocate memory in ZONE_DMA. In this case, I think if you simply replace your alloc_bootmem_low_pages call with an alloc_bootmem_pages call, it should work fine to allocate your 22MB buffer. If the system becomes unstable, then it's probably because your device is subject to a hardware limitation, and you cannot use a 22MB buffer.
It looks like my first attempt at an answer was a little too generic. I think that for the specific i.MX287 architecture you mention in the comments, the DMA zone size is configurable through the CONFIG_DMA_ZONE_SIZE parameter which can be made as large as 32Megs. The relevant configuration option should be under "System Type -> Freescale i.MXS implementations -> DMA memory zone size".
On this architecture, it's seems safe to modify it, as it looks like it's not addressing a hardware limitation (the way it was on x86 architectures) but just determining how to lay out memory.
If you try setting it to 32Meg and testing both alloc_bootmem_pages and alloc_bootmem_low_pages in your own driver, perhaps one of those will work.
Otherwise, I think I'm out of ideas.

Means to allocate contiguous physical memory

I am aware that with C malloc and posix_memaligh one can allocate contiguous memory from the virtual address space of a process. However, I was wondering whether somehow one can allocate a buffer of physically contiguous memory? I am investigating side channel attacks that exploit L2 cache so I want to be sure that I can access the right cache lines..
Your best and easiest take at continuous memory is to request a single "huge" page from the system. The availability of those depends on your CPU and kernel options (on x86_64 the 2MB huge pages are usually available and some CPUs can also do 1GB pages; other architectures can be more flexible than this). Check out Hugepagesize field in /proc/meminfo for the size of huge pages on your setup.
Those can be accessed in two ways:
By means of a MAP_HUGETLB flag passed to mmap(). This way you can be sure that the "huge" virtual page corresponds to a continuous physical memory range. Unfortunately, whether the kernel can supply you with a "huge" page depends on many factors (current layout of memory utilization, kernel options, etc - also see the hugepages kernel boot parameter).
By means of mapping a file from a dedicated HugeTLB filesystem (see here: http://lwn.net/Articles/375096/). With HugeTLB file system you can configure the number of huge pages available in advance for some assurance that the necessary amount of huge pages will be available.
The other approach is to write a kernel module which will allocate continuous physical memory on the kernel side and then map it into your process' address space on request. This approach is sometimes employed on special purpose hardware in embedded systems. Of course, there's still no guarantee that the kernel side memory allocator will be able to come with an appropriately sized continuous physical address range, so on some occasions such address ranges are pre-reserved on boot (one dumb approach is to pass max_addr parameter to kernel on boot to leave some of the RAM out of kernel's reach).
On (almost [Note 1]) all virtual memory architectures, virtual memory is mapped to physical memory in units of a "page". The size of a page is (almost) always a power of 2, and pages are aligned by that size, because the mapping is done by only using the high-order bits of the address. It's common to see a page size of 4K (12 bits of address), although modern CPUs have an option to map much larger pages in order to reduce the size of mapping tables.
Since L2_CACHE_SIZE will generally also be a power of 2 and will be smaller than the page size, any single aligned allocation of size L2_CACHE_SIZE will necessarily be in a single page, so the bytes in the alignment will be physically contiguous as well.
So in this particular case, you can be assured that your allocated memory will be a single cache-line (at least, on standard machine architectures).
Note 1: Undoubtedly there are machines -- possibly imaginary -- which do not function this way. But the one you are playing with is not one of them.

Resources