Until recently, I thought CPU caches (at least, some levels) were 4kb in size, so grouping my allocations into 4kb slabs seemed like a good idea, to improve cache locality.
However, reading Line size of L1 and L2 caches I see that all levels of CPU cache lines are 64 bytes, not 4kb. Does this mean my 4kb slabs are useless, and I should just go back to using regular malloc?
Thanks!
4KB does have a significance: it is a common page size for your operating system, and thus for entries in your TLB.
Memory-mapped files, swap space, kernel I/Os (write to file, or socket) will operate on this page size even if you're using only 64 bytes of the page.
For this reason, it can still be somewhat advantageous to use page-sized pools.
Related
Recently I am trying to do some forwarding test with DPDK "testpmd" application. And I find something interesting.
When 512 descriptors are used for TX and RX, the performance is better than using 4096 descriptors. After checking the counters with perf command, I find that a huge number of "dTLB-load-misses" is observed. And it is about more than 100 times of that with 512 descriptors. But the page-faults are always zero. With the ":u" and ":k" arguments, it seems that most of the TLB misses are in the user space. All the buffers are in one huge page for storing the data of network payloads, and the hugepage is 512MB size. Each buffer is less than 3KB. The buffer and the descriptors are one-to-one map.
So is there any clue that I can find the huge number of TLB misses? And will it have some effect to the performance (degradation)?
In general, CPU TLB cache capacity depends on page size. This means that for 4KB pages and for 512MB pages there are might be different number L1/L2 TLB cache entries.
For example, for ARM Cortex-A75:
The data micro TLB is a 48-entry fully associative TLB that is used by load and store operations. The cache entries have 4KB, 16KB, 64KB, and 1MB granularity of VA to PA mappings only.
Source: ARM Info Center
For ARM Cortex-A55:
The Cortex-A55 L1 data TLB supports 4KB pages only.
Any other page sizes are fractured after the L2 TLB and the appropriate page size sent to the L1 TLB.
Source: ARM Info Center
Basically, this means that the 512MB huge page mappings will be fractured to some smaller size (down to 4K) and only those small pieces will be cached in L1 dTLB.
So even if your application is fit into a single 512MB page, still the performance will depend greatly on actual memory footprint.
I realize that most CPUs are better at reading data at an aligned memory address, that is at memory address that is a multiple of CPU word. However, in many places I read about allocating a page-aligned memory. Why might someone want to get a page-aligned memory address? Is it only for even bigger performance?
The "traditional" way to allocate memory is to have it in a contiguous address space (the "heap", growing upwards by calls to sbrk()). Each time you hit a page boundary, there will be a page fault and you get mapped a new page. There are two consequences of this strategy:
pages can only be freed when all allocations inside that page are freed AND when all other allocations are mapped to lower addresses. (the typical effect of heap fragmentation).
larger allocations might occupy one page more than strictly needed (if they start somewhere in the middle of a page).
So this strategy is only suitable for smaller blocks of memory where you don't want to "waste" a whole page for each allocation.
For bigger chunks, it's better to use mmap() which maps you new pages somewhere directly, so you get "page aligned memory". Using this, your allocation doesn't share pages with other allocations. As soon as you don't need the memory any more, you can give it back to the OS. Note that many malloc()implementations choose automatically whether to allocate using sbrk() or mmap(), depending on the size of the desired allocation.
Alignment restrictions are usually associated with direct IO - which bypasses the page cache, copying data to/from disk directly into or from the address space of a process. This can provide significant performance improvements in cases where the page cache is not needed - such as streaming multiple gigabytes of data, especially when doing IO to/from extremely fast disk systems.
Note that only some file systems support direct IO.
On Linux, RedHat's documentation is, in part:
Direct I/O best practices
Users must always take care to use properly aligned and sized IO.
This is especially important for Direct I/O
access. Direct I/O should be aligned on a 'logical_block_size'
boundary and in multiples of the 'logical_block_size'. With native 4K
devices (logical_block_size is 4K) it is now critical that
applications perform Direct I/O that is a multiple of the device's
'logical_block_size'. This means that applications that do not
perform 4K aligned I/O, but 512-byte aligned I/O, will break with
native 4K devices. Applications may consult a device's "I/O Limits"
to ensure they are using properly aligned and sized I/O. The "I/O
Limits" are exposed through both sysfs and block device ioctl
interfaces (also see: libblkid).
sysfs interface
/sys/block//alignment_offset
/sys/block///alignment_offset
/sys/block//queue/physical_block_size
/sys/block//queue/logical_block_size
/sys/block//queue/minimum_io_size
/sys/block//queue/optimal_io_size
Note that the use of direct IO can be limited by actual hardware, as well as software. As noted in the RedHat documentation, physical device limitations matter.
To use direct IO, on Linux the file needs to be opened with the O_DIRECT flag:
int fd = open( filename, O_RDONLY | O_DIRECT );
In my experience, direct IO can result in 20-30% gains in IO performance under certain circumstances. Those circumstances usually involve streaming large amounts of data to/from a file on a very fast file system with the application performing no or very few seek() calls.
Alignment is something that always makes some performance issues. when you write(2) or read(2) a file, it's best if you can adjust the limits of your reading to block aligments, because you make kernel do two block reads instead of one. The worst case being just reading two bytes on a block boundary. Suppose you have a block size of 1024bytes, this code:
char var[2];
int fd;
fd = open("/etc/passwd", O_RDONLY);
lseek(fd, 1023UL, SEEK_SET);
read(fd, &var, sizeof var);
Will make the kernel to force two block reads (at most, as the blocks could be already cached before) for only a two bytes read(2) call.
In the case of memory, all this stuff is normally managed by malloc(3), and, as you don't fail on page faults, you don't get any performance penalties (that's the reason you don't have any standar library function to get aligned memory, even in demand paged virtual systems) as far as you consume memory, the kernel allocs it in pages for you. The processor virtual memory system makes page alignment almost transparent. Only in case you have an unaligned memory access (suppose you access a 32bit integer access misaligned ---unprobable--- to two pages, and those two pages have been swapped out by the kernel, you'll have to wait for the kernel to swap in two pages of memory instead of one ---but that's far improbable thing to occur, the compiler normally forces inner loops to not fail between page boundaries to minimize the probability of this to happen, and you have also the instruction cache to cope with these things)
Said this, there are some places where you do get performance improvements if you somewhat align memory. I'll try to show you a scenario of this:
Suppose you need to dynamically manage a lot of small structures (let's say 16bytes long) and you plan to manage them with malloc(). malloc(3) manages memory including a header in each chunk of memory allocated (let's say this header is 8 bytes long) making the overhead of memory 50% percent more than the ideal. If you arrange to get memory in chunks of (let's say) 64 structures you'll get just one of those headers (8bytes) for each 64*16 = 1024 bytes (amounting for just roughly an 8%)
To manage this, you have to consider knowing to which chunk all of this structures belong (so you can free(3) the chunk when not in use), and you can do this in two ways: 1.- Using a pointer (adding 4 bytes to each structure size --this is pointless as you'll add 4 bytes to each structure, lossing a 25% of memory again) to point to the chunck, or 2.- *forcing the chunck to be aligned, so the chunk address can be easily calculated from the struct address (you only need to substract the rest of the division mod chunksize to the struct address) to get the chunk address. This last method doesn't impose any overhead to locate the chunck, but imposes the practice of all chunks to be chunk aligned (not page aligned).
In this way, you improve performance too much, as you reduce considerably the amount of malloc(3) calls and the waste of memory imposed by allocating small amounts of memory.
By the way, malloc doesn't ask the operating system for the memory you ask it at each call. It allocates memory in chunks, in a manner similar as has been explained here, and normal implementations don't even manage to return the allocated memory to the system again (reusing the freed memory before allocating new one) It controls the calls to sbrk(2) system call, what means that you are going to interfere with malloc in case you use this system call.
Linux/unix will give you page aligned memory when you use shmat(2) system call. Try reading this and related documents.
I am aware that with C malloc and posix_memaligh one can allocate contiguous memory from the virtual address space of a process. However, I was wondering whether somehow one can allocate a buffer of physically contiguous memory? I am investigating side channel attacks that exploit L2 cache so I want to be sure that I can access the right cache lines..
Your best and easiest take at continuous memory is to request a single "huge" page from the system. The availability of those depends on your CPU and kernel options (on x86_64 the 2MB huge pages are usually available and some CPUs can also do 1GB pages; other architectures can be more flexible than this). Check out Hugepagesize field in /proc/meminfo for the size of huge pages on your setup.
Those can be accessed in two ways:
By means of a MAP_HUGETLB flag passed to mmap(). This way you can be sure that the "huge" virtual page corresponds to a continuous physical memory range. Unfortunately, whether the kernel can supply you with a "huge" page depends on many factors (current layout of memory utilization, kernel options, etc - also see the hugepages kernel boot parameter).
By means of mapping a file from a dedicated HugeTLB filesystem (see here: http://lwn.net/Articles/375096/). With HugeTLB file system you can configure the number of huge pages available in advance for some assurance that the necessary amount of huge pages will be available.
The other approach is to write a kernel module which will allocate continuous physical memory on the kernel side and then map it into your process' address space on request. This approach is sometimes employed on special purpose hardware in embedded systems. Of course, there's still no guarantee that the kernel side memory allocator will be able to come with an appropriately sized continuous physical address range, so on some occasions such address ranges are pre-reserved on boot (one dumb approach is to pass max_addr parameter to kernel on boot to leave some of the RAM out of kernel's reach).
On (almost [Note 1]) all virtual memory architectures, virtual memory is mapped to physical memory in units of a "page". The size of a page is (almost) always a power of 2, and pages are aligned by that size, because the mapping is done by only using the high-order bits of the address. It's common to see a page size of 4K (12 bits of address), although modern CPUs have an option to map much larger pages in order to reduce the size of mapping tables.
Since L2_CACHE_SIZE will generally also be a power of 2 and will be smaller than the page size, any single aligned allocation of size L2_CACHE_SIZE will necessarily be in a single page, so the bytes in the alignment will be physically contiguous as well.
So in this particular case, you can be assured that your allocated memory will be a single cache-line (at least, on standard machine architectures).
Note 1: Undoubtedly there are machines -- possibly imaginary -- which do not function this way. But the one you are playing with is not one of them.
In a 32-bit machine each process gets a 4GB virtual space. In this case one can worry that we might face trouble due to fragmentation. But in the case of a 64-bit machine we theoretically have a huge addressable virtual memory, so why is memory fragmentation still an issue (if it is) in a 64-bit machine?
Each virtual address that you try to access is mapped by the operating system to physical memory. Physical memory is allocated in pages (e.g. 4K in size). If you manage to allocate a byte at offset 1000000*n and do it for n from 1 to 1000000 (I think you could do that with mmap), then the OS will have to back that with a million pages of physical memory, which is something like 4G. That physical memory will not be available for anything else. If you had allocated the bytes contiguously, you'd only need about 1M of physical memory (256 pages) for your million bytes.
You can get in a similar bad situation if you allocate 4G for legitimate reasons, and then deallocate parts of it, keeping a bit of every page allocated. The OS will not be able to actually reuse the freed memory for anything else because there are no physical pages that are fully free. So that's a fragmentation problem.
In theory, you could imagine that virtual addresses 1000000 and 2000000 would map to the same page of physical memory, avoiding the fragmentation. But in practice, and for good reasons, the virtual memory mapping is done on a page by page basis. You can read more about it here: http://en.wikipedia.org/wiki/Page_table.
Because all that memory is "wasted" consider an application where you have a lot of internal fragmentation. That process requires more pages in memory because the working set is now scattered in memory and that means its memory footprint is much higher. If this application is contending for physical slots in RAM (machines still really only have about 4 - 8 GB of RAM for a typical home setup) then it causes more page swapping. Generally you want to reduce your applications memory footprint to avoid memory pressure and contention with other applications.
There are cases though where it doesn't really matter, it won't kill you to use an extra megabyte here or there but it all adds up in larger applications. It depends on the situation as to whether or not it is important to have as little fragmentation as possible depending on what you're coding or what the aim of your project is.
I have understood why memory should be aligned to 4 byte and 8 byte based on data width of the bus. But following statement confuses me
"IoDrive requires that all I/O performed on a device using O_DIRECT must be 512-byte
alligned and a multiple of 512 bytes in size."
What is the need for aligning address to 512 bytes.
Blanket statements blaming DMA for large buffer alignment restrictions are wrong.
Hardware DMA transfers are usually aligned on 4 or 8 byte boundaries since the PCI bus can physically transfer 32 or 64bits at a time. Beyond this basic alignment, hardware DMA transfers are designed to work with any address provided.
However, the hardware deals with physical addresses, while the OS deals with virtual memory addresses (which is a protected mode construct in the x86 cpu). This means that a contiguous buffer in process space may not be contiguous in physical ram. Unless care is taken to create physically contiguous buffers, the DMA transfer needs to be broken up at VM page boundaries (typically 4K, possibly 2M).
As for buffers needing to be aligned to disk sector size, this is completely untrue; the DMA hardware is completely oblivious to the physical sector size on a hard drive.
Under Linux 2.4 O_DIRECT required 4K alignment, under 2.6 it's been relaxed to 512B. In either case, it was probably a design decision to prevent single sector updates from crossing VM page boundaries and therefor requiring split DMA transfers. (An arbitrary 512B buffer has a 1/4 chance of crossing a 4K page).
So, while the OS is to blame rather than the hardware, we can see why page aligned buffers are more efficient.
Edit: Of course, if we're writing large buffers anyways (100KB), then the number of VM page boundaries crossed will be practically the same whether we've aligned to 512B or not.
So the main case being optimized by 512B alignment is single sector transfers.
Usually large alignment requirements like that are due to underlying DMA hardware. Large block transfers can sometimes be made much faster by requiring much stronger alignment restrictions than what you have here.
On several ARM processors, the first level translation table has to be aligned on a 16 KB boundary!
If you don't know what you're doing, don't use O_DIRECT.
O_DIRECT means "direct device access". This means it bypasses all OS caches, hitting the disk (or possibly RAID controller, etc) directly. Disk accesses are on a per-sector basis.
EDIT: The alignment requirement is for the IO offset/size; it's not usually a memory-alignment requirement.
EDIT: If you're looking at this page (it appears to be the only hit), it also says that the memory must be page-aligned.