How to use Intel Westmere 1GB pages on Linux? - c

Edit: I updated my question with the details of my benchmark
For benchmarking purposes, I am trying to setup 1GB pages in a Linux 3.13 system running on top of two Intel Xeon 56xx ("Westmere") processors. For that I modified my boot parameters to add support for 1GB pages (10 pages). These boot parameters only contain 1GB pages and not 2MB ones. Running hugeadm --pool-list leads to:
Size Minimum Current Maximum Default
1073741824 10 10 10 *
My kernel boot parameters are taken into account. In my benchmark I am allocating 1GiB of memory that I want to be backed by a 1GiB huge page using:
#define PROTECTION (PROT_READ | PROT_WRITE)
#define FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB)
uint64_t size = 1UL*1024*1024*1024;
memory = mmap(0, size, PROTECTION, FLAGS, 0, 0);
if (memory == MAP_FAILED) {
perror("mmap");
exit(1);
}
sleep(200)
Looking at the /proc/meminfo while the bench is sleeping (sleep call above), we can see that one huge page has been allocated:
AnonHugePages: 4096 kB
HugePages_Total: 10
HugePages_Free: 9
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 1048576 kB
Note: I disabled THP (through the /sys file system) before running the bench, so I guess the AnonHugePages field reported by /proc/meminfo represents the huge pages allocated by THP before stopping it.
At this point we can think that all is fine, but unfortunately my bench leads me to think that many 2MiB pages are used and not one 1GiB page. Here is the explanation:
This bench randomly access the allocated memory through pointer's chasing: a first step fills the memory to enable pointer chasing (each cell points to another cell) and in a second step the bench navigates through the memory using
pointer = *pointer;
Using the perf_event_open system call, I am counting data TLB read misses for the second step of the bench only. When the memory allocated size is 64MiB, I count a very small number, 0,01 % of my 6400000 memory accesses, of data TLB read misses. All the accesses are saved in the TLB. In other words, 64MiB of memory can be kept in the TLB. As soon as the allocated memory size is greater than 64 MiB I see data tlb read misses. For a memory size equals to 128 MiB, I have 50% of my 6400000 memory accesses that missed in the TLB. 64MiB appears to be the size that can fit in the TLB and 64MiB = 32 entries (as reportd below) * 2MiB pages. I conclude that I am not using 1GiB pages but 2MiB ones.
Can you see any explanation for that behavior ?
Moreover, the cpuid tool, reports the following about the tlb on my system:
cache and TLB information (2):
0x5a: data TLB: 2M/4M pages, 4-way, 32 entries
0x03: data TLB: 4K pages, 4-way, 64 entries
0x55: instruction TLB: 2M/4M pages, fully, 7 entries
0xb0: instruction TLB: 4K, 4-way, 128 entries
0xca: L2 TLB: 4K, 4-way, 512 entries
L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax):
L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx):
L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax):
L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx):
As you can see, there is no information about 1GiB pages. How many such pages can be cached in the TLB ?

TL;DR
You (specifically, your processor) cannot benefit from 1GB pages in this scenario, but your code is correct without modifications on systems that can.
Long version
I followed these steps to attempt to reproduce your problem.
My System: Intel Core i7-4700MQ, 32GB RAM 1600MHz, Chipset H87
svn co https://github.com/ManuelSelva/c4fun.git
cd c4fun.git/trunk
make. Discovered a few dependencies were needed. Installed them. Build failed, but mem_load did build and link, so did not pursue the rest further.
Rebooted the system, appending at GRUB time to the boot arguments the following:
hugepagesz=1G hugepages=10 default_hugepagesz=1G
which reserves 10 1GB pages.
cd c4fun.git/trunk/mem_load
Ran several tests using memload, in random-access pattern mode and pinning it to core 3, which is something that isn't 0 (the bootstrap processor).
./mem_load -a rand -c 3 -m 1073741824 -i 1048576
This resulted in approximately nil TLB misses.
./mem_load -a rand -c 3 -m 10737418240 -i 1048576
This resulted in approximately 60% TLB misses. On a hunch I did
./mem_load -a rand -c 3 -m 4294967296 -i 1048576
This resulted in approximately nil TLB misses. On a hunch I did
./mem_load -a rand -c 3 -m 5368709120 -i 1048576
This resulted in approximately 20% TLB misses.
At this point I downloaded the cpuid utility. It gave me this for cpuid -1 | grep -i tlb:
cache and TLB information (2):
0x63: data TLB: 1G pages, 4-way, 4 entries
0x03: data TLB: 4K pages, 4-way, 64 entries
0x76: instruction TLB: 2M/4M pages, fully, 8 entries
0xb5: instruction TLB: 4K, 8-way, 64 entries
0xc1: L2 TLB: 4K/2M pages, 8-way, 1024 entries
L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax):
L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx):
L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax):
L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx):
As you can see, my TLB has 4 entries for 1GB pages. This explains well my results: For 1GB and 4GB arenas, the 4 slots of the TLB are entirely sufficient to satisfy all accesses. For 5GB arenas and random-access pattern mode, 4 of the 5 pages only can be mapped through the TLB, so chasing a pointer into the remaining one will cause a miss. The probability of chasing a pointer into the unmapped page is 1/5, so we expect a miss rate of 1/5 = 20% and we get that. For 10GB, 4/10 pages are mapped and 6/10 aren't so the miss rate will be 6/10=60%, and we got that.
So your code works without modifications on my system at least. Your code does not appear to be problematic then.
I then did some research on CPU-World, and while not all CPUs are listed with TLB geometry data, some are. The only one I saw that matched your cpuid printout exactly (there could be more) is the Xeon Westmere-EP X5650; CPU-World does not explicitly say that the Data TLB0 has entries for 1GB pages, but does say the processor has "1 GB large page support".
I then did more research and finally nailed it. An author at RealWorldTech makes an (admittedly, I must yet find a source for this) off-hand comment in the discussion of the memory subsystem of Sandy Bridge. It reads as follows:
After address generation, uops will access the DTLB to translate from a virtual to a physical address, in parallel with the start of the cache access. The DTLB was mostly kept the same, but the support for 1GB pages has improved. Previously, Westmere added support for 1GB pages, but fragmented 1GB pages into many 2MB pages since the TLB did not have any 1GB page entries. Sandy Bridge adds 4 dedicated entries for 1GB pages in the DTLB.
(Emphasis added)
Conclusion
Whatever nebulous concept "CPU supports 1GB pages" represents, Intel thinks it does not imply "TLB supports 1GB page entries". I'm afraid that you will not be able to use 1GB pages on an Intel Westmere processor to reduce the number of TLB misses.
That, or Intel is hoodwinking us by distinguishing huge pages (in the TLB) from large pages.

Related

TLB, CPUID and Hugepages?

If I mount 64MB of 2MB hugepages to /mnt/huge2mb, which TLB entries are those pages using? I mmap()-ed them in my C program.
cpuid's output:
cache and TLB information (2):
0x63: data TLB: 1G pages, 4-way, 4 entries
0x03: data TLB: 4K pages, 4-way, 64 entries
0x76: instruction TLB: 2M/4M pages, fully, 8 entries
0xff: cache data is in CPUID 4
0xb6: instruction TLB: 4K, 8-way, 128 entries
0xf0: 64 byte prefetching
0xc3: L2 TLB: 4K/2M pages, 6-way, 1536 entries
I believe those mounted 2MB hugepages belong to data, and so they use data TLB entries.
However, the data TLB entries are for 1G and 4K pages.
Then, what TLB entries are used for those 2MB hugepages? L2 TLB entries? If yes, what is L2 TLB? Is it for both data and instruction? If yes, then there are overlapping for 4K pages for data -- data TLB and L2 TLB. What's the purpose of the extra 64 entries for 4K pages then?
Thanks!
First, I wouldn't necessarily assume that the data from CPUID itself is correct (over the years there's been various pieces of errata), and if the data from CPUID is correct I wouldn't necessarily assume that the code in Linux correctly interprets it (over the years determining cache characteristics has become a horrible mess).
Without having any clue what the CPU is (and without being able to check if the information was correctly reported by CPUID and Linux); based on the information shown I'd be tempted to suspect that 0xc3: L2 TLB: 4K/2M pages, 6-way, 1536 entries is used for instructions and data and will be used for your 2 MiB pages, but when these 2 MiB pages are used the CPU will also split the 2 MiB TLB entries into pieces and use 0x03: data TLB: 4K pages, 4-way, 64 entries for (4 KiB pieces of) the 2 MiB pages.

Huge number of "dTLB-load-misses" when DPDK forwarding test

Recently I am trying to do some forwarding test with DPDK "testpmd" application. And I find something interesting.
When 512 descriptors are used for TX and RX, the performance is better than using 4096 descriptors. After checking the counters with perf command, I find that a huge number of "dTLB-load-misses" is observed. And it is about more than 100 times of that with 512 descriptors. But the page-faults are always zero. With the ":u" and ":k" arguments, it seems that most of the TLB misses are in the user space. All the buffers are in one huge page for storing the data of network payloads, and the hugepage is 512MB size. Each buffer is less than 3KB. The buffer and the descriptors are one-to-one map.
So is there any clue that I can find the huge number of TLB misses? And will it have some effect to the performance (degradation)?
In general, CPU TLB cache capacity depends on page size. This means that for 4KB pages and for 512MB pages there are might be different number L1/L2 TLB cache entries.
For example, for ARM Cortex-A75:
The data micro TLB is a 48-entry fully associative TLB that is used by load and store operations. The cache entries have 4KB, 16KB, 64KB, and 1MB granularity of VA to PA mappings only.
Source: ARM Info Center
For ARM Cortex-A55:
The Cortex-A55 L1 data TLB supports 4KB pages only.
Any other page sizes are fractured after the L2 TLB and the appropriate page size sent to the L1 TLB.
Source: ARM Info Center
Basically, this means that the 512MB huge page mappings will be fractured to some smaller size (down to 4K) and only those small pieces will be cached in L1 dTLB.
So even if your application is fit into a single 512MB page, still the performance will depend greatly on actual memory footprint.

Is there any benefit to using 4kb allocation pools?

Until recently, I thought CPU caches (at least, some levels) were 4kb in size, so grouping my allocations into 4kb slabs seemed like a good idea, to improve cache locality.
However, reading Line size of L1 and L2 caches I see that all levels of CPU cache lines are 64 bytes, not 4kb. Does this mean my 4kb slabs are useless, and I should just go back to using regular malloc?
Thanks!
4KB does have a significance: it is a common page size for your operating system, and thus for entries in your TLB.
Memory-mapped files, swap space, kernel I/Os (write to file, or socket) will operate on this page size even if you're using only 64 bytes of the page.
For this reason, it can still be somewhat advantageous to use page-sized pools.

Cassandra hardware configuration

I have a question regarding hardware related performance issues.
Our cassandra nodes had 4 cores with 2GB RAM and we suffered with unreasonable response time (1.5 seconds average on read for 200 calls / sec).
We then upgraded the machines to 8 cores with 8GB RAM and immediately saw an improvement (around 300ms now).
However server analytics doesn't show any peek or extra use of cpu power.
How can this be explained? does an upgrade from 4 cores to 8 cores explain such a performance boost even if it seems like the server's cpu usage is unaffected?
Thanks
Cassandra needs more memory to hold data in memtables and for faster read response times.
8GB to 16GB is what we assign to Cassandra process with JVM parameters tweaked, with 4 CPUs each quad-core so 16 cores per node with sata drives.
Make sure commit log and data dir are on separate disks.
If you left configuration parameters by default, you saw an improvement in performance partly because your key cache is now larger.
key_cache_size_in_mb = (min(5% of Heap (in MB), 100MB))
cassandra-env.sh automatically decide how much heap to give to the JVM (if you did not change it):
set max heap size based on the following
max(min(1/2 ram, 1024MB), min(1/4 ram, 8GB))
calculate 1/2 ram and cap to 1024MB
calculate 1/4 ram and cap to 8192MB
pick the max
Since your max heap size was increased, your key cache was increased as well.

How many bytes the cache controller fetches a time from main memory to L2 cache?

I just read two articles over this topic which provide infomration inconsistent, so I want to know which one is correct. Perhaps both are correct, but under what context?
The first one states that we fetch a page size a time
The cache controller is always observing the memory positions being loaded and loading data from several memory positions after the memory position that has just been read.
To give you a real example, if the CPU loaded data stored in the address 1,000, the cache controller will load data from ”n“ addresses after the address 1,000. This number ”n“ is called page; if a given processor is working with 4 KB pages (which is a typical value), it will load data from 4,096 addresses below the current memory position being load (address 1,000 in our example). In following Figure, we illustrate this example.
The second one states that we fetch sizeof(cache line) + sizeof(prefetcher) a time
So we can summarize how the memory cache works as:
The CPU asks for instruction/data stored in address “a”.
Since the contents from address “a” aren’t inside the memory cache, the CPU has to fetch it
directly from RAM.
The cache controller loads a line (typically 64 bytes) starting at address “a” into the memory
cache. This is more data than the CPU requested, so if the program continues to run sequentially
(i.e. asks for address a+1) the next instruction/data the CPU will ask will be already loaded in the
memory cache.
A circuit called prefetcher loads more data located after this line, i.e. starts loading the contents
from address a+64 on into the cache. To give you a real example, Pentium 4 CPUs have a 256-byte
prefetcher, so it loads the next 256 bytes after the line already loaded into the cache.
Completely hardware implementation dependent. Some implementations load a single line from main memory at a time — and cache line sizes vary a lot between different processors. I've seen line sizes from 64 bytes all the way up to 256 bytes. Basically what the size of a "cache line" means is that when the CPU requests memory from main RAM, it does so n bytes at a time. So if n is 64 bytes, and you load a 4-byte integer at 0x1004, the MMU will actually send 64 bytes across the bus, all the addresses from 0x1000 to 0x1040. This entire chunk of data will be stored in the data cache as one line.
Some MMUs can fetch multiple cache lines across the bus per request -- so that making a request at address 0x1000 on a machine that has 64 byte caches actually loads four lines from 0x1000 to 0x1100. Some systems let you do this explicitly with special cache prefetch or DMA opcodes.
The article through your first link, however, is completely wrong. It confuses the size of an OS memory page with a hardware cache line. These are totally different concepts. The first is the minimum size of virtual address space the OS will allocate at once. The latter is a detail of how the CPU talks to main RAM.
They resemble each other only in the sense that when the OS runs low on physical memory, it will page some not-recently-used virtual memory to disk; then later on, when you use that memory again, the OS loads that whole page from disk back into physical RAM. This is analogous (but not related) to the way that the CPU loads bytes from RAM, which is why the author of "Hardware Secrets" was confused.
A good place to learn all about computer memory and why caches work the way they do is Ulrich Drepper's paper, What Every Programmer Should Know About Memory.

Resources