Profiling cache coherence latency - c

Is there a tool that makes it possible to monitor the time spent on managing cache coherence by MESIF protocol on Skylake servers or its successors for Linux OS? I am also interested in programmatic ways in C if possible.

Related

Prevent a CPU core from using the LL cache

I have a following problem: I have a low-latency application running on core 0, and a regular application running on core 1. I want to make sure that core 0 app gets as much cache as possible, therefore, I want to make core 1 bypass the L3 cache (not use it at all) and go directly in memory for data.
Are there any other ways I can achieve that core 0 app gets the priority in using the L3 cache?
Some Intel CPUs support partitioning the L3 cache between different workloads or VMs, Cache Allocation Technology (CAT). It's been supported since Haswell Xeon (v3), and apparently 11th-gen desktop/laptop CPUs.
Presumably you need to let each workload have some L3, probably even on Skylake-Xeon and later where L3 is non-inclusive, but you might be able to give it a pretty small share and still achieve your goal.
More generally, https://github.com/intel/intel-cmt-cat has tools (for Linux and somewhat for FreeBSD) for managing that and other parts of what Intel's now calling "Resource Director Technology (RDT)" for monitoring, CAT, and Memory Bandwidth Allocation. It also has a table of features by CPU.
What you describe would be literally impossible on a desktop Intel CPU (or Xeon before Skylake), as they use inclusive L3 cache: a line can only be in L2/L1 if it's in L3 (at least tags, not the data if a core has it in Modified or Exclusive state). Skylake-X and later xeons have non-inclusive L3 so it would be possible in theory; IDK if CAT lets you give one set of cores zero L3.
I don't know if any AMD or ARM CPUs have something similar. I just happen to know of the existence of Intel's hardware support for this, not something I've ever gone looking for or used myself.

Do efficiency cores support the same instructions as performance cores?

When writing a program that requires high computational performance, it is often required that multiple threads, SIMD vectorization, or other extensions are required. One can query the CPU using CPUID to find out what instruction set it supports. However, since the programmer has no control over which cores are actually executing the different threads, it could be a problem if different cores support different instruction sets.
If one queries the CPU at the start of the program, is it safe to assume all threads will support the same instruction set? If not, then does this break programs that assume they do all support the same instructions or are the CPUs clever enough to realize they shouldn't use those cores?
Does one need to query CPUID on each thread separately?
Is there any way a program can avoid running on E-cores?
If the instruction sets are the same, then where is the 'Efficiency'? Is it with less cache, lower clock speed, or something else?
This question is posed out of curiosity, but the answers may affect how I write programs in the future. I would appreciate any informed comments on these questions but please don't just share your thoughts and opinions on how you think it works if you don't know with high confidence. Thanks.
I have only tried to find information on the internet, but found nothing of sufficiently low level to answer these questions adequately.
Do efficiency cores support the same instructions as performance cores?
Yes (for Intel's Alder lake, but also for big.LITTLE ARM).
For Alder Lake; operating systems were "deemed unable" to handle heterogeneous CPUs; so Intel nerfed existing support for extensions that already existed in performance cores (primarily AVX-512) to match the features present in the efficiency cores.
Sadly, supporting heterogeneous CPU isn't actually hard in some cases (e.g. hypervisors that don't give all CPUs to a single guest) and is solvable in the general case; and failing to provide a way to re-enable disabled extensions (if an OS supports heterogeneous CPUs) prevents an OS from trying to support heterogeneous CPUs in future; essentially turning a temporary solution into a permanent problem.
Does one need to query CPUID on each thread separately?
Not for the purpose of determining feature availability. If you have highly optimized code (e.g. code tuned differently for different CPU types) you might still want to (even though it's not a strict need); but will also need to pin the thread to a specific CPU or group of CPUs.
Is there any way a program can avoid running on E-cores?
Potentially, via. CPU affinity. Typically it just makes things worse though (better to run on an E core than to not run at all because P cores are already busy).
If the instruction sets are the same, then where is the 'Efficiency'? Is it with less cache, lower clock speed, or something else?
Lower clock, shorter pipeline, less aggressive speculative execution, ...

Can I take advntage of multi core in a multi-threaded application that I develop

If I am writing a multi-threaded C application on linux (using pthreads), can I take advantage of multi-core processor.
I mean what should an application programmer do to take advantage of multi-core processor. Or is it that the OS alone does so with its various scheduling algorithms
You don't need to do anything. Create as many threads as you want and the OS will schedule them together with the threads from all the other processes over every available cores.
"Take advantage of multi-core" could be understood to mean "utilize multi-core."
Or it could mean "gaining a qualitative advantage from the utilization of multi-core."
Anyone can do the former. They often end up with software that runs slower than if it were single-threaded.
The latter is an entirely different proposition. It requires writing the software such that usage of and accessing computing resources shared by all cores (bus-locking, RAM and L3 cache) are economized upon and focusing on doing as much computing as possible primarily in the individual cores and their L1 caches. The L2 cache is usually shared by two cores so it falls somewhere in-between the two categories in that yes, it is a shared resource but it is shared by just two cores and it is much faster than the resources shared by all cores.
This is at the implementation level, writing and testing the code.
The decisions made at earlier stages - specifically the system's software architecture phase - are usually much more important to the system's long-term quality and performance.
Some posts: 1 2 3. There are many more.

Is the shared L2 cache in multicore processors multiported? [duplicate]

The Intel core i7 has per-core L1 and L2 caches, and a large shared L3 cache. I need to know what kind of an interconnect connects the multiple L2s to the single L3. I am a student, and need to write a rough behavioral model of the cache subsystem.
Is it a crossbar? A single bus? a ring? The references I came across mention structural details of the caches, but none of them mention what kind of on-chip interconnect exists.
Thanks,
-neha
Modern i7's use a ring. From Tom's Hardware:
Earlier this year, I had the chance to talk to Sailesh Kottapalli, a
senior principle engineer at Intel, who explained that he’d seen
sustained bandwidth close to 300 GB/s from the Xeon 7500-series’ LLC,
enabled by the ring bus. Additionally, Intel confirmed at IDF that
every one of its products currently in development employs the ring
bus.
Your model will be very rough, but you may be able to glean more information from public information on i7 performance counters pertaining to the L3.

How do Linux OS schedule threads when there are multiple sockets

For example, in a dual socket system with 2 quad core processors, does the thread scheduler tries to keep the threads from the same processes in the same processor? Because interleaving threads of different processes in different processors would slow down performance in the case where threads in a process have a lot of shared memory accesses.
It depends.
On current Intel platforms the BIOS default seems to be that memory is interleaved between the sockets in the system, page by page. Allocate 1Mbyte and half will be on one socket, half on the other. That means that wherever your threads are they have equal access to the data.
This makes it very simple for OSes - anywhere will do.
This can work against you. The SMP hardware environment presented to the OS is synthesised by the CPUs cooperating over QPI. If there's a lot of threads all accessing the same data then those links can get real busy. If they're too busy then that limits the performance, and you're I/O bound. That's where I am; Z80 cores with Intel's memory subsystem design would be just as quick as the nahelem cores I've actually got (ok I might be exagerating...).
At the end of the day the real problem is that memory just isn't quick enough. Intel and AMD have both done some impressive things with memory recently, but we're still hampered by its slowness. Ideally memory would be quick enough so that all cores had clock rate access times to it. The Cell processor sort of did this - each SPE has a bit of SRAM instead of a cache, and once you get your head round them you can make them really sing.
===EDIT===
There is more to it. As Basile Starynkevitch hints the alternate approach is to embrace NUMA.
NUMA is what modern CPUs actually embody, the memory access being non-uniform because the memory on the other CPU sockets is not accessible directly by addressing a bus. The CPUs instead make a request for data over the QPI link (or Hypertransport in AMD's case) to ask the other CPU to fetch data out of its memory and send it back. Because the CPU is doing all this for you in hardware it ends up looking like a conventional SMP environment. And QPI / Hypertransport are very fast, so most of the time it's plenty quick enough.
If you write your code so as to mirror the architecture of the hardware you can in theory make improvements. So this might involve (for example) having two copies of your data in the system, one on each socket. There's memory affinity routines in Linux to specifically allocate memory that way instead of interleaved across all sockets. There's also CPU affinity routines that allow you to control which CPU core a thread is running on, the idea being you run it on a core that is close to the data buffer it will be processing.
Ok, so that might mean a lot of investment in the source code to make that work for you (especially if that data duplication doesn't fit well with the program's flow), but if the QPI has become a problematic bottle neck it's the only thing you can do.
I've fiddled with this to some extent. In a way it's a right faff. The whole mindset of Intel and AMD (and thus the OSes and libraries too) is to give you an SMP environment which, most of the time, is pretty good. However they let you play with NUMA by having a load of library functions you have to call to get the deployment of threads and memory that you want.
However for the edge cases where you want that little bit extra speed it'd be easier if the architecture and OS was rigidly NUMA, no SMP at all. Just like the Cell processor in fact. Easier, not because it'd be simple to write (in fact it would be harder), but if you got it running at all you'd then know for sure that it was as quick as the hardware could ever possibly achieve. With the faked SMP that we have right now you experiment with NUMA but you're mostly left wondering if it's as fast as it possibly could be. It's not like the libraries tell you that you're accessing memory that actually resident on another socket, they just let you do it with no hint that there's room for improvement.

Resources