Is the shared L2 cache in multicore processors multiported? [duplicate] - c

The Intel core i7 has per-core L1 and L2 caches, and a large shared L3 cache. I need to know what kind of an interconnect connects the multiple L2s to the single L3. I am a student, and need to write a rough behavioral model of the cache subsystem.
Is it a crossbar? A single bus? a ring? The references I came across mention structural details of the caches, but none of them mention what kind of on-chip interconnect exists.
Thanks,
-neha

Modern i7's use a ring. From Tom's Hardware:
Earlier this year, I had the chance to talk to Sailesh Kottapalli, a
senior principle engineer at Intel, who explained that he’d seen
sustained bandwidth close to 300 GB/s from the Xeon 7500-series’ LLC,
enabled by the ring bus. Additionally, Intel confirmed at IDF that
every one of its products currently in development employs the ring
bus.
Your model will be very rough, but you may be able to glean more information from public information on i7 performance counters pertaining to the L3.

Related

Prevent a CPU core from using the LL cache

I have a following problem: I have a low-latency application running on core 0, and a regular application running on core 1. I want to make sure that core 0 app gets as much cache as possible, therefore, I want to make core 1 bypass the L3 cache (not use it at all) and go directly in memory for data.
Are there any other ways I can achieve that core 0 app gets the priority in using the L3 cache?
Some Intel CPUs support partitioning the L3 cache between different workloads or VMs, Cache Allocation Technology (CAT). It's been supported since Haswell Xeon (v3), and apparently 11th-gen desktop/laptop CPUs.
Presumably you need to let each workload have some L3, probably even on Skylake-Xeon and later where L3 is non-inclusive, but you might be able to give it a pretty small share and still achieve your goal.
More generally, https://github.com/intel/intel-cmt-cat has tools (for Linux and somewhat for FreeBSD) for managing that and other parts of what Intel's now calling "Resource Director Technology (RDT)" for monitoring, CAT, and Memory Bandwidth Allocation. It also has a table of features by CPU.
What you describe would be literally impossible on a desktop Intel CPU (or Xeon before Skylake), as they use inclusive L3 cache: a line can only be in L2/L1 if it's in L3 (at least tags, not the data if a core has it in Modified or Exclusive state). Skylake-X and later xeons have non-inclusive L3 so it would be possible in theory; IDK if CAT lets you give one set of cores zero L3.
I don't know if any AMD or ARM CPUs have something similar. I just happen to know of the existence of Intel's hardware support for this, not something I've ever gone looking for or used myself.

Profiling cache coherence latency

Is there a tool that makes it possible to monitor the time spent on managing cache coherence by MESIF protocol on Skylake servers or its successors for Linux OS? I am also interested in programmatic ways in C if possible.

Do efficiency cores support the same instructions as performance cores?

When writing a program that requires high computational performance, it is often required that multiple threads, SIMD vectorization, or other extensions are required. One can query the CPU using CPUID to find out what instruction set it supports. However, since the programmer has no control over which cores are actually executing the different threads, it could be a problem if different cores support different instruction sets.
If one queries the CPU at the start of the program, is it safe to assume all threads will support the same instruction set? If not, then does this break programs that assume they do all support the same instructions or are the CPUs clever enough to realize they shouldn't use those cores?
Does one need to query CPUID on each thread separately?
Is there any way a program can avoid running on E-cores?
If the instruction sets are the same, then where is the 'Efficiency'? Is it with less cache, lower clock speed, or something else?
This question is posed out of curiosity, but the answers may affect how I write programs in the future. I would appreciate any informed comments on these questions but please don't just share your thoughts and opinions on how you think it works if you don't know with high confidence. Thanks.
I have only tried to find information on the internet, but found nothing of sufficiently low level to answer these questions adequately.
Do efficiency cores support the same instructions as performance cores?
Yes (for Intel's Alder lake, but also for big.LITTLE ARM).
For Alder Lake; operating systems were "deemed unable" to handle heterogeneous CPUs; so Intel nerfed existing support for extensions that already existed in performance cores (primarily AVX-512) to match the features present in the efficiency cores.
Sadly, supporting heterogeneous CPU isn't actually hard in some cases (e.g. hypervisors that don't give all CPUs to a single guest) and is solvable in the general case; and failing to provide a way to re-enable disabled extensions (if an OS supports heterogeneous CPUs) prevents an OS from trying to support heterogeneous CPUs in future; essentially turning a temporary solution into a permanent problem.
Does one need to query CPUID on each thread separately?
Not for the purpose of determining feature availability. If you have highly optimized code (e.g. code tuned differently for different CPU types) you might still want to (even though it's not a strict need); but will also need to pin the thread to a specific CPU or group of CPUs.
Is there any way a program can avoid running on E-cores?
Potentially, via. CPU affinity. Typically it just makes things worse though (better to run on an E core than to not run at all because P cores are already busy).
If the instruction sets are the same, then where is the 'Efficiency'? Is it with less cache, lower clock speed, or something else?
Lower clock, shorter pipeline, less aggressive speculative execution, ...

Can I take advntage of multi core in a multi-threaded application that I develop

If I am writing a multi-threaded C application on linux (using pthreads), can I take advantage of multi-core processor.
I mean what should an application programmer do to take advantage of multi-core processor. Or is it that the OS alone does so with its various scheduling algorithms
You don't need to do anything. Create as many threads as you want and the OS will schedule them together with the threads from all the other processes over every available cores.
"Take advantage of multi-core" could be understood to mean "utilize multi-core."
Or it could mean "gaining a qualitative advantage from the utilization of multi-core."
Anyone can do the former. They often end up with software that runs slower than if it were single-threaded.
The latter is an entirely different proposition. It requires writing the software such that usage of and accessing computing resources shared by all cores (bus-locking, RAM and L3 cache) are economized upon and focusing on doing as much computing as possible primarily in the individual cores and their L1 caches. The L2 cache is usually shared by two cores so it falls somewhere in-between the two categories in that yes, it is a shared resource but it is shared by just two cores and it is much faster than the resources shared by all cores.
This is at the implementation level, writing and testing the code.
The decisions made at earlier stages - specifically the system's software architecture phase - are usually much more important to the system's long-term quality and performance.
Some posts: 1 2 3. There are many more.

Hyperthreading intel processors and C

If I don't utilize multithreaded paradigms when designing my code, will hyperthreading split the load automagically over the logical cores, or would my have to be specicially written to take advantage of the other cores like it would have to be for physical cores?
On suggestion of #us2012 I post this here from my comment...
There is no such magic. Superscalar CPUs, especially OOO (Out Of Order execution) processors do magic - but that is inside one core.
On the contrary, Hyperthreading can be thought of as (very simplified) two pipelines in front of one complete core.
AMD Bulldozer CPUs have a similar bit, but they went a step further: the integer core is split into two too, but the two pipelines + integer cores share one floating point unit. This whole is called a "module", having two threads.
TL;DR
Superscalar (from the Wiki)
A superscalar CPU architecture implements a form of parallelism called instruction level parallelism within a single processor. It therefore allows faster CPU throughput than would otherwise be possible at a given clock rate. A superscalar processor executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to redundant functional units on the processor. Each functional unit is not a separate CPU core but an execution resource within a single CPU such as an arithmetic logic unit, a bit shifter, or a multiplier.
Out of order execution (from the Wiki)
In computer engineering, out-of-order execution (OoOE or OOE) is a paradigm used in most high-performance microprocessors to make use of instruction cycles that would otherwise be wasted by a certain type of costly delay. In this paradigm, a processor executes instructions in an order governed by the availability of input data, rather than by their original order in a program. In doing so, the processor can avoid being idle while data is retrieved for the next instruction in a program, processing instead the next instructions which are able to run immediately.
Hyperthreading (from... you know where...)
Hyper-threading (officially Hyper-Threading Technology or HT Technology, abbreviated HTT or HT) is Intel's proprietary simultaneous multithreading (SMT) implementation used to improve parallelization of computations (doing multiple tasks at once) performed on PC microprocessors. It first appeared in February 2002 on Xeon server processors and in November 2002 on Pentium 4 desktop CPUs.1 Later, Intel included this technology in Itanium, Atom, and Core 'i' Series CPUs, among others.
Bulldozer (not from not the wiki)
Bulldozer is the first major redesign of AMD’s processor architecture since 2003, when the firm launched its K8 processors, and also features two 128-bit FMA-capable FPUs which can be combined into one 256-bit FPU. This design is accompanied by two integer clusters, each with 4 pipelines (the fetch/decode stage is shared). Bulldozer will also introduce shared L2 cache in the new architecture. AMD's marketing service calls this design a "Module". A 16-threads processor design would feature eight of these "modules",[7] but the operating system will recognize each "module" as two logical cores.

Resources