Prevent a CPU core from using the LL cache - arm

I have a following problem: I have a low-latency application running on core 0, and a regular application running on core 1. I want to make sure that core 0 app gets as much cache as possible, therefore, I want to make core 1 bypass the L3 cache (not use it at all) and go directly in memory for data.
Are there any other ways I can achieve that core 0 app gets the priority in using the L3 cache?

Some Intel CPUs support partitioning the L3 cache between different workloads or VMs, Cache Allocation Technology (CAT). It's been supported since Haswell Xeon (v3), and apparently 11th-gen desktop/laptop CPUs.
Presumably you need to let each workload have some L3, probably even on Skylake-Xeon and later where L3 is non-inclusive, but you might be able to give it a pretty small share and still achieve your goal.
More generally, https://github.com/intel/intel-cmt-cat has tools (for Linux and somewhat for FreeBSD) for managing that and other parts of what Intel's now calling "Resource Director Technology (RDT)" for monitoring, CAT, and Memory Bandwidth Allocation. It also has a table of features by CPU.
What you describe would be literally impossible on a desktop Intel CPU (or Xeon before Skylake), as they use inclusive L3 cache: a line can only be in L2/L1 if it's in L3 (at least tags, not the data if a core has it in Modified or Exclusive state). Skylake-X and later xeons have non-inclusive L3 so it would be possible in theory; IDK if CAT lets you give one set of cores zero L3.
I don't know if any AMD or ARM CPUs have something similar. I just happen to know of the existence of Intel's hardware support for this, not something I've ever gone looking for or used myself.

Related

Do efficiency cores support the same instructions as performance cores?

When writing a program that requires high computational performance, it is often required that multiple threads, SIMD vectorization, or other extensions are required. One can query the CPU using CPUID to find out what instruction set it supports. However, since the programmer has no control over which cores are actually executing the different threads, it could be a problem if different cores support different instruction sets.
If one queries the CPU at the start of the program, is it safe to assume all threads will support the same instruction set? If not, then does this break programs that assume they do all support the same instructions or are the CPUs clever enough to realize they shouldn't use those cores?
Does one need to query CPUID on each thread separately?
Is there any way a program can avoid running on E-cores?
If the instruction sets are the same, then where is the 'Efficiency'? Is it with less cache, lower clock speed, or something else?
This question is posed out of curiosity, but the answers may affect how I write programs in the future. I would appreciate any informed comments on these questions but please don't just share your thoughts and opinions on how you think it works if you don't know with high confidence. Thanks.
I have only tried to find information on the internet, but found nothing of sufficiently low level to answer these questions adequately.
Do efficiency cores support the same instructions as performance cores?
Yes (for Intel's Alder lake, but also for big.LITTLE ARM).
For Alder Lake; operating systems were "deemed unable" to handle heterogeneous CPUs; so Intel nerfed existing support for extensions that already existed in performance cores (primarily AVX-512) to match the features present in the efficiency cores.
Sadly, supporting heterogeneous CPU isn't actually hard in some cases (e.g. hypervisors that don't give all CPUs to a single guest) and is solvable in the general case; and failing to provide a way to re-enable disabled extensions (if an OS supports heterogeneous CPUs) prevents an OS from trying to support heterogeneous CPUs in future; essentially turning a temporary solution into a permanent problem.
Does one need to query CPUID on each thread separately?
Not for the purpose of determining feature availability. If you have highly optimized code (e.g. code tuned differently for different CPU types) you might still want to (even though it's not a strict need); but will also need to pin the thread to a specific CPU or group of CPUs.
Is there any way a program can avoid running on E-cores?
Potentially, via. CPU affinity. Typically it just makes things worse though (better to run on an E core than to not run at all because P cores are already busy).
If the instruction sets are the same, then where is the 'Efficiency'? Is it with less cache, lower clock speed, or something else?
Lower clock, shorter pipeline, less aggressive speculative execution, ...

Can I take advntage of multi core in a multi-threaded application that I develop

If I am writing a multi-threaded C application on linux (using pthreads), can I take advantage of multi-core processor.
I mean what should an application programmer do to take advantage of multi-core processor. Or is it that the OS alone does so with its various scheduling algorithms
You don't need to do anything. Create as many threads as you want and the OS will schedule them together with the threads from all the other processes over every available cores.
"Take advantage of multi-core" could be understood to mean "utilize multi-core."
Or it could mean "gaining a qualitative advantage from the utilization of multi-core."
Anyone can do the former. They often end up with software that runs slower than if it were single-threaded.
The latter is an entirely different proposition. It requires writing the software such that usage of and accessing computing resources shared by all cores (bus-locking, RAM and L3 cache) are economized upon and focusing on doing as much computing as possible primarily in the individual cores and their L1 caches. The L2 cache is usually shared by two cores so it falls somewhere in-between the two categories in that yes, it is a shared resource but it is shared by just two cores and it is much faster than the resources shared by all cores.
This is at the implementation level, writing and testing the code.
The decisions made at earlier stages - specifically the system's software architecture phase - are usually much more important to the system's long-term quality and performance.
Some posts: 1 2 3. There are many more.

Is the shared L2 cache in multicore processors multiported? [duplicate]

The Intel core i7 has per-core L1 and L2 caches, and a large shared L3 cache. I need to know what kind of an interconnect connects the multiple L2s to the single L3. I am a student, and need to write a rough behavioral model of the cache subsystem.
Is it a crossbar? A single bus? a ring? The references I came across mention structural details of the caches, but none of them mention what kind of on-chip interconnect exists.
Thanks,
-neha
Modern i7's use a ring. From Tom's Hardware:
Earlier this year, I had the chance to talk to Sailesh Kottapalli, a
senior principle engineer at Intel, who explained that he’d seen
sustained bandwidth close to 300 GB/s from the Xeon 7500-series’ LLC,
enabled by the ring bus. Additionally, Intel confirmed at IDF that
every one of its products currently in development employs the ring
bus.
Your model will be very rough, but you may be able to glean more information from public information on i7 performance counters pertaining to the L3.

How to selectively store a variable in memory segments using C?

We know that there are levels in memory hierarchy,
cache,primary storage and secondary storage..
Can we use a c program to selectively store a variable in a specific block in memory hierarchy?
Reading your comments in the other answers, I would like to add a few things.
Inside an Operating System you can not restrict in which level of the memory hierarchy your variables will be stored, since the one who controls the metal is the Operating System and it enforces you to play by its rules.
Despite this, you can do something that MAY get you close to testing access time in cache (mostly L1, depending on your test algorithm) and in RAM memory.
To test cache access: warm up accessing a few times a variable. Them access a lot of times the same variable (cache is super fast, you need a lot of accesses to measure access time).
To test the main memory (aka RAM): disable the cache memory in the BIOS and run your code.
To test secondary memory (aka disk): disable disk cache for a given file (you can ask your Operating System for this, just Google about it), and start reading some data from the disk, always from the same position. This might or might not work depending on how much your OS will allow you to disable disk cache (Google about it).
To test other levels of memory, you must implement your own "Test Operating System", and even with that it may not be possible to disable some caching mechanisms due to hardware limitations (well, not actually limitations...).
Hope I helped.
Not really. Cache is designed to work transparently. Almost anything you do will end up in cache, because it's being operated on at the moment.
As for secondary storage, I assume you mean HDD, file, cloud, and so on. Nothing really ever gets stored there unless you do so explicitly, or set up a memory mapped region, or something gets paged to disk.
No. A normal computer program only has access to main memory. Even secondary storage (disk) is usually only available via operating system services, typically used via the stdio part of the library. There's basically no direct way to inspect or control the hierarchical caches closer to the CPU.
That said, there are cache profilers (like Valgrind's massif tool) which give you an idea about how well your program uses a given type of cache architecture (as well as speculative execution), and which can be very useful to help you spot code paths that have poor cache performance. They do this essentially by emulating the hardware.
There may also be architecture-specific instructions that give you some control over caching (such as "nontemporal hints" on x86, or "prefetch" instructions), but those are rare and peculiar, and not commonly exposed to C program code.
It depends on the specific architecture and compiler that you're using.
For example, on x86/x64, most compilers have a variety of levels of prefetch instructions which hint to the CPU
that a cache-line should be moved to a specific level in the cache from DRAM (or from higher-order caches - e.g. from L3 to L2).
On some CPUs, non-temporal prefetch instructions are available that when combined with non-temporal read instructions, allow you to bypass the caches and read directly into a register. (Some CPUs have non-temporals implemented by forcing the data to a specific way of the cache though; so you have to read the docs).
In general though, between L1 and L2 (or L3, or L4, or DRAM) it's a bit of a black-hole. You can't specifically store a value in one of these - some caches are inclusive of each other (so if a value is in L1, it's also in L2 and L3), some are not. And the caches are designed to periodically drain - so if a write goes to L1, eventually it works its way out to L2, then L3, then DRAM - especially in multi-core architectures with strong-memory models.
You can bypass them entirely on write (use streaming-store or mark the memory as write-combining).
You can measure the different access times by:
Using memory-mapped files as a backing store for your data (this will measure the time for first access to reach the CPU - just wrap it in a timer call, such as QueryPerformanceCounter or __rdtscp). Be sure to unmap and close the file from memory between each test, and turn off any caching. It'll take a while.
Flushing the cache between accesses to get the time-to-access from DRAM.
It's harder to measure the difference between different cache levels, but if your architecture supports it, you can prefetch into a cache level, spin in a loop for some period of time, and then attempt it.
It's going to be hard to do this on a system using a commercial OS, because they tend to do a LOT of work all the time, which will disturb your measurements.

Is it possible to bypass L1 cache in a multicore processor

In Modern multicore processors, we normally have a local L1 cache but a shared L2 cache. Is it possible to bypass the L1 cache for some portion of the memory while still using L2 cache for it? I want to do this to improve timing predictability, at the cost of performance it may be.
As far as I know, there is no way to bypass the L1 cache on mainstream CPUs.
However, to achieve your goal (i.e. avoid cache misses that may cause variation in timings mesurements), you may try to ask your compiler to prefetch the data into the cache.
If you use GCC or LLVM, see __builtin_prefetch.
However, your question is quite vague, and I am unsure that your proposal will suit your needs.
Caches
I strongly suspect that you have misunderstood what a cache does and what it is for.
Caches are transparent from the point of view of memory contents. If one core writes to a memory location then every other core whose caches (L1, L2, L3 etc), shared or not, happen to be caching that location will get updated also.
Note that that does not mean that the cores aren't able to race for the value. You can still have a race condition whereby one core reading a location fractionally before another writes it 'gets the wrong value'. Furthermore that will happen whether or not your CPU has caches of any sort. To solve that 'ordering' problem you have to use semaphores or other IPC primitives in your source code.
Some cache systems do allow you to 'drop hints' to them. Matthieu Rouget gave an example of that with __builtin_prefetch. These sorts of things allow the programmer to tell the cache system that it might well be worth getting some data in advance. Some systems (e.g. PowerPC 7450) sort of allowed the programmer to use part of the cache as memory instead of cache, kind of the ultimate in programmer cache control.
However, none of these things make any difference to the view of memory that all the caches have. If one cache's contents get updated, the rest are also updated.
Caches and Performance Programming
The very best programmers are able to extract peak performance from a CPU by coding around the behaviour of the cache. In that realm one generally finds oneself wishing that the cache wasn't there at all. The ultimate embodiment of this is the Cell processor in the PS3. The maths cores on that have no cache at all. Instead you have to in effect do all your own data fetching and write back yourself in your source code, rather than leave it up to some cache to second guess what data your program is going to ask for. Get it right and the performance is still blisteringly good.
Bus Snooping
Some CPUs don't have cache bus snooping, which can be a particular problem when writing device drivers. Bus snooping is a mechanism whereby the CPU caches spot the content of memory being updated by something other than the CPU cores (e.g. by a DMA controller reading data from a device). And the same the other way round - DMAs from memory get values currently stuck in cache. AFAIK almost all CPUs these days do bus snooping, so that is not likely to be a problem.
On systems with IO as well as memory address spaces (e.g. Intel) I don't think that I/O address space is cached anyway. For systems with memory mapped devices their memory is generally not cached either, and the OS sets up the CPU that way (see this).
Timing Predictability
To return to the reason for your question - timing predictability. You may be using the wrong technology. If your system has timing constraints whereby the problem is variations in main memory write times, then frankly using a multicore CPU sounds like the wrong thing in the first place. #Griwes is quite right on that point (and indeed the entire comment). You'll more likely need to resort to a pure hardware design, something along the lines of an FPGA (no comments about whether firmware is really software please!).
If, as I suspect, you're actually trying to avoid using semaphores and other IPC primitives to synchronise two threads in your system then you're not going to succeed, shared caches or not. You need to use semaphores and such to make your code work properly.

Resources