How to disable L3 cache on guest machine in QEMU - arm

On arm64 host, different size of L3 cache causes the failure of migration for me. So how to disable L3 cache on guest machine?

There's no way to do that. For migration to work for an AArch64 KVM setup, both machines must be exactly identical, including things like cache size ID registers. If your two machines have different L3 cache sizes then migration between them will not work.

Related

Prevent a CPU core from using the LL cache

I have a following problem: I have a low-latency application running on core 0, and a regular application running on core 1. I want to make sure that core 0 app gets as much cache as possible, therefore, I want to make core 1 bypass the L3 cache (not use it at all) and go directly in memory for data.
Are there any other ways I can achieve that core 0 app gets the priority in using the L3 cache?
Some Intel CPUs support partitioning the L3 cache between different workloads or VMs, Cache Allocation Technology (CAT). It's been supported since Haswell Xeon (v3), and apparently 11th-gen desktop/laptop CPUs.
Presumably you need to let each workload have some L3, probably even on Skylake-Xeon and later where L3 is non-inclusive, but you might be able to give it a pretty small share and still achieve your goal.
More generally, https://github.com/intel/intel-cmt-cat has tools (for Linux and somewhat for FreeBSD) for managing that and other parts of what Intel's now calling "Resource Director Technology (RDT)" for monitoring, CAT, and Memory Bandwidth Allocation. It also has a table of features by CPU.
What you describe would be literally impossible on a desktop Intel CPU (or Xeon before Skylake), as they use inclusive L3 cache: a line can only be in L2/L1 if it's in L3 (at least tags, not the data if a core has it in Modified or Exclusive state). Skylake-X and later xeons have non-inclusive L3 so it would be possible in theory; IDK if CAT lets you give one set of cores zero L3.
I don't know if any AMD or ARM CPUs have something similar. I just happen to know of the existence of Intel's hardware support for this, not something I've ever gone looking for or used myself.

Store read/write data in assembly similar to filesystem

I am creating a custom operating system and is there any way to store data (almost like a filesystem) in assembly so that if the computer shuts off and turns back on the data will still be there?
You can write device drivers for SATA hard drives, USB mass storage, floppy disks, NVMe flash, or whatever else in asm. You might also be able to use BIOS functions to access them (especially if you're on x86). But then you have to manage writes in chunks of 512B or 4096B, because those kinds of storage are block-based.
A more literal interpretation of the question has an interesting answer: can a store instruction like mov [mem], eax put data into persistent storage where a load instruction can get it later (after a power cycle)?
Yes, if your hardware has some memory-mapped non-volatile RAM. (Physically memory-mapped NVRAM like an NVDIMM, not like mmap() to logically map a file into the virtual memory address space of a process). See this answer on Superuser about Intel Optane DC Persistent Memory
x86 for example has recently gotten more instructions to support NVRAM, like clwb to write-back a cache line (all the way to memory) without necessarily evicting it. Early implementations of clwb may just run it like clflushopt, though: #Ana reports that Skylake-X does evict.
Also, clflushopt is a more efficient way to force more cache lines to memory. Use a memory barrier like sfence after a weakly-ordered flush like clflushopt to make sure data is in non-volatile RAM before further writes appear.
For a while Intel was going to require pcommit as part of making sure data had hit non-volatile storage, but decided against it. With that in mind, see Why Intel added the CLWB and PCOMMIT instructions for more details about using persistent RAM.
IDK what the situation is on architectures other than x86, but presumably NV RAM is / will be usable with ARM and other CPUs, too.

Is the shared L2 cache in multicore processors multiported? [duplicate]

The Intel core i7 has per-core L1 and L2 caches, and a large shared L3 cache. I need to know what kind of an interconnect connects the multiple L2s to the single L3. I am a student, and need to write a rough behavioral model of the cache subsystem.
Is it a crossbar? A single bus? a ring? The references I came across mention structural details of the caches, but none of them mention what kind of on-chip interconnect exists.
Thanks,
-neha
Modern i7's use a ring. From Tom's Hardware:
Earlier this year, I had the chance to talk to Sailesh Kottapalli, a
senior principle engineer at Intel, who explained that he’d seen
sustained bandwidth close to 300 GB/s from the Xeon 7500-series’ LLC,
enabled by the ring bus. Additionally, Intel confirmed at IDF that
every one of its products currently in development employs the ring
bus.
Your model will be very rough, but you may be able to glean more information from public information on i7 performance counters pertaining to the L3.

How to selectively store a variable in memory segments using C?

We know that there are levels in memory hierarchy,
cache,primary storage and secondary storage..
Can we use a c program to selectively store a variable in a specific block in memory hierarchy?
Reading your comments in the other answers, I would like to add a few things.
Inside an Operating System you can not restrict in which level of the memory hierarchy your variables will be stored, since the one who controls the metal is the Operating System and it enforces you to play by its rules.
Despite this, you can do something that MAY get you close to testing access time in cache (mostly L1, depending on your test algorithm) and in RAM memory.
To test cache access: warm up accessing a few times a variable. Them access a lot of times the same variable (cache is super fast, you need a lot of accesses to measure access time).
To test the main memory (aka RAM): disable the cache memory in the BIOS and run your code.
To test secondary memory (aka disk): disable disk cache for a given file (you can ask your Operating System for this, just Google about it), and start reading some data from the disk, always from the same position. This might or might not work depending on how much your OS will allow you to disable disk cache (Google about it).
To test other levels of memory, you must implement your own "Test Operating System", and even with that it may not be possible to disable some caching mechanisms due to hardware limitations (well, not actually limitations...).
Hope I helped.
Not really. Cache is designed to work transparently. Almost anything you do will end up in cache, because it's being operated on at the moment.
As for secondary storage, I assume you mean HDD, file, cloud, and so on. Nothing really ever gets stored there unless you do so explicitly, or set up a memory mapped region, or something gets paged to disk.
No. A normal computer program only has access to main memory. Even secondary storage (disk) is usually only available via operating system services, typically used via the stdio part of the library. There's basically no direct way to inspect or control the hierarchical caches closer to the CPU.
That said, there are cache profilers (like Valgrind's massif tool) which give you an idea about how well your program uses a given type of cache architecture (as well as speculative execution), and which can be very useful to help you spot code paths that have poor cache performance. They do this essentially by emulating the hardware.
There may also be architecture-specific instructions that give you some control over caching (such as "nontemporal hints" on x86, or "prefetch" instructions), but those are rare and peculiar, and not commonly exposed to C program code.
It depends on the specific architecture and compiler that you're using.
For example, on x86/x64, most compilers have a variety of levels of prefetch instructions which hint to the CPU
that a cache-line should be moved to a specific level in the cache from DRAM (or from higher-order caches - e.g. from L3 to L2).
On some CPUs, non-temporal prefetch instructions are available that when combined with non-temporal read instructions, allow you to bypass the caches and read directly into a register. (Some CPUs have non-temporals implemented by forcing the data to a specific way of the cache though; so you have to read the docs).
In general though, between L1 and L2 (or L3, or L4, or DRAM) it's a bit of a black-hole. You can't specifically store a value in one of these - some caches are inclusive of each other (so if a value is in L1, it's also in L2 and L3), some are not. And the caches are designed to periodically drain - so if a write goes to L1, eventually it works its way out to L2, then L3, then DRAM - especially in multi-core architectures with strong-memory models.
You can bypass them entirely on write (use streaming-store or mark the memory as write-combining).
You can measure the different access times by:
Using memory-mapped files as a backing store for your data (this will measure the time for first access to reach the CPU - just wrap it in a timer call, such as QueryPerformanceCounter or __rdtscp). Be sure to unmap and close the file from memory between each test, and turn off any caching. It'll take a while.
Flushing the cache between accesses to get the time-to-access from DRAM.
It's harder to measure the difference between different cache levels, but if your architecture supports it, you can prefetch into a cache level, spin in a loop for some period of time, and then attempt it.
It's going to be hard to do this on a system using a commercial OS, because they tend to do a LOT of work all the time, which will disturb your measurements.

Testing cache invalidation and flush

So after programming the basic L1 and L2 cache related routines in Linux kernel (arch/arm/mm/cache-X.S) say for example specific to ARM11 Processor, is there a test utility/program available to test whether the cache is working properly such that invalidation, flush happens properly. How we can ensure it instead of just relying on our own programs.
You can use the perfcounters subsystem. It's basically an abstraction over CPU performance counters, which are hardware registers recording events like cache misses, instructions executed, branch mispredictions etc. It also provides abstraction for software events (sic) such as minor/major page faults, task migrations, task context-switches and tracepoints. The perf tool can be used to monitor and verify correct cache behaviour - for instance, you can check cache flushing works correctly by filling the cache, flushing it, measuring cache misses on subsequent memory accesses and comparing it to your expected result.
You can have a look to LMBench a deep benchmark that can be run on almost every linux platform (I already successly used it on x86, ARM9 and CortexA8 architectures). You'll be able to measure cache performance.
If your cache is using the RAM, you can flush the Linux RAM:
echo "1" > /proc/sys/vm/drop_caches
echo "0" > /proc/sys/vm/drop_caches
free -m

Resources