Programming to fill our system's L1 or L2 cache - c

How we can systematically write code to load data into our L1 or L2 cache?
I am specifically trying to target filling the the L1 I cache of my system for some higher analysis.
Any suggestions will do - with respect to writing assembly code or simple C programming. Related articles on this topic will be even more helpful.

A cache stores recently-accessed data. To fill the cache, just access the data. Or in this case, instructions. Fill a block of memory with no-op instructions (and a looping branch instruction at the end) and jump to it.
The tricky part is keeping data in there once it's loaded. You can't access anything outside the 32K (or whatever) data set as long as your benchmark is running.
I can't imagine what you get from artificially filling a cache and then keeping it filled with the same data set, but there you go.

You will need to find out the cache associativity of your CPU and the replacement policy. I can't think of a general solution to this problem that would work on all the CPUs I've worked with. Even caches advertised as fully associative with an LRU replacement policy aren't exactly that in reality and it can be very hard to figure out a pattern of memory access that completely fills the cache.
If you want this for some very specific benchmark (which is a bad idea for other reasons), I'd recommend you trying to figure out how to flush the cache instead. That is actually doable.

I just last week performed this task for ECC filling of an l1 and l2 cache.
Basically if you have a 64Kbyte cache, for example, total (x number of ways, y number of cache lines, etc) for the data just access that much data linearly through the cache (might need an mmu on to enable caching) start on some 64Kbyte boundary and read 64Kbytes worth of data ideally in cache line sized reads (or multiples) if possible. For icache you need that many bytes worth of instructions (nops or add reg+1 or something), remember there is probably a pre-fetch at the end so you might have to back off the final return a few instructions so that the prefetch takes you all the way to the end (might take some practice and if you dont have visibility into the logic (a chip sim) then you might not figure it out.
you can use the mmu or other games your logic might have to reduce the amount of memory required, for example if you have an mmu with an entry size that covers say 4Kb, then you could fill 4Kb of real memory with data, then use 16 different mmu entries (with 16 different virtual addresses) and for each of the 16 read through the 4K. Of course that is if your cache is on the virtual address side of the mmu.
overall it is kind of an ugly thing to do, if your mmu prevents instruction caching, you could put the code performing the test in a non-cached space so that it doesnt mess with the icache and only instructions used to fill the cache are in a cached address space.
Good luck...

Related

How to manage devices that cannot access d-cache in ARM

I'm using an SPI device with DMA enabled in an STM32H7 SoC. The DMA periph. cannot access d-cache, so in order to make it work I have disabled d-cache entirely (for more info. about this, see this explanation). However, I would like to avoid disabling d-cache globally for a problem that only affects to a small region of memory.
I have read this post about the meaning of clean and invalidate cache operations, in the ARM domain. My understanding is that, by cleaning a cache area, you force it to be written in the actual memory. On the other hand, by invalidating a cache area, you force the actual memory to be cached. Is this correct?
My intention with this is to follow these steps to transmit something over SPI (with DMA):
Write the value you want on the buffer that DMA will read from.
Clean d-cache for that area to force it to go to actual memory, so DMA can see it.
Launch the operation: DMA will read the value from the area above and write it to the SPI's Tx buffer.
SPI reads data at the same time it writes, so there will be data in the SPI's Rx buffer, which will be read by DMA and then it will write it to the recv. buffer provided by the user. It could happen that an observer of such buffer can indeed access d-cache. The latter could not be updated with the new value received by SPI yet, so invalidate the recv. buffer area to force d-cache to get updated.
Does the above make sense?
EDIT
Adding some more sources/examples of the problem I'm facing:
Example from the ST github: https://github.com/STMicroelectronics/STM32CubeH7/issues/153
Post in ST forums answring and explaining the d-cache problem: https://community.st.com/s/question/0D53W00000m2fjHSAQ/confused-about-dma-and-cache-on-stm32-h7-devices
Here the interconnection between memory and DMA:
As you can see, DMA1 can access sram1, 2 and 3. I'm using sram2.
Here the cache attributes of sram2:
As you can see, it is write back,write allocate, but not write through. I'm not familiar with these attributes, so I read the definition from here. However, that article seems to talk about the CPU physical cache (L1, L2 etc.) I'm not sure if ARM i-cache and d-cache refer to this physical cache. In any case, I'm assuming the definition for write through and the other terms are valid for d-cache as well.
I forget off hand how the data cache works on the cortex-m7/armv7-m. I want to remember it does not have an MMU and caching is based on address. ARM and ST would be smart enough to know to put cached and non-cached access to sram from the processor core.
If you are wanting to send or receive data using DMA you do not go through the cache.
You linked a question from before which I had provided an answer.
Caches contain some amount of sram as we tend to see a spec for this many KBytes or this many MBytes, whatever. But there are also tag rams and other infrastructure. How does the cache know if there is a hit or a miss. Not from the data, but from other bits of information. Taken from the address of the transaction. Some number of bits of that address are taken and compared to however many "ways" you have so there may be 8 ways for example so there are 8 small memories think of them as arrays of structures in C. In that structure is some information is this cache line valid? If valid what is the tag or bit of address that it is tied to, is it clean/dirty...
Clean or dirty meaning the overall caching infrastructure will be designed (kinda the whole point) to hold information in a faster sram (sram in mcus is very fast already so why a cache in the first place???), which means that write transactions, if they go through the cache (they should in some form) will get written to the cache, and then based on design/policy will get written out into system memory or at least get written on the memory side of the cache. While the cache contains information that has been written that is not also in system memory (due to a write) that is dirty. And when you clean the cache using ARM's term clean, or flush is another term, etc. You go through all of the cache and look for items that are valid and dirty and you initiate writes to system memory to clean them. This is how you force things out the cache into system memory for coherency reasons, if you have a need to do that.
Invalidate a cache simply means you go through the tag rams and you change the valid bit to indicate invalid for that cache line. Basically that "loses" all information about that cache line it is now available to use. It will not result in any hits and it will not do a write to the system for a clean/flush. The actual cache line in the cache memory does not have to be zeroed or put in any other state. Technically just the valid/invalid bit or bits.
How things generally get into a cache are certainly from reads. Depending on the design and settings if a read is cacheable then the cache will first look to see if it has a tag for that item and if it is valid, if so then it simply takes the information in the cache and returns it. If there is a miss, that data does not have a copy in the cache, then it initiates one or more cache line reads from the system side. So a single byte read can/will cause a larger, sometimes much larger, read to happen on the system side, the transaction is held until that (much larger) data (read) returns and then it is put in the cache and the item requested is returned to the processor.
Depending on the architecture and settings, writes may or may not create an entry in the cache, if a (cacheable) write happens and there are no hits in the cache then it may just go straight to the system side as a write of that size and shape. As if the cache was not there. If there is a cache hit then it will go into the cache, and the that/those cache lines are marked as dirty and then depending on the design, etc it may be written to system memory as a side effect of the write from the processor side, the processor will be freed to continue execution but the cache and other logic (write buffer) may continue to process this transaction moving this new data to the system side essentially cleaning/flushing automatically. One normally does not expect this as it takes away performance that the cache was there to provide in the first place.
In any case if it is determined that a transaction has a miss and it is to be cached, then based on that tag, the ways have already been examined to determine if there was a hit. One of the ways will be chosen to hold this new cache line. How that is determined is based on design and in some cases programmable settings. Hopefully if there are any that are invalid then it would go to one of those. But round robin, randomizer, oldest first, etc are solutions you may see. And if there is dirty data in that space then it has to get written out first, making room for the new information. So, absolutely a single byte or single word read (since they have the same performance in a system like this) can require a cache flush of a cache line, then a read from the system and then the result is returned, more clock cycles than if the cache was not there. Nature of the beast. Caches are not perfect, with the right information and experience you can easily write code that makes the cache degrade the performance of the application.
Clean means if a cache line is valid and dirty then write it out to system memory and mark it as clean.
Invalidate means if the cache line is valid then mark it as valid. If it was valid and dirty that information is lost.
In your case you do not want to deal with cache at all for these transactions, the cache in question is in the arm core so nobody but the arm core has access to that cache, nobody else is behind the cache, they are all on the system end.
Taking a quick look at the ARM ARM for armv7-m they do use address space to determine write through and cached or not. One then needs to look at the cortex-m7 TRM for further information and then, particularly in this case, since it is a chip thing not an arm thing anyway, the whole system. The arm processor is just some bit of ip that st bought to glue into a chip with a bunch of other ip and ip of their own that is glued together. Like the engine in the car, the engine manufacturer cant answer questions about the rear differential nor the transmission, that is the car company not the engine company.
arm knows what they are doing
st knows what they are doing
if a chip company makes a chip with dma but the only path between the processor and the memory shared with the dma engine is through the processors cache when the cache is enabled, and clean/flush and invalidate of address ranges are constantly required to use that dma engine...Then you need to immediately discard that chip, blacklist that company's products (if this product is that poorly designed then assume all of their products are), and find a better company to buy products from.
I cant imagine that is the case here, so
Initialize the peripheral, choosing to use DMA and configure the peripheral or dma engine or both (for each direction).
Start the peripheral (this might be part of 4)
write the tx data to the configured address space for dma
tell the peripheral to start the transfer
monitor for completion of transfer
read the received data from the configured address space for dma
That is generic but that is what you are looking for, caches are not involved. For a part/family like this there should be countless examples including the (choose your name for the quality) one or more library solutions and examples that come from the chip vendor. Look at how they others are using the part, compare that to the documentation, determine your risk level for their solution and use it or modify it or learn from it if nothing else.
I know that st products do not have an instruction cache they do their own thing, or at least that is what I remember (some trademarked name for a flash cache, on most of them you cannot turn it off). Does that mean they have not implemented a data cache on the products either? Possible. Just because the architecture for an ip product has a feature (fpu, caches, ...) does not automatically mean that the chip vendor has enabled/implemented those. Depending on the ip there are various ways to do that as some ip does not have a compile time option for the chip vendor to not compile in a feature. if nothing else the chip vendor could simply stub out the cache memory interfaces and write a few lines of text in the docs that there is no cache, and you can write control registers and see things appear to enable that feature but it simply does not work. One expects that arm provides compile time features, that are not in the public documentation we can see, but are available to the chip vendor in some form. Sometimes when you buy the ip you are given a menu if you will like ordering a custom burger at a fancy burger shop, a list of checkboxes, mayo, mustard, pickle. ... fpu, cache, 16 bit fetch, 32 bit fetch, one cycle multiply, x cycle multiply, divide, etc. And the chip vendor then produces your custom burger. Or some vendors you get the whole burger then you have to pick off the pickles and onions yourself.
So again, not our job to read the docs for you, so first off does this part even have a dcache? Look between the arm arm, the arm trm and the documentation for the chip address spaces (as well as the countless examples) and determine what address space or whet settings, etc are needed to access portions of sram in a non-cached way. If it has a data cache feature at all.
I have investigated a bit more:
With regards to clean and invalidate memory question, the answer is yes: clean will force cache to be written in memory and invalidate will force memory to be cached.
With regards to the steps I proposed, again yes, it makes sense.
Here is a sequence of 4 videos that explain this exact situation (DMA and memory coherency). As can be seen, the 'software' solution (doesn't involve MPU) proposed by the videos (and other resources provided above) is exactly the sequence of steps I posted.
https://youtu.be/5xVKIGCPy2s
https://youtu.be/2q8IvCxSjaY
https://youtu.be/6IEtoG7m0jI
https://youtu.be/0DhYTqPCRiA
The other proposed solution is to configure the cortex-m7 MPU to change the attributes of a particular memory region to keep memory coherency.
This all apart from the easiest solution which is to globally disable d-cache, although, naturally, this is not desirable.

Cache pre-fetching while traversing an array: what if some memory pages have being swapped out?

The biggest advantage of an array, of, for example, int, is that, if you read it sequentially, it can be fully preloaded in cache because the CPU checks the memory access patterns and pre-fetches the next locations about to be read, so the "next" element of a vector is always in cache.
At which extent is that sentence just "theory"? Thinking about timing, for that to be true, the pre-fetcher must know how much time will take sending the next cache line to cache (which implies knowing how "slow" is the RAM), and how much time is left before such data is the next one to be read by the CPU (which implies knowing how time-consuming are the remaining instructions), so the first sequence of actions takes no longer than the second.
The specific case I have in mind: let's assume that the first 5 pages of a 10-pages-long array are in RAM and the last 5 in swap. If the next address that the pre-fetcher wants to load is the first address of the fifth page, that pre-loading time will be unpredictable long and the pre-fetcher will fail in its mission.
I know CPUs try to do a lot of guessing and speculations about the future of a process, like such cache pre-fetching, branch prediction and probably a lot of other techniques I'm not aware of, some of them probably talking to the OS to speculate together (and I'm eager to know more about this, it surprises me every time).
So, does CPUs and/or OSes try to solve that kind of guessing-timing problems, for example, by trying to answer the pre-fetcher's question of: how much time do I need in advance for my speculative pre-fetching to cause 0 delay?
So, does CPUs and/or OSes try to solve that kind of guessing-timing problems
No.
This is solely handled by the CPU HW. If you read some memory location the CPU will simply bring a memory block containing your location into a whole cache line.

Does the distance between read and write locations have an effect on cache performance?

I have a buffer of size n that is full, and a successor buffer of size n that is empty. I want to insert a value within the first buffer at position i, but I would need to move a range of memory forward in order to do that, since the buffer is full (ie. sequential insert). I have two options here:
Prefer write close to read (adjacent):
Push the last value of the first buffer into the second.
Move between i and n - 1 in the first buffer one forward.
Insert at i.
Prefer fewer steps:
Copy the range i to n - 1 from the first into the second buffer.
Insert at i.
Most of what I can find only talks about locality in a read context, and I am wondering whether the distance between the read and the write memory should be considered.
Does the distance between read and write locations have an effect on cache performance?
Yes. Normally (not including rare situations where CPU can write an entire cache line with new data) the CPU has to fetch the most recent version of a cache line into its cache before doing the write. If the cache line is already in the cache (e.g. due to a previous read of some other data that happened to be in the same cache line) then CPU won't need to fetch the cache line before doing the write.
Note that there's also various other quirks (cache aliasing, TLB misses, etc); and all of it depends on the specific situation and which CPU (e.g. if all of the process' data fits in the CPU's cache, there's no shared memory in involved, and there's no task switches or other processes using the CPU; then you can assume everything will always be in the cache anyway).
I want to insert a value within the first buffer at position i, but I would need to move a range of memory forward in order to do that, since the buffer is full (ie. sequential insert).
Without more information (how often this happens, how much data is involved, etc) I can't really make any suggestions. However (at first glance, without much information), the entire idea seems bad. More specifically, it sounds like you're adding a bunch of hassle to make two smaller arrays behave exactly the same as one larger array would have (and then worrying about the cost of insertion because arrays aren't good for insertion in general).
this is a component deep within a data structure at the lowest level where n is small and constant
by small I assume you mean smaller than L1 cpu cache of being somewhere less than 1MB or L2 cache up to 10-20 MB, depending on your CPU then no,
I am wondering whether the distance between the read and the write memory should be considered.
sometimes; if all the data can fit into the CPU L1, L2, L3 cache that the process is running on then what you think random access means applies it would all be the same latency. You can get nitty gritty and delve into the differences between L1, L2, L3 cache but for sake of brevity (and i simply take it for granted) anywhere within a memory boundary it's all the same latency to access. So in your case where N is small and if it all fits into cpu cache (the first of many boundaries) then it would be the manner and efficiency in which you chose to move/change values and the number of times you end up doing that which affects performance (time to complete).
Now if N were big, for example in a 2 or more socket system (over intel QPI or UPI) and that data resided on DDR RAM that is located across the QPI or UPI path to memory dimms off the memory controller of the other CPU, then definitely yes big performance hit (relatively speaking) because now a boundary has been crossed, and that would be what could NOT fit into cache of the CPU that the process was running on (which was initally fetched from DIMMS LOCAL to that cpu memory controller) now incurs the overhead talking to the other CPU over the QPI or UPI path (while still very fast compared to previous architecures) and that other CPU then fetches the data from it's set of memory DIMMS and sends it back over QPI or UPI to the cpu your process is running on.
So when you exceed L1 cache limit into L2 there is a performance hit, likewise into L3 cache, all within one CPU. when a process has to repeatedly fetch from it's local set of dimms more data that it could not fit into cache then performance hit. And when that data is not on dimms local to that cpu = slower. And when that data is not on the same motherboard and goes across some kind of high speed fiber RDMA = slower. When it's across ethernet even slower... and so on.

Random Memory Reads vs Random Memory Writes

In low level languages like C I know you should try to use the CPU cache to your benefit as much as possible. As a cache miss means your program will temporarily have to wait for the RAM to dereference a pointer. However are writes to memory also effected by this? If you write to memory it would seem that the CPU does not need to wait on a response.
I'm trying to decide if reordering a array of items would truly be worth it when I need to access items in the array in certain groups repeatedly (so sorting it based on those groups). However those groups will frequently change so I would need to keep reordering the array if I do this.
Depending on your architecture, random memory writes can be expensive for at least two reasons.
On today's multi-core machines, almost all writes will require some kind of cache coherence protocol to be run so that the corresponding cache lines on other caches will be invalidated.
In terms of ordinary writes, they will either always cost a memory operation or sometimes cause a memory operation depending on whether the cache is write-through or write-back.
You can read more details about the possible behaviors of caches on Wikipedia.
This is a very broad question, so my answer is nearly as broad.
The source code, the compiled code, and the underlying hardware are not necessarily all in sync when it comes to reading and writing memory. Your C/C++ code simply references variables. The compiled code will turn that into appropriate machine language which is close to the source code but can vary in the case of optimization, volatile keyword, etc. Finally the hardware will optimize the 3 main levels of storage: CPU cache (fastest), RAM, and hard disk (yes, your program variables can actually be stored on the hard disk, in the case of swapping).
Whether the CPU waits or not depends partially on what's going on at the hardware layer combined with the machine code (again for example consider data specified as volatile).

Is there a way to avoid cache misses _completely_?

I read the very basics on how the cache works here: How and when to align to cache line size? and here: What is "cache-friendly" code? , but none of these posts answered my question: is there a way to execute some code entirely within the cache, i.e., without using any access to RAM (beyond perhaps during the initial process of reading the file from the HDD)? As far as I understand the bottleneck in computation nowadays is mostly memory bandwidth, and "as long as you are within the CPU, you are just fine".
Is there a way to load a program into the cache, and keep it there until it terminates? So let's say I have a 1MB compiled C program, which does some scientific computation with a memory requirement of another 1MB, and runs for 5 days. Is there a way to flag this code, so that it does not get out from the cache during evaluation? I am thinking of giving this code higher priority, or alike during execution.
In other words, how much cache is used by an idling computer, which loads its OS (say Ubuntu), and then does nothing? Is there excessive cache use during idling? Should I expect my small program to be always in the cache if the OS does not do anything besides executing it? Let's say after 5 minutes the screensaver starts. Does this lead to massive cache misses (and hence, drastic reduction in performance), since now it competes with my program for the cache space? My experience says that running several non-demanding programs (like the screensaver, or a simple audio player, pdf reader, etc.) at the same time does not significantly decrease the performance of my scientific program, even though I would expect that it would go in-and-out from the cache all the time. The question is: why does not it get its speed affected? Would it make sense to use an absolute minimalistic OS (if so, then which one?) to improve (or rather: maintain) the speed of the computation?
Just for clarity, we can assume that the code is something very simple, say it is a bunch of nested for loops where the innermost part sums up all the increment variables modulo 97. The point is that it is small enough to be put and executed in the cache.
There are different types of CPU cache misses: compulsory, conflict, capacity, coherence.
Compulsory misses can't be avoided, as they happen on the first reference to a location in memory. So no, you definitely can't avoid cache misses completely.
Besides that, typical L1 cache sizes today are 32KB/64KB per core, and L2 cache sizes are 256KB per core. So 1MB of data would also create either capacity or conflict misses, depending on cache's associativity.
No, on most standard architectures, CPU cache is not addressable.*
And even if you could, what kind of performance improvement are you anticipating here? What percentage of your program's execution time do you believe is being spent loading from main memory into (L3) cache? You should profile your program to determine where it's actually spending its time, rather than dreaming up solutions to problems that don't exist!
* I think x86 CPUs might have a hardware configuration which allows them to operate without attached RAM, but that's basically irrelevant.
Short answer: NO. Cache is being maintained by the OS/CPU and it is a bad idea to allow programs to force itself to stay in cache. Lets say you got 2 programs running at the same time, and both are trying to force to stay in the cache, chaos would happen isn't it?
Newer Intel CPUs have added "Cache Allocation Technology" (CAT) under the general rubric of their Resource Director Technology. This allows software directives to reserve certain cache (and other) resources for particular computational units (application, container, VM, etc). So, if the process in question has enough cache space set aside for it under CAT, it should experience only its initial compulsory misses (to bring its code and data into cache) and self-induced conflict misses, avoiding capacity misses and conflict misses created by other processes.
I am not sure whether it will satisfy your questions.
is there a way to execute some code entirely within the cache, i.e., without using any access to RAM?
Is there a way to load a program into the cache, and keep it there until it terminates?
It is possible to use fully associative cache( for eg Tightly coupled memories), which has single cycle access times.(This is realistic only in very small embedded systems).it is a general practise to use TCM's in embedded systems for time critical code as it provides predictability.
In case of partially associative caches it is possible to lock up cache lines or ways (for eg using CP15 in ARM ), so that the eviction algorithm doesn't consider them as a victim for cache fill.
as a side note it is also useful sometimes to use Cache as Ram for Bringup of non booting boards when the caches are in debug mode.
(http://www.asset-intertech.com/Products/Processor-Controlled-Test/PCT-Software/Cache-as-RAM-for-board-bring-up-of-non-boothing-ci)

Resources