Is there a way to avoid cache misses _completely_? - c

I read the very basics on how the cache works here: How and when to align to cache line size? and here: What is "cache-friendly" code? , but none of these posts answered my question: is there a way to execute some code entirely within the cache, i.e., without using any access to RAM (beyond perhaps during the initial process of reading the file from the HDD)? As far as I understand the bottleneck in computation nowadays is mostly memory bandwidth, and "as long as you are within the CPU, you are just fine".
Is there a way to load a program into the cache, and keep it there until it terminates? So let's say I have a 1MB compiled C program, which does some scientific computation with a memory requirement of another 1MB, and runs for 5 days. Is there a way to flag this code, so that it does not get out from the cache during evaluation? I am thinking of giving this code higher priority, or alike during execution.
In other words, how much cache is used by an idling computer, which loads its OS (say Ubuntu), and then does nothing? Is there excessive cache use during idling? Should I expect my small program to be always in the cache if the OS does not do anything besides executing it? Let's say after 5 minutes the screensaver starts. Does this lead to massive cache misses (and hence, drastic reduction in performance), since now it competes with my program for the cache space? My experience says that running several non-demanding programs (like the screensaver, or a simple audio player, pdf reader, etc.) at the same time does not significantly decrease the performance of my scientific program, even though I would expect that it would go in-and-out from the cache all the time. The question is: why does not it get its speed affected? Would it make sense to use an absolute minimalistic OS (if so, then which one?) to improve (or rather: maintain) the speed of the computation?
Just for clarity, we can assume that the code is something very simple, say it is a bunch of nested for loops where the innermost part sums up all the increment variables modulo 97. The point is that it is small enough to be put and executed in the cache.

There are different types of CPU cache misses: compulsory, conflict, capacity, coherence.
Compulsory misses can't be avoided, as they happen on the first reference to a location in memory. So no, you definitely can't avoid cache misses completely.
Besides that, typical L1 cache sizes today are 32KB/64KB per core, and L2 cache sizes are 256KB per core. So 1MB of data would also create either capacity or conflict misses, depending on cache's associativity.

No, on most standard architectures, CPU cache is not addressable.*
And even if you could, what kind of performance improvement are you anticipating here? What percentage of your program's execution time do you believe is being spent loading from main memory into (L3) cache? You should profile your program to determine where it's actually spending its time, rather than dreaming up solutions to problems that don't exist!
* I think x86 CPUs might have a hardware configuration which allows them to operate without attached RAM, but that's basically irrelevant.

Short answer: NO. Cache is being maintained by the OS/CPU and it is a bad idea to allow programs to force itself to stay in cache. Lets say you got 2 programs running at the same time, and both are trying to force to stay in the cache, chaos would happen isn't it?

Newer Intel CPUs have added "Cache Allocation Technology" (CAT) under the general rubric of their Resource Director Technology. This allows software directives to reserve certain cache (and other) resources for particular computational units (application, container, VM, etc). So, if the process in question has enough cache space set aside for it under CAT, it should experience only its initial compulsory misses (to bring its code and data into cache) and self-induced conflict misses, avoiding capacity misses and conflict misses created by other processes.

I am not sure whether it will satisfy your questions.
is there a way to execute some code entirely within the cache, i.e., without using any access to RAM?
Is there a way to load a program into the cache, and keep it there until it terminates?
It is possible to use fully associative cache( for eg Tightly coupled memories), which has single cycle access times.(This is realistic only in very small embedded systems).it is a general practise to use TCM's in embedded systems for time critical code as it provides predictability.
In case of partially associative caches it is possible to lock up cache lines or ways (for eg using CP15 in ARM ), so that the eviction algorithm doesn't consider them as a victim for cache fill.
as a side note it is also useful sometimes to use Cache as Ram for Bringup of non booting boards when the caches are in debug mode.
(http://www.asset-intertech.com/Products/Processor-Controlled-Test/PCT-Software/Cache-as-RAM-for-board-bring-up-of-non-boothing-ci)

Related

Cache pre-fetching while traversing an array: what if some memory pages have being swapped out?

The biggest advantage of an array, of, for example, int, is that, if you read it sequentially, it can be fully preloaded in cache because the CPU checks the memory access patterns and pre-fetches the next locations about to be read, so the "next" element of a vector is always in cache.
At which extent is that sentence just "theory"? Thinking about timing, for that to be true, the pre-fetcher must know how much time will take sending the next cache line to cache (which implies knowing how "slow" is the RAM), and how much time is left before such data is the next one to be read by the CPU (which implies knowing how time-consuming are the remaining instructions), so the first sequence of actions takes no longer than the second.
The specific case I have in mind: let's assume that the first 5 pages of a 10-pages-long array are in RAM and the last 5 in swap. If the next address that the pre-fetcher wants to load is the first address of the fifth page, that pre-loading time will be unpredictable long and the pre-fetcher will fail in its mission.
I know CPUs try to do a lot of guessing and speculations about the future of a process, like such cache pre-fetching, branch prediction and probably a lot of other techniques I'm not aware of, some of them probably talking to the OS to speculate together (and I'm eager to know more about this, it surprises me every time).
So, does CPUs and/or OSes try to solve that kind of guessing-timing problems, for example, by trying to answer the pre-fetcher's question of: how much time do I need in advance for my speculative pre-fetching to cause 0 delay?
So, does CPUs and/or OSes try to solve that kind of guessing-timing problems
No.
This is solely handled by the CPU HW. If you read some memory location the CPU will simply bring a memory block containing your location into a whole cache line.

Random Memory Reads vs Random Memory Writes

In low level languages like C I know you should try to use the CPU cache to your benefit as much as possible. As a cache miss means your program will temporarily have to wait for the RAM to dereference a pointer. However are writes to memory also effected by this? If you write to memory it would seem that the CPU does not need to wait on a response.
I'm trying to decide if reordering a array of items would truly be worth it when I need to access items in the array in certain groups repeatedly (so sorting it based on those groups). However those groups will frequently change so I would need to keep reordering the array if I do this.
Depending on your architecture, random memory writes can be expensive for at least two reasons.
On today's multi-core machines, almost all writes will require some kind of cache coherence protocol to be run so that the corresponding cache lines on other caches will be invalidated.
In terms of ordinary writes, they will either always cost a memory operation or sometimes cause a memory operation depending on whether the cache is write-through or write-back.
You can read more details about the possible behaviors of caches on Wikipedia.
This is a very broad question, so my answer is nearly as broad.
The source code, the compiled code, and the underlying hardware are not necessarily all in sync when it comes to reading and writing memory. Your C/C++ code simply references variables. The compiled code will turn that into appropriate machine language which is close to the source code but can vary in the case of optimization, volatile keyword, etc. Finally the hardware will optimize the 3 main levels of storage: CPU cache (fastest), RAM, and hard disk (yes, your program variables can actually be stored on the hard disk, in the case of swapping).
Whether the CPU waits or not depends partially on what's going on at the hardware layer combined with the machine code (again for example consider data specified as volatile).

Programming to fill our system's L1 or L2 cache

How we can systematically write code to load data into our L1 or L2 cache?
I am specifically trying to target filling the the L1 I cache of my system for some higher analysis.
Any suggestions will do - with respect to writing assembly code or simple C programming. Related articles on this topic will be even more helpful.
A cache stores recently-accessed data. To fill the cache, just access the data. Or in this case, instructions. Fill a block of memory with no-op instructions (and a looping branch instruction at the end) and jump to it.
The tricky part is keeping data in there once it's loaded. You can't access anything outside the 32K (or whatever) data set as long as your benchmark is running.
I can't imagine what you get from artificially filling a cache and then keeping it filled with the same data set, but there you go.
You will need to find out the cache associativity of your CPU and the replacement policy. I can't think of a general solution to this problem that would work on all the CPUs I've worked with. Even caches advertised as fully associative with an LRU replacement policy aren't exactly that in reality and it can be very hard to figure out a pattern of memory access that completely fills the cache.
If you want this for some very specific benchmark (which is a bad idea for other reasons), I'd recommend you trying to figure out how to flush the cache instead. That is actually doable.
I just last week performed this task for ECC filling of an l1 and l2 cache.
Basically if you have a 64Kbyte cache, for example, total (x number of ways, y number of cache lines, etc) for the data just access that much data linearly through the cache (might need an mmu on to enable caching) start on some 64Kbyte boundary and read 64Kbytes worth of data ideally in cache line sized reads (or multiples) if possible. For icache you need that many bytes worth of instructions (nops or add reg+1 or something), remember there is probably a pre-fetch at the end so you might have to back off the final return a few instructions so that the prefetch takes you all the way to the end (might take some practice and if you dont have visibility into the logic (a chip sim) then you might not figure it out.
you can use the mmu or other games your logic might have to reduce the amount of memory required, for example if you have an mmu with an entry size that covers say 4Kb, then you could fill 4Kb of real memory with data, then use 16 different mmu entries (with 16 different virtual addresses) and for each of the 16 read through the 4K. Of course that is if your cache is on the virtual address side of the mmu.
overall it is kind of an ugly thing to do, if your mmu prevents instruction caching, you could put the code performing the test in a non-cached space so that it doesnt mess with the icache and only instructions used to fill the cache are in a cached address space.
Good luck...

Is it possible to assign parts of the shared L2 caches to different cores

Lets say, 4 threads are running on 4 separate cores of a Multicore x86 processor, and they do not share any data, is it possible to progammatically make the 4 cores use separate and predefined portions of the shared L2 cache.
Let's use two terms, exclusive and shared caches instead of L1, L2, L3, L4 caches. Different CPU families start to share cache on different levels. In the presented terms the original question is - is it possible split shared cache into the parts, each of which will be used exclusively by one of the CPU/cores? There is no clear answer. Furthermore there are two answers opposite to each other.
1) First and general answer: NO.
Cache is by design managed in hardware. There are only few control levers of cache accessible in software such as enable/disable cache for whole memory or defined memory region, apply specified policy for cache flushing (write through/ write back). NO basically due to the fact, that it was designed to be managed in hardware. So there are no useful interface that will allow manage it gracefully in software.
2) Second answer: Yes.
In fact, cache designed in such a way, that each line of the cache can save data from specified set of memory lines. Due to this if memory manager provides guaranty, that the same CPU one CPU/core own and use all memory lines assigned to the same cache line exclusively, then memory manager provides guaranty that that cache line will be used by that CPU exclusively. It is a very tricky workaround. And it have very limited benefits, and have serious drawbacks: memory layout is very fragmented, cache usage is unbalanced, complicated memory management, very hadrware-dependent (Details can be found in the paper provided by "MetallicPriest").
Resume: it is possible in theory and almost impossible on practice.

Can we optimize code to reduce power consumption?

Are there any techniques to optimize code in order to ensure lesser power consumption.Architecture is ARM.language is C
From the ARM technical reference site:
The features of the ARM11 MPCore
processor that improve energy
efficiency include:
accurate branch and sub-routine return prediction, reducing the number
of incorrect instruction fetch and
decode operations
use of physically addressed caches, which reduces the number of cache
flushes and refills, saving energy in
the system
the use of MicroTLBs reduces the power consumed in translation and
protection lookups each cycle
the caches use sequential access information to reduce the number of
accesses to the tag RAMs and to
unwanted data RAMs.
In the ARM11 MPCore processor
extensive use is also made of gated
clocks and gates to disable inputs to
unused functional blocks. Only the
logic actively in use to perform a
calculation consumes any dynamic
power.
Based on this information, I'd say that the processor does a lot of work for you to save power. Any power wastage would come from poorly written code that does more processing than necessary, which you wouldn't want anyway. If you're looking to save power, the overall design of your application will have more effect. Network access, screen rendering, and other power-hungry operations will be of more concern for power consumption.
Optimizing code to use less power is, effectively, just optimizing code. Regardless of whether your motives are monetary, social, politital or the like, fewer CPU cycles = less energy used. What I'm trying to say is I think you can probably replace "power consumption" with "execution time", as they would, essentially, be directly proportional - and you therefore may have more success when not "scaring" people off with a power-related question. I may, however, stand corrected :)
Yes. Use a profiler and see what routines are using most of the CPU. On ARM you can use some JTAG connectors, if available (I used Lauterbach both for debugging and for profiling). The main problem is generally to put your processor, when in idle, in a low-consumption state (deep sleep). If you cannot reduce the CPU percentage used by much (for example from 80% to 50%) it won't make a big difference. Depending on what operating systems you are running the options may vary.
The July 2010 edition of the Communications of the ACM has an article on energy-efficient algorithms which might interest you. I haven't read it yet so cannot impart any of its wisdom.
Try to stay in on chip memory (cache) for idle loops, keep I/O to a minimum, keep bit flipping to a minimum on busses. NV memory like proms and flash consume more power to store zeros than ones (which is why they erase to ones, it is actually a zero but the transitor(s) invert the bit before you see it, zeros stored as ones, ones stored as zeros, this is also why they degrade to ones when they fail), I dont know about volatile memories, dram uses half as many transistors as sram, but has to be refreshed.
For all of this to matter though you need to start with a lower power system as the above may not be noticeable. dont use anything from intel for example.
If you are not running Windows XP+ or a newer version of Linux, you could run a background thread which does nothing but HLT.
This is how programs like CPUIdle reduce power consumption/heat.
If the processor is tuned to use less power when it needs less cycles, then simply making your code run more efficiently is the solution. Else, there's not much you can do unless the operating system exposes some sort of power management functionality.
Keep IO to a minimum.
On some ARM processors it's possible to reduce power consumption by putting the voltage regulator in standby mode.

Resources