Is it possible the to lock the ISR instructions to L1 cache? - arm

I am running a bare metal application on one of the cores of ARM cortex A9 processor. My ISR is quite small an I am wondering whether it would be possible to lock my ISR instructions in the L1 cache? Is it possible? Is there any one who would explain some drawbacks of doing it?
Regards,
N

The Cortex-A9 does not support L1 cache lockdown (neither instructions nor data).
The drawback is that taking large chunks of the cache away (lockdown is usually done on a granularity of entire cache ways) decreases performance for everything else in the system.
Not to mention the fact that if your ISR is indeed small, and it is called frequently, it is somewhat likely to be in the cache anyway.
What is the benefit you were expecting to gain from doing this?

Your condition is the perfect fit for fast interrupt. (FIQ)
You only have to assign the last interrupt number for that particular ISR.
While other interrupt numbers are just vectors, the last number branches directly to the code area, thus saving one memory load plus interlock. You save about three cycles or so.
Besides, i-cache lockdown isn't as efficient as d-cache lockdown.
CA9 doesn't support L1 cache lockdown anyway (for some good reasons), so don't bother.
Just make sure the ISR is cache line aligned for maximum efficiency. (typically 32 or 64byte)

Related

Is L2 Cache Miss equivalent to "L2 Data Cache Refill" on ARMv7 A15?

I am trying to determine which hardware counters available on the ARM Cortex A15 processor are the correct ones to use for determining system-wide L2 cache misses.
My application here is a kernel-level voltage-frequency governor (i.e. it could substitute for the ondemand governor). Because I need access to performance counters at the system level and not attached to a particular program runtime, I am not using existing utilities such as PAPI or Linux's perf tool. From my past experiences with both I understand that these are better used to monitor the performance stats for a particular program or instrumented binary.
I have implemented a kernel module that periodically updates several hardware counter values to sysfs endpoints. The resources I have used include:
Performance Counter Sampling on the Exynos 5422
Arm Technical Reference Manual (Specifically Ch. 11 on the Performance Monitoring Unit)
Searches in perf and PAPI code & documentation to see if L2 misses is a derived counter rather than a native one.
The hardware counter I am currently using to measure L2 misses is event 0x17: "L2 data cache refill". Printing this value consistently gives 0, even when running data-heavy benchmarks. Is there a different event or set of events I should be using to determine L2 cache misses? Perhaps 0x13, "Data memory accesses", or some composite of events?
It is very possible that the root of my question is a misunderstanding of "L2 data cache refills", but I have not been able to find a clarification on this through documentation and stack overflow searches.
EDIT: I have found that L2 refills was reading 0 because the 5th hardware counter for some reason is not working as expected; re-assigning L2 refills to a different counter has resolved this particular issue.
EDIT 2: That 5th hardware counter was not function because I had not enabled that many. Silly me.
Quote from a different ARM manual implies refills are not just misses:
L1 data cache refill:
"This event counts all allocations into the L1 cache. This includes read linefills, store linefills, and prefetch linefills."

Instruction Cache Accesses (PAPI)

I am currently benchmarking some code using PAPI.
One of the preset values I obtain is the PAPI_L1_ICA i.e. the amount of instruction cache accesses performed to the L1 instruction cache.
As far as I can see the code is dominated by this, since the running time of the algorithm and the PAPI_L1_ICA seems to be more or less equivalent, while other metrics such as branch mispredictions, cache misses, tlb misses and CPU instructions, generally does not explain the behaviour of the running time.
My question is, what defines an action that triggers an L1 instruction cache access? From my measurements the accesses are in the order of 150.000 while for example the amount of completed instructions PAPI_TOT_INS are only approximately 10.000. Should they not be somewhat equal?

Is there a way to avoid cache misses _completely_?

I read the very basics on how the cache works here: How and when to align to cache line size? and here: What is "cache-friendly" code? , but none of these posts answered my question: is there a way to execute some code entirely within the cache, i.e., without using any access to RAM (beyond perhaps during the initial process of reading the file from the HDD)? As far as I understand the bottleneck in computation nowadays is mostly memory bandwidth, and "as long as you are within the CPU, you are just fine".
Is there a way to load a program into the cache, and keep it there until it terminates? So let's say I have a 1MB compiled C program, which does some scientific computation with a memory requirement of another 1MB, and runs for 5 days. Is there a way to flag this code, so that it does not get out from the cache during evaluation? I am thinking of giving this code higher priority, or alike during execution.
In other words, how much cache is used by an idling computer, which loads its OS (say Ubuntu), and then does nothing? Is there excessive cache use during idling? Should I expect my small program to be always in the cache if the OS does not do anything besides executing it? Let's say after 5 minutes the screensaver starts. Does this lead to massive cache misses (and hence, drastic reduction in performance), since now it competes with my program for the cache space? My experience says that running several non-demanding programs (like the screensaver, or a simple audio player, pdf reader, etc.) at the same time does not significantly decrease the performance of my scientific program, even though I would expect that it would go in-and-out from the cache all the time. The question is: why does not it get its speed affected? Would it make sense to use an absolute minimalistic OS (if so, then which one?) to improve (or rather: maintain) the speed of the computation?
Just for clarity, we can assume that the code is something very simple, say it is a bunch of nested for loops where the innermost part sums up all the increment variables modulo 97. The point is that it is small enough to be put and executed in the cache.
There are different types of CPU cache misses: compulsory, conflict, capacity, coherence.
Compulsory misses can't be avoided, as they happen on the first reference to a location in memory. So no, you definitely can't avoid cache misses completely.
Besides that, typical L1 cache sizes today are 32KB/64KB per core, and L2 cache sizes are 256KB per core. So 1MB of data would also create either capacity or conflict misses, depending on cache's associativity.
No, on most standard architectures, CPU cache is not addressable.*
And even if you could, what kind of performance improvement are you anticipating here? What percentage of your program's execution time do you believe is being spent loading from main memory into (L3) cache? You should profile your program to determine where it's actually spending its time, rather than dreaming up solutions to problems that don't exist!
* I think x86 CPUs might have a hardware configuration which allows them to operate without attached RAM, but that's basically irrelevant.
Short answer: NO. Cache is being maintained by the OS/CPU and it is a bad idea to allow programs to force itself to stay in cache. Lets say you got 2 programs running at the same time, and both are trying to force to stay in the cache, chaos would happen isn't it?
Newer Intel CPUs have added "Cache Allocation Technology" (CAT) under the general rubric of their Resource Director Technology. This allows software directives to reserve certain cache (and other) resources for particular computational units (application, container, VM, etc). So, if the process in question has enough cache space set aside for it under CAT, it should experience only its initial compulsory misses (to bring its code and data into cache) and self-induced conflict misses, avoiding capacity misses and conflict misses created by other processes.
I am not sure whether it will satisfy your questions.
is there a way to execute some code entirely within the cache, i.e., without using any access to RAM?
Is there a way to load a program into the cache, and keep it there until it terminates?
It is possible to use fully associative cache( for eg Tightly coupled memories), which has single cycle access times.(This is realistic only in very small embedded systems).it is a general practise to use TCM's in embedded systems for time critical code as it provides predictability.
In case of partially associative caches it is possible to lock up cache lines or ways (for eg using CP15 in ARM ), so that the eviction algorithm doesn't consider them as a victim for cache fill.
as a side note it is also useful sometimes to use Cache as Ram for Bringup of non booting boards when the caches are in debug mode.
(http://www.asset-intertech.com/Products/Processor-Controlled-Test/PCT-Software/Cache-as-RAM-for-board-bring-up-of-non-boothing-ci)

How to saturate memory bus

I want to test a program with various memory bus usage levels. For example, I would like to find out if my program works as expected when other processes use 50% of the memory bus.
How would I simulate this kind of disturbance?
My attempt was to run a process with multiple threads, each thread doing random reads from a big block of memory. This didn't appear to have a big impact on my program. My program has a lot of memory operations, so I would expect that a significant disturbance will be noticeable.
I want to saturate the bus but without using too many CPU cycles, so that any performance degradation will be caused only by bus contention.
Notes:
I'm using a Xeon E5645 processor, DDR3 memory
The mental model of "processes use 50% of the memory bus" is not a great one. A thread that has acquired a core and accesses memory that's not in the caches uses the memory bus.
Getting a thread to saturate the bus is simple, just use memcpy(). Copy several times the amount that fits in the last cache and warm it up by running it multiple times so there are no page faults to slow the code down.
My first instinct would be to set up a bunch of DMA operations to bounce data around without using the CPU too much. This all depends on what operating system you're running and what hardware. Is this an embedded system? I'd be glad to give more detail in the comments.
I'd use SSE2 movntps instructions to stream data, to avoid cache conflicts for the other thread in the same core. Maybe unroll that loop 16 times to minimize number of instructions per memory transfer. While DMA idea sounds good, the linked manual is old and for 32bit linux and your processor model makes me think you probably have 64bit os, which makes me wonder how much of it is correct still. And bug in your test code may screw your hard drive in worst case.

Can we optimize code to reduce power consumption?

Are there any techniques to optimize code in order to ensure lesser power consumption.Architecture is ARM.language is C
From the ARM technical reference site:
The features of the ARM11 MPCore
processor that improve energy
efficiency include:
accurate branch and sub-routine return prediction, reducing the number
of incorrect instruction fetch and
decode operations
use of physically addressed caches, which reduces the number of cache
flushes and refills, saving energy in
the system
the use of MicroTLBs reduces the power consumed in translation and
protection lookups each cycle
the caches use sequential access information to reduce the number of
accesses to the tag RAMs and to
unwanted data RAMs.
In the ARM11 MPCore processor
extensive use is also made of gated
clocks and gates to disable inputs to
unused functional blocks. Only the
logic actively in use to perform a
calculation consumes any dynamic
power.
Based on this information, I'd say that the processor does a lot of work for you to save power. Any power wastage would come from poorly written code that does more processing than necessary, which you wouldn't want anyway. If you're looking to save power, the overall design of your application will have more effect. Network access, screen rendering, and other power-hungry operations will be of more concern for power consumption.
Optimizing code to use less power is, effectively, just optimizing code. Regardless of whether your motives are monetary, social, politital or the like, fewer CPU cycles = less energy used. What I'm trying to say is I think you can probably replace "power consumption" with "execution time", as they would, essentially, be directly proportional - and you therefore may have more success when not "scaring" people off with a power-related question. I may, however, stand corrected :)
Yes. Use a profiler and see what routines are using most of the CPU. On ARM you can use some JTAG connectors, if available (I used Lauterbach both for debugging and for profiling). The main problem is generally to put your processor, when in idle, in a low-consumption state (deep sleep). If you cannot reduce the CPU percentage used by much (for example from 80% to 50%) it won't make a big difference. Depending on what operating systems you are running the options may vary.
The July 2010 edition of the Communications of the ACM has an article on energy-efficient algorithms which might interest you. I haven't read it yet so cannot impart any of its wisdom.
Try to stay in on chip memory (cache) for idle loops, keep I/O to a minimum, keep bit flipping to a minimum on busses. NV memory like proms and flash consume more power to store zeros than ones (which is why they erase to ones, it is actually a zero but the transitor(s) invert the bit before you see it, zeros stored as ones, ones stored as zeros, this is also why they degrade to ones when they fail), I dont know about volatile memories, dram uses half as many transistors as sram, but has to be refreshed.
For all of this to matter though you need to start with a lower power system as the above may not be noticeable. dont use anything from intel for example.
If you are not running Windows XP+ or a newer version of Linux, you could run a background thread which does nothing but HLT.
This is how programs like CPUIdle reduce power consumption/heat.
If the processor is tuned to use less power when it needs less cycles, then simply making your code run more efficiently is the solution. Else, there's not much you can do unless the operating system exposes some sort of power management functionality.
Keep IO to a minimum.
On some ARM processors it's possible to reduce power consumption by putting the voltage regulator in standby mode.

Resources