I am currently benchmarking some code using PAPI.
One of the preset values I obtain is the PAPI_L1_ICA i.e. the amount of instruction cache accesses performed to the L1 instruction cache.
As far as I can see the code is dominated by this, since the running time of the algorithm and the PAPI_L1_ICA seems to be more or less equivalent, while other metrics such as branch mispredictions, cache misses, tlb misses and CPU instructions, generally does not explain the behaviour of the running time.
My question is, what defines an action that triggers an L1 instruction cache access? From my measurements the accesses are in the order of 150.000 while for example the amount of completed instructions PAPI_TOT_INS are only approximately 10.000. Should they not be somewhat equal?
Related
ARM allows the reordering loads with subsequent stores, so that the following pseudocode:
// CPU 0 | // CPU 1
temp0 = x; | temp1 = y;
y = 1; | x = 1;
can result in temp0 == temp1 == 1 (and, this is observable in practice as well). I'm having trouble understanding how this occurs; it seems like in-order commit would prevent it (which, it was my understanding, is present in pretty much all OOO processors). My reasoning goes "the load must have its value before it commits, it commits before the store, and the store's value can't become visible to other processors until it commits."
I'm guessing that one of my assumptions must be wrong, and something like one of the following must hold:
Instructions don't need to commit all the way in-order. A later store could safely commit and become visible before an earlier load, so long as at the time the store commits the core can guarantee that the previous load (and all intermediate instructions) won't trigger an exception, and that the load's address is guaranteed to be distinct from the store's.
The load can commit before its value is known. I don't have a guess as to how this would be implemented.
Stores can become visible before they are committed. Maybe a memory buffer somewhere is allowed to forward stores to loads to a different thread, even if the load was enqueued earlier?
Something else entirely?
There's a lot of hypothetical microarchitectural features that would explain this behavior, but I'm most curious about the ones that are actually present in modern weakly ordered CPUs.
Your bullet points of assumptions all look correct to me, except that you could build a uarch where loads can retire from the OoO core after merely checking permissions (TLB) on a load to make sure it can definitely happen. There could be OoO exec CPUs that do that (update: apparently there are).
I think x86 CPUs require loads to actually have the data arrive before they can retire, but their strong memory model doesn't allow LoadStore reordering anyway. So ARM certainly could be different.
You're right that stores can't be made visible to any other cores before retirement. That way lies madness. Even on an SMT core (multiple logical threads on one physical core), it would link speculation on two logical threads together, requiring them both to roll back if either one detected mis-speculation. That would defeat the purpose of SMT of having one logical thread take advantage of stalls in others.
(Related: Making retired but not yet committed (to L1d) stores visible to other logical threads on the same core is how some real PowerPC implementations make it possible for threads to disagree on the global order of stores. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?)
CPUs with in-order execution can start a load (check the TLB and write a load-buffer entry) and only stall if an instruction tries to use the result before it's ready. Then later instructions, including stores, can run normally. This is basically required for non-terrible performance in an in-order pipeline; stalling on every cache miss (or even just L1d latency) would be unacceptable. Memory parallelism is a thing even on in-order CPUs; they can have multiple load buffers that track multiple outstanding cache misses. High(ish) performance in-order ARM cores like Cortex-A53 are still widely used in modern smartphones, and scheduling loads well ahead of when the result register is used is a well-known important optimization for looping over an array. (Unrolling or even software pipelining.)
So if the load misses in cache but the store hits (and commits to L1d before earlier cache-miss loads get their data), you can get LoadStore reordering. (Jeff Preshing intro to memory reording uses that example for LoadStore, but doesn't get into uarch details at all.)
A load can't fault after you've checked the TLB and / or whatever memory-region stuff for it. That part has to be complete before it retires, or before it reaches the end of an in-order pipeline. Just like a retired store sitting in the store buffer waiting to commit, a retired load sitting in a load buffer is definitely happening at some point.
So the sequence on an in-order pipeline is:
lw r0, [r1] TLB hit, but misses in L1d cache. Load execution unit writes the address (r1) into a load buffer. Any later instruction that tries to read r0 will stall, but we know for sure that the load didn't fault.
With r0 tied to waiting for that load buffer to be ready, the lw instruction itself can leave the pipeline (retire), and so can later instructions.
any amount of other instructions that don't read r0. That would stall an in-order pipeline.
sw r2, [r3] store execution unit writes address + data to the store buffer / queue. Then this instruction can retire.
Probing the load buffers finds that this store doesn't overlap with the pending load, so it can commit to L1d. (If it had overlapped, you couldn't commit it until a MESI RFO completed anyway, and fast restart would forward the incoming data to the load buffer. So it might not be too complicated to handle that case without even probing on every store, but let's only look at the separate-cache-line case where we can get LoadStore reordering)
Committing to L1d = becoming globally visible. This can happen while the earlier load is still waiting for the cache line to arrive.
For OoO CPUs, you'd need some way to tie load completion back into the OoO core for instructions waiting on the load result. I guess that's possible, but it means that the architectural/retirement value of a register might not be stored anywhere in the core. Pipeline flushes and other rollbacks from mis-speculation would have to hang on to that association between an incoming load and a physical and architectural register. (Not flushing store buffers on pipeline rollbacks is already a thing that CPUs have to do, though. Retired but not yet committed stores sitting in the store buffer have no way to be rolled back.)
That could be a good design idea for uarches with a small OoO window that's too small to come close to hiding a cache miss. (Which to be fair, is every high-performance OoO exec CPU: memory latency is usually too high to fully hide.)
We have experimental evidence of LoadStore reordering on an OoO ARM: section 7.1 of https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf shows non-zero counts for "load buffering" on Tegra 2, which is based on the out-of-order Cortex-A9 uarch. I didn't look up all the others, but I did rewrite the answer to suggest that this is the likely mechanism for out-of-order CPUs, too. I don't know for sure if that's the case, though.
I am trying to determine which hardware counters available on the ARM Cortex A15 processor are the correct ones to use for determining system-wide L2 cache misses.
My application here is a kernel-level voltage-frequency governor (i.e. it could substitute for the ondemand governor). Because I need access to performance counters at the system level and not attached to a particular program runtime, I am not using existing utilities such as PAPI or Linux's perf tool. From my past experiences with both I understand that these are better used to monitor the performance stats for a particular program or instrumented binary.
I have implemented a kernel module that periodically updates several hardware counter values to sysfs endpoints. The resources I have used include:
Performance Counter Sampling on the Exynos 5422
Arm Technical Reference Manual (Specifically Ch. 11 on the Performance Monitoring Unit)
Searches in perf and PAPI code & documentation to see if L2 misses is a derived counter rather than a native one.
The hardware counter I am currently using to measure L2 misses is event 0x17: "L2 data cache refill". Printing this value consistently gives 0, even when running data-heavy benchmarks. Is there a different event or set of events I should be using to determine L2 cache misses? Perhaps 0x13, "Data memory accesses", or some composite of events?
It is very possible that the root of my question is a misunderstanding of "L2 data cache refills", but I have not been able to find a clarification on this through documentation and stack overflow searches.
EDIT: I have found that L2 refills was reading 0 because the 5th hardware counter for some reason is not working as expected; re-assigning L2 refills to a different counter has resolved this particular issue.
EDIT 2: That 5th hardware counter was not function because I had not enabled that many. Silly me.
Quote from a different ARM manual implies refills are not just misses:
L1 data cache refill:
"This event counts all allocations into the L1 cache. This includes read linefills, store linefills, and prefetch linefills."
I'm trying to get an idea of how the instruction cache works.
How many extra cachelines gets prefetched when a block of code is being executed? Does it take into account branch prediction?
If a block of code contains a function call, is the function code body loaded sequentially or in a different part of the cache?
For example, are the following code fragments same?
if (condition) {
// block of code that handles condition
}
and
if (condition) {
handle_condition(); // function that handles the condition
}
Which one reduces "holes" in the instruction sequence if the condition is rarely true?
If my first example is part of a code that is frequently run and the if condition is never true, will the body of the if condition be eventually evicted leaving the rest of the code body as-is?
I'm assuming these questions do not have answers that depend on specific micro-architectures. But in case they do, I have an x86-64 Intel Sandy Bridge.
Actually, the answer is very much micro-architectural dependent. This is not something defined by the x86 or any other architecture, but rather left for the designers to implement and improve over the various generations.
For a Sandybridge, you can find an interesting description here
The most relevant part is -
The instruction fetch for Sandy Bridge is shown above in Figure 2. Branch predictions are queued slightly ahead of instruction fetch so that the stalls for a taken branch are usually hidden, a feature earlier used in Merom and Nehalem. Predictions occur for 32B of instructions, while instructions are fetched 16B at a time from the L1 instruction cache.
Once the next address is known, Sandy Bridge will probe both the uop cache (which we will discuss in the next page) and the L1 instruction cache. The L1 instruction cache is 32KB with 64B lines, and the associativity has increased to 8-way, meaning that it is virtually indexed and physically tagged. The L1 ITLB is partitioned between threads for small pages, with dedicated large pages per thread. Sandy Bridge added 2 entries for large pages, bringing the total to 128 entries for 4KB pages (for both threads) and 16 fully associative entries for large pages (for each thread).
In other words, and as also shown in the diagram, the branch prediction is the first step of the pipe, and precedes the instruction cache access. Therefore, the cache would hold the addresses "trace" as predicted by the branch predictor. If a certain code snippet is hardly accessed, the predictor will avoid it and it would age out from the I-cache over time. Since the branch predictor should be able to handle function calls, there shouldn't be fundamental difference between your code snippets.
This of course breaks due to alignment issues (the I-cache has 64B lines, you can't have partial data there, so inlined code may actually cause more useless overhead than a function call, although both are bounded), and due to false predictions of course.
It's also possible that other HW-prefetchers are working and may fetch the data to other levels, but it's not something that was officially disclosed (The guides only mention some L2 cache prefetching that may help reduce the latency without thrashing your L1 cache). Also note that Sandy bridge has a uop cache that may add further caching (but also more complexity).
I am running a bare metal application on one of the cores of ARM cortex A9 processor. My ISR is quite small an I am wondering whether it would be possible to lock my ISR instructions in the L1 cache? Is it possible? Is there any one who would explain some drawbacks of doing it?
Regards,
N
The Cortex-A9 does not support L1 cache lockdown (neither instructions nor data).
The drawback is that taking large chunks of the cache away (lockdown is usually done on a granularity of entire cache ways) decreases performance for everything else in the system.
Not to mention the fact that if your ISR is indeed small, and it is called frequently, it is somewhat likely to be in the cache anyway.
What is the benefit you were expecting to gain from doing this?
Your condition is the perfect fit for fast interrupt. (FIQ)
You only have to assign the last interrupt number for that particular ISR.
While other interrupt numbers are just vectors, the last number branches directly to the code area, thus saving one memory load plus interlock. You save about three cycles or so.
Besides, i-cache lockdown isn't as efficient as d-cache lockdown.
CA9 doesn't support L1 cache lockdown anyway (for some good reasons), so don't bother.
Just make sure the ISR is cache line aligned for maximum efficiency. (typically 32 or 64byte)
I know that modern CPUs can execute out of order, However they always retire the results in-order, as described by wikipedia.
"Out of Oder processors fill these "slots" in time with other instructions that are ready, then re-order the results at the end to make it appear that the instructions were processed as normal."
Now memory fences are said to be required when using multicore platforms, because owing to Out of Order execution, wrong value of x can be printed here.
Processor #1:
while f == 0
;
print x; // x might not be 42 here
Processor #2:
x = 42;
// Memory fence required here
f = 1
Now my question is, since Out of Order Processors (Cores in case of MultiCore Processors I assume) always retire the results In-Order, then what is the necessity of Memory fences. Don't the cores of a multicore processor sees results retired from other cores only or they also see results which are in-flight?
I mean in the example I gave above, when Processor 2 will eventually retire the results, the result of x should come before f, right? I know that during out of order execution it might have modified f before x but it must have not retired it before x, right?
Now with In-Order retiring of results and cache coherence mechanism in place, why would you ever need memory fences in x86?
This tutorial explains the issues: http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf
FWIW, where memory ordering issues happen on modern x86 processors, the reason is that while the x86 memory consistency model offers quite strong consistency, explicit barriers are needed to handle read-after-write consistency. This is due to something called the "store buffer".
That is, x86 is sequentially consistent (nice and easy to reason about) except that loads may be reordered wrt earlier stores. That is, if the processor executes the sequence
store x
load y
then on the processor bus this may be seen as
load y
store x
The reason for this behavior is the afore-mentioned store buffer, which is a small buffer for writes before they go out on the system bus. Load latency is, OTOH, a critical issue for performance, and hence loads are permitted to "jump the queue".
See Section 8.2 in http://download.intel.com/design/processor/manuals/253668.pdf
The memory fence ensures that all changes to variables before the fence are visible to all other cores, so that all cores have an up to date view of the data.
If you don't put a memory fence, the cores might be working with wrong data, this can be seen especially in scenario's, where multiple cores would be working on the same datasets. In this case you can ensure that when CPU 0 has done some action, that all changes done to the dataset are now visible to all other cores, whom can then work with up to date information.
Some architectures, including the ubiquitous x86/x64, provide several
memory barrier instructions including an instruction sometimes called
"full fence". A full fence ensures that all load and store operations
prior to the fence will have been committed prior to any loads and
stores issued following the fence.
If a core were to start working with outdated data on the dataset, how could it ever get the correct results? It couldn't no matter if the end result were to be presented as-if all was done in the right order.
The key is in the store buffer, which sits between the cache and the CPU, and does this:
Store buffer invisible to remote CPUs
Store buffer allows writes to memory and/or caches to be saved to
optimize interconnect accesses
That means that things will be written to this buffer, and then at some point will the buffer be written to the cache. So the cache could contain a view of data that is not the most recent, and therefore another CPU, through cache coherency, will also not have the latest data. A store buffer flush is necessary for the latest data to be visible, this, I think is essentially what the memory fence will cause to happen at hardware level.
EDIT:
For the code you used as an example, Wikipedia says this:
A memory barrier can be inserted before processor #2's assignment to f
to ensure that the new value of x is visible to other processors at or
prior to the change in the value of f.
Just to make explicit what is implicit in the previous answers, this is correct, but is distinct from memory accesses:
CPUs can execute out of order, However they always retire the results in-order
Retirement of the instruction is separate from performing the memory access, the memory access may complete at a different time to instruction retirement.
Each core will act as if it's own memory accesses occur at retirement, but other cores may see those accesses at different times.
(On x86 and ARM, I think only stores are observably subject to this, but e.g., Alpha may load an old value from memory. x86 SSE2 has instructions with weaker guarentees than normal x86 behaviour).
PS. From memory the abandoned Sparc ROCK could in fact retire out-of-order, it spent power and transistors determining when this was harmless. It got abandoned because of power consumption and transistor count... I don't believe any general purpose CPU has been bought to market with out-of-order retirement.