Typical L1 and L2 access latency for SoCs made of ARM Cortex-A9

Typical L1 and L2 access latency for SoCs made of ARM Cortex-A9 - arm

I am looking for L1 access latency and L2 access latency for SoCs made from ARM Cortex-A9 processors such as Nvidia Tegra 2 and Tegra 3 which have multiple ARM A9 processors.
I could find some information about the L1 and L2 size of those architectures, but I could not much information about the L1 and L2 access latency. The only reliable information I found is that "L2 cache latency is 2 cycles faster on Tegra 3 than 2, while L1 cache latencies haven't changed."
Here is mentioned that L2 on Tegra 2 has a latency of 25 cycles and here is mentioned that L1 has a latency of 4 cycles and L2 has a latency of 31 to 55 cycles. None of these references are fully reliable. I was hoping to find more info on Nvidia, TI, and Qualcomm websites and technical documents, but no success.
EDIT: information on similar SoCs like OMAP4460 and OMAP4470 would be great too.

For an authoritative answer, you can try running lmbench (HowTo?) on the target of your choice.
A set of results for AM37x (variant of TI OMAP3 family) is available here for reference.
Also checkout this presentation that describes the latency and bandwidth of various caches configurations on an ARM Cortex A9 MP system.

Related

Which clock source is used for clocking instructions in cortex-M4

In my Cortex-M4, I have am using a 8Mhz oscillator as HSE, which then gets multiplied to 72Mhz using PLL which then drives SYSCLK. This got me thinking, which clock is the one being used to execute instructions? In other words, if our CPI is 1 (an ideal value, of course), does that mean we would execute 8 million instructions per second or 72 million instructions per second?
I also found this DWT which can be used to measure clock cycles, and hence CPI. So I am guessing which ever clock that is used to execute instructions would be the same one used by DWT?

It is driven from HCLK (not SYSCLK which clocks system timer and it does not have to be equal to HCLK). Thew source of HCLK is settable by the programmer.
if our CPI is 1 (an ideal value, of course), does that mean we would
execute 8 million instructions per second or 72 million instructions
per second?
You can see how many cycles every instruction takes: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0439b/CHDDIGAC.html
The real speed depends on many factors but mainly depends on the place where your code and data reside and the advanced uC features.
If you execute your code fro the internal TCM SRAM and place data in the SRAM (or even better on some uC in TCI and TCD SRAM)you can archive the theoretical execution efficiency as those memories work at the core clock frequency with no wait states or bus waitstates. Ideally if the uC has TC memory and both instructions and data are fetched using separate buses.
If your code resides in the FLASH memory - this memory may introduce some wait states. STM uC (ART accelerator) read the flash in larger a chunks and fetch the instructions ahead. It allows those uCs to perform almost at the max speed. The problem are branch instructions which require pipeline to be flushed and instructions fetched again.

L2 cache lines miss count

I want to calculate total no of L2 cache miss while I am running one particular program A. Is there any way to find cache miss in L2 cache ?
I got to know, Core i7 CPU's performance counter event types "L2_LINES_OUT " is available to Counts L2 cache lines evicted, but don't know how to use it ?
I am using linux and Intel i7 IvyBridge machine.
Any pointer or link will be highly appreciated .

According to this summary, you can use the l2_rqsts subevents:
0x01: (name=demand_data_rd_hit) Demand Data Read requests that hit L2 cache
0x03: (name=all_demand_data_rd) Demand Data Read requests
0x04: (name=rfo_hit) RFO requests that hit L2 cache
0x08: (name=rfo_miss) RFO requests that miss L2 cache
0x0c: (name=all_rfo) RFO requests to L2 cache
0x10: (name=code_rd_hit) L2 cache hits when fetching instructions, code reads.
0x20: (name=code_rd_miss) L2 cache misses when fetching instructions
0x30: (name=all_code_rd) L2 code requests
0x40: (name=pf_hit) Requests from the L2 hardware prefetchers that hit L2 cache
0x80: (name=pf_miss) Requests from the L2 hardware prefetchers that miss L2 cache
0xc0: (name=all_pf) Requests from L2 hardware prefetchers
You can just use - (all_demand_data_rd - demand_data_rd_hit) to count the demand misses

The IBM documentation here might be helpful.
It describes various expressions for many different metrics on Ivybridge. The one you want appears to be
*Instruction fetch from L2 cache miss rate*
100.0 * X_L2_RQSTS_IFETCH_MISS / X_L2_RQSTS_IFETCHES

Search PAPI, it is a tool that you can use to read the PMU before and after the code segment you want to collect the L2 cache misses.

Count the number of instruction cycles of each instruction in the assembly generated for ARM

How can i count the number of instruction cycles for each instruction in the assembly generated for a C/C++ program for ARM? Is there any tool for it? I don't want to look into ARM instruction set each time.

Modern ARM systems doesn't specify cycles per instructions.
This is from Cortex-A9 TRM
The complexity of the Cortex-A9 processor makes it impossible to
calculate precise timing information manually. The timing of an
instruction is often affected by other concurrent instructions, memory
system activity, and additional events outside the instruction flow.
However you can also look into Cycle Counter for Cortex A8.

On a system with ARM A9 processor, L2CACHE, SRAM. is it possible to have a C program to get the following performance data

On a system with ARM A9 processor, L2CACHE, SRAM. is it possible to have a C program to get the following performance data:
Avg. SRAM data Fetch delay.
Avg. Instruction Fetch delay.

If you have hardware targets to run and measure on, you could create test code that can get cycle counts between different points of its execution, using Cortex-A9 PMU (ref A9 TRM chapter 11). Your test code would need to initialize and read from PMU registers. Then, PMU will measure cycle count and give other interesting data e.g. number of cache misses. That much is doable with software.
However, that resulting performance data may not be as low-level as you may want.
Consider a loop over a block of NOP instructions, with loop counter in a register. L1 instruction cache would fill on the first iteration. PMU can give you a measurement of instruction cycles and total time. That measurement would relate to L1 instruction fetch delay (unless you use a really big block, in which case you might shed light on L2).
Similarly you could construct test code whose execution time will also include effect of data fetch delay.
There is ARM example code which shows how PMU can be used.
You may find processor internals to be complicated. If L2 is your primary interest, the controller e.g. L2C-310 may have its own event counters, although I haven't used such.

Hardware prefetching in corei3

Does corei3 support hardware prefetching through hardware prefetcher? If yes, how do I enable/disable it?

Intel Core i3 processors definitely support hardware prefetching, though Intel's documentation tends to be very weak on details. The brand name "Core i3" refers to both "Nehalem" based and "Sandy Bridge" based processors, so you have to check the specific model number to know which one you are dealing with.
To make things more complicated, newer Intel processors (Nehalem/Westmere/Sandy Bridge) have several different hardware prefetchers -- at least three are mentioned in the Intel Architecture Software Developer's Manual, Volume 3B (publication 253669). Table 30-25 "MSR_OFFCORE_RSP_x Request Type Field Definition" mentions "DCU prefetch" and "L2 prefetchers". These are also mentioned in Appendix A-2, Table A-2, which describes the performance counter events for Core i7, i5, and i3 processors. Event 4EH in Table A-2 mentions that there are both "L1 streamer and IP-Based (IPP) HW prefetchers". There are a few more words on this topic in the corresponding entry (for event 4EH) in Appendix A.4, Table A-6, which describes the performance counters for Westmere processors.
Appendix B-2, Table B-3 in the same document discusses the MSRs (Model Specific Registers) for the Intel Core Microarchitecture, but it looks like many of these carry over into newer versions. Register 1A0h shows that 4 bits control prefetching behavior:
Bit 9: Hardware Prefetcher Disable
Bit 19: Adjacent Cache Line Prefetch Disable
Bit 37: DCU Prefetcher Disable
Bit 39: IP Prefetcher Disable
Tools to enable and disable prefetchers are discussed in:
How do I programmatically disable hardware prefetching?

Yes, Hardware prefetcher does exist in Core i3/i7 machine, but you CAN'T disable them in i3/i7. Two ways to disable the prefetching (1) by changing msr bit (2) through bios. Intel stopped supporting both ways to disable in i3/i7.
Link from comment: https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors Disclosure of H/W prefetcher control on some Intel processors - Vish Viswanathan (Intel), September 24, 2014
This article discloses the MSR setting that can be used to control the various h/w prefetchers that are available on Intel processors based on the following microarchitectures: Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell, and Broadwell.
The above mentioned processors support 4 types of h/w prefetchers for prefetching data. There are 2 prefetchers associated with L1-data cache (also known as DCU DCU prefetcher, DCU IP prefetcher) and 2 prefetchers associated with L2 cache (L2 hardware prefetcher, L2 adjacent cache line prefetcher).
There is a Model Specific Register (MSR) on every core with address of 0x1A4 that can be used to control these 4 prefetchers. Bits 0-3 in this register can be used to either enable or disable these prefetchers. Other bits of this MSR are reserved.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight