Hardware prefetching in corei3

Hardware prefetching in corei3 - core

Does corei3 support hardware prefetching through hardware prefetcher? If yes, how do I enable/disable it?

Intel Core i3 processors definitely support hardware prefetching, though Intel's documentation tends to be very weak on details. The brand name "Core i3" refers to both "Nehalem" based and "Sandy Bridge" based processors, so you have to check the specific model number to know which one you are dealing with.
To make things more complicated, newer Intel processors (Nehalem/Westmere/Sandy Bridge) have several different hardware prefetchers -- at least three are mentioned in the Intel Architecture Software Developer's Manual, Volume 3B (publication 253669). Table 30-25 "MSR_OFFCORE_RSP_x Request Type Field Definition" mentions "DCU prefetch" and "L2 prefetchers". These are also mentioned in Appendix A-2, Table A-2, which describes the performance counter events for Core i7, i5, and i3 processors. Event 4EH in Table A-2 mentions that there are both "L1 streamer and IP-Based (IPP) HW prefetchers". There are a few more words on this topic in the corresponding entry (for event 4EH) in Appendix A.4, Table A-6, which describes the performance counters for Westmere processors.
Appendix B-2, Table B-3 in the same document discusses the MSRs (Model Specific Registers) for the Intel Core Microarchitecture, but it looks like many of these carry over into newer versions. Register 1A0h shows that 4 bits control prefetching behavior:
Bit 9: Hardware Prefetcher Disable
Bit 19: Adjacent Cache Line Prefetch Disable
Bit 37: DCU Prefetcher Disable
Bit 39: IP Prefetcher Disable
Tools to enable and disable prefetchers are discussed in:
How do I programmatically disable hardware prefetching?

Yes, Hardware prefetcher does exist in Core i3/i7 machine, but you CAN'T disable them in i3/i7. Two ways to disable the prefetching (1) by changing msr bit (2) through bios. Intel stopped supporting both ways to disable in i3/i7.
Link from comment: https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors Disclosure of H/W prefetcher control on some Intel processors - Vish Viswanathan (Intel), September 24, 2014
This article discloses the MSR setting that can be used to control the various h/w prefetchers that are available on Intel processors based on the following microarchitectures: Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell, and Broadwell.
The above mentioned processors support 4 types of h/w prefetchers for prefetching data. There are 2 prefetchers associated with L1-data cache (also known as DCU DCU prefetcher, DCU IP prefetcher) and 2 prefetchers associated with L2 cache (L2 hardware prefetcher, L2 adjacent cache line prefetcher).
There is a Model Specific Register (MSR) on every core with address of 0x1A4 that can be used to control these 4 prefetchers. Bits 0-3 in this register can be used to either enable or disable these prefetchers. Other bits of this MSR are reserved.

Related

Number of Performance Monitoring Units in ARM Cortex-A53

How many performance monitoring units (PMU) are in ARM Cortex-A53? Is there one PMU for each core or one PMU for the whole processor?

There is one per core. You can kind of infer this from seeing one entry for each in the ROM table.
In the Cortex-A53 TRM, Fig 2-1 alludes to debug being located per core, and 2.1.9
• ARM v8 debug features in each core.
I don't see anything explicit that there is one PMU instance per core (architectural or not), but it's possible that I missed this since there are a few places where it might be specified.
Section 11.10.1 describes the debug memory map, consisting of a ROM table (index of other components in this group), then CPU debug, CTI, PMU and Trace for each core. To check what is present, you need to read DBGDRAR to find the base of the ROM table, and check bit[0] of the entries listed in Table 11-28. In a 4-core A53, you should find that all 16 devices are present.
One further point, although the PMU is per core and generally counts the events for the core, there can be some processor level events which are visible equally to all PMUs.

Why is Generic Interrupt Controller in ARM Cortex A15 Multi-Core processor frequency less than main core frequency?

I was studying about Generic Interrupt Controllers(GIC) in ARM and when I read about the Clock frequency for GIC, it stated that a GIC has a clock frequency that is an integer multiple less than that of Main core clock frequency. Can somebody explain why is it so ?

Highspeed circuits are a LOT harder to design, manufacture and control. So, if speed is not that important (As it is the case with your GIC), the peripherals of the core will usually run a lot slower than the core itself. Even L2-Cache usually does not run with full core speed.
Also, the gain of having a faster clocked GIC is probably negligible, so there is no reason for the designers to do a new generation, which in this business always is an expensive and risky adventure.

Typical L1 and L2 access latency for SoCs made of ARM Cortex-A9

I am looking for L1 access latency and L2 access latency for SoCs made from ARM Cortex-A9 processors such as Nvidia Tegra 2 and Tegra 3 which have multiple ARM A9 processors.
I could find some information about the L1 and L2 size of those architectures, but I could not much information about the L1 and L2 access latency. The only reliable information I found is that "L2 cache latency is 2 cycles faster on Tegra 3 than 2, while L1 cache latencies haven't changed."
Here is mentioned that L2 on Tegra 2 has a latency of 25 cycles and here is mentioned that L1 has a latency of 4 cycles and L2 has a latency of 31 to 55 cycles. None of these references are fully reliable. I was hoping to find more info on Nvidia, TI, and Qualcomm websites and technical documents, but no success.
EDIT: information on similar SoCs like OMAP4460 and OMAP4470 would be great too.

For an authoritative answer, you can try running lmbench (HowTo?) on the target of your choice.
A set of results for AM37x (variant of TI OMAP3 family) is available here for reference.
Also checkout this presentation that describes the latency and bandwidth of various caches configurations on an ARM Cortex A9 MP system.

CPU TSC fetch operation especially in multicore-multi-processor environment

In Linux world, to get nano seconds precision timer/clockticks one can use :
#include <sys/time.h>
int foo()
{
timespec ts;
clock_gettime(CLOCK_REALTIME, &ts);
//--snip--
}
This answer suggests an asm approach to directly query for the cpu clock with the RDTSC instruction.
In a multi-core, multi-processor architecture, how is this clock ticks/timer value synchronized across multiple cores/processors? My understanding is that there in inherent fencing being done. Is this understanding correct?
Can you suggest some documentation that would explain this in detail? I am interested in Intel Nehalem and Sandy Bridge microarchitectures.
EDIT
Limiting the process to a single core or cpu is not an option as the process is really huge(in terms of resources consumed) and would like to optimally utilize all the resources in the machine that includes all the cores and processors.
Edit
Thanks for the confirmation that the TSC is synced across cores and processors. But my original question is how is this synchronization done ? is it with some kind of fencing ? do you know of any public documentation ?
Conclusion
Thanks for all the inputs: Here's the conclusion for this discussion: The TSCs are synchronized at the initialization using a RESET that happens across the cores and processors in a multi processor/multi core system. And after that every Core is on their own. The TSCs are kept invariant with a Phase Locked Loop that would normalize the frequency variations and thus the clock variations within a given Core and that is how the TSC remain in sync across cores and processors.

Straight from Intel, here's an explanation of how recent processors maintain a TSC that ticks at a constant rate, is synchronous between cores and packages on a multi-socket motherboard, and may even continue ticking when the processor goes into a deep sleep C-state, in particular see the explanation by Vipin Kumar E K (Intel):
http://software.intel.com/en-us/articles/best-timing-function-for-measuring-ipp-api-timing/
Here's another reference from Intel discussing the synchronization of the TSC across cores, in this case they mention the fact that rdtscp allows you to read both the TSC and the processor id atomically, this is important in tracing applications... suppose you want to trace the execution of a thread that might migrate from one core to another, if you do that in two separate instructions (non-atomic) then you don't have certainty of which core the thread was in at the time it read the clock.
http://software.intel.com/en-us/articles/intel-gpa-tip-cannot-sychronize-cpu-timestamps/
All sockets/packages on a motherboard receive two external common signals:
RESET
Reference CLOCK
All sockets see RESET at the same time when you power the motherboard, all processor packages receive a reference clock signal from an external crystal oscillator and the internal clocks in the processor are kept in phase (although usually with a high multiplier, like 25x) with circuitry called a Phase Locked Loop (PLL). Recent processors will clock the TSC at the highest frequency (multiplier) that the processor is rated (so called constant TSC), regardless of the multiplier that any individual core may be using due to temperature or power management throttling (so called invariant TSC). Nehalem processors like the X5570 released in 2008 (and newer Intel processors) support a "Non-stop TSC" that will continue ticking even when conserving power in a deep power down C-state (C6). See this link for more information on the different power down states:
http://www.anandtech.com/show/2199
Upon further research I came across a patent Intel filed on 12/22/2009 and was published on 6/23/2011 entitled "Controlling Time Stamp Counter (TSC) Offsets For Mulitple Cores And Threads"
http://www.freepatentsonline.com/y2011/0154090.html
Google's page for this patent application (with link to USPTO page)
http://www.google.com/patents/US20110154090
From what I gather there is one TSC in the uncore (the logic in a package surrounding the cores but not part of any core) which is incremented on every external bus clock by the value in the field of the machine specific register specified by Vipin Kumar in the link above (MSR_PLATFORM_INFO[15:8]). The external bus clock runs at 133.33MHz. In addition each core has it's own TSC register, clocked by a clock domain that is shared by all cores and may be different from the clock for any one core - therefore there must be some kind of buffer when the core TSC is read by the RDTSC (or RDTSCP) instruction running in a core. For example, MSR_PLATFORM_INFO[15:8] may be set to 25 on a package, every bus clock the uncore TSC increments by 25, there is a PLL that multiplies the bus clock by 25 and provides this clock to each of the cores to clock their local TSC register, thereby keeping all TSC registers in synch. So to map the terminology to actual hardware
Constant TSC is implemented by using the external bus clock running at 133.33 MHz which is multiplied by a constant multiplier specified in MSR_PLATFORM_INFO[15:8]
Invariant TSC is implemented by keeping the TSC in each core on a separate clock domain
Non-stop TSC is implemented by having an uncore TSC that is incremented by MSR_PLATFORM_INFO[15:8] ticks on every bus clock, that way a multi-core package can go into deep power down (C6 state) and can shutdown the PLL... there is no need to keep a clock at the higher multiplier. When a core is resumed from C6 state its internal TSC will get initialized to the value of the uncore TSC (the one that didn't go to sleep) with an offset adjustment in case software has written a value to the TSC, the details of which are in the patent. If software does write to the TSC then the TSC for that core will be out of phase with other cores, but at a constant offset (the frequency of the TSC clocks are all tied to the bus reference clock by a constant multiplier).

On newer CPUs (i7 Nehalem+ IIRC) the TSC is synchronzied across all cores and runs a constant rate.
So for a single processor, or more than one processor on a single package or mainboard(!) you can rely on a synchronzied TSC.
From the Intel System Manual 16.12.1
The time stamp counter in newer processors may support an enhancement,
referred to as invariant TSC. Processors support for invariant TSC is
indicated by CPUID.80000007H:EDX[8]. The invariant TSC will run at a
constant rate in all ACPI P-, C-. and T-states. This is the
architectural behavior moving forward.
On older processors you can not rely on either constant rate or synchronziation.
Edit: At least on multiple processors in a single package or mainboard the invariant TSC is synchronized. The TSC is reset to zero at a /RESET and then ticks onward at a constant rate on each processor, without drift. The /RESET signal is guaranteed to arrive at each processor at the same time.

RTDSC is not synchronized across CPUs. Thus, you cannot rely on it in a multi-processor systems. The only workaround I can think of for Linux would be to actually restricting the process to run on a single CPU by settings its affinity. This can be done externally using using taskset utility or "internally" using sched_setaffinity or pthread_setaffinity_np functions.

This manual, chapter 17.12, describes the invariant TSC used in the newest processors. Available with Nehalem this time stamp, along with the rtscp instruction, allows one to read a timestamp (not affected by wait-states, etc) and a processor signature in one atomic operation.
It is said to be suitable for calculating wall-clock time, but it obviously doesn't expect the value to be the same across processors. The stated idea is that you can see if successive reads are to the same CPU's clock, or to adjust for multiple CPU reads. "It can also be used to adjust for per-CPU differences in TSC values in a NUMA system."
See also rdtsc accuracy across CPU cores
However, I'm not sure that the final consistency conclusion in the accepted answer follows from the statement that the tsc can be used for wall clock time. If it was consistent, what reason would there be for atomically determining the CPU source of the time.
N.B. The TSC information has moved from chapter 11 to chapter 17 in that Intel manual.

ARM modes and why are there so many?

I'm currently reading/learning about ARM architecture ...
and I was wondering why there are so many modes
(FIQ, User, System, Supervisor, IRQ, ...).
My question is why do we need so many modes? Wouldn't just User and System be enough?
Thanks in advance.

It's just an architectural decision. The big advantage of the multiple modes is that they have some banked registers. Those extra registers allow you to write much less complicated exception routines.
If you were to pick only two, just USR and SYS are probably as good a choice as any, but what would happen when you took an exception? The normal ARM model is to go to an exception mode, set the banked link register for that exception mode to point to the instruction you want to return to after you resolve the exception, save the processor state in the exception mode's SPSR register, and then jump to the exception vector. USR and SYS share all their registers - using this model, you'd blow away your function return address (in LR) every time you took an interrupt!
The FIQ mode in particular has even more banked registers than the other exception modes. Those extra registers are in keeping with the "F" part of FIQ - it stands for "Fast". Not having to save and restore more processor context in software will speed up your interrupt handler.

Not too much to add to Carl's answer. Not sure what family / architecture of ARM processors you're talking about, so I'll just assume based on your question (FIQ, IRQ, etc.) that you're talking about ARM7/9/11. I won't enumerate every difference between every mode in every ARM architecture variant.
In addition to what Carl said, a few other advantages of having different modes for different circumstances:
for example, in the FIQ, you don't have to branch off right away, you can just keep on executing. With other exceptions you have to branch right away
with different modes, you have natural support for separate stacks. If you're multitasking (e.g., RTOS) and you don't have a separate stack when you're in an interrupt mode, you have to build-in extra space onto each task stack for the worst-case interrupt situation
with different modes, certain registers (e.g. CPSR, MMU regs, etc. - depends on architecture) are off-limits. Same thing with certain instructions. You don't want to let user code modify privileged registers, now do you?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight