I'm using Zedboard with Zynq Chip. I want to know the number of cores per CPU (in the board there is two, CPU0 and CPU1, but there is no indication of the number of cores!). Thanks.
Well, first check Zynq's documentation. Then one should proceed to search what is inside a ARM Cortex-A9. With both in mind you should be able to notice that what you have is two cores (CPU0 and CPU1). You are probably confusing your device with modern processors that have multiple cores. Look for in Google: "What is a core/CPU".
I hope that this answer gives you the pointers to better understand what a core is and what your device's cores provide.
Related
I am planning to measure PMU counters for L1,L2,L3 misses branch prediction misses , I have read related Intel documents but i am unsure about the below scenarios.could some one please clarify ?
//assume PMU reset and PERFEVTSELx configurtion done above
ioctl(fd, IOCTL_MSR_CMDS, (long long)msr_start) //PMU start counters
my_program();
ioctl(fd, IOCTL_MSR_CMDS, (long long)msr_stop) ///PMU stop
//now reading PMU counters
1.what will happen if my process is scheduled out when my_program() is running, and scheduled to another core?
2.what will happen if process scheduled out and schedule back to same core again, meanwhile some other process reset the PMU counters?
How to make sure that we are reading the correct values from PMU counters.?
Machine details:CentOS with Linux kernel 3.10.0-327.22.2.el7.x86_64 , which is powered up with Intel(R) Core(TM) i7-3770 CPU # 3.40GHz
Thanks
Summary of the Intel forum thread started by the OP:
The Linux perf subsystem virtualizes the performance counters, but this means you have to read them with a system call, instead of rdpmc, to get the full virtualized 64-bit value instead of whatever is currently in the architectural performance counter register.
If you want to use rdpmc inside your own code so it can measure itself, pin each thread to a core because context switches don't save/restore PMCs. There's no easy way to avoid measuring everything that happens on the core, including interrupt handlers and other processes that get a timeslice. This can be a good thing, since you need to take the impact of kernel overhead into account.
More useful quotes from John D. McCalpin, PhD ("Dr. Bandwidth"):
For inline code instrumentation you should be able to use the "perf events" API, but the documentation is minimal. Some resources are available at http://web.eece.maine.edu/~vweaver/projects/perf_events/faq.html
You can use "pread()" on the /dev/cpu/*/msr device files to read the
MSRs -- this may be a bit easier to read than IOCTL-based code. The
codes "rdmsr.c" and "wrmsr.c" from "msr-tools-1.3" provide excellent
examples.
There have been a number of approaches to reserving and sharing
performance counters, including both software-only and combined
hardware+software approaches, but at this point there is not a
"standard" approach. (It looks like Intel has a hardware-based
approach using MSR 0x392 IA32_PERF_GLOBAL_INUSE, but I don't know what
platforms support it.)
your questions
what will happen if my process is scheduled out when my_program() is running, and scheduled to another core?
You'll see random garbage, same if another process resets PMCs between timeslices of your process.
i got the answers from some Intel forum, the link is below.
https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/673602
I was studying about Generic Interrupt Controllers(GIC) in ARM and when I read about the Clock frequency for GIC, it stated that a GIC has a clock frequency that is an integer multiple less than that of Main core clock frequency. Can somebody explain why is it so ?
Highspeed circuits are a LOT harder to design, manufacture and control. So, if speed is not that important (As it is the case with your GIC), the peripherals of the core will usually run a lot slower than the core itself. Even L2-Cache usually does not run with full core speed.
Also, the gain of having a faster clocked GIC is probably negligible, so there is no reason for the designers to do a new generation, which in this business always is an expensive and risky adventure.
As far as I understand to measure the actual operating CPU frequency I need access to the model specific registers (MSR) IA32_APERF and IA32_MPERF (Assembly CPU frequency measuring algorithm).
However, access to the MSR registers is privileged (through the rdmsr instruction). Is there another way this can be done? I mean, for example, through a device driver/library which I could call in my code. It seems strange to me that reading the registers is privileged. I would think only writing to them would be privileged.
Note: the rdtsc instruction does not account for turbo boost and thus cannot report the actual operating frequency
Edit:
I'm interested in solutions for Linux and/or Windows.
You are right, the proper way to find average cpu frequency described in 2nd answer in your link.
To read msrs on linux you can use tool RDMSR.
The only thing that maybe missleading in that answer, is maxfrequency. It should be not maxfrequency, but nominal frequency (max non-turbo frequency), as MPERF counter counts in max-non turbo frequency. You can get this frequency from MSR 0xCE bits 8:15 (ref)
I have a application which has two threads ,thread1 would receive multicast packages
from network card eth1 , suppose I use sched_setaffinity to set cpu affinity
for thread1 to cpu core 1 , and then I have thread2 to use these packages
(received from thread1,located in heap area global vars) to do some operations ,
I set cpu affinity for thread2 to core 7 ,suppose core 1 and core 7 are
in the same core with hyper-threading , I think the performance would be good,
since core 1 and core 7 can use L1 cache .
I have watched /proc/interrupt , I see eth1 has interrupts in several cpu cores ,
so in my case , I set cpu affinity to core 1 for thread1 ,but interrupts happened
in many cores , would it effect performance ? those packages received from eth1
would go directly to main memory no matter which core has the interrupt ?
I don't know much about network in linux kernel , may anyone who would suggest
books or websites can help me for this topic ? Thanks for any comments ~~
Edit : according to "What every programmer should know about memory" 6.3.5 "Direct Cache Access" , I think "DCA" is hwat i like to know ...
The interrupt will (quite likely) happen on a different core than the one receiving the packet. Depending on how the driver deals with packets, that may or may not matter. If the driver reads the packet (e.g. to makle a copy), then it's not ideal, as the cache gets filled on a different CPU. But if the packet is just loaded into memory somewhere using DMA, and left there for the software to pick up later, then it doesn't matter [in fact, it's better to have it happen on a different CPU, as "your" cpu gets more time to do other things].
As to using hyperthreading, my experience (and that of many others) is that hyperthreading SOMETIMES gives a benefit, but often ends up being similar to not having hyperthreading, because the two threads use the same execution units of the same core. You may want to compare the throughput with both threads set to affinity on the same core as well, to see if that makes it "better" or "worse" - like most things, it's often details that make a difference, so your code may be slightly different from someone elses, meaning it works better in one or the other of the cases.
Edit: if your system has multiple sockets, you may also want to ensure that the CPU on the socket "nearest" (as in the number of QPI/PCI bridge hops) the network card.
Im using an ARM926EJ-S and am trying to figure out whether the ARM can give (e.g. a readable register) the CPU's cycle-counter. I guess a # that will represent the number of cycles since the CPU has been powered.
In my system i have only Low-Res external RTC/Timers. I would like to be able to achieve a Hi-Res timer.
Many thanks in advance!
You probably have only two choices:
Use an instruction-cycle accurate simulator; the problem here is that effectively simulating peripherals and external stimulus can be complex or impossible.
Use a peripheral hardware timer. In most cases you will not be able to run such a timer at the typical core clock rate of an ARM9, and there will be an over head in servicing the timer either side of the period being timed, but it can be used to give execution time over larger or longer running sections of code, which may be of more practical use than cycle count.
While cycle count may be somewhat scalable to different clock rates, it remains constrained by memory and I/O wait states, so is perhaps not as useful as it may seem as a performance metric, except at the micro-level of analysis, and larger performance gains are typically to be had by taking a wider view.
The arm-9 is not equipped with an PMU (Performance Monitoring Unit) as included in the Cortex-family. The PMU is described here. The linux kernel comes equipped with support for using the PMU for benchmarking performance. See here for documentation of the perf tool-set.
Bit unsure about the arm-9, need to dig a bit more...