CPU TSC fetch operation especially in multicore-multi-processor environment

CPU TSC fetch operation especially in multicore-multi-processor environment - c

In Linux world, to get nano seconds precision timer/clockticks one can use :
#include <sys/time.h>
int foo()
{
timespec ts;
clock_gettime(CLOCK_REALTIME, &ts);
//--snip--
}
This answer suggests an asm approach to directly query for the cpu clock with the RDTSC instruction.
In a multi-core, multi-processor architecture, how is this clock ticks/timer value synchronized across multiple cores/processors? My understanding is that there in inherent fencing being done. Is this understanding correct?
Can you suggest some documentation that would explain this in detail? I am interested in Intel Nehalem and Sandy Bridge microarchitectures.
EDIT
Limiting the process to a single core or cpu is not an option as the process is really huge(in terms of resources consumed) and would like to optimally utilize all the resources in the machine that includes all the cores and processors.
Edit
Thanks for the confirmation that the TSC is synced across cores and processors. But my original question is how is this synchronization done ? is it with some kind of fencing ? do you know of any public documentation ?
Conclusion
Thanks for all the inputs: Here's the conclusion for this discussion: The TSCs are synchronized at the initialization using a RESET that happens across the cores and processors in a multi processor/multi core system. And after that every Core is on their own. The TSCs are kept invariant with a Phase Locked Loop that would normalize the frequency variations and thus the clock variations within a given Core and that is how the TSC remain in sync across cores and processors.

Straight from Intel, here's an explanation of how recent processors maintain a TSC that ticks at a constant rate, is synchronous between cores and packages on a multi-socket motherboard, and may even continue ticking when the processor goes into a deep sleep C-state, in particular see the explanation by Vipin Kumar E K (Intel):
http://software.intel.com/en-us/articles/best-timing-function-for-measuring-ipp-api-timing/
Here's another reference from Intel discussing the synchronization of the TSC across cores, in this case they mention the fact that rdtscp allows you to read both the TSC and the processor id atomically, this is important in tracing applications... suppose you want to trace the execution of a thread that might migrate from one core to another, if you do that in two separate instructions (non-atomic) then you don't have certainty of which core the thread was in at the time it read the clock.
http://software.intel.com/en-us/articles/intel-gpa-tip-cannot-sychronize-cpu-timestamps/
All sockets/packages on a motherboard receive two external common signals:
RESET
Reference CLOCK
All sockets see RESET at the same time when you power the motherboard, all processor packages receive a reference clock signal from an external crystal oscillator and the internal clocks in the processor are kept in phase (although usually with a high multiplier, like 25x) with circuitry called a Phase Locked Loop (PLL). Recent processors will clock the TSC at the highest frequency (multiplier) that the processor is rated (so called constant TSC), regardless of the multiplier that any individual core may be using due to temperature or power management throttling (so called invariant TSC). Nehalem processors like the X5570 released in 2008 (and newer Intel processors) support a "Non-stop TSC" that will continue ticking even when conserving power in a deep power down C-state (C6). See this link for more information on the different power down states:
http://www.anandtech.com/show/2199
Upon further research I came across a patent Intel filed on 12/22/2009 and was published on 6/23/2011 entitled "Controlling Time Stamp Counter (TSC) Offsets For Mulitple Cores And Threads"
http://www.freepatentsonline.com/y2011/0154090.html
Google's page for this patent application (with link to USPTO page)
http://www.google.com/patents/US20110154090
From what I gather there is one TSC in the uncore (the logic in a package surrounding the cores but not part of any core) which is incremented on every external bus clock by the value in the field of the machine specific register specified by Vipin Kumar in the link above (MSR_PLATFORM_INFO[15:8]). The external bus clock runs at 133.33MHz. In addition each core has it's own TSC register, clocked by a clock domain that is shared by all cores and may be different from the clock for any one core - therefore there must be some kind of buffer when the core TSC is read by the RDTSC (or RDTSCP) instruction running in a core. For example, MSR_PLATFORM_INFO[15:8] may be set to 25 on a package, every bus clock the uncore TSC increments by 25, there is a PLL that multiplies the bus clock by 25 and provides this clock to each of the cores to clock their local TSC register, thereby keeping all TSC registers in synch. So to map the terminology to actual hardware
Constant TSC is implemented by using the external bus clock running at 133.33 MHz which is multiplied by a constant multiplier specified in MSR_PLATFORM_INFO[15:8]
Invariant TSC is implemented by keeping the TSC in each core on a separate clock domain
Non-stop TSC is implemented by having an uncore TSC that is incremented by MSR_PLATFORM_INFO[15:8] ticks on every bus clock, that way a multi-core package can go into deep power down (C6 state) and can shutdown the PLL... there is no need to keep a clock at the higher multiplier. When a core is resumed from C6 state its internal TSC will get initialized to the value of the uncore TSC (the one that didn't go to sleep) with an offset adjustment in case software has written a value to the TSC, the details of which are in the patent. If software does write to the TSC then the TSC for that core will be out of phase with other cores, but at a constant offset (the frequency of the TSC clocks are all tied to the bus reference clock by a constant multiplier).

On newer CPUs (i7 Nehalem+ IIRC) the TSC is synchronzied across all cores and runs a constant rate.
So for a single processor, or more than one processor on a single package or mainboard(!) you can rely on a synchronzied TSC.
From the Intel System Manual 16.12.1
The time stamp counter in newer processors may support an enhancement,
referred to as invariant TSC. Processors support for invariant TSC is
indicated by CPUID.80000007H:EDX[8]. The invariant TSC will run at a
constant rate in all ACPI P-, C-. and T-states. This is the
architectural behavior moving forward.
On older processors you can not rely on either constant rate or synchronziation.
Edit: At least on multiple processors in a single package or mainboard the invariant TSC is synchronized. The TSC is reset to zero at a /RESET and then ticks onward at a constant rate on each processor, without drift. The /RESET signal is guaranteed to arrive at each processor at the same time.

RTDSC is not synchronized across CPUs. Thus, you cannot rely on it in a multi-processor systems. The only workaround I can think of for Linux would be to actually restricting the process to run on a single CPU by settings its affinity. This can be done externally using using taskset utility or "internally" using sched_setaffinity or pthread_setaffinity_np functions.

This manual, chapter 17.12, describes the invariant TSC used in the newest processors. Available with Nehalem this time stamp, along with the rtscp instruction, allows one to read a timestamp (not affected by wait-states, etc) and a processor signature in one atomic operation.
It is said to be suitable for calculating wall-clock time, but it obviously doesn't expect the value to be the same across processors. The stated idea is that you can see if successive reads are to the same CPU's clock, or to adjust for multiple CPU reads. "It can also be used to adjust for per-CPU differences in TSC values in a NUMA system."
See also rdtsc accuracy across CPU cores
However, I'm not sure that the final consistency conclusion in the accepted answer follows from the statement that the tsc can be used for wall clock time. If it was consistent, what reason would there be for atomically determining the CPU source of the time.
N.B. The TSC information has moved from chapter 11 to chapter 17 in that Intel manual.

Related

Which clock source is used for clocking instructions in cortex-M4

In my Cortex-M4, I have am using a 8Mhz oscillator as HSE, which then gets multiplied to 72Mhz using PLL which then drives SYSCLK. This got me thinking, which clock is the one being used to execute instructions? In other words, if our CPI is 1 (an ideal value, of course), does that mean we would execute 8 million instructions per second or 72 million instructions per second?
I also found this DWT which can be used to measure clock cycles, and hence CPI. So I am guessing which ever clock that is used to execute instructions would be the same one used by DWT?

It is driven from HCLK (not SYSCLK which clocks system timer and it does not have to be equal to HCLK). Thew source of HCLK is settable by the programmer.
if our CPI is 1 (an ideal value, of course), does that mean we would
execute 8 million instructions per second or 72 million instructions
per second?
You can see how many cycles every instruction takes: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0439b/CHDDIGAC.html
The real speed depends on many factors but mainly depends on the place where your code and data reside and the advanced uC features.
If you execute your code fro the internal TCM SRAM and place data in the SRAM (or even better on some uC in TCI and TCD SRAM)you can archive the theoretical execution efficiency as those memories work at the core clock frequency with no wait states or bus waitstates. Ideally if the uC has TC memory and both instructions and data are fetched using separate buses.
If your code resides in the FLASH memory - this memory may introduce some wait states. STM uC (ART accelerator) read the flash in larger a chunks and fetch the instructions ahead. It allows those uCs to perform almost at the max speed. The problem are branch instructions which require pipeline to be flushed and instructions fetched again.

Performance Counting for Cycles Inconsistent and not Reflecting CPU Frequency

Intro:
I have written a Linux kernel module for performance counter monitoring on an ARM v7 platform with cortex A-15 and A-7 processors (Odroid XU3). One counter I am trying to use in my research is cycle counts, which from the ARM technical reference manuals has its own dedicated counter. I have checked my code against other implementations and ARM references found online; here is a snippet of the part that enables the CPU counters:
Resources Used:
How to measure program execution time in ARM Cortex-A8 processor?
http://neocontra.blogspot.se/2013/05/user-mode-performance-counters-for.html
https://pietrotech.wordpress.com/2016/09/28/sample-performance-counters-on-little-and-big-cluster-on-odroid-xu3-processor-exynos-5422/
ARM Reference Manual (Ch. 11, PMU)
Problem:
When I print the cycles elapsed over a fixed sampling period (100ms) for a fixed CPU frequency (1.4GHz in the case of core 0), I see a huge amount of variance in the values returned by the module. See the chart below for an example of this. Not only does the variance seem very high, but the number of cycles measured does not reflect the number of cycles I would expect to see recorded given the sample time and fixed frequency (for the given scenario I expected 1.4e8 cycles on each sample). What could be causing such divergence from the expected number of cycles?
Variability of measured cycles for kernel module running across all cores and across just core 0.

After further though and discussions with colleagues, I believe the discrepancy between measured and expected cycles is cpuidle: it is a subsystem in the Linux kernel that places a CPU core into a lower-power state when the core is not doing anything. Some of the lowest states shut down the clock, which likely causes the cycle counter to stop incrementing. This article gives a nice description of cpuidle and how it works: https://lwn.net/Articles/384146/

How to implement time counting in a new operating system?

i have to implement in a operating system, the function sleep().
Which is at the moment, not exisiting in the previous mentioned system.
The problem is, I have to count the elapsed time to wake the sleeping thread up.
How should i relize this? Do i have to count the CPU Ticks or is there another way?
Are CPU Ticks not dependend on the CPU frequency which is different for each CPU?
I have to implement the function in the language C.
the time function doesn't exist either
Thank you in advance!

Typically, such functionality is provided by a hardware timer interrupt, (and its associated driver), that manages a 'tick count' and a delta-queue of 'Thread Control Block' pointers, (pTCB). The pTCP's for sleeping threads are stored in the queue ordered by interval expiry tick count. The timer interrupt increments the tick count and checks it agains the expiry count of the item at the head of the queue.
When a thread requests a sleep, the thread pTCB is taken out of the set of ready threads, the expiry-count calculated and the pTCB inserted into the timer queue. When the pTCB reaches the end of the queue, and it's expiry tick has arrived, it is popped and added back to the set of ready threads so that it may be set running.

It totally depends on your platform/OS. It has to provide you some time-like information, e.g. ticks. Otherwise it is just impossible.
Converting ticks to seconds of course requires additional information. Again, this can be supplied by your platform. Or you have to find it out by other means (manual, configure it yourself, ...).

The easiest and most common way to do that in operating systems is to set up a timer interrupt at a static frequency, then build a timer framework on top of that, then use that timer framework to fire off wakeups for your sleeping threads.
A good paper that discusses various data structures for how to do it efficiently is here. I recommend from my own experience scheme 7. It's quite easy to implement and performs wonderfully.
You can find a fast implementation with a good API here. But I'm biased, because I wrote it.
If you don't want a timer interrupt with a static frequency it becomes much harder to implement a nice timer facility with good performance. I've done a few experiments, but I'd recommend you to start with simple timer interrupt with a static frequency. Once you start doing dynamic timers you need to exactly understand the tradeoffs you are prepared to make.

you can use time() :
time_t t = time();
while(time() < t + sleepDuration);

You may use the CPUs Time Stamp Counter (TSC) to get counter values for time keeping. See Chapter 16.12.1 of "Intel® 64 and IA-32 Architectures Software Developer’s Manual".
The TSC is a low level counter which may provide counter values independent of CPU speed:
"The time stamp counter in newer processors may support an enhancement, referred to as invariant TSC. Processor’s support for invariant TSC is indicated by CPUID.80000007H:EDX[8].
The invariant TSC will run at a constant rate in all ACPI P-, C--, and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource."
However, for the implementation of sleep() alike functionality you should look into timer
hardware like HPET, ACPI, and alike. See "Intel 64® and IA-32 Architectures Software Developer's Manual, Volume 3B: System Programming Guide, Part 2" and "IA-PC HPET (High Precision Event Timers) Specification" for details.

How can I tell if every core on my machine uses the same timer?

I'm trying to write some code to determine if clock_gettime used with CLOCK_MONOTONIC_RAW will give me results coming from the same hardware on different cores.
From what I understand it is possible for each core to produce independent results but not always. I was given the task of obtaining timings on all cores with a precision of 40 nanoseconds.
The reason I'm not using CLOCK_REALTIME is that my program absolutely must not be affected by NTP adjustments.
Edit:
I have found the unsynchronized_tsc function which tries to test whether the TSC is the same on all cores. I am now attempting to find if CLOCK_MONOTONIC_RAW is based on the TSC.
Final edit:
It turns out that CLOCK_MONOTONIC_RAW is always usable on multi-core systems and does not rely on the TSC even on Intel machines.

To do measurements this precisely; you'd need:
code that's executed on all CPUs, that reads the CPU's time stamp counter and stores it as soon as "an event" occurs
some way to create "an event" that is noticed at the same time by all CPUs
some way to prevent timing problems caused by IRQs, task switches, etc.
Various possibilities for the event include:
polling a memory location in a loop, where one CPU writes a new value and other CPUs stop polling when they see the new value
using the local APIC to broadcast an IPI (inter-processor interrupt) to all CPUs
For both of these methods there are delays between the CPUs (especially for larger NUMA systems) - a write to memory (cache) may be visible on the CPU that made the write immediately, and be visible by a CPU on a different physical chip (in a different NUMA domain) later. To avoid this you may need to find the average of initiating the event on all CPUs. E.g. (for 2 CPUs) one CPU initiates and both measure, then the other CPU initiates and both measure, then results are combined to cancel out any "event propagation latency".
To fix other timing problems (IRQs, task switches, etc) I'd want to be doing these tests during boot where nothing else can mess things up. Otherwise you either need to prevent the problems (ensure all CPUs are running at the same speed, disable IRQs, disable thread switches, stop any PCI device bus mastering, etc) or cope with problems (e.g. run the same test many times and see if you get similar results most of the time).
Also note that all of the above can only ensure that the time stamp counters were in sync at the time the test was done, and don't guarantee that they won't become out of sync after the test is done. To ensure the CPUs remain in sync you'd need to rely on the CPU's "monotonic clock" guarantees (but older CPUs don't make that guarantee).
Finally; if you're attempting to do this in user-space (and not in kernel code); then my advice is to design code in a way that isn't so fragile to begin with. Even if the TSCs on different CPUs are guaranteed to be perfectly in sync at all times, you can't prevent an IRQ from interrupting immediately before or immediately after reading the TSC (and there's no way to atomically do something and read TSC at the same time); and therefore if your code requires such precisely synchronised timing then your code's design is probably flawed.

CPU clock frequency and thus QueryPerformanceCounter wrong?

I am using QueryPerformanceCounter to time some code. I was shocked when the code starting reporting times that were clearly wrong. To convert the results of QPC into "real" time you need to divide by the frequency returned from QueryPerformanceFrequency, so the elapsed time is:
Time = (QPC.end - QPC.start)/QPF
After a reboot, the QPF frequency changed from 2.7 GHz to 4.1 GHz. I do not think that the actual hardware frequency changed as the wall clock time of the running program did not change although the time reported using QPC did change (it dropped by 2.7/4.1).
MyComputer->Properties shows:
Intel(R)
Pentium(R)
4 CPU 2.80 GHz; 4.11 GHz;
1.99 GB of RAM; Physical Address Extension
Other than this, the system seems to be working fine.
I will try a reboot to see if the problem clears, but I am concerned that these critical performance counters could become invalid without warning.
Update:
While I appreciate the answers and especially the links, I do not have one of the affected chipsets nor to I have a CPU clock that varies itself. From what I have read, QPC and QPF are based on a timer in the PCI bus and not affected by changes in the CPU clock. The strange thing in my situation is that the FREQUENCY reported by QPF changed to an incorrect value and this changed frequency was also reported in MyComputer -> Properties which I certainly did not write.
A reboot fixed my problem (QPF now reports the correct frequency) but I assume that if you are planning on using QPC/QPF you should validate it against another timer before trusting it.

Apparently there is a known issue with QPC on some chipsets, so you may want to make sure you do not have those chipset. Additionally some dual core AMDs may also cause a problem. See the second post by sebbbi, where he states:
QueryPerformanceCounter() and
QueryPerformanceFrequency() offer a
bit better resolution, but have
different issues. For example in
Windows XP, all AMD Athlon X2 dual
core CPUs return the PC of either of
the cores "randomly" (the PC sometimes
jumps a bit backwards), unless you
specially install AMD dual core driver
package to fix the issue. We haven't
noticed any other dual+ core CPUs
having similar issues (p4 dual, p4 ht,
core2 dual, core2 quad, phenom quad).
From this answer.

You should always expect the core frequency to change on any CPU that supports technology such as SpeedStep or Cool'n'Quiet. Wall time is not affected, it uses a RTC. You should probably stop using the performance counters, unless you can tolerate a few (5-50) millisecond's worth of occasional phase adjustments, and are willing to perform some math in order to perform the said phase adjustment by continuously or periodically re-normalizing your performance counter values based on the reported performance counter frequency and on RTC low-resolution time (you can do this on-demand, or asynchronously from a high-resolution timer, depending on your application's ultimate needs.)

You can try to use the Stopwatch class from .NET, it could help with your problem since it abstracts from all this low-lever stuff.
Use the IsHighResolution property to see whether the timer is based on a high-resolution performance counter.
Note: On a multiprocessor computer, it
does not matter which processor the
thread runs on. However, because of
bugs in the BIOS or the Hardware
Abstraction Layer (HAL), you can get
different timing results on different
processors. To specify processor
affinity for a thread, use the
ProcessThread..::.ProcessorAffinity
method.

Just a shot in the dark.
On my home PC I used to have "AI NOS" or something like that enabled in the BIOS. I suspect this screwed up the QueryPerformanceCounter/QueryPerformanceFrequency APIs because although the system clock ran at the normal rate, and normal apps ran perfectly, all full screen 3D games ran about 10-15% too fast, causing, for example, adjacent lines of dialog in a game to trip on each other.

I'm afraid you can't say "I shouldn't have this problem" when you're using QueryPerformance* - while the documentation states that the value returned by QueryPerformanceFrequency is constant, practical experimentation shows that it really isn't.
However you also don't want to be calling QPF every time you call QPC either. In practice we found that periodically (in our case once a second) calling QPF to get a fresh value kept the timers synchronised well enough for reliable profiling.
As has been pointed out as well, you need to keep all of your QPC calls on a single processor for consistent results. While this might not matter for profiling purposes (because you can just use ProcessorAffinity to lock the thread onto a single CPU), you don't want to do this for timing which is running as part of a proper multi-threaded application (because then you run the risk of locking a hard working thread to a CPU which is busy).
Especially don't arbitrarily lock to CPU 0, because you can guarantee that some other badly coded application has done that too, and then both applications will fight over CPU time on CPU 0 while CPU 1 (or 2 or 3) sit idle. Randomly choose from the set of available CPUs and you have at least a fighting chance that you're not locked to an overloaded CPU.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight