How do the clock function works in operating systems? - c

I don't understand how clock function works within an operating system.
First the documentation for clock function in www.cplusplus.com is:
Returns the processor time consumed by the program.
The value returned is expressed in clock ticks[...]
As far as I understand, the clock function must access directly some register within the CPU that is a counter for CPU cycles. How is this possible? I mean, any register of 32 bits would overflow very soon if it is being incremented one unit at the CPU frecuency. Is the OS handling this overflow in some way?

"Clock ticks" are implementation-defined, not in units of the cpu clock. Historically they were fixed-length, coarse-grained scheduling timeslices. There is no need for any special hardware TSC to implement clock. It suffices to simply count up the timeslices the process is scheduled for (which are ultimately enforced by a timer interrupt or similar).

Related

ARM embedded delay hardware timer vs CPU cycle counter

I'm working on an embedded project that's running on an ARM Cortex M3 based microcontroller. Some code provided by our vendor uses a delay function that sets up built-in hardware timer and then spins until the timer expires. Typically this is used to wait between 1 and a couple hundred microseconds. These delays are almost because they are waiting on some register, chip or bus to complete an action and need to wait at least the given number of microseconds. The hardware timer also appears to cost at least 6 microseconds in overhead to setup.
In a multithreaded environment this is a problem because there are N threads but only 1 hardware timer. I could disable interrupts while the timer is being used to prevent context switches and thus race conditions but it seems a bit ugly. I am thinking of replacing the function that uses the hardware timer with a function that uses the ARM CPU Cycle Counter (CCNT). Are there are pitfalls I am missing or other alternatives? Obviously the cycle counter function requires it be tuned to the proper CPU frequency which will never change for our system, but I suppose could be detected at boot programmatically using the hardware timer.
Setup the timer once at startup and let the counter run continuously. When you want to start a delay, read the counter value and remember this start value. Then in the delay loop read the counter value again and loop until the counter value minus the start value is greater than or equal to the requested delay ticks. (If you do the subtraction correctly then rollovers will wash out and you don't need special handling to check for them.)
You could multiplex your timer such that you have a table of when each thread wants to fire off and a function pointer / vector for execution. When the timer interrupt occurs, fire off that thread's interrupt and then set the timer to the next one in the list, minus elapsed time. This is what I see many *nix operating systems do in their kernel code, so there should be code to pull from as example.
A bigger concern is the fact that you are spin locking the thread waiting for the timer. Besides CPU usage, and depending on what OS you have (or if you have an OS) you could easily introduce thread inversion issues or even full on lock ups. It might be better to use thread primitives instead so that any OS can actually sleep your threads and wake them when needed.

How to implement time counting in a new operating system?

i have to implement in a operating system, the function sleep().
Which is at the moment, not exisiting in the previous mentioned system.
The problem is, I have to count the elapsed time to wake the sleeping thread up.
How should i relize this? Do i have to count the CPU Ticks or is there another way?
Are CPU Ticks not dependend on the CPU frequency which is different for each CPU?
I have to implement the function in the language C.
the time function doesn't exist either
Thank you in advance!
Typically, such functionality is provided by a hardware timer interrupt, (and its associated driver), that manages a 'tick count' and a delta-queue of 'Thread Control Block' pointers, (pTCB). The pTCP's for sleeping threads are stored in the queue ordered by interval expiry tick count. The timer interrupt increments the tick count and checks it agains the expiry count of the item at the head of the queue.
When a thread requests a sleep, the thread pTCB is taken out of the set of ready threads, the expiry-count calculated and the pTCB inserted into the timer queue. When the pTCB reaches the end of the queue, and it's expiry tick has arrived, it is popped and added back to the set of ready threads so that it may be set running.
It totally depends on your platform/OS. It has to provide you some time-like information, e.g. ticks. Otherwise it is just impossible.
Converting ticks to seconds of course requires additional information. Again, this can be supplied by your platform. Or you have to find it out by other means (manual, configure it yourself, ...).
The easiest and most common way to do that in operating systems is to set up a timer interrupt at a static frequency, then build a timer framework on top of that, then use that timer framework to fire off wakeups for your sleeping threads.
A good paper that discusses various data structures for how to do it efficiently is here. I recommend from my own experience scheme 7. It's quite easy to implement and performs wonderfully.
You can find a fast implementation with a good API here. But I'm biased, because I wrote it.
If you don't want a timer interrupt with a static frequency it becomes much harder to implement a nice timer facility with good performance. I've done a few experiments, but I'd recommend you to start with simple timer interrupt with a static frequency. Once you start doing dynamic timers you need to exactly understand the tradeoffs you are prepared to make.
you can use time() :
time_t t = time();
while(time() < t + sleepDuration);
You may use the CPUs Time Stamp Counter (TSC) to get counter values for time keeping. See Chapter 16.12.1 of "Intel® 64 and IA-32 Architectures Software Developer’s Manual".
The TSC is a low level counter which may provide counter values independent of CPU speed:
"The time stamp counter in newer processors may support an enhancement, referred to as invariant TSC. Processor’s support for invariant TSC is indicated by CPUID.80000007H:EDX[8].
The invariant TSC will run at a constant rate in all ACPI P-, C--, and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource."
However, for the implementation of sleep() alike functionality you should look into timer
hardware like HPET, ACPI, and alike. See "Intel 64® and IA-32 Architectures Software Developer's Manual, Volume 3B: System Programming Guide, Part 2" and "IA-PC HPET (High Precision Event Timers) Specification" for details.

How can I tell if every core on my machine uses the same timer?

I'm trying to write some code to determine if clock_gettime used with CLOCK_MONOTONIC_RAW will give me results coming from the same hardware on different cores.
From what I understand it is possible for each core to produce independent results but not always. I was given the task of obtaining timings on all cores with a precision of 40 nanoseconds.
The reason I'm not using CLOCK_REALTIME is that my program absolutely must not be affected by NTP adjustments.
Edit:
I have found the unsynchronized_tsc function which tries to test whether the TSC is the same on all cores. I am now attempting to find if CLOCK_MONOTONIC_RAW is based on the TSC.
Final edit:
It turns out that CLOCK_MONOTONIC_RAW is always usable on multi-core systems and does not rely on the TSC even on Intel machines.
To do measurements this precisely; you'd need:
code that's executed on all CPUs, that reads the CPU's time stamp counter and stores it as soon as "an event" occurs
some way to create "an event" that is noticed at the same time by all CPUs
some way to prevent timing problems caused by IRQs, task switches, etc.
Various possibilities for the event include:
polling a memory location in a loop, where one CPU writes a new value and other CPUs stop polling when they see the new value
using the local APIC to broadcast an IPI (inter-processor interrupt) to all CPUs
For both of these methods there are delays between the CPUs (especially for larger NUMA systems) - a write to memory (cache) may be visible on the CPU that made the write immediately, and be visible by a CPU on a different physical chip (in a different NUMA domain) later. To avoid this you may need to find the average of initiating the event on all CPUs. E.g. (for 2 CPUs) one CPU initiates and both measure, then the other CPU initiates and both measure, then results are combined to cancel out any "event propagation latency".
To fix other timing problems (IRQs, task switches, etc) I'd want to be doing these tests during boot where nothing else can mess things up. Otherwise you either need to prevent the problems (ensure all CPUs are running at the same speed, disable IRQs, disable thread switches, stop any PCI device bus mastering, etc) or cope with problems (e.g. run the same test many times and see if you get similar results most of the time).
Also note that all of the above can only ensure that the time stamp counters were in sync at the time the test was done, and don't guarantee that they won't become out of sync after the test is done. To ensure the CPUs remain in sync you'd need to rely on the CPU's "monotonic clock" guarantees (but older CPUs don't make that guarantee).
Finally; if you're attempting to do this in user-space (and not in kernel code); then my advice is to design code in a way that isn't so fragile to begin with. Even if the TSCs on different CPUs are guaranteed to be perfectly in sync at all times, you can't prevent an IRQ from interrupting immediately before or immediately after reading the TSC (and there's no way to atomically do something and read TSC at the same time); and therefore if your code requires such precisely synchronised timing then your code's design is probably flawed.

Measuring Time in Linux Kernel Space With Sub-Microsecond Precision

I am currently using the do_gettimeofday() function to measure time in the kernel, which gives me microsecond precision. Is there anything available that is more precise than this (maybe on the order of nanoseconds)?
The ktime_get() function returns ktime_t, which has nanosecond resolution.
As I know, the most precise timer should be the processor specific counter register (such as TSC in x86). Linux kernel provide rdtsc, rdtscl, rdtscll macros from the "./arch/x86/include/asm/msr.h" file to read this register value. For ARM, cycle counter register.
These registers are all different from CPU to CPU. Common interface to access it is "get_cycles" function which is declared in <linux/timex.h> file.
Maybe, this document can be helpful.

CPU TSC fetch operation especially in multicore-multi-processor environment

In Linux world, to get nano seconds precision timer/clockticks one can use :
#include <sys/time.h>
int foo()
{
timespec ts;
clock_gettime(CLOCK_REALTIME, &ts);
//--snip--
}
This answer suggests an asm approach to directly query for the cpu clock with the RDTSC instruction.
In a multi-core, multi-processor architecture, how is this clock ticks/timer value synchronized across multiple cores/processors? My understanding is that there in inherent fencing being done. Is this understanding correct?
Can you suggest some documentation that would explain this in detail? I am interested in Intel Nehalem and Sandy Bridge microarchitectures.
EDIT
Limiting the process to a single core or cpu is not an option as the process is really huge(in terms of resources consumed) and would like to optimally utilize all the resources in the machine that includes all the cores and processors.
Edit
Thanks for the confirmation that the TSC is synced across cores and processors. But my original question is how is this synchronization done ? is it with some kind of fencing ? do you know of any public documentation ?
Conclusion
Thanks for all the inputs: Here's the conclusion for this discussion: The TSCs are synchronized at the initialization using a RESET that happens across the cores and processors in a multi processor/multi core system. And after that every Core is on their own. The TSCs are kept invariant with a Phase Locked Loop that would normalize the frequency variations and thus the clock variations within a given Core and that is how the TSC remain in sync across cores and processors.
Straight from Intel, here's an explanation of how recent processors maintain a TSC that ticks at a constant rate, is synchronous between cores and packages on a multi-socket motherboard, and may even continue ticking when the processor goes into a deep sleep C-state, in particular see the explanation by Vipin Kumar E K (Intel):
http://software.intel.com/en-us/articles/best-timing-function-for-measuring-ipp-api-timing/
Here's another reference from Intel discussing the synchronization of the TSC across cores, in this case they mention the fact that rdtscp allows you to read both the TSC and the processor id atomically, this is important in tracing applications... suppose you want to trace the execution of a thread that might migrate from one core to another, if you do that in two separate instructions (non-atomic) then you don't have certainty of which core the thread was in at the time it read the clock.
http://software.intel.com/en-us/articles/intel-gpa-tip-cannot-sychronize-cpu-timestamps/
All sockets/packages on a motherboard receive two external common signals:
RESET
Reference CLOCK
All sockets see RESET at the same time when you power the motherboard, all processor packages receive a reference clock signal from an external crystal oscillator and the internal clocks in the processor are kept in phase (although usually with a high multiplier, like 25x) with circuitry called a Phase Locked Loop (PLL). Recent processors will clock the TSC at the highest frequency (multiplier) that the processor is rated (so called constant TSC), regardless of the multiplier that any individual core may be using due to temperature or power management throttling (so called invariant TSC). Nehalem processors like the X5570 released in 2008 (and newer Intel processors) support a "Non-stop TSC" that will continue ticking even when conserving power in a deep power down C-state (C6). See this link for more information on the different power down states:
http://www.anandtech.com/show/2199
Upon further research I came across a patent Intel filed on 12/22/2009 and was published on 6/23/2011 entitled "Controlling Time Stamp Counter (TSC) Offsets For Mulitple Cores And Threads"
http://www.freepatentsonline.com/y2011/0154090.html
Google's page for this patent application (with link to USPTO page)
http://www.google.com/patents/US20110154090
From what I gather there is one TSC in the uncore (the logic in a package surrounding the cores but not part of any core) which is incremented on every external bus clock by the value in the field of the machine specific register specified by Vipin Kumar in the link above (MSR_PLATFORM_INFO[15:8]). The external bus clock runs at 133.33MHz. In addition each core has it's own TSC register, clocked by a clock domain that is shared by all cores and may be different from the clock for any one core - therefore there must be some kind of buffer when the core TSC is read by the RDTSC (or RDTSCP) instruction running in a core. For example, MSR_PLATFORM_INFO[15:8] may be set to 25 on a package, every bus clock the uncore TSC increments by 25, there is a PLL that multiplies the bus clock by 25 and provides this clock to each of the cores to clock their local TSC register, thereby keeping all TSC registers in synch. So to map the terminology to actual hardware
Constant TSC is implemented by using the external bus clock running at 133.33 MHz which is multiplied by a constant multiplier specified in MSR_PLATFORM_INFO[15:8]
Invariant TSC is implemented by keeping the TSC in each core on a separate clock domain
Non-stop TSC is implemented by having an uncore TSC that is incremented by MSR_PLATFORM_INFO[15:8] ticks on every bus clock, that way a multi-core package can go into deep power down (C6 state) and can shutdown the PLL... there is no need to keep a clock at the higher multiplier. When a core is resumed from C6 state its internal TSC will get initialized to the value of the uncore TSC (the one that didn't go to sleep) with an offset adjustment in case software has written a value to the TSC, the details of which are in the patent. If software does write to the TSC then the TSC for that core will be out of phase with other cores, but at a constant offset (the frequency of the TSC clocks are all tied to the bus reference clock by a constant multiplier).
On newer CPUs (i7 Nehalem+ IIRC) the TSC is synchronzied across all cores and runs a constant rate.
So for a single processor, or more than one processor on a single package or mainboard(!) you can rely on a synchronzied TSC.
From the Intel System Manual 16.12.1
The time stamp counter in newer processors may support an enhancement,
referred to as invariant TSC. Processors support for invariant TSC is
indicated by CPUID.80000007H:EDX[8]. The invariant TSC will run at a
constant rate in all ACPI P-, C-. and T-states. This is the
architectural behavior moving forward.
On older processors you can not rely on either constant rate or synchronziation.
Edit: At least on multiple processors in a single package or mainboard the invariant TSC is synchronized. The TSC is reset to zero at a /RESET and then ticks onward at a constant rate on each processor, without drift. The /RESET signal is guaranteed to arrive at each processor at the same time.
RTDSC is not synchronized across CPUs. Thus, you cannot rely on it in a multi-processor systems. The only workaround I can think of for Linux would be to actually restricting the process to run on a single CPU by settings its affinity. This can be done externally using using taskset utility or "internally" using sched_setaffinity or pthread_setaffinity_np functions.
This manual, chapter 17.12, describes the invariant TSC used in the newest processors. Available with Nehalem this time stamp, along with the rtscp instruction, allows one to read a timestamp (not affected by wait-states, etc) and a processor signature in one atomic operation.
It is said to be suitable for calculating wall-clock time, but it obviously doesn't expect the value to be the same across processors. The stated idea is that you can see if successive reads are to the same CPU's clock, or to adjust for multiple CPU reads. "It can also be used to adjust for per-CPU differences in TSC values in a NUMA system."
See also rdtsc accuracy across CPU cores
However, I'm not sure that the final consistency conclusion in the accepted answer follows from the statement that the tsc can be used for wall clock time. If it was consistent, what reason would there be for atomically determining the CPU source of the time.
N.B. The TSC information has moved from chapter 11 to chapter 17 in that Intel manual.

Resources