Measuring Time in Linux Kernel Space With Sub-Microsecond Precision

Measuring Time in Linux Kernel Space With Sub-Microsecond Precision - c

I am currently using the do_gettimeofday() function to measure time in the kernel, which gives me microsecond precision. Is there anything available that is more precise than this (maybe on the order of nanoseconds)?

The ktime_get() function returns ktime_t, which has nanosecond resolution.

As I know, the most precise timer should be the processor specific counter register (such as TSC in x86). Linux kernel provide rdtsc, rdtscl, rdtscll macros from the "./arch/x86/include/asm/msr.h" file to read this register value. For ARM, cycle counter register.
These registers are all different from CPU to CPU. Common interface to access it is "get_cycles" function which is declared in <linux/timex.h> file.
Maybe, this document can be helpful.

Related

How to use perf or other utilities to measure the elapsed time of a loop/function

I am working on a project requiring profiling the target applications at first.
What I want to know is the exact time consumed by a loop body/function. The platform is BeagleBone Black board with Debian OS and installed perf_4.9.
gettimeofday() can provide a micro-second resolution but I still want more accurate results. It looks perf can give cycles statistics and thus be a good fit for purposes. However, perf can only analyze the whole application instead of individual loop/functions.
After trying the instructions posted in this Using perf probe to monitor performance stats during a particular function, it does not work well.
I am just wondering if there is any example application in C I can test and use on this board for my purpose. Thank you!

Caveat: This is more of comment than an answer but it's a bit too long for just a comment.
Thanks a lot for advising a new function. I tried that but get a little unsure about its accuracy. Yes, it can offer nanosecond resolution but there is inconsistency.
There will be inconsistency if you use two different clock sources.
What I do is first use clock_gettime() to measure a loop body, the approximate elasped time would be around 1.4us in this way. Then I put GPIO instructions, pull high and pull down, at beginning and end of the loop body, respectively and measure the signal frequency on this GPIO with an oscilloscope.
A scope is useful if you're trying to debug the hardware. It can also show what's on the pins. But, in 40+ years of doing performance measurement/improvement/tuning, I've never used it to tune software.
In fact, I would trust the CPU clock more than I would trust the scope for software performance numbers
For a production product, you may have to measure performance on a system deployed at a customer site [because the issue only shows up on that one customer's machine]. You may have to debug this remotely and can't hook up a scope there. So, you need something that can work without external probe/test rigs.
To my surprise, the frequency is around 1.8MHz, i.e., ~500ns. This inconsistency makes me a little confused... – GeekTao
The difference could be just round off error based on different time bases and latency in getting in/out of the device (GPIO pins). I presume you're just using GPIO in this way to facilitate benchmark/timing. So, in a way, you're not measuring the "real" system, but the system with the GPIO overhead.
In tuning, one is less concerned with absolute values than relative. That is, clock_gettime is ultimately based on number of highres clock ticks (at 1ns/tick or better from the system's free running TSC (time stamp counter)). What the clock frequency actually is doesn't matter as much. If you measure a loop/function and get X duration. Then, you change some code and get X+n, this tells you whether the code got faster or slower.
500ns isn't that large an amount. Almost any system wide action (e.g. timeslicing, syscall, task switch, etc.) could account for that. Unless you've mapped the GPIO registers into app memory, the syscall overhead could dwarf that.
In fact, just the overhead of calling clock_gettime could account for that.
Although the clock_gettime is technically a syscall, linux will map the code directly into the app's code via the VDSO mechanism so there is no syscall overhead. But, even the userspace code has some calculations to do.
For example, I have two x86 PCs. On one system the overhead of the call is 26 ns. On another system, the overhead is 1800 ns. Both these systems are at least 2GHz+
For your beaglebone/arm system, the base clock rate may be less, so overhead of 500 ns may be ballpark.
I usually benchmark the overhead and subtract it out from the calculations.
And, on the x86, the actual code just gets the CPU's TSC value (via the rdtsc instruction) and does some adjustment. For arm, it has a similar H/W register but requires special care to map userspace access to it (a coprocessor instruction, IIRC).
Speaking of arm, I was doing a commercial arm product (an nVidia Jetson to be exact). We were very concerned about latency of incoming video frames.
The H/W engineer didn't trust TSC [or software in general ;-)] and was trying to use a scope, an LED [controlled by a GPIO pin] and when the LED flash/pulse showed up inside the video frame (e.g. the coordinates of the white dot in the video frame were [effectively] a time measurement).
It took a while to convince the engineer, but, eventually I was able to prove that the clock_gettime/TSC approach was more accurate/reliable.
And, certainly, easier to use. We had multiple test/development/SDK boards but could only hook up the scope/LED rig on one at a time.

How do the clock function works in operating systems?

I don't understand how clock function works within an operating system.
First the documentation for clock function in www.cplusplus.com is:
Returns the processor time consumed by the program.
The value returned is expressed in clock ticks[...]
As far as I understand, the clock function must access directly some register within the CPU that is a counter for CPU cycles. How is this possible? I mean, any register of 32 bits would overflow very soon if it is being incremented one unit at the CPU frecuency. Is the OS handling this overflow in some way?

"Clock ticks" are implementation-defined, not in units of the cpu clock. Historically they were fixed-length, coarse-grained scheduling timeslices. There is no need for any special hardware TSC to implement clock. It suffices to simply count up the timeslices the process is scheduled for (which are ultimately enforced by a timer interrupt or similar).

access to model specific registers, IA32_APERF / IA32_MPERF, to measure actual CPU frequency

As far as I understand to measure the actual operating CPU frequency I need access to the model specific registers (MSR) IA32_APERF and IA32_MPERF (Assembly CPU frequency measuring algorithm).
However, access to the MSR registers is privileged (through the rdmsr instruction). Is there another way this can be done? I mean, for example, through a device driver/library which I could call in my code. It seems strange to me that reading the registers is privileged. I would think only writing to them would be privileged.
Note: the rdtsc instruction does not account for turbo boost and thus cannot report the actual operating frequency
Edit:
I'm interested in solutions for Linux and/or Windows.

You are right, the proper way to find average cpu frequency described in 2nd answer in your link.
To read msrs on linux you can use tool RDMSR.
The only thing that maybe missleading in that answer, is maxfrequency. It should be not maxfrequency, but nominal frequency (max non-turbo frequency), as MPERF counter counts in max-non turbo frequency. You can get this frequency from MSR 0xCE bits 8:15 (ref)

How to get the most exact clock in linux user space application?

I have an embedded system and I would like to share in a kernel module/driver the most accurate clock with an userspace application which is very sensitive to the clock, and should be the most accurate.
These functions requires context switch and a huge overhead and the clock won't be accurate since the syscall.
I thought to increase a shared integer every jiffy from the kernel module, assuming the userspace application can access it directly, the problem is that I can't share an integer/long which is not aligned to a page size and declare a whole new page for only one long variable is a huge overhead.
What should I do?

You want to use clock_gettime() for obtaining the current time (since startup) with either CLOCK_MONOTONIC (monotonic but not steady as it is influenced by NTP) or CLOCK_MONOTONIC_RAW (monotonic and steady, but Linux specific and requires a kernel >= 2.6.28).
For waking up at exact intervals, use clock_nanosleep() and specify TIMER_ABSTIME. Unfortunately clock_nanosleep() only supports CLOCK_MONOTONIC and not CLOCK_MONOTONIC_RAW, so you cannot pass a wakeup time obtained with CLOCK_MONOTONIC_RAW because these clocks may differ. Don't forget to check the return code for EINTR.

What are the sources or clock for Linux time functions (time , gettimeofday ...)

What are the sources or clock of Linux time functions ?
Are all (time , gettimeofday ...) getting updated clock values from reading same hardware component? Or are they all just retrieving some current time value maintained by Kernel?
Any of these function will read directly from BIOS ?

It varies depending on a large number of factors including what hardware is available, whether time synchronization is in use, and a number of other factors. On typical modern hardware, the TSC or HPET is read and scaled according to factors maintained by the kernel's timekeeping system.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight