CUDA: Difference between CPU timer and CUDA timer event? - timer

What is the difference between using a CPU timer and the CUDA timer event to measure the time taken for the execution of some CUDA code?
Which of these should a CUDA programmer use?
And why?
what I know:
CPU timer usage would involve calling cudaThreadSynchronize before any time is noted.
For noting the time, one of these could be used:
clock()
high-resolution performance counter like QueryPerformanceCounter (on Windows)
CUDA timer event would involve recording before and after by using cudaEventRecord. At a later time, the elapsed time would be obtained by calling cudaEventSynchronize on the events, followed by cudaEventElapsedTime to obtain the elapsed time.

The answer to the first part of question is that cudaEvents timers are based off high resolution counters on board the GPU, and they have lower latency and better resolution than using a host timer because they come "off the metal". You should expect sub-microsecond resolution from the cudaEvents timers. You should prefer them for timing GPU operations for precisely that reason. The per-stream nature of cudaEvents can also be useful for instrumenting asynchronous operations like simultaneous kernel execution and overlapped copy and kernel execution. Doing that sort of time measurement is just about impossible using host timers.
EDIT: I won't answer the last paragraph because you deleted it.

The main advantage of using CUDA events for timing is that they're less subject to perturbations due to other system events, like paging or interrupts from the disk or network controller. Also, because the cu(da)EventRecord is asynchronous, there is less of a Heisenberg effect when timing short, GPU-intensive operations.
Another advantage of CUDA events is that they have a clean cross-platform API - no need to wrap gettimeofday() or QueryPerformanceCounter().
One final note: use caution when using streamed CUDA events for timing - if you do not specify the NULL stream, you may wind up timing operations that you did not intend to. There is a good analogy between CUDA events and reading the CPU's timestamp counter, which is a serializing instruction. On modern superscalar processors, the serializing semantics make the timing unambiguous. Also like RDTSC, you should always bracket the events you want to time with enough work that the timing is meaningful (just like you can't use RDTSC to meaningfully time a single machine instruction).

Related

How to implement time counting in a new operating system?

i have to implement in a operating system, the function sleep().
Which is at the moment, not exisiting in the previous mentioned system.
The problem is, I have to count the elapsed time to wake the sleeping thread up.
How should i relize this? Do i have to count the CPU Ticks or is there another way?
Are CPU Ticks not dependend on the CPU frequency which is different for each CPU?
I have to implement the function in the language C.
the time function doesn't exist either
Thank you in advance!
Typically, such functionality is provided by a hardware timer interrupt, (and its associated driver), that manages a 'tick count' and a delta-queue of 'Thread Control Block' pointers, (pTCB). The pTCP's for sleeping threads are stored in the queue ordered by interval expiry tick count. The timer interrupt increments the tick count and checks it agains the expiry count of the item at the head of the queue.
When a thread requests a sleep, the thread pTCB is taken out of the set of ready threads, the expiry-count calculated and the pTCB inserted into the timer queue. When the pTCB reaches the end of the queue, and it's expiry tick has arrived, it is popped and added back to the set of ready threads so that it may be set running.
It totally depends on your platform/OS. It has to provide you some time-like information, e.g. ticks. Otherwise it is just impossible.
Converting ticks to seconds of course requires additional information. Again, this can be supplied by your platform. Or you have to find it out by other means (manual, configure it yourself, ...).
The easiest and most common way to do that in operating systems is to set up a timer interrupt at a static frequency, then build a timer framework on top of that, then use that timer framework to fire off wakeups for your sleeping threads.
A good paper that discusses various data structures for how to do it efficiently is here. I recommend from my own experience scheme 7. It's quite easy to implement and performs wonderfully.
You can find a fast implementation with a good API here. But I'm biased, because I wrote it.
If you don't want a timer interrupt with a static frequency it becomes much harder to implement a nice timer facility with good performance. I've done a few experiments, but I'd recommend you to start with simple timer interrupt with a static frequency. Once you start doing dynamic timers you need to exactly understand the tradeoffs you are prepared to make.
you can use time() :
time_t t = time();
while(time() < t + sleepDuration);
You may use the CPUs Time Stamp Counter (TSC) to get counter values for time keeping. See Chapter 16.12.1 of "Intel® 64 and IA-32 Architectures Software Developer’s Manual".
The TSC is a low level counter which may provide counter values independent of CPU speed:
"The time stamp counter in newer processors may support an enhancement, referred to as invariant TSC. Processor’s support for invariant TSC is indicated by CPUID.80000007H:EDX[8].
The invariant TSC will run at a constant rate in all ACPI P-, C--, and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource."
However, for the implementation of sleep() alike functionality you should look into timer
hardware like HPET, ACPI, and alike. See "Intel 64® and IA-32 Architectures Software Developer's Manual, Volume 3B: System Programming Guide, Part 2" and "IA-PC HPET (High Precision Event Timers) Specification" for details.

Scheduling events at microsecond granularity in POSIX

I'm trying to determine the granularity I can accurately schedule tasks to occur in C/C++. At the moment I can reliably schedule tasks to occur every 5 microseconds, but I'm trying to see if I can lower this further.
Any advice on how to achieve this / if it is possible would be greatly appreciated.
Since I know timer granularity can often be OS dependent: I am currently running on Linux, but would use Windows if the timing granularity is better (although I don't believe it is, based on what I've found for the QueryPerformanceCounter)
I execute all measurements on bare-metal (no VM). /proc/timer_info confirms nanosecond timer resolution for my CPU (but I know that doesn't translate to nanosecond alarm resolution)
Current
My current code can be found as a Gist here
At the moment, I'm able to execute a request every 5 microseconds (5000 nanoseconds) with less then 1% late arrivals. When late arrivals do occur, they are typically only one cycle (5000 nanoseconds) behind.
I'm doing 3 things at the moment
Setting the process to real-time priority (some pointed out by #Spudd86 here)
struct sched_param schedparm;
memset(&schedparm, 0, sizeof(schedparm));
schedparm.sched_priority = 99; // highest rt priority
sched_setscheduler(0, SCHED_FIFO, &schedparm);
Minimizing the timer slack
prctl(PR_SET_TIMERSLACK, 1);
Using timerfds (part of the 2.6 Linux kernel)
int timerfd = timerfd_create(CLOCK_MONOTONIC,0);
struct itimerspec timspec;
bzero(&timspec, sizeof(timspec));
timspec.it_interval.tv_sec = 0;
timspec.it_interval.tv_nsec = nanosecondInterval;
timspec.it_value.tv_sec = 0;
timspec.it_value.tv_nsec = 1;
timerfd_settime(timerfd, 0, &timspec, 0);
Possible improvements
Dedicate a processor to this process?
Use a nonblocking timerfd so that I can create a tight loop, instead of blocking (tight loop will waste more CPU, but may also be quicker to respond to an alarm)
Using an external embedded device for triggering (can't imagine why this would be better)
Why
I'm currently working on creating a workload generator for a benchmarking engine. The workload generator simulates an arrival rate (X requests / second, etc.) using a Poisson process. From the Poisson process, I can determine the relative times at which requests must be made from the benchmarking engine.
So for instance, at 10 requests a second, we may have requests made at:
t = 0.02, 0.04, 0.05, 0.056, 0.09 seconds
These requests need to be scheduled in advance and then executed. As the number of requests per second increases, the granularity required for scheduling these requests increases (thousands of requests per second requires sub-millisecond accuracy). As a result, I'm trying to figure out how to scale this system further.
You're very close to the limits of what vanilla Linux will offer you, and it's way past what it can guarantee. Adding the real-time patches to your kernel and tuning for full pre-emption will help give you better guarantees under load. I would also remove any dynamic memory allocation from your time critical code, malloc and friends can (and will) stall for a not-inconsequential (in a real-time sense) period of time if it has to reclaim the memory from the i/o cache. I would also be considering removing swap from that machine to help guarantee performance. Dedicating a processor to your task will help to prevent context switch times but, again, it's no guarantee.
I would also suggest that you be careful with that level of sched_priority, you're above various important bits of Linux there, which can lead to very strange effects.
What you gain from building a realtime kernel is more reliable guarantees (ie lower maximum latency) of the time between an IO/timer event handled by the kernel, and control being passed to your app in response. This comes at the price of lower throughput, and you might notice an increase in your best-case latency times.
However, the only reason for using OS timers to schedule events with high-precision is if you're afraid of burning CPU cycles in a loop while you wait for your next due event. OS timers (especially in MS Windows) are not reliable for high granularity timing events, and are very dependant on the sort of timing/HPET hardware available in your system.
When I require highly accurate event scheduling, I use a hybrid method. First, I measure the worst case latency - that is, the biggest difference between the time I requested to sleep, and the actual clock time after sleeping. Let's call this difference "D". (You can actually do this on-the-fly during normal running, by tracking "D" every time you sleep, with something like "D = (D*7 + lastD) / 8" to produce a temporal average).
Then never request to sleep beyond "N - D*2", where "N" is the time of the next event. When within "D*2" time of the next event, enter a spin loop and wait for "N" to occur.
This eats a lot more CPU cycles, but depending on the accuracy you require, you might be able to get away with a "sched_yield()" in your spin loop, which is more kind to your system.

Embedded Programming, Wait for 12.5 us

I'm programming on the C2000 F28069 Experimenters Kit. I'm toggling a GPIO output every 12.5 microseconds 5 times in a row. I decided I don't want to use interrupts (though I will if I absolutely have to). I want to just wait that amount of times in terms of clock cycles.
My clock is running at 80MHz, so 12.5 us should be 1000 clock cycles. When I use a loop:
for(i=0;i<1000;i++)
I get a result that is way too long (not 12.5 us). What other techniques can I use?
Is sleep(n); something that I can use on a microcontroller? If so, which header file do I need to download and where can I find it? Also, now that I think about it, sleep(n); takes an int input, so that wouldn't even work... any other ideas?
Summary: Use the PWM or Timer peripherals to generate output pulses.
First, the clock speed of the CPU has a complex relationship to actual code execution speed, and in many CPUs there is more than one clock rate involved in different stages of the execution. The chip you reference has several internal clock sources, for instance. Further, each individual instruction will likely take a different number of clocks to execute, and some cores can execute part of (or all of) several instructions simultaneously.
To rigorously create a loop that required 12.5 µs to execute without using a timing interrupt or other hardware device would require careful hand coding in assembly language along with careful accounting of the execution time of each instruction.
But you are writing in C, not assembler.
So the first question you have to ask is what machine code was actually generated for your loop. And the second question is did you enable the optimizer, and to what level.
As written, a decent optimizer will determine that the loop for (i=0; i<1000; i++) ; has no visible side effects, and therefore is just a slow way of writing ;, and can be completely removed.
If it does compile the loop, it could be written naively using perhaps as many as 5 instructions, or as few as one or two. I am not personally familiar with this particular TI CPU architecture, so I won't attempt to guess at the best possible implementation.
All that said, learning about the CPU architecture and its efficiency is important to building reliable and efficient embedded systems. But given that the chip has peripheral devices built-in that provide hardware support for PWM (pulse width modulated) outputs as well as general purpose hardware timer/counters you would be far better off learning to use the hardware to generate the waveform for you.
I would start by collecting every document available on the CPU core and its peripherals, especially app notes and sample code.
The C compiler will have an option to emit and preserve an assembly language source file. I would use that as a guide to study the structure of the code generated for critical loops and other bottlenecks, as well as the effects of the compiler's various optimization levels.
The tool suite should have a mechanism for profiling your running code. Before embarking on heroic measures in pursuit of optimizations, use that first to identify the actual bottlenecks. Even if it lacks decent profiling, you are likely to have spare GPIO pins that can be toggled around critical sections of code and measured with a logic analyzer or oscilloscope.
The chip you refer has PWM (pulse width modulation) hardware declared as one of major winning features. You should rely on this. Please refer to appropriate application guide. Generally you cannot guarantee 12.5uS periods from application layer (and should not try to do so). Even if you managed to do so directly from application layer it's bad idea. Any change in your firmware code can break this.
If you use a timer peripheral with PWM output capability as suggested by #RBerteig already, then you can generate an accurate timing signal with zero software overhead. If you need to do other work synchronously with the clock, then you can use the timer interrupt to trigger that too. However if you process interrupts at an interval of 12.5us you may find that your processor spends a great deal of time context switching rather than performing useful work.
If you simply want an accurate delay, then you should still use a hardware timer and poll its reload flag rather than process its interrupt. This allows consistent timing independent of the compiler's code generation or processor speed and allows you to add other code within the loop without extending the total loop time. You would poll it in a loop during which you might do other work as well. The timing jitter and determinism will depend on what other work you do in the loop, but for an empty loop, reaction to the timer even will probably be faster than the latency on an interrupt handler.

Internal working of timer

I know QueryPerformanceCounter() can be used for timing functions. I want to know:
1-Can I increase the resolution of the timer by over-clocking the CPU (so it ticks faster)?
2-Basically what makes some timers more precise than others, (e.g, QueryPerformanceCounter() is more precise as compared to GetTickCount())? If there is single crystal oscillator on the motherboard , why some timers are slower as compared to others?
QueryPerformanceCounter has very high resolution - normally less than one nanosecond. I don't see why you'd like to increase it. Overclocking will increase it, but it seems like a very weak reason for overclocking.
QueryPerformanceCounter is very accurate, but somewhat expensive and not very convenient.
a. It's expensive because it uses the expensive rdtsc instruction. Faster timers can just read an integer from memory. This integer needs to be updated, and we don't want to do it too often (1000 times a second is reasonable), so we get a very cheap timer, with low precision. That's basically GetTickCount.
b. It's inconvenient because it uses units which change between computers. Sometimes it will be nanoseconds, sometimes half-nano, or other values. It makes it harder to calculate with.
a. Another source of inconvenience is that it returns very large numbers, which may overflow when you try to do math with them, so you need to be careful.
The timing source for QPC is machine dependent. It is typically picked up from a frequency available somewhere in the chipset. Whether overclocking the cpu is going to affect it is highly dependent on your motherboard design. The simplest way is to just try it, use QueryPerformanceFrequency to see the effect.
GetTickCount is driven from an entirely different timer source, the signal that also generates the clock interrupt. It is not very precise, 1/64 of second normally, but it is highly accurate. Your machine contacts a time server from time to time to recalibrate the clock and adjust the clock correction factor. Which makes it accurate to about a second over an entire year. QPC is very precise, but not nearly as accurate. Use it only to time short intervals.
1 - Yes, Internally, one of the better timers is rdtsc, which does give you the clock value. Combining this with information from cpuid instruction, gives you time.
2 - The other timers rely upon various timing sources, such as the 8253 timer, for instance.
QPF is a wrapper added by Microsoft on and over what rdtsc provides. Read this article for more info:
http://www.strchr.com/performance_measurements_with_rdtsc

ARM926EJ-S cycle-counter

Im using an ARM926EJ-S and am trying to figure out whether the ARM can give (e.g. a readable register) the CPU's cycle-counter. I guess a # that will represent the number of cycles since the CPU has been powered.
In my system i have only Low-Res external RTC/Timers. I would like to be able to achieve a Hi-Res timer.
Many thanks in advance!
You probably have only two choices:
Use an instruction-cycle accurate simulator; the problem here is that effectively simulating peripherals and external stimulus can be complex or impossible.
Use a peripheral hardware timer. In most cases you will not be able to run such a timer at the typical core clock rate of an ARM9, and there will be an over head in servicing the timer either side of the period being timed, but it can be used to give execution time over larger or longer running sections of code, which may be of more practical use than cycle count.
While cycle count may be somewhat scalable to different clock rates, it remains constrained by memory and I/O wait states, so is perhaps not as useful as it may seem as a performance metric, except at the micro-level of analysis, and larger performance gains are typically to be had by taking a wider view.
The arm-9 is not equipped with an PMU (Performance Monitoring Unit) as included in the Cortex-family. The PMU is described here. The linux kernel comes equipped with support for using the PMU for benchmarking performance. See here for documentation of the perf tool-set.
Bit unsure about the arm-9, need to dig a bit more...

Resources