Reliable to use cudaEvent_t to measure CPU time? - c

I have a simple kernel without using multiple events, and i want to create a CPU version of it which i've done and measure the difference between them. I don't know if events are strictly created for CUDA, but i guess my example is simple enough and does not contain anything to be ok to do that. Opinions?

If you are measuring time on the CPU, nothing is better than High performance counters E.g for java, you can measure time in the nano second range. Events are generally used for the GPU as the start and stop event are added to the GPU queue, not the CPU one.

Related

Timings differ while measuring a CUDA kernel

I try to measure the time taken by a CUDA kernel function. I measure both CPU and GPU timings of it. But I'm getting a huge difference between both.
When I profile it using the NVIDIA profiler the kernel takes around 6ms which is what I want. But when I used gettimeofday() around the kernel call to get the CPU timings, the measure was 15ms. I do not have any memcpy calls there too. The kernel runs in a separate stream. And similar kernels are running in concurrent streams.
Sample code :
gettimeofday(start);
cudaEventRecord(startGPU);
Kernel <<<abc, xyz,stream>>>();
cudaDeviceSynchronize();
cudaEventRecord(stopGPU);
printf("Elapsed GPU time = ");
gettimeofday(stop);
printf("Elapsed CPU time = ");
Results I'm getting for the above code :
Elapsed GPU time = 6ms
Elapsed CPU time = 15ms
It is weird because there is only kernel execution line present. The kernel params are however pointers. Is the extra time being taken by mem copies? But I do not find mem copies anywhere in the profile too. Any leads would be appreciated.
Basically, what you're measuring as your CPU time is the time it takes to
record the first event,
set up the kernel launch with the respective parameters,
send the necessary commands to the GPU,
launch the kernel on the GPU,
execute the kernel on the GPU,
wait for the notification that GPU execution finished to get back to the CPU, and
record the second event.
Also, note that your method of measuring CPU time does not measure just the processing time spent by your process/thread, but, rather, the total system time elapsed (which potentially includes processing time spent by other processes/threads while your process/thread was not necessarily even running). I have to admit that, even in light of all that, the CPU time you report is still much larger compared to the GPU time than I would normally expect. But I'm not sure that that up there really is your entire code. In fact, I rather doubt it, given that, e.g., the printf()s don't really print anything. So there may be some additional factors we're not aware of that would still have to be considered to fully explain your timings.
Anyways, most likely neither of the two measurements you take are actually measuring what you really wanted to measure. If you're interested in the time it takes for the kernel to run, then use CUDA events. However, if you synchronize first and only then record the end event, the time between the start and end events will be the time between the beginning of kernel execution, the CPU waiting for kernel execution to finish, and whatever time it may take to then record the second event and have that one get to the GPU just so you can then ask the GPU at what time it got it. Think of events like markers that mark a specific point in the command stream that is sent to the GPU. Most likely, you actually wanted to write this:
cudaEventRecord(startGPU, stream); // mark start of kernel execution
Kernel<<<abc, xyz, stream>>>();
cudaEventRecord(stopGPU, stream); // mark end of kernel execution
cudaEventSynchronize(stopGPU); // wait for results to be available
and then use cudaEventElapsedTime() to get the time between the two events.
Also, note that gettimeofday() is not necessarily a reliable way of obtaining high-resolution timings. In C++, you could use, e.g., std::steady_clock, or std::high_resolution_clock (I would resort to the latter only if it cannot be avoided, since it is not guaranteed to be steady; and make sure that the clock period is actually sufficient for what you're trying to measure).
After debugging into the same issue, I found that cuda usually takes time before the first kernel launch, as referred in the forum here: https://devtalk.nvidia.com/default/topic/1042733/extremely-slow-cuda-api-calls-/?offset=3.
The cuda runtime APIs before the kernel had 6ms of cudaMalloc and 14ms of cudaLaunch which was the reason for the extra delay. The subsequent kernels, however, was good to work normally. cudaLaunch take usually takes time in microseconds, so if anything goes beyond that, it definitely needs some repair.
NOTE: If you are running any cuda kernels in a while(1) loop (only once), the allocation must be done outside the loop. Else you will end up with delays just like this.

Set CPU usage or manipulate other system resource in C

I have specific application to make in C. Is there any possibility to programmatically set CPU usage for process? I want to set CPU usage to eg. 20% by specific (mine) process for few seconds and then back to regular usage. while(1) take 100% CPU so its not bes idea for me. Any other ideas to manipulate some system resources and functions that can provide it? I already did memory allocation manipulations but i need other ideas about manipulating system resources.
Thanks!
What I know is that you may be able to control your application's priority depending on the operating system.
Also, a function equivalent to Sleep() reduces CPU load as it causes your application to relinquish CPU cycles to other running programs.
Have you ever tried to answer a question that became more and more complicated once you dug into it?
What you do depends upon what are you trying to accomplish. Do you want to utilize "20% by specific (mine) process for few seconds and then back to regular usage"? Or do you want to utilize 20% of all the CPU usage of the entire processor? Over what interval do you want to use 20%? Averaged over 5 sec? 500 msec? 10 msec?
20% of your process is pretty easy as long as you don't need to do any real work and want 20% of the average over a reasonably long interval, say 1 sec.
for( i=0; i=INTERVAL_CNT; i++ ) //untested syntax error ridden code
{
for( j=0; j=INTERVAL_CNT*(PERCENT/100); j++ )
{
//some work the compiler won't optimize away
}
sleep( INTERVAL_CNT*(1-(PERCENT/100)) );
}
Adjusting this for doing real work is more difficult. Note the comment about the compiler doing optimization. Optimizing compilers are pretty smart and will identify and remove code that does nothing useful. For example, if you use myVar++, declare it local to a certain scope, and never use it, the compiler will remove it to make your app run faster.
If you want a more continuous load (read that as a load of 20% at any sampling point vs a square wave with a certain duty cycle), it's going to be complicated. You might be able to do this with some experimentation by launching CPU consuming multiple threads. Having multiple threads with offset duty cycles should give you a smoother load.
20% of the entire processor is even more complicated since you need to account for multiple factors such as other processes executing, process priority, and multiple CPUs in the processor. I'm not going to get into any detail, but you might be able to do this using simultaneously executing multiple heavy weight processes with offset duty cycles along with a master thread sampling the processor load and dynamically adjusting the heavy weight processes through a set of shared variables.
Let me know if you want me to confuse the matter even further.

Is there a way to suspend OS scheduling for the duration of a program?

I have an assignment where I am analyzing the runtime of various sorting algorithms. I have written the code but I think it's an unfair comparison.
My code basically grabs the the clock time before and after the sorting is finished to compute the elapsed time. However, what if the OS decides to interrupt more frequently during the runtime of a specific sorting algorithm, or if it rather decides that some other background application should be given more of the time domain when it's thread comes back up?
I am not a CS major so I may not be entirely correct here, but from what I've read previously I was concerned this might have an impact on the results.
I also realize that if OS scheduling is suspended and the program hangs then there might be a serious problem; I am just wondering if it possible.
Normally, there's no real reason for it. The scheduler will slightly increase the execution time, but if the code runs for a few seconds, the change will be tiny.
So unless you're running heavy applications on the same computer, the amount of noise this will add to your tests is negligible.
In Linux, you can use isolcpus parameter to mark CPUs that won't be used by the scheduler. You can find information here. I'm not sure what's the minimal kernel version.
If you use it, you'll need to use sched_setaffinity, to put your theread on an isolated CPU, because the scheduler won't put it there.
It is not possible, not in user space code. Otherwise, any malicious process could steal the CPU from others.
If you want precise time counting for your process only, I suggest using time command. You can read about it here: What do 'real', 'user' and 'sys' mean in the output of time(1)?
Quick answer: you are most likely interested in user time, assuming your code doesn't make a heavy use of syscalls (which would be rather strange for a sorting algorithm)
On an up-to-date POSIX system (basically Linux) you can use clock_gettime with CLOCK_PROCESS_CPUTIME_ID or CLOCK_THREAD_CPUTIME_ID if you make sure the process doesn't migrate between CPUs (you can set its affinity for example).
The difference in times returned by clock_gettime with those arguments results in exact time the process/thread spent executing. Only pitfall as I mentioned is process migration as the man page says:
The CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID clocks are realized on many platforms using timers from the CPUs (TSC on i386, AR.ITC on Itanium). These registers may differ between CPUs and as a consequence these clocks may return bogus results if a process is migrated to another CPU.
This means that you don't really need to suspend all other processes just to measure the execution time of your program.

time() and context switching

I am more or less wondering how time() is implemented in the C standard library and what would happen in the situation described below. Although this time is most-likely negligible, consider a situation where you have a hard-limit on time and no control over the CPU scheduler (assume that it is a "good" scheduler for a general-purpose CPU).
Now, if I use time() to calculate my execution time of a particular section of code and use this time subtracted from some maximum bound to determine some other time-dependent variable, how would this variable be skewed based on context-switches? I know we could use nice and other tools (i.e. custom scheduler, etc.) to be certain we get full CPU usage when we need it, however, I am wondering how this works in general for similar situations as this and what side-effects exist due to the system's choices.
time is supposed to measure wall-time. I.e., it gives the current time, regardless of how much or little your process has run.
If you want to measure cpu time, you should use clock instead (though some vendors such as MS implement it wrong, so it does wall time also).
Of course, there are also other tools to retrieve CPU usage, such as times on Unix-like systems or GetProcessTimes on Windows. Most people find these more useful despite the reduced portability.

When benchmarking, what causes a lag between CPU time and "elapsed real time"?

I'm using a built-in benchmarking module for some quick and dirty tests. It gives me:
CPU time
system CPU time (actually I never get any result for this with the code I'm running)
the sum of the user and system CPU times (always the same as the CPU time in my case)
the elapsed real time
I didn't even know I needed all that information.
I just want to compare two pieces of code and see which one takes longer. I know that one piece of code probably does more garbage collection than the other but I'm not sure how much of an impact it's going to have.
Any ideas which metric I should be looking at?
And, most importantly, could someone explain why the "elapsed real time" is always longer than the CPU time - what causes the lag between the two?
There are many things going on in your system other than running your Ruby code. Elapsed time is the total real time taken and should not be used for benchmarking. You want the system and user CPU times since those are the times that your process actually had the CPU.
An example, if your process:
used the CPU for one second running your code; then
used the CPU for one second running OS kernel code; then
was swapped out for seven seconds while another process ran; then
used the CPU for one more second running your code,
you would have seen:
ten seconds elapsed time,
two seconds user time,
one second system time,
three seconds total CPU time.
The three seconds is what you need to worry about, since the ten depends entirely upon the vagaries of the process scheduling.
Multitasking operating system, stalls while waiting for I/O, and other moments when you code is not actively working.
You don't want to totally discount clock-on-the-wall time. Time used to wait w/o another thread ready to utilize CPU cycles may make one piece of code less desirable than another. One set of code may take some more CPU time, but, employ multi-threading to dominate over the other code in the real world. Depends on requirements and specifics. My point is ... use all metrics available to you to make your decision.
Also, as a good practice, if you want to compare two pieces of code you should be running as few extraneous processes as possible.
It may also be the case that the CPU time when your code is executing is not counted.
The extreme example is a real-time system where the timer triggers some activity which is always shorter than a timer tick. Then the CPU time for that activity may never be counted (depending on how the OS does the accounting).

Resources