I try to measure the time taken by a CUDA kernel function. I measure both CPU and GPU timings of it. But I'm getting a huge difference between both.
When I profile it using the NVIDIA profiler the kernel takes around 6ms which is what I want. But when I used gettimeofday() around the kernel call to get the CPU timings, the measure was 15ms. I do not have any memcpy calls there too. The kernel runs in a separate stream. And similar kernels are running in concurrent streams.
Sample code :
Kernel <<<abc, xyz,stream>>>();
printf("Elapsed GPU time = ");
printf("Elapsed CPU time = ");
Results I'm getting for the above code :
Elapsed GPU time = 6ms
Elapsed CPU time = 15ms
It is weird because there is only kernel execution line present. The kernel params are however pointers. Is the extra time being taken by mem copies? But I do not find mem copies anywhere in the profile too. Any leads would be appreciated.

Basically, what you're measuring as your CPU time is the time it takes to
record the first event,
set up the kernel launch with the respective parameters,
send the necessary commands to the GPU,
launch the kernel on the GPU,
execute the kernel on the GPU,
wait for the notification that GPU execution finished to get back to the CPU, and
record the second event.
Also, note that your method of measuring CPU time does not measure just the processing time spent by your process/thread, but, rather, the total system time elapsed (which potentially includes processing time spent by other processes/threads while your process/thread was not necessarily even running). I have to admit that, even in light of all that, the CPU time you report is still much larger compared to the GPU time than I would normally expect. But I'm not sure that that up there really is your entire code. In fact, I rather doubt it, given that, e.g., the printf()s don't really print anything. So there may be some additional factors we're not aware of that would still have to be considered to fully explain your timings.
Anyways, most likely neither of the two measurements you take are actually measuring what you really wanted to measure. If you're interested in the time it takes for the kernel to run, then use CUDA events. However, if you synchronize first and only then record the end event, the time between the start and end events will be the time between the beginning of kernel execution, the CPU waiting for kernel execution to finish, and whatever time it may take to then record the second event and have that one get to the GPU just so you can then ask the GPU at what time it got it. Think of events like markers that mark a specific point in the command stream that is sent to the GPU. Most likely, you actually wanted to write this:
cudaEventRecord(startGPU, stream); // mark start of kernel execution
Kernel<<<abc, xyz, stream>>>();
cudaEventRecord(stopGPU, stream); // mark end of kernel execution
cudaEventSynchronize(stopGPU); // wait for results to be available
and then use cudaEventElapsedTime() to get the time between the two events.
Also, note that gettimeofday() is not necessarily a reliable way of obtaining high-resolution timings. In C++, you could use, e.g., std::steady_clock, or std::high_resolution_clock (I would resort to the latter only if it cannot be avoided, since it is not guaranteed to be steady; and make sure that the clock period is actually sufficient for what you're trying to measure).

After debugging into the same issue, I found that cuda usually takes time before the first kernel launch, as referred in the forum here: https://devtalk.nvidia.com/default/topic/1042733/extremely-slow-cuda-api-calls-/?offset=3.
The cuda runtime APIs before the kernel had 6ms of cudaMalloc and 14ms of cudaLaunch which was the reason for the extra delay. The subsequent kernels, however, was good to work normally. cudaLaunch take usually takes time in microseconds, so if anything goes beyond that, it definitely needs some repair.
NOTE: If you are running any cuda kernels in a while(1) loop (only once), the allocation must be done outside the loop. Else you will end up with delays just like this.


Measuring time of the process/thread in user mode and in kernel on behalf of the process/thread

Imagine process/thread is running from point A to point B.
I can get how much time the code execution took by taking two gettimeofday() and calculating the difference (wall clock time). However, it may happen, that during the 'route' from A to B CPU was switching to another processes, to drivers, kernel, and other stuff it must perform to keep system running.
Is it possible somehow identify how much time A to B took in terms of actual process/thread execution, and kernel time related to their execution?
The goal for this exercise is to actually identify how much time CPU was NOT executing process/thread or its system calls by executing something else that them.
I am using C.
Searching the man-pages from time (1) backwards I found this:
You can use getrusage:
getrusage() returns resource usage measures
can be used for the own process (self), child processes or the calling thread.
Amongst other values it will give you the user CPU time used and the system CPU time used.
Please see man 2 getrusage for the details. (Or use an online replacement like https://linux.die.net/man/2/getrusage)

How to measure total execution time of a function in Linux multithreaded environment

I want to measure the total time spent in a C function within Linux. The function may be called at the same time from different threads, and the times spent should be summed together. How can I do this measurement from Linux? I have looked at the clock() function and to calculate the difference between the start and end of the function.
I found one solution using clock() in this thread within Stackoverflow:
How to measure total time spent in a function?
But from what I understand this will also include the CPU processing from threads executes some other function during the time of measurement. Is that a correct assumption?
Is there some other way to do this measurement within Linux?
Your question states that you are using Linux.
You can use the getrusage(2) system call with the RUSAGE_THREAD parameter, which will give you the accumulated statistics for the currently running thread.
By comparing what's in ru_utime, and perhaps ru_stime also, before and after your function runs, you should be able to determine how much time the function has accumulated in CPU time, for the currently running thread.
Lather, rinse, repeat, for all threads, and add them up.
A very good tool for performance analysis is perf (available with recent linux kernels):
Record performance data with
perf record <command>
and then analyze it with
perf report
Compile your program with debug symbols for useful results.
getting time from from clock() and gettimeofday() family functions are good for obtaining precise time difference between two consequent calls, but not good for obtaining time spent in functions, because of thread and process rescheduling of operating system and IO blocking, there isn't any guarantee which your thread/process could obtain CPU until finishes its operations, so you can't relay on time difference.
You have two choice for this
Using profiling softwares such as Intel V-Tune and Intel Inspector which will utilize the hardware performance counters
Using Realtime linux kernel, scheduling your process with FIFO scheduler and use time difference, in FIFO scheduler no one interrupt your program so you can safely use the time difference as time spent in functions, using clock(), gettimeofday() or even more precise rdtsc

Program stops working when calling kernel too many times

I am doing particles simulations with Self-propelled particles. My CUDA kernel updates each particle's location at every time step. So I run CUDA kernel from the for loop. Schematically it looks like this:
for(int i=0;i<NumberOfTimeSteps;i++)
Calculate<<<1,N,sharedsize>>>(float *data, other parameters)
So, each time step new data is calculated based on previously calculated data. It works ok, when NumberOfTimeSteps is small. But after I set NumberOfTimeSteps > 500 (approximate critical value), program stops working.
I know, that there is a limitation on kernel execution: driver can stop GPU calculations if kernel execution time is too long. However, in my code, time of the single kernel execution doesn't change with NumberOfTimeSteps.
Is there any limitations on the number of kernel calls?
EDIT: There was another issue: I didn't close mat files (where I put results), and kept opening new files each step. That eventually caused error. I voted to close question, since it has nothing to do with CUDA. Robert answered alredy about CUDA kernels.
Is there any limitations on the number of kernel calls?
There is no real limit to the number of kernel calls. There is a limit to how many can be accepted asynchronously, but after this limit, additional kernel calls will simply block the CPU thread from proceeding until some queue slots open up (i.e. until some previously issued kernels complete).
If your program is failing after ~500 kernel calls, it is due to some other issue, which is impossible to diagnose based on what you have shown in your question.
If by "program stops working" you mean that you hit a WDDM timeout, then it is possible based on batched kernel calls within WDDM, that even though a single kernel call is not longer than the timeout period, back-to-back kernel calls may exceed the watchdog timeout. This really should not be happening in your case, because cudaMemcpy as you have shown it is not an asynchronous operation; it blocks the CPU thread. Therefore, you should at most have one kernel call outstanding at a time.

Are two successive calls to getrusage guaranteed to produce increasing results?

In a program that calls getrusage() twice in order to obtain the time of a task by subtraction, I have once seen an assertion, saying that the time of the task should be nonnegative, fail. This, of course, cannot easily be reproduced, although I could write a specialized program that might reproduce it more easily.
I have tried to find a guarantee that getrusage() increased along execution, but neither the man page on my system(Linux on x86-64) nor this system-independant description say so explicitly.
The behavior was observed on a physical computer, with several cores, and NTP running.
Should I report a bug against the OS I am using? Am I asking too much when I expect getrusage() to increase with time?
On many systems rusage (I presume you mean ru_utime and ru_stime) is not calculated accurately, it's just sampled once per clock tick which is usually as slow as 100Hz and sometimes even slower.
Primary reason for that is that many machines have clocks that are incredibly expensive to read and you don't want to do this accounting (you'd have to read the clock twice for every system call). You could easily end up spending more time reading clocks than doing anything else in programs that do many system calls.
The counters should never go backwards though. I've seen that many years ago where the total running time of the process was tracked on context switches (which was relatively cheap and getrusge could calculate utime by using samples for stime, and subtracting that from the total running time). The clock used in that case was the wall clock instead of a monotonic clock and when you changed the time on the machine, the running time of processes could go back. But that was of course a bug.

When benchmarking, what causes a lag between CPU time and "elapsed real time"?

I'm using a built-in benchmarking module for some quick and dirty tests. It gives me:
CPU time
system CPU time (actually I never get any result for this with the code I'm running)
the sum of the user and system CPU times (always the same as the CPU time in my case)
the elapsed real time
I didn't even know I needed all that information.
I just want to compare two pieces of code and see which one takes longer. I know that one piece of code probably does more garbage collection than the other but I'm not sure how much of an impact it's going to have.
Any ideas which metric I should be looking at?
And, most importantly, could someone explain why the "elapsed real time" is always longer than the CPU time - what causes the lag between the two?
There are many things going on in your system other than running your Ruby code. Elapsed time is the total real time taken and should not be used for benchmarking. You want the system and user CPU times since those are the times that your process actually had the CPU.
An example, if your process:
used the CPU for one second running your code; then
used the CPU for one second running OS kernel code; then
was swapped out for seven seconds while another process ran; then
used the CPU for one more second running your code,
you would have seen:
ten seconds elapsed time,
two seconds user time,
one second system time,
three seconds total CPU time.
The three seconds is what you need to worry about, since the ten depends entirely upon the vagaries of the process scheduling.
Multitasking operating system, stalls while waiting for I/O, and other moments when you code is not actively working.
You don't want to totally discount clock-on-the-wall time. Time used to wait w/o another thread ready to utilize CPU cycles may make one piece of code less desirable than another. One set of code may take some more CPU time, but, employ multi-threading to dominate over the other code in the real world. Depends on requirements and specifics. My point is ... use all metrics available to you to make your decision.
Also, as a good practice, if you want to compare two pieces of code you should be running as few extraneous processes as possible.
It may also be the case that the CPU time when your code is executing is not counted.
The extreme example is a real-time system where the timer triggers some activity which is always shorter than a timer tick. Then the CPU time for that activity may never be counted (depending on how the OS does the accounting).
