CPU load of my program

CPU load of my program - c

How can we know that how much load does our program is on CPU?
I tried to find it using htop. But htop wont give the cpu load. It actually gives the cpu utilization percentage of my program(using pid).
I am using C programming, Linux environment.

The function you are probably looking for is getrusage. It fills struct rusage. There are two members of the struct you are interested in:
ru_utime - user CPU time used
ru_stime - system CPU time used
You can call the function at regular intervals of time and based on the results you can estimate the cpu load (e.g. in percentage) of your own process.
If you want to get it at the system level, then you need to read (and parse) /proc/stat
file (also at regular intervals), see here.

Related

Measuring time of the process/thread in user mode and in kernel on behalf of the process/thread

Imagine process/thread is running from point A to point B.
I can get how much time the code execution took by taking two gettimeofday() and calculating the difference (wall clock time). However, it may happen, that during the 'route' from A to B CPU was switching to another processes, to drivers, kernel, and other stuff it must perform to keep system running.
Is it possible somehow identify how much time A to B took in terms of actual process/thread execution, and kernel time related to their execution?
The goal for this exercise is to actually identify how much time CPU was NOT executing process/thread or its system calls by executing something else that them.
I am using C.

Searching the man-pages from time (1) backwards I found this:
You can use getrusage:
getrusage() returns resource usage measures
can be used for the own process (self), child processes or the calling thread.
Amongst other values it will give you the user CPU time used and the system CPU time used.
Please see man 2 getrusage for the details. (Or use an online replacement like https://linux.die.net/man/2/getrusage)

Timings differ while measuring a CUDA kernel

I try to measure the time taken by a CUDA kernel function. I measure both CPU and GPU timings of it. But I'm getting a huge difference between both.
When I profile it using the NVIDIA profiler the kernel takes around 6ms which is what I want. But when I used gettimeofday() around the kernel call to get the CPU timings, the measure was 15ms. I do not have any memcpy calls there too. The kernel runs in a separate stream. And similar kernels are running in concurrent streams.
Sample code :
gettimeofday(start);
cudaEventRecord(startGPU);
Kernel <<<abc, xyz,stream>>>();
cudaDeviceSynchronize();
cudaEventRecord(stopGPU);
printf("Elapsed GPU time = ");
gettimeofday(stop);
printf("Elapsed CPU time = ");
Results I'm getting for the above code :
Elapsed GPU time = 6ms
Elapsed CPU time = 15ms
It is weird because there is only kernel execution line present. The kernel params are however pointers. Is the extra time being taken by mem copies? But I do not find mem copies anywhere in the profile too. Any leads would be appreciated.

Basically, what you're measuring as your CPU time is the time it takes to
record the first event,
set up the kernel launch with the respective parameters,
send the necessary commands to the GPU,
launch the kernel on the GPU,
execute the kernel on the GPU,
wait for the notification that GPU execution finished to get back to the CPU, and
record the second event.
Also, note that your method of measuring CPU time does not measure just the processing time spent by your process/thread, but, rather, the total system time elapsed (which potentially includes processing time spent by other processes/threads while your process/thread was not necessarily even running). I have to admit that, even in light of all that, the CPU time you report is still much larger compared to the GPU time than I would normally expect. But I'm not sure that that up there really is your entire code. In fact, I rather doubt it, given that, e.g., the printf()s don't really print anything. So there may be some additional factors we're not aware of that would still have to be considered to fully explain your timings.
Anyways, most likely neither of the two measurements you take are actually measuring what you really wanted to measure. If you're interested in the time it takes for the kernel to run, then use CUDA events. However, if you synchronize first and only then record the end event, the time between the start and end events will be the time between the beginning of kernel execution, the CPU waiting for kernel execution to finish, and whatever time it may take to then record the second event and have that one get to the GPU just so you can then ask the GPU at what time it got it. Think of events like markers that mark a specific point in the command stream that is sent to the GPU. Most likely, you actually wanted to write this:
cudaEventRecord(startGPU, stream); // mark start of kernel execution
Kernel<<<abc, xyz, stream>>>();
cudaEventRecord(stopGPU, stream); // mark end of kernel execution
cudaEventSynchronize(stopGPU); // wait for results to be available
and then use cudaEventElapsedTime() to get the time between the two events.
Also, note that gettimeofday() is not necessarily a reliable way of obtaining high-resolution timings. In C++, you could use, e.g., std::steady_clock, or std::high_resolution_clock (I would resort to the latter only if it cannot be avoided, since it is not guaranteed to be steady; and make sure that the clock period is actually sufficient for what you're trying to measure).

After debugging into the same issue, I found that cuda usually takes time before the first kernel launch, as referred in the forum here: https://devtalk.nvidia.com/default/topic/1042733/extremely-slow-cuda-api-calls-/?offset=3.
The cuda runtime APIs before the kernel had 6ms of cudaMalloc and 14ms of cudaLaunch which was the reason for the extra delay. The subsequent kernels, however, was good to work normally. cudaLaunch take usually takes time in microseconds, so if anything goes beyond that, it definitely needs some repair.
NOTE: If you are running any cuda kernels in a while(1) loop (only once), the allocation must be done outside the loop. Else you will end up with delays just like this.

Time to run instructions of a for loop

I am interested to calculate a duration of 125 μs for implementing a TDM (Time Division Multiplexing scheme) based scheme. However, I am not able to get this duration with an accuracy of +-5us using the Linux operating system. I am using DPDK which runs on ubuntu and intel hardware. If I take time from the computer using function clock_gettime(CLOCK_REALTIME), it adds the time to make a call to the kernel to get the time. This gives an inaccurate duration to me.
Therefore, I dedicated a cpu core for calculating time without asking the time from the kernel. For this, I run a for loop for a maximum instructions (8000000) and find the number instructions that need to be executed for the 125 μs duration (i.e. (125*8000000)/timespent).
However, the problem is that it is also giving inaccurate results (there is always different results i.e., a difference 1000 instructions).
Does anybody know why I am getting inaccurate results even if I am dedicating a CPU for this?
Do you know a method to calculate a duration (very short, may be equal to 125 us) without making a call to the kernel? thanks!

You are getting inaccurate result because you are on a multitasking operating system. You cannot do this on modern computers. You can only do this on embedded microcontroller where you control 100% of the cpu time. The operating system need to manage your process, even if you have a dedicated cpu. The mouse and keyboard takes time also. Your have to run the process on 'Bare Metal'.

Measure time complexity of a program in any programming language

I am searching for a standard way to identify running time complexity of a program.
As described here, I am not searching for a solution for analyzing the same by looking at code, rather than through some other parameters at program runtime.
Consider a program which requires the user to convert a binary string to its decimal equivalent. The time complexity for such a program should be O(n) at worst, when each binary digit is processed at a time. With some intelligence, the running time can be reduced to O(n/4) (process 4 digits from the binary string at a time, assume that the binary string has 4k digits for all k=1,2,3...)
I wrote this program in C and used the time command and a function that uses gettimeoftheday (both) to calculate running time on a linux box having a 64 bit quad core processor (each core at 800 MHZ) under two categories:
When system is under normal load (core usage 5-10%)
When system is under heavy load (core usage 80-90%)
Following are the readings for O(n) algorithm, length of binary string is 100000, under normal load:
Time spent in User (ms) - 216
Time Spent in Kernel (ms) - 8
Timed using gettimeofday (ms) - 97
Following are the readings for O(n) algorithm, length of binary string is 200000, under high load:
Time spent in User (ms) - 400
Time Spent in Kernel (ms) - 48
Timed using gettimeofday (ms) - 190
What I am looking for:
If I am using time command, which output should I consider? real, user or sys?
Is there a standard method to calculate the running time of a program?
Every time I execute these commands, I get a different reading. How many times should I sample so that the average will always be the same, given the code does not change.
What if I want to use multiple threads and measure time in each thread by calling execve on such programs.
From the research I have done, I have not come across any standard approach. Also, whatever command / method I seem to use gives a me different output each time (I understand this is because of the context switches and cpu cycles). We can assume here that I can even do with a solution that is machine dependant.

To answer your questions:
Depends on what your code is doing each component of the output of time may be significant. This question deals with what those components mean. If the code you're timing doesn't utilize system calls, calculating the "user" time is probably sufficient. I'd probably just use the "real" time.
What's wrong with time? If you need better granularity (i.e. you just want to time a section of code instead of the entire program) you can always get the start time before the block of code you are profiling, run the code and then get the end time then calculate the difference to give you the runtime. NEVER use gettimeofday as the time does not monotonically increase. The system time can be changed by the administrator or an NTP process. You should use clock_gettime instead.
To minimise the runtime differences from run to run, I would check that cpu frequency scaling is OFF especially if you're getting very wildly differing results. This has caught me out before.
Once you start getting into multiple threads, you might want to start looking at a profiler. gprof is a good place to start.

how does filesystem benchmark tools measure time?

I want to measure the running time of a specific system call, for example, I want to know a pread need how many time on both CPU and I/O.
Which function should I use?
Now I usetimes, and it works.
gettimeofday is get the current time, and that may not just calculate the running time of a specific process, right?
clock is return the CPU time this program used so far, does this include the I/O time? If there are other programs running, will this influence the time of this function? I mean something like switching running process.
getrusage seems like a ideal one, but it also returns the CPU time of a specific process.
Does anyone know how benchmark tools like iozone calculate system calls time? I've read its code, and still have no idea.