Google benchmark state.PauseTiming() and state.ResumeTiming() take a long time - benchmarking

I am running some performance tests using google benchmark API. I use state.PauseTiming() and state.ResumeTiming() to avoid unnecessary code segments runs through perf path. I have attached the sample code below
while (state.KeepRunning()) {
state.PauseTiming();
state.ResumeTiming();
state.PauseTiming();
state.ResumeTiming();
}
Those functions itself took 323ns for 2 pauses and resumes.
hiRun on (16 X 3196.36 MHz CPU s)
2019-06-19 11:21:06
---------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------
Benchmark_Test1 323 ns 324 ns 2158319
It this a bug in google benchmark API or are there any workarounds for this?

I had a similar use case and I noticed that the pause and resume has a lot of overhead, this is mentioned in their docs as well.
The workaround for this is another feature of Google Benchmarks. Manually measure the part of the code you want running time for inside the for loop. You can do this using manual timers (https://github.com/google/benchmark#cpu-timers).
You take chrono timestamps before and after the code segment for which you want running times and then you you set it to the state on each iterating using SetIterationTime.
When you call this test you need to add a ->UseManualTime();

Related

How to use oprofile to calculate execution time of a part of C program?

I want to profile a portion of the C code (user_defined_function())using oprofile and calculate the time taken to execute it. Any pointers on how to do this would be very helpful. Thanks in advance!!
#include <stdio.h>
int main()
{
//some statements;
//Begin Profiling
user_defined_function();
//End Profiling
//some statements;
return 0;
}
I don't see turn on / turn off markers in the http://oprofile.sourceforge.net/doc/index.html and http://oprofile.sourceforge.net/faq/ documentation. Probably calling (fork+exec) opcontrol with --start and --stop will help if the code to be profiled is long enough.
With perf tool in profiling (sampling) mode perf record (and/or probably operf which is based on the same perf_event_open syscall) you can try to profile full program and add some markers at Begin Profiling and End Profiling points (by using some custom tracing event), then you can dump entire perf.data with perf script, find events of your markers and cut only part of the profile between markers (every event in perf.data has timestamp and they are ordered or can be sorted by time).
With direct use of perf_event_open syscall you can enable and disable profiling from the same process with ioctl calls described in the "man 2 perf_event_open" page on the fd descriptor of perf with PERF_EVENT_IOC_ENABLE / PERF_EVENT_IOC_DISABLE actions. Man page also lists using prctl to temporary disable and reenable profiling on the program (this may even work with oprofile, disable at start of main, enable at Begin, disable at End)
Using prctl(2)
A process can enable or disable all the event groups that are
attached to it using the prctl(2) PR_TASK_PERF_EVENTS_ENABLE and
PR_TASK_PERF_EVENTS_DISABLE operations.
Another way of using performance counter is not sampling profiling, but counting (perf stat ./your_program / perf stat -d ./your_program does this). This mode will not give you list of 'hot' functions, it just will say that your code did 100 millions of instructions in 130 millions of cycles, with 10 mln L1 cache hits and 5 mln L1 cache misses. There are wrappers to enable counting on parts of program, for example: PAPI http://icl.cs.utk.edu/papi/ (PAPI_start_counters), perfmon2 (libpfm3,libpfm4), https://github.com/RRZE-HPC/likwid (pdf, likwid_markerStartRegion), http://halobates.de/jevents.html & http://halobates.de/simple-pmu, etc..

The MPI_Sendrecv_replace function is excessively time consuming.

I called the MPI_Sendrecv_replace several times in my code. But the function behaves very strangely. In the first loop (5000 iterations), MPI_Sendrecv_replace costs about 30 seconds totally. However, in the second loop the function used only 1 second. I am very curious about the reason because I want to optimize the code. The code is running with 32 processors. The machine is one node with 4 Intel(R) Xeon(R) CPU E5-4620 v2.
Part of my code is here:
for (ishot=1;ishot<=nshots;ishot+=SHOTINC){
...
for (nt=1;nt<=NT;nt++){
...
MPI_Sendrecv_replace(&bufferlef_to_rig[1][1],NY*fdo3,MPI_FLOAT,INDEX[1],TAG1,INDEX[2],TAG1,MPI_COMM_WORLD,&status);
...
}
for (nt=1;nt<=NT;nt++){
...
MPI_Sendrecv_replace(&bufferlef_to_rig[1][1],NY*fdo3,MPI_FLOAT,INDEX[1],TAG1,INDEX[2],TAG1,MPI_COMM_WORLD,&status);
...
}
}
And I used the MPI_Wtime to count the time costed by MPI_Sendrecv_replace.
It seems that the problem is the performance bug of MPI_Sendrecv_replace. When I replace the MPI_Sendrecv_replace with MPI_Sendrecv, the program behaves normally and the time-consuming problem disappear. I found the bug report in https://github.com/open-mpi/ompi/issues/93 and http://www.open-mpi.org/community/lists/users/2010/01/11683.php

Regarding CPU utilization

Considering the below piece of C code, I expected the CPU utilization to go up to 100% as the processor would try to complete the job (endless in this case) given to it. On running the executable for 5 mins, I found the CPU to go up to a max. of 48%. I am running Mac OS X 10.5.8; processor: Intel Core 2 Duo; Compiler: GCC 4.1.
int i = 10;
while(1) {
i = i * 5;
}
Could someone please explain why the CPU usage does not go up to 100%? Does the OS limit the CPU from reaching 100%?
Please note that if I added a "printf()" inside the loop the CPU hits 88%. I understand that in this case, the processor also has to write to the standard output stream hence the sharp rise in usage.
Has this got something to do with the amount of job assigned to the processor per unit time?
Regards,
Ven.
You have a multicore processor and you are in a single thread scenario, so you will use only one core full throttle ... Why do you expect the overall processor use go to 100% in a similar context ?
Run two copies of your program at the same time. These will use both cores of your "Core 2 Duo" CPU and overall CPU usage will go to 100%
Edit
if I added a "printf()" inside the loop the CPU hits 88%.
The printf send some characters to the terminal/screen. Sending information, Display and Update is handeled by code outside your exe, this is likely to be executed on another thread. But displaying a few characters does not need 100% of such a thread. That is why you see 100% for Core 1 and 76% for Core 2 which results in the overal CPU usage of 88% what you see.

Using CLOCK_PROCESS_CPUTIME_ID in clock_gettime

I read http://linux.die.net/man/3/clock_gettime and http://www.guyrutenberg.com/2007/09/22/profiling-code-using-clock_gettime/comment-page-1/#comment-681578
It said to use this to
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &stop_time);
measure how long it take for a function to run.
I tried that in my program. When I run it, it returns saying it took 15 sec. But when I compare against using a stop watch to measure it, it is 30 sec.
Can you please tell me why clock_gettime return 1/2 of the actual time it took?
Thank you.
In a multi-process environment, processes are constantly migrating from CPU(s) to 'run queue(s)'.
When performance testing an application, it is often convenient to know the amount of time a process has been running on a CPU, while excluding time that the process was waiting for a CPU resource on a 'run queue'.
In the case of this question, where CPU-time is about half of REAL-time, it is likely that other processes were actively competing for CPU time while your process was also running. It appears that your process was fairly successful in acquiring roughly half the CPU resources during its run.
Instead of using CLOCK_PROCESS_CPUTIME_ID, you might consider using CLOCK_REALTIME?
For additional details, see: Understanding the different clocks of clock_gettime()

CPU clock frequency and thus QueryPerformanceCounter wrong?

I am using QueryPerformanceCounter to time some code. I was shocked when the code starting reporting times that were clearly wrong. To convert the results of QPC into "real" time you need to divide by the frequency returned from QueryPerformanceFrequency, so the elapsed time is:
Time = (QPC.end - QPC.start)/QPF
After a reboot, the QPF frequency changed from 2.7 GHz to 4.1 GHz. I do not think that the actual hardware frequency changed as the wall clock time of the running program did not change although the time reported using QPC did change (it dropped by 2.7/4.1).
MyComputer->Properties shows:
Intel(R)
Pentium(R)
4 CPU 2.80 GHz; 4.11 GHz;
1.99 GB of RAM; Physical Address Extension
Other than this, the system seems to be working fine.
I will try a reboot to see if the problem clears, but I am concerned that these critical performance counters could become invalid without warning.
Update:
While I appreciate the answers and especially the links, I do not have one of the affected chipsets nor to I have a CPU clock that varies itself. From what I have read, QPC and QPF are based on a timer in the PCI bus and not affected by changes in the CPU clock. The strange thing in my situation is that the FREQUENCY reported by QPF changed to an incorrect value and this changed frequency was also reported in MyComputer -> Properties which I certainly did not write.
A reboot fixed my problem (QPF now reports the correct frequency) but I assume that if you are planning on using QPC/QPF you should validate it against another timer before trusting it.
Apparently there is a known issue with QPC on some chipsets, so you may want to make sure you do not have those chipset. Additionally some dual core AMDs may also cause a problem. See the second post by sebbbi, where he states:
QueryPerformanceCounter() and
QueryPerformanceFrequency() offer a
bit better resolution, but have
different issues. For example in
Windows XP, all AMD Athlon X2 dual
core CPUs return the PC of either of
the cores "randomly" (the PC sometimes
jumps a bit backwards), unless you
specially install AMD dual core driver
package to fix the issue. We haven't
noticed any other dual+ core CPUs
having similar issues (p4 dual, p4 ht,
core2 dual, core2 quad, phenom quad).
From this answer.
You should always expect the core frequency to change on any CPU that supports technology such as SpeedStep or Cool'n'Quiet. Wall time is not affected, it uses a RTC. You should probably stop using the performance counters, unless you can tolerate a few (5-50) millisecond's worth of occasional phase adjustments, and are willing to perform some math in order to perform the said phase adjustment by continuously or periodically re-normalizing your performance counter values based on the reported performance counter frequency and on RTC low-resolution time (you can do this on-demand, or asynchronously from a high-resolution timer, depending on your application's ultimate needs.)
You can try to use the Stopwatch class from .NET, it could help with your problem since it abstracts from all this low-lever stuff.
Use the IsHighResolution property to see whether the timer is based on a high-resolution performance counter.
Note: On a multiprocessor computer, it
does not matter which processor the
thread runs on. However, because of
bugs in the BIOS or the Hardware
Abstraction Layer (HAL), you can get
different timing results on different
processors. To specify processor
affinity for a thread, use the
ProcessThread..::.ProcessorAffinity
method.
Just a shot in the dark.
On my home PC I used to have "AI NOS" or something like that enabled in the BIOS. I suspect this screwed up the QueryPerformanceCounter/QueryPerformanceFrequency APIs because although the system clock ran at the normal rate, and normal apps ran perfectly, all full screen 3D games ran about 10-15% too fast, causing, for example, adjacent lines of dialog in a game to trip on each other.
I'm afraid you can't say "I shouldn't have this problem" when you're using QueryPerformance* - while the documentation states that the value returned by QueryPerformanceFrequency is constant, practical experimentation shows that it really isn't.
However you also don't want to be calling QPF every time you call QPC either. In practice we found that periodically (in our case once a second) calling QPF to get a fresh value kept the timers synchronised well enough for reliable profiling.
As has been pointed out as well, you need to keep all of your QPC calls on a single processor for consistent results. While this might not matter for profiling purposes (because you can just use ProcessorAffinity to lock the thread onto a single CPU), you don't want to do this for timing which is running as part of a proper multi-threaded application (because then you run the risk of locking a hard working thread to a CPU which is busy).
Especially don't arbitrarily lock to CPU 0, because you can guarantee that some other badly coded application has done that too, and then both applications will fight over CPU time on CPU 0 while CPU 1 (or 2 or 3) sit idle. Randomly choose from the set of available CPUs and you have at least a fighting chance that you're not locked to an overloaded CPU.

Resources