Are two successive calls to getrusage guaranteed to produce increasing results? - c

In a program that calls getrusage() twice in order to obtain the time of a task by subtraction, I have once seen an assertion, saying that the time of the task should be nonnegative, fail. This, of course, cannot easily be reproduced, although I could write a specialized program that might reproduce it more easily.
I have tried to find a guarantee that getrusage() increased along execution, but neither the man page on my system(Linux on x86-64) nor this system-independant description say so explicitly.
The behavior was observed on a physical computer, with several cores, and NTP running.
Should I report a bug against the OS I am using? Am I asking too much when I expect getrusage() to increase with time?

On many systems rusage (I presume you mean ru_utime and ru_stime) is not calculated accurately, it's just sampled once per clock tick which is usually as slow as 100Hz and sometimes even slower.
Primary reason for that is that many machines have clocks that are incredibly expensive to read and you don't want to do this accounting (you'd have to read the clock twice for every system call). You could easily end up spending more time reading clocks than doing anything else in programs that do many system calls.
The counters should never go backwards though. I've seen that many years ago where the total running time of the process was tracked on context switches (which was relatively cheap and getrusge could calculate utime by using samples for stime, and subtracting that from the total running time). The clock used in that case was the wall clock instead of a monotonic clock and when you changed the time on the machine, the running time of processes could go back. But that was of course a bug.

Related

Timings differ while measuring a CUDA kernel

I try to measure the time taken by a CUDA kernel function. I measure both CPU and GPU timings of it. But I'm getting a huge difference between both.
When I profile it using the NVIDIA profiler the kernel takes around 6ms which is what I want. But when I used gettimeofday() around the kernel call to get the CPU timings, the measure was 15ms. I do not have any memcpy calls there too. The kernel runs in a separate stream. And similar kernels are running in concurrent streams.
Sample code :
gettimeofday(start);
cudaEventRecord(startGPU);
Kernel <<<abc, xyz,stream>>>();
cudaDeviceSynchronize();
cudaEventRecord(stopGPU);
printf("Elapsed GPU time = ");
gettimeofday(stop);
printf("Elapsed CPU time = ");
Results I'm getting for the above code :
Elapsed GPU time = 6ms
Elapsed CPU time = 15ms
It is weird because there is only kernel execution line present. The kernel params are however pointers. Is the extra time being taken by mem copies? But I do not find mem copies anywhere in the profile too. Any leads would be appreciated.
Basically, what you're measuring as your CPU time is the time it takes to
record the first event,
set up the kernel launch with the respective parameters,
send the necessary commands to the GPU,
launch the kernel on the GPU,
execute the kernel on the GPU,
wait for the notification that GPU execution finished to get back to the CPU, and
record the second event.
Also, note that your method of measuring CPU time does not measure just the processing time spent by your process/thread, but, rather, the total system time elapsed (which potentially includes processing time spent by other processes/threads while your process/thread was not necessarily even running). I have to admit that, even in light of all that, the CPU time you report is still much larger compared to the GPU time than I would normally expect. But I'm not sure that that up there really is your entire code. In fact, I rather doubt it, given that, e.g., the printf()s don't really print anything. So there may be some additional factors we're not aware of that would still have to be considered to fully explain your timings.
Anyways, most likely neither of the two measurements you take are actually measuring what you really wanted to measure. If you're interested in the time it takes for the kernel to run, then use CUDA events. However, if you synchronize first and only then record the end event, the time between the start and end events will be the time between the beginning of kernel execution, the CPU waiting for kernel execution to finish, and whatever time it may take to then record the second event and have that one get to the GPU just so you can then ask the GPU at what time it got it. Think of events like markers that mark a specific point in the command stream that is sent to the GPU. Most likely, you actually wanted to write this:
cudaEventRecord(startGPU, stream); // mark start of kernel execution
Kernel<<<abc, xyz, stream>>>();
cudaEventRecord(stopGPU, stream); // mark end of kernel execution
cudaEventSynchronize(stopGPU); // wait for results to be available
and then use cudaEventElapsedTime() to get the time between the two events.
Also, note that gettimeofday() is not necessarily a reliable way of obtaining high-resolution timings. In C++, you could use, e.g., std::steady_clock, or std::high_resolution_clock (I would resort to the latter only if it cannot be avoided, since it is not guaranteed to be steady; and make sure that the clock period is actually sufficient for what you're trying to measure).
After debugging into the same issue, I found that cuda usually takes time before the first kernel launch, as referred in the forum here: https://devtalk.nvidia.com/default/topic/1042733/extremely-slow-cuda-api-calls-/?offset=3.
The cuda runtime APIs before the kernel had 6ms of cudaMalloc and 14ms of cudaLaunch which was the reason for the extra delay. The subsequent kernels, however, was good to work normally. cudaLaunch take usually takes time in microseconds, so if anything goes beyond that, it definitely needs some repair.
NOTE: If you are running any cuda kernels in a while(1) loop (only once), the allocation must be done outside the loop. Else you will end up with delays just like this.

c - mutlithread image processing performance issue [duplicate]

There is a really interesting note here: http://en.cppreference.com/w/cpp/chrono/c/clock
"Only the difference between two values returned by different calls to std::clock is meaningful, as the beginning of the std::clock era does not have to coincide with the start of the program. std::clock time may advance faster or slower than the wall clock, depending on the execution resources given to the program by the operating system. For example, if the CPU is shared by other processes, std::clock time may advance slower than wall clock. On the other hand, if the current process is multithreaded and more than one execution core is available, std::clock time may advance faster than wall clock."
Why does the clock speed up with multithreading? I'm checking the performance of a C++ program with threading vs without it and I'm noticing that the times are similar for threading (not better) but feel faster (like saying 8 seconds in 3 seconds of runtime).
If more than one core is available, and you are running multiple threads, then potentially multiple threads are executing at the same time on different cores. Since clock() measures processor time, it may advance faster than wallclock time, because multiple threads are advancing it simultaneously.
Just as the example given in the documentation - it shows two threads created, and the clock() value reported is almost double the wallclock time reported.

Is there a way to suspend OS scheduling for the duration of a program?

I have an assignment where I am analyzing the runtime of various sorting algorithms. I have written the code but I think it's an unfair comparison.
My code basically grabs the the clock time before and after the sorting is finished to compute the elapsed time. However, what if the OS decides to interrupt more frequently during the runtime of a specific sorting algorithm, or if it rather decides that some other background application should be given more of the time domain when it's thread comes back up?
I am not a CS major so I may not be entirely correct here, but from what I've read previously I was concerned this might have an impact on the results.
I also realize that if OS scheduling is suspended and the program hangs then there might be a serious problem; I am just wondering if it possible.
Normally, there's no real reason for it. The scheduler will slightly increase the execution time, but if the code runs for a few seconds, the change will be tiny.
So unless you're running heavy applications on the same computer, the amount of noise this will add to your tests is negligible.
In Linux, you can use isolcpus parameter to mark CPUs that won't be used by the scheduler. You can find information here. I'm not sure what's the minimal kernel version.
If you use it, you'll need to use sched_setaffinity, to put your theread on an isolated CPU, because the scheduler won't put it there.
It is not possible, not in user space code. Otherwise, any malicious process could steal the CPU from others.
If you want precise time counting for your process only, I suggest using time command. You can read about it here: What do 'real', 'user' and 'sys' mean in the output of time(1)?
Quick answer: you are most likely interested in user time, assuming your code doesn't make a heavy use of syscalls (which would be rather strange for a sorting algorithm)
On an up-to-date POSIX system (basically Linux) you can use clock_gettime with CLOCK_PROCESS_CPUTIME_ID or CLOCK_THREAD_CPUTIME_ID if you make sure the process doesn't migrate between CPUs (you can set its affinity for example).
The difference in times returned by clock_gettime with those arguments results in exact time the process/thread spent executing. Only pitfall as I mentioned is process migration as the man page says:
The CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID clocks are realized on many platforms using timers from the CPUs (TSC on i386, AR.ITC on Itanium). These registers may differ between CPUs and as a consequence these clocks may return bogus results if a process is migrated to another CPU.
This means that you don't really need to suspend all other processes just to measure the execution time of your program.

When benchmarking, what causes a lag between CPU time and "elapsed real time"?

I'm using a built-in benchmarking module for some quick and dirty tests. It gives me:
CPU time
system CPU time (actually I never get any result for this with the code I'm running)
the sum of the user and system CPU times (always the same as the CPU time in my case)
the elapsed real time
I didn't even know I needed all that information.
I just want to compare two pieces of code and see which one takes longer. I know that one piece of code probably does more garbage collection than the other but I'm not sure how much of an impact it's going to have.
Any ideas which metric I should be looking at?
And, most importantly, could someone explain why the "elapsed real time" is always longer than the CPU time - what causes the lag between the two?
There are many things going on in your system other than running your Ruby code. Elapsed time is the total real time taken and should not be used for benchmarking. You want the system and user CPU times since those are the times that your process actually had the CPU.
An example, if your process:
used the CPU for one second running your code; then
used the CPU for one second running OS kernel code; then
was swapped out for seven seconds while another process ran; then
used the CPU for one more second running your code,
you would have seen:
ten seconds elapsed time,
two seconds user time,
one second system time,
three seconds total CPU time.
The three seconds is what you need to worry about, since the ten depends entirely upon the vagaries of the process scheduling.
Multitasking operating system, stalls while waiting for I/O, and other moments when you code is not actively working.
You don't want to totally discount clock-on-the-wall time. Time used to wait w/o another thread ready to utilize CPU cycles may make one piece of code less desirable than another. One set of code may take some more CPU time, but, employ multi-threading to dominate over the other code in the real world. Depends on requirements and specifics. My point is ... use all metrics available to you to make your decision.
Also, as a good practice, if you want to compare two pieces of code you should be running as few extraneous processes as possible.
It may also be the case that the CPU time when your code is executing is not counted.
The extreme example is a real-time system where the timer triggers some activity which is always shorter than a timer tick. Then the CPU time for that activity may never be counted (depending on how the OS does the accounting).

Are high resolution calls to get the system time wrong by the time the function returns?

Given a C process that runs at the highest priority that requests the current time, Is the time returned adjusted for the amount of time the code takes to return to the user process space? Is it out of date when you get it? As a measurement taking the execution time of known number of assembly instructions in a loop and asking for the time before and after it could give you an approximation of the error. I know this must be an issue in scientific applications? I don't plan to write software involving any super colliders any time in the near future. I have read a few articles on the subject but they do not indicate that any correction is made to make the time given to you be slightly ahead of the time the system read in. Should I lose sleep over other things?
Yes, they are almost definitely "wrong".
For Windows, the timing functions do not take into account the time it takes to transition back to user mode. Even if this were taken into account, it can't correct if the function returns, and your code hits a page fault/gets swapped out/etc., before capturing the return value.
In general, when timing things you should snap a start and an end time around a large number of iterations to weed out these sort of uncertainties.
No, you should not lose sleep over this. No amount of adjustment or other software trickery will yield perfect results on a system with a pipelined processor with multi-layered memory access running a multi-tasking operating system with memory management, devices, interrupt handlers... Not even if your process has the highest priority.
Plus, taking the difference of two such times will cancel out the constant overhead, anyway.
Edit: I mean yes, you should lose sleep over other things :).
Yes, the answer you get will be off by a certain (smallish) amount; I have never heard of a timer function compensating for the average return time, because such a thing is nearly impossible to predict well. Such things are usually implemented by simply reading a register in the hardware and returning the value, or a version of it scaled to the appropriate timescale.
That said, I wouldn't lose sleep over this. The accepted way of keeping this overhead from affecting your measurements in any significant way is not to use these timers for short events. Usually, you will time several hundred, thousand, or million executions of the same thing, and divide by the number of executions to estimate the average time. Such a thing is usually more useful than timing a single instance, as it takes into account average cache behavior, OS effects, and so forth.
Most of the real world problems involving high resolution timers are used for profiling, in which the time is read once during START, and once more during FINISH. So most of the times ~almost~ the same amount of delay in involved in both START and FINISH. And hence it works fine.
Now, for nuclear reactors, WINDOWS or for that many other operating system with generic functions may not be suitable. I guess they use REAL TIME operating systems which might give a better accurate time values than desktop operating systems.

Resources