c - mutlithread image processing performance issue [duplicate] - c

There is a really interesting note here: http://en.cppreference.com/w/cpp/chrono/c/clock
"Only the difference between two values returned by different calls to std::clock is meaningful, as the beginning of the std::clock era does not have to coincide with the start of the program. std::clock time may advance faster or slower than the wall clock, depending on the execution resources given to the program by the operating system. For example, if the CPU is shared by other processes, std::clock time may advance slower than wall clock. On the other hand, if the current process is multithreaded and more than one execution core is available, std::clock time may advance faster than wall clock."
Why does the clock speed up with multithreading? I'm checking the performance of a C++ program with threading vs without it and I'm noticing that the times are similar for threading (not better) but feel faster (like saying 8 seconds in 3 seconds of runtime).

If more than one core is available, and you are running multiple threads, then potentially multiple threads are executing at the same time on different cores. Since clock() measures processor time, it may advance faster than wallclock time, because multiple threads are advancing it simultaneously.
Just as the example given in the documentation - it shows two threads created, and the clock() value reported is almost double the wallclock time reported.

Related

How to measure total execution time of a function in Linux multithreaded environment

I want to measure the total time spent in a C function within Linux. The function may be called at the same time from different threads, and the times spent should be summed together. How can I do this measurement from Linux? I have looked at the clock() function and to calculate the difference between the start and end of the function.
I found one solution using clock() in this thread within Stackoverflow:
How to measure total time spent in a function?
But from what I understand this will also include the CPU processing from threads executes some other function during the time of measurement. Is that a correct assumption?
Is there some other way to do this measurement within Linux?
Your question states that you are using Linux.
You can use the getrusage(2) system call with the RUSAGE_THREAD parameter, which will give you the accumulated statistics for the currently running thread.
By comparing what's in ru_utime, and perhaps ru_stime also, before and after your function runs, you should be able to determine how much time the function has accumulated in CPU time, for the currently running thread.
Lather, rinse, repeat, for all threads, and add them up.
A very good tool for performance analysis is perf (available with recent linux kernels):
Record performance data with
perf record <command>
and then analyze it with
perf report
Compile your program with debug symbols for useful results.
getting time from from clock() and gettimeofday() family functions are good for obtaining precise time difference between two consequent calls, but not good for obtaining time spent in functions, because of thread and process rescheduling of operating system and IO blocking, there isn't any guarantee which your thread/process could obtain CPU until finishes its operations, so you can't relay on time difference.
You have two choice for this
Using profiling softwares such as Intel V-Tune and Intel Inspector which will utilize the hardware performance counters
Using Realtime linux kernel, scheduling your process with FIFO scheduler and use time difference, in FIFO scheduler no one interrupt your program so you can safely use the time difference as time spent in functions, using clock(), gettimeofday() or even more precise rdtsc

Are two successive calls to getrusage guaranteed to produce increasing results?

In a program that calls getrusage() twice in order to obtain the time of a task by subtraction, I have once seen an assertion, saying that the time of the task should be nonnegative, fail. This, of course, cannot easily be reproduced, although I could write a specialized program that might reproduce it more easily.
I have tried to find a guarantee that getrusage() increased along execution, but neither the man page on my system(Linux on x86-64) nor this system-independant description say so explicitly.
The behavior was observed on a physical computer, with several cores, and NTP running.
Should I report a bug against the OS I am using? Am I asking too much when I expect getrusage() to increase with time?
On many systems rusage (I presume you mean ru_utime and ru_stime) is not calculated accurately, it's just sampled once per clock tick which is usually as slow as 100Hz and sometimes even slower.
Primary reason for that is that many machines have clocks that are incredibly expensive to read and you don't want to do this accounting (you'd have to read the clock twice for every system call). You could easily end up spending more time reading clocks than doing anything else in programs that do many system calls.
The counters should never go backwards though. I've seen that many years ago where the total running time of the process was tracked on context switches (which was relatively cheap and getrusge could calculate utime by using samples for stime, and subtracting that from the total running time). The clock used in that case was the wall clock instead of a monotonic clock and when you changed the time on the machine, the running time of processes could go back. But that was of course a bug.

Memory Sharing C - Performance

I'm playing around with process creation/ scheduling in Linux. As part of that, I have a number of concurrent threads computing a basic hash function from a shared in memory buffer. Each thread is created using clone, I'm trying and I'm playing around with the various flags, stack size, to measure process creation time, etc. (hence the use of clone)
My experiments are run on a 2 core i7 with hyperthreading enabled.
In this context, I find that, with all flags enabled (CLONE_VM, CLONE_SIGHAND, CLONE_FILES, CLONE_FS), the time it takes to compute n hash functions doubles when I run 4 processes (ak one per logical cpu) over when I run 2 processes. My understanding is that hyperthreading helps when a process is waiting on IO, so for a CPU bound process, it has almost no effect. Is this correct?
The second observation is that I observe pretty high variance (up to 2 seconds) when computing these hash functions (I compute a hash 1 000 000 times). No other process is running on he system (though there are some background threads). I'm struggling to understand why so much variance? Is it strictly due to how the scheduler happens to schedule the processes? I understand that without using sched_affinity, there is no guarantee that they will be located on different cpus, so can that just be explained by them being placed on the same CPU?
Are there any other ways to guarantee improved reliability without relying on sched_affinity?
The third observation is that, even when I run with just 2 threads (so when each should be scheduled on a diff CPU), I find that the performance goes down (not by much, but a little bit). I'm struggling to understand why that is the case? It's the same read-only buffer, and fits in the cache. Is there some contention in accessing the page table? Would it then be preferable to create two processes with distinct address spaces and explicitly share the segment, marking it as read only?
Different threads still run in the context of one process so they should run on the same CPU the process is run on (usually one process is run on one CPU but that is not guaranteed).
When you run two threads instead of processes you have an overhead of switching threads, the more calculations you do the more this switching needs to be done so it will be slower than the same calculations done in one thread.
Furthermore if you run the same calculations in different processes then there is an even bigger overhead of switching between processes but there is more chance you will run on different CPUs so in the long run this will probably be faster, not so much for short calculations.
Even if you don't think you have other processes running the OS has a lot to do all the time and switches to it's own processes that you aren't always aware of.
All of this emanates from the randomness of switching. Hope I helped a bit.

Is there a way to suspend OS scheduling for the duration of a program?

I have an assignment where I am analyzing the runtime of various sorting algorithms. I have written the code but I think it's an unfair comparison.
My code basically grabs the the clock time before and after the sorting is finished to compute the elapsed time. However, what if the OS decides to interrupt more frequently during the runtime of a specific sorting algorithm, or if it rather decides that some other background application should be given more of the time domain when it's thread comes back up?
I am not a CS major so I may not be entirely correct here, but from what I've read previously I was concerned this might have an impact on the results.
I also realize that if OS scheduling is suspended and the program hangs then there might be a serious problem; I am just wondering if it possible.
Normally, there's no real reason for it. The scheduler will slightly increase the execution time, but if the code runs for a few seconds, the change will be tiny.
So unless you're running heavy applications on the same computer, the amount of noise this will add to your tests is negligible.
In Linux, you can use isolcpus parameter to mark CPUs that won't be used by the scheduler. You can find information here. I'm not sure what's the minimal kernel version.
If you use it, you'll need to use sched_setaffinity, to put your theread on an isolated CPU, because the scheduler won't put it there.
It is not possible, not in user space code. Otherwise, any malicious process could steal the CPU from others.
If you want precise time counting for your process only, I suggest using time command. You can read about it here: What do 'real', 'user' and 'sys' mean in the output of time(1)?
Quick answer: you are most likely interested in user time, assuming your code doesn't make a heavy use of syscalls (which would be rather strange for a sorting algorithm)
On an up-to-date POSIX system (basically Linux) you can use clock_gettime with CLOCK_PROCESS_CPUTIME_ID or CLOCK_THREAD_CPUTIME_ID if you make sure the process doesn't migrate between CPUs (you can set its affinity for example).
The difference in times returned by clock_gettime with those arguments results in exact time the process/thread spent executing. Only pitfall as I mentioned is process migration as the man page says:
The CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID clocks are realized on many platforms using timers from the CPUs (TSC on i386, AR.ITC on Itanium). These registers may differ between CPUs and as a consequence these clocks may return bogus results if a process is migrated to another CPU.
This means that you don't really need to suspend all other processes just to measure the execution time of your program.

When benchmarking, what causes a lag between CPU time and "elapsed real time"?

I'm using a built-in benchmarking module for some quick and dirty tests. It gives me:
CPU time
system CPU time (actually I never get any result for this with the code I'm running)
the sum of the user and system CPU times (always the same as the CPU time in my case)
the elapsed real time
I didn't even know I needed all that information.
I just want to compare two pieces of code and see which one takes longer. I know that one piece of code probably does more garbage collection than the other but I'm not sure how much of an impact it's going to have.
Any ideas which metric I should be looking at?
And, most importantly, could someone explain why the "elapsed real time" is always longer than the CPU time - what causes the lag between the two?
There are many things going on in your system other than running your Ruby code. Elapsed time is the total real time taken and should not be used for benchmarking. You want the system and user CPU times since those are the times that your process actually had the CPU.
An example, if your process:
used the CPU for one second running your code; then
used the CPU for one second running OS kernel code; then
was swapped out for seven seconds while another process ran; then
used the CPU for one more second running your code,
you would have seen:
ten seconds elapsed time,
two seconds user time,
one second system time,
three seconds total CPU time.
The three seconds is what you need to worry about, since the ten depends entirely upon the vagaries of the process scheduling.
Multitasking operating system, stalls while waiting for I/O, and other moments when you code is not actively working.
You don't want to totally discount clock-on-the-wall time. Time used to wait w/o another thread ready to utilize CPU cycles may make one piece of code less desirable than another. One set of code may take some more CPU time, but, employ multi-threading to dominate over the other code in the real world. Depends on requirements and specifics. My point is ... use all metrics available to you to make your decision.
Also, as a good practice, if you want to compare two pieces of code you should be running as few extraneous processes as possible.
It may also be the case that the CPU time when your code is executing is not counted.
The extreme example is a real-time system where the timer triggers some activity which is always shorter than a timer tick. Then the CPU time for that activity may never be counted (depending on how the OS does the accounting).

Resources