Strange "disappearing" CPU utilization - c

It's an odd title, but bear with me. Some time ago, I finished writing a program in C with Visual Studio Community 2017 that made significant use of OpenSSL's secp256k1 implementation. It was 10x faster than an equivalent program in Java, so I was happy. However, today I decided to upgrade it to use the bitcoin project's libsecp256k1 optimized library. It worked out great and I got a further 7x performance boost! Changing what library was used to do the EC multiplications is the ONLY thing I changed about the software- it still reads in the same input files, computes the same things, and outputs that same results.
Input files consist of 5 million initial values, and I break that up into chunks of 50k. I then set pthreads to use 6 threads to compute those 50k, save the results, and move on to the next 50k until all 5 million are done (I've also used OpenMP with 6 threads). For some reason when running this program on my Windows 10 4-core laptop, after exactly 16 chunks, the CPU utilization drops from 75% down to 65%, and after another 10 chunks down to 55%, and so on until it's only using about 25% of my CPU by the time all 5 million inputs are calculated.
The thread count (7- 1 main thread, 6 worker threads) remains the same, and the memory usage never goes over 1.5GB (laptop has 16GB), yet the CPU utilization drops as if I'm dropping threads. My temps never go over 83C, and all-core turbo stays at the max 3.4Ghz (base 2.8), so there is no temp throttling happening here. The laptop is always plugged in and power settings are set to max performance. There are no other CPU- or memory-intensive programs running besides this one.
Even stranger is that this problem doesn't happen on either of my two Windows 7 desktops- they both hold correct CPU utilization throughout all 5 million calculations. The old OpenSSL implementation always stayed at correct CPU utilization on all computers, so something's different yet it only affects the Windows 10 laptop.
I'm sorry I don't have code to demonstrate this, and maybe another forum would be more appropriate, but since it's my code I thought I'd ask here. Anyone have any ideas what might be causing this or how to fix it?

Related

Fork() vs number of cores

I am looking to do some data processing of some 6700 files and am using fork() to handle different sets of calculations on the data. Will getting a CPU with a higher core count allow me to run more forks()? Currently I am using a quad core with 8 threads, forking() 8 times, takes me about an hour per file. If I had a 64 core processor and forked() 64 times (splitting up calculations), would that decrease the time by about 8???
Theoretically no, according to Amdahl's law. Probably also practically, because many resources are shared (the caches, the operating system calls, the disk, etc.), but this really depends on your algorithm. For example, if your algorithm is embarrassingly parallel and is cpu-bound, than you may notice a great improvement increasing the cores to 64.
A note after reading the comments of the question: if you have a complexity of O(n!), it is possible that your algorithm is simply impossible to be executed in a realistic time. For example, if your input is n=42, and let's say that you machine is able to do 1 billion of operation per seconds, then the time required to execute your algorithm is greater than the age of the universe.

Time to run instructions of a for loop

I am interested to calculate a duration of 125 μs for implementing a TDM (Time Division Multiplexing scheme) based scheme. However, I am not able to get this duration with an accuracy of +-5us using the Linux operating system. I am using DPDK which runs on ubuntu and intel hardware. If I take time from the computer using function clock_gettime(CLOCK_REALTIME), it adds the time to make a call to the kernel to get the time. This gives an inaccurate duration to me.
Therefore, I dedicated a cpu core for calculating time without asking the time from the kernel. For this, I run a for loop for a maximum instructions (8000000) and find the number instructions that need to be executed for the 125 μs duration (i.e. (125*8000000)/timespent).
However, the problem is that it is also giving inaccurate results (there is always different results i.e., a difference 1000 instructions).
Does anybody know why I am getting inaccurate results even if I am dedicating a CPU for this?
Do you know a method to calculate a duration (very short, may be equal to 125 us) without making a call to the kernel? thanks!
You are getting inaccurate result because you are on a multitasking operating system. You cannot do this on modern computers. You can only do this on embedded microcontroller where you control 100% of the cpu time. The operating system need to manage your process, even if you have a dedicated cpu. The mouse and keyboard takes time also. Your have to run the process on 'Bare Metal'.

Are two successive calls to getrusage guaranteed to produce increasing results?

In a program that calls getrusage() twice in order to obtain the time of a task by subtraction, I have once seen an assertion, saying that the time of the task should be nonnegative, fail. This, of course, cannot easily be reproduced, although I could write a specialized program that might reproduce it more easily.
I have tried to find a guarantee that getrusage() increased along execution, but neither the man page on my system(Linux on x86-64) nor this system-independant description say so explicitly.
The behavior was observed on a physical computer, with several cores, and NTP running.
Should I report a bug against the OS I am using? Am I asking too much when I expect getrusage() to increase with time?
On many systems rusage (I presume you mean ru_utime and ru_stime) is not calculated accurately, it's just sampled once per clock tick which is usually as slow as 100Hz and sometimes even slower.
Primary reason for that is that many machines have clocks that are incredibly expensive to read and you don't want to do this accounting (you'd have to read the clock twice for every system call). You could easily end up spending more time reading clocks than doing anything else in programs that do many system calls.
The counters should never go backwards though. I've seen that many years ago where the total running time of the process was tracked on context switches (which was relatively cheap and getrusge could calculate utime by using samples for stime, and subtracting that from the total running time). The clock used in that case was the wall clock instead of a monotonic clock and when you changed the time on the machine, the running time of processes could go back. But that was of course a bug.

Getting reliable performance measurements for short bits of code

I'm trying to profile a few set of functions which implement different versions of the same algorithm in different ways. I've increased the number of times each function is run so that the total time spent in a single function is roughly 1 minute (to reveal performance differences).
Now, running several times the test produces baffling results. There is a huge variability (+- 50 %) between several executions of the same function, and determining which function is the fastest (which is the goal of the test) is nearly impossible because of that.
Is there something special I should take care of before running the tests, so that I get smoother measurements? Failing that, is running the test several times and compute the average for each function the way to go?
There are lots of things to check!
First, make sure your functions are actually CPU-bound. If so, make sure you have all CPU throttling, turbo modes, and power-saving modes disabled (in BIOS) for the test. If you still have trouble, try pinning your process to a single core. Disable hyper-threading too perhaps.
The goal of all this is to make sure you get your code running hot on a single core without much interruption. If you're on Linux, you can remove a single core from the OS list of available cores and use that (so there is no chance of interference on that core).
Running the test several times is a good idea, but using the average (arithmetic mean) is not. Instead, use the median or minimum or some other measurement which won't be influenced by outliers. Usually, the occasional long test run can be thrown out entirely (unless you're building a real-time system!).

CUDA - limiting number of SMs being used

Is there any way I could EXPLICITLY limit the number of GPU multiprocessors being used during runtime of my program? I would like to calculate how my algorithm scales up with growing number of multiprocessors.
If it helps: I am using CUDA 4.0 and device with compute capability 2.0.
Aaahhh... I know the problem. I played with it myself when writing a paper.
There is no explicit way to do it, however you can try "hacking" it, by having some of the blocks doing nothing.
if you never launch more blocks as there are multiprocessors, then your work is easy - just launch even less blocks. Some of SM are guaranteed to have no work, because a block cannot split onto multiple SMs.
if you launch much more blocks and you just rely on the driver to schedule them, use a different approach: just launch as many blocks as your GPU can handle and if one of the blocks finishes its work, instead of terminating it, loop back to the beginning and fetch another piece of data to work on. Most likely, the performance of your program will not fall; it might even get better if you schedule your work carefully :)
The biggest problem is when all your blocks are all running on the GPU at once, but you have more than one block per SM. Then you need to launch normally, but manually "disable" some of the blocks and order other blocks to do the work for them. The problem is - which blocks to disable to guarantee that one SM is working and other is not.
From my own experiments, the 1.3 devices (I had GTX 285) schedules the blocks in sequence. So, if I launch 60 blocks onto 30 SMs, blocks 1-30 are scheduled onto SM 1-30 and then 31-60 again onto SM from 1 to 30. So, by disabling block 5 and 35, SM number 5 is practically not doing anything.
Note however, this is my private, experimental observation I made 2 years ago. It is in no way confirmed, supported, maintained, what-not by NVIDIA and may change (or already has changed) with the new GPUs and/or drivers.
I would suggest - try playing with some simple kernels which do a lot of stupid work and see how long does it take to compute on various "enabled"/"disabled" configurations. If you are lucky, you will catch a performance drop, indicating that 2 blocks are in fact executed by a single SM.

Resources