linux high kernel cpu usage on memory initialization - c

I have a problem with high CPU cunsumption by the linux kernel, while bootstrapping my java applications on server. This problem occurs only in production, on dev servers everything is light-speed.
upd9: There was two question about this issue:
How to fix it? - Nominal Animal suggested to sync and drop everything, and it really helps. sudo sh -c 'sync ; echo 3 > /proc/sys/vm/drop_caches ; Works. upd12: But indeed sync is enough.
Why this happening? - It is still open for me, I do understand that flushing durty pages to disk consumes Kernel CPU and IO time, it is normal. But what is strage, why even single threaded application written in "C" my load ALL cores by 100% in kernel space?
Due to ref-upd10 and ref-upd11 I have an idea that echo 3 > /proc/sys/vm/drop_caches is not required to fix my problem with slow memory allocation.
It should be enough to run `sync' before starting memory-consuming application.
Probably will try this tommorow in production and post results here.
upd10: Lost of FS caches pages case:
I executed cat 10GB.fiel > /dev/null, then
sync to be sure, no durty pages (cat /proc/meminfo |grep ^Dirty displayed 184kb.
Checking cat /proc/meminfo |grep ^Cached I got: 4GB cached
Running int main(char**) I got normal performance (like 50ms to initialize 32MB of allocated data).
Cached memory reduced to 900MB
Test summary: I think it is no problem for linux to reclaim pages used as FS cache into allocated memory.
upd11: LOTS of dirty pages case.
List item
I run my HowMongoDdWorks example with commented read part, and after some time
/proc/meminfo said that 2.8GB is Dirty and a 3.6GB is Cached.
I stopped HowMongoDdWorks and run my int main(char**).
Here is part of result:
init 15, time 0.00s
x 0 [try 1/part 0] time 1.11s
x 1 [try 2/part 0] time 0.04s
x 0 [try 1/part 1] time 1.04s
x 1 [try 2/part 1] time 0.05s
x 0 [try 1/part 2] time 0.42s
x 1 [try 2/part 2] time 0.04s
Summary by test: lost of durty pages significantly slow down first access to allocated memory (to be fair, this starts to happen only when total application memory starts to be comparable to whole OS memory, i.e. if you have 8 of 16 GB free, it is no problem to allocate 1GB, slowdown starst from 3GB or so).
Now I managed to reproduce this situation in my dev environment, so here is new details.
Dev machine configuration:
Linux 2.6.32-220.13.1.el6.x86_64 - Scientific Linux release 6.1 (Carbon)
RAM: 15.55 GB
CPU: 1 X Intel(R) Core(TM) i5-2300 CPU # 2.80GHz (4 threads) (physical)
It's 99.9% that problem caused by large amount of durty pages in FS cache. Here is the application which creates lots on dirty pages:
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.Random;
/**
* #author dmitry.mamonov
* Created: 10/2/12 2:53 PM
*/
public class HowMongoDdWorks{
public static void main(String[] args) throws IOException {
final long length = 10L*1024L*1024L*1024L;
final int pageSize = 4*1024;
final int lengthPages = (int) (length/pageSize);
final byte[] buffer = new byte[pageSize];
final Random random = new Random();
System.out.println("Init file");
final RandomAccessFile raf = new RandomAccessFile("random.file","rw");
raf.setLength(length);
int written = 0;
int readed = 0;
System.out.println("Test started");
while(true){
{ //write.
random.nextBytes(buffer);
final long randomPageLocation = (long)random.nextInt(lengthPages)*(long)pageSize;
raf.seek(randomPageLocation);
raf.write(buffer);
written++;
}
{ //read.
random.nextBytes(buffer);
final long randomPageLocation = (long)random.nextInt(lengthPages)*(long)pageSize;
raf.seek(randomPageLocation);
raf.read(buffer);
readed++;
}
if (written % 1024==0 || readed%1024==0){
System.out.printf("W %10d R %10d pages\n", written, readed);
}
}
}
}
And here is test application, which causes HI (up to 100% by all cores) CPU load in Kernel Space (same as below, but I will copy it once again).
#include<stdlib.h>
#include<stdio.h>
#include<time.h>
int main(char** argv){
int last = clock(); //remember the time
for(int i=0;i<16;i++){ //repeat test several times
int size = 256 * 1024 * 1024;
int size4=size/4;
int* buffer = malloc(size); //allocate 256MB of memory
for(int k=0;k<2;k++){ //initialize allocated memory twice
for(int j=0;j<size4;j++){
//memory initialization (if I skip this step my test ends in
buffer[j]=k; 0.000s
}
//printing
printf(x "[%d] %.2f\n",k+1, (clock()-last)/(double)CLOCKS_PER_SEC); stat
last = clock();
}
}
return 0;
}
While previous HowMongoDdWorks program is running, int main(char** argv) will show results like this:
x [1] 0.23
x [2] 0.19
x [1] 0.24
x [2] 0.19
x [1] 1.30 -- first initialization takes significantly longer
x [2] 0.19 -- then seconds one (6x times slowew)
x [1] 10.94 -- and some times it is 50x slower!!!
x [2] 0.19
x [1] 1.10
x [2] 0.21
x [1] 1.52
x [2] 0.19
x [1] 0.94
x [2] 0.21
x [1] 2.36
x [2] 0.20
x [1] 3.20
x [2] 0.20 -- and the results is totally unstable
...
I keep everything below this line only for historical perpose.
upd1: both, dev and production systems are big enought for this test.
upd7: it is not paging, at least I have not seen any storage IO activity during problem time.
dev ~ 4 Cores, 16 GM RAM, ~8 GB free
production ~ 12 Cores, 24 GB
RAM, ~ 16 GB free (from 8 to 10 GM is under FS Cache, but it no
difference, same results even if all 16GM is totally free), also this machine is loaded by CPU, but not too high ~10%.
upd8(ref): New test case and and potentional explanation see in tail.
Here is my test case (I also tested java and python, but "c" should be most clear):
#include<stdlib.h>
#include<stdio.h>
#include<time.h>
int main(char** argv){
int last = clock(); //remember the time
for(int i=0;i<16;i++){ //repeat test several times
int size = 256 * 1024 * 1024;
int size4=size/4;
int* buffer = malloc(size); //allocate 256MB of memory
for(int k=0;k<2;k++){ //initialize allocated memory twice
for(int j=0;j<size4;j++){
//memory initialization (if I skip this step my test ends in
buffer[j]=k; 0.000s
}
//printing
printf(x "[%d] %.2f\n",k+1, (clock()-last)/(double)CLOCKS_PER_SEC); stat
last = clock();
}
}
return 0;
}
The output on dev machine (partial):
x [1] 0.13 --first initialization takes a bit longer
x [2] 0.12 --then second one, but the different is not significant.
x [1] 0.13
x [2] 0.12
x [1] 0.15
x [2] 0.11
x [1] 0.14
x [2] 0.12
x [1] 0.14
x [2] 0.12
x [1] 0.13
x [2] 0.12
x [1] 0.14
x [2] 0.11
x [1] 0.14
x [2] 0.12 -- and the results is quite stable
...
The output on production machine (partial):
x [1] 0.23
x [2] 0.19
x [1] 0.24
x [2] 0.19
x [1] 1.30 -- first initialization takes significantly longer
x [2] 0.19 -- then seconds one (6x times slowew)
x [1] 10.94 -- and some times it is 50x slower!!!
x [2] 0.19
x [1] 1.10
x [2] 0.21
x [1] 1.52
x [2] 0.19
x [1] 0.94
x [2] 0.21
x [1] 2.36
x [2] 0.20
x [1] 3.20
x [2] 0.20 -- and the results is totally unstable
...
While running this test on development machine the CPU usage is not even rised from gound, like all cores is less then 5% usage in htop.
But running this test on production machine I see up to 100% CPU usage by ALL cores (in average load rises up to 50% on 12 cores machine), and it's all Kernel time.
upd2: all machines has same centos linux 2.6 installed, I work with them using ssh.
upd3: A: It's unlikely to be swapping, haven't seen any disk activity during my test, and plenty of RAM is also free. (also, descriptin is updated). – Dmitry 9 mins ago
upd4: htop say HI CPU utilisation by Kernel, up to 100% utilisation by al cores (on prod).
upd5: does CPU utilization settle down after initialization completes? In my simple test - Yes. For real application, it is only helping to stop everything else to start a new program (which is nonsense).
I have two questions:
Why this happening?
How to fix it?
upd8: Improved test and explain.
#include<stdlib.h>
#include<stdio.h>
#include<time.h>
int main(char** argv){
const int partition = 8;
int last = clock();
for(int i=0;i<16;i++){
int size = 256 * 1024 * 1024;
int size4=size/4;
int* buffer = malloc(size);
buffer[0]=123;
printf("init %d, time %.2fs\n",i, (clock()-last)/(double)CLOCKS_PER_SEC);
last = clock();
for(int p=0;p<partition;p++){
for(int k=0;k<2;k++){
for(int j=p*size4/partition;j<(p+1)*size4/partition;j++){
buffer[j]=k;
}
printf("x [try %d/part %d] time %.2fs\n",k+1, p, (clock()-last)/(double)CLOCKS_PER_SEC);
last = clock();
}
}
}
return 0;
}
And result looks like this:
init 15, time 0.00s -- malloc call takes nothing.
x [try 1/part 0] time 0.07s -- usually first try to fill buffer part with values is fast enough.
x [try 2/part 0] time 0.04s -- second try to fill buffer part with values is always fast.
x [try 1/part 1] time 0.17s
x [try 2/part 1] time 0.05s -- second try...
x [try 1/part 2] time 0.07s
x [try 2/part 2] time 0.05s -- second try...
x [try 1/part 3] time 0.07s
x [try 2/part 3] time 0.04s -- second try...
x [try 1/part 4] time 0.08s
x [try 2/part 4] time 0.04s -- second try...
x [try 1/part 5] time 0.39s -- BUT some times it takes significantly longer then average to fill part of allocated buffer with values.
x [try 2/part 5] time 0.05s -- second try...
x [try 1/part 6] time 0.35s
x [try 2/part 6] time 0.05s -- second try...
x [try 1/part 7] time 0.16s
x [try 2/part 7] time 0.04s -- second try...
Facts I learned from this test.
Memory allocation itself is fast.
First access to allocated memory is fast (so it is not a lazy buffer allocation problem).
I split allocated buffer into parts (8 in test).
And filling each buffer part with value 0, and then with value 1, printing consumed time.
Second buffer part fill is always fast.
BUT furst buffer part fill is always a bit slower then second fill (I belive some extra work is done my kernel on first page access).
Some Times it takes SIGNIFICANTLY longer to fill buffer part with values at first time.
I tried suggested anwser and it seems it helped. I will recheck and post results again later.
Looks like linux maps allocated pages to durty file system cache pages, and it takes a lot of time to flush pages to disk one by one. But total sync works fast and eliminates problem.

Run
sudo sh -c 'sync ; echo 3 > /proc/sys/vm/drop_caches ; sync'
on your dev machine. It is a safe, nondestructive way to make sure your caches are empty. (You will not lose any data by running the above command, even if you happen to save or write to disk at the exact same time. It really is safe.)
Then, make sure you don't have any Java stuff running, and re-run the above command just to be sure. You can check if you have any Java running for example
ps axu | sed -ne '/ sed -ne /d; /java/p'
It should output nothing. If it does, close your Java stuff first.
Now, re-run your application test. Does the same slowdown now occur on your dev machine, too?
If you care to leave the comment either way, Dmitry, I'd be happy to explore the issue further.
Edited to add: I suspect that the slowdown does occur, and is due to the large startup latency incurred by Java itself. It is a very common issue, and basically built-in to Java, a result of its architecture. For larger applications, startup latency is often a significant fraction of a second, no matter how fast the machine, simply because Java has to load and prepare the classes (mostly serially, too, so adding cores will not help).
In other words, I believe the blame should fall on Java, not Linux; quite the opposite, since Linux manages to mitigate the latency on your development machine through kernel-level caching -- and that only because you keep running those Java components practically all the time, so the kernel knows to cache them.
Edit 2: It would be very useful to see which files your Java environment accesses when your application is started. You can do this with strace:
strace -f -o trace.log -q -tt -T -e trace=open COMMAND...
which creates file trace.log containing the open() syscalls done by any of the processes started by COMMAND.... To save the output to trace.PID for each process the COMMAND... starts, use
strace -f -o trace -ff -q -tt -T -e trace=open COMMAND...
Comparing the outputs on your dev and prod installations will tell you if they are truly equivalent. One of them may have extra or missing libraries, affecting the startup time.
In case an installation is old and system partition is reasonably full, it is possible that those files have been fragmented, causing the kernel to spend more time waiting for I/O to complete. (Note that the amount of I/O stays the same; only the time it takes to complete will increase if the files are fragmented.) You can use command
LANG=C LC_ALL=C sed -ne 's|^[^"]* open("\(.*\)", O[^"]*$|\1|p' trace.* \
| LANG=C LC_ALL=C sed -ne 's|^[^"]* open("\(.*\)", O[^"]*$|\1|p' \
| LANG=C LC_ALL=C xargs -r -d '\n' filefrag \
| LANG=C LC_ALL=C awk '(NF > 3 && $NF == "found") { n[$(NF-2)]++ }
END { for (i in n) printf "%d extents %d files\n", i, n[i] }' \
| sort -g
to check how fragmented the files used by your application are; it reports how many files use only one, or more than one extents. Note that it does not include the original executable (COMMAND...), only the files it accesses.
If you just want to get the fragmentation statistics for files accessed by a single command, you can use
LANG=C LC_ALL=C strace -f -q -tt -T -e trace=open COMMAND... 2>&1 \
| LANG=C LC_ALL=C sed -ne 's|^[0-9:.]* open("\(.*\)", O[^"]*$|\1|p' \
| LANG=C LC_ALL=C xargs -r filefrag \
| LANG=C LC_ALL=C awk '(NF > 3 && $NF == "found") { n[$(NF-2)]++ }
END { for (i in n) printf "%d extents %d files\n", i, n[i] }' \
| sort -g
If the problem is not due to caching, then I think it is most likely that the two installations are not truly equivalent. If they are, then I'd check the fragmentation. After that, I'd do a full trace (omitting the -e trace=open) on both environments to see where exactly the differences are.
I do believe I now understand your problem/situation.
On your prod environment, kernel page cache is mostly dirty, i.e. most cached stuff is stuff that is going to be written to disk.
When your application allocates new pages, the kernel only sets up the page mappings, it does not actually give physical RAM immediately. That only happens on the first access to each page.
On the first access, the kernel first locates a free page -- typically, a page that contains "clean" cached data, i.e. something read from the disk but not modified. Then, it clears it to zeros, to avoid information leaks between processes. (When using the C library allocation facilities like malloc() etc., instead of the direct mmap() family of functions, the library may use/reuse parts of the mapping. Although the kernel does clear the pages to zeros, the library may "dirty" them. Using mmap() to get anonymous pages you do get them zeroed out.)
If the kernel does not have suitable clean pages at hand, it must first flush some of the oldest dirty pages to disk first. (There are processes inside the kernel that flush pages to disk, and mark them clean, but if the server load is such that pages are continuously dirtied, it is usually desirable to have mostly dirty pages instead of mostly clean pages -- the server gets more work done that way. Unfortunately, it does also mean an increase in the first page access latency, which you're now encountered.)
Each page is sysconf(_SC_PAGESIZE) bytes long, aligned. In other words, pointer p points to the start of a page if and only if ((long)p % sysconf(_SC_PAGESIZE)) == 0. Most kernels, I believe, do actually populate groups of pages in most cases instead of individual pages, thus increasing the latency of the first access (to each group of pages).
Finally, there may be some compiler optimization that plays havoc with your benchmarking. I recommend you write a separate source file for a benchmarking main(), and the actual work done on each iteration in a separate file. Compile them separately, and just link them together, to make sure the compiler does not rearrange the time functions wrt. the actual work done. Basically, in benchmark.c:
#define _POSIX_C_SOURCE 200809L
#include <time.h>
#include <stdio.h>
/* in work.c, adjust as needed */
void work_init(void); /* Optional, allocations etc. */
void work(long iteration); /* Completely up to you, including parameters */
void work_done(void); /* Optional, deallocations etc. */
#define PRIMING 0
#define REPEATS 100
int main(void)
{
double wall_seconds[REPEATS];
struct timespec wall_start, wall_stop;
long iteration;
work_init();
/* Priming: do you want caches hot? */
for (iteration = 0L; iteration < PRIMING; iteration++)
work(iteration);
/* Timed iterations */
for (iteration = 0L; iteration < REPEATS; iteration++) {
clock_gettime(CLOCK_REALTIME, &wall_start);
work(iteration);
clock_gettime(CLOCK_REALTIME, &wall_stop);
wall_seconds[iteration] = (double)(wall_stop.tv_sec - wall_start.tv_sec)
+ (double)(wall_stop.tv_nsec - wall_start.tv_nsec) / 1000000000.0;
}
work_done();
/* TODO: wall_seconds[0] is the first iteration.
* Comparing to successive iterations (assuming REPEATS > 0)
* tells you about the initial latency.
*/
/* TODO: Sort wall_seconds, for easier statistics.
* Most reliable value is the median, with half of the
* values larger and half smaller.
* Personally, I like to discard first and last 15.85%
* of the results, to get "one-sigma confidence" interval.
*/
return 0;
}
with the actual array allocation, deallocation, and filling (per repeat loop) done in the work() functions defined in work.c.

When the kernel runs out of available clean pages, it must flush dirty pages to disk. Flushing a lot of dirty pages to disk looks like a high CPU load, because most kernel-side stuff require one or more pages (temporarily) to work. Essentially, the kernel is waiting for I/O to complete, even when the userspace applications called a non-I/O related kernel function.
If you run a microbenchmark in parallel, say a program that just continuously dirties a very large mapping over and over, and measures the CPU time (__builtin_ia32_rdtsc() if using GCC on x86 or x86-64) without calling any syscalls, you should see that this one gets plenty of CPU time even when the kernel is seemingly eating "all" of CPU time. Only when a process calls a kernel function (syscall) that internally requires some memory, will that call "block", stuck waiting in the kernel for the page flushing to yield new pages.
When running benchmarks, it is usually enough to just run sudo sh -c 'sync ; echo 3 >/proc/sys/vm/drop_caches ; sync' a couple of times before running the benchmark, to make sure there should be no undue memory pressure during the benchmark. I never use it in a production environment. (While it is safe to run, i.e. does not lose data, it is like killing mosquitoes using a sledgehammer: the wrong tool for the job.)
When you find in a production environment that your latencies start to grow too large due to kernel flushing dirty pages -- which it does at maximum device speed I believe, possibly causing a hiccup in application I/O speeds too --, you can tune the kernel dirty page flushing mechanism. Basically, you can tell the kernel to flush dirty pages much sooner to disk, and make sure there won't be that many dirty pages at any point in time (if possible).
Gregory Smith has written about the theory and tuning of the flushing mechanism here. In short, /proc/sys/vm/ contains kernel tunables you can modify. They are reset to defaults at boot, but you can easily write a simple init script to echo the desired values to the files at boot. If the processes running on the production machine do heavy I/O, you might also look at the filesystem tunables. At minimum, you should mount your filesystems (see /etc/fstab) using the relatime flag, so that file access times are only updated for the first access after the file has been modified or its status changed.
Personally, I also use a low-latency pre-emptible kernel with 1000 Hz timer for multimedia workstations (and for multimedia servers if I had any right now). Such kernels run user processes in shorter slices, and usually provide much better latencies, although maximum computation capability is slightly less. If your production services are latency-sensitive, I do recommend switching your production servers to such kernels.
Many distributions do provide such kernels already, but I find it much simpler to recompile distro kernels instead, or even switch to kernel.org kernels. The procedure is simple: you need kernel development and tools installed (on Debian variants, make-kpkg is VERY useful). To upgrade a kernel, you obtain new sources, configure the kernel (usually using current configuration as the basis -- make oldconfig), build the new kernel, and install the packages, before rebooting. Most people do find that just upgrading the hardware is more cost effective than recompiling distro kernels, but I find recompiling kernels very effortless myself. I don't automatically reboot for kernel upgrades anyway, so adding a simple step (triggered by running a single script) prior to rebooting is not too much effort for me.

Related

Accelerate framework uses only one core on Mac M1

The following C program (dgesv_ex.c)
#include <stdlib.h>
#include <stdio.h>
/* DGESV prototype */
extern void dgesv( int* n, int* nrhs, double* a, int* lda, int* ipiv,
double* b, int* ldb, int* info );
/* Main program */
int main() {
/* Locals */
int n = 10000, info;
/* Local arrays */
/* Initialization */
double *a = malloc(n*n*sizeof(double));
double *b = malloc(n*n*sizeof(double));
int *ipiv = malloc(n*sizeof(int));
for (int i = 0; i < n*n; i++ )
{
a[i] = ((double) rand()) / ((double) RAND_MAX) - 0.5;
}
for(int i=0;i<n*n;i++)
{
b[i] = ((double) rand()) / ((double) RAND_MAX) - 0.5;
}
/* Solve the equations A*X = B */
dgesv( &n, &n, a, &n, ipiv, b, &n, &info );
free(a);
free(b);
free(ipiv);
exit( 0 );
} /* End of DGESV Example */
compiled on a Mac mini M1 with the command
clang -o dgesv_ex dgesv_ex.c -framework accelerate
uses only one core of the processor (as also shown by the activity monitor)
me#macmini-M1 ~ % time ./dgesv_ex
./dgesv_ex 35,54s user 0,27s system 100% cpu 35,758 total
I checked that the binary is of the right type:
me#macmini-M1 ~ % lipo -info dgesv
Non-fat file: dgesv is architecture: arm64
As a comparaison, on my Intel MacBook Pro I get the following output :
me#macbook-intel ˜ % time ./dgesv_ex
./dgesv_ex 142.69s user 0,51s system 718% cpu 19.925 total
Is it a known problem ? Maybe a compilation flag or else ?
Accelerate uses the M1's AMX coprocessor to perform its matrix operations, it is not using the typical paths in the processor. As such, the accounting of CPU utilization doesn't make much sense; it appears to me that when a CPU core submits instructions to the AMX coprocessor, it is accounted as being held at 100% utilization while it waits for the coprocessor to finish its work.
We can see evidence of this by running multiple instances of your dgesv benchmark in parallel, and watching as the runtime increases by a factor of two, but the CPU monitor simply shows two processes using 100% of one core:
clang -o dgesv_accelerate dgesv_ex.c -framework Accelerate
$ time ./dgesv_accelerate
real 0m36.563s
user 0m36.357s
sys 0m0.251s
$ ./dgesv_accelerate & ./dgesv_accelerate & time wait
[1] 6333
[2] 6334
[1]- Done ./dgesv_accelerate
[2]+ Done ./dgesv_accelerate
real 0m59.435s
user 1m57.821s
sys 0m0.638s
This implies that there is a shared resource that each dgesv_accelerate process is consuming; one that we don't have much visibility into. I was curious as to whether these dgesv_accelerate processes are actually consuming computational resources at all while waiting for the AMX coprocessor to finish its task, so I linked another version of your example against OpenBLAS, which is what we use as the default BLAS backend in the Julia language. I am using the code hosted in this gist which has a convenient Makefile for downloading OpenBLAS (and its attendant compiler support libraries such as libgfortran and libgcc) and compiling everything and running timing tests.
Note that because the M1 is a big.LITTLE architecture, we generally want to avoid creating so many threads that we schedule large BLAS operations on the "efficiency" cores; we mostly want to stick to only using the "performance" cores. You can get a rough outline of what is being used by opening the "CPU History" graph of Activity Monitor. Here is an example showcasing normal system load, followed by running OPENBLAS_NUM_THREADS=4 ./dgesv_openblas, and then OPENBLAS_NUM_THREADS=8 ./dgesv_openblas. Notice how in the four threads example, the work is properly scheduled onto the performance cores and the efficiency cores are free to continue doing things such as rendering this StackOverflow webpage as I am typing this paragraph, and playing music in the background. Once I run with 8 threads however, the music starts to skip, the webpage begins to lag, and the efficiency cores are swamped by a workload they're not designed to do. All that, and the timing doesn't even improve much at all:
$ OPENBLAS_NUM_THREADS=4 time ./dgesv_openblas
18.76 real 69.67 user 0.73 sys
$ OPENBLAS_NUM_THREADS=8 time ./dgesv_openblas
17.49 real 100.89 user 5.63 sys
Now that we have two different ways of consuming computational resources on the M1, we can compare and see if they interfere with eachother; e.g. if I launch an "Accelerate"-powered instances of your example, will it slow down the OpenBLAS-powered instances?
$ OPENBLAS_NUM_THREADS=4 time ./dgesv_openblas
18.86 real 70.87 user 0.58 sys
$ ./dgesv_accelerate & OPENBLAS_NUM_THREADS=4 time ./dgesv_openblas
24.28 real 89.84 user 0.71 sys
So, sadly, it does appear that the CPU usage is real, and that it consumes resources that the OpenBLAS version wants to use. The Accelerate version also gets a little slower, but not by much.
In conclusion, the CPU usage numbers for an Accelerate-heavy process are misleading, but not totally so. There do appear to be CPU resources that Accelerate is using, but there is a hidden shared resource that multiple Accelerate processes must fight over. Using a non-AMX library such as OpenBLAS results in more familiar performance (and better runtime, in this case, although that is not always the case). The truly "optimal" usage of the processor would likely be to have something like OpenBLAS running on 3 Firestorm cores, and one Accelerate process:
$ OPENBLAS_NUM_THREADS=3 time ./dgesv_openblas
23.77 real 68.25 user 0.32 sys
$ ./dgesv_accelerate & OPENBLAS_NUM_THREADS=3 time ./dgesv_openblas
28.53 real 81.63 user 0.40 sys
This solves two problems at once, one taking 28.5s and one taking 42.5s (I simply moved the time to measure the dgesv_accelerate). This slowed the 3-core OpenBLAS down by ~20% and the Accelerate by ~13%, so assuming that you have an application with a very long queue of these problems to solve, you could feed them to these two engines and solve them in parallel with a modest amount of overhead.
I am not claiming that these configurations are actually optimal, just exploring what the relative overheads are for this particular workload because I am curious. :) There may be ways to improve this, and this all could change dramatically with a new Apple Silicon processor.
The original poster and the commenter are both somewhat unclear on exactly how AMX operates. That's OK, it's not obvious!
For pre-A15 designs the setup is:
(a) Each cluster (P or E) has ONE AMX unit. You can think of it as being more an attachment of the L2 than of a particular core.
(b) This unit has four sets of registers, one for each core.
(c) An AMX unit gets its instructions from the CPU (sent down the Load/Store pipeline, but converted at some point to a transaction that is sent to the L2 and so the AMX unit).
Consequences of this include that
AMX instructions execute out of order on the core just like other instructions, interleaved with other instructions, and the CPU will do all the other sort of overhead you might expect (counter increments, maybe walking and derefencing sparse vectors/matrices) in parallel with AMX.
A core that is running a stream of AMX instructions will look like a 100% utilized core. Because it is! (100% doesn't mean every cycle the CPU is executing at full width; it means the CPU never gives up any time to the OS for whatever reason).
ideally data for AMX is present in L2. If present in L1, you lose a cycle or three in the transfer to L2 before AMX can access it.
(most important for this question) there is no value in having multiple cores running AMX code to solve a single problem. They will all land up fighting over the same single AMX unit anyway! So why complicate the code with Altivec by trying to achieve that. It will work (because of the abstraction of 4 sets of registers) but that's there to help "un-co-ordinated" code from different apps to work without forcing some sort of synchronization/allocation of the resource.
the AMX unit on the E-cluster does work, so why not use it? Well, it runs at a lower frequency and a different design with much less parallelization. So it can be used by code that, for whatever reason, both runs on the E-core and wants AMX. But trying to use that AMX unit along with the P AMX-unit is probably more trouble than it's worth. The speed differences are large enough to make it very difficult to ensure synchronization and appropriate balancing between the much faster P and the much slower E. I can't blame Apple for considering pursuing this a waste of time.
More details can be found here:
https://gist.github.com/dougallj/7a75a3be1ec69ca550e7c36dc75e0d6f
It is certainly possible that Apple could change various aspects of this at any time, for example adding two AMX units to the P-cluster. Presumably when this happens, Accelerate will be updated appropriately.

Looking for cause of unexpected preemption in linux kernel module

I have a small linux kernel module that is a prototype for a device driver for hardware that doesn't exist yet. The code needs to do a short bit of computation as fast as possible from beginning to end with a duration that is a few microseconds. I am trying to measure whether this is possible with the intel rdtscp instruction using an ndelay() call to simulate the computation. I find that 99.9% of the time it runs as expected, but 0.1% of the time it has a very large delay that appears as if something else is preempting the code despite running inside a spinlock which should be disabling interrupts. This is run using a stock Ubuntu 64 bit kernel (4.4.0-112) with no extra realtime or low latency patches.
Here is some example code that replicates this behavior. This is written as a handler for a /proc filesystem entry for easy testing, but I have only shown the function that actually computes the delays:
#define ITERATIONS 50000
#define SKIPITER 10
DEFINE_SPINLOCK(timer_lock);
static int timing_test_show(struct seq_file *m, void *v)
{
uint64_t i;
uint64_t first, start, stop, delta, max=0, min=1000000;
uint64_t avg_ticks;
uint32_t a, d, c;
unsigned long flags;
int above30k=0;
__asm__ volatile ("rdtscp" : "=a" (a), "=d" (d) : : "rcx");
first = a | (((uint64_t)d)<<32);
for (i=0; i<ITERATIONS; i++) {
spin_lock_irqsave(&timer_lock, flags);
__asm__ volatile ("rdtscp" : "=a" (a), "=d" (d) : : "rcx");
start = a | (((uint64_t)d)<<32);
ndelay(1000);
__asm__ volatile ("rdtscp" : "=a" (a), "=d" (d) : : "rcx");
stop = a | (((uint64_t)d)<<32);
spin_unlock_irqrestore(&timer_lock, flags);
if (i < SKIPITER) continue;
delta = stop-start;
if (delta < min) min = delta;
if (delta > max) max = delta;
if (delta > 30000) above30k++;
}
seq_printf(m, "min: %llu max: %llu above30k: %d\n", min, max, above30k);
avg_ticks = (stop - first) / ITERATIONS;
seq_printf(m, "Average total ticks/iteration: %llu\n", avg_ticks);
return 0;
}
Then if I run:
# cat /proc/timing_test
min: 4176 max: 58248 above30k: 56
Average total ticks/iteration: 4365
This is on a 3.4 GHz sandy bridge generation Core i7. The ~4200 ticks of the TSC is about right for a little over 1 microsecond delay. About 0.1% of the time I see delays about 10x longer than expected, and in some cases I have seen times as long as 120,000 ticks.
These delays appear too long to be a single cache miss, even to DRAM. So I think it either has to be several cache misses, or another task preempting the CPU in the middle of my critical section. I would like to understand the possible causes of this to see if they are something we can eliminate or if we have to move to a custom processor/FPGA solution.
Things I have tried:
I considered if this could be caused by cache misses. I don't think that could be the case since I ignore the first few iterations which should load the cache. I have verified by examining disassembly that there are no memory operations between the two calls to rdtscp, so I think the only possible cache misses are for the instruction cache.
Just in case, I moved the spin_lock calls around the outer loop. Then it shouldn't be possible to have any cache misses after the first iteration. However, this made the problem worse.
I had heard that the SMM interrupt is unmaskable and mostly transparent and could cause unwanted preemption. However, you can read the SMI interrupt count with rdmsr on MSR_SMI_COUNT. I tried adding that before and after and there are no SMM interrupts happening while my code is executing.
I understand there are also inter-processor interrupts in SMP systems that may interrupt, but I looked at /proc/interrupts before and after and don't see enough of them to explain this behavior.
I don't know if ndelay() takes into account variable clock speed, but I think the CPU clock only varies by a factor of 2, so this should not cause a >10x change.
I booted with nopti to disable page table isolation in case that is causing problems.
Another thing that I have just noticed is that it is unclear what ndelay() does. Maybe you should show it so as non-trivial problems may be lurking inside it.
For example, I've observed once that my piece of a kernel driver code was still preempted when it had a memory leak inside it, so as soon as it hit some watermark limit, it was put aside even if it disabled interrupts.
120,000 ticks that you observed in extreme cases sounds a lot like an SMM handler. Lesser values might have been caused by an assortment of microarchitectural events (by the way, have you checked all the performance counters available to you?), but this is something that must be caused by a subroutine written by someone who was not writing his/her code to achieve minimal latency.
However you stated that you've checked that no SMIs are observed. This leads me to think that either something is wrong with kernel facilities to count/report them, or with your method to look after them. Hunting after SMI without a hardware debugger may be a frustrating endeavor.
Was SMI_COUNT not changing during your experiment course, or was it exactly zero all the time? The latter might indicate that it does not count anything, unless you system is completely free from SMI, which I doubt of in case of regular Sandy Bridge.
It may be that SMIs are delivered to another core in your system, and an SMM handler is synchronizing other cores through some sort of mechanism that does not show up on SMI_COUNT. Have you checked other cores?
In general I would recommend starting downsizing your system under test to exclude as much of stuff as possible. Have you tried booting it with a single core and no hyperthreading enabled in BIOS? Have you tried to run the same code on a system that is known to not have SMIs? The same goes with disabling Turbo Boost and Frequency scaling in BIOS. As much as possible of timing-related must go.
FYI, in my system:
timingtest % uname -a
Linux xxxxxx 4.15.0-42-generic #45-Ubuntu SMP Thu Nov 15 19:32:57 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Replicating your example (with ndelay(1000);) I get:
timingtest % sudo cat /proc/timing_test
min: 3783 max: 66883 above30k: 20
Average total ticks/iteration: 4005
timingtest % sudo cat /proc/timing_test
min: 3783 max: 64282 above30k: 19
Average total ticks/iteration: 4010
Replicating your example (with udelay(1);) I get:
timingtest % sudo cat /proc/timing_test
min: 3308 max: 43301 above30k: 2
Average total ticks/iteration: 3611
timingtest % sudo cat /proc/timing_test
min: 3303 max: 44244 above30k: 2
Average total ticks/iteration: 3600
ndelay(),udelay(),mdelay() are for use in atomic context as stated here:
https://www.kernel.org/doc/Documentation/timers/timers-howto.txt
They all rely on __const_udelay() funtion that is a vmlinux exported symbol (using: LFENCE/RDTSC instructions).
Anyway, I replaced the delay with:
for (delta=0,c=0; delta<500; delta++) {c++; c|=(c<<24); c&=~(c<<16);}
for a trivial busy loop, with the same results.
I also tryed with _cli()/_sti(), local_bh_disable()/local_bh_enable() and preempt_disable()/preempt_enable() without success.
Examinig SMM interrupts (before and after delay) with:
__asm__ volatile ("rdmsr" : "=a" (a), "=d" (d) : "c"(0x34) : );
smi_after = (a | (((uint64_t)d)<<32));
I always obtain the same number (no SMI or register not updated).
Executing the cat command with trace-cmd to explore what's happening, I get results suprissingly not so scattered in time. (!?)
timingtest % sudo trace-cmd record -o trace.dat -p function_graph cat /proc/timing_test
plugin 'function_graph'
min: 3559 max: 4161 above30k: 0
Average total ticks/iteration: 5863
...
In my system, the problem can be solved making use of Power management Quality of Service, see (https://access.redhat.com/articles/65410). Hope this helps

Sequential part of the program takes more time as process count increases

I'm writing a parallel C program using MPICH. The program naturally has a sequential part and a parallel part. The parallel part seems to be working fine, however I'm having trouble with the sequential part.
In the sequential part, the program reads some values from a file and then proceeds to distribute them among the other processes. It is written as follows:
if (rank == 0)
{
gettimeofday(&sequentialStartTime, NULL);
// Read document id's and their weights into the corresponding variables
documentCount = readDocuments(&weights, &documentIds, documentsFileName, dictionarySize);
readQuery(&query, queryFileName, dictionarySize);
gettimeofday(&endTime, NULL);
timersub(&endTime, &sequentialStartTime, &resultTime);
printf("Sequential part: %.2f ms\n", (double) resultTime.tv_sec + resultTime.tv_usec / 1000.0);
//distribute the data to other processes
} else {
//wait for the data and the start working
}
Here, readQuery and readDocuments are reading values from files, and the elapsed time is printed after they are complete. This piece of code actually works just fine. The problem arises when I try to run this with different number of processors.
I run the program with the following command
mpirun -np p ./main
where p is the processor count. I expect the sequential part to run in a certain amount of time no matter how many processors I use. For p values 1 to 4, this holds, however when I use 5 to 8 as p values, then the sequential part takes more time.
The processor I'm using is Intel® Core™ i7-4790 CPU # 3.60GHz × 8 and the operating system I have is Windows 8.1 64-bit. I'm running this program on Ubuntu 14.04 which runs on a virtual machine which has full access to my processor and 8 GB's of RAM.
The only reason that came to my mind is that maybe when the process count is higher than 4, the main process may share the physical core it's running on with another process, since I know that this CPU has 4 physical cores but functions as if it had 8, using the hyperthreading technology. However, when I increase p from 5 to 6 to 7 and so on, the execution time increases linearly, so that cannot be the case.
Any help or idea on this would be highly appreciated. Thanks in advance.
Edit: I realized that increasing p increases the run time no matter the value. I'm getting a linear increase in time as p increases.

Precise Linux Timing - What Determines the Resolution of clock_gettime()?

I need to do precision timing to the 1 us level to time a change in duty cycle of a pwm wave.
Background
I am using a Gumstix Over Water COM (https://www.gumstix.com/store/app.php/products/265/) that has a single core ARM Cortex-A8 processor running at 499.92 BogoMIPS (the Gumstix page claims up to 1Ghz with 800Mhz recommended) according to /proc/cpuinfo. The OS is an Angstrom Image version of Linux based of kernel version 2.6.34 and it is stock on the Gumstix Water COM.
The Problem
I have done a fair amount of reading about precise timing in Linux (and have tried most of it) and the consensus seems to be that using clock_gettime() and referencing CLOCK_MONOTONIC is the best way to do it. (I would have liked to use the RDTSC register for timing since I have one core with minimal power saving abilities but this is not an Intel processor.) So here is the odd part, while clock_getres() returns 1, suggesting resolution at 1 ns, actual timing tests suggest a minimum resolution of 30517ns or (it can't be coincidence) exactly the time between a 32.768KHz clock ticks. Here's what I mean:
// Stackoverflow example
#include <stdio.h>
#include <time.h>
#define SEC2NANOSEC 1000000000
int main( int argc, const char* argv[] )
{
// //////////////// Min resolution test //////////////////////
struct timespec resStart, resEnd, ts;
ts.tv_sec = 0; // s
ts.tv_nsec = 1; // ns
int iters = 100;
double resTime,sum = 0;
int i;
for (i = 0; i<iters; i++)
{
clock_gettime(CLOCK_MONOTONIC, &resStart); // start timer
// clock_nanosleep(CLOCK_MONOTONIC, 0, &ts, &ts);
clock_gettime(CLOCK_MONOTONIC, &resEnd); // end timer
resTime = ((double)resEnd.tv_sec*SEC2NANOSEC + (double)resEnd.tv_nsec
- ((double)resStart.tv_sec*SEC2NANOSEC + (double)resStart.tv_nsec);
sum = sum + resTime;
printf("resTime = %f\n",resTime);
}
printf("Average = %f\n",sum/(double)iters);
}
(Don't fret over the double casting, tv_sec in a time_t and tv_nsec is a long.)
Compile with:
gcc soExample.c -o runSOExample -lrt
Run with:
./runSOExample
With the nanosleep commented out as shown, the result is either 0ns or 30517ns with the majority being 0ns. This leads me to believe that CLOCK_MONOTONIC is updated at 32.768kHz and most of the time the clock has not been updated before the second clock_gettime() call is made and in cases where the result is 30517ns the clock has been updated between calls.
When I do the same thing on my development computer (AMD FX(tm)-6100 Six-Core Processor running at 1.4 GHz) the minimum delay is a more constant 149-151ns with no zeros.
So, let's compare those results to the CPU speeds. For the Gumstix, that 30517ns (32.768kHz) equates to 15298 cycles of the 499.93MHz cpu. For my dev computer that 150ns equates to 210 cycles of the 1.4Ghz CPU.
With the clock_nanosleep() call uncommented the average results are these:
Gumstix: Avg value = 213623 and the result varies, up and down, by multiples of that min resolution of 30517ns
Dev computer: 57710-68065 ns with no clear trend. In the case of the dev computer I expect the resolution to actually be at the 1 ns level and the measured ~150ns truly is the time elapsed between the two clock_gettime() calls.
So, my question's are these:
What determines that minimum resolution?
Why is the resolution of the dev computer 30000X better than the Gumstix when the processor is only running ~2.6X faster?
Is there a way to change how often CLOCK_MONOTONIC is updated and where? In the kernel?
Thanks! If you need more info or clarification just ask.
As I understand, the difference between two environments(Gumstix and your Dev-computer) might be the underlying timer h/w they are using.
Commented nanosleep() case:
You are using clock_gettime() twice. To give you a rough idea of what this clock_gettime() will ultimately get mapped to(in kernel):
clock_gettime -->clock_get() -->posix_ktime_get_ts -->ktime_get_ts() -->timekeeping_get_ns()
-->clock->read()
clock->read() basically reads the value of the counter provided by underlying timer driver and corresponding h/w. A simple difference with stored value of the counter in the past and current counter value and then nanoseconds conversion mathematics will yield you the nanoseconds elapsed and will update the time-keeping data structures in kernel.
For example , if you have a HPET timer which gives you a 10 MHz clock, the h/w counter will get updated at 100 ns time interval.
Lets say, on first clock->read(), you get a counter value of X.
Linux Time-keeping data structures will read this value of X, get the difference 'D'compared to some old stored counter value.Do some counter-difference 'D' to nanoseconds 'n' conversion mathematics, update the data-structure by 'n'
Yield this new time value to the user space.
When second clock->read() is issued, it will again read the counter and update the time.
Now, for a HPET timer, this counter is getting updated every 100ns and hence , you will see this difference being reported to the user-space.
Now, Let's replace this HPET timer with a slow 32.768 KHz clock. Now , clock->read()'s counter will updated only after 30517 ns seconds, so, if you second call to clock_gettime() is before this period, you will get 0(which is majority of the cases) and in some cases, your second function call will be placed after counter has incremented by 1, i.e 30517 ns has elapsed. Hence , the value of 30517 ns sometimes.
Uncommented Nanosleep() case:
Let's trace the clock_nanosleep() for monotonic clocks:
clock_nanosleep() -->nsleep --> common_nsleep() -->hrtimer_nanosleep() -->do_nanosleep()
do_nanosleep() will simply put the current task in INTERRUPTIBLE state, will wait for the timer to expire(which is 1 ns) and then set the current task in RUNNING state again. You see, there are lot of factors involved now, mainly when your kernel thread (and hence the user space process) will be scheduled again. Depending on your OS, you will always face some latency when your doing a context-switch and this is what we observe with the average values.
Now Your questions:
What determines that minimum resolution?
I think the resolution/precision of your system will depend on the underlying timer hardware being used(assuming your OS is able to provide that precision to the user space process).
*Why is the resolution of the dev computer 30000X better than the Gumstix when the processor is only running ~2.6X faster?*
Sorry, I missed you here. How it is 30000x faster? To me , it looks like something 200x faster(30714 ns/ 150 ns ~ 200X ? ) .But anyway, as I understand, CPU speed may or may not have to do with the timer resolution/precision. So, this assumption may be right in some architectures(when you are using TSC H/W), though, might fail in others(using HPET, PIT etc).
Is there a way to change how often CLOCK_MONOTONIC is updated and where? In the kernel?
you can always look into the kernel code for details(that's how i looked into it).
In linux kernel code , look for these source files and Documentation:
kernel/posix-timers.c
kernel/hrtimer.c
Documentation/timers/hrtimers.txt
I do not have gumstix on hand, but it looks like your clocksource is slow.
run:
$ dmesg | grep clocksource
If you get back
[ 0.560455] Switching to clocksource 32k_counter
This might explain why your clock is so slow.
In the recent kernels there is a directory /sys/devices/system/clocksource/clocksource0 with two files: available_clocksource and current_clocksource. If you have this directory, try switching to a different source by echo'ing its name into second file.

Increasing CPU Utilization and keep it at a certain level using C code

I am writing a C code (on Linux) that needs to consume a certain amount of CPU when it's running. I am carrying out an experiment in which I trigger certain actions upon reaching a certain CPU threshold. So, once the Utilization reaches a certain threshold, I need to keep it at that state for say 30 seconds till I complete my experiments. I am monitoring the CPU Utilization using the top command.
So my questions are -
1. How do I increase the CPU Utilization to a given value (in a deterministic way if possible)?
2. Once I get to the threshold, is there a way to keep it at that level for a pre-defined time?
Sample output of top command (the 9th column is CPU used by the 'top' process) -
19304 abcde 16 0 5448 1212 808 R 0.2 0.0 0:00.06 top
Similar to above, I will look at the line in top which shows the utilization of my binary.
Any help would be appreciated. Also, let me know if you need more details.
Thanks!
Edit:
The following lines of code allowed me to control CPU Utilization quite well - In the following case, I have 2 options - keep it above 50% and keep it below 50% - After some trial and error, I settled down at the given usleep values.
endwait = clock() + ( seconds * CLOCKS_PER_SEC );
while( clock() < endwait ) {}
if (cpu_utilization > 50)
usleep(250000);
else
usleep(700000);
Hope this helps!
cpuburn is known to make CPU utilization so high that it raise its temperature to its max level.
It seems there is no more official website about it, but you can still access to source code with Debian package or googlecode.
It's implemented in asm, so you'll have to make some glue in order to interact with it in C.
Something of this sort should have a constant CPU utilization, in my opinion:
md5sum < /dev/urandom

Resources