Perf fails at showing LLC-loads

Perf fails at showing LLC-loads - c

Although I have LLC-loadsand LLC-load-misses listed in perf list, if I run:
perf stat --repeat 2 -e cycles:u -e instructions:u -e LLC-loads -e LLC-load-misses binary arguments
on my Linux machine, I get 0 for both LLC-loadsand LLC-load-misses. I also tried perf record + perf report. Does anybody know why? The CPU is E5-2670 Sandy Bridge.

Related

What kind of program can benefit much from LTO?

When using dhrystone to get DMIPS, I found that LTO greatly impacted the results. LTO-dhrystone is nearly 4x LTO-less-dhrystone:
$ wget http://www.xanthos.se/~joachim/dhrystone-src.tar.gz
$ cd dhrystone-src
without LTO
$ aarch64-linux-gnu-gcc -O3 -funroll-all-loops --param max-inline-insns-auto=550 -static dhry21a.c dhry21b.c timers.c -o dhrystone # use qemu-user to execute
$ perf stat ./dhrystone # input 100000000
...
Register option selected? YES
Microseconds for one run through Dhrystone: 0.2
Dhrystones per Second: 5234421.7
VAX MIPS rating = 2979.181
Performance counter stats for './dhrystone':
19,158.53 msec task-clock:u # 0.969 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
547 page-faults:u # 28.551 /sec
81,470,643,102 cycles:u # 4.252 GHz (50.01%)
3,046,747 stalled-cycles-frontend:u # 0.00% frontend cycles idle (50.02%)
37,208,106,969 stalled-cycles-backend:u # 45.67% backend cycles idle (50.00%)
319,848,969,156 instructions:u # 3.93 insn per cycle
# 0.12 stalled cycles per insn (49.99%)
49,311,879,609 branches:u # 2.574 G/sec (49.98%)
317,518 branch-misses:u # 0.00% of all branches (50.00%)
19.762244278 seconds time elapsed
19.118127000 seconds user
0.004017000 seconds sys
With LTO
$ aarch64-linux-gnu-gcc -O3 -funroll-all-loops --param max-inline-insns-auto=550 -static dhry21a.c dhry21b.c timers.c -o dhrystone -flto
$ perf stat ./dhrystone # input 100000000
...
Register option selected? YES
Microseconds for one run through Dhrystone: 0.1
Dhrystones per Second: 19539623.0
VAX MIPS rating = 11121.015
Performance counter stats for './dhrystone':
5,146.69 msec task-clock:u # 0.908 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
553 page-faults:u # 107.448 /sec
21,453,263,692 cycles:u # 4.168 GHz (50.00%)
1,574,543 stalled-cycles-frontend:u # 0.01% frontend cycles idle (50.03%)
12,575,396,819 stalled-cycles-backend:u # 58.62% backend cycles idle (50.04%)
89,186,371,586 instructions:u # 4.16 insn per cycle
# 0.14 stalled cycles per insn (50.00%)
7,717,732,872 branches:u # 1.500 G/sec (49.97%)
353,303 branch-misses:u # 0.00% of all branches (49.96%)
5.666446006 seconds time elapsed
5.133037000 seconds user
0.003322000 seconds sys
As you can see
LTO dhrystone DMIPS is 1953,9623.0 and LTO-less dhrystone is 523,4421.7
LTO dhrystone executes 89,186,371,586 instructions and LTO-less dhrystone executes 319,848,969,156
I think the root cause is that LTO reduces many instructions, so it can run much faster.
But When I run benchmarks like coremark/coremark-pro, LTO doesn't have notable improvement compared with non-LTO.
Qeustion
What kind of programs are more easily affected by LTO optimization? Why LTO has a big impact on dhrystone, but not on coremark/coremark-pro.
How does LTO reduce runtime instructions?

LTO allows cross-file inlining, so if you have tiny helper functions (like C++ get/set functions in classes) that aren't visible in a .h for inlining normally, LTO can greatly simplify code that does a lot of calling such functions.
A simple get or set wrapper can inline to zero instructions (with the object data just living in registers), but a call/ret would need to pass an arg in a register, not to mention executing the actual bl and ret instructions. And would have to respect the calling convention, so the call-site might need to mov some values to call-preserved registers. But when inlining, the compiler has full control over all the registers.
For benchmarks, putting the work in a separate file from a repeat loop is a good way of stopping compilers from defeating the benchmark by optimizing across repeat-loop iterations. (e.g. hoisting work out of loops instead of re-computing something every time.)
Unless you use LTO so it can break your benchmarks. (Or maybe there's another reason with dhrystone, IDK.)

How can I check integrity of a extracted zImage?

$ binwalk -e linux_image.img
DECIMAL HEXADECIMAL DESCRIPTION
--------------------------------------------------------------------------------
0 0x0 Android bootimg, kernel size: 6897653 bytes, kernel addr: 0x81C08000, ramdisk size: 5959520 bytes, ramdisk addr: 0x81C08000, product name: ""
2048 0x800 Linux kernel ARM boot executable zImage (little-endian)
18479 0x482F gzip compressed data, maximum compression, from Unix, last modified: 1970-01-01 00:00:00 (null date)
6761720 0x672CF8 device tree image (dtb)
6883304 0x6907E8 Unix path: /dev/block/platform/soc/7824900.sdhci/by-name/vendor
6899712 0x694800 gzip compressed data, maximum compression, has original file name: "rootfs.cpio", from Unix, last modified: 2019-04-06 00:42:26
9706949 0x941DC5 MySQL ISAM compressed data file Version 11
$ dd if=linux_image.img of=vmlinuz bs=1 skip=2048 count=6897653
$ file vmlinuz
vmlinuz: Linux kernel ARM boot executable zImage (little-endian)
$ dd if=vmlinuz bs=1 skip=$(LC_ALL=C grep -a -b -o $'\x1f\x8b\x08\x00\x00\x00\x00\x00' vmlinuz-3.18.66-perf | head -n 1 | cut -d ':' -f 1) | zcat | grep -a 'Linux version'
Linux version 3.18.66 (build#test) (gcc version 4.9.3 (GCC) ) #1 SMP PREEMPT Fri Apr 1 13:16:33 PDT 2018
Running 'qemu-system-arm.exe -machine vexpress-a9 -cpu cortex-a7 -smp 4 -kernel vmlinuz' blank screen

If you pull a random Arm Linux kernel (including Android) from somewhere and try to run it on anything other than the hardware that it is intended to boot on, the expected result is that it crashes very early in bootup without being able to output anything to screen or serial port, ie you get a black screen and nothing happens. The most likely situation here is that your image is fine and not corrupt, it's just not built to run on the vexpress-a9 board you're running it on.
In the unlikely event that this really is a kernel built for the vexpress-a9, the next problem you have is that you haven't passed QEMU a device tree blob via the -dtb option. Modern Linux kernels don't hardcode all the information about the boards they can run on, but instead expect the bootloader (which is QEMU in this case) to pass them a data file which provides information about where all the devices are for the board. If you don't do that, then the result is the same as above: kernel crashes very early in bootup without being able to output any information, so black screen.

Why is lua on host system slower than in the linux vm?

Comparing executing time of this Lua Script on a Macbook Air (Mac OS 10.9.4, i5-4250U (1.3GHz), 8GB RAM) to a VM (virtualbox) running Arch Linux.
Compiling Lua 5.2.3 in a Arch Linux virtualbox
First I've compiled lua by myself using clang, to compare it with the Mac OS X clang binary.
using tcc, gcc and clang
$ tcc *[^ca].c lgc.c lfunc.c lua.c -lm -o luatcc
$ gcc -O3 *[^ca].c lgc.c lfunc.c lua.c -lm -o luagcc
/tmp/ccxAEYH8.o: In function `os_tmpname':
loslib.c:(.text+0x29c): warning: the use of `tmpnam' is dangerous, better use `mkstemp'
$ clang -O3 *[^ca].c lgc.c lfunc.c lua.c -lm -o luaclang
/tmp/loslib-bd4ef4.o:loslib.c:function os_tmpname: warning: the use of `tmpnam' is dangerous, better use `mkstemp'
clang version in VM
$ clang --version
clang version 3.4.2 (tags/RELEASE_34/dot2-final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
compare the file size
$ ls -lh |grep lua
-rwxr-xr-x 1 markus markus 210K 20. Aug 18:21 luaclang
-rwxr-xr-x 1 markus markus 251K 20. Aug 18:22 luagcc
-rwxr-xr-x 1 markus markus 287K 20. Aug 18:22 luatcc
VM benchmarking
clang binary ~3.1 sec
$ time ./luaclang sumdata.lua data.log
Original Size: 117261680 kb
Compressed Size: 96727557 kb
real 0m3.124s
user 0m3.100s
sys 0m0.020s
gcc binary ~3.09 sec
$ time ./luagcc sumdata.lua data.log
Original Size: 117261680 kb
Compressed Size: 96727557 kb
real 0m3.090s
user 0m3.080s
sys 0m0.007s
tcc binary ~7.0 sec - no surprise here :)
$ time ./luatcc sumdata.lua data.log
Original Size: 117261680 kb
Compressed Size: 96727557 kb
real 0m7.071s
user 0m7.053s
sys 0m0.010s
Compiling on Mac OS X
Now compiling lua with the same clang command/options like in the VM.
$ clang -O3 *[^ca].c lgc.c lfunc.c lua.c -lm -o luaclangmac
loslib.c:108:3: warning: 'tmpnam' is deprecated: This function is provided for
compatibility reasons only. Due to security concerns inherent in the design of tmpnam(3),
it is highly recommended that you use mkstemp(3)
instead. [-Wdeprecated-declarations]
lua_tmpnam(buff, err);
^
loslib.c:57:33: note: expanded from macro 'lua_tmpnam'
#define lua_tmpnam(b,e) { e = (tmpnam(b) == NULL); }
^
/usr/include/stdio.h:274:7: note: 'tmpnam' declared here
char *tmpnam(char *);
^
1 warning generated.
clang version Mac OS X
I've tried two version. 3.4.2 and the one which is provided by xcode. The version 3.4.2 is a bit slower.
Markuss-MacBook-Air:bin markus$ ./clang --version
clang version 3.4.2 (tags/RELEASE_34/dot2-rc1)
Target: x86_64-apple-darwin13.3.0
Thread model: posix
Markuss-MacBook-Air:bin markus$ clang --version
Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)
Target: x86_64-apple-darwin13.3.0
Thread model: posix
file size
$ ls -lh|grep lua
-rwxr-xr-x 1 markus staff 194K 20 Aug 18:26 luaclangmac
HOST benchmarking
clang binary ~4.3 sec
$ time ./luaclangmac sumdata.lua data.log
Original Size: 117261680 kb
Compressed Size: 96727557 kb
real 0m4.338s
user 0m4.264s
sys 0m0.062s
Why?
I would have expected that the host system is a little faster than the virtualization (or roughly the same speed). But not that the host system is reproducible slower.
So, any ideas or explanations?
Update 2014.10.30
Meanwhile I've installed Arch Linux nativly on my MBA. The benchmarks are as fast as in the Arch Linux VM.

Can you try to run 'perf stat' instead of 'time'. It provides you much more details and the time measurement is more correct, avoiding timing differences inside the VM.
Here is an example:
$ perf stat ls > /dev/null
Performance counter stats for 'ls':
23.348076 task-clock (msec) # 0.989 CPUs utilized
2 context-switches # 0.086 K/sec
0 cpu-migrations # 0.000 K/sec
93 page-faults # 0.004 M/sec
74,628,308 cycles # 3.196 GHz [65.75%]
740,755 stalled-cycles-frontend # 0.99% frontend cycles idle [48.66%]
29,200,738 stalled-cycles-backend # 39.13% backend cycles idle [60.02%]
80,592,001 instructions # 1.08 insns per cycle
# 0.36 stalled cycles per insn
17,746,633 branches # 760.090 M/sec [60.00%]
642,360 branch-misses # 3.62% of all branches [48.64%]
0.023609439 seconds time elapsed

My guess is that the HFS+ journaling feature is adding latency. This would be easy enough to test: If TimeMachine is running on the Macbook Air, you could try disabling it, and disable journaling on the filesystem (obviously you should back up first). As root:
diskutil disableJournal YourDiskVolume
I'd see if that's the cause of the problem. Then i would immediately re-enable journaling.
diskutil enableJournal YourDiskVolume
OS X 10.9.2 had a journaling-related bug that would hang the filesystem... this page explores this bug further, and even though the bug (#15821723) hasn't been reported as fixed, journaling reportedly no longer crashes the disk controller.

to test the speed of lua, instead of reading a file hard-code some sample data into the test script and loop over the lines over and over as necessary. Like others mentioned, the filesystem effects are going to outweigh any compiler differences.

Profiling sleep times with perf

I was looking for a way to find out where my program spends time. I read the perf tutorial and tried to profile sleep times as it is described there. I wrote the simplest possible program to profile:
#include <unistd.h>
int main() {
sleep(10);
return 0;
}
then I executed it with perf:
$ sudo perf record -e sched:sched_stat_sleep -e sched:sched_switch -e sched:sched_process_exit -g -o ~/perf.data.raw ./a.out
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.013 MB /home/pablo/perf.data.raw (~578 samples) ]
$ sudo perf inject -v -s -i ~/perf.data.raw -o ~/perf.data
build id event received for [kernel.kallsyms]: d62870685909222126e7070d2bafdf029f7ed3b6
failed to write feature 2
$ sudo perf report --stdio --show-total-period -i ~/perf.data
Error:
The /home/pablo/perf.data file has no samples!
Does anybody know how to avoid these errors? What do they mean? failed to write feature 2 doesn't look too user-friendly...
Update:
$ uname -a
Linux debian 3.12-1-amd64 #1 SMP Debian 3.12.9-1 (2014-02-01) x86_64 GNU/Linux

There is a error message from your second perf command from https://perf.wiki.kernel.org/index.php/Tutorial#Profiling_sleep_times - perf inject -s
$ sudo perf inject -v -s -i ~/perf.data.raw -o ~/perf.data
build id event received for [kernel.kallsyms]: d62870685909222126e7070d2bafdf029f7ed3b6
failed to write feature 2
failed to write feature 2 doesn't look too user-friendly...
... but it was added to perf to made errors more user-friendly: http://lwn.net/Articles/460520/ "perf: make perf.data more self-descriptive (v5)" by Stephane Eranian , 22 Sep 2011:
+static int do_write_feat(int fd, struct perf_header *h, int type, ....
+ pr_debug("failed to write feature %d\n", type);
All features are listed here http://lxr.free-electrons.com/source/tools/perf/util/header.h#L13
15 HEADER_TRACING_DATA = 1,
16 HEADER_BUILD_ID,
So, it sounds like perf inject was not able to write information about build ids (error from function write_build_id() from util/header.c) if I'm not wrong. There are two cases which can lead to error: unsuccessful call to perf_session__read_build_ids() or failing in writing buildid table dsos__write_buildid_table (this is not our case because there is no "failed to write buildid table" error message; check write_build_id)
You may check, do you have all buildids needed for the session. Also it may be useful to clear your buildid cache (rm -rf ~/.debug), and check that you have up-to-date vmlinux with debugging info or kallsyms enabled in your kernel.
UPDATE: in comments Pavel says that his pref record had no any sched:sched_stat_sleep events written to perf.data:
sudo perf record -e sched:sched_stat_sleep -e sched:sched_switch -e sched:sched_process_exit -g -o ~/perf.data.raw ./a.out
As he explains in his answer, his default debian kernel have CONFIG_SCHEDSTATS option disabled with vendor's patch. The redhat did the same thing with the option in release kernels since 3.11, and this is explained in Redhat Bug 1013225 (Josh Boyer 2013-10-28, comment 4):
We switched to enabling that only on debug builds a while ago. It seems that was turned off entirely with the final 3.11.0 build and has remained off since. Internal testing shows the option has a non-trivial performance impact for context switches.
We can turn this on in debug kernels again, but I'm not sure it's worthwhile.
Josh Poimboeuf 2013-11-04 in comment 8 says that performance impact is detectable:
In my tests I did a lot of context switches under various CPU loads. I saw a ~5-10% drop in average context switch speed when CONFIG_SCHEDSTATS was enabled. ...The performance hit only seemed to happen on post-CFS kernels (>= 2.6.23). The previous O(1) scheduler didn't seem to have this issue.
Fedora disabled CONFIG_SCHEDSTAT in non-debug kernels at 12 July 2013 "[kernel] Disable LATENCYTOP/SCHEDSTATS in non-debug builds." by Dave Jones. First kernel with disabled option: 3.11.0-0.rc0.git6.4.
In order to use any perf software tracepoint event with name like sched:sched_stat_* (sched:sched_stat_wait, sched:sched_stat_sleep, sched:sched_stat_iowait) we must recompile kernel with CONFIG_SCHEDSTATS option enabled and replace default Debian, RedHat or Fedora kernels which have no this option.
Thank you, Pavel Davydov.

I finally found out how to make it work. The problem was that the default debian kernel is built without some config options, that perf needs to be able to monitor sleep times. It looks like CONFIG_SCHEDSTATS should be enabled to make kernel collect scheduler statistics. This is told to have some runtime overhead. Also I enabled CONFIG_SCHED_TRACER and some lock tracing options, but I'm not sure if they matter in my case. Anyway, no statistic data is collected in scheduler without CONFIG_SCHEDSTATS (see kernel/sched/ directory of kernel source).
Also, there is a very good article about perf written by Brendan Gregg, with a lot of usefull examples and some kernel options that are needed to make perf work properly.
Update: I checked the history of CONFIG_SCHEDSTATS in debian. I've checked out debian kernel patches and build scripts repo:
svn checkout svn://svn.debian.org/svn/kernel/dists/trunk/linux/debian
And then found CONFIG_SCHEDSTATS option there
$ grep -R CONFIG_SCHEDSTAT config/
config/config:# CONFIG_SCHEDSTATS is not set
This string was added to the repo in commit 10837, on 2008-03-14, with comment "debian/config: Do complete reorganization". Also, in this and this (thanks to osgx) bug reports it is told that CONFIG_LATENCYTOP, CONFIG_SCHEDSTATS options are not enabled because they can affect kernel perfomance. So, I think it just was never switched on in default debian kernels. I haven't found the discussion about scheduler stats option, though. If I do, I will write back here.

This works for me for "perf version 3.11.1" on an "openSUSE 13.1 (x86_64)" box.
Here is the output if you care:
# ========
# captured on: Sun Feb 16 09:49:38 2014
# hostname : *****************
# os release : 3.11.10-7-desktop
# perf version : 3.11.1
# arch : x86_64
# nrcpus online : 8
# nrcpus avail : 8
# cpudesc : Intel(R) Core(TM) i7-3840QM CPU # 2.80GHz
# cpuid : GenuineIntel,6,58,9
# total memory : 32945368 kB
# cmdline : /usr/bin/perf inject -v -s -i perf.data.raw -o perf.data
# event : name = sched:sched_stat_sleep, type = 2, config = 0x48, config1 = 0x0, config2 = 0x
# event : name = sched:sched_switch, type = 2, config = 0x51, config1 = 0x0, config2 = 0x0, e
# event : name = sched:sched_process_exit, type = 2, config = 0x4e, config1 = 0x0, config2 =
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: cpu = 4, software = 1, tracepoint = 2, uncore_cbox_0 = 6, uncore_cbox_1 = 7,
# ========
#
# Samples: 0 of event 'sched:sched_stat_sleep'
# Event count (approx.): 0
#
# Overhead Period Command Shared Object Symbol
# ........ ............ ....... ............. ......
#
# Samples: 8 of event 'sched:sched_switch'
# Event count (approx.): 80099958776
#
# Overhead Period Command Shared Object Symbol
# ........ ............ ....... ................. .................
#
100.00% 80099958776 bla [kernel.kallsyms] [k] thread_return
|
--- thread_return
thread_return
do_nanosleep
hrtimer_nanosleep
SyS_nanosleep
system_call_fastpath
0x7fbc0dec6570
__GI___libc_nanosleep
(nil)
# Samples: 0 of event 'sched:sched_process_exit'
# Event count (approx.): 0
#
# Overhead Period Command Shared Object Symbol
# ........ ............ ....... ............. ......
#
#
# (For a higher level overview, try: perf report --sort comm,dso)
#
}

Getting user-space stack information from perf

I'm currently trying to track down some phantom I/O in a PostgreSQL build I'm testing. It's a multi-process server and it isn't simple to associate disk I/O back to a particular back-end and query.
I thought Linux's perf tool would be ideal for this, but I'm struggling to capture block I/O performance counter metrics and associate them with user-space activity.
It's easy to record block I/O requests and completions with, eg:
sudo perf record -g -T -u postgres -e 'block:block_rq_*'
and the user-space pid is recorded, but there's no kernel or user-space stack captured, or ability to snapshot bits of the user-space process's heap (say, query text) etc. So while you have the pid, you don't know what the process was doing at that point. Just perf script output like:
postgres 7462 [002] 301125.113632: block:block_rq_issue: 8,0 W 0 () 208078848 + 1024 [postgres]
If I add the -g flag to perf record it'll take snapshots of the kernel stack, but doesn't capture user-space state for perf events captured in the kernel. The user-space stack only goes up to the entry-point from userspace, like LWLockRelease, LWLockAcquire, memcpy (mmap'd IO), __GI___libc_write, etc.
So. Any tips? Being able to capture a snapshot of the user-space stack in response to kernel events would be ideal.
I'm on Fedora 19, 3.11.3-201.fc19.x86_64, Schrödinger’s Cat, with perf version 3.10.9-200.fc19.x86_64.

OK, looks like there are several parts to this:
I'm on x86_64, where most distros build with -fomit-frame-pointer by default, and perf can't follow the stack without frame pointers;
.... unless it's a newer version built with libunwind support, in which case it supports perf record -g dwarf.
See:
the patch adding libunwind support to Perf
Debian bug 725075.
linux perf: how to interpret and find hotspots
I'm on Fedora 18, but the same issue applies. So if you're profiling code you're working on (as is likely on Stack Overflow), rebuild with -fno-omit-frame-pointer and -ggdb.
I landed up rebuilding perf because I wanted to be able to compare to the stock RPMs:
sudo yum build-dep perf
sudo yum install yum-utils rpmdevtools libunwind-devel
yumdownloader --source perf or download the appropriate kernel-.....src.rpm srpm
rpmdev-setuptree
rpm -Uvh kernel-*.src.rpm
cd $HOME/rpmbuild/SPECS
rpmbuild -bp --target=$(uname -m) kernel.spec
At this point you can just build a new perf if you want:
cd $HOME/rpmbuild/BUILD/kernel-*/linux-*/tools/perf
make
... which I did and tested that the updated perf does in fact capture a useful stack if built with libunwind available.
You can also build a new rpm:
edit kernel.spec, uncomment the line %define buildid ..., change buildid to something like .perfunwind. Note it's %define not % define.
In the same spec file, find:
%global perf_make \
make %{?_smp_mflags} -C tools/perf -s V=1 WERROR=0 NO_LIBUNWIND=1 HAVE_CPLUS_DEMANGLE=1 NO_GTK2=1 NO_LIBNUMA=1 NO_STRLCPY=1 prefix=%{_prefix}
and delete NO_LIBUNWIND=1
rpmbuild -bb --without up --without mp --without pae --without debug --without doc --without headers --without debuginfo --without bootwrapper --without with_vdso_install --with perf kernel.spec to produce new perf RPMs without building the whole kernel. Or if you want, omit the --without for the kernel flavour you want, in which case you'll also want to build headers, debuginfo, etc.
sudo rpm -Uvh $HOME/rpmbuild/RPMS/x86_64/perf-*.fc19.x86_64.rpm
See the fedora project guide on building a custom kernel.
I've reported the issue to Fedora; they shouldn't be using NO_LIBUNWIND=1. See bug 1025603.
Once you have a rebuilt perf you can use perf record -g dwarf to get full stacks.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight