What kind of program can benefit much from LTO? - linker

When using dhrystone to get DMIPS, I found that LTO greatly impacted the results. LTO-dhrystone is nearly 4x LTO-less-dhrystone:
$ wget http://www.xanthos.se/~joachim/dhrystone-src.tar.gz
$ cd dhrystone-src
without LTO
$ aarch64-linux-gnu-gcc -O3 -funroll-all-loops --param max-inline-insns-auto=550 -static dhry21a.c dhry21b.c timers.c -o dhrystone # use qemu-user to execute
$ perf stat ./dhrystone # input 100000000
...
Register option selected? YES
Microseconds for one run through Dhrystone: 0.2
Dhrystones per Second: 5234421.7
VAX MIPS rating = 2979.181
Performance counter stats for './dhrystone':
19,158.53 msec task-clock:u # 0.969 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
547 page-faults:u # 28.551 /sec
81,470,643,102 cycles:u # 4.252 GHz (50.01%)
3,046,747 stalled-cycles-frontend:u # 0.00% frontend cycles idle (50.02%)
37,208,106,969 stalled-cycles-backend:u # 45.67% backend cycles idle (50.00%)
319,848,969,156 instructions:u # 3.93 insn per cycle
# 0.12 stalled cycles per insn (49.99%)
49,311,879,609 branches:u # 2.574 G/sec (49.98%)
317,518 branch-misses:u # 0.00% of all branches (50.00%)
19.762244278 seconds time elapsed
19.118127000 seconds user
0.004017000 seconds sys
With LTO
$ aarch64-linux-gnu-gcc -O3 -funroll-all-loops --param max-inline-insns-auto=550 -static dhry21a.c dhry21b.c timers.c -o dhrystone -flto
$ perf stat ./dhrystone # input 100000000
...
Register option selected? YES
Microseconds for one run through Dhrystone: 0.1
Dhrystones per Second: 19539623.0
VAX MIPS rating = 11121.015
Performance counter stats for './dhrystone':
5,146.69 msec task-clock:u # 0.908 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
553 page-faults:u # 107.448 /sec
21,453,263,692 cycles:u # 4.168 GHz (50.00%)
1,574,543 stalled-cycles-frontend:u # 0.01% frontend cycles idle (50.03%)
12,575,396,819 stalled-cycles-backend:u # 58.62% backend cycles idle (50.04%)
89,186,371,586 instructions:u # 4.16 insn per cycle
# 0.14 stalled cycles per insn (50.00%)
7,717,732,872 branches:u # 1.500 G/sec (49.97%)
353,303 branch-misses:u # 0.00% of all branches (49.96%)
5.666446006 seconds time elapsed
5.133037000 seconds user
0.003322000 seconds sys
As you can see
LTO dhrystone DMIPS is 1953,9623.0 and LTO-less dhrystone is 523,4421.7
LTO dhrystone executes 89,186,371,586 instructions and LTO-less dhrystone executes 319,848,969,156
I think the root cause is that LTO reduces many instructions, so it can run much faster.
But When I run benchmarks like coremark/coremark-pro, LTO doesn't have notable improvement compared with non-LTO.
Qeustion
What kind of programs are more easily affected by LTO optimization? Why LTO has a big impact on dhrystone, but not on coremark/coremark-pro.
How does LTO reduce runtime instructions?

LTO allows cross-file inlining, so if you have tiny helper functions (like C++ get/set functions in classes) that aren't visible in a .h for inlining normally, LTO can greatly simplify code that does a lot of calling such functions.
A simple get or set wrapper can inline to zero instructions (with the object data just living in registers), but a call/ret would need to pass an arg in a register, not to mention executing the actual bl and ret instructions. And would have to respect the calling convention, so the call-site might need to mov some values to call-preserved registers. But when inlining, the compiler has full control over all the registers.
For benchmarks, putting the work in a separate file from a repeat loop is a good way of stopping compilers from defeating the benchmark by optimizing across repeat-loop iterations. (e.g. hoisting work out of loops instead of re-computing something every time.)
Unless you use LTO so it can break your benchmarks. (Or maybe there's another reason with dhrystone, IDK.)

Related

Report shows "no time accumulated" for gprof using Eclipse CDT

After compiling with flags: -O0 -p -pg -Wall -c on GCC and -p -pg on the MinGW linker, the eclipse plugin gprof for shows no results. After that I did a cmd call using gprof my.exe gmon.out > prof.txt, which resulted in a report witth only the number of calls to functions.
Flat profile:
Each sample counts as 0.01 seconds.
no time accumulated
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
0.00 0.00 0.00 16000 0.00 0.00 vector_norm
0.00 0.00 0.00 16 0.00 0.00 rbf_kernel
0.00 0.00 0.00 8 0.00 0.00 lubksb
I've came across this topic: gprof reports no time accumulated. But my program is terminating in a clear maner. Also, gprof view show no data on MingW/Windows, but I am using 32 bits GCC. I have previously tried to use Cygwin, same result.
I am using eclipse Kepler with CDT version 8.3.0.201402142303 and MinGW with GCC 5.4.0.
Any help is appreciated, thank you in advance.
Sorry for the question, seems that the code is faster than gprof can measure.
As my application involves a neural network train with several iterations and further testing of kernels, I didn't suspected that a fast code could be causing the problem. I inserted a long loop in the main body and the gprof time was printed.

Error when running DPDK app on valgrind

When I run my DPDK based app on valgrind, it cannot execute it and throws error
ERROR: This system does not support "RDRAND". Please check that
RTE_MACHINE is set correctly.
My CPU supports RDRAND, still it is throwing the same error. For valgrind to support hugepages which are being used by my app, I'm using the following patched version of valgrind.
https://github.com/bisdn/valgrind-hugepages.git
I had this same problem on a Haswell architecture CPU, and was able to fix it by modifying one of the makefiles to remove a handful of options. Specifically, AVX/AVX2, RDRND, FSGSBASE, and F16C. You might need to remove other options that valgrind is balking at and recompile DPDK, it was an iterative process for me. There's probably a more elegant way to do this using the .config file but I didn't find it. See this patch:
diff -u dpdk-2.2.0-orig/mk/rte.cpuflags.mk dpdk-2.2.0/mk/rte.cpuflags.mk
--- dpdk-2.2.0-orig/mk/rte.cpuflags.mk^I2015-12-15 12:06:58.000000000 -0500
+++ dpdk-2.2.0/mk/rte.cpuflags.mk^I2016-08-24 08:53:22.911258783 -0400
## -69,26 +69,6 ##
CPUFLAGS += PCLMULQDQ
endif
-ifneq ($(filter $(AUTO_CPUFLAGS),__AVX__),)
-CPUFLAGS += AVX
-endif
-
-ifneq ($(filter $(AUTO_CPUFLAGS),__RDRND__),)
-CPUFLAGS += RDRAND
-endif
-
-ifneq ($(filter $(AUTO_CPUFLAGS),__FSGSBASE__),)
-CPUFLAGS += FSGSBASE
-endif
-
-ifneq ($(filter $(AUTO_CPUFLAGS),__F16C__),)
-CPUFLAGS += F16C
-endif
-
-ifneq ($(filter $(AUTO_CPUFLAGS),__AVX2__),)
-CPUFLAGS += AVX2
-endif
-
# IBM Power CPU flags
ifneq ($(filter $(AUTO_CPUFLAGS),__PPC64__),)
CPUFLAGS += PPC64
RDRAND was introduced on IvyBridge, you can build dpdk with a specific subset of instructions using "CONFIG_RTE_MACHINE". For this case you can use SandyBridge as the machine.
Modify $RTE_SDK/$RTE_TARGET/.config, set CONFIG_RTE_MACHINE="snb", and rebuild the DPDK library (make -C $RTE_SDK/$RTE_TARGET).
I found another solution to this problem. DPDK supports EXTRA_CFLAGS variable which you can use to specify your own flags for GCC. Initial makefile runs gcc with options -dN -E to check what is supported by the platform. If you want to disable some instruction sets, e.g. RDRAND, you can specify option
export EXTRA_CFLAGS=-mno-rdrnd
and this will disable RDRAND in built DPDK library binaries.

Perf fails at showing LLC-loads

Although I have LLC-loadsand LLC-load-misses listed in perf list, if I run:
perf stat --repeat 2 -e cycles:u -e instructions:u -e LLC-loads -e LLC-load-misses binary arguments
on my Linux machine, I get 0 for both LLC-loadsand LLC-load-misses. I also tried perf record + perf report. Does anybody know why? The CPU is E5-2670 Sandy Bridge.

Why is lua on host system slower than in the linux vm?

Comparing executing time of this Lua Script on a Macbook Air (Mac OS 10.9.4, i5-4250U (1.3GHz), 8GB RAM) to a VM (virtualbox) running Arch Linux.
Compiling Lua 5.2.3 in a Arch Linux virtualbox
First I've compiled lua by myself using clang, to compare it with the Mac OS X clang binary.
using tcc, gcc and clang
$ tcc *[^ca].c lgc.c lfunc.c lua.c -lm -o luatcc
$ gcc -O3 *[^ca].c lgc.c lfunc.c lua.c -lm -o luagcc
/tmp/ccxAEYH8.o: In function `os_tmpname':
loslib.c:(.text+0x29c): warning: the use of `tmpnam' is dangerous, better use `mkstemp'
$ clang -O3 *[^ca].c lgc.c lfunc.c lua.c -lm -o luaclang
/tmp/loslib-bd4ef4.o:loslib.c:function os_tmpname: warning: the use of `tmpnam' is dangerous, better use `mkstemp'
clang version in VM
$ clang --version
clang version 3.4.2 (tags/RELEASE_34/dot2-final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
compare the file size
$ ls -lh |grep lua
-rwxr-xr-x 1 markus markus 210K 20. Aug 18:21 luaclang
-rwxr-xr-x 1 markus markus 251K 20. Aug 18:22 luagcc
-rwxr-xr-x 1 markus markus 287K 20. Aug 18:22 luatcc
VM benchmarking
clang binary ~3.1 sec
$ time ./luaclang sumdata.lua data.log
Original Size: 117261680 kb
Compressed Size: 96727557 kb
real 0m3.124s
user 0m3.100s
sys 0m0.020s
gcc binary ~3.09 sec
$ time ./luagcc sumdata.lua data.log
Original Size: 117261680 kb
Compressed Size: 96727557 kb
real 0m3.090s
user 0m3.080s
sys 0m0.007s
tcc binary ~7.0 sec - no surprise here :)
$ time ./luatcc sumdata.lua data.log
Original Size: 117261680 kb
Compressed Size: 96727557 kb
real 0m7.071s
user 0m7.053s
sys 0m0.010s
Compiling on Mac OS X
Now compiling lua with the same clang command/options like in the VM.
$ clang -O3 *[^ca].c lgc.c lfunc.c lua.c -lm -o luaclangmac
loslib.c:108:3: warning: 'tmpnam' is deprecated: This function is provided for
compatibility reasons only. Due to security concerns inherent in the design of tmpnam(3),
it is highly recommended that you use mkstemp(3)
instead. [-Wdeprecated-declarations]
lua_tmpnam(buff, err);
^
loslib.c:57:33: note: expanded from macro 'lua_tmpnam'
#define lua_tmpnam(b,e) { e = (tmpnam(b) == NULL); }
^
/usr/include/stdio.h:274:7: note: 'tmpnam' declared here
char *tmpnam(char *);
^
1 warning generated.
clang version Mac OS X
I've tried two version. 3.4.2 and the one which is provided by xcode. The version 3.4.2 is a bit slower.
Markuss-MacBook-Air:bin markus$ ./clang --version
clang version 3.4.2 (tags/RELEASE_34/dot2-rc1)
Target: x86_64-apple-darwin13.3.0
Thread model: posix
Markuss-MacBook-Air:bin markus$ clang --version
Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)
Target: x86_64-apple-darwin13.3.0
Thread model: posix
file size
$ ls -lh|grep lua
-rwxr-xr-x 1 markus staff 194K 20 Aug 18:26 luaclangmac
HOST benchmarking
clang binary ~4.3 sec
$ time ./luaclangmac sumdata.lua data.log
Original Size: 117261680 kb
Compressed Size: 96727557 kb
real 0m4.338s
user 0m4.264s
sys 0m0.062s
Why?
I would have expected that the host system is a little faster than the virtualization (or roughly the same speed). But not that the host system is reproducible slower.
So, any ideas or explanations?
Update 2014.10.30
Meanwhile I've installed Arch Linux nativly on my MBA. The benchmarks are as fast as in the Arch Linux VM.
Can you try to run 'perf stat' instead of 'time'. It provides you much more details and the time measurement is more correct, avoiding timing differences inside the VM.
Here is an example:
$ perf stat ls > /dev/null
Performance counter stats for 'ls':
23.348076 task-clock (msec) # 0.989 CPUs utilized
2 context-switches # 0.086 K/sec
0 cpu-migrations # 0.000 K/sec
93 page-faults # 0.004 M/sec
74,628,308 cycles # 3.196 GHz [65.75%]
740,755 stalled-cycles-frontend # 0.99% frontend cycles idle [48.66%]
29,200,738 stalled-cycles-backend # 39.13% backend cycles idle [60.02%]
80,592,001 instructions # 1.08 insns per cycle
# 0.36 stalled cycles per insn
17,746,633 branches # 760.090 M/sec [60.00%]
642,360 branch-misses # 3.62% of all branches [48.64%]
0.023609439 seconds time elapsed
My guess is that the HFS+ journaling feature is adding latency. This would be easy enough to test: If TimeMachine is running on the Macbook Air, you could try disabling it, and disable journaling on the filesystem (obviously you should back up first). As root:
diskutil disableJournal YourDiskVolume
I'd see if that's the cause of the problem. Then i would immediately re-enable journaling.
diskutil enableJournal YourDiskVolume
OS X 10.9.2 had a journaling-related bug that would hang the filesystem... this page explores this bug further, and even though the bug (#15821723) hasn't been reported as fixed, journaling reportedly no longer crashes the disk controller.
to test the speed of lua, instead of reading a file hard-code some sample data into the test script and loop over the lines over and over as necessary. Like others mentioned, the filesystem effects are going to outweigh any compiler differences.

Profiling sleep times with perf

I was looking for a way to find out where my program spends time. I read the perf tutorial and tried to profile sleep times as it is described there. I wrote the simplest possible program to profile:
#include <unistd.h>
int main() {
sleep(10);
return 0;
}
then I executed it with perf:
$ sudo perf record -e sched:sched_stat_sleep -e sched:sched_switch -e sched:sched_process_exit -g -o ~/perf.data.raw ./a.out
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.013 MB /home/pablo/perf.data.raw (~578 samples) ]
$ sudo perf inject -v -s -i ~/perf.data.raw -o ~/perf.data
build id event received for [kernel.kallsyms]: d62870685909222126e7070d2bafdf029f7ed3b6
failed to write feature 2
$ sudo perf report --stdio --show-total-period -i ~/perf.data
Error:
The /home/pablo/perf.data file has no samples!
Does anybody know how to avoid these errors? What do they mean? failed to write feature 2 doesn't look too user-friendly...
Update:
$ uname -a
Linux debian 3.12-1-amd64 #1 SMP Debian 3.12.9-1 (2014-02-01) x86_64 GNU/Linux
There is a error message from your second perf command from https://perf.wiki.kernel.org/index.php/Tutorial#Profiling_sleep_times - perf inject -s
$ sudo perf inject -v -s -i ~/perf.data.raw -o ~/perf.data
build id event received for [kernel.kallsyms]: d62870685909222126e7070d2bafdf029f7ed3b6
failed to write feature 2
failed to write feature 2 doesn't look too user-friendly...
... but it was added to perf to made errors more user-friendly: http://lwn.net/Articles/460520/ "perf: make perf.data more self-descriptive (v5)" by Stephane Eranian , 22 Sep 2011:
+static int do_write_feat(int fd, struct perf_header *h, int type, ....
+ pr_debug("failed to write feature %d\n", type);
All features are listed here http://lxr.free-electrons.com/source/tools/perf/util/header.h#L13
15 HEADER_TRACING_DATA = 1,
16 HEADER_BUILD_ID,
So, it sounds like perf inject was not able to write information about build ids (error from function write_build_id() from util/header.c) if I'm not wrong. There are two cases which can lead to error: unsuccessful call to perf_session__read_build_ids() or failing in writing buildid table dsos__write_buildid_table (this is not our case because there is no "failed to write buildid table" error message; check write_build_id)
You may check, do you have all buildids needed for the session. Also it may be useful to clear your buildid cache (rm -rf ~/.debug), and check that you have up-to-date vmlinux with debugging info or kallsyms enabled in your kernel.
UPDATE: in comments Pavel says that his pref record had no any sched:sched_stat_sleep events written to perf.data:
sudo perf record -e sched:sched_stat_sleep -e sched:sched_switch -e sched:sched_process_exit -g -o ~/perf.data.raw ./a.out
As he explains in his answer, his default debian kernel have CONFIG_SCHEDSTATS option disabled with vendor's patch. The redhat did the same thing with the option in release kernels since 3.11, and this is explained in Redhat Bug 1013225 (Josh Boyer 2013-10-28, comment 4):
We switched to enabling that only on debug builds a while ago. It seems that was turned off entirely with the final 3.11.0 build and has remained off since. Internal testing shows the option has a non-trivial performance impact for context switches.
We can turn this on in debug kernels again, but I'm not sure it's worthwhile.
Josh Poimboeuf 2013-11-04 in comment 8 says that performance impact is detectable:
In my tests I did a lot of context switches under various CPU loads. I saw a ~5-10% drop in average context switch speed when CONFIG_SCHEDSTATS was enabled. ...The performance hit only seemed to happen on post-CFS kernels (>= 2.6.23). The previous O(1) scheduler didn't seem to have this issue.
Fedora disabled CONFIG_SCHEDSTAT in non-debug kernels at 12 July 2013 "[kernel] Disable LATENCYTOP/SCHEDSTATS in non-debug builds." by Dave Jones. First kernel with disabled option: 3.11.0-0.rc0.git6.4.
In order to use any perf software tracepoint event with name like sched:sched_stat_* (sched:sched_stat_wait, sched:sched_stat_sleep, sched:sched_stat_iowait) we must recompile kernel with CONFIG_SCHEDSTATS option enabled and replace default Debian, RedHat or Fedora kernels which have no this option.
Thank you, Pavel Davydov.
I finally found out how to make it work. The problem was that the default debian kernel is built without some config options, that perf needs to be able to monitor sleep times. It looks like CONFIG_SCHEDSTATS should be enabled to make kernel collect scheduler statistics. This is told to have some runtime overhead. Also I enabled CONFIG_SCHED_TRACER and some lock tracing options, but I'm not sure if they matter in my case. Anyway, no statistic data is collected in scheduler without CONFIG_SCHEDSTATS (see kernel/sched/ directory of kernel source).
Also, there is a very good article about perf written by Brendan Gregg, with a lot of usefull examples and some kernel options that are needed to make perf work properly.
Update: I checked the history of CONFIG_SCHEDSTATS in debian. I've checked out debian kernel patches and build scripts repo:
svn checkout svn://svn.debian.org/svn/kernel/dists/trunk/linux/debian
And then found CONFIG_SCHEDSTATS option there
$ grep -R CONFIG_SCHEDSTAT config/
config/config:# CONFIG_SCHEDSTATS is not set
This string was added to the repo in commit 10837, on 2008-03-14, with comment "debian/config: Do complete reorganization". Also, in this and this (thanks to osgx) bug reports it is told that CONFIG_LATENCYTOP, CONFIG_SCHEDSTATS options are not enabled because they can affect kernel perfomance. So, I think it just was never switched on in default debian kernels. I haven't found the discussion about scheduler stats option, though. If I do, I will write back here.
This works for me for "perf version 3.11.1" on an "openSUSE 13.1 (x86_64)" box.
Here is the output if you care:
# ========
# captured on: Sun Feb 16 09:49:38 2014
# hostname : *****************
# os release : 3.11.10-7-desktop
# perf version : 3.11.1
# arch : x86_64
# nrcpus online : 8
# nrcpus avail : 8
# cpudesc : Intel(R) Core(TM) i7-3840QM CPU # 2.80GHz
# cpuid : GenuineIntel,6,58,9
# total memory : 32945368 kB
# cmdline : /usr/bin/perf inject -v -s -i perf.data.raw -o perf.data
# event : name = sched:sched_stat_sleep, type = 2, config = 0x48, config1 = 0x0, config2 = 0x
# event : name = sched:sched_switch, type = 2, config = 0x51, config1 = 0x0, config2 = 0x0, e
# event : name = sched:sched_process_exit, type = 2, config = 0x4e, config1 = 0x0, config2 =
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: cpu = 4, software = 1, tracepoint = 2, uncore_cbox_0 = 6, uncore_cbox_1 = 7,
# ========
#
# Samples: 0 of event 'sched:sched_stat_sleep'
# Event count (approx.): 0
#
# Overhead Period Command Shared Object Symbol
# ........ ............ ....... ............. ......
#
# Samples: 8 of event 'sched:sched_switch'
# Event count (approx.): 80099958776
#
# Overhead Period Command Shared Object Symbol
# ........ ............ ....... ................. .................
#
100.00% 80099958776 bla [kernel.kallsyms] [k] thread_return
|
--- thread_return
thread_return
do_nanosleep
hrtimer_nanosleep
SyS_nanosleep
system_call_fastpath
0x7fbc0dec6570
__GI___libc_nanosleep
(nil)
# Samples: 0 of event 'sched:sched_process_exit'
# Event count (approx.): 0
#
# Overhead Period Command Shared Object Symbol
# ........ ............ ....... ............. ......
#
#
# (For a higher level overview, try: perf report --sort comm,dso)
#
}

Resources