Why is lua on host system slower than in the linux vm? - c

Comparing executing time of this Lua Script on a Macbook Air (Mac OS 10.9.4, i5-4250U (1.3GHz), 8GB RAM) to a VM (virtualbox) running Arch Linux.
Compiling Lua 5.2.3 in a Arch Linux virtualbox
First I've compiled lua by myself using clang, to compare it with the Mac OS X clang binary.
using tcc, gcc and clang
$ tcc *[^ca].c lgc.c lfunc.c lua.c -lm -o luatcc
$ gcc -O3 *[^ca].c lgc.c lfunc.c lua.c -lm -o luagcc
/tmp/ccxAEYH8.o: In function `os_tmpname':
loslib.c:(.text+0x29c): warning: the use of `tmpnam' is dangerous, better use `mkstemp'
$ clang -O3 *[^ca].c lgc.c lfunc.c lua.c -lm -o luaclang
/tmp/loslib-bd4ef4.o:loslib.c:function os_tmpname: warning: the use of `tmpnam' is dangerous, better use `mkstemp'
clang version in VM
$ clang --version
clang version 3.4.2 (tags/RELEASE_34/dot2-final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
compare the file size
$ ls -lh |grep lua
-rwxr-xr-x 1 markus markus 210K 20. Aug 18:21 luaclang
-rwxr-xr-x 1 markus markus 251K 20. Aug 18:22 luagcc
-rwxr-xr-x 1 markus markus 287K 20. Aug 18:22 luatcc
VM benchmarking
clang binary ~3.1 sec
$ time ./luaclang sumdata.lua data.log
Original Size: 117261680 kb
Compressed Size: 96727557 kb
real 0m3.124s
user 0m3.100s
sys 0m0.020s
gcc binary ~3.09 sec
$ time ./luagcc sumdata.lua data.log
Original Size: 117261680 kb
Compressed Size: 96727557 kb
real 0m3.090s
user 0m3.080s
sys 0m0.007s
tcc binary ~7.0 sec - no surprise here :)
$ time ./luatcc sumdata.lua data.log
Original Size: 117261680 kb
Compressed Size: 96727557 kb
real 0m7.071s
user 0m7.053s
sys 0m0.010s
Compiling on Mac OS X
Now compiling lua with the same clang command/options like in the VM.
$ clang -O3 *[^ca].c lgc.c lfunc.c lua.c -lm -o luaclangmac
loslib.c:108:3: warning: 'tmpnam' is deprecated: This function is provided for
compatibility reasons only. Due to security concerns inherent in the design of tmpnam(3),
it is highly recommended that you use mkstemp(3)
instead. [-Wdeprecated-declarations]
lua_tmpnam(buff, err);
^
loslib.c:57:33: note: expanded from macro 'lua_tmpnam'
#define lua_tmpnam(b,e) { e = (tmpnam(b) == NULL); }
^
/usr/include/stdio.h:274:7: note: 'tmpnam' declared here
char *tmpnam(char *);
^
1 warning generated.
clang version Mac OS X
I've tried two version. 3.4.2 and the one which is provided by xcode. The version 3.4.2 is a bit slower.
Markuss-MacBook-Air:bin markus$ ./clang --version
clang version 3.4.2 (tags/RELEASE_34/dot2-rc1)
Target: x86_64-apple-darwin13.3.0
Thread model: posix
Markuss-MacBook-Air:bin markus$ clang --version
Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)
Target: x86_64-apple-darwin13.3.0
Thread model: posix
file size
$ ls -lh|grep lua
-rwxr-xr-x 1 markus staff 194K 20 Aug 18:26 luaclangmac
HOST benchmarking
clang binary ~4.3 sec
$ time ./luaclangmac sumdata.lua data.log
Original Size: 117261680 kb
Compressed Size: 96727557 kb
real 0m4.338s
user 0m4.264s
sys 0m0.062s
Why?
I would have expected that the host system is a little faster than the virtualization (or roughly the same speed). But not that the host system is reproducible slower.
So, any ideas or explanations?
Update 2014.10.30
Meanwhile I've installed Arch Linux nativly on my MBA. The benchmarks are as fast as in the Arch Linux VM.

Can you try to run 'perf stat' instead of 'time'. It provides you much more details and the time measurement is more correct, avoiding timing differences inside the VM.
Here is an example:
$ perf stat ls > /dev/null
Performance counter stats for 'ls':
23.348076 task-clock (msec) # 0.989 CPUs utilized
2 context-switches # 0.086 K/sec
0 cpu-migrations # 0.000 K/sec
93 page-faults # 0.004 M/sec
74,628,308 cycles # 3.196 GHz [65.75%]
740,755 stalled-cycles-frontend # 0.99% frontend cycles idle [48.66%]
29,200,738 stalled-cycles-backend # 39.13% backend cycles idle [60.02%]
80,592,001 instructions # 1.08 insns per cycle
# 0.36 stalled cycles per insn
17,746,633 branches # 760.090 M/sec [60.00%]
642,360 branch-misses # 3.62% of all branches [48.64%]
0.023609439 seconds time elapsed

My guess is that the HFS+ journaling feature is adding latency. This would be easy enough to test: If TimeMachine is running on the Macbook Air, you could try disabling it, and disable journaling on the filesystem (obviously you should back up first). As root:
diskutil disableJournal YourDiskVolume
I'd see if that's the cause of the problem. Then i would immediately re-enable journaling.
diskutil enableJournal YourDiskVolume
OS X 10.9.2 had a journaling-related bug that would hang the filesystem... this page explores this bug further, and even though the bug (#15821723) hasn't been reported as fixed, journaling reportedly no longer crashes the disk controller.

to test the speed of lua, instead of reading a file hard-code some sample data into the test script and loop over the lines over and over as necessary. Like others mentioned, the filesystem effects are going to outweigh any compiler differences.

Related

What kind of program can benefit much from LTO?

When using dhrystone to get DMIPS, I found that LTO greatly impacted the results. LTO-dhrystone is nearly 4x LTO-less-dhrystone:
$ wget http://www.xanthos.se/~joachim/dhrystone-src.tar.gz
$ cd dhrystone-src
without LTO
$ aarch64-linux-gnu-gcc -O3 -funroll-all-loops --param max-inline-insns-auto=550 -static dhry21a.c dhry21b.c timers.c -o dhrystone # use qemu-user to execute
$ perf stat ./dhrystone # input 100000000
...
Register option selected? YES
Microseconds for one run through Dhrystone: 0.2
Dhrystones per Second: 5234421.7
VAX MIPS rating = 2979.181
Performance counter stats for './dhrystone':
19,158.53 msec task-clock:u # 0.969 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
547 page-faults:u # 28.551 /sec
81,470,643,102 cycles:u # 4.252 GHz (50.01%)
3,046,747 stalled-cycles-frontend:u # 0.00% frontend cycles idle (50.02%)
37,208,106,969 stalled-cycles-backend:u # 45.67% backend cycles idle (50.00%)
319,848,969,156 instructions:u # 3.93 insn per cycle
# 0.12 stalled cycles per insn (49.99%)
49,311,879,609 branches:u # 2.574 G/sec (49.98%)
317,518 branch-misses:u # 0.00% of all branches (50.00%)
19.762244278 seconds time elapsed
19.118127000 seconds user
0.004017000 seconds sys
With LTO
$ aarch64-linux-gnu-gcc -O3 -funroll-all-loops --param max-inline-insns-auto=550 -static dhry21a.c dhry21b.c timers.c -o dhrystone -flto
$ perf stat ./dhrystone # input 100000000
...
Register option selected? YES
Microseconds for one run through Dhrystone: 0.1
Dhrystones per Second: 19539623.0
VAX MIPS rating = 11121.015
Performance counter stats for './dhrystone':
5,146.69 msec task-clock:u # 0.908 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
553 page-faults:u # 107.448 /sec
21,453,263,692 cycles:u # 4.168 GHz (50.00%)
1,574,543 stalled-cycles-frontend:u # 0.01% frontend cycles idle (50.03%)
12,575,396,819 stalled-cycles-backend:u # 58.62% backend cycles idle (50.04%)
89,186,371,586 instructions:u # 4.16 insn per cycle
# 0.14 stalled cycles per insn (50.00%)
7,717,732,872 branches:u # 1.500 G/sec (49.97%)
353,303 branch-misses:u # 0.00% of all branches (49.96%)
5.666446006 seconds time elapsed
5.133037000 seconds user
0.003322000 seconds sys
As you can see
LTO dhrystone DMIPS is 1953,9623.0 and LTO-less dhrystone is 523,4421.7
LTO dhrystone executes 89,186,371,586 instructions and LTO-less dhrystone executes 319,848,969,156
I think the root cause is that LTO reduces many instructions, so it can run much faster.
But When I run benchmarks like coremark/coremark-pro, LTO doesn't have notable improvement compared with non-LTO.
Qeustion
What kind of programs are more easily affected by LTO optimization? Why LTO has a big impact on dhrystone, but not on coremark/coremark-pro.
How does LTO reduce runtime instructions?
LTO allows cross-file inlining, so if you have tiny helper functions (like C++ get/set functions in classes) that aren't visible in a .h for inlining normally, LTO can greatly simplify code that does a lot of calling such functions.
A simple get or set wrapper can inline to zero instructions (with the object data just living in registers), but a call/ret would need to pass an arg in a register, not to mention executing the actual bl and ret instructions. And would have to respect the calling convention, so the call-site might need to mov some values to call-preserved registers. But when inlining, the compiler has full control over all the registers.
For benchmarks, putting the work in a separate file from a repeat loop is a good way of stopping compilers from defeating the benchmark by optimizing across repeat-loop iterations. (e.g. hoisting work out of loops instead of re-computing something every time.)
Unless you use LTO so it can break your benchmarks. (Or maybe there's another reason with dhrystone, IDK.)

How can I check integrity of a extracted zImage?

$ binwalk -e linux_image.img
DECIMAL HEXADECIMAL DESCRIPTION
--------------------------------------------------------------------------------
0 0x0 Android bootimg, kernel size: 6897653 bytes, kernel addr: 0x81C08000, ramdisk size: 5959520 bytes, ramdisk addr: 0x81C08000, product name: ""
2048 0x800 Linux kernel ARM boot executable zImage (little-endian)
18479 0x482F gzip compressed data, maximum compression, from Unix, last modified: 1970-01-01 00:00:00 (null date)
6761720 0x672CF8 device tree image (dtb)
6883304 0x6907E8 Unix path: /dev/block/platform/soc/7824900.sdhci/by-name/vendor
6899712 0x694800 gzip compressed data, maximum compression, has original file name: "rootfs.cpio", from Unix, last modified: 2019-04-06 00:42:26
9706949 0x941DC5 MySQL ISAM compressed data file Version 11
$ dd if=linux_image.img of=vmlinuz bs=1 skip=2048 count=6897653
$ file vmlinuz
vmlinuz: Linux kernel ARM boot executable zImage (little-endian)
$ dd if=vmlinuz bs=1 skip=$(LC_ALL=C grep -a -b -o $'\x1f\x8b\x08\x00\x00\x00\x00\x00' vmlinuz-3.18.66-perf | head -n 1 | cut -d ':' -f 1) | zcat | grep -a 'Linux version'
Linux version 3.18.66 (build#test) (gcc version 4.9.3 (GCC) ) #1 SMP PREEMPT Fri Apr 1 13:16:33 PDT 2018
Running 'qemu-system-arm.exe -machine vexpress-a9 -cpu cortex-a7 -smp 4 -kernel vmlinuz' blank screen
If you pull a random Arm Linux kernel (including Android) from somewhere and try to run it on anything other than the hardware that it is intended to boot on, the expected result is that it crashes very early in bootup without being able to output anything to screen or serial port, ie you get a black screen and nothing happens. The most likely situation here is that your image is fine and not corrupt, it's just not built to run on the vexpress-a9 board you're running it on.
In the unlikely event that this really is a kernel built for the vexpress-a9, the next problem you have is that you haven't passed QEMU a device tree blob via the -dtb option. Modern Linux kernels don't hardcode all the information about the boards they can run on, but instead expect the bootloader (which is QEMU in this case) to pass them a data file which provides information about where all the devices are for the board. If you don't do that, then the result is the same as above: kernel crashes very early in bootup without being able to output any information, so black screen.

Perf stack backtraces on armv7l

I'm trying to use perf to get information about stack backtraces in my system.
I compile an application where main calls f, f calls g1, g1 calls g3, g3 calls g4, g4 calls g2.
I expect my backtrases to be something like
g2
g4
g3
g1
f
main
But instead, I have cropped backtraces in perf script, like
a.out 2869 [000] 19414.348571: 225426 cycles:ppp:
7ac f (/opt/usr/home/owner/a.out)
beb3dd2c [unknown] ([unknown])
a.out 2869 [000] 19414.348754: 235721 cycles:ppp:
72c g1 (/opt/usr/home/owner/a.out)
beb3dd24 [unknown] ([unknown])
a.out 2869 [000] 19414.348937: 246486 cycles:ppp:
670 g3 (/opt/usr/home/owner/a.out)
beb3dd14 [unknown] ([unknown])
a.out 2869 [000] 19414.349121: 232929 cycles:ppp:
60c g4 (/opt/usr/home/owner/a.out)
beb3dd04 [unknown] ([unknown])
How can I get more info about my backtraces?
Compile: arm-linux-gnueabi-gcc -O0 -g3 -marm -fno-omit-frame-pointer -funwind-tables main.c
Perf record: perf record -g -a
Perf script: perf script
Target is running on Linux 3.10.65.
There were problems with stack unwinding implementation in perf for ARM: Not implemented at some moment.
Try recent kernel and/or recent version of perf (new perf tool will work on old kernel, but part of backtrace reading is in kernel).
https://wiki.linaro.org/LEG/Engineering/TOOLS/perf-callstack-unwinding (also mentioned there: How does linux's perf utility understand stack traces?)
Linux perf has architecture specific support code. x86 has some dwarf stack frame unwinding support whilst arm and arm64 do not. It should be implemented on ARM32/64.
The work for ARMv7 is done under LEG-760 blueprint. .. The expected result is the backtrace of user and kernel call chain in the perf output statistics.
Support was commited after september 2013: http://www.spinics.net/lists/kernel/msg1608919.html
and 3.10 is from june 2013.
Try kernel&perf from 3.11, or any newer version of kernel.

Verification of poly1305-donna-16.h code

The RFC from May 2015 Y. Nir et al, ChaCha20 and Poly1305 for IETF Protocols
(https://www.rfc-editor.org/rfc/rfc7539) contains a reference to the MIT/Public domain C library https://github.com/floodyberry/poly1305-donna.
I am just porting the C code to Pascal. The 8 bit code works OK (self-test, example, and RFC test vectors).
The port from poly1305-donna-16.h using 16->32 bit multiples and 32 bit additions failed. After some testing I compiled the original source with DJGPP GCC 4.7.3, MS VC 6.0 and BC 3.1 and all three failed too (poly1305 self-test).
Questions: Does this C version (build with -DPOLY1305_16BIT) fail for other compilers too? Is there a known fix available? (The blog of the author Andrew Moon at https://floodyberry.wordpress.com/ is inactive since 6 years)
I can confirm a build failure on a pretty vanilla Fedora 22 system:
% gcc poly1305-donna.c -c -DPOLY1305_16BIT
% gcc example-poly1305.c -o ex poly1305-donna.o -DPOLY1305_16BIT
% ./ex
poly1305 self test: failed
Notice that the test succeeds when I omit -DPOLY1305_16BIT.
Also notice:
% uname -rmp
4.0.8-300.fc22.x86_64 x86_64 x86_64
% gcc --version
gcc (GCC) 5.1.1 20150618 (Red Hat 5.1.1-4)
I suggest you submit a bug report. Andrew has been responsive in the past.
EDIT:
Compiling with clang version 3.5.0 yields the same results as the above gcc test.

Profiling sleep times with perf

I was looking for a way to find out where my program spends time. I read the perf tutorial and tried to profile sleep times as it is described there. I wrote the simplest possible program to profile:
#include <unistd.h>
int main() {
sleep(10);
return 0;
}
then I executed it with perf:
$ sudo perf record -e sched:sched_stat_sleep -e sched:sched_switch -e sched:sched_process_exit -g -o ~/perf.data.raw ./a.out
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.013 MB /home/pablo/perf.data.raw (~578 samples) ]
$ sudo perf inject -v -s -i ~/perf.data.raw -o ~/perf.data
build id event received for [kernel.kallsyms]: d62870685909222126e7070d2bafdf029f7ed3b6
failed to write feature 2
$ sudo perf report --stdio --show-total-period -i ~/perf.data
Error:
The /home/pablo/perf.data file has no samples!
Does anybody know how to avoid these errors? What do they mean? failed to write feature 2 doesn't look too user-friendly...
Update:
$ uname -a
Linux debian 3.12-1-amd64 #1 SMP Debian 3.12.9-1 (2014-02-01) x86_64 GNU/Linux
There is a error message from your second perf command from https://perf.wiki.kernel.org/index.php/Tutorial#Profiling_sleep_times - perf inject -s
$ sudo perf inject -v -s -i ~/perf.data.raw -o ~/perf.data
build id event received for [kernel.kallsyms]: d62870685909222126e7070d2bafdf029f7ed3b6
failed to write feature 2
failed to write feature 2 doesn't look too user-friendly...
... but it was added to perf to made errors more user-friendly: http://lwn.net/Articles/460520/ "perf: make perf.data more self-descriptive (v5)" by Stephane Eranian , 22 Sep 2011:
+static int do_write_feat(int fd, struct perf_header *h, int type, ....
+ pr_debug("failed to write feature %d\n", type);
All features are listed here http://lxr.free-electrons.com/source/tools/perf/util/header.h#L13
15 HEADER_TRACING_DATA = 1,
16 HEADER_BUILD_ID,
So, it sounds like perf inject was not able to write information about build ids (error from function write_build_id() from util/header.c) if I'm not wrong. There are two cases which can lead to error: unsuccessful call to perf_session__read_build_ids() or failing in writing buildid table dsos__write_buildid_table (this is not our case because there is no "failed to write buildid table" error message; check write_build_id)
You may check, do you have all buildids needed for the session. Also it may be useful to clear your buildid cache (rm -rf ~/.debug), and check that you have up-to-date vmlinux with debugging info or kallsyms enabled in your kernel.
UPDATE: in comments Pavel says that his pref record had no any sched:sched_stat_sleep events written to perf.data:
sudo perf record -e sched:sched_stat_sleep -e sched:sched_switch -e sched:sched_process_exit -g -o ~/perf.data.raw ./a.out
As he explains in his answer, his default debian kernel have CONFIG_SCHEDSTATS option disabled with vendor's patch. The redhat did the same thing with the option in release kernels since 3.11, and this is explained in Redhat Bug 1013225 (Josh Boyer 2013-10-28, comment 4):
We switched to enabling that only on debug builds a while ago. It seems that was turned off entirely with the final 3.11.0 build and has remained off since. Internal testing shows the option has a non-trivial performance impact for context switches.
We can turn this on in debug kernels again, but I'm not sure it's worthwhile.
Josh Poimboeuf 2013-11-04 in comment 8 says that performance impact is detectable:
In my tests I did a lot of context switches under various CPU loads. I saw a ~5-10% drop in average context switch speed when CONFIG_SCHEDSTATS was enabled. ...The performance hit only seemed to happen on post-CFS kernels (>= 2.6.23). The previous O(1) scheduler didn't seem to have this issue.
Fedora disabled CONFIG_SCHEDSTAT in non-debug kernels at 12 July 2013 "[kernel] Disable LATENCYTOP/SCHEDSTATS in non-debug builds." by Dave Jones. First kernel with disabled option: 3.11.0-0.rc0.git6.4.
In order to use any perf software tracepoint event with name like sched:sched_stat_* (sched:sched_stat_wait, sched:sched_stat_sleep, sched:sched_stat_iowait) we must recompile kernel with CONFIG_SCHEDSTATS option enabled and replace default Debian, RedHat or Fedora kernels which have no this option.
Thank you, Pavel Davydov.
I finally found out how to make it work. The problem was that the default debian kernel is built without some config options, that perf needs to be able to monitor sleep times. It looks like CONFIG_SCHEDSTATS should be enabled to make kernel collect scheduler statistics. This is told to have some runtime overhead. Also I enabled CONFIG_SCHED_TRACER and some lock tracing options, but I'm not sure if they matter in my case. Anyway, no statistic data is collected in scheduler without CONFIG_SCHEDSTATS (see kernel/sched/ directory of kernel source).
Also, there is a very good article about perf written by Brendan Gregg, with a lot of usefull examples and some kernel options that are needed to make perf work properly.
Update: I checked the history of CONFIG_SCHEDSTATS in debian. I've checked out debian kernel patches and build scripts repo:
svn checkout svn://svn.debian.org/svn/kernel/dists/trunk/linux/debian
And then found CONFIG_SCHEDSTATS option there
$ grep -R CONFIG_SCHEDSTAT config/
config/config:# CONFIG_SCHEDSTATS is not set
This string was added to the repo in commit 10837, on 2008-03-14, with comment "debian/config: Do complete reorganization". Also, in this and this (thanks to osgx) bug reports it is told that CONFIG_LATENCYTOP, CONFIG_SCHEDSTATS options are not enabled because they can affect kernel perfomance. So, I think it just was never switched on in default debian kernels. I haven't found the discussion about scheduler stats option, though. If I do, I will write back here.
This works for me for "perf version 3.11.1" on an "openSUSE 13.1 (x86_64)" box.
Here is the output if you care:
# ========
# captured on: Sun Feb 16 09:49:38 2014
# hostname : *****************
# os release : 3.11.10-7-desktop
# perf version : 3.11.1
# arch : x86_64
# nrcpus online : 8
# nrcpus avail : 8
# cpudesc : Intel(R) Core(TM) i7-3840QM CPU # 2.80GHz
# cpuid : GenuineIntel,6,58,9
# total memory : 32945368 kB
# cmdline : /usr/bin/perf inject -v -s -i perf.data.raw -o perf.data
# event : name = sched:sched_stat_sleep, type = 2, config = 0x48, config1 = 0x0, config2 = 0x
# event : name = sched:sched_switch, type = 2, config = 0x51, config1 = 0x0, config2 = 0x0, e
# event : name = sched:sched_process_exit, type = 2, config = 0x4e, config1 = 0x0, config2 =
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: cpu = 4, software = 1, tracepoint = 2, uncore_cbox_0 = 6, uncore_cbox_1 = 7,
# ========
#
# Samples: 0 of event 'sched:sched_stat_sleep'
# Event count (approx.): 0
#
# Overhead Period Command Shared Object Symbol
# ........ ............ ....... ............. ......
#
# Samples: 8 of event 'sched:sched_switch'
# Event count (approx.): 80099958776
#
# Overhead Period Command Shared Object Symbol
# ........ ............ ....... ................. .................
#
100.00% 80099958776 bla [kernel.kallsyms] [k] thread_return
|
--- thread_return
thread_return
do_nanosleep
hrtimer_nanosleep
SyS_nanosleep
system_call_fastpath
0x7fbc0dec6570
__GI___libc_nanosleep
(nil)
# Samples: 0 of event 'sched:sched_process_exit'
# Event count (approx.): 0
#
# Overhead Period Command Shared Object Symbol
# ........ ............ ....... ............. ......
#
#
# (For a higher level overview, try: perf report --sort comm,dso)
#
}

Resources