Interpreting the Dhrystone Benchmark results on Modern PC

Interpreting the Dhrystone Benchmark results on Modern PC - c

First I'll give a run down on what I did.
I downloaded dhry.h dhry_1.c and dhry_2.c from here:
http://giga.cps.unizar.es/~spd/src/other/dhry/
Then I made some corrections (so that I would compile) according to this:
https://github.com/maximeh/buildroot/blob/master/package/dhrystone/dhrystone-2-HZ.patch
And this
Errors while compiling dhrystone in unix
I've compiled the files with the following command line:
gcc dhry_1.c dhry_2.c -O2 -o run
I finally entered the number of runs to be 1000000000
And waited. I compiled with four different optimization levels and I got these values of DMIPS (According to http://en.wikipedia.org/wiki/Dhrystone this is the Dhrystones per Second divided by 1757):
O0: 8112 O1: 16823.9 O2: 22977.5 O3: 23164.5 (these represent the compiler flags like -O2 is optimization level two and O0 is none).
This would give the following DMIPS/MHz (base frequency for my processor is 3.4 GHz):
2.3859 4.9482 6.7581 6.8131
However, I get the feelign tha 6.7 is way to low. According to what I've read an A15 has between 3.5 to 4 DMIPS/MHz and a third generation I7 only has double that? Shouldn't it be a lot higher?
Can anyone tell me from my procedure if they can see that I might have done something wrong? Or maybe I'm interpretting the results incorrectly?

Except with a broad brush treatment, you cannot compare benchmark results produced by different compilers. As the design authority of the first standard benchmark (Whetstone), I can advise that it even less safe to include comparisons with results from a computer manufacturer’s in-house compiler. In minicomputer days, manufacturers found that sections of the Whetstone benchmark could be optimised out, to double the score. I arranged for changes and more detailed results to avoid and later highlight over optimisation.
Below are example results on PCs from my original (1990’s) Dhrystone Benchmarks. For details, more results and (free) execution and source files see:
http://www.roylongbottom.org.uk/dhrystone%20results.htm
Also included, and compiled from the same source code, are results from a later MS compiler and some via Linux and on Android, via ARM CPUs, plus one for an Intel Atom, via Houdini compatibility layer. I prefer the term VAX MIPS instead to DMIPS, as the 1757 divisor is the result on DEC VAX 11/780. Anyway, MIPS/MHz calculations are also shown. Note differences due to compilers and the particularly low ratios on Android ARM CPUs.
Dhry1 Dhry1 Dhry2 Dhry2 Dhry2
Opt NoOpt Opt NoOpt Opt
VAX VAX VAX VAX MIPS/
CPU MHz MIPS MIPS MIPS MIPS MHz
AMD 80386 40 17.5 4.32 13.7 4.53 0.3
80486 DX2 66 45.1 12 35.3 12.4 0.5
Pentium 100 169 31.8 122 32.2 1.2
Pentium Pro 200 373 92.4 312 91.9 1.6
Pentium II 300 544 132 477 136 1.6
Pentium III 450 846 197 722 203 1.6
Pentium 4 1900 2593 261 2003 269 1.1
Atom 1666 2600 772 1948 780 1.2
Athlon 64 2211 5798 1348 4462 1312 2.0
Core 2 Duo 1 CP 2400 7145 1198 6446 1251 2.7
Phenom II 1 CP 3000 9462 2250 7615 2253 2.5
Core i7 4820K 3900 14776 2006 11978 2014 3.1
Later Intel Compiler
Pentium 4 1900 2613 1795 0.9
Athlon 64 2211 6104 3720 1.7
Core 2 Duo 2400 8094 5476 2.3
Phenom II 3000 9768 6006 2.0
Core i7 4820K 3900 15587 10347 2.7
Linux Ubuntu GCC Compiler
Atom 1666 5485 1198 2055 1194 1.2
Athlon 64 2211 9034 2286 4580 2347 2.1
Core 2 Duo 2400 13599 3428 5852 3348 2.4
Phenom II 3000 13406 3368 6676 3470 2.2
Core i7 4820K 3900 29277 7108 16356 7478 4.2
ARM Android NDK
926EJ 800 356 196 0.4
v7-A9 1500 1650 786 1.1
v7-A15 1700 3189 1504 1.9
Atom Houdini 1866 1840 1310 1.0

Related

Postgres SQL Update JDBC Stuck

I'm running the below query through jdbcTemplate.update()
"UPDATE <tblnme> as aa set _r=b._r from(SELECT id,parent,LEAD(id,1) OVER (partition by parent ORDER BY id) _r from <tblnme>) as b where aa.id=b.id"
This update (1 table - 28 million records) runs fine around 15 min on below m.c config
SSD 32 GB RAM Intel(R) Core(TM) i7-7700HQ CPU # 2.80GHz 2.80 GHz
but gets stuck in below m.c config
HHD 32 GB RAM Intel(R) Core(TM) i7-6700 CPU # 3.40GHz 3.41 GHz
There's no message in postgres server log during the stuck phase. Had to terminate after 4 hrs wait.
have set below in postgresql.conf in both m/c:
a. max_connections = 375 b. shared_buffers = 2500MB c. max_wal_size
= 16GB d. min_wal_size = 4096MB
What could be done to resolve this issue?

How to restrict mongodb's ROM usage on 32 bit architecture system? I am using mongodb version 2.4

I am using mongodb version 2.4 due to 32 bit system limitation (nanopi m1 plus). I have a debian jessie OS image (Debian 8) with 4.2 GB space available (emmc). I have a total of about 2.2 GB available after loading my application files. However, my flash gets filled up quickly to 100% after I start running my application.
Then I get the error "Unable to get database instance and mongodb stopped working " and my application stops working.
Can someone please help me with this problem. Thanks in advance!
Memory status of my device when it stopped working:
df –h:
Filesystem- overlay Size- 4.2G Used- 4.2G Avail- 0 Use%- 100% Mounted on- /
df -h command
du –shx /var/lib/mongodb/ | sort –rh | head –n 20
512M /var/lib/mongodb/xyz.6
512M /var/lib/mongodb/xyz.5
257M /var/lib/mongodb/xyz.4
128M /var/lib/mongodb/xyz.3
64M /var/lib/mongodb/xyz.2
32M /var/lib/mongodb/xyz.1
17M /var/lib/mongodb/xyz.ns
17M /var/lib/mongodb/xyz.0
16M /var/lib/mongodb/local.ns
16M /var/lib/mongodb/local.0
4.0K /var/lib/mongodb/journal
0 /var/lib/mongodb/mongodb.lock
du –shx /var/lib/mongodb/journal/* | sort –rh | head –n 20
257M /var/lib/mongodb/journal/prealloc.2
257M /var/lib/mongodb/journal/prealloc.1
257M /var/lib/mongodb/journal/prealloc.0
du –shx /var/log/mongodb/journal/* | sort –rh | head –n 20
399M /var/lib/mongodb/mongodb.log
353M /var/lib/mongodb/mongodb.log.1
3.3M /var/lib/mongodb/mongodb.log.2.gz
752K /var/lib/mongodb/mongodb.log.1.gz

Solr / Lucene best file system

I've done some logs indexing benchmark with Solr with Redhat 7.3.
The machine included 2 7200 RPM with software RAID 1, 64GB memory and a E3-1240v6 CPU.
I was really surprised to find a huge difference in IO performance between ext4 and xfs (see details below).
Indexing with xfs provided 20% more indexing throughput compared to ext4 (io wait is tenth with xfs).
I'm looking for some insights related to choosing the appropriate file system for a Solr machine.
ext4:
avg-cpu: %user %nice %system %iowait %steal %idle
3.09 62.43 1.84 14.51 0.00 18.12
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.02 169.38 13.95 182.97 0.36 26.28 277.04 40.91 207.66 18.96 222.05 3.82 75.18
sda 0.04 169.34 20.55 183.01 0.61 26.28 270.51 47.18 231.71 27.84 254.60 3.76 76.51
xfs:
avg-cpu: %user %nice %system %iowait %steal %idle
3.18 81.72 2.19 1.48 0.00 11.42
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 17.51 0.00 123.70 0.00 29.13 482.35 34.03 274.97 56.12 274.97 5.39 66.63
sdb 0.00 17.53 0.09 123.69 0.00 29.13 482.05 34.84 281.29 25.58 281.48 5.29 65.52

as you have done the testing yourself (hopefully similar to your intended production usage), nobody else will have better advise regarding the FS. Of course, if you could change the spinning disks for SSD, that would be much, much better, specially for indexing througput.

Report shows "no time accumulated" for gprof using Eclipse CDT

After compiling with flags: -O0 -p -pg -Wall -c on GCC and -p -pg on the MinGW linker, the eclipse plugin gprof for shows no results. After that I did a cmd call using gprof my.exe gmon.out > prof.txt, which resulted in a report witth only the number of calls to functions.
Flat profile:
Each sample counts as 0.01 seconds.
no time accumulated
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
0.00 0.00 0.00 16000 0.00 0.00 vector_norm
0.00 0.00 0.00 16 0.00 0.00 rbf_kernel
0.00 0.00 0.00 8 0.00 0.00 lubksb
I've came across this topic: gprof reports no time accumulated. But my program is terminating in a clear maner. Also, gprof view show no data on MingW/Windows, but I am using 32 bits GCC. I have previously tried to use Cygwin, same result.
I am using eclipse Kepler with CDT version 8.3.0.201402142303 and MinGW with GCC 5.4.0.
Any help is appreciated, thank you in advance.

Sorry for the question, seems that the code is faster than gprof can measure.
As my application involves a neural network train with several iterations and further testing of kernels, I didn't suspected that a fast code could be causing the problem. I inserted a long loop in the main body and the gprof time was printed.

Why is lua on host system slower than in the linux vm?

Comparing executing time of this Lua Script on a Macbook Air (Mac OS 10.9.4, i5-4250U (1.3GHz), 8GB RAM) to a VM (virtualbox) running Arch Linux.
Compiling Lua 5.2.3 in a Arch Linux virtualbox
First I've compiled lua by myself using clang, to compare it with the Mac OS X clang binary.
using tcc, gcc and clang
$ tcc *[^ca].c lgc.c lfunc.c lua.c -lm -o luatcc
$ gcc -O3 *[^ca].c lgc.c lfunc.c lua.c -lm -o luagcc
/tmp/ccxAEYH8.o: In function `os_tmpname':
loslib.c:(.text+0x29c): warning: the use of `tmpnam' is dangerous, better use `mkstemp'
$ clang -O3 *[^ca].c lgc.c lfunc.c lua.c -lm -o luaclang
/tmp/loslib-bd4ef4.o:loslib.c:function os_tmpname: warning: the use of `tmpnam' is dangerous, better use `mkstemp'
clang version in VM
$ clang --version
clang version 3.4.2 (tags/RELEASE_34/dot2-final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
compare the file size
$ ls -lh |grep lua
-rwxr-xr-x 1 markus markus 210K 20. Aug 18:21 luaclang
-rwxr-xr-x 1 markus markus 251K 20. Aug 18:22 luagcc
-rwxr-xr-x 1 markus markus 287K 20. Aug 18:22 luatcc
VM benchmarking
clang binary ~3.1 sec
$ time ./luaclang sumdata.lua data.log
Original Size: 117261680 kb
Compressed Size: 96727557 kb
real 0m3.124s
user 0m3.100s
sys 0m0.020s
gcc binary ~3.09 sec
$ time ./luagcc sumdata.lua data.log
Original Size: 117261680 kb
Compressed Size: 96727557 kb
real 0m3.090s
user 0m3.080s
sys 0m0.007s
tcc binary ~7.0 sec - no surprise here :)
$ time ./luatcc sumdata.lua data.log
Original Size: 117261680 kb
Compressed Size: 96727557 kb
real 0m7.071s
user 0m7.053s
sys 0m0.010s
Compiling on Mac OS X
Now compiling lua with the same clang command/options like in the VM.
$ clang -O3 *[^ca].c lgc.c lfunc.c lua.c -lm -o luaclangmac
loslib.c:108:3: warning: 'tmpnam' is deprecated: This function is provided for
compatibility reasons only. Due to security concerns inherent in the design of tmpnam(3),
it is highly recommended that you use mkstemp(3)
instead. [-Wdeprecated-declarations]
lua_tmpnam(buff, err);
^
loslib.c:57:33: note: expanded from macro 'lua_tmpnam'
#define lua_tmpnam(b,e) { e = (tmpnam(b) == NULL); }
^
/usr/include/stdio.h:274:7: note: 'tmpnam' declared here
char *tmpnam(char *);
^
1 warning generated.
clang version Mac OS X
I've tried two version. 3.4.2 and the one which is provided by xcode. The version 3.4.2 is a bit slower.
Markuss-MacBook-Air:bin markus$ ./clang --version
clang version 3.4.2 (tags/RELEASE_34/dot2-rc1)
Target: x86_64-apple-darwin13.3.0
Thread model: posix
Markuss-MacBook-Air:bin markus$ clang --version
Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)
Target: x86_64-apple-darwin13.3.0
Thread model: posix
file size
$ ls -lh|grep lua
-rwxr-xr-x 1 markus staff 194K 20 Aug 18:26 luaclangmac
HOST benchmarking
clang binary ~4.3 sec
$ time ./luaclangmac sumdata.lua data.log
Original Size: 117261680 kb
Compressed Size: 96727557 kb
real 0m4.338s
user 0m4.264s
sys 0m0.062s
Why?
I would have expected that the host system is a little faster than the virtualization (or roughly the same speed). But not that the host system is reproducible slower.
So, any ideas or explanations?
Update 2014.10.30
Meanwhile I've installed Arch Linux nativly on my MBA. The benchmarks are as fast as in the Arch Linux VM.

Can you try to run 'perf stat' instead of 'time'. It provides you much more details and the time measurement is more correct, avoiding timing differences inside the VM.
Here is an example:
$ perf stat ls > /dev/null
Performance counter stats for 'ls':
23.348076 task-clock (msec) # 0.989 CPUs utilized
2 context-switches # 0.086 K/sec
0 cpu-migrations # 0.000 K/sec
93 page-faults # 0.004 M/sec
74,628,308 cycles # 3.196 GHz [65.75%]
740,755 stalled-cycles-frontend # 0.99% frontend cycles idle [48.66%]
29,200,738 stalled-cycles-backend # 39.13% backend cycles idle [60.02%]
80,592,001 instructions # 1.08 insns per cycle
# 0.36 stalled cycles per insn
17,746,633 branches # 760.090 M/sec [60.00%]
642,360 branch-misses # 3.62% of all branches [48.64%]
0.023609439 seconds time elapsed

My guess is that the HFS+ journaling feature is adding latency. This would be easy enough to test: If TimeMachine is running on the Macbook Air, you could try disabling it, and disable journaling on the filesystem (obviously you should back up first). As root:
diskutil disableJournal YourDiskVolume
I'd see if that's the cause of the problem. Then i would immediately re-enable journaling.
diskutil enableJournal YourDiskVolume
OS X 10.9.2 had a journaling-related bug that would hang the filesystem... this page explores this bug further, and even though the bug (#15821723) hasn't been reported as fixed, journaling reportedly no longer crashes the disk controller.

to test the speed of lua, instead of reading a file hard-code some sample data into the test script and loop over the lines over and over as necessary. Like others mentioned, the filesystem effects are going to outweigh any compiler differences.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Interpreting the Dhrystone Benchmark results on Modern PC - c

Related

Postgres SQL Update JDBC Stuck

How to restrict mongodb's ROM usage on 32 bit architecture system? I am using mongodb version 2.4

Solr / Lucene best file system

Report shows "no time accumulated" for gprof using Eclipse CDT

Why is lua on host system slower than in the linux vm?

Categories

Resources