I've done some logs indexing benchmark with Solr with Redhat 7.3.
The machine included 2 7200 RPM with software RAID 1, 64GB memory and a E3-1240v6 CPU.
I was really surprised to find a huge difference in IO performance between ext4 and xfs (see details below).
Indexing with xfs provided 20% more indexing throughput compared to ext4 (io wait is tenth with xfs).
I'm looking for some insights related to choosing the appropriate file system for a Solr machine.
ext4:
avg-cpu: %user %nice %system %iowait %steal %idle
3.09 62.43 1.84 14.51 0.00 18.12
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.02 169.38 13.95 182.97 0.36 26.28 277.04 40.91 207.66 18.96 222.05 3.82 75.18
sda 0.04 169.34 20.55 183.01 0.61 26.28 270.51 47.18 231.71 27.84 254.60 3.76 76.51
xfs:
avg-cpu: %user %nice %system %iowait %steal %idle
3.18 81.72 2.19 1.48 0.00 11.42
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 17.51 0.00 123.70 0.00 29.13 482.35 34.03 274.97 56.12 274.97 5.39 66.63
sdb 0.00 17.53 0.09 123.69 0.00 29.13 482.05 34.84 281.29 25.58 281.48 5.29 65.52
as you have done the testing yourself (hopefully similar to your intended production usage), nobody else will have better advise regarding the FS. Of course, if you could change the spinning disks for SSD, that would be much, much better, specially for indexing througput.
Related
After compiling with flags: -O0 -p -pg -Wall -c on GCC and -p -pg on the MinGW linker, the eclipse plugin gprof for shows no results. After that I did a cmd call using gprof my.exe gmon.out > prof.txt, which resulted in a report witth only the number of calls to functions.
Flat profile:
Each sample counts as 0.01 seconds.
no time accumulated
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
0.00 0.00 0.00 16000 0.00 0.00 vector_norm
0.00 0.00 0.00 16 0.00 0.00 rbf_kernel
0.00 0.00 0.00 8 0.00 0.00 lubksb
I've came across this topic: gprof reports no time accumulated. But my program is terminating in a clear maner. Also, gprof view show no data on MingW/Windows, but I am using 32 bits GCC. I have previously tried to use Cygwin, same result.
I am using eclipse Kepler with CDT version 8.3.0.201402142303 and MinGW with GCC 5.4.0.
Any help is appreciated, thank you in advance.
Sorry for the question, seems that the code is faster than gprof can measure.
As my application involves a neural network train with several iterations and further testing of kernels, I didn't suspected that a fast code could be causing the problem. I inserted a long loop in the main body and the gprof time was printed.
I have a server running HP UX 11 OS and i'm trying to have I/O statistics for File System (not disk).
Fo example , I have 50 disk attached to the server when I type iostat (here under output of iostat for 3 disks) :
disk9 508 31.4 1.0
disk10 53 1.5 1.0
disk11 0 0.0 1.0
And I have File Systems (df output) :
/c101 (/dev/VGAPPLI/c101_lv): 66426400 blocks 1045252 i-nodes
/c102 (/dev/VGAPPLI/c102_lv): 360190864 blocks 5672045 i-nodes
/c103 (/dev/VGAPPLI/c103_lv): 150639024 blocks 2367835 i-nodes
/c104 (/dev/VGAPPLI/c104d_lv): 75852825 blocks 11944597 i-nodes
Is it possible to have I/O statistic specifically for these File System ?
Thanks.
First I'll give a run down on what I did.
I downloaded dhry.h dhry_1.c and dhry_2.c from here:
http://giga.cps.unizar.es/~spd/src/other/dhry/
Then I made some corrections (so that I would compile) according to this:
https://github.com/maximeh/buildroot/blob/master/package/dhrystone/dhrystone-2-HZ.patch
And this
Errors while compiling dhrystone in unix
I've compiled the files with the following command line:
gcc dhry_1.c dhry_2.c -O2 -o run
I finally entered the number of runs to be 1000000000
And waited. I compiled with four different optimization levels and I got these values of DMIPS (According to http://en.wikipedia.org/wiki/Dhrystone this is the Dhrystones per Second divided by 1757):
O0: 8112 O1: 16823.9 O2: 22977.5 O3: 23164.5 (these represent the compiler flags like -O2 is optimization level two and O0 is none).
This would give the following DMIPS/MHz (base frequency for my processor is 3.4 GHz):
2.3859 4.9482 6.7581 6.8131
However, I get the feelign tha 6.7 is way to low. According to what I've read an A15 has between 3.5 to 4 DMIPS/MHz and a third generation I7 only has double that? Shouldn't it be a lot higher?
Can anyone tell me from my procedure if they can see that I might have done something wrong? Or maybe I'm interpretting the results incorrectly?
Except with a broad brush treatment, you cannot compare benchmark results produced by different compilers. As the design authority of the first standard benchmark (Whetstone), I can advise that it even less safe to include comparisons with results from a computer manufacturer’s in-house compiler. In minicomputer days, manufacturers found that sections of the Whetstone benchmark could be optimised out, to double the score. I arranged for changes and more detailed results to avoid and later highlight over optimisation.
Below are example results on PCs from my original (1990’s) Dhrystone Benchmarks. For details, more results and (free) execution and source files see:
http://www.roylongbottom.org.uk/dhrystone%20results.htm
Also included, and compiled from the same source code, are results from a later MS compiler and some via Linux and on Android, via ARM CPUs, plus one for an Intel Atom, via Houdini compatibility layer. I prefer the term VAX MIPS instead to DMIPS, as the 1757 divisor is the result on DEC VAX 11/780. Anyway, MIPS/MHz calculations are also shown. Note differences due to compilers and the particularly low ratios on Android ARM CPUs.
Dhry1 Dhry1 Dhry2 Dhry2 Dhry2
Opt NoOpt Opt NoOpt Opt
VAX VAX VAX VAX MIPS/
CPU MHz MIPS MIPS MIPS MIPS MHz
AMD 80386 40 17.5 4.32 13.7 4.53 0.3
80486 DX2 66 45.1 12 35.3 12.4 0.5
Pentium 100 169 31.8 122 32.2 1.2
Pentium Pro 200 373 92.4 312 91.9 1.6
Pentium II 300 544 132 477 136 1.6
Pentium III 450 846 197 722 203 1.6
Pentium 4 1900 2593 261 2003 269 1.1
Atom 1666 2600 772 1948 780 1.2
Athlon 64 2211 5798 1348 4462 1312 2.0
Core 2 Duo 1 CP 2400 7145 1198 6446 1251 2.7
Phenom II 1 CP 3000 9462 2250 7615 2253 2.5
Core i7 4820K 3900 14776 2006 11978 2014 3.1
Later Intel Compiler
Pentium 4 1900 2613 1795 0.9
Athlon 64 2211 6104 3720 1.7
Core 2 Duo 2400 8094 5476 2.3
Phenom II 3000 9768 6006 2.0
Core i7 4820K 3900 15587 10347 2.7
Linux Ubuntu GCC Compiler
Atom 1666 5485 1198 2055 1194 1.2
Athlon 64 2211 9034 2286 4580 2347 2.1
Core 2 Duo 2400 13599 3428 5852 3348 2.4
Phenom II 3000 13406 3368 6676 3470 2.2
Core i7 4820K 3900 29277 7108 16356 7478 4.2
ARM Android NDK
926EJ 800 356 196 0.4
v7-A9 1500 1650 786 1.1
v7-A15 1700 3189 1504 1.9
Atom Houdini 1866 1840 1310 1.0
I have come across .prof file (at least the extension tell me so) which I think was used to analyze the performance of loaders which are written in pro c language.
I am writing a new similar loader and I want to analyze the performance of my program.
I have pasted the first few lines of .prof file here :
%Time Seconds Cumsecs #Calls msec/call Name
90.8 235.13 235.13 0 0.0000 strlen
3.2 8.17 243.30 0 0.0000 _read
1.3 3.33 246.63 897580 0.0037 Search
1.0 2.56 249.19 0 0.0000 _lseek
0.6 1.43 250.62 0 0.0000 _kill
0.5 1.39 252.01 0 0.0000 _write
0.3 0.83 252.84 864734 0.0010 _doprnt
0.3 0.75 253.59 0 0.0000 _mcount0
I am interested in two points:
which kind of file is this
how can i generate such a file in unix environment (command?)
That looks like an (outdated?) gprof flat profile.
These are generated with gcc by adding -pg to the command line options, and running the program.
The profile tells us that the tested code spends a very long time running strlen().
My program produces this profile:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
29.79 0.14 0.14 227764 0.00 0.00 standardize_state
21.28 0.24 0.10 __udivdi3
14.89 0.31 0.07 __umoddi3
8.51 0.35 0.04 170971 0.00 0.00 make_negative_child
6.38 0.38 0.03 8266194 0.00 0.00 mypow
...
My cursory googling tells me that __udivdi3 and __umoddi3 are generated by / and %. And yes there is alot of / and % in my program.
However it's kind of useless in the profile, I would prefer those calls to be listed as part of the calling function.
Is there any way I can do this, or otherwise make gprof more applicable to this scenario.