perf reporting huge numbers for simple program

perf reporting huge numbers for simple program - c

I'm trying to get into C optimization. When running perf, the reports don't really make sense to me.
I created a test program:
int main()
{
return 0;
}
Compiled it: gcc test.c -o test -std=c99 -O2 -lm
And ran perf:
perf stat -B -r 20 -e "cycles,instructions,cache-references,cache-misses,branches,branch-misses,cpu-clock,task-clock,faults,cs,migrations,alignment-faults" test
This is the output:
Performance counter stats for 'test' (20 runs):
918.130 cycles # 1,640 GHz ( +- 0,75% )
871.395 instructions # 0,95 insn per cycle ( +- 0,31% )
35.793 cache-references # 63,926 M/sec ( +- 0,90% )
7.897 cache-misses # 22,062 % of all cache refs ( +- 3,81% )
176.129 branches # 314,562 M/sec ( +- 0,26% )
7.300 branch-misses # 4,14% of all branches ( +- 1,04% )
0,56 msec cpu-clock # 0,648 CPUs utilized ( +- 3,87% )
0,56 msec task-clock # 0,648 CPUs utilized ( +- 3,87% )
59 faults # 0,106 M/sec ( +- 0,35% )
0 cs # 0,000 K/sec
0 migrations # 0,000 K/sec
0 alignment-faults # 0,000 K/sec
0,0008638 +- 0,0000357 seconds time elapsed ( +- 4,13% )
I'm not sure if I'm missing something, but I can't find any reason it would make sense for a program that returns 0 to have 871 thousand instructions, 7 thousand cache misses and 176 thousand branches.
Am I doing something wrong when running perf? Or just completely misunderstanding what the output is supposed to mean?

Related

How to decrease the time spent on one instruction?

I am trying to optimize a code in C, and it seems that one instruction is taking about 22% of the time.
The code is compiled with gcc 8.2.0. Flags are -O3 -DNDEBUG -g, and -Wall -Wextra -Weffc++ -pthread -lrt.
509529.517218 task-clock (msec) # 0.999 CPUs utilized
6,234 context-switches # 0.012 K/sec
10 cpu-migrations # 0.000 K/sec
1,305,885 page-faults # 0.003 M/sec
1,985,640,853,831 cycles # 3.897 GHz (30.76%)
1,897,574,410,921 instructions # 0.96 insn per cycle (38.46%)
229,365,727,020 branches # 450.152 M/sec (38.46%)
13,027,677,754 branch-misses # 5.68% of all branches (38.46%)
604,340,619,317 L1-dcache-loads # 1186.076 M/sec (38.46%)
47,749,307,910 L1-dcache-load-misses # 7.90% of all L1-dcache hits (38.47%)
19,724,956,845 LLC-loads # 38.712 M/sec (30.78%)
3,349,412,068 LLC-load-misses # 16.98% of all LL-cache hits (30.77%)
<not supported> L1-icache-loads
129,878,634 L1-icache-load-misses (30.77%)
604,482,046,140 dTLB-loads # 1186.353 M/sec (30.77%)
4,596,384,416 dTLB-load-misses # 0.76% of all dTLB cache hits (30.77%)
2,493,696 iTLB-loads # 0.005 M/sec (30.77%)
21,356,368 iTLB-load-misses # 856.41% of all iTLB cache hits (30.76%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
509.843595752 seconds time elapsed
507.706093000 seconds user
1.839848000 seconds sys
VTune Amplifier gives me a hint to a function: https://pasteboard.co/IagrLaF.png
The instruction cmpq seems to take 22% of the whole time. On the other hand, the other instructions take negligible time.
perf gives me a somewhat different picture, yet I think that results are consistent:
Percent│ bool mapFound = false;
0.00 │ movb $0x0,0x7(%rsp)
│ goDownBwt():
│ bwt_2occ(bwt, getStateInterval(previousState)->k-1, getStateInterval(previousState)->l, nucleotide, &newState->interval.k, &newState->interval.l);
0.00 │ lea 0x20(%rsp),%r12
│ newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
0.00 │ lea (%rax,%rax,2),%rax
0.00 │ shl $0x3,%rax
0.00 │ mov %rax,0x18(%rsp)
0.01 │ movzwl %dx,%eax
0.00 │ mov %eax,(%rsp)
0.00 │ ↓ jmp d6
│ nop
│ if ((previousState->trace & PREPROCESSED) && (previousState->preprocessedInterval->firstChild != NULL)) {
0.30 │ 88: mov (%rax),%rsi
8.38 │ mov 0x10(%rsi),%rcx
0.62 │ test %rcx,%rcx
0.15 │ ↓ je 1b0
│ newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
2.05 │ add 0x18(%rsp),%rcx
│ ++stats->nDownPreprocessed;
0.25 │ addq $0x1,0x18(%rdx)
│ newState->trace = PREPROCESSED;
0.98 │ movb $0x10,0x30(%rsp)
│ return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
43.36 │ mov 0x8(%rcx),%rax
2.61 │ cmp %rax,(%rcx)
│ newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
0.05 │ mov %rcx,0x20(%rsp)
│ return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
3.47 │ setbe %dl
The function is
inline bool goDownBwt (state_t *previousState, unsigned short nucleotide, state_t *newState) {
++stats->nDown;
if ((previousState->trace & PREPROCESSED) && (previousState->preprocessedInterval->firstChild != NULL)) {
++stats->nDownPreprocessed;
newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
newState->trace = PREPROCESSED;
return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
}
bwt_2occ(bwt, getStateInterval(previousState)->k-1, getStateInterval(previousState)->l, nucleotide, &newState->interval.k, &newState->interval.l);
newState->interval.k = bwt->L2[nucleotide] + newState->interval.k + 1;
newState->interval.l = bwt->L2[nucleotide] + newState->interval.l;
newState->trace = 0;
return (newState->interval.k <= newState->interval.l);
}
state_t is defined as
struct state_t {
union {
bwtinterval_t interval;
preprocessedInterval_t *preprocessedInterval;
};
unsigned char trace;
struct state_t *previousState;
};
preprocessedInterval_t is:
struct preprocessedInterval_t {
bwtinterval_t interval;
preprocessedInterval_t *firstChild;
};
There are few (~1000) state_t structures. However, there are many (350k) preprocessedInterval_t objects, allocated somewhere else.
The first if is true 15 billion times over 19 billions.
Finding mispredicted branches with perf record -e branches,branch-misses mytool on the function gives me:
Available samples
2M branches
1M branch-misses
Can I assume that branch misprediction is responsible for this slow down?
What would be the next step to optimize my code?
The code is available on GitHub
Edit 1
valgrind --tool=cachegrind gives me:
I refs: 1,893,716,274,393
I1 misses: 4,702,494
LLi misses: 137,142
I1 miss rate: 0.00%
LLi miss rate: 0.00%
D refs: 756,774,557,235 (602,597,601,611 rd + 154,176,955,624 wr)
D1 misses: 39,489,866,187 ( 33,583,272,379 rd + 5,906,593,808 wr)
LLd misses: 3,483,920,786 ( 3,379,118,877 rd + 104,801,909 wr)
D1 miss rate: 5.2% ( 5.6% + 3.8% )
LLd miss rate: 0.5% ( 0.6% + 0.1% )
LL refs: 39,494,568,681 ( 33,587,974,873 rd + 5,906,593,808 wr)
LL misses: 3,484,057,928 ( 3,379,256,019 rd + 104,801,909 wr)
LL miss rate: 0.1% ( 0.1% + 0.1% )
Edit 2
I compiled with -O3 -DNDEBUG -march=native -fprofile-use, and used the command perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread,mem_load_uops_retired.l3_miss,mem_load_uops_retired.l2_miss,mem_load_uops_retired.l1_miss ./a.out
508322.348021 task-clock (msec) # 0.998 CPUs utilized
21,592 context-switches # 0.042 K/sec
33 cpu-migrations # 0.000 K/sec
1,305,885 page-faults # 0.003 M/sec
1,978,382,746,597 cycles # 3.892 GHz (44.44%)
228,898,532,311 branches # 450.302 M/sec (44.45%)
12,816,920,039 branch-misses # 5.60% of all branches (44.45%)
1,867,947,557,739 instructions # 0.94 insn per cycle (55.56%)
2,957,085,686,275 uops_issued.any # 5817.343 M/sec (55.56%)
2,864,257,274,102 uops_executed.thread # 5634.726 M/sec (55.56%)
2,490,571,629 mem_load_uops_retired.l3_miss # 4.900 M/sec (55.55%)
12,482,683,638 mem_load_uops_retired.l2_miss # 24.557 M/sec (55.55%)
18,634,558,602 mem_load_uops_retired.l1_miss # 36.659 M/sec (44.44%)
509.210162391 seconds time elapsed
506.213075000 seconds user
2.147749000 seconds sys
Edit 3
I selected the results of perf record -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread,mem_load_uops_retired.l3_miss,mem_load_uops_retired.l2_miss,mem_load_uops_retired.l1_miss a.out that mentioned my function:
Samples: 2M of event 'task-clock', Event count (approx.): 517526250000
Overhead Command Shared Object Symbol
49.76% srnaMapper srnaMapper [.] mapWithoutError
Samples: 917K of event 'cycles', Event count (approx.): 891499601652
Overhead Command Shared Object Symbol
49.36% srnaMapper srnaMapper [.] mapWithoutError
Samples: 911K of event 'branches', Event count (approx.): 101918042567
Overhead Command Shared Object Symbol
43.01% srnaMapper srnaMapper [.] mapWithoutError
Samples: 877K of event 'branch-misses', Event count (approx.): 5689088740
Overhead Command Shared Object Symbol
50.32% srnaMapper srnaMapper [.] mapWithoutError
Samples: 1M of event 'instructions', Event count (approx.): 1036429973874
Overhead Command Shared Object Symbol
34.85% srnaMapper srnaMapper [.] mapWithoutError
Samples: 824K of event 'uops_issued.any', Event count (approx.): 1649042473560
Overhead Command Shared Object Symbol
42.19% srnaMapper srnaMapper [.] mapWithoutError
Samples: 802K of event 'uops_executed.thread', Event count (approx.): 1604052406075
Overhead Command Shared Object Symbol
48.14% srnaMapper srnaMapper [.] mapWithoutError
Samples: 13K of event 'mem_load_uops_retired.l3_miss', Event count (approx.): 1350194507
Overhead Command Shared Object Symbol
33.24% srnaMapper srnaMapper [.] addState
31.00% srnaMapper srnaMapper [.] mapWithoutError
Samples: 142K of event 'mem_load_uops_retired.l2_miss', Event count (approx.): 7143448989
Overhead Command Shared Object Symbol
40.79% srnaMapper srnaMapper [.] mapWithoutError
Samples: 84K of event 'mem_load_uops_retired.l1_miss', Event count (approx.): 8451553539
Overhead Command Shared Object Symbol
39.11% srnaMapper srnaMapper [.] mapWithoutError
(Using perf record --period 10000 triggers Workload failed: No such file or directory.)

Was the sample-rate the same for branches and branch-misses? A 50% mispredict rate would be extremely bad.
https://perf.wiki.kernel.org/index.php/Tutorial#Period_and_rate explains that the kernel dynamically adjusts the period for each counter so events fire often enough to get enough samples even for rare events, But you can set the period (how many raw counts trigger a sample) I think that's what perf record --period 10000 does, but I haven't used that.
Use perf stat to get hard numbers. Update: yup, your perf stat results confirm your branch mispredict rate is "only" 5%, not 50%, at least for the program as a whole. That's still higher than you'd like (branches are usually frequent and mispredicts are expensive) but not insane.
Also for cache miss rate for L1d and maybe mem_load_retired.l3_miss (and/or l2_miss and l1_miss) to see if it's really that load that's missing. e.g.
perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,\
uops_issued.any,uops_executed.thread,\
mem_load_retired.l3_miss,mem_load_retired.l2_miss,mem_load_retired.l1_miss ./a.out
You can use any of these events with perf record to get some statistical samples on which instructions are causing cache misses. Those are precise events (using PEBS), so should accurately map to the correct instruction (not like "cycles" where counts get attributed to some nearby instruction, often the one that stalls waiting for an input with the ROB full, instead of the one that was slow to produce it.)
And without any skew for non-PEBS events that should "blame" a single instruction but don't always interrupt at exactly the right place.
If you're optimizing for your local machine and don't need it to run anywhere else, you might use -O3 -march=native. Not that that will help with cache misses.
GCC profile-guided optimization can help it choose branchy vs. branchless. (gcc -O3 -march=native -fprofile-generate / run it with some realistic input data to generate profile outputs / gcc -O3 -march=native -fprofile-use)
Can I assume that branch misprediction is responsible for this slow down?
No, cache misses might be more likely. You have a significant number of L3 misses, and going all the way to DRAM costs hundreds of core clock cycles. Branch prediction can hide some of that if it predicts correctly.
What would be the next step to optimize my code?
Compact your data structures if possible so more of them fit in cache, e.g. 32-bit pointers (Linux x32 ABI: gcc -mx32) if you don't need more than 4GiB of virtual address space. Or maybe try using a 32-bit unsigned index into a large array instead of raw pointers, but that has slightly worse load-use latency (by a couple cycles on Sandybridge-family.)
And / or improve your access pattern, so you're mostly accessing them in sequential order. So hardware prefetch can bring them into cache before you need to read them.
I'm not familiar enough with the https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform or its application in sequence alignment to know if it's possible to make it more efficient, but data compression is inherently problematic because you very often need data-dependent branching and accessing scattered data. It's often worth the tradeoff vs. even more cache misses, though.

NFS v4 with fast network and average IOPS disk. Load increase high on large file transfer

NFS v4 with fast network and average IOPS disk. Load increase high on large file transfer.
The problem seems to be IOPS.
The test case:
/etc/exports
server# /mnt/exports 192.168.6.0/24(rw,sync,no_subtree_check,no_root_squash,fsid=0)
server# /mnt/exports/nfs 192.168.6.0/24(rw,sync,no_subtree_check,no_root_squash)
client# mount -t nfs 192.168.6.131:/nfs /mnt/nfstest -vvv
(or client# mount -t nfs 192.168.6.131:/nfs /mnt/nfstest -o nfsvers=4,tcp,port=2049,async -vvv)
It works well wits 'sync' flag but the transger drops form 50MB/s to 500kb/s
http://ubuntuforums.org/archive/index.php/t-1478413.html
The topic seems to be solved by reducing wsize to wsize=300 - small improvement but not the solution.
Simple test with dd:
client# dd if=/dev/zero bs=1M count=6000 |pv | dd of=/mnt/nfstest/delete_me
server# iotop
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
1863 be/4 root 0.00 B/s 14.17 M/s 0.00 % 21.14 % [nfsd]
1864 be/4 root 0.00 B/s 7.42 M/s 0.00 % 17.39 % [nfsd]
1858 be/4 root 0.00 B/s 6.32 M/s 0.00 % 13.09 % [nfsd]
1861 be/4 root 0.00 B/s 13.26 M/s 0.00 % 12.03 % [nfsd]
server# dstat -r --top-io-adv --top-io --top-bio --aio -l -n -m
--io/total- -------most-expensive-i/o-process------- ----most-expensive---- ----most-expensive---- async ---load-avg--- -NET/total- ------memory-usage-----
read writ|process pid read write cpu| i/o process | block i/o process | #aio| 1m 5m 15m | recv send| used buff cach free
10.9 81.4 |init [2] 1 5526B 20k0.0%|init [2] 5526B 20k|nfsd 10B 407k| 0 |2.92 1.01 0.54| 0 0 |29.3M 78.9M 212M 4184k
1.00 1196 |sshd: root#pts/0 1943 1227B1264B 0%|sshd: root#1227B 1264B|nfsd 0 15M| 0 |2.92 1.01 0.54| 44M 319k|29.1M 78.9M 212M 4444k
0 1365 |sshd: root#pts/0 1943 485B 528B 0%|sshd: root# 485B 528B|nfsd 0 16M| 0 |2.92 1.01 0.54| 51M 318k|29.5M 78.9M 212M 4708k
Do You know any way of limiting the load without big changes in the configuration?
I do consider limiting the network speed with wondershaper or iptables, though it is not nice since other traffic would be harmed as well.
Someone suggested cgroups - may be worth solving - but it still it is not my 'feng shui' - I would hope to find solution in NFS config - since the problem is here, would be nice to have in-one-place-solution.
If that would be possible to increase 'sync' speed to 10-20MB/s that would be enough for me.

I think I nailed it:
On the server, change disk scheduller:
for i in /sys/block/sd*/queue/scheduler ; do echo deadline > $i ; done
additionally (small improvement - find the best value for You):
/etc/default/nfs-kernel-server
# Number of servers to start up
-RPCNFSDCOUNT=8
+RPCNFSDCOUNT=2
restart services
/etc/init.d/rpcbind restart
/etc/init.d/nfs-kernel-server restart
ps:
My current configs
server:
/etc/exports
/mnt/exports 192.168.6.0/24(rw,no_subtree_check,no_root_squash,fsid=0)
/mnt/exports/nfs 192.168.6.0/24(rw,no_subtree_check,no_root_squash)
client:
/etc/fstab
192.168.6.131:/nfs /mnt/nfstest nfs rsize=32768,wsize=32768,tcp,port=2049 0 0

What is the performance difference between gawk and ....? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
This question has been discussed here on Meta and my answer give links to a test system to answer this.
The question often comes up about whether to use gawk or mawk or C or some other language due to performance so let's create a canonical question/answer for a trivial and typical awk program.
The result of this will be an answer that provides a comparison of the performance of different tools performing the basic text processing tasks of regexp matching and field splitting on a simple input file. If tool X is twice as fast as every other tool for this task then that is useful information. If all the tools take about the same amount of time then that is useful information too.
The way this will work is that over the next couple of days many people will contribute "answers" which are the programs to be tested and then one person (volunteers?) will test all of them on one platform (or a few people will test some subset on their platform so we can compare) and then all of the results will be collected into a single answer.
Given a 10 Million line input file created by this script:
$ awk 'BEGIN{for (i=1;i<=10000000;i++) print (i%5?"miss":"hit"),i," third\t \tfourth"}' > file
$ wc -l file
10000000 file
$ head -10 file
miss 1 third fourth
miss 2 third fourth
miss 3 third fourth
miss 4 third fourth
hit 5 third fourth
miss 6 third fourth
miss 7 third fourth
miss 8 third fourth
miss 9 third fourth
hit 10 third fourth
and given this awk script which prints the 4th then 1st then 3rd field of every line that starts with "hit" followed by an even number:
$ cat tst.awk
/hit [[:digit:]]*0 / { print $4, $1, $3 }
Here are the first 5 lines of expected output:
$ awk -f tst.awk file | head -5
fourth hit third
fourth hit third
fourth hit third
fourth hit third
fourth hit third
and here is the result when piped to a 2nd awk script to verify that the main script above is actually functioning exactly as intended:
$ awk -f tst.awk file |
awk '!seen[$0]++{unq++;r=$0} END{print ((unq==1) && (seen[r]==1000000) && (r=="fourth hit third")) ? "PASS" : "FAIL"}'
PASS
Here are the timing results of the 3rd execution of gawk 4.1.1 running in bash 4.3.33 on cygwin64:
$ time awk -f tst.awk file > /dev/null
real 0m4.711s
user 0m4.555s
sys 0m0.108s
Note the above is the 3rd execution to remove caching differences.
Can anyone provide the equivalent C, perl, python, whatever code to this:
$ cat tst.awk
/hit [[:digit:]]*0 / { print $4, $1, $3 }
i.e. find THAT REGEXP on a line (we're not looking for some other solution that works around the need for a regexp), split the line at each series of contiguous white space and print the 4th, then 1st, then 3rd fields separated by a single blank char?
If so we can test them all on one platform to see/record the performance differences.
The code contributed so far:
AWK (can be tested against gawk, etc. but mawk, nawk and perhaps others will require [0-9] instead of [:digit:])
awk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file
PHP
php -R 'if(preg_match("/hit \d*0 /", $argn)){$f=preg_split("/\s+/", $argn); echo $f[3]." ".$f[0]." ".$f[2];}' < file
shell
egrep 'hit [[:digit:]]*0 ' file | awk '{print $4, $1, $3}'
grep --mmap -E "^hit [[:digit:]]*0 " file | awk '{print $4, $1, $3 }'
Ruby
$ cat tst.rb
File.open("file").readlines.each do |line|
line.gsub(/(hit)\s[0-9]*0\s+(.*?)\s+(.*)/) { puts "#{$3} #{$1} #{$2}" }
end
$ ruby tst.rb
Perl
$ cat tst.pl
#!/usr/bin/perl -nl
# A solution much like the Ruby one but with atomic grouping
print "$4 $1 $3" if /^(hit)(?>\s+)(\d*0)(?>\s+)((?>[^\s]+))(?>\s+)(?>([^\s]+))$/
$ perl tst.pl file
Python
none yet
C
none yet

Applying egrep before awk gives a great speedup:
paul#home ~ % wc -l file
10000000 file
paul#home ~ % for i in {1..5}; do time egrep 'hit [[:digit:]]*0 ' file | awk '{print $4, $1, $3}' | wc -l ; done
1000000
egrep --color=auto 'hit [[:digit:]]*0 ' file 0.63s user 0.02s system 85% cpu 0.759 total
awk '{print $4, $1, $3}' 0.70s user 0.01s system 93% cpu 0.760 total
wc -l 0.00s user 0.02s system 2% cpu 0.760 total
1000000
egrep --color=auto 'hit [[:digit:]]*0 ' file 0.65s user 0.01s system 85% cpu 0.770 total
awk '{print $4, $1, $3}' 0.71s user 0.01s system 93% cpu 0.771 total
wc -l 0.00s user 0.02s system 2% cpu 0.771 total
1000000
egrep --color=auto 'hit [[:digit:]]*0 ' file 0.64s user 0.02s system 82% cpu 0.806 total
awk '{print $4, $1, $3}' 0.73s user 0.01s system 91% cpu 0.807 total
wc -l 0.02s user 0.00s system 2% cpu 0.807 total
1000000
egrep --color=auto 'hit [[:digit:]]*0 ' file 0.63s user 0.02s system 86% cpu 0.745 total
awk '{print $4, $1, $3}' 0.69s user 0.01s system 92% cpu 0.746 total
wc -l 0.00s user 0.02s system 2% cpu 0.746 total
1000000
egrep --color=auto 'hit [[:digit:]]*0 ' file 0.62s user 0.02s system 88% cpu 0.727 total
awk '{print $4, $1, $3}' 0.67s user 0.01s system 93% cpu 0.728 total
wc -l 0.00s user 0.02s system 2% cpu 0.728 total
versus:
paul#home ~ % for i in {1..5}; do time gawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null; done
gawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 2.46s user 0.04s system 97% cpu 2.548 total
gawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 2.43s user 0.03s system 98% cpu 2.508 total
gawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 2.40s user 0.04s system 98% cpu 2.489 total
gawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 2.38s user 0.04s system 98% cpu 2.463 total
gawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 2.39s user 0.03s system 98% cpu 2.465 total
'nawk' is even slower!
paul#home ~ % for i in {1..5}; do time nawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null; done
nawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 6.05s user 0.06s system 92% cpu 6.606 total
nawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 6.11s user 0.05s system 96% cpu 6.401 total
nawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 5.78s user 0.04s system 97% cpu 5.975 total
nawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 5.71s user 0.04s system 98% cpu 5.857 total
nawk '/hit [[:digit:]]*0 / { print $4, $1, $3 }' file > /dev/null 6.34s user 0.05s system 93% cpu 6.855 total

On OSX Yosemite
time bash -c 'grep --mmap -E "^hit [[:digit:]]*0 " file | awk '\''{print $4, $1, $3 }'\''' >/dev/null
real 0m5.741s
user 0m6.668s
sys 0m0.112s

first Idea
File.open("file").readlines.each do |line|
line.gsub(/(hit)\s[0-9]*0\s+(.*?)\s+(.*)/) { puts "#{$3} #{$1} #{$2}" }
end
Second idea
File.read("file").scan(/(hit)\s[[:digit:]]*0\s+(.*?)\s+(.*)/) { |f,s,t| puts "#{t} #{f} #{s}" }
Trying to get something able to compare answer I ended up creating a github repo here. Each push to this repo trigger a build on travis-ci which compose a markdown file pushed in turn to the gh-pages branch to update a web page with a view on the build results.
Anyone wishing to participate can fork the github repo, add tests and do a pull request wich I'll merge asap if it does not break the others tests.

mawk is slightly faster than gawk.
$ time bash -c 'mawk '\''/hit [[:digit:]]*0 / { print $4, $1, $3 }'\'' file | wc -l'
0
real 0m1.160s
user 0m0.484s
sys 0m0.052s
$ time bash -c 'gawk '\''/hit [[:digit:]]*0 / { print $4, $1, $3 }'\'' file | wc -l'
100000
real 0m1.648s
user 0m0.996s
sys 0m0.060s
(Only 1,000,000 lines in my input file. Best results of many displayed, though they were quite consistent.)

Here comes an equivalent in PHP:
$ time php -R 'if(preg_match("/hit \d*0 /", $argn)){$f=preg_split("/\s+/", $argn); echo $f[3]." ".$f[0]." ".$f[2];}' < file > /dev/null
real 2m42.407s
user 2m41.934s
sys 0m0.355s
compared to your awk:
$ time awk -f tst.awk file > /dev/null
real 0m3.271s
user 0m3.165s
sys 0m0.104s
I tried a different approach in PHP where I iterate trough the file manually, this makes things a lot faster but I'm still not impressed:
tst.php
<?php
$fd=fopen('file', 'r');
while($line = fgets($fd)){
if(preg_match("/hit \d*0 /", $line)){
$f=preg_split("/\s+/", $line);
echo $f[3]." ".$f[0]." ".$f[2]."\n";
}
}
fclose($fd);
Results:
$ time php tst.php > /dev/null
real 0m27.354s
user 0m27.042s
sys 0m0.296s

sched_setaffinity cpu affinity in linux

I have done a sched_setaffinity test in Linux in a server with 1 socket ,4 cores ,
the following /proc/cpuinfo showes the cpu information :
processor : 0
model name : Intel(R) Core(TM)2 Quad CPU Q8400 # 2.66GHz
cache size : 2048 KB
physical id : 0
siblings : 4
cpu cores : 4
processor : 1
model name : Intel(R) Core(TM)2 Quad CPU Q8400 # 2.66GHz
cache size : 2048 KB
physical id : 0
siblings : 4
cpu cores : 4
processor : 2
model name : Intel(R) Core(TM)2 Quad CPU Q8400 # 2.66GHz
cache size : 2048 KB
physical id : 0
siblings : 4
cpu cores : 4
processor : 3
model name : Intel(R) Core(TM)2 Quad CPU Q8400 # 2.66GHz
cache size : 2048 KB
physical id : 0
siblings : 4
cpu cores : 4
I have a simple test application :
struct foo {
int x;
int y;
} ;
//globar var
volatile struct foo fvar ;
pid_t gettid( void )
{
return syscall( __NR_gettid );
}
void *test_func0(void *arg)
{
int proc_num = (int)(long)arg;
cpu_set_t set;
CPU_ZERO( &set );
CPU_SET( proc_num, &set );
printf("proc_num=(%d)\n",proc_num) ;
if (sched_setaffinity( gettid(), sizeof( cpu_set_t ), &set ))
{
perror( "sched_setaffinity" );
return NULL;
}
int i=0;
for(i=0;i<1000000000;++i){
__sync_fetch_and_add(&fvar.x,1);
}
return NULL;
} //test_func0
compiled :
gcc testsync.c -D_GNU_SOURCE -lpthread -o testsync.exe
The following is the test results :
2 threads running test_func0 in core 0,1 take 35 secs ;
2 threads running test_func0 in core 0,2 take 55 secs ;
2 threads running test_func0 in core 0,3 take 55 secs ;
2 threads running test_func0 in core 1,2 take 55 secs ;
2 threads running test_func0 in core 1,3 take 55 secs ;
2 threads running test_func0 in core 2,3 take 35 secs ;
I wonder why 2 threads running in core (0,1) or in core(2,3) would be much
faster in others ? if I running 2 threads at the same core , like core(1,1) ,
core(2,2),core(3,3) , that would be take 28 secs , also confused why this happen ?

Cores 0 and 1 share an L2 cache, and so do cores 2 and 3. Running on two cores that share the cache makes the shared variable stay in the L2 cache, which makes things faster.
This is not true in today's Intel processors, where L2 is per core. But on the CPU you're using, this is how it works (it's actually a quad-core CPU made by gluing together two dual-core CPUs).

How do I multiply each member of array by a scalar in perl?

Here is the code...
use strict;
use warnings;
my #array= (1,2,3,4,5);
my $scalar= 5;
#array= $scalar*#array;
print #array;
Need something that can perform similar function with little code. Thanks!

Use foreach.
foreach my $x (#array) { $x = $x * $scalar; }

You can try this:
#array = map { $_ * $scalar } #array;
or more simply:
map { $_ *= $scalar } #array;

Howabout this:
foreach(#array)
{ $_ *= $scalar }
As you see, you can modify the array in-place as it's traversed.

I don't know the scope of your need. IFF you are doing numerical data manipulation, the Perl Data Language (PDL) takes an array of numerical data, creates a "piddle" object from it and overloads mathematical operations to "vectorize" their operation. This is a very efficient system for doing numerical processing. Anyway here is an example:
#!/usr/bin/perl
use strict;
use warnings;
use PDL;
my $pdl_array = pdl([1,1,2,3,5,8]);
print 2*$pdl_array;
__END__
gives:
[2 2 4 6 10 16]

This comment is for SoloBold.
Here is a test of the map approach:
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark;
my #array = ();
push(#array, (1) x 1000000);
my $scalar = 5;
my $startTime = new Benchmark();
#array = map { $_ * $scalar } #array;
my $stopTime = new Benchmark();
print STDOUT "runtime: ".timestr(timediff($stopTime, $startTime), 'all')." sec\n";
Here is a test of the foreach approach:
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark;
my #array = ();
push(#array, (1) x 1000000);
my $scalar = 5;
my $startTime = new Benchmark();
foreach my $x (#array) { $x = $x * $scalar; }
my $stopTime = new Benchmark();
print STDOUT "runtime: ".timestr(timediff($stopTime, $startTime), 'all')." sec\n";
Here is the system I'm running on:
bash-3.2$ perl --version
This is perl, v5.8.8 built for darwin-2level
...
bash-3.2$ uname -a
Darwin Sounder.local 10.7.0 Darwin Kernel Version 10.7.0: Sat Jan 29 15:17:16 PST 2011; root:xnu-1504.9.37~1/RELEASE_I386 i386
Here were results from one test:
bash-3.2$ ./test.map.pl
runtime: 4 wallclock secs ( 0.41 usr 0.70 sys + 0.00 cusr 0.00 csys = 1.11 CPU) sec
bash-3.2$ ./test.foreach.pl
runtime: 0 wallclock secs ( 0.13 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.13 CPU) sec
These times are fairly reproducible on the same machine, and the results are somewhat repeatable on a dual-core Linux box:
[areynolds#fiddlehead ~]$ perl --version
This is perl, v5.8.8 built for x86_64-linux-thread-multi
...
[areynolds#fiddlehead ~]$ uname -a
Linux fiddlehead.example.com 2.6.18-194.17.1.el5 #1 SMP Mon Sep 20 07:12:06 EDT 2010 x86_64 GNU/Linux
[areynolds#fiddlehead ~]$ ./test.map.pl
runtime: 0 wallclock secs ( 0.28 usr 0.05 sys + 0.00 cusr 0.00 csys = 0.33 CPU) sec
[areynolds#fiddlehead ~]$ ./test.foreach.pl
runtime: 0 wallclock secs ( 0.09 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.09 CPU) sec
The ratio of performance on the OS X box is 8.53x slower for map versus foreach. On the Linux box, 3.67x slower for the same.
My Linux box is dual-core and has a slightly faster cores than my single-core OS X laptop.
EDIT
I updated Perl from v5.8.8 to v5.12.3 on my OS X box and got a considerable speed boost, but map still performed worse than foreach:
sounder:~ alexreynolds$ perl --version
This is perl 5, version 12, subversion 3 (v5.12.3) built for darwin-multi-2level
...
sounder:~ alexreynolds$ ./test.map.pl
runtime: 0 wallclock secs ( 0.45 usr 0.08 sys + 0.00 cusr 0.00 csys = 0.53 CPU) sec
sounder:~ alexreynolds$ ./test.foreach.pl
runtime: 1 wallclock secs ( 0.18 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.18 CPU) sec
This goes from 8.53x worse to 2.94x worse. A fairly substantial improvement.
The Linux box performed slightly worse with upgrading its Perl installation to v5.12.2:
[areynolds#basquiat bin]$ perl --version
This is perl 5, version 12, subversion 2 (v5.12.2) built for x86_64-linux-thread-multi
...
[areynolds#basquiat bin]$ /home/areynolds/test.map.pl
runtime: 1 wallclock secs ( 0.29 usr 0.07 sys + 0.00 cusr 0.00 csys = 0.36 CPU) sec
[areynolds#basquiat bin]$ /home/areynolds/test.foreach.pl
runtime: 0 wallclock secs ( 0.08 usr 0.00 sys + 0.00 cusr 0.00 csys = 0.08 CPU) sec
This goes from 3.67x worse to 4.5x worse — not so good! It might not always pay to upgrade, just for the heck of it.

it seem unfortunate to me that Larry didn't allow
$scalar operator (list)
or
(list) operator $scalar
Sure map or loops can do it, but the syntax is so much cleaner like above.
Also (list) operator (list)
makes sense too if the 2 are equal length.
Surprised Larry didn't allow these, just saying.. I guess in this case there were (n-1) ways to do it.
Like
my #a = 'n' . (1..5);
my #a = 2 * (1..5);
or even
my #a = 2 * #b;

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight