Load field 1 and print at the END{} equivalent awk in Perl - arrays

I have the following AWK script that counts occurences of elements in field 1 and when finishes to read entire file, prints each element and the times of repetitions.
awk '{a[$1]++} END{ for(i in a){print i"-->"a[i]} }' file
I'm very new with perl and I don't know how would be the equivalent. What I have so far is below, but it has incorrect syntax. Thanks in advance.
perl -lane '$a{$F[1]}++ END{foreach $a {print $a} }' file
____________________________________UPDATE
______________________________________
Hi, thanks both for your answers. The real input file has 34 million lines and the execution time is 3 or more times faster between awk and Perl. Is awk faster than perl?
awk '{a[$1]++}END{for(i in a){print i"-->"a[i]}}' file #--> 2:45 aprox
perl -lane '$a{$F[0]}++;END{foreach my $k (keys %a){ print "$k --> $a{$k}" } }' file #--> 7 min aprox
perl -lanE'$a{$F[0]}++; END { say "$_ => $a{$_}" for keys %a }' file # -->9 min aprox

Okay, Ger, one more time :-)
I upgraded my Perl to the latest version available to me and made a file like what you described (34.5 million lines each having a 16 digit integer in the 1st and only column):
schumack#linux2 52> wc -l listbig
34521909 listbig
schumack#linux2 53> head -3 listbig
1111111111111111
3333333333333333
4444444444444444
I then ran a specialized Perl line (works for this file but is not the same as the awk line). As before I timed the runs using /usr/bin/time:
schumack#linux2 54> /usr/bin/time -f '%E %P' /usr/local/bin/perl -lne 'chomp; $a{$_}++; END{foreach $i (keys %a){print "$i-->$a{$i}"}}' listbig
5555555555555555-->4547796
1111111111111111-->9715747
9999999999999999-->826872
3333333333333333-->9922465
1212121212121212-->826872
4444444444444444-->5374669
2222222222222222-->1653744
8888888888888888-->826872
7777777777777777-->826872
0:12.20 99%
schumack#linux2 55> /usr/bin/time -f '%E %P' awk '{a[$1]++} END{ for(i in a){print i"-->"a[i]} }' listbig
1111111111111111-->9715747
2222222222222222-->1653744
3333333333333333-->9922465
4444444444444444-->5374669
5555555555555555-->4547796
1212121212121212-->826872
7777777777777777-->826872
8888888888888888-->826872
9999999999999999-->826872
0:12.61 99%
Both perl and awk ran very fast on the 34.5 million line file and were within a half second of each other.
Curious as what type of machine / OS / Perl version you are currently using. I tested on an ASUS laptop that is about 4 years old, has Intel I7. I am using Ubuntu 16.04 and Perl v5.26.1
Anyways, thanks for the reason to play around with Perl!
Have fun,
Ken

Input file would obviously make a difference, but Perl 5.22.1 was slightly faster than Awk 4.1.3 below on my 33.5 million line test file (12.23 vs 12.52 seconds).
schumack#daddyo2 10-02T18:25:17 54> wc -l listbig
33521910 listbig
schumack#daddyo2 10-02T18:25:58 55> /usr/bin/time -f '%E %P' awk '{a[$1]++} END{ for(i in a){print i"-->"a[i]} }' listbig
1-->9434310
2-->1605840
3-->9635040
4-->5218980
5-->4416060
7-->802920
8-->802920
9-->802920
12-->802920
0:12.52 99%
schumack#daddyo2 10-02T18:26:17 56> /usr/bin/time -f '%E %P' perl -lne '$_=~s/^(\S+) .*/$1/; $a{$_}++; END{foreach $i (keys %a){print "$i-->$a{$i}"}}' listbig
1-->9434310
5-->4416060
2-->1605840
3-->9635040
12-->802920
8-->802920
9-->802920
4-->5218980
7-->802920
0:12.23 99%

Equivalent to your awk line
perl -lanE'$a{$F[0]}++; END { say "$_ => $a{$_}" for keys %a }' file
By -a the line is broken into fields in #F so you want $F[0] as a key in a hash %a with the value of the counter handled by ++. The hash is iterated over keys and printed in the END block.
However, the efficiency comparison comes up. One way to improve this is to not fetch all fields on the line, done with -a, since only the first one is needed. Between two ways that come to mind
perl -nE'$a{(/(\S+)/)[0]}++; END { ... }'
and
perl -nE'$a{(split " ", $_, 2)[0]}++; END { ... }'
the split is significantly faster with its 3.63s vs 4.41s for regex, on a 8M-line file.
This is still behind 1.99s for your awk line. So it seems that awk is faster for this task.
Summary of my timings for an 8-million line file (average of a few runs)
awk (question) 1.99s
perl (split) 3.63s
perl (regex) 4.41s
perl (like awk) 5.61s
These timings vary over runs by a few tens of miliseconds (a few 0.01s).

This destructive method is the fastest I came up with:
perl -lne '$_=~s/\s.*//; $a{$_}++; END{foreach $i (keys %a){print "$i-->$a{$i}"}}' file
However, it still is not quite as fast as awk.

You can go through a2p
$ cat file
1
1
2
3
3
3
$ perl -lane '$a{$F[0]}++;END{foreach my $k (keys %a){ print "$k --> $a{$k}" } }' file
1 --> 2
2 --> 1
3 --> 3
$ awk '{a[$1]++} END{ for(i in a){print i" --> "a[i]} }' file
1 --> 2
2 --> 1
3 --> 3

Related

Split a string directly into array

Suppose I want to pass a string to awk so that once I split it (on a pattern) the substrings become the indexes (not the values) of an associative array.
Like so:
$ awk -v s="A:B:F:G" 'BEGIN{ # easy, but can these steps be combined?
split(s,temp,":") # temp[1]="A",temp[2]="B"...
for (e in temp) arr[temp[e]] #arr["A"], arr["B"]...
for (e in arr) print e
}'
A
B
F
G
Is there a awkism or gawkism that would allow the string s to be directly split into its components with those components becoming the index entries in arr?
The reason is (bigger picture) is I want something like this (pseudo awk):
awk -v s="1,4,55" 'BEGIN{[arr to arr["1"],arr["5"],arr["55"]} $3 in arr {action}'
No, there is no better way to map separated substrings to array indices than:
split(str,tmp); for (i in tmp) arr[tmp[i]]
FWIW if you don't like that approach for doing what your final pseudo-code does:
awk -v s="1,4,55" 'BEGIN{split(s,tmp,/,/); for (i in tmp) arr[tmp[i]]} $3 in arr{action}'
then another way to get the same behavior is
awk -v s=",1,4,55," 'index(s,","$3","){action}'
Probably useless and unnecessarily complex but I'll open the game with while, match and substr:
$ awk -v s="A:B:F:G" '
BEGIN {
while(match(s,/[^:]+/)) {
a[substr(s,RSTART,RLENGTH)]
s=substr(s,RSTART+RLENGTH)
}
for(i in a)
print i
}'
A
B
F
G
I'm eager to see (if there are) some useful solutions. I tried playing around with asorts and such.
Other way kind awkism
cat file
1 hi
2 hello
3 bonjour
4 hola
5 konichiwa
Run it,
awk 'NR==FNR{d[$1]; next}$1 in d' RS="," <(echo "1,2,4") RS="\n" file
you get,
1 hi
2 hello
4 hola

How to multiply small numbers in Bash

I want to multiply all entries in a array with numbers like 3.17 * 10^-7, but Bash can't do that. I tried with awk and bc, but it doesn't work. I would be obliged if someone can help me.
Input data example (overall 4000 datafile):
TecN210500-0100.plt
TecN210500-0200.plt
TecN210500-0300.plt
TecN210500-0400.plt
......
Here is my code:
#!/bin/bash
ZS=($(find . -name "*.plt"))
i=1
Variable=$(awk "BEGIN{print 10 ** -7}")
Solutiontime=$(awk "BEGIN{print 3.17 * $Variable}")
for Dataname in ${ZS[#]}
do
Cut=${Dataname:13}
Timesteps=${Cut:0:${#Cut}-4}
Array[i]=$Timesteps
i=$((i++))
p=$((i++))
done
Amount=$p
for ((i=1;i<10;i++))
do
Array[i]=${i}00
done
for (($i=1;i<$Amount+1;i++))
do
Array[i]=$(awk "BEGIN{print ${Array[i]} * $Solutiontime}")
done
Array[0]=Solutiontime
First loop:
Extract e.i. the "0100".
Second loop:
"Delete" the leading zero -> e.i. "100"
Last loop:
Multiply with time step -> e.i. "100 * 3.17*10^-7"
Do a little parameter expansion trimming on the filename, and then let awk do the math for you.
#!/bin/bash
for f in *.plt; do
num=${f##*-} # remove the stuff before the final -
num=${num%.*} # remove the stuff before the last .
num=${num#0} # remove the left-hand zero
awk "BEGIN {print $num * 3.17 * 10**-7}"
done
Or, done entirely with awk:
#!/bin/bash
for f in *.plt; do
awk -v f="$f" 'BEGIN {gsub(/^TecN[[:digit:]]+-0?|.plt$/, "", f); print f * 3.17 * 10**-7}'
done
awk to the rescue!
awk 'BEGIN{print 3.17 * 10^-7 }'
3.17e-07
iteration 1
awk -F'[-.]' '{printf "%s %e\n",substr($1,5),$2*3.17*10^-7}' file
210500 3.170000e-05
210500 6.340000e-05
210500 9.510000e-05
210500 1.268000e-04
for the posted file names used as input.
iteration 2
If you need just the computed numbers, simple drop the first field
awk -F'[-.]' '{printf "%e\n",$2*3.17*10^-7}' file
3.170000e-05
6.340000e-05
9.510000e-05
1.268000e-04
this will be the output of the script. I strongly suggest moving whatever logic you have inside the awk script rather than working on shell level with the array.

Bash sum of array

I'm writing script in Bash and I have a problem with sum elements of array. I add to array results of df for two paths. In result I want to get sum elements of array.
use=()
i=0
for d in '$PATH1' '$PATH2'
do
usagebck=$(du $d | awk '{print awk $1}')
use[i]=$usagebck
sum=0
for j in $use
do
sum=$($sum + ${use[$i]})
done
i=$((i+1))
done
echo ${use[*]}
If your du has option -s:
use=()
sum=0
for d in "$PATH1" "$PATH2"
do
usagebck="$(du -s "$d" | awk 'END{print $1}')"
use+=($usagebck)
((sum+=$usagebck))
done
echo ${use[*]}
echo $sum
First, take a look at the parameters in du. On BSD based systems, there's -c which will give you a grand total. On GNU and BSD, there's the -a parameter which will report on all files for a directory.
Since you're already using awk, why not do everything in awk?
$ du -ms $PATH1 $PATH2 |
awk 'BEGIN {sum = 0}
END {print "Total: " sum }
{
sum+=$1
print $0
}'
du -ms specifies that I want the total sums of each file specified
BEGIN is executed before the main awk program. Here I'm initializing sum. This isn't necessary because variables are assumed to equal zero when created.
END is executed after the main awk program. Here, I'm specifying that I want sum printed.
Between the { ... } is the main Awk program. Two lines. The first line adds Column 1 (the size of the file) to sum. The second line prints out the entire line.

awk using an index key over a range

I have an awk script that I normally run in parallel using an outside variable $a.
awk -v a=$a '$4>a-5 && $4<a+5 {print $10,$4}' INFILE
It would of course run much faster using an array so I tried something like this to get it to do the same thing ($2 in LISTFILE being the search value for $4 in INFILE
awk 'FNR==NR{a[$2]=($2-5);next}$4 in a{if ($4>a[$4] && $4<a[$4]+10 {print} LISTFILE INFILE
This of course did not work because awk scanned until it reached the key and then started the testing the if statement, so only the downstream range was found. Unfortunately this isn't a continuous list, so often there is no $2-5 value, otherwise I would use that as the key for the array.
obviously I know how to do this using a combo of awk and bash, but I was wondering if there was an awk only solution for this.
My first answer addresses the actual question asked and fixes the awk script. But perhaps I have missed the point. If you want speed, and don't mind making more use of your multi-core processor, you can use GNU parallel. Here's an implementation that will launch 4 jobs at a time:
awk_cmd='$4 > var - 5 && $4 < var + 5 { print $10, $4 }'
parallel -j 4 "awk -v var={} '$awk_cmd' INFILE" :::: LISTFILE
As you can see, this will read INFILE up to four times concurrently. This answer, after adjustment of the number of jobs, should provide very similar performance to your parallel implementation that you describe using your shell. Therefore, you may like to split up your LISTFILE into smaller chunks and set awk_cmd to the command posted in my previous answer. There may be an optimal way to process you input, but that would largely depend on the size of INFILE and the number of elements in LISTFILE. HTH.
TESTING:
Create LISTFILE:
paste - - < <(seq 16) > LISTFILE
Create INFILE:
awk 'BEGIN { for (i=1; i<=9999999; i++) { print i, i, i, int(i * rand()), i, i, i, i, i, i } }' > INFILE
RESULTS:
TEST1:
time awk 'FNR==NR { a[$2]; next } { for (i in a) { if ($4 > i - 5 && $4 < i + 5) { print $10, $4 } } }' LISTFILE INFILE >/dev/null
real 0m45.198s
user 0m45.090s
sys 0m0.160s
TEST2:
time for i in $(seq 1 2 16); do awk -v var="$i" '$4 > var - 5 && $4 < var + 5 { print $10, $4 }' INFILE; done >/dev/null
real 0m55.335s
user 0m54.433s
sys 0m0.953s
TEST3:
awk_cmd='$4 > var - 5 && $4 < var + 5 { print $10, $4 }'
time parallel --colsep "\t" -j 4 "awk -v var={2} '$awk_cmd' INFILE" :::: LISTFILE >/dev/null
real 0m28.190s
user 1m42.750s
sys 0m1.757s
My reply to THIS answer:
1:
The awk1 script does not run much faster than the awk script.
A 15% time saving is pretty significant in my opinion.
I suspect because it scans the LISTFILE for every line in the INFILE.
Yes, essentially. The awk1 script loops through INFILE just once.
So number of lines scanned using the array with for (i in a) = NR(INFILE)*NR(LISTFILE).
Close. But don't forget that by using an array, we actually remove any duplicate values in LISTFILE.
This is the same number of lines you would scan by going through the INFILE repeatedly with the bash script.
This statement is therefore only true when LISTFILE contains no duplicates. Even if LISTFILE never contains any dups, having to read a single file multiple times is best avoided.
2:
Running awk and awk2 in a different folder produced different results (where my 4 min result came from versus the ~2 min result here, not sure what the difference is because they are next door in the parent directory.
What four minute result? When benchmarking this sort of thing, you should stop writing the output to disk. If your machine has some background process going on when your running your tests, you will only end up biasing your results with the write speed of your disk. Use /dev/null instead.
3:
Awk and Awk2 are essentially the same. Any idea why awk2 runs faster?
If you remove the pipe to sort and uniq you will get a better idea of where the time difference is. You will find that doing $4 > i - 5 && $4 < i + 5 is grossly different to doing $4 < i + 5 && $4 > i - 5. If awkout.txt is the same as awkout.txt, you are spending time processing duplicates.
4:
The second command you posted here avoids this test: $4 > i - 5 && $4 < i + 5. I wouldn't think that that alone would warrant a 90% improvement in runtime. Something smells wrong. Would you mind re-running your tests writing to /dev/null and posting the contents of LISTFILE and INFILE? If those two files are confidential, could you provide some example files with the amount of content equal to the originals?
Other thoughts:
To me, it looks like something along these lines would also work:
awk 'FNR==NR { for (i=$2-4;i<$2+5;i++) a[i]; next } $4 in a { b[$10,$4] } END { print length b }' LISTFILE INFILE
It looks like you just need to add the keys of LISTFILE to an array, then, as you process INFILE (line by line), test each key in your array with your 'if' statement. You can do this using the following construct or similar:
for (i in a) { print i, a[i] }
Here's some untested code that may help get you started. Notice how I have not assigned any values to my keys:
awk 'FNR==NR { a[$2]; next } { for (i in a) { if ($4 > i - 5 && $4 < i + 5) { print $10, $4 } } }' LISTFILE INFILE
Steves answer above is the correct answer to the question. Below is a comparison of array and non array ways to handle the problem.
I created a test program to look at two different scenarios and the results from each. The test programs code is here:
echo time for bash
time for line in `awk '{print $2}' $1` ; do awk -v a=$line '$4>a-5&&$4<a+5{print $4,$10}' $2 ; done | sort | uniq -c > bashout.txt
echo time for awk
time awk 'FNR==NR{a[$2]; next}{for (i in a) {if ($4>i-5&&$4<i+5) print $10,$4}}' $1 $2 |sort | uniq -c > awkout.txt
echo time for awk2
time awk 'FNR==NR{a[$2]; next}{for (i in a) {if ($4<i+5&&$4>i-5) print $10,$4}}' $1 $2 |sort | uniq -c > awk2out.txt
echo time for awk3
time awk '{a=$2;b=$1;for (i=a-4;i<a+5;i++) print b,i}' $1 > LIST2;time awk 'FNR==NR{a[$2];next}$4 in a{print $10,$4}' LIST2 $2 | sort | uniq -c > awk3out.txt
Here is the output:
time for bash
real 2m22.394s
user 2m15.938s
sys 0m6.409s
time for awk
real 2m1.719s
user 2m0.919s
sys 0m0.782s
time for awk2
real 1m49.146s
user 1m47.607s
sys 0m1.524s
time for awk3
real 0m0.006s
user 0m0.000s
sys 0m0.001s
real 0m12.788s
user 0m12.096s
sys 0m0.695s
4 observations/questions
The awk1 script does not run much faster than the awk script. I suspect because it scans the LISTFILE for every line in the INFILE. So number of lines scanned using the array with for (i in a) = NR(INFILE)*NR(LISTFILE). This is the same number of lines you would scan by going through the INFILE repeatedly with the bash script.
Running awk and awk2 in a different folder produced different results (where my 4 min result came from versus the ~2 min result here, not sure what the difference is because they are next door in the parent directory.
Awk and Awk2 are essentially the same. Any idea why awk2 runs faster?
Making an expanded LIST2 from the LISTFILE and using that as the array makes the program run significantly faster, at the cost of increasing the memory footprint. Considering how small the list I"m looking at is (only 200-300 long) that seems to be the way to go, even over doing this in parallel.

getting line numbers to be deleted from an array

I am trying to remove certain lines from a huge file, getting line numbers to be deleted from an array. The file is at least 2GB in size and the my array size can be large as well. Can I do this without a for loop? What is fastest way?
Example:
input:
>1
>2
>3
>4
>5
declare -a A=(2 3 5);
output:
>1
>4
... getting line numbers to be deleted from an array.
If I understand it correct, your array A contains line numbers to be deleted from the input.
You could use sed:
sed $(printf "%dd;" "${A[#]}") inputfile
Use the -i option to modify the file in-place.
If the array is too large, consider using process substitution instead:
sed -f <(printf "%dd;" "${A[#]}") inputfile
I wouldn't to this in plain shell code. sed is the tool for editing/transforming files.
On-The-Fly create a sed-programm from your array and edit the INPUTFILE in-place (-i)
for line in ${A[#]}; do
echo ${line}d
done| sed -i -f /dev/stdin $INPUTFILE
You can use grep -vf to get this array differential:
declare -a O=(1 2 3 4 5)
declare -a A=(2 3 5)
B=( $(grep -vf <(printf "%s\n" "${A[#]}") <(printf "%s\n" "${O[#]}")) )
OUTPUT:
declare -p B
declare -a B='([0]="1" [1]="4")'
printf "%s\n" "${B[#]}"
1
4
awk -v n=2,3,5 'BEGIN{split(n,nn,",")} !(NR in nn) {print}' input >output
In the above, the list of lines to be deleted is provided as the variable n. (I have it shown as a comma-separated format but other formats are possible.) In the BEGIN block, this list is converted to an awk array called nn. The remainder of the awk program simply prints all lines whose line number, NR, is not in the array of lines to be excluded, nn.
If awk implements its membership testing in a properly hashed fashion, the way python does it, then the above should be fast. If not, not.

Resources