How to initialize a 2D array in awk - arrays

I am using a 2D array to save the number of recurrences of certain patterns. For instance:
$4 == "Water" {s[$5]["w"]++}
$4 == "Fire" {s[$5]["f"]++}
$4 == "Air" {s[$5]["a"]++}
where $5 can be attack1, attack2 or attack3. In the END{ }, I print out these values. However, some of these patterns don't exist. So for s["attack1"]["Air"] =0, my code prints whitespace. Hence I would like to know if there is a way to initialize the array in one line instead of initializing each of the elements I need, in the BEGIN{ }.
awk -f script.awk data
This is the command I am using to run my script. I am not allowed to use any other flags.
EDIT 1:
Here's the current output
Water Air Fire
attack1 554 12
attack2 14 24
attack3 6 3
Here's the output I desire:
Water Air Fire
attack1 554 0 12
attack2 14 24 0
attack3 6 0 3

You don't need to initialise the array in this case. Awk already has a default empty value, so you just have to change the way you print the value.
Observe:
awk 'BEGIN {print "Blank:", a[1];
print "Zero: ", a[1] + 0;
printf("Blank: %s\n", a[1]);
printf("Zero: %i\n", a[1])}'
Output:
Blank:
Zero: 0
Blank:
Zero: 0

Related

Computing sum of specific field from array entries

I have an array trf. Would like to compute the sum of the second element in each array entry.
Example of array contents
trf=( "2 13 144" "3 21 256" "5 34 389" )
Here is the current implementation, but I do not find it robust enough. For instance, it fails with arbitrary number of elements (but considered constant from one array element to another) in each array entry.
cnt=0
m=${#trf[#]}
while (( cnt < m )); do
while read -r one two three
do
sum+="$two"+
done <<< $(echo ${array[$count]})
let count=$count+1
done
sum+=0
result=`echo "$sum" | /usr/bin/bc -l`
You're making it way too complicated. Something like
#!/usr/bin/env bash
trf=( "2 13 144" "3 21 256" "5 34 389" )
declare -i sum=0 # Integer attribute; arithmetic evaluation happens when assigned
for (( n = 0; n < ${#trf[#]}; n++)); do
read -r _ val _ <<<"${trf[n]}"
sum+=$val
done
printf "%d\n" "$sum"
in pure bash, or just use awk (This is handy if you have floating point numbers in your real data):
printf "%s\n" "${trf[#]}" | awk '{ sum += $2 } END { print sum }'
You can use printf to print the entire array, one entry per line. On such an input, one loop (while read) would be sufficient. You can even skip the loop entirely using cut and tr to build the bc command. The echo 0 is there so that bc can handle empty arrays and the trailing + inserted by tr.
{ printf %s\\n "${trf[#]}" | cut -d' ' -f2 | tr \\n +; echo 0; } | bc -l
For your examples this generates prints 68 (= 13+21+34+0).
Try this printf + awk combo:
$ printf '%s\n' "${trf[#]}" | awk '{print $2}{a+=$2}END{print "sum:", a}'
13
21
34
sum: 68
Oh, it's already suggested by Shawn. Then with loop:
$ for item in "${trf[#]}"; do
echo $item
done | awk '{print $2}{a+=$2}END{print "sum:", a}'
13
21
34
sum: 68
For relatively small arrays a for/while double loop should be ok re: performance; placing the final sum in the $result variable (as in OP's code):
result=0
for element in "${trf[#]}"
do
while read -r a b c
do
((result+=b))
done <<< "${element}"
done
echo "${result}"
This generates:
68
For larger data sets I'd probably opt for one of the awk-only solutions (for performance reasons).

How to grep ranges of numeric sequences from a column that contain several sequences

I'm new writing bash scripts and have the following question; how can extract ranges (first and last value) from a column which contains several incremental and decremental numeric sequences that can increase or decrease by 3 and jump to the next sequence once it detects that the increment is >3 e.g.:
1
4
7
20
23
26
100
97
94
It is required to receive as an output:
1,7
20,26
100,94
Using awk:
$ awk 'NR==1||sqrt(($0-p)*($0-p))>3{print p; printf "%s", $0 ", "} {p=$0} END{print $0}' file
1, 7
20, 26
100, 94
Explained:
NR==1 || sqrt(($0-p)*($0-p))>3 { # if the abs($0-previous) > 3
print p # print previous to end a sequence and
printf "%s", $0 ", " # start a new sequence
}
{ p=$0 }
END { print $0 }
this awk script gives you expected output:
awk '{v=$NF}
NR==1{printf "%s,",v;p=v;next}
(p-v)*(p-v)==9{p=v;next}
{printf "%s\n%s,",p,v;p=v}
END{print v}' file

nested for loops in awk to count number of fields matching values

I have a file with two columns (1.4 million rows) that looks like:
CLM MXL
0 0
0 1
1 1
1 1
0 0
29 42
0 0
30 15
I would like to count the instances of each possible combination of values; for example if there are x number of lines where column CLM equals 0 and column MXL matches 1, I would like to print:
0 1 x
Since the maximum value of column CLM is 188 and the maximum value of column MXL is 128, I am trying to use a nested for loop in awk that looks something like:
awk '{for (i=0; i<=188; i++) {for (j=0; j<=128; j++) {if($9==i && $10==j) {print$0}}}}' 1000Genomes.ALL.new.txt > test
But this only prints out the original file, which makes sense, I just don't know how to correctly write a for loop that prints out one file for each combination of values, which I can then wc, or print out one file with counts of each combination. Any solution in awk, bash script, perl script would be great.
1. A Pure awk Solution
$ awk 'NR>1{c[$0]++} END{for (k in c)print k,c[k]}' file | sort -n
0 0 3
0 1 1
1 1 2
29 42 1
30 15 1
How it works
The code uses a single variable c. c is an associative array whose keys are lines in the file and whose values are the number of occurrences.
NR>1{c[$0]++}
For every line except the first (which has the headings), this increments the count for the combination in that line.
END{for (k in c)print k,c[k]}
This prints out the final counts.
sort -n
This is just for aesthetics: it puts the output lines in a predictable order.
2. Alternative using uniq -c
$ tail -n+2 file | sort -n | uniq -c | awk '{print $2,$3,$1}'
0 0 3
0 1 1
1 1 2
29 42 1
30 15 1
How it works
tail -n+2 file
This prints all but the first line of the file. The purpose of this is to remove the column headings.
sort -n | uniq -c
This sorts the lines and then counts the duplicates.
awk '{print $2,$3,$1}
uniq -c puts the counts first and you wanted the counts to be the last on the line. This just rearranges the columns to the format that you wanted.

awk, declare array embracing FNR and field, output

I would like to declare an array of a certain number of lines, that means from line 10 to line 78, as an example. Could be other number, this is just an example.
My sample gives me that range of lines on stdout but sets "1" in between that lines. Can anybody tell me how to get rid of that "1"?
Sample as follows should go to stdout and embraces the named lines.
awk '
myarr["range-one"]=NR~/^2$/ , NR~/^8$/;
{print myarr["range-one"]};' /home/$USER/uplog.txt;
That is giving me this output:
0
12:33:49 up 3:57, 2 users, load average: 0,61, 0,37, 0,22 21.06.2014
1
12:42:02 up 4:06, 2 users, load average: 0,14, 0,18, 0,19 21.06.2014
1
12:42:29 up 4:06, 2 users, load average: 0,09, 0,17, 0,19 21.06.2014
1
12:43:09 up 4:07, 2 users, load average: 0,09, 0,16, 0,19 21.06.2014
1
Second question: how to set in that array one field of FNR or line?
When I do it this way there comes up the field that I wanted
awk ' NR~/^1$/ , NR~/^7$/ {print $3, $11; next} ; ' /home/$USER/uplog.txt;
But I need an array, thats why I'm asking. Any hints? Thanks in advance.
What the example script does
awk '
myarr["range-one"]=NR~/^2$/ , NR~/^8$/;
{print myarr["range-one"]};'
Your script is one of the more convoluted and decidedly less-than-obvious pieces of awk that I've ever seen. Let's take a simple input file:
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
Line 8
Line 9
Line 10
Line 11
Line 12
The output from that is:
0
Line 2
1
Line 3
1
Line 4
1
Line 5
1
Line 6
1
Line 7
1
Line 8
1
0
0
0
0
Dissecting your script, it appears that the first line:
myarr["range-one"]=NR~/^2$/ , NR~/^8$/;
is equivalent to:
myarr["range-one"] = (NR ~ /^#$/, NR ~ /^8$/) { print }
That is, the value assigned to myarr["range-one"] is 1 inside the range of line numbers where NR is equal to 2 and is equal to 8, and 0 outside that range; further, when the value is 1, the line is printed.
The second line:
{print myarr["range-one"]};
print the value in myarr["range-one"] for each line of input. Thus, on the first line, the value 0 is printed. For lines 2 to 8, the line is printed followed by the value 1; for lines after that, the value 0 is printed once more.
What the question asks for
The question is not clear. It appears that lines 10 to 78 should be printed. In awk, there are essentially no variable declarations (we can debate about function parameters, but functions don't seem to figure in this). Therefore, declaring an array is not an option.
awk -v lo=10 -v hi=78 'NR >= lo && NR <= hi { print }'
This would print the lines between line 10 and line 78. It would be feasible to save the values in an array (a in the examples below). Said array could be indexed by NR or with a separate index starting at 0 or 1:
awk -v lo=10 -v hi=78 'NR >= lo && NR <= hi { a[NR] = $0 }' # Indexed by line number
awk -v lo=10 -v hi=78 'NR >= lo && NR <= hi { a[i++] = $0 }' # Indexed from 0
awk -v lo=10 -v hi=78 'NR >= lo && NR <= hi { a[++i] = $0 }' # Indexed from 1
Presumably, you'd also have an END block to do something with the data.
The semicolons in the original are both unnecessary. The blank line is ignored, of course.

How can I find the sum of the elements of an array in Bash?

I am trying to add the elements of an array that is defined by user input from the read -a command. How can I do that?
read -a array
tot=0
for i in ${array[#]}; do
let tot+=$i
done
echo "Total: $tot"
Given an array (of integers), here's a funny way to add its elements (in bash):
sum=$(IFS=+; echo "$((${array[*]}))")
echo "Sum=$sum"
e.g.,
$ array=( 1337 -13 -666 -208 -408 )
$ sum=$(IFS=+; echo "$((${array[*]}))")
$ echo "$sum"
42
Pro: No loop, no subshell!
Con: Only works with integers
Edit (2012/12/26).
As this post got bumped up, I wanted to share with you another funny way, using dc, which is then not restricted to just integers:
$ dc <<< '[+]sa[z2!>az2!>b]sb1 2 3 4 5 6 6 5 4 3 2 1lbxp'
42
This wonderful line adds all the numbers. Neat, eh?
If your numbers are in an array array:
$ array=( 1 2 3 4 5 6 6 5 4 3 2 1 )
$ dc <<< '[+]sa[z2!>az2!>b]sb'"${array[*]}lbxp"
42
In fact there's a catch with negative numbers. The number '-42' should be given to dc as _42, so:
$ array=( -1.75 -2.75 -3.75 -4.75 -5.75 -6.75 -7.75 -8.75 )
$ dc <<< '[+]sa[z2!>az2!>b]sb'"${array[*]//-/_}lbxp"
-42.00
will do.
Pro: Works with floating points.
Con: Uses an external process (but there's no choice if you want to do non-integer arithmetic — but dc is probably the lightest for this task).
My code (which I actually utilize) is inspired by answer of gniourf_gniourf. I personally consider this more clear to read/comprehend, and to modify. Accepts also floating points, not just integers.
Sum values in array:
arr=( 1 2 3 4 5 6 7 8 9 10 )
IFS='+' sum=$(echo "scale=1;${arr[*]}"|bc)
echo $sum # 55
With small change, you can get the average of values:
arr=( 1 2 3 4 5 6 7 8 9 10 )
IFS='+' avg=$(echo "scale=1;(${arr[*]})/${#arr[#]}"|bc)
echo $avg # 5.5
gniourf_gniourf's answer is excellent since it doesn't require a loop or bc. For anyone interested in a real-world example, here's a function that totals all of the CPU cores reading from /proc/cpuinfo without messing with IFS:
# Insert each processor core count integer into array
cpuarray=($(grep cores /proc/cpuinfo | awk '{print $4}'))
# Read from the array and replace the delimiter with "+"
# also insert 0 on the end of the array so the syntax is correct and not ending on a "+"
read <<< "${cpuarray[#]/%/+}0"
# Add the integers together and assign output to $corecount variable
corecount="$((REPLY))"
# Echo total core count
echo "Total cores: $corecount"
I also found the arithmetic expansion works properly when calling the array from inside the double parentheses, removing the need for the read command:
cpuarray=($(grep cores /proc/cpuinfo | awk '{print $4}'))
corecount="$((${cpuarray[#]/%/+}0))"
echo "Total cores: $corecount"
Generic:
array=( 1 2 3 4 5 )
sum="$((${array[#]/%/+}0))"
echo "Total: $sum"
I'm a fan of brevity, so this is what I tend to use:
IFS="+";bc<<<"${array[*]}"
It essentially just lists the data of the array and passes it into BC which evaluates it. The "IFS" is the internal field separate, it essentially specifies how to separate arrays, and we said to separate them with plus signs, that means when we pass it into BC, it receives a list of numbers separated by plus signs, so naturally it adds them together.
Another dc & bash method:
arr=(1 3.88 7.1 -1)
dc -e "0 ${arr[*]/-/_} ${arr[*]/*/+} p"
Output:
10.98
The above runs the expression 0 1 3.88 7.1 _1 + + + + p with dc. Note the dummy value 0 because there's one too many +s, and also note the usual negative number prefix - must be changed to _ in dc.
arr=(1 2 3) //or use `read` to fill the array
echo Sum of array elements: $(( ${arr[#]/%/ +} 0))
Sum of array elements: 6
Explanation:
"${arr[#]/%/ +}" will return 1 + 2 + 3 +
By adding additional zero at the end we will get 1 + 2 + 3 + 0
By wrapping this string with BASH's math operation like this$(( "${arr[#]/%/ +} 0")), it will return the sum instead
This could be used for other math operations.
For subtracting just use - instead
For multiplication use * and 1 instead of 0
Can be used with logic operators too.
BOOL AND EXAMPLE - check if all items are true (1)
arr=(1 0 1)
if [[ $((${arr[#]/%/ &} 1)) -eq 1 ]]; then echo "yes"; else echo "no"; fi
This will print: no
BOOL OR EXAMPLE - check if any item is true (1)
arr=(1 0 0)
if [[ $((${arr[#]/%/ |} 0)) -eq 1 ]]; then echo "yes"; else echo "no"; fi
This will print: yes
A simple way
function arraySum
{
sum=0
for i in ${a[#]};
do
sum=`expr $sum + $i`
done
echo $sum
}
a=(7 2 3 9)
echo -n "Sum is = "
arraySum ${a[#]}
I find this very simple using an increasing variable:
result2=0
for i in ${lineCoffset[#]};
do
result2=$((result2+i))
done
echo $result2

Resources