Use bash variable as array in awk and filter input file by comparing with array - arrays

I have bash variable like this:
val="abc jkl pqr"
And I have a file that looks smth like this:
abc 4 5
abc 8 8
def 43 4
def 7 51
jkl 4 0
mno 32 2
mno 9 2
pqr 12 1
I want to throw away rows from file which first field isn't present in the val:
abc 4 5
abc 8 8
jkl 4 0
pqr 12 1
My solution in awk doesn't work at all and I don't have any idea why:
awk -v var="${val}" 'BEGIN{split(var, arr)}$1 in arr{print $0}' file

Just slice the variable into array indexes:
awk -v var="${val}" 'BEGIN{split(var, arr)
for (i in arr)
names[arr[i]]
}
$1 in names' file
As commented in the linked question, when you call split() you get values for the array, while what you want to set are indexes. The trick is to generate another array with this content.
As you see $1 in names suffices, you don't have to call for the action {print $0} when this happens, since it is the default.
As a one-liner:
$ awk -v var="${val}" 'BEGIN{split(var, arr); for (i in arr) names[arr[i]]} $1 in names' file
abc 4 5
abc 8 8
jkl 4 0
pqr 12 1

grep -E "$( echo "${val}"| sed 's/ /|/g' )" YourFile
# or
awk -v val="${val}" 'BEGIN{gsub(/ /, "|",val)} $1 ~ val' YourFile
Grep:
it use a regex (extended version with option -E) that filter all the lines that contains the value. The regex is build OnTheMove in a subshell with a sed that replace the space separator by a | meaning OR
Awk:
use the same princip as the grep but everything is made inside (so no subshell)
use the variable val assigned to the shell variable of the same name
At start of the script (before first line read) change the space, (in val) by | with BEGIN{gsub(/ /, "|",val)}
than, for every line where first field (default field separator is space/blank in awk, so first is the letter group) matching, print it (defaut action of a filter with $1 ~ val.

Related

Picking input record fields with AWK

Let's say we have a shell variable $x containing a space separated list of numbers from 1 to 30:
$ x=$(for i in {1..30}; do echo -n "$i "; done)
$ echo $x
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
We can print the first three input record fields with AWK like this:
$ echo $x | awk '{print $1 " " $2 " " $3}'
1 2 3
How can we print all the fields starting from the Nth field with AWK? E.g.
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
EDIT: I can use cut, sed etc. to do the same but in this case I'd like to know how to do this with AWK.
Converting my comment to answer so that solution is easy to find for future visitors.
You may use this awk:
awk '{for (i=3; i<=NF; ++i) printf "%s", $i (i<NF?OFS:ORS)}' file
or pass start position as argument:
awk -v n=3 '{for (i=n; i<=NF; ++i) printf "%s", $i (i<NF?OFS:ORS)}' file
Version 4: Shortest is probably using sub to cut off the first three fields and their separators:
$ echo $x | awk 'sub(/^ *([^ ]+ +){3}/,"")'
Output:
4 5 6 7 8 9 ...
This will, however, preserve all space after $4:
$ echo "1 2 3 4 5" | awk 'sub(/^ *([^ ]+ +){3}/,"")'
4 5
so if you wanted the space squeezed, you'd need to, for example:
$ echo "1 2 3 4 5" | awk 'sub(/^ *([^ ]+ +){3}/,"") && $1=$1'
4 5
with the exception that if there are only 4 fields and the 4th field happens to be a 0:
$ echo "1 2 3 0" | awk 'sub(/^ *([^ ]+ +){3}/,"")&&$1=$1'
$ [no output]
in which case you'd need to:
$ echo "1 2 3 0" | awk 'sub(/^ *([^ ]+ +){3}/,"") && ($1=$1) || 1'
0
Version 1: cut is better suited for the job:
$ cut -d\ -f 4- <<<$x
Version 2: Using awk you could:
$ echo -n $x | awk -v RS=\ -v ORS=\ 'NR>=4;END{printf "\n"}'
Version 3: If you want to preserve those varying amounts of space, using GNU awk you could use split's fourth parameter seps:
$ echo "1 2 3 4 5 6 7" |
gawk '{
n=split($0,a,FS,seps) # actual separators goes to seps
for(i=4;i<=n;i++) # loop from 4th
printf "%s%s",a[i],(i==n?RS:seps[i]) # get fields from arrays
}'
Adding one more approach to add all value into a variable and once all fields values are done with reading just print the value of variable. Change the value of n= as per from which field onwards you want to get the data.
echo "$x" |
awk -v n=3 '{val="";for(i=n; i<=NF; i++){val=(val?val OFS:"")$i};print val}'
With GNU awk, you can use the join function which has been a built-in include since gawk 4.1:
x=$(seq 30 | tr '\n' ' ')
echo "$x" | gawk '#include "join"
{split($0, arr)
print join(arr, 4, length(arr), "|")}
'
4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30
(Shown here with a '|' instead of a ' ' for clarity...)
Alternative way of including join:
echo "$x" | gawk -i join '{split($0, arr); print join(arr, 4, length(arr), "|")}'
Using gnu awk and gensub:
echo $x | awk '{ print gensub(/^([[:digit:]]+[[:space:]]){3}(.*$)/,"\\2",$0)}'
Using gensub, split the string into two sections based on regular expressions and print the second section only.

Edit a string in shell script and display it as an array

Input:
1234-A1;1235-A2;2345-B1;5678-C2;2346-D5
Expected Output:
1234
1235
2345
5678
2346
Input shown is a user input. I want to store it in an array and do some operations to display as shown in 'Expected Output'
I have done it in perl, but want to achieve it in shell script. Please help in achieving this.
To split an input text to an array you can follow this technique:
IFS="[;-]" read -r -a arr <<< "1234-A1;1235-A2;2345-B1;5678-C2;2346-D5"
printf '%s\n' "${arr[#]}"
1234
A1
1235
A2
2345
B1
5678
C2
2346
D5
If you want to keep only 1234,1234, etc as per your expected output you can either to use the corresponding array elements (0-2-4-etc) or to do something like this:
a="1234-A1;1235-A2;2345-B1;5678-C2;2346-D5"
IFS="[;]" read -r -a arr <<< "${a//-[A-Z][0-9]/}" #or more generally <<< "${a//-??/}"
declare -p arr #This asks bash to print the array for us
#Output
declare -a arr='([0]="1234" [1]="1235" [2]="2345" [3]="5678" [4]="2346")'
# Array can now be printed or used elsewhere in your script. Array counting starts from zero
#Yash:#try:
echo "1234-A1;1235-A2;2345-B1;5678-C2;2346-D5" | awk '{gsub(/-[[:alnum:]]+/,"");gsub(/;/,RS);print}'
Substituting all alpha bate, numbers with NULL, then substituting all semi colons to RS(record separator) which is a new line by default.
Thanks #George and #Vipin.
Based on your inputs the solution which best suites my environment is as under:
i=0
a="1234-A1;1235-A2;2345-B1;5678-C2;2346-D5"
IFS="[;]" read -r -a arr <<< "${a//-??/}"
#declare -p arr
for var in "${arr[#]}"
do
echo " var $((i++)) is : $var"
done
Output:
var 0 is : 1234
var 1 is : 1235
var 2 is : 2345
var 3 is : 5678
var 4 is : 2346
Try this -
awk -F'[-;]' '{for(i=1;i<=NF;i++) if(i%2!=0) {print $i}}' f
1234
1235
2345
5678
2346
OR
echo "1234-A1;1235-A2;2345-B1;5678-C2;2346-D5"|tr ';' '\n'|cut -d'-' -f1
OR
As #George Vasiliou Suggested -
awk -F'[-;]' '{for(i=1;i<=NF;i+=2) {print $i}}'f
If Data needs to store in Array and you are using gawk, try below -
awk -F'[;-]' -v k=1 '{for(i=1;i<=NF;i++) if($i !~ /[[:alpha:]]/) {a[k++]=$i}} END {
> PROCINFO["sorted_in"] = "#ind_str_asc"
> for(k in a) print k,a[k]}' f
1 1234
2 1235
3 2345
4 5678
5 2346
PROCINFO["sorted_in"] = "#ind_str_asc" used to print the data in
sorted order.

Multiply array elements

So I'm learning bash and need to do a simple script to multiply array elements by calling a function.
My code so far is this, but it ain't working at all. I believe there is a much simpler way than this (incrementing the pos variable so as to move to the next array element feels simply wrong).
array=(1 2 3 4 5 100)
sum=0
pos=1
function multiplicate {
for i in ${array[*]};do
sum=$(($i * $array[pos]))
let pos++
done
}
multiplicate
echo $sum
I did my best to google the solution but was unable to find any relevant information, I found how to sum by using bc, but it simply wouldn't work by replacing + with *.
Use this script:
#!/bin/bash
array=(1 2 3 4 5 100)
function multiplicate {
local mul=1
for i in "${array[#]}"; do
((mul *= i))
done
echo "$mul"
}
multiplicate
$ ./script
12000
Or better yet:
#!/bin/bash
multiplicate() {
local mul=1
for i
do ((mul *= i))
done
echo "$mul"
}
multiplicate 1 2 3 4 5 100
And if you like to play with string variables use this:
multiplicate() { local IFS=* ; echo $(( $* )); }
multiplicate 1 2 3 4 5 100
Here is a method using bc:
multiply ()
{
printf '%s\n' "$#" | paste -s -d '*' | bc
}
Used as follows:
$ multiply 1 2 3 4 5 100
12000
The first command in the pipeline prints each array elements on a separate line:
$ printf '%s\n' 1 2 3 4 5 100
1
2
3
4
5
100
The paste -s ("serial") then turns the output into a single line again, but the elements now separated by *:
$ printf '%s\n' 1 2 3 4 5 100 | paste -s -d '*'
1*2*3*4*5*100
And bc finally evaluates the expression.
Alternatively, we can save a subshell and skip bc:
multiply () {
echo $(( $(printf '%s\n' "$#" | paste -s -d '*') ))
}
This uses an arithmetic expression to evaluate the output of printf and paste (which now is in a command substitution), but readability suffers a bit.
Alternatively, in pure Bash (hat tip sorontar):
multiply () {
local IFS='*'
echo "$(( $* ))"
}
This sets the field separator IFS to * so the arguments, $*, expand to a string separated by *, which is then evaluated in the arithmetic expression $(()).

Print duplicate entries in a file using linux commands

I have a file called foo.txt, which consists of:
abc
zaa
asd
dess
zaa
abc
aaa
zaa
I want the output to be stored in another file as:
this text abc appears 2 times
this text zaa appears 3 times
I have tried the following command, but this just writes duplicate entries and their number.
sort foo.txt | uniq --count --repeated > sample.txt
Example of output of above command:
abc 2
zaa 3
How do I add the line "this text appears x times" ?
Awk is your friend:
sort foo.txt | uniq --count --repeated | awk '{print($2" appears "$1" times")}'

nested for loops in awk to count number of fields matching values

I have a file with two columns (1.4 million rows) that looks like:
CLM MXL
0 0
0 1
1 1
1 1
0 0
29 42
0 0
30 15
I would like to count the instances of each possible combination of values; for example if there are x number of lines where column CLM equals 0 and column MXL matches 1, I would like to print:
0 1 x
Since the maximum value of column CLM is 188 and the maximum value of column MXL is 128, I am trying to use a nested for loop in awk that looks something like:
awk '{for (i=0; i<=188; i++) {for (j=0; j<=128; j++) {if($9==i && $10==j) {print$0}}}}' 1000Genomes.ALL.new.txt > test
But this only prints out the original file, which makes sense, I just don't know how to correctly write a for loop that prints out one file for each combination of values, which I can then wc, or print out one file with counts of each combination. Any solution in awk, bash script, perl script would be great.
1. A Pure awk Solution
$ awk 'NR>1{c[$0]++} END{for (k in c)print k,c[k]}' file | sort -n
0 0 3
0 1 1
1 1 2
29 42 1
30 15 1
How it works
The code uses a single variable c. c is an associative array whose keys are lines in the file and whose values are the number of occurrences.
NR>1{c[$0]++}
For every line except the first (which has the headings), this increments the count for the combination in that line.
END{for (k in c)print k,c[k]}
This prints out the final counts.
sort -n
This is just for aesthetics: it puts the output lines in a predictable order.
2. Alternative using uniq -c
$ tail -n+2 file | sort -n | uniq -c | awk '{print $2,$3,$1}'
0 0 3
0 1 1
1 1 2
29 42 1
30 15 1
How it works
tail -n+2 file
This prints all but the first line of the file. The purpose of this is to remove the column headings.
sort -n | uniq -c
This sorts the lines and then counts the duplicates.
awk '{print $2,$3,$1}
uniq -c puts the counts first and you wanted the counts to be the last on the line. This just rearranges the columns to the format that you wanted.

Resources