How to print lines with multiple associative arrays and conditions using awk - arrays

I want to print all lines from file 1 where the values of $1 and $4 are found in $1 and $4 of file 2 AND where the value in file 1 $2 is greater than or equal to the value in file 2 $2 AND where the value in file 1 $3 is less than or equal to the value in file 2 $3.
file 1
1 110201809 117658766 a
1 168095261 182305990 b
1 215456074 233436403 c
2 9465687 12905490 d
2 28765309 35235120 e
2 48958595 64702082 f
file 2
1 245371026 249210707 a
2 937388 46504962 h
2 937388 162731186 b
2 2954974 6777829 c
2 9465687 12996275 d
2 14539477 44757554 d
2 14766820 30080818 m
2 16531332 23584565 n
2 17340076 26206255 o
2 18535880 24452180 p
2 28830071 35289330 q
2 36206662 47273732 r
2 48958495 64703082 f
Desired output only prints the lines from file 1 that meet the condition.
desired output
2 9465687 12905490 d
2 48958595 64702082 f
I've tried the following which gave an empty file:
awk 'NR==FNR{ a[$1,$4]= $0; b[$2] = $2 ; c[$3] = $3; next } ($1 $4 in a) && ($2 >= b[$2]) && ($3 <= c[$3])' file2 file1>desired output

I would do this by collecting the second and third columns in separate hashes, e.g.:
parse.awk
NR==FNR {
g[$1,$4] = $2
h[$1,$4] = $3
next
}
($1 SUBSEP $4 in g) && g[$1,$4] >= $2 && h[$1,$4] <= $3
Run it like this:
awk -f parse.awk file1 file2
Output:
2 9465687 12996275 d
2 48958495 64703082 f

Related

Linux shell passing column position dynamically

question is: how to pass column position (e.g., $2) dynamically through a loop.
Example file temp1
a 1 2
a 2 3
b 1 1
b 3 2
c 1 5
c 2 6
code so far (does not work :-))
#!/bin/bash
twopq () {
awk -v c1="$1" -v c2="$2" '{ if ($1==c1 && c2 == 1) {print}}' temp1 > temp2
}
twopq a $2
twopq b $3
Desired output in temp2 from 1st loop (1st col = 'a' and 2nd col = 1)
a 1 2
desired output in temp2 from 2nd loop (1st col= b and 3rd col = 1)
b 1 1
my pb is to pass the "$" through my loop to tell I'm looking for col2 in the first loop and col3 in the second loop
thanks for the help!
Assumptions:
1st argument is a value we're looking for in the 1st column of the input file
2nd argument is the column number we're looking for that has a value of 1
print all lines that match the search criteria
Adding a couple lines to demonstrate multiple matches:
$ cat temp1
a 1 2
a 2 3
b 1 1
b 3 2
c 1 5
c 2 6
d 1 5 # new line
d 1 9 # new line
A few tweaks to OP's current code:
twopq () { awk -v val="$1" -v colnum="$2" '$1==val && $(colnum)==1' temp1; }
Taking for a test drive:
$ twopq a 2
a 1 2
$ twopq b 3
b 1 1
$ twopq d 2
d 1 5
d 1 9
NOTES:
once the output is verified OP can update the function as needed to capture the output to temp2 (eg, > temp2 to overwrite on each function call; >> temp2 to append with each function call)
alternatively, route the output from the function call to the output file (eg, twopq a 2 > temp2, twopq b 3 >> temp2)
Like this:
#!/bin/bash
twopq () {
awk -v c1="$1" '($1==c1) {
for (i=2; i<=NF; i++)
if ($i == 1) {print;exit}
}' temp1 | tee -a temp2
}
twopq a 2
twopq b 3
Output
a 1 2
b 1 1
did you tried to use $c2 inside if condition ?
awk -v c1="$1" -v c2="$2" '{ if ($1==c1 && $c2 == 1) {print}}' temp1 > temp2

Use Perl to only print only if the value of column A appears with every different value of Column B

So in Perl how can I go through a sample file like so:
1 D Z
1 E F
1 G L
2 D I
2 E L
3 D P
3 G L
So here I want to be able to print out only the values that have a value in the first column that appears with every different value of the second column.
The output would look like this:
1 D Z
1 E F
1 G L
cat test
1 D Z
1 E F
1 G L
2 D I
2 E L
3 D P
3 G L
perl -a -lne 'unless ( $h{ $F[1] } ) { print }; $h{ $F[1] } = 1; ' test
1 D Z
1 E F
1 G L
Okay this isn't as easy as it seems. I've read the file into memory so that I can take three passes over it
Count the number of different values in column 2
Record each combination of values in column 1 and column 2
Print those lines in the file whose first column has as many occurrences as there are different values of column 2
This could be improved with more information about the input file, but it works fine as it is and I see no reason to optimise it
use strict;
use warnings 'all';
use List::MoreUtils qw/ uniq /;
my #lines = <>;
my #col2 = uniq map { (split)[1] } #lines;
my %data;
for ( #lines ) {
my ($c1, $c2) = split;
$data{$c1}{$c2} = 1;
}
for ( #lines ) {
my ($c1) = split;
print if keys %{ $data{$c1} } == #col2;
}
output
1 D Z
1 E F
1 G L

Array Bash printing elements in a loop of multiple arrays

I have multiple arrays ( I limit it to 3 ) & first time using arrays
The length of the arrays are the same. They correspond to the same records
So array a, b and c values are listed below:
array a = 1 2 3 4 5
array b = a b c d e
array c = v w x y z
I need to print then content so the output is like this on each line
1 a v
2 b w
3 c x
4 d y
5 e z
Can you help?
Thanks
Here's a more bash-ful version (if you will):
#!/usr/bin/env bash
# initialize arrays
a=(1 2 3 4 5)
b=(a b c d e)
c=(v w x y z)
# count elements (assuming all arrays are the same size)
numElems=${#a[#]}
# loop over all elements
for (( i = 0; i < numElems; i++ )); do
# -e ensures that escape sequences such as \t are recognized
echo -e "${a[i]}\t${b[i]}\t${c[i]}"
done
This is how I worked it out, hopefully there is a better way. There are 3 arrays sample listed above, Each array has a list of values in it. Since they are of equal length. This is what can be done . The $'\t' puts a tab in between.
s=${#a[#]}
counter=0
echo $counter
while [[ $counter -lt $s ]];
do
echo ${a[$counter]} $'\t' ${b[$counter]} $'\t' ${c[$counter]}
counter=$(( $counter + 1 ))
done

How to output counts for list of active/inactive inputs?

I have this input file (1=active, 0=inactive)
a 1
a 0
b 1
b 1
b 0
c 0
c 0
c 0
c 0
.
.
.
And want output like this:
X repeats active count inactive count
a 2 times 1 1
b 3 times 2 1
c 4 times 0 4
I tried:
awk -F "," '{if ($2==1) a[$1]++; } END { for (i in a); print i, a[i] }'file name
But that did not work.
How can I get the output?
Just to give you an idea this awk should work:
awk '$2{a[$1]++; next} {b[$1]++; if (!($1 in a)) a[$1]=0} END{for (i in a) print i, a[i], b[i], (a[i]+b[i])}' file
a 1 1 2
b 2 1 3
c 0 4 4
You can format the output way you want.
You can try
awk -f r.awk input.txt
where input.awk is your data file, and r.awk is
{
X[$1]++
if ($2) a[$1]++
else ia[$1]++
}
END {
printf "X\tRepeat\tActive\tInactive\n"
for (i in X) {
printf "%s\t%d\t%d\t%d\n", i, X[i], a[i], ia[i]
}
}
This is GNU awk
awk '{a[$1]++; if ($2!=0) {b[$1]++;c[$1]+=0} else {c[$1]++;b[$1]+=0}}END {for (i in a) print i, a[i], b[i], c[i]}' file
Here is another simple way to do it with awk
awk '{a[$1]++;b[$1]+=$2} END { for (i in a) print i,a[i],b[i],a[i]-b[i]}' file
a 2 1 1
b 3 2 1
c 4 0 4
No test is needed, just sum the column $2 and this gives number of hits.
awk '
{ repeats[$1]++; counts[$1,$2]++ }
END {
for (key in repeats)
print key, repeats[key], counts[key,1]+0, counts[key,0]+0
}
' file

Identify overlapping ranges in AWK

I have a file with rows of 3 columns (tab separated) eg:
2 45 100
And a second file with rows of 3 columns (tab separated) eg:
2 10 200
I want an awk command that matched the lines if $1 in both files matches and the range between $2-$3 in file one interstects at all with the range in $2-$3 in file 2. It can be within the range of values in file 2 or the range in file 2 can be within the range in file 1, or theer can just be a partial overlap. Any kind of intersect between the ranges would count as a match and then print the row in file 3.
My current code only matches if $1 and either $2 or $3 match, but doesn't work for when the ranges are within each other as in these cases the precise numbers don't match.
awk '
BEGIN {
FS = "\t";
}
FILENAME == ARGV[1] {
pair[ $1, $2, $3 ] = 1;
next;
}
{
if ( pair[ $1, $2, $3 ] == 1 ) {
print $1 $2 $3;
}
}
Example Input:
File1:
1 10 23
2 30 50
6 100 110
8 20 25
File2:
1 5 15
10 30 50
2 10 100
8 22 24
Here line 1(file1) matches line 1(file2) because the first column matches AND range 10-15 overlaps between both ranges
Line 2 (file1) matches line 3(file2) because first column matches and range of 30-50 is within range 10-100.
Line 4(file1) matches line 4(file2) because first column matches and the range 22-24 overlaps in both.
Therefore output would be lines 1,2 and 4 from file2 printed in a new output file.
Hope these examples help.
Your help is really appreciated.
Thank you in advance!
It is quite easy if you use join command to merge both files by its first field ($1):
If you only want the file2 lines as output:
join --nocheck-order <(sort -n file1) <(sort -n file2) | awk '{if ($2 >= $4 && $2 <= $5 || $3 >= $4 && $3 <= $5 || $4 >= $2 && $4 <= $3 || $5 >= $2 && $5 <= $3) {print $1" "$4" "$5;}}' -
Using your input files I got this output:
1 5 15
2 10 100
8 22 24

Resources