Identify overlapping ranges in AWK - file

I have a file with rows of 3 columns (tab separated) eg:
2 45 100
And a second file with rows of 3 columns (tab separated) eg:
2 10 200
I want an awk command that matched the lines if $1 in both files matches and the range between $2-$3 in file one interstects at all with the range in $2-$3 in file 2. It can be within the range of values in file 2 or the range in file 2 can be within the range in file 1, or theer can just be a partial overlap. Any kind of intersect between the ranges would count as a match and then print the row in file 3.
My current code only matches if $1 and either $2 or $3 match, but doesn't work for when the ranges are within each other as in these cases the precise numbers don't match.
awk '
BEGIN {
FS = "\t";
}
FILENAME == ARGV[1] {
pair[ $1, $2, $3 ] = 1;
next;
}
{
if ( pair[ $1, $2, $3 ] == 1 ) {
print $1 $2 $3;
}
}
Example Input:
File1:
1 10 23
2 30 50
6 100 110
8 20 25
File2:
1 5 15
10 30 50
2 10 100
8 22 24
Here line 1(file1) matches line 1(file2) because the first column matches AND range 10-15 overlaps between both ranges
Line 2 (file1) matches line 3(file2) because first column matches and range of 30-50 is within range 10-100.
Line 4(file1) matches line 4(file2) because first column matches and the range 22-24 overlaps in both.
Therefore output would be lines 1,2 and 4 from file2 printed in a new output file.
Hope these examples help.
Your help is really appreciated.
Thank you in advance!

It is quite easy if you use join command to merge both files by its first field ($1):
If you only want the file2 lines as output:
join --nocheck-order <(sort -n file1) <(sort -n file2) | awk '{if ($2 >= $4 && $2 <= $5 || $3 >= $4 && $3 <= $5 || $4 >= $2 && $4 <= $3 || $5 >= $2 && $5 <= $3) {print $1" "$4" "$5;}}' -
Using your input files I got this output:
1 5 15
2 10 100
8 22 24

Related

Complete file2 with data from file1

I have two files with fields separated with tabs:
File1 has 13 columns and 90 millions of lines (~5GB). The number of lines of file 1 is always smaller than the number of lines of file2.
1 1 27 0 2 0 0 1 0 0 0 1 false
1 2 33 0 3 0 0 0 0 0 0 1 false
1 5 84 3 0 0 0 0 0 0 0 2 false
1 6 41 0 1 0 0 0 0 0 0 1 false
1 7 8 4 0 0 0 0 0 0 0 1 false
File2 has 2 columns and 100 millions of lines (1.3GB)
1 1
1 2
1 3
1 4
1 5
What I want to achieve:
When the pair columns $1/$2 of file2 is identical to the pair columns $1/$2 of file1, I would like to print $1 and $2 from file2 and $3 from file1 into an output file. In addition, if the pair $1/$2 of file2 does not have a match in file1, print $1/$2 in the output and the 3rd column is left empty. Thus, the output keeps the same structure (number of lines) than file2.
If relevant: The pairs $1/$2 are unique in both file1 and 2 and both files are sorted according $1 first and then $2.
Output file:
1 1 27
1 2 33
1 3 45
1 4
1 5 84
What I have done so far:
awk -F"\t" 'NR == FNR {a[$1 "\t" $2] = $3; next } { print $0 "\t" a[$1 "\t" $2] }' file1 file2 > output
The command runs for few minutes and unexpectedly stop without additional information. When I open the output file, the first 5 to 6x10E6 lines have been correctly processed (I can see the 3rd column that was correctly added) but the rest of the output file does not have a 3rd column. I am running this command on a 3.2 GHz Intel Core i5 with 32 GB 1600 MHz DDR3. Any ideas why the command stops? Thanks for your help.
You are close.
I would do something like this:
awk 'BEGIN{FS=OFS="\t"}
{key=$1 FS $2}
NR==FNR{seen[key]=$3; next}
key in seen {print key, seen[key]}
' file1 file2
Or, since file1 is bigger, reverse which file is held in memory:
awk 'BEGIN{FS=OFS="\t"}
{key=$1 FS $2}
NR==FNR{seen[key]; next}
key in seen {print key, $3}
' file2 file1
You could also use join which will likely handle files much larger than memory. This is BSD join that can use multiple fields for the join:
join -1 1 -1 2 -2 1 -2 2 -t $'\t' -o 1.1,1.2,1.3 file1 file2
join requires the files be sorted, as your example is. If not sorted, you could do:
join -1 1 -1 2 -2 1 -2 2 -t $'\t' -o 1.1,1.2,1.3 <(sort -n file1) <(sort -n file2)
Or, if your join can only use a single field, you can temporarily use ' ' as the field separator between field 2 and 3 and set join to use that as the delimiter:
join -1 1 -2 1 -t $' ' -o 1.1,2.2 <(sort -k1n -k2n file2) <(awk '{printf("%s\t%s %s\n",$1,$2,$3)}' file1 | sort -k1n -k2n) | sed 's/[ ]/\t/'
Either awk or join prints:
1 1 27
1 2 33
1 3 45
1 4 7
1 5 84
Your comment:
After additional investigations, the suggested solutions did not worked because my question was not properly asked (my mistake). The suggested solutions printed lines only when matches between pairs ($1/$2) were found between files 1 and 2. Thus, the resulting output file has always the number of lines of file1 (that is always smaller than file2). I want the output file to keep the same structure than file2, as said, the same number of lines (for further comparison). The question was further refined.
If your computer can handle the file sizes:
awk 'BEGIN{FS=OFS="\t"}
{key=$1 FS $2}
NR==FNR{seen[key]=$3; next}
{if (key in seen)
print key, seen[key]
else
print key
}
' file1 file2
Otherwise you can filter file1 so only the matches are feed to the awk from file1 and then file2 dictates the final output structure:
awk 'BEGIN{FS=OFS="\t"}
{key=$1 FS $2}
NR==FNR{seen[key]=$3; next}
{if (key in seen)
print key, seen[key]
else
print key
}
' <(join -1 1 -1 2 -2 1 -2 2 -t $'\t' -o 1.1,1.2,1.3 file1 file2) file2
If you still need something more memory efficient, I would break out ruby for a line-by-line solution:
ruby -e 'f1=File.open(ARGV[0]); f2=File.open(ARGV[1])
l1=f1.gets
f2.each { |l2|
l1a=l1.chomp.split(/\t/)[0..2].map(&:to_i)
l2a=l2.chomp.split(/\t/).map(&:to_i)
while((tst=l1a[0..1]<=>l2a)<0 && !f1.eof?)
l1=f1.gets
l1a=l1.chomp.split(/\t/)[0..2].map(&:to_i)
end
if tst==0
l2a << l1a[2]
end
puts l2a.join("\t")
}
' file1 file2
Issues with OP's current awk code:
testing shows loading file1 into memory (a[$1 "\t" $2] = $3) requires ~290 bytes per entry; for 90 million rows this works out to ~26 GBytes; this amount of memory usage should not be an issue in OP's system (max of 32 GBytes) ... assuming all other processes are not consuming 6+ GBytes; having said that ...
in the 2nd half of OP's script (ie, file2 processing) the print/a[$1 "\t" $2] will actually create a new array entry if one doesn't already exist (ie, if file2 key not found in file1 then create a new array entry); since we know this situation can occur we have to take into consideration the amount of memory required to store an entry from file2 in the a[] array ...
testing shows loading file2 into memory (a[$1 "\t" $2] = $3) requires ~190 bytes per entry; for 100 million rows this works out to ~19 GBytes; 'course we won't be loading all of file2 into the a[] array so total additional memory will be less than 19 GBytes; then again, we only need to load about 26 million rows (from file2) into the a[] array to use up another ~5 GBytes (26 million * 190 bytes) to run out of memory
OP has mentioned that processing 'stops' after generating 5-6 million rows of output; this symptom ('stopped' process) can occur when the system runs out of memory and/or goes into heavy swapping (also a memory issue); with 100 million rows in file2 and only 5-6 million rows in the output, that leaves 94-95 million rows from file2 unaccounted for which in turn is considerably more then the 26 million rows it would take to use up the remaining ~5 GBytes of memory ...
net result: OP's current awk script is likely hanging/stopped due to running out of memory
net result: we need to look at solutions that keep us from running out of memory; better would be solutions that use considerably less memory than the current awk code; even better would be solutions that use little (effectively 'no') memory at all ...
Assumptions/understandings:
both input files have already been sorted by the 1st and 2nd columns
within a given file the combination of the 1st and 2nd columns represents a unique key (ie, there are no duplicate keys in a file)
all lines from file2 are to be written to stdout while an optional 3rd column will be derived from file1 (when the key exists in both files)
General approach for a 'merge join' operation:
read from both files in parallel
compare the 1st/2nd columns from both files to determine what to print
memory usage should be minimal since we never have more than 2 lines (one from each file) loaded in memory
One awk idea for implementing a 'merge join' operation:
awk -v lookup="file1" ' # assign awk variable "lookup" the name of the file where we will obtain the optional 3rd column from
function get_lookup() { # function to read a new line from the "lookup" file
rc=getline line < lookup # read next line from "lookup" file into variable "line"; rc==1 if successful; rc==0 if reached end of file
if (rc) split(line,f) # if successful (rc==1) then split "line" into array f[]
}
BEGIN { FS=OFS="\t" # define input/output field delimiters
get_lookup() # read first line from "lookup" file
}
{ c3="" # set optional 3rd column to blank
while (rc) { # while we have a valid line from the "lookup" file look for a matching line ...
if ($1< f[1] ) { break }
else if ($1==f[1] && $2< f[2]) { break }
else if ($1==f[1] && $2==f[2]) { c3= OFS f[3]; get_lookup(); break }
# else if ($1==f[1] && $2> f[2]) { get_lookup() }
# else if ($1> f[1] ) { get_lookup() }
else { get_lookup() }
}
print $0 c3 # print current line plus optional 3rd column
}
' file2
This generates:
1 1 27
1 2 33
1 3
1 4
1 5 84

How to print lines with multiple associative arrays and conditions using awk

I want to print all lines from file 1 where the values of $1 and $4 are found in $1 and $4 of file 2 AND where the value in file 1 $2 is greater than or equal to the value in file 2 $2 AND where the value in file 1 $3 is less than or equal to the value in file 2 $3.
file 1
1 110201809 117658766 a
1 168095261 182305990 b
1 215456074 233436403 c
2 9465687 12905490 d
2 28765309 35235120 e
2 48958595 64702082 f
file 2
1 245371026 249210707 a
2 937388 46504962 h
2 937388 162731186 b
2 2954974 6777829 c
2 9465687 12996275 d
2 14539477 44757554 d
2 14766820 30080818 m
2 16531332 23584565 n
2 17340076 26206255 o
2 18535880 24452180 p
2 28830071 35289330 q
2 36206662 47273732 r
2 48958495 64703082 f
Desired output only prints the lines from file 1 that meet the condition.
desired output
2 9465687 12905490 d
2 48958595 64702082 f
I've tried the following which gave an empty file:
awk 'NR==FNR{ a[$1,$4]= $0; b[$2] = $2 ; c[$3] = $3; next } ($1 $4 in a) && ($2 >= b[$2]) && ($3 <= c[$3])' file2 file1>desired output
I would do this by collecting the second and third columns in separate hashes, e.g.:
parse.awk
NR==FNR {
g[$1,$4] = $2
h[$1,$4] = $3
next
}
($1 SUBSEP $4 in g) && g[$1,$4] >= $2 && h[$1,$4] <= $3
Run it like this:
awk -f parse.awk file1 file2
Output:
2 9465687 12996275 d
2 48958495 64703082 f

How to grep ranges of numeric sequences from a column that contain several sequences

I'm new writing bash scripts and have the following question; how can extract ranges (first and last value) from a column which contains several incremental and decremental numeric sequences that can increase or decrease by 3 and jump to the next sequence once it detects that the increment is >3 e.g.:
1
4
7
20
23
26
100
97
94
It is required to receive as an output:
1,7
20,26
100,94
Using awk:
$ awk 'NR==1||sqrt(($0-p)*($0-p))>3{print p; printf "%s", $0 ", "} {p=$0} END{print $0}' file
1, 7
20, 26
100, 94
Explained:
NR==1 || sqrt(($0-p)*($0-p))>3 { # if the abs($0-previous) > 3
print p # print previous to end a sequence and
printf "%s", $0 ", " # start a new sequence
}
{ p=$0 }
END { print $0 }
this awk script gives you expected output:
awk '{v=$NF}
NR==1{printf "%s,",v;p=v;next}
(p-v)*(p-v)==9{p=v;next}
{printf "%s\n%s,",p,v;p=v}
END{print v}' file

nested for loops in awk to count number of fields matching values

I have a file with two columns (1.4 million rows) that looks like:
CLM MXL
0 0
0 1
1 1
1 1
0 0
29 42
0 0
30 15
I would like to count the instances of each possible combination of values; for example if there are x number of lines where column CLM equals 0 and column MXL matches 1, I would like to print:
0 1 x
Since the maximum value of column CLM is 188 and the maximum value of column MXL is 128, I am trying to use a nested for loop in awk that looks something like:
awk '{for (i=0; i<=188; i++) {for (j=0; j<=128; j++) {if($9==i && $10==j) {print$0}}}}' 1000Genomes.ALL.new.txt > test
But this only prints out the original file, which makes sense, I just don't know how to correctly write a for loop that prints out one file for each combination of values, which I can then wc, or print out one file with counts of each combination. Any solution in awk, bash script, perl script would be great.
1. A Pure awk Solution
$ awk 'NR>1{c[$0]++} END{for (k in c)print k,c[k]}' file | sort -n
0 0 3
0 1 1
1 1 2
29 42 1
30 15 1
How it works
The code uses a single variable c. c is an associative array whose keys are lines in the file and whose values are the number of occurrences.
NR>1{c[$0]++}
For every line except the first (which has the headings), this increments the count for the combination in that line.
END{for (k in c)print k,c[k]}
This prints out the final counts.
sort -n
This is just for aesthetics: it puts the output lines in a predictable order.
2. Alternative using uniq -c
$ tail -n+2 file | sort -n | uniq -c | awk '{print $2,$3,$1}'
0 0 3
0 1 1
1 1 2
29 42 1
30 15 1
How it works
tail -n+2 file
This prints all but the first line of the file. The purpose of this is to remove the column headings.
sort -n | uniq -c
This sorts the lines and then counts the duplicates.
awk '{print $2,$3,$1}
uniq -c puts the counts first and you wanted the counts to be the last on the line. This just rearranges the columns to the format that you wanted.

awk, declare array embracing FNR and field, output

I would like to declare an array of a certain number of lines, that means from line 10 to line 78, as an example. Could be other number, this is just an example.
My sample gives me that range of lines on stdout but sets "1" in between that lines. Can anybody tell me how to get rid of that "1"?
Sample as follows should go to stdout and embraces the named lines.
awk '
myarr["range-one"]=NR~/^2$/ , NR~/^8$/;
{print myarr["range-one"]};' /home/$USER/uplog.txt;
That is giving me this output:
0
12:33:49 up 3:57, 2 users, load average: 0,61, 0,37, 0,22 21.06.2014
1
12:42:02 up 4:06, 2 users, load average: 0,14, 0,18, 0,19 21.06.2014
1
12:42:29 up 4:06, 2 users, load average: 0,09, 0,17, 0,19 21.06.2014
1
12:43:09 up 4:07, 2 users, load average: 0,09, 0,16, 0,19 21.06.2014
1
Second question: how to set in that array one field of FNR or line?
When I do it this way there comes up the field that I wanted
awk ' NR~/^1$/ , NR~/^7$/ {print $3, $11; next} ; ' /home/$USER/uplog.txt;
But I need an array, thats why I'm asking. Any hints? Thanks in advance.
What the example script does
awk '
myarr["range-one"]=NR~/^2$/ , NR~/^8$/;
{print myarr["range-one"]};'
Your script is one of the more convoluted and decidedly less-than-obvious pieces of awk that I've ever seen. Let's take a simple input file:
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
Line 8
Line 9
Line 10
Line 11
Line 12
The output from that is:
0
Line 2
1
Line 3
1
Line 4
1
Line 5
1
Line 6
1
Line 7
1
Line 8
1
0
0
0
0
Dissecting your script, it appears that the first line:
myarr["range-one"]=NR~/^2$/ , NR~/^8$/;
is equivalent to:
myarr["range-one"] = (NR ~ /^#$/, NR ~ /^8$/) { print }
That is, the value assigned to myarr["range-one"] is 1 inside the range of line numbers where NR is equal to 2 and is equal to 8, and 0 outside that range; further, when the value is 1, the line is printed.
The second line:
{print myarr["range-one"]};
print the value in myarr["range-one"] for each line of input. Thus, on the first line, the value 0 is printed. For lines 2 to 8, the line is printed followed by the value 1; for lines after that, the value 0 is printed once more.
What the question asks for
The question is not clear. It appears that lines 10 to 78 should be printed. In awk, there are essentially no variable declarations (we can debate about function parameters, but functions don't seem to figure in this). Therefore, declaring an array is not an option.
awk -v lo=10 -v hi=78 'NR >= lo && NR <= hi { print }'
This would print the lines between line 10 and line 78. It would be feasible to save the values in an array (a in the examples below). Said array could be indexed by NR or with a separate index starting at 0 or 1:
awk -v lo=10 -v hi=78 'NR >= lo && NR <= hi { a[NR] = $0 }' # Indexed by line number
awk -v lo=10 -v hi=78 'NR >= lo && NR <= hi { a[i++] = $0 }' # Indexed from 0
awk -v lo=10 -v hi=78 'NR >= lo && NR <= hi { a[++i] = $0 }' # Indexed from 1
Presumably, you'd also have an END block to do something with the data.
The semicolons in the original are both unnecessary. The blank line is ignored, of course.

Resources