Merge multiple files by common field - Unix - file

I have hundreds of files, each with two columns :
For example :
file1.txt
ID Value1
1 40
2 30
3 70
file2.txt
ID Value2
1 50
2 70
3 20
And so on, till
file150.txt
ID Value150
1 98
2 52
3 71
How do I merge these files based on the first column (which is common). My output should be
ID Value1 Value2...........Value150
1 40 50 98
2 30 70 52
3 70 20 71
Thank you.

using cut and paste combination to solve the file merging problem on three files or more. cd to the folder only contains file1, file2, file3, ... file150:
i=0
cut -f 1 file1 > delim ## use first column as delimiter
for file in file*
do
i=$(($i+1)) ## for adding count to distinguish files from original ones
cut -f 2 $file > ${file}__${i}.temp
done
paste -d\\t delim file*__*.temp > output
Another solution is using join to merge two files once by steps.
join -j 1 test1 test2 | join -j 1 test3 - | join -j 1 test4 -

Related

Brace expansion in zsh - how to concatenate two lists for file selection?

In zsh, one can create an expression of {nx..ny}, for example to select files x to y inside a folder.
For example, {1..50} selects items, files, etc. from 1 to 50.
How can I concatenate two two brace expansions into one?
Example: I would like to select {1..50} and {60..100} for one and the same output.
You can nest brace expansions, so this will work:
> print {{1..50},{60..100}}
1 2 3 (lots of numbers) 49 50 60 61 (more numbers) 99 100
Brace expansions support lists as well as sequences, and can be included in strings:
> print -l file{A,B,WORD,{R..T}}.txt
fileA.txt
fileB.txt
fileWORD.txt
fileR.txt
fileS.txt
fileT.txt
Note that brace expansions are not glob patterns. The {n..m} expansion will include every value between the start and end values, regardless of whether a file exists by that name. For finding files in folders, the <-> glob expression will usually work better:
> touch 2 3 55 89
> ls -l <1-50> <60-100>
-rw-r--r-- 1 me grp 0 Feb 18 06:52 2
-rw-r--r-- 1 me grp 0 Feb 18 06:52 3
-rw-r--r-- 1 me grp 0 Feb 18 06:52 89

Complete file2 with data from file1

I have two files with fields separated with tabs:
File1 has 13 columns and 90 millions of lines (~5GB). The number of lines of file 1 is always smaller than the number of lines of file2.
1 1 27 0 2 0 0 1 0 0 0 1 false
1 2 33 0 3 0 0 0 0 0 0 1 false
1 5 84 3 0 0 0 0 0 0 0 2 false
1 6 41 0 1 0 0 0 0 0 0 1 false
1 7 8 4 0 0 0 0 0 0 0 1 false
File2 has 2 columns and 100 millions of lines (1.3GB)
1 1
1 2
1 3
1 4
1 5
What I want to achieve:
When the pair columns $1/$2 of file2 is identical to the pair columns $1/$2 of file1, I would like to print $1 and $2 from file2 and $3 from file1 into an output file. In addition, if the pair $1/$2 of file2 does not have a match in file1, print $1/$2 in the output and the 3rd column is left empty. Thus, the output keeps the same structure (number of lines) than file2.
If relevant: The pairs $1/$2 are unique in both file1 and 2 and both files are sorted according $1 first and then $2.
Output file:
1 1 27
1 2 33
1 3 45
1 4
1 5 84
What I have done so far:
awk -F"\t" 'NR == FNR {a[$1 "\t" $2] = $3; next } { print $0 "\t" a[$1 "\t" $2] }' file1 file2 > output
The command runs for few minutes and unexpectedly stop without additional information. When I open the output file, the first 5 to 6x10E6 lines have been correctly processed (I can see the 3rd column that was correctly added) but the rest of the output file does not have a 3rd column. I am running this command on a 3.2 GHz Intel Core i5 with 32 GB 1600 MHz DDR3. Any ideas why the command stops? Thanks for your help.
You are close.
I would do something like this:
awk 'BEGIN{FS=OFS="\t"}
{key=$1 FS $2}
NR==FNR{seen[key]=$3; next}
key in seen {print key, seen[key]}
' file1 file2
Or, since file1 is bigger, reverse which file is held in memory:
awk 'BEGIN{FS=OFS="\t"}
{key=$1 FS $2}
NR==FNR{seen[key]; next}
key in seen {print key, $3}
' file2 file1
You could also use join which will likely handle files much larger than memory. This is BSD join that can use multiple fields for the join:
join -1 1 -1 2 -2 1 -2 2 -t $'\t' -o 1.1,1.2,1.3 file1 file2
join requires the files be sorted, as your example is. If not sorted, you could do:
join -1 1 -1 2 -2 1 -2 2 -t $'\t' -o 1.1,1.2,1.3 <(sort -n file1) <(sort -n file2)
Or, if your join can only use a single field, you can temporarily use ' ' as the field separator between field 2 and 3 and set join to use that as the delimiter:
join -1 1 -2 1 -t $' ' -o 1.1,2.2 <(sort -k1n -k2n file2) <(awk '{printf("%s\t%s %s\n",$1,$2,$3)}' file1 | sort -k1n -k2n) | sed 's/[ ]/\t/'
Either awk or join prints:
1 1 27
1 2 33
1 3 45
1 4 7
1 5 84
Your comment:
After additional investigations, the suggested solutions did not worked because my question was not properly asked (my mistake). The suggested solutions printed lines only when matches between pairs ($1/$2) were found between files 1 and 2. Thus, the resulting output file has always the number of lines of file1 (that is always smaller than file2). I want the output file to keep the same structure than file2, as said, the same number of lines (for further comparison). The question was further refined.
If your computer can handle the file sizes:
awk 'BEGIN{FS=OFS="\t"}
{key=$1 FS $2}
NR==FNR{seen[key]=$3; next}
{if (key in seen)
print key, seen[key]
else
print key
}
' file1 file2
Otherwise you can filter file1 so only the matches are feed to the awk from file1 and then file2 dictates the final output structure:
awk 'BEGIN{FS=OFS="\t"}
{key=$1 FS $2}
NR==FNR{seen[key]=$3; next}
{if (key in seen)
print key, seen[key]
else
print key
}
' <(join -1 1 -1 2 -2 1 -2 2 -t $'\t' -o 1.1,1.2,1.3 file1 file2) file2
If you still need something more memory efficient, I would break out ruby for a line-by-line solution:
ruby -e 'f1=File.open(ARGV[0]); f2=File.open(ARGV[1])
l1=f1.gets
f2.each { |l2|
l1a=l1.chomp.split(/\t/)[0..2].map(&:to_i)
l2a=l2.chomp.split(/\t/).map(&:to_i)
while((tst=l1a[0..1]<=>l2a)<0 && !f1.eof?)
l1=f1.gets
l1a=l1.chomp.split(/\t/)[0..2].map(&:to_i)
end
if tst==0
l2a << l1a[2]
end
puts l2a.join("\t")
}
' file1 file2
Issues with OP's current awk code:
testing shows loading file1 into memory (a[$1 "\t" $2] = $3) requires ~290 bytes per entry; for 90 million rows this works out to ~26 GBytes; this amount of memory usage should not be an issue in OP's system (max of 32 GBytes) ... assuming all other processes are not consuming 6+ GBytes; having said that ...
in the 2nd half of OP's script (ie, file2 processing) the print/a[$1 "\t" $2] will actually create a new array entry if one doesn't already exist (ie, if file2 key not found in file1 then create a new array entry); since we know this situation can occur we have to take into consideration the amount of memory required to store an entry from file2 in the a[] array ...
testing shows loading file2 into memory (a[$1 "\t" $2] = $3) requires ~190 bytes per entry; for 100 million rows this works out to ~19 GBytes; 'course we won't be loading all of file2 into the a[] array so total additional memory will be less than 19 GBytes; then again, we only need to load about 26 million rows (from file2) into the a[] array to use up another ~5 GBytes (26 million * 190 bytes) to run out of memory
OP has mentioned that processing 'stops' after generating 5-6 million rows of output; this symptom ('stopped' process) can occur when the system runs out of memory and/or goes into heavy swapping (also a memory issue); with 100 million rows in file2 and only 5-6 million rows in the output, that leaves 94-95 million rows from file2 unaccounted for which in turn is considerably more then the 26 million rows it would take to use up the remaining ~5 GBytes of memory ...
net result: OP's current awk script is likely hanging/stopped due to running out of memory
net result: we need to look at solutions that keep us from running out of memory; better would be solutions that use considerably less memory than the current awk code; even better would be solutions that use little (effectively 'no') memory at all ...
Assumptions/understandings:
both input files have already been sorted by the 1st and 2nd columns
within a given file the combination of the 1st and 2nd columns represents a unique key (ie, there are no duplicate keys in a file)
all lines from file2 are to be written to stdout while an optional 3rd column will be derived from file1 (when the key exists in both files)
General approach for a 'merge join' operation:
read from both files in parallel
compare the 1st/2nd columns from both files to determine what to print
memory usage should be minimal since we never have more than 2 lines (one from each file) loaded in memory
One awk idea for implementing a 'merge join' operation:
awk -v lookup="file1" ' # assign awk variable "lookup" the name of the file where we will obtain the optional 3rd column from
function get_lookup() { # function to read a new line from the "lookup" file
rc=getline line < lookup # read next line from "lookup" file into variable "line"; rc==1 if successful; rc==0 if reached end of file
if (rc) split(line,f) # if successful (rc==1) then split "line" into array f[]
}
BEGIN { FS=OFS="\t" # define input/output field delimiters
get_lookup() # read first line from "lookup" file
}
{ c3="" # set optional 3rd column to blank
while (rc) { # while we have a valid line from the "lookup" file look for a matching line ...
if ($1< f[1] ) { break }
else if ($1==f[1] && $2< f[2]) { break }
else if ($1==f[1] && $2==f[2]) { c3= OFS f[3]; get_lookup(); break }
# else if ($1==f[1] && $2> f[2]) { get_lookup() }
# else if ($1> f[1] ) { get_lookup() }
else { get_lookup() }
}
print $0 c3 # print current line plus optional 3rd column
}
' file2
This generates:
1 1 27
1 2 33
1 3
1 4
1 5 84

awk lookup table, blank column replacement

I'm trying to use a lookup table to do a search and replace for two specific columns and keep getting a blank column as output. I've followed the syntax for several examples of lookup tables that I've found on stack, but no joy. Here is a snippet from each of the files.
Sample lookup table -- want to search for instances of column 1 in my data file and replace them with the corresponding value in column 2 (first row is a header):
#xyz type
N 400
C13 401
13A 402
13B 402
13C 402
C14 405
The source file to be substituted has the following format:
1 N 0.293000 2.545000 16.605000 0 2 6 10 14
2 C13 0.197000 2.816000 15.141000 0 1
3 13A 1.173000 2.887000 14.676000 0
4 13B -0.319000 3.756000 14.937000 0
5 13C -0.351000 1.998000 14.678000 0
6 C14 0.749000 3.776000 17.277000 0 1
The corresponding values in column 2 of the lookup table will replace the values in column 6 of my source file (currently all zeroes). Here's the awk one-liner that I thought should work:
awk -v OFS='\t' 'NR==1 { next } FNR==NR { a[$1]=$2; next } $2 in a { $6=a[$1] }1' lookup.txt source.txt
But my output essentially deletes the entire entry for column 6:
1 N 0.293000 2.545000 16.605000 2 6 10 14
2 C13 0.197000 2.816000 15.141000 1
3 13A 1.173000 2.887000 14.676000
4 13B -0.319000 3.756000 14.937000
5 13C -0.351000 1.998000 14.678000
6 C14 0.749000 3.776000 17.277000 1
(The sixth column should be 400 to 405. I considered using sed, but I have duplicate values in the source and output columns of my lookup table, so that won't work in this case. What's frustrating is that I had this one-liner working on almost the exact same source file the other week, but now can only get this behavior. I'd love to be able to modify my awk call to do lookups of two different columns simultaneously, but wanted to start simple for now. Thanks!
You have $6=a[$1] instead of $6=a[$2] in your script.
$ awk -v OFS='\t' 'NR==FNR{map[$1]=$2; next} {$6=map[$2]} 1' file1 file2
1 N 0.293000 2.545000 16.605000 400 2 6 10 14
2 C13 0.197000 2.816000 15.141000 401 1
3 13A 1.173000 2.887000 14.676000 402
4 13B -0.319000 3.756000 14.937000 402
5 13C -0.351000 1.998000 14.678000 402
6 C14 0.749000 3.776000 17.277000 405 1

Print all lines of file and matching lines from other file

I have file1 and file2. I want to print all lines of file1 and if: column 1 and 2 of file1 match columns 1 and 2 of file2, then: adds that line from file2 to line of file1.
File1:
1 30 40 name info
1 3 2 desc info
1 3 2 id info
10 35 45 name info
File2:
20 30 40 numbers desc
1 3 2 desc name
Result:
1 30 40 name info -
1 3 2 desc info desc name
1 3 2 id info desc name
10 35 45 name info -
I did this code:
awk 'NR==FNR {h[$1,$2]=$0;next}{print h[$1,$2],$0}' file1.txt file2.txt > result.txt
But it only prints lines that match and I want all lines.
this awk one-liner should help:
awk '{k=$1 FS $2}
NR==FNR{a[k]=$4FS$5;next}{printf "%s %s\n", $0, (k in a?a[k]:"-") }' file2 file1

How to split a large file into small ones by line number

I am trying to split my large big file into small bits using the line numbers. For example my file has 30,000,000 lines and i would like to divide this into small files wach of which has 10,000 lines(equivalent to 3000 small files).
I used the 'split' in unix but it seems that it is limited to only 100 files.
Is there a way of overcoming this limitation of 100 files?
If there is another way of doing this, please advise as well.
Thanks.
Using GNU awk
gawk '
BEGIN {
i=1
}
{
print $0 > "small"i".txt"
}
NR%10==0 {
close("file"i".txt"); i++
}' bigfile.txt
Test:
[jaypal:~/temp] seq 100 > bigfile.txt
[jaypal:~/temp] gawk 'BEGIN {i=1} {print $0 > "small"i".txt" } NR%10==0 { close("file"i".txt"); i++ }' bigfile.txt
[jaypal:~/temp] ls small*
small1.txt small10.txt small2.txt small3.txt small4.txt small5.txt small6.txt small7.txt small8.txt small9.txt
[jaypal:~/temp] cat small1.txt
1
2
3
4
5
6
7
8
9
10
[jaypal:~/temp] cat small10.txt
91
92
93
94
95
96
97
98
99
100
Not an answer, just added a way to do the renaming-part as requested in a comment
$ touch 000{1..5}.txt
$ ls
0001.txt 0002.txt 0003.txt 0004.txt 0005.txt
$ rename 's/^0*//' *.txt
$ ls
1.txt 2.txt 3.txt 4.txt 5.txt
I also tried the above with 3000-files without any problems.

Resources