Grep rows from reference file while keeping the source column? - file

I have two tables. Table 1 sample has multiple columns and table 2 has one column. My question is, how can i extract rows from table 1 based on values in table 1. I guess a simple grep should work but how can i do a grep on each row. I would like the output to retain the table 2 identifier that matched.
Thanks!
Desired Output:
IPI00004233 IPI00514755;IPI00004233;IPI00106646; Q9BRK5-1;Q9BRK5-2;
IPI00001849 IPI00420049;IPI00001849; Q5SV97-1;Q5SV97-2;
...
......
Table 1:
IPI00436567; Q6VEP3;
IPI00169105;IPI01010102; Q8NH21;
IPI00465263; Q6IEY1;
IPI00465263; Q6IEY1;
IPI00478224; A6NHI5;
IPI00853584;IPI00000733;IPI00166122; Q96NU1-1;Q96NU1-2;
IPI00411886;IPI00921079;IPI00385785; Q9Y3T9;
IPI01010975;IPI00418437;IPI01013997;IPI00329191; Q6TDP4;
IPI00644132;IPI00844469;IPI00030240; Q494U1-1;Q494U1-2;
IPI00420049;IPI00001849; Q5SV97-1;Q5SV97-2;
IPI00966381;IPI00917954;IPI00028151; Q9HCC6;
IPI00375631; P05161;
IPI00374563;IPI00514026;IPI00976820; O00468;
IPI00908418; E7ERA6;
IPI00062955;IPI00002821;IPI00909677; Q96HA4-1;Q96HA4-2;
IPI00641937;IPI00790556;IPI00889194; Q6ZVT0-1;Q6ZVT0-2;Q6ZVT0-3;
IPI00001796;IPI00375404;IPI00217555; Q9Y5U5-1;Q9Y5U5-2;Q9Y5U5-3;
IPI00515079;IPI00018859; P43489;
IPI00514755;IPI00004233;IPI00106646; Q9BRK5-1;Q9BRK5-2;
IPI00064848; Q96L58;
IPI00373976; Q5T7M4;
IPI00375728;IPI86;IPI00383350; Q8N2K1-1;Q8N2K1-2;
IPI01022053;IPI00514605;IPI00514599; P51172-1;P51172-2;
Table 2:
IPI00000207
IPI00000728
IPI00000733
IPI00000846
IPI00000893
IPI00001849
IPI00002214
IPI00002335
IPI00002349
IPI00002821
IPI00003362
IPI00003419
IPI00003865
IPI00004233
IPI00004399
IPI00004795
IPI00004977

You cannot use grep to prepend the needle, so no chance to use -f file2.
Use a loop and prepend manually:
while read token; do grep $token file1 |xargs -I{} echo $token {} ; done <file2
Alternatively, you could store both the results of grep and grep -o and paste them:
grep -f 2.txt 1.txt >a
grep -of 2.txt 1.txt >b
paste b a
If you're also fine with using awk, try this:
awk 'FNR==NR { a[$0];next } { for (x in a) if ($0 ~ x) print x, $0 }' 2.txt 1.txt
Explanation: For the first file (as long as FNR==NR), store all needles into array a ({ a[$0];next }). Then (implicitly) loop over all lines of the second file, loop again over all needles and print needle and line if found.

Related

Output results from cat into different files with names specified into an array

I would like to do cat on several files, which names are stored in an array:
cat $input | grep -v "#" | cut -f 1,2,3
Here the content of the array:
echo $input
1.blastp 2.blastp 3.blastp 4.blastp 5.blastp 6.blastp 7.blastp 8.blastp 9.blastp 10.blastp 11.blastp 12.blastp 13.blastp 14.blastp 15.blastp 16.blastp 17.blastp 18.blastp 19.blastp 20.blastp
This will work just nicely. Now, I am struggling in storing the results into proper output files. So I want to also store the output into files which names are stored into another array:
echo $out_in
1_pairs.tab 2_pairs.tab 3_pairs.tab 4_pairs.tab 5_pairs.tab 6_pairs.tab 7_pairs.tab 8_pairs.tab 9_pairs.tab 10_pairs.tab 11_pairs.tab 12_pairs.tab 13_pairs.tab 14_pairs.tab 15_pairs.tab 16_pairs.tab 17_pairs.tab 18_pairs.tab 19_pairs.tab 20_pairs.tab
cat $input | grep -v "#" | cut -f 1,2,3 > "$out_in"
My problem is:
When I don't use the "" I will get 'ambiguous redirect' error.
When I use them, a single file will be created that comes by the name:
1_pairs.tab?2_pairs.tab?3_pairs.tab?4_pairs.tab?5_pairs.tab?6_pairs.tab?7_pairs.tab?8_pairs.tab?9_pairs.tab?10_pairs.tab?11_pairs.tab?12_pairs.tab?13_pairs.tab?14_pairs.tab?15_pairs.tab?16_pairs.tab?17_pairs.tab?18_pairs.tab?19_pairs.tab?20_pairs.tab
I don't get why the input array is read with no problem but that's not the case for the output array...
any ideas?
Thanks a lot!
D.
You cannot redirect output that way, the output is a stream of characters and the redirection can not know when to switch to the next file. You need a loop over input files.
Assuming that the file names do not contain spaces:
for fn in $input; do
grep -v "$fn" | cut -f 1,2,3 >"${fn%%.*}_pairs.tab"
done

How can I use sed to make thousands of substitutions in a file using a reference file?

I have a big file with two columns like this:
tiago#tiago:~/$ head Ids.txt
TRINITY_DN126999_c0_g1_i1 ENSMUST00000040656.6
TRINITY_DN126999_c0_g1_i1 ENSMUST00000040656.6
TRINITY_DN126906_c0_g1_i1 ENSMUST00000126770.1
TRINITY_DN126907_c0_g1_i1 ENSMUST00000192613.1
TRINITY_DN126988_c0_g1_i1 ENSMUST00000032372.6
.....
and I have another file with data, like this:
"baseMean" "log2FoldChange" "lfcSE" "stat" "pvalue" "padj" "super" "sub" "threshold"
"TRINITY_DN41319_c0_g1" 178.721774751278 2.1974294626636 0.342621318593487 6.41358066008381 1.4214085388179e-10 5.54686423073089e-08 TRUE FALSE "TRUE"
"TRINITY_DN87368_c0_g1" 4172.76139849472 2.45766387851112 0.404014016558211 6.08311538160958 1.17869459181235e-09 4.02673069375893e-07 TRUE FALSE "TRUE"
"TRINITY_DN34622_c0_g1" 39.1949851245197 3.28758092748061 0.54255370348027 6.05945716781964 1.3658169042862e-09 4.62597265729593e-07 TRUE FALSE "TRUE"
.....
I was thinking of using sed to perform a translation of the values in the first column of the data file, using the first file as a dictionary.
That is, considering each line of the data file in turn, if the value in the first column matches a value in the first column of the dictionary file, then a substitution would be be made; otherwise, the line would simply be printed.
Any suggestions would be appreciated.
You can turn your first file Ids.txt into a sed script:
$ sed -r 's| *(\S+) (\S+)|s/^"\1/"\2/|' Ids.txt > repl.sed
$ cat repl.sed
s/^"TRINITY_DN126999_c0_g1_i1/"ENSMUST00000040656.6/
s/^"TRINITY_DN126999_c0_g1_i1/"ENSMUST00000040656.6/
s/^"TRINITY_DN126906_c0_g1_i1/"ENSMUST00000126770.1/
s/^"TRINITY_DN126907_c0_g1_i1/"ENSMUST00000192613.1/
s/^"TRINITY_DN126988_c0_g1_i1/"ENSMUST00000032372.6/
This removes leading spaces and makes each line into a substitution command.
Then you can use this script to do the replacements in your data file:
sed -f repl.sed datafile
... with redirection to another file, or in-place with sed -i.
If you don't have GNU sed, you can use this POSIX conformant version of the first command:
sed 's| *\([^ ]*\) \([^ ]*\)|s/^"\1/"\2/|' Ids.txt
This uses basic instead of extended regular expressions and uses [^ ] for "not space" instead of \S.
Since the first file (the dictionary file) is large, using sed may be very slow; a much faster and not much more complex approach would be to use awk as follows:
awk -v col=1 -v dict=Ids.txt '
BEGIN {while(getline<dict){a["\""$1"\""]="\""$2"\""} }
$col in a {$col=a[$col]}; {print}'
(Here, "Ids.txt" is the dictionary file, and "col" is the column number of the field of interest in the data file.)
This approach also has the advantage of not requiring any modification to the dictionary file.
#!/bin/bash
# Declare hash table
declare -A Ids
# Go though first input file and add key-value pairs to hash table
while read Id; do
key=$(echo $Id | cut -d " " -f1)
value=$(echo $Id | cut -d " " -f2)
Ids+=([$key]=$value)
done < $1
# Go through second input file and replace every first column with
# the corresponding value in the hash table if it exists
while read line; do
first_col=$(echo $line | cut -d '"' -f2)
new_id=${Ids[$first_col]}
if [ -n "$new_id" ]; then
sed -i s/$first_col/$new_id/g $2
fi
done < $2
I would call the script as
./script.sh Ids.txt data.txt

Store grep output in an array

I need to search a pattern in a directory and save the names of the files which contain it in an array.
Searching for pattern:
grep -HR "pattern" . | cut -d: -f1
This prints me all filenames that contain "pattern".
If I try:
targets=$(grep -HR "pattern" . | cut -d: -f1)
length=${#targets[#]}
for ((i = 0; i != length; i++)); do
echo "target $i: '${targets[i]}'"
done
This prints only one element that contains a string with all filnames.
output: target 0: 'file0 file1 .. fileN'
But I need:
output: target 0: 'file0'
output: target 1: 'file1'
.....
output: target N: 'fileN'
How can I achieve the result without doing a boring split operation on targets?
You can use:
targets=($(grep -HRl "pattern" .))
Note use of (...) for array creation in BASH.
Also you can use grep -l to get only file names in grep's output (as shown in my command).
Above answer (written 7 years ago) made an assumption that output filenames won't contain special characters like whitespaces or globs. Here is a safe way to read those special filenames into an array: (will work with older bash versions)
while IFS= read -rd ''; do
targets+=("$REPLY")
done < <(grep --null -HRl "pattern" .)
# check content of array
declare -p targets
On BASH 4+ you can use readarray instead of a loop:
readarray -d '' -t targets < <(grep --null -HRl "pattern" .)

Printing duplicate rows as many times it is duplicate in the input file using UNIX

Suppose I have a sorted file:
AARAV,12345,BANK OF AMERICA,$145
AARAV,12345,BANK OF AMERICA,$145
AARAV,12345,BANK OF AMERICA,$145
RAM,124455,DUETCHE BANK,$240
And I want output as:
AARAV,12345,BANK OF AMERICA,$145
AARAV,12345,BANK OF AMERICA,$145
With **uniq -d file** I am able to find duplicate records but its printing the record only once even if it is repeated. I want to print as many times it is duplicated.
Thanks in advance.
The following should do what you want, assuming your file is called Input.txt.
uniq -d Input.txt | xargs -I {} grep {} Input.txt
xargs -I {} basically tells xargs to substitute the input that is being piped in whenever it sees {} in a later command.
grep {} Input.txt will be called with each line of input from the pipe, where the line of input will get substituted where {} is.
Why does this work? We are using uniq -d to find the duplicate entries, and then using them as input patterns to grep to match all the lines which contain those entries. Thus, only duplicate entries are printed, and they are printed exactly as many times as they appear in the file.
Update: Printing the duplicates occurences only, not the first occurrence, in a way that is compatible with ksh, since the OP does not apparently have bash on his system.
uniq -d Input.txt | xargs -L 1 | while read line
do
grep "$line" Input.txt | tail -n +2;
done
Note that in the above scripts, we are assuming that no line is a substring of another line.
This should give you the output that you want. It repeats each duplicate line N-1 times. Unfortunately the output isn't sorted, so you'd have to pipe it through sort again.
Assuming the input file is input.txt:
awk -F '\n' '{ a[$1]++ } END { for (b in a) { while(--a[b]) { print b } } }' input.txt | sort

Comparing arrays in shell

I am writing a shell script as below, that gets a list of files provided by the user in a file and then ftp's to a server and then compares the list of files to what is on the server. The issue I am having is that when I am calling my diff function, the list that is being returned are the files that are unique to both arrays. I want only those that are in unique_array1 but not in unique_array2. In short, a list that shows what files within the list the user provided are not on the ftp server. Please note, that in the list of files provided by the user, each line is a file name, separated by a new line character.My script is as below:
#!/bin/bash
SERVER=ftp://localhost
USER=anonymous
PASS=password
EXT=txt
FILELISTTOCHECK="ftpFileList.txt"
#create a list of files that is on the ftp server
listOfFiles=$(curl $SERVER --user $USER:$PASS 2> /dev/null | awk '{ print $9 }' | grep -E "*.$EXT$")
#read the list of files from the list provided##
#Eg:
# 1.txt
# 2.txt
# 3.txt
#################################################
listOfFilesToCheck=`cat $FILELISTTOCHECK`
unique_array1=$(echo $listOfFiles | sort -u)
unique_array2=$(echo $listOfFilesToCheck | sort -u)
diff(){
awk 'BEGIN{RS=ORS=" "}
{NR==FNR?a[$0]++:a[$0]--}
END{for(k in a)if(a[k])print k}' <(echo -n "${!1}") <(echo -n "${!2}")
}
#Call the diff function above
Array3=($(diff unique_array1[#] unique_array2[#]))
#get what files are in listOfFiles but not in listOfFilesToCheck
echo ${Array3[#]}
Based on this You may try comm command:
Usage: comm [OPTION]... FILE1 FILE2
Compare sorted files FILE1 and FILE2 line by line.
With no options, produce three-column output. Column one contains
lines unique to FILE1, column two contains lines unique to FILE2,
and column three contains lines common to both files.
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
A test program:
#!/bin/bash
declare -a arr1
declare -a arr2
arr1[0]="this"
arr1[1]="is"
arr1[2]="a"
arr1[3]="test"
arr2[0]="test"
arr2[1]="is"
unique_array1=$(printf "%s\n" "${arr1[#]}" | sort -u)
unique_array2=$(printf "%s\n" "${arr2[#]}" | sort -u)
comm -23 <(printf "%s\n" "${unique_array1[#]}") <(printf "%s\n" "${unique_array2[#]}")
Output:
a
this

Resources