I have 2 files. First file contains the list of row ID's of tuples of a table in the database.
And second file contains SQL queries with these row ID's in "where" clause of the query.
For example:
File 1
1610657303
1610658464
1610659169
1610668135
1610668350
1610670407
1610671066
File 2
update TABLE_X set ATTRIBUTE_A=87 where ri=1610668350;
update TABLE_X set ATTRIBUTE_A=87 where ri=1610672154;
update TABLE_X set ATTRIBUTE_A=87 where ri=1610668135;
update TABLE_X set ATTRIBUTE_A=87 where ri=1610672153;
I have to read File 1 and search in File 2 for all the SQL commands which matches the row ID's from File 1 and dump those SQL queries in a third file.
File 1 has 1,00,000 entries and File 2 contains 10 times the entries of File 1 i.e. 1,00,0000.
I used grep -f File_1 File_2 > File_3. But this is extremely slow and the rate is 1000 entries per hour.
Is there any faster way to do this?
You don't need regexps, so grep -F -f file1 file2
One way with awk:
awk -v FS="[ =]" 'NR==FNR{rows[$1]++;next}(substr($NF,1,length($NF)-1) in rows)' File1 File2
This should be pretty quick. On my machine, it took under 2 seconds to create a lookup of 1 million entries and compare it against 3 million lines.
Machine Specs:
Intel(R) Xeon(R) CPU E5-2670 0 # 2.60GHz (8 cores)
98 GB RAM
I suggest using a programming language such as Perl, Ruby or Python.
In Ruby, a solution reading both files (f1 and f2) just once could be:
idxes = File.readlines('f1').map(&:chomp)
File.foreach('f2') do | line |
next unless line =~ /where ri=(\d+);$/
puts line if idxes.include? $1
end
or with Perl
open $file, '<', 'f1';
while (<$file>) { chomp; $idxs{$_} = 1; }
close($file);
open $file, '<', 'f2';
while (<$file>) {
next unless $_ =~ /where ri=(\d+);$/;
print $_ if $idxs{$1};
}
close $file;
The awk/grep solutions mentioned above were slow or memory hungry on my machine (file1 10^6 rows, file2 10^7 rows). So I came up with an SQL solution using sqlite3.
Turn file2 into a CSV-formatted file where the first field is the value after ri=
cat file2.txt | gawk -F= '{ print $3","$0 }' | sed 's/;,/,/' > file2_with_ids.txt
Create two tables:
sqlite> CREATE TABLE file1(rowId char(10));
sqlite> CREATE TABLE file2(rowId char(10), statement varchar(200));
Import the row IDs from file1:
sqlite> .import file1.txt file1
Import the statements from file2, using the "prepared" version:
sqlite> .separator ,
sqlite> .import file2_with_ids.txt file2
Select all and ony the statements in table file2 with a matching rowId in table file1:
sqlite> SELECT statement FROM file2 WHERE file2.rowId IN (SELECT file1.rowId FROM file1);
File 3 can be easily created by redirecting output to a file before issuing the select statement:
sqlite> .output file3.txt
Test data:
sqlite> select count(*) from file1;
1000000
sqlite> select count(*) from file2;
10000000
sqlite> select * from file1 limit 4;
1610666927
1610661782
1610659837
1610664855
sqlite> select * from file2 limit 4;
1610665680|update TABLE_X set ATTRIBUTE_A=87 where ri=1610665680;
1610661907|update TABLE_X set ATTRIBUTE_A=87 where ri=1610661907;
1610659801|update TABLE_X set ATTRIBUTE_A=87 where ri=1610659801;
1610670610|update TABLE_X set ATTRIBUTE_A=87 where ri=1610670610;
Without creating any indices, the select statement took about 15 secs on an AMD A8 1.8HGz 64bit Ubuntu 12.04 machine.
Most of previous answers are correct but the only thing that worked for me was this command
grep -oi -f a.txt b.txt
Maybe try AWK and use number from file 1 as a key for example simple script
First script will produce awk script:
awk -f script1.awk
{
print "\$0 ~ ",$0,"{ print \$0 }" > script2.awk;
}
and then invoke script2.awk with file
I may be missing something, but wouldn't it be sufficient to just iterate the IDs in file1 and for each ID, grep file2 and store the matches in a third file? I.e.
for ID in `cat file1`; do grep $ID file2; done > file3
This is not terribly efficient (since file2 will be read over and over again), but it may be good enough for you. If you want more speed, I'd suggest to use a more powerful scripting language which lets you read file2 into a map which quickly allows identifying lines for a given ID.
Here's a Python version of this idea:
queryByID = {}
for line in file('file2'):
lastEquals = line.rfind('=')
semicolon = line.find(';', lastEquals)
id = line[lastEquals + 1:semicolon]
queryByID[id] = line.rstrip()
for line in file('file1'):
id = line.rstrip()
if id in queryByID:
print queryByID[id]
## reports any lines contained in < file 1> missing in < file 2>
IFS=$(echo -en "\n\b") && for a in $(cat < file 1>);
do ((\!$(grep -F -c -- "$a" < file 2>))) && echo $a;
done && unset IFS
or to do what the asker wants, take off the negation and redirect
(IFS=$(echo -en "\n\b") && for a in $(cat < file 1>);
do (($(grep -F -c -- "$a" < file 2>))) && echo $a;
done && unset IFS) >> < file 3>
Related
I have multiple files to process within a unique directory.
They share the same extension (.dat) but their name could be anything.
Each file has a 1st line made of a random text in which the first encountered numeric value has to be caught and then put at the end of the 2nd line.
Then many other lines after.
1st and 2nd line number of fields is unknown, as the position of the numeric value in the 1st row. 1st row can also include several numeric values.
This currently looks like as the example, with '850' in 'xxx.dat', file below:
typical input:
field11 field21 ... 850 ... 520 ... blabla ... 1100 ... fieldi1
field12 field22 ... fieldj2
field13 field23 ... fieldk3
...
field1n field2n ... fieldzn
desired output:
field11 field21 ... 850 ... 520 ... blabla ... 1100 ... fieldi1
field12 field22 ... fieldj2 850
field13 field23 ... fieldk3
...
field1n field2n ... fieldzn
Ideally a unique command or loop would process all the .dat files.
I am a beginner with sed and awk and unfortunately far from being able to solve this.
Please could I have any advice or solutions to do it ?
Thanks.
You can use this shell script:
#!/bin/sh
# make a temp file
trap "rm -f '$tmp'; exit" INT TERM
if command -v mktemp 2>&1 >/dev/null; then
tmp=$(mktemp)
else
tmp=edit.tmp
fi
[ -e "$tmp" ] && exit 1
# edit .dat files
for i in *.dat; do
awk '
NR==1 {while ($(++i) ~ /[^0-9]/); num=$i}
NR==2 {print $0,num}
NR!=2' "$i" > "$tmp" &&
mv "$tmp" "$i"
done
rm -f "$tmp"
It grabs the first digit only field in line 1, and appends it to line 2.
Run in a directory containing only .dats you wish to edit.
It helps a lot to say which OS platform you're targeting.
I want that output result to grep second and third column:
1 db1 ADM_DAT 300 yes 95.09
2 db2 SYSAUX 400 yes 94.52
and convert them like array for example:
outputres=("db1 ADM_DAT" "db2 SYSAUX")
and after that to be able to read those values in loop for example:
for i in "${outputres[#]}"; do read -r a b <<< "$i"; unix_command $(cat file|grep $a|awk '{print $1}') $a $b;done
file:
10.1.1.1 db1
10.1.1.2 db2
Final expectation:
unix_command 10.1.1.1 db1 ADM_DAT
unix_command 10.1.1.2 db2 SYSAUX
This is only a theoretical example, I am not sure if it is working.
I would use a simple bash while read and keep adding elements into the array with the += syntax:
outputres=()
while read -r _ a b _; do
outputres+=("$a $b")
done < file
Doing so, with your input file, I got:
$ echo "${outputres[#]}" #print all elements
db1 ADM_DAT db2 SYSAUX
$ echo "${outputres[0]}" #print first one
db1 ADM_DAT
$ echo "${outputres[1]}" #print second one
db2 SYSAUX
Since you want to use both values separatedly, it may be better to use an associative array:
$ declare -A array=()
$ while read -r _ a b _; do array[$a]=$b; done < file
And then you can loop through the values with:
$ for key in ${!array[#]}; do echo "array[$key] = ${array[$key]}"; done
array[db2] = SYSAUX
array[db1] = ADM_DAT
See a basic example of utilization of these arrays:
#!/bin/bash
declare -A array=([key1]='value1' [key2]='value2')
for key in ${!array[#]}; do
echo "array[$key] = ${array[$key]}"
done
echo ${array[key1]}
echo ${array[key2]}
So maybe this can solve your problem: loop through the file with columns, fetch the 2nd and 3rd and use them twice: firstly the $a to perform a grep in file and then as parameters to cmd_command:
while read -r _ a b _
do
echo "cmd_command $(awk -v patt="$a" '$0~patt {print $1}' file) $a, $b"
done < columns_file
For a sample file file:
$ cat file
hello this is db1
and this is another db2
I got this output (note I am just echoing):
$ while read -r _ a b _; do echo "cmd_command $(awk -v patt="$a" '$0~patt {print $1}' file) $a, $b"; done < a
cmd_command hello db1, ADM_DAT
cmd_command and db2, SYSAUX
I am writing a shell script as below, that gets a list of files provided by the user in a file and then ftp's to a server and then compares the list of files to what is on the server. The issue I am having is that when I am calling my diff function, the list that is being returned are the files that are unique to both arrays. I want only those that are in unique_array1 but not in unique_array2. In short, a list that shows what files within the list the user provided are not on the ftp server. Please note, that in the list of files provided by the user, each line is a file name, separated by a new line character.My script is as below:
#!/bin/bash
SERVER=ftp://localhost
USER=anonymous
PASS=password
EXT=txt
FILELISTTOCHECK="ftpFileList.txt"
#create a list of files that is on the ftp server
listOfFiles=$(curl $SERVER --user $USER:$PASS 2> /dev/null | awk '{ print $9 }' | grep -E "*.$EXT$")
#read the list of files from the list provided##
#Eg:
# 1.txt
# 2.txt
# 3.txt
#################################################
listOfFilesToCheck=`cat $FILELISTTOCHECK`
unique_array1=$(echo $listOfFiles | sort -u)
unique_array2=$(echo $listOfFilesToCheck | sort -u)
diff(){
awk 'BEGIN{RS=ORS=" "}
{NR==FNR?a[$0]++:a[$0]--}
END{for(k in a)if(a[k])print k}' <(echo -n "${!1}") <(echo -n "${!2}")
}
#Call the diff function above
Array3=($(diff unique_array1[#] unique_array2[#]))
#get what files are in listOfFiles but not in listOfFilesToCheck
echo ${Array3[#]}
Based on this You may try comm command:
Usage: comm [OPTION]... FILE1 FILE2
Compare sorted files FILE1 and FILE2 line by line.
With no options, produce three-column output. Column one contains
lines unique to FILE1, column two contains lines unique to FILE2,
and column three contains lines common to both files.
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
A test program:
#!/bin/bash
declare -a arr1
declare -a arr2
arr1[0]="this"
arr1[1]="is"
arr1[2]="a"
arr1[3]="test"
arr2[0]="test"
arr2[1]="is"
unique_array1=$(printf "%s\n" "${arr1[#]}" | sort -u)
unique_array2=$(printf "%s\n" "${arr2[#]}" | sort -u)
comm -23 <(printf "%s\n" "${unique_array1[#]}") <(printf "%s\n" "${unique_array2[#]}")
Output:
a
this
I have two tables. Table 1 sample has multiple columns and table 2 has one column. My question is, how can i extract rows from table 1 based on values in table 1. I guess a simple grep should work but how can i do a grep on each row. I would like the output to retain the table 2 identifier that matched.
Thanks!
Desired Output:
IPI00004233 IPI00514755;IPI00004233;IPI00106646; Q9BRK5-1;Q9BRK5-2;
IPI00001849 IPI00420049;IPI00001849; Q5SV97-1;Q5SV97-2;
...
......
Table 1:
IPI00436567; Q6VEP3;
IPI00169105;IPI01010102; Q8NH21;
IPI00465263; Q6IEY1;
IPI00465263; Q6IEY1;
IPI00478224; A6NHI5;
IPI00853584;IPI00000733;IPI00166122; Q96NU1-1;Q96NU1-2;
IPI00411886;IPI00921079;IPI00385785; Q9Y3T9;
IPI01010975;IPI00418437;IPI01013997;IPI00329191; Q6TDP4;
IPI00644132;IPI00844469;IPI00030240; Q494U1-1;Q494U1-2;
IPI00420049;IPI00001849; Q5SV97-1;Q5SV97-2;
IPI00966381;IPI00917954;IPI00028151; Q9HCC6;
IPI00375631; P05161;
IPI00374563;IPI00514026;IPI00976820; O00468;
IPI00908418; E7ERA6;
IPI00062955;IPI00002821;IPI00909677; Q96HA4-1;Q96HA4-2;
IPI00641937;IPI00790556;IPI00889194; Q6ZVT0-1;Q6ZVT0-2;Q6ZVT0-3;
IPI00001796;IPI00375404;IPI00217555; Q9Y5U5-1;Q9Y5U5-2;Q9Y5U5-3;
IPI00515079;IPI00018859; P43489;
IPI00514755;IPI00004233;IPI00106646; Q9BRK5-1;Q9BRK5-2;
IPI00064848; Q96L58;
IPI00373976; Q5T7M4;
IPI00375728;IPI86;IPI00383350; Q8N2K1-1;Q8N2K1-2;
IPI01022053;IPI00514605;IPI00514599; P51172-1;P51172-2;
Table 2:
IPI00000207
IPI00000728
IPI00000733
IPI00000846
IPI00000893
IPI00001849
IPI00002214
IPI00002335
IPI00002349
IPI00002821
IPI00003362
IPI00003419
IPI00003865
IPI00004233
IPI00004399
IPI00004795
IPI00004977
You cannot use grep to prepend the needle, so no chance to use -f file2.
Use a loop and prepend manually:
while read token; do grep $token file1 |xargs -I{} echo $token {} ; done <file2
Alternatively, you could store both the results of grep and grep -o and paste them:
grep -f 2.txt 1.txt >a
grep -of 2.txt 1.txt >b
paste b a
If you're also fine with using awk, try this:
awk 'FNR==NR { a[$0];next } { for (x in a) if ($0 ~ x) print x, $0 }' 2.txt 1.txt
Explanation: For the first file (as long as FNR==NR), store all needles into array a ({ a[$0];next }). Then (implicitly) loop over all lines of the second file, loop again over all needles and print needle and line if found.
I have a large text file with the next format:
1 2327544589
1 3554547564
1 2323444333
2 3235434544
2 3534532222
2 4645644333
3 3424324322
3 5323243333
...
And the output should be text files with a suffix in the name with the number of the first column of the original file keeping the number of the second column in the corresponding output file as following:
file1.txt:
2327544589
3554547564
2323444333
file2.txt:
3235434544
3534532222
4645644333
file3.txt:
3424324322
5323243333
...
The script should run on Solaris but I'm also having trouble with the instruction awk and options of another instruccions like -c with cut; its very limited so I am searching for common commands on Solaris. I am not allowed to change or install anything on the system. Using a loop is not very efficient because the script takes too long with large files. So aside from using the awk instruction and loops, any suggestions?
Something like this perhaps:
$ awk 'NF>1{print $2 > "file"$1".txt"}' input
$ cat file1.txt
2327544589
3554547564
2323444333
or if you have bash available, try this:
#!/bin/bash
while read a b
do
[ -z $a ] && continue
echo $b >> "file"$a".txt"
done < input
output:
$ paste file{1..3}.txt
2327544589 3235434544 3424324322
3554547564 3534532222 5323243333
2323444333 4645644333