Eliminate duplicate columns based on separate field using awk array? - arrays

I am trying to eliminate a set of duplicate rows based on a separate field.
cat file.txt
1 345 a blue
1 345 b blue
3 452 c blue
3 342 d green
3 342 e green
1 345 f green
I would like to remove duplicates rows based on field 1 and 2, but separately for each colour. Desired output:
1 345 a blue
3 452 c blue
3 342 d green
1 345 f green
I can achieve this output using a for loop that iterates over the colours:
for i in $(awk '{ print $4 }' file.txt | sort -u); do
grep -w ${i} |
awk '!x[$1,$2]++' >> output.txt
done
But this is slow. Is there any way to get this output without use of a loop?
Thank you.

At least for the example, it is simple as:
$ awk 'arr[$1,$2,$4]++{next} 1' file
1 345 a blue
3 452 c blue
3 342 d green
1 345 f green
Or, you can negate that:
$ awk '!arr[$1,$2,$4]++' file
You can also use GNU sort for the same which may be faster:
$ sort -k4,4 -k2,2 -k1,1 -u file

Could you please try this too:
awk '!a[$1,$2,$4]++' Input_file

Related

How to compare 2 files and returning matching values with awk [duplicate]

I want to keep only the lines in results.txt that matched the IDs in uniq.txt based on matches in column 3 of results.txt. Usually I would use grep -f uniq.txt results.txt, but this does not specify column 3.
uniq.txt
9606
234831
131
31313
results.txt
readID seqID taxID score 2ndBestScore hitLength queryLength numMatches
A00260:70:HJM2YDSXX:4:1111:15519:16720 NC_000011.10 9606 169 0 28 151 1
A00260:70:HJM2YDSXX:3:1536:9805:14841 NW_021160017.1 9606 81 0 24 151 1
A00260:70:HJM2YDSXX:3:1366:27181:24330 NC_014803.1 234831 121 121 26 151 3
A00260:70:HJM2YDSXX:3:1366:27181:24330 NC_014973.1 443143 121 121 26 151 3
With your shown samples, please try following code.
awk 'FNR==NR{arr[$0];next} ($3 in arr)' uniq.txt results.txt
Explanation:
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when uniq.txt is being read.
arr[$0] ##Creating arrar with index of current line.
next ##next will skip all further statements from here.
}
($3 in arr) ##If 3rd field is present in arr then print line from results.txt here.
' uniq.txt results.txt ##Mentioning Input_file names here.
2nd solution: In case your field number is not set in results.txt and you want to search values in whole line then try following.
awk 'FNR==NR{arr[$0];next} {for(key in arr){if(index($0,key)){print;next}}}' uniq.txt results.txt
You can use grep in combination with sed to manipulate the input patterns and achieve what you're looking for
grep -Ef <(sed -e 's/^/^(\\S+\\s+){2}/;s/$/\\s*/' uniq.txt) result.txt
If you want to match nth column, replace 2 in above command with n-1
outputs
A00260:70:HJM2YDSXX:4:1111:15519:16720 NC_000011.10 9606 169 0 28 151 1
A00260:70:HJM2YDSXX:3:1536:9805:14841 NW_021160017.1 9606 81 0 24 151 1
A00260:70:HJM2YDSXX:3:1366:27181:24330 NC_014803.1 234831 121 121 26 151 3

Accurate AWK array searching

Can anybody offer some help getting this AWK to search correctly?
I need to search inside the "sample.txt" file for all the 6 array elements in the "combinations" file. However, I need the search to happen from every single character instead of like an ordinary text editor search box type search, which searches by blocks after each occurrence. I need to search in the most squeezed in way so as to display exactly every times it happens. For example I need the type of search that finds inside the string "AAAAA" the combination "AAA" happening 3 times, not 1 time. See my previous post about this: BASH: Search a string and exactly display the exact number of times a substring happens inside it
The sample.txt file is:
AAAAAHHHAAHH
The combinations file is:
AA
HH
AAA
HHH
AAH
HHA
How do I get the script
#!/bin/bash
awk 'NR==FNR {data=$0; next} {printf "%s %d \n",$1,gsub($1,$1,data)}' 'sample.txt' combinations > searchoutput
to output the desired output:
AA 5
HH 3
AAA 3
HHH 1
AAH 2
HHA 1
instead of what it is currently outputing:
AA 3
HH 2
AAA 1
HHH 1
AAH 2
HHA 1
?
As we can see, the script is only finding the combinations just like a text editor. I need it to search for the combinations from the start of every character instead so that the desired output happens.
How do I have the AWK output the desired output instead? Can't thank you enough.
there may be a faster way to find the first match and carry forward from that index, but this might be simpler
$ awk 'NR==1{content=$0;next}
{c=0; len1=length($1);
for(i=1;i<=length(content)-len1+1;i++)
c+=substr(content,i,len1)==$1;
print $1,c}' file combs
AA 5
HH 3
AAA 3
HHH 1
AAH 2
HHA 1
you might try this:
$ awk '{x="AAAAAHHHAAHH"; n=0}{
while(t=index(x,$0)){n++; x=substr(x,t+1) }
print $0,n
}' combinations.txt
AA 5
HH 3
AAA 3
HHH 1
AAH 2
HHA 1

Use bash variable as array in awk and filter input file by comparing with array

I have bash variable like this:
val="abc jkl pqr"
And I have a file that looks smth like this:
abc 4 5
abc 8 8
def 43 4
def 7 51
jkl 4 0
mno 32 2
mno 9 2
pqr 12 1
I want to throw away rows from file which first field isn't present in the val:
abc 4 5
abc 8 8
jkl 4 0
pqr 12 1
My solution in awk doesn't work at all and I don't have any idea why:
awk -v var="${val}" 'BEGIN{split(var, arr)}$1 in arr{print $0}' file
Just slice the variable into array indexes:
awk -v var="${val}" 'BEGIN{split(var, arr)
for (i in arr)
names[arr[i]]
}
$1 in names' file
As commented in the linked question, when you call split() you get values for the array, while what you want to set are indexes. The trick is to generate another array with this content.
As you see $1 in names suffices, you don't have to call for the action {print $0} when this happens, since it is the default.
As a one-liner:
$ awk -v var="${val}" 'BEGIN{split(var, arr); for (i in arr) names[arr[i]]} $1 in names' file
abc 4 5
abc 8 8
jkl 4 0
pqr 12 1
grep -E "$( echo "${val}"| sed 's/ /|/g' )" YourFile
# or
awk -v val="${val}" 'BEGIN{gsub(/ /, "|",val)} $1 ~ val' YourFile
Grep:
it use a regex (extended version with option -E) that filter all the lines that contains the value. The regex is build OnTheMove in a subshell with a sed that replace the space separator by a | meaning OR
Awk:
use the same princip as the grep but everything is made inside (so no subshell)
use the variable val assigned to the shell variable of the same name
At start of the script (before first line read) change the space, (in val) by | with BEGIN{gsub(/ /, "|",val)}
than, for every line where first field (default field separator is space/blank in awk, so first is the letter group) matching, print it (defaut action of a filter with $1 ~ val.

Print duplicate entries in a file using linux commands

I have a file called foo.txt, which consists of:
abc
zaa
asd
dess
zaa
abc
aaa
zaa
I want the output to be stored in another file as:
this text abc appears 2 times
this text zaa appears 3 times
I have tried the following command, but this just writes duplicate entries and their number.
sort foo.txt | uniq --count --repeated > sample.txt
Example of output of above command:
abc 2
zaa 3
How do I add the line "this text appears x times" ?
Awk is your friend:
sort foo.txt | uniq --count --repeated | awk '{print($2" appears "$1" times")}'

How to split a large file into small ones by line number

I am trying to split my large big file into small bits using the line numbers. For example my file has 30,000,000 lines and i would like to divide this into small files wach of which has 10,000 lines(equivalent to 3000 small files).
I used the 'split' in unix but it seems that it is limited to only 100 files.
Is there a way of overcoming this limitation of 100 files?
If there is another way of doing this, please advise as well.
Thanks.
Using GNU awk
gawk '
BEGIN {
i=1
}
{
print $0 > "small"i".txt"
}
NR%10==0 {
close("file"i".txt"); i++
}' bigfile.txt
Test:
[jaypal:~/temp] seq 100 > bigfile.txt
[jaypal:~/temp] gawk 'BEGIN {i=1} {print $0 > "small"i".txt" } NR%10==0 { close("file"i".txt"); i++ }' bigfile.txt
[jaypal:~/temp] ls small*
small1.txt small10.txt small2.txt small3.txt small4.txt small5.txt small6.txt small7.txt small8.txt small9.txt
[jaypal:~/temp] cat small1.txt
1
2
3
4
5
6
7
8
9
10
[jaypal:~/temp] cat small10.txt
91
92
93
94
95
96
97
98
99
100
Not an answer, just added a way to do the renaming-part as requested in a comment
$ touch 000{1..5}.txt
$ ls
0001.txt 0002.txt 0003.txt 0004.txt 0005.txt
$ rename 's/^0*//' *.txt
$ ls
1.txt 2.txt 3.txt 4.txt 5.txt
I also tried the above with 3000-files without any problems.

Resources