Print duplicate entries in a file using linux commands

Print duplicate entries in a file using linux commands - file

I have a file called foo.txt, which consists of:
abc
zaa
asd
dess
zaa
abc
aaa
zaa
I want the output to be stored in another file as:
this text abc appears 2 times
this text zaa appears 3 times
I have tried the following command, but this just writes duplicate entries and their number.
sort foo.txt | uniq --count --repeated > sample.txt
Example of output of above command:
abc 2
zaa 3
How do I add the line "this text appears x times" ?

Awk is your friend:
sort foo.txt | uniq --count --repeated | awk '{print($2" appears "$1" times")}'

Related

Accurate AWK array searching

Can anybody offer some help getting this AWK to search correctly?
I need to search inside the "sample.txt" file for all the 6 array elements in the "combinations" file. However, I need the search to happen from every single character instead of like an ordinary text editor search box type search, which searches by blocks after each occurrence. I need to search in the most squeezed in way so as to display exactly every times it happens. For example I need the type of search that finds inside the string "AAAAA" the combination "AAA" happening 3 times, not 1 time. See my previous post about this: BASH: Search a string and exactly display the exact number of times a substring happens inside it
The sample.txt file is:
AAAAAHHHAAHH
The combinations file is:
AA
HH
AAA
HHH
AAH
HHA
How do I get the script
#!/bin/bash
awk 'NR==FNR {data=$0; next} {printf "%s %d \n",$1,gsub($1,$1,data)}' 'sample.txt' combinations > searchoutput
to output the desired output:
AA 5
HH 3
AAA 3
HHH 1
AAH 2
HHA 1
instead of what it is currently outputing:
AA 3
HH 2
AAA 1
HHH 1
AAH 2
HHA 1
?
As we can see, the script is only finding the combinations just like a text editor. I need it to search for the combinations from the start of every character instead so that the desired output happens.
How do I have the AWK output the desired output instead? Can't thank you enough.

there may be a faster way to find the first match and carry forward from that index, but this might be simpler
$ awk 'NR==1{content=$0;next}
{c=0; len1=length($1);
for(i=1;i<=length(content)-len1+1;i++)
c+=substr(content,i,len1)==$1;
print $1,c}' file combs
AA 5
HH 3
AAA 3
HHH 1
AAH 2
HHA 1

you might try this:
$ awk '{x="AAAAAHHHAAHH"; n=0}{
while(t=index(x,$0)){n++; x=substr(x,t+1) }
print $0,n
}' combinations.txt
AA 5
HH 3
AAA 3
HHH 1
AAH 2
HHA 1

Edit a string in shell script and display it as an array

Input:
1234-A1;1235-A2;2345-B1;5678-C2;2346-D5
Expected Output:
1234
1235
2345
5678
2346
Input shown is a user input. I want to store it in an array and do some operations to display as shown in 'Expected Output'
I have done it in perl, but want to achieve it in shell script. Please help in achieving this.

To split an input text to an array you can follow this technique:
IFS="[;-]" read -r -a arr <<< "1234-A1;1235-A2;2345-B1;5678-C2;2346-D5"
printf '%s\n' "${arr[#]}"
1234
A1
1235
A2
2345
B1
5678
C2
2346
D5
If you want to keep only 1234,1234, etc as per your expected output you can either to use the corresponding array elements (0-2-4-etc) or to do something like this:
a="1234-A1;1235-A2;2345-B1;5678-C2;2346-D5"
IFS="[;]" read -r -a arr <<< "${a//-[A-Z][0-9]/}" #or more generally <<< "${a//-??/}"
declare -p arr #This asks bash to print the array for us
#Output
declare -a arr='([0]="1234" [1]="1235" [2]="2345" [3]="5678" [4]="2346")'
# Array can now be printed or used elsewhere in your script. Array counting starts from zero

#Yash:#try:
echo "1234-A1;1235-A2;2345-B1;5678-C2;2346-D5" | awk '{gsub(/-[[:alnum:]]+/,"");gsub(/;/,RS);print}'
Substituting all alpha bate, numbers with NULL, then substituting all semi colons to RS(record separator) which is a new line by default.

Thanks #George and #Vipin.
Based on your inputs the solution which best suites my environment is as under:
i=0
a="1234-A1;1235-A2;2345-B1;5678-C2;2346-D5"
IFS="[;]" read -r -a arr <<< "${a//-??/}"
#declare -p arr
for var in "${arr[#]}"
do
echo " var $((i++)) is : $var"
done
Output:
var 0 is : 1234
var 1 is : 1235
var 2 is : 2345
var 3 is : 5678
var 4 is : 2346

Try this -
awk -F'[-;]' '{for(i=1;i<=NF;i++) if(i%2!=0) {print $i}}' f
1234
1235
2345
5678
2346
OR
echo "1234-A1;1235-A2;2345-B1;5678-C2;2346-D5"|tr ';' '\n'|cut -d'-' -f1
OR
As #George Vasiliou Suggested -
awk -F'[-;]' '{for(i=1;i<=NF;i+=2) {print $i}}'f
If Data needs to store in Array and you are using gawk, try below -
awk -F'[;-]' -v k=1 '{for(i=1;i<=NF;i++) if($i !~ /[[:alpha:]]/) {a[k++]=$i}} END {
> PROCINFO["sorted_in"] = "#ind_str_asc"
> for(k in a) print k,a[k]}' f
1 1234
2 1235
3 2345
4 5678
5 2346
PROCINFO["sorted_in"] = "#ind_str_asc" used to print the data in
sorted order.

Eliminate duplicate columns based on separate field using awk array?

I am trying to eliminate a set of duplicate rows based on a separate field.
cat file.txt
1 345 a blue
1 345 b blue
3 452 c blue
3 342 d green
3 342 e green
1 345 f green
I would like to remove duplicates rows based on field 1 and 2, but separately for each colour. Desired output:
1 345 a blue
3 452 c blue
3 342 d green
1 345 f green
I can achieve this output using a for loop that iterates over the colours:
for i in $(awk '{ print $4 }' file.txt | sort -u); do
grep -w ${i} |
awk '!x[$1,$2]++' >> output.txt
done
But this is slow. Is there any way to get this output without use of a loop?
Thank you.

At least for the example, it is simple as:
$ awk 'arr[$1,$2,$4]++{next} 1' file
1 345 a blue
3 452 c blue
3 342 d green
1 345 f green
Or, you can negate that:
$ awk '!arr[$1,$2,$4]++' file
You can also use GNU sort for the same which may be faster:
$ sort -k4,4 -k2,2 -k1,1 -u file

Could you please try this too:
awk '!a[$1,$2,$4]++' Input_file

Use bash variable as array in awk and filter input file by comparing with array

I have bash variable like this:
val="abc jkl pqr"
And I have a file that looks smth like this:
abc 4 5
abc 8 8
def 43 4
def 7 51
jkl 4 0
mno 32 2
mno 9 2
pqr 12 1
I want to throw away rows from file which first field isn't present in the val:
abc 4 5
abc 8 8
jkl 4 0
pqr 12 1
My solution in awk doesn't work at all and I don't have any idea why:
awk -v var="${val}" 'BEGIN{split(var, arr)}$1 in arr{print $0}' file

Just slice the variable into array indexes:
awk -v var="${val}" 'BEGIN{split(var, arr)
for (i in arr)
names[arr[i]]
}
$1 in names' file
As commented in the linked question, when you call split() you get values for the array, while what you want to set are indexes. The trick is to generate another array with this content.
As you see $1 in names suffices, you don't have to call for the action {print $0} when this happens, since it is the default.
As a one-liner:
$ awk -v var="${val}" 'BEGIN{split(var, arr); for (i in arr) names[arr[i]]} $1 in names' file
abc 4 5
abc 8 8
jkl 4 0
pqr 12 1

grep -E "$( echo "${val}"| sed 's/ /|/g' )" YourFile
# or
awk -v val="${val}" 'BEGIN{gsub(/ /, "|",val)} $1 ~ val' YourFile
Grep:
it use a regex (extended version with option -E) that filter all the lines that contains the value. The regex is build OnTheMove in a subshell with a sed that replace the space separator by a | meaning OR
Awk:
use the same princip as the grep but everything is made inside (so no subshell)
use the variable val assigned to the shell variable of the same name
At start of the script (before first line read) change the space, (in val) by | with BEGIN{gsub(/ /, "|",val)}
than, for every line where first field (default field separator is space/blank in awk, so first is the letter group) matching, print it (defaut action of a filter with $1 ~ val.

Comparing two files using awk or sed

I have two files...
Lookup is 1285 lines long:
cat Lookup.txt
abc
def
ghi
jkl
main is 4,838,869 lines long:
cat main.txt
abc, USA
pqr, UK
xyz, SA
I need to compare lookup and main and then output the matching lines in main to final.txt

You don't need awk or sed here but grep, assuming I am reading your requirements correctly:
% grep -f lookup.txt main.txt > final.txt
% cat final.txt
abc, USA

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Print duplicate entries in a file using linux commands - file

Awk is your friend: sort foo.txt | uniq --count --repeated | awk '{print($2" appears "$1" times")}'

Related

Accurate AWK array searching

Edit a string in shell script and display it as an array

Eliminate duplicate columns based on separate field using awk array?

Use bash variable as array in awk and filter input file by comparing with array

Comparing two files using awk or sed

Categories

Resources