Count the occurrence of a pattern from a column of one file in other file - file

I created a file with one column with a list of patterns (2,196 in total) that I wanna find in other text file which has approximated 400 millions lines.
For example:
file1
abc1
abc2
abc3
abc4
abc5
file2
abc1
abc1
abc1
abc1
abc1
abc2
abc2
abc2
abc2
The desired output:
file3
abc1 5
abc2 2
I can do one by one with awk or grep:
awk '/abc1/{++c}END{print c}' file1 | wc -l > file3
or
grep 'abc1' file1 | wc -l > file3
However, when I try:
cat file1 | xargs -L 1 grep file2 | wc -l > file3
I get an error message:
grep: abc1: No such file or directory
grep: abc2: No such file or directory
etc
I tried:
cat file1 | xargs -L 1 grep '' file2 | wc -l > file3
Also does not work! So what I am doing wrong?
Thank you!

Your cat file1 | xargs -L 1 grep file2… is trying to grep the pattern file2 from the non-existing file abcX. You could start with something like
<file1 xargs -I{} grep "{}" file2
and extend this to
$ <file1 xargs -I{} sh -c 'printf "%s\t%s\n" "{}" $(grep -c "{}" file2)'
abc1 5
abc2 4
abc3 0
abc4 0
abc5 0
but that's not very efficient for a large pattern file.
Using grep, sort and uniq:
$ grep -F -x -f file1 file2 | sort | uniq -c > file3
Output file3:
5 abc1
4 abc2
If you need to reverse the number of matches and the pattern:
grep -F -x -f file1 file2 | sort | uniq -c | awk '{ print $2"\t"$1 }' > file3
Output file3:
abc1 5
abc2 4
Using awk:
awk '
NR==FNR{ a[$0] }
NR!=FNR && $0 in a{ a[$0]++ }
END{ for (i in a){ if (a[i])print i"\t"a[i] }}
' file1 file2 > file3
Output file3:
abc1 5
abc2 4

Simplest solution would be as follows IMHO.
awk 'FNR==NR{a[$0]++;next} ($1 in a){print $1,a[$1]}' Input_file2 Input_file1
Explanation: Adding explanation for above code.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when Input_file2 is being read.
a[$0]++ ##Creating an array named a index is $0 and increment it with 1 each time it goes to line.
next ##next will skip all further statements from here.
}
($1 in a){ ##Checking condition if $1 is present in array a then do following.
print $1,a[$1] ##Printing first field then value of array a with index $1.
}
' Input_file2 Input_file1 ##Mentioning Input_file names here.
Output will be as follows.
abc1 5
abc2 4

Related

ubuntu, extract duplicate value from 2 files

how to extract duplicate and unique values from 2 files in ubuntu and save them in separate files
for example
file1.txt
abc
123
321
file2.txt
abc
123
321
456
how to extract duplicate and uniques?
output for duplicates between 2 files
duplicates.txt
abc
123
321
output for unique values between 2 files
unique.txt
456
I tried this
awk 'NR==FNR{a[$1];next}$1 in a' file1.txt RS="" file2.txt
but did not get only duplicate and uniques, but I got all values
ok I have found solutions
to get duplicates
sort *.txt | awk '{print $1}' | uniq -d
to get unique, not correspondiong values
awk 'FNR==NR {a[$0]++; next} !($0 in a)' file1.txt file2.txt
or
sort *.txt | awk '{print $1}' | uniq -u

How to substract two stdout lists in linux bash

heed help.
I have one list "A" from
netstat -ntlp | grep -oP ":[:1]?[:1]?(.*)+" | grep -oP "\d\d+"
it looks like
80
443
8080
22
25
I have another list "B" from
ufw status numbered | grep -oP "\] \d+" | grep -oP "\d+"
it looks as
80
443
22
So i want to know, which ports are listening, but not open with ufw, i.e. substract ["A"]-["B"]
and going to see
8080
25
with some command like
netstat -ntlp | grep -oP ":[:1]?[:1]?(.*)+" | grep -oP "\d\d+" | SELECT ALL NOT IN `ufw status numbered | grep -oP "\] \d+" | grep -oP "\d+"`
How to do this?
Typically it's comm job:
netstat -ntlp | grep -oP ":[:1]?[:1]?(.*)+" | grep -oP "\d\d+" |
sort | comm -23 - <(ufw status numbered | grep -oP "\] \d+" | grep -oP "\d+" | sort)
You may use grep:
grep -vxFf <(cmd2) <(cmd1)
Here replace cmd1 with netstat ... command and replace cmd2 with ufw ... command.
This solution requires pre-sorting of the outputs:
$ netstat -ntlp | grep -oP ":[:1]?[:1]?(.*)+" | grep -oP "\d\d+" | sort > A
^^^^^^
$ ufw status numbered | grep -oP "\] \d+" | grep -oP "\d+" | sort > B
^^^^^^
Items unique to A:
$ comm -23 A B
25
8080
$
... but also, in case you require, items unique to B:
$ comm -13 A B
$
... and items common to A and B:
$ comm -12 A B
22
443
80
$
See man comm for details.
You can check the uniq -u command:
http://man7.org/linux/man-pages/man1/uniq.1.html
You pass a group of lines to uniq -d and redirect to an output. It will print only the duplicated ones.
So you just need to aggregate both results from list A and list B into a text:
List A:
netstat -ntlp | grep -oP ":[:1]?[:1]?(.*)+" | grep -oP "\d\d+" >> output.txt
List B:
ufw status numbered | grep -oP "\] \d+" | grep -oP "\d+" > output.txt >> output.txt`
(NOTE: You use '>>' over '>' to append the content to end of the file. So make sure to clean it on each iteration!)
Then:
uniq -u output.txt
You can redirect the uniq -u output too, if needed:
uniq -u output.txt > gotuniques.txt
Edit: formatting
Edit2: I was confused by -d when the answer requires -u.

If Statement With 2 Arrays To Perfrom Relative Converging Task

The data is fictional to keep it simple.
Here's the problem
Content Of Prcessed Data
cat rawdata
10 0-9{3}
4 0-9{3}
7 0-9{3}
noc=$(cat ipConn.txt | awk '{print $1}')
rct=$(cat ipConn.txt | awk '{print $2}')
Intended Solution:
for i in ${noc[]}
if $i -ge 50 then
coomand -options ${rct[]}
done
Is the code comprehensible??
but the item in ${noc[]} must match the item in ${rct[]}
so that only items in same line is affected..
Try a while read loop:
echo '10 0-9{3}
4 0-9{3}
7 0-9{3}' |
while IFS=' ' read -r num item; do
if (( num >= 50 )); then
some_action with "$item"
fi
done
Note that the loop is typically very slow in bash. A faster solution would be to first filter the rows with first column greater or equal to 50, then remove the first column and then run some_action using xargs (or even pass -P0 to xargs to run in parallel):
echo '10 0-9{3}
4 0-9{3}
7 0-9{3}' |
awk '$1 >= 50' |
cut -d' ' -f2- |
xargs -n1 some_action with

Counting strings from array in bash

I am writing output of awk to array in bash like so:
ARR=$(( awk '{print $2}' file.txt ))
Imagine the content of file.txt is:
A B
A B
A C
A D
A C
A B
What I want is number of repetition of each string in second column like:
B: 3
C: 2
D: 1
Any other solution rather than arrays and awk is welcome.
Using awk you can do:
awk '{c[$2]++} END{for (i in c) print i ":", c[i]}' file
B: 3
C: 2
D: 1
Other solution I found:
awk '{print $2}' file.txt | sort | uniq -c | sort -nr | while read count name
do
if [ ${count} -gt 1 ]
then
echo "${name} ${count}"
fi
done

Find duplicate lines in a file and count how many time each line was duplicated?

Suppose I have a file similar to the following:
123
123
234
234
123
345
I would like to find how many times '123' was duplicated, how many times '234' was duplicated, etc.
So ideally, the output would be like:
123 3
234 2
345 1
Assuming there is one number per line:
sort <file> | uniq -c
You can use the more verbose --count flag too with the GNU version, e.g., on Linux:
sort <file> | uniq --count
This will print duplicate lines only, with counts:
sort FILE | uniq -cd
or, with GNU long options (on Linux):
sort FILE | uniq --count --repeated
on BSD and OSX you have to use grep to filter out unique lines:
sort FILE | uniq -c | grep -v '^ *1 '
For the given example, the result would be:
3 123
2 234
If you want to print counts for all lines including those that appear only once:
sort FILE | uniq -c
or, with GNU long options (on Linux):
sort FILE | uniq --count
For the given input, the output is:
3 123
2 234
1 345
In order to sort the output with the most frequent lines on top, you can do the following (to get all results):
sort FILE | uniq -c | sort -nr
or, to get only duplicate lines, most frequent first:
sort FILE | uniq -cd | sort -nr
on OSX and BSD the final one becomes:
sort FILE | uniq -c | grep -v '^ *1 ' | sort -nr
To find and count duplicate lines in multiple files, you can try the following command:
sort <files> | uniq -c | sort -nr
or:
cat <files> | sort | uniq -c | sort -nr
Via awk:
awk '{dups[$1]++} END{for (num in dups) {print num,dups[num]}}' data
In awk 'dups[$1]++' command, the variable $1 holds the entire contents of column1 and square brackets are array access. So, for each 1st column of line in data file, the node of the array named dups is incremented.
And at the end, we are looping over dups array with num as variable and print the saved numbers first then their number of duplicated value by dups[num].
Note that your input file has spaces on end of some lines, if you clear up those, you can use $0 in place of $1 in command above :)
In Windows, using "Windows PowerShell", I used the command mentioned below to achieve this
Get-Content .\file.txt | Group-Object | Select Name, Count
Also, we can use the where-object Cmdlet to filter the result
Get-Content .\file.txt | Group-Object | Where-Object { $_.Count -gt 1 } | Select Name, Count
To find duplicate counts, use this command:
sort filename | uniq -c | awk '{print $2, $1}'
Assuming you've got access to a standard Unix shell and/or cygwin environment:
tr -s ' ' '\n' < yourfile | sort | uniq -d -c
^--space char
Basically: convert all space characters to linebreaks, then sort the tranlsated output and feed that to uniq and count duplicate lines.

Resources