Accurate AWK array searching - arrays

Can anybody offer some help getting this AWK to search correctly?
I need to search inside the "sample.txt" file for all the 6 array elements in the "combinations" file. However, I need the search to happen from every single character instead of like an ordinary text editor search box type search, which searches by blocks after each occurrence. I need to search in the most squeezed in way so as to display exactly every times it happens. For example I need the type of search that finds inside the string "AAAAA" the combination "AAA" happening 3 times, not 1 time. See my previous post about this: BASH: Search a string and exactly display the exact number of times a substring happens inside it
The sample.txt file is:
AAAAAHHHAAHH
The combinations file is:
AA
HH
AAA
HHH
AAH
HHA
How do I get the script
#!/bin/bash
awk 'NR==FNR {data=$0; next} {printf "%s %d \n",$1,gsub($1,$1,data)}' 'sample.txt' combinations > searchoutput
to output the desired output:
AA 5
HH 3
AAA 3
HHH 1
AAH 2
HHA 1
instead of what it is currently outputing:
AA 3
HH 2
AAA 1
HHH 1
AAH 2
HHA 1
?
As we can see, the script is only finding the combinations just like a text editor. I need it to search for the combinations from the start of every character instead so that the desired output happens.
How do I have the AWK output the desired output instead? Can't thank you enough.

there may be a faster way to find the first match and carry forward from that index, but this might be simpler
$ awk 'NR==1{content=$0;next}
{c=0; len1=length($1);
for(i=1;i<=length(content)-len1+1;i++)
c+=substr(content,i,len1)==$1;
print $1,c}' file combs
AA 5
HH 3
AAA 3
HHH 1
AAH 2
HHA 1

you might try this:
$ awk '{x="AAAAAHHHAAHH"; n=0}{
while(t=index(x,$0)){n++; x=substr(x,t+1) }
print $0,n
}' combinations.txt
AA 5
HH 3
AAA 3
HHH 1
AAH 2
HHA 1

Related

Use bash variable as array in awk and filter input file by comparing with array

I have bash variable like this:
val="abc jkl pqr"
And I have a file that looks smth like this:
abc 4 5
abc 8 8
def 43 4
def 7 51
jkl 4 0
mno 32 2
mno 9 2
pqr 12 1
I want to throw away rows from file which first field isn't present in the val:
abc 4 5
abc 8 8
jkl 4 0
pqr 12 1
My solution in awk doesn't work at all and I don't have any idea why:
awk -v var="${val}" 'BEGIN{split(var, arr)}$1 in arr{print $0}' file
Just slice the variable into array indexes:
awk -v var="${val}" 'BEGIN{split(var, arr)
for (i in arr)
names[arr[i]]
}
$1 in names' file
As commented in the linked question, when you call split() you get values for the array, while what you want to set are indexes. The trick is to generate another array with this content.
As you see $1 in names suffices, you don't have to call for the action {print $0} when this happens, since it is the default.
As a one-liner:
$ awk -v var="${val}" 'BEGIN{split(var, arr); for (i in arr) names[arr[i]]} $1 in names' file
abc 4 5
abc 8 8
jkl 4 0
pqr 12 1
grep -E "$( echo "${val}"| sed 's/ /|/g' )" YourFile
# or
awk -v val="${val}" 'BEGIN{gsub(/ /, "|",val)} $1 ~ val' YourFile
Grep:
it use a regex (extended version with option -E) that filter all the lines that contains the value. The regex is build OnTheMove in a subshell with a sed that replace the space separator by a | meaning OR
Awk:
use the same princip as the grep but everything is made inside (so no subshell)
use the variable val assigned to the shell variable of the same name
At start of the script (before first line read) change the space, (in val) by | with BEGIN{gsub(/ /, "|",val)}
than, for every line where first field (default field separator is space/blank in awk, so first is the letter group) matching, print it (defaut action of a filter with $1 ~ val.

In AWK: Count number of ocurrences in a column in a tab separated file and write data into a new tsv file

I have data stored in a large (20Gb) tab separated text file, as the sample below (input.txt):
1234 567 T 0
1267 890 Z 1
1269 908 T 1
3142 789 T 0
7896 678 Z 0
I would like to count the occurrences of each entry in Column 4, and write this automatically into a new tab separated file.
I would like to see the following in output.txt:
0 3
1 2
Can anybody suggest a fast way to do this with AWK?
awk '{ count[$4]++ } END { for (i in count) printf "%s\t%d\n", i, count[i] }' \
big.file.txt
For each value in column 4, increment the counter for that value. At the end, print each value found and its count. This prints the values in an indeterminate order. If you want it in some order, either post-process the output with sort or sort the keys inside awk and print in the sorted key order.

Print duplicate entries in a file using linux commands

I have a file called foo.txt, which consists of:
abc
zaa
asd
dess
zaa
abc
aaa
zaa
I want the output to be stored in another file as:
this text abc appears 2 times
this text zaa appears 3 times
I have tried the following command, but this just writes duplicate entries and their number.
sort foo.txt | uniq --count --repeated > sample.txt
Example of output of above command:
abc 2
zaa 3
How do I add the line "this text appears x times" ?
Awk is your friend:
sort foo.txt | uniq --count --repeated | awk '{print($2" appears "$1" times")}'

nested for loops in awk to count number of fields matching values

I have a file with two columns (1.4 million rows) that looks like:
CLM MXL
0 0
0 1
1 1
1 1
0 0
29 42
0 0
30 15
I would like to count the instances of each possible combination of values; for example if there are x number of lines where column CLM equals 0 and column MXL matches 1, I would like to print:
0 1 x
Since the maximum value of column CLM is 188 and the maximum value of column MXL is 128, I am trying to use a nested for loop in awk that looks something like:
awk '{for (i=0; i<=188; i++) {for (j=0; j<=128; j++) {if($9==i && $10==j) {print$0}}}}' 1000Genomes.ALL.new.txt > test
But this only prints out the original file, which makes sense, I just don't know how to correctly write a for loop that prints out one file for each combination of values, which I can then wc, or print out one file with counts of each combination. Any solution in awk, bash script, perl script would be great.
1. A Pure awk Solution
$ awk 'NR>1{c[$0]++} END{for (k in c)print k,c[k]}' file | sort -n
0 0 3
0 1 1
1 1 2
29 42 1
30 15 1
How it works
The code uses a single variable c. c is an associative array whose keys are lines in the file and whose values are the number of occurrences.
NR>1{c[$0]++}
For every line except the first (which has the headings), this increments the count for the combination in that line.
END{for (k in c)print k,c[k]}
This prints out the final counts.
sort -n
This is just for aesthetics: it puts the output lines in a predictable order.
2. Alternative using uniq -c
$ tail -n+2 file | sort -n | uniq -c | awk '{print $2,$3,$1}'
0 0 3
0 1 1
1 1 2
29 42 1
30 15 1
How it works
tail -n+2 file
This prints all but the first line of the file. The purpose of this is to remove the column headings.
sort -n | uniq -c
This sorts the lines and then counts the duplicates.
awk '{print $2,$3,$1}
uniq -c puts the counts first and you wanted the counts to be the last on the line. This just rearranges the columns to the format that you wanted.

How can I merge (concatenate) two columns in an array/matrix?

This is probably a pretty newb question, but….
In perl, I'm trying to read in a table (into an array) and combine the values of the first two columns. So for an input file with:
1 7 ABC DEF GHI
2 8 ABC DEF GHI
3 1 ZYX MNO PLQ
I'd like to get out:
17 ABC DEF GHI
28 ABC DEF GHI
31 ZYX MNO PLQ
What's the easiest way of doing this?
This is so short, so I have to add extra text to the answer:
while(<>){ s/^(\S+)\s+/$1/; print}
The easiest way I can think of
Open file
#read file line by line
while input line
remove trailing newline
split line into an array
make index 1 equal index 0 . 1
remove first element from array
print the elements of the array followed by \n #to STDOUT or file

Resources