AWK looping through columns to count matches - loops

I have a tab-delimited file that looks like this:
SampleID
dbSNP
Min.alle
M.zygo
Sample1
Sample2
Sample3
311
rs1490413
A
Homo
G
A
G
311
rs730123
G
Homo
A
G
A
311
rs7532151
A
Homo
A
C
C
311
rs1434369
G
Homo
T
G
T
311
rs1563172
T
Homo
T
C
C
the number of samples and thus columns $5-$i is theoretically unlimited.
I want to count the total number of occurrences of the same letter in the same row between $3 with $5 and then with $6 and then with $7 etc. and divide the resulting values by the total number of rows (except header)
so far, I could do it separately for each pair e.g. for $3 and $5 like this:
awk 'BEGIN {
FS = OFS = "\t"
}
$3 == $5 {
++count
}
END {
print count/(NR-1)
}
' infile
I would like to do it in a loop and get an output like this:
SampleID
Sample1
Sample2
Sample3
311
0.4
0.6
0
Can someone help, please?

Here is one potential solution:
awk '
BEGIN{
OFS="\t"; printf "%s\t", "SampleID"
}
NR==1{
for(i=5;i<=NF;i++){
printf "Sample%s\t", (i-4)
}
}
NR>1{
sample[$1]++
!sampleID[$1]++
for(i=5;i<=NF;i++){
if($3 == $i){
count[$1, i]++
}
}
}
END{
for (j in sampleID) {
print ""
printf "%s\t", j
for(i=5;i<=NF;i++){
printf "%s\t", count[j, i] / sample[j]
}
}
}' inputfile
SampleID Sample1 Sample2 Sample3
311 0.4 0.6 0
Rather than divide by (NR-1), the values are divided by the number of SampleID rows. So, if you have other sampleIDs in the file:
cat test.txt
SampleID dbSNP Min.alle M.zygo Sample1 Sample2 Sample3
311 rs1490413 A Homo G A G
311 rs730123 G Homo A G A
311 rs7532151 A Homo A C C
311 rs1434369 G Homo T G T
311 rs1563172 T Homo T C C
312 rs1490413 A Homo G A G
312 rs730123 G Homo A G A
312 rs7532151 A Homo A C C
312 rs1434369 G Homo T G T
312 rs1563172 G Homo T C C
awk '
BEGIN{
OFS="\t"; printf "%s\t", "SampleID"
}
NR==1{
for(i=5;i<=NF;i++){
printf "Sample%s\t", (i-4)
}
}
NR>1{
sample[$1]++
!sampleID[$1]++
for(i=5;i<=NF;i++){
if($3 == $i){
count[$1, i]++
}
}
}
END{
for (j in sampleID) {
print ""
printf "%s\t", j
for(i=5;i<=NF;i++){
printf "%s\t", count[j, i] / sample[j]
}
}
}' test.txt
SampleID Sample1 Sample2 Sample3
311 0.4 0.6 0
312 0.2 0.6 0
Depending on the size of the file/s it might be worth looking at other languages for this task, i.e. this is relatively trivial in R:
library(dplyr)
df <- read.table(text = "SampleID dbSNP Min.alle M.zygo Sample1 Sample2 Sample3
311 rs1490413 A Homo G A G
311 rs730123 G Homo A G A
311 rs7532151 A Homo A C C
311 rs1434369 G Homo T G T
311 rs1563172 T Homo T C C", header = TRUE)
df %>%
group_by(SampleID) %>%
summarise(across(starts_with("Sample"), ~mean(.x == Min.alle)))
#> # A tibble: 1 × 4
#> SampleID Sample1 Sample2 Sample3
#> <int> <dbl> <dbl> <dbl>
#> 1 311 0.4 0.6 0
Edit
To print the column names (instead of "Sample_n") you could use:
awk '
BEGIN{
OFS="\t"; printf "%s\t", "SampleID"
}
NR==1{
for(i=5;i<=NF;i++){
printf "%s\t", $i
}
}
NR>1{
sample[$1]++
!sampleID[$1]++
for(i=5;i<=NF;i++){
if($3 == $i){
count[$1, i]++
}
}
}
END{
for (j in sampleID) {
print ""
printf "%s\t", j
for(i=5;i<=NF;i++){
printf "%s\t", count[j, i] / sample[j]
}
}
}' inputfile

Related

Print all elements in an array in AWK

I want to loop through all elements in an array in awk and print. The values are sourced from the file below:
Ala A Alanine
Arg R Arginine
Asn N Asparagine
Asp D Aspartic acid
Cys C Cysteine
Gln Q Glutamine
Glu E Glutamic acid
Gly G Glycine
His H Histidine
Ile I Isoleucine
Leu L Leucine
Lys K Lysine
Met M Methionine
Phe F Phenylalanine
Pro P Proline
Pyl O Pyrrolysine
Ser S Serine
Sec U Selenocysteine
Thr T Threonine
Trp W Tryptophan
Tyr Y Tyrosine
Val V Valine
Asx B Aspartic acid or Asparagine
Glx Z Glutamic acid or Glutamine
Xaa X Any amino acid
Xle J Leucine or Isoleucine
TERM TERM termination codon
I have tried this:
awk 'BEGIN{FS="\t";OFS="\t"}{if (FNR==NR) {codes[$1]=$2;} else{next}}END{for (key in codes);{print key,codes[key],length(codes)}}' $input1 $input2
And the output is always Cys C 27 and when I replace codes[$1]=$2 for codes[$2]=$1 I get M Met 27.
How can I make my code print out all the values sequentially? I don't understand why my code selectively prints out just one element when I can tell the array length is 27 as expected. (To keep my code minimal I have excluded code within else{next} - Otherwise I just want to print all elements from array codes while retaining the else{***} command)
According to How to view all the content in an awk array?, The syntax above should work. I tried it here echo -e "1 2\n3 4\n5 6" | awk '{my_dict[$1] = $2};END {for(key in my_dict) print key " : " my_dict[key],": "length(my_dict)}' and that worked well.
With your shown samples and attempts please try following, written and tested in GNU awk.
awk '
BEGIN{
FS=OFS="\t"
}
{
codes[$1]=$2
}
END{
for(key in codes){
print key,codes[key],length(codes)
}
}' Input_file
Will add detailed explanation and OP's misses too in few mins.
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=OFS="\t" ##Setting FS and OFS as TAB here.
}
{
codes[$1]=$2 ##Creating array codes with index of 1st field and value of 2nd field
}
END{ ##Starting END block of this program from here.
for(key in codes){ ##Traversing through codes array here.
print key,codes[key],length(codes) ##Printing index and value of current item along with total length of codes.
}
}' Input_file ##Mentioning Input_file name here.
I'm a bit confused what you are after, but to print the codes sequentially, with the no., (ignoring the name), you can do:
awk '{seq[++n]=$2; codes[$2]=$1}
END{for (i=1;i<=n;i++) printf "%s\t%s\t%d\n", codes[seq[i]], seq[i], i}' file
Which uses two arrays to coordinate the sequence number with the single letter in the seq array and then the letter to the code in the codes array.
Example Use/Output
$ awk '{seq[++n]=$2; codes[$2]=$1}
END{for (i=1;i<=n;i++) printf "%s\t%s\t%d\n", codes[seq[i]], seq[i], i}' file
Ala A 1
Arg R 2
Asn N 3
Asp D 4
Cys C 5
Gln Q 6
Glu E 7
Gly G 8
His H 9
Ile I 10
Leu L 11
Lys K 12
Met M 13
Phe F 14
Pro P 15
Pyl O 16
Ser S 17
Sec U 18
Thr T 19
Trp W 20
Tyr Y 21
Val V 22
Asx B 23
Glx Z 24
Xaa X 25
Xle J 26
TERM TERM 27
Resolved: The error was brought about by the introduction of ; here: END{for (key in codes);{print key,codes[key],length(codes)}}.
Solution:
awk 'BEGIN{FS="\t";OFS="\t"}{if (FNR==NR) {codes[$1]=$2;} else{next}}END{for (key in codes){print key,codes[key],length(codes)}}' $input1 $input2

Convert a multidimensional array to a data frame Python

I have a data set as below:
data={ 'StoreID':['a','b','c','d'],
'Sales':[1000,200,500,800],
'Profit':[600,100,300,500]
}
data=pd.DataFrame(data)
data.set_index(['StoreID'],inplace=True,drop=True)
X=data.values
from sklearn.metrics.pairwise import euclidean_distances
dist=euclidean_distances(X)
Now I get an array as below:
array([[0. ,943,583,223],
[943, 0.,360,721],
[583,360,0., 360],
[223,721,360, 0.]])
My purpose to get unique combinations of stores and their corresponding distance. I would like the end results as a data frame below:
Store NextStore Dist
a b 943
a c 583
a d 223
b c 360
b d 721
c d 360
Thank you for your help!
You probably want pandas.melt which will "unpivot" the distance matrix into tall-and-skinny format.
m = pd.DataFrame(dist)
m.columns = list('abcd')
m['Store'] = list('abcd')
...which produces:
a b c d Store
0 0.000000 943.398113 583.095189 223.606798 a
1 943.398113 0.000000 360.555128 721.110255 b
2 583.095189 360.555128 0.000000 360.555128 c
3 223.606798 721.110255 360.555128 0.000000 d
Melt data into tall-and-skinny format:
pd.melt(m, id_vars=['Store'], var_name='nextStore')
Store nextStore value
0 a a 0.000000
1 b a 943.398113
2 c a 583.095189
3 d a 223.606798
4 a b 943.398113
5 b b 0.000000
6 c b 360.555128
7 d b 721.110255
8 a c 583.095189
9 b c 360.555128
10 c c 0.000000
11 d c 360.555128
12 a d 223.606798
13 b d 721.110255
14 c d 360.555128
15 d d 0.000000
Remove redundant rows, convert dist to int, and sort:
df2 = pd.melt(m, id_vars=['Store'],
var_name='NextStore',
value_name='Dist')
df3 = df2[df2.Store < df2.NextStore].copy()
df3.Dist = df3.Dist.astype('int')
df3.sort_values(by=['Store', 'NextStore'])
Store NextStore Dist
4 a b 943
8 a c 583
12 a d 223
9 b c 360
13 b d 721
14 c d 360
I have a 2D array with shape (14576, 24) respectively elements and the number of features that are going to constitute my dataframe.
data_from_2Darray = {}
for i in range(name_2Darray.shape[1]):
data_from_2Darray["df_column_name{}".format(i)] = name_2Darray[:,i]
#visualize the result
pd.DataFrame(data=data_from_2Darray).plot()
#the converted array in dataframe
df = pd.DataFrame(data=data_from_2Darray)

Why does printing a 2D array of characters give garbage values?

I am implementing a function that prints a 2D array of characters only using a double-pointer and pointer notation. When I run the code, it prints a bunch of garbage values in the format I want instead of the correct characters.
My professor instructed me not to use arr[row][col], instead, I must access it using ((arr+i)+j) or similar
This is a project for a class and I can't change any of the code outside of this function. The characters are meant to be formatted like a word search puzzle. The arguments passed to my function are char** arr, int size.
This is my function:
for(int i = 0; i < size; i++){
printf("%c %c %c %c %c %c %c %c %c %c %c %c %c %c %c\n", *(arr+i), *(arr+i)+1, *(arr+i)+2, *(arr+i)+3, *(arr+i)+4, *(arr+i)+5, *(arr+i)+6, *
(arr+i)+7, *(arr+i)+8, *(arr+i)+9, *(arr+i)+10, *(arr+i)+11, *(arr+i)+12, *(arr+i)+13, *(arr+i)+14 ); }
Expected output:
W D B M J Q D B C J N Q P T I
I R Z U X U Z E A O I O R T N
M N Z P L R N H L Y L X H M D
M Y E K A I D P I U L Y O W I
A O A B A R K U F V I H L A A
L O N M R X K I O J N A V R N
A E P T A A R A R T O W A I A
S U C Z A U S I N A I A L Z V
K O T A O N R K I S S I A O N
A H X S V K A I A E A I B N E
U D S X N X C C D W G S A A V
O I S D W L E J N J T X M H A
M O X W T N H Q D X O Q A Q D
R U U V G E O R G I A Q V D A
V F L O R I D A L G L W O X N
Actual output:
░ ▒ ▓ │ ┤ ╡ ╢ ╖ ╕ ╣ ║ ╗ ╝ ╜ ╛
≡ ± ≥ ≤ ⌠ ⌡ ÷ ≈ ° ∙ · √ ⁿ ² ■
0 1 2 3 4 5 6 7 8 9 : ; < = >
P Q R S T U V W X Y Z [ \ ] ^
p q r s t u v w x y z { | } ~
É æ Æ ô ö ò û ù ÿ Ö Ü ¢ £ ¥ ₧
░ ▒ ▓ │ ┤ ╡ ╢ ╖ ╕ ╣ ║ ╗ ╝ ╜ ╛
╨ ╤ ╥ ╙ ╘ ╒ ╓ ╫ ╪ ┘ ┌ █ ▄ ▌ ▐
≡ ± ≥ ≤ ⌠ ⌡ ÷ ≈ ° ∙ · √ ⁿ ² ■
0 1 2 3 4 5 6 7 8 9 : ; < = >
P Q R S T U V W X Y Z [ \ ] ^
└ ┴ ┬ ├ ─ ┼ ╞ ╟ ╚ ╔ ╩ ╦ ╠ ═ ╬
` a b c d e f g h i j k l m n
If arr is really a char**, then you need to dereference twice to get a char.
So, in your statement, arr+i is another char**, pointing at a char* i steps further along from the one arr points at. Hopefully arr points at the beginning of an array of char* at least size long.
Now *(arr+i) dereferences it, fetching the char* pointed to by arr+i, giving you a char*.
Now *(arr+i)+7, for example, is another char*, pointing at a char 7 steps further along from the one *(arr+i) points at. Hopefully *(arr+i) points at the beginning of an array of char at least 15 long.
But you don't dereference it, so you're attempting to print the value of the pointer (i.e. the address it holds), not the value it points to (the char).
Try *(*(arr+i)+7).

Use Perl to only print only if the value of column A appears with every different value of Column B

So in Perl how can I go through a sample file like so:
1 D Z
1 E F
1 G L
2 D I
2 E L
3 D P
3 G L
So here I want to be able to print out only the values that have a value in the first column that appears with every different value of the second column.
The output would look like this:
1 D Z
1 E F
1 G L
cat test
1 D Z
1 E F
1 G L
2 D I
2 E L
3 D P
3 G L
perl -a -lne 'unless ( $h{ $F[1] } ) { print }; $h{ $F[1] } = 1; ' test
1 D Z
1 E F
1 G L
Okay this isn't as easy as it seems. I've read the file into memory so that I can take three passes over it
Count the number of different values in column 2
Record each combination of values in column 1 and column 2
Print those lines in the file whose first column has as many occurrences as there are different values of column 2
This could be improved with more information about the input file, but it works fine as it is and I see no reason to optimise it
use strict;
use warnings 'all';
use List::MoreUtils qw/ uniq /;
my #lines = <>;
my #col2 = uniq map { (split)[1] } #lines;
my %data;
for ( #lines ) {
my ($c1, $c2) = split;
$data{$c1}{$c2} = 1;
}
for ( #lines ) {
my ($c1) = split;
print if keys %{ $data{$c1} } == #col2;
}
output
1 D Z
1 E F
1 G L

Grepping Columns and then Pasting to another file

Hey guys so I have a grep problem. I want to grep lines from a file that contain a certain number, and then I want to paste certain columns from that line to a file.
For example, if I have the number 1068
File A has
1094 A B C
1068 D E F
1044 G H I
File B has
1092 L M N
1068 X Y Z
1045 Q R S
File C has
1093 A B C
1062 D E F
1041 G H I
I want to grep the line that has 1068 from all files, only paste certain columns, and paste them side by side. Note that File C does not have 1068, but I would like to paste NA instead. So that the final output looks like this:
1068 FileA A C FileB X Z FileC NA NA
Any help would be appreciated! I don't now how you would grep columns, or even check if it exists. For example in File C, grep would just come out with nothing, but I want to had in NA NA instead. How would I do that?
I don't think this is a job for grep. More of an awk job.
awk -v num=1068 '
BEGIN { printf "%d", num }
# If file has changed and num has not been found...
FNR==1 && NR!=1 && !found_num { printf " %s NA NA", FILENAME }
# If at the beginning of a file (needs to be after previous line).
FNR==1 { found_num = 0 }
# If we find num, print the data and set found_num flag.
$1 == num { printf " %s %s %s", FILENAME, $2, $4; found_num = 1 }
END { if (!found_num) printf " %s NA NA", FILENAME; print"" }
' FileA FileB FileC

Resources