Print all elements in an array in AWK - arrays

I want to loop through all elements in an array in awk and print. The values are sourced from the file below:
Ala A Alanine
Arg R Arginine
Asn N Asparagine
Asp D Aspartic acid
Cys C Cysteine
Gln Q Glutamine
Glu E Glutamic acid
Gly G Glycine
His H Histidine
Ile I Isoleucine
Leu L Leucine
Lys K Lysine
Met M Methionine
Phe F Phenylalanine
Pro P Proline
Pyl O Pyrrolysine
Ser S Serine
Sec U Selenocysteine
Thr T Threonine
Trp W Tryptophan
Tyr Y Tyrosine
Val V Valine
Asx B Aspartic acid or Asparagine
Glx Z Glutamic acid or Glutamine
Xaa X Any amino acid
Xle J Leucine or Isoleucine
TERM TERM termination codon
I have tried this:
awk 'BEGIN{FS="\t";OFS="\t"}{if (FNR==NR) {codes[$1]=$2;} else{next}}END{for (key in codes);{print key,codes[key],length(codes)}}' $input1 $input2
And the output is always Cys C 27 and when I replace codes[$1]=$2 for codes[$2]=$1 I get M Met 27.
How can I make my code print out all the values sequentially? I don't understand why my code selectively prints out just one element when I can tell the array length is 27 as expected. (To keep my code minimal I have excluded code within else{next} - Otherwise I just want to print all elements from array codes while retaining the else{***} command)
According to How to view all the content in an awk array?, The syntax above should work. I tried it here echo -e "1 2\n3 4\n5 6" | awk '{my_dict[$1] = $2};END {for(key in my_dict) print key " : " my_dict[key],": "length(my_dict)}' and that worked well.

With your shown samples and attempts please try following, written and tested in GNU awk.
awk '
BEGIN{
FS=OFS="\t"
}
{
codes[$1]=$2
}
END{
for(key in codes){
print key,codes[key],length(codes)
}
}' Input_file
Will add detailed explanation and OP's misses too in few mins.
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=OFS="\t" ##Setting FS and OFS as TAB here.
}
{
codes[$1]=$2 ##Creating array codes with index of 1st field and value of 2nd field
}
END{ ##Starting END block of this program from here.
for(key in codes){ ##Traversing through codes array here.
print key,codes[key],length(codes) ##Printing index and value of current item along with total length of codes.
}
}' Input_file ##Mentioning Input_file name here.

I'm a bit confused what you are after, but to print the codes sequentially, with the no., (ignoring the name), you can do:
awk '{seq[++n]=$2; codes[$2]=$1}
END{for (i=1;i<=n;i++) printf "%s\t%s\t%d\n", codes[seq[i]], seq[i], i}' file
Which uses two arrays to coordinate the sequence number with the single letter in the seq array and then the letter to the code in the codes array.
Example Use/Output
$ awk '{seq[++n]=$2; codes[$2]=$1}
END{for (i=1;i<=n;i++) printf "%s\t%s\t%d\n", codes[seq[i]], seq[i], i}' file
Ala A 1
Arg R 2
Asn N 3
Asp D 4
Cys C 5
Gln Q 6
Glu E 7
Gly G 8
His H 9
Ile I 10
Leu L 11
Lys K 12
Met M 13
Phe F 14
Pro P 15
Pyl O 16
Ser S 17
Sec U 18
Thr T 19
Trp W 20
Tyr Y 21
Val V 22
Asx B 23
Glx Z 24
Xaa X 25
Xle J 26
TERM TERM 27

Resolved: The error was brought about by the introduction of ; here: END{for (key in codes);{print key,codes[key],length(codes)}}.
Solution:
awk 'BEGIN{FS="\t";OFS="\t"}{if (FNR==NR) {codes[$1]=$2;} else{next}}END{for (key in codes){print key,codes[key],length(codes)}}' $input1 $input2

Related

awk PROCINFO["sorted_in"] Multidimensional array sorting problem

[root#rocky ~]# cat c
a b 1
a c 4
a r 6
a t 2
b a 89
b c 76
a d 45
b z 9
[root#rocky ~]# awk '{a[$1][$2]=$3}END{PROCINFO["sorted_in"]="#val_num_asc";for(i in a){for(x in a[i])print i,x,a[i][x]}}' c
a b 1
a t 2
a c 4
a r 6
a d 45
b z 9
b c 76
b a 89
[root#rocky ~]# awk '{a[$2]=$3}END{PROCINFO["sorted_in"]="#val_num_asc";for(i in a)print i,a[i]}' c
b 1
t 2
r 6
z 9
d 45
c 76
a 89
[root#rocky ~]# awk '{a[$2]=$3}END{PROCINFO["sorted_in"]="#val_num_desc";for(i in a)print i,a[i]}' c
a 89
c 76
d 45
z 9
r 6
t 2
b 1
[root#rocky ~]# awk '{a[$1][$2]=$3}END{PROCINFO["sorted_in"]="#val_num_desc";for(i in a){for(x in a[i])print i,x,a[i][x]}}' c
a d 45
a r 6
a c 4
a t 2
a b 1
b a 89
b c 76
b z 9
[root#rocky ~]# awk --version
GNU Awk 4.2.1, API: 2.0 (GNU MPFR 3.1.6-p2, GNU MP 6.1.2)
There is a problem with the sorting of multidimensional arrays using PROCINFO["sorted_in"]="#val_num_asc" or PROCINFO["sorted_in"]="#val_num_desc", and there is no real sorting. There is no problem with one-dimensional arrays. What is the problem? Is it because it does not support multidimensional arrays?
awk '{a[$1][$2]=$3}END{PROCINFO["sorted_in"]="#val_num_asc";for(i in a){for(x in a[i])print i,x,a[i][x]}}' c
a b 1
a t 2
a c 4
a r 6
a d 45
b z 9
b c 76
b a 89
This is not a bug, this is how it is supposed to work. Look closely, you are using a nested loop here.
for(i in a)
This is outer loop that will iterate through values a and b in 2 iterations.
for(x in a[i])
This is inner loop that will iterate through values of for a,[$2] first and b,[$2] second time.
#val_num_asc will sort values numerically in ascending order as per the value which is $3. If you look closely printed values 1,2,4,6,45 for $1=a are numerically sorted as per the value and so are 9,76,89 for $1=b.
If you want sorted output using awk then use this suggested workaround:
awk '{a[$1 OFS $2]=$3} END {PROCINFO["sorted_in"]="#val_num_asc"; for(x in a) print x, a[x]}' c
a b 1
a t 2
a c 4
a r 6
b z 9
a d 45
b c 76
b a 89

Gnuplot: Plotting several datasets with titles from the pipe

As a follow up of: Gnuplot: Plotting several datasets with titles from one file, I have a test.dat file:
"p = 0.1"
1 1
3 3
4 1
"p = 0.2"
1 3
2 2
5 2
and I can plot it with no issues from within gnuplot using:
> plot for [IDX=0:1] 'test.dat' i IDX u 1:2 w lines title columnheader(1)
however I cannot pipe the data.
Here is the single line example:
$ cat test.dat | gnuplot --persist -e "plot for [IDX=0:1] '-' i IDX u 1:2 w lines title columnheader(1)"
line 10: warning: Skipping data file with no valid points
I get the warning message and only the first set is plotted. I tried to add an e at the end of the data file, but no luck... This should be trivial, am I making a silly mistake?
I've messing around a bit more. So these works:
gnuplot --persist -e "plot for [IDX=0:1] 'test.dat' i IDX u 1:2 w lines title columnheader(1)"
gnuplot --persist -e "plot for [IDX=0:1] '< cat test.dat' i IDX u 1:2 w lines title columnheader(1)"
These don't:
cat test.dat | gnuplot --persist -e "plot for [IDX=0:1] '-' i IDX u 1:2 w lines title columnheader(1)"
cat test.dat | gnuplot --persist -e "plot for [IDX=0:1] '< cat' i IDX u 1:2 w lines title columnheader(1)"
It looks like a bug to me. I tried few Gnuplot versions (4.6.6, 5.0.0, 5.0.3) but they all present the same behaviour.
Ok, I've finally got it browsing the documentation. When piping, each index selection requires to repeat the whole data:
plot '-' index 0, '-' index 1
2
4
6
10
12
14
e
2
4
6
10
12
14
e
or, as a much simpler alternative, one can just do:
plot '-', '-'
2
4
6
e
10
12
14
e

cmp command returning EOF on my output despite exact match as far as i can tell

So I will start by saying this is for a course and I assume the professor won't really care that they are the same if cmp returns something weird. I am attempting to compare the output of my code, named uout, to the correct output, in the file correct0. The problem however is that it returns "cmp: EOF on uout". From a little bit of digging I found that EOF indicates they are the same up to the end of the shorter file with the shorter file being the one named after EOF, so what I gather from this is that they are the same until uout ends short. Problem is however, that it absolutely does NOT end short. When opening both in a text editor and manually checking spaces, line and column numbers, etc. everything was an EXACT match.
To illustrate my point here are the files copied directly using ctrl-a + ctrl-v:
correct0 http://pastebin.com/Bx7SM7rA
uout http://pastebin.com/epMFtFpM
If anyone knows what is going wrong and can explain it simply I would appreciate it. I have checked multiple times and can't find anything wrong with it. Maybe it is something simple and I just can't see it, but everything I have seen so far seems to suggest that the files are the same up until the "shorter one" ends, and oddly even if i switch my execution from
cmp correct0 uout
to
cmp uout correct0
both instances end up returning
cmp: EOF on uout
The files you uploaded are same. It can be a line ending problem. DOS/Windows uses "\r\n" as a line ending, but Unix/Linux uses just a "\n".
The best utility on Linux machine for checking what your problem is, is "od" (octal dump) or any other command for showing files in their binary format. That is:
$ od -c uout.txt
0000000 E n t e r t h e n u m b e r
0000020 s f r o m 1 t o 1 6 i
0000040 n a n y o r d e r , s e p
0000060 a r a t e d b y s p a c e s
0000100 : \r \n \r \n 1 6 3 2 1
0000120 3 \r \n 5 1 0 1 1 8 \r
0000140 \n 9 6 7 1 2 \r \n
0000160 4 1 5 1 4 1 \r \n \r \n R
0000200 o w s u m s : 3 4 3 4 3
0000220 4 3 4 \r \n C o l u m n s u m
0000240 s : 3 4 3 4 3 4 3 4 \r \n
0000260 D i a g o n a l s u m s : 3
0000300 4 3 4 \r \n \r \n T h e m a t r
0000320 i x i s a m a g i c s q
0000340 u a r e
0000344
As you can see, here the line endings are \r\n. Since you have opened and copy pasted the files, this represents your machines preferences and not the actual fiels line ending. Also you can try dos2unix utility to convert line endings.
If the files are human readable I would use diff tool instead. It has ways to ignore the line endings(see the --ignore-space-change and --strip-trailing-cr and --ignore-blank-lines).
diff -u --ignore-space-change --strip-trailing-cr --ignore-blank-lines test_cases/correct0 test_cases/uout0

Grepping Columns and then Pasting to another file

Hey guys so I have a grep problem. I want to grep lines from a file that contain a certain number, and then I want to paste certain columns from that line to a file.
For example, if I have the number 1068
File A has
1094 A B C
1068 D E F
1044 G H I
File B has
1092 L M N
1068 X Y Z
1045 Q R S
File C has
1093 A B C
1062 D E F
1041 G H I
I want to grep the line that has 1068 from all files, only paste certain columns, and paste them side by side. Note that File C does not have 1068, but I would like to paste NA instead. So that the final output looks like this:
1068 FileA A C FileB X Z FileC NA NA
Any help would be appreciated! I don't now how you would grep columns, or even check if it exists. For example in File C, grep would just come out with nothing, but I want to had in NA NA instead. How would I do that?
I don't think this is a job for grep. More of an awk job.
awk -v num=1068 '
BEGIN { printf "%d", num }
# If file has changed and num has not been found...
FNR==1 && NR!=1 && !found_num { printf " %s NA NA", FILENAME }
# If at the beginning of a file (needs to be after previous line).
FNR==1 { found_num = 0 }
# If we find num, print the data and set found_num flag.
$1 == num { printf " %s %s %s", FILENAME, $2, $4; found_num = 1 }
END { if (!found_num) printf " %s NA NA", FILENAME; print"" }
' FileA FileB FileC

Octave read each line of file to vector

I have a file data.txt
1 22 34 -2
3 34 -3
2
3 43 -3 2 3
And I want to load this file onto Octave as separate matrices
matrix1 = [1; 22; 34 ;-2]
matrix2 = [3; 34 -3]
.
.
.
How do I do this? I've tried fopen and fgetl, but it seems as if each character is given its own spot in the matrix. I want to separate the values, not the characters (it's space delimited).
quick and dirty:
A = dlmread("file");
matrix = 1; # just for nice varname generation
for i = 1:size(A,1)
name = genvarname("matrix",who());
eval([name " = A(i,:);"]);
eval(["[_,__," name "] = find(" name ");"]);
end
clear _ __ A matrix i
The format needs to be as you specified.

Resources