awk PROCINFO["sorted_in"] Multidimensional array sorting problem - arrays

[root#rocky ~]# cat c
a b 1
a c 4
a r 6
a t 2
b a 89
b c 76
a d 45
b z 9
[root#rocky ~]# awk '{a[$1][$2]=$3}END{PROCINFO["sorted_in"]="#val_num_asc";for(i in a){for(x in a[i])print i,x,a[i][x]}}' c
a b 1
a t 2
a c 4
a r 6
a d 45
b z 9
b c 76
b a 89
[root#rocky ~]# awk '{a[$2]=$3}END{PROCINFO["sorted_in"]="#val_num_asc";for(i in a)print i,a[i]}' c
b 1
t 2
r 6
z 9
d 45
c 76
a 89
[root#rocky ~]# awk '{a[$2]=$3}END{PROCINFO["sorted_in"]="#val_num_desc";for(i in a)print i,a[i]}' c
a 89
c 76
d 45
z 9
r 6
t 2
b 1
[root#rocky ~]# awk '{a[$1][$2]=$3}END{PROCINFO["sorted_in"]="#val_num_desc";for(i in a){for(x in a[i])print i,x,a[i][x]}}' c
a d 45
a r 6
a c 4
a t 2
a b 1
b a 89
b c 76
b z 9
[root#rocky ~]# awk --version
GNU Awk 4.2.1, API: 2.0 (GNU MPFR 3.1.6-p2, GNU MP 6.1.2)
There is a problem with the sorting of multidimensional arrays using PROCINFO["sorted_in"]="#val_num_asc" or PROCINFO["sorted_in"]="#val_num_desc", and there is no real sorting. There is no problem with one-dimensional arrays. What is the problem? Is it because it does not support multidimensional arrays?

awk '{a[$1][$2]=$3}END{PROCINFO["sorted_in"]="#val_num_asc";for(i in a){for(x in a[i])print i,x,a[i][x]}}' c
a b 1
a t 2
a c 4
a r 6
a d 45
b z 9
b c 76
b a 89
This is not a bug, this is how it is supposed to work. Look closely, you are using a nested loop here.
for(i in a)
This is outer loop that will iterate through values a and b in 2 iterations.
for(x in a[i])
This is inner loop that will iterate through values of for a,[$2] first and b,[$2] second time.
#val_num_asc will sort values numerically in ascending order as per the value which is $3. If you look closely printed values 1,2,4,6,45 for $1=a are numerically sorted as per the value and so are 9,76,89 for $1=b.
If you want sorted output using awk then use this suggested workaround:
awk '{a[$1 OFS $2]=$3} END {PROCINFO["sorted_in"]="#val_num_asc"; for(x in a) print x, a[x]}' c
a b 1
a t 2
a c 4
a r 6
b z 9
a d 45
b c 76
b a 89

Related

Print all elements in an array in AWK

I want to loop through all elements in an array in awk and print. The values are sourced from the file below:
Ala A Alanine
Arg R Arginine
Asn N Asparagine
Asp D Aspartic acid
Cys C Cysteine
Gln Q Glutamine
Glu E Glutamic acid
Gly G Glycine
His H Histidine
Ile I Isoleucine
Leu L Leucine
Lys K Lysine
Met M Methionine
Phe F Phenylalanine
Pro P Proline
Pyl O Pyrrolysine
Ser S Serine
Sec U Selenocysteine
Thr T Threonine
Trp W Tryptophan
Tyr Y Tyrosine
Val V Valine
Asx B Aspartic acid or Asparagine
Glx Z Glutamic acid or Glutamine
Xaa X Any amino acid
Xle J Leucine or Isoleucine
TERM TERM termination codon
I have tried this:
awk 'BEGIN{FS="\t";OFS="\t"}{if (FNR==NR) {codes[$1]=$2;} else{next}}END{for (key in codes);{print key,codes[key],length(codes)}}' $input1 $input2
And the output is always Cys C 27 and when I replace codes[$1]=$2 for codes[$2]=$1 I get M Met 27.
How can I make my code print out all the values sequentially? I don't understand why my code selectively prints out just one element when I can tell the array length is 27 as expected. (To keep my code minimal I have excluded code within else{next} - Otherwise I just want to print all elements from array codes while retaining the else{***} command)
According to How to view all the content in an awk array?, The syntax above should work. I tried it here echo -e "1 2\n3 4\n5 6" | awk '{my_dict[$1] = $2};END {for(key in my_dict) print key " : " my_dict[key],": "length(my_dict)}' and that worked well.
With your shown samples and attempts please try following, written and tested in GNU awk.
awk '
BEGIN{
FS=OFS="\t"
}
{
codes[$1]=$2
}
END{
for(key in codes){
print key,codes[key],length(codes)
}
}' Input_file
Will add detailed explanation and OP's misses too in few mins.
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=OFS="\t" ##Setting FS and OFS as TAB here.
}
{
codes[$1]=$2 ##Creating array codes with index of 1st field and value of 2nd field
}
END{ ##Starting END block of this program from here.
for(key in codes){ ##Traversing through codes array here.
print key,codes[key],length(codes) ##Printing index and value of current item along with total length of codes.
}
}' Input_file ##Mentioning Input_file name here.
I'm a bit confused what you are after, but to print the codes sequentially, with the no., (ignoring the name), you can do:
awk '{seq[++n]=$2; codes[$2]=$1}
END{for (i=1;i<=n;i++) printf "%s\t%s\t%d\n", codes[seq[i]], seq[i], i}' file
Which uses two arrays to coordinate the sequence number with the single letter in the seq array and then the letter to the code in the codes array.
Example Use/Output
$ awk '{seq[++n]=$2; codes[$2]=$1}
END{for (i=1;i<=n;i++) printf "%s\t%s\t%d\n", codes[seq[i]], seq[i], i}' file
Ala A 1
Arg R 2
Asn N 3
Asp D 4
Cys C 5
Gln Q 6
Glu E 7
Gly G 8
His H 9
Ile I 10
Leu L 11
Lys K 12
Met M 13
Phe F 14
Pro P 15
Pyl O 16
Ser S 17
Sec U 18
Thr T 19
Trp W 20
Tyr Y 21
Val V 22
Asx B 23
Glx Z 24
Xaa X 25
Xle J 26
TERM TERM 27
Resolved: The error was brought about by the introduction of ; here: END{for (key in codes);{print key,codes[key],length(codes)}}.
Solution:
awk 'BEGIN{FS="\t";OFS="\t"}{if (FNR==NR) {codes[$1]=$2;} else{next}}END{for (key in codes){print key,codes[key],length(codes)}}' $input1 $input2

Convert a multidimensional array to a data frame Python

I have a data set as below:
data={ 'StoreID':['a','b','c','d'],
'Sales':[1000,200,500,800],
'Profit':[600,100,300,500]
}
data=pd.DataFrame(data)
data.set_index(['StoreID'],inplace=True,drop=True)
X=data.values
from sklearn.metrics.pairwise import euclidean_distances
dist=euclidean_distances(X)
Now I get an array as below:
array([[0. ,943,583,223],
[943, 0.,360,721],
[583,360,0., 360],
[223,721,360, 0.]])
My purpose to get unique combinations of stores and their corresponding distance. I would like the end results as a data frame below:
Store NextStore Dist
a b 943
a c 583
a d 223
b c 360
b d 721
c d 360
Thank you for your help!
You probably want pandas.melt which will "unpivot" the distance matrix into tall-and-skinny format.
m = pd.DataFrame(dist)
m.columns = list('abcd')
m['Store'] = list('abcd')
...which produces:
a b c d Store
0 0.000000 943.398113 583.095189 223.606798 a
1 943.398113 0.000000 360.555128 721.110255 b
2 583.095189 360.555128 0.000000 360.555128 c
3 223.606798 721.110255 360.555128 0.000000 d
Melt data into tall-and-skinny format:
pd.melt(m, id_vars=['Store'], var_name='nextStore')
Store nextStore value
0 a a 0.000000
1 b a 943.398113
2 c a 583.095189
3 d a 223.606798
4 a b 943.398113
5 b b 0.000000
6 c b 360.555128
7 d b 721.110255
8 a c 583.095189
9 b c 360.555128
10 c c 0.000000
11 d c 360.555128
12 a d 223.606798
13 b d 721.110255
14 c d 360.555128
15 d d 0.000000
Remove redundant rows, convert dist to int, and sort:
df2 = pd.melt(m, id_vars=['Store'],
var_name='NextStore',
value_name='Dist')
df3 = df2[df2.Store < df2.NextStore].copy()
df3.Dist = df3.Dist.astype('int')
df3.sort_values(by=['Store', 'NextStore'])
Store NextStore Dist
4 a b 943
8 a c 583
12 a d 223
9 b c 360
13 b d 721
14 c d 360
I have a 2D array with shape (14576, 24) respectively elements and the number of features that are going to constitute my dataframe.
data_from_2Darray = {}
for i in range(name_2Darray.shape[1]):
data_from_2Darray["df_column_name{}".format(i)] = name_2Darray[:,i]
#visualize the result
pd.DataFrame(data=data_from_2Darray).plot()
#the converted array in dataframe
df = pd.DataFrame(data=data_from_2Darray)

Python 3.x - Append data to Pandas dataframe using for loop

I have a empty pandas DataFrame:
aqi_df = pd.DataFrame(columns = ["IMEI","Date","pm10conc_24hrs","pm25conc_24hrs","sdPm10","sdPm25","aqi","windspeed","winddirection","severity","health_impact"] )
I want to add elements one by one to each column -
for i in range(1,10):
aqi_df.IMEI.append("a")
aqi_df.Date.append("b")
aqi_df.pm10conc_24hrs.append("c")
.
.
.
But append throws an error
TypeError: cannot concatenate a non-NDFrame object
How can I append elements to pandas dataframe one by one?
IIUC you can use:
aqi_df = pd.DataFrame(columns = ["IMEI","Date","pm10conc_24hrs"] )
print (aqi_df)
for i in range(1,10):
aqi_df.loc[i] = ['a','b','c']
print (aqi_df)
IMEI Date pm10conc_24hrs
1 a b c
2 a b c
3 a b c
4 a b c
5 a b c
6 a b c
7 a b c
8 a b c
9 a b c
But better is creating DataFrame from Series or dict:
IMEI = pd.Series(['aa','bb','cc'])
Date = pd.Series(['2016-01-03','2016-01-06','2016-01-08'])
pm10conc_24hrs = pd.Series(['w','e','h'])
aqi_df = pd.DataFrame({'a':IMEI,'Date':Date,'pm10conc_24hrs':pm10conc_24hrs})
print (aqi_df)
Date a pm10conc_24hrs
0 2016-01-03 aa w
1 2016-01-06 bb e
2 2016-01-08 cc h
aqi_df = pd.DataFrame({'a':['aa','bb','cc'],
'Date':['2016-01-03','2016-01-06','2016-01-08'],
'pm10conc_24hrs':['w','e','h']})
print (aqi_df)
Date a pm10conc_24hrs
0 2016-01-03 aa w
1 2016-01-06 bb e
2 2016-01-08 cc h

Get all values in one vector in Matlab

I have two arrays, A (500 x 128 integer values) and B (500 x 64 real values). I want to concatenate both to get C. The problem is that Matlab ignores all values in B as they are small values. Is there any way to get all values without neglect?
Thanks.
I think this can simulate your problem:
A = int8(randi(4,4)*10);
B = rand(4,4)*10;
C = [A B]
C =
10 20 20 30 3 0 8 3
40 10 40 40 2 6 1 2
30 20 10 30 2 1 6 6
40 20 40 30 9 9 5 5
To achieve the result you want, you have to add a type to your data before concatenating them:
C = [double(A) B]
C =
Columns 1 through 7:
10.00000 20.00000 20.00000 30.00000 2.92979 0.31162 7.73694
40.00000 10.00000 40.00000 40.00000 1.71392 5.82900 1.08936
30.00000 20.00000 10.00000 30.00000 1.83903 0.84160 5.75773
40.00000 20.00000 40.00000 30.00000 8.81039 9.31400 4.60636
Column 8:
3.10192
1.75853
5.75013
5.39383
So here when you mention that A has to be shown as a double matrix, the other matrix, B is kept at its original type. You can check the other existing types in MATLAB here.

Filtering Arrays/Vectors in Matlab

I have two arrays in Matlab
say
A = [1 4 89 2 67 247 2]
B = [0 1 1 1 0 0 1]
I want an array C, which contains elements from array A, if there is 1 in B at the corresponding index. In this case, C = [4 89 2 2].
How to do this?
Use logical indexing:
>> C = A(logical(B))
C =
4 89 2 2

Resources