Consider the following code:
EmbedFeatures <- function(x,w) {
c_rev <- seq(from=w,to=1,by=-1)
em <- embed(x,w)
em <- em[,c_rev]
return(em)
}
m=matrix(1:1400,100,14)
X.tr<-c()
F<-dim(m)[2]
W=16
for(i in 1:F){ X.tr<-abind(list(X.tr,EmbedFeatures(m[,i],W)),along=3)}
this builds an array of features, each row has W=16 timesteps.
The dimensions are:
> dim(X.tr)
[1] 85 16 14
The following are the first samples:
> X.tr[1,,1]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
> X.tr[1,,2]
[1] 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116
> X.tr[1,,3]
[1] 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
I would like to use apply to build this array, but the following code does not work:
X.tr <- apply(m,2,EmbedFeatures, w=W)
since it gives me the following dimensions:
> dim(X.tr)
[1] 1360 14
How can I do it?
Firstly, thanks for providing a great reproducible example!
Now, as far as I know, you can't do this with apply. You can, however, do it with a combination of plyr::aaply, which allows you to return multidimensional arrays, and base::aperm, which allows you to transpose multidimensional arrays.
See here for aaply function details and here for aperm function details.
After running your code above, you can do:
library(plyr)
Y.tr <- plyr::aaply(m, 2, EmbedFeatures, w=W)
Z.tr <- aperm(Y.tr, c(2,3,1))
dim(Y.tr)
[1] 14 85 16
dim(Z.tr)
[1] 85 16 14
I turned those two lines of code into a function.
using_aaply <- function(m = m) {
Y.tr <- aaply(m, 2, EmbedFeatures, w=W)
Z.tr <- aperm(Y.tr, c(2,3,1))
return(Z.tr)
}
Then I did some microbenchmarking.
library(microbenchmark)
microbenchmark(for(i in 1:F){ X.tr<-abind(list(X.tr,EmbedFeatures(m[,i],W)),along=3)}, times=100)
Unit: milliseconds
expr
for (i in 1:F) { X.tr <- abind(list(X.tr, EmbedFeatures(m[, i], W)), along = 3) }
min lq mean median uq max neval
405.0095 574.9824 706.0845 684.8531 802.4413 1189.845 100
microbenchmark(using_aaply(m=m), times=100)
Unit: milliseconds
expr min lq mean median uq max
using_aaply(m = m) 4.873627 5.670474 7.797129 7.083925 9.674041 19.74449
neval
100
It seems like it's loads faster using aaply and aperm compared to abind in a for-loop.
Related
Some of the the values in columns Molecular.Weight and m.z are quite similar, often differing only by 1.0 or less. But there are some instances where its greater than 1.0. I would like to generate a new dataset that only includes the rows with a difference less than or equal to 1.0. However, it can be either column that has the higher number, so I am struggling to make an equation that works.
'data.frame': 544 obs. of 48 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ No. : int 2 32 34 95 114 141 169 234 236 278 ...
$ RT..min. : num 0.89 3.921 0.878 2.396 0.845 ...
$ Molecular.Weight : num 70 72 72 78 80 ...
$ m.z : num 103 145 114 120 113 ...
$ HMDB.ID : chr "HMDB0006804" "HMDB0031647" "HMDB0006112" "HMDB0001505" ...
$ Name : chr "Propiolic acid" "Acrylic acid" "Malondialdehyde" "Benzene" ...
$ Formula : chr "C3H2O2" "C3H4O2" "C3H4O2" "C6H6" ...
$ Monoisotopic_Mass: num 70 72 72 78 80 ...
$ Delta.ppm. : num 1.295 0.833 1.953 1.023 0.102 ...
$ X1 : num 288.3 16.7 1130.9 3791.5 33.5 ...
$ X2 : num 276.8 13.4 1069.1 3228.4 44.1 ...
$ X3 : num 398.6 19.3 794.8 2153.2 15.8 ...
$ X4 : num 247.6 100.5 1187.5 1791.4 33.4 ...
$ X5 : num 98.4 162.1 1546.4 1646.8 45.3 ...
I had to do it in 2 parts because I couldn't figure out how to combine them but its still not giving me the right result.
The first section is supposed to filter out the values where Molecular.Weight might be greater than m.z by 1, and the second then filters out when m.z might be greater than Molecular.Weight. The first part seems to work and gives me a new dataset with around half the number of rows, but then when I do the second part on it, it gives me 1 row (and its not even correct because that one compound does fall within the 1.0 difference). Any help is super appreciated, thanks!
rawdata <- read.csv("Analysis negative + positive minus QC.csv")
filtered_data <-c()
for (i in 1:nrow(rawdata)) {
if (rawdata$m.z[i]-rawdata$Molecular.Weight[i]<1)
filtered_data <- rbind(filtered_data, rawdata[i,])
}
newdata <- c()
for (i in 1:row(filtered_data)) {
if ((filtered_data$Molecular.Weight[i] - filtered_data$m.z[i])>1)
newdata <- rbind(newdata, filtered_data[i,])
}
I have created a list of pandas series, with each series indexed by numbers between 1 and 100 eg
Index Value
1 62.99
4 64.39
37 75.225
65 88.12
74 89.89
79 93.30
88 94.30
92 95.83
100 100.00
What I want to do, either while it is a Series, or as an array after calling .to_numpy() on it, is to fill it out so that my series has 100 values (1 to 100), with any new entries having the previous existing value ie
Index Value
1 62.99
2 62.99
3 62.99
4 64.39
5 64.39
6 64.39
...
...
36 64.39
37 75.225
38 75.225
and so on.
I can do this programmatically the long-winded way by iterating through each series and checking for a change in value; my question is, is there a version of Series.repeat() which could do this in one hit, or a numpy function which can 'pad out' my array in this manner with my 100 values?
Thanks in advance for reading, and for any suggestions. This isn't homework; it's a genuine question so please don't attack me if my style of asking isn't as you expect.
What you need yo do is to frontfill the values in a series:
This code
series = pd.Series([33.2, 36, 39, 55], index=[3, 6, 12, 14], name='series')
indices = range(100)
df = pd.DataFrame(indices)
series = df.join(series).ffill()['series']
produces
0 NaN
1 NaN
2 NaN
3 33.2
4 33.2
...
95 55.0
96 55.0
97 55.0
98 55.0
99 55.0
First values ar NaN because there are no values to fill them in the series
So here's the solution I went with - an ffill() with fillna(0), joining to range(1,101). I had to iterate through a larger dataset which needed grouping by ID first / taking the maximum 'Pct' per 'Bucket' :-
j=df[['ID','Bucket','Pct']].groupby(['ID','Bucket']).max()
for i in df['ID'].unique():
index=pd.DataFrame(range(1,101))
index.columns=['Bucket']
k=pd.merge(index,j.loc[i],how='left',on='Bucket').ffill().fillna(0)
In:
Bucket Pct
3 0.03
3 0.1
3 0.26
3 0.42
3 0.45
3 0.59
3 0.69
3 0.83
3 0.86
3 0.91
3 0.94
3 0.98
4 1.1
... ...
91 98.89
93 99.08
94 99.17
94 99.26
94 99.43
94 99.48
94 99.63
100 100.0
Out:
Bucket Pct
1 0.00
2 0.00
3 0.98
4 1.83
5 22.83
... ...
91 98.89
92 98.89
93 99.08
94 99.63
95 99.63
96 99.63
97 99.63
98 99.63
99 99.63
100 100.00
Many, many thanks once again to you both!
I currently have a list of 7 dataframes, each of which have 24 columns, but not the same number of rows. I would like to convert my list to a 3-dimension array, but I can't because all the components of my list do not have the same dimension. I have
one dataframe with 60 rows, 4 dataframes with 59 rows, and 2 with 58 rows.
When I try laply(mylist, unlist), I get the following message: Error: Results must have the same dimensions.
Is there any way to put those dataframes into an array? How could I get to put NAs at the end of the 6 other dataframes in order to get them to 60 rows?
I'm not sure about real purpose of OP which lead him to think of creating a 3-D array, for which he needs all data frames of list containing same number of of rows.
But, whatever the reason be, one can achieve it using lapply. Please make a note that lengths function doesn't work properly on list containing data frames. As lengths function simply returns number of columns in each data frame contained in list.
Hence, the approach is to first find maximum number of rows in a data frame contained in mylist. And then iterate over each data frame to extend its rows to maximum number of rows.
# Find maximum row across all data frames in mylist
maxrow <- max(sapply(mylist, nrow))
# Iterate over and increase the row count to maxrow
mylist_mod <- lapply(mylist, function(x,nRow){
if(nrow(x) < nRow){
x[(nrow(x)+1):nRow,] <- NA
}
x
}, nRow = maxrow)
mylist_mod
# $df1
# one two three
# 1 101 111 131
# 2 102 112 132
# 3 103 113 133
# 4 NA NA NA
# 5 NA NA NA
#
# $df2
# one two three
# 1 201 211 231
# 2 202 212 232
# 3 NA NA NA
# 4 NA NA NA
# 5 NA NA NA
#
# $df3
# one two three
# 1 301 311 331
# 2 302 312 332
# 3 303 313 333
# 4 304 314 334
# 5 305 315 335
Sample Data:
df1 <- data.frame(one = 101:103, two = 111:113, three = 131:133)
df2 <- data.frame(one = 201:202, two = 211:212, three = 231:232)
df3 <- data.frame(one = 301:305, two = 311:315, three = 331:335)
mylist <- list(df1 = df1, df2 = df2, df3 = df3)
mylist
# $df1
# one two three
# 1 101 111 131
# 2 102 112 132
# 3 103 113 133
#
# $df2
# one two three
# 1 201 211 231
# 2 202 212 232
#
# $df3
# one two three
# 1 301 311 331
# 2 302 312 332
# 3 303 313 333
# 4 304 314 334
# 5 305 315 335
I would like to create an array or vector of musical notes using a for loop. Every musical note, A, A#, B, C...etc is a 2^(1/12) ratio of the previous/next. E.G the note A is 440Hz, and A# is 440 * 2^(1/12) Hz = 446.16Hz.
Starting from 27.5Hz (A0), I want a loop that iterates 88 times to create an array of each notes frequency up to 4186Hz, so that will look like
f= [27.5 29.14 30.87 ... 4186.01]
So far, I've understood this much:
f = [];
for i=1:87,
%what goes here
% f = [27.5 * 2^(i/12)]; ?
end
return;
There is no need to do a loop for this in matlab, you can simply do:
f = 27.5 * 2.^((0:87)/12)
The answer:
f =
Columns 1 through 13
27.5 29.135 30.868 32.703 34.648 36.708 38.891 41.203 43.654 46.249 48.999 51.913 55
Columns 14 through 26
58.27 61.735 65.406 69.296 73.416 77.782 82.407 87.307 92.499 97.999 103.83 110 116.54
Columns 27 through 39
123.47 130.81 138.59 146.83 155.56 164.81 174.61 185 196 207.65 220 233.08 246.94
Columns 40 through 52
261.63 277.18 293.66 311.13 329.63 349.23 369.99 392 415.3 440 466.16 493.88 523.25
Columns 53 through 65
554.37 587.33 622.25 659.26 698.46 739.99 783.99 830.61 880 932.33 987.77 1046.5 1108.7
Columns 66 through 78
1174.7 1244.5 1318.5 1396.9 1480 1568 1661.2 1760 1864.7 1975.5 2093 2217.5 2349.3
Columns 79 through 88
2489 2637 2793.8 2960 3136 3322.4 3520 3729.3 3951.1 4186
maxind = 87;
f = zeros(1, maxind); % preallocate, better performance and avoids mlint warnings
for ii=1:maxind
f(ii) = 27.5 * 2^(ii/12);
end
The reason I named the loop variable ii is because i is the name of a builtin function. So it's considered bad practice to use that as a variable name.
Also, in your description you said you want to iterate 88 times, but the above loop only iterates 1 through 87 (both inclusive). If you want to iterate 88 times change maxind to 88.
I have the following data called gg and yy.
> str(gg)
num [1:1992] 128 130 132 185 186 187 188 189 190 191 ...
> str(yy)
'data.frame': 2103 obs. of 2 variables:
$ grp : num 128 130 132 185 186 187 188 189 190 191 ...
$ predd: num -0.963 -1.518 1.712 -11.286 -8.195 ...
>
You'll notice that the first several values of gg match the first several from yy.
I would like to select rows from yy if the value yy$grp matches any value in gg. The issue is that gg and yy are of unequal length. Further, there are some values of gg that are not present in yy$grp and also some values of yy$grp not present in gg.
I can't seem to get this to work. It is basically an intersection of the two data sets based upon the index value I mentioned (gg, or yy$grp).
I've tried:
inters<-intersect(gg,yy$grp)
yyint<-yy[yy$grp==inters,]
but get the following
Warning message:
In yy$grp == inters :
longer object length is not a multiple of shorter object length
> str(yya)
'data.frame': 28 obs. of 2 variables:
$ grp : num 128 130 132 185 186 187 188 189 190 191 ...
$ predd: num -0.963 -1.518 1.712 -11.286 -8.195 ...
yya should be much longer, according to my plans at least.
Thanks.
As I mentioned, I think this is what you want:
yy[yy$grp %in% gg,]