Deleting rows where the difference between column 1 and column 2 have a greater difference than 1 - loops

Some of the the values in columns Molecular.Weight and m.z are quite similar, often differing only by 1.0 or less. But there are some instances where its greater than 1.0. I would like to generate a new dataset that only includes the rows with a difference less than or equal to 1.0. However, it can be either column that has the higher number, so I am struggling to make an equation that works.
'data.frame': 544 obs. of 48 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ No. : int 2 32 34 95 114 141 169 234 236 278 ...
$ RT..min. : num 0.89 3.921 0.878 2.396 0.845 ...
$ Molecular.Weight : num 70 72 72 78 80 ...
$ m.z : num 103 145 114 120 113 ...
$ HMDB.ID : chr "HMDB0006804" "HMDB0031647" "HMDB0006112" "HMDB0001505" ...
$ Name : chr "Propiolic acid" "Acrylic acid" "Malondialdehyde" "Benzene" ...
$ Formula : chr "C3H2O2" "C3H4O2" "C3H4O2" "C6H6" ...
$ Monoisotopic_Mass: num 70 72 72 78 80 ...
$ Delta.ppm. : num 1.295 0.833 1.953 1.023 0.102 ...
$ X1 : num 288.3 16.7 1130.9 3791.5 33.5 ...
$ X2 : num 276.8 13.4 1069.1 3228.4 44.1 ...
$ X3 : num 398.6 19.3 794.8 2153.2 15.8 ...
$ X4 : num 247.6 100.5 1187.5 1791.4 33.4 ...
$ X5 : num 98.4 162.1 1546.4 1646.8 45.3 ...
I had to do it in 2 parts because I couldn't figure out how to combine them but its still not giving me the right result.
The first section is supposed to filter out the values where Molecular.Weight might be greater than m.z by 1, and the second then filters out when m.z might be greater than Molecular.Weight. The first part seems to work and gives me a new dataset with around half the number of rows, but then when I do the second part on it, it gives me 1 row (and its not even correct because that one compound does fall within the 1.0 difference). Any help is super appreciated, thanks!
rawdata <- read.csv("Analysis negative + positive minus QC.csv")
filtered_data <-c()
for (i in 1:nrow(rawdata)) {
if (rawdata$m.z[i]-rawdata$Molecular.Weight[i]<1)
filtered_data <- rbind(filtered_data, rawdata[i,])
}
newdata <- c()
for (i in 1:row(filtered_data)) {
if ((filtered_data$Molecular.Weight[i] - filtered_data$m.z[i])>1)
newdata <- rbind(newdata, filtered_data[i,])
}

Related

Repeat certain pandas series values, so that it has an entry for all index values between 1 and 100

I have created a list of pandas series, with each series indexed by numbers between 1 and 100 eg
Index Value
1 62.99
4 64.39
37 75.225
65 88.12
74 89.89
79 93.30
88 94.30
92 95.83
100 100.00
What I want to do, either while it is a Series, or as an array after calling .to_numpy() on it, is to fill it out so that my series has 100 values (1 to 100), with any new entries having the previous existing value ie
Index Value
1 62.99
2 62.99
3 62.99
4 64.39
5 64.39
6 64.39
...
...
36 64.39
37 75.225
38 75.225
and so on.
I can do this programmatically the long-winded way by iterating through each series and checking for a change in value; my question is, is there a version of Series.repeat() which could do this in one hit, or a numpy function which can 'pad out' my array in this manner with my 100 values?
Thanks in advance for reading, and for any suggestions. This isn't homework; it's a genuine question so please don't attack me if my style of asking isn't as you expect.
What you need yo do is to frontfill the values in a series:
This code
series = pd.Series([33.2, 36, 39, 55], index=[3, 6, 12, 14], name='series')
indices = range(100)
df = pd.DataFrame(indices)
series = df.join(series).ffill()['series']
produces
0 NaN
1 NaN
2 NaN
3 33.2
4 33.2
...
95 55.0
96 55.0
97 55.0
98 55.0
99 55.0
First values ar NaN because there are no values to fill them in the series
So here's the solution I went with - an ffill() with fillna(0), joining to range(1,101). I had to iterate through a larger dataset which needed grouping by ID first / taking the maximum 'Pct' per 'Bucket' :-
j=df[['ID','Bucket','Pct']].groupby(['ID','Bucket']).max()
for i in df['ID'].unique():
index=pd.DataFrame(range(1,101))
index.columns=['Bucket']
k=pd.merge(index,j.loc[i],how='left',on='Bucket').ffill().fillna(0)
In:
Bucket Pct
3 0.03
3 0.1
3 0.26
3 0.42
3 0.45
3 0.59
3 0.69
3 0.83
3 0.86
3 0.91
3 0.94
3 0.98
4 1.1
... ...
91 98.89
93 99.08
94 99.17
94 99.26
94 99.43
94 99.48
94 99.63
100 100.0
Out:
Bucket Pct
1 0.00
2 0.00
3 0.98
4 1.83
5 22.83
... ...
91 98.89
92 98.89
93 99.08
94 99.63
95 99.63
96 99.63
97 99.63
98 99.63
99 99.63
100 100.00
Many, many thanks once again to you both!

Using apply in building a tensor of features from a matrix

Consider the following code:
EmbedFeatures <- function(x,w) {
c_rev <- seq(from=w,to=1,by=-1)
em <- embed(x,w)
em <- em[,c_rev]
return(em)
}
m=matrix(1:1400,100,14)
X.tr<-c()
F<-dim(m)[2]
W=16
for(i in 1:F){ X.tr<-abind(list(X.tr,EmbedFeatures(m[,i],W)),along=3)}
this builds an array of features, each row has W=16 timesteps.
The dimensions are:
> dim(X.tr)
[1] 85 16 14
The following are the first samples:
> X.tr[1,,1]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
> X.tr[1,,2]
[1] 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116
> X.tr[1,,3]
[1] 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
I would like to use apply to build this array, but the following code does not work:
X.tr <- apply(m,2,EmbedFeatures, w=W)
since it gives me the following dimensions:
> dim(X.tr)
[1] 1360 14
How can I do it?
Firstly, thanks for providing a great reproducible example!
Now, as far as I know, you can't do this with apply. You can, however, do it with a combination of plyr::aaply, which allows you to return multidimensional arrays, and base::aperm, which allows you to transpose multidimensional arrays.
See here for aaply function details and here for aperm function details.
After running your code above, you can do:
library(plyr)
Y.tr <- plyr::aaply(m, 2, EmbedFeatures, w=W)
Z.tr <- aperm(Y.tr, c(2,3,1))
dim(Y.tr)
[1] 14 85 16
dim(Z.tr)
[1] 85 16 14
I turned those two lines of code into a function.
using_aaply <- function(m = m) {
Y.tr <- aaply(m, 2, EmbedFeatures, w=W)
Z.tr <- aperm(Y.tr, c(2,3,1))
return(Z.tr)
}
Then I did some microbenchmarking.
library(microbenchmark)
microbenchmark(for(i in 1:F){ X.tr<-abind(list(X.tr,EmbedFeatures(m[,i],W)),along=3)}, times=100)
Unit: milliseconds
expr
for (i in 1:F) { X.tr <- abind(list(X.tr, EmbedFeatures(m[, i], W)), along = 3) }
min lq mean median uq max neval
405.0095 574.9824 706.0845 684.8531 802.4413 1189.845 100
microbenchmark(using_aaply(m=m), times=100)
Unit: milliseconds
expr min lq mean median uq max
using_aaply(m = m) 4.873627 5.670474 7.797129 7.083925 9.674041 19.74449
neval
100
It seems like it's loads faster using aaply and aperm compared to abind in a for-loop.

Awk multiply column by a specific value

I'm quite new to awk, that I am using more and more to process the output files from a model I am running. Right now, I am stuck with a multiplication issue.
I would like to calculate relative change in percentage.
Example:
A B
1 150 0
2 210 10
3 380 1000
...
I would like to calculate Ax = (Ax-A1)/A1 * 100.
Output:
New_A B
1 0 0
2 10 40
3 1000 153.33
...
I can multiply columns together but don't know how to fix a value to a position in the text file (ie. Row 1 Column 1).
Thank you.
Assuming your actual file does not have the "A B" header and the row numbers in it:
$ cat file
150 0
210 10
380 1000
$ awk 'NR==1 {a1=$1} {printf "%s %.1f\n", $2, ($1-a1)/a1*100}' file | column -t
0 0.0
10 40.0
1000 153.3

combine row and columns in matlab for series

everyone
I have problem when make iteration for two variable but, it's combine just in one vector or array
At first i write my input for iteration w(0) as w
w=[1 50];
for number 1, I use array
e=0:1:(n-1);
f=0:2:(2*n-2); %for 50 in column 2.
I try to use this code
w=[1 50];
ww=kron(ones((n),1),w)
e=0:1:(n-1);
f=0:2:(2*n-2);
r=[e',f']
x=ww+r
and the output is
ww =
1 50
1 50
1 50
1 50
1 50
1 50
r =
0 0
1 2
2 4
3 6
4 8
5 10
x =
1 50
2 52
3 54
4 56
5 58
6 60
I want x is output just in one array, in example
x =
1
50
2
52
3
54
4
56
5
58
6
60
where w=[1 50] can be use difference addition for iteration
Apply this to your x matrix:
x = reshape(x.',[],1);
See reshape doc for details.
Here is a simple method to create your vector from scratch:
x = [1:6;50:2:60];
x(:)
Or with your variables:
x = [e; f];
x(:)

How can I create an array of ratios inside a for loop in MATLAB?

I would like to create an array or vector of musical notes using a for loop. Every musical note, A, A#, B, C...etc is a 2^(1/12) ratio of the previous/next. E.G the note A is 440Hz, and A# is 440 * 2^(1/12) Hz = 446.16Hz.
Starting from 27.5Hz (A0), I want a loop that iterates 88 times to create an array of each notes frequency up to 4186Hz, so that will look like
f= [27.5 29.14 30.87 ... 4186.01]
So far, I've understood this much:
f = [];
for i=1:87,
%what goes here
% f = [27.5 * 2^(i/12)]; ?
end
return;
There is no need to do a loop for this in matlab, you can simply do:
f = 27.5 * 2.^((0:87)/12)
The answer:
f =
Columns 1 through 13
27.5 29.135 30.868 32.703 34.648 36.708 38.891 41.203 43.654 46.249 48.999 51.913 55
Columns 14 through 26
58.27 61.735 65.406 69.296 73.416 77.782 82.407 87.307 92.499 97.999 103.83 110 116.54
Columns 27 through 39
123.47 130.81 138.59 146.83 155.56 164.81 174.61 185 196 207.65 220 233.08 246.94
Columns 40 through 52
261.63 277.18 293.66 311.13 329.63 349.23 369.99 392 415.3 440 466.16 493.88 523.25
Columns 53 through 65
554.37 587.33 622.25 659.26 698.46 739.99 783.99 830.61 880 932.33 987.77 1046.5 1108.7
Columns 66 through 78
1174.7 1244.5 1318.5 1396.9 1480 1568 1661.2 1760 1864.7 1975.5 2093 2217.5 2349.3
Columns 79 through 88
2489 2637 2793.8 2960 3136 3322.4 3520 3729.3 3951.1 4186
maxind = 87;
f = zeros(1, maxind); % preallocate, better performance and avoids mlint warnings
for ii=1:maxind
f(ii) = 27.5 * 2^(ii/12);
end
The reason I named the loop variable ii is because i is the name of a builtin function. So it's considered bad practice to use that as a variable name.
Also, in your description you said you want to iterate 88 times, but the above loop only iterates 1 through 87 (both inclusive). If you want to iterate 88 times change maxind to 88.

Resources