how to split an array into separate arrays (R)? [duplicate] - arrays

This question already has answers here:
Split data.frame based on levels of a factor into new data.frames
(3 answers)
Closed 7 years ago.
I have the array:
>cent
b e r f
A19 60.46 0.77 -0.12 1
A15 16.50 0.53 0.08 2
A17 2.66 0.51 0.20 3
A11 36.66 0.40 -0.25 4
A12 38.96 0.91 0.23 1
A05 0.00 0.29 0.01 2
A09 3.40 0.35 0.03 3
A04 0.00 0.25 -0.03 4
Could some one please say me how to split this array into 4 separate arrays where the last column «f» is the flag? In result I would like to see:
>cent1
b e r f
A19 60.46 0.77 -0.12 1
A12 38.96 0.91 0.23 1
>cent2
b e r f
A15 16.50 0.53 0.08 2
A05 0.00 0.29 0.01 2
….
Should I use the for-loop and check flag "f" or exist a build-in function? Thanks.

We can use split to create a list of data.frames.
lst <- split(cent, cent$f)
NOTE: Here I assumed that the 'cent' is a data.frame. If it is a matrix
lst <- split(as.data.frame(cent), cent[,"f"])
Usually, it is enough to do most of the analysis. But, if we need to create multiple objects in the global environment, we can use list2env (not recommended)
list2env(lst, paste0("cent", seq_along(lst)), envir= .GlobalEnv)

Related

How do I correctly add a chain ID to my pdb file?

I am trying to conduct some analysis with my single-chain PDB file (766 residues long), but it requires a chain ID. Currently, there isn't one.
Here is a snippet of the pdb file:
ATOM 1 N MET 1 -69.269 78.953 -91.441 1.00 0.00 N
ATOM 2 CA MET 1 -69.264 78.650 -92.891 1.00 0.00 C
ATOM 4 C MET 1 -69.371 79.939 -93.633 1.00 0.00 C
ATOM 5 O MET 1 -68.379 80.649 -93.799 1.00 0.00 O
ATOM 3 CB MET 1 -70.475 77.774 -93.251 1.00 0.00 C
ATOM 6 CG MET 1 -70.505 76.455 -92.477 1.00 0.00 C
ATOM 7 SD MET 1 -69.115 75.332 -92.806 1.00 0.00 S
ATOM 8 CE MET 1 -69.473 74.270 -91.377 1.00 0.00 C
ATOM 9 N ASP 2 -70.583 80.284 -94.111 1.00 0.00 N
ATOM 10 CA ASP 2 -70.688 81.539 -94.789 1.00 0.00 C
ATOM 12 C ASP 2 -70.661 82.602 -93.737 1.00 0.00 C
ATOM 13 O ASP 2 -71.088 82.377 -92.606 1.00 0.00 O
ATOM 11 CB ASP 2 -71.963 81.733 -95.626 1.00 0.00 C
ATOM 14 CG ASP 2 -71.691 82.908 -96.557 1.00 0.00 C
ATOM 15 OD1 ASP 2 -70.569 82.953 -97.130 1.00 0.00 O
ATOM 16 OD2 ASP 2 -72.598 83.768 -96.717 1.00 0.00 O1-
ATOM 17 N HIS 3 -70.129 83.791 -94.077 1.00 0.00 N
ATOM 18 CA HIS 3 -70.045 84.846 -93.110 1.00 0.00 C
ATOM 20 C HIS 3 -71.342 85.581 -93.094 1.00 0.00 C
ATOM 21 O HIS 3 -72.113 85.574 -94.052 1.00 0.00 O
ATOM 19 CB HIS 3 -68.925 85.865 -93.404 1.00 0.00 C
ATOM 23 CG HIS 3 -68.749 86.908 -92.336 1.00 0.00 C
ATOM 25 CD2 HIS 3 -67.998 86.879 -91.200 1.00 0.00 C
ATOM 22 ND1 HIS 3 -69.357 88.144 -92.351 1.00 0.00 N
ATOM 26 CE1 HIS 3 -68.947 88.797 -91.234 1.00 0.00 C
ATOM 24 NE2 HIS 3 -68.121 88.068 -90.504 1.00 0.00 N
What's the best way for me to label the chain as chain A?
Here's the answer.
We need to read the file line by line and put a chain into column 22 of each line that begins with ATOM. Assuming the file is called myfile.pdb, we are trying to replace the empty space that is separated by 17 characters from ATOM with the letter A. This can be accomplished with a relatively simple sed command.
sed 's/^\(ATOM.\{17\}\) /\1A/' myfile.pdb > newfile.pdb
Hope this is helpful!

How to modify data of one column based on data present in another column of same row using pandas

I have 5x6 arrays.
The data are like below
indx sv-01 sv-02 status-1 status-2 valu-1 valu-2
0 8 16 B B 0.1 -0.02
1 8 16 B A 0.03 0.210
2 8 16 A B 0.23 0.34
3 8 16 B B 0.29 0.67
4 8 16 A A 0.23 0.67
.. .. .. .. .. ... ...
My aim is to do iteration such that either for SV 8 or 16 if status column is
A ,convert its corresponding value to 0(I need it for further calculation). I have some homegrown ways but could
not get make it to desire result. How can I achieve this with minimum for and
if condition.Is it possible throw pandas dataframe. And also whenever there
is A then that SV will not be counted ,so total sv would be based on B only.
At this moment I got confused with too many if and else conditions.
Thanks
Use df.where i.e
df['valu-1'] = df['valu-1'].where(df['status-1']!='A',0)
df['valu-2'] = df['valu-2'].where(df['status-2']!='A',0)
Output :
indx sv-01 sv-02 status-1 status-2 valu-1 valu-2
0 0 8 16 B B 0.10 -0.02
1 1 8 16 B A 0.03 0.00
2 2 8 16 A B 0.00 0.34
3 3 8 16 B B 0.29 0.67
4 4 8 16 A A 0.00 0.00
For selecting the df sv-01 and sv-02 be to 8 and 16 you can use boolean indexing like
ndf = df[(df['sv-01']==8) & (df['sv-02']==16)]
Then use ndf.where for replacement

Spotfire Line chart with min max bars

I am trying to make a chart that has a line graph showing the change in value in the count column for each month, and then two points showing the min and max value in that month. The table table is below.
Date Min Max Count
1/1/2015 0.28 6.02 13
2/1/2015 0.2 7.72 8
3/1/2015 1 1 1
4/1/2015 0.4 6.87 7
5/1/2015 0.36 3.05 8
6/1/2015 0.17 1.26 13
7/1/2015 0.31 1.59 15
8/1/2015 0.39 3.35 13
9/1/2015 0.22 0.86 10
10/1/2015 0.3 2.48 13
11/1/2015 0.16 0.82 9
12/1/2015 0.33 2.18 5
1/1/2016 0.23 1.16 14
2/1/2016 0.38 1.74 7
3/1/2016 0.1 8.87 9
4/1/2016 0.28 0.68 3
5/1/2016 0.13 3.23 11
6/1/2016 0.33 1 5
7/1/2016 0.28 1.26 4
8/1/2016 0.08 0.41 2
9/1/2016 0.43 0.61 2
10/1/2016 0.49 1.39 4
11/1/2016 0.89 0.89 1
I tried doing a scatter plot but when I try to Add a Line from Column value I get an error saying that the line cannot work on categorical data.
Any suggestions on how I can prepare this visualization?
Thanks!
I would do this in a combination chart.
Insert a combination chart (Line & Bar Graph)
On your X-Axis put your date as <BinByDateTime([Date],"Year.Month",1)>
On your Y-Axis put your aggregations: Sum([Count]), Max([Max]), Min([Min])
Right click > Properties > Series > set the Min and Max to Line Type
(Optional) Change the Y-Axis scale

SAS: Using a Loop for Creating Many Data Sets and renaming the variables in them

I have a dataset in a long format as e.g.:
time subject var1 var2 var3
1 1 0.41 0.48 0.85
2 1 0.58 0.38 0.15
3 1 0.08 0.39 0.96
4 1 0.58 0.87 0.15
5 1 0.55 0.40 0.67
1 2 0.76 0.49 0.03
2 2 0.36 0.26 0.93
3 2 0.83 0.88 0.63
4 2 0.19 0.65 0.99
5 2 0.89 0.91 0.47
I would like to get a dataset in a wide format as
time var1_sub1 var2_sub1 var3_sub1 var1_sub2 var2_sub2 var3_sub2
1 0.41 0.48 0.85 0.76 0.49 0.03
2 0.58 0.38 0.15 0.36 0.26 0.93
3 0.08 0.39 0.96 0.83 0.88 0.63
4 0.58 0.87 0.15 0.19 0.65 0.99
5 0.55 0.40 0.67 0.89 0.91 0.47
So far, I came up with an idea to do it in the following way:
data data_sub1;
set data;
if subject=1;
var1_sub1=var1;
var2_sub1=var2;
var3_sub1=var3;
run;
data data_sub2;
set data;
if subject=2;
var1_sub2=var1;
var2_sub2=var2;
var3_sub2=var3;
run;
proc sort data=data_sub1;
by time;
run;
proc sort data=data_sub2;
by time;
run;
data datamerged;
merge data_sub1 data_sub2;
by time;
run;
It works, everything is fine, but I would like to learn how one could code it in a more beautiful way as in the practice I have much more subjects and variables.
This is a PROC TRANSPOSE problem. To solve most PROC TRANSPOSE problems, make it totally vertical (one value-one variable name per row) and then transpose using the ID statement.
data have;
input time subject var1 var2 var3;
datalines;
1 1 0.41 0.48 0.85
2 1 0.58 0.38 0.15
3 1 0.08 0.39 0.96
4 1 0.58 0.87 0.15
5 1 0.55 0.40 0.67
1 2 0.76 0.49 0.03
2 2 0.36 0.26 0.93
3 2 0.83 0.88 0.63
4 2 0.19 0.65 0.99
5 2 0.89 0.91 0.47
;;;;
run;
data have_vert;
set have;
array vars var:;
do _t = 1 to dim(vars);
id=cats(vname(vars[_t]),'_','sub',subject); *this is our future variable name;
value = vars[_t]; *this is our future variable value;
output;
end;
keep time id value subject;
run;
proc sort data=have_vert;
by time subject id;
run;
proc transpose data=have_vert out=want;
by time;
var value;
id id;
run;

R extension in C, setting matrix row/column names

I'm writing an R package that manipulates Matrices in C. Currently, the matrices returned to R have numbers for the row/column names. I would rather assign my own row/column names when modifying the object in C.
I've googled around for about an hour, but haven't found a good solution yet. The closest I've found is dimnames, but I want to name each column, not just the two dimensions. The matrices get larger than 4x4, below is just a small example of what I want to do.
The number of rows is 4^x where X is the length of the row name
Current
[,1] [,2] [,3] [,4]
[1,] 0.20 0.00 0.00 0.80
[2,] 0.25 0.25 0.25 0.25
[3,] 0.25 0.25 0.25 0.25
[4,] 1.00 0.00 0.00 0.00
[5,] 0.20 0.00 0.00 0.80
[6,] 0.25 0.25 0.25 0.25
[7,] 0.25 0.25 0.25 0.25
[8,] 1.00 0.00 0.00 0.00
[9,] 0.20 0.00 0.00 0.80
[10,] 0.25 0.25 0.25 0.25
[11,] 0.25 0.25 0.25 0.25
[12,] 1.00 0.00 0.00 0.00
[13,] 0.20 0.00 0.00 0.80
[14,] 0.25 0.25 0.25 0.25
[15,] 0.25 0.25 0.25 0.25
[16,] 1.00 0.00 0.00 0.00
Desired
[A] [C] [G] [T]
[AA] 0.20 0.00 0.00 0.80
[AC] 0.25 0.25 0.25 0.25
[AG] 0.25 0.25 0.25 0.25
[AT] 1.00 0.00 0.00 0.00
[CA] 0.20 0.00 0.00 0.80
[CC] 0.25 0.25 0.25 0.25
[CG] 0.25 0.25 0.25 0.25
[CT] 1.00 0.00 0.00 0.00
[GA] 0.20 0.00 0.00 0.80
[GC] 0.25 0.25 0.25 0.25
[GG] 0.25 0.25 0.25 0.25
[GT] 1.00 0.00 0.00 0.00
[TA] 0.20 0.00 0.00 0.80
[TC] 0.25 0.25 0.25 0.25
[TG] 0.25 0.25 0.25 0.25
[TT] 1.00 0.00 0.00 0.00
If you are open to C++ instead of C, then Rcpp can make this a little easier. We just create a list object with rows and column names as we would in R, and assign that to the dimnames attribute of the matrix object:
R> library(inline) # to compile, link, load the code here
R> src <- '
+ Rcpp::NumericMatrix x(2,2);
+ x.fill(42); // or more interesting values
+ // C++0x can assign a set of values to a vector, but we use older standard
+ Rcpp::CharacterVector rows(2); rows[0] = "aa"; rows[1] = "bb";
+ Rcpp::CharacterVector cols(2); cols[0] = "AA"; cols[1] = "BB";
+ // now create an object "dimnms" as a list with rows and cols
+ Rcpp::List dimnms = Rcpp::List::create(rows, cols);
+ // and assign it
+ x.attr("dimnames") = dimnms;
+ return(x);
+ '
R> fun <- cxxfunction(signature(), body=src, plugin="Rcpp")
R> fun()
AA BB
aa 42 42
bb 42 42
R>
The actual assignment of the column and row names is so manual ... because the current C++ standard does not allow direct assignment of vectors at initialization, but that will change.
Edit: I just realized that I can of course use static create() method on the row and colnames too, which makes this a little easier and shorter still
R> src <- '
+ Rcpp::NumericMatrix x(2,2);
+ x.fill(42); // or more interesting values
+ Rcpp::List dimnms = // two vec. with static names
+ Rcpp::List::create(Rcpp::CharacterVector::create("cc", "dd"),
+ Rcpp::CharacterVector::create("ee", "ff"));
+ // and assign it
+ x.attr("dimnames") = dimnms;
+ return(x);
+ '
R> fun <- cxxfunction(signature(), body=src, plugin="Rcpp")
R> fun()
ee ff
cc 42 42
dd 42 42
R>
So we are down to three or four statements, no monkeying with PROTECT / UNPROTECT and no memory management.
As Jim said, this is much easier to do in R. I'm passing the names into the C function via the nam argument.
#include <Rinternals.h>
SEXP myMat(SEXP nam) {
/*PrintValue(nam);*/
SEXP ans, dimnames;
PROTECT(ans = allocMatrix(REALSXP, length(nam), length(nam)));
PROTECT(dimnames = allocVector(VECSXP, 2));
SET_VECTOR_ELT(dimnames, 0, nam);
SET_VECTOR_ELT(dimnames, 1, nam);
setAttrib(ans, R_DimNamesSymbol, dimnames);
UNPROTECT(2);
return(ans);
}
If you put that code in a file called myMat.c, you can test it via the line below. I'm using Ubuntu, so you will have to change myMat.so to myMat.dll if you're on Windows.
R CMD SHLIB myMat.c
Rscript -e 'dyn.load("myMat.so"); .Call("myMat", c("A","C","G","T"))'
The note above is instructive. The dimnames is a list with the same number of elements as dimensions of the dataset, where each element corresponds to the number elements along that dimension, i.e., list(c('a','c','g','t'), c('a','c','g','t')).
To set that in C, I would recommend:
PROTECT(dimnames = allocVector(VECSXP, 2));
PROTECT(rownames = allocVector(STRSXP, 4));
PROTECT(colnames = allocVector(STRSXP, 4));
setAttrib( ? , R_DimNamesSymbol, dimnames);
You'll have to then set the relevant rowname and colname elements. In general, this stuff is much easier to do in R.
jim

Resources