Array row calculations - arrays

I have the following table:
DATA:
Lines <- " ID MeasureX MeasureY x1 x2 x3 x4 x5
1 1 1 1 1 1 1 1
2 1 1 0 1 1 1 1
3 1 1 1 2 3 3 3"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)
What i would like to achieve is :
Create 5 columns(r1-r5)
which is the division of each column x1-x5 with MeasureX (example x1/measurex, x2/measurex etc.)
Create 5 columns(p1-p5)
which is the division of each column x1-x5 with number 1-5 (the number of xcolumns) example x1/1, x2/2 etc.
MeasureY is irrelevant for now, the end product would be the ID and columns r1-r5 and p1-p5, is this feasible?
In SAS i would go with something like this:
data test6;
set test5;
array x {5} x1- x5;
array r{5} r1 - r5;
array p{5} p1 - p5;
do i=1 to 5;
r{i} = x{i}/MeasureX;
p{i} = x{i}/(i);
end;
The reason would be to have more dynamic beacuse the number of columns could change in the future.

Argument recycling allows you do do element-wise division with a constant vector. The tricky part was extracting the digits from the column names. I then repeated each of the digits by the number of rows to do the second division-task.
DF[ ,paste0("r", 1:5)] <- DF[ , grep("x", names(DF) )]/ DF$MeasureX
DF[ ,paste0("p", 1:5)] <- DF[ , grep("x", names(DF) )]/ # element-wise division
rep( as.numeric( sub("\\D","",names(DF)[ # remove non-digits
grep("x", names(DF))] #returns only 'x'-cols
) ), each=nrow(DF) ) # make them as long as needed
#-------------
> DF
ID MeasureX MeasureY x1 x2 x3 x4 x5 r1 r2 r3 r4 r5 p1 p2 p3 p4 p5
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.5 0.3333333 0.25 0.2
2 2 1 1 0 1 1 1 1 0 1 1 1 1 0 0.5 0.3333333 0.25 0.2
3 3 1 1 1 2 3 3 3 1 2 3 3 3 1 1.0 1.0000000 0.75 0.6
This could be greatly simplified if you already know the sequence vector for the second division task would be 1-5, but this was designed to allow "gaps" in the sequence for column names and still use the digit information in the names as the divisor. (You were not entirely clear about what situations this code would be used in.) The construct of r{1-5} in SAS is mimicked by [ , paste0('r', 1:5)]. SAS is a macro language and sometimes experienced users have trouble figuring out how to make R behave like one. Generally it takes a while to lose the for-loop mentality and begin using R as a functional language.

An alternative with the data.table package:
cols <- names(df[c(4:8)])
library(data.table)
setDT(df)[, (paste0("r",1:5)) := .SD / df$MeasureX, by = ID, .SDcols = cols
][, (paste0("p",1:5)) := .SD / 1:5, by = ID, .SDcols = cols]
which results in:
> df
ID MeasureX MeasureY x1 x2 x3 x4 x5 r1 r2 r3 r4 r5 p1 p2 p3 p4 p5
1: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.5 0.3333333 0.25 0.2
2: 2 1 1 0 1 1 1 1 0 1 1 1 1 0 0.5 0.3333333 0.25 0.2
3: 3 1 1 1 2 3 3 3 1 2 3 3 3 1 1.0 1.0000000 0.75 0.6

You could put together a nifty loop or apply to do this, but here it is explicitly:
# Handling the "r" columns.
DF$r1 <- DF$x1 / DF$MeasureX
DF$r2 <- DF$x2 / DF$MeasureX
DF$r3 <- DF$x3 / DF$MeasureX
DF$r4 <- DF$x4 / DF$MeasureX
DF$r5 <- DF$x5 / DF$MeasureX
# Handling the "p" columns.
DF$p1 <- DF$x1 / 1
DF$p2 <- DF$x2 / 2
DF$p3 <- DF$x3 / 3
DF$p4 <- DF$x4 / 4
DF$p5 <- DF$x5 / 5
# Taking only the columns we want.
FinalDF <- DF[, c("ID", "r1", "r2", "r3", "r4", "r5", "p1", "p2", "p3", "p4", "p5")]
Just noting that this is pretty straightforward matrix manipulation that you definitely could have found elsewhere. Perhaps you're new to R, but still put a little more effort in next time. If you are new to R, it's definitely worth the time to look up some basic R coding tutorial or video.

Related

How to make Groupby dataframe using list?

I have xyz dataframe like below.
x y z
1 2 1
1 2 2
3 3 1
3 1 2
4 1 2
'''''
9 3 4
and I have to make dataframes by x.
df1(x=1)
x y z
1 2 3
1 3 3
df2(x=2)
x y z
2 3 3
2 4 5
dfx(x=n)
x y z
n y z
- - -
I know pandas df.groupby("x") makes dataframe by "x".
but there are so many "x" value in my data, so I couldn't define all "x".
Is there any function which makes dataframes using list like groupby(list)?
Thanks in advance.
In your case save the df into dict
d = {x : y for x , y in df.groupby('x')}
d[1]

Relate indices where two dataframes are equal with elements in another array

I have an array cluster_true and a dataframe data containing in each row a 2D coordinate. I want to save in another dataframe information regarding how many times for a given 2D coordinate each element in cluster_true appeared. So, for instance, for the coordinate (1,1), I want to check all the rows in data whose first two columns have the value of 1, and then check the values of cluster_true at those indices. Here is an example to make it clearer (it gives the desired result):
# Example variables
cluster_true = c(1,2,1,1,2,2,1,2,2,2,2,1,1)
x = 3
y = 3
data = data.frame(X = c(1,1,0,0,2,1,1,0,0,0,1,1,1),
Y = c(1,1,2,1,2,2,1,0,0,0,0,2,0))
# Names of the columns
plot_colnames = c('X', 'Y', paste('cluster',unique(cluster_true),sep='_'))
# Empty dataframe with the right column names
plot_df = data.frame(matrix(vector(), x*y, length(plot_colnames),
dimnames=list(c(), plot_colnames)),
stringsAsFactors=F)
# Each row belongs to a certain 2D coordinate
plot_df$X = rep(1:x, y)-1
plot_df$Y = rep(1:x, each = y)-1
# This is what I don't know how to improve
for(i in 1:nrow(plot_df)){
idx = which(apply(data[,1:2], 1, function(x) all(x == plot_df[i,1:2])))
plot_df[i,3] = sum(cluster_true[idx] == 1)
plot_df[i,4] = sum(cluster_true[idx] == 2)
}
print(plot_df)
Things I need to change and I don't know how to:
I think the loop could be avoided in order to get a more elegant solution, but I don't know how. The dataframe data could have a very large amount of rows, so efficient code would be awesome.
Inside the loop, I've hardcoded the clusters to check (the last two lines inside the loop assume that I know which numbers are present in cluster_true and to which column of plot_df they correspond to). In fact, the elements in cluster_true could be anything, even non-consecutive numbers (i.e. cluster_true = c(1,5,5,5,56,10,19,10)).
So basically, I want to know if this could be done without the loop and as generic as possible.
If I understand correctly, the OP wants to
find the row indices for all unique combinations of X, Y coordinates in data,
look up the value in the corresponding rows of cluster_true,
count the number of occurrences of each value for the given X, Y combination, and
print the results in wide format.
This can be solved by joining and reshaping:
library(data.table) # version 1.11.4 used
library(magrittr) # use piping to improve readability
# unique coordinate pairs
uni_coords <- unique(setDT(data)[, .(X, Y)])[order(X, Y)]
# join and lookup values in cluster_true
data[uni_coords, on = .(X, Y), cluster_true[.I], by = .EACHI] %>%
# reshape from long to wide format, thereby counting occurrences
dcast(X + Y ~ sprintf("cluster_%02i", V1), length)
X Y cluster_01 cluster_02
1: 1 1 2 1
2: 1 2 1 1
3: 1 3 1 1
4: 2 2 0 1
5: 3 1 1 0
6: 3 2 1 0
7: 3 3 0 3
This is identical with OP's expected result except for the coordinate combinations which do not appear in data.
setDT(plot_df)[order(X, Y)]
X Y cluster_1 cluster_2
1: 1 1 2 1
2: 1 2 1 1
3: 1 3 1 1
4: 2 1 0 0
5: 2 2 0 1
6: 2 3 0 0
7: 3 1 1 0
8: 3 2 1 0
9: 3 3 0 3
Reshaping has the benefit that it can handle arbitrary values in cluster_true as requested by the OP.
Edit
The OP has requested that all possible combinations of X, Y coordinates should be included in the final result. This can be achieved by using a cross join CJ() to compute uni_coords:
# all possible coordinate pairs
uni_coords <- setDT(data)[, CJ(X = X, Y = Y, unique = TRUE)]
# join and lookup values in cluster_true
data[uni_coords, on = .(X, Y), cluster_true[.I], by = .EACHI][
uni_coords, on = .(X, Y)] %>%
# reshape from long to wide format, thereby counting occurrences
dcast(X + Y ~ sprintf("cluster_%02i", V1), length) %>%
# remove NA column from reshaped result
.[, cluster_NA := NULL] %>%
print()
X Y cluster_01 cluster_02
1: 1 1 2 1
2: 1 2 1 1
3: 1 3 1 1
4: 2 1 0 0
5: 2 2 0 1
6: 2 3 0 0
7: 3 1 1 0
8: 3 2 1 0
9: 3 3 0 3

How can I create two other matrices from a single mx3 matrix?

I have an mx3 matrix A containing both integer and non-integers.
A = [1.5 1 1
1 1.5 1
2 1.5 1
1.5 2 1
1 1 1.5
2 1 1.5
1 2 1.5
2 2 1.5
1.5 1 2
1 1.5 2
2 1.5 2
1.5 2 2];
What I would want is to create 2 new sets of matrices A1 and A2 such that I scan through each row of A and;
A1 = subtract 0.5 from any non-integer found in any column, and leave the integers as they are.
A2 = add 0.5 from any non-integer found in any column, and leave the integers as they are.
I would expect my final arrays to be:
A1 = [1 1 1
1 1 1
2 1 1
1 2 1
1 1 1
2 1 1
1 2 1
2 2 1
1 1 2
1 1 2
2 1 2
1 2 2];
A2 = [2 1 1
1 2 1
2 2 1
2 2 1
1 1 2
2 1 2
1 2 2
2 2 2
2 1 2
1 2 2
2 2 2
2 2 2];
if your "non-integer" numbers are only x.5 you can simply use floor and ceil:
A1 = floor(A);
A2 = ceil(A);
if it's not the case use logical indexing:
A1 = A;
A1(round(A1) ~= A1) = A1(round(A1) ~= A1) - 0.5;
A2 = A;
A2(round(A2) ~= A2) = A2(round(A2) ~= A2) + 0.5;
You can also make a condition, and depending on how you satisfy that condition either add or subtract 0.5:
cond = (rem(A3,1) ~= 0);%Generates a logical matrix
A1 = A; A2 = A;
%subtract and add 0.5 only to the elements which satisfy the condition:
A1(cond) = A1(cond) -0.5;
A2(cond) = A2(cond) +0.5;

Find unique pairs in a matrix

Let's assume I have the following matrix:
A = [1 1 2 1;1 2 2 1;2 1 3 0;2 2 2 0;3 1 2 1]
Where the first column is an index and the next two an interaction and the last one a logic saying yes or no.
So know I would like to generate the following heat map based on the interactions. "X" axis represents interactions and "Y" axis represents index.
1-2 1-3 2-2
1 1 NaN 1
2 NaN 0 0
3 1 NaN NaN
My current approach:
B = sortrows(A,[2,3]);
Afterwards I apply find for each row and column individually.
Is there a function similar to unique which can check for two columns at the same time?
Here's a way, using unique(...,'rows'):
A = [1 1 2 1; 1 2 2 1; 2 1 3 0; 2 2 2 0; 3 1 2 1]; % data
[~, ~, jj] = unique(A(:,[2 3]),'rows'); % get interaction identifiers
B = accumarray([A(:,1) jj], A(:,4), [], #sum, NaN); % build result, with NaN as fill value
This gives
B =
1 NaN 1
NaN 0 0
1 NaN NaN
>> A
A =
1 1 2 1
1 2 2 1
2 1 3 0
2 2 2 0
3 1 2 1
>> [C, IA, IC] = unique(A(:, [2, 3]), 'rows')
C =
1 2
1 3
2 2
IA =
1
3
2
IC =
1
3
2
3
1
C is a set of unique pairs. IA is the corresponding index of C (i.e., C == A(IA, [2, 3])). IC is the corresponding index of each row (i.e., A(:, [2, 3]) == C(IC, :)).
this is a possible solution with the aid of #Jeon 's answer(Updated):
A = [1 1 2 1;1 2 2 1;2 1 3 0;2 2 2 0;3 1 2 1]
[~,IA,idx] = unique(A(:, [2, 3]), 'rows');
r = max(A(:,1));
c = numel(IA);
out= NaN(r,c );
out(sub2ind([r ,c], A(:,1),idx)) = A(:,4)

R generate matrix from linear table with column and row numbers

I am new in R and struggle with arrays.My question is very simple but I didnt find easy answer on the web or in R documentation.
I have a table with column and row number that I want to use to generate a new matrix
Original table:
V1 V2 pval
1 1 2 5.914384e-13
2 1 3 8.143390e-01
3 1 4 7.587818e-01
4 1 5 9.734698e-12
5 1 6 7.812521e-19
I want to use:
V1 as the column number for the new matrix;
V2 as the row number
pvals as the value
Targeted matrix:
1 2 3 4
1 0 5e-1 8e-1 7e-1
2 5e-13 0
3 8e-1 0
4 7e-1 0
#some data
set.seed(42)
df <- data.frame(V1=rep(1:6,each=3),V2=rep(1:3,6),pval=runif(18,0,1))
df <- df[df$V1!=df$V2,]
# V1 V2 pval
#2 1 2 0.560332746
#3 1 3 0.904031387
#4 2 1 0.138710168
#6 2 3 0.946668233
#7 3 1 0.082437558
#8 3 2 0.514211784
# ...
#use dcast to change to wide format
library(reshape2)
df2 <- dcast(df,V2~V1,fill=0)
# V2 1 2 3 4 5 6
#1 1 0.0000000 0.1387102 0.08243756 0.9057381 0.7375956 0.685169729
#2 2 0.5603327 0.0000000 0.51421178 0.4469696 0.8110551 0.003948339
#3 3 0.9040314 0.9466682 0.00000000 0.8360043 0.3881083 0.832916080
#in case you really want a matrix object
m <- as.matrix(df2[,-1])

Resources