I have xyz dataframe like below.
x y z
1 2 1
1 2 2
3 3 1
3 1 2
4 1 2
'''''
9 3 4
and I have to make dataframes by x.
df1(x=1)
x y z
1 2 3
1 3 3
df2(x=2)
x y z
2 3 3
2 4 5
dfx(x=n)
x y z
n y z
- - -
I know pandas df.groupby("x") makes dataframe by "x".
but there are so many "x" value in my data, so I couldn't define all "x".
Is there any function which makes dataframes using list like groupby(list)?
Thanks in advance.
In your case save the df into dict
d = {x : y for x , y in df.groupby('x')}
d[1]
I have an array cluster_true and a dataframe data containing in each row a 2D coordinate. I want to save in another dataframe information regarding how many times for a given 2D coordinate each element in cluster_true appeared. So, for instance, for the coordinate (1,1), I want to check all the rows in data whose first two columns have the value of 1, and then check the values of cluster_true at those indices. Here is an example to make it clearer (it gives the desired result):
# Example variables
cluster_true = c(1,2,1,1,2,2,1,2,2,2,2,1,1)
x = 3
y = 3
data = data.frame(X = c(1,1,0,0,2,1,1,0,0,0,1,1,1),
Y = c(1,1,2,1,2,2,1,0,0,0,0,2,0))
# Names of the columns
plot_colnames = c('X', 'Y', paste('cluster',unique(cluster_true),sep='_'))
# Empty dataframe with the right column names
plot_df = data.frame(matrix(vector(), x*y, length(plot_colnames),
dimnames=list(c(), plot_colnames)),
stringsAsFactors=F)
# Each row belongs to a certain 2D coordinate
plot_df$X = rep(1:x, y)-1
plot_df$Y = rep(1:x, each = y)-1
# This is what I don't know how to improve
for(i in 1:nrow(plot_df)){
idx = which(apply(data[,1:2], 1, function(x) all(x == plot_df[i,1:2])))
plot_df[i,3] = sum(cluster_true[idx] == 1)
plot_df[i,4] = sum(cluster_true[idx] == 2)
}
print(plot_df)
Things I need to change and I don't know how to:
I think the loop could be avoided in order to get a more elegant solution, but I don't know how. The dataframe data could have a very large amount of rows, so efficient code would be awesome.
Inside the loop, I've hardcoded the clusters to check (the last two lines inside the loop assume that I know which numbers are present in cluster_true and to which column of plot_df they correspond to). In fact, the elements in cluster_true could be anything, even non-consecutive numbers (i.e. cluster_true = c(1,5,5,5,56,10,19,10)).
So basically, I want to know if this could be done without the loop and as generic as possible.
If I understand correctly, the OP wants to
find the row indices for all unique combinations of X, Y coordinates in data,
look up the value in the corresponding rows of cluster_true,
count the number of occurrences of each value for the given X, Y combination, and
print the results in wide format.
This can be solved by joining and reshaping:
library(data.table) # version 1.11.4 used
library(magrittr) # use piping to improve readability
# unique coordinate pairs
uni_coords <- unique(setDT(data)[, .(X, Y)])[order(X, Y)]
# join and lookup values in cluster_true
data[uni_coords, on = .(X, Y), cluster_true[.I], by = .EACHI] %>%
# reshape from long to wide format, thereby counting occurrences
dcast(X + Y ~ sprintf("cluster_%02i", V1), length)
X Y cluster_01 cluster_02
1: 1 1 2 1
2: 1 2 1 1
3: 1 3 1 1
4: 2 2 0 1
5: 3 1 1 0
6: 3 2 1 0
7: 3 3 0 3
This is identical with OP's expected result except for the coordinate combinations which do not appear in data.
setDT(plot_df)[order(X, Y)]
X Y cluster_1 cluster_2
1: 1 1 2 1
2: 1 2 1 1
3: 1 3 1 1
4: 2 1 0 0
5: 2 2 0 1
6: 2 3 0 0
7: 3 1 1 0
8: 3 2 1 0
9: 3 3 0 3
Reshaping has the benefit that it can handle arbitrary values in cluster_true as requested by the OP.
Edit
The OP has requested that all possible combinations of X, Y coordinates should be included in the final result. This can be achieved by using a cross join CJ() to compute uni_coords:
# all possible coordinate pairs
uni_coords <- setDT(data)[, CJ(X = X, Y = Y, unique = TRUE)]
# join and lookup values in cluster_true
data[uni_coords, on = .(X, Y), cluster_true[.I], by = .EACHI][
uni_coords, on = .(X, Y)] %>%
# reshape from long to wide format, thereby counting occurrences
dcast(X + Y ~ sprintf("cluster_%02i", V1), length) %>%
# remove NA column from reshaped result
.[, cluster_NA := NULL] %>%
print()
X Y cluster_01 cluster_02
1: 1 1 2 1
2: 1 2 1 1
3: 1 3 1 1
4: 2 1 0 0
5: 2 2 0 1
6: 2 3 0 0
7: 3 1 1 0
8: 3 2 1 0
9: 3 3 0 3
I have an mx3 matrix A containing both integer and non-integers.
A = [1.5 1 1
1 1.5 1
2 1.5 1
1.5 2 1
1 1 1.5
2 1 1.5
1 2 1.5
2 2 1.5
1.5 1 2
1 1.5 2
2 1.5 2
1.5 2 2];
What I would want is to create 2 new sets of matrices A1 and A2 such that I scan through each row of A and;
A1 = subtract 0.5 from any non-integer found in any column, and leave the integers as they are.
A2 = add 0.5 from any non-integer found in any column, and leave the integers as they are.
I would expect my final arrays to be:
A1 = [1 1 1
1 1 1
2 1 1
1 2 1
1 1 1
2 1 1
1 2 1
2 2 1
1 1 2
1 1 2
2 1 2
1 2 2];
A2 = [2 1 1
1 2 1
2 2 1
2 2 1
1 1 2
2 1 2
1 2 2
2 2 2
2 1 2
1 2 2
2 2 2
2 2 2];
if your "non-integer" numbers are only x.5 you can simply use floor and ceil:
A1 = floor(A);
A2 = ceil(A);
if it's not the case use logical indexing:
A1 = A;
A1(round(A1) ~= A1) = A1(round(A1) ~= A1) - 0.5;
A2 = A;
A2(round(A2) ~= A2) = A2(round(A2) ~= A2) + 0.5;
You can also make a condition, and depending on how you satisfy that condition either add or subtract 0.5:
cond = (rem(A3,1) ~= 0);%Generates a logical matrix
A1 = A; A2 = A;
%subtract and add 0.5 only to the elements which satisfy the condition:
A1(cond) = A1(cond) -0.5;
A2(cond) = A2(cond) +0.5;
Let's assume I have the following matrix:
A = [1 1 2 1;1 2 2 1;2 1 3 0;2 2 2 0;3 1 2 1]
Where the first column is an index and the next two an interaction and the last one a logic saying yes or no.
So know I would like to generate the following heat map based on the interactions. "X" axis represents interactions and "Y" axis represents index.
1-2 1-3 2-2
1 1 NaN 1
2 NaN 0 0
3 1 NaN NaN
My current approach:
B = sortrows(A,[2,3]);
Afterwards I apply find for each row and column individually.
Is there a function similar to unique which can check for two columns at the same time?
Here's a way, using unique(...,'rows'):
A = [1 1 2 1; 1 2 2 1; 2 1 3 0; 2 2 2 0; 3 1 2 1]; % data
[~, ~, jj] = unique(A(:,[2 3]),'rows'); % get interaction identifiers
B = accumarray([A(:,1) jj], A(:,4), [], #sum, NaN); % build result, with NaN as fill value
This gives
B =
1 NaN 1
NaN 0 0
1 NaN NaN
>> A
A =
1 1 2 1
1 2 2 1
2 1 3 0
2 2 2 0
3 1 2 1
>> [C, IA, IC] = unique(A(:, [2, 3]), 'rows')
C =
1 2
1 3
2 2
IA =
1
3
2
IC =
1
3
2
3
1
C is a set of unique pairs. IA is the corresponding index of C (i.e., C == A(IA, [2, 3])). IC is the corresponding index of each row (i.e., A(:, [2, 3]) == C(IC, :)).
this is a possible solution with the aid of #Jeon 's answer(Updated):
A = [1 1 2 1;1 2 2 1;2 1 3 0;2 2 2 0;3 1 2 1]
[~,IA,idx] = unique(A(:, [2, 3]), 'rows');
r = max(A(:,1));
c = numel(IA);
out= NaN(r,c );
out(sub2ind([r ,c], A(:,1),idx)) = A(:,4)
I am new in R and struggle with arrays.My question is very simple but I didnt find easy answer on the web or in R documentation.
I have a table with column and row number that I want to use to generate a new matrix
Original table:
V1 V2 pval
1 1 2 5.914384e-13
2 1 3 8.143390e-01
3 1 4 7.587818e-01
4 1 5 9.734698e-12
5 1 6 7.812521e-19
I want to use:
V1 as the column number for the new matrix;
V2 as the row number
pvals as the value
Targeted matrix:
1 2 3 4
1 0 5e-1 8e-1 7e-1
2 5e-13 0
3 8e-1 0
4 7e-1 0
#some data
set.seed(42)
df <- data.frame(V1=rep(1:6,each=3),V2=rep(1:3,6),pval=runif(18,0,1))
df <- df[df$V1!=df$V2,]
# V1 V2 pval
#2 1 2 0.560332746
#3 1 3 0.904031387
#4 2 1 0.138710168
#6 2 3 0.946668233
#7 3 1 0.082437558
#8 3 2 0.514211784
# ...
#use dcast to change to wide format
library(reshape2)
df2 <- dcast(df,V2~V1,fill=0)
# V2 1 2 3 4 5 6
#1 1 0.0000000 0.1387102 0.08243756 0.9057381 0.7375956 0.685169729
#2 2 0.5603327 0.0000000 0.51421178 0.4469696 0.8110551 0.003948339
#3 3 0.9040314 0.9466682 0.00000000 0.8360043 0.3881083 0.832916080
#in case you really want a matrix object
m <- as.matrix(df2[,-1])