How to make Groupby dataframe using list? - arrays

I have xyz dataframe like below.
x y z
1 2 1
1 2 2
3 3 1
3 1 2
4 1 2
'''''
9 3 4
and I have to make dataframes by x.
df1(x=1)
x y z
1 2 3
1 3 3
df2(x=2)
x y z
2 3 3
2 4 5
dfx(x=n)
x y z
n y z
- - -
I know pandas df.groupby("x") makes dataframe by "x".
but there are so many "x" value in my data, so I couldn't define all "x".
Is there any function which makes dataframes using list like groupby(list)?
Thanks in advance.

In your case save the df into dict
d = {x : y for x , y in df.groupby('x')}
d[1]

Related

Sort a matrix according to ordering in another matrix

I am trying to sort an array based on another array. I tried the sort method with index return, but it is somehow behaving strangely.
y = [1 2 3; 2 3 4]
x = [1 3 4; 2 2 3]
[yy, ii] = sort(y,'descend');
yy =
2 3 4
1 2 3
ii =
2 2 2
1 1 1
But my x(ii) is not the matrix sorted based on y.
x(ii) =
2 2 2
1 1 1
I am expecting the matrix to be:
x(ii) =
2 2 3
1 3 4
How can I sort the matrix x according to another matrix y?
ii are row subscripts but are being inputted by you as linear indices.
You need to convert them to relevant linear indices before proceeding i.e.
>> szx = size(x);
>> x(sub2ind(szx, ii, repmat(1:szx(2),szx(1),1)))
ans =
2 2 3
1 3 4

How do I obtain a single array from all possible combinations of the elements of two vectors?

I have two vectors:
x = [1 2 3]; y = [4 5]
I need a single array yx that gives me one-to-one combinations of the elements of both vectors. This is the code I have tried so far using of the examples from Stackoverflow.
sets = {y, x};
[y x] = ndgrid(sets{:});
yx = [y x]'
This gives me the result:
yx =
4 5
4 5
4 5
1 1
2 2
3 3
Whereas, I am expecting the following result:
yx =
4 1
4 2
4 3
5 1
5 2
5 3
Please, what am I doing wrong here? Any help/suggestions is greatly appreciated. Thanks!
What you're trying to obtain is a cartesian product of the two vector.
Here's a solution:
>> x = [1 2 3]; y = [4 5];
>> [X,Y] = meshgrid(y,x);
>> result = [X(:) Y(:)]
result =
4 1
4 2
4 3
5 1
5 2
5 3
(this works also in Octave and does not require extra libraries)
Your final cat is wrong. You expect that x and y are column vectors but they are 2x3-matrices. To get a 2-column matrix of all pairs, you need to linearize first:
yx = [y(:) x(:)]
It outputs the data in a different order. If you want the same order, transpose x and y before vectorizing and concatenating.
you are looking for combvec(x, y)
>> x = [1 2 3]
x =
1 2 3
>> y = [4 5]
y =
4 5
>> combvec(x, y)
ans =
1 2 3 1 2 3
4 4 4 5 5 5
Here is a way to do it with no complicated functions.
x = [1 2 3];
y = [4 5];
nx = numel(x);
ny = numel(y);
xy = [reshape(repmat(y,nx,1), 1, [])', repmat(x',ny,1)];
% xy = [4 1
% 4 2
% 4 3
% 5 1
% 5 2
% 5 3
Explanation:
We know that the output will have x repeated for each element in y, named ny.
We know that the output will have each element of y repeated for each element in x, nx
repmat repeats x simply for the second column.
repmat used with reshape to "interweave" y with its repeated self to get the repeated digits in the y vector as the first column.
You could condense the code by not using nx and ny.
xy = [reshape(repmat(y,numel(x),1), 1, [])', repmat(x',numel(y),1)];

Find unique pairs in a matrix

Let's assume I have the following matrix:
A = [1 1 2 1;1 2 2 1;2 1 3 0;2 2 2 0;3 1 2 1]
Where the first column is an index and the next two an interaction and the last one a logic saying yes or no.
So know I would like to generate the following heat map based on the interactions. "X" axis represents interactions and "Y" axis represents index.
1-2 1-3 2-2
1 1 NaN 1
2 NaN 0 0
3 1 NaN NaN
My current approach:
B = sortrows(A,[2,3]);
Afterwards I apply find for each row and column individually.
Is there a function similar to unique which can check for two columns at the same time?
Here's a way, using unique(...,'rows'):
A = [1 1 2 1; 1 2 2 1; 2 1 3 0; 2 2 2 0; 3 1 2 1]; % data
[~, ~, jj] = unique(A(:,[2 3]),'rows'); % get interaction identifiers
B = accumarray([A(:,1) jj], A(:,4), [], #sum, NaN); % build result, with NaN as fill value
This gives
B =
1 NaN 1
NaN 0 0
1 NaN NaN
>> A
A =
1 1 2 1
1 2 2 1
2 1 3 0
2 2 2 0
3 1 2 1
>> [C, IA, IC] = unique(A(:, [2, 3]), 'rows')
C =
1 2
1 3
2 2
IA =
1
3
2
IC =
1
3
2
3
1
C is a set of unique pairs. IA is the corresponding index of C (i.e., C == A(IA, [2, 3])). IC is the corresponding index of each row (i.e., A(:, [2, 3]) == C(IC, :)).
this is a possible solution with the aid of #Jeon 's answer(Updated):
A = [1 1 2 1;1 2 2 1;2 1 3 0;2 2 2 0;3 1 2 1]
[~,IA,idx] = unique(A(:, [2, 3]), 'rows');
r = max(A(:,1));
c = numel(IA);
out= NaN(r,c );
out(sub2ind([r ,c], A(:,1),idx)) = A(:,4)

Rowwise 2 dimensional matrix intersection in Matlab

I will try to explain what I need through an example.
Suppose you have a matrix x as follows:
1 2 3
4 5 6
And another matrix y as follows:
1 4 5
7 4 8
What I need is (without looping over the rows) to perform an intersection between each 2 corresponding rows in x & y. So I wish to get a matrix z as follows:
1
4
The 1st rows in x and y only have 1 as the common value. The 2nd rows have 4 as the common value.
EDIT:
I forgot to add that in my case, it is guaranteed that the intersection results will have the same length and the length is always 1 actually.
I am thinking bsxfun -
y(squeeze(any(bsxfun(#eq,x,permute(y,[1 3 2])),2)))
Sample runs -
Run #1:
>> x
x =
1 2 3
4 5 6
>> y
y =
1 4 5
7 4 8
>> y(squeeze(any(bsxfun(#eq,x,permute(y,[1 3 2])),2)))
ans =
1
4
Run #2:
>> x
x =
3 5 7 9
2 7 9 0
>> y
y =
6 4 3
6 0 2
>> y(squeeze(any(bsxfun(#eq,x,permute(y,[1 3 2])),2)))
ans =
0
3
2
The idea is to put the matrices together and to look for duplicates in the rows. One idea to find duplicated numeric values is to diff them; the duplicates will be marked by the value 0 in result.
Which leads to:
%'Initial data'
A = [1 2 3; 8 5 6];
B = [1 4 5; 7 4 8];
%'Look in merged data'
V = sort([A,B],2); %'Sort matrix values in rows'
R = V(diff(V,1,2)==0); %'Find duplicates in rows'
This should work with any number of matrices that can be concatenated horizontally. It will detect all the duplicates, but it will return a column the same size as the number of rows only if there is one and only one duplicate per row in the matrices.

Array row calculations

I have the following table:
DATA:
Lines <- " ID MeasureX MeasureY x1 x2 x3 x4 x5
1 1 1 1 1 1 1 1
2 1 1 0 1 1 1 1
3 1 1 1 2 3 3 3"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)
What i would like to achieve is :
Create 5 columns(r1-r5)
which is the division of each column x1-x5 with MeasureX (example x1/measurex, x2/measurex etc.)
Create 5 columns(p1-p5)
which is the division of each column x1-x5 with number 1-5 (the number of xcolumns) example x1/1, x2/2 etc.
MeasureY is irrelevant for now, the end product would be the ID and columns r1-r5 and p1-p5, is this feasible?
In SAS i would go with something like this:
data test6;
set test5;
array x {5} x1- x5;
array r{5} r1 - r5;
array p{5} p1 - p5;
do i=1 to 5;
r{i} = x{i}/MeasureX;
p{i} = x{i}/(i);
end;
The reason would be to have more dynamic beacuse the number of columns could change in the future.
Argument recycling allows you do do element-wise division with a constant vector. The tricky part was extracting the digits from the column names. I then repeated each of the digits by the number of rows to do the second division-task.
DF[ ,paste0("r", 1:5)] <- DF[ , grep("x", names(DF) )]/ DF$MeasureX
DF[ ,paste0("p", 1:5)] <- DF[ , grep("x", names(DF) )]/ # element-wise division
rep( as.numeric( sub("\\D","",names(DF)[ # remove non-digits
grep("x", names(DF))] #returns only 'x'-cols
) ), each=nrow(DF) ) # make them as long as needed
#-------------
> DF
ID MeasureX MeasureY x1 x2 x3 x4 x5 r1 r2 r3 r4 r5 p1 p2 p3 p4 p5
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.5 0.3333333 0.25 0.2
2 2 1 1 0 1 1 1 1 0 1 1 1 1 0 0.5 0.3333333 0.25 0.2
3 3 1 1 1 2 3 3 3 1 2 3 3 3 1 1.0 1.0000000 0.75 0.6
This could be greatly simplified if you already know the sequence vector for the second division task would be 1-5, but this was designed to allow "gaps" in the sequence for column names and still use the digit information in the names as the divisor. (You were not entirely clear about what situations this code would be used in.) The construct of r{1-5} in SAS is mimicked by [ , paste0('r', 1:5)]. SAS is a macro language and sometimes experienced users have trouble figuring out how to make R behave like one. Generally it takes a while to lose the for-loop mentality and begin using R as a functional language.
An alternative with the data.table package:
cols <- names(df[c(4:8)])
library(data.table)
setDT(df)[, (paste0("r",1:5)) := .SD / df$MeasureX, by = ID, .SDcols = cols
][, (paste0("p",1:5)) := .SD / 1:5, by = ID, .SDcols = cols]
which results in:
> df
ID MeasureX MeasureY x1 x2 x3 x4 x5 r1 r2 r3 r4 r5 p1 p2 p3 p4 p5
1: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.5 0.3333333 0.25 0.2
2: 2 1 1 0 1 1 1 1 0 1 1 1 1 0 0.5 0.3333333 0.25 0.2
3: 3 1 1 1 2 3 3 3 1 2 3 3 3 1 1.0 1.0000000 0.75 0.6
You could put together a nifty loop or apply to do this, but here it is explicitly:
# Handling the "r" columns.
DF$r1 <- DF$x1 / DF$MeasureX
DF$r2 <- DF$x2 / DF$MeasureX
DF$r3 <- DF$x3 / DF$MeasureX
DF$r4 <- DF$x4 / DF$MeasureX
DF$r5 <- DF$x5 / DF$MeasureX
# Handling the "p" columns.
DF$p1 <- DF$x1 / 1
DF$p2 <- DF$x2 / 2
DF$p3 <- DF$x3 / 3
DF$p4 <- DF$x4 / 4
DF$p5 <- DF$x5 / 5
# Taking only the columns we want.
FinalDF <- DF[, c("ID", "r1", "r2", "r3", "r4", "r5", "p1", "p2", "p3", "p4", "p5")]
Just noting that this is pretty straightforward matrix manipulation that you definitely could have found elsewhere. Perhaps you're new to R, but still put a little more effort in next time. If you are new to R, it's definitely worth the time to look up some basic R coding tutorial or video.

Resources