Split Pandas Dataframe into separate pieces based on column values - arrays

I am looking to perform some Inner Joins in Pandas, using Python 2.7. Here is the dataset that I am working with:
import pandas as pd
import numpy as np
columns = ['s_id', 'c_id', 'c_col1']
index = np.arange(46) # array of numbers for the number of samples
df = pd.DataFrame(columns=columns, index = index)
df.s_id[:15] = 144
df.s_id[15:27] = 105
df.s_id[27:46] = 52
df.c_id[:5] = 1
df.c_id[5:10] = 2
df.c_id[10:15] = 3
df.c_id[15:19] = 1
df.c_id[19:27] = 2
df.c_id[27:34] = 1
df.c_id[34:39] = 2
df.c_id[39:46] = 3
df.c_col1[:5] = ['H', 'C', 'N', 'O', 'S']
df.c_col1[5:10] = ['C', 'O','S','K','Ca']
df.c_col1[10:15] = ['H', 'O','F','Ne','Si']
df.c_col1[15:19] = ['C', 'O', 'F', 'Zn']
df.c_col1[19:27] = ['N', 'O','F','Fe','Zn','Gd','Hg','Pb']
df.c_col1[27:34] = ['H', 'He', 'Li', 'B', 'N','Al','Si']
df.c_col1[34:39] = ['N', 'F','Ne','Na','P']
df.c_col1[39:46] = ['C', 'N','O','F','K','Ca', 'Fe']
Here is the dataframe:
s_id c_id c_col1
0 144 1 H
1 144 1 C
2 144 1 N
3 144 1 O <--
4 144 1 S
5 144 2 C
6 144 2 O <--
7 144 2 S
8 144 2 K
9 144 2 Ca
10 144 3 H
11 144 3 O <--
12 144 3 F
13 144 3 Ne
14 144 3 Si
15 105 1 C
16 105 1 O
17 105 1 F
18 105 1 Zn
19 105 2 N
20 105 2 O
21 105 2 F
22 105 2 Fe
23 105 2 Zn
24 105 2 Gd
25 105 2 Hg
26 105 2 Pb
27 52 1 H
28 52 1 He
29 52 1 Li
30 52 1 B
31 52 1 N
32 52 1 Al
33 52 1 Si
34 52 2 N
35 52 2 F
36 52 2 Ne
37 52 2 Na
38 52 2 P
39 52 3 C
40 52 3 N
41 52 3 O
42 52 3 F
43 52 3 K
44 52 3 Ca
45 52 3 Fe
I need to do the following in Pandas:
In a given s_id, produce separate dataframes for each c_id value. ex. for s_id = 144, there will be 3 dataframes, while for s_id = 105 there will be 2 dataframes
Inner Join the separate dataframes produced in a.), on the elements column (c_col1) in Pandas. This is a little difficult to understand so here is the dataframe what I would like to get from this step:
index s_id c_id c_col1
0 144 1 O
1 144 2 O
2 144 3 O
3 105 1 O
4 105 2 F
5 52 1 N
6 52 2 N
7 52 3 N
As you can see, what I am looking for in part 2.) is the following: Within each s_id, I am looking for those c_col1 values that occur for all the c_id values. ex. in the case of s_id = 144, only O (oxygen) occurs for c_id = 1, 2, 3. I have pointed to these entries, with "<--", in the raw data. So, I would like to have the dataframe show O 3 times in the c_col1 column and the corresponding c_id entries would be 1, 2, 3.
Conditions:
the number of unique c_ids are not known ahead of time.i.e. for one
particular s_id, I do not know if there will be 1, 2 and 3 or just 1
and 2. This means that if 1, 2 and 3 occur, there will be one Inner
Join; if only 1 and 2 occur, then there will be only one Inner Join.
How can this be done with Pandas?

Producing the separate dataframes is easy enough. How would you want to store them? One way would be in a nested dict where the outer keys are the s_id and the inner keys are the c_id and the inner values are the data. That you can do with a fairly long but straightforward dict comprehension:
DF_dict = {s_id :
{c_id : df[(df.s_id == s_id) & (df.c_id == c_id)] for c_id in df[df.s_id == s_id]['c_id'].unique()}
for s_id in df.s_id.unique()}
Then for example:
In [12]: DF_dict[52][2]
Out[12]:
s_id c_id c_col1
34 52 2 N
35 52 2 F
36 52 2 Ne
37 52 2 Na
38 52 2 P
I do not understand part two of your question. You want then to join the data within in s_id? Could you show what the expected output would be? If you want to do something within each s_id you might be better off exploring groupby options. Perhaps someone understands what you want, but if you can clarify I might be able to show a better option that skips the first part of the question...
##################EDIT
It seems to me that you should just go straight to problem 2, if problem 1 is simply a step you believe to be necessary to get to a problem 2 solution. In fact it is entirely unnecessary. To solve your second problem you need to group the data by s_id and transform the data according to your requirements. To sum up your requirements as I see them the rule is as follows: For each data group grouped by s_id, return only those ccol_1 data for which there are equal values for each value of c_id.
You might write a function like this:
def c_id_overlap(df):
common_vals = [] #container for values of c_col1 that are in ever c_id subgroup
c_ids = df.c_id.unique() #get unique values of c_id
c_col1_values = set(df.c_col1) # get a set of c_col1 values
#create nested list of values. Each inner list contains the c_col1 values for each c_id
nested_c_col_vals = [list(df[df.c_id == ID]['c_col1'].unique()) for ID in c_ids]
#Iterate through the c_col1_values and see if they are in every nested list
for val in c_col1_values:
if all([True if val in elem else False for elem in nested_c_col_vals]):
common_vals.append(val)
#return a slice of the dataframe that only contains values of c_col1 that are in every
#c_id
return df[df.c_col1.isin(common_vals)]
and then pass it to apply on data grouped by s_id:
df.groupby('s_id', as_index = False).apply(c_id_overlap)
which gives me the following output:
s_id c_id c_col1
0 31 52 1 N
34 52 2 N
40 52 3 N
1 16 105 1 O
17 105 1 F
18 105 1 Zn
20 105 2 O
21 105 2 F
23 105 2 Zn
2 3 144 1 O
6 144 2 O
11 144 3 O
Which seems to be what you are looking for.
###########EDIT: Additional Explanation:
So apply passes each chunk of grouped data to the function and the the pieces are glues back together once this has been done for each group of data.
So think about the first group passed where s_id == 105. The first line of the function creates an empty list common_vals which will contain those periodic elements that appear in every subgroup of the data (i.e. relative to each of the values of c_id).
The second line gets the unique values of 'c_id', in this case [1, 2] and stores them in an array called c_ids
The third line creates a set of the values of c_col1 which in this case produces:
{'C', 'F', 'Fe', 'Gd', 'Hg', 'N', 'O', 'Pb', 'Zn'}
The fourth line creates a nested list structure nested_c_col_vals where every inner list is a list of the unique values associated with each of the elements in the c_ids array. In this case this looks like this:
[['C', 'O', 'F', 'Zn'], ['N', 'O', 'F', 'Fe', 'Zn', 'Gd', 'Hg', 'Pb']]
Now each of the elements in the c_col1_values list is iterated over and for each of those elements the program determines whether that element appears in every inner list of the nested_c_col_vals object. The bulit in all function, determines whether every item in the sequence between the backets is True or rather whether it is non-zero (you will need to check this). So:
In [10]: all([True, True, True])
Out[10]: True
In [11]: all([True, True, True, False])
Out[11]: False
In [12]: all([True, True, True, 1])
Out[12]: True
In [13]: all([True, True, True, 0])
Out[13]: False
In [14]: all([True, 1, True, 0])
Out[14]: False
So in this case, let's say 'C' is the first element iterated over. The list comprehension inside the all() backets says, look inside each inner list and see if the element is there. If it is then True if it is not then False. So in this case this resolves to:
all([True, False])
which is of course False. No when the element is 'Zn' the result of this operation is
all([True, True])
which resolves to True. Therefore 'Zn' is appended to the common_vals list.
Once the process is complete the values inside common_vals are:
['O', 'F', 'Zn']
The return statement simply slices the data chunk according to whether the vaues os c_col1 are in the list common_vals as per above.
This is then repeated for each of the remaining groups and the data are glued back together.
Hope this helps

Related

Matlab: extract values from vector A, based on values in vector B

A = [5 10 16 22 28 32 36 44 49 56]
B = [2 1 1 2 1 2 1 2 2 2]
How to get this?
C1 = [10 16 28 36]
C2 = [5 22 32 44 49 56]
C1 needs to get the values from A, only in the positions in which B is 1
C2 needs to get the values from A, only in the positions in which B is 2
You can do this this way :
C1 = A(B==1);
C2 = A(B==2);
B==1 gives a logical array : [ 0 1 1 0 1 0 1 0 0 0 ].
A(logicalArray) returns elements for which the value of logicalArray is true (it is termed logical indexing).
A and logicalArray must of course have the same size.
It is probably the fastest way of doing this operation in matlab.
For more information on indexing, see matlab documentation.
To achieve this with an arbitrary number of groups (not just two as in your example), use accumarray with an a anoynmous function to collect the values in each group into a cell. To preserve order, B needs to be sorted first (and the same order needs to be applied to A):
[B_sort, ind_sort] = sort(B);
C = accumarray(B_sort.', A(ind_sort).', [], #(x){x.'});
This gives the result in a cell array:
>> C{1}
ans =
10 16 28 36
>> C{2}
ans =
5 22 32 44 49 56

Find median position points of duration events

I have the following vector A:
A = [34 35 36 5 6 7 78 79 7 9 10 80 81 82 84 85 86 102 3 4 6 103 104 105 106 8 11 107 201 12 202 203 204];
For n = 2, I counted the elements larger or equal to 15 within A:
D = cellfun(#numel, regexp(char((A>=15)+'0'), [repmat('0',1,n) '+'], 'split'));
The above expression gives the following output as duration values:
D = [3 2 7 4 6] = [A(1:3) **stop** A(7:8) **stop** A(12:18) **stop** A(22:25) **stop** A(28:33)];
The above algorithm computes the duration values by counting the elements larger or equal to 15. The counting also allows less than 2 consecutive elements smaller than 15 (n = 2). The counter stops when there are 2 or more consecutive elements smaller than 15 and starts over at the next substring within A.
Eventually, I want a way to find the median position points of the duration events A(1:3), A(7:8), A(12:18), A(22:25) and A(28:33), which are correctly computed. The result should look like this:
a1 = round(median(A(1:3))) = 2;
a2 = round(median(A(7:8))) = 8;
a3 = round(median(A(12:18))) = 15;
a4 = round(median(A(22:25))) = 24;
a5 = round(median(A(28:33))) = 31;
I edited the question to make it more clear, because the solution that was provided here assigns the last number within the row of 2 or more consecutive numbers smaller than 15 (3 in this case) after A(1:3) to the next substring A(7:8)and the same with the other substrings, therefore generating wrong duration values and in consequence wrong median position points of the duration events when n = 2 or for any given even n.
Anyone has any idea how to achieve this?

check if ALL elements of a vector are in another vector

I need to loop through coloumn 1 of a matrix and return (i) when I have come across ALL of the elements of another vector which i can predefine.
check_vector = [1:43] %% I dont actually need to predefine this - i know I am looking for the numbers 1 to 43.
matrix_a coloumn 1 (which is the only coloumn i am interested in looks like this for example
1
4
3
5
6
7
8
9
10
11
12
13
14
16
15
18
17
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
1
3
4
2
6
7
8
We want to loop through matrix_a and return the value of (i) when we have hit all of the numbers in the range 1 to 43.
In the above example we are looking for all the numbers from 1 to 43 and the iteration will end round about position 47 in matrix_a because it is at this point that we hit number '2' which is the last number to complete all numbers in the sequence 1 to 43.
It doesnt matter if we hit several of one number on the way, we count all those - we just want to know when we have reached all the numbers from the check vector or in this example in the sequence 1 to 43.
Ive tried something like:
completed = []
for i = 1:43
complete(i) = find(matrix_a(:,1) == i,1,'first')
end
but not working.
Assuming A as the input column vector, two approaches could be suggested here.
Approach #1
With arrayfun -
check_vector = [1:43]
idx = find(arrayfun(#(n) all(ismember(check_vector,A(1:n))),1:numel(A)),1)+1
gives -
idx =
47
Approach #2
With customary bsxfun -
check_vector = [1:43]
idx = find(all(cumsum(bsxfun(#eq,A(:),check_vector),1)~=0,2),1)+1
To find the first entry at which all unique values of matrix_a have already appeared (that is, if check_vector consists of all unique values of matrix_a): the unique function almost gives the answer:
[~, ind] = unique(matrix_a, 'first');
result = max(ind);
Someone might have a more compact answer but is this what your after?
maxIndex = 0;
for ii=1:length(a)
[f,index] = ismember(ii,a);
maxIndex=max(maxIndex,max(index));
end
maxIndex
Here is one solution without a loop and without any conditions on the vectors to be compared. Given two vectors a and b, this code will find the smallest index idx where a(1:idx) contains all elements of b. idx will be 0 when b is not contained in a.
a = [ 1 4 3 5 6 7 8 9 10 11 12 13 14 16 15 18 17 19 20 21 22 23 24 25 26 ...
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 1 3 4 2 6 7 8 50];
b = 1:43;
[~, Loca] = ismember(b,a);
idx = max(Loca) * all(Loca);
Some details:
ismember(b,a) checks if all elements of b can be found in a and the output Loca lists the indices of these elements within a. The index will be 0, if the element cannot be found in a.
idx = max(Loca) then is the highest index in this list of indices, so the smallest one where all elements of b are found within a(1:idx).
all(Loca) finally checks if all indices in Loca are nonzero, i.e. if all elements of b have been found in a.

Passing as a variable the individual columns of individual matrices in a list of matrices

I want to pass columns in various matrices to a for loop.
If my two matrices had the same number of columns, I might do something like this:
mat1 = matrix(rep(1:25), 5,5)
mat2 = matrix(rep(26:50), 5,5)
array.mat = array(c(mat1,mat2), dim=c(5,5,2))
mat1.ncol = ncol(mat1)
mat2.ncol = ncol(mat2)
mat.ncol = c(mat1.ncol, mat2.ncol)
mat.ncol
array.mat
for (dimi in 1:2){
dim.col = mat.ncol[dimi]
for (coli in 1:dim.col){
st = shapiro.test(array.mat[,coli,dimi])$p.value
if(st > .001){
array.mat[,coli,dimi] = log(array.mat[,coli,dimi])
}}}
But, my data don't have the same number of columns, so I'd like to use a list of matrices instead.
mat1 = matrix(rep(1:10), 5,2)
mat2 = matrix(rep(26:50), 5,5)
list.mat=list(a=mat1, b=mat2)
list.mat
But I can't figure out how I'd pass the columns of the matrices?
list.mat$a[1:5]
gives the first column of the first matrix, but how would you pass $a and [startindex:endindex] in a loop? All the other answers I see tend to pass the ith element (e.g., column) of both matrices. I need to keep the two matrices (a and b) separate for later computations, but I want them together (the list of the two matrices) for these types of loops.
Once again, I'm probably just thinking about this incorrectly. Thanks for any thoughts.
Can you use numeric indices? e.g., matrix 1, column 1: list.mat[[1]][,1]
In a loop:
for (m in 1:2) {
for (i in 1:ncol(list.mat[[m]])) {
cat('Here is matrix', m, ', columnn', i, '\n')
print(list.mat[[m]][,i])
}
}
result:
Here is matrix 1 , columnn 1
[1] 1 2 3 4 5
Here is matrix 1 , columnn 2
[1] 6 7 8 9 10
Here is matrix 2 , columnn 1
[1] 26 27 28 29 30
Here is matrix 2 , columnn 2
[1] 31 32 33 34 35
Here is matrix 2 , columnn 3
[1] 36 37 38 39 40
Here is matrix 2 , columnn 4
[1] 41 42 43 44 45
Here is matrix 2 , columnn 5
[1] 46 47 48 49 50

R - how to shift a multidimensional array

I am working with a multi-dimensional array:
> dim(Sales)
[1] 35 71 5
which I use to perform operations like comparing sales year over year:
Sales_Increase_Y2_to_Y1 = Sales[,,2]-Sales[,,1]
Now I would like to be able to shift one dimension to calculate Sales increase across all years in one line:
Sales-Sales[,,how to call previous year here?]
Example to build sample multi-dim array:
x = structure(list(Store = c(35L, 35L, 35L, 35L, 35L), Dept = c(71L,
71L, 71L, 71L, 71L), Year = c(1, 2, 3, 4, 5), Sales = c(10908.04,
12279.99, 11061.82, 12288.1, 9950.55)), .Names = c("Store", "Dept",
"Year", "Sales"), row.names = c(NA, -5L), class = "data.frame")
> x
Store Dept Year Sales
1 35 71 1 10908.04
2 35 71 2 12279.99
3 35 71 3 11061.82
4 35 71 4 12288.10
5 35 71 5 9950.55
Sales <- array(NA, c(max(x$Store), max(x$Dept), max(x$Year)))
for (i in 1:nrow(x))
Sales[x[i,"Store"], x[i,"Dept"], x[i,"Year"]] <- x[i, "Sales"]
Sales[35,71,1]
Bonus tip
When assigning or extracting parts of an array (or matrix), you can either use a number of vectors like you do in your example, or a matrix of array coordinates
Sales[as.matrix(x[1:3])] <- x$Sales
The actual problem
You can then calculate the difference between the years with apply. Since we want to work over dimension 3 (the years), but keep the other dimensions 1 and 2 intact we set MARGIN=1:2 (the second argument)
Sales.diff <- apply(Sales, 1:2, diff)
However, notice that the dimensions have been shifted now, putting the differences first
> dim(Sales.diff)
[1] 4 35 71
but you can get the order back with aperm
> Sales.diff <- aperm(sd, c(2,3,1))
> dim(Sales.diff)
[1] 35 71 4
Alternative solution
This will keep the order of the dimensions too.
Sales[,,-1] - Sales[,,-dim(Sales)[3]]

Resources