How do I subset data set with specific condition?

How do I subset data set with specific condition? - dataset

I have a dataset that has ages of children. I want to subset the variables that correspond to a certain bracket of age.
basically if I have a dataset
data = names age marks attendance
A 5 1 90
B 12 9 87
C 16 7 98
D 3 0 70
E 7 4 77
I want this:
df = names age (2-10) marks attendance
A 5 1 90
D 3 0 70
E 7 4 77
And similarly for the age bracket 11-16

Related

What is a correct name for an operation that turns 3-column long table into a compact "2D" table with variable number of columns?

For example, from this table
row
col
val
0
A
32
0
B
31
0
C
35
1
A
30
1
B
29
1
C
29
2
A
15
2
B
14
2
D
18
3
A
34
3
B
39
3
C
34
3
D
35
it should produce this table:
A
B
C
D
0
32
31
35
1
30
29
29
2
15
14
18
3
34
39
34
35
Is there some official, canonical (or at least popular specific unambiguous) term for such operation (or its reverse)?
I am trying to find (or implement & publish) a tool that transforms CSV this way, but am unsure what to search for (or how to name it).

The term is pivot.
Some databases have native support for pivot, eg SQL Server's PIVOT (and even UNPIVOT) keywords.
For most databases you must craft a query that does the job.

Is there any way to automatically paste text vertically into google sheets?

I have about 10 years worth of data from a local animal shelter and I need to copy all kittens under 3 months over from Excel spreadsheets onto a google sheet, their ID, age in days, month of intake, intake year, sex, and intake type . I am doing this for a research project, it has to just be the data points and nothing else. The aggregate looks like this.
I look through the excel spreadsheets and type the relevant data onto notepad. All the data points I need are separated by date with all the kittens taken in on that date, with their age, sex and ID number corresponding to the order they're in so I can find discrepancies easily and input the data in quickly, it usually turns out looking like this:
(3/14) 74 M A25164093
(3/18) 34 27 34 M F M A25192748 A25192760 A25192795
(3/19) 70 F A25202463
(3/20) 80 2 1 1 F F F F A25208017 A25210624 A25210625 A25210626
My main bottleneck is putting it into the data aggregate. I have to go cell by cell inputting this data into the columns and I feel like there has to be a way to copy the data I need vertically onto the columns they belong in.
I've tried the transpose function and it seems to perfectly match what I need, however it requires the data points transposed vertically to already be somewhere on the google sheet, and if I try and delete the cell with the function so it's just the data it transposed it also deletes all the data it tranposed vertically. It seems to only function to display the data. This is not ideal, I need something that just copy and pastes data into the columns they belong in so that it can be useable.
This is an old sample of the data aggregate (output):
https://docs.google.com/spreadsheets/d/1pUaU4ChFHwL9suqAGRUegBQ72F1-rusp/edit?usp=sharing&ouid=115130932279005924436&rtpof=true&sd=true
And this is a sample of the input:
(5/2) 10 10 10 M F F A27729208 A27729210 A27729212 (5/3) 26 30 26 30 9 9 9 9 2 9 9 M M F F M M M F F F F A27732515 A27732516 A27732518 A27732520 A27732521 A27732541 A27732542 A27732543 A27732545 A27732546 A27732547 A27732549 (5/4) 29 54 54 40 61 M M F M A27735990 A27735996 A27736005 A27740796 (5/4) 29 54 54 40 61 M M F M A27735990 A27735996 A27736005 A27740796 (5/6) 26 26 16 26 14 33 21 21 F F F F M M U U A27760967 A27760970 A27760973 A27760977 A27761539 A27761720 A27762164 A27762169 (5/7) 21 21 F M A27767085 A27767091 (5/8) 47 51 52 51 F F M M A27779894 A27782370 A27782371 A27782373 (5/9) 19 60 60 36 67 67 M F M F F F A27787353 A27788037 A27788043 A27788687 A27788911 A27788923 (5/11) 47 47 47 28 F M M M A27798156 A27798165 A27798168 A27801540 (5/12) 20 10 27 27 52 F M M M F A27810744 A27811484 A27812447 A27812449 A27818279 (5/14) 21 21 21 U U U A27829238 A27829239 A27829241

I don't know exactly how your input looks. But if you need to paste transposed, you have a specific function in Sheets by copying and right-clicking:
Select "Transposed" and it will look like this:

Split Pandas Dataframe into separate pieces based on column values

I am looking to perform some Inner Joins in Pandas, using Python 2.7. Here is the dataset that I am working with:
import pandas as pd
import numpy as np
columns = ['s_id', 'c_id', 'c_col1']
index = np.arange(46) # array of numbers for the number of samples
df = pd.DataFrame(columns=columns, index = index)
df.s_id[:15] = 144
df.s_id[15:27] = 105
df.s_id[27:46] = 52
df.c_id[:5] = 1
df.c_id[5:10] = 2
df.c_id[10:15] = 3
df.c_id[15:19] = 1
df.c_id[19:27] = 2
df.c_id[27:34] = 1
df.c_id[34:39] = 2
df.c_id[39:46] = 3
df.c_col1[:5] = ['H', 'C', 'N', 'O', 'S']
df.c_col1[5:10] = ['C', 'O','S','K','Ca']
df.c_col1[10:15] = ['H', 'O','F','Ne','Si']
df.c_col1[15:19] = ['C', 'O', 'F', 'Zn']
df.c_col1[19:27] = ['N', 'O','F','Fe','Zn','Gd','Hg','Pb']
df.c_col1[27:34] = ['H', 'He', 'Li', 'B', 'N','Al','Si']
df.c_col1[34:39] = ['N', 'F','Ne','Na','P']
df.c_col1[39:46] = ['C', 'N','O','F','K','Ca', 'Fe']
Here is the dataframe:
s_id c_id c_col1
0 144 1 H
1 144 1 C
2 144 1 N
3 144 1 O <--
4 144 1 S
5 144 2 C
6 144 2 O <--
7 144 2 S
8 144 2 K
9 144 2 Ca
10 144 3 H
11 144 3 O <--
12 144 3 F
13 144 3 Ne
14 144 3 Si
15 105 1 C
16 105 1 O
17 105 1 F
18 105 1 Zn
19 105 2 N
20 105 2 O
21 105 2 F
22 105 2 Fe
23 105 2 Zn
24 105 2 Gd
25 105 2 Hg
26 105 2 Pb
27 52 1 H
28 52 1 He
29 52 1 Li
30 52 1 B
31 52 1 N
32 52 1 Al
33 52 1 Si
34 52 2 N
35 52 2 F
36 52 2 Ne
37 52 2 Na
38 52 2 P
39 52 3 C
40 52 3 N
41 52 3 O
42 52 3 F
43 52 3 K
44 52 3 Ca
45 52 3 Fe
I need to do the following in Pandas:
In a given s_id, produce separate dataframes for each c_id value. ex. for s_id = 144, there will be 3 dataframes, while for s_id = 105 there will be 2 dataframes
Inner Join the separate dataframes produced in a.), on the elements column (c_col1) in Pandas. This is a little difficult to understand so here is the dataframe what I would like to get from this step:
index s_id c_id c_col1
0 144 1 O
1 144 2 O
2 144 3 O
3 105 1 O
4 105 2 F
5 52 1 N
6 52 2 N
7 52 3 N
As you can see, what I am looking for in part 2.) is the following: Within each s_id, I am looking for those c_col1 values that occur for all the c_id values. ex. in the case of s_id = 144, only O (oxygen) occurs for c_id = 1, 2, 3. I have pointed to these entries, with "<--", in the raw data. So, I would like to have the dataframe show O 3 times in the c_col1 column and the corresponding c_id entries would be 1, 2, 3.
Conditions:
the number of unique c_ids are not known ahead of time.i.e. for one
particular s_id, I do not know if there will be 1, 2 and 3 or just 1
and 2. This means that if 1, 2 and 3 occur, there will be one Inner
Join; if only 1 and 2 occur, then there will be only one Inner Join.
How can this be done with Pandas?

Producing the separate dataframes is easy enough. How would you want to store them? One way would be in a nested dict where the outer keys are the s_id and the inner keys are the c_id and the inner values are the data. That you can do with a fairly long but straightforward dict comprehension:
DF_dict = {s_id :
{c_id : df[(df.s_id == s_id) & (df.c_id == c_id)] for c_id in df[df.s_id == s_id]['c_id'].unique()}
for s_id in df.s_id.unique()}
Then for example:
In [12]: DF_dict[52][2]
Out[12]:
s_id c_id c_col1
34 52 2 N
35 52 2 F
36 52 2 Ne
37 52 2 Na
38 52 2 P
I do not understand part two of your question. You want then to join the data within in s_id? Could you show what the expected output would be? If you want to do something within each s_id you might be better off exploring groupby options. Perhaps someone understands what you want, but if you can clarify I might be able to show a better option that skips the first part of the question...
##################EDIT
It seems to me that you should just go straight to problem 2, if problem 1 is simply a step you believe to be necessary to get to a problem 2 solution. In fact it is entirely unnecessary. To solve your second problem you need to group the data by s_id and transform the data according to your requirements. To sum up your requirements as I see them the rule is as follows: For each data group grouped by s_id, return only those ccol_1 data for which there are equal values for each value of c_id.
You might write a function like this:
def c_id_overlap(df):
common_vals = [] #container for values of c_col1 that are in ever c_id subgroup
c_ids = df.c_id.unique() #get unique values of c_id
c_col1_values = set(df.c_col1) # get a set of c_col1 values
#create nested list of values. Each inner list contains the c_col1 values for each c_id
nested_c_col_vals = [list(df[df.c_id == ID]['c_col1'].unique()) for ID in c_ids]
#Iterate through the c_col1_values and see if they are in every nested list
for val in c_col1_values:
if all([True if val in elem else False for elem in nested_c_col_vals]):
common_vals.append(val)
#return a slice of the dataframe that only contains values of c_col1 that are in every
#c_id
return df[df.c_col1.isin(common_vals)]
and then pass it to apply on data grouped by s_id:
df.groupby('s_id', as_index = False).apply(c_id_overlap)
which gives me the following output:
s_id c_id c_col1
0 31 52 1 N
34 52 2 N
40 52 3 N
1 16 105 1 O
17 105 1 F
18 105 1 Zn
20 105 2 O
21 105 2 F
23 105 2 Zn
2 3 144 1 O
6 144 2 O
11 144 3 O
Which seems to be what you are looking for.
###########EDIT: Additional Explanation:
So apply passes each chunk of grouped data to the function and the the pieces are glues back together once this has been done for each group of data.
So think about the first group passed where s_id == 105. The first line of the function creates an empty list common_vals which will contain those periodic elements that appear in every subgroup of the data (i.e. relative to each of the values of c_id).
The second line gets the unique values of 'c_id', in this case [1, 2] and stores them in an array called c_ids
The third line creates a set of the values of c_col1 which in this case produces:
{'C', 'F', 'Fe', 'Gd', 'Hg', 'N', 'O', 'Pb', 'Zn'}
The fourth line creates a nested list structure nested_c_col_vals where every inner list is a list of the unique values associated with each of the elements in the c_ids array. In this case this looks like this:
[['C', 'O', 'F', 'Zn'], ['N', 'O', 'F', 'Fe', 'Zn', 'Gd', 'Hg', 'Pb']]
Now each of the elements in the c_col1_values list is iterated over and for each of those elements the program determines whether that element appears in every inner list of the nested_c_col_vals object. The bulit in all function, determines whether every item in the sequence between the backets is True or rather whether it is non-zero (you will need to check this). So:
In [10]: all([True, True, True])
Out[10]: True
In [11]: all([True, True, True, False])
Out[11]: False
In [12]: all([True, True, True, 1])
Out[12]: True
In [13]: all([True, True, True, 0])
Out[13]: False
In [14]: all([True, 1, True, 0])
Out[14]: False
So in this case, let's say 'C' is the first element iterated over. The list comprehension inside the all() backets says, look inside each inner list and see if the element is there. If it is then True if it is not then False. So in this case this resolves to:
all([True, False])
which is of course False. No when the element is 'Zn' the result of this operation is
all([True, True])
which resolves to True. Therefore 'Zn' is appended to the common_vals list.
Once the process is complete the values inside common_vals are:
['O', 'F', 'Zn']
The return statement simply slices the data chunk according to whether the vaues os c_col1 are in the list common_vals as per above.
This is then repeated for each of the remaining groups and the data are glued back together.
Hope this helps

Assigning a single value to all cells within a specified time period, matrix format

I have the following example dataset which consists of the # of fish caught per check of a net. The nets are not checked at uniform intervals. The day of the check is denoted in julian days as well as the number of days the net had been fishing since last checked (or since it's deployment in the case of the first check)
http://textuploader.com/9ybp
Site_Number Check_Day_Julian Set_Duration_Days Fish_Caught
2 5 3 100
2 10 5 70
2 12 2 65
2 15 3 22
100 4 3 45
100 10 6 20
100 18 8 8
450 10 10 10
450 14 4 4
In any case, I would like to turn the raw data above into the following format:
http://textuploader.com/9y3t
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
2 0 0 100 100 100 70 70 70 70 70 65 65 22 22 22 0 0 0
100 0 45 45 45 20 20 20 20 20 20 8 8 8 8 8 8 8 8
450 10 10 10 10 10 10 10 10 10 10 4 4 4 4 0 0 0 0
This is a matrix which assigns the # of fish caught during the period to EACH of the days that were within that period. The columns of the matrix are Julian days, the rows are site numbers.
I have tried to do this with some matrix functions but I have had much difficulty trying to populate all the fields that are within the time period, but I do not necessarily have a row of data for?
I had posted my small bit of code here, but upon reflection, my approach is quite archaic and a bit off point. Can anyone suggest a method to convert the data into the matrix provided? I've been scratching my head and googling all day but now I am stumped.
Cheers,
C

Two answers, the second one is faster but a bit low level.
Solution #1:
library(IRanges)
with(d, {
ir <- IRanges(end=Check_Day_Julian, width=Set_Duration_Days)
cov <- coverage(split(ir, Site_Number),
weight=split(Fish_Caught, Site_Number),
width=max(end(ir)))
do.call(rbind, lapply(cov, as.vector))
})
Solution #2:
with(d, {
ir <- IRanges(end=Check_Day_Julian, width=Set_Duration_Days)
site <- factor(Site_Number, unique(Site_Number))
m <- matrix(0, length(levels(site)), max(end(ir)))
ind <- cbind(rep(site, width(ir)), as.integer(ir))
m[ind] <- rep(Fish_Caught, width(ir))
m
})

I don't see a super obvious matrix transformation here. This is all i've got assuming the raw data is in a data.frame called dd
dd$Site_Number<-factor(dd$Site_Number)
mm<-matrix(0, nrow=nlevels(dd$Site_Number), ncol=18)
for(i in 1:nrow(dd)) {
mm[as.numeric(dd[i,1]), (dd[i,2]-dd[i,3]):dd[i,2] ] <- dd[i,4]
}
mm

combine row and columns in matlab for series

everyone
I have problem when make iteration for two variable but, it's combine just in one vector or array
At first i write my input for iteration w(0) as w
w=[1 50];
for number 1, I use array
e=0:1:(n-1);
f=0:2:(2*n-2); %for 50 in column 2.
I try to use this code
w=[1 50];
ww=kron(ones((n),1),w)
e=0:1:(n-1);
f=0:2:(2*n-2);
r=[e',f']
x=ww+r
and the output is
ww =
1 50
1 50
1 50
1 50
1 50
1 50
r =
0 0
1 2
2 4
3 6
4 8
5 10
x =
1 50
2 52
3 54
4 56
5 58
6 60
I want x is output just in one array, in example
x =
1
50
2
52
3
54
4
56
5
58
6
60
where w=[1 50] can be use difference addition for iteration

Apply this to your x matrix:
x = reshape(x.',[],1);
See reshape doc for details.

Here is a simple method to create your vector from scratch:
x = [1:6;50:2:60];
x(:)
Or with your variables:
x = [e; f];
x(:)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How do I subset data set with specific condition? - dataset

Related

What is a correct name for an operation that turns 3-column long table into a compact "2D" table with variable number of columns?

Is there any way to automatically paste text vertically into google sheets?

Split Pandas Dataframe into separate pieces based on column values

Assigning a single value to all cells within a specified time period, matrix format

combine row and columns in matlab for series

Categories

Resources