How to sampling Data Frame? - sampling

The goal is to subsample a data frame.
code:
# 1 date is in type datatime
dg.Yr_Mo_Dy = pd.to_datetime(dg.Yr_Mo_Dy, format='%Y%m%d')
# 2 date is in index
dg = dg.set_index(dg.Yr_Mo_Dy, drop = True)
# 3 to group by 10
dg.resample('1AS').mean().mean()
That gives:
RPT 14.847325
VAL 12.914560
ROS 13.299624
KIL 7.199498
SHA 11.667734
BIR 8.054839
DUB 11.819355
CLA 9.512047
MUL 9.543208
CLO 10.053566
BEL 14.550520
MAL 18.028763
dtype: float6
The code takes every 10 values the 10 intermediate values and the average.
Similarly, it is also possible to sum these 10 values by replacing mean() with sum().
However, what I want to do is not an average but a sampling. That is, to take all the values and only one without averaging, without summing the intermediate values.
For example, the data: 1,2,3,4,5,6.. sampled by 0.5 gives 2,4,6... et non 1.5,2.5,3.5,5.5...

Related

How can I identify three highest values in a column by ID and take their squares and then add them in SAS?

I am working on injury severity scores (ISS) and my dataset has these four columns: ID, High_AIS, Dxcode (diagnosis code), ISS_bodyregion. Each ID/case has several values for "dxcode" and respective High_AIS and ISS_bodyregion - which means each ID/case has multiple injuries in different body regions. The rule to calculate ISS specifies that we have to select AIS values of three different ISS body regions
For some IDs, we have only one value (of course when a person only has single injury and one associated dxcode and AIS). My goal is to calculate ISS (ranges from 0-75) and in order to do this, I want to tell SAS the following things:
Select three largest AIS values by ID (of course when ID has more than 3 values for AIS), take their squares and add them to get ISS.
If ID has only one injury and that has the AIS = 6, the ISS will automatically be equal to 75 (regardless of the injuries elsewhere).
If ID has less than 3 AIS values (for example, 5th ID has only two AIS values: 0 and 1), then consider only two, square them and add them, as we do not have third severely ISS body region for this ID.
If ID has only 3 AIS (for example, 1,0,0) then consider only three, square them and add them even if it is ISS=1.
If ID has all the injuries and AIS values equal to 0 (for example: 0,0) then ISS will equal to 0.
If ID has multiple injuries, and AIS values are: 2,2,1,1,1 and ISS_bodyregion = 5,5,6,6,6. Then we see that ISS_bodyregion repeats itself, the instructions suggest that we only select highest AIS value of ISS body region only once, because it has to be from DIFFERENT ISS body regions. So, in such situation, I want to tell SAS that if ISS_bodyregion repeats itself, only select the one with highest AIS value and leave the rest.
I am so confused as I am telling SAS to keep account of all these aforementioned considerations and I cannot seem to put them all in a single code. Thank you so much in advance. I have already sorted my data by ID descending high_AIS.
So if you are trying to implement this algorithm https://aci.health.nsw.gov.au/networks/institute-of-trauma-and-injury-management/data/injury-scoring/injury_severity_score then you need data like this:
data have;
input id region :$20. ais ;
cards;
1 HEAD/NECK 4
1 HEAD/NECK 3
1 FACE 1
1 CHEST 2
1 ABDOMEN 2
1 EXTREMITIES 3
1 EXTERNAL 1
2 ABDOMEN 3
3 FACE 1
3 CHEST 2
4 HEAD/NECK 6
;
So first find the max per id per region. For example by using PROC SUMMARY.
proc summary data=have nway;
class id region;
var ais;
output out=bodysys max=ais;
run;
Now order by ID and AIS
proc sort data=bodysys ;
by id ais ;
run;
Now you can process by ID and accumulate the AIS scores into an array. You can use MOD() function to cycle through the array so that the last three observations per ID will be the values left in the array (skips the need to first subset to three observations per ID).
data want;
do count=0 by 1 until(last.id);
set bodysys;
by id;
array x[3] ais1-ais3 ;
x[1+mod(count,3)] = ais;
end;
iss=0;
if ais>5 then iss=75;
else do count=1 to 3 ;
iss + x[count]**2;
end;
keep id ais1-ais3 iss ;
run;
Result:
Obs id ais1 ais2 ais3 iss
1 1 2 3 4 29
2 2 3 . . 9
3 3 1 2 . 5
4 4 6 . . 75

How to extract multiple 5-minute averages from a data frame based on specified start time?

I have second-by-second data for channels A, B, and C as shown below (this just shows the first 6 rows):
date A B C
1 2020-03-06 09:55:42 224.3763 222.3763 226.3763
2 2020-03-06 09:55:43 224.2221 222.2221 226.2221
3 2020-03-06 09:55:44 224.2239 222.2239 226.2239
4 2020-03-06 09:55:45 224.2044 222.2044 226.2044
5 2020-03-06 09:55:46 224.2397 222.2397 226.2397
6 2020-03-06 09:55:47 224.3690 222.3690 226.3690
I would like to be able to extract multiple 5-minute averages for columns A, B and C based off time. Is there a way to do this where I would only need to type in the starting time period, rather than having to type the start AND end times for each time period I want to extract? Essentially, I want to be able to type the start time and have my code calculate and extract the average for the successive 5 minutes.
I was previously using the 'time.average' function from the 'openair' package to obtain 1-minute averages for the entire data set. I then created a vector with the start times and then used the 'subset' function' to extract the 1 minute averages I was interested in.
library(openair)
df.avg <- timeAverage(df, avg.time = "min", statistic = "mean")
cond.1.time <- c(
'2020-03-06 10:09:00',
'2020-03-06 10:13:00',
'2020-03-06 10:18:00',
) #enter start times
library(dplyr)
df.cond.1.avg <- subset(df.avg,
date %in% cond.1.time) #filter data based off vector
df.cond.1.avg <- as.data.frame(df.cond.1.avg) #tibble to df
However, this approach will not work for 5-minute averages since not all of the time frames I am interested in begin in 5 minute increments of each other. Also, my previous approach forced me to only use 1 minute averages that start at the top of the minute.
I need to be able to extract 5-minute averages scattered randomly throughout the day. These are not rolling averages. I will need to extract approximately thirty 5-minute averages per day so being able to only type in the start date would be key.
Thank you!
Using the dplyr and tidyr libraries, the interval to be averaged can be selected by filtering the dates and averaged.
It doesn't seem to be efficient but it can help you.
library(dplyr)
library(tidyr)
data <- data.frame(date = seq(as.POSIXct("2020-02-01 01:01:01"),
as.POSIXct("2020-02-01 20:01:10"),
by = "sec"),
A = rnorm(68410),
B = rnorm(68410),
C = rnorm(68410))
meanMinutes <- function(data, start, interval){
# Interval in minutes
start <- as.POSIXct(start)
end <- start + 60*interval
filterData <- dplyr::filter(data, date <= end, date >= start)
date_start <- filterData$date[1]
meanData <- filterData %>%
tidyr::gather(key = "param", value = "value", A:C) %>%
dplyr::group_by(param) %>%
dplyr::summarise(value = mean(value, na.rm = T)) %>%
tidyr::spread(key = "param", value = "value")
return(cbind(date_start, meanData))
}
For one date
meanMinutes(data, "2020-02-01 07:03:11", 5)
Result:
date_start A B C
1 2020-02-01 07:03:11 0.004083064 -0.06067075 -0.1304691
For multiple dates:
dates <- c("2020-02-01 02:53:41", "2020-02-01 05:23:14",
"2020-02-01 07:03:11", "2020-02-01 19:10:45")
do.call(rbind, lapply(dates, function(x) meanMinutes(data, x, 5)))
Result:
date_start A B C
1 2020-02-01 02:53:41 -0.001929374 -0.03807152 0.06072332
2 2020-02-01 05:23:14 0.009494321 -0.05911055 -0.02698245
3 2020-02-01 07:03:11 0.004083064 -0.06067075 -0.13046909
4 2020-02-01 19:10:45 -0.123574816 -0.02373881 0.05997007

Update table with random numbers in kdb+q

when I run the following script:
tbl: update prob: 1?100 from tbl;
I was expecting that I get a new column created with each row having a random number. However, I get back a column containing the same number for all the rows in the table.
How do I resolve this? I need to update my existing table and not create a table from scratch.
When you are using 1?100 you are only requesting 1 random value within the range of 0-100. If you use 10?100, you will be returned a list of 10 random values between 0-100.
So to do this in an update you want to use something like this
tbl:([]time:5?.z.p;sym:5?`3;price:5?10f;qty:5?10)
time sym price qty
-----------------------------------------------
2012.02.19D18:34:27.148501760 gkn 8.376952 9
2008.07.29D20:23:13.601434560 odo 7.041609 3
2007.02.07D08:17:59.482332864 pbl 0.955069 9
2001.04.27D03:36:44.475531384 aph 1.127308 2
2010.03.03D03:35:55.253069888 mgi 0.7663449 6
update r:abs count[i]?0h from tbl
time sym price qty r
-----------------------------------------------------
2012.02.19D18:34:27.148501760 gkn 8.376952 9 23885
2008.07.29D20:23:13.601434560 odo 7.041609 3 19312
2007.02.07D08:17:59.482332864 pbl 0.955069 9 10372
2001.04.27D03:36:44.475531384 aph 1.127308 2 25281
2010.03.03D03:35:55.253069888 mgi 0.7663449 6 27503
Note that I am using type short and abs to return positive values.
You need to seed your initial data, using something like rand(time), otherwise it will use the same seed, and thus, give the same sequence of random numbers.
EDIT: Per https://code.kx.com/wiki/Reference/SystemCommands
Use \S?n, where n is any integer.
EDIT2: Check out https://code.kx.com/wiki/Reference/SystemCommands#.5CS_.5Bn.5D_-_random_seed for how to use random numbers, please.
Just generate as many random numbers as you have rows using count tbl:
First create your table tbl:
tbl:([]date:reverse .z.d-til 100;price:sums 100?1f)
date price
--------------------
2018.04.26 0.2426471
2018.04.27 0.6163571
2018.04.28 1.179559
..
Then add a column of random numbers between 0 and 100:
update rdn:(count tbl)?100 from tbl
date price rdn
------------------------
2018.04.26 0.2426471 25
2018.04.27 0.6163571 33
2018.04.28 1.179559 13
..

Split array into chunks based on timestamp in Haskell

I have an array of records (custom data type) in Haskell which I want to aggregate based on a each records' timestamp. In very general terms each record looks like this:
data Record = Record { event :: String,
time :: Double,
from :: Int,
to :: Int
} deriving (Show, Eq)
I used a Double for the timestamp since that is the same format used in the tracefile.
And I parse them from a CSV file into an array of records: [Record]
Now I'm looking to get an approximation of instantaneous events / time. So I want to split the array into several arrays based on the timestamp (say. every 1 seconds) and then fold across each smaller array.
The problem is I can't figure out how to split an array based on the value of a record. Looking on Hoogle I found several functions like splitEvery and splitWhen, but I'm lost. I considered using splitWhen to break up the list when, say, (mod time 0.1) == 0, but even if that worked it would remove the elements it's splitting on (which I don't want to do).
I should note that the records are NOT evenly spaced in time. E.g. the timestamp on sequential records is not going to differ by a fixed amount.
I am more than willing to store the data in a different format if you can suggest one that would make this sort of work easier.
A quick sample of the data I'm parsing (from a ns2 simulation):
r 0.114 1 2 tcp 1000 ________ 2 1.0 5.0 0 2
r 0.240 1 2 tcp 1000 ________ 2 1.0 5.0 0 2
r 0.914 2 1 tcp 1000 ________ 2 5.0 1.0 0 3
If you have [Record] and you want to group them by a specific condition, you can use Data.List.groupBy. I'm assuming that for your time :: Double, 1 second is the base unit, so time = 1 is 1 second, time = 100 is 100 seconds, etc, so adjust this to whatever system you're actually using:
import Data.List
import Data.Function (on)
isInSameClockSecond :: Record -> Record -> Bool
isInSameClockSecond = (==) `on` (floor . time :: Record -> Integer)
-- The type signature is given for floor . time to remove any ambiguity
-- due to floor's polymorphic type signature.
groupBySameClockSecond :: [Record] -> [[Record]]
groupBySameClockSecond = groupBy isInSameClockSecond

Fill secondly data from Q KDB+

I have a csv file with some high frequency stock price data, and I'd like to get a secondly price data from the table.
In each file, there are columns named date, time, symbol, price, volume, and etc.
There are some seconds with no trading so there are missing data in some seconds.
I'm wondering how could I fill the missing data in Q to get the secondly data from 9:30 to 16:00 in full? If there is missing price, just use the recently price as its price in that second.
I'm considering to write some loop, but I don't know how to exactly to that.
Simplifying a little, I'll assume you have some random timestamps in your dataset like this:
time price
--------------------------------------
2015.01.20D22:42:34.776607000 7
2015.01.20D22:42:34.886607000 3
2015.01.20D22:42:36.776607000 4
2015.01.20D22:42:37.776607000 8
2015.01.20D22:42:37.886607000 7
2015.01.20D22:42:39.776607000 9
2015.01.20D22:42:40.776607000 4
2015.01.20D22:42:41.776607000 9
so there are some missing seconds there. I'm going to call this table t. So if you do a by-second type of query, obviously the seconds that are missing are still missing:
q)select max price by time.second from t
second | price
--------| -----
22:42:34| 7
22:42:36| 4
22:42:37| 8
22:42:39| 9
22:42:40| 4
22:42:41| 9
To get missing seconds, you have to join a list of nulls. In this case we know the data goes from 22:42:34 to 22:42:41, but in reality you'll have to find the min/max time and use that to create a temporary "null" table to join against:
q)([] second:22:42:34 + til 1+`int$22:42:41-22:42:34 ; price:(1+`int$22:42:41-22:42:34)#0N)
second price
--------------
22:42:34
22:42:35
22:42:36
22:42:37
22:42:38
22:42:39
22:42:40
22:42:41
Then left join:
q)([] second:22:42:34 + til 1+`int$22:42:41-22:42:34 ; price:(1+`int$22:42:41-22:42:34)#0N) lj select max price by time.second from t
second price
--------------
22:42:34 7
22:42:35
22:42:36 4
22:42:37 8
22:42:38
22:42:39 9
22:42:40 4
22:42:41 9
You can use fills or whatever your favourite filling heuristic is after that.
q)fills `second xasc asc ([] second:22:42:34 + til 1+`int$22:42:41-22:42:34 ; price:(1+`int$22:42:41-22:42:34)#0N) lj select max price by time.second from t
second price
--------------
22:42:34 7
22:42:35 7
22:42:36 4
22:42:37 8
22:42:38 8
22:42:39 9
22:42:40 4
22:42:41 9
(Note the sort on second before fills!)
By the way for larger tables this will be much faster than a loop. Loops in q are generally a bad idea.
EDIT
You could use a comma join too, both tables need to be keyed on the second column
t,t1
(where t1 is the null-filled table keyed on second)
I haven't tested it, but I suspect it would be slightly faster than the lj version.
Using aj which is one of the most powerful features of KDB:
q)data
sym time price size
----------------------------
MS 10:24:04 93.35974 8
MS 10:10:47 4.586986 1
APPL 10:50:23 0.7831685 1
GOOG 10:19:52 49.17305 0
in-memory table needs to be sym,time sorted with g# attribute applied to sym column
q)data:update `g#sym from `sym`time xasc data
q)meta trade
c | t f a
-----| -----
sym | s g
time | v
price| f
size | j
Creating a rack table intervalized per second per sym :
q)rack: `sym`time xasc (select distinct sym from data) cross ([] time:{x[0]+til `int$x[1]-x[0]}(min;max)#\:data`time)
Using aj to join the data :
q)aj[`sym`time; rack; data]

Resources