Sampling rows with specific criteria - permutation

I am trying to create some sort of a schedule for me and my flatmates, but I am not sure how to do it, hell I am not even sure if it is feasible.
A simplified version of what I am trying to do is: let's assume that we have 5 individuals (John, Mike, Jack, Collin and Anna) and 3 chores (1, 2, 3). I want to spread these chores among the individuals, while a couple of criteria are being fulfilled. For example, each individual must do chore #1 two times per month.
I am really shooting in the dark here, and I am not sure how to approach this problem. So far I have used permutation to find all the possible combinations between the chores, and then I apply the criteria with a couple of loops. Here is the code:
####################
#install.packages("gtools")
library(gtools)
maat=permutations(3, 4, repeats.allowed = T)#Finding all the possible combinations of the chores for 4 weeks and 3 chores.
#The following loop applies two criteria: #
#1) in a 4-week period (1 month) chore #1 must be done two times
#2) in a 4-week period all of the chores must be done at least one
o=k=0
for (i in 1:nrow(maat)) {
if ((sum(maat[i,] %in% 1) == 2) && all(1:3 %in% maat[i,])) {
o[k]=i
k = k + 1}
}
newmaat<-maat[o,] #the filtered matrix
cleaningRot<-matrix(ncol=4, nrow=5
, data=newmaat[sample(nrow(newmaat), size=5, replace=F),]
, dimnames = list(c("John", "Mike", "Jack", "Collin", "Anna")
, c(paste0("Week",1:4)))) # This is the schedule for each individual
##The following loop applies 2 more criteria:
#1) all three chores must be done per week
#2) each week chores #1 and #2 must be done twice
while(!all(1:3 %in% cleaningRot[,1])| #All three chores must be in the 1st week
!all(1:3 %in% cleaningRot[,2])| #All three chores must be in the 2st week
!all(1:3 %in% cleaningRot[,3])| #All three chores must be in the 3st week
!all(1:3 %in% cleaningRot[,4]) #All three chores must be in the 4st week
|
!(sum(cleaningRot[,1] %in% 1) == 2) && !(sum(cleaningRot[,1] %in% 2) == 2)| #Chores #1 and #2 must be done 2 times in week 1
!(sum(cleaningRot[,2] %in% 1) == 2) && !(sum(cleaningRot[,2] %in% 2) == 2)|#Chores #1 and #2 must be done 2 times in week 2
!(sum(cleaningRot[,3] %in% 1) == 2) && !(sum(cleaningRot[,3] %in% 2) == 2)|#Chores #1 and #2 must be done 2 times in week 3
!(sum(cleaningRot[,4] %in% 1) == 2) && !(sum(cleaningRot[,4] %in% 2) == 2)#Chores #1 and #2 must be done 2 times in week 4
){
cleaningRot<-matrix(ncol=4, nrow=5, data=newmaat[sample(nrow(newmaat), size=5, replace=F),],
dimnames = list(c("John", "Mike", "Jack", "Collin", "Anna"), c(paste0("Week", 1:4))))
}
######################################
So, basically, I am sampling between the permutations until the result fulfil the given assumptions; a highly random method, but my programming knowledge doesn't allow me to do more.
Another problem of this approach is that if the number of chores/individuals/weeks are bigger, it is almost impossible to randomly find the suitable schedule given the assumptions.
So my question is if you know a more methodical way to do that, rather than just sampling randomly from a pool of numbers?
Thank you and apologies for my messy code.

Related

How can I identify three highest values in a column by ID and take their squares and then add them in SAS?

I am working on injury severity scores (ISS) and my dataset has these four columns: ID, High_AIS, Dxcode (diagnosis code), ISS_bodyregion. Each ID/case has several values for "dxcode" and respective High_AIS and ISS_bodyregion - which means each ID/case has multiple injuries in different body regions. The rule to calculate ISS specifies that we have to select AIS values of three different ISS body regions
For some IDs, we have only one value (of course when a person only has single injury and one associated dxcode and AIS). My goal is to calculate ISS (ranges from 0-75) and in order to do this, I want to tell SAS the following things:
Select three largest AIS values by ID (of course when ID has more than 3 values for AIS), take their squares and add them to get ISS.
If ID has only one injury and that has the AIS = 6, the ISS will automatically be equal to 75 (regardless of the injuries elsewhere).
If ID has less than 3 AIS values (for example, 5th ID has only two AIS values: 0 and 1), then consider only two, square them and add them, as we do not have third severely ISS body region for this ID.
If ID has only 3 AIS (for example, 1,0,0) then consider only three, square them and add them even if it is ISS=1.
If ID has all the injuries and AIS values equal to 0 (for example: 0,0) then ISS will equal to 0.
If ID has multiple injuries, and AIS values are: 2,2,1,1,1 and ISS_bodyregion = 5,5,6,6,6. Then we see that ISS_bodyregion repeats itself, the instructions suggest that we only select highest AIS value of ISS body region only once, because it has to be from DIFFERENT ISS body regions. So, in such situation, I want to tell SAS that if ISS_bodyregion repeats itself, only select the one with highest AIS value and leave the rest.
I am so confused as I am telling SAS to keep account of all these aforementioned considerations and I cannot seem to put them all in a single code. Thank you so much in advance. I have already sorted my data by ID descending high_AIS.
So if you are trying to implement this algorithm https://aci.health.nsw.gov.au/networks/institute-of-trauma-and-injury-management/data/injury-scoring/injury_severity_score then you need data like this:
data have;
input id region :$20. ais ;
cards;
1 HEAD/NECK 4
1 HEAD/NECK 3
1 FACE 1
1 CHEST 2
1 ABDOMEN 2
1 EXTREMITIES 3
1 EXTERNAL 1
2 ABDOMEN 3
3 FACE 1
3 CHEST 2
4 HEAD/NECK 6
;
So first find the max per id per region. For example by using PROC SUMMARY.
proc summary data=have nway;
class id region;
var ais;
output out=bodysys max=ais;
run;
Now order by ID and AIS
proc sort data=bodysys ;
by id ais ;
run;
Now you can process by ID and accumulate the AIS scores into an array. You can use MOD() function to cycle through the array so that the last three observations per ID will be the values left in the array (skips the need to first subset to three observations per ID).
data want;
do count=0 by 1 until(last.id);
set bodysys;
by id;
array x[3] ais1-ais3 ;
x[1+mod(count,3)] = ais;
end;
iss=0;
if ais>5 then iss=75;
else do count=1 to 3 ;
iss + x[count]**2;
end;
keep id ais1-ais3 iss ;
run;
Result:
Obs id ais1 ais2 ais3 iss
1 1 2 3 4 29
2 2 3 . . 9
3 3 1 2 . 5
4 4 6 . . 75

How to extract multiple 5-minute averages from a data frame based on specified start time?

I have second-by-second data for channels A, B, and C as shown below (this just shows the first 6 rows):
date A B C
1 2020-03-06 09:55:42 224.3763 222.3763 226.3763
2 2020-03-06 09:55:43 224.2221 222.2221 226.2221
3 2020-03-06 09:55:44 224.2239 222.2239 226.2239
4 2020-03-06 09:55:45 224.2044 222.2044 226.2044
5 2020-03-06 09:55:46 224.2397 222.2397 226.2397
6 2020-03-06 09:55:47 224.3690 222.3690 226.3690
I would like to be able to extract multiple 5-minute averages for columns A, B and C based off time. Is there a way to do this where I would only need to type in the starting time period, rather than having to type the start AND end times for each time period I want to extract? Essentially, I want to be able to type the start time and have my code calculate and extract the average for the successive 5 minutes.
I was previously using the 'time.average' function from the 'openair' package to obtain 1-minute averages for the entire data set. I then created a vector with the start times and then used the 'subset' function' to extract the 1 minute averages I was interested in.
library(openair)
df.avg <- timeAverage(df, avg.time = "min", statistic = "mean")
cond.1.time <- c(
'2020-03-06 10:09:00',
'2020-03-06 10:13:00',
'2020-03-06 10:18:00',
) #enter start times
library(dplyr)
df.cond.1.avg <- subset(df.avg,
date %in% cond.1.time) #filter data based off vector
df.cond.1.avg <- as.data.frame(df.cond.1.avg) #tibble to df
However, this approach will not work for 5-minute averages since not all of the time frames I am interested in begin in 5 minute increments of each other. Also, my previous approach forced me to only use 1 minute averages that start at the top of the minute.
I need to be able to extract 5-minute averages scattered randomly throughout the day. These are not rolling averages. I will need to extract approximately thirty 5-minute averages per day so being able to only type in the start date would be key.
Thank you!
Using the dplyr and tidyr libraries, the interval to be averaged can be selected by filtering the dates and averaged.
It doesn't seem to be efficient but it can help you.
library(dplyr)
library(tidyr)
data <- data.frame(date = seq(as.POSIXct("2020-02-01 01:01:01"),
as.POSIXct("2020-02-01 20:01:10"),
by = "sec"),
A = rnorm(68410),
B = rnorm(68410),
C = rnorm(68410))
meanMinutes <- function(data, start, interval){
# Interval in minutes
start <- as.POSIXct(start)
end <- start + 60*interval
filterData <- dplyr::filter(data, date <= end, date >= start)
date_start <- filterData$date[1]
meanData <- filterData %>%
tidyr::gather(key = "param", value = "value", A:C) %>%
dplyr::group_by(param) %>%
dplyr::summarise(value = mean(value, na.rm = T)) %>%
tidyr::spread(key = "param", value = "value")
return(cbind(date_start, meanData))
}
For one date
meanMinutes(data, "2020-02-01 07:03:11", 5)
Result:
date_start A B C
1 2020-02-01 07:03:11 0.004083064 -0.06067075 -0.1304691
For multiple dates:
dates <- c("2020-02-01 02:53:41", "2020-02-01 05:23:14",
"2020-02-01 07:03:11", "2020-02-01 19:10:45")
do.call(rbind, lapply(dates, function(x) meanMinutes(data, x, 5)))
Result:
date_start A B C
1 2020-02-01 02:53:41 -0.001929374 -0.03807152 0.06072332
2 2020-02-01 05:23:14 0.009494321 -0.05911055 -0.02698245
3 2020-02-01 07:03:11 0.004083064 -0.06067075 -0.13046909
4 2020-02-01 19:10:45 -0.123574816 -0.02373881 0.05997007

randomly select 5 rows in c

I want to be able to randomly select 5 rows in C
Thanks.
Let's think aloud a bit.
If you just need to select 5 arbitrary numbers that happen to sum up to a number below a given N, you can cheat and select just the 5 smallest numbers; if they sum up to a number larger than N,choosing any other numbers won't help, too, and you register an error.
If you want your numbers to sum up quite close to N (the user asked 20 minutes, you try to offer something like 19 minutes and not 5), it becomes a knapsack problem, which is hard, but maybe various approximate ways to solve it could help.
If you just want to choose 5 random numbers that sum up to N, you can keep choosing 5 numbers (songs) randomly and check. You'll have to limit the number of tries done and/or time spent, and be ready to report a failure.
A somehow more efficient algorithm would keep a list of songs chosen so far, and the sum of their lengths s. It would try to add to it a random song with length ≤ N - s. If it failed after a few attempts, it would remove the longest song from the list and repeat. It must be ready to admit failure, too, based on the total number of attempts made and/or time spent.
I don't think a simple SQL query could efficiently solve this problem. You can approximately encode the algorithm above as a very complex SQL query, though. I'd rather encode it in Python, because local SQLite lookups are pretty fast, provided that your songs are indexed by length.
A possible solution is to only select songs for which the individual length is < 500. Then you keep as much of them as you can. If you have less than 5 or if the total time is < 500, then you iterate or recurse for find some songs for the unused time.
def createtimeplay(timee, tot = None, tot_time = 0):
if tot is None: tot= [] # at first call initialize the result list
# exclude previously selected songs from the search
qry = "SELECT * FROM songs WHERE length <= ?"
if len(tot) > 0:
qry += " and name not in (" + ','.join(['?'] * len(tot)) + ')'
qry += " ORDER BY RANDOM() LIMIT 5 "
curs = c.execute(qry, [timee] + [song[0] for song in tot])
cur = (curs.fetchall())
if len(cur) == 0: return tot # no song were found: we can return
# keep songs that fit in allowed time
cur_time = 0
for song in cur:
if cur_time + song[1] <= timee:
cur_time += song[1]
tot.append(song)
if (len(tot) == 5) return tot # never more than 5 songs
tot_time += cur_time # total songs time
if len(tot) != 5 and cur_time != timee: # if not all recurse
createtimeplay(timee - tot_time, tot, tot_time)
return tot
The trick is that we pass a list which is a modifiable object, so all recursive calls add songs to the same list.
You can then use:
>>> print(createtimeplay(500))
[('Song 18', 350, 'Country', 'z'), ('Song 4', 150, 'Pop', 'x')]
>>> print(createtimeplay(500))
[('Song 12', 200, 'Country', 'z'), ('Song 3', 100, 'Country', 'z'), ('Song 14', 200, 'Rap', 'y')]
>>> print(createtimeplay(500))
[('Song 5', 300, 'Rap', 'y'), ('Song 7', 200, 'Pop', 'x')]
>>>
But previous solution is very inefficient: it requires more than one query, when each query is a full table scan because of the order by random(), and uses recursion when it could easily be avoided. It would be both simpler and more efficient to only do a full table scan at sqlite level, shuffle the result in sqlite or Python, and then just scan once the full randomized list of songs, keeping a maximum number of 5 with a constraint of the total length.
Code is now much simpler:
def createtimeplay(tim, n, con):
songs = c.execute("""SELECT name, length, genre, artist
FROM songs
WHERE length < ? ORDER BY RANDOM()""", (tim,)).fetchall()
result = []
tot = 0
for song in songs:
if song[1] <= tim:
tim -= song[1]
result.append(song)
if len(result) == n or tim == 0: break
return result
In this code, I choosed to pass the maximum number and a cursor or a connection to the sqlite database as parameters.

Link two tables based on conditions in matlab

I am using matlab to prepare my dataset in order to run it in certain data mining models and I am facing an issue with linking the data between two of my tables.
So, I have two tables, A and B, which contain sequential recordings of certain values in a certain timestamps and I want to create a third table, C, in which I will add columns of both A and B in the same rows according to some conditions.
Tables A and B don't have the same amount of rows (A has more measurements) but they both have two columns:
1st column: time of the recording (hh:mm:ss) and
2nd column: recorded value in that time
Columns of A and B are going to be added in table C when all the following conditions stand:
The time difference between A and B is more than 3 sec but less than 5 sec
The recorded value of A is the 40% - 50% of the recorded value of B.
Any help would be greatly appreciated.
For the first condition you need something like [row,col,val]=find((A(:,1)-B(:,1))>2sec && (A(:,1)-B(:,1))<5sec) where you do need to use datenum or equivalent to transform your timestamps. For the second condition this works the same, use [row,col,val]=find(A(:,2)>0.4*B(:,2) && A(:,2)<0.5*B(:,2)
datenum allows you to transform your arrays, so do that first:
A(:,1) = datenum(A(:,1));
B(:,1) = datenum(B(:,1));
you might need to check the documentation on datenum, regarding the format your string is in.
time1 = [datenum([0 0 0 0 0 3]) datenum([0 0 0 0 0 3])];
creates the datenums for 3 and 5 seconds. All combined:
A(:,1) = datenum(A(:,1));
B(:,1) = datenum(B(:,1));
time1 = [datenum([0 0 0 0 0 3]) datenum([0 0 0 0 0 3])];
[row1,col1,val1]=find((A(:,1)-B(:,1))>time1(1)&& (A(:,1)-B(:,1))<time1(2));
[row2,col2,val2]=find(A(:,2)>0.4*B(:,2) && A(:,2)<0.5*B(:,2);
The variables of row and col you might not need when you want only the values though. val1 contains the values of condition 1, val2 of condition 2. If you want both conditions to be valid at the same time, use both in the find command:
[row3,col3,val3]=find((A(:,1)-B(:,1))>time1(1)&& ...
(A(:,1)-B(:,1))<time1(2) && A(:,2)>0.4*B(:,2)...
&& A(:,2)<0.5*B(:,2);
The actual adding of your two arrays based on the conditions:
C = A(row3,2)+B(row3,2);
Thank you for your response and help! However for the time I followed a different approach by converting hh:mm:ss to seconds that will make the comparison easier later on:
dv1 = datevec(A, 'dd.mm.yyyy HH:MM:SS.FFF ');
secs = [3600,60,1];
dv1(:,6) = floor(dv1(:,6));
timestamp = dv1(:,4:6)*secs.';
Now I am working on combining both time and weight conditions in a piece of code that will run. Should I use an if condition inside a for loop or is a for loop not necessary?

summing & matching cell arrays of different sizes

I have a 4016 x 4 cell, called 'totalSalesCell'. The first two columns contain text the remaining two are numeric.
1st field CompanyName
2nd field UniqueID
3rd field NumberItems
4th field TotalValue
In my code I have a loop which goes over the last month in weekly steps - i.e. 4 loops.
At each loop my code returns a cell of the same structure as totalSalesCell, called weeklySalesCell which generally contains a different number of rows to totalSalesCell.
There are two things I need to do. First if weeklySalesCell contains a company that is not in totalSalesCell it needs to be added to totalSalesCell, which I believe the code below will do for me.
co_list = unique([totalSalesCell(:, 1); weeklySalesCell (:, 1)]);
index = ismember(co_list, totalSalesCell(:, 1));
new_co = co_list(index==0, :);
totalSalesCell = [totalSalesCell; new_co];
The second thing I need to do and am unsure of the best way of going about it is to then add the weeklySalesCell numeric fields to the totalSalesCell. As mentioned the cells will 90% of the time have different row numbers so cannot apply a simple addition. Below is an example of what I wish to achieve.
totalSalesCell weeklySalesCell Result
co_id sales_value co_id sales_value co_id sales_value
23DFG 5 DGH84 3 23DFG 5
DGH84 6 ABC33 1 DGH84 9
12345 7 PLM78 4 ABC33 1
PLM78 4 12345 3 12345 10
KLH11 11 PLM78 8
KLH11 11
I believe the following codes must take care of both of your tasks -
[x1,x2] = ismember(totalSalesCell(:,1),weeklySalesCell(:,1))
corr_c2 = nonzeros(x1.*x2)
newval = cell2mat(totalSalesCell(x1,2)) + cell2mat(weeklySalesCell(corr_c2,2))
totalSalesCell(x1,2) = num2cell(newval)
excl_c2 = ~ismember(weeklySalesCell(:,1),totalSalesCell(:,1))
out = vertcat(totalSalesCell,weeklySalesCell(excl_c2,:)) %// desired output
Output -
out =
'23DFG' [ 5]
'DGH8444' [ 9]
'12345' [10]
'PLM78' [ 8]
'KLH11' [11]
'ABC33' [ 1]

Resources