Rolling Arrays as a possible solution to a rolling window in SAS - arrays

I am trying to calculate the dosage of a particular drug for a population to see if any member in this population is over a certain threshold for any 90 consecutive days. So to do this I am thinking that I am going to need to make a arraty that looks at the strengh of this drug over 90 days from an index and if they are all '1' then they get a 'pass', and somehow put this into a do loop to look at all of the potential 90 day windows for a year i=1...i=275 to see if a member at any point during the year has met the criteria. Thoughts?

Related

Getting an average from a nested calculation with averageifs in excel

i am trying to pull performance data togther from a csv. We have a design team with a list of tasks which arrive on their list, sit there until dealt with, and then get completed. For any given day i want to calculate:
1/ The number of tasks on the list (so i'm running countifs, with citeria being 1: the design team, 2: the task had arrived on the list by the given day, and 3: had not been completed by the given day. This works ok.
2/ I also want to know what the average amonth of days those tasks had been on the list on that given day. IN the data set i have the day they arrive, and the day they got completed, and i know the given day i want to calculate for. So i think i ned to nest a formula to calculate (given date - arrival date) into an averageifs formula with the same conditions as the task count.
I have set this up in a sheet using a helper column to give me the days elapsed, but in practice, i want to run this for a number of given days so cannot add a helper column for each query.
So i have =AVERAGEIFS($G:$G,B:B,"Design",$C:$C, "<"&I$1, $E:$E, ">="&I$1)
but where $G:$G is a calculation of the days elapsed =IF(C2<>"",SUM(I$1-C2),"")
enter image description here

Count down a column until value in column meets condition

I have a simple daily rainfall data set and would like to calculate the antecedent dry period for each day. Here, I'm defining a dry day to be "<10". I'm fairly unfamiliar with INDEX(), MATCH(), and other fancy array functions but feel like I'll need to use them.
For example, in the image, for 1/17/2020, the values in cells C3:C9=0, C10=1, C11:C13=0. I've tried various versions of COUNTIF(), COUNTIFS(), and IF() functions but I cannot get the step-wise + re-set functionality necessary when extended "dry spells" or brief rain periods occur with gaps. Thanks!
You are right, you need to use Match. Basically you need to search for the next antecedent wet day (of which there are many here in Manchester England at the moment) and subtract 1 (Formula 1):
=MATCH(TRUE,INDEX(B15:B$1000>=10,0),0)-1
where B$1000 may need changing to include all of your data. The use of Index here is just a bit of a hack to avoid having to enter the formula as an array formula.
As you can see there is an issue when you come to the end of the range which I will come to in a minute.
In this case, we want to count the number of antecedent dry days to the end of the range (Formula 2):
=IFERROR(MATCH(TRUE,INDEX(B4:B$1000>=10,0),0)-1,COUNTIF(B4:B$1000,"<10"))
If the range ended with a dry spell, you would get this:

How many trucks came empty but bought something

I think this is the hardest to date I have had to crack - so hard I had a hard time finding a good headline.
So we have a site where trucks come and buy say Gravel, or sand or other building materials.
Sometimes they also unload demolition waste first.
I need to find out a couple of things
how many trucks (and from what companys) came empty
if they came empty what did they buy from us.
what companys are sending full trucks and what are sending empty trucks.
a tope 10 of materials they will drive to us from to buy even when coming empty to our facility.
a list of all the order numbers that they drove to us til fill and came with empty trucks. ( I have distances linked to order numbers, so now I can estimate the value of our products)
The data I have available:
I have a full data set of when what customer buys what and / or pay to deliver.
E.G.:
I can see the parts I need to split the data into I think it should be something like this
find all unique licence plates
somehow map if they bought materials within 30 minutes of
offloading demolition waste (most trucks will come between 2 and 10
times per day)
Present all this data (on a normal day we have about 800 trucks = 2000 lines since they weigh in, weigh out, and then some buy something = 2 more weigh lines)
I can easily find unique licence plates per day (either by formula or by Excel function Data/delete doublets,
but after that I have no clue where to start.
I think I need some sheets in between, where I somehow mark if a material was bought from an "empty truck" and I need a counter for that .. somehow...
Any help on how to get started is appreciated.
It seems like the best way to start is with a helper column (in the following exampes, I have chosen "Column M") to flag whether the truck arrived empty.
In the helper column, you can use something similar to the following formula.
{=IF(ISBLANK(B2),0,IF(C2="In",0,IF(B2=$B$2:$B$13,IF($C$2:$C$13="In",IF($A$2:$A$13>(A2-TIME(0,30,0)),0,1),1),1)))}
This is an array formula, which means you have to press ctrl+shift+enter after pasting it in the cell. Then you can copy that cell down the column.
Just to explain, the first if statement knows the truck is not arriving empty if Column C is 'In'. The second if statement creates an array and tests to see if other the same truck appears in other rows. The third if statement checks to see if the same truck checked 'In' in the matching rows, and the fourth if statement verifies if the time they checked in was less than thirty minutes ago. You can adjust the length by editing the TIME(0,30,0) function. The format is TIME(hours,minuites,seconds). Unless the truck matches all three of the second, third and fourth if statements, it is marked as coming empty.
Once you have this helper column, just about all of your tasks are quite simple.
1a: How many trucks came empty? Sum Column M
1b: How many trucks from what company? Create a unique list of companies. Then create a COUNTIFS formula based on Column M = 1 and Column K = Company. For example, if C32 had Company B then the formula =COUNTIFS($M$2:$M$13,1,$K$2:$K$13,C32) would return 2
1c: How many times did a truck come empty? Similar to 1b, create a unique list of License Plates, then use a COUNTIFS based on Column M = 1 and Column B = License Plate.
2: Similar to 1b, just use a unique list of products tested against Column F
3: Similar to 1b, just create a second column, next to the first that uses =COUNTIFS($M$2:$M$13,0,$K$2:$K$13,C53,$C$2:$C$13,"In") Which tests that Column M reports the truck did not come empty, that matches the company in Column K and that the truck came 'In' so you don't double count the same truck when it goes 'out'
4: Just sort list created by number 2. You can highlight the range, right-click and select "Sort" > "Custom Sort", then select the column you want to sort on and largest to smallest.
5: There are a couple of different ways, you could do this. The formula
{=TEXTJOIN(", ",TRUE,IF($M$2:$M$13=1,$J$2:$J$13,""))}
(again, entered as an array formula)
would create a comma separated list of order numbers. An alternative if you want a column of order numbers (but would only work if they are actually numbers), is to paste the formula {=MAX(IF($M$2:$M$13=1,$J$2:$J$13,))} in the first row of the column (in my example, its O2) and then {=MAX(IF($M$2:$M$13=1,IF($J$2:$J$13<O2,$J$2:$J$13,)))} in the row below (change the reference to O2 if you pasted it in a different spot)(again, note that both of these are array formulas). Then copy and paste the second formula down the column. When order numbers of trucks that came in empty are exhausted, the formula will report 0.

Find all key-unique permutations of an array

I have an array that looks something like this:
[["Sunday", [user1, user2]], ["Sunday", [user1, user4]], ["Monday", [user3, user2]]]
The array essentially has all permutations of a given day with a unique pair of users. I obtained it by running
%w[Su Mo Tu We Th Fr Sa].product(User.all_pairs)
where User.all_pairs is every unique pair of users.
My goal now is to compose this set of nested arrays into schedules, meaning I want to find every permutation of length 7 with unique days. In other words, I want every potential week. I already have every potential day, and I have every potential pair of users, now I just need to compose them.
I have a hunch that the Array.permutation method is what I need, but I'm not sure how I'd use it in this case. Or perhaps I should use Array.product?
If I understand you correctly, you want all possible weeks where there is one pair of users assigned to each day. You can do it like this:
User.all_pairs.combination(7)
This will give you all possible ways of how you can pick 7 pairs and assign them to the days of the week. But if you are asking for every possible week, then it also matters into which day is which pair assigned, and you also have to take every permutation of those 7 pairs:
User.all_pairs.combination(7).map{|week| week.permutation().to_a}.flatten(1)
Now this will give you all possible weeks, where every week is represented as array containing 7 pairs. For example one of the weeks may look like this:
[(user1, user2), (user1, user3), (user2, user3), (user3, user4), (user1, user4), (user2, user4), (user3, user4)]
However the amount of the weeks will be huge! If you have n users, you will have k = n!/2 pairs, there is p = k! / (7! * (k - 7)!) ways of selecting 7 pairs and p * 7! possible weeks. If you have just 5 users, you get 1946482876800 possible weeks! No matter what you are planning to do with it, it won't be possible.
If you are trying to find the best schedule for a week, you can try to make some greedy algorithm.

Web stats: Calculating/estimating unique visitors for arbitary time intervals

I am writing an application which is recording some 'basic' stats -- page views, and unique visitors. I don't like the idea of storing every single view, so have thought about storing totals with a hour/day resolution. For example, like this:
Tuesday 500 views 200 unique visitors
Wednesday 400 views 210 unique visitors
Thursday 800 views 420 unique visitors
Now, I want to be able to query this data set on chosen time periods -- ie, for a week. Calculating views is easy enough: just addition. However, adding unique visitors will not give the correct answer, since a visitor may have visited on multiple days.
So my question is how do I determine or estimate unique visitors for any time period without storing each individual hit. Is this even possible? Google Analytics reports these values -- surely they don't store every single hit and query the data set for every time period!?
I can't seem to find any useful information on the net about this. My initial instinct is that I would need to store 2 sets of values with different resolutions (ie day and half-day), and somehow interpolate these for all possible time ranges. I've been playing with the maths, but can't get anything to work. Do you think I may be on to something, or on the wrong track?
Thanks,
Brendon.
If you are OK with approximations, I think tom10 is onto something, but his notion of random subsample is not the right one or needs a clarification. If I have a visitor that comes on day1 and day2, but is sampled only on day2, that is going to introduce a bias in the estimation. What I would do is to store full information for a random subsample of users (let's say, all users whose hash(id)%100 == 1). Then you do the full calculations on the sampled data and multiply by 100. Yes tom10 said about just that, but there are two differences: he said "for example" sample based on the ID and I say that's the only way you should sample because you are interested in unique visitors. If you were interested in unique IPs or unique ZIP codes or whatever you would sample accordingly. The quality of the estimation can be assessed using the normal approximation to the binomial if your sample is big enough. Beyond this, you can try and use a model of user loyalty, like you observe that over 2 days 10% of visitors visit on both days, over three days 11% of visitors visit twice and 5% visit once and so forth up to a maximum number of day. These numbers unfortunately can depend on time of the week, season and even modeling those, loyalty changes over time as the user base matures, changes in composition and the service changes as well, so any model needs to be re-estimated. My guess is that in 99% of practical situations you'd be better served by the sampling technique.
You could store a random subsample of the data, for example, 10% of the visitor IDs, then compare these between days.
The easiest way to do this is to store a random subsample of each day for future comparisons, but then, for the current day, temporarily store all your IDs and compare them to the subsampled historical data and determine the fraction of repeats. (That is, you're comparing the subsampled data to a full dataset for a given day and not comparing two subsamples -- it's possible to compare two subsamples and get an estimate for the total but the math would be a bit trickier.)
You don't need to store every single view, just each unique session ID per hour or day depending on the resolution you need in your stats.
You can keep these log files containing session IDs sorted to count unique visitors quickly, by merging multiple hours/days. One file per hour/day, one unique session ID per line.
In *nix, a simple one-liner like this one will do the job:
$ sort -m sorted_sid_logs/2010-09-0[123]-??.log | uniq | wc -l
It counts the number of unique visitors during the first three days of September.
You can calculate the uniqueness factor (UF) on each day and use it to calculate the composite (week by example) UF.
Let's say that you counted:
100 visits and 75 unique session id's on monday (you have to store the sessions ID's at least for a day, or the period you use as unit).
200 visits and 100 unique session id's on tuesday.
If you want to estimate the UF for the period Mon+Tue you can do:
UV = UVmonday + UVtuesday = TVmonday*UFmonday + TVtuesday*UFtuesday
being:
UV = Unique Visitors
TV = Total Visits
UF = Uniqueness Factor
So...
UV = (Sum(TVi*UFi))
UF = UV / TV
TV = Sum(TVi)
I hope it helps...
This math counts two visits of the same person as two unique visitors. I think it's ok if the only way you have to identify somebody is via the session ID.

Resources