Stata: Number of overlapping days within multiple date ranges - loops

I want to calculate the number of overlapping days within multiple date ranges. For example, in the sample data below, there are 167 overlapping days: first from 07jan to 04apr and second from 30may to 15aug.
start end
01jan2000 04apr2000
30may2000 15aug2000
07jan2000 31dec2000

This is fairly crude but gets the job done. Essentially, you
Reshape the data to be in long format, which is usually a good idea when working with panel data in Stata
Fill in gaps between the start and end of each spell
Keep dates that occur more than once
Count the distinct values of dates
clear
/* Fake Data */
input str9(start end)
"01jan2000" "04apr2000"
"30may2000" "15aug2000"
"07jan2000" "31dec2000"
end
foreach var of varlist start end {
gen d = date(`var', "DMY")
drop `var'
gen `var' = d
format %td `var'
drop d
}
/* Count Overlapping Days */
rename (start end) date=
gen spell = _n
reshape long date, i(spell) j(range) string
drop range
xtset spell date, delta(1 day)
tsfill
bys date: keep if _N>1
distinct date

Related

Fast up Nested Loop

I am researching a topic on inflation and currently have the problem that part of my code takes over 20 hours to complete because I have a nested loop.
The code looks like this:
avg_timeseries <- function(start_date, end_date, k){
start_date = microdata_sce$survey_date[i]
end_date = microdata_sce$prev_survey[i]
mean(subset(topictimeseries, survey_date <= start_date & survey_date >= end_date)[[k]])
}
start.time <- Sys.time()
for (i in 1:nrow(microdata_sce)){
for(k in 2:ncol(topictimeseries)){
microdata_sce[[i, paste(k-1, 'topic', sep="_")]] <- avg_timeseries(start_date, end_date, k)
}
}
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
What do I want/what do I have:
I have a panel dataset (microdata_sce) with a total of 130,000 observations. Each userid has a start date and an end date, which are the survey date and the last survey date respectively. I want to take both dates and calculate the average between the two dates separately for each column in the timeseries dataframe (70 columns). Afterwards, the average between the two dates is to be attached to the microdata_sce data set as a column.
My idea was to loop through each row, take the survey data, create a subset from which the average is calculated. I loop through the k different time series.
Unfortunately, the code takes an eternity. Is there a way to speed this up using apply?
Thank you!
Kind regards

How to pull specific indices out of a character array in a loop?

I have an array that contains multiple dates in the format yyyymmdd, stored as a 50x1 double. I am trying to pull out the year,month, and day so I can use datenum to assign each date a serial number.
Indexing an individual date, converting the using str2num, then indexing and pulling the appropriate values works fine, but when I try to loop through the list of dates it doesn't work- only variations of the number 2 are returned.
dates = [20180910; 20180920; 20181012; 20181027; 20181103; 20181130; 20181225];
% version1
datesnums=num2str(dates); % dates is a list of dates stored as
integers
for i=1:length(datesnums)
pullyy=str2num(datesnums(1:4));
pullmm=str2num(datesnums(5:6));
pulldd=str2num(datesnums(7:8));
end
As well as
%version2
datesnums=num2str(dates,'%d')
for i = 1:length(datesnums)
dd=datenum(str2num(datesnums(i(1:4))),str2num(datesnums(i(5:6))),
str2num(datesnums(i(7:8))));
end
I'm trying to generate a new array that is just the serial numbers of the input dates. In the examples shown, I am only getting single integer values, which I know is because the loop is incorrect and I get errors that say "Index exceeds the number of array elements (1)." for version 1. When I've gotten it to successfully loop through everything, the outputs are just '2222','22,'22' for every single date which is incorrect. What am I doing wrong? Do I need to incorporate a cell array?
To get all the years, month, and days in a loop:
datesnums=num2str(dates);
for i=1:size(datesnums, 1)
pullyy(i) = str2num(datesnums(i,1:4));
pullmm(i) = str2num(datesnums(i,5:6));
pulldd(i) = str2num(datesnums(i,7:8));
end
Actually, you can do this without a loop:
pullyy = str2num(datesnums(:,1:4));
pullmm = str2num(datesnums(:,5:6));
pulldd = str2num(datesnums(:,7:8));
Explanation:
If for example the dates vector is a [6x1] array:
dates =[...
20190901
20170124
20191215
20130609
20141104
20190328];
Than datesnums=num2str(dates); creates a char matrix of size [6x8] where each row corresponds to one element in dates:
datesnums =
6×8 char array
'20190901'
'20170124'
'20191215'
'20160609'
'20191104'
'20190328'
So in the loop you need to refer to the row index for each date and and the column indices to extract the years, month, and days.
The easiest solution I can think of is:
SN = datenum(num2str(dates),'yyyymmdd')
You only have to specify the date format which is 'yyyymmdd'

How to change from annual year end to annual mid year in SAS

I currently work in SAS and utilise arrays in this way:
Data Test;
input Payment2018-Payment2021;
datalines;
10 10 10 10
20 20 20 20
30 30 30 30
;
run;
In my opinion this automatically assumes a limit, either the start of the year or the end of the year (Correct me if i'm wrong please)
So, if I wanted to say that this is June data and payments are set to increase every 9 months by 50% I'm looking for a way for my code to recognise that my years go from end of June to the next end of june
For example, if I wanted to say
Data Payment_Pct;
set test;
lastpayrise = "31Jul2018";
array payment:
array Pay_Inc(2018:2021) Pay_Inc: ;
Pay_Inc2018 = 0;
Pay_Inc2019 = 2; /*2 because there are two increments in 2019*/
Pay_Inc2020 = 1;
Pay_Inc2021 = 1;
do I = 2018 to 2021;
if i = year(pay_inc) then payrise(i) * 50% * Pay_Inc(i);
end;
run;
It's all well and good for me to manually do this for one entry but for my uni project, I'll need the algorithm to work these out for themselves and I am currently reading into intck but any help would be appreciated!
P.s. It would be great to have an algorithm that creates the following
Pay_Inc2019 Pay_Inc2020 Pay_Inc2021
1 2 1
OR, it would be great to know how the SAS works in setting the array for 2018:2021 , does it assume end of year or can you set it to mid year or?
Regarding input Payment2018-Payment2021; there is no automatic assumption of yearness or calendaring. The numbers 2018 and 2021 are the bounds for a numbered range list
In a numbered range list, you can begin with any number and end with any number as long as you do not violate the rules for user-supplied names and the numbers are consecutive.
The meaning of the numbers 2018 to 2021 is up to the programmer. You state the variables correspond to the June payment in the numbered year.
You would have to iterate a date using 9-month steps and increment a counter based on the year in which the date falls.
Sample code
Dynamically adapts to the variable names that are arrayed.
data _null_;
array payments payment2018-payment2021;
array Pay_Incs pay_inc2018-pay_inc2021; * must be same range numbers as payments;
* obtain variable names of first and last element in the payments array;
lower_varname = vname(payments(1));
upper_varname = vname(payments(dim(payments)));
* determine position of the range name numbers in those variable names;
lower_year_position = prxmatch('/\d+\s*$/', lower_varname);
upper_year_position = prxmatch('/\d+\s*$/', upper_varname);
* extract range name numbers from the variable names;
lower_year = input(substr(lower_varname,lower_year_position),12.);
upper_year = input(substr(upper_varname,upper_year_position),12.);
* prepare iteration of a date over the years that should be the name range numbers;
date = mdy(06,01,lower_year); * june 1 of year corresponding to first variable in array;
format date yymmdd10.;
do _n_ = 1 by 1; * repurpose _n_ for an infinite do loop with interior leave;
* increment by 9-months;
date = intnx('month', date, 9);
year = year(date);
if year > upper_year then leave;
* increment counter for year in which iterating date falls within;
Pay_Incs( year - lower_year + 1 ) + 1;
end;
put Pay_Incs(*)=;
run;
Increment counter notes
There is a lot to unpack in this statement
Pay_Incs( year - lower_year + 1 ) + 1;
+ 1 at the end of the statement increments the addressed array element by 1, and is the syntax for the SUM Statement
variable + expression The sum statement is equivalent to using the SUM function and the RETAIN statement, as shown here:
retain variable 0;
variable=sum(variable,expression);
year - lower_year + 1 computes the array base-1 index, 1..N, that addresses the corresponding variable in the named range list pay_inc<lower_year>-pay_inc<upper_year>
Pay_Incs( <computed index> ) selects the variable of the SUM statement
This is a wonderful use case of the intnx() function. intnx() will be your best friend when it comes to aligning dates.
In the traditional calendar, the year starts on 01JAN. In your calendar, the year starts in 01JUN. The difference between these two dates is exactly 6 months. We want to shift our date so that the year starts on 01JUN. This will allow you to take the year part of the date and determine what year you are on in the new calendar.
data want;
format current_cal_year
current_new_year year4.
;
current_cal_year = intnx('year', '01JUN2018'd, 0, 'B');
current_new_year = intnx('year.6', '01JUN2018'd, 1, 'B');
run;
Note that we shifted current_new_year by one year. To illustrate why, let's see what happens if we don't shift it by one year.
data want;
format current_cal_year
current_new_year year4.
;
current_cal_year = intnx('year', '01JUN2018'd, 0, 'B');
current_new_year = intnx('year.6', '01JUN2018'd, 0, 'B');
run;
current_new_year shows 2018, but we really are in 2019. For 5 months out of the year, this value will be correct. From June-December, the year value will be incorrect. By shifting it one year, we will always have the correct year associated with this date value. Look at it with different months of the year and you will see that the year part remains correct throughout time.
data want;
format cal_month date9.
cal_year
new_year year4.
;
do i = 0 to 24;
cal_month = intnx('month', '01JAN2016'd, i, 'B');
cal_year = intnx('year', cal_month, i, 'B');
new_year = intnx('year.6', cal_month, i+1, 'B');
year_not_same = (year(cal_year) NE year(new_year) );
output;
end;
drop i;
run;

sas loop over month from variable

I am tryinng to loop over a series of dates in order to create the dates inbetween. This is to be done in steps of month, always displaying the last day of the respective month. The start and end dates are given (first_date and last_date), while the last_date should always refer to the end of the previous month.
The original dataset looks like the following:
customer id first_date last_date
xy 135 01.01.2000 25.03.2005
xy 247 19.03.2003 25.03.2005
ab 387 01.06.2010 30.12.2012
ab 128 01.05.2010 28.02.2011
...
My goal is to have a dataset which looks like this:
customer id date
xy 135 31.01.2000
xy 135 28.02.2000
...
xy 135 28.02.2005
xy 247 31.03.2003
xy 247 30.04.2003
...
xy 247 28.02.2005
I found the solution to iterate over days quite straightforward (see below), but I am struggling to implement the monthly steps and the end of month dates.
data want;
set have;
by customer id;
do day = first_date to last_date;
output;
end;
format day date9.;
run;
Thanks for your help!!
First, lets get some data:
data have;
attrib customer length=$10 informat=$10.
id informat=best.
first_date informat=ddmmyy10. format=ddmmyy10.
last_date informat=ddmmyy10. format=ddmmyy10.
;
input customer $
id
first_date
last_date
;
datalines;
xy 135 01.01.2000 25.03.2005
xy 247 19.03.2003 25.03.2005
ab 387 01.06.2010 30.12.2012
ab 128 01.05.2010 28.02.2011
;
run;
The intnx() function will come to the rescue here. We are going to create a new variable called date, and then use the intnx function to return the end of the month for that date. As long as that date is less than the end date, we will continue to output it to a dataset and then increment to the end of the following month.
data want;
format date ddmmyy10.;
set have;
date = intnx('month',first_date,0,'end');
do while (date le last_date);
output;
date = intnx('month',date,1,'end');
end;
drop first_date last_date;
run;
While I think Rob's answer is the right way to do this, it's probably helpful to see how to do it the way you were trying to.
Starting with this:
data want;
set have;
by customer id;
do day = first_date to last_date;
output;
end;
format day date9.;
run;
This gives you too many rows, right? So what you need to do is identify where in the month you are. There are a bunch of ways to do this. Several date functions (like INTNX and INTCK) could be used to tell you where you are; but the easiest is just to compare month(date) with month(date+1). When they're different, you're on the last day of a month!
data want;
set have;
by customer id notsorted;
do day = first_date to last_date;
if month(day) ne month(day+1) then output;
end;
format day date9.;
run;
(I added notsorted since Rob's example data was not sorted, and I'm lazy. Probably not needed in your real case.)
I would note that this probably isn't your ideal solution - Rob's is probably that, in terms of data steps - in terms of speed. This of course will iterate through every day rather than just once per month.
Another option if you have the dataset you created above - with one row per day - is to use PROC EXPAND, if you have the ETS module. It's very handy for things like this.
data intermediate;
set have;
by customer id notsorted;
do day = first_date to last_date;
output;
end;
format day date9.;
run;;;
Here's your day-level data. Then below is the PROC EXPAND statement, asking for monthly data, aligned at the end. id day; identifies the time series variable, and by customer id notsorted; is the normal by statement (what variables identify the observations), with notsorted so they don't have to be in order relative to each other.
proc expand data=intermediate out=want from=day to=month align=end;
id day;
by customer id notsorted;
run;
This gives a slightly different solution than Rob's and my other solution, because it does give you the final row for each if it's not at the end of a month (and does set that final row to the end of the month). If that's desired, great, and our solutions can easily be adapted to give that; if it's not desired, you'll have to remove it afterwards.
You can do this with a simple iterative DO loop by using the date interval functions. Subtract one from the number of intervals to make it end at the last day of the previous month.
data want ;
set have ;
do offset=0 to intck('month',first_date,last_date)-1;
date=intnx('month',first_date,offset,'e');
output;
end;
format date yymmdd10.;
run;

Report Builder 3.0 - grouping rows by time of day

I am trying to create a table within a report that appears as follows:
The data set is based on this query:
SELECT
DATENAME(dw, CurrentReadTime) AS 'DAY',
DATEPART(dw, CurrentReadTime) AS 'DOW',
CAST(datename(HH, CurrentReadTime) as int) AS 'HOD',
AVG([Difference]) AS 'AVG'
FROM
Consumption
INNER JOIN Readings ON Readings.[RadioID-Hex] = Consumption.[RadioID-Hex]
WHERE
CONCAT([Building], ' ', [Apt]) = #ServiceLocation
GROUP BY
CurrentReadTime
ORDER BY
DATEPART(DW, CurrentReadTime),
CAST(DATENAME(HH, CurrentReadTime) AS INT)
The data from this table returns as follows:
In report builder, I have added this code to the report properties:
Function GetRangeValueByHour(ByVal Hour As Integer) As String
Select Case Hour
Case 6 To 12
GetRangeValueByHour = "Morning"
Case 12 to 17
GetRangeValueByHour = "Afternoon"
Case 17 to 22
GetRangeValueByHour = "Evening"
Case Else
GetRangeValueByHour = "Overnight"
End Select
Return GetRangeValueByHour
End Function
And this code to the "row group":
=Code.GetRangeValueByHour(Fields!HOD.Value)
When I execute the report, selecting the parameter for the target service location, I get this result:
As you will notice, the "Time of Day" is displaying the first result that meets the CASE expression in the Report Properties code; however, I confirmed that ALL "HOD" (stored as an integer) are being grouped together by doing a SUM on this result.
Furthermore, the actual table values (.05, .08, etc) are only returning the results for the HOD that first meets the requirements of the CASE statement in the VB code.
These are the things I need resolved, but can't figure out:
Why isn't the Report Properties VB code displaying "Morning", "Afternoon", "Evening", and "Overnight" in the Time of Day column?
How do I group together the values in the table? So that the AVG would actually be the sum of each AVG for all hours within the designated range and day of week (6-12, 12-18, etc on Monday, Tuesday etc).
To those still reading, thanks for your assistance! Please let me know if you need additional information.
I'm still not sure if I have a clear picture of your table design, but I'm imagining this as a single row group that's grouped on this expression: =Code.GetRangeValueByHour(Fields!HOD.Value). Based on this design and the dataset above, here's how I would solve your two questions:
Use the grouping expression for the value of the Time of Day cell, like:
Add a SUM with a conditional for the values on each day of the week. Example: the expression for Sunday would be =SUM(IIF(Fields!DOW.Value = 1, Fields!AVG.Value, CDec(0))). This uses CDec(0)instead of 0 because the AVG values are decimals and SSRS will otherwise throw an aggregate of mixed data types error by interpreting 0 as an int.

Resources