sas loop over month from variable - loops

I am tryinng to loop over a series of dates in order to create the dates inbetween. This is to be done in steps of month, always displaying the last day of the respective month. The start and end dates are given (first_date and last_date), while the last_date should always refer to the end of the previous month.
The original dataset looks like the following:
customer id first_date last_date
xy 135 01.01.2000 25.03.2005
xy 247 19.03.2003 25.03.2005
ab 387 01.06.2010 30.12.2012
ab 128 01.05.2010 28.02.2011
...
My goal is to have a dataset which looks like this:
customer id date
xy 135 31.01.2000
xy 135 28.02.2000
...
xy 135 28.02.2005
xy 247 31.03.2003
xy 247 30.04.2003
...
xy 247 28.02.2005
I found the solution to iterate over days quite straightforward (see below), but I am struggling to implement the monthly steps and the end of month dates.
data want;
set have;
by customer id;
do day = first_date to last_date;
output;
end;
format day date9.;
run;
Thanks for your help!!

First, lets get some data:
data have;
attrib customer length=$10 informat=$10.
id informat=best.
first_date informat=ddmmyy10. format=ddmmyy10.
last_date informat=ddmmyy10. format=ddmmyy10.
;
input customer $
id
first_date
last_date
;
datalines;
xy 135 01.01.2000 25.03.2005
xy 247 19.03.2003 25.03.2005
ab 387 01.06.2010 30.12.2012
ab 128 01.05.2010 28.02.2011
;
run;
The intnx() function will come to the rescue here. We are going to create a new variable called date, and then use the intnx function to return the end of the month for that date. As long as that date is less than the end date, we will continue to output it to a dataset and then increment to the end of the following month.
data want;
format date ddmmyy10.;
set have;
date = intnx('month',first_date,0,'end');
do while (date le last_date);
output;
date = intnx('month',date,1,'end');
end;
drop first_date last_date;
run;

While I think Rob's answer is the right way to do this, it's probably helpful to see how to do it the way you were trying to.
Starting with this:
data want;
set have;
by customer id;
do day = first_date to last_date;
output;
end;
format day date9.;
run;
This gives you too many rows, right? So what you need to do is identify where in the month you are. There are a bunch of ways to do this. Several date functions (like INTNX and INTCK) could be used to tell you where you are; but the easiest is just to compare month(date) with month(date+1). When they're different, you're on the last day of a month!
data want;
set have;
by customer id notsorted;
do day = first_date to last_date;
if month(day) ne month(day+1) then output;
end;
format day date9.;
run;
(I added notsorted since Rob's example data was not sorted, and I'm lazy. Probably not needed in your real case.)
I would note that this probably isn't your ideal solution - Rob's is probably that, in terms of data steps - in terms of speed. This of course will iterate through every day rather than just once per month.

Another option if you have the dataset you created above - with one row per day - is to use PROC EXPAND, if you have the ETS module. It's very handy for things like this.
data intermediate;
set have;
by customer id notsorted;
do day = first_date to last_date;
output;
end;
format day date9.;
run;;;
Here's your day-level data. Then below is the PROC EXPAND statement, asking for monthly data, aligned at the end. id day; identifies the time series variable, and by customer id notsorted; is the normal by statement (what variables identify the observations), with notsorted so they don't have to be in order relative to each other.
proc expand data=intermediate out=want from=day to=month align=end;
id day;
by customer id notsorted;
run;
This gives a slightly different solution than Rob's and my other solution, because it does give you the final row for each if it's not at the end of a month (and does set that final row to the end of the month). If that's desired, great, and our solutions can easily be adapted to give that; if it's not desired, you'll have to remove it afterwards.

You can do this with a simple iterative DO loop by using the date interval functions. Subtract one from the number of intervals to make it end at the last day of the previous month.
data want ;
set have ;
do offset=0 to intck('month',first_date,last_date)-1;
date=intnx('month',first_date,offset,'e');
output;
end;
format date yymmdd10.;
run;

Related

Need to persist values from one record to the subsequent record - lag and retain both work but not giving desired results

I am editing my original question to simplify the problem statement:
I need to create a dataset that contains the principal paydown schedule of a security, which is split into 3 tranches. For each period for the security, I need to calculate the ending balances of principal owed for each tranche. For period 0 (i.e. starting period), I already have the balances owed. For subsequent periods, I need to take the balances from the previous periods and subtract the principal paid down in the current period. The same logic should continue through the last period.
In my SAS code, I am able to get period 1 to do the calculations correctly, but the balances from period 1 don't correctly make it into period 2, causing the calculation to break from that point onwards. I know lag or its placement is what is not working correctly. I am not able to figure out where to place it, or how to use retain (if not lag), such that my balances go from one row to the next.
%let n_t=3;
data xyz;
INFILE DATALINES DLM='#';
input ID $6. period PrincipalPaid best12.2;
datalines;
ABC123#00#0.0
ABC123#01#4.0
ABC123#02#3.92
ABC123#03#3.84
ABC123#04#3.76
ABC123#05#3.69
ABC123#06#3.62
ABC123#07#3.54
;run;
data xyz2;
set xyz;
by id;
if period=0 then do;
Bal1= 120;
Bal2= 8;
Bal3= 2;
end;
/*Code to push all starting balances from period 0 to 1*/
array prev_bal{&N_t.} prev_bal1-prev_bal&n_t.;
array bal{&N_t.} bal1-bal&n_t.;
do i=1 to &N_t.;
prev_bal{i}=lag(bal{i});
end;
/*code to calculate balances for periods >=1*/
if period>=1 then do;
array PrincipalPayDown{&N_t.} PrincipalPayDown1-PrincipalPayDown&N_t.;
do i = 1 to &N_t. ;
PrincipalPayDown{i}=round(PrincipalPaid*prev_bal{i}/sum(of prev_bal:),0.01);
bal{i}=max(prev_bal{i}-PrincipalPayDown{i},0);
end;
end;
drop i ;
run;
proc sql;
create table final as
select
id,period,PrincipalPaid,prev_bal1,prev_bal2,prev_bal3,
PrincipalPayDown1,PrincipalPayDown2,PrincipalPayDown3,Bal1,Bal2,Bal3
from xyz2;
quit;
I am also adding a picture of the final dataset with the correct output calculated in Excel. I want SAS to give me the same output for periods >=2.
Screenshot showing correct output in Excel

Stata: Number of overlapping days within multiple date ranges

I want to calculate the number of overlapping days within multiple date ranges. For example, in the sample data below, there are 167 overlapping days: first from 07jan to 04apr and second from 30may to 15aug.
start end
01jan2000 04apr2000
30may2000 15aug2000
07jan2000 31dec2000
This is fairly crude but gets the job done. Essentially, you
Reshape the data to be in long format, which is usually a good idea when working with panel data in Stata
Fill in gaps between the start and end of each spell
Keep dates that occur more than once
Count the distinct values of dates
clear
/* Fake Data */
input str9(start end)
"01jan2000" "04apr2000"
"30may2000" "15aug2000"
"07jan2000" "31dec2000"
end
foreach var of varlist start end {
gen d = date(`var', "DMY")
drop `var'
gen `var' = d
format %td `var'
drop d
}
/* Count Overlapping Days */
rename (start end) date=
gen spell = _n
reshape long date, i(spell) j(range) string
drop range
xtset spell date, delta(1 day)
tsfill
bys date: keep if _N>1
distinct date

SAS Array Calculations Row Operations

I have a dataset that has a list of contributions of members of a sales organization by day. What I want to ultimately end up with is the following information:
For each day:
How much the entire team sold. ($200 for day one, $350 for day two..)
How much a designated subset ("Joe"...for example) of that team sold (Joe sold $100 day one, $200 day two...)
the difference in the above two calculations ($200-$100 for day one, $350-$200 for day two....)
how many total people contributed that day (2 in day 1, 3 in day two, 5 in day 3)
how many of my designated subset contributed that day (1 every day in this case, since Joe was there every day)
In the example below, Joe is my designated subset. The problem I am having is directing SAS to only sum up Joe's contributions. The method I have below works, but only if Joe is the only contributor AND if he contributes every day. I basically force him to be the first entry, then point to him. This fails if he is not there one day, or if my subset has multiple people.
Below is my attempt I've been working on, but I think I'm going down the wrong path, since this will not be dynamic enough when I add more people. For example, if the subset now becomes Joe and Sue....the calculation will still just point to Joe. If I point it two first two obs, it may select hal accidentally from day one. Is there a way to specify by rom "Only add the Amount column if the name next to it is either Joe or Sue? Help!
*declare team;
/*%let team=('joe','sue');*/
%let team=('joe');
*input data;
data have;
input day name $ amount;
cards;
1 hal 100
1 joe 100
2 joe 80
2 sue 70
2 jim 200
3 joe 50
3 sue 100
3 ted 200
3 tim 100
3 wen 5000
;
run;
*getting my team to float to top of order list;
data have;
set have;
if name in &team. then order=1;
else order=2;
run;
*order;
proc sort data=have;
by day order name;
run;
*add running count by day;
data have;
set have;
by day;
x+1;
if first.day then x=1;
run;
*get number of people on team;
proc sql noprint;
select count(distinct name) into :count
from have
where name in &team.;
quit;
*get max of people per day;
proc sql noprint;
select max(x) into :max_freq from have;
quit;
*pre transpose...set labels;
data have;
set have;
varname=cats('Name_',x);
value=name;
output;
varname=cats('Amount_',x);
value=amount;
output;
keep day value varname;
run;
*transpose;
proc transpose data=have out=have_transp(drop=_NAME_);
by day;
id varname;
var value;
run;
data want;
set have_transp;
array Amount {*} Amount:;
TOT_Amount=0;
NUM_TOTAL_PEOPLE=0;
do i=1 to dim(Amount);
if Amount[i]>0
then
do;
TOT_Amount+Amount[i];
NUM_TOTAL_PEOPLE+1;
end;
end;
TEAM_CONTRIB=Amount_1;
NON_TEAM_CONTRIB=TOT_Amount-TEAM_CONTRIB;
run;
A few other things:
Every member of the team will not always be present every day
There are very many possibilities for how many people might be on the total team and/or subset
Here's a way using proc means that doesn't use arrays. Proc means will calculate data at different levels by default when using the CLASS and TYPES statements. The data can then be merged into the appropriate level. In this solution it doesn't matter how many people are in the group/subset or that everyone is present for every day.
/*Subset group*/
data subteam;
input name $;
cards;
joe
sue
;
run;
/*Sample data*/
data have;
input day name $ amount;
cards;
1 hal 100
1 joe 100
2 joe 80
2 sue 70
2 jim 200
3 joe 50
3 sue 100
3 ted 200
3 tim 100
3 wen 5000
;
run;
*Set group variable for subset team;
data have;
set have;
group=0;
run;
*Set group variable=1 to subset;
proc sql;
update have
set group=1
where name in (select name from subteam);
quit;
*Calculate sums;
proc means data=have;
class day group;
types day day*group;
var amount;
output out=want1 sum=total n=count;
run;
*Reformat into desired format;
data want2;
merge want1 (where=(group=.) rename=(total=total_overall count=count_overall))
want1 (where=(group=1) rename=(total=total_group count=count_group));
by day;
run;

SAS: How can I filter for (multiple) entries which are closest to the last day of month (for each month)

I have a large Dataset and want to filter it for all rows with date entry closest to the last day of the month, for each month. So there could be multiple entries for the day closest to the last day of month.
So for instance:
original Dataset
date price name
05-01-1995 1,2 abc
06-01-1995 1,5 def
07-01-1995 1,8 ghi
07-01-1995 1,7 mmm
04-02-1995 1,9 jkl
27-02-1995 2,1 mno
goal:
date price name
07-01-1995 1,8 ghi
07-01-1995 1,7 mmm
27-02-1995 2,1 mno
I had 2 ideas, but I am failing with implementing it within a loop (for traversing the months) in SAS.
1.idea: create new column wich indicates last day of the current month (intnx() function); then filter for all entries that are closest to the last day of its month:
date price name last_day_of_month
05-01-1995 1,2 abc 31-01-1995
06-01-1995 1,5 def 31-01-1995
07-01-1995 1,8 ghi 31-01-1995
04-02-1995 1,9 jkl 28-02-1995
27-02-1995 2,1 mno 28-02-1995
2.idea: simply filter for each month the entries with highest date (using maybe max function?!)
I would be very glad if you were able to help me, as I am used to ordinary programming languages and just started with SAS for research purposes.
proc sql is one way to solve this kind of situation. I'll break down your original requirements with explanations in how to interpret them in sql.
Since you want to group your observations on date, you can use the having clause to filter on the max date per month.
data work.have;
input date DDMMYY10. price name $;
format date date9.;
datalines;
05-01-1995 1.2 abc
07-01-1995 1.8 ghi
06-01-1995 1.5 def
07-01-1995 1.7 mmm
04-02-1995 1.9 jkl
27-02-1995 2.1 mno
;
data work.want;
input date DDMMYY10. price name $;
format date date9.;
datalines;
07-01-1995 1.8 ghi
07-01-1995 1.7 mmm
27-02-1995 2.1 mno
;
proc sql ;
create table work.want as
select *
/*, max(date) as max_date format=date9.*/
/*, intnx('month',date,0,'end') as monthend format=date9.*/
from work.have
group by intnx('month',date,0,'end')
having max(date) = date
order by date, name
;
If you uncomment the comments, the actual filters used are shown in the output table.
Comparing the the requirements against the solution:
proc compare base=work.want compare=work.solution;
results in
NOTE: No unequal values were found. All values compared are exactly equal.
1) create a new variable periode = put(date,yymmn6.) /* gives you yyyymm*/
2) sort the table on periode and date
3) now a periode.last logic will select the record you need per periode.
Something like...
data tab2;
set your_table;
periode = put(date,yymmn6.);
run;
proc sort data= tab2;
by periode date;
run;
data tab3;
set tab2;
by periode;
if last.periode then output;
run;
You can use two SAS functions called intnx and intck to do this with proc sql:
proc sql ;
create table want as
select *, put(date,yymmn6.) as month, intck('days',date,intnx('month',date,0,'end')) as DaysToEnd
from have
group by month
having (DaysToEnd=min(DaysToEnd))
;quit ;
Intnx() adjusts dates by intervals. In the above case, the four parameters used are:
What size 'step' you want to add/subrate the intervals in.
The date that is being referenced
How many interval steps to make
How to 'round' the step (eg round it to the start/end/middle of the resultant day/week/year)
Intck() simply counts interval steps between two dates
This will give you all records which fall on the day closest to the end of the month
Another approach is by using proc rank;
data mid;
retain yrmth date;
set have;
format date yymmddn8.;
yrmth = put(date,yymmn6.);
run;
proc sort data = mid;
by yrmth descending date;
run;
proc rank data = mid out = want descending ties=low;
by yrmth;
var date;
ranks rankdt;
run;
data want1;
set want;
where rankdt = 1;
run;
HTH

SAS: sum all values except one

I'm working in SAS and I'm trying to sum all observations, leaving out one each time.
For example, if I have:
Count Name Grade
1 Sam 90
2 Adam 100
3 John 80
4 Max 60
5 Andrea 70
I want to output a value for Sam that is the sum of all grades but his own, and a value for Adam that is a sum of all grades but his own - etc.
Any ideas? Thanks!
You can do it in a single proc sql instead, using key word calculated:
data have;
input Count Name $ Grade;
datalines;
1 Sam 90
2 Adam 100
3 John 80
4 Max 60
5 Andrea 70
;;;;
run;
proc sql;
create table want as
select *, sum(grade) as all_grades, calculated all_grades-grade as minus_grade
from have;
quit;
Here's a nearly one pass solution (it will be about the same speed as a one pass solution if the dataset fits in the read buffer). I actually calculate the mean here instead of just the sum, as I feel that's a more interesting result (and the sum is of course the mean without the division).
data have;
input Count Name $ Grade;
datalines;
1 Sam 90
2 Adam 100
3 John 80
4 Max 60
5 Andrea 70
;;;;
run;
data want;
retain grademean;
if _n_=1 then do;
do _n_ = 1 to nobs_have;
set have(keep=grade) point=_n_ nobs=nobs_have;
gradesum+grade;
end;
grademean=gradesum/nobs_have;
end;
set have;
grade_noti = ((grademean*nobs_have)-grade)/(nobs_have-1);
run;
Calculate the mean, then for each record subtract the portion that record contributed to the mean. This is a super useful technique for stat testing when you want to compare a record to the rest of the population, and you have a complicated class combination where you'd rather do the mean first. In those cases you use PROC MEANS first and then merge it on, then do this subtraction.
proc sql;
create table temp as select
sum(grade) as all_grades
from orig_data;
quit;
proc sql;
create table temp2 as select
a.count,
a.name,
a.grade,
(b.all_grades-a.grade) as sum_other_grades
from orig_data a
left join temp b;
quit;
Haven't tested it but the above should work. It creates a new dataset temp that has the sum of all grades and merges that back to create a new table with the sum of all grades less the current students grade as sum_other_grades.
This solution performs takes each observation of your starting dataset, and then loops through the same dataset summing up grade values for any records with different names, so beginning with 'Sam', we only add the oth_g variable when we find names that are NOT 'Sam':
data want;
set have;
oth_g=0;
do i=1 to n;
set have
(keep=name grade rename=(name=name_loop grade=grade_loop))
nobs=n point=i;
if name^=name_loop then oth_g+grade_loop;
end;
drop grade_loop name_loop i n;
run;
This is a slight modification to the answer #Reese provided above.
proc sql;
create table want as
select *,
(select sum(grade) from have) as all_grades,
calculated all_grades - grade as minus_grade
from have;
quit;
I've rearranged it this way to avoid the below message being printed to the log:
NOTE: The query requires remerging summary statistics back with the original data.
If you see the above message, it almost always means that you have made a mistake. If you actually did mean to remerge summary stats back with the original data, you should do so explicitly (like I have done above by refactoring #reese 's query.
Personally I think the refactored version is also easier to understand.

Resources