SAS: How can I filter for (multiple) entries which are closest to the last day of month (for each month) - loops

I have a large Dataset and want to filter it for all rows with date entry closest to the last day of the month, for each month. So there could be multiple entries for the day closest to the last day of month.
So for instance:
original Dataset
date price name
05-01-1995 1,2 abc
06-01-1995 1,5 def
07-01-1995 1,8 ghi
07-01-1995 1,7 mmm
04-02-1995 1,9 jkl
27-02-1995 2,1 mno
goal:
date price name
07-01-1995 1,8 ghi
07-01-1995 1,7 mmm
27-02-1995 2,1 mno
I had 2 ideas, but I am failing with implementing it within a loop (for traversing the months) in SAS.
1.idea: create new column wich indicates last day of the current month (intnx() function); then filter for all entries that are closest to the last day of its month:
date price name last_day_of_month
05-01-1995 1,2 abc 31-01-1995
06-01-1995 1,5 def 31-01-1995
07-01-1995 1,8 ghi 31-01-1995
04-02-1995 1,9 jkl 28-02-1995
27-02-1995 2,1 mno 28-02-1995
2.idea: simply filter for each month the entries with highest date (using maybe max function?!)
I would be very glad if you were able to help me, as I am used to ordinary programming languages and just started with SAS for research purposes.

proc sql is one way to solve this kind of situation. I'll break down your original requirements with explanations in how to interpret them in sql.
Since you want to group your observations on date, you can use the having clause to filter on the max date per month.
data work.have;
input date DDMMYY10. price name $;
format date date9.;
datalines;
05-01-1995 1.2 abc
07-01-1995 1.8 ghi
06-01-1995 1.5 def
07-01-1995 1.7 mmm
04-02-1995 1.9 jkl
27-02-1995 2.1 mno
;
data work.want;
input date DDMMYY10. price name $;
format date date9.;
datalines;
07-01-1995 1.8 ghi
07-01-1995 1.7 mmm
27-02-1995 2.1 mno
;
proc sql ;
create table work.want as
select *
/*, max(date) as max_date format=date9.*/
/*, intnx('month',date,0,'end') as monthend format=date9.*/
from work.have
group by intnx('month',date,0,'end')
having max(date) = date
order by date, name
;
If you uncomment the comments, the actual filters used are shown in the output table.
Comparing the the requirements against the solution:
proc compare base=work.want compare=work.solution;
results in
NOTE: No unequal values were found. All values compared are exactly equal.

1) create a new variable periode = put(date,yymmn6.) /* gives you yyyymm*/
2) sort the table on periode and date
3) now a periode.last logic will select the record you need per periode.
Something like...
data tab2;
set your_table;
periode = put(date,yymmn6.);
run;
proc sort data= tab2;
by periode date;
run;
data tab3;
set tab2;
by periode;
if last.periode then output;
run;

You can use two SAS functions called intnx and intck to do this with proc sql:
proc sql ;
create table want as
select *, put(date,yymmn6.) as month, intck('days',date,intnx('month',date,0,'end')) as DaysToEnd
from have
group by month
having (DaysToEnd=min(DaysToEnd))
;quit ;
Intnx() adjusts dates by intervals. In the above case, the four parameters used are:
What size 'step' you want to add/subrate the intervals in.
The date that is being referenced
How many interval steps to make
How to 'round' the step (eg round it to the start/end/middle of the resultant day/week/year)
Intck() simply counts interval steps between two dates
This will give you all records which fall on the day closest to the end of the month

Another approach is by using proc rank;
data mid;
retain yrmth date;
set have;
format date yymmddn8.;
yrmth = put(date,yymmn6.);
run;
proc sort data = mid;
by yrmth descending date;
run;
proc rank data = mid out = want descending ties=low;
by yrmth;
var date;
ranks rankdt;
run;
data want1;
set want;
where rankdt = 1;
run;
HTH

Related

Data Studio | How to write case for >= or <= to filter data for Date dimension

Requirement: My data table has 2 fields i.e. Name and Date of joining (DOJ). I want to count users who joined on or before 30-Jan-21.
Solution tried: I created a calculated field using CASE i.e.
CASE WHEN CAST(FORMAT_DATETIME("%Y%m%d", DOJ) AS NUMBER ) <= 20210130 THEN DOJ END.
Issue: After creating the field, I aggregated it by count and used the field in metric but its not giving count of users who joined on or before 30-Jan-21.
Data Table preview
Name
DOJ
John Smith
04/01/2021
Dexter Morgan
13/01/2021
Debra Morgan
18/01/2021
Kyle Butler
21/01/2021
Rita Benett
25/01/2021
Angel Batista
31/01/2021
Maria LaGuerta
01/02/2021
Vince Masuka
17/02/2021
Joey Quinn
26/03/2021
Arthur Mitchell
05/04/2021
Thomas Matthews
25/05/2021
Solution
I created a field to convert Date into Number and created another calculated field where I used both the conditions i.e. >= and <= and got the desired result.
Formula for converting Date to text:
CAST(FORMAT_DATETIME("%Y%m%d", DOJ) AS NUMBER)
Formula for counting DOJ's in Month of Jan'21 i.e. on and before 30-Jan-21
CASE WHEN DOJ <= 20210130 AND DOJ >=20210101 then DOJ END

Merging time series with different number of observations where variables have the same name (SAS)

I have a bunch of time series data (sas-files) which I like to merge / combine up to a larger table (I am fairly new to SAS).
Filenames:
cq_ts_SYMBOL, where SYMBOL is the respective symbol for each file
with the following structure:
cq_ts_AAA.sas7bdat: file1
SYMBOL DATE TIME BID ASK MID
AAA 20100101 9:30:00 10.375 10.4 .
AAA 20100101 9:31:00 10.38 10.4 .
.
.
AAA 20150101 15:59:00 15 15.1 .
cq_ts_BBB.sas7bdat: file2
SYMBOL DATE TIME BID ASK MID
BBB 20120101 9:30:00 12.375 12.4 .
BBB 20120102 9:31:00 12.38 12.4 .
.
.
BBB 20170101 15:59:00 20 20.1 .
Key characteristics:
- They have the same variable name
- They have different number of observations
- They are all saved in the same folder
So what I want to do is:
- Create 3 tables: BID-table, ASK-table, Mid-table with the following structure, ie for bid-table, cq_ts_bid.sas7bdat:
DATE TIME AAA BBB ...
20100101 9:30:00 10.375 .
20100102 9:31:00 10.38 .
.
.
20120101 9:30:00 9.375 12.375
20120102 9:31:00 9.38 12.38
.
.
20150101 15:59:00 15 17
.
.
20170101 15:59:00 . 20
It is not all to difficult to do it for 2 stock time series, however, I was wondering whether there is the possibility to do the following:
From data set cq_ts_AAA take DATE TIME BID and rename BID to AAA (either from the values in symbol? does this make sense? or get the name from the filename).
Do the same for cq_ts_BBB.
In fact, loop through the folder to get the number of files and filenames (this part I got more or less, see below).
Merge cq_ts_BBB and cq_ts_BBB having DATE TIME AAA (former bid price of AAA) BBB (former bid price of BBB), for all the files in the folder.
Do this for BID, then for ASK and finally MID (actually I couldn't get the midpoint variable from bid and ask (i.e. mid= (bid + ask) / 2;) just gives me the "." in the previous data steps when creating the files).
I think a macro to first get each single file then rename (when should this step take place?) it and merge them together - like a double loop.
Here the renaming and merging part:
data ALDW_short (rename=(iprice = ALDW));
set output.cq_ts_aldw
retain date time ALJ;
run;
data ALJ_short (rename= (iprice = ALJ));
set output.cq_ts_alj;
retain date time datetime ALJ;
run;
data ALDW_ALJ_merged (keep= date itime ALDW ALJ);
merge ALDW_short ALJ_short;
by datetime;
run;
This is the part to loop through the folder and get a list of names:
proc contents data = output._all_ out = outputcont(keep = memname) noprint;
run;
proc sort data = outputcont nodupkey;
by memname;
run;
data _null_;
set outputcont end = last;
by memname;
i+1;
call symputx('name'||trim(left(put(i,8.))),memname);
if last then call symputx('count',i);
run;
Would it make sense to extract the symbol (and how? they have different length) from the filename or just to take them from the variable SYMBOL (and how can I get the one value to rename my column?)?
Somehow I have difficulty changing the order of columns, ie. I tried with retain and format.
Looks like you could do this easily with PROC TRANSPOSE. Combine your datasets into a single dataset.
data all ;
set set output.cq_ts_: ;
by date time;
run;
Then use PROC TRANSPOSE for each of your source variables/target tables.
proc transpose data=all out=bid ;
by date time ;
id symbol;
var bid;
run;
Given your example data a formula for MID of
mid = (bid + ask)/2 ;
Should work. Most likely if you got all missing values you probable put the assignment statement before the SET or INPUT statement. In other words you were trying to calculate using values that had not been read in yet.

sas loop over month from variable

I am tryinng to loop over a series of dates in order to create the dates inbetween. This is to be done in steps of month, always displaying the last day of the respective month. The start and end dates are given (first_date and last_date), while the last_date should always refer to the end of the previous month.
The original dataset looks like the following:
customer id first_date last_date
xy 135 01.01.2000 25.03.2005
xy 247 19.03.2003 25.03.2005
ab 387 01.06.2010 30.12.2012
ab 128 01.05.2010 28.02.2011
...
My goal is to have a dataset which looks like this:
customer id date
xy 135 31.01.2000
xy 135 28.02.2000
...
xy 135 28.02.2005
xy 247 31.03.2003
xy 247 30.04.2003
...
xy 247 28.02.2005
I found the solution to iterate over days quite straightforward (see below), but I am struggling to implement the monthly steps and the end of month dates.
data want;
set have;
by customer id;
do day = first_date to last_date;
output;
end;
format day date9.;
run;
Thanks for your help!!
First, lets get some data:
data have;
attrib customer length=$10 informat=$10.
id informat=best.
first_date informat=ddmmyy10. format=ddmmyy10.
last_date informat=ddmmyy10. format=ddmmyy10.
;
input customer $
id
first_date
last_date
;
datalines;
xy 135 01.01.2000 25.03.2005
xy 247 19.03.2003 25.03.2005
ab 387 01.06.2010 30.12.2012
ab 128 01.05.2010 28.02.2011
;
run;
The intnx() function will come to the rescue here. We are going to create a new variable called date, and then use the intnx function to return the end of the month for that date. As long as that date is less than the end date, we will continue to output it to a dataset and then increment to the end of the following month.
data want;
format date ddmmyy10.;
set have;
date = intnx('month',first_date,0,'end');
do while (date le last_date);
output;
date = intnx('month',date,1,'end');
end;
drop first_date last_date;
run;
While I think Rob's answer is the right way to do this, it's probably helpful to see how to do it the way you were trying to.
Starting with this:
data want;
set have;
by customer id;
do day = first_date to last_date;
output;
end;
format day date9.;
run;
This gives you too many rows, right? So what you need to do is identify where in the month you are. There are a bunch of ways to do this. Several date functions (like INTNX and INTCK) could be used to tell you where you are; but the easiest is just to compare month(date) with month(date+1). When they're different, you're on the last day of a month!
data want;
set have;
by customer id notsorted;
do day = first_date to last_date;
if month(day) ne month(day+1) then output;
end;
format day date9.;
run;
(I added notsorted since Rob's example data was not sorted, and I'm lazy. Probably not needed in your real case.)
I would note that this probably isn't your ideal solution - Rob's is probably that, in terms of data steps - in terms of speed. This of course will iterate through every day rather than just once per month.
Another option if you have the dataset you created above - with one row per day - is to use PROC EXPAND, if you have the ETS module. It's very handy for things like this.
data intermediate;
set have;
by customer id notsorted;
do day = first_date to last_date;
output;
end;
format day date9.;
run;;;
Here's your day-level data. Then below is the PROC EXPAND statement, asking for monthly data, aligned at the end. id day; identifies the time series variable, and by customer id notsorted; is the normal by statement (what variables identify the observations), with notsorted so they don't have to be in order relative to each other.
proc expand data=intermediate out=want from=day to=month align=end;
id day;
by customer id notsorted;
run;
This gives a slightly different solution than Rob's and my other solution, because it does give you the final row for each if it's not at the end of a month (and does set that final row to the end of the month). If that's desired, great, and our solutions can easily be adapted to give that; if it's not desired, you'll have to remove it afterwards.
You can do this with a simple iterative DO loop by using the date interval functions. Subtract one from the number of intervals to make it end at the last day of the previous month.
data want ;
set have ;
do offset=0 to intck('month',first_date,last_date)-1;
date=intnx('month',first_date,offset,'e');
output;
end;
format date yymmdd10.;
run;

SAS Array Calculations Row Operations

I have a dataset that has a list of contributions of members of a sales organization by day. What I want to ultimately end up with is the following information:
For each day:
How much the entire team sold. ($200 for day one, $350 for day two..)
How much a designated subset ("Joe"...for example) of that team sold (Joe sold $100 day one, $200 day two...)
the difference in the above two calculations ($200-$100 for day one, $350-$200 for day two....)
how many total people contributed that day (2 in day 1, 3 in day two, 5 in day 3)
how many of my designated subset contributed that day (1 every day in this case, since Joe was there every day)
In the example below, Joe is my designated subset. The problem I am having is directing SAS to only sum up Joe's contributions. The method I have below works, but only if Joe is the only contributor AND if he contributes every day. I basically force him to be the first entry, then point to him. This fails if he is not there one day, or if my subset has multiple people.
Below is my attempt I've been working on, but I think I'm going down the wrong path, since this will not be dynamic enough when I add more people. For example, if the subset now becomes Joe and Sue....the calculation will still just point to Joe. If I point it two first two obs, it may select hal accidentally from day one. Is there a way to specify by rom "Only add the Amount column if the name next to it is either Joe or Sue? Help!
*declare team;
/*%let team=('joe','sue');*/
%let team=('joe');
*input data;
data have;
input day name $ amount;
cards;
1 hal 100
1 joe 100
2 joe 80
2 sue 70
2 jim 200
3 joe 50
3 sue 100
3 ted 200
3 tim 100
3 wen 5000
;
run;
*getting my team to float to top of order list;
data have;
set have;
if name in &team. then order=1;
else order=2;
run;
*order;
proc sort data=have;
by day order name;
run;
*add running count by day;
data have;
set have;
by day;
x+1;
if first.day then x=1;
run;
*get number of people on team;
proc sql noprint;
select count(distinct name) into :count
from have
where name in &team.;
quit;
*get max of people per day;
proc sql noprint;
select max(x) into :max_freq from have;
quit;
*pre transpose...set labels;
data have;
set have;
varname=cats('Name_',x);
value=name;
output;
varname=cats('Amount_',x);
value=amount;
output;
keep day value varname;
run;
*transpose;
proc transpose data=have out=have_transp(drop=_NAME_);
by day;
id varname;
var value;
run;
data want;
set have_transp;
array Amount {*} Amount:;
TOT_Amount=0;
NUM_TOTAL_PEOPLE=0;
do i=1 to dim(Amount);
if Amount[i]>0
then
do;
TOT_Amount+Amount[i];
NUM_TOTAL_PEOPLE+1;
end;
end;
TEAM_CONTRIB=Amount_1;
NON_TEAM_CONTRIB=TOT_Amount-TEAM_CONTRIB;
run;
A few other things:
Every member of the team will not always be present every day
There are very many possibilities for how many people might be on the total team and/or subset
Here's a way using proc means that doesn't use arrays. Proc means will calculate data at different levels by default when using the CLASS and TYPES statements. The data can then be merged into the appropriate level. In this solution it doesn't matter how many people are in the group/subset or that everyone is present for every day.
/*Subset group*/
data subteam;
input name $;
cards;
joe
sue
;
run;
/*Sample data*/
data have;
input day name $ amount;
cards;
1 hal 100
1 joe 100
2 joe 80
2 sue 70
2 jim 200
3 joe 50
3 sue 100
3 ted 200
3 tim 100
3 wen 5000
;
run;
*Set group variable for subset team;
data have;
set have;
group=0;
run;
*Set group variable=1 to subset;
proc sql;
update have
set group=1
where name in (select name from subteam);
quit;
*Calculate sums;
proc means data=have;
class day group;
types day day*group;
var amount;
output out=want1 sum=total n=count;
run;
*Reformat into desired format;
data want2;
merge want1 (where=(group=.) rename=(total=total_overall count=count_overall))
want1 (where=(group=1) rename=(total=total_group count=count_group));
by day;
run;

Merging Data to Run Specific Individual Analysis

I have two data sets. FIRST is a list of products and their daily prices from a supplier and SECOND is a list of start and end dates (as well as other important data for analysis). How can I tell Stata to pull the price at the beginning date and then the price at the end date from FIRST into SECOND for the given dates. Please note, if there is no exact matching date I would like it to grab the last date available. For example, if SECOND has the date 1/1/2013 and FIRST has prices on ... 12/30/2012, 12/31/2012, 1/2/2013, ... it would grab the 12/31/2012 price.
I would usually do this with Excel, but I have millions of observations, and it is not feasible.
I have put an example of FIRST and SECOND as well as what the optimal solution would give as an output POST_SECOND
FIRST
Product Price Date
1 3 1/1/2010
1 3 1/3/2010
1 4 1/4/2010
1 2 1/8/2010
2 1 1/1/2010
2 5 2/5/2010
3 7 12/26/2009
3 2 1/1/2010
3 6 4/3/2010
SECOND
Product Start Date End Date
1 1/3/2010 1/4/2010
2 1/1/2010 1/1/2010
3 12/26/2009 4/3/2010
POST_SECOND
Product Start Date End Date Price_Start Price_End
1 1/3/2010 1/4/2010 3 4
2 1/1/2010 1/1/2010 1 1
3 12/26/2009 4/3/2010 7 6
Here's a merge/keep/sort/collapse* solution that relies on using the last date. I altered your example data slightly.
/* Make Fake Data & Convert Dates to Date Format */
clear
input byte Product byte Price str12 str_date
1 3 "1/1/2010"
1 3 "1/3/2010"
1 4 "1/4/2010"
1 2 "1/8/2010"
2 1 "1/1/2010"
2 5 "2/5/2010"
3 7 "12/26/2009"
3 7 "12/28/2009"
3 2 "1/1/2010"
3 6 "4/3/2010"
4 8 "12/30/2012"
4 9 "12/31/2012"
4 10 "1/2/2013"
4 10 "1/3/2013"
end
gen Date = date(str_date,"MDY")
format Date %td
drop str_date
save "First.dta", replace
clear
input byte Product str12 str_Start_Date str12 str_End_Date
1 "1/3/2010" "1/4/2010"
2 "1/1/2010" "1/1/2010"
3 "12/27/2009" "4/3/2010"
4 "1/1/2013" "1/2/2013"
end
gen Start_Date = date(str_Start_Date,"MDY")
gen End_Date = date(str_End_Date,"MDY")
format Start_Date End_Date %td
drop str_*
save "Second.dta", replace
/* Data Transformation */
use "First.dta", clear
merge m:1 Product using "Second.dta", nogen
bys Product: egen ads = min(abs(Start_Date-Date))
bys Product: egen ade = min(abs(End_Date - Date))
keep if (ads==abs(Date - Start_Date) & Date <= Start_Date) | (ade==abs(Date - End_Date) & Date <= End_Date)
sort Product Date
collapse (first) Price_Start = Price (last) Price_End = Price, by(Product Start_Date End_Date)
list, clean noobs
*Some people are reshapers. Others are collapsers. Often both can get the job done, but I think collapse is easier in this case.
In Stata, I've never been able to get something like this to work nicely in one step (something you can do in SAS via a SQL call). In any case, I think you'd be better off creating an intermediate file from FIRST.dta and then merging that 2x on each of your StartDate and EndDate variables in SECOND.dta.
Say you have data for price adjustments from Jan 1, 2010 to Dec 31, 2013 (specified with varied intervals as you have shown above). I assume all the date variables are already in date format in FIRST.dta & SECOND.dta, and that variable names in SECOND do not have spaces in them.
tempfile prod prices
use FIRST.dta, clear
keep Product
duplicates drop
save `prod'
clear
set obs 1096
g Date=date("12-31-2009","MDY")+_n
format date %td
cross using `prod'
merge 1:1 Product Date using FIRST.dta, assert(1 3) nogen
gsort +Product +Date /*this ensures the data are sorted properly for the next step */
replace price=price[_n-1] if price==. & Product==Product[_n-1]
save `prices'
use SECOND.dta, clear
foreach i in Start End {
rename `i'Date Date
merge 1:1 Product Date using `prices', assert(2 3) keep(3) nogen
rename Price Price_`i'
rename Date `i'Date
}
This should work if I understand your data structures correctly, and it should address the issue being discussed in the comments to #Dimitriy's answer. I'm open to critiques on how to make this nicer as its something I've had to do a few times and this is how I usually go about it.

Resources