sum values across any 365 day period - arrays

I've got a dataset that has id, start date and a claim value (in dollars) in each row - most ids have more than one row - some span over 50 rows. The earliest date for each ID/claim varies, and the claim values are mostly different.
I'd like to do a rolling sum of the value of IDs that have claims within 365 days of each other, to report each ID that has claims that have exceeded a limiting value across each period. So for an ID that had a claim date on 1 January, I'd sum all claims to 31 December (inclusive). Most IDs have several years of data so for the example above, I'd also need to check that if they had a claim on 1 May that they hadn't exceeded the limit by 30 April the following year and so on. I normally see this referred to as a 'rolling sum'. My site has many SAS products including base, stat, ets, and others.
I'm currently testing code on a small mock dataet and so far I've converted a thin file to a fat file with one column for each claim value and each date of the claim. The mock dataset is similar to the client dataset that I'll be using. Here's what I've done so far (noting that the mock data uses days rather than dates - I'm not at the stage where I want to test on real data yet).
data original_data;
input ppt $1. day claim;
datalines;
a 1 7
a 2 12
a 4 12
a 6 18
a 7 11
a 8 10
a 9 14
a 10 17
b 1 27
b 2 12
b 3 14
b 4 12
b 6 18
b 7 11
b 8 10
b 9 14
b 10 17
c 4 2
c 6 4
c 8 8
;
run;
proc sql;
create table ppt_counts as
select ppt, count(*) as ppts
from work.original_data
group by ppt;
select cats('value_', max(ppts) ) into :cats
from work.ppt_counts;
select cats('dates_',max(ppts)) into :cnts
from work.ppt_counts;
quit;
%put &cats;
%put &cnts;
data flipped;
set original_data;
by ppt;
array vars(*) value_1 -&cats.;
array dates(*) dates_1 - &cnts.;
array m_vars value_1 - &cats.;
array m_dates dates_1 - &cnts.;
if first.ppt then do;
i=1;
do over m_vars;
m_vars="";
end;
do over m_dates;
m_dates="";
end;
end;
if first.ppt then do:
i=1;
vars(i) = claim;
dates(i)=day;
if last.ppt then output;
i+1;
retain value_1 - &cats dates_1 - &cnts. 0.;
run;
data output;
set work.flipped;
max_date =max(of dates_1 - &cnts.);
max_value =max(of value_1 - &cats.);
run;
This doesn't give me even close to what I need - not sure how to structure code to make this correct.
What I need to end up with is one row per time that an ID exceeds the yearly limit of claim value (say in the mock data if a claim exceeds 75 across a seven day period), and to include the sum of the claims. So it's likely that there may be multiple lines per ID and the claims from one row may also be included in the claims for the same ID on another row.
type of output:
ID sum of claims
a $85
a $90
b $80
On separate rows.
Any help appreciated.
Thanks

If you need to perform a rolling sum, you can do this with proc expand. The code below will perform a rolling sum of 5 days for each group. First, expand your data to fill in any missing gaps:
proc expand data = original_data
out = original_data_expanded
from = day;
by ppt;
id day;
convert claim / method=none;
run;
Any days with gaps will have missing value of claim. Now we can calculate a moving sum and ignore those missing days when performing the moving sum:
proc expand data = original_data
out = want(where=(NOT missing(claim)));
by ppt;
id day;
convert claim = rolling_sum / transform=(movsum 5) method=none;
run;
Output:
ppt day rolling_sum claim
a 1 7 7
a 2 19 12
a 4 31 12
a 6 42 18
a 7 41 11
...
b 9 53 14
b 10 70 17
c 4 2 2
c 6 6 4
c 8 14 8
The reason we use two proc expand statements is because the rolling sum is calculated before the days are expanded. We need the rolling sum to occur after the expansion. You can test this by running the above code all in a single statement:
/* Performs moving sum, then expands */
proc expand data = original_data
out = test
from = day;
by ppt;
id day;
convert claim = rolling_sum / transform=(movsum 5) method=none;
run;

Use a SQL self join with the dates being within 365 days of itself. This is time/resource intensive if you have a very large data set.
Assuming you have a date variable, the intnx is probably the better way to calculate the date interval than 365 depending on how you want to account for leap years.
If you have a claim id to group on, that would also be better than using the group by clause in this example.
data have;
input ppt $1. day claim;
datalines;
a 1 7
a 2 12
a 4 12
a 6 18
a 7 11
a 8 10
a 9 14
a 10 17
b 1 27
b 2 12
b 3 14
b 4 12
b 6 18
b 7 11
b 8 10
b 9 14
b 10 17
c 4 2
c 6 4
c 8 8
;
run;
proc sql;
create table want as
select a.*, sum(b.claim) as total_claim
from have as a
left join have as b
on a.ppt=b.ppt and
b.day between a.day and a.day+365
group by 1, 2, 3;
/*b.day between a.day and intnx('year', a.day, 1, 's')*/;
quit;

Assuming that you have only one claim per day you could just use a circular array to keep track of the pervious N days of claims to generate the rolling sum. By circular array I mean one where the indexes wrap around back to the beginning when you increment past the end. You can use the MOD() function to convert any integer into an index into the array.
Then to get the running sum just add all of the elements in the array.
Add an extra DO loop to zero out the days skipped when there are days with no claims.
%let N=5;
data want;
set original_data;
by ppt ;
array claims[0:%eval(&n-1)] _temporary_;
lagday=lag(day);
if first.ppt then call missing(of lagday claims[*]);
do index=max(sum(lagday,1),day-&n+1) to day-1;
claims[mod(index,&n)]=0;
end;
claims[mod(day,&n)]=claim;
running_sum=sum(of claims[*]);
drop index lagday ;
run;
Results:
running_
OBS ppt day claim sum
1 a 1 7 7
2 a 2 12 19
3 a 4 12 31
4 a 6 18 42
5 a 7 11 41
6 a 8 10 51
7 a 9 14 53
8 a 10 17 70
9 b 1 27 27
10 b 2 12 39
11 b 3 14 53
12 b 4 12 65
13 b 6 18 56
14 b 7 11 55
15 b 8 10 51
16 b 9 14 53
17 b 10 17 70
18 c 4 2 2
19 c 6 4 6
20 c 8 8 14

Working in a known domain of date integers, you can use a single large array to store the claims at each date and slice out the 365 days to be summed. The bookkeeping needed for the modular approach is not needed.
Example:
data have;
call streaminit(20230202);
do id = 1 to 10;
do date = '01jan2012'd to '02feb2023'd;
date + rand('integer', 25);
claim = rand('integer', 5, 100);
output;
end;
end;
format date yymmdd10.;
run;
options fullstimer;
data want;
set have;
by id;
array claims(100000) _temporary_;
array slice (365) _temporary_;
if first.id then call missing(of claims(*));
claims(date) = claim;
call pokelong(
peekclong(
addrlong (claims(date-365))
, 8*365)
,
addrlong(slice(1))
);
rolling_sum_365 = sum(of slice(*));
if dif1(claim) < 365 then
claims_out_365 = lag(claim) - dif1(rolling_sum_365);
if first.id then claims_out_365 = .;
run;
Note: SAS Date 100,000 is 16OCT2233

Related

SAS do loop with if statement

I am trying to group by dataset in three month groups, or quarters, but as I'm starting from an arbitrary date, I cannot use the quarter function in sas.
Example data below of what I have and quarter is the column I need to create in SAS.
The start date is always the same, so my initial quarter will be 3rd Sep 2018 - 3rd Dec 2018 and any active date falling in that quarter will be 1, then quarter 2 will be 3rd Dec 2018 - 3rd Mar 2019 and so on. This cannot be coded manually as the start date will change depending on the data, and the number of quarters could be up to 20+.
The code I have attempted so far is below
data test_Data_op;
set test_data end=eof;
%let j = 0;
%let start_date = start_Date;
if &start_Date. <= effective_dt < (&start_date. + 90) then quarter = &j.+1;
run;
This works and gives the first quarter correctly, but I can't figure out how to loop this for every following quarter? Any help will be greatly appreciated!
No need for a DO loop if you already have the start_date and actual event dates. Just count the number of months and divide by three. Use the continuous method of the INTCK() function to handle start dates that are not the first day of a month.
month_number=intck('month',&start_date,mydate,'cont')+1;
qtr_number=floor((month_number-1)/3)+1;
Based on the comment by #Lee. Edited to match the data from the screenshot.
The example shows that May 11 would be in the 3rd quarter since the seed date is September 3.
data have;
input mydate :yymmdd10.;
format mydate yymmddd10.;
datalines;
2018-09-13
2018-12-12
2019-05-11
;
run;
%let start_date='03sep2018'd;
data want;
set have;
quarter=floor(mod((yrdif(&start_date,mydate)*4),4))+1;
run;
If you want the number of quarters to extend beyond 4 (e.g. September 4, 2019 would be in quarter 5 rather than cycle back to 1), then remove the "mod" from the function:
quarter=floor(yrdif(&start_date,mydate)*4)+1;
The traditional use of quarter means a 3 month time period relative to Jan 1. Make sure your audience understands the phrase quarter in your data presentation actually means 3 months relative to some arbitrary starting point.
The funky quarter can be functionally computed from a months apart derived using a mix of INTCK for the baseline months computation and a logical expression for adjusting with relation to the day of the month of the start date. No loops required.
For example:
data have;
do startDate = '11feb2019'd ;
do effectiveDate = startDate to startDate + 21*90;
output;
end;
end;
format startDate effectiveDate yymmdd10.;
run;
data want;
set have;
qtr = 1
+ floor(
( intck ('month', startDate, effectiveDate)
-
(day(effectiveDate) < day(startDate))
)
/ 3
);
format qtr 4.;
run;
Extra
Comparing my method (qtr) to #Tom (qtr_number) for a range of startDates:
data have;
retain seq 0;
do startDate = '01jan1999'd to '15jan2001'd;
seq + 1;
do effectiveDate = startDate to startDate + 21*90;
output;
end;
end;
format startDate effectiveDate yymmdd10.;
run;
data want;
set have;
qtr = 1
+ floor( ( intck ('month', startDate, effectiveDate)
- (day(effectiveDate) < day(startDate))
) / 3 );
month_number=intck('month',startDate,effectiveDate,'cont')+1;
qtr_number=floor((month_number-1)/3)+1;
format qtr: month: 4.;
run;
options nocenter nodate nonumber;title;
ods listing;
proc print data=want;
where qtr ne qtr_number;
run;
dm 'output';
-------- OUTPUT ---------
effective month_ qtr_
Obs seq startDate Date qtr number number
56820 31 1999-01-31 1999-04-30 1 4 2
57186 31 1999-01-31 2000-04-30 5 16 6
57551 31 1999-01-31 2001-04-30 9 28 10
57916 31 1999-01-31 2002-04-30 13 40 14
58281 31 1999-01-31 2003-04-30 17 52 18
168391 90 1999-03-31 1999-06-30 1 4 2
168483 90 1999-03-31 1999-09-30 2 7 3
168757 90 1999-03-31 2000-06-30 5 16 6
168849 90 1999-03-31 2000-09-30 6 19 7
169122 90 1999-03-31 2001-06-30 9 28 10
169214 90 1999-03-31 2001-09-30 10 31 11
169487 90 1999-03-31 2002-06-30 13 40 14
169579 90 1999-03-31 2002-09-30 14 43 15
169852 90 1999-03-31 2003-06-30 17 52 18
169944 90 1999-03-31 2003-09-30 18 55 19
280510 149 1999-05-29 2001-02-28 7 22 8
280875 149 1999-05-29 2002-02-28 11 34 12
281240 149 1999-05-29 2003-02-28 15 46 16
282035 150 1999-05-30 2000-02-29 3 10 4
282400 150 1999-05-30 2001-02-28 7 22 8
282765 150 1999-05-30 2002-02-28 11 34 12

Counting the ID and assigning a Year

I have a dataset that looks like this:
data have;
input ID P1 P2 P3 P4;
datalines;
ID P1 P2 P3 P4
12 10 15 20 30
12 - 20 5 3
12 - - 25 33
12 - - - 30
19 10 15 20 30
19 - 10 17 30
19 - - 5 30
19 - - - 30
;
run;
I am trying to build in a variable called Year which then can be used to identify that the ID and P1-P4 is an array with each row representing a year. Such that the dataset would look like.
data want;
set have;
input ID P1 P2 P3 P4;
datalines;
ID P1 P2 P3 P4 Year
12 10 15 20 30 2017
12 - 20 5 3 2018
12 - - 25 33 2019
12 - - - 30 2020
19 10 15 20 30 2017
19 - 10 17 30 2018
19 - - 5 30 2019
19 - - - 30 2020
;
run;
I originally used to use this code:
Data Year;
do ID = 1 to 8;
do Year = 2017 to 2020;
output;
end;
end;
run;
data Final;
set have;
Merge Year;
run;
But now that I am working with a different dataset each time and I don't know the structure of the ID, I can't keep changing ID=1 to 8 to suit the dataset each time.
My question: Is there a way to do this through the dataset, possibly a count?
Count ID = 2017;
Year = count + 1;
There is no need to create a second data set that will be merged with the first.
You do need to make assumptions about the grouping in the have data set. The assumptions are the data is already sorted or arranged in a manner that allows a monotonic year value to be assigned to each sequential row in each group.
data want;
set have;
by id;
if first.id
then year = 2017; %* initial year for a group;
else year + 1; %* increment year for subsequent rows of a group;
run;

Looping a proc transpose through multiple data ranges SAS

I am trying to transpose a sequence of ranges from an excel file into SAS. The excel file looks something like this:
31 Dec 01Jan 02Jan 03Jan 04Jan
Book id1 23 24 35 43 98
Book id2 3 4 5 4 1
(few blank rows in between)
05Jan 06Jan 07Jan 08Jan 09Jan
Book id1 14 100 30 23 58
Book id2 2 7 3 8 6
(and it repeats..)
My final output should have a first column for the date and then two additional columns for the book Ids:
Date Book id1 Book id2
31 Dec 23 3
01Jan 24 4
02Jan 35 5
03Jan 43 4
04Jan 98 1
05Jan 14 2
06Jan 100 7
07Jan 30 3
08Jan 23 8
09Jan 58 6
In this particular case I am asking for a simpler method to:
Either import and transpose each range of data and replacing the data range with macro variables to separately import and transpose each individual range
Or to import the whole datafile first and then to create a loop that
transposes each range of data
Code I used for a simple import and transpose of a specific data range:
proc import datafile="&input./have.xlsx"
out=want
dbms=xlsx replace;
range="Data$A3:F5" ;
run;
proc transpose data=want
out=want_transposed
name=date;
id A;
run;
Thanks!
A data row that is split over several segments or blocks of rows in an Excel file can be imported raw into SAS and then processed into a categorical form using a DATA Step.
In this example sample data is put into a text file and imported such that the column names are generic VAR-1 ... VAR-n. The generic import is then processed across each row, outputting one SAS data set row per import cell.
The column names in each segment are retained within a temporary array an updated whenever a blank book id is encountered.
* mock data;
filename demo "%sysfunc(pathname(WORK))\demo.txt";
data _null_;
input;
file demo;
put _infile_;
datalines;
., 31Dec, 01Jan, 02Jan, 03Jan, 04Jan
Book_id1, 23 , 24 , 35 , 43 , 98
Book_id2, 3 , 4 , 5 , 4 , 1
., 05Jan, 06Jan, 07Jan, 08Jan, 09Jan
Book_id1, 14 , 100 , 30 , 23 , 58
Book_id2, 2 , 7 , 3 , 8 , 6
run;
* mock import;
proc import replace out=work.haveraw file=demo dbms=csv;
getnames = no;
datarow = 1;
run;
ods listing;
proc print data=haveraw;
run;
When Excel import is be made to look like this:
Obs VAR1 VAR2 VAR3 VAR4 VAR5 VAR6
1 31Dec 01Jan 02Jan 03Jan 04Jan
2 Book_id1 23 24 35 43 98
3 Book_id2 3 4 5 4 1
4
5 05Jan 06Jan 07Jan 08Jan 09Jan
6 Book_id1 14 100 30 23 58
7 Book_id2 2 7 3 8 6
It can be processed in a transposing way, outputting only the name value pairs corresponding to a original cell.
data have (keep=bookid date value);
set haveraw;
array dates(1000) $12 _temporary_ ;
array vars var:;
if missing(var1) then do;
do index = 2 by 1 while (index <= dim(vars));
if not missing(vars(index)) then
dates(index) = put(index-1,z3.) || '_' || vars(index); * adjust as you see fit;
else
dates(index) = '';
end;
end;
else do;
bookid = var1;
do index = 2 by 1 while (index <= dim(vars));
date = dates(index);
value = input(vars(index),??best12.);
output;
end;
end;
run;

Using group by in Proc SQL for SAS

I am trying to summarize my data set using the proc sql, but I have repeated values in the output, a simple version of my code is:
PROC SQL;
CREATE TABLE perm.rx_4 AS
SELECT patid,ndc,fill_mon,
COUNT(dea) AS n_dea,
sum(DEDUCT) AS tot_DEDUCT
FROM perm.rx
GROUP BY patid,ndc,fill_mon;
QUIT;
Some sample output is:
Obs Patid Ndc FILL_mon n_dea DEDUCT
3815 33003605204 00054465029 2000-05 2 0
3816 33003605204 00054465029 2000-05 2 0
12257 33004361450 00406035701 2000-06 2 0
16564 33004744098 00603128458 2000-05 2 0
16565 33004744098 00603128458 2000-05 2 0
16566 33004744098 00603128458 2000-06 2 0
16567 33004744098 00603128458 2000-06 2 0
46380 33008165116 00406035705 2000-06 2 0
85179 33013674758 00406035801 2000-05 2 0
89248 33014228307 00054465029 2000-05 2 0
107514 33016949900 00406035805 2000-06 2 0
135047 33056226897 63481062370 2000-05 2 0
213691 33065594501 00472141916 2000-05 2 0
215192 33065657835 63481062370 2000-06 2 0
242848 33066899581 60432024516 2000-06 2 0
As you can see there are repeated out put, for example obs 3815,3816. I have saw some people had similar problem, but the answers didn't work for me.
The content of the dataset is this:
The SAS System 5
17:01 Thursday, December 3, 2015
The CONTENTS Procedure
Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 210
First Data Page 1
Max Obs per Page 1360
Obs in First Data Page 1310
Number of Data Set Repairs 0
Filename /home/zahram/optum/rx_4.sas7bdat
Release Created 9.0401M2
Host Created Linux
Inode Number 424673574
Access Permission rw-r-----
Owner Name zahram
File Size (bytes) 13828096
The SAS System 6
17:01 Thursday, December 3, 2015
The CONTENTS Procedure
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat Label
3 FILL_mon Num 8 YYMMD. Fill month
2 Ndc Char 11 $11. $20. Ndc
1 Patid Num 8 19. Patid
4 n_dea Num 8
5 tot_DEDUCT Num 8
Sort Information
Sortedby Patid Ndc FILL_mon
Validated YES
Character Set ASCII
The SAS System 7
17:01 Thursday, December 3, 2015
The CONTENTS Procedure
Sort Information
Sort Option NODUPKEY
NOTE: PROCEDURE CONTENTS used (Total process time):
real time 0.08 seconds
cpu time 0.01 seconds
I'll guess that you have a format on a variable, most likely the date. Proc SQL does not aggregate over formatted values but will use the underlying values but still shows them as formatted, so they appear as duplicates. Your proc contents confirms this. You can get around this by converting this the variable to a character variable.
PROC SQL;
CREATE TABLE perm.rx_4 AS
SELECT patid,ndc, put(fill_mon, yymmd.) as fill_month,
COUNT(dea) AS n_dea,
sum(DEDUCT) AS tot_DEDUCT
FROM perm.rx
GROUP BY patid,ndc, calculated fill_month;
QUIT;

SAS: Calculate an average excluding the current observation

I am searching for an elegant way (or, failing that, an inelegant way) to calculate an average which does not include the current record. So, if I have 30 observations I would end up with 30 different averages. Each would be the average of the other 29 values.
From this made-up data, I would want to create 5 new observations with the averages of A, B, and C not including their own data.
A B C
Albert 12 4 6
Bob 14 7 12
Clyde 6 7 11
Dennis 9 11 7
Earl 8 8 6
I have a vague idea that this will involve proc sql inside a loop. Other ideas or approaches are appreciated.
No loop needed. Use SQL to get the totals for each variable. The average without the current observation is (total sum - value)/(n-1)
data test;
input NAME $ A B C;
datalines;
Albert 12 4 6
Bob 14 7 12
Clyde 6 7 11
Dennis 9 11 7
Earl 8 8 6
;
run;
proc sql noprint;
select count(*),
sum(A),
sum(B),
sum(C)
into :n,
:a,
:b,
:c
from test;
quit;
data test2;
set test;
Ave_A = (&a - a)/(&n-1);
Ave_B = (&b - b)/(&n-1);
Ave_C = (&c - c)/(&n-1);
run;

Resources