I have the following example data set with an ID and the contract status in six months (01/2017 - 06/2017).
Example data:
ID Month1 Month2 Month3 Month4 Month5 Month6**
12 5 5 5 5 5 5
34 5 5 6 6 5 5
56 6 6 6 -7 -7 -7
78 6 6 5 5 5 5
12 5 5 5 5 6 -7
If the status is 5 the ID is active, if 6 it's canceled and -7 is "not able to reactivate".
I want to check two kind of changes:
1) IDs which change from status 5 to 6
2) IDs which change from 6 to 5
When the status changes from 5 to 6 I want a new variable "churn" containing the month in which the status changes to 6.
For the second group, I want a new variable "reactivation" containing the month in which the status changes to 5.
If an ID is in both groups (from 5 to 6 to 5) both variables should be filled.
What I have so far is an array, which shows me how many status matches occur in one row, but I do not get the next step. Here is the code:
data want (drop= i j);
set have (obs=100);
array stat_check {*} month1-month6;
sum=0;
do i=1 to dim(stat_check)-1;
do j=i+1 to dim(stat_check);
sum=sum(sum,stat_check(i) eq stat_check(j));
end;
end;
run;
Thanks in advance!
For an array approach, sounds like you need to compare each variable in the array to the variable immediately before it. You don't need two passes through the array, only one. You want to compare month2 to month1, month3 to month2 ... month6 to month5.
I would try something like (untested):
data want (drop= month);
set have (obs=100);
array stat_check {*} month1-month6;
sum=0;
do month=2 to dim(stat_check);
if stat_check{month}-stat_check{month-1} = 1 then Churn=month;
else if stat_check{month}-stat_check{month-1} = -1 then Reactivation =month;
end;
run;
If you could have multiple churns or multiple reactivations for the same ID, that would capture the latest churn or reactivation.
But honestly, I would transpose the data to have one row per ID-month. That would avoid the need for an array, and would allow you to capture multiple churns/reactivations. Generally it is easier to work with tall skinny data rather than short wide data. For example, it would be easy to count the number of months each ID was active.
You can try this one. vname function is used to get the variable name (month)
data two (drop= i j);
set one;
array stat_check {*} m1-m6;
sum=0;
do i=1 to dim(stat_check)-1;
do j=i+1 to dim(stat_check);
sum=stat_check(i)-stat_check(j);
if sum=1 then churn=vname(stat_check(i));
if sum=-1 then reactivation=vname(stat_check(i));
end;
end;
run;
Related
I am trying to merge 2 datasets (df1, df2) with the one of them df2 has only 1 observation that I want to assign its value to all length of the df1 duplicate with merge in sas.
I am aware that I can add that manually but I want to use automated way as this is just a step in my long code with big data.
Here is a reproducible example and datasets:
data df1;
input a b c;
datalines;
1 2 3
6 7 8
5 6 9
;
run;
data df2;
input d ;
datalines;
4
;
run;
data df3;
merge df1 df2;
run;
/*I need the resulting df3 to be */;
a b c d
1 2 3 4
6 7 8 4
5 6 9 4
Any help will be greatly appreciated.
Then you don't want to MERGE the dataset, since there are no common variables that the merge could actually use.
Instead just SET both datasets, but take care to not read past the end of single observation set.
data want;
set long_dataset;
if _n_=1 then set short_dataset;
run;
I've got a dataset that has id, start date and a claim value (in dollars) in each row - most ids have more than one row - some span over 50 rows. The earliest date for each ID/claim varies, and the claim values are mostly different.
I'd like to do a rolling sum of the value of IDs that have claims within 365 days of each other, to report each ID that has claims that have exceeded a limiting value across each period. So for an ID that had a claim date on 1 January, I'd sum all claims to 31 December (inclusive). Most IDs have several years of data so for the example above, I'd also need to check that if they had a claim on 1 May that they hadn't exceeded the limit by 30 April the following year and so on. I normally see this referred to as a 'rolling sum'. My site has many SAS products including base, stat, ets, and others.
I'm currently testing code on a small mock dataet and so far I've converted a thin file to a fat file with one column for each claim value and each date of the claim. The mock dataset is similar to the client dataset that I'll be using. Here's what I've done so far (noting that the mock data uses days rather than dates - I'm not at the stage where I want to test on real data yet).
data original_data;
input ppt $1. day claim;
datalines;
a 1 7
a 2 12
a 4 12
a 6 18
a 7 11
a 8 10
a 9 14
a 10 17
b 1 27
b 2 12
b 3 14
b 4 12
b 6 18
b 7 11
b 8 10
b 9 14
b 10 17
c 4 2
c 6 4
c 8 8
;
run;
proc sql;
create table ppt_counts as
select ppt, count(*) as ppts
from work.original_data
group by ppt;
select cats('value_', max(ppts) ) into :cats
from work.ppt_counts;
select cats('dates_',max(ppts)) into :cnts
from work.ppt_counts;
quit;
%put &cats;
%put &cnts;
data flipped;
set original_data;
by ppt;
array vars(*) value_1 -&cats.;
array dates(*) dates_1 - &cnts.;
array m_vars value_1 - &cats.;
array m_dates dates_1 - &cnts.;
if first.ppt then do;
i=1;
do over m_vars;
m_vars="";
end;
do over m_dates;
m_dates="";
end;
end;
if first.ppt then do:
i=1;
vars(i) = claim;
dates(i)=day;
if last.ppt then output;
i+1;
retain value_1 - &cats dates_1 - &cnts. 0.;
run;
data output;
set work.flipped;
max_date =max(of dates_1 - &cnts.);
max_value =max(of value_1 - &cats.);
run;
This doesn't give me even close to what I need - not sure how to structure code to make this correct.
What I need to end up with is one row per time that an ID exceeds the yearly limit of claim value (say in the mock data if a claim exceeds 75 across a seven day period), and to include the sum of the claims. So it's likely that there may be multiple lines per ID and the claims from one row may also be included in the claims for the same ID on another row.
type of output:
ID sum of claims
a $85
a $90
b $80
On separate rows.
Any help appreciated.
Thanks
If you need to perform a rolling sum, you can do this with proc expand. The code below will perform a rolling sum of 5 days for each group. First, expand your data to fill in any missing gaps:
proc expand data = original_data
out = original_data_expanded
from = day;
by ppt;
id day;
convert claim / method=none;
run;
Any days with gaps will have missing value of claim. Now we can calculate a moving sum and ignore those missing days when performing the moving sum:
proc expand data = original_data
out = want(where=(NOT missing(claim)));
by ppt;
id day;
convert claim = rolling_sum / transform=(movsum 5) method=none;
run;
Output:
ppt day rolling_sum claim
a 1 7 7
a 2 19 12
a 4 31 12
a 6 42 18
a 7 41 11
...
b 9 53 14
b 10 70 17
c 4 2 2
c 6 6 4
c 8 14 8
The reason we use two proc expand statements is because the rolling sum is calculated before the days are expanded. We need the rolling sum to occur after the expansion. You can test this by running the above code all in a single statement:
/* Performs moving sum, then expands */
proc expand data = original_data
out = test
from = day;
by ppt;
id day;
convert claim = rolling_sum / transform=(movsum 5) method=none;
run;
Use a SQL self join with the dates being within 365 days of itself. This is time/resource intensive if you have a very large data set.
Assuming you have a date variable, the intnx is probably the better way to calculate the date interval than 365 depending on how you want to account for leap years.
If you have a claim id to group on, that would also be better than using the group by clause in this example.
data have;
input ppt $1. day claim;
datalines;
a 1 7
a 2 12
a 4 12
a 6 18
a 7 11
a 8 10
a 9 14
a 10 17
b 1 27
b 2 12
b 3 14
b 4 12
b 6 18
b 7 11
b 8 10
b 9 14
b 10 17
c 4 2
c 6 4
c 8 8
;
run;
proc sql;
create table want as
select a.*, sum(b.claim) as total_claim
from have as a
left join have as b
on a.ppt=b.ppt and
b.day between a.day and a.day+365
group by 1, 2, 3;
/*b.day between a.day and intnx('year', a.day, 1, 's')*/;
quit;
Assuming that you have only one claim per day you could just use a circular array to keep track of the pervious N days of claims to generate the rolling sum. By circular array I mean one where the indexes wrap around back to the beginning when you increment past the end. You can use the MOD() function to convert any integer into an index into the array.
Then to get the running sum just add all of the elements in the array.
Add an extra DO loop to zero out the days skipped when there are days with no claims.
%let N=5;
data want;
set original_data;
by ppt ;
array claims[0:%eval(&n-1)] _temporary_;
lagday=lag(day);
if first.ppt then call missing(of lagday claims[*]);
do index=max(sum(lagday,1),day-&n+1) to day-1;
claims[mod(index,&n)]=0;
end;
claims[mod(day,&n)]=claim;
running_sum=sum(of claims[*]);
drop index lagday ;
run;
Results:
running_
OBS ppt day claim sum
1 a 1 7 7
2 a 2 12 19
3 a 4 12 31
4 a 6 18 42
5 a 7 11 41
6 a 8 10 51
7 a 9 14 53
8 a 10 17 70
9 b 1 27 27
10 b 2 12 39
11 b 3 14 53
12 b 4 12 65
13 b 6 18 56
14 b 7 11 55
15 b 8 10 51
16 b 9 14 53
17 b 10 17 70
18 c 4 2 2
19 c 6 4 6
20 c 8 8 14
Working in a known domain of date integers, you can use a single large array to store the claims at each date and slice out the 365 days to be summed. The bookkeeping needed for the modular approach is not needed.
Example:
data have;
call streaminit(20230202);
do id = 1 to 10;
do date = '01jan2012'd to '02feb2023'd;
date + rand('integer', 25);
claim = rand('integer', 5, 100);
output;
end;
end;
format date yymmdd10.;
run;
options fullstimer;
data want;
set have;
by id;
array claims(100000) _temporary_;
array slice (365) _temporary_;
if first.id then call missing(of claims(*));
claims(date) = claim;
call pokelong(
peekclong(
addrlong (claims(date-365))
, 8*365)
,
addrlong(slice(1))
);
rolling_sum_365 = sum(of slice(*));
if dif1(claim) < 365 then
claims_out_365 = lag(claim) - dif1(rolling_sum_365);
if first.id then claims_out_365 = .;
run;
Note: SAS Date 100,000 is 16OCT2233
I have the following that contains dates, the visit number, and a specific variable of interest. I would like to retain the last five visits that are available in SAS by person. I am familiar with retaining the first and last visits. The data for a single subject is listed below:
Person Date VisitNumber VariableOfInterest
001 10/10/2001 1 6
001 11/12/2001 3 8
001 01/05/2002 5 12
001 03/10/2002 6 5
001 05/03/2002 8 3
001 07/29/2002 10 11
Any insight would be appreciated.
A double DOW loop will let you measure the group in the first loop and select from the group based on your desired per-group criteria in the second loop. This is useful when have is large and pre-sorted, and you want to avoid additional sorting.
data want;
* measure the group size;
do _n_ = 1 by 1 until (last.person);
set have;
by person visitnumber; * visitnumber in by only to enforce expectation of orderness;
end;
_i_ = _n_;
* apply the criteria "last 5 rows in group";
do _n_ = 1 to _n_;
set have;
if _i_ - _n_ < 5 then output;
end;
run;
It is easier if you sort by descending VisitNumber so that the problem becomes take the first 5 observations for a person. Then just generate a counter of which observation this is for the person and subset on that.
data want;
set have ;
by person descending visitnumber;
if first.person then rowno=0;
rowno+1;
if rowno <= 5;
run;
I have a dataset as follows:
data have;
input;
ID Base Adverse Fixed$ Date RepricingFrequency
1 38 50 FIXED 2016 2
2 40 60 FLOATING 2017 3
3 20 20 FIXED 2016 2
4 ...
5
6
I am looking to build an array such that each ID has four vintage years 2017-2020, where the subsequent years are to be filled out with a piece of array code I have that works
like such
ID Vintage Base Adverse Fixed$ Date RepricingFrequency
1 2017 38 50 FIXED 2016 2
1 2018
1 2019
1 2020
In the beginning I just need to duplicate the dataset with the blanks,
The code I've tried so far is
data want;
set have;
do I=1 to 4;
output;
drop I;
run;
but of course that keeps the repeats of all the observations. So I tried an array.
data want;
set have;
array Base(2017:2020) Base2017-Base2020
array Vintage(2017:2020) Vintage2017-Vintage2020
But I'm not sure where to go from here on either accord.
The question is how do I extrapolate my data set for ID1-8 to a dataset where I have ID 1111-8888 where each ID is repeated 4 times with blanks.
Make a dummy dataset with all of the observations
data frame ;
set have(keep=id);
by id ;
if first.id then do date=2017 to 2020 ;
output;
end;
run;
and merge it back with the original.
data want ;
merge have frame ;
by id date ;
run;
Have sales and a time indicator as such:
time sales
1 6
2 7
1 5
3 4
2 4
5 7
4 3
3 2
5 1
5 4
3 1
4 9
1 8
I want the mean, stdev, and N of the above saved in a t (each time period has a row) X 4 (time period, mean, stdev, N) matrix.
For time = 5 the matrix would be:
time mean stdev N
... ... ... ...
5 4 3 3
... ... ... ...
Just for the mean I tried:
mat t1=J(5,1,0)
forval i = 1/5 {
summ sales if time == `i'
mat t1[`i']=r(mean)
}
However, I kept getting an error. Even if it worked I was unsure how to get the other (stdev and N) variables of interest.
You were probably aiming for something like
matrix t1 = J(5, 1, .)
forvalues i = 1/5 {
summarize sales if time == `i'
matrix t1[`i', 1] = r(mean)
}
matrix list t1
U[14.9] Subscripting specifies you need matname[r,c]. You were leaving out the second subscript. In Mata you are allowed to subscript vectors in this way but you never enter Mata.
An alternative is
forval i = 1/5 {
summarize sales if time == `i'
matrix t1 = (nullmat(t1) \ r(mean))
}
With the latter, you have no need of declaring the matrix beforehand. See help nullmat().
But it's probably easiest to use collapse and get all information in one step:
clear all
set more off
input ///
time sales
1 6
2 7
1 5
3 4
2 4
5 7
4 3
3 2
5 1
5 4
3 1
4 9
1 8
end
collapse (mean) msales=sales (sd) sdsales=sales ///
(count) csales=sales, by(time)
list
Note that count counts nonmissing observations only.
If you want a matrix then convert the variables using mkmat, after the collapse:
mkmat time msales sdsales csales, matrix(summatrix)
matrix list summatrix