Retain last 5 visits by Person in SAS - arrays

I have the following that contains dates, the visit number, and a specific variable of interest. I would like to retain the last five visits that are available in SAS by person. I am familiar with retaining the first and last visits. The data for a single subject is listed below:
Person Date VisitNumber VariableOfInterest
001 10/10/2001 1 6
001 11/12/2001 3 8
001 01/05/2002 5 12
001 03/10/2002 6 5
001 05/03/2002 8 3
001 07/29/2002 10 11
Any insight would be appreciated.

A double DOW loop will let you measure the group in the first loop and select from the group based on your desired per-group criteria in the second loop. This is useful when have is large and pre-sorted, and you want to avoid additional sorting.
data want;
* measure the group size;
do _n_ = 1 by 1 until (last.person);
set have;
by person visitnumber; * visitnumber in by only to enforce expectation of orderness;
end;
_i_ = _n_;
* apply the criteria "last 5 rows in group";
do _n_ = 1 to _n_;
set have;
if _i_ - _n_ < 5 then output;
end;
run;

It is easier if you sort by descending VisitNumber so that the problem becomes take the first 5 observations for a person. Then just generate a counter of which observation this is for the person and subset on that.
data want;
set have ;
by person descending visitnumber;
if first.person then rowno=0;
rowno+1;
if rowno <= 5;
run;

Related

SAS macro to print out change of baseline scores

I'm looking for a way to print out change of tests scores for each subject with a SAS macro. Here is a sample of the data:
Subject Visit Date Test Score
001 Baseline 01/01/99 Jump 5
001 Baseline 01/01/99 Reach 3
001 Week 6 02/12/99 Jump 7
001 Week 6 02/12/99 Reach 6
002 Baseline 03/01/99 Jump 2
002 Baseline 03/01/99 Reach 4
002 Week 6 04/12/99 Jump 5
002 Week 6 04/12/99 Reach 9
I would like to create a macro that generates the following for each subject:
Subject Visit Date (Days from Baseline) Test Score Change from Baseline Score
001 Baseline 01/01/99 Jump 5
01/01/99 Reach 3
001 Week 6 02/12/99 (42) Jump 7 +2
02/12/99 (42) Reach 6 +3
002 Baseline 03/01/99 Jump 2
03/01/99 Reach 4
002 Week 6 04/12/99 (42) Jump 5 +3
04/12/99 (42) Reach 9 +5
I believe I can just use the INTCK function for the Days from Baseline, but I'm not sure how to print out each test without retaining the 'Subject' and 'Visit' values in each row. Any help would be much appreciated.
You can sort by test and process using a retain for date and score for computing deltas. The print out can be done with Proc REPORT, formatting delta values appropriately.
Example:
data have; input
Subject Visit& $8. Date& mmddyy8. Test $ Score; format date mmddyy8.; datalines;
001 Baseline 01/01/99 Jump 5
001 Baseline 01/01/99 Reach 3
001 Week 6 02/12/99 Jump 7
001 Week 6 02/12/99 Reach 6
002 Baseline 03/01/99 Jump 2
002 Baseline 03/01/99 Reach 4
002 Week 6 04/12/99 Jump 5
002 Week 6 04/12/99 Reach 9
run;
proc sort data=have;
by subject test date;
run;
data for_report;
set have;
by subject test;
retain base_date base_score;
if first.subject then do;
base_date = .;
base_score = .;
end;
if first.test and visit='Baseline' then do;
base_date = date;
base_score = score;
end;
if not first.test then do;
delta_days = intck('days', date, base_date);
delta_score = score - base_score;
end;
run;
proc format;
picture plus low-0 = [best12.] other = '000000009' (prefix='+');
options missing=' ';
proc report data=for_report;
columns subject visit date delta_days test score delta_score;
define subject / order;
define visit / order order=data;
format delta_days negparen.;
format delta_score plus.;
run;
options missing='.';
An alternate report can be more subject-centric:
proc report data=for_report
style(lines) = [just=left fontweight=bold]
;
columns subject visit date delta_days test score delta_score;
define subject / order noprint;
define visit / order order=data;
format delta_days negparen.;
format delta_score plus.;
compute before subject;
subj = catx(' ', "Subject:", subject);
line subj $200.;
endcomp;
run;
Here is one way of doing it. The SQL-step calculates changes from baseline. The case-when-construct is only there to suppress zeroes in the output.
Printing using group-variables in proc report means Subject- and Visit-values are not retained on every line (but note that subject is not repeated each week).
I put the code in a macro, as that was the question. It doesn't really do much, however.
/* Creating test data*/
data testdata;
input Subject $3. #5 Visit $8. #17 Date mmddyy10. #28 Test $5. Score;
format date mmddyy10.;
datalines;
001 Baseline 01/01/99 Jump 5
001 Baseline 01/01/99 Reach 3
001 Week 6 02/12/99 Jump 7
001 Week 6 02/12/99 Reach 6
002 Baseline 03/01/99 Jump 2
002 Baseline 03/01/99 Reach 4
002 Week 6 04/12/99 Jump 5
002 Week 6 04/12/99 Reach 9
;
%macro baselines(dataset=);
/* Adding days from baseline and change from baseline. Please note that the first visit
must denoted as exactly "Baseline"*/
proc sql;
create table changes as
select t1.*, case when t1.date-t2.date>0 then t1.date-t2.date else . end as days
"Days from baseline", case when t1.score-t2.score>0 then t1.score-t2.score else .
end as score_change "Change from Baseline"
from &dataset as t1 left join (select * from &dataset where visit="Baseline") as t2
on t1.subject=t2.subject and t1.test=t2.test
order by subject, visit, test;
/* Printing the dataset. The use of subject and visit as group variables keeps SAS from repeating the values*/
title "Changes based on the dataset &dataset";
proc report data=changes;
column subject visit days test score score_change;
define subject / group;
define visit / group;
run;
%mend;
%baselines(dataset=testdata)

expanding a dataset with blanks

I have a dataset as follows:
data have;
input;
ID Base Adverse Fixed$ Date RepricingFrequency
1 38 50 FIXED 2016 2
2 40 60 FLOATING 2017 3
3 20 20 FIXED 2016 2
4 ...
5
6
I am looking to build an array such that each ID has four vintage years 2017-2020, where the subsequent years are to be filled out with a piece of array code I have that works
like such
ID Vintage Base Adverse Fixed$ Date RepricingFrequency
1 2017 38 50 FIXED 2016 2
1 2018
1 2019
1 2020
In the beginning I just need to duplicate the dataset with the blanks,
The code I've tried so far is
data want;
set have;
do I=1 to 4;
output;
drop I;
run;
but of course that keeps the repeats of all the observations. So I tried an array.
data want;
set have;
array Base(2017:2020) Base2017-Base2020
array Vintage(2017:2020) Vintage2017-Vintage2020
But I'm not sure where to go from here on either accord.
The question is how do I extrapolate my data set for ID1-8 to a dataset where I have ID 1111-8888 where each ID is repeated 4 times with blanks.
Make a dummy dataset with all of the observations
data frame ;
set have(keep=id);
by id ;
if first.id then do date=2017 to 2020 ;
output;
end;
run;
and merge it back with the original.
data want ;
merge have frame ;
by id date ;
run;

SAS Compare values across multiple variables

I have the following example data set with an ID and the contract status in six months (01/2017 - 06/2017).
Example data:
ID Month1 Month2 Month3 Month4 Month5 Month6**
12 5 5 5 5 5 5
34 5 5 6 6 5 5
56 6 6 6 -7 -7 -7
78 6 6 5 5 5 5
12 5 5 5 5 6 -7
If the status is 5 the ID is active, if 6 it's canceled and -7 is "not able to reactivate".
I want to check two kind of changes:
1) IDs which change from status 5 to 6
2) IDs which change from 6 to 5
When the status changes from 5 to 6 I want a new variable "churn" containing the month in which the status changes to 6.
For the second group, I want a new variable "reactivation" containing the month in which the status changes to 5.
If an ID is in both groups (from 5 to 6 to 5) both variables should be filled.
What I have so far is an array, which shows me how many status matches occur in one row, but I do not get the next step. Here is the code:
data want (drop= i j);
set have (obs=100);
array stat_check {*} month1-month6;
sum=0;
do i=1 to dim(stat_check)-1;
do j=i+1 to dim(stat_check);
sum=sum(sum,stat_check(i) eq stat_check(j));
end;
end;
run;
Thanks in advance!
For an array approach, sounds like you need to compare each variable in the array to the variable immediately before it. You don't need two passes through the array, only one. You want to compare month2 to month1, month3 to month2 ... month6 to month5.
I would try something like (untested):
data want (drop= month);
set have (obs=100);
array stat_check {*} month1-month6;
sum=0;
do month=2 to dim(stat_check);
if stat_check{month}-stat_check{month-1} = 1 then Churn=month;
else if stat_check{month}-stat_check{month-1} = -1 then Reactivation =month;
end;
run;
If you could have multiple churns or multiple reactivations for the same ID, that would capture the latest churn or reactivation.
But honestly, I would transpose the data to have one row per ID-month. That would avoid the need for an array, and would allow you to capture multiple churns/reactivations. Generally it is easier to work with tall skinny data rather than short wide data. For example, it would be easy to count the number of months each ID was active.
You can try this one. vname function is used to get the variable name (month)
data two (drop= i j);
set one;
array stat_check {*} m1-m6;
sum=0;
do i=1 to dim(stat_check)-1;
do j=i+1 to dim(stat_check);
sum=stat_check(i)-stat_check(j);
if sum=1 then churn=vname(stat_check(i));
if sum=-1 then reactivation=vname(stat_check(i));
end;
end;
run;

Running counts of records and sum of max() records within date range based specified intervals in t-sql

Sample data: (assume year_month_record is the first day of the month and is datetime data type)
location item year_month_record type visits1 visits2
ABC111 11JF445553 2014-01 sales 3 5
ABC111 11JF445553 2014-02 sales 3 6
ABC111 11JF445553 2014-03 sales 2 8
ABC111 11JF445553 2014-04 sales 2 4
ABC111 22WZ777814 2014-02 sales 3 5
ABC111 55RR342013 2014-01 nsales 1 2
For the given sample data, I need to count how many times records with the same location and item appear within specified intervals. In addition, I need to grab the maximum value for specified interval / time frame and sum it up based on location, item_number and type.
The output should look something like this:
location year_month_record length_months type count_unique_visits sum_max_visits1 sum_max_visits2
ABC111 2014-01 3 sales 4 6 13
ABC111 2014-02 3 sales 4 6 12
ABC111 2014-03 3 sales 2 4 12
ABC111 2014-04 3 sales 1 2 4
ABC111 2014-01 3 nsales 1 1 2
notes for calculating visits1 / visits2 above
example output of record 1: max(of item 11JF445553) = 3 + max(item 22WZ77781) = 3. Sum = 6 (item 55RR342013 has a different type). Note 2. All records with max summed up are within "length_months" specified of 3 months. 2014-01 through 2014-03.
new "type" will cause new grouping to start
Additional notes:
count_unique_visits is the count for each record within date range
length_months is defined prior to execution and can be hardcoded
current year_month_record + length_months (i.e. 2014-01 year_month_record with length_months = 3) is 01/2014 through 03/2014
I've tried creating a recursive CTE to select the count and max, but i'm doing something wrong.
Basically, I need to be able to recursively, grab a count and the max visit1/2 for a given interval.
Starting with 01/2014, it would need to look for the max(visits1/2) for the next three months (basically, 01/2014 - 04/2014) and return those. In 02/2014, it would use the range of 02/2014 through 05/2014 and return the max there as well. It would continue this throughout the recordset. The interval would be 3 months, but then I could copy the query and replace with 6 months and so on and so forth.
Closing this topic to ask a more targeted/specific question.
Any help would be appreciated.
You can use a combination of a groupping subquery followed by a cross apply subquery:
DECLARE #len int = 3
SELECT grp.*, SUM(ca.cuv) count_unique_visits, SUM(ca.visits1) sum_max_visits1, SUM(ca.visits2) sum_max_visits2
FROM
(SELECT v.location, v.year_month_record, v.type
FROM Visits v
GROUP BY v.location, v.year_month_record, v.type) grp CROSS APPLY
(SELECT COUNT(*) cuv, MAX(visits1) visits1, MAX(visits2) visits2
FROM Visits ca_v
WHERE ca_v.location = grp.location AND grp.type = ca_v.type AND ca_v.year_month_record >= grp.year_month_record AND
ca_v.year_month_record < DATEADD(month, #len, grp.year_month_record)
GROUP BY ca_v.item
) ca
GROUP BY grp.location, grp.year_month_record, grp.type
ORDER BY grp.type DESC, grp.year_month_record
You can see the results in this SQLFiddle.
NOTE: As I wrote in the comment to the original question, I suspect you have a mistake in the requested output, if not, please explain...

create index in SAS using do loop

Say I have a set of data in this format:
ID Product account open date
1 A 20100101
1 B 20100103
2 C 20100104
2 A 20100205
2 D 20100605
3 A 20100101
And I want to create a column to capture the sequence of the products opened so the table will look like this:
ID First Second third
1 A B
2 C A D
3 A
I know I need to create an index for each ID so I can transpose the data afterwards:
ID Product account open date sequence
1 A 20100101 1
1 B 20100103 2
2 C 20100104 1
2 A 20100205 2
2 D 20100605 3
3 A 20100101 1
From my limited knowledge in do loop, I think I need to write something like this:
if first.ID and not last.ID then n=1 do while ID not last n+1
Something like that. Can anyone help me with the exact syntax? I have tried googling for similar codes and haven't had much luck.
Thanks!
I'd sort by ID and then date and use proc transpose for simplicity. Here's an example:
data prod;
input ID Product $ Open_DT :yymmdd8.;
format open_dt date9.;
datalines;
1 A 20100101
1 B 20100103
2 C 20100104
2 A 20100205
2 D 20100605
3 A 20100101
;
run;
proc sort data=prod;
by ID Open_DT;run;
proc transpose data=prod
out=prod_trans(drop=_name_)
prefix=ITEM;
by id;
var Product;
run;
proc print data=prod_trans noobs;
run;

Resources