SAS macro to print out change of baseline scores - arrays

I'm looking for a way to print out change of tests scores for each subject with a SAS macro. Here is a sample of the data:
Subject Visit Date Test Score
001 Baseline 01/01/99 Jump 5
001 Baseline 01/01/99 Reach 3
001 Week 6 02/12/99 Jump 7
001 Week 6 02/12/99 Reach 6
002 Baseline 03/01/99 Jump 2
002 Baseline 03/01/99 Reach 4
002 Week 6 04/12/99 Jump 5
002 Week 6 04/12/99 Reach 9
I would like to create a macro that generates the following for each subject:
Subject Visit Date (Days from Baseline) Test Score Change from Baseline Score
001 Baseline 01/01/99 Jump 5
01/01/99 Reach 3
001 Week 6 02/12/99 (42) Jump 7 +2
02/12/99 (42) Reach 6 +3
002 Baseline 03/01/99 Jump 2
03/01/99 Reach 4
002 Week 6 04/12/99 (42) Jump 5 +3
04/12/99 (42) Reach 9 +5
I believe I can just use the INTCK function for the Days from Baseline, but I'm not sure how to print out each test without retaining the 'Subject' and 'Visit' values in each row. Any help would be much appreciated.

You can sort by test and process using a retain for date and score for computing deltas. The print out can be done with Proc REPORT, formatting delta values appropriately.
Example:
data have; input
Subject Visit& $8. Date& mmddyy8. Test $ Score; format date mmddyy8.; datalines;
001 Baseline 01/01/99 Jump 5
001 Baseline 01/01/99 Reach 3
001 Week 6 02/12/99 Jump 7
001 Week 6 02/12/99 Reach 6
002 Baseline 03/01/99 Jump 2
002 Baseline 03/01/99 Reach 4
002 Week 6 04/12/99 Jump 5
002 Week 6 04/12/99 Reach 9
run;
proc sort data=have;
by subject test date;
run;
data for_report;
set have;
by subject test;
retain base_date base_score;
if first.subject then do;
base_date = .;
base_score = .;
end;
if first.test and visit='Baseline' then do;
base_date = date;
base_score = score;
end;
if not first.test then do;
delta_days = intck('days', date, base_date);
delta_score = score - base_score;
end;
run;
proc format;
picture plus low-0 = [best12.] other = '000000009' (prefix='+');
options missing=' ';
proc report data=for_report;
columns subject visit date delta_days test score delta_score;
define subject / order;
define visit / order order=data;
format delta_days negparen.;
format delta_score plus.;
run;
options missing='.';
An alternate report can be more subject-centric:
proc report data=for_report
style(lines) = [just=left fontweight=bold]
;
columns subject visit date delta_days test score delta_score;
define subject / order noprint;
define visit / order order=data;
format delta_days negparen.;
format delta_score plus.;
compute before subject;
subj = catx(' ', "Subject:", subject);
line subj $200.;
endcomp;
run;

Here is one way of doing it. The SQL-step calculates changes from baseline. The case-when-construct is only there to suppress zeroes in the output.
Printing using group-variables in proc report means Subject- and Visit-values are not retained on every line (but note that subject is not repeated each week).
I put the code in a macro, as that was the question. It doesn't really do much, however.
/* Creating test data*/
data testdata;
input Subject $3. #5 Visit $8. #17 Date mmddyy10. #28 Test $5. Score;
format date mmddyy10.;
datalines;
001 Baseline 01/01/99 Jump 5
001 Baseline 01/01/99 Reach 3
001 Week 6 02/12/99 Jump 7
001 Week 6 02/12/99 Reach 6
002 Baseline 03/01/99 Jump 2
002 Baseline 03/01/99 Reach 4
002 Week 6 04/12/99 Jump 5
002 Week 6 04/12/99 Reach 9
;
%macro baselines(dataset=);
/* Adding days from baseline and change from baseline. Please note that the first visit
must denoted as exactly "Baseline"*/
proc sql;
create table changes as
select t1.*, case when t1.date-t2.date>0 then t1.date-t2.date else . end as days
"Days from baseline", case when t1.score-t2.score>0 then t1.score-t2.score else .
end as score_change "Change from Baseline"
from &dataset as t1 left join (select * from &dataset where visit="Baseline") as t2
on t1.subject=t2.subject and t1.test=t2.test
order by subject, visit, test;
/* Printing the dataset. The use of subject and visit as group variables keeps SAS from repeating the values*/
title "Changes based on the dataset &dataset";
proc report data=changes;
column subject visit days test score score_change;
define subject / group;
define visit / group;
run;
%mend;
%baselines(dataset=testdata)

Related

sum values across any 365 day period

I've got a dataset that has id, start date and a claim value (in dollars) in each row - most ids have more than one row - some span over 50 rows. The earliest date for each ID/claim varies, and the claim values are mostly different.
I'd like to do a rolling sum of the value of IDs that have claims within 365 days of each other, to report each ID that has claims that have exceeded a limiting value across each period. So for an ID that had a claim date on 1 January, I'd sum all claims to 31 December (inclusive). Most IDs have several years of data so for the example above, I'd also need to check that if they had a claim on 1 May that they hadn't exceeded the limit by 30 April the following year and so on. I normally see this referred to as a 'rolling sum'. My site has many SAS products including base, stat, ets, and others.
I'm currently testing code on a small mock dataet and so far I've converted a thin file to a fat file with one column for each claim value and each date of the claim. The mock dataset is similar to the client dataset that I'll be using. Here's what I've done so far (noting that the mock data uses days rather than dates - I'm not at the stage where I want to test on real data yet).
data original_data;
input ppt $1. day claim;
datalines;
a 1 7
a 2 12
a 4 12
a 6 18
a 7 11
a 8 10
a 9 14
a 10 17
b 1 27
b 2 12
b 3 14
b 4 12
b 6 18
b 7 11
b 8 10
b 9 14
b 10 17
c 4 2
c 6 4
c 8 8
;
run;
proc sql;
create table ppt_counts as
select ppt, count(*) as ppts
from work.original_data
group by ppt;
select cats('value_', max(ppts) ) into :cats
from work.ppt_counts;
select cats('dates_',max(ppts)) into :cnts
from work.ppt_counts;
quit;
%put &cats;
%put &cnts;
data flipped;
set original_data;
by ppt;
array vars(*) value_1 -&cats.;
array dates(*) dates_1 - &cnts.;
array m_vars value_1 - &cats.;
array m_dates dates_1 - &cnts.;
if first.ppt then do;
i=1;
do over m_vars;
m_vars="";
end;
do over m_dates;
m_dates="";
end;
end;
if first.ppt then do:
i=1;
vars(i) = claim;
dates(i)=day;
if last.ppt then output;
i+1;
retain value_1 - &cats dates_1 - &cnts. 0.;
run;
data output;
set work.flipped;
max_date =max(of dates_1 - &cnts.);
max_value =max(of value_1 - &cats.);
run;
This doesn't give me even close to what I need - not sure how to structure code to make this correct.
What I need to end up with is one row per time that an ID exceeds the yearly limit of claim value (say in the mock data if a claim exceeds 75 across a seven day period), and to include the sum of the claims. So it's likely that there may be multiple lines per ID and the claims from one row may also be included in the claims for the same ID on another row.
type of output:
ID sum of claims
a $85
a $90
b $80
On separate rows.
Any help appreciated.
Thanks
If you need to perform a rolling sum, you can do this with proc expand. The code below will perform a rolling sum of 5 days for each group. First, expand your data to fill in any missing gaps:
proc expand data = original_data
out = original_data_expanded
from = day;
by ppt;
id day;
convert claim / method=none;
run;
Any days with gaps will have missing value of claim. Now we can calculate a moving sum and ignore those missing days when performing the moving sum:
proc expand data = original_data
out = want(where=(NOT missing(claim)));
by ppt;
id day;
convert claim = rolling_sum / transform=(movsum 5) method=none;
run;
Output:
ppt day rolling_sum claim
a 1 7 7
a 2 19 12
a 4 31 12
a 6 42 18
a 7 41 11
...
b 9 53 14
b 10 70 17
c 4 2 2
c 6 6 4
c 8 14 8
The reason we use two proc expand statements is because the rolling sum is calculated before the days are expanded. We need the rolling sum to occur after the expansion. You can test this by running the above code all in a single statement:
/* Performs moving sum, then expands */
proc expand data = original_data
out = test
from = day;
by ppt;
id day;
convert claim = rolling_sum / transform=(movsum 5) method=none;
run;
Use a SQL self join with the dates being within 365 days of itself. This is time/resource intensive if you have a very large data set.
Assuming you have a date variable, the intnx is probably the better way to calculate the date interval than 365 depending on how you want to account for leap years.
If you have a claim id to group on, that would also be better than using the group by clause in this example.
data have;
input ppt $1. day claim;
datalines;
a 1 7
a 2 12
a 4 12
a 6 18
a 7 11
a 8 10
a 9 14
a 10 17
b 1 27
b 2 12
b 3 14
b 4 12
b 6 18
b 7 11
b 8 10
b 9 14
b 10 17
c 4 2
c 6 4
c 8 8
;
run;
proc sql;
create table want as
select a.*, sum(b.claim) as total_claim
from have as a
left join have as b
on a.ppt=b.ppt and
b.day between a.day and a.day+365
group by 1, 2, 3;
/*b.day between a.day and intnx('year', a.day, 1, 's')*/;
quit;
Assuming that you have only one claim per day you could just use a circular array to keep track of the pervious N days of claims to generate the rolling sum. By circular array I mean one where the indexes wrap around back to the beginning when you increment past the end. You can use the MOD() function to convert any integer into an index into the array.
Then to get the running sum just add all of the elements in the array.
Add an extra DO loop to zero out the days skipped when there are days with no claims.
%let N=5;
data want;
set original_data;
by ppt ;
array claims[0:%eval(&n-1)] _temporary_;
lagday=lag(day);
if first.ppt then call missing(of lagday claims[*]);
do index=max(sum(lagday,1),day-&n+1) to day-1;
claims[mod(index,&n)]=0;
end;
claims[mod(day,&n)]=claim;
running_sum=sum(of claims[*]);
drop index lagday ;
run;
Results:
running_
OBS ppt day claim sum
1 a 1 7 7
2 a 2 12 19
3 a 4 12 31
4 a 6 18 42
5 a 7 11 41
6 a 8 10 51
7 a 9 14 53
8 a 10 17 70
9 b 1 27 27
10 b 2 12 39
11 b 3 14 53
12 b 4 12 65
13 b 6 18 56
14 b 7 11 55
15 b 8 10 51
16 b 9 14 53
17 b 10 17 70
18 c 4 2 2
19 c 6 4 6
20 c 8 8 14
Working in a known domain of date integers, you can use a single large array to store the claims at each date and slice out the 365 days to be summed. The bookkeeping needed for the modular approach is not needed.
Example:
data have;
call streaminit(20230202);
do id = 1 to 10;
do date = '01jan2012'd to '02feb2023'd;
date + rand('integer', 25);
claim = rand('integer', 5, 100);
output;
end;
end;
format date yymmdd10.;
run;
options fullstimer;
data want;
set have;
by id;
array claims(100000) _temporary_;
array slice (365) _temporary_;
if first.id then call missing(of claims(*));
claims(date) = claim;
call pokelong(
peekclong(
addrlong (claims(date-365))
, 8*365)
,
addrlong(slice(1))
);
rolling_sum_365 = sum(of slice(*));
if dif1(claim) < 365 then
claims_out_365 = lag(claim) - dif1(rolling_sum_365);
if first.id then claims_out_365 = .;
run;
Note: SAS Date 100,000 is 16OCT2233

Modifying final column value in SAS by group

I have the following data set:
Student TestDayStart TestDayEnd
001 1 5
001 6 10
001 11 15
002 1 4
002 5 9
002 10 14
I would like to make the last 'TestDayEnd' the final value for 'TestDayStart' for each Student.
So the data should look like this:
Student TestDayStart TestDayEnd
001 1 5
001 6 10
001 11 15
001 15 15
002 1 4
002 5 9
002 10 14
002 14 14
I'm not quite sure how I can do this in SAS. Any insight would be appreciated.
After sorting the dataset you can do this within a data step.
proc sort data=have;
by student testdaystart testdayend;
run;
Now you can use the by and retain statements in the data step. The by statement allows you to find the last student, and the retain statement lets you keep the previous value in the dataset.
data want;
set have;
retain last_testdayend;
by student testdaystart testdayend;
output;
last_testdayend = testdayend;
if last.student then do;
if testdaystart ne testdayend then do;
testdaystart = last_testdayend;
testdayend = last_testdayend;
output; * this second output statement creates a new record in the dataset;
end;
end;
drop last_testdayend;
run;

Retain last 5 visits by Person in SAS

I have the following that contains dates, the visit number, and a specific variable of interest. I would like to retain the last five visits that are available in SAS by person. I am familiar with retaining the first and last visits. The data for a single subject is listed below:
Person Date VisitNumber VariableOfInterest
001 10/10/2001 1 6
001 11/12/2001 3 8
001 01/05/2002 5 12
001 03/10/2002 6 5
001 05/03/2002 8 3
001 07/29/2002 10 11
Any insight would be appreciated.
A double DOW loop will let you measure the group in the first loop and select from the group based on your desired per-group criteria in the second loop. This is useful when have is large and pre-sorted, and you want to avoid additional sorting.
data want;
* measure the group size;
do _n_ = 1 by 1 until (last.person);
set have;
by person visitnumber; * visitnumber in by only to enforce expectation of orderness;
end;
_i_ = _n_;
* apply the criteria "last 5 rows in group";
do _n_ = 1 to _n_;
set have;
if _i_ - _n_ < 5 then output;
end;
run;
It is easier if you sort by descending VisitNumber so that the problem becomes take the first 5 observations for a person. Then just generate a counter of which observation this is for the person and subset on that.
data want;
set have ;
by person descending visitnumber;
if first.person then rowno=0;
rowno+1;
if rowno <= 5;
run;

Using group by in Proc SQL for SAS

I am trying to summarize my data set using the proc sql, but I have repeated values in the output, a simple version of my code is:
PROC SQL;
CREATE TABLE perm.rx_4 AS
SELECT patid,ndc,fill_mon,
COUNT(dea) AS n_dea,
sum(DEDUCT) AS tot_DEDUCT
FROM perm.rx
GROUP BY patid,ndc,fill_mon;
QUIT;
Some sample output is:
Obs Patid Ndc FILL_mon n_dea DEDUCT
3815 33003605204 00054465029 2000-05 2 0
3816 33003605204 00054465029 2000-05 2 0
12257 33004361450 00406035701 2000-06 2 0
16564 33004744098 00603128458 2000-05 2 0
16565 33004744098 00603128458 2000-05 2 0
16566 33004744098 00603128458 2000-06 2 0
16567 33004744098 00603128458 2000-06 2 0
46380 33008165116 00406035705 2000-06 2 0
85179 33013674758 00406035801 2000-05 2 0
89248 33014228307 00054465029 2000-05 2 0
107514 33016949900 00406035805 2000-06 2 0
135047 33056226897 63481062370 2000-05 2 0
213691 33065594501 00472141916 2000-05 2 0
215192 33065657835 63481062370 2000-06 2 0
242848 33066899581 60432024516 2000-06 2 0
As you can see there are repeated out put, for example obs 3815,3816. I have saw some people had similar problem, but the answers didn't work for me.
The content of the dataset is this:
The SAS System 5
17:01 Thursday, December 3, 2015
The CONTENTS Procedure
Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 210
First Data Page 1
Max Obs per Page 1360
Obs in First Data Page 1310
Number of Data Set Repairs 0
Filename /home/zahram/optum/rx_4.sas7bdat
Release Created 9.0401M2
Host Created Linux
Inode Number 424673574
Access Permission rw-r-----
Owner Name zahram
File Size (bytes) 13828096
The SAS System 6
17:01 Thursday, December 3, 2015
The CONTENTS Procedure
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat Label
3 FILL_mon Num 8 YYMMD. Fill month
2 Ndc Char 11 $11. $20. Ndc
1 Patid Num 8 19. Patid
4 n_dea Num 8
5 tot_DEDUCT Num 8
Sort Information
Sortedby Patid Ndc FILL_mon
Validated YES
Character Set ASCII
The SAS System 7
17:01 Thursday, December 3, 2015
The CONTENTS Procedure
Sort Information
Sort Option NODUPKEY
NOTE: PROCEDURE CONTENTS used (Total process time):
real time 0.08 seconds
cpu time 0.01 seconds
I'll guess that you have a format on a variable, most likely the date. Proc SQL does not aggregate over formatted values but will use the underlying values but still shows them as formatted, so they appear as duplicates. Your proc contents confirms this. You can get around this by converting this the variable to a character variable.
PROC SQL;
CREATE TABLE perm.rx_4 AS
SELECT patid,ndc, put(fill_mon, yymmd.) as fill_month,
COUNT(dea) AS n_dea,
sum(DEDUCT) AS tot_DEDUCT
FROM perm.rx
GROUP BY patid,ndc, calculated fill_month;
QUIT;

parsing a text file in sas

So I have a rather messy text file I'm trying to convert to a sas data set. It looks something like this (though much bigger):
0305679 SMITH, JOHN ARCH05 001 2
ARCH05 005 3
ARCH05 001 7
I'm trying to set 5 separate variables (ID, name, job, time, hours) but clearly only 3 of the variables appear after the first line. I tried this:
infile "C:\Users\Desktop\jobs.txt" dlm = ' ' dsd missover;
input ID $ name $ job $ time hours;
and didn't get the right output, then I tried to parse it
infile "C:\Users\Desktop\jobs.txt" dlm = ' ' dsd missover; input
allData $; id = substr(allData, find(allData,"305")-2, 7);
but I'm still not getting the right output. Any ideas?
EDIT: I'm trying now to use .scan() and .substr() to apart the larger data set, how do I subset a single line from the data?
Your data might not be all that messy; it just might be in a hierarchical format where the first row contains all five variables and subsequent rows contain values for variables 3-5. In other words, ID and NAME should be retained as you read through the file.
If that is correct (it's a hierarchical layout) this here is a possible solution:
data have;
retain ID NAME;
informat ID 7. JOB $6. TIME 3. HOURS 1.;
input #1 test_string $7. #;
if notdigit(test_string) = 0
then input #1 ID NAME $12. JOB time hours;
else input #1 JOB time hours;
drop test_string;
datalines;
0305679 SMITH, JOHN ARCH05 001 2
ARCH05 005 3
ARCH05 001 7
0305680 JONES, MARY ARCH06 002 4
ARCH06 005 3
ARCH07 001 7
run;
The key thing is to really understand how your raw file is organized. Once you know the rules, using SAS to read it is a snap!
A list input solution could be the following:
data have;
array all(6) $20. ID LNAME FNAME JOB TIME HOURS;
retain Id Lname Fname;
drop i;
input #;
nitems = countw(_infile_,', ');
if notdigit(scan(_infile_,1)) = 0 then
do i = 1 to nitems;
all(i) = Scan(_infile_,i);
end;
else
do i = 1 to 3;
all(i+3) = Scan(_infile_,i);
if i = 6 then all(i) = all(i)*1;
end;
datalines;
0305679 SMITH, JOHN ARCH05 001 2
ARCH05 005 3
ARCH05 001 7
0305680 JONES, MARY ARCH06 002 4
ARCH06 005 3
ARCH07 001 7
run;
proc print; run;

Resources