SAS combining datasets, binary search, indices - loops

In SAS, for the two test datasets below - for every value of "amount" that falls within "y" and "z", I need to extract the corresponding "x". There could be multiple values of "x" that fit into the criteria.
The final result should look something like this:
/*
4 banana eggs
15 .
31 .
7 banana
22 fig
1 eggs
11 coconut
17 date
41 apple
*/
I realize this relies on using indices or binary searches but I can't figure out a workable solution! Any help would appreciated! Thanks!
data test1;
input x $ y z;
datalines;
apple 29 43
banana 2 7
coconut 9 13
date 17 20
eggs 1 5
fig 18 26
;
run;
data test2;
input amount;
datalines;
4
15
31
7
22
1
11
17
41
;
run;

Join the two datasets so amount falls between y and z.
proc sql;
create table join as
select a.amount
,b.*
from test2 a
left join
test1 b
on a.amount between b.y and b.z;
quit;
Sort the result by amount for transpose.
proc sort data=join; by amount; run;
Transpose it.
proc transpose data=join out=trans;
by amount;
var x;
run;
Now you have your fruits each in its own variable named col1, col2, ....
If you want them all in one variable separated by a blank, just concatenate them.
data trans2(keep= amount text);
set trans(drop=_name_);
array v{*} _character_;
text = catx(' ', of v{*});
run;

Here is a possible solution using "old-fashioned" data step code plus PROC TRANSPOSE:
data test1;
input x $ y z;
datalines;
apple 29 43
banana 2 7
coconut 9 13
date 17 20
eggs 1 5
fig 18 26
run;
data test2;
input amount;
datalines;
4
15
31
7
22
1
11
17
41
run;
data want(keep=amount x);
set test2;
found = 0;
do _i_=1 to nobs;
set test1 point=_i_ nobs=nobs;
if y <= amount <= z then do;
found = 1;
output;
end;
end;
if not found then do;
x = ' ';
output;
end;
run;
proc transpose data=want out=want2(drop=_name_);
by amount notsorted;
var x;
run;
Note my results do not match that in your example; amount 31 is an "apple".

Related

sum values across any 365 day period

I've got a dataset that has id, start date and a claim value (in dollars) in each row - most ids have more than one row - some span over 50 rows. The earliest date for each ID/claim varies, and the claim values are mostly different.
I'd like to do a rolling sum of the value of IDs that have claims within 365 days of each other, to report each ID that has claims that have exceeded a limiting value across each period. So for an ID that had a claim date on 1 January, I'd sum all claims to 31 December (inclusive). Most IDs have several years of data so for the example above, I'd also need to check that if they had a claim on 1 May that they hadn't exceeded the limit by 30 April the following year and so on. I normally see this referred to as a 'rolling sum'. My site has many SAS products including base, stat, ets, and others.
I'm currently testing code on a small mock dataet and so far I've converted a thin file to a fat file with one column for each claim value and each date of the claim. The mock dataset is similar to the client dataset that I'll be using. Here's what I've done so far (noting that the mock data uses days rather than dates - I'm not at the stage where I want to test on real data yet).
data original_data;
input ppt $1. day claim;
datalines;
a 1 7
a 2 12
a 4 12
a 6 18
a 7 11
a 8 10
a 9 14
a 10 17
b 1 27
b 2 12
b 3 14
b 4 12
b 6 18
b 7 11
b 8 10
b 9 14
b 10 17
c 4 2
c 6 4
c 8 8
;
run;
proc sql;
create table ppt_counts as
select ppt, count(*) as ppts
from work.original_data
group by ppt;
select cats('value_', max(ppts) ) into :cats
from work.ppt_counts;
select cats('dates_',max(ppts)) into :cnts
from work.ppt_counts;
quit;
%put &cats;
%put &cnts;
data flipped;
set original_data;
by ppt;
array vars(*) value_1 -&cats.;
array dates(*) dates_1 - &cnts.;
array m_vars value_1 - &cats.;
array m_dates dates_1 - &cnts.;
if first.ppt then do;
i=1;
do over m_vars;
m_vars="";
end;
do over m_dates;
m_dates="";
end;
end;
if first.ppt then do:
i=1;
vars(i) = claim;
dates(i)=day;
if last.ppt then output;
i+1;
retain value_1 - &cats dates_1 - &cnts. 0.;
run;
data output;
set work.flipped;
max_date =max(of dates_1 - &cnts.);
max_value =max(of value_1 - &cats.);
run;
This doesn't give me even close to what I need - not sure how to structure code to make this correct.
What I need to end up with is one row per time that an ID exceeds the yearly limit of claim value (say in the mock data if a claim exceeds 75 across a seven day period), and to include the sum of the claims. So it's likely that there may be multiple lines per ID and the claims from one row may also be included in the claims for the same ID on another row.
type of output:
ID sum of claims
a $85
a $90
b $80
On separate rows.
Any help appreciated.
Thanks
If you need to perform a rolling sum, you can do this with proc expand. The code below will perform a rolling sum of 5 days for each group. First, expand your data to fill in any missing gaps:
proc expand data = original_data
out = original_data_expanded
from = day;
by ppt;
id day;
convert claim / method=none;
run;
Any days with gaps will have missing value of claim. Now we can calculate a moving sum and ignore those missing days when performing the moving sum:
proc expand data = original_data
out = want(where=(NOT missing(claim)));
by ppt;
id day;
convert claim = rolling_sum / transform=(movsum 5) method=none;
run;
Output:
ppt day rolling_sum claim
a 1 7 7
a 2 19 12
a 4 31 12
a 6 42 18
a 7 41 11
...
b 9 53 14
b 10 70 17
c 4 2 2
c 6 6 4
c 8 14 8
The reason we use two proc expand statements is because the rolling sum is calculated before the days are expanded. We need the rolling sum to occur after the expansion. You can test this by running the above code all in a single statement:
/* Performs moving sum, then expands */
proc expand data = original_data
out = test
from = day;
by ppt;
id day;
convert claim = rolling_sum / transform=(movsum 5) method=none;
run;
Use a SQL self join with the dates being within 365 days of itself. This is time/resource intensive if you have a very large data set.
Assuming you have a date variable, the intnx is probably the better way to calculate the date interval than 365 depending on how you want to account for leap years.
If you have a claim id to group on, that would also be better than using the group by clause in this example.
data have;
input ppt $1. day claim;
datalines;
a 1 7
a 2 12
a 4 12
a 6 18
a 7 11
a 8 10
a 9 14
a 10 17
b 1 27
b 2 12
b 3 14
b 4 12
b 6 18
b 7 11
b 8 10
b 9 14
b 10 17
c 4 2
c 6 4
c 8 8
;
run;
proc sql;
create table want as
select a.*, sum(b.claim) as total_claim
from have as a
left join have as b
on a.ppt=b.ppt and
b.day between a.day and a.day+365
group by 1, 2, 3;
/*b.day between a.day and intnx('year', a.day, 1, 's')*/;
quit;
Assuming that you have only one claim per day you could just use a circular array to keep track of the pervious N days of claims to generate the rolling sum. By circular array I mean one where the indexes wrap around back to the beginning when you increment past the end. You can use the MOD() function to convert any integer into an index into the array.
Then to get the running sum just add all of the elements in the array.
Add an extra DO loop to zero out the days skipped when there are days with no claims.
%let N=5;
data want;
set original_data;
by ppt ;
array claims[0:%eval(&n-1)] _temporary_;
lagday=lag(day);
if first.ppt then call missing(of lagday claims[*]);
do index=max(sum(lagday,1),day-&n+1) to day-1;
claims[mod(index,&n)]=0;
end;
claims[mod(day,&n)]=claim;
running_sum=sum(of claims[*]);
drop index lagday ;
run;
Results:
running_
OBS ppt day claim sum
1 a 1 7 7
2 a 2 12 19
3 a 4 12 31
4 a 6 18 42
5 a 7 11 41
6 a 8 10 51
7 a 9 14 53
8 a 10 17 70
9 b 1 27 27
10 b 2 12 39
11 b 3 14 53
12 b 4 12 65
13 b 6 18 56
14 b 7 11 55
15 b 8 10 51
16 b 9 14 53
17 b 10 17 70
18 c 4 2 2
19 c 6 4 6
20 c 8 8 14
Working in a known domain of date integers, you can use a single large array to store the claims at each date and slice out the 365 days to be summed. The bookkeeping needed for the modular approach is not needed.
Example:
data have;
call streaminit(20230202);
do id = 1 to 10;
do date = '01jan2012'd to '02feb2023'd;
date + rand('integer', 25);
claim = rand('integer', 5, 100);
output;
end;
end;
format date yymmdd10.;
run;
options fullstimer;
data want;
set have;
by id;
array claims(100000) _temporary_;
array slice (365) _temporary_;
if first.id then call missing(of claims(*));
claims(date) = claim;
call pokelong(
peekclong(
addrlong (claims(date-365))
, 8*365)
,
addrlong(slice(1))
);
rolling_sum_365 = sum(of slice(*));
if dif1(claim) < 365 then
claims_out_365 = lag(claim) - dif1(rolling_sum_365);
if first.id then claims_out_365 = .;
run;
Note: SAS Date 100,000 is 16OCT2233

Replacing null value in SAS with next available value by group

I am trying to replace missing values that occur before the first non-null entry in SAS. I have the following data:
StudentID Day TestScore
Student001 0 .
Student001 1 78
Student001 2 89
Student002 3 .
Student002 4 .
Student002 5 .
Student002 6 95
I'd like to modify the data so the null values are replaces with the next available non-null entry:
StudentID Day TestScore
Student001 0 78
Student001 1 78
Student001 2 89
Student002 3 95
Student002 4 95
Student002 5 95
Student002 6 95
data scores;
length StudentID $ 10;
input StudentID $ Day TestScore;
datalines;
Student001 0 .
Student001 1 78
Student001 2 89
Student002 3 .
Student002 4 .
Student002 5 .
Student002 6 95
;
run;
proc sort data = scores;
by descending day;
run;
data scores;
drop addscore;
retain addscore;
set scores;
if testscore ne . then addscore = testscore;
if testscore eq . then testscore = addscore;
run;
proc sort data = scores;
by day;
run;
proc sort data = have;
by id descending day ;
run;
data want;
set have;
by id;
retain last_score;
if first.id then call missing(last_score);
if not missing(score) then last_score = score;
else score = last_score;
run;
proc sort data=want;
by id day;
run;
FYI, this will NOT set the missing values if there are any after the last known score for a given ID. i.e. if you had something like:
Student002 5 95
Student002 6 .
Then only records prior to day 5 for id 002 will get a value of 95. Is that a possible condition for you? If yes, this solution will require a slight modification
You can use a DOW loop to identify the next non-missing score, and a subsequent DOW loop to apply the non-missing score. The DOW approach does not require sorting and maintains the original row order.
data want;
do _n_ = 1 by 1 until (last.id or not missing(score));
set have;
by id;
end;
_score = score;
do _n_ = 1 to _n_;
set have;
score = _score;
output;
end;
drop _score;
run;
In SQL, presuming day ordering, the imputed value can be looked up in a correlated sub-query.
proc sql;
create table want as
select
id, day,
case
when not missing(score) then score
else (select score from have as inner
where inner.id = outer.id
and inner.day > outer.day
and not missing(score)
having inner.day = min(inner.day)
)
end as score
from have as outer;

Looping a proc transpose through multiple data ranges SAS

I am trying to transpose a sequence of ranges from an excel file into SAS. The excel file looks something like this:
31 Dec 01Jan 02Jan 03Jan 04Jan
Book id1 23 24 35 43 98
Book id2 3 4 5 4 1
(few blank rows in between)
05Jan 06Jan 07Jan 08Jan 09Jan
Book id1 14 100 30 23 58
Book id2 2 7 3 8 6
(and it repeats..)
My final output should have a first column for the date and then two additional columns for the book Ids:
Date Book id1 Book id2
31 Dec 23 3
01Jan 24 4
02Jan 35 5
03Jan 43 4
04Jan 98 1
05Jan 14 2
06Jan 100 7
07Jan 30 3
08Jan 23 8
09Jan 58 6
In this particular case I am asking for a simpler method to:
Either import and transpose each range of data and replacing the data range with macro variables to separately import and transpose each individual range
Or to import the whole datafile first and then to create a loop that
transposes each range of data
Code I used for a simple import and transpose of a specific data range:
proc import datafile="&input./have.xlsx"
out=want
dbms=xlsx replace;
range="Data$A3:F5" ;
run;
proc transpose data=want
out=want_transposed
name=date;
id A;
run;
Thanks!
A data row that is split over several segments or blocks of rows in an Excel file can be imported raw into SAS and then processed into a categorical form using a DATA Step.
In this example sample data is put into a text file and imported such that the column names are generic VAR-1 ... VAR-n. The generic import is then processed across each row, outputting one SAS data set row per import cell.
The column names in each segment are retained within a temporary array an updated whenever a blank book id is encountered.
* mock data;
filename demo "%sysfunc(pathname(WORK))\demo.txt";
data _null_;
input;
file demo;
put _infile_;
datalines;
., 31Dec, 01Jan, 02Jan, 03Jan, 04Jan
Book_id1, 23 , 24 , 35 , 43 , 98
Book_id2, 3 , 4 , 5 , 4 , 1
., 05Jan, 06Jan, 07Jan, 08Jan, 09Jan
Book_id1, 14 , 100 , 30 , 23 , 58
Book_id2, 2 , 7 , 3 , 8 , 6
run;
* mock import;
proc import replace out=work.haveraw file=demo dbms=csv;
getnames = no;
datarow = 1;
run;
ods listing;
proc print data=haveraw;
run;
When Excel import is be made to look like this:
Obs VAR1 VAR2 VAR3 VAR4 VAR5 VAR6
1 31Dec 01Jan 02Jan 03Jan 04Jan
2 Book_id1 23 24 35 43 98
3 Book_id2 3 4 5 4 1
4
5 05Jan 06Jan 07Jan 08Jan 09Jan
6 Book_id1 14 100 30 23 58
7 Book_id2 2 7 3 8 6
It can be processed in a transposing way, outputting only the name value pairs corresponding to a original cell.
data have (keep=bookid date value);
set haveraw;
array dates(1000) $12 _temporary_ ;
array vars var:;
if missing(var1) then do;
do index = 2 by 1 while (index <= dim(vars));
if not missing(vars(index)) then
dates(index) = put(index-1,z3.) || '_' || vars(index); * adjust as you see fit;
else
dates(index) = '';
end;
end;
else do;
bookid = var1;
do index = 2 by 1 while (index <= dim(vars));
date = dates(index);
value = input(vars(index),??best12.);
output;
end;
end;
run;

SAS: Calculate an average excluding the current observation

I am searching for an elegant way (or, failing that, an inelegant way) to calculate an average which does not include the current record. So, if I have 30 observations I would end up with 30 different averages. Each would be the average of the other 29 values.
From this made-up data, I would want to create 5 new observations with the averages of A, B, and C not including their own data.
A B C
Albert 12 4 6
Bob 14 7 12
Clyde 6 7 11
Dennis 9 11 7
Earl 8 8 6
I have a vague idea that this will involve proc sql inside a loop. Other ideas or approaches are appreciated.
No loop needed. Use SQL to get the totals for each variable. The average without the current observation is (total sum - value)/(n-1)
data test;
input NAME $ A B C;
datalines;
Albert 12 4 6
Bob 14 7 12
Clyde 6 7 11
Dennis 9 11 7
Earl 8 8 6
;
run;
proc sql noprint;
select count(*),
sum(A),
sum(B),
sum(C)
into :n,
:a,
:b,
:c
from test;
quit;
data test2;
set test;
Ave_A = (&a - a)/(&n-1);
Ave_B = (&b - b)/(&n-1);
Ave_C = (&c - c)/(&n-1);
run;

Calculating moving average using do loop in SAS

I am trying to find a way to calculate a moving average using SAS do loops. I am having difficulty. I essentially want to calculate a 4 unit moving average.
DATA data;
INPUT a b;
CARDS;
1 2
3 4
5 6
7 8
9 10
11 12
13 14
15 16
17 18
;
run;
data test(drop = i);
set data;
retain c 0;
do i = 1 to _n_-4;
c = (c+a)/4;
end;
run;
proc print data = test;
run;
One option is to use the merge-ahead:
DATA have;
INPUT a b;
CARDS;
1 2
3 4
5 6
7 8
9 10
11 12
13 14
15 16
17 18
;
run;
data want;
merge have have(firstobs=2 rename=a=a_1) have(firstobs=3 rename=a=a_2) have(firstobs=4 rename=a=a_3);
c = mean(of a:);
run;
Merge the data to itself, each time the merged dataset advancing one - so the 2nd starts with 2, third starts with 3, etc. That gives you all 4 'a' on one line.
SAS has a lag() function. What this does is create the lag of the variable it is applied to. SO for example, if your data looked like this:
DATA data;
INPUT a ;
CARDS;
1
2
3
4
5
;
Then the following would create a lag one, two, three etc variable;
data data2;
set data;
a_1=lag(a);
a_2=lag2(a);
a_3=lag3(a);
drop b;
run;
would create the following dataset
a a_1 a_2 a_3
1 . . .
2 1 . .
3 2 1 .
4 3 2 1
etc.
Moving averages can be easily calculated from these.
Check out http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000212547.htm
(Please note, I did not get a chance to run the codes, so they may have errors.)
Straight from Cody's Collection of Popular Programming Tasks and How to Tackle them.
*Presenting a macro to compute a moving average;
%macro Moving_ave(In_dsn=, /*Input data set name */
Out_dsn=, /*Output data set name */
Var=, /*Variable on which to compute
the average */
Moving=, /* Variable for moving average */
n= /* Number of observations on which
to compute the average */);
data &Out_dsn;
set &In_dsn;
***compute the lags;
_x1 = &Var;
%do i = 1 %to &n - 1;
%let Num = %eval(&i + 1);
_x&Num = lag&i(&Var);
%end;
***if the observation number is greater than or equal to the
number of values needed for the moving average, output;
if _n_ ge &n then do;
&Moving = mean (of _x1 - _x&n);
output;
end;
drop _x:;
run;
%mend Moving_ave;
*Testing the macro;
%moving_Ave(In_dsn=data,
Out_dsn=test,
Var=a,
Moving=Average,
n=4)

Resources