Defining variables in SAS DATA step with a loop - loops

Does anybody knows how to compress this long SAS code with some sort of looping technique?
DATA CDS; SET CDS;
retain find131 find132 find133 find134 find135 find136 find137 find138 find139 find140;
if _n_=1
THEN DO;
find131 = prxPARSE('/\d\d\d\d\d\d\d\.\d\d/');
find132 = prxPARSE('/\d\d\d\d\d\d\d\.\d\d/');
find133 = prxPARSE('/\d\d\d\d\d\d\d\.\d\d/');
find134 = prxPARSE('/\d\d\d\d\d\d\d\.\d\d/');
find135 = prxPARSE('/\d\d\d\d\d\d\d\.\d\d/');
find136 = prxPARSE('/\d\d\d\d\d\d\d\.\d\d/');
find137 = prxPARSE('/\d\d\d\d\d\d\d\.\d\d/');
find138 = prxPARSE('/\d\d\d\d\d\d\d\.\d\d/');
find139 = prxPARSE('/\d\d\d\d\d\d\d\.\d\d/');
find140 = prxPARSE('/\d\d\d\d\d\d\d\.\d\d/');
END;
Thank you very much
Marco

Replace each series of find# variables with a loop. Also, you forgot a run statement in your original code block.
%macro simplify;
DATA CDS;
SET CDS;
retain %do i = 131 %to 140; find&i. %end;;
if _n_=1 THEN DO;
%do i = 131 %to 140;
find&i. = prxPARSE('/\d\d\d\d\d\d\d\.\d\d/');
%end;
END;
RUN;
%mend simplify;
%simplify;

Related

Modifying SAS table counts

I am attempting to use the following code from Jack Shostak's book 'SAS Programming in the Pharmaceutical Industry' for a medications table in SAS:
PROC SQL NOPRINT;
SELECT COUNT(DISTINCT USUBJID) FORMAT = 3.
INTO :n1
FROM ADSL
WHERE TRTPN = 1;
SELECT COUNT(DISTINCT USUBJID) FORMAT = 3.
INTO :n2
FROM ADSL
WHERE TRTPN = 0;
SELECT COUNT(DISTINCT USUBJID) FORMAT = 3.
INTO :n3
FROM ADSL
WHERE TRTPN NE .;
QUIT;
PROC SQL NOPRINT;
CREATE TABLE CMTOSUM AS
SELECT UNIQUE(C.CMDECOD) AS CMDECOD, C.USUBJID, T.TRTPN
FROM CM AS C, ADSL AS T
WHERE C.USUBJID = T.USUBJID
ORDER BY USUBJID, CMDECOD;
QUIT;
ODS LISTING CLOSE;
ODS OUTPUT CROSSTABFREQS = COUNTS;
PROC FREQ DATA = CMTOSUM;
TABLES TRTPN * CMDECOD;
RUN;
ODS OUTPUT CLOSE;
ODS LISTING;
PROC SORT DATA = COUNTS;
BY CMDECOD;
RUN;
DATA CM;
MERGE COUNTS(WHERE = (TRTPN = 1) RENAME = (FREQUENCY = COUNT1))
COUNTS(WHERE = (TRTPN = 0) RENAME = (FREQUENCY = COUNT2))
COUNTS(WHERE = (TRTPN = .) RENAME = (FREQUENCY = COUNT3))
END = EOF;
BY CMDECOD;
KEEP CMDECOD ROWLABEL COL1-COL3 SECTION;
LENGTH ROWLABEL $25 COL1-COL3 $10;
IF CMDECOD = '' THEN
DO;
ROWLABEL = 'ANY MEDICATION';
SECTION = 1;
END;
ELSE
DO;
ROWLABEL = CMDECOD;
SECTION = 2;
END;
PCT1 = (COUNT1/ &n1) *100;
PCT2 = (COUNT2/ &n2) *100;
PCT3 = (COUNT3/ &n3) *100;
COL1 = PUT(COUNT1, 3.) || " (" || PUT(PCT1, 3.) || "%)";
COL2 = PUT(COUNT2, 3.) || " (" || PUT(PCT2, 3.) || "%)";
COL3 = PUT(COUNT3, 3.) || " (" || PUT(PCT3, 3.) || "%)";
RUN;
This code correctly tabulates the number of subjects within each treatment arm on specific medications. However, when I run this code it generates a count based on the number of medications in the 'ANY MEDICATION' row rather than the total number of subjects. Currently the percentage exceeds 100; I would like to modify the count so that it stops once it hits the total number of subjects in each treatment arm. Any insight would be appreciated.
I was able to resolve the issue by adding the following lines of code:
IF COUNT1 GE &N1 THEN COUNT1 = &n1;
IF COUNT2 GE &N2 THEN COUNT2 = &n2;
IF COUNT3 GE &N3 THEN COUNT3 = &n3;
This restricts the counts to total number of subjects within each group.
Below is the updated code for reference.
PROC SQL NOPRINT;
SELECT COUNT(DISTINCT USUBJID) FORMAT = 3.
INTO :n1
FROM ADSL
WHERE TRTPN = 1;
SELECT COUNT(DISTINCT USUBJID) FORMAT = 3.
INTO :n2
FROM ADSL
WHERE TRTPN = 0;
SELECT COUNT(DISTINCT USUBJID) FORMAT = 3.
INTO :n3
FROM ADSL
WHERE TRTPN NE .;
QUIT;
PROC SQL NOPRINT;
CREATE TABLE CMTOSUM AS
SELECT UNIQUE(C.CMDECOD) AS CMDECOD, C.USUBJID, T.TRTPN
FROM CM AS C, ADSL AS T
WHERE C.USUBJID = T.USUBJID
ORDER BY USUBJID, CMDECOD;
QUIT;
ODS LISTING CLOSE;
ODS OUTPUT CROSSTABFREQS = COUNTS;
PROC FREQ DATA = CMTOSUM;
TABLES TRTPN * CMDECOD;
RUN;
ODS OUTPUT CLOSE;
ODS LISTING;
PROC SORT DATA = COUNTS;
BY CMDECOD;
RUN;
DATA CM;
MERGE COUNTS(WHERE = (TRTPN = 1) RENAME = (FREQUENCY = COUNT1))
COUNTS(WHERE = (TRTPN = 0) RENAME = (FREQUENCY = COUNT2))
COUNTS(WHERE = (TRTPN = .) RENAME = (FREQUENCY = COUNT3))
END = EOF;
BY CMDECOD;
KEEP CMDECOD ROWLABEL COL1-COL3 SECTION;
LENGTH ROWLABEL $25 COL1-COL3 $10;
IF COUNT1 GE &N1 THEN COUNT1 = &n1;
IF COUNT2 GE &N2 THEN COUNT2 = &n2;
IF COUNT3 GE &N3 THEN COUNT3 = &n3;
IF CMDECOD = '' THEN
DO;
ROWLABEL = 'ANY MEDICATION';
SECTION = 1;
END;
ELSE
DO;
ROWLABEL = CMDECOD;
SECTION = 2;
END;
PCT1 = (COUNT1/ &n1) *100;
PCT2 = (COUNT2/ &n2) *100;
PCT3 = (COUNT3/ &n3) *100;
COL1 = PUT(COUNT1, 3.) || " (" || PUT(PCT1, 3.) || "%)";
COL2 = PUT(COUNT2, 3.) || " (" || PUT(PCT2, 3.) || "%)";
COL3 = PUT(COUNT3, 3.) || " (" || PUT(PCT3, 3.) || "%)";
RUN;

How to read in a column of data in an IF-THEN statement or in a PROC SQL statement, SAS

I have a SAS data step statement –
Data work.CABGothers2;
set work.CABGothers1;
IF proc_p in (a HUGE LIST OF ICD10 CODES) and PDDCABG = 1
and TypeofCABG_PDDTemp = . then TypeofCABG_PDDTemp = 4;
IF proc2 in (a HUGE LIST OF ICD10 CODES) and PDDCABG = 1
and TypeofCABG_PDDTemp = . then TypeofCABG_PDDTemp = 4;
IF proc3 in (a HUGE LIST OF ICD10 CODES) and PDDCABG = 1
and TypeofCABG_PDDTemp = . then TypeofCABG_PDDTemp = 4;
...
run;
This IF-THEN section goes on 21 times, so you can imagine how HUGE and cumbersome this sas code file gets, especially when it comes to any modifications to the ICD10 code list. It would have to be changed individually in all the proc1,proc2... columns.
Also, the ICD10 lists are very huge with over 7000 codes, I was wondering if someone could show me a better SAS code that might take as input a column of data (ICD10 codes) from a file.
I would like a proc sql or Data step procedure. Whichever is more efficient.
Current code-
Data work.CABGothers2;
set work.CABGothers1;
IF proc_p in (a HUGE LIST OF ICD10 CODES) and PDDCABG = 1
and TypeofCABG_PDDTemp = . then TypeofCABG_PDDTemp = 4;
run;
UPDATE--
I got this to work if the list is small...however I have a column with 8000 unique ICD10 codes. So I get an error message as shown below.
proc sql;
select quote(icd10) into :cabgvalexcl separated by ','
from newlink.cabgvalexcl2019;
quit;
Data work.test1;
set WORK.cabgpddcol;
IF proc_p in (&cabgvalexcl.) and PDDCABG = 1 then CABGVAL_Excl = 1;
IF oproc1 in (&cabgvalexcl.) and PDDCABG = 1 then CABGVAL_Excl = 1 ;
IF oproc2 in (&cabgvalexcl.) and PDDCABG = 1 then CABGVAL_Excl = 1;
IF oproc3 in (&cabgvalexcl.) and PDDCABG = 1 then CABGVAL_Excl = 1 ;
IF oproc4 in (&cabgvalexcl.) and PDDCABG = 1 then CABGVAL_Excl = 1;
run;
**> ERROR message- ERROR: The length of the value of the macro variable
CABGVALEXCL (65540) exceeds the maximum length (65534). The value has
been
truncated to 65534 characters.**
UPDATE --
eXAMPLE (JUST FEW ROWS) of ONLY 1 column (I do not have multiple columns. I did that in the macro example because macro variable was running out of max space.) containing ICD10 codes and the data file in which I have to tag rows that have any of the ICD10 codes -
OUTPUT table-
LOgic - If any of the ICD10 codes listed in cabgvalexcl2019 (shown here in RED) is found in the table CABGOTHERS1, create a column called - EXCLUDE - and put a value of 1 for that record.
Here's a hash-based example. It doesn't use macro variables, so it should work for any number of ICD10 codes:
data cabgvalexcl2019;
input (icd1-icd3) (:$2.);
datalines;
1 2 3
4 5 6
7 8 9
;
run;
/*Generate some dummy data*/
data cabgpddcol;
array keys[*] $2 proc_p oproc1-oproc20;
call streaminit(1); /*Set random number seed*/
do i = 1 to 20;
do j = 1 to dim(keys);
keys[j] = put(int(rand('uniform') * 11 + 9), 2.); /*Chosen so we get a few rows with no exclusion codes*/
end;
PDDCABG = rand('uniform') < 0.75;
output;
end;
drop i j;
run;
/* CABGval_Excl = Identify CABG+VALVE exclusions which are "CABG OTHERS". This is the 2019 CABG+VALVE exclusion list. */
/* If the RECORD IN following table has CABGVAL_Excl = 1 then it is a CABG+valve WITH EXCLUSION*/
Data work.CABGval_Excl; /* CABG OTHERS prior to refinement into non-iso CABG WITH Valve and non-iso CABG WITHOUT Valve */
/*Create hash object to hold list of ICD codes*/
length icd $ 2;
if _n_ = 1 then do;
declare hash h();
rc = h.definekey('icd');
rc = h.definedone();
do until(eof);
set cabgvalexcl2019 end = eof;
/*Consider using an array here if you have lots of ICD columns*/
do icd = icd1, icd2, icd3;
rc = h.add();
end;
end;
end;
set cabgpddcol;
/*Loop through all the keys and stop if we find one in the hash*/
array keys[*] proc_p oproc1-oproc20;
rc = -1;
do i = 1 to dim(keys) until(rc = 0);
rc = h.find(key:keys[i]); /*This sets rc = 0 if a match is found*/
end;
drop i rc icd:;
CABGVAL_Excl = rc ne 0 and PDDCABG = 1;
run;
Constructing the hash object is a little bit fiddly if you have multiple columns holding all the distinct ICD10 codes you care about - if they're all in one column then there's a simpler way of doing this:
declare hash h(dataset:'cabgvalexcl2019');
rc = h.definekey('icd');
rc = h.definedone();

how to do IN with an array in SAS

I am defined four variables here and each of the variables with different number of ICD10 codes:
%LET DX_27800_CODE = 'E6609', 'E661', 'E668', 'E669';
%LET DX_27801_CODE = 'E6601';
%LET DX_2859_CODE = 'D649';
%LET DX_6202_CODE = 'N8320', 'N8329';
now I want to use create an array that can easy mapping those variables that with my icd 10 table columns so that I could assign flags variables with it.
the regular way would be:
data test; set input;
if (dx1 in ( &DX_27800_CODE) or dx2 in (&DX_27800_CODE) or dx3 in (&DX_27800_CODE))
then dx_27800 = 1; else dx_27800 =0;
run;
in the regular way I would need to do this procedure four times to get all four flags variable. So I'm wondering if it could be done by using array.
data test; set input;
array dx_code10 [4] &DX_27800_CODE &DX_27801_CODE &DX_2859_CODE &DX_6202_CODE;
ARRAY DX_VARIABLE[4] DX_27800 DX_27801 DX_2859 DX_6202;
DO I = 1 TO DIM(dx_code10);
IF (DX1 IN (DX_CODE10[I]) OR DX2 IN (DX_CODE10[I]) OR DX3 IN (DX_CODE10[I]))
THEN DX_VARIABLE[I] = 1;
ELSE DX_VARIABLE[I] = 0;
END;
END;
RUN;
But seems like it can't be done by this way. Please help me to solve this problem. thanks.
I think a better approach is to use formats. I'd rather have those DX codes in a spreadsheet or a text file or something, and then input that to make the formats, but even with the not-best-practice %LETs, you can still use a format solution.
Approach is to make a format that turns each of those DX code pairs into a value that returns the dx value (the 27800, 27801, etc.); then use that to drive how you assign the followup array.
%LET DX_27800_CODE = 'E6609', 'E661', 'E668', 'E669';
%LET DX_27801_CODE = 'E6601';
%LET DX_2859_CODE = 'D649';
%LET DX_6202_CODE = 'N8320', 'N8329';
proc format;
value $dxcode
&dx_27800_code = '27800'
&dx_27801_code = '27801'
&dx_2859_code = '2859'
&dx_6202_code = '6202'
other=' '
;
quit;
data input;
input dx1 $;
datalines;
E6601
E6609
E6608
E661
E668
D649
D650
N8320
E669
N8329
;;;;
run;
data want;
set input;
array dx_codes[4] dx_27800 dx_27801 dx_2859 dx_6202;
dx_code_val = put(dx1,$dxcode5.);
do _i = 1 to dim(dx_codes);
if dx_code_val = scan(vname(dx_codes[_i]),2,'_') then dx_codes[_i]=1;
else dx_codes[_i]=0;
end;
run;
For your specific example you could use FINDW() function instead of the IN operator. Turn your code lists into delimited strings instead.
%LET DX_27800_CODE = E6609,E661,E668,E669;
%LET DX_27801_CODE = E6601 ;
%LET DX_2859_CODE = D649 ;
%LET DX_6202_CODE = N8320,N8329;
data test;
set input;
array dx_code_list (4) $200 _temporary_ ("&dx_27800_code" "&dx_27801_code" "&dx_2859_code" "&dx_6202_code");
array dx_variable (4) dx_27800 dx_27801 dx_2859 dx_6202;
array dx dx1-dx3 ;
do i = 1 to dim(dx_variable);
dx_variable(i)=0;
do j=1 to dim(dx) while (dx_variable(i)=0);
if findw(dx_code_list(i),dx(j),',','it') then dx_variable(i)=1;
end;
end;
drop i j;
run;
So if I make some sample data.
data input ;
length dx1-dx3 $7 ;
input dx1 - dx3 ;
cards;
E6609 E661 .
E668 E669 .
E6601 . .
D649 N8320 N8329
. . .
;
I get this result:

How to create multiple datasets in SAS using loops

proc iml;
use rdata3;
read all var _all_ into pp;
close rdata3;
do i = 1 to 1050;
perms = allperm(pp[i, ]);
create pp&i from perms[colname= {"Best" "NA1" "NA2" "Worst"}];
append from perms;
close pp&i;
end;
I will like to create multiple datasets in SAS using the above code through a do loop. However, i cant seem to change the name of each dataset using the &i indicator. Can anyone help me change my code to allow me to create multiple datasets? Or are there any other alternatives on how to create multiple datasets from matrix through loops? Thanks in advance.
You don't want to use macro variables you want to use the features of IML. However you will be creating an awful lot of data sets.
data rdata3;
x = 1;
y = 2;
a = 4;
b = 5;
output;
output;
run;
proc iml;
use rdata3;
read all var _all_ into pp;
close rdata3;
do i = 1 to nrow(pp);
outname = cats('pp',putn(i,'z5.'));
perms = allperm(pp[i, ]);
create (outname) from perms[colname= {"Best" "NA1" "NA2" "Worst"}];
append from perms;
close (outname);
end;
quit;
You can add an ID variable to PERMS and append all versions of PERMS into one data set. I'm not sure I used the best IML technique, I know just enough IML to be dangerous.
proc iml;
use rdata3;
read all var _all_ into pp;
close rdata3;
perms = j(1,5,0);
create PP_out from perms[colname= {'ID' "Best" "NA1" "NA2" "Worst"}];
do i = 1 to nrow(pp);
perms = allperm(pp[i, ]);
perms = j(nrow(perms),1,i)||perms;
append from perms;
end;
close PP_out;
quit;

Loop through list

I have the following macro:
rsubmit;
data indexsecid;
input secid 1-6;
datalines;
108105
109764
102456
102480
101499
102434
107880
run;
%let endyear = 2014;
%macro getvols1;
* First I extract the secids for all the options given a date and
an expiry date;
%do yearnum = 1996 %to &endyear;
proc sql;
create table volsurface1&yearnum as
select a.secid, a.date, a.days, a.delta, a.impl_volatility,
a.impl_strike, a.cp_flag
from optionm.vsurfd&yearnum as a, indexsecid as b
where a.secid=b
and a.impl_strike NE -99.99
order by a.date, a.secid, a.impl_strike;
quit;
%if &yearnum > 1996 %then %do;
proc append base= volsurface11996 data=volsurface1&yearnum;
run;
%end;
%end;
%mend;
%getvols1;
proc download data=volsurface11996;
run;
endrsubmit;
data _null_;
set work.volsurface11996;
length fv $ 200;
fv = "C:\Users\user\Desktop\" || TRIM(put(indexsecid,4.)) || ".csv";
file write filevar=fv dsd dlm=',' lrecl=32000 ;
put (_all_) (:);
run;
On the code above I have: where a.secid=108105. Now I have a list with several secid and I need to run the macro once for each secid. I am looking to run it once and generate a new dataset for each secid.
How can I do that? Thanks
Here is an approach that uses
A single data step set statement to combine all the input datasets
A data set list so you don't have to call each input by name
A hash table to limit the output to your list of secids
proc sort to order the output
Rezza/DWal's approach to output separate csvs with file filevar =
%let startyear = 1996;
%let endyear = 2014;
data volsurface1;
/* Read in all the input tables */
set optionm.vsurfd&startyear.-optionm.vsurfd&endyear.;
where impl_strike ~= -99.99;
/* Set up a hash table containing all the wanted secids */
if _N_ = 1 then do;
declare hash h(dataset: "indexsecid");
_rc = h.defineKey("secid");
_rc = h.defineDone();
end;
/* Only keep observations where secid is matched in the hash table */
if not h.find();
/* Select which variables to output */
keep secid date days delta impl_volatility impl_strike cp_flag;
run;
/* Sort the data */
proc sort data = volsurface1;
by secid date secid impl_strike;
run;
/* Write out a CSV for each secid */
data _null_;
set volsurface1;
length fv $200;
fv = "\path\to\output\" || trim(put(secid, 6.)) || ".csv";
file write filevar = fv dsd dlm = ',' lrecl = 32000;
put (_all_) (:);
run;
As I don't have your data this is untested. The only constraint I can see is that the contents of indexsecid must fit in memory. If you were not concerned with the order this could be all done in one data step.
SRSwift thank you for your comprehensive answer. It run smoothly with no errors. The only issue is that I am running it on a remote server (wharton) using:
%let wrds=wrds.wharton.upenn.edu 4016;
options comamid=TCP remote=wrds;
signon username=_prompt_;
rsubmit;
and on the log it says it wrote the file to my folder on the server but I can t see any file on the server. The log says:
NOTE: The file WRITE is:
Filename=/home/uni/user/108505.csv,
Owner Name=user,Group Name=uni,
Access Permission=rw-r--r--,
Last Modified=Wed Apr 1 20:11:20 2015

Resources