Using arrays in SAS - arrays

I am a SAS beginner. I have an array-like piece of data in my code, which needs to be passed to a different data step much lower in the code to do computations with it. My code does something like this (computation simplified for this example):
data _null_;
call symput('numRuns', 10000);
run;
/* this is the pre-computation step, building CompressArray for later use */
data _null_;
do i = 1 to &numRuns;
value = exp(rand('NORMAL', 0.1, 0.5)));
call symput(compress('CompressArray'||i), value);
end;
run;
data reportData;
set veryLargeDataSet; /* 100K+ observations on 30+ vars */
array outputValues[10000];
do i = 1 to &numRuns;
precomputedValue = symget(compress('CompressArray'||i));
outputValues[i] = /* calculation using precomputedValue */
end;
run;
I am trying to redo this using arrays, is that possible? E.g. to store it in some global array and access it later...

Arrays in SAS only exist for the duration of the data step in which they are created. You would need to save the contents of your array in a dataset or, as you have done, in a series of macro variables.
Alternatively, you might be able to rewrite some of your code to do all of the work that uses the array within one data step. DOW-loops are quite good in this regard.
Based on the updates to your question, it sounds as though you could use a temporary array to do what you want:
data reportData;
set veryLargeDataSet; /* 100K+ observations on 30+ vars */
array outputValues[&numruns];
array precomputed[&numruns] _temporary_;
if _n_ = 1 then do i = 1 to &numruns;
if i = 1 then call streaminit(1);
precomputed[i] = exp(rand('NORMAL', &meanNorm, &stDevNorm));
end;
do i = 1 to &numRuns;
outputValues[i] = /* calculation using precomputed[i] */
end;
run;
Defining an array as _temporary_ causes the values of the array elements to be retained across iterations of the data step, so you only have to populate it once and then you can use it for the rest of the data step.

There are a lot of ways to do this, but the hash table lookup is one of the most straightforward.
%let meannorm=5;
%let stDevNorm=1;
%let numRuns=10000;
/* this is the pre-computation step, building CompressArray for later use */
data my_values;
call streaminit(7);
do i = 1 to &numRuns;
Value= rand('Normal',&meannorm., &stDevNorm.);
output;
end;
run;
data reportData;
if _n_=1 then do;
declare hash h(dataset:'my_values');
h.defineKey('i'); *the "key" you are looking up from;
h.defineData('value'); *what you want back;
h.defineDone();
call missing(of i value);
end;
set sashelp.class; /* 100K+ observations on 30+ vars */
array outputValues[10000];
do i = 1 to &numRuns;
rc=h.find();
outputValues[i] = value;
end;
run;
Basically, you need to 'load' the table in some fashion and do [something] with it. Here's one easy way.
In your particular example there's another pretty simple way: bring it in an as array.
In this case we don't put 10k rows out, but 10k variables - then declare it as an array (again) in the new data step. (Arrays are, as noted by user667489, transient; they're not stored on the dataset in any way, except as the underlying variables, so they have to be re-declared each data step.)
%let meannorm=5;
%let stDevNorm=1;
%let numRuns=10000;
/* this is the pre-computation step, building CompressArray for later use */
data my_values;
call streaminit(7);
array values[&numruns.];
do i = 1 to &numRuns;
Values[i]= rand('Normal',&meannorm., &stDevNorm.);
end;
run;
data reportData;
if _n_=1 then set my_values(drop=i);
set sashelp.class; /* 100K+ observations on 30+ vars */
array outputValues[&numruns.];
array values[&numruns.]; *this comes from my_values;
do i = 1 to &numRuns;
outputValues[i] = values[i];
end;
drop values:;
run;
Here note that I have the set in if _n_=1 still - otherwise it would terminate the data step after the first iteration.
You could also use a format, as Reeza notes, or several other options - but I think these are the simplest.

Related

How exactly does call symput work - trying to create an iterator with help of call symput

I am writing the code that is modifying an array declared in previous data step. Since it is a new datastep old indexes won't work. I thought I could use an iterator with help of call symput function.
I was trying assign 0 value for each MID_(i) array element where month < "i" so I came up with code:
data want;
set summary;
do i=1 to &MAX_MONTH.;
call symputx('iterator',i);
if MONTH < &iterator. then MID_&iterator. = 0;
end;run;
And it doesn't work. I was experimenting with the code to debug it and inserted a constant value instead of "i":
data want;
set summary;
do i=1 to &MAX_MONTH.;
call symputx('iterator',7);
if MONTH < &iterator. then MID_&iterator. = 0;
end;run;
To confuse me even more, this code only works once. When I change '7' for other number the result stays the same until I reset SAS and after that it will work with changed value, but still - only once.
What happens here? What am I not understanding? How do I create a working iterator?
The macro processor does its work converting the macro expressions into text first. So &MAX_MONTH and &iterator have already been replaced by their values before SAS even begins to compile the data step, and definitely before it has a chance to run either the CALL SYMPUTX() or the IF statement.
So if MAX_MONTH had a value of 12 and ITERATOR had a value of 7 then you ran this data step:
data want;
set summary;
do i=1 to 12;
call symputx('iterator',i);
if MONTH < 7 then MID_7 = 0;
end;
run;
Which is the same as just running:
data want;
set summary;
if MONTH < 7 then MID_7 = 0;
i=13;
run;
%let iterator=12;
The ARRAY statement is the data step method to use to reference a variable by its position in a list. So if you want to reference variables with names like MID_1, MID_2, etc then define an array and use an index into the array. You could still use your MAX_MONTH macro variable to define the set of variables to include in the array.
So perhaps you meant to run something like this:
data want;
set summary;
array mid_ [&max_month] ;
do index=month+1 to dim(mid_);
MID_[index] = 0;
end;
drop index;
run;
I would recommend sticking entirely with arrays and if your variables have a naming convention you don't need anything else.
I don't have your data but I wonder if a simplification like this could work as well.
data want;
set summary;
array mid_[*] mid_:;
do i=1 to month-1;
MID_[i] = 0;
end;
run;
symput and symputx create macro variables after the data step ends. The macro variables being created cannot be accessed within the same data step. Each time symput is called, the macro variable that will be output at the end is updated.
Per the call symput documentation:
You cannot use a macro variable reference to retrieve the value of a
macro variable in the same program (or step) in which SYMPUT creates
that macro variable and assigns it a value.
You must specify a step
boundary statement to force the DATA step to execute before
referencing a value in a global statement following the program (for
example, a TITLE statement). The boundary could be a RUN statement or
another DATA or PROC statement.
You do not need to use symput to achieve your goal. i is already iterating and you can use it if you create a new array of your mid_ variables.
data want;
set summary;
array mid_[&MAX_MONTH.];
do i=1 to dim(mid_);
if MONTH < i then MID_[i] = 0;
end;
run;
Stu, Tom, and Reeza have all answered the question with the way you should do this.
However, for completeness, here is how you could do this using macro variables: by using macro %do as well. This is not the right way to do your exact problem, but there are problems that might require this method.
%let max_month=12;
data summary;
do month = 1 to 12;
output;
end;
run;
%macro do_months(max_month);
data want;
set summary;
%do i=1 %to &MAX_MONTH.;
if MONTH < &i. then MID_&i. = 0;
%end;
run;
%mend do_months;
%do_months(max_month=12);
Here if you turn on options mprint; you can see what SAS is doing: it's making 12 if statements for you, each of which has a different value for the iterator and for the mid_ variable. All 12 are executed every time through the data step. It's not nearly as efficient as the array solution, and it's much harder to debug, so don't do it unless you need to.
Tom gave good advice on how to solve you problem. However no one explained how call symputx() works: answer by Stu Sztukowski is partially incorrect since it interprets the SAS documentation in a wrong way.
There are two languages: SAS Base (data step, proc sql, etc.) and SAS Macro (%let, %put, &var, etc.). So there are two worlds: SAS Base world and SAS Macro world. You can exchange data between these two worlds using specific functions.
1. Access SAS Base world from SAS Macro.
You can execute most of SAS Base functions using specific macro %sysfunc(). Example:
%let mvSrc = there are multiple blanks;
%let mvVar = %qsysfunc(compbl(%nrbquote(&mvSrc)));
%put &=mvVar; /* prints MVVAR=there are multiple blanks */
It is a good practice to prefix Macro Variables with mv.
Also I used macro %nrbquote() to be able to process strings with specific to SAS Macro symbols like brackets, commas, quotes, ampersands, etc. It is not required here but it is a good practice too.
2. Access SAS Macro world from SAS Base code.
data _null_;
call symputx('mvVar', 'hey!', 'g'); /* 1 */
var = symget('mvVar'); /* 2 */
put var=; /* 3 */
call symputx('mvVar', 'whats' up?', 'g');
var = symget('mvVar');
put var=;
run;
The output is as follows:
hey!
what's up?
How it works:
Create macro variable mvVar in global scope with value hey!.
Read closest (local, one of calling macro, ..., global) macro variable mvVar into data step variable var.
Print the value.
Should one use those functions? As for call symputx() the answer is yes, you have to use it to put data into macro world. As for symget(), the answer is sometimes you should use it, sometimes it is easier to do with simple macro variable substitution, i.e. var = "&mvVar";. The problem with substitution is that it won't work when macro variable contains double quotes. For example, following code
%let mvVar = hey, "Mr X"!;
data _null_;
var = "&mvVar";
run;
turns into erroneous code:
data _null_;
var = "hey, "Mr X"!"; /* error! */
run;
Also remember that substitution occurs only once, before data step is compiled and hence before it is executed. The following code
%let mvVar = hello;
data _null_;
var = "&mvVar";
put var=;
call symputx('mvVar', 'yes?', 'g');
var = "&mvVar";
put var=;
run;
turns into
data _null_;
var = "hello";
put var=;
call symputx('mvVar', 'yes?', 'g');
var = "hello";
put var=;
run;
and prints
hello
hello
SAS also includes a mechanism to execute macro code inside a data step. The mechanism uses dosubl() function or call execute() routine. The mechanism is a duck tape, it is hard to understand, it works in a non intuitive way and must never be used. Abosultely never. Ever.

Problem: Referencing Array Value, but Returning Zero

I have been working on randomly selecting items within an array. Below, I have outlined my process. I have made it to successfully step 6 (with many data checks), but for some reason, when I reference the array, I receive a value of zero. This has been confusing because even when I check the raw sorted data note a certain value, the value retrieved is zero. Additionally, I ran a VNAME to see which variable it was pulling and it corresponded to the correct place within the array. Does anyone know why I am returning a zero value from the array?
*STEP 1: Set all non-codes to zero;
ARRAY CEREAL [337] ha_DTQ02_1-ha_DTQ02_337;
DO i=1 to 337;
if CEREAL[i]=88888.00 THEN CEREAL[i]=0;
END;
*STEP 2: Sort so that all zero values come first and food codes come last;
call SORTN(ha_DTQ02_1-ha_DTQ02_337);
*STEP 3: Rename array in reverse order so that zeros come last and codes are first. Sort function above only works in ascending order;
RENAME ha_DTQ02_1- ha_DTQ02_337=ha_DTQ02_337-ha_DTQ02_1;
*STEP 4: Count number of cereals selected;
ARRAY CEREALS[337]ha_DTQ02_1-ha_DTQ02_337;
NUMCEREALS=0;
DO i=1 to 337;
IF CEREALS[i] NOT IN (.,0) THEN NUMCEREALS+1;
END;
*STEP 5: get a random number between those two numbers- this works just fine;
IF NUMCEREALS NE 0 THEN rand1 = rand('integer', 1, numCereals);
*ensure that your second random number isn't the same as the first random number;
if NUMCEREALS ge 2 then do until(rand2 ne rand1);
rand2 = rand('integer', 1, numCereals);
end;
*STEP 6: Pull value from array using random number.;
Note: This is where I am stuck. I have tried alternative code where I recreated a new array and tried to pull the values from that new array. I have also tried placing the code directly below before closing the do loop. When the code does run, the value for these variables is zero. After many data checks, steps 1-5 work well and achieve their goals.
dtd020Af = CEREALS (rand1);
dtd020Bf = CEREALS (rand2);
OPTIONS NOFMTERR;
run;
The SORTN call routine needs the OF operator in order to utilize a name list.
call SORTN(of ha_DTQ02_1-ha_DTQ02_337);
A keen eye on the LOG window should have shown you the WARNING
3214 call SORTN(ha_DTQ02_1-ha_DTQ02_337);
-----
134
WARNING 134-185: Argument #1 is an expression, which cannot be updated by the SORTN subroutine
call.
You can't rename variables during run-time and reference the value with the new names.
You have declared an ARRAY listing the variables in 1..337 order. Check, that's good.
You CAN declare a second ARRAY listing the variables in reverse 337..1 order!
You also do not want to use a variable that might be missing, rand2, as a index value.
Suggested code:
data have;
call streaminit(123);
do id = 1 to 100;
array X X1-X337;
do over X;
if rand('uniform') < 0.75 then X = 88888;
else
X = rand('integer',1,10);
if id=50 then if _I_ ne 10 then X=88888; else X=5;
end;
OUTPUT;
end;
run;
data want;
set have;
ARRAY CEREAL X1-X337;
DO i=1 to DIM(CEREAL);
if CEREAL[i]=88888.00 THEN CEREAL[i]=0;
END;
* sort the variables that comprise the CEREAL array;
call SORTN(of CEREAL(*));
* second array to reference variables in reverse order;
array CEREAL_REVERSE x337-x1;
* count how many non-missing/non-zero values at the end of the sorted variables;
DO i=1 to DIM(CEREAL);
IF CEREAL_REVERSE[i] IN (.,0) then leave;
NUMCEREALS = i;
END;
IF NUMCEREALS NE 0 THEN rand1 = rand('integer', 1, numCereals);
if NUMCEREALS ge 2 then
do until(rand2 ne rand1);
rand2 = rand('integer', 1, numCereals);
end;
* assign random selection if warranted;
if NUMCEREALS > 0 then dtd020Af = CEREAL_REVERSE (rand1);
if NUMCEREALS > 1 then dtd020Bf = CEREAL_REVERSE (rand2);
run;

set length of a 1d array with a variable

I would like to set the length of an array depending on what value i obtain from reading a dataset:number which has one variable num with one numeric value. But I am getting an error message: saying that I cannot initiate the probs array. Can i get any suggestion on how to solve this issue? (I really don't want to hardcode the length of the probs array)
data test;
if _n_=1 then do;
set work.number;
i = num +1;
end;
array probs{i} _temporary_ .....
SAS Data step arrays can not be dynamically sized during step run-time.
One common approach is to place the computed number of rows of the data set into a macro variable before the data step.
I'm not sure what you are doing with probs.
What values will be going into the array elements ?
Do you need all prob data while iterating through each row of the data set ?
Is a single result computed from the probs data ?
Example - Compute row count in a data null using nobs set option:
data _null_;
if 0 then set work.number nobs=row_count;
call symputx ('ROW_COUNT', row_count);
stop;
run;
data test;
array probs (&ROW_COUNT.) _temporary_;
* something with probs[index] ;
* maybe fill it ?
do index = 1 by 1 until (last_row);
set work.number;
probs[index] = prob; * prob from ith row;
end;
* do some computation that a procedure isn't able to;
…
result_over_all_data = my_magic; * one result row from processing all prob;
output;
stop;
run;
Of course your actual use of the array will vary.
The many other ways to get row_count include dictionary.table views, sql select count(*) into and a variety of ATTRN invocations.

SAS use an array to search through variables and then populate new variables

I have a data set with diagnosis codes, and each observation has multiple diagnosis codes, up to 95 (variables dx1-dx95), some of the dx codes are numeric but some are e codes (they have an E before the number, and then they become character variables). I need to write code that will look in all 95 dx code variables and pull out each time there’s an e code and make new variables called ecode1-ecode# (however many ecodes there are in that observation).
For example one observation might have dx1=999 dx2=E100 dx3=878 and dx4=E202, I need to make new variables ecode1=E100 ecode2=202. The code I wrote yesterday got me close, but what I wrote makes the above example ecode2=E100 ecode4=E202. The ecode variable # ends up being the same as the dx # instead of starting at 1 and counting up.
Here’s what I wrote yesterday:
**//array to pull out ecodes from dx1-dx95//**;
data ecodes;
set injurycodes;
*array to create new ecode variables;
array ecode{95}$ ecode1-ecode95;
*array to pull out ecodes;
array dxcode{95} dx1-dx95;
do i=1 to 95;
if 'E0000' le dxcode{i} le 'E9999' then ecode{i}=dxcode{i};
end;
drop i;
run;
I know the problem right now is the ecode{i}=dxcode{i} piece. This is pulling out the Ecodes, but they aren't starting with ecode1, ecode2, etc.
Updated code:
data ecodes;
set injurycodes;
array ecode{95}$ ecode1-ecode95;
array dxcode{95} dx1-dx95;
j=0;
DO i=1 TO 95;
IF SUBSTR(CATT(dxcode{i}),1,1)="E" THEN DO;
ecode{j}=dxcode{i};
j=j+1;
END;
END;
run;
Now I'm getting "invalid second argument to function SUBSTR"
Just check the first character of dxcode with SUBSTR, and use j to loop ecode.
j=0;
DO i=1 TO 95;
IF SUBSTR(CATT(dxcode{i}),1,1)="E" THEN DO;
ecode{j}=dxcode{i};
j=j+1;
END;
END;
Your main problem is you need to keep a separate counter variable to use to index into the output array.
data ecodes;
set injurycodes;
array ecode(95) $5;
array dx (95) ;
j=1;
do i=1 to dim(dx);
if dx(i)=:'E' then do;
ecode(j) = dx(i);
j=j+1;
end;
end;
drop i j;
run;

Compute inner poduct without IML

I am trying to create a macro that compute the inner(dot) product of a vector and a matrix.
Y*X*t(Y) ## equivalent to the Sum(yi*Xij*yj)
I don't have IML, so I try to do it using array manipulation.
How to create a multidimensional array from the data to avoid index
translation within single array.
How to debug my loop, or at least print some variable to control my program?
How to delete temporary variables?
I am a SAS newbie, but this is what I have tried so far:
%macro dot_product(X = ,y=, value= );
/* read number of rows */
%let X_id=%sysfunc(open(&X));
%let nrows=%sysfunc(attrn(&X_id,nobs));
%let rc=%sysfunc(close(&X_id));
data &X.;
set &X.;
array arr_X{*} _numeric_;
set &y.;
array arr_y{*} _numeric_;
do i = 1 to &nrows;
do j = 1 to &nrows;
value + arr_y[i]*arr_X[j + &nrows*(i-1)]*arr_y[j];
end;
end;
run;
%mend;
When I run this :
%dot_product(X=X,y=y,value=val);
I get this error :
ERROR: Array subscript out of range at line 314 column 158.
I am using this to generate data :
data X;
array myCols{*} col1-col5;
do i = 1 to 5;
do j = 1 to dim(myCols);
myCols{j}=ranuni(123);
end;
output;
end;
drop i j;
run;
/* create a vector y */
data y;
array myCols{*} col1-col5;
do j = 1 to dim(myCols);
myCols{j}=ranuni(123);
end;
output;
drop j;
run;
Thanks in advance for your help or any idea to debug my data.
Edit: The following relates to the description of the question, how to evaluate a quadratic form using dot, inner or scalar products. The actual code is nearly fine. end edit
If you want to reduce it to dot products, then your value is the dot product of the linearization of X_ij and the same linearization applied to Z_ij=Y_i*Y_j.
The other way is to portion X_ij into its rows or columns depending on the linearization of the matrix, and compute separate dot products of Y with, say, each row. Of the resulting vector you the compute the dot product again with Y.
Edit added: The length nrows of the nested loops in the code should be determined from the length of the vector y, perhaps with a check that the length of x is nrows*nrows.

Resources