SAS: Dynamically Setting a Variable Name - arrays

In this block of SAS data step code I am setting a Table from an SQL query called TEST_Table. This table contains multiple columns including a larger section of columns titled PREFIX_1 to PREFIX_20. Each column starts with PREFIX_ and then an incrementing number from 1 to 20.
What I would like to do is iteratively cycle through each column and analyze the value of that column.
Below is an example of what I am trying to go for. As you can see I would like to create a variable that increases on each iteration and then I use that count value as a part of the variable name I am checking.
data TEST_Data;
set TEST_Table;
retain changing_number;
changing_number=1;
do while(changing_number<=20);
if PREFIX_changing_number='BAD_IDENTIFIER' then do;
PREFIX_changing_number='This is a bad part';
end;
end;
run;
How would be the best way to do this in SAS? I know I can do it by simply checking each value individually from 1 to 20.
if PREFIX_1 = 'BAD_IDENTIFIER' then do;
PREFIX_1 = 'This is a bad part';
end;
if PREFIX_2 = ...
But that would be really obnoxious as later I will be doing the same thing with a set of over 40 columns.
Ideas?
SOLUTION
data TEST_Data;
set TEST_Table;
array SC $ SC1-SC20;
do i=1 to dim(SC);
if SC{i}='xxx' then do;
SC{i}="bad part";
end;
end;
run;
Thank you for suggesting Arrays :)

You need to look up Array processing in SAS. Simply put, you can do something like this:
data TEST_Data;
set TEST_Table;
*retain changing_number; Remove this - even in your code it does nothing useful;
array prefixes prefix:; *one of a number of ways to do this;
changing_number=1;
do while(changing_number<=20);
if prefixes[changing_number]='BAD_IDENTIFIER' then do;
prefixes[changing_number]='This is a bad part';
end;
end;
run;
A slightly better loop is:
do changing_number = 1 to dim(prefixes);
... loop ...
end;
As that's all in one step, and it is flexible with the number of array elements (dim = number of elements in the array).

Related

SAS: How to maintain memory from a do loop within another do loop so that the partial output is appended together vertically using an output statement

So I have this:
Initial database:
Variable1, variable2, value, percentvalue
Keyword1, a, 234, 0.7
Keyword1, a, 64, 0.18
Keyword1, a, 4, 0.05
Keyword1, a, 2, 0.025
Keyword1, a, 300, 0.84
Keyword2
Keyword2
Keyword3
Keyword4
Keyword4
and so on.
When I run this individually, it work:
data Filename1;
set filename0;
if variable1 = 'Keyword1' then do;
retain sumCol;
sumCol = sum(sumCol, percentvalue);
if sumCol>0.95 then DELETE;
output;
end;
This return the first 3 row of keyword1
Which is what I want.
But when I try to do it for the entire table which has like 600 keywords.
I'm currently running the test with only one keyword to make sure it work in the same way.
But when I run:
data Filename1;
set filename0;
array MyArrayVariable1{1} $ Keyword1;
do i=1 to dim(MyArrayVariable1);
if variable1 = MyArrayVariable1[i] then do;
retain sumCol;
sumCol = sum(sumCol, percentvalue);
if sumCol>0.95 then DELETE;
output;
end;
end;
run;
When I run it, It just pull an empty table instead of the selected value.
And if I get rid of the output; it pulls the entire table without filtering anything.
Looks like you just want to use BY group processing.
data Filename1;
set filename0;
by variable1 ;
if first.variable1 then sumcol=0;
sumCol + percentvalue;
if sumCol<=0.95 then output;
run;
Note that using a SUM statement
sumCol + percentvalue;
is a simplified way to code these two statements in your original code.
retain sumCol;
sumCol = sum(sumCol, percentvalue);
BY group processing with an I/O criterion based on a groupwise computation can also be succinctly coded in what is commonly called a DOW loop in the SAS community. One hallmark of the technique is to place the SET statement inside a DO loop.
Example:
data want;
do until (last.variable1);
SET have;
by variable1;
pctsum = sum(pctsum,percentvalue);
if pctsum <= 0.95 then OUTPUT;
end;
run;
NOTE:
I'm not sure of the role of your Variable2. Should it be part of a hierarchy wherein the pctsum is reset if the Variable2 value changes within a Variable1 group?

Putting individual customers transactions into an array

I have a dataset consisting of some transactions done by customers
I want to put these transactions into an array from 1 to 50. So 1 customers can have 50 transactions or more. I want my output dataset to be 1 row per customer with the value of each transaction being put into a column.
Finally what I am trying to do is to have these transaction values in an array. With the array being reset back to 0 upon first.cust_id. Any idea how I can go about doing this. This is what I have so far but it is giving me errors. Let's assume the initial dataset has just cust_id and transaction_amount fields.
The code represents just array initiation. I am doing a few calculations with the arrays afterwards.
data check;
set transactions;
by cust_id;
array trans[*] trans1-trans50;
retain array_counter;
if first.cust_id then do;
do i=1 to dim(trans);
trans[i]=0;
end;
array_counter=1;
end;
trans[array_counter] = transaction_amount;
array_counter=sum(array_counter,1);
if last.cust_id;
run;
First off, you need a retain. Without that, your values of the trans array will get cleared out every data step loop.
Second, if you say "50 or more" transactions; well, your array bounds only allow 50, so what does it do for 51?
This works, and I set it to 30 per customer so you can see the 0's. If you set that initial 1 to 30 loop to 1 to 60, you'll get out of bounds errors.
data transactions;
call streaminit(7);
do cust_id = 1 to 10;
do transaction = 1 to 30;
transaction_amount = rand('uniform');
output;
end;
end;
run;
data check;
set transactions;
by cust_id;
retain trans1-trans50;
array trans[*] trans1-trans50;
retain array_counter;
if first.cust_id then do;
do i=1 to dim(trans);
trans[i]=0;
end;
array_counter=1;
end;
trans[array_counter] = transaction_amount;
array_counter=sum(array_counter,1);
if last.cust_id;
run;
A couple things are wrong in your code. For starters, in order to use first/last processing, you need a by statement. Then, in order to get only one row per customer, you need to also retain your array variables and output only on last.cust_id:
data check;
set transactions;
by cust_id;
array trans[*] trans1-trans50;
retain array_counter trans1-trans50;
if first.cust_id then do;
do i=1 to dim(trans);
trans[i]=0;
end;
array_counter=1;
end;
trans[array_counter] = transaction_amount;
array_counter+1;
if last.cust_id then output;
run;

reading a data set multiple times in SAS

I am new here. I am trying to read in a data set multiple times. so for example, assume that I have 3 observations in a data set (called tempfile) for a variable called temp. the three observations are 4,6, and 5.. so I want to read in the set x number of times so the 4th observation would be 4, fifth would be 6 and sixth, would be 5. the 7th would be 4, etc etc. I have tried this literally a few dozen ways, by doing something like
data new;
do i=1 to 100;
set tempfile;
end;
output;
run;
I have tried this by moving the do statement, moving the output statement, omitting the output statement..... every which way, trying macros also. can somebody help? thanks John
followup....
Hello:
Thanks for response. That did work. I would like to now do several things involving some “if then” statements inside the loop (more than just reading in the data set).
I want to read in a data set n number of times, and each time, there will be two if then statements
So, assume I read in 3 numbers any number of times; 7, 15, and 12
As each number is read, it will ask if it is less than 10. And each time it will create a random number.
If less than 10, then
If rand(uniform) < .4 then 1 is added to counter1, else 1 is added to counter2
And if >= 10,
Then
If rand(uniform) < .2 then 1 is added to counter1, else 1 is added to counter2
Any help is much appreciated.
Thanks
John
The way that most data steps actually stop is when SAS reads past the end of the input. So you need a method that prevents SAS from doing that.
The easiest way to replicate the data is to just execute multiple output statements. So the first record is repeated three times, then the second record is repeated three times, etc.
data want;
set tempfile ;
do i=1 to 3;
output;
end;
run;
Another method is to just list the dataset multiple times on the SET statement. So to read it in 3 times just use
data want;
set tempfile tempfile tempfile;
run;
You could probably use macro logic or even just a macro variable to make the number of repetitions variable.
data _null_; call symputx('list',repeat('tempfile ',3-1)); run;
data want; set &list; run;
Other method is to use the POINT= and NOBS= options on the SET statement so that SAS never reads past the end and you can jump back to the beginning. But since it never reads past the end of the input data you will need to manually tell it when to stop.
data want ;
do i=1 to 3;
do p=1 to nobs ;
set tempfile point=p nobs=nobs;
output;
end;
end;
stop;
run;
Or more in the spirit of your original post you might want to use the MOD() function to figure out which observation to read next.
data want;
if _n_ > 100 then stop;
p=1+mod(_n_-1,nobs);
set tempfile point=p nobs=nobs;
run;
If you have SAS/STAT software SURVEYSELECT.
data have;
do temp=4,6,5;
output;
end;
run;
proc surveyselect reps=10 rate=1 out=temp2 noprint;
run;
The data step is designed for serial processing. In this case, you need to "remember" previous observations. You can do it using only the data step, but for that use case, there are other solutions in the SAS environment that are simpler. The one I suggest is a macro that appends the original file n times:
%macro replicate( data=, out=, n=)/des='&out is &data repeated &n times.';
data &out;
set
%do i=1 %to &n;
&data
%end;
; /* This ; ends the data step `set` statement */
run;
%mend;
You could test your example with this helper:
%macro test;
data have; /* create the example data set */
temp = 4; output;
temp = 6; output;
temp = 5; output;
run;
%replicate( data=have, out=want, n=4 );
proc print; quit;
%mend;
Here is a portion of the SAS doc that adds lots of detail with many examples.

SAS Put Certain Row as New Variable Names After Manipulation

After importing my CSV data with GETNAMES=NO, I have 59 columns with variable names VAR1, VAR2, . . . VAR59. My first row contains the names I need for the new variables, but they first needed manipulated by removing special characters and turning spaces into underscores since SAS doesn't like spaces in variable names. This is the array I used for that piece:
DATA DATA1; SET DATA (FIRSTOBS=7);
ARRAY VAR(59) VAR1-VAR59;
IF _N_ = 1 THEN DO;
DO I = 1 TO 59;
VAR[I] = COMPRESS(TRANSLATE(TRIM(VAR[I]),'_',' '),'?()');
PUT VAR[I]=;
END;
END;
DROP I;
RUN;
This worked perfectly, but now I need to get this first row up to the new variable names. I tried a similar array to perform this:
DATA DATA2; SET DATA1;
ARRAY V(59) VAR1-VAR59;
DO I = 1 TO 59;
IF _N_ = 1 AND V[I] NE "" THEN CALL SYMPUT("NEWNAME",V[I]);
RENAME VAR[I] = &NEWNAME;
END;
DROP I;
RUN;
This only puts the name of VAR59 since there is no [i] connected to the &NEWNAME, and it still isn't working quite right. Any suggestions to moving a row up to variable names AFTER manipulation?
Your primary problem is you are trying to use a macro variable in the data step it's created in. You can't. You're also trying to create rename statements in the data step; rename, as with other similar statements (keep, drop), must be defined before the data step is compiled.
You need to write code somewhere - either in a text file, a macro variable, whatever - with this information. For example:
filename renamef temp;
data _null_;
set myfile (obs=1);
file renamef;
array var[59];
do _i = 1 to dim(Var);
[your code to clean it out];
strput = cat("rename",vname(var[_i]),'=',var[_i],';');
put strput;
end;
run;
data want;
set myfile (firstobs=2);
%include renamef;
run;
There are lots of other examples to this on the site and on the web, "list processing" is the term for this.
Joe -- using your suggestions and another one of your posts, the following worked flawlessly:
Put the row of needed variables into long format (in my case, first row so n = 1)
DATA NEWVARS; SET DATA;
IF _N_ = 1 THEN OUTPUT NEWVARS;
RUN;
PROC TRANSPOSE DATA = NEWVARS OUT=NEWVARS1;
VAR _ALL_;
RUN;
Create a list of rename macro calls.
PROC SQL;
SELECT CATS('%RENAME(VAR=',_NAME_,',NEWVAR=',COL1,')')
INTO :RENAMELIST SEPARATED BY ' '
FROM NEWVARS1;
QUIT;
%MACRO RENAME(VAR=,NEWVAR=);
RENAME &VAR.=&NEWVAR.;
%MEND RENAME;
Call in the list created in Step 2 to rename all variables.
PROC DATASETS LIB=WORK NOLIST;
MODIFY DATA;
&RENAMELIST.;
QUIT;
I had to perform a few additional checks making sure that the variable names were not greater than 32 characters, and this was easy to check for when the data was in long format after transposing. If there are certain words that make the lengths too long, a TRANWRD statement can easily replace them with abbreviations.

SAS use value from one observation to overwrite different one

I have a data set with two main variables of interest now - Major and Major_Code. These should match up 1 to 1 but there are some errors I need to fix and what I've found is that for 14 Major_Code values, there are two different Majors. This is only due to a change in spelling or punctuation, like "ed." and "education". They are supposed to have the same value here but don't.
So I have a table with 7 pairs. Each pair has the same Major_Code and different a Major. How can I select one of the Major vales to use for each code? My only idea was through an if-then statement but that seems horribly inefficient.
I found the doubled values like this:
proc freq data=majorslist;
tables Major_Code/out=majorcodedups;
run;
proc print data=majorcodedups;
where COUNT > 1;
run;
So I can easily find these observations but can't extract certain values to overwrite onto another observation. I've looked into arrays, macros, sql and transpose but it's all a bit over my head right now.
Logically it would work like this:
from obs i to n, find value for variable x at obs i, output value onto variable y at obs i, go to obs(i+1) and repeat.
Assuming you have some rule for determining which MAJOR is correct for a MAJOR_CODE, you should do this:
This assumes majorslist is a dataset of every major/major_code pair whether unique or not - but only one per major/major_code pair.
proc sort data=majorslist;
by major_code major;
run;
data majorslist_unique;
set majorslist;
by major_code major;
if first.major_code and last.major_code then output;
else do;
*rule to determine whether to output it or not;
end;
run;
So, you now have the major_code/major relationship. Let's say you picked if first.major_code then output; as your rule (ie, take the major_code with the alphabetically first major value).
Now, you need to apply this to your larger dataset. There are a lot of ways to do that - merge this on is one, format is another, for starters. Format works like this:
Create a dataset with FMTNAME, START, LABEL defined. For each value of MAJOR_CODE, construct one row like that, where START is MAJOR_CODE and LABEL is MAJOR. We'll also add an extra line that says what to do with non-matches (in case you get new values of major_code).
data for_fmt;
set majorslist_unique;
fmtname='MAJORF'; *add a $ if MAJOR_CODE is a character variable;
start=major_code;
label=major;
output;
if _n_=1 then do;
hlo='o';
call missing(start);
label='NONMATCHED';
output;
end;
keep fmtname start label hlo;
run;
proc format cntlin=for_fmt;
quit;
Now you have a format, MAJORF. (or $MAJORF. if MAJOR_CODE is character), that you can use in a PUT statement.
data my_bigdata2;
set my_bigdata;
major = put(major_code,MAJORF.);
run;

Resources