SAS how to create variable with corresponding values efficiently - arrays

I am trying to complete the following.
Variable Letter has three values (a, b, c). I would like to create a variable Letter_2 with values corresponding to the values of Letter, namely (1, 2, 3).
I know I can do this using three IF Then statements.
if Letter='a' then Letter_2='1';
if Letter='b' then Letter_2='2';
if Letter='c' then Letter_2='3';
Suppose I have 15 values for the variable Letter, and 15 corresponding values for the replacement. Is there a way to do it efficiently without typing the same If Then statement 15 times?
I am new to SAS. Any clue will be appreciated.
Lisa

Looks like an application for a FORMAT.
First define the format.
proc format ;
value $lookup 'a'='1' 'b'='2' 'c'='3' ;
run;
Then use it to re-code your variable.
data want;
set have;
letter2 = put(letter,$lookup.);
run;
Or perhaps you could use two temporary arrays and the WHICHC() function?
data have;
input letter $10. ;
cards;
car
apple
box
;;;;
data want ;
set have ;
array from (3) $10 _temporary_ ('apple','box','car');
array to (3) $10 _temporary_ ('first','second','third');
if whichc(letter,of from(*)) then
letter_2 = to(whichc(letter,of from(*)))
;
run;

Related

Find corresponding variable to a certain value through array

So if I have identified a max value regarding a test result (Highest variable listed below), which occurred during one of the three dates that are being tested (testtime variables listed below), what I want to do is to create a new variable called Highesttime identifying the date when the test was given.
However, I am stuck in an array looping. SAS informs that "ERROR: Array subscript out of range at line x", guess there's something working regarding the logic? See codes below:
Example:
ID time1_a time_b time_c result_a result_b result_c Highest
001 1/1/22 1/2/22 1/3/22 3 2 4 4
002 12/1/21 12/23/21 1/5/22 6 1 2 6
003 12/22/21 1/6/22 2/2/22 5 5 7 7
...
data want;
set origin;
array testtime{3} time1_a time_b time_c;
array maxvalue{1} Highest;
array corr_time{1} Highesttime;
do i=1 to dim(testttime);
corr_time{i}=testttime{i=maxvalue{i}};
end;
run;
There is no need to make an array for HIGHEST since there is only one variable that you would put into that array. In that case just use the variable directly instead of trying to access it indirectly via an array reference.
First let's make an actual SAS dataset out of the listing you provided.
data have;
input ID (time_a time_b time_c) (:mmddyy.) result_a result_b result_c Highest ;
format time_a time_b time_c yymmdd10.;
cards;
001 1/1/22 1/2/22 1/3/22 3 2 4 4
002 12/1/21 12/23/21 1/5/22 6 1 2 6
003 12/22/21 1/6/22 2/2/22 5 5 7 7
;
If you want to loop then you need two arrays. One for times and the other for the values. Then you can loop until you find which index points to the highest value and use the same index into the other array.
data want ;
set have;
array times time_a time_b time_c ;
array results result_a result_b result_c;
do which_one=1 to dim(results) until (not missing(highest_time));
if results[which_one] = highest then highest_time=times[which_one];
end;
format highest_time yymmdd10.;
run;
Or you can avoid the looping by using the WHICHN() function to figure out which of three result variables is the first one that has that HIGHEST value. Then you can use that value as the index into the array of the TIME variables (which in your case have DATE instead of TIME or DATETIME values).
data want ;
set have;
which_one = whichn(highest, of result_a result_b result_c);
array times time_a time_b time_c ;
highest_time = times[which_one];
format highest_time yymmdd10.;
run;
Your code from this question was close, you just had the assignment backwards.
Note that an array method will assign the last date in the case of duplicate high results and WHICHN will report the first date so the answers are not identical unless you modify the loop to exit after the first maximum value is found.
With the changes suggested in the answer proposed:
data temp2_lead_f2022;
set temp_lead_f2022;
array _day {3} daybld_a daybld_b daybld_c;
array _month {3} mthbld_a mthbld_b mthbld_c;
array _dates {3} date1_a date2_b date3_c;
array _pblev{3} pblev_a pblev_b pblev_c;
do i = 1 to 3;
_dates{i} = mdy(_month{i}, _day{i}, 1990);
end;
maxlead= max(of _pblev(*));
do i=1 to 3;
if _pblev{i} = maxlead then max_date=_dates(i);
end;
*Using WHICHN to identify the maximum occurence;
max_first_index=whichn(maxlead, of _pblev(*));
max_date2 = _dates(max_first_index);
drop k;
format date1_a date2_b date3_c dob mmddyy8. ;
run;

SAS Help: Using Index function to compare 2 columns

I want to compare string value of A and B by using the index function. I want to check if A contains B in its column. The only way I know how to do it is Index but the problem is index doesn't allow column name in its parameters. You have to enter a string value.
Tried this: index(Address, HouseNumber)>0 but it doesn't work.
Example:
Address HouseNumber
123 Road Road
So I want to see if Address column contains House number value in its field. It wont be a direct match but just want to check if A contains the string. I think using a macro variable or array is the solution but I do not know how to do it.
You need to account for the padding that SAS does since all variables are fixed length.
data have ;
length Address HouseNumber $50;
infile cards dsd dlm='|';
input address housenumber ;
cards;
123 Road|Road
;;;;
data want ;
set have ;
if index(address,strip(HouseNumber));
run;
This works - is it what you're trying to do??
data _null_;
a = '52 Festive Rd';
b = 'Festive';
if index(a,b) then put 'yes';
else put 'no';
run;

SAS / Using an array to convert multiple character variables to numeric

I am a SAS novice. I am trying to convert character variables to numeric. The code below works for one variable, but I need to convert more than 50 variables, hopefully simultaneously. Would an array solve this problem? If so, how would I write the syntax?
DATA conversion_subset;
SET have;
new_var = input(oldvar,4.);
drop oldvar;
rename newvar=oldvar;
RUN;
#Reeza
DATA conversion_subset;
SET have;
Array old_var(*) $ a_20040102--a_20040303 a_302000--a_302202;
* The first list contains 8 variables. The second list contains 7 variables;
Array new_var(15) var1-var15;
Do i=1 to dim(old_var);
new_var(i) = input(old_var(i),4.);
End;
*drop a_20040102--a_20040303 a_302000--a_302202;
*rename var1-var15 = a_20040102--a_20040303 a_302000--a_302202;
RUN;
NOTE: Invalid argument to function INPUT at line 64 column 19
(new_var(i) = input(old_var(i),4.)
#Reeza
I am still stuck on this array. Your help would be greatly appreciated. My code:
DATA conversion_subset (DROP= _20040101 _20040201 _20040301);
SET replace_nulls;
Array _char(*) $ _200100--_601600;
Array _num(*) var1-var90;
Do i=1 to dim(_char);
_num(i) = input(_char(i),4.);
End;
RUN;
I am receiving the following error: ERROR: Array subscript out of range at line 64 column 6. Line 64 contains the input statement.
Yes, an array solves this issue. You will want a simple way to list the variables so look into SAS variable lists as well. For example if your converting all character variables between first and last you could list them as first_var-character-last_var.
The rename/drop are illustrated in other questions across SO.
DATA conversion_subset;
SET have;
Array old_var(50) $ first-character-last;
Array new_var(50) var1-var50;
Do i=1 to 50;
new_var(i) = input(oldvar(i),4.);
End;
RUN;
As #Parfait suggests, it would be best to adjust it when you are getting it, rather than after it is already in a SAS data set. However, if you're given the data set and have to convert that, that's what you have to do. You can add a WHERE clause to the PROC SQL to exclude variables that should not be converted. If you do so, they won't be in the final data set unless you add them in the CREATE TABLE's SELECT clause.
PROC CONTENTS DATA=have OUT=havelist NOPRINT ;
RUN ; %* get variable names ;
PROC SQL ;
SELECT 'INPUT(' || name || ',4.) AS ' || name
INTO :convert SEPARATED BY ','
FROM havelist
; %* create the select statement ;
CREATE TABLE conversion_subset AS
SELECT &convert
FROM have
;
QUIT ;
If excluding variables is an issue and/or you want to use a DATA step, then use the PROC CONTENTS above and follow with:
PROC SQL ;
SELECT COMPRESS(name || '_n=INPUT(' || name || ',4.)'),
COMPRESS(name || '_n=' || name),
COMPRESS(name)
INTO :convertlst SEPARATED BY ';',
:renamelst SEPARATED BY ' ',
:droplst SEPARATED BY ' '
FROM havelist
;
QUIT ;
DATA conversion_subset (RENAME=(&renamelst)) ;
SET have ;
&convertlst ;
DROP &droplst ;
RUN ;
Again, add a where clause to exclude variables that should not be converted. This will automatically preserve any variables that you exclude from conversion with a WHERE in the PROC SQL SELECT.
If you have too many variables, or their names are very long, or adding _n to the end causes a name collision, things can go badly (too much data for a macro variable, illegal field name, one field overwriting another, respectively).

How can I "define" SAS data sets using macro variable and write to them using an array

My source data contains 200,000+ observations, one of the many variables in the data set is "county." My goal is to write a macro that will take this one data set as an input, and split them into 58 different temporary data sets for each of the California counties.
First question is if it is possible to specify the 58 counties on the data statement using something like a global reference array defined beforehand.
Second question is, assuming the output data sets have been properly specified on the data statement, is it possible to use a do loop to choose the right data set to write to?
I can get the comparison to work properly, but cannot seem to use a array reference to specify a output data set. This is most likely because I need more experience with the macro environment!
Please see below for the simplistic skeleton framework I have written so far. c_long array contains the names of each of the counties, c_short array contains a 3 letter abbreviation for each of the counties. Thanks in advance!
data splitraw;
length county_name $15;
infile "&path/random.csv" dsd firstobs=2;
input county_name $ number;
run;
%macro _58countysplit(dxtosplit,countycol);
data <need to specify 58 data sets here named something like &dxtosplit_ALA, &dxtosplit_ALP, etc..>;
set &dxtosplit;
do i=1 to 58;
if c_long{i}=&countycol then output &dxtosplit._&c_short{i};
end;
run;
%mend _58countysplit;
%_58countysplit(splitraw,county_name);
The code you provided will need to run through the large dataset 58 times, each time writing a small one. I have done it a bit different.
First I create a sample dataset with a variable "county" this will contain ten different values:
data large;
attrib county length=$12;
do i=1 to 10000;
county=put(mod(i,10)+1,ROMAN.);
output;
end;
run;
First, I start with finding all the unique values and constructing the names of all the different tables I would like to create:
proc sql noprint;
select distinct compbl("large_"!!county) into :counties separated by " "
from large;
quit;
Now I have a macrovariable "counties" that containes all the different datasets I want to create.
Here I am writing the IF-statements to a file:
filename x temp;
data _null_;
attrib county length=$12 ds length=$18;
file x;
i=1;
do while(scan("&counties",i," ") ne "");
ds=scan("&counties",i," ");
county=scan(ds,-1,"_");
put "if county=""" county +(-1) """ then output " ds ";";
i+1;
end;
run;
Now I have what I need to create the small datasets:
data &counties;
set large;
%inc x;
run;
I agree with user667489, there is almost always a better way then splitting one large data set into many small data sets. However, if you want to proceed along these lines there is a table in sashelp called vcolumn which lists all your libraries, their tables, and each column (in each table) that should help you. Also if you want
if c_long{i}=&countycol then output &dxtosplit._&c_short{i};
to resolve you might mean:
if c_long{i}=&countycol then output &&dxtosplit._&c_short{i};
It's likely, depending upon what you're actually trying to do, that BY processing is all you need. Nevertheless, here is a simple solution:
%macro split_by(data=, splitvar=);
%local dslist iflist;
proc sql noprint;
select distinct cats("&splitvar._", &splitvar)
into :dslist separated by ' '
from &data;
select distinct
catt("if &splitvar='", &splitvar, "' then output &splitvar._", &splitvar, ";", '0A'x)
into :iflist separated by "else "
from &data;
quit;
data &dslist;
set &data;
&iflist
run;
%mend split_by;
Here is some test data to illustrate:
options mprint;
data test;
length county $1 val $1;
input county val;
infile cards;
datalines;
A 2
B 3
A 5
C 8
C 9
D 10
run;
%split_by(data=test, splitvar=county)
And you can view the log to see how the macro generates the DATA step you want:
MPRINT(SPLIT_BY): proc sql noprint;
MPRINT(SPLIT_BY): select distinct cats("county_", county) into :dslist separated by ' ' from test;
MPRINT(SPLIT_BY): select distinct catt("if county='", county, "' then output county_", county, ";", '0A'x) into :iflist separated
by "else " from test;
MPRINT(SPLIT_BY): quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
MPRINT(SPLIT_BY): data county_A county_B county_C county_D;
MPRINT(SPLIT_BY): set test;
MPRINT(SPLIT_BY): if county='A' then output county_A;
MPRINT(SPLIT_BY): else if county='B' then output county_B;
MPRINT(SPLIT_BY): else if county='C' then output county_C;
MPRINT(SPLIT_BY): else if county='D' then output county_D;
MPRINT(SPLIT_BY): run;
NOTE: There were 6 observations read from the data set WORK.TEST.
NOTE: The data set WORK.COUNTY_A has 2 observations and 2 variables.
NOTE: The data set WORK.COUNTY_B has 1 observations and 2 variables.
NOTE: The data set WORK.COUNTY_C has 2 observations and 2 variables.
NOTE: The data set WORK.COUNTY_D has 1 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.05 seconds

SAS use value from one observation to overwrite different one

I have a data set with two main variables of interest now - Major and Major_Code. These should match up 1 to 1 but there are some errors I need to fix and what I've found is that for 14 Major_Code values, there are two different Majors. This is only due to a change in spelling or punctuation, like "ed." and "education". They are supposed to have the same value here but don't.
So I have a table with 7 pairs. Each pair has the same Major_Code and different a Major. How can I select one of the Major vales to use for each code? My only idea was through an if-then statement but that seems horribly inefficient.
I found the doubled values like this:
proc freq data=majorslist;
tables Major_Code/out=majorcodedups;
run;
proc print data=majorcodedups;
where COUNT > 1;
run;
So I can easily find these observations but can't extract certain values to overwrite onto another observation. I've looked into arrays, macros, sql and transpose but it's all a bit over my head right now.
Logically it would work like this:
from obs i to n, find value for variable x at obs i, output value onto variable y at obs i, go to obs(i+1) and repeat.
Assuming you have some rule for determining which MAJOR is correct for a MAJOR_CODE, you should do this:
This assumes majorslist is a dataset of every major/major_code pair whether unique or not - but only one per major/major_code pair.
proc sort data=majorslist;
by major_code major;
run;
data majorslist_unique;
set majorslist;
by major_code major;
if first.major_code and last.major_code then output;
else do;
*rule to determine whether to output it or not;
end;
run;
So, you now have the major_code/major relationship. Let's say you picked if first.major_code then output; as your rule (ie, take the major_code with the alphabetically first major value).
Now, you need to apply this to your larger dataset. There are a lot of ways to do that - merge this on is one, format is another, for starters. Format works like this:
Create a dataset with FMTNAME, START, LABEL defined. For each value of MAJOR_CODE, construct one row like that, where START is MAJOR_CODE and LABEL is MAJOR. We'll also add an extra line that says what to do with non-matches (in case you get new values of major_code).
data for_fmt;
set majorslist_unique;
fmtname='MAJORF'; *add a $ if MAJOR_CODE is a character variable;
start=major_code;
label=major;
output;
if _n_=1 then do;
hlo='o';
call missing(start);
label='NONMATCHED';
output;
end;
keep fmtname start label hlo;
run;
proc format cntlin=for_fmt;
quit;
Now you have a format, MAJORF. (or $MAJORF. if MAJOR_CODE is character), that you can use in a PUT statement.
data my_bigdata2;
set my_bigdata;
major = put(major_code,MAJORF.);
run;

Resources