SAS Help: Using Index function to compare 2 columns - arrays

I want to compare string value of A and B by using the index function. I want to check if A contains B in its column. The only way I know how to do it is Index but the problem is index doesn't allow column name in its parameters. You have to enter a string value.
Tried this: index(Address, HouseNumber)>0 but it doesn't work.
Example:
Address HouseNumber
123 Road Road
So I want to see if Address column contains House number value in its field. It wont be a direct match but just want to check if A contains the string. I think using a macro variable or array is the solution but I do not know how to do it.

You need to account for the padding that SAS does since all variables are fixed length.
data have ;
length Address HouseNumber $50;
infile cards dsd dlm='|';
input address housenumber ;
cards;
123 Road|Road
;;;;
data want ;
set have ;
if index(address,strip(HouseNumber));
run;

This works - is it what you're trying to do??
data _null_;
a = '52 Festive Rd';
b = 'Festive';
if index(a,b) then put 'yes';
else put 'no';
run;

Related

Find corresponding variable to a certain value through array

So if I have identified a max value regarding a test result (Highest variable listed below), which occurred during one of the three dates that are being tested (testtime variables listed below), what I want to do is to create a new variable called Highesttime identifying the date when the test was given.
However, I am stuck in an array looping. SAS informs that "ERROR: Array subscript out of range at line x", guess there's something working regarding the logic? See codes below:
Example:
ID time1_a time_b time_c result_a result_b result_c Highest
001 1/1/22 1/2/22 1/3/22 3 2 4 4
002 12/1/21 12/23/21 1/5/22 6 1 2 6
003 12/22/21 1/6/22 2/2/22 5 5 7 7
...
data want;
set origin;
array testtime{3} time1_a time_b time_c;
array maxvalue{1} Highest;
array corr_time{1} Highesttime;
do i=1 to dim(testttime);
corr_time{i}=testttime{i=maxvalue{i}};
end;
run;
There is no need to make an array for HIGHEST since there is only one variable that you would put into that array. In that case just use the variable directly instead of trying to access it indirectly via an array reference.
First let's make an actual SAS dataset out of the listing you provided.
data have;
input ID (time_a time_b time_c) (:mmddyy.) result_a result_b result_c Highest ;
format time_a time_b time_c yymmdd10.;
cards;
001 1/1/22 1/2/22 1/3/22 3 2 4 4
002 12/1/21 12/23/21 1/5/22 6 1 2 6
003 12/22/21 1/6/22 2/2/22 5 5 7 7
;
If you want to loop then you need two arrays. One for times and the other for the values. Then you can loop until you find which index points to the highest value and use the same index into the other array.
data want ;
set have;
array times time_a time_b time_c ;
array results result_a result_b result_c;
do which_one=1 to dim(results) until (not missing(highest_time));
if results[which_one] = highest then highest_time=times[which_one];
end;
format highest_time yymmdd10.;
run;
Or you can avoid the looping by using the WHICHN() function to figure out which of three result variables is the first one that has that HIGHEST value. Then you can use that value as the index into the array of the TIME variables (which in your case have DATE instead of TIME or DATETIME values).
data want ;
set have;
which_one = whichn(highest, of result_a result_b result_c);
array times time_a time_b time_c ;
highest_time = times[which_one];
format highest_time yymmdd10.;
run;
Your code from this question was close, you just had the assignment backwards.
Note that an array method will assign the last date in the case of duplicate high results and WHICHN will report the first date so the answers are not identical unless you modify the loop to exit after the first maximum value is found.
With the changes suggested in the answer proposed:
data temp2_lead_f2022;
set temp_lead_f2022;
array _day {3} daybld_a daybld_b daybld_c;
array _month {3} mthbld_a mthbld_b mthbld_c;
array _dates {3} date1_a date2_b date3_c;
array _pblev{3} pblev_a pblev_b pblev_c;
do i = 1 to 3;
_dates{i} = mdy(_month{i}, _day{i}, 1990);
end;
maxlead= max(of _pblev(*));
do i=1 to 3;
if _pblev{i} = maxlead then max_date=_dates(i);
end;
*Using WHICHN to identify the maximum occurence;
max_first_index=whichn(maxlead, of _pblev(*));
max_date2 = _dates(max_first_index);
drop k;
format date1_a date2_b date3_c dob mmddyy8. ;
run;

SAS / Using an array to convert multiple character variables to numeric

I am a SAS novice. I am trying to convert character variables to numeric. The code below works for one variable, but I need to convert more than 50 variables, hopefully simultaneously. Would an array solve this problem? If so, how would I write the syntax?
DATA conversion_subset;
SET have;
new_var = input(oldvar,4.);
drop oldvar;
rename newvar=oldvar;
RUN;
#Reeza
DATA conversion_subset;
SET have;
Array old_var(*) $ a_20040102--a_20040303 a_302000--a_302202;
* The first list contains 8 variables. The second list contains 7 variables;
Array new_var(15) var1-var15;
Do i=1 to dim(old_var);
new_var(i) = input(old_var(i),4.);
End;
*drop a_20040102--a_20040303 a_302000--a_302202;
*rename var1-var15 = a_20040102--a_20040303 a_302000--a_302202;
RUN;
NOTE: Invalid argument to function INPUT at line 64 column 19
(new_var(i) = input(old_var(i),4.)
#Reeza
I am still stuck on this array. Your help would be greatly appreciated. My code:
DATA conversion_subset (DROP= _20040101 _20040201 _20040301);
SET replace_nulls;
Array _char(*) $ _200100--_601600;
Array _num(*) var1-var90;
Do i=1 to dim(_char);
_num(i) = input(_char(i),4.);
End;
RUN;
I am receiving the following error: ERROR: Array subscript out of range at line 64 column 6. Line 64 contains the input statement.
Yes, an array solves this issue. You will want a simple way to list the variables so look into SAS variable lists as well. For example if your converting all character variables between first and last you could list them as first_var-character-last_var.
The rename/drop are illustrated in other questions across SO.
DATA conversion_subset;
SET have;
Array old_var(50) $ first-character-last;
Array new_var(50) var1-var50;
Do i=1 to 50;
new_var(i) = input(oldvar(i),4.);
End;
RUN;
As #Parfait suggests, it would be best to adjust it when you are getting it, rather than after it is already in a SAS data set. However, if you're given the data set and have to convert that, that's what you have to do. You can add a WHERE clause to the PROC SQL to exclude variables that should not be converted. If you do so, they won't be in the final data set unless you add them in the CREATE TABLE's SELECT clause.
PROC CONTENTS DATA=have OUT=havelist NOPRINT ;
RUN ; %* get variable names ;
PROC SQL ;
SELECT 'INPUT(' || name || ',4.) AS ' || name
INTO :convert SEPARATED BY ','
FROM havelist
; %* create the select statement ;
CREATE TABLE conversion_subset AS
SELECT &convert
FROM have
;
QUIT ;
If excluding variables is an issue and/or you want to use a DATA step, then use the PROC CONTENTS above and follow with:
PROC SQL ;
SELECT COMPRESS(name || '_n=INPUT(' || name || ',4.)'),
COMPRESS(name || '_n=' || name),
COMPRESS(name)
INTO :convertlst SEPARATED BY ';',
:renamelst SEPARATED BY ' ',
:droplst SEPARATED BY ' '
FROM havelist
;
QUIT ;
DATA conversion_subset (RENAME=(&renamelst)) ;
SET have ;
&convertlst ;
DROP &droplst ;
RUN ;
Again, add a where clause to exclude variables that should not be converted. This will automatically preserve any variables that you exclude from conversion with a WHERE in the PROC SQL SELECT.
If you have too many variables, or their names are very long, or adding _n to the end causes a name collision, things can go badly (too much data for a macro variable, illegal field name, one field overwriting another, respectively).

SAS how to create variable with corresponding values efficiently

I am trying to complete the following.
Variable Letter has three values (a, b, c). I would like to create a variable Letter_2 with values corresponding to the values of Letter, namely (1, 2, 3).
I know I can do this using three IF Then statements.
if Letter='a' then Letter_2='1';
if Letter='b' then Letter_2='2';
if Letter='c' then Letter_2='3';
Suppose I have 15 values for the variable Letter, and 15 corresponding values for the replacement. Is there a way to do it efficiently without typing the same If Then statement 15 times?
I am new to SAS. Any clue will be appreciated.
Lisa
Looks like an application for a FORMAT.
First define the format.
proc format ;
value $lookup 'a'='1' 'b'='2' 'c'='3' ;
run;
Then use it to re-code your variable.
data want;
set have;
letter2 = put(letter,$lookup.);
run;
Or perhaps you could use two temporary arrays and the WHICHC() function?
data have;
input letter $10. ;
cards;
car
apple
box
;;;;
data want ;
set have ;
array from (3) $10 _temporary_ ('apple','box','car');
array to (3) $10 _temporary_ ('first','second','third');
if whichc(letter,of from(*)) then
letter_2 = to(whichc(letter,of from(*)))
;
run;

SAS Put Certain Row as New Variable Names After Manipulation

After importing my CSV data with GETNAMES=NO, I have 59 columns with variable names VAR1, VAR2, . . . VAR59. My first row contains the names I need for the new variables, but they first needed manipulated by removing special characters and turning spaces into underscores since SAS doesn't like spaces in variable names. This is the array I used for that piece:
DATA DATA1; SET DATA (FIRSTOBS=7);
ARRAY VAR(59) VAR1-VAR59;
IF _N_ = 1 THEN DO;
DO I = 1 TO 59;
VAR[I] = COMPRESS(TRANSLATE(TRIM(VAR[I]),'_',' '),'?()');
PUT VAR[I]=;
END;
END;
DROP I;
RUN;
This worked perfectly, but now I need to get this first row up to the new variable names. I tried a similar array to perform this:
DATA DATA2; SET DATA1;
ARRAY V(59) VAR1-VAR59;
DO I = 1 TO 59;
IF _N_ = 1 AND V[I] NE "" THEN CALL SYMPUT("NEWNAME",V[I]);
RENAME VAR[I] = &NEWNAME;
END;
DROP I;
RUN;
This only puts the name of VAR59 since there is no [i] connected to the &NEWNAME, and it still isn't working quite right. Any suggestions to moving a row up to variable names AFTER manipulation?
Your primary problem is you are trying to use a macro variable in the data step it's created in. You can't. You're also trying to create rename statements in the data step; rename, as with other similar statements (keep, drop), must be defined before the data step is compiled.
You need to write code somewhere - either in a text file, a macro variable, whatever - with this information. For example:
filename renamef temp;
data _null_;
set myfile (obs=1);
file renamef;
array var[59];
do _i = 1 to dim(Var);
[your code to clean it out];
strput = cat("rename",vname(var[_i]),'=',var[_i],';');
put strput;
end;
run;
data want;
set myfile (firstobs=2);
%include renamef;
run;
There are lots of other examples to this on the site and on the web, "list processing" is the term for this.
Joe -- using your suggestions and another one of your posts, the following worked flawlessly:
Put the row of needed variables into long format (in my case, first row so n = 1)
DATA NEWVARS; SET DATA;
IF _N_ = 1 THEN OUTPUT NEWVARS;
RUN;
PROC TRANSPOSE DATA = NEWVARS OUT=NEWVARS1;
VAR _ALL_;
RUN;
Create a list of rename macro calls.
PROC SQL;
SELECT CATS('%RENAME(VAR=',_NAME_,',NEWVAR=',COL1,')')
INTO :RENAMELIST SEPARATED BY ' '
FROM NEWVARS1;
QUIT;
%MACRO RENAME(VAR=,NEWVAR=);
RENAME &VAR.=&NEWVAR.;
%MEND RENAME;
Call in the list created in Step 2 to rename all variables.
PROC DATASETS LIB=WORK NOLIST;
MODIFY DATA;
&RENAMELIST.;
QUIT;
I had to perform a few additional checks making sure that the variable names were not greater than 32 characters, and this was easy to check for when the data was in long format after transposing. If there are certain words that make the lengths too long, a TRANWRD statement can easily replace them with abbreviations.

SAS use value from one observation to overwrite different one

I have a data set with two main variables of interest now - Major and Major_Code. These should match up 1 to 1 but there are some errors I need to fix and what I've found is that for 14 Major_Code values, there are two different Majors. This is only due to a change in spelling or punctuation, like "ed." and "education". They are supposed to have the same value here but don't.
So I have a table with 7 pairs. Each pair has the same Major_Code and different a Major. How can I select one of the Major vales to use for each code? My only idea was through an if-then statement but that seems horribly inefficient.
I found the doubled values like this:
proc freq data=majorslist;
tables Major_Code/out=majorcodedups;
run;
proc print data=majorcodedups;
where COUNT > 1;
run;
So I can easily find these observations but can't extract certain values to overwrite onto another observation. I've looked into arrays, macros, sql and transpose but it's all a bit over my head right now.
Logically it would work like this:
from obs i to n, find value for variable x at obs i, output value onto variable y at obs i, go to obs(i+1) and repeat.
Assuming you have some rule for determining which MAJOR is correct for a MAJOR_CODE, you should do this:
This assumes majorslist is a dataset of every major/major_code pair whether unique or not - but only one per major/major_code pair.
proc sort data=majorslist;
by major_code major;
run;
data majorslist_unique;
set majorslist;
by major_code major;
if first.major_code and last.major_code then output;
else do;
*rule to determine whether to output it or not;
end;
run;
So, you now have the major_code/major relationship. Let's say you picked if first.major_code then output; as your rule (ie, take the major_code with the alphabetically first major value).
Now, you need to apply this to your larger dataset. There are a lot of ways to do that - merge this on is one, format is another, for starters. Format works like this:
Create a dataset with FMTNAME, START, LABEL defined. For each value of MAJOR_CODE, construct one row like that, where START is MAJOR_CODE and LABEL is MAJOR. We'll also add an extra line that says what to do with non-matches (in case you get new values of major_code).
data for_fmt;
set majorslist_unique;
fmtname='MAJORF'; *add a $ if MAJOR_CODE is a character variable;
start=major_code;
label=major;
output;
if _n_=1 then do;
hlo='o';
call missing(start);
label='NONMATCHED';
output;
end;
keep fmtname start label hlo;
run;
proc format cntlin=for_fmt;
quit;
Now you have a format, MAJORF. (or $MAJORF. if MAJOR_CODE is character), that you can use in a PUT statement.
data my_bigdata2;
set my_bigdata;
major = put(major_code,MAJORF.);
run;

Resources