what is the difference between SAS ARRAY and SAS IF-THEN - arrays

I have a table with students exams scores;
veriables: name, score1, score2, score3 and gender
wherever there is a missing value in one of the scores,
the score is set to 999.
I want to transform all 999's to missing (.) values.
I realized there are 2 main ways and I would like to know the MAIN difference between them.
As written above, both give the same output:
first:
data try ;
set mis_999 ;
if score1 = 999 then score1 = . ;
if score2 = 999 then score2 = . ;
if score3 = 999 then score3 = . ;
run ;
second (with array):
data array_try ;
set mis_999 ;
array try2{*} score1-score3 ;
do i=1 to dim(try2) ;
if try2(i) = 999 then try2(i) = . ;
end ;
run ;

For that example the main difference is that the code using an array is easier to expand to more variables.
In your first example you have what is referred to as wallpaper code, a lot of code that repeats the same pattern. If you have 500 variables instead of 3 you would need to write 500 statements. But with the array method you would just need to change the list of variables in the array definition. The DO loop would be the same.

Related

Find corresponding variable to a certain value through array

So if I have identified a max value regarding a test result (Highest variable listed below), which occurred during one of the three dates that are being tested (testtime variables listed below), what I want to do is to create a new variable called Highesttime identifying the date when the test was given.
However, I am stuck in an array looping. SAS informs that "ERROR: Array subscript out of range at line x", guess there's something working regarding the logic? See codes below:
Example:
ID time1_a time_b time_c result_a result_b result_c Highest
001 1/1/22 1/2/22 1/3/22 3 2 4 4
002 12/1/21 12/23/21 1/5/22 6 1 2 6
003 12/22/21 1/6/22 2/2/22 5 5 7 7
...
data want;
set origin;
array testtime{3} time1_a time_b time_c;
array maxvalue{1} Highest;
array corr_time{1} Highesttime;
do i=1 to dim(testttime);
corr_time{i}=testttime{i=maxvalue{i}};
end;
run;
There is no need to make an array for HIGHEST since there is only one variable that you would put into that array. In that case just use the variable directly instead of trying to access it indirectly via an array reference.
First let's make an actual SAS dataset out of the listing you provided.
data have;
input ID (time_a time_b time_c) (:mmddyy.) result_a result_b result_c Highest ;
format time_a time_b time_c yymmdd10.;
cards;
001 1/1/22 1/2/22 1/3/22 3 2 4 4
002 12/1/21 12/23/21 1/5/22 6 1 2 6
003 12/22/21 1/6/22 2/2/22 5 5 7 7
;
If you want to loop then you need two arrays. One for times and the other for the values. Then you can loop until you find which index points to the highest value and use the same index into the other array.
data want ;
set have;
array times time_a time_b time_c ;
array results result_a result_b result_c;
do which_one=1 to dim(results) until (not missing(highest_time));
if results[which_one] = highest then highest_time=times[which_one];
end;
format highest_time yymmdd10.;
run;
Or you can avoid the looping by using the WHICHN() function to figure out which of three result variables is the first one that has that HIGHEST value. Then you can use that value as the index into the array of the TIME variables (which in your case have DATE instead of TIME or DATETIME values).
data want ;
set have;
which_one = whichn(highest, of result_a result_b result_c);
array times time_a time_b time_c ;
highest_time = times[which_one];
format highest_time yymmdd10.;
run;
Your code from this question was close, you just had the assignment backwards.
Note that an array method will assign the last date in the case of duplicate high results and WHICHN will report the first date so the answers are not identical unless you modify the loop to exit after the first maximum value is found.
With the changes suggested in the answer proposed:
data temp2_lead_f2022;
set temp_lead_f2022;
array _day {3} daybld_a daybld_b daybld_c;
array _month {3} mthbld_a mthbld_b mthbld_c;
array _dates {3} date1_a date2_b date3_c;
array _pblev{3} pblev_a pblev_b pblev_c;
do i = 1 to 3;
_dates{i} = mdy(_month{i}, _day{i}, 1990);
end;
maxlead= max(of _pblev(*));
do i=1 to 3;
if _pblev{i} = maxlead then max_date=_dates(i);
end;
*Using WHICHN to identify the maximum occurence;
max_first_index=whichn(maxlead, of _pblev(*));
max_date2 = _dates(max_first_index);
drop k;
format date1_a date2_b date3_c dob mmddyy8. ;
run;

array processing with different indices and missing values in SAS

have is a sas data set with 4 variables: an id and variables storing info on all the activities a respondent shares with 3 different members of a team they're on. There are 4 different activity types, identified by the numbers populating in the :_activities vars for each player (p1 to p3). Below are the first 5 obs:
id p1_activities p2_activities p3_activities
A 1,2,3,4 1,3
B 1,3 1,2,3 1,2,3
C 1,2,3 1,2,3
D 1,2,3
E 1,2,3 1
Consider respondent A: they share all 4 activities with player 1 on their team, and activities 1 and 3 with player 2 on their team. I need to create flags for each player position and each activity. For example, a new numeric variable p1_act2_flag should equal 1 for all respondents who have a value of 2 appearing in the p1_activities character variable. Here are the first 6 variables I need to create out of the 12 total for the data shown:
p1_act1_flag p1_act2_flag p1_act3_flag p1_act4_flag p2_act1_flag p2_act2_flag …
1 1 1 1 1 0 …
1 0 1 0 1 1 …
. . . . 1 1 …
. . . . 1 1 …
1 1 1 0 . . …
I do this now by initializing all of the variable names in a length statement, then writing a ton if-then statements. I want to use far fewer lines of code, but my array logic is incorrect. Here's how I try to create the flags for player 1:
data want;
length p1_act1_flg p1_act2_flg p1_act3_flg p1_act4_flg
p2_act1_flg p2_act2_flg p2_act3_flg p2_act4_flg
p3_act1_flg p3_act2_flg p3_act3_flg p3_act4_flg
p4_act1_flg p4_act2_flg p4_act3_flg p4_act4_flg 8.0;
set have;
array plracts {*} p1_activities p2_activities p3_activities;
array p1actflg {*} p1_act1_flg p1_act2_flg p1_act3_flg p1_act4_flg;
array p2actflg {*} p2_act1_flg p2_act2_flg p2_act3_flg p2_act4_flg;
array p3actflg {*} p3_act1_flg p3_act2_flg p3_act3_flg p3_act4_flg;
array p4actflg {*} p4_act1_flg p4_act2_flg p4_act3_flg p4_act4_flg;
do i=1 to dim(plracts);
do j=1 to dim(p1actflg);
if find(plracts{i}, cats(put(j, $12.))) then p1actflg{j}=1;
else if missing(plracts{i}) then p1actflg{j}=.;
else p1actflg{j}=0;
end;
end;
*do this again for the other p#actflg arrays;
run;
My "array subscript is out of range" because of the different lengths of the player and activity arrays, but nesting in different do-loops would result in me writing many more lines of code than a wallpaper solution.
How would you do this more systematically, and/or in far fewer lines of code?
Not sure why you are processing 4 activities for flags when there are only 3.
Some ideas:
Refactoring the column names to numbered suffixes would reduce some of the wallpaper effect.
activities_p1-activities_p3
Refactoring the flag column names to number suffixes
flag_p1_1-flag_p1_4
flag_p2_1-flag_p2_4
flag_p3_1-flag_p3_4
Use DIM to stay within array bounds.
Use two dimensional array for flags
Use direct addressing of items to be flagged
Add error checking
Not fewer, but perhaps more robust ?
This code examines each item in the activities list as opposed to seeking presence of a specific items (1..4):
data want;
set have;
array activities
activities_p1-activities_p3
;
array flags(3,4)
flag_p1_1-flag_p1_4
flag_p2_1-flag_p2_4
flag_p3_1-flag_p3_4
;
do i = 1 to dim(activites);
if missing(activities[i]) then continue; %* skip;
do j = 1 by 1;
item = scan ( activities[i], j, ',' );
if missing(item) then leave; %* no more items in csv list;
item_num = input (item,?1.);
if missing(item_num) then continue; %* skip, csv item is not a number;
if item_num > hbound(flags,2) or item_num < lbound(flags,2) then do;
put 'WARNING:' item_num 'is invalid for flagging';
continue; %* skip, csv item is missing, 0, negative or exceeds 4;
end;
flags (i, item_num) = 1;
end;
* backfill zeroes where flag not assigned;
do j = 1 to hbound(flags,2);
flags (i, item_num) = sum (0, flags (i, item_num)); %* sum() handles missing values;
end;
end;
Here is the same processing, but only searching for specific items to be flagged:
data have; length id activities_p1-activities_p3 $20;input
id activities_p1-activities_p3 ; datalines;
A 1,2,3,4 1,3 .
B 1,3 1,2,3 1,2,3
C . 1,2,3 1,2,3
D . 1,2,3 .
E 1,2,3 . 1
;
data want;
set have;
array activities
activities_p1-activities_p3
;
array flags(3,4)
flag_p1_1-flag_p1_4
flag_p2_1-flag_p2_4
flag_p3_1-flag_p3_4
;
do i = 1 to dim(activities);
if not missing(activities[i]) then
do j = 1 to hbound(flags,2);
flags (i,j) = sum (flags(i,j), findw(trim(activities[i]),cats(j),',') > 0) > 0;
end;
end;
run;
What's going on ?
flags variables are reset to missing at top of step
hbound return 4 as upper limit of second dimension
findw(trim(activities[i]),cats(j),',') find position of j in csv string
trim needed to remove trailing spaces which are not part of findw word delimiter list
cats converts j number to character representation
findw returns position of j in csv string.
might want to also compress out spaces and other junk if activity data values are not reliable.
first > 0 evaluates position to 0 j not present and 1 present
second > 0 is a another logic evaluation that ensures j present flag remains 0 or 1. Otherwise flags would be a frequency count (imagine activity data 1,1,2,3)
flags(i,j) covers the 3 x 4 slots available for flagging.
Consider converting into a hierarchical view and doing the logic there. The real stickler here is the fact that there can be missing positions within each list. Because of this, a simple do loop will be difficult. A faster way would be multi-step:
Create a template of all possible players and positions
Create an actual list of all players and positions
Merge the template with the actual list and flag all matches
It's not as elegant as a single data step like could be done, but it is somewhat easy to work with.
data have;
infile datalines dlm='|';
input id$ p1_activities$ p2_activities$ p3_activities$;
datalines;
A|1,2,3,4|1,3|
B|1,3|1,2,3|1,2,3|
C| |1,2,3|1,2,3|
D| |1,2,3|
E|1,2,3| |1
;
run;
/* Make a template of all possible players and positions */
data template;
set have;
array players p1_activities--p3_activities;
length varname $15.;
do player = 1 to dim(players);
do activity = 1 to 4;
/* Generate a variable name for later */
varname = cats('p', player, '_act', activity, '_flg');
output;
end;
end;
keep ID player activity varname;
run;
/* Create a list of actual players and their positions */
data actual;
set have;
array players p1_activities--p3_activities;
do player = 1 to dim(players);
do i = 1 to countw(players[player], ',');
activity = input(scan(players[player], i, ','), 8.);
/* Do not output missing positions */
if(NOT missing(activity)) then output;
end;
end;
keep ID player activity;
run;
/* Merge the template with actual values and create a flag when an
an id, player, and activity matches with the template
*/
data want_long;
merge template(in=all)
actual(in=act);
by id player activity;
flag_activity = (all=act);
run;
/* Transpose it back to wide */
proc transpose data=want_long
out=want_wide;
id varname;
by id;
var flag_activity;
run;
Following Stu's example, a DS2 DATA step can perform his 'merge' using a hash lookup. The hash lookup depends on creating a data set that maps CSV item lists to flags.
* Create data for hash;
data share_flags(where=(not missing(key)));
length key $7 f1-f4 8;
array k[4] $1 _temporary_;
do f1 = 0 to 1; k[1] = ifc(f1,'1','');
do f2 = 0 to 1; k[2] = ifc(f2,'2','');
do f3 = 0 to 1; k[3] = ifc(f3,'3','');
do f4 = 0 to 1; k[4] = ifc(f4,'4','');
key = catx(',', of k[*]);
output;
end;end;end;end;
run;
proc ds2;
data want2 / overwrite=yes;
declare char(20) id;
vararray char(7) pact[*] activities_p1-activities_p3;
vararray double fp1[*] flag_p1_1-flag_p1_4;
vararray double fp2[*] flag_p2_1-flag_p2_4;
vararray double fp3[*] flag_p3_1-flag_p3_4;
declare char(1) sentinel;
keep id--sentinel;
drop sentinel;
declare char(7) key;
vararray double flags[*] f1-f4;
declare package hash shares([key],[f1-f4],4,'share_flags'); %* load lookup data;
method run();
declare int rc;
set have;
rc = shares.find([activities_p1],[flag_p1:]); %* find() will fill-in the flag variables;
rc = shares.find([activities_p2],[flag_p2:]);
rc = shares.find([activities_p3],[flag_p3:]);
end;
enddata;
run;
quit;
%let syslast = want2;
share_flags
result

Assign an array's value as a dimension to another array in SAS

I've been working on a complicated code and am stuck in the end, where I need to assign one array's value as a dimension parameter to another array in the code. A snapshot from my code :
For example:
array temp_match_fl(3) temp_match_fl1 - temp_match_fl3;
ARRAY buracc_repay(3) buracc_repay1 - buracc_repay3;
ARRAY ocs_repay(3) ocs_repay1 - ocs_repay3;
jj = 0;
do until (jj>=3);
jj=jj+1;
If length(strip(match_flag(jj))) = 1 then do;
temp_match_fl(jj) = match_flag(jj);
end;
Else If length(strip(match_flag(jj))) > 1 then do;
j1 = 0;
min_diff = 99999999;
do until (j1>=length(strip(match_class(jj))));
j1=j1+1;
retain min_diff;
n=substr(strip(match_flag(jj)),j1,1);
If (min_diff > abs(buracc_repay(jj)-ocs_repay(n))) then do;
min_diff = abs(buracc_repay(jj)-ocs_repay(n));
temp_match_fl(jj) = n;
end;
end;
end;
kk=temp_match_fl(jj);
/* buracc_repay(jj) = ocs_repay(kk);*/
buracc_repay(jj) = ocs_repay(temp_match_fl(jj));
end;
run;
Now, I need to be able to assign the value stored in temp_match_fl(jj) array as dimension parameter to another array, how can I achieve that?? None of the last two statements work:
buracc_repay(jj) = ocs_repay(kk);
buracc_repay(jj) = ocs_repay(temp_match_fl(jj));
Can someone please suggest.
Thanks!
Actually your last two statements as written do work. Are you getting an error, or unexpected results? Can you make a simple example like below that shows the problem?
Note that for this to work, it's essential that the value of temp_match_fl(jj) is 1, 2, or 3, because your OCS_REPAY array has three elements. From the code you've shown, it's not clear if that is always true. You don't show the match_flag array.
data want ;
array temp_match_fl(3) temp_match_fl1 - temp_match_fl3 (1 2 3) ;
array buracc_repay(3) buracc_repay1 - buracc_repay3 (10 20 30) ;
array ocs_repay(3) ocs_repay1 - ocs_repay3 (100 200 300) ;
jj=1 ;
kk=2 ;
*buracc_repay(jj) = ocs_repay(kk); *this works ;
put temp_match_fl(jj)= ; *debug to confirm value is 1 2 or 3 ;
buracc_repay(jj) = ocs_repay(temp_match_fl(jj)); *this also works;
put (buracc_repay:)(=) temp_match_fl1=; *check output ;
run ;

SAS how to create variable with corresponding values efficiently

I am trying to complete the following.
Variable Letter has three values (a, b, c). I would like to create a variable Letter_2 with values corresponding to the values of Letter, namely (1, 2, 3).
I know I can do this using three IF Then statements.
if Letter='a' then Letter_2='1';
if Letter='b' then Letter_2='2';
if Letter='c' then Letter_2='3';
Suppose I have 15 values for the variable Letter, and 15 corresponding values for the replacement. Is there a way to do it efficiently without typing the same If Then statement 15 times?
I am new to SAS. Any clue will be appreciated.
Lisa
Looks like an application for a FORMAT.
First define the format.
proc format ;
value $lookup 'a'='1' 'b'='2' 'c'='3' ;
run;
Then use it to re-code your variable.
data want;
set have;
letter2 = put(letter,$lookup.);
run;
Or perhaps you could use two temporary arrays and the WHICHC() function?
data have;
input letter $10. ;
cards;
car
apple
box
;;;;
data want ;
set have ;
array from (3) $10 _temporary_ ('apple','box','car');
array to (3) $10 _temporary_ ('first','second','third');
if whichc(letter,of from(*)) then
letter_2 = to(whichc(letter,of from(*)))
;
run;

dynamic variable name in sas

with similar issue to this, My situation is a bit difference that Variable name are Var12, Var 24, Var36 instead of Var1 Var2 and Var3.
It gives Array Subscript out of range error.
data have;
input Index Var12 Var2 Var3;
cards;
12 78.3 54.7 79.8
36 67.2 56.2 12.3
24 65.3 45.2 98.1
12 56.2 49.7 11.3
12 67.2 98.2 98.6
;
run;
data want;
set have;
array vars(*) var: ;
var_index=vars(Index);
run;
Look into vvaluex function instead. It allows you to specify a string defining the variable, versus vvalue which takes a variable name (not a string).
Var_index=vvaluex('var'||put(index, 2. -l));
I think you have a typo in your input statement...
Assuming it should be
input Index Var12 Var24 Var36 ;
Then this code works if the input var fields have any numeric suffix and in any order :
data want ;
set have ;
array vars{*} var: ;
var_index = . ;
do i = 1 to dim(vars) ;
/* Get variable name of vars{i}, keep only digits, compare to var_index */
/* If they match, store the value from vars{i} */
if input(compress(vname(vars{i}),,'kd'),8.) = index then var_index = vars{i} ;
end ;
drop i ;
run ;
Since you have 3 variables with names starting with var, the array of 3 numeric variables will be created and therefore index value should be between 1 to 3.
Any value above 3 will give out of range error. You can use dim function to find out number of elements in the array declared.
code statement:
num_val = dim(vars);

Resources