array processing with different indices and missing values in SAS - arrays

have is a sas data set with 4 variables: an id and variables storing info on all the activities a respondent shares with 3 different members of a team they're on. There are 4 different activity types, identified by the numbers populating in the :_activities vars for each player (p1 to p3). Below are the first 5 obs:
id p1_activities p2_activities p3_activities
A 1,2,3,4 1,3
B 1,3 1,2,3 1,2,3
C 1,2,3 1,2,3
D 1,2,3
E 1,2,3 1
Consider respondent A: they share all 4 activities with player 1 on their team, and activities 1 and 3 with player 2 on their team. I need to create flags for each player position and each activity. For example, a new numeric variable p1_act2_flag should equal 1 for all respondents who have a value of 2 appearing in the p1_activities character variable. Here are the first 6 variables I need to create out of the 12 total for the data shown:
p1_act1_flag p1_act2_flag p1_act3_flag p1_act4_flag p2_act1_flag p2_act2_flag …
1 1 1 1 1 0 …
1 0 1 0 1 1 …
. . . . 1 1 …
. . . . 1 1 …
1 1 1 0 . . …
I do this now by initializing all of the variable names in a length statement, then writing a ton if-then statements. I want to use far fewer lines of code, but my array logic is incorrect. Here's how I try to create the flags for player 1:
data want;
length p1_act1_flg p1_act2_flg p1_act3_flg p1_act4_flg
p2_act1_flg p2_act2_flg p2_act3_flg p2_act4_flg
p3_act1_flg p3_act2_flg p3_act3_flg p3_act4_flg
p4_act1_flg p4_act2_flg p4_act3_flg p4_act4_flg 8.0;
set have;
array plracts {*} p1_activities p2_activities p3_activities;
array p1actflg {*} p1_act1_flg p1_act2_flg p1_act3_flg p1_act4_flg;
array p2actflg {*} p2_act1_flg p2_act2_flg p2_act3_flg p2_act4_flg;
array p3actflg {*} p3_act1_flg p3_act2_flg p3_act3_flg p3_act4_flg;
array p4actflg {*} p4_act1_flg p4_act2_flg p4_act3_flg p4_act4_flg;
do i=1 to dim(plracts);
do j=1 to dim(p1actflg);
if find(plracts{i}, cats(put(j, $12.))) then p1actflg{j}=1;
else if missing(plracts{i}) then p1actflg{j}=.;
else p1actflg{j}=0;
end;
end;
*do this again for the other p#actflg arrays;
run;
My "array subscript is out of range" because of the different lengths of the player and activity arrays, but nesting in different do-loops would result in me writing many more lines of code than a wallpaper solution.
How would you do this more systematically, and/or in far fewer lines of code?

Not sure why you are processing 4 activities for flags when there are only 3.
Some ideas:
Refactoring the column names to numbered suffixes would reduce some of the wallpaper effect.
activities_p1-activities_p3
Refactoring the flag column names to number suffixes
flag_p1_1-flag_p1_4
flag_p2_1-flag_p2_4
flag_p3_1-flag_p3_4
Use DIM to stay within array bounds.
Use two dimensional array for flags
Use direct addressing of items to be flagged
Add error checking
Not fewer, but perhaps more robust ?
This code examines each item in the activities list as opposed to seeking presence of a specific items (1..4):
data want;
set have;
array activities
activities_p1-activities_p3
;
array flags(3,4)
flag_p1_1-flag_p1_4
flag_p2_1-flag_p2_4
flag_p3_1-flag_p3_4
;
do i = 1 to dim(activites);
if missing(activities[i]) then continue; %* skip;
do j = 1 by 1;
item = scan ( activities[i], j, ',' );
if missing(item) then leave; %* no more items in csv list;
item_num = input (item,?1.);
if missing(item_num) then continue; %* skip, csv item is not a number;
if item_num > hbound(flags,2) or item_num < lbound(flags,2) then do;
put 'WARNING:' item_num 'is invalid for flagging';
continue; %* skip, csv item is missing, 0, negative or exceeds 4;
end;
flags (i, item_num) = 1;
end;
* backfill zeroes where flag not assigned;
do j = 1 to hbound(flags,2);
flags (i, item_num) = sum (0, flags (i, item_num)); %* sum() handles missing values;
end;
end;
Here is the same processing, but only searching for specific items to be flagged:
data have; length id activities_p1-activities_p3 $20;input
id activities_p1-activities_p3 ; datalines;
A 1,2,3,4 1,3 .
B 1,3 1,2,3 1,2,3
C . 1,2,3 1,2,3
D . 1,2,3 .
E 1,2,3 . 1
;
data want;
set have;
array activities
activities_p1-activities_p3
;
array flags(3,4)
flag_p1_1-flag_p1_4
flag_p2_1-flag_p2_4
flag_p3_1-flag_p3_4
;
do i = 1 to dim(activities);
if not missing(activities[i]) then
do j = 1 to hbound(flags,2);
flags (i,j) = sum (flags(i,j), findw(trim(activities[i]),cats(j),',') > 0) > 0;
end;
end;
run;
What's going on ?
flags variables are reset to missing at top of step
hbound return 4 as upper limit of second dimension
findw(trim(activities[i]),cats(j),',') find position of j in csv string
trim needed to remove trailing spaces which are not part of findw word delimiter list
cats converts j number to character representation
findw returns position of j in csv string.
might want to also compress out spaces and other junk if activity data values are not reliable.
first > 0 evaluates position to 0 j not present and 1 present
second > 0 is a another logic evaluation that ensures j present flag remains 0 or 1. Otherwise flags would be a frequency count (imagine activity data 1,1,2,3)
flags(i,j) covers the 3 x 4 slots available for flagging.

Consider converting into a hierarchical view and doing the logic there. The real stickler here is the fact that there can be missing positions within each list. Because of this, a simple do loop will be difficult. A faster way would be multi-step:
Create a template of all possible players and positions
Create an actual list of all players and positions
Merge the template with the actual list and flag all matches
It's not as elegant as a single data step like could be done, but it is somewhat easy to work with.
data have;
infile datalines dlm='|';
input id$ p1_activities$ p2_activities$ p3_activities$;
datalines;
A|1,2,3,4|1,3|
B|1,3|1,2,3|1,2,3|
C| |1,2,3|1,2,3|
D| |1,2,3|
E|1,2,3| |1
;
run;
/* Make a template of all possible players and positions */
data template;
set have;
array players p1_activities--p3_activities;
length varname $15.;
do player = 1 to dim(players);
do activity = 1 to 4;
/* Generate a variable name for later */
varname = cats('p', player, '_act', activity, '_flg');
output;
end;
end;
keep ID player activity varname;
run;
/* Create a list of actual players and their positions */
data actual;
set have;
array players p1_activities--p3_activities;
do player = 1 to dim(players);
do i = 1 to countw(players[player], ',');
activity = input(scan(players[player], i, ','), 8.);
/* Do not output missing positions */
if(NOT missing(activity)) then output;
end;
end;
keep ID player activity;
run;
/* Merge the template with actual values and create a flag when an
an id, player, and activity matches with the template
*/
data want_long;
merge template(in=all)
actual(in=act);
by id player activity;
flag_activity = (all=act);
run;
/* Transpose it back to wide */
proc transpose data=want_long
out=want_wide;
id varname;
by id;
var flag_activity;
run;

Following Stu's example, a DS2 DATA step can perform his 'merge' using a hash lookup. The hash lookup depends on creating a data set that maps CSV item lists to flags.
* Create data for hash;
data share_flags(where=(not missing(key)));
length key $7 f1-f4 8;
array k[4] $1 _temporary_;
do f1 = 0 to 1; k[1] = ifc(f1,'1','');
do f2 = 0 to 1; k[2] = ifc(f2,'2','');
do f3 = 0 to 1; k[3] = ifc(f3,'3','');
do f4 = 0 to 1; k[4] = ifc(f4,'4','');
key = catx(',', of k[*]);
output;
end;end;end;end;
run;
proc ds2;
data want2 / overwrite=yes;
declare char(20) id;
vararray char(7) pact[*] activities_p1-activities_p3;
vararray double fp1[*] flag_p1_1-flag_p1_4;
vararray double fp2[*] flag_p2_1-flag_p2_4;
vararray double fp3[*] flag_p3_1-flag_p3_4;
declare char(1) sentinel;
keep id--sentinel;
drop sentinel;
declare char(7) key;
vararray double flags[*] f1-f4;
declare package hash shares([key],[f1-f4],4,'share_flags'); %* load lookup data;
method run();
declare int rc;
set have;
rc = shares.find([activities_p1],[flag_p1:]); %* find() will fill-in the flag variables;
rc = shares.find([activities_p2],[flag_p2:]);
rc = shares.find([activities_p3],[flag_p3:]);
end;
enddata;
run;
quit;
%let syslast = want2;
share_flags
result

Related

SAS - Invalid numeric data while searching through an Array

I am trying to create an array of strings and want to insert a value in it, if it does not exist already in the array.
I read somewhere that we can use 'IN' operator with Array. So, coded it as follows:
DATA WANT;
SET HAVE;
BY ID;
ARRAY R_PROS_SCRN_ID {2} $4. R_PROS_SCRN_ID_1 - R_PROS_SCRN_ID_2;
RETAIN R_PROS_SCRN_ID_1 - R_PROS_SCRN_ID_2;
IF NOT PROS_SCRN_ID IN R_PROS_SCRN_ID THEN DO;
DO I=1 to 2 ;
IF MISSING( R_PROS_SCRN_ID{i}) THEN DO;
R_PROS_SCRN_ID{i} = PROS_SCRN_ID;
LEAVE;
END;
END;
END;
IF LAST.ID THEN OUTPUT;
RUN;
In Array R_PROS_SCRN_ID, I want only the unique values from field PROS_SCRN_ID.
It is throwing error:
NOTE: Invalid numeric data, PROS_SCRN_ID='MED' , at line 17352 column 201.
I think it is because I did not initialize the Array before comparing and hence it is considering it as Numeric Array. But, I have specified the format as $4. Why is it throwing error?
Also, I am not sure if this is the best way get unique values in an Array. Is there any better way to implement this?
Your code appears to be collecting unique values by group, pivoting from a tall data structure to a wide data structure.
One of the clearest DATA step ways is to use what we call DOW loop in which SET is within the loop. This sample code presumes no more than 10 unique satellite values per group. (The by variables can be thought of as key variables, and all other variables would be satellites)
data have;
input user_id screen_id ;
datalines;
1 1
1 2
1 1
1 1
1 1
1 3
2 1
2 1
2 1
3 0
4 1
4 2
4 3
5 11
5 11
5 11
5 5
5 1
5 5
5 6
5 1
run;
data want;
_index = 0;
do until (last.user_id);
set have;
by user_id;
array ids screen_id1-screen_id10;
if screen_id not in ids then do;
_index + 1;
ids(_index) = screen_id;
end;
end;
drop _index screen_id;
run;
One of the clearest procedural ways is to select the unique values and transpose them.
proc sql;
create view uniqueScreenByUser as
select distinct user_id, screen_id
from have
order by user_id
;
proc transpose data=uniqueScreenByUser prefix=screen_id out=wantWide(drop=_name_);
by user_id;
var screen_id;
run;

Setting conditions in arrays in SAS

I have a set of data that looks like this:
ID Status31Jan2007 Status28Jan2007 Status31Mar2007
001 0 0
002 1 0 0
003 1 1 0
I have Statusddmmyyyy fields of either '0' or '1' for 118 months. (here, I only have three months as a sample)
I want to get results like this:
ID Flag1 Flag2 Flag3
001 N N N
002 Y N N
003 Y Y N
The logic is, if as at Status31Jan2007 = 1 and the following two months, count of Status fields with 0 > 0, then flag it as 'Y'. Else, N.
Meaning,
If my ID is 001 and as at Status31Jan2007, value is missing, i flag it as 'N' under Flag1.
Moving on to the next month, Status28Feb2007, value is 0, i automatically flag it as 'N' as well under Flag2. This applies to the next month.
Looking at ID 002, Status31Jan2007 is 1. And following two months, I have two 0 values. Count of '0' value is > 0. So I flag it as 'Y' under Flag1.
But as at Status28Feb2007, it is 0. It doesnt fit the criteria so i flag it as 'N' under Flag2.
As long as as at the field, I need the status to be 1 then only I proceed to look into the following two months.
After getting the results, how do I count the number of flags N and Y under each fields?
Count1 Count2 Count3
N 1 2 3
Y 2 1 0
Would appreciate the help as I am new to SAS. Thanks.
This will only work if the column names across are in calendar order.
Use an ARRAY statement to organize and then access variables by index and thus easily process the [index+1] and [index+2] checks your logic indicates. You can also use temporary arrays to maintain a count as you assign the flag values; at the last row the counts are output to a separate table.
Note: for status variables taking on either 0 or 1 the count of 1's can be computed using SUM. The sum of two status variables will be < 2 when either of them is 0.
* simulate some data;
data prelim;
do id = 1 to 20;
do date = '01jan07'd by 1 until(intck('month', '01jan07'd, date) >= 117);
date = intnx('month', date, 1) - 1;
status = ranuni(123) < 0.45;
if date = '31jan07'd and mod(id,5) = 1 then status = .;
output;
end;
end;
format date date9.;
run;
* change the shape of simulated data to match the question;
proc transpose data=prelim prefix=Status out=have(drop=_name_);
by id;
var status;
id date;
run;
* process the problem shaped data;
data
want (keep=id status: flag:)
want_count (keep=flag_value count:);
;
set have end=lastid;
retain sentinel1 sentinel2 0;
array status status: sentinel1 sentinel2; * map all the Status* variables to an array named status;
array flag [118] $1 ; * automatically creates 118 new variables flag1 to flag118;
array yfreq [118] _temporary_ (118*0); * temporary arrays initialized to 0;
array nfreq [118] _temporary_ (118*0);
* process each month status, -2 because of the sentinels ;
do i = 1 to dim(status)-2;
* assign flag according to the logic, some cases require a 2-month look ahead;
select;
when ( status(i) = . ) flag(i) = 'N';
when ( status(i) = 0 ) flag(i) = 'N';
when ( status(i) = 1
and sum(status(i+1),status(i+2)) < 2 ) flag(i) = 'Y'; * SUM trick;
otherwise
flag(i) = 'N';
end;
* track frequencies of flags assigned;
if flag(i) = 'N'
then nfreq(i)+1;
else yfreq(i)+1;
end;
output want;
if lastid then do;
* all flags for all ids have been binned for frequency;
* output the freqs to a count data set;
length flag_value $1;
array freq count1-count118;
flag_value = 'N'; do i = 1 to dim(nfreq); freq(i) = nfreq(i); end; output want_count;
flag_value = 'Y'; do i = 1 to dim(yfreq); freq(i) = yfreq(i); end; output want_count;
end;
run;

Get rid of kth smallest and largest values of a dataset in SAS

I have a datset sort of like this
obs| foo | bar | more
1 | 111 | 11 | 9
2 | 9 | 2 | 2
........
I need to throw out the 4 largest and 4 smallest of foo (later then I would do a similar thing with bar) basically to proceed but I'm unsure the most effective way to do this. I know there are functions smallest and largest but I don't understand how I can use them to get the smallest 4 or largest 4 from an already made dataset. I guess alternatively I could just remove the min and max 4 times but that sounds needlessly tedious/time consuming. Is there a better way?
PROC RANK will do this for you pretty easily. If you know the total count of observations, it's trivial - it's slightly harder if you don't.
proc rank data=sashelp.class out=class_ranks(where=(height_r>4 and weight_r>4));
ranks height_r weight_r;
var height weight;
run;
That removes any observation that is in the 4 smallest heights or weights, for example. The largest 4 would require knowing the maximum rank, or doing a second processing step.
data class_final;
set class_ranks nobs=nobs;
if height_r lt (nobs-3) and weight_r lt (nobs-3);
run;
Of course if you're just removing the values then do it all in the data step and call missing the variable if the condition is met rather than deleting the observation.
You are going to need to make at least 2 passes through your dataset however you do this - one to find out what the top and bottom 4 values are, and one to exclude those observations.
You can use proc univariate to get the top and bottom 5 values, and then use the output from that to create a where filter for a subsequent data step. Here's an example:
ods _all_ close;
ods output extremeobs = extremeobs;
proc univariate data = sashelp.cars;
var MSRP INVOICE;
run;
ods listing;
data _null_;
do _N_ = 1 by 1 until (last.varname);
set extremeobs;
by varname notsorted;
if _n_ = 2 then call symput(cats(varname,'_top4'),high);
if _n_ = 4 then call symput(cats(varname,'_bottom4'),low);
end;
run;
data cars_filtered;
set sashelp.cars(where = ( &MSRP_BOTTOM4 < MSRP < &MSRP_TOP4
and &INVOICE_BOTTOM4 < INVOICE < &INVOICE_TOP4
)
);
run;
If there are multiple observations that tie for 4th largest / smallest this will filter out all of them.
You can use proc sql to place the number of distinct values of foo into a macro var (includes null values as distinct).
In you data step you can use first.foo and the macro var to selectively output only those that are no the smallest or largest 4 values.
proc sql noprint;
select count(distinct foo) + count(distinct case when foo is null then 1 end)
into :distinct_obs from have;
quit;
proc sort data = have; by foo; run;
data want;
set have;
by foo;
if first.foo then count+1;
if 4 < count < (&distinct_obs. - 3) then output;
drop count;
run;
I also found a way to do it that seems to work with IML (I'm practicing by trying to redo things different ways). I knew my maximum number of observations and had already sorted it by order of the variable of interest.
PROC IML;
EDIT data_set;
DELETE point {{1, 2, 3, 4,51, 52, 53, 54};
PURGE;
close data_set;
run;
I've not used IML very much but I stumbled upon this while reading documentation. Thank you to everyone who answered my question!

Count the number of times a value occurs

I have 7 variables, 489 observations with variable values of 0-4.
What I need is the count percentage of use.
Answers 0,1 stand for non usage, and answers 2,3,4 stand for usage.
I created 7 additional vars and turned all the values above to:
1=usage - 0=non-usage.
Now, I don't know how to count and present how many "1" I have for each var and divide it by 489.
data LAB7;
set LAB3;
array v{*} v21-v27;
array VU{7};
DO i=1 to dim(v);
if v[i] = 1|0 THEN VU[i]=0;
else VU[i]=1;
END;
run;
You can do this:
data usage;
set lab3 end=eof;
array v{*} v21-v27
array n{7};
retain n: 0;
do i = 1 to dim(v);
if v[i] in (2, 3, 4) then n[i] + 1;
end;
if eof then do j = 1 to dim(v);
variable = vname(v[j]);
pct_usage = 100 * n[j] / _n_;
output;
end;
keep variable pct_usage;
run;
This creates an array of counters, one per variable, that are incremented by one whenever the corresponding variable is equal to 2, 3, or 4.
At the end of the data step, we output a record for each variable and record the percentage as the counter divided by the number of observations (_n_ when eof is true).
An alternative would be to use proc freq.
data indicators;
set lab3;
array v{*} v21-v27;
array ind{7};
do i = 1 to dim(v);
ind[i] = (v[i] in (2, 3, 4));
end;
run;
proc freq data = indicators;
tables ind: / out = usage;
run;
This creates binary indicator variables, one for each of the input variables, that are 1 when the input is 2, 3, or 4, and 0 otherwise. Counts and percentages are then obtained using proc freq.

Checking correct order of values

I have a data set that looks similar to the one below. Basically, I have current prices for three different sizes of an item type. If the sizes are priced correctly (ie small<medium<large) I want to flag them with a “Y” and continue to use the current price. If they are not priced correctly, I want to flag them with a “N” and use the recommended price. I know that this is probably a time to use array programming, but my array skills are, admittedly, a bit weak. There are hundreds of locations, but only one item type. I currently have the unique locations loaded in a macro variable list.
data have;
input type location $ size $ cur_price rec_price;
cards;
x NY S 4 1
x NY M 5 2
x NY L 6 3
x LA S 5 1
x LA M 4 2
x LA L 3 3
x DC S 5 1
x DC M 5 2
x DC L 5 3
;
run;
proc sql;
select distinct location into :loc_list from have;
quit;
Any help would be greatly appreciated.
Thanks.
Not sure why you'd want to use an array here...proc transpose and some data step logic
can easily solve this problem. Arrays are very useful (gotta admit, I'm not entirely
comfortable with them either), but in a situation where you have that many locations,
I think transpose is better.
Does the code below accomplish your goal?
/*sorts to get ready for transpose*/
proc sort data=have;
by location;
run;
/*transpose current price*/
proc transpose data=have out=cur_tran prefix=cur_price;
by location;
id size;
var cur_price;
run;
/*transpose recommended price*/
proc transpose data=have out=rec_tran prefix=rec_price;
by location;
id size;
var rec_price;
run;
/*merge back together*/
data merged;
merge cur_tran rec_tran;
by location;
run;
/*creates flags and new field for final price*/
data want;
set merged;
if cur_priceS<cur_priceM<cur_priceL then
do;
FLAG='Y';
priceS=cur_priceS;
priceM=cur_priceM;
priceL=cur_priceL;
end;
else do;
FLAG='N';
priceS=rec_priceS;
priceM=rec_priceM;
priceL=rec_priceL;
end;
run;
I don't see how arrays would help here. How about just checking using dif to queue the last record's price and verify it (could also retain the last price if you prefer). Make sure the dataset's properly sorted by type location descending size, then:
data want;
set have;
by type location descending size; *S > M > L alphabetically;
retain price_check;
if not first.location and dif(cur_price) lt 0 then price_check=1;
*if dif < 0 then cur rec is smaller;
else if first.location then price_check=0; *reset it;
if last.location;
keep type location price_check;
run;
Then merge that back to your original dataset by type location, and use the other price if cur_price=1.
Alternatively you could do it in a single query which is almost a re-statement of your requirements.
Proc sql;
create table want as
select *
/* Basically, I have current prices for three different sizes of an item type.
If the sizes are priced correctly (ie small<medium<large) */
, case
when max ( case when size eq 'S' then cur_price end)
lt max ( case when size eq 'M' then cur_price end)
and max ( case when size eq 'M' then cur_price end)
lt max ( case when size eq 'L' then cur_price end)
/* I want to flag them with a “Y” and continue to use the current price */
then 'Y'
/* If they are not priced correctly,
I want to flag them with a “N” and use the recommended price. */
else 'N'
end as Cur_Price_Sizes_Correct
, case
when calculate Cur_Price_Correct eq 'Y'
then cur_price
else rec_price
end as Price
From have
Group by Type
, Location
;
Quit;

Resources