Setting conditions in arrays in SAS - arrays

I have a set of data that looks like this:
ID Status31Jan2007 Status28Jan2007 Status31Mar2007
001 0 0
002 1 0 0
003 1 1 0
I have Statusddmmyyyy fields of either '0' or '1' for 118 months. (here, I only have three months as a sample)
I want to get results like this:
ID Flag1 Flag2 Flag3
001 N N N
002 Y N N
003 Y Y N
The logic is, if as at Status31Jan2007 = 1 and the following two months, count of Status fields with 0 > 0, then flag it as 'Y'. Else, N.
Meaning,
If my ID is 001 and as at Status31Jan2007, value is missing, i flag it as 'N' under Flag1.
Moving on to the next month, Status28Feb2007, value is 0, i automatically flag it as 'N' as well under Flag2. This applies to the next month.
Looking at ID 002, Status31Jan2007 is 1. And following two months, I have two 0 values. Count of '0' value is > 0. So I flag it as 'Y' under Flag1.
But as at Status28Feb2007, it is 0. It doesnt fit the criteria so i flag it as 'N' under Flag2.
As long as as at the field, I need the status to be 1 then only I proceed to look into the following two months.
After getting the results, how do I count the number of flags N and Y under each fields?
Count1 Count2 Count3
N 1 2 3
Y 2 1 0
Would appreciate the help as I am new to SAS. Thanks.

This will only work if the column names across are in calendar order.
Use an ARRAY statement to organize and then access variables by index and thus easily process the [index+1] and [index+2] checks your logic indicates. You can also use temporary arrays to maintain a count as you assign the flag values; at the last row the counts are output to a separate table.
Note: for status variables taking on either 0 or 1 the count of 1's can be computed using SUM. The sum of two status variables will be < 2 when either of them is 0.
* simulate some data;
data prelim;
do id = 1 to 20;
do date = '01jan07'd by 1 until(intck('month', '01jan07'd, date) >= 117);
date = intnx('month', date, 1) - 1;
status = ranuni(123) < 0.45;
if date = '31jan07'd and mod(id,5) = 1 then status = .;
output;
end;
end;
format date date9.;
run;
* change the shape of simulated data to match the question;
proc transpose data=prelim prefix=Status out=have(drop=_name_);
by id;
var status;
id date;
run;
* process the problem shaped data;
data
want (keep=id status: flag:)
want_count (keep=flag_value count:);
;
set have end=lastid;
retain sentinel1 sentinel2 0;
array status status: sentinel1 sentinel2; * map all the Status* variables to an array named status;
array flag [118] $1 ; * automatically creates 118 new variables flag1 to flag118;
array yfreq [118] _temporary_ (118*0); * temporary arrays initialized to 0;
array nfreq [118] _temporary_ (118*0);
* process each month status, -2 because of the sentinels ;
do i = 1 to dim(status)-2;
* assign flag according to the logic, some cases require a 2-month look ahead;
select;
when ( status(i) = . ) flag(i) = 'N';
when ( status(i) = 0 ) flag(i) = 'N';
when ( status(i) = 1
and sum(status(i+1),status(i+2)) < 2 ) flag(i) = 'Y'; * SUM trick;
otherwise
flag(i) = 'N';
end;
* track frequencies of flags assigned;
if flag(i) = 'N'
then nfreq(i)+1;
else yfreq(i)+1;
end;
output want;
if lastid then do;
* all flags for all ids have been binned for frequency;
* output the freqs to a count data set;
length flag_value $1;
array freq count1-count118;
flag_value = 'N'; do i = 1 to dim(nfreq); freq(i) = nfreq(i); end; output want_count;
flag_value = 'Y'; do i = 1 to dim(yfreq); freq(i) = yfreq(i); end; output want_count;
end;
run;

Related

array processing with different indices and missing values in SAS

have is a sas data set with 4 variables: an id and variables storing info on all the activities a respondent shares with 3 different members of a team they're on. There are 4 different activity types, identified by the numbers populating in the :_activities vars for each player (p1 to p3). Below are the first 5 obs:
id p1_activities p2_activities p3_activities
A 1,2,3,4 1,3
B 1,3 1,2,3 1,2,3
C 1,2,3 1,2,3
D 1,2,3
E 1,2,3 1
Consider respondent A: they share all 4 activities with player 1 on their team, and activities 1 and 3 with player 2 on their team. I need to create flags for each player position and each activity. For example, a new numeric variable p1_act2_flag should equal 1 for all respondents who have a value of 2 appearing in the p1_activities character variable. Here are the first 6 variables I need to create out of the 12 total for the data shown:
p1_act1_flag p1_act2_flag p1_act3_flag p1_act4_flag p2_act1_flag p2_act2_flag …
1 1 1 1 1 0 …
1 0 1 0 1 1 …
. . . . 1 1 …
. . . . 1 1 …
1 1 1 0 . . …
I do this now by initializing all of the variable names in a length statement, then writing a ton if-then statements. I want to use far fewer lines of code, but my array logic is incorrect. Here's how I try to create the flags for player 1:
data want;
length p1_act1_flg p1_act2_flg p1_act3_flg p1_act4_flg
p2_act1_flg p2_act2_flg p2_act3_flg p2_act4_flg
p3_act1_flg p3_act2_flg p3_act3_flg p3_act4_flg
p4_act1_flg p4_act2_flg p4_act3_flg p4_act4_flg 8.0;
set have;
array plracts {*} p1_activities p2_activities p3_activities;
array p1actflg {*} p1_act1_flg p1_act2_flg p1_act3_flg p1_act4_flg;
array p2actflg {*} p2_act1_flg p2_act2_flg p2_act3_flg p2_act4_flg;
array p3actflg {*} p3_act1_flg p3_act2_flg p3_act3_flg p3_act4_flg;
array p4actflg {*} p4_act1_flg p4_act2_flg p4_act3_flg p4_act4_flg;
do i=1 to dim(plracts);
do j=1 to dim(p1actflg);
if find(plracts{i}, cats(put(j, $12.))) then p1actflg{j}=1;
else if missing(plracts{i}) then p1actflg{j}=.;
else p1actflg{j}=0;
end;
end;
*do this again for the other p#actflg arrays;
run;
My "array subscript is out of range" because of the different lengths of the player and activity arrays, but nesting in different do-loops would result in me writing many more lines of code than a wallpaper solution.
How would you do this more systematically, and/or in far fewer lines of code?
Not sure why you are processing 4 activities for flags when there are only 3.
Some ideas:
Refactoring the column names to numbered suffixes would reduce some of the wallpaper effect.
activities_p1-activities_p3
Refactoring the flag column names to number suffixes
flag_p1_1-flag_p1_4
flag_p2_1-flag_p2_4
flag_p3_1-flag_p3_4
Use DIM to stay within array bounds.
Use two dimensional array for flags
Use direct addressing of items to be flagged
Add error checking
Not fewer, but perhaps more robust ?
This code examines each item in the activities list as opposed to seeking presence of a specific items (1..4):
data want;
set have;
array activities
activities_p1-activities_p3
;
array flags(3,4)
flag_p1_1-flag_p1_4
flag_p2_1-flag_p2_4
flag_p3_1-flag_p3_4
;
do i = 1 to dim(activites);
if missing(activities[i]) then continue; %* skip;
do j = 1 by 1;
item = scan ( activities[i], j, ',' );
if missing(item) then leave; %* no more items in csv list;
item_num = input (item,?1.);
if missing(item_num) then continue; %* skip, csv item is not a number;
if item_num > hbound(flags,2) or item_num < lbound(flags,2) then do;
put 'WARNING:' item_num 'is invalid for flagging';
continue; %* skip, csv item is missing, 0, negative or exceeds 4;
end;
flags (i, item_num) = 1;
end;
* backfill zeroes where flag not assigned;
do j = 1 to hbound(flags,2);
flags (i, item_num) = sum (0, flags (i, item_num)); %* sum() handles missing values;
end;
end;
Here is the same processing, but only searching for specific items to be flagged:
data have; length id activities_p1-activities_p3 $20;input
id activities_p1-activities_p3 ; datalines;
A 1,2,3,4 1,3 .
B 1,3 1,2,3 1,2,3
C . 1,2,3 1,2,3
D . 1,2,3 .
E 1,2,3 . 1
;
data want;
set have;
array activities
activities_p1-activities_p3
;
array flags(3,4)
flag_p1_1-flag_p1_4
flag_p2_1-flag_p2_4
flag_p3_1-flag_p3_4
;
do i = 1 to dim(activities);
if not missing(activities[i]) then
do j = 1 to hbound(flags,2);
flags (i,j) = sum (flags(i,j), findw(trim(activities[i]),cats(j),',') > 0) > 0;
end;
end;
run;
What's going on ?
flags variables are reset to missing at top of step
hbound return 4 as upper limit of second dimension
findw(trim(activities[i]),cats(j),',') find position of j in csv string
trim needed to remove trailing spaces which are not part of findw word delimiter list
cats converts j number to character representation
findw returns position of j in csv string.
might want to also compress out spaces and other junk if activity data values are not reliable.
first > 0 evaluates position to 0 j not present and 1 present
second > 0 is a another logic evaluation that ensures j present flag remains 0 or 1. Otherwise flags would be a frequency count (imagine activity data 1,1,2,3)
flags(i,j) covers the 3 x 4 slots available for flagging.
Consider converting into a hierarchical view and doing the logic there. The real stickler here is the fact that there can be missing positions within each list. Because of this, a simple do loop will be difficult. A faster way would be multi-step:
Create a template of all possible players and positions
Create an actual list of all players and positions
Merge the template with the actual list and flag all matches
It's not as elegant as a single data step like could be done, but it is somewhat easy to work with.
data have;
infile datalines dlm='|';
input id$ p1_activities$ p2_activities$ p3_activities$;
datalines;
A|1,2,3,4|1,3|
B|1,3|1,2,3|1,2,3|
C| |1,2,3|1,2,3|
D| |1,2,3|
E|1,2,3| |1
;
run;
/* Make a template of all possible players and positions */
data template;
set have;
array players p1_activities--p3_activities;
length varname $15.;
do player = 1 to dim(players);
do activity = 1 to 4;
/* Generate a variable name for later */
varname = cats('p', player, '_act', activity, '_flg');
output;
end;
end;
keep ID player activity varname;
run;
/* Create a list of actual players and their positions */
data actual;
set have;
array players p1_activities--p3_activities;
do player = 1 to dim(players);
do i = 1 to countw(players[player], ',');
activity = input(scan(players[player], i, ','), 8.);
/* Do not output missing positions */
if(NOT missing(activity)) then output;
end;
end;
keep ID player activity;
run;
/* Merge the template with actual values and create a flag when an
an id, player, and activity matches with the template
*/
data want_long;
merge template(in=all)
actual(in=act);
by id player activity;
flag_activity = (all=act);
run;
/* Transpose it back to wide */
proc transpose data=want_long
out=want_wide;
id varname;
by id;
var flag_activity;
run;
Following Stu's example, a DS2 DATA step can perform his 'merge' using a hash lookup. The hash lookup depends on creating a data set that maps CSV item lists to flags.
* Create data for hash;
data share_flags(where=(not missing(key)));
length key $7 f1-f4 8;
array k[4] $1 _temporary_;
do f1 = 0 to 1; k[1] = ifc(f1,'1','');
do f2 = 0 to 1; k[2] = ifc(f2,'2','');
do f3 = 0 to 1; k[3] = ifc(f3,'3','');
do f4 = 0 to 1; k[4] = ifc(f4,'4','');
key = catx(',', of k[*]);
output;
end;end;end;end;
run;
proc ds2;
data want2 / overwrite=yes;
declare char(20) id;
vararray char(7) pact[*] activities_p1-activities_p3;
vararray double fp1[*] flag_p1_1-flag_p1_4;
vararray double fp2[*] flag_p2_1-flag_p2_4;
vararray double fp3[*] flag_p3_1-flag_p3_4;
declare char(1) sentinel;
keep id--sentinel;
drop sentinel;
declare char(7) key;
vararray double flags[*] f1-f4;
declare package hash shares([key],[f1-f4],4,'share_flags'); %* load lookup data;
method run();
declare int rc;
set have;
rc = shares.find([activities_p1],[flag_p1:]); %* find() will fill-in the flag variables;
rc = shares.find([activities_p2],[flag_p2:]);
rc = shares.find([activities_p3],[flag_p3:]);
end;
enddata;
run;
quit;
%let syslast = want2;
share_flags
result

Create an ID to identify each loop in SAS

I have a log dataset in SAS like this, which has been ordered by TimeStamp ascendantly
TimeStamp Status
2015Dec01:1:00:00 1
2015Dec01:2:00:00 2
2015Dec01:3:00:00 3
2015Dec01:4:00:00 4
2015Dec01:5:00:00 1
2015Dec01:6:00:00 2
2015Dec01:7:00:00 2
2015Dec01:8:00:00 4
2015Dec01:9:00:00 5
2015Dec01:10:00:00 1
2015Dec01:11:00:00 3
2015Dec01:11:30:00 4
I wanted to create an ID to identify each loop which always started from status 1 and ended at status 4 (no matter what status between 1 and 4) like this:
Time Stamp Status ID
2015Dec01:1:00:00 1 1
2015Dec01:2:00:00 2 1
2015Dec01:3:00:00 3 1
2015Dec01:4:00:00 4 1
2015Dec01:5:00:00 1 2
2015Dec01:6:00:00 2 2
2015Dec01:7:00:00 2 2
2015Dec01:8:00:00 4 2
2015Dec01:9:00:00 5 .
2015Dec01:10:00:00 1 3
2015Dec01:11:00:00 3 3
2015Dec01:11:30:00 4 3
Does anyone can help me out? Thanks a lot
Define your rules (assumed):
Increment ID when status=1
If status>5 then ID is missing
data tmp;
set have;
retain ID_TMP 0; *initialize ID;
if status=1 then ID_TMP + 1;
if status<5 then ID=ID_TMP;
DROP ID_TMP;
run;
We know that a new group has started if current status is < previous status. We'll save the previous status using the lag function so that we can compare the current status to the previous one.
Create a temporary variable called count to count when we increment the ID. Since we are using if-then logic, we want to initialize ID and count with a value of 1 using the retain statement.
There are three cases to account for:
If the current ID is less than the previous ID, then increment count by 1 and set ID to be the value of count;
If the current ID is > 4, set ID to missing.
Any other time, ID will stay the same.
data want;
set log;
retain count ID 1;
Prior_Status = lag(Status);
if(Status < Prior_Status) then do;
count+1;
ID = count;
end;
else if(Status > 4) then call missing(ID);
drop count Prior_Status;
run;
Here's a couple of ways of doing it.
Method 1:
data want;
set have;
if status = 1 then tmp + 1;
if status <= 4 then id = tmp;
else id = .;
run;
Method 2:
data want;
set have;
if status = 1 then tmp + 1;
id = choosen((status<=4)+1, ., tmp);
run;

Count the number of times a value occurs

I have 7 variables, 489 observations with variable values of 0-4.
What I need is the count percentage of use.
Answers 0,1 stand for non usage, and answers 2,3,4 stand for usage.
I created 7 additional vars and turned all the values above to:
1=usage - 0=non-usage.
Now, I don't know how to count and present how many "1" I have for each var and divide it by 489.
data LAB7;
set LAB3;
array v{*} v21-v27;
array VU{7};
DO i=1 to dim(v);
if v[i] = 1|0 THEN VU[i]=0;
else VU[i]=1;
END;
run;
You can do this:
data usage;
set lab3 end=eof;
array v{*} v21-v27
array n{7};
retain n: 0;
do i = 1 to dim(v);
if v[i] in (2, 3, 4) then n[i] + 1;
end;
if eof then do j = 1 to dim(v);
variable = vname(v[j]);
pct_usage = 100 * n[j] / _n_;
output;
end;
keep variable pct_usage;
run;
This creates an array of counters, one per variable, that are incremented by one whenever the corresponding variable is equal to 2, 3, or 4.
At the end of the data step, we output a record for each variable and record the percentage as the counter divided by the number of observations (_n_ when eof is true).
An alternative would be to use proc freq.
data indicators;
set lab3;
array v{*} v21-v27;
array ind{7};
do i = 1 to dim(v);
ind[i] = (v[i] in (2, 3, 4));
end;
run;
proc freq data = indicators;
tables ind: / out = usage;
run;
This creates binary indicator variables, one for each of the input variables, that are 1 when the input is 2, 3, or 4, and 0 otherwise. Counts and percentages are then obtained using proc freq.

Checking correct order of values

I have a data set that looks similar to the one below. Basically, I have current prices for three different sizes of an item type. If the sizes are priced correctly (ie small<medium<large) I want to flag them with a “Y” and continue to use the current price. If they are not priced correctly, I want to flag them with a “N” and use the recommended price. I know that this is probably a time to use array programming, but my array skills are, admittedly, a bit weak. There are hundreds of locations, but only one item type. I currently have the unique locations loaded in a macro variable list.
data have;
input type location $ size $ cur_price rec_price;
cards;
x NY S 4 1
x NY M 5 2
x NY L 6 3
x LA S 5 1
x LA M 4 2
x LA L 3 3
x DC S 5 1
x DC M 5 2
x DC L 5 3
;
run;
proc sql;
select distinct location into :loc_list from have;
quit;
Any help would be greatly appreciated.
Thanks.
Not sure why you'd want to use an array here...proc transpose and some data step logic
can easily solve this problem. Arrays are very useful (gotta admit, I'm not entirely
comfortable with them either), but in a situation where you have that many locations,
I think transpose is better.
Does the code below accomplish your goal?
/*sorts to get ready for transpose*/
proc sort data=have;
by location;
run;
/*transpose current price*/
proc transpose data=have out=cur_tran prefix=cur_price;
by location;
id size;
var cur_price;
run;
/*transpose recommended price*/
proc transpose data=have out=rec_tran prefix=rec_price;
by location;
id size;
var rec_price;
run;
/*merge back together*/
data merged;
merge cur_tran rec_tran;
by location;
run;
/*creates flags and new field for final price*/
data want;
set merged;
if cur_priceS<cur_priceM<cur_priceL then
do;
FLAG='Y';
priceS=cur_priceS;
priceM=cur_priceM;
priceL=cur_priceL;
end;
else do;
FLAG='N';
priceS=rec_priceS;
priceM=rec_priceM;
priceL=rec_priceL;
end;
run;
I don't see how arrays would help here. How about just checking using dif to queue the last record's price and verify it (could also retain the last price if you prefer). Make sure the dataset's properly sorted by type location descending size, then:
data want;
set have;
by type location descending size; *S > M > L alphabetically;
retain price_check;
if not first.location and dif(cur_price) lt 0 then price_check=1;
*if dif < 0 then cur rec is smaller;
else if first.location then price_check=0; *reset it;
if last.location;
keep type location price_check;
run;
Then merge that back to your original dataset by type location, and use the other price if cur_price=1.
Alternatively you could do it in a single query which is almost a re-statement of your requirements.
Proc sql;
create table want as
select *
/* Basically, I have current prices for three different sizes of an item type.
If the sizes are priced correctly (ie small<medium<large) */
, case
when max ( case when size eq 'S' then cur_price end)
lt max ( case when size eq 'M' then cur_price end)
and max ( case when size eq 'M' then cur_price end)
lt max ( case when size eq 'L' then cur_price end)
/* I want to flag them with a “Y” and continue to use the current price */
then 'Y'
/* If they are not priced correctly,
I want to flag them with a “N” and use the recommended price. */
else 'N'
end as Cur_Price_Sizes_Correct
, case
when calculate Cur_Price_Correct eq 'Y'
then cur_price
else rec_price
end as Price
From have
Group by Type
, Location
;
Quit;

checking rows of row_number result on condition of another column

I have a query that outputs some rows that I need to "filter".
The data I want to filter is like this:
rownum value
1 0
2 0
3 1
4 1
I need the first 2 rows, but only when they have "0" in the value-column.
The structure of the query is like this:
select count(x)
from
(
select row_number() over (partition by X order by y) as rownum, bla, bla
from [bla bla]
group by [bla bla]
) as temp
where
/* now this is where i want the magic to happen */
temp.rownum = 1 AND temp.value = 0
AND
temp.rownum = 2 AND temp.value = 0
So I want x only when row 1 and 2 have "0" in the value-column.
If either rownumber 1 or 2 have a "1" in the value-column, I dont want them.
I basically wrote the where-clause the way I wrote it here, but it's returning data sets that have "1" as value in either row 1 or 2.
How to fix this?
A value in a single row for a given column can never be both 1 and 2 at the same time.
So, first of all, either use OR (which is what I believe you intended):
WHERE (temp.rownum = 1 AND temp.value = 0)
OR (temp.rownum = 2 AND temp.value = 0)
Or simplify the query altogether:
WHERE temp.rownum <= 2 AND temp.value = 0
From here, you will get just the first two rows, and only if value = 0. If you only want rows when both rows are returned (i.e. both rows have value = 0), add
HAVING COUNT(1) = 2
to the query.

Resources