Create an ID to identify each loop in SAS - loops

I have a log dataset in SAS like this, which has been ordered by TimeStamp ascendantly
TimeStamp Status
2015Dec01:1:00:00 1
2015Dec01:2:00:00 2
2015Dec01:3:00:00 3
2015Dec01:4:00:00 4
2015Dec01:5:00:00 1
2015Dec01:6:00:00 2
2015Dec01:7:00:00 2
2015Dec01:8:00:00 4
2015Dec01:9:00:00 5
2015Dec01:10:00:00 1
2015Dec01:11:00:00 3
2015Dec01:11:30:00 4
I wanted to create an ID to identify each loop which always started from status 1 and ended at status 4 (no matter what status between 1 and 4) like this:
Time Stamp Status ID
2015Dec01:1:00:00 1 1
2015Dec01:2:00:00 2 1
2015Dec01:3:00:00 3 1
2015Dec01:4:00:00 4 1
2015Dec01:5:00:00 1 2
2015Dec01:6:00:00 2 2
2015Dec01:7:00:00 2 2
2015Dec01:8:00:00 4 2
2015Dec01:9:00:00 5 .
2015Dec01:10:00:00 1 3
2015Dec01:11:00:00 3 3
2015Dec01:11:30:00 4 3
Does anyone can help me out? Thanks a lot

Define your rules (assumed):
Increment ID when status=1
If status>5 then ID is missing
data tmp;
set have;
retain ID_TMP 0; *initialize ID;
if status=1 then ID_TMP + 1;
if status<5 then ID=ID_TMP;
DROP ID_TMP;
run;

We know that a new group has started if current status is < previous status. We'll save the previous status using the lag function so that we can compare the current status to the previous one.
Create a temporary variable called count to count when we increment the ID. Since we are using if-then logic, we want to initialize ID and count with a value of 1 using the retain statement.
There are three cases to account for:
If the current ID is less than the previous ID, then increment count by 1 and set ID to be the value of count;
If the current ID is > 4, set ID to missing.
Any other time, ID will stay the same.
data want;
set log;
retain count ID 1;
Prior_Status = lag(Status);
if(Status < Prior_Status) then do;
count+1;
ID = count;
end;
else if(Status > 4) then call missing(ID);
drop count Prior_Status;
run;

Here's a couple of ways of doing it.
Method 1:
data want;
set have;
if status = 1 then tmp + 1;
if status <= 4 then id = tmp;
else id = .;
run;
Method 2:
data want;
set have;
if status = 1 then tmp + 1;
id = choosen((status<=4)+1, ., tmp);
run;

Related

array processing with different indices and missing values in SAS

have is a sas data set with 4 variables: an id and variables storing info on all the activities a respondent shares with 3 different members of a team they're on. There are 4 different activity types, identified by the numbers populating in the :_activities vars for each player (p1 to p3). Below are the first 5 obs:
id p1_activities p2_activities p3_activities
A 1,2,3,4 1,3
B 1,3 1,2,3 1,2,3
C 1,2,3 1,2,3
D 1,2,3
E 1,2,3 1
Consider respondent A: they share all 4 activities with player 1 on their team, and activities 1 and 3 with player 2 on their team. I need to create flags for each player position and each activity. For example, a new numeric variable p1_act2_flag should equal 1 for all respondents who have a value of 2 appearing in the p1_activities character variable. Here are the first 6 variables I need to create out of the 12 total for the data shown:
p1_act1_flag p1_act2_flag p1_act3_flag p1_act4_flag p2_act1_flag p2_act2_flag …
1 1 1 1 1 0 …
1 0 1 0 1 1 …
. . . . 1 1 …
. . . . 1 1 …
1 1 1 0 . . …
I do this now by initializing all of the variable names in a length statement, then writing a ton if-then statements. I want to use far fewer lines of code, but my array logic is incorrect. Here's how I try to create the flags for player 1:
data want;
length p1_act1_flg p1_act2_flg p1_act3_flg p1_act4_flg
p2_act1_flg p2_act2_flg p2_act3_flg p2_act4_flg
p3_act1_flg p3_act2_flg p3_act3_flg p3_act4_flg
p4_act1_flg p4_act2_flg p4_act3_flg p4_act4_flg 8.0;
set have;
array plracts {*} p1_activities p2_activities p3_activities;
array p1actflg {*} p1_act1_flg p1_act2_flg p1_act3_flg p1_act4_flg;
array p2actflg {*} p2_act1_flg p2_act2_flg p2_act3_flg p2_act4_flg;
array p3actflg {*} p3_act1_flg p3_act2_flg p3_act3_flg p3_act4_flg;
array p4actflg {*} p4_act1_flg p4_act2_flg p4_act3_flg p4_act4_flg;
do i=1 to dim(plracts);
do j=1 to dim(p1actflg);
if find(plracts{i}, cats(put(j, $12.))) then p1actflg{j}=1;
else if missing(plracts{i}) then p1actflg{j}=.;
else p1actflg{j}=0;
end;
end;
*do this again for the other p#actflg arrays;
run;
My "array subscript is out of range" because of the different lengths of the player and activity arrays, but nesting in different do-loops would result in me writing many more lines of code than a wallpaper solution.
How would you do this more systematically, and/or in far fewer lines of code?
Not sure why you are processing 4 activities for flags when there are only 3.
Some ideas:
Refactoring the column names to numbered suffixes would reduce some of the wallpaper effect.
activities_p1-activities_p3
Refactoring the flag column names to number suffixes
flag_p1_1-flag_p1_4
flag_p2_1-flag_p2_4
flag_p3_1-flag_p3_4
Use DIM to stay within array bounds.
Use two dimensional array for flags
Use direct addressing of items to be flagged
Add error checking
Not fewer, but perhaps more robust ?
This code examines each item in the activities list as opposed to seeking presence of a specific items (1..4):
data want;
set have;
array activities
activities_p1-activities_p3
;
array flags(3,4)
flag_p1_1-flag_p1_4
flag_p2_1-flag_p2_4
flag_p3_1-flag_p3_4
;
do i = 1 to dim(activites);
if missing(activities[i]) then continue; %* skip;
do j = 1 by 1;
item = scan ( activities[i], j, ',' );
if missing(item) then leave; %* no more items in csv list;
item_num = input (item,?1.);
if missing(item_num) then continue; %* skip, csv item is not a number;
if item_num > hbound(flags,2) or item_num < lbound(flags,2) then do;
put 'WARNING:' item_num 'is invalid for flagging';
continue; %* skip, csv item is missing, 0, negative or exceeds 4;
end;
flags (i, item_num) = 1;
end;
* backfill zeroes where flag not assigned;
do j = 1 to hbound(flags,2);
flags (i, item_num) = sum (0, flags (i, item_num)); %* sum() handles missing values;
end;
end;
Here is the same processing, but only searching for specific items to be flagged:
data have; length id activities_p1-activities_p3 $20;input
id activities_p1-activities_p3 ; datalines;
A 1,2,3,4 1,3 .
B 1,3 1,2,3 1,2,3
C . 1,2,3 1,2,3
D . 1,2,3 .
E 1,2,3 . 1
;
data want;
set have;
array activities
activities_p1-activities_p3
;
array flags(3,4)
flag_p1_1-flag_p1_4
flag_p2_1-flag_p2_4
flag_p3_1-flag_p3_4
;
do i = 1 to dim(activities);
if not missing(activities[i]) then
do j = 1 to hbound(flags,2);
flags (i,j) = sum (flags(i,j), findw(trim(activities[i]),cats(j),',') > 0) > 0;
end;
end;
run;
What's going on ?
flags variables are reset to missing at top of step
hbound return 4 as upper limit of second dimension
findw(trim(activities[i]),cats(j),',') find position of j in csv string
trim needed to remove trailing spaces which are not part of findw word delimiter list
cats converts j number to character representation
findw returns position of j in csv string.
might want to also compress out spaces and other junk if activity data values are not reliable.
first > 0 evaluates position to 0 j not present and 1 present
second > 0 is a another logic evaluation that ensures j present flag remains 0 or 1. Otherwise flags would be a frequency count (imagine activity data 1,1,2,3)
flags(i,j) covers the 3 x 4 slots available for flagging.
Consider converting into a hierarchical view and doing the logic there. The real stickler here is the fact that there can be missing positions within each list. Because of this, a simple do loop will be difficult. A faster way would be multi-step:
Create a template of all possible players and positions
Create an actual list of all players and positions
Merge the template with the actual list and flag all matches
It's not as elegant as a single data step like could be done, but it is somewhat easy to work with.
data have;
infile datalines dlm='|';
input id$ p1_activities$ p2_activities$ p3_activities$;
datalines;
A|1,2,3,4|1,3|
B|1,3|1,2,3|1,2,3|
C| |1,2,3|1,2,3|
D| |1,2,3|
E|1,2,3| |1
;
run;
/* Make a template of all possible players and positions */
data template;
set have;
array players p1_activities--p3_activities;
length varname $15.;
do player = 1 to dim(players);
do activity = 1 to 4;
/* Generate a variable name for later */
varname = cats('p', player, '_act', activity, '_flg');
output;
end;
end;
keep ID player activity varname;
run;
/* Create a list of actual players and their positions */
data actual;
set have;
array players p1_activities--p3_activities;
do player = 1 to dim(players);
do i = 1 to countw(players[player], ',');
activity = input(scan(players[player], i, ','), 8.);
/* Do not output missing positions */
if(NOT missing(activity)) then output;
end;
end;
keep ID player activity;
run;
/* Merge the template with actual values and create a flag when an
an id, player, and activity matches with the template
*/
data want_long;
merge template(in=all)
actual(in=act);
by id player activity;
flag_activity = (all=act);
run;
/* Transpose it back to wide */
proc transpose data=want_long
out=want_wide;
id varname;
by id;
var flag_activity;
run;
Following Stu's example, a DS2 DATA step can perform his 'merge' using a hash lookup. The hash lookup depends on creating a data set that maps CSV item lists to flags.
* Create data for hash;
data share_flags(where=(not missing(key)));
length key $7 f1-f4 8;
array k[4] $1 _temporary_;
do f1 = 0 to 1; k[1] = ifc(f1,'1','');
do f2 = 0 to 1; k[2] = ifc(f2,'2','');
do f3 = 0 to 1; k[3] = ifc(f3,'3','');
do f4 = 0 to 1; k[4] = ifc(f4,'4','');
key = catx(',', of k[*]);
output;
end;end;end;end;
run;
proc ds2;
data want2 / overwrite=yes;
declare char(20) id;
vararray char(7) pact[*] activities_p1-activities_p3;
vararray double fp1[*] flag_p1_1-flag_p1_4;
vararray double fp2[*] flag_p2_1-flag_p2_4;
vararray double fp3[*] flag_p3_1-flag_p3_4;
declare char(1) sentinel;
keep id--sentinel;
drop sentinel;
declare char(7) key;
vararray double flags[*] f1-f4;
declare package hash shares([key],[f1-f4],4,'share_flags'); %* load lookup data;
method run();
declare int rc;
set have;
rc = shares.find([activities_p1],[flag_p1:]); %* find() will fill-in the flag variables;
rc = shares.find([activities_p2],[flag_p2:]);
rc = shares.find([activities_p3],[flag_p3:]);
end;
enddata;
run;
quit;
%let syslast = want2;
share_flags
result

SAS -- By-Group Processing Retain with First and Last Dot

I always find myself having to quickly refresh by-group processing when I'm using first dot and last dot variables with data, but today I saw something interesting.
Here's a sample dataset:
data DS1;
input ID1 ID2;
datalines;
1 100
1 200
1 300
2 400
3 500
3 500
4 600
;
run;
I generally use retain with by-group processing and either first or last dot variables to manipulate my data like so:
data ByGroup1;
set DS1;
by ID1 ID2;
retain Count;
if first.ID1 then Count = 0;
Count + 1;
run;
But, I was reading a post of SAS.com where an invidual used the following method (without a retain statement).
data ByGroup2;
set DS1;
by ID1 ID2;
if first.ID1 then Count = 0;
Count + 1;
run;
Both methods return the same dataset:
ID1 ID2 Count
1 100 1
1 200 2
1 300 3
2 400 1
3 500 1
3 500 2
4 600 1
Are variables implicitly retained in the PDV when doing by group processing? Or, is it that the program never returns to the top of the data-step until it reaches the end of the by-group?
I think I'm confused on the mechanics of how this variable is iterating without a retain since I'm so used to explicitly using retain when it's required.
When you use a sum statement the variable is automatically retained. And unless you have defined a different initial value it will be initialized to zero.
The syntax for a sum statement is:
variable + expression ;

SAS - Invalid numeric data while searching through an Array

I am trying to create an array of strings and want to insert a value in it, if it does not exist already in the array.
I read somewhere that we can use 'IN' operator with Array. So, coded it as follows:
DATA WANT;
SET HAVE;
BY ID;
ARRAY R_PROS_SCRN_ID {2} $4. R_PROS_SCRN_ID_1 - R_PROS_SCRN_ID_2;
RETAIN R_PROS_SCRN_ID_1 - R_PROS_SCRN_ID_2;
IF NOT PROS_SCRN_ID IN R_PROS_SCRN_ID THEN DO;
DO I=1 to 2 ;
IF MISSING( R_PROS_SCRN_ID{i}) THEN DO;
R_PROS_SCRN_ID{i} = PROS_SCRN_ID;
LEAVE;
END;
END;
END;
IF LAST.ID THEN OUTPUT;
RUN;
In Array R_PROS_SCRN_ID, I want only the unique values from field PROS_SCRN_ID.
It is throwing error:
NOTE: Invalid numeric data, PROS_SCRN_ID='MED' , at line 17352 column 201.
I think it is because I did not initialize the Array before comparing and hence it is considering it as Numeric Array. But, I have specified the format as $4. Why is it throwing error?
Also, I am not sure if this is the best way get unique values in an Array. Is there any better way to implement this?
Your code appears to be collecting unique values by group, pivoting from a tall data structure to a wide data structure.
One of the clearest DATA step ways is to use what we call DOW loop in which SET is within the loop. This sample code presumes no more than 10 unique satellite values per group. (The by variables can be thought of as key variables, and all other variables would be satellites)
data have;
input user_id screen_id ;
datalines;
1 1
1 2
1 1
1 1
1 1
1 3
2 1
2 1
2 1
3 0
4 1
4 2
4 3
5 11
5 11
5 11
5 5
5 1
5 5
5 6
5 1
run;
data want;
_index = 0;
do until (last.user_id);
set have;
by user_id;
array ids screen_id1-screen_id10;
if screen_id not in ids then do;
_index + 1;
ids(_index) = screen_id;
end;
end;
drop _index screen_id;
run;
One of the clearest procedural ways is to select the unique values and transpose them.
proc sql;
create view uniqueScreenByUser as
select distinct user_id, screen_id
from have
order by user_id
;
proc transpose data=uniqueScreenByUser prefix=screen_id out=wantWide(drop=_name_);
by user_id;
var screen_id;
run;

Setting conditions in arrays in SAS

I have a set of data that looks like this:
ID Status31Jan2007 Status28Jan2007 Status31Mar2007
001 0 0
002 1 0 0
003 1 1 0
I have Statusddmmyyyy fields of either '0' or '1' for 118 months. (here, I only have three months as a sample)
I want to get results like this:
ID Flag1 Flag2 Flag3
001 N N N
002 Y N N
003 Y Y N
The logic is, if as at Status31Jan2007 = 1 and the following two months, count of Status fields with 0 > 0, then flag it as 'Y'. Else, N.
Meaning,
If my ID is 001 and as at Status31Jan2007, value is missing, i flag it as 'N' under Flag1.
Moving on to the next month, Status28Feb2007, value is 0, i automatically flag it as 'N' as well under Flag2. This applies to the next month.
Looking at ID 002, Status31Jan2007 is 1. And following two months, I have two 0 values. Count of '0' value is > 0. So I flag it as 'Y' under Flag1.
But as at Status28Feb2007, it is 0. It doesnt fit the criteria so i flag it as 'N' under Flag2.
As long as as at the field, I need the status to be 1 then only I proceed to look into the following two months.
After getting the results, how do I count the number of flags N and Y under each fields?
Count1 Count2 Count3
N 1 2 3
Y 2 1 0
Would appreciate the help as I am new to SAS. Thanks.
This will only work if the column names across are in calendar order.
Use an ARRAY statement to organize and then access variables by index and thus easily process the [index+1] and [index+2] checks your logic indicates. You can also use temporary arrays to maintain a count as you assign the flag values; at the last row the counts are output to a separate table.
Note: for status variables taking on either 0 or 1 the count of 1's can be computed using SUM. The sum of two status variables will be < 2 when either of them is 0.
* simulate some data;
data prelim;
do id = 1 to 20;
do date = '01jan07'd by 1 until(intck('month', '01jan07'd, date) >= 117);
date = intnx('month', date, 1) - 1;
status = ranuni(123) < 0.45;
if date = '31jan07'd and mod(id,5) = 1 then status = .;
output;
end;
end;
format date date9.;
run;
* change the shape of simulated data to match the question;
proc transpose data=prelim prefix=Status out=have(drop=_name_);
by id;
var status;
id date;
run;
* process the problem shaped data;
data
want (keep=id status: flag:)
want_count (keep=flag_value count:);
;
set have end=lastid;
retain sentinel1 sentinel2 0;
array status status: sentinel1 sentinel2; * map all the Status* variables to an array named status;
array flag [118] $1 ; * automatically creates 118 new variables flag1 to flag118;
array yfreq [118] _temporary_ (118*0); * temporary arrays initialized to 0;
array nfreq [118] _temporary_ (118*0);
* process each month status, -2 because of the sentinels ;
do i = 1 to dim(status)-2;
* assign flag according to the logic, some cases require a 2-month look ahead;
select;
when ( status(i) = . ) flag(i) = 'N';
when ( status(i) = 0 ) flag(i) = 'N';
when ( status(i) = 1
and sum(status(i+1),status(i+2)) < 2 ) flag(i) = 'Y'; * SUM trick;
otherwise
flag(i) = 'N';
end;
* track frequencies of flags assigned;
if flag(i) = 'N'
then nfreq(i)+1;
else yfreq(i)+1;
end;
output want;
if lastid then do;
* all flags for all ids have been binned for frequency;
* output the freqs to a count data set;
length flag_value $1;
array freq count1-count118;
flag_value = 'N'; do i = 1 to dim(nfreq); freq(i) = nfreq(i); end; output want_count;
flag_value = 'Y'; do i = 1 to dim(yfreq); freq(i) = yfreq(i); end; output want_count;
end;
run;

checking rows of row_number result on condition of another column

I have a query that outputs some rows that I need to "filter".
The data I want to filter is like this:
rownum value
1 0
2 0
3 1
4 1
I need the first 2 rows, but only when they have "0" in the value-column.
The structure of the query is like this:
select count(x)
from
(
select row_number() over (partition by X order by y) as rownum, bla, bla
from [bla bla]
group by [bla bla]
) as temp
where
/* now this is where i want the magic to happen */
temp.rownum = 1 AND temp.value = 0
AND
temp.rownum = 2 AND temp.value = 0
So I want x only when row 1 and 2 have "0" in the value-column.
If either rownumber 1 or 2 have a "1" in the value-column, I dont want them.
I basically wrote the where-clause the way I wrote it here, but it's returning data sets that have "1" as value in either row 1 or 2.
How to fix this?
A value in a single row for a given column can never be both 1 and 2 at the same time.
So, first of all, either use OR (which is what I believe you intended):
WHERE (temp.rownum = 1 AND temp.value = 0)
OR (temp.rownum = 2 AND temp.value = 0)
Or simplify the query altogether:
WHERE temp.rownum <= 2 AND temp.value = 0
From here, you will get just the first two rows, and only if value = 0. If you only want rows when both rows are returned (i.e. both rows have value = 0), add
HAVING COUNT(1) = 2
to the query.

Resources