Count the number of times a value occurs - arrays

I have 7 variables, 489 observations with variable values of 0-4.
What I need is the count percentage of use.
Answers 0,1 stand for non usage, and answers 2,3,4 stand for usage.
I created 7 additional vars and turned all the values above to:
1=usage - 0=non-usage.
Now, I don't know how to count and present how many "1" I have for each var and divide it by 489.
data LAB7;
set LAB3;
array v{*} v21-v27;
array VU{7};
DO i=1 to dim(v);
if v[i] = 1|0 THEN VU[i]=0;
else VU[i]=1;
END;
run;

You can do this:
data usage;
set lab3 end=eof;
array v{*} v21-v27
array n{7};
retain n: 0;
do i = 1 to dim(v);
if v[i] in (2, 3, 4) then n[i] + 1;
end;
if eof then do j = 1 to dim(v);
variable = vname(v[j]);
pct_usage = 100 * n[j] / _n_;
output;
end;
keep variable pct_usage;
run;
This creates an array of counters, one per variable, that are incremented by one whenever the corresponding variable is equal to 2, 3, or 4.
At the end of the data step, we output a record for each variable and record the percentage as the counter divided by the number of observations (_n_ when eof is true).
An alternative would be to use proc freq.
data indicators;
set lab3;
array v{*} v21-v27;
array ind{7};
do i = 1 to dim(v);
ind[i] = (v[i] in (2, 3, 4));
end;
run;
proc freq data = indicators;
tables ind: / out = usage;
run;
This creates binary indicator variables, one for each of the input variables, that are 1 when the input is 2, 3, or 4, and 0 otherwise. Counts and percentages are then obtained using proc freq.

Related

array processing with different indices and missing values in SAS

have is a sas data set with 4 variables: an id and variables storing info on all the activities a respondent shares with 3 different members of a team they're on. There are 4 different activity types, identified by the numbers populating in the :_activities vars for each player (p1 to p3). Below are the first 5 obs:
id p1_activities p2_activities p3_activities
A 1,2,3,4 1,3
B 1,3 1,2,3 1,2,3
C 1,2,3 1,2,3
D 1,2,3
E 1,2,3 1
Consider respondent A: they share all 4 activities with player 1 on their team, and activities 1 and 3 with player 2 on their team. I need to create flags for each player position and each activity. For example, a new numeric variable p1_act2_flag should equal 1 for all respondents who have a value of 2 appearing in the p1_activities character variable. Here are the first 6 variables I need to create out of the 12 total for the data shown:
p1_act1_flag p1_act2_flag p1_act3_flag p1_act4_flag p2_act1_flag p2_act2_flag …
1 1 1 1 1 0 …
1 0 1 0 1 1 …
. . . . 1 1 …
. . . . 1 1 …
1 1 1 0 . . …
I do this now by initializing all of the variable names in a length statement, then writing a ton if-then statements. I want to use far fewer lines of code, but my array logic is incorrect. Here's how I try to create the flags for player 1:
data want;
length p1_act1_flg p1_act2_flg p1_act3_flg p1_act4_flg
p2_act1_flg p2_act2_flg p2_act3_flg p2_act4_flg
p3_act1_flg p3_act2_flg p3_act3_flg p3_act4_flg
p4_act1_flg p4_act2_flg p4_act3_flg p4_act4_flg 8.0;
set have;
array plracts {*} p1_activities p2_activities p3_activities;
array p1actflg {*} p1_act1_flg p1_act2_flg p1_act3_flg p1_act4_flg;
array p2actflg {*} p2_act1_flg p2_act2_flg p2_act3_flg p2_act4_flg;
array p3actflg {*} p3_act1_flg p3_act2_flg p3_act3_flg p3_act4_flg;
array p4actflg {*} p4_act1_flg p4_act2_flg p4_act3_flg p4_act4_flg;
do i=1 to dim(plracts);
do j=1 to dim(p1actflg);
if find(plracts{i}, cats(put(j, $12.))) then p1actflg{j}=1;
else if missing(plracts{i}) then p1actflg{j}=.;
else p1actflg{j}=0;
end;
end;
*do this again for the other p#actflg arrays;
run;
My "array subscript is out of range" because of the different lengths of the player and activity arrays, but nesting in different do-loops would result in me writing many more lines of code than a wallpaper solution.
How would you do this more systematically, and/or in far fewer lines of code?
Not sure why you are processing 4 activities for flags when there are only 3.
Some ideas:
Refactoring the column names to numbered suffixes would reduce some of the wallpaper effect.
activities_p1-activities_p3
Refactoring the flag column names to number suffixes
flag_p1_1-flag_p1_4
flag_p2_1-flag_p2_4
flag_p3_1-flag_p3_4
Use DIM to stay within array bounds.
Use two dimensional array for flags
Use direct addressing of items to be flagged
Add error checking
Not fewer, but perhaps more robust ?
This code examines each item in the activities list as opposed to seeking presence of a specific items (1..4):
data want;
set have;
array activities
activities_p1-activities_p3
;
array flags(3,4)
flag_p1_1-flag_p1_4
flag_p2_1-flag_p2_4
flag_p3_1-flag_p3_4
;
do i = 1 to dim(activites);
if missing(activities[i]) then continue; %* skip;
do j = 1 by 1;
item = scan ( activities[i], j, ',' );
if missing(item) then leave; %* no more items in csv list;
item_num = input (item,?1.);
if missing(item_num) then continue; %* skip, csv item is not a number;
if item_num > hbound(flags,2) or item_num < lbound(flags,2) then do;
put 'WARNING:' item_num 'is invalid for flagging';
continue; %* skip, csv item is missing, 0, negative or exceeds 4;
end;
flags (i, item_num) = 1;
end;
* backfill zeroes where flag not assigned;
do j = 1 to hbound(flags,2);
flags (i, item_num) = sum (0, flags (i, item_num)); %* sum() handles missing values;
end;
end;
Here is the same processing, but only searching for specific items to be flagged:
data have; length id activities_p1-activities_p3 $20;input
id activities_p1-activities_p3 ; datalines;
A 1,2,3,4 1,3 .
B 1,3 1,2,3 1,2,3
C . 1,2,3 1,2,3
D . 1,2,3 .
E 1,2,3 . 1
;
data want;
set have;
array activities
activities_p1-activities_p3
;
array flags(3,4)
flag_p1_1-flag_p1_4
flag_p2_1-flag_p2_4
flag_p3_1-flag_p3_4
;
do i = 1 to dim(activities);
if not missing(activities[i]) then
do j = 1 to hbound(flags,2);
flags (i,j) = sum (flags(i,j), findw(trim(activities[i]),cats(j),',') > 0) > 0;
end;
end;
run;
What's going on ?
flags variables are reset to missing at top of step
hbound return 4 as upper limit of second dimension
findw(trim(activities[i]),cats(j),',') find position of j in csv string
trim needed to remove trailing spaces which are not part of findw word delimiter list
cats converts j number to character representation
findw returns position of j in csv string.
might want to also compress out spaces and other junk if activity data values are not reliable.
first > 0 evaluates position to 0 j not present and 1 present
second > 0 is a another logic evaluation that ensures j present flag remains 0 or 1. Otherwise flags would be a frequency count (imagine activity data 1,1,2,3)
flags(i,j) covers the 3 x 4 slots available for flagging.
Consider converting into a hierarchical view and doing the logic there. The real stickler here is the fact that there can be missing positions within each list. Because of this, a simple do loop will be difficult. A faster way would be multi-step:
Create a template of all possible players and positions
Create an actual list of all players and positions
Merge the template with the actual list and flag all matches
It's not as elegant as a single data step like could be done, but it is somewhat easy to work with.
data have;
infile datalines dlm='|';
input id$ p1_activities$ p2_activities$ p3_activities$;
datalines;
A|1,2,3,4|1,3|
B|1,3|1,2,3|1,2,3|
C| |1,2,3|1,2,3|
D| |1,2,3|
E|1,2,3| |1
;
run;
/* Make a template of all possible players and positions */
data template;
set have;
array players p1_activities--p3_activities;
length varname $15.;
do player = 1 to dim(players);
do activity = 1 to 4;
/* Generate a variable name for later */
varname = cats('p', player, '_act', activity, '_flg');
output;
end;
end;
keep ID player activity varname;
run;
/* Create a list of actual players and their positions */
data actual;
set have;
array players p1_activities--p3_activities;
do player = 1 to dim(players);
do i = 1 to countw(players[player], ',');
activity = input(scan(players[player], i, ','), 8.);
/* Do not output missing positions */
if(NOT missing(activity)) then output;
end;
end;
keep ID player activity;
run;
/* Merge the template with actual values and create a flag when an
an id, player, and activity matches with the template
*/
data want_long;
merge template(in=all)
actual(in=act);
by id player activity;
flag_activity = (all=act);
run;
/* Transpose it back to wide */
proc transpose data=want_long
out=want_wide;
id varname;
by id;
var flag_activity;
run;
Following Stu's example, a DS2 DATA step can perform his 'merge' using a hash lookup. The hash lookup depends on creating a data set that maps CSV item lists to flags.
* Create data for hash;
data share_flags(where=(not missing(key)));
length key $7 f1-f4 8;
array k[4] $1 _temporary_;
do f1 = 0 to 1; k[1] = ifc(f1,'1','');
do f2 = 0 to 1; k[2] = ifc(f2,'2','');
do f3 = 0 to 1; k[3] = ifc(f3,'3','');
do f4 = 0 to 1; k[4] = ifc(f4,'4','');
key = catx(',', of k[*]);
output;
end;end;end;end;
run;
proc ds2;
data want2 / overwrite=yes;
declare char(20) id;
vararray char(7) pact[*] activities_p1-activities_p3;
vararray double fp1[*] flag_p1_1-flag_p1_4;
vararray double fp2[*] flag_p2_1-flag_p2_4;
vararray double fp3[*] flag_p3_1-flag_p3_4;
declare char(1) sentinel;
keep id--sentinel;
drop sentinel;
declare char(7) key;
vararray double flags[*] f1-f4;
declare package hash shares([key],[f1-f4],4,'share_flags'); %* load lookup data;
method run();
declare int rc;
set have;
rc = shares.find([activities_p1],[flag_p1:]); %* find() will fill-in the flag variables;
rc = shares.find([activities_p2],[flag_p2:]);
rc = shares.find([activities_p3],[flag_p3:]);
end;
enddata;
run;
quit;
%let syslast = want2;
share_flags
result

Setting conditions in arrays in SAS

I have a set of data that looks like this:
ID Status31Jan2007 Status28Jan2007 Status31Mar2007
001 0 0
002 1 0 0
003 1 1 0
I have Statusddmmyyyy fields of either '0' or '1' for 118 months. (here, I only have three months as a sample)
I want to get results like this:
ID Flag1 Flag2 Flag3
001 N N N
002 Y N N
003 Y Y N
The logic is, if as at Status31Jan2007 = 1 and the following two months, count of Status fields with 0 > 0, then flag it as 'Y'. Else, N.
Meaning,
If my ID is 001 and as at Status31Jan2007, value is missing, i flag it as 'N' under Flag1.
Moving on to the next month, Status28Feb2007, value is 0, i automatically flag it as 'N' as well under Flag2. This applies to the next month.
Looking at ID 002, Status31Jan2007 is 1. And following two months, I have two 0 values. Count of '0' value is > 0. So I flag it as 'Y' under Flag1.
But as at Status28Feb2007, it is 0. It doesnt fit the criteria so i flag it as 'N' under Flag2.
As long as as at the field, I need the status to be 1 then only I proceed to look into the following two months.
After getting the results, how do I count the number of flags N and Y under each fields?
Count1 Count2 Count3
N 1 2 3
Y 2 1 0
Would appreciate the help as I am new to SAS. Thanks.
This will only work if the column names across are in calendar order.
Use an ARRAY statement to organize and then access variables by index and thus easily process the [index+1] and [index+2] checks your logic indicates. You can also use temporary arrays to maintain a count as you assign the flag values; at the last row the counts are output to a separate table.
Note: for status variables taking on either 0 or 1 the count of 1's can be computed using SUM. The sum of two status variables will be < 2 when either of them is 0.
* simulate some data;
data prelim;
do id = 1 to 20;
do date = '01jan07'd by 1 until(intck('month', '01jan07'd, date) >= 117);
date = intnx('month', date, 1) - 1;
status = ranuni(123) < 0.45;
if date = '31jan07'd and mod(id,5) = 1 then status = .;
output;
end;
end;
format date date9.;
run;
* change the shape of simulated data to match the question;
proc transpose data=prelim prefix=Status out=have(drop=_name_);
by id;
var status;
id date;
run;
* process the problem shaped data;
data
want (keep=id status: flag:)
want_count (keep=flag_value count:);
;
set have end=lastid;
retain sentinel1 sentinel2 0;
array status status: sentinel1 sentinel2; * map all the Status* variables to an array named status;
array flag [118] $1 ; * automatically creates 118 new variables flag1 to flag118;
array yfreq [118] _temporary_ (118*0); * temporary arrays initialized to 0;
array nfreq [118] _temporary_ (118*0);
* process each month status, -2 because of the sentinels ;
do i = 1 to dim(status)-2;
* assign flag according to the logic, some cases require a 2-month look ahead;
select;
when ( status(i) = . ) flag(i) = 'N';
when ( status(i) = 0 ) flag(i) = 'N';
when ( status(i) = 1
and sum(status(i+1),status(i+2)) < 2 ) flag(i) = 'Y'; * SUM trick;
otherwise
flag(i) = 'N';
end;
* track frequencies of flags assigned;
if flag(i) = 'N'
then nfreq(i)+1;
else yfreq(i)+1;
end;
output want;
if lastid then do;
* all flags for all ids have been binned for frequency;
* output the freqs to a count data set;
length flag_value $1;
array freq count1-count118;
flag_value = 'N'; do i = 1 to dim(nfreq); freq(i) = nfreq(i); end; output want_count;
flag_value = 'Y'; do i = 1 to dim(yfreq); freq(i) = yfreq(i); end; output want_count;
end;
run;

Assign an array's value as a dimension to another array in SAS

I've been working on a complicated code and am stuck in the end, where I need to assign one array's value as a dimension parameter to another array in the code. A snapshot from my code :
For example:
array temp_match_fl(3) temp_match_fl1 - temp_match_fl3;
ARRAY buracc_repay(3) buracc_repay1 - buracc_repay3;
ARRAY ocs_repay(3) ocs_repay1 - ocs_repay3;
jj = 0;
do until (jj>=3);
jj=jj+1;
If length(strip(match_flag(jj))) = 1 then do;
temp_match_fl(jj) = match_flag(jj);
end;
Else If length(strip(match_flag(jj))) > 1 then do;
j1 = 0;
min_diff = 99999999;
do until (j1>=length(strip(match_class(jj))));
j1=j1+1;
retain min_diff;
n=substr(strip(match_flag(jj)),j1,1);
If (min_diff > abs(buracc_repay(jj)-ocs_repay(n))) then do;
min_diff = abs(buracc_repay(jj)-ocs_repay(n));
temp_match_fl(jj) = n;
end;
end;
end;
kk=temp_match_fl(jj);
/* buracc_repay(jj) = ocs_repay(kk);*/
buracc_repay(jj) = ocs_repay(temp_match_fl(jj));
end;
run;
Now, I need to be able to assign the value stored in temp_match_fl(jj) array as dimension parameter to another array, how can I achieve that?? None of the last two statements work:
buracc_repay(jj) = ocs_repay(kk);
buracc_repay(jj) = ocs_repay(temp_match_fl(jj));
Can someone please suggest.
Thanks!
Actually your last two statements as written do work. Are you getting an error, or unexpected results? Can you make a simple example like below that shows the problem?
Note that for this to work, it's essential that the value of temp_match_fl(jj) is 1, 2, or 3, because your OCS_REPAY array has three elements. From the code you've shown, it's not clear if that is always true. You don't show the match_flag array.
data want ;
array temp_match_fl(3) temp_match_fl1 - temp_match_fl3 (1 2 3) ;
array buracc_repay(3) buracc_repay1 - buracc_repay3 (10 20 30) ;
array ocs_repay(3) ocs_repay1 - ocs_repay3 (100 200 300) ;
jj=1 ;
kk=2 ;
*buracc_repay(jj) = ocs_repay(kk); *this works ;
put temp_match_fl(jj)= ; *debug to confirm value is 1 2 or 3 ;
buracc_repay(jj) = ocs_repay(temp_match_fl(jj)); *this also works;
put (buracc_repay:)(=) temp_match_fl1=; *check output ;
run ;

Recursion in Array SAS

I have an existing collection of variables a_0,...,a_45 where a_i represents the amount of stuff I have on day i. I'd like to create a new collection of variables b_0,...,b_45 to represent the incremental change in stuff I have on day i (i.e. b_k=a_k-a_(k-1) ). My approach:
data test;
set dataset;
array a a_0-a_45;
array b b_0-b_45;
b(1)=a(1);
do i=2 to 45;
b(i)=a(i)-a(i-1);
end;
run;
However my b variables just come out missing.
What initial values do you have for a_1 to a_45 before you start the loop? As you are not intialising them (except for a_0 ≡ a(1)), every b(i) term will be a difference of 2 a terms, of which at least one will be missing, unless these variables are populated in your input dataset.
Here is some sample code showing that the delta computation is correct when the variable names in the data set align with the variables named in the array statement in the data step.
Sample data
data have(keep=product_id note a_:);
do product_id = 1 to 100;
length note $15;
array amount a_0-a_45;
call missing(of amount(*));
if (ranuni(123) < 0.5) then do;
note = 'static deltas';
static_delta = ceil(5 * ranuni(123));
amount(1) = static_delta;
do inventory_day = 2 to dim(amount);
amount(inventory_day) = amount(inventory_day-1) + static_delta;
end;
end;
else do;
note = 'random deltas';
amount(1) = ceil(5 * ranuni(123));
do inventory_day = 2 to dim(amount);
amount(inventory_day) = max ( 0, amount(inventory_day-1) + floor(10 * ranuni(123)) - 5 );
end;
end;
OUTPUT;
end;
run;
Compute deltas
data want;
set have;
array amount a_0-a_45;
array delta b_0-b_45;
delta(1) = amount(1);
do i=2 to dim(amount);
delta(i) = amount(i) - amount(i-1);
end;
drop i;
format a_: b_: 4.;
run;
As Richard has already suggested in his comment while I was working on writing the code...Basically the only error that you have in your code is that your code should loop from 2 to 46 because there are 46 elements in the array. below code should work for you.
%macro f();
data dataset;
%do i = 0 %to 45;
a_&i. = ranuni(2);
%end;
run;
%mend;
%f();
data test;
set dataset;
array a1 a_0-a_45;
array b1 b_0-b_45;
/* This line will help in avoiding b_0 to have a missing value */
b1(1)=a1(1);
do i=2 to 46;
b1(i)=a1(i)-a1(i-1);
end;
run;

Reset Retained Array Values at the end of each observation in SAS

Im running the array code below
DATA Want;
SET Have;
ARRAY Dates{2562} (&Start_Date:&End_Date);
DO i = 1 TO DIM(Dates);
IF Dates[i] >= ObStartDate AND Dates[i] <= ObEndDate THEN Dates[i] = 1;
END;
RUN;
I have found the minimum date (ie first Obstartdate date of my dataset) and the maximum date (ie last ObEndDate date of my dataset) and those values are set to &Start_Date and &End_Date. The array creates itself correctly and enters unformatted SAS date values for each observation. I want to also run through each observation and say if the value in each of the array Dates columns are between the Observations individual Start and End date then replace that value with 1.
Heres where it starts to go wrong. It retains the ObStartDate and ObEndDate from observation to observation and only replaces different Dates[i] when it picks up a lower ObStartDate or higher ObEndDate.
Is there a way I can reset ObStartDate and ObEndDate to the value of each observations ObStartDate and ObEndDate when the Arrays Do Loop gets to each consecutive observation
Ive tried creating the array and doing a Do Loop in a different datastep. Ive also tried putting loops inside loops inside loops and arrays inside loops etc etc. I may have been close to success but this is the code that I thought would work and the first code that i wrote.
Any help will be greatly appreciated.
Cheers.
Here is some code to see what I mean
DATA Haveyay;
ATTRIB Ob LENGTH=3
ObStartDate Length=3
ObEndDate Length=3;
INFILE datalines DELIMITER='~';
INPUT Ob ObStartDate ObEndDate ;
DATALINES;
1~1~8
2~2~5
3~5~10
4~1~4
5~2~3
6~4~7
7~7~10
8~3~4
9~3~9
10~2~9
;
RUN;
PROC SQL Noprint;
SELECT min(ObStartDate), max(ObEndDate) into :Start_Date, :End_Date
FROM Haveyay;
QUIT;
DATA Wantyay;
SET Haveyay;
ARRAY Dates{10} (&Start_Date:&End_Date);
DO i = 1 to DIM(Dates);
IF Dates[i] >= ObStartDate AND Dates[i] <= ObEndDate THEN Dates[i] = 1;
END;
RUN;
It looks like your problem may be that you are expecting the values in the dates array to be reset to their original values with each observation. In reality the array statement initialises the value in the array only once, before any data is loaded. As the array variables are automatically retained each change you make to a member of the array will be carried forward into later observations.
You can use a second loop to reset the date values after outputting:
do i = 1 to dim(dates);
if obstartdate <= dates[i] <= obenddate then dates[i] = 1;
end;
output;
do i = 1 to dim(dates);
dates[i] = &start_date. + i - 1;
end;
Or more compactly calculate the date from i and the macro variable rather than the array:
do i = 1 to dim(dates);
_date = &start_date + i - 1;
dates[i] = ifn(ObStartDate <= _date <= ObEndDate , 1, _date);
end;

Resources