SAS conditionals applied to array elements - arrays

I have a hard time wrapping my head around SAS arrays. I have a dataset that has ID and BeginDate, EndDate. I want to create 3 binary variables that =1 if either start or enddate is in a given year and I'm looking at 3 different years. When I run the code below all of the new variables I create (year1955, year1956, year1957) = 1 if any one of them is true. This is not what I want. I am using an array because I eventually will want to do this with more than 3 variables.
My code:
data temp2; set temp;
array yr(3) year1955-year1957;
do i = 1 to 3;
if year(BeginDate) =1955 or year(EndDate)=1955 then yr(i)=1;
if year(BeginDate) =1956 or year(EndDate)=1956 then yr(i)=1;
if year(BeginDate) =1957 or year(EndDate)=1957 then yr(i)=1;
end;
drop i;
run;
I would be open to a more elegant solution than the one I've devised.
Output I'm getting :
ID Begindate EndDate year1955 year1956 year1957
AA 01/01/1956 01/01/1969 1 1 1
Output I want:
ID Begindate EndDate year1955 year1956 year1957
AA 01/01/1956 01/01/1969 . 1 .

You are not use the value of your loop variable in the IF conditions.
You could just get rid of the DO loop.
if year(BeginDate) =1955 or year(EndDate)=1955 then yr(1)=1;
if year(BeginDate) =1956 or year(EndDate)=1956 then yr(2)=1;
if year(BeginDate) =1957 or year(EndDate)=1957 then yr(3)=1;
Or include the value I in the IF condition.
do i = 1 to 3;
if year(BeginDate) =1955+i-1 or year(EndDate)=1955+i-1 then yr(i)=1;
end;
Or use the year value as the index into the array by changing the range of indexes the array uses.
array yr [1955:1957] year1955-year1957;
if year(BeginDate) in (1955:1957) then yr[year(BeginDate)]=1;
if year(EndDate) in (1955:1957) then yr[year(EndDate)]=1;

Related

Find corresponding variable to a certain value through array

So if I have identified a max value regarding a test result (Highest variable listed below), which occurred during one of the three dates that are being tested (testtime variables listed below), what I want to do is to create a new variable called Highesttime identifying the date when the test was given.
However, I am stuck in an array looping. SAS informs that "ERROR: Array subscript out of range at line x", guess there's something working regarding the logic? See codes below:
Example:
ID time1_a time_b time_c result_a result_b result_c Highest
001 1/1/22 1/2/22 1/3/22 3 2 4 4
002 12/1/21 12/23/21 1/5/22 6 1 2 6
003 12/22/21 1/6/22 2/2/22 5 5 7 7
...
data want;
set origin;
array testtime{3} time1_a time_b time_c;
array maxvalue{1} Highest;
array corr_time{1} Highesttime;
do i=1 to dim(testttime);
corr_time{i}=testttime{i=maxvalue{i}};
end;
run;
There is no need to make an array for HIGHEST since there is only one variable that you would put into that array. In that case just use the variable directly instead of trying to access it indirectly via an array reference.
First let's make an actual SAS dataset out of the listing you provided.
data have;
input ID (time_a time_b time_c) (:mmddyy.) result_a result_b result_c Highest ;
format time_a time_b time_c yymmdd10.;
cards;
001 1/1/22 1/2/22 1/3/22 3 2 4 4
002 12/1/21 12/23/21 1/5/22 6 1 2 6
003 12/22/21 1/6/22 2/2/22 5 5 7 7
;
If you want to loop then you need two arrays. One for times and the other for the values. Then you can loop until you find which index points to the highest value and use the same index into the other array.
data want ;
set have;
array times time_a time_b time_c ;
array results result_a result_b result_c;
do which_one=1 to dim(results) until (not missing(highest_time));
if results[which_one] = highest then highest_time=times[which_one];
end;
format highest_time yymmdd10.;
run;
Or you can avoid the looping by using the WHICHN() function to figure out which of three result variables is the first one that has that HIGHEST value. Then you can use that value as the index into the array of the TIME variables (which in your case have DATE instead of TIME or DATETIME values).
data want ;
set have;
which_one = whichn(highest, of result_a result_b result_c);
array times time_a time_b time_c ;
highest_time = times[which_one];
format highest_time yymmdd10.;
run;
Your code from this question was close, you just had the assignment backwards.
Note that an array method will assign the last date in the case of duplicate high results and WHICHN will report the first date so the answers are not identical unless you modify the loop to exit after the first maximum value is found.
With the changes suggested in the answer proposed:
data temp2_lead_f2022;
set temp_lead_f2022;
array _day {3} daybld_a daybld_b daybld_c;
array _month {3} mthbld_a mthbld_b mthbld_c;
array _dates {3} date1_a date2_b date3_c;
array _pblev{3} pblev_a pblev_b pblev_c;
do i = 1 to 3;
_dates{i} = mdy(_month{i}, _day{i}, 1990);
end;
maxlead= max(of _pblev(*));
do i=1 to 3;
if _pblev{i} = maxlead then max_date=_dates(i);
end;
*Using WHICHN to identify the maximum occurence;
max_first_index=whichn(maxlead, of _pblev(*));
max_date2 = _dates(max_first_index);
drop k;
format date1_a date2_b date3_c dob mmddyy8. ;
run;

SAS - Invalid numeric data while searching through an Array

I am trying to create an array of strings and want to insert a value in it, if it does not exist already in the array.
I read somewhere that we can use 'IN' operator with Array. So, coded it as follows:
DATA WANT;
SET HAVE;
BY ID;
ARRAY R_PROS_SCRN_ID {2} $4. R_PROS_SCRN_ID_1 - R_PROS_SCRN_ID_2;
RETAIN R_PROS_SCRN_ID_1 - R_PROS_SCRN_ID_2;
IF NOT PROS_SCRN_ID IN R_PROS_SCRN_ID THEN DO;
DO I=1 to 2 ;
IF MISSING( R_PROS_SCRN_ID{i}) THEN DO;
R_PROS_SCRN_ID{i} = PROS_SCRN_ID;
LEAVE;
END;
END;
END;
IF LAST.ID THEN OUTPUT;
RUN;
In Array R_PROS_SCRN_ID, I want only the unique values from field PROS_SCRN_ID.
It is throwing error:
NOTE: Invalid numeric data, PROS_SCRN_ID='MED' , at line 17352 column 201.
I think it is because I did not initialize the Array before comparing and hence it is considering it as Numeric Array. But, I have specified the format as $4. Why is it throwing error?
Also, I am not sure if this is the best way get unique values in an Array. Is there any better way to implement this?
Your code appears to be collecting unique values by group, pivoting from a tall data structure to a wide data structure.
One of the clearest DATA step ways is to use what we call DOW loop in which SET is within the loop. This sample code presumes no more than 10 unique satellite values per group. (The by variables can be thought of as key variables, and all other variables would be satellites)
data have;
input user_id screen_id ;
datalines;
1 1
1 2
1 1
1 1
1 1
1 3
2 1
2 1
2 1
3 0
4 1
4 2
4 3
5 11
5 11
5 11
5 5
5 1
5 5
5 6
5 1
run;
data want;
_index = 0;
do until (last.user_id);
set have;
by user_id;
array ids screen_id1-screen_id10;
if screen_id not in ids then do;
_index + 1;
ids(_index) = screen_id;
end;
end;
drop _index screen_id;
run;
One of the clearest procedural ways is to select the unique values and transpose them.
proc sql;
create view uniqueScreenByUser as
select distinct user_id, screen_id
from have
order by user_id
;
proc transpose data=uniqueScreenByUser prefix=screen_id out=wantWide(drop=_name_);
by user_id;
var screen_id;
run;

Within a sas by statement, subtract one observation from its lag

I have a SAS data set grouped by clusters, as follows
data have;
input cluster date date9.;
cards;
1 1JAN2017
1 2JAN2017
1 7JAN2017
2 1JAN2017
2 3JAN2017
2 10JAN2017
;
run;
Within each cluster, I'd like to subtract a date from it's previous date, so I have the dataset below:
data want;
input cluster date date_diff;
cards;
1 1JAN2017 0
1 2JAN2017 1
1 7JAN2017 5
2 1JAN2017 0
2 3JAN2017 2
2 10JAN2017 7
;
run;
I think perhaps I should be using a lag function similar to what I have written below.
DATA test;
SET have;
BY cluster;
if first.cluster then do;
date_diff = date - lag(date);
END;
RUN;
Any advice would be appreciated! Thanks
I like dif for this (lag plus subtract in one function). You have the if first backwards, I think, but dif and lag have the same restriction - what they're really doing is building a queue, so the lag or dif statement cannot be conditionally executed for most use cases. Here I flip it around and calculate the dif, then set it to missing if on first.cluster.
I also encourage you to use missing, not 0, for the first.cluster dif.
DATA test;
SET have;
BY cluster;
date_diff = dif(date);
if first.cluster then call missing(date_diff);
RUN;
Conditional lags are tricky. It this case (and in many), you don't actually want a conditional lag. You want to compute the lag for every record, and then you can use it conditionally. One way is:
data want ;
set have ;
by cluster ;
lagdate=lag(date) ;
if first.cluster then date_diff=0 ;
else date_diff=date-lagdate ;
run ;
So you compute lagdate for every record. Then you can conditionally compute date_diff.

Get rid of kth smallest and largest values of a dataset in SAS

I have a datset sort of like this
obs| foo | bar | more
1 | 111 | 11 | 9
2 | 9 | 2 | 2
........
I need to throw out the 4 largest and 4 smallest of foo (later then I would do a similar thing with bar) basically to proceed but I'm unsure the most effective way to do this. I know there are functions smallest and largest but I don't understand how I can use them to get the smallest 4 or largest 4 from an already made dataset. I guess alternatively I could just remove the min and max 4 times but that sounds needlessly tedious/time consuming. Is there a better way?
PROC RANK will do this for you pretty easily. If you know the total count of observations, it's trivial - it's slightly harder if you don't.
proc rank data=sashelp.class out=class_ranks(where=(height_r>4 and weight_r>4));
ranks height_r weight_r;
var height weight;
run;
That removes any observation that is in the 4 smallest heights or weights, for example. The largest 4 would require knowing the maximum rank, or doing a second processing step.
data class_final;
set class_ranks nobs=nobs;
if height_r lt (nobs-3) and weight_r lt (nobs-3);
run;
Of course if you're just removing the values then do it all in the data step and call missing the variable if the condition is met rather than deleting the observation.
You are going to need to make at least 2 passes through your dataset however you do this - one to find out what the top and bottom 4 values are, and one to exclude those observations.
You can use proc univariate to get the top and bottom 5 values, and then use the output from that to create a where filter for a subsequent data step. Here's an example:
ods _all_ close;
ods output extremeobs = extremeobs;
proc univariate data = sashelp.cars;
var MSRP INVOICE;
run;
ods listing;
data _null_;
do _N_ = 1 by 1 until (last.varname);
set extremeobs;
by varname notsorted;
if _n_ = 2 then call symput(cats(varname,'_top4'),high);
if _n_ = 4 then call symput(cats(varname,'_bottom4'),low);
end;
run;
data cars_filtered;
set sashelp.cars(where = ( &MSRP_BOTTOM4 < MSRP < &MSRP_TOP4
and &INVOICE_BOTTOM4 < INVOICE < &INVOICE_TOP4
)
);
run;
If there are multiple observations that tie for 4th largest / smallest this will filter out all of them.
You can use proc sql to place the number of distinct values of foo into a macro var (includes null values as distinct).
In you data step you can use first.foo and the macro var to selectively output only those that are no the smallest or largest 4 values.
proc sql noprint;
select count(distinct foo) + count(distinct case when foo is null then 1 end)
into :distinct_obs from have;
quit;
proc sort data = have; by foo; run;
data want;
set have;
by foo;
if first.foo then count+1;
if 4 < count < (&distinct_obs. - 3) then output;
drop count;
run;
I also found a way to do it that seems to work with IML (I'm practicing by trying to redo things different ways). I knew my maximum number of observations and had already sorted it by order of the variable of interest.
PROC IML;
EDIT data_set;
DELETE point {{1, 2, 3, 4,51, 52, 53, 54};
PURGE;
close data_set;
run;
I've not used IML very much but I stumbled upon this while reading documentation. Thank you to everyone who answered my question!

Checking correct order of values

I have a data set that looks similar to the one below. Basically, I have current prices for three different sizes of an item type. If the sizes are priced correctly (ie small<medium<large) I want to flag them with a “Y” and continue to use the current price. If they are not priced correctly, I want to flag them with a “N” and use the recommended price. I know that this is probably a time to use array programming, but my array skills are, admittedly, a bit weak. There are hundreds of locations, but only one item type. I currently have the unique locations loaded in a macro variable list.
data have;
input type location $ size $ cur_price rec_price;
cards;
x NY S 4 1
x NY M 5 2
x NY L 6 3
x LA S 5 1
x LA M 4 2
x LA L 3 3
x DC S 5 1
x DC M 5 2
x DC L 5 3
;
run;
proc sql;
select distinct location into :loc_list from have;
quit;
Any help would be greatly appreciated.
Thanks.
Not sure why you'd want to use an array here...proc transpose and some data step logic
can easily solve this problem. Arrays are very useful (gotta admit, I'm not entirely
comfortable with them either), but in a situation where you have that many locations,
I think transpose is better.
Does the code below accomplish your goal?
/*sorts to get ready for transpose*/
proc sort data=have;
by location;
run;
/*transpose current price*/
proc transpose data=have out=cur_tran prefix=cur_price;
by location;
id size;
var cur_price;
run;
/*transpose recommended price*/
proc transpose data=have out=rec_tran prefix=rec_price;
by location;
id size;
var rec_price;
run;
/*merge back together*/
data merged;
merge cur_tran rec_tran;
by location;
run;
/*creates flags and new field for final price*/
data want;
set merged;
if cur_priceS<cur_priceM<cur_priceL then
do;
FLAG='Y';
priceS=cur_priceS;
priceM=cur_priceM;
priceL=cur_priceL;
end;
else do;
FLAG='N';
priceS=rec_priceS;
priceM=rec_priceM;
priceL=rec_priceL;
end;
run;
I don't see how arrays would help here. How about just checking using dif to queue the last record's price and verify it (could also retain the last price if you prefer). Make sure the dataset's properly sorted by type location descending size, then:
data want;
set have;
by type location descending size; *S > M > L alphabetically;
retain price_check;
if not first.location and dif(cur_price) lt 0 then price_check=1;
*if dif < 0 then cur rec is smaller;
else if first.location then price_check=0; *reset it;
if last.location;
keep type location price_check;
run;
Then merge that back to your original dataset by type location, and use the other price if cur_price=1.
Alternatively you could do it in a single query which is almost a re-statement of your requirements.
Proc sql;
create table want as
select *
/* Basically, I have current prices for three different sizes of an item type.
If the sizes are priced correctly (ie small<medium<large) */
, case
when max ( case when size eq 'S' then cur_price end)
lt max ( case when size eq 'M' then cur_price end)
and max ( case when size eq 'M' then cur_price end)
lt max ( case when size eq 'L' then cur_price end)
/* I want to flag them with a “Y” and continue to use the current price */
then 'Y'
/* If they are not priced correctly,
I want to flag them with a “N” and use the recommended price. */
else 'N'
end as Cur_Price_Sizes_Correct
, case
when calculate Cur_Price_Correct eq 'Y'
then cur_price
else rec_price
end as Price
From have
Group by Type
, Location
;
Quit;

Resources