Using SAS to check if columns have specified characteristics

Using SAS to check if columns have specified characteristics - arrays

I have a dataset that looks like the one below. each row is a different observation that has anywhere from 1 to x values (in this case x=3). I want to create a dataset that contains the original info, but four additional columns (for the four values of Bin present in the dataset). The value of the column freq_Bin_1 will be a 1 if there are any 1's present in that row, else missing. The column freq_Bin_2 will be a 1 if there are any 2's present, etc.
Both the number of Bins and the number of columns in the original dataset may vary.
data have;
input Bin_1 Bin_2 Bin_3;
cards;
1 . .
3 . .
1 1 .
3 2 1
3 4 .
;
run;
Here is my desired output:
data want_this;
input Bin_1 Bin_2 Bin_3 freq_Bin_1 freq_Bin_2 freq_Bin_3 freq_Bin_4;
cards;
1 . . 1 . . .
3 . . . . 1 .
1 1 . 1 . . .
3 2 1 1 1 1 .
3 4 . . . 1 1
;
run;
I have an array solution that I think is pretty close, but I can't quite get it. I am also open to other methods.
data want;
set have;
array Bins {&max_freq.} Bin:;
array freq_Bin {&num_bin.} freq_Bin_1-freq_Bin_&num_bin.;
do j=1 to dim(Bins);
freq_Bin(j)=.;
end;
do k=1 to dim(freq_Bin);
if Bins(k)=1 then freq_Bin(1)=1;
else if Bins(k)=2 then freq_Bin(2)=1;
else if Bins(k)=3 then freq_Bin(3)=1;
else if Bins(k)=4 then freq_Bin(4)=1;
end;
drop j k;
run;

This should work:
data want;
set have;
array Bins{*} Bin:;
array freq_Bin{4};
do k=1 to dim(Bins);
if Bins(k) ne . then freq_Bin(Bins(k))=1;
end;
drop k;
run;
I tweaked your code somewhat, but really the only problem was that you need to check that Bins(k) isn't missing before trying to use it to index an array. Also, there's no need to initialize the values to missing as that's the default.

Related

Find corresponding variable to a certain value through array

So if I have identified a max value regarding a test result (Highest variable listed below), which occurred during one of the three dates that are being tested (testtime variables listed below), what I want to do is to create a new variable called Highesttime identifying the date when the test was given.
However, I am stuck in an array looping. SAS informs that "ERROR: Array subscript out of range at line x", guess there's something working regarding the logic? See codes below:
Example:
ID time1_a time_b time_c result_a result_b result_c Highest
001 1/1/22 1/2/22 1/3/22 3 2 4 4
002 12/1/21 12/23/21 1/5/22 6 1 2 6
003 12/22/21 1/6/22 2/2/22 5 5 7 7
...
data want;
set origin;
array testtime{3} time1_a time_b time_c;
array maxvalue{1} Highest;
array corr_time{1} Highesttime;
do i=1 to dim(testttime);
corr_time{i}=testttime{i=maxvalue{i}};
end;
run;

There is no need to make an array for HIGHEST since there is only one variable that you would put into that array. In that case just use the variable directly instead of trying to access it indirectly via an array reference.
First let's make an actual SAS dataset out of the listing you provided.
data have;
input ID (time_a time_b time_c) (:mmddyy.) result_a result_b result_c Highest ;
format time_a time_b time_c yymmdd10.;
cards;
001 1/1/22 1/2/22 1/3/22 3 2 4 4
002 12/1/21 12/23/21 1/5/22 6 1 2 6
003 12/22/21 1/6/22 2/2/22 5 5 7 7
;
If you want to loop then you need two arrays. One for times and the other for the values. Then you can loop until you find which index points to the highest value and use the same index into the other array.
data want ;
set have;
array times time_a time_b time_c ;
array results result_a result_b result_c;
do which_one=1 to dim(results) until (not missing(highest_time));
if results[which_one] = highest then highest_time=times[which_one];
end;
format highest_time yymmdd10.;
run;
Or you can avoid the looping by using the WHICHN() function to figure out which of three result variables is the first one that has that HIGHEST value. Then you can use that value as the index into the array of the TIME variables (which in your case have DATE instead of TIME or DATETIME values).
data want ;
set have;
which_one = whichn(highest, of result_a result_b result_c);
array times time_a time_b time_c ;
highest_time = times[which_one];
format highest_time yymmdd10.;
run;

Your code from this question was close, you just had the assignment backwards.
Note that an array method will assign the last date in the case of duplicate high results and WHICHN will report the first date so the answers are not identical unless you modify the loop to exit after the first maximum value is found.
With the changes suggested in the answer proposed:
data temp2_lead_f2022;
set temp_lead_f2022;
array _day {3} daybld_a daybld_b daybld_c;
array _month {3} mthbld_a mthbld_b mthbld_c;
array _dates {3} date1_a date2_b date3_c;
array _pblev{3} pblev_a pblev_b pblev_c;
do i = 1 to 3;
_dates{i} = mdy(_month{i}, _day{i}, 1990);
end;
maxlead= max(of _pblev(*));
do i=1 to 3;
if _pblev{i} = maxlead then max_date=_dates(i);
end;
*Using WHICHN to identify the maximum occurence;
max_first_index=whichn(maxlead, of _pblev(*));
max_date2 = _dates(max_first_index);
drop k;
format date1_a date2_b date3_c dob mmddyy8. ;
run;

how to create a loop for a macro in stata?

I have a very large dataset but to cut it short I demonstrated the data with the following example:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(patid death dateofdeath)
1 0 .
2 0 .
3 0 .
4 0 .
5 1 15007
6 0 .
7 0 .
8 1 15526
9 0 .
10 0 .
end
format %d dateofdeath
I am trying to sample for a case-control study based on date of death. At this stage, I need to first create a variable with each date of death repeated for all the participants (hence we end up with a dataset with 20 participants) and a pairid equivalent to the patient id patid of the corresponding case.
I created a macro for one case (which works) but I am finding it difficult to have it repeated for all cases (where death==1) in a loop.
The successful macro is as follows:
local i "5" //patient id who died
gen pairid= `i'
gen matchedindexdate = dateofdeath
replace matchedindexdate=0 if pairid != patid
gsort matchedindexdate
replace matchedindexdate= matchedindexdate[_N]
format matchedindexdate %d
save temp`i'
and the loop I attempted is:
* (min and max patid id)
forval j = 1/10 {
count if patid == `j' & death==1
if r(N)=1 {
gen pairid= `j'
gen matchedindexdate = dateofdeath
replace matchedindexdate=0 if pairid != patid
gsort matchedindexdate
replace matchedindexdate= matchedindexdate[_N]
save temp/matched`j'
}
}
use temp/matched1, clear
forval i=2/10 {
capture append using temp/matched`i'
save matched, replace
}
but I get:
invalid syntax
How can I do the loop?

I finally had it solved, please check:
https://www.statalist.org/forums/forum/general-stata-discussion/general/1591811-how-to-create-a-loop-for-a-macro

SAS - Invalid numeric data while searching through an Array

I am trying to create an array of strings and want to insert a value in it, if it does not exist already in the array.
I read somewhere that we can use 'IN' operator with Array. So, coded it as follows:
DATA WANT;
SET HAVE;
BY ID;
ARRAY R_PROS_SCRN_ID {2} $4. R_PROS_SCRN_ID_1 - R_PROS_SCRN_ID_2;
RETAIN R_PROS_SCRN_ID_1 - R_PROS_SCRN_ID_2;
IF NOT PROS_SCRN_ID IN R_PROS_SCRN_ID THEN DO;
DO I=1 to 2 ;
IF MISSING( R_PROS_SCRN_ID{i}) THEN DO;
R_PROS_SCRN_ID{i} = PROS_SCRN_ID;
LEAVE;
END;
END;
END;
IF LAST.ID THEN OUTPUT;
RUN;
In Array R_PROS_SCRN_ID, I want only the unique values from field PROS_SCRN_ID.
It is throwing error:
NOTE: Invalid numeric data, PROS_SCRN_ID='MED' , at line 17352 column 201.
I think it is because I did not initialize the Array before comparing and hence it is considering it as Numeric Array. But, I have specified the format as $4. Why is it throwing error?
Also, I am not sure if this is the best way get unique values in an Array. Is there any better way to implement this?

Your code appears to be collecting unique values by group, pivoting from a tall data structure to a wide data structure.
One of the clearest DATA step ways is to use what we call DOW loop in which SET is within the loop. This sample code presumes no more than 10 unique satellite values per group. (The by variables can be thought of as key variables, and all other variables would be satellites)
data have;
input user_id screen_id ;
datalines;
1 1
1 2
1 1
1 1
1 1
1 3
2 1
2 1
2 1
3 0
4 1
4 2
4 3
5 11
5 11
5 11
5 5
5 1
5 5
5 6
5 1
run;
data want;
_index = 0;
do until (last.user_id);
set have;
by user_id;
array ids screen_id1-screen_id10;
if screen_id not in ids then do;
_index + 1;
ids(_index) = screen_id;
end;
end;
drop _index screen_id;
run;
One of the clearest procedural ways is to select the unique values and transpose them.
proc sql;
create view uniqueScreenByUser as
select distinct user_id, screen_id
from have
order by user_id
;
proc transpose data=uniqueScreenByUser prefix=screen_id out=wantWide(drop=_name_);
by user_id;
var screen_id;
run;

Filling missing values for many variables from previous observation by group in SAS

My dataset looks like this:
Date ID Var1 Var2 ... Var5
200701 1 x .
200702 1 . a
200703 1 . .
200701 2 . b
200702 2 y b
200703 2 y .
200702 3 z .
200703 3 . .
I want my results to look like this:
Date ID Var1 Var2 ... Var5
200701 1 x .
200702 1 x a
200703 1 x a
200701 2 . b
200702 2 y b
200703 2 y b
200702 3 z .
200703 3 z .
I tried the following code below, but it didn't work. What's wrong with it?
Am I better off using array? If so, how?
%macro a(variable);
length _&variable $10.;
retain _&variable;
if first.ID then _&variable = '';
if &variable ne '' then _&variable=&variable;
else if &variable = '' then &variable=_&variable;
drop _&variable;
%mend;
data want;
set have;
%a(Var1)
%a(Var2)
%a(Var3)
%a(Var4)
%a(Var5)
run;
Appreciate the help! Thanks!

The UPDATE statement can do that. It is intended to process transactions against a master dataset so when the transaction value is missing the current value from the master table is left unchanged. You can use your single dataset as both the master and the transaction data by adding OBS=0 dataset option. Normally it will expect to output only one observation per BY group, but if you add an OUTPUT statement you can have it output all of the observations.
data want;
set have(obs=0) have ;
by id;
output;
run;

The full code works! Thanks
%macro a(variable);
length _&variable $10.;
retain _&variable;
if first.ID then _&variable = '';
if &variable ne '' then _&variable=&variable;
else if &variable = '' then &variable=_&variable;
drop _&variable;
%mend;
data want;
update have(obs=0) have;
by id;
output;
%a(Var1)
%a(Var2)
%a(Var3)
%a(Var4)
%a(Var5)
run;

SAS: Merging two tables with identical columns while dropping null values

I'm not sure if the title does this question justice, but here it goes:
I have three datasets Forecasts1, Forecasts2 and Forecasts3. They are all time series data composed of a date variable and variables r1 through r241.
For a given r variable (lets just use r1-r3, and only Forecasts 1 and 2 for now) each dataset has only one row where the value isn't null, and it is a different row in each dataset.
Forecast 1 looks like this:
Forecast 2 looks like this:
I need to be able to combine them such that r1-r3 contain all the non-null values, without creating duplicate date rows to hold the null values.
So ideally the finished produce would look like this:
I've tried various types of merges and sets, but I keep getting duplicate date rows. How would I go about doing this properly for all 241 (or more) variables? (specifically in SAS or Proc SQL?)
LINKS TO GOOGLE DOCS CONTAINING DATA:
Forecasts1: https://docs.google.com/spreadsheets/d/1iUEwPltU6V6ijgnkALFiIdrwrolDFt8xaITZaFC4WN8/edit?usp=sharing
Forecasts2:
https://docs.google.com/spreadsheets/d/1lQGKYJlz6AAR-DWtoWnl8SwzCNAmSpj7yxRqRgnybr8/edit?usp=sharing

Did you try the UPDATE statement?
data forecast1 ;
input date r1-r3 ;
cards;
1 1 . .
2 . 2 .
3 . . 3
4 . . .
;
data forecast2 ;
input date r1-r3 ;
cards;
2 2 . .
3 . 3 .
4 . . 4
5 . . .
;
data want ;
update forecast1 forecast2 ;
by date ;
run;
proc print; run;
Results
date r1 r2 r3
1 1 . .
2 2 2 .
3 . 3 3
4 . . 4
5 . . .

I tend to approach these types of problems using proc sql. Assuming one row per date in the data sets, you can use full outer join:
proc sql;
select coalesce(f1.date, f2.date) as date,
coalesce(f1.r1, f2.r1) as r1,
coalesce(f1.r2, f2.r2) as r2,
coalesce(f1.r3, f2.r3) as r3
from forecast1 f1 full outer join
forecast2 f2
on f1.date = f2.date

Consider a union query with aggregation. The only drawback is writing out the aggregates for all 241 columns in outer query.
proc sql;
SELECT sub.date, Max(sub.r1) AS R1, Max(sub.r2) AS R2, Max(sub.r3) AS R3, ...
FROM
(SELECT *
FROM Forecasts1 f1
UNION ALL
SELECT *
FROM Forecasts2 f2) As sub
GROUP BY sub.date
quit;

A different solution would be to append all and delete rows where all are missing.
data want;
set forecast1-forecast3 indsname=fc;
model = fc; *store name of forecast model;
if nmiss(of r1-r3) = 3 then delete;
run;