SAS pass array in SAS macro - arrays

I try to do some calculations based on existing columns and add the results back to the datasets. Could anyone help?
Here is what I try to write in SAS:
%macro ColumnCal1(m,prefix);
data _null_;
attr_&prefix. = sum(of &m.1-&m.3);
call symput("attr_&prefix.",attr_&prefix.);
run;
%mend ColumnCal1;
data c2;
set c1;
array mth{12} m1-m12;
%ColumnCal1(m=mth, prefix=ttl);
attr_ttl =&attr_ttl.;
run;

If I understood cthe problem you have a dataset and you want to calculate sum of different columns using a macro. I have shared an example below. Try it out.
Suppose you have a dataset with array like this:
data have;
array mnth{*} m1-m4;
input mnth{*};
cards;
1 3 6 9
2 4 8 10
;
run;
Code:
To calculate sum of different columns macro columncal1 is created with parameters
1) Input: Input file which contains the variables
2) Start: First column from which sum needs to be calculated
3) End: Last column till which sum needs to be calculated
4) Prefix: Prefix of the computed column name
5) Output: Output file which gives the result
%macro ColumnCal1(input=,start=,end=,prefix=,output=);
data &output.;
set &input.;
attr_&prefix = sum(of &start.-&end.);
run;
%mend ColumnCal1;
%ColumnCal1(input=have,start=m1,end=m2,prefix=ttl,output=want1);
/* Dataset want1 having all the initial columns plus sum of m1 and m2 stored in a variable attr_ttl has been created from have dataset*/
%ColumnCal1(input=want1,start=m2,end=m3,prefix=ttl1,output=want2);
/* Dataset want2 having all the initial columns plus sum of m1 and m2 stored in a variable attr_ttl has been created from want1 dataset*/
My Output (want2) :
m1 |m2 |m3 |m4 |attr_ttl |attr_ttl1
1 |3 |6 |9 |4 |9
2 |4 |8 |10 |6 |12
If you have any different requirement please do let me know.

You cannot nest data steps. Once SAS sees the new data step starting it stops compiling the first one and runs it. Also how does your current macro know how to find the data since there is no SET statement in the data step? Also you cannot reference a macro variable that has not been created yet. So if you generate a macro variable using then CALL SYMPUTX() function you cannot reference its value to modify the code of the current data step since the data steps needs to already have been compiled before the call symputx() can execute.
Something like this could work.
%macro ColumnCal1(m,prefix);
attr_&prefix. = sum(of &m.1-&m.3);
call symput("attr_&prefix.",attr_&prefix.);
%mend ColumnCal1;
data c2;
set c1;
%ColumnCal1(m=mth, prefix=ttl);
run;

Related

Find corresponding variable to a certain value through array

So if I have identified a max value regarding a test result (Highest variable listed below), which occurred during one of the three dates that are being tested (testtime variables listed below), what I want to do is to create a new variable called Highesttime identifying the date when the test was given.
However, I am stuck in an array looping. SAS informs that "ERROR: Array subscript out of range at line x", guess there's something working regarding the logic? See codes below:
Example:
ID time1_a time_b time_c result_a result_b result_c Highest
001 1/1/22 1/2/22 1/3/22 3 2 4 4
002 12/1/21 12/23/21 1/5/22 6 1 2 6
003 12/22/21 1/6/22 2/2/22 5 5 7 7
...
data want;
set origin;
array testtime{3} time1_a time_b time_c;
array maxvalue{1} Highest;
array corr_time{1} Highesttime;
do i=1 to dim(testttime);
corr_time{i}=testttime{i=maxvalue{i}};
end;
run;
There is no need to make an array for HIGHEST since there is only one variable that you would put into that array. In that case just use the variable directly instead of trying to access it indirectly via an array reference.
First let's make an actual SAS dataset out of the listing you provided.
data have;
input ID (time_a time_b time_c) (:mmddyy.) result_a result_b result_c Highest ;
format time_a time_b time_c yymmdd10.;
cards;
001 1/1/22 1/2/22 1/3/22 3 2 4 4
002 12/1/21 12/23/21 1/5/22 6 1 2 6
003 12/22/21 1/6/22 2/2/22 5 5 7 7
;
If you want to loop then you need two arrays. One for times and the other for the values. Then you can loop until you find which index points to the highest value and use the same index into the other array.
data want ;
set have;
array times time_a time_b time_c ;
array results result_a result_b result_c;
do which_one=1 to dim(results) until (not missing(highest_time));
if results[which_one] = highest then highest_time=times[which_one];
end;
format highest_time yymmdd10.;
run;
Or you can avoid the looping by using the WHICHN() function to figure out which of three result variables is the first one that has that HIGHEST value. Then you can use that value as the index into the array of the TIME variables (which in your case have DATE instead of TIME or DATETIME values).
data want ;
set have;
which_one = whichn(highest, of result_a result_b result_c);
array times time_a time_b time_c ;
highest_time = times[which_one];
format highest_time yymmdd10.;
run;
Your code from this question was close, you just had the assignment backwards.
Note that an array method will assign the last date in the case of duplicate high results and WHICHN will report the first date so the answers are not identical unless you modify the loop to exit after the first maximum value is found.
With the changes suggested in the answer proposed:
data temp2_lead_f2022;
set temp_lead_f2022;
array _day {3} daybld_a daybld_b daybld_c;
array _month {3} mthbld_a mthbld_b mthbld_c;
array _dates {3} date1_a date2_b date3_c;
array _pblev{3} pblev_a pblev_b pblev_c;
do i = 1 to 3;
_dates{i} = mdy(_month{i}, _day{i}, 1990);
end;
maxlead= max(of _pblev(*));
do i=1 to 3;
if _pblev{i} = maxlead then max_date=_dates(i);
end;
*Using WHICHN to identify the maximum occurence;
max_first_index=whichn(maxlead, of _pblev(*));
max_date2 = _dates(max_first_index);
drop k;
format date1_a date2_b date3_c dob mmddyy8. ;
run;

Split SAS datasets by column with primary key

So I have a dataset with one primary key: unique_id and 1200 variables. This dataset is generated from a macro so the number of columns will not be fixed. I need to split this dataset into 4 or more datasets of 250 variables each, and each of these smaller datasets should contain the primary key so that I can merge them back later. Can somebody help me with either a sas function or a macro to solve this?
Thanks in advance.
A simple way to split a datasets in the way you request is to use a single data step with multiple output datasets where each one has a KEEP= dataset option listing the variables to keep. For example:
data split1(keep=Name Age Height) split2(keep=Name Sex Weight);
set sashelp.class;
run;
So you need to get the list of variables and group then into sets of 250 or less. Then you can use those groupings to generate code like above. Here is one method using PROC CONTENTS to get the list of variables and CALL EXECUTE() to generate the code.
I will use macro variables to hold the name of the input dataset, the key variable that needs to be kept on each dataset and maximum number of variables to keep in each dataset.
So for the example above those macro variable values would be:
%let ds=sashelp.class;
%let key=name;
%let nvars=2;
So use PROC CONTENTS to get the list of variable names:
proc contents data=&ds noprint out=contents; run;
Now run a data step to split them into groups and generate a member name to use for the new split dataset. Make sure not to include the KEY variable in the list of variables when counting.
data groups;
length group 8 memname $41 varnum 8 name $32 ;
group +1;
memname=cats('split',group);
do varnum=1 to &nvars while (not eof);
set contents(keep=name where=(upcase(name) ne %upcase("&key"))) end=eof;
output;
end;
run;
Now you can use that dataset to drive the generation of the code:
data _null_;
set groups end=eof;
by group;
if _n_=1 then call execute('data ');
if first.group then call execute(cats(memname,'(keep=&key'));
call execute(' '||trim(name));
if last.group then call execute(') ');
if eof then call execute(';set &ds;run;');
run;
Here are results from the SAS log:
NOTE: CALL EXECUTE generated line.
1 + data
2 + split1(keep=name
3 + Age
4 + Height
5 + )
6 + split2(keep=name
7 + Sex
8 + Weight
9 + )
10 + ;set sashelp.class;run;
NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK.SPLIT1 has 19 observations and 3 variables.
NOTE: The data set WORK.SPLIT2 has 19 observations and 3 variables.
Just another way of doing it using macro variables:
/* Number of columns you want in each chunk */
%let vars_per_part = 250;
/* Get all the column names into a dataset */
proc contents data = have out=cols noprint;
run;
%macro split(part);
/* Split the columns into 250 chunks for each part and put it into a macro variable */
%let fobs = %eval((&part - 1)* &vars_per_part + 1);
%let obs = %eval(&part * &vars_per_part);
proc sql noprint;
select name into :cols separated by " " from cols (firstobs = &fobs obs = &obs) where name ~= "uniq_id";
quit;
/* Chunk up the data only keeping those varaibles and the uniq_id */
data want_part∂
set have (keep = &cols uniq_id);
run;
%mend;
/* Run this from 1 to whatever the increment required to cover all the columnns */
%split(1);
%split(2);
%split(3);
this is not a complete solution but some help to give you another insight into how to solve this. The previous solutions have relied much on proc contents and data step, but I would solve this using proc sql and dictionary.columns. And I would create a macro that would split the original file into as many parts as needed, 250 cols each. The steps roughly:
proc sql; create table as _colstemp as select * from dictionary.columns where library='your library' and memname = 'your table' and name ne 'your primary key'; quit;
Count the number of files needed somewhere along:
proc sql;
select ceil(count(*)/249) into :num_of_datasets from _colstemp;
select count(*) into :num_of_cols from _colstemp;
quit;
Then just loop over the original dataset like:
%do &_i = 1 %to &num_of_datasets
proc sql;
select name into :vars separated by ','
from _colstemp(firstobs=%eval((&_i. - 1)*249 + 1) obs = %eval(min(249,&num_of_cols. - &_i. * 249)) ;
quit;
proc sql;
create table split_&_i. as
select YOUR_PRIMARY_KEY, &vars from YOUR_ORIGINAL_TABLE;
quit;
%end;
Hopefully this gives you another idea. The solution is not tested, and may contain some pseudocode elements as it's written from my memory of doing things. Also this is void of macro declaration and much of parametrization one could do.. This would make the solution more general (parametrize your number of variables for each dataset, your primary key name, and your dataset names for example.

Macro loop in SAS - passing value to condition

I'm a beginner in SAS and I am struggling a bit with the macro loop in SAS. The problem is illustrated by the code below. The task here is to create separate subsets and save them as libraries for later post-processing. Additionally I added graphs for visualization. I am operating on a huge database but for this post I create a sample at the beginning of the code for simplification.
However, it seems that the internal condition (IF ID = i ) doesn't filter out the data. Instead the internal loop creates empty tables (but with correct names: "SUB1", "SUB2", "SUB3") with a column (variabale) called "i".
DATA EXAMPLE;
INPUT ID DATE DDMMYY8. VALUE;
FORMAT DATE DDMMYY8.;
DATALINES;
1 01012011 100
1 01022011 400
1 01032011 678
2 01012011 678
2 01022011 333
2 01032011 333
3 01012011 733
3 01022011 899
3 01032011 999
;
%MACRO filter(number);
%DO i=1 %TO &number;
DATA SUB&i;
SET WORK.EXAMPLE;
IF ID = i;
PROC SGPLOT DATA=SUB&i;
reg x=DATE y=VALUE;
RUN;
%END;
%mend filter;
%filter(3);
If I manually copy and paste the part inside macro and manually change i to numbers 1 to 3 it creates correct graphs. What is wrong in this code? How can I pass the value from the DO statement inside the code?
I am using SAS Studio.
The macro is creating empty data sets because the code that the macro eventually writes contains the subsetting if statement
if ID = i;
Because the data set does not contain a variable i a new variable named i is added to the PDV and the output data sets SUB1, SUB2, SUB3. The default value for i is missing and no ID value is missing, thus no rows pass the test and you get empty data sets. The log will also provide clues to the situation:
NOTE: Variable i is uninitialized.
When abstracting a code segment for 'macroization' be sure to use & in front of the macro variables. Thus, when the macro contains
if ID = &i;
The eventual code written by the macro system will have your 3 similar code operations with the different values of the macro variable.
...
if ID = 1;
...
...
if ID = 2;
...
...
if ID = 3;
...
Right now you are producing the same graph three times because the datasets SUB1, SUB2, SUB3 all use the same set of data. That is because the only thing in your data step that depends on the value of the macro variable I is the name.
You are currently selecting the observations where the variable ID matches the variable I. Perhaps you meant to select the observations where the variable ID matches the macro variable used in the %DO loop?
IF ID = &i;
One tip for debugging your macro code is to add the statement
options mprint;
This will show the code that SAS is actually using.
For example in the log:
70 options mprint;
71 %MACRO filter(number);
72 %DO i=1 %TO &number;
73 DATA SUB&i;
74 SET WORK.EXAMPLE;
75 IF ID = &i;
76 PROC SGPLOT DATA=SUB&i;
77 reg x=DATE y=VALUE;
78 RUN;
79 %END;
80 %mend filter;
81
82 %filter(2);
MPRINT(FILTER): DATA SUB1;
MPRINT(FILTER): SET WORK.EXAMPLE;
MPRINT(FILTER): IF ID = 1;
NOTE: There were 9 observations read from the data set WORK.EXAMPLE.
NOTE: The data set WORK.SUB1 has 3 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds
MPRINT(FILTER): PROC SGPLOT DATA=SUB1;
MPRINT(FILTER): reg x=DATE y=VALUE;
MPRINT(FILTER): RUN;

Get rid of kth smallest and largest values of a dataset in SAS

I have a datset sort of like this
obs| foo | bar | more
1 | 111 | 11 | 9
2 | 9 | 2 | 2
........
I need to throw out the 4 largest and 4 smallest of foo (later then I would do a similar thing with bar) basically to proceed but I'm unsure the most effective way to do this. I know there are functions smallest and largest but I don't understand how I can use them to get the smallest 4 or largest 4 from an already made dataset. I guess alternatively I could just remove the min and max 4 times but that sounds needlessly tedious/time consuming. Is there a better way?
PROC RANK will do this for you pretty easily. If you know the total count of observations, it's trivial - it's slightly harder if you don't.
proc rank data=sashelp.class out=class_ranks(where=(height_r>4 and weight_r>4));
ranks height_r weight_r;
var height weight;
run;
That removes any observation that is in the 4 smallest heights or weights, for example. The largest 4 would require knowing the maximum rank, or doing a second processing step.
data class_final;
set class_ranks nobs=nobs;
if height_r lt (nobs-3) and weight_r lt (nobs-3);
run;
Of course if you're just removing the values then do it all in the data step and call missing the variable if the condition is met rather than deleting the observation.
You are going to need to make at least 2 passes through your dataset however you do this - one to find out what the top and bottom 4 values are, and one to exclude those observations.
You can use proc univariate to get the top and bottom 5 values, and then use the output from that to create a where filter for a subsequent data step. Here's an example:
ods _all_ close;
ods output extremeobs = extremeobs;
proc univariate data = sashelp.cars;
var MSRP INVOICE;
run;
ods listing;
data _null_;
do _N_ = 1 by 1 until (last.varname);
set extremeobs;
by varname notsorted;
if _n_ = 2 then call symput(cats(varname,'_top4'),high);
if _n_ = 4 then call symput(cats(varname,'_bottom4'),low);
end;
run;
data cars_filtered;
set sashelp.cars(where = ( &MSRP_BOTTOM4 < MSRP < &MSRP_TOP4
and &INVOICE_BOTTOM4 < INVOICE < &INVOICE_TOP4
)
);
run;
If there are multiple observations that tie for 4th largest / smallest this will filter out all of them.
You can use proc sql to place the number of distinct values of foo into a macro var (includes null values as distinct).
In you data step you can use first.foo and the macro var to selectively output only those that are no the smallest or largest 4 values.
proc sql noprint;
select count(distinct foo) + count(distinct case when foo is null then 1 end)
into :distinct_obs from have;
quit;
proc sort data = have; by foo; run;
data want;
set have;
by foo;
if first.foo then count+1;
if 4 < count < (&distinct_obs. - 3) then output;
drop count;
run;
I also found a way to do it that seems to work with IML (I'm practicing by trying to redo things different ways). I knew my maximum number of observations and had already sorted it by order of the variable of interest.
PROC IML;
EDIT data_set;
DELETE point {{1, 2, 3, 4,51, 52, 53, 54};
PURGE;
close data_set;
run;
I've not used IML very much but I stumbled upon this while reading documentation. Thank you to everyone who answered my question!

SAS: sum all values except one

I'm working in SAS and I'm trying to sum all observations, leaving out one each time.
For example, if I have:
Count Name Grade
1 Sam 90
2 Adam 100
3 John 80
4 Max 60
5 Andrea 70
I want to output a value for Sam that is the sum of all grades but his own, and a value for Adam that is a sum of all grades but his own - etc.
Any ideas? Thanks!
You can do it in a single proc sql instead, using key word calculated:
data have;
input Count Name $ Grade;
datalines;
1 Sam 90
2 Adam 100
3 John 80
4 Max 60
5 Andrea 70
;;;;
run;
proc sql;
create table want as
select *, sum(grade) as all_grades, calculated all_grades-grade as minus_grade
from have;
quit;
Here's a nearly one pass solution (it will be about the same speed as a one pass solution if the dataset fits in the read buffer). I actually calculate the mean here instead of just the sum, as I feel that's a more interesting result (and the sum is of course the mean without the division).
data have;
input Count Name $ Grade;
datalines;
1 Sam 90
2 Adam 100
3 John 80
4 Max 60
5 Andrea 70
;;;;
run;
data want;
retain grademean;
if _n_=1 then do;
do _n_ = 1 to nobs_have;
set have(keep=grade) point=_n_ nobs=nobs_have;
gradesum+grade;
end;
grademean=gradesum/nobs_have;
end;
set have;
grade_noti = ((grademean*nobs_have)-grade)/(nobs_have-1);
run;
Calculate the mean, then for each record subtract the portion that record contributed to the mean. This is a super useful technique for stat testing when you want to compare a record to the rest of the population, and you have a complicated class combination where you'd rather do the mean first. In those cases you use PROC MEANS first and then merge it on, then do this subtraction.
proc sql;
create table temp as select
sum(grade) as all_grades
from orig_data;
quit;
proc sql;
create table temp2 as select
a.count,
a.name,
a.grade,
(b.all_grades-a.grade) as sum_other_grades
from orig_data a
left join temp b;
quit;
Haven't tested it but the above should work. It creates a new dataset temp that has the sum of all grades and merges that back to create a new table with the sum of all grades less the current students grade as sum_other_grades.
This solution performs takes each observation of your starting dataset, and then loops through the same dataset summing up grade values for any records with different names, so beginning with 'Sam', we only add the oth_g variable when we find names that are NOT 'Sam':
data want;
set have;
oth_g=0;
do i=1 to n;
set have
(keep=name grade rename=(name=name_loop grade=grade_loop))
nobs=n point=i;
if name^=name_loop then oth_g+grade_loop;
end;
drop grade_loop name_loop i n;
run;
This is a slight modification to the answer #Reese provided above.
proc sql;
create table want as
select *,
(select sum(grade) from have) as all_grades,
calculated all_grades - grade as minus_grade
from have;
quit;
I've rearranged it this way to avoid the below message being printed to the log:
NOTE: The query requires remerging summary statistics back with the original data.
If you see the above message, it almost always means that you have made a mistake. If you actually did mean to remerge summary stats back with the original data, you should do so explicitly (like I have done above by refactoring #reese 's query.
Personally I think the refactored version is also easier to understand.

Resources