SAS: generating an output database following several proc procedures

SAS: generating an output database following several proc procedures - database

I am still new at SAS and I was wondering how I can do the following:
Say that I have a database with the following info:
Time_during_the day date prices volume_traded
930am sep02 42 300
10am sep02 41 200
..4pm sep02 40 200
930am sep03 40 500
10am sep03 41 100
..4pm sep03 40 350
.....
What I want is to take the average of the total daily volume and divide this number by 50 (always). So say avg.daily vol./50 = V; and what I want is to record the price/time/date at every interval of size V. Now, say that V=500, I start by recording the first price,time,and date in my database and then record the same info 500 volume trade later. It is possible that on one day that the traded volume is say 300 and half of it will cover the v=500, the other 150 will be use to fill up up the following interval.
How can I get this information in one database?
Thank you!

Assume your input dataset is called tick_data, and that it is sorted by both date and time_during_the_day. Then here's what I got:
%LET n = 50;
/* Calculate V - the breakpoint size */
PROC SUMMARY DATA=tick_data;
BY date;
OUTPUT OUT = temp_1
SUM (volume_traded)= volume_traded_agg;
RUN;
DATA temp_2 ;
SET temp_1;
V = volume_traded_agg / &n;
RUN;
/* Merge it into original dataset so that it is available */
DATA temp_3;
MERGE tick_data temp_2;
BY date;
RUN;
/* Final walk through tick data to output at breakpoints */
DATA results
/* Comment out the KEEP to see what is happening under the hood */
(KEEP=date time_during_the_day price volume_traded)
;
SET temp_3;
/* The IF FIRST will not work without the BY below */
BY date;
/* Stateful counters */
RETAIN
volume_cumulative
breakpoint_next
breakpoint_counter
;
/* Reset stateful counters at the beginning of each day */
IF (FIRST.date) THEN DO;
volume_cumulative = 0;
breakpoint_next = V;
breakpoint_counter = 0;
END;
/* Breakpoint test */
volume_cumulative = volume_cumulative + volume_traded;
IF (breakpoint_counter <= &n AND volume_cumulative >= breakpoint_next) THEN DO;
OUTPUT;
breakpoint_next = breakpoint_next + V;
breakpoint_counter = breakpoint_counter + 1;
END;
RUN;
The key SAS language feature to keep in mind for the future is the use of BY, FIRST, and RETAIN together. This enables stateful walks through data like this one. Conditional OUTPUT also figures here.
Note that whenever you use BY <var>, the dataset must be sorted on a key that includes <var>. In the case of tick_data and all intermediate temporary tables, it is.
Additional: Alternative V
In order to make V equal the (average total daily volume / n), replace the matching code block above with this one:
. . . . . .
/* Calculate V - the breakpoint size */
PROC SUMMARY DATA=tick_data;
BY date;
OUTPUT OUT = temp_1
SUM (volume_traded)= volume_traded_agg;
RUN;
PROC SUMMARY DATA = temp_1
OUTPUT OUT = temp_1a
MEAN (volume_traded_agg) =;
RUN;
DATA temp_2 ;
SET temp_1a;
V = volume_traded_agg / &n;
RUN;
/* Merge it into original dataset so that it is available */
DATA temp_3 . . . . . .
. . . . . .
Basically you just insert a second PROC SUMMARY to take the mean of the sums. Notice how there is no BY statement because we're averaging over the whole set, not by any groupings or buckets. Also notice the MEAN (...) = without a name after the =. That will make the output variable have the same name as the input variable.

Related

Need to persist values from one record to the subsequent record - lag and retain both work but not giving desired results

I am editing my original question to simplify the problem statement:
I need to create a dataset that contains the principal paydown schedule of a security, which is split into 3 tranches. For each period for the security, I need to calculate the ending balances of principal owed for each tranche. For period 0 (i.e. starting period), I already have the balances owed. For subsequent periods, I need to take the balances from the previous periods and subtract the principal paid down in the current period. The same logic should continue through the last period.
In my SAS code, I am able to get period 1 to do the calculations correctly, but the balances from period 1 don't correctly make it into period 2, causing the calculation to break from that point onwards. I know lag or its placement is what is not working correctly. I am not able to figure out where to place it, or how to use retain (if not lag), such that my balances go from one row to the next.
%let n_t=3;
data xyz;
INFILE DATALINES DLM='#';
input ID $6. period PrincipalPaid best12.2;
datalines;
ABC123#00#0.0
ABC123#01#4.0
ABC123#02#3.92
ABC123#03#3.84
ABC123#04#3.76
ABC123#05#3.69
ABC123#06#3.62
ABC123#07#3.54
;run;
data xyz2;
set xyz;
by id;
if period=0 then do;
Bal1= 120;
Bal2= 8;
Bal3= 2;
end;
/*Code to push all starting balances from period 0 to 1*/
array prev_bal{&N_t.} prev_bal1-prev_bal&n_t.;
array bal{&N_t.} bal1-bal&n_t.;
do i=1 to &N_t.;
prev_bal{i}=lag(bal{i});
end;
/*code to calculate balances for periods >=1*/
if period>=1 then do;
array PrincipalPayDown{&N_t.} PrincipalPayDown1-PrincipalPayDown&N_t.;
do i = 1 to &N_t. ;
PrincipalPayDown{i}=round(PrincipalPaid*prev_bal{i}/sum(of prev_bal:),0.01);
bal{i}=max(prev_bal{i}-PrincipalPayDown{i},0);
end;
end;
drop i ;
run;
proc sql;
create table final as
select
id,period,PrincipalPaid,prev_bal1,prev_bal2,prev_bal3,
PrincipalPayDown1,PrincipalPayDown2,PrincipalPayDown3,Bal1,Bal2,Bal3
from xyz2;
quit;
I am also adding a picture of the final dataset with the correct output calculated in Excel. I want SAS to give me the same output for periods >=2.
Screenshot showing correct output in Excel

array processing with different indices and missing values in SAS

have is a sas data set with 4 variables: an id and variables storing info on all the activities a respondent shares with 3 different members of a team they're on. There are 4 different activity types, identified by the numbers populating in the :_activities vars for each player (p1 to p3). Below are the first 5 obs:
id p1_activities p2_activities p3_activities
A 1,2,3,4 1,3
B 1,3 1,2,3 1,2,3
C 1,2,3 1,2,3
D 1,2,3
E 1,2,3 1
Consider respondent A: they share all 4 activities with player 1 on their team, and activities 1 and 3 with player 2 on their team. I need to create flags for each player position and each activity. For example, a new numeric variable p1_act2_flag should equal 1 for all respondents who have a value of 2 appearing in the p1_activities character variable. Here are the first 6 variables I need to create out of the 12 total for the data shown:
p1_act1_flag p1_act2_flag p1_act3_flag p1_act4_flag p2_act1_flag p2_act2_flag …
1 1 1 1 1 0 …
1 0 1 0 1 1 …
. . . . 1 1 …
. . . . 1 1 …
1 1 1 0 . . …
I do this now by initializing all of the variable names in a length statement, then writing a ton if-then statements. I want to use far fewer lines of code, but my array logic is incorrect. Here's how I try to create the flags for player 1:
data want;
length p1_act1_flg p1_act2_flg p1_act3_flg p1_act4_flg
p2_act1_flg p2_act2_flg p2_act3_flg p2_act4_flg
p3_act1_flg p3_act2_flg p3_act3_flg p3_act4_flg
p4_act1_flg p4_act2_flg p4_act3_flg p4_act4_flg 8.0;
set have;
array plracts {*} p1_activities p2_activities p3_activities;
array p1actflg {*} p1_act1_flg p1_act2_flg p1_act3_flg p1_act4_flg;
array p2actflg {*} p2_act1_flg p2_act2_flg p2_act3_flg p2_act4_flg;
array p3actflg {*} p3_act1_flg p3_act2_flg p3_act3_flg p3_act4_flg;
array p4actflg {*} p4_act1_flg p4_act2_flg p4_act3_flg p4_act4_flg;
do i=1 to dim(plracts);
do j=1 to dim(p1actflg);
if find(plracts{i}, cats(put(j, $12.))) then p1actflg{j}=1;
else if missing(plracts{i}) then p1actflg{j}=.;
else p1actflg{j}=0;
end;
end;
*do this again for the other p#actflg arrays;
run;
My "array subscript is out of range" because of the different lengths of the player and activity arrays, but nesting in different do-loops would result in me writing many more lines of code than a wallpaper solution.
How would you do this more systematically, and/or in far fewer lines of code?

Not sure why you are processing 4 activities for flags when there are only 3.
Some ideas:
Refactoring the column names to numbered suffixes would reduce some of the wallpaper effect.
activities_p1-activities_p3
Refactoring the flag column names to number suffixes
flag_p1_1-flag_p1_4
flag_p2_1-flag_p2_4
flag_p3_1-flag_p3_4
Use DIM to stay within array bounds.
Use two dimensional array for flags
Use direct addressing of items to be flagged
Add error checking
Not fewer, but perhaps more robust ?
This code examines each item in the activities list as opposed to seeking presence of a specific items (1..4):
data want;
set have;
array activities
activities_p1-activities_p3
;
array flags(3,4)
flag_p1_1-flag_p1_4
flag_p2_1-flag_p2_4
flag_p3_1-flag_p3_4
;
do i = 1 to dim(activites);
if missing(activities[i]) then continue; %* skip;
do j = 1 by 1;
item = scan ( activities[i], j, ',' );
if missing(item) then leave; %* no more items in csv list;
item_num = input (item,?1.);
if missing(item_num) then continue; %* skip, csv item is not a number;
if item_num > hbound(flags,2) or item_num < lbound(flags,2) then do;
put 'WARNING:' item_num 'is invalid for flagging';
continue; %* skip, csv item is missing, 0, negative or exceeds 4;
end;
flags (i, item_num) = 1;
end;
* backfill zeroes where flag not assigned;
do j = 1 to hbound(flags,2);
flags (i, item_num) = sum (0, flags (i, item_num)); %* sum() handles missing values;
end;
end;
Here is the same processing, but only searching for specific items to be flagged:
data have; length id activities_p1-activities_p3 $20;input
id activities_p1-activities_p3 ; datalines;
A 1,2,3,4 1,3 .
B 1,3 1,2,3 1,2,3
C . 1,2,3 1,2,3
D . 1,2,3 .
E 1,2,3 . 1
;
data want;
set have;
array activities
activities_p1-activities_p3
;
array flags(3,4)
flag_p1_1-flag_p1_4
flag_p2_1-flag_p2_4
flag_p3_1-flag_p3_4
;
do i = 1 to dim(activities);
if not missing(activities[i]) then
do j = 1 to hbound(flags,2);
flags (i,j) = sum (flags(i,j), findw(trim(activities[i]),cats(j),',') > 0) > 0;
end;
end;
run;
What's going on ?
flags variables are reset to missing at top of step
hbound return 4 as upper limit of second dimension
findw(trim(activities[i]),cats(j),',') find position of j in csv string
trim needed to remove trailing spaces which are not part of findw word delimiter list
cats converts j number to character representation
findw returns position of j in csv string.
might want to also compress out spaces and other junk if activity data values are not reliable.
first > 0 evaluates position to 0 j not present and 1 present
second > 0 is a another logic evaluation that ensures j present flag remains 0 or 1. Otherwise flags would be a frequency count (imagine activity data 1,1,2,3)
flags(i,j) covers the 3 x 4 slots available for flagging.

Consider converting into a hierarchical view and doing the logic there. The real stickler here is the fact that there can be missing positions within each list. Because of this, a simple do loop will be difficult. A faster way would be multi-step:
Create a template of all possible players and positions
Create an actual list of all players and positions
Merge the template with the actual list and flag all matches
It's not as elegant as a single data step like could be done, but it is somewhat easy to work with.
data have;
infile datalines dlm='|';
input id$ p1_activities$ p2_activities$ p3_activities$;
datalines;
A|1,2,3,4|1,3|
B|1,3|1,2,3|1,2,3|
C| |1,2,3|1,2,3|
D| |1,2,3|
E|1,2,3| |1
;
run;
/* Make a template of all possible players and positions */
data template;
set have;
array players p1_activities--p3_activities;
length varname $15.;
do player = 1 to dim(players);
do activity = 1 to 4;
/* Generate a variable name for later */
varname = cats('p', player, '_act', activity, '_flg');
output;
end;
end;
keep ID player activity varname;
run;
/* Create a list of actual players and their positions */
data actual;
set have;
array players p1_activities--p3_activities;
do player = 1 to dim(players);
do i = 1 to countw(players[player], ',');
activity = input(scan(players[player], i, ','), 8.);
/* Do not output missing positions */
if(NOT missing(activity)) then output;
end;
end;
keep ID player activity;
run;
/* Merge the template with actual values and create a flag when an
an id, player, and activity matches with the template
*/
data want_long;
merge template(in=all)
actual(in=act);
by id player activity;
flag_activity = (all=act);
run;
/* Transpose it back to wide */
proc transpose data=want_long
out=want_wide;
id varname;
by id;
var flag_activity;
run;

Following Stu's example, a DS2 DATA step can perform his 'merge' using a hash lookup. The hash lookup depends on creating a data set that maps CSV item lists to flags.
* Create data for hash;
data share_flags(where=(not missing(key)));
length key $7 f1-f4 8;
array k[4] $1 _temporary_;
do f1 = 0 to 1; k[1] = ifc(f1,'1','');
do f2 = 0 to 1; k[2] = ifc(f2,'2','');
do f3 = 0 to 1; k[3] = ifc(f3,'3','');
do f4 = 0 to 1; k[4] = ifc(f4,'4','');
key = catx(',', of k[*]);
output;
end;end;end;end;
run;
proc ds2;
data want2 / overwrite=yes;
declare char(20) id;
vararray char(7) pact[*] activities_p1-activities_p3;
vararray double fp1[*] flag_p1_1-flag_p1_4;
vararray double fp2[*] flag_p2_1-flag_p2_4;
vararray double fp3[*] flag_p3_1-flag_p3_4;
declare char(1) sentinel;
keep id--sentinel;
drop sentinel;
declare char(7) key;
vararray double flags[*] f1-f4;
declare package hash shares([key],[f1-f4],4,'share_flags'); %* load lookup data;
method run();
declare int rc;
set have;
rc = shares.find([activities_p1],[flag_p1:]); %* find() will fill-in the flag variables;
rc = shares.find([activities_p2],[flag_p2:]);
rc = shares.find([activities_p3],[flag_p3:]);
end;
enddata;
run;
quit;
%let syslast = want2;
share_flags
result

Split SAS datasets by column with primary key

So I have a dataset with one primary key: unique_id and 1200 variables. This dataset is generated from a macro so the number of columns will not be fixed. I need to split this dataset into 4 or more datasets of 250 variables each, and each of these smaller datasets should contain the primary key so that I can merge them back later. Can somebody help me with either a sas function or a macro to solve this?
Thanks in advance.

A simple way to split a datasets in the way you request is to use a single data step with multiple output datasets where each one has a KEEP= dataset option listing the variables to keep. For example:
data split1(keep=Name Age Height) split2(keep=Name Sex Weight);
set sashelp.class;
run;
So you need to get the list of variables and group then into sets of 250 or less. Then you can use those groupings to generate code like above. Here is one method using PROC CONTENTS to get the list of variables and CALL EXECUTE() to generate the code.
I will use macro variables to hold the name of the input dataset, the key variable that needs to be kept on each dataset and maximum number of variables to keep in each dataset.
So for the example above those macro variable values would be:
%let ds=sashelp.class;
%let key=name;
%let nvars=2;
So use PROC CONTENTS to get the list of variable names:
proc contents data=&ds noprint out=contents; run;
Now run a data step to split them into groups and generate a member name to use for the new split dataset. Make sure not to include the KEY variable in the list of variables when counting.
data groups;
length group 8 memname $41 varnum 8 name $32 ;
group +1;
memname=cats('split',group);
do varnum=1 to &nvars while (not eof);
set contents(keep=name where=(upcase(name) ne %upcase("&key"))) end=eof;
output;
end;
run;
Now you can use that dataset to drive the generation of the code:
data _null_;
set groups end=eof;
by group;
if _n_=1 then call execute('data ');
if first.group then call execute(cats(memname,'(keep=&key'));
call execute(' '||trim(name));
if last.group then call execute(') ');
if eof then call execute(';set &ds;run;');
run;
Here are results from the SAS log:
NOTE: CALL EXECUTE generated line.
1 + data
2 + split1(keep=name
3 + Age
4 + Height
5 + )
6 + split2(keep=name
7 + Sex
8 + Weight
9 + )
10 + ;set sashelp.class;run;
NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK.SPLIT1 has 19 observations and 3 variables.
NOTE: The data set WORK.SPLIT2 has 19 observations and 3 variables.

Just another way of doing it using macro variables:
/* Number of columns you want in each chunk */
%let vars_per_part = 250;
/* Get all the column names into a dataset */
proc contents data = have out=cols noprint;
run;
%macro split(part);
/* Split the columns into 250 chunks for each part and put it into a macro variable */
%let fobs = %eval((&part - 1)* &vars_per_part + 1);
%let obs = %eval(&part * &vars_per_part);
proc sql noprint;
select name into :cols separated by " " from cols (firstobs = &fobs obs = &obs) where name ~= "uniq_id";
quit;
/* Chunk up the data only keeping those varaibles and the uniq_id */
data want_part∂
set have (keep = &cols uniq_id);
run;
%mend;
/* Run this from 1 to whatever the increment required to cover all the columnns */
%split(1);
%split(2);
%split(3);

this is not a complete solution but some help to give you another insight into how to solve this. The previous solutions have relied much on proc contents and data step, but I would solve this using proc sql and dictionary.columns. And I would create a macro that would split the original file into as many parts as needed, 250 cols each. The steps roughly:
proc sql; create table as _colstemp as select * from dictionary.columns where library='your library' and memname = 'your table' and name ne 'your primary key'; quit;
Count the number of files needed somewhere along:
proc sql;
select ceil(count(*)/249) into :num_of_datasets from _colstemp;
select count(*) into :num_of_cols from _colstemp;
quit;
Then just loop over the original dataset like:
%do &_i = 1 %to &num_of_datasets
proc sql;
select name into :vars separated by ','
from _colstemp(firstobs=%eval((&_i. - 1)*249 + 1) obs = %eval(min(249,&num_of_cols. - &_i. * 249)) ;
quit;
proc sql;
create table split_&_i. as
select YOUR_PRIMARY_KEY, &vars from YOUR_ORIGINAL_TABLE;
quit;
%end;
Hopefully this gives you another idea. The solution is not tested, and may contain some pseudocode elements as it's written from my memory of doing things. Also this is void of macro declaration and much of parametrization one could do.. This would make the solution more general (parametrize your number of variables for each dataset, your primary key name, and your dataset names for example.

Get rid of kth smallest and largest values of a dataset in SAS

I have a datset sort of like this
obs| foo | bar | more
1 | 111 | 11 | 9
2 | 9 | 2 | 2
........
I need to throw out the 4 largest and 4 smallest of foo (later then I would do a similar thing with bar) basically to proceed but I'm unsure the most effective way to do this. I know there are functions smallest and largest but I don't understand how I can use them to get the smallest 4 or largest 4 from an already made dataset. I guess alternatively I could just remove the min and max 4 times but that sounds needlessly tedious/time consuming. Is there a better way?

PROC RANK will do this for you pretty easily. If you know the total count of observations, it's trivial - it's slightly harder if you don't.
proc rank data=sashelp.class out=class_ranks(where=(height_r>4 and weight_r>4));
ranks height_r weight_r;
var height weight;
run;
That removes any observation that is in the 4 smallest heights or weights, for example. The largest 4 would require knowing the maximum rank, or doing a second processing step.
data class_final;
set class_ranks nobs=nobs;
if height_r lt (nobs-3) and weight_r lt (nobs-3);
run;
Of course if you're just removing the values then do it all in the data step and call missing the variable if the condition is met rather than deleting the observation.

You are going to need to make at least 2 passes through your dataset however you do this - one to find out what the top and bottom 4 values are, and one to exclude those observations.
You can use proc univariate to get the top and bottom 5 values, and then use the output from that to create a where filter for a subsequent data step. Here's an example:
ods _all_ close;
ods output extremeobs = extremeobs;
proc univariate data = sashelp.cars;
var MSRP INVOICE;
run;
ods listing;
data _null_;
do _N_ = 1 by 1 until (last.varname);
set extremeobs;
by varname notsorted;
if _n_ = 2 then call symput(cats(varname,'_top4'),high);
if _n_ = 4 then call symput(cats(varname,'_bottom4'),low);
end;
run;
data cars_filtered;
set sashelp.cars(where = ( &MSRP_BOTTOM4 < MSRP < &MSRP_TOP4
and &INVOICE_BOTTOM4 < INVOICE < &INVOICE_TOP4
)
);
run;
If there are multiple observations that tie for 4th largest / smallest this will filter out all of them.

You can use proc sql to place the number of distinct values of foo into a macro var (includes null values as distinct).
In you data step you can use first.foo and the macro var to selectively output only those that are no the smallest or largest 4 values.
proc sql noprint;
select count(distinct foo) + count(distinct case when foo is null then 1 end)
into :distinct_obs from have;
quit;
proc sort data = have; by foo; run;
data want;
set have;
by foo;
if first.foo then count+1;
if 4 < count < (&distinct_obs. - 3) then output;
drop count;
run;

I also found a way to do it that seems to work with IML (I'm practicing by trying to redo things different ways). I knew my maximum number of observations and had already sorted it by order of the variable of interest.
PROC IML;
EDIT data_set;
DELETE point {{1, 2, 3, 4,51, 52, 53, 54};
PURGE;
close data_set;
run;
I've not used IML very much but I stumbled upon this while reading documentation. Thank you to everyone who answered my question!

Looping over groups in SAS

I have 55 weeks of sales data of a certain item. I created two SAS datasets from the original data. The first dataset has the date and the sum of quantity sold in each date. Therefore, I have 385 observations (55 x 7). The second table has detailed transaction data. Specifically, for each date, I have the time between transactions, which is the time between the arrival of one customer and the next one who purchased that item (I call it the interarrival times). What I need to do next is as follows:
For the first table (daily sales) I need to take the sales data for
each week, fit a number of distributions to find the parameters of
each one, and record those parameters in a separate table. Note that
each week has eaxctly 7 observations
For the second table (interarrival times) I also need to fit a
number of distributions to find the parameters of each one, and
record those parameters in the same table above, but here, I don’t
have an exact number of observations in each week
Note: I already labeled the week number for the observations in each of the two datasets and I wrote the code that fits the distributions to the data. The only area in which I am struggling is how to tell SAS to take the data for one week, do the calculations, fit the distributions, and then move to the next week (i.e. group the data by week and perform multiple statements on each group).
I tried so many methods and none of them worked including nested loops. I know how to get the weekly sales using other methods and procedures such as PROC SQL, but I am not sure whether I can fit distributions with PROC SQL.
I am using proc nlp to estimate the parameters of each distribution using the maximum likelihood method. For example, if I need to estimate Mu and Sigma for the normal distribution, I am using the following code:
proc nlp data= temp vardef=n covariance=h outest=parms;
title "Normal";
max loglik;
parms mu=0, sigma=1;
bounds sigma > 1e-12;
loglik=-log(sigma*(2*constant('PI'))**.5) - 0.5*((x-mu)/sigma)**2;
run;
This method will find Mu and Sigma that most likely produced the data.

For others wishing to use SAS's internal grouping the nlm code would become:
/* Ensure that the data is sorted to allow group processing */
proc sort data = temp;
by week;
run;
proc nlp data = temp vardef = n covariance = h outest = parms;
/* Produce separate output for each week */
by week;
title "Normal";
max loglik;
parms mu = 0, sigma = 1;
bounds sigma > 1e-12;
loglik = -log(sigma * (2 * constant('PI'))**.5) - 0.5 * ((x - mu) / sigma)**2;
run;
And here is a method using proc univariate:
/* Suppress printed output (remove to see all the details) */
ods select none;
proc univariate data = temp;
/* Produce separate output for each week */
by week;
histogram x /
/* Request fitting to normal distribution */
normal
/* You can select other distributions too */
lognormal;
/* Put the fitted parameters in a dataset */
ods output ParameterEstimates = parms;
/* Put the fit statistics in a dataset */
ods output GoodnessOfFit = quality;
run;
/* Restore printing output */
ods select all;

Here's what I used
%macro weekly;
%do i=1 %to 55;
proc sql;
create table temp as
select location, UPC, date, x, week
from weeks
where week = &i;
quit;
/* I have here the rest of the code where I do my calculations and I fit the distributions to the data of each week */
%end;
%mend;
%weekly;
I knew that proc sql would work initially but I was wondering whether there may be a more efficient way to do it.