I have a dataset consisting of some transactions done by customers
I want to put these transactions into an array from 1 to 50. So 1 customers can have 50 transactions or more. I want my output dataset to be 1 row per customer with the value of each transaction being put into a column.
Finally what I am trying to do is to have these transaction values in an array. With the array being reset back to 0 upon first.cust_id. Any idea how I can go about doing this. This is what I have so far but it is giving me errors. Let's assume the initial dataset has just cust_id and transaction_amount fields.
The code represents just array initiation. I am doing a few calculations with the arrays afterwards.
data check;
set transactions;
by cust_id;
array trans[*] trans1-trans50;
retain array_counter;
if first.cust_id then do;
do i=1 to dim(trans);
trans[i]=0;
end;
array_counter=1;
end;
trans[array_counter] = transaction_amount;
array_counter=sum(array_counter,1);
if last.cust_id;
run;
First off, you need a retain. Without that, your values of the trans array will get cleared out every data step loop.
Second, if you say "50 or more" transactions; well, your array bounds only allow 50, so what does it do for 51?
This works, and I set it to 30 per customer so you can see the 0's. If you set that initial 1 to 30 loop to 1 to 60, you'll get out of bounds errors.
data transactions;
call streaminit(7);
do cust_id = 1 to 10;
do transaction = 1 to 30;
transaction_amount = rand('uniform');
output;
end;
end;
run;
data check;
set transactions;
by cust_id;
retain trans1-trans50;
array trans[*] trans1-trans50;
retain array_counter;
if first.cust_id then do;
do i=1 to dim(trans);
trans[i]=0;
end;
array_counter=1;
end;
trans[array_counter] = transaction_amount;
array_counter=sum(array_counter,1);
if last.cust_id;
run;
A couple things are wrong in your code. For starters, in order to use first/last processing, you need a by statement. Then, in order to get only one row per customer, you need to also retain your array variables and output only on last.cust_id:
data check;
set transactions;
by cust_id;
array trans[*] trans1-trans50;
retain array_counter trans1-trans50;
if first.cust_id then do;
do i=1 to dim(trans);
trans[i]=0;
end;
array_counter=1;
end;
trans[array_counter] = transaction_amount;
array_counter+1;
if last.cust_id then output;
run;
Related
So I have this:
Initial database:
Variable1, variable2, value, percentvalue
Keyword1, a, 234, 0.7
Keyword1, a, 64, 0.18
Keyword1, a, 4, 0.05
Keyword1, a, 2, 0.025
Keyword1, a, 300, 0.84
Keyword2
Keyword2
Keyword3
Keyword4
Keyword4
and so on.
When I run this individually, it work:
data Filename1;
set filename0;
if variable1 = 'Keyword1' then do;
retain sumCol;
sumCol = sum(sumCol, percentvalue);
if sumCol>0.95 then DELETE;
output;
end;
This return the first 3 row of keyword1
Which is what I want.
But when I try to do it for the entire table which has like 600 keywords.
I'm currently running the test with only one keyword to make sure it work in the same way.
But when I run:
data Filename1;
set filename0;
array MyArrayVariable1{1} $ Keyword1;
do i=1 to dim(MyArrayVariable1);
if variable1 = MyArrayVariable1[i] then do;
retain sumCol;
sumCol = sum(sumCol, percentvalue);
if sumCol>0.95 then DELETE;
output;
end;
end;
run;
When I run it, It just pull an empty table instead of the selected value.
And if I get rid of the output; it pulls the entire table without filtering anything.
Looks like you just want to use BY group processing.
data Filename1;
set filename0;
by variable1 ;
if first.variable1 then sumcol=0;
sumCol + percentvalue;
if sumCol<=0.95 then output;
run;
Note that using a SUM statement
sumCol + percentvalue;
is a simplified way to code these two statements in your original code.
retain sumCol;
sumCol = sum(sumCol, percentvalue);
BY group processing with an I/O criterion based on a groupwise computation can also be succinctly coded in what is commonly called a DOW loop in the SAS community. One hallmark of the technique is to place the SET statement inside a DO loop.
Example:
data want;
do until (last.variable1);
SET have;
by variable1;
pctsum = sum(pctsum,percentvalue);
if pctsum <= 0.95 then OUTPUT;
end;
run;
NOTE:
I'm not sure of the role of your Variable2. Should it be part of a hierarchy wherein the pctsum is reset if the Variable2 value changes within a Variable1 group?
I am editing my original question to simplify the problem statement:
I need to create a dataset that contains the principal paydown schedule of a security, which is split into 3 tranches. For each period for the security, I need to calculate the ending balances of principal owed for each tranche. For period 0 (i.e. starting period), I already have the balances owed. For subsequent periods, I need to take the balances from the previous periods and subtract the principal paid down in the current period. The same logic should continue through the last period.
In my SAS code, I am able to get period 1 to do the calculations correctly, but the balances from period 1 don't correctly make it into period 2, causing the calculation to break from that point onwards. I know lag or its placement is what is not working correctly. I am not able to figure out where to place it, or how to use retain (if not lag), such that my balances go from one row to the next.
%let n_t=3;
data xyz;
INFILE DATALINES DLM='#';
input ID $6. period PrincipalPaid best12.2;
datalines;
ABC123#00#0.0
ABC123#01#4.0
ABC123#02#3.92
ABC123#03#3.84
ABC123#04#3.76
ABC123#05#3.69
ABC123#06#3.62
ABC123#07#3.54
;run;
data xyz2;
set xyz;
by id;
if period=0 then do;
Bal1= 120;
Bal2= 8;
Bal3= 2;
end;
/*Code to push all starting balances from period 0 to 1*/
array prev_bal{&N_t.} prev_bal1-prev_bal&n_t.;
array bal{&N_t.} bal1-bal&n_t.;
do i=1 to &N_t.;
prev_bal{i}=lag(bal{i});
end;
/*code to calculate balances for periods >=1*/
if period>=1 then do;
array PrincipalPayDown{&N_t.} PrincipalPayDown1-PrincipalPayDown&N_t.;
do i = 1 to &N_t. ;
PrincipalPayDown{i}=round(PrincipalPaid*prev_bal{i}/sum(of prev_bal:),0.01);
bal{i}=max(prev_bal{i}-PrincipalPayDown{i},0);
end;
end;
drop i ;
run;
proc sql;
create table final as
select
id,period,PrincipalPaid,prev_bal1,prev_bal2,prev_bal3,
PrincipalPayDown1,PrincipalPayDown2,PrincipalPayDown3,Bal1,Bal2,Bal3
from xyz2;
quit;
I am also adding a picture of the final dataset with the correct output calculated in Excel. I want SAS to give me the same output for periods >=2.
Screenshot showing correct output in Excel
I am new here. I am trying to read in a data set multiple times. so for example, assume that I have 3 observations in a data set (called tempfile) for a variable called temp. the three observations are 4,6, and 5.. so I want to read in the set x number of times so the 4th observation would be 4, fifth would be 6 and sixth, would be 5. the 7th would be 4, etc etc. I have tried this literally a few dozen ways, by doing something like
data new;
do i=1 to 100;
set tempfile;
end;
output;
run;
I have tried this by moving the do statement, moving the output statement, omitting the output statement..... every which way, trying macros also. can somebody help? thanks John
followup....
Hello:
Thanks for response. That did work. I would like to now do several things involving some “if then” statements inside the loop (more than just reading in the data set).
I want to read in a data set n number of times, and each time, there will be two if then statements
So, assume I read in 3 numbers any number of times; 7, 15, and 12
As each number is read, it will ask if it is less than 10. And each time it will create a random number.
If less than 10, then
If rand(uniform) < .4 then 1 is added to counter1, else 1 is added to counter2
And if >= 10,
Then
If rand(uniform) < .2 then 1 is added to counter1, else 1 is added to counter2
Any help is much appreciated.
Thanks
John
The way that most data steps actually stop is when SAS reads past the end of the input. So you need a method that prevents SAS from doing that.
The easiest way to replicate the data is to just execute multiple output statements. So the first record is repeated three times, then the second record is repeated three times, etc.
data want;
set tempfile ;
do i=1 to 3;
output;
end;
run;
Another method is to just list the dataset multiple times on the SET statement. So to read it in 3 times just use
data want;
set tempfile tempfile tempfile;
run;
You could probably use macro logic or even just a macro variable to make the number of repetitions variable.
data _null_; call symputx('list',repeat('tempfile ',3-1)); run;
data want; set &list; run;
Other method is to use the POINT= and NOBS= options on the SET statement so that SAS never reads past the end and you can jump back to the beginning. But since it never reads past the end of the input data you will need to manually tell it when to stop.
data want ;
do i=1 to 3;
do p=1 to nobs ;
set tempfile point=p nobs=nobs;
output;
end;
end;
stop;
run;
Or more in the spirit of your original post you might want to use the MOD() function to figure out which observation to read next.
data want;
if _n_ > 100 then stop;
p=1+mod(_n_-1,nobs);
set tempfile point=p nobs=nobs;
run;
If you have SAS/STAT software SURVEYSELECT.
data have;
do temp=4,6,5;
output;
end;
run;
proc surveyselect reps=10 rate=1 out=temp2 noprint;
run;
The data step is designed for serial processing. In this case, you need to "remember" previous observations. You can do it using only the data step, but for that use case, there are other solutions in the SAS environment that are simpler. The one I suggest is a macro that appends the original file n times:
%macro replicate( data=, out=, n=)/des='&out is &data repeated &n times.';
data &out;
set
%do i=1 %to &n;
&data
%end;
; /* This ; ends the data step `set` statement */
run;
%mend;
You could test your example with this helper:
%macro test;
data have; /* create the example data set */
temp = 4; output;
temp = 6; output;
temp = 5; output;
run;
%replicate( data=have, out=want, n=4 );
proc print; quit;
%mend;
Here is a portion of the SAS doc that adds lots of detail with many examples.
I have a datset sort of like this
obs| foo | bar | more
1 | 111 | 11 | 9
2 | 9 | 2 | 2
........
I need to throw out the 4 largest and 4 smallest of foo (later then I would do a similar thing with bar) basically to proceed but I'm unsure the most effective way to do this. I know there are functions smallest and largest but I don't understand how I can use them to get the smallest 4 or largest 4 from an already made dataset. I guess alternatively I could just remove the min and max 4 times but that sounds needlessly tedious/time consuming. Is there a better way?
PROC RANK will do this for you pretty easily. If you know the total count of observations, it's trivial - it's slightly harder if you don't.
proc rank data=sashelp.class out=class_ranks(where=(height_r>4 and weight_r>4));
ranks height_r weight_r;
var height weight;
run;
That removes any observation that is in the 4 smallest heights or weights, for example. The largest 4 would require knowing the maximum rank, or doing a second processing step.
data class_final;
set class_ranks nobs=nobs;
if height_r lt (nobs-3) and weight_r lt (nobs-3);
run;
Of course if you're just removing the values then do it all in the data step and call missing the variable if the condition is met rather than deleting the observation.
You are going to need to make at least 2 passes through your dataset however you do this - one to find out what the top and bottom 4 values are, and one to exclude those observations.
You can use proc univariate to get the top and bottom 5 values, and then use the output from that to create a where filter for a subsequent data step. Here's an example:
ods _all_ close;
ods output extremeobs = extremeobs;
proc univariate data = sashelp.cars;
var MSRP INVOICE;
run;
ods listing;
data _null_;
do _N_ = 1 by 1 until (last.varname);
set extremeobs;
by varname notsorted;
if _n_ = 2 then call symput(cats(varname,'_top4'),high);
if _n_ = 4 then call symput(cats(varname,'_bottom4'),low);
end;
run;
data cars_filtered;
set sashelp.cars(where = ( &MSRP_BOTTOM4 < MSRP < &MSRP_TOP4
and &INVOICE_BOTTOM4 < INVOICE < &INVOICE_TOP4
)
);
run;
If there are multiple observations that tie for 4th largest / smallest this will filter out all of them.
You can use proc sql to place the number of distinct values of foo into a macro var (includes null values as distinct).
In you data step you can use first.foo and the macro var to selectively output only those that are no the smallest or largest 4 values.
proc sql noprint;
select count(distinct foo) + count(distinct case when foo is null then 1 end)
into :distinct_obs from have;
quit;
proc sort data = have; by foo; run;
data want;
set have;
by foo;
if first.foo then count+1;
if 4 < count < (&distinct_obs. - 3) then output;
drop count;
run;
I also found a way to do it that seems to work with IML (I'm practicing by trying to redo things different ways). I knew my maximum number of observations and had already sorted it by order of the variable of interest.
PROC IML;
EDIT data_set;
DELETE point {{1, 2, 3, 4,51, 52, 53, 54};
PURGE;
close data_set;
run;
I've not used IML very much but I stumbled upon this while reading documentation. Thank you to everyone who answered my question!
In this block of SAS data step code I am setting a Table from an SQL query called TEST_Table. This table contains multiple columns including a larger section of columns titled PREFIX_1 to PREFIX_20. Each column starts with PREFIX_ and then an incrementing number from 1 to 20.
What I would like to do is iteratively cycle through each column and analyze the value of that column.
Below is an example of what I am trying to go for. As you can see I would like to create a variable that increases on each iteration and then I use that count value as a part of the variable name I am checking.
data TEST_Data;
set TEST_Table;
retain changing_number;
changing_number=1;
do while(changing_number<=20);
if PREFIX_changing_number='BAD_IDENTIFIER' then do;
PREFIX_changing_number='This is a bad part';
end;
end;
run;
How would be the best way to do this in SAS? I know I can do it by simply checking each value individually from 1 to 20.
if PREFIX_1 = 'BAD_IDENTIFIER' then do;
PREFIX_1 = 'This is a bad part';
end;
if PREFIX_2 = ...
But that would be really obnoxious as later I will be doing the same thing with a set of over 40 columns.
Ideas?
SOLUTION
data TEST_Data;
set TEST_Table;
array SC $ SC1-SC20;
do i=1 to dim(SC);
if SC{i}='xxx' then do;
SC{i}="bad part";
end;
end;
run;
Thank you for suggesting Arrays :)
You need to look up Array processing in SAS. Simply put, you can do something like this:
data TEST_Data;
set TEST_Table;
*retain changing_number; Remove this - even in your code it does nothing useful;
array prefixes prefix:; *one of a number of ways to do this;
changing_number=1;
do while(changing_number<=20);
if prefixes[changing_number]='BAD_IDENTIFIER' then do;
prefixes[changing_number]='This is a bad part';
end;
end;
run;
A slightly better loop is:
do changing_number = 1 to dim(prefixes);
... loop ...
end;
As that's all in one step, and it is flexible with the number of array elements (dim = number of elements in the array).