Efficiently concatenate many sas datasets - dataset

I have over 200k small datasets with the same variables (n<1000 and usually n<100) that I want to concatenate into a master dataset. I have tried using a macro that uses a data step to just iterate through all of the new datasets and concatenate with the master with "set master new:", but this is taking a really long time. Also, if I try to run at the same time, the call execute data step says that I am out of memory on a huge server box. For reference, all of the small datasets together are just over 5 Gigs. Any suggestions would be greatly appreciated. Here is what I have so far:
%macro catDat(name, nbr) ;
/*call in new dataset */
data new ;
set libin.&name ;
run ;
/* reorder names */
proc sql noprint;
create table new as
select var1, var2, var3
from new;
quit;
%if &nbr = 1 %then %do ;
data master;
set new;
run;
%end;
%if &nbr > 1 %then %do ;
data master;
set master new ;
run;
%end ;
%mend;
/* concatenate datasets */
data runthis ;
set datasetNames ;
call execute('%catdat('||datasetname||','||_n_||')');
run;
Resolved: see Bob's comments below.

Try using PROC APPEND instead of your "new" dataset; that will be much, much faster:
%macro DOIT;
proc sql noprint;
select count(*) into : num_recs
from datasetNames;
quit;
%do i=1 %to &num_recs;
data _null_;
i = &i;
set datasetNames point=i;
call symput('ds_name',datasetname);
stop;
run; /* UPDATE: added this line */
%if &i = 1 %then %do;
/* Initialize MASTER with variables in the order you wish */
data master(keep=var1 var2 var3);
retain var1 var2 var3;
set libin.&ds_name;
stop;
run;
%end;
proc append base=master data=libin.&ds_name(keep=var1 var2 var3);
run;
%end;
%mend DOIT;
PROC APPEND will add each dataset into your new "master" without rebuilding it each time as you are doing now.
This also avoids using CALL EXECUTE, removing that memory issue you were running into (caused by generating so much code into the execution stack).

Related

SAS Looping through macro variable and processing the data

I have a bunch of character variables which I need to sort out from a large dataset. The unwanted variables all have entries that are the same or are all missing (meaning I want to drop these from the dataset before processing the data further). The data sets are very large so this cannot be done manually, and I will be doing it a lot of times so I am trying to create a macro which will do just this. I have created a list macro variable with all character variables using the following code (The data for my part is different but I use the same sort of code):
data test;
input Obs ID Age;
datalines;
1 2 3
2 2 1
3 2 2
4 3 1
5 3 2
6 3 3
7 4 1
8 4 2
run;
proc contents
data = test
noprint
out = test_info(keep=name);
run;
proc sql noprint;
select name into : testvarlist separated by ' ' from test_info;
quit;
My idea is then to just use a data step to drop this list of variables from the original dataset. Now, the problem is that I need to loop over each variable, and determine if the observations for that variable are all the same or not. My idea is to create a macro that loops over all variables, and for each variable counts the occurrences of the entries. Since the length of this table is equal to the number of unique entries I know that the variable should be dropped if the table is of length 1. My attempt so far is the following code:
%macro ListScanner (org_list);
%local i next_name name_list;
%let name_list = &org_list;
%let i=1;
%do %while (%scan(&name_list, &i) ne );
%let next_name = %scan(&name_list, &i);
%put &next_name;
proc sql;
create table char_occurrences as
select &next_name, count(*) as numberofoccurrences
from &name_list group by &next_name;
select count(*) as countrec from char_occurrences;
quit;
%if countrec = 1 %then %do;
proc sql;
delete &next_name from &org_list;
quit;
%end;
%let i = %eval(&i + 1);
%end;
%mend;
%ListScanner(org_list = &testvarlist);
Though I get syntax errors, and with my real data I get other kinds of problems with not being able to read the data correctly but I am taking one step at a time. I am thinking that I might overcomplicate things so if anyone has an easier solution or can see what might be wrong to I would be very grateful.
There are many ways to do this posted around.
But let's just look at the issues you are having.
First for looping through your space delimited list of names it is easier to let the %do loop increment the index variable for you. Use the countw() function to find the upper bound.
%do i=1 %to %sysfunc(countw(&name_list,%str( )));
%let next_name = %scan(&name_list,&i,%str( ));
...
%end;
Second where is your input dataset in your SQL code? Add another parameter to your macro definition. Where to you want to write the dataset without the empty columns? So perhaps another parameter.
%macro ListScanner (dsname , out, name_list);
%local i next_name sep drop_list ;
Third you can use a single query to count all of variables at once. Just use count( distinct xxxx ) instead of group by.
proc sql noprint;
create table counts as
select
%let sep=;
%do i=1 %to %sysfunc(countw(&name_list,%str( )));
%let next_name = %scan(&name_list,&i,%str( ));
&sep. count(distinct &next_name) as &next_name
%let sep=,;
%end;
from &dsname
;
quit;
So this will get a dataset with one observation. You can use PROC TRANSPOSE to turn it into one observation per variable instead.
proc transpose data=counts out=counts_tall ;
var _all_;
run;
Now you can just query that table to find the names of the columns with 0 non-missing values.
proc sql noprint ;
select _name_ into :drop_list separated by ' '
from counts_tall
where col1=0
;
quit;
Now you can use the new DROP_LIST macro variable.
data &out ;
set &dsname ;
drop &drop_list;
run;
So now all that is left is to clean up after your self.
proc delete data=counts counts_tall ;
run;
%mend;
As far as your specific initial question, this is fairly straightforward. Assuming &testvarlist is your macro variable containing the variables you are interested in, and creating some test data in have:
%let testvarlist=x y z;
data have;
call streaminit(7);
do id = 1 to 1e6;
x = floor(rand('Uniform')*10);
y = floor(rand('Uniform')*10);
z = floor(rand('Uniform')*10);
if x=0 and y=4 and z=7 then call missing(of x y z);
output;
end;
run;
data want fordel;
set have;
if min(of &testvarlist.) = max(of &testvarlist.)
and (cmiss(of &testvarlist.)=0 or missing(min(of &testvarlist.)))
then output fordel;
else output want;
run;
This isn't particularly inefficient, but there are certainly better ways to do this, as referenced in comments.

SAS Macro with a do loop

I am trying to write a query where a new table is created with a selection of variables from a number of existing datasets all ending with YYYYMM(e.g. dataset_201610). Then I am trying to append this data to a master database. When I run it, it doesnt loop back to the other datasets. Any help?
%macro create_master_data_table;
*If the master table exists then delete it;
%if %sysfunc(exist(data_master)) %then %do;
proc sql;
drop table data_master;
quit;
%end;
%let yyyymm = 201702;
%do %while (&yyyymm >= 201610);
*Create a simple table with a month id and the fields we want;
data thismonth;
set Base.Accounts_&yyyymm;
keep var1 var2 var3
run;
*Append the fields we want to the master table;
proc append base=data_master
data=Base.Accounts_&yyyymm(keep=var1 var2 var3);
run;
%end;
%mend create_master_data_table;
%create_master_cre_table;
You need to increment or decrement your &yyyymm macro variable, otherwise it will loop forever. Additionally, the way your program is set up will loop forever if you increment, so you will need to decrement starting at the maximum date.
Because you are dealing with months/dates and always appending a dataset, you'll want to use a few additional checks to ensure no errors and have timely execution.
Modify your program as such:
%macro create_master_data_table(mindate=, maxdate=);
*If the master table exists then delete it;
%if %sysfunc(exist(data_master)) %then %do;
proc sql;
drop table data_master;
quit;
%end;
/* Initialize variables */
%let i = 0;
%let startdate = %sysfunc(inputn(&maxdate., yymmn6.) );
%let nextdate = %sysfunc(intnx(month, &startdate., &i.) );
%do %while (&nextdate > %sysfunc(inputn(&mindate., yymmn6.) ) );
/* Decrease date by 1 month relative to start date */
%let nextdate = %sysfunc(intnx(month, &startdate., &i.) );
/* Convert from SAS date to yyyymm */
%let yyyymm = %sysfunc(putn(&nextdate, yymmn6.) );
/* Only pull data if the table exists */
%if(%sysfunc(exist(Base.Accounts_&yyyymm.) ) ) %then %do;
*Create a simple table with a month id and the fields we want;
data thismonth;
set Base.Accounts_&yyyymm;
keep var1 var2 var3
run;
*Append the fields we want to the master table;
proc append base=data_master
data=Base.Accounts_&yyyymm(keep=var1 var2 var3);
run;
%end;
%else %put WARNING: Missing data: &yyyymm.;
%let i = %eval(&i. - 1);
%end;
%mend create_master_data_table;
%create_master_data_table(mindate=201610, maxdate=201702);
You can set these to default min/max values if you wish. Also note that we are assuming the input dates will be in yyyymm format; it will not work with other date formats without changing the program.

SAS-multiple datasets merging

I want to merge several individual datasets through following code. However, it reports error as:
How could I solve this problem?
%macro test(sourcelib=,from=);
proc sql noprint; /*read datasets in a library*/
create table mytables as
select *
from dictionary.tables
where libname = &sourcelib
order by memname ;
select count(memname)
into:obs
from mytables;
%let obs=&obs.;
select memname
into : memname1-:memname&obs.
from mytables;
quit;
data full;
set
%do i=1 %to &obs.;
&from.&&memname&i;
%end;
;
run;
%mend;
%test(sourcelib='RESULT',from=RESULT.);
Your %DO loop is generating extra semi-colons in the middle of your SET statement.
set
%do i=1 %to &obs.;
&from.&&memname&i
%end;
;
Also why do you have two macro parameters to pass the same information? You should be able to just pass in the libref. Also why make so many macro variables when one will do?
%macro test(sourcelib=);
%local memlist ;
proc sql noprint;
select catx('.',libname,memname) into :memlist separated by ' '
from dictionary.tables
where libname = %upcase("&sourcelib")
order by 1
;
quit;
data full;
set &memlist ;
run;
%mend;
%test(sourcelib=ReSulT);

Automated Sorting in SAS

I have lots of tables which I would like to sort with Proc Sort. (The names of the tables are written in a text file.) To avoid repeating the same code all over again I have tried creating a macro that would import the text file, create an array consisting of those table names and finally sort all the tables. However, I came across a few problems. In Python, I would easily be able to loop through an array. But in SAS, I am not sure how to do it.
%MACRO SORT_TABLES();
PROC IMPORT
DATAFILE = 'TABLES_LIST.txt'
OUT = WORK.TABLES_LIST (RENAME = VAR1 = TABLE_NAME)
DBMS = TAB
REPLACE;
GETNAMES = NO;
QUIT;
/* GET THE LIST OF TABLE NAMES: */
PROC SQL NOPRINT;
SELECT
DISTINCT TABLE_NAME
INTO :TABLEVAR1 - :TABLEVAR&SYSMAXLONG
FROM
WORK.TABLES_LIST;
QUIT;
DATA _NULL_;
ARRAY TABLE_NAMES $ &TABLEVAR1 - &TABLEVAR&SYSMAXLONG;
RUN;
%DO %OVER TABLE_NAMES
PROC SORT
DATA = &TABLEVAR1 /* how can I iterate here???? */
OUT = 'WORK.'||&TABLEVAR1;
BY A B C;
QUIT;
%END;
%MEND;
Just use an iterative %DO loop to loop over your "array" of macro variables.
proc sql noprint ;
select distinct table_name
into :tablevar1 -
from table_list
;
quit;
%do i=1 %to &sqlobs ;
proc sort data=&&tablevar&i ; by _all_ ; run;
%end;
But you don't need a macro for this. There are easier ways to generate code.
filename code temp;
data _null_;
set table_list ;
put 'PROC SORT DATA = ' table_name '; BY _all_; run;' ;
run;
%include code / source2 ;

how to excute sas macro iteratively in another macro?

i would like to get result of brand_channel macro.
macro is not working on i=2,3,4 in %do-loop statement.
How can I execute doing_scoring macro iteratively?
thanks!
%doing_scoring;
...
...
...
%mend doing_scoring;
%macro brand_channel;
proc sql noprint;
create table oneb_onec as
select unique x1, x2
from mydata_all;
quit;
data seq_oneb_onec;
set oneb_onec;
seqno = _N_;
run;
%let num=4;
%do i=1 %to &num;
%put doing number is &i;
%put end doing number is &num;
proc sql noprint;
create table onebc_table&i as
select a.*
from mydata_all a, seq_oneb_onec b
where b.seqno = &i
and b.x1 = a.x1
and b.x2 = a.x2;
quit;
%doing_scoring(mydata=onebc_table&i, setnumber = &i);
%end;
%mend brand_channel;
%brand_channel;
Your code is fine, except for the initial line (declaration of doing_scoring), but that's likely transcription error I suppose.
Below I have a functional test version.
However, I have a better way to do the same thing. Fundamentally, macro driven iteration is a bad idea; there is a better way to do almost every task you might want to attempt.
In this case, you can call the doing_scoring calls directly from the seq_ dataset, and either move the creation of the sub-dataset to the macro (should be easy) or, perhaps better, keep the dataset in one piece.
First the better way: call execute. (Or, you can create the macro calls in SQL using select into.)
proc sort data=sashelp.class out=class;
by age sex;
run;
%macro doing_scoring(data=,age=,sex=,setnumber=);
data mydata;
set class;
where age=&age. and sex="&sex.";
run;
*whatever else you are doing;
%mend doing_scoring;
data _null_;
set class;
by age sex;
if first.sex then seqno+1;
callstr=cats('%doing_scoring(data=class,age=',age,',sex=',sex,',setnumber=',seqno,')');
call execute(callstr);
run;
Now, the original way with same test data.
%macro doing_scoring(mydata=,setnumber=);
%put doing_scoring &mydata. &setnumber.;
%mend doing_scoring;
%macro brand_channel;
proc sql noprint;
create table oneb_onec as
select distinct age,sex
from sashelp.class;
quit;
data seq_oneb_onec;
set oneb_onec;
seqno = _N_;
run;
%let num=4;
%do i=1 %to &num;
%put -------------------;
%put doing number is &i;
%put end doing number is &num;
proc sql noprint;
create table onebc_table&i as
select a.*
from sashelp.class a, seq_oneb_onec b
where b.seqno = &i
and b.age = a.age
and b.sex = a.sex;
quit;
%doing_scoring(mydata=onebc_table&i, setnumber = &i);
%put -------------------;
%end;
%mend brand_channel;
%brand_channel;

Resources