Split SAS datasets by column with primary key - arrays

So I have a dataset with one primary key: unique_id and 1200 variables. This dataset is generated from a macro so the number of columns will not be fixed. I need to split this dataset into 4 or more datasets of 250 variables each, and each of these smaller datasets should contain the primary key so that I can merge them back later. Can somebody help me with either a sas function or a macro to solve this?
Thanks in advance.

A simple way to split a datasets in the way you request is to use a single data step with multiple output datasets where each one has a KEEP= dataset option listing the variables to keep. For example:
data split1(keep=Name Age Height) split2(keep=Name Sex Weight);
set sashelp.class;
run;
So you need to get the list of variables and group then into sets of 250 or less. Then you can use those groupings to generate code like above. Here is one method using PROC CONTENTS to get the list of variables and CALL EXECUTE() to generate the code.
I will use macro variables to hold the name of the input dataset, the key variable that needs to be kept on each dataset and maximum number of variables to keep in each dataset.
So for the example above those macro variable values would be:
%let ds=sashelp.class;
%let key=name;
%let nvars=2;
So use PROC CONTENTS to get the list of variable names:
proc contents data=&ds noprint out=contents; run;
Now run a data step to split them into groups and generate a member name to use for the new split dataset. Make sure not to include the KEY variable in the list of variables when counting.
data groups;
length group 8 memname $41 varnum 8 name $32 ;
group +1;
memname=cats('split',group);
do varnum=1 to &nvars while (not eof);
set contents(keep=name where=(upcase(name) ne %upcase("&key"))) end=eof;
output;
end;
run;
Now you can use that dataset to drive the generation of the code:
data _null_;
set groups end=eof;
by group;
if _n_=1 then call execute('data ');
if first.group then call execute(cats(memname,'(keep=&key'));
call execute(' '||trim(name));
if last.group then call execute(') ');
if eof then call execute(';set &ds;run;');
run;
Here are results from the SAS log:
NOTE: CALL EXECUTE generated line.
1 + data
2 + split1(keep=name
3 + Age
4 + Height
5 + )
6 + split2(keep=name
7 + Sex
8 + Weight
9 + )
10 + ;set sashelp.class;run;
NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK.SPLIT1 has 19 observations and 3 variables.
NOTE: The data set WORK.SPLIT2 has 19 observations and 3 variables.

Just another way of doing it using macro variables:
/* Number of columns you want in each chunk */
%let vars_per_part = 250;
/* Get all the column names into a dataset */
proc contents data = have out=cols noprint;
run;
%macro split(part);
/* Split the columns into 250 chunks for each part and put it into a macro variable */
%let fobs = %eval((&part - 1)* &vars_per_part + 1);
%let obs = %eval(&part * &vars_per_part);
proc sql noprint;
select name into :cols separated by " " from cols (firstobs = &fobs obs = &obs) where name ~= "uniq_id";
quit;
/* Chunk up the data only keeping those varaibles and the uniq_id */
data want_part∂
set have (keep = &cols uniq_id);
run;
%mend;
/* Run this from 1 to whatever the increment required to cover all the columnns */
%split(1);
%split(2);
%split(3);

this is not a complete solution but some help to give you another insight into how to solve this. The previous solutions have relied much on proc contents and data step, but I would solve this using proc sql and dictionary.columns. And I would create a macro that would split the original file into as many parts as needed, 250 cols each. The steps roughly:
proc sql; create table as _colstemp as select * from dictionary.columns where library='your library' and memname = 'your table' and name ne 'your primary key'; quit;
Count the number of files needed somewhere along:
proc sql;
select ceil(count(*)/249) into :num_of_datasets from _colstemp;
select count(*) into :num_of_cols from _colstemp;
quit;
Then just loop over the original dataset like:
%do &_i = 1 %to &num_of_datasets
proc sql;
select name into :vars separated by ','
from _colstemp(firstobs=%eval((&_i. - 1)*249 + 1) obs = %eval(min(249,&num_of_cols. - &_i. * 249)) ;
quit;
proc sql;
create table split_&_i. as
select YOUR_PRIMARY_KEY, &vars from YOUR_ORIGINAL_TABLE;
quit;
%end;
Hopefully this gives you another idea. The solution is not tested, and may contain some pseudocode elements as it's written from my memory of doing things. Also this is void of macro declaration and much of parametrization one could do.. This would make the solution more general (parametrize your number of variables for each dataset, your primary key name, and your dataset names for example.

Related

call execute inside a loop pulling from one table to execute a macro

i have bellow currently
%macro sqlloop (event_id);
...lots of code, mostly proc sql segments ...
%mend;
that generates an output table (named export_table2). I need to be able to run this code dozens of time for every value in another table (named vars). my trial code testing what I want it to do is below (basically manually typing in the first two values of this 68 row table)
data ;
%let empl_nbr_var = '222';
%let fleet = '7ER';
%let position = 'A';
%let base = 'BWI';
%sqlloop(event_id = 1);
run;
data summary_pilots;
set work.export_table2;
run;
data;
%let empl_nbr_var = '111';
%let fleet = '320';
%let position = 'B';
%let base = 'CHS';
%sqlloop(event_id = 2);
run;
data summary_pilots;
set summary_pilots work.export_table2;
run;
This produces the final output of each execution stacked into one table called summary_pilots. How can I do this in a loop, prehaps using call execute to iterate through each row of vars? The columns of vars are exactly what I need for the macro variables, and I want to iterate through every single row to assign those macro variable and run my %sqlloop again. Thanks for the help!
EDIT:
currently figuring out how call execute works and see how its helpful here but still a bit stuck... code below works exactly as youd think, printing out all the variables in the table vars into the log.
data ;
set work.vars;
call execute( '%put='|| strip(empl_nbr_var) || ';
%put = ' || strip(fleet) ||';
%put = '|| strip(position) ||';
%put = ' || strip(base) ||';' );
run;
I am trying to use the below code, but am getting a crazy amount of errors due to the macros being assigned weirdly. The types in the columns of vars match exactly what I want them to be in the macros, but it still looks like that might be the issue here?
data ;
set work.vars;
call execute( '
%let empl_nbr_var =' || strip(empl_nbr_var) || ';
%let fleet = ' || strip(fleet) ||';
%let position = '|| strip(position) ||';
%let base = ' || strip(base) ||';
%sqlloop(event_id = 17);' );
run;
and the event ID doesnt actually matter here so i just left that as a random number for now.
Assuming your work.Vars contain data like this:
empl_nbr_var
fleet
position
base
222
7ER
A
BWI
111
320
B
CHS
...
...
...
...
Consider extending your macro to receive such input parameters:
%macro sqlloop(event_id, empl_nbr_var, fleet, position, base);
...lots of code, mostly proc sql segments ...
%mend;
Then, build run macro with concatenated data values via call execute. Below passes 17 into event_id parameter.
data _null_;
set Work.Vars;
args = catx("', '", empl_nbr_var, fleet, position, base);
args = '%sqlloop(17,'''|| strip(args) || ''');';
put args $char.; /* VIEW CALL COMMAND */
call execute(args); /* RUN CALL COMMAND */
run;
It makes no sense to code %LET statements in the middle of a data step. The macro processor will evaluate them before it passes the text of the data step code to SAS to process. Avoid confusing yourself by moving the %LET statements before the data step.
If the macro needs values of macros variables, like FLEET, as input then make those things parameters to the macro. Don't create a macro that references "magic" macro variables, macro variables that are neither input parameters nor created by the macro. Instead the reference to them just appears in the middle of the macro definition as if their values will appear by magic somehow.
%macro sqlloop(empl_nbr_var,fleet,position,base);
... code that uses &fleet.
%mend;
If you have a lot of combinations of parameters you want run through your macro then collect them into a dataset first.
data inputs ;
input empl_nbr_var fleet $ position $ base $ ;
cards;
222 7ER A BWI
111 320 B CHS
;
Then you can use those dataset variables to generate the calls to the macro. You could try using call execute() to do this, but personally I find it a lot easier to use a data step to write the code to a file. Then you can examine the file and make sure the code generation logic is correct. Plus you can use the power of the PUT statement to make the code generation easier. For example if the variable names match the parameter names you can use named output.
filename code temp;
data _null_;
set inputs;
file code ;
put '%sqlloop(' empl_nbr_var= ',' fleet= ',' position= ',' base= ')';
run;
Which will generate code like:
%sqlloop(empl_nbr_var=222 ,fleet=7ER ,position=A ,base=BWI )
%sqlloop(empl_nbr_var=111 ,fleet=320 ,position=B ,base=CHS )
Once you are confident that it is generating the right code use the %INCLUDE command to run the code it generates.
%include code / source2;
If the macro does not have its own step for aggregating the results you could include that step in the code generation.
filename code temp;
data _null_;
set inputs;
file code ;
put '%sqlloop(' empl_nbr_var= ',' fleet= ',' position= ',' base= ')';
put 'proc append base=summary_pilots data=export_table force; run;' ;
run;
%include code / source2;

SAS Looping through macro variable and processing the data

I have a bunch of character variables which I need to sort out from a large dataset. The unwanted variables all have entries that are the same or are all missing (meaning I want to drop these from the dataset before processing the data further). The data sets are very large so this cannot be done manually, and I will be doing it a lot of times so I am trying to create a macro which will do just this. I have created a list macro variable with all character variables using the following code (The data for my part is different but I use the same sort of code):
data test;
input Obs ID Age;
datalines;
1 2 3
2 2 1
3 2 2
4 3 1
5 3 2
6 3 3
7 4 1
8 4 2
run;
proc contents
data = test
noprint
out = test_info(keep=name);
run;
proc sql noprint;
select name into : testvarlist separated by ' ' from test_info;
quit;
My idea is then to just use a data step to drop this list of variables from the original dataset. Now, the problem is that I need to loop over each variable, and determine if the observations for that variable are all the same or not. My idea is to create a macro that loops over all variables, and for each variable counts the occurrences of the entries. Since the length of this table is equal to the number of unique entries I know that the variable should be dropped if the table is of length 1. My attempt so far is the following code:
%macro ListScanner (org_list);
%local i next_name name_list;
%let name_list = &org_list;
%let i=1;
%do %while (%scan(&name_list, &i) ne );
%let next_name = %scan(&name_list, &i);
%put &next_name;
proc sql;
create table char_occurrences as
select &next_name, count(*) as numberofoccurrences
from &name_list group by &next_name;
select count(*) as countrec from char_occurrences;
quit;
%if countrec = 1 %then %do;
proc sql;
delete &next_name from &org_list;
quit;
%end;
%let i = %eval(&i + 1);
%end;
%mend;
%ListScanner(org_list = &testvarlist);
Though I get syntax errors, and with my real data I get other kinds of problems with not being able to read the data correctly but I am taking one step at a time. I am thinking that I might overcomplicate things so if anyone has an easier solution or can see what might be wrong to I would be very grateful.
There are many ways to do this posted around.
But let's just look at the issues you are having.
First for looping through your space delimited list of names it is easier to let the %do loop increment the index variable for you. Use the countw() function to find the upper bound.
%do i=1 %to %sysfunc(countw(&name_list,%str( )));
%let next_name = %scan(&name_list,&i,%str( ));
...
%end;
Second where is your input dataset in your SQL code? Add another parameter to your macro definition. Where to you want to write the dataset without the empty columns? So perhaps another parameter.
%macro ListScanner (dsname , out, name_list);
%local i next_name sep drop_list ;
Third you can use a single query to count all of variables at once. Just use count( distinct xxxx ) instead of group by.
proc sql noprint;
create table counts as
select
%let sep=;
%do i=1 %to %sysfunc(countw(&name_list,%str( )));
%let next_name = %scan(&name_list,&i,%str( ));
&sep. count(distinct &next_name) as &next_name
%let sep=,;
%end;
from &dsname
;
quit;
So this will get a dataset with one observation. You can use PROC TRANSPOSE to turn it into one observation per variable instead.
proc transpose data=counts out=counts_tall ;
var _all_;
run;
Now you can just query that table to find the names of the columns with 0 non-missing values.
proc sql noprint ;
select _name_ into :drop_list separated by ' '
from counts_tall
where col1=0
;
quit;
Now you can use the new DROP_LIST macro variable.
data &out ;
set &dsname ;
drop &drop_list;
run;
So now all that is left is to clean up after your self.
proc delete data=counts counts_tall ;
run;
%mend;
As far as your specific initial question, this is fairly straightforward. Assuming &testvarlist is your macro variable containing the variables you are interested in, and creating some test data in have:
%let testvarlist=x y z;
data have;
call streaminit(7);
do id = 1 to 1e6;
x = floor(rand('Uniform')*10);
y = floor(rand('Uniform')*10);
z = floor(rand('Uniform')*10);
if x=0 and y=4 and z=7 then call missing(of x y z);
output;
end;
run;
data want fordel;
set have;
if min(of &testvarlist.) = max(of &testvarlist.)
and (cmiss(of &testvarlist.)=0 or missing(min(of &testvarlist.)))
then output fordel;
else output want;
run;
This isn't particularly inefficient, but there are certainly better ways to do this, as referenced in comments.

SAS Put Certain Row as New Variable Names After Manipulation

After importing my CSV data with GETNAMES=NO, I have 59 columns with variable names VAR1, VAR2, . . . VAR59. My first row contains the names I need for the new variables, but they first needed manipulated by removing special characters and turning spaces into underscores since SAS doesn't like spaces in variable names. This is the array I used for that piece:
DATA DATA1; SET DATA (FIRSTOBS=7);
ARRAY VAR(59) VAR1-VAR59;
IF _N_ = 1 THEN DO;
DO I = 1 TO 59;
VAR[I] = COMPRESS(TRANSLATE(TRIM(VAR[I]),'_',' '),'?()');
PUT VAR[I]=;
END;
END;
DROP I;
RUN;
This worked perfectly, but now I need to get this first row up to the new variable names. I tried a similar array to perform this:
DATA DATA2; SET DATA1;
ARRAY V(59) VAR1-VAR59;
DO I = 1 TO 59;
IF _N_ = 1 AND V[I] NE "" THEN CALL SYMPUT("NEWNAME",V[I]);
RENAME VAR[I] = &NEWNAME;
END;
DROP I;
RUN;
This only puts the name of VAR59 since there is no [i] connected to the &NEWNAME, and it still isn't working quite right. Any suggestions to moving a row up to variable names AFTER manipulation?
Your primary problem is you are trying to use a macro variable in the data step it's created in. You can't. You're also trying to create rename statements in the data step; rename, as with other similar statements (keep, drop), must be defined before the data step is compiled.
You need to write code somewhere - either in a text file, a macro variable, whatever - with this information. For example:
filename renamef temp;
data _null_;
set myfile (obs=1);
file renamef;
array var[59];
do _i = 1 to dim(Var);
[your code to clean it out];
strput = cat("rename",vname(var[_i]),'=',var[_i],';');
put strput;
end;
run;
data want;
set myfile (firstobs=2);
%include renamef;
run;
There are lots of other examples to this on the site and on the web, "list processing" is the term for this.
Joe -- using your suggestions and another one of your posts, the following worked flawlessly:
Put the row of needed variables into long format (in my case, first row so n = 1)
DATA NEWVARS; SET DATA;
IF _N_ = 1 THEN OUTPUT NEWVARS;
RUN;
PROC TRANSPOSE DATA = NEWVARS OUT=NEWVARS1;
VAR _ALL_;
RUN;
Create a list of rename macro calls.
PROC SQL;
SELECT CATS('%RENAME(VAR=',_NAME_,',NEWVAR=',COL1,')')
INTO :RENAMELIST SEPARATED BY ' '
FROM NEWVARS1;
QUIT;
%MACRO RENAME(VAR=,NEWVAR=);
RENAME &VAR.=&NEWVAR.;
%MEND RENAME;
Call in the list created in Step 2 to rename all variables.
PROC DATASETS LIB=WORK NOLIST;
MODIFY DATA;
&RENAMELIST.;
QUIT;
I had to perform a few additional checks making sure that the variable names were not greater than 32 characters, and this was easy to check for when the data was in long format after transposing. If there are certain words that make the lengths too long, a TRANWRD statement can easily replace them with abbreviations.

How can I "define" SAS data sets using macro variable and write to them using an array

My source data contains 200,000+ observations, one of the many variables in the data set is "county." My goal is to write a macro that will take this one data set as an input, and split them into 58 different temporary data sets for each of the California counties.
First question is if it is possible to specify the 58 counties on the data statement using something like a global reference array defined beforehand.
Second question is, assuming the output data sets have been properly specified on the data statement, is it possible to use a do loop to choose the right data set to write to?
I can get the comparison to work properly, but cannot seem to use a array reference to specify a output data set. This is most likely because I need more experience with the macro environment!
Please see below for the simplistic skeleton framework I have written so far. c_long array contains the names of each of the counties, c_short array contains a 3 letter abbreviation for each of the counties. Thanks in advance!
data splitraw;
length county_name $15;
infile "&path/random.csv" dsd firstobs=2;
input county_name $ number;
run;
%macro _58countysplit(dxtosplit,countycol);
data <need to specify 58 data sets here named something like &dxtosplit_ALA, &dxtosplit_ALP, etc..>;
set &dxtosplit;
do i=1 to 58;
if c_long{i}=&countycol then output &dxtosplit._&c_short{i};
end;
run;
%mend _58countysplit;
%_58countysplit(splitraw,county_name);
The code you provided will need to run through the large dataset 58 times, each time writing a small one. I have done it a bit different.
First I create a sample dataset with a variable "county" this will contain ten different values:
data large;
attrib county length=$12;
do i=1 to 10000;
county=put(mod(i,10)+1,ROMAN.);
output;
end;
run;
First, I start with finding all the unique values and constructing the names of all the different tables I would like to create:
proc sql noprint;
select distinct compbl("large_"!!county) into :counties separated by " "
from large;
quit;
Now I have a macrovariable "counties" that containes all the different datasets I want to create.
Here I am writing the IF-statements to a file:
filename x temp;
data _null_;
attrib county length=$12 ds length=$18;
file x;
i=1;
do while(scan("&counties",i," ") ne "");
ds=scan("&counties",i," ");
county=scan(ds,-1,"_");
put "if county=""" county +(-1) """ then output " ds ";";
i+1;
end;
run;
Now I have what I need to create the small datasets:
data &counties;
set large;
%inc x;
run;
I agree with user667489, there is almost always a better way then splitting one large data set into many small data sets. However, if you want to proceed along these lines there is a table in sashelp called vcolumn which lists all your libraries, their tables, and each column (in each table) that should help you. Also if you want
if c_long{i}=&countycol then output &dxtosplit._&c_short{i};
to resolve you might mean:
if c_long{i}=&countycol then output &&dxtosplit._&c_short{i};
It's likely, depending upon what you're actually trying to do, that BY processing is all you need. Nevertheless, here is a simple solution:
%macro split_by(data=, splitvar=);
%local dslist iflist;
proc sql noprint;
select distinct cats("&splitvar._", &splitvar)
into :dslist separated by ' '
from &data;
select distinct
catt("if &splitvar='", &splitvar, "' then output &splitvar._", &splitvar, ";", '0A'x)
into :iflist separated by "else "
from &data;
quit;
data &dslist;
set &data;
&iflist
run;
%mend split_by;
Here is some test data to illustrate:
options mprint;
data test;
length county $1 val $1;
input county val;
infile cards;
datalines;
A 2
B 3
A 5
C 8
C 9
D 10
run;
%split_by(data=test, splitvar=county)
And you can view the log to see how the macro generates the DATA step you want:
MPRINT(SPLIT_BY): proc sql noprint;
MPRINT(SPLIT_BY): select distinct cats("county_", county) into :dslist separated by ' ' from test;
MPRINT(SPLIT_BY): select distinct catt("if county='", county, "' then output county_", county, ";", '0A'x) into :iflist separated
by "else " from test;
MPRINT(SPLIT_BY): quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
MPRINT(SPLIT_BY): data county_A county_B county_C county_D;
MPRINT(SPLIT_BY): set test;
MPRINT(SPLIT_BY): if county='A' then output county_A;
MPRINT(SPLIT_BY): else if county='B' then output county_B;
MPRINT(SPLIT_BY): else if county='C' then output county_C;
MPRINT(SPLIT_BY): else if county='D' then output county_D;
MPRINT(SPLIT_BY): run;
NOTE: There were 6 observations read from the data set WORK.TEST.
NOTE: The data set WORK.COUNTY_A has 2 observations and 2 variables.
NOTE: The data set WORK.COUNTY_B has 1 observations and 2 variables.
NOTE: The data set WORK.COUNTY_C has 2 observations and 2 variables.
NOTE: The data set WORK.COUNTY_D has 1 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.05 seconds

Create and use a list in an IF statement in SAS

User PomPazz post this answer for creating a list from input variables:
"You need to use a macro to "write" the SAS code for you.
This should do what you are looking for. It takes a space delimited list of values, and loops over them doing what your code specifies. Post a comment if you have a question on it.
%macro doit(list);
proc sql noprint;
%let n=%sysfunc(countw(&list));
%do i=1 %to &n;
%let val = %scan(&list,&i);
create table somlib._&val as
select * from somlib.somtable
where item=&val;
%end;
quit;
%mend;
%doit(100 101 102);
Note, data sets cannot start with a number so I have these starting with '_'"
My questions is, how can this be applied to creating a list from a variable in a dataset that can be used in an IF statement such as "IF Telephone in(List) then Invalid=1"
This is needed to validate a list of telephone numbers from a predetermined list of invalid numbers.
The basic answer to your question is that you need to pull it into a macro variable or an include file.
proc sql;
select distinct telephone into :tellist separated by ','
from invalid_phones;
quit;
data want;
set have;
if telephone in (&tellist.) then invalid=1;
run;
This is limited to about 32k characters, so if you have more than ~3000 phone numbers it may not work. If telephone is character you need to select distinct quote(telephone) instead.
The more detailed answer is that this, usually, is inefficient. Better is to use a format.
data for_fmt;
set invalid_phones;
start=telephone;
label='INVALID';
fmtname='TELCHECKF'; *add $ if telephone is a character field;
output;
if _n_=1 then do; *this block adds a line to deal with non-matching records (hlo=o means other);
start=.;
label='VALID';
hlo='o';
output;
end;
run;
proc format cntlin=for_fmt;
quit;
data want;
set have;
if put(telephone,TELCHECKF.)='INVALID' then invalidflag=1;
run;
This can be faster than the list method, and doesn't have the length issues.
Joe's answer is great, but I just wanted to add that you can do this in one SQL step
proc sql noprint;
create table want as
select *, case
when telephone in (select distinct telephone from invalid_phones) then 1
else 0
end as invalid
from have
;
quit;

Resources