Can I use array based processing to add additional column(s)? SAS - arrays

I have a dataset (a) that looks like this:
Name Value
Cost_1 28
Cost_2 22
Unit_1 Fixed
Unit_2 C
Is it possible to use an array to have a dataset that looks like this:
Name Cat_1 Cat_2
Cost 28 22
Unit Fixed C
%let Cat_Count = 2;
data b;
set a;
array category [&Cat_Count] cat_1-cat_&Cat_count;
.
.
.
run;
Not sure how to execute this...the macro variable cat_count will be dynamic.

You can use array's but a transpose is more efficient.
First create a new column that separates name into the name and count and then use a proc transpose.
data have;
input Name $ Value $;
cards;
Cost_1 28
Cost_2 22
Unit_1 Fixed
Unit_2 C
;;;;
run;
data have_cat;
set have;
cat = input(scan(name, 2, "_"), 8.); *numeric conversion not required for this approach but for array approach;
name = scan(name, 1, "_");
run;
proc sort data=have_cat;
by name cat value;
run;
proc transpose data=have_cat out=want prefix=cat_;
by name;
id cat;
var value;
run;
Array method - requires everything before PROC TRANSPOSE and max_count macro variable.
%let Cat_Count = 2;
data want_array;
set have_cat;
by name;
array category(&cat_count) $ cat_1-cat_&cat_count;
retain cat_1-cat_&cat_count;
if first.name then
call missing(of category (*));
category(cat) = value;
if last.name then output;
run;

Related

SAS proc freq of multiple tables into a single one

I have the following dataset in SAS;
City grade1 grade2 grade3
NY A. A. A
CA. B. A. C
CO. A. B. B
I would "combine" the three variables grades and get a proc freq that tells me the number of grades for each City; the expected output should therefore be:
A. B. C
NY 3. 0. 0
CA. 1. 1. 1
CO. 1. 2. 0
How could I do that in SAS?
Quite a few steps but it gives the expected result.
*-- Creating sample data --*;
data have;
infile datalines delimiter="|";
input City $ grade1 $ grade2 $ grade3 $;
datalines;
NY|A|A|A
CA|B|A|C
CO|A|B|B
;
*-- Sorting in order to use the transpose procedure --*;
proc sort data=have; by city; quit;
*-- Transposing from wide to tall format --*;
proc transpose data=have out=stage1(rename=(col1=grade) drop= _name_);
by city;
var grade:;
run;
*-- Assigning a value to 1 for each record for later sum --*;
data stage2;
set stage1;
val = 1;
run;
*-- Tabulate to create val_sum --*;
ods exclude all; *turn off default tabulate print;
proc tabulate data=stage2 out=stage3;
class city grade;
var val;
table city,grade*sum=''*val='';
run;
ods select all; *turn on;
*-- Transpose back using val_sum --*;
proc transpose data=stage3 out=stage4(drop=_name_);
by city;
id grade;
var val_sum;
run;
*-- Replace missing values by 0 to achieve desired output --*;
proc stdize data=stage4 out=want reponly missing=0;run;
City A B C
CA 1 1 1
CO 1 2 0
NY 3 0 0
In general:
Transpose data to a long format
Use PROC FREQ with the SPARSE option to generate the counts
Save the output from PROC FREQ to a data set
Transpose the output from PROC FREQ to the desired output format
*create sample data;
data have;
input City $ grade1 $ grade2 $ grade3 $;
cards;
NY A A A
CA B A C
CO A B B
;;;;
*sort;
proc sort data=have; by City;run;
*transpose to long format;
proc transpose data=have out=want1 prefix=Grade;
by City;
var grade1-grade3;
run;
*displayed output and counts;
proc freq data=want1;
table City*Grade1 / sparse out=freq norow nopercent nocol;
run;
*output table in desird format;
proc transpose data=freq out=want2;
by city;
id Grade1;
var count;
run;
Here is a way to do it in two steps: a sort step and a data step.
proc sort data=have; by city; run;
data count (drop grade1-grade3);
set have;
* create an array of all your grades;
array grade(3) 3 grade1-grade3;
by city;
*set the count to zero for each city;
if first.city then do;
A = 0;
B = 0;
C = 0;
end;
* use a do loop to count the grades;
do i = 1 to 3;
if grade(i) = 'A' then A + 1;
else if grade(i) = 'B' then B + 1;
else if grade(i) = 'C' then C + 1;
end;
run;

Lookup table using hash on multiple (>50) columns

I am working with a table with more than 50 columns. I am trying to replace the value of multiple columns using a lookup table.
Table:
data have;
infile datalines delimiter=",";
input ID $1. SUB_ID :$2. COUNTRY :$2. A $1. B $1.;
datalines;
1,A,FR,A,B
2,B,CH,,B
3,C,DE,B,A
4,D,CZ,,B
5,E,GE,A,
6,F,EN,B,
7,G,US,,A
;
run;
Lookup table:
data lookup;
infile datalines delimiter=",";
input value_before $1. value_after :$2.;
datalines;
A,1
B,2
C,3
;
run;
Actual code:
data want;
if 0 then set lookup;
if _n_ = 1 then do;
declare hash lookup(dataset:'lookup');
lookup.defineKey('value_before');
lookup.defineData('value_after');
lookup.defineDone();
end;
set have;
if (lookup.find(key:A) = 0) then
A = value_after;
if (lookup.find(key:B) = 0) then
B = value_after;
/* ... */
/* if (lookup.find(key:Z) = 0) then
Z = value_after; */
drop value_before value_after;
run;
I guess this code would do the job if I would hardcode the 50 columns.
I wonder if there is a way to "apply" the hash.find() to all variables except the first three (ID, SUB_ID and Country) (maybe by indexing ?) without having to hardcode them or to use macros. For the sake of example I only computed 2 variables to replace the value (A and B) but there are more than 50 (with really different names and no pattern like var1,var2,...,varn).
In cases like this, I like to use proc sql and the dictionary table to fill in the column names for me to create an array. The below code will pull the variable names from dictionary.columns and save them as space-delimited into the macro variable varnames. We can feed this into an array and then use array logic to do the rest.
proc sql noprint;
select name
into :varnames separated by ' '
from dictionary.columns
where libname = 'WORK'
AND memname = 'HAVE'
AND name NOT IN('ID', 'SUB_ID', 'COUNTRY')
;
quit;
data want;
if 0 then set lookup;
if _n_ = 1 then do;
declare hash lookup(dataset:'lookup');
lookup.defineKey('value_before');
lookup.defineData('value_after');
lookup.defineDone();
end;
set have;
array vars[*] &varnames.;
do i = 1 to dim(vars);
if lookup.Find(key:vars[i])=0 then vars[i] = value_after;
end;
drop value_before value_after i;
run;

Split SAS datasets by column with primary key

So I have a dataset with one primary key: unique_id and 1200 variables. This dataset is generated from a macro so the number of columns will not be fixed. I need to split this dataset into 4 or more datasets of 250 variables each, and each of these smaller datasets should contain the primary key so that I can merge them back later. Can somebody help me with either a sas function or a macro to solve this?
Thanks in advance.
A simple way to split a datasets in the way you request is to use a single data step with multiple output datasets where each one has a KEEP= dataset option listing the variables to keep. For example:
data split1(keep=Name Age Height) split2(keep=Name Sex Weight);
set sashelp.class;
run;
So you need to get the list of variables and group then into sets of 250 or less. Then you can use those groupings to generate code like above. Here is one method using PROC CONTENTS to get the list of variables and CALL EXECUTE() to generate the code.
I will use macro variables to hold the name of the input dataset, the key variable that needs to be kept on each dataset and maximum number of variables to keep in each dataset.
So for the example above those macro variable values would be:
%let ds=sashelp.class;
%let key=name;
%let nvars=2;
So use PROC CONTENTS to get the list of variable names:
proc contents data=&ds noprint out=contents; run;
Now run a data step to split them into groups and generate a member name to use for the new split dataset. Make sure not to include the KEY variable in the list of variables when counting.
data groups;
length group 8 memname $41 varnum 8 name $32 ;
group +1;
memname=cats('split',group);
do varnum=1 to &nvars while (not eof);
set contents(keep=name where=(upcase(name) ne %upcase("&key"))) end=eof;
output;
end;
run;
Now you can use that dataset to drive the generation of the code:
data _null_;
set groups end=eof;
by group;
if _n_=1 then call execute('data ');
if first.group then call execute(cats(memname,'(keep=&key'));
call execute(' '||trim(name));
if last.group then call execute(') ');
if eof then call execute(';set &ds;run;');
run;
Here are results from the SAS log:
NOTE: CALL EXECUTE generated line.
1 + data
2 + split1(keep=name
3 + Age
4 + Height
5 + )
6 + split2(keep=name
7 + Sex
8 + Weight
9 + )
10 + ;set sashelp.class;run;
NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK.SPLIT1 has 19 observations and 3 variables.
NOTE: The data set WORK.SPLIT2 has 19 observations and 3 variables.
Just another way of doing it using macro variables:
/* Number of columns you want in each chunk */
%let vars_per_part = 250;
/* Get all the column names into a dataset */
proc contents data = have out=cols noprint;
run;
%macro split(part);
/* Split the columns into 250 chunks for each part and put it into a macro variable */
%let fobs = %eval((&part - 1)* &vars_per_part + 1);
%let obs = %eval(&part * &vars_per_part);
proc sql noprint;
select name into :cols separated by " " from cols (firstobs = &fobs obs = &obs) where name ~= "uniq_id";
quit;
/* Chunk up the data only keeping those varaibles and the uniq_id */
data want_part∂
set have (keep = &cols uniq_id);
run;
%mend;
/* Run this from 1 to whatever the increment required to cover all the columnns */
%split(1);
%split(2);
%split(3);
this is not a complete solution but some help to give you another insight into how to solve this. The previous solutions have relied much on proc contents and data step, but I would solve this using proc sql and dictionary.columns. And I would create a macro that would split the original file into as many parts as needed, 250 cols each. The steps roughly:
proc sql; create table as _colstemp as select * from dictionary.columns where library='your library' and memname = 'your table' and name ne 'your primary key'; quit;
Count the number of files needed somewhere along:
proc sql;
select ceil(count(*)/249) into :num_of_datasets from _colstemp;
select count(*) into :num_of_cols from _colstemp;
quit;
Then just loop over the original dataset like:
%do &_i = 1 %to &num_of_datasets
proc sql;
select name into :vars separated by ','
from _colstemp(firstobs=%eval((&_i. - 1)*249 + 1) obs = %eval(min(249,&num_of_cols. - &_i. * 249)) ;
quit;
proc sql;
create table split_&_i. as
select YOUR_PRIMARY_KEY, &vars from YOUR_ORIGINAL_TABLE;
quit;
%end;
Hopefully this gives you another idea. The solution is not tested, and may contain some pseudocode elements as it's written from my memory of doing things. Also this is void of macro declaration and much of parametrization one could do.. This would make the solution more general (parametrize your number of variables for each dataset, your primary key name, and your dataset names for example.

Transpose a correlation matrix into one long vector in SAS

I'm trying to turn a correlation matrix into one long column vector such that I have the following structure
data want;
input _name1_$ _name2_$ _corr_;
datalines;
var1 var2 0.54
;
run;
I have the following code, which outputs name1 and corr; however, I'm struggling to get name2!
DATA TEMP_1
(DROP=I J);
ARRAY VAR[*] VAR1-VAR10;
DO I = 1 TO 10;
DO J = 1 TO 10;
VAR(J) = RANUNI(0);
END;
OUTPUT;
END;
RUN;
PROC CORR
DATA=TEMP_1
OUT=TEMP_CORR
(WHERE=(_NAME_ NE " ")
DROP=_TYPE_)
;
RUN;
PROC SORT DATA=TEMP_CORR; BY _NAME_; RUN;
PROC TRANSPOSE
DATA=TEMP_CORR
OUT=TEMP_CORR_T
;
BY _NAME_;
RUN;
Help is appreciated
You're close. You're running into a weird issue with the name variable because that becomes a variable out of PROC TRANSPOSE as well. If you rename it, you get what you want. I also list the variables explicitly and add some RENAME data set options to get what you likely want.
PROC TRANSPOSE
DATA=TEMP_CORR (rename=_name_ = Name1)
OUT=TEMP_CORR_T (rename = (_name_ = Name2 col1=corr))
;
by name1;
var var1-var10;
RUN;
Edit: If you don’t want duplicates you can add a WHERE to the OUT dataset.
PROC TRANSPOSE
DATA=TEMP_CORR (rename=_name_ = Name1)
OUT=TEMP_CORR_T (rename = (_name_ = Name2 col1=corr) where = name1 > name2)
;
by name1;
var var1-var10;
RUN;
Just an ARRAY with VNAME() function. To just output the upper triangle set lower bound of DO loop to _N_.
data want ;
length _name1_ _name2_ $32 _corr_ 8 ;
keep _name1_ _name2_ _corr_;
set corr;
where _type_ = 'CORR';
array x _numeric_;
_name1_=_name_;
do i=_n_ to dim(x);
_name2_ = vname(x(i));
_corr_ = x(i);
output;
end;
run;

SAS Array character dates and rename variables after input function

Trying to determine a sensible way to clean dates (character), then put those dates in a proper date format via input function, but maintain sensible variable names (and possibly even preserve the original variable names) once the char-to-number process is executed.
The dates are being cleaned with an array (replacing '..' with '01', or '....' with 0101) since there are about 75 variables that have dates as strings.
Ex. -
data sample;
input d1 $ d2 $ d3 $ d4 $ d5 $;
cards;
200103.. 20070905 20060222 2007.... 199801..
;
run;
data clean;
set sample;
array dt_cln(5) d1-d5;
array fl_dt (5) f1-f5;
*clean out '..'/'....', replace with '01'/'0101';
do i=1 to 5;
if substr(dt_cln(i),5,4) = '....' then do;
dt_cln(i) = substr(dt_cln(i),1,4) || '0101';
end;
else if substr(dt_cln(i),7,2) = '..' then do;
dt_cln(i) = substr(dt_cln(i),1,6) || '01';
end;
end;
*change to number;
do i=1 to 5;
fl_dt(i)=input(dt_cln(i),yymmdd8.);
end;
format f: date9.;
drop i d:;
run;
What would be the best way to approach this?
You cannot preserve the original names and convert from character to numeric directly - however, with a bit of macro code you could drop all the old character variables and rename the numeric versions you've created. E.g.
%macro rename_loop();
%local i;
%do i = 1 %to 5;
f&i = d&i
%end;
%mend;
Then in your data step add a rename statement at the end, after your drop statement:
rename %rename_loop;
Otherwise, your existing approach is already pretty good. You could perhaps simplify the cleaning process a bit, e.g. remove your first do-loop and do the following within the second one:
fl_dt(i)=input(tranwrd(dt_cln(i),'..','01'),yymmdd8.);
data want;
set sample;
array var1 newd1-newd5;
array var2 d:;
do over var2;
var1=input(ifc(index(var2,'.')^=0,put(prxchange('s/((\.){1,})/0101/',-1,var2),8.),var2),yymmdd8.);
end;
format newd1-newd5 yymmddn8.;
drop d:;
run;

Resources