SAS: Merging two tables with identical columns while dropping null values - database

I'm not sure if the title does this question justice, but here it goes:
I have three datasets Forecasts1, Forecasts2 and Forecasts3. They are all time series data composed of a date variable and variables r1 through r241.
For a given r variable (lets just use r1-r3, and only Forecasts 1 and 2 for now) each dataset has only one row where the value isn't null, and it is a different row in each dataset.
Forecast 1 looks like this:
Forecast 2 looks like this:
I need to be able to combine them such that r1-r3 contain all the non-null values, without creating duplicate date rows to hold the null values.
So ideally the finished produce would look like this:
I've tried various types of merges and sets, but I keep getting duplicate date rows. How would I go about doing this properly for all 241 (or more) variables? (specifically in SAS or Proc SQL?)
LINKS TO GOOGLE DOCS CONTAINING DATA:
Forecasts1: https://docs.google.com/spreadsheets/d/1iUEwPltU6V6ijgnkALFiIdrwrolDFt8xaITZaFC4WN8/edit?usp=sharing
Forecasts2:
https://docs.google.com/spreadsheets/d/1lQGKYJlz6AAR-DWtoWnl8SwzCNAmSpj7yxRqRgnybr8/edit?usp=sharing

Did you try the UPDATE statement?
data forecast1 ;
input date r1-r3 ;
cards;
1 1 . .
2 . 2 .
3 . . 3
4 . . .
;
data forecast2 ;
input date r1-r3 ;
cards;
2 2 . .
3 . 3 .
4 . . 4
5 . . .
;
data want ;
update forecast1 forecast2 ;
by date ;
run;
proc print; run;
Results
date r1 r2 r3
1 1 . .
2 2 2 .
3 . 3 3
4 . . 4
5 . . .

I tend to approach these types of problems using proc sql. Assuming one row per date in the data sets, you can use full outer join:
proc sql;
select coalesce(f1.date, f2.date) as date,
coalesce(f1.r1, f2.r1) as r1,
coalesce(f1.r2, f2.r2) as r2,
coalesce(f1.r3, f2.r3) as r3
from forecast1 f1 full outer join
forecast2 f2
on f1.date = f2.date

Consider a union query with aggregation. The only drawback is writing out the aggregates for all 241 columns in outer query.
proc sql;
SELECT sub.date, Max(sub.r1) AS R1, Max(sub.r2) AS R2, Max(sub.r3) AS R3, ...
FROM
(SELECT *
FROM Forecasts1 f1
UNION ALL
SELECT *
FROM Forecasts2 f2) As sub
GROUP BY sub.date
quit;

A different solution would be to append all and delete rows where all are missing.
data want;
set forecast1-forecast3 indsname=fc;
model = fc; *store name of forecast model;
if nmiss(of r1-r3) = 3 then delete;
run;

Related

Split SAS datasets by column with primary key

So I have a dataset with one primary key: unique_id and 1200 variables. This dataset is generated from a macro so the number of columns will not be fixed. I need to split this dataset into 4 or more datasets of 250 variables each, and each of these smaller datasets should contain the primary key so that I can merge them back later. Can somebody help me with either a sas function or a macro to solve this?
Thanks in advance.
A simple way to split a datasets in the way you request is to use a single data step with multiple output datasets where each one has a KEEP= dataset option listing the variables to keep. For example:
data split1(keep=Name Age Height) split2(keep=Name Sex Weight);
set sashelp.class;
run;
So you need to get the list of variables and group then into sets of 250 or less. Then you can use those groupings to generate code like above. Here is one method using PROC CONTENTS to get the list of variables and CALL EXECUTE() to generate the code.
I will use macro variables to hold the name of the input dataset, the key variable that needs to be kept on each dataset and maximum number of variables to keep in each dataset.
So for the example above those macro variable values would be:
%let ds=sashelp.class;
%let key=name;
%let nvars=2;
So use PROC CONTENTS to get the list of variable names:
proc contents data=&ds noprint out=contents; run;
Now run a data step to split them into groups and generate a member name to use for the new split dataset. Make sure not to include the KEY variable in the list of variables when counting.
data groups;
length group 8 memname $41 varnum 8 name $32 ;
group +1;
memname=cats('split',group);
do varnum=1 to &nvars while (not eof);
set contents(keep=name where=(upcase(name) ne %upcase("&key"))) end=eof;
output;
end;
run;
Now you can use that dataset to drive the generation of the code:
data _null_;
set groups end=eof;
by group;
if _n_=1 then call execute('data ');
if first.group then call execute(cats(memname,'(keep=&key'));
call execute(' '||trim(name));
if last.group then call execute(') ');
if eof then call execute(';set &ds;run;');
run;
Here are results from the SAS log:
NOTE: CALL EXECUTE generated line.
1 + data
2 + split1(keep=name
3 + Age
4 + Height
5 + )
6 + split2(keep=name
7 + Sex
8 + Weight
9 + )
10 + ;set sashelp.class;run;
NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK.SPLIT1 has 19 observations and 3 variables.
NOTE: The data set WORK.SPLIT2 has 19 observations and 3 variables.
Just another way of doing it using macro variables:
/* Number of columns you want in each chunk */
%let vars_per_part = 250;
/* Get all the column names into a dataset */
proc contents data = have out=cols noprint;
run;
%macro split(part);
/* Split the columns into 250 chunks for each part and put it into a macro variable */
%let fobs = %eval((&part - 1)* &vars_per_part + 1);
%let obs = %eval(&part * &vars_per_part);
proc sql noprint;
select name into :cols separated by " " from cols (firstobs = &fobs obs = &obs) where name ~= "uniq_id";
quit;
/* Chunk up the data only keeping those varaibles and the uniq_id */
data want_part∂
set have (keep = &cols uniq_id);
run;
%mend;
/* Run this from 1 to whatever the increment required to cover all the columnns */
%split(1);
%split(2);
%split(3);
this is not a complete solution but some help to give you another insight into how to solve this. The previous solutions have relied much on proc contents and data step, but I would solve this using proc sql and dictionary.columns. And I would create a macro that would split the original file into as many parts as needed, 250 cols each. The steps roughly:
proc sql; create table as _colstemp as select * from dictionary.columns where library='your library' and memname = 'your table' and name ne 'your primary key'; quit;
Count the number of files needed somewhere along:
proc sql;
select ceil(count(*)/249) into :num_of_datasets from _colstemp;
select count(*) into :num_of_cols from _colstemp;
quit;
Then just loop over the original dataset like:
%do &_i = 1 %to &num_of_datasets
proc sql;
select name into :vars separated by ','
from _colstemp(firstobs=%eval((&_i. - 1)*249 + 1) obs = %eval(min(249,&num_of_cols. - &_i. * 249)) ;
quit;
proc sql;
create table split_&_i. as
select YOUR_PRIMARY_KEY, &vars from YOUR_ORIGINAL_TABLE;
quit;
%end;
Hopefully this gives you another idea. The solution is not tested, and may contain some pseudocode elements as it's written from my memory of doing things. Also this is void of macro declaration and much of parametrization one could do.. This would make the solution more general (parametrize your number of variables for each dataset, your primary key name, and your dataset names for example.

Filling missing values for many variables from previous observation by group in SAS

My dataset looks like this:
Date ID Var1 Var2 ... Var5
200701 1 x .
200702 1 . a
200703 1 . .
200701 2 . b
200702 2 y b
200703 2 y .
200702 3 z .
200703 3 . .
I want my results to look like this:
Date ID Var1 Var2 ... Var5
200701 1 x .
200702 1 x a
200703 1 x a
200701 2 . b
200702 2 y b
200703 2 y b
200702 3 z .
200703 3 z .
I tried the following code below, but it didn't work. What's wrong with it?
Am I better off using array? If so, how?
%macro a(variable);
length _&variable $10.;
retain _&variable;
if first.ID then _&variable = '';
if &variable ne '' then _&variable=&variable;
else if &variable = '' then &variable=_&variable;
drop _&variable;
%mend;
data want;
set have;
%a(Var1)
%a(Var2)
%a(Var3)
%a(Var4)
%a(Var5)
run;
Appreciate the help! Thanks!
The UPDATE statement can do that. It is intended to process transactions against a master dataset so when the transaction value is missing the current value from the master table is left unchanged. You can use your single dataset as both the master and the transaction data by adding OBS=0 dataset option. Normally it will expect to output only one observation per BY group, but if you add an OUTPUT statement you can have it output all of the observations.
data want;
set have(obs=0) have ;
by id;
output;
run;
The full code works! Thanks
%macro a(variable);
length _&variable $10.;
retain _&variable;
if first.ID then _&variable = '';
if &variable ne '' then _&variable=&variable;
else if &variable = '' then &variable=_&variable;
drop _&variable;
%mend;
data want;
update have(obs=0) have;
by id;
output;
%a(Var1)
%a(Var2)
%a(Var3)
%a(Var4)
%a(Var5)
run;

Merging time series with different number of observations where variables have the same name (SAS)

I have a bunch of time series data (sas-files) which I like to merge / combine up to a larger table (I am fairly new to SAS).
Filenames:
cq_ts_SYMBOL, where SYMBOL is the respective symbol for each file
with the following structure:
cq_ts_AAA.sas7bdat: file1
SYMBOL DATE TIME BID ASK MID
AAA 20100101 9:30:00 10.375 10.4 .
AAA 20100101 9:31:00 10.38 10.4 .
.
.
AAA 20150101 15:59:00 15 15.1 .
cq_ts_BBB.sas7bdat: file2
SYMBOL DATE TIME BID ASK MID
BBB 20120101 9:30:00 12.375 12.4 .
BBB 20120102 9:31:00 12.38 12.4 .
.
.
BBB 20170101 15:59:00 20 20.1 .
Key characteristics:
- They have the same variable name
- They have different number of observations
- They are all saved in the same folder
So what I want to do is:
- Create 3 tables: BID-table, ASK-table, Mid-table with the following structure, ie for bid-table, cq_ts_bid.sas7bdat:
DATE TIME AAA BBB ...
20100101 9:30:00 10.375 .
20100102 9:31:00 10.38 .
.
.
20120101 9:30:00 9.375 12.375
20120102 9:31:00 9.38 12.38
.
.
20150101 15:59:00 15 17
.
.
20170101 15:59:00 . 20
It is not all to difficult to do it for 2 stock time series, however, I was wondering whether there is the possibility to do the following:
From data set cq_ts_AAA take DATE TIME BID and rename BID to AAA (either from the values in symbol? does this make sense? or get the name from the filename).
Do the same for cq_ts_BBB.
In fact, loop through the folder to get the number of files and filenames (this part I got more or less, see below).
Merge cq_ts_BBB and cq_ts_BBB having DATE TIME AAA (former bid price of AAA) BBB (former bid price of BBB), for all the files in the folder.
Do this for BID, then for ASK and finally MID (actually I couldn't get the midpoint variable from bid and ask (i.e. mid= (bid + ask) / 2;) just gives me the "." in the previous data steps when creating the files).
I think a macro to first get each single file then rename (when should this step take place?) it and merge them together - like a double loop.
Here the renaming and merging part:
data ALDW_short (rename=(iprice = ALDW));
set output.cq_ts_aldw
retain date time ALJ;
run;
data ALJ_short (rename= (iprice = ALJ));
set output.cq_ts_alj;
retain date time datetime ALJ;
run;
data ALDW_ALJ_merged (keep= date itime ALDW ALJ);
merge ALDW_short ALJ_short;
by datetime;
run;
This is the part to loop through the folder and get a list of names:
proc contents data = output._all_ out = outputcont(keep = memname) noprint;
run;
proc sort data = outputcont nodupkey;
by memname;
run;
data _null_;
set outputcont end = last;
by memname;
i+1;
call symputx('name'||trim(left(put(i,8.))),memname);
if last then call symputx('count',i);
run;
Would it make sense to extract the symbol (and how? they have different length) from the filename or just to take them from the variable SYMBOL (and how can I get the one value to rename my column?)?
Somehow I have difficulty changing the order of columns, ie. I tried with retain and format.
Looks like you could do this easily with PROC TRANSPOSE. Combine your datasets into a single dataset.
data all ;
set set output.cq_ts_: ;
by date time;
run;
Then use PROC TRANSPOSE for each of your source variables/target tables.
proc transpose data=all out=bid ;
by date time ;
id symbol;
var bid;
run;
Given your example data a formula for MID of
mid = (bid + ask)/2 ;
Should work. Most likely if you got all missing values you probable put the assignment statement before the SET or INPUT statement. In other words you were trying to calculate using values that had not been read in yet.

Using SAS to check if columns have specified characteristics

I have a dataset that looks like the one below. each row is a different observation that has anywhere from 1 to x values (in this case x=3). I want to create a dataset that contains the original info, but four additional columns (for the four values of Bin present in the dataset). The value of the column freq_Bin_1 will be a 1 if there are any 1's present in that row, else missing. The column freq_Bin_2 will be a 1 if there are any 2's present, etc.
Both the number of Bins and the number of columns in the original dataset may vary.
data have;
input Bin_1 Bin_2 Bin_3;
cards;
1 . .
3 . .
1 1 .
3 2 1
3 4 .
;
run;
Here is my desired output:
data want_this;
input Bin_1 Bin_2 Bin_3 freq_Bin_1 freq_Bin_2 freq_Bin_3 freq_Bin_4;
cards;
1 . . 1 . . .
3 . . . . 1 .
1 1 . 1 . . .
3 2 1 1 1 1 .
3 4 . . . 1 1
;
run;
I have an array solution that I think is pretty close, but I can't quite get it. I am also open to other methods.
data want;
set have;
array Bins {&max_freq.} Bin:;
array freq_Bin {&num_bin.} freq_Bin_1-freq_Bin_&num_bin.;
do j=1 to dim(Bins);
freq_Bin(j)=.;
end;
do k=1 to dim(freq_Bin);
if Bins(k)=1 then freq_Bin(1)=1;
else if Bins(k)=2 then freq_Bin(2)=1;
else if Bins(k)=3 then freq_Bin(3)=1;
else if Bins(k)=4 then freq_Bin(4)=1;
end;
drop j k;
run;
This should work:
data want;
set have;
array Bins{*} Bin:;
array freq_Bin{4};
do k=1 to dim(Bins);
if Bins(k) ne . then freq_Bin(Bins(k))=1;
end;
drop k;
run;
I tweaked your code somewhat, but really the only problem was that you need to check that Bins(k) isn't missing before trying to use it to index an array. Also, there's no need to initialize the values to missing as that's the default.

SAS Looping: Summing observations vertically based on conditionally

I have a dataset that looks like:
Zip Codes Total Cars
11111 3
11111 4
23232 1
44331 0
44331 10
18860 6
18860 6
18860 6
18860 8
Ther are 3 million+ rows just like this, with varying zips. I need to sum total cars for each zip code, such that the resulting table looks like
Zip Codes Total Cars
11111 7
23232 1
44331 10
18860 26
.
.
.
Manually inputting zips into the code is not an option considering the size of the dataset. Thoughts?
Both answers so far are OK, but here is a more detailed explanation of both possible methods:
PROC SQL METHOD
PROC SQL;
CREATE TABLE output_table AS
SELECT ZipCodes,
SUM(Total_Cars) as Total_Cars
FROM input_table
GROUP BY ZipCodes;
QUIT;
The GROUP BY clause can also be written GROUP BY 1, omitting ZipCodes, as this refers to the 1st column in the SELECT clause.
PROC SUMMARY METHOD
PROC SUMMARY DATA=input_table NWAY;
CLASS ZipCodes;
VAR Total_Cars;
OUTPUT OUT=output_table (DROP=_TYPE_ _FREQ_) SUM()=;
RUN;
The method is similar to another answer to this question, but I've added:
NWAY - gives only the maximum level of summarisation, here it's not as important because you have only one CLASS variable, meaning there is only one level of summarisation. However, without NWAY you get an extra row showing the total value of Total_Cars across the whole dataset, which is not something you asked for in your question.
DROP=_TYPE_ _FREQ_ - This removes the automatic variables:
_TYPE_ - which shows the level of summarisation (see comment above), which would just be a column containing the value 1.
_FREQ_ - gives a frequency count of the ZipCodes, which although useful, isn't something you wanted in your question.
DATA STEP METHOD
PROC SORT DATA=input_table (RENAME=(Total_Cars = tc)) OUT=_temp;
BY ZipCodes;
RUN;
DATA output_table (DROP=TC);
SET _temp;
BY ZipCodes;
IF first.ZipCodes THEN Total_Cars = 0;
Total_Cars+tc;
IF last.ZipCodes THEN OUTPUT;
RUN;
This is just included for completeness, it's not as efficient as it requires pre-sorting.
To supplement #mjsqu's answer, for (more) completeness:
data testin;
input Zip Cars;
datalines;
11111 3
11111 4
23232 1
44331 0
44331 10
18860 6
18860 6
18860 6
18860 8
;
PROC TABULATE METHOD
proc tabulate data=testin out=testout
/*drop extra created vars and rename as needed*/
(drop=_type_ _page_ _table_ rename=(Zip='Zip Codes'n Cars_Sum='Total Cars'n));
/*grouping variable, also used to sort output in ascending order*/
class Zip;
/* variable to be analyzed*/
var Cars;
/*sum cars by zip code*/
table Zip, Cars*(sum);
run;
If using Enterprise Guide, this produces a dataset and a results table. To suppress the results and only output a dataset, include this line before "proc tabulate":
ods select none; /*suppress ods output*/
and this after "run":
ods select all; /*restore ods output*/
The variable upon which you want to sum is "ZipCodes" so that will go into "Class" section.
You want to sum Total_cars , so that will go into "var" section.
Input_table and Output_table is self explanatory.
/*Code/
proc summary data=Input_table;
class ZipCodes;
var Total_cars;
output out=Output_table
sum()=;
run;
You can use proc sql. this is a very simple step
proc sql;
create table new as
select Zipcodes, sum(Total Cars) as total_cars from table_have group by Zipcodes
;
quit;

Resources