How to do a loop merging multiple datasets in Stata? - loops

I am trying to create two lists of files and create two new datasets that merges all those files. To do So I was trying the following:
*** SET FOLDER PATHS ***********************************************************
global projectFolder "C:\Users\XXX"
global codeFolder "${projectFolder}\code"
global databaseFolder "${projectFolder}\data"
global rawFolder "${databaseFolder}\raw"
global outputsFolder "${databaseFolder}\output"
*** CREATING VECTORS WITH FILE NAMES *******************************************
global file_all dir "$outputsFolder" files "*.dta"
di `$file_all'
global file_monthly dir "$outputsFolder" files "*_monthly.dta"
di `$file_monthly'
global file_yearly : list global file_all - global file_monthly
di `$file_yearly'
I found a few problems. First, I was not able to create the list of files, and second, I didn't find a way to create this loop without merging twice the first dataset.
*** MERGING YEARLY OUTCOMES ****************************************************
use "$outputsFolder\first_dataset.dta", clear
foreach file in `file_yearly' {
merge 1:1 muni_code year using `file', nogen
}

Within your foreach loop over the files, you can conditionally load/use if the first file (in the example below, it requires knowing the name of the "first" file), else merge, like this:
local files: dir "." files "yearly*.dta"
foreach f of local files {
if "`f'" == "yearly_1.dta" use `f'
else merge 1:1 year muni using `f', nogen
}
list, clean
Output:
year muni val1 val2 val3
1. 2001 1 .3132002 .1924075 .8190824
2. 2002 2 .5559791 .1951401 .4882096
3. 2003 3 .9382851 .9509598 .2704866
4. 2004 4 .7363221 .2904454 .5859706
Input:
set seed 123
forvalues i = 1/3 {
clear
set obs 4
gen year = 2000 + _n
gen muni = _n
gen val`i' = runiform()
save yearly_`i', replace
}

Related

Open file with formatted variable name in Julia

I have a list of files numbered gll_01.tab, gll_02.tab, ...., gll_20.tab in a subdirectory of my parent directory. These files are tabular data files.
I want to open/read files with user-specified input.
I can do:
a = 3
open("directory/gll_0$a.tab")
But using this approach, I would have to define two separate variable names for (01 to 09) and for (10 to 18). How can I use variables or strings with name 02, 03, ..., etc?
In python, I can have an equivalent command:
a = 4
g = '{:02d}'.format(a)
f = open('directory/gll_%s.tab' %g)
Is there an equivalent string formatting command in Julia?
A simple answer in this case would be to use lpad:
a = 3
open("directory/gll_$(lpad(a,2,"0")).tab")
If you need more fancy formatting you can use e.g. https://github.com/JuliaIO/Formatting.jl, in this case this would be:
using Formatting
a = 3
open("directory/gll_$(fmt("0>2", a)).tab")
Another option is to use #sprintf, docs are here. With that you can use %02d as a formatting option that would pad a digit d to length 2 with 0s preceding it:
julia> using Printf # this is in the standard library
julia> #sprintf("directory/gll_%02d.tab", 1)
"directory/gll_01.tab"
You can use this in your open statements too. Here they are in action:
julia> for i in 5:10
println("$i file is: $(#sprintf("directory/gll_%02d.tab",i))")
end
5 file is: directory/gll_05.tab
6 file is: directory/gll_06.tab
7 file is: directory/gll_07.tab
8 file is: directory/gll_08.tab
9 file is: directory/gll_09.tab
10 file is: directory/gll_10.tab

Merging time series with different number of observations where variables have the same name (SAS)

I have a bunch of time series data (sas-files) which I like to merge / combine up to a larger table (I am fairly new to SAS).
Filenames:
cq_ts_SYMBOL, where SYMBOL is the respective symbol for each file
with the following structure:
cq_ts_AAA.sas7bdat: file1
SYMBOL DATE TIME BID ASK MID
AAA 20100101 9:30:00 10.375 10.4 .
AAA 20100101 9:31:00 10.38 10.4 .
.
.
AAA 20150101 15:59:00 15 15.1 .
cq_ts_BBB.sas7bdat: file2
SYMBOL DATE TIME BID ASK MID
BBB 20120101 9:30:00 12.375 12.4 .
BBB 20120102 9:31:00 12.38 12.4 .
.
.
BBB 20170101 15:59:00 20 20.1 .
Key characteristics:
- They have the same variable name
- They have different number of observations
- They are all saved in the same folder
So what I want to do is:
- Create 3 tables: BID-table, ASK-table, Mid-table with the following structure, ie for bid-table, cq_ts_bid.sas7bdat:
DATE TIME AAA BBB ...
20100101 9:30:00 10.375 .
20100102 9:31:00 10.38 .
.
.
20120101 9:30:00 9.375 12.375
20120102 9:31:00 9.38 12.38
.
.
20150101 15:59:00 15 17
.
.
20170101 15:59:00 . 20
It is not all to difficult to do it for 2 stock time series, however, I was wondering whether there is the possibility to do the following:
From data set cq_ts_AAA take DATE TIME BID and rename BID to AAA (either from the values in symbol? does this make sense? or get the name from the filename).
Do the same for cq_ts_BBB.
In fact, loop through the folder to get the number of files and filenames (this part I got more or less, see below).
Merge cq_ts_BBB and cq_ts_BBB having DATE TIME AAA (former bid price of AAA) BBB (former bid price of BBB), for all the files in the folder.
Do this for BID, then for ASK and finally MID (actually I couldn't get the midpoint variable from bid and ask (i.e. mid= (bid + ask) / 2;) just gives me the "." in the previous data steps when creating the files).
I think a macro to first get each single file then rename (when should this step take place?) it and merge them together - like a double loop.
Here the renaming and merging part:
data ALDW_short (rename=(iprice = ALDW));
set output.cq_ts_aldw
retain date time ALJ;
run;
data ALJ_short (rename= (iprice = ALJ));
set output.cq_ts_alj;
retain date time datetime ALJ;
run;
data ALDW_ALJ_merged (keep= date itime ALDW ALJ);
merge ALDW_short ALJ_short;
by datetime;
run;
This is the part to loop through the folder and get a list of names:
proc contents data = output._all_ out = outputcont(keep = memname) noprint;
run;
proc sort data = outputcont nodupkey;
by memname;
run;
data _null_;
set outputcont end = last;
by memname;
i+1;
call symputx('name'||trim(left(put(i,8.))),memname);
if last then call symputx('count',i);
run;
Would it make sense to extract the symbol (and how? they have different length) from the filename or just to take them from the variable SYMBOL (and how can I get the one value to rename my column?)?
Somehow I have difficulty changing the order of columns, ie. I tried with retain and format.
Looks like you could do this easily with PROC TRANSPOSE. Combine your datasets into a single dataset.
data all ;
set set output.cq_ts_: ;
by date time;
run;
Then use PROC TRANSPOSE for each of your source variables/target tables.
proc transpose data=all out=bid ;
by date time ;
id symbol;
var bid;
run;
Given your example data a formula for MID of
mid = (bid + ask)/2 ;
Should work. Most likely if you got all missing values you probable put the assignment statement before the SET or INPUT statement. In other words you were trying to calculate using values that had not been read in yet.

SAS Put Certain Row as New Variable Names After Manipulation

After importing my CSV data with GETNAMES=NO, I have 59 columns with variable names VAR1, VAR2, . . . VAR59. My first row contains the names I need for the new variables, but they first needed manipulated by removing special characters and turning spaces into underscores since SAS doesn't like spaces in variable names. This is the array I used for that piece:
DATA DATA1; SET DATA (FIRSTOBS=7);
ARRAY VAR(59) VAR1-VAR59;
IF _N_ = 1 THEN DO;
DO I = 1 TO 59;
VAR[I] = COMPRESS(TRANSLATE(TRIM(VAR[I]),'_',' '),'?()');
PUT VAR[I]=;
END;
END;
DROP I;
RUN;
This worked perfectly, but now I need to get this first row up to the new variable names. I tried a similar array to perform this:
DATA DATA2; SET DATA1;
ARRAY V(59) VAR1-VAR59;
DO I = 1 TO 59;
IF _N_ = 1 AND V[I] NE "" THEN CALL SYMPUT("NEWNAME",V[I]);
RENAME VAR[I] = &NEWNAME;
END;
DROP I;
RUN;
This only puts the name of VAR59 since there is no [i] connected to the &NEWNAME, and it still isn't working quite right. Any suggestions to moving a row up to variable names AFTER manipulation?
Your primary problem is you are trying to use a macro variable in the data step it's created in. You can't. You're also trying to create rename statements in the data step; rename, as with other similar statements (keep, drop), must be defined before the data step is compiled.
You need to write code somewhere - either in a text file, a macro variable, whatever - with this information. For example:
filename renamef temp;
data _null_;
set myfile (obs=1);
file renamef;
array var[59];
do _i = 1 to dim(Var);
[your code to clean it out];
strput = cat("rename",vname(var[_i]),'=',var[_i],';');
put strput;
end;
run;
data want;
set myfile (firstobs=2);
%include renamef;
run;
There are lots of other examples to this on the site and on the web, "list processing" is the term for this.
Joe -- using your suggestions and another one of your posts, the following worked flawlessly:
Put the row of needed variables into long format (in my case, first row so n = 1)
DATA NEWVARS; SET DATA;
IF _N_ = 1 THEN OUTPUT NEWVARS;
RUN;
PROC TRANSPOSE DATA = NEWVARS OUT=NEWVARS1;
VAR _ALL_;
RUN;
Create a list of rename macro calls.
PROC SQL;
SELECT CATS('%RENAME(VAR=',_NAME_,',NEWVAR=',COL1,')')
INTO :RENAMELIST SEPARATED BY ' '
FROM NEWVARS1;
QUIT;
%MACRO RENAME(VAR=,NEWVAR=);
RENAME &VAR.=&NEWVAR.;
%MEND RENAME;
Call in the list created in Step 2 to rename all variables.
PROC DATASETS LIB=WORK NOLIST;
MODIFY DATA;
&RENAMELIST.;
QUIT;
I had to perform a few additional checks making sure that the variable names were not greater than 32 characters, and this was easy to check for when the data was in long format after transposing. If there are certain words that make the lengths too long, a TRANWRD statement can easily replace them with abbreviations.

Each xtable produced in R-loops should have \begin{table}..\end{table} environment in Sweave

I try to write an R-function which produces xtables in a loop. Later I want to call my function in a Sweave document- but a single chunk can't support multiple tables. I would have to put each table in a single chunk and wrap it with the Latex Code \begin{table} ... \end{table}.
So I wonder, whether it's possible to somehow call Sweave/knitr from within the Loop of the R-function and add \begin{table} .. \end{table} around each xtable?
Or whether it is somehow possible to send each xtable from the loop to a chunk with \begin{table} ... \end{table} environment?
A mini-example of my function:
multiple_tables_Loop<-function(...){
(....) ##Some necessary calculations to produce a data frame
for(j in 1:m){
for(i in 1:n){
a<-data.frame(...)
table<-xtable(a)
print(table)
}
}
}
In Sweave I would call the function:
<<Hallo_Table,results='aisis'>>
multiple_tables_Loop(...)
#
I'm confused by your question. xtable does include \begin{table}/\end{table} pairs. And you can put multiple tables is a code chunk (for both Sweave and knitr .Rnw files). Could it be just that you have misspelled 'asis' in your chunk header?
Showing xtable does include \begin{table}/\end{table}:
> xtable(data.frame(x=1))
% latex table generated in R 3.1.2 by xtable 1.7-4 package
% Fri Jan 23 11:12:47 2015
\begin{table}[ht]
\centering
\begin{tabular}{rr}
\hline
& x \\
\hline
1 & 1.00 \\
\hline
\end{tabular}
\end{table}
And a simple .Rnw file of
<<results="asis">>=
library("xtable")
xtable(data.frame(x=1))
xtable(data.frame(y=1))
#
properly gives two tables.
If the misspelling isn't the problem, a complete minimally reproducible example is needed along with the version numbers of R and all the packages (output of sessionInfo())

Tabulate multiple variables with common prefix using a local macro

I have a number of variables whose name begins with the prefix indoor. What comes after indoor is not numeric (that would make everything simpler).
I would like a tabulation for each of these variables.
My code is the following:
local indoor indoor*
foreach i of local indoor {
tab `i' group, col freq exact chi2
}
The problem is that indoor in the foreach command resolves to indoor* and not to the list of the indoor questions, as I hoped. For this reason, the tab command is followed by too many variables (it can only handle two) and this results in an error.
The simple fix is to substitute the first command with:
local indoor <full list of indoor questions>
But this is what I would like to avoid, that is to have to find all the names for these variables and then paste them in the code. It seems there is a quicker fix for this but I can't think of any.
The trick is to use ds or unab to create the varlist expansion before asking Stata to loop over values in the foreach loop.
Here's an example of each:
******************! BEGIN EXAMPLE
** THIS FIRST SECTION SIMPLY CREATES SOME FAKE DATA & INDOOR VARS **
clear
set obs 10000
local suffix `c(ALPHA)'
token `"`suffix'"'
while "`1'" != "" {
g indoor`1'`2'`3' = 1+int((5-1+1)*runiform())
lab var indoor`1'`2'`3' "Indoor Values for `1'`2'`3'"
mac shift 1
}
g group = rbinomial(1,.5)
lab var group "GROUP TYPE"
** NOW, YOU SHOULD HAVE A BUNCH OF FAKE INDOOR
**VARS WITH ALPHA, NOT NUMERIC SUFFIXES
desc indoor*
**USE ds TO CREATE YOUR VARLIST FOR THE foreach LOOP:
ds indoor*
di "`r(varlist)'"
local indoorvars `r(varlist)'
local n 0
foreach i of local indoorvars {
**LET'S CLEAN UP YOUR TABLES A BIT WITH SOME HEADERS VIA display
local ++n
di in red "--------------------------------------------"
di in red "Table `n': `:var l `i'' by `:var l group'"
di in red "--------------------------------------------"
**YOUR tab TABLES
tab `i' group, col freq chi2 exact nolog nokey
}
******************! END EXAMPLE
OR using unab instead:
******************! BEGIN EXAMPLE
unab indoorvars: indoor*
di "`indoorvars'"
local n 0
foreach i of local indoorvars {
local ++n
di in red "--------------------------------------------"
di in red "Table `n': `:var l `i'' by `:var l group'"
di in red "--------------------------------------------"
tab `i' group, col freq chi2 nokey //I turned off exact to speed things up
}
******************! END EXAMPLE
The advantages of ds come into play if you want to select your indoor vars using a tricky selection rule, like selecting indoor vars based on information in the variable label or some other characteristic.
You could do this with
foreach i of var `indoor' {
tab `i' group, col freq exact chi2
}
This would work. It is almost identical to the code in the question.
unab indoor : indoor*
foreach i of local indoor {
tab `i' group, col freq exact chi2
}
foreach v of varlist indoo* {
do sth with `v'
}

Resources