Hi run several fixed effects regressions on several outcomes, which I store in a local and go through in a foreach loop. Next I want to add subgroup analysis by stable, time-invariant trait (such as gender or race). Thus I cannot use a bysort group: regress
Following is a MWE, how can I redo this analysis for all 3 levels of race? At the moment I do a copy-paste, preserve the data and keep levels each one at a time. I hope that there's a more efficient way.
* load data
use http://www.stata-press.com/data/r13/nlswork
* set panel structure
xtset idcode year
* fixed effects regression
local outcomes "ln_wage ttl_exp tenure"
local rhsvars "c.wks_ue##c.wks_ue##i.occ_code union age i.year 1.race"
foreach o of local outcomes {
quietly xtreg `o' `rhsvars', i(idcode) fe
margins, dydx(wks_ue) at(occ_code=(1 2 3)) post
outreg2 using report_`r'.doc, word append ctitle(`o')
* subgroup analysis race (or gender) ??
As Pearly Spencer mentioned above, if seems like the perfect solution. (I assumed your local macro r was for iterating over values of race.)
use http://www.stata-press.com/data/r13/nlswork
xtset idcode year
local outcomes "ln_wage ttl_exp tenure"
local rhsvars "c.wks_ue##c.wks_ue##i.occ_code union age i.year"
levelsof race
local racelevels `r(levels)'
foreach r in `racelevels'{
foreach o of local outcomes {
quietly xtreg `o' `rhsvars' if race == `r', i(idcode) fe
margins, dydx(wks_ue) at(occ_code=(1 2 3)) post
outreg2 using report_`r'.doc, word append ctitle(`o')
By the way, consider the user-written command reghdfe by Sergio Correia as a faster and more intuitive substitute for xtreg: http://scorreia.com/software/reghdfe/
(Code edited)
I'm trying to create three individual dta files for three different variables that creates a new var for a select dxcode (then collapses it to the id) and assigns every record in the newly created variable a value of 1. Then with these new dta files I will merge them with a larger table. My question is, is there any way to simplify the lines of code below (perhaps with a loop since they seem to all do the same thing)?
*Menin variable***
use "C:\ICD9study.dta",clear
keep if inlist(dxcode, "32290")
collapse(max) date, by(bene_id)
gen menin = 1
save "C:\Users\Desktop\temp\menin.dta",replace
*BMenin variable***
use "C:\ICD9study.dta",clear
keep if inlist(dxcode, "32090")
collapse(min) date, by(bene_id)
gen Bmenin = 1
save "C:\Users\Desktop\temp\Bmenin.dta",replace
*nonBmenin variable***
use "C:\ICD9study.dta",clear
keep if inlist(dxcode, "04790")
collapse(max) date, by(bene_id)
gen nonBmenin = 1
save "C:\Users\Desktop\temp\nonBmenin.dta",replace
This is a sketch, as nothing in your question is reproducible.
local codes 32290 32090 04790
foreach v in menin Bmenin nonBmenin {
use "C:\ICD9study.dta", clear
gettoken code codes : codes
keep if dxcode == "`code'"
collapse (max) date, by(bene_id)
gen `v' = 1
save "C:\Users\Desktop\temp/`v'.dta", replace
Your code produces the maximum date in two cases and the minimum in one. If that is really what you want, you'll need to rewrite the code.
There is plenty of advice about good practice on the site and under the Stata tag wiki.
I have a question in regards to preparing my dataset for research.
I have a dataset in SPSS 20 in long format as I am researching on individual level over multiple years. However some individuals were added twice to my dataset because there were differences in some variables matched to those individuals (5000 individuals with 25 variables per individual). I would like to merge those duplicates so that I can run my analysis over time. For those variables that differ between the duplicates I would like spss to make additional variables when all the duplicates are merged.
Is this at all possible and if yes HOW?
I suggest following steps>
create auxiliary variable "PrimaryLast" with procedure Data->Identify Duplicate Cases by... , set "Define matching cases by" to your case ID
create 2 new auxiliary datasets with Data->Select Cases with condition "PrimaryLast = 0" and "PrimaryLast = 1" and selection "Copy selected cases to new dataset"
merge both auxiliary datasets with procedure Data -> Merge Files-> Add Variables, rename duplicated variable names in left box and move them in right box and select your case ID as key
don't forget to control if you made "full outer join", in case you lost non-duplicated cases and have only duplicated cases in your dataset, just merge datasets from step 2. in different order in step 3.
Try this:
sort cases by caseID otherVar.
compute ind=1.
if $casenum>1 and caseID=lag(caseID) ind=lag(ind)+1.
casestovars /id=caseID /index=ind.
If a caseID is repeated more then once, after restructure there will be only one line for that case, while all the variables will be repeated with indexes.
If the order of the caseID repeats, replace the otherVar in the sort command with the corresponding variable (e.g. date). This way your new variables will also be indexed accordingly.
I have a student who has gathered data in a survey online whereby each response was given a variable, rather than the variable having whatever the response was. We need a scoring algorithm which reads the statements and integrates. I can do this with IF statements per item, e.g.,
if Q1_1=1 var1=1.
if Q1_2=1 var1=2.
if Q1_3=1 var1=3.
if Q1_4=1 var1=4.
Doing this for a 200 item survey (now more like 1000) will be a drag and subject to many typos unless automated. I have no experience of vectors and loops in SPSS, but some reading suggests this is the way to approach the problem.
I would like to run if statements as something like (pseudocode):
for items=1 1 to 30
for responses=1 to 4
if Q1_2_1=1 a=1.
if Q1_2=1 a=2.
if Q1_3=1 a=3.
if Q1_4=1 a=4.
compute newitem(items)=a.
next response.
next item.
Which I would hope would produce a new variable (newitem1 to 30) which has one of the 4 responses for it's original corresponding 4 variable information.
Never written serious spss code before: please advise!
This will do the Job:
* creating some sample data.
data list free (",")/Item1_1 to Item1_4 Item2_1 to Item2_4 Item3_1 to Item3_4.
begin data
end data.
* now looping over the items and constructing the "NewItems".
do repeat Item1=Item1_1 to Item1_4
/Item2=Item2_1 to Item2_4
/Item3=Item3_1 to Item3_4
/Val=1 to 4.
if Item1=1 NewItem1=Val.
if Item2=1 NewItem2=Val.
if Item3=1 NewItem3=Val.
end repeat.
In this way you run all you loops simultaneously.
Note that "ItemX_1 to ItemX_4" will only work if these four variables are consecutive in the dataset. If they aren't, you have to name each of them separately - "ItemX_1 ItemX_2 ItemX_3 itemX_4".
Now if you have many such item sets, all named regularly as in the example, the following macro can shorten the process:
define !DoItems (ItemList=!cmdend)
!do !Item !in (!ItemList)
do repeat !Item=!concat(!Item,"_1") !concat(!Item,"_2") !concat(!Item,"_3") !concat(!Item,"_4")/Val=1 2 3 4.
if !item=1 !concat("New",!Item)=Val.
end repeat.
* now you just have to call the macro and list all your Item names:
!DoItems ItemList=Item1 Item2 Item3.
The macro will work with any item name, as long as the variables are named ItemName_1, ItemName_2 etc'.
I'm trying to combine two datasets, say data1.dta and data2.dta, in Stata, keeping only the overlapping variables and I want to drop all variables that exist only in one of the two datasets.
My idea was to compare the two data sets with cfvar: with return list I get the output r(both), r(oneonly) and r(twoonly). And now I want to use the outputs r(oneonly) and r(twoonly) for a loop to drop all variables that are listed in r(oneonly) and r(twoonly), something like:
for each v of varlist ??how to define the varlist??{ drop v }
I know of no program cfvar; perhaps you mean cfvars (SSC), which turns out to be something I wrote. But that is redundant given recent versions of Stata. You can go
d using data1, varlist
local v1 "`r(varlist)'"
d using data2, varlist
local v2 "`r(varlist)'"
local both : list v1 & v2
u data1
keep `both'
Then you need to merge with data2: the syntax will depend on which variable(s) act as identifiers. Note the keepusing() option of merge.
I can't see that a loop is needed here.
I've been using SQL Server to store historical time series data for a couple hundred thousand objects, observed about 100 times per day. I'm finding that queries (give me all values for object XYZ between time t1 and time t2) are too slow (for my needs, slow is more then a second). I'm indexing by timestamp and object ID.
I've entertained the thought of using somethings a key-value store like MongoDB instead, but I'm not sure if this is an "appropriate" use of this sort of thing, and I couldn't find any mentions of using such a database for time series data. ideally, I'd be able to do the following queries:
retrieve all the data for object XYZ between time t1 and time t2
do the above, but return one date point per day (first, last, closed to time t...)
retrieve all data for all objects for a particular timestamp
the data should be ordered, and ideally it should be fast to write new data as well as update existing data.
it seems like my desire to query by object ID as well as by timestamp might necessitate having two copies of the database indexed in different ways to get optimal performance...anyone have any experience building a system like this, with a key-value store, or HDF5, or something else? or is this totally doable in SQL Server and I'm just not doing it right?
It sounds like MongoDB would be a very good fit. Updates and inserts are super fast, so you might want to create a document for every event, such as:
object: XYZ,
ts : new Date()
Then you can index the ts field and queries will also be fast. (By the way, you can create multiple indexes on a single database.)
How to do your three queries:
retrieve all the data for object XYZ
between time t1 and time t2
db.data.find({object : XYZ, ts : {$gt : t1, $lt : t2}})
do the above, but return one date
point per day (first, last, closed to
time t...)
// first
db.data.find({object : XYZ, ts : {$gt : new Date(/* start of day */)}}).sort({ts : 1}).limit(1)
// last
db.data.find({object : XYZ, ts : {$lt : new Date(/* end of day */)}}).sort({ts : -1}).limit(1)
For closest to some time, you'd probably need a custom JavaScript function, but it's doable.
retrieve all data for all objects for
a particular timestamp
db.data.find({ts : timestamp})
Feel free to ask on the user list if you have any questions, someone else might be able to think of an easier way of getting closest-to-a-time events.
This is why databases specific to time series data exist - relational databases simply aren't fast enough for large time series.
I've used Fame quite a lot at investment banks. It's very fast but I imagine very expensive. However if your application requires the speed it might be worth looking it.
There is an open source timeseries database under active development (.NET only for now) that I wrote. It can store massive amounts (terrabytes) of uniform data in a "binary flat file" fashion. All usage is stream-oriented (forward or reverse). We actively use it for the stock ticks storage and analysis at our company.
I am not sure this will be exactly what you need, but it will allow you to get the first two points - get values from t1 to t2 for any series (one series per file) or just take one data point.
// Create a new file for MyStruct data.
// Use BinCompressedFile<,> for compressed storage of deltas
using (var file = new BinSeriesFile<UtcDateTime, MyStruct>("data.bts"))
file.UniqueIndexes = true; // enforces index uniqueness
file.InitializeNewFile(); // create file and write header
file.AppendData(data); // append data (stream of ArraySegment<>)
// Read needed data.
using (var file = (IEnumerableFeed<UtcDateTime, MyStrut>) BinaryFile.Open("data.bts", false))
// Enumerate one item at a time maxitum 10 items starting at 2011-1-1
// (can also get one segment at a time with StreamSegments)
foreach (var val in file.Stream(new UtcDateTime(2011,1,1), maxItemCount = 10)
I recently tried something similar in F#. I started with the 1 minute bar format for the symbol in question in a Space delimited file which has roughly 80,000 1 minute bar readings. The code to load and parse from disk was under 1ms. The code to calculate a 100 minute SMA for every period in the file was 530ms. I can pull any slice I want from the SMA sequence once calculated in under 1ms. I am just learning F# so there are probably ways to optimize. Note this was after multiple test runs so it was already in the windows Cache but even when loaded from disk it never adds more than 15ms to the load.
To reduce the recalculation time I save the entire calculated indicator sequence to disk in a single file with \n delimiter and it generally takes less than 0.5ms to load and parse when in the windows file cache. Simple iteration across the full time series data to return the set of records inside a date range in a sub 3ms operation with a full year of 1 minute bars. I also keep the daily bars in a separate file which loads even faster because of the lower data volumes.
I use the .net4 System.Runtime.Caching layer to cache the serialized representation of the pre-calculated series and with a couple gig's of RAM dedicated to cache I get nearly a 100% cache hit rate so my access to any pre-computed indicator set for any symbol generally runs under 1ms.
Pulling any slice of data I want from the indicator is typically less than 1ms so advanced queries simply do not make sense. Using this strategy I could easily load 10 years of 1 minute bar in less than 20ms.
// Parse a \n delimited file into RAM then
// then split each line on space to into a
// array of tokens. Return the entire array
// as string[][]
let readSpaceDelimFile fname =
|> Array.map (fun line -> line.Split [|' '|])
// Based on a two dimensional array
// pull out a single column for bar
// close and convert every value
// for every row to a float
// and return the array of floats.
let GetArrClose(tarr : string[][]) =
[| for aLine in tarr do
//printfn "aLine=%A" aLine
let closep = float(aLine.[5])
yield closep
I use HDF5 as my time series repository. It has a number of effective and fast compression styles which can be mixed and matched. It can be used with a number of different programming languages.
I use boost::date_time for the timestamp field.
In the financial realm, I then create specific data structures for each of bars, ticks, trades, quotes, ...
I created a number of custom iterators and used standard template library features to be able to efficiently search for specific values or ranges of time-based records.