Simplifying stata script with a loop - loops

I'm trying to create three individual dta files for three different variables that creates a new var for a select dxcode (then collapses it to the id) and assigns every record in the newly created variable a value of 1. Then with these new dta files I will merge them with a larger table. My question is, is there any way to simplify the lines of code below (perhaps with a loop since they seem to all do the same thing)?
*Menin variable***
*****************************
use "C:\ICD9study.dta",clear
keep if inlist(dxcode, "32290")
collapse(max) date, by(bene_id)
gen menin = 1
save "C:\Users\Desktop\temp\menin.dta",replace
*BMenin variable***
***************************************
use "C:\ICD9study.dta",clear
keep if inlist(dxcode, "32090")
collapse(min) date, by(bene_id)
gen Bmenin = 1
save "C:\Users\Desktop\temp\Bmenin.dta",replace
*nonBmenin variable***
*********************************************
use "C:\ICD9study.dta",clear
keep if inlist(dxcode, "04790")
collapse(max) date, by(bene_id)
gen nonBmenin = 1
save "C:\Users\Desktop\temp\nonBmenin.dta",replace

This is a sketch, as nothing in your question is reproducible.
local codes 32290 32090 04790
foreach v in menin Bmenin nonBmenin {
use "C:\ICD9study.dta", clear
gettoken code codes : codes
keep if dxcode == "`code'"
collapse (max) date, by(bene_id)
gen `v' = 1
save "C:\Users\Desktop\temp/`v'.dta", replace
}
Your code produces the maximum date in two cases and the minimum in one. If that is really what you want, you'll need to rewrite the code.
There is plenty of advice about good practice on the site and under the Stata tag wiki.

Related

How to run FE regressions for time-invariant subgroups (in Stata)

Hi run several fixed effects regressions on several outcomes, which I store in a local and go through in a foreach loop. Next I want to add subgroup analysis by stable, time-invariant trait (such as gender or race). Thus I cannot use a bysort group: regress
Following is a MWE, how can I redo this analysis for all 3 levels of race? At the moment I do a copy-paste, preserve the data and keep levels each one at a time. I hope that there's a more efficient way.
* load data
use http://www.stata-press.com/data/r13/nlswork
* set panel structure
xtset idcode year
* fixed effects regression
local outcomes "ln_wage ttl_exp tenure"
local rhsvars "c.wks_ue##c.wks_ue##i.occ_code union age i.year 1.race"
foreach o of local outcomes {
quietly xtreg `o' `rhsvars', i(idcode) fe
margins, dydx(wks_ue) at(occ_code=(1 2 3)) post
outreg2 using report_`r'.doc, word append ctitle(`o')
}
* subgroup analysis race (or gender) ??
As Pearly Spencer mentioned above, if seems like the perfect solution. (I assumed your local macro r was for iterating over values of race.)
use http://www.stata-press.com/data/r13/nlswork
xtset idcode year
local outcomes "ln_wage ttl_exp tenure"
local rhsvars "c.wks_ue##c.wks_ue##i.occ_code union age i.year"
levelsof race
local racelevels `r(levels)'
foreach r in `racelevels'{
foreach o of local outcomes {
quietly xtreg `o' `rhsvars' if race == `r', i(idcode) fe
margins, dydx(wks_ue) at(occ_code=(1 2 3)) post
outreg2 using report_`r'.doc, word append ctitle(`o')
}
}
By the way, consider the user-written command reghdfe by Sergio Correia as a faster and more intuitive substitute for xtreg: http://scorreia.com/software/reghdfe/
(Code edited)

AT NEW with substring access?

I have a solution that includes a LOOP which I would like to spare. So I wonder, whether you know a better way to do this.
My goal is to loop through an internal, alphabetically sorted standard table. This table has two columns: a name and a table, let's call it subtable. For every subtable I want to do some stuff (open an xml page in my xml framework).
Now, every subtable has a corresponding name. I want to group the subtables according to the first letter of this name (meaning, put the pages of these subtables on one main page -one main page for every character-). By grouping of subtables I mean, while looping through the table, I want to deal with the subtables differently according to the first letter of their name.
So far I came up with the following solution:
TYPES: BEGIN OF l_str_tables_extra,
first_letter(1) TYPE c,
name TYPE string,
subtable TYPE REF TO if_table,
END OF l_str_tables_extra.
DATA: ls_tables_extra TYPE l_str_tables_extra.
DATA: lt_tables_extra TYPE TABLE OF l_str_tables_extra.
FIELD-SYMBOLS: <ls_tables> TYPE str_table."Like LINE OF lt_tables.
FIELD-SYMBOLS: <ls_tables_extra> TYPE l_str_tables_extra.
*"--- PROCESSING LOGIC ------------------------------------------------
SORT lt_tables ASCENDING BY name.
"Add first letter column in order to use 'at new' later on
"This is the loop I would like to spare
LOOP AT lt_tables ASSIGNING <ls_tables>.
ls_tables_extra-first_letter = <ls_tables>-name+0(1). "new column
ls_tables_extra-name = <ls_tables>-name.
ls_tables_extra-subtable = <ls_tables>-subtable.
APPEND ls_tables_extra TO lt_tables_extra.
ENDLOOP.
LOOP AT lt_tables_extra ASSIGNING <ls_tables_extra>.
AT NEW first_letter.
"Do something with subtables with same first_letter.
ENDAT.
ENDLOOP.
I wish I could use
AT NEW name+0(1)
instead of
AT NEW first_letter
, but offsets and lengths are not allowed.
You see, I have to inlcude this first loop to add another column to my table which is kind of unnecessary because there is no new info gained.
In addition, I am interested in other solutions because I get into trouble with the framework later on for different reasons. A different way to do this might help me out there, too.
I am happy to hear any thoughts about this! I could not find anything related to this here on stackoverflow, but I might have used not optimal search terms ;)
Maybe the GROUP BY addition on LOOP could help you in this case:
LOOP AT i_tables
INTO DATA(wa_line)
" group lines by condition
GROUP BY (
" substring() because normal offset would be evaluated immediately
name = substring( val = wa_line-name len = 1 )
) INTO DATA(o_group).
" begin of loop over all tables starting with o_group-name(1)
" loop over group object which contains
LOOP AT GROUP o_group
ASSIGNING FIELD-SYMBOL(<fs_table>).
" <fs_table> contains your table
ENDLOOP.
" end of loop
ENDLOOP.
why not using a IF comparison?
data: lf_prev_first_letter(1) type c.
loop at lt_table assigning <ls_table>.
if <ls_table>-name(1) <> lf_prev_first_letter. "=AT NEW
"do something
lf_prev_first_letter = <ls_table>-name(1).
endif.
endloop.

Merging partial duplicate cases without losing data

I have a question in regards to preparing my dataset for research.
I have a dataset in SPSS 20 in long format as I am researching on individual level over multiple years. However some individuals were added twice to my dataset because there were differences in some variables matched to those individuals (5000 individuals with 25 variables per individual). I would like to merge those duplicates so that I can run my analysis over time. For those variables that differ between the duplicates I would like spss to make additional variables when all the duplicates are merged.
Is this at all possible and if yes HOW?
I suggest following steps>
create auxiliary variable "PrimaryLast" with procedure Data->Identify Duplicate Cases by... , set "Define matching cases by" to your case ID
create 2 new auxiliary datasets with Data->Select Cases with condition "PrimaryLast = 0" and "PrimaryLast = 1" and selection "Copy selected cases to new dataset"
merge both auxiliary datasets with procedure Data -> Merge Files-> Add Variables, rename duplicated variable names in left box and move them in right box and select your case ID as key
don't forget to control if you made "full outer join", in case you lost non-duplicated cases and have only duplicated cases in your dataset, just merge datasets from step 2. in different order in step 3.
Try this:
sort cases by caseID otherVar.
compute ind=1.
if $casenum>1 and caseID=lag(caseID) ind=lag(ind)+1.
casestovars /id=caseID /index=ind.
If a caseID is repeated more then once, after restructure there will be only one line for that case, while all the variables will be repeated with indexes.
If the order of the caseID repeats, replace the otherVar in the sort command with the corresponding variable (e.g. date). This way your new variables will also be indexed accordingly.

easier use of loops and vectors in spss to combine variables

I have a student who has gathered data in a survey online whereby each response was given a variable, rather than the variable having whatever the response was. We need a scoring algorithm which reads the statements and integrates. I can do this with IF statements per item, e.g.,
if Q1_1=1 var1=1.
if Q1_2=1 var1=2.
if Q1_3=1 var1=3.
if Q1_4=1 var1=4.
Doing this for a 200 item survey (now more like 1000) will be a drag and subject to many typos unless automated. I have no experience of vectors and loops in SPSS, but some reading suggests this is the way to approach the problem.
I would like to run if statements as something like (pseudocode):
for items=1 1 to 30
for responses=1 to 4
if Q1_2_1=1 a=1.
if Q1_2=1 a=2.
if Q1_3=1 a=3.
if Q1_4=1 a=4.
compute newitem(items)=a.
next response.
next item.
Which I would hope would produce a new variable (newitem1 to 30) which has one of the 4 responses for it's original corresponding 4 variable information.
Never written serious spss code before: please advise!
This will do the Job:
* creating some sample data.
data list free (",")/Item1_1 to Item1_4 Item2_1 to Item2_4 Item3_1 to Item3_4.
begin data
1,,,,,1,,,,,1,,
,1,,,1,,,,1,,,,
,,,1,,,1,,,,,1,
end data.
* now looping over the items and constructing the "NewItems".
do repeat Item1=Item1_1 to Item1_4
/Item2=Item2_1 to Item2_4
/Item3=Item3_1 to Item3_4
/Val=1 to 4.
if Item1=1 NewItem1=Val.
if Item2=1 NewItem2=Val.
if Item3=1 NewItem3=Val.
end repeat.
execute.
In this way you run all you loops simultaneously.
Note that "ItemX_1 to ItemX_4" will only work if these four variables are consecutive in the dataset. If they aren't, you have to name each of them separately - "ItemX_1 ItemX_2 ItemX_3 itemX_4".
Now if you have many such item sets, all named regularly as in the example, the following macro can shorten the process:
define !DoItems (ItemList=!cmdend)
!do !Item !in (!ItemList)
do repeat !Item=!concat(!Item,"_1") !concat(!Item,"_2") !concat(!Item,"_3") !concat(!Item,"_4")/Val=1 2 3 4.
if !item=1 !concat("New",!Item)=Val.
end repeat.
!doend
execute.
!enddefine.
* now you just have to call the macro and list all your Item names:
!DoItems ItemList=Item1 Item2 Item3.
The macro will work with any item name, as long as the variables are named ItemName_1, ItemName_2 etc'.

Renaming a Dataset using a list

I'm new at programming, and I've been banging my head trying to figure this out for work.
I am trying to pull roughly 300 mysql tables into Matlab into my workspace.
I have attached the following code, which is designed to pull one table (I plan to loop this code though the 300 mysql tables when it is working).
The code successfully works to import single table into the workspace as a new dataset.
My problem arise when I try to rename this new dataset with the name of the original mysql table.
Please see code below for this part where I screw up (%Assign data to output variable)
I have a list of all the 300 tables names, and I plan to store them in a list called 'name'... Hence, name(1)... is this the right approach?
For example, the original mysql table was called 'options_20020208'.
After I run the script, I need the new dataset that Matlab imports to be called 'options_20020208' as well.
Any ideas here?
%Define Query
name = 'options_20020208'
%Set preferences with setdbprefs.
setdbprefs('DataReturnFormat', 'dataset');
setdbprefs('NullNumberRead', 'NaN');
setdbprefs('NullStringRead', 'null');
%Make connection to database.
conn = database('', 'root', 'password', 'Vendor', 'MYSQL', 'Server', 'localhost', 'PortNumber', 3306);
%Read data from database.
curs = exec(conn, [['SELECT ',name,'.UnderlyingSymbol , ']...
, [name,'.UnderlyingPrice , ']...
, [name,'.Expiration , ']...
, ['FROM ','PriceMatrix.',name,' ']...
]);
curs = fetch(curs);
close(curs);
%Assign data to output variable
name(1) = curs.Data;
%Close database connection.
close(conn);
%Clear variables
clear curs conn
If you have defined a variable name, name(1) means "the first element of variable name" (in this case just "o"). Regardless of the dimensions of the variable it returns a single value (i.e. even if X is some 5-D monstrosity, X(50) returns only the value of the 50th element). name(1) = data means "set the first element of variable name to be equal to data" and will cause an error if data is not of the right size, and either an error or unexpected behaviour if it's not the right type.
For example, try this at the command line:
name = 'options_20020208';
name(1) = 1
Now, technically what you want can be done, although I don't recommend it. If you have all the names in some sort of 300 x (length) variable then over a loop of n = 1:300 it would be something like this (where name is your list of variable names):
eval([name(n,:) ' = curs.Data;'])
This will return 300 variables named 'options_20020208' or similar each containing one set of curs.Data. However, there are better ways of storing data in your workspace that will make further operations on your data easier, for example you could use structures:
myStruct(n).name = name(n,:);
myStruct(n).Data = curs.Data;
If you wanted to do some analysis and then save out all this data in some format, for example, it's going to be much easier to loop over the structure, and set the filename to [myStruct(n).name,'.csv'] and the file contents to mystruct(n).AdjustedData, etc., then to deal with 300 named variables.

Resources