how to create a loop for a macro in stata? - loops

I have a very large dataset but to cut it short I demonstrated the data with the following example:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(patid death dateofdeath)
1 0 .
2 0 .
3 0 .
4 0 .
5 1 15007
6 0 .
7 0 .
8 1 15526
9 0 .
10 0 .
end
format %d dateofdeath
I am trying to sample for a case-control study based on date of death. At this stage, I need to first create a variable with each date of death repeated for all the participants (hence we end up with a dataset with 20 participants) and a pairid equivalent to the patient id patid of the corresponding case.
I created a macro for one case (which works) but I am finding it difficult to have it repeated for all cases (where death==1) in a loop.
The successful macro is as follows:
local i "5" //patient id who died
gen pairid= `i'
gen matchedindexdate = dateofdeath
replace matchedindexdate=0 if pairid != patid
gsort matchedindexdate
replace matchedindexdate= matchedindexdate[_N]
format matchedindexdate %d
save temp`i'
and the loop I attempted is:
* (min and max patid id)
forval j = 1/10 {
count if patid == `j' & death==1
if r(N)=1 {
gen pairid= `j'
gen matchedindexdate = dateofdeath
replace matchedindexdate=0 if pairid != patid
gsort matchedindexdate
replace matchedindexdate= matchedindexdate[_N]
save temp/matched`j'
}
}
use temp/matched1, clear
forval i=2/10 {
capture append using temp/matched`i'
save matched, replace
}
but I get:
invalid syntax
How can I do the loop?

I finally had it solved, please check:
https://www.statalist.org/forums/forum/general-stata-discussion/general/1591811-how-to-create-a-loop-for-a-macro

Related

How can I use a loop to create lag variables?

* Example generated by -dataex-. To install: ssc install dataex
clear
input int(date date2)
18257 16112
18206 16208
17996 16476
18197 17355
18170 17204
end
format %d date
format %d date2
I'm trying to create a loop in Stata that generates four variables (lags at 0 months, 3 months, 12 months, and 18 months). I tried this (below) and I get an error: invalid syntax
foreach x inlist (0,3,12,18) & foreach y inlist (0,90,360,540){
gen var`x' = (date > date2 + `y')
}
Here is a way for me to successfully create these variables without the loop. It would be much nicer if it could be simplified with a loop.
gen var0=(date>date2)
gen var3=(date>date2+90)
gen var12=(date>date2+360)
gen var18=(date>date2+540)
Good news: you need just one loop over 4 possibilities, as 0 3 12 18 and 0 90 360 540 are paired.
foreach x in 0 3 12 18 {
gen var`x' = date > (date2 + 30 * `x')
}
foreach requires either in or of following the macro name, so your code fails at that point. There is also no construct foreach ... & foreach ....: perhaps you are using syntax from elsewhere or just guessing there.

Loop Postestimation Tests after Regression

A loop for a number of regressions is performed. For each regression we need to conduct some heteroscedasticity tests. The following code unfortunately does not work:
gen p_hettest = .
quietly forvalues i = 1/10 {
reg y x if id == `i'
estat hettest if id == `i'
replace p_hettest=r(p) if id == `i'
}
Here is a data sample:
clear
input float(y x id)
-.006994963 -7.015742e-06 1
.002128173 2.7695405e-06 1
.01837084 .000015578877 1
-.018459747 -.000017552491 1
-.008869853 -8.115663e-06 1
0 0 1
.00081374 1.039456e-06 1
.0192536 .00001801726 1
-.004777103 -2.800596e-06 1
.006691461 4.95152e-06 1
-.015235436 -.000015264517 1
.03523033 -.00001293428 2
.037114896 .00001956828 2
.0041321944 -6.849998e-06 2
-.000645176 .000012979223 2
-.015742416 -4.716876e-06 2
.005813865 -2.943401e-06 2
.00220989 -4.920239e-06 2
.003843212 8.216926e-06 2
.013684767 -4.7989766e-07 2
.02013146 3.841124e-07 2
.0714285 2.9144696e-06 3
.02564108 6.107174e-06 3
-.01336905 -7.19949e-06 3
0 .000031617565 3
.034420278 3.418627e-06 3
-.04042552 .00004654335 3
.03571425 .000024398614 3
-.002500042 -3.514139e-06 3
-.04651165 -.00004515287 3
.05263159 -7.449272e-06 3
.08727269 -7.16101e-06 3
end
A r(101) error occurs, indicating: "if not allowed".
Is there an alternative way to loop regress-postestimation tests?
The issue is that estat hettest does not take if qualifiers. I am not familiar with the command, but I would guess that it uses only the values from the regression to perform the test.
If you modify your code to look like:
gen p_hettest = .
quietly forvalues i = 1/10 {
reg y x if id == `i'
estat hettest
replace p_hettest=r(p) if id == `i'
}
you should be all set.
If you take off the quietly, you can see that the values for r(p) are changing for each call of estat hettest

SPSS: using IF function with REPEAT when each case has multiple linked instances

I have a dataset as such:
Case #|DateA |Drug.1|Drug.2|Drug.3|DateB.1 |DateB.2 |DateB.3 |IV.1|IV.2|IV.3
------|------|------|------|------|--------|---------|--------|----|----|----
1 |DateA1| X | Y | X |DateB1.1|DateB1.2 |DateB1.3| 1 | 0 | 1
2 |DateA2| X | Y | X |DateB2.1|DateB2.2 |DateB2.3| 1 | 0 | 1
3 |DateA3| Y | Z | X |DateB3.1|DateB3.2 |DateB3.3| 0 | 0 | 1
4 |DateA4| Z | Z | Z |DateB4.1|DateB4.2 |DateB4.3| 0 | 0 | 0
For each case, there are linked variables i.e. Drug.1 is linked with DateB.1 and IV.1 (Indicator Variable.1); Drug.2 is linked with DateB.2 and IV.2, etc.
The variable IV.1 only = 1 if Drug.1 is the case that I want to analyze (in this example, I want to analyze each receipt of Drug "X"), and so on for the other IV variables. Otherwise, IV = 0 if the drug for that scenario is not "X".
I want to calculate the difference between DateA and DateB for each instance where Drug "X" is received.
e.g. In the example above I want to calculate a new variable:
DateDiffA1_B1.1 = DateA1 - DateB1.1
DateDiffA1_B2.1 = DateA1 - DateB2.1
DateDiffA1_B1.3 = DateA1 - DateB1.3
DateDiffA1_B2.3 = DateA1 - DateB2.3
DateDiffA1_B3.3 = DateA1 - DateB3.3
I'm not sure if this new variable would need to be linked to each instance of Drug "X" as for the other variables, or if it could be a single variable that COUNTS all the instances for each case.
The end goal is to COUNT how many times each case had a date difference of <= 2 weeks when they received Drug "X". If they did not receive Drug "X", I do not want to COUNT the date difference.
I will eventually want to compare those who did receive Drug "X" with a date difference <= 2 weeks to those who did not, so having another indicator variable to help separate out these specific patients would be beneficial.
I am unsure about the best way to go about this; I suspect it will require a combination of IF and REPEAT functions using the IV variable, but I am relatively new with SPSS and syntax and am not sure how this should be coded to avoid errors.
Thanks for your help!
EDIT: It seems like I may need to use IV as a vector variable to loop through the linked variables in each case. I've tried the syntax below to no avail:
DATASET ACTIVATE DataSet1.
vector IV = IV.1 to IV.3.
loop #i = .1 to .3.
do repeat DateB = DateB.1 to DateB.3
/ DrugDateDiff = DateDiff.1 to DateDiff.3.
if IV(#i) = 1
/ DrugDateDiff = datediff(DateA, DateB, "days").
end repeat.
end loop.
execute.
Actually there is no need to add the vector and the loop, all you need can be done within one DO REPEAT:
compute N2W=0.
do repeat DateB = DateB.1 to DateB.3 /IV=IV.1 to IV.3 .
if IV=1 and datediff(DateA, DateB, "days")<=14 N2W = N2W + 1.
end repeat.
execute.
This syntax will first put a zero in the count variable N2W. Then it will loop through all the dates, and only if the matching IV is 1, the syntax will compare them to dateA, and add 1 to the count if the difference is <=2 weeks.
if you prefer to keep the count variable as missing when none of the IV are 1, instead of compute N2W=0. start the syntax with:
If any(1, IV.1 to IV.3) N2W=0.

Using SAS to check if columns have specified characteristics

I have a dataset that looks like the one below. each row is a different observation that has anywhere from 1 to x values (in this case x=3). I want to create a dataset that contains the original info, but four additional columns (for the four values of Bin present in the dataset). The value of the column freq_Bin_1 will be a 1 if there are any 1's present in that row, else missing. The column freq_Bin_2 will be a 1 if there are any 2's present, etc.
Both the number of Bins and the number of columns in the original dataset may vary.
data have;
input Bin_1 Bin_2 Bin_3;
cards;
1 . .
3 . .
1 1 .
3 2 1
3 4 .
;
run;
Here is my desired output:
data want_this;
input Bin_1 Bin_2 Bin_3 freq_Bin_1 freq_Bin_2 freq_Bin_3 freq_Bin_4;
cards;
1 . . 1 . . .
3 . . . . 1 .
1 1 . 1 . . .
3 2 1 1 1 1 .
3 4 . . . 1 1
;
run;
I have an array solution that I think is pretty close, but I can't quite get it. I am also open to other methods.
data want;
set have;
array Bins {&max_freq.} Bin:;
array freq_Bin {&num_bin.} freq_Bin_1-freq_Bin_&num_bin.;
do j=1 to dim(Bins);
freq_Bin(j)=.;
end;
do k=1 to dim(freq_Bin);
if Bins(k)=1 then freq_Bin(1)=1;
else if Bins(k)=2 then freq_Bin(2)=1;
else if Bins(k)=3 then freq_Bin(3)=1;
else if Bins(k)=4 then freq_Bin(4)=1;
end;
drop j k;
run;
This should work:
data want;
set have;
array Bins{*} Bin:;
array freq_Bin{4};
do k=1 to dim(Bins);
if Bins(k) ne . then freq_Bin(Bins(k))=1;
end;
drop k;
run;
I tweaked your code somewhat, but really the only problem was that you need to check that Bins(k) isn't missing before trying to use it to index an array. Also, there's no need to initialize the values to missing as that's the default.

Setting Up a Dynamic Stopping Point for a Loop

Data is setup with a bunch of information corresponding to an ID, which can show-up more than once.
ID Data
1 X
1 Y
2 A
2 B
2 Z
3 X
I want a loop that signifies which instance of the ID I am looking at. Is it the first time, second time, etc? I want it as a string in the form _# so I have to go beyond the simple _n function in Stata, to my knowledge. If someone knows a way to do what I want without the loop let me know, but I would still like the answer.
I have the following loop in Stata
by ID: gen count_one = _n
gen count_two = ""
quietly forval j = 1/3 {
replace count_two = "_`j'" if count_one == `j'
}
The output now looks like this:
ID Data count_one count_two
1 X 1 _1
1 Y 2 _2
2 A 1 _1
2 B 2 _2
2 Z 3 _3
3 X 1 _1
The question is how can I replace the 16 above with to tell Stata to take the max of the count_one column because I need to run this weekly and that max will change and I want to reduce errors.
It's hard to understand why you want this, but it is one line whether you want numeric or string:
bysort ID : gen nummax = _N
bysort ID : gen strmax = "_" + string(_N)
Note that the sort order within ID is irrelevant to the number of observations for each.
Some parts of your question aren't clear ("...replace the 16 above with to tell Stata...") but:
Why don't you just use _n with tostring?
gsort +ID +data
bys ID: g count_one=_n
tostring count_one, gen(count_two)
replace count_two="_"+count_two
Then to generate the max (answering the partial question at the end there) -- although note this value will be repeated across instances of each ID value:
bys ID: egen maxcount1=max(count_one)
or more elegantly:
bys ID: g maxcount2=_N

Resources