* Example generated by -dataex-. To install: ssc install dataex
clear
input int(date date2)
18257 16112
18206 16208
17996 16476
18197 17355
18170 17204
end
format %d date
format %d date2
I'm trying to create a loop in Stata that generates four variables (lags at 0 months, 3 months, 12 months, and 18 months). I tried this (below) and I get an error: invalid syntax
foreach x inlist (0,3,12,18) & foreach y inlist (0,90,360,540){
gen var`x' = (date > date2 + `y')
}
Here is a way for me to successfully create these variables without the loop. It would be much nicer if it could be simplified with a loop.
gen var0=(date>date2)
gen var3=(date>date2+90)
gen var12=(date>date2+360)
gen var18=(date>date2+540)
Good news: you need just one loop over 4 possibilities, as 0 3 12 18 and 0 90 360 540 are paired.
foreach x in 0 3 12 18 {
gen var`x' = date > (date2 + 30 * `x')
}
foreach requires either in or of following the macro name, so your code fails at that point. There is also no construct foreach ... & foreach ....: perhaps you are using syntax from elsewhere or just guessing there.
Related
I have a very large dataset but to cut it short I demonstrated the data with the following example:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(patid death dateofdeath)
1 0 .
2 0 .
3 0 .
4 0 .
5 1 15007
6 0 .
7 0 .
8 1 15526
9 0 .
10 0 .
end
format %d dateofdeath
I am trying to sample for a case-control study based on date of death. At this stage, I need to first create a variable with each date of death repeated for all the participants (hence we end up with a dataset with 20 participants) and a pairid equivalent to the patient id patid of the corresponding case.
I created a macro for one case (which works) but I am finding it difficult to have it repeated for all cases (where death==1) in a loop.
The successful macro is as follows:
local i "5" //patient id who died
gen pairid= `i'
gen matchedindexdate = dateofdeath
replace matchedindexdate=0 if pairid != patid
gsort matchedindexdate
replace matchedindexdate= matchedindexdate[_N]
format matchedindexdate %d
save temp`i'
and the loop I attempted is:
* (min and max patid id)
forval j = 1/10 {
count if patid == `j' & death==1
if r(N)=1 {
gen pairid= `j'
gen matchedindexdate = dateofdeath
replace matchedindexdate=0 if pairid != patid
gsort matchedindexdate
replace matchedindexdate= matchedindexdate[_N]
save temp/matched`j'
}
}
use temp/matched1, clear
forval i=2/10 {
capture append using temp/matched`i'
save matched, replace
}
but I get:
invalid syntax
How can I do the loop?
I finally had it solved, please check:
https://www.statalist.org/forums/forum/general-stata-discussion/general/1591811-how-to-create-a-loop-for-a-macro
I have a dataset as such:
Case #|DateA |Drug.1|Drug.2|Drug.3|DateB.1 |DateB.2 |DateB.3 |IV.1|IV.2|IV.3
------|------|------|------|------|--------|---------|--------|----|----|----
1 |DateA1| X | Y | X |DateB1.1|DateB1.2 |DateB1.3| 1 | 0 | 1
2 |DateA2| X | Y | X |DateB2.1|DateB2.2 |DateB2.3| 1 | 0 | 1
3 |DateA3| Y | Z | X |DateB3.1|DateB3.2 |DateB3.3| 0 | 0 | 1
4 |DateA4| Z | Z | Z |DateB4.1|DateB4.2 |DateB4.3| 0 | 0 | 0
For each case, there are linked variables i.e. Drug.1 is linked with DateB.1 and IV.1 (Indicator Variable.1); Drug.2 is linked with DateB.2 and IV.2, etc.
The variable IV.1 only = 1 if Drug.1 is the case that I want to analyze (in this example, I want to analyze each receipt of Drug "X"), and so on for the other IV variables. Otherwise, IV = 0 if the drug for that scenario is not "X".
I want to calculate the difference between DateA and DateB for each instance where Drug "X" is received.
e.g. In the example above I want to calculate a new variable:
DateDiffA1_B1.1 = DateA1 - DateB1.1
DateDiffA1_B2.1 = DateA1 - DateB2.1
DateDiffA1_B1.3 = DateA1 - DateB1.3
DateDiffA1_B2.3 = DateA1 - DateB2.3
DateDiffA1_B3.3 = DateA1 - DateB3.3
I'm not sure if this new variable would need to be linked to each instance of Drug "X" as for the other variables, or if it could be a single variable that COUNTS all the instances for each case.
The end goal is to COUNT how many times each case had a date difference of <= 2 weeks when they received Drug "X". If they did not receive Drug "X", I do not want to COUNT the date difference.
I will eventually want to compare those who did receive Drug "X" with a date difference <= 2 weeks to those who did not, so having another indicator variable to help separate out these specific patients would be beneficial.
I am unsure about the best way to go about this; I suspect it will require a combination of IF and REPEAT functions using the IV variable, but I am relatively new with SPSS and syntax and am not sure how this should be coded to avoid errors.
Thanks for your help!
EDIT: It seems like I may need to use IV as a vector variable to loop through the linked variables in each case. I've tried the syntax below to no avail:
DATASET ACTIVATE DataSet1.
vector IV = IV.1 to IV.3.
loop #i = .1 to .3.
do repeat DateB = DateB.1 to DateB.3
/ DrugDateDiff = DateDiff.1 to DateDiff.3.
if IV(#i) = 1
/ DrugDateDiff = datediff(DateA, DateB, "days").
end repeat.
end loop.
execute.
Actually there is no need to add the vector and the loop, all you need can be done within one DO REPEAT:
compute N2W=0.
do repeat DateB = DateB.1 to DateB.3 /IV=IV.1 to IV.3 .
if IV=1 and datediff(DateA, DateB, "days")<=14 N2W = N2W + 1.
end repeat.
execute.
This syntax will first put a zero in the count variable N2W. Then it will loop through all the dates, and only if the matching IV is 1, the syntax will compare them to dateA, and add 1 to the count if the difference is <=2 weeks.
if you prefer to keep the count variable as missing when none of the IV are 1, instead of compute N2W=0. start the syntax with:
If any(1, IV.1 to IV.3) N2W=0.
Data is setup with a bunch of information corresponding to an ID, which can show-up more than once.
ID Data
1 X
1 Y
2 A
2 B
2 Z
3 X
I want a loop that signifies which instance of the ID I am looking at. Is it the first time, second time, etc? I want it as a string in the form _# so I have to go beyond the simple _n function in Stata, to my knowledge. If someone knows a way to do what I want without the loop let me know, but I would still like the answer.
I have the following loop in Stata
by ID: gen count_one = _n
gen count_two = ""
quietly forval j = 1/3 {
replace count_two = "_`j'" if count_one == `j'
}
The output now looks like this:
ID Data count_one count_two
1 X 1 _1
1 Y 2 _2
2 A 1 _1
2 B 2 _2
2 Z 3 _3
3 X 1 _1
The question is how can I replace the 16 above with to tell Stata to take the max of the count_one column because I need to run this weekly and that max will change and I want to reduce errors.
It's hard to understand why you want this, but it is one line whether you want numeric or string:
bysort ID : gen nummax = _N
bysort ID : gen strmax = "_" + string(_N)
Note that the sort order within ID is irrelevant to the number of observations for each.
Some parts of your question aren't clear ("...replace the 16 above with to tell Stata...") but:
Why don't you just use _n with tostring?
gsort +ID +data
bys ID: g count_one=_n
tostring count_one, gen(count_two)
replace count_two="_"+count_two
Then to generate the max (answering the partial question at the end there) -- although note this value will be repeated across instances of each ID value:
bys ID: egen maxcount1=max(count_one)
or more elegantly:
bys ID: g maxcount2=_N
Before asking my question, here's a little background so you understand what I'm doing. I'm looking to analyze a very large data set (a little less than 2,000,000 rows). I've parsed the data set into Matlab and built a structure array from this data, giving names, dates, returns, etc for each asset i. Now, I would like to restrict my data set to being between two days, and Matlab doesn't seem to be particularly amenable to that kind of approach. One suggestion that was given to me was to take the dates, which are of the form MM/DD/YYYY and use a delimiter '/' to somehow build three integer arrays for my data structure (which I'd call stock(i).month, stock(i).day, and stock(i).year). However, nothing I'm doing seems to be working, and I'm very much stuck.
What I have been trying to do is something like the following:
%% Dates
fid = fopen('52c6d3831952b24a.csv');
C = textscan(fid, [repmat('%*s ',1,0),'%s %*[^\n]'], 'delimiter',',');
date = C{1}(2:end,1);
fclose(fid);
for i=1:numStock
locate = strcmp(uniquePermno{i},permno);
stock(i+1).date = date(locate);
end;
for i = 1:numStock
stock(i+1).date = char(stock(i+1).date);
D = textscan(stock(i+1).date, '%s %s %s', 'delimiter','/');
stock(i+1).month = D{1}(1:end);
stock(i+1).day = D{2}(1:end);
stock(i+1).year = D{3}(1:end);
end
I initially wanted to save them as integers (and was using %u instead), but I was getting a strange situation where most of my entries were just 0 and the non-zero ones were very large (obviously not what I expected). However, the above form returns the following error:
Error using textscan
Buffer overflow (bufsize = 4095) while reading string from
file (row 1 u, field 1 u). Use 'bufsize' option. See HELP TEXTSCAN.
44444444444444444444455555555555555555555566666666666666666666677777777777777777777778888888888888888888889999999999999999999990000000000000000000000011111111111111111112222222222222222222222111111111
Error in makeData_CRSP (line 87)
D = textscan(stock(i+1).date, '%s %s %s', 'delimiter','/');
So I'm honestly at a loss for how to approach this. What am I doing wrong? Seeing how I saved my dates vectors for my data structure, is this the best way to approach this problem?
You can use the datenum function to convert dates into numbers. The syntax is datenum(dateString, format). For example, if your dates are in the format YYYY MM DD then that would be
datenum('2012 12 04', 'yyyy mm dd')
Once you converted all your dates like that you can simply compare the resulting numbers using > and <:
>> datenum('2012 12 04', 'yyyy mm dd') > datenum('2012 12 03', 'yyyy mm dd')
ans =
1
>> datenum('2012 12 04', 'yyyy mm dd') > datenum('2012 12 05', 'yyyy mm dd')
ans =
0
I have the following table in Excel (blank spaces are empty):
A B C D
1 1
2 3
3 4
4 -2
5 4
6 9
7 8
8
9
10
I would like to return the minimum of column A from A1 to A1000000, using the QUARTILE function, while excluding all negative values. The reason I want it from A1 to A1000000 and not A1 to A7 is because I want to update the table (adding new rows starting from A8) and have the formula also automatically update. The reason I want the QUARTILE and not MIN function is because I will be extending it to calculate other statistics like 1st and 3rd quartile.
This function works correctly and returns 1 (pressing ctrl+shift+enter):
QUARTILE(IF(A1:A7 > -1, A1:A7), 0)
However, when I tried the following, it returned 0 when it should still return 1 (pressing ctrl+shift+enter):
QUARTILE(IF(A1:A1000000 > -1, A1:A1000000), 0)
I also tried the following and it returned 0 (pressing ctrl+shift+enter):
QUARTILE(IF(AND(NOT(ISBLANK(A1:A1000000)), A1:A1000000 > -1), A1:A1000000), 0)
Anybody have a solution to my problem?
Create a dynamic named range, called for example, rng, defined by =OFFSET($A$1,0,0,COUNT($A1:$A10000),1)
Then modify your array formula to refer to rng, via =QUARTILE(IF(rng >-1,rng), 0)
Actually what you have works. Try doing:
=QUARTILE(IF(A:A > 0,A:A ),0)
The reason you are returning 0 is that a blank cell is considered to be of the value 0 when this formula is ran. For example, erase one of the values in the A1:A7 range and your original formula will return 0. Also, I would run the formula on the entire A column if possible (for readability, etc.)
Or do you need to return a "0" if that number is in the list?