Stata Nested foreach loop substring comparison - loops

I have just started learning Stata and I'm having a hard time.
My problem is this: I have two different variables, ATC and A, where A is potentially a substring of ATC.
Now I want to mark all the observations in which A is a substring of ATC with OK = 1.
I tried this using a simple nested loop:
foreach x in ATC {
foreach j in A {
replace OK = 1 if strpos(`x',`j')!=0
}
}
However, whenever I run this loop no changes are being made even though there should be plenty.
I feel like I should probably give an index specifying which OK is being changed (the one belonging to the ATC/x), but I have no idea how to do this. This is probably really simple but I've been struggling with it for some time.
I should have clarified: my A list is separate from the main list (simply appended to it) and only contains unique keys which I use to identify the ATCs which I want. So I have ~120 A-keys and a couple million ATC keys. What I wanted to do was iterate over every ATC key for every single A-key and mark those ATC-keys with A that qualify.
That means I don't have complete tuples of (ATC,A,OK) but instead separate lists of different sizes.
For example: I have
ATC OK A
ABCD 0 .
EFGH 0 .
... ... ...
. . AB
. . ET
and want the result that "ABCD" having OK is marked as 1 while "EFGH" remains at 0.

We can separate your question into two parts. Your title implies a problem with loops, but your loops are just equivalent to
replace OK = 1 if strpos(ATC, A)!=0
so the use of looping appears irrelevant. That leaves the substring comparison.
Let's supply an example:
. set obs 3
obs was 0, now 3
. gen OK = 0
. gen A = cond(_n == 1, "42", "something else")
. gen ATC = "answer is 42"
. replace OK = 1 if strpos(ATC, A) != 0
(1 real change made)
. list
+------------------------------------+
| OK A ATC |
|------------------------------------|
1. | 1 42 answer is 42 |
2. | 0 something else answer is 42 |
3. | 0 something else answer is 42 |
+------------------------------------+
So it works fine; and you really need to give a reproducible example if you think you have something different.
As for specifying where the variable should be changed: your code does precisely that, as again the example above shows.
The update makes the problem clear. Stata will only look in the same observation for a matching substring when you specify the syntax you gave. A variable in Stata is a field in a dataset. To cycle over a set of values, something like this should suffice
gen byte OK = 0
levelsof A, local(Avals)
quietly foreach A of local Avals {
replace OK = 1 if strpos(ATC, `"`A'"') > 0
}
Notes:
Specifying byte cuts down storage.
You may need an if or in restriction on levelsof.
quietly cuts out messages about changed values. When debugging, it is often better left out.
> 0 could be omitted as a positive result from strpos() is automatically treated as true in logical comparisons. See this FAQ.

Related

#VALUES! while using IF and OR together

I have the File as following format
Name Number Position
A 1
B 2
C 3
D 4
Now on position A3 , I applied =IF(B2=1,"Goal Keeper",OR(IF(B2=2,"Defender",OR(IF(B2=3,"MidField","Striker"))))) But it giving me an error #value!
Looked up at google, and my formula is correct.
What i basically want it
1- Goalkeeper 2-Defender 3-Midfield 4-Striker
Yes the other way is to to just filter the number and copy paste the text
But I want to do it using formula and want to know where did I go wrong.
Your immediate problem lies with the expression (for example):
OR(IF(B2=3,"MidField","Striker"))
| \__/ \________/ \_______/ |
| bool string string |
\____________________________/
string
The OR function expects a series of boolean values (true or false) and you're giving it a string value from the inner IF.
You don't actually need the or bits in this specific case, the if is a full if-else. So you can just use:
=IF(B1=1,"Goal Keeper",IF(B2=2,"Defender",IF(B2=3,"MidField","Striker")))
This means that B1=1 will result in "Goal Keeper", otherwise it will evaluate IF(B2=2,"Defender",IF(B2=3,"MidField","Striker")).
Then that means that, if B2=2, it will result in "Defender", otherwise it will evaluate IF(B2=3,"MidField","Striker").
Finally, that means the B2=3 will result in "MidField", anything else will give "Striker".
The only situation I can envisage when OR would come in handy here would be when two different numbers were to generate the same string. Let's say both 1 and 4 should give "Goalie", you could use:
=IF(OR(B1=1,B1=4),"Goalie",IF(B2=2,"Defender","MidField"))
Keep in mind that a more general solution would be better implemented with the Excel lookup functions, ones that would search a table (on the spreadsheet somewhere) which mapped the integers to strings. Then, if the mapping needed to change, you would just update the table rather than going back and changing the formula in every single row.
If you are actually tasked with solving the problem by using the IF and OR function within the same equation, this is the only way I can see how:
=IF(OR(B1=1, B1 = 2, B1 = 3, B1 = 4),IF(B1 = 1, "Goal Keeper", IF(B1 = 2,"Defender",IF(B1 = 3,"MidField","Striker")))
If B1 does not equal 1-4, the OR function will return FALSE and completely bypass all of the nested IF statements.

Stata: assign numbers in a loop

I have a problem creating a loop in Stata.
I have a dataset in Stata where I classified my observations into 6 categories via variable k10. So k10 takes on values 1,2,3,4,5,6.
Now I want to assign each observation one value according to its class:
value 15 for k10=1
value 10 for k10=2
value 8 for k10=3
value 5 for k10=4
value 4 for k10=5
value 2 for k10=6
It is easy if I create a new variable w10 and do it like the following:
gen w10 =.
replace w10 = 15 if k10==1
replace w10 = 10 if k10==2
replace w10 = 8 if k10==3
replace w10 = 5 if k10==4
replace w10 = 4 if k10==5
replace w10 = 2 if k10==6
Now I tried to simplify the code by using a loop, unfortunately it does not do what I want to achieve.
My loop:
gen w10=.
local A "1 2 3 4 5 6"
local B "15 10 8 5 4 2"
foreach y of local A {
foreach x of local B {
replace w10 = `x' if k10= `y'
}
}
The loop assigns value 2 to each observation though. The reason is that the if-condition k10=`y' is always true and overwrites the replaced w10s each time until the end, right?
So how can I write the loop correctly?
It's really just one loop, not two nested loops. That's your main error, which is general programming logic. Only the last time you go through the inner loop has an effect that lasts. Try tracing the loops by hand to see this.
Specifically in Stata, looping over the integers 1/6 is much better done with forval; there is no need at all for the indirection of defining a local macro and then obliging foreach to look inside that macro. That can be coupled with assigning the other values to local macros with names 1 ... 6. Here tokenize is the dedicated command to use.
Try this:
gen w10 = .
tokenize "15 10 8 5 4 2"
quietly forval i = 1/6 {
replace w10 = ``i'' if k10 == `i'
}
Note incidentally that you need == not = when testing for equality.
See (e.g.) this discussion.
Many users of Stata would want to do it in one line with recode. Here I concentrate on the loop technique, which is perhaps of wider interest.

foreach using numlist of numbers with leading 0s

In Stata, I am trying to use a foreach loop where I am looping over numbers from, say, 05-11. The problem is that I wish to keep the 0 as part of the value. I need to do this because the 0 appears in variable names. For example, I may have variables named Y2005, Y2006, Var05, Var06, etc. Here is an example of the code that I tried:
foreach year of numlist 05/09 {
...do stuff with Y20`year` or with Var`year`
}
This gives me an error that e.g. Y205 is not found. (I think that what is happening is that it is treating 05 as 5.)
Also note that I can't add a 0 in at the end of e.g. Y20 to get Y200 because of the 10 and 11 values.
Is there a work-around or an obvious thing I am not doing?
Another work-around is
forval y = 5/11 {
local Y : di %02.0f `y'
<code using local Y, which must be treated as a string>
}
The middle line could be based on
`: di %02.0f `y''
so that using another macro can be avoided, but at the cost of making the code more cryptic.
Here I've exploited the extra fact that foreach over such a simple numlist is replaceable with forvalues.
The main trick here is documented here. This trick avoids the very slight awkwardness of treating 5/9 differently from 10/11.
Note. To understand what is going on, it often helps to use display interactively on very simple examples. The detail here is that Stata is happily indifferent to leading zeros when presented with numbers. Usually this is immaterial to you, or indeed a feature as when you appreciate that Stata does not insist on a leading zero for numbers less than 1.
. di 05
5
. di 0.3
.3
. di .3
.3
Here we really need the leading zero, and the art is to see that the problem is one of string manipulation, the strings such as "08" just happening to contain numeric characters. Agreed that this is obvious only when understood.
There's probably a better solution but here's how this one goes:
clear
set more off
*----- example data -----
input ///
var2008 var2009 var2010 var2011 var2012
0 1 2 3 4
end
*----- what you want -----
numlist "10(1)12"
local nums 08 09 `r(numlist)'
foreach x of local nums {
display var20`x'
}
The 01...09 you can insert manually. The rest you build with numlist. Put all that in a local, and finally use it in the loop.
As you say, the problem with your code is that Stata will read 5 when given 05, if you've told it is a number (which you do using numlist in the loop).
Another solution would be to use an if command to count the number of characters in the looping value, and then if needed you can add a leading zero by reassigning the local.
clear
input var2008 var2009 var2010 var2011 var2012
0 1 2 3 4
end
foreach year of numlist 08/12{
if length("`year'") == 1 local year 0`year'
di var20`year'
}

Stata: Count a variable by another one?

My little Stata Problem:
I have a table like this:
I want to create a variable that counts the number of different cat for each citing. This is... For the A citing there are 2 cat... the 3 and the 6. So I want another variable (dif_cat) with two 2.
For this sample it would look something like this:
I have tried different methods I always feel like I am getting close but then I can't do it.
I tried bysort with preserve and restore but I don't seem to get there.
One attempt was:
egen tag = tag(cat citing)
egen distinct = total(tag), by(citing)
Can you help me?
PS: I know this has nothing to do with Stata (but it may inspire someone) with an actually programming language I would try something such as:
Having a cycle doing citing column and checking if equal to the one before
Having an auxiliary empty vector
Having a second cycle within the first that wouldsee if the current cat was in the vector and if not put it there.
When the citing changed I would count the lenght of the auxiliary matrix, reset it and do it again. The problem is that I need this in Stata code :S
One way (from Stata FAQ) is:
clear all
set more off
input ///
str1 citing cat
A 3
A 6
B 5
B 2
B 5
B 2
C 2
C 4
C 3
D 5
E 1
E 1
end
list, sepby(citing)
bysort citing cat: gen numvals = (_n == 1)
by citing: replace numvals = sum(numvals)
by citing: replace numvals = numvals[_N]
list, sepby(citing)

Display data in Stata loop

I have a loop in Stata 12 that looks at each record in a file and if it finds a flag equal to 1 it generates five imputed values. My code looks like this:
forvalues i=1/5 {
gen y3`i' = y2
gen double h`i' = (uniform()*(1-a)+a) if flag==1
replace y3`i' = 1.6*(invibeta(7.2,2.6,h`i')/(1-invibeta(7.2,2.6,h`i')))^(1/1.7) if
flag==1
}
a is defined elsewhere. I want to check the individual imputations. Thus, I need to display the imputed variable preferably only for those cases where flag=1. I also would like to display another value, s, alongside. I need help in figuring out the syntax. I've tried every combination of quotes and subscripts that I can think of, but I keep getting error messages.
One other useful modification occurs to me. Suppose I have three concatenated files on which I want to perform this routine. Let them have a variable file equal to 1, 2 or 3. I'd like to set a separate seed for each and do it in my program so I have a record. I envision something like:
forvalues j=1/3 {
set seed=12345 if file=1
set seed=56789 if file=2
set seed=98765 if file=3
insert code above
}
Will this work?
No comment is possible on code you don't show, but the word "display" may be misleading you.
list y3`i' if flag == 1
or some variation may be what you seek. Note that display is geared to showing at most one line of output at a time.
P.S. As you are William Shakespeare, know that the mug http://www.stata.com/giftshop/much-ado-mug/ was inspired by your work.
A subsidiary question asks about choosing a different seed each time around a loop. That is easy:
forval j = 1/3 {
local seed : word `j' of 12345 56789 98765
set seed `seed'
...
}
or
tokenize 12345 56789 98765
forval j = 1/3 {
set seed ``j''
...
}

Resources