Inflow/Outflow Count in Stata - loops

I have the following data
id pair_id id_in id_out date
1 1 2 3 1/1/2010
2 1 2 3 1/2/2010
3 1 3 2 1/3/2010
4 1 3 2 1/5/2010
5 1 3 2 1/7/2010
6 2 2 1 1/2/2010
7 3 1 3 1/5/2010
8 2 1 2 1/7/2010
At any given row I want to know what the inflow/outflow differential is between the unique pair id_in and id_out from the id_in perspective
For example, for id_in == 2 and id_out == 3 it would look like the following (from id_in == 2s perspective)
id pair_id id_in id_out date inflow_outflow
1 1 2 3 1/1/2010 1
2 1 2 3 1/2/2010 2
3 1 3 2 1/3/2010 1
4 1 3 2 1/5/2010 0
5 1 3 2 1/7/2010 -1
Explanation. id_in == 2 as received first so they get +1 then they received again so +2. Then they gave out so it gets reduced by -1 bringing the total to that point to 1, etc.
This is what I have tried
sort pair_id id_in date
gen count = 0
qui forval i = 2/`=_N' {
local I = `i' - 1
count if id_in == id_out[`i'] in 1/`I'
replace count = r(N) in `i'
}

I don't follow all the logic here and in particular presenting transactions from the point of view of one member seems quite arbitrary. But a broad impression from loosely similar problems is that you should not be thinking about loops here. It should suffice to use by: and cumulative sums. There is an attempt at some systematic discussion of how to handle dyads at http://www.stata-journal.com/sjpdf.html?articlenum=dm0043 but it is only a beginning.
Please note that presenting dates according to some display format is a small pain as they need to be reverse engineered. dataex from SSC can be used to create examples that are easy to copy and paste.
This code may suggest some technique:
clear
input id pair_id id_in id_out str8 sdate
1 1 2 3 "1/1/2010"
2 1 2 3 "1/2/2010"
3 1 3 2 "1/3/2010"
4 1 3 2 "1/5/2010"
5 1 3 2 "1/7/2010"
6 2 2 1 "1/2/2010"
7 3 1 3 "1/5/2010"
8 2 1 2 "1/7/2010"
end
gen date = daily(sdate, "MDY")
format date %td
assert id_in != id_out
gen pair1 = cond(id_in < id_out, id_in, id_out)
gen pair2 = cond(id_in < id_out, id_out, id_in)
bysort pair_id (date): gen sum1 = sum(id_in == pair1) - sum(id_out == pair1)
bysort pair_id (date): gen sum2 = sum(id_in == pair2) - sum(id_out == pair2)
list date id_* pair? sum?, sepby(pair_id)
+----------------------------------------------------------+
| date id_in id_out pair1 pair2 sum1 sum2 |
|----------------------------------------------------------|
1. | 01jan2010 2 3 2 3 1 -1 |
2. | 02jan2010 2 3 2 3 2 -2 |
3. | 03jan2010 3 2 2 3 1 -1 |
4. | 05jan2010 3 2 2 3 0 0 |
5. | 07jan2010 3 2 2 3 -1 1 |
|----------------------------------------------------------|
6. | 02jan2010 2 1 1 2 -1 1 |
7. | 07jan2010 1 2 1 2 0 0 |
|----------------------------------------------------------|
8. | 05jan2010 1 3 1 3 1 -1 |
+----------------------------------------------------------+

A specific pair (as defined by pair_id) is always conformed by two entities that can be ordered in one of two ways. For example, entity 5 with entity 8, and entity 8 with entity 5. If one is receiving, the other is giving out, necessarily.
Two slightly different ways of approaching the problem can be found below.
clear all
set more off
*----- example data -----
input id pair_id id_in id_out str8 sdate
1 1 2 3 "1/1/2010"
2 1 2 3 "1/2/2010"
3 1 3 2 "1/3/2010"
4 1 3 2 "1/5/2010"
5 1 3 2 "1/7/2010"
6 2 2 1 "1/2/2010"
7 3 1 3 "1/5/2010"
8 2 1 2 "1/7/2010"
end
gen date = daily(sdate, "MDY")
format date %td
drop sdate
sort pair_id date id
list, sepby(pair_id)
*---- what you want -----
// approach 1
bysort pair_id (date id) : gen sum1 = sum(cond(id_in == id_in[1], 1, -1))
gen sum2 = -1 * sum1
// approach 2
bysort pair_id (id_in date id) : gen temp = cond(id_in == id_in[1], 1, -1)
bysort pair_id (date id) : gen sum100 = sum(temp)
gen sum200 = -1 * sum100
// list
drop temp
sort pair_id date
list, sepby(pair_id)
The first approach involves creating a variable that holds the differential for the entity that first receives according to the date variable. sum1 does just that. Variable sum2 holds the differential for the other entity.
The second approach creates a variable that holds the differential for the entity that has the smallest identifying number. I've named it sum100. Variable sum200 holds the information for the other entity.
Note that I added id to the sorting list in case pair_id date does not uniquely identify observations.
The second approach is equivalent to the code provided by #NickCox, or so I believe.
The results:
. list, sepby(pair_id)
+---------------------------------------------------------------------------+
| id pair_id id_in id_out date sum1 sum2 sum100 sum200 |
|---------------------------------------------------------------------------|
1. | 1 1 2 3 01jan2010 1 -1 1 -1 |
2. | 2 1 2 3 02jan2010 2 -2 2 -2 |
3. | 3 1 3 2 03jan2010 1 -1 1 -1 |
4. | 4 1 3 2 05jan2010 0 0 0 0 |
5. | 5 1 3 2 07jan2010 -1 1 -1 1 |
|---------------------------------------------------------------------------|
6. | 6 2 2 1 02jan2010 1 -1 -1 1 |
7. | 8 2 1 2 07jan2010 0 0 0 0 |
|---------------------------------------------------------------------------|
8. | 7 3 1 3 05jan2010 1 -1 1 -1 |
+---------------------------------------------------------------------------+
Check them carefully, as the difference between both approaches is subtle, at least initially.

Related

Nested for-loop: error variable already defined

I have a nested loop in Stata with four levels of foreach statements. With this loop, I am trying to create a new variable named strata that ranges from 1 to 40.
foreach x in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 {
foreach r in 1 2 3 4 5 {
foreach s in 1 2 {
foreach a in 1 2 3 4 {
gen strata= `x' if race==`r' & sex==`s' & age==`a'
}
}
}
}
I get an error :
"variable strata already defined"
Even with the error, the loop does assign strata = 1, but not the rest of the strata. All other cells are missing/empty.
Example data:
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(age sex race)
1 2 2
1 2 1
1 1 1
1 1 1
1 2 1
2 2 1
2 2 1
4 2 1
1 2 1
4 2 1
3 2 1
2 2 1
4 2 1
4 2 2
3 2 1
4 1 3
4 2 1
4 2 1
2 1 2
4 2 1
2 2 1
3 2 1
3 2 1
1 2 3
4 2 1
1 2 5
4 2 1
4 2 1
4 2 2
4 2 1
2 2 1
4 1 1
3 2 1
1 2 1
2 2 1
4 2 1
1 2 2
2 2 3
1 1 3
4 2 1
2 2 3
1 2 1
1 1 1
2 2 3
1 2 1
1 1 3
1 2 1
2 2 1
3 2 1
1 2 1
4 2 1
1 2 2
1 2 1
2 2 1
4 2 1
4 2 1
1 2 1
1 2 1
4 2 1
2 2 1
4 2 1
1 2 1
1 1 3
2 2 1
1 1 1
4 1 1
3 2 1
2 2 1
1 2 1
1 1 1
2 2 3
4 2 2
2 2 1
2 2 1
3 2 1
2 2 2
3 2 1
2 1 1
1 1 1
3 2 1
1 2 3
4 2 1
4 2 1
2 2 1
1 2 1
1 1 1
3 2 1
4 2 1
2 2 3
1 2 3
4 2 1
3 2 1
2 2 1
4 2 1
3 2 1
2 1 1
1 2 1
2 2 1
2 2 3
1 1 1
end
label values sex sex
label def sex 1 "male (1)", modify
label def sex 2 "female (2)", modify
label values race race
label def race 1 "non-Hispanic white (1)", modify
label def race 2 "black (2)", modify
label def race 3 "AAPI/other (3)", modify
label def race 5 "Hispanic (5)", modify
generate is for generating new variables. The second time your code reaches a generate statement, the code fails for the reason given.
One answer is that you need to generate your variable outside the loops and then replace inside.
For other reasons your code can be rewritten in stages.
First, integer sequences can be more easily and efficiently specified with forvalues, which can be abbreviated: I tend to write forval.
gen strata = .
forval x = 1/40 {
forval r = 1/5 {
forval s = 1/2 {
forval a = 1/4 {
replace strata = `x' if race==`r' & sex==`s' & age==`a'
}
}
}
}
Second, the code is flawed any way. Everything ends up as 40!
Third, you can do allocations much more directly, say by
gen strata = 8 * (race - 1) + 4 * (sex - 1) + age
This is a self-contained reproducible demonstration:
clear
set obs 5
gen race = _n
expand 2
bysort race : gen sex = _n
expand 4
bysort race sex : gen age = _n
gen strata = 8 * (race - 1) + 4 * (sex - 1) + age
isid strata
Clearly you can and should vary the recipe for a different preferred scheme.

In a panel data of products (per store/week), identify products that are present in all store/week panels

In the data set below, store/week is a panel. I would like to show that products 4, 5, 6 are present in all the panels. Binary variable present1 indicates it.
Similarly, I would like to recognize the presence of products in the corresponding number of occurrences. Categorical variable present2 indicates it.
clear
input expbasedem price product store str7 l5 week present1 present2
1.1 5.3 1 1 Ana 1 0
1.1 2.5 3 1 Bob 1 0 3
1.1 1 4 1 Brian 1 1 4
2.1 12 5 1 Brian 1 1 4
3.1 12 6 1 Suming 1 1 4
12 4 2 2 Ana 1 0 2
12 3.5 3 2 Bob 1 0 3
10 2 4 2 Brian 1 1 4
25 13 5 2 Brian 1 1 4
35 13 6 2 Suming 1 1 4
35.3 5.3 7 1 Bob 2 0 1
12.3 2.5 8 1 Brian 2 0 1
10.3 1 4 1 Brian 2 1 4
35.3 12 5 1 Bobby 2 1 4
35.3 12 6 1 Becky 2 1 4
23.4 4 2 2 Icarus 2 0 2
12.4 3.5 3 2 Xerox 2 0 3
10.4 2 4 2 Yulia 2 1 4
35.4 13 5 2 Zebra 2 1 4
35.4 13 6 2 Ninjago 2 1 4
end
I think I know how to do present2:
bysort product: generate order = _n
drop tag
bysort product: egen tag=max(order)
present1requires loops, I think, but I don't go far:
sort week store product
egen joint1 = group (week store), label
gen long id = _n
su joint1, meanonly
forval i = 1/`r(max)' {
su id if joint1 == `i', meanonly
local jmin = r(min)
local jmax = r(max)
...
...
}
I interpret your first question about present1 as wanting an indicator that is
1 if and only if products 4, 5, 6 are all available in a given store and week AND the product is 4, 5, 6
0 otherwise
I suggest
gen byte wanted1 = 1
foreach p in 4 5 6 {
bysort store week : egen byte work = max(product == `p')
replace wanted1 = wanted1 * work
drop work
}
replace wanted1 = wanted1 * inlist(product, 4, 5, 6)
So, the recipe is
Assume all present. Indicator initialised to 1.
Change our mind if any is absent. Principles documented at https://www.stata.com/support/faqs/data-management/create-variable-recording/
(We just need to multiply by 0 once, and the indicator as a logical product is necessarily zero.)
Change our mind in respect of other products.
Here's a version without egen that should be faster in large datasets:
gen byte wanted1 = 1
foreach p in 4 5 6 {
bysort store week : gen byte work = product == `p'
bysort store week (work) : replace work = work[_N]
replace wanted1 = wanted1 * work
drop work
}
replace wanted1 = wanted1 * inlist(product, 4, 5, 6)
Sorry, I don't understand the definition of present2. Occurrences of what precisely?
However, your present code
bysort product: generate order = _n
drop tag
bysort product: egen tag=max(order)
could be just
bysort product : replace tag = _N

how can i compare two csv files?

train.csv:
01kcPWA9K2BOxQeS5Rju 1
04EjIdbPV5e1XroFOpiN 1
05EeG39MTRrI6VY21DPd 1
05rJTUWYAKNegBk2wE8X 1
0AnoOZDNbPXIr2MRBSCJ 1
0AwWs42SUQ19mI7eDcTC 1
0cH8YeO15ZywEhPrJvmj 1
0DNVFKwYlcjO7bTfJ5p1 1
0DqUX5rkg3IbMY6BLGCE 1
0eaNKwluUmkYdIvZ923c 1
0fHVZKeTE6iRb1PIQ4au 1
0G4hwobLuAzvl1PWYfmd 1
test.csv:
01IsoiSMh5gxyDYTl4CB
01SuzwMJEIXsK7A8dQbl
01azqd4InC7m9JpocGv5
01jsnpXSAlgw6aPeDxrU
01kcPWA9K2BOxQeS5Rju
02IOCvYEy8mjiuAQHax3
02JqQ7H3yEoD8viYWlmS
02K5GMYITj7bBoAisEmD
02MRILoE6rNhmt7FUi45
02mlBLHZTDFXGa7Nt6cr
02zcUmKV16Lya5xqnPGB
03nJaQV6K2ObICUmyWoR
04BfoQRA6XEshiNuI7pF
04EjIdbPV5e1XroFOpiN
these type of rows and i want each row compare with train.csv rows and find the match where it match save the id against that row and output should be like this:
output.csv:
01kcPWA9K2BOxQeS5Rju 2
04EjIdbPV5e1XroFOpiN 2
05EeG39MTRrI6VY21DPd 4
05rJTUWYAKNegBk2wE8X 1
0AnoOZDNbPXIr2MRBSCJ 1
0AwWs42SUQ19mI7eDcTC 5
0cH8YeO15ZywEhPrJvmj 5
0DNVFKwYlcjO7bTfJ5p1 1
0DqUX5rkg3IbMY6BLGCE 3
0eaNKwluUmkYdIvZ923c 1
0fHVZKeTE6iRb1PIQ4au 1
0G4hwobLuAzvl1PWYfmd 2
Kindly help me

Eliminate repeated vectors but with elements on different order

I have a matrix A which is (243 x 5). I want to pick the unique row vectors of that matrix but taking into account that row vectors with the same elements but in different order shall be considered as being the same.
E.g., suppose for simplicity that the A matrix is (10 x 5) and equal to:
A=[1 2 1 2 3
1 3 1 1 1
1 3 1 1 2
1 2 1 1 3
2 3 1 2 1
1 3 1 2 2
1 3 1 2 3
1 3 1 3 2
1 3 1 3 1
1 3 2 3 1]
On the example above, rows (1, 5, 6) are to be considered equivalent they have the same elements but in different order. Also, rows (3 and 4) are equivalent, and rows (7, 8, 10) are also equivalent.
Is there any way to write a code that removes all "repeated rows", i.e. a code that delivers only the rows (1, 2, 3, 7 and 9) from A?
So far I came across with this solution:
B(:,1) = sum(A == 1,2);
B(:,2) = sum(A == 2,2);
B(:,3) = sum(A == 3,2);
[C, ia, ic] = unique(B,'rows');
Result = A(ia,:);
This delivers what I am looking for with one caveat - it is delivering the unique rows of A according to the criteria defined above, but it is not delivering the first row it finds. I.e. instead of delivering rows (1,2,3,7,9) it is delivering rows(7, 1, 9, 3, 2).
Anyway I can force him to deliver the rows in correct order? Also any better way of doing this?
You can do it as follows:
Sort A along the second dimension;
Get stable indices of unique (sorted) rows;
Use the result as row indices into the original A.
That is:
As = sort(A, 2);
[~, ind] = unique(As, 'rows', 'stable');
result = A(ind,:);
For
A = [1 2 1 2 3
1 3 1 1 1
1 3 1 1 2
1 2 1 1 3
2 3 1 2 1
1 3 1 2 2
1 3 1 2 3
1 3 1 3 2
1 3 1 3 1
1 3 2 3 1];
this gives
result =
1 2 1 2 3
1 3 1 1 1
1 3 1 1 2
1 3 1 2 3
1 3 1 3 1

Comparisons across multiple rows in Stata (household dataset)

I'm working on a household dataset and my data looks like this:
input id id_family mother_id male
1 2 12 0
2 2 13 1
3 3 15 1
4 3 17 0
5 3 4 0
end
What I want to do is identify the mother in each family. A mother is a member of the family whose id is equal to one of the mother_id's of another family member. In the example above, for the family with id_family=3, individual 5 has mother_id=4, which makes individual 4 her mother.
I create a family size variable that tells me how many members there are per family. I also create a rank variable for each member within a family. For families of three, I then have the following piece of code that works:
bysort id_family: gen family_size=_N
bysort id_family: gen rank=_n
gen mother=.
bysort id_family: replace mother=1 if male==0 & rank==1 & family_size==3 & (id[_n]==id[_n+1] | id[_n]==id[_n+2])
bysort id_family: replace mother=1 if male==0 & rank==2 & family_size==3 & (id[_n]==id[_n-1] | id[_n]==id[_n+1])
bysort id_family: replace mother=1 if male==0 & rank==3 & family_size==3 & (id[_n]==id[_n-1] | id[_n]==id[_n-2])
What I get is:
id id_family mother_id male family_size rank mother
1 2 12 0 2 1 .
2 2 13 1 2 2 .
3 3 15 1 3 1 .
4 3 17 0 3 2 1
5 3 4 0 3 3 .
However, in my real data set, I have to get the mother for families of size 4 and higher (up to 9), which makes this procedure very inefficient (in the sense that there are too many row elements to compare "manually").
How would you obtain this in a cleaner way? Would you make use of permutations to index the rows? Or would you use a for-loop?
Here's an approach using merge.
// create sample data
clear
input id id_family mother_id male
1 2 12 0
2 2 13 1
3 3 15 1
4 3 17 0
5 3 4 0
end
save families, replace
clear
// do the job
use families
drop id male
rename mother_id id
sort id_family id
duplicates drop
list, clean abbreviate(10)
save mothers, replace
use families, clear
merge 1:1 id_family id using mothers, keep(master match)
generate byte is_mother = _merge==3
list, clean abbreviate(10)
The second list yields
id id_family mother_id male _merge is_mother
1. 1 2 12 0 master only (1) 0
2. 2 2 13 1 master only (1) 0
3. 3 3 15 1 master only (1) 0
4. 4 3 17 0 matched (3) 1
5. 5 3 4 0 master only (1) 0
where I retained _merge only for expositional purposes.

Resources