Comparisons across multiple rows in Stata (household dataset) - loops

I'm working on a household dataset and my data looks like this:
input id id_family mother_id male
1 2 12 0
2 2 13 1
3 3 15 1
4 3 17 0
5 3 4 0
end
What I want to do is identify the mother in each family. A mother is a member of the family whose id is equal to one of the mother_id's of another family member. In the example above, for the family with id_family=3, individual 5 has mother_id=4, which makes individual 4 her mother.
I create a family size variable that tells me how many members there are per family. I also create a rank variable for each member within a family. For families of three, I then have the following piece of code that works:
bysort id_family: gen family_size=_N
bysort id_family: gen rank=_n
gen mother=.
bysort id_family: replace mother=1 if male==0 & rank==1 & family_size==3 & (id[_n]==id[_n+1] | id[_n]==id[_n+2])
bysort id_family: replace mother=1 if male==0 & rank==2 & family_size==3 & (id[_n]==id[_n-1] | id[_n]==id[_n+1])
bysort id_family: replace mother=1 if male==0 & rank==3 & family_size==3 & (id[_n]==id[_n-1] | id[_n]==id[_n-2])
What I get is:
id id_family mother_id male family_size rank mother
1 2 12 0 2 1 .
2 2 13 1 2 2 .
3 3 15 1 3 1 .
4 3 17 0 3 2 1
5 3 4 0 3 3 .
However, in my real data set, I have to get the mother for families of size 4 and higher (up to 9), which makes this procedure very inefficient (in the sense that there are too many row elements to compare "manually").
How would you obtain this in a cleaner way? Would you make use of permutations to index the rows? Or would you use a for-loop?

Here's an approach using merge.
// create sample data
clear
input id id_family mother_id male
1 2 12 0
2 2 13 1
3 3 15 1
4 3 17 0
5 3 4 0
end
save families, replace
clear
// do the job
use families
drop id male
rename mother_id id
sort id_family id
duplicates drop
list, clean abbreviate(10)
save mothers, replace
use families, clear
merge 1:1 id_family id using mothers, keep(master match)
generate byte is_mother = _merge==3
list, clean abbreviate(10)
The second list yields
id id_family mother_id male _merge is_mother
1. 1 2 12 0 master only (1) 0
2. 2 2 13 1 master only (1) 0
3. 3 3 15 1 master only (1) 0
4. 4 3 17 0 matched (3) 1
5. 5 3 4 0 master only (1) 0
where I retained _merge only for expositional purposes.

Related

Get unique values in matrix with Matlab

I'm looking for fastest way to get unique values in matrix with Matlab! I have a matrix like this:
1 2
1 2
1 3
1 5
1 23
2 1
3 1
3 2
3 2
3 2
4 17
4 3
4 17
and need to get something like this:
1 2
1 3
1 5
1 23
2 1
3 1
3 2
4 3
4 17
Actually I need unique values by combination of columns in each row.
Have a look at matlabs unique() function with the argument 'rows'.
C = unique(A,'rows')
https://de.mathworks.com/help/matlab/ref/unique.html

How to aggregate number of notes sent to each user?

Consider the following tables
group (obj_id here is user_id)
group_id obj_id role
--------------------------
100 1 A
100 2 root
100 3 B
100 4 C
notes
obj_id ref_obj_id note note_id
-------------------------------------------
1 2 10
1 3 10
1 0 foobar 10
1 4 20
1 2 20
1 0 barbaz 20
2 0 caszes 30
2 1 30
4 1 70
4 0 taz 70
4 3 70
Note: a note in the system can be assigned to multiple users (for instance: an admin could write "sent warning to 2 users" and link it to 2 user_ids). The first user the note gets linked to is stored differently than the other linked users. The note itself is linked to the first linked user only. Whenever group.obj_id = notes.obj_id then ref_obj_id = 0 and note <> null
I need to make an overview of the notes per user. Normally I would do this by joining on group.obj_id = notes.obj_idbut here this goes wrong because of ref_obj_id being 0 (in which case I should join on notes.obj_id)
There are 4 notes in this system (foobar, barbaz, caszes and taz).
The desired output is:
obj_id user_is_primary notes_primary user_is_linked notes_linked
-------------------------------------------------------------------
1 2 10;20 2 30;70
2 1 30 2 10;20
3 0 2 10;70
4 1 70 1 20
How can I get to this aggregated result?
I hope that I was able to explain the situation clearly; perhaps it is my inexperience but I find the data model not the most straightforward.
Couldn't you simply put this in the ON clause of your join?
case when notes.ref_obj_id = 0 then notes.obj_id else notes.ref_obj_id end = group.obj_id

In a panel data of products (per store/week), identify products that are present in all store/week panels

In the data set below, store/week is a panel. I would like to show that products 4, 5, 6 are present in all the panels. Binary variable present1 indicates it.
Similarly, I would like to recognize the presence of products in the corresponding number of occurrences. Categorical variable present2 indicates it.
clear
input expbasedem price product store str7 l5 week present1 present2
1.1 5.3 1 1 Ana 1 0
1.1 2.5 3 1 Bob 1 0 3
1.1 1 4 1 Brian 1 1 4
2.1 12 5 1 Brian 1 1 4
3.1 12 6 1 Suming 1 1 4
12 4 2 2 Ana 1 0 2
12 3.5 3 2 Bob 1 0 3
10 2 4 2 Brian 1 1 4
25 13 5 2 Brian 1 1 4
35 13 6 2 Suming 1 1 4
35.3 5.3 7 1 Bob 2 0 1
12.3 2.5 8 1 Brian 2 0 1
10.3 1 4 1 Brian 2 1 4
35.3 12 5 1 Bobby 2 1 4
35.3 12 6 1 Becky 2 1 4
23.4 4 2 2 Icarus 2 0 2
12.4 3.5 3 2 Xerox 2 0 3
10.4 2 4 2 Yulia 2 1 4
35.4 13 5 2 Zebra 2 1 4
35.4 13 6 2 Ninjago 2 1 4
end
I think I know how to do present2:
bysort product: generate order = _n
drop tag
bysort product: egen tag=max(order)
present1requires loops, I think, but I don't go far:
sort week store product
egen joint1 = group (week store), label
gen long id = _n
su joint1, meanonly
forval i = 1/`r(max)' {
su id if joint1 == `i', meanonly
local jmin = r(min)
local jmax = r(max)
...
...
}
I interpret your first question about present1 as wanting an indicator that is
1 if and only if products 4, 5, 6 are all available in a given store and week AND the product is 4, 5, 6
0 otherwise
I suggest
gen byte wanted1 = 1
foreach p in 4 5 6 {
bysort store week : egen byte work = max(product == `p')
replace wanted1 = wanted1 * work
drop work
}
replace wanted1 = wanted1 * inlist(product, 4, 5, 6)
So, the recipe is
Assume all present. Indicator initialised to 1.
Change our mind if any is absent. Principles documented at https://www.stata.com/support/faqs/data-management/create-variable-recording/
(We just need to multiply by 0 once, and the indicator as a logical product is necessarily zero.)
Change our mind in respect of other products.
Here's a version without egen that should be faster in large datasets:
gen byte wanted1 = 1
foreach p in 4 5 6 {
bysort store week : gen byte work = product == `p'
bysort store week (work) : replace work = work[_N]
replace wanted1 = wanted1 * work
drop work
}
replace wanted1 = wanted1 * inlist(product, 4, 5, 6)
Sorry, I don't understand the definition of present2. Occurrences of what precisely?
However, your present code
bysort product: generate order = _n
drop tag
bysort product: egen tag=max(order)
could be just
bysort product : replace tag = _N

t-sql - select all combintations of groups of rows in single table

Ok, I have a table like this:
ThingID SubthingID ThingLevel
1 1 0
1 2 0
1 3 0
1 4 0
2 14 1
2 17 1
3 22 1
3 950 1
I need to select groups of subthings such that I end up with one subthing from a level 0 thing, and all the combinations of subthings, one each from the level 1 things. Note that each subthing belongs to its thing - they're not interchangeable. So there can't be, say, a combination of thing 2 with subthing 950. Also, things come in two levels - level 0 and level 1. A level 1 thing is also level 1, and a level 0 thing is always level 0 - really, it means that level 1 things can be combined with other level 1 or 0 things, but level 0 things can only be combined with level 1 things.
So the output would look like:
GroupID ThingID SubthingID ThingLevel
1 1 1 0
1 2 14 1
1 3 22 1
2 1 2 0
2 2 14 1
2 3 22 1
3 1 3 0
3 2 14 1
3 3 22 1
. . . .
. . . .
. . . .
x 1 4 0
x 2 17 1
x 3 950 1
There are multiple level 0 things, each with one to many subthings. There are multiple level 1 things, each with one to many subthings.
Offhand it would seem like nexted loops would be the answer:
For each level 0 thing begin
for each level 1 thing subthing begin
etc...
But that's obviously not going to handle variable numbers of level one things.
Is there a way to do this with recursion?

Inflow/Outflow Count in Stata

I have the following data
id pair_id id_in id_out date
1 1 2 3 1/1/2010
2 1 2 3 1/2/2010
3 1 3 2 1/3/2010
4 1 3 2 1/5/2010
5 1 3 2 1/7/2010
6 2 2 1 1/2/2010
7 3 1 3 1/5/2010
8 2 1 2 1/7/2010
At any given row I want to know what the inflow/outflow differential is between the unique pair id_in and id_out from the id_in perspective
For example, for id_in == 2 and id_out == 3 it would look like the following (from id_in == 2s perspective)
id pair_id id_in id_out date inflow_outflow
1 1 2 3 1/1/2010 1
2 1 2 3 1/2/2010 2
3 1 3 2 1/3/2010 1
4 1 3 2 1/5/2010 0
5 1 3 2 1/7/2010 -1
Explanation. id_in == 2 as received first so they get +1 then they received again so +2. Then they gave out so it gets reduced by -1 bringing the total to that point to 1, etc.
This is what I have tried
sort pair_id id_in date
gen count = 0
qui forval i = 2/`=_N' {
local I = `i' - 1
count if id_in == id_out[`i'] in 1/`I'
replace count = r(N) in `i'
}
I don't follow all the logic here and in particular presenting transactions from the point of view of one member seems quite arbitrary. But a broad impression from loosely similar problems is that you should not be thinking about loops here. It should suffice to use by: and cumulative sums. There is an attempt at some systematic discussion of how to handle dyads at http://www.stata-journal.com/sjpdf.html?articlenum=dm0043 but it is only a beginning.
Please note that presenting dates according to some display format is a small pain as they need to be reverse engineered. dataex from SSC can be used to create examples that are easy to copy and paste.
This code may suggest some technique:
clear
input id pair_id id_in id_out str8 sdate
1 1 2 3 "1/1/2010"
2 1 2 3 "1/2/2010"
3 1 3 2 "1/3/2010"
4 1 3 2 "1/5/2010"
5 1 3 2 "1/7/2010"
6 2 2 1 "1/2/2010"
7 3 1 3 "1/5/2010"
8 2 1 2 "1/7/2010"
end
gen date = daily(sdate, "MDY")
format date %td
assert id_in != id_out
gen pair1 = cond(id_in < id_out, id_in, id_out)
gen pair2 = cond(id_in < id_out, id_out, id_in)
bysort pair_id (date): gen sum1 = sum(id_in == pair1) - sum(id_out == pair1)
bysort pair_id (date): gen sum2 = sum(id_in == pair2) - sum(id_out == pair2)
list date id_* pair? sum?, sepby(pair_id)
+----------------------------------------------------------+
| date id_in id_out pair1 pair2 sum1 sum2 |
|----------------------------------------------------------|
1. | 01jan2010 2 3 2 3 1 -1 |
2. | 02jan2010 2 3 2 3 2 -2 |
3. | 03jan2010 3 2 2 3 1 -1 |
4. | 05jan2010 3 2 2 3 0 0 |
5. | 07jan2010 3 2 2 3 -1 1 |
|----------------------------------------------------------|
6. | 02jan2010 2 1 1 2 -1 1 |
7. | 07jan2010 1 2 1 2 0 0 |
|----------------------------------------------------------|
8. | 05jan2010 1 3 1 3 1 -1 |
+----------------------------------------------------------+
A specific pair (as defined by pair_id) is always conformed by two entities that can be ordered in one of two ways. For example, entity 5 with entity 8, and entity 8 with entity 5. If one is receiving, the other is giving out, necessarily.
Two slightly different ways of approaching the problem can be found below.
clear all
set more off
*----- example data -----
input id pair_id id_in id_out str8 sdate
1 1 2 3 "1/1/2010"
2 1 2 3 "1/2/2010"
3 1 3 2 "1/3/2010"
4 1 3 2 "1/5/2010"
5 1 3 2 "1/7/2010"
6 2 2 1 "1/2/2010"
7 3 1 3 "1/5/2010"
8 2 1 2 "1/7/2010"
end
gen date = daily(sdate, "MDY")
format date %td
drop sdate
sort pair_id date id
list, sepby(pair_id)
*---- what you want -----
// approach 1
bysort pair_id (date id) : gen sum1 = sum(cond(id_in == id_in[1], 1, -1))
gen sum2 = -1 * sum1
// approach 2
bysort pair_id (id_in date id) : gen temp = cond(id_in == id_in[1], 1, -1)
bysort pair_id (date id) : gen sum100 = sum(temp)
gen sum200 = -1 * sum100
// list
drop temp
sort pair_id date
list, sepby(pair_id)
The first approach involves creating a variable that holds the differential for the entity that first receives according to the date variable. sum1 does just that. Variable sum2 holds the differential for the other entity.
The second approach creates a variable that holds the differential for the entity that has the smallest identifying number. I've named it sum100. Variable sum200 holds the information for the other entity.
Note that I added id to the sorting list in case pair_id date does not uniquely identify observations.
The second approach is equivalent to the code provided by #NickCox, or so I believe.
The results:
. list, sepby(pair_id)
+---------------------------------------------------------------------------+
| id pair_id id_in id_out date sum1 sum2 sum100 sum200 |
|---------------------------------------------------------------------------|
1. | 1 1 2 3 01jan2010 1 -1 1 -1 |
2. | 2 1 2 3 02jan2010 2 -2 2 -2 |
3. | 3 1 3 2 03jan2010 1 -1 1 -1 |
4. | 4 1 3 2 05jan2010 0 0 0 0 |
5. | 5 1 3 2 07jan2010 -1 1 -1 1 |
|---------------------------------------------------------------------------|
6. | 6 2 2 1 02jan2010 1 -1 -1 1 |
7. | 8 2 1 2 07jan2010 0 0 0 0 |
|---------------------------------------------------------------------------|
8. | 7 3 1 3 05jan2010 1 -1 1 -1 |
+---------------------------------------------------------------------------+
Check them carefully, as the difference between both approaches is subtle, at least initially.

Resources