Description
I have a table like this in Google Sheet:
A
B
C
D
E
F
G
1
Cond1
Person_code
n/a
Count
Cond2
n/a
Result
__
_______
________________
_____
________
_______
_____
________
2
0
Tom T_44767
1
1
3
0
Isrel I_44767
1
1
4
1
Patty P_44767
1
1
x
5
1
Isrel I_44767
0
1
6
0
Dummy D_44767
1
1
7
1
Patty P_447677
0
1
8
1
Jarson X_44768
1
1
x
A - Cond1 - either 0 or 1
B - Person_code - first name, second name and number which represents a date
C - n/a - column not important for the case, included for the sake of numeration
D - Count - either 0 or 1 because it counts THE first occurence of B with formula:
COUNTIF($B$1:$B2;$B2)=1)+0 for row 2
COUNTIF($B$1:$B3;$B3)=1)+0 for row 3 and so on.
NOTE: The important thing is to count ONLY THE FIRST occurence (see rows 4 and 7 for an example).
E - Cond2 - either 0 or 1
F - n/a - column not important for the case, included for the sake of numeration
G - Result - IF (Cond1 + Count + Cond 2 = 3) THEN x
What the problem is
Currently Column D counts the first occurence of B. It does not take into account anything else. Just the first occurence in B column. However, I need it to ignore (i.e. do not count) rows where Cond1 + Cond2 is different than 2 (i.e. 0 or 1). Instead, it should look for a first occurence of B where Cond1 + Cond2 = 2 and count it.
So the table should look like this (pay attention to D3, D5 and G5):
A
B
C
D
E
F
G
1
Cond1
Person_code
n/a
Count
Cond2
n/a
Result
__
_______
________________
_____
________
_______
_____
________
2
0
Tom T_44767
1
1
3
0
Isrel I_44767
0
1
4
1
Patty P_44767
1
1
x
5
1
Isrel I_44767
1
1
x
6
0
Dummy D_44767
1
1
7
1
Patty P_447677
0
1
8
1
Jarson X_44768
1
1
x
Row 3 was ignored and the first occurence of 'Isrel I_44767' was found in row 5. Therefore an 'x' appeared in G in row 5.
I've tried to include additional conditions in D but can't get it to work. Any solution would be acceptable. It's okay to add additional columns, if needed or use a totally different approach.
I will be grateful for any advice on this.
I need it to ignore (i.e. do not count) rows where Cond1 + Cond2 is different than 2 (i.e. 0 or 1). Instead, it should look for the first occurrence of B where Cond1 + Cond2 = 2 and count it
=ARRAYFORMULA(IF(A2:A8+E2:E8=2, 1, 0))
now to account for occurrences/instances eg. not count duplicates (if that's what you are after - it's not clear from your question):
=ARRAYFORMULA(IF(1=COUNTIFS(
IF(A2:A10+E2:E10=2, B2:B10, ),
IF(A2:A10+E2:E10=2, B2:B10, ),
ROW(B2:B10), "<="&ROW(B2:B10)), 1, 0))
G - Result - IF (Cond1 + Count + Cond 2 = 3) THEN x
and G2 would be:
=INDEX(IF(A2:A10+D2:D10+E2:E10=3, "x", ))
Related
I have a table include "ID" and "Values", and wanted to know how many times does value "A" jumped into another values like below
ID
Values
1
A
1
A
1
A
1
B
1
A
1
B
1
B
1
C
1
C
1
C
1
A
2
A
2
A
2
B
2
A
2
B
2
C
2
B
Expected Result:
ID
Values
Desired Output
1
A
0
1
A
0
1
A
1
1
B
0
1
A
1
1
B
0
1
B
0
1
C
0
1
C
0
1
C
0
1
A
0
2
A
0
2
A
1
2
B
0
2
A
1
2
B
0
2
C
0
2
B
0
The final table should be like this:
ID
Number of Transitions
1
2
2
2
You just need LEAD() to look at the next value:
select id, values, lead(value) over(partition by id) next_value
from table
Then you can compare next_value with values, and apply an iff(value='A' and next_value!='A', 1, 0).
Then just SUM() or COUNT() and GROUP BY.
You could also treat this as a regexp problem where you want to count how many times a given pattern occurs for each id. The missing piece in your question is -what column dictates the order in which the values appear for each id? You'll need that for either of the solutions
select id, regexp_count(listagg(val,',') within group (order by ordering_col), 'A,[^A]')
from t
group by id;
how I can fin the first and last elements of a dataframe based on a group of rows with respect of a column?
df1:=
g col1 col2
h 1 2
h 0 1
h 7 8
h 5 2
h 0 1
k 7 3
k 2 1
k 9 1
if I wanna group the column with respect of g, and for each group and column I need the following information:
first element, last element, size of the group
IIUC, try:
df_g = df.groupby('dates1').agg(['first','last','size']).T.unstack()
df_g.columns = [f'{i}/{j}' for i, j in df_g.columns]
print(df_g)
Output:
2020-01/first 2020-01/last 2020-01/size 2020-02/first 2020-02/last 2020-02/size
col1 7 9 3 1 0 5
col2 3 1 3 2 1 5
I have a table T, where I have 4 (integer) columns, A, B, C, and D. There is already a UNIQUE constraint on ABC, but I would need to write a constraint enforcing, that for the same AB combination, the D has the same value, no matter what C is. I.e.
A B C D Note
1 1 1 1 AB is 1,1, D is 1
1 1 2 1
1 1 3 2 wrong! D must be 1, because AB is 1,1
1 1 4 1 ok
2 1 1 5 ok, it's a new AB combination, so a new D value is possible
2 1 2 5 D must be 5 here (and for any following row with AB 2,1)
etc.
I have no idea where to start, and my Google-fu is weak in this case.
I have a longitudinal dataset of 18 time periods. For reasons not to be discussed here, this dataset is in the wide shape, not in the long one. More precisely, time-varying variables have an alphabetic prefix which identifies the time it belongs to. For the sake of this question, consider a quantity of interest called pay. This variable is denoted apay in the first period, bpay in the second, and so on, until rpay.
Importantly, different observations have missing values in this variable in different periods, in an unpredictable way. In consequence, running a panel for the full number of periods will reduce my number of observations considerably. Hence, I would like to know precisely how many observations a panel with different lengths will have. To evaluate this, I want to create variables that, for each period and for each number of consecutive periods count how many respondents have the variable with that time sequence. For example, I want the variable b_count_2 to count how many observations have nonmissing pay in the first period and the second. This can be achieved with something like this:
local b_count_2 = 0
if apay != . & bpay != . {
local b_count_2 = `b_count_2' + 1 // update for those with nonmissing pay in both periods
}
Now, since I want to do this automatically, this has to be in a loop. Moreover, there are different numbers of sequences for each period. For example, for the third period, there are two sequences (those with pay in period 2 and 3, and those with sequences in period 1, 2 and 3). Thus, the number of variables to create is 1+2+3+4+...+17 = 153. This variability has to be reflected in the loop. I propose a code below, but there are bits that are wrong, or of which I'm unsure, as highlighted in the comments.
local list b c d e f g h i j k l m n o p q r // periods over which iterate
foreach var of local list { // loop over periods
local counter = 1 // counter to update; reflects sequence length
while `counter' > 0 { // loop over sequence lengths
gen _`var'_counter_`counter' = 0 // generate variable with counter
if `var'pay != . { // HERE IS PROBLEM 1. NEED TO MAKE THIS TO CHECK CONDITIONS WITH INCREASING NUMBER OF ELEMENTS
recode _`var'_counter_`counter' (0 = 1) // IM NOT SURE THIS IS HOW TO UPDATE SPECIFIC OBSERVATIONS.
local counter = `counter' - 1 // update counter to look for a longer sequence in the next iteration
}
}
local counter = `counter' + 1 // HERE IS PROBLEM 2. NEED TO STOP THIS LOOP! Otherwise counter goes to infinity.
}
An example of the result of the above code (if right) is the following. Consider a dataset of five observations, for four periods (denoted a, b, c, and d):
Obs a b c d
1 1 1 . 1
2 1 1 . .
3 . . 1 1
4 . 1 1 .
5 1 1 1 1
where 1 means value is observed in that period, and . is not. The objective of the code is to create 1+2+3=6 new variables such that the new dataset is:
Obs a b c d b_count_2 c_count_2 c_count_3 d_count_2 d_count_3 d_count_4
1 1 1 . 1 1 0 0 0 0 0
2 1 1 . . 1 0 0 0 0 0
3 . . 1 1 0 0 0 1 0 0
4 . 1 1 . 0 1 0 0 0 0
5 1 1 1 1 1 1 1 1 1 1
Now, why is this helpful? Well, because now I can run a set of summarize commands to get a very nice description of the dataset. The code to print this information in one go would be something like this:
local list a b c d e f g h i j k l m n o p q r // periods over which iterate
foreach var of local list { // loop over periods
local list `var'_counter_* // group of sequence variables for each period
foreach var2 of local list { // loop over each element of the list
quietly sum `var'_counter_`var2' if `var'_counter_`var2' == 1 // sum the number of individuals with value = 1 with sequence of length var2 in period var
di as text "Wave `var' has a sequence of length `var2' with " as result r(N) as text " observations." // print result
}
}
For the above example, this produces the following output:
"Wave 'b' has a sequence of length 2 with 3 observations."
"Wave 'c' has a sequence of length 2 with 2 observations."
"Wave 'c' has a sequence of length 3 with 1 observations."
"Wave 'd' has a sequence of length 2 with 2 observations."
"Wave 'd' has a sequence of length 3 with 1 observations."
"Wave 'd' has a sequence of length 4 with 1 observations."
This gives me a nice summary of the trade-offs I'm having between a wider panel and a longer panel.
If you insist on doing this with data in wide form, it is very inefficient to create extra variables just to count patterns of missing values. You can create a single string variable that contains the pattern for each observation. Then, it's just a matter of extracting from this pattern variable what you are looking for (i.e. patterns of consecutive periods up to the current wave). You can then loop over lengths of the matching patterns and do counts. Something like:
* create some fake data
clear
set seed 12341
set obs 10
foreach pre in a b c d e f g {
gen `pre'pay = runiform() if runiform() < .8
}
* build the pattern of missing data
gen pattern = ""
foreach pre in a b c d e f g {
qui replace pattern = pattern + cond(mi(`pre'pay), " ", "`pre'")
}
list
qui foreach pre in b c d e f g {
noi dis "{hline 80}" _n as res "Wave `pre'"
// the longest substring without a space up to the wave
gen temp = regexs(1) if regexm(pattern, "([^ ]+`pre')")
noi tab temp
// loop over the various substring lengths, from 2 to max length
gen len = length(temp)
sum len, meanonly
local n = r(max)
forvalues i = 2/`n' {
count if length(temp) >= `i'
noi dis as txt "length = " as res `i' as txt " obs = " as res r(N)
}
drop temp len
}
If you are open to working in long form, then here is how you would identify spells with contiguous data and how to loop to get the info you want (the data setup is exactly the same as above):
* create some fake data in wide form
clear
set seed 12341
set obs 10
foreach pre in a b c d e f g {
gen `pre'pay = runiform() if runiform() < .8
}
* reshape to long form
gen id = _n
reshape long #pay, i(id) j(wave) string
* identify spells of contiguous periods
egen wavegroup = group(wave), label
tsset id wavegroup
tsspell, cond(pay < .)
drop if mi(pay)
foreach pre in b c d e f g {
dis "{hline 80}" _n as res "Wave `pre'"
sum _seq if wave == "`pre'", meanonly
local n = r(max)
forvalues i = 2/`n' {
qui count if _seq >= `i' & wave == "`pre'"
dis as txt "length = " as res `i' as txt " obs = " as res r(N)
}
}
I echo #Dimitriy V. Masterov in genuine puzzlement that you are using this dataset shape. It can be convenient for some purposes, but for panel or longitudinal data such as you have, working with it in Stata is at best awkward and at worst impracticable.
First, note specifically that
local b_count_2 = 0
if apay != . & bpay != . {
local b_count_2 = `b_count_2' + 1 // update for those with nonmissing pay in both periods
}
will only ever be evaluated in terms of the first observation, i.e. as if you had coded
if apay[1] != . & bpay[1] != .
This is documented here. Even if it is what you want, it is not usually a pattern for others to follow.
Second, and more generally, I haven't tried to understand all of the details of your code, as what I see is the creation of a vast number of variables even for tiny datasets as in your sketch. For a series T periods long, you would create a triangular number [(T - 1)T]/2 of new variables; in your example (17 x 18)/2 = 153. If someone had series 100 periods long, they would need 4950 new variables.
Note that because of the first point just made, these new variables would pertain with your strategy only to individual variables like pay and individual panels. Presumably that limitation to individual panels could be fixed, but the main idea seems singularly ill-advised in many ways. In a nutshell, what strategy do you have to work with these hundreds or thousands of new variables except writing yet more nested loops?
Your main need seems to be to identify spells of non-missing and missing values. There is easy machinery for this long since developed. General principles are discussed in this paper and an implementation is downloadable from SSC as tsspell.
On Statalist, people are asked to provide workable examples with data as well as code. See this FAQ That's entirely equivalent to long-standing requests here for MCVE.
Despite all that advice, I would start by looking at the Stata command xtdescribe and associated xt tools already available to you. These tools do require a long data shape, which reshape will provide for you.
Let me add another answer based on the example now added to the question.
Obs a b c d
1 1 1 . 1
2 1 1 . .
3 . . 1 1
4 . 1 1 .
5 1 1 1 1
The aim of this answer is not to provide what the OP asks but to indicate how many simple tools are available to look at patterns of non-missing and missing values, none of which entail the creation of large numbers of extra variables or writing intricate code based on nested loops for every new question. Most of those tools require a reshape long.
. clear
. input a b c d
a b c d
1. 1 1 . 1
2. 1 1 . .
3. . . 1 1
4. . 1 1 .
5. 1 1 1 1
6. end
. rename (a b c d) (y1 y2 y3 y4)
. gen id = _n
. reshape long y, i(id) j(time)
(note: j = 1 2 3 4)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 5 -> 20
Number of variables 5 -> 3
j variable (4 values) -> time
xij variables:
y1 y2 ... y4 -> y
-----------------------------------------------------------------------------
. xtset id time
panel variable: id (strongly balanced)
time variable: time, 1 to 4
delta: 1 unit
. preserve
. drop if missing(y)
(7 observations deleted)
. xtdescribe
id: 1, 2, ..., 5 n = 5
time: 1, 2, ..., 4 T = 4
Delta(time) = 1 unit
Span(time) = 4 periods
(id*time uniquely identifies each observation)
Distribution of T_i: min 5% 25% 50% 75% 95% max
2 2 2 2 3 4 4
Freq. Percent Cum. | Pattern
---------------------------+---------
1 20.00 20.00 | ..11
1 20.00 40.00 | .11.
1 20.00 60.00 | 11..
1 20.00 80.00 | 11.1
1 20.00 100.00 | 1111
---------------------------+---------
5 100.00 | XXXX
* ssc inst xtpatternvar
. xtpatternvar, gen(pattern)
* ssc inst groups
. groups pattern
+------------------------------------+
| pattern Freq. Percent % <= |
|------------------------------------|
| ..11 2 15.38 15.38 |
| .11. 2 15.38 30.77 |
| 11.. 2 15.38 46.15 |
| 11.1 3 23.08 69.23 |
| 1111 4 30.77 100.00 |
+------------------------------------+
. restore
. egen npresent = total(missing(y)), by(time)
. tabdisp time, c(npresent)
----------------------
time | npresent
----------+-----------
1 | 2
2 | 1
3 | 2
4 | 2
----------------------
I have the following data
id pair_id id_in id_out date
1 1 2 3 1/1/2010
2 1 2 3 1/2/2010
3 1 3 2 1/3/2010
4 1 3 2 1/5/2010
5 1 3 2 1/7/2010
6 2 2 1 1/2/2010
7 3 1 3 1/5/2010
8 2 1 2 1/7/2010
At any given row I want to know what the inflow/outflow differential is between the unique pair id_in and id_out from the id_in perspective
For example, for id_in == 2 and id_out == 3 it would look like the following (from id_in == 2s perspective)
id pair_id id_in id_out date inflow_outflow
1 1 2 3 1/1/2010 1
2 1 2 3 1/2/2010 2
3 1 3 2 1/3/2010 1
4 1 3 2 1/5/2010 0
5 1 3 2 1/7/2010 -1
Explanation. id_in == 2 as received first so they get +1 then they received again so +2. Then they gave out so it gets reduced by -1 bringing the total to that point to 1, etc.
This is what I have tried
sort pair_id id_in date
gen count = 0
qui forval i = 2/`=_N' {
local I = `i' - 1
count if id_in == id_out[`i'] in 1/`I'
replace count = r(N) in `i'
}
I don't follow all the logic here and in particular presenting transactions from the point of view of one member seems quite arbitrary. But a broad impression from loosely similar problems is that you should not be thinking about loops here. It should suffice to use by: and cumulative sums. There is an attempt at some systematic discussion of how to handle dyads at http://www.stata-journal.com/sjpdf.html?articlenum=dm0043 but it is only a beginning.
Please note that presenting dates according to some display format is a small pain as they need to be reverse engineered. dataex from SSC can be used to create examples that are easy to copy and paste.
This code may suggest some technique:
clear
input id pair_id id_in id_out str8 sdate
1 1 2 3 "1/1/2010"
2 1 2 3 "1/2/2010"
3 1 3 2 "1/3/2010"
4 1 3 2 "1/5/2010"
5 1 3 2 "1/7/2010"
6 2 2 1 "1/2/2010"
7 3 1 3 "1/5/2010"
8 2 1 2 "1/7/2010"
end
gen date = daily(sdate, "MDY")
format date %td
assert id_in != id_out
gen pair1 = cond(id_in < id_out, id_in, id_out)
gen pair2 = cond(id_in < id_out, id_out, id_in)
bysort pair_id (date): gen sum1 = sum(id_in == pair1) - sum(id_out == pair1)
bysort pair_id (date): gen sum2 = sum(id_in == pair2) - sum(id_out == pair2)
list date id_* pair? sum?, sepby(pair_id)
+----------------------------------------------------------+
| date id_in id_out pair1 pair2 sum1 sum2 |
|----------------------------------------------------------|
1. | 01jan2010 2 3 2 3 1 -1 |
2. | 02jan2010 2 3 2 3 2 -2 |
3. | 03jan2010 3 2 2 3 1 -1 |
4. | 05jan2010 3 2 2 3 0 0 |
5. | 07jan2010 3 2 2 3 -1 1 |
|----------------------------------------------------------|
6. | 02jan2010 2 1 1 2 -1 1 |
7. | 07jan2010 1 2 1 2 0 0 |
|----------------------------------------------------------|
8. | 05jan2010 1 3 1 3 1 -1 |
+----------------------------------------------------------+
A specific pair (as defined by pair_id) is always conformed by two entities that can be ordered in one of two ways. For example, entity 5 with entity 8, and entity 8 with entity 5. If one is receiving, the other is giving out, necessarily.
Two slightly different ways of approaching the problem can be found below.
clear all
set more off
*----- example data -----
input id pair_id id_in id_out str8 sdate
1 1 2 3 "1/1/2010"
2 1 2 3 "1/2/2010"
3 1 3 2 "1/3/2010"
4 1 3 2 "1/5/2010"
5 1 3 2 "1/7/2010"
6 2 2 1 "1/2/2010"
7 3 1 3 "1/5/2010"
8 2 1 2 "1/7/2010"
end
gen date = daily(sdate, "MDY")
format date %td
drop sdate
sort pair_id date id
list, sepby(pair_id)
*---- what you want -----
// approach 1
bysort pair_id (date id) : gen sum1 = sum(cond(id_in == id_in[1], 1, -1))
gen sum2 = -1 * sum1
// approach 2
bysort pair_id (id_in date id) : gen temp = cond(id_in == id_in[1], 1, -1)
bysort pair_id (date id) : gen sum100 = sum(temp)
gen sum200 = -1 * sum100
// list
drop temp
sort pair_id date
list, sepby(pair_id)
The first approach involves creating a variable that holds the differential for the entity that first receives according to the date variable. sum1 does just that. Variable sum2 holds the differential for the other entity.
The second approach creates a variable that holds the differential for the entity that has the smallest identifying number. I've named it sum100. Variable sum200 holds the information for the other entity.
Note that I added id to the sorting list in case pair_id date does not uniquely identify observations.
The second approach is equivalent to the code provided by #NickCox, or so I believe.
The results:
. list, sepby(pair_id)
+---------------------------------------------------------------------------+
| id pair_id id_in id_out date sum1 sum2 sum100 sum200 |
|---------------------------------------------------------------------------|
1. | 1 1 2 3 01jan2010 1 -1 1 -1 |
2. | 2 1 2 3 02jan2010 2 -2 2 -2 |
3. | 3 1 3 2 03jan2010 1 -1 1 -1 |
4. | 4 1 3 2 05jan2010 0 0 0 0 |
5. | 5 1 3 2 07jan2010 -1 1 -1 1 |
|---------------------------------------------------------------------------|
6. | 6 2 2 1 02jan2010 1 -1 -1 1 |
7. | 8 2 1 2 07jan2010 0 0 0 0 |
|---------------------------------------------------------------------------|
8. | 7 3 1 3 05jan2010 1 -1 1 -1 |
+---------------------------------------------------------------------------+
Check them carefully, as the difference between both approaches is subtle, at least initially.