ArangoDB: how to optimize FILTER on a subquery on big collections - graph-databases

I have a fairly simple data model (inspired by IMDB data) to link films to their directors or actors.
This includes 3 collections:
document collection Titles
document collection Persons
edge collection core_edge_values_links with an attribute field to determine the role of the person (director, actor...)
Titles are linked to Persons through a core_edge_values_links edge:
Now, I'm trying to find all Titles which have a link to any director.
Query is:
FOR r IN titles
LET rlinkVal = FIRST(
FOR rv, re IN 1 OUTBOUND r._id
core_edge_values_links
FILTER re.attribute == 'directors'
RETURN re
)
FILTER rlinkVal != null
LIMIT 0, 20
RETURN r
This query will take about 2.5s (and up to almost 2 minutes with fullCount enable) on a DB with 1M titles, about 200k persons and about 4M edges. I have about 600k matching titles.
To try to speed this up, I added an index on the core_edge_values_links colllection on fields _from and attribute. Unfortunately, the traversal won't take advantage of this index.
I also tried to use a simple join instead of the traversal:
FOR r IN titles
LET rlinkVal = FIRST(
FOR e IN core_edge_values_links
FILTER e._from == r._id AND e.attribute == 'directors'
RETURN true
)
FILTER rlinkVal != null
LIMIT 0, 20
RETURN r
This time, the index is used, but query time is pretty much the same as with the traversal (about 2.5s, see profiling data below).
This makes me wondering if the problem just comes from the number of titles in the collection (about 1M) as they're all scanned in the initial FOR but I don't see how I could index it. This is a fairly simple use case, I don't feel this should take so long.
This is a fairly simple use case, I don't feel this should take so long.
Here is profiling data provided by arango.
With the traversal:
Query String (241 chars, cacheable: false):
FOR r IN titles
LET rlinkVal = FIRST(
FOR rv, re IN 1 OUTBOUND r._id
core_edge_values_links
FILTER re.attribute == 'directors'
RETURN re
)
FILTER rlinkVal != null
LIMIT 0, 20
RETURN r
Execution plan:
Id NodeType Calls Items Runtime [s] Comment
1 SingletonNode 1 1 0.00001 * ROOT
2 EnumerateCollectionNode 305 304000 0.08433 - FOR r IN titles /* full collection scan */
4 CalculationNode 305 304000 0.07084 - LET #6 = r.`_id` /* attribute expression */ /* collections used: r : titles */
16 SubqueryStartNode 609 608000 0.12038 - LET #3 = ( /* subquery begin */
5 TraversalNode 306 305000 1.93808 - FOR /* vertex optimized away */, re /* edge */ IN 1..1 /* min..maxPathDepth */ OUTBOUND #6 /* startnode */ core_edge_values_links
6 CalculationNode 306 305000 0.02857 - LET #10 = (re.`attribute` == "directors") /* simple expression */
7 FilterNode 305 304000 0.05040 - FILTER #10
15 LimitNode 305 304000 0.00927 - LIMIT 0, 1
17 SubqueryEndNode 304 303000 0.05520 - RETURN re ) /* subquery end */
10 CalculationNode 304 303000 0.02094 - LET rlinkVal = FIRST(#3) /* simple expression */
11 CalculationNode 304 303000 0.01909 - LET #12 = (rlinkVal != null) /* simple expression */
12 FilterNode 1 20 0.01508 - FILTER #12
13 LimitNode 1 20 0.00001 - LIMIT 0, 20
14 ReturnNode 1 20 0.00001 - RETURN r
Indexes used:
By Name Type Collection Unique Sparse Selectivity Fields Ranges
5 edge edge core_edge_values_links false false 34.87 % [ `_from` ] base OUTBOUND
Traversals on graphs:
Id Depth Vertex collections Edge collections Options Filter / Prune Conditions
5 1..1 core_edge_values_links uniqueVertices: none, uniqueEdges: path FILTER (re.`attribute` == "directors")
Optimization rules applied:
Id RuleName
1 move-calculations-up
2 optimize-subqueries
3 optimize-traversals
4 splice-subqueries
Query Statistics:
Writes Exec Writes Ign Scan Full Scan Index Filtered Peak Mem [b] Exec Time [s]
0 0 304000 0 302977 1605632 2.41360
Query Profile:
Query Stage Duration [s]
initializing 0.00000
parsing 0.00006
optimizing ast 0.00001
loading collections 0.00001
instantiating plan 0.00006
optimizing plan 0.00120
executing 2.41225
finalizing 0.00003
With the join:
Query String (232 chars, cacheable: false):
FOR r IN titles
LET rlinkVal = FIRST(
FOR e IN core_edge_values_links
FILTER e._from == r._id AND e.attribute == 'directors'
RETURN true
)
FILTER rlinkVal != null
LIMIT 0, 20
RETURN r
Execution plan:
Id NodeType Calls Items Runtime [s] Comment
1 SingletonNode 1 1 0.00001 * ROOT
7 CalculationNode 1 1 0.00000 - LET #7 = true /* json expression */ /* const assignment */
2 EnumerateCollectionNode 305 304000 0.09654 - FOR r IN titles /* full collection scan */
17 SubqueryStartNode 609 608000 0.13655 - LET #2 = ( /* subquery begin */
16 IndexNode 305 304000 2.29080 - FOR e IN core_edge_values_links /* persistent index scan, scan only */
15 LimitNode 305 304000 0.00992 - LIMIT 0, 1
18 SubqueryEndNode 304 303000 0.05765 - RETURN #7 ) /* subquery end */
10 CalculationNode 304 303000 0.02272 - LET rlinkVal = FIRST(#2) /* simple expression */
11 CalculationNode 304 303000 0.02060 - LET #9 = (rlinkVal != null) /* simple expression */
12 FilterNode 1 20 0.01656 - FILTER #9
13 LimitNode 1 20 0.00000 - LIMIT 0, 20
14 ReturnNode 1 20 0.00000 - RETURN r
Indexes used:
By Name Type Collection Unique Sparse Selectivity Fields Ranges
16 idx_1742224911374483456 persistent core_edge_values_links false false n/a [ `_from`, `attribute` ] ((e.`_from` == r.`_id`) && (e.`attribute` == "directors"))
Optimization rules applied:
Id RuleName
1 move-calculations-up
2 optimize-subqueries
3 use-indexes
4 remove-filter-covered-by-index
5 remove-unnecessary-calculations-2
6 splice-subqueries
Query Statistics:
Writes Exec Writes Ign Scan Full Scan Index Filtered Peak Mem [b] Exec Time [s]
0 0 304000 492 302977 1441792 2.65203
Query Profile:
Query Stage Duration [s]
initializing 0.00000
parsing 0.00006
optimizing ast 0.00001
loading collections 0.00001
instantiating plan 0.00003
optimizing plan 0.00056
executing 2.65136
finalizing 0.00003

Related

How to load mixed record type fixed width file with two headers into two separate files

I got a task to load a strangely formatted text file. The file contains unwanted data too. It contains two headers back to back and data for each header is specified on alternate lines. Header rows start after ------. I need to read both the header along with its corresponding data and dump it into some Excel/table destination using. Let me know how to solve this using any transformation in SSIS or maybe with a script.
Don't know how to use a script task for this.
Right now I am reading the file in one column and using a derived column manually trying to split it using substring function. But that works for only one header and it is too hard coded type. I need some dynamic approach to read header rows as well as data rows directly.
Input file:
A1234-012 I N F O R M A T I C S C O M P A N Y 08/23/17
PAGE 2 BATCH ABC PAYMENT DATE & DUE DATE EDIT PAGE 481
------------------------------------------------------------------------------------------------------------------------------------
SEO XRAT CLT LOAN OPENING PAYMENT MATURIUH LOAN NEXE ORIG-AMT OFF TO CATE CONTC MON NO.TO TOL NEL S CUP CO IND PAT
NOM CODE NOM NOMTER DATE DUO DATE DATE TIME PT # MONEY AQ LOAN NUMBER BLOCK PAYMENT U TYP GH OMG IND
1-3 4-6 7-13/90-102 14-19 20-25 26-31 32-34 35-37 38-46 47-48 49 50-51 52-61 62 63 64-72 73 4-5 76 77 8-80
------------------------------------------------------------------------------------------------------------------------------------
SEO XRAT CLT LOAN A/C A/C MIN MAX MAX PENDI LATE CCH L/F PARTLYS CUR L/F L/F L/F
NOM CODE NOM NOMTER CODE FACTOR MON MON ROAD DAYS MONE POT L/A L/F JAC INT VAD CD USED PI VAD DT
1-3 4-6 7-13/90-102 14 15 20-23 24-29 30-34 35-37 38-42 43 44 49 60 61-63 64-69
USED-ID:
------------------------------------------------------------------------------------------------------------------------------------
454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,555.22 33 H55
025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS
454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,555.22 33 H55
025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS
454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,555.22 33 H55
025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS
Expected output should be:
FILE 1:
SEO XRAT CLT LOAN OPENING PAYMENT MATURIUH LOAN NEXE ORIG-AMT OFF TO CATE CONTC MON NO.TO TOL NEL S CUP CO IND PAT
NOM CODE NOM NOMTER DATE DUO DATE DATE TIME PT # MONEY AQ LOAN NUMBER BLOCK PAYMENT U TYP GH OMG IND
454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,555.22 33 H55
454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,555.22 33 H55
454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,555.22 33 H55
FILE 2:
SEO XRAT CLT LOAN A/C A/C MIN MAX MAX PENDI LATE CCH L/F PARTLYS CUR L/F L/F L/F
NOM CODE NOM NOMTER CODE FACTOR MON MON ROAD DAYS MONE POT L/A L/F JAC INT VAD CD USED PI VAD DT
025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS
025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS
025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS
Ignore first 3 rows
To ignore first 3 rows you can simply configure the flat file connection manager to ignore them, similar to:
Split file and remove bad rows
1. Configure connection managers
In addition, in the flat file connection manager, go to the advanced tab and delete all columns except one and change its data type to DT_STR and the MaxLength to 4000.
Add two connection managers , one for each destination file where you must define only one column with max length = 4000:
2. Configure Data flow task
Add a Data Flow Task, And add a Flat File Source inside. Select the Source File connection manager.
Add a conditional split with the following expressions:
File1
FINDSTRING([Column 0],"OPENING",1) > 1 || FINDSTRING([Column 0],"DATE",1) > 1 || TOKENCOUNT([Column 0]," ") == 19
File2
FINDSTRING([Column 0],"A/C",1) > 1 || FINDSTRING([Column 0],"FACTOR",1) > 1 || TOKENCOUNT([Column 0]," ") == 10
The expressions above are created based on the expected output you mentioned in the question, i tired to search for unique keywords inside each header and splitted the data rows based on the number of space occurrence.
Finally Map each output to a destination flat file component:
Experiments
The execution result is shown in the following screenshots:
Update 1 - Remove duplicates
To remove duplicates you must you can refer to the following link:
How to remove duplicate rows from flat file using SSIS?
Update 2 - Remove only duplicates headers + Replace spaces with Tab
If you need only to remove duplicate headers then you can do this in two steps:
Add a script component after each conditional split output to flag unwanted rows
Add a conditional split to filter rows based on the script component output
In addition, because the columns values does not contains spaces you can use regular expression to replace spaces with single Tab to make the file consistent.
Script Component
In the Script Component add an output column of type DT_BOOL and name it outFlag also add a output column outColumn0 of type DT_STR and length equal to 4000 and select Column0 as Input Column.
Then write the following script in the Script Editor (C#):
First make sure that you add the RegularExpressions namespace
using System.Text.RegularExpressions;
Script Code
int SEOCount = 0;
int NOMCount = 0;
Regex regex = new Regex("[ ]{2,}", RegexOptions.None);
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
if (Row.Column0.Trim().StartsWith("SEO"))
{
if (SEOCount == 0)
{
SEOCount++;
Row.outFlag = true;
}
else
{
Row.outFlag = false;
}
}
else if (Row.Column0.Trim().StartsWith("NOM"))
{
if (NOMCount == 0)
{
NOMCount++;
Row.outFlag = true;
}
else
{
Row.outFlag = false;
}
}
else if (Row.Column0.Trim().StartsWith("PAGE"))
{
Row.outFlag = false;
}
else
{
Row.outFlag = true;
}
Row.outColumn0 = regex.Replace(Row.Column0.TrimStart(), "\t");
}
Conditional Split
Add a conditional split after each Script Component and use the following expression to filter duplicate header:
[outFlag] == True
And connect the conditional split to the destination. Make Sure to map outColumn0 to the destination column.
Package link
https://www.dropbox.com/s/d936u4xo3mkzns8/Package.dtsx?dl=0

Adding count to a vector while looping through dataframe

I am relatively new in R. I'm working on a project in which there is a column of IDs (PMID), a column of MESH terms which are basically a lot of biomedical summarized terms (MH), and a column for year that's organized sequentially (EDAT_Year). My goal is to create a vector that holds the count of a particular word from the MESH terms for each year. Basically, if a row contains the word (not how many times it's in the row but rather its presence), it should be counted and separated by year in the vector.
Here is an example. Suppose this is the dataframe:
PMID MH EDAT_Year
1 Male, Lung, Heart, Aneurysm 1978
2 Male, Male, Anemia, Lung 1978
3 Heart, Anemia, Adult 1980
4 Female, Heart, Blood, Acute 1980
5 Male, Blood, Adult, Lung 1980
6 Male, Kidney, Brain, Heart 1983
7 Male, Lung, Blood, Male 1983
Then, if I were to test "Male", I would want the output to be
2 1 2
to represent that there are 2 observations in 1978 that contain "Male", 1 in 1980, and 2 in 1983 (regardless of how many times it has appeared).
I am currently working with 3 years, but hope to expand to more. I was able to do this manually with 3 years with the following (years are 1978, 1980, 1983 by the way) in which I created multiple columns that only contained MESH terms if they belonged to that year:
# count occurrences in the three years
disease_78 <- length(grep("\\Male\\>", total$MH_78))
disease_80 <- length(grep("\\Male\\>", total$MH_80))
disease_83 <- length(grep("\\Male\\>", total$MH_83))
But now I am trying to write a function so that if I were to enter a phrase, I would get all the occurrences in one vector, instead of manually having to copy and paste or having hundreds of columns for each year. This is what I have so far:
# function of count occurences
count_fxn <- function(x)
{
# read in argument as character
phrase_to_count <- deparse(substitute(x))
# create a vector to store count values
count_occur <- numeric(0)
# a vector for how many years there are
num_years <- seq(1, 3, 1)
# loop through entire data frame
for (i in 1:length(total$PMID))
{
# loop through the three years
for(j in 1:length(num_years))
{
# if at least one occurence occurs in row cell, increment count
if (length(grep(phrase_to_count, total$MH[i]) > 0))
{
count_occur[j] <- count_occur[j] + 1
}
# if the next row's year is different than the current one's, move to
# next spot for next year in vector
if (total$EDAT_Year[i] != total$EDAT_Year[i+1])
{
j <- j + 1
}
# increment so go to next line to read in data
i <- i + 1
}
}
return(count_occur)
}
# using function
count_fxn(Male)
But this is the error I keep getting:
Error in if (total$EDAT_Year[i] != total$EDAT_Year[i + 1]) { :
missing value where TRUE/FALSE needed
When I change
if (total$EDAT_Year[i] != total$EDAT_Year[i + 1])
to
if (total$EDAT_Year[j] != total$EDAT_Year[j + 1])
I don't get any errors, but instead, the output is
NA NA NA
when it should be something like
3453 2343 5235
to represent how many observations contained "Male" in them, in the years 1978, 1980, and 1983 respectively.
Please advise. I'm not the strongest coder yet, and I've been working on this for 2 hours when I'm sure it could've been done in much less time.
You could use by().
with(df, lengths(by(MH, EDAT_Year, grep, pattern="Male")))
# EDAT_Year
# 1978 1980 1983
# 2 1 2
If you want to calculate the number of occurrences of every "word" in MH for every year without having to type out each word or create a list of words you can do so as follows:
DF <- read.table(text="PMID MH EDAT_Year
1 Male,Lung,Heart,Aneurysm 1978
2 Male,Male,Anemia,Lung 1978
3 Heart,Anemia,Adult 1980
4 Female,Heart,Blood,Acute 1980
5 Male,Blood,Adult,Lung 1980
6 Male,Kidney,Brain,Heart 1983
7 Male,Lung,Blood,Male 1983", header=T)
DF <- DF %>%
#Convert MH column to nested list
dplyr::mutate(MH = strsplit(as.character(MH), ",")) %>%
#reashape data into tidy format
tidyr::unnest(MH) %>%
#eliminate duplicates to not count PMIDs with multiple identical entries in MH
unique() %>%
#count entries for each value in MH by year
reshape2::dcast(EDAT_Year ~ MH)
DF
Results in:
EDAT_Year Acute Adult Anemia Aneurysm Blood Brain Female Heart Kidney Lung Male
1 1978 0 0 1 1 0 0 0 1 0 2 2
2 1980 1 2 1 0 2 0 1 2 0 1 1
3 1983 0 0 0 0 1 1 0 1 1 1 2

Nested loop with increasing number of elements

I have a longitudinal dataset of 18 time periods. For reasons not to be discussed here, this dataset is in the wide shape, not in the long one. More precisely, time-varying variables have an alphabetic prefix which identifies the time it belongs to. For the sake of this question, consider a quantity of interest called pay. This variable is denoted apay in the first period, bpay in the second, and so on, until rpay.
Importantly, different observations have missing values in this variable in different periods, in an unpredictable way. In consequence, running a panel for the full number of periods will reduce my number of observations considerably. Hence, I would like to know precisely how many observations a panel with different lengths will have. To evaluate this, I want to create variables that, for each period and for each number of consecutive periods count how many respondents have the variable with that time sequence. For example, I want the variable b_count_2 to count how many observations have nonmissing pay in the first period and the second. This can be achieved with something like this:
local b_count_2 = 0
if apay != . & bpay != . {
local b_count_2 = `b_count_2' + 1 // update for those with nonmissing pay in both periods
}
Now, since I want to do this automatically, this has to be in a loop. Moreover, there are different numbers of sequences for each period. For example, for the third period, there are two sequences (those with pay in period 2 and 3, and those with sequences in period 1, 2 and 3). Thus, the number of variables to create is 1+2+3+4+...+17 = 153. This variability has to be reflected in the loop. I propose a code below, but there are bits that are wrong, or of which I'm unsure, as highlighted in the comments.
local list b c d e f g h i j k l m n o p q r // periods over which iterate
foreach var of local list { // loop over periods
local counter = 1 // counter to update; reflects sequence length
while `counter' > 0 { // loop over sequence lengths
gen _`var'_counter_`counter' = 0 // generate variable with counter
if `var'pay != . { // HERE IS PROBLEM 1. NEED TO MAKE THIS TO CHECK CONDITIONS WITH INCREASING NUMBER OF ELEMENTS
recode _`var'_counter_`counter' (0 = 1) // IM NOT SURE THIS IS HOW TO UPDATE SPECIFIC OBSERVATIONS.
local counter = `counter' - 1 // update counter to look for a longer sequence in the next iteration
}
}
local counter = `counter' + 1 // HERE IS PROBLEM 2. NEED TO STOP THIS LOOP! Otherwise counter goes to infinity.
}
An example of the result of the above code (if right) is the following. Consider a dataset of five observations, for four periods (denoted a, b, c, and d):
Obs a b c d
1 1 1 . 1
2 1 1 . .
3 . . 1 1
4 . 1 1 .
5 1 1 1 1
where 1 means value is observed in that period, and . is not. The objective of the code is to create 1+2+3=6 new variables such that the new dataset is:
Obs a b c d b_count_2 c_count_2 c_count_3 d_count_2 d_count_3 d_count_4
1 1 1 . 1 1 0 0 0 0 0
2 1 1 . . 1 0 0 0 0 0
3 . . 1 1 0 0 0 1 0 0
4 . 1 1 . 0 1 0 0 0 0
5 1 1 1 1 1 1 1 1 1 1
Now, why is this helpful? Well, because now I can run a set of summarize commands to get a very nice description of the dataset. The code to print this information in one go would be something like this:
local list a b c d e f g h i j k l m n o p q r // periods over which iterate
foreach var of local list { // loop over periods
local list `var'_counter_* // group of sequence variables for each period
foreach var2 of local list { // loop over each element of the list
quietly sum `var'_counter_`var2' if `var'_counter_`var2' == 1 // sum the number of individuals with value = 1 with sequence of length var2 in period var
di as text "Wave `var' has a sequence of length `var2' with " as result r(N) as text " observations." // print result
}
}
For the above example, this produces the following output:
"Wave 'b' has a sequence of length 2 with 3 observations."
"Wave 'c' has a sequence of length 2 with 2 observations."
"Wave 'c' has a sequence of length 3 with 1 observations."
"Wave 'd' has a sequence of length 2 with 2 observations."
"Wave 'd' has a sequence of length 3 with 1 observations."
"Wave 'd' has a sequence of length 4 with 1 observations."
This gives me a nice summary of the trade-offs I'm having between a wider panel and a longer panel.
If you insist on doing this with data in wide form, it is very inefficient to create extra variables just to count patterns of missing values. You can create a single string variable that contains the pattern for each observation. Then, it's just a matter of extracting from this pattern variable what you are looking for (i.e. patterns of consecutive periods up to the current wave). You can then loop over lengths of the matching patterns and do counts. Something like:
* create some fake data
clear
set seed 12341
set obs 10
foreach pre in a b c d e f g {
gen `pre'pay = runiform() if runiform() < .8
}
* build the pattern of missing data
gen pattern = ""
foreach pre in a b c d e f g {
qui replace pattern = pattern + cond(mi(`pre'pay), " ", "`pre'")
}
list
qui foreach pre in b c d e f g {
noi dis "{hline 80}" _n as res "Wave `pre'"
// the longest substring without a space up to the wave
gen temp = regexs(1) if regexm(pattern, "([^ ]+`pre')")
noi tab temp
// loop over the various substring lengths, from 2 to max length
gen len = length(temp)
sum len, meanonly
local n = r(max)
forvalues i = 2/`n' {
count if length(temp) >= `i'
noi dis as txt "length = " as res `i' as txt " obs = " as res r(N)
}
drop temp len
}
If you are open to working in long form, then here is how you would identify spells with contiguous data and how to loop to get the info you want (the data setup is exactly the same as above):
* create some fake data in wide form
clear
set seed 12341
set obs 10
foreach pre in a b c d e f g {
gen `pre'pay = runiform() if runiform() < .8
}
* reshape to long form
gen id = _n
reshape long #pay, i(id) j(wave) string
* identify spells of contiguous periods
egen wavegroup = group(wave), label
tsset id wavegroup
tsspell, cond(pay < .)
drop if mi(pay)
foreach pre in b c d e f g {
dis "{hline 80}" _n as res "Wave `pre'"
sum _seq if wave == "`pre'", meanonly
local n = r(max)
forvalues i = 2/`n' {
qui count if _seq >= `i' & wave == "`pre'"
dis as txt "length = " as res `i' as txt " obs = " as res r(N)
}
}
I echo #Dimitriy V. Masterov in genuine puzzlement that you are using this dataset shape. It can be convenient for some purposes, but for panel or longitudinal data such as you have, working with it in Stata is at best awkward and at worst impracticable.
First, note specifically that
local b_count_2 = 0
if apay != . & bpay != . {
local b_count_2 = `b_count_2' + 1 // update for those with nonmissing pay in both periods
}
will only ever be evaluated in terms of the first observation, i.e. as if you had coded
if apay[1] != . & bpay[1] != .
This is documented here. Even if it is what you want, it is not usually a pattern for others to follow.
Second, and more generally, I haven't tried to understand all of the details of your code, as what I see is the creation of a vast number of variables even for tiny datasets as in your sketch. For a series T periods long, you would create a triangular number [(T - 1)T]/2 of new variables; in your example (17 x 18)/2 = 153. If someone had series 100 periods long, they would need 4950 new variables.
Note that because of the first point just made, these new variables would pertain with your strategy only to individual variables like pay and individual panels. Presumably that limitation to individual panels could be fixed, but the main idea seems singularly ill-advised in many ways. In a nutshell, what strategy do you have to work with these hundreds or thousands of new variables except writing yet more nested loops?
Your main need seems to be to identify spells of non-missing and missing values. There is easy machinery for this long since developed. General principles are discussed in this paper and an implementation is downloadable from SSC as tsspell.
On Statalist, people are asked to provide workable examples with data as well as code. See this FAQ That's entirely equivalent to long-standing requests here for MCVE.
Despite all that advice, I would start by looking at the Stata command xtdescribe and associated xt tools already available to you. These tools do require a long data shape, which reshape will provide for you.
Let me add another answer based on the example now added to the question.
Obs a b c d
1 1 1 . 1
2 1 1 . .
3 . . 1 1
4 . 1 1 .
5 1 1 1 1
The aim of this answer is not to provide what the OP asks but to indicate how many simple tools are available to look at patterns of non-missing and missing values, none of which entail the creation of large numbers of extra variables or writing intricate code based on nested loops for every new question. Most of those tools require a reshape long.
. clear
. input a b c d
a b c d
1. 1 1 . 1
2. 1 1 . .
3. . . 1 1
4. . 1 1 .
5. 1 1 1 1
6. end
. rename (a b c d) (y1 y2 y3 y4)
. gen id = _n
. reshape long y, i(id) j(time)
(note: j = 1 2 3 4)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 5 -> 20
Number of variables 5 -> 3
j variable (4 values) -> time
xij variables:
y1 y2 ... y4 -> y
-----------------------------------------------------------------------------
. xtset id time
panel variable: id (strongly balanced)
time variable: time, 1 to 4
delta: 1 unit
. preserve
. drop if missing(y)
(7 observations deleted)
. xtdescribe
id: 1, 2, ..., 5 n = 5
time: 1, 2, ..., 4 T = 4
Delta(time) = 1 unit
Span(time) = 4 periods
(id*time uniquely identifies each observation)
Distribution of T_i: min 5% 25% 50% 75% 95% max
2 2 2 2 3 4 4
Freq. Percent Cum. | Pattern
---------------------------+---------
1 20.00 20.00 | ..11
1 20.00 40.00 | .11.
1 20.00 60.00 | 11..
1 20.00 80.00 | 11.1
1 20.00 100.00 | 1111
---------------------------+---------
5 100.00 | XXXX
* ssc inst xtpatternvar
. xtpatternvar, gen(pattern)
* ssc inst groups
. groups pattern
+------------------------------------+
| pattern Freq. Percent % <= |
|------------------------------------|
| ..11 2 15.38 15.38 |
| .11. 2 15.38 30.77 |
| 11.. 2 15.38 46.15 |
| 11.1 3 23.08 69.23 |
| 1111 4 30.77 100.00 |
+------------------------------------+
. restore
. egen npresent = total(missing(y)), by(time)
. tabdisp time, c(npresent)
----------------------
time | npresent
----------+-----------
1 | 2
2 | 1
3 | 2
4 | 2
----------------------

Split Pandas Dataframe into separate pieces based on column values

I am looking to perform some Inner Joins in Pandas, using Python 2.7. Here is the dataset that I am working with:
import pandas as pd
import numpy as np
columns = ['s_id', 'c_id', 'c_col1']
index = np.arange(46) # array of numbers for the number of samples
df = pd.DataFrame(columns=columns, index = index)
df.s_id[:15] = 144
df.s_id[15:27] = 105
df.s_id[27:46] = 52
df.c_id[:5] = 1
df.c_id[5:10] = 2
df.c_id[10:15] = 3
df.c_id[15:19] = 1
df.c_id[19:27] = 2
df.c_id[27:34] = 1
df.c_id[34:39] = 2
df.c_id[39:46] = 3
df.c_col1[:5] = ['H', 'C', 'N', 'O', 'S']
df.c_col1[5:10] = ['C', 'O','S','K','Ca']
df.c_col1[10:15] = ['H', 'O','F','Ne','Si']
df.c_col1[15:19] = ['C', 'O', 'F', 'Zn']
df.c_col1[19:27] = ['N', 'O','F','Fe','Zn','Gd','Hg','Pb']
df.c_col1[27:34] = ['H', 'He', 'Li', 'B', 'N','Al','Si']
df.c_col1[34:39] = ['N', 'F','Ne','Na','P']
df.c_col1[39:46] = ['C', 'N','O','F','K','Ca', 'Fe']
Here is the dataframe:
s_id c_id c_col1
0 144 1 H
1 144 1 C
2 144 1 N
3 144 1 O <--
4 144 1 S
5 144 2 C
6 144 2 O <--
7 144 2 S
8 144 2 K
9 144 2 Ca
10 144 3 H
11 144 3 O <--
12 144 3 F
13 144 3 Ne
14 144 3 Si
15 105 1 C
16 105 1 O
17 105 1 F
18 105 1 Zn
19 105 2 N
20 105 2 O
21 105 2 F
22 105 2 Fe
23 105 2 Zn
24 105 2 Gd
25 105 2 Hg
26 105 2 Pb
27 52 1 H
28 52 1 He
29 52 1 Li
30 52 1 B
31 52 1 N
32 52 1 Al
33 52 1 Si
34 52 2 N
35 52 2 F
36 52 2 Ne
37 52 2 Na
38 52 2 P
39 52 3 C
40 52 3 N
41 52 3 O
42 52 3 F
43 52 3 K
44 52 3 Ca
45 52 3 Fe
I need to do the following in Pandas:
In a given s_id, produce separate dataframes for each c_id value. ex. for s_id = 144, there will be 3 dataframes, while for s_id = 105 there will be 2 dataframes
Inner Join the separate dataframes produced in a.), on the elements column (c_col1) in Pandas. This is a little difficult to understand so here is the dataframe what I would like to get from this step:
index s_id c_id c_col1
0 144 1 O
1 144 2 O
2 144 3 O
3 105 1 O
4 105 2 F
5 52 1 N
6 52 2 N
7 52 3 N
As you can see, what I am looking for in part 2.) is the following: Within each s_id, I am looking for those c_col1 values that occur for all the c_id values. ex. in the case of s_id = 144, only O (oxygen) occurs for c_id = 1, 2, 3. I have pointed to these entries, with "<--", in the raw data. So, I would like to have the dataframe show O 3 times in the c_col1 column and the corresponding c_id entries would be 1, 2, 3.
Conditions:
the number of unique c_ids are not known ahead of time.i.e. for one
particular s_id, I do not know if there will be 1, 2 and 3 or just 1
and 2. This means that if 1, 2 and 3 occur, there will be one Inner
Join; if only 1 and 2 occur, then there will be only one Inner Join.
How can this be done with Pandas?
Producing the separate dataframes is easy enough. How would you want to store them? One way would be in a nested dict where the outer keys are the s_id and the inner keys are the c_id and the inner values are the data. That you can do with a fairly long but straightforward dict comprehension:
DF_dict = {s_id :
{c_id : df[(df.s_id == s_id) & (df.c_id == c_id)] for c_id in df[df.s_id == s_id]['c_id'].unique()}
for s_id in df.s_id.unique()}
Then for example:
In [12]: DF_dict[52][2]
Out[12]:
s_id c_id c_col1
34 52 2 N
35 52 2 F
36 52 2 Ne
37 52 2 Na
38 52 2 P
I do not understand part two of your question. You want then to join the data within in s_id? Could you show what the expected output would be? If you want to do something within each s_id you might be better off exploring groupby options. Perhaps someone understands what you want, but if you can clarify I might be able to show a better option that skips the first part of the question...
##################EDIT
It seems to me that you should just go straight to problem 2, if problem 1 is simply a step you believe to be necessary to get to a problem 2 solution. In fact it is entirely unnecessary. To solve your second problem you need to group the data by s_id and transform the data according to your requirements. To sum up your requirements as I see them the rule is as follows: For each data group grouped by s_id, return only those ccol_1 data for which there are equal values for each value of c_id.
You might write a function like this:
def c_id_overlap(df):
common_vals = [] #container for values of c_col1 that are in ever c_id subgroup
c_ids = df.c_id.unique() #get unique values of c_id
c_col1_values = set(df.c_col1) # get a set of c_col1 values
#create nested list of values. Each inner list contains the c_col1 values for each c_id
nested_c_col_vals = [list(df[df.c_id == ID]['c_col1'].unique()) for ID in c_ids]
#Iterate through the c_col1_values and see if they are in every nested list
for val in c_col1_values:
if all([True if val in elem else False for elem in nested_c_col_vals]):
common_vals.append(val)
#return a slice of the dataframe that only contains values of c_col1 that are in every
#c_id
return df[df.c_col1.isin(common_vals)]
and then pass it to apply on data grouped by s_id:
df.groupby('s_id', as_index = False).apply(c_id_overlap)
which gives me the following output:
s_id c_id c_col1
0 31 52 1 N
34 52 2 N
40 52 3 N
1 16 105 1 O
17 105 1 F
18 105 1 Zn
20 105 2 O
21 105 2 F
23 105 2 Zn
2 3 144 1 O
6 144 2 O
11 144 3 O
Which seems to be what you are looking for.
###########EDIT: Additional Explanation:
So apply passes each chunk of grouped data to the function and the the pieces are glues back together once this has been done for each group of data.
So think about the first group passed where s_id == 105. The first line of the function creates an empty list common_vals which will contain those periodic elements that appear in every subgroup of the data (i.e. relative to each of the values of c_id).
The second line gets the unique values of 'c_id', in this case [1, 2] and stores them in an array called c_ids
The third line creates a set of the values of c_col1 which in this case produces:
{'C', 'F', 'Fe', 'Gd', 'Hg', 'N', 'O', 'Pb', 'Zn'}
The fourth line creates a nested list structure nested_c_col_vals where every inner list is a list of the unique values associated with each of the elements in the c_ids array. In this case this looks like this:
[['C', 'O', 'F', 'Zn'], ['N', 'O', 'F', 'Fe', 'Zn', 'Gd', 'Hg', 'Pb']]
Now each of the elements in the c_col1_values list is iterated over and for each of those elements the program determines whether that element appears in every inner list of the nested_c_col_vals object. The bulit in all function, determines whether every item in the sequence between the backets is True or rather whether it is non-zero (you will need to check this). So:
In [10]: all([True, True, True])
Out[10]: True
In [11]: all([True, True, True, False])
Out[11]: False
In [12]: all([True, True, True, 1])
Out[12]: True
In [13]: all([True, True, True, 0])
Out[13]: False
In [14]: all([True, 1, True, 0])
Out[14]: False
So in this case, let's say 'C' is the first element iterated over. The list comprehension inside the all() backets says, look inside each inner list and see if the element is there. If it is then True if it is not then False. So in this case this resolves to:
all([True, False])
which is of course False. No when the element is 'Zn' the result of this operation is
all([True, True])
which resolves to True. Therefore 'Zn' is appended to the common_vals list.
Once the process is complete the values inside common_vals are:
['O', 'F', 'Zn']
The return statement simply slices the data chunk according to whether the vaues os c_col1 are in the list common_vals as per above.
This is then repeated for each of the remaining groups and the data are glued back together.
Hope this helps

Stata: Generating individual comparison groups for each observation in sample (age brackets)

Currently, I try to assign certain properties of a comparison group (i.e.: mean income) to each individual in my microdata sample. The comparison groups are defined by some other observables (gender, region) and generated by other individuals. So far, I coded:
egen com_group = group(gender region)
bysort com_group: egen com_income = mean(income)
This works so far, but, this way raises two issues:
As the mean is calculated for all individuals in a certain group and the current observation is part of her own group, its own income counts for the calculation of mean income of own reference group. This might raise a (little) bias. This problem seems minor compared to problem 2.
I would prefer to assign the average income of less static groups. More concrete, I’m thinking about generating comparison groups of type group(gender region age+-5years). So, this running age brackets can’t be solved in the above mentioned way as each observation of a different age has a different age bracket. This information can’t be saved in one variable like “ref_group” before. My idea was to loop over all observations and generate observation specific reference groups. But, I don’t really know how to do this…
Will this give you what you want? I have not checked details. I will add some explanation later. In this example, the range for age is +/- 1 and the grouping variable is race
clear all
set more off
*----- example data -----
input ///
idcode age race wage
45 35 1 10.18518
47 35 1 3.526568
48 35 1 5.852843
1 37 2 11.73913
2 37 2 6.400963
9 37 1 10.49114
36 37 1 4.180602
7 39 1 4.62963
15 39 1 16.79548
20 39 1 9.661837
12 40 1 17.20612
13 40 1 13.08374
14 40 1 7.745568
16 40 1 15.48309
18 40 1 5.233495
19 40 1 10.16103
97 40 2 19.92563
22 41 1 9.057972
24 41 1 11.09501
44 41 1 28.45666
98 41 2 4.098635
3 42 2 5.016723
6 42 1 8.083731
23 42 1 8.05153
25 42 1 9.581316
99 42 2 9.875124
4 43 1 9.033813
39 44 1 9.790657
46 44 1 3.051529
end
sort age idcode
list, sepby(age)
*----- what you want -----
gen mwage = .
levelsof race, local(lrace)
forvalues i = 1/`=_N' {
foreach j of local lrace {
summarize wage if ///
inrange(age, age[`i']-1, age[`i']+1) /// age condition
& race == `j' /// race condition
& _n != `i' /// self-exclude condition
, meanonly
replace mwage = r(mean) if race == `j' in `i'
}
}
list, sepby(age)
Edit
If Stata is too slow with your database, then you can do it with Mata. Here is my attempt at it (I'm only starting to use it):
clear all
set more off
*----- example data -----
sysuse nlsw88
expand 2
*----- what you want -----
egen gro = group(race industry) // grouping variables
* Get number of groups
summarize gro, meanonly
local numgro = r(max)
* Compute upper limits for groups
forvalues i = 1/`numgro' {
summarize gro if gro == `i', meanonly
local countgro `countgro' `r(N)'
}
/*
sort group and bracking var. sort in Stata so Mata results
can be posted back to Stata using only -getmata-
*/
sort gro age
* Take statistic and bracking variables to Mata
putmata STVAR=wage BRVAR=age
mata:
/*
Get upper limits of groups from Stata.
Not considered good style. See Mata Matters: Macros, Gould (2008)
*/
UPLIM = tokens(st_local("countgro"))
UPLIM = runningsum(strtoreal(UPLIM)) // upper limits of groups
/*
For example, in the following observation ranges, each line
shows lower and upper limits:
1-11
12-23
24-28
29-29
*/
ST = J(rows(STVAR), 1, .)
for (i = 1; i <= cols(UPLIM); i++) {
if (i == 1) {
ro = 1
}
else {
ro = UPLIM[i-1]+1
}
co = UPLIM[i]
STVARP = STVAR[|ro\co|] // statistic variable
BRVARP = BRVAR[|ro\co|] // bracket variable
STPART = J(rows(STVARP), 1, 0)
for (j = 1; j <= rows(BRVARP); j++) {
SMALLER = BRVARP :>= BRVARP[j] - 1
LARGER = BRVARP :<= BRVARP[j] + 1
STPART[j] = ( sum(STVARP :* SMALLER :* LARGER) - STVARP[j] ) / ( sum(SMALLER :* LARGER) - 1 ) //division by zero gives . for last group with only one observation
}
ST[|ro\co|] = STPART // stack results
}
end
getmata mwage=ST
keep wage race industry gro age mwage
sort gro age wage
//list wage gro age matawage, sepby(gro)
Mata is formidable with loops; a database with 15.000 observations takes only a few seconds.

Resources