How to compare and select non-changing variables in panel data - loops

I have unbalanced panel data and need to exclude observations (in t) for which the income changed during the year before (t-1), while keeping other observations of these people. Thus, if a change in income happens in year t, then year t should be dropped (for that person).
clear
input year id income
2003 513 1500
2003 517 1600
2003 518 1400
2004 513 1500
2004 517 1600
2004 518 1400
2005 517 1600
2005 513 1700
2005 518 1400
2006 513 1700
2006 517 1800
2006 518 1400
2007 513 1700
2007 517 1600
2007 518 1400
2008 513 1700
2008 517 1600
2008 518 1400
end
xtset id year
xtline income, overlay
To illustrate what's going on, I add a xtline plot, which follows the income per person over the years. ID=518 is the perfect non-changing case (keep all obs). ID=513 has one time jump (drop year 2005 for that person). ID=517 has something like a peak, perhaps one time measurement error (drop 2006 and 2007).
I think there should be some form of loop. Initialize the first value for each person (because this cannot be compared), say t0. Then compare t1-t0, drop if changed, else compare t2-t1 etc. Because data is unbalanced there might be missing year-obervations. Thanks for advice.
Update/Goal: The purpose is prepare the data for a fixed effects regression analysis. There is another variable, reported for the entire "last year". Income however is reported at interview date (point in time). I need to get close to something like "last year income" to relate it to this variable. The procedure is suggested and followed by several publications. I try to replicate and understand it.
Solution:
bysort id (year) : drop if income != income[_n-1] & _n > 1

bysort id (year) : gen byte flag = (income != income[_n-1]) if _n > 1
list, sepby(id)
The procedure is VERY IFFY methodologically. There is no need to prepare for the fixed effects analysis other than xtsetting the data; and there rarely is any excuse to create missing data... let alone do so to squeeze the data into the limits of what (other) researchers know about statistics and econometrics. I understand that this is a replication study, but whatever you do with your replication and wherever you present it, you need to point out that the original authors did not have much clue about regression to begin with. Don't try too hard to understand it.

Related

Optimisation of Fact Read & Write Snowflake

All,
I am looking at some tips with regards to optimizing a ELT on a Snowflake Fact table with approx. 15 billion rows.
We get approximately 30,000 rows every 35 mins like the one below, we always will get 3 Account Dimension Key values i.e. Sales, Cogs & MKT.
Finance_Key
Date_Key
Department_Group_Key
Scenario_Key
Account_Key
Value
IsCurrent
001
2019-01-01
001
0012
SALES_001
100
Y
001
2019-01-01
001
0012
COGS_001
300
Y
001
2019-01-01
001
0012
MKT_001
200
Y
This is then PIVOTED based on Finance Key and Date Key and loaded into another table for reporting, like the one below
Finance_Key
Date_Key
Department_Group_Key
Scenario_Key
SALES
COGS
MKT
IsCurrent
001
2019-01-01
001
0012
100
300
200
Y
At times there is an adjustment made and for 1 Account key.
Finance_Key
Date_Key
Department_Group_Key
Scenario_Key
Account_Key
Value
001
2019-01-01
001
0012
SALES_001
50
Hence we have to do this
Finance_Key
Date_Key
Department_Group_Key
Scenario_Key
Account_Key
Value
IsCurrent
001
2019-01-01
001
0012
SALES_001
100
X
001
2019-01-01
001
0012
COGS_001
300
Y
001
2019-01-01
001
0012
MKT_001
200
Y
001
2019-01-01
001
0012
SALES_001
50
Y
And the resulting value should be
Finance_Key
Date_Key
Department_Group_Key
Scenario_Key
SALES
COGS
MKT
001
2019-01-01
001
0012
50
300
200
However my question is how do I go about optimizing the query to scan and update the Pivoted table for approx. 15 billion rows in Snowflake.
This is more of a optimization the read and write .
Any pointers
Thanks
So a longer form answer:
I am going to assume the Date_Key values being very is the past, is just a feature of the example date, because if you have 15B rows, and every 35 minutes you have ~10K (30K /3) updates to apply, and they span the whole date range of your data, then there is very little you can do to optimize it. Snowflake Query Acceleration Service might help with the IO.
But the primary way to improve total processing time, is process less.
For example in my old job we had IoT data, that could have messages up to two weeks old. And we duplicated all messages on load (which is effectively a similar process), as part of our pipeline error handling. We found that handling batches with a min date of -2 week against of full message tables (that also had billions of rows) used most of the time, reading/writing the tables. By altering the ingestion to sideline message older than a couple of hours, and deferring their processing until "midnight" we could get the processing of all the timely points in the batch done in under 5 minutes (we did a large amount of other processing, in that interval) for every batch, and we could turn the warehouse cluster size down out of core data hours, and use the saved credits to run a bigger instance at the midnight "catchup" to bring all the days worth of sidelined data on board. This eventually consistent approach worked very well for our application.
So if your destination table is well clustered, so the read/writes is just a fraction of the data then this should be less of a problem for you. But by the nature of asking the question I assume this is not the case.
So if the tables natural clustering is unaligned with the loading data, is that because the destination table needs to be a different shape for the critical read the tables is handling, at which point which is more cost/time sensitive. Another option is to have the destination table clustered by the date (again assuming the fresh data is a small window of time) and have a materialize view/table on top of that table, with a different sort, so that the rewrite is being done for you by Snowflake, now this in not a free-lunch. But it might allow faster upsert times, and faster usage performance. Assuming both a time sensitive.

Is normalization always necessary and more efficient?

I got into databases and normalization. I am still trying to understand normalization and I am confused about its usage. I'll try to explain it with this example.
Every day I collect data which would look like this in a single table:
TABLE: CAR_ALL
ID
DATE
CAR
LOCATION
FUEL
FUEL_USAGE
MILES
BATTERY
123
01.01.2021
Toyota
New York
40.3
3.6
79321
78
520
01.01.2021
BMW
Frankfurt
34.2
4.3
123232
30
934
01.01.2021
Mercedes
London
12.7
4.7
4321
89
123
05.01.2021
Toyota
New York
34.5
3.3
79515
77
520
05.01.2021
BMW
Frankfurt
20.1
4.6
123489
29
934
05.01.2021
Mercedes
London
43.7
5.0
4400
89
In this example I get data for thousands of cars every day. ID, CAR and LOCATION never changes. All the other data can have other values daily. If I understood correctly, normalizing would make it look like this:
TABLE: CAR_CONSTANT
ID
CAR
LOCATION
123
Toyota
New York
520
BMW
Frankfurt
934
Mercedes
London
TABLE: CAR_MEASUREMENT
GUID
ID
DATE
FUEL
FUEL_USAGE
MILES
BATTERY
1
123
01.01.2021
40.3
3.6
79321
78
2
520
01.01.2021
34.2
4.3
123232
30
3
934
01.01.2021
12.7
4.7
4321
89
4
123
05.01.2021
34.5
3.3
79515
77
5
520
05.01.2021
20.1
4.6
123489
29
6
934
05.01.2021
43.7
5.0
4400
89
I have two questions:
Does it make sense to create an extra table for DATE?
It is possible that new cars will be included through the collected data.
For every row I insert into CAR_MEASUREMENT, I would have to check whether the ID is already in CAR_CONSTANT. If it doesn't exist, I'd have to insert it.
But that means that I would have to check through CAR_CONSTANT thousands of times every day. Wouldn't it be more efficient if I just insert the whole data as 1 row into CAR_ALL? I wouldn't have to check through CAR_CONSTANT every time.
The benefits of normalization are dependent on your specific use case. I can see both pros and cons to normalizing your schema, but its impossible to say which is better without more knowledge of your use case.
Pros:
With your schema, normalization could reduce the amount of data consumed by your DB since CAR_MEASUREMENT will probably be much larger than CAR_CONSTANT. This scales up if you are able to factor out additional data into CAR_CONSTANT.
Normalization could also improve data consistency if you ever begin tracking additional fixed data about a car, such as license plate number. You could simply update one row in CAR_CONSTANT instead of potentially thousands of rows in CAR_ALL.
A normalized data structure can make it easier to query data for a specific car. using a LEFT JOIN, the DBMS can search through the CAR_MEASUREMENT table based on the integer ID column instead of having to compare two string columns.
Cons:
As you noted, the normalized form requires an additional lookup and possible insert to CAR_CONSTANT for every addition to CAR_MEASUREMENT. Depending on how fast you are collecting this data, those extra queries could be too much overhead.
To answer your questions directly:
I would not create an extra table for just the date. The date is a part of the CAR_MEASUREMENT data and should not be separated. The only exception that I can think of to this is if you will eventually collect measurements that do not contain any car data. In that case, then it would make sense to split CAR_MEASUREMENT into separate MEASUREMENT and CAR_DATA tables with MEASUREMENT containing the date, and CAR_DATA containing just the car-specific data.
See above. If you have a use case to query data for a specific car, then the normalized form can be more efficient. If not, then the additional INSERT overhead may not be worth it.

Sequences in Graph Database

All,
I am new to the graph database area and want to know if this type of example if applicable to a graph database.
Say I am looking at a baseball game. When each player goes to bat, there are 3 possible outcomes: hit, strikeout, or walk.
For each batter and throughout the baseball season, what I want to figure out is the counts of the sequences.
For example, for batters that went to the plate n times, how many people had a particular sequence (e.g, hit/walk/strikeout or hit/hit/hit/hit), and if so, how many of the same batters repeated the same sequence indexed by time. To further explain, time would allow me know if a particular sequence (e.g. hit/walk/strikeout or hit/hit/hit/hit) occurred during the beginning of the season, in the mid, or later half.
For a key-value type database, the raw data would look as follows:
Batter Time Game Event Bat
------- ----- ---- --------- ---
Charles April 1 Hit 1
Charles April 1 strikeout 2
Charles April 1 Walk 3
Doug April 1 Walk 1
Doug April 1 Hit 2
Doug April 1 strikeout 3
Charles April 2 strikeout 1
Charles April 2 strikeout 2
Doug May 5 Hit 1
Doug May 5 Hit 2
Doug May 5 Hit 3
Doug May 5 Hit 4
Hence, my output would appear as follows:
Sequence Freq Unique Batters Time
----------------------- ---- -------------- ------
hit 5000 600 April
walk/strikeout 3000 350 April
strikeout/strikeout/hit 2000 175 April
hit/hit/hit/hit/hit 1000 80 April
hit 6000 800 May
walk/strikeout 3500 425 May
strikeout/strikeout/hit 2750 225 May
hit/hit/hit/hit/hit 1250 120 May
. . . .
. . . .
. . . .
. . . .
If this is feasible for a graph database, would it also scale? What if instead of 3 possible outcomes for a batter, there were 10,000 potential outcomes with 10,000,000 batters?
More so, the 10,000 unique outcomes would be sequenced in a combinatoric setting (e.g. 10,000 CHOOSE 2, 10,000 CHOOSE 3, etc.).
My question then is, if a graphing database is appropriate, how would you propose setting up a solution?
Much thanks in advance.

FIFO Implementation using SQL server

I am looking for SQL server implementation of FIFO. For customers of a large company, I have 2 datasets with points earned in one dataset and points redeemed and expired in each year in another dataset.
The target is to determine the points redeemed and expired out of the points earned in any given year. So, obviously, the redeemed and expired points will have to be assigned to the earned points on a first in first out basis.
The datasets look as follows:
Points earned:
ID year earn
1 2000 100
1 2001 150
1 2002 75
1 2003 10
1 2004 120
Points burned:
ID Year Type points
1 2001 burn 50
1 2001 exp 20
1 2003 burn 120
1 2004 exp 100
1 2006 exp 20
Combining the two datasets, we should get a dataset like this:
Combined dataset:
ID Year Earn Burn exp
1 2000 100 80 20
1 2001 150 90 60
1 2002 75 0 60
1 2003 10 0 0
1 2004 120 0 0
Simply, the burned and expired points are being assigned to points earned on a first in first out basis. The dataset is huge and cannot be exported in excel. Code to do the above on SQL server will be of huge help
I figured out the answer. I certainly do not have the most efficient code, but i have thew following algorithm to assign points redeemed an expired using FIFO logic
Calculate the maximum points that will be used by an individual. This would include both the points redeemed and expired. Target is to assign these points to different earnings transactions
Figure out the earnings transaction that cross the maximum points used. Anything transaction below this should not have any points assigned under redeemed or expired. So, make redeemed and expired 0 for all these transactions. Now, we only need to assign points to the other transactions left
For each earnings transaction, figure out the transaction from the used table that crosses the points earned in that transaction. Now for this earnings transaction, the total expiration points that need to be assigned is the total number of points expired until the corresponding transaction in the used table + whatever is left from this transaction from the used table depending on whether it's a redeemed transaction or expired transaction. Similarly, points redeemed also need to assigned
The algorithm worked perfectly. I could have done a better job with coding. It took me 3 hours to run on a dataset with 150 million records

Approximate match within a sub-array

I have a table with the following:
Name Quota-Date Quota
Ami 5/1/2010 75000
Ami 1/1/2012 100000
Ami 6/1/2014 150000
John 8/1/2014 0
John 4/1/2015 50000
Rick 5/1/2011 100000
(Dates are shown in American format: m/d/yyyy). "Quota Date" is the first month of the active new "Quota" next to it.
E.g. Ami's quota is 75000 for each month between May 2010 and December 2011.
I need a formula to fetch the quota of a given person and a given month: the active quota of a person in every month. This needed formula is to calculate the third column of this table:
Name Month Quota
Ami 6/1/2010 75000
Ami 12/1/2011 75000
Ami 1/1/2012 100000
Ami 7/1/2014 150000
John 10/1/2014 0
John 4/1/2015 50000
I prefer not to maintain the first table sorted, but if it will make things significantly simpler, I would.
What would be the correct formula for "Quota" on the second table?
If your new data is in columns A-C and original data is also columns A-C in Sheet1, then enter this formula in B2:
=SUMIFS(Sheet1!C:C,Sheet1!A:A,A2,Sheet1!B:B,MAX(IF((Sheet1!A:A=A2)*(Sheet1!B:B<=B2),Sheet1!B:B,"")))
This formula works well if you have only numbers in your 3rd column, but would be more complicated to make it working on text too.
thanks, Máté Juhász!
I just worked out another solution for that, not as an Array Formula, but I like your solution better - more elegant, I will use it!
My solution for the record:
=INDEX(INDIRECT("Quota!$E$" & MATCH([#PM],PMQuotaTable[PM],0)+ROW(PMQuotaTable[#Headers]) & ":$E$" & MATCH([#PM],PMQuotaTable[PM],0)+ROW(PMQuotaTable[#Headers])+COUNTIF(PMQuotaTable[PM],[#PM])-1),MATCH([#Month],INDIRECT("Quota!$D$" & MATCH([#PM],PMQuotaTable[PM],0)+ROW(PMQuotaTable[#Headers]) & ":$D$" & MATCH([#PM],PMQuotaTable[PM],0)+ROW(PMQuotaTable[#Headers])+COUNTIF(PMQuotaTable[PM],[#PM])-1),1))
I'm running a regular index/match with match type = 1 to find the largest date row, but I construct the target range dynamically to scope only the rows of the current person (PM).
I identify the first row of this PM with this part:
MATCH([#PM],PMQuotaTable[PM],0)+ROW(PMQuotaTable[#Headers])
...and the the last row for the PM by adding the number of rows he has in the table, so retrieved using this:
MATCH([#PM],PMQuotaTable[PM],0)+ROW(PMQuotaTable[#Headers]) + COUNTIF(PMQuotaTable[PM],[#PM])-1
The dynamic range is then constructed using INDIRECT. So, the complete range is determined with this part (for the needed column to be retrieved eventually):
INDIRECT("Quota!$E$" & MATCH([#PM],PMQuotaTable[PM],0)+ROW(PMQuotaTable[#Headers]) & ":$E$" & MATCH([#PM],PMQuotaTable[PM],0)+ROW(PMQuotaTable[#Headers])+COUNTIF(PMQuotaTable[PM],[#PM])-1)
Mor

Resources