I am looking for SQL server implementation of FIFO. For customers of a large company, I have 2 datasets with points earned in one dataset and points redeemed and expired in each year in another dataset.
The target is to determine the points redeemed and expired out of the points earned in any given year. So, obviously, the redeemed and expired points will have to be assigned to the earned points on a first in first out basis.
The datasets look as follows:
Points earned:
ID year earn
1 2000 100
1 2001 150
1 2002 75
1 2003 10
1 2004 120
Points burned:
ID Year Type points
1 2001 burn 50
1 2001 exp 20
1 2003 burn 120
1 2004 exp 100
1 2006 exp 20
Combining the two datasets, we should get a dataset like this:
Combined dataset:
ID Year Earn Burn exp
1 2000 100 80 20
1 2001 150 90 60
1 2002 75 0 60
1 2003 10 0 0
1 2004 120 0 0
Simply, the burned and expired points are being assigned to points earned on a first in first out basis. The dataset is huge and cannot be exported in excel. Code to do the above on SQL server will be of huge help
I figured out the answer. I certainly do not have the most efficient code, but i have thew following algorithm to assign points redeemed an expired using FIFO logic
Calculate the maximum points that will be used by an individual. This would include both the points redeemed and expired. Target is to assign these points to different earnings transactions
Figure out the earnings transaction that cross the maximum points used. Anything transaction below this should not have any points assigned under redeemed or expired. So, make redeemed and expired 0 for all these transactions. Now, we only need to assign points to the other transactions left
For each earnings transaction, figure out the transaction from the used table that crosses the points earned in that transaction. Now for this earnings transaction, the total expiration points that need to be assigned is the total number of points expired until the corresponding transaction in the used table + whatever is left from this transaction from the used table depending on whether it's a redeemed transaction or expired transaction. Similarly, points redeemed also need to assigned
The algorithm worked perfectly. I could have done a better job with coding. It took me 3 hours to run on a dataset with 150 million records
Related
All,
I am looking at some tips with regards to optimizing a ELT on a Snowflake Fact table with approx. 15 billion rows.
We get approximately 30,000 rows every 35 mins like the one below, we always will get 3 Account Dimension Key values i.e. Sales, Cogs & MKT.
Finance_Key
Date_Key
Department_Group_Key
Scenario_Key
Account_Key
Value
IsCurrent
001
2019-01-01
001
0012
SALES_001
100
Y
001
2019-01-01
001
0012
COGS_001
300
Y
001
2019-01-01
001
0012
MKT_001
200
Y
This is then PIVOTED based on Finance Key and Date Key and loaded into another table for reporting, like the one below
Finance_Key
Date_Key
Department_Group_Key
Scenario_Key
SALES
COGS
MKT
IsCurrent
001
2019-01-01
001
0012
100
300
200
Y
At times there is an adjustment made and for 1 Account key.
Finance_Key
Date_Key
Department_Group_Key
Scenario_Key
Account_Key
Value
001
2019-01-01
001
0012
SALES_001
50
Hence we have to do this
Finance_Key
Date_Key
Department_Group_Key
Scenario_Key
Account_Key
Value
IsCurrent
001
2019-01-01
001
0012
SALES_001
100
X
001
2019-01-01
001
0012
COGS_001
300
Y
001
2019-01-01
001
0012
MKT_001
200
Y
001
2019-01-01
001
0012
SALES_001
50
Y
And the resulting value should be
Finance_Key
Date_Key
Department_Group_Key
Scenario_Key
SALES
COGS
MKT
001
2019-01-01
001
0012
50
300
200
However my question is how do I go about optimizing the query to scan and update the Pivoted table for approx. 15 billion rows in Snowflake.
This is more of a optimization the read and write .
Any pointers
Thanks
So a longer form answer:
I am going to assume the Date_Key values being very is the past, is just a feature of the example date, because if you have 15B rows, and every 35 minutes you have ~10K (30K /3) updates to apply, and they span the whole date range of your data, then there is very little you can do to optimize it. Snowflake Query Acceleration Service might help with the IO.
But the primary way to improve total processing time, is process less.
For example in my old job we had IoT data, that could have messages up to two weeks old. And we duplicated all messages on load (which is effectively a similar process), as part of our pipeline error handling. We found that handling batches with a min date of -2 week against of full message tables (that also had billions of rows) used most of the time, reading/writing the tables. By altering the ingestion to sideline message older than a couple of hours, and deferring their processing until "midnight" we could get the processing of all the timely points in the batch done in under 5 minutes (we did a large amount of other processing, in that interval) for every batch, and we could turn the warehouse cluster size down out of core data hours, and use the saved credits to run a bigger instance at the midnight "catchup" to bring all the days worth of sidelined data on board. This eventually consistent approach worked very well for our application.
So if your destination table is well clustered, so the read/writes is just a fraction of the data then this should be less of a problem for you. But by the nature of asking the question I assume this is not the case.
So if the tables natural clustering is unaligned with the loading data, is that because the destination table needs to be a different shape for the critical read the tables is handling, at which point which is more cost/time sensitive. Another option is to have the destination table clustered by the date (again assuming the fresh data is a small window of time) and have a materialize view/table on top of that table, with a different sort, so that the rewrite is being done for you by Snowflake, now this in not a free-lunch. But it might allow faster upsert times, and faster usage performance. Assuming both a time sensitive.
I have been tasked to move a process that pays people for training from an excel spreadsheet to sql server DB. I need to be able to track payments and the reason why it was approved, denied. Example:
Payment Run Jan1
Student
Class
Amount
Reason for NonPayment
Mary
Introduction to Python
0
No W2
John
Introduction to Java
100
Payment Run Feb 1
Student
Class
Amount
Reason for NonPayment
Mary
Introduction to Python
100
Now I know I should make three tables , One for student info, one for course info , and a linked table with payments. It the payments table that has me stumped. I can do that for Jan1 , but how do I track the changes ?
I want to be able to say "On Jan runs Mary did not get paid because she was missing her W2, but she was paid in Feb" . For every payment run, I need to be able to track who got paid, amount paid , reason for nonpayment ( if present ) .
My bad. I was forgetting about many to one relationship. I was thinking more about addresses, where you "retire" the address, and only have the new link active.
So instead of "retiring" the link to the payment table on every run, keep a "PaymentRunDate" field and have a key referencing the user.
Like this ( given Marys ID is 15 , John Id is 5 , dates are in European format)
UserId
Class ID
PaymentRunDate
amount_paid
Reason
15
1
01/01/2022
0
No W2
5
2
01/01/2022
100
15
1
01/02/2022
100
and let the front end worry about how this is presented to the user.
I have unbalanced panel data and need to exclude observations (in t) for which the income changed during the year before (t-1), while keeping other observations of these people. Thus, if a change in income happens in year t, then year t should be dropped (for that person).
clear
input year id income
2003 513 1500
2003 517 1600
2003 518 1400
2004 513 1500
2004 517 1600
2004 518 1400
2005 517 1600
2005 513 1700
2005 518 1400
2006 513 1700
2006 517 1800
2006 518 1400
2007 513 1700
2007 517 1600
2007 518 1400
2008 513 1700
2008 517 1600
2008 518 1400
end
xtset id year
xtline income, overlay
To illustrate what's going on, I add a xtline plot, which follows the income per person over the years. ID=518 is the perfect non-changing case (keep all obs). ID=513 has one time jump (drop year 2005 for that person). ID=517 has something like a peak, perhaps one time measurement error (drop 2006 and 2007).
I think there should be some form of loop. Initialize the first value for each person (because this cannot be compared), say t0. Then compare t1-t0, drop if changed, else compare t2-t1 etc. Because data is unbalanced there might be missing year-obervations. Thanks for advice.
Update/Goal: The purpose is prepare the data for a fixed effects regression analysis. There is another variable, reported for the entire "last year". Income however is reported at interview date (point in time). I need to get close to something like "last year income" to relate it to this variable. The procedure is suggested and followed by several publications. I try to replicate and understand it.
Solution:
bysort id (year) : drop if income != income[_n-1] & _n > 1
bysort id (year) : gen byte flag = (income != income[_n-1]) if _n > 1
list, sepby(id)
The procedure is VERY IFFY methodologically. There is no need to prepare for the fixed effects analysis other than xtsetting the data; and there rarely is any excuse to create missing data... let alone do so to squeeze the data into the limits of what (other) researchers know about statistics and econometrics. I understand that this is a replication study, but whatever you do with your replication and wherever you present it, you need to point out that the original authors did not have much clue about regression to begin with. Don't try too hard to understand it.
I want to create a database to store process cycle time data. For example:
Say a particular process for a certain product, say welding, theoretically takes about 10 seconds to do (the process cycle time). Due to various issues, the machine's actual cycle time would vary throughout the day. I would like to store the machine's actual cycle time throughout the day and analyze it over time (days, weeks, months). How would i go about designing the database for this?
I considered using a time series database, but i figured it isn't suitable - cycle time data has a start time and an end time - basically i'm measuring time performance over time - if this even makes sense. At the same time, I was also worried that using relational database to store and then display/analyze time related data is inefficient.
Any thoughts on a good database structure would be greatly appreciated. Let me know if any more info is needed and i will gladly edit this question
You are tracking the occurrence of an event. The event (weld) starts at some time and ends at some time. It might be tempting to model the event entity like so:
StationID StartTime StopTime
with each welding station having a unique identifier. Some sample data might look like this:
17 08:00:00 09:00:00
17 09:00:00 10:00:00
For simplicity, I've set the times to large values (1 hour each) and removed the date values. This tells you that welding station #17 started a weld at 8am and finished at 9am, at which time the second weld started which finished at 10am.
This seems simple enough. Notice, however, that the StopTime of the first entry matches the StartTime of the second entry. Of course it does, the end of one weld signals the start of the next weld. That's how the system was designed.
But this sets up what I call the Row Spanning Dependency antipattern: where the value of one field of a row must be synchronized with the value of another field in a different row.
This can create any number of problems. For example, what if the StartTime of the second entry showed '09:15:00'? Now we have a 15 minute gap between the end of the first weld and the start of the next. The system does not allow for gaps -- the end of each weld also starts the next weld. How should be interpret this gap. Is the StopTime of the first row wrong. Is the StartTime of the second row wrong? Are both wrong? Or was there another row between them that was somehow deleted? There is no way to tell which is the correct interpretation.
What if the StartTime of the second entry showed '08:45'? This is an overlap where the start of the second cycle supposedly started before the first cycle ended. Again, we can't know which row contains the erroneous data.
A row spanning dependency allows for gaps and overlaps, neither of which is allowed in the data. There would need to be a large amount of database and application code required to prevent such a situation from ever occurring, and when it does (as assuredly it will) there is no way to determine which data is correct and which is wrong -- not from within the database, that is.
An easy solution is to do away with the StopTime field altogether:
StationID StartTime
17 08:00:00
17 09:00:00
Each entry signals the start of a weld. The end of the weld is indicated by the start of the next weld. This simplifies the data model, makes it impossible to have a gap or overlap, and more precisely matches the system we are modeling.
But we need the data from two rows to determine the length of a weld.
select w1.StartTime, w2.StartTime as StopTime
from Welds w1
join Welds w2
on w2.StationID = w1.StationID
and w2.StartTime =(
select Max( StartTime )
from Welds
where StationID = w2.StationID
and StartTime < w2.StartTime );
This may seem like a more complicated query that if the start and stop times were in the same row -- and, well, it is -- but think of all that checking code that no longer has to be written, and executed at every DML operation. And since the combination of StationID and StartTime would be the obvious PK, the query would use only indexed data.
There is one addition to suggest. What about the first weld of the day or after a break (like lunch) and the last weld of the day or before a break? We must make an effort not to include the break time as a cycle time. We could include the intelligence to detect such situation in the query, but that would increase the complexity even more.
Another way would be to include a status value in the record.
StationID StartTime Status
17 08:00:00 C
17 09:00:00 C
17 10:00:00 C
17 11:00:00 C
17 12:00:00 B
17 13:00:00 C
17 14:00:00 C
17 15:00:00 C
17 16:00:00 C
17 17:00:00 B
So the first few entries represent the start of a cycle, whereas the entry for noon and 5pm represents the start of a break. Now we just need to append the line
where w1.Status = 'C'
to the end of the query above. Thus the 'B' entries supply the end times of the previous cycle but do not start another cycle.
This question already has answers here:
Trying to find vehicles which are free between 2 variable dates
(2 answers)
Closed 8 years ago.
So I am kind of stuck on this Query I need to do, I have no Idea what It even means.
I need to be able to find a vehicle number from a table entitled Vehicle_Details and check whether it is currently in use for the period Of time I would like to use it as said bellow.
When a new trip is being arranged, the administrator has to find a vehicle that is not already in use for the trip duration. Because this query will be needed frequently, it is important that it can be run easily for an arbitrary start and end date. It should therefore use substitution variables so that the trip start and end dates can be provided at run time.
Make sure that when it is run the user is only prompted to supply the start and end dates once. Also make sure that any vehicles displayed are available for the entire period specified - you will have to include more than one test in the where clause
Any code would be helpful but even links to things that can help me write it myself as I have No idea tbh.
Example data from the three tables:
Trip_ID Departure Return_Date Duration Registration
73180 07-FEB-12 08-FEB-12 1 PY09 XRH
73181 07-FEB-12 08-FEB-12 1 PY10 OPM
73182 07-FEB-12 10-FEB-12 3 PY56 BZT
73183 07-FEB-12 08-FEB-12 1 PY56 BZU
73184 07-FEB-12 09-FEB-12 2 PY58 UHF
Registration Make Model Year
4585 AW ALBION RIEVER 1963
SDU 567M ATKINSON N/A 1974
P525 CAO DAF FT85.400 1996
PY55 CGO DAF FTGCF85.430 2005
PY06 BYP DAF FTGCF85.430 2006
Weight Registration Body Vehicle ID
20321 4585 AW N/A 1
32520 SDU 567M N/A 2
40000 P525 CAO N/A 3
40000 PY55 CGO N/A 4
40000 PY06 BYP N/A 5
You need to find if a particular date range intersects with any reservation. This is interval intersection arithmetic. Consider the following intervals [A,B] and [x,y]:
-----------[xxxxxxxxxxx]-------------
A B
---------------------[xxxxxx]--------
x y
An interval [x,y] will intersect with [A,B] if and only if:
B >= x
And A <= y
So your query will look like:
SELECT *
FROM registrations reg
WHERE reg.registration = :searched_vehicle
AND NOT EXISTS (SELECT NULL
FROM reservations res
WHERE res.registration = reg.registration
AND res.return_date >= :interval_start
AND res.departure <= :interval_end)
This is for one vehicle. If the query returns a row, this vehicle is available for the given interval [:interval_start, :interval_end].