T-SQL rolling twelve month per day performance - sql-server

I have checked similar problems, but none have worked well for me. The most useful was http://forums.asp.net/t/1170815.aspx/1, but the performance makes my query run for hours and hours.
I have 1.5 million records based on product sales (about 10k product) over 4 years. I want to have a table that contains date, product and rolling twelve months sales.
This query (from the link above) works, and shows what I want, but the perfomance makes it useless:
select day_key, product_key, price, (select sum(price) as R12 from #ORDER_TURNOVER as tb1 where tb1.day_key <= a.day_key and tb1.day_key > dateadd(mm, -12, a.day_key) and tb1.product_key = a.product_key) as RSum into #hejsan
from #ORDER_TURNOVER as a
I tried a rolling sum cursor function for all records which was fast as lightning, but I couldn't get the query only to sum the sales over the last 365 days.
Any ideas on how to solve this problem is much appreciated.
Thank you.

I'd change your setup slightly.
First, have a table that lists all the product keys that are of interest...
CREATE TABLE product (
product_key INT NOT NULL,
price INT,
some_fact_data VARCHAR(MAX),
what_ever_else SOMEDATATYPE,
PRIMARY KEY CLUSTERED (product_key)
)
Then, I'd have a calendar table, with each individual date that you could ever need to report on...
CREATE TABLE calendar (
date SMALLDATETIME,
is_bank_holdiday INT,
what_ever_else SOMEDATATYPE,
PRIMARY KEY CLUSTERED (date)
)
Finally, I'd ensure that your data table has a covering index on all the relevant fields...
CREATE INDEX IX_product_day ON #ORDER_TURNOVER (product_key, day_key)
This would then allow the following query...
SELECT
product.product_key,
product.price,
calendar.date,
SUM(price) AS RSum
FROM
product
CROSS JOIN
calendar
INNER JOIN
#ORDER_TURNOVER AS data
ON data.product_key = product.product_key
AND data.day_key > dateadd(mm, -12, calendar.date)
AND data.day_key <= calendare.date
GROUP BY
product.product_key,
product.price,
calendar.date
By doing everything in this way, each product/calendar_date combination will then relate to a set of record in your data table that are all consecutive to each other. This will make the act of looking up the data to be aggregated much, much simpler for the optimiser.
[Requires a single index, specifically in the order (product, date).]
If you have the index the other way around, it is actually much harder...
Example data:
product | date date | product
---------+------------- ------------+---------
A | 01/01/2012 01/01/2012 | A
A | 02/01/2012 01/01/2012 | B
A | 03/01/2012 02/01/2012 | A
B | 01/01/2012 02/01/2012 | B
B | 02/01/2012 03/01/2012 | A
B | 03/01/2012 03/01/2012 | B
On the left oyu just get all the records that are next to each other in a 365 day block.
On the right you search for each record before you can aggregate. The search is relatively simple, but you do 365 of them. Much more than the version on the left.

This is how one does "running totals" / "sum subsets" in SQL Server 2005-2008. In SQL 2012 there is native support for running totals but we are all still working with 2005-2008 db's
SELECT day_key ,
product_key ,
price ,
( SELECT SUM(price) AS R12
FROM #ORDER_TURNOVER AS tb1
WHERE tb1.day_key <= a.day_key
AND tb1.day_key > DATEADD(mm, -12, a.day_key)
AND tb1.product_key = a.product_key
) AS RSum
INTO #hejsan
FROM #ORDER_TURNOVER AS a
A few suggestions.
You could pre calculate the running totals so that they are not calculated again and again. What you are doing it the above select is a disguised loop and not a set query (unless the optimizer can convert the subquery to a join).
The above solution requires a few changes to the code.
Another solution that you can certainly try is to create a clustered index on your #ORDER_TURNOVER temp table. This is safer cause it's local change.
CREATE CLUSTERED INDEX IndexName
ON #ORDER_TURNOVER (day_key,day_key,product_key)
All your 3 expressions in the WHERE clause are SARGS so chanes are good that the optimizer will now do a seek instead of a scan.
If the index solution does not give enough performance gains that its well worth investing in solution 1

Related

Cassandra best practice to ORDER BY using PRIMARY KEY

Originally I had a cassandra table like this:
CREATE TABLE table (
open_time timestamp,
open double,
close double,
high double,
low double,
volume bigint,
PRIMARY KEY(open_time));
open_time | close | high | low | open | volume
---------------------------------+--------+--------+-------+--------+--------
2020-08-05 06:00:00.000000+0000 | 181.53 | 184.32 | 181.1 | 184.32 | 100
2020-08-04 06:00:00.000000+0000 | 181.53 | 184.32 | 181.1 | 184.32 | 100
I need to perform a query to get the latest open_time. After noticing that querys like
SELECT open_time FROM table ORDER BY open_time DESC LIMIT 1;
are not allowed, I wonder what's the best practice here.
My idea is to add an id column, that I can use open_time as clustering order. Something like:
CREATE TABLE table (
id int,
open_time timestamp,
open double,
close double,
high double,
low double,
volume bigint,
PRIMARY KEY(id, open_time)
)
WITH CLUSTERING ORDER BY (open_time DESC);
Is this a valid solution to get the job done or are there better ways, e.g. something without an extra id column, because I would never query over the id itslef.
The most queries would be something like:
SELECT * FROM table WHERE open_time >= '2013-01-01 00:00:00+0200' AND open_time <= '2013-08-13 23:59:00+0200';
Thanks!
CLUSTERING ORDER enforces the on-disk sort order within each partition. So ordering by the same key that you're partitioning on isn't possible. Partitioning by id will face a similar challenge, in that the CLUSTERING ORDER BY open_time will only be enforced within each id.
I wonder what's the best practice here.
Models like these are usually solved by time bucketing, as I mentioned in an answer to a similar question earlier today. To select the best "bucket," you'll need to understand your business case like number of entries per day, as well as the query requirements.
For the sake of example, let's say that month would work the best. If each row contained a value of 'YEAR-MONTH', the PK definition would look like this:
PRIMARY KEY (month_bucket,open_time))
WITH CLUSTERING ORDER BY (open_time DESC);
Then, you could support a query like this:
SELECT * FROM table
WHERE month_bucket = '2013-08'
AND open_time >= '2013-08-01 00:00:00+0200' AND open_time <= '2013-08-13 23:59:00+0200';
Likewise, querying the most recent entry would only require the most recent (current?) month as a parameter:
SELECT * FROM table
WHERE month_bucket = '2020-08'
LIMIT 1;
As the results are stored within each month_bucket sorted by open_time in descending order, that query would return the most-recent entry.
I wrote an article on this for DataStax (several years ago) which is relevant to this problem. It's been moved to a new part of their site, which hosed the formatting, but the content is defintely there. Give it a read; hope it helps: We Shall Have Order!
If id is mentioned as primary key, it must be included in where clause otherwise it would need allow filtering.
You can try querying with "Select max(open_time)....",otherwise you can use id as above which will be incremented with every record and a result, id with highest value will always have the latest record.

T-SQL recursion, date shifting based on previous iteration

I have a data set that includes a customer, payment date, and the number of days they have paid for. I need to be calculate the coverage start/end dates that each payment is covering. This is difficult when a payment is made before the current coverage period ends.
The best way I've come up with to think about this would be a month to month cell phone plan where the customer may pay for a specified number of days at any point during a given month. The next covered period should always start the day after the previous covered period expires.
Here is the code sample using a temp table.
CREATE TABLE #Payments
(Customer_ID INTEGER,
Payment_Date DATE,
Days_Paid INTEGER);
INSERT INTO #Payments
VALUES (1,'2018-01-01',30);
INSERT INTO #Payments
VALUES (1,'2018-01-29',20);
INSERT INTO #Payments
VALUES (1,'2018-02-15',30);
INSERT INTO #Payments
VALUES (1,'2018-04-01',30);
I need to get the coverage start/end dates back.
The initial payment is made on 2018-01-01 and they paid for 30 days. That means they are covered until 2018-01-30 (Payment_Date + Paid_Days - 1 since the payment date is included as a covered day). However they made their next payment on 2018-01-29, so I need calculate the start date of the next coverage window, which in this case would be the previous Payment_Date + previous Paid_Days. In this case, coverage window 2 starts on 2018-02-01 and would extend through the 2018-02-19 since they only paid for 20 days on Payment_Date 2018-01-29.
The expected output is:
Customer_ID | Payment_Date | Days_Paid | Coverage_Start_Date | Coverage_End_Date
--------------------------------------------------------------------------------
1 | '2018-01-01'| 30 | '2018-01-01'| '2018-01-30'
1 | '2018-01-29'| 20 | '2018-01-31'| '2018-02-19'
1 | '2018-02-15'| 30 | '2018-02-20'| '2018-03-21'
1 | '2018-04-01'| 30 | '2018-04-01'| '2018-04-30'
Because the current record's coverage start date will depend of the previous record's coverage end date, I feel like this would be a good candidate for recursion, but I can't figure out how to do it.
I have a way to do this in a while loop, but I would like to complete it using a recursive CTE. I have also thought about simply adding up the Days_Paid and adding that to the first payment's start date, however this only works if a payment is made before the previous coverage has expired. In addition, I need to calculate the coverage start/end dates for each Payment_Date.
Finally, using LAG/LEAD functions doesn't appear to work because it does not consider the result of the previous iteration, only the current value of the previous record. Using LAG/LEAD, you get the correct answer for the 2nd payment record, but not the third.
Is there a way to do this with a recursive CTE?
NOTE: This is not a recursive solution, but it is set-based vs. your loop solution.
While trying to solve this recursively it hit me that this is essentially a "running totals" problem, and can be easily solved with window functions.
WITH runningTotal AS
(
SELECT p.*, SUM(Days_Paid) OVER(ORDER BY p.Payment_Date) AS runningTotalDays, MIN(Payment_Date) OVER(ORDER BY p.Payment_Date) startDate
FROM #Payments p
)
SELECT r.Customer_Id, r.Payment_Date,Days_Paid, COALESCE(DATEADD(DAY, LAG(runningTotalDays) OVER(ORDER BY r.Payment_Date) +1, startDate), startDate) AS Coverage_Start_Date, DATEADD(DAY, runningTotalDays, startDate) AS Coverage_End_Date
FROM runningTotal r
Each end date is the "running total" of all the previous Days_Paid added together. Using LAG to get the previous records end date+1 gets you the start date. The COALESCE is to handle the first record. For more than a single customer, you can PARTITION BY Customer_Id.
So of course, right after posting this I came across a similar question that was already answered.
Here's the link: Recursively retrieve LAG() value of previous record
Based on that solution, I was able construct the following solution to my own question.
The key here was adding the "prep_data" CTE which made the recursion problem much easier.
;WITH prep_data AS
(SELECT Customer_ID,
ROW_NUMBER() OVER (PARTITION BY Customer_ID ORDER BY Payment_Date) AS payment_seq_num,
Payment_Date,
Days_Paid,
Payment_Date as Coverage_Start_Date,
DATEADD(DAY,Days_Paid-1,Payment_Date) AS Coverage_End_Date
FROM #Payments),
recursion AS
(SELECT Customer_ID,
payment_seq_num,
Payment_Date,
Days_Paid,
Coverage_Start_Date,
Coverage_End_Date
FROM prep_data
WHERE payment_seq_num = 1
UNION ALL
SELECT r.Customer_ID,
p.payment_seq_num,
p.Payment_Date,
p.Days_Paid,
CASE WHEN r.Coverage_End_Date >= p.Payment_Date THEN DATEADD(DAY,1,r.Coverage_End_Date) ELSE p.Payment_Date END AS Coverage_Start_Date,
DATEADD(DAY,p.Days_Paid-1,CASE WHEN r.Coverage_End_Date >= p.Payment_Date THEN DATEADD(DAY,1,r.Coverage_End_Date) ELSE p.Payment_Date END) AS Coverage_End_Date
FROM recursion r
JOIN prep_data p ON r.payment_seq_num + 1 =p.payment_seq_num
)
SELECT Customer_ID,
Payment_Date,
Days_Paid,
Coverage_Start_Date,
Coverage_End_Date
FROM recursion
ORDER BY payment_seq_num;

PostgreSQL - Filter column 2 results based on column 1

Forgive a novice question. I am new to postgresql.
I have a database full of transactional information. My goal is to iterate through each day since the first transaction, and show how many unique users made a purchase on that day, or in the 30 days previous to that day.
So the # of unique users on 02/01/2016 should show all unique users from 01/01/2016 through 02/01/2016. The # of unique users on 02/02/2016 should show all unique users from 01/02/2016 through 02/02/2016.
Here is a fiddle with some sample data: http://sqlfiddle.com/#!15/b3d90/1
The result should be something like this:
December 17 2014 -- 1
December 18 2014 -- 2
December 19 2014 -- 3
...
January 13 2015 -- 16
January 19 2015 -- 15
January 20 2015 -- 15
...
The best I've come up with is the following:
SELECT
to_char(S.created, 'YYYY-MM-DD') AS my_day,
COUNT(DISTINCT
CASE
WHEN S.created > S.created - INTERVAL '30 days'
THEN S.user_id
END)
FROM
transactions S
GROUP BY my_day
ORDER BY my_day;
As you can see, I have no idea how I could reference what exists in column one in order to specify what date range should be included in the filter.
Any help would be much appreciated!
I think if you do a self-join, it would give you the results you seek:
select
t1.created,
count (distinct t2.user_id)
from
transactions t1
join transactions t2 on
t2.created between t1.created - interval '30 days' and t1.created
group by
t1.created
order by
t1.created
That said, I think this is going to do form of a cartesian join in the background, so for large datasets I doubt it's very efficient. If you run into huge performance problems, there are ways to make this a lot faster... but before you address that, find out if you need to.
-- EDIT 8/20/16 --
In response to your issue with the performance of this... yes, it's a pig. I admit it. I encountered a similar issue here:
PostgreSQL Joining Between Two Values
The same concept for your example is this:
with xtrans as (
select created, created + generate_series(0, 30) as create_range, user_id
from transactions
)
select
t1.created,
count (distinct t2.user_id)
from
transactions t1
join xtrans t2 on
t2.create_range = t1.created
group by
t1.created
order by
t1.created
It's not as easy to follow, but it should yield identical results, only it will be significantly faster because it's not doing the "glorified cross join."

Efficiently counting strength of relationship between rows in Postgres

I have a table that looks similar to this:
session_id | sku
------------|-----
a | 1
a | 2
a | 3
a | 4
b | 2
b | 3
c | 3
I want to pivot this into a table similar to this:
sku1 | sku2 | score
------|------|------
1 | 2 | 1
1 | 3 | 1
1 | 4 | 1
2 | 3 | 2
2 | 4 | 1
3 | 4 | 1
The idea is to store a denormalised table that allows one to look up for a given sku, what other skus are related to sessions it has been related to, and how many times both skus are related to the same session.
What algorithms, patterns or strategies could you suggest for implementing this in PostgreSQL or other technologies?
I realise that this kind of lookup can be done on the original table using counts, or using a facetting search engine. However, I want to make the reads more performant, and just want to keep the overall statistics. The idea is that I will be performing this pivot regularly on the newest few thousand rows in the first table, then storing the result in the second. I'm only concerned with approximate statistics for the second table.
I've got some SQL that works, but VERY slowly. Also looking into the potential for using a graph database of some sort, but wanted to avoid adding another technology for a small part of the app.
Update: The SQL below seems performant enough. I can convert 1.2 million rows in the first table (tags) into 250k rows in the second table (product_relations) with around 2-3k variations of sku in about 5 minutes on my iMac. I will realistically be denormalising only up to 10k rows per day. Question is whether this is actually the best approach. Seems a little dirty to me.
BEGIN;
CREATE
TEMPORARY TABLE working_tags(tag_id int, session_id varchar, sku varchar) ON COMMIT DROP;
INSERT INTO working_tags
SELECT id,
session_id,
sku
FROM tags
WHERE time < now() - interval '12 hours'
AND processed_product_relation IS NULL
AND sku IS NOT NULL LIMIT 200000;
CREATE
TEMPORARY TABLE working_relations (sku1 varchar, sku2 varchar, score int) ON COMMIT DROP;
INSERT INTO working_relations
SELECT a.sku AS sku1,
b.sku AS sku2,
count(DISTINCT a.session_id) AS score
FROM working_tags AS a
INNER JOIN working_tags AS b ON a.session_id = b.session_id
AND a.sku < b.sku
WHERE a.sku IS NOT NULL
AND b.sku IS NOT NULL
GROUP BY a.sku,
b.sku;
UPDATE product_relations
SET score = working_relations.score+product_relations.score
FROM working_relations
WHERE working_relations.sku1 = product_relations.sku1
AND working_relations.sku2 = product_relations.sku2;
INSERT INTO product_relations (sku1, sku2, score)
SELECT working_relations.sku1,
working_relations.sku2,
working_relations.score
FROM working_relations
LEFT OUTER JOIN product_relations ON (working_relations.sku1 = product_relations.sku1
AND working_relations.sku2 = product_relations.sku2)
WHERE product_relations.sku1 IS NULL;
UPDATE tags
SET processed_product_relation = TRUE
WHERE id IN
(SELECT tag_id
FROM working_tags);
COMMIT;
If I've interpreted your intention correctly (per comments) this should do it:
SELECT
s1.sku AS sku1,
s2.sku AS sku2,
count(session_id)
FROM session s1
INNER JOIN session s2 USING (session_id)
WHERE s1.sku < s2.sku
GROUP BY s1.sku, s2.sku
ORDER BY 1,2;
See: http://sqlfiddle.com/#!15/2e0b2/1
In other words: Self-join session, then find all pairings of SKUs for each session ID, excluding ones where the left is greater than or equal to the right in order to avoid repeating pairings - if we have (1,2,count) we don't want (2,1,count) as well. Then group by the SKU pairings and count how many rows are found for each pairing.
You may want to count(distinct session_id) instead, if your SKU pairings can repeat and you want to exclude duplicates. There will probably be more efficient ways to do that, but that's the simplest.
An index on at least session_id will be very useful. You may also want to mess with planner cost parameters to make sure it chooses a good plan - in particular, make sure effective_cache_size is accurate and random_page_cost vs seq_page_cost reflects your caching and I/O costs. Finally, throw as much work_mem at it as you can afford.
If you're creating a materialized view, just CREATE UNLOGGED TABLE whatever AS SELECT .... . That way you minimise the numer of writes/rewrites/overwrites.

SQL Query to determine VAT rate

I'm looking to create a 3 column VAT_Parameter table with the following columns:
VATID, VATRate, EffectiveDate
However, I can't get my head around how I would identify which vat rate applies to an invoice date.
for example if the table was populated with:
1, 17.5, 1/4/1991
2, 15, 1/1/2009
3, 20, 4/1/2011
Say for example I have an invoice dated 4/5/2010, how would an SQL query select the correct VAT rate for that date?
select top 1 *
from VatRate
where EffectiveDate<=#InvoiceDate
order by EffectiveDate desc
Or, with a table of invoices
select id, invoicedate, rate
from
(
select
inv.id, inv.invoicedate, vatrate.rate, ROW_NUMBER() over (partition by inv.id order by vatrate.effectivedate desc) rn
from inv
inner join vatrate
on inv.invoicedate>=vatrate.effectivedate
) v
where rn = 1
PS. The rules for the rate of VAT to be charged when the rate changes are more complicated than just the invoice date. For example, the date of supply also matters.
I've run into this kind of thing before. There are two choices I can think of:
1. Expand the table to have two dates: EffectiveFrom and EffectiveTo. (You'll have to have a convention about whether each of these is exclusive or inclusive - but that's always a problem when using dates). This raises the problem of validating that the table population, as a whole, makes sense. e.g. that you don't end up with one row with Rate1 effective from 1/1/2000-1/1/2002, and another (overlapping) with Rate2 effective from 30/10/2001-1/1/2003. Or an uncovered gap in time, where no rate applies. Since this sounds like a very slowly-changing table, populated occasionally (by people who know what they're doing?), this could be the best solution. The SQL to get the effective rate would then be simple:
SELECT VATRate FROM VATTable WHERE (EffectiveFrom<=[YourInvoiceDate]) AND (EffectiveTo>=[YourInvoiceDate])
or
2. Use your existing table structure, and use some slightly more complicated SQL to determine the effective rate for an invoice.
Using your existing structure, something like this would work:
SELECT VATTAble.VATRate FROM
VATTable
INNER JOIN
(SELECT Max(EffectiveDate) AS LatestDate FROM VATTable WHERE EffectiveDate<=
YourInvoiceDate) latest
ON VATTable.EffectiveDate=latest.LatestDate
An easier option may just be to denormalise your data structure and store the VAT rate in the invoice table itself.

Resources