Workaround on Sliding Window Function in Snowflake - snowflake-cloud-data-platform

I've stumbled upon a problem that is giving me huge headaches, which is the following:
I have a table Deals, that contains information about this entity from our Sales CRM. I also have a table Company, that contains information about the companies pegged to those deals.
I was asked to compute a metric called Pipeline Conversion Rate, which is calculated as:
Won deals / Created Deals
Until here, everything is quite clear. Nevertheless, when computing this metric I was asked to do so in a sliding-window-function-fashion, which means to compute the metric only looking at the prior 90 days. Thing is that to look at the last 90 days of the numerator, we need to use one Date (created date); while when looking at the prior 90 days of the denominator, we should take into account the closed date (both dimensions are part of the Deals table).
There wouldn't be any problem if we could do this kind of window functions in Snowflake, as the following (I know syntax may not be exactly this one, but you get the idea):
count(deal_id) over (
partition by is_inbound, sales_agent, sales_tier, country
order by created_date range between 90 days preceding and current row
) as created_deals_last_90_days,
count(case when is_deal_won then deal_id end) over (
partition by is_inbound, sales_agent, sales_tier, country
order by created_date range between 90 days preceding and current row
) as won_deals_last_90_days
But we can't as far as I know. So my current workaround is the following (taken from this post):
select
calendar_date,
is_inbound,
sales_tier,
sales_agent,
country,
(
select count(deal_id)
from deals
where d.is_inbound = is_inbound
and d.sales_tier = sales_tier
and d.sales_agent = sales_agent
and d.country = country
and created_date between cal.calendar_date - 90 and cal.calendar_date
) as created_deals_last_90_days,
(
select count(case when is_deal_won then deal_id end)
from deals
where d.is_inbound = is_inbound
and d.sales_tier = sales_tier
and d.sales_agent = sales_agent
and d.country = country
and closed_date between cal.calendar_date - 90 and cal.calendar_date
) as won_deals_last_90_days
from calendar as cal
left join deals as d on cal.calendar_date between d.created_date and d.closed_date
*Note that I am using a calendar table here as base table, in order to have visibility on all calendar dates since without it I might say I'd be missing on those dates where there are no new deals (could happen on weekends).
Problem is that I am not getting correct figures when I cross check the raw data and the output of this query, and I have no idea how to make this (ugly) workaround, well... work.
Any ideas are more than welcome!

Well, it turns out it was way easier than I expected. After some trial-and-error, I figured out the only thing that could be failing was the JOIN condition in the outer query:
on cal.calendar_date between d.created_date and d.closed_date
This was assuming that both dates needed to be in the range, while this assumption is wrong. By tweaking the above mentioned part of the outer query to:
on cal.calendar_date >= d.created_date
It captures all those Deals that were created on or before the calendar_date, and therefore all of them since it is a mandatory field.
Maintaining the rest of the query as is, and assuming that there will be no nulls in any of the partitions, the results are the ones I expected.

Related

SQL Server - deducting from available credit

I have invoicing solution that uses Azure SQL to store and calculate invoice data. I have been requested to provide 'credit' functionality so rather than recovering customers charges, the totals are deducted from an amount of available credit and reflected in the invoice (solution xyz may have 1500 worth of charges, but deducted from available credit of 10,000 means its effectively zero'd and leaves 8,500 credit remaining ). Unfortunately after several days I haven't been able to work out how to do this.
I am able to get a list of items and their costs from sql easily:
invoice_id
contact_id
solution_id
total
date
202104-015
52
10000
30317.27
2021-05-22
202104-015
52
10001
2399.90
2021-05-22
202104-015
52
10005
8302.27
2021-05-22
202104-015
52
10060
3625.22
2021-05-22
202104-015
52
10111
22.87
2021-05-22
202104-015
52
10115
435.99
2021-05-22
I have another table that shows the credit available for the given contact:
id
credit_id
owner_id
total_applied
date_applied
1
C00001
52
500000.00
2021-05-14
I have tried using the following SQL statement, based on another stackoverflow question to subtract from the previous row, thinking each row would then reflect the remaining credit:
Select
invoice_id,
solution_id
sum(total) as 'total',
cr.total_remaining - coalesce(lag(total)) over (order by s.solution_id), 0) as credit_available,
date
from
invoices
left join credits cr on
cr.credit_id = 'C00001'
Whilst this does subtract, it only subtracts from the row above it, not all of the rows above it:
invoice_id
solution_id
total
credit_available
date
202104-015
10000
30317.27
500000.00
2021-05-22
202104-015
10001
2399.90
469682.73
2021-05-22
202104-015
10005
8302.27
497600.10
2021-05-22
202104-015
10060
3625.22
491697.73
2021-05-22
202104-015
10111
22.87
496374.78
2021-05-22
202104-015
10115
435.99
499977.13
2021-05-22
I've also tried various queries with a mess of case statements.
Im at the point where I am contemplating using powershell or similar to do the task instead (loop through each solution, check if there is enough available credit, update a deduction table, goto next etc) but I'd rather keep it all in SQL if I can.
Anyone have some pointers for this beginner?
You don't need to use window functions, use a sub-query that sums the total of previous invoices. But be sure to use index the table correctly so that performance is not a problem.
There are two sub-queries, one for the previous total sum and another to get the date of the next credit for contact_id.
SELECT [inv].[invoice_id],
[inv].[solution_id],
[inv].[total],
-- subquery that sums the previous totals
[cr].[total_applied] - COALESCE((
SELECT SUM([inv_inner].[total])
FROM [dbo].[invoices] AS [inv_inner]
WHERE [inv_inner].[solution_id] < [inv].[solution_id]
), 0) AS [credit_available],
[inv].[date]
FROM [dbo].[invoices] [inv]
LEFT JOIN [dbo].[credits] [cr]
ON [cr].[owner_id] = [inv].[contact_id]
-- here, we make sure that the credit is available for the correct period
-- invoice date >= credit date_applied
AND [inv].[date] >= [cr].[date_applied]
-- and invoice date < next date_applied or tomorrow, in case there are no next date_applied
AND [inv].[date] < COALESCE((
SELECT MIN([cr2].[date_applied])
FROM [dbo].[credits] [cr2]
WHERE [cr2].[owner_id] = [cr].[owner_id]
AND [cr2].[date_applied] > [cr].[date_applied]
), GETDATE()+1)
AND [cr].[credit_id] = 'C00001';
This query works, but it is for this question only. Please study it and adapt to your real world problem.
This is a pretty complex scenario. I sadly cannot spend the time to offer a complete solution here. I do can provide you with tips and points of attention here:
Be sure to determine the actual remaining credit based on the complete invoice history. If you introduce filtering (in a WHERE-clause, for example, or by including joins with other tables), the results should not be affected by it. You should probably pre-calculate the available credit per invoice detail record in a temporary table or in a CTE and use that data in your main query.
Make sure that you regard the date_applied value of the credit. Before a credit is applied to a customer, that customer should probably have less credit or no credit at all. That should be reflected correctly on historical invoices, I guess.
Make sure you determine the correct amount of total credit. It is unclear from the information provided in your question how that should be determined/calculated. Is only the latest total_applied value from the credits table active? Or should all the historical total_applied values be summarized to get the total available credit?)
Include a correct join between your invoices table and your credits table. Currently, this join is hard coded in your query.
Also regard actual payments by customers. Payments have effect on the available credit, I assume. Also note that, unless you are OK with a history that changes, you need to regard the payment dates as well (just like the credit change dates).
I'm not sure how you would solve your scenario using PowerShell... I do know for sure, that this can be tackled with SQL.
I cannot say anything about the resulting performance, however. These kinds of calculations surely come with a price tag attached in that regard. If you need high performance, I guess it might be more practical to include columns in your invoices table to physically store the available credit with each invoice detail record.
Edit
I have experimented a little with your scenario and your additional comments.
My solution implementation uses two CTEs:
The first CTE (cte_invoice_credit_dates) retrieves the date of the active credit record for specific invoice IDs.
The second CTE (cte_contact_invoice_summarized_totals) calculates the invoice totals of all the invoices of a specific contact. Since you want to summarize on solution detail per invoice as well, I also included the solution ID per invoice in the querying logic.
The main query selects all columns from the invoices table and uses the data from the two CTEs to calculate three additional columns in the result set:
Column credit_assigned represents the total assigned credit at the invoice's date.
Column summarized_total shows the contact's cumulative invoice total.
Column credit_available shows the remaining credit.
WITH
[cte_invoice_credit_dates] AS (
SELECT DISTINCT
I.[invoice_id],
C.[date_applied]
FROM
[invoices] AS I
OUTER APPLY (SELECT TOP (1) [date_applied]
FROM [credits]
WHERE
[owner_id] = I.[contact_id] AND
[date_applied] <= I.[date]
ORDER BY [date_applied] DESC) AS C
),
[cte_contact_invoice_summarized_totals] AS (
SELECT
I.[contact_id],
I.[invoice_id],
I.[solution_id],
SUM(H.[total]) AS [total]
FROM
[invoices] AS I
INNER JOIN [invoices] AS H ON
H.[contact_id] = I.[contact_id] AND
H.[invoice_id] = I.[invoice_id] AND
H.[solution_id] <= I.[solution_id] AND
H.[date] <= I.[date]
GROUP BY
I.[contact_id],
I.[invoice_id],
I.[solution_id]
)
SELECT
I.[invoice_id],
I.[contact_id],
I.[solution_id],
I.[total],
I.[date],
COALESCE(C.[total_applied], 0) AS [credit_assigned],
H.[total] AS [summarized_total],
COALESCE(C.[total_applied] - H.[total], 0) AS [credit_available]
FROM
[invoices] AS I
INNER JOIN [cte_contact_invoice_summarized_totals] AS H ON
H.[contact_id] = I.[contact_id] AND
H.[invoice_id] = I.[invoice_id] AND
H.[solution_id] = I.[solution_id]
LEFT JOIN [cte_invoice_credit_dates] AS CD ON
CD.[invoice_id] = I.[invoice_id]
LEFT JOIN [credits] AS C ON
C.[owner_id] = I.[contact_id] AND
C.[date_applied] = CD.[date_applied]
ORDER BY
I.[invoice_id],
I.[solution_id];

SQL query last instance before current range

I have a query that pulls all studies from the previous week based on the av_summary column. I need to add a column that will pull the study prior to the most recent study no matter how long ago the previous study was performed. I only need the date of the last study. f.creation_datetime is the column both current study and previous study would come from.
select distinct
f.patient_name
,t.patient_mrn
,p.accession_number
,p.performed_start_time
,p.procedure_id
,SUBSTRING(CAST(s.av_summary as NVARCHAR(MAX)), CHARINDEX('This suggests
the stenosis', CAST(s.av_summary as NVARCHAR(MAX))) ,
LEN(CAST(s.av_summary as NVARCHAR(MAX)))) as AV_Summary
,(select TOP 1 (p1.performed_start_time)
from dbo.T_TCS_PROCEDURE as p1
where
p1.patient_id = p.patient_id
and
p1.performed_start_time < p.performed_start_time
and p1.procedure_type_id = p.procedure_type_id
order by p1.performed_start_time DESC
) as Last_Echo
from dbo.folders as f
join dbo.T_TCS_PROCEDURE as p
on p.procedure_id = f.procedure_id
join dbo.T_ECHO_SUMMARY as s
on s.procedure_id = f.procedure_id
join dbo.T_CON_DISPATCHER_EVENT_TRACK as t
on t.procedure_id = f.procedure_id
where
CAST( f.creation_datetime AS DATE ) > DATEADD( DAY, -14, CAST( GETDATE()
AS DATE))
and
CAST(s.av_summary as NVARCHAR(MAX)) like '%This suggests the stenosis is%'
and
LEN(LTRIM(RTRIM(t.patient_mrn))) > 0
I need to add a column that will pull the study prior to the most
recent study no matter how long ago the previous study was performed.
Add the column as a subquery that gets the TOP 1 date before the date of the current row's study.
I was finally able to figure out how to get the data I wanted for the previous study. I had to correlate the subquery on a couple different columns and I used a completely different table. The database I am working with is a complete mess, but I managed to figure it out. The company that created the database couldn't even figure it out. Thank you all for your help on this. It is much appreciated.
,(select TOP 1 (p1.performed_start_time)
from dbo.T_TCS_PROCEDURE as p1
where
p1.patient_id = p.patient_id
and
p1.performed_start_time < p.performed_start_time
and
p1.procedure_type_id = p.procedure_type_id
order by p1.performed_start_time DESC
) as Last_Echo

SQL Server, 2 tables, one is input, second is database with items, find closest match

This is my first Stackflow question, I hope someone can help me out with this. I I am completely lost and a newbie at SQL.
I have two tables (which I overly simplified for this question), the first one has the customer info and the car tire that they need. The second one is simply filled with a tire id, and all of the information for the tires. I am trying to input only the customer ID and return the one closest tire that matches the input along with the values of both the selected tire and the customer's tire. The matches also need to be prioritized in that order (size most important, width next most important, ratio is least important). Any suggestions on how to do this or where to start? Is there anything I can look at to help me solve this problem? I have been trying many different procedures, and some nested selects, but nothing is getting me close. Thank you.
customertable (custno, custsize, custwidth, custratio)
1,17,255,50
2,16,235,50
etc...
tirecollection (tireid, tiresize, tirewidth, tireratio)
1,15,225,40
2,16,225,50
3,17,250,55
4,17,235,30
5,18,255,40
etc...
This is not a 100% complete solution, but may work towards coming up with a solution. The approach here is combining the tyre dimensions into one value and then ranking them within a tyre size partition. You could then pass in the customer tyre dimensions to get the closest match.
with CTE
as
(
select *, TyreSize + TyreWidth as [TyreDimensions]
from tblTyres
)
select TC.CustId, C.TyreId, C.TyreSize, C.TyreWidth, C.[TyreDimensions],
rank() over(partition by C.TyreSize order by C.[TyreDimensions]) as [RNK]
from tblTyreCustomer as TC
join CTE as C
on TC.CustTyreSize = C.TyreSize
Assuming you're running SQL Server 2008 or later, this should work (this assumes you want to get a result for a single customer on a case-by-case basis):
CREATE FUNCTION udf.GetClosestTireMatch
(
#CustomerNo int
)
RETURNS TABLE
AS RETURN
SELECT custno, tireid, tiresize, tirewidth, tireratio
FROM (
SELECT ROW_NUMBER() OVER (ORDER BY sizediff, widthdiff, ratiodiff) AS rownum
, c.custno, c.custsize, c.custwidth, c.custratio, t.tireid, t.tiresize, t.tirewidth, t.tireratio
, ABS(c.custsize-t.tiresize) AS sizediff, ABS(c.custwidth-t.tirewidth) AS widthdiff, ABS(c.custratio-t.tireratio) AS ratiodiff
FROM (SELECT * FROM customertable WHERE custno = #CustomerNo) c
CROSS JOIN tirecollection
) sub
WHERE rownum = 1
GO
Then you run the function with:
SELECT * FROM udf.GetClosestTireMatch(5)
(where 5=the customernumber you're querying).

MS Access : Average and Total Calculation in Single Query

INTRODUCTION TO DATABASE TABLE BEING USED -
I am working on a “Stock Market Prices” based Database Table. My table has got the data for the following FIELDS –
ID
SYMBOL
OPEN
HIGH
LOW
CLOSE
VOLUME
VOLUME CHANGE
VOLUME CHANGE %
OPEN_INT
SECTOR
TIMESTAMP
New data gets added to the table daily “Monday to Friday”, based on the stock market price changes for that day. The current requirement is based on the VOLUME field, which shows the volume traded for a particular stock on daily basis.
REQUIREMENT –
To get the Average and Total Volume for last 10,15 and 30 Days respectively.
METHOD USED CURRENTLY -
I created these 9 SEPARATE QUERIES in order to get my desired results –
First I have created these 3 queries to take out the most recent last 10,15 and 30 dates from the current table:
qryLast10DaysStored
qryLast15DaysStored
qryLast30DaysStored
Then I have created these 3 queries for getting the respective AVERAGES:
qrySymbolAvgVolume10Days
qrySymbolAvgVolume15Days
qrySymbolAvgVolume30Days
And then I have created these 3 queries for getting the respective TOTALS:
qrySymbolTotalVolume10Days
qrySymbolTotalVolume15Days
qrySymbolTotalVolume30Days
PROBLEM BEING FACED WITH CURRENT METHOD -
Now, my problem is that I have ended up having these so many different queries, whereas I wanted to get the output into One Single Query, as shown in the Snapshot of the Excel Sheet:
http://i49.tinypic.com/256tgcp.png
SOLUTION NEEDED -
Is there some way by which I can get these required fields into ONE SINGLE QUERY, so that I do not have to look into multiple places for the required fields? Can someone please tell me how to get all these separate queries into one -
A) Either by taking out or moving the results from these separate individual queries to one.
B) Or by making a new query which calculates all these fields within itself, so that these separate individual queries are no longer needed. This would be a better solution I think.
One Clarification about Dates –
Some friend might think why I used the method of using Top 10,15 and 30 for getting the last 10,15 and 30 Date Values. Why not I just used the PC Date for getting these values? Or used something like -
("VOLUME","tbl-B", "TimeStamp BETWEEN Date() - 10 AND Date()")
The answer is that I require my query to "Read" the date from the "TIMESTAMP" Field, and then perform its calculations accordingly for LAST / MOST RECENT "10 days, 15 days, 30 days” FOR WHICH THE DATA IS AVAILABLE IN THE TABLE, WITHOUT BOTHERING WHAT THE CURRENT DATE IS. It should not depend upon the current date in any way.
If there is any better method or more efficient way to create these queries, then please enlighten.
You have separate queries to compute 10DayTotalVolume and 10DayAvgVolume. I suspect you can compute both in one query, qry10DayVolumes.
SELECT
b.SYMBOL,
Sum(b.VOLUME) AS 10DayTotalVolume,
Avg(b.VOLUME) AS 10DayAvgVolume
FROM
[tbl-B] AS b INNER JOIN
qryLast10DaysStored AS q
ON b.TIMESTAMP = q.TIMESTAMP
GROUP BY b.SYMBOL;
However, that makes me wonder whether 10DayAvgVolume can ever be anything other than 10DayTotalVolume / 10
Similar considerations apply to the 15 and 30 day values.
Ultimately, I think you want something based on a starting point like this:
SELECT
q10.SYMBOL,
q10.[10DayTotalVolume],
q10.[10DayAvgVolume],
q15.[15DayTotalVolume],
q15.[15DayAvgVolume],
q30.[30DayTotalVolume],
q30.[30DayAvgVolume]
FROM
(qry10DayVolumes AS q10
INNER JOIN qry15DayVolumes AS q15
ON q10.SYMBOL = q15.SYMBOL)
INNER JOIN qry30DayVolumes AS q30
ON q10.SYMBOL = q30.SYMBOL;
That assumes you have created qry15DayVolumes and qry30DayVolumes following the approach I suggested for qry10DayVolumes.
If you want to cut down the number of queries, you could use subqueries for each of the qry??DayVolumes saved queries, but try it this way first to make sure the logic is correct.
In that second query above, there can be a problem due to field names which start with digits. Enclose those names in square brackets or re-alias them in qry10DayVolumes, qry15DayVolumes, and qry30DayVolumes using alias names which begin with letters instead of digits.
I tested the query as written above with the "2nd Upload.mdb" you uploaded, and it ran without error from Access 2007. Here is the first row of the result set from that query:
SYMBOL 10DayTotalVolume 10DayAvgVolume 15DayTotalVolume 15DayAvgVolume 30DayTotalVolume 30DayAvgVolume
ACC-1 42909 4290.9 54892 3659.46666666667 89669 2988.96666666667
Access doesn't support most advanced SQL syntax and clauses, so this is a bit of a hack, but it works, and is fast on your small sample. You're basically running 3 queries but the Union clauses allow you to combine into one:
select
Symbol,
sum([10DayTotalVol]) as 10DayTotalV,
sum([10DayAvgVol]) as 10DayAvgV,
sum([15DayTotalVol]) as 15DayTotalV,
sum([15DayAvgVol]) as 15DayAvgV,
sum([30DayTotalVol]) as 30DayTotalV,
sum([30DayAvgVol]) as 30DayAvgV
from (
select
Symbol,
sum(volume) as 10DayTotalVol, avg(volume) as 10DayAvgVol,
0 as 15DayTotalVol, 0 as 15DayAvgVol,
0 as 30DayTotalVol, 0 as 30DayAvgVol
from
[tbl-b]
where
timestamp >= (select min(ts) from (select distinct top 10 timestamp as ts from [tbl-b] order by timestamp desc ))
group by
Symbol
UNION
select
Symbol,
0, 0,
sum(volume), avg(volume),
0, 0
from
[tbl-b]
where
timestamp >= (select min(ts) from (select distinct top 15 timestamp as ts from [tbl-b] order by timestamp desc ))
group by
Symbol
UNION
select
Symbol,
0, 0,
0, 0,
sum(volume), avg(volume)
from
[tbl-b]
where
timestamp >= (select min(ts) from (select distinct top 30 timestamp as ts from [tbl-b] order by timestamp desc ))
group by
Symbol
) s
group by
Symbol

Is there a quicker way of doing this type of query (finding inactive accounts)?

I have a very large table of wagering transactions. Let's say for the sake of the question I want to find the accounts of people who have wagered in the last year but not wagered in the last month, so I do something like this...
--query one
select accountnumber into #wageredrecently from activity
where _date >='2011-08-10' and transaction_type = 'Bet'
group by accountnumber
--query two
select accountnumber,firstname,lastname,email,sum(handle)
from activity a, customers c
where a.accountnumber = c.accountno
and transaction_type = 'Bet'
and _date >='2010-09-10'
and accountnumber not in (select * from #wageredrecently)
group by accountnumber,firstname,lastname,email
The problem is, this takes ages to get the data. Is there a quicker way to acheive the same in sql?
Edit, just to be specific about the time: It takes just over 3 minutes, which is far too long for a query that is destined for a php intranet page.
Edit (11/09/2011): I've found out that the problem is the customers table. It's actually a view. It previously had good performance but now all of a sudden its performance is terrible, a simple query on it takes almost as long as the above query pair. I have therefore chosen an alternative table of customer data (that actually is a table, and not a view) and now the query pair takes about 15 seconds.
You should try to join customers after you have found and aggregated the rows from activity (I assume that handle is a column in activity).
select c.accountno,
c.firstname,
c.lastname,
c.email,
a.sumhandle
from customers as c
inner join (
select accountnumber,
sum(handle) as sumhandle
from activity
where _date >= '2010-09-10' and
transaction_type = 'bet' and
accountnumber not in (
select accountnumber
from activity
where _date >= '2011-08-10' and
transaction_type = 'bet'
)
group by accountnumber
) as a
on c.accountno = a.accountnumber
I also included your first query as a sub-query instead. I'm not sure what that will do for performance. It could be better, it could be worse, you have to test on your data.
I don't know your exact business need, but rarely will someone need access to innactive accounts over several months at a moments notice. Depending on when you pruge data, this may get worse.
You could create an indexed view that contains the last transaction date for each account:
max(_date) as RecentTransaction
If this table gets too large, it could be partioned by year or month of the activity.
Have you considered adding an index on _date to the activity table? It's probably taking so long because it has to do a full table scan on that column when you're comparing the dates. Also, is transaction_type indexed as well? Otherwise, the other index wouldn't do you any good.
Answering my question as the problem wasn't the structure of the query but one of the tables being used. It was a view and its performance was terrible. I change to an actual table with customer data in and reduced the execution time down to about 15 seconds.

Resources