How to get analysis result faster using Partition? - sql-server

I have a table in SQL Server 2012, which has these 2 columns:
Date, Amount
I want to get the summary of a month, i.e. for May 2013. And I also want to get the summary of last month, the same month last year, and average of past 12 month. I know I can use GROUP BY to get the data for each month, then get all the data I need. However, the table has so many rows, I want to make it faster.
One possibility is to use Partition By
SELECT DISTINCT YEAR(Date), MONTH(Date), SUM(Amount) OVER (Partiotion By YEAR(Date), MONTH(Date))
FROM myTable
However, how can I use this to get data like: last month, same month last year, and average of past 12 month?
Or, I need to use partition by to get monthly data, and then use ROWS to get them?
Any ideas?
Thanks

The key idea is to first aggregate the data in a subquery or CTE. Then you can express the conditions you want using window functions:
SELECT yr, mon, amount,
LAG(Amount) OVER (ORDER BY yr*100+mon) as LastMonth,
LAG(Amount, 12) OVER (ORDER BY yr*100+mon) as LastYearMonth,
AVG(Amount) OVER (ORDER BY yr*100 + mon RANGE BETWEEN 11 PRECEDING AND CURRENT ROW)
FROM (SELECT YEAR(Date) as yr, MONTH(Date) as mon, SUM(Amount) as Amount
FROM myTable
GROUP BY YEAR(Date), MONTH(Date)
) ym;

Related

How to find difference between dates and find first purchase in an eCommerce database

I am using Microsoft SQL Server Management Studio. I am trying to measure the customer retention rate of an eCommerce site.
For this, I need four values:
customer_id
order_purchase_timestamp
age_by_month
first_purchase
The values of age_by_month and first_purchase are not in my database. I want to calculate them.
In my database, I have customer_id and order_purchase_timestamp.
The first_purchase should be the earliest instance of order_purchase_timestamp. I only want the month and year.
The age_by_month should be the difference of months from first_purchase to order_purchase_timestamp.
I only want to measure the retention of the customer for each month so if two purchases are made in the same month it shouldn't be shown.
the dates are between 2016-10-01 to 2018-09-30. it should be ordered by order_purchase_timestamp
An example
customer_id
order_purchase_timestamp
1
2016-09-04
2
2016-09-05
3
2016-09-05
3
2016-09-15
1
2016-10-04
to
customer_id
first_purchase
age_by_month
order_purchase_timestamp
1
2016-09
0
2016-09-04
2
2016-09
0
2016-09-05
3
2016-09
0
2016-09-05
1
2016-09
1
2016-10-04
What I have done
SELECT
customer_id, order_purchase_timestamp
FROM
orders
WHERE
(order_purchase_timestamp BETWEEN '2016-10-01' AND '2016-12-31')
OR (order_purchase_timestamp BETWEEN '2017-01-01' AND '2017-03-31')
OR (order_purchase_timestamp BETWEEN '2017-04-01' AND '2017-06-30')
OR (order_purchase_timestamp BETWEEN '2017-07-01' AND '2017-09-30')
OR (order_purchase_timestamp BETWEEN '2017-10-01' AND '2017-12-31')
OR (order_purchase_timestamp BETWEEN '2018-01-01' AND '2018-03-31')
OR (order_purchase_timestamp BETWEEN '2018-04-01' AND '2018-06-30')
OR (order_purchase_timestamp BETWEEN '2018-07-01' AND '2018-09-30')
ORDER BY
order_purchase_timestamp
Originally I was going to do it by quarters but I want to do it in months now.
The following approach is designed to be relatively easy to understand. There are other ways (e.g., windowed functions) that may be marginally more efficient; but this makes it easy to maintain at your current SQL skill level.
Note that the SQL commands below build on one another (so the answer is at the end). To follow along, here is a db<>fiddle with the working.
It's based around a simple query (which we'll use as a sub-query) that finds the first order_purchase_timestamp for each customer.
SELECT customer_id, MIN(order_purchase_timestamp) AS first_purchase_date
FROM orders
GROUP BY customer_id
The next thing is DATEDIFF to find the difference between 2 dates.
Then, you can use the above as a subquery to get the first date onto each row - then find the date difference e.g.,
SELECT orders.customer_id,
orders.order_purchase_timestamp,
first_purchases.first_purchase_date,
DATEDIFF(month, first_purchases.first_purchase_date, orders.order_purchase_timestamp) AS age_by_month
FROM orders
INNER JOIN
(SELECT customer_id, MIN(order_purchase_timestamp) AS first_purchase_date
FROM orders
GROUP BY customer_id
) AS first_purchases ON orders.customer_id = first_purchases.customer_id
Note - DATEDIFF has a 'gotcha' that gets most people but is good for you - when comparing months, it ignores the day component e.g., if finding the difference in months, there is 0 difference in months between 1 Jan and 31 Jan. On the other hand, there will be a difference on 1 month between 31 Jan and 1 Feb. However, I think this is actually what you want!
The above, however, repeats when a customer has multiple purchases within the month (it has one row per purchase). Instead, we can GROUP BY to group by the month it's in, then only take the first purchase for that month.
A 'direct' approach to this would be to group on YEAR(orders.order_purchase_timestamp) AND MONTH(orders.order_purchase_timestamp). However, I use a little trick below - using EOMONTH which finds the last day of the month. EOMONTH returns the same date for any date in that month; therefore, we can group by that.
Finally, you can add the WHERE expression and ORDER BY to get the results you asked for (between the two dates)
SELECT orders.customer_id,
MIN(orders.order_purchase_timestamp) AS order_purchase_timestamp,
first_purchases.first_purchase_date,
DATEDIFF(month, first_purchases.first_purchase_date, EOMONTH(orders.order_purchase_timestamp)) AS age_by_month
FROM orders
INNER JOIN
(SELECT customer_id, MIN(order_purchase_timestamp) AS first_purchase_date
FROM orders AS orders_ref
GROUP BY customer_id
) AS first_purchases ON orders.customer_id = first_purchases.customer_id
WHERE orders.order_purchase_timestamp BETWEEN '20161001' AND '20180930'
GROUP BY orders.customer_id, first_purchases.first_purchase_date, EOMONTH(orders.order_purchase_timestamp)
ORDER BY order_purchase_timestamp;
Results - note they are different from yours because you wanted the earliest date to be 1/10/2016.
customer_id order_purchase_timestamp first_purchase_date age_by_month
1 2016-10-04 00:00:00.000 2016-09-04 00:00:00.000 1
Edit: Because someone else will do it like this otherwise!
You can do this with a single read-through that will potentially run a little faster. It is also a bit shorter - but harder to understand imo.
The below uses windows functions to calculate both the customer's earliest purchase, and the earliest purchase for each month (and uses DISTINCT rather than a GROUP BY). With that, it just does the DATEDIFF to calculate the difference.
WITH monthly_orders AS
(SELECT DISTINCT orders.customer_id,
MIN(orders.order_purchase_timestamp) OVER (PARTITION BY orders.customer_id, EOMONTH(orders.order_purchase_timestamp)) AS order_purchase_timestamp,
MIN(orders.order_purchase_timestamp) OVER (PARTITION BY orders.customer_id) AS first_purchase_date
FROM orders)
SELECT *, DATEDIFF(month, first_purchase_date, order_purchase_timestamp) AS age_by_month
FROM monthly_orders
WHERE order_purchase_timestamp BETWEEN '20161001' AND '20180930';
Note however this has one difference in the results. If you have 2 orders in a month, and your lowest date filter is between the to (e.g., orders on 15/10 and 20/10, and your minimum date is 16/10) then the row won't be included as the earliest purchase in the month is outside the filter range.
Also beware with both of these and what type of date or datetime field you are using - if you have datetimes rather than just dates, BETWEEN '20161001' AND '20180930' is not the same as >= '20161001' AND < '20181001'
Here is short query that achieves all you want (descriptions of methods used are inline):
declare #test table (
customer_id int,
order_purchase_timestamp date
)
-- some test data
insert into #test values
(1, '2016-09-04'),
(2, '2016-09-05'),
(3, '2016-09-05'),
(3, '2016-09-15'),
(1, '2016-10-04');
select
customer_id,
-- takes care of correct display of first_purchase
format(first_purchase, 'yyyy-MM') first_purchase,
-- used to get the difference in months
datediff(m, first_purchase, order_purchase_timestamp) age_by_month,
order_purchase_timestamp
from (
select
*,
-- window function used to find min value for given column within group
-- for each row
min(order_purchase_timestamp) over (partition by customer_id) first_purchase
from #test
) a

Calculate Time and Cost Using Values From Next Row

Consider the following data showing the time it takes for engineers to travel to a job. The ChargeBand column shows the various rates charged at different hours of the day. Many engineers can attend one job.
I want to be able to cap the travel time for each engineer to one hour. So even if it takes longer than an hour to travel, the maximum to be paid is one hour travel cost, no more. This is fine and I can do this when the travel time is contained within one charge band using this CASE statement:
CASE
WHEN NumberOfHours > 1.0 AND CallIDStartDate >= '2019-11-01' Then cast((1 * isnull(x.Rate,1)) as
decimal(20,7))
ELSE cast((NumberOfHours * isnull(x.Rate,1)) as decimal(20,7))
END as LabourCost
However the problem is when an hour travel straddles 2 charge bands. This is the issue I am unable to resolve and would like help with. In the first two rows for example on Monday, the engineer travel from 07.46 til 09.41 but his travel cost should be capped at 8.46. So the first 14 mins are charged at 46.62 and the remaining 44 mins at 37.67. How do I do this?
The multiple tuesday entries signify multiple engineers attending. The challenge is also to identify two rows for each engineer as being the 07.46/44 and 08:00 StartTimes and cap the travel charge for one hour as per the Monday example.
I thought to partition the table by day so it was apparent that any StartTime less than the value in the previous row belongs to a different engineer attending the same job but this doesn't help with the calculation itself.
I also thought to use the LEAD() and LAG() functions to calculate the time or charge from the following row value, and perhaps the answer is with them, but I don't know how to apply in the code.
you can try with CTE and LEAD function. This example works if you have two records per employee for each workday
with cte as(
select employeeid, weekday, starttime, finishtime, chargeband, rate firstrate,
lead(starttime,1) over(partition by employeeid, weekday order by starttime) nextstarttime,
lead(finishtime,1) over(partition by employeeid, weekday order by starttime) nextfinishtime,
lead(rate,1) over(partition by employeeid, weekday order by starttime) nextrate
from ratetable)
select employeeid, weekday, case when firsttime>= 1 then firstrate
else firsttime*firstrate+ (case when secondtime>1 then 1-firsttime else secondtime-firsttime end)*nextrate end
from(
select employeeid, weekday, DATEDIFF(second, starttime, finishtime) / 3600.0 firsttime ,firstrate,
DATEDIFF(second, nextstarttime, nextfinishtime) / 3600.0 secondtime,nextrate
from cte where nextstarttime is not null) x

How to sum if within percentile in SQL Server?

I have a table that looks something like this:
It contains more than 100k rows.
I know how to get the median (or other percentile) values per week:
SELECT DISTINCT week,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY visits) OVER (PARTITION BY week) AS visit_median
FROM table
ORDER BY week
But how do I return a column with the total visits within the top N percentile of the group per week?
I don't think you want percentile_cont(). You can try using ntile(). For instance, the top decile:
SELECT week, SUM(visits)
FROM (SELECT t.*,
NTILE(100) OVER (PARTITION BY week ORDER BY visits DESC) as tile
FROM table
) t
WHERE tile <= 10
GROUP BY week
ORDER BY week;
You need to understand how NTILE() handles ties. Rows with the same number of visits can go into different tiles. That is, the sizes of the tiles differ by at most 1. This may or may not be what you really want.

SQL Server 2012: How to calculate Quarterly Average with only values from the first of each month

Let's say I have the following table:
CREATE TABLE Portfolio.DailyNAV
(
Date date NOT NULL,
NAV int NOT NULL,
)
GO
The date column has the daily business day starting from '2015-02-02' and the NAV column has that day's total asset value. I want to get the average NAV for a quarter. For this example, let's assume I want to get it for the 2nd Quarter of 2016. My code is now:
Select AVG(NAV) As AvgNAV
FROM Portfolio.DailyNAV
WHERE year(Date) = '2016' AND DATEPART(QUARTER,Date) = '2'
GO
The problem I am facing is that this code calculates the daily average for the quarter but Average NAV should be calculated only using the first business date of each month for that quarter. So for 2016 Q2, the Average NAV should be the average from 2016-04-01, 2016-05-02 (the 1st was not a work day) and 2016-06-01. I don't want to simply change my WHERE clause and use those dates because at I want to make a stored procedure where the user can get the average NAV by putting in the Year and Quarter.
Thanks.
This should work:
WITH CTE AS
(
SELECT *,
RN = ROW_NUMBER() OVER( PARTITION BY CONVERT(VARCHAR(6),[Date],112)
ORDER BY [Date])
FROM Portfolio.DailyNAV
WHERE YEAR([Date]) = 2016
AND DATEPART(QUARTER,[Date]) = 2
AND NAV IS NOT NULL
)
SELECT AVG(NAV) AvgNAV
FROM CTE
WHERE RN = 1;

SQL Server Group data by week but show the start of the week

I have a query to select customer data and I want to keep an evolution of the number of customers. In week 1 I have 2 new customers so the number is 2. In week 2 I receive 3 new customers, so the number of customers is 5.
I have following query to do this
SELECT LAST_UPDATED_WEEK, SUM( NUM_CUSTOMERS ) OVER ( ORDER BY LAST_UPDATED_WEEK ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS "Number of customers"
FROM (
SELECT DATEADD(dd,DATEDIFF(dd,0,REGISTRATION_DATE),0) AS LAST_UPDATED_WEEK,
COUNT(DISTINCT CUSTOMER_ID) AS NUM_CUSTOMERS
FROM CUSTOMERS_TABLE
GROUP BY DATEADD(dd,DATEDIFF(dd,0,REGISTRATION_DATE),0)) AS T
But when I run this query, it doesn't group my data by week. I read about a DATEPART function, but that returns an integer, but I need to have the actual date.
Can someone help me?
Just replace dd in DATEADD and DATEDIFF functions with WEEK

Resources