Order DateTime, then group by Date (ignoring Time) + other field - sql-server

I'm trying to sort by Date and group by another field. Seems challenging because ORDER BY and GROUP BY clauses use the same fields for aggregations.
This is some example data:
Date
Operation
Count
2021-10-13 9:12:00
Visits
2
2021-10-13 8:11:00
Calls
1
2021-10-13 7:10:30
Calls
3
2021-10-13 6:00:00
Calls
5
2021-10-13 5:10:00
Visits
2
2021-10-13 4:00:00
Visits
1
2021-10-12 3:20:00
Calls
2
2021-10-12 2:10:00
Calls
2
2021-10-12 1:00:00
Visits
2
I need to show groups of "Visits" and "Calls", on different days. The result should be:
Date
Operation
Count
2021-10-13
Visits
2
2021-10-13
Calls
9
2021-10-13
Visits
3
2021-10-12
Calls
4
2021-10-12
Visits
2
Right now, I've tried:
SELECT
CAST([Date] AS DATE) [Date],
Operation,
SUM([Count])
FROM Table
GROUP BY CAST([Date] AS DATE), Operation
ORDER BY CAST([Date] AS DATE) DESC, Operation
But it gives the following result:
Date
Operation
Count
2021-10-13
Calls
9
2021-10-13
Visits
5
2021-10-12
Calls
4
2021-10-12
Visits
2
Here's a fiddle to make working with this easier:
https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=39bb598378b35603ca98c0c4733d8f92
I'm thinking now in adding a temporary table with an additional column called "Group", but I'm not sure if I can try a better solution. I've seen answers to similar problems, but the "Date" problem seems to be different
Could you share ideas?

As discussed this is gaps/islands problem which just requires isolating the additional distinct groups, using a running count and subtracting a count partitioned by each group of rows:
with grp as (
select Convert(date, [date]) [Date], operation, [count],
Row_Number() over(order by [date])
- Row_Number() over(partition by operation, Convert(date, [date]) order by date) gp
from t
)
select date, operation, Sum([count]) [Count]
from grp
group by [date], operation, gp
order by [date] desc, gp desc
DB Fiddle

I'd probably wind up using a CTE (there's probably a way to do it similar to how you've already tried though). Something along the lines of:
WITH x AS(
SELECT CAST([Date] AS DATE) [Date],
Operation,
SUM(t.[Count]) as [Count]
FROM [MyTable]
GROUP BY CAST([Date] AS DATE),
Operation )
SELECT x.[Date],
x.Operation,
x.[Count]
FROM x AS x
ORDER BY x.[Date] desc,
x.Operation;
Although; if you're going down the route of wanting to show something like:
X calls came before Y visits, then there were another z calls before the end of the day.
Then you'll need something more custom, like the solutions that were linked in in the comments.

Related

How to find difference between dates and find first purchase in an eCommerce database

I am using Microsoft SQL Server Management Studio. I am trying to measure the customer retention rate of an eCommerce site.
For this, I need four values:
customer_id
order_purchase_timestamp
age_by_month
first_purchase
The values of age_by_month and first_purchase are not in my database. I want to calculate them.
In my database, I have customer_id and order_purchase_timestamp.
The first_purchase should be the earliest instance of order_purchase_timestamp. I only want the month and year.
The age_by_month should be the difference of months from first_purchase to order_purchase_timestamp.
I only want to measure the retention of the customer for each month so if two purchases are made in the same month it shouldn't be shown.
the dates are between 2016-10-01 to 2018-09-30. it should be ordered by order_purchase_timestamp
An example
customer_id
order_purchase_timestamp
1
2016-09-04
2
2016-09-05
3
2016-09-05
3
2016-09-15
1
2016-10-04
to
customer_id
first_purchase
age_by_month
order_purchase_timestamp
1
2016-09
0
2016-09-04
2
2016-09
0
2016-09-05
3
2016-09
0
2016-09-05
1
2016-09
1
2016-10-04
What I have done
SELECT
customer_id, order_purchase_timestamp
FROM
orders
WHERE
(order_purchase_timestamp BETWEEN '2016-10-01' AND '2016-12-31')
OR (order_purchase_timestamp BETWEEN '2017-01-01' AND '2017-03-31')
OR (order_purchase_timestamp BETWEEN '2017-04-01' AND '2017-06-30')
OR (order_purchase_timestamp BETWEEN '2017-07-01' AND '2017-09-30')
OR (order_purchase_timestamp BETWEEN '2017-10-01' AND '2017-12-31')
OR (order_purchase_timestamp BETWEEN '2018-01-01' AND '2018-03-31')
OR (order_purchase_timestamp BETWEEN '2018-04-01' AND '2018-06-30')
OR (order_purchase_timestamp BETWEEN '2018-07-01' AND '2018-09-30')
ORDER BY
order_purchase_timestamp
Originally I was going to do it by quarters but I want to do it in months now.
The following approach is designed to be relatively easy to understand. There are other ways (e.g., windowed functions) that may be marginally more efficient; but this makes it easy to maintain at your current SQL skill level.
Note that the SQL commands below build on one another (so the answer is at the end). To follow along, here is a db<>fiddle with the working.
It's based around a simple query (which we'll use as a sub-query) that finds the first order_purchase_timestamp for each customer.
SELECT customer_id, MIN(order_purchase_timestamp) AS first_purchase_date
FROM orders
GROUP BY customer_id
The next thing is DATEDIFF to find the difference between 2 dates.
Then, you can use the above as a subquery to get the first date onto each row - then find the date difference e.g.,
SELECT orders.customer_id,
orders.order_purchase_timestamp,
first_purchases.first_purchase_date,
DATEDIFF(month, first_purchases.first_purchase_date, orders.order_purchase_timestamp) AS age_by_month
FROM orders
INNER JOIN
(SELECT customer_id, MIN(order_purchase_timestamp) AS first_purchase_date
FROM orders
GROUP BY customer_id
) AS first_purchases ON orders.customer_id = first_purchases.customer_id
Note - DATEDIFF has a 'gotcha' that gets most people but is good for you - when comparing months, it ignores the day component e.g., if finding the difference in months, there is 0 difference in months between 1 Jan and 31 Jan. On the other hand, there will be a difference on 1 month between 31 Jan and 1 Feb. However, I think this is actually what you want!
The above, however, repeats when a customer has multiple purchases within the month (it has one row per purchase). Instead, we can GROUP BY to group by the month it's in, then only take the first purchase for that month.
A 'direct' approach to this would be to group on YEAR(orders.order_purchase_timestamp) AND MONTH(orders.order_purchase_timestamp). However, I use a little trick below - using EOMONTH which finds the last day of the month. EOMONTH returns the same date for any date in that month; therefore, we can group by that.
Finally, you can add the WHERE expression and ORDER BY to get the results you asked for (between the two dates)
SELECT orders.customer_id,
MIN(orders.order_purchase_timestamp) AS order_purchase_timestamp,
first_purchases.first_purchase_date,
DATEDIFF(month, first_purchases.first_purchase_date, EOMONTH(orders.order_purchase_timestamp)) AS age_by_month
FROM orders
INNER JOIN
(SELECT customer_id, MIN(order_purchase_timestamp) AS first_purchase_date
FROM orders AS orders_ref
GROUP BY customer_id
) AS first_purchases ON orders.customer_id = first_purchases.customer_id
WHERE orders.order_purchase_timestamp BETWEEN '20161001' AND '20180930'
GROUP BY orders.customer_id, first_purchases.first_purchase_date, EOMONTH(orders.order_purchase_timestamp)
ORDER BY order_purchase_timestamp;
Results - note they are different from yours because you wanted the earliest date to be 1/10/2016.
customer_id order_purchase_timestamp first_purchase_date age_by_month
1 2016-10-04 00:00:00.000 2016-09-04 00:00:00.000 1
Edit: Because someone else will do it like this otherwise!
You can do this with a single read-through that will potentially run a little faster. It is also a bit shorter - but harder to understand imo.
The below uses windows functions to calculate both the customer's earliest purchase, and the earliest purchase for each month (and uses DISTINCT rather than a GROUP BY). With that, it just does the DATEDIFF to calculate the difference.
WITH monthly_orders AS
(SELECT DISTINCT orders.customer_id,
MIN(orders.order_purchase_timestamp) OVER (PARTITION BY orders.customer_id, EOMONTH(orders.order_purchase_timestamp)) AS order_purchase_timestamp,
MIN(orders.order_purchase_timestamp) OVER (PARTITION BY orders.customer_id) AS first_purchase_date
FROM orders)
SELECT *, DATEDIFF(month, first_purchase_date, order_purchase_timestamp) AS age_by_month
FROM monthly_orders
WHERE order_purchase_timestamp BETWEEN '20161001' AND '20180930';
Note however this has one difference in the results. If you have 2 orders in a month, and your lowest date filter is between the to (e.g., orders on 15/10 and 20/10, and your minimum date is 16/10) then the row won't be included as the earliest purchase in the month is outside the filter range.
Also beware with both of these and what type of date or datetime field you are using - if you have datetimes rather than just dates, BETWEEN '20161001' AND '20180930' is not the same as >= '20161001' AND < '20181001'
Here is short query that achieves all you want (descriptions of methods used are inline):
declare #test table (
customer_id int,
order_purchase_timestamp date
)
-- some test data
insert into #test values
(1, '2016-09-04'),
(2, '2016-09-05'),
(3, '2016-09-05'),
(3, '2016-09-15'),
(1, '2016-10-04');
select
customer_id,
-- takes care of correct display of first_purchase
format(first_purchase, 'yyyy-MM') first_purchase,
-- used to get the difference in months
datediff(m, first_purchase, order_purchase_timestamp) age_by_month,
order_purchase_timestamp
from (
select
*,
-- window function used to find min value for given column within group
-- for each row
min(order_purchase_timestamp) over (partition by customer_id) first_purchase
from #test
) a

T-SQL recursion, date shifting based on previous iteration

I have a data set that includes a customer, payment date, and the number of days they have paid for. I need to be calculate the coverage start/end dates that each payment is covering. This is difficult when a payment is made before the current coverage period ends.
The best way I've come up with to think about this would be a month to month cell phone plan where the customer may pay for a specified number of days at any point during a given month. The next covered period should always start the day after the previous covered period expires.
Here is the code sample using a temp table.
CREATE TABLE #Payments
(Customer_ID INTEGER,
Payment_Date DATE,
Days_Paid INTEGER);
INSERT INTO #Payments
VALUES (1,'2018-01-01',30);
INSERT INTO #Payments
VALUES (1,'2018-01-29',20);
INSERT INTO #Payments
VALUES (1,'2018-02-15',30);
INSERT INTO #Payments
VALUES (1,'2018-04-01',30);
I need to get the coverage start/end dates back.
The initial payment is made on 2018-01-01 and they paid for 30 days. That means they are covered until 2018-01-30 (Payment_Date + Paid_Days - 1 since the payment date is included as a covered day). However they made their next payment on 2018-01-29, so I need calculate the start date of the next coverage window, which in this case would be the previous Payment_Date + previous Paid_Days. In this case, coverage window 2 starts on 2018-02-01 and would extend through the 2018-02-19 since they only paid for 20 days on Payment_Date 2018-01-29.
The expected output is:
Customer_ID | Payment_Date | Days_Paid | Coverage_Start_Date | Coverage_End_Date
--------------------------------------------------------------------------------
1 | '2018-01-01'| 30 | '2018-01-01'| '2018-01-30'
1 | '2018-01-29'| 20 | '2018-01-31'| '2018-02-19'
1 | '2018-02-15'| 30 | '2018-02-20'| '2018-03-21'
1 | '2018-04-01'| 30 | '2018-04-01'| '2018-04-30'
Because the current record's coverage start date will depend of the previous record's coverage end date, I feel like this would be a good candidate for recursion, but I can't figure out how to do it.
I have a way to do this in a while loop, but I would like to complete it using a recursive CTE. I have also thought about simply adding up the Days_Paid and adding that to the first payment's start date, however this only works if a payment is made before the previous coverage has expired. In addition, I need to calculate the coverage start/end dates for each Payment_Date.
Finally, using LAG/LEAD functions doesn't appear to work because it does not consider the result of the previous iteration, only the current value of the previous record. Using LAG/LEAD, you get the correct answer for the 2nd payment record, but not the third.
Is there a way to do this with a recursive CTE?
NOTE: This is not a recursive solution, but it is set-based vs. your loop solution.
While trying to solve this recursively it hit me that this is essentially a "running totals" problem, and can be easily solved with window functions.
WITH runningTotal AS
(
SELECT p.*, SUM(Days_Paid) OVER(ORDER BY p.Payment_Date) AS runningTotalDays, MIN(Payment_Date) OVER(ORDER BY p.Payment_Date) startDate
FROM #Payments p
)
SELECT r.Customer_Id, r.Payment_Date,Days_Paid, COALESCE(DATEADD(DAY, LAG(runningTotalDays) OVER(ORDER BY r.Payment_Date) +1, startDate), startDate) AS Coverage_Start_Date, DATEADD(DAY, runningTotalDays, startDate) AS Coverage_End_Date
FROM runningTotal r
Each end date is the "running total" of all the previous Days_Paid added together. Using LAG to get the previous records end date+1 gets you the start date. The COALESCE is to handle the first record. For more than a single customer, you can PARTITION BY Customer_Id.
So of course, right after posting this I came across a similar question that was already answered.
Here's the link: Recursively retrieve LAG() value of previous record
Based on that solution, I was able construct the following solution to my own question.
The key here was adding the "prep_data" CTE which made the recursion problem much easier.
;WITH prep_data AS
(SELECT Customer_ID,
ROW_NUMBER() OVER (PARTITION BY Customer_ID ORDER BY Payment_Date) AS payment_seq_num,
Payment_Date,
Days_Paid,
Payment_Date as Coverage_Start_Date,
DATEADD(DAY,Days_Paid-1,Payment_Date) AS Coverage_End_Date
FROM #Payments),
recursion AS
(SELECT Customer_ID,
payment_seq_num,
Payment_Date,
Days_Paid,
Coverage_Start_Date,
Coverage_End_Date
FROM prep_data
WHERE payment_seq_num = 1
UNION ALL
SELECT r.Customer_ID,
p.payment_seq_num,
p.Payment_Date,
p.Days_Paid,
CASE WHEN r.Coverage_End_Date >= p.Payment_Date THEN DATEADD(DAY,1,r.Coverage_End_Date) ELSE p.Payment_Date END AS Coverage_Start_Date,
DATEADD(DAY,p.Days_Paid-1,CASE WHEN r.Coverage_End_Date >= p.Payment_Date THEN DATEADD(DAY,1,r.Coverage_End_Date) ELSE p.Payment_Date END) AS Coverage_End_Date
FROM recursion r
JOIN prep_data p ON r.payment_seq_num + 1 =p.payment_seq_num
)
SELECT Customer_ID,
Payment_Date,
Days_Paid,
Coverage_Start_Date,
Coverage_End_Date
FROM recursion
ORDER BY payment_seq_num;

min(count(*)) over... behavior?

I'm trying to understand the behavior of
select ..... ,MIN(count(*)) over (partition by hotelid)
VS
select ..... ,count(*) over (partition by hotelid)
Ok.
I have a list of hotels (1,2,3)
Each hotel has departments.
On each departments there are workers.
My Data looks like this :
select * from data
Ok. Looking at this query :
select hotelid,departmentid , cnt= count(*) over (partition by hotelid)
from data
group by hotelid, departmentid
ORDER BY hotelid
I can perfectly understand what's going on here. On that result set, partitioning by hotelId , we are counting visible rows.
But look what happens with this query :
select hotelid,departmentid , min_cnt = min(count(*)) over (partition by hotelid)
from data
group by hotelid, departmentid
ORDER BY hotelid
Question:
Where are those numbers came from? I don't understand how adding min caused that result? min of what?
Can someone please explain how's the calculation being made?
fiddle
The 2 statements are very different. The first query is counting the rows after the grouping and then application the PARTITION. So, for example, with hotel 1 there is 1 row returned (as all rows for Hotel 1 have the same department A as well) and so the COUNT(*) OVER (PARTITION BY hotelid) returns 1. Hotel 2, however, has 2 departments 'B' and 'C', and so hence returns 2.
For your second query, you firstly have the COUNT(*), which is not within the OVER clause. That means it counts all the rows within the GROUP BY specified in your query: GROUP BY hotelid, departmentid. For Hotel 1, there are 4 rows for department A, hence 4. Then you take the minimum of 4; which is unsurprisingly 4. For all the other hotels, they have at least 1 entry with only 1 row for a hotel and department and so returns 1.

SQL to identify initial dates for a product that changes from active to cancelled status

I have a table that records the following items:
product_id
product_status
date
Products can exist in the following product statuses: pending, active, or canceled. Only one status can exist per date per product code. A status and product code is inserted for each and every day a product exists.
Utilizing SQL I'd like to be able to identify the initial cancellation dates for a product that cancels more than once in a given time frame.
i.e. if a product is active for 3 days and then cancels for 3 days and then is active again for 3 days and then cancels again for another 3 days.
I'd like to be able to identify day 1 of the 2 cancellation periods.
Thought I'd get the crystal ball out for this one. This sounds like a Gaps and Islands question. There's plenty of answers on how to do this on the internet, however, this might be what you're after:
CREATE TABLE #Sample (product_id int,
product_status varchar(10),
[date] date); --blargh
INSERT INTO #Sample
VALUES (1,'active', '20170101'),
(1,'active', '20170102'),
(1,'active', '20170103'),
(1,'cancelled', '20170104'),
(1,'cancelled', '20170105'),
(1,'cancelled', '20170106'),
(1,'active', '20170107'),
(1,'pending', '20170108'),
(1,'active', '20170109'),
(1,'cancelled', '20170110'),
(2,'pending', '20170101'),
(2,'active', '20170102'),
(2,'cancelled', '20170103'),
(2,'cancelled', '20170104');
GO
SELECT *
FROM #Sample;
WITH Groups AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY product_id
ORDER BY [date]) -
ROW_NUMBER() OVER (PARTITION BY product_id, product_status
ORDER BY [date]) AS Grp
FROM #Sample)
SELECT product_id, MIN([date]) AS cancellation_start
FROM Groups
WHERE product_status = 'cancelled'
GROUP BY Grp, product_id
ORDER BY product_id, cancellation_start;
GO
DROP TABLE #Sample;
If not, then see Patrick Artnet's comment.

Updating duplicate records so they are filtered

I've found that our website to ERP integration tool will duplicate inserts if there is an error during the sync. Until the error is resolved, the records will duplicate every time the sync retries, which is usually every 5 minutes.
Trying to find an effective way to update duplicate records so that when queried for a view that the duplicates are filtered. The challenge I am having is that a duplicate will have some columns that are different.
For example, looking at the SalaesOrderDetail table, an order had 120 line items. However, because of a sync issue, each line was duplicated.
I've tried using the following to test for the past month:
WITH cte AS (
SELECT SOHD.[salesorderno], [itemcode],[CommentText], unitofmeasure, itemcodedesc, quantityorderedoriginal, quantityshipped,
row_number() OVER(PARTITION BY SOHD.[salesorderno], [itemcode], unitofmeasure, itemcodedesc, quantityorderedoriginal, quantityshipped ORDER BY SOHD.[Linekey] desc) AS [rn]
FROM [dbo].[SO_SalesOrderHistoryDetail] SOHD
inner join [dbo].[SO_SalesOrderHistoryHeader] SOHH on SOHH.Salesorderno = SOHD.Salesorderno
Where year(orderdate) = '2016'
and month(orderdate) = '08'
--Only Look at completed orders, ignore quotes & deleted orders
and SOHH.Orderstatus in ('C')
--Only looks for item lines where something did not ship (prevent removing a "good" entry)
and [quantityshipped] = '0'
)
Select *
from cte
However, I keep finding issues with using this because if I were to run an update command with this, it will update some records it shouldn't. And if I add some of the columns for it to be more specific, it wouldn't edit some columns that it needs to.
For example, if I don't add
where rn >1 then I inadvertently edit records that are not duplicates
but if I add
where rn >1 then the 1st set of duplicate records won't be updated.
Feeling stuck, but not sure what to do.
Adding more info from comment section. I think maybe my cte statement to find the duplicates and an update command might have to be somewhat different. Example Data:
Order# Itemcode CommentText UnitofMeasure itemcodedesc qtyordered qtyshipped
12345 abc null each candy 5 0
12345 abc null each candy 5 5
12345 xyz null case slinky 25 0
12345 xyz null case slinky 25 25
So they are not duplicates if I include the qtyshipped column, but what I want to do is update only the records where the qtyshipped = 0. The update I plan to so is set commenttext = 'delete'
Change ROW_NUMBER to COUNT() Over() window function
WITH cte
AS (SELECT SOHD.[salesorderno],
[itemcode],
[commenttext],
unitofmeasure,
itemcodedesc,
quantityorderedoriginal,
quantityshipped,
Count(1)
OVER(partition BY SOHD.[salesorderno], [itemcode], unitofmeasure,itemcodedesc) AS [rn]
FROM [dbo].[so_salesorderhistorydetail] SOHD
..........)
SELECT *
FROM cte
WHERE rn > 1

Resources