T-SQL - Getting most recent date and most recent future date - sql-server

Assume the table of records below
ID Name AppointmentDate
-- -------- ---------------
1 Bob 1/1/2010
1 Bob 5/1/2010
2 Henry 5/1/2010
2 Henry 8/1/2011
3 John 8/1/2011
3 John 12/1/2011
I want to retrieve the most recent appointment date by person. So I need a query that will give the following result set.
1 Bob 5/1/2010 (5/1/2010 is most recent)
2 Henry 8/1/2011 (8/1/2011 is most recent)
3 John 8/1/2011 (has 2 future dates but 8/1/2011 is most recent)
Thanks!

Assuming that where you say "most recent" you mean "closest", as in "stored date is the fewest days away from the current date and we don't care if it's before or after the current date", then this should do it (trivial debugging might be required):
SELECT ID, Name, AppointmentDate
from (select
ID
,Name
,AppointmentDate
,row_number() over (partition by ID order by abs(datediff(dd, AppointmentDate, getdate()))) Ranking
from MyTable) xx
where Ranking = 1
This usese the row_number() function from SQL 2005 and up. The subquery "orders" the data as per the specifications, and the main query picks the best fit.
Note also that:
The search is based on the current date
We're only calculating difference in days, time (hours, minutes, etc.) is ignored
If two days are equidistant (say, 2 before and 2 after), we pick one randomly
All of which could be adjusted based on your final requirements.

(Phillip beat me to the punch, and windowing functions are an excellent choice. Here's an alternative approach:)
Assuming I correctly understand your requirement as getting the date closest to the present date, whether in the past or future, consider this query:
SELECT t.Name, t.AppointmentDate
FROM
(
SELECT Name, AppointmentDate, ABS(DATEDIFF(d, GETDATE(), AppointmentDate)) AS Distance
FROM Table
) t
JOIN
(
SELECT Name, MIN(ABS(DATEDIFF(d, GETDATE(), AppointmentDate))) AS MinDistance
FROM Table
GROUP BY Name
) d ON t.Name = d.Name AND t.Distance = d.MinDistance

Related

How to find difference between dates and find first purchase in an eCommerce database

I am using Microsoft SQL Server Management Studio. I am trying to measure the customer retention rate of an eCommerce site.
For this, I need four values:
customer_id
order_purchase_timestamp
age_by_month
first_purchase
The values of age_by_month and first_purchase are not in my database. I want to calculate them.
In my database, I have customer_id and order_purchase_timestamp.
The first_purchase should be the earliest instance of order_purchase_timestamp. I only want the month and year.
The age_by_month should be the difference of months from first_purchase to order_purchase_timestamp.
I only want to measure the retention of the customer for each month so if two purchases are made in the same month it shouldn't be shown.
the dates are between 2016-10-01 to 2018-09-30. it should be ordered by order_purchase_timestamp
An example
customer_id
order_purchase_timestamp
1
2016-09-04
2
2016-09-05
3
2016-09-05
3
2016-09-15
1
2016-10-04
to
customer_id
first_purchase
age_by_month
order_purchase_timestamp
1
2016-09
0
2016-09-04
2
2016-09
0
2016-09-05
3
2016-09
0
2016-09-05
1
2016-09
1
2016-10-04
What I have done
SELECT
customer_id, order_purchase_timestamp
FROM
orders
WHERE
(order_purchase_timestamp BETWEEN '2016-10-01' AND '2016-12-31')
OR (order_purchase_timestamp BETWEEN '2017-01-01' AND '2017-03-31')
OR (order_purchase_timestamp BETWEEN '2017-04-01' AND '2017-06-30')
OR (order_purchase_timestamp BETWEEN '2017-07-01' AND '2017-09-30')
OR (order_purchase_timestamp BETWEEN '2017-10-01' AND '2017-12-31')
OR (order_purchase_timestamp BETWEEN '2018-01-01' AND '2018-03-31')
OR (order_purchase_timestamp BETWEEN '2018-04-01' AND '2018-06-30')
OR (order_purchase_timestamp BETWEEN '2018-07-01' AND '2018-09-30')
ORDER BY
order_purchase_timestamp
Originally I was going to do it by quarters but I want to do it in months now.
The following approach is designed to be relatively easy to understand. There are other ways (e.g., windowed functions) that may be marginally more efficient; but this makes it easy to maintain at your current SQL skill level.
Note that the SQL commands below build on one another (so the answer is at the end). To follow along, here is a db<>fiddle with the working.
It's based around a simple query (which we'll use as a sub-query) that finds the first order_purchase_timestamp for each customer.
SELECT customer_id, MIN(order_purchase_timestamp) AS first_purchase_date
FROM orders
GROUP BY customer_id
The next thing is DATEDIFF to find the difference between 2 dates.
Then, you can use the above as a subquery to get the first date onto each row - then find the date difference e.g.,
SELECT orders.customer_id,
orders.order_purchase_timestamp,
first_purchases.first_purchase_date,
DATEDIFF(month, first_purchases.first_purchase_date, orders.order_purchase_timestamp) AS age_by_month
FROM orders
INNER JOIN
(SELECT customer_id, MIN(order_purchase_timestamp) AS first_purchase_date
FROM orders
GROUP BY customer_id
) AS first_purchases ON orders.customer_id = first_purchases.customer_id
Note - DATEDIFF has a 'gotcha' that gets most people but is good for you - when comparing months, it ignores the day component e.g., if finding the difference in months, there is 0 difference in months between 1 Jan and 31 Jan. On the other hand, there will be a difference on 1 month between 31 Jan and 1 Feb. However, I think this is actually what you want!
The above, however, repeats when a customer has multiple purchases within the month (it has one row per purchase). Instead, we can GROUP BY to group by the month it's in, then only take the first purchase for that month.
A 'direct' approach to this would be to group on YEAR(orders.order_purchase_timestamp) AND MONTH(orders.order_purchase_timestamp). However, I use a little trick below - using EOMONTH which finds the last day of the month. EOMONTH returns the same date for any date in that month; therefore, we can group by that.
Finally, you can add the WHERE expression and ORDER BY to get the results you asked for (between the two dates)
SELECT orders.customer_id,
MIN(orders.order_purchase_timestamp) AS order_purchase_timestamp,
first_purchases.first_purchase_date,
DATEDIFF(month, first_purchases.first_purchase_date, EOMONTH(orders.order_purchase_timestamp)) AS age_by_month
FROM orders
INNER JOIN
(SELECT customer_id, MIN(order_purchase_timestamp) AS first_purchase_date
FROM orders AS orders_ref
GROUP BY customer_id
) AS first_purchases ON orders.customer_id = first_purchases.customer_id
WHERE orders.order_purchase_timestamp BETWEEN '20161001' AND '20180930'
GROUP BY orders.customer_id, first_purchases.first_purchase_date, EOMONTH(orders.order_purchase_timestamp)
ORDER BY order_purchase_timestamp;
Results - note they are different from yours because you wanted the earliest date to be 1/10/2016.
customer_id order_purchase_timestamp first_purchase_date age_by_month
1 2016-10-04 00:00:00.000 2016-09-04 00:00:00.000 1
Edit: Because someone else will do it like this otherwise!
You can do this with a single read-through that will potentially run a little faster. It is also a bit shorter - but harder to understand imo.
The below uses windows functions to calculate both the customer's earliest purchase, and the earliest purchase for each month (and uses DISTINCT rather than a GROUP BY). With that, it just does the DATEDIFF to calculate the difference.
WITH monthly_orders AS
(SELECT DISTINCT orders.customer_id,
MIN(orders.order_purchase_timestamp) OVER (PARTITION BY orders.customer_id, EOMONTH(orders.order_purchase_timestamp)) AS order_purchase_timestamp,
MIN(orders.order_purchase_timestamp) OVER (PARTITION BY orders.customer_id) AS first_purchase_date
FROM orders)
SELECT *, DATEDIFF(month, first_purchase_date, order_purchase_timestamp) AS age_by_month
FROM monthly_orders
WHERE order_purchase_timestamp BETWEEN '20161001' AND '20180930';
Note however this has one difference in the results. If you have 2 orders in a month, and your lowest date filter is between the to (e.g., orders on 15/10 and 20/10, and your minimum date is 16/10) then the row won't be included as the earliest purchase in the month is outside the filter range.
Also beware with both of these and what type of date or datetime field you are using - if you have datetimes rather than just dates, BETWEEN '20161001' AND '20180930' is not the same as >= '20161001' AND < '20181001'
Here is short query that achieves all you want (descriptions of methods used are inline):
declare #test table (
customer_id int,
order_purchase_timestamp date
)
-- some test data
insert into #test values
(1, '2016-09-04'),
(2, '2016-09-05'),
(3, '2016-09-05'),
(3, '2016-09-15'),
(1, '2016-10-04');
select
customer_id,
-- takes care of correct display of first_purchase
format(first_purchase, 'yyyy-MM') first_purchase,
-- used to get the difference in months
datediff(m, first_purchase, order_purchase_timestamp) age_by_month,
order_purchase_timestamp
from (
select
*,
-- window function used to find min value for given column within group
-- for each row
min(order_purchase_timestamp) over (partition by customer_id) first_purchase
from #test
) a

T-SQL recursion, date shifting based on previous iteration

I have a data set that includes a customer, payment date, and the number of days they have paid for. I need to be calculate the coverage start/end dates that each payment is covering. This is difficult when a payment is made before the current coverage period ends.
The best way I've come up with to think about this would be a month to month cell phone plan where the customer may pay for a specified number of days at any point during a given month. The next covered period should always start the day after the previous covered period expires.
Here is the code sample using a temp table.
CREATE TABLE #Payments
(Customer_ID INTEGER,
Payment_Date DATE,
Days_Paid INTEGER);
INSERT INTO #Payments
VALUES (1,'2018-01-01',30);
INSERT INTO #Payments
VALUES (1,'2018-01-29',20);
INSERT INTO #Payments
VALUES (1,'2018-02-15',30);
INSERT INTO #Payments
VALUES (1,'2018-04-01',30);
I need to get the coverage start/end dates back.
The initial payment is made on 2018-01-01 and they paid for 30 days. That means they are covered until 2018-01-30 (Payment_Date + Paid_Days - 1 since the payment date is included as a covered day). However they made their next payment on 2018-01-29, so I need calculate the start date of the next coverage window, which in this case would be the previous Payment_Date + previous Paid_Days. In this case, coverage window 2 starts on 2018-02-01 and would extend through the 2018-02-19 since they only paid for 20 days on Payment_Date 2018-01-29.
The expected output is:
Customer_ID | Payment_Date | Days_Paid | Coverage_Start_Date | Coverage_End_Date
--------------------------------------------------------------------------------
1 | '2018-01-01'| 30 | '2018-01-01'| '2018-01-30'
1 | '2018-01-29'| 20 | '2018-01-31'| '2018-02-19'
1 | '2018-02-15'| 30 | '2018-02-20'| '2018-03-21'
1 | '2018-04-01'| 30 | '2018-04-01'| '2018-04-30'
Because the current record's coverage start date will depend of the previous record's coverage end date, I feel like this would be a good candidate for recursion, but I can't figure out how to do it.
I have a way to do this in a while loop, but I would like to complete it using a recursive CTE. I have also thought about simply adding up the Days_Paid and adding that to the first payment's start date, however this only works if a payment is made before the previous coverage has expired. In addition, I need to calculate the coverage start/end dates for each Payment_Date.
Finally, using LAG/LEAD functions doesn't appear to work because it does not consider the result of the previous iteration, only the current value of the previous record. Using LAG/LEAD, you get the correct answer for the 2nd payment record, but not the third.
Is there a way to do this with a recursive CTE?
NOTE: This is not a recursive solution, but it is set-based vs. your loop solution.
While trying to solve this recursively it hit me that this is essentially a "running totals" problem, and can be easily solved with window functions.
WITH runningTotal AS
(
SELECT p.*, SUM(Days_Paid) OVER(ORDER BY p.Payment_Date) AS runningTotalDays, MIN(Payment_Date) OVER(ORDER BY p.Payment_Date) startDate
FROM #Payments p
)
SELECT r.Customer_Id, r.Payment_Date,Days_Paid, COALESCE(DATEADD(DAY, LAG(runningTotalDays) OVER(ORDER BY r.Payment_Date) +1, startDate), startDate) AS Coverage_Start_Date, DATEADD(DAY, runningTotalDays, startDate) AS Coverage_End_Date
FROM runningTotal r
Each end date is the "running total" of all the previous Days_Paid added together. Using LAG to get the previous records end date+1 gets you the start date. The COALESCE is to handle the first record. For more than a single customer, you can PARTITION BY Customer_Id.
So of course, right after posting this I came across a similar question that was already answered.
Here's the link: Recursively retrieve LAG() value of previous record
Based on that solution, I was able construct the following solution to my own question.
The key here was adding the "prep_data" CTE which made the recursion problem much easier.
;WITH prep_data AS
(SELECT Customer_ID,
ROW_NUMBER() OVER (PARTITION BY Customer_ID ORDER BY Payment_Date) AS payment_seq_num,
Payment_Date,
Days_Paid,
Payment_Date as Coverage_Start_Date,
DATEADD(DAY,Days_Paid-1,Payment_Date) AS Coverage_End_Date
FROM #Payments),
recursion AS
(SELECT Customer_ID,
payment_seq_num,
Payment_Date,
Days_Paid,
Coverage_Start_Date,
Coverage_End_Date
FROM prep_data
WHERE payment_seq_num = 1
UNION ALL
SELECT r.Customer_ID,
p.payment_seq_num,
p.Payment_Date,
p.Days_Paid,
CASE WHEN r.Coverage_End_Date >= p.Payment_Date THEN DATEADD(DAY,1,r.Coverage_End_Date) ELSE p.Payment_Date END AS Coverage_Start_Date,
DATEADD(DAY,p.Days_Paid-1,CASE WHEN r.Coverage_End_Date >= p.Payment_Date THEN DATEADD(DAY,1,r.Coverage_End_Date) ELSE p.Payment_Date END) AS Coverage_End_Date
FROM recursion r
JOIN prep_data p ON r.payment_seq_num + 1 =p.payment_seq_num
)
SELECT Customer_ID,
Payment_Date,
Days_Paid,
Coverage_Start_Date,
Coverage_End_Date
FROM recursion
ORDER BY payment_seq_num;

Delete latest entry in SQL Server without using datetime or ID

I have a basic SQL Server delete script that goes:
Delete from tableX
where colA = ? and colB = ?;
In tableX, I do not have any columns indicating sequential IDs or timestamp; just varchar. I want to delete the latest entry that was inserted, and I do not have access to the row number from the insert script. TOP is not an option because it's random. Also, this particular table does not have a primary key, and it's not a matter of poor design. Is there any way I can do this? I recall mysql being able to call something like max(row_number) and also something along the lines of limit one.
ROW_NUMBER exists in SQL Server, too, but it must be used with an OVER (order_by_clause). So... in your case it's impossible for you unless you come up with another sorting algo.
MSDN
Edit: (Examples for George from MSDN ... I'm afraid his company has a Firewall rule that blocks MSDN)
SQL-Code
USE AdventureWorks2012;
GO
SELECT ROW_NUMBER() OVER(ORDER BY SalesYTD DESC) AS Row,
FirstName, LastName, ROUND(SalesYTD,2,1) AS "Sales YTD"
FROM Sales.vSalesPerson
WHERE TerritoryName IS NOT NULL AND SalesYTD <> 0;
Output
Row FirstName LastName SalesYTD
--- ----------- ---------------------- -----------------
1 Linda Mitchell 4251368.54
2 Jae Pak 4116871.22
3 Michael Blythe 3763178.17
4 Jillian Carson 3189418.36
5 Ranjit Varkey Chudukatil 3121616.32
6 José Saraiva 2604540.71
7 Shu Ito 2458535.61
8 Tsvi Reiter 2315185.61
9 Rachel Valdez 1827066.71
10 Tete Mensa-Annan 1576562.19
11 David Campbell 1573012.93
12 Garrett Vargas 1453719.46
13 Lynn Tsoflias 1421810.92
14 Pamela Ansman-Wolfe 1352577.13
Returning a subset of rows
USE AdventureWorks2012;
GO
WITH OrderedOrders AS
(
SELECT SalesOrderID, OrderDate,
ROW_NUMBER() OVER (ORDER BY OrderDate) AS RowNumber
FROM Sales.SalesOrderHeader
)
SELECT SalesOrderID, OrderDate, RowNumber
FROM OrderedOrders
WHERE RowNumber BETWEEN 50 AND 60;
Using ROW_NUMBER() with PARTITION
USE AdventureWorks2012;
GO
SELECT FirstName, LastName, TerritoryName, ROUND(SalesYTD,2,1),
ROW_NUMBER() OVER(PARTITION BY TerritoryName ORDER BY SalesYTD DESC) AS Row
FROM Sales.vSalesPerson
WHERE TerritoryName IS NOT NULL AND SalesYTD <> 0
ORDER BY TerritoryName;
Output
FirstName LastName TerritoryName SalesYTD Row
--------- -------------------- ------------------ ------------ ---
Lynn Tsoflias Australia 1421810.92 1
José Saraiva Canada 2604540.71 1
Garrett Vargas Canada 1453719.46 2
Jillian Carson Central 3189418.36 1
Ranjit Varkey Chudukatil France 3121616.32 1
Rachel Valdez Germany 1827066.71 1
Michael Blythe Northeast 3763178.17 1
Tete Mensa-Annan Northwest 1576562.19 1
David Campbell Northwest 1573012.93 2
Pamela Ansman-Wolfe Northwest 1352577.13 3
Tsvi Reiter Southeast 2315185.61 1
Linda Mitchell Southwest 4251368.54 1
Shu Ito Southwest 2458535.61 2
Jae Pak United Kingdom 4116871.22 1
Your current table design does not allow you to determine the latest entry. YOu have no field to sort on to indicate which record was added last.
You need to redesign or pull that information from the audit tables. If you have a database without audit tables, you might have to find a tool to read the transaction logs and it will be a very time-consuming and expensive process. Or if you know the date the records you want to remove were added, you could possibly use a backup from just before this happened to find the records that were added. Just be awwre that you might be looking at records changed after this date that you want to keep.
If you need to do this on a regular basis instead of one-time to fix some bad data, then you need to properly design your database to include an identity field and possibly a dateupdated field (maintained through a trigger) or audit tables. (In my opinion no database containing information your company is depending on should be without audit tables, one of the many reasons why you should never allow an ORM to desgn a database, but I digress.) If you need to know the order records were added to a table, it is your responsiblity as the developer to create that structure. Databases only store what is deisnged for tehm to store, if you didn't design it in, then it is not available easily or at all
If (colA +'_'+ colB) can not be dublicate try this.
declare #delColumn nvarchar(250)
set #delColumn = (select top 1 DeleteColumn from (
select (colA +'_'+ colB) as DeleteColumn ,
ROW_NUMBER() OVER(ORDER BY colA DESC) as Id from tableX
)b
order by Id desc
)
delete from tableX where (colA +'_'+ colB) =#delColumn

sql server sum column to find out who reached a value first

I asked a question similar to this here:
sql sum a column and also return the last time stamp
It turns out my use case was incorrect though so I need to make an adjustment to my question.
I've got a table in SQL Server with several columns. The relevant ones are:
name
distance
create_date
I have many people identified by name, and every few days they travel a certain distance. For example:
name distance create_date
john 15 09/12/2014
john 20 09/22/2014
alex 10 08/15/2014
alex 12 09/05/2014
alex 20 09/12/2014
john 8 09/30/2014
alex 30 09/14/2014
mike 12 09/10/2014
The query I need has 3 parameters:
#start_date
#end_date
#count
I need a query that between the two dates, returns for each person the distance traveled. The trick though is that for each person I should sum to an amount just past the amount indicated in #count and return the date this was achieved, or if the person did not pass the #count then return the sum and last date of entry. So for example, if I use the parameters:
#start_date=08/01/2014
#end_date=09/25/2014
#count=35
I would expect the following:
name distance create_date
alex 42 09/12/2014
john 35 09/22/2014
mike 12 09/10/2014
Does someone have an idea for this?
Thank you!
As per my understanding the query for the the person who crossed particular count
select A.name,sum(A.distance),A.create_date from (select name,distance,create_date
from table where create_date between #start_date and #end_date)A group by A.name,A.create_date
having sum(A.distance)>#count
Those who doesnt cross
select A.name,sum(A.distance),max(A.create_date) from (select name,distance,create_date
from table where create_date between #start_date and #end_date)A group by A.name
having sum(A.distance)<#count

Is it possible to do this query without a temp table?

If I have a table of data like this
tableid author book pubdate
1 1 The Hobbit 1923
2 1 Fellowship 1925
3 2 Foundation Trilogy 1947
4 2 I Robot 1942
5 3 Frankenstein 1889
6 3 Frankenstein 2 1894
Is there a query that would get me the following without having to use a temp table, table variable or cte?
tableid author book pubdate
1 1 The Hobbit 1923
4 2 I Robot 1942
5 3 Frankenstein 1889
So I want min(ranking) grouping by person and ending up with book for that min(ranking) value.
OK, the data I gave initially was flawed. Instead of a ranking column I'll have a date column. I need the book published earliest by author.
Missed that a CTE was not valid (but not sure why). How about as a subquery?
SELECT tableid, author, book, pubdate
FROM
(
SELECT
tableid, author, book, pubdate,
rn = ROW_NUMBER() OVER
(
PARTITION BY author
ORDER BY pubdate
)
FROM dbo.src -- replace this with the real table name
) AS x
WHERE rn = 1
ORDER BY tableid;
Original:
;WITH x AS
(
SELECT
tableid, author, book, pubdate,
rn = ROW_NUMBER() OVER
(
PARTITION BY author
ORDER BY pubdate
)
FROM dbo.src -- replace this with the real table name
)
SELECT tableid, author, book, pubdate
FROM x
WHERE rn = 1
ORDER BY tableid;
If you want to return multiple rows when there is a tie for earliest book, use RANK() in place of ROW_NUMBER(). In the case of a tie and you only want to return one row, you need to add additional tie breaker columns to the ORDER BY within OVER().
select * from table where ranking = 1
EDIT
Are you looking for this query to work in situations where there is no value of rank=1 for a given table and person? in that case, try this:
select *, RANK() OVER (Partition By talbeid, personid order by rank asc) as sqlrank
from table
where sqlrank = 1
EDIT OF MY EDIT:
This will work for the earliest pub date:
select *, RANK() OVER (Partition By author order by pubdate asc) as sqlrank
from table
where sqlrank = 1
SELECT tableid,author,book,pubdate FROM my_table as my_table1 WHERE pubdate =
(SELECT MIN(pubdate) FROM my_table as my_table2 WHERE my_table1.author = my_table2.author);
WITH min_table as
(
SELECT author, min(pubdate) as min_pubdate
FROM table
GROUP BY author
)
SELECT t.tableid, t.author, t.book, t.pubdate
FROM table t INNER JOIN min_table mt on t.author = mt.author and t.pub_date = mt.min_pubdate
Your sample data may be a overly simplistic. You talk about 'min(ranking)', but for all your examples, the minimum ranking for each personid is 1. So the answers you have received so far short-circuit the issue and simple select for ranking = 1. You don't state it in your "requirements", but it sounds like the minimum rank value for any particular personid may not necessarily be 1, correct? Also, you don't mention if a person can rank two or more books with the same minimum rank, so answers will be incomplete due to this missing requirement.
If my psychic abilities are accurate, then you might want to try something like this (untested obviously):
SELECT tableid, personid, book, ranking
FROM UnknownTable UNKTBL INNER JOIN
(SELECT personid, min(ranking) as ranking
FROM UnknownTable GROUP BY personid) MINRANK
ON UNKTBL.personid = MINRANK.personid AND UNKTBL.ranking = MINRANK.ranking
This will return all the rows for each person where the ranking value is the minimum value for that person. So if the minimum ranking for person 6 is 2, and there are two books for that person with that ranking, then both book rows will be returned.
If these are not, in fact your requirements, then please edit your question with more details/example data. Thanks!
Edit
Based on your change in requirements/example data, the SQL above should still work, if you change the column names appropriately. You still don't mention if an author can have two books in the same year (i.e. a prolific author such as Stephen King), so the SQL I have here will give multiple rows if the same author publishes two books in the same year, and that year is the earliest year of publication for that author.
SELECT * FROM my_table WHERE ranking = 1
ZING!
Seriously though I don't follow your question - can you provide a more elaborate or complicated example? I think I'm missing something obvious.

Resources