T-SQL - Finding records with chronological gaps - sql-server

This is my first post here. I'm still a novice SQL user at this point though I've been using it for several years now. I am trying to find a solution to the following problem and am looking for some advice, as simple as possible, please.
I have this 'recordTable' with the following columns related to transactions; 'personID', 'recordID', 'item', 'txDate' and 'daySupply'. The recordID is the primary key. Almost every personID should have many distinct recordID's with distinct txDate's.
My focus is on one particular 'item' for all of 2017. It's expected that once the item daySupply has elapsed for a recordID that we would see a newer recordID for that person with a more recent txDate somewhere between five days before and five days after the end of the daySupply.
What I'm trying to uncover are the number of distinct recordID's where there wasn't an expected new recordID during this ten day window. I think this is probably very simple to solve but I am having a lot of difficulty trying to create a query for it, let alone explain it to someone.
My thought thus far is to create two temp tables. The first temp table stores all of the records associated with the desired items and I'm just storing the personID, recordID and txDate columns. The second temp table has the personID, recordID and the two derived columns from the txDate and daySupply; these would represent the five days before and five days after.
I am trying to find some way to determine the number of recordID's from the first table that don't have expected refills for that personID in the second. I thought a simple EXCEPT would do this but I don't think there's anyway of getting around a recursive type statement to answer this and I have never gotten comfortable with recursive queries.
I searched Stackoverflow and elsewhere but couldn't come up with an answer to this one. I would really appreciate some help from some more clever data folks. Here is the code so far. Thanks everyone!
CREATE TABLE #temp1 (personID VARCHAR(20), recordID VARCHAR(10), txDate
DATE)
CREATE TABLE #temp2 (personID VARCHAR(20), recordID VARCHAR(10), startDate
DATE, endDate DATE)
INSERT INTO #temp1
SELECT [personID], [recordID], txDate
FROM recordTable
WHERE item = 'desiredItem'
AND txDate > '12/31/16'
AND txDate < '1/1/18';
INSERT INTO #temp2
SELECT [personID], [recordID], (txDate + (daySupply - 5)), (txDate +
(daySupply + 5))
FROM recordTable
WHERE item = 'desiredItem'
AND txDate > '12/31/16'
AND txDate < '1/1/18';

I agree with mypetlion that you could have been more concise with your question, but I think I can figure out what you are asking.
SQL Window Functions to the rescue!
Here's the basic idea...
CREATE TABLE #fills(
personid INT,
recordid INT,
item NVARCHAR(MAX),
filldate DATE,
dayssupply INT
);
INSERT #fills
VALUES (1, 1, 'item', '1/1/2018', 30),
(1, 2, 'item', '2/1/2018', 30),
(1, 3, 'item', '3/1/2018', 30),
(1, 4, 'item', '5/1/2018', 30),
(1, 5, 'item', '6/1/2018', 30)
;
SELECT *,
ABS(
DATEDIFF(
DAY,
LAG(DATEADD(DAY, dayssupply, filldate)) OVER (PARTITION BY personid, item ORDER BY filldate),
filldate
)
) AS gap
FROM #fills
ORDER BY filldate;
... outputs ...
+----------+----------+------+------------+------------+------+
| personid | recordid | item | filldate | dayssupply | gap |
+----------+----------+------+------------+------------+------+
| 1 | 1 | item | 2018-01-01 | 30 | NULL |
| 1 | 2 | item | 2018-02-01 | 30 | 1 |
| 1 | 3 | item | 2018-03-01 | 30 | 2 |
| 1 | 4 | item | 2018-05-01 | 30 | 31 |
| 1 | 5 | item | 2018-06-01 | 30 | 1 |
+----------+----------+------+------------+------------+------+
You can insert the results into a temp table and pull out only the ones you want (gap > 5), or use the query above as a CTE and pull out the results without the temp table.

This could be stated as follows: "Given a set of orders, return a subset for which there is no order within +/- 5 days of the expected resupply date (defined as txDate + DaysSupply)."
This can be solved simply with NOT EXISTS. Define the range of orders you wish to examine, and this query will find the subset of those orders for which there is no resupply order (NOT EXISTS) within 5 days of either side of the expected resupply date (txDate + daysSupply).
SELECT
gappedOrder.personID
, gappedOrder.recordID
, gappedOrder.item
, gappedOrder.txDate
, gappedOrder.daysSupply
FROM
recordTable as gappedOrder
WHERE
gappedOrder.item = 'desiredItem'
AND gappedOrder.txDate > '12/31/16'
AND gappedOrder.txDate < '1/1/18'
--order not refilled within date range tolerance
AND NOT EXISTS
(
SELECT
1
FROM
recordTable AS refilledOrder
WHERE
refilledOrder.personID = gappedOrder.personID
AND refilledOrder.item = gappedOrder.item
--5 days prior to (txDate + daysSupply)
AND refilledOrder.txtDate >= DATEADD(day, -5, DATEADD(day, gappedOrder.daysSupply, gappedOrder.txDate))
--5 days after (txtDate + daysSupply)
AND refilledOrder.txtDate <= DATEADD(day, 5, DATEADD(day, gappedOrder.daysSupply, gappedOrder.txtDate))
);

Related

How to derive an attribute value automatically based on other table values

I have some tables and want to populate a database attribute based on other table interval values.
The base idea is to populate the 'eye-age' attribute with the values: young, pre-prebyotic, or prebyotic depending on patient's age.
I have the patient table birthdate, and need to populate last attribute with a value from BirthToEyeAge based on patient birthdate, inferring its age.
How can I do this, or which documentation should I read to learn these types of things.
INSERT INTO BirthToEyeAge(bId, minAge , maxAge , eyeAge)
VALUES(1, 0, 28 , 'young')
VALUES(2, 29, 59, 'probyotic')
VALUES(3, 60, 120, 'pre-probyotic')
INSERT INTO Patient( patId, firstName, lastName, birthDate )
VALUES( 1, 'Ark', 'May', '1991-7-22' );
INSERT INTO Diagnostic( diagId, date, tear_rate, consId_Consulta, eyeAge )
VALUES( 1, '2019-08-10', 'normal', 1, ??? );
You can join table Patient with BirthToEyeAge, taking advantage of handy postgres function age() to compute the age of the patient at the time he was diagnosed. Here is an an insert query based on this logic:
insert into Diagnostic( diagId, date, tear_rate, consId_Consulta, eyeAge )
select d.*, b.bId
from
(select 1 diagId, '2018-08-10'::date date, 'normal' tear_rate, 1 consId_Consulta ) d
inner join patient p
on d.consId_Consulta = p.patId
inner join BirthToEyeAge b
on extract(year from age(d.date, p.birthDate)) between b.minAge and b.maxAge;
In this demo on DB Fiddle, after creating the tables, initializing their content, and running the above query, the content of Diagnostic is:
| diagid | date | tear_rate | consid_consulta | eyeage |
| ------ | ------------------------ | --------- | --------------- | ------ |
| 1 | 2018-08-10T00:00:00.000Z | normal | 1 | 1 |

Insert multuple rows at once with a calculated column from prior inserts into SQL Server

I'm trying to figure out how to do a multi-row insert as one statement in SQL Server, but where one of the columns is a column computer based on the data as it stands after every insert row.
Let's say I run this simple query and get back 3 records:
SELECT *
FROM event_courses
WHERE event_id = 100
Results:
id | event_id | course_id | course_priority
---+----------+-----------+----------------
10 | 100 | 501 | 1
11 | 100 | 502 | 2
12 | 100 | 503 | 3
Now I want to insert 3 more records into this table, except I need to be able to calculate the priority for each record. The priority should be the count of all courses in this event. But if I run a sub-query, I get the same priority for all new courses:
INSERT INTO event_courses (event_id, course_id, course_priority)
VALUES (100, 500,
(SELECT COUNT (id) + 1 AS cnt_event_courses
FROM event_courses
WHERE event_id = 100)),
(100, 501,
(SELECT COUNT (id) + 1 AS cnt_event_courses
FROM event_courses
WHERE event_id = 1))
Results:
id | event_id | course_id | course_priority
---+----------+-----------+-----------------
10 | 100 | 501 | 1
11 | 100 | 502 | 2
12 | 100 | 503 | 3
13 | 100 | 504 | 4
14 | 100 | 505 | 4
15 | 100 | 506 | 4
Now I know I could easily do this in a loop outside of SQL and just run a bunch of insert statement, but that's not very efficient. There's got to be a way to calculate the priority on the fly during a multi-row insert.
Big thanks to #Sean Lange for the answer. I was able to simplify it even further for my application. Great lead! Learned 2 new syntax tricks today ;)
DECLARE #eventid int = 100
INSERT event_courses
SELECT #eventid AS event_id,
course_id,
course_priority = existingEventCourses.prioritySeed + ROW_NUMBER() OVER(ORDER BY tempid)
FROM (VALUES
(1, 501),
(2, 502),
(3, 503)
) courseInserts (tempid, course_id) -- This basically creates a temp table in memory at run-time
CROSS APPLY (
SELECT COUNT(id) AS prioritySeed
FROM event_courses
WHERE event_id = #eventid
) existingEventCourses
SELECT *
FROM event_courses
WHERE event_id = #eventid
Here is an example of how you might be able to do this. I have no idea where your new rows values are coming from so I just tossed them in a derived table. I doubt your final solution would look like this but it demonstrates how you can leverage ROW_NUMBER for accomplish this type of thing.
declare #EventCourse table
(
id int identity
, event_id int
, course_id int
, course_priority int
)
insert #EventCourse values
(100, 501, 1)
,(100, 502, 2)
,(100, 503, 3)
select *
from #EventCourse
insert #EventCourse
(
event_id
, course_id
, course_priority
)
select x.eventID
, x.coursePriority
, NewPriority = y.MaxPriority + ROW_NUMBER() over(partition by x.eventID order by x.coursePriority)
from
(
values(100, 504)
,(100, 505)
,(100, 506)
)x(eventID, coursePriority)
cross apply
(
select max(course_priority) as MaxPriority
from #EventCourse ec
where ec.event_id = x.eventID
) y
select *
from #EventCourse

Splitting data from one record is a specific column T-SQL

I'm working on a old legacy database that got imported into SQL Server 2012 from Oracle. I have the following table called INSOrders which includes a column called OrderID of type varchar(8).
An example of the data inserted is:
A04-05 | B81-02 | C02-01
A01-01 | B95-01 | C99-05
A02-02 | B06-07 | C03-02
A98-06 | B10-01 | C17-01
A78-07 | B02-03 | C15-03
A79-01 | B02-01 | C78-06
First Letter = Ordertype, next 2 digit = Year - and last 2 digit = OrderNum within that Year.
So I split all the data into 3 column : (not stored , just presented)
select
orderid,
substring(orderid, 0, patindex('%[0-9]%', orderid)) as ordtype,
right(max(datepart(yyyy, '01/01/' + substring(orderid, patindex('%[0-9]-%', orderid) - 1, 2))),2) as year,
max(substring(orderid, patindex('%-[0-9]%', orderid) + 1, 2)) as ordnum
from
ins.insorders
where
orderid is not null
group by
substring(orderid, 0, patindex('%[0-9]%', orderid)), orderid
order by
ordtype
It is looking like this:
OrderID | OrderType | OrderYear | OrderNum
---------+-------------+-------------+----------
A04-05 | A | 04 | 05
A01-01 | A | 01 | 01
B10-03 | B | 10 | 03
B95-01 | B | 95 | 01
etc....
But now I just want to select the Max for all of the OrderType: show only the max for letter A, Show the max for letter B, etc. What I mean Max, I mean from Letter A I need to show the latest year and the latest ordernumber. so if I have A04-01 and A04-02 Just show A04-02.
I need to modify my query were I can see the following:
OrderID | OrderType | OrderYear | OrderNum
---------+-------------+-------------+----------
A04-05 | A | 04 | 05
B10-03 | B | 10 | 03
C17-01 | C | 17 | 01
Thank you, I will truly appreciate the help.
You can try the below. Using your original query as a cte and assigning row numbers to each group of order types based on order year and order number. Then get all row number 1's which should be the max for each order type.
This little bit DATEPART(yyyy,('01/01/' + OrderYear)) will make sure we get the correct year so that 95 is 1995 and 10 is 2010 etc.
;WITH cte
AS (
select orderid,
substring(orderid, 0, patindex('%[0-9]%', orderid)) as ordtype,
right(max(datepart(yyyy,'01/01/' + substring(orderid, patindex('%[0-9]-%', orderid) - 1, 2))),2) as year,
max(substring(orderid, patindex('%-[0-9]%', orderid) + 1, 2)) as ordnum
from ins.insorders
where orderid is not null
group by substring(orderid, 0, patindex('%[0-9]%', orderid)), orderid
)
SELECT *
FROM
(SELECT
*
, ROW_NUMBER() OVER (PARTITION BY OrderType ORDER BY DATEPART(yyyy,('01/01/' + OrderYear)) DESC, OrderNum DESC) AS RowNum
FROM cte) t
WHERE t.RowNum = 1
The data is represented poorly and I only have a way to "cheese" it, and we'll need to make a lot of assumptions:
with cte_example
as
( your query )
select OrderID
,OrderType
,OrderYear
,OrderNum
from
(select *, row_number() over(partition by OrderType order by OrderYear DESC) rn
from cte_example
where OrderYear <= right(year(getdate()),2)) t1
where t1.rn = 1
Since you already have a query extracting the information I won't bother changing it. We wrap your query in a CTE, query from it and apply the row_number function to decide whichOrderType has the most recent OrderYear, along with its OrderNum and OrderID
Now the tricky part is that the years are poorly represented (assuming my comment on your original post is true), then using any sort of aggregation for OrderType B will return 95 since it is numerically greatest.
We make the assumption that no order date will be greater than this current year, and anything greater is in the 90s, using this statement: where OrderYear < right(year(getdate()),2). In other words get this year and the two right characters of it. First by retrieving 2017 from getdate and then 17 with the RIGHT function. I'm sure why you can see this is dangerous, because what if your latest date is 1999?
So by filtering them out, we can then see the latest year for each OrderType... hope this helps.
Here is the rextester test I built around to play with your query in case you want to try it.
I think your original query was almost exactly what you needed except you need to use MAX(OrderID) and not group by it.
declare #Something table
(
orderid varchar(6)
)
insert #Something
(
orderid
) values
('A04-05'), ('B81-02'), ('C02-01'),
('A01-01'), ('B95-01'), ('C99-05'),
('A02-02'), ('B06-07'), ('C03-02'),
('A98-06'), ('B10-01'), ('C17-01'),
('A78-07'), ('B02-03'), ('C15-03'),
('A79-01'), ('B02-01'), ('C78-06')
select max(orderid),
substring(orderid, 0, patindex('%[0-9]%', orderid)) as ordtype,
right(max(datepart(yyyy,'01/01/' + substring(orderid, patindex('%[0-9]-%', orderid) - 1, 2))),2) as year,
max(substring(orderid, patindex('%-[0-9]%', orderid) + 1, 2)) as ordnum
from myTable
where orderid is not null
group by substring(orderid, 0, patindex('%[0-9]%', orderid))
order by ordtype

How can I group / window date ordered events delineated by an arbitrary expression?

I would like to group some data together based on dates and some (potentially arbitrary) indicator:
Date | Ind
================
2016-01-02 | 1
2016-01-03 | 5
2016-03-02 | 10
2016-03-05 | 15
2016-05-10 | 6
2016-05-11 | 2
I would like to group together subsequent (date-ordered) rows but breaking the group after Indicator >= 10:
Date | Ind | Group
========================
2016-01-02 | 1 | 1
2016-01-03 | 5 | 1
2016-03-02 | 10 | 1
2016-03-05 | 15 | 2
2016-05-10 | 6 | 3
2016-05-11 | 2 | 3
I did find a promising technique at the end of a blog post: "Use this Neat Window Function Trick to Calculate Time Differences in a Time Series" (the final subsection, "Extra Bonus"), but the important part of the query uses a keyword (FILTER) that doesn't seem to be supported in SQL Server (and a quick Google later and I'm not sure where it is supported!).
I'm still hopeful a technique using a window function might be the answer. I just need a counter that I can add to every row, (like RANK or ROW_NUMBER does) but that only increments when some arbitrary condition evaluates as true. Is there a way to do this in SQL Server?
Here is the solution:
DECLARE #t TABLE ([Date] DATETIME, Ind INT)
INSERT INTO #t
VALUES
('2016-01-02', 1),
('2016-01-03', 5),
('2016-03-02', 10),
('2016-03-05', 15),
('2016-05-10', 6),
('2016-05-11', 2)
SELECT [Date],
Ind,
1 + SUM([Group]) OVER(ORDER BY [Date]) AS [Group]
FROM
(
SELECT *,
CASE WHEN LAG(ind) OVER(ORDER BY [Date]) >= 10
THEN 1
ELSE 0
END AS [Group]
FROM #t
) t
Just mark row as 1 when previous is greater than 10 else 0. Then a running sum will give you the desired result.
Giving full credit to Giorgi for the idea, but I've modified his answer (both for my benefit and for future readers).
Just change the CASE statement to see if 30 or more days have lapsed since the last record:
DECLARE #t TABLE ([Date] DATETIME)
INSERT INTO #t
VALUES
('2016-01-02'),
('2016-01-03'),
('2016-03-02'),
('2016-03-05'),
('2016-05-10'),
('2016-05-11')
SELECT [Date],
1 + SUM([Group]) OVER(ORDER BY [Date]) AS [Group]
FROM
(
SELECT [Date],
CASE WHEN DATEADD(d, -30, [Date]) >= LAG([Date]) OVER(ORDER BY [Date])
THEN 1
ELSE 0
END AS [Group]
FROM #t
) t

How to remove a duplicate row in SQL with an older date field

I have two rows in my table which are exact duplicates with the exception of a date field. I want to find these records and delete the older record by hopefully comparing the dates.
For example I have the following data
ctrc_num | Ctrc_name | some_date
---------------------------------------
12345 | John R | 2011-01-12
12345 | John R | 2012-01-12
56789 | Sam S | 2011-01-12
56789 | Sam S | 2012-01-12
Now the idea is to find duplicates with a different 'some_date' field and delete the older records. The final output should look something like this.
ctrc_num | Ctrc_name | some_date
---------------------------------------
12345 | John R | 2012-01-12
56789 | Sam S | 2012-01-12
Also note that my table does not have a primary key, it was originally created this way, not sure why, and it has to fit inside a stored procedure.
If you look at this:
SELECT * FROM <tablename> WHERE some_date IN
(
SELECT MAX(some_date) FROM <tablename> GROUP BY ctrc_num,ctrc_name
HAVING COUNT(ctrc_num) > 1
AND COUNT(ctrc_name) > 1
)
You can see it selects the two most recent dates for the duplicate rows. If I switch the select in the brackets to 'min date' and use it to delete then you are removing the two older dates for the duplicate rows.
DELETE FROM <tablename> WHERE some_date IN
(
SELECT MIN(some_date) FROM <tablename> GROUP BY ctrc_num,ctrc_name
HAVING COUNT(ctrc_num) > 1
AND COUNT(ctrc_name) > 1
)
This is for SQL Server
CREATE TABLE StackOverFlow
([ctrc_num] int, [Ctrc_name] varchar(6), [some_date] datetime)
;
INSERT INTO StackOverFlow
([ctrc_num], [Ctrc_name], [some_date])
SELECT 12345, 'John R', '2011-01-12 00:00:00' UNION ALL
SELECT 12345, 'John R', '2012-01-12 00:00:00' UNION ALL
SELECT 56789, 'Sam S', '2011-01-12 00:00:00' UNION ALL
SELECT 56789, 'Sam S', '2012-01-12 00:00:00'
;WITH RankedByDate AS
(
SELECT ctrc_num
,Ctrc_name
,some_date
,ROW_NUMBER() OVER(PARTITION BY Ctrc_num, Ctrc_name ORDER BY some_date DESC) AS rNum
FROM StackOverFlow
)
DELETE
FROM RankedByDate
WHERE rNum > 1
SELECT
[ctrc_num]
, [Ctrc_name]
, [some_date]
FROM StackOverFlow
And here is the sql fiddle to test it http://sqlfiddle.com/#!6/32718/6
What I tried to do here is
rank the records by descending order of date
delete those that are older (keep the latest)

Resources