Sql-Get time ranges from million+ rows for particular condition - sql-server

I am working with SQL Server 2012, I have a table with approx 35 column and 10+ million rows.
I need to find time ranges from across the data where the value of any particular column is matching
E.g.
The sample data is as below
Datetime col1 col2 col3
2018-05-31 0:00 1 2 1
2018-05-31 13:00 2 2 2
2018-05-31 14:30 3 2 1
2018-05-31 15:00 4 3 1
2018-05-31 16:00 4 5 1
2018-05-31 17:00 3 2 2
2018-05-31 17:30 3 2 4
2018-05-31 18:00 2 2 4
2018-05-31 20:00 1 2 6
2018-05-31 21:00 2 2 3
2018-05-31 21:10 2 2 1
2018-05-31 22:00 1 6 3
2018-05-31 22:00 4 5 1
2018-05-31 23:59 4 7 2
Find the time range from data where col2 value =< 2, accordingly my expected result set is as below
Start Time End time Time Diff
2018-05-31 0:00 2018-05-31 14:30 14:30:00
2018-05-31 17:00 2018-05-31 21:10 4:10:00
I can achieved the same with below logic, but it's extremely slow
I get all rows and then
Order by date_Time
Scan the rows get the first row where exactly value is matching and record that timestamp as start time.
Scan further rows till i get the row where condition is breaking and record that timestamp as end time.
But as i have to play with huge no. Of rows, overall this will make my operation slow, any inputs or pseudo code to improve the same.

We can use a slightly modified difference in row number method here. The purpose of the first CTE labelled cte1 is to add a computed column which labels islands we want, having a col2 values <= 2, as 1 and everything else as 0. Then, we can compute the difference of two row numbers, and aggregate over the islands to find the starting and ending times, and the difference between those times.
WITH cte1 AS (
SELECT *,
CASE WHEN col2 <= 2 THEN 1 ELSE 0 END AS class
FROM yourTable
),
cte2 AS (
SELECT *,
ROW_NUMBER() OVER (ORDER BY Datetime) -
ROW_NUMBER() OVER (PARTITION BY class ORDER BY Datetime) rn
FROM cte1
)
SELECT
MIN(Datetime) AS [Start Time],
MAX(Datetime) AS [End Time],
CONVERT(TIME, MAX(Datetime) - MIN(Datetime)) AS [Time Diff]
FROM cte2
WHERE class = 1
GROUP BY rn
ORDER BY MIN(Datetime);
Demo

Related

grouping rows of common values to create a new 'group id' for each set

I am trying to achieve the below but was not able to so far, any help would be greatly appreciated.
I have this data (sorted from a query by id, anchor, date, and time) that I wish to group by common anchor :
id anchor date time 'group' (the value to get)
3 2 2019-01-01 07:00 1
4 2 2019-01-01 08:00 1
5 3 2019-01-01 15:00 2
7 3 2019-01-01 16:00 2
10 3 2019-01-01 17:00 2
I'm looking to do a query in postgres where I can select this data and foreach set of common anchors, have a 'group number'
I then need a query to sum the anchor of points of same group, example above would become:
anchor sum group
2 4 1
3 9 2
thanks!
EDIT: McNets solution works perfect,
I have another case, with below data.
The anchor repeats but after a change of anchor: they're sorted by time, first it was anchor 2, then anchor 3, then again anchor 2.
I need to group after the change (ids 11 & 12) to have a new group number in this case
id anchor date time 'group' (the value to get)
3 2 2019-01-01 07:00 1
4 2 2019-01-01 08:00 1
5 3 2019-01-01 15:00 2
7 3 2019-01-01 16:00 2
10 3 2019-01-01 17:00 2
11 2 2019-01-01 18:00 3
12 2 2019-01-01 19:00 3
We can try using ROW_NUMBER here:
SELECT
anchor,
SUM(anchor) AS sum,
ROW_NUMBER() OVER (ORDER BY anchor) AS "group"
FROM yourTable
GROUP BY
anchor;
Demo

Find customer lapse across variable subscription periods

Hoping someone has run across this issue previously and has a solution.
I am trying to find customers who lapse based off subscription periods rather than a single order date.
Lapse is defined by us as not making a purchase/renewal within 30 days of the end of their subscription. A customer can have multiple subscriptions simultaneously and subscriptions can vary in length.
I have a data set that includes customerIDs, Orders, the subscription start date, the subscription expire date, and that order’s rank in the customer’s order history, something like this:
CREATE TABLE #Subscriptions
(CustomerID INT,
Orderid INT,
SubscriptionStart DATE,
SubscriptionEnd DATE,
OrderNumber INT);
INSERT INTO #Subscriptions
VALUES(1, 111111, '2017-01-01', '2017-12-31', 1),
(1, 211111, '2018-01-01', '2019-12-31' ,2),
(1, 311121, '2018-10-01', '2018-10-02', 3),
(1, 451515, '2019-02-01', '2019-02-28', 4),
(2, 158797, '2018-07-01', '2018-07-31', 1),
(2, 287584, '2018-09-01', '2018-12-31', 2),
(2, 387452, '2019-01-01', '2019-01-31', 3),
(3, 187498, '2019-01-01', '2019-02-28', 1),
(3, 284990, '2019-02-01', '2019-02-28', 2),
(4, 184849, '2019-02-01', '2019-02-28', 1)
Within this data set, customer 2 would have lapsed on 2018-07-31. Since Customer 1 has a subscription of 2017-01-01 - 2017-12-31 and then one that starts 2018-01-01 and ends 2019-12-31 they cannot lapse within that time period even if other orders made by the customer would qualify.
I have attempt some of simple gap calculations using LEAD() and LAG(), however, I have had no success due to the variable lengths of the subscription period where a single subscription can span across multiple other orders. Eventually, we will use this to calculate monthly churn rate across approximately 5 million records.
You're overthinking this trying to use LEAD() and LAG(). All you need is a NOT EXISTS() function in the WHERE clause
In psuedocode:
SELECT...FROM...
WHERE {SubscriptionEnd is at least 30 days in the past}
AND NOT EXISTS(
{A row for the same Customer where the StartDate is 30 days or less after this EndDate}
)
This one looks to be a tricky one. You are correct about the problem with using the LEAD() and LAG() functions. It stems from customers being able to have multiple subscriptions of variable length. So we need to deal with that issue first. Let's begin with creating a single list of dates instead of having a list of SubscriptionStart and SubscriptionEnd.
SELECT
CustomerId,
OrderId,
1 AS Activity,
SubscriptionStart AS ActivityDate
FROM
#Subscriptions
UNION ALL
SELECT
CustomerId,
OrderId,
-1 AS Activity,
SubscriptionEnd AS ActivityDate
FROM
#Subscriptions
ORDER BY
CustomerId,
ActivityDate
CustomerId OrderId Activity ActivityDate
----------- ----------- ----------- ------------
1 111111 1 2017-01-01
1 111111 -1 2017-12-31
1 211111 1 2018-01-01
1 311121 1 2018-10-01
1 311121 -1 2018-10-02
1 451515 1 2019-02-01
1 451515 -1 2019-02-28
1 211111 -1 2019-12-31
2 158797 1 2018-07-01
2 158797 -1 2018-07-31
2 287584 1 2018-09-01
2 287584 -1 2018-12-31
2 387452 1 2019-01-01
2 387452 -1 2019-01-31
3 187498 1 2019-01-01
3 284990 1 2019-02-01
3 187498 -1 2019-02-28
3 284990 -1 2019-02-28
4 184849 1 2019-02-01
4 184849 -1 2019-02-28
Notice the additional Activity field. It is 1 for the SubscriptionStart and -1 for the SubscriptionEnd.
Using this new Activity field it is possible to find places where there might be a lapse in the customer's subscriptions. At the same time use LEAD() to find the NextDate.
;WITH SubscriptionList AS (
SELECT
CustomerId,
OrderId,
1 AS Activity,
SubscriptionStart AS ActivityDate
FROM
#Subscriptions
UNION ALL
SELECT
CustomerId,
OrderId,
-1 AS Activity,
SubscriptionEnd AS ActivityDate
FROM
#Subscriptions
)
SELECT
CustomerId,
OrderId,
Activity,
SUM(Activity) OVER(PARTITION BY CustomerId ORDER BY ActivityDate ROWS UNBOUNDED PRECEDING) as SubscriptionCount,
ActivityDate,
LEAD(ActivityDate, 1, GETDATE()) OVER(PARTITION BY CustomerId ORDER BY ActivityDate) AS NextDate,
DATEDIFF(d, ActivityDate, LEAD(ActivityDate, 1, GETDATE()) OVER(PARTITION BY CustomerId ORDER BY ActivityDate)) AS LapsedDays
FROM
SubscriptionList
ORDER BY
CustomerId,
ActivityDate
CustomerId OrderId Activity SubscriptionCount ActivityDate NextDate LapsedDays
----------- ----------- ----------- ----------------- ------------ ---------- -----------
1 111111 1 1 2017-01-01 2017-12-31 364
1 111111 -1 0 2017-12-31 2018-01-01 1
1 211111 1 1 2018-01-01 2018-10-01 273
1 311121 1 2 2018-10-01 2018-10-02 1
1 311121 -1 1 2018-10-02 2019-02-01 122
1 451515 1 2 2019-02-01 2019-02-28 27
1 451515 -1 1 2019-02-28 2019-12-31 306
1 211111 -1 0 2019-12-31 2019-02-28 -306
2 158797 1 1 2018-07-01 2018-07-31 30
2 158797 -1 0 2018-07-31 2018-09-01 32
2 287584 1 1 2018-09-01 2018-12-31 121
2 287584 -1 0 2018-12-31 2019-01-01 1
2 387452 1 1 2019-01-01 2019-01-31 30
2 387452 -1 0 2019-01-31 2019-02-28 28
3 187498 1 1 2019-01-01 2019-02-01 31
3 284990 1 2 2019-02-01 2019-02-28 27
3 187498 -1 1 2019-02-28 2019-02-28 0
3 284990 -1 0 2019-02-28 2019-02-28 0
4 184849 1 1 2019-02-01 2019-02-28 27
4 184849 -1 0 2019-02-28 2019-02-28 0
Adding running total on the Activity field will effectively give the number of active subscriptions. While it is greater than 0 a lapse is not possible. So focus in on the rows WHERE the SubscriptionCount is zero.
Using LEAD() get the NextDate. If there isn't a next date then default to today. If the SubscriptionCount is 0 then the NextDate has to be from a new subscription and the NextDate will be the date that the new subscription starts. Using DATEDIFF count the number of days between the SubscriptionEnd and the SubscriptionBegin if it is > 30 days then there was a lapse. Sounds like a good WHERE statement.
;WITH SubscriptionList AS (
SELECT
CustomerId,
OrderId,
1 AS Activity,
SubscriptionStart AS ActivityDate
FROM
#Subscriptions
UNION ALL
SELECT
CustomerId,
OrderId,
-1 AS Activity,
SubscriptionEnd AS ActivityDate
FROM
#Subscriptions
)
, FindLapse AS (
SELECT
CustomerId,
OrderId,
Activity,
SUM(Activity) OVER(PARTITION BY CustomerId ORDER BY ActivityDate ROWS UNBOUNDED PRECEDING) as SubscriptionCount,
ActivityDate,
LEAD(ActivityDate, 1, GETDATE()) OVER(PARTITION BY CustomerId ORDER BY ActivityDate) AS NextDate
FROM
SubscriptionList
)
SELECT
CustomerId,
OrderId,
Activity,
SubscriptionCount,
ActivityDate,
NextDate,
DATEDIFF(d, ActivityDate, NextDate) AS LapsedDays
FROM
FindLapse
WHERE
SubscriptionCount = 0
AND DATEDIFF(d, ActivityDate, NextDate) >= 30
CustomerId OrderId Activity SubscriptionCount ActivityDate NextDate LapsedDays
----------- ----------- ----------- ----------------- ------------ ---------- -----------
2 158797 -1 0 2018-07-31 2018-09-01 32
Looks like we have a winner!

SQL Server query for Total of hours across multiple rows

I've tried to resolve this a few ways and wanting some extra help.
I'm wanting to return the same number of rows but trying to calculate the number of total hours delivered by each Employee for each service on each day.
I've added a duplicate flag but that doesn't help me to work out the max hours by the 1 employee in 1 day.
Emp Service Date Start End Hrs Duplicate Flag Flag hrs
Fred xyz 14/09/2017 8:45 15:00 6.25 1 1 6.25
Fred xyz 14/09/2017 9:00 14:15 5.25 1 0 0
Fred xyz 14/09/2017 9:00 14:15 5.25 2 0 0
Fred xyz 14/09/2017 9:00 15:00 6 1 0 0
John xyz 15/09/2017 10:00 12:00 2 1 1 2
John xyz 15/09/2017 10:00 13:00 3 1 0 0
John xyz 15/09/2017 11:00 15:00 4 1 0 0
John xyz 15/09/2017 12:00 16:00 4 1 1 4
the last 2 columns are the ones I can't quite work out how to add. I've tried Overlaps and other ANDing methods.
thanks, Dave
I think you are looking for an OVER clause. Not sure what the duplicate flag is for though? If you ignore your last three columns, assuming they are computed columns in a query, you could use...
Select
*,
sum(Hrs) over (partition by Emp, Date, Service order by Date)
From (select distinct * from your table) x
If the last three columns are actual columns in your table just replace select * in the derived table with the column names, except those three.

How to count number of months in T-SQL

I've got a problem in SQL Server.
"Whate'er is well conceived is clearly said, And the words to say it flow with ease", Nicolas Boileau-Despreaux
Well, I don't think I'll be able to make it clear but I'll try ! And I'd like to apologize for my bad english !
I've got this table :
id ind lvl result date
1 1 a 3 2017-01-31
2 1 a 3 2017-02-28
3 1 a 1 2017-03-31
4 1 a 1 2017-04-30
5 1 a 1 2017-05-31
6 1 b 1 2017-01-31
7 1 b 3 2017-02-28
8 1 b 3 2017-03-31
9 1 b 1 2017-04-30
10 1 b 1 2017-05-31
11 2 a 3 2017-01-31
12 2 a 1 2017-02-28
13 2 a 3 2017-03-31
14 2 a 1 2017-04-30
15 2 a 3 2017-05-31
I'd like to count the number of month the combo {ind, lvl} remain in the result 1 before re-initializing the number of month to 0 if the result is not 1.
Clearly, I need to get something like that :
id ind lvl result date BadResultRemainsFor%Months
1 1 a 3 2017-01-31 0
2 1 a 3 2017-02-28 0
3 1 a 1 2017-03-31 1
4 1 a 1 2017-04-30 2
5 1 a 1 2017-05-31 3
6 1 b 1 2017-01-31 1
7 1 b 3 2017-02-28 0
8 1 b 3 2017-03-31 0
9 1 b 1 2017-04-30 1
10 1 b 1 2017-05-31 2
11 2 a 3 2017-01-31 0
12 2 a 1 2017-02-28 1
13 2 a 3 2017-03-31 0
14 2 a 1 2017-04-30 1
15 2 a 3 2017-05-31 0
So that if I was looking for the number of months the result was 1 for the date 2017-05-31 with the id 1 and the lvl a, I know it's been 3 months.
Assume all the date the the end day of month:
;WITH tb(id,ind,lvl,result,date) AS(
select 1,1,'a',3,'2017-01-31' UNION
select 2,1,'a',3,'2017-02-28' UNION
select 3,1,'a',1,'2017-03-31' UNION
select 4,1,'a',1,'2017-04-30' UNION
select 5,1,'a',1,'2017-05-31' UNION
select 6,1,'b',1,'2017-01-31' UNION
select 7,1,'b',3,'2017-02-28' UNION
select 8,1,'b',3,'2017-03-31' UNION
select 9,1,'b',1,'2017-04-30' UNION
select 10,1,'b',1,'2017-05-31' UNION
select 11,2,'a',3,'2017-01-31' UNION
select 12,2,'a',1,'2017-02-28' UNION
select 13,2,'a',3,'2017-03-31' UNION
select 14,2,'a',1,'2017-04-30' UNION
select 15,2,'a',3,'2017-05-31'
)
SELECT t.id,t.ind,t.lvl,t.result,t.date
,CASE WHEN t.isMatched=1 THEN ROW_NUMBER()OVER(PARTITION BY t.ind,t.lvl,t.id-t.rn ORDER BY t.id) ELSE 0 END
FROM (
SELECT t1.*,c.MonthDiff,CASE WHEN c.MonthDiff=t1.result THEN 1 ELSE 0 END AS isMatched
,CASE WHEN c.MonthDiff=t1.result THEN ROW_NUMBER()OVER(PARTITION BY t1.ind,t1.lvl,CASE WHEN c.MonthDiff=t1.result THEN 1 ELSE 0 END ORDER BY t1.id) ELSE null END AS rn
FROM tb AS t1
LEFT JOIN tb AS t2 ON t1.ind=t2.ind AND t1.lvl=t2.lvl AND t2.id=t1.id-1
CROSS APPLY(VALUES(ISNULL(DATEDIFF(MONTH,t2.date,t1.date),1))) c(MonthDiff)
) AS t
ORDER BY t.id
id ind lvl result date
----------- ----------- ---- ----------- ---------- --------------------
1 1 a 3 2017-01-31 0
2 1 a 3 2017-02-28 0
3 1 a 1 2017-03-31 1
4 1 a 1 2017-04-30 2
5 1 a 1 2017-05-31 3
6 1 b 1 2017-01-31 1
7 1 b 3 2017-02-28 0
8 1 b 3 2017-03-31 0
9 1 b 1 2017-04-30 1
10 1 b 1 2017-05-31 2
11 2 a 3 2017-01-31 0
12 2 a 1 2017-02-28 1
13 2 a 3 2017-03-31 0
14 2 a 1 2017-04-30 1
15 2 a 3 2017-05-31 0
By slightly tweaking your input data and slightly tweaking how we define the requirement, it becomes quite simple to produce the expected results.
First, we tweak your date values so that the only thing that varies is the month and year - the days are all the same. I've chosen to do that my adding 1 day to each value1. The fact that this produces results which are one month advanced doesn't matter here, since all values are similarly transformed, and so the monthly relationships stay the same.
Then, we introduce a numbers table - here, I've assumed a small fixed table is adequate. If it doesn't fit your needs, you can easily locate examples online for creating a large fixed numbers table that you can use for this query.
And, finally, we recast the problem statement. Instead of trying to count months, we instead ask "what's the smallest number of months, greater of equal to zero, that I need to go back from the current row, to locate a row with a non-1 result?". And so, we produce this query:
declare #t table (id int not null,ind int not null,lvl varchar(13) not null,
result int not null,date date not null)
insert into #t(id,ind,lvl,result,date) values
(1 ,1,'a',3,'20170131'), (2 ,1,'a',3,'20170228'), (3 ,1,'a',1,'20170331'),
(4 ,1,'a',1,'20170430'), (5 ,1,'a',1,'20170531'), (6 ,1,'b',1,'20170131'),
(7 ,1,'b',3,'20170228'), (8 ,1,'b',3,'20170331'), (9 ,1,'b',1,'20170430'),
(10,1,'b',1,'20170531'), (11,2,'a',3,'20170131'), (12,2,'a',1,'20170228'),
(13,2,'a',3,'20170331'), (14,2,'a',1,'20170430'), (15,2,'a',3,'20170531')
;With Tweaked as (
select
*,
DATEADD(day,1,date) as dp1d
from
#t
), Numbers(n) as (
select 0 union all select 1 union all select 2 union all select 3 union all select 4
union all
select 5 union all select 6 union all select 7 union all select 8 union all select 9
)
select
id, ind, lvl, result, date,
COALESCE(
(select MIN(n) from Numbers n1
inner join Tweaked t2
on
t2.ind = t1.ind and
t2.lvl = t1.lvl and
t2.dp1d = DATEADD(month,-n,t1.dp1d)
where
t2.result != 1
),
1) as [BadResultRemainsFor%Months]
from
Tweaked t1
The COALESCE is just there to deal with the edge case, such as for your 1,b data, where there is no previous row with a non-1 result.
Results:
id ind lvl result date BadResultRemainsFor%Months
----------- ----------- ------------- ----------- ---------- --------------------------
1 1 a 3 2017-01-31 0
2 1 a 3 2017-02-28 0
3 1 a 1 2017-03-31 1
4 1 a 1 2017-04-30 2
5 1 a 1 2017-05-31 3
6 1 b 1 2017-01-31 1
7 1 b 3 2017-02-28 0
8 1 b 3 2017-03-31 0
9 1 b 1 2017-04-30 1
10 1 b 1 2017-05-31 2
11 2 a 3 2017-01-31 0
12 2 a 1 2017-02-28 1
13 2 a 3 2017-03-31 0
14 2 a 1 2017-04-30 1
15 2 a 3 2017-05-31 0
1An alternative way to perform the adjustment is to use a DATEADD/DATEDIFF pair to perform a "floor" operation against the dates:
DATEADD(month,DATEDIFF(month,0,date),0) as dp1d
Which resets all of the date values to be the first of their own month rather than the following month. This may fell more "natural" to you, or you may already have such values available in your original data.
Assuming the dates are continously increasing in month, you can use window function like so:
select
t.id, ind, lvl, result, dat,
case when result = 1 then row_number() over (partition by grp order by id) else 0 end x
from (
select t.*,
dense_rank() over (order by e, result) grp
from (
select
t.*,
row_number() over (order by id) - row_number() over (partition by ind, lvl, result order by id) e
from your_table t
order by id) t ) t;

How to get TOP (1) row within each Group in sql server 2000

I have some following set of data from where I want to select Top 1 row for each PK_PatientId based on the current order
PK_PatientId PK_PatientVisitId PK_VisitProcedureId DateSort
------------ ----------------- ------------------- -----------------------
1 4 4 2009-06-22 00:00:00.000
1 3 3 2009-06-22 00:00:00.000
1 2 2 2010-03-11 00:00:00.000
1 1 1 2010-03-11 00:00:00.000
5 6 6 2009-05-24 00:00:00.000
5 5 5 2009-11-07 00:00:00.000
7 7 7 2009-05-24 00:00:00.000
8 8 8 2009-05-24 00:00:00.000
9 9 9 2009-05-24 00:00:00.000
10 10 10 2009-05-24 00:00:00.000
Query that lead me to this result is
SELECT
P.PK_PatientId
, PV.PK_PatientVisitId
, MAX(TVP.PK_VisitProcedureId) AS PK_VisitProcedureId
, MAX(PV.LastUpdated) AS DateSort
--, Row_Number() OVER (Partition BY PK_PatientId ORDER BY PV.PK_PatientVisitId DESC) AS RowNo
FROM
dbo.M_Patient AS P
INNER JOIN
dbo.M_PatientVisit AS PV
ON
P.PK_PatientId = PV.FK_PatientId
INNER JOIN
dbo.TX_VisitProcedure AS TVP
ON
PV.PK_PatientVisitId = TVP.FK_PatientVisitId
WHERE
(P.IsActive = 1)
AND
(PV.IsActive = 1)
AND
(TVP.IsActive = 1)
GROUP BY
PK_PatientId
, PK_PatientVisitId
ORDER BY
PK_PatientId
, PK_PatientVisitId DESC
and I have to get the remaining functionality that I was doing with Row Number function by taking RowNo=1. But Now I have to take this procedure to SQL 2000 due to which this function can't be used.
Desired Result is
PK_PatientId PK_PatientVisitId PK_VisitProcedureId DateSort RowNo
------------ ----------------- ------------------- ----------------------- --------------------
1 4 4 2009-06-22 00:00:00.000 1
5 6 6 2009-05-24 00:00:00.000 1
7 7 7 2009-05-24 00:00:00.000 1
8 8 8 2009-05-24 00:00:00.000 1
9 9 9 2009-05-24 00:00:00.000 1
which I am getting when using Row_Number in sql 2005. I want same result using sql 2000 only.
I have to use SQL 2000
You just need to strap this to the end of your WHERE clause:
AND NOT EXISTS (
SELECT *
FROM dbo.M_PatientVisit PV2
WHERE P.PK_PatientId = PV2.FK_PatientId
AND PV2.PK_PatientVisitId > PV.PK_PatientVisitId
)
...which will result in the query returning "the patient visits for patients where there does not exist another visit for that patient with a higher ID" - that is, you'll get the visits with the highest IDs.
Note that you'll need to include the other logic in the WHERE clause in this subquery in order to ensure that the bits are active etc.
Would you mind use temporary table in your procedure? I mean that you can insert max(PatientVisitId) rows into a temporary table.

Resources