Find nearest row that matches condition in SQL Server - sql-server

I have a SQL table with unique IDs, a date of service for a health care encounter, and whether this encounter was an emergency room visit (ed = 1) or a hospital admission (hosp = 1).
For each unique ID, I want to identify ED visits that occurred <= 1 calendar day from a hospital stay.
Thus I think I want to ask SQL first identify ED visits and then search up and down to find the nearest hospital admission and calculate the difference in dates (absolute value). I'm familiar with lag/lead and rownumber() functions, but can't quite seem to figure this out.
Any ideas would be much appreciated! Thank you!
Table looks like this for one illustrative ID:
id date ed hosp
1 2012-01-01 0 1
1 2012-01-05 1 0
1 2012-02-01 0 1
1 2012-02-03 1 0
1 2012-05-01 0 0
And I want to create a new column (ed_hosp_diff) that is the minimum absolute date difference (days) between each ED visit and the closest hospital stay, something like this:
id date ed hosp ed_hosp_diff
1 2012-01-01 0 1 null
1 2012-01-05 1 0 4
1 2012-02-01 0 1 null
1 2012-02-03 1 0 2
1 2012-05-01 0 0 null

So this doesn't get you the output table you show, but it meets the requirement you list:
For each unique ID, I want to identify ED visits that occurred <= 1
calendar day from a hospital stay.
Your output table doesn't really give you that - it includes rows for ED Visits that don't have a matching hospital admit, and has rows for hospital admits, etc. This SQL doesn't give you those, it just gives you the ED Visits that were followed by a hospital admit within one day.
It also doesn't give you matches with negative days - cases where the hospital visit is prior to the ED visit (in terms of healthcare analytics, that's usually a different thing than looking for ED Visits followed by an IP Admit). If you do want those, delete the last bit of logic in the WHERE clause for the main query.
SELECT
ID = e.id,
ED_DATE = e.date,
HOSP_DATE = h.date
ED_HOSP_DIFF = DATEDIFF(dd, e.date, h.date)
FROM
Table1 AS e
JOIN
(
SELECT
id,
date
FROM
Table1
WHERE
hosp = 1
) AS h
ON
e.id = h.id
WHERE
e.ed = 1
AND
DATEDIFF(dd, e.date, h.date) <= 1
AND
DATEDIFF(dd, e.date, h.date) >= 0

use OUTER APPLY to get the record with ed = 1 and find the min date diff
SELECT *
FROM table t
OUTER APPLY
(
SELECT ed_hosp_diff = MIN ( ABS ( DATEDIFF(DAY, t.date, x.date) ) )
FROM table x
WHERE x.date <> t.date
AND x.ed = 1
) eh

Related

SQL: How to count distinct for all hourly periods in a day

I have a table of hotel data like this:
Room_ID
Check_in_time
Check_out_time
123
2021-10-01 01:02:03
2021-10-01 02:03:04
I would like to do a count of how many rooms were were checked in during each hour throughout a day (even if the room was checked in for 1 minute during the hour it still counts), so an output that look like this:
Time period
Number of rooms
09:00-10:00
10
10:00-11:00
12
..
..
There are a couple of other 'where' conditions but this is the crux of the problem. I have so far managed to write a query that can count unique room ID by specifying the hourly window:
select count (distinct room_id)
from data
where check_out_time > 9am and check_in_time < 10am
But how do I do this for each of the 24 hourly windows without repeating the same query 24 times? Hopefully something that can be later adapted into half hour intervals, or even minutes. I'm using Sigma in case that matters. Thanks in advance!
In Snowflake, I'd leverage a DATE_TRUNC function. If your dataset is very large, this will likely perform much better than any of the BETWEEN type of filtering that the OP and other answers are using.
select date_trunc('hour',check_out_time) as check_out_hour
, count (distinct room_id) as cnt
from data
group by 1;
If you needed to parse it out by day and time, you could add that, as well:
select date_trunc('day',check_out_time) as check_out_day
, date_trunc('hour',check_out_time) as check_out_hour
, count (distinct room_id) as cnt
from data
group by 1,2;
For reference:
https://docs.snowflake.com/en/sql-reference/functions/date_trunc.html
You may try the following:
A recursive CTE is used to generate the possible hours 0-23 (we could have also select distinct hours from your existing dataset but i did not want to assume that every hour was possibly booked and this may be a less expensive operation for this case to get all possible hours). A left join was then used to determine hours rooms were booked before aggregating this and counting the number of bookings each hour.
WITH recursive hours(hr) as (
select 0 as hr
union all
select hr + 1 from hours where hr < 23
)
select
concat(h.hr,':00-',(h.hr+1),':00') as time_period,
COUNT(DISTINCT r.room_id) as no_rooms
from hours h
left join room_times r on (
CAST(r.check_in_time AS DATE) = CAST(r.check_out_time AS DATE) AND
h.hr BETWEEN DATE_PART(hour,r.Check_in_time) AND DATE_PART(hour,r.Check_out_time)
) OR
(
CAST(r.check_in_time AS DATE) < CAST(r.check_out_time AS DATE) AND
(
h.hr >= DATE_PART(hour, r.Check_in_time) OR
h.hr <= DATE_PART(hour,r.Check_out_time)
)
)
GROUP BY h.hr
order by h.hr
See working db fiddle (using sql server instead) with the same logic and additional data and outputs to assist verification here.
Sample Data:
INSERT INTO room_times
(Room_ID, Check_in_time, Check_out_time)
VALUES
('123', '2021-10-01 01:02:03', '2021-10-01 03:03:04'),
('124', '2021-10-01 15:02:03', '2021-10-02 01:03:04');
Outputs:
time_period
no_rooms
0:00-1:00
1
1:00-2:00
2
2:00-3:00
1
3:00-4:00
1
4:00-5:00
0
5:00-6:00
0
6:00-7:00
0
7:00-8:00
0
8:00-9:00
0
9:00-10:00
0
10:00-11:00
0
11:00-12:00
0
12:00-13:00
0
13:00-14:00
0
14:00-15:00
0
15:00-16:00
1
16:00-17:00
1
17:00-18:00
1
18:00-19:00
1
19:00-20:00
1
20:00-21:00
1
21:00-22:00
1
22:00-23:00
1
23:00-24:00
1
Let me know if this works for you.

Calculate Bounce Rate SQL Server 2008

I'm trying to calculate the Bounce Rate of pages in SQL Server in a table with Audit Data from Sharepoint.
ItemId UserId DocLocation Occurred
1 1 Home.aspx 2016-08-02 13:39:41
1 2 Home.aspx 2016-08-02 13:40:07
2 1 Other.aspx 2016-08-02 13:40:16
3 1 Items.aspx 2016-08-02 13:40:17
2 2 Other.aspx 2016-08-02 13:40:11
ItemId is the id of the page, DocLocation the location of the page and Occurred when the user goes into the page.
To calculate the bounce rate we have to divide the number of bounces between the total number of visits.
A Bounce happens when an user leaves the page in less than 5 seconds.
This should be the results for that table:
ItemId Bounces Visits BounceRate(Bounces/Visits)
1 1 2 0.5
2 1 2 0.5
3 0 1 0
I want to count a bounce calculating how much passes since the user performs the check until the user makes a visit to another page. If that time is less than 5 seconds, it would be counted as a bounce.
I'm making a stored procedure that execute the query to show the bounce rate of each page, but this doesn´t work.
SELECT
SUM(CASE
WHEN (DATEDIFF(second, #Occurred,
(SELECT TOP 1 a.Occurred
FROM [AuditPages] a
WHERE a.UserId = #userId
AND a.Occurred > #occurred
ORDER BY a.Occurred ASC))) < 30
THEN 1.0
ELSE 0.0
END) / COUNT(#itemId)
Someone knows how i can calculate this Bounce Rate?
Thanks for all the answers.
I like using row_number for this type of sequenced problem. The query below gives the desired result. I find performance with CTEs can sometimes be problematic with larger tables and you may need to convert to a temp table. You might consider using milliseconds if there is a chance you would want to use 4.5 seconds or such in the future.
declare #bounce_seconds int = 5;
with audit_cte as (
select *, ROW_NUMBER() over (partition by UserId order by Occurred) row_num
from AuditPages
--order by UserId,row_num
)
select a.ItemId, sum(a.bounce) Bounces, count(1) Visits, sum(a.bounce)/convert(float, count(1)) BounceRate
from (
select a1.ItemId, datediff(s,a1.Occurred, a2.Occurred) elapsed, case when datediff(s,a1.Occurred, a2.Occurred) < #bounce_seconds then 1 else 0 end bounce
from audit_cte a1
left join audit_cte a2
on a2.UserId = a1.UserId
and a2.row_num = a1.row_num + 1
--order by a1.UserId, a1.row_num
) a
group by a.ItemId
order by a.ItemId;
SELECT ItemId,COUNT(1) VISITS,SUM(BOUNCE_IND) BOUNCE, cast(SUM(BOUNCE_IND) as decimal(5,2))/cast(COUNT(1) as decimal(5,2)) BOUNCE_RATE
FROM (
Select
UserID,
ItemID,
DocLocation,
Occurred as Entry_time,
Lead(Occurred,1) Over (Partition by Userid order by Occurred) Exit_time,
CASE WHEN DATEDIFF(ss,Occurred,Lead(Occurred,1) Over (Partition by Userid order by Occurred)) <= 5 THEN 1 ELSE 0 END BOUNCE_IND
FROM Web_Data_Sample
) TBL GROUP BY ItemId

SQL Server: How to get a rolling sum over 3 days for different customers within same table

This is the input table:
Customer_ID Date Amount
1 4/11/2014 20
1 4/13/2014 10
1 4/14/2014 30
1 4/18/2014 25
2 5/15/2014 15
2 6/21/2014 25
2 6/22/2014 35
2 6/23/2014 10
There is information pertaining to multiple customers and I want to get a rolling sum across a 3 day window for each customer.
The solution should be as below:
Customer_ID Date Amount Rolling_3_Day_Sum
1 4/11/2014 20 20
1 4/13/2014 10 30
1 4/14/2014 30 40
1 4/18/2014 25 25
2 5/15/2014 15 15
2 6/21/2014 25 25
2 6/22/2014 35 60
2 6/23/2014 10 70
The biggest issue is that I don't have transactions for each day because of which the partition by row number doesn't work.
The closest example I found on SO was:
SQL Query for 7 Day Rolling Average in SQL Server
but even in that case there were transactions made everyday which accomodated the rownumber() based solutions
The rownumber query is as follows:
select customer_id, Date, Amount,
Rolling_3_day_sum = CASE WHEN ROW_NUMBER() OVER (partition by customer_id ORDER BY Date) > 2
THEN SUM(Amount) OVER (partition by customer_id ORDER BY Date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
END
from #tmp_taml9
order by customer_id
I was wondering if there is way to replace "BETWEEN 2 PRECEDING AND CURRENT ROW" by "BETWEEN [DATE - 2] and [DATE]"
One option would be to use a calendar table (or something similar) to get the complete range of dates and left join your table with that and use the row_number based solution.
Another option that might work (not sure about performance) would be to use an apply query like this:
select customer_id, Date, Amount, coalesce(Rolling_3_day_sum, Amount) Rolling_3_day_sum
from #tmp_taml9 t1
cross apply (
select sum(amount) Rolling_3_day_sum
from #tmp_taml9
where Customer_ID = t1.Customer_ID
and datediff(day, date, t1.date) <= 3
and t1.Date >= date
) o
order by customer_id;
I suspect performance might not be great though.

Extracting data from MS SQL Server-2008 referring multiple tables

I have had asked a similar question here and have got help from jpw who helped me with the query. The situation here remains same but only a bit more detail added. I have four tables. Sample structure for three of them is given below:
I have been helped to form query which goes as below:
select
d.LOTQty,
ApprovedQty = count(d.SerialNo),
d.DispatchDate,
Installed = count(a.SerialNo) + count(r.SerialNo)
from
Despatch d
left join
Activation a
on d.SerialNo= a.SerialNo
and d.DispatchDate <= a.ActivationDate
and d.LOTQty = a.LOTQty
left join
Replaced r
on d.SerialNo= r.SerialNo
and d.DispatchDate <= r.ActivationDate
and (a.ActivationDate is null or a.ActivationDate < d.DispatchDate)
where
d.LOTQty = 15
group by
d.LOTQty, d.DispatchDate, d.STBModel
For understanding sake, above query match Despatch table's SerialNo with Activation table. If match found it checks for Date difference. If DespatchDate < ActivationDate only those numbers are considered while others(which didn't match or whose DispatchDate > ActivationDate) are matched with Replaced with similar date criteria. So at the end we find 9 matches i.e 7 from Activation and 2 from Replaced as below:
LotQty | ApprovedQty | DispatchDate | Installed
15 | 10 | 2013-8-7 | 9
I want to display two more columns in here i.e DOA and Bounce like this:
LotQty | ApprovedQty | DispatchDate | Installed | DOA | Bounce
15 | 10 | 2013-8-7 | 9 | 2 | 4
DOA and Bounce should be calculated with difference between 4th table i.e Failed table's FailedDate and the above 9 matched SerialNo's respective Activation/Record date(henceforth termed as act_rec_date). Failed table and Intermediate 9 matched SerialNo's structure is shown below:
Intermediate table doesn't physically exist. It is just for reference and to provide more clarity. Intermediate table contain those SerialNo, which were matched with Activation and Replaced table. The act_rec_Date field is correspondingly matched Activation/Record Date.
DOA & Bounce = We should match all the 9 resultant SerialNo's(i.e Intermediate table) with Failed table. If matched, calculate difference between FailedDate and act_rec_date. If difference is (0 to <=10 days) then count it under DOA and if difference is (>10 days to <=180 days) then count it under Bounce. From Failed we find 6 matches out of which Product1,2 falls in DOA as difference between act_rec_Date is 0 and Product7,8,9 & 10 falls under Bounce as their difference is 89 | 54 | 61 | 61. So as shown above DOA = 2 and Bounce = 4
I want to build a query which could give me DOA and Bounce as well. I tried creating a temp table and dumping the resultant SerialNo's and act_rec_Date into it. Next I tried to match temp table and Failed table. I couldn't get it working and further more it took around 7 minutes to even execute the query.
P.S- My Actual tables contain around 50k to 100k data entries.
Continuing on the previous query I think the new columns could be added with a conditional aggregation in the select statement and another left join for the failed table.
This should work, but I'm sure the query can be improved:
select
d.LOTQty,
ApprovedQty = count(d.SerialNo),
d.DispatchDate,
Installed = count(a.SerialNo) + count(r.NewSerialNo),
DOA = sum(case when datediff(day, coalesce(a.ActivationDate,r.RecordDate), f.FailedDate) <= 10 then 1 else 0 end),
Bounce = sum(case when datediff(day, coalesce(a.ActivationDate,r.RecordDate), f.FailedDate) between 11 and 180 then 1 else 0 end)
from
Despatch d
left join
Activation a
on d.SerialNo= a.SerialNo
and d.DispatchDate <= a.ActivationDate
and d.LOTQty = a.LOTQty
left join
Replaced r
on d.SerialNo= r.NewSerialNo
and d.DispatchDate <= r.RecordDate
and (a.ActivationDate is null or a.ActivationDate < d.DispatchDate)
left join
Failed f
on (f.FailedSINo = a.SerialNo)
or (f.FailedSINo = r.NewSerialNo)
where
d.LOTQty = 15
group by
d.LOTQty, d.DispatchDate
Sample SQL Fiddle with test data

group by with 'pre-defined row'

Say I have to following PaymentTransaction Table:
ID Amount PayMethodID
----------------------------
10254 100 1
15789 150 1
15790 200 0
16954 300 0
17864 400 1
19364 500 1
PayMethodID Desc
----------------------------
0 CASH
1 VISA
2 MASTER
3 AMEX
4 ETC
I can simply use a group by to group the PayMethodID under 1 and 0.
What i am trying to do is to show also the non-exist PayMethodID under GROUP BY
My current result with simple group by statement is
PayMethodID TotalAmount
-------------------------
0 500
1 1150
Expected result (to show 0 if its not exits in the transaction table):
PayMethodID TotalAmount
-------------------------
0 500
1 1150
2 0
3 0
4 0
This might be a simple and duplicated question, but i just cant find the keyword to search around. I would remove this post if you can find me any duplication. Thanks.
You can use LEFT JOIN, so all rows from leftmost table (TableA) will be shown whether it has a matching values on the other table or not.
SELECT a.PayMethodID,
TotalAmount = ISNULL(SUM(b.Amount), 0)
FROM TableA AS a -- <== contains list of card type
LEFT JOIN TableB AS b -- <== contains the payment list
ON a.PayMethodID = b.PayMethodID
GROUP BY a.PayMethodID
A regular OUTER (LEFT) JOIN will give you all rows from the PayMethod table no matter if they exist in the PaymentTransaction table, the rest of the sums being NULL. You can then use a COALESCE to make the null rows zero;
SELECT pm.PayMethodID, COALESCE(SUM(pt.Amount), 0) TotalAmount
FROM PayMethod pm
LEFT JOIN PaymentTransaction pt
ON pm.PayMethodID = pt.PayMethodID
GROUP BY pm.PayMethodID
An SQLfiddle to test with.

Resources