partitioning and selecting clusters with multiple records - sql-server

The header of question might be confusing so I put my issue into words:
I have a table with master_ids, ids and years. A master_id can contain different ids. Each Id is associated with a year. I already partitioned by master_id and gave each year a rank (year_rank).
+-----------+----+------+-----------+
| master_id | id | year | year_rank |
+-----------+----+------+-----------+
| 100 | 1 | 2017 | 1 |
| 100 | 2 | 2016 | 2 |
| 100 | 3 | 2015 | 3 |
| 200 | 9 | 2001 | 1 |
| 300 | 5 | 2020 | 1 |
| 300 | 4 | 2010 | 2 |
| 400 | 7 | 1999 | 1 |
| 400 | 11 | 1996 | 2 |
| 500 | 20 | 1999 | 1 |
| 600 | 25 | 2005 | 1 |
| 600 | 29 | 2005 | 1 |
+-----------+----+------+-----------+
My goal is to pick only the clusters which have more than 1 record in order to compare it:
+-----------+----+------+-----------+
| master_id | id | year | year_rank |
+-----------+----+------+-----------+
| 100 | 1 | 2017 | 1 |
| 100 | 2 | 2016 | 2 |
| 100 | 3 | 2015 | 3 |
| 300 | 5 | 2020 | 1 |
| 300 | 4 | 2010 | 2 |
| 400 | 7 | 1999 | 1 |
| 400 | 11 | 1996 | 2 |
+-----------+----+------+-----------+
If I put where year_rank > 1 it eliminates the first rows in the clusters with multiple records which I don't want. How can I solve this? I thought about a group by but I don't know how to apply this.
Thank you very much!

Edit: Completely updated for new requirement. This will only show records for master_ids which have multiple years associated with them, however it will show all records associated for that master_id even if they are in the same year (see 600 vs 700).
SQLFiddle here
We will perform your year_rank in cte1 so we can aggregate it with the MAX() function in cte2 to filter out where max is greater than whatever variable you want to put there. We then query cte1 and join on cte2 to only show the records for master_ids that have multiple years associated with them.
WITH cte1 AS (
SELECT
master_id,
id,
year,
RANK() OVER (PARTITION BY master_id ORDER BY year DESC) AS year_rank
FROM tbl
),
cte2 AS (
SELECT
master_id
FROM cte1
GROUP BY master_id
HAVING MAX(year_rank) > 1
)
SELECT
cte1.master_id,
cte1.id,
cte1.year,
cte1.year_rank
FROM cte1
JOIN cte2 ON
cte1.master_id = cte2.master_id

I figured out to eliminate rows which don't have a discrepancy in years within their master_id:
select *,
case
when (master_id = (lead(master_id) over (order by master_id))) and
(year = (lead(service_year) over (order by master_id))) then 'no show'
when (master_id = (lag(master_id) over (order by master_id))) and
(year = (lag(service_year) over (order by master_id))) then 'no show'
else ''
end as note
from table
Now I can put all of that into a temp table and delete the records which have 'no show' in the note column.
What do you think of this? Is there an easier way?

Related

Getting Top 10 based on column value

I have a code that output a long list of the sum of count of work orders per name and sorts it by total, name and count:
;with cte as (
SELECT [Name],
[Emergency],
count([Emergency]) as [CountItem]
FROM tableA
GROUP BY [Name], [Emergency])
select Name,[Emergency],[Count],SUM([CountItem]) OVER(PARTITION BY Name) as Total from cte
order by Total desc, Name, [CountItem] desc
but I only want to get the top 10 Names with the highest total like the one below:
+-------+-------------------------------+-------+-------+
| Name | Emergency | Count | Total |
+-------+-------------------------------+-------+-------+
| PLB | No | 7 | 15 |
| PLB | No Hot Water | 4 | 15 |
| PLB | Resident Locked Out | 2 | 15 |
| PLB | Overflowing Tub | 1 | 15 |
| PLB | No Heat | 1 | 15 |
| GG | Broken Lock - Exterior | 6 | 6 |
| BOA | Broken Lock - Exterior | 2 | 4 |
| BOA | Garage Door not working | 1 | 4 |
| BOA | Resident Locked Out | 1 | 4 |
| 15777 | Smoke Alarm not working | 3 | 3 |
| FP | No air conditioning | 2 | 3 |
| FP | Flood | 1 | 3 |
| KB | No electrical power | 2 | 3 |
| KB | No | 1 | 3 |
| MEM | Noise Complaint | 3 | 3 |
| ANG | Parking Issue | 2 | 2 |
| ALL | Smoke Alarm not working | 2 | 2 |
| AAS | No air conditioning | 1 | 2 |
| AAS | Toilet - Clogged (1 Bathroom) | 1 | 2 |
+-------+-------------------------------+-------+-------+
Note: I'm not after unique values. As you can see from the example above it gets the top 10 names from a very long table.
What I want to happen is assign a row id for each name so all PLB above will have a row id of 1, GG = 2, BOA = 3, ...
So on my final select I will only add the where clause where row id <= 10. I already tried ROW_NUMBER() OVER(PARTITION BY Name ORDER BY Name) but it's assigning 1 to every unique Name it encounters.
You may try this:
;with cte as (
SELECT [Name],
[Emergency],
count([Emergency]) as [CountItem]
FROM tableA
GROUP BY [Name], [Emergency]),
ct as (
select Name,[Emergency],[Count],SUM([CountItem]) OVER(PARTITION BY PropertyName) as Total from cte
),
ctname as (
select dense_rank() over ( order by total, name ) as RankName, Name,[Emergency],[Count], total from ct )
select * from ctname where rankname < 11

What's an efficient way to count "previous" rows in SQL?

Hard to phrase the title for this one.
I have a table of data which contains a row per invoice. For example:
| Invoice ID | Customer Key | Date | Value | Something |
| ---------- | ------------ | ---------- | ------| --------- |
| 1 | A | 08/02/2019 | 100 | 1 |
| 2 | B | 07/02/2019 | 14 | 0 |
| 3 | A | 06/02/2019 | 234 | 1 |
| 4 | A | 05/02/2019 | 74 | 1 |
| 5 | B | 04/02/2019 | 11 | 1 |
| 6 | A | 03/02/2019 | 12 | 0 |
I need to add another column that counts the number of previous rows per CustomerKey, but only if "Something" is equal to 1, so that it returns this:
| Invoice ID | Customer Key | Date | Value | Something | Count |
| ---------- | ------------ | ---------- | ------| --------- | ----- |
| 1 | A | 08/02/2019 | 100 | 1 | 2 |
| 2 | B | 07/02/2019 | 14 | 0 | 1 |
| 3 | A | 06/02/2019 | 234 | 1 | 1 |
| 4 | A | 05/02/2019 | 74 | 1 | 0 |
| 5 | B | 04/02/2019 | 11 | 1 | 0 |
| 6 | A | 03/02/2019 | 12 | 0 | 0 |
I know I can do this using either a CTE like this...
(
select
count(*)
from table
where
[Customer Key] = t.[Customer Key]
and [Date] < t.[Date]
and Something = 1
)
But I have a lot of data and that's pretty slow. I know I can also use cross apply to achieve the same thing, but as far as I can tell that's not any better performing than just using a CTE.
So; is there a more efficient means of achieving this, or do I just suck it up?
EDIT: I originally posted this without the requirement that only rows where Something = 1 are counted. Mea culpa - I asked it in a hurry. Unfortunately I think that this means I can't use row_number() over (partition by [Customer Key])
Assuming you're using SQL Server 2012+ you can use Window Functions:
COUNT(CASE WHEN Something = 1 THEN CustomerKey END) OVER (PARTITION BY CustomerKey ORDER BY [Date]
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) -1 AS [Count]
Old answer before new required logic:
COUNT(CustomerKey) OVER (PARTITION BY CustomerKey ORDER BY [Date]
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) -1 AS [Count]
If you're not using 2012 an alternative is to use ROW_NUMBER
ROW_NUMBER() OVER (PARTITION BY CustomerKey ORDER BY [Date]) - 1 AS Count

SQL Server - get values from X months ago according to columndata

Let's say I have the following table (data is completely fiction):
ID | MonthDate | PersonID | Name | Status | MonthsAgoSinceLastCheck
1 | 2017-12 | 900 | Jack | Ill | -
2 | 2018-01 | 900 | Jack | Ill | 1
3 | 2018-02 | 900 | Jack | Ill | 2
4 | 2018-03 | 900 | Jack | Healthy | 1
5 | 2017-02 | 901 | Bill | Ill | -
6 | 2017-03 | 901 | Bill | Ill | 1
7 | 2017-05 | 901 | Bill | Healthy | 1
For each record, I would like to see the previous status that person had X months ago since last check (column MonthsAgoSinceLastCheck). Notice that MonthDate can skip months.
So in this case, the result would be
ID | MonthDate | PersonID | Name | Status | MonthsAgoSinceLastCheck | PreviousSatus
1 | 2017-12 | 900 | Jack | Ill | - | -
2 | 2018-01 | 900 | Jack | Ill | 1 | Ill
3 | 2018-02 | 900 | Jack | Ill | 2 | Ill
4 | 2018-03 | 900 | Jack | Healthy | 1 | Ill
5 | 2017-02 | 901 | Bill | Healthy | - | -
6 | 2017-03 | 901 | Bill | Healthy | 1 | Healthy
7 | 2017-05 | 901 | Bill | Ill | 2 | Healthy
Any sugestions/tips? I tried to do this with CTE's and self-joins but failed on both.
It's way easier to use full dates than year and months separately. The first thing you should do is generate a full date from your year + month. Then just self join with previous month, depending on the last check.
;WITH DataWithDates AS
(
SELECT
T.ID,
MonthDate = CONVERT(DATE, T.MonthDate + '-01'),
T.PersonID,
T.Name,
T.Status,
T.MonthsAgoSinceLastCheck
FROM
YourTable AS T
)
SELECT
D.ID,
D.MonthDate,
D.PersonID,
D.Name,
D.Status,
D.MonthsAgoSinceLastCheck,
PreviousStatus = N.Status
FROM
DataWithDates AS D
LEFT JOIN DataWithDates AS N ON
D.PersonID = N.PersonID AND
N.MonthDate = DATEADD(MONTH, -1 * D.MonthsAgoSinceLastCheck, D.MonthDate)
I'm assuming your MonthDate has values for all rows, otherwise the conversion will fail. I'm also assuming that your - values for MonthsAgoSinceLastCheck are actually NULL.
try this:
select *,LAG(Status) OVER(Partition by Name Order by MonthDate,Id) AS PreviousSatus
from tab1
order by id
SQl Fiddle:http://sqlfiddle.com/#!18/04407/4

SQL Server query for next row value where previous row value

This query gives me Event values from 1 to 20 within an hour, how to add to that if a consecutive Event value is >=200 as well?
SELECT ID, count(Event) as numberoftimes
FROM table_name
WHERE Event >=1 and Event <=20
GROUP BY ID, DATEPART(HH, AtHour)
HAVING DATEPART(HH, AtHour) <= 1
ORDER BY ID desc
In this dummy 24h table:
+----+-------+--------+
| ID | Event | AtHour |
+----+-------+--------+
| 1 | 1 | 11:00 |
| 1 | 4 | 11:01 |
| 1 | 1 | 11:02 |
| 1 | 20 | 11:03 |
| 1 | 200 | 11:04 |
| 1 | 1 | 13:00 |
| 1 | 1 | 13:05 |
| 1 | 2 | 13:06 |
| 1 | 500 | 13:07 |
| 1 | 39 | 13:10 |
| 1 | 50 | 13:11 |
| 1 | 2 | 13:12 |
+----+-------+--------+
I would like to select IDs with Event with values with range between 1 and 20 followed immediately by value greater than or equal to 200 within an hour.
Expected result should be something like that:
+----+--------+
| ID | AtHour |
+----+--------+
| 1 | 11 |
| 1 | 13 |
| 2 | 11 |
| 2 | 14 |
| 3 | 09 |
| 3 | 12 |
+----+--------+
or just how many times it has happened for unique ID instead of which hour.
Please excuse me I am still rusty with post formatting!
CREATE TABLE data (Id INT, Event INT, AtHour SMALLDATETIME);
INSERT data (Id, Event, AtHour) VALUES
(1,1,'2017-03-16 11:00:00'),
(1,4,'2017-03-16 11:01:00'),
(1,1,'2017-03-16 11:02:00'),
(1,20,'2017-03-16 11:03:00'),
(1,200,'2017-03-16 11:04:00'),
(1,1,'2017-03-16 13:00:00'),
(1,1,'2017-03-16 13:05:00'),
(1,2,'2017-03-16 13:06:00'),
(1,500,'2017-03-16 13:07:00'),
(1,39,'2017-03-16 13:10:00')
;
; WITH temp as (
SELECT rownum = ROW_NUMBER() OVER (PARTITION BY id ORDER BY AtHour)
, *
FROM data
)
SELECT a.id, DATEPART(HOUR, a.AtHour) as AtHour, COUNT(*) AS NumOfPairs
FROM temp a JOIN temp b ON a.rownum = b.rownum-1
WHERE a.Event BETWEEN 1 and 20 AND b.Event >= 200
AND DATEDIFF(MINUTE, a.AtHour, b.AtHour) <= 60
GROUP BY a.id, DATEPART(HOUR, a.AtHour)
;

Finding the max and min date values in SQL Server tables

I have two tables:
A lookup table (tabOne):
KEY | Group | Name | Desc | Val_Key
----------------------------------------
1 | a | NameA | DescA | 10
2 | b | NameB | DescB | 20
3 | c | NameC | DescC | 30
4 | d | NameD | DescD | 40
5 | e | NameE | DescE | 50
6 | f | NameF | DescF | 60
A second table containing readings (tabTwo):
KEY | Date | Reading | Val_Key
----------------------------------------
1 | Date | Read | 10
2 | Date | Read | 20
3 | Date | Read | 40
4 | Date | Read | 40
5 | Date | Read | 30
6 | Date | Read | 20
7 | Date | Read | 40
8 | Date | Read | 20
9 | Date | Read | 10
10 | Date | Read | 20
11 | Date | Read | 50
12 | Date | Read | 60
What I need to do is join tabTwo with TabOne and create a column with the newest Reading and a column with the oldest reading for each item in the group column of TabOne.
At the end of the day I want a table that look as follow:
KEY | Group | Name | Desc | Val_Key | LastReading | FirstReading |
-------------------------------------------------------------------------
1 | a | NameA | DescA | 10 | | |
2 | b | NameB | DescB | 20 | | |
3 | c | NameC | DescC | 30 | | |
4 | d | NameD | DescD | 40 | | |
5 | e | NameE | DescE | 50 | | |
6 | f | NameF | DescF | 60 | | |
Thanks!
Freddie
If this is Sql Server 2005 or newer, outer apply will help:
select TabOne.*,
last.Reading LastReading,
first.Reading FirstReading
from TabOne
outer apply
(
select top 1
Reading
from TabTwo
where TabTwo.Val_Key = TabOne.val_Key
order by TabTwo.Date desc
) last
outer apply
(
select top 1
Reading
from TabTwo
where TabTwo.Val_Key = TabOne.val_Key
order by TabTwo.Date asc
) first
Live test is # Sql Fiddle.
#Nikola Markovinović's solution can be made more universally applicable if the subqueries are moved directly to the main query's SELECT clause, which is possible each of them retrieves only one value and is, therefore, valid as a scalar expression:
SELECT
t1.[KEY],
t1.[Group],
t1.Name,
t1.[Desc],
t1.Val_Key,
(
SELECT TOP 1 Reading
FROM TabTwo
WHERE Val_Key = t1.Val_Key
ORDER BY Date DESC
) AS LastReading,
(
SELECT TOP 1 Reading
FROM TabTwo
WHERE Val_Key = t1.Val_Key
ORDER BY Date ASC
) AS FirstReading
FROM TabOne t1
If you needed e.g. dates along the way, you would probably have to stick to Nikola's solution. There is an alternative to it, but it's more cumbersome (albeit more standard too): it would involve grouping TabTwo's data by Val_Key to get earliest/latest dates per Val_Key, then joining back to TabTwo to access entire rows corresponding to the found dates to finally pull the necessary columns, and ultimately joining both result sets to TabOne to get the final column set.

Resources