Grouping data into fuzzy gaps and islands - sql-server

This is essentially a gaps and islands problem however it's atypical. I did cut the example down to bare minimum. I need to identify gaps that exceed a certain threshold and duplicates can't be a problem although this example removes them.
In any case the common solution of using ROW_NUMBER() doesn't help since gaps of even 1 can't be handled and the gap value is a parameter in 'real life'.
The code below actually works correctly. And it's super fast! But if you look at it you'll see why people are rather gun shy about relying upon it. The method was first published 9 years ago here http://www.sqlservercentral.com/articles/T-SQL/68467/ and I've read all 32 pages of comments. Nobody has successfully poked holes in it other than to say "it's not documented behavior". I've tried it on every version from 2005 to 2019 and it works.
The question is, beyond using a cursor or while loop to look at many millions of rows 1 by 1 - which takes, I don't know how long because I cancel after 30 min. - is there a 'supported' way to get the same results in a reasonable amount of time? Even 100x slower would complete 4M rows in 10 minutes and I can't find a way to come close to that!
CREATE TABLE #t (CreateDate date not null
,TufpID int not null
,Cnt int not null
,FuzzyGroup int null);
ALTER TABLE #t ADD CONSTRAINT PK_temp PRIMARY KEY CLUSTERED (CreateDate,TufpID);
-- Takes 40 seconds to write 4.4M rows from a source of 70M rows.
INSERT INTO #T
SELECT X.CreateDate
,X.TufpID
,Cnt = COUNT(*)
,FuzzyGroup = null
FROM SessionState SS
CROSS APPLY(VALUES (CAST(SS.CreateDate as date),SS.TestUser_Form_Part_id)) X(CreateDate,TufpID)
GROUP BY X.CreateDate
,X.TufpID
ORDER BY x.CreateDate,x.TufpID;
-- Takes 6 seconds to update 4.4M rows. They WILL update in clustered index order!
-- (Provided all the rules are followed - see the link above)
DECLARE #FuzzFactor int = 38
DECLARE #Prior int = -#FuzzFactor; -- Insure 1st row has it's own group
DECLARE #Group int;
DECLARE #CDate date;
UPDATE #T
SET #Group = FuzzyGroup = CASE WHEN t.TufpID - #PRIOR < #FuzzFactor AND t.CreateDate = #CDate
THEN #Group ELSE t.TufpID END
,#CDate = CASE WHEN #CDate = t.CreateDate THEN #CDate ELSE t.CreateDate END
,#Prior = CASE WHEN #Prior = t.TufpID-1 THEN #Prior + 1 ELSE t.TufpID END
FROM #t t WITH (TABLOCKX) OPTION(MAXDOP 1);
After the above executes the FuzzyGroup column contains the lowest value of TufpID in the group. IOW the first row (in clustered index order) contains the value of it's own TufpID column. Thereafter every row gets the same value until the date changes or a gap size (in this case 38) is exceeded. In those cases the current TufpID becomes the value put into FuzzyGroup until another change is detected. So after 6 seconds I can run queries that group by FuzzyGroup and analyze the islands.
In practice I do some running counts and totals as well in the same pass and so it takes 8 seconds not 6 but I could do those things with window functions pretty easily if I need to so I left them off.
This is the smallest table and I'll eventually need to handle 100M rows. Thus 10 minutes for 4.4M is probably not good enough but it's a place to start.

This should be reasonably efficient and avoid relying on undocumented behaviour
WITH T1
AS (SELECT *,
PrevTufpID = LAG(TufpID)
OVER (PARTITION BY CreateDate
ORDER BY TufpID)
FROM #T),
T2
AS (SELECT *,
_FuzzyGroup = MAX(CASE
WHEN PrevTufpID IS NULL
OR TufpID - PrevTufpID >= #FuzzFactor
THEN TufpID
END)
OVER (PARTITION BY CreateDate
ORDER BY TufpID ROWS UNBOUNDED PRECEDING)
FROM T1)
UPDATE T2
SET FuzzyGroup = _FuzzyGroup
The execution plan has a single ordered scan through the clustered index, with the row values then flowing through some window function operators and into the update.

Related

Select one record, grab "n" records before it, and iterate through them to see if they're sequential

So, I'd like to grab a record from a table of results. Let's say that this is our "sample" record.
Once I have the sample record, I'd like to grab 10 results down the table, and check to see if the sample is sequential within this list of 10 results.
So, if our sample record was 124, I'd like to grab the 10 records before it, and check to see if they follow the sequence of 123, 122, 121, 120, etc.
Once I know that the sample result is in fact sequential down to 10 records, I would like to insert that record into a different table for keeping.
I am using SQL Server and T-SQL to do this, and pulling my hair out trying to do so. If anyone could offer any advice, I would GREATLY appreciate it. Here's what I have so far (with some data removed), with no idea if I'm on the right track.
declare #TestTable as table (a char(15), RowNumber integer)
declare #SampleNumber as char(15)
insert into #TestTable (a, RowNumber)
select top 10
[NUMBERS],
ROW_NUMBER() over (order by a) as RowNumber
from [TABLE]
where
[NUMBERS] like [CONDITIONS]
order by [NUMBERS] desc
With this, I'm trying to grab the result and also a set of row numbers, allowing me to iterate through them based on that row number. But, I'm getting an "Invalid column name 'a'" error when running. Feel free to forget about that error and write something totally new though, because I don't even know if I'm on the right track.
Again, any help would be appreciated.
I am not sure how well this would perform on a larger dataset, but as Peter Smith mentioned, this is possible by using lag to see what the value of the row x rows prior in an ordered window was, though be aware this will run for all rows in your table and return all those that meet the criteria, rather than randomly sampling:
-- Create a not quite sequential dataset
declare #t table(n int);
with n as
(
select row_number() over (order by (select null)) as n
,abs(checksum(newid())) % 14 as r
from sys.all_objects
)
insert into #t
select n
from n
where r > 2;
-- Output the original dataset
select *
from #t;
-- Only return rows that come after a certain number of sequential numbers
declare #seq int = 10;
with l as
(
select n
,n - lag(n,#seq,null) over (order by n) as l
from #t
)
select n
from l
where l = #seq;

Calculate a Recursive Rolling Average in SQL Server

We are attempting to calculate a rolling average and have tried to convert numerous SO answers to solve the problem. To this point we are still unsuccessful.
What we've tried:
Here are some of the SO answers we have considered.
SQL Server: How to get a rolling sum over 3 days for different customers within same table
SQL Query for 7 Day Rolling Average in SQL Server
T-SQL calculate moving average
Our latest attempt has been to modify one of the solutions (#4) found here.
https://www.red-gate.com/simple-talk/sql/t-sql-programming/calculating-values-within-a-rolling-window-in-transact-sql/
Example:
Here is an example in SQL Fiddle: http://sqlfiddle.com/#!6/4570a/17
In the fiddle, we are still trying to get the SUM to work right but ultimately we are trying to get the average.
The end goal
Using the Fiddle example, we need to find the difference between Value1 and ComparisonValue1 and present it as Diff1. When a row has no Value1 available, we need to estimate it by taking the average of the last two Diff1 values and then add it to the ComparisonValue1 for that row.
With the correct query, the result would look like this:
GroupID Number ComparisonValue1 Diff1 Value1
5 10 54.78 2.41 57.19
5 11 55.91 2.62 58.53
5 12 55.93 2.78 58.71
5 13 56.54 2.7 59.24
5 14 56.14 2.74 58.88
5 15 55.57 2.72 58.29
5 16 55.26 2.73 57.99
Question: is it possible to calculate this average when it could potentially factor into the average of the following rows?
Update:
Added a VIEW to the Fiddle schema to simplify the final query.
Updated the query to include the new rolling average for Diff1 (column Diff1Last2Avg). This rolling average works great until we run into nulls in the Value1 column. This is where we need to insert the estimate.
Updated the query to include the estimate that should be used when there is no Value1 (column Value1Estimate). This is working great and would be perfect if we could use the estimate in place of NULL in the Value1 column. Since the Diff1 column reflects the difference between Value1 (or its estimate) and ComparisonValue1, including the Estimate would fill in all the NULL values in Diff1. This in turn would continue to allow the Estimates of future rows to be calculated. It gets confusing at this point, but still hacking away at it. Any ideas?
Credit for the idea goes to this answer: https://stackoverflow.com/a/35152131/6305294 from #JesúsLópez
I have included comments in the code to explain it.
UPDATE
I have corrected the query based on comments.
I have swapped numbers in minuend and subtrahend to get difference as a positive number.
Removed Diff2Ago column.
Results of the query now exactly match your sample output.
;WITH cte AS
(
-- This is similar to your ItemWithComparison view
SELECT i.Number, i.Value1, i2.Value1 AS ComparisonValue1,
-- Calculated Differences; NULL will be returned when i.Value1 is NULL
CONVERT( DECIMAL( 10, 3 ), i.Value1 - i2.Value1 ) AS Diff
FROM Item AS i
LEFT JOIN [Group] AS G ON g.ID = i.GroupID
LEFT JOIN Item AS i2 ON i2.GroupID = g.ComparisonGroupID AND i2.Number = i.Number
WHERE NOT i2.Id IS NULL
),
cte2 AS(
/*
Start with the first number
Note if you do not have at least 2 consecutive numbers (in cte) with non-NULL Diff value and therefore Diff1Ago or Diff2Ago are NULL then everything else will not work;
You may need to add additional logic to handle these cases */
SELECT TOP 1 -- start with the 1st number (see ORDER BY)
a.Number, a.Value1, a.ComparisonValue1, a.Diff, b.Diff AS Diff1Ago
FROM cte AS a
-- "1 number ago"
LEFT JOIN cte AS b ON a.Number - 1 = b.Number
WHERE NOT a.Value1 IS NULL
ORDER BY a.Number
UNION ALL
SELECT b.Number, b.Value1, b.ComparisonValue1,
( CASE
WHEN NOT b.Value1 IS NULL THEN b.Diff
ELSE CONVERT( DECIMAL( 10, 3 ), ( a.Diff + a.Diff1Ago ) / 2.0 )
END ) AS Diff,
a.Diff AS Diff1Ago
FROM cte2 AS a
INNER JOIN cte AS b ON a.Number + 1 = b.Number
)
SELECT *, ( CASE WHEN Value1 IS NULL THEN ComparisonValue1 + Diff ELSE Value1 END ) AS NewValue1
FROM cte2 OPTION( MAXRECURSION 0 );
Limitations:
this solution works well only when you need to consider small number of preceding values.

TSQL - Need to compare values of two most recent rows in logging table

I have a table INDICATORS that stores details and current scores of performance indicators. I have another table IND_HISTORIES that stores historical values of the indicator scores. Data are stored from INDICATORS to IND_HISTORIES at set periods (ie quarterly), to establish score / rating trends.
IND_HISTORIES has a column structure similar to this-
pk_IndHistId fk_IndId Score DateSaved
Rating levels are also defined, meaning a score value of 1 to 3 is Low, 4 to 6 is Avg, and 7 to 9 is High.
I am trying to build an alert feature, whereby a record will be returned if it's most recent rating level (based on most recent score in IND_HISTORIES) is greater than it's second-most recent rating level (based on second-most recent score in IND_HISTORIES).
I am using code like below to build a temp table that translates score values to rating level thresholds...
-- opt_IND_ScoreValues = 1;2;3;4;5;6;7;8;9
DECLARE #tblScores TABLE (idx int identity, val int not null)
INSERT INTO #tblScores (val) SELECT IntValue FROM dbo.fn_getSettingList('opt_IND_ScoreValues')
-- opt_IND_RatingLevels = Low;Low;Low;Avg;Avg;Avg;High;High;High
DECLARE #tblRatings TABLE (idx int identity, txt nvarchar(128))
INSERT INTO #tblRatings (txt) SELECT TxtValue FROM dbo.fn_getSettingList('opt_IND_RatingLevels')
-- combine two tables above using a common index
DECLARE #tblRatingScores TABLE (val int, txt nvarchar(128))
INSERT INTO #tblRatingScores SELECT s.val, r.txt FROM #tblScores s JOIN #tblRatings r ON s.idx = r.idx
-- reduce table rows above to find score thresholds for each rating level
DECLARE #tblRatingBands TABLE (idx int identity, score int not null, rating nvarchar(128))
INSERT INTO #tblRatingBands
SELECT rs.val, rs.txt FROM #tblRatingScores rs
INNER JOIN (SELECT MIN(val) as val FROM #tblRatingScores GROUP BY txt) AS x ON rs.txt = x.txt AND rs.val = x.val
ORDER BY val
QUESTION: Is there an elegant query I can run against the IND_HISTORIES table that will return records where the most recent rating level for an INDICATOR is above the second-most recent rating level?
UPDATE: To clarify, INDICATORS is not used in the calculation - it's a parent table that holds general information of the performance measure and current 'volatile' scores. Scores are saved to IND_HISTORY periodically - this provides point-in-time 'snapshots' of data, helping to establish score trends.
I'm looking to query the IND_HISTORY table, to find where the most recent 'snapshot' value of an indicator is higher than its second-most recent 'snapshot' value. (It would be ideal to also join the Rating Levels table, as described above, in the determination, so that rows are only returned if the score increase results in a Rating Level increase.)
Any solution should be compatible with SQL Server 2005.
I've implemented the below, which seems to work. But I'd be interested to hear any recommendations to optimize or consolidate.
First, I realize that I do not need the last temp table #tblRatingBands constructed above. Instead, I simply select matching text ratings from #tblRatingScores in my first query set below.
Then in the final query, I check if the score value has increased and if the rating text has changed -- this indicates the trend score has increased and resulted in a change to the rating level.
DECLARE #tblTrendScores TABLE (indId int not null, ih_date datetime, rowNo int, ih_score int, rating nvarchar(128));
WITH LastTwoScores AS (
SELECT fk_IndId,
DateSaved,
ROW_NUMBER() OVER (PARTITION BY fk_IndId ORDER BY DateSaved DESC) AS RowNo,
Score
FROM Ind_History
)
INSERT INTO #tblTrendScores
SELECT *,
(SELECT txt FROM #tblRatingScores WHERE val = Score)
FROM LastTwoScores
WHERE RowNo BETWEEN 1 AND 2
ORDER BY fk_IndId, RowNo
SELECT a.indId,
a.ih_date,
CASE WHEN ((a.ih_score > IsNull(b.ih_score, 0)) AND (a.rating <> IsNull(b.rating, 'none'))) THEN 'Up'
WHEN ((a.ih_score < IsNull(b.ih_score, 0)) AND (a.rating <> IsNull(b.rating, 'none'))) THEN 'Down'
ELSE 'no-change'
END AS TrendRatingChange
FROM #tblTrendScores a
JOIN #tblTrendScores b ON a.indId = b.indId AND b.rowNo = 2
WHERE a.rowNo = 1

How Can I Detect and Bound Changes Between Row Values in a SQL Table?

I have a table which records values over time, similar to the following:
RecordId Time Name
========================
1 10 Running
2 18 Running
3 21 Running
4 29 Walking
5 33 Walking
6 57 Running
7 66 Running
After querying this table, I need a result similar to the following:
FromTime ToTime Name
=========================
10 29 Running
29 57 Walking
57 NULL Running
I've toyed around with some of the aggregate functions (e.g. MIN, MAX, etc.), PARTITION and CTEs, but I can't seem to hit upon the right solution. I'm hoping a SQL guru can give me a hand, or at least point me in the right direction. Is there a fairly straightforward way to query this (preferrably without a cursor?)
Finding "ToTime" By Aggregates Instead of a Join
I would like to share a really wild query that only takes 1 scan of the table with 1 logical read. By comparison, the best other answer on the page, Simon Kingston's query, takes 2 scans.
On a very large set of data (17,408 input rows, producing 8,193 result rows) it takes CPU 574 and time 2645, while Simon Kingston's query takes CPU 63,820 and time 37,108.
It's possible that with indexes the other queries on the page could perform many times better, but it is interesting to me to achieve 111x CPU improvement and 14x speed improvement just by rewriting the query.
(Please note: I mean no disrespect at all to Simon Kingston or anyone else; I am simply excited about my idea for this query panning out so well. His query is better than mine as its performance is plenty and it actually is understandable and maintainable, unlike mine.)
Here is the impossible query. It is hard to understand. It was hard to write. But it is awesome. :)
WITH Ranks AS (
SELECT
T = Dense_Rank() OVER (ORDER BY Time, Num),
N = Dense_Rank() OVER (PARTITION BY Name ORDER BY Time, Num),
*
FROM
#Data D
CROSS JOIN (
VALUES (1), (2)
) X (Num)
), Items AS (
SELECT
FromTime = Min(Time),
ToTime = Max(Time),
Name = IsNull(Min(CASE WHEN Num = 2 THEN Name END), Min(Name)),
I = IsNull(Min(CASE WHEN Num = 2 THEN T - N END), Min(T - N)),
MinNum = Min(Num)
FROM
Ranks
GROUP BY
T / 2
)
SELECT
FromTime = Min(FromTime),
ToTime = CASE WHEN MinNum = 2 THEN NULL ELSE Max(ToTime) END,
Name
FROM Items
GROUP BY
I, Name, MinNum
ORDER BY
FromTime
Note: This requires SQL 2008 or up. To make it work in SQL 2005, change the VALUES clause to SELECT 1 UNION ALL SELECT 2.
Updated Query
After thinking about this a bit, I realized that I was accomplishing two separate logical tasks at the same time, and this made the query unnecessarily complicated: 1) prune out intermediate rows that have no bearing on the final solution (rows that do not begin a new task) and 2) pull the "ToTime" value from the next row. By performing #1 before #2, the query is simpler and performs with approximately half the CPU!
So here is the simplified query that first, trims out the rows we don't care about, then gets the ToTime value using aggregates rather than a JOIN. Yes, it does have 3 windowing functions instead of 2, but ultimately because of the fewer rows (after pruning those we don't care about) it has less work to do:
WITH Ranks AS (
SELECT
Grp =
Row_Number() OVER (ORDER BY Time)
- Row_Number() OVER (PARTITION BY Name ORDER BY Time),
[Time], Name
FROM #Data D
), Ranges AS (
SELECT
Result = Row_Number() OVER (ORDER BY Min(R.[Time]), X.Num) / 2,
[Time] = Min(R.[Time]),
R.Name, X.Num
FROM
Ranks R
CROSS JOIN (VALUES (1), (2)) X (Num)
GROUP BY
R.Name, R.Grp, X.Num
)
SELECT
FromTime = Min([Time]),
ToTime = CASE WHEN Count(*) = 1 THEN NULL ELSE Max([Time]) END,
Name = IsNull(Min(CASE WHEN Num = 2 THEN Name ELSE NULL END), Min(Name))
FROM Ranges R
WHERE Result > 0
GROUP BY Result
ORDER BY FromTime;
This updated query has all the same issues as I presented in my explanation, however, they are easier to solve because I am not dealing with the extra unneeded rows. I also see that the Row_Number() / 2 value of 0 I had to exclude, and I am not sure why I didn't exclude it from the prior query, but in any case this works perfectly and is amazingly fast!
Outer Apply Tidies Things Up
Last, here is a version basically identical to Simon Kingston's query that I think is an easier to understand syntax.
SELECT
FromTime = Min(D.Time),
X.ToTime,
D.Name
FROM
#Data D
OUTER APPLY (
SELECT TOP 1 ToTime = D2.[Time]
FROM #Data D2
WHERE
D.[Time] < D2.[Time]
AND D.[Name] <> D2.[Name]
ORDER BY D2.[Time]
) X
GROUP BY
X.ToTime,
D.Name
ORDER BY
FromTime;
Here's the setup script if you want to do performance comparison on a larger data set:
CREATE TABLE #Data (
RecordId int,
[Time] int,
Name varchar(10)
);
INSERT #Data VALUES
(1, 10, 'Running'),
(2, 18, 'Running'),
(3, 21, 'Running'),
(4, 29, 'Walking'),
(5, 33, 'Walking'),
(6, 57, 'Running'),
(7, 66, 'Running'),
(8, 77, 'Running'),
(9, 81, 'Walking'),
(10, 89, 'Running'),
(11, 93, 'Walking'),
(12, 99, 'Running'),
(13, 107, 'Running'),
(14, 113, 'Walking'),
(15, 124, 'Walking'),
(16, 155, 'Walking'),
(17, 178, 'Running');
GO
insert #data select recordid + (select max(recordid) from #data), time + (select max(time) +25 from #data), name from #data
GO 10
Explanation
Here is the basic idea behind my query.
The times that represent a switch have to appear in two adjacent rows, one to end the prior activity, and one to begin the next activity. The natural solution to this is a join so that an output row can pull from its own row (for the start time) and the next changed row (for the end time).
However, my query accomplishes the need to make end times appear in two different rows by repeating the row twice, with CROSS JOIN (VALUES (1), (2)). We now have all our rows duplicated. The idea is that instead of using a JOIN to do calculation across columns, we'll use some form of aggregation to collapse each desired pair of rows into one.
The next task is to make each duplicate row split properly so that one instance goes with the prior pair and one with the next pair. This is accomplished with the T column, a ROW_NUMBER() ordered by Time, and then divided by 2 (though I changed it do a DENSE_RANK() for symmetry as in this case it returns the same value as ROW_NUMBER). For efficiency I performed the division in the next step so that the row number could be reused in another calculation (keep reading). Since row number starts at 1, and dividing by 2 implicitly converts to int, this has the effect of producing the sequence 0 1 1 2 2 3 3 4 4 ... which has the desired result: by grouping by this calculated value, since we also ordered by Num in the row number, we've now accomplished that all sets after the first one are comprised of a Num = 2 from the "prior" row, and a Num = 1 from the "next" row.
The next difficult task is figuring out a way to eliminate the rows we don't care about and somehow collapse the start time of a block into the same row as the end time of a block. What we want is a way to get each discrete set of Running or Walking to be given its own number so we can group by it. DENSE_RANK() is a natural solution, but a problem is that it pays attention to each value in the ORDER BY clause--we don't have syntax to do DENSE_RANK() OVER (PREORDER BY Time ORDER BY Name) so that the Time does not cause the RANK calculation to change except on each change in Name. After some thought I realized I could crib a bit from the logic behind Itzik Ben-Gan's grouped islands solution, and I figured out that the rank of the rows ordered by Time, subtracted from the rank of the rows partitioned by Name and ordered by Time, would yield a value that was the same for each row in the same group but different from other groups. The generic grouped islands technique is to create two calculated values that both ascend in lockstep with the rows such as 4 5 6 and 1 2 3, that when subtracted will yield the same value (in this example case 3 3 3 as the result of 4 - 1, 5 - 2, and 6 - 3). Note: I initially started with ROW_NUMBER() for my N calculation but it wasn't working. The correct answer was DENSE_RANK() though I am sorry to say I don't remember why I concluded this at the time, and I would have to dive in again to figure it out. But anyway, that is what T-N calculates: a number that can be grouped on to isolate each "island" of one status (either Running or Walking).
But this was not the end because there are some wrinkles. First of all, the "next" row in each group contains the incorrect values for Name, N, and T. We get around this by selecting, from each group, the value from the Num = 2 row when it exists (but if it doesn't, then we use the remaining value). This yields the expressions like CASE WHEN NUM = 2 THEN x END: this will properly weed out the incorrect "next" row values.
After some experimentation, I realized that it was not enough to group by T - N by itself, because both the Walking groups and the Running groups can have the same calculated value (in the case of my sample data provided up to 17, there are two T - N values of 6). But simply grouping by Name as well solves this problem. No group of either "Running" or "Walking" will have the same number of intervening values from the opposite type. That is, since the first group starts with "Running", and there are two "Walking" rows intervening before the next "Running" group, then the value for N will be 2 less than the value for T in that next "Running" group. I just realized that one way to think about this is that the T - N calculation counts the number of rows before the current row that do NOT belong to the same value "Running" or "Walking". Some thought will show that this is true: if we move on to the third "Running" group, it is only the third group by virtue of having a "Walking" group separating them, so it has a different number of intervening rows coming in before it, and due to it starting at a higher position, it is high enough so that the values cannot be duplicated.
Finally, since our final group consists of only one row (there is no end time and we need to display a NULL instead) I had to throw in a calculation that could be used to determine whether we had an end time or not. This is accomplished with the Min(Num) expression and then finally detecting that when the Min(Num) was 2 (meaning we did not have a "next" row) then display a NULL instead of the Max(ToTime) value.
I hope this explanation is of some use to people. I don't know if my "row-multiplying" technique will be generally useful and applicable to most SQL query writers in production environments because of the difficulty understanding it and and the difficulty of maintenance it will most certainly present to the next person visiting the code (the reaction is probably "What on earth is it doing!?!" followed by a quick "Time to rewrite!").
If you have made it this far then I thank you for your time and for indulging me in my little excursion into incredibly-fun-sql-puzzle-land.
See it For Yourself
A.k.a. simulating a "PREORDER BY":
One last note. To see how T - N does the job--and noting that using this part of my method may not be generally applicable to the SQL community--run the following query against the first 17 rows of the sample data:
WITH Ranks AS (
SELECT
T = Dense_Rank() OVER (ORDER BY Time),
N = Dense_Rank() OVER (PARTITION BY Name ORDER BY Time),
*
FROM
#Data D
)
SELECT
*,
T - N
FROM Ranks
ORDER BY
[Time];
This yields:
RecordId Time Name T N T - N
----------- ---- ---------- ---- ---- -----
1 10 Running 1 1 0
2 18 Running 2 2 0
3 21 Running 3 3 0
4 29 Walking 4 1 3
5 33 Walking 5 2 3
6 57 Running 6 4 2
7 66 Running 7 5 2
8 77 Running 8 6 2
9 81 Walking 9 3 6
10 89 Running 10 7 3
11 93 Walking 11 4 7
12 99 Running 12 8 4
13 107 Running 13 9 4
14 113 Walking 14 5 9
15 124 Walking 15 6 9
16 155 Walking 16 7 9
17 178 Running 17 10 7
The important part being that each group of "Walking" or "Running" has the same value for T - N that is distinct from any other group with the same name.
Performance
I don't want to belabor the point about my query being faster than other people's. However, given how striking the difference is (when there are no indexes) I wanted to show the numbers in a table format. This is a good technique when high performance of this kind of row-to-row correlation is needed.
Before each query ran, I used DBCC FREEPROCCACHE; DBCC DROPCLEANBUFFERS;. I set MAXDOP to 1 for each query to remove the time-collapsing effects of parallelism. I selected each result set into variables instead of returning them to the client so as to measure only performance and not client data transmission. All queries were given the same ORDER BY clauses. All tests used 17,408 input rows yielding 8,193 result rows.
No results are displayed for the following people/reasons:
RichardTheKiwi *Could not test--query needs updating*
ypercube *No SQL 2012 environment yet :)*
Tim S *Did not complete tests within 5 minutes*
With no index:
CPU Duration Reads Writes
----------- ----------- ----------- -----------
ErikE 344 344 99 0
Simon Kingston 68672 69582 549203 49
With index CREATE UNIQUE CLUSTERED INDEX CI_#Data ON #Data (Time);:
CPU Duration Reads Writes
----------- ----------- ----------- -----------
ErikE 328 336 99 0
Simon Kingston 70391 71291 549203 49 * basically not worse
With index CREATE UNIQUE CLUSTERED INDEX CI_#Data ON #Data (Time, Name);:
CPU Duration Reads Writes
----------- ----------- ----------- -----------
ErikE 375 414 359 0 * IO WINNER
Simon Kingston 172 189 38273 0 * CPU WINNER
So the moral of the story is:
Appropriate Indexes Are More Important Than Query Wizardry
With the appropriate index, Simon Kingston's version wins overall, especially when including query complexity/maintainability.
Heed this lesson well! 38k reads is not really that many, and Simon Kingston's version ran in half the time as mine. The speed increase of my query was entirely due to there being no index on the table, and the concomitant catastrophic cost this gave to any query needing a join (which mine didn't): a full table scan Hash Match killing its performance. With an index, his query was able to do a Nested Loop with a clustered index seek (a.k.a. a bookmark lookup) which made things really fast.
It is interesting that a clustered index on Time alone was not enough. Even though Times were unique, meaning only one Name occurred per time, it still needed Name to be part of the index in order to utilize it properly.
Adding the clustered index to the table when full of data took under 1 second! Don't neglect your indexes.
This will not work in SQL Server 2008, only in SQL Server 2012 version that has the LAG() and LEAD() analytic functions, but I'll leave it here for anyone with newer versions:
SELECT Time AS FromTime
, LEAD(Time) OVER (ORDER BY Time) AS ToTime
, Name
FROM
( SELECT Time
, LAG(Name) OVER (ORDER BY Time) AS PreviousName
, Name
FROM Data
) AS tmp
WHERE PreviousName <> Name
OR PreviousName IS NULL ;
Tested in SQL-Fiddle
With an index on (Time, Name) it will need an index scan.
Edit:
If NULL is a valid value for Name that needs to be taken as a valid entry, use the following WHERE clause:
WHERE PreviousName <> Name
OR (PreviousName IS NULL AND Name IS NOT NULL)
OR (PreviousName IS NOT NULL AND Name IS NULL) ;
I think you're essentially interested in where the 'Name' changes from one record to the next (in order of 'Time'). If you can identify where this happens you can generate your desired output.
Since you mentioned CTEs I'm going to assume you're on SQL Server 2005+ and can therefore use the ROW_NUMBER() function. You can use ROW_NUMBER() as a handy way to identify consecutive pairs of records and then to find those where the 'Name' changes.
How about this:
WITH OrderedTable AS
(
SELECT
*,
ROW_NUMBER() OVER (ORDER BY Time) AS Ordinal
FROM
[YourTable]
),
NameChange AS
(
SELECT
after.Time AS Time,
after.Name AS Name,
ROW_NUMBER() OVER (ORDER BY after.Time) AS Ordinal
FROM
OrderedTable before
RIGHT JOIN OrderedTable after ON after.Ordinal = before.Ordinal + 1
WHERE
ISNULL(before.Name, '') <> after.Name
)
SELECT
before.Time AS FromTime,
after.Time AS ToTime,
before.Name
FROM
NameChange before
LEFT JOIN NameChange after ON after.Ordinal = before.Ordinal + 1
I assume that the RecordIDs are not always sequential, hence the CTE to create a non-breaking sequential number.
SQLFiddle
;with SequentiallyNumbered as (
select *, N = row_number() over (order by RecordId)
from Data)
, Tmp as (
select A.*, RN=row_number() over (order by A.Time)
from SequentiallyNumbered A
left join SequentiallyNumbered B on B.N = A.N-1 and A.name = B.name
where B.name is null)
select A.Time FromTime, B.Time ToTime, A.Name
from Tmp A
left join Tmp B on B.RN = A.RN + 1;
The dataset I used to test
create table Data (
RecordId int,
Time int,
Name varchar(10));
insert Data values
(1 ,10 ,'Running'),
(2 ,18 ,'Running'),
(3 ,21 ,'Running'),
(4 ,29 ,'Walking'),
(5 ,33 ,'Walking'),
(6 ,57 ,'Running'),
(7 ,66 ,'Running');
Here's a CTE solution that gets the results you're seeking:
;WITH TheRecords (FirstTime,SecondTime,[Name])
AS
(
SELECT [Time],
(
SELECT MIN([Time])
FROM ActivityTable at2
WHERE at2.[Time]>at.[Time]
AND at2.[Name]<>at.[Name]
),
[Name]
FROM ActivityTable at
)
SELECT MIN(FirstTime) AS FromTime,SecondTime AS ToTime,MIN([Name]) AS [Name]
FROM TheRecords
GROUP BY SecondTime
ORDER BY FromTime,ToTime

Getting a Subset of Records along with Total Record Count

I'm working on returning a recordset from SQL Server 2008 to do some pagination. I'm only returning 15 records at a time, but I need to have the total number of matches along with the subset of records. I've used two different queries with mixed results depending on where in the larger group I need to pull the subset. Here's a sample:
SET NOCOUNT ON;
WITH tempTable AS (
SELECT
FirstName
, LastName
, ROW_NUMBER() OVER(ORDER BY FirstName ASC) AS RowNumber
FROM People
WHERE
Active = 1
)
SELECT
tempTable.*
, (SELECT Max(RowNumber) FROM tempTable) AS Records
FROM tempTable
WHERE
RowNumber >= 1
AND RowNumber <= 15
ORDER BY
FirstName
This query works really fast when I'm returning items on the low end of matches, like records 1 through 15. However, when I start returning records 1000 - 1015, the processing will go from under a second to more than 15 seconds.
So I changed the query to the following instead:
SET NOCOUNT ON;
WITH tempTable AS (
SELECT * FROM (
SELECT
FirstName
, LastName
, ROW_NUMBER() OVER(ORDER BY FirstName ASC) AS RowNumber
, COUNT(*) OVER(PARTITION BY NULL) AS Records
FROM People
WHERE
Active = 1
) derived
WHERE RowNumber >= 1 AND RowNumber <= 15
)
SELECT
tempTable.*
FROM tempTable
ORDER BY
FirstName
That query runs the high number returns in 2-3 seconds, but also runs the low number queries in 2-3 seconds as well. Because it's doing the count for each of 70,000+ rows, it makes every request take longer instead of just the large row numbers.
So I need to figure out how to get a good row count, as well as only return a subset of items at any point in the resultset without suffering such a huge penalty. I could handle a 2-3 second penalty for the high row numbers, but 15 is too much, and I'm not willing to suffer slow loads on the first few pages a person views.
NOTE: I know that I don't need the CTE in the second example, but this is just a simple example. In production I'm doing further joins on the tempTable after I've filtered it down to the 15 rows I need.
Here is what I have done (and its just as fast, no matter which records I return):
--Parameters include:
#pageNum int = 1,
#pageSize int = 0,
DECLARE
#pageStart int,
#pageEnd int
SELECT
#pageStart = #pageSize * #pageNum - (#pageSize - 1),
#pageEnd = #pageSize * #pageNum;
SET NOCOUNT ON;
WITH tempTable AS (
SELECT
ROW_NUMBER() OVER (ORDER BY FirstName ASC) AS RowNumber,
FirstName
, LastName
FROM People
WHERE Active = 1
)
SELECT
(SELECT COUNT(*) FROM tempTable) AS TotalRows,
*
FROM tempTable
WHERE #pageEnd = 0
OR RowNumber BETWEEN #pageStart AND #pageEnd
ORDER BY RowNumber
I've handled a situation a bit similar to this in the past by not bothering to determine a definite row count, but using the query plan to give me an estimated row count, a bit like the first item in this link describes:
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=108658
The intention was then to deliver whatever rows have been asked for within the range (from say 900-915) and then returning the estimated row count, like
rows 900-915 of approx. 990
which avoided having to count all rows. Once the user moves beyond that point, I just showed
rows 1000-1015 of approx. 1015
i.e. just taking the last requested row as my new estimate.

Resources