Is there a way you can produce an output like this in T-SQL - sql-server

I have a column which I translate the values using a case statements and I get numbers like this below. There are multiple columns I need to produce the result like this and this is just one column.
How do you produce the output as a whole like this below.
The 12 is the total numbers counting from top to bottom
49 is the Average.
4.08 is the division 49/12.
1 is how many 1's are there in the output list above. As you can see there is only one 1 in the output above
8.33% is the division and percentage comes from 1/12 * 100
and so on. Is there a way to produce this output below?
drop table test111
create table test111
(
Q1 nvarchar(max)
);
INSERT INTO TEST111(Q1)
VALUES('Strongly Agree')
,('Agree')
,('Disagree')
,('Strongly Disagree')
,('Strongly Agree')
,('Agree')
,('Disagree')
,('Neutral');
SELECT
CASE WHEN [Q1] = 'Strongly Agree' THEN 5
WHEN [Q1] = 'Agree' THEN 4
WHEN [Q1] = 'Neutral' THEN 3
WHEN [Q1] = 'Disagree' THEN 2
WHEN [Q1] = 'Strongly Disagree' THEN 1
END AS 'Test Q1'
FROM test111

I have to make a few assumptions here, but it looks like you want to treat an output column like a column in a spreadsheet. You have 12 numbers. You then have a blank "separator" row. Then a row with the number 12 (which is the count of how many numbers you have). Then a row with the number 49, which is the sum of those 12 numbers. Then the 4.08 row, which is rougly the average, and so on.
Some of these outputs can be provided by cube or rollup, but neither is a complete solution.
If you wanted to produce this output directly from TSQL, you would need to have multiple select statements and combine the results of all of those statements using union all. First you would have a select just to get the numbers. Then you would have a second select which outputs a "blank". Then another select which is providing a count. Then another select which is providing a sum. And so on.
You would also no longer be able to output actual numbers, since a "blank" is not a number. Visually it's best represented as an empty string. But now your output column has to be of datatype char or varchar.
You also have to make sure rows come out in the correct order for presentation. So you need a column to order by. You would have to add some kind of ordering column "manually" to each of the select statements, so when you union them all together you can tell SQL in what order the output should be provided.
So the answer to "can it be done?" is technically "yes". But if you think seems like a whole lot of laborious and inefficient TSQL work, you'd be right.
The real solution here is to change your approach. SQL should not be concerned with "output formatting". What you should do is just return the actual data (your 12 numbers) from SQL, and then do all of the additional presentation (like adding a blank row, adding a count row, etc), in the code of the program that is calling SQL to get that data.

I must say, this is one of the strangest T-SQL requirements I've seen, and is really best left to the presentation layer.
It is possible using GROUPING SETS though. We can use it to get an extra rollup row that aggregates the whole table.
Once you have the rollup, you need to unpivot the totalled row (identified by GROUPING() = 1) to get your final result. We can do this using CROSS APPLY.
This is impossible without a row-identifier. I have added ROW_NUMBER, but any primary or unique key will do.
WITH YourTable AS (
SELECT
ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS rn,
CASE WHEN [Q1] = 'Strongly Agree' THEN 5
WHEN [Q1] = 'Agree' THEN 4
WHEN [Q1] = 'Neutral' THEN 3
WHEN [Q1] = 'Disagree' THEN 2
WHEN [Q1] = 'Strongly Disagree' THEN 1
END AS TestQ1
FROM test111
),
RolledUp AS (
SELECT
rn,
TestQ1,
grouping = GROUPING(TestQ1),
count = COUNT(*),
sum = SUM(TestQ1),
avg = AVG(TestQ1 * 1.0),
one = COUNT(CASE WHEN TestQ1 = 1 THEN 1 END),
onePct = COUNT(CASE WHEN TestQ1 = 1 THEN 1 END) * 1.0 / COUNT(*)
FROM YourTable
GROUP BY GROUPING SETS(
(rn, TestQ1),
()
)
)
SELECT v.TestQ1
FROM RolledUp r
CROSS APPLY (
SELECT r.TestQ1, 0 AS ordering
WHERE r.grouping = 0
UNION ALL
SELECT v.value, v.ordering
FROM (VALUES
(NULL , 1),
(r.count , 2),
(r.sum , 3),
(r.avg , 4),
(r.one , 5),
(r.onePct, 6)
) v(value, ordering)
WHERE r.grouping = 1
) v
ORDER BY
v.ordering,
r.rn;
db<>fiddle

Related

SQL to split a column values into rows in Netezza

I have data in the below way in a column. The data within the column is separated by two spaces.
4EG C6CC C6DE 6MM C6LL L3BC C3
I need to split it into as beloW. I tried using REGEXP_SUBSTR to do it but looks like it's not in the SQL toolkit. Any suggestions?
1. 4EG
2. C6CC
3. C6DE
4. 6MM
5. C6LL
6. L3BC
7. C3
This has ben answered here: http://nz2nz.blogspot.com/2016/09/netezza-transpose-delimited-string-into.html?m=1
Please note the comment at the button about the best performing way of use if array functions. I have measured the use of regexp_extract_all_sp() versus repeated regex matches and the benefit can be quite large
The examples from nz2nz.blogpost.com are hard to follow. I was able to piece together this method:
with
n_rows as (--update on your end
select row_number() over(partition by 1 order by some_field) as seq_num
from any_table_with_more_rows_than_delimited_values
)
, find_values as ( -- fake data
select 'A' as id, '10,20,30' as orig_values
union select 'B', '5,4,3,2,1'
)
select
id,
seq_num,
orig_values,
array_split(orig_values, ',') as array_list,
get_value_varchar(array_list, seq_num) as value
from
find_values
cross join n_rows
where
seq_num <= regexp_match_count(orig_values, ',') + 1 -- one row for each value in list
order by
id,
seq_num

Get a count based on the row order

I have a table with this structure
Create Table Example (
[order] INT,
[typeID] INT
)
With this data:
order|type
1 7
2 11
3 11
4 18
5 5
6 19
7 5
8 5
9 3
10 11
11 11
12 3
I need to get the count of each type based on the order, something like:
type|count
7 1
11 **2**
18 1
5 1
19 1
5 **2**
3 1
11 **2**
3 1
Context
Lets say that this table is about houses, so I have a list houses in an order. So I have
Order 1: A red house
2: A white house
3: A white house
4: A red house
5: A blue house
6: A blue house
7: A white house
So I need to show that info condensed. I need to say:
I have 1 red house
Then I have 2 white houses
Then I have 1 red house
Then I have 2 blue houses
Then I have 1 white house
So the count is based on the order. The DENSE_RANK function would help me if I were able to reset the RANK when the partition changes.
So I have an answer, but I have to warn you it's probably going to get some raised eyebrows because of how it's done. It uses something known as a "Quirky Update". If you plan to implement this, please for the love of god read through the linked article and understand that this is an "undocumented hack" which needs to be implemented precisely to avoid unintended consequences.
If you have a tiny bit of data, I'd just do it row by agonizing row for simplicity and clarity. However if you have a lot of data and still need high performance, this might do.
Requirements
Table must have a clustered index in the order you want to progress in
Table must have no other indexes (these might cause SQL to read the data from another index which is not in the correct order, causing the quantum superposition of row order to come collapsing down).
Table must be completely locked down during the operation (tablockx)
Update must progress in serial fashion (maxdop 1)
What it does
You know how people tell you there is no implicit order to the data in a table? That's still true 99% of the time. Except we know that ultimately it HAS to be stored on disk in SOME order. And it's that order that we're exploiting here. By forcing a clustered index update and the fact that you can assign variables in the same update statement that columns are updated, you can effectively scroll through the data REALLY fast.
Let's set up the data:
if object_id('tempdb.dbo.#t') is not null drop table #t
create table #t
(
_order int primary key clustered,
_type int,
_grp int
)
insert into #t (_order, _type)
select 1,7
union all select 2,11
union all select 3,11
union all select 4,18
union all select 5,5
union all select 6,19
union all select 7,5
union all select 8,5
union all select 9,3
union all select 10,11
union all select 11,11
union all select 12,3
Here's the update statement. I'll walk through each of the components below
declare #Order int, #Type int, #Grp int
update #t with (tablockx)
set #Order = _order,
#Grp = case when _order = 1 then 1
when _type != #Type then #grp + 1
else #Grp
end,
#Type = _type,
_grp = #Grp
option (maxdop 1)
Update is performed with (tablockx). If you're working with a temp table, you know there's no contention on the table, but still it's a good habit to get into (if using this approach can even be considered a good habit to get into at all).
Set #Order = _order. This looks like a pointless statement, and it kind of is. However since _order is the primary key of the table, assigning that to a variable is what forces SQL to perform a clustered index update, which is crucial to this working
Populate an integer to represent the sequential groups you want. This is where the magic happens, and you have to think about it in terms of it scrolling through the table. When _order is 1 (the first row), just set the #Grp variable to 1. If, on any given row, the column value of _type differs from the variable value of #type, we increment the grouping variable. If the values are the same, we just stick with the #Grp we have from the previous row.
Update the #Type variable with the column _type's value. Note this HAS to come after the assignment of #Grp for it to have the correct value.
Finally, set _grp = #Grp. This is where the actual column value is updated with the results of step 3.
All this must be done with option (maxdop 1). This means the Maximum Degree of Parallelism is set to 1. In other words, SQL cannot do any task parallelization which might lead to the ordering being off.
Now it's just a matter of grouping by the _grp field. You'll have a unique _grp value for each consecutive batch of _type.
Conclusion
If this seems bananas and hacky, it is. As with all things, you need to take this with a grain of salt, and I'd recommend really playing around with the concept to fully understand it if you plan to implement it because I guarantee nobody else is going to know how to troubleshoot it if you get a call in the middle of the night that it's breaking.
This solution is using a recursive CTE and is relying on a gapless order value. If you don't have this, you can create it with ROW_NUMBER() on the fly:
DECLARE #mockup TABLE([order] INT,[type] INT);
INSERT INTO #mockup VALUES
(1,7)
,(2,11)
,(3,11)
,(4,18)
,(5,5)
,(6,19)
,(7,5)
,(8,5)
,(9,3)
,(10,11)
,(11,11)
,(12,3);
WITH recCTE AS
(
SELECT m.[order]
,m.[type]
,1 AS IncCounter
,1 AS [Rank]
FROM #mockup AS m
WHERE m.[order]=1
UNION ALL
SELECT m.[order]
,m.[type]
,CASE WHEN m.[type]=r.[type] THEN r.IncCounter+1 ELSE 1 END
,CASE WHEN m.[type]<>r.[type] THEN r.[Rank]+1 ELSE r.[Rank] END
FROM #mockup AS m
INNER JOIN recCTE AS r ON m.[order]=r.[order]+1
)
SELECT recCTE.[type]
,MAX(recCTE.[IncCounter])
,recCTE.[Rank]
FROM recCTE
GROUP BY recCTE.[type], recCTE.[Rank];
The recursion is traversing down the line increasing the counter if the type is unchanged and increasing the rank if the type is different.
The rest is a simple GROUP BY
I thought I'd post another approach I worked out, I think more along the lines of the dense_rank() work others were thinking about. The only thing this assumes is that _order is a sequential integer (i.e. no gaps).
Same data setup as before:
if object_id('tempdb.dbo.#t') is not null drop table #t
create table #t
(
_order int primary key clustered,
_type int,
_grp int
)
insert into #t (_order, _type)
select 1,7
union all select 2,11
union all select 3,11
union all select 4,18
union all select 5,5
union all select 6,19
union all select 7,5
union all select 8,5
union all select 9,3
union all select 10,11
union all select 11,11
union all select 12,3
What this approach does is row_number each _type so that regardless of where a _type exists, and how many times, the types will have a unique row_number in the order of the _order field. By subtracting that type-specific row number from the global row number (i.e. _order), you'll end up with groups. Here's the code for this one, then I'll walk through this as well.
;with tr as
(
select
-- Create an incrementing integer row_number over each _type (regardless of it's position in the sequence)
_type_rid = row_number() over (partition by _type order by _order),
-- This shows that on rows 6-8 (the transition between type 19 and 5), naively they're all assigned the same group
naive_type_rid = _order - row_number() over (partition by _type order by _order),
-- By adding a value to the type_rid which is a function of _type, those two values are distinct.
-- Originally I just added the value, but I think squaring it ensures that there can't ever be another gap of 1
true_type_rid = (_order - row_number() over (partition by _type order by _order)) + power(_type, 2),
_type,
_order
from #t
-- order by _order -- uncomment this if you want to run the inner select separately
)
select
_grp = dense_rank() over (order by max(_order)),
_type = max(_type)
from tr
group by true_type_rid
order by max(_order)
What's Going On
First things first; I didn't have to create a separate column in the src cte to return _type_rid. I did that mostly for troubleshooting and clarity. Secondly, I also didn't really have to do a second dense_rank on the final selection for the column _grp. I just did that so it matched exactly the results from my other approach.
Within each type, type_rid is unique, and increments by 1. _order also increments by one. So as long as a given type is chugging along, gapped by only 1, _order - _type_rid will be the same value. Let's look at a couple examples (This is the result of the src cte, ordered by _order):
_type_rid naive_type_rid true_type_rid _type _order
-------------------- -------------------- -------------------- ----------- -----------
1 8 17 3 9
2 10 19 3 12
1 4 29 5 5
2 5 30 5 7
3 5 30 5 8
1 0 49 7 1
1 1 122 11 2
2 1 122 11 3
3 7 128 11 10
4 7 128 11 11
1 3 327 18 4
1 5 366 19 6
First row, _order - _type_rid = 1 - 1 = 0. This assigns this row (type 7) to group 0
Second row, 2 - 1 = 1. This assigns type 11 to group 1
Third row, 3 - 2 = 1. This assigns the second sequential type 11 to group 1 also
Forth row, 4 - 1 = 3. This assigns type 18 to group 3
... and so forth.
The groups aren't sequential, but they ARE in the same order as _order which is the important part. You'll also notice I added the value of _type to that value as well. That's because when we hit some of the later rows, groups switched, but the sequence was still incremented by 1. By adding _type, we can differentiate those off-by-one values and still do it in the right order as well.
The final outer select from src orders by the max(_order) (in both my unnecessary dense_rank() _grp modification, and just the general result order).
Conclusion
This is still a little wonky, but definitely well within the bounds of "supported functionality". Given that I ran into one gotcha in there (the off-by-one thing), there might be others I haven't considered, so again, take that with a grain of salt, and do some testing.

Calculate a Recursive Rolling Average in SQL Server

We are attempting to calculate a rolling average and have tried to convert numerous SO answers to solve the problem. To this point we are still unsuccessful.
What we've tried:
Here are some of the SO answers we have considered.
SQL Server: How to get a rolling sum over 3 days for different customers within same table
SQL Query for 7 Day Rolling Average in SQL Server
T-SQL calculate moving average
Our latest attempt has been to modify one of the solutions (#4) found here.
https://www.red-gate.com/simple-talk/sql/t-sql-programming/calculating-values-within-a-rolling-window-in-transact-sql/
Example:
Here is an example in SQL Fiddle: http://sqlfiddle.com/#!6/4570a/17
In the fiddle, we are still trying to get the SUM to work right but ultimately we are trying to get the average.
The end goal
Using the Fiddle example, we need to find the difference between Value1 and ComparisonValue1 and present it as Diff1. When a row has no Value1 available, we need to estimate it by taking the average of the last two Diff1 values and then add it to the ComparisonValue1 for that row.
With the correct query, the result would look like this:
GroupID Number ComparisonValue1 Diff1 Value1
5 10 54.78 2.41 57.19
5 11 55.91 2.62 58.53
5 12 55.93 2.78 58.71
5 13 56.54 2.7 59.24
5 14 56.14 2.74 58.88
5 15 55.57 2.72 58.29
5 16 55.26 2.73 57.99
Question: is it possible to calculate this average when it could potentially factor into the average of the following rows?
Update:
Added a VIEW to the Fiddle schema to simplify the final query.
Updated the query to include the new rolling average for Diff1 (column Diff1Last2Avg). This rolling average works great until we run into nulls in the Value1 column. This is where we need to insert the estimate.
Updated the query to include the estimate that should be used when there is no Value1 (column Value1Estimate). This is working great and would be perfect if we could use the estimate in place of NULL in the Value1 column. Since the Diff1 column reflects the difference between Value1 (or its estimate) and ComparisonValue1, including the Estimate would fill in all the NULL values in Diff1. This in turn would continue to allow the Estimates of future rows to be calculated. It gets confusing at this point, but still hacking away at it. Any ideas?
Credit for the idea goes to this answer: https://stackoverflow.com/a/35152131/6305294 from #JesúsLópez
I have included comments in the code to explain it.
UPDATE
I have corrected the query based on comments.
I have swapped numbers in minuend and subtrahend to get difference as a positive number.
Removed Diff2Ago column.
Results of the query now exactly match your sample output.
;WITH cte AS
(
-- This is similar to your ItemWithComparison view
SELECT i.Number, i.Value1, i2.Value1 AS ComparisonValue1,
-- Calculated Differences; NULL will be returned when i.Value1 is NULL
CONVERT( DECIMAL( 10, 3 ), i.Value1 - i2.Value1 ) AS Diff
FROM Item AS i
LEFT JOIN [Group] AS G ON g.ID = i.GroupID
LEFT JOIN Item AS i2 ON i2.GroupID = g.ComparisonGroupID AND i2.Number = i.Number
WHERE NOT i2.Id IS NULL
),
cte2 AS(
/*
Start with the first number
Note if you do not have at least 2 consecutive numbers (in cte) with non-NULL Diff value and therefore Diff1Ago or Diff2Ago are NULL then everything else will not work;
You may need to add additional logic to handle these cases */
SELECT TOP 1 -- start with the 1st number (see ORDER BY)
a.Number, a.Value1, a.ComparisonValue1, a.Diff, b.Diff AS Diff1Ago
FROM cte AS a
-- "1 number ago"
LEFT JOIN cte AS b ON a.Number - 1 = b.Number
WHERE NOT a.Value1 IS NULL
ORDER BY a.Number
UNION ALL
SELECT b.Number, b.Value1, b.ComparisonValue1,
( CASE
WHEN NOT b.Value1 IS NULL THEN b.Diff
ELSE CONVERT( DECIMAL( 10, 3 ), ( a.Diff + a.Diff1Ago ) / 2.0 )
END ) AS Diff,
a.Diff AS Diff1Ago
FROM cte2 AS a
INNER JOIN cte AS b ON a.Number + 1 = b.Number
)
SELECT *, ( CASE WHEN Value1 IS NULL THEN ComparisonValue1 + Diff ELSE Value1 END ) AS NewValue1
FROM cte2 OPTION( MAXRECURSION 0 );
Limitations:
this solution works well only when you need to consider small number of preceding values.

How to select Second Last Row in mySql?

I want to retrieve the 2nd last row result and I have seen this question:
How can I retrieve second last row?
but it uses order by which in my case does not work because the Emp_Number Column contains number of rows and date time stamp that mixes data if I use order by .
The rows 22 and 23 contain the total number of rows (excluding row 21 and 22) and the time and day it got entered respectively.
I used this query which returns the required result 21 but if this number increases it will cause an error.
SELECT TOP 1 *
FROM(
SELECT TOP 2 *
FROM DAT_History
ORDER BY Emp_Number ASC
) t
ORDER BY Emp_Number desc
Is there any way to get the 2nd last row value without using the Order By function?
There is no guarantee that the count will be returned in the one-but-last row, as there is no definite order defined. Even if those records were written in the correct order, the engine is free to return the records in any order, unless you specify an order by clause. But apparently you don't have a column to put in that clause to reproduce the intended order.
I propose these solutions:
1. Return the minimum of those values that represent positive integers
select min(Emp_Number * 1)
from DAT_history
where Emp_Number not regexp '[^0-9]'
See SQL Fiddle
This will obviously fail when the count is larger then the smallest employee number. But seeing the sample data, that would represent a number of records that is maybe not expected...
2. Count the records, ignoring the 2 aggregated records
select count(*)-2
from DAT_history
See SQL Fiddle
3. Relying on correct order without order by
As explained at the start, you cannot rely on the order, but if for some reason you still want to rely on this, you can use a variable to number the rows in a sub query, and then pick out the one that has been attributed the one-but-last number:
select Emp_Number * 1
from (select Emp_Number,
#rn := #rn + 1 rn
from DAT_history,
(select #rn := 0) init
) numbered
where rn = #rn - 1
See SQL Fiddle
The * 1 is added to convert the text to a number data type.
This is not a perfect solution. I am making some assumptions for this. Check if this could work for you.
;WITH cte
AS (SELECT emp_number,
Row_number()
OVER (
ORDER BY emp_number ASC) AS rn
FROM dat_history
WHERE Isdate(emp_number) = 0) --Omit date entries
SELECT emp_number
FROM cte
WHERE rn = 1 -- select the minimum entry, assuming it would be the count and assuming count might not exceed the emp number range of 9888000

How Can I Detect and Bound Changes Between Row Values in a SQL Table?

I have a table which records values over time, similar to the following:
RecordId Time Name
========================
1 10 Running
2 18 Running
3 21 Running
4 29 Walking
5 33 Walking
6 57 Running
7 66 Running
After querying this table, I need a result similar to the following:
FromTime ToTime Name
=========================
10 29 Running
29 57 Walking
57 NULL Running
I've toyed around with some of the aggregate functions (e.g. MIN, MAX, etc.), PARTITION and CTEs, but I can't seem to hit upon the right solution. I'm hoping a SQL guru can give me a hand, or at least point me in the right direction. Is there a fairly straightforward way to query this (preferrably without a cursor?)
Finding "ToTime" By Aggregates Instead of a Join
I would like to share a really wild query that only takes 1 scan of the table with 1 logical read. By comparison, the best other answer on the page, Simon Kingston's query, takes 2 scans.
On a very large set of data (17,408 input rows, producing 8,193 result rows) it takes CPU 574 and time 2645, while Simon Kingston's query takes CPU 63,820 and time 37,108.
It's possible that with indexes the other queries on the page could perform many times better, but it is interesting to me to achieve 111x CPU improvement and 14x speed improvement just by rewriting the query.
(Please note: I mean no disrespect at all to Simon Kingston or anyone else; I am simply excited about my idea for this query panning out so well. His query is better than mine as its performance is plenty and it actually is understandable and maintainable, unlike mine.)
Here is the impossible query. It is hard to understand. It was hard to write. But it is awesome. :)
WITH Ranks AS (
SELECT
T = Dense_Rank() OVER (ORDER BY Time, Num),
N = Dense_Rank() OVER (PARTITION BY Name ORDER BY Time, Num),
*
FROM
#Data D
CROSS JOIN (
VALUES (1), (2)
) X (Num)
), Items AS (
SELECT
FromTime = Min(Time),
ToTime = Max(Time),
Name = IsNull(Min(CASE WHEN Num = 2 THEN Name END), Min(Name)),
I = IsNull(Min(CASE WHEN Num = 2 THEN T - N END), Min(T - N)),
MinNum = Min(Num)
FROM
Ranks
GROUP BY
T / 2
)
SELECT
FromTime = Min(FromTime),
ToTime = CASE WHEN MinNum = 2 THEN NULL ELSE Max(ToTime) END,
Name
FROM Items
GROUP BY
I, Name, MinNum
ORDER BY
FromTime
Note: This requires SQL 2008 or up. To make it work in SQL 2005, change the VALUES clause to SELECT 1 UNION ALL SELECT 2.
Updated Query
After thinking about this a bit, I realized that I was accomplishing two separate logical tasks at the same time, and this made the query unnecessarily complicated: 1) prune out intermediate rows that have no bearing on the final solution (rows that do not begin a new task) and 2) pull the "ToTime" value from the next row. By performing #1 before #2, the query is simpler and performs with approximately half the CPU!
So here is the simplified query that first, trims out the rows we don't care about, then gets the ToTime value using aggregates rather than a JOIN. Yes, it does have 3 windowing functions instead of 2, but ultimately because of the fewer rows (after pruning those we don't care about) it has less work to do:
WITH Ranks AS (
SELECT
Grp =
Row_Number() OVER (ORDER BY Time)
- Row_Number() OVER (PARTITION BY Name ORDER BY Time),
[Time], Name
FROM #Data D
), Ranges AS (
SELECT
Result = Row_Number() OVER (ORDER BY Min(R.[Time]), X.Num) / 2,
[Time] = Min(R.[Time]),
R.Name, X.Num
FROM
Ranks R
CROSS JOIN (VALUES (1), (2)) X (Num)
GROUP BY
R.Name, R.Grp, X.Num
)
SELECT
FromTime = Min([Time]),
ToTime = CASE WHEN Count(*) = 1 THEN NULL ELSE Max([Time]) END,
Name = IsNull(Min(CASE WHEN Num = 2 THEN Name ELSE NULL END), Min(Name))
FROM Ranges R
WHERE Result > 0
GROUP BY Result
ORDER BY FromTime;
This updated query has all the same issues as I presented in my explanation, however, they are easier to solve because I am not dealing with the extra unneeded rows. I also see that the Row_Number() / 2 value of 0 I had to exclude, and I am not sure why I didn't exclude it from the prior query, but in any case this works perfectly and is amazingly fast!
Outer Apply Tidies Things Up
Last, here is a version basically identical to Simon Kingston's query that I think is an easier to understand syntax.
SELECT
FromTime = Min(D.Time),
X.ToTime,
D.Name
FROM
#Data D
OUTER APPLY (
SELECT TOP 1 ToTime = D2.[Time]
FROM #Data D2
WHERE
D.[Time] < D2.[Time]
AND D.[Name] <> D2.[Name]
ORDER BY D2.[Time]
) X
GROUP BY
X.ToTime,
D.Name
ORDER BY
FromTime;
Here's the setup script if you want to do performance comparison on a larger data set:
CREATE TABLE #Data (
RecordId int,
[Time] int,
Name varchar(10)
);
INSERT #Data VALUES
(1, 10, 'Running'),
(2, 18, 'Running'),
(3, 21, 'Running'),
(4, 29, 'Walking'),
(5, 33, 'Walking'),
(6, 57, 'Running'),
(7, 66, 'Running'),
(8, 77, 'Running'),
(9, 81, 'Walking'),
(10, 89, 'Running'),
(11, 93, 'Walking'),
(12, 99, 'Running'),
(13, 107, 'Running'),
(14, 113, 'Walking'),
(15, 124, 'Walking'),
(16, 155, 'Walking'),
(17, 178, 'Running');
GO
insert #data select recordid + (select max(recordid) from #data), time + (select max(time) +25 from #data), name from #data
GO 10
Explanation
Here is the basic idea behind my query.
The times that represent a switch have to appear in two adjacent rows, one to end the prior activity, and one to begin the next activity. The natural solution to this is a join so that an output row can pull from its own row (for the start time) and the next changed row (for the end time).
However, my query accomplishes the need to make end times appear in two different rows by repeating the row twice, with CROSS JOIN (VALUES (1), (2)). We now have all our rows duplicated. The idea is that instead of using a JOIN to do calculation across columns, we'll use some form of aggregation to collapse each desired pair of rows into one.
The next task is to make each duplicate row split properly so that one instance goes with the prior pair and one with the next pair. This is accomplished with the T column, a ROW_NUMBER() ordered by Time, and then divided by 2 (though I changed it do a DENSE_RANK() for symmetry as in this case it returns the same value as ROW_NUMBER). For efficiency I performed the division in the next step so that the row number could be reused in another calculation (keep reading). Since row number starts at 1, and dividing by 2 implicitly converts to int, this has the effect of producing the sequence 0 1 1 2 2 3 3 4 4 ... which has the desired result: by grouping by this calculated value, since we also ordered by Num in the row number, we've now accomplished that all sets after the first one are comprised of a Num = 2 from the "prior" row, and a Num = 1 from the "next" row.
The next difficult task is figuring out a way to eliminate the rows we don't care about and somehow collapse the start time of a block into the same row as the end time of a block. What we want is a way to get each discrete set of Running or Walking to be given its own number so we can group by it. DENSE_RANK() is a natural solution, but a problem is that it pays attention to each value in the ORDER BY clause--we don't have syntax to do DENSE_RANK() OVER (PREORDER BY Time ORDER BY Name) so that the Time does not cause the RANK calculation to change except on each change in Name. After some thought I realized I could crib a bit from the logic behind Itzik Ben-Gan's grouped islands solution, and I figured out that the rank of the rows ordered by Time, subtracted from the rank of the rows partitioned by Name and ordered by Time, would yield a value that was the same for each row in the same group but different from other groups. The generic grouped islands technique is to create two calculated values that both ascend in lockstep with the rows such as 4 5 6 and 1 2 3, that when subtracted will yield the same value (in this example case 3 3 3 as the result of 4 - 1, 5 - 2, and 6 - 3). Note: I initially started with ROW_NUMBER() for my N calculation but it wasn't working. The correct answer was DENSE_RANK() though I am sorry to say I don't remember why I concluded this at the time, and I would have to dive in again to figure it out. But anyway, that is what T-N calculates: a number that can be grouped on to isolate each "island" of one status (either Running or Walking).
But this was not the end because there are some wrinkles. First of all, the "next" row in each group contains the incorrect values for Name, N, and T. We get around this by selecting, from each group, the value from the Num = 2 row when it exists (but if it doesn't, then we use the remaining value). This yields the expressions like CASE WHEN NUM = 2 THEN x END: this will properly weed out the incorrect "next" row values.
After some experimentation, I realized that it was not enough to group by T - N by itself, because both the Walking groups and the Running groups can have the same calculated value (in the case of my sample data provided up to 17, there are two T - N values of 6). But simply grouping by Name as well solves this problem. No group of either "Running" or "Walking" will have the same number of intervening values from the opposite type. That is, since the first group starts with "Running", and there are two "Walking" rows intervening before the next "Running" group, then the value for N will be 2 less than the value for T in that next "Running" group. I just realized that one way to think about this is that the T - N calculation counts the number of rows before the current row that do NOT belong to the same value "Running" or "Walking". Some thought will show that this is true: if we move on to the third "Running" group, it is only the third group by virtue of having a "Walking" group separating them, so it has a different number of intervening rows coming in before it, and due to it starting at a higher position, it is high enough so that the values cannot be duplicated.
Finally, since our final group consists of only one row (there is no end time and we need to display a NULL instead) I had to throw in a calculation that could be used to determine whether we had an end time or not. This is accomplished with the Min(Num) expression and then finally detecting that when the Min(Num) was 2 (meaning we did not have a "next" row) then display a NULL instead of the Max(ToTime) value.
I hope this explanation is of some use to people. I don't know if my "row-multiplying" technique will be generally useful and applicable to most SQL query writers in production environments because of the difficulty understanding it and and the difficulty of maintenance it will most certainly present to the next person visiting the code (the reaction is probably "What on earth is it doing!?!" followed by a quick "Time to rewrite!").
If you have made it this far then I thank you for your time and for indulging me in my little excursion into incredibly-fun-sql-puzzle-land.
See it For Yourself
A.k.a. simulating a "PREORDER BY":
One last note. To see how T - N does the job--and noting that using this part of my method may not be generally applicable to the SQL community--run the following query against the first 17 rows of the sample data:
WITH Ranks AS (
SELECT
T = Dense_Rank() OVER (ORDER BY Time),
N = Dense_Rank() OVER (PARTITION BY Name ORDER BY Time),
*
FROM
#Data D
)
SELECT
*,
T - N
FROM Ranks
ORDER BY
[Time];
This yields:
RecordId Time Name T N T - N
----------- ---- ---------- ---- ---- -----
1 10 Running 1 1 0
2 18 Running 2 2 0
3 21 Running 3 3 0
4 29 Walking 4 1 3
5 33 Walking 5 2 3
6 57 Running 6 4 2
7 66 Running 7 5 2
8 77 Running 8 6 2
9 81 Walking 9 3 6
10 89 Running 10 7 3
11 93 Walking 11 4 7
12 99 Running 12 8 4
13 107 Running 13 9 4
14 113 Walking 14 5 9
15 124 Walking 15 6 9
16 155 Walking 16 7 9
17 178 Running 17 10 7
The important part being that each group of "Walking" or "Running" has the same value for T - N that is distinct from any other group with the same name.
Performance
I don't want to belabor the point about my query being faster than other people's. However, given how striking the difference is (when there are no indexes) I wanted to show the numbers in a table format. This is a good technique when high performance of this kind of row-to-row correlation is needed.
Before each query ran, I used DBCC FREEPROCCACHE; DBCC DROPCLEANBUFFERS;. I set MAXDOP to 1 for each query to remove the time-collapsing effects of parallelism. I selected each result set into variables instead of returning them to the client so as to measure only performance and not client data transmission. All queries were given the same ORDER BY clauses. All tests used 17,408 input rows yielding 8,193 result rows.
No results are displayed for the following people/reasons:
RichardTheKiwi *Could not test--query needs updating*
ypercube *No SQL 2012 environment yet :)*
Tim S *Did not complete tests within 5 minutes*
With no index:
CPU Duration Reads Writes
----------- ----------- ----------- -----------
ErikE 344 344 99 0
Simon Kingston 68672 69582 549203 49
With index CREATE UNIQUE CLUSTERED INDEX CI_#Data ON #Data (Time);:
CPU Duration Reads Writes
----------- ----------- ----------- -----------
ErikE 328 336 99 0
Simon Kingston 70391 71291 549203 49 * basically not worse
With index CREATE UNIQUE CLUSTERED INDEX CI_#Data ON #Data (Time, Name);:
CPU Duration Reads Writes
----------- ----------- ----------- -----------
ErikE 375 414 359 0 * IO WINNER
Simon Kingston 172 189 38273 0 * CPU WINNER
So the moral of the story is:
Appropriate Indexes Are More Important Than Query Wizardry
With the appropriate index, Simon Kingston's version wins overall, especially when including query complexity/maintainability.
Heed this lesson well! 38k reads is not really that many, and Simon Kingston's version ran in half the time as mine. The speed increase of my query was entirely due to there being no index on the table, and the concomitant catastrophic cost this gave to any query needing a join (which mine didn't): a full table scan Hash Match killing its performance. With an index, his query was able to do a Nested Loop with a clustered index seek (a.k.a. a bookmark lookup) which made things really fast.
It is interesting that a clustered index on Time alone was not enough. Even though Times were unique, meaning only one Name occurred per time, it still needed Name to be part of the index in order to utilize it properly.
Adding the clustered index to the table when full of data took under 1 second! Don't neglect your indexes.
This will not work in SQL Server 2008, only in SQL Server 2012 version that has the LAG() and LEAD() analytic functions, but I'll leave it here for anyone with newer versions:
SELECT Time AS FromTime
, LEAD(Time) OVER (ORDER BY Time) AS ToTime
, Name
FROM
( SELECT Time
, LAG(Name) OVER (ORDER BY Time) AS PreviousName
, Name
FROM Data
) AS tmp
WHERE PreviousName <> Name
OR PreviousName IS NULL ;
Tested in SQL-Fiddle
With an index on (Time, Name) it will need an index scan.
Edit:
If NULL is a valid value for Name that needs to be taken as a valid entry, use the following WHERE clause:
WHERE PreviousName <> Name
OR (PreviousName IS NULL AND Name IS NOT NULL)
OR (PreviousName IS NOT NULL AND Name IS NULL) ;
I think you're essentially interested in where the 'Name' changes from one record to the next (in order of 'Time'). If you can identify where this happens you can generate your desired output.
Since you mentioned CTEs I'm going to assume you're on SQL Server 2005+ and can therefore use the ROW_NUMBER() function. You can use ROW_NUMBER() as a handy way to identify consecutive pairs of records and then to find those where the 'Name' changes.
How about this:
WITH OrderedTable AS
(
SELECT
*,
ROW_NUMBER() OVER (ORDER BY Time) AS Ordinal
FROM
[YourTable]
),
NameChange AS
(
SELECT
after.Time AS Time,
after.Name AS Name,
ROW_NUMBER() OVER (ORDER BY after.Time) AS Ordinal
FROM
OrderedTable before
RIGHT JOIN OrderedTable after ON after.Ordinal = before.Ordinal + 1
WHERE
ISNULL(before.Name, '') <> after.Name
)
SELECT
before.Time AS FromTime,
after.Time AS ToTime,
before.Name
FROM
NameChange before
LEFT JOIN NameChange after ON after.Ordinal = before.Ordinal + 1
I assume that the RecordIDs are not always sequential, hence the CTE to create a non-breaking sequential number.
SQLFiddle
;with SequentiallyNumbered as (
select *, N = row_number() over (order by RecordId)
from Data)
, Tmp as (
select A.*, RN=row_number() over (order by A.Time)
from SequentiallyNumbered A
left join SequentiallyNumbered B on B.N = A.N-1 and A.name = B.name
where B.name is null)
select A.Time FromTime, B.Time ToTime, A.Name
from Tmp A
left join Tmp B on B.RN = A.RN + 1;
The dataset I used to test
create table Data (
RecordId int,
Time int,
Name varchar(10));
insert Data values
(1 ,10 ,'Running'),
(2 ,18 ,'Running'),
(3 ,21 ,'Running'),
(4 ,29 ,'Walking'),
(5 ,33 ,'Walking'),
(6 ,57 ,'Running'),
(7 ,66 ,'Running');
Here's a CTE solution that gets the results you're seeking:
;WITH TheRecords (FirstTime,SecondTime,[Name])
AS
(
SELECT [Time],
(
SELECT MIN([Time])
FROM ActivityTable at2
WHERE at2.[Time]>at.[Time]
AND at2.[Name]<>at.[Name]
),
[Name]
FROM ActivityTable at
)
SELECT MIN(FirstTime) AS FromTime,SecondTime AS ToTime,MIN([Name]) AS [Name]
FROM TheRecords
GROUP BY SecondTime
ORDER BY FromTime,ToTime

Resources