Complex grouping algorithm with combinations in Sql server - sql-server

I have a complex grouping problem. I should solve it on Sql server 2005 but a solution that works on a more recent release is ok (we will upgrade soon).
test table:
CREATE TABLE [dbo].[testGrouping](
[Family] [varchar](50) NOT NULL,
[Person] [varchar](50) NOT NULL,
[transNr] [int] NOT NULL,
[Amount] [numeric](6, 2) NOT NULL,
[ExpectedGroup] [int] NULL
)
test data
INSERT INTO [testGrouping]([Family],[Person],[transNr],[Amount],[ExpectedGroup])
SELECT 'f1','p1',1, 10.00,1
union SELECT 'f1','p1',2 , -9.00,1
union SELECT 'f1','p2',3 , -1.00,1
union SELECT 'f2','p3',4 , 50.00,2
union SELECT 'f2','p4',5 ,-50.01,2
union SELECT 'f2','p5',6 ,-30.00,3
union SELECT 'f2','p5',7 , 20.00,3
union SELECT 'f2','p5',8 , 10.00,3
union SELECT 'f3','p7',9 , -1.00,4
union SELECT 'f3','p7',10, -2.00,4
union SELECT 'f3','p7',11, -6.00,null
union SELECT 'f3','p9',12, 2.00,null
union SELECT 'f3','p7',13, 3.00,4
union SELECT 'f2','p6',14,100.00,null
Now the problem. The ExpectedGroup starts at null, I must fill it with my code.
The requirement is to identify groups of records with ABS(sum(amount)) <= 0.01
In details:
I can group 2 or more records of the same "person";
After grouping by persons, I can search groups in the same "family"
Each record can belong to 1 group only
Each person can belong to 1 family only
Records that cannot be grouped have group = null
A group can have more than 2 records (and that's the real challenge!)
In the real data each "Family" can have up to 200 records, and each "Person" can have up to 10 records.
Amount is always <> 0
Explanation of grouping in sample data:
Group 1:
Include all the records of family f1 because no partial combination of person in that family has ABS(sum(amount)) <= 0.01
Group 2:
Persons p3 and p4 have ABS(sum(amount)) <= 0.01.
Group 3:
Persons p5 has ABS(sum(amount)) <= 0.01. So Family f2 is divided into 2 groups and a single record (transNr 13) has no group
Group 4:
In family f3 you could group transNr 9 and 11 but there is a group that belongs to Person p7 only, therefore it has higher priority.
I could easily find groups like Group 1
select family, sum(amount) from testGrouping group by Family HAVING ABS(sum(amount)) <= 0.01
and also group 3
select person, sum(amount) from testGrouping group by Person HAVING ABS(sum(amount)) <= 0.01
But other cases are trickyer (see Family f2: there are several ways to construct groups there, grouping by p5 is trivial but the other records are not so easy)
My idea in pseudo code is:
-- process the easy cases...
group by person, set a group number to persons having ABS(sum(amount)) <= 0.01
group by family, set a group number to families having ABS(sum(amount)) <= 0.01
-- process the remaining records
For each person
Generate all combinations of not grouped records of that person
For each combination of records
IF ABS(sum(amount)) of the combination <= 0.01 THEN
Assign a group to records of the combination
Recalculate the combinations (we have less records to work with)
END IF
Next combination
Next person
For each family
Generate all combinations of not grouped records of that family
For each combination of records
IF ABS(sum(amount)) of the combination <= 0.01 THEN
Assign a group to records of the combination
Recalculate the combinations (we have less records to work with)
END IF
Next combination
Next family
(on each step I can use only record not assigned to a group in previous steps, the For each translates into cursors)
My questions are:
Can you suggest me a better algorithm? (the solution must be SQL only but to describe it pseudocode is ok) I think that my pseudocode translates into a spaghetti code of nested loops, cursors, goto and other ugly code.(performance is not so critical, a few minutes to process about 10.000 records is acceptable)
How can I implement the "Generate all combinations" part? In the sample, for family F2 I should try all the possible groups of 2 records, then all the possible groups of 3 and so on till testing all the combinations. transNr is unique record ID.

Related

TSQL to choose a record that meets the criteria or first one

I have a table for company phone numbers and one of the columns is IsPrimary which is a boolean type. The table looks like this:
CompanyId | AreaCode | PhoneNumber | IsPrimary
123 212 555-1212 0
234 307 555-1234 1
234 307 555-4321 0
As you can see in the first record, even though the phone number is the only one for CompanyId: 123, it's not marked as the primary.
In such cases, I want my SELECT statement to return the first available number for that company.
My current SELECT statement looks like this which does NOT return a number unless it's set as the primary number.
SELECT *
FROM CompanyPhoneNumbers AS t
WHERE t.IsPrimary = 1
How can I modify this SELECT statement so that it includes the phone number for CompanyId: 123?
The query might be different depending on what you are actually up to.
If you already have the CompanyId and only need the phone number for it, that's easy:
select top (1) pn.*
from dbo.CompanyPhoneNumbers pn
where pn.CompanyId = #CompanyId -- A parameter provided externally, by calling code for instance
order by pn.IsPrimary desc;
However, if you need all companies' data, including one of their phones (for example, you might be going to create a view for this), then you need a correlated subquery:
select c.*, oa.*
from dbo.Companies c
outer apply (
select top (1) pn.*
from dbo.CompanyPhoneNumbers pn
where pn.CompanyId = c.Id
order by pn.IsPrimary desc
) oa;
I have deliberately used outer instead of cross apply, otherwise it will filter out companies with no phone numbers listed.
You can achieve this using an apply statement. This looks at the exact same table and returns the record with the highest IsPrimary so, this would return the records with a 1 in that column. If there are more than one marked as primary or not as primary, then it returns the phone number, with area code, in ascending order.
select b.*
from CompanyPhoneNumbers a
cross apply (
select top 1
*
from CompanyPhoneNumbers b
where b.CompanyId = a.CompanyId
order by b.IsPrimary desc
,b.AreaCode
,b.PhoneNumber
) b

return multiple rows in a single row resultset

I have three MSSQL tables. First table is item transactions, The second is Items table (where I get the name of Item and the other specs of the particular item)
and the third table is item delivery type table(like Box, Pack and Pieces and their converting factors)
first table fields;
..., itemid, trtype, trdate, amount, price, dlvid ...
second table fields;
..., itemid, itemcode, itemname ...
third table fields;
... dlvid, IsMain, dlvname, convfactor ...
And
11 1 Box [1]
12 0 Pack [25] (1 box = 25 Packs)
13 0 Pcs [375] (1 Pack = 15 Pcs, 1 Box = 375 Pcs)
Figures in brackets are the converting factors
Before, I was putting them in a gridview. itemname(from 2.nd table), trtype, trdate, amount (from 1.st table) and the dlvname (from 3.rd table) then it was ok.
But now I have to display all of the delivery type in one row with the calculation like;
Faber Drawing Pencil 3B | Out | 05.06.2018 | 6 Box, 150 Packs, 2,250
Pcs ...
Could you please help me with these.
Thanks in advance.
First of all I would like to let you know that your question is very low quality. It is very hard to understand your tables and data types. It is even not clear to understand in which case you are having problem?
So for the next time pleae read How to create a Minimal, Complete, and Verifiable example
Here is your answer:
select
...
STRING_AGG (concat(amount, ' ', trtype), ',') within group (order by amount asc)
from
Table 1
Inner Join Table2 on ....
inner join table3 on ...

Flatten/merge overlapping time intervals

I have a 'Service' table with millions of rows. Each row corresponds to a service provided by a staff in a given date and time interval (Each row has a unique ID). There are cases where a staff might provide services in overlapping time frames. I need to write a query that merges overlapping time intervals and returns the data in the format shown below.
I tried grouping by StaffID and Date fields and getting the Min of BeginTime and Max of EndTime but that does not account for the non-overlapping time frames. How can I accomplish this? Again, the table contains several million records so a recursive CTE approach might have performance issues. Thanks in advance.
Service Table
ID StaffID Date BeginTime EndTime
1 101 2014-01-01 08:00 09:00
2 101 2014-01-01 08:30 09:30
3 101 2014-01-01 18:00 20:30
4 101 2014-01-01 19:00 21:00
Output
StaffID Date BeginTime EndTime
101 2014-01-01 08:00 09:30
101 2014-01-01 18:00 21:00
Here is another sample data set with a query proposed by a contributor.
http://sqlfiddle.com/#!6/bfbdc/3
The first two rows in the results set should be merged into one row (06:00-08:45) but it generates two rows (06:00-08:30 & 06:00-08:45)
I only came up with a CTE query as the problem is there may be a chain of overlapping times, e.g. record 1 overlaps with record 2, record 2 with record 3 and so on. This is hard to resolve without CTE or some other kind of loops, etc. Please give it a go anyway.
The first part of the CTE query gets the services that start a new group and are do not have the same starting time as some other service (I need to have just one record that starts a group). The second part gets those that start a group but there's more then one with the same start time - again, I need just one of them. The last part recursively builds up on the starting group, taking all overlapping services.
Here is SQLFiddle with more records added to demonstrate different kinds of overlapping and duplicate times.
I couldn't use ServiceID as it would have to be ordered in the same way as BeginTime.
;with flat as
(
select StaffID, ServiceDate, BeginTime, EndTime, BeginTime as groupid
from services S1
where not exists (select * from services S2
where S1.StaffID = S2.StaffID
and S1.ServiceDate = S2.ServiceDate
and S2.BeginTime <= S1.BeginTime and S2.EndTime <> S1.EndTime
and S2.EndTime > S1.BeginTime)
union all
select StaffID, ServiceDate, BeginTime, EndTime, BeginTime as groupid
from services S1
where exists (select * from services S2
where S1.StaffID = S2.StaffID
and S1.ServiceDate = S2.ServiceDate
and S2.BeginTime = S1.BeginTime and S2.EndTime > S1.EndTime)
and not exists (select * from services S2
where S1.StaffID = S2.StaffID
and S1.ServiceDate = S2.ServiceDate
and S2.BeginTime < S1.BeginTime
and S2.EndTime > S1.BeginTime)
union all
select S.StaffID, S.ServiceDate, S.BeginTime, S.EndTime, flat.groupid
from flat
inner join services S
on flat.StaffID = S.StaffID
and flat.ServiceDate = S.ServiceDate
and flat.EndTime > S.BeginTime
and flat.BeginTime < S.BeginTime and flat.EndTime < S.EndTime
)
select StaffID, ServiceDate, MIN(BeginTime) as begintime, MAX(EndTime) as endtime
from flat
group by StaffID, ServiceDate, groupid
order by StaffID, ServiceDate, begintime, endtime
Elsewhere I've answered a similar Date Packing question with
a geometric strategy. Namely, I interperet the date ranges
as a line, and utilize geometry::UnionAggregate to merge
the ranges.
Your question has two peculiarities though. First, it calls
for sql-server-2008. geometry::UnionAggregate is not then
avialable. However, download the microsoft library at
https://github.com/microsoft/SQLServerSpatialTools and load
it in as a clr assembly to your instance and you have it
available as dbo.GeometryUnionAggregate.
But the real peculiarity that has my interest is the concern
that you have several million rows to work with. So I thought
I'd repeat the strategy here but with an added technique to
improve it's performance. This technique will work well if
you have a lot of your StaffID/date subsets that are the same.
First, let's build a numbers table. Swap this out with your favorite
way to do it.
select i = row_number() over (order by (select null))
into #numbers
from #services; -- where i put your data
Then convert the dates to floats and use those floats to create
geometrical points.
These points can then be turned into lines via STUnion and STEnvelope.
With your ranges now represented as geometric lines, merge them via
UnionAggregate. The resulting geometry object 'lines' might contain
multiple lines. But any overlapping lines turn into one line.
select s.StaffID,
s.Date,
linesWKT = geometry::UnionAggregate(line).ToString()
-- If you have SQLSpatialTools installed then:
-- linesWKT = dbo.GeometryUnionAggregate(line).ToString()
into #aggregateRangesToGeo
from #services s
cross apply (select
beginTimeF = convert(float, convert(datetime,beginTime)),
endTimeF = convert(float, convert(datetime,endTime))
) prepare
cross apply (select
beginPt = geometry::Point(beginTimeF, 0, 0),
endPt = geometry::Point(endTimeF, 0, 0)
) pointify
cross apply (select
line = beginPt.STUnion(endPt).STEnvelope()
) lineify
group by s.StaffID,
s.Date;
You have one 'lines' object for each staffId/date combo. But depending
on your dataset, there may be many 'lines' objects that are the same
between these combos. This may very well be true if staff are expected
to follow a routine and data is recorded to the nearest whatever.
So get a distinct lising of 'lines' objects. This should improve
performance.
From this, extract the individual lines inside 'lines'. Envelope the lines,
which ensures that the lines are stored only as their endpoints. Read the
endpoint x values and convert them back to their time representations.
Keep the WKT representation to join it back to the combos later on.
select lns.linesWKT,
beginTime = convert(time, convert(datetime, ap.beginTime)),
endTime = convert(time, convert(datetime, ap.endTime))
into #parsedLines
from (select distinct linesWKT from #aggregateRangesToGeo) lns
cross apply (select
lines = geometry::STGeomFromText(linesWKT, 0)
) geo
join #numbers n on n.i between 1 and geo.lines.STNumGeometries()
cross apply (select
line = geo.lines.STGeometryN(n.i).STEnvelope()
) ln
cross apply (select
beginTime = ln.line.STPointN(1).STX,
endTime = ln.line.STPointN(3).STX
) ap;
Now just join your parsed data back to the StaffId/Date combos.
select ar.StaffID,
ar.Date,
pl.beginTime,
pl.endTime
from #aggregateRangesToGeo ar
join #parsedLines pl on ar.linesWKT = pl.linesWKT
order by ar.StaffID,
ar.Date,
pl.beginTime;

How Can I Detect and Bound Changes Between Row Values in a SQL Table?

I have a table which records values over time, similar to the following:
RecordId Time Name
========================
1 10 Running
2 18 Running
3 21 Running
4 29 Walking
5 33 Walking
6 57 Running
7 66 Running
After querying this table, I need a result similar to the following:
FromTime ToTime Name
=========================
10 29 Running
29 57 Walking
57 NULL Running
I've toyed around with some of the aggregate functions (e.g. MIN, MAX, etc.), PARTITION and CTEs, but I can't seem to hit upon the right solution. I'm hoping a SQL guru can give me a hand, or at least point me in the right direction. Is there a fairly straightforward way to query this (preferrably without a cursor?)
Finding "ToTime" By Aggregates Instead of a Join
I would like to share a really wild query that only takes 1 scan of the table with 1 logical read. By comparison, the best other answer on the page, Simon Kingston's query, takes 2 scans.
On a very large set of data (17,408 input rows, producing 8,193 result rows) it takes CPU 574 and time 2645, while Simon Kingston's query takes CPU 63,820 and time 37,108.
It's possible that with indexes the other queries on the page could perform many times better, but it is interesting to me to achieve 111x CPU improvement and 14x speed improvement just by rewriting the query.
(Please note: I mean no disrespect at all to Simon Kingston or anyone else; I am simply excited about my idea for this query panning out so well. His query is better than mine as its performance is plenty and it actually is understandable and maintainable, unlike mine.)
Here is the impossible query. It is hard to understand. It was hard to write. But it is awesome. :)
WITH Ranks AS (
SELECT
T = Dense_Rank() OVER (ORDER BY Time, Num),
N = Dense_Rank() OVER (PARTITION BY Name ORDER BY Time, Num),
*
FROM
#Data D
CROSS JOIN (
VALUES (1), (2)
) X (Num)
), Items AS (
SELECT
FromTime = Min(Time),
ToTime = Max(Time),
Name = IsNull(Min(CASE WHEN Num = 2 THEN Name END), Min(Name)),
I = IsNull(Min(CASE WHEN Num = 2 THEN T - N END), Min(T - N)),
MinNum = Min(Num)
FROM
Ranks
GROUP BY
T / 2
)
SELECT
FromTime = Min(FromTime),
ToTime = CASE WHEN MinNum = 2 THEN NULL ELSE Max(ToTime) END,
Name
FROM Items
GROUP BY
I, Name, MinNum
ORDER BY
FromTime
Note: This requires SQL 2008 or up. To make it work in SQL 2005, change the VALUES clause to SELECT 1 UNION ALL SELECT 2.
Updated Query
After thinking about this a bit, I realized that I was accomplishing two separate logical tasks at the same time, and this made the query unnecessarily complicated: 1) prune out intermediate rows that have no bearing on the final solution (rows that do not begin a new task) and 2) pull the "ToTime" value from the next row. By performing #1 before #2, the query is simpler and performs with approximately half the CPU!
So here is the simplified query that first, trims out the rows we don't care about, then gets the ToTime value using aggregates rather than a JOIN. Yes, it does have 3 windowing functions instead of 2, but ultimately because of the fewer rows (after pruning those we don't care about) it has less work to do:
WITH Ranks AS (
SELECT
Grp =
Row_Number() OVER (ORDER BY Time)
- Row_Number() OVER (PARTITION BY Name ORDER BY Time),
[Time], Name
FROM #Data D
), Ranges AS (
SELECT
Result = Row_Number() OVER (ORDER BY Min(R.[Time]), X.Num) / 2,
[Time] = Min(R.[Time]),
R.Name, X.Num
FROM
Ranks R
CROSS JOIN (VALUES (1), (2)) X (Num)
GROUP BY
R.Name, R.Grp, X.Num
)
SELECT
FromTime = Min([Time]),
ToTime = CASE WHEN Count(*) = 1 THEN NULL ELSE Max([Time]) END,
Name = IsNull(Min(CASE WHEN Num = 2 THEN Name ELSE NULL END), Min(Name))
FROM Ranges R
WHERE Result > 0
GROUP BY Result
ORDER BY FromTime;
This updated query has all the same issues as I presented in my explanation, however, they are easier to solve because I am not dealing with the extra unneeded rows. I also see that the Row_Number() / 2 value of 0 I had to exclude, and I am not sure why I didn't exclude it from the prior query, but in any case this works perfectly and is amazingly fast!
Outer Apply Tidies Things Up
Last, here is a version basically identical to Simon Kingston's query that I think is an easier to understand syntax.
SELECT
FromTime = Min(D.Time),
X.ToTime,
D.Name
FROM
#Data D
OUTER APPLY (
SELECT TOP 1 ToTime = D2.[Time]
FROM #Data D2
WHERE
D.[Time] < D2.[Time]
AND D.[Name] <> D2.[Name]
ORDER BY D2.[Time]
) X
GROUP BY
X.ToTime,
D.Name
ORDER BY
FromTime;
Here's the setup script if you want to do performance comparison on a larger data set:
CREATE TABLE #Data (
RecordId int,
[Time] int,
Name varchar(10)
);
INSERT #Data VALUES
(1, 10, 'Running'),
(2, 18, 'Running'),
(3, 21, 'Running'),
(4, 29, 'Walking'),
(5, 33, 'Walking'),
(6, 57, 'Running'),
(7, 66, 'Running'),
(8, 77, 'Running'),
(9, 81, 'Walking'),
(10, 89, 'Running'),
(11, 93, 'Walking'),
(12, 99, 'Running'),
(13, 107, 'Running'),
(14, 113, 'Walking'),
(15, 124, 'Walking'),
(16, 155, 'Walking'),
(17, 178, 'Running');
GO
insert #data select recordid + (select max(recordid) from #data), time + (select max(time) +25 from #data), name from #data
GO 10
Explanation
Here is the basic idea behind my query.
The times that represent a switch have to appear in two adjacent rows, one to end the prior activity, and one to begin the next activity. The natural solution to this is a join so that an output row can pull from its own row (for the start time) and the next changed row (for the end time).
However, my query accomplishes the need to make end times appear in two different rows by repeating the row twice, with CROSS JOIN (VALUES (1), (2)). We now have all our rows duplicated. The idea is that instead of using a JOIN to do calculation across columns, we'll use some form of aggregation to collapse each desired pair of rows into one.
The next task is to make each duplicate row split properly so that one instance goes with the prior pair and one with the next pair. This is accomplished with the T column, a ROW_NUMBER() ordered by Time, and then divided by 2 (though I changed it do a DENSE_RANK() for symmetry as in this case it returns the same value as ROW_NUMBER). For efficiency I performed the division in the next step so that the row number could be reused in another calculation (keep reading). Since row number starts at 1, and dividing by 2 implicitly converts to int, this has the effect of producing the sequence 0 1 1 2 2 3 3 4 4 ... which has the desired result: by grouping by this calculated value, since we also ordered by Num in the row number, we've now accomplished that all sets after the first one are comprised of a Num = 2 from the "prior" row, and a Num = 1 from the "next" row.
The next difficult task is figuring out a way to eliminate the rows we don't care about and somehow collapse the start time of a block into the same row as the end time of a block. What we want is a way to get each discrete set of Running or Walking to be given its own number so we can group by it. DENSE_RANK() is a natural solution, but a problem is that it pays attention to each value in the ORDER BY clause--we don't have syntax to do DENSE_RANK() OVER (PREORDER BY Time ORDER BY Name) so that the Time does not cause the RANK calculation to change except on each change in Name. After some thought I realized I could crib a bit from the logic behind Itzik Ben-Gan's grouped islands solution, and I figured out that the rank of the rows ordered by Time, subtracted from the rank of the rows partitioned by Name and ordered by Time, would yield a value that was the same for each row in the same group but different from other groups. The generic grouped islands technique is to create two calculated values that both ascend in lockstep with the rows such as 4 5 6 and 1 2 3, that when subtracted will yield the same value (in this example case 3 3 3 as the result of 4 - 1, 5 - 2, and 6 - 3). Note: I initially started with ROW_NUMBER() for my N calculation but it wasn't working. The correct answer was DENSE_RANK() though I am sorry to say I don't remember why I concluded this at the time, and I would have to dive in again to figure it out. But anyway, that is what T-N calculates: a number that can be grouped on to isolate each "island" of one status (either Running or Walking).
But this was not the end because there are some wrinkles. First of all, the "next" row in each group contains the incorrect values for Name, N, and T. We get around this by selecting, from each group, the value from the Num = 2 row when it exists (but if it doesn't, then we use the remaining value). This yields the expressions like CASE WHEN NUM = 2 THEN x END: this will properly weed out the incorrect "next" row values.
After some experimentation, I realized that it was not enough to group by T - N by itself, because both the Walking groups and the Running groups can have the same calculated value (in the case of my sample data provided up to 17, there are two T - N values of 6). But simply grouping by Name as well solves this problem. No group of either "Running" or "Walking" will have the same number of intervening values from the opposite type. That is, since the first group starts with "Running", and there are two "Walking" rows intervening before the next "Running" group, then the value for N will be 2 less than the value for T in that next "Running" group. I just realized that one way to think about this is that the T - N calculation counts the number of rows before the current row that do NOT belong to the same value "Running" or "Walking". Some thought will show that this is true: if we move on to the third "Running" group, it is only the third group by virtue of having a "Walking" group separating them, so it has a different number of intervening rows coming in before it, and due to it starting at a higher position, it is high enough so that the values cannot be duplicated.
Finally, since our final group consists of only one row (there is no end time and we need to display a NULL instead) I had to throw in a calculation that could be used to determine whether we had an end time or not. This is accomplished with the Min(Num) expression and then finally detecting that when the Min(Num) was 2 (meaning we did not have a "next" row) then display a NULL instead of the Max(ToTime) value.
I hope this explanation is of some use to people. I don't know if my "row-multiplying" technique will be generally useful and applicable to most SQL query writers in production environments because of the difficulty understanding it and and the difficulty of maintenance it will most certainly present to the next person visiting the code (the reaction is probably "What on earth is it doing!?!" followed by a quick "Time to rewrite!").
If you have made it this far then I thank you for your time and for indulging me in my little excursion into incredibly-fun-sql-puzzle-land.
See it For Yourself
A.k.a. simulating a "PREORDER BY":
One last note. To see how T - N does the job--and noting that using this part of my method may not be generally applicable to the SQL community--run the following query against the first 17 rows of the sample data:
WITH Ranks AS (
SELECT
T = Dense_Rank() OVER (ORDER BY Time),
N = Dense_Rank() OVER (PARTITION BY Name ORDER BY Time),
*
FROM
#Data D
)
SELECT
*,
T - N
FROM Ranks
ORDER BY
[Time];
This yields:
RecordId Time Name T N T - N
----------- ---- ---------- ---- ---- -----
1 10 Running 1 1 0
2 18 Running 2 2 0
3 21 Running 3 3 0
4 29 Walking 4 1 3
5 33 Walking 5 2 3
6 57 Running 6 4 2
7 66 Running 7 5 2
8 77 Running 8 6 2
9 81 Walking 9 3 6
10 89 Running 10 7 3
11 93 Walking 11 4 7
12 99 Running 12 8 4
13 107 Running 13 9 4
14 113 Walking 14 5 9
15 124 Walking 15 6 9
16 155 Walking 16 7 9
17 178 Running 17 10 7
The important part being that each group of "Walking" or "Running" has the same value for T - N that is distinct from any other group with the same name.
Performance
I don't want to belabor the point about my query being faster than other people's. However, given how striking the difference is (when there are no indexes) I wanted to show the numbers in a table format. This is a good technique when high performance of this kind of row-to-row correlation is needed.
Before each query ran, I used DBCC FREEPROCCACHE; DBCC DROPCLEANBUFFERS;. I set MAXDOP to 1 for each query to remove the time-collapsing effects of parallelism. I selected each result set into variables instead of returning them to the client so as to measure only performance and not client data transmission. All queries were given the same ORDER BY clauses. All tests used 17,408 input rows yielding 8,193 result rows.
No results are displayed for the following people/reasons:
RichardTheKiwi *Could not test--query needs updating*
ypercube *No SQL 2012 environment yet :)*
Tim S *Did not complete tests within 5 minutes*
With no index:
CPU Duration Reads Writes
----------- ----------- ----------- -----------
ErikE 344 344 99 0
Simon Kingston 68672 69582 549203 49
With index CREATE UNIQUE CLUSTERED INDEX CI_#Data ON #Data (Time);:
CPU Duration Reads Writes
----------- ----------- ----------- -----------
ErikE 328 336 99 0
Simon Kingston 70391 71291 549203 49 * basically not worse
With index CREATE UNIQUE CLUSTERED INDEX CI_#Data ON #Data (Time, Name);:
CPU Duration Reads Writes
----------- ----------- ----------- -----------
ErikE 375 414 359 0 * IO WINNER
Simon Kingston 172 189 38273 0 * CPU WINNER
So the moral of the story is:
Appropriate Indexes Are More Important Than Query Wizardry
With the appropriate index, Simon Kingston's version wins overall, especially when including query complexity/maintainability.
Heed this lesson well! 38k reads is not really that many, and Simon Kingston's version ran in half the time as mine. The speed increase of my query was entirely due to there being no index on the table, and the concomitant catastrophic cost this gave to any query needing a join (which mine didn't): a full table scan Hash Match killing its performance. With an index, his query was able to do a Nested Loop with a clustered index seek (a.k.a. a bookmark lookup) which made things really fast.
It is interesting that a clustered index on Time alone was not enough. Even though Times were unique, meaning only one Name occurred per time, it still needed Name to be part of the index in order to utilize it properly.
Adding the clustered index to the table when full of data took under 1 second! Don't neglect your indexes.
This will not work in SQL Server 2008, only in SQL Server 2012 version that has the LAG() and LEAD() analytic functions, but I'll leave it here for anyone with newer versions:
SELECT Time AS FromTime
, LEAD(Time) OVER (ORDER BY Time) AS ToTime
, Name
FROM
( SELECT Time
, LAG(Name) OVER (ORDER BY Time) AS PreviousName
, Name
FROM Data
) AS tmp
WHERE PreviousName <> Name
OR PreviousName IS NULL ;
Tested in SQL-Fiddle
With an index on (Time, Name) it will need an index scan.
Edit:
If NULL is a valid value for Name that needs to be taken as a valid entry, use the following WHERE clause:
WHERE PreviousName <> Name
OR (PreviousName IS NULL AND Name IS NOT NULL)
OR (PreviousName IS NOT NULL AND Name IS NULL) ;
I think you're essentially interested in where the 'Name' changes from one record to the next (in order of 'Time'). If you can identify where this happens you can generate your desired output.
Since you mentioned CTEs I'm going to assume you're on SQL Server 2005+ and can therefore use the ROW_NUMBER() function. You can use ROW_NUMBER() as a handy way to identify consecutive pairs of records and then to find those where the 'Name' changes.
How about this:
WITH OrderedTable AS
(
SELECT
*,
ROW_NUMBER() OVER (ORDER BY Time) AS Ordinal
FROM
[YourTable]
),
NameChange AS
(
SELECT
after.Time AS Time,
after.Name AS Name,
ROW_NUMBER() OVER (ORDER BY after.Time) AS Ordinal
FROM
OrderedTable before
RIGHT JOIN OrderedTable after ON after.Ordinal = before.Ordinal + 1
WHERE
ISNULL(before.Name, '') <> after.Name
)
SELECT
before.Time AS FromTime,
after.Time AS ToTime,
before.Name
FROM
NameChange before
LEFT JOIN NameChange after ON after.Ordinal = before.Ordinal + 1
I assume that the RecordIDs are not always sequential, hence the CTE to create a non-breaking sequential number.
SQLFiddle
;with SequentiallyNumbered as (
select *, N = row_number() over (order by RecordId)
from Data)
, Tmp as (
select A.*, RN=row_number() over (order by A.Time)
from SequentiallyNumbered A
left join SequentiallyNumbered B on B.N = A.N-1 and A.name = B.name
where B.name is null)
select A.Time FromTime, B.Time ToTime, A.Name
from Tmp A
left join Tmp B on B.RN = A.RN + 1;
The dataset I used to test
create table Data (
RecordId int,
Time int,
Name varchar(10));
insert Data values
(1 ,10 ,'Running'),
(2 ,18 ,'Running'),
(3 ,21 ,'Running'),
(4 ,29 ,'Walking'),
(5 ,33 ,'Walking'),
(6 ,57 ,'Running'),
(7 ,66 ,'Running');
Here's a CTE solution that gets the results you're seeking:
;WITH TheRecords (FirstTime,SecondTime,[Name])
AS
(
SELECT [Time],
(
SELECT MIN([Time])
FROM ActivityTable at2
WHERE at2.[Time]>at.[Time]
AND at2.[Name]<>at.[Name]
),
[Name]
FROM ActivityTable at
)
SELECT MIN(FirstTime) AS FromTime,SecondTime AS ToTime,MIN([Name]) AS [Name]
FROM TheRecords
GROUP BY SecondTime
ORDER BY FromTime,ToTime

Merge rows based on date in SQL Server

I want to display data based on start date and end date. a code can contain different dates. if any time intervel is continues then I need to merge that rows and display as single row
Here is sample data
Code Start_Date End_Date Volume
470 24-Oct-10 30-Oct-10 28
470 17-Oct-10 23-Oct-10 2
470 26-Sep-10 2-Oct-10 2
471 22-Aug-10 29-Aug-10 2
471 15-Aug-10 21-Aug-10 2
The output result I want is
Code Start_Date End_Date Volume
470 17-Oct-10 30-Oct-10 30
470 26-Sep-10 2-Oct-10 2
471 15-Aug-10 29-Aug-10 4
a code can have any no. of time intervels. Pls help. Thank you
Based on your sample data (which I've put in a table called Test), and assuming no overlaps:
;with Ranges as (
select Code,Start_Date,End_Date,Volume from Test
union all
select r.Code,r.Start_Date,t.End_Date,(r.Volume + t.Volume)
from
Ranges r
inner join
Test t
on
r.Code = t.Code and
DATEDIFF(day,r.End_Date,t.Start_Date) = 1
), ExtendedRanges as (
select Code,MIN(Start_Date) as Start_Date,End_Date,MAX(Volume) as Volume
from Ranges
group by Code,End_Date
)
select Code,Start_Date,MAX(End_Date),MAX(Volume)
from ExtendedRanges
group by Code,Start_Date
Explanation:
The Ranges CTE contains all rows from the original table (because some of them might be relevant) and all rows we can form by joining ranges together (both original ranges, and any intermediate ranges we construct - we're doing recursion here).
Then ExtendedRanges (poorly named) finds, for any particular End_Date, the earliest Start_Date that can reach it.
Finally, we query this second CTE, to find, for any particular Start_Date, the latest End_Date that is associated with it.
These two queries combine to basically filter the Ranges CTE down to "the widest possible Start_Date/End_Date pair" in each set of overlapping date ranges.
Sample data setup:
create table Test (
Code int not null,
Start_Date date not null,
End_Date date not null,
Volume int not null
)
insert into Test(Code, Start_Date, End_Date, Volume)
select 470,'24-Oct-10','30-Oct-10',28 union all
select 470,'17-Oct-10','23-Oct-10',2 union all
select 470,'26-Sep-10','2-Oct-10',2 union all
select 471,'22-Aug-10','29-Aug-10',2 union all
select 471,'15-Aug-10','21-Aug-10',2
go
if I understand your request, you're looking for something like:
select code, min(Start_date), max(end_date), sum(volume)
from yourtable
group by code

Resources