I have a bunch of value pairs (Before, After) by users in a table. In ideal scenarios these values should form an unbroken chain. e.g.
| UserId | Before | After |
|--------|--------|-------|
| 1 | 0 | 10 |
| 1 | 10 | 20 |
| 1 | 20 | 30 |
| 1 | 30 | 40 |
| 1 | 40 | 30 |
| 1 | 30 | 52 |
| 1 | 52 | 0 |
Unfortunately, these records originate in multiple different tables and are imported into my investigation table. The other values in the table do not lend themselves to ordering (e.g. CreatedDate) due to some quirks in the system saving them out of order.
I need to produce a list of users with gaps in their data. e.g.
| UserId | Before | After |
|--------|--------|-------|
| 1 | 0 | 10 |
| 1 | 10 | 20 |
| 1 | 20 | 30 |
// Row Deleted (30->40)
| 1 | 40 | 30 |
| 1 | 30 | 52 |
| 1 | 52 | 0 |
I've looked at the other Daisy Chaining questions on SO (and online in general), but they all appear to be on a given problem space, where one value in the pair is always lower than the other in a predictable fashion. In my case, there can be increases or decreases.
Is there a way to quickly calculate the longest chain that can be created? I do have a CreatedAt column that would provide some (very rough) relative ordering - When the date is more than about 10 seconds apart, we could consider them orderable)
Are you not therefore simply after this to get the first row where the "chain" is broken?
SELECT UserID, Before, After
FROM dbo.YourTable YT
WHERE NOT EXISTS (SELECT 1
FROM dbo.YourTable NE
WHERE NE.After = YT.Before)
AND YT.Before != 0;
If you want to last row where the row where the "chain" is broken, just swap the aliases on the columns in the WHERE in the NOT EXISTS.
the following performs hierarchical recursion on your example data and calculates a "chain" count column called 'h_level'.
;with recur_cte([UserId], [Before], [After], h_level) as (
select [UserId], [Before], [After], 0
from dbo.test_table
where [Before] is null
union all
select tt.[UserId], tt.[Before], tt.[After], rc.h_level+1
from dbo.test_table tt join recur_cte rc on tt.UserId=rc.UserId
and tt.[Before]=rc.[After]
where tt.[Before]<tt.[after])
select * from recur_cte;
Results:
UserId Before After h_level
1 NULL 10 0
1 10 20 1
1 20 30 2
1 30 40 3
1 30 52 3
Is this helpful? Could you further define which rows to exclude?
If you want users that have more than one chain:
select t.UserID
from <T> as t left outer join <T> as t2
on t2.UserID = t.UserID and t2.Before = t.After
where t2.UserID is null
group by t.UserID
having count(*) > 1;
How can I use SQL to compute a cumulative sum over a column, so that the cumulative sum always stays within upper/lower bounds. Example with lower bound -2 and upper bound 10, showing the regular cumulative sum and the bounded cumulative sum.
id input
-------------
1 5
2 7
3 -10
4 -10
5 5
6 10
Result:
id cum_sum bounded_cum_sum
---------------------------------
1 5 5
2 12 10
3 2 0
4 -8 -2
5 -3 3
6 7 10
See https://codegolf.stackexchange.com/questions/61684/calculate-the-bounded-cumulative-sum-of-a-vector for some (non SQL) examples of a bounded cumulative sum.
You can (almost) always use a cursor to implement whatever cumulative logic you have. The technique is quite routine so can be used to tackle a variety of problems easily once you get it.
One specific thing to note: Here I update the table in-place, so the [id] column must be uniquely indexed.
(Tested on SQL Server 2017 latest linux docker image)
Test Dataset
use [testdb];
if OBJECT_ID('testdb..test') is not null
drop table testdb..test;
create table test (
[id] int,
[input] int,
);
insert into test (id, input)
values (1,5), (2,7), (3,-10), (4,-10), (5,5), (6,10);
Solution
/* A generic row-by-row cursor solution */
-- First of all, make [id] uniquely indexed to enable "where current of"
create unique index idx_id on test(id);
-- append answer columns
alter table test
add [cum_sum] int,
[bounded_cum_sum] int;
-- storage for each row
declare #id int,
#input int,
#cum_sum int,
#bounded_cum_sum int;
-- record accumulated values
declare #prev_cum_sum int = 0,
#prev_bounded_cum_sum int = 0;
-- open a cursor ordered by [id] and updatable for assigned columns
declare cur CURSOR local
for select [id], [input], [cum_sum], [bounded_cum_sum]
from test
order by id
for update of [cum_sum], [bounded_cum_sum];
open cur;
while 1=1 BEGIN
/* fetch next row and check termination condition */
fetch next from cur
into #id, #input, #cum_sum, #bounded_cum_sum;
if ##FETCH_STATUS <> 0
break;
/* program body */
-- main logic
set #cum_sum = #prev_cum_sum + #input;
set #bounded_cum_sum = #prev_bounded_cum_sum + #input;
if #bounded_cum_sum > 10 set #bounded_cum_sum=10
else if #bounded_cum_sum < -2 set #bounded_cum_sum=-2;
-- write the result back
update test
set [cum_sum] = #cum_sum,
[bounded_cum_sum] = #bounded_cum_sum
where current of cur;
-- setup for next row
set #prev_cum_sum = #cum_sum;
set #prev_bounded_cum_sum = #bounded_cum_sum;
END
-- cleanup
close cur;
deallocate cur;
-- show
select * from test;
Result
| | id | input | cum_sum | bounded_cum_sum |
|---|----|-------|---------|-----------------|
| 1 | 1 | 5 | 5 | 5 |
| 2 | 2 | 7 | 12 | 10 |
| 3 | 3 | -10 | 2 | 0 |
| 4 | 4 | -10 | -8 | -2 |
| 5 | 5 | 5 | -3 | 3 |
| 6 | 6 | 10 | 7 | 10 |
I have the following table:
Sensor | building | Date_time | Current_value
1 | 1 | 20.08.2017 | 20
1 | 1 | 21.08.2017 | 25
1 | 1 | 22.08.2017 | 35
2 | 1 | 20.08.2017 | 120
2 | 1 | 21.08.2017 | 200
2 | 1 | 22.08.2017 | 210
3 | 2 | 20.08.2017 | 20
3 | 2 | 21.08.2017 | 25
3 | 2 | 22.08.2017 | 85
5 | 2 | 20.08.2017 | 320
5 | 2 | 21.08.2017 | 400
5 | 2 | 22.08.2017 | 410
The sensor ID is assumed to be unique, as is the building ID.
I need to calculate the total value for each building for any given timeframe by subtracting the MIN value from the MAX value for each sensor, then group the sum by each building.
In the above sample it would be
Sensor 1: (35 - 20)=15
Sensor 2: (210-120)=90
Building 1 = 15+90 = 105
(...)
Building 2 = 65+90 = 155
Any pointers in the right direction are greatly appreciated!
You are asking how to calculate the difference between min and max values per sensor, then aggregate the differences per building.
with diffs as (
SELECT Building,Sensor, MAX(Current_Value)-MIN(Current_Value) as diff
FROM SomeTable
GROUP BY Building, Sensor
)
SELECT Building,sum(diff)
FROM diffs
GROUP BY Building
If you want to restrict the time period, you'll have to do so inside the CTE :
with diffs as (
SELECT Building,Sensor, MAX(Current_Value)-MIN(Current_Value) as diff
FROM SomeTable
WHERE Date_Time between #start and #end
GROUP BY Building, Sensor
)
SELECT Building,sum(diff)
FROM diffs
GROUP BY Building
You can convert this query into a user defined function that can be used in other queries :
create function fn_TotalDiffs(#start datetime2(0), #end datetime2(0))
returns table
as
Return (
with diffs as (
select Building,Sensor, MAX(Current_Value)-MIN(Current_Value) as diff
from SomeTable
Group by Building, Sensor
)
select Building,sum(diff) as Total
from diffs
Group by Building
)
Another option using window function min/max over()
Example
Select Building
,Total = sum(R1)
From (
Select Distinct
Building
,R1 = max([Current_value]) over (Partition By Building,Sensor)
-min([Current_value]) over (Partition By Building,Sensor)
From YourTable
Where Date_time between #Date1 and #Date2
) A
Group By Building
Returns
Building Total
1 105
2 155
I have a list of multiple components that all have a model number. I want to group every second component based on the model it belongs to.
Model | Component | Group
1 1 1
1 2 2
1 3 1
1 4 2
1 5 1
2 1 1
2 2 2
2 3 1
Every second component belonging to a model should have an alternative group number.
I believe I have to use a windows function but haven't been able to solve.
I have assumed that your Model and Component ID numbers will not be perfectly incremental, but that they will be unique. As such, you can use the row_number windowed function along with the modulo operator % to get the remainder of the division of the row_number result by 2:
declare #t table (Model int, Component int);
insert into #t values (1,1),(1,2),(1,3),(1,4),(1,5),(2,1),(2,2),(2,2);
select Model
,Component
,case (row_number() over (partition by Model order by Component) % 2)
when 1 then 1
when 0 then 2
end as [Group]
from #t;
Output:
+-------+-----------+-------+
| Model | Component | Group |
+-------+-----------+-------+
| 1 | 1 | 1 |
| 1 | 2 | 2 |
| 1 | 3 | 1 |
| 1 | 4 | 2 |
| 1 | 5 | 1 |
| 2 | 1 | 1 |
| 2 | 2 | 2 |
| 2 | 2 | 1 |
+-------+-----------+-------+
I'm trying to get a "lineage" or similar, and also information about the first and last links (at least; all would be good), out of a table that has self-referential links between rows that have been "replaced" and rows that have replaced them. The table has a structure along these lines:
CREATE TABLE Thing (
Id INT PRIMARY KEY,
TStamp DATETIME,
Replaces INT NULL,
ReplacedBy INT NULL
);
I'm stuck with this structure. :-) It's sort of doubly-linked (yes, it's a bit silly): Each row has a unique Id, and then a row that has been "replaced" by another will have a non-NULL ReplacedBy giving the Id of the replacement row, and the replacement row will also have a link back to what it replaces in Replaces. So we can use either Replaces or ReplacedBy (or both) if we like.
Here's some sample data:
INSERT INTO Thing
(Id, TStamp, Replaces, ReplacedBy)
VALUES
(1, '2017-01-01', NULL, 11),
(2, '2017-01-02', NULL, 12),
(3, '2017-01-03', NULL, NULL),
(4, '2017-01-04', NULL, NULL),
(11, '2017-01-11', 1, NULL),
(12, '2017-01-12', 2, 22),
(22, '2017-01-22', 12, NULL);
So 1 was replaced by 11, 2 was replaced by 12, and 12 was replaced by 22.
I'd like to get the following information for each chain of links from this table in a reasonable way:
Details of the row that started the chain
Details of the final row in the chain
Details of the links in-between or at least how many links (total) there are in the chain
...filtered by a date range applied to the last row in the chain.
In an ideal universe, I'd get back something like this:
+−−−−−−−−−+−−−−−−−−+−−−−+−−−−−−−+−−−−−−−−−−−−+
| FirstId | LastId | Id | Links | TStamp |
+−−−−−−−−−+−−−−−−−−+−−−−+−−−−−−−+−−−−−−−−−−−−+
| 1 | 11 | 1 | 2 | 2017−01−01 |
| 1 | 11 | 11 | 2 | 2017−01−11 |
| 2 | 22 | 2 | 3 | 2017−01−02 |
| 2 | 22 | 12 | 3 | 2017−01−12 |
| 2 | 22 | 22 | 3 | 2017−01−22 |
+−−−−−−−−−+−−−−−−−−+−−−−+−−−−−−−+−−−−−−−−−−−−+
So far I have this query, which I could post-process to get the above:
WITH Data AS (
SELECT Id, TStamp, Replaces, ReplacedBy, 0 AS Depth
FROM Thing
UNION ALL
SELECT Thing.Id, Thing.TStamp, Thing.Replaces, Thing.ReplacedBy, Depth + 1
FROM Data
JOIN Thing
ON Thing.Replaces = Data.Id
)
SELECT *
FROM Data
WHERE ReplacedBy IS NOT NULL OR Depth > 0
ORDER BY
Id, Depth;
That gives me:
+−−−−+−−−−−−−−−−−−+−−−−−−−−−−+−−−−−−−−−−−−+−−−−−−−+
| Id | TStamp | Replaces | ReplacedBy | Depth |
+−−−−+−−−−−−−−−−−−+−−−−−−−−−−+−−−−−−−−−−−−+−−−−−−−+
| 1 | 2017−01−01 | NULL | 11 | 0 |
| 2 | 2017−01−02 | NULL | 12 | 0 |
| 11 | 2017−01−11 | 1 | NULL | 1 |
| 12 | 2017−01−12 | 2 | 12 | 0 |
| 12 | 2017−01−12 | 2 | 12 | 1 |
| 22 | 2017−01−13 | 12 | NULL | 1 |
| 22 | 2017−01−13 | 12 | NULL | 2 |
+−−−−+−−−−−−−−−−−−+−−−−−−−−−−+−−−−−−−−−−−−+−−−−−−−+
And I could use something like this to figure out (for instance) the final row of each chain:
WITH Data AS (
SELECT Id, Replaces, ReplacedBy, 0 AS Depth
FROM Thing
UNION ALL
SELECT Thing.Id, Thing.Replaces, Thing.ReplacedBy, Depth + 1
FROM Data
JOIN Thing
ON Thing.Replaces = Data.Id
),
MaxData AS (
SELECT Data.Id, Data.Depth
FROM Data
JOIN (
SELECT Id, MAX(Depth) AS MaxDepth
FROM Data
GROUP BY Id
) j ON data.Id = j.Id AND Data.Depth = j.MaxDepth
WHERE Depth > 0
)
SELECT *
FROM MaxData
ORDER BY
Id;
...which gives me:
+−−−−+−−−−−−−+
| Id | Depth |
+−−−−+−−−−−−−+
| 11 | 1 |
| 12 | 1 |
| 22 | 2 |
+−−−−+−−−−−−−+
...but I've lost the starting point and the points along the way.
I have the strong feeling I'm missing something really straight-forward — but clever — that would let me get this largely with the query rather than post-processing, some kind of join with a "min" and "max" query (but not like my one above). What would it be?
The table doesn't have any indexes on Replaces or ReplacedBy, but we could add any needed. The table is only lightly used (roughly 300k rows and probably only a couple of hundred updates/inserts a day).
I'm limited to SQL Server 2008 features.
Inspired by Gordon Linoff's answer and HABO's comment which highlighted something Gordon was doing that was critical, I:
Removed the SQL Server 2012+ FIRST_VALUE function, replacing it with a CROSS JOIN on an "overview" query of the data
Included the Links count in the overview query
Removed the reliance on t in Gordon's WHERE NOT EXISTS (SELECT 1 FROM Thing t2 WHERE t2.ReplacedBy = t.id), which (at last on SQL Server 2008) wasn't bound to anything
Filtered out rows that weren't replaced
Below, I also add the date filtering mentioned in the question
...filtered by a date range applied to the last row in the chain.
...which Gordon didn't cover at all, and changes our approach, but only in terms of the arrow of time.
So, first, without the date criteria, sticking fairly close to Gordon's answer:
WITH Data AS (
SELECT Id AS FirstId, Id, TStamp, Replaces, ReplacedBy, 0 AS Depth
FROM Thing
WHERE Replaces IS NULL AND ReplacedBy IS NOT NULL
UNION ALL
SELECT d.FirstId, t.Id, t.TStamp, t.Replaces, t.ReplacedBy, d.Depth + 1
FROM Data d
JOIN Thing t ON t.Replaces = d.Id
),
Overview AS (
SELECT FirstId, MAX(Id) AS LastId, COUNT(*) AS Links
FROM Data
GROUP BY
FirstId
)
SELECT d.FirstId, o.LastId, d.Id, o.Links, d.Depth, d.TStamp
FROM Data d
CROSS APPLY (
SELECT LastId, Links
FROM Overview
WHERE FirstId = d.FirstId
) o
ORDER BY
d.FirstId, d.Depth
;
The critical parts of that are grabbing the seed Id as FirstId here:
SELECT Id AS FirstId, Id, TStamp, Replaces, ReplacedBy, 0 AS Depth
FROM Thing
WHERE Replaces IS NULL AND ReplacedBy IS NOT NULL
and then propagating it through the results of the recursive join:
SELECT d.FirstId, t.Id, t.TStamp, t.Replaces, t.ReplacedBy, d.Depth + 1
FROM Data d
JOIN Thing t ON t.Replaces = d.Id
Just adding that to my original query gives us most of what I wanted. Then we add a second query to get the LastId for each FirstId (Gordon did it as a FIRST_VALUE over a partition, but I can't do that in SQL Server 2008) and using an overview query also lets me grab the number of links. We cross-apply that on the basis of the FirstId value to get the overall results I wanted.
The query above returns the following for the sample data:
+−−−−−−−−−+−−−−−−−−+−−−−+−−−−−−−+−−−−−−−+−−−−−−−−−−−−+
| FirstId | LastId | Id | Links | Depth | TStamp |
+−−−−−−−−−+−−−−−−−−+−−−−+−−−−−−−+−−−−−−−+−−−−−−−−−−−−+
| 1 | 11 | 1 | 2 | 0 | 2017-01-01 |
| 1 | 11 | 11 | 2 | 1 | 2017-01-11 |
| 2 | 22 | 2 | 3 | 0 | 2017-01-02 |
| 2 | 22 | 12 | 3 | 1 | 2017-01-12 |
| 2 | 22 | 22 | 3 | 2 | 2017-01-13 |
+−−−−−−−−−+−−−−−−−−+−−−−+−−−−−−−+−−−−−−−+−−−−−−−−−−−−+
...e.g., exactly what I wanted, plus Depth if I want (so I know what order the intermediary links were in).
If we wanted to include rows that were never replaced, we'd just change
WHERE Replaces IS NULL AND ReplacedBy IS NOT NULL
to
WHERE Replaces IS NULL
Giving us:
+−−−−−−−−−+−−−−−−−−+−−−−+−−−−−−−+−−−−−−−+−−−−−−−−−−−−+
| FirstId | LastId | Id | Links | Depth | TStamp |
+−−−−−−−−−+−−−−−−−−+−−−−+−−−−−−−+−−−−−−−+−−−−−−−−−−−−+
| 1 | 11 | 1 | 2 | 0 | 2017-01-01 |
| 1 | 11 | 11 | 2 | 1 | 2017-01-11 |
| 2 | 22 | 2 | 3 | 0 | 2017-01-02 |
| 2 | 22 | 12 | 3 | 1 | 2017-01-12 |
| 2 | 22 | 22 | 3 | 2 | 2017-01-13 |
| 3 | 3 | 3 | 1 | 0 | 2017-01-03 |
| 4 | 4 | 4 | 1 | 0 | 2017-01-04 |
+−−−−−−−−−+−−−−−−−−+−−−−+−−−−−−−+−−−−−−−+−−−−−−−−−−−−+
But we've ignored the date criteria required by the question:
...filtered by a date range applied to the last row in the chain.
To do that without building a massive temporary result set, we have to work backward: Instead of selecting the starting point (the first entry in a chain, Replaces IS NULL), we need to select the ending point (the last entry in a chain, ReplacedBy IS NULL), and then invert our logic working back through the chain. It's largely a matter of:
Swapping FirstId with LastId
Swapping Replaces with ReplacedBy (convenient the table had both!)
Using MIN to get the first ID in the chain rather than MAX to get the last
Using d.Depth - 1 rather than d.Depth + 1
Then fixing-up Depth based on Links once we know it in our final select, to get those nice values where 0 = first link rather than some varying negative number: o.Links + d.Depth - 1 AS Depth
All of which gives us:
WITH Data AS (
SELECT Id AS LastId, Id, TStamp, Replaces, ReplacedBy, 0 AS Depth
FROM Thing
WHERE ReplacedBy IS NULL AND Replaces IS NOT NULL
-- Filtering by date of last entry would go here
UNION ALL
SELECT d.LastId, t.Id, t.TStamp, t.Replaces, t.ReplacedBy, d.Depth - 1
FROM Data d
JOIN Thing t ON t.ReplacedBy = d.Id
),
Overview AS (
SELECT LastId, MIN(Id) AS FirstId, COUNT(*) AS Links
FROM Data
GROUP BY
LastId
)
SELECT o.FirstId, d.LastId, d.Id, o.Links, o.Links + d.Depth - 1 AS Depth, d.TStamp
FROM Data d
CROSS APPLY (
SELECT FirstId, Links
FROM Overview
WHERE LastId = d.LastId
) o
ORDER BY
o.FirstId, d.Depth
;
So for instance, if we used
AND TStamp BETWEEN '2017-01-12' AND '2017-02-01'
where I have
-- Filtering by date of last entry would go here
above, with our sample data we'd get this result:
+−−−−−−−−−+−−−−−−−−+−−−−+−−−−−−−+−−−−−−−+−−−−−−−−−−−−+
| FirstId | LastId | Id | Links | Depth | TStamp |
+−−−−−−−−−+−−−−−−−−+−−−−+−−−−−−−+−−−−−−−+−−−−−−−−−−−−+
| 2 | 22 | 2 | 3 | 0 | 2017−01−02 |
| 2 | 22 | 12 | 3 | 1 | 2017−01−12 |
| 2 | 22 | 22 | 3 | 2 | 2017−01−13 |
+−−−−−−−−−+−−−−−−−−+−−−−+−−−−−−−+−−−−−−−+−−−−−−−−−−−−+
...because the last link the Id = 1 chain is outside the date range, so we don't include it.
This is a little tricky. Arrange the CTE to start at the beginning of each list. That makes the subsequent processing easier:
WITH Data AS (
SELECT Id as FirstId, Id, TStamp, Replaces, ReplacedBy, 0 AS Depth
FROM Thing t
WHERE NOT EXISTS (SELECT 1 FROM Thing t2 WHERE t2.ReplacedBy = t.id)
UNION ALL
SELECT d.FirstId, t.Id, t.TStamp, t.Replaces, t.ReplacedBy, d.Depth + 1
FROM Data d JOIN
Thing t
ON t.Replaces = d.Id
)
SELECT d.*,
FIRST_VALUE(id) OVER (PARTITION BY FirstId ORDER BY Depth DESC) as LastId
FROM Data d;
Then, you can use FIRST_VALUE() with a reverse sort to get the last value in the chain.
This returns chains that have no links. You can add a filter to remove these.