Identifying changes over time - sql-server

No doubt a similar question has come up before, but I haven't been able to locate it by searching...
I have a raw dataset with time series data including 'from' and 'to' date fields.
The problem is, when data is loaded, new records have been created ('to' date added to old record, new record 'from' load date) even where no values have changed.
I want to convert this to a table which just shows a row for each genuine change - and the from/ to dates reflecting this.
By way of example, the source data looks like this:
ID
Col1
Col2
Col3
From
To
Test1
1
1
1
01/01/2020
31/12/9999
Test2
1
2
3
01/01/2020
30/06/2020
Test2
1
2
3
01/07/2020
30/09/2020
Test2
3
2
1
01/10/2020
31/12/9999
The first two records for Test2 (rows 2 and 3) are essentially the same - there was no change when the second row was loaded on 01/07/2020. I want a single row for the period 01/01/2020 - 30/09/2020 for which there was no change:
ID
Col1
Col2
Col3
From
To
Test1
1
1
1
01/01/2020
31/12/9999
Test2
1
2
3
01/01/2020
30/09/2020
Test2
3
2
1
01/10/2020
31/12/9999
For this simplified example, I can achieve that by grouping by each column (apart from dates) and using the MIN from date/ MAX end date:
SELECT
ID, Col1, Col2, Col3, MIN(From) AS From, MAX(To) as TO
FROM TABLE
GROUP BY ID, Col1, Col2, Col3
However, this won't work if a value changes then subsequently changes back to what it was before eg
ID
Col1
Col2
Col3
From
To
Test1
1
1
1
01/01/2020
31/12/9999
Test2
1
2
3
01/01/2020
30/04/2020
Test2
1
2
3
01/05/2020
30/06/2020
Test2
3
2
1
01/07/2020
30/10/2020
Test2
1
2
3
01/11/2020
31/12/9999
Simply using MIN/ MAX in the code above would return this - so it looks like both sets of values were valid for the period from 01/07/2020 - 30/10/2020:
ID
Col1
Col2
Col3
From
To
Test1
1
1
1
01/01/2020
31/12/9999
Test2
1
2
3
01/01/2020
31/12/9999
Test2
3
2
1
01/07/2020
30/10/2020
Whereas actually the first set of values were valid before and after that period, but not during.
It should return a single row for instead of two for the period from 01/01/2020 - 30/06/2020 when there were no changes for this ID, but then another row for the period when the values were different, and then another row where it reverted to the initial values, but with a new From date.
ID
Col1
Col2
Col3
From
To
Test1
1
1
1
01/01/2020
31/12/9999
Test2
1
2
3
01/01/2020
30/06/2020
Test2
3
2
1
01/07/2020
30/10/2020
Test2
1
2
3
01/11/2020
31/12/9999
I'm struggling to conceptualise how to approach this.
I'm guessing I need to use LAG somehow but not sure how to apply it - eg rank everything in a staging table first, then use LAG to compare a concatenation of the whole row?
I'm sure I could find a fudged way eventually, but I've no doubt this problem has been solved many times before so hoping somebody can point me to a simpler/ neater solution than I'd inevitably come up with...

Advanced Gaps and Islands
I believe this is an advanced "gaps and islands" problem. Use that as a search term and you'll find plenty of literature on the subject. Only difference is normally only one column is being tracked, but you have 3.
No Gaps Assumption
One major assumption of this script is there is no gap in the overlapping dates, or in other words, it assumes the previous rows ToDate = current FromDate - 1 day.
Not sure if you need to account for gaps, would be simple just add criteria to IsChanged to check for that
Multi-Column Gaps and Islands Solution
DROP TABLE IF EXISTS #Grouping
DROP TABLE IF EXISTS #Test
CREATE TABLE #Test (ID INT IDENTITY(1,1),TestName Varchar(10),Col1 INT,Col2 INT,Col3 INT,FromDate Date,ToDate DATE)
INSERT INTO #Test VALUES
('Test1',1,1,1,'2020-01-01','9999-12-31')
,('Test2',1,2,3,'2020-01-01','2020-04-30')
,('Test2',1,2,3,'2020-05-01','2020-06-30')
,('Test2',3,2,1,'2020-07-01','2020-10-30')
,('Test2',1,2,3,'2020-11-01','9999-12-31')
;WITH cte_Prev AS (
SELECT *
,PrevCol1 = LAG(Col1) OVER (PARTITION BY TestName ORDER BY FromDate)
,PrevCol2 = LAG(Col2) OVER (PARTITION BY TestName ORDER BY FromDate)
,PrevCol3 = LAG(Col3) OVER (PARTITION BY TestName ORDER BY FromDate)
FROM #Test
), cte_Compare AS (
SELECT *
,IsChanged = CASE
WHEN Col1 = PrevCol1
AND Col2 = PrevCol2
AND Col3 = PrevCol3
THEN 0 /*No change*/
ELSE 1 /*Iterate so new group created */
END
FROM cte_Prev
)
SELECT *,GroupID = SUM(IsChanged) OVER (PARTITION BY TestName ORDER BY ID)
INTO #Grouping
FROM cte_Compare
/*Raw unformatted data so you can see how it works*/
SELECT *
FROM #Grouping
/*Aggregated results*/
SELECT GroupID,TestName,Col1,Col2,Col3
,FromDate = MIN(FromDate)
,ToDate = MAX(ToDate)
,NumberOfRowsCollapsedIntoOneRow = COUNT(*)
FROM #Grouping
GROUP BY GroupID,TestName,Col1,Col2,Col3

Related

Sql;select average and one value from same column

How can I get an average value and one other value from the same column into two different columns in a new table?
I have this:
Person_ID col2 col3_values
1 101010A 20000
1 101010B 30000
2 101010A 25000
2 101010B 30000
3 101010A 22000
3 101010B 24000
And I want a table that average col3_values with ID:s from col1_ID (1,2,3) and then compare this average value with a column wich holds the col1_ID: value like this:
col2 AVG(value personID_1-3) Value PersonID_1
101010 A 22333 20000
101010 B 28000 30000
I have tried a lot of code but nothing had worked. Can someone please help me with this? If this worked I would be grateful if I also could get a fourth column thay show the difference between the averagecolumn and the third column that hold ID_1:s values.
There's many ways to do this, one would be to use the outer apply construct:
select
col2,
AVG(t.col3_values) as "AVG(value personID_1-3)",
a.col3_values as "Value PersonID_1",
AVG(t.col3_values) - a.col3_values as "Difference"
from your_table t
outer apply (
select col3_values from your_table where Person_ID = 1 and t.col2 = col2
) a
group by col2, a.col3_values
Or you could use a correlated subquery:
select
col2,
AVG(t.col3_values) as "AVG(value personID_1-3)",
(
select col3_values from your_table where Person_ID = 1 and t.col2 = col2
) as "Value PersonID_1"
from your_table t
group by col2
Sample output:
Query 1:
col2 AVG(value personID_1-3) Value PersonID_1 Difference
-------------------- ----------------------- ---------------- -----------
101010A 22333 20000 2333
101010B 28000 30000 -2000
Query 2:
col2 AVG(value personID_1-3) Value PersonID_1
-------------------- ----------------------- ----------------
101010A 22333 20000
101010B 28000 30000

Using a While Loop to update a field by 1 each time a value changes

So I have a table that has two records that need to be one. I can identify them but I want to update them in groups (sort of like a scan update =1, then proceed, then some other field changes, increment the number by 1 and proceed.)
Example table:
IDEvent 1 2 3 4 5
Col1 1 1 0 1 0
Col2 a a b a b
So essentially, my outcome would look like this afterwards so that I can write a select and group by col1 to then group the two first records into one but leave non consecutive record alone. I tried while loops but I couldn't figure it out.
IDEvent 1 2 3 4 5
Col1 1 1 0 2 0
Col2 A A B A B
alter view PtypeGroup as
WITH q AS
(
SELECT *,
ROW_Number() OVER (PARTITION BY idsession, comment ORDER BY ideventrecord) AS rnd,
ROW_NUMBER() OVER (PARTITION BY idsession ORDER BY ideventrecord) AS rn
FROM [ratedeventssorted]
)
SELECT min(ideventrecord) as IDEventRecord, idsession, min(distancestamp) as distancestamp, sum(length) as length, min(comment) as comment2, min(eventscorename) as firstptype, min(eventscoredescription) as Ptype2,
MIN(ideventrecord) AS first_number,
MAX(ideventrecord) AS last_number,
comment
,COUNT(ideventrecord) AS numbers_count
--into test
FROM q
where eventscorename IN ('Flex', 'Chpsl')
GROUP BY idsession,
rnd - rn,
comment

T-SQL select rows by oldest date and unique category

I'm using Microsoft SQL. I have a table that contains information stored by two different categories and a date. For example:
ID Cat1 Cat2 Date/Time Data
1 1 A 11:00 456
2 1 B 11:01 789
3 1 A 11:01 123
4 2 A 11:05 987
5 2 B 11:06 654
6 1 A 11:06 321
I want to extract one line for each unique combination of Cat1 and Cat2 and I need the line with the oldest date. In the above I want ID = 1, 2, 4, and 5.
Thanks
Have a look at row_number() on MSDN.
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY col1, col2 ORDER BY date_time, id) rn
FROM mytable
) q
WHERE rn = 1
(run the code on SQL Fiddle)
Quassnoi's answer is fine, but I'm a bit uncomfortable with how it handles dups. It seems to return based on insertion order, but I'm not sure if even that can be guaranteed? (see these two fiddles for an example where the result changes based on insertion order: dup at the end, dup at the beginning)
Plus, I kinda like staying with old-school SQL when I can, so I would do it this way (see this fiddle for how it handles dups):
select *
from my_table t1
left join my_table t2
on t1.cat1 = t2.cat1
and t1.cat2 = t2.cat2
and t1.datetime > t2.datetime
where t2.datetime is null

How to use generate Id for different values in calculated columns?

I have a big query (which is already ordered as per my needs), one of the columns is calculated (varchar combination of other columns in the query). I need an incremental integer to identify this calculated column (duplicates should have the same id).
I canĀ“t use rank because the order in which I need the incremental number uses another criteria than the one used to generate the calculated column.
This is what I need:
OrderByColumn CalculatedColumn GeneratedId
1 ggg 1
1 aaa 2
1 ggg 1
1 fff 3
2 vvv 4
2 ddd 5
3 ggg 1
4 rrr 6
5 aaa 2
5 ooo 7
5 kkk 8
8 vvv 4
9 aaa 2
Use
ROW_NUMBER() OVER (PARTITION BY XXX ORDER BY YYY)
assuming you are using SQL2005 or better
http://msdn.microsoft.com/en-us/library/ms186734.aspx
-- though like you said this doesn't solve your dupes with same ID thing - ahhh! Give me a moment - should be able to do this pretty easy
Edit:
Here you go -
http://sqlfiddle.com/#!3/2f014/2
-- Select stuff:
select vals.val as genid, ord.* from ord
-- Join back to a distinct list of CalculatedColumn with a row_number() to id them
inner join
(select calculatedcolumn, row_number() over (order by calculatedcolumn) as val from ord group by calculatedcolumn) as vals on vals.calculatedcolumn = ord.calculatedcolumn
order by ord.orderbycolumn
Of course this is using the calculated column in the subquery - so you will need to re-calculate unless you store the value in a temp table or table variable

Expand row results based on a value in column (with iterator)

Need help from you all in writing up this query. Running SQL 2005 Standard edition.
I have a basic query that gets a subset of records from a table where the record_Count is greater then 1.
SELECT *
FROM Table_Records
WHERE Record_Count > 1
This query gives me a result set of, say:
TableRecords_ID Record_Desc Record_Count
123 XYZ 3
456 PQR 2
The above query needs to be modified so that each record appears as many times as the Record_Count and has its iteration number with it, as a value. So the new query should return results as follows:
TableRecords_ID Record_Desc Record_Count Rec_Iteration
123 XYZ 3 1
123 XYZ 3 2
123 XYZ 3 3
456 PQR 2 1
456 PQR 2 2
Could anyone help we write this query up? appreciate the help.
Clarification: Rec_Iteration column is a sub representation of the Record_Count. Basically, since there are three Record_Count for XYZ description thus three rows were returned with the Rec_Iteration representing the Row one , two and three respectively.
You can use a recursive CTE for this query. Below I use a table variable #T instead of your table Table_Records.
declare #T table(TableRecords_ID int,Record_Desc varchar(3), Record_Count int)
insert into #T
select 123, 'XYZ', 3 union all
select 456, 'PQR', 2
;with cte as
(
select TableRecords_ID,
Record_Desc,
Record_Count,
1 as Rec_Iteration
from #T
where Record_Count > 1
union all
select TableRecords_ID,
Record_Desc,
Record_Count,
Rec_Iteration + 1
from cte
where Rec_Iteration < Record_Count
)
select TableRecords_ID,
Record_Desc,
Record_Count,
Rec_Iteration
from cte
order by TableRecords_ID,
Rec_Iteration

Resources