Remove consecutive duplicate rows from a table

Remove consecutive duplicate rows from a table - sql-server

I am stuck on a problem. I want to remove consecutive duplicate records in the table,
I,e, in the below table I want to calculate the total cost without consecutive duplication.
Like, row 3 should be removed as it is consecutively duplicated with row 2 as all 3 column data is the same.
And same is the case in the second group, row 7 should be removed as it is a duplication of 6.
The total cost at the end should be 10.
How can I do it in SSMS?
ClaimID
ClaimLine
Cost
M0001R1616878951
2
10
M0001R1616878951
2
-10
M0001R1616878951
2
-10
M0001R1616878951
3
10
M0001R1616878951
3
-10
M0001R1616878951
3
10
M0001R1616878951
3
10
I searched for this problem and tried lead and lag keywords but didnt work.

I prepared an example for your issue and it might be solution for that.
I used CTE, ROW_NUMBER and IIF expressions to generate row_number and filter for duplicate rows.
Preparing the example data:
DECLARE #vClaims TABLE (
ClaimID NVARCHAR(16),
ClaimLine SMALLINT,
Cost SMALLINT
)
INSERT INTO #vClaims
VALUES
('M0001R1616878951', 2, 10),
('M0001R1616878951', 2, -10),
('M0001R1616878951', 2, -10),
('M0001R1616878951', 3, 10),
('M0001R1616878951', 3, -10),
('M0001R1616878951', 3, 10),
('M0001R1616878951', 3, 10)
And the query script:
;WITH CTE_ClaimsWithSort AS (
SELECT
ClaimID,
ClaimLine,
Cost,
RowNumber = ROW_NUMBER() OVER(ORDER BY (SELECT NULL))
FROM
#vClaims
), CTE_ClaimsFiltered AS (
SELECT
ClaimID,
ClaimLine,
Cost,
RowNumber,
isDuplicate = IIF(
LAG(ClaimID) OVER(ORDER BY RowNumber) = ClaimID
AND LAG(ClaimLine) OVER(ORDER BY RowNumber) = ClaimLine
AND LAG(Cost) OVER(ORDER BY RowNumber) = Cost
, 1, 0)
FROM
CTE_ClaimsWithSort
)
SELECT
ClaimID,
ClaimLine,
Cost,
RowNumber,
isDuplicate
FROM
CTE_ClaimsFiltered
WHERE
isDuplicate = 0
First part of cte: generate row_numbers for your example data. If you have a date column you can use instead of it.
Second part of cte: find and filter duplicate rows by ClaimID, ClaimLine and Cost with IIF expression
The result:
ClaimID
ClaimLine
Cost
RowNumber
isDuplicate
M0001R1616878951
2
10
1
0
M0001R1616878951
2
-10
2
0
M0001R1616878951
3
10
4
0
M0001R1616878951
3
-10
5
0
M0001R1616878951
3
10
6
0

Related

Calculating the stddev and avg between the most recent number and all the other numbers in a running list snowflake

I have a dataset that looks something like this:
id committed delivered timestamp stddev
1 10 8 01-02-2022 ?
2 20 15 01-14-2022 ?
3 12 12 01-30-2022 ?
4 2 0 02-14-2022 ?
.
.
99 null
I am trying to calculate the standard deviation between sprint x and all the sprints after sprint x; for example, the standard deviation and avg between sprint 1, 2, 3 & 4, 2, 3 & 4, 3 & 4, and so on. If there are no records after 4, that stddev would be null
With the current snowflake functions, I am generally unable to calculate the stddev in general, let alone do something with a lag/lead function.
Does anyone have any advice? Thanks in advance!
Update:
I've figured out how to calculate a moving avg over sprint x and the next sprint, but not for all previous sprints:
(delivered + lead(delivered) over (partition by id order by timestamp asc)) / 2
stddev can also be calculated using abs / sqrt (2)

You're looking for a frame clause -- this is part of the window function that can specify which rows in the current partition to use in the calculation.
select
id,
stddev(delivered) over (
order by id asc
rows between current row and unbounded following
) as stddev,
avg(delivered) over (
order by id asc
rows between current row and unbounded following
) as avg
from my_data

tconbeer is 100% correct, but here is the code and count to "show it working"
and you example data (moshed) into a VALUES section to avoid making a table.
I also stripped out timestamp, as it's not used in this demo, but normally I would order by that, but I could not see the pattern, so just dropped it, as it's non material to the example.
SELECT t.*
,count(delivered) over ( order by id asc rows between current row and unbounded following ) as _count
,stddev(delivered) over ( order by id asc rows between current row and unbounded following ) as stddev
,avg(delivered) over ( order by id asc rows between current row and unbounded following ) as avg
FROM VALUES
(1, 10, 8),
(2, 20, 15),
(3, 12, 12),
(4, 2, 0),
(5, 2, 0),
(6, 0, 0)
t(id, committed, delivered)
ORDER BY 1;
gives:
ID
COMMITTED
DELIVERED
_COUNT
STDDEV
AVG
1
10
8
6
6.765106577
5.833
2
20
15
5
7.469939759
5.4
3
12
12
4
6
3
4
2
0
3
0
0
5
2
0
2
0
0
6
0
0
1
null
0

you can create a dummy table which will have a id generated sequentially using the generator function for a particular range and do a LEFT join with the table. This way you will get rows with NULL values where the id is not present, and then you can use lag / leap to get the average.
--- untested
select seq4() as id1 , TMP.* from table(generator(rowcount => 10)) v
LEFT JOIN (SELECT * FROM (
SELECT 1 as id , 8 As committed , '01-02-2022' as delivered UNION ALL
SELECT 2 as id , 20 As committed , '01-14-2022' as delivered UNION ALL
SELECT 3 as id , 12 As committed , '01-30-2022' as delivered UNION ALL
SELECT 5 as id , 2 As committed , '02-14-2022' as delivered
)) TMP
ON trim(id1) = trim(tmp.id)

Filter table to show only most recent values [duplicate]

This question already has answers here:
Get top 1 row of each group
(19 answers)
Closed 11 months ago.
I have a table that looks like this.
Category
Type
fromDate
Value
1
1
1/1/2022
5
1
2
1/1/2022
10
2
1
1/1/2022
7.5
2
2
1/1/2022
15
3
1
1/1/2022
3.5
3
2
1/1/2022
5
3
1
4/1/2022
5
3
2
4/1/2022
10
I'm trying to filter this table down to filter down and keep the most recent grouping of Category/Type. IE rows 5 and 6 would be removed in the query since they are older records.
So far I have the below query but I am getting an aggregate error due to not aggregating the "Value" column. My question is how do I get around this without aggregating? I want to keep the actual value that is in the column.
SELECT T1.Category, T1.Type, T2.maxDate, T1.Value
FROM (SELECT Category, Type, MAX(fromDate) AS maxDate
FROM Table GROUP BY Category,Type) T2
INNER JOIN Table T1 ON T1.Category=T2.Category
GROUP BY T1.Category, T1.Type, T2.MaxDate

This has been asked and answered dozens and dozens of times. But it was quick and painless to type up an answer. This should work for you.
declare #MyTable table
(
Category int
, Type int
, fromDate date
, Value decimal(5,2)
)
insert #MyTable
select 1, 1, '1/1/2022', 5 union all
select 1, 2, '1/1/2022', 10 union all
select 2, 1, '1/1/2022', 7.5 union all
select 2, 2, '1/1/2022', 15 union all
select 3, 1, '1/1/2022', 3.5 union all
select 3, 2, '1/1/2022', 5 union all
select 3, 1, '4/1/2022', 5 union all
select 3, 2, '4/1/2022', 10
select Category
, Type
, fromDate
, Value
from
(
select *
, RowNum = ROW_NUMBER() over(partition by Category, Type order by fromDate desc)
from #MyTable
) x
where x.RowNum = 1
order by x.Category
, x.Type

Finding the Datediff between Records in same Table

IP QID ScanDate Rank
101.110.32.80 6 2016-09-28 18:33:21.000 3
101.110.32.80 6 2016-08-28 18:33:21.000 2
101.110.32.80 6 2016-05-30 00:30:33.000 1
I have a Table with certain records, grouped by Ipaddress and QID.. My requirement is to find out which record missed the sequence in the date column or other words the date difference is more than 30 days. In the above table date diff between rank 1 and rank 2 is more than 30 days.So, i should flag the rank 2 record.

You can use LAG in Sql 2012+
declare #Tbl Table (Ip VARCHAR(50), QID INT, ScanDate DATETIME,[Rank] INT)
INSERT INTO #Tbl
VALUES
('101.110.32.80', 6, '2016-09-28 18:33:21.000', 3),
('101.110.32.80', 6, '2016-08-28 18:33:21.000', 2),
('101.110.32.80', 6, '2016-05-30 00:30:33.000', 1)
;WITH Result
AS
(
SELECT
T.Ip ,
T.QID ,
T.ScanDate ,
T.[Rank],
LAG(T.[Rank]) OVER (ORDER BY T.[Rank]) PrivSRank,
LAG(T.ScanDate) OVER (ORDER BY T.[Rank]) PrivScanDate
FROM
#Tbl T
)
SELECT
R.Ip ,
R.QID ,
R.ScanDate ,
R.Rank ,
R.PrivScanDate,
IIF(DATEDIFF(DAY, R.PrivScanDate, R.ScanDate) > 30, 'This is greater than 30 day. Rank ' + CAST(R.PrivSRank AS VARCHAR(10)), '') CFlag
FROM
Result R
Result:
Ip QID ScanDate Rank CFlag
------------------------ ----------- ----------------------- ----------- --------------------------------------------
101.110.32.80 6 2016-05-30 00:30:33.000 1
101.110.32.80 6 2016-08-28 18:33:21.000 2 This is greater than 30 day. Rank 1
101.110.32.80 6 2016-09-28 18:33:21.000 3 This is greater than 30 day. Rank 2

While Window Functions could be used here, I think a self join might be more straight forward and easier to understand:
SELECT
t1.IP,
t1.QID,
t1.Rank,
t1.ScanDate as endScanDate,
t2.ScanDate as beginScanDate,
datediff(day, t2.scandate, t1.scandate) as scanDateDays
FROM
table as t1
INNER JOIN table as t2 ON
t1.ip = t2.ip
t1.rank - 1 = t2.rank --get the record from t2 and is one less in rank
WHERE datediff(day, t2.scandate, t1.scandate) > 30 --only records greater than 30 days
It's pretty self-explanatory. We are joining the table to itself and joining the ranks together where rank 2 gets joined to rank 1, rank 3 gets joined to rank 2, and so on. Then we just test for records that are greater than 30 days using the datediff function.

I would use windowed function to avoid self join which in many case will perform better.
WITH cte
AS (
SELECT
t.IP
, t.QID
, LAG(t.ScanDate) OVER (PARTITION BY t.IP ORDER BY T.ScanDate) AS beginScanDate
, t.ScanDate AS endScanDate
, DATEDIFF(DAY,
LAG(t.ScanDate) OVER (PARTITION BY t.IP ORDER BY t.ScanDate),
t.ScanDate) AS Diff
FROM
MyTable AS t
)
SELECT
*
FROM
cte c
WHERE
Diff > 30;

Tsql group by clause with exceptions

I have a problem with a query.
This is the data (order by Timestamp):
Data
ID Value Timestamp
1 0 2001-1-1
2 0 2002-1-1
3 1 2003-1-1
4 1 2004-1-1
5 0 2005-1-1
6 2 2006-1-1
7 2 2007-1-1
8 2 2008-1-1
I need to extract distinct values and the first occurance of the date. The exception here is that I need to group them only if not interrupted with a new value in that timeframe.
So the data I need is:
ID Value Timestamp
1 0 2001-1-1
3 1 2003-1-1
5 0 2005-1-1
6 2 2006-1-1
I've made this work by a complicated query, but am sure there is an easier way to do it, just cant think of it. Could anyone help?
This is what I started with - probably could work with that. This is a query that should locate when a value is changed.
> SELECT * FROM Data d1 join Data d2 ON d1.Timestamp < d2.Timestamp and
> d1.Value <> d2.Value
It probably could be done with a good use of row_number clause but cant manage it.

Sample data:
declare #T table (ID int, Value int, Timestamp date)
insert into #T(ID, Value, Timestamp) values
(1, 0, '20010101'),
(2, 0, '20020101'),
(3, 1, '20030101'),
(4, 1, '20040101'),
(5, 0, '20050101'),
(6, 2, '20060101'),
(7, 2, '20070101'),
(8, 2, '20080101')
Query:
;With OrderedValues as (
select *,ROW_NUMBER() OVER (ORDER By TimeStamp) as rn --TODO - specific columns better than *
from #T
), Firsts as (
select
ov1.* --TODO - specific columns better than *
from
OrderedValues ov1
left join
OrderedValues ov2
on
ov1.Value = ov2.Value and
ov1.rn = ov2.rn + 1
where
ov2.ID is null
)
select * --TODO - specific columns better than *
from Firsts
I didn't rely on the ID values being sequential and without gaps. If that's the situation, you can omit OrderedValues (using the table and ID in place of OrderedValues and rn). The second query simply finds rows where there isn't an immediate preceding row with the same Value.
Result:
ID Value Timestamp rn
----------- ----------- ---------- --------------------
1 0 2001-01-01 1
3 1 2003-01-01 3
5 0 2005-01-01 5
6 2 2006-01-01 6
You can order by rn if you need the results in this specific order.

Filter Duplicate Rows on Conditions

I would like to filter duplicate rows on conditions so that the rows with minimum modified and maximum active and unique rid and did are picked. self join? or any better approach that would be performance wise better?
Example:
id rid modified active did
1 1 2010-09-07 11:37:44.850 1 1
2 1 2010-09-07 11:38:44.000 1 1
3 1 2010-09-07 11:39:44.000 1 1
4 1 2010-09-07 11:40:44.000 0 1
5 2 2010-09-07 11:41:44.000 1 1
6 1 2010-09-07 11:42:44.000 1 2
Output expected is
1 1 2010-09-07 11:37:44.850 1 1
5 2 2010-09-07 11:41:44.000 1 1
6 1 2010-09-07 11:42:44.000 1 2
Commenting on the first answer, the suggestion does not work for the below dataset(when active=0 and modified is the minimum for that row)
id rid modified active did
1 1 2010-09-07 11:37:44.850 1 1
2 1 2010-09-07 11:38:44.000 1 1
3 1 2010-09-07 11:39:44.000 1 1
4 1 2010-09-07 11:36:44.000 0 1
5 2 2010-09-07 11:41:44.000 1 1
6 1 2010-09-07 11:42:44.000 1 2

Assuming SQL Server 2005+. Use RANK() instead of ROW_NUMBER() if you want ties returned.
;WITH YourTable as
(
SELECT 1 id,1 rid,cast('2010-09-07 11:37:44.850' as datetime) modified, 1 active,1 did union all
SELECT 2,1,'2010-09-07 11:38:44.000', 1,1 union all
SELECT 3,1,'2010-09-07 11:39:44.000', 1,1 union all
SELECT 4,1,'2010-09-07 11:36:44.000', 0,1 union all
SELECT 5,2,'2010-09-07 11:41:44.000', 1,1 union all
SELECT 6,1,'2010-09-07 11:42:44.000', 1,2
),cte as
(
SELECT id,rid,modified,active, did,
ROW_NUMBER() OVER (PARTITION BY rid,did ORDER BY active DESC, modified ASC ) RN
FROM YourTable
)
SELECT id,rid,modified,active, did
FROM cte
WHERE rn=1
order by id

select id, rid, min(modified), max(active), did from foo group by rid, did order by id;

You can get good performance with a CROSS APPLY if you have a table that has one row for each combination of rid and did:
SELECT
X.*
FROM
ParentTable P
CROSS APPLY (
SELECT TOP 1 *
FROM YourTable T
WHERE P.rid = T.rid AND P.did = T.did
ORDER BY active DESC, modified
) X
Substituting (SELECT DISTINCT rid, did FROM YourTable) for ParentTable would work but will hurt performance.
Also, here is my crazy, single scan magic query which can often outperform other methods:
SELECT
id = Substring(Packed, 6, 4),
rid,
modified = Convert(datetime, Substring(Packed, 2, 4)),
Active = Convert(bit, 1 - Substring(Packed, 1, 1)),
did,
FROM
(
SELECT
rid,
did,
Packed = Min(Convert(binary(1), 1 - active) + Convert(binary(4), modified) + Convert(binary(4), id)
FROM
YourTable
GROUP BY
rid,
did
) X
This method is not recommended because it's not easy to understand, and it's very easy to make mistakes with it. But it's a fun oddity because it can outperform other methods in some cases.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Remove consecutive duplicate rows from a table - sql-server

Related

Calculating the stddev and avg between the most recent number and all the other numbers in a running list snowflake

Filter table to show only most recent values [duplicate]

Finding the Datediff between Records in same Table

Tsql group by clause with exceptions

Filter Duplicate Rows on Conditions

Categories

Resources