Remove Almost Duplicate Rows in SQL

Remove Almost Duplicate Rows in SQL - sql-server

I have found a lot of examples online of how to remove duplicate rows in a SQL table but I cannot figure out how to remove almost duplicate rows.
Data Example
+--------+----------+--------+
| Col1 | Col2 | NumCol |
+--------+----------+--------+
| USA | Organic | 300 |
| USA | Organic | 400 |
| Canada | Referral | 120 |
| Canada | Referral | 120 |
+--------+----------+--------+
Desired Output
+--------+----------+--------+
| Col1 | Col2 | NumCol |
+--------+----------+--------+
| USA | Organic | 400 |
| Canada | Referral | 120 |
+--------+----------+--------+
In this example, if 2 rows are identical then I would like one of them to be removed. In addition, if 2 rows match based on Col1 and Col2, then I would like the row with the lesser value in NumCol to be removed.
My SQL Server Express code is:
WITH CTE AS(
SELECT [Col1]
,[Col2]
,[NumCol]
, RN = ROW_NUMBER()OVER(PARTITION BY [Col1]
,[Col2]
,[NumCol] ORDER BY [Col1])
FROM table
)
DELETE FROM CTE WHERE RN > 1
This code does a good job of deleting duplicates but it doesn't get rid of rows where only Col1 and Col2 match but not NumCol. How should I approach something like this? I'm a newbie to SQL, so any explanation in layman's terms is appreciated!

You can let the row numbers restart per (Col1, Col2) pair by changing:
RN = ROW_NUMBER()OVER(PARTITION BY [Col1]
,[Col2]
,[NumCol] ORDER BY [Col1])
To:
RN = ROW_NUMBER() OVER(
PARTITION BY Col1, Col1
ORDER BY NumCol desc)
The order by NumCol desc makes sure that the rows with the lower NumCol are removed.

Related

I have a table where customer ID are being duplicated because of their reactivation date. I need to pivot the reactivation date per CustomerID

I have a following table
I need to pivot the table and have it like the table below:
How can I have the unique customer ID in a column and all the reactivation dates pivoted like in the above picture?

To attribute a numeric sequence to the reactivation dates, use row_number() over() and then you can pivot that sequence from rows to columns:
select
customer_id
, activation_date
, [1] as reactivation_dt_1
, [2] as reactivation_dt_2
, [3] as reactivation_dt_3
, [4] as reactivation_dt_4
from (
select
customer_id, activation_date, reactivation_date
, row_number() over(partition by customer_id
order by reactivation_date ASC) as pivcol
from mytable
) as d
pivot (
max(reactivation_date)
for pivcol in ([1],[2],[3],[4])
) as p
order by
customer_id
result
+-------------+-----------------+-------------------+-------------------+-------------------+-------------------+
| customer_id | activation_date | reactivation_dt_1 | reactivation_dt_2 | reactivation_dt_3 | reactivation_dt_4 |
+-------------+-----------------+-------------------+-------------------+-------------------+-------------------+
| 1 | 2010-01-01 | 2012-02-01 | 2015-03-01 | 2017-07-01 | 2022-07-01 |
| 2 | 2011-12-03 | 2013-05-01 | 2014-08-10 | 2015-12-09 | |
+-------------+-----------------+-------------------+-------------------+-------------------+-------------------+
see db<>fiddle here

Sum the result of a query

I have made a query with the below code which works fine. I then need to sum the diff column which will give me the total amount. It also needs to ignore the first row in the sum to reflect the true result.
select col,
col - coalesce(lag(col) over (order by id), 0) as diff
from t;
+.......+.......+
|COL |DIFF |
+.......+.......+
|1200 |0 |
|1200 |0 |
|1202 |2 |
|1204 |2 |
|1204 |0 |
|1208 |4 |
+.......+.......+
This is what the query result is, i need to have the result as the sum of the diff column which in this case would be 8.
As added by OP as comment, adding to the question.
Below is the query. How can I improve the performance.
Select SUM(result)
FROM (
select col1,
col1 - coalesce (lag(col1) over (order by col1, 0) as result
from table
where CAST(t_stamp AS TIME) BETWEEN '07:00' and '15:00'
and DATEPART(DAY, T_Stamp) = '20'
and DATEPART(MONTH, T_Stamp) = '05'
and DATEPART(YEAR, T_Stamp) = '2020'
) sub

You dont need this much complexity. You can simply use the RANGE with SUM aggregate function.
DECLARE #table Table (col1 int)
INSERT INTO #table values(1),(2),(3),(4),(5)
SELECT col1, SUM(col1) OVER(ORDER BY col1 RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS diff
FROm #table
ORDER BY col1
+------+------+
| col1 | diff |
+------+------+
| 1 | 15 |
| 2 | 14 |
| 3 | 12 |
| 4 | 9 |
| 5 | 5 |
+------+------+
** UPDATE: Based on OP edit, updating answer **
SELECT SUM(Diff) as diff_sum
FROM
(
SELECT col1,
ISNULL(col1 - LAG(col1) OVER(ORDER BY Col1),0) AS DIFF
FROm #table
) AS T
Result set
+----------+
| diff_sum |
+----------+
| 8 |
+----------+

How to check if SQL records are in a specific order

I'm having trouble figuring out how I can check if records on a table are in a specific order. The simplified table design is essentially this:
+------------+----------------+--------+
| ID (GUID) | StartDate | NumCol |
+------------+----------------+--------+
| CEE8C17... | 8/17/2019 3:11 | 22 |
| BB22001... | 8/17/2019 3:33 | 21 |
| 4D40B12... | 8/17/2019 3:47 | 21 |
| 3655125... | 8/17/2019 4:06 | 20 |
| 3456CD1... | 8/17/2019 4:22 | 20 |
| 38BAF92... | 8/17/2019 4:40 | 19 |
| E60CBE8... | 8/17/2019 5:09 | 19 |
| 5F2756B... | 8/17/2019 5:24 | 18 |
+------------+----------------+--------+
The ID column is a non-sequential GUID. The table is sorted by default on the StartDate when data is entered. However I am trying to flag instances where the NumCol values are out of descending order. The NumCol values can be identical on adjacent records, but ultimately they must be descending.
+--------+
| NumCol |
+--------+
| 22 |
| *20 | <- I want the ID where this occurs
| 21 |
| 20 |
| 20 |
| 19 |
| 19 |
| 18 |
+--------+
I've tried LEFT JOIN this table to itself, but can't seem to come up with an ON clause that gives the right results:
ON a.ID <> b.ID AND a.NumCol > b.NumCol
I also thought I could use OFFSET n ROWS to compare the default sorted table against one with an ORDER BY NumCol performed on it. I can't come up with anything that works.
I need a solution that will work for both SQL Server and SQL Compact.

With EXISTS:
select t.* from tablename t
where exists (
select 1 from tablename
where numcol > t.numcol and startdate > t.startdate
)
Or with row_number() window function:
select t.id, t.startdate, t.numcol
from (
select *,
row_number() over (order by startdate desc) rn1,
row_number() over (order by numcol) rn2
from tablename
) t
where rn1 > rn2
See the demo.

This might be easiest:
select * from T t1
where NumCol < (select max(NumCol) from T t2 where t2.StartDate > t1.StartDate);
The exists version is probably better to optimize though.
Using analytic functions you could try this approach which finds breaks in the monotonicity of consecutive rows. It might not return all the rows you're interested in seeing:
with data as (
select *, lag(NumCol) over (order by StartDate desc) as prevNumCol
from T
)
select * from data where prevNumCol > NumCol;
Here's a better solution that's probably not available in both of your environments:
with data as (
select *,
max(NumCol) over (
order by StartDate desc
rows between unbounded preceding and current row
) as prevMax
from T
)
select * from data where prevMax > NumCol;

Update a column with LastExclusionDate

In SQL Server 2012, I have a table t1 where we store a list of excluded product.
I would like to add a column LastExclusionDate to store the date since when the product has been excluded.
Every day the product is inserted into the table if it is excluded. If not there will be no row and the next time when the product will be excluded there will be a gap date with the previous insert.
I would like to find a T-SQL query to update the LastExclusionDate column.
I would like to use it to populate column LastExclusionDate the first time (=initialisation) and use it every day to update the column when we insert a new row
I've tried this query, but I don't know how to get LastExclusionDate!
;WITH Cte AS
(
SELECT
product_id,
CreationDate,
LAG(CreationDate) OVER (PARTITION BY Product_ID ORDER BY CreationDate) AS GapStart,
(DATEDIFF(DAY, LAG(CreationDate) OVER (PARTITION BY Product_id ORDER BY CreationDate), CreationDate) -1) AS GapDays
FROM
#t1
)
SELECT *
FROM cte
Here's some sample data:
+------------+--------------+--------------------------------+
| product_id | CreationDate | LastExclusionDate_(toPopulate) |
+------------+--------------+--------------------------------+
| 100 | 2018-05-01 | 2018-05-01 |
| 100 | 2018-05-02 | 2018-05-01 |
| 100 | 2018-05-03 | 2018-05-01 |
| 100 | 2018-06-01 | 2018-06-01 |
| 100 | 2018-06-02 | 2018-06-01 |
| 200 | 2018-09-01 | 2018-09-01 |
| 200 | 2018-09-02 | 2018-09-01 |
| 200 | 2018-09-17 | 2018-09-17 |
+------------+--------------+--------------------------------+
Thanks

The idea in finding gap-less sequences is to compare the series to a gap-less sequence and find groups of records where the difference of both doesn't change. For example, when the date increases one by one and a row number also does, then the difference between both stays the same and we found a group:
WITH
cte (product_id, CreationDate, grp) AS (
SELECT product_id, CreationDate
, DATEDIFF(day, '19000101', CreationDate)
- ROW_NUMBER() OVER (PARTITION BY product_id ORDER BY CreationDate)
FROM #t1
)
SELECT product_id, CreationDate
, MIN(CreationDate) OVER (PARTITION BY product_id, grp) AS LastExclusionDate
FROM cte

For ongoing daily insertions it can be done with something like this.
INSERT INTO <yourTable>
SELECT
newProduct.[product_id],
newProduct.[creationDate],
isnull(existingProduct.[lastExclusionDate], newProduct.[creationDate]) AS [lastExclusionDate]
FROM
(SELECT <#product_id> AS [product_id], <#createionDate> AS [creationDate]) AS newProduct
LEFT JOIN #temp existingProduct
ON existingProduct.[product_id] = newProduct.product_id
AND existingProduct.[creationDate] = DATEADD(DAY,-1,newProduct.[creationDate])
I've got a demo here http://rextester.com/BDEO23118 . It's a larger than necessary demo because it uses the code above with the data you provided to populate a table row-by-row like you might in a daily update process. It then does individual insertions using this code with some new dates so you can see the way it handles new ranges. (just an FYI, rextester displays result dates in day.month.year hh:mm:ss format, but you can dump the script into management studio and it will output in DATE format)

TSQL - UNPIVOT from Excel imported data

I have an Excel spreadsheet that imports into a table like so:
+-------------------------------------------------------------------------+
| Col1 Col2 Col3 Col4 Col5 |
+-------------------------------------------------------------------------+
| Ref Name 01-01-2013 02-01-2013 03-01-2013 |
| 1 John 500 550 600 |
| 2 Fred 600 650 400 |
| 3 Paul 700 750 550 |
| 4 Steve 800 850 700 |
+-------------------------------------------------------------------------+
My goal is to change it to look like this:
+-------------------------------------------------------------------------+
| Ref Name Date Sales |
+-------------------------------------------------------------------------+
| 1 John 01-01-2013 500 |
| 1 John 02-02-2013 550 |
| 1 John 03-01-2013 600 |
| 2 Fred 01-01-2013 600 |
| ..... |
+-------------------------------------------------------------------------+
So far I figured out how to use UNPIVOT to get the dates and sales numbers into 1 column but that doesn't solve the problem of breaking the dates out into their own column. Any help is appreciated. Thanks!!

You could possibly use two separate UNPIVOT queries and then join them. The first unpivot, will convert the row with the ref value in col1, then the second subquery does an unpivot of the sales. You join the subqueries on the previous column names:
select s.col1,
s.col2,
d.value date,
s.value sales
from
(
select col1, col2, col, value
from yt
unpivot
(
value
for col in (col3, col4, col5)
) un
where col1 = 'ref'
) d
inner join
(
select col1, col2, col, value
from yt
unpivot
(
value
for col in (col3, col4, col5)
) un
where col1 <> 'ref'
) s
on d.col = s.col;
See SQL Fiddle with Demo

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Remove Almost Duplicate Rows in SQL - sql-server

Related

I have a table where customer ID are being duplicated because of their reactivation date. I need to pivot the reactivation date per CustomerID

Sum the result of a query

How to check if SQL records are in a specific order

Update a column with LastExclusionDate

TSQL - UNPIVOT from Excel imported data

Categories

Resources