SQL Server: select all duplicate rows where col1+col2 exists more than once - sql-server

I have a table which has around 300,000 rows. 225 Rows are being added to this table daily since March 16,2015 till July 09,2015
My problem is that, from last 1 week or so, some duplicate rows are being entered in the table (i.e more than 225 per day)
Now I want to select (and ultimately delete!) all the duplicate rows from the table that have more than 1 siteID+ reportID combination existing against one Date column .
Example is attached in the screenshot:

When Row_Number() is used with Partition By clause, it can provide the SQL developer to select duplicate rows in a table
Please check the SQL tutorial on how to delete duplicate rows in SQL table
Below query is what is copied from that article and applied to your requirement:
;WITH DUPLICATES AS
(
SELECT *,
RN = ROW_NUMBER() OVER (PARTITION BY siteID, ReportID ORDER BY Date)
FROM myTable
)
DELETE FROM DUPLICATES WHERE RN > 1
I hope it helps,

When you want to filter duplicated rows I suggest you this type of query:
SELECT *
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Col1, Col2 ORDER BY Col3) As seq
FROM yourTable) dt
WHERE (seq > 1)
Like this:
SELECT *
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY siteID, reportID, [Date] ORDER BY ID) As seq
FROM yourTable) dt
WHERE (seq > 1)

Related

Create a separate table based on select condition query in snowflake

I am using select query with condition to remove the duplicates. Query as below
select * from (
select LOCATIONID, OBSERVATION_TIME_UTC, max(ROW_KEY) ROW_KEY from OLD_TABLE group by LOCATIONID, OBSERVATION_TIME_UTC
)
here it will display only 3 columns and LOCATIONID, OBSERVATION_TIME_UTC,ROW_KEY out of 15 columns
I want to create a separate table which has all the columns and order of the columns should not be changed.
I tried below query
create or replace table NEW_TABLE as
select * from (
select LOCATIONID, OBSERVATION_TIME_UTC, max(ROW_KEY) ROW_KEY from OLD_TABLE group by LOCATIONID, OBSERVATION_TIME_UTC
)
but the above query gives only 3 columns, whereas I need the data as it is in new table(it should have all the columns).
could someone correct my query please!
Qualify could be used to grab the highest row(row_key) per location and observation_time:
-- create or replace new_table as
Select *
From old_table
Qualify row_number() over(partition by location_id, observation_time_utc
order by row_key desc) = 1

StackExchange Query Help t-sql

Would anybody be able to help me with this exercise. I am used to querying on postgresql and not t-sql and I am running into trouble with how some of my data aggregates
My assignment requires me to:
Create a query that returns the number of comments made on each day for each post from the top 50 most commented on posts in the past year.
For example, this query below is giving me a non aggregated result set:
select cast(creationdate as date),
postid,
count(id)
from comments
where postid = 17654496
group by creationdate, postid
The schema is all here
https://data.stackexchange.com/stackoverflow/query/edit/898297
You can try to use CTE get the count by date.
then use window function with ROW_NUMBER make row number order by count amount desc.
;with CTE as (
select cast(creationdate as date) dt,
postid,
count(id) cnt
from comments
WHERE creationdate between dateadd(year,-1,getdate()) and getdate()
group by cast(creationdate as date), postid
), CTE2 AS (
select *,ROW_NUMBER() OVER (order by cnt desc) rn
from CTE
)
SELECT *
FROM CTE2
WHERE rn <=50
https://data.stackexchange.com/stackoverflow/query/898322/test

SSIS - Filter duplicate rows

I have a table (Id, ArticleCode, StoreCode, Adress, Number) that contains duplicate entries based on only these columns [ArticleCode, StoreCode].
Currently I can filter duplicate rows using Aggregate transformation, but the problem is in the output rows I have only two columns [Article, StoreCode] and I need the other columns as well.
Just in the OLEDB Source component use SQL Command as Source instead of Table name and write the following command (as a source):
SELECT [ID]
,[ArticleCode]
,[StoreCode]
,[Address]
,[Number] FROM (
SELECT [ID]
,[ArticleCode]
,[StoreCode]
,[Address]
,[Number]
,ROW_NUMBER() OVER(PARTITION BY [ArticleCode]
,[StoreCode] ORDER BY [ArticleCode]
,[StoreCode]) AS ROWNUM
FROM [dbo].[Table_1]) AS T1
WHERE T1.ROWNUM = 1
To get rid of duplicates and select unique records by [ArticleCode, StoreCode]:
select top 1 with ties
Id ,
ArticleCode ,
StoreCode ,
Adress ,
Number
from
YourTable
order by
row_number() over(partition by ArticleCode, StoreCode order by Id)
But which of two records have to be selected when [ArticleCode, StoreCode] are equal and [Adress, Number] differ?
If Id is auto-increment then order by Id gets the first entered record, order by Id desc - the last.
You have somehow to define which [Adress, Number] pair among the duplicates is correct to be selected.

Remove duplicate lines in sql server

I have a table with the following example format:
ID Name
1 NULL
1 NULL
2 HELLO
3 NULL
3 BYE
My goal is to remove repeated lines with same IDS, but with restrictions.
According to the example, I need to remove a row with ID-1, and the row with ID-3 and with no value (NULL).
I would stick with the table:
ID Name
1 NULL
2 HELLO
3 BYE
How can I do this in sql server? thank you
To just select the data, you can use a simple CTE (common table expression);
WITH cte AS (
SELECT id, name,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY name DESC) rn
FROM myTable
)
SELECT id,name FROM cte WHERE rn=1;
An SQLfiddle to test with.
If you mean to delete the duplicates from the table and not just select the data without updating anything, you could use the same CTE;
WITH cte AS (
SELECT id, name,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY name DESC) rn
FROM myTable
)
DELETE FROM cte WHERE rn<>1;
Another SQLfiddle to test with, and remember to always back up your data before running destructive SQL statements from random people on the Internet.

How to delete duplicates from a table and keeping only the one with highest id in sql server?

I have a table with unique ID and then some fields. I would like to delete all the dupliacte rows and keep only one, the one with highest id.
For example assuming to have a table with 3 fields: RECORD_ID, FIELD_ONE, FIELD_TWO
which is the query that allows me to delete all records that have same value for FIELD_ONE and FIELD_TWO except the one that has highest RECORD_ID?
Found:
with cte
as
(
select *, row_number() over(partition by FIELD_ONE, FIELD_TWO order by RECORD_ID desc) RowNumber
from TestTable
)
delete cte
where RowNumber > 1

Resources