How to create RowNum column in SQL Server? - sql-server

In Oracle we have "rownum".
What can I do in SQL Server?

In SQL Server 2005 (and 2008) you can use the ROW_NUMBER function, coupled with the OVER clause to determine the order in which the rows should be counted.
Update
Hmm. I don't actually know what the Oracle version does. If it's giving you a unique number per row (across the entire table), then I'm not sure there's a way to do that in SQL Server. SQL Server's ROW_NUMBER() only works for the rows returned in the current query.

If you have an id column, you can do this:
select a.*,
(select count(*) from mytable b where b.id <= a.id) as rownum
from mytable a
order by id;
Of course, this only works where you're able to order rownums in the same (or opposite) order as the order of the ids.
If you're selecting a proper subset of rows, of course you need to apply the same predicate to the whole select and to the subquery:
select a.*,
(select count(*) from table b where b.id <= a.id and b.foo = 'X') as rownum
from table a where a.foo = 'X'
order by id;
Obviously, this is not particularly efficient.

Based on my understanding, you'd need to use ranking functions and/or the TOP clause. The SQL Server features are specific, the Oracle one combines the 2 concepts.
The ranking function is simple: here is why you'd use TOP.
Note: you can't WHERE on ROWNUMBER directly...
'Orable:
select
column_1, column_2
from
table_1, table_2
where
field_3 = 'some value'
and rownum < 5
--MSSQL:
select top 4
column_1, column_2
from
table_1, table_2
where
field_3 = 'some value'

Related

SQL duplicates with percentage result

I have an issue where my database contains a table with these columns:
Program_ID, Vehicle_VIN, Vehicle_Type
I need to create a report where will be:
Program_ID, AmountOfAllPoliciesInProgram, PercentageDuplicatesInProgram
where Vehicle_type = 10
and criteria for the duplicate is to have Vehicle_VIN more than 1 unique time in the table for dedicated Program_ID.
It's in Microsoft SQL Server Management Studio
AmountOfAllPoliciesInProgram is:
SELECT
PROGRAM_ID, COUNT(*) AS AmountOfAllPoliciesInProgram
FROM
dbo.table
WHERE
Vehicle_type = 10
GROUP BY
PROGRAM_ID
I'm not 100% sure I understand you correctly, but you can count distinct Vehicle_VIN and use the having clause to test if the count of distinct Vehicle_VIN is larger than 1. Like so,
SELECT PROGRAM_ID, COUNT(*) as AmountOfAllPoliciesInProgram, count(distinct Vehicle_VIN) as VIN_COUNT
FROM dbo.table
where Vehicle_type = 10
group by PROGRAM_ID
having count(distinct Vehicle_VIN)>1

SQL Server: Find duplicates using group by and having count than delete them all but not first [duplicate]

I need to remove duplicate rows from a fairly large SQL Server table (i.e. 300,000+ rows).
The rows, of course, will not be perfect duplicates because of the existence of the RowID identity field.
MyTable
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
How can I do this?
Assuming no nulls, you GROUP BY the unique columns, and SELECT the MIN (or MAX) RowId as the row to keep. Then, just delete everything that didn't have a row id:
DELETE FROM MyTable
LEFT OUTER JOIN (
SELECT MIN(RowId) as RowId, Col1, Col2, Col3
FROM MyTable
GROUP BY Col1, Col2, Col3
) as KeepRows ON
MyTable.RowId = KeepRows.RowId
WHERE
KeepRows.RowId IS NULL
In case you have a GUID instead of an integer, you can replace
MIN(RowId)
with
CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))
Another possible way of doing this is
;
--Ensure that any immediately preceding statement is terminated with a semicolon above
WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3
ORDER BY ( SELECT 0)) RN
FROM #MyTable)
DELETE FROM cte
WHERE RN > 1;
I am using ORDER BY (SELECT 0) above as it is arbitrary which row to preserve in the event of a tie.
To preserve the latest one in RowID order for example you could use ORDER BY RowID DESC
Execution Plans
The execution plan for this is often simpler and more efficient than that in the accepted answer as it does not require the self join.
This is not always the case however. One place where the GROUP BY solution might be preferred is situations where a hash aggregate would be chosen in preference to a stream aggregate.
The ROW_NUMBER solution will always give pretty much the same plan whereas the GROUP BY strategy is more flexible.
Factors which might favour the hash aggregate approach would be
No useful index on the partitioning columns
relatively fewer groups with relatively more duplicates in each group
In extreme versions of this second case (if there are very few groups with many duplicates in each) one could also consider simply inserting the rows to keep into a new table then TRUNCATE-ing the original and copying them back to minimise logging compared to deleting a very high proportion of the rows.
There's a good article on removing duplicates on the Microsoft Support site. It's pretty conservative - they have you do everything in separate steps - but it should work well against large tables.
I've used self-joins to do this in the past, although it could probably be prettied up with a HAVING clause:
DELETE dupes
FROM MyTable dupes, MyTable fullTable
WHERE dupes.dupField = fullTable.dupField
AND dupes.secondDupField = fullTable.secondDupField
AND dupes.uniqueField > fullTable.uniqueField
The following query is useful to delete duplicate rows. The table in this example has ID as an identity column and the columns which have duplicate data are Column1, Column2 and Column3.
DELETE FROM TableName
WHERE ID NOT IN (SELECT MAX(ID)
FROM TableName
GROUP BY Column1,
Column2,
Column3
/*Even if ID is not null-able SQL Server treats MAX(ID) as potentially
nullable. Because of semantics of NOT IN (NULL) including the clause
below can simplify the plan*/
HAVING MAX(ID) IS NOT NULL)
The following script shows usage of GROUP BY, HAVING, ORDER BY in one query, and returns the results with duplicate column and its count.
SELECT YourColumnName,
COUNT(*) TotalCount
FROM YourTableName
GROUP BY YourColumnName
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC
delete t1
from table t1, table t2
where t1.columnA = t2.columnA
and t1.rowid>t2.rowid
Postgres:
delete
from table t1
using table t2
where t1.columnA = t2.columnA
and t1.rowid > t2.rowid
DELETE LU
FROM (SELECT *,
Row_number()
OVER (
partition BY col1, col1, col3
ORDER BY rowid DESC) [Row]
FROM mytable) LU
WHERE [row] > 1
This will delete duplicate rows, except the first row
DELETE
FROM
Mytable
WHERE
RowID NOT IN (
SELECT
MIN(RowID)
FROM
Mytable
GROUP BY
Col1,
Col2,
Col3
)
Refer (http://www.codeproject.com/Articles/157977/Remove-Duplicate-Rows-from-a-Table-in-SQL-Server)
I would prefer CTE for deleting duplicate rows from sql server table
strongly recommend to follow this article ::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/
by keeping original
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)
DELETE FROM CTE WHERE RN<>1
without keeping original
WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
 
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
To Fetch Duplicate Rows:
SELECT
name, email, COUNT(*)
FROM
users
GROUP BY
name, email
HAVING COUNT(*) > 1
To Delete the Duplicate Rows:
DELETE users
WHERE rowid NOT IN
(SELECT MIN(rowid)
FROM users
GROUP BY name, email);
Quick and Dirty to delete exact duplicated rows (for small tables):
select distinct * into t2 from t1;
delete from t1;
insert into t1 select * from t2;
drop table t2;
I prefer the subquery\having count(*) > 1 solution to the inner join because I found it easier to read and it was very easy to turn into a SELECT statement to verify what would be deleted before you run it.
--DELETE FROM table1
--WHERE id IN (
SELECT MIN(id) FROM table1
GROUP BY col1, col2, col3
-- could add a WHERE clause here to further filter
HAVING count(*) > 1
--)
SELECT DISTINCT *
INTO tempdb.dbo.tmpTable
FROM myTable
TRUNCATE TABLE myTable
INSERT INTO myTable SELECT * FROM tempdb.dbo.tmpTable
DROP TABLE tempdb.dbo.tmpTable
I thought I'd share my solution since it works under special circumstances.
I my case the table with duplicate values did not have a foreign key (because the values were duplicated from another db).
begin transaction
-- create temp table with identical structure as source table
Select * Into #temp From tableName Where 1 = 2
-- insert distinct values into temp
insert into #temp
select distinct *
from tableName
-- delete from source
delete from tableName
-- insert into source from temp
insert into tableName
select *
from #temp
rollback transaction
-- if this works, change rollback to commit and execute again to keep you changes!!
PS: when working on things like this I always use a transaction, this not only ensures everything is executed as a whole, but also allows me to test without risking anything. But off course you should take a backup anyway just to be sure...
This query showed very good performance for me:
DELETE tbl
FROM
MyTable tbl
WHERE
EXISTS (
SELECT
*
FROM
MyTable tbl2
WHERE
tbl2.SameValue = tbl.SameValue
AND tbl.IdUniqueValue < tbl2.IdUniqueValue
)
it deleted 1M rows in little more than 30sec from a table of 2M (50% duplicates)
Using CTE. The idea is to join on one or more columns that form a duplicate record and then remove whichever you like:
;with cte as (
select
min(PrimaryKey) as PrimaryKey
UniqueColumn1,
UniqueColumn2
from dbo.DuplicatesTable
group by
UniqueColumn1, UniqueColumn1
having count(*) > 1
)
delete d
from dbo.DuplicatesTable d
inner join cte on
d.PrimaryKey > cte.PrimaryKey and
d.UniqueColumn1 = cte.UniqueColumn1 and
d.UniqueColumn2 = cte.UniqueColumn2;
Yet another easy solution can be found at the link pasted here. This one easy to grasp and seems to be effective for most of the similar problems. It is for SQL Server though but the concept used is more than acceptable.
Here are the relevant portions from the linked page:
Consider this data:
EMPLOYEE_ID ATTENDANCE_DATE
A001 2011-01-01
A001 2011-01-01
A002 2011-01-01
A002 2011-01-01
A002 2011-01-01
A003 2011-01-01
So how can we delete those duplicate data?
First, insert an identity column in that table by using the following code:
ALTER TABLE dbo.ATTENDANCE ADD AUTOID INT IDENTITY(1,1)
Use the following code to resolve it:
DELETE FROM dbo.ATTENDANCE WHERE AUTOID NOT IN (SELECT MIN(AUTOID) _
FROM dbo.ATTENDANCE GROUP BY EMPLOYEE_ID,ATTENDANCE_DATE)
This is the easiest way to delete duplicate record
DELETE FROM tblemp WHERE id IN
(
SELECT MIN(id) FROM tblemp
GROUP BY title HAVING COUNT(id)>1
)
Use this
WITH tblTemp as
(
SELECT ROW_NUMBER() Over(PARTITION BY Name,Department ORDER BY Name)
As RowNumber,* FROM <table_name>
)
DELETE FROM tblTemp where RowNumber >1
Here is another good article on removing duplicates.
It discusses why its hard: "SQL is based on relational algebra, and duplicates cannot occur in relational algebra, because duplicates are not allowed in a set."
The temp table solution, and two mysql examples.
In the future are you going to prevent it at a database level, or from an application perspective. I would suggest the database level because your database should be responsible for maintaining referential integrity, developers just will cause problems ;)
I had a table where I needed to preserve non-duplicate rows.
I'm not sure on the speed or efficiency.
DELETE FROM myTable WHERE RowID IN (
SELECT MIN(RowID) AS IDNo FROM myTable
GROUP BY Col1, Col2, Col3
HAVING COUNT(*) = 2 )
Oh sure. Use a temp table. If you want a single, not-very-performant statement that "works" you can go with:
DELETE FROM MyTable WHERE NOT RowID IN
(SELECT
(SELECT TOP 1 RowID FROM MyTable mt2
WHERE mt2.Col1 = mt.Col1
AND mt2.Col2 = mt.Col2
AND mt2.Col3 = mt.Col3)
FROM MyTable mt)
Basically, for each row in the table, the sub-select finds the top RowID of all rows that are exactly like the row under consideration. So you end up with a list of RowIDs that represent the "original" non-duplicated rows.
The other way is Create a new table with same fields and with Unique Index. Then move all data from old table to new table. Automatically SQL SERVER ignore (there is also an option about what to do if there will be a duplicate value: ignore, interrupt or sth) duplicate values. So we have the same table without duplicate rows. If you don't want Unique Index, after the transfer data you can drop it.
Especially for larger tables you may use DTS (SSIS package to import/export data) in order to transfer all data rapidly to your new uniquely indexed table. For 7 million row it takes just a few minute.
By useing below query we can able to delete duplicate records based on the single column or multiple column. below query is deleting based on two columns. table name is: testing and column names empno,empname
DELETE FROM testing WHERE empno not IN (SELECT empno FROM (SELECT empno, ROW_NUMBER() OVER (PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
or empname not in
(select empname from (select empname,row_number() over(PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
Create new blank table with the same structure
Execute query like this
INSERT INTO tc_category1
SELECT *
FROM tc_category
GROUP BY category_id, application_id
HAVING count(*) > 1
Then execute this query
INSERT INTO tc_category1
SELECT *
FROM tc_category
GROUP BY category_id, application_id
HAVING count(*) = 1
Another way of doing this :--
DELETE A
FROM TABLE A,
TABLE B
WHERE A.COL1 = B.COL1
AND A.COL2 = B.COL2
AND A.UNIQUEFIELD > B.UNIQUEFIELD
I would mention this approach as well as it can be helpful, and works in all SQL servers:
Pretty often there is only one - two duplicates, and Ids and count of duplicates are known. In this case:
SET ROWCOUNT 1 -- or set to number of rows to be deleted
delete from myTable where RowId = DuplicatedID
SET ROWCOUNT 0
From the application level (unfortunately). I agree that the proper way to prevent duplication is at the database level through the use of a unique index, but in SQL Server 2005, an index is allowed to be only 900 bytes, and my varchar(2048) field blows that away.
I dunno how well it would perform, but I think you could write a trigger to enforce this, even if you couldn't do it directly with an index. Something like:
-- given a table stories(story_id int not null primary key, story varchar(max) not null)
CREATE TRIGGER prevent_plagiarism
ON stories
after INSERT, UPDATE
AS
DECLARE #cnt AS INT
SELECT #cnt = Count(*)
FROM stories
INNER JOIN inserted
ON ( stories.story = inserted.story
AND stories.story_id != inserted.story_id )
IF #cnt > 0
BEGIN
RAISERROR('plagiarism detected',16,1)
ROLLBACK TRANSACTION
END
Also, varchar(2048) sounds fishy to me (some things in life are 2048 bytes, but it's pretty uncommon); should it really not be varchar(max)?
DELETE
FROM
table_name T1
WHERE
rowid > (
SELECT
min(rowid)
FROM
table_name T2
WHERE
T1.column_name = T2.column_name
);
CREATE TABLE car(Id int identity(1,1), PersonId int, CarId int)
INSERT INTO car(PersonId,CarId)
VALUES(1,2),(1,3),(1,2),(2,4)
--SELECT * FROM car
;WITH CTE as(
SELECT ROW_NUMBER() over (PARTITION BY personid,carid order by personid,carid) as rn,Id,PersonID,CarId from car)
DELETE FROM car where Id in(SELECT Id FROM CTE WHERE rn>1)
I you want to preview the rows you are about to remove and keep control over which of the duplicate rows to keep. See http://developer.azurewebsites.net/2014/09/better-sql-group-by-find-duplicate-data/
with MYCTE as (
SELECT ROW_NUMBER() OVER (
PARTITION BY DuplicateKey1
,DuplicateKey2 -- optional
ORDER BY CreatedAt -- the first row among duplicates will be kept, other rows will be removed
) RN
FROM MyTable
)
DELETE FROM MYCTE
WHERE RN > 1

Group by working in SQL Server 2000 but not in SQL Server 2005

I used following query in SQL Server 2000
SELECT
U.FirstName,
SUM(VE.Score)AS Score, SUM(VE.QuizTime) AS Time,
SUM(VE.IsQuizType) AS QuizesAttempted,
SUM(VE.IsProgrammingType) AS ProgrammingProblemsAttempted
from
Users U INNER JOIN VirtualExercise VE on U.UserID=VE.UserID
where U.UserID IN( 10,11 ) AND ProgramID = 2
group by U.FirstName
order by VE.Score desc
It working fine in SQL Server 2000 but not working in SQL Server 2005.
Gives following error:
Column "VirtualExercise.Score" is
invalid in the ORDER BY clause because
it is not contained in either an
aggregate function or the GROUP BY
clause. --- Inner Exception
Please help...
SQL Server 2005 is correct: you are trying to order by a value that doesn't exist in the output because you must either GROUP on it or aggregate it. VE.Score is neither. SQL Server 2000 allowed some ambiguities like this (edit: see "ORDER BY clause" at http://msdn.microsoft.com/en-us/library/ms143359(SQL.90).aspx for info on this)
I assume you mean
ORDER BY SUM(VE.Score) DESC
SQL Server 2000 is kind of dumb where it comes to table and column aliases in the result set. Provided an explicit alias column is introduced (e.g. Score in your case, Col3 in the example below), it doesn't matter what table alias you use in the ORDER BY clause:
create table #T1 (
ID int not null primary key,
Col1 varchar(10) not null
)
go
create table #T2 (
ID int not null primary key,
Col2 varchar(10) not null
)
go
insert into #T1 (ID,Col1)
select 1,'abc' union all
select 2,'def'
go
insert into #T2 (ID,Col2)
select 1,'zyx' union all
select 2,'wvu'
go
select *,1 as Col3 from #T1 t1 inner join #T2 t2 on t1.ID = t2.ID order by t2.Col3
Since you presumably want to sort by the computed Score column, just remove the VE. prefix.
Since you have defined SUM(VE.Score)AS Score in select , do you mean ORDER BY Score DESC

SQL Error with Order By in Subquery

I'm working with SQL Server 2005.
My query is:
SELECT (
SELECT COUNT(1) FROM Seanslar WHERE MONTH(tarihi) = 4
GROUP BY refKlinik_id
ORDER BY refKlinik_id
) as dorduncuay
And the error:
The ORDER BY clause is invalid in views, inline functions, derived
tables, subqueries, and common table expressions, unless TOP or FOR
XML is also specified.
How can I use ORDER BY in a sub query?
This is the error you get (emphasis mine):
The ORDER BY clause is invalid in
views, inline functions, derived
tables, subqueries, and common table
expressions, unless TOP or FOR XML is
also specified.
So, how can you avoid the error? By specifying TOP, would be one possibility, I guess.
SELECT (
SELECT TOP 100 PERCENT
COUNT(1) FROM Seanslar WHERE MONTH(tarihi) = 4
GROUP BY refKlinik_id
ORDER BY refKlinik_id
) as dorduncuay
If you're working with SQL Server 2012 or later, this is now easy to fix. Add an offset 0 rows:
SELECT (
SELECT
COUNT(1) FROM Seanslar WHERE MONTH(tarihi) = 4
GROUP BY refKlinik_id
ORDER BY refKlinik_id OFFSET 0 ROWS
) as dorduncuay
Besides the fact that order by doesn't seem to make sense in your query....
To use order by in a sub select you will need to use TOP 2147483647.
SELECT (
SELECT TOP 2147483647
COUNT(1) FROM Seanslar WHERE MONTH(tarihi) = 4
GROUP BY refKlinik_id
ORDER BY refKlinik_id
) as dorduncuay
My understanding is that "TOP 100 PERCENT" doesn't gurantee ordering anymore starting with SQL 2005:
In SQL Server 2005, the ORDER BY
clause in a view definition is used
only to determine the rows that are
returned by the TOP clause. The ORDER
BY clause does not guarantee ordered
results when the view is queried,
unless ORDER BY is also specified in
the query itself.
See SQL Server 2005 breaking changes
Hope this helps,
Patrick
If building a temp table, move the ORDER BY clause from inside the temp table code block to the outside.
Not allowed:
SELECT * FROM (
SELECT A FROM Y
ORDER BY Y.A
) X;
Allowed:
SELECT * FROM (
SELECT A FROM Y
) X
ORDER BY X.A;
You don't need order by in your sub query. Move it out into the main query, and include the column you want to order by in the subquery.
however, your query is just returning a count, so I don't see the point of the order by.
A subquery (nested view) as you have it returns a dataset that you can then order in your calling query. Ordering the subquery itself will make no (reliable) difference to the order of the results in your calling query.
As for your SQL itself:
a) I seen no reason for an order by as you are returning a single value.
b) I see no reason for the sub query anyway as you are only returning a single value.
I'm guessing there is a lot more information here that you might want to tell us in order to fix the problem you have.
Add the Top command to your sub query...
SELECT
(
SELECT TOP 100 PERCENT
COUNT(1)
FROM
Seanslar
WHERE
MONTH(tarihi) = 4
GROUP BY
refKlinik_id
ORDER BY
refKlinik_id
) as dorduncuay
:)
maybe this trick will help somebody
SELECT
[id],
[code],
[created_at]
FROM
( SELECT
[id],
[code],
[created_at],
(ROW_NUMBER() OVER (
ORDER BY
created_at DESC)) AS Row
FROM
[Code_tbl]
WHERE
[created_at] BETWEEN '2009-11-17 00:00:01' AND '2010-11-17 23:59:59'
) Rows
WHERE
Row BETWEEN 10 AND 20;
here inner subquery ordered by field created_at (could be any from your table)
In this example ordering adds no information - the COUNT of a set is the same whatever order it is in!
If you were selecting something that did depend on order, you would need to do one of the things the error message tells you - use TOP or FOR XML
Try moving the order by clause outside sub select and add the order by field in sub select
SELECT * FROM
(SELECT COUNT(1) ,refKlinik_id FROM Seanslar WHERE MONTH(tarihi) = 4 GROUP BY refKlinik_id)
as dorduncuay
ORDER BY refKlinik_id
For me this solution works fine as well:
SELECT tbl.a, tbl.b
FROM (SELECT TOP (select count(1) FROM yourtable) a,b FROM yourtable order by a) tbl
Good day
for some guys the order by in the sub-query is questionable.
the order by in sub-query is a must to use if you need to delete some records based on some sorting.
like
delete from someTable Where ID in (select top(1) from sometable where condition order by insertionstamp desc)
so that you can delete the last insertion form table.
there are three way to do this deletion actually.
however, the order by in the sub-query can be used in many cases.
for the deletion methods that uses order by in sub-query review below link
http://web.archive.org/web/20100212155407/http://blogs.msdn.com/sqlcat/archive/2009/05/21/fast-ordered-delete.aspx
i hope it helps. thanks you all
For a simple count like the OP is showing, the Order by isn't strictly needed. If they are using the result of the subquery, it may be. I am working on a similiar issue and got the same error in the following query:
-- I want the rows from the cost table with an updateddate equal to the max updateddate:
SELECT * FROM #Costs Cost
INNER JOIN
(
SELECT Entityname, costtype, MAX(updatedtime) MaxUpdatedTime
FROM #HoldCosts cost
GROUP BY Entityname, costtype
ORDER BY Entityname, costtype -- *** This causes an error***
) CostsMax
ON Costs.Entityname = CostsMax.entityname
AND Costs.Costtype = CostsMax.Costtype
AND Costs.UpdatedTime = CostsMax.MaxUpdatedtime
ORDER BY Costs.Entityname, Costs.costtype
-- *** To accomplish this, there are a few options:
-- Add an extraneous TOP clause, This seems like a bit of a hack:
SELECT * FROM #Costs Cost
INNER JOIN
(
SELECT TOP 99.999999 PERCENT Entityname, costtype, MAX(updatedtime) MaxUpdatedTime
FROM #HoldCosts cost
GROUP BY Entityname, costtype
ORDER BY Entityname, costtype
) CostsMax
ON Costs.Entityname = CostsMax.entityname
AND Costs.Costtype = CostsMax.Costtype
AND Costs.UpdatedTime = CostsMax.MaxUpdatedtime
ORDER BY Costs.Entityname, Costs.costtype
-- **** Create a temp table to order the maxCost
SELECT Entityname, costtype, MAX(updatedtime) MaxUpdatedTime
INTO #MaxCost
FROM #HoldCosts cost
GROUP BY Entityname, costtype
ORDER BY Entityname, costtype
SELECT * FROM #Costs Cost
INNER JOIN #MaxCost CostsMax
ON Costs.Entityname = CostsMax.entityname
AND Costs.Costtype = CostsMax.Costtype
AND Costs.UpdatedTime = CostsMax.MaxUpdatedtime
ORDER BY Costs.Entityname, costs.costtype
Other possible workarounds could be CTE's or table variables. But each situation requires you to determine what works best for you. I tend to look first towards a temp table. To me, it is clear and straightforward. YMMV.
On possible needs to order a subquery is when you have a UNION :
You generate a call book of all teachers and students.
SELECT name, phone FROM teachers
UNION
SELECT name, phone FROM students
You want to display it with all teachers first, followed by all students, both ordered by. So you cant apply a global order by.
One solution is to include a key to force a first order by, and then order the names :
SELECT name, phone, 1 AS orderkey FROM teachers
UNION
SELECT name, phone, 2 AS orderkey FROM students
ORDER BY orderkey, name
I think its way more clear than fake offsetting subquery result.
I Use This Code To Get Top Second Salary
I am Also Get Error Like
The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP or FOR XML is also specified.
TOP 100 I Used To Avoid The Error
select * from (
select tbl.Coloumn1 ,CONVERT(varchar, ROW_NUMBER() OVER (ORDER BY (SELECT 1))) AS Rowno from (
select top 100 * from Table1
order by Coloumn1 desc) as tbl) as tbl where tbl.Rowno=2

How can I remove duplicate rows?

I need to remove duplicate rows from a fairly large SQL Server table (i.e. 300,000+ rows).
The rows, of course, will not be perfect duplicates because of the existence of the RowID identity field.
MyTable
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
How can I do this?
Assuming no nulls, you GROUP BY the unique columns, and SELECT the MIN (or MAX) RowId as the row to keep. Then, just delete everything that didn't have a row id:
DELETE FROM MyTable
LEFT OUTER JOIN (
SELECT MIN(RowId) as RowId, Col1, Col2, Col3
FROM MyTable
GROUP BY Col1, Col2, Col3
) as KeepRows ON
MyTable.RowId = KeepRows.RowId
WHERE
KeepRows.RowId IS NULL
In case you have a GUID instead of an integer, you can replace
MIN(RowId)
with
CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))
Another possible way of doing this is
;
--Ensure that any immediately preceding statement is terminated with a semicolon above
WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3
ORDER BY ( SELECT 0)) RN
FROM #MyTable)
DELETE FROM cte
WHERE RN > 1;
I am using ORDER BY (SELECT 0) above as it is arbitrary which row to preserve in the event of a tie.
To preserve the latest one in RowID order for example you could use ORDER BY RowID DESC
Execution Plans
The execution plan for this is often simpler and more efficient than that in the accepted answer as it does not require the self join.
This is not always the case however. One place where the GROUP BY solution might be preferred is situations where a hash aggregate would be chosen in preference to a stream aggregate.
The ROW_NUMBER solution will always give pretty much the same plan whereas the GROUP BY strategy is more flexible.
Factors which might favour the hash aggregate approach would be
No useful index on the partitioning columns
relatively fewer groups with relatively more duplicates in each group
In extreme versions of this second case (if there are very few groups with many duplicates in each) one could also consider simply inserting the rows to keep into a new table then TRUNCATE-ing the original and copying them back to minimise logging compared to deleting a very high proportion of the rows.
There's a good article on removing duplicates on the Microsoft Support site. It's pretty conservative - they have you do everything in separate steps - but it should work well against large tables.
I've used self-joins to do this in the past, although it could probably be prettied up with a HAVING clause:
DELETE dupes
FROM MyTable dupes, MyTable fullTable
WHERE dupes.dupField = fullTable.dupField
AND dupes.secondDupField = fullTable.secondDupField
AND dupes.uniqueField > fullTable.uniqueField
The following query is useful to delete duplicate rows. The table in this example has ID as an identity column and the columns which have duplicate data are Column1, Column2 and Column3.
DELETE FROM TableName
WHERE ID NOT IN (SELECT MAX(ID)
FROM TableName
GROUP BY Column1,
Column2,
Column3
/*Even if ID is not null-able SQL Server treats MAX(ID) as potentially
nullable. Because of semantics of NOT IN (NULL) including the clause
below can simplify the plan*/
HAVING MAX(ID) IS NOT NULL)
The following script shows usage of GROUP BY, HAVING, ORDER BY in one query, and returns the results with duplicate column and its count.
SELECT YourColumnName,
COUNT(*) TotalCount
FROM YourTableName
GROUP BY YourColumnName
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC
delete t1
from table t1, table t2
where t1.columnA = t2.columnA
and t1.rowid>t2.rowid
Postgres:
delete
from table t1
using table t2
where t1.columnA = t2.columnA
and t1.rowid > t2.rowid
DELETE LU
FROM (SELECT *,
Row_number()
OVER (
partition BY col1, col1, col3
ORDER BY rowid DESC) [Row]
FROM mytable) LU
WHERE [row] > 1
This will delete duplicate rows, except the first row
DELETE
FROM
Mytable
WHERE
RowID NOT IN (
SELECT
MIN(RowID)
FROM
Mytable
GROUP BY
Col1,
Col2,
Col3
)
Refer (http://www.codeproject.com/Articles/157977/Remove-Duplicate-Rows-from-a-Table-in-SQL-Server)
I would prefer CTE for deleting duplicate rows from sql server table
strongly recommend to follow this article ::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/
by keeping original
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)
DELETE FROM CTE WHERE RN<>1
without keeping original
WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
 
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
To Fetch Duplicate Rows:
SELECT
name, email, COUNT(*)
FROM
users
GROUP BY
name, email
HAVING COUNT(*) > 1
To Delete the Duplicate Rows:
DELETE users
WHERE rowid NOT IN
(SELECT MIN(rowid)
FROM users
GROUP BY name, email);
Quick and Dirty to delete exact duplicated rows (for small tables):
select distinct * into t2 from t1;
delete from t1;
insert into t1 select * from t2;
drop table t2;
I prefer the subquery\having count(*) > 1 solution to the inner join because I found it easier to read and it was very easy to turn into a SELECT statement to verify what would be deleted before you run it.
--DELETE FROM table1
--WHERE id IN (
SELECT MIN(id) FROM table1
GROUP BY col1, col2, col3
-- could add a WHERE clause here to further filter
HAVING count(*) > 1
--)
SELECT DISTINCT *
INTO tempdb.dbo.tmpTable
FROM myTable
TRUNCATE TABLE myTable
INSERT INTO myTable SELECT * FROM tempdb.dbo.tmpTable
DROP TABLE tempdb.dbo.tmpTable
I thought I'd share my solution since it works under special circumstances.
I my case the table with duplicate values did not have a foreign key (because the values were duplicated from another db).
begin transaction
-- create temp table with identical structure as source table
Select * Into #temp From tableName Where 1 = 2
-- insert distinct values into temp
insert into #temp
select distinct *
from tableName
-- delete from source
delete from tableName
-- insert into source from temp
insert into tableName
select *
from #temp
rollback transaction
-- if this works, change rollback to commit and execute again to keep you changes!!
PS: when working on things like this I always use a transaction, this not only ensures everything is executed as a whole, but also allows me to test without risking anything. But off course you should take a backup anyway just to be sure...
This query showed very good performance for me:
DELETE tbl
FROM
MyTable tbl
WHERE
EXISTS (
SELECT
*
FROM
MyTable tbl2
WHERE
tbl2.SameValue = tbl.SameValue
AND tbl.IdUniqueValue < tbl2.IdUniqueValue
)
it deleted 1M rows in little more than 30sec from a table of 2M (50% duplicates)
Using CTE. The idea is to join on one or more columns that form a duplicate record and then remove whichever you like:
;with cte as (
select
min(PrimaryKey) as PrimaryKey
UniqueColumn1,
UniqueColumn2
from dbo.DuplicatesTable
group by
UniqueColumn1, UniqueColumn1
having count(*) > 1
)
delete d
from dbo.DuplicatesTable d
inner join cte on
d.PrimaryKey > cte.PrimaryKey and
d.UniqueColumn1 = cte.UniqueColumn1 and
d.UniqueColumn2 = cte.UniqueColumn2;
Yet another easy solution can be found at the link pasted here. This one easy to grasp and seems to be effective for most of the similar problems. It is for SQL Server though but the concept used is more than acceptable.
Here are the relevant portions from the linked page:
Consider this data:
EMPLOYEE_ID ATTENDANCE_DATE
A001 2011-01-01
A001 2011-01-01
A002 2011-01-01
A002 2011-01-01
A002 2011-01-01
A003 2011-01-01
So how can we delete those duplicate data?
First, insert an identity column in that table by using the following code:
ALTER TABLE dbo.ATTENDANCE ADD AUTOID INT IDENTITY(1,1)
Use the following code to resolve it:
DELETE FROM dbo.ATTENDANCE WHERE AUTOID NOT IN (SELECT MIN(AUTOID) _
FROM dbo.ATTENDANCE GROUP BY EMPLOYEE_ID,ATTENDANCE_DATE)
This is the easiest way to delete duplicate record
DELETE FROM tblemp WHERE id IN
(
SELECT MIN(id) FROM tblemp
GROUP BY title HAVING COUNT(id)>1
)
Use this
WITH tblTemp as
(
SELECT ROW_NUMBER() Over(PARTITION BY Name,Department ORDER BY Name)
As RowNumber,* FROM <table_name>
)
DELETE FROM tblTemp where RowNumber >1
Here is another good article on removing duplicates.
It discusses why its hard: "SQL is based on relational algebra, and duplicates cannot occur in relational algebra, because duplicates are not allowed in a set."
The temp table solution, and two mysql examples.
In the future are you going to prevent it at a database level, or from an application perspective. I would suggest the database level because your database should be responsible for maintaining referential integrity, developers just will cause problems ;)
I had a table where I needed to preserve non-duplicate rows.
I'm not sure on the speed or efficiency.
DELETE FROM myTable WHERE RowID IN (
SELECT MIN(RowID) AS IDNo FROM myTable
GROUP BY Col1, Col2, Col3
HAVING COUNT(*) = 2 )
Oh sure. Use a temp table. If you want a single, not-very-performant statement that "works" you can go with:
DELETE FROM MyTable WHERE NOT RowID IN
(SELECT
(SELECT TOP 1 RowID FROM MyTable mt2
WHERE mt2.Col1 = mt.Col1
AND mt2.Col2 = mt.Col2
AND mt2.Col3 = mt.Col3)
FROM MyTable mt)
Basically, for each row in the table, the sub-select finds the top RowID of all rows that are exactly like the row under consideration. So you end up with a list of RowIDs that represent the "original" non-duplicated rows.
The other way is Create a new table with same fields and with Unique Index. Then move all data from old table to new table. Automatically SQL SERVER ignore (there is also an option about what to do if there will be a duplicate value: ignore, interrupt or sth) duplicate values. So we have the same table without duplicate rows. If you don't want Unique Index, after the transfer data you can drop it.
Especially for larger tables you may use DTS (SSIS package to import/export data) in order to transfer all data rapidly to your new uniquely indexed table. For 7 million row it takes just a few minute.
By useing below query we can able to delete duplicate records based on the single column or multiple column. below query is deleting based on two columns. table name is: testing and column names empno,empname
DELETE FROM testing WHERE empno not IN (SELECT empno FROM (SELECT empno, ROW_NUMBER() OVER (PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
or empname not in
(select empname from (select empname,row_number() over(PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
Create new blank table with the same structure
Execute query like this
INSERT INTO tc_category1
SELECT *
FROM tc_category
GROUP BY category_id, application_id
HAVING count(*) > 1
Then execute this query
INSERT INTO tc_category1
SELECT *
FROM tc_category
GROUP BY category_id, application_id
HAVING count(*) = 1
Another way of doing this :--
DELETE A
FROM TABLE A,
TABLE B
WHERE A.COL1 = B.COL1
AND A.COL2 = B.COL2
AND A.UNIQUEFIELD > B.UNIQUEFIELD
I would mention this approach as well as it can be helpful, and works in all SQL servers:
Pretty often there is only one - two duplicates, and Ids and count of duplicates are known. In this case:
SET ROWCOUNT 1 -- or set to number of rows to be deleted
delete from myTable where RowId = DuplicatedID
SET ROWCOUNT 0
From the application level (unfortunately). I agree that the proper way to prevent duplication is at the database level through the use of a unique index, but in SQL Server 2005, an index is allowed to be only 900 bytes, and my varchar(2048) field blows that away.
I dunno how well it would perform, but I think you could write a trigger to enforce this, even if you couldn't do it directly with an index. Something like:
-- given a table stories(story_id int not null primary key, story varchar(max) not null)
CREATE TRIGGER prevent_plagiarism
ON stories
after INSERT, UPDATE
AS
DECLARE #cnt AS INT
SELECT #cnt = Count(*)
FROM stories
INNER JOIN inserted
ON ( stories.story = inserted.story
AND stories.story_id != inserted.story_id )
IF #cnt > 0
BEGIN
RAISERROR('plagiarism detected',16,1)
ROLLBACK TRANSACTION
END
Also, varchar(2048) sounds fishy to me (some things in life are 2048 bytes, but it's pretty uncommon); should it really not be varchar(max)?
DELETE
FROM
table_name T1
WHERE
rowid > (
SELECT
min(rowid)
FROM
table_name T2
WHERE
T1.column_name = T2.column_name
);
CREATE TABLE car(Id int identity(1,1), PersonId int, CarId int)
INSERT INTO car(PersonId,CarId)
VALUES(1,2),(1,3),(1,2),(2,4)
--SELECT * FROM car
;WITH CTE as(
SELECT ROW_NUMBER() over (PARTITION BY personid,carid order by personid,carid) as rn,Id,PersonID,CarId from car)
DELETE FROM car where Id in(SELECT Id FROM CTE WHERE rn>1)
I you want to preview the rows you are about to remove and keep control over which of the duplicate rows to keep. See http://developer.azurewebsites.net/2014/09/better-sql-group-by-find-duplicate-data/
with MYCTE as (
SELECT ROW_NUMBER() OVER (
PARTITION BY DuplicateKey1
,DuplicateKey2 -- optional
ORDER BY CreatedAt -- the first row among duplicates will be kept, other rows will be removed
) RN
FROM MyTable
)
DELETE FROM MYCTE
WHERE RN > 1

Resources