Check which all records have duplicate using window function - sql-server

I am using ROW_NUMBER() to find duplicates, encapsulated under CTEs. I am using ROW_NUMBER() because I also want to have column which shows how many duplicate rows are present in the table.
The below code gives only records greater than 1. That is row number 2 and 3. But how can I include row number 1 of duplicate records?
If I removed T>1, than output also contain records which does not have duplicate like records with Part "'0020R5',100".
DDL:
DROP TABLE #TEST
CREATE TABLE #TEST
(
PART VARCHAR(30),
ALTPART int
)
INSERT #TEST
SELECT '15-AB78',100 UNION ALL
SELECT '15-AB78',110 UNION ALL
SELECT '16-A9-1',100 UNION ALL
SELECT '16-A9-1',110 UNION ALL
SELECT '16-B97-2',100 UNION ALL
SELECT '16-B97-2',110 UNION ALL
SELECT '0020R5',100
Query:
WITH TEST(PART,ALTPART,T) AS
(
SELECT PART,ALTPART,ROW_NUMBER() OVER (PARTITION BY PART ORDER BY ALTPART ASC) AS T FROM #TEST
)
SELECT PART,ALTPART,T
FROM TEST
WHERE T>1
ORDER BY PART
GO
Current output:
'15-AB78',110,2
'16-A9-1',110,2
'16-B97-2',110,2
Expected Result:
'15-AB78',100,1
'15-AB78',110,2
'16-A9-1',100,1
'16-A9-1',110,2
'16-B97-2',100,1
'16-B97-2',110,2

You need to add another window function to count the number of duplicates in each group.
Something along these lines:
WITH TEST(PART,ALTPART,DuplicateNum, DuplicateCnt) AS
(
SELECT PART,ALTPART,
-- Number each duplicate
ROW_NUMBER() OVER (PARTITION BY PART ORDER BY ALTPART ASC) AS DuplicateNum,
-- Count duplicates
COUNT() OVER (PARTITION BY PART ) AS DuplicateCnt,
FROM #TEST
)
SELECT PART, ALTPART, DuplicateNum
FROM TEST
WHERE DuplicateCnt > 1
ORDER BY PART

Related

TSQL - De-duplication report - grouping

So I'm trying to create a report that ranks a duplicate record, the idea behind this is that the customer wants to merge a whole lot of duplicate records that came about from a migration.
I need the ranking so that my report can show which record should be the "main" record, i.e. the record that will have missing data pulled into it.
The duplicate definition is pretty simple:
If the email addresses are the same then it is always a duplicate, if
the emails do not match, then the first name, surname, and mobile must
match.
The ranking will be based on a whole bunch of columns in the table, so:
email address isn't NULL = 50
phone number isn't NULL = 20
etc.. whichever gets the highest number in the duplicate group becomes the main record. This is where I am having issues, I can't seem to find a way to get an incremental number for each duplicate set. This is some of the code I have so far:
( I took out some of the rank columns in the temp table and CTE expression to shorten it )
DECLARE #tmp_Duplicates TABLE (
tmp_personID INT
, tmp_Firstname NVARCHAR(100)
, tmp_Surname NVARCHAR(100)
, tmp_HomeEmail NVARCHAR(300)
, tmp_MobileNumber NVARCHAR(100)
--- Ratings
, tmp_HomeEmail_Rating INT
--- Groupings
, tmp_GroupNumber INT
)
;WITH cteDupes AS
(
SELECT ROW_NUMBER() OVER(PARTITION BY personHomeEmail ORDER BY personID DESC) AS RND,
ROW_NUMBER() OVER(PARTITION BY personHomeEmail ORDER BY personId) AS RNA,
p.personID, p.PersonFirstName, p.PersonSurname, p.PersonHomeEMail
, personMobileTelephone
FROM tblCandidate c INNER JOIN tblPerson p ON c.candidateID = p.personID
)
INSERT INTO #tmp_Duplicates
SELECT PersonID, PersonFirstName, PersonSurname, PersonHomeEMail, personMobileTelephone
, 10, RND
FROM cteDupes
WHERE RNA + RND > 2
ORDER BY personID, PersonFirstName, PersonSurname
SELECT * FROM #tmp_Duplicates
This gives me the results I want, but the group number isn't showing how I need it:
What I need is for each group to be an incremental value:

Can the below query be rebuilt using setbased rather than iterative logic

I am preparing one SSRS report and i have only select access in the table.
I have built the query to get the data.But since i am using table variables, and not getting the desired output in report.
Here is my table structure:
This table stores each job run data with below columns:
jobname
instance_id--ever increasing unique number for each run
starttime
endtime
status
I want to show in my report all above columns and I wrote the query like below.
since I want to show latest run irrespective of success or failure, I need to get max(instance_id) into one table and from that base table I am passing all in loop.
I am getting max instance_id along with row number for looping purposes
create table ##table
(
fqjn varchar(1000),
maxi int,
rownum int
)
create table ##tb
(jobname varchar(max),
[status] varchar(100),
duration int
)
insert into ##table
select jobname,max(instance_id) as maxi,row_number() over (order by jobname desc) as rownum from [dbo].[vw_job_hist]
where grp_data_id in (select grp_Data_id from [dbo].[vw_job_data] where grp_name='pcp')
group by jobname
--now loop through table passing instance_id as parameter
declare #rownum int
select top 1 #rownum=rownum from ##table order by rownum
while #rownum is not null
begin
insert into ##tb
select jh.jobname,"job status"=
case jh.completion_status
when 0 then 'Success'
else 'Failed'
end,
"duration"=datediff(minute,jh.started_time,jh.end_time)
from [dbo].[vw_job_] jh
where jh.instance_id in (select maxi from ##table where rownum=#rownum)
set #rownum=(select top 1 rownum from ##table where rownum>#rownum
order by rownum
end
select * from ##tb
drop table ##tb
drop table ##table
I am not getting the desired output in SSRS ,i know if i create the above query as stored proc,i will get the desired results.but this is third party database and we wont get access.
ask:
Can the above query logic be built in single query with out loop/cursor
I tried using recursive cte ,but no further progress
any help/pointers would be much appreciated
Try this:
;WITH OrderedJobInfo AS (
SELECT jobname, completion_status, started_time, end_time, instance_id
, ROW_NUMBER() OVER (PARTITION BY jobname ORDER BY instance_id DESC) AS rownum
FROM [dbo].[vw_job_hist]
WHERE grp_data_id IN (SELECT grp_Data_id FROM [dbo].[vw_job_data] WHERE grp_name='pcp')
)
SELECT o.jobname, o.instance_id
, [job status] = CASE o.completion_status WHEN 0 THEN 'Success' ELSE 'Failed' END
, [duration] = DATEDIFF(MINUTE, o.started_time, o.end_time)
FROM OrderedJobInfo o
WHERE o.rownum = 1;

How to select Top % in T-SQL without using Top clause?

How to select Top 40% from a table without using the Top clause (or Top percent, the assignment is a little ambiguous) ? This question is for T-SQL, SQL Server 2008. I am not allowed to use Top for my assignment.
Thanks.
This is what I've tried but seems complicated. Isn't there an easier way ?
select top (convert (int, (select round (0.4*COUNT(*), 0) from MyTable))) * from MyTable
Try the NTILE function:
;WITH YourCTE AS
(
SELECT
(some columns),
percentile = NTILE(10) OVER(ORDER BY SomeColumn DESC)
FROM
dbo.YourTable
)
SELECT *
FROM YourCTE
WHERE percentile <= 4
The NTILE(10) OVER(....) creates 10 groups of percentages over your data - and thus, the top 40% are the groups no. 1, 2, 3, 4 of that result
Use NTILE
CREATE TABLE #temp(StudentID CHAR(3), Score INT)
INSERT #temp VALUES('S1',75 )
INSERT #temp VALUES('S2',83)
INSERT #temp VALUES('S3',91)
INSERT #temp VALUES('S4',83)
INSERT #temp VALUES('S5',93 )
INSERT #temp VALUES('S6',75 )
INSERT #temp VALUES('S7',83)
INSERT #temp VALUES('S8',91)
INSERT #temp VALUES('S9',83)
INSERT #temp VALUES('S10',93 )
SELECT * FROM (
SELECT NTILE(10) OVER(ORDER BY Score) AS NtileValue,*
FROM #temp) x
WHERE NtileValue <= 4
ORDER BY 1
Interesting enough I blogged about NTILE today: Does anyone use the NTILE() windowing function?
A problem with the NTILE(10) answers given so far is that if the table has 15 rows they will return 8 rows (53%) rather than the correct number to make up 40% (6).
If the number of rows is not evenly divisible by number of buckets the extra rows all go into the first buckets rather than being evenly distributed.
This alternative (borrows SQL Menace's table) avoids that issue.
WITH CTE
AS (SELECT *,
ROW_NUMBER() OVER ( ORDER BY Score) AS RN,
COUNT(*) OVER() AS Cnt
FROM #temp)
SELECT StudentID,
Score
FROM CTE
WHERE RN <= CEILING(0.4 * Cnt )
Using Top t-sql command:
select top 10 [Column_1],
[Column_2] from [Table]
order by [Column_1]
Using Paging method:
select
[Column_1],
[Column_2]
from
(Select ROW_NUMBER() Over (ORDER BY [Column_1]) AS Row,
[Column_1],
[Column_2]
FROM [Table]) as [alias]
WHERE (Row between 0 and 10)
This is finding the top 10 with order by [Column_1]...please note this is using [variable] method of documentation.
If you could provide column names and table names i could write much more beneficial t-sql, for example to find the top 40% you are going to need to do another sub-query to get count of all rows then do division, i'd likely do this as a query before i do the main query.
Calculate and set ROWCOUNT for whatever number of records.
Then execute you query for the limited set.
declare #rc as integer
select #rc = count(*)*0.40 from CTE
Set ROWCOUNT #rc
select * from CTE
ROWCOUNT is not deprecated yet - see http://msdn.microsoft.com/en-us/library/ms188774.aspx

Anyway to get a value similar to ##ROWCOUNT when TOP is used?

If I have a SQL statement such as:
SELECT TOP 5
*
FROM Person
WHERE Name LIKE 'Sm%'
ORDER BY ID DESC
PRINT ##ROWCOUNT
-- shows '5'
Is there anyway to get a value like ##ROWCOUNT that is the actual count of all of the rows that match the query without re-issuing the query again sans the TOP 5?
The actual problem is a much more complex and intensive query that performs beautifully since we can use TOP n or SET ROWCOUNT n but then we cannot get a total count which is required to display paging information in the UI correctly. Presently we have to re-issue the query with a #Count = COUNT(ID) instead of *.
Whilst this doesn't exactly meet your requirement (in that the total count isn't returned as a variable), it can be done in a single statement:
;WITH rowCTE
AS
(
SELECT *
,ROW_NUMBER() OVER (ORDER BY ID DESC) AS rn1
,ROW_NUMBER() OVER (ORDER BY ID ASC) AS rn2
FROM Person
WHERE Name LIKE 'Sm%'
)
SELECT *
,(rn1 + rn2) - 1 as totalCount
FROM rowCTE
WHERE rn1 <=5
The totalCount column will have the total number of rows matching the where filter.
It would be interesting to see how this stacks up performance-wise against two queries on a decent-sized data-set.
you'll have to run another COUNT() query:
SELECT TOP 5
*
FROM Person
WHERE Name LIKE 'Sm%'
ORDER BY ID DESC
DECLARE #r int
SELECT
#r=COUNT(*)
FROM Person
WHERE Name LIKE 'Sm%'
select #r
Something like this may do it:
SELECT TOP 5
*
FROM Person
cross join (select count(*) HowMany
from Person
WHERE Name LIKE 'Sm%') tot
WHERE Name LIKE 'Sm%'
ORDER BY ID DESC
The subquery returns one row with one column containing the full count; the cross join includes it with all rows returned by the "main" query"; and "SELECT *" would include new column HowMany.
Depending on your needs, the next step might be to filter out that column from your return set. One way would be to load the data from the query into a temp table, and then return just the desired columns, and get rowcount from the HowMany column from any row.

How do I select last 5 rows in a table without sorting?

I want to select the last 5 records from a table in SQL Server without arranging the table in ascending or descending order.
This is just about the most bizarre query I've ever written, but I'm pretty sure it gets the "last 5" rows from a table without ordering:
select *
from issues
where issueid not in (
select top (
(select count(*) from issues) - 5
) issueid
from issues
)
Note that this makes use of SQL Server 2005's ability to pass a value into the "top" clause - it doesn't work on SQL Server 2000.
Suppose you have an index on id, this will be lightning fast:
SELECT * FROM [MyTable] WHERE [id] > (SELECT MAX([id]) - 5 FROM [MyTable])
The way your question is phrased makes it sound like you think you have to physically resort the data in the table in order to get it back in the order you want. If so, this is not the case, the ORDER BY clause exists for this purpose. The physical order in which the records are stored remains unchanged when using ORDER BY. The records are sorted in memory (or in temporary disk space) before they are returned.
Note that the order that records get returned is not guaranteed without using an ORDER BY clause. So, while any of the the suggestions here may work, there is no reason to think they will continue to work, nor can you prove that they work in all cases with your current database. This is by design - I am assuming it is to give the database engine the freedom do as it will with the records in order to obtain best performance in the case where there is no explicit order specified.
Assuming you wanted the last 5 records sorted by the field Name in ascending order, you could do something like this, which should work in either SQL 2000 or 2005:
select Name
from (
select top 5 Name
from MyTable
order by Name desc
) a
order by Name asc
You need to count number of rows inside table ( say we have 12 rows )
then subtract 5 rows from them ( we are now in 7 )
select * where index_column > 7
select * from users
where user_id >
( (select COUNT(*) from users) - 5)
you can order them ASC or DESC
But when using this code
select TOP 5 from users order by user_id DESC
it will not be ordered easily.
select * from table limit 5 offset (select count(*) from table) - 5;
Without an order, this is impossible. What defines the "bottom"? The following will select 5 rows according to how they are stored in the database.
SELECT TOP 5 * FROM [TableName]
Well, the "last five rows" are actually the last five rows depending on your clustered index. Your clustered index, by definition, is the way that he rows are ordered. So you really can't get the "last five rows" without some order. You can, however, get the last five rows as it pertains to the clustered index.
SELECT TOP 5 * FROM MyTable
ORDER BY MyCLusteredIndexColumn1, MyCLusteredIndexColumnq, ..., MyCLusteredIndexColumnN DESC
Search 5 records from last records you can use this,
SELECT *
FROM Table Name
WHERE ID <= IDENT_CURRENT('Table Name')
AND ID >= IDENT_CURRENT('Table Name') - 5
If you know how many rows there will be in total you can use the ROW_NUMBER() function.
Here's an examble from MSDN (http://msdn.microsoft.com/en-us/library/ms186734.aspx)
USE AdventureWorks;
GO
WITH OrderedOrders AS
(
SELECT SalesOrderID, OrderDate,
ROW_NUMBER() OVER (ORDER BY OrderDate) AS 'RowNumber'
FROM Sales.SalesOrderHeader
)
SELECT *
FROM OrderedOrders
WHERE RowNumber BETWEEN 50 AND 60;
In SQL Server 2012 you can do this :
Declare #Count1 int ;
Select #Count1 = Count(*)
FROM [Log] AS L
SELECT
*
FROM [Log] AS L
ORDER BY L.id
OFFSET #Count - 5 ROWS
FETCH NEXT 5 ROWS ONLY;
Try this, if you don't have a primary key or identical column:
select [Stu_Id],[Student_Name] ,[City] ,[Registered],
RowNum = row_number() OVER (ORDER BY (SELECT 0))
from student
ORDER BY RowNum desc
You can retrieve them from memory.
So first you get the rows in a DataSet, and then get the last 5 out of the DataSet.
There is a handy trick that works in some databases for ordering in database order,
SELECT * FROM TableName ORDER BY true
Apparently, this can work in conjunction with any of the other suggestions posted here to leave the results in "order they came out of the database" order, which in some databases, is the order they were last modified in.
select *
from table
order by empno(primary key) desc
fetch first 5 rows only
Last 5 rows retrieve in mysql
This query working perfectly
SELECT * FROM (SELECT * FROM recharge ORDER BY sno DESC LIMIT 5)sub ORDER BY sno ASC
or
select sno from(select sno from recharge order by sno desc limit 5) as t where t.sno order by t.sno asc
When number of rows in table is less than 5 the answers of Matt Hamilton and msuvajac is Incorrect.
Because a TOP N rowcount value may not be negative.
A great example can be found Here.
i am using this code:
select * from tweets where placeID = '$placeID' and id > (
(select count(*) from tweets where placeID = '$placeID')-2)
In SQL Server, it does not seem possible without using ordering in the query.
This is what I have used.
SELECT *
FROM
(
SELECT TOP 5 *
FROM [MyTable]
ORDER BY Id DESC /*Primary Key*/
) AS T
ORDER BY T.Id ASC; /*Primary Key*/
DECLARE #MYVAR NVARCHAR(100)
DECLARE #step int
SET #step = 0;
DECLARE MYTESTCURSOR CURSOR
DYNAMIC
FOR
SELECT col FROM [dbo].[table]
OPEN MYTESTCURSOR
FETCH LAST FROM MYTESTCURSOR INTO #MYVAR
print #MYVAR;
WHILE #step < 10
BEGIN
FETCH PRIOR FROM MYTESTCURSOR INTO #MYVAR
print #MYVAR;
SET #step = #step + 1;
END
CLOSE MYTESTCURSOR
DEALLOCATE MYTESTCURSOR
Thanks to #Apps Tawale , Based on his answer, here's a bit of another (my) version,
To select last 5 records without an identity column,
select top 5 *,
RowNum = row_number() OVER (ORDER BY (SELECT 0))
from [dbo].[ViewEmployeeMaster]
ORDER BY RowNum desc
Nevertheless, it has an order by, but on RowNum :)
Note(1): The above query will reverse the order of what we get when we run the main select query.
So to maintain the order, we can slightly go like:
select *, RowNum2 = row_number() OVER (ORDER BY (SELECT 0))
from (
select top 5 *, RowNum = row_number() OVER (ORDER BY (SELECT 0))
from [dbo].[ViewEmployeeMaster]
ORDER BY RowNum desc
) as t1
order by RowNum2 desc
Note(2): Without an identity column, the query takes a bit of time in case of large data
Get the count of that table
select count(*) from TABLE
select top count * from TABLE where 'primary key row' NOT IN (select top (count-5) 'primary key row' from TABLE)
If you do not want to arrange the table in ascending or descending order. Use this.
select * from table limit 5 offset (select count(*) from table) - 5;

Resources