TSQL - De-duplication report - grouping - sql-server

So I'm trying to create a report that ranks a duplicate record, the idea behind this is that the customer wants to merge a whole lot of duplicate records that came about from a migration.
I need the ranking so that my report can show which record should be the "main" record, i.e. the record that will have missing data pulled into it.
The duplicate definition is pretty simple:
If the email addresses are the same then it is always a duplicate, if
the emails do not match, then the first name, surname, and mobile must
match.
The ranking will be based on a whole bunch of columns in the table, so:
email address isn't NULL = 50
phone number isn't NULL = 20
etc.. whichever gets the highest number in the duplicate group becomes the main record. This is where I am having issues, I can't seem to find a way to get an incremental number for each duplicate set. This is some of the code I have so far:
( I took out some of the rank columns in the temp table and CTE expression to shorten it )
DECLARE #tmp_Duplicates TABLE (
tmp_personID INT
, tmp_Firstname NVARCHAR(100)
, tmp_Surname NVARCHAR(100)
, tmp_HomeEmail NVARCHAR(300)
, tmp_MobileNumber NVARCHAR(100)
--- Ratings
, tmp_HomeEmail_Rating INT
--- Groupings
, tmp_GroupNumber INT
)
;WITH cteDupes AS
(
SELECT ROW_NUMBER() OVER(PARTITION BY personHomeEmail ORDER BY personID DESC) AS RND,
ROW_NUMBER() OVER(PARTITION BY personHomeEmail ORDER BY personId) AS RNA,
p.personID, p.PersonFirstName, p.PersonSurname, p.PersonHomeEMail
, personMobileTelephone
FROM tblCandidate c INNER JOIN tblPerson p ON c.candidateID = p.personID
)
INSERT INTO #tmp_Duplicates
SELECT PersonID, PersonFirstName, PersonSurname, PersonHomeEMail, personMobileTelephone
, 10, RND
FROM cteDupes
WHERE RNA + RND > 2
ORDER BY personID, PersonFirstName, PersonSurname
SELECT * FROM #tmp_Duplicates
This gives me the results I want, but the group number isn't showing how I need it:
What I need is for each group to be an incremental value:

Related

How can I access a specific field in a named subquery when the field name might not be unique?

I am trying to create a routine that can accept an SQL query as a string and the [table].[primaryKey] of the primary record in the returned dataset, then wrap that original query to implement pagination (return records 40-49 when requesting page 4 and 10 records per page).
The dataset returned by the original queries will frequently contain multiple instances of the primary record, one for each occurrence of supporting records. For the example provided, if a customer has three phone numbers on record the results for that customer in the original query would look like:
{5; John Smith; 205 W. Fort St; 17; Home; 123-123-4587}
{5; John Smith; 205 W. Fort St; 18; Work; 123-123-8547}
{5; John Smith; 205 W. Fort St; 19; Mobile; 123-123-1147}
I'm almost there, I think, with the following query:
DECLARE #PageNumber int = 4;
DECLARE #RecordsPerPage int = 10;
WITH OriginalQuery AS (
SELECT [Customer].[Id],
[Customer].[Name],
[Customer].[Address],
[Phone].[Id],
[Phone].[Type],
[Phone].[Number]
FROM [Customer] INNER JOIN [Phone] ON [Customer].[Id] = [Phone].[CustomerId]
)
SELECT [WrappedQuery].[RowNumber], [OriginalQuery].* FROM (
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) [RowNumber], *
FROM (
SELECT DISTINCT [OriginalQuery].[{Customer.Id}] [PrimaryKey]
FROM [OriginalQuery]
) [RuwNumberQuery]
) [WrappedQuery]
INNER JOIN [OriginalQuery] ON [WrappedQuery].[PrimaryKey] = [OriginalQuery].[{Customer.Id}]
WHERE [WrappedQuery].[RowNumber] >= #PageNumber
AND [WrappedQuery].[RowNumber] < #PageNumber + #RecordsPerPage
This solution performs a SELECT DISTINCT on the primary key for the Primary (Customer) record and uses the SQL routine Row_Number() then joins the result with the results of the original query such that each unique primary (customer) record is numbered 1 - {end of file}, and I can pull only the RowNumber counts that I want.
But because OriginalQuery may have multiple fields named Id (from different tables), I can't figure out how to properly access [Customer].[Id] in my SELECT DISTINCT clause of [RowNumberQuery] or in the INNER JOIN.
Is there a better way to implement pagination at the SQL level, or a more direct method of accessing the field I need from within the subquery based on the table to which it belongs?
EDIT:
I've caused confusion in the pagination I am looking for. I am using Dapper in C# to compile the resulting dataset into individual complex objects, so the goal in the example would be to retrieve customers 31-40 in the list regardless of how many individual records exist for each customer. If Customer 31 had five phone records, Customer 32 had three phone records, Customer 33 had 1 phone record, and the remaining seven customers had two phone records each, I would expect the resulting dataset to contain 23 records total, but only 10 distinct customers.
SOLUTION
Thank you for all of the assistance, and I apologize for those areas I should have clarified sooner. I am creating a toolset that will allow C# Data Access Libraries to implement a set of standard parameters. If I have an option to implement the pagination in an internal function that can accept the SQL statement, I can defer to the toolset and not have to remember (or count on others to remember) to add the appropriate text each time. I'll set it up to return the finished objects, but if I were going to just modify the original query string it would look like:
public static string AddPagination(string sql, string primaryKey, Parameter requestParameters)
{
return $"WITH OriginalQuery AS ({sql.Replace("SELECT ", $"SELECT DENSE_RANK() OVER (ORDER BY {primaryKey}) AS PrimaryRecordCount, ",StringComparison.OrdinalIgnoreCase)}) " +
$"SELECT TOP ({requestParameters.MaxRecords}) * " +
$"FROM OriginalQuery " +
$"WHERE PrimaryRecordCount >= 1 + (({requestParameters.PageNumber - 1}) * {requestParameters.RecordsPerPage})" +
$" AND PrimaryRecordCount <= {requestParameters.Page} * {requestParameters.Limit}";
}
Just give your columns a different alias in your original query, e.g. [Customer].[Id] AS CustomerId, [Phone].[Id] AS PhoneId..., then you can reference OriginalQuery.CustomerId, or OriginalQuery.PhoneId
e.g.
DECLARE #PageNumber int = 4;
DECLARE #RecordsPerPage int = 10;
WITH OriginalQuery AS (
SELECT [Customer].[Id] AS CustomerId,
[Customer].[Name],
[Customer].[Address],
[Phone].[Id] AS PhoneId,
[Phone].[Type],
[Phone].[Number]
FROM [Customer] INNER JOIN [Phone] ON [Customer].[Id] = [Phone].[CustomerId]
)
SELECT [WrappedQuery].[RowNumber], [OriginalQuery].* FROM (
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) [RowNumber], *
FROM (
SELECT DISTINCT [OriginalQuery].[{Customer.Id}] [PrimaryKey]
FROM [OriginalQuery]
) [RuwNumberQuery]
) [WrappedQuery]
INNER JOIN [OriginalQuery] ON [WrappedQuery].[PrimaryKey] = [OriginalQuery].[CustomerId]
WHERE [WrappedQuery].[RowNumber] >= #PageNumber
AND [WrappedQuery].[RowNumber] < #PageNumber + #RecordsPerPage
It's worth noting that your paging logic is wrong too. Currently you are adding page number to the number of pages so you are searching for:
Page 1: Customers 1 - 10
Page 2: Customers 2 - 11
Page 3: Customers 3 - 12
Your logic should be:
WHERE [WrappedQuery].[RowNumber] >= 1 + ((#PageNumber - 1) * #RecordsPerPage)
AND [WrappedQuery].[RowNumber] <= (#PageNumber * #RecordsPerPage)
Page 1: Customers 1 - 10
Page 2: Customers 11 - 20
Page 3: Customers 21 - 30
With that being said, you could just use DENSE_RANK() Rather than ROW_NUMBER which would simplify everything. I think this would give you the same result:
DECLARE #PageNumber int = 4;
DECLARE #RecordsPerPage int = 10;
WITH OriginalQuery AS (
SELECT c.Id AS CustomerId,
c.Name,
c.Address,
p.Id AS PhoneId,
p.Type,
p.Number,
DENSE_RANK() OVER(ORDER BY c.Id) AS RowNumber
FROM Customer AS c INNER JOIN Phone AS p ON c.Id = p.CustomerId
)
SELECT oq.CustomerId, oq.Name, oq.Address, oq.PhoneId, oq.Type, oq.Number
FROM OriginalQuery AS oq
WHERE oq.RowNumber >= 1 +((#PageNumber - 1) * #RecordsPerPage)
AND oq.RowNumber <= (#PageNumber * #RecordsPerPage);
I've added table aliases to try and make the code a bit cleaner, and also removed all the unnecessary square brackets. This is not necessary, but I personally find them quite hard on the eye, and only use them to escape key words.
Another difference is that in adding ORDER BY c.CustomerId you ensure consistent results for your paging. Using ORDER BY (SELECT NULL) implies that you don't care about the order, but you should if you using it for paging.
There are many concerns with what you are trying to do and you might be better off explaining why you are trying to make this process.
SQL query as a string
You are receiving a SQL query as a string, how are you parsing that string into the OriginalQuery CTE? This has both concerns about sql injection and concerns about global temp tables if you are using those.
Secondly, your example isn't doing pagination as it is commonly understood. If someone were to request page 1, 10 records per page, the calling application would expect to receive the first 10 records of the result set but your example will returns all records for the first 10 customers. Meaning the result could be 40+ if they each had 4 phone numbers as in your example data.
You should take a look at OFFSET and FETCH NEXT, as well as why this requirement to parse an arbitrary SQL string. There is probably a better way to do that.
Here is a rough example using OFFSET and FETCH NEXT from a static query, and returning only #RecordsPerPage number of records.
DECLARE #PageNumber int = 1;
DECLARE #RecordsPerPage int = 10;
SELECT [Customer].[Id],
[Customer].[Name],
[Customer].[Address],
[Phone].[Id],
[Phone].[Type],
[Phone].[Number]
FROM [Customer] INNER JOIN [Phone] ON [Customer].[Id] = [Phone].[CustomerId]
ORDER BY [Customer].[Id]
OFFSET (#PageNumber-1)*#RecordsPerPage rows
FETCH NEXT #RecordsPerPage ROWS ONLY
If you wanted to return all records for the the RecordsPerPage number of entries which have a corresponding phone number, then it would be something like...
DECLARE #PageNumber int = 1;
DECLARE #RecordsPerPage int = 10;
SELECT [Customer].[Id],
[Customer].[Name],
[Customer].[Address],
[Phone].[Id],
[Phone].[Type],
[Phone].[Number]
FROM [Customer] INNER JOIN [Phone] ON [Customer].[Id] = [Phone].[CustomerId]
WHERE Customer.ID IN (
SELECT DISTINCT Customer.ID FROM Customer INNER JOIN [Phone] ON [Customer].[Id] = [Phone].[CustomerId]
ORDER BY [Customer].[Id]
OFFSET (#PageNumber-1)*#RecordsPerPage rows
FETCH NEXT #RecordsPerPage ROWS ONLY
)
This does leave a question, what is the point of this query when the calling application can just use their own OFFSET and FETCH NEXT? They already have the SQL to generate the initial dataset, all they need to do is add OFFSET / FETCH NEXT to the end of it and they have their own pagination without trying to wrap it in a procedure of some sort.
To create a comparison, would you create a stored procedure that accepts a SQL string and then filters specific fields by specific values? Or would the people calling that stored procedure just add a Where clause to their own queries instead?
You can use alias name for the cuplicated column.
For example:
WITH OriginalQuery AS (
SELECT [Customer].[Id] as CustomerID,
[Customer].[Name],
[Customer].[Address],
[Phone].[Id] as PhoneID,
[Phone].[Type],
[Phone].[Number]
FROM [Customer] INNER JOIN [Phone] ON [Customer].[Id] = [Phone].[CustomerId]
)
now you can use the 2 ids whit the alias name for the next query.

SELECT from multiple queries

I have this tables:
tblDiving(
diving_number int primary key
diving_club int
date_of_diving date)
tblDivingClub(
number int primary key not null check (number>0),
name char(30),
country char(30))
tblWorks_for(
diver_number int
club_number int
end_working_date date)
tblCountry(
name char(30) not null primary key)
I need to write a query to return a name of a country and the number of "Super club" in it.
a Super club is a club which have more than 25 working divers (tblWorks_for.end_working_date is null) or had more than 100 diving's in it(tblDiving) in the last year.
after I get the country and number of super club, I need to show only the country's that contains more than 2 super club.
I wrote this 2 queries:
select tblDivingClub.name,count(distinct tblWorks_for.diver_number) as number_of_guids
from tblWorks_for
inner join tblDivingClub on tblDivingClub.number = tblWorks_for.club_number,tblDiving
where tblWorks_for.end_working_date is null
group by tblDivingClub.name
select tblDivingClub.name, count(distinct tblDiving.diving_number) as number_of_divings
from tblDivingClub
inner join tblDiving on tblDivingClub.number = tblDiving.diving_club
WHERE tblDiving.date_of_diving <= DATEADD(year,-1, GETDATE())
group by tblDivingClub.name
But I don't know how do I continue.
Every query works separately, but how do I combine them and select from them?
It's university assignment and I'm not allowed to use views or temporary tables.
It's my first program so I'm not really sure what I'm doing:)
WITH CTE AS (
select tblDivingClub.name,count(distinct tblWorks_for.diver_number) as diving_number
from tblWorks_for
inner join tblDivingClub on tblDivingClub.number = tblWorks_for.club_number,tblDiving
where tblWorks_for.end_working_date is null
group by tblDivingClub.name
UNION ALL
select tblDivingClub.name, count(distinct tblDiving.diving_number) as diving_number
from tblDivingClub
inner join tblDiving on tblDivingClub.number = tblDiving.diving_club
WHERE tblDiving.date_of_diving <= DATEADD(year,-1, GETDATE())
group by tblDivingClub.name
)
SELECT * FROM CTE
You can combine the queries using a UNION ALL as long as there are the same number of columns in each query. You can then roll them into a Common Table Expression (CTE) and do a select from that.

SQL Select set of records from one table, join each record to top 1 record of second table matching 1 column, sorted by a column in the second table

This is my first question on here, so I apologize if I break any rules.
Here's the situation. I have a table that lists all the employees and the building to which they are assigned, plus training hours, with ssn as the id column, I have another table that list all the employees in the company, also with ssn, but including name, and other personal data. The second table contains multiple records for each employee, at different points in time. What I need to do is select all the records in the first table from a certain building, then get the most recent name from the second table, plus allow the result set to be sorted by any of the columns returned.
I have this in place, and it works fine, it is just very slow.
A very simplified version of the tables are:
table1 (ssn CHAR(9), buildingNumber CHAR(7), trainingHours(DEC(5,2)) (7200 rows)
table2 (ssn CHAR(9), fName VARCHAR(20), lName VARCHAR(20), sequence INT) (708,000 rows)
The sequence column in table 2 is a number that corresponds to a predetermined date to enter these records, the higher number, the more recent the entry. It is common/expected that each employee has several records. But several may not have the most recent(i.e. '8').
My SProc is:
#BuildingNumber CHAR(7), #SortField VARCHAR(25)
BEGIN
DECLARE #returnValue TABLE(ssn CHAR(9), buildingNumber CAHR(7), fname VARCHAR(20), lName VARCHAR(20), rowNumber INT)
INSERT INTO #returnValue(...)
SELECT(ssn,buildingNum,fname,lname,rowNum)
FROM SELECT(...,CASE #SortField Row_Number() OVER (PARTITION BY buildingNumber ORDER BY {sortField column} END AS RowNumber)
FROM table1 a
OUTER APPLY(SELECT TOP 1 fName,lName FROM table2 WHERE ssn = a.ssn ORDER BY sequence DESC) AS e
where buildingNumber = #BuildingNumber
SELECT * from #returnValue ORDER BY RowNumber
END
I have indexes for the following:
table1: buildingNumber(non-unique,nonclustered)
table2: sequence_ssn(unique,nonclustered)
Like I said this gets me the correct result set, but it is rather slow. Is there a better way to go about doing this?
It's not possible to change the database structure or the way table 2 operates. Trust me if it were it would be done. Are there any indexes I could make that would help speed this up?
I've looked at the execution plans, and it has a clustered index scan on table 2(18%), then a compute scalar(0%), then an eager spool(59%), then a filter(0%), then top n sort(14%).
That's 78% of the execution so I know it's in the section to get the names, just not sure of a better(faster) way to do it.
The reason I'm asking is that table 1 needs to be updated with current data. This is done through a webpage with a radgrid control. It has a range, start index, all that, and it takes forever for the users to update their data.
I can change how the update process is done, but I thought I'd ask about the query first.
Thanks in advance.
I would approach this with window functions. The idea is to assign a sequence number to records in the table with duplicates (I think table2), such as the most recent records have a value of 1. Then just select this as the most recent record:
select t1.*, t2.*
from table1 t1 join
(select t2.*,
row_number() over (partition by ssn order by sequence desc) as seqnum
from table2 t2
) t2
on t1.ssn = t1.ssn and t2.seqnum = 1
where t1.buildingNumber = #BuildingNumber;
My second suggestion is to use a user-defined function rather than a stored procedure:
create function XXX (
#BuildingNumber int
)
returns table as
return (
select t1.ssn, t1.buildingNum, t2.fname, t2.lname, rowNum
from table1 t1 join
(select t2.*,
row_number() over (partition by ssn order by sequence desc) as seqnum
from table2 t2
) t2
on t1.ssn = t1.ssn and t2.seqnum = 1
where t1.buildingNumber = #BuildingNumber;
);
(This doesn't have the logic for the ordering because that doesn't seem to be the central focus of the question.)
You can then call it as:
select *
from dbo.XXX(<building number>);
EDIT:
The following may speed it up further, because you are only selecting a small(ish) subset of the employees:
select *
from (select t1.*, t2.*, row_number() over (partition by ssn order by sequence desc) as seqnum
from table1 t1 join
table2 t2
on t1.ssn = t1.ssn
where t1.buildingNumber = #BuildingNumber
) t
where seqnum = 1;
And, finally, I suspect that the following might be the fastest:
select t1.*, t2.*, row_number() over (partition by ssn order by sequence desc) as seqnum
from table1 t1 join
table2 t2
on t1.ssn = t1.ssn
where t1.buildingNumber = #BuildingNumber and
t2.sequence = (select max(sequence) from table2 t2a where t2a.ssn = t1.ssn)
In all these cases, an index on table2(ssn, sequence) should help performance.
Try using some temp tables instead of the table variables. Not sure what kind of system you are working on, but I have had pretty good luck. Temp tables actually write to the drive so you wont be holding and processing so much in memory. Depending on other system usage this might do the trick.
Simple define the temp table using #Tablename instead of #Tablename. Put the name sorting subquery in a temp table before everything else fires off and make a join to it.
Just make sure to drop the table at the end. It will drop the table at the end of the SP when it disconnects, but it is a good idea to make tell it to drop to be on the safe side.

Problem with unique SQL query

I want to select all records, but have the query only return a single record per Product Name. My table looks similar to:
SellId ProductName Comment
1 Cake dasd
2 Cake dasdasd
3 Bread dasdasdd
where the Product Name is not unique. I want the query to return a single record per ProductName with results like:
SellId ProductName Comment
1 Cake dasd
3 Bread dasdasdd
I have tried this query,
Select distict ProductName,Comment ,SellId from TBL#Sells
but it is returning multiple records with the same ProductName. My table is not realy as simple as this, this is just a sample. What is the solution? Is it clear?
Select ProductName,
min(Comment) , min(SellId) from TBL#Sells
group by ProductName
If y ou only want one record per productname, you ofcourse have to choose what value you want for the other fields.
If you aggregate (using group by) you can choose an aggregate function,
htat's a function that takes a list of values and return only one : here I have chosen MIN : that is the smallest walue for each field.
NOTE : comment and sellid can come from different records, since MIN is taken...
Othter aggregates you might find useful :
FIRST : first record encountered
LAST : last record encoutered
AVG : average
COUNT : number of records
first/last have the advantage that all fields are from the same record.
SELECT S.ProductName, S.Comment, S.SellId
FROM
Sells S
JOIN (SELECT MAX(SellId)
FROM Sells
GROUP BY ProductName) AS TopSell ON TopSell.SellId = S.SellId
This will get the latest comment as your selected comment assuming that SellId is an auto-incremented identity that goes up.
I know, you've got an answer already, I'd like to offer a way that was fastest in terms of performance for me, in a similar situation. I'm assuming that SellId is Primary Key and identity. You'd want an index on ProductName for best performance.
select
Sells.*
from
(
select
distinct ProductName
from
Sells
) x
join
Sells
on
Sells.ProductName = x.ProductName
and Sells.SellId =
(
select
top 1 s2.SellId
from
Sells s2
where
x.ProductName = s2.ProductName
Order By SellId
)
A slower method, (but still better than Group By and MIN on a long char column) is this:
select
*
from
(
select
*,ROW_NUMBER() over (PARTITION BY ProductName order by SellId) OccurenceId
from sells
) x
where
OccurenceId = 1
An advantage of this one is that it's much easier to read.
create table Sale
(
SaleId int not null
constraint PK_Sale primary key,
ProductName varchar(100) not null,
Comment varchar(100) not null
)
insert Sale
values
(1, 'Cake', 'dasd'),
(2, 'Cake', 'dasdasd'),
(3, 'Bread', 'dasdasdd')
-- Option #1 with over()
select *
from Sale
where SaleId in
(
select SaleId
from
(
select SaleId, row_number() over(partition by ProductName order by SaleId) RowNumber
from Sale
) tt
where RowNumber = 1
)
order by SaleId
-- Option #2
select *
from Sale
where SaleId in
(
select min(SaleId)
from Sale
group by ProductName
)
order by SaleId
drop table Sale

Anyway to get a value similar to ##ROWCOUNT when TOP is used?

If I have a SQL statement such as:
SELECT TOP 5
*
FROM Person
WHERE Name LIKE 'Sm%'
ORDER BY ID DESC
PRINT ##ROWCOUNT
-- shows '5'
Is there anyway to get a value like ##ROWCOUNT that is the actual count of all of the rows that match the query without re-issuing the query again sans the TOP 5?
The actual problem is a much more complex and intensive query that performs beautifully since we can use TOP n or SET ROWCOUNT n but then we cannot get a total count which is required to display paging information in the UI correctly. Presently we have to re-issue the query with a #Count = COUNT(ID) instead of *.
Whilst this doesn't exactly meet your requirement (in that the total count isn't returned as a variable), it can be done in a single statement:
;WITH rowCTE
AS
(
SELECT *
,ROW_NUMBER() OVER (ORDER BY ID DESC) AS rn1
,ROW_NUMBER() OVER (ORDER BY ID ASC) AS rn2
FROM Person
WHERE Name LIKE 'Sm%'
)
SELECT *
,(rn1 + rn2) - 1 as totalCount
FROM rowCTE
WHERE rn1 <=5
The totalCount column will have the total number of rows matching the where filter.
It would be interesting to see how this stacks up performance-wise against two queries on a decent-sized data-set.
you'll have to run another COUNT() query:
SELECT TOP 5
*
FROM Person
WHERE Name LIKE 'Sm%'
ORDER BY ID DESC
DECLARE #r int
SELECT
#r=COUNT(*)
FROM Person
WHERE Name LIKE 'Sm%'
select #r
Something like this may do it:
SELECT TOP 5
*
FROM Person
cross join (select count(*) HowMany
from Person
WHERE Name LIKE 'Sm%') tot
WHERE Name LIKE 'Sm%'
ORDER BY ID DESC
The subquery returns one row with one column containing the full count; the cross join includes it with all rows returned by the "main" query"; and "SELECT *" would include new column HowMany.
Depending on your needs, the next step might be to filter out that column from your return set. One way would be to load the data from the query into a temp table, and then return just the desired columns, and get rowcount from the HowMany column from any row.

Resources