Can I make this T-SQL code more efficient - sql-server

I am running a SQL query on a table containing 3 million records comparing email addresses.
We have two email address fields, primary and secondary.
I am comparing a subset of primary emails against all other primary and secondary Emails to get a count of both duplicates and unique Emails in the data.
I believe this code works, its still running 10 mins in, and I have to do this for another 9 subsets which are alot larger than this one. Code is as follows:
SELECT COUNT(*) AS UniqueRecords
FROM AllVRContacts
WHERE LEN(EMAIL) > 1 AND ACCOUNTID = '00120000003bNmMAAU'
AND EMAIL NOT IN
(SELECT EMAIL FROM AllVRContacts WHERE ACCOUNTID != '00120000003bNmMAAU')
AND EMAIL NOT IN
(SELECT SECONDARY_EMAIL_ADDRESS__C FROM AllVRContacts WHERE ACCOUNTID != '00120000003bNmMAAU')
I want to learn something from this rather than just have someone scratch my back for me, the more explanation the better!
Thanks guys,

Create the following indexes:
AllVrContacts (AccountID) INCLUDE (Email)
AllVrContacts (Email) INCLUDE (AccountID)
AllVrContacts (SECONDARY_EMAIL_ADDRESS__C) INCLUDE (AccountID)
The index on (AccountID, Email) will be used for the WHERE filter in the main query:
WHERE ACCOUNTID = '00120000003bNmMAAU'
AND LEN(Email) > 1
The other two indexes will be used for antijoins (NOT IN) against this table.
You should also use:
SELECT COUNT(DISTINCT email) AS UniqueRecords
if you want the duplicates across the same account to be counted only once.

SELECT COUNT(*)
FROM (SELECT EMAIL AS UniqueRecords
FROM AllVRContacts a
WHERE ACCOUNTID = '00120000003bNmMAAU'
AND NOT EXISTS (SELECT EMAIL FROM AllVRContacts b
WHERE ACCOUNTID != '00120000003bNmMAAU'
AND (
a.EMAIL = b.EMAIL
OR a.EMAIL = b.SECONDARY_EMAIL_ADDRESS__C
)
)
AND LEN(EMAIL) > 1
GROUP BY EMAIL
) c
So how is this query better?
You typically want to use NOT EXISTS instead of NOT IN
IN returns true if a specified value matches any value in a subquery or a list
EXISTS returns true if a subquery contains any rows
More Info: SQL Server: JOIN vs IN vs EXISTS - the logical difference
= performs much better than !=
Reduce the scans (seeks if you have indexes on AllVRContacts) by not searching through AllVRContacts a second time for the secondary e-mail comparison
GROUP BY resolves potential duplicate e-mails within the ACCOUNTID
To further improve performance, add indexes as Quassnoi suggested and whatever is populating the table should validate e-mails to remove the need for the LEN check.
[EDIT] Added explanation to (3)

Can this be applicable?
SELECT ACCOUNTID, COUNT(*) AS UniqueRecords
FROM (
SELECT ACCOUNTID, EMAIL
FROM AllVRContacts
WHERE ACCOUNTID = '00120000003bNmMAAU' AND LEN(EMAIL) > 1
UNION
SELECT ACCOUNTID, SECONDARY_EMAIL_ADDRESS__C
FROM AllVRContacts
WHERE ACCOUNTID = '00120000003bNmMAAU' AND LEN(SECONDARY_EMAIL_ADDRESS__C) > 1
) s
I understood that basically you wanted to count distinct email addresses for each ACCOUNTID.
UNION in the inner query eliminates duplicates so the output (of the inner query) only has distinct pairs of account ids and emails, whether primary or secondary. Particularly this means that if an email address is stored as both primary and secondary, it will count only once. Same applies to same primary or same secondary address stored in different rows.
Now you only need to count the rows, which is done by the outer query.
If another 9 subsets you've mentioned mean simply other ACCOUNTIDs, then maybe you could try GROUP BY ACCOUNTID applied to the outer query and the ACCOUNTID = '...' part of both WHERE clauses got rid of to count emails for all of them with one query. That is, like this:
SELECT ACCOUNTID, COUNT(*) AS UniqueRecords
FROM (
SELECT ACCOUNTID, EMAIL
FROM AllVRContacts
WHERE LEN(EMAIL) > 1
UNION
SELECT ACCOUNTID, SECONDARY_EMAIL_ADDRESS__C
FROM AllVRContacts
WHERE LEN(SECONDARY_EMAIL_ADDRESS__C) > 1
) s
GROUP BY ACCOUNTID

Try this and let me know
SELECT ACCOUNTID,COUNT(*) AS UniqueRecords
FROM AllVRContacts
WHERE LEN(EMAIL) > 1 AND ACCOUNTID = '00120000003bNmMAAU'
Group by ACCOUNTID
Having COUNT(EMAIL) >1

Related

Select all columns from table where one field is duplicated

I'm trying to get a list of users (all their data) that have a duplicated email.
I can get all the emails by using
SELECT EMAIL, Count(*)
FROM USER_TABLE
Group By EMAIL having COUNT(*) > 1
and that returns a table of emails and their count (greater than 1).
I could write a query and just do
SELECT *
FROM USER_TABLE
WHERE EMAIL IN ('dup#email.com', 'dup2#email.com' ...);`
but that requires me to always run the first query first and then copy paste them all into the IN statement.
What's the best way to combine these? Well not really combine, I don't care how many duplicates there are, I just want all the user info for users that have a duplicate email.
You pretty much wrote the whole solution yourself. You just need your first query as the IN instead of hard coded list.
SELECT *
FROM USER_TABLE
WHERE EMAIL IN
(
SELECT EMAIL
FROM USER_TABLE
GROUP By EMAIL
HAVING COUNT(*) > 1
)
With window function COUNT:
SELECT *
FROM
(
SELECT
u.*,
COUNT(*) OVER (PARTITION BY u.Email) AS Cnt
FROM USER_TABLE u
) AS t
WHERE t.Cnt > 1
We can also try self joining the USER_TABLE table to your first original query:
SELECT t1.*
FROM USER_TABLE t1
INNER JOIN
(
SELECT EMAIL
FROM USER_TABLE
GROUP BY EMAIL
HAVING COUNT(*) > 1
) t2
ON t1.EMAIL = t2.EMAIL

SQL Server getting newest entry returns wrong results. What am I missing?

I have a strange behaviour with a SQL Server query/function.
I have a table with 3 columns (actually there are more columns, but these 3 are relevant for this task). The columns are FileId, UserId and TimeCreated. It is possible, that one user can create the same FileId multiple times, and I want to know, which was the newest created file.
I am doing it with this WHERE clause:
WHERE TimeCreated IN (SELECT MAX(TimeCreated)
FROM table
GROUP BY FileId, UserId)
In my opinion this should be correct, but for some groups, it returns multiple rows, even if the TimeCreated is different.
Here is one result as an example:
TimeCreated | UserId | FileId
------------------------------------------------------
2016-01-18 00:00:00.000 | UserA | FileA
2016-01-18 06:00:00:000 | UserA | FileA
But it should only return the row with '2016-01-18 06:00:00:000' as TimeCreated value.
I don't understand what is going wrong, because there are a lot more entries, which have UserA (as UserId) AND FileA (as FileId) but different TimeCreated values but it only returns this two rows (so in some way, it is quite working) and like I said, for some groups it is ok, but sometimes it returns two rows with the same UserId and FileId but different TimeCreated values. And when this happens it's always two rows and not more.
The TimeCreated is a DateTimeOffset(7), UserId is a string as well as FileId. Maybe this is important to know...
Does someone have an explanation why this is happening?
You should use this syntax instead:
;WITH CTE as
(
SELECT
*,
row_number() over (partition by FileId, UserId ORDER BY TimeCreated DESC)rn
FROM <table>
)
SELECT * FROM CTE
WHERE rn = 1
What's going wrong is that your inner select returns more than one value. It returns the maximum of TimeCreated for each combination FileId and UserID in the table.
One way to solve it is this:
...
FROM table t1
INNER JOIN
(
select FileId, UserId, max(TimeCreated) as maxTimeCreated
from table
group by FileId, UserId)
)
t2 ON t1.TimeCreated = t2.maxTimeCreatedAND t1.UserId = t2.USerId AND T1.FileId = t2.FileId
However, if you post your table structure and desired results, someone might show you a better way.
You are not joining the subquery by UserId, so your lower TimeCreated may correspond to another user file.
from table t1
where TimeCreated = (select max(TimeCreated)
from table
where table.UserId = t1.UserId
and table.FileId = t1.FileId )

TSQL Group By Issues

I have a TSQL query that I am trying to group data on. The table contains records of users and the access keys they hold such as site admin, moderator etc. The PK is on User and access key because a user can exist multiple times with different keys.
I am now trying to display a table of all users and in one column, all of the keys that user holds.
If bob had three separate records for his three separate access keys, result should only have One record for bob with all three of is access levels.
SELECT A.[FirstName],
A.[LastName],
A.[ntid],
A.[qid],
C.FirstName AS addedFirstName,
C.LastName AS addedLastName,
C.NTID AS addedNTID,
CONVERT(VARCHAR(100), p.TIMESTAMP, 101) AS timestamp,
(
SELECT k.accessKey,
k.keyDescription
FROM TFS_AdhocKeys AS k
WHERE p.accessKey = k.accessKey
FOR XML PATH ('key'), TYPE, ELEMENTS, ROOT ('keys')
)
FROM TFS_AdhocPermissions AS p
LEFT OUTER JOIN dbo.EmployeeTable as A
ON p.QID = A.QID
LEFT OUTER JOIN dbo.EmployeeTable AS C
ON p.addedBy = C.QID
GROUP BY a.qid
FOR XML PATH ('data'), TYPE, ELEMENTS, ROOT ('root');
END
I am trying to group the data by a.qid but its forcing me to group on every column in the select which will then not be unique so it will contain the duplicates.
Whats another approach to handle this?
Currently:
UserID | accessKey
123 | admin
123 | moderator
Desired:
UserID | accessKey
123 | admin
moderator
Recently, I was working on something and had a similar problem. Like your query, I had an inner 'for xml' with joins in the outer 'for xml'. It turned out it worked better if the joins were in the inner 'for xml'. The code is pasted below. I hope this helps.
Select
(Select Institution.Name, Institution.Id
, (Select Course.Courses_Id, Course.Expires, Course.Name
From
(Select Course.Courses_Id, Course.Expires, Courses.Name
From Institutions Course Course Join Courses On Course.Courses_Id = Courses.Id
Where Course.Institutions_Id = 31) As Course
For Xml Auto, Type, Elements) As Courses
From Institutions Institution
For Xml Auto, Elements, Root('Institutions') )
As I don't have the definitions for the other tables you have I just make a sample test data and you can follow this to answer yours.
Create statement
CREATE TABLE #test(UserId INT, AccessLevel VARCHAR(20))
Insert sample data
INSERT INTO #test VALUES(123, 'admin')
,(123, 'moderator')
,(123, 'registered')
,(124, 'moderator')
,(124, 'registered')
,(125, 'admin')
By using ROW_NUMBER() you can achieve what you need
;WITH C AS(
SELECT ROW_NUMBER() OVER(PARTITION BY UserId ORDER BY UserId) As Rn
,UserId
,AccessLevel
FROM #test
)
SELECT CASE Rn
WHEN 1 THEN UserId
ELSE NULL
END AS UserId
,AccessLevel
FROM C
Output
UserId AccessLevel
------ -----------
123 admin
NULL moderator
NULL registered
124 moderator
NULL registered
125 admin

SQL - Find more than one occurreance of a record

I have a table of customers:
Firstname Lastname Mobile Email
I would like to know what query in SQL Server I could run to find all the instances of there being a mobile number allocated to more than one email address, for example
Bob Smith 07789665544 bob#test.com
Bill Car 07789665544 bill#hello.com
I want to find all the records where an mobile number has multiple email addresses.
Thanks.
Use EXISTS
SELECT c.*
FROM dbo.Customers c
WHERE EXISTS
(
SELECT 1 FROM dbo.Customers c2
WHERE c.Mobile = c2.Mobile
AND COALESCE(c.Email, '') <> COALESCE(c2.Email, '')
)
I've used COALESCE in case Email can be NULL.
A CTE with a nested query can do this, and rather quickly too:
with DupeNumber as(
select se.Mobile from (select distinct Mobile, Email from Customers) se
group by se.Mobile
having count(*) >1
)
select * from Customers
inner join DupeNumber dn on se.Mobile=dn.Mobile
order by Mobile
This makes a list of the unique fax and email combinations, then finds the Mobile numbers that are in more than one email, then joins back to the original table to get the full rows

OR in WHERE statement slowing things down dramatically

I have the following query that finds customers related to an order. I have a legacy ID on the customer so I have to check old id (legacy) and customer id hence the or statement
SELECT
c.Title,
c.Name
FROM productOrder po
INNER JOIN Employee e ON po.BookedBy = e.ID
CROSS APPLY (
SELECT TOP 1 *
FROM Customer c
WHERE(po.CustID = c.OldID OR po.CustID = c.CustID)
) c
GROUP BY
c.CustomerId, c.Title, c.FirstName, c.LastName
if I remove the OR statement it runs fine for both situations. There is an index on customer id and legacy.
For table customer, you need to create separate indexes on columns oldid and custid. If you already have clustered index on custid, then add index on oldid as well:
CREATE INDEX customer_oldid_idx ON customer(oldid);
Without this index, search for oldid in this clause:
WHERE (po.CustID = c.OldID OR po.CustID = c.CustID)
will have to use full table scan, and that will be super slow.

Resources