Same query giving different results - sql-server

I am still new to working in databases, so please have patience with me. I have read through a number of similar questions, but none of them seem to be talking about the same issue I am facing.
Just a bit of info on what I am doing, I have a table filled with contact information, and some of the contacts are duplicated, but most of the duplicated rows have a truncated phone number, which makes that data useless.
I wrote the following query to search for the duplicates:
WITH CTE (CID, Firstname, lastname, phone, email, length, dupcnt) AS
(
SELECT
CID, Firstname, lastname, phone, email, LEN(phone) AS length,
ROW_NUMBER() OVER (PARTITION BY Firstname, lastname, email
ORDER BY Firstname) AS dupcnt
FROM
[data.com_raw]
)
SELECT *
FROM CTE
WHERE dupcnt > 1
AND length <= 10
I assumed that this query would find all records that have duplicates based on the three columns that I have specified, and select any that have the dupcnt greater than 1, and a phone column with a length less than or equal to 10. But when I run the query more than once I get different result sets each execution. There must be some logic that I am missing here, but I am completely baffled by this. All of the columns are of varchar datatype, except for CID, which is int.

Instead of ROW_NUMBER() use COUNT(*), and remove the ORDER BY since that's not necessary with COUNT(*).
The way you have it now, you are chunking up records into similar groups/partitions of records by firstname/lastname/email. Then you are ORDERING each group/partition by firstname. Firstname is part of the partition, meaning every firstname in that group/partition is identical. You will get different results depending on how SQL Server fetches the results from storage (which record it found first is 1, what it found second is 2). Every time it fetches records (every time you run this sql) it may fetch each record from disk or cache at a different order.
Count(*) will return ALL duplicate rows
So instead:
COUNT(*) OVER (PARTITION BY Firstname, lastname, email ) AS dupcnt
Which will return the number of records that share the same firstname, lastname, and email. You then keep any record that is greater than 1.

ORDER BY Firstname is non-deterministic here as they all have the same Firstname from the partition by
If CID is unique you could use that for the order by but I suspect you really want count.

I believe you are getting different results with every run would be because (a) unless clearly specified in the query, you can assume nothing about the order in which SQL return data in a query, and (b) the only ordering criteria you provide is by FirstName, which is far less precise than your grouping (Firstname, lastname, email).
As for the query itself, as written it assumes that the first item found in a given partition contains a valid phone number. Without specifying the order, you cannot know this will be true… and what if all items in a given grouping have invalid phone numbers? Below is my stab at pulling out the data you're looking for, in a hopefully useful format.
WITH CTE -- Sorry, I'm lazy and generally don't list the columns
AS
(
SELECT
Firstname
,lastname
,phone
,count(*) HowMany -- How many in group
,sum(case len(phone) when 10 then 1 else 0 end) BadLength -- How many "bad" in group
from data.com_raw
group by
Firstname
,lastname
,phone
having count(*) <> sum(case len(phone) when 10 then 1 else 0 end)
and count(*) > 1 -- Remove this to find singletons with invalid phone numbers
)
select
cr.CID
,cr.Firstname
,cr.lastname
,case len(cr.phone) when 10 then '' else 'Bad' end) IsBad
,cr.phone
,cr.email
from data.com_raw cr
inner join CTE
on CTE.Firstname = cr.Firstname
and CTE.lastname = cr.lastname
and CTE.phone = cr.phone
order by
cr.CID
,cr.Firstname
,cr.lastname
,case len(cr.phone) when 10 then '' else 'Bad' end)
,cr.phone
(Yes, if there are no indexes to support this, you will end up with a table scan.)

SELECT Firstname, lastname,email, COUNT(*)
FROM [data.com_raw]
GROUP BY Firstname, lastname,email HAVING COUNT(*)>1
WHERE LEN(PHONE)<= 10

Related

Choose 2 Random Values from 2 Separate Columns SQL

I am having to prepare test data to send to a 3rd party however I don't wish to send the customers real name nor do I wish to send their real date of birth.
I could solve the D.O.B issue by just randomly increasing the DOB by several years. However the name is different, is there anyway I can have a list of say 10 customer names and just choose a different Firstname and Surname each time.
I wish to mix and match the names however so it essentially randomly picks 1 firstname and then randomly picks a lastname and puts them together on the same line.
SELECT TOP 1 opde.first_name
FROM Table AS opde
ORDER BY NEWID()
This will return a random first name each time, but if I put the surname column in it will also return the matching surname, I don't want that I want a random surname from the list.
I tried doing this via a UNION but you can't do an ORDER BY NEWID() in the UNION.
Cheers.
I think this one might help...
WITH fn AS
(
SELECT TOP 1 opde.first_name
FROM Table AS opde
ORDER BY NEWID()
),
sn AS
(
SELECT TOP 1 opde.surname
FROM Table AS opde
ORDER BY NEWID()
)
SELECT first_name, surname
FROM fn
CROSS APPLY sn;
In the fn subquery you select a random first name. In the sn you do the same but with an surname.
With the cross apply you combine those two results
You may use a union of subqueries, each of which uses order by with NEWID:
SELECT first_name
FROM
(SELECT TOP 1 opde.first_name FROM Table AS opde ORDER BY NEWID()) t1
UNION ALL
SELECT first_name
FROM
(SELECT TOP 1 opde.first_name FROM Table AS opde ORDER BY NEWID()) t2;
Demo

detecting duplicates and removing them

I've been trying to solve a problem in my database which is quite common but I couldn't find a solution so far and I hope you could help me with this.
I have a database with people and their associated addresses. My primary goal is to find out how many unique households are in there. For example, I want to count a family as one. So far a ran a query to display last_names and addresses which are more than one:
select Last_Name ,add_line1, count(*) from ##all_people
group by Last_Name,ADD_LINE1
having count(*) > 1
This shows me people with the same last_name and address but I need their IDs in order to remove them from my temptable.
Furthermore, I'd like to ask how it is possible to display only one record for each household.
This is the structure of my temptable:
ID First_name Last_Name add_line1
Thank you so much for your help!!!
to find duplicates, you can use Count() Over() and partition by the grouping you want.
select * from (
select Id, Last_Name ,add_line1, count(*) over (partition by Last_Name, add_line1) dupe_count from ##all_people
) t
where t.dupe_count > 1
to find the ones you want to delete, you can use Row_Number()
select * from (
select Id, Last_Name ,add_line1, row_number() over (partition by Last_Name, add_line1 order by ID) extras from ##all_people
) t
where t.extras > 1
use t.extras = 1 to see one row per grouping
You seem to have a lot of questions here...
My primary goal is to find out how many unique households are in there.
You can do this with a distinct count:
SELECT COUNT(DISTINCT Last_Name + add_line1)
FROM ##all_people
...but I need their IDs in order to remove them from my temptable
I think this is solved by the new count query.
Furthermore, I'd like to ask how it is possible to display only one record for each household.
Just use distinct last name and address:
SELECT DISTINCT last_name, add_line1
FROM ##all_people

SQL Server - Delete Duplicate Rows - how does Partition By affect this query?

I've been using the following inherited query where I'm trying to delete duplicate rows and I'm getting some unexpected results when first running it as a SELECT - I believe it has something to do with my lack of understanding of the Partition part of the statement:
WITH CTE AS(
SELECT [Id],
[Url],
[Identifier],
[Name],
[Entity],
[DOB],
RN = ROW_NUMBER()OVER(PARTITION BY Name ORDER BY Name)
FROM Data.Statistics
where Id = 2170
)
DELETE FROM CTE WHERE RN > 1
Can someone help me understand exactly what I'm doing with the Partition BY Name part of this? This doesn't limit the query in any way to only looking for duplicates in the Name field, correct? I need to ensure that it's looking for records where all 5 of the fields inside the CTE definition are the same for a record to be considered a duplicate.
ROW_NUMBER() OVER (PARTITION BY Name ORDER BY Name) doesn't make a lot of sense. You wouldn't ORDER BY the same thing you used in PARTITION BY since it will be the same value for everything in the partition, making the ORDER BY part useless.
Basically the CTE part of this query is saying to split the matching rows (those with [Id] = 2170) temporarily into groups for each distinct name, and within each group of rows with the same name, order those by name (which are obviously all the same value) and then return the row number within that sequence group as RN. Unique names will all have a row number of 1, because there is only one row with that name. Duplicate names will have row numbers 1, 2, 3, and so on. The order of those rows is undefined in this case because of the silly ORDER BY clause, but if you changed the ORDER BY to something meaningful, the row numbers would follow that sequence.

How to limit the answers of the GROUP BY clause in sql-server (2005)

I'm trying to get the top 50 cities for all customer lists in our DB
(So simplified: every client has a list of customers with associated data(like the city))
If I say:
SELECT top(50) clientid, city, COUNT(city) as cnt
FROM customers
GROUP BY clientid, city
ORDER by cnt
it will limit the total resultset on 50 rows instead of limiting the results for every group.
How can I get the top 50 per clientid?
EDIT:
I searched stackoverflow (and googled) but only found solutions for Mysql. Probably searching for 'limit' will only find mysql solutions wince thats the keyword needed for that Database engine. If I know the keyword needed in Sql-Server I could find it as well using google.
;WITH cte
As (SELECT clientid,
city,
COUNT(city) as cnt,
ROW_NUMBER() OVER (PARTITION BY clientid
ORDER BY COUNT(city)) AS RN
FROM customers
GROUP BY clientid,
city)
SELECT clientid,
city
FROM cte
WHERE RN <= 50
I think this might do it:
select top 50 city
from (select city from customers group by city order by count(clientid) desc)
I assume if a clientID is in the same row as a city, then that clientID represents a user living in that city.

SQL Server Weighted Full Text Search

Currently I have a table that I search upon 4 fields, FirstName, LastName, MiddleName, And AKA's. I currently have a CONTAINSTABLE search for the rows and it works. Not well but it works. Now I want to make the First Name weighted higher and middle name lower.
I found the command ISABOUT but that seems pretty worthless if I have to do it by word not column (hopefully I understood this wrong). This is not an option if its by word because I do not know how many words the user will enter.
I found the thread here that talks about this same solution however I was unable to get the accepted solution to work. Maybe I have done something wrong but regardless I cannot get it to work, and its logic seems really... odd. There has to be an easier way.
The key to manipulating the rankings is to use a union. For each column you use a separate select statement. In that statement, add an identifier that shows from which column each row was pulled then. Insert the results into a table variable, then you can manipulate the ranking by sorting on the identifier or multiplying the rank by some value based on the identifier.
The key is to give the appearance of modifying the ranking, not to actually change sql server's ranking.
Example using a table variable:
DECLARE #Results TABLE (PersonId Int, Rank Int, Source Int)
For table People with Columns PersonId Int PK Identity, FirstName VarChar(100), MiddleName VarChar(100), LastName VarChar(100), AlsoKnown VarChar(100) with each column added to a full text catalog, you could use the query:
INSERT INTO #Results (PersonId, Rank, Source)
SELECT PersonId, Rank, 1
FROM ContainsTable(People, FirstName, #SearchValue) CT INNER JOIN People P ON CT.Key = P.PersonId
UNION
SELECT PersonId, Rank, 2
FROM ContainsTable(People, MiddleName, #SearchValue) CT INNER JOIN People P ON CT.Key = P.PersonId
UNION
SELECT PersonId, Rank, 3
FROM ContainsTable(People, LastName, #SearchValue) CT INNER JOIN People P ON CT.Key = P.PersonId
UNION
SELECT PersonId, Rank, 4
FROM ContainsTable(People, AlsoKnown, #SearchValue) CT INNER JOIN People P ON CT.Key = P.PersonId
/*
Now that the results from above are in the #Results table, you can manipulate the
rankings in one of several ways, the simplest is to pull the results ordered first by Source then by Rank. Of course you would probably join to the People table to pull the name fields.
*/
SELECT PersonId
FROM #Results
ORDER BY Source, Rank DESC
/*
A more complex manipulation would use a statement to multiply the ranking
by a value above 1 (to increase rank) or less than 1 (to lower rank), then
return results based on the new rank. This provides more fine tuning,
since I could make first name 10% higher and middle name 15% lower and
leave last name and also known the original value.
*/
SELECT PersonId, CASE Source WHEN 1 THEN Rank * 1.1 WHEN 2 THEN Rank * .9 ELSE Rank END AS NewRank FROM #Results
ORDER BY NewRank DESC
The one downside is you'll notice I didn't use UNION ALL, so if a word appears in more than one column, the rank won't reflect that. If that's an issue you could use UNION ALL and then remove duplicate person id's by adding all or part of the duplicate record's rank to the rank of another record with the same person id.
Ranks are useless across indexes, you can't merge them and expect the result to mean anything. The rank numbers of each index are apple/orange/grape/watermelon/pair comparisions that have no relative meaning WRT contents of other indexes.
Sure you can try and link/weight/order ranks between indexes to try and fudge a meaningful result but at the end of the day that result is still gibberish however possibly still good enough to provide a workable solution depending on the specifics of your situation.
In my view the best solution is to put all data you intend to be searchable in a single FTS index/column and use that columns rank to order your output.. Even if you have to duplicate field contents to accomplish the result.
Just few weeks ago I was solving very similar problem and solution is suprisingly easy (albeit ugly and space consuming).
Create another column containing combined values of FirstName + FirstName + LastName + MiddleName in this order. Duplicate FirstName column is not a typo, it's a trick to force FT to weight values from FirstName higher during search.
How about this way:
SELECT p.* from Person p
left join ContainsTable(Person, FirstName, #SearchValue) firstnamefilter on firstnamefiler.key = p.id
left join ContainsTable(Person, MiddleName, #SearchValue) middlenamefilter on middlenamefilter.key = p.id
where (firstnamefilter.rank is not null or middlenamefilter.rank is not null)
order by firstnamefilter.rank desc, middlenamefilter.rank desc
This will produce a record for each Person record where either the first or middle names (or both) match on the search term, and order by all matches against the first name first (in descending rank order), followed by all matches against the middle name (again in descending rank order)
I assume the data returned is joined to other tables within your schema? I would develop your own RANK based on columns from associated data to the full text index. This also provides a guaranteed level of accuracy in the RANK value.

Resources