Obtain Duplicated Data

Obtain Duplicated Data - sql-server

Please suggest an SQL query to find duplicate customers across different stores, e.g. customer table has id, name, phone, storeid in it, I need to write queries for the following:
Duplicate customers within a store
Duplicate customers across different stores
Table data:
id name phone storeid
-----------------------------------
1 abc 123 4
2 abc 123 4
3 abc 123 5
The first query should show only first 2 records, and the second query should show all 3 records.

You can do something like the following:-
SELECT Name,Phone, COUNT(Id) NumberOfTimes, StoreID
FROM Customers
GROUP BY Name,Phone,StoreID
HAVING COUNT(Id) > 1
ORDER BY StoreID
Hope this helps.

Solution
You can try this for the first query:
SELECT *
FROM customer,
WHERE 1 < (
SELECT COUNT(name)
FROM customer
WHERE name IN (
SELECT name FROM customer
)
) AND
1 < (
SELECT COUNT(storeid)
FROM customer
WHERE storeid IN (
SELECT storeid FROM customer
)
);
Now, for the second query, use the above one, but remove everything after and including the AND.
Explanation
Let's look at the query step-by-step:
SELECT *
FROM customer
This is stating you want all the columns from the customers table.
WHERE 1 < (
SELECT COUNT(name)
FROM customer
WHERE name IN (
SELECT name FROM customer
)
)
This is a pretty long query, so let's look from inside-outward.
WHERE name IN (
SELECT name FROM customer
)
This time we're getting all the names of customers and checking if their is match in our curret table. To be truthful, we might not need this whole section....
SELECT COUNT(name)
FROM customer
This is stating we want the total number of times each name appears (count) in the customers table that matches the where clause.
WHERE 1 < (
....
)
Here, we are comparing the result from the subquery (the number of duplicated names) and checking to see if it is greater than l (i.e., there is a duplicate).
AND
.....
The AND keyword indicates that this second condition must be true in addition to the previous conditions.
The full query should return all entries where both the names and store ids are duplicated; if you remove everything including and after the AND, that will result in all entries which have the same name, but not neccessarily the right store id.
Notes
The other two answers are suggesting grouping duplicated data, but in your particular case, I think you do want the duplicated entries as per your expected results (albeit you should add more expected output info than that).

SELECT storeName, customerName FROM customer
WHERE id IN (
SELECT c.storeid
FROM customer 'c'
RIGHT JOIN store 's' ON (c.storeid = s.id)
GROUP BY c.storeid
HAVING COUNT(*) > 1
)
Basically, we are grouping by storeids, which allows us to count the times they occur in the customer table. We get the id of a case where there are multiple occurrences, and we select the storeName and CustomerName from the customer table that contains the id we got from the inner query.

Related

min(count(*)) over... behavior?

I'm trying to understand the behavior of
select ..... ,MIN(count(*)) over (partition by hotelid)
VS
select ..... ,count(*) over (partition by hotelid)
Ok.
I have a list of hotels (1,2,3)
Each hotel has departments.
On each departments there are workers.
My Data looks like this :
select * from data
Ok. Looking at this query :
select hotelid,departmentid , cnt= count(*) over (partition by hotelid)
from data
group by hotelid, departmentid
ORDER BY hotelid
I can perfectly understand what's going on here. On that result set, partitioning by hotelId , we are counting visible rows.
But look what happens with this query :
select hotelid,departmentid , min_cnt = min(count(*)) over (partition by hotelid)
from data
group by hotelid, departmentid
ORDER BY hotelid
Question:
Where are those numbers came from? I don't understand how adding min caused that result? min of what?
Can someone please explain how's the calculation being made?
fiddle

The 2 statements are very different. The first query is counting the rows after the grouping and then application the PARTITION. So, for example, with hotel 1 there is 1 row returned (as all rows for Hotel 1 have the same department A as well) and so the COUNT(*) OVER (PARTITION BY hotelid) returns 1. Hotel 2, however, has 2 departments 'B' and 'C', and so hence returns 2.
For your second query, you firstly have the COUNT(*), which is not within the OVER clause. That means it counts all the rows within the GROUP BY specified in your query: GROUP BY hotelid, departmentid. For Hotel 1, there are 4 rows for department A, hence 4. Then you take the minimum of 4; which is unsurprisingly 4. For all the other hotels, they have at least 1 entry with only 1 row for a hotel and department and so returns 1.

SQL Server Select query returning records twice in a table

I am using the following query to get the top 10 companies from a table:
Select Top 10 CompanyName
From CompanyMaster
Where LiveProductFlag = 1
Order By Display_Priority asc
It is returning records like this.
CompanyName
------------
First Company
Second Company
First Company
Second COmpany
Third Company
Third Company
Fourth Company
Fourth Company
Fifth Company
Fifth Company
I checked records and I don't have duplicate records. Select Distinct doesn't work. Tried all possible solutions after googling without any success.
Thanks.

You can run a simple count on your group field query as follows to doublecheck your assumption.
Select
CompanyName,
DuplicateCount=COUNT(*)
From
CompanyMaster
Where
LiveProductFlag = 1
GROUP BY
CompanyName
HAVING
COUNT(*) > 1
ORDER BY
COUNT(*) DESC

SQL SERVER - Retrieve Last Entered Data

I've searched for long time for getting last entered data in a table. But I got same answer.
SELECT TOP 1 CustomerName FROM Customers
ORDER BY CustomerID DESC;
My scenario is, how to get last data if that Customers table is having CustomerName column only? No other columns such as ID or createdDate I entered four names in following order.
James
Arun
Suresh
Bryen
Now I want to select last entered CustomerName, i.e., Bryen. How can I get it..?

If the table is not properly designed (IDENTITY, TIMESTAMP, identifier generated using SEQUENCE etc.), INSERT order is not kept by SQL Server. So, "last" record is meaningless without some criteria to use for ordering.
One possible workaround is if, by chance, records in this table are linked to some other table records (FKs, 1:1 or 1:n connection) and that table has a timestamp or something similar and you can deduct insertion order.
More details about "ordering without criteria" can be found here and here.

; with cte_new as (
select *,row_number() over(order by(select 1000)) as new from tablename
)
select * from cte_new where new=4

How to SQL Sum on a field and have a different field in group by section?

I have a table like this: (Please note that Names are not unique and can be repeated, while Personal_ID is unique).
ID SourceID Personal_ID Name NumberOfPurchases
1 4 1001 Alex 10
2 2 1002 Sara 5
3 4 1001 Alex 12
4 1 1003 Mina 200
5 2 1002 Sara 20
6 2 1001 Alex 64
Now what need to do is to get the total sum of Number of purchases each person had based on given the SourceID. So that we have the results for sourceId = 4 as:
Name Total Number of Purchases
1. Alex 22
And for SourceID = 2
Name Total Number of Purchases
1. Alex 64
2. Sara 25
For this I came up with something like this:
SELECT Name,Sum(NumberOfPurchases) AS Total
FROM tblTEST
GROUP BY (Personal_ID)
HAVING (SOURCEID = #id)
but this is apparently wrong. I am stuck here, if I add other fields to the group-by clause the result would be completely different and if I don't, this select command wont work. How can I achieve such a result?

I would guess that the problem here is that you have Name in your SELECT clause, but you're grouping on Personal_ID. It might help if you add Name explicitly to your GROUP BY clause as well. If all is well, Name will be functionally determined by Personal_ID anyway.
And you should also put the filter on SourceID in the WHERE clause like Lucero says. So your query should look like this:
SELECT Name
,Sum(NumberOfPurchases) AS Total
FROM tblTEST
WHERE SOURCEID = #id
GROUP BY Personal_ID, Name;
Check out this Fiddle.
To clear things up: if you use an aggregation function like SUM(), all other things in your SELECT clause should also be in your GROUP BY clause. But you do not want to group only on Name, because it is not unique in your case. That is why you have both Personal_ID and Name in your GROUP BY clause. The query will not execute when Name is missing in the GROUP BY. And you're adding Personal_ID to make sure not all Sara's are put in a big Sara-group, but are grouped according to Personal_ID.

Something like this?
SELECT Name, SUM(NumberOfPurchases) AS Total
FROM tblTEST
WHERE (SourceID = #id)
GROUP BY Name
Note: The WHERE filters before grouping, and HAVING filters after grouping.

You can try with this:
SELECT names.Name
, names.Personal_ID
, grouped.Total
FROM (
SELECT SourceID
, Personal_ID
, SUM(NumberOfPurchases) AS Total
FROM tblTEST
GROUP BY
SourceID, Personal_ID
) grouped
JOIN (
SELECT DISTINCT
Personal_ID
, Name
FROM tblTEST
) names ON names.Personal_ID = grouped.Personal_ID
WHERE SourceID = #id
Here is an SQL Fiddle
Among test data I added another record that contains person named 'Alex', now there are two persons with same name which shows that grouping cannot be performed using Name column since it is not unique. I also added Personal_ID in the select list so that people with same names could be distinguished.

Efficient checking of possible duplicate entities

I have a requirement to produce a list of possible duplicates before a user saves an entity to the database and warn them of the possible duplicates.
There are 7 criteria on which we should check the for duplicates and if at least 3 match we should flag this up to the user.
The criteria will all match on ID, so there is no fuzzy string matching needed but my problem comes from the fact that there are many possible ways (99 ways if I've done my sums corerctly) for at least 3 items to match from the list of 7 possibles.
I don't want to have to do 99 separate db queries to find my search results and nor do I want to bring the whole lot back from the db and filter on the client side. We're probably only talking of a few tens of thousands of records at present, but this will grow into the millions as the system matures.
Anyone got any thoughs of a nice efficient way to do this?
I was considering a simple OR query to get the records where at least one field matches from the db and then doing some processing on the client to filter it some more, but a few of the fields have very low cardinality and won't actually reduce the numbers by a huge amount.
Thanks
Jon

OR and CASE summing will work but are quite inefficient, since they don't use indexes.
You need to make UNION for indexes to be usable.
If a user enters name, phone, email and address into the database, and you want to check all records that match at least 3 of these fields, you issue:
SELECT i.*
FROM (
SELECT id, COUNT(*)
FROM (
SELECT id
FROM t_info t
WHERE name = 'Eve Chianese'
UNION ALL
SELECT id
FROM t_info t
WHERE phone = '+15558000042'
UNION ALL
SELECT id
FROM t_info t
WHERE email = '42#example.com'
UNION ALL
SELECT id
FROM t_info t
WHERE address = '42 North Lane'
) q
GROUP BY
id
HAVING COUNT(*) >= 3
) dq
JOIN t_info i
ON i.id = dq.id
This will use indexes on these fields and the query will be fast.
See this article in my blog for details:
Matching 3 of 4: how to match a record which matches at least 3 of 4 possible conditions
Also see this question the article is based upon.
If you want to have a list of DISTINCT values in the existing data, you just wrap this query into a subquery:
SELECT i.*
FROM t_info i1
WHERE EXISTS
(
SELECT 1
FROM (
SELECT id
FROM t_info t
WHERE name = i1.name
UNION ALL
SELECT id
FROM t_info t
WHERE phone = i1.phone
UNION ALL
SELECT id
FROM t_info t
WHERE email = i1.email
UNION ALL
SELECT id
FROM t_info t
WHERE address = i1.address
) q
GROUP BY
id
HAVING COUNT(*) >= 3
)
Note that this DISTINCT is not transitive: if A matches B and B matches C, this does not mean that A matches C.

You might want something like the following:
SELECT id
FROM
(select id, CASE fld1 WHEN input1 THEN 1 ELSE 0 "rule1",
CASE fld2 when input2 THEN 1 ELSE 0 "rule2",
...,
CASE fld7 when input7 THEN 1 ELSE 0 "rule2",
FROM table)
WHERE rule1+rule2+rule3+...+rule4 >= 3
This isn't tested, but it shows a way to tackle this.

What DBS are you using? Some support using such constraints by using server side code.

Have you considered using a stored procedure with a cursor? You could then do your OR query and then step through the records one-by-one looking for matches. Using a stored procedure would allow you to do all the checking on the server.
However, I think a table scan with millions of records is always going to be slow. I think you should work out which of the 7 fields are most likely to match are make sure these are indexed.

I'm assuming your system is trying to match tag ids of a certain post, or something similar. This is a multi-to-multi relationship and you should have three tables to handle it. One for the post, one for tags and one for post and tags relationship.
If my assumptions are correct then the best way to handle this is:
SELECT postid, count(tagid) as common_tag_count
FROM posts_to_tags
WHERE tagid IN (tag1, tag2, tag3, ...)
GROUP BY postid
HAVING count(tagid) > 3;

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Obtain Duplicated Data - sql-server

You can do something like the following:- SELECT Name,Phone, COUNT(Id) NumberOfTimes, StoreID FROM Customers GROUP BY Name,Phone,StoreID HAVING COUNT(Id) > 1 ORDER BY StoreID Hope this helps.

Related

min(count(*)) over... behavior?

SQL Server Select query returning records twice in a table

SQL SERVER - Retrieve Last Entered Data

How to SQL Sum on a field and have a different field in group by section?

Efficient checking of possible duplicate entities

Categories

Resources