Efficient checking of possible duplicate entities - database

I have a requirement to produce a list of possible duplicates before a user saves an entity to the database and warn them of the possible duplicates.
There are 7 criteria on which we should check the for duplicates and if at least 3 match we should flag this up to the user.
The criteria will all match on ID, so there is no fuzzy string matching needed but my problem comes from the fact that there are many possible ways (99 ways if I've done my sums corerctly) for at least 3 items to match from the list of 7 possibles.
I don't want to have to do 99 separate db queries to find my search results and nor do I want to bring the whole lot back from the db and filter on the client side. We're probably only talking of a few tens of thousands of records at present, but this will grow into the millions as the system matures.
Anyone got any thoughs of a nice efficient way to do this?
I was considering a simple OR query to get the records where at least one field matches from the db and then doing some processing on the client to filter it some more, but a few of the fields have very low cardinality and won't actually reduce the numbers by a huge amount.
Thanks
Jon

OR and CASE summing will work but are quite inefficient, since they don't use indexes.
You need to make UNION for indexes to be usable.
If a user enters name, phone, email and address into the database, and you want to check all records that match at least 3 of these fields, you issue:
SELECT i.*
FROM (
SELECT id, COUNT(*)
FROM (
SELECT id
FROM t_info t
WHERE name = 'Eve Chianese'
UNION ALL
SELECT id
FROM t_info t
WHERE phone = '+15558000042'
UNION ALL
SELECT id
FROM t_info t
WHERE email = '42#example.com'
UNION ALL
SELECT id
FROM t_info t
WHERE address = '42 North Lane'
) q
GROUP BY
id
HAVING COUNT(*) >= 3
) dq
JOIN t_info i
ON i.id = dq.id
This will use indexes on these fields and the query will be fast.
See this article in my blog for details:
Matching 3 of 4: how to match a record which matches at least 3 of 4 possible conditions
Also see this question the article is based upon.
If you want to have a list of DISTINCT values in the existing data, you just wrap this query into a subquery:
SELECT i.*
FROM t_info i1
WHERE EXISTS
(
SELECT 1
FROM (
SELECT id
FROM t_info t
WHERE name = i1.name
UNION ALL
SELECT id
FROM t_info t
WHERE phone = i1.phone
UNION ALL
SELECT id
FROM t_info t
WHERE email = i1.email
UNION ALL
SELECT id
FROM t_info t
WHERE address = i1.address
) q
GROUP BY
id
HAVING COUNT(*) >= 3
)
Note that this DISTINCT is not transitive: if A matches B and B matches C, this does not mean that A matches C.

You might want something like the following:
SELECT id
FROM
(select id, CASE fld1 WHEN input1 THEN 1 ELSE 0 "rule1",
CASE fld2 when input2 THEN 1 ELSE 0 "rule2",
...,
CASE fld7 when input7 THEN 1 ELSE 0 "rule2",
FROM table)
WHERE rule1+rule2+rule3+...+rule4 >= 3
This isn't tested, but it shows a way to tackle this.

What DBS are you using? Some support using such constraints by using server side code.

Have you considered using a stored procedure with a cursor? You could then do your OR query and then step through the records one-by-one looking for matches. Using a stored procedure would allow you to do all the checking on the server.
However, I think a table scan with millions of records is always going to be slow. I think you should work out which of the 7 fields are most likely to match are make sure these are indexed.

I'm assuming your system is trying to match tag ids of a certain post, or something similar. This is a multi-to-multi relationship and you should have three tables to handle it. One for the post, one for tags and one for post and tags relationship.
If my assumptions are correct then the best way to handle this is:
SELECT postid, count(tagid) as common_tag_count
FROM posts_to_tags
WHERE tagid IN (tag1, tag2, tag3, ...)
GROUP BY postid
HAVING count(tagid) > 3;

Related

SQL - Inclusive Matches in column or across multiple rows

I'm sure this has been asked before, but I don't know how to search for it.
First off, I do not want to implement full-text searching. The database contains multiple languages including Chinese and Japanese which pose a huge problem for Full-text indexes.
I have a table like the following:
Table Comment:
UserID int
CommentText nvarchar(400)
I want to do a search against this table and find anything matching multiple words. Normally I would just do something like
select *
from Comment
where CommentText like '%potato%' and CommentText like '%badger%'
But if the two words are in different rows I need to do something like
select
UserID, count(UserID )
from
Comment
where
CommentText like '%potato%' or CommentText like '%badger%'
group by
UserID
having
count(UserID ) > 1
But then if the words are sometimes in the same row and sometimes spread across multiple rows, how do I determine if both words matched?
Cases:
Both words are in a single row.
One word is in row 1 and the other word is in row 2 for the same UserID
One word is in multiple rows for the same UserID (so it returns multiple matches even if it's the same word several times)
My question is: for multiple words, how do I conduct a wildcard search and make sure all the words match at least once for a given UserID?
Thanks in advance
I'm thinking of a CTE to grab all rows that contain matches and concat them for a given userid, but I don't know if I can find something more efficient.
A simple way to approach this would use conditional aggregation:
SELECT UserID
FROM Comment
GROUP BY UserID
HAVING
SUM(CASE WHEN CommentText LIKE '%java%' THEN 1 ELSE 0 END) > 0 AND
SUM(CASE WHEN CommentText LIKE '%python%' THEN 1 ELSE 0 END) > 0;
Each of the sums in the HAVING clause keep track of each individual word you want to match. Only when at least one record has a positive match, for both words, would a user turn up in the result set.
Note that if you plan to continue down this road, you should look into SQL Server's full text capabilities.
https://learn.microsoft.com/en-us/sql/relational-databases/search/full-text-search
Just posting the question caused me to think of the answer.
select distinct UserID from (
select UserID FROM Comment where CommentText like '%java%'
UNION
select UserID FROM Comment where CommentText like '%python%'
) as a

SQL Get Second Record

I am looking to retrieve only the second (duplicate) record from a data set. For example in the following picture:
Inside the UnitID column there is two separate records for 105. I only want the returned data set to return the second 105 record. Additionally, I want this query to return the second record for all duplicates, not just 105.
I have tried everything I can think of, albeit I am not that experience, and I cannot figure it out. Any help would be greatly appreciated.
You need to use GROUP BY for this.
Here's an example: (I can't read your first column name, so I'm calling it JobUnitK
SELECT MAX(JobUnitK), Unit
FROM JobUnits
WHERE DispatchDate = 'oct 4, 2015'
GROUP BY Unit
HAVING COUNT(*) > 1
I'm assuming JobUnitK is your ordering/id field. If it's not, just replace MAX(JobUnitK) with MAX(FieldIOrderWith).
Use RANK function. Rank the rows OVER PARTITION BY UnitId and pick the rows with rank 2 .
For reference -
https://msdn.microsoft.com/en-IN/library/ms176102.aspx
Assuming SQL Server 2005 and up, you can use the Row_Number windowing function:
WITH DupeCalc AS (
SELECT
DupID = Row_Number() OVER (PARTITION BY UnitID, ORDER BY JobUnitKeyID),
*
FROM JobUnits
WHERE DispatchDate = '20151004'
ORDER BY UnitID Desc
)
SELECT *
FROM DupeCalc
WHERE DupID >= 2
;
This is better than a solution that uses Max(JobUnitKeyID) for multiple reasons:
There could be more than one duplicate, in which case using Min(JobUnitKeyID) in conjunction with UnitID to join back on the UnitID where the JobUnitKeyID <> MinJobUnitKeyID` is required.
Except, using Min or Max requires you to join back to the same data (which will be inherently slower).
If the ordering key you use turns out to be non-unique, you won't be able to pull the right number of rows with either one.
If the ordering key consists of multiple columns, the query using Min or Max explodes in complexity.

Show records where most recent 'x' records meet criteria

Here's a simplified SQLFiddle example of data
Basically, I'm looking to identify records in a login audit table where the most recent records for each user has 'x' (let's say 3, for this example) number of failed logins
I am able to get this data for individual users by doing a SELECT TOP 3 and ordering by the log date in descending order and evaluating those records, but I know there's got to be a better way to do this.
I have tried a few queries using ROW_NUMBER(), partitioning by UserName and Success and ordering by LogDate, but I can't quite get it to do what I want. Essentially, every time a successful login occurs, I want the failed login counter to be reset.
try this code:
select * from (
select distinct a.UserName,
(select sum(cast(Success as int)) from (
SELECT TOP 3 Success --- here 3, change it to your number
FROM tbl as b
WHERE b.UserName=a.UserName
ORDER BY LogDate DESC
) as q
having count(*) >= 3 --- this string need to remove users who made less then 3 attempts
) as cnts
from tbl as a
) as q2
where q2.cnts=0
it shows users with all last 3 attempts failed, with different modifications, you can use this approach to identify how many success/fail attempts where done during last N rows
NOTE: this query works, but it is not the optimal way, from tbl as a should be changed to table where only users are stored, so you will be able to get rid of distinct, also - store users ID instead of username in tbl

SQL Server do I need two queries and a function efficiency question

I want to get a list of people affiliated with a blog. The table [BlogAffiliates] has:
BlogID
UserID
Privelage
and if the persons associated with that blog have a lower or equal privelage they cannot edit [bit field canedit].
Is this query the most efficient way of doing this or are there better ways to derive this information??
I wonder if it can be done in a single query??
Can it be done without that convert in some more clever way?
declare #privelage tinyint
select #privelage = (select Privelage from BlogAffiliates
where UserID=#UserID and BlogID = #BlogID)
select aspnet_Users.UserName as username,
BlogAffiliates.Privelage as privelage,
Convert(Bit, Case When #privelage> blogaffiliates.privelage
Then 1 Else 0 End) As canedit
from BlogAffiliates, aspnet_Users
where BlogAffiliates.BlogID = #BlogID and BlogAffiliates.Privelage >=2
and aspnet_Users.UserId = BlogAffiliates.UserID
Some of this would depend on the indexs and the size of the tables involved. If for example your most costly portion of the query when you profiled it was a seek on the "BlogAffiliates.BlogID" column, then you could do one select into a table variable and then do both calculations from there.
However I think most likely the query you have stated is probably going to be close the the most efficient. The only possible work duplication is you are seeking twice on the "BlogAffiliates.BlogID" fields because of the two queries.
You can try below query.
Select aspnet_Users.UserName as username, Blog.Privelage as privelage,
Convert(Bit, Case When #privelage> Blog.privelage
Then 1 Else 0 End) As canedit
From
(
Select UserID, Privelage
From BlogAffiliates
Where BlogID = #BlogID and Privelage >= 2
)Blog
Inner Join aspnet_Users on aspnet_Users.UserId = Blog.UserID
As per my understanding you should not use Table variable, in case you are joining it with other table. This can reduce the performance. But in case the records are less, then you should go for it. You can also use Local temporary tables for this purpose.

The fastest way to check if some records in a database table?

I have a huge table to work with . I want to check if there are some records whose parent_id equals my passing value .
currently what I implement this is by using "select count(*) from mytable where parent_id = :id"; if the result > 0 , means the they do exist.
Because this is a very huge table , and I don't care what's the exactly number of records that exists , I just want to know whether it exists , so I think count(*) is a bit inefficient.
How do I implement this requirement in the fastest way ? I am using Oracle 10.
#
According to hibernate Tips & Tricks https://www.hibernate.org/118.html#A2
It suggests to write like this :
Integer count = (Integer) session.createQuery("select count(*) from ....").uniqueResult();
I don't know what's the magic of uniqueResult() here ? why does it make this fast ?
Compare to "select 1 from mytable where parent_id = passingId and rowrum < 2 " , which is more efficient ?
An EXISTS query is the one to go for if you're not interested in the number of records:
select 'Y' from dual where exists (select 1 from mytable where parent_id = :id)
This will return 'Y' if a record exists and nothing otherwise.
[In terms of your question on Hibernate's "uniqueResult" - all this does is return a single object when there is only one object to return - instead of a set containing 1 object. If multiple results are returned the method throws an exception.]
There's no real difference between:
select 'y'
from dual
where exists (select 1
from child_table
where parent_key = :somevalue)
and
select 'y'
from mytable
where parent_key = :somevalue
and rownum = 1;
... at least in Oracle10gR2 and up. Oracle's smart enough in that release to do a FAST DUAL operation where it zeroes out any real activity against it. The second query would be easier to port if that's ever a consideration.
The real performance differentiator is whether or not the parent_key column is indexed. If it's not, then you should run something like:
select 'y'
from dual
where exists (select 1
from parent_able
where parent_key = :somevalue)
select count(*) should be lighteningly fast if you have an index, and if you don't, allowing the database to abort after the first match won't help much.
But since you asked:
boolean exists = session.createQuery("select parent_id from Entity where parent_id=?")
.setParameter(...)
.setMaxResults(1)
.uniqueResult()
!= null;
(Some syntax errors to be expected, since I don't have a hibernate to test against on this computer)
For Oracle, maxResults is translated into rownum by hibernate.
As for what uniqueResult() does, read its JavaDoc! Using uniqueResult instead of list() has no performance impact; if I recall correctly, the implementation of uniqueResult delegates to list().
First of all, you need an index on mytable.parent_id.
That should make your query fast enough, even for big tables (unless there are also a lot of rows with the same parent_id).
If not, you could write
select 1 from mytable where parent_id = :id and rownum < 2
which would return a single row containing 1, or no row at all. It does not need to count the rows, just find one and then quit. But this is Oracle-specific SQL (because of rownum), and you should rather not.
For DB2 there is something like select * from mytable where parent_id = ? fetch first 1 row only. I assume that something similar exists for oracle.
This query will return 1 if any record exists and 0 otherwise:
SELECT COUNT(1) FROM (SELECT 1 FROM mytable WHERE ROWNUM < 2);
It could help when you need to check table data statistics, regardless table size and any performance issue.

Resources