Fuzzy name matching algorithm - sql-server

I have a database containing names of certain blacklisted companies and individuals.
All transactions created, its detail needs to be scanned against these blacklisted names. The created transactions may have names not correctly spelled, for example one can write "Wilson" as "Wilson", "Vilson" or "Veelson". The Fuzzy search logic or utility should match against the name "Wilson" present in the blacklisted database and based on the required correctness / accuracy percentage set by the user, has to show the matching name within the percentage set.
The transactions will be sent in batches or real time to check against black listed names.
I would appreciate, if users who had similar requirement and has implemented them, could also give their views and implementation

T-SQL leaves a lot to be desired in the realm of fuzzy search. Your best options are third party libraries, but if you don't want to mess with that, your best best is using the DIFFERENCE function built in to SQL Server. For example:
SELECT * FROM tblUsers U WHERE DIFFERENCE(U.Name, #nameEntered) >= 3
A higher return value for DIFFERENCE indicates higher accuracy. A drawback of this is that the algorithm favors words that sound alike, which may not be your desired characteristic.
This next example shows how to get the best match out of a table:
DECLARE #users TABLE (Name VARCHAR(255))
INSERT INTO #users VALUES ('Dylan'), ('Bob'), ('Tester'), ('Dude')
SELECT *, MAX(DIFFERENCE(Name, 'Dillon')) AS SCORE FROM #users GROUP BY Name ORDER BY SCORE DESC
It returns:
Name | Score
Dylan 4
Dude 3
Bob 2
Tester 0

Related

Strategy to design specification table for measurements data for analytical database project

I need advice on the following topic:
I am developing a DW/BI solution in SQL Server and reports are published in Power BI.
Main part of my question starts here: I have a large table which collects measurement data on product measurement for multiple attributes. Product can be of multiple type, that can be recognised by item number in this table, measurements can be done multiple times and can be identified by measurement date. Usually, we refer latest dates. If it makes things complicated, I can filter data for latest dates only. This is dense row table (multi million). Number of attribute counts about 200.
I want to include specifications for these attributes most likely in a dimension table, and there may be tens of such specifications. Intention is that user shall select in the report any one specification name and he would like to see each product with attributes passing/failing as well as the products passing if all all of specification attributes are passed.
I currently have this measurement table and a dim table with test names, I can add a table for specification if needed. Specification can define few or all test names with lower/upper spec limits:
Sample measurement table:
Sample dim table for test names:
I can add a table for specification as below and user will select any of one:
e.g. Use select ID_spec = 1 then measurement table may look like:
Some spec may contain all and some few attributes.
Please suggest strategy to design a spec table to be efficient for such large size tables. Please let me know if any further details needed.
Later, I will have to further work to calculate % of pass product if they have been tested for all needed tests in a specification selected.
For large tables, the best thing to do is choose the right key. That means dumping the "Id" column (nothing more than a row identifier) and replacing it with something that:
Guarantees uniqueness
Facilitates searches
That often means composite keys, which are fine.
It's also means dumping the whole "fact/dimension" mindset and just focusing on the relations. This is also fine.
Based on your description, this is the first draft of a data model for your warehouse. If you are unfamiliar with IDEF1X diagrams, please read this.
I've added a unique constraint to SpecCd so you could specify the value directly instead of having to check both the ProductId and SpecCd to return a result.
ProductTest exists so you can provide integrity for ProductTestCriteria and ensure tests are limited to only those products that can be measured by them. If all products are subject to all tests, this can be removed and Test can relate directly to ProductMeasurement and ProductTestCriteria.
If you want to subject the latest test of "Product A" to "Spec S" your query would look like:
SELECT
Measurement.ProductId
,Measurement.TestCd
,Measurement.TestDt
,Criteria.SpecCd
,Measurement.Value
,CASE
WHEN Measurement.Value BETWEEN Criteria.LowerValue AND Criteria.UpperValue THEN 'Pass'
ELSE 'Fail'
END AS Result
FROM
ProductMeasurement Measurement
INNER JOIN
ProductTestCriteria Criteria
ON Critera.ProductId = Measurement.ProductId
AND Criteria.TestCd = Measurement.TestCd
WHERE
Measurement.ProductId = 'A'
AND Criteria.SpecCd = 'S'
AND Measurement.TestDt =
(
SELECT
MAX(TestDt)
FROM
ProductMeasurement
WHERE
ProductId = Measurement.ProductId
)
You could remove the filters for ProductId and SpecCd and roll that into a view - users could later specify for the Products and specifications they want later.
If you want result as of a given date, the query is easily modified to this or incorporated into a TVF:
SELECT
Measurement.ProductId
,Measurement.TestCd
,Measurement.TestDt
,Criteria.SpecCd
,Measurement.Value
,CASE
WHEN Measurement.Value BETWEEN Criteria.LowerValue AND Criteria.UpperValue THEN 'Pass'
ELSE 'Fail'
END AS Result
FROM
ProductMeasurement Measurement
INNER JOIN
ProductTestCriteria Criteria
ON Critera.ProductId = Measurement.ProductId
AND Criteria.TestCd = Measurement.TestCd
WHERE
Measurement.ProductId = 'A'
AND Criteria.SpecCd = 'S'
AND Measurement.TestDt =
(
SELECT
MAX(TestDt)
FROM
ProductMeasurement
WHERE
ProductId = Measurement.ProductId
AND TestDt <= <Your Date>
)

Check sql table for values in another table

If I have the following data:
Results Table
.[Required]
I want one grape
I want one orange
I want one apple
I want one carrot
I want one watermelon
Fruit Table
.[Name]
grape
orange
apple
What I want to do is essentially say give me all results where users are looking for a fruit. This is all just example, I am looking at a table with roughly 1 million records and a string field of 4000+ characters. I am expecting a somewhat slow result and I know that the table could DEFINITELY be structured better, but I have no control of that. Here is the query I would essentially have, but it doesn't seem to do what I want. It gives every record. And yes, [#Fruit] is a temp table.
SELECT * FROM [Results]
JOIN [#Fruit] ON
'%'+[Results].[Required]+'%' LIKE [#Fruit].[Name]
Ideally my output should be the following 3 rows:
I want one grape
I want one orange
I want one apple
If that kind of think is doable, I would try the other way round:
SELECT * FROM [Results]
JOIN [#Fruit] ON
[Results].[Required] LIKE '%'+[#Fruit].[Name]+'%'
This topic interests me, so I did a little bit of searching.
Suggestion 1 : Full Text Search
I think what you are trying to do is Full Text Search .
You will need Full-Text Index created on the table if it is not already there. ( Create FULLTEXT Index ).
This should be faster than performing "Like".
Suggestion 2 : Meta Data Search
Another approach I'd take is to create meta data table, and maintain the information myself when the [Result].Required values are updated(or created).
This looks more or less doable, but I'd start from the Fruit table just for conceptual clarity.
Here's roughly how I would structure this, ignoring all performance / speed / normalization issues (note also that I've switched around the variables in the LIKE comparison):
SELECT f.name, r.required
FROM fruits f
JOIN results r ON r.required LIKE CONCAT('%', f.name, '%')
...and perhaps add a LIMIT 10 to keep the query from wasting time while you're testing it out.
This structure will:
give you one record per "match" (per Result row that matches a Fruit)
exclude Result rows that don't have a Fruit
probably be ungodly slow.
Good luck!

Is it possible to create an SQL query that displays results like this?

Background
I have a database that hold records of all assets in an office. Each asset have a condition, a category name and an age.
A ConditionID can be;
In use
Spare
In Circulation
CategoryID are;
Phone
PC
Laptop
and Age is just a field called AquiredDate which holds records like;
2009-04-24 15:07:51.257
Example
I've created an example of the inputs of the query to explain better what I need if possible.
NB.
Inputs are in Orange in the above example.
I've split the example into two separate queries.
Count would be the output
Question
Is this type of query and result set possible using SQL alone? And if so where do I start? Would it be easier to use Ms Excel also?
Yes it is possible, for your orange fields you can just e.g.
where CategoryID ='Phone' and ConditionID in ('In use', 'In Circulation')
For the yellow one you could do a datediff of days of accuired date to now and divide it by 365 and floor that value, to get the last one (6+ years category) you need to take the minimum of 5 and the calculated value so you get 0 for all between 0-1 year old etc. until 5 which has everything above 6 years.
When you group by that calculated column and select the additional the count you get what you desire.

SQL Fuzzy Matching

Hope i am not repeating this question. I did some search here and google before posting here.
I am running a eStore with SQL Server 2008R2 with Full Text enabled.
My requirements,
There is a Product Table, which has product name, OEM Codes, Model which this product fits into. All are in text.
I have created a new column called TextSearch. This has concatenated values of Product Name, OEM Code and Model which this product fits in. These values are comma separated.
When a customer enters a keyword, we run search on TextSearch column to match for products. See matching logic below.
I am using a Hybrid Fulltext and normal like to do search. This gives more relevant results. All the queries executed in to a temp table and distincts were returned.
Matching logic,
Run following SQL to get relevant product using full text. But #Keywords will be pre-processed. Say 'CLC 2200' will be changed to 'CLC* AND 2200*'
SELECT Id FROM dbo.Product WHERE CONTAINS (TextSearch ,#Keywords)
Another query will be running using normal like. So 'CLC 2200' will be pre-processed to 'TextSearch like %clc% AND TextSearch like %2200%'. This is simply because full text search wont search patterns before the keywords. example, it wont return 'pclc 2200'.
SELECT Id FROM dbo.Product WHERE TextSearch like '%clc%' AND TextSearch like '%2200%'
If step 1 and 2 didn't return any records, following search will be executed. Value 135 was fine tuned by me to return more relevant records.
SELECT p.id FROM dbo.Product AS p INNER JOIN FREETEXTTABLE(product,TextSearch,#Keywords) AS r ON p.Id = r.[KEY] WHERE r.RANK > 135
All of above combined works fine in a reasonable speed and returns relevant products for keywords.
But i am looking for to further improve when there is no product found at all.
Say if customer looks for 'CLC 2200npk' and this product wasn't there, i needed to show next very close by 'CLC 2200'.
So far i tried using Soundex() function. Buy computing soundex value for each word in TextSearch column and comparing with soudex value of keyword. But this returns way too many records and slow too.
example, 'CLC 2200npk' will return products such as 'CLC 1100' etc. But this wouldn't be a good result. As it is not close to CLC 2200npk
There is another good one here. but this uses CLR Functions. But i can not install CLR functions on the server.
So my logic would need to be,
if 'CLC 2200npk' not found, show close by 'CLC 2200'
if 'CLC 2200' not found, show next close by 'CLC 1100'
Questions
Is it possible to match like as suggested?
If i would need to do spelling correction and search, what would be good way? All of our product listing is in English.
Is there any UDF or SP's to match texts like my suggestions?
Thanks.
A rather quick domain specific solution may be to calculate a string similarity using SOUNDEX and a numeric distance between 2 strings. This will only really help when you have a lot of product codes.
Using a simple UDF like below you can extract the numeric chars from a string so that you can then get 2200 out of 'CLC 2200npk' and 1100 out of 'CLC 1100' so you can now determine closeness based on the SOUNDEX output of each input as well as closeness of the numeric component of each input.
CREATE Function [dbo].[ExtractNumeric](#input VARCHAR(1000))
RETURNS INT
AS
BEGIN
WHILE PATINDEX('%[^0-9]%', #input) > 0
BEGIN
SET #input = STUFF(#input, PATINDEX('%[^0-9]%', #input), 1, '')
END
IF #input = '' OR #input IS NULL
SET #input = '0'
RETURN CAST(#input AS INT)
END
GO
As far as general purpose algorithms go there are a couple which might help you with varying degrees of success depending on data set size and performance requirements. (both links have TSQL implementations available)
Double Metaphone - This algo will give you a better match than soundex at the cost of speed it is really good for spelling correction though.
Levenshtein Distance - This will calculate how many keypresses it would take to turn one string into another for instance to get from 'CLC 2200npk' to 'CLC 2200' is 3, while from 'CLC 2200npk' to 'CLC 1100' is 5.
Here is an interesting article which applies both algos together which may give you a few ideas.
Well hopefully some of that helps a little.
EDIT: Here is a much faster partial Levenshtein Distance implementation (read the post it wont return exact same results as the normal one). On my test table of 125000 rows it runs in 6 seconds compared to 60 seconds for the first one I linked to.

DB Design: Sort Order for Lookup Tables

I have an application where the database back-end has around 15 lookup tables. For instance there is a table for Counties like this:
CountyID(PK) County
49001 Beaver
49005 Cache
49007 Carbon
49009 Daggett
49011 Davis
49015 Emery
49029 Morgan
49031 Piute
49033 Rich
49035 Salt Lake
49037 San Juan
49041 Sevier
49043 Summit
49045 Tooele
49049 Utah
49051 Wasatch
49057 Weber
The UI for this app has a number of combo boxes in various places for these lookup tables, and my client has asked that the boxes list in this case:
CountyID(PK) County
49035 Salt Lake
49049 Utah
49011 Davis
49057 Weber
49045 Tooele
'The Rest Alphabetically
The best plan I have for accomplishing this is to add a column to each lookup table for SortOrder(numeric). I had a colleague tell me he thought that would cause the tables to violate 3rd-Normal-Form, but I think the sort order still depends on the key and only the key (even though the rest of the list is alphabetical).
Is adding the SortOrder column the best way to do this, or is there a better way I am just not seeing?
I agree with #cletus that a sort order column is a good way to go and it does not violate 3NF (because, as you said, the sort order column entries are functionally dependent on the candidate keys of the table).
I'm not sure I agree that alphanumeric is better than numeric. In the specific case of counties, there are seldom new ones created. But there is no requirement that the numbers assigned are sequential; you can allocate them with numbers that are a multiple of a hundred, for example, leaving ample room for insertions.
Yes I agree a sort order column is the best solution when the requirements call for a custom sort order like the one you cite. I wouldn't go with a numeric column however. If the data is alphanumeric, the sort order should be alphanumeric. That way you can seed the value with whatever is in the county field.
If you use a numeric field you'll have to resequence the entire table (potentially) whenever you add a new entry. So:
Columns: ID, County, SortOrder
Seed:
UPADTE County SET SortOrder = CONCAT('M-', County)
and for the special cases:
UPDATE County
SET SortOrder = CONCAT('E-' . County)
WHERE County IN ('Salt Lake', 'Utah', 'Davis', 'Weber', 'Tooele')
Arguably you may want to put another marker column in to indicate those entries are special.
I went with numeric and large multiples.
Even with the CONCAT('E-'.. example, I don't get the required sort order. That would give me Davis, SL, Tooele... and Salt Lake needs to be first.
I ended up using multiples of 10 and assigned the non-special-sort entries a value like 10000. That way the view for each lookup can have
ORDER BY SortOrder ASC, OtherField ASC
Another programmer suggested using DECODE in Oracle, or CASE statements in SQL Server, but this is a more general solution. YMMV.

Resources