SQL Fuzzy Matching - sql-server

Hope i am not repeating this question. I did some search here and google before posting here.
I am running a eStore with SQL Server 2008R2 with Full Text enabled.
My requirements,
There is a Product Table, which has product name, OEM Codes, Model which this product fits into. All are in text.
I have created a new column called TextSearch. This has concatenated values of Product Name, OEM Code and Model which this product fits in. These values are comma separated.
When a customer enters a keyword, we run search on TextSearch column to match for products. See matching logic below.
I am using a Hybrid Fulltext and normal like to do search. This gives more relevant results. All the queries executed in to a temp table and distincts were returned.
Matching logic,
Run following SQL to get relevant product using full text. But #Keywords will be pre-processed. Say 'CLC 2200' will be changed to 'CLC* AND 2200*'
SELECT Id FROM dbo.Product WHERE CONTAINS (TextSearch ,#Keywords)
Another query will be running using normal like. So 'CLC 2200' will be pre-processed to 'TextSearch like %clc% AND TextSearch like %2200%'. This is simply because full text search wont search patterns before the keywords. example, it wont return 'pclc 2200'.
SELECT Id FROM dbo.Product WHERE TextSearch like '%clc%' AND TextSearch like '%2200%'
If step 1 and 2 didn't return any records, following search will be executed. Value 135 was fine tuned by me to return more relevant records.
SELECT p.id FROM dbo.Product AS p INNER JOIN FREETEXTTABLE(product,TextSearch,#Keywords) AS r ON p.Id = r.[KEY] WHERE r.RANK > 135
All of above combined works fine in a reasonable speed and returns relevant products for keywords.
But i am looking for to further improve when there is no product found at all.
Say if customer looks for 'CLC 2200npk' and this product wasn't there, i needed to show next very close by 'CLC 2200'.
So far i tried using Soundex() function. Buy computing soundex value for each word in TextSearch column and comparing with soudex value of keyword. But this returns way too many records and slow too.
example, 'CLC 2200npk' will return products such as 'CLC 1100' etc. But this wouldn't be a good result. As it is not close to CLC 2200npk
There is another good one here. but this uses CLR Functions. But i can not install CLR functions on the server.
So my logic would need to be,
if 'CLC 2200npk' not found, show close by 'CLC 2200'
if 'CLC 2200' not found, show next close by 'CLC 1100'
Questions
Is it possible to match like as suggested?
If i would need to do spelling correction and search, what would be good way? All of our product listing is in English.
Is there any UDF or SP's to match texts like my suggestions?
Thanks.

A rather quick domain specific solution may be to calculate a string similarity using SOUNDEX and a numeric distance between 2 strings. This will only really help when you have a lot of product codes.
Using a simple UDF like below you can extract the numeric chars from a string so that you can then get 2200 out of 'CLC 2200npk' and 1100 out of 'CLC 1100' so you can now determine closeness based on the SOUNDEX output of each input as well as closeness of the numeric component of each input.
CREATE Function [dbo].[ExtractNumeric](#input VARCHAR(1000))
RETURNS INT
AS
BEGIN
WHILE PATINDEX('%[^0-9]%', #input) > 0
BEGIN
SET #input = STUFF(#input, PATINDEX('%[^0-9]%', #input), 1, '')
END
IF #input = '' OR #input IS NULL
SET #input = '0'
RETURN CAST(#input AS INT)
END
GO
As far as general purpose algorithms go there are a couple which might help you with varying degrees of success depending on data set size and performance requirements. (both links have TSQL implementations available)
Double Metaphone - This algo will give you a better match than soundex at the cost of speed it is really good for spelling correction though.
Levenshtein Distance - This will calculate how many keypresses it would take to turn one string into another for instance to get from 'CLC 2200npk' to 'CLC 2200' is 3, while from 'CLC 2200npk' to 'CLC 1100' is 5.
Here is an interesting article which applies both algos together which may give you a few ideas.
Well hopefully some of that helps a little.
EDIT: Here is a much faster partial Levenshtein Distance implementation (read the post it wont return exact same results as the normal one). On my test table of 125000 rows it runs in 6 seconds compared to 60 seconds for the first one I linked to.

Related

How to stop Heap Analytics grouping assets into "OTHER" Category

I think this might be very simple.
I wrote a query in heap to tell me which users were part of an event and how many times they engaged in it during the year.
The result is a simple table with username and number of occurrences.
It worked. However, Heap has this weird behavior of choosing multiple results (maybe at random?) and throwing them into a single "Other (X other results)" category. Where x is a number of others.
So i end up with a table of 20 maybe 30 users and occurences, and one row of "Other (X other results)".
I shrunk the query to see results from a smaller subset of dates and the "Other" category disappeared.
I really need to see every individual row in my query results! Even if it's paginated.
Help! Thank you
You can export the result as a CSV. The downloaded file will contain all the results (all single entries without the grouped OTHER).
Inte the current UI, you can find Export to CSV at the top of the report view.

Fuzzy name matching algorithm

I have a database containing names of certain blacklisted companies and individuals.
All transactions created, its detail needs to be scanned against these blacklisted names. The created transactions may have names not correctly spelled, for example one can write "Wilson" as "Wilson", "Vilson" or "Veelson". The Fuzzy search logic or utility should match against the name "Wilson" present in the blacklisted database and based on the required correctness / accuracy percentage set by the user, has to show the matching name within the percentage set.
The transactions will be sent in batches or real time to check against black listed names.
I would appreciate, if users who had similar requirement and has implemented them, could also give their views and implementation
T-SQL leaves a lot to be desired in the realm of fuzzy search. Your best options are third party libraries, but if you don't want to mess with that, your best best is using the DIFFERENCE function built in to SQL Server. For example:
SELECT * FROM tblUsers U WHERE DIFFERENCE(U.Name, #nameEntered) >= 3
A higher return value for DIFFERENCE indicates higher accuracy. A drawback of this is that the algorithm favors words that sound alike, which may not be your desired characteristic.
This next example shows how to get the best match out of a table:
DECLARE #users TABLE (Name VARCHAR(255))
INSERT INTO #users VALUES ('Dylan'), ('Bob'), ('Tester'), ('Dude')
SELECT *, MAX(DIFFERENCE(Name, 'Dillon')) AS SCORE FROM #users GROUP BY Name ORDER BY SCORE DESC
It returns:
Name | Score
Dylan 4
Dude 3
Bob 2
Tester 0

Solr hierarchical facets: how to get all 2nd-level values for the top N 1st-level values

I have a pair of multi-valued index fields, author and author_norm, and I've created a hierarchical facet field for them using the pattern described at https://wiki.apache.org/solr/HierarchicalFaceting#Indexed_Terms. The facet values look like this:
0/Blow, J
1/Blow, J/Blow, Joe
1/Blow, J/Blow, Joseph
1/Blow, J/Blow, Jennifer
0/Smith, M
1/Smith, M/Smith, Michelle
1/Smith, M/Smith, Michael
1/Smith, M/Smith, Mike
Authors are associated to article records, and in most cases an article will have many authors. This means that for a Solr query that returns 100+ articles, there will potentially be 1000+ authors represented.
My problem is that when I go to display this hierarchy to the user, due to my facet.limit and facet.mincount being set to sane values, I do not have the complete set of 2nd-level values, i.e., the 2nd level of my hierarchy will be cut-off at a certain point. I will have something like this:
Blow, J (30)
Blow, Joe (17)
Blow, Joseph (9)
Smith, M (22)
Smith, Michelle (14)
Smith, Michael (6)
I would like to also have the "Blow, Jennifer (4)" and "Smith, Mike (2)" entries in this list, but they're not returned in the response because mincount cutoff is 5. So I end up with a confusing display (17 + 9 != 30, etc).
One option would be to put a little "(more)" link at the bottom of every 2nd-level list and fetch the full set via ajax. I'm not crazy about this solution because it's asking users to work/click more than they really should have to, and also because I can't control the length of the initial 2nd-level listing; sometimes it'll be 3 names + "(more)", sometimes 2 or even 1. That's just ugly.
I could set mincount=1 and limit=-1 for just my hierarchical facet field, but that would be nuts because for a large query (100k hits) I would be fetching 100k+ values I don't need. I only need the full set of 2nd-level values for the top N 1st-level values.
So unless someone has a better suggestion, I'm assuming I need to do some kind of follow-up query. So after all that, here's what I'm really asking: is is there a way to fetch these 2nd-level values in a single follow-up query. Given an initial solr response, how can I get all the 2nd-level permutations of just the top N 1st-level values of my hierarchy?
Thanks!
PS, I'm using Solr 4.0.
You can modify mincount for any level in pivot:
facet.pivot=fieldA,filedB&f.fieldA.limit=3&f.fieldB.limit=-1
Problems starts when both fields are the same facet.pivot=fieldA,filedA in that case I would probably create a copy of fieldA as fieldB

SPSS :Loop through the values of variable

I have a dataset that has patient data according to the site they visited our mobile clinic. I have now written up a series of commands such as freqs and crosstabs to produce the analyses I need, however I would like this to be done for patients at each site, rather than the dataset as whole.
If I had only one site, a mere filter command with the variable that specifies a patient's site would suffice, but alas I have 19 sites, so I would like to find a way to loop through my code to produce these outputs for each site. That is to say for i in 1 to 19:
1. Take the i th site
2. Compute a filter for this i th site
3. Run the tables using this filtered data of patients at ith site
Here is my first attempt using DO REPEA. I also tried using LOOP earler.
However it does not work I keep getting an error even though these are closed loops.
Is there a way to do this in SPSS syntax? Bear in mind I do not know Python well enough to do this using that plugin.
*LOOP #ind= 1 TO 19 BY 1.
DO REPEAT #ind= 1 TO 20.
****8888888888888888888888888888888888888888888888888888888 Select the Site here.
COMPUTE filter_site=(RCDSITE=#ind).
USE ALL.
FILTER BY filter_site.
**********************Step 3: Apply the necessary code for tables
*********Participation in the wellness screening, we actually do not care about those who did FP as we are not reporting it.
COUNT BIO= CheckB (1).
* COUNT FPS=CheckF(1).
* COUNT BnF= CheckB CheckF(1).
VAL LABEL BIO
1 ' Has the Wellness screening'
0 'Does not have the wellness screening'.
*VAL LABEL FPS
1 'Has the First patient survey'.
* VAL LABEL BnF
1 'Has either Wellness or FPS'
2 'Has both surveys done'.
FREQ BIO.
*************************Use simple math to calcuate those who only did the Wellness/First Patient survey FUB= F+B -FnB.
*******************************************************Executive Summary.
***********Blood Pressure.
FREQ BP.
*******************BMI.
FREQ BMI.
******************Waist Circumference.
FREQ OBESITY.
******************Glucose.
FREQ GLUCOSE.
*******************Cholesterol.
FREQ TC.
************************ Heamoglobin.
FREQ HAEMOGLOBIN.
*********************HIV.
FREQ HIV.
******************************************************************************I Lifestyle and General Health.
MISSING VALUES Gender GroupDep B8 to B13 ('').
******************Graphs 3.1
Is this just Frequencies you are producing? Try the SPLIT procedure by the variable RCDSITE. Should be enough.
SPLIT FILES allows you to partition your data by up to eight variables. Then each procedure will automatically iterate over each group.
If you need to group the results at a higher level than the procedure, that is, to run a bunch of procedures for each group before moving on to the next one so that all the output for a group will be together, you can use the SPSSINC SPLIT DATASET and SPSSINC PROCESS files extension commands to do this.
These commands require the Python Essentials. That and the commands can be downloaded from the SPSS Community website (www.ibm.com/developerworks/spssdevcentral) if you have at least version 18.
HTH,
Jon Peck
A simple but perhaps not very elegant way is to select from the menu: Data/Select Cases/If condition, there you enter the filter for site 1 and press Paste, not OK.
This will give the used filter as syntax code.
So with some copy/paste/replace/repeat you can get the freqs and all other results based on the different sites.

SQL Server query to Crystal Reports

I am trying to move reports that currently run in SQL Server to Crystal Reports.
Essentially the statement I want to reproduce is:
SELECT DATEPART(DD,DATE), COUNT(*)
WHERE FOO = 'BAR'
GROUP BY DATEPART(DD,DATE)
Count the occurrence of records that match a criteria, grouped by date.
I have used the Selection Expert to generate a equivalence relation (to evaluate the records) and would like to use the datepart function in a group by statement. I have gotten the GROUP BY selection expert to group by date - but it is the full timestamp (SS:HH:DD:MM) not by specific day i.e. March 1 2010.
I am sure there is a way to achieve what I want but have yet to find a tutorial explaining this scenario.
Any help you could lend would be appreciated
As with any other SDK/Language, there are many ways to do this. Here is the first one that I can think of:
Get the raw data into Crystal. (Sounds like you already did this)
In Crystal, make a new formula, call it "GroupByDate". In the formula editor, enter:
datepart("yyyy-mm-dd",{mytable.mydatefield})
Go into the Group Expert. Group your report by GroupByDate.
Make a new formula, call it "AddMe". In the formula editor, enter:
iif({mytable.foo="bar",1,0)
Drag & drop your AddMe formula into the details section. Right-click on it to Insert->Summary. Set your summary location as the group footer.
Preview your report and you should see the total counts in every group footer. To simplify the appearance of the report, you can also suppress the display of the detail and grouper header sections.
Again, there are many ways to do this. You can also get creative with a Running Total function. The Crystal Formula Editor has very useful help files. Use the Functions pane to select a function, press F1, and you'll get criteria, examples, etc.

Resources