Fuzzy Matching in SnowFlake like EDIT_DISTANCE_SIMILARITY

Fuzzy Matching in SnowFlake like EDIT_DISTANCE_SIMILARITY - snowflake-cloud-data-platform

Do we have any function for name fuzzy matching like we have UTL_MATCHING.EDIT_DISTANCE_SIMILARITY in oracle. I have to find the difference at row level.

Snowflake has EDITDISTANCE and SOUNDEX functions:
select editdistance('Duningham', 'Cunningham');
-- Result 2
select soundex('McArthur') = soundex('MacArthur');
-- Result TRUE
For EDITDISTANCE, unlike EDIT_DISTANCE_SIMILARITY lower scores are closer matches. There are many open source JavaScript implementations of fuzzy matching that you could plug into a Snowflake JavaScript UDF.

Interzoid (Disclaimer, I work there) has matching capabilities with native Snowflake connectivity, using knowledge bases (for different data types: name, company, address, etc.), heuristics, soundex, spelling analysis, derivatives, contextual ML, etc.) using a similarity key technology for use with one or more tables. It accesses an underlying API for each record in a table to generate the similarity keys (which can be appended to the table if desired) upon which the fuzzy matching is based -> https://connect.interzoid.com/matching-data-database - it would work on the above scenario.

Related

Merging two datasets with unclear identifier in Excel

I'm trying to merge two datasets on Mergers & Acquisitions. They both consist of c.10'000 observations with c.50-100 variables each. One contains information about the actual M&A deal whereas the other one contains info on how a deal was financed.
The problem is that there is no clear and unique identifier. For example, I could use the date that the deal was announced but that wouldn't be unique because on some days 10 deals were announced. Using company names is difficult since they mostly aren't identical in both datasets. For example if in one dataset I find "Ebay", in the other the same company could be called "eBay", "Ebay Inc", or "Ebay, Inc."
I've been working with the Fuzzy Lookup add-on for Excel, as well as concacenating various identifiers that are not unique but in their combination become useful (e.g. Date & Country & SIC Industry Classification Code, etc.). However I haven't been able to generate as many matches as I would have hoped.
I'd be grateful for any ideas or pointers towards resources that would help me merge the datasets more efficiently.

This is probably a process that requires group tactics applied over several repeated iterations. No single fuzzy match will be able to catch them all the first time. An optimum strategy should narrow the possibilities for manual matching. Once you match them, exclude them from further fuzzy matching. Go through all records using various tactics to whittle down unmatched entries to as few as possible.
As for the tactics, you mention date and that there could be 10 mergers that day. Now you just manually match up just those 10. That becomes a manageable chunk.
Using company names is difficult...
Yes, but you can combine date with Levenshtein Distance (or an equivalent algorithm) to rank the possible choices to narrow the options. So names with eBay in them will all appear closer when you rank them by their Levenshtein Distance.
Other text comparison algorithms besides Levenshtein include Gotoh, Jaro, Soundex, Chapman, etc. Some of these techniques are decades old so the chances of finding Excel add-ins are high. There used to be a vibrant open source group working on these solutions on Sourceforge.net.
...their combination become useful (e.g. Date & Country & SIC Industry Classification Code, etc.).
Watch out for SIC codes. These are never consistent nor accurate. Depending on who was maintaining those codes you may not get very accurate values beyond 4-digit levels. Also the SIC codes themselves have been updated while the companies were under no obligation to update/revise them as needed. Lastly, SIC codes have been replaced by newer NAICS codes, which in-turn had several versions. During each change, they add new industries, such as Social Media companies that did not exist in the old SIC codes. However SIC/NAICS codes can be useful to eliminate duplicates through self matching.
...any ideas or pointers towards resources...
SQL Server's text indexing features have ready-made algorithms for finding matches. Depending on your resources, that may be something to explore.
If this matching process is going to be a routine task, then you can explore specialized ETL products (extract, transform, load) offered through various data integration services. Some of these ensure bi-directional updates. But no matter how sophisticated the solution, it's all based on a few simple tactics as shown above and nearly all of them contain manual overrides.

Database with support for JOINs and a flexible data schema

For our project, we need a database that supports JOINs and has the ability to easily add and modify attributes of the entity (schema-less/free). Key points:
The system is designed to work with customers (CRM)
Basic entities: User, Customer, Case, Case Interaction, Order
Currently in the database there are ~200k customers and ~250k orders
Customer entity contains 15-20 optional attributes that are most often not filled
About 100 new cases a day
The data is synchronized with several other sources in the background
Requirements (high to low priority):
Ability to implement search/sort by related entities, e.g. Case by linked Customer name (support JOINs)
Having the flexibility to change the schema of the data and do not store NULL for a large number of attributes
Performance
ORM for Python with support for monitoring changes and the possibility of storing only the changes to the database
What we've tried:
MongoDB does not satisfy paragraph 1.
PostgreSQL with all the attributes in one table does not satisfy paragraph 2.
PostgreSQL with a separate table for each attribute or EAV does not satisfy paragraph 3 (a lot of slow joins), but seems a better solution than others.
Can you suggest any database or design of the system that will meet our needs?

Datomic might be worth checking out (http://www.datomic.com/). It satisfies requirements 1-3, and although there's no python ORM, there is a REST API.
Datomic is based on an Entity Attribute Value schema (it's not quite schema free - you need to specify a name and type for each attribute - but any entity can have any attribute). It is transactional and has support for joins, unlike some of the other flexible "NoSQL" solutions. Interestingly, it also has first-class support for time (e.g. what is the history of this entity/what did the database look like at time t,etc), which might be useful if you're tracking cases and interactions.
Queries are based on datalog, which queries by unification. Query by unification looks a bit odd at first but is brilliant once you get used to it.
For example, a query to find cases by linked customer name would be something like this:
[find ?x
:in $
:where [?x :case/linked-customers ?c
?c :customer/name "Barry"]]
The query engine looks in the database, and tries to satisfy the where clause by unifying all occurrences of a given variable. In this case, only ?c appears twice (the case has a linked customer c whose name is Barry), but queries can obviously get a lot more complex. The $ here represents the database.

You may want to consider storing the "flexible" part as XML. Some databases, e.g. DB2, allow XML indexing so lookup performance should be as good as with the relational data store. DB2 Express-C is free and does not have an artificial limit on the database size.
Update Since 2015 DB2 Express-C limits the database user data volume to 15 TB, which still should be plenty.

How to implement an Enterprise Search

We are searching disparate data sources in our company. We have information in multiple databases that need to be searched from our Intranet. Initial experiments with Full Text Search (FTS) proved disappointing. We've implemented a custom search engine that works very well for our purposes. However, we want to make sure we are doing "the right thing" and aren't missing any great tools that would make our job easier.
What we need:
Column search
ability to search by column
we flag which columns in a table are searchable
Keep some relation between db column and data
we provide advanced filtering on the results
facilitates (amazon style) filtering
filter provided by grouping of results and allowing user to filter them via a checkbox
this is a great feature, users like it very much
Partial Word Match
we have a lot of unique identifiers (product id, etc).
the unique id's can have sub parts with meaning (location, etc)
or only a portion may be available (when the user is searching)
or (by a decidedly poor design decision) there may be white space in the id
this is a major feature that we've implemented now via CHARINDEX (MSSQL) and INSTR (ORACLE)
using the char index functions turned out to be equivalent performance(+/-) on MSSQL compared to full text
didn't test on Oracle
however searches against both types of db are very fast
We take advantage of Indexed (MSSQL) and Materialized (Oracle) views to increase speed
this is a huge win, Oracle Materialized views are better than MSSQL Indexed views
both provide speedups in read-only join situations (like a search combing company and product)
A search that matches user expectations of the paradigm CTRL-f -> enter text -> find matches
Full Text Search is not the best in this area (slow and inconsistent matching)
partial matching (see "Partial Word Match")
Nice to have:
Search database in real time
skip the indexing skip, this is not a hard requirement
Spelling suggestion
Xapian has this http://xapian.org/docs/spelling.html
Similar to google's "Did you mean:"
What we don't need:
We don't need to index documents
at this point searching our data sources are the most important thing
even when we do search documents, we will be looking for partial word matching, etc
Ranking
Our own simple ranking algorithm has proven much better than an FTS equivalent.
Users understand it, we understand it, it's almost always relevant.
Stemming
Just don't need to get [run|ran|running]
Advanced search operators
phrase matching, or/and, etc
according to Jakob Nielsen http://www.useit.com/alertbox/20010513.html
most users are using simple search phrases
very few use advanced searches (when it's available)
also in Information Architecture 3rd edition Page 185
"few users take advantage of them [advanced search functions]"
http://oreilly.com/catalog/9780596000356
our Amazon like filtering allows better filtering anyway (via user testing)
Full Text Search
We've found that results don't always "make sense" to the user
Searching with FTS is hard to tune (which set of operators match the users expectations)
Advanced search operators are a no go
we don't need them because
users don't understand them
Performance has been very close (+/1) to the char index functions
but the results are sometimes just "weird"
The question:
Is there a solution that allows us to keep the key value pair "filtering feature", offers the column specific matching, partial word matching and the rest of the features, without the pain of full text search?
I'm open to any suggestion. I've wondered if a document/hash table nosql data store (MongoDB, et al) might be of use? ( http://www.mongodb.org/display/DOCS/Full+Text+Search+in+Mongo ). Any experience with these is appreciated.
Again, just making sure we aren't missing something with our in-house customized version. If there is something "off the shelf" I would be interested in it. Or if you've built something from some components, what components (search engines, data stores, etc) did you use and why?
You can also make your point for FTS. Just make sure it meets the requirements above before you say "just use Full Text Search because that's the only tool we have."

I ended up coding my own.
The results are fantastic. Users like it, it works well with our existing technologies.
It really wasn't that hard. Just took some time.
Features:
Faceted search (amazon, walmart, etc)
Partial word search (the real stuff not full text)
Search databases (oracle, sql server, etc) and non database sources
Integrates well with our existing environment
Maintains relations, so I can have a n to n search and display
--> this means I can display child records of a master record in search results
--> also I can search any child field and return the master record
It's really amazing what you can do with dictionaries and a lot of memory.

I recommend looking into Solr, I believe it will meet you needs:
http://lucene.apache.org/solr/

For an off-she-shelf solution: Have you checked out the Google Search Appliance?
Quote from the Google Mini/GSA site:
... If direct database indexing is a requirement for you, we encourage you to consider the Google Search Appliance, which has direct database connectivity.
And of course it indexes everything else in the Googly manner you'd expect it to.

Apache Solr is a good way to start your project with and it is open source . You can also try Elastic Search and there are a lot of off shelf products which offer good customization abilities and search features such as Coveo, SharePoint Fast, Google ...

Keyword to SQL search

Use Case
When a user goes to my website, they will be confronted with a search box much like SO. They can search for results using plan text. ".net questions", "closed questions", ".net and java", etc.. The search will function a bit different that SO, in that it will try to as much as possible of the schema of the database rather than a straight fulltext search. So ".net questions" will only search for .net questions as opposed to .net answers (probably not applicable to SO case, just an example here), "closed questions" will return questions that are closed, ".net and java" questions will return questions that relate to .net and java and nothing else.
Problem
I'm not too familiar with the words but I basically want to do a keyword to SQL driven search. I know the schema of the database and I also can datamine the database. I want to know any current approaches there that existing out already before I try to implement this. I guess this question is for what is a good design for the stated problem.
Proposed
My proposed solution so far looks something like this
Clean the input. Just remove any special characters
Parse the input into chunks of data. Break an input of "c# java" into c# and java Also handle the special cases like "'c# java' questions" into 'c# java' and "questions".
Build a tree out of the input
Bind the data into metadata. So convert stuff like closed questions and relate it to the isclosed column of a table.
Convert the tree into a sql query.
Thoughts/suggestions/links?

I run a digital music store with a "single search" that weights keywords based on their occurrences and the schema in which Products appear, eg. with different columns like "Artist", "Title" or "Publisher".
Products are also related to albums and playlists, but for simpler explanation, I will only elaborate on the indexing and querying of Products' Keywords.
Database Schema
Keywords table - a weighted table for every word that could possibly be searched for (hence, it is referenced somewhere) with the following data for each record:
Keyword ID (not the word),
The Word itself,
A Soundex Alpha value for the Word
Weight
ProductKeywords table - a weighted table for every keyword referenced by any of a product's fields (or columns) with the following data for each record:
Product ID,
Keyword ID,
Weight
Keyword Weighting
The weighting value is an indication of how often the words occurs. Matching keywords with a lower weight are "more unique" and are more likely to be what is being searched for. In this way, words occurring often are automatically "down-weighted", eg. "the", "a" or "I". However, it is best to strip out atomic occurrences of those common words before indexing.
I used integers for weighting, but using a decimal value will offer more versatility, possibly with slightly slower sorting.
Indexing
Whenever any product field is updated, eg. Artist or Title (which does not happen that often), a database trigger re-indexes the product's keywords like so inside a transaction:
All product keywords are disassociated and deleted if no longer referenced.
Each indexed field (eg. Artist) value is stored/retrieved as a keyword in its entirety and related to the product in the ProductKeywords table for a direct match.
The keyword weight is then incremented by a value that depends on the importance of the field. You can add, subtract weight based on the importance of the field. If Artist is more important than Title, Subtract 1 or 2 from its ProductKeyword weight adjustment.
Each indexed field value is stripped of any non-alphanumeric characters and split into separate word groups, eg. "Billy Joel" becomes "Billy" and "Joel".
Each separate word group for each field value is soundexed and stored/retrieved as a keyword and associated with the product in the same way as in step 2. If a keyword has already been associated with a product, its weight is simply adjusted.
Querying
Take the input query search string in its entirety and look for a direct matching keyword. Retrieve all ProductKeywords for the keyword in an in-memory table along with Keyword weight (different from ProductKeyword weight).
Strip out all non-alphanumeric characters and split query into keywords. Retrieve all existing keywords (only a few will match). Join ProductKeywords to matching keywords to in-memory table along with Keyword weight, which is different from the ProductKeyword weight.
Repeat Step 2 but use soundex values instead, adjusting weights to be less relevant.
Join retrieved ProductKeywords to their related Products and retrieve each product's sales, which is a measure of popularity.
Sort results by Keyword weight, ProductKeyword weight and Sales. The final summing/sorting and/or weighting depends on your implementation.
Limit results and return product search results to client.

What you are looking for is Natural Language Processing. Strangely enough this used to be included free as English Query in SQL Server 2000 and prior. But it's gone now
Some other sources are :
http://devtools.korzh.com/eq/dotnet/
http://www.easyask.com/products/business-intelligence/index.htm
The concept is a meta data dictionary mapping words to table, columns, relationships etc and an English sentence parser combined together to convert a English sentence ( or just some keywords) into a real query
Some people even user English Query with speech recognition for some really cool demos, never saw it used in anger though!

If you're using SQL Server, you can simply use its Full-Text Search feature, which is specifically designed to solve your problem.

You could use a hybrid approach, take the full text search results and further filter them based on the meta data from your #4. For something more intelligent you could create a simple supervised learning solution by tracking what links the user clicks on after the search and store that choice with the key search words in a decision tree. Searches would then be mined from this decision tree

Creating an efficient search capability using SQL Server (and/or coldfusion)

I am trying to visualize how to create a search for an application that we are building. I would like a suggestion on how to approach 'searching' through large sets of data.
For instance, this particular search would be on a 750k record minimum table, of product sku's, sizing, material type, create date, etc;
Is anyone aware of a 'plugin' solution for Coldfusion to do this? I envision a google like single entry search where a customer can type in the part number, or the sizing, etc, and get hits on any or all relevant results.
Currently if I run a 'LIKE' comparison query, it seems to take ages (ok a few seconds, but still), and it is too long. At times making a user sit there and wait up to 10 seconds for queries & page loads.
Or are there any SQL formulas to help accomplish this? I want to use a proven method to search the data, not just a simple SQL like or = comparison operation.
So this is a multi-approach question, should I attack this at the SQL level (as it ultimately looks to be) or is there a plug in/module for ColdFusion that I can grab that will give me speedy, advanced search capability.

You could try indexing your db records with a Verity (or Solr, if CF9) search.
I'm not sure it would be faster, and whether even trying it would be worthwhile would depend a lot on how often you update the records you need to search. If you update them rarely, you could do an Verity Index update whenever you update them. If you update the records constantly, that's going to be a drag on the webserver, and certainly mitigate any possible gains in search speed.
I've never indexed a database via Verity, but I've indexed large collections of PDFs, Word Docs, etc, and I recall the search being pretty fast. I don't know if it will help your current situation, but it might be worth further research.

If your slowdown is specifically the search of textual fields (as I surmise from your mentioning of LIKE), the best solution is building an index table (not to be confiused with DB table indexes that are also part of the answer).
Build an index table mapping the unique ID of your records from main table to a set of words (1 word per row) of the textual field. If it matters, add the field of origin as a 3rd column in the index table, and if you want "relevance" features you may want to consider word count.
Populate the index table with either a trigger (using splitting) or from your app - the latter might be better, simply call a stored proc with both the actual data to insert/update and the list of words already split up.
This will immediately drastically speed up textual search as it will no longer do "LIKE", AND will be able to use indexes on index table (no pun intended) without interfering with indexing on SKU and the like on the main table.
Also, ensure that all the relevant fields are indexed fully - not necessarily in the same compund index (SKU, sizing etc...), and any field that is searched as a range field (sizing or date) is a good candidate for a clustered index (as long as the records are inserted in approximate order of that field's increase or you don't care about insert/update speed as much).
For anything mode detailed, you will need to post your table structure, existing indexes, the queries that are slow and the query plans you have now for those slow queries.
Another item is to enure that as little of the fields are textual as possible, especially ones that are "decodable" - your comment mentioned "is it boxed" in the text fields set. If so, I assume the values are "yes"/"no" or some other very limited data set. If so, simply store a numeric code for valid values and do en/de-coding in your app, and search by the numeric code. Not a tremendous speed improvement but still an improvement.

I've done this using SQL's full text indexes. This will require very application changes and no changes to the database schema except for the addition of the full text index.
First, add the Full Text index to the table. Include in the full text index all of the columns the search should perform against. I'd also recommend having the index auto update; this shouldn't be a problem unless your SQL Server is already being highly taxed.
Second, to do the actual search, you need to convert your query to use a full text search. The first step is to convert the search string into a full text search string. I do this by splitting the search string into words (using the Split method) and then building a search string formatted as:
"Word1*" AND "Word2*" AND "Word3*"
The double-quotes are critical; they tell the full text index where the words begin and end.
Next, to actually execute the full text search, use the ContainsTable command in your query:
SELECT *
from containstable(Bugs, *, '"Word1*" AND "Word2*" AND "Word3*"')
This will return two columns:
Key - The column identified as the primary key of the full text search
Rank - A relative rank of the match (1 - 1000 with a higher ranking meaning a better match).
I've used approaches similar to this many times and I've had good luck with it.

If you want a truly plug-in solution then you should just go with Google itself. It sounds like your doing some kind of e-commerce or commercial site (given the use of the term 'SKU'), So you probably have a catalog of some kind with product pages. If you have consistent markup then you can configure a google appliance or service to do exactly what you want. It will send a bot in to index your pages and find your fields. No SQl, little coding, it will not be dependent on your database, or even coldfusion. It will also be quite fast and familiar to customers.
I was able to do this with a coldfusion site in about 6 hours, done! The only thing to watch out for is that google's index is limited to what the bot can see, so if you have a situation where you want to limit access based on a users role or permissions or group, then it may not be the solution for you (although you can configure a permission service for Google to check with)

Because SQL Server is where your data is that is where your search performance is going to be a possible issue. Make sure you have indexes on the columns you are searching on and if using a like you can't use and index if you do this SELECT * FROM TABLEX WHERE last_name LIKE '%FR%'
But it can use an index if you do it like this SELECT * FROM TABLEX WHERE last_name LIKE 'FR%'. The key here is to allow as many of the first characters to not be wild cards.
Here is a link to a site with some general tips. https://web.archive.org/web/1/http://blogs.techrepublic%2ecom%2ecom/datacenter/?p=173

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight