Is there a Google Apps Script for Fuzzy Lookups?

Is there a Google Apps Script for Fuzzy Lookups? - arrays

I have a list of companies on a spreadsheet that is rarely updated. I'll call it List A.
I also have a constantly updating weekly list of companies (List B) that should have entries that match some entries on List A.
The reality is that the data extracted from List B's company names are often inconsistent due to various business abbreviations (e.g. The Company, Company Ltd., Company Accountants Limited). Sometimes, these companies are under different trading names or have various mispellings.
My initial very not intelligent reaction was to construct a table of employer alias names, with the first column being the true employer name and the following columns holding alises, something like this: [https://i.stack.imgur.com/2cmYv.png]
On the left is a sample table, and the far right is a column where I am using the following array formula template:
=ArrayFormula(INDEX(A30:A33,MATCH(1,MMULT(--(B30:E33=H30),TRANSPOSE(COLUMN(B30:E33)^0)),0)))
I realized soon after that I needed to create a new entry for every single exact match variation (Ltd., Ltd, and Limited), so I looked into fuzzy lookups. I was really impressed by Alan's Fuzzy Matching UDFs, but my needs heavily lean towards using Google Spreadsheets rather than VBA.
Sorry for the long post, but I would be grateful if anyone has any good suggestions for fuzzy lookups or can suggest an alternative solution.

The comments weren't exactly what I was looking for, but they did provide some inspiration for me to come up with a bandaid solution.
My original array formula needed exact matches, but the problem was that there were simply too many company suffixes and alternate names, so I looked into fuzzy lookups.
My current answer is to abandon the fuzzy lookup proposal and instead focus on editing the original data string (i.e. company names) into more simplified substring. Grabbing with a few codes floating around, I came up with a combined custom formula that implements two lines for GApps Script:
var companysuffixremoval = str.toString().replace(/limited|ltd|group|holdings|plc|llp|sons|the/ig, "");
var alphanumericalmin = str.replace(/[^A-Za-z0-9]/g,"")
The first line is simply my idea of removing popular company suffixes and the "the" from the string.
The second line is removing all non-alphanumerical characters, as well as any spaces and periods.
This ensure "The First Company Limited." and "First Company Ltd" become "FirstCompany", which should work with returning the same values from the original array formula in the OP. Of course, I also implemented a trimming and cleaning line for any trailing/leading/extra spaces for the initial string, but that's probably unncessary with the second line.
If someone can come up with a better idea, please do tell. As it stands, I'm just tinkering with a script with minimal experience.

Related

Tricky duplicate control: meeting criterias in Excel array formulas

A bit of a tricky question - I might just have to do it through VBA with a proper script, however if someone actually has a complicated answer, (let's be honest I don't think there's a super simple formula for this) I'm taker. I'd rather do as much as I can through formulas. I've attached a sample.
The data: I have data that relates to countries. In each country, you can have multiple sites. For each site, you may or not have different distributions. When those distributions meet a given criteria, I want to tally up that as a "break" & count how many by countries, sites, etc.
How it works: I'm using array formulas with sumproduct() for this. The nice thing is that you can easily add criteria, each criteria returns your 0/1 so when you multiply them it gives you the array you need to sum up to see how many breaks you have.
The problem: I am unable to format the formula so that I can account for each site being counted only once in the case where the same site has 2 different distribution types and both meet the break criteria. If both distributions meet the break criteria, I don't want to record that as 2 breaks, otherwise I may end up with more sites with breaks recorded than the number of sites. Part of the problem is how I account for the unicness of sites:
(tdata[siteid]>"")/COUNTIF(tdata[siteid],tdata[siteid] &"")
This is actually a bit of a hack, in the sense that as opposed to other formulas it doesn't return 0/1 but possibly fractions. They do add up correctly and do allow me to, say, count the number of sites correctly, but the array isn't formated as 0/1 therefore when multiplied with other 0/1 arrays it messes up the results....
I control the data, so I have some leeway. I work with tables (as can be seen) and VBA is already used. I could sort the source tables if that helps. Source data:
1 row = 1 distribution for 1 site on 1 month
The summary table per country I linked is based on those source data.
Any idea?
EDIT - Filtering for distribution is not really an option. I do already have an event-based filters for the source data, and I can already calculate rightly the indicator for filtered data by distributions. But I also need to display global data (which is currently not working). Also there are other indicators that need to be calculate which won't work if I filter the data (it's big dashboard).
EDIT2: In other words, I need to find some way to account for the fact that if the same criteria (break or not) is found in 2 sites with the same siteid but 2 different distributions, I want to count that as 1 break only. While keeping in mind that if one distribution has a break (and the other not), I still want to record it as 1 site with break in that country.
EDIT3: I've decided to make a new table, that summarizes the data for each site individually (each of which may have more than once distribution). Then I can calculate global stuff from that.
My take home message from this: I think that when you have many level of data (e.g. countries, sites, with some kind of a sub-level with distributions) in Excel formulas, it's difficult NOT to summarize the data in intermediate tables for the level of analysis at which you want to focus. E.g. in my case, I am interested in country-level analysis, which is 2 "levels" above the distribution level. This means that there will be "duplication" of data from a site-level perspective. You may be able to navigate around this, but I think by far the simpler solution is to suck it up and make an intermediate table. I does shorten significantly your formulas as well.
I don't mark this as a solution because it's not what I was looking for. Still open to better suggestions allowing to work only with formulas....
File: https://www.dropbox.com/sh/4ofctha6qhfgtqw/AAD0aPJXr__tononRTpKc1oka?dl=0

Maybe the following can help.
First, you filter the entries which don't meet the criteria regarding the distribution.
In a second step, you sort the table from A to Z based on the column siteid.
Then you add an extra column after the last on with the formula =C3<>C4, where column C contains the siteid entries. In that way all duplicates are denoted by a FALSE value in the helper column.
After that you filter the FALSE values in this column.
You then get unique site ids.
In case I got your question wrong, I would be glad about an update in order to try to help you.

Strategies for UK Postal Address Matching

I have 2 tables of UK postal addresses (around 300000 rows each) and need to match one set to another in order to return a unique ID contained in first set for each address.
The problem is there's a lot of variation in the formats of the addresses and in the spellings.
I've written a lot of t-sql scripts to pick off the east matches (exact postcode + house number + street name, etc) but there are many unmatched records left that are proving difficult to handle. I might end up having as many sql scripts as there are exceptions!
I've look at Levenstein function and ranking word for word but these methods are unreliable and problematic too.
Does anyone have any experience of doing similar work and what was your approach & success rate?
Thank you!

I agree with the commenters that this is largely a business rule thing rather than a programming question, but for what it's worth...
I had a somewhat similar problem with a catalogue many years ago. Entries weren't always consistent in the way we'd hoped, different editions came up weirdly and with a wide variety of variations. All had to be linked.
What I did in the end was a fuzzy matcher. Broke the item down into components. Normalised the data where I could - removing spaces from fields that didn't always have them and could live without them for example. Worked out the distance between near misses - bar and car being 1 apart, for example. I stemmed words - see http://snowball.tartarus.org/algorithms/english/stemmer.html for more info. Think I even played with SQL Server's SOUNDEX matching.
I then went through and scripted the job to produce a list of candidate matches. Anything above a certain level got presented to an administrator, who was shown what the program thought was the best match along with other likely matches. They picked the one that looked best, ticked it and went on to the next one.
At the start of the list everyone thought the job was far too huge to be manageable. They then started going through it, and found it was much faster than they thought and much easier than they'd feared to stay on top of the new data as it came in.
The script to do it all programmatically will never be perfect, and will end up being nearly as long as the source list with as many objections as it'll generate. Don't try to automate it perfectly; automate the easy stuff, put a human in the loop for the uncertain cases. Much easier and safer.

Sql Server Full Text: Human names which sound alike

I have a database with lots of customers in it. A user of the system wants to be able to look up a customer's account by name, amongst other things.
What I have done is create a new table called CustomerFullText, which just has a CustomerId and an nvarchar(max) field "CustomerFullText". In "CustomerFullText" I keep concatenated together all the text I have for the customer, e.g. First Name, Last Name, Address, etc, and I have a full-text index on that field, so that the user can just type into a single search box and gets matching results.
I found this gave better results that trying to search data stored in lots of different columns, although I suppose I'd be interested in hearing if this in itself is a terrible idea.
Many people have names which sound the same but which have different spellings: Katherine and Catherine and Catharine and perhaps someone who's record in the database is Katherine but who introduces themselves as Kate. Also, McDonald vs MacDonald, Liz vs Elisabeth, and so on.
Therefore, what I'm doing is, whilst storing the original name correctly, making a series of replacements before I build the full text. So ALL of Katherine and Catheine and so on are replaced with "KATE" in the full text field. I do the same transform on my search parameter before I query the database, so someone who types "Catherine" into the search box will actually run a query for "KATE" against the full text index in the database, which will match Catherine AND Katherine and so on.
My question is: does this duplicate any part of existing SQL Server Full Text functionality? I've had a look, but I don't think that this is the same as a custom stemmer or word breaker or similar.

Rather than trying to phonetically normalize your data yourself, I would use the Double Metaphone algorithm, essentially a much better implementation of the basic SOUNDEX idea.
You can find an example implementation here: http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=13574, and more are listed in the Wikipedia link above.
It will generate two normalized code versions of your word. You can then persist those in two additional columns and compare them against your search text, which you would convert to Double Metaphone on the fly.

Determining the Similarity Between Items in a Database

We have a database with hundreds of millions of records of log data. We're attempting to 'group' this log data as being likely to be of the same nature as other entries in the log database. For instance:
Record X may contain a log entry like:
Change Transaction ABC123 Assigned To Server US91
And Record Y may contain a log entry like:
Change Transaction XYZ789 Assigned To Server GB47
To us humans those two log entries are easily recognizable as being likely related in some way. Now, there may be 10 million rows between Record X and Record Y. And there may be thousands of other entries that are similar to X and Y, and some that are totally different but that have other records they are similar to.
What I'm trying to determine is the best way to group the similar items together and say that with XX% certainty Record X and Record Y are probably of the same nature. Or perhaps a better way of saying it would be that the system would look at Record Y and say based on your content you're most like Record X as apposed to all other records.
I've seen some mentions of Natural Language Processing and other ways to find similarity between strings (like just brute-forcing some Levenshtein calculations) - however for us we have these two additional challenges:
The content is machine generated - not human generated
As opposed to a search engine approach where we determine results for a given query - we're trying to classify a giant repository and group them by how alike they are to one another.
Thanks for your input!

Interesting problem. Obviously, there's a scale issue here because you don't really want to start comparing each record to every other record in the DB. I believe I'd look at growing a list of "known types" and scoring records against the types in that list to see if each record has a match in that list.
The "scoring" part will hopefully draw some good answers here -- your ability to score against known types is key to getting this to work well, and I have a feeling you're in a better position than we are to get that right. Some sort of soundex match, maybe? Or if you can figure out how to "discover" which parts of new records change, you could define your known types as regex expressions.
At that point, for each record, you can hopefully determine that you've got a match (with high confidence) or a match (with lower confidence) or very likely no match at all. In this last case, it's likely that you've found a new "type" that should be added to your "known types" list. If you keep track of the score for each record you matched, you could also go back for low-scoring matches and see if a better match showed up later in your processing.

I would suggest indexing your data using a text search engine like Lucene to split your log entries into terms. As your data is machine generated use also word bigrams and tigrams, even higher order n-grams. A bigram is just a sequence of consecutive words, in your example you would have the following bigrams:
Change_Transaction, Transaction_XYZ789, XYZ789_Assigned, Assigned_To, To_Server, Server_GB47
For each log prepare queries in a similar way, the search engine may give you the most similar results. You may need to tweek the similarity function a bit to obtain best results but I believe this is a good start.

Two main strategies come to my mind here:
the ad-hoc one. Use an information retrieval approach. Build an index for the log entries, eventually using a specialized tokenizer/parser, by feeding them into a regular text search engine. I've heard people do this with Xapian and Lucene. Then you can "search" for a new log record and the text search engine will (hopefully) return some related log entries to compare it with. Usually the "information retrieval" approach is however only interested in finding the 10 most similar results.
the clustering approach. You will usually need to turn the data into numerical vectors (that may however be sparse) e.g. as TF-IDF. Then you can apply a clustering algorithm to find groups of closely related lines (such as the example you gave above), and investigate their nature. You might need to tweak this a little, so it doesn't e.g. cluster on the server ID.
Both strategies have their ups and downs. The first one is quite fast, however it will always just return you some similar existing log lines, without much quantities on how common this line is. It's mostly useful for human inspection.
The second strategy is more computationally intensive, and depending on your parameters could fail completely (so maybe test it on a subset first), but could also give more useful results by actually building large groups of log entries that are very closely related.

It sounds like you could take the lucene approach mentioned above, then use that as a source for input vectors into the machine learning library Mahout (http://mahout.apache.org/). Once there you can train a classifier, or just use one of their clustering algorithms.

If your DBMS has it, take a look at SOUNDEX().

Keyword to SQL search

Use Case
When a user goes to my website, they will be confronted with a search box much like SO. They can search for results using plan text. ".net questions", "closed questions", ".net and java", etc.. The search will function a bit different that SO, in that it will try to as much as possible of the schema of the database rather than a straight fulltext search. So ".net questions" will only search for .net questions as opposed to .net answers (probably not applicable to SO case, just an example here), "closed questions" will return questions that are closed, ".net and java" questions will return questions that relate to .net and java and nothing else.
Problem
I'm not too familiar with the words but I basically want to do a keyword to SQL driven search. I know the schema of the database and I also can datamine the database. I want to know any current approaches there that existing out already before I try to implement this. I guess this question is for what is a good design for the stated problem.
Proposed
My proposed solution so far looks something like this
Clean the input. Just remove any special characters
Parse the input into chunks of data. Break an input of "c# java" into c# and java Also handle the special cases like "'c# java' questions" into 'c# java' and "questions".
Build a tree out of the input
Bind the data into metadata. So convert stuff like closed questions and relate it to the isclosed column of a table.
Convert the tree into a sql query.
Thoughts/suggestions/links?

I run a digital music store with a "single search" that weights keywords based on their occurrences and the schema in which Products appear, eg. with different columns like "Artist", "Title" or "Publisher".
Products are also related to albums and playlists, but for simpler explanation, I will only elaborate on the indexing and querying of Products' Keywords.
Database Schema
Keywords table - a weighted table for every word that could possibly be searched for (hence, it is referenced somewhere) with the following data for each record:
Keyword ID (not the word),
The Word itself,
A Soundex Alpha value for the Word
Weight
ProductKeywords table - a weighted table for every keyword referenced by any of a product's fields (or columns) with the following data for each record:
Product ID,
Keyword ID,
Weight
Keyword Weighting
The weighting value is an indication of how often the words occurs. Matching keywords with a lower weight are "more unique" and are more likely to be what is being searched for. In this way, words occurring often are automatically "down-weighted", eg. "the", "a" or "I". However, it is best to strip out atomic occurrences of those common words before indexing.
I used integers for weighting, but using a decimal value will offer more versatility, possibly with slightly slower sorting.
Indexing
Whenever any product field is updated, eg. Artist or Title (which does not happen that often), a database trigger re-indexes the product's keywords like so inside a transaction:
All product keywords are disassociated and deleted if no longer referenced.
Each indexed field (eg. Artist) value is stored/retrieved as a keyword in its entirety and related to the product in the ProductKeywords table for a direct match.
The keyword weight is then incremented by a value that depends on the importance of the field. You can add, subtract weight based on the importance of the field. If Artist is more important than Title, Subtract 1 or 2 from its ProductKeyword weight adjustment.
Each indexed field value is stripped of any non-alphanumeric characters and split into separate word groups, eg. "Billy Joel" becomes "Billy" and "Joel".
Each separate word group for each field value is soundexed and stored/retrieved as a keyword and associated with the product in the same way as in step 2. If a keyword has already been associated with a product, its weight is simply adjusted.
Querying
Take the input query search string in its entirety and look for a direct matching keyword. Retrieve all ProductKeywords for the keyword in an in-memory table along with Keyword weight (different from ProductKeyword weight).
Strip out all non-alphanumeric characters and split query into keywords. Retrieve all existing keywords (only a few will match). Join ProductKeywords to matching keywords to in-memory table along with Keyword weight, which is different from the ProductKeyword weight.
Repeat Step 2 but use soundex values instead, adjusting weights to be less relevant.
Join retrieved ProductKeywords to their related Products and retrieve each product's sales, which is a measure of popularity.
Sort results by Keyword weight, ProductKeyword weight and Sales. The final summing/sorting and/or weighting depends on your implementation.
Limit results and return product search results to client.

What you are looking for is Natural Language Processing. Strangely enough this used to be included free as English Query in SQL Server 2000 and prior. But it's gone now
Some other sources are :
http://devtools.korzh.com/eq/dotnet/
http://www.easyask.com/products/business-intelligence/index.htm
The concept is a meta data dictionary mapping words to table, columns, relationships etc and an English sentence parser combined together to convert a English sentence ( or just some keywords) into a real query
Some people even user English Query with speech recognition for some really cool demos, never saw it used in anger though!

If you're using SQL Server, you can simply use its Full-Text Search feature, which is specifically designed to solve your problem.

You could use a hybrid approach, take the full text search results and further filter them based on the meta data from your #4. For something more intelligent you could create a simple supervised learning solution by tracking what links the user clicks on after the search and store that choice with the key search words in a decision tree. Searches would then be mined from this decision tree