I have 2 tables of UK postal addresses (around 300000 rows each) and need to match one set to another in order to return a unique ID contained in first set for each address.
The problem is there's a lot of variation in the formats of the addresses and in the spellings.
I've written a lot of t-sql scripts to pick off the east matches (exact postcode + house number + street name, etc) but there are many unmatched records left that are proving difficult to handle. I might end up having as many sql scripts as there are exceptions!
I've look at Levenstein function and ranking word for word but these methods are unreliable and problematic too.
Does anyone have any experience of doing similar work and what was your approach & success rate?
Thank you!
I agree with the commenters that this is largely a business rule thing rather than a programming question, but for what it's worth...
I had a somewhat similar problem with a catalogue many years ago. Entries weren't always consistent in the way we'd hoped, different editions came up weirdly and with a wide variety of variations. All had to be linked.
What I did in the end was a fuzzy matcher. Broke the item down into components. Normalised the data where I could - removing spaces from fields that didn't always have them and could live without them for example. Worked out the distance between near misses - bar and car being 1 apart, for example. I stemmed words - see http://snowball.tartarus.org/algorithms/english/stemmer.html for more info. Think I even played with SQL Server's SOUNDEX matching.
I then went through and scripted the job to produce a list of candidate matches. Anything above a certain level got presented to an administrator, who was shown what the program thought was the best match along with other likely matches. They picked the one that looked best, ticked it and went on to the next one.
At the start of the list everyone thought the job was far too huge to be manageable. They then started going through it, and found it was much faster than they thought and much easier than they'd feared to stay on top of the new data as it came in.
The script to do it all programmatically will never be perfect, and will end up being nearly as long as the source list with as many objections as it'll generate. Don't try to automate it perfectly; automate the easy stuff, put a human in the loop for the uncertain cases. Much easier and safer.
Related
I have a list of companies on a spreadsheet that is rarely updated. I'll call it List A.
I also have a constantly updating weekly list of companies (List B) that should have entries that match some entries on List A.
The reality is that the data extracted from List B's company names are often inconsistent due to various business abbreviations (e.g. The Company, Company Ltd., Company Accountants Limited). Sometimes, these companies are under different trading names or have various mispellings.
My initial very not intelligent reaction was to construct a table of employer alias names, with the first column being the true employer name and the following columns holding alises, something like this: [https://i.stack.imgur.com/2cmYv.png]
On the left is a sample table, and the far right is a column where I am using the following array formula template:
=ArrayFormula(INDEX(A30:A33,MATCH(1,MMULT(--(B30:E33=H30),TRANSPOSE(COLUMN(B30:E33)^0)),0)))
I realized soon after that I needed to create a new entry for every single exact match variation (Ltd., Ltd, and Limited), so I looked into fuzzy lookups. I was really impressed by Alan's Fuzzy Matching UDFs, but my needs heavily lean towards using Google Spreadsheets rather than VBA.
Sorry for the long post, but I would be grateful if anyone has any good suggestions for fuzzy lookups or can suggest an alternative solution.
The comments weren't exactly what I was looking for, but they did provide some inspiration for me to come up with a bandaid solution.
My original array formula needed exact matches, but the problem was that there were simply too many company suffixes and alternate names, so I looked into fuzzy lookups.
My current answer is to abandon the fuzzy lookup proposal and instead focus on editing the original data string (i.e. company names) into more simplified substring. Grabbing with a few codes floating around, I came up with a combined custom formula that implements two lines for GApps Script:
var companysuffixremoval = str.toString().replace(/limited|ltd|group|holdings|plc|llp|sons|the/ig, "");
var alphanumericalmin = str.replace(/[^A-Za-z0-9]/g,"")
The first line is simply my idea of removing popular company suffixes and the "the" from the string.
The second line is removing all non-alphanumerical characters, as well as any spaces and periods.
This ensure "The First Company Limited." and "First Company Ltd" become "FirstCompany", which should work with returning the same values from the original array formula in the OP. Of course, I also implemented a trimming and cleaning line for any trailing/leading/extra spaces for the initial string, but that's probably unncessary with the second line.
If someone can come up with a better idea, please do tell. As it stands, I'm just tinkering with a script with minimal experience.
My name is Frederic and I am no professional developper nor native english speaker so please forgive me for that.
I am actually a medical physician working as an in-hospital clinician and i am passionate in NLP for quite a time now. I'm currently writing a thesis for a MDPHD in medical informatics and my subject is about information retrieval of patient documentation for a better clinical workflow.
I have a quite well defined strategy for text pre-processing and later indexing through Solr. I have been able to implement a max-ent classifier that works great! It works so well actually that is puts into questions other step of text processing (I now kindof want to use it everywhere, which feels totally wrong). And I need your insight for making my mind about that point.
Medical texts are of very different types (specialized consultations, operative documents, prescriptions, notes etc). However, they often follow text structuration rules that are quite general. For example, admissions form always display info about physical examination in the form of a list, and a few paragraphs of narrative text.
I first wanted to use a maxent classifier to classify chunks of text into main headings (Patient info, discipline, physical, discharge summary, discharge prescriptions and so on). That seems to work great! But after a while, I realized that most errors in classification came from the fact that the text was not segmented properly in the first place and thus the maxent could not do its job correctly. I do paragraph segmenting on the basis of a manual decision-tree, taking into account new lines and spacing, and some characteristics of the text preceding or following separation markors (ex: titles are often all-caps, you can differentiate "real new lines" from "decorative new lines" by the presence of a period at the end of preceding paragraph and an upper cased first letter in the one following and so on...).
Well... maxent for this task works well as well. But now I find myself training a third maxent because it works so well and I now want to differentiate real periods from other periods (like in numbers or abbreviations)... and there also is dashes classification.
If I listened to myself (and hopefully i dont), i would use maxent for pretty much every text processing task that requires a bit of classification. But it seems very wrong for many reasons: time for training, memory usage etc...
So could you please give me your advices: For a series of tasks, which would make the most sense to use maxent, and are there alternative that I did not took into account?
Main tasks that are in question now are:
paragraph segmentation: using maxent to determine if a line-break is
a real one or should be omitted
paragraph classification: once
segmented, give subheading for each paragraphs
words (or tokens) normalisation:
real periods VS in-word periods (end of sentence VS abbreviation for
example), dashes, apostrophes (I work in the french language so these
characters are quite problematic, like in the words such as "va-t'en"
or "jusqu'à".
Sorry for the long text. I figured no code was really useful for this question, but i can put some if it can help you help me ;)
Thanx in advance, and cheers,
Our application connects to a SQL Server database. There is a column that is nvarchar(max) that has been added an must be included in the search. The number of records in the this DB is only in the 10s of thousands and there are only a few hundred people using the application. I'm told to explore Full Text Search, is this necessary?
This is like asking, I work 5 miles away, and I was told to consider buying a car. Is this necessary? Too many variables to give you a simple and correct answer to your question. For example, is it a nice walk? Is there public transit available? Is your schedule flexible? Do you have to go far for lunch or run errands after work?
Full-Text Search can help if your typical searches are going to be WHERE col LIKE '%foo%' - but whether it is necessary depends on how large this column will get, whether your searches are true wildcard searches, your tolerance for millisecond vs. nanosecond queries, the amount of concurrency, even seemingly extraneous stuff like whether the data is always in memory and can be searched more efficiently.
The better answer is that you should try it. Populate a table with a copy of your data, add a full-text index, and see if your typical workload improves by using full-text queries instead of LIKE. It probably will, but there's no way for us to know for sure even if you add more specifics than ballpark row counts.
In a similar situation I ended up making a table structure that was more search friendly and indexable, then setting up a batch job to copy records from the live database to the reporting one.
In my case the original data didn't come close to needing an nvarchar(max) column so I could get away with that. Your mileage may vary. In any case, the answer is "try a few things and see what works for you".
We have a database with hundreds of millions of records of log data. We're attempting to 'group' this log data as being likely to be of the same nature as other entries in the log database. For instance:
Record X may contain a log entry like:
Change Transaction ABC123 Assigned To Server US91
And Record Y may contain a log entry like:
Change Transaction XYZ789 Assigned To Server GB47
To us humans those two log entries are easily recognizable as being likely related in some way. Now, there may be 10 million rows between Record X and Record Y. And there may be thousands of other entries that are similar to X and Y, and some that are totally different but that have other records they are similar to.
What I'm trying to determine is the best way to group the similar items together and say that with XX% certainty Record X and Record Y are probably of the same nature. Or perhaps a better way of saying it would be that the system would look at Record Y and say based on your content you're most like Record X as apposed to all other records.
I've seen some mentions of Natural Language Processing and other ways to find similarity between strings (like just brute-forcing some Levenshtein calculations) - however for us we have these two additional challenges:
The content is machine generated - not human generated
As opposed to a search engine approach where we determine results for a given query - we're trying to classify a giant repository and group them by how alike they are to one another.
Thanks for your input!
Interesting problem. Obviously, there's a scale issue here because you don't really want to start comparing each record to every other record in the DB. I believe I'd look at growing a list of "known types" and scoring records against the types in that list to see if each record has a match in that list.
The "scoring" part will hopefully draw some good answers here -- your ability to score against known types is key to getting this to work well, and I have a feeling you're in a better position than we are to get that right. Some sort of soundex match, maybe? Or if you can figure out how to "discover" which parts of new records change, you could define your known types as regex expressions.
At that point, for each record, you can hopefully determine that you've got a match (with high confidence) or a match (with lower confidence) or very likely no match at all. In this last case, it's likely that you've found a new "type" that should be added to your "known types" list. If you keep track of the score for each record you matched, you could also go back for low-scoring matches and see if a better match showed up later in your processing.
I would suggest indexing your data using a text search engine like Lucene to split your log entries into terms. As your data is machine generated use also word bigrams and tigrams, even higher order n-grams. A bigram is just a sequence of consecutive words, in your example you would have the following bigrams:
Change_Transaction, Transaction_XYZ789, XYZ789_Assigned, Assigned_To, To_Server, Server_GB47
For each log prepare queries in a similar way, the search engine may give you the most similar results. You may need to tweek the similarity function a bit to obtain best results but I believe this is a good start.
Two main strategies come to my mind here:
the ad-hoc one. Use an information retrieval approach. Build an index for the log entries, eventually using a specialized tokenizer/parser, by feeding them into a regular text search engine. I've heard people do this with Xapian and Lucene. Then you can "search" for a new log record and the text search engine will (hopefully) return some related log entries to compare it with. Usually the "information retrieval" approach is however only interested in finding the 10 most similar results.
the clustering approach. You will usually need to turn the data into numerical vectors (that may however be sparse) e.g. as TF-IDF. Then you can apply a clustering algorithm to find groups of closely related lines (such as the example you gave above), and investigate their nature. You might need to tweak this a little, so it doesn't e.g. cluster on the server ID.
Both strategies have their ups and downs. The first one is quite fast, however it will always just return you some similar existing log lines, without much quantities on how common this line is. It's mostly useful for human inspection.
The second strategy is more computationally intensive, and depending on your parameters could fail completely (so maybe test it on a subset first), but could also give more useful results by actually building large groups of log entries that are very closely related.
It sounds like you could take the lucene approach mentioned above, then use that as a source for input vectors into the machine learning library Mahout (http://mahout.apache.org/). Once there you can train a classifier, or just use one of their clustering algorithms.
If your DBMS has it, take a look at SOUNDEX().
I understand an intermediate class is often introduced to capture information in a situation where for example, a team has many players, and a player plays for many teams over the years. The intermediate class introduced is contract with cardinality as shown:
Team -1----N- Contract -N----1- Player
Let's say however that 98% of all queries only want current information and don't care about historical information. Given the name of a player, they want to know information about his current team, and perhaps current contract.
Given the above relationship, should all the contracts always be looked through to find the current one first, and then from there access information about the team? Or should an optimization be made with direct linkage between the player and his current team?
Thanks
If it is assured that there is only one team for each player at given time, you just add
currentTeam column to the Player table and that's it. But remember you must update it every time you update the Contracts table! And it must be done within the transaction, so that the database is kept consistent at any time.
You violate some normal form this way, but you know what and why you are doing that - for efficiency and optimization. I do this trick many times.
This seems to be under the context of some kind of ORM, so I'll run with that. (Even if it isn't, keep reading.)
Objects are useful for modeling complex operations. For example, adding a new Contract causes all sorts of crazy things to happen to both the Team, the Players, and various PayChecks (I made the last one up, but you get the point). This is the perfect kind of thing to be handled in code than in, say, a hideously complex T-SQL stored procedure.
But when it comes to querying, I find that it often makes sense to write a view/SQL statement/projection that is shamelessly tailored to the set of information that you need to perform a function. As long as you do this for reading data, and not for writing it, then you're not really subverting your object model; you are just looking at it a different way, and you're just making a pragmatic observation that most of the time, you only need the information from a IPlayerCurrentContractQuery and not the whole list of Contracts within the Player. Since it is a method that is called a bajillion times, you've written an integration test to make sure that the SQL produces correct results, and you've looked closely at its query plan to make sure that it's not doing awful things like table scans to the database. This commonly-used screen in your app is fast and everyone is happy.
One could make the case that creating such a separate query is a premature optimization, but it probably isn't. I mean, if a player usually only has a few Contracts, then it might not be worth separating out the query and interface. Sucking down all of the Contracts from the database to loop through them and pluck out the current one is going to perform worse than selecting the right one from the database first, but if it's just a handful of Contracts, then a "yeah I'm fully aware it's kinda dumb but it's fast enough" approach is probably good enough, just move on. But if these Contracts stretch back years or are large objects, then separating out the query becomes a no-brainer.
If that starts performing badly because of the joins (which is unlikely unless you start seeing significant traffic), then you add a cache. And if that doesn't work due to lots of writes, then you can start denormalizing your database by adding a direct reference. But unless you are writing the next Facebook of baseball then YAGNI, and at that point you're sharding across servers and throwing away most of the benefits of the relational model anyway so who cares.
A similar situation is posed in my answer to this question.
(If this question isn't about ORM, and really is just about modeling how the tables are designed, then you make sure that you have an index that covers the query that selects the current contract--such as start and stop dates--and you are pretty much done unless you have really exceptional scaling requirements as mentioned above. If you're writing a particular set of joins very often, then you might write a function or stored procedure to remove the boilerplate.)
That's my brain dump. Hope this helps!
Given the above relationship, should all the contracts always be
looked through to find the current one first, and then from there
access information about the team?
A modern query optimizer will use the most selective index first. Assuming that player_id is in that index in a usable position, the optimizer will probably find all the rows for that player first--and there won't be many, right?--then do another index scan on the contract dates to find the current contract.
If I were you, I'd create a view that returns only the "current" rows. Let application code run against that view.