I have 2 tables of UK postal addresses (around 300000 rows each) and need to match one set to another in order to return a unique ID contained in first set for each address.
The problem is there's a lot of variation in the formats of the addresses and in the spellings.
I've written a lot of t-sql scripts to pick off the east matches (exact postcode + house number + street name, etc) but there are many unmatched records left that are proving difficult to handle. I might end up having as many sql scripts as there are exceptions!
I've look at Levenstein function and ranking word for word but these methods are unreliable and problematic too.
Does anyone have any experience of doing similar work and what was your approach & success rate?
Thank you!
I agree with the commenters that this is largely a business rule thing rather than a programming question, but for what it's worth...
I had a somewhat similar problem with a catalogue many years ago. Entries weren't always consistent in the way we'd hoped, different editions came up weirdly and with a wide variety of variations. All had to be linked.
What I did in the end was a fuzzy matcher. Broke the item down into components. Normalised the data where I could - removing spaces from fields that didn't always have them and could live without them for example. Worked out the distance between near misses - bar and car being 1 apart, for example. I stemmed words - see http://snowball.tartarus.org/algorithms/english/stemmer.html for more info. Think I even played with SQL Server's SOUNDEX matching.
I then went through and scripted the job to produce a list of candidate matches. Anything above a certain level got presented to an administrator, who was shown what the program thought was the best match along with other likely matches. They picked the one that looked best, ticked it and went on to the next one.
At the start of the list everyone thought the job was far too huge to be manageable. They then started going through it, and found it was much faster than they thought and much easier than they'd feared to stay on top of the new data as it came in.
The script to do it all programmatically will never be perfect, and will end up being nearly as long as the source list with as many objections as it'll generate. Don't try to automate it perfectly; automate the easy stuff, put a human in the loop for the uncertain cases. Much easier and safer.
We have a customer that insists on putting contact details, at this time first and last names, into a single field. Take, for example, Mr. Bob Smith and Mrs. Jane Smith. Mr. Bob and Mrs. Jane would be entered into the first name field and Smith would be entered into the last name. It gets messier if the contacts have different last names or if there is a hyphenated name. The customer only wants one contact record so they came up with this system and implemented it on their own.
Our system is designed around contacts and each individual person is intended to be an individual contact, even married. Due to some of the attributes we must assign to people and notes we need to keep, a contact-centric approach is best. The above issue occurs in about 1/3 of the cases we handle.
Internally, my team has discussed how to sell the customer on using the database the way it was designed. We listed form letters and contact lists as being the main reasons for keeping the data clean and in the fields we designed. For example, using our recommendation, the customer will have much more granular control over form letter creation and sorting of data.
Any suggestions for how we sell this to the customer?
Tell them what they can get out of your system is only as good as what gets put in. If they want to enter inconsistent data, the cost they'll pay down the line is the inability to generate letters or mailing lists in the future.
They may need to learn this lesson the hard way for themselves. I see more problems with switching the names, for example, entering Smith as the first name and Bob as the last.
Also, can you make both fields required?
It sounds like what they want to enter is similar to AddressLine1, AddressLine2. It's just a poor design, I thought you had 2 name fields but they would only enter data in one of them (the first name).
All you can do it try to help them when they ask for it. They'll get the system they deserve.
Just show your customer the normal forms for database design:
> http://www.phlonx.com/resources/nf3/
Tell him that these normal forms are designed to make the database more manageable over time and make it more flexible.
Can't you just create a view that holds First and Last name together? For some servers you can also create editable views... So your customer will be happy and data will be stored normalized.
I'd try to put it in terms of money and time. You're going to spend more time trying to keep duplicates out of a db with their design, more time building relevant reports or queries (constantly having to parse a block name field... do they want address all in one too?!?), more money to scrub the data (either themselves or someone else) if they ever want to send the data to a third party for analysis and metrics.
It sounds like they don't want to let go of their design, maybe partly because they understand it. You may want to try and meet them halfway somehow at first, and involve them in the process of making incremental improvements to the design. That way they can see and understand the benefits that right now may just be over their head, pushing them out of their comfort zone. They have to trust you with their baby :)
The best argument is that you won't be responsible for the behavior of the database unless they put things where they belong.
If they want to make a single mailing to each "household", then I'm sure your app can do that. (Probably already does.) Y'all just have to come to terms on what "household" means. Since there may be rented rooms or long-term guests, it doesn't always mean "only one mailing piece per address".
FWIW, I've been doing this stuff for decades, and I still find doctors and attorneys (and their staffs) the hardest people to deal with. One time, I walked out of a meeting (and, of course, lost the chance to bid on the contract) when a doctor's IT guy stood up, pounded his fist on the table, and screamed at me over and over, "Doctors are not people! Doctors are not people!".
Need an example of a database model to be attached to a database for data quality. Best form of the answer would at the very least be DDL that's executable in MySQL; other RDMS DDL's are okay, I'll just post another question asking for a porting of the code.
A good explaintion would be a huge plus.
Questions, comments, feedback, etc. -- just comment, thanks!!
The biggest problem is identifying meaningful measures of quality. That's so highly application-dependent, I doubt that anybody will be able to help you very much. (At least not without a lot more information--perhaps more than you're allowed to give.)
But let's say your application records observations of birds by individuals. (I'm just throwing this together off the top of my head. Read it for the gist, and expect the details to crumble under scrutiny.) Under average field conditions,
some species are hard for even a beginner to get wrong
some species are hard for an expert to get right
a specific individual's ability varies irregularly over time (good days, bad days)
individuals usually become more skilled over time
you might be highly skilled at identifying hawks, and totally suck at identifying gulls
individuals are prone to suggestion (who they're with makes a difference in their reliability)
So, to take a shot at assessing the quality of an identification, you might try to record a lot of information besides the observation "3 red-tailed hawks at Cape May on 05-Feb-2011 at 4:30 pm". You might try to record
weather
lighting
temperature (some birders suck in the cold)
hours afield (some birders suck after 3 hours, or after 20 cold minutes)
names of others present
average difficulty of correctly
identifying red-tailed hawks
probability that this individual
could correctly identify red-tails
under these field conditions
alcohol intake
Although this might be "meta" to field birders, to the database designer it's just data. And you'd design the tables just like you'd design them for any other application. (That's what I did, anyway.)
As part of a contact management system I have a large database of names. People frequently edit this and as a result we run into issues of the same person existing in different forms (John Smith and Jonathan Smith). I looked into word similarity but it's easy to think of name variations which are not similar at all (Richard vs Dick). I was wondering if there was a list of common English first name variations that I could use to detect and correct such errors.
I would crawl all wikipedia pages (there is an available dump of wikipedia data) on people names, e.g., http://en.wikipedia.org/wiki/Teresa (from http://en.wikipedia.org/wiki/Category:English_given_names), and create an index that you can use to suggest people correct forms (you will rank them by the number of first name variants in your database). Unfortunately I do not know. such a database.
This thread points to a list of nickname/first name maps from the census:
http://deron.meranda.us/data/nicknames.txt
I am trying to find a better approach for storing people's name in the table.
What is the benefits of 3 field over 1 field for storing persons name?
UPDATE
Here is an interesting discussion and resources about storing names and user experience
Merging firstname/last name into one field
You can always construct a full name from its components, but you can't always deconstruct a full name into its components.
Say you want to write an email starting with "Dear Richie" - you can do that trivially if you have a given_name field, but figuring out what someone's given name is from their full name isn't trivial.
You can also trivially search or sort by given_name, or family_name, or whatever.
(Note I'm using given_name, family_name, etc. rather than first_name, last_name, because different cultures put their names in different orders.)
Solving this problem in the general case is hard - here's an article that gives a flavour of how hard it is: Representing People's Names in Dublin Core.
Keep your data as clean as you can!
How?
Ask your user only as few things as you absolutely need at the time you ask.
How you store the name does not matter. What does matter is that
the user experience is as good as can be
you don't have false data in your system
If you annoy the users with mandatory fields to fill in and re-question them several times, they can get upset and not buy into your application right there and then. You want to avoid bad user experiences at all times.
No user cares how easy it is for you to search your database for his middle name. He wants to have a easy, feel good experience, that's it.
What do users do if they are forced to input data like their postal address, or even email address when they only want a "read-only" account with no notifications needed? They put garbage data into your system. This will render your super search and sort algorithms useless anyway.
Thus, my advice would be in any app to gather just as little information from your user as you really need in order to serve them, no more.
If for example you run a online shop for pet food, don't ask your users at sign-up what kind of pets they own. Make it an option for them to fill in once they are logged in and all happy (new customers). Don't ask them their postal address until they order stuff that is actually carried to their house, stuff they pay for and thus care that YOU have their exact coordinates.
This will lead to a lot better data quality and this is what you should care about, not technical details the user has no benefit from....
In your example I would just ask for the full name (not sure though) and once the user willingly subscribes to your newsletter, let the user decide how he/she wants to be addressed...
As others have said, how do you decompose a full name in to its component parts.
Colin Angus Mackay
Jean Michel Jarre
Vincent van Gogh
Pablo Diego José Francisco de Paula Juan Nepomuceno María de los Remedios Cipriano de la Santísima Trinidad Ruiz y Picasso
How do you reliably decompose that lot?
To learn more, see falsehoods programmers believe about names.
I was looking up the Spanish Civil War the other day, and found this exception to most rules:
Francisco Paulino Hermenegildo Teódulo Franco y Bahamonde, Salgado y Pardo de Andrade
Father: Nicolás Franco y Salgado-Araújo
Mother: María del Pilar Bahamonde y Pardo de Andrade
Next time I'm working on a system that has to store names, I'm going to try something radical: designing from the requirements.
What are we going to use the names for?
Name on an address label for the postal service
Greeting on the website
Informal name
Based on what the names will be used for, we'd determine how much information to store. Maybe we allow the user to enter all three of those, including line breaks in the first case (Generalissimo Franco might want his full titles and appointments listed, if he weren't still dead). Maybe we provide First, Middle, Last, Generation as an option, and fill in the rest as defaults. Maybe we offer other common options like Surname, Given Name.
This is in contrast to the old-style First, Middle, Last we've used since before I started programming in COBOL back in 1975, and have "made fit" ever since.
Unfortunately this is kind of like asking what is the best way to store a number in the database. It depends on what you are going to do with it - sometimes you want an int,other times a byte, and sometimes a float. With names it depends on things like what cultures do you expect your users to come from, what you plan on doing with the names (will you be using these names to connect with another system that stores names as "last name, first name"?), and how much you can afford to annoy your users. If this is an internal HR application, you can probably afford to annoy the users a lot, and have a very structured, formal breakdown of name components (there are way more than 3 - don't forget mr/mrs, jr, III, multiple middle names, hyphenated last names, and who knows what else if you are trying to handle names from all cultures). If you have a webapp that users might or might not care about, you can't ask them to care too much.
You may want to search on the 3 separate fields for one and its inexpensive to concatenate for the fullname.
e.g. If you want to search for all the Mr. Nolans your query would be
SELECT Title+' '+FirstName+' '+Surname As FullName
from table where firstname = 'Mr' and surname ='Nolan'
to do this with just the fullnames would be a pain.
I'm English and only have one name. I normally put it in the 'surname' field for least aggravation. I am usually forced to put something in the 'first name' field too, which by definition is wrong.
Any attempt to impose anything more than 'Name' is doomed to be wrong at least some of the time, and sometimes be very frustrating to users. Single names are common in Southern India, Indonesia, and Pakistan (which is hundreds of millions of people) as well as the occaisional weirdo on the UK like me.
The 'first, middle, last' thing is very U.S.-centric. Few other countries think of names that way. Please stop doing it.
Keeping the fields separate allows you to support different output formats and cultures where the family name is written first
Things like ORDER BY firstname or ORDER BY lastname are possible when you break the name up into multiple fields.
Not as easy to do when you mash all names into one field.
About the only thing I can think of is for searching purposes. It's a bit better to search a field using [=] rather than say [like].
If you have no need to display the name as seperate words then go with a single field.
But if you need to do something like [Dear Mr. Achu] then perhaps a 3 field approach would be better.
Most of the time it's there to support writing form letters like, "Mr. so-and-so", or to search/sort by last name which is very common.
Given that first/middle/last may not apply to all cultures, there could be a better approach. It might be better expressed as "informal name" / "formal name" / "legal name" or something like that.
Still, at this point first/middle/last is very common, and from a data entry standpoint it is what everyone expect.
Here's the thing, not even humans can get this right all the time, there's just too much data, and too many special cases. I could change my name right now to be 20 parts, with the middle 13 as my "first" name. Parts of names can contain any number of words, and there can be any number of parts of names. Some people only have 1 name (no surname). Some people have lots of middle names. Some people have first or surnames composed of several words. Some people list their surname first. Some people go by their middle name. Some people go by nicknames that aren't obviously related to their given name.
If you try to guess these conventions in software YOU WILL FAIL. Period. Maybe you'll get it right some of the time, maybe even most of the time, but is even that worth it? In my opinion you should store names as one field and stop trying to be cute by using first names to refer to a person. If you need additional information about a name (e.g. a nickname), ask the user!
Each of the individual names is an atomic piece of data. When they are stored separately then it is easier to print them out in different formats such as Firstname Lastname and Lastname, Firstname.
There is no benefit if you never need to sort or search by first, middle, or last name.
Flexibility.
e.g.
If someone had a double barreled last name and no middle name.
I voted up some of these answers, but if you are looking to avoid repetitive or redundant or messy concatenation in your code, you can always use a computed column in the database or a method in a class which exposes the name consistently reconstructed. If these concatenations are expensive (because you are printing a million statements), you can use a persisted column.
Often you will allow users to specify names like nicknames or friendly names, so that you aren't referring to them by the name in their records or always as Mr. Smith.
It all depends on your requirements. There is no single good answer without the environment it is expected to satisfy.
Not sure how practical it would be, but maybe if cultural sensitivity is important in the context of the application being developed, perhaps a name should be a collection with each element of the collection carrying a value indicating if the name is the addressable "first name" or the addressable "surname" and so on for "title" or anything else that needs to be identified. A name ID could be used to identify the order of the elements for re-composing the full name.
Just have two fields, 'Full Name', and 'Preferred Name' - easy. Supports every name in existance (As long as the language has lexical symbols... So, yes, that excludes languages that do not have a written form).
Just make sure that they are handled in some unicode format, and that application code properly handles unicode conversion.
To me it is simply better to store 3 names so that explicit parsing is necessary later on if the individual components are needed..
You can't always separate surname from full name cleanly and reliably so there's good reason to separate that because you often need surname. After you do that, there are two common approaches:
first_name and middle_name; or
given_names.
(2) is arguably more preferable because people sometimes have more than tow given names and (1) is more inflexible in this regard.
Also, another common field is preferred_name (in addition to the above).
The i18n issue can be a bugger either way. certain cultures use the surname first and the given name last, that strikes the idea of first and last names so we move to fields for surnames and given names. Wait, some cultures don't have a surname or the surname is modified by the gender of the named.
We can get into tribal cultures where the person is renamed on adulthood. "Sitting Bull" childhood name was "Jumping Badger".
This is somewhat of a ramble but what I am showing is that the more fields you have the more accurate the design is. There should be at least a not null 'given name' field and a optional 'surname' field tied to a PK that is an integer. If the aforementioned requirements are observed, fields can be added without issues of breaking queries.
Some of the issues can be solved by also storing an additional column like PreferredName. We do that in our DB and also store prefix column and a suffix column.
e.g
'Prof Henry W Jones Jnr' with preferred name as 'Indiana Jones'.