Related
Am designing a database for a credit bureau and am seeking some guidance.
The data they receive from Banks, MFIs, Saccos, Utility companies etc comes with various types of IDs. E.g. It is perfectly legal to open a bank account with a National ID and also a Passport. Scenario One that has my head banging is that Customer1 will take a credit facility (call it loan for now) in bank1 with the passport and then go to bank2 and take another loan with their NationalID and Bank3 with their MilitaryID. Eventually when this data comes from the banks to the bureau, it would be seen as 3 different people while we know that its actually 1 person. At this point, there is nothing we can do as a bureau.
However, one way out (for now) is using the Govt registry which provides a repository which holds both passports and IDS. So once we query for this information and get a response, how do I show in the DB that Passport_X is related to NationalID_Y and MilitaryNumber_Z?
Again, a person's name could be captured in various orders states. Bank1 could do FName, LName, OName while Bank3 can do LName, FName only. How do I store this names?
Even against one ID type e.g. NationalID, you will often find misspellt names or missing names. So one NationalID in our database could end up with about 6 different names because the person's name was captured different by the various banks where he has transacted.
And that is just the tip of the iceberg. We have issues with addresses, telephone numbers, etc etc.
Could you have any insight as to how I'd structure my database to ensure we capture all data from all banks and provide the most accurate information possible regarding an individual? Better yet, do you have experience with this type of setup?
Thanks.
how do I show in the DB that Passport_X is related to NationalID_Y and MilitaryNumber_Z?
Trivial.
You ahve an identity table, that has an AlternateId field if the Identity is linked to another one. Use the first IDentity you created as master. Any alternative will have AlternateId pointing to it.
You need to separate the identity from the data in it, so you can have alterante versions of it, possibly with an origin and timestampt. You need oto likely fully support versioning and tying different identities to each other as alternative, including generating a "master identity" possibly by algorithm with the "official" version of your data (i.e. consolidated).
The details are complex - mostly you ahve to make a LOT of compromises without killing performance, so at the end HIRE A SPECIALIST. There is a reason there are people out as sensior database designers or architects that have 20+ years experience finding the optimal solution given the constrints you may not even be aware of (application wise).
Better yet, do you have experience with this type of setup?
Yes. Try financial information. Stock symbols / feeds / definitions are not necessariyl compatible and vary by whom you get it. Any non-trivial setup has different data feeds that may show the same item slightly different, sometimes in error. DIfferent name, sometimes different price (example: ES, CME group, is 50 USD per point, but on TT Fix it is 5 - to make up, the price is multiplied by 10, so instad of 1000.25 you get 10002.5). THis is the same line of consolidation, and it STINKS.
Tons of code, tons of proper database design, redoing it half a dozen time to get the proper performance. THis is tricky, sadly.
Say I had three tables: Accommodation, Train Stations and Airports. Would I have address columns in each table or an address table that is referenced by the other tables? Is there such a thing as over-normalization?
Database Normalization is all about constructing relations (tables) that maintain certain functional
dependencies among the facts (columns) within the relation (table) and among the various relations (tables)
making up the schema (database). Bit of a mouth-full, but that is what it is all about.
A Simple Guide to Five Normal Forms in Relational Database Theory
is the classic reference for normal forms. This paper defines in simple terms what the essence of each normal form is
and its significance with respect to database table design. This is a very good "touch-stone" reference.
To answer your specific question properly requires additional information. Some critical questions you have to ask
are:
Is an Address a simple fact (e.g. blob of text) or a composite fact (e.g.
composed of multiple attributes: Address line, City Name, Postal Code etc.)
What are the other "facts" relating to "Accommodation",
"Airport" and "Train Station"?
What sets of "facts" uniquely and minimally identify an "Airport", an "Accommodation"
and a "Train Station" (these facts are typically called a key or candidate key)?
What functional dependencies exist among Address facts and the facts
composing each relations key?
All this to say, the answer to your question is not as straight forward as one might hope for!
Is there such a thing as "over normalization"? Maybe. This depends on whether the
functional dependencies you have identified and used to build your tables are
of significance to your application domain.
For example, suppose it was determined that an address
was composed of multiple attributes; one of which is postal code. Technically a postal
code is a composite item too (at least Canadian Postal Codes are). Further normalizing your
database to recognize these facts would probably be an over-normalization. This is because
the components of a postal code are irrelevant to your application and therefore factoring
them into the database design would be an over-normalization.
For addresses, I would almost always create a separate address table. Not only for normalization but also for consistency in fields stored.
As for such a thing as over-normalization, absolutely there is! It's hard to give you guidance on what is and isn't over-normalization as I think it mostly comes from experience. However, follow the books on each level of normalization and then once it starts to get difficult to see where things are you've probably gone too far.
Look at all the sample/example databases you can as well. They will give you a good indication on when you should be splitting out data and when you shouldn't.
Also, be well aware of the type and amount of data you're storing, along with the speed of access, etc. A lot of modern web software is going fully de-normalized for many performance and scalability reason. It's worth looking into those for reason why and when you should and shouldn't de-normalize.
Would I have address columns in each table or an address table that is referenced by the other tables?
Can airports, train stations and accommodation each have a different address format?
A single ADDRESS table minimizes the work necessary dealing with addresses - suite, RR, postal/zip code, state/province...
Is there such a thing as over-normalization?
There are different levels of normalization. I've only encountered what I'd consider poor design rather than normalization.
Personally I'd go for another table.
I think it makes the design cleaner, makes reporting on addresses much simpler and will make any changes you need to make to the address schema easier.
If you need to have it denormalized later on you can always create two views that contain the Train station and airport information along with any address information you need.
This isn't really what I understand by normalisation. You don't seem to be talking about removing redundancy, just how to partition the storage or data model. I'm assuming that the example of addresses for Accommodation, Train Stations and Airports will all be disjoint?
As far as I know it would only be normalisation if you started thinking along the lines. Postcode is functionally dependent upon street address so should be factored out into its own table.
In which case this could be ever desirable or undesirable dependent upon context. Perhaps desirable if you administer the records and can ensure correctness, and less desirable if users can update their own records.
A related question is Is normalizing a person’s name going too far?
If you have a project/piece of functionality that is very performance sensitive, it may be smart to denormalize the database in some cases. However, this can lead to maintenance issues for various reasons. You may instead want to duplicate the data with cache tables but there are drawbacks to this as well. It's really a case by case basis but in normal practice, database normalization is a good thing. 99% of the non-normalized databases I've seen are not by design, but rather by a misunderstanding/mistake by the developer.
Would I have address columns in each table or an address table that is referenced by the other tables?
As others have alluded to, this is not really a question of normalization because you're not attempting to reduce redundancy or organize dependencies. Either way is perfectly acceptable. Moving the addresses to a separate table might make sense if you are going to have centralized validation or business logic specific to addresses.
Is there such a thing as over-normalization?
Yes. As has been mentioned, in large systems (lots of data, lots of transactions, or both) you can normalize to the point where performance becomes an issue. This is why lots of systems use denormalized database for reporting and querying.
In addition to performance though, there is also the issue of how easy the data is to query. In systems where there will be a lot of end-user querying of the data (can be dangerous!), a denormalized structure is easier for most non-technical or non-database people to understand.
Like most things we deal with, it's a trade-off between understanding, performance, and future maintainability and there is rarely a clear-cut answer to where you draw the line in any given system.
With experience, you will learn where the line is best drawn for the systems you write.
With that said, my preference is to err on the side of more vs less normalization.
If you are using Oracle 9i, you could store address objects in your tables. That would remove the (justified) concerns about address formats.
I agree with S.Lott, and would like to add:
A good answer depends on what you know already. The basic "math" of relational database theory, however, defines very well-defined, distinct levels of normalization. You cannot normalize anymore when you've reached the ultimate normal form.
Depending on what you want to model with your three entities, and how you identify them, you can come up with very different conceptual data models, all of which can be represented in a mix of normal forms -- or unnormalized at all (like 1 table for all data with descriptors and NULL holes all over the place...).
Consider you normalize your three entities to the ultimate normal form. I can now introduce a new requirement, or use case, or extension, which gives an upto-now descriptive attribute a somehow ordered, or referencing, or structured nature if you look at its content. Then, the model should represent this behavior, and what used to be an attribute perhaps will better be a separate entity referenced by other entities.
Over-normalization? Only in the sense that can you normalize a given model so it gets inefficient to store, or process, on a given DB platform. Depending on what can be handled efficiently there, you might want to de-normalize certain aspects, trading off redundancy for speed (data warehouse dbs do this all the time), and insight, or vice versa.
All (working) db designs I've seen so far either have a rather normalized conceptual data model, with quite some denormalization done at the logical and/or physical data model level (speaking in Sybase PowerDesigner terms) to make the model "manageable" -- either that, or they were not working, i.e. failed because the maintenance problems became kingsize real quick.
When you say "address", I presume you mean a complete address, like street, city, state/province, maybe country, and zip/postal code. That's 4 or 5 fields, maybe more if you allow for "address line 1" and "address line 2", care-of's, etc. That should definately be in a separate table, with an "addressid" to link to the Station, etc tables. Otherwise, you are creating 3 separate copies of the same set of field definitions. That's bad news because it creates extra effort to keep them consistent. Like, what if initially you are only dealing with U.S. addresses (I'm an American so I'll assume U.S.), but later you find you also need to allow for Canadians. You'll need to expand the size of the postal code field and add a country code. If there's a common table, then you only have to do this once. If there isn't, then you have to do this three times. And it's likely that the "three times" is not just changing the database schema, but changing every place in your programs that processes an address.
One of the benefits of normalization is to minimize the impact of changes.
There are times when you want to denormalize to make queries more efficient. But this should be done very cautiously, only after you have good reason to believe that the fully normalized model creates serious inefficiency problems. In my humble experience, most programmers are far to quick to denormalize, usually with a quick "oh, breaking that out into a separate table is too much trouble".
I think in this situation it is OK to have address columns in each table. You'll hardly have an address which will be used more than two times. Most of the adresses will be used just one per entity.
But what could be in an extra table are names of streets, cities, countries...
And most important every train station, accomodoation and airport will probably have just one address so it's an n:1 relation.
I can only add one more constructive note to the answers already posted here. However you choose to normalize your database, that very process becomes almost trivial when the addresses are standardized (look the same). This is because as you endeavor to prevent duplicates, all the addresses that are actually the same do look the same.
Now, standardizing addresses is not trivial. There are CASS services which do this for you (for US addresses) which have been certified by the USPS. I actually work for SmartyStreets where this is our expertise, so I'd suggest you start your search there. You can either perform batch processing or use the API to standardize the addresses as you receive them.
Without something like this, your database may be normalized, but duplicate address data (whether correct or incomplete and invalid, etc) will still seep in because of the many, many forms they can take. If you have any further questions about this, I'll personally assisty you.
I expect the column to be a VARCHAR2, in my Oracle Database.
US Zips are 9.
Canadian is 7.
I am thinking 32 characters would be reasonable upper limit
What am I missing?
[EDIT]
TIL: 12 is a reasonable answer to the question
Thanks to everyone who contributed.
Skimming through Wikipedia's Postal Codes page, 32 characters should be more than enough. I would say even 16 characters is good.
As already raised by #neil-mcguigan, wikipedia has a decent page on the topic. Based on that 12 characters should do it: http://en.wikipedia.org/wiki/List_of_postal_codes
The wikipedia article lists ~254 countries, which is pretty good regarding UPU (Universal Postal Union) has 192 member countries.
Why would you declare a field size larger than the actual data you are expecting to store in it?
If the initial version of your application is going to support US and Canadian addresses (which I'm inferring from the fact that you call out those sizes in your question), I'd declare the field as VARCHAR2(9) (or VARCHAR2(10) if you intend to store the hyphen in ZIP+4 fields). Even looking at the posts others have made to postal codes across countries, VARCHAR2(9) or VARCHAR2(10) would be sufficient for the most if not all other countries.
Down the line, you can always ALTER the column to increase the length should the need arise. But it is generally hard to prevent someone, somewhere from deciding to get "creative" and stuff 50 characters into a VARCHAR2(50) field for one reason or another (i.e. because they want another line on a shipping label). You also have to deal with testing the boundary cases (will every application that displays a ZIP handle 50 characters?). And with the fact that when clients are retrieving data from the database, they are generally allocating memory based on the maximum size of the data that will be fetched, not the actual length of a given row. Probably not a huge deal in this specific case, but 40 bytes per row could be a decent chunk of RAM for some situations.
As an aside, you might also consider storing (at least for US addresses) the ZIP code and the +4 extension separately. It is generally useful to be able to generate reports by geographical region, and you may frequently want to put everything in a ZIP code together rather than breaking it down by the +4 extension. At that point, it's useful to not have to try to SUBSTR out the first 5 characters for the ZIP code.
Normalization? Postal codes might be used more than once, and might be related to street names or town names. Separate table(s).
What you're missing is a reason why you need the postal code to be handled specially.
If you don't really need to WORK with a postal code, I would suggest not worrying about it. By work, I mean do special processing for rather than just use to print address labels and so on.
Simply create three or four address fields of VARCHAR2(50) [for example] and let the user input whatever they want.
Do you really need to group your orders or transactions by postcode? I think not, since different countries have vastly different schemes for this field.
Canadian Postal Codes are only 6 characters, in the form of letter's and numbers (LNLNLN)
UK have published standards: UK Government Data Standards Catalogue
Max 35 characters per line
International Postal Address:
Minimum of 2 lines and maximum of 5 lines for the postal delivery point
details, plus 1 line for country and 1 line for postcode/zip code
The UK postal code length is:
Minimum 6 and Maximum 8 characters
If you want to integrate postal codes in database then geonames database is best to use. Even though it is tough to use and understand but it is the largest geographical database available freely to users like us.
All the other such database are more or less likely have same data and structure. They just remove some extra/redundant information from database. If you are just doing it for low load systems use their free services the limits are attractive and provides more easy interface using json and ajax. You can view the limits here
For your information varchar(20) is sufficient for storing postal codes
Are there any good references for best practices for storing postal addresses in an RDBMS? It seems there are lots of tradeoffs that can be made and lots of pros and cons to each to be evaluated -- surely this has been done time and time again? Maybe someone has at least written done some lessons learned somewhere?
Examples of the tradeoffs I am talking about are storing the zipcode as an integer vs a char field, should house number be stored as a separate field or part of address line 1, should suite/apartment/etc numbers be normalized or just stored as a chunk of text in address line 2, how do you handle zip +4 (separate fields or one big field, integer vs text)? etc.
I'm primarily concerned with U.S. addresses at this point but I imagine there are some best practices in regards to preparing yourself for the eventuality of going global as well (e.g. naming fields appropriately like region instead of state or postal code instead of zip code, etc.
For more international use, one schema to consider is the one used by Drupal Address Field. It's based on the xNAL standard, and seems to cover most international cases. A bit of digging into that module will reveal some nice pearls for interpreting and validating addresses internationally. It also has a nice set of administrative areas ( province, state, oblast, etc ) with ISO codes.
Here's the gist of the schema, copied from the module page:
country => Country (always required, 2 character ISO code)
name_line => Full name (default name entry)
first_name => First name
last_name => Last name
organisation_name => Company
administrative_area => State / Province / Region (ISO code when available)
sub_administrative_area => County / District (unused)
locality => City / Town
dependent_locality => Dependent locality (unused)
postal_code => Postal code / ZIP Code
thoroughfare => Street address
premise => Apartment, Suite, Box number, etc.
sub_premise => Sub premise (unused)
A lessons I've learned:
Don't store anything numerically.
Store country and administrative area as ISO codes where possible.
When you don't know, be lax about requiring fields. Some country may not use fields you take for granted, even basic things like locality & thoroughfare.
As an 'international' user, there is nothing more frustrating than dealing with a website that is oriented around only US-format addresses. It's a little rude at first, but becomes a serious problem when the validation is also over-zealous.
If you are concerned with going global, the only advice I have is to keep things free-form. Different countries have different conventions - in some, the house number comes before the street name, in some it comes after. Some have states, some regions, some counties, some combinations of those. Here in the UK, the zipcode is not a zipcode, it's a postcode containing both letters and numbers.
I'd advise simply ~10 lines of variable-length strings, together with a separate field for a postcode (and be careful how you describe that to cope with national sensibilities). Let the user/customer decide how to write their addresses.
If you need comprehensive information about how other countries use postal addresses, here's a very good reference link (Columbia University):
Frank's Compulsive Guide to Postal Addresses
Effective Addressing for International Mail
You should definitely consider storing house number as a character field rather than a number, because of special cases such as "half-numbers", or my current address, which is something like "129A" — but the A is not considered as an apartment number for delivery services.
I've done this (rigorously model address structures in a database), and I would never do it again. You can't imagine how crazy the exceptions are that you'll have to take into account as a rule.
I vaguely recall some issue with Norwegian postal codes (I think), which were all 4 positions, except Oslo, which had 18 or so.
I'm positively sure that from the moment we started using the geographically correct ZIP codes for all of our own national addresses, quite a few people started complaining that their mail arrived too late. Turned out those people were living near a borderline between postal areas, and despite the fact that someone really lived in postal area, say, 1600, in reality his mail should be addressed to postal area 1610, because in reality it was that neighbouring postal area that actually served him, so sending his mail to his correct postal area would take that mail a couple of days longer to arrive, because of the unwanted intervention that was required in the correct postal office to forward it to the incorrect postal area ...
(We ended up registering those people with an address abroad in the country with ISO-code 'ZZ'.)
Unless you are going to do maths on the street numbers or zip / postal codes, you are just inviting future pain by storing them as numerics.
You might save a few bytes here and there, and maybe get a faster index, but what do you when US postal, or whatever other country you are dealing with, decides the introduce alphas into the codes?
The cost of disk space is going to be a lot cheaper than the cost of fixing it later on... y2k anybody?
Adding to what #Jonathan Leffler and #Paul Fisher have said
If you ever anticipate having postal addresses for Canada or Mexico added to your requirements, storing postal-code as a string is a must. Canada has alpha-numeric postal codes and I don't remember what Mexico's look like off the top of my head.
You should certainly consult "Is this a good way to model address information in a relational database", but your question is not a direct duplicate of that.
There are surely a lot of pre-existing answers (check out the example data models at DatabaseAnswers, for example). Many of the pre-existing answers are defective under some circumstances (not picking on DB Answers at all).
One major issue to consider is the scope of the addresses. If your database must deal with international addresses, you have to be more flexible than if you only have to deal with addresses in one country.
In my view, it is often (which does not mean always) sensible to both record the 'address label image' of the address and separately analyze the content. This allows you to deal with differences between the placement of postal codes, for example, between different countries. Sure, you can write an analyzer and a formatter that handle the eccentricities of different countries (for instance, US addresses have 2 or 3 lines; by contrast, British addresses can have considerably more; one address I write to periodically has 9 lines). But it can be easier to have the humans do the analysis and formatting and let the DBMS just store the data.
Ive found that listing all possible fields from smallest discrete unit to largest is the easiest way. Users will fill in the fields they see fit. My address table looks like this:
*********************************
Field Type
*********************************
address_id (PK) int
unit string
building string
street string
city string
region string
country string
address_code string
*********************************
Where's the "trade off" in storing the ZIP as a NUMBER or VARCHAR? That's just a choice -- it's not a trade off unless there are benefits to both and you have to give up some benefits to get others.
Unless the sum of zips has any meaning at all, Zips as number is not useful.
This might be an overkill, but if you need a solution that would work with multiple countries and you need to programmatically process parts of the address:
you could have country specific address handling using two tables: One generic table with 10 VARCHAR2 columns, 10 Number columns, another table which maps these fields to prompts and has a country column tying an address structure to a country.
If you ever have to verify an address or use it to process credit card payments, you'll at least need a little structure. A free-form block of text does not work very well for that.
Zip code is a common optional field for validating payment card transactions without using the whole address. So have a separate and generously sized field for that (at least 10 chars).
Inspired by Database Answers
Line1
Line2
Line3
City
Country_Province
PostalCode
CountryId
OtherDetails
At the moment, I'm developing an international ecommerce website.
It should cover almost all addresses in this world as shown below:
*****************************************************************
Type Field name Displayed name in your form
*****************************************************************
INT id (PK)
VARCHAR(100) building Apt, office, suite, etc. (Optional)
VARCHAR(100) street Street address
VARCHAR(100) city City
VARCHAR(100) state State, province or prefecture
VARCHAR(100) zip_code Zip code
VARCHAR(100) country Country
*****************************************************************
I would just put all the fields together in a large NVARCHAR(1000) field, with a textarea element for the user to enter the value for (unless you want to perform analysis on eg. zip codes). All those address line 1, address line 2, etc. inputs are just so annoying if you have an address that doesn't fit well with that format (and, you know, there are other countries than the US).
What is the "best" way to store international addresses in a database? Answer in the form of a schema and an explanation of the reasons why you chose to normalize (or not) the way you did. Also explain why you chose the type and length of each field.
Note: You decide what fields you think are necessary.
Plain freeform text.
Validating all the world's post/zip codes is too hard; a fixed list of countries is too politically sensitive; mandatory state/region/other administrative subdivision is just plain inappropriate (all too often I'm asked which county I live in--when I don't, because Greater London is not a county at all).
More to the point, it's simply unnecessary. Your application is highly unlikely to be modelling addresses in any serious way. If you want a postal address, ask for the postal address. Most people aren't so stupid as to put in something other than a postal address, and if they do, they can kiss their newly purchased item bye-bye.
The exception to this is if you're doing something that's naturally constrained to one country anyway. In this situation, you should ask for, say, the { postcode, house number } pair, which is enough to identify a postal address. I imagine you could achieve similar things with the extended zip code in the US.
In the past I've modeled forms that needed to be international after the ups/fedex shipping address forms on their websites (I figured if they don't know how to handle an international order we are all hosed). The fields they use can be used as reference for setting up your schema.
In general, you need to understand why you want an address. Is it for shipping/mailing? Then there is really only one requirement, have the country separate. The other lines are freeform, to be filled in by the user. The reason for this is the common forwarding strategy for mail : any incoming mail for a foreign country is forwarded without looking at the other address lines. Hence, the detailed information is parsed only by the mail sorter located in the country itself. Like the receiver, they'll be familiar with national conventions.
(UPS may bunch together some small European countries, e.. all the Low Countries are probably served from Belgium - the idea still holds.)
I think adding country/city and address text will be fine. country and city should be separate for reporting. Managers always ask for these kind of reports which you do not expect and I dont prefer running a LIKE query through a large database.
Not to give Facebook undue respect. However, the overall structure of the database seems to be overlooked in many web applications launching every day. Obviously I don't think there is a perfect solution that covers all the potential variables with address structure without some hard work. That said, combined with autocomplete Facebook manages to take location input data and eliminate a majority of their redundant entries. They do this by organizing their database well enough to provide autocomplete information in a low cost, low error way to the client in real time allowing them to more or less choose the correct location from an existing list.
I think the best solution is to access a third party database which contains your desired geographic scope and use it to initially seed your user location information. This will allow you to avoid doing the groudwork of creating your own. With any luck you can reduce the load on your server by allowing your new users to receive the correct autocomplete information directly off your third party supplier. Eventually you will be able to fill most autocomplete for location information such as city, country, etc. from information contained in your own database from user input data.
You need to provide a bit more details about how you are planning to use the data. For example, fields like City, State, Country can either be text in the single table, or be codes which are linked to a separate table with a Foreign Key.
Simplest would be
Address_Line_01 (Required, Non blank)
Address_Line_02
Address_Line_03
Landmark
City (Required)
Pin (Required)
Province_District
State (Required)
Country (Required)
All the above can be Text/Unicode with appropriate field lengths.
Phone Numbers as applicable.