Related
Although I am targeting MySQL/PHP, for the sake of my questions, I'd like to just apply this generally to any relational database that is being used in conjunction with a modern programming language. Another assumption would be that the language is leveraging a modern framework, which, on some level would handle foreign key constraints implicitly or have a means to do so explicitly.
My questions:
What are the pros and cons of creating FK constraints in the database itself as opposed to managing them at the application level?
From a design standpoint, should they ever both be used together or would that cause conflict?
If they should not be used together, what is considered the "best practice" in regards to which approach to use?
Note: This is a design theory question. Because of the wide variety of technology that could be used to satisfy an implementation, I'm not really interested in any specifics regarding an implementation.
What are the pros and cons of creating FK constraints in the database itself as opposed to managing them at the application level?
In a concurrent environment, it is surprisingly difficult to implement referential integrity in the application code, such that it is both correct and with good performance.
Unless you very carefully use locking, you are open to race conditions, such as:
Imagine there is currently one row in the parent table and no corresponding rows in the child.
Transaction T1 inserts a row in the child table, but does not yet commit. It can do that since there is a corresponding row in the parent table.
Transaction T2 deletes the parent row. It can do that since there are no child rows from its perspective (T1 hasn't committed yet).
T1 and T2 commit.
At this point, you have a child row without parent (i.e. broken referential integrity).
To remedy that, you can lock the parent row from both transactions, but that's likely to be less performant compared to the highly optimized FK implemented in the DBMS itself.
On top of that, all your clients have to adhere to the same "locking protocol" (one misbehaving client is enough to currupt the data). And the complexity rapidly raises if you have several levels of nested FKs or diamond-shaped FKs. Even if you implement referential integrity in triggers, you are only solving the "one misbehaving client" problem, but the rest remains.
Another nice thing about database-level FKs is that they usually support referential actions such as ON DELETE CASCADE. And all that is simple and self-documenting, unlike referential integrity burried inside application code.
From a design standpoint, should they ever both be used together or would that cause conflict?
You should always use database-level FKs. You could also use application level "pre-checks" if that benefits your user experience (i.e. you don't want to wait until the actual INSERT/UPDATE/DELETE to warn the user), but you should always code as if the INSERT/UPDATE/DELETE can fail even if your application-level check has passed.
If they should not be used together, what is considered the "best practice" in regards to which approach to use?
As I stated, always use database-level FKs. Optionally, you may also use application-level FKs "on top" of them.
See also: Sql - Indirect Foreign Key
Just how familiar you are with database design and the foreign key concept in general? FK is a column(s) in one table that identifies a row in another table. (I'm pretty sure you already know this.) So FK constraint is something that exists in DB, not in application. Managing FK constraints in application requires manual coding for the functionalities that are already available in DB. So why would you want to do all that manual labor? Also the DB/application interaction and development is much more difficult because of all that extra manual coding.
Best practice IMHO is to use the tools for what they are created to do. DB takes care of the FKs referential integrity and application doesn't need to concern itself with DBs inner functionalities. However, if referential integrity is your main concern and you're for example using MySQL with MyISAM engine which doesn't support FK constraints then you have to some manual checking in application (or maybe with DB triggers which I am not familiar with). Just keep in mind that when you do all kind of checking in application you still have to access the DB and thus you use more resources than what really is needed if the DB could handle the referential integrity checks. (The easy solution of course would be start using InnoDB engine but I'll stop here before this answer gets too product oriented).
So some the pros for letting the DB handle the FK constraint would be:
You don't have to think about it.
You don't have to manually code anything extra.
Application uses less resources and contains less code and thus...
... maintaining and developing both the DB and the application is a lot easier (for example the application developers don't need to understand database oriented concepts and functionalities so deeply, let the DB experts do the FK etc. thinking...).
What are the pros and cons of creating FK constraints in the database
itself as opposed to managing them at the application level?
Some of the pros of using db-enforced FKs:
Separation of schmea from code.
Making application code smaller
No chance for programmer to mess with FK rules.
Forces other applications that integrate with the db to follow the fk rules.
Some of the cons of having db-enforced FKs.
Not easy to break if you have a special case
If data is not valid, errors could be thrown. Application should be coded to gracefully handle errors such as those (specially batch ones).
Definition of FK with Referential integrity rules must be defined and coded carefully. You don't want to cascade delete 1000000 rows online.
They cause an implicit check, even if you don't want that check to occur because you know the parent row must exist. This has probably a trivial impact on performance. Performance is an issue when loading huge data volumes in batch loads and in OLAP/Data Warehouse systems. Special load tools are used and constraints such as database enforced FKs are usually disabled during the load.
From a design standpoint, should they ever both be used together or
would that cause conflict?
You could use them together for a reason. As I mentioned before, you may have special cases in your data that you can't define FKs for. Also, there are certain cases such as many-to-many self referencing relationships between tables that could not be handled by FKs (for some db engines at least).
I was wondering how useful foreign keys really are in a database. Essentially, if the developers know what keys the different tables depend on, they can write the queries just as though there was a foreign key, right?
Also, I do see how to foreign-key constraints help prevent all sorts of bugs with data integrity, but say for example, the programmers do a good job of preserving data integrity, how necessary are foreign keys really?
If you don't care about referential integrity then you are right. But.... you should care about referential integrity.
The problem is that people make mistakes. Computers do not.
Regarding your comment:
but say for example, the programmers do a good job of
preserving data integrity
Someone will eventually make a mistake. No one is perfect. Also if you bring someone new in you aren't always sure of their ability to write "perfect" code.
In addition to that you lose the ability to do cascading deletes and a number of other features that having defined foreign keys allow.
I think that assuming that programmers will always preserve data integrity is a risky assumption.
There's no reason why you wouldn't create foreign keys, and being able to guarantee integrity instead of just hoping for integrity is reason enough.
Not using referential integrity in a database is like not using seatbelts in cars. It will provide you with measurable improvements in taking you from A->B, but it will make "real" difference only in the most extreme cases. Why take the "risk" unless you really have to?
The underlaying reason people ask this question is always performance.
Foreign keys give the optimizer much more information to work with, and it will potentially produce better execution plans. It's not like a specific query will be % percent faster with enabled constraints, it's more like you effectively eliminate entire classes of problems due to bad execution plans. You also enable the optimizer to rewrite queries in ways that just isn't possible without the constraints (join elimination for example).
Starting right here, I would like to start a myth that referential integrity always increases performance in databases. I'm fairly confident that if 100 people designed their databases with full integrity checking, less than 5 people will actually have to consider spend a whopping 1 second to disable them for performance reasons. Out of those 5 people, there will be close to 0 people who find that they need to disable 100% of the constraints.
Foreign keys are invaluable as a means of ensuring integrity, and even if you trust your developers to never (!) make errors the cost of having them is usually well worth it.
Foreign keys also serve as documentation, in that you can see what relates to what. This information is typically also used by tools, such as for generating reports, creating data sets from table definitions, object-relational mappers, etc. Even if you do not use any of these today, having FKs will make it easier to tread that path later.
Foreign keys also allow you to define cascade rules, which e.g. can be used to to delete associated records in related tables when a row in one table is deleted.
Only if you have ridiculously high loads should you consider bypassing FKs.
Edit: updated answer to include points from other answers (reports, cascades).
You said
but say for example, the programmers
do a good job of preserving data
integrity
The expression you were looking for is, "I'm 100% certain that every programmer and every database administrator will manually preserve data integrity perfectly no matter what application touches this database, no matter how complex the database becomes, from now until the time it's decommissioned."
You don't have to use them but why wouldn't you?
They are there to help. From making life easier with cascade updates and cascade deletes, to guaranteeing that constraints aren't violated.
Maybe the application honors the constraints, but isn't it useful to have them clearly specified? You could document them, or you could put them in the database where most programmers expect to find constraints they are supposed to conform to (a better idea I think!).
Finally, if you ever need to import data into this database which doesn't go via the front-end, you may accidently import data which violates the constraints and breaks the application.
I'd definetly not recommend skipping the relationships in a database
Foreign Keys make life so much easier when using report builders and data analysis tools. Just select one table, check the include related tables box and BAM! you've got you're report built. Ok Ok, it's not that easy, but they certianly save time in that respect.
Use constraints rather than application logic to enforce integrity because it is generally easier, cheaper and more reliable to maintain constraints in one place (the database) rather than in every application.
I understand from one of your comments that your motivation for asking the question is that you think leaving out the keys may make it easier to evolve the database design during development. In my experience you are wrong about that. I find that it's actually better to be more restrictive with constraints in the early stages of development. If in doubt, create the constraint because it's much easier to remove constraints later than it is to create them. Removing a constraint will tend to break fewer things than adding one and generally requires less testing and fewer code changes to achieve.
Another point to make is that when you scrap your current user interface and use a new one with shiny new tools, you won't lose your referential integrity because the new devs have no idea what should be related to what. Databases are generally in use much much longer than user interfaces. They are also often used by more than one application interface and then you have the problem of different interfaces trying to enforce different integrity rules.
I will also point out that I have had occasion to look at the data in, quite literally, hundreds of databases and have not found one yet that has good data if they didn't set up FKs. This bad data complicates reporting, it complicates imports and exports to and from clients and other third party vendors who need or provide the data. And if the bad data is in a financial area, it could also have legal and accounting implications. I can even remember one time the company had thousands of bad inventory records where the actual product that was stored was no longer identifiable (nor the location) which also created issues with defining the value of the inventory necessary for financial reporting. This is not only bad from a perspective of not knowing what parts you have on hand, but it enables people to steal parts without being caught simply by deleting the part number from the part table (this particular place didn't have auditing in place either.).
Folks have offered up some good answers above. However, one important point I didn't see mentioned is that foreign keys make your entity relationship diagrams (ERDs) easier to generate and much more meaningful. Without FKs, you either need to depict the FK relationships on your ERD manually (painful for you) or not at all (painful for others, and perhaps even for yourself once your memory of the implied FK relationships starts to fade over time). With FKs explicitly defined, most tools that automatically generate ERDs from database object definitions will automatically detect and depict the FK relationships.
Perhaps the question should be "How bad are orphan records?". In many cases orphaned records aren't really going to hurt anything. Yes these records may persist until the end of time but how bad is this really? Cascading updates or deletes are rarely useful features. Referential integrity sounds nice but I think is not as important as we have been lead to believe. The biggest benefit to FK's is the documentation they provide. In my experience FK's for referential integrity are way more trouble than they are worth.
I am having the same question today, and found many articles talking about why you don't have to use foreign keys online. But so far, 10 of 11 answers here say you should have FKs.
I am not a db expert and just want to share some points I found online about when and why you don't have FKs:
Some points from 9 reasons why there are no foreign keys constraints:
Performance
Legacy data
Full table reload
Higher level framework
Cross database relations
Database platform agnostic
Open for change
Lazy architect
Keep model a secret
Some points from At GitHub we do not use foreign keys, ever, anywhere.
FKs are in your way to shard your database.
FKs are a performance impact.
FKs don't work well with online schema migrations.
Note: I don't have any opinions. Just sharing some online articles to provide a different answer to most of the current ones.
I have seen in my past experience that most of the people don't use physical relationships in tables and they try to remember them and apply them through coding only.
Here 'Physical Relationships' refer to Primary Key, Foreign Key, Check constraints, etc.
While designing a database, people try to normalize the database on paper and keep things documented. Like, if I have to create a database for a marketing company, I will try to understand its requirements.
For example, what fields are mandatory, what fields will contain only (a or b or c) etc.
When all the things are clear, then why are most of the people afraid of the constraints?
Don't they want to manage things?
Do they have a lack of knowledge
(which I don't think is so)?
Are they not confident about future
problems?
Is it really a tough job managing all these entities?
What is the reason in your opinion?
I always have the DBMS enforce both primary key and foreign key constraints; I often add check constraints too. As far as I am concerned, the data is too important to run the risk of inaccurate data being stored.
If you think of the database as a series of stored true logical propositions, you will see that if the database contains a false proposition - an error - then you can argue to any conclusion you want. Given a false premise, any conclusion is true.
Why don't other people use PK and FK constraints, etc?
Some are unaware of their importance (so lack of knowledge is definitely a factor, even a major factor). Others are scared that they will cost too much in performance, forgetting that one error that has to be fixed may easily use up all the time saved by not having the DBMS do the checking for you. I take the view that if the current DBMS can't handle them well, it might be (probably is) time to change DBMS.
Many developers will check the constraints in code above the database before they actually go to perform an operation. Sometimes, this is driven by user experience considerations (we don't want to present choices / options to users that can't be saved to the database). In other cases, it may be driven by the pain associated with executing a statement, determining why it failed, and then taking corrective action. Most people would consider code more maintainable if it did the check upfront, along with other business logic that might be at play, rather than taking corrective action through an exception handler. (Not that this is necessarily an ideal line of thinking, but it is a prevalent one.) In any case, if you are doing the check in advance of issuing the statement, and not particularly conscious of the fact that the database might get touched by applications / users who are not coming in through your integrity-enforcing code, then you might conclude that database constraints are unnecessary, especially with the performance hit that could be incurred from their use. Also, if you are checking integrity in the application code above the database, one might consider it a violation of DRY (Don't Repeat Yourself) to implement logically equivalent checks in the database itself. The two manifestations of integrity rules (those in database constraints and those in application code above the database) could in principle become out-of-sync if not managed carefully.
Also, I would not discount option 2, that many developers don't know much about database constraints, too readily.
Well, I mean, everyone is entitled to their own opinion and development strategy I suppose, but in my humble opinion these people are almost certainly wrong :)
The reason, however, someone may wish to avoid constraints is efficiency. Not because constraints are slow, but because storing redundant data (i.e. caching) is a very effective way of speeding up (well, avoiding) an expensive calculation. This is an acceptable approach, when implemented properly (i.e. the cache is updated a regular/appropriate intervals, generally I do this with a trigger).
As to the motivation to not us FKs without a caching motivation, I can't imagine it. Perhaps they aim to be 'flexible' in their DB structure. If so, fine, but then don't use a relational DB, because it's pointless. Non-relational DBs (OO dbs) certainly have their place, and may even arguably be better (quite arguable, but interesting to argue) but it's a mistake to use a relational DB and not use it's core properties.
I would always define PK and FK constraints. especially when using an ORM. it really makes the life easy for everybody to let the ORM reverse engineer the database instead of manually configuring it to use some PKs and FKs
There are several reasons for not enforcing relationships in descending order of importance:
People-friendly error handling.
Your program should check constraints and send an intelligible message to the user. For some reason normal people dont like "SQL exception code -100013 goble rule violated for table gook'.
Operational flexibility.
You dont really want your operators trying to figure out which order you must load your tables in at 3 a.m., nor do you want your testers pulling their hair out 'cause they cannot reset the database back to its starting position.
Efficiency.
Cheking constraints does consume IO and CPU.
Functionality.
Its a cheap way to save details for later recovery. For instance in an on line order system you could leave the detail item rows in the table when the users kills a parent order, if he later reinstates the order the details re-appear as if by a miracle -- you acheive this extra feature by deleteing lines of code. (course you need some housekeeping process but it is trivial!)
As things get more complex and more tables and relationships are needed in the database, how can you ensure the database developer remembers to check all of them? When you makea change to the schema that adds a new "informal" relationship, how can you ensure all the application code which might be affected gets changed?
Suddenly you could be deleting records that should stay because they have related data the developer forgot to check when writng the delete process or because that process was in place before the last ten related tables were added to the schema.
It is foolhardy in the extreme to not formally set up PK/FK relationships. I process data received from many different vendors and databases. You can tell which ones have data integrity problems most likely caused by a failure to explicitly define relationships by the poor quality of their data.
One of the reasons why referential integrity should not be enforced is performance. Because Db has to validate all updates against relationships, it just makes things slower but what are the other pros and cons of enforcing and not enforcing?
Because relationships are maintained in the business logic layer anyway, it just makes them redundant for db to do it. What are your thoughts on it?
The database is responsible for data. That's it. Period.
If referential integrity is not done in the database, then it's not integrity. It's just trusting people not to do bad things, in which case you probably shouldn't even worry about password-protecting your data either :-)
Who's to say you won't get someone writing their own JDBC-connected client to totally screw up the data, despite your perfectly crafted and bug-free business layer (the fact that it probably won't be bug-free is another issue entirely, mandating that the DB should protect itself).
First of all, it's almost impossible to make it really work correctly. To have any chance of working right, you need to wrap a lot of the cascading modifications as transactions, so you don't have things out of sync while you've changed one part of the database, but are still updating others that depend on the first. This means code that should be simple and aware only of business logic suddenly needs to know about all sorts of concurrency issues.
Second, keeping it working is almost impossible to hope for -- every time anybody touches the business logic, they need to deal with those concurrency issues again.
Third, this makes the referential integrity difficult to understand -- in the future, when somebody wants to learn about your database structure, they'll have to reverse engineer it out of your business logic. With it in the database, it's separate, so what you have to look at only deals with referential integrity, not all sorts of unrelated issues. You have (for example) direct chains of logic showing what a modification to a particular field will trigger. At least for quite a few databases, that logic can be automatically extracted and turned into fairly useful documentation (e.g., tree diagrams showing dependencies). Extracting the same kind of information from the BLL is more likely to be a fairly serious project.
There are certainly some points in the other direction, and reasons to craft all of this by hand -- scalability and performance being the most obvious. When/if you go that route, however, you should be aware of what you're giving up to get that performance. In some cases, it's a worthwhile tradeoff -- but in other cases it's not, and you need information to make a reasoned decision.
Relationships may be maintained in a business logic layer. Unless you can guarantee 100% beyond any doubt that your BLL is and always will be bug-free, then you don't have data integrity. And you can't make that guarantee.
Also, if another app will ever touch your database, it isn't required to follow (read: reimplement, maybe in a subtlely wrong way) the rules in your BLL. It could corrupt the data, even if you somehow managed to be one of the 3 programmers on Earth to write bug-free code.
The database, meanwhile, enforces the same rules for everybody -- and rules enforced by the database are far less likely to be overlooked when you're updating, since the DB won't allow it.
Have a listen to Dan Pritchett, Technical Fellow at eBay on why certain database constructs such as transactions and referential integrity are not the mandates that textbooks might indicate they should be... It comes down to the types of data, the volume of queries and business requirements. Balance those and it will lead you to pragmatic solutions, not dogmatic answers...
However, do not assume that keeping relationships in the BLL will protect your data. You cannot guarantee that future developers won't expose new APIs that bypass the BLL for "performance" reasons, or simple lack of understanding of your architecture...
The performance assumption on which the question is based is incorrect as a general rule. Usually if you require RI to be enforced then the database is the most efficient place to do it, NOT the application - otherwise the application has to requery more data in order to be able to validate RI outside the database.
Also, RI constraints in the database are useful for the query optimiser for making other queries more efficient. Integrity constraints in the application can't achieve that.
Lastly, the cost of maintaining integrity constraints in every application is generally more expensive and complex than doing it once in one place.
But Colonel Ingus, if you've got the customer with an id in the session you've already probed the database! The problem is when you then write your sales order away, but didn't attach it to a product because you didn't prob for a product. One way or another you'll end up with orphaned records, just like the very large company I'm currently working for has. We have customers with no history and history with no customers; customers with outstanding balances who've never bought anything and goods sold to customers who don't exist - interesting business concepts - and it keeps a team of very frustrated support staff in full time employment trying to sort it out. It would be far less expensive to have put RI on everything and bought a bigger box to sort out any perceived performance problems.
A lot has already been said about the fact that the DB should be the final place to validate/control your constraints (and I couldn't agree more)
If the data is important, then your application won't be the last to access the database and it won't be the only one.
But there is another very important fact about referential integrity (and other constraints): it documents your datamodel and makes the dependencies between the tables explicit.
As far as performance is concerned, defining FKs (or other constraints) in the database can make things even faster in certain cases, because the DBMS can rely on the constraints and make approriate optimizations.
It depends on the data, if its highly transactional data such as business transactions and what not where frequent updates are happening then enforcing the business rules in the database is extremely important.. But for everything else the performance impact may not be worth it..
What paxdiablo and dportas said. And my two cents. There are two other considerations.
In order to validate referential integrity for a new insert, you have to do a probe into the database to verify that the reference is valid. You just nullfied the performance gain that led you to want to enforce integrity in the application. It's actually faster to let the DBMS enforce referential integrity.
Beyond that, consider the case where you have more than one application all reading and writing data in a single database. If you enforce referential integrity in the business application layer, you have to make sure that all of the applications do things right. Otherwise, some aberrant application could store invalid refrences, and the problem could surface when a different application went to use the data. That's a real mess.
Better to have the DBMS enforce the data rules for all the applications.
If you are maintaining the relationships in the business layer, you can guarantee that a few years down the pike you will have bad data in the database. The business layer is the worst possible place to do that.
Further, when you replace the business layer with something else you have to redefine all these things. Datbases often outlast the original application they are written for by many years, put the correct realtionships and constraints in the datbase where they belong.
What happens when you try to insert a record into the database and it fails referential integrity? You get an error from the database. Then you have to change your code so that it doesn't try to insert invalid data. To avoid ref integrity errors your code MUST know which data is which. Therefore, referential integrity is useless.
Walter Mitty said "In order to validate referential integrity for a new insert, you have to do a probe into the database to verify that the reference is valid." Sigh... this is complete nonsense. If I have a Customer object in the session (that's memory, aka RAM for some of you fellas), I know the Customer's ID and can use it to insert a SalesOrder object. There is no need to look up the Customer.
I am on a system now with tight Referential Integrity and Hibernate wrapped around it with its gross tenticles. It's the slowest system I have ever seen. I did not design it and if I had, it would be many times faster AND easier to maintain. Hibernate sucks.
When would a database design be described as overnormalized? Is this characterization an absolute one? Or is it dependent on the way it is used in the application? Thanks.
In the general sense, I think that overnormalized is when you are doing so many JOINs to retrieve data that it is causing notable performance penalties and deadlocks on your database, even after you've tuned the heck out of your indexes. Obviously, for huge applications and sites like MySpace or eBay, de-normalization is a scaling requirement.
As a developer for several small businesses, I tell you that in my experience it's always been easier to go from normalized -> denormalized than the other way around, and in fact going the other way around (to avoid duplication of data now that the business requirements have changed a year or so later) is much more difficult.
When I read general statements such as "you should put the address in your customers table instead of a separate address table so you can avoid the join", I shudder, because you just know that a year from now somebody's going to ask you to do something with addresses that you totally didn't foresee, like maintaining an audit trail, or storing multiple per customer. If your database allows you to create an indexed view, you can sidestep that issue until you get to the point where your dataset is so large that it can't possibly exist or be served by a single server or set of servers in a 1-write, many-read environment. For most of us, I don't think that scenario happens very often.
When in doubt, I aim for third normal form with some exceptions (for example, having a field contain a CSV-list of separated strings because I know I'll never ever look at the data from the other angle). When I need to consolidate, I'll look at my views or indexes first. Hope this helps.
It's always a question of the application domain. It's usually a question of correctness, but occasionally a question of performance.
There's one case where I can think of a prima facie case of overnormalization: say you have an order + orderitem, and the orderitem references productID, and leaves pricing to the product.price. Since that introduces temporal coupling, you've incorrectly normalized because the overnormalization affects already shipped orders, unless prices absolutely never change. You can certainly argue that this is simply a modeling error (as in the comments), but I see under-normalization as a modeling error in most cases, too.
The other category is performance related. In principle, I think there are generally better solutions to performance than denormalizing data, such as materialized views, but if your application suffers from the performance consequences of many joins, it may be worth assessing whether denormalizing can help you. I think these cases are often over-emphasized, because people sometimes reach for denormalization before they properly profile their application.
People also often forget about alternatives, like keeping a canonical form of the database and using warehousing or other strategies for frequently-read, but infrequently changed data.
Normalization is absolute. A database follows Normal Forms or it does not. There are a half-dozen normal forms. Mostly, they have names like First through Fifth. Plus there's a Boyce-Codd Normal Form.
Normalization exists for precisely one purpose -- to prevent "update anomalies".
Normalization isn't subjective. It isn't a judgement. Each table and relationship among tables either does or does not follow a normal form.
Consequently, you can't be "over-normalized" or "under-normalized".
Having said that, normalization has a performance cost. Some people elect to denormalize in various ways to improve performance. The most common sensible denormalization is to break 3NF and include derived data.
A common mistake is to break 2NF and have duplicate copies of a functional dependency between a key and non-key value. This requires extra updates or -- worse -- triggers to keep the copies in parallel.
Denormalization of a transactional database should be a case-by-case situation.
A data warehouse, also, rarely follows any of the transactional normalization rules because it's (essentially) never updated.
"Over-normalization" could mean that a database is too slow because of a large number of joins. This may also mean that the database has outgrown the hardware. Or that the applications haven't been designed to scale.
The most common issue here is that folks try to use a transactional database for reporting while transactions are going on. The locking for transactions interferes with reporting.
"Under-normalization," however, means that there are NF violations and needless processing is being done to handle the replicated data and correct update anomalies.
When the performance cost exceeds the benefit towards the application's intended purpose.
Normalize your OLTP databases, and denormalize your OLAP databases. Each has a mission that dictates its schema. Like normalized transaction databases, data warehouses exist for a reason. A complete system needs both.
A lot of people are talking about performance. I think a key issue is flexibility. In general, the more normalized your database, the more flexible it is.
We currently use an "over-normalized" database because, in our operating environment, client requirements change on a monthly basis. By "over-normalizing" we can adopt our software accordingly, without changing the database structure.
My take on this:
Always normalize as much as you are able to do. I usually go crazy on normalization, and try to design something that could handle every thinkable future extensions. What I end up with is a database design that is extremely flexible... and impossible to implement.
Then the real job starts: De-normalization. Here you solve what you know would be problematic to implement and/or would slow the queries down because of too many joins.
This way you know what you scarify for make the design usable.
Edit: Documentations! I forgot to mention that documenting the de-normalization is very important. It is extremely helpful when you take over a project to know the reason behind the choices.
Third Normal Form (3NF) is considered the optimal level of normalization for many a rational database application. This is a state in which, as Bill Kent once summarized, every "non-key field [in every table within a particular a relational database management system, or RDBMS] must provide a fact about the key, the whole key, and nothing but the key." 3NF is a term that was introduced by E.F. Codd, inventor of the relational model for database management. Generally, the data that a software application is dependent on, especially an application used for an Online Transaction Processing System (OLTP), will fare well in 3NF. This normal form by definition reduces database size by calling for a minimum repetition of row/column data, and maximizes query efficiency and ease of application maintenance. 3NF achieves that by requiring that a database's tables (i.e., its schema) be broken down into separate tables related by primary/foreign keys--basically until Kent's rule holds true (well, I've stated it this way for ease of reading but the actual definition of 3NF is much more detailed than that). In contrast, overnormalization implies increasing the number of joins required in a query between related tables. This comes as a result of breaking down the database schema into a much more granular level than 3NF. However, though normalization past the 3rd degree can often be considered overnormalization, the negative connotation of the term "overnormalization" can sometimes be unwarranted. Overnormalization may be desirable in some applications which by design require 4NF (and beyond) due to the complexity and versatility of the application software. An example of that is a highly customizable and extensible commercial database program for some industry in which it is sold to end users requiring an open API. But then the reverse can be desirable as well--that is, denormalization--most notably, when designing an Online Analytical Processing (OLAP) database used strictly to summarize data from an OLTP database just for querying/reporting--such as a data warehouse. In this case the data must by necessity reside in a highly denormalized format (i.e, 1NF, or 2NF). It's often under these constraints--when there are high demands for efficient querying and reporting--that we find database and application programmers calling a database, "overnormalized". But as Redgate's Tony Davis once said--taking into account today's much more advanced and efficient database software and storage systems--"the performance hit from multiple joins in a query is negligible. If your database is slow, it isn’t because it is ‘over-normalized’!" So in conclusion, this characterization--overnormalization--isn't an absolute one, and it is dependent on the way it is used in the application. In Kent's words, "The normalization rules are designed to prevent update anomalies and data inconsistencies. . . [but] there is no obligation to fully normalize all records when actual performance requirements are taken into account. . . The normalized design enhances the integrity of the data, by minimizing redundancy and inconsistency, but at some possible performance cost for certain retrieval applications. . . [Thus,] the desirability of normalization has to be assessed, in terms of its performance impact on retrieval applications."
..or hitting limits on the number of joins your RDBMS will do.
If performance is affected by too many joins, creating de-normalized tables for reporting purposes can speed things up. By copying the data into new tables, it may be possible to run reports with no joins at all.
In my experience, I've never seen a normalized database that contains postal addresses, as it's usually acceptable to store the address as a string. Ideally, there would be tables for countries, counties / states, cities, districts and streets. I've not come across anyone who needs to report on street level, so it hasn't been necessary. The addresses have only be used for postal contact, so are treated as a single entity.