Database Designing: An art or headache (Managing relationships) - sql-server

I have seen in my past experience that most of the people don't use physical relationships in tables and they try to remember them and apply them through coding only.
Here 'Physical Relationships' refer to Primary Key, Foreign Key, Check constraints, etc.
While designing a database, people try to normalize the database on paper and keep things documented. Like, if I have to create a database for a marketing company, I will try to understand its requirements.
For example, what fields are mandatory, what fields will contain only (a or b or c) etc.
When all the things are clear, then why are most of the people afraid of the constraints?
Don't they want to manage things?
Do they have a lack of knowledge
(which I don't think is so)?
Are they not confident about future
problems?
Is it really a tough job managing all these entities?
What is the reason in your opinion?

I always have the DBMS enforce both primary key and foreign key constraints; I often add check constraints too. As far as I am concerned, the data is too important to run the risk of inaccurate data being stored.
If you think of the database as a series of stored true logical propositions, you will see that if the database contains a false proposition - an error - then you can argue to any conclusion you want. Given a false premise, any conclusion is true.
Why don't other people use PK and FK constraints, etc?
Some are unaware of their importance (so lack of knowledge is definitely a factor, even a major factor). Others are scared that they will cost too much in performance, forgetting that one error that has to be fixed may easily use up all the time saved by not having the DBMS do the checking for you. I take the view that if the current DBMS can't handle them well, it might be (probably is) time to change DBMS.

Many developers will check the constraints in code above the database before they actually go to perform an operation. Sometimes, this is driven by user experience considerations (we don't want to present choices / options to users that can't be saved to the database). In other cases, it may be driven by the pain associated with executing a statement, determining why it failed, and then taking corrective action. Most people would consider code more maintainable if it did the check upfront, along with other business logic that might be at play, rather than taking corrective action through an exception handler. (Not that this is necessarily an ideal line of thinking, but it is a prevalent one.) In any case, if you are doing the check in advance of issuing the statement, and not particularly conscious of the fact that the database might get touched by applications / users who are not coming in through your integrity-enforcing code, then you might conclude that database constraints are unnecessary, especially with the performance hit that could be incurred from their use. Also, if you are checking integrity in the application code above the database, one might consider it a violation of DRY (Don't Repeat Yourself) to implement logically equivalent checks in the database itself. The two manifestations of integrity rules (those in database constraints and those in application code above the database) could in principle become out-of-sync if not managed carefully.
Also, I would not discount option 2, that many developers don't know much about database constraints, too readily.

Well, I mean, everyone is entitled to their own opinion and development strategy I suppose, but in my humble opinion these people are almost certainly wrong :)
The reason, however, someone may wish to avoid constraints is efficiency. Not because constraints are slow, but because storing redundant data (i.e. caching) is a very effective way of speeding up (well, avoiding) an expensive calculation. This is an acceptable approach, when implemented properly (i.e. the cache is updated a regular/appropriate intervals, generally I do this with a trigger).
As to the motivation to not us FKs without a caching motivation, I can't imagine it. Perhaps they aim to be 'flexible' in their DB structure. If so, fine, but then don't use a relational DB, because it's pointless. Non-relational DBs (OO dbs) certainly have their place, and may even arguably be better (quite arguable, but interesting to argue) but it's a mistake to use a relational DB and not use it's core properties.

I would always define PK and FK constraints. especially when using an ORM. it really makes the life easy for everybody to let the ORM reverse engineer the database instead of manually configuring it to use some PKs and FKs

There are several reasons for not enforcing relationships in descending order of importance:
People-friendly error handling.
Your program should check constraints and send an intelligible message to the user. For some reason normal people dont like "SQL exception code -100013 goble rule violated for table gook'.
Operational flexibility.
You dont really want your operators trying to figure out which order you must load your tables in at 3 a.m., nor do you want your testers pulling their hair out 'cause they cannot reset the database back to its starting position.
Efficiency.
Cheking constraints does consume IO and CPU.
Functionality.
Its a cheap way to save details for later recovery. For instance in an on line order system you could leave the detail item rows in the table when the users kills a parent order, if he later reinstates the order the details re-appear as if by a miracle -- you acheive this extra feature by deleteing lines of code. (course you need some housekeeping process but it is trivial!)

As things get more complex and more tables and relationships are needed in the database, how can you ensure the database developer remembers to check all of them? When you makea change to the schema that adds a new "informal" relationship, how can you ensure all the application code which might be affected gets changed?
Suddenly you could be deleting records that should stay because they have related data the developer forgot to check when writng the delete process or because that process was in place before the last ten related tables were added to the schema.
It is foolhardy in the extreme to not formally set up PK/FK relationships. I process data received from many different vendors and databases. You can tell which ones have data integrity problems most likely caused by a failure to explicitly define relationships by the poor quality of their data.

Related

Does putting integrity constraints decrease performance?

In a discussion with a friend, I got to hear two things -
Using constraints causes slight decrease in performance. eg. Consider a uniqueness constraint. Before insertion, DBMS would have to check for the uniqueness in all of existing data, thus causing extra computation.
He suggested to make sure that these constraints are handled at the application level logic itself. eg. Delete rows from both table yourself properly, instead of putting foreign integrity constraint etc.
First one sounds a little logical to me, but the second one seems pretty wrong intuitively. I don't have enough experience in DBMS to really judge these claims though.
Q. Is the claim 1 correct ? If so, is claim 2 even the right way to handle such scenarios ?
TL;DR
If your data needs to be correct, you need to enforce the constraints, and if you need to enforce the constraints, letting the database do it for you will be faster than anything else (and likely more correct too).
Example
Attempting to enforce something like key uniqueness at the application-level can be done correctly or quickly, but not both. For example, let's say you want to insert a new row. A naive application-level algorithm could look something like this:
Search the table for the (key fields of) new row.
If not found, insert the new row.
And that would actually work in a single-client / single-threaded environment. However, in a concurrent environment, some other client could write that same key value in between your steps 1 and 2, and presto: you have yourself a duplicate in your data without even knowing it!
To prevent such a race condition, you'd have to use some form of locking, and since you are inserting a new row, there is no row to lock yet - you'll likely end-up locking the entire table, destroying scalability in the process.
OTOH, if you let the DBMS do it for you, it can do it in a special way without too much locking, which has been tested and double-tested for correctness in all the tricky concurrent edge cases, and whose performance has been optimized over the time the DBMS has been on the market.
Similar concerns exist for foreign keys as well.
So yeah, if your application is the only one accessing the database (e.g. when using an embedded database), you may get away with application-level enforcement, although why would you if the DBMS can do it for you?
But in a concurrent environment, leave keys and foreign keys to the database - you'll have plenty of work anyway, enforcing your custom "business logic" (that is not directly "declarable" in the DBMS) in a way that is both correct and performant...
That being said, feel free to perform any application-level "pre-checks" that benefit your user experience. But do them in addition to database-level constraints, not instead of them.
Claim 1 is correct, claim 2 is incorrect, just like you concluded.
Database's job is to handle the data and its integrity. App's job is to ask the database about the data and then perform work with that data.
If you handle #2 trough the application:
you have to handle concurrency - what happens when there's more than 1 connection active to the db? You need to lock tables to perform operations ensuring uniqueness or integrity. Since this connection can break at any time, you've got a huge problem at your hands. How to unlock tables when the process that locked it died?
you can't do a better job from the app than the database can on its own. You still need to check the rows for uniqueness, meaning that you need to retrieve all the data, perform the check on the whole dataset and then write it. You can't do anything better or faster than database can - by definition, it will be slower since you need to transfer the data from db to your app
databases are made with concurrency in mind. Creating optimizations using logic of your friend is what leads to unstable apps, duplicate data, unresponsive databases etc. Never do that. Let the db do its job, it's made for such purposes.
When checking for uniqueness, MySQL utilizes indexes which is a data structure made for fast access. The speed at which MySQL performs uniqueness check is incomparable in performance compared to what any app can do - it's simply going to do the work faster. If you need unique data, you need to ensure that you have unique data - this is a workload that can't be avoided and people that develop databases are using proven algorithms designed for speed. It works at optimum speed already.
As for integrity - the same, MySQL (or any other RDBMS) is made to handle such scenarios. If foreign key constraints would be better if implemented in app logic, then we'd never have FK's available to us in the first place. Like I mentioned before - the database's job is to take care of that.
ACID for relational databases isn't there for no reason. Atomicity, Consistency, Isolation, Durability MySQL's InnoDB implements and allows for those, if you need it - then you use it. There's no app in any language that anyone can create which performs better in any way compared to MySQL's internal handling of those.
TL;DR: you are correct in your thinking.
Yes, it's true that checking a constraint is going to take time and slow down database updates.
But it's not at all clear how moving this logic to the application will result in a net performance improvement. Now you have at least two separate trips to the database: one to check the constraint and another to perform the update. Every trip to the database costs: It takes time to make a connection, it takes time for the database engine to parse the query and construct a query plan, it takes time to send results back. As the database engine doesn't know what you're doing or why, it can't optimize. In practice, one "big visit" is almost always cheaper than two "small visits" that accomplish the same thing.
I'm speaking here mostly of uniqueness constraints and relational integrity constraints. If you have a constraint that can be tested without visiting the database, like a range limit on an individual field, it would be faster to do that in the application. Maybe still not a good idea for a variety of reasons, but it would be faster.
Constraints do generally cause a slight decrease in performance. Nothing is free. There are, however, two important considerations:
The performance hit is usually so slight that it is lost in the "noise" of the natural variability of a running system so it would take tests involving thousands or millions of test queries to determine the difference.
One has to ask "Affects the performance where?" Constraints affect the performance of DML operations. But if the constraints were not there, then every query would have to perform additional testing to verify the accuracy of the data being read. I can assure you, this will be at a far greater performance hit than the constraints.
There are exceptions, of course, but most databases are queried a lot more often than modified. So if you can shift performance hits from queries to DML, you generally speed up the overall performance of the system.
Perform separate constraint checking at the app level by all means. It is a tremendous benefit to provide the user with feedback during the process of collecting data ("Delivery date cannot be in the past!") rather than waiting until the attempt to insert the data into the database fails.
But that doesn't mean remove them from the database. This redundancy is important. Can you absolutely guarantee that the only operations ever performed on the database will originate from the app? Absolutely not. There is too much normal maintenance activity going on outside the app to make that promise. Not to mention that there are generally more than one app so the guarantee must apply to each one. Too many loose ends.
When designing a database, data integrity is your number one priority. Never sacrifice that for the sake of performance, especially since performance of a well-designed database is not often an issue and even when it is, there are far too many ways to improve performance that does not involve removing constraints (or denormalizing, another mistake many still make in order to improve the performance of an OLTP system).
Q. Is the claim 1 correct ?
Yes. In my experience, using constraints can cause a massive decrease in performance.
The performance impact is relative to the amount of constraints and records in the tables. As table records grow, the performance is impacted and DB performance can move from great to bad fast.
For example. In one auditing company I worked for, part of the process was to serialize an excel matrix containing a large number of responsibilities/roles/functions into a set of tables which had many FK constraints.
Initially the performance was fine, but within 6 months to a year this serialization process took a few minutes to complete. We optimised as much as we could with little affect. If we switched off the constraints, this process completed in a few seconds.
If so (if claim 1 is correct), is claim 2 even the right way to handle such scenarios ?
Yes, but under certain circumstances.
You have a large number of constraints
You have a large number / ever growing records in your DB tables.
The DB hardware provided is not able to be improved upon for whatever reason and you are experiencing performance problems.
So with the performance problem we had at the auditing company, we looked at moving the constraint checks into an application dataset. So in essence the dataset was used to check and validate the constraints and the matrix DB tables used simply for storage (and processing).
NOTE: This worked for us because the matrix data never changed once inserted and each matrix was independent of all other past inserted matrices.

Do you absolutely need foreign keys in a database?

I was wondering how useful foreign keys really are in a database. Essentially, if the developers know what keys the different tables depend on, they can write the queries just as though there was a foreign key, right?
Also, I do see how to foreign-key constraints help prevent all sorts of bugs with data integrity, but say for example, the programmers do a good job of preserving data integrity, how necessary are foreign keys really?
If you don't care about referential integrity then you are right. But.... you should care about referential integrity.
The problem is that people make mistakes. Computers do not.
Regarding your comment:
but say for example, the programmers do a good job of
preserving data integrity
Someone will eventually make a mistake. No one is perfect. Also if you bring someone new in you aren't always sure of their ability to write "perfect" code.
In addition to that you lose the ability to do cascading deletes and a number of other features that having defined foreign keys allow.
I think that assuming that programmers will always preserve data integrity is a risky assumption.
There's no reason why you wouldn't create foreign keys, and being able to guarantee integrity instead of just hoping for integrity is reason enough.
Not using referential integrity in a database is like not using seatbelts in cars. It will provide you with measurable improvements in taking you from A->B, but it will make "real" difference only in the most extreme cases. Why take the "risk" unless you really have to?
The underlaying reason people ask this question is always performance.
Foreign keys give the optimizer much more information to work with, and it will potentially produce better execution plans. It's not like a specific query will be % percent faster with enabled constraints, it's more like you effectively eliminate entire classes of problems due to bad execution plans. You also enable the optimizer to rewrite queries in ways that just isn't possible without the constraints (join elimination for example).
Starting right here, I would like to start a myth that referential integrity always increases performance in databases. I'm fairly confident that if 100 people designed their databases with full integrity checking, less than 5 people will actually have to consider spend a whopping 1 second to disable them for performance reasons. Out of those 5 people, there will be close to 0 people who find that they need to disable 100% of the constraints.
Foreign keys are invaluable as a means of ensuring integrity, and even if you trust your developers to never (!) make errors the cost of having them is usually well worth it.
Foreign keys also serve as documentation, in that you can see what relates to what. This information is typically also used by tools, such as for generating reports, creating data sets from table definitions, object-relational mappers, etc. Even if you do not use any of these today, having FKs will make it easier to tread that path later.
Foreign keys also allow you to define cascade rules, which e.g. can be used to to delete associated records in related tables when a row in one table is deleted.
Only if you have ridiculously high loads should you consider bypassing FKs.
Edit: updated answer to include points from other answers (reports, cascades).
You said
but say for example, the programmers
do a good job of preserving data
integrity
The expression you were looking for is, "I'm 100% certain that every programmer and every database administrator will manually preserve data integrity perfectly no matter what application touches this database, no matter how complex the database becomes, from now until the time it's decommissioned."
You don't have to use them but why wouldn't you?
They are there to help. From making life easier with cascade updates and cascade deletes, to guaranteeing that constraints aren't violated.
Maybe the application honors the constraints, but isn't it useful to have them clearly specified? You could document them, or you could put them in the database where most programmers expect to find constraints they are supposed to conform to (a better idea I think!).
Finally, if you ever need to import data into this database which doesn't go via the front-end, you may accidently import data which violates the constraints and breaks the application.
I'd definetly not recommend skipping the relationships in a database
Foreign Keys make life so much easier when using report builders and data analysis tools. Just select one table, check the include related tables box and BAM! you've got you're report built. Ok Ok, it's not that easy, but they certianly save time in that respect.
Use constraints rather than application logic to enforce integrity because it is generally easier, cheaper and more reliable to maintain constraints in one place (the database) rather than in every application.
I understand from one of your comments that your motivation for asking the question is that you think leaving out the keys may make it easier to evolve the database design during development. In my experience you are wrong about that. I find that it's actually better to be more restrictive with constraints in the early stages of development. If in doubt, create the constraint because it's much easier to remove constraints later than it is to create them. Removing a constraint will tend to break fewer things than adding one and generally requires less testing and fewer code changes to achieve.
Another point to make is that when you scrap your current user interface and use a new one with shiny new tools, you won't lose your referential integrity because the new devs have no idea what should be related to what. Databases are generally in use much much longer than user interfaces. They are also often used by more than one application interface and then you have the problem of different interfaces trying to enforce different integrity rules.
I will also point out that I have had occasion to look at the data in, quite literally, hundreds of databases and have not found one yet that has good data if they didn't set up FKs. This bad data complicates reporting, it complicates imports and exports to and from clients and other third party vendors who need or provide the data. And if the bad data is in a financial area, it could also have legal and accounting implications. I can even remember one time the company had thousands of bad inventory records where the actual product that was stored was no longer identifiable (nor the location) which also created issues with defining the value of the inventory necessary for financial reporting. This is not only bad from a perspective of not knowing what parts you have on hand, but it enables people to steal parts without being caught simply by deleting the part number from the part table (this particular place didn't have auditing in place either.).
Folks have offered up some good answers above. However, one important point I didn't see mentioned is that foreign keys make your entity relationship diagrams (ERDs) easier to generate and much more meaningful. Without FKs, you either need to depict the FK relationships on your ERD manually (painful for you) or not at all (painful for others, and perhaps even for yourself once your memory of the implied FK relationships starts to fade over time). With FKs explicitly defined, most tools that automatically generate ERDs from database object definitions will automatically detect and depict the FK relationships.
Perhaps the question should be "How bad are orphan records?". In many cases orphaned records aren't really going to hurt anything. Yes these records may persist until the end of time but how bad is this really? Cascading updates or deletes are rarely useful features. Referential integrity sounds nice but I think is not as important as we have been lead to believe. The biggest benefit to FK's is the documentation they provide. In my experience FK's for referential integrity are way more trouble than they are worth.
I am having the same question today, and found many articles talking about why you don't have to use foreign keys online. But so far, 10 of 11 answers here say you should have FKs.
I am not a db expert and just want to share some points I found online about when and why you don't have FKs:
Some points from 9 reasons why there are no foreign keys constraints:
Performance
Legacy data
Full table reload
Higher level framework
Cross database relations
Database platform agnostic
Open for change
Lazy architect
Keep model a secret
Some points from At GitHub we do not use foreign keys, ever, anywhere.
FKs are in your way to shard your database.
FKs are a performance impact.
FKs don't work well with online schema migrations.
Note: I don't have any opinions. Just sharing some online articles to provide a different answer to most of the current ones.

Should referential integrity be enforced?

One of the reasons why referential integrity should not be enforced is performance. Because Db has to validate all updates against relationships, it just makes things slower but what are the other pros and cons of enforcing and not enforcing?
Because relationships are maintained in the business logic layer anyway, it just makes them redundant for db to do it. What are your thoughts on it?
The database is responsible for data. That's it. Period.
If referential integrity is not done in the database, then it's not integrity. It's just trusting people not to do bad things, in which case you probably shouldn't even worry about password-protecting your data either :-)
Who's to say you won't get someone writing their own JDBC-connected client to totally screw up the data, despite your perfectly crafted and bug-free business layer (the fact that it probably won't be bug-free is another issue entirely, mandating that the DB should protect itself).
First of all, it's almost impossible to make it really work correctly. To have any chance of working right, you need to wrap a lot of the cascading modifications as transactions, so you don't have things out of sync while you've changed one part of the database, but are still updating others that depend on the first. This means code that should be simple and aware only of business logic suddenly needs to know about all sorts of concurrency issues.
Second, keeping it working is almost impossible to hope for -- every time anybody touches the business logic, they need to deal with those concurrency issues again.
Third, this makes the referential integrity difficult to understand -- in the future, when somebody wants to learn about your database structure, they'll have to reverse engineer it out of your business logic. With it in the database, it's separate, so what you have to look at only deals with referential integrity, not all sorts of unrelated issues. You have (for example) direct chains of logic showing what a modification to a particular field will trigger. At least for quite a few databases, that logic can be automatically extracted and turned into fairly useful documentation (e.g., tree diagrams showing dependencies). Extracting the same kind of information from the BLL is more likely to be a fairly serious project.
There are certainly some points in the other direction, and reasons to craft all of this by hand -- scalability and performance being the most obvious. When/if you go that route, however, you should be aware of what you're giving up to get that performance. In some cases, it's a worthwhile tradeoff -- but in other cases it's not, and you need information to make a reasoned decision.
Relationships may be maintained in a business logic layer. Unless you can guarantee 100% beyond any doubt that your BLL is and always will be bug-free, then you don't have data integrity. And you can't make that guarantee.
Also, if another app will ever touch your database, it isn't required to follow (read: reimplement, maybe in a subtlely wrong way) the rules in your BLL. It could corrupt the data, even if you somehow managed to be one of the 3 programmers on Earth to write bug-free code.
The database, meanwhile, enforces the same rules for everybody -- and rules enforced by the database are far less likely to be overlooked when you're updating, since the DB won't allow it.
Have a listen to Dan Pritchett, Technical Fellow at eBay on why certain database constructs such as transactions and referential integrity are not the mandates that textbooks might indicate they should be... It comes down to the types of data, the volume of queries and business requirements. Balance those and it will lead you to pragmatic solutions, not dogmatic answers...
However, do not assume that keeping relationships in the BLL will protect your data. You cannot guarantee that future developers won't expose new APIs that bypass the BLL for "performance" reasons, or simple lack of understanding of your architecture...
The performance assumption on which the question is based is incorrect as a general rule. Usually if you require RI to be enforced then the database is the most efficient place to do it, NOT the application - otherwise the application has to requery more data in order to be able to validate RI outside the database.
Also, RI constraints in the database are useful for the query optimiser for making other queries more efficient. Integrity constraints in the application can't achieve that.
Lastly, the cost of maintaining integrity constraints in every application is generally more expensive and complex than doing it once in one place.
But Colonel Ingus, if you've got the customer with an id in the session you've already probed the database! The problem is when you then write your sales order away, but didn't attach it to a product because you didn't prob for a product. One way or another you'll end up with orphaned records, just like the very large company I'm currently working for has. We have customers with no history and history with no customers; customers with outstanding balances who've never bought anything and goods sold to customers who don't exist - interesting business concepts - and it keeps a team of very frustrated support staff in full time employment trying to sort it out. It would be far less expensive to have put RI on everything and bought a bigger box to sort out any perceived performance problems.
A lot has already been said about the fact that the DB should be the final place to validate/control your constraints (and I couldn't agree more)
If the data is important, then your application won't be the last to access the database and it won't be the only one.
But there is another very important fact about referential integrity (and other constraints): it documents your datamodel and makes the dependencies between the tables explicit.
As far as performance is concerned, defining FKs (or other constraints) in the database can make things even faster in certain cases, because the DBMS can rely on the constraints and make approriate optimizations.
It depends on the data, if its highly transactional data such as business transactions and what not where frequent updates are happening then enforcing the business rules in the database is extremely important.. But for everything else the performance impact may not be worth it..
What paxdiablo and dportas said. And my two cents. There are two other considerations.
In order to validate referential integrity for a new insert, you have to do a probe into the database to verify that the reference is valid. You just nullfied the performance gain that led you to want to enforce integrity in the application. It's actually faster to let the DBMS enforce referential integrity.
Beyond that, consider the case where you have more than one application all reading and writing data in a single database. If you enforce referential integrity in the business application layer, you have to make sure that all of the applications do things right. Otherwise, some aberrant application could store invalid refrences, and the problem could surface when a different application went to use the data. That's a real mess.
Better to have the DBMS enforce the data rules for all the applications.
If you are maintaining the relationships in the business layer, you can guarantee that a few years down the pike you will have bad data in the database. The business layer is the worst possible place to do that.
Further, when you replace the business layer with something else you have to redefine all these things. Datbases often outlast the original application they are written for by many years, put the correct realtionships and constraints in the datbase where they belong.
What happens when you try to insert a record into the database and it fails referential integrity? You get an error from the database. Then you have to change your code so that it doesn't try to insert invalid data. To avoid ref integrity errors your code MUST know which data is which. Therefore, referential integrity is useless.
Walter Mitty said "In order to validate referential integrity for a new insert, you have to do a probe into the database to verify that the reference is valid." Sigh... this is complete nonsense. If I have a Customer object in the session (that's memory, aka RAM for some of you fellas), I know the Customer's ID and can use it to insert a SalesOrder object. There is no need to look up the Customer.
I am on a system now with tight Referential Integrity and Hibernate wrapped around it with its gross tenticles. It's the slowest system I have ever seen. I did not design it and if I had, it would be many times faster AND easier to maintain. Hibernate sucks.

Why do we need Audit Columns in Database Tables?

I have seen many database designs having following audit columns on all the tables...
Created By
Create DateTime
Updated By
Upldated DateTime
From one perspective I see tables from the following view...
Entity Tables:
Good candidate for Audit columns)
Reference Tables:
Audit columns may or may not required. In some case last update information is not at all required because record is never going to be modified.)
Reference Data Tables
Like Country Names, Entity State etc... Audit columns may not required because these information is created only during system installation time, and never going to be changed.
I have seen many designers blindly put all audit columns to all tables, is this practice good, if yes what could be the reason...
I just want to know because to me it seems illogical. It is difficult for me to figure out why do they design their db this way? I am not saying they are wrong or right, just want to know the WHY?
You can also suggest me, if there is an alternative auditing patter or solution available...
Thanks and Regards
Data auditing is a required internal control for many business systems (see Sarbanes Oxley for reasons why). It must be at the database level to assure that all changes are captured especially unauthorized ones.
Even with lookup tables an unauthorized change could wreak havok in your system and thus it is important to know who made the change and when. When is especially important because it helps the dbas know how far back to grab a backup to restore information accidentally or maliciously changed.
We like to think all our employees are trustworthy, but many of the thefts of personal data and the malicious changes to destroy company data come from internal sources (this is why is is dangerous to have many disgruntled employees) as does almost all of the fraud. Yet most programmers seem to think that they only have to protect against outside threats.
Of course you are still going to have a few people who can make unauthorized changes, you can't prevent system admins from doing this. But with auditing at least you can limit the potential for data damage (and be especially careful when hiring dbas and allow no one else admin rights on your database servers).
These columns are for the benefit of the DBA and the database developers. They just provide a quick mechanism to answer questions like "When did this record last change?" "who changed it?" They are not robust enough or fine-grained enough to satisfy compliance with SOX, HIPAA or whatever.
It is simply easier to have these columns on every table. All data can change, so it is useful to know when changes happened, especially if that data isn't supposed to change. It is possible to automate the process of adding them, by using the data dictionary to generate scripts.
It is good practice for these columns to be populated independently of the application, by triggers or some similar mechanism. These columns are metadata, the application shouldn't really be aware of them.
Relying on a full-blown audit trail to provide this functionality is usually not an option. Audit data which is collected for compliance purposes usually has restricted access, and indeed may be stored in a separate physical location.
Many applications are developed using some OOP language in which there is generally a class like BusinessObject that contains what is perceived generally helpful information like such auditing fields. Not all subclassing entities may need it, but it's there if they do. Since the overhead of the db is small and the chances that the client may request another odd statistic based on the audit fields it's better to have them around than not to have them at all. If something represents a static list of information such as country names I generally wouldn't put it in the db at all - enumerated data type are created just for such purposes.
I come across this thread by chance, as the same question popped up in my mind this morning. Every answer has got the point and I definitely agree with all of you. It is undeniable to safeguard business data and transaction data. Instead of that, the author feels doubtful about audit fields for some configuration or static data.
This kind of configuration data are not updatable by users. Usually they can be placed in other places as well, like properties, config files or even hard-coded constants. Of course putting configuration data in these places might be bad designs or styles, but from the perspective of auditing, do they matter? In addition, if these data are updatable by users, then the only ones who can update it are either dba or hackers. Truly malicious dba or hackers will already know laws before they break the laws and they do find ways circumvent the laws.
To me, the question is more related to the environment in your company. Does your company have a culture of keeping track of every little bit of tiny information? Does your company constantly enforce strict discipline, monitoring or auditing? Having these auditing fields for non-user data are simply for their satisfaction, more than any other purposes.

Overnormalization

When would a database design be described as overnormalized? Is this characterization an absolute one? Or is it dependent on the way it is used in the application? Thanks.
In the general sense, I think that overnormalized is when you are doing so many JOINs to retrieve data that it is causing notable performance penalties and deadlocks on your database, even after you've tuned the heck out of your indexes. Obviously, for huge applications and sites like MySpace or eBay, de-normalization is a scaling requirement.
As a developer for several small businesses, I tell you that in my experience it's always been easier to go from normalized -> denormalized than the other way around, and in fact going the other way around (to avoid duplication of data now that the business requirements have changed a year or so later) is much more difficult.
When I read general statements such as "you should put the address in your customers table instead of a separate address table so you can avoid the join", I shudder, because you just know that a year from now somebody's going to ask you to do something with addresses that you totally didn't foresee, like maintaining an audit trail, or storing multiple per customer. If your database allows you to create an indexed view, you can sidestep that issue until you get to the point where your dataset is so large that it can't possibly exist or be served by a single server or set of servers in a 1-write, many-read environment. For most of us, I don't think that scenario happens very often.
When in doubt, I aim for third normal form with some exceptions (for example, having a field contain a CSV-list of separated strings because I know I'll never ever look at the data from the other angle). When I need to consolidate, I'll look at my views or indexes first. Hope this helps.
It's always a question of the application domain. It's usually a question of correctness, but occasionally a question of performance.
There's one case where I can think of a prima facie case of overnormalization: say you have an order + orderitem, and the orderitem references productID, and leaves pricing to the product.price. Since that introduces temporal coupling, you've incorrectly normalized because the overnormalization affects already shipped orders, unless prices absolutely never change. You can certainly argue that this is simply a modeling error (as in the comments), but I see under-normalization as a modeling error in most cases, too.
The other category is performance related. In principle, I think there are generally better solutions to performance than denormalizing data, such as materialized views, but if your application suffers from the performance consequences of many joins, it may be worth assessing whether denormalizing can help you. I think these cases are often over-emphasized, because people sometimes reach for denormalization before they properly profile their application.
People also often forget about alternatives, like keeping a canonical form of the database and using warehousing or other strategies for frequently-read, but infrequently changed data.
Normalization is absolute. A database follows Normal Forms or it does not. There are a half-dozen normal forms. Mostly, they have names like First through Fifth. Plus there's a Boyce-Codd Normal Form.
Normalization exists for precisely one purpose -- to prevent "update anomalies".
Normalization isn't subjective. It isn't a judgement. Each table and relationship among tables either does or does not follow a normal form.
Consequently, you can't be "over-normalized" or "under-normalized".
Having said that, normalization has a performance cost. Some people elect to denormalize in various ways to improve performance. The most common sensible denormalization is to break 3NF and include derived data.
A common mistake is to break 2NF and have duplicate copies of a functional dependency between a key and non-key value. This requires extra updates or -- worse -- triggers to keep the copies in parallel.
Denormalization of a transactional database should be a case-by-case situation.
A data warehouse, also, rarely follows any of the transactional normalization rules because it's (essentially) never updated.
"Over-normalization" could mean that a database is too slow because of a large number of joins. This may also mean that the database has outgrown the hardware. Or that the applications haven't been designed to scale.
The most common issue here is that folks try to use a transactional database for reporting while transactions are going on. The locking for transactions interferes with reporting.
"Under-normalization," however, means that there are NF violations and needless processing is being done to handle the replicated data and correct update anomalies.
When the performance cost exceeds the benefit towards the application's intended purpose.
Normalize your OLTP databases, and denormalize your OLAP databases. Each has a mission that dictates its schema. Like normalized transaction databases, data warehouses exist for a reason. A complete system needs both.
A lot of people are talking about performance. I think a key issue is flexibility. In general, the more normalized your database, the more flexible it is.
We currently use an "over-normalized" database because, in our operating environment, client requirements change on a monthly basis. By "over-normalizing" we can adopt our software accordingly, without changing the database structure.
My take on this:
Always normalize as much as you are able to do. I usually go crazy on normalization, and try to design something that could handle every thinkable future extensions. What I end up with is a database design that is extremely flexible... and impossible to implement.
Then the real job starts: De-normalization. Here you solve what you know would be problematic to implement and/or would slow the queries down because of too many joins.
This way you know what you scarify for make the design usable.
Edit: Documentations! I forgot to mention that documenting the de-normalization is very important. It is extremely helpful when you take over a project to know the reason behind the choices.
Third Normal Form (3NF) is considered the optimal level of normalization for many a rational database application. This is a state in which, as Bill Kent once summarized, every "non-key field [in every table within a particular a relational database management system, or RDBMS] must provide a fact about the key, the whole key, and nothing but the key." 3NF is a term that was introduced by E.F. Codd, inventor of the relational model for database management. Generally, the data that a software application is dependent on, especially an application used for an Online Transaction Processing System (OLTP), will fare well in 3NF. This normal form by definition reduces database size by calling for a minimum repetition of row/column data, and maximizes query efficiency and ease of application maintenance. 3NF achieves that by requiring that a database's tables (i.e., its schema) be broken down into separate tables related by primary/foreign keys--basically until Kent's rule holds true (well, I've stated it this way for ease of reading but the actual definition of 3NF is much more detailed than that). In contrast, overnormalization implies increasing the number of joins required in a query between related tables. This comes as a result of breaking down the database schema into a much more granular level than 3NF. However, though normalization past the 3rd degree can often be considered overnormalization, the negative connotation of the term "overnormalization" can sometimes be unwarranted. Overnormalization may be desirable in some applications which by design require 4NF (and beyond) due to the complexity and versatility of the application software. An example of that is a highly customizable and extensible commercial database program for some industry in which it is sold to end users requiring an open API. But then the reverse can be desirable as well--that is, denormalization--most notably, when designing an Online Analytical Processing (OLAP) database used strictly to summarize data from an OLTP database just for querying/reporting--such as a data warehouse. In this case the data must by necessity reside in a highly denormalized format (i.e, 1NF, or 2NF). It's often under these constraints--when there are high demands for efficient querying and reporting--that we find database and application programmers calling a database, "overnormalized". But as Redgate's Tony Davis once said--taking into account today's much more advanced and efficient database software and storage systems--"the performance hit from multiple joins in a query is negligible. If your database is slow, it isn’t because it is ‘over-normalized’!" So in conclusion, this characterization--overnormalization--isn't an absolute one, and it is dependent on the way it is used in the application. In Kent's words, "The normalization rules are designed to prevent update anomalies and data inconsistencies. . . [but] there is no obligation to fully normalize all records when actual performance requirements are taken into account. . . The normalized design enhances the integrity of the data, by minimizing redundancy and inconsistency, but at some possible performance cost for certain retrieval applications. . . [Thus,] the desirability of normalization has to be assessed, in terms of its performance impact on retrieval applications."
..or hitting limits on the number of joins your RDBMS will do.
If performance is affected by too many joins, creating de-normalized tables for reporting purposes can speed things up. By copying the data into new tables, it may be possible to run reports with no joins at all.
In my experience, I've never seen a normalized database that contains postal addresses, as it's usually acceptable to store the address as a string. Ideally, there would be tables for countries, counties / states, cities, districts and streets. I've not come across anyone who needs to report on street level, so it hasn't been necessary. The addresses have only be used for postal contact, so are treated as a single entity.

Resources