Is a strongly normalized relational database not efficient? - database

I was reading this question https://meta.stackexchange.com/questions/26398/stackoverflow-database-design-join-issues and I got the following question: using a very normalized db is not efficient?
How should be found the right compromise?
I'm not sure if this question better fits here or on programmer. Here there are some similar but if I should move, just ask me.

Whether it will speed it up or slow it down depends strongly on the nature of the data, the size of the tables, the type of querying, the indexing. I have seen it go both ways although, more often than not in my experience, normalization to the third normal form speeds things up. Relational databases are built to be normalized and designed so that those things are expected.
One thing the denormalization advocates often forget is that speed is critical to transactions (possibly more critical due to blocking potential) and that denormalization often slows down updates. You can't measure performance just on select statements. Denormalized database tables are often wider and wider tables can often cause slowdowns too.
Denormalized databases are a major problem to keep the data integrity in and a change of a company name in a normalized database might result in one record needing to be updated and in a denormalized one might result in 100,000,000 records needing to be updated. That is why denormalization is generally only preferred for databases (like data warehouses) where the data is loaded through an ETL process but the database itself is frequently queried for complex reporting scenarios. Transactional databases that have a lot of user updates and deletions and inserts are often much faster if they are normalized to the third normal forma at least. Now you can go crazy with normalization too, don't get me wrong. I shouldn't have to join to 10 tables to get a simple address especially if I get them often. Data that is often used together often belongs together especially if the items are unlikely to change a million records if a change is made. For instance in the address, it would require a large update if Chicago changed it's name to New Chicago, but those types of massive address changes are pretty rare in my part of the world. On the other hand, company name changes are frequent and could cause massive disruption if they needed to be made to millions of denormalized records.
If you are not designing a data warehouse, then normalize your data. Never denormalize unless you are a database specialist with at least 5 years experience in large systems. You can harm things tremendously if you don't know what you are doing. If things are slow denormalization is one of the last performance improvements to try. Generally, the problem is fixed by writing better queries that are sargable and which do not use poorly performing techniques like correlated subqueries or by getting the correct indexing applied.

Normalization optimizes storage requirements and data consistency. As a tradeoff, it can make queries more complex and slow.
How should be found the right compromise?
Unfortunately, that cannot be answered with generality.
It all depends on your application and its requirements.
If your queries run too slow, and indexing or caching or query rewriting or database parameter tuning don't cut it, denormalization may be appropriate for you.
(OTOH, if your queries run just fine, or can be made to run just fine, there is probably no need to go there).

It depends. Every time I've worked to normalize a database, it has radically sped up. But, the performance problems with the non-normalized DBs were that they needed many indices, most of which were not used for any particular query, having too many columns, forced DISTINCT constraints on queries that wouldn't have been needed with a normalized DB, and inefficient table searching.
If common queries need to perform many joins on large tables for the simplest of lookups, or hit many tables for writes to update what the user/application sees as an atomic update of a single entity, then as traffic grows, so will that burden, at a rate higher than with lower/no normalization. Typically what happens is that everything runs OK until either the database and application are put on different production servers, while they were on the same dev server, or when the data gets big enough to start hitting the disks all the time.
DBMS products couple logical layout and physical storage, so while it may be as likely to increase speed as decrease it, normalization of base tables will in some way affect performance of the system.
Usually, the right compromise is views, with an SQL DBMS. If you are using any variation of design by contract, views are likely the correct design decision even without any concerns for normalization or performance, so that the application gets a model fitting its needs. Scalability concerns, like for major websites, create problems that don't have quick and easy solutions, at this point in time.

Additionally to Thilo's post:
normalizing on SAP HANA is wrong due to the fact the db normalize the data itself. If you do it anyway you will slow down the database.

Related

non clustered index for 100 millions records in a table or pratitions

I have a table in IBM DB2 which contains more than 100 million records . Database was made 13 years ago and is not partitioned . Searching data and creating joins with this table takes huge amount of time .What should be proper approach to optimize searching and joins .
1. Using Non Clustered Index and searching via indexes .
2. Partitioning Table
3. or any other efficient approach.
I would like thanks in advance for your valuable time and efforts.
A "proper" approach is, of course, subjective. It's usually a trade-off, and the things most people trade off are the cost of implementing the change, the cost of maintaining the change, and the performance of the solution.
In all cases, I recommend gathering metrics, and agreeing your target - otherwise, you risk continuously optimizing beyond the point the business really needs. Typically, this means creating a representative test environment, with representative data. You then run the queries as they are today, and measure their performance. Finally, you agree (with whoever is paying the bills) what the minimum and optimum targets are. Once you reach that target - stop!
By far the cheapest solution is to optimize your queries, which often means creating indices. Depending on your queries, this can sometimes take just a few hours, and doesn't require any ongoing maintenance.
The next thing to do is to look at server configuration - tuning the memory allocation and disk strategy can do wonders, and making sure the database statistics are up to date. These tasks usually require 2 or 3 people to work together, and you may need to set up regular maintenance tasks.
If that doesn't do the job, consider improving the hardware. If your database server is as old as the database (13 years), it's quite possible that your mobile phone has better performance characteristics than your server. It's much cheaper to improve the hardware than it is to go to the next steps.
If hardware doesn't solve the problem, consider de-normalizing your data. For instance, if you are running lots of queries joining your large table to other large tables, consider creating a de-normalized table with all the data you need to fulfill that query. This is expensive, both from a development point of view (you have to work out how to maintain the denormalized data, how to make sure all the queries still work), and from a maintenance point of view - the additional complexity will make all enhancements and bug fixes harder.
If denormalizing doesn't work, partitioning is the next most expensive solution. This is a fairly drastic solution, because as far as I know, there's no "out of the box" solution to glue your front-end applications into the partitioning logic. So, pretty much every piece of code that needs to interact with the database needs to understand the partitioning logic, and a bug in any one place will break every other component that interacts with that data.

Oracle recommendations for high volume writes and low volume read

Is there some general guidelines online on how to tweak oracle for doing a high number of inserts and low number of reads?
All the answers below are pretty good recommendations. I have to clarify the following things. I am using 10g and this is an absolute requirement that we use Oracle. I am also more interested in oracle instance parameters for tuning (perhaps some different locking policies).
Let me assume you want to do an excessive high number of inserts, so that you simply want to just ignore all other kinds of operations just to get those inserts to complete, without problems.
First, have you completely ruled out other types of databases? There are systems like industry databases that cope very well with massive amounts of inserts, typically used to receive and store data from equipment that is measuring something in a factory environment. Oracle is a relational database, it might not be the right type of software for your needs.
Having said that, let's assume you can, or will, or should, use Oracle. The very first thing you need to do is to consider all the various types of data you need to make this assumption about. If they're all about the same kind of data, you need 1 table, and it need to be lean and mean regarding inserts.
The optimal way do that is to do the following:
do not add any indexes on this table at all, if you need a primary key, that's the only index you want
if you need to do reads against this table, consider having a shadow table with indexes that you do reads, lookups, and aggregates against. If this doesn't have to be up-to-the-millisecond updated, consider a periodic batch job to update it with data from the master table. This will disturb the master table with read-locks as little as possible
Make sure your server has fast disks. Transactional write operations will typically involve the disk at some point, so make sure that's a small bottleneck as you can get.
If your application is gathering data from many incoming sources, consider adding a layer in front of the database that will keep the number of concurrent connections and thus transactions to that table to a minimum. If you get a high number of write-locks on the same page for an oracle database, ultimately your performance will suffer.
If you can split up the data, consider splitting it in such a way that it is stored on different physical disks. That way, disk I/O problems won't be cross-data-type, and only affect one type of data.
In the other end of the spectrum you have a denormalized table with lots of indexes optimized for a balance between lookups and updates, and you need to find some middle-way that will get you the performance you want.
In terms of database design put as few constraints, indexes and triggers on the table(s) you're inserting into as possible as these will all slow down the insert.
The lack of indexes will obviously hurt your SELECT performance, but it doesn't sound like this is your primary concern.
What sort of application are we talking about? What version of Oracle?
If you are designing a data warehouse load process, for example, you would generally want to do direct-path inserts into staging table(s), then build any necessary indexes, then do a partition exhange to load the data into the partitioned destination table. This doesn't work as well, of course, if you are doing single-row inserts.
Depending on the Oracle version and the type of application, you may also want to enable compression on the table. Inserts are generally cheap from a CPU standpoint, so there is probably plenty of CPU available to do the compression which can substantially decrease the amount of I/O required, which is generally going to be your bottleneck.
I'm going to suggest that you take your question to Tom Kyte's site, http://asktom.oracle.com. You can generally find an answer there. Otherwise, try Oracle's forums.
Also try looking up any of Tom Kyte's books. Suggest checking the library or your local bookstore to find the right one, to ensure that the book contains the right topics for you. Also, his blog has links to his books and some articles/discussions on each book.
I did a quick google, site:oracle.com tuning write, and found this
OracleAS TopLink Writing Optimization Features. I realize that you might not be using TopLink but it may have some good tips. Keywords you'll want to try using: tuning, performance, insert(s), improve. Also through in the technology you are using like java/c++/etc.
Other tips you can try:
using stored procedures or using them in more efficient ways.
tweaking your server's hardware. Faster hard drives or a specific RAID array, possibly more cpu's.
Ask Tom thread - some nice comments here, also links to Fowler's site
You will probably have to start running some performance analytics on your queries/implementations to find the sweet spot for each one. I wish I had an easy answer for you. Good Luck!
A couple of suggestions for you to look into further:-
direct path load
block compression

Overnormalization

When would a database design be described as overnormalized? Is this characterization an absolute one? Or is it dependent on the way it is used in the application? Thanks.
In the general sense, I think that overnormalized is when you are doing so many JOINs to retrieve data that it is causing notable performance penalties and deadlocks on your database, even after you've tuned the heck out of your indexes. Obviously, for huge applications and sites like MySpace or eBay, de-normalization is a scaling requirement.
As a developer for several small businesses, I tell you that in my experience it's always been easier to go from normalized -> denormalized than the other way around, and in fact going the other way around (to avoid duplication of data now that the business requirements have changed a year or so later) is much more difficult.
When I read general statements such as "you should put the address in your customers table instead of a separate address table so you can avoid the join", I shudder, because you just know that a year from now somebody's going to ask you to do something with addresses that you totally didn't foresee, like maintaining an audit trail, or storing multiple per customer. If your database allows you to create an indexed view, you can sidestep that issue until you get to the point where your dataset is so large that it can't possibly exist or be served by a single server or set of servers in a 1-write, many-read environment. For most of us, I don't think that scenario happens very often.
When in doubt, I aim for third normal form with some exceptions (for example, having a field contain a CSV-list of separated strings because I know I'll never ever look at the data from the other angle). When I need to consolidate, I'll look at my views or indexes first. Hope this helps.
It's always a question of the application domain. It's usually a question of correctness, but occasionally a question of performance.
There's one case where I can think of a prima facie case of overnormalization: say you have an order + orderitem, and the orderitem references productID, and leaves pricing to the product.price. Since that introduces temporal coupling, you've incorrectly normalized because the overnormalization affects already shipped orders, unless prices absolutely never change. You can certainly argue that this is simply a modeling error (as in the comments), but I see under-normalization as a modeling error in most cases, too.
The other category is performance related. In principle, I think there are generally better solutions to performance than denormalizing data, such as materialized views, but if your application suffers from the performance consequences of many joins, it may be worth assessing whether denormalizing can help you. I think these cases are often over-emphasized, because people sometimes reach for denormalization before they properly profile their application.
People also often forget about alternatives, like keeping a canonical form of the database and using warehousing or other strategies for frequently-read, but infrequently changed data.
Normalization is absolute. A database follows Normal Forms or it does not. There are a half-dozen normal forms. Mostly, they have names like First through Fifth. Plus there's a Boyce-Codd Normal Form.
Normalization exists for precisely one purpose -- to prevent "update anomalies".
Normalization isn't subjective. It isn't a judgement. Each table and relationship among tables either does or does not follow a normal form.
Consequently, you can't be "over-normalized" or "under-normalized".
Having said that, normalization has a performance cost. Some people elect to denormalize in various ways to improve performance. The most common sensible denormalization is to break 3NF and include derived data.
A common mistake is to break 2NF and have duplicate copies of a functional dependency between a key and non-key value. This requires extra updates or -- worse -- triggers to keep the copies in parallel.
Denormalization of a transactional database should be a case-by-case situation.
A data warehouse, also, rarely follows any of the transactional normalization rules because it's (essentially) never updated.
"Over-normalization" could mean that a database is too slow because of a large number of joins. This may also mean that the database has outgrown the hardware. Or that the applications haven't been designed to scale.
The most common issue here is that folks try to use a transactional database for reporting while transactions are going on. The locking for transactions interferes with reporting.
"Under-normalization," however, means that there are NF violations and needless processing is being done to handle the replicated data and correct update anomalies.
When the performance cost exceeds the benefit towards the application's intended purpose.
Normalize your OLTP databases, and denormalize your OLAP databases. Each has a mission that dictates its schema. Like normalized transaction databases, data warehouses exist for a reason. A complete system needs both.
A lot of people are talking about performance. I think a key issue is flexibility. In general, the more normalized your database, the more flexible it is.
We currently use an "over-normalized" database because, in our operating environment, client requirements change on a monthly basis. By "over-normalizing" we can adopt our software accordingly, without changing the database structure.
My take on this:
Always normalize as much as you are able to do. I usually go crazy on normalization, and try to design something that could handle every thinkable future extensions. What I end up with is a database design that is extremely flexible... and impossible to implement.
Then the real job starts: De-normalization. Here you solve what you know would be problematic to implement and/or would slow the queries down because of too many joins.
This way you know what you scarify for make the design usable.
Edit: Documentations! I forgot to mention that documenting the de-normalization is very important. It is extremely helpful when you take over a project to know the reason behind the choices.
Third Normal Form (3NF) is considered the optimal level of normalization for many a rational database application. This is a state in which, as Bill Kent once summarized, every "non-key field [in every table within a particular a relational database management system, or RDBMS] must provide a fact about the key, the whole key, and nothing but the key." 3NF is a term that was introduced by E.F. Codd, inventor of the relational model for database management. Generally, the data that a software application is dependent on, especially an application used for an Online Transaction Processing System (OLTP), will fare well in 3NF. This normal form by definition reduces database size by calling for a minimum repetition of row/column data, and maximizes query efficiency and ease of application maintenance. 3NF achieves that by requiring that a database's tables (i.e., its schema) be broken down into separate tables related by primary/foreign keys--basically until Kent's rule holds true (well, I've stated it this way for ease of reading but the actual definition of 3NF is much more detailed than that). In contrast, overnormalization implies increasing the number of joins required in a query between related tables. This comes as a result of breaking down the database schema into a much more granular level than 3NF. However, though normalization past the 3rd degree can often be considered overnormalization, the negative connotation of the term "overnormalization" can sometimes be unwarranted. Overnormalization may be desirable in some applications which by design require 4NF (and beyond) due to the complexity and versatility of the application software. An example of that is a highly customizable and extensible commercial database program for some industry in which it is sold to end users requiring an open API. But then the reverse can be desirable as well--that is, denormalization--most notably, when designing an Online Analytical Processing (OLAP) database used strictly to summarize data from an OLTP database just for querying/reporting--such as a data warehouse. In this case the data must by necessity reside in a highly denormalized format (i.e, 1NF, or 2NF). It's often under these constraints--when there are high demands for efficient querying and reporting--that we find database and application programmers calling a database, "overnormalized". But as Redgate's Tony Davis once said--taking into account today's much more advanced and efficient database software and storage systems--"the performance hit from multiple joins in a query is negligible. If your database is slow, it isn’t because it is ‘over-normalized’!" So in conclusion, this characterization--overnormalization--isn't an absolute one, and it is dependent on the way it is used in the application. In Kent's words, "The normalization rules are designed to prevent update anomalies and data inconsistencies. . . [but] there is no obligation to fully normalize all records when actual performance requirements are taken into account. . . The normalized design enhances the integrity of the data, by minimizing redundancy and inconsistency, but at some possible performance cost for certain retrieval applications. . . [Thus,] the desirability of normalization has to be assessed, in terms of its performance impact on retrieval applications."
..or hitting limits on the number of joins your RDBMS will do.
If performance is affected by too many joins, creating de-normalized tables for reporting purposes can speed things up. By copying the data into new tables, it may be possible to run reports with no joins at all.
In my experience, I've never seen a normalized database that contains postal addresses, as it's usually acceptable to store the address as a string. Ideally, there would be tables for countries, counties / states, cities, districts and streets. I've not come across anyone who needs to report on street level, so it hasn't been necessary. The addresses have only be used for postal contact, so are treated as a single entity.

In terms of databases, is "Normalize for correctness, denormalize for performance" a right mantra?

Normalization leads to many essential and desirable characteristics, including aesthetic pleasure. Besides it is also theoretically "correct". In this context, denormalization is applied as a compromise, a correction to achieve performance.
Is there any reason other than performance that a database could be denormalized?
The two most common reasons to denormalize are:
Performance
Ignorance
The former should be verified with profiling, while the latter should be corrected with a rolled-up newspaper ;-)
I would say a better mantra would be "normalize for correctness, denormalize for speed - and only when necessary"
To fully understand the import of the original question, you have to understand something about team dynamics in systems development, and the kind of behavior (or misbehavior) different roles / kinds of people are predisposed to. Normalization is important because it isn't just a dispassionate debate of design patterns -- it also has a lot to do with how systems are designed and managed over time.
Database people are trained that data integrity is a paramount issue. We like to think in terms of 100% correctness of data, so that once data is in the DB, you don't have to think about or deal with it ever being logically wrong. This school of thought places a high value on normalization, because it causes (forces) a team to come to grips with the underlying logic of the data & system. To consider a trivial example -- does a customer have just one name & address, or could he have several? Someone needs to decide, and the system will come to depend on that rule being applied consistently. That sounds like a simple issue, but multiply that issue by 500x as you design a reasonably complex system and you will see the problem -- rules can't just exist on paper, they have to be enforced. A well-normalized database design (with the additional help of uniqueness constraints, foreign keys, check values, logic-enforcing triggers etc.) can help you have a well-defined core data model and data-correctness rules, which is really important if you want the system to work as expected when many people work on different parts of the system (different apps, reports, whatever) and different people work on the system over time. Or to put it another way -- if you don't have some way to define and operationally enforce a solid core data model, your system will suck.
Other people (often, less experienced developers) don't see it this way. They see the database as at best a tool that's enslaved to the application they're developing, or at worst a bureaucracy to be avoided. (Note that I'm saying "less experienced" developers. A good developer will have the same awareness of the need for a solid data model and data correctness as a database person. They might differ on what's the best way to achieve that, but in my experience are reasonably open to doing those things in a DB layer as long as the DB team knows what they're doing and can be responsive to the developers). These less experienced folks are often the ones who push for denormalization, as more or less an excuse for doing a quick & dirty job of designing and managing the data model. This is how you end up getting database tables that are 1:1 with application screens and reports, each reflecting a different developer's design assumptions, and a complete lack of sanity / coherence between the tables. I've experienced this several times in my career. It is a disheartening and deeply unproductive way to develop a system.
So one reason people have a strong feeling about normalization is that the issue is a stand-in for other issues they feel strongly about. If you are sucked into a debate about normalization, think about the underlying (non-technical) motivation that the parties may be bringing to the debate.
Having said that, here's a more direct answer to the original question :)
It is useful to think of your database as consisting of a core design that is as close as possible to a logical design -- highly normalized and constrained -- and an extended design that addresses other issues like stable application interfaces and performance.
You should want to constrain and normalize your core data model, because to not do that compromises the fundamental integrity of the data and all the rules / assumptions your system is being built upon. If you let those issues get away from you, your system can get crappy pretty fast. Test your core data model against requirements and real-world data, and iterate like mad until it works. This step will feel a lot more like clarifying requirements than building a solution, and it should. Use the core data model as a forcing function to get clear answers on these design issues for everyone involved.
Complete your core data model before moving on to the extended data model. Use it and see how far you can get with it. Depending on the amount of data, number of users and patterns of use, you may never need an extended data model. See how far you can get with indexing plus the 1,001 performance-related knobs you can turn in your DBMS.
If you truly tap out the performance-management capabilities of your DBMS, then you need to look at extending your data model in a way that adds denormalization. Note this is not about denormalizing your core data model, but rather adding new resources that handle the denorm data. For example, if there are a few huge queries that crush your performance, you might want to add a few tables that precompute the data those queries would produce -- essentially pre-executing the query. It is important to do this in a way that maintains the coherence of the denormalized data with the core (normalized) data. For example, in DBMS's that support them, you can use a MATERIALIZED VIEW to make the maintenance of the denorm data automatic. If your DBMS doesn't have this option, then maybe you can do it by creating triggers on the tables where the underlying data exists.
There is a world of difference between selectively denormalizing a database in a coherent manner to deal with a realistic performance challenge vs. just having a weak data design and using performance as a justification for it.
When I work with low-to-medium experienced database people and developers, I insist they produce an absolutely normalized design ... then later may involve a small number of more experienced people in a discussion of selective denormalization. Denormalization is more or less always bad in your core data model. Outside the core, there is nothing at all wrong with denormalization if you do it in a considered and coherent way.
In other words, denormalizing from a normal design to one that preserves the normal while adding some denormal -- that deals with the physical reality of your data while preserving its essential logic -- is fine. Designs that don't have a core of normal design -- that shouldn't even be called de-normalized, because they were never normalized in the first place, because they were never consciously designed in a disciplined way -- are not fine.
Don't accept the terminology that a weak, undisciplined design is a "denormalized" design. I believe the confusion between intentionally / carefully denormalized data vs. plain old crappy database design that results in denormal data because the designer was a careless idiot is the root cause of many of the debates about denormalization.
Denormalization normally means some improvement in retrieval efficiency (otherwise, why do it at all), but at a huge cost in complexity of validating the data during modify (insert, update, sometimes even delete) operations. Most often, the extra complexity is ignored (because it is too damned hard to describe), leading to bogus data in the database, which is often not detected until later - such as when someone is trying to work out why the company went bankrupt and it turns out that the data was self-inconsistent because it was denormalized.
I think the mantra should go "normalize for correctness, denormalize only when senior management offers to give your job to someone else", at which point you should accept the opportunity to go to pastures new since the current job may not survive as long as you'd like.
Or "denormalize only when management sends you an email that exonerates you for the mess that will be created".
Of course, this assumes that you are confident of your abilities and value to the company.
Mantras almost always oversimplify their subject matter. This is a case in point.
The advantages of normalizing are more that merely theoretic or aesthetic. For every departure from a normal form for 2NF and beyond, there is an update anomaly that occurs when you don't follow the normal form and that goes away when you do follow the normal form. Departure from 1NF is a whole different can of worms, and I'm not going to deal with it here.
These update anomalies generally fall into inserting new data, updating existing data, and deleting rows. You can generally work your way around these anomalies by clever, tricky programming. The question then is was the benefit of using clever, tricky programming worth the cost. Sometimes the cost is bugs. Sometimes the cost is loss of adaptability. Sometimes the cost is actually, believe it or not, bad performance.
If you learn the various normal forms, you should consider your learning incomplete until you understand the accompanying update anomaly.
The problem with "denormalize" as a guideline is that it doesn't tell you what to do. There are myriad ways to denormalize a database. Most of them are unfortunate, and that's putting it charitably. One of the dumbest ways is to simply denormalize one step at a time, every time you want to speed up some particular query. You end up with a crazy mish mosh that cannot be understood without knowing the history of the application.
A lot of denormalizing steps that "seemed like a good idea at the time" turn out later to be very bad moves.
Here's a better alternative, when you decide not to fully normalize: adopt some design discipline that yields certain benefits, even when that design discipline departs from full normalization. As an example, there is star schema design, widely used in data warehousing and data marts. This is a far more coherent and disciplined approach than merely denormalizing by whimsy. There are specific benefits you'll get out of a star schema design, and you can contrast them with the update anomalies you will suffer because star schema design contradicts normalized design.
In general, many people who design star schemas are building a secondary database, one that does not interact with the OLTP application programs. One of the hardest problems in keeping such a database current is the so called ETL (Extract, Transform, and Load) processing. The good news is that all this processing can be collected in a handful of programs, and the application programmers who deal with the normalized OLTP database don't have to learn this stuff. There are tools out there to help with ETL, and copying data from a normalized OLTP database to a star schema data mart or warehouse is a well understood case.
Once you have built a star schema, and if you have chosen your dimensions well, named your columns wisely, and especially chosen your granularity well, using this star schema with OLAP tools like Cognos or Business Objects turns out to be almost as easy as playing a video game. This permits your data analysts to focus on analysing the data instead of learning how the container of the data works.
There are other designs besides star schema that depart from normalization, but star schema is worth a special mention.
Data warehouses in a dimensional model are often modelled in a (denormalized) star schema. These kinds of schemas are not (normally) used for online production or transactional systems.
The underlying reason is performance, but the fact/dimensional model also allows for a number of temporal features like slowly changing dimensions which are doable in traditional ER-style models, but can be incredibly complex and slow (effective dates, archive tables, active records, etc).
Don't forget that each time you denormalize part of your database, your capacity to further adapt it decreases, as risks of bugs in code increases, making the whole system less and less sustainable.
Good luck!
Normalization has nothing to do with performance. I can't really put it better than Erwin Smout did in this thread:
What is the resource impact from normalizing a database?
Most SQL DBMSs have limited support for changing the physical representation of data without also compromising the logical model, so unfortunately that's one reason why you may find it necessary to demormalize. Another is that many DBMSs don't have good support for multi-table integrity constraints, so as a workaround to implement those constraints you may be forced to put extraneous attributes into some tables.
Database normalization isn't just for theoretical correctness, it can help to prevent data corruption. I certainly would NOT denormalize for "simplicity" as #aSkywalker suggests. Fixing and cleaning corrupted data is anything but simple.
You don't normalize for 'correctness' per se. Here is the thing:
Denormalized table has the benefit of increasing performance but requires redundancy and more developer brain power.
Normalized tables has the benefit of reducing redundancy and increasing ease of development but requires performance.
It's almost like a classic balanced equation. So depending on your needs (such as how many that are hammering your database server) you should stick with normalized tables unless it is really needed. It is however easier and less costly for development to go from normalized to denormalized than vice versa.
No way. Keep in mind that what you're supposed to be normalizing is your relations (logical level), not your tables (physical level).
Denormalized data is much more often found at places where not enough normalization was done.
My mantra is 'normalize for correctness, eliminate for performance'. RDBMs are very flexible tools, but optimized for the OLTP situation. Replacing the RDBMS by something simpler (e.g. objects in memory with a transaction log) can help a lot.
I take issue with the assertion by folks here that Normalized databases are always associated with simpler, cleaner, more robust code. It is certainly true that there are many cases where fully normalized are associated with simpler code than partially denormalized code, but at best this is a guideline, not a law of physics.
Someone once defined a word as the skin of a living idea. In CS, you could say an object or table is defined in terms of the needs of the problem and the existing infrastructure, instead of being a platonic reflection of an ideal object. In theory, there will be no difference between theory and practice, but in practice, you do find variations from theory. This saying is particularly interesting for CS because one of the focuses of the field is to find these differences and to handle them in the best way possible.
Taking a break from the DB side of things and looking at the coding side of things, object oriented programming has saved us from a lot of the evils of spaghetti-coding, by grouping a lot of closely related code together under an object-class name that has an english meaning that is easy to remember and that somehow fits with all of the code it is associated with. If too much information is clustered together, then you end up with large amounts of complexity within each object and it is reminiscent of spaghetti code. If you make the clusters to small, then you can't follow threads of logic without searching through large numbers of objects with very little information in each object, which has been referred to as "Macaroni code".
If you look at the trade-off between the ideal object size on the programming side of things and the object size that results from normalizing your database, I will give a nod to those that would say it is often better to chose based on the database and then work around that choice in code. Especially because you have the ability in some cases to create objects from joins with hibernate and technologies like that. However I would stop far short of saying this is an absolute rule. Any OR-Mapping layer is written with the idea of making the most complex cases simpler, possibly at the expense of adding complexity to the most simple cases. And remember that complexity is not measured in units of size, but rather in units of complexity. There are all sorts of different systems out there. Some are expected to grow to a few thousand lines of code and stay there forever. Others are meant to be the central portal to a company's data and could theoretically grow in any direction without constraint. Some applications manage data that is read millions of times for every update. Others manage data that is only read for audit and ad-hoc purposes. In general the rules are:
Normalization is almost always a good idea in medium-sized apps or larger when data on both sides of the split can be modified and the potential modifications are independent of each other.
Updating or selecting from a single table is generally simpler than working with multiple tables, however with a well-written OR, this difference can be minimized for a large part of the data model space. Working with straight SQL, this is almost trivial to work around for an individual use case, albeit it in a non-object-oriented way.
Code needs to be kept relatively small to be manage-able and one effective way to do this is to divide the data model and build a service-oriented architecture around the various pieces of the data model. The goal of an optimal state of data (de)normalization should be thought of within the paradigm of your overall complexity management strategy.
In complex object hierarchies there are complexities that you don't see on the database side, like the cascading of updates. If you model relational foreign keys and crosslinks with an object ownership relationship, then when updating the object, you have to decide whether to cascade the update or not. This can be more complex than it would be in sql because of the difference between doing something once and doing something correctly always, sort of like the difference between loading a data file and writing a parser for that type of file. The code that cascades an update or delete in C++, java, or whatever will need to make the decision correctly for a variety of different scenarios, and the consequences of mistakes in this logic can be pretty serious. It remains to be proven that this can never be simplified with a bit of flexibility on the SQL side enough to make any sql complexities worthwhile.
There is also a point deserving delineation with one of the normalization precepts. A central argument for normalization in databases is the idea that data duplication is always bad. This is frequently true, but cannot be followed slavishly, especially when there are different owners for the different pieces of a solution. I saw a situation once in which one group of developers managed a certain type of transactions, and another group of developers supported auditability of these transactions, so the second group of developers wrote a service which scraped several tables whenever a transaction occurred and created a denormalized snapshot record stating, in effect, what was the state of the system at the time the transaction. This scenario stands as an interesting use case (for the data duplication part of the question at least), but it is actually part of a larger category of issues. Data constistency desires will often put certain constraints on the structure of data in the database that can make error handling and troubleshooting simpler by making some of the incorrect cases impossible. However this can also have the impact of "freezing" portions of data because changing that subset of the data would cause past transactions to become invalid under the consistancy rules. Obviously some sort of versioning system is required to sort this out, so the obvious question is whether to use a normalized versioning system (effective and expiration times) or a snapshot-based approach (value as of transaction time). There are several internal structure questions for the normalized version that you don't have to worry about with the snapshot approach, like:
Can date range queries be done efficiently even for large tables?
Is it possible to guarantee non-overlap of date ranges?
Is it possible to trace status events back to operator, transaction, or reason for change? (probably yes, but this is additional overhead)
By creating a more complicated versioning system, are you putting the right owners in charge of the right data?
I think the optimal goal here is to learn not only what is correct in theory, but why it is correct, and what are the consequences of violations, then when you are in the real world, you can decide which consequences are worth taking to gain which other benefits. That is the real challenge of design.
Reporting system and transaction system have different requirements.
I would recommend for transaction system, always use normalization for data correctness.
For reporting system, use normalization unless denormaliztion is required for whatever reason, such as ease of adhoc query, performance, etc.
Simplicity? Not sure if Steven is gonna swat me with his newspaper, but where I hang, sometimes the denormalized tables help the reporting/readonly guys get their jobs done without bugging the database/developers all the time...

Pro's of databases like BigTable, SimpleDB

New school datastore paradigms like Google BigTable and Amazon SimpleDB are specifically designed for scalability, among other things. Basically, disallowing joins and denormalization are the ways this is being accomplished.
In this topic, however, the consensus seems to be that joins on large tables don't necessarilly have to be too expensive and denormalization is "overrated" to some extent
Why, then, do these aforementioned systems disallow joins and force everything together in a single table to achieve scalability? Is it the sheer volumes of data that needs to be stored in these systems (many terabytes)?
Do the general rules for databases simply not apply to these scales?
Is it because these database types are tailored specifically towards storing many similar objects?
Or am I missing some bigger picture?
Distributed databases aren't quite as naive as Orion implies; there has been quite a bit of work done on optimizing fully relational queries over distributed datasets. You may want to look at what companies like Teradata, Netezza, Greenplum, Vertica, AsterData, etc are doing. (Oracle got in the game, finally, as well, with their recent announcement; Microsoft bought their solition in the name of the company that used to be called DataAllegro).
That being said, when the data scales up into terabytes, these issues become very non-trivial. If you don't need the strict transactionality and consistency guarantees you can get from RDBMs, it is often far easier to denormalize and not do joins. Especially if you don't need to cross-reference much. Especially if you are not doing ad-hoc analysis, but require programmatic access with arbitrary transformations.
Denormalization is overrated. Just because that's what happens when you are dealing with a 100 Tera, doesn't mean this fact should be used by every developer who never bothered to learn about databases and has trouble querying a million or two rows due to poor schema planning and query optimization.
But if you are in the 100 Tera range, by all means...
Oh, the other reason these technologies are getting the buzz -- folks are discovering that some things never belonged in the database in the first place, and are realizing that they aren't dealing with relations in their particular fields, but with basic key-value pairs. For things that shouldn't have been in a DB, it's entirely possible that the Map-Reduce framework, or some persistent, eventually-consistent storage system, is just the thing.
On a less global scale, I highly recommend BerkeleyDB for those sorts of problems.
I'm not too familiar with them (I've only read the same blog/news/examples as everyone else) but my take on it is that they chose to sacrifice a lot of the normal relational DB features in the name of scalability - I'll try explain.
Imagine you have 200 rows in your data-table.
In google's datacenter, 50 of these rows are stored on server A, 50 on B, and 100 on server C. Additionally server D contains redundant copies of data from server A and B, and server E contains redundant copies of data on server C.
(In real life I have no idea how many servers would be used, but it's set up to deal with many millions of rows, so I imagine quite a few).
To "select * where name = 'orion'", the infrastructure can fire that query to all the servers, and aggregate the results that come back. This allows them to scale pretty much linearly across as many servers as they like (FYI this is pretty much what mapreduce is)
This however means you need some tradeoffs.
If you needed to do a relational join on some data, where it was spread across say 5 servers, each of those servers would need to pull data from eachother for each row. Try do that when you have 2 million rows spread across 10 servers.
This leads to tradeoff #1 - No joins.
Also, depending on network latency, server load, etc, some of your data may get saved instantly, but some may take a second or 2. Again, when you have dozens of servers, this gets longer and longer, and the normal approach of 'everyone just waits until the slowest guy has finished' no longer becomes acceptable.
This leads to tradeoff #2 - Your data may not always be immediately visible after it's written.
I'm not sure what other tradeoffs there are, but off the top of my head those are the main 2.
So what I'm getting is that the whole "denormalize, no joins" philosophy exists, not because joins themselves don't scale in large systems, but because they're practically impossible to implement in distributed databases.
This seems pretty reasonable when you're storing largely invariant data of a single type (Like Google does). Am I on the right track here?
If you are talking about data that is virtually read-only, the rules change. Denormalisation is hardest in situations where data changes because the work required is increased and there are more problems with locking. If the data barely changes then denormalisation is not so much of a problem.
Novaday You need to find more interoperational environment for databases. More frequently You don't need only an relational DBs, like MySQL or MS SQL but also Big Data farms as Hadoop or non-relational DBs like MongoDB. In some cases all those DBs will be used in one solution so their performance must be as equal as possible in macro scale. It means, that You will not be able to use let say Azure SQL as relational DB and one VM with 2 cores and 3GB of RAM for MongoDB. You must scale-up Your solution and use DB as a Service when it is possible (if it is not possible, then build Your own cluster in a cloud).

Resources