Will normalising my database kill the scalability? - database

I have a database that will form part of a highly trafficked web app.
I'm wondering if I should normalise the tables so things such as (e.g.) 'question_type' should be in a separate table too all the basic information about the question such as 'title' and 'question_body'?
I'm only asking because I need this database to be as scalable as possible and I'm told normalisation isn't always the way to go when you need scalability.
Thanks

The thing that makes normalization an issue with scaling is that it tends to need to have multiple tables join together. Joins are great on small tables but the larger the table grows the harder the server needs to work.
The main thing to look to is avoiding joins. If you can do the query without a join by adding a field to one of the tables, you just speed up the performance of that query.

If your table has a question_body and question_type, then I don't see how moving it to another table achieves normalization. e.g.:
table question (
question_body text,
question_user text,
question_user_rank integer,
question_type text
);
Splitting out a single value into a single column table won't achieve anything other than useless joins. That is:
select * from question q join question_type qt on (q.qt_id = qt.id)
where qt.name = 'sql questions';
is an equivalent, but wasteful form of
select * from question
where question_type = 'sql questions';
On the other hand, (using the example above), it makes a lot of sense to split out the question user information into its own table:
table question (
question_body text,
question_type text,
question_user_id integer references question_user(id) on delete cascade
);
table question_user (
id integer,
name text,
rank integer
);
So if a user has his rank changed (ala SO), you only have to change it in one place rather than in every row where he's asked a question. You've increased your ability to handling scaling since you've changed hundreds of updates into a single update.

Now that's a loaded question. Normalization isn't a hard rule so much as a guideline. Designing a database is made up of a series of decisions regarding the level of normalization that makes sense given your need for code efficiency, performance and integrity, among other things. That's greatly oversimplifying it, but the spectrum of design decisions spans volumes of well-authored books.
Can you tell me a little bit more about your application and intended platform? I might be able to steer you in the direction of some very useful reference material if I can better understand your situation.

Will adding salt make my food taste better?
Same question. Noone can answer.
The main proble mis that it depends on your USAGE patterns and to soem degree your competence as programmer, to use lookup caches in the application instead of database joins. Quite a lot of programmers never get above the "scrambled eggs, burned" level of SQL, to keep a cooking analogy.
For scalability application design AND database technology have a lot more to say. Hard to beat an Oracle RAC installation. Depending on what you need on an Exadata platform. Cost is I think around half a million USD for the smallest unit. Still sure you need "as scalable as possible"? Not joking here - I right now work on a 6000 gb data warehouse, we just ordered 3 of those monsters, and not the smallest one.
So, what do you mean with "as scalable as possible"? THis is like "my car needs to go as fast as a car ever has gone and more", then you end up with a special made car with a jet engine in it ;)
General rule:
* Separate transactions and reporting into two databases. The second being a data warehouse.
* Normalize transactional db
* Use star schema on data warehouse.
BIG chance is: you dont kno what you talk ab out, never did scalability, so there is a 80% chance your "high scalability" requirement is a joke for a decent database server. Now, that is not meant insulting, but i have seen SO many people say "I have a ton of data in a table" which turns ou to be 10.000 rows maximum. That is not a ton - it is a joke. We load 100 million daily into our data warehouse main table (and have to keep them many years). Most peope dont really get the speed a decent database server can provide. Which means many discs.

Related

Sql server Index Size is 100 GB [duplicate]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
What are common database development mistakes made by application developers?
1. Not using appropriate indices
This is a relatively easy one but still it happens all the time. Foreign keys should have indexes on them. If you're using a field in a WHERE you should (probably) have an index on it. Such indexes should often cover multiple columns based on the queries you need to execute.
2. Not enforcing referential integrity
Your database may vary here but if your database supports referential integrity--meaning that all foreign keys are guaranteed to point to an entity that exists--you should be using it.
It's quite common to see this failure on MySQL databases. I don't believe MyISAM supports it. InnoDB does. You'll find people who are using MyISAM or those that are using InnoDB but aren't using it anyway.
More here:
How important are constraints like NOT NULL and FOREIGN KEY if I’ll always control my database input with php?
Are foreign keys really necessary in a database design?
Are foreign keys really necessary in a database design?
3. Using natural rather than surrogate (technical) primary keys
Natural keys are keys based on externally meaningful data that is (ostensibly) unique. Common examples are product codes, two-letter state codes (US), social security numbers and so on. Surrogate or technical primary keys are those that have absolutely no meaning outside the system. They are invented purely for identifying the entity and are typically auto-incrementing fields (SQL Server, MySQL, others) or sequences (most notably Oracle).
In my opinion you should always use surrogate keys. This issue has come up in these questions:
How do you like your primary keys?
What's the best practice for primary keys in tables?
Which format of primary key would you use in this situation.
Surrogate vs. natural/business keys
Should I have a dedicated primary key field?
This is a somewhat controversial topic on which you won't get universal agreement. While you may find some people, who think natural keys are in some situations OK, you won't find any criticism of surrogate keys other than being arguably unnecessary. That's quite a small downside if you ask me.
Remember, even countries can cease to exist (for example, Yugoslavia).
4. Writing queries that require DISTINCT to work
You often see this in ORM-generated queries. Look at the log output from Hibernate and you'll see all the queries begin with:
SELECT DISTINCT ...
This is a bit of a shortcut to ensuring you don't return duplicate rows and thus get duplicate objects. You'll sometimes see people doing this as well. If you see it too much it's a real red flag. Not that DISTINCT is bad or doesn't have valid applications. It does (on both counts) but it's not a surrogate or a stopgap for writing correct queries.
From Why I Hate DISTINCT:
Where things start to go sour in my
opinion is when a developer is
building substantial query, joining
tables together, and all of a sudden
he realizes that it looks like he is
getting duplicate (or even more) rows
and his immediate response...his
"solution" to this "problem" is to
throw on the DISTINCT keyword and POOF
all his troubles go away.
5. Favouring aggregation over joins
Another common mistake by database application developers is to not realize how much more expensive aggregation (ie the GROUP BY clause) can be compared to joins.
To give you an idea of how widespread this is, I've written on this topic several times here and been downvoted a lot for it. For example:
From SQL statement - “join” vs “group by and having”:
First query:
SELECT userid
FROM userrole
WHERE roleid IN (1, 2, 3)
GROUP by userid
HAVING COUNT(1) = 3
Query time: 0.312 s
Second query:
SELECT t1.userid
FROM userrole t1
JOIN userrole t2 ON t1.userid = t2.userid AND t2.roleid = 2
JOIN userrole t3 ON t2.userid = t3.userid AND t3.roleid = 3
AND t1.roleid = 1
Query time: 0.016 s
That's right. The join version I
proposed is twenty times faster than
the aggregate version.
6. Not simplifying complex queries through views
Not all database vendors support views but for those that do, they can greatly simplify queries if used judiciously. For example, on one project I used a generic Party model for CRM. This is an extremely powerful and flexible modelling technique but can lead to many joins. In this model there were:
Party: people and organisations;
Party Role: things those parties did, for example Employee and Employer;
Party Role Relationship: how those roles related to each other.
Example:
Ted is a Person, being a subtype of Party;
Ted has many roles, one of which is Employee;
Intel is an organisation, being a subtype of a Party;
Intel has many roles, one of which is Employer;
Intel employs Ted, meaning there is a relationship between their respective roles.
So there are five tables joined to link Ted to his employer. You assume all employees are Persons (not organisations) and provide this helper view:
CREATE VIEW vw_employee AS
SELECT p.title, p.given_names, p.surname, p.date_of_birth, p2.party_name employer_name
FROM person p
JOIN party py ON py.id = p.id
JOIN party_role child ON p.id = child.party_id
JOIN party_role_relationship prr ON child.id = prr.child_id AND prr.type = 'EMPLOYMENT'
JOIN party_role parent ON parent.id = prr.parent_id = parent.id
JOIN party p2 ON parent.party_id = p2.id
And suddenly you have a very simple view of the data you want but on a highly flexible data model.
7. Not sanitizing input
This is a huge one. Now I like PHP but if you don't know what you're doing it's really easy to create sites vulnerable to attack. Nothing sums it up better than the story of little Bobby Tables.
Data provided by the user by way of URLs, form data and cookies should always be treated as hostile and sanitized. Make sure you're getting what you expect.
8. Not using prepared statements
Prepared statements are when you compile a query minus the data used in inserts, updates and WHERE clauses and then supply that later. For example:
SELECT * FROM users WHERE username = 'bob'
vs
SELECT * FROM users WHERE username = ?
or
SELECT * FROM users WHERE username = :username
depending on your platform.
I've seen databases brought to their knees by doing this. Basically, each time any modern database encounters a new query it has to compile it. If it encounters a query it's seen before, you're giving the database the opportunity to cache the compiled query and the execution plan. By doing the query a lot you're giving the database the opportunity to figure that out and optimize accordingly (for example, by pinning the compiled query in memory).
Using prepared statements will also give you meaningful statistics about how often certain queries are used.
Prepared statements will also better protect you against SQL injection attacks.
9. Not normalizing enough
Database normalization is basically the process of optimizing database design or how you organize your data into tables.
Just this week I ran across some code where someone had imploded an array and inserted it into a single field in a database. Normalizing that would be to treat element of that array as a separate row in a child table (ie a one-to-many relationship).
This also came up in Best method for storing a list of user IDs:
I've seen in other systems that the list is stored in a serialized PHP array.
But lack of normalization comes in many forms.
More:
Normalization: How far is far enough?
SQL by Design: Why You Need Database Normalization
10. Normalizing too much
This may seem like a contradiction to the previous point but normalization, like many things, is a tool. It is a means to an end and not an end in and of itself. I think many developers forget this and start treating a "means" as an "end". Unit testing is a prime example of this.
I once worked on a system that had a huge hierarchy for clients that went something like:
Licensee -> Dealer Group -> Company -> Practice -> ...
such that you had to join about 11 tables together before you could get any meaningful data. It was a good example of normalization taken too far.
More to the point, careful and considered denormalization can have huge performance benefits but you have to be really careful when doing this.
More:
Why too much Database Normalization can be a Bad Thing
How far to take normalization in database design?
When Not to Normalize your SQL Database
Maybe Normalizing Isn't Normal
The Mother of All Database Normalization Debates on Coding Horror
11. Using exclusive arcs
An exclusive arc is a common mistake where a table is created with two or more foreign keys where one and only one of them can be non-null. Big mistake. For one thing it becomes that much harder to maintain data integrity. After all, even with referential integrity, nothing is preventing two or more of these foreign keys from being set (complex check constraints notwithstanding).
From A Practical Guide to Relational Database Design:
We have strongly advised against exclusive arc construction wherever
possible, for the good reason that they can be awkward to write code
and pose more maintenance difficulties.
12. Not doing performance analysis on queries at all
Pragmatism reigns supreme, particularly in the database world. If you're sticking to principles to the point that they've become a dogma then you've quite probably made mistakes. Take the example of the aggregate queries from above. The aggregate version might look "nice" but its performance is woeful. A performance comparison should've ended the debate (but it didn't) but more to the point: spouting such ill-informed views in the first place is ignorant, even dangerous.
13. Over-reliance on UNION ALL and particularly UNION constructs
A UNION in SQL terms merely concatenates congruent data sets, meaning they have the same type and number of columns. The difference between them is that UNION ALL is a simple concatenation and should be preferred wherever possible whereas a UNION will implicitly do a DISTINCT to remove duplicate tuples.
UNIONs, like DISTINCT, have their place. There are valid applications. But if you find yourself doing a lot of them, particularly in subqueries, then you're probably doing something wrong. That might be a case of poor query construction or a poorly designed data model forcing you to do such things.
UNIONs, particularly when used in joins or dependent subqueries, can cripple a database. Try to avoid them whenever possible.
14. Using OR conditions in queries
This might seem harmless. After all, ANDs are OK. OR should be OK too right? Wrong. Basically an AND condition restricts the data set whereas an OR condition grows it but not in a way that lends itself to optimisation. Particularly when the different OR conditions might intersect thus forcing the optimizer to effectively to a DISTINCT operation on the result.
Bad:
... WHERE a = 2 OR a = 5 OR a = 11
Better:
... WHERE a IN (2, 5, 11)
Now your SQL optimizer may effectively turn the first query into the second. But it might not. Just don't do it.
15. Not designing their data model to lend itself to high-performing solutions
This is a hard point to quantify. It is typically observed by its effect. If you find yourself writing gnarly queries for relatively simple tasks or that queries for finding out relatively straightforward information are not efficient, then you probably have a poor data model.
In some ways this point summarizes all the earlier ones but it's more of a cautionary tale that doing things like query optimisation is often done first when it should be done second. First and foremost you should ensure you have a good data model before trying to optimize the performance. As Knuth said:
Premature optimization is the root of all evil
16. Incorrect use of Database Transactions
All data changes for a specific process should be atomic. I.e. If the operation succeeds, it does so fully. If it fails, the data is left unchanged. - There should be no possibility of 'half-done' changes.
Ideally, the simplest way to achieve this is that the entire system design should strive to support all data changes through single INSERT/UPDATE/DELETE statements. In this case, no special transaction handling is needed, as your database engine should do so automatically.
However, if any processes do require multiple statements be performed as a unit to keep the data in a consistent state, then appropriate Transaction Control is necessary.
Begin a Transaction before the first statement.
Commit the Transaction after the last statement.
On any error, Rollback the Transaction. And very NB! Don't forget to skip/abort all statements that follow after the error.
Also recommended to pay careful attention to the subtelties of how your database connectivity layer, and database engine interact in this regard.
17. Not understanding the 'set-based' paradigm
The SQL language follows a specific paradigm suited to specific kinds of problems. Various vendor-specific extensions notwithstanding, the language struggles to deal with problems that are trivial in langues like Java, C#, Delphi etc.
This lack of understanding manifests itself in a few ways.
Inappropriately imposing too much procedural or imperative logic on the databse.
Inappropriate or excessive use of cursors. Especially when a single query would suffice.
Incorrectly assuming that triggers fire once per row affected in multi-row updates.
Determine clear division of responsibility, and strive to use the appropriate tool to solve each problem.
Key database design and programming mistakes made by developers
Selfish database design and usage. Developers often treat the database as their personal persistent object store without considering the needs of other stakeholders in the data. This also applies to application architects. Poor database design and data integrity makes it hard for third parties working with the data and can substantially increase the system's life cycle costs. Reporting and MIS tends to be a poor cousin in application design and only done as an afterthought.
Abusing denormalised data. Overdoing denormalised data and trying to maintain it within the application is a recipe for data integrity issues. Use denormalisation sparingly. Not wanting to add a join to a query is not an excuse for denormalising.
Scared of writing SQL. SQL isn't rocket science and is actually quite good at doing its job. O/R mapping layers are quite good at doing the 95% of queries that are simple and fit well into that model. Sometimes SQL is the best way to do the job.
Dogmatic 'No Stored Procedures' policies. Regardless of whether you believe stored procedures are evil, this sort of dogmatic attitude has no place on a software project.
Not understanding database design. Normalisation is your friend and it's not rocket science. Joining and cardinality are fairly simple concepts - if you're involved in database application development there's really no excuse for not understanding them.
Not using version control on the database schema
Working directly against a live database
Not reading up and understanding more advanced database concepts (indexes, clustered indexes, constraints, materialized views, etc)
Failing to test for scalability ... test data of only 3 or 4 rows will never give you the real picture of real live performance
Over-use and/or dependence on stored procedures.
Some application developers see stored procedures as a direct extension of middle tier/front end code. This appears to be a common trait in Microsoft stack developers, (I'm one, but I've grown out of it) and produces many stored procedures that perform complex business logic and workflow processing. This is much better done elsewhere.
Stored procedures are useful where it has actuallly been proven that some real technical factor necessitates their use (for example, performance and security) For example, keeping aggregation/filtering of large data sets "close to the data".
I recently had to help maintain and enhance a large Delphi desktop application of which 70% of the business logic and rules were implemented in 1400 SQL Server stored procedures (the remainder in UI event handlers). This was a nightmare, primarily due to the difficuly of introducing effective unit testing to TSQL, lack of encapsulation and poor tools (Debuggers, editors).
Working with a Java team in the past I quickly found out that often the complete opposite holds in that environment. A Java Architect once told me: "The database is for data, not code.".
These days I think it's a mistake to not consider stored procs at all, but they should be used sparingly (not by default) in situations where they provide useful benefits (see the other answers).
Number one problem? They only test on toy databases. So they have no idea that their SQL will crawl when the database gets big, and someone has to come along and fix it later (that sound you can hear is my teeth grinding).
Not using indexes.
Poor Performance Caused by Correlated Subqueries
Most of the time you want to avoid correlated subqueries. A subquery is correlated if, within the subquery, there is a reference to a column from the outer query. When this happens, the subquery is executed at least once for every row returned and could be executed more times if other conditions are applied after the condition containing the correlated subquery is applied.
Forgive the contrived example and the Oracle syntax, but let's say you wanted to find all the employees that have been hired in any of your stores since the last time the store did less than $10,000 of sales in a day.
select e.first_name, e.last_name
from employee e
where e.start_date >
(select max(ds.transaction_date)
from daily_sales ds
where ds.store_id = e.store_id and
ds.total < 10000)
The subquery in this example is correlated to the outer query by the store_id and would be executed for every employee in your system. One way that this query could be optimized is to move the subquery to an inline-view.
select e.first_name, e.last_name
from employee e,
(select ds.store_id,
max(s.transaction_date) transaction_date
from daily_sales ds
where ds.total < 10000
group by s.store_id) dsx
where e.store_id = dsx.store_id and
e.start_date > dsx.transaction_date
In this example, the query in the from clause is now an inline-view (again some Oracle specific syntax) and is only executed once. Depending on your data model, this query will probably execute much faster. It would perform better than the first query as the number of employees grew. The first query could actually perform better if there were few employees and many stores (and perhaps many of stores had no employees) and the daily_sales table was indexed on store_id. This is not a likely scenario but shows how a correlated query could possibly perform better than an alternative.
I've seen junior developers correlate subqueries many times and it usually has had a severe impact on performance. However, when removing a correlated subquery be sure to look at the explain plan before and after to make sure you are not making the performance worse.
In my experience:
Not communicating with experienced DBAs.
Using Access instead of a "real" database. There are plenty of great small and even free databases like SQL Express, MySQL, and SQLite that will work and scale much better. Apps often need to scale in unexpected ways.
Forgetting to set up relationships between the tables. I remember having to clean this up when I first started working at my current employer.
Using Excel for storing (huge amounts of) data.
I have seen companies holding thousands of rows and using multiple worksheets (due to the row limit of 65535 on previous versions of Excel).
Excel is well suited for reports, data presentation and other tasks, but should not be treated as a database.
I'd like to add:
Favoring "Elegant" code over highly performing code. The code that works best against databases is often ugly to the application developer's eye.
Believing that nonsense about premature optimization. Databases must consider performance in the original design and in any subsequent development. Performance is 50% of database design (40% is data integrity and the last 10% is security) in my opinion. Databases which are not built from the bottom up to perform will perform badly once real users and real traffic are placed against the database. Premature optimization doesn't mean no optimization! It doesn't mean you should write code that will almost always perform badly because you find it easier (cursors for example which should never be allowed in a production database unless all else has failed). It means you don't need to look at squeezing out that last little bit of performance until you need to. A lot is known about what will perform better on databases, to ignore this in design and development is short-sighted at best.
Not using parameterized queries. They're pretty handy in stopping SQL Injection.
This is a specific example of not sanitizing input data, mentioned in another answer.
I hate it when developers use nested select statements or even functions the return the result of a select statement inside the "SELECT" portion of a query.
I'm actually surprised I don't see this anywhere else here, perhaps I overlooked it, although #adam has a similar issue indicated.
Example:
SELECT
(SELECT TOP 1 SomeValue FROM SomeTable WHERE SomeDate = c.Date ORDER BY SomeValue desc) As FirstVal
,(SELECT OtherValue FROM SomeOtherTable WHERE SomeOtherCriteria = c.Criteria) As SecondVal
FROM
MyTable c
In this scenario, if MyTable returns 10000 rows the result is as if the query just ran 20001 queries, since it had to run the initial query plus query each of the other tables once for each line of result.
Developers can get away with this working in a development environment where they are only returning a few rows of data and the sub tables usually only have a small amount of data, but in a production environment, this kind of query can become exponentially costly as more data is added to the tables.
A better (not necessarily perfect) example would be something like:
SELECT
s.SomeValue As FirstVal
,o.OtherValue As SecondVal
FROM
MyTable c
LEFT JOIN (
SELECT SomeDate, MAX(SomeValue) as SomeValue
FROM SomeTable
GROUP BY SomeDate
) s ON c.Date = s.SomeDate
LEFT JOIN SomeOtherTable o ON c.Criteria = o.SomeOtherCriteria
This allows database optimizers to shuffle the data together, rather than requery on each record from the main table and I usually find when I have to fix code where this problem has been created, I usually end up increasing the speed of queries by 100% or more while simultaneously reducing CPU and memory usage.
For SQL-based databases:
Not taking advantage of CLUSTERED INDEXES or choosing the wrong column(s) to CLUSTER.
Not using a SERIAL (autonumber) datatype as a PRIMARY KEY to join to a FOREIGN KEY (INT) in a parent/child table relationship.
Not UPDATING STATISTICS on a table when many records have been INSERTED or DELETED.
Not reorganizing (i.e. unloading, droping, re-creating, loading and re-indexing) tables when many rows have been inserted or deleted (some engines physically keep deleted rows in a table with a delete flag.)
Not taking advantage of FRAGMENT ON EXPRESSION (if supported) on large tables which have high transaction rates.
Choosing the wrong datatype for a column!
Not choosing a proper column name.
Not adding new columns at the end of the table.
Not creating proper indexes to support frequently used queries.
creating indexes on columns with few possible values and creating unnecessary indexes.
...more to be added.
Not taking a backup before fixing some issue inside production database.
Using DDL commands on stored objects(like tables, views) in stored procedures.
Fear of using stored proc or fear of using ORM queries wherever the one is more efficient/appropriate to use.
Ignoring the use of a database profiler, which can tell you exactly what your ORM query is being converted into finally and hence verify the logic or even for debugging when not using ORM.
Not doing the correct level of normalization. You want to make sure that data is not duplicated, and that you are splitting data into different as needed. You also need to make sure you are not following normalization too far as that will hurt performance.
Treating the database as just a storage mechanism (i.e. glorified collections library) and hence subordinate to their application (ignoring other applications which share the data)
Dismissing an ORM like Hibernate out of hand, for reasons like "it's too magical" or "not on my database".
Relying too heavily on an ORM like Hibernate and trying to shoehorn it in where it isn't appropriate.
1 - Unnecessarily using a function on a value in a where clause with the result of that index not being used.
Example:
where to_char(someDate,'YYYYMMDD') between :fromDate and :toDate
instead of
where someDate >= to_date(:fromDate,'YYYYMMDD') and someDate < to_date(:toDate,'YYYYMMDD')+1
And to a lesser extent: Not adding functional indexes to those values that need them...
2 - Not adding check constraints to ensure the validity of the data. Constraints can be used by the query optimizer, and they REALLY help to ensure that you can trust your invariants. There's just no reason not to use them.
3 - Adding unnormalized columns to tables out of pure laziness or time pressure. Things are usually not designed this way, but evolve into this. The end result, without fail, is a ton of work trying to clean up the mess when you're bitten by the lost data integrity in future evolutions.
Think of this, a table without data is very cheap to redesign. A table with a couple of millions records with no integrity... not so cheap to redesign. Thus, doing the correct design when creating the column or table is amortized in spades.
4 - not so much about the database per se but indeed annoying. Not caring about the code quality of SQL. The fact that your SQL is expressed in text does not make it OK to hide the logic in heaps of string manipulation algorithms. It is perfectly possible to write SQL in text in a manner that is actually readable by your fellow programmer.
This has been said before, but: indexes, indexes, indexes. I've seen so many cases of poorly performing enterprise web apps that were fixed by simply doing a little profiling (to see which tables were being hit a lot), and then adding an index on those tables. This doesn't even require much in the way of SQL writing knowledge, and the payoff is huge.
Avoid data duplication like the plague. Some people advocate that a little duplication won't hurt, and will improve performance. Hey, I'm not saying that you have to torture your schema into Third Normal Form, until it's so abstract that not even the DBA's know what's going on. Just understand that whenever you duplicate a set of names, or zipcodes, or shipping codes, the copies WILL fall out of synch with each other eventually. It WILL happen. And then you'll be kicking yourself as you run the weekly maintenance script.
And lastly: use a clear, consistent, intuitive naming convention. In the same way that a well written piece of code should be readable, a good SQL schema or query should be readable and practically tell you what it's doing, even without comments. You'll thank yourself in six months, when you have to to maintenance on the tables. "SELECT account_number, billing_date FROM national_accounts" is infinitely easier to work with than "SELECT ACCNTNBR, BILLDAT FROM NTNLACCTS".
Not executing a corresponding SELECT query before running the DELETE query (particularly on production databases)!
The most common mistake I've seen in twenty years: not planning ahead. Many developers will create a database, and tables, and then continually modify and expand the tables as they build out the applications. The end result is often a mess and inefficient and difficult to clean up or simplify later on.
a) Hardcoding query values in string
b) Putting the database query code in the "OnButtonPress" action in a Windows Forms application
I have seen both.
Not paying enough attention towards managing database connections in your application. Then you find out the application, the computer, the server, and the network is clogged.
Thinking that they are DBAs and data modelers/designers when they have no formal indoctrination of any kind in those areas.
Thinking that their project doesn't require a DBA because that stuff is all easy/trivial.
Failure to properly discern between work that should be done in the database, and work that should be done in the app.
Not validating backups, or not backing up.
Embedding raw SQL in their code.
Here is a link to video called ‘Classic Database Development Mistakes and five ways to overcome them’ by Scott Walz
Not having an understanding of the databases concurrency model and how this affects development. It's easy to add indexes and tweak queries after the fact. However applications designed without proper consideration for hotspots, resource contention
and correct operation (Assuming what you just read is still valid!) can require significant changes within the database and application tier to correct later.
Not understanding how a DBMS works under the hood.
You cannot properly drive a stick without understanding how a clutch works. And you cannot understand how to use a Database without understanding that you are really just writing to a file on your hard disk.
Specifically:
Do you know what a Clustered Index is? Did you think about it when you designed your schema?
Do you know how to use indexes properly? How to reuse an index? Do you know what a Covering Index is?
So great, you have indexes. How big is 1 row in your index? How big will the index be when you have a lot of data? Will that fit easily into memory? If it won't it's useless as an index.
Have you ever used EXPLAIN in MySQL? Great. Now be honest with yourself: Did you understand even half of what you saw? No, you probably didn't. Fix that.
Do you understand the Query Cache? Do you know what makes a query un-cachable?
Are you using MyISAM? If you NEED full text search, MyISAM's is crap anyway. Use Sphinx. Then switch to Inno.
Using an ORM to do bulk updates
Selecting more data than needed. Again, typically done when using an ORM
Firing sqls in a loop.
Not having good test data and noticing performance degradation only on live data.

Accounting Payables/ Receivables database schema

I searched for this topic and there are lots of explanation to collect. Mostly to do separate tables for a/p and a/r.
I'm thinking of another approach from where to combine the two tables into 1 because a/c and a/r are 98% similar and only differ by 1 or 2 attributes (e.g. a/p: vendor/supplier).
Question:
Is it possible to merge the 2 tables and just categorize each transaction with Transaction_Type with values of Acnt_Payable or acnt_Receivable?
I know this is possible to do but is this a good practice? and what are the circumstances that my system will face when dealing with Reports?
Your question is: "another table or another column". Given that I've done lots of MySql and accounting database manipulation, I wouldn't hesitate to have type as an extra column.
The deal with reports is that you'll need to rework the code that pulls the reports. Having the extra column means less joins (if you have them now). You will need to step up your error checking, as doing a change to already designed software seems to always catch me off guard where I hadn't before considered a nuance of the design that is suddenly a paradigm shift for the logic.
I'd say rewrite the report code from a blank canvas; between your increased experience from "last time" and a fresh look at it this time, you might do a much better job of it.
IMHO, for performance aspects, most businesses and most average coders are OK with an off-the-shelf install of MySql or PostgreSQL. When I observe a query taking more than a second, it's usually because I need to add an index. Given that, you could have yet another table with the IDs representing AR vs AP (or any other human-terms), as I think numbers will make faster performance.
Once a company gets to where performance really is an issue, they'll have the resources to hire a real MySql guy to come in and mess with it. Sarcasm: Unless, of course, you're using quickbooks, in which case, once you're past a 20 transaction-per-day mom and pop joint, you're overtaxing the software.

Should OLAP databases be denormalized for read performance? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I always thought that databases should be denormalized for read performance, as it is done for OLAP database design, and not exaggerated much further 3NF for OLTP design.
PerformanceDBA in various posts, for ex., in Performance of different aproaches to time-based data defends the paradigm that database should be always well-designed by normalization to 5NF and 6NF (Normal Form).
Have I understood it correctly (and what had I understood correctly)?
What's wrong with the traditional denormalization approach/paradigm design of OLAP databases (below 3NF) and the advice that 3NF is enough for most practical cases of OLTP databases?
For example:
"The simple truth ... is that 6NF, executed properly, is the data warehouse" (PerformanceDBA)
I should confess that I could never grasp the theories that denormalization facilitates read performance. Can anybody give me references with good logical explanations of this and of the contrary beliefs?
What are sources to which I can refer when trying to convince my stakeholders that OLAP/Data Warehousing databases should be normalized?
To improve visibility I copied here from comments:
"It would be nice if participants would
add (disclose) how many real-life (no
science projects included)
data-warehouse implementations in 6NF
they have seen or participated in.
Kind of a quick-pool. Me = 0." – Damir
Sudarevic
Wikipedia's Data Warehouse article tells:
"The normalized approach [vs. dimensional one by Ralph Kimball], also
called the 3NF model (Third Normal Form) whose supporters are
referred to as “Inmonites”, believe in Bill Inmon's approach in which
it is stated that the data warehouse should be modeled using an E-R
model/normalized model."
It looks like the normalized data warehousing approach (by Bill Inmon) is perceived as not exceeding 3NF (?)
I just want to understand what is the origin of the myth (or ubiquitous axiomatic belief) that data warehousing/OLAP is synonym of denormalization?
Damir Sudarevic answered that they are well-paved approach. Let me return to the question: Why is denormalization believed to facilitate reading?
Mythology
I always thought that databases should be denormalized for reading, as it is done for OLAP database design, and not exaggerated much further 3NF for OLTP design.
There's a myth to that effect. In the Relational Database context, I have re-implemented six very large so-called "de-normalised" "databases"; and executed over eighty assignments correcting problems on others, simply by Normalising them, applying Standards and engineering principles. I have never seen any evidence for the myth. Only people repeating the mantra as if it were some sort of magical prayer.
Normalisation vs Un-normalised
("De-normalisation" is a fraudulent term I refuse to use it.)
This is a scientific industry (at least the bit that delivers software that does not break; that put people on the Moon; that runs banking systems; etc). It is governed by the laws of physics, not magic. Computers and software are all finite, tangible, physical objects that are subject to the laws of physics. According to the secondary and tertiary education I received:
it is not possible for a bigger, fatter, less organised object to perform better than a smaller, thinner, more organised object.
Normalisation yields more tables, yes, but each table is much smaller. And even though there are more tables, there are in fact (a) fewer joins and (b) the joins are faster because the sets are smaller. Fewer Indices are required overall, because each smaller table needs fewer indices. Normalised tables also yield much shorter row sizes.
for any given set of resources, Normalised tables:
fit more rows into the same page size
therefore fit more rows into the same cache space, therefore overall throughput is increased)
therefore fit more rows into the same disk space, therefore the no of I/Os is reduced; and when I/O is called for, each I/O is more efficient.
.
it is not possible for an object that is heavily duplicated to perform better than an object that is stored as a single version of the truth. Eg. when I removed the 5 x duplication at the table and column level, all the transactions were reduced in size; the locking reduced; the Update Anomalies disappeared. That substantially reduced contention and therefore increased concurrent use.
The overall result was therefore much, much higher performance.
In my experience, which is delivering both OLTP and OLAP from the same database, there has never been a need to "de-normalise" my Normalised structures, to obtain higher speed for read-only (OLAP) queries. That is a myth as well.
No, the "de-normalisation" requested by others reduced speed, and it was eliminated. No surprise to me, but again, the requesters were surprised.
Many books have been written by people, selling the myth. It needs to be recognised that these are non-technical people; since they are selling magic, the magic they sell has no scientific basis, and they conveniently avoid the laws of physics in their sales pitch.
(For anyone who wishes to dispute the above physical science, merely repeating the mantra will no have any effect, please supply specific evidence supporting the mantra.)
Why is the Myth Prevalent ?
Well, first, it is not prevalent among scientific types, who do not seek ways of overcoming the laws of physics.
From my experience, I have identified three major reasons for the prevalence:
For those people who cannot Normalise their data, it is a convenient justification for not doing so. They can refer to the magic book and without any evidence for the magic, they can reverently say "see a famous writer validates what I have done". Not Done, most accurately.
Many SQL coders can write only simple, single-level SQL. Normalised structures require a bit of SQL capability. If they do not have that; if they cannot produce SELECTs without using temporary tables; if they cannot write Sub-queries, they will be psychologically glued to the hip to flat files (which is what "de-normalised" structures are), which they can process.
People love to read books, and to discuss theories. Without experience. Especially re magic. It is a tonic, a substitute for actual experience. Anyone who has actually Normalised a database correctly has never stated that "de-normalised is faster than normalised". To anyone stating the mantra, I simply say "show me the evidence", and they have never produced any. So the reality is, people repeat the mythology for these reasons, without any experience of Normalisation. We are herd animals, and the unknown is one of our biggest fears.
That is why I always include "advanced" SQL and mentoring on any project.
My Answer
This Answer is going to be ridiculously long if I answer every part of your question or if I respond to the incorrect elements in some of the other answers. Eg. the above has answered just one item. Therefore I will answer your question in total without addressing the specific components, and take a different approach. I will deal only in the science related to your question, that I am qualified in, and very experienced with.
Let me present the science to you in manageable segments.
The typical model of the six large scale full implementation assignments.
These were the closed "databases" commonly found in small firms, and the organisations were large banks
very nice for a first generation, get-the-app-running mindset, but a complete failure in terms of performance, integrity and quality
they were designed for each app, separately
reporting was not possible, they could only report via each app
since "de-normalised" is a myth, the accurate technical definition is, they were un-normalised
In order to "de-normalise" one must Normalise first; then reverse the process a little
in every instance where people showed me their "de-normalised" data models, the simple fact was, they had not Normalised at all; so "de-normalisation" was not possible; it was simply un-normalised
since they did not have much Relational technology, or the structures and control of Databases, but they were passed off as "databases", I have placed those words in quotation marks
as is scientifically guaranteed for un-normalised structures, they suffered multiple versions of the truth (data duplication) and therefore high contention and low concurrency, within each of them
they had an additional problem of data duplication across the "databases"
the organisation was trying to keep all those duplicates synchronised, so they implemented replication; which of course meant an additional server; ETL and synching scripts to be developed; and maintained; etc
needless to say, the synching was never quite enough and they were forever changing it
with all that contention and low throughput, it was no problem at all justifying a separate server for each "database". It did not help much.
So we contemplated the laws of physics, and we applied a little science.
We implemented the Standard concept that the data belongs to the corporation (not the departments) and the corporation wanted one version of the truth. The Database was pure Relational, Normalised to 5NF. Pure Open Architecture, so that any app or report tool could access it. All transactions in stored procs (as opposed to uncontrolled strings of SQL all over the network). The same developers for each app coded the new apps, after our "advanced" education.
Evidently the science worked. Well, it wasn't my private science or magic, it was ordinary engineering and the laws of physics. All of it ran on one database server platform; two pairs (production & DR) of servers were decommissioned and given to another department. The 5 "databases" totalling 720GB were Normalised into one Database totalling 450GB. About 700 tables (many duplicates and duplicated columns) were normalised into 500 unduplicated tables. It performed much faster, as in 10 times faster overall, and more than 100 times faster in some functions. That did not surprise me, because that was my intention, and the science predicted it, but it surprised the people with the mantra.
More Normalisation
Well, having had success with Normalisation in every project, and confidence with the science involved, it has been a natural progression to Normalise more, not less. In the old days 3NF was good enough, and later NFs were not yet identified. In the last 20 years, I have only delivered databases that had zero update anomalies, so it turns out by todays definitions of NFs, I have always delivered 5NF.
Likewise, 5NF is great but it has its limitations. Eg. Pivoting large tables (not small result sets as per the MS PIVOT Extension) was slow. So I (and others) developed a way of providing Normalised tables such that Pivoting was (a) easy and (b) very fast. It turns out, now that 6NF has been defined, that those tables are 6NF.
Since I provide OLAP and OLTP from the same database, I have found that, consistent with the science, the more Normalised the structures are:
the faster they perform
and they can be used in more ways (eg Pivots)
So yes, I have consistent and unvarying experience, that not only is Normalised much, much faster than un-normalised or "de-normalised"; more Normalised is even faster than less normalised.
One sign of success is growth in functionality (the sign of failure is growth in size without growth in functionality). Which meant they immediately asked us for more reporting functionality, which meant we Normalised even more, and provided more of those specialised tables (which turned out years later, to be 6NF).
Progressing on that theme. I was always a Database specialist, not a data warehouse specialist, so my first few projects with warehouses were not full-blown implementations, but rather, they were substantial performance tuning assignments. They were in my ambit, on products that I specialised in.
Let's not worry about the exact level of normalisation, etc, because we are looking at the typical case. We can take it as given that the OLTP database was reasonably normalised, but not capable of OLAP, and the organisation had purchased a completely separate OLAP platform, hardware; invested in developing and maintaining masses of ETL code; etc. And following implementation then spent half their life managing the duplicates they had created. Here the book writers and vendors need to be blamed, for the massive waste of hardware and separate platform software licences they cause organisations to purchase.
If you have not observed it yet, I would ask you to notice the similarities between the Typical First Generation "database" and the Typical Data Warehouse
Meanwhile, back at the farm (the 5NF Databases above) we just kept adding more and more OLAP functionality. Sure the app functionality grew, but that was little, the business had not changed. They would ask for more 6NF and it was easy to provide (5NF to 6NF is a small step; 0NF to anything, let alone 5NF, is a big step; an organised architecture is easy to extend).
One major difference between OLTP and OLAP, the basic justification of separate OLAP platform software, is that the OLTP is row-oriented, it needs transactionally secure rows, and fast; and the OLAP doesn't care about the transactional issues, it needs columns, and fast. That is the reason all the high end BI or OLAP platforms are column-oriented, and that is why the OLAP models (Star Schema, Dimension-Fact) are column-oriented.
But with the 6NF tables:
there are no rows, only columns; we serve up rows and columns at same blinding speed
the tables (ie. the 5NF view of the 6NF structures) are already organised into Dimension-Facts. In fact they are organised into more Dimensions than any OLAP model would ever identify, because they are all Dimensions.
Pivoting entire tables with aggregation on the fly (as opposed to the PIVOT of a small number of derived columns) is (a) effortless, simple code and (b) very fast
What we have been supplying for many years, by definition, is Relational Databases with at least 5NF for OLTP use, and 6NF for OLAP requirements.
Notice that it is the very same science that we have used from the outset; to move from Typical un-normalised "databases" to 5NF Corporate Database. We are simply applying more of the proven science, and obtaining higher orders of functionality and performance.
Notice the similarity between 5NF Corporate Database and 6NF Corporate Database
The entire cost of separate OLAP hardware, platform software, ETL, administration, maintenance, are all eliminated.
There is only one version of the data, no update anomalies or maintenance thereof; the same data served up for OLTP as rows, and for OLAP as columns
The only thing we have not done, is to start off on a new project, and declare pure 6NF from the start. That is what I have lined up next.
What is Sixth Normal Form ?
Assuming you have a handle on Normalisation (I am not going to not define it here), the non-academic definitions relevant to this thread are as follows. Note that it applies at the table level, hence you can have a mix of 5NF and 6NF tables in the same database:
Fifth Normal Form: all Functional Dependencies resolved across the database
in addition to 4NF/BCNF
every non-PK column is 1::1 with its PK
and to no other PK
No Update Anomalies
.
Sixth Normal Form: is the irreducible NF, the point at which the data cannot be further reduced or Normalised (there will not be a 7NF)
in addition to 5NF
the row consists of a Primary Key, and at most, one non-key column
eliminates The Null Problem
What Does 6NF Look Like ?
The Data Models belong to the customers, and our Intellectual Property is not available for free publication. But I do attend this web-site, and provide specific answers to questions. You do need a real world example, so I will publish the Data Model for one of our internal utilities.
This one is for the collection of server monitoring data (enterprise class database server and OS) for any no of customers, for any period. We use this to analyse performance issues remotely, and to verify any performance tuning that we do. The structure has not changed in over ten years (added to, with no change to the existing structures), it is typical of the specialised 5NF that many years later was identified as 6NF. Allows full pivoting; any chart or graph to be drawn, on any Dimension (22 Pivots are provided but that is not a limit); slice and dice; mix and match. Notice they are all Dimensions.
The monitoring data or Metrics or vectors can change (server version changes; we want to pick up something more) without affecting the model (you may recall in another post I stated EAV is the bastard son of 6NF; well this is full 6NF, the undiluted father, and therefore provides all features of EAV, without sacrificing any Standards, integrity or Relational power); you merely add rows.
▶Monitor Statistics Data Model◀. (too large for inline; some browsers cannot load inline; click the link)
It allows me to produce these ▶Charts Like This◀, six keystrokes after receiving a raw monitoring stats file from the customer. Notice the mix-and-match; OS and server on the same chart; a variety of Pivots. (Used with permission.)
Readers who are unfamiliar with the Standard for Modelling Relational Databases may find the ▶IDEF1X Notation◀ helpful.
6NF Data Warehouse
This has been recently validated by Anchor Modeling, in that they are now presenting 6NF as the "next generation" OLAP model for data warehouses. (They do not provide the OLTP and OLAP from the single version of the data, that is ours alone).
Data Warehouse (Only) Experience
My experience with Data Warehouses only (not the above 6NF OLTP-OLAP Databases), has been several major assignments, as opposed to full implementation projects. The results were, no surprise:
consistent with the science, Normalised structures perform much faster; are easier to maintain; and require less data synching. Inmon, not Kimball.
consistent with the magic, after I Normalise a bunch of tables, and deliver substantially improved performance via application of the laws of physics, the only people surprised are the magicians with their mantras.
Scientifically minded people do not do that; they do not believe in, or rely upon, silver bullets and magic; they use and hard work science to resolve their problems.
Valid Data Warehouse Justification
That is why I have stated in other posts, the only valid justification for a separate Data Warehouse platform, hardware, ETL, maintenance, etc, is where there are many Databases or "databases", all being merged into a central warehouse, for reporting and OLAP.
Kimball
A word on Kimball is necessary, as he is the main proponent of "de-normalised for performance" in data warehouses. As per my definitions above, he is one of those people who have evidently never Normalised in their lives; his starting point was un-normalised (camouflaged as "de-normalised") and he simply implemented that in a Dimension-Fact model.
Of course, to obtain any performance, he had to "de-normalise" even more, and create further duplicates, and justify all that.
So therefore it is true, in a schizophrenic sort of way, that "de-normalising" un-normalised structures, by making more specialised copies, "improves read performance". It is not true when the whole is taking into account; it is true only inside that little asylum, not outside.
Likewise it is true, in that crazy way, that where all the "tables" are monsters, that "joins are expensive" and something to be avoided. They have never had the experience of joining smaller tables and sets, so they cannot believe the scientific fact that more, smaller tables are faster.
they have experience that creating duplicate "tables" is faster, so they cannot believe that eliminating duplicates is even faster than that.
his Dimensions are added to the un-normalised data. Well the data is not Normalised, so no Dimensions are exposed. Whereas in a Normalised model, the Dimensions are already exposed, as an integral part of the data, no addition is required.
that well-paved path of Kimball's leads to the cliff, where more lemmings fall to their deaths, faster. Lemmings are herd animals, as long as they are walking the path together, and dying together, they die happy. Lemmings do not look for other paths.
All just stories, parts of the one mythology that hang out together and support each other.
Your Mission
Should you choose to accept it. I am asking you to think for yourself, and to stop entertaining any thoughts that contradict science and the laws of physics. No matter how common or mystical or mythological they are. Seek evidence for anything before trusting it. Be scientific, verify new beliefs for yourself. Repeating the mantra "de-normalised for performance" won't make your database faster, it will just make you feel better about it. Like the fat kid sitting in the sidelines telling himself that he can run faster than all the kids in the race.
on that basis, even the concept "normalise for OLTP" but do the opposite, "de-normalise for OLAP" is a contradiction. How can the laws of physics work as stated on one computer, but work in reverse on another computer ? The mind boggles. It is simply not possible, the work that same way on every computer.
Questions ?
Denormalization and aggregation are the two main strategies used to achieve performance in a data warehouse. It's just silly to suggest that it doesn't improve read performance! Surely I must have missunderstood something here?
Aggregation:
Consider a table holding 1 billion purchases.
Contrast it with a table holding one row with the sum of the purchases.
Now, which is faster? Select sum(amount) from the one-billion-row table or a select amount from the one-row-table? It's a stupid example of course, but it illustrates the principle of aggregation quite clearly. Why is it faster? Because regardless of what magical model/hardware/software/religion we use, reading 100 bytes is faster than reading 100 gigabytes. Simple as that.
Denormalization:
A typical product dimension in a retail data warehouse has shitloads of columns. Some columns are easy stuff like "Name" or "Color", but it also has some complicated stuff, like hierarchies. Multiple hierarchies (The product range (5 levels), the intended buyer (3 levels), raw materials (8 levels), way of production (8 levels) along with several computed numbers such as average lead time (since start of the year), weight/packaging measures etcetera etcetera. I've maintained a product dimension table with 200+ columns that was constructed from ~70 tables from 5 different source systems. It is just plain silly to debate whether a query on the normalized model (below)
select product_id
from table1
join table2 on(keys)
join (select average(..)
from one_billion_row_table
where lastyear = ...) on(keys)
join ...table70
where function_with_fuzzy_matching(table1.cola, table37.colb) > 0.7
and exists(select ... from )
and not exists(select ...)
and table20.version_id = (select max(v_id from product_ver where ...)
and average_price between 10 and 20
and product_range = 'High-Profile'
...is faster than the equivalent query on the denormalized model:
select product_id
from product_denormalized
where average_price between 10 and 20
and product_range = 'High-Profile';
Why? Partly for the same reason as the aggregated scenario. But also because the queries are just "complicated". They are so disgustingly complicated that the optimizer (and now I'm going Oracle specifics) gets confused and screws up the execution plans. Suboptimal execution plans may not be such a big deal if the query deals with small amounts of data. But as soon as we start to join in the Big Tables it is crucial that the database gets the execution plan right. Having denormalized the data in one table with a single syntetic key (heck, why don't I add more fuel to this ongoing fire), the filters become simple range/equality filters on pre-cooked columns. Having duplicated the data into new columns enables us to gather statistics on the columns which will help the optimizer in estimating the selectivities and thus providing us with a proper execution plan (well, ...).
Obviously, using denormalization and aggregation makes it harder to accomodate schema changes which is a bad thing. On the other hand they provides read performance, which is a good thing.
So, should you denormalize your database in order to achieve read-performance?
Hell no! It adds so many complexities to your system that there is no end to how many ways it will screw you over before you have delivered. Is it worth it? Yes, sometimes you need to do it to meet a specific performance requirement.
Update 1
PerformanceDBA: 1 row would get updated a billion times a day
That would imply a (near) realtime requirement (which in turn would generate a completely different set of technical requirements). Many (if not most) data warehouses does not have that requirement. I picked an unrealistic aggregation example just to make it clear why aggregation works. I didn't want to have to explain rollup strategies too :)
Also, one has to contrast the needs of the typical user of a data warehouse and the typical user of the underlaying OLTP system. A user looking to understand what factors drive transport costs, couldn't care less if 50% of todays data is missing or if 10 trucks exploded and killed the drivers. Performing the analysis over 2 years worth of data would still come to the same conclusion even if he had to-the-second up-to-date information at his disposal.
Contrast this to the needs of the drivers of that truck (the ones who survived). They can't wait 5 hours at some transit point just because some stupid aggregation process has to finnish. Having two separate copies of the data solves both needs.
Another major hurdle with sharing the same set of data for operational systems and reporting systems is that the release cycles, Q&A, deployment, SLA and what have you, are very different. Again, having two separate copies makes this easier to handle.
By "OLAP" I understand you to mean a subject-oriented relational / SQL database used for decision support - AKA a Data Warehouse.
Normal Form (typically 5th / 6th Normal Form) is generally the best model for a Data Warehouse. The reasons for normalizing a Data Warehouse are exactly the same as any other database: it reduces redundancy and avoids potential update anomalies; it avoids built-in bias and is therefore the easiest way to support schema change and new requirements. Using Normal Form in a data warehouse also helps keep the data load process simple and consistent.
There is no "traditional" denormalization approach. Good data warehouses have always been normalized.
Should not a database be denormalized for reading performance?
Okay, here goes a total "Your Mileage May Vary", "It Depends", "Use The Proper Tool For Every Job", "One Size Does Not Fit All" answer, with a bit of "Don't Fix It If It Ain't Broken" thrown in:
Denormalization is one way to improve query performance in certain situations. In other situations it may actually reduce performance (because of the increased disk use). It certainly makes updates more difficult.
It should only be considered when you hit a performance problem (because you are giving the benefits of normalization and introduce complexity).
The drawbacks of denormalization are less of an issue with data that is never updated, or only updated in batch jobs, i.e. not OLTP data.
If denormalization solves a performance problem that you need solved, and that less invasive techniques (like indexes or caches or buying a bigger server) do not solve, then yes, you should do it.
First my opinions, then some analysis
Opinions
Denormalisation is perceived to help reading data because common use of the word denormalisation often include not only breaking normal forms, but also introducing any insertion, update and deletion dependencies into the system.
This, strictly speaking, is false, see this question/answer, Denormalisation in strict sense mean to break any of the normal forms from 1NF-6NF, other insertion, update and deletion dependencies are addressed with Principle of Orthogonal Design.
So what happens is that people take the Space vs Time tradeoff principle and remember the term redundancy (associated with denormalisation, still not equal to it) and conclude that you should have benefits. This is faulty implication, but false implications do not allow you to conclude the reverse.
Breaking normal forms may indeed speed up some data retrieval (details in analysis below), but as a rule it will also at the same time:
favour only specific type of queries and slow down all other access paths
increase complexity of the system (which influences not only maintenance of the database itself, but also increases the complexity of applications that consume the data)
obfuscate and weaken semantic clarity of the database
main point of database systems, as central data representing the problem space is to be unbiased in recording the facts, so that when requirements change you don't have to redesign the parts of the system (data and applications) that are independent in reality. to be able to do this artificial dependencies should be minimised - today's 'critical' requirement to speed up one query quite often become only marginally important.
Analysis
So, I made a claim that sometimes breaking normal forms can help retrieval. Time to give some arguments
1) Breaking 1NF
Assume you have financial records in 6NF. From such database you can surely get a report on what is a balance for each account for each month.
Assuming that a query that would have to calculate such report would need to go through n records you could make a table
account_balances(month, report)
which would hold XML structured balances for each account. This breaks 1NF (see notes later), but allows one specific query to execute with minimum I/O.
At the same time, assuming it is possible to update any month with inserts, updates or deletes of financial records, the performance of the update queries on the system might be slowed down by time proportional to some function of n for each update.
(the above case illustrates a principle, in reality you would have better options and the benefit of getting minimum I/O bring such penalties that for realistic system that actually updates data often you would get bad performance on even for your targeted query depending on the type of actual workload; can explain this in more detail if you want)
Note:
This is actually trivial example and there is one problem with it - the definition of 1NF. Assumption that the above model breaks 1NF is according to requirement that values of an attribute 'contain exactly one value from the applicable domain'.
This allows you to say that the domain of the attribute report is a set of all possible reports and that from all of them there is exactly one value and claim that 1NF is not broken (similar to argument that storing words does not break 1NF even though you might have letters relation somewhere in your model).
On the other hand there are much better ways to model this table, which would be more useful for wider range of queries (such as to retrieve balances for single account for all months in a year). In this case you would justify that improvement by saying that this field is not in 1NF.
Anyway it explains why people claim that breaking NFs might improve performance.
2) Breaking 3NF
Assuming tables in 3NF
CREATE TABLE `t` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`member_id` int(10) unsigned NOT NULL,
`status` tinyint(3) unsigned NOT NULL,
`amount` decimal(10,2) NOT NULL,
`opening` decimal(10,2) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `member_id` (`member_id`),
CONSTRAINT `t_ibfk_1` FOREIGN KEY (`member_id`) REFERENCES `m` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB
CREATE TABLE `m` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
with sample data (1M rows in t, 100k in m)
Assume a common query that you want to improve
mysql> select sql_no_cache m.name, count(*)
from t join m on t.member_id = m.id
where t.id between 100000 and 500000 group by m.name;
+-------+----------+
| name | count(*) |
+-------+----------+
| omega | 11 |
| test | 8 |
| test3 | 399982 |
+-------+----------+
3 rows in set (1.08 sec)
you could find suggestions to move attribute name into table m which breaks 3NF (it has a FD: member_id -> name and member_id is not a key of t)
after
alter table t add column varchar(255);
update t inner join m on t.member_id = t.id set t.name = m.name;
running
mysql> select sql_no_cache name, count(*)
from t where id
between 100000 and 500000
group by name;
+-------+----------+
| name | count(*) |
+-------+----------+
| omega | 11 |
| test | 8 |
| test3 | 399982 |
+-------+----------+
3 rows in set (0.41 sec)
notes:
The above query execution time is cut in half, but
the table was not in 5NF/6NF to begin with
the test was done with no_sql_cache so most cache mechanisms were avoided (and in real situations they play a role in system's performance)
space consumption is increased by approx 9x size of the column name x 100k rows
there should be triggers on t to keep the integrity of data, which would significantly slow down all updates to name and add additional checks that inserts in t would need to go through
probably better results could be achieved by dropping surrogate keys and switching to natural keys, and/or indexing, or redesigning to higher NFs
Normalising is the proper way in the long run. But you don't always have an option to redesign company's ERP (which is for example already only mostly 3NF) - sometimes you must achieve certain task within given resources. Of course doing this is only short term 'solution'.
Bottom line
I think that the most pertinent answer to your question is that you will find the industry and education using the term 'denormalisation' in
strict sense, for breaking NFs
loosely, for introducing any insertion, update and deletion dependencies (original Codd's quote comments on normalisation saying: 'undesirable(!) insertion, update and deletion dependencies', see some details here)
So, under strict definition, the aggregation (summary tables) are not considered denormalisation and they can help a lot in terms of performance (as will any cache, which is not perceived as denormalisation).
The loose usage encompasses both breaking normal forms and the principle of orthogonal design, as said before.
Another thing that might shed some light is that there is a very important difference between the logical model and the physical model.
For example indexes store redundant data, but no one considers them denormalization, not even people who use the term loosely and there are two (connected) reasons for this
they are not part of the logical model
they are transparent and guaranteed not to break integrity of your model
If you fail to properly model your logical model you will end up with inconsistent database - wrong types of relationships between your entities (inability to represent problem space), conflicting facts (ability to loose information) and you should employ whatever methods you can to get a correct logical model, it is a foundation for all applications that will be built on top of it.
Normalisation, orthogonal and clear semantics of your predicates, well defined attributes, correctly identified functional dependencies all play a factor in avoiding pitfalls.
When it comes to physical implementation things get more relaxed in a sense that ok, materialised computed column that is dependent on non key might be breaking 3NF, but if there are mechanisms that guarantee consistency it is allowed in physical model in the same way as indexes are allowed, but you have to very carefully justify it because usually normalising will yield same or better improvements across the board and will have no or less negative impact and will keep the design clear (which reduces the application development and maintenance costs) resulting in savings that you can easily spend on upgrading hardware to improve the speed even more then what is achieved with breaking NFs.
The two most popular methodologies for building a data warehouse (DW) seem to be Bill Inmon's and Ralph Kimball's.
Inmon's methodology uses normalized approach, while Kimball's uses dimensional modelling -- de-normalized star schema.
Both are well documented down to small details and both have many successful implementations. Both present a "wide, well-paved road" to a DW destination.
I can not comment on the 6NF approach nor on Anchor Modelling because I have never seen nor participated in a DW project using that methodology. When it comes to implementations, I like to travel down well tested paths -- but, that's just me.
So, to summarize, should DW be normalized or de-normalized? Depends on the methodology you pick -- simply pick one and stick to it, at least till the end of the project.
EDIT - An Example
At the place I currently work for, we had a legacy report which has been running since ever on the production server. Not a plain report, but a collection of 30 sub-reports emailed to everybody and his ant every day.
Recently, we implemented a DW. With two report servers and bunch of reports in place, I was hoping that we can forget about the legacy thing. But not, legacy is legacy, we always had it, so we want it, need it, can't live without it, etc.
The thing is that the mess-up of a python script and SQL took eight hours (yes, e-i-g-h-t hours) to run every single day. Needless to say, the database and the application were built over years by few batches of developers -- so, not exactly your 5NF.
It was time to re-create the legacy thing from the DW. Ok, to keep it short it's done and it takes 3 minutes (t-h-r-e-e minutes) to produce it, six seconds per sub-report. And I was in the hurry to deliver, so was not even optimizing all the queries. This is factor of 8 * 60 / 3 = 160 times faster -- not to mention benefits of removing an eight hour job from a production server. I think I can still shave of a minute or so, but right now no one cares.
As a point of interest, I have used Kimball's method (dimensional modelling) for the DW and everything used in this story is open-source.
This is what all this (data-warehouse) is supposed to be about, I think. Does it even matter which methodology (normalized or de-normalized) was used?
EDIT 2
As a point of interest, Bill Inmon has a nicely written paper on his website -- A Tale of Two Architectures.
The problem with the word "denormalized" is that it doesn't specify what direction to go in. It's about like trying to get to San Francisco from Chicago by driving away from New York.
A star schema or a snowflake schema is certainly not normalized. And it certainly performs better than a normalized schema in certain usage patterns. But there are cases of denormalization where the designer wasn't following any discipline at all, but just composing tables by intuition. Sometimes those efforts don't pan out.
In short, don't just denormalize. Do follow a different design discipline if you are confident of its benefits, and even if it doesn't agree with normalized design. But don't use denormalization as an excuse for haphazard design.
The short answer is don't fix a performance problem you have not got!
As for time based tables the generally accepted pardigm is to have valid_from and valid_to dates in every row. This is still basically 3NF as it only changes the semantics from "this is the one and only verision of this entity" to "this is the one and only version of this entity at this time "
Simplification:
An OLTP database should be normalised (as far as makes sense).
An OLAP data warehouse should be denormalised into Fact and Dimension tables (to minimise joins).

Database development mistakes made by application developers [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
What are common database development mistakes made by application developers?
1. Not using appropriate indices
This is a relatively easy one but still it happens all the time. Foreign keys should have indexes on them. If you're using a field in a WHERE you should (probably) have an index on it. Such indexes should often cover multiple columns based on the queries you need to execute.
2. Not enforcing referential integrity
Your database may vary here but if your database supports referential integrity--meaning that all foreign keys are guaranteed to point to an entity that exists--you should be using it.
It's quite common to see this failure on MySQL databases. I don't believe MyISAM supports it. InnoDB does. You'll find people who are using MyISAM or those that are using InnoDB but aren't using it anyway.
More here:
How important are constraints like NOT NULL and FOREIGN KEY if I’ll always control my database input with php?
Are foreign keys really necessary in a database design?
Are foreign keys really necessary in a database design?
3. Using natural rather than surrogate (technical) primary keys
Natural keys are keys based on externally meaningful data that is (ostensibly) unique. Common examples are product codes, two-letter state codes (US), social security numbers and so on. Surrogate or technical primary keys are those that have absolutely no meaning outside the system. They are invented purely for identifying the entity and are typically auto-incrementing fields (SQL Server, MySQL, others) or sequences (most notably Oracle).
In my opinion you should always use surrogate keys. This issue has come up in these questions:
How do you like your primary keys?
What's the best practice for primary keys in tables?
Which format of primary key would you use in this situation.
Surrogate vs. natural/business keys
Should I have a dedicated primary key field?
This is a somewhat controversial topic on which you won't get universal agreement. While you may find some people, who think natural keys are in some situations OK, you won't find any criticism of surrogate keys other than being arguably unnecessary. That's quite a small downside if you ask me.
Remember, even countries can cease to exist (for example, Yugoslavia).
4. Writing queries that require DISTINCT to work
You often see this in ORM-generated queries. Look at the log output from Hibernate and you'll see all the queries begin with:
SELECT DISTINCT ...
This is a bit of a shortcut to ensuring you don't return duplicate rows and thus get duplicate objects. You'll sometimes see people doing this as well. If you see it too much it's a real red flag. Not that DISTINCT is bad or doesn't have valid applications. It does (on both counts) but it's not a surrogate or a stopgap for writing correct queries.
From Why I Hate DISTINCT:
Where things start to go sour in my
opinion is when a developer is
building substantial query, joining
tables together, and all of a sudden
he realizes that it looks like he is
getting duplicate (or even more) rows
and his immediate response...his
"solution" to this "problem" is to
throw on the DISTINCT keyword and POOF
all his troubles go away.
5. Favouring aggregation over joins
Another common mistake by database application developers is to not realize how much more expensive aggregation (ie the GROUP BY clause) can be compared to joins.
To give you an idea of how widespread this is, I've written on this topic several times here and been downvoted a lot for it. For example:
From SQL statement - “join” vs “group by and having”:
First query:
SELECT userid
FROM userrole
WHERE roleid IN (1, 2, 3)
GROUP by userid
HAVING COUNT(1) = 3
Query time: 0.312 s
Second query:
SELECT t1.userid
FROM userrole t1
JOIN userrole t2 ON t1.userid = t2.userid AND t2.roleid = 2
JOIN userrole t3 ON t2.userid = t3.userid AND t3.roleid = 3
AND t1.roleid = 1
Query time: 0.016 s
That's right. The join version I
proposed is twenty times faster than
the aggregate version.
6. Not simplifying complex queries through views
Not all database vendors support views but for those that do, they can greatly simplify queries if used judiciously. For example, on one project I used a generic Party model for CRM. This is an extremely powerful and flexible modelling technique but can lead to many joins. In this model there were:
Party: people and organisations;
Party Role: things those parties did, for example Employee and Employer;
Party Role Relationship: how those roles related to each other.
Example:
Ted is a Person, being a subtype of Party;
Ted has many roles, one of which is Employee;
Intel is an organisation, being a subtype of a Party;
Intel has many roles, one of which is Employer;
Intel employs Ted, meaning there is a relationship between their respective roles.
So there are five tables joined to link Ted to his employer. You assume all employees are Persons (not organisations) and provide this helper view:
CREATE VIEW vw_employee AS
SELECT p.title, p.given_names, p.surname, p.date_of_birth, p2.party_name employer_name
FROM person p
JOIN party py ON py.id = p.id
JOIN party_role child ON p.id = child.party_id
JOIN party_role_relationship prr ON child.id = prr.child_id AND prr.type = 'EMPLOYMENT'
JOIN party_role parent ON parent.id = prr.parent_id = parent.id
JOIN party p2 ON parent.party_id = p2.id
And suddenly you have a very simple view of the data you want but on a highly flexible data model.
7. Not sanitizing input
This is a huge one. Now I like PHP but if you don't know what you're doing it's really easy to create sites vulnerable to attack. Nothing sums it up better than the story of little Bobby Tables.
Data provided by the user by way of URLs, form data and cookies should always be treated as hostile and sanitized. Make sure you're getting what you expect.
8. Not using prepared statements
Prepared statements are when you compile a query minus the data used in inserts, updates and WHERE clauses and then supply that later. For example:
SELECT * FROM users WHERE username = 'bob'
vs
SELECT * FROM users WHERE username = ?
or
SELECT * FROM users WHERE username = :username
depending on your platform.
I've seen databases brought to their knees by doing this. Basically, each time any modern database encounters a new query it has to compile it. If it encounters a query it's seen before, you're giving the database the opportunity to cache the compiled query and the execution plan. By doing the query a lot you're giving the database the opportunity to figure that out and optimize accordingly (for example, by pinning the compiled query in memory).
Using prepared statements will also give you meaningful statistics about how often certain queries are used.
Prepared statements will also better protect you against SQL injection attacks.
9. Not normalizing enough
Database normalization is basically the process of optimizing database design or how you organize your data into tables.
Just this week I ran across some code where someone had imploded an array and inserted it into a single field in a database. Normalizing that would be to treat element of that array as a separate row in a child table (ie a one-to-many relationship).
This also came up in Best method for storing a list of user IDs:
I've seen in other systems that the list is stored in a serialized PHP array.
But lack of normalization comes in many forms.
More:
Normalization: How far is far enough?
SQL by Design: Why You Need Database Normalization
10. Normalizing too much
This may seem like a contradiction to the previous point but normalization, like many things, is a tool. It is a means to an end and not an end in and of itself. I think many developers forget this and start treating a "means" as an "end". Unit testing is a prime example of this.
I once worked on a system that had a huge hierarchy for clients that went something like:
Licensee -> Dealer Group -> Company -> Practice -> ...
such that you had to join about 11 tables together before you could get any meaningful data. It was a good example of normalization taken too far.
More to the point, careful and considered denormalization can have huge performance benefits but you have to be really careful when doing this.
More:
Why too much Database Normalization can be a Bad Thing
How far to take normalization in database design?
When Not to Normalize your SQL Database
Maybe Normalizing Isn't Normal
The Mother of All Database Normalization Debates on Coding Horror
11. Using exclusive arcs
An exclusive arc is a common mistake where a table is created with two or more foreign keys where one and only one of them can be non-null. Big mistake. For one thing it becomes that much harder to maintain data integrity. After all, even with referential integrity, nothing is preventing two or more of these foreign keys from being set (complex check constraints notwithstanding).
From A Practical Guide to Relational Database Design:
We have strongly advised against exclusive arc construction wherever
possible, for the good reason that they can be awkward to write code
and pose more maintenance difficulties.
12. Not doing performance analysis on queries at all
Pragmatism reigns supreme, particularly in the database world. If you're sticking to principles to the point that they've become a dogma then you've quite probably made mistakes. Take the example of the aggregate queries from above. The aggregate version might look "nice" but its performance is woeful. A performance comparison should've ended the debate (but it didn't) but more to the point: spouting such ill-informed views in the first place is ignorant, even dangerous.
13. Over-reliance on UNION ALL and particularly UNION constructs
A UNION in SQL terms merely concatenates congruent data sets, meaning they have the same type and number of columns. The difference between them is that UNION ALL is a simple concatenation and should be preferred wherever possible whereas a UNION will implicitly do a DISTINCT to remove duplicate tuples.
UNIONs, like DISTINCT, have their place. There are valid applications. But if you find yourself doing a lot of them, particularly in subqueries, then you're probably doing something wrong. That might be a case of poor query construction or a poorly designed data model forcing you to do such things.
UNIONs, particularly when used in joins or dependent subqueries, can cripple a database. Try to avoid them whenever possible.
14. Using OR conditions in queries
This might seem harmless. After all, ANDs are OK. OR should be OK too right? Wrong. Basically an AND condition restricts the data set whereas an OR condition grows it but not in a way that lends itself to optimisation. Particularly when the different OR conditions might intersect thus forcing the optimizer to effectively to a DISTINCT operation on the result.
Bad:
... WHERE a = 2 OR a = 5 OR a = 11
Better:
... WHERE a IN (2, 5, 11)
Now your SQL optimizer may effectively turn the first query into the second. But it might not. Just don't do it.
15. Not designing their data model to lend itself to high-performing solutions
This is a hard point to quantify. It is typically observed by its effect. If you find yourself writing gnarly queries for relatively simple tasks or that queries for finding out relatively straightforward information are not efficient, then you probably have a poor data model.
In some ways this point summarizes all the earlier ones but it's more of a cautionary tale that doing things like query optimisation is often done first when it should be done second. First and foremost you should ensure you have a good data model before trying to optimize the performance. As Knuth said:
Premature optimization is the root of all evil
16. Incorrect use of Database Transactions
All data changes for a specific process should be atomic. I.e. If the operation succeeds, it does so fully. If it fails, the data is left unchanged. - There should be no possibility of 'half-done' changes.
Ideally, the simplest way to achieve this is that the entire system design should strive to support all data changes through single INSERT/UPDATE/DELETE statements. In this case, no special transaction handling is needed, as your database engine should do so automatically.
However, if any processes do require multiple statements be performed as a unit to keep the data in a consistent state, then appropriate Transaction Control is necessary.
Begin a Transaction before the first statement.
Commit the Transaction after the last statement.
On any error, Rollback the Transaction. And very NB! Don't forget to skip/abort all statements that follow after the error.
Also recommended to pay careful attention to the subtelties of how your database connectivity layer, and database engine interact in this regard.
17. Not understanding the 'set-based' paradigm
The SQL language follows a specific paradigm suited to specific kinds of problems. Various vendor-specific extensions notwithstanding, the language struggles to deal with problems that are trivial in langues like Java, C#, Delphi etc.
This lack of understanding manifests itself in a few ways.
Inappropriately imposing too much procedural or imperative logic on the databse.
Inappropriate or excessive use of cursors. Especially when a single query would suffice.
Incorrectly assuming that triggers fire once per row affected in multi-row updates.
Determine clear division of responsibility, and strive to use the appropriate tool to solve each problem.
Key database design and programming mistakes made by developers
Selfish database design and usage. Developers often treat the database as their personal persistent object store without considering the needs of other stakeholders in the data. This also applies to application architects. Poor database design and data integrity makes it hard for third parties working with the data and can substantially increase the system's life cycle costs. Reporting and MIS tends to be a poor cousin in application design and only done as an afterthought.
Abusing denormalised data. Overdoing denormalised data and trying to maintain it within the application is a recipe for data integrity issues. Use denormalisation sparingly. Not wanting to add a join to a query is not an excuse for denormalising.
Scared of writing SQL. SQL isn't rocket science and is actually quite good at doing its job. O/R mapping layers are quite good at doing the 95% of queries that are simple and fit well into that model. Sometimes SQL is the best way to do the job.
Dogmatic 'No Stored Procedures' policies. Regardless of whether you believe stored procedures are evil, this sort of dogmatic attitude has no place on a software project.
Not understanding database design. Normalisation is your friend and it's not rocket science. Joining and cardinality are fairly simple concepts - if you're involved in database application development there's really no excuse for not understanding them.
Not using version control on the database schema
Working directly against a live database
Not reading up and understanding more advanced database concepts (indexes, clustered indexes, constraints, materialized views, etc)
Failing to test for scalability ... test data of only 3 or 4 rows will never give you the real picture of real live performance
Over-use and/or dependence on stored procedures.
Some application developers see stored procedures as a direct extension of middle tier/front end code. This appears to be a common trait in Microsoft stack developers, (I'm one, but I've grown out of it) and produces many stored procedures that perform complex business logic and workflow processing. This is much better done elsewhere.
Stored procedures are useful where it has actuallly been proven that some real technical factor necessitates their use (for example, performance and security) For example, keeping aggregation/filtering of large data sets "close to the data".
I recently had to help maintain and enhance a large Delphi desktop application of which 70% of the business logic and rules were implemented in 1400 SQL Server stored procedures (the remainder in UI event handlers). This was a nightmare, primarily due to the difficuly of introducing effective unit testing to TSQL, lack of encapsulation and poor tools (Debuggers, editors).
Working with a Java team in the past I quickly found out that often the complete opposite holds in that environment. A Java Architect once told me: "The database is for data, not code.".
These days I think it's a mistake to not consider stored procs at all, but they should be used sparingly (not by default) in situations where they provide useful benefits (see the other answers).
Number one problem? They only test on toy databases. So they have no idea that their SQL will crawl when the database gets big, and someone has to come along and fix it later (that sound you can hear is my teeth grinding).
Not using indexes.
Poor Performance Caused by Correlated Subqueries
Most of the time you want to avoid correlated subqueries. A subquery is correlated if, within the subquery, there is a reference to a column from the outer query. When this happens, the subquery is executed at least once for every row returned and could be executed more times if other conditions are applied after the condition containing the correlated subquery is applied.
Forgive the contrived example and the Oracle syntax, but let's say you wanted to find all the employees that have been hired in any of your stores since the last time the store did less than $10,000 of sales in a day.
select e.first_name, e.last_name
from employee e
where e.start_date >
(select max(ds.transaction_date)
from daily_sales ds
where ds.store_id = e.store_id and
ds.total < 10000)
The subquery in this example is correlated to the outer query by the store_id and would be executed for every employee in your system. One way that this query could be optimized is to move the subquery to an inline-view.
select e.first_name, e.last_name
from employee e,
(select ds.store_id,
max(s.transaction_date) transaction_date
from daily_sales ds
where ds.total < 10000
group by s.store_id) dsx
where e.store_id = dsx.store_id and
e.start_date > dsx.transaction_date
In this example, the query in the from clause is now an inline-view (again some Oracle specific syntax) and is only executed once. Depending on your data model, this query will probably execute much faster. It would perform better than the first query as the number of employees grew. The first query could actually perform better if there were few employees and many stores (and perhaps many of stores had no employees) and the daily_sales table was indexed on store_id. This is not a likely scenario but shows how a correlated query could possibly perform better than an alternative.
I've seen junior developers correlate subqueries many times and it usually has had a severe impact on performance. However, when removing a correlated subquery be sure to look at the explain plan before and after to make sure you are not making the performance worse.
In my experience:
Not communicating with experienced DBAs.
Using Access instead of a "real" database. There are plenty of great small and even free databases like SQL Express, MySQL, and SQLite that will work and scale much better. Apps often need to scale in unexpected ways.
Forgetting to set up relationships between the tables. I remember having to clean this up when I first started working at my current employer.
Using Excel for storing (huge amounts of) data.
I have seen companies holding thousands of rows and using multiple worksheets (due to the row limit of 65535 on previous versions of Excel).
Excel is well suited for reports, data presentation and other tasks, but should not be treated as a database.
I'd like to add:
Favoring "Elegant" code over highly performing code. The code that works best against databases is often ugly to the application developer's eye.
Believing that nonsense about premature optimization. Databases must consider performance in the original design and in any subsequent development. Performance is 50% of database design (40% is data integrity and the last 10% is security) in my opinion. Databases which are not built from the bottom up to perform will perform badly once real users and real traffic are placed against the database. Premature optimization doesn't mean no optimization! It doesn't mean you should write code that will almost always perform badly because you find it easier (cursors for example which should never be allowed in a production database unless all else has failed). It means you don't need to look at squeezing out that last little bit of performance until you need to. A lot is known about what will perform better on databases, to ignore this in design and development is short-sighted at best.
Not using parameterized queries. They're pretty handy in stopping SQL Injection.
This is a specific example of not sanitizing input data, mentioned in another answer.
I hate it when developers use nested select statements or even functions the return the result of a select statement inside the "SELECT" portion of a query.
I'm actually surprised I don't see this anywhere else here, perhaps I overlooked it, although #adam has a similar issue indicated.
Example:
SELECT
(SELECT TOP 1 SomeValue FROM SomeTable WHERE SomeDate = c.Date ORDER BY SomeValue desc) As FirstVal
,(SELECT OtherValue FROM SomeOtherTable WHERE SomeOtherCriteria = c.Criteria) As SecondVal
FROM
MyTable c
In this scenario, if MyTable returns 10000 rows the result is as if the query just ran 20001 queries, since it had to run the initial query plus query each of the other tables once for each line of result.
Developers can get away with this working in a development environment where they are only returning a few rows of data and the sub tables usually only have a small amount of data, but in a production environment, this kind of query can become exponentially costly as more data is added to the tables.
A better (not necessarily perfect) example would be something like:
SELECT
s.SomeValue As FirstVal
,o.OtherValue As SecondVal
FROM
MyTable c
LEFT JOIN (
SELECT SomeDate, MAX(SomeValue) as SomeValue
FROM SomeTable
GROUP BY SomeDate
) s ON c.Date = s.SomeDate
LEFT JOIN SomeOtherTable o ON c.Criteria = o.SomeOtherCriteria
This allows database optimizers to shuffle the data together, rather than requery on each record from the main table and I usually find when I have to fix code where this problem has been created, I usually end up increasing the speed of queries by 100% or more while simultaneously reducing CPU and memory usage.
For SQL-based databases:
Not taking advantage of CLUSTERED INDEXES or choosing the wrong column(s) to CLUSTER.
Not using a SERIAL (autonumber) datatype as a PRIMARY KEY to join to a FOREIGN KEY (INT) in a parent/child table relationship.
Not UPDATING STATISTICS on a table when many records have been INSERTED or DELETED.
Not reorganizing (i.e. unloading, droping, re-creating, loading and re-indexing) tables when many rows have been inserted or deleted (some engines physically keep deleted rows in a table with a delete flag.)
Not taking advantage of FRAGMENT ON EXPRESSION (if supported) on large tables which have high transaction rates.
Choosing the wrong datatype for a column!
Not choosing a proper column name.
Not adding new columns at the end of the table.
Not creating proper indexes to support frequently used queries.
creating indexes on columns with few possible values and creating unnecessary indexes.
...more to be added.
Not taking a backup before fixing some issue inside production database.
Using DDL commands on stored objects(like tables, views) in stored procedures.
Fear of using stored proc or fear of using ORM queries wherever the one is more efficient/appropriate to use.
Ignoring the use of a database profiler, which can tell you exactly what your ORM query is being converted into finally and hence verify the logic or even for debugging when not using ORM.
Not doing the correct level of normalization. You want to make sure that data is not duplicated, and that you are splitting data into different as needed. You also need to make sure you are not following normalization too far as that will hurt performance.
Treating the database as just a storage mechanism (i.e. glorified collections library) and hence subordinate to their application (ignoring other applications which share the data)
Dismissing an ORM like Hibernate out of hand, for reasons like "it's too magical" or "not on my database".
Relying too heavily on an ORM like Hibernate and trying to shoehorn it in where it isn't appropriate.
1 - Unnecessarily using a function on a value in a where clause with the result of that index not being used.
Example:
where to_char(someDate,'YYYYMMDD') between :fromDate and :toDate
instead of
where someDate >= to_date(:fromDate,'YYYYMMDD') and someDate < to_date(:toDate,'YYYYMMDD')+1
And to a lesser extent: Not adding functional indexes to those values that need them...
2 - Not adding check constraints to ensure the validity of the data. Constraints can be used by the query optimizer, and they REALLY help to ensure that you can trust your invariants. There's just no reason not to use them.
3 - Adding unnormalized columns to tables out of pure laziness or time pressure. Things are usually not designed this way, but evolve into this. The end result, without fail, is a ton of work trying to clean up the mess when you're bitten by the lost data integrity in future evolutions.
Think of this, a table without data is very cheap to redesign. A table with a couple of millions records with no integrity... not so cheap to redesign. Thus, doing the correct design when creating the column or table is amortized in spades.
4 - not so much about the database per se but indeed annoying. Not caring about the code quality of SQL. The fact that your SQL is expressed in text does not make it OK to hide the logic in heaps of string manipulation algorithms. It is perfectly possible to write SQL in text in a manner that is actually readable by your fellow programmer.
This has been said before, but: indexes, indexes, indexes. I've seen so many cases of poorly performing enterprise web apps that were fixed by simply doing a little profiling (to see which tables were being hit a lot), and then adding an index on those tables. This doesn't even require much in the way of SQL writing knowledge, and the payoff is huge.
Avoid data duplication like the plague. Some people advocate that a little duplication won't hurt, and will improve performance. Hey, I'm not saying that you have to torture your schema into Third Normal Form, until it's so abstract that not even the DBA's know what's going on. Just understand that whenever you duplicate a set of names, or zipcodes, or shipping codes, the copies WILL fall out of synch with each other eventually. It WILL happen. And then you'll be kicking yourself as you run the weekly maintenance script.
And lastly: use a clear, consistent, intuitive naming convention. In the same way that a well written piece of code should be readable, a good SQL schema or query should be readable and practically tell you what it's doing, even without comments. You'll thank yourself in six months, when you have to to maintenance on the tables. "SELECT account_number, billing_date FROM national_accounts" is infinitely easier to work with than "SELECT ACCNTNBR, BILLDAT FROM NTNLACCTS".
Not executing a corresponding SELECT query before running the DELETE query (particularly on production databases)!
The most common mistake I've seen in twenty years: not planning ahead. Many developers will create a database, and tables, and then continually modify and expand the tables as they build out the applications. The end result is often a mess and inefficient and difficult to clean up or simplify later on.
a) Hardcoding query values in string
b) Putting the database query code in the "OnButtonPress" action in a Windows Forms application
I have seen both.
Not paying enough attention towards managing database connections in your application. Then you find out the application, the computer, the server, and the network is clogged.
Thinking that they are DBAs and data modelers/designers when they have no formal indoctrination of any kind in those areas.
Thinking that their project doesn't require a DBA because that stuff is all easy/trivial.
Failure to properly discern between work that should be done in the database, and work that should be done in the app.
Not validating backups, or not backing up.
Embedding raw SQL in their code.
Here is a link to video called ‘Classic Database Development Mistakes and five ways to overcome them’ by Scott Walz
Not having an understanding of the databases concurrency model and how this affects development. It's easy to add indexes and tweak queries after the fact. However applications designed without proper consideration for hotspots, resource contention
and correct operation (Assuming what you just read is still valid!) can require significant changes within the database and application tier to correct later.
Not understanding how a DBMS works under the hood.
You cannot properly drive a stick without understanding how a clutch works. And you cannot understand how to use a Database without understanding that you are really just writing to a file on your hard disk.
Specifically:
Do you know what a Clustered Index is? Did you think about it when you designed your schema?
Do you know how to use indexes properly? How to reuse an index? Do you know what a Covering Index is?
So great, you have indexes. How big is 1 row in your index? How big will the index be when you have a lot of data? Will that fit easily into memory? If it won't it's useless as an index.
Have you ever used EXPLAIN in MySQL? Great. Now be honest with yourself: Did you understand even half of what you saw? No, you probably didn't. Fix that.
Do you understand the Query Cache? Do you know what makes a query un-cachable?
Are you using MyISAM? If you NEED full text search, MyISAM's is crap anyway. Use Sphinx. Then switch to Inno.
Using an ORM to do bulk updates
Selecting more data than needed. Again, typically done when using an ORM
Firing sqls in a loop.
Not having good test data and noticing performance degradation only on live data.

What's the better database design: more tables or more columns?

A former coworker insisted that a database with more tables with fewer columns each is better than one with fewer tables with more columns each. For example rather than a customer table with name, address, city, state, zip, etc. columns, you would have a name table, an address table, a city table, etc.
He argued this design was more efficient and flexible. Perhaps it is more flexible, but I am not qualified to comment on its efficiency. Even if it is more efficient, I think those gains may be outweighed by the added complexity.
So, are there any significant benefits to more tables with fewer columns over fewer tables with more columns?
I have a few fairly simple rules of thumb I follow when designing databases, which I think can be used to help make decisions like this....
Favor normalization. Denormalization is a form of optimization, with all the requisite tradeoffs, and as such it should be approached with a YAGNI attitude.
Make sure that client code referencing the database is decoupled enough from the schema that reworking it doesn't necessitate a major redesign of the client(s).
Don't be afraid to denormalize when it provides a clear benefit to performance or query complexity.
Use views or downstream tables to implement denormalization rather than denormalizing the core of the schema, when data volume and usage scenarios allow for it.
The usual result of these rules is that the initial design will favor tables over columns, with a focus on eliminating redundancy. As the project progresses and denormalization points are identified, the overall structure will evolve toward a balance that compromises with limited redundancy and column proliferation in exchange for other valuable benefits.
It doesn't sound so much like a question about tables/columns, but about normalization. In some situations have a high degree of normalization ("more tables" in this case) is good, and clean, but it typically takes a high number of JOINs to get relevant results. And with a large enough dataset, this can bog down performance.
Jeff wrote a little about it regarding the design of StackOverflow. See also the post Jeff links to by Dare Obasanjo.
I would argue in favor of more tables, but only up to a certain point. Using your example, if you separated your user's information into two tables, say USERS and ADDRESS, this gives you the flexibility to have multiple addresses per user. One obvious application of this is a user who has separate billing and shipping addresses.
The argument in favor of having a separate CITY table would be that you only have to store each city's name once, then refer to it when you need it. That does reduce duplication, but in this example I think it's overkill. It may be more space efficient, but you'll pay the price in joins when you select data from your database.
A fully normalized design (i.e, "More Tables") is more flexible, easier to maintain, and avoids duplication of data, which means your data integrity is going to be a lot easier to enforce.
Those are powerful reasons to normalize. I would choose to normalize first, and then only denormalize specific tables after you saw that performance was becoming an issue.
My experience is that in the real world, you won't reach the point where denormalization is necessary, even with very large data sets.
Each table should only include columns that pertain to the entity that's uniquely identified by the primary key. If all the columns in the database are all attributes of the same entity, then you'd only need one table with all the columns.
If any of the columns may be null, though, you would need to put each nullable column into its own table with a foreign key to the main table in order to normalize it. This is a common scenario, so for a cleaner design, you're likley to be adding more tables than columns to existing tables. Also, by adding these optional attributes to their own table, they would no longer need to allow nulls and you avoid a slew of NULL-related issues.
It depends on your database flavor. MS SQL Server, for example, tends to prefer narrower tables. That's also the more 'normalized' approach. Other engines might prefer it the other way around. Mainframes tend to fall in that category.
The multi-table database is a lot more flexible if any of these one to one relationships may become one to many or many to many in the future. For example, if you need to store multiple addresses for some customers, it's a lot easier if you have a customer table and an address table. I can't really see a situation where you might need to duplicate some parts of an address but not others, so separate address, city, state, and zip tables may be a bit over the top.
Like everything else: it depends.
There is no hard and fast rule regarding column count vs table count.
If your customers need to have multiple addresses, then a separate table for that makes sense. If you have a really good reason to normalize the City column into its own table, then that can go, too, but I haven't seen that before because it's a free form field (usually).
A table heavy, normalized design is efficient in terms of space and looks "textbook-good" but can get extremely complex. It looks nice until you have to do 12 joins to get a customer's name and address. These designs are not automatically fantastic in terms of performance that matters most: queries.
Avoid complexity if possible. For example, if a customer can have only two addresses (not arbitrarily many), then it might make sense to just keep them all in a single table (CustomerID, Name, ShipToAddress, BillingAddress, ShipToCity, BillingCity, etc.).
Here's Jeff's post on the topic.
There are advantages to having tables with fewer columns, but you also need to look at your scenario above and answer these questions:
Will the customer be allowed to have more than 1 address? If not, then a separate table for address is not necessary. If so, then a separate table becomes helpful because you can easily add more addresses as needed down the road, where it becomes more difficult to add more columns to the table.
i would consider normalizing as the first step, so cities, counties, states, countries would be better as separate columns... the power of SQL language, together with today DBMS-es allows you to group your data later if you need to view it in some other, non-normalized view.
When the system is being developed, you might consider 'unnormalizing' some part if you see that as an improvement.
I think balance is in order in this case. If it makes sense to put a column in a table, then put it in the table, if it doesn't, then don't. Your coworkers approach would definately help to normalize the database, but that might not be very useful if you have to join 50 tables together to get the information you need.
I guess what my answer would be is, use your best judgement.
There are many sides to this, but from an application efficiency perspective mote tables can be more efficient at times. If you have a few tables with a bunch of columns every time the db as to do an operation it has a chance of making a lock, more data is made unavailable for the duration of the lock. If locks get escalated to page and tables (well hopefully not tables :) ) you can see how this can slow down the system.
Hmm.
I think its a wash and depends on your particular design model. Definitely factor out entities that have more than a few fields out into their own table, or entities whose makeup will likely change as your application's requirements changes (for instance - I'd factor out address anyways, since it has so many fields, but I'd especially do it if you thought there was any chance you'd need to handle foreign country addresses, which can be of a different form. The same with phone numbers).
That said, when you're got it working, keep an eye out on performance. If you've spun an entity out that requires you to do large, expensive joins, maybe it becomes a better design decision to spin that table back into the original.
When you design your database, you should be as close as possible from the meaning of data and NOT your application need !
A good database design should stand over 20 years without a change.
A customer could have multiple adresses, that's the reality. If you decided that's your application is limited to one adresse for the first release, it's concern the design of your application not the data !
It's better to have multiple table instead of multiple column and use view if you want to simplify your query.
Most of time you will have performance issue with a database it's about network performance (chain query with one row result, fetch column you don't need, etc) not about the complexity of your query.
There are huge benefits to queries using as few columns as possible. But the table itself can have a large number. Jeff says something on this as well.
Basically, make sure that you don't ask for more than you need when doing a query - performance of queries is directly related to the number of columns you ask for.
I think you have to look at the kind of data you're storing before you make that decision. Having an address table is great but only if the likelihood of multiple people sharing the same address is high. If every person had different addresses, keeping that data in a different table just introduces unnecessary joins.
I don't see the benefit of having a city table unless cities in of themselves are entities you care about in your application. Or if you want to limit the number of cities available to your users.
Bottom line is decisions like this have to take the application itself into considering before you start shooting for efficiency. IMO.
First, normalize your tables. This ensures you avoid redundant data, giving you less rows of data to scan, which improves your queries. Then, if you run into a point where the normalized tables you are joining are causing the query to take to long to process (expensive join clause), denormalize where more appropriate.
Good to see so many inspiring and well based answers.
My answer would be (unfortunately): it depends.
Two cases:
* If you create a datamodel that is to be used for many years and thus possibly has to adept many future changes: go for more tables and less rows and pretty strict normalization.
* In other cases you can choose between more tables-less rows or less tables-more rows. Especially for people relatively new to the subject this last approach can be more intuitive and easy to comprehend.
The same is valid for the choosing between the object oriented approach and other options.

Resources