Relational Database schema design for metric storage - database

Considering a system that has the following characteristics:
Stores time series data/metrics collected from multiple sensors/inputs.
Data points (metrics) are collected from many different systems at different times.
Each of these metrics is generally one data point (e.g. temp and humidity are not reported at the same time, but rather individually and will have a different timestamp)
The types of metrics that are collected will expand over time - the system is open and additional inputs will be supported over time (e.g. today we collect temp, humidity and cpu, tomorrow a sensor maybe added that monitors co2 and RAM).
A summary of all metrics for a given time bucket needs to be obtained via a query and it likely to be the most common querying scenario.
I can think of three ways of modeling this.
1. Wide table - with table per category (covered)
Notes: has lots of sparse values due to the data points being collected individually. Storage of new metrics require a new column
2. Narrow table - with table per metric (covered)
Notes: Storage of new metrics require a new table
3. Typed table (not covered) - with single metric table (not covered)
Notes: Storage of new metrics just require a new row in the metricType table, no schema changes. Concerned about performance implications due to chunk size although grouping by a time bucket across all metrics would not require joins and could therefore be faster?
I was wondering if anyone could comment or the options presented, point me to some performance bench marks that include 3 as well as 1 and 2 or generally give any advice on the suitability of each approach. I'm planning to run my own experiments on this and I will post the results when done, but any insight at this stage would be gratefully received. :)
Please note, do not suggest a nosql solution, I'm aware of the options in that space and am assessing that option separately

1 Proposal
"Wide table"
That has gross Normalisation errors (as well as, if taken seriously, it has masses of Nulls and integrity problems). It is unuseable, no further comment is required.
"Narrow table"
That is free of errors, but the Normalisation is not yet complete.
"Typed table"
That is sort of complete, the "best" of your three scenarios. But it views the issue through a narrow lens, and in total isolation from the context in which the issue exists. Thus it is in error for reasons other than those you inquire about.
2 Problem
The first problem is that you are comparing three things which are not reasonably comparable, not reasonably equal to each other.
The second problem is, EAV is the flavour of the month, and many people are attracted to it. However, it has major problems, and requires an additional set of "metadata" tables if it is to be implemented with some data integrity. The point is, EAV is not needed.
3 Solution
The types of metrics that are collected will expand over time - the system is open and additional inputs will be supported over time (e.g. today we collect temp, humidity and cpu, tomorrow a sensor maybe added that monitors co2 and RAM).
This is actually a straight-forward Relational database problem, which is solved by a perfectly ordinary Relational design, which provides full Relation Power; Relational Integrity; and Speed (which other designs will not have).
3.1 Caveat
But there are a few caveats, due to the fact that what is marketed as "relational" is not Relational.
Get rid of the Record ID fields, they are anti-Relational.
Record IDs reduce your schema to a 1970's style Record Filing system (located in an SQL container for convenience).
Record IDs do not provide row uniqueness, which is demanded by the Relational Model.
Further, they require one additional field and one additional index per file.
When modelling a database (Relational or not), perceive the data, as data, and nothing but data. Do not view the data in terms of your need re the GUI, or some query or other.
It is an error to concern yourself with performance issues at this (modelling) stage. First get it right. Second, make it fast. Do not reverse the prescribed sequence.
Relational Keys provide meaning, as well as Relational Integrity (which is Logical, and distinct from Referential Integrity, which is a physical facility of SQL). What this addresses is the context in which an object exists.
A Sensor does not exist in isolation (except when it is in a package on a shelf in a shop ... but even then, it exists in the context of the shop inventory)
An active Sensor exists only in the context of the object in which it is housed. You have not provided any info regarding that. Let's call the thing Article as a generic label.
Further, it is the Article that requires a limit on the Metric that is being measured by the Sensor (for the purpose of out-of-range alarms, etc), and not the Sensor itself. (The Sensor may have a range, which is a different thing.)
Likewise, a Sensor exists in a Location, which is a second vector. Or else, the Article exists in a Location, and the Article Key carries the Location. I have modelled the latter.
3.2 Data Model
Here is the solution:
Sensor Data Model
Inline graphics may not show up in some browsers. In that case, here it is in PDF.
It will satisfy both OLTP and OLAP (Dimension-Fact) requirements.
If you provide more context, we can get that modelled precisely. This may take a bit of to-and-fro.
It is limited to the info provided.
I have taken MetricType and SensorType to be synonymous.
Article is shown as Dependent on (exists within) Location, alternately they could be separate vectors. In any case, Article and Location together qualify Sensor.
Since SensorSerialNo is unique (AK2), therefore Reading(SensorSerialNo, DateTime) is unique. An index is not required. However, in the event there are many queries on Reading via SensorSerialNo alone, such an index will boost performance.
Please feel free to ask questions, and I will answer.
For those who are completely new to IDEF1X, refer to IDEF1X Introduction.
For those who are familiar with IDEF1X, and only want a brush-up, refer to IDEF1X Anatomy.
4 Performance
Your concern re performance is good, but far too premature to be applied at this stage. First get the data model right, second get the data structures fast. The reasons for that are many, not the least of which is, when the data is Normalised, Relationally, the structures are already very fast. Further, one should never optimise for a particular query (one can add indices, if necessary, in the second stage).
Nevertheless, I will respond to your stated concerns.
Eg. a ClusteredIndex on the prescribed Reading PK will:
Serve most queries, most Dimensions (except queries that use SensorSerialNo alone, in case of which I have suggested an additional index)
Serve all OLTP Transactions and ensure the highest concurrency, because the Sensors are distributed per the real world: across Locations and Articles`.
Whereas an Index on a Record ID guarantees a HotSpot on every single INSERT. Great for creating Deadlocks.
4.1 Benchmark
I do have a hundred or so benchmarks for data structures such as this, collected over the last four decades for both OLTP & OLAP use. Most of my customers are banks (Think: Sensor Readings are very much like Stock Prices that change over the period of a day; several vectors (Dimensions); billions of rows). Banks are very strict about confidentiality, so I cannot publish the benchmarks as is, and redacting them will take time and effort.
I do have one benchmark for a very similar requirement, that is public. In fact, it was included in an Answer to a SO Question re Time Series data, but the seeker got the moderators to excise it (it is embarrassing to Oracle). Here is the Benchmark Summary for the Sybase ASE vs Oracle 10.2 benchmark on a fixed DDL (Time Series data) and population.
Finally, the structures and code required are simple enough for you to run your own benchmark.
5 Response to Other Answers
Re Neville's comments:
However, if you also have to answer questions like "on what day was CPU above 30% while humidity was below 56% for more than 3 hours", your EAV model becomes really hard to work with. Those queries would rapidly become really hard to write and understand - every criterium becomes at least 1 self-join.
Noting that his comments regard EAV, but that it may imply that it applies equally to the subject table (an ordinary Relational database table (non-EAV) Reading) in this case, because it concerns the query type (and not the EAV concept vs the Relational concept):
The declaration does not apply to Relational tables (it may well apply to EAV; the masses of problems introduced due to Record IDs; etc)
As long as you have
a genuine Relational database schema (as I have suggested), and
a genuine SQL platform (not a pretend "sql", which does not comply but fraudulently uses the name), and
you understand IN and NOT IN, and how to compare Sets in SQL
... such queries are straight-forward to code.
6 Response to Comments
Record ID is Anti-Relational
Do you have any links on the record_id being anti-relational, I don't disbelieve you for a second but I'm interested to learn more about why this anti-pattern is so prevalent.
In this mess of anti-science, the academics manufacture and contrive various "solutions" to "problems", that do not exist in the Relational Model, and then you have a second level of endless "debates" about which correction to the non-problem is better or worse.
You don't need links because there is nothing to "debate", and whatever "debate" you might happen to read misses the above point.
The one and only authority is the great Dr E F Codd. All the authors of all books and textbooks alleging to be about the Relational Model, other than Codd, are actually false, they are about implementing 1970's style Record Filing Systems, and anti-Relational (no Relational Power; no Relational Integrity; no Relational Speed). They made the mistake, from 1970, of trying to fit the RM into their 1970's RFS mindset, rather than releasing it and taking on the RM mindset. And they have spent the last FIVE DECADES reinforcing that, even justifying it with "mathematical definitions"; 17 "relational algebras"; 42 abnormal "normal forms". All completely anti-Relational. And they cite each other, so they get published.
The second problem is, sites such as SO are predicated on the basis of populism. The popular answer is not the best or correct answer. For that you need an Authority (very scary to populists), and objective, absolute truth. (People love their relative or subjective "truths", that change all the time).
Therefore, you need just the single, authoritative definition, the original paper, the Relational model.
Yes, the terms are out-dated, and not well understood these days.
Yes, it is seminal (every word counts, has deep meaning).
No, you need not read section 2 (math).
You need to glean from that, that:
the Relational Key is “made up from the data” (my paraphrase, to the several entries, which are layers in the RM), which is Logical
that surrogates are (a) not only against that definition, (b) they are the pre-Relational paradigm, that is Physical pointers, the very thing the RM replaces, and (c) explicitly prohibited.
Very important, you need to understand not only the definition of the Relational Key, but the whys and the wherefores.
Eg. that it transcends import/export problems that pointer-based systems have.
Eg. the temporal definition (seminal; 8 letters; scary).
Therefore, there is no argument, no "debate", to be had.
Anyone going against that is anti-Relational. Not because I say so, but because it contradicts evidenced facts, and the single Authority.
I have named the explicit technical benefits of using the RM correctly (Relational Power; Relational Integrity; Relational Speed), but an expansion of that requires a fair amount of effort
The consequence of NOT complying with the RM is, you get (a) none of the benefits, AND (b) you get the complete set of problems that pre-Relational Record Filing Systems had in 1970, AND (c) the contrived "solutions" supplied by the "academics" that have never worked.
If you need an expansion of those benefits of the RM, which of course you do need to understand to some degree, because each one is very deep and very important, the best I can provide is this. As you can imagine, this is a battle that I have to fight on every Answer that relates to this subject, so I have posted a fair amount, over the years, across many Answers.
Go to my profile, select All Answers, and read any that relate to this subject.
Why is this Record ID anti-pattern so prevalent ?
The short answer is, people love their ignorance, their subjective "truths", and will fight tooth and nail to protect it. They quickly accept and repeat any justification for remaining the same. Learning something that is a paradigm shift away from what they know, is very scary, because it threatens their comfortable ignorance, and exposes it for what it really is. They will have to admit that what they have been writing for FIVE DECADES is wrong. That is why populism thrives. In ignorance.
The slightly longer answer is this. Just look at the internet. In the old days, for any particular subject, we had one source, one absolute authority: eg. buy the Encyclopædia Britannica; spend your entire childhood devouring it. Permanent truth. Honest history. But now anyone with a keyboard and two fingers plus some connective tissue (no brain required) can post. As an instant "authority". The web is chock-full of (a) superficial answers (the anti-thesis of "Now THAT is an answer") (b) in many flavours (c) that get upvoted due to populism (d) that are nowhere near the correct or full answer. Sound bites that can be easily understood by the populace. Very few want the depth of the full answer.
Even when an authority of sorts becomes established (eg. Wikipedia; Stack Overflow), it is easily subverted, because there are literally millions of people who change the entries (truth does not change, therefore, as long as something is changing, it is not truth). Mostly to serve their political positions; their ideologies; their re-writes of history to make the past wrong (it wasn't, it already happened), and the present insanity "good".
The definitive answer is this: academic envy. It took a whole decade for Codd's Relational Model to be understood and accepted. And even then, only by the few. IBM, and Britton-Lee (which became Sybase) implemented Codd's RM, in spirit and word. (Digital Equipment Corp did as well, but they are defunct.) Those academics who appeared to be working with Codd turned out to be actually working against him (by virtue of the evidence). They hated the fact that they did not come up with it themselves, that one man came up with the first real model, with a sound; logical; mathematical, foundation, complete with a Relational Algebra. All integrated. All requirements of the day (eg. the Bill Of Materials problem) answered. That has stood the test of time: five decades and nothing has been added or changed.
Typically they will declare, "but Codd did not define this or that, so here I am defining it ...". So they came up with their own RA. Now they have 17, all irrelevant. And abnormal "normal forms" to elevate fragmented bits of their Record Filing Systems to seem "relational". Now they have 42, all irrelevant. And many books, alleging to be "relational", but by evidenced fact, anti-Relational. Each "academic" seeks to reinforce their "academic" position, against all others.
Which is why I say, again, go to the one and only Authority. Read nothing from the anti-Relational crowd, because it will diminish your understanding of the RM (at best), or poison your mind (at worst).
One Clarification
If you examine a Relation PK (eg) Location.Location, it may seem odd. This is a %Code or %ShortName that is data, that the user actually uses. Usually 4 to 6 characters, max 12. As distinct from the long Name, which has to exist, and which is an Alternate Key. And of course, it is definitely not a number of any kind (which is not data, not something that the user uses). Users too, like their short forms. Obviously, use any International Standard if such exists.
The Key must be stable (not static, nothing in the universe is static), and one that is used in the real world to uniquely identify the object (data row).
Eg. for Security, which is a company listed on the stock exchange, in America, it would be TickerSymbol, in Australia ASXCode. The ISO code, an ISINCode, is an AlternateKey.
For cities, use one of the geographic location standards: ISO; FIPS; etc. (I use Statoids because it existed long before the others, but those days are numbered). At worst, use Airport Code.
Genuine SQL Platform
What do you consider to be genuine SQL? Sql Server, Postgres, MySQL, Oracle I guess all would be?
No. I mean any platform that actually complies with the published SQL Standard, and therefore can actually support relational tables; relational processing of Sets; and ACID Transactions.
That automatically excludes freeware/vapourware/nowhere/"open source", for which bits are written by 10,000 developers spread across the universe, with no governing principles. Eg. no ACID Transactions, or the structures that are required for it, which are required in every code segment. Too late to insert that now, because it will require a 100% re-write, and heaven forbid ... a Server Architecture.
Commercial
which means paid-for and supported, is also important. Either you have a maintenance contract and support is immediate, or you post a bug report and you check for updates every day for the next year or three.
Server Architecture
If either scalability or performance (high throughput; high concurrency; low latency) is required, then the Server Architecture is most important. Again, that excludes the freeware, and Oracle, because they have no Architecture, they are massive collections of interacting programs, that get the o/s to perform all the functions that a architected Database Server would normally perform.
Check this Comparison of Oracle vs Sybase Architecture.
The exact same applies to PostgreSQL and other freeware. PostgreSQL (son of the total failure Ingres) famously failed under pressure, with masses of locking problems and very low concurrency.
1 High-End, Commercial, SQL Compliant
Something like 5% market share, but 95% of the Financial Services and Automation markets. Great Architecture, hopeless marketing.
**Sybase ASE
IBM DB2**
2 Commercial, SQL Compliant
MS SQL Server
Easily the most common. Good Architecture (originally stolen from Sybase) and then "progressed" in the usual insane MS style. Pain to use; masses of overhead; poorly integrated with various add-ons and must-uses.
3 Commercial, SQL Non-Compliant
Hopeless Architecture, great marketing.
Oracle
Generally, Oracle developers are quite good at using the product in the ways that are required to get it to work, but that means they have strayed quite far away from the Relational Model.
Eg. in the Time Series benchmark, the whole point was, Oracle cacks itself when a Subquery is requested, so it has to use an "Inline View". Which the OP alleged was just as fast as a Subquery (avoiding the fact that it requires far more code, and the coder must step outside the Relational mindset). Which the benchmark proved to be hilariously false, in each scenario tested (Oracle was 3 to 4.8 times slower than Sybase on a COUNT(), 26 to 36 times slower on a SUM()
...and the Subquery (Sybase 2.1 secs) had to be abandoned after 120 mins.
Eg. Oracle is non-compliant re ACID Transactions, and developers work around that obstacle to a degree, but Phantom Updates and Lost Updates (technical terms) are simply not prevented. If the work-arounds are not written properly, entire rows (UPDATES or INSERTS are lost).
All that applies to the below ...
4 Non-Commercial, SQL Non-Compliant
These guys spend an awful lot of time developing "features" that are not required for a Relational database, but very attractive to the anti-Relational Record ID Filing Systems.
Eg. "deferred constraint checking"; ENUMs; etc.
They lack the basics of SQL compliance. Eg. no genuine ACID Transactions.
Further, as explained above, zero Architecture. This results in systems that perform wonderfully under single-use, and fail miserably under any order of pressure from concurrency or scalability.
Due to their non-compliance with the SQL requirement, they take pains to post a notice of compliance on every page in the Commands manual. (Just one declaration of compliance at the front of the manual is all that is required.) Of course, the missing commands are simply missing, so gee whiz, they do not have a compliance declaration.
PostgreSQL
The worst piece of software I have ever had to examine since the days of Ingres. Dearly loved by the "academic" crowd, simply because it was scrawled by a fellow "academic".
5 user max, or deal with the concurrency problems (just take a cursory look at the problems reported on SO).
MySQL
Head and shoulders above PostgreSQL, but still in this category.
The InnoDB engine is distinctly better in the performance department, but nowhere near the Sybase/DB2 level (still no genuine Server Architecture). No respite in the SQL non-compliance department.
5 Summary
You get what you pay for.
Server Architecture, most visibly, performance in every scenario.
SQL Compliance, thought through deeply, and implemented in every applicable code segment.
Last but not least, Support.
Whatever you choose, remember, when you port it to another platform, your SQL code will require a complete check-and-change, because the "flavours" of SQL (or NON-sql) are very different. For the Non-Commercial program suites, that means a complete rewrite. Therefore choose carefully, with the long term implementation in mind.

It depends largely on the types of query you'll need to run. I think performance may not be your biggest concern if, as you say
A summary of all metrics for a given time bucket needs to be obtained
via a query and it likely to be the most common querying scenario.
As queries in all scenarios would hit an indexable timestamp column, it really is just a question of the performance of joins, and pretty much every relational database is really good at that.
If your queries really are just "show data for a time range", your option 3 (an entity/attribute/value design) is most effective from a development effort point of view. .
Your query would have a single, inner join, and the timestamp column would provide a good index. As you say, you wouldn't need to change schema or queries when collecting new measurement points.
The alternative designs would require outer joins for each table. In performance terms, that's not a huge deal, but managing the schema and associated queries would be a pain.
However, if you also have to answer questions like "on what day was CPU above 30% while humidity was below 56% for more than 3 hours", your EAV model becomes really hard to work with. Those queries would rapidly become really hard to write and understand - every criterium becomes at least 1 self-join.

TimescaleDB's documentation discusses wide versus narrow data models:
https://docs.timescale.com/timescaledb/latest/overview/data-model-flexibility/
In summary:
"A narrow model makes sense if you collect each metric independently. It allows you to add new metrics as you go by adding a new tag without requiring a formal schema change."
"If you typically query multiple metrics together, it is both faster and easier to store them in a wide table format"

Indeed, the way 3 is a sort of EAV modeling on the relational storage including timestamps into EAV key.
+---------+ +-----+ +-------------+
| Sensors | -- 1:M --< | EAV | >-- M:1 -- | Value kinds |
+---------+ +-----+ +-------------+
A summary of all metrics for a given time bucket needs to be obtained
via a query and it likely to be the most common querying scenario
If queries don't require joins but need to be grouped by time, the clustered index on timestamp column ensures the performance.
However, any queries with joins (i.e. comparing values of different sensors) risque to degrade the performance. The solution can be a separated OLAP storage for collected EAV data.

From a developer's point of view, I would like to recommend the third option. In your third option, you might consider having indexes on the MetricType (i.e. typeId) and the timestamp column, which will greatly optimize the query performance.
Whereas your first table requires a system downtime, as when a new column needs to be inserted, you need to shut down your live system first to add the column, initialize with some default or null values, and then bring back the system to live again. In my opinion, it will contain un-necessary data (garbage) for the previous rows from the point it was being added in the system. The size of the database table will be huge and might contain garbage data in a significant amount hence affecting the query time.
The second idea shows an improvement over the first, however, in spite of having garbage data, this will require joining multiple tables which will increase the query performance over time. You cannot have indexes on multiple tables as you could for the third option.
Hence I think going for the third option is the most effective. The tables are normalized and effective indexes will provide efficient query results.
I would like to suggest another thing. You might also consider having a separate table which will contain aggregated data. For example, if your system requires aggregated data, you might consider having the data in a denormalized style in a separate table where the aggregated values for a certain timeline can be stored so that you can remove the data from your original table which are already processed. I am referring to the OLAP database where you might consider looking into.

I wouldn't recommend an ERD design where you need to Alter whenever you add a sensor (as long as you know you will). That's why I believe you should eliminate option 1. Whenever you alter your table you will get plenty of null values and unnecessary work you might have in your code.
The same applies on Option 2, maybe except for nulls, but still you will get unnecessary work whenever you add a new data source to your system.
Option 3 Looks good fit to me, as its ready to expanding data sources and keeps data clean and neat.

Related

Accounting Payables/ Receivables database schema

I searched for this topic and there are lots of explanation to collect. Mostly to do separate tables for a/p and a/r.
I'm thinking of another approach from where to combine the two tables into 1 because a/c and a/r are 98% similar and only differ by 1 or 2 attributes (e.g. a/p: vendor/supplier).
Question:
Is it possible to merge the 2 tables and just categorize each transaction with Transaction_Type with values of Acnt_Payable or acnt_Receivable?
I know this is possible to do but is this a good practice? and what are the circumstances that my system will face when dealing with Reports?
Your question is: "another table or another column". Given that I've done lots of MySql and accounting database manipulation, I wouldn't hesitate to have type as an extra column.
The deal with reports is that you'll need to rework the code that pulls the reports. Having the extra column means less joins (if you have them now). You will need to step up your error checking, as doing a change to already designed software seems to always catch me off guard where I hadn't before considered a nuance of the design that is suddenly a paradigm shift for the logic.
I'd say rewrite the report code from a blank canvas; between your increased experience from "last time" and a fresh look at it this time, you might do a much better job of it.
IMHO, for performance aspects, most businesses and most average coders are OK with an off-the-shelf install of MySql or PostgreSQL. When I observe a query taking more than a second, it's usually because I need to add an index. Given that, you could have yet another table with the IDs representing AR vs AP (or any other human-terms), as I think numbers will make faster performance.
Once a company gets to where performance really is an issue, they'll have the resources to hire a real MySql guy to come in and mess with it. Sarcasm: Unless, of course, you're using quickbooks, in which case, once you're past a 20 transaction-per-day mom and pop joint, you're overtaxing the software.

Reaching an appropriate balance between performance and scalability in a large database

I'm trying to determine which of the many database models would best support probabilistic record comparison. Specifically, I have approximately 20 million documents defined by a variety of attributes (name, type, author, owner, etc.). Text attributes dominate the data set, yet there are still plenty of images. Read operations are the most crucial vis-a-vis performance, but I expect roughly 20,000 new documents to insert each week. Luckily, insert speed does not matter at all, and I am comfortable queuing the incoming documents for controlled processing.
Database queries will most typically take the following forms:
Find documents containing at least five sentences that reference someone who'a a member of the military
Predict whether User A will comment on a specific document written by User B, given User A's entire comment history
Predict an author for Document X by comparing vocabulary, word ordering, sentence structure, and concept flow
My first thought was to use a simple document store like, like MongoDB, since each document does not necessarily contain the same data. However, complex queries effectively degrade this to a file system wrapper, since I cannot construct a query yielding the results I desire. As such, this approach corners me into walking the entire database and processing each file separately. Although document stores scale well horizontally, the benefits are not realized here.
This led me to realize that my granularity isn't at the document level, but rather the entity-relationship level. As such, graph databases seemed like logical choice, since they facilitate relating each word in a sentence to the next word, next paragraph, current paragraph, part of speech, etc. Graph databases limit data replication, increase the speed of statistical clustering, and scale horizontally, among other things. Unfortunately, ensuring a definitive answer to your query still necessitates traversing the entire graph. Even still, indexing will help with performance.
I've also evaluated the use of relational databases, which are very efficient when designed properly (i.e., by avoiding unnecessary normalization). A relational database excels at finding all documents authored by User A, but fails at structural comparisons (which involves expensive joins). Relational databases also enforce constraints (primary keys, foreign keys, uniqueness, etc.) efficiently--a task with which some NoSQL solutions struggle.
After considering the above-listed requirements, are there any database models that combine the "exactness" of relational models (viz., efficient exhaustion of the domain) with the flexibility of graph databases?
This is not really an answer, just a discussion.
The database you are talking about is a large database. You don't mention the nature of the documents, but newspaper articles are typically in the 2-3k range, so you are talking about hundreds of gigabytes of raw data.
If query performance is an issue, you are talking about a large, rather expensive system.
Your requirements are also quite complex, and not likely to be out-of-the-box. I would be thinking of a hybrid system. Store the document metadata in a relational database system, so you can quickly access them with simple queries. You can store the documents themselves in the database as blobs.
Some of your requirements can be met with text-add ins on relational databases. So, simple searching is feasible using inverted index technology. That handles the first of your three scenarios.
The other two are much more challenging. The third ("predict an author") can probably be handled by having a parallel system that stores author information, summarized from the documents when they are loaded. Then it is a question of comparing a document to the author, using simple statistical analysis (naive Bayesian, anyone?).
The middle one is tricky, but it suggests yet another component for managing comments on documents. Depending on the volume, this may be easy or hard.
Finally, how changing are the requirements? Do you really know what the system should be doing? Or will the functionality be radically different once you get it up and running?

Choosing SQL Server data types for maximum speed

I'm designing a database that will need to be optimized for maximum speed.
All the database data is generated once from something I call an input database (which holds the data I'm editing, mainly some polylines, markers, etc for google maps).
So the database is not subject to editing, but it needs to hold as many data as it can for quickly displaying results to the user (routes across town, custom polylines, etc).
The question is: choosing smaller data types for example like smallint over int will improve performance or it will affect it? Space is not quite a problem, after some quick calculations, the database will not exceed 200mb, and there will not be tables with more than 100.000 rows (average will be around 5.000).
I'm asking this because I read some articles around the internet and some say that smaller data types improve performance others say that it affects it because additional processing must be done. I'm aware that for smaller databases probably results are not noticeable, but I'm interested in every bit because I'm expecting many requests which will trigger a lot more queries.
The hosting environment is gonna be Windows Server 2008 R2 with SQL Server 2008 R2.
EDIT 1: Just to give you an example because I don't have a proper table structure yet:
I'm going to have a table which will hold public transportation lines (somewhere around 200), identified by a unique number in real life, and which is going to be referenced in all sorts of tables and on which all sorts of operations are going to be made. These referencing tables will hold the largest amount of data.
Because lines have unique numbers, I have thought of 3 examples of designs:
The PK is the line number of datatype: smallint
The PK is the line number of datatype: int
The PK is something different (identity for example) and the line number is stored in a different field.
Just for the sake of argument, because I used this on the 'input database' which is not subject to optimization, the PK is a GUID (16 bytes); if you like, you can make a comparison of how bad is this compared to others, if it really is
So keep in mind that the PK is going to be referenced in at least 15 tables, some of which will have over 50.000 rows (the rest averaging 5.000 as I said above) which are going to be subject to constant querying and manipulation, and I'm interested in every bit of speed that I can get.
I can detail this even more if you need. Thanks
EDIT 2: And another question related to this came to my mind, think it fits into this discussion:
Will I see any performance improvements in this specific scenario if I use native SQL queries from inside my .NET application rather than using LINQ to SQL? I know LINQ is strongly optimized and generates very good queries performance-wise, but still, sure worth asking. Thanks again.
Can you point to some articles that say that smaller data types = more processing? Keeping in mind that even with SSDs most workloads today are I/O-bound (or memory-bound) and not CPU-bound.
Particularly in cases where the PK is going to be referenced in many tables, it will be beneficial to use the smallest data type possible. In this case if that's a SMALLINT then that's what I would use (though you say there are about 200 values, so theoretically you could use TINYINT which is half the size and supports 0-255). Where you need to exercise caution is if you aren't 100% sure that there will always be ~200 values. Once you need 256 you're going to have to change the data type in all of the affected tables, and this is going to be a pain. So sometimes a trade-off is made between accommodating future growth and squeezing the absolute most performance today. If you don't know for certain that you will never exceed 255 or 32,000 values then I would probably just an INT. Unless you also don't know that you won't ever exceed 2 billion values, in which case you would use BIGINT.
The difference between INT/SMALLINT/TINYINT is going to be more noticeable in disk space than in performance. (And if you're on Enterprise, the differences in both disk space and performance can be offset quite a bit using data compression - particularly while your INT values all fit within SMALLINT/TINYINT, though in the latter case it really will be negligible because the values are unique.) On the other hand, the difference between any of these and GUID is going to be much more noticeable in both performance and disk space. Marc gave some great links from Kimberly; I wrote this article in 2003 and while it's a little dated it does contain most of the salient points that are still relevant today.
Another trade-off that sometimes needs to be considered (though not in your specific case, it seems) is whether values need to be unique across multiple systems. This is where you might need to sacrifice some performance in order to meet business requirements. In a lot of cases folks take the easy way and resign themselves to GUID. But there are other solutions too, such as identity ranges, a central custom sequence generator, and the new SEQUENCE object in SQL Server 2012. I wrote about SEQUENCE back in 2010 when the first public beta of SQL Server 2012 was released.
I think you will need to provide some more details about the tables structure and sample queries that will be running against them. Based on the information that you have provided I believe that impact of choosing smaller data types will be just a couple of percents and I would suggest to give higher attention to indexes that you will have. SQL Server does a good job on suggesting what indexes to create by providing you with execution plans for your queries and tuning advisor tool
One suggestion that I have is to incorporate a decimal datatype instead of using a combination of fields. For example, instead of having a table with Date (YYYYMMDD), Store (SSSS), and Item (IIII), I would recommend...YYYYMMDD.SSSSIIII. Especially when querying multiple tables with this same key combination, it dramatically improves processing time.

Should OLAP databases be denormalized for read performance? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I always thought that databases should be denormalized for read performance, as it is done for OLAP database design, and not exaggerated much further 3NF for OLTP design.
PerformanceDBA in various posts, for ex., in Performance of different aproaches to time-based data defends the paradigm that database should be always well-designed by normalization to 5NF and 6NF (Normal Form).
Have I understood it correctly (and what had I understood correctly)?
What's wrong with the traditional denormalization approach/paradigm design of OLAP databases (below 3NF) and the advice that 3NF is enough for most practical cases of OLTP databases?
For example:
"The simple truth ... is that 6NF, executed properly, is the data warehouse" (PerformanceDBA)
I should confess that I could never grasp the theories that denormalization facilitates read performance. Can anybody give me references with good logical explanations of this and of the contrary beliefs?
What are sources to which I can refer when trying to convince my stakeholders that OLAP/Data Warehousing databases should be normalized?
To improve visibility I copied here from comments:
"It would be nice if participants would
add (disclose) how many real-life (no
science projects included)
data-warehouse implementations in 6NF
they have seen or participated in.
Kind of a quick-pool. Me = 0." – Damir
Sudarevic
Wikipedia's Data Warehouse article tells:
"The normalized approach [vs. dimensional one by Ralph Kimball], also
called the 3NF model (Third Normal Form) whose supporters are
referred to as “Inmonites”, believe in Bill Inmon's approach in which
it is stated that the data warehouse should be modeled using an E-R
model/normalized model."
It looks like the normalized data warehousing approach (by Bill Inmon) is perceived as not exceeding 3NF (?)
I just want to understand what is the origin of the myth (or ubiquitous axiomatic belief) that data warehousing/OLAP is synonym of denormalization?
Damir Sudarevic answered that they are well-paved approach. Let me return to the question: Why is denormalization believed to facilitate reading?
Mythology
I always thought that databases should be denormalized for reading, as it is done for OLAP database design, and not exaggerated much further 3NF for OLTP design.
There's a myth to that effect. In the Relational Database context, I have re-implemented six very large so-called "de-normalised" "databases"; and executed over eighty assignments correcting problems on others, simply by Normalising them, applying Standards and engineering principles. I have never seen any evidence for the myth. Only people repeating the mantra as if it were some sort of magical prayer.
Normalisation vs Un-normalised
("De-normalisation" is a fraudulent term I refuse to use it.)
This is a scientific industry (at least the bit that delivers software that does not break; that put people on the Moon; that runs banking systems; etc). It is governed by the laws of physics, not magic. Computers and software are all finite, tangible, physical objects that are subject to the laws of physics. According to the secondary and tertiary education I received:
it is not possible for a bigger, fatter, less organised object to perform better than a smaller, thinner, more organised object.
Normalisation yields more tables, yes, but each table is much smaller. And even though there are more tables, there are in fact (a) fewer joins and (b) the joins are faster because the sets are smaller. Fewer Indices are required overall, because each smaller table needs fewer indices. Normalised tables also yield much shorter row sizes.
for any given set of resources, Normalised tables:
fit more rows into the same page size
therefore fit more rows into the same cache space, therefore overall throughput is increased)
therefore fit more rows into the same disk space, therefore the no of I/Os is reduced; and when I/O is called for, each I/O is more efficient.
.
it is not possible for an object that is heavily duplicated to perform better than an object that is stored as a single version of the truth. Eg. when I removed the 5 x duplication at the table and column level, all the transactions were reduced in size; the locking reduced; the Update Anomalies disappeared. That substantially reduced contention and therefore increased concurrent use.
The overall result was therefore much, much higher performance.
In my experience, which is delivering both OLTP and OLAP from the same database, there has never been a need to "de-normalise" my Normalised structures, to obtain higher speed for read-only (OLAP) queries. That is a myth as well.
No, the "de-normalisation" requested by others reduced speed, and it was eliminated. No surprise to me, but again, the requesters were surprised.
Many books have been written by people, selling the myth. It needs to be recognised that these are non-technical people; since they are selling magic, the magic they sell has no scientific basis, and they conveniently avoid the laws of physics in their sales pitch.
(For anyone who wishes to dispute the above physical science, merely repeating the mantra will no have any effect, please supply specific evidence supporting the mantra.)
Why is the Myth Prevalent ?
Well, first, it is not prevalent among scientific types, who do not seek ways of overcoming the laws of physics.
From my experience, I have identified three major reasons for the prevalence:
For those people who cannot Normalise their data, it is a convenient justification for not doing so. They can refer to the magic book and without any evidence for the magic, they can reverently say "see a famous writer validates what I have done". Not Done, most accurately.
Many SQL coders can write only simple, single-level SQL. Normalised structures require a bit of SQL capability. If they do not have that; if they cannot produce SELECTs without using temporary tables; if they cannot write Sub-queries, they will be psychologically glued to the hip to flat files (which is what "de-normalised" structures are), which they can process.
People love to read books, and to discuss theories. Without experience. Especially re magic. It is a tonic, a substitute for actual experience. Anyone who has actually Normalised a database correctly has never stated that "de-normalised is faster than normalised". To anyone stating the mantra, I simply say "show me the evidence", and they have never produced any. So the reality is, people repeat the mythology for these reasons, without any experience of Normalisation. We are herd animals, and the unknown is one of our biggest fears.
That is why I always include "advanced" SQL and mentoring on any project.
My Answer
This Answer is going to be ridiculously long if I answer every part of your question or if I respond to the incorrect elements in some of the other answers. Eg. the above has answered just one item. Therefore I will answer your question in total without addressing the specific components, and take a different approach. I will deal only in the science related to your question, that I am qualified in, and very experienced with.
Let me present the science to you in manageable segments.
The typical model of the six large scale full implementation assignments.
These were the closed "databases" commonly found in small firms, and the organisations were large banks
very nice for a first generation, get-the-app-running mindset, but a complete failure in terms of performance, integrity and quality
they were designed for each app, separately
reporting was not possible, they could only report via each app
since "de-normalised" is a myth, the accurate technical definition is, they were un-normalised
In order to "de-normalise" one must Normalise first; then reverse the process a little
in every instance where people showed me their "de-normalised" data models, the simple fact was, they had not Normalised at all; so "de-normalisation" was not possible; it was simply un-normalised
since they did not have much Relational technology, or the structures and control of Databases, but they were passed off as "databases", I have placed those words in quotation marks
as is scientifically guaranteed for un-normalised structures, they suffered multiple versions of the truth (data duplication) and therefore high contention and low concurrency, within each of them
they had an additional problem of data duplication across the "databases"
the organisation was trying to keep all those duplicates synchronised, so they implemented replication; which of course meant an additional server; ETL and synching scripts to be developed; and maintained; etc
needless to say, the synching was never quite enough and they were forever changing it
with all that contention and low throughput, it was no problem at all justifying a separate server for each "database". It did not help much.
So we contemplated the laws of physics, and we applied a little science.
We implemented the Standard concept that the data belongs to the corporation (not the departments) and the corporation wanted one version of the truth. The Database was pure Relational, Normalised to 5NF. Pure Open Architecture, so that any app or report tool could access it. All transactions in stored procs (as opposed to uncontrolled strings of SQL all over the network). The same developers for each app coded the new apps, after our "advanced" education.
Evidently the science worked. Well, it wasn't my private science or magic, it was ordinary engineering and the laws of physics. All of it ran on one database server platform; two pairs (production & DR) of servers were decommissioned and given to another department. The 5 "databases" totalling 720GB were Normalised into one Database totalling 450GB. About 700 tables (many duplicates and duplicated columns) were normalised into 500 unduplicated tables. It performed much faster, as in 10 times faster overall, and more than 100 times faster in some functions. That did not surprise me, because that was my intention, and the science predicted it, but it surprised the people with the mantra.
More Normalisation
Well, having had success with Normalisation in every project, and confidence with the science involved, it has been a natural progression to Normalise more, not less. In the old days 3NF was good enough, and later NFs were not yet identified. In the last 20 years, I have only delivered databases that had zero update anomalies, so it turns out by todays definitions of NFs, I have always delivered 5NF.
Likewise, 5NF is great but it has its limitations. Eg. Pivoting large tables (not small result sets as per the MS PIVOT Extension) was slow. So I (and others) developed a way of providing Normalised tables such that Pivoting was (a) easy and (b) very fast. It turns out, now that 6NF has been defined, that those tables are 6NF.
Since I provide OLAP and OLTP from the same database, I have found that, consistent with the science, the more Normalised the structures are:
the faster they perform
and they can be used in more ways (eg Pivots)
So yes, I have consistent and unvarying experience, that not only is Normalised much, much faster than un-normalised or "de-normalised"; more Normalised is even faster than less normalised.
One sign of success is growth in functionality (the sign of failure is growth in size without growth in functionality). Which meant they immediately asked us for more reporting functionality, which meant we Normalised even more, and provided more of those specialised tables (which turned out years later, to be 6NF).
Progressing on that theme. I was always a Database specialist, not a data warehouse specialist, so my first few projects with warehouses were not full-blown implementations, but rather, they were substantial performance tuning assignments. They were in my ambit, on products that I specialised in.
Let's not worry about the exact level of normalisation, etc, because we are looking at the typical case. We can take it as given that the OLTP database was reasonably normalised, but not capable of OLAP, and the organisation had purchased a completely separate OLAP platform, hardware; invested in developing and maintaining masses of ETL code; etc. And following implementation then spent half their life managing the duplicates they had created. Here the book writers and vendors need to be blamed, for the massive waste of hardware and separate platform software licences they cause organisations to purchase.
If you have not observed it yet, I would ask you to notice the similarities between the Typical First Generation "database" and the Typical Data Warehouse
Meanwhile, back at the farm (the 5NF Databases above) we just kept adding more and more OLAP functionality. Sure the app functionality grew, but that was little, the business had not changed. They would ask for more 6NF and it was easy to provide (5NF to 6NF is a small step; 0NF to anything, let alone 5NF, is a big step; an organised architecture is easy to extend).
One major difference between OLTP and OLAP, the basic justification of separate OLAP platform software, is that the OLTP is row-oriented, it needs transactionally secure rows, and fast; and the OLAP doesn't care about the transactional issues, it needs columns, and fast. That is the reason all the high end BI or OLAP platforms are column-oriented, and that is why the OLAP models (Star Schema, Dimension-Fact) are column-oriented.
But with the 6NF tables:
there are no rows, only columns; we serve up rows and columns at same blinding speed
the tables (ie. the 5NF view of the 6NF structures) are already organised into Dimension-Facts. In fact they are organised into more Dimensions than any OLAP model would ever identify, because they are all Dimensions.
Pivoting entire tables with aggregation on the fly (as opposed to the PIVOT of a small number of derived columns) is (a) effortless, simple code and (b) very fast
What we have been supplying for many years, by definition, is Relational Databases with at least 5NF for OLTP use, and 6NF for OLAP requirements.
Notice that it is the very same science that we have used from the outset; to move from Typical un-normalised "databases" to 5NF Corporate Database. We are simply applying more of the proven science, and obtaining higher orders of functionality and performance.
Notice the similarity between 5NF Corporate Database and 6NF Corporate Database
The entire cost of separate OLAP hardware, platform software, ETL, administration, maintenance, are all eliminated.
There is only one version of the data, no update anomalies or maintenance thereof; the same data served up for OLTP as rows, and for OLAP as columns
The only thing we have not done, is to start off on a new project, and declare pure 6NF from the start. That is what I have lined up next.
What is Sixth Normal Form ?
Assuming you have a handle on Normalisation (I am not going to not define it here), the non-academic definitions relevant to this thread are as follows. Note that it applies at the table level, hence you can have a mix of 5NF and 6NF tables in the same database:
Fifth Normal Form: all Functional Dependencies resolved across the database
in addition to 4NF/BCNF
every non-PK column is 1::1 with its PK
and to no other PK
No Update Anomalies
.
Sixth Normal Form: is the irreducible NF, the point at which the data cannot be further reduced or Normalised (there will not be a 7NF)
in addition to 5NF
the row consists of a Primary Key, and at most, one non-key column
eliminates The Null Problem
What Does 6NF Look Like ?
The Data Models belong to the customers, and our Intellectual Property is not available for free publication. But I do attend this web-site, and provide specific answers to questions. You do need a real world example, so I will publish the Data Model for one of our internal utilities.
This one is for the collection of server monitoring data (enterprise class database server and OS) for any no of customers, for any period. We use this to analyse performance issues remotely, and to verify any performance tuning that we do. The structure has not changed in over ten years (added to, with no change to the existing structures), it is typical of the specialised 5NF that many years later was identified as 6NF. Allows full pivoting; any chart or graph to be drawn, on any Dimension (22 Pivots are provided but that is not a limit); slice and dice; mix and match. Notice they are all Dimensions.
The monitoring data or Metrics or vectors can change (server version changes; we want to pick up something more) without affecting the model (you may recall in another post I stated EAV is the bastard son of 6NF; well this is full 6NF, the undiluted father, and therefore provides all features of EAV, without sacrificing any Standards, integrity or Relational power); you merely add rows.
▶Monitor Statistics Data Model◀. (too large for inline; some browsers cannot load inline; click the link)
It allows me to produce these ▶Charts Like This◀, six keystrokes after receiving a raw monitoring stats file from the customer. Notice the mix-and-match; OS and server on the same chart; a variety of Pivots. (Used with permission.)
Readers who are unfamiliar with the Standard for Modelling Relational Databases may find the ▶IDEF1X Notation◀ helpful.
6NF Data Warehouse
This has been recently validated by Anchor Modeling, in that they are now presenting 6NF as the "next generation" OLAP model for data warehouses. (They do not provide the OLTP and OLAP from the single version of the data, that is ours alone).
Data Warehouse (Only) Experience
My experience with Data Warehouses only (not the above 6NF OLTP-OLAP Databases), has been several major assignments, as opposed to full implementation projects. The results were, no surprise:
consistent with the science, Normalised structures perform much faster; are easier to maintain; and require less data synching. Inmon, not Kimball.
consistent with the magic, after I Normalise a bunch of tables, and deliver substantially improved performance via application of the laws of physics, the only people surprised are the magicians with their mantras.
Scientifically minded people do not do that; they do not believe in, or rely upon, silver bullets and magic; they use and hard work science to resolve their problems.
Valid Data Warehouse Justification
That is why I have stated in other posts, the only valid justification for a separate Data Warehouse platform, hardware, ETL, maintenance, etc, is where there are many Databases or "databases", all being merged into a central warehouse, for reporting and OLAP.
Kimball
A word on Kimball is necessary, as he is the main proponent of "de-normalised for performance" in data warehouses. As per my definitions above, he is one of those people who have evidently never Normalised in their lives; his starting point was un-normalised (camouflaged as "de-normalised") and he simply implemented that in a Dimension-Fact model.
Of course, to obtain any performance, he had to "de-normalise" even more, and create further duplicates, and justify all that.
So therefore it is true, in a schizophrenic sort of way, that "de-normalising" un-normalised structures, by making more specialised copies, "improves read performance". It is not true when the whole is taking into account; it is true only inside that little asylum, not outside.
Likewise it is true, in that crazy way, that where all the "tables" are monsters, that "joins are expensive" and something to be avoided. They have never had the experience of joining smaller tables and sets, so they cannot believe the scientific fact that more, smaller tables are faster.
they have experience that creating duplicate "tables" is faster, so they cannot believe that eliminating duplicates is even faster than that.
his Dimensions are added to the un-normalised data. Well the data is not Normalised, so no Dimensions are exposed. Whereas in a Normalised model, the Dimensions are already exposed, as an integral part of the data, no addition is required.
that well-paved path of Kimball's leads to the cliff, where more lemmings fall to their deaths, faster. Lemmings are herd animals, as long as they are walking the path together, and dying together, they die happy. Lemmings do not look for other paths.
All just stories, parts of the one mythology that hang out together and support each other.
Your Mission
Should you choose to accept it. I am asking you to think for yourself, and to stop entertaining any thoughts that contradict science and the laws of physics. No matter how common or mystical or mythological they are. Seek evidence for anything before trusting it. Be scientific, verify new beliefs for yourself. Repeating the mantra "de-normalised for performance" won't make your database faster, it will just make you feel better about it. Like the fat kid sitting in the sidelines telling himself that he can run faster than all the kids in the race.
on that basis, even the concept "normalise for OLTP" but do the opposite, "de-normalise for OLAP" is a contradiction. How can the laws of physics work as stated on one computer, but work in reverse on another computer ? The mind boggles. It is simply not possible, the work that same way on every computer.
Questions ?
Denormalization and aggregation are the two main strategies used to achieve performance in a data warehouse. It's just silly to suggest that it doesn't improve read performance! Surely I must have missunderstood something here?
Aggregation:
Consider a table holding 1 billion purchases.
Contrast it with a table holding one row with the sum of the purchases.
Now, which is faster? Select sum(amount) from the one-billion-row table or a select amount from the one-row-table? It's a stupid example of course, but it illustrates the principle of aggregation quite clearly. Why is it faster? Because regardless of what magical model/hardware/software/religion we use, reading 100 bytes is faster than reading 100 gigabytes. Simple as that.
Denormalization:
A typical product dimension in a retail data warehouse has shitloads of columns. Some columns are easy stuff like "Name" or "Color", but it also has some complicated stuff, like hierarchies. Multiple hierarchies (The product range (5 levels), the intended buyer (3 levels), raw materials (8 levels), way of production (8 levels) along with several computed numbers such as average lead time (since start of the year), weight/packaging measures etcetera etcetera. I've maintained a product dimension table with 200+ columns that was constructed from ~70 tables from 5 different source systems. It is just plain silly to debate whether a query on the normalized model (below)
select product_id
from table1
join table2 on(keys)
join (select average(..)
from one_billion_row_table
where lastyear = ...) on(keys)
join ...table70
where function_with_fuzzy_matching(table1.cola, table37.colb) > 0.7
and exists(select ... from )
and not exists(select ...)
and table20.version_id = (select max(v_id from product_ver where ...)
and average_price between 10 and 20
and product_range = 'High-Profile'
...is faster than the equivalent query on the denormalized model:
select product_id
from product_denormalized
where average_price between 10 and 20
and product_range = 'High-Profile';
Why? Partly for the same reason as the aggregated scenario. But also because the queries are just "complicated". They are so disgustingly complicated that the optimizer (and now I'm going Oracle specifics) gets confused and screws up the execution plans. Suboptimal execution plans may not be such a big deal if the query deals with small amounts of data. But as soon as we start to join in the Big Tables it is crucial that the database gets the execution plan right. Having denormalized the data in one table with a single syntetic key (heck, why don't I add more fuel to this ongoing fire), the filters become simple range/equality filters on pre-cooked columns. Having duplicated the data into new columns enables us to gather statistics on the columns which will help the optimizer in estimating the selectivities and thus providing us with a proper execution plan (well, ...).
Obviously, using denormalization and aggregation makes it harder to accomodate schema changes which is a bad thing. On the other hand they provides read performance, which is a good thing.
So, should you denormalize your database in order to achieve read-performance?
Hell no! It adds so many complexities to your system that there is no end to how many ways it will screw you over before you have delivered. Is it worth it? Yes, sometimes you need to do it to meet a specific performance requirement.
Update 1
PerformanceDBA: 1 row would get updated a billion times a day
That would imply a (near) realtime requirement (which in turn would generate a completely different set of technical requirements). Many (if not most) data warehouses does not have that requirement. I picked an unrealistic aggregation example just to make it clear why aggregation works. I didn't want to have to explain rollup strategies too :)
Also, one has to contrast the needs of the typical user of a data warehouse and the typical user of the underlaying OLTP system. A user looking to understand what factors drive transport costs, couldn't care less if 50% of todays data is missing or if 10 trucks exploded and killed the drivers. Performing the analysis over 2 years worth of data would still come to the same conclusion even if he had to-the-second up-to-date information at his disposal.
Contrast this to the needs of the drivers of that truck (the ones who survived). They can't wait 5 hours at some transit point just because some stupid aggregation process has to finnish. Having two separate copies of the data solves both needs.
Another major hurdle with sharing the same set of data for operational systems and reporting systems is that the release cycles, Q&A, deployment, SLA and what have you, are very different. Again, having two separate copies makes this easier to handle.
By "OLAP" I understand you to mean a subject-oriented relational / SQL database used for decision support - AKA a Data Warehouse.
Normal Form (typically 5th / 6th Normal Form) is generally the best model for a Data Warehouse. The reasons for normalizing a Data Warehouse are exactly the same as any other database: it reduces redundancy and avoids potential update anomalies; it avoids built-in bias and is therefore the easiest way to support schema change and new requirements. Using Normal Form in a data warehouse also helps keep the data load process simple and consistent.
There is no "traditional" denormalization approach. Good data warehouses have always been normalized.
Should not a database be denormalized for reading performance?
Okay, here goes a total "Your Mileage May Vary", "It Depends", "Use The Proper Tool For Every Job", "One Size Does Not Fit All" answer, with a bit of "Don't Fix It If It Ain't Broken" thrown in:
Denormalization is one way to improve query performance in certain situations. In other situations it may actually reduce performance (because of the increased disk use). It certainly makes updates more difficult.
It should only be considered when you hit a performance problem (because you are giving the benefits of normalization and introduce complexity).
The drawbacks of denormalization are less of an issue with data that is never updated, or only updated in batch jobs, i.e. not OLTP data.
If denormalization solves a performance problem that you need solved, and that less invasive techniques (like indexes or caches or buying a bigger server) do not solve, then yes, you should do it.
First my opinions, then some analysis
Opinions
Denormalisation is perceived to help reading data because common use of the word denormalisation often include not only breaking normal forms, but also introducing any insertion, update and deletion dependencies into the system.
This, strictly speaking, is false, see this question/answer, Denormalisation in strict sense mean to break any of the normal forms from 1NF-6NF, other insertion, update and deletion dependencies are addressed with Principle of Orthogonal Design.
So what happens is that people take the Space vs Time tradeoff principle and remember the term redundancy (associated with denormalisation, still not equal to it) and conclude that you should have benefits. This is faulty implication, but false implications do not allow you to conclude the reverse.
Breaking normal forms may indeed speed up some data retrieval (details in analysis below), but as a rule it will also at the same time:
favour only specific type of queries and slow down all other access paths
increase complexity of the system (which influences not only maintenance of the database itself, but also increases the complexity of applications that consume the data)
obfuscate and weaken semantic clarity of the database
main point of database systems, as central data representing the problem space is to be unbiased in recording the facts, so that when requirements change you don't have to redesign the parts of the system (data and applications) that are independent in reality. to be able to do this artificial dependencies should be minimised - today's 'critical' requirement to speed up one query quite often become only marginally important.
Analysis
So, I made a claim that sometimes breaking normal forms can help retrieval. Time to give some arguments
1) Breaking 1NF
Assume you have financial records in 6NF. From such database you can surely get a report on what is a balance for each account for each month.
Assuming that a query that would have to calculate such report would need to go through n records you could make a table
account_balances(month, report)
which would hold XML structured balances for each account. This breaks 1NF (see notes later), but allows one specific query to execute with minimum I/O.
At the same time, assuming it is possible to update any month with inserts, updates or deletes of financial records, the performance of the update queries on the system might be slowed down by time proportional to some function of n for each update.
(the above case illustrates a principle, in reality you would have better options and the benefit of getting minimum I/O bring such penalties that for realistic system that actually updates data often you would get bad performance on even for your targeted query depending on the type of actual workload; can explain this in more detail if you want)
Note:
This is actually trivial example and there is one problem with it - the definition of 1NF. Assumption that the above model breaks 1NF is according to requirement that values of an attribute 'contain exactly one value from the applicable domain'.
This allows you to say that the domain of the attribute report is a set of all possible reports and that from all of them there is exactly one value and claim that 1NF is not broken (similar to argument that storing words does not break 1NF even though you might have letters relation somewhere in your model).
On the other hand there are much better ways to model this table, which would be more useful for wider range of queries (such as to retrieve balances for single account for all months in a year). In this case you would justify that improvement by saying that this field is not in 1NF.
Anyway it explains why people claim that breaking NFs might improve performance.
2) Breaking 3NF
Assuming tables in 3NF
CREATE TABLE `t` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`member_id` int(10) unsigned NOT NULL,
`status` tinyint(3) unsigned NOT NULL,
`amount` decimal(10,2) NOT NULL,
`opening` decimal(10,2) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `member_id` (`member_id`),
CONSTRAINT `t_ibfk_1` FOREIGN KEY (`member_id`) REFERENCES `m` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB
CREATE TABLE `m` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
with sample data (1M rows in t, 100k in m)
Assume a common query that you want to improve
mysql> select sql_no_cache m.name, count(*)
from t join m on t.member_id = m.id
where t.id between 100000 and 500000 group by m.name;
+-------+----------+
| name | count(*) |
+-------+----------+
| omega | 11 |
| test | 8 |
| test3 | 399982 |
+-------+----------+
3 rows in set (1.08 sec)
you could find suggestions to move attribute name into table m which breaks 3NF (it has a FD: member_id -> name and member_id is not a key of t)
after
alter table t add column varchar(255);
update t inner join m on t.member_id = t.id set t.name = m.name;
running
mysql> select sql_no_cache name, count(*)
from t where id
between 100000 and 500000
group by name;
+-------+----------+
| name | count(*) |
+-------+----------+
| omega | 11 |
| test | 8 |
| test3 | 399982 |
+-------+----------+
3 rows in set (0.41 sec)
notes:
The above query execution time is cut in half, but
the table was not in 5NF/6NF to begin with
the test was done with no_sql_cache so most cache mechanisms were avoided (and in real situations they play a role in system's performance)
space consumption is increased by approx 9x size of the column name x 100k rows
there should be triggers on t to keep the integrity of data, which would significantly slow down all updates to name and add additional checks that inserts in t would need to go through
probably better results could be achieved by dropping surrogate keys and switching to natural keys, and/or indexing, or redesigning to higher NFs
Normalising is the proper way in the long run. But you don't always have an option to redesign company's ERP (which is for example already only mostly 3NF) - sometimes you must achieve certain task within given resources. Of course doing this is only short term 'solution'.
Bottom line
I think that the most pertinent answer to your question is that you will find the industry and education using the term 'denormalisation' in
strict sense, for breaking NFs
loosely, for introducing any insertion, update and deletion dependencies (original Codd's quote comments on normalisation saying: 'undesirable(!) insertion, update and deletion dependencies', see some details here)
So, under strict definition, the aggregation (summary tables) are not considered denormalisation and they can help a lot in terms of performance (as will any cache, which is not perceived as denormalisation).
The loose usage encompasses both breaking normal forms and the principle of orthogonal design, as said before.
Another thing that might shed some light is that there is a very important difference between the logical model and the physical model.
For example indexes store redundant data, but no one considers them denormalization, not even people who use the term loosely and there are two (connected) reasons for this
they are not part of the logical model
they are transparent and guaranteed not to break integrity of your model
If you fail to properly model your logical model you will end up with inconsistent database - wrong types of relationships between your entities (inability to represent problem space), conflicting facts (ability to loose information) and you should employ whatever methods you can to get a correct logical model, it is a foundation for all applications that will be built on top of it.
Normalisation, orthogonal and clear semantics of your predicates, well defined attributes, correctly identified functional dependencies all play a factor in avoiding pitfalls.
When it comes to physical implementation things get more relaxed in a sense that ok, materialised computed column that is dependent on non key might be breaking 3NF, but if there are mechanisms that guarantee consistency it is allowed in physical model in the same way as indexes are allowed, but you have to very carefully justify it because usually normalising will yield same or better improvements across the board and will have no or less negative impact and will keep the design clear (which reduces the application development and maintenance costs) resulting in savings that you can easily spend on upgrading hardware to improve the speed even more then what is achieved with breaking NFs.
The two most popular methodologies for building a data warehouse (DW) seem to be Bill Inmon's and Ralph Kimball's.
Inmon's methodology uses normalized approach, while Kimball's uses dimensional modelling -- de-normalized star schema.
Both are well documented down to small details and both have many successful implementations. Both present a "wide, well-paved road" to a DW destination.
I can not comment on the 6NF approach nor on Anchor Modelling because I have never seen nor participated in a DW project using that methodology. When it comes to implementations, I like to travel down well tested paths -- but, that's just me.
So, to summarize, should DW be normalized or de-normalized? Depends on the methodology you pick -- simply pick one and stick to it, at least till the end of the project.
EDIT - An Example
At the place I currently work for, we had a legacy report which has been running since ever on the production server. Not a plain report, but a collection of 30 sub-reports emailed to everybody and his ant every day.
Recently, we implemented a DW. With two report servers and bunch of reports in place, I was hoping that we can forget about the legacy thing. But not, legacy is legacy, we always had it, so we want it, need it, can't live without it, etc.
The thing is that the mess-up of a python script and SQL took eight hours (yes, e-i-g-h-t hours) to run every single day. Needless to say, the database and the application were built over years by few batches of developers -- so, not exactly your 5NF.
It was time to re-create the legacy thing from the DW. Ok, to keep it short it's done and it takes 3 minutes (t-h-r-e-e minutes) to produce it, six seconds per sub-report. And I was in the hurry to deliver, so was not even optimizing all the queries. This is factor of 8 * 60 / 3 = 160 times faster -- not to mention benefits of removing an eight hour job from a production server. I think I can still shave of a minute or so, but right now no one cares.
As a point of interest, I have used Kimball's method (dimensional modelling) for the DW and everything used in this story is open-source.
This is what all this (data-warehouse) is supposed to be about, I think. Does it even matter which methodology (normalized or de-normalized) was used?
EDIT 2
As a point of interest, Bill Inmon has a nicely written paper on his website -- A Tale of Two Architectures.
The problem with the word "denormalized" is that it doesn't specify what direction to go in. It's about like trying to get to San Francisco from Chicago by driving away from New York.
A star schema or a snowflake schema is certainly not normalized. And it certainly performs better than a normalized schema in certain usage patterns. But there are cases of denormalization where the designer wasn't following any discipline at all, but just composing tables by intuition. Sometimes those efforts don't pan out.
In short, don't just denormalize. Do follow a different design discipline if you are confident of its benefits, and even if it doesn't agree with normalized design. But don't use denormalization as an excuse for haphazard design.
The short answer is don't fix a performance problem you have not got!
As for time based tables the generally accepted pardigm is to have valid_from and valid_to dates in every row. This is still basically 3NF as it only changes the semantics from "this is the one and only verision of this entity" to "this is the one and only version of this entity at this time "
Simplification:
An OLTP database should be normalised (as far as makes sense).
An OLAP data warehouse should be denormalised into Fact and Dimension tables (to minimise joins).

Am i right to sacrifice database design fundamentals in this case for the sake of speed?

I work in a company that uses single table Access database for its outbound cms, which I moved to a SQL server based system. There's a data list table (not normalized) and a calls table. This has about one update per second currently. All call outcomes along with date, time, and agent id are stored in the calls table. Agents have a predefined set of records that they will call each day (this comprises records from various data lists sorted to give an even spread throughout their set). Note a data list record is called once per day.
In order to ensure speed, live updates to this system are stored in a duplicate of the calls table fields in the data list table. These are then copied to the calls table in a batch process at the end of the day.
The reason for this is not obviously the speed at which a new record could be added to the calls table live, but when the user app is closed/opened and loads the user's data set again I need to check which records have not been called today - I would need to run a stored proc on the server that picked the last most call from the calls table and check if its calldate didn't match today's date. I believe a more expensive query than checking if a field in the data list table is NULL.
With this setup I only run the expensive query at the end of each day.
There are many pitfalls in this design, the main limitation is my inexperience. This is my first SQL server system. It's pretty critical, and I had to ensure it would work and I could easily dump data back to access db during a live failure. It has worked for 11 months now (and no live failure, less downtime than the old system).
I have created pretty well normalized databases for other things (with far fewer users), but I'm hesitant to implement this for the calling database.
Specifically, I would like to know your thoughts on whether the duplication of the calls fields in the data list table is necessary in my current setup or whether I should be able to use the calls table. Please try and answer this from my perspective. I know you DBAs may be cringing!
Redesigning an already working Database may become the major flaw here. Rather try to optimize what you have got running currently instead if starting from scratch. Think of indices, referential integrity, key assigning methods, proper usage of joins and the like.
In fact, have a look here:
Database development mistakes made by application developers
This outlines some very useful pointers.
The thing the "Normalisation Nazis" out there forget is that database design typically has two stages, the "Logical Design" and the "Physical Design". The logical design is for normalisation, and the physical design is for "now lets get the thing working", considering among other things the benefits of normalisation vs. the benefits of breaking nomalisation.
The classic example is an Order table and an Order-Detail table and the Order header table has "total price" where that value was derived from the Order-Detail and related tables. Having total price on Order in this case still make sense, but it breaks normalisation.
A normalised database is meant to give your database high maintainability and flexibility. But optimising for performance is one of the considerations that physical design considers. Look at reporting databases for example. And don't get me started about storing time-series data.
Ask yourself, has my maintainability or flexibility been significantly hindered by this decision? Does it cause me lots of code changes or data redesign when I change something? If not, and you're happy that your design is working as required, then I wouldn't worry.
I think whether to normalize it depends on how much you can do, and what may be needed.
For example, as Ian mentioned, it has been working for so long, is there some features they want to add that will impact the database schema?
If not, then just leave it as it is, but, if you need to add new features that change the database, you may want to see about normalizing it at that point.
You wouldn't need to call a stored procedure, you should be able to use a select statement to get the max(id) by the user id, or the max(id) in the table, depending on what you need to do.
Before deciding to normalize, or to make any major architectural changes, first look at why you are doing it. If you are doing it just because you think it needs to be done, then stop, and see if there is anything else you can do, perhaps add unit tests, so you can get some times for how long operations take. Numbers are good before making major changes, to see if there is any real benefit.
I would ask you to be a little more clear about the specific dilemma you face. If your system has worked so well for 11 months, what makes you think it needs any change?
I'm not sure you are aware of the fact that "Database design fundamentals" might relate to "logical database design fundamentals" as well as "physical database design fundamentals", nor whether you are aware of the difference.
Logical database design fundamentals should not (and actually cannot) be "sacrificed" for speed precisely because speed is only determined by physical design choices, the prime desision factor in which is precisely speed and performance.

Resources