Performance of 100M Row Table (Oracle 11g) - database

We are designing a table for ad-hoc analysis that will capture umpteen value fields over time for claims received. The table structure is essentially (pseudo-ish-code):
table_huge (
claim_key int not null,
valuation_date_key int not null,
value_1 some_number_type,
value_2 some_number_type,
[etc...],
constraint pk_huge primary key (claim_key, valuation_date_key)
);
All value fields all numeric. The requirements are: The table shall capture a minimum of 12 recent years (hopefully more) of incepted claims. Each claim shall have a valuation date for each month-end occurring between claim inception and the current date. Typical claim inception volumes range from 50k-100k per year.
Adding all this up I project a table with a row count on the order of 100 million, and could grow to as much as 500 million over years depending on the business's needs. The table will be rebuilt each month. Consumers will select only. Other than a monthly refresh, no updates, inserts or deletes will occur.
I am coming at this from the business (consumer) side, but I have an interest in mitigating the IT cost while preserving the analytical value of this table. We are not overwhelmingly concerned about quick returns from the Table, but will occasionally need to throw a couple dozen queries at it and get all results in a day or three.
For argument's sake, let's assume the technology stack is, I dunno, in the 80th percentile of modern hardware.
The questions I have are:
Is there a point at which the cost-to-benefit of indices becomes excessive, considering a low frequency of queries against high-volume tables?
Does the SO community have experience with +100M row tables and can
offer tips on how to manage?
Do I leave the database technology problem to IT to solve or should I
seriously consider curbing the business requirements (and why?)?
I know these are somewhat soft questions, and I hope readers appreciate this is not a proposition I can test before building.
Please let me know if any clarifications are needed. Thanks for reading!

First of all: Expect this to "just work" if leaving the tech problem to IT - especially if your budget allows for an "80% current" hardware level.
I do have experience with 200M+ rows in MySQL on entry-level and outdated hardware, and I was allways positivly suprised.
Some Hints:
On monthly refresh, load the table without non-primary indices, then create them. Search for the sweet point, how many index creations in parallell work best. In a project with much less date (ca. 10M) this reduced load time compared to the naive "create table, then load data" approach by 70%
Try to get a grip on the number and complexity of concurrent queries: This has influence on your hardware decisions (less concurrency=less IO, more CPU)
Assuming you have 20 numeric fields of 64 bits each, times 200M rows: If I can calculate correctly, ths is a payload of 32GB. Trade cheap disks against 64G RAM and never ever have an IO bottleneck.
Make sure, you set the tablespace to read only

You could consider anchor modeling approach to store changes only.
Considering that there are so many expected repeated rows, ~ 95% --
bringing row count from 100M to only 5M, removes most of your concerns.
At this point it is mostly cache consideration, if the whole table
can somehow fit into cache, things happen fairly fast.
For "low" data volumes, the following structure is slower to query than a plain table; at one point (as data volume grows) it becomes faster. That point depends on several factors, but it may be easy to test. Take a look at this white-paper about anchor modeling -- see graphs on page 10.
In terms of anchor-modeling, it is equivalent to
The modeling tool has automatic code generation, but it seems that it currenty fully supports only MS SQL server, though there is ORACLE in drop-down too. It can still be used as a code-helper.
In terms of supporting code, you will need (minimum)
Latest perspective view (auto-generated)
Point in time function (auto-generated)
Staging table from which this structure will be loaded (see tutorial for data-warehouse-loading)
Loading function, from staging table to the structure
Pruning functions for each attribute, to remove any repeating values
It is easy to create all this by following auto-generated-code patterns.

With no ongoing updates/inserts, an index NEVER has negative performance consequences, only positive (by MANY orders of magnitude for tables of this size).
More critically, the schema is seriously flawed. What you want is
Claim
claim_key
valuation_date
ClaimValue
claim_key (fk->Claim.claim_key)
value_key
value
This is much more space-efficient as it stores only the values you actually have, and does not require schema changes when the number of values for a single row exceeds the number of columns you have allocated.

Using partition concept & apply partition key on every query that you perform will save give the more performance improvements.
In our company we solved huge number of performance issues with the partition concept.
One more design solutions is if we know that the table is going to be very very big, try not to apply more constraints on the table & handle in the logic before u perform & don't have many columns on the table to avoid row chaining issues.

Related

Removing PAGELATCH with randomized ID instead of GUID

We have two tables which receive 1 million+ insertions per minute. This table is heavily indexed, which also can’t be removed to support business requirements. Due to such high volume of insertions, we are seeing PAGELATCH_EX and PAGELATCH_SH. These locks further slowdown insertions.
A commonly accepted solution is to change the identity column to GUID so that insertions are written on random page every time. We can do this but changing IDs will trigger a need for the development cycle of migration scripts so that existing production data can be changed.
I tried another approach which seems to be working well in our load tests. Instead of changing to GUID, We are now generating IDs in a randomized pattern using following logic
SELECT #ModValue = (DATEPART(NANOSECOND, GETDATE()) % 14);
INSERT xxx(id)
SELECT NEXT VALUE FOR Sequence * (#ModValue + IIF(#ModValue IN (0,1,2,3,4,5,6), 100000000000,-100000000000))
It has eliminated PAGELATCH_EX and PAGELATCH_SH locks and our insertions are quite fast now. I also think GUID as PK of such a critical table is less efficient then a bigint ID column.
However, some of team members are sceptical on this as IDs with negative values that too generated on random basis is not a common solution. Also, there is argument that support team may struggle due to large negative IDs. A common habit of writing select * from table order by 1 will need to be changed.
I am wondering what the community’s take on this solution. If you could please point any disadvantage of approach suggested, that will be highly appreciated.
However, some of team members are skeptical on this as IDs with negative values that too generated on random basis is not a common solution
You have an uncommon problem, and so uncommon solutions might need to be entertained.
Also, there is argument that support team may struggle due to large negative IDs. A common habit of writing select * from table order by 1 will need to be changed.
Sure. The system as it exists now has a high (but not perfect) correlation between IDs and time. That is, in general a higher ID for a row means that it was created after one with a lower ID. So it's convenient to order by IDs as a stand-in for ordering by time. If that's something that they need to do (i.e. order the data by time), give them a way to do that in the new proposal. Conversely, play out the hypothetical scenario where you're explaining to your CTO why you didn't fix performance on this table for your end users. Would "so that our support personnel don't have to change the way they do things" be an acceptable answer? I know that it wouldn't be for me but maybe the needs of support outweigh the needs of end users in your system. Only you (and your CTO) can answer that question.

Query optimization for hive table

We have a table which is of size 100TB and we have multiple customers using the same table (i.e every customer uses different where conditions). Now the problem statement is every time a customer tries to query the table it gets scanned from top to bottom.
This creates lot of slowness for all the queries. We cannot even partition/bucket the table basing on any business keys. Can someone provide solution or point to similar problem statements and their resolution.
you can provide your suggestions as well as alternative technologies so that we can pick the best suitable one. Thanks.
My 2 cents: experiment with an ORC table with GZip compression (default) and clever partitioning / ordering...
every SELECT that uses a partition key in its WHERE clause will
do "partition pruning" and thus avoid to scan everything [OK, OK, you said you had no good candidate in your specific case, but in general it can be done so I had to mention it first]
then within each ORC file in scope, the min/max counters will be
checked for "stripe pruning", limiting the I/O further
With clever partitioning & clever ordering of the data at INSERT time, using the most-frequent filters, the pruning can be quite efficient.
Then you can look into optimizations such as using a non-default ORC stripe size, a non-default "bytes-per-reducer" threshold, etc.
Reference:
http://fr.slideshare.net/oom65/orc-andvectorizationhadoopsummit
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
https://streever.atlassian.net/wiki/display/HADOOP/Optimizing+ORC+Files+for+Query+Performance
http://thinkbig.teradata.com/hadoop-performance-tuning-orc-snappy-heres-youre-missing/
One last thing: with 15 nodes for running queries and a replication factor of 3, each HDFS block is available "locally" on 3 the nodes (20%) and "remotely" in the rest (80%). A higher replication factor may reduce I/O and network bottlenecks -- at the cost of disk space, of course.

Should OLAP databases be denormalized for read performance? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I always thought that databases should be denormalized for read performance, as it is done for OLAP database design, and not exaggerated much further 3NF for OLTP design.
PerformanceDBA in various posts, for ex., in Performance of different aproaches to time-based data defends the paradigm that database should be always well-designed by normalization to 5NF and 6NF (Normal Form).
Have I understood it correctly (and what had I understood correctly)?
What's wrong with the traditional denormalization approach/paradigm design of OLAP databases (below 3NF) and the advice that 3NF is enough for most practical cases of OLTP databases?
For example:
"The simple truth ... is that 6NF, executed properly, is the data warehouse" (PerformanceDBA)
I should confess that I could never grasp the theories that denormalization facilitates read performance. Can anybody give me references with good logical explanations of this and of the contrary beliefs?
What are sources to which I can refer when trying to convince my stakeholders that OLAP/Data Warehousing databases should be normalized?
To improve visibility I copied here from comments:
"It would be nice if participants would
add (disclose) how many real-life (no
science projects included)
data-warehouse implementations in 6NF
they have seen or participated in.
Kind of a quick-pool. Me = 0." – Damir
Sudarevic
Wikipedia's Data Warehouse article tells:
"The normalized approach [vs. dimensional one by Ralph Kimball], also
called the 3NF model (Third Normal Form) whose supporters are
referred to as “Inmonites”, believe in Bill Inmon's approach in which
it is stated that the data warehouse should be modeled using an E-R
model/normalized model."
It looks like the normalized data warehousing approach (by Bill Inmon) is perceived as not exceeding 3NF (?)
I just want to understand what is the origin of the myth (or ubiquitous axiomatic belief) that data warehousing/OLAP is synonym of denormalization?
Damir Sudarevic answered that they are well-paved approach. Let me return to the question: Why is denormalization believed to facilitate reading?
Mythology
I always thought that databases should be denormalized for reading, as it is done for OLAP database design, and not exaggerated much further 3NF for OLTP design.
There's a myth to that effect. In the Relational Database context, I have re-implemented six very large so-called "de-normalised" "databases"; and executed over eighty assignments correcting problems on others, simply by Normalising them, applying Standards and engineering principles. I have never seen any evidence for the myth. Only people repeating the mantra as if it were some sort of magical prayer.
Normalisation vs Un-normalised
("De-normalisation" is a fraudulent term I refuse to use it.)
This is a scientific industry (at least the bit that delivers software that does not break; that put people on the Moon; that runs banking systems; etc). It is governed by the laws of physics, not magic. Computers and software are all finite, tangible, physical objects that are subject to the laws of physics. According to the secondary and tertiary education I received:
it is not possible for a bigger, fatter, less organised object to perform better than a smaller, thinner, more organised object.
Normalisation yields more tables, yes, but each table is much smaller. And even though there are more tables, there are in fact (a) fewer joins and (b) the joins are faster because the sets are smaller. Fewer Indices are required overall, because each smaller table needs fewer indices. Normalised tables also yield much shorter row sizes.
for any given set of resources, Normalised tables:
fit more rows into the same page size
therefore fit more rows into the same cache space, therefore overall throughput is increased)
therefore fit more rows into the same disk space, therefore the no of I/Os is reduced; and when I/O is called for, each I/O is more efficient.
.
it is not possible for an object that is heavily duplicated to perform better than an object that is stored as a single version of the truth. Eg. when I removed the 5 x duplication at the table and column level, all the transactions were reduced in size; the locking reduced; the Update Anomalies disappeared. That substantially reduced contention and therefore increased concurrent use.
The overall result was therefore much, much higher performance.
In my experience, which is delivering both OLTP and OLAP from the same database, there has never been a need to "de-normalise" my Normalised structures, to obtain higher speed for read-only (OLAP) queries. That is a myth as well.
No, the "de-normalisation" requested by others reduced speed, and it was eliminated. No surprise to me, but again, the requesters were surprised.
Many books have been written by people, selling the myth. It needs to be recognised that these are non-technical people; since they are selling magic, the magic they sell has no scientific basis, and they conveniently avoid the laws of physics in their sales pitch.
(For anyone who wishes to dispute the above physical science, merely repeating the mantra will no have any effect, please supply specific evidence supporting the mantra.)
Why is the Myth Prevalent ?
Well, first, it is not prevalent among scientific types, who do not seek ways of overcoming the laws of physics.
From my experience, I have identified three major reasons for the prevalence:
For those people who cannot Normalise their data, it is a convenient justification for not doing so. They can refer to the magic book and without any evidence for the magic, they can reverently say "see a famous writer validates what I have done". Not Done, most accurately.
Many SQL coders can write only simple, single-level SQL. Normalised structures require a bit of SQL capability. If they do not have that; if they cannot produce SELECTs without using temporary tables; if they cannot write Sub-queries, they will be psychologically glued to the hip to flat files (which is what "de-normalised" structures are), which they can process.
People love to read books, and to discuss theories. Without experience. Especially re magic. It is a tonic, a substitute for actual experience. Anyone who has actually Normalised a database correctly has never stated that "de-normalised is faster than normalised". To anyone stating the mantra, I simply say "show me the evidence", and they have never produced any. So the reality is, people repeat the mythology for these reasons, without any experience of Normalisation. We are herd animals, and the unknown is one of our biggest fears.
That is why I always include "advanced" SQL and mentoring on any project.
My Answer
This Answer is going to be ridiculously long if I answer every part of your question or if I respond to the incorrect elements in some of the other answers. Eg. the above has answered just one item. Therefore I will answer your question in total without addressing the specific components, and take a different approach. I will deal only in the science related to your question, that I am qualified in, and very experienced with.
Let me present the science to you in manageable segments.
The typical model of the six large scale full implementation assignments.
These were the closed "databases" commonly found in small firms, and the organisations were large banks
very nice for a first generation, get-the-app-running mindset, but a complete failure in terms of performance, integrity and quality
they were designed for each app, separately
reporting was not possible, they could only report via each app
since "de-normalised" is a myth, the accurate technical definition is, they were un-normalised
In order to "de-normalise" one must Normalise first; then reverse the process a little
in every instance where people showed me their "de-normalised" data models, the simple fact was, they had not Normalised at all; so "de-normalisation" was not possible; it was simply un-normalised
since they did not have much Relational technology, or the structures and control of Databases, but they were passed off as "databases", I have placed those words in quotation marks
as is scientifically guaranteed for un-normalised structures, they suffered multiple versions of the truth (data duplication) and therefore high contention and low concurrency, within each of them
they had an additional problem of data duplication across the "databases"
the organisation was trying to keep all those duplicates synchronised, so they implemented replication; which of course meant an additional server; ETL and synching scripts to be developed; and maintained; etc
needless to say, the synching was never quite enough and they were forever changing it
with all that contention and low throughput, it was no problem at all justifying a separate server for each "database". It did not help much.
So we contemplated the laws of physics, and we applied a little science.
We implemented the Standard concept that the data belongs to the corporation (not the departments) and the corporation wanted one version of the truth. The Database was pure Relational, Normalised to 5NF. Pure Open Architecture, so that any app or report tool could access it. All transactions in stored procs (as opposed to uncontrolled strings of SQL all over the network). The same developers for each app coded the new apps, after our "advanced" education.
Evidently the science worked. Well, it wasn't my private science or magic, it was ordinary engineering and the laws of physics. All of it ran on one database server platform; two pairs (production & DR) of servers were decommissioned and given to another department. The 5 "databases" totalling 720GB were Normalised into one Database totalling 450GB. About 700 tables (many duplicates and duplicated columns) were normalised into 500 unduplicated tables. It performed much faster, as in 10 times faster overall, and more than 100 times faster in some functions. That did not surprise me, because that was my intention, and the science predicted it, but it surprised the people with the mantra.
More Normalisation
Well, having had success with Normalisation in every project, and confidence with the science involved, it has been a natural progression to Normalise more, not less. In the old days 3NF was good enough, and later NFs were not yet identified. In the last 20 years, I have only delivered databases that had zero update anomalies, so it turns out by todays definitions of NFs, I have always delivered 5NF.
Likewise, 5NF is great but it has its limitations. Eg. Pivoting large tables (not small result sets as per the MS PIVOT Extension) was slow. So I (and others) developed a way of providing Normalised tables such that Pivoting was (a) easy and (b) very fast. It turns out, now that 6NF has been defined, that those tables are 6NF.
Since I provide OLAP and OLTP from the same database, I have found that, consistent with the science, the more Normalised the structures are:
the faster they perform
and they can be used in more ways (eg Pivots)
So yes, I have consistent and unvarying experience, that not only is Normalised much, much faster than un-normalised or "de-normalised"; more Normalised is even faster than less normalised.
One sign of success is growth in functionality (the sign of failure is growth in size without growth in functionality). Which meant they immediately asked us for more reporting functionality, which meant we Normalised even more, and provided more of those specialised tables (which turned out years later, to be 6NF).
Progressing on that theme. I was always a Database specialist, not a data warehouse specialist, so my first few projects with warehouses were not full-blown implementations, but rather, they were substantial performance tuning assignments. They were in my ambit, on products that I specialised in.
Let's not worry about the exact level of normalisation, etc, because we are looking at the typical case. We can take it as given that the OLTP database was reasonably normalised, but not capable of OLAP, and the organisation had purchased a completely separate OLAP platform, hardware; invested in developing and maintaining masses of ETL code; etc. And following implementation then spent half their life managing the duplicates they had created. Here the book writers and vendors need to be blamed, for the massive waste of hardware and separate platform software licences they cause organisations to purchase.
If you have not observed it yet, I would ask you to notice the similarities between the Typical First Generation "database" and the Typical Data Warehouse
Meanwhile, back at the farm (the 5NF Databases above) we just kept adding more and more OLAP functionality. Sure the app functionality grew, but that was little, the business had not changed. They would ask for more 6NF and it was easy to provide (5NF to 6NF is a small step; 0NF to anything, let alone 5NF, is a big step; an organised architecture is easy to extend).
One major difference between OLTP and OLAP, the basic justification of separate OLAP platform software, is that the OLTP is row-oriented, it needs transactionally secure rows, and fast; and the OLAP doesn't care about the transactional issues, it needs columns, and fast. That is the reason all the high end BI or OLAP platforms are column-oriented, and that is why the OLAP models (Star Schema, Dimension-Fact) are column-oriented.
But with the 6NF tables:
there are no rows, only columns; we serve up rows and columns at same blinding speed
the tables (ie. the 5NF view of the 6NF structures) are already organised into Dimension-Facts. In fact they are organised into more Dimensions than any OLAP model would ever identify, because they are all Dimensions.
Pivoting entire tables with aggregation on the fly (as opposed to the PIVOT of a small number of derived columns) is (a) effortless, simple code and (b) very fast
What we have been supplying for many years, by definition, is Relational Databases with at least 5NF for OLTP use, and 6NF for OLAP requirements.
Notice that it is the very same science that we have used from the outset; to move from Typical un-normalised "databases" to 5NF Corporate Database. We are simply applying more of the proven science, and obtaining higher orders of functionality and performance.
Notice the similarity between 5NF Corporate Database and 6NF Corporate Database
The entire cost of separate OLAP hardware, platform software, ETL, administration, maintenance, are all eliminated.
There is only one version of the data, no update anomalies or maintenance thereof; the same data served up for OLTP as rows, and for OLAP as columns
The only thing we have not done, is to start off on a new project, and declare pure 6NF from the start. That is what I have lined up next.
What is Sixth Normal Form ?
Assuming you have a handle on Normalisation (I am not going to not define it here), the non-academic definitions relevant to this thread are as follows. Note that it applies at the table level, hence you can have a mix of 5NF and 6NF tables in the same database:
Fifth Normal Form: all Functional Dependencies resolved across the database
in addition to 4NF/BCNF
every non-PK column is 1::1 with its PK
and to no other PK
No Update Anomalies
.
Sixth Normal Form: is the irreducible NF, the point at which the data cannot be further reduced or Normalised (there will not be a 7NF)
in addition to 5NF
the row consists of a Primary Key, and at most, one non-key column
eliminates The Null Problem
What Does 6NF Look Like ?
The Data Models belong to the customers, and our Intellectual Property is not available for free publication. But I do attend this web-site, and provide specific answers to questions. You do need a real world example, so I will publish the Data Model for one of our internal utilities.
This one is for the collection of server monitoring data (enterprise class database server and OS) for any no of customers, for any period. We use this to analyse performance issues remotely, and to verify any performance tuning that we do. The structure has not changed in over ten years (added to, with no change to the existing structures), it is typical of the specialised 5NF that many years later was identified as 6NF. Allows full pivoting; any chart or graph to be drawn, on any Dimension (22 Pivots are provided but that is not a limit); slice and dice; mix and match. Notice they are all Dimensions.
The monitoring data or Metrics or vectors can change (server version changes; we want to pick up something more) without affecting the model (you may recall in another post I stated EAV is the bastard son of 6NF; well this is full 6NF, the undiluted father, and therefore provides all features of EAV, without sacrificing any Standards, integrity or Relational power); you merely add rows.
▶Monitor Statistics Data Model◀. (too large for inline; some browsers cannot load inline; click the link)
It allows me to produce these ▶Charts Like This◀, six keystrokes after receiving a raw monitoring stats file from the customer. Notice the mix-and-match; OS and server on the same chart; a variety of Pivots. (Used with permission.)
Readers who are unfamiliar with the Standard for Modelling Relational Databases may find the ▶IDEF1X Notation◀ helpful.
6NF Data Warehouse
This has been recently validated by Anchor Modeling, in that they are now presenting 6NF as the "next generation" OLAP model for data warehouses. (They do not provide the OLTP and OLAP from the single version of the data, that is ours alone).
Data Warehouse (Only) Experience
My experience with Data Warehouses only (not the above 6NF OLTP-OLAP Databases), has been several major assignments, as opposed to full implementation projects. The results were, no surprise:
consistent with the science, Normalised structures perform much faster; are easier to maintain; and require less data synching. Inmon, not Kimball.
consistent with the magic, after I Normalise a bunch of tables, and deliver substantially improved performance via application of the laws of physics, the only people surprised are the magicians with their mantras.
Scientifically minded people do not do that; they do not believe in, or rely upon, silver bullets and magic; they use and hard work science to resolve their problems.
Valid Data Warehouse Justification
That is why I have stated in other posts, the only valid justification for a separate Data Warehouse platform, hardware, ETL, maintenance, etc, is where there are many Databases or "databases", all being merged into a central warehouse, for reporting and OLAP.
Kimball
A word on Kimball is necessary, as he is the main proponent of "de-normalised for performance" in data warehouses. As per my definitions above, he is one of those people who have evidently never Normalised in their lives; his starting point was un-normalised (camouflaged as "de-normalised") and he simply implemented that in a Dimension-Fact model.
Of course, to obtain any performance, he had to "de-normalise" even more, and create further duplicates, and justify all that.
So therefore it is true, in a schizophrenic sort of way, that "de-normalising" un-normalised structures, by making more specialised copies, "improves read performance". It is not true when the whole is taking into account; it is true only inside that little asylum, not outside.
Likewise it is true, in that crazy way, that where all the "tables" are monsters, that "joins are expensive" and something to be avoided. They have never had the experience of joining smaller tables and sets, so they cannot believe the scientific fact that more, smaller tables are faster.
they have experience that creating duplicate "tables" is faster, so they cannot believe that eliminating duplicates is even faster than that.
his Dimensions are added to the un-normalised data. Well the data is not Normalised, so no Dimensions are exposed. Whereas in a Normalised model, the Dimensions are already exposed, as an integral part of the data, no addition is required.
that well-paved path of Kimball's leads to the cliff, where more lemmings fall to their deaths, faster. Lemmings are herd animals, as long as they are walking the path together, and dying together, they die happy. Lemmings do not look for other paths.
All just stories, parts of the one mythology that hang out together and support each other.
Your Mission
Should you choose to accept it. I am asking you to think for yourself, and to stop entertaining any thoughts that contradict science and the laws of physics. No matter how common or mystical or mythological they are. Seek evidence for anything before trusting it. Be scientific, verify new beliefs for yourself. Repeating the mantra "de-normalised for performance" won't make your database faster, it will just make you feel better about it. Like the fat kid sitting in the sidelines telling himself that he can run faster than all the kids in the race.
on that basis, even the concept "normalise for OLTP" but do the opposite, "de-normalise for OLAP" is a contradiction. How can the laws of physics work as stated on one computer, but work in reverse on another computer ? The mind boggles. It is simply not possible, the work that same way on every computer.
Questions ?
Denormalization and aggregation are the two main strategies used to achieve performance in a data warehouse. It's just silly to suggest that it doesn't improve read performance! Surely I must have missunderstood something here?
Aggregation:
Consider a table holding 1 billion purchases.
Contrast it with a table holding one row with the sum of the purchases.
Now, which is faster? Select sum(amount) from the one-billion-row table or a select amount from the one-row-table? It's a stupid example of course, but it illustrates the principle of aggregation quite clearly. Why is it faster? Because regardless of what magical model/hardware/software/religion we use, reading 100 bytes is faster than reading 100 gigabytes. Simple as that.
Denormalization:
A typical product dimension in a retail data warehouse has shitloads of columns. Some columns are easy stuff like "Name" or "Color", but it also has some complicated stuff, like hierarchies. Multiple hierarchies (The product range (5 levels), the intended buyer (3 levels), raw materials (8 levels), way of production (8 levels) along with several computed numbers such as average lead time (since start of the year), weight/packaging measures etcetera etcetera. I've maintained a product dimension table with 200+ columns that was constructed from ~70 tables from 5 different source systems. It is just plain silly to debate whether a query on the normalized model (below)
select product_id
from table1
join table2 on(keys)
join (select average(..)
from one_billion_row_table
where lastyear = ...) on(keys)
join ...table70
where function_with_fuzzy_matching(table1.cola, table37.colb) > 0.7
and exists(select ... from )
and not exists(select ...)
and table20.version_id = (select max(v_id from product_ver where ...)
and average_price between 10 and 20
and product_range = 'High-Profile'
...is faster than the equivalent query on the denormalized model:
select product_id
from product_denormalized
where average_price between 10 and 20
and product_range = 'High-Profile';
Why? Partly for the same reason as the aggregated scenario. But also because the queries are just "complicated". They are so disgustingly complicated that the optimizer (and now I'm going Oracle specifics) gets confused and screws up the execution plans. Suboptimal execution plans may not be such a big deal if the query deals with small amounts of data. But as soon as we start to join in the Big Tables it is crucial that the database gets the execution plan right. Having denormalized the data in one table with a single syntetic key (heck, why don't I add more fuel to this ongoing fire), the filters become simple range/equality filters on pre-cooked columns. Having duplicated the data into new columns enables us to gather statistics on the columns which will help the optimizer in estimating the selectivities and thus providing us with a proper execution plan (well, ...).
Obviously, using denormalization and aggregation makes it harder to accomodate schema changes which is a bad thing. On the other hand they provides read performance, which is a good thing.
So, should you denormalize your database in order to achieve read-performance?
Hell no! It adds so many complexities to your system that there is no end to how many ways it will screw you over before you have delivered. Is it worth it? Yes, sometimes you need to do it to meet a specific performance requirement.
Update 1
PerformanceDBA: 1 row would get updated a billion times a day
That would imply a (near) realtime requirement (which in turn would generate a completely different set of technical requirements). Many (if not most) data warehouses does not have that requirement. I picked an unrealistic aggregation example just to make it clear why aggregation works. I didn't want to have to explain rollup strategies too :)
Also, one has to contrast the needs of the typical user of a data warehouse and the typical user of the underlaying OLTP system. A user looking to understand what factors drive transport costs, couldn't care less if 50% of todays data is missing or if 10 trucks exploded and killed the drivers. Performing the analysis over 2 years worth of data would still come to the same conclusion even if he had to-the-second up-to-date information at his disposal.
Contrast this to the needs of the drivers of that truck (the ones who survived). They can't wait 5 hours at some transit point just because some stupid aggregation process has to finnish. Having two separate copies of the data solves both needs.
Another major hurdle with sharing the same set of data for operational systems and reporting systems is that the release cycles, Q&A, deployment, SLA and what have you, are very different. Again, having two separate copies makes this easier to handle.
By "OLAP" I understand you to mean a subject-oriented relational / SQL database used for decision support - AKA a Data Warehouse.
Normal Form (typically 5th / 6th Normal Form) is generally the best model for a Data Warehouse. The reasons for normalizing a Data Warehouse are exactly the same as any other database: it reduces redundancy and avoids potential update anomalies; it avoids built-in bias and is therefore the easiest way to support schema change and new requirements. Using Normal Form in a data warehouse also helps keep the data load process simple and consistent.
There is no "traditional" denormalization approach. Good data warehouses have always been normalized.
Should not a database be denormalized for reading performance?
Okay, here goes a total "Your Mileage May Vary", "It Depends", "Use The Proper Tool For Every Job", "One Size Does Not Fit All" answer, with a bit of "Don't Fix It If It Ain't Broken" thrown in:
Denormalization is one way to improve query performance in certain situations. In other situations it may actually reduce performance (because of the increased disk use). It certainly makes updates more difficult.
It should only be considered when you hit a performance problem (because you are giving the benefits of normalization and introduce complexity).
The drawbacks of denormalization are less of an issue with data that is never updated, or only updated in batch jobs, i.e. not OLTP data.
If denormalization solves a performance problem that you need solved, and that less invasive techniques (like indexes or caches or buying a bigger server) do not solve, then yes, you should do it.
First my opinions, then some analysis
Opinions
Denormalisation is perceived to help reading data because common use of the word denormalisation often include not only breaking normal forms, but also introducing any insertion, update and deletion dependencies into the system.
This, strictly speaking, is false, see this question/answer, Denormalisation in strict sense mean to break any of the normal forms from 1NF-6NF, other insertion, update and deletion dependencies are addressed with Principle of Orthogonal Design.
So what happens is that people take the Space vs Time tradeoff principle and remember the term redundancy (associated with denormalisation, still not equal to it) and conclude that you should have benefits. This is faulty implication, but false implications do not allow you to conclude the reverse.
Breaking normal forms may indeed speed up some data retrieval (details in analysis below), but as a rule it will also at the same time:
favour only specific type of queries and slow down all other access paths
increase complexity of the system (which influences not only maintenance of the database itself, but also increases the complexity of applications that consume the data)
obfuscate and weaken semantic clarity of the database
main point of database systems, as central data representing the problem space is to be unbiased in recording the facts, so that when requirements change you don't have to redesign the parts of the system (data and applications) that are independent in reality. to be able to do this artificial dependencies should be minimised - today's 'critical' requirement to speed up one query quite often become only marginally important.
Analysis
So, I made a claim that sometimes breaking normal forms can help retrieval. Time to give some arguments
1) Breaking 1NF
Assume you have financial records in 6NF. From such database you can surely get a report on what is a balance for each account for each month.
Assuming that a query that would have to calculate such report would need to go through n records you could make a table
account_balances(month, report)
which would hold XML structured balances for each account. This breaks 1NF (see notes later), but allows one specific query to execute with minimum I/O.
At the same time, assuming it is possible to update any month with inserts, updates or deletes of financial records, the performance of the update queries on the system might be slowed down by time proportional to some function of n for each update.
(the above case illustrates a principle, in reality you would have better options and the benefit of getting minimum I/O bring such penalties that for realistic system that actually updates data often you would get bad performance on even for your targeted query depending on the type of actual workload; can explain this in more detail if you want)
Note:
This is actually trivial example and there is one problem with it - the definition of 1NF. Assumption that the above model breaks 1NF is according to requirement that values of an attribute 'contain exactly one value from the applicable domain'.
This allows you to say that the domain of the attribute report is a set of all possible reports and that from all of them there is exactly one value and claim that 1NF is not broken (similar to argument that storing words does not break 1NF even though you might have letters relation somewhere in your model).
On the other hand there are much better ways to model this table, which would be more useful for wider range of queries (such as to retrieve balances for single account for all months in a year). In this case you would justify that improvement by saying that this field is not in 1NF.
Anyway it explains why people claim that breaking NFs might improve performance.
2) Breaking 3NF
Assuming tables in 3NF
CREATE TABLE `t` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`member_id` int(10) unsigned NOT NULL,
`status` tinyint(3) unsigned NOT NULL,
`amount` decimal(10,2) NOT NULL,
`opening` decimal(10,2) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `member_id` (`member_id`),
CONSTRAINT `t_ibfk_1` FOREIGN KEY (`member_id`) REFERENCES `m` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB
CREATE TABLE `m` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
with sample data (1M rows in t, 100k in m)
Assume a common query that you want to improve
mysql> select sql_no_cache m.name, count(*)
from t join m on t.member_id = m.id
where t.id between 100000 and 500000 group by m.name;
+-------+----------+
| name | count(*) |
+-------+----------+
| omega | 11 |
| test | 8 |
| test3 | 399982 |
+-------+----------+
3 rows in set (1.08 sec)
you could find suggestions to move attribute name into table m which breaks 3NF (it has a FD: member_id -> name and member_id is not a key of t)
after
alter table t add column varchar(255);
update t inner join m on t.member_id = t.id set t.name = m.name;
running
mysql> select sql_no_cache name, count(*)
from t where id
between 100000 and 500000
group by name;
+-------+----------+
| name | count(*) |
+-------+----------+
| omega | 11 |
| test | 8 |
| test3 | 399982 |
+-------+----------+
3 rows in set (0.41 sec)
notes:
The above query execution time is cut in half, but
the table was not in 5NF/6NF to begin with
the test was done with no_sql_cache so most cache mechanisms were avoided (and in real situations they play a role in system's performance)
space consumption is increased by approx 9x size of the column name x 100k rows
there should be triggers on t to keep the integrity of data, which would significantly slow down all updates to name and add additional checks that inserts in t would need to go through
probably better results could be achieved by dropping surrogate keys and switching to natural keys, and/or indexing, or redesigning to higher NFs
Normalising is the proper way in the long run. But you don't always have an option to redesign company's ERP (which is for example already only mostly 3NF) - sometimes you must achieve certain task within given resources. Of course doing this is only short term 'solution'.
Bottom line
I think that the most pertinent answer to your question is that you will find the industry and education using the term 'denormalisation' in
strict sense, for breaking NFs
loosely, for introducing any insertion, update and deletion dependencies (original Codd's quote comments on normalisation saying: 'undesirable(!) insertion, update and deletion dependencies', see some details here)
So, under strict definition, the aggregation (summary tables) are not considered denormalisation and they can help a lot in terms of performance (as will any cache, which is not perceived as denormalisation).
The loose usage encompasses both breaking normal forms and the principle of orthogonal design, as said before.
Another thing that might shed some light is that there is a very important difference between the logical model and the physical model.
For example indexes store redundant data, but no one considers them denormalization, not even people who use the term loosely and there are two (connected) reasons for this
they are not part of the logical model
they are transparent and guaranteed not to break integrity of your model
If you fail to properly model your logical model you will end up with inconsistent database - wrong types of relationships between your entities (inability to represent problem space), conflicting facts (ability to loose information) and you should employ whatever methods you can to get a correct logical model, it is a foundation for all applications that will be built on top of it.
Normalisation, orthogonal and clear semantics of your predicates, well defined attributes, correctly identified functional dependencies all play a factor in avoiding pitfalls.
When it comes to physical implementation things get more relaxed in a sense that ok, materialised computed column that is dependent on non key might be breaking 3NF, but if there are mechanisms that guarantee consistency it is allowed in physical model in the same way as indexes are allowed, but you have to very carefully justify it because usually normalising will yield same or better improvements across the board and will have no or less negative impact and will keep the design clear (which reduces the application development and maintenance costs) resulting in savings that you can easily spend on upgrading hardware to improve the speed even more then what is achieved with breaking NFs.
The two most popular methodologies for building a data warehouse (DW) seem to be Bill Inmon's and Ralph Kimball's.
Inmon's methodology uses normalized approach, while Kimball's uses dimensional modelling -- de-normalized star schema.
Both are well documented down to small details and both have many successful implementations. Both present a "wide, well-paved road" to a DW destination.
I can not comment on the 6NF approach nor on Anchor Modelling because I have never seen nor participated in a DW project using that methodology. When it comes to implementations, I like to travel down well tested paths -- but, that's just me.
So, to summarize, should DW be normalized or de-normalized? Depends on the methodology you pick -- simply pick one and stick to it, at least till the end of the project.
EDIT - An Example
At the place I currently work for, we had a legacy report which has been running since ever on the production server. Not a plain report, but a collection of 30 sub-reports emailed to everybody and his ant every day.
Recently, we implemented a DW. With two report servers and bunch of reports in place, I was hoping that we can forget about the legacy thing. But not, legacy is legacy, we always had it, so we want it, need it, can't live without it, etc.
The thing is that the mess-up of a python script and SQL took eight hours (yes, e-i-g-h-t hours) to run every single day. Needless to say, the database and the application were built over years by few batches of developers -- so, not exactly your 5NF.
It was time to re-create the legacy thing from the DW. Ok, to keep it short it's done and it takes 3 minutes (t-h-r-e-e minutes) to produce it, six seconds per sub-report. And I was in the hurry to deliver, so was not even optimizing all the queries. This is factor of 8 * 60 / 3 = 160 times faster -- not to mention benefits of removing an eight hour job from a production server. I think I can still shave of a minute or so, but right now no one cares.
As a point of interest, I have used Kimball's method (dimensional modelling) for the DW and everything used in this story is open-source.
This is what all this (data-warehouse) is supposed to be about, I think. Does it even matter which methodology (normalized or de-normalized) was used?
EDIT 2
As a point of interest, Bill Inmon has a nicely written paper on his website -- A Tale of Two Architectures.
The problem with the word "denormalized" is that it doesn't specify what direction to go in. It's about like trying to get to San Francisco from Chicago by driving away from New York.
A star schema or a snowflake schema is certainly not normalized. And it certainly performs better than a normalized schema in certain usage patterns. But there are cases of denormalization where the designer wasn't following any discipline at all, but just composing tables by intuition. Sometimes those efforts don't pan out.
In short, don't just denormalize. Do follow a different design discipline if you are confident of its benefits, and even if it doesn't agree with normalized design. But don't use denormalization as an excuse for haphazard design.
The short answer is don't fix a performance problem you have not got!
As for time based tables the generally accepted pardigm is to have valid_from and valid_to dates in every row. This is still basically 3NF as it only changes the semantics from "this is the one and only verision of this entity" to "this is the one and only version of this entity at this time "
Simplification:
An OLTP database should be normalised (as far as makes sense).
An OLAP data warehouse should be denormalised into Fact and Dimension tables (to minimise joins).

What is best practice for representing time intervals in a data warehouse?

In particular I am dealing with a Type 2 Slowly Changing Dimension and need to represent the time interval a particular record was active for, i.e. for each record I have a StartDate and an EndDate. My question is around whether to use a closed ([StartDate,EndDate]) or half open ([StartDate,EndDate)) interval to represent this, i.e. whether to include the last date in the interval or not. To take a concrete example, say record 1 was active from day 1 to day 5 and from day 6 onwards record 2 became active. Do I make the EndDate for record 1 equal to 5 or 6?
Recently I have come around to the way of thinking that says half open intervals are best based on, inter alia, Dijkstra:Why numbering should start at zero as well as the conventions for array slicing and the range() function in Python. Applying this in the data warehousing context I would see the advantages of a half open interval convention as the following:
EndDate-StartDate gives the time the record was active
Validation: The StartDate of the next record will equal the EndDate of the previous record which is easy to validate.
Future Proofing: if I later decide to change my granularity from daily to something shorter then the switchover date still stays precise. If I use a closed interval and store the EndDate with a timestamp of midnight then I would have to adjust these records to accommodate this.
Therefore my preference would be to use a half open interval methodology. However if there was some widely adopted industry convention of using the closed interval method then I might be swayed to rather go with that, particularly if it is based on practical experience of implementing such systems rather than my abstract theorising.
I have seen both closed and half-open versions in use. I prefer half-open for the reasons you have stated.
In my opinion the half-open version it makes the intended behaviour clearer and is "safer". The predicate ( a <= x < b ) clearly shows that b is intended to be outside the interval. In contrast, if you use closed intervals and specify (x BETWEEN a AND b) in SQL then if someone unwisely uses the enddate of one row as the start of the next, you get the wrong answer.
Make the latest end date default to the largest date your DBMS supports rather than null.
Generally I agree with David's answer, so I won't repeat that info. Further to that:
Did you really mean half open ([StartDate,EndDate])
Even in that "half-open", there are two errors. One is a straight Normalisation error that of course implements duplicate data that you identify in the discussion, that is available as derived data, and that should be removed.
To me, Half Open is (StartDate)
EndDate is derived from the next row.
it is best practice
it is not common usage, because (a) common implementors are unaware these days and (b) they are too lazy, or don't know how, to code the necessary simple subquery
it is based on experience, in large banking databases
Refer to this for details:
Link to Recent Very Similar Question & Data Model
Responses to Comments
You seem to clearly favour normalised designs with natural, meaningful keys. Is it ever warranted to deviate from this in a reporting data warehouse? My understanding is that the extra space devoted to surrogate keys and duplicate columns (eg EndDate) are a trade off for increased query performance. However some of your comments about cache utilisation and increased disk IO make me question this. I would be very interested in your input on this.
Yes, absolutely. Any sane person (who is not learning Computer Science from Wikipedia) should question that. It simply defies the laws of physics.
Can you understand that many people, without understanding Normalisation or databases (you need 5NF), produce Unnormalised slow data heaps, and their famous excuse (written up by "gurus") is "denormalised for performance" ? Now you know that is excreta.
Those same people, without understanding Normalisation or datawarehouses (you need 6NF), (a) create a copy of the database and (b) all manner of weird and wonderful structures to "enhance" queries, including (c) even more duplication. And guess what their excuse is ? "denormalised for performance".
The simple truth (not complex enough for people who justify datawarehouses with (1) (2) (3) ), is that 6NF, executed properly, is the data warehouse. I provide both database and data warehouse from the same data, at warehouse speeds. No second system; no second platform; no copies; no ETL; no keeping copies synchronised; no users having to go to two sources. Sure, it takes skill and an understanding of performance, and a bit of special code to overcome the limitations of SQL (you cannot specify 6NF in DDL, you need to implement a catalogue).
why implement a StarSchema or a SnowFlake, when the pure Normalised structure already has full Dimension-Fact capability.
Even if you did not do that, if you just did the traditional thing and ETLed that database onto a separate datawarehouse system, within it, if you eliminated duplication, reduced row size, reduced Indices, of course it would run faster. Otherwise, it defies the laws of physics: fat people would run faster than thin people; a cow would run faster than a horse.
fair enough, if you don't have a Normalised structure, then anything, please, to help. So they come up with StarSchemas, SnowFlakes and all manner of Dimension-Fact designs.
And please understand, only un_qualified, in_experienced people believe all these myths and magic. Educated experienced people have their hard-earned truths, they do not hire witch doctors. Those "gurus" only validate that the fat person doesn't win the race because of the weather, or the stars; anything but the thing that will solve the problem. A few people get their knickers in a knot because I am direct, I tell the fat person to shed weight; but the real reason they get upset is, I puncture their cherished myths, that keep them justified being fat. People do not like to change.
One thing. Is it ever warranted to deviate. The rules are not black-or-white; they are not single rules in isolation. A thinking person has to consider all of them together; prioritise them for the context. You will find neither all Id keys, nor zero Id keys in my databases, but every Id key has been carefully considered and justified.
By all means, use the shortest possible keys, but use meaningful Relational ones over Surrogates; and use Surrogates when the key becomes too large to carry.
But never start out with Surrogates. This seriously hampers your ability to understand the data; Normalise; model the data.
Here is one question/answer (of many!) where the person was stuck in the process, unable to identify even the basic Entities and Relations, because he had stuck Id keys on everything at the start. Problem solved without discussion, in the first iteration.
.
Ok, another thing. Learn this subject, get experience, and further yourself. But do not try to teach it or convert others, even if the lights went on, and you are eager. Especially if you are enthusiastic. Why ? Because when you question a witch doctor's advice, the whole village will lynch you because you are attacking their cherished myths, their comfort; and you need my kind of experience to nail witch doctors (just check for evidence of his in the comments!). Give it a few years, get your real hard-won experience, and then take them on.
If you are interested, follow this question/answer for a few days, it will be a great example of how to follow IDEF1X methodology, how to expose and distil those Identifiers.
Well, the standard sql where my_field between date1 and date2 is inclusive, so I prefer the inclusive form -- not that the other one is wrong.
The thing is that for usual DW queries, these (rowValidFrom, rowValidTo) fields are mostly not used at all because the foreign key in a fact table already points to the appropriate row in the dimension table.
These are mostly needed during loading (we are talking type 2 SCD here), to look-up the most current primary key for the matching business key. At that point you have something like:
select ProductKey
from dimProduct
where ProductName = 'unique_name_of_some_product'
and rowValidTo > current_date ;
Or, if you prefer to create key-pipeline before loading:
insert into keys_dimProduct (ProductName, ProductKey) -- here ProductName is PK
select ProductName, ProductKey
from dimProduct
where rowValidTo > current_date ;
This helps loading, because it is easy to cache the key table into memory before loading. For example if ProductName is varchar(40) and ProductKey an integer, the key table is less than 0.5 GB per 10 million rows, easy to cache for lookup.
Other frequently seen variations include were rowIsCurrent = 'yes' and where rowValidTo is null.
In general, one or more of the following fields are used :
rowValidFrom
rowValidTo
rowIsCurrent
rowVersion
depending on a DW designer and sometimes ETL tool used, because most tools have a SCD type 2 loading blocks.
There seems to be a concern about the space used by having extra fields -- so, I will estimate here the cost of using some extra space in a dimension table, if for no other reason then convenience.
Suppose I use all of the row_ fields.
rowValidFrom date = 3 bytes
rowValidTo date = 3 bytes
rowIsCurrent varchar(3) = 5 bytes
rowVersion integer = 4 bytes
This totals 15 bytes. One may argue that this is 9 or even 12 bytes too many -- OK.
For 10 million rows this amounts to 150,000,000 bytes ~ 0.14GB
I looked-up prices from a Dell site.
Memory ~ $38/GB
Disk ~ $80/TB = 0.078 $/GB
I will assume raid 5 here (three drives), so disk price will be 0.078 $/GB * 3 = 0.23 $/GB
So, for 10 million rows, to store these 4 fields on disk costs 0.23 $/GB * 0.14 GB = 0.032 $. If the whole dimension table is to be cached into memory, the price of these fields would be 38 $/GB * 0.14GB = 5.32 $ per 10 million rows. In comparison, a beer in my local pub costs ~ 7$.
The year is 2010, and I do expect my next laptop to have 16GB memory. Things and (best) practices change with time.
EDIT:
Did some searching, in the last 15 years, the disk capacity of an average computer increased about 1000 times, the memory about 250 times.

Limits on number of Rows in a SQL Server Table

Are there any hard limits on the number of rows in a table in a sql server table? I am under the impression the only limit is based around physical storage.
At what point does performance significantly degrade, if at all, on tables with and without an index. Are there any common practicies for very large tables?
To give a little domain knowledge, we are considering usage of an Audit table which will log changes to fields for all tables in a database and are wondering what types of walls we might run up against.
You are correct that the number of rows is limited by your available storage.
It is hard to give any numbers as it very much depends on your server hardware, configuration, and how efficient your queries are.
For example, a simple select statement will run faster and show less degradation than a Full Text or Proximity search as the number of rows grows.
BrianV is correct. It's hard to give a rule because it varies drastically based on how you will use the table, how it's indexed, the actual columns in the table, etc.
As to common practices... for very large tables you might consider partitioning. This could be especially useful if you find that for your log you typically only care about changes in the last 1 month (or 1 day, 1 week, 1 year, whatever). You could then archive off older portions of the data so that it's available if absolutely needed, but won't be in the way since you will almost never actually need it.
Another thing to consider is to have a separate change log table for each of your actual tables if you aren't already planning to do that. Using a single log table makes it VERY difficult to work with. You usually have to log the information in a free-form text field which is difficult to query and process against. Also, it's difficult to look at data if you have a row for each column that has been changed because you have to do a lot of joins to look at changes that occur at the same time side by side.
In addition to all the above, which are great reccomendations I thought I would give a bit more context on the index/performance point.
As mentioned above, it is not possible to give a performance number as depending on the quality and number of your indexes the performance will differ. It is also dependent on what operations you want to optimize. Do you need to optimize inserts? or are you more concerned about query response?
If you are truly concerned about insert speed, partitioning, as well a VERY careful index consideration is going to be key.
The separate table reccomendation of Tom H is also a good idea.
With audit tables another approach is to archive the data once a month (or week depending on how much data you put in it) or so. That way if you need to recreate some recent changes with the fresh data, it can be done against smaller tables and thus more quickly (recovering from audit tables is almost always an urgent task I've found!). But you still have the data avialable in case you ever need to go back farther in time.

Resources