Star schema vs Snowflake Schema - database

In Business Intelligence perspective this is a common question but I am looking for a statistical answer.
Can we take decision depending on relational database to go for one of these design? I mean, is there any mathematical ratio among data volume that suits one of the schema?

Star schema stores de-normalised data while snowflake stores normalised data.
Usually, snow flake retains the referential integrity in the relational database, meaning you will have many dimensions linked by primary/foreign keys. On the other hand, the star schema will have a flat structure that merges all of the linked tables into one dimension.
Star schema is less complex and has much better performance than the snow flake schema. In BI perspective, star schema should be the way the go. Snow flake should only be used when necessary.

Star Schema Vs SnowFlake Schema..What to choose:
well this entirely depends on the project requirement and scenarios.
If we want to dive more into Dimensional Analysis then SnowFlake will be a good choice because as suggested in above answer, it mains referential integrity, does not contain data redundancy because of it normalised behaviour. For eg: if we want to find out who are the customers that are attracted towards a particular scheme started by the Bank.!!
If the purpose is more into Metric Analysis, then Star is the best option. For eg: if we want to find how much amount did the customer spend in a particular scheme weekly/monthly/quarterly/yearly basis..how much profit does the company made etc.
As suggested above, Star schema is less complex because of less no. of joins and runs much faster, query execution is much better as compared to snowflake.
But again, these are used according to the need of the project.
I hope this answer is helpful.
any suggestions, guidance is highly, deeply appreciated... :)

In relational databases there are fundamentally 2 types of schema (and i realise there are other edge cases): 3NF and Star schemas.
3NF are normally found in transactional systems and Star schemas in analytical schemas.
In a star schema it is possible to create snowflakes off a dimension but this is normally bad practice and should be avoided. If you have a very specific use case and you have the knowledge and experience to know that the only way to solve it is with a Snowflake then thats fine - however building Snowflakes because you don't know how to design a Star schema is not going to end well!
So a Star schema with a limited number of Snowflakes may be ok but a design that has a large number of Snowflakes is not a Snowflake schema - it's just a badly designed Star schema

Related

How to design a database that can handle unknown reports?

I am working on a project which stores large amounts of data on multiple industries.
I have been tasked with designing the database schema.
I need to make the database schema flexible so it can handle complex reporting on the data.
For example,
what products are trending in industry x
what other companies have a similar product to my company
how is my company website different to x company website
There could be all sorts of reports. Right now everything is vague. But I know for sure the reports need to be fast.
Am I right in thinking my best path is to try to make as many association tables as I can? The idea being (for example) if the product table is linked to the industry table, it'll be relatively easy to get all products for a certain industry without having to go through joins on other tables to try to make a connection to the data.
This seems insane though. The schema will be so large and complex.
Please tell me if what I'm doing is correct or if there is some other known solution for this problem. Perhaps the solution is to hire a data scientist or DBA whose job is to do this sort of thing, rather than getting the programmer to do it.
Thank you.
I think getting these kinds of answers from a relational/operational database will be very difficult and the queries will be really slow.
The best approach I think will be to create multidimensional data structures (in other words a data warehouse) where you will have flattened data which will be easier to query than a relational database. It will also have historical data for trend analysis.
If there is a need for complex statistical or predictive analysis, then the data scientists can use the data warehouse as their source.
Adding to Amit's answer above, the problem is that what you need from your transactional database is a heavily normalized association of facts for operational purposes. For an analytic side you want what are effectively tagged facts.
In other words what you want is a series of star schemas where you can add whatever associations you want.

DataWarehouse - What is a good definition?

Could someone give me a good, practical definition of what a data warehouse is?
I'm surprised no one has posted Inmon's definition:
A warehouse is a subject-oriented,
integrated, time-variant and
non-volatile collection of data in
support of management's decision
making process
From the same page you can pick up Kimball's definition:
A copy of transaction data
specifically structured for query and
analysis
I think that, unfortunately, data warehousing is a wide-ranging field. There is a lot of variety with very few standard paradigms, specifically I'm thinking of Kimball's dimensional modelling. Inmon does not have as a specific a methodology as Kimball's and thus some 3NF models may or may not conform to his principles.
Because Inmon has broadened his scope for what warehousing is meant to accomplish, it can encompass unstructured data. However, analysis of unstructured data is very different than traditional analysis.
As applied to SQL Server, typically the largest Data Warehouses on SQL Server are modelled dimensionally, because this lends itself well to the non-distributed, non-massively parallel model. Massively parallel systems like Teradata generally perform a lot better with 3NF models. These are still table-based systems with the various tables connected with foreign key constraints (perhaps not enforced, but at least logical).
Of course, we are also seeing NoSQL data processing systems like Map/Reduce which are not really databases at all in the sense of normalized, denormalized or non/poorly-normalized relational databases which we have had for 40 years now.
i just started with Datawarehousing and Buisness Intelligence and looking around the web you can find some interesting links :
Get Start With Datawarehousing
I think this two links could help you to understand the concepts of datawarehousing.
sorry, im new i can post only one link ^^
we're sorry, but as a spam prevention mechanism, new users can only post a maximum of one hyperlink. Earn 10 reputation to post more hyperlinks.
A database optimized for retrieval, in general denormalized data, usually a star schema(but could be snowflake) and uses dimensional modeling (fact and dimension tables)
While this is not an academic definition, it might serve as a practical one. A data warehouse is a collection of datamarts and will combine datasets across the breadth of an organization.
A datamart will contain datasets specific to certain portions of the business. In the datamart you will find fact tables, measurable pieces of information, along with dimensions, attributes of your measurable pieces.
A true data warehouse will have conformed dimension tables that can be shared across datamarts.
An example...
Your company may build a datamart around sales. And another datamart around human resources. If the customer dimension table is shared across both these datamarts, it would be considered a conformed dimension. All three of these entities together would make up a data warehouse.
As someone else stated you can find more detailed information by searching for Ralph Kimball's Data Strategies.
Definition : Datawarehouse is a database used for analysis purpose rather than for transaction processing
Check the below link for more informaion on datawarehouse
http://www.idatastage.com/datawarehouse/

Is database normalization still necessary?

Is database normalization still "the thing?"
When I studied during a databases course we were taught all levels of normalization and were said that we must always do it.
Now, with all the NoSQL movement, it seems normalization is no longer the thing to do?
It depends on what type of application(s) are using the database.
For OLTP apps (principally data entry, with many INSERTs, UPDATEs and DELETES, along with SELECTs), normalized is generally a good thing.
For OLAP and reporting apps, normalization is not helpful. SELECT queries will run much more quickly against a denormalized schema, which could be achieved with views.
You might also find some helpful information in these very popular similar questions:
Should I normalize my DB or not?
In terms of databases, is “Normalize for correctness, denormalize for performance” a right mantra?
What is the resource impact from normalizing a database?
How to convince someone to normalize a database?
Is it really better to use normalized tables?
NoSQL is not a silver bullet: it is simply a technology that may provide a far better fit for for certain circumstances. For relationally-shaped data, the RDBMS is not going away any time soon.
A rule of thumb "JOIN's are Expensive on Processing Power". I use is when creating a database for a project large or small. Tables that hold data such as usernames, addresses etc should always be normalised as they are accessed less recently how you where taught using the kind of examples taught. Now in recent years web2.0 data, apps, mobile services etc. have taken in fact a different type of data which with the abundance of memory code even lower, it can save processing power to keep them all on the same "table" not normalising it.
yes, for a transactional system always normalise, or chances are you're going to have major headaches further down the road. For a database that will be used for reporting/OLAP denormalising the schema can be very helpful.

Overnormalization

When would a database design be described as overnormalized? Is this characterization an absolute one? Or is it dependent on the way it is used in the application? Thanks.
In the general sense, I think that overnormalized is when you are doing so many JOINs to retrieve data that it is causing notable performance penalties and deadlocks on your database, even after you've tuned the heck out of your indexes. Obviously, for huge applications and sites like MySpace or eBay, de-normalization is a scaling requirement.
As a developer for several small businesses, I tell you that in my experience it's always been easier to go from normalized -> denormalized than the other way around, and in fact going the other way around (to avoid duplication of data now that the business requirements have changed a year or so later) is much more difficult.
When I read general statements such as "you should put the address in your customers table instead of a separate address table so you can avoid the join", I shudder, because you just know that a year from now somebody's going to ask you to do something with addresses that you totally didn't foresee, like maintaining an audit trail, or storing multiple per customer. If your database allows you to create an indexed view, you can sidestep that issue until you get to the point where your dataset is so large that it can't possibly exist or be served by a single server or set of servers in a 1-write, many-read environment. For most of us, I don't think that scenario happens very often.
When in doubt, I aim for third normal form with some exceptions (for example, having a field contain a CSV-list of separated strings because I know I'll never ever look at the data from the other angle). When I need to consolidate, I'll look at my views or indexes first. Hope this helps.
It's always a question of the application domain. It's usually a question of correctness, but occasionally a question of performance.
There's one case where I can think of a prima facie case of overnormalization: say you have an order + orderitem, and the orderitem references productID, and leaves pricing to the product.price. Since that introduces temporal coupling, you've incorrectly normalized because the overnormalization affects already shipped orders, unless prices absolutely never change. You can certainly argue that this is simply a modeling error (as in the comments), but I see under-normalization as a modeling error in most cases, too.
The other category is performance related. In principle, I think there are generally better solutions to performance than denormalizing data, such as materialized views, but if your application suffers from the performance consequences of many joins, it may be worth assessing whether denormalizing can help you. I think these cases are often over-emphasized, because people sometimes reach for denormalization before they properly profile their application.
People also often forget about alternatives, like keeping a canonical form of the database and using warehousing or other strategies for frequently-read, but infrequently changed data.
Normalization is absolute. A database follows Normal Forms or it does not. There are a half-dozen normal forms. Mostly, they have names like First through Fifth. Plus there's a Boyce-Codd Normal Form.
Normalization exists for precisely one purpose -- to prevent "update anomalies".
Normalization isn't subjective. It isn't a judgement. Each table and relationship among tables either does or does not follow a normal form.
Consequently, you can't be "over-normalized" or "under-normalized".
Having said that, normalization has a performance cost. Some people elect to denormalize in various ways to improve performance. The most common sensible denormalization is to break 3NF and include derived data.
A common mistake is to break 2NF and have duplicate copies of a functional dependency between a key and non-key value. This requires extra updates or -- worse -- triggers to keep the copies in parallel.
Denormalization of a transactional database should be a case-by-case situation.
A data warehouse, also, rarely follows any of the transactional normalization rules because it's (essentially) never updated.
"Over-normalization" could mean that a database is too slow because of a large number of joins. This may also mean that the database has outgrown the hardware. Or that the applications haven't been designed to scale.
The most common issue here is that folks try to use a transactional database for reporting while transactions are going on. The locking for transactions interferes with reporting.
"Under-normalization," however, means that there are NF violations and needless processing is being done to handle the replicated data and correct update anomalies.
When the performance cost exceeds the benefit towards the application's intended purpose.
Normalize your OLTP databases, and denormalize your OLAP databases. Each has a mission that dictates its schema. Like normalized transaction databases, data warehouses exist for a reason. A complete system needs both.
A lot of people are talking about performance. I think a key issue is flexibility. In general, the more normalized your database, the more flexible it is.
We currently use an "over-normalized" database because, in our operating environment, client requirements change on a monthly basis. By "over-normalizing" we can adopt our software accordingly, without changing the database structure.
My take on this:
Always normalize as much as you are able to do. I usually go crazy on normalization, and try to design something that could handle every thinkable future extensions. What I end up with is a database design that is extremely flexible... and impossible to implement.
Then the real job starts: De-normalization. Here you solve what you know would be problematic to implement and/or would slow the queries down because of too many joins.
This way you know what you scarify for make the design usable.
Edit: Documentations! I forgot to mention that documenting the de-normalization is very important. It is extremely helpful when you take over a project to know the reason behind the choices.
Third Normal Form (3NF) is considered the optimal level of normalization for many a rational database application. This is a state in which, as Bill Kent once summarized, every "non-key field [in every table within a particular a relational database management system, or RDBMS] must provide a fact about the key, the whole key, and nothing but the key." 3NF is a term that was introduced by E.F. Codd, inventor of the relational model for database management. Generally, the data that a software application is dependent on, especially an application used for an Online Transaction Processing System (OLTP), will fare well in 3NF. This normal form by definition reduces database size by calling for a minimum repetition of row/column data, and maximizes query efficiency and ease of application maintenance. 3NF achieves that by requiring that a database's tables (i.e., its schema) be broken down into separate tables related by primary/foreign keys--basically until Kent's rule holds true (well, I've stated it this way for ease of reading but the actual definition of 3NF is much more detailed than that). In contrast, overnormalization implies increasing the number of joins required in a query between related tables. This comes as a result of breaking down the database schema into a much more granular level than 3NF. However, though normalization past the 3rd degree can often be considered overnormalization, the negative connotation of the term "overnormalization" can sometimes be unwarranted. Overnormalization may be desirable in some applications which by design require 4NF (and beyond) due to the complexity and versatility of the application software. An example of that is a highly customizable and extensible commercial database program for some industry in which it is sold to end users requiring an open API. But then the reverse can be desirable as well--that is, denormalization--most notably, when designing an Online Analytical Processing (OLAP) database used strictly to summarize data from an OLTP database just for querying/reporting--such as a data warehouse. In this case the data must by necessity reside in a highly denormalized format (i.e, 1NF, or 2NF). It's often under these constraints--when there are high demands for efficient querying and reporting--that we find database and application programmers calling a database, "overnormalized". But as Redgate's Tony Davis once said--taking into account today's much more advanced and efficient database software and storage systems--"the performance hit from multiple joins in a query is negligible. If your database is slow, it isn’t because it is ‘over-normalized’!" So in conclusion, this characterization--overnormalization--isn't an absolute one, and it is dependent on the way it is used in the application. In Kent's words, "The normalization rules are designed to prevent update anomalies and data inconsistencies. . . [but] there is no obligation to fully normalize all records when actual performance requirements are taken into account. . . The normalized design enhances the integrity of the data, by minimizing redundancy and inconsistency, but at some possible performance cost for certain retrieval applications. . . [Thus,] the desirability of normalization has to be assessed, in terms of its performance impact on retrieval applications."
..or hitting limits on the number of joins your RDBMS will do.
If performance is affected by too many joins, creating de-normalized tables for reporting purposes can speed things up. By copying the data into new tables, it may be possible to run reports with no joins at all.
In my experience, I've never seen a normalized database that contains postal addresses, as it's usually acceptable to store the address as a string. Ideally, there would be tables for countries, counties / states, cities, districts and streets. I've not come across anyone who needs to report on street level, so it hasn't been necessary. The addresses have only be used for postal contact, so are treated as a single entity.

Star-Schema Design [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Is a Star-Schema design essential to a data warehouse? Or can you do data warehousing with another design pattern?
Using star schemas for a data warehouse system gets you several benefits and in most cases it is appropriate to use them for the top layer. You may also have an operational data store (ODS) - a normalised structure that holds 'current state' and facilitates operations such as data conformation. However there are reasonable situations where this is not desirable. I've had occasion to build systems with and without ODS layers, and had specific reasons for the choice of architecture in each case.
Without going into the subtlties of data warehouse architecture or starting a Kimball vs. Inmon flame war the main benefits of a star schema are:
Most database management systems
have facilities in the query optimiser
to do 'Star Transformations' that
use Bitmap Index structures or
Index Intersection for fast
predicate resolution. This means that selection from a star schema can be done without hitting the fact table (which is usually much bigger than the indexes) until the selection is resolved.
Partitioning a star schema is relatively straightforward as only the fact table needs to be partitioned (unless you have some biblically large dimensions). Partition elimination means that the query optimiser can ignore patitions that could not possibly participate in the query results, which saves on I/O.
Slowly changing dimensions are much easier to implement on a star schema than a snowflake.
The schema is easier to understand and tends to involve less joins than a snowflake or E-R schema. Your reporting team will love you for this
Star schemas are much easier to use and (more importantly) make perform well with ad-hoc query tools such as Business Objects or Report Builder. As a developer you have very little control over the SQL generated by these tools so you need to give the query optimiser as much help as possible. Star schemas give the query optimiser relatively little opportunity to get it wrong.
Typically your reporting layer would use star schemas unless you have a specific reason not to. If you have multiple source systems you may want to implement an Operational Data Store with a normalised or snowflake schema to accumulate the data. This is easier because an ODS typically does not do history. Historical state is tracked in star schemas where this is much easier to do than with normalised structures. A normalised or snowflaked Operational Data Store reflects 'current' state and does not hold a historical view over and above any that is inherent in the data.
ODS load processes are concerned with data scrubbing and conforming, which is easier to do with a normalised structure. Once you have clean data in an ODS, dimension and fact loads can track history (changes over time) with generic or relatively simple mechanisms relatively simply; this is much easier to do with a star schema, Many ETL tools (for example) provide built-in facilities for slowly changing dimensions and implementing a generic mechanism is relatively straightforward.
Layering the system in this way providies a separation of responsibilities - business and data cleansing logic is dealt with in the ODS and the star schema loads deal with historical state.
There is an ongoing debate in the datawarehousing litterature about where in the datawarehouse-architecture the Star-Schema design should be applied.
In short Kimball advocates very highly for using only the Star-Schema design in the datawarehouse, while Inmon first wants to build an Enterprise Datawarehouse using normalized 3NF design and later use the Star-Schema design in the datamarts.
In addition here to you could also say that Snowflake schema design is another approach.
A fourth design could be the Data Vault Modeling approach.
Star schemas are used to enable high speed access to large volumes of data. The high performance is enabled by reducing the amount of joins needed to satsify any query that may be made against the subject area. This is done by allowing data redundancy in dimension tables.
You have to remember that the star schema is a pattern for the top layer for the warehouse. All models also involve staging schemas at the bottom of the warehouse stack, and some also include a persistant transformed merged staging area where all source systems are merged into a 3NF modelled schema. The various subject areas sit above this.
Alternatives to star schemas at the top level include a variation, which is a snowflake schema. A new method that may bear out some investigation as well is Data Vault Modelling proposed by Dan Linstedt.
The thing about star schemas is they are a natural model for the kinds of things most people want to do with a data warehouse. For instance it is easy to produce reports with different levels of granularity (month or day or year for example). It is also efficient to insert typical business data into a star schema, again a common and important feature of a data warehouse.
You certainly can use any kind of database you want but unless you know your business domain very well it is likely that your reports will not run as efficiently as they could if you had used a star schema.
Star schemas are a natural fit for the last layer of a data warehouse. How you get there is another question. As far as I know, there are two big camps, those of Bill Inmon and Ralph Kimball. You might want to look at the theories of these two guys if/when you decide to go with a star.
Also, some reporting tools really like the star schema setup. If you are locked into a specific reporting tool, that might drive what the reporting mart looks like in your warehouse.
Star schema is a logical data model for relational databases that fits the regular data warehousing needs; if the relational environment is given, a star or a snowflake schema will be a good design pattern, hard-wired in lots of DW design methodologies.
There are however other than relational database engines too, and they can be used for efficient data warehousing. Multidimensional storage engines might be very fast for OLAP tasks (TM1 eg.); we can not apply star schema design in this case. Other examples requiring special logical models include XML databases or column-oriented databases (eg. the experimental C-store)).
It's possible to do without. However, you will make life hard for yourself -- your organization will want to use standard tools that live on top of DWs, and those tools will expect a star schema -- a lot of effort will be spent fitting a square peg in a round hole.
A lot of database-level optimizations assume that you have a star schema; you will spend a lot of time optimizing and restructuring to get the DB to do "the right thing" with your not-quite-star layout.
Make sure that the pros outweigh the cons..
(Does it sound like I've been there before?)
-D
There are three problems we need to solve.
1) How to get the data out of the operational source system without putting undue pressure on them by joining tables within and between them, cleaning data as we extract, creating derivations etc.
2) How to merge data from disparate sources - some legacy, some file based, from different departments into an integral, accurate, efficiently stored whole that models the business, and does not reflect the structures of the source systems. Remember, systems change / are replaced relatively quickly, but the basic model of the business changes slowly.
3) How to structure the data to meet specific analytical and reporting requirements for particular people/departments in the business as quickly and accurately as possible.
The solution to these three very different problems require different architectural layers to solve them
Staging Layer
We replicate the structures of the sources, but only changed data from the sources are loaded each night. once the data is taken from the staging layer into the next layer, the data is dropped. Queries are single table queries with a simple data_time filter. Very little effect on the source.
Enterprise Layer
This is a business oriented 3rd normal form database. Data is extracted (and afterward dropped) from the staging layer into the enterprise layer, where it is cleaned, integrated and normalised.
Presentation (Star Schema) Layer
Here, we model dimensionally to meet specific requirements. Data is deliberately de-normalise to reduce the number of joins. Hierarchies that may occupy several tables in the Enterprise Layer are collapsed into a single dimension tables, and multiple transactional tables may be merged into single fact tables.
You always face these three problems. If you choose to do away with the enterprise layer, you still have to solve the second problem, but you have to do it in the star schema layer, and in my view, this is the wrong place to do it.

Resources