Data warehouse design, multiple dimensions or one dimension with attributes? - sql-server

Working on a data warehouse and am looking for suggestions on having numerous dimensions versus on large dimension with attributes.
We currently have DimEntity, DimStation, DimZone, DimGroup, DimCompany and have multiple fact tables that contain the keys from each of the dimensions. Is this the best way or would it be better to have just one dimension, DimEntity and include station, zone, group and company as attributes of the entity?
We have already gone the route of separate dimensions with our ETL so it isn't like the work to populate and build out the star schema is an issue. Performance and maintainability are important. These dimensions do not change often so looking for guidance on the best way to handle such dimensions.
Fact tables have over 100 million records. The entity dimension has around 1000 records and the others listed have under 200 each.

Without knowing your star schema table definitions, data cardinality, etc, it's tough to give a yes or no. It's going to be a balancing act.
For read performance, the fact table should be as skinny as possible and the dimension should be as short (low row count) as possible. Consolidating dimensions typically means that the fact table gets skinnier while the dimension record count increases.
If you can consolidate dimensions without adding a significant number of rows to the consolidated dimension, it may be worth looking into. It may be that you can combine the low cardinality dimensions into a junk dimension and achieve a nice balance. Dimensions with high cardinality attributes shouldn't be consolidated.
Here's a good Kimball University article on dimensional modeling. Look specifically where he addresses centipede fact tables and how he recommends using junk dimensions.

Related

Star snd snowflake schema in OLAP systems

I was of the impression that in OLAP , we try to store data in a denormalized fashion to reduce the number of joins and make query processing faster. Normalization that avoids data redundancy was more for OLTP systems.
But then again, 2 of the common modelling approaches (star and snowflake schema) are essentially normalized schemas.
Can you help me connect the dots?
Actually, that's very perceptive and the vast majority of people accept it. The truth is that a star is partially denormalized - the dimension tables are highly denormalized; they typically come from joining together a lot of related tables into one. A well designed fact table, however, is normalized - Each record is a bunch of values identified by a single, unique, primary key which is composed of the intersection of a set of foreign keys.
Snowflake schemas are, as you surmised, even more normalized. They effectively take the dimension tables and break them into small values that are all joined together when needed. While there are constant arguments over whether this is better or worse than a star, many folks believe that these are inexpensive joins and, depending on your thinking, may be worth it.
Initially, snowflakes were sold as a way of saving disk space because they do take up less room than dimension tables but disk space is rarely an issue nowadays.
I personally prefer a hybrid approach that allows me to build a few levels of dimension table that ultimately can provide referential integrity to both my atomic level data but also to my aggregate fact tables.
When people use the term "normalised" they normally mean something that is in, or at least close to, 3rd normal form (3NF).
Unless you mean something significantly different by the term normalised then neither Star or Snowflake schemas are normalised. Why do you think they are normalised?

What is the number of columns that make table really big?

I have two tables in my database, one for login and second for user details (the database is not only two tables). Logins table has 12 columns (Id, Email, Password, PhoneNumber ...) and user details has 23 columns (Job, City, Gender, ContactInfo ..). The two tables have one-to-one relationship.
I am thinking to create one table that contain the columns of both tables but I not sure because this may make the size of the table big.
So this lead to my question, what the number of columns that make table big? Is there a certain or approximate number that make size of table big and make us stop adding columns to a table and create another one? or it is up to the programmer to decide such number?
The number of columns isn't realistically a problem. Any kind of performance issues you seem to be worried with can be attributed to the size of the DATA on the table. Ie, if the table has billions of rows, or if one of the columns contains 200 MB of XML data on each separate row, etc.
Normally, the only issue arising from a multitude of columns is how it pertains to indexing, as it can get troublesome trying to create 100 different indexes covering each variation of each query.
Point here is, we can't really give you any advice since just the number of tables and columns and relations isn't enough information to go on. It could be perfectly fine, or not. The nature of the data, and how you account for that data with proper normalization, indexing and statistics, is what really matters.
The constraint that makes us stop adding columns to an existing table in SQL is if we exceed the maximum number of columns that the database engine can support for a single table. As can be seen here, for SQLServer that is 1024 columns for a non-wide table, or 30,000 columns for a wide table.
35 columns is not a particularly large number of columns for a table.
There are a number of reasons why decomposing a table (splitting up by columns) might be advisable. One of the first reasons a beginner should learn is data normalization. Data normalization is not directly concerned with performance, although a normalized database will sometimes outperform a poorly built one, especially under load.
The first three steps in normalization result in 1st, 2nd, and 3rd normal forms. These forms have to do with the relationship that non-key values have to the key. A simple summary is that a table in 3rd normal form is one where all the non-key values are determined by the key, the whole key, and nothing but the key.
There is a whole body of literature out there that will teach you how to normalize, what the benefits of normalization are, and what the drawbacks sometimes are. Once you become proficient in normalization, you may wish to learn when to depart from the normalization rules, and follow a design pattern like Star Schema, which results in a well structured, but not normalized design.
Some people treat normalization like a religion, but that's overselling the idea. It's definitely a good thing to learn, but it's only a set of guidelines that can often (but not always) lead you in the direction of a satisfactory design.
A normalized database tends to outperform a non normalized one at update time, but a denormalized database can be built that is extraordinarily speedy for certain kinds of retrieval.
And, of course, all this depends on how many databases you are going to build, and their size and scope,
I take it that the login tables contains data that is only used when the user logs into your system. For all other purposes, the details table is used.
Separating these sets of data into separate tables is not a bad idea and could work perfectly well for your application. However, another option is having the data in one table and separating them using covering indexes.
One aspect of an index no one seems to consider is that an index can be thought of as a sub-table within a table. When a SQL statement accesses only the fields within an index, the I/O required to perform the operation can be limited to only the index rather than the entire row. So creating a "login" index and "details" index would achieve the same benefits as separate tables. With the added benefit that any operations that do need all the data would not have to perform a join of two tables.

Database - fact table and dimension table

When reading a book for business objects, I came across the term- fact table and dimension table. Is this the standard thing for all the database that they all have fact table and dimension table or is it just for business object design? I am looking for an explanation which differentiates between two and how they are related.
Edited:
Why cannot a query just get the required data from the fact table? What happens if all the information are stored in one fact table alone? What advantages we get by creating a separate fact and dimension table and joining it?
Sorry for too many questions at a time but I would like to know about the inter-relations and whys.
Dimension and Fact are key terms in OLAP database design.
Fact table contains data that can be aggregate.
Measures are aggregated data expressions (e. Sum of costs, Count of calls, ...)
Dimension contains data that is use to generate groups and filters.
Fact table without dimension data is useless. A sample: "the sum of orders is 1M" is not information but "the sum of orders from 2005 to 2009" it is.
They are a lot of BI tools that work with these concepts (e.g. Microsft SSAS, Tableau Software) and languages (e. MDX).
Some times is not easy to know if a data is a measure or a dimension. For example, we are analyzing revenue, both scenarios are possibles:
3 measures: net profit , overheads , interest
1 measure: profit and 1 dimension: profit type (with 3 elements: net, overhead, interest )
The BI analyst is who determines what is the best design for each solution.
EDITED due to the question also being edited:
An OLAP solution usually has a semantic layer. This layer provides to the OLAP tool information about: which elements are fact data, which elements are dimension data and the table relationships. Unlike OLTP systems, it is not required that an OLAP database is properly normalized. For this reason, you can take dimension data from several tables including fact tables. A dimension that takes data from a fact table is named Fact Dimension or Degenerate dimension.
They are a lot of concepts that you should keep in mind when designing OLAP databases: "STAR Schema", "SNOWFLAKE Schema", "Surrogate keys", "parent-child hierarchies", ...
That's a standard in a datawarehouse to have fact tables and dimension tables. A fact table contains the data that you are measuring, for instance what you are summing. A dimension table is a table containing data that you don't want to constantly repeat in the fact table, for example, product data, statuses, customers etc. They are related by keys: in a star schema, each row in the fact table contains a the key of a row in the dimension table.

Is it normal to have a table with about 40-50 columns in database?

Is it normal to have a table with about 40-50 columns in database?
Depends on your data model. It is somehow "neater" to have data broken down into multiple tables and have them related to each other, but it can also be possible your data is such it cannot, or it makes no sense, to be broken down.
If you want to have less columns just "for the sake of it", and there is no significant performance degradations - no need. If you find yourself using less columns than there are in the table, break it down...
Yes, if those 40-50 columns are all dependent on the key, the whole key, and nothing but the key of the table.
It is not uncommon for a database to be de-normalised to improve performance: munging tables together results in fewer joins during queries.
So denormalised tables tend to have more columns, and duplicate data can become an issue, but sometimes that's the only way to get the performance that you need.
I seem to get asked that question at every job interview I go to:
When would you denormalise a database?
Depends on what you call normal. If you are a big enterprise corporation, it's not normal, because you have way too few columns.
But if you find it hard to work with that many columns, you probably have a problem and need to do something about it: either abstract the many columns away or split up your data model to something more manageable.
It doesn't sound very normalised, so you might want to look at this . But it really depends on what you're storing I suppose...
I don't know about "normal", but it should not be causing any problems. If you have many "optional" columns, that are null most of the time, or many fields are very large and not often queried, then maybe the schema could be normalized or tuned a bit more, but the number of columns itself is not an issue.
The number of columns has no relationship to whether the data is normalized or not. It is the content of the columns which will tell you that. Are the columns things like
Phone1, phone2, phone3? Then certainly the table is not normalized and should be broken apart. But if they are all differnt items which are all in a one-to-one raltionship with the key value, then 40-50 columns can be normalized.
This doesn't mean you always want to store them in one table though. If the combined size of those columns is larger than the actual bytes allowed per row of data in the database, you might be better off creating two or more tables in a one-to-one relationship with each other. Otherwise you will have trouble storing the data if all the fields are at or near their max size. And if some of the fields are not needed most of the time, a separate table may also be in order for them.

How efficient is a details table?

At my job, we have pseudo-standard of creating one table to hold the "standard" information for an entity, and a second table, named like 'TableNameDetails', which holds optional data elements. On average, for every row in the main table will have about 8-10 detail rows in it.
My question is: What kind of performance impacts does this have over adding these details as additional nullable columns on the main table?
8-10 detail rows or 8-10 detail columns ?
If its rows, then you're mixing apples and oranges as a one-to-many relationship cannot be flatten out into columns.
If is columns, then you're talking vertical partitioning. For large and very large tables, moving seldom referenced columns into Extra or Details tables (ie partition the columns vertically into 'hot' and 'cold' tables) can have significant and event huge performance benefits. Narrower table means higher density of data per page, in turn means less pages needed for frequent queries, less IO, better cache efficiency, all goodness.
Mileage may vary, depending on the average width of the 'details' columns and how 'seldom' the columns are accessed.
I'm with Remus on all the "depends", but would just add that after choosing this design for a table/entity, you must also have a good process for determining what is "standard" and what is "details" for an entity.
Misplacing something as a detail which should be standard is probably the worst thing. Because you can't require a row to exist as easily as requiring a column to exist (big complex trigger code). Setting a default on a type of row is a lot harder (big complex constraint code). And indexing is also not easy either (sparse index, maybe?).
Misplacing something as a standard which should be a detail is less of a mistake, just taking up extra row space and potentially not being able to have a meaningful default.
If your details are very weakly structured, you could consider using an XML column for the "details" and still be able to query them using XPath/XQuery.
As a general rule, I would not use this pattern for every entity table, but only entity tables which have certain requirements and usage patterns which fit this solution's benefits well.
Is your details table an entity value table? In that case, yes you are asking for performance problems.
What you are describing is an Entity-Attribute-Value design. They have their place in the world, but they should be avoided like the plague unless absolutely necessary. The analogy I always give is that they are like drugs: in small quantities and in select circumstances they can be beneficial. Too much will kill you. Their performance will be awful and will not scale and you will not get any sort of data integrity on the values since they are all stored as strings.
So, the short answer to your question: if you never need to query for specific values nor ever need to make a columnar report of a given entity's attributes nor care about data integrity nor ever do anything other than spit the entire wad of data for an entity out as a list, they're fine. If you need to actually use them however, whatever query you write will not be efficient.
You are mixing two different data models - a domain specific one for the "standard" and a key/value one for the "extended" information.
I dislike key/value tables except when absolutely required. They run counter to the concept of an SQL database and generally represent an attempt to shoehorn object data into a data store that can't conveniently handle it.
If some of the extended information is very often NULL you can split that column off into a separate table. But if you do this to two different columns, put them in separate tables, not the same table.

Resources