One to Many relationship between 2 dimensions or between fact and dimension - data-modeling

I am new to dimensional data modeling. I have come across scenarios where I have to create one to many relationship between 2 dimensions and in one scenario I did create a one to many from the Fact table to a dimension table.
I understand these 2 scenarios are not ideal for dimensional modeling but I want to understand the disadvantages of having these in the dimensional model.
Thanks

It is hard to advise unless you share the actual problem, i.e. what business problem/questions are you trying to model for?
If you need a 1-* relationship between dimensions, that means probably that you should consider denormalizing them into a single dimension, because the functional dependence between that data is likely to exist already.
A 1-* between dimensions and facts, in which one dimension is related to multiple facts, is typical; but the other way around is uncommon.
Data modelling, namely the dimensional approach, is intimately related to the business problem. It's easier (and paramount) to start from there and then model the data to answers the business questions at hand.

Related

Generic vs Conformed dimensions

I'm new to dimensional modeling. I'm reading Kimball's "The Data Warehouse Toolkit".
As soon as I understood, Conformed Dimensions are a good thing, a key concept for integration of different fact tables. Usually you will have a separate fact table per business process, and if someone want to make decisions based on multiple processes - in most cases conformed dimensions allow to perform Drilling Across instead of Consolidated Fact Table. Looks pretty straighforward.
But how Abstract Generic Dimensions differs from Consolidated Dimensions?
Looks the same for me. For some reason Abstract Generic Dimensions are consider anti-pattern. The referenced example says it's bad to use the same geo-location dimension table for employees, customers and vendors. Two reasons: attributes may differ and dimension table size. But haven't Conformed Dimensions got the same downsides?
The Kimball article on abstract generic dimensions doesn't say anything about dates.
A Customer dimension shared across sales and marketing facts is a conformed dimension
A single Person dimension containing employees and customers is an example of an abstract generic dimension (which might be "bad" if there is very little commonailty in attributes or processes)
However IMHO nothing in Kimball is a hard and fast rule - I see it as guidance, and the note on abstract generic dimensions to me is just a warning to do proper analysis before jumping into using one dimension to model two things that seem the same but probably aren't from a data detail perspective.

Is it ok to have multiple fact tables that are connected to the same dimension tables without using a link table?

Let's say that in my database model I have three fact tables. These fact tables have same dimension tables (so called conformed dimensions). I know that I shouldn't connect directly fact tables (since direct connection can cause double-counting of some facts), but only through the dimension tables. What I am interested in is can I connect every fact with every dimension table without problems? I looked for an answer a lot and the opinions are divided. Some say there is no problem, the others say that because of this fact tables can associate with each other and circular references can occur; and that in these cases so called link table should be used. Is this link table really necessary or can this work without it?
If a dimension can describe an aspect of the fact event, you should connect it so it can be used in analytics.
However, you shouldn't force a relationship to connect a fact to a dimension that it does not need. That will make your model confusing and bloated.
You are correct that you should not connect facts directly. The model does not function that way. You'll want to read up on the purpose of facts and dimension to understand why.
You should be able to navigate between related events through the common dimensions, but that is not a circular reference. A circular reference prevents a value from being returned because there is not a bottom to the relationship.
If entities have a many to many relationship, you can use link/bridge tables to expand the relationship into multiple one to many/one to one relationships. That is complicated to model and too much to explain as part of this question.
If you want more, please post some of your model so we can focus on the specific needs of your question.
I implemented the model (in MS SQL), and I'm sharing here my experience in case anyone is interested in this in future.
In the end I created five fact tables (model turned out to be more complex), they are all connected to all existing dimension tables (six of them) directly. I didn't use the link table.
This model is in usage for almost five months now and so far no problems appeared.

What is the best way to realize this database

I have to realize a system with different kind of users and I think to realize it in this way:
A user table with only id, email and password.
Two different tables correlated to the user table in a 1-to-1 relation. Each table define specific attributes of each kind of user.
Is this the best way to realize it? I should use the InnoDB storage engine?
If I realize it in this way, how can I handle the tables in the Zend Framework?
I can't answer the second part of your question but the pattern you describe is called super and subtype in datamodelling. If this is the right choice can't be answered without knowing more about the differences between these user types and how they will be used in the application. There are different approaches when converting logical super/subtypes into physical tables.
Here are some relevant links:
http://www.sqlmag.com/article/data-modeling/implementing-supertypes-and-subtypes
and the next one about pitfalls and (mis)use of subtyping
http://www.ocgworld.com/doc/OCG_Subtyping_Techniques.pdf
In general I am, from a pragmatic point of view, very reluctant to follow your choice and most often opt to create one table containing all columns. In most cases there are a number of places where the application needs show all users in some sort of listing with specific columns for specific types (and empty if not applicable for that type). It quickly leads to non-straigtforward queries and all sort of extra code to deal with the different tables that it's just not worth being 'conceptually correct'.
Two reasons for me to still split the subtypes into different tables are if the subtypes are so truly different that it makes no logical sense to have them in one table and if the number of rows is so enormous that the overhead of the 'unneeded' columns when putting it all in one table actually starts to matter
On php side you can use Doctrine 2 ORM. It's easy to integrate with zf, and you could easily implement this table structure as inheritance in your doctrine mapping.

Is it better to model databases after their applications, or after their components?

I'm structuring a database, and found that I have two different objects I'm trying to model. Each one consists of the same things (a varchar and a couple of foreign keys), and will do so for the forseeable future.
I'm (as of now) going to put them in the same table, with an extra 'type' field, but I was wondering if there's standard practice for this.
Edit: To clarify, they'll both be used in the same way as well, with the only differences being where/when they're displayed.
The rule is as follows:
If the objects are truly different and will act in different ways, regardless of how similar they are in implementation, you should put them in two different tables.
Apples and oranges.
If the objects are at any point being compared to one another in the same context or in aggregate, then you store the base class in one table with a code field, and store the subclasses in two more tables using foreign keys.*
A "fruit report" for apples and oranges. How many fruits do we have? How many fruits of any kind come from California?
*NB: There are actually many ways to attack the subclassing problem in a database. The point wasn't so much which strategy you're using as it is you treating them as a common supertype or not.
There are different patterns you can use to design a database. For example you can represent objects with a Table per type style design or with a table per heirarchy design. There are pros and cons to each but I haven't seen that one stands out as the "right" way.
However, with your design, if the objects are essentially the same, I would try to use the same object and ditch the type column. Or if they are truly different, it seems like the foreign key columns would be related to different tables, so you'd want to have different tables with clearly defined Primary Key, Foreign key associations.

Fact table with multiple facts

I have a dimension (SiteItem) has two important facts:
perUserClicks
perBrowserClicks
however, within this dimension, I have groups of values based on an attribute column (let's call the groups AboveFoldItems, LeftNavItems, OnTheFlyItems, etc.) each have more facts that are specific to that group:
AboveFoldItems: eyeTime, loadTime
LeftNavItems: mouseOverTime
OnTheFlyItems: doesn't have any extra, but may in the future
Is the following fact table schema ok?
DateKey
SessionKey
SiteItemKey
perUserClicks
perBrowserClicks
eyeTime
loadTime
mouseOverTime
It seems a little wasteful since only some columns pertain to some dimension keys (the irrelevant facts are left NULL). But... this seems like it would be a common problem, so there should be a common solution for this, right?
I'm generally in agreement with Damir's answer on this, but because the fact table is very narrow in your particular case, there is still merit to Aaron's advocation for keeping the NULLs.
We have several star schemas in particular subject areas with multiple fact tables that share most (if not all) of the dimensions (conformed and internal). The limited-scope dimensions are not considered "conformed" across the enterprise, but they are what we would call "shared internal" dimensions.
Now typically, if the data is loaded contemporaneously so that the dimension hasn't changed, you can join both fact tables on the keys, but in general, of course, you cannot join two different star schemas on the dimension keys if they are surrogates in traditional slowly changing dimensions. In general, you have to join separate stars on the natural keys or "business keys" within the dimension and not on surrogates (except usually in the special case of the date dimension where it is unchanging and only has a natural key).
Note that when you do join the two stars, you have to use a LEFT JOIN, in which case you WILL produce NULLs which you will still probably have to take account of - so you're actually getting back to the original model you had with NULLs! ;-)
The benefit of the extra fact table is more obvious when your tables are wide with a smaller set of keys and the vertical partitioning of the data produces space savings as well as a cleaner logical model - this is especially true when the keys are only really shared up to a point - having one dummy key or NULL key is definitely not a good idea - this usually points to a dimensional modeling problem.
However, as Aaron says, if you push it to extremes, you can have a single fact column in each fact table with shared keys, which means the key overhead dwarfs the fact cost and you really do end up in a disguised EAV model.
I would also look to see if you are in Kimball's situation of "too few dimensions". Seems like you must have good dimensional attributes lumped into the SessionKey and SiteItemKey - but without seeing your entire model and requirements, it's hard to say, but I would think you would have some user demographics in a low-cardinality or even snowflake dimension without the full Session or Site dimension.
There isn't an elegant solution really, you either have nullable columns or you use an EAV solution. I posted about EAV before (and generated a lot of comments that might be worthwhile reading):
What is so bad about EAV, anyway?
I am a fan of that model in some scenarios, but if your dimensions/attributes do not change frequently, it can be a lot of extra work for nothing. NULL values in a column do not really make waste as long as the surrounding code can deal with them appropriately.
You could have more than one fact table: factperUserClicks, factperBroWserClicks, factEyeTime, etc...
Each of these would have DateKey, SessionKey, SiteItemKey. This way only dimension keys that "make sense" appear with each fact.
Ideally, there should be no NULLS in the DW -- if you keep them in the same fact table, using zeros may be more appropriate.
As far as saving disk space, I do not see an ideal solution -- but, in a DW one is supposed to trade space for speed and (query) simplicity anyway.

Resources