Dimensional Modeling event Hierarchy - data-modeling

In my current world, An employer Can Grant Stocks to an Employee under a stock plan. Not only stocks, other types like restricted stock units etc.. can be granted too. Each Grant record has its own attributes (Qty granted, fair market value when granted, etc..). Every Grant record has multiple vesting information (eg: out of 100 shares granted, 50 can vest in 2021, 30 in 2022, 20 in 2023) . Finally each vesting record can have multiple "planned distribution" records (ie out of 50 vested ones in 2021, 20 can get exercised in Dec 2021 , 20 in Jan 2022 and remaining in Feb 2022). So the hierarchy looks like this:
Employee -> n Grants -> n Vesting -> n Planned Distribution
Wondering what is the prescribed way of Dimensionally modeling this?
Option#1: Treat Grant, Vesting, Planned Distribution as separate Dimension and have a separate factless fact that relates all of these and the employee (However question here is, can they be treated independent dimensions, as child cannot exist/meaningless without the parent)
Option#2: Only have Planned Distribution Fact and collapse the Grant and Vesting into this fact (like kimball order-orderline concept). Thus employee and employer will be the only Dimensions (Drawback- What if Grants and Vesting are required on its own in other facts?)
Option#3: Treat Grant, Vesting, Planned Distribution as separate Dimension, relate them by using natural keys from parent to the child, but also have separate factless fact to relate the Dim Keys of each of this for point in time analysis
Database: snowflake cloud
Thanks in advance
Sunil

You are approaching this from the wrong direction. A dimensional model is based on the business questions you want to be able to answer, not on the data and its structures that you happen to have in your source system.
So you need to define the measures that you want to report on and their grain (which will give you your facts) and the entities that you want to use to filter and aggregate your fact (which will give you your dimensions).
Once you have this information it will become easier (though not necessarily easy!) to design your model and the answers to your questions will either become much more obvious or, possibly, irrelevant.

Related

ER diagram relationship for user admin

I am designing a Database management project of gym management. There are 2 users, one is the clerk who can add,remove and edit all trainers, centers and members and the second user is the member who can only see and edit certain attributes related to him. Member ,center and trainers are 3 entities in the ER diagram so the question should I introduce entity for clerk and if so should it have a relationship with any of the three entities described above?
I wouldn't split up the two Entities based on the Fact that they have different permissions in your system.
I recommend you focus on the concepts behind the entities:
First, if all Attributes are equal I would start considering building 1 Entity out of the two. Once you end up with multiple columns that are mainly null it might have been a mistake to "merge" two entities.
In addition to that you should check if there is a central name that you can give your merged entity. For example if you have the two Entities: Manager, Employee and you want to merge them I would maybe just call it User and check if the Properties still make sense in that context.
Last but not least you should think about how the Entities are used later in the development. If you need two Joins instead of one once you split up your Entities that could be an argument for merging them. Maybe later in the development your 'clark' Entity will be extended by a few columns, this way you might end up with null columns again.
I think a general answer is not suitable since the Domain is unclear. Just collect arguments for and against merging the entities and compare those.

How to handle one to many in a star-schema?

I need a way to associate one or more fractional owners with an aircraft in the following star-schema (see diagram below).
Is the following diagram and example of the correct way to model that relationship in a data warehouse?
The most common need I'll likely have is a need to report on aircraft by total number of fractional owners. Is there a more "correct" way of modeling this?
Joining 2 fact tables is a bad idea. Many BI tools won't even let you do it (only 1:M relations are allowed).
What you have is a classic case of modeling many-to-many attribute in a star schema. A most common solution is to create a bridge table that associates (in your case) aircraft and owners (which might also change in time). "Owner" will become a dimension, connected to the fact table via the bridge.
The problem with bridge tables is that they seriously complicate the model, and make it harder to use. Whenever possible, I am trying to avoid them. Two common design techniques I often use:
Count number of fractional owners per aircraft in a data warehouse, and add it as a fact to the fact table. The advantage of this approach - it's the simplest and most robust design. Disadvantage - if you need to see the names of the owners, you won't be able to (although you can partially address that by concatenating multiple owners into a string and adding it as an attribute).
Alternatively, you can re-grain your fact table. Currently, fact table grain is aircraft. You can change it to "aircraft ownership" (i.e, aircraft + owner). Owners then can be added as a dimension and connected to the fact table. Advantages: the model is still simple (no bridge), and also robust, and yet you will have full visibility of the owners and their attributes. Disadvantages: new grain might be less intuitive for the analysts; size of the fact table increases (i.e., if you have on average 3 owners per aircraft, your fact table will triple). Also, if you have any additive facts such as costs etc, they will have to be allocated per owner (i.e, split equally, or split by ownership % if you have the data), to avoid double-counting.

Is it better to cache a value in a column or query another table [duplicate]

I am trying to figure out the fastest way to access data stored in a junction object. The example below is analagous to my problem, but with a different context, because the actual dataset I am dealing with is somewhat unintuitive in its relationships.
We have 3 classes: User, Product, and Rating. User has a many-to-many relationship to Product with Rating as the junction/'through' class.
The Rating object stores the answers to several questions which are integer ratings on a scale of 1-5 (Example questions: How is the quality of the Product, how is the value of the Product, how user-friendly is the Product). For simplification assume every User rates every Product they buy.
Now here is the calculation I want to perform: For a User, calculate the average rating of all the Products they have bought (that is, the average rating from all other Users, one of which will be from this User themself). Then we can tell the user "On average, you buy products rated 3/5 for value by all customers who bought that product".
The simple and slow way is just to iterate over all of a user's review objects. If we assume that each user has bought a small (<100) number of products, and each product has n ratings, this is O(100n) = O(n).
However, I could also do the following: On the Product class, keep a counter of the number of Rating s that selected each number (e.g. how many User s rated this product 3/5 for value). If you increment that counter every time a Product is rated, then computing the average for a given Product just requires checking the 5 counters for each Rating criteria.
Is this a valid technique? Is it commonly employed/is there a name for it? It seems intuitive to me, but I don't know enough about databases to tell whether there's some fundamental flaw or not.
This is normal. It is ultimately caching: encoding of state redundantly to benefit some patterns of usage at the expense of others. Of course it's also a complexification.
Just because the RDBMS data structure is relations doesn't mean you can't rearrange how you are encoding state from some straightforward form. Eg denormalization.
(Sometimes redundant designs (including ones like yours) are called "denormalized" when they are not actually the result of denormalization and the redundancy is not the kind that denormalization causes or normalization removes. Cross Table Dependency/Constraint in SQL Database Indeed one could reasonably describe your case as involving normalization without preserving FDs (functional dependencies). Start with a table with a user's id & other columns, their ratings (a relation) & its counter. Then ratings functionally determines counter since counter = select count(*) from ratings. Decompose to user etc + counter, ie table User, and user + ratings, which ungroups to table Rating. )
Do you have a suggestion as to the best term to use when googling this
A frequent comment by me: Google many clear, concise & specific phrasings of your question/problem/goal/desiderata with various subsets of terms & tags as you may discover them with & without your specific names (of variables/databases/tables/columns/constraints/etc). Eg 'when can i store a (sum OR total) redundantly in a database'. Human phrasing, not just keywords, seems to help. Your best bet may be along the lines of optimizing SQL database designs for performance. There are entire books ('amazon isbn'), some online ('pdf'). (But maybe mostly re queries). Investigate techniques relevant to warehousing, since an OLTP database acts as an input buffer to an OLAP database, and using SQL with big data. (Eg snapshot scheduling.)
PS My calling this "caching" (so does tag caching) is (typical of me) rather abstract, to the point where there are serious-jokes that everything in CS is caching. (Googling... "There are only two hard problems in Computer Science: cache invalidation and naming things."--Phil Karlton.) (Welcome to both.)

Can I use a counter in a database Many-to-Many field to reduce lookups?

I am trying to figure out the fastest way to access data stored in a junction object. The example below is analagous to my problem, but with a different context, because the actual dataset I am dealing with is somewhat unintuitive in its relationships.
We have 3 classes: User, Product, and Rating. User has a many-to-many relationship to Product with Rating as the junction/'through' class.
The Rating object stores the answers to several questions which are integer ratings on a scale of 1-5 (Example questions: How is the quality of the Product, how is the value of the Product, how user-friendly is the Product). For simplification assume every User rates every Product they buy.
Now here is the calculation I want to perform: For a User, calculate the average rating of all the Products they have bought (that is, the average rating from all other Users, one of which will be from this User themself). Then we can tell the user "On average, you buy products rated 3/5 for value by all customers who bought that product".
The simple and slow way is just to iterate over all of a user's review objects. If we assume that each user has bought a small (<100) number of products, and each product has n ratings, this is O(100n) = O(n).
However, I could also do the following: On the Product class, keep a counter of the number of Rating s that selected each number (e.g. how many User s rated this product 3/5 for value). If you increment that counter every time a Product is rated, then computing the average for a given Product just requires checking the 5 counters for each Rating criteria.
Is this a valid technique? Is it commonly employed/is there a name for it? It seems intuitive to me, but I don't know enough about databases to tell whether there's some fundamental flaw or not.
This is normal. It is ultimately caching: encoding of state redundantly to benefit some patterns of usage at the expense of others. Of course it's also a complexification.
Just because the RDBMS data structure is relations doesn't mean you can't rearrange how you are encoding state from some straightforward form. Eg denormalization.
(Sometimes redundant designs (including ones like yours) are called "denormalized" when they are not actually the result of denormalization and the redundancy is not the kind that denormalization causes or normalization removes. Cross Table Dependency/Constraint in SQL Database Indeed one could reasonably describe your case as involving normalization without preserving FDs (functional dependencies). Start with a table with a user's id & other columns, their ratings (a relation) & its counter. Then ratings functionally determines counter since counter = select count(*) from ratings. Decompose to user etc + counter, ie table User, and user + ratings, which ungroups to table Rating. )
Do you have a suggestion as to the best term to use when googling this
A frequent comment by me: Google many clear, concise & specific phrasings of your question/problem/goal/desiderata with various subsets of terms & tags as you may discover them with & without your specific names (of variables/databases/tables/columns/constraints/etc). Eg 'when can i store a (sum OR total) redundantly in a database'. Human phrasing, not just keywords, seems to help. Your best bet may be along the lines of optimizing SQL database designs for performance. There are entire books ('amazon isbn'), some online ('pdf'). (But maybe mostly re queries). Investigate techniques relevant to warehousing, since an OLTP database acts as an input buffer to an OLAP database, and using SQL with big data. (Eg snapshot scheduling.)
PS My calling this "caching" (so does tag caching) is (typical of me) rather abstract, to the point where there are serious-jokes that everything in CS is caching. (Googling... "There are only two hard problems in Computer Science: cache invalidation and naming things."--Phil Karlton.) (Welcome to both.)

What is the recommended database design for 2 entities that share most of their attributes?

In a financial analysis program there is an account object, and a loan account object that extend it. The loan object only has couple more attribute than the account. Which one of the following will be the recommend DB design ?
Table for the account, and another table for the extra loan
attribute with 1 to 1 relationship.
Two separate tables.
One table that has all fields, and ignore the loan attribute for
basic account.
You should go for first approach.
For relationship cardinality, you should consider what data will be stored in each of the object. Are you going to maintain history for it.
As per my understanding about the above said objects, you should go for one-to-many relationship.
You're talking about implementing polymorphism, which while not possible in a relational database is a good way to assess the pros and cons. Option 1 is similar to subclassing, where the loan account inherits everything from the account and extends it. So use that if you want the account to be a superclass...in other words if you add a new kind of account, say credit card, you will add another table for it and have it relate to account also. That means the account table must remain generic...account number, balance, etc.
Option 2 is treating the two types of accounts like separate classes. Use that if they won't share many CRUD operations, because now a simple balance update in response to a transaction has to have different code somewhere.
Option 3 is the generalist approach. It's big advantage is simplicity in modeling and querying. It's big disadvantage is that you won't be able to implement NOT NULL constraints on columns that need to be there for some account types but not others.
There's a 4th option to combine the first 2 options which provides a solution similar to the party abstraction for people and organizations. You have 3 tables: 1) an account table that handles the basic elements of account id, balance, owner, etc.; 2) a loan account table that has the additional columns and a reference to account id; and 3) a basic account table that just has a reference to account id. This may seem like overkill, but it sets up a system that you can extend without modification. I've used the pattern many times.

Resources