I want to calculate the "stock turnover rate" KPI which depend on two measures in two differents table facts (Amount from sales fact and physical quantity from inventory fact). So, my question is the following-
Do I have to regroup the two facts in the same cube OLAP, or is there another way to do this? knowing that everyone recommend to have one fact table per cube.
Are there any dimensions that the Fact tables share?
I don't think that being recommended to have 1 Fact table per Cube is correct. There is no reason why you cant have multiple Fact tables or Measure groups.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I have to design and build a star / snowflake schema database that will keep data about employees in a company - especially the rates that are payed to the employees. This is the first time I am experimenting with this schema type and I'm not sure about which parts of the fact tables should be separate dimension tables.
I don't exactly understand the practical upsides of having this schema, is it actually that much easier to perform queries on this type of database? Or is it only about the performance?
Below I am attaching the project of the schema of my database. I would like to know what should I modify for this to be the best possible version for this database. I also have a question about two things:
Should the rate column be just a value in the fact table? Or should it be a foreign key to a dim_rate table?
What about date dimensions? Should they just be values in specific tables? Or should they always be foreign keys? If they should be foreign keys, should there be one dim_date table or a table for each type of date?
As an example for question 2 lets takie the dim_employee table and the employment_date and end_of_employment columns. I have these dates as values in the dim_employee table but I can think of 2 other versions of how to handle this data: either foreign keys to a dim_date table or seperate fact tables for fact_start_of_employment and fact_end_of_deployment. I know I will need different kinds of report for example reports showing how many people started work and left the company for different date intervals (eg. in december of 2020). Honestly at this point I have no idea which option would be best and easiest to work with in the future.
Also as I said - I would love any constructive criticism of this schema, even if it means completely redesigning it.
I would merge both fact tables because I think there is a strong relation between rate and position. But that's how I look at this data without knowing all the details.
I would also create a date dimension and a form_of_employment dimension.
That would result in 4 dimensions:
dim_employee
dim_date
dim_position
dim_form_of_employment
And a single fact table with these columns:
fact_assignment
employee_id
date_id
position_id
form_of_employment_id
rate
student
This setup results in a proper star and very simpel SQL for your reports
For every BI or reporting system, you have a process of designing your tables and building them based on that design. This process is called dimensional modeling. Some others call it data warehouse design, which is the same thing. Dimensional modeling is the process of thinking and designing the data model including tables and their relationships. As you see, there is no technology involved in the process of dimensional modeling, It is all happening on your head and ends up with sketching diagrams on the paper. Dimensional modeling is not the diagram in which tables are connected to each other, it is the process of doing that.
Star Schema is the best way of designing a data model for reporting, You will get the best performance and also flexibility using such a model.
In this case the Employee Dimension will be a Historical Dimension or Slowly Changing Dimension :
You can use a bridge table.
In a classic dimensional schema, each dimension attached to a fact table has a single value consistent with the fact table’s grain. But there are a number of situations in which a dimension is legitimately multivalued.
Like in your example, an employee can have many positions :
In the multidimensional cube, I have two facts (at different grain) named as : FactTestScore and Fact SubjectScore. These two facts share two dimensions- DimStudent and DimSubject. And FactTestScore has additional dimension of DimTest. I've deployed the cube without any error.
In the PowerBI to create report, when I have matrix table with Subject, Test, Student and their respective scores, the all tests are getting cross joined with all subjects. Can you please point out where I am making mistake?
In Power BI, filtering across relationships can only propagate in the direction specified.
In your diagram, there is no path from one dimension table to another without going the "wrong way". In fact, it doesn't look like they can even filter the fact tables unless the notation is the reverse of what I'm accustomed to.
So, I have this dataset here: https://www.kaggle.com/johnolafenwa/us-census-data#adult-training.csv
I am new to datawarehouses. I understand what a measure is but I'm not sure what justifies itself as a measure for a fact table? In this dataset what columns can be measures?
The way I have seen is measures are like Count() or Avg() etc.
Measures are numerical values that mathematical functions work on. For example, a sales revenue column is a measure because you can find out a total or average the data (but not only total or average it depends on your need).
When dimensions and measures work together, they help answer complex business questions.
A metric is a quantifiable measure that is used to track and assess the status of a specific process. That said, here is the difference: a measure is a fundamental or unit-specific term—a metric can literally be derived from one or more measures.
A fact table is used in the dimensional model in data warehouse design. A fact table is found at the center of a star schema or snowflake schema surrounded by dimension tables.
A fact table consists of facts of a particular business process e.g., sales revenue by month by product. Facts are also known as measurements or metrics. A fact table record captures a measurement or a metric.
I'm a newbie to data warehousing and I've been reading articles and watching videos on the principles but I'm a bit confused as to how I would take the design below and convert it into a star schema.
In all the examples I've seen the fact table references the dim tables, so I'm assuming the questionId and responseId would be part of the fact table? Any advice would be much appreciated.
I can't see the image at the moment (blocked by my firewall # the office). but I'll try to give you some ideas.
The general idea is to organize your measurable 'facts' into what are called fact tables. There are 3 main types of facts, but that is a topic for a different day (but I'd be happy to go into this if needed). Each of these facts are what you'd see in the center of typical 'star schema'. The other attributes within the fact tables are typically FK references to the dimension tables.
Regarding dimensions, these are groups of attributes that share commonality (the most notable being a calendar dimension). This is important because when you're doing analysis across multiple facts the dimensions are what you use to connect them.
If you consider this simple example: A product is ordered and then shipped. We could have 2 transaction facts (one that contains the qty ordered - measure, type of product ordered - dimension, and transaction date - dimension). We'd also have a transaction fact for the product shipping ( qty shipped - measure, product type - dimension, and ship date - dimension). This simple schema could be used to answer questions like 'how many products by product type last quarter were ordered but not shipped'.
Hopefully this helps you get started.
Usually a fact table is used to aggregate measures - which are always numeric. Examples would be: sales dollars, distances, weights, number of items sold.
The type of data you drew here doesn't have any cut and dry "measure" so you need to decide what you want to measure. Is the number of answers per question? Is it how many responses per sample?
This is often called an Event Fact table (if you want to search for other examples). And you need some sort of reporting requirements before you can turn it into a star schema. So it isn't an easy answer...
It's so easy :) Responses is fact, all other is dimensions. And your schema is now star designed, because you can directly connect fact with all dimensions. Example, when you need to redesign its structure where addresses stored in separate table and related with sample. You must add address table id into responses table for get star schema.
I have three facts in my warehouse that can be related events in my relational db. They are PhoneContact, Appointment and Donation. A PhoneContact can result in an Appoinment and/or a Donation. I already have the Apppointment and Donation facts with their related dimensions and am now adding PhoneContact to my warehouse. The common dimension between all of these facts is the Donor dimension which describes who received the call and made the appointment and donation.
If a PhoneContact did result in an Appointment and/or Donation, I'd like to join those facts, but my understanding is that joining facts is a no-no. How would I best relate those Facts? Right now I can't think of anything better, so I'm considering putting AppointmentID and DonationID fields in my Phonecontacts fact.
More info: there are about 1.2M PhoneContacts per month but only about 100k of those result in an Appointment or Donation, so aside from not joining facts, just putting 1.1M NULLs per month into the table so I can get the 100K other events seems less than great.
There seems to be a trade-of here between space and performance. It seems like joining would save space. On the other hand if we used a denormalized table (already joined), we might get better performance on complicated group by queries that require scanning entire tables.
Note that joining can be less expensive in some scenarios :
If you your tables are sorted based on the join key, joining will be less expensive (because we will use merge join algorithm).
If your queries yield small num of rows (eg. give me information about John), joining will be affordable with nice indices.
If you think your use case consistently falls out of the above categories and you can easily buy more disk space, creating an already joined table can help in increasing query speed.