So, I have this dataset here: https://www.kaggle.com/johnolafenwa/us-census-data#adult-training.csv
I am new to datawarehouses. I understand what a measure is but I'm not sure what justifies itself as a measure for a fact table? In this dataset what columns can be measures?
The way I have seen is measures are like Count() or Avg() etc.
Measures are numerical values that mathematical functions work on. For example, a sales revenue column is a measure because you can find out a total or average the data (but not only total or average it depends on your need).
When dimensions and measures work together, they help answer complex business questions.
A metric is a quantifiable measure that is used to track and assess the status of a specific process. That said, here is the difference: a measure is a fundamental or unit-specific term—a metric can literally be derived from one or more measures.
A fact table is used in the dimensional model in data warehouse design. A fact table is found at the center of a star schema or snowflake schema surrounded by dimension tables.
A fact table consists of facts of a particular business process e.g., sales revenue by month by product. Facts are also known as measurements or metrics. A fact table record captures a measurement or a metric.
Related
I am new to Power BI and data-base management and I want clarify for myself how Power BI works in reference to my last two questions (Database modelling Bridge Table , Power BI Report Bridge Table ). I have a main_table with firm specific information each year which is connected to an end_table that contains some quantitative information (e.g. sales data). The tables are modelled as a 1:N relationship, so that I do not have to store the same values twice, which I thought is a good thing to do in data modelling.
I want to aggregate the value column of end table over the group column Year. I am surprised that to my understanding Power BI sums up the value column within the end table when I would expect the aggregation over the group variable in the connected tables
My basic example is based on this data and data model (you need to adjust the relationship manually):
main_table<-data.frame(id=1:20, FK_id=sample(1:2,20, replace=TRUE), Jahre=2016:2020)
main_table<-rbind(main_table,data.frame(id=21:25, FK_id=sample(2:3,5, replace=TRUE), Jahre=2015) )
end_table<-data.frame(id=1:3, value=c(10,20,30))
The first 5 rows of the data including all columns looks like this:
If I take out all row specific information and sum up over value. It will always show the sum of the end table, which is 60, in each Year.
Making the connection bi-directional does not help. It just sums up for the existing values of the end_table in each year. I get the correct results, if I add the value column to the main table using Related value = RELATED(end_table[value])
I am just wondering if there is another way to model or analyse this 1:N relationship in Power BI. This comes up frequently and it feels a bit tedious to always add the column using Related() in the main table while it would be intuitive to just click both columns and expect the aggregation to be based on the grouping variable.
In any case, just asking this and my other two questions helped me a lot.
This is a bit of a weird modeling situation (even though it's not terribly uncommon). In general, it's handy to build star schemas where you have dimension tables in 1:N relationships to fact table(s). E.g.
In this setup, the items from the dimension tables (e.g. year or customer) are used in the columns and rows in a visual and measures generally aggregate columns from the fact table (e.g. sales amount).
Your example inverts this. You are trying to sum over a column in your end table using the year as a dimension. As a result, it's not automatically behaving as you'd expect.
In order to get the result that you want, where Year is treated as a dimension, you need to write a measure that sums over Year as if it were a dimension. Since main_table is essentially a dimension table for Year (one unique row per year), you can write
SumValue = SUMX ( main_table, RELATED ( end_table[value] ) )
I want to calculate the "stock turnover rate" KPI which depend on two measures in two differents table facts (Amount from sales fact and physical quantity from inventory fact). So, my question is the following-
Do I have to regroup the two facts in the same cube OLAP, or is there another way to do this? knowing that everyone recommend to have one fact table per cube.
Are there any dimensions that the Fact tables share?
I don't think that being recommended to have 1 Fact table per Cube is correct. There is no reason why you cant have multiple Fact tables or Measure groups.
I'm wondering what are best practices around recording stocks and flows. Do you store only flows, and calculate stocks? Or do you store both?
It seems like the important thing to persist are the flows. (For example, in a bank database, it would be debits and credits to an account), and the stocks (funds remaining) can be calculated from these. But if there are lots of bank accounts, and I want a table of multiple bank accounts with funds remaining, then I would have to recalculate this amount for each one. This seems quite slow.
On the other hand, I thought one of the main goals of databases is to not have duplicated data.
Is there a general practice around storing stocks? Should this be a calculated field, or rather inserted by program logic?
In database design, we have Derived Data:
A table can have derived columns, which are columns for which values are computed, based on the values of other table columns. If all columns are derived, it is said to be a derived table.
For example:
Student Age,
Account Balance,
Number of likes or up-votes and likes of posts and comments (like stackoverflow).
In this case we have 2 options with pros and cons:
Delete the derived data and calculate them
pros: we do not have any Redundancy in our database design.
cons: we should calculate the Aggregation data (Count, Sum, Avg,...) in most queries
Use derived data instead of calculate them
pros: we have all Aggregation data ready and do not need to calculate them
cons: we have a little Redundancy.
cons: we should update derived data when the original data changes.
Therefor we have a trade-off between choosing option 1 or 2. We should calculate their costs in our application and choose one of them.
First: Redundancy
I my idea the redundancy is not so important case in this trade-off. Because there is no so many duplicate data, we only use an extra field (like Integer of Big Integer)
Second:
I think we should calculate the costs between these options:
in Deleting derived data
cost of Performance of retrieving Aggregation data
Using Aggregation columns
cost of Updating Aggregation columns
So, how can we calculate them in our application? There are some evaluation parameters that directly related to the cost:
the number of records in original table (and secondary table).
the number of inserts in original tables in a specified period of time.
the number of updates (update and delete)in original table in a specified period of time.
the number of selects from original data (or secondary table), containing Aggregation data, in a specified period of time
so many other parameters.
Finally: to get very formal approaches, I recommend to read DAX Patterns.
I'm a newbie to data warehousing and I've been reading articles and watching videos on the principles but I'm a bit confused as to how I would take the design below and convert it into a star schema.
In all the examples I've seen the fact table references the dim tables, so I'm assuming the questionId and responseId would be part of the fact table? Any advice would be much appreciated.
I can't see the image at the moment (blocked by my firewall # the office). but I'll try to give you some ideas.
The general idea is to organize your measurable 'facts' into what are called fact tables. There are 3 main types of facts, but that is a topic for a different day (but I'd be happy to go into this if needed). Each of these facts are what you'd see in the center of typical 'star schema'. The other attributes within the fact tables are typically FK references to the dimension tables.
Regarding dimensions, these are groups of attributes that share commonality (the most notable being a calendar dimension). This is important because when you're doing analysis across multiple facts the dimensions are what you use to connect them.
If you consider this simple example: A product is ordered and then shipped. We could have 2 transaction facts (one that contains the qty ordered - measure, type of product ordered - dimension, and transaction date - dimension). We'd also have a transaction fact for the product shipping ( qty shipped - measure, product type - dimension, and ship date - dimension). This simple schema could be used to answer questions like 'how many products by product type last quarter were ordered but not shipped'.
Hopefully this helps you get started.
Usually a fact table is used to aggregate measures - which are always numeric. Examples would be: sales dollars, distances, weights, number of items sold.
The type of data you drew here doesn't have any cut and dry "measure" so you need to decide what you want to measure. Is the number of answers per question? Is it how many responses per sample?
This is often called an Event Fact table (if you want to search for other examples). And you need some sort of reporting requirements before you can turn it into a star schema. So it isn't an easy answer...
It's so easy :) Responses is fact, all other is dimensions. And your schema is now star designed, because you can directly connect fact with all dimensions. Example, when you need to redesign its structure where addresses stored in separate table and related with sample. You must add address table id into responses table for get star schema.
Do the different 'attributes' of a dimension of an OLAP cube have to have a hierarchical order? If not, would the corresponding cube store the results for each possible combination of the dimension attributes?
Let us assume a cube with only two dimensions: time and product.
Time (year, quarter, month, day)
Product (product channel [direct vs. indirect], product group)
While the attributes (how are these called technically?) of the dimension time are clearly strictly hierarchical, the two attributes of the product dimensions are not. We may combine either Channel-Product group or Product group-channel (depending on which one's first).
Is such dimension even possible (non-hierarchical)? If so, which aggregations would the cube store? Each combination (aggregation where first grouped according to channel, then according to product group and the other way around)?
I think Attributes is a perfectly fine name for them - I knew exactly what you meant.
Dimensions don't have to be hierarchical, and very often aren't.
As to which aggregations it will store, there is no simple answer. It will depend on what DBMS you are using, and what you tell it to do. For example with SQL Server (SSAS) you can tell it to precalculate a given percentage of results, from 0 to 100. However within that you can't tell it which ones: it'll do that itself; you can only tell it e.g. 50%. I usually specify 100%.
Other DBMS's will have different facilities.