Power BI Aggregation of End Tables - sql-server

I am new to Power BI and data-base management and I want clarify for myself how Power BI works in reference to my last two questions (Database modelling Bridge Table , Power BI Report Bridge Table ). I have a main_table with firm specific information each year which is connected to an end_table that contains some quantitative information (e.g. sales data). The tables are modelled as a 1:N relationship, so that I do not have to store the same values twice, which I thought is a good thing to do in data modelling.
I want to aggregate the value column of end table over the group column Year. I am surprised that to my understanding Power BI sums up the value column within the end table when I would expect the aggregation over the group variable in the connected tables
My basic example is based on this data and data model (you need to adjust the relationship manually):
main_table<-data.frame(id=1:20, FK_id=sample(1:2,20, replace=TRUE), Jahre=2016:2020)
main_table<-rbind(main_table,data.frame(id=21:25, FK_id=sample(2:3,5, replace=TRUE), Jahre=2015) )
end_table<-data.frame(id=1:3, value=c(10,20,30))
The first 5 rows of the data including all columns looks like this:
If I take out all row specific information and sum up over value. It will always show the sum of the end table, which is 60, in each Year.
Making the connection bi-directional does not help. It just sums up for the existing values of the end_table in each year. I get the correct results, if I add the value column to the main table using Related value = RELATED(end_table[value])
I am just wondering if there is another way to model or analyse this 1:N relationship in Power BI. This comes up frequently and it feels a bit tedious to always add the column using Related() in the main table while it would be intuitive to just click both columns and expect the aggregation to be based on the grouping variable.
In any case, just asking this and my other two questions helped me a lot.

This is a bit of a weird modeling situation (even though it's not terribly uncommon). In general, it's handy to build star schemas where you have dimension tables in 1:N relationships to fact table(s). E.g.
In this setup, the items from the dimension tables (e.g. year or customer) are used in the columns and rows in a visual and measures generally aggregate columns from the fact table (e.g. sales amount).
Your example inverts this. You are trying to sum over a column in your end table using the year as a dimension. As a result, it's not automatically behaving as you'd expect.
In order to get the result that you want, where Year is treated as a dimension, you need to write a measure that sums over Year as if it were a dimension. Since main_table is essentially a dimension table for Year (one unique row per year), you can write
SumValue = SUMX ( main_table, RELATED ( end_table[value] ) )

Related

How to move from Excel to designing a Data Warehouse Model

I just started in Data Warehouse modeling and I need help for the modeling of a problem.
Let me tell you the facts: I work on flight data (aeronautical data),
so I have two Excel (fact) files, linked together, one file 'order' and the other 'services'.
the 'order' file sets out a summary of each flight (orderId, departure date, arrival date, City of departure, City of arrival, total amount collected, etc.)
the 'services' file lists the services provided by flight (orderId, service name, quantity, amount / qty, etc.)
with a 1-n relationship (order-services) each order has n services
I already see some dimensions (Time, Location, etc ...). However, I would like to know how I could design my Data Warehouse, knowing that I have two fact files linked together by orderId.
I thought about it, and the star and snowflake schema do not work in my case (since I have two fact tables) and the galaxy schema requires to have dimensions in common, but I block it, is that I put the order table as a dimension and not as a fact table or I should rather put the services table as a dimension, but these are fact tables. I get a little confused.
How can I design my model?
First of all realize that in a star schema it is not a problem to have more fact tables that are connected - see the discussion here.
So the first draw will simple follow your two fact tables with the native provided dimensions.
Order is in one context a fact table, in other context a dimensional table for the service table.
Dependent on your expected queries you could find useful to denormalize some dimensions of the order table in the service table. So the service will have defined the departure date, arrival date etc. dimensions.
This will be done at the load time in the ETL job.
I will be somehow careful to denormalize the measures from order to service - which will basically eliminate the whole order table.
There will be no problem with the measure total amount collected if this is a redundant sum of the service amounts - you may safely get rid of it.
But you will need for sure the number of flights or number of people transported - those measure are better defined in the order fact table; you can not simple replicate them in the N rows for each service.
A workaround is possible, if you define a main service for each order and those measures are defined only in this row - in other rows the value is NULL. This could lead to unexpected results if queried naively, e.g. for number of flights per service.
So basically I'd start with the two fact tables and denormalize some dimensions to the services if this would help to optimize the queries.
I would start with one fact table of Services. This fact would include all of the dimensions you might associate with the Order including a degenerated dimension of OrderId.
Once this fact is built out and some information products are consuming it, return to the Order and re-evaluate it to see if there are any reporting needs which are not being served, or questions which are difficult to answer with the Services fact.
Joining two facts together is always a bad idea. Performance is terrible. You are always better off bring the dimensions from, in your case, Order to Services. Don't forget to include the context of the dimension in the column name and a corresponding role-playing dimension view for this context. E.G. OrderArrivalCity, OrderDepartureDate, OrderDepartureTime.
You can also get yourself a copy of Ralph Kimball's The Data Warehouse Toolkit

How to find measures in a dataSet

So, I have this dataset here: https://www.kaggle.com/johnolafenwa/us-census-data#adult-training.csv
I am new to datawarehouses. I understand what a measure is but I'm not sure what justifies itself as a measure for a fact table? In this dataset what columns can be measures?
The way I have seen is measures are like Count() or Avg() etc.
Measures are numerical values that mathematical functions work on. For example, a sales revenue column is a measure because you can find out a total or average the data (but not only total or average it depends on your need).
When dimensions and measures work together, they help answer complex business questions.
A metric is a quantifiable measure that is used to track and assess the status of a specific process. That said, here is the difference: a measure is a fundamental or unit-specific term—a metric can literally be derived from one or more measures.
A fact table is used in the dimensional model in data warehouse design. A fact table is found at the center of a star schema or snowflake schema surrounded by dimension tables.
A fact table consists of facts of a particular business process e.g., sales revenue by month by product. Facts are also known as measurements or metrics. A fact table record captures a measurement or a metric.

Excel formula to retrieve conditional sumproduct for massive data set

I have a data table of over 50,000 rows. The data contains SALES information for permutations of STORES, DATES, and PRODUCTS. However, the PRODUCTS are actually a combination of PRIMARY PRODUCTS (PP) and SECONDARY PRODUCTS (SP), where a sale QUANTITY of 1 PP should convert to the sale of 1 or more SPs. I have another sheet containing the CONVERSIONS of PP to SP containing the respective MULTIPLIERS (over 500 rows). PPs and SPs have a many-to-many relationship; PPs may convert to several different SPs, and the same SP may be converted from other PPs.
At the moment, only unconverted sales quantities exist for the PRODUCTS, and it's my job to convert those figures to each PP's respective SP if a MULTIPLIER exists.
Sample: https://i.stack.imgur.com/YdGHn.png
I am able to do that with the following SUMPRODUCT() formula, which appears more efficient than an array formula:
=SUMPRODUCT(
(Conversions[Multiplier]),
--(Conversions[SP]=[#Product]),
SUMIFS([Quantity],[Product],Conversions[PP],[Store],[#Store],[Date],[#Date])
)
However, given the size of my data set, it still takes forever to process. Is there a more efficient way to do this?
EDIT:
I tried wrapping the formula in a conditional so that SUMPRODUCT only evaluates if the Product in question can be found in the Conversion table as an SP (and it also now displays the values of PRODUCTS that don't have any conversions). This seems to have sped things up a little, but still nowhere near quick enough...
=IFERROR(IF(MATCH([#Product],Conversions[SP],0)>0,
SUMPRODUCT(
(Conversions[Multiplier]),
--(Conversions[SP]=[#Product]),
SUMIFS([Quantity],[Product],Conversions[PP],[Store],[#Store],[Date],[#Date])
),0),0)+[#Quantity]
if you have the possibility to import your data to a database. you then can work with indexed tables. should be faster.

Calculating values in SQL using two tables as references

I'm not sure if I can explain this well, I have two tables, one with the following columns TotalLoan, InterestRate, Classification, LendingRates and the other has the following columns, SumLoan, Classification, LendingRates.
The first table, let's call it Loans has values of loans of specific users and their loan totals plus their interest rates while the other table has the sum of the total of loans based on the Classification and LendingRates. This means that if two users have the Classification Individual and LendingRate as 1-5years, their loans will be summed up. Let me try to visualize them, here is the Loans table.
and here is the second table containing the summations of the values using the lending rates and classification for grouping.
For the sake of simplicity and also the fact that I'm dealing with real bank values here and I'm not supposed to share the company data online I created a fake summary on Excel, this is not the real data. The real data contains thousands of rows. So the formula is
(InterestRate * TotalLoan)/Sum of total.
Thereby for Individuals with Overdraft, you'll take the first expression to be
(12*34555)/12221222
then
(14*22322)/12221222
then
(6/76772)/76772 and so on...
Anyone has an idea how I can do this in Microsoft SQL. I'm seriously stumped here.

Calculated Measure aggregating on certain cells only

I'm trying to figure out how I can create a calculated measure that produces a count of only unique facts in my fact table. My fact table basically stores events from a historical perspective. But I need the measure to filter out redundant events.
Using sales as an example(Since all material around OLAP always uses sales in examples):
The fact table stores sales EVENTS. When a sale is first made it has a unique sales reference which is a column in the fact table. A unique sale however can be amended(Items added or returned) or completely canceled. The fact table stores these changes to a sale as different rows.
If I create a count measure using SSAS I get a count of all sales events which means an unique sale will be counted multiple times for every change made to it (Which in some reports is desirable). However I also want a measure that produces a count of unique sales rather than events but not just based on counting unique sales references. If the user filters by date then they should see unique sales that still exist on that date (If a sale was canceled by that date if should not be represented in the count at all).
How would I do this in MDX/SSAS? It seems like I need have a count query work from a subset from a query that finds the latest change to a sale based on the time dimension.
In SQL it would be something like:
SELECT COUNT(*) FROM SalesFacts FACT1 WHERE Event <> 'Cancelled' AND
Timestamp = (SELECT MAX(Timestamp) FROM SalesFact FACT2 WHERE FACT1.SalesRef=FACT2.SalesRef)
Is it possible or event performant to have subqueries in MDX?
In SSAS, create a measure that is based on the unique transaction ID (The sales number, or order number) then make that measure a 'DistinctCount' aggregate function in the properties window.
Now it should count distinct order numbers, under whichever dimension slice it finds itself under.
The posted query might probably be rewritten like this:
SELECT COUNT(DISTINCT SalesRef)
FROM SalesFacts
WHERE Event <> 'Cancelled'
An simple answer would be just to have a 'sales count' column in your fact view / dsv query that supplies a 1 for an 'initial' event, a zero for all subsiquent revisions to the event and a -1 if the event is cancelled. This 'journalling' approach plays nicely with incremental fact table loads.
Another approach, probably more useful in the long run, would be to have an Events dimension: you could then expose a calculated measure that was the count of the members in that dimension non-empty over a given measure in your fact table. However for sales this is essentially a degenerate dimension (a dimension based on a fact table) and might get very large. This may be inappropriate.
Sometimes the requirements may be more complicated. If you slice by time, do you need to know all the distinct events that existed then, even if they were later cancelled? That starts to get tricky: there's a recent post on Chris Webb's blog where he talks about one (slightly hairy) solution:
http://cwebbbi.wordpress.com/2011/01/22/solving-the-events-in-progress-problem-in-mdx-part-2role-playing-measure-groups/

Resources