Simple fact-dim relationship lead to wrong averages

Simple fact-dim relationship lead to wrong averages - database

I'd like to answer a simple question: What is the avarage age of persons that own cars of specific colors ? ( Using Penatho BI, Yellowfin BI, MariaDB)
To do that the following model was designed.
One fact table "person_fact" containing the columns:
id, name and age.
And a dimension table "auto_dim" containing the columns:
id ( foreign key to person_fact table), car_type, color and hp.
As we see one person can own more than one car!
The problem is that in some filter scenarios, the persons that own several cars are counted multiple times, where they shouldn't.
For instance, trying to calcualte the avarage age of persons that own a "Black" or a "Red" car leads to a wrong result !!!
The SQL query that is generated by Pentaho BI is the following:
select
avg(`person_fact`.`age`) as `m0`
from `auto_dim` as `auto_dim`,
`person_fact` as `person_fact`
where `person_fact`.`id` = `auto_dim`.`id`
and `auto_dim`.`color` in ('Black', 'Red');
The same problem occurs using Yellowfin BI. The SQL query generated by Yellowfin BI:
SELECT DISTINCT AVG(`PERSON_FACT`.`age`)
FROM `person_fact` AS `PERSON_FACT`
INNER JOIN `auto_dim` AS `AUTO_DIM`
ON (`PERSON_FACT`.`id` = `AUTO_DIM`.`id`)
WHERE (`AUTO_DIM`.`color` IN ('Black', 'Red'))
The correct answer is 28.25 !
NOTE: Calculating the avg age with excel yields the right answer using a DAX expression!!
Question:
Is there any possibility to use something like DAX expression for Pentaho BI/Yellowfin BI to get right avg ?
Should i use a another model to fix the problem ?
Thanks in advance !!

Related

Power BI combine results from two SQL-Server tables

While using Power BI for a few months now, we (the user group) encountered an issue that is not really clear to us.
We use Power-BI with a remote SQL-Server data source, we access the data source through direct query.
Let's pretend we have 2 Tables as below-
Table name: Issue
Column:
ResolutionTime(Date/Time)
IssueID(Unique Numbers)
Table Name: WorkItem
Column:
start (Date/Time)
end (Date/Time)
IssueID (Unique Numbers, Foreign Key to "Issue" table)
Table WorkItem also contain a calculated column "WorkTime" which uses this DAX-expression as below-
WorkTime = WorkItem[end] - WorkItem[start]
The two tables are configured through Power-Bi having a two-way 1:n relationship that can be queried to collect all "WorkItem"(s) assigned to an "Issue" entry, using the "IssueID" as correlation column.
To be able to compute the aggregated "work-time" for each "WorkItem", we use a new/calculated table with the following DAX expression to aggregate the total amount of time invested for a single "Issue":
SumWork =
SUMMARIZE(
WorkItem, WorkItem[IssueID], "All work per item", SUM(WorkItem[WorkTime])
)
The above table computes the total invested work-time for a particular issue, grouping/summarizing results based on the "IssueID" foreign key. This new calculated table is also configured to have a relationship with the "Issue" table, this time a "1:1" relationship, using the IssueID as correlation column.
Now to compute the time that the issue was worked on + the time for Resolution should be summarized in a calculated column inside "Issue", but this does not work:
ResolutionAndWorkTime = Issue[ResolutionTime] + SumWork["All work per item"]
But the above DAX expression fails to compile, as it always reports that it returns "more than one result", thus not being a singular result. But that is suprising, as the two table ("Issue" and "SumWork" are related to each other with a "1:1" relationship).
Tables:
Issues
IssueID ResolutionTime ResolutionAndWorkTime
1 03:20:20 ???
2 01:20:20 ???
3 00:20:20 ???
WorkItem
IssueID start end WorkTime
1 1-2-2020 3:20:20 1-2-2020 3:25:20 00:05:00
1 2-2-2020 6:20:20 2-2-2020 7:20:20 01:00:00
3 1-3-2020 3:20:20 1-3-2020 3:29:20 00:09:00
Any ideas what to look for? Data-types? Table-definition? Table-relationships? We checked other Stackoverflow questions/answers, but no good ideas retrieved so far.
NOTE that a lot of join/merge features of Power BI are not available if direct-query is used and thus joining the tables is not really an option (we think).

You need this following code for your new Calculated column.
Visit HERE To know more about RELATED.
ResolutionAndWorkTime = Issues[ResolutionTime] + RELATED(SumWork[All work per item])

Based on input provided by "mkRabbani" (see other answer) we investigated why "RELATED" does not function as expected. The problem originates in the access to the database. As suspected earlier the function delivers the expected results once the database access is switched to "import" instead of "direct-query".
As a workaround we now joins the data inside the SQL server by using traditional database views. Of course this only works for scenarios where the database is under control of the data analytics team.

Power BI - TopN + Others on data from two tables

I am a bit stuck with a specific case in Power BI. Let's say that we have two tables. The first one contains the product ID and the product name, and the second one contains the product ID and a specific budget.
I want to create a piechart showing the topN + group others. I have made a dax formula which works for data in a single table, but not on two.
Here is the formula :
ProductTop =
VAR rankSiteImpressions = RANKX(ALL(Piechart); [Impressions ];;DESC)
return
IF(rankSiteImpressions<=3;Piechart[Site];"Others")
How can I apply this on data from two tables to get the top products by budget?
Many thanks,
Rémi

Is there a way to sum an entire quantity in SQL with unique values

I am trying to get a total summation of both the ItemDetail.Quantity column and ItemDetail.NetPrice column. For sake of example, let's say the quantity that is listed is for each individual item is 5, 2, and 4 respectively. I am wondering if there is a way to display quantity as 11 for one single ItemGroup.ItemGroupName
The query I am using is listed below
select Location.LocationName, ItemDetail.DOB, SUM (ItemDetail.Quantity) as "Quantity",
ItemGroup.ItemGroupName, SUM (ItemDetail.NetPrice)
from ItemDetail
Join ItemGroupMember
on ItemDetail.ItemID = ItemGroupMember.ItemID
Join ItemGroup
on ItemGroupMember.ItemGroupID = ItemGroup.ItemGroupID
Join Location
on ItemDetail.LocationID = Location.LocationID
Inner Join Item
on ItemDetail.ItemID = Item.ItemID
where ItemGroup.ItemGroupID = '78' and DOB = '11/20/2019'
GROUP BY Location.LocationName, ItemDetail.DOB, Item.ItemName,
ItemDetail.NetPrice, ItemGroup.ItemGroupName

If you are using SQL Server 2012 , you can use the summation on partition to display the
details and aggregates in the same query.
SUM(SalesYTD) OVER (ORDER BY DATEPART(yy,ModifiedDate)),1)
Link :
https://learn.microsoft.com/en-us/sql/t-sql/functions/sum-transact-sql?view=sql-server-ver15

We can't be certain without seeing sample data. But I suspect you need to remove some fields from you GROUP BY clause -- probably Item.ItemName and ItemDetail.NetPrice.
Generally, you won't GROUP BY a column that you are applying an aggregate function to in the SELECT -- as in SUM(ItemDetail.NetPrice). And it is not very common, in my experience, to GROUP BY columns that aren't included in the SELECT list - as you are doing with Item.ItemName.
I think you need to go back to basics and read about what GROUP BY does.

First of all welcome to the overflow...
Second: The answer is going to be "It depends"
Any time you aggregate data you will need to Group by the other fields in the query, and you have that in the query. The gotcha is what happens when data is spread across multiple locations.
My suggestion is to rethink your problem and see if you really need these other fields in the query. This will depend on what the person using the data really wants to know.
Do they need to know how many of item X there are, or do they really need to know that item X is spread out over three sites?
You might find you are better off with two smaller queries.

multiple condition on joined table

I have one small database for exercise, please see below ER-Diagram
I want to write a query that List student last and first names and majors for students who had at least one high grade (>= 3.5) in at least one course offered in fall of 2012.
My code below:
select s.StdNo,s.StdFirstName,s.StdLastName,s.StdMajor,e.EnrGrade,o.OfferNo,o.OffYear
from Enrollment e
join Offering o on e.OfferNo=o.OfferNo
join Student s on s.StdNo=e.StdNo
where e.EnrGrade >=3.5 and o.OffYear="2010";
But I got an SQL Error
[207] [S0001]: Invalid column name '2010'
I am confused about the error, value "2010" is NOT a column name, the Offyear is column. So why did this happen?
The basic query is not that hard, but I am stuck on （multiple）nested query.

Offyear is shown as a number, so you should compare against the number 2010, not the text "2010":
[...] and Offyear = 2010

SSAS cube and getting data with MDX for SSRS report

I'm new to OLAP cubes. Can you directed in the right direction with small example.
Let's say I have table "transactions" with 3 columns: transaction_id (int), date (datetime), amount (decimal(16,2)).
I want to create a cube and then get data with MDX query for SSRS report.
I want report to show something like:
Ok. I know i can have fact table with amount and date dimention (date->month->year).
Can you explain what to do in order to get this result (including how to write MDX query). Thanks.
Can someone explain why I get amount of full 201504 and 201606 months even if I specified exact range with days?
SELECT
[Measures].[Amount] ON COLUMNS
,[Dim_Date].[Hierarchy].[Month].MEMBERS ON ROWS
FROM
[DM]
WHERE
(
{[Dim_Date].[Date Int].&[20150414] : [Dim_Date].[Date Int].&[20160615]}
)

Something like below, change the query accordingly :)
SELECT
{ [Date].[EnglishMonthName].[EnglishMonthName]} ON COLUMNS,
{ [Date].[DateHierarchy].[Year].&[2015],
[Date].[DateHierarchy].[Year].&[2016] } ON ROWS
FROM [YourCubeName]
WHERE ([Measures].[amount])

So you want someone to show you how to create a multi-dimensional cube from scratch and report on it in a single answer...? Start here and work through the lessons