I'm trying to figure out how I can create a calculated measure that produces a count of only unique facts in my fact table. My fact table basically stores events from a historical perspective. But I need the measure to filter out redundant events.
Using sales as an example(Since all material around OLAP always uses sales in examples):
The fact table stores sales EVENTS. When a sale is first made it has a unique sales reference which is a column in the fact table. A unique sale however can be amended(Items added or returned) or completely canceled. The fact table stores these changes to a sale as different rows.
If I create a count measure using SSAS I get a count of all sales events which means an unique sale will be counted multiple times for every change made to it (Which in some reports is desirable). However I also want a measure that produces a count of unique sales rather than events but not just based on counting unique sales references. If the user filters by date then they should see unique sales that still exist on that date (If a sale was canceled by that date if should not be represented in the count at all).
How would I do this in MDX/SSAS? It seems like I need have a count query work from a subset from a query that finds the latest change to a sale based on the time dimension.
In SQL it would be something like:
SELECT COUNT(*) FROM SalesFacts FACT1 WHERE Event <> 'Cancelled' AND
Timestamp = (SELECT MAX(Timestamp) FROM SalesFact FACT2 WHERE FACT1.SalesRef=FACT2.SalesRef)
Is it possible or event performant to have subqueries in MDX?
In SSAS, create a measure that is based on the unique transaction ID (The sales number, or order number) then make that measure a 'DistinctCount' aggregate function in the properties window.
Now it should count distinct order numbers, under whichever dimension slice it finds itself under.
The posted query might probably be rewritten like this:
SELECT COUNT(DISTINCT SalesRef)
FROM SalesFacts
WHERE Event <> 'Cancelled'
An simple answer would be just to have a 'sales count' column in your fact view / dsv query that supplies a 1 for an 'initial' event, a zero for all subsiquent revisions to the event and a -1 if the event is cancelled. This 'journalling' approach plays nicely with incremental fact table loads.
Another approach, probably more useful in the long run, would be to have an Events dimension: you could then expose a calculated measure that was the count of the members in that dimension non-empty over a given measure in your fact table. However for sales this is essentially a degenerate dimension (a dimension based on a fact table) and might get very large. This may be inappropriate.
Sometimes the requirements may be more complicated. If you slice by time, do you need to know all the distinct events that existed then, even if they were later cancelled? That starts to get tricky: there's a recent post on Chris Webb's blog where he talks about one (slightly hairy) solution:
http://cwebbbi.wordpress.com/2011/01/22/solving-the-events-in-progress-problem-in-mdx-part-2role-playing-measure-groups/
Related
I am finding it difficult to understand how you get the history data from a fact table join to a Dimension that has Type2 and Type1 for historic records that have changed. Currently I have a Surrogate Key and Business Key in the Dim. The Fact Table has the Surrogate Key the Fact table and I am using SSIS Lookup Component currently to bring back the row that has the CurrentFlag set to Yes.
However I am joining on the Business Key in the Lookup and returning the Surrogate. Which I know is the main reason I can't get history, however if I Join on the Business Key as I am currently doing and return the Business Key also, SSIS component will only bring back just one row, regardless of how many versions of history you have against that Business Key.
What I want to know or have been told is to use lookups to populate fact tables, however this doesn't seem to really give me the history as it will only return one row regardless. So I Just want to know how to return historic date between a fact and a dimension in SSIS.
Thank you
There's a few caveats when it comes to historical dimensions. Your end users will need to know what it is you are presenting, and understand the differences.
For example, consider the following scenario:
Customer A is located in Las Vegas in January 2017. They place an order for Product 123, which at that time costs $125.
Now, it's August. In the meantime, the Customer moved to Washington D.C. in May, and Product 123 was updated in July to cost $145.
Your end users will need to inform you what they want to see. In case you are not tracking history whatsoever, and simply truncate and load everything on a daily basis, your order report would show the following:
Customer A, located in Washington D.C. placed an order for $145 in January.
If you implement proper history tracking, and implemented logic to identify the start- and end-date of a row in a dimension, you would join the fact table to the dimension using the natural key as well as the proper date interval. This should return you a single value for every dimension row in the fact table. IF it returns more, you have overlapping dates.
Can you show us the logic where you receive only a single value from the lookup, even though you have more records?
I am building a model to allow reporting on two seperate datasets, for this example we'l say a Students dataset & a Staff dataset.
The datasets are pretty seperate and the only real link between the two is Date, so from a model perspective, there is a Students star schema & a Staff Star Schema.
The data displayed is snapshot type data, answering questions like:
- For a selected date, show all active employees
- for a selected date, show all enrolled students
This means that when a single date is selected, the model then finds all employees where the selected date falls within the employment start & end date , and finds all students where the selected date falls within the enrolled start & end date.
This means i had to make a decision, how to return the correct data from each schema with a single date dimension. Creating a relationship would not work as relationships in Tabular dont allow "between" type queries, so i instead have one unrelated Date Dimension and the Dax for each model finds applicable rows.
The problem is that its not the most performant. for perhaps 50k rows, adding a measure can take 5-10 seconds.
Im asking if there is a better way to either write the queries, or alter the model to still let me do "between" style queries but give better performance.
Below is an example of a dax query to return all students that were enrolled on a particular date.
Thanks for any advice.
All Enrolled Students:=IF (
HASONEVALUE ( 'Date'[Date] ),
CALCULATE (
DISTINCTCOUNT ( 'Students'[StudentID] ),
FILTER (
'Students',
'Students'[StudentStartDateID] <= MIN ( 'Date'[DateID] )
&& 'Students'[StudentEndDateID] >= MAX ( 'Date'[DateID] )
)
),
BLANK ())
Unrelated or "disconnected" tables are good for powering slicers, timelines, and filters in certain situations. As you said in your question, you have two optimization options: Re-structure your data set or optimize the existing measure syntax.
Re-Structure Dataset
Duplicate each row for every day between start and end dates with a column for that iterated date. This can be done a handful of ways depending on how you get your dataset, but could be tedious. Then, relate your tables on this iterated date and use the relation to filter from DATE to FACT. If this is a recurring report and/or you are using SQL to pull the data, this might be worth it to make use of PowerPivot's relational calculation power.
Optimize DAX statement
If this is a one-off request or the dataset would be too tedious to duplicate out by day, then stick with the disconnected table approach and clean up the measure syntax. Since you have already included the MIN() and MAX() functions and your CALCULATE() is returning DISTINCTCOUNT(), the conditional HASONEVALUE() function is unnecessary. I ran this in a simulated environment and had good results, but that can vary with computer performance and dataset size. See below for cleaned syntax.
All Enrolled Students:=CALCULATE (
DISTINCTCOUNT('Students'[StudentID]),
FILTER(
'Students',
'Students'[StudentStartDateID]<= MIN('Date'[DateID]) &&
'Students'[StudentEndDateID] >= MAX('Date'[DateID])
)
)
If your StudentID column is unique, which would make sense to me, you can further speed this up.
All Enrolled Students:=CALCULATE (
COUNT('Students'[StudentID]),
FILTER(
'Students',
'Students'[StudentStartDateID]<= MIN('Date'[DateID]) &&
'Students'[StudentEndDateID] >= MAX('Date'[DateID])
)
)
If StudentID is not a number replace COUNT() with COUNTA() to get the desired effect.
This type of scenario is often called "Events in progress" or "Events with a duration". Take a look at the links below. The answer will depend on your version of SSAS and the event duration length.
https://www.sqlbi.com/articles/analyzing-events-with-a-duration-in-dax/
https://www.sqlbi.com/articles/understanding-dax-query-plans/
https://blog.gbrueckl.at/2014/12/events-in-progress-for-time-periods-in-dax/
If these measures don't perform well (Which can happen with events that have a long duration), it may be necessary to generate a table containing a row for each day of the event. The SQL would look something like this:
SELECT
d.CalendarDate
,s.StudentID
FROM dbo.Students AS s
CROSS JOIN dbo.DimDate AS d
WHERE d.CalendarDate >= StudentStartDateID
AND d.CalendarDate <= StudentEndDateID
Create a relationship from this table to the date/calendar table.
With this design you can use a simple DISTINCTCOUNT(Students[StudentID]) measure, which should perform better. The trade-off is that this table can become quite large. Keep it as narrow as possible for best performance and memory conservation. Another optimization could be to use a different granularity such as week or month instead of day.
We have a requirement to come up with a strategy to show Sales revenue data weighted by dates differently on different schedules.
We currently have a FactSales table with a grain of one row per order with the measure of sales amount. We have separate DimDate and DimTime dimensions,and a DimBusinessUnit dimension with one row for each entity within the organization.
In DimDate we have a flag for the major US holidays so we know reduced sales revenue may be expected. This flag would apply globally.
The ask is that different business units might have slow revenue days. For example, Monday's might be slow in one business unit, and Friday's slow in another. For analysis it is desireable to capture these different schedules with a flag or a weighting.
Ultimately this probably be reflected as a projected sales amount in a calculated measure.
How can I best add this weighting? Does it belong in the Date dimension, Business Unit dimension, or maybe a degenerate dimension in the Fact table, or something else altogether?
The DimDate is probably not a good place to keep this information, as each Business Unit (BU) may have a different schedule, so quite possibly you will have to have a flag on each of the dates per a combination of BU and a slow day. So for example if BU1 and BU2 has a slow day on Monday, each Monday in your DimDate will have to have a way showing that it's slow for BU1 and BU2.
The Dimension BU, might be a better place, as schedule is specific to each of the unit. So you may opt for extending your dim by adding 7 days as an attributes and flag them as slow or not using for example false or true flags. You could also have one attribute with the bit mask i.e. 0100000 where position of the value corresponds to the day i.e. M T W T F S S and 0 is not slow and 1 is slow, so in this example T is a slow day.
This will also allow you to trace a history if you whish selecting relevant SCD process.
Another option may be a separate Dimension i.e. DimSchedule and Factless Fact Table.
http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/factless-fact-table/
I hope this helps.
Your situation seems to be the same as the Multiple National Calendars problem described by Kimball:
http://www.kimballgroup.com/1998/12/think-globally-act-locally/
Where Kimball is describing holidays in the left-most table, you could also add a "slow day" flag.
Say I have blog post comments. On insert they get the current utc date time as their creation time (via sysutcdatetime default value) and they get an ID (via integer identity column as PK).
Now I want to sort the comments descending by their age. Is it safe to just do a ORDER BY ID or is it required to use the creation time? I'm thinking about "concurrent" commits and rollbacks of inserts and an isolation level of read committed. Is it possible that the IDs sometimes do not represent the insert order?
I'm asking this because if sorting by IDs is safe then I could have the following benefits:
I don't need an index for the creation time.
Sorting by ID's is probably faster
I don't need a high precision on the datetime2 column because that would only be required for sorting anyway (in order to not have two rows with the same creation time).
This answer says it is possible when you don't have the creation time but is it always safe?
This answer says it is not safe with an identity column. But when it's also the PK the answer gives an example with sorting by ID without mentioning if this is safe.
Edit:
This answer suggests sorting by date and then by ID.
Yes, the IDs can be jumbled because ID generation is not part of the insert transaction. This is in order to not serialize all insert transactions on the table.
The most correct way to sort would be ORDER BY DateTime DESC, ID DESC with the ID being added as a tie breaker in case the same date was generated multiple times. Tie breakers in sorts are important to achieve deterministic results. You don't want different data to be shown for multiple refreshes of the page for example.
You can define a covering index on DateTime DESC, ID DESC and achieve the same performance as if you had ordered by the CI key (here: ID). There's no relevant physical difference between CI and NCIs.
Since you mention the PK somewhere I want to point out that the choice of the PK does not affect any of this. Only indexes do. The query processor does not ever care about PKs and unique keys.
I would order by ID.
Technically you may get different results when sorting by ID vs sorting by time.
The sysutcdatetime will return the time when transaction starts. ID could be generated somewhere later during the transaction. Also, the clock on any computer always drifts. When computer clock is synchronized with the time source, the clock may jump forward or backwards. If you do the sync often, the jump will be small, but it will happen.
From the practical point of view, if two comments were posted within, say, one second of each other, does it really matter which of these comments is shown first?
What I think does matter is the consistency of the display results. If the system somehow decides that comment A should go before comment B, then this order should be preserved everywhere across the system.
So, even with the highest precision datetime2(7) column it is possible to have two comments with exactly the same timestamp and if you order just by this timestamp it is possible that sometimes they will appear as A, B and sometimes as B, A.
If you order by ID (primary key), you are guaranteed that it is unique, so the order will be always well defined.
I would order by ID.
On a second thought, I would order by time and ID.
If you show the time of the comment to the user it is important to show comments according to this time. To guarantee consistency sort by both time and ID in case two comments have the same timestamp.
if you sort on id based on descending order and you are filtering on basis of user then your blog will automatically show latest post on above and that will do the job for you. so dont use date as sorting
We have an SQL Server that gets daily imports of data files from clients. This data is interrelated and we are always scrubbing it and having to look for suspect duplicate records between these files.
Finding and tagging suspect records can get pretty complicated. We use logic that requires some field values to be the same, allows some field values to differ, and allows a range to be specified for how different certain field values can be. The only way we've found to do it is by using a cursor based process, and it places a heavy burden on the database.
So I wanted to ask if there's a more efficient way to do this. I've heard it said that there's almost always a more efficient way to replace cursors with clever JOINS. But I have to admit I'm having a lot of trouble with this one.
For a concrete example suppose we have 1 table, an "orders" table, with the following 6 fields.
(order_id, customer_id, product_id, quantity, sale_date, price)
We want to look through the records to find suspect duplicates on the following example criteria. These get increasingly harder.
Records that have the same product_id, sale_date, and quantity but different customer_id's should be marked as suspect duplicates for review
Records that have the same customer_id, product_id, quantity and have sale_dates within five days of each other should be marked as suspect duplicates for review
Records that have the same customer_id, product_id, but different quantities within 20
units, and sales dates within five days of each other should be considered suspect.
Is it possible to satisfy each one of these criteria with a single SQL Query that uses JOINS? Is this the most efficient way to do this?
If this gets much more involved, then you might be looking at a simple ETL process to do the heavy carrying for you: the load to the database should be manageable in the sense that you will be loading to your ETL environment, running tranformations/checks/comparisons and then writing your results to perhaps a staging table that outputs the stats you need. It sounds like a lot of work, but once it is setup, tweaking it is no great pain.
On the other hand, if you are looking at comparing vast amounts of data, then that might entail significant network traffic.
I am thinking efficient will mean adding index to the fields you are looking into the contents of. Not sure offhand if a megajoin is what you need, or just to list off a primary key of the suspect records into a hold table to simply list problems later. I.e. do you need to know why each record is suspect in the result set
You could
-- Assuming some pkid (primary key) has been added
1.
select pkid,order_id, customer_id product_id, quantity, sale_date
from orders o
join orders o2 on o.product_id=o2.productid and o.sale_date=o2.sale_date
and o.quantity=o2.quantity and o.customerid<>o2.customerid
then keep joining up more copies of orders, I suppose
You can do this in a single Case statement. In this below scenario, the value for MarkedForReview will tell you which of your three Tests (1,2, or 3) triggered the review. Note that I have to check for the conditions of the third test before the second test.
With InputData As
(
Select order_id, product_id, sale_date, quantity, customer_id
, Case
When O.sale_date = O2.sale_date Then 1
When Abs(DateDiff(d, O.sale_date, O2.sale_date)) <= 5
And Abs( O.quantity - O2.quantity ) <= 20 Then 3
When Abs(DateDiff(d, O.sale_date, O2.sale_date)) <= 5 Then 2
Else 0
End As MarkedForReview
From Orders As O
Left Join Orders As O2
On O2.order_id <> O.order_id
And O2.customer_id = O.customer_id
And O2.product_id = O.product_id
)
Select order_id, product_id, sale_date, quantity, customer_id
From InputData
Where MarkedForReview <> 0
Btw, if you are using something prior to SQL Server 2005, you can achieve the equivalent query using a derived table. Also note that you can return the id of the complementary order that triggered the review. Both orders that trigger a review will obviously be returned.