DISTINCT COUNT basket analysis between multiple measures with shared dimension - sql-server

Let's say I have a cube with two different distinct count measures, call them Measure1 and Measure2. Both of these measures contain a common dimension, Dimension1, which is counted by both measures.
What I need to do is return a distinct count of Dimension1 members that exist in both Measure1 and Measure2, after appropriate filtering on each measure as required.
I can define MDX queries for both Measure1 and Measure2 individually and get distinct counts, but I need to be able to "overlap" the result to avoid double-counting the members that exist in both sets.
Note: in the actual scenario, there are more than 2 measures involved, and all MDX queries will be dynamically constructed (the user defines which measures and dimension criteria are included).
Can this be done in SSAS/MDX? If not, is there another Microsoft tool/feature that can? The minimum requirement for the system is SQL Server 2008 R2 Standard Edition.
Honestly I have no idea where to start. Google turned up nothing like this (I saw some basket analysis stuff involving a single measure, but I'm unsure if or how to apply that to my scenario). I'm not an SSAS/MDX/BI expert by any means.
There are two alternatives that I can think of:
Use DRILLTHROUGH using the individual MDX queries and (essentially) COUNT DISTINCT the results.
Use T-SQL on the data warehouse source database. (May be difficult to account for all scenarios efficiently.)
We do have a requirement to also be able to drillthrough, so I'll probably have to implement solution #1 anyway, but it would be nice to have a more efficient way to obtain just the counts, as counts will be needed far more frequently.

I would add a Distinct Count measure based on the Dimension1 Key attribute. I would present it in Excel 2010+ using the Sets MDX feature to filter on Measure1, 2 etc.

I never did find an MDX solution for this.
I went ahead with a solution that queries the data warehouse directly, and it's working pretty well so far after some performance tweaks. This approach may not be suitable for all applications, but it looks like it will work for our particular scenario.

I would recomend union function either on a SQL Server side (create view froom two tables) or at a SSAS side (create a single measure but with diferent partitions from diferent sources (for example, Partition1 - for Credits, Partition2 - for Deposits).
For second way, initialy over this "monstrous" decision you need make
simple Measure1 using SUM function. And after, ckick on Measure1 and choose "Create new measure" using DistionctCount function.
So SSAS would make a separate new Measure Group with Measure2 using DistionctCount function.
It must work perfectly.

Lets simplify the problem statement. You want the count of customers who bought both bread and eggs or who have a toyota and honda. I faced this issue a long time back and came up with a query design. The performance for these queries was not good. By the nature of these queries they are opening the fact to grain level. Hence all aggreagtion benifits lost.
Here is the code, I am counting the customers based on their names, who ordered
ClassicVestS or HLMountainTire and other products
with
member [Measures].[CustomersWhoBoughtClassicVestS] as
count(
intersect(
{nonempty(
existing ([Customer].[Customer].children),[Measures].[Internet Order Count]
)},
{extract(nonempty( ([Customer].[Customer].children* [Product].[Product].&[471]),[Measures].[Internet Order Count]),[Customer].[Customer])}
)
)
member [Measures].[CustomersWhoBoughtHLMountainTire] as
count(
intersect(
{nonempty(
existing ([Customer].[Customer].children),[Measures].[Internet Order Count]
)},
{extract(nonempty( ([Customer].[Customer].children* [Product].[Product].&[537]),[Measures].[Internet Order Count]),[Customer].[Customer])}
)
)
Select {[Measures].[CustomersWhoBoughtClassicVestS],[Measures].[CustomersWhoBoughtHLMountainTire]
} on columns ,
{ nonempty( [Product].[Product].children
,[Measures].[Internet Order Count]) }
on rows
from [Adventure Works]

Related

Pattern matching based on variable name for variable selection in a Postgres query?

I'm trying to query some data in Postgres and I'm wondering how I might use some sort of pattern matching not merely to select rows - e.g. SELECT * FROM schema.tablename WHERE varname ~ 'phrase' - but to select columns in the SELECT statement, specifically doing so based on the common names of those columns.
I have a table with a bunch of estimates of rates over the course of many years - say, of apples picked per year - along with upper and lower 95% confidence intervals for each year. (For reference, each estimate and 95% CI comes from a different source - a written report, to be precise; these sources are my rows, and the table describes various aspect of each source. Based on a critique from below, I think it's important that the reader know that the unit of analysis in this relational database is a written report with different estimates of things picked per year - apples in one Table, oranges in another, pears in a third.)
So in this table, each year has three columns / variables:
rate_1994
low_95_1994
high_95_1994
The thing is, the CIs are mostly null - they haven't been filled in. In my query, I'm really only trying to pull out the rates for each year: all the variables that begin with rate_. How can I phrase this in my SELECT statement?
I'm trying to employ regexp_matches to do this, but I keep getting back errors.
I've done some poking around StackOverflow on this, and I'm getting the sense that it may not even be possible, but I'm trying to make sure. If it isn't possible, it's easy to break up the table into two new ones: one with just the rates, and another with the CIs.
(For the record, I've looked at posts such as this one:
Selecting all columns that start with XXX using a wildcard? )
Thanks in advance!
If what you are basically asking is can columns be selected dynamically based on an execution-time condition,
No.
You could however use PL/SQL to build up a query as a string and then execute it using EXECUTE IMMEDIATE.

SQL save different versions of same product in table

I have a products table in postgres that stores all the data I need. What I need to work out the best way to do:
Each product has a different status on the way to completion - machined, painted, assembled, etc. For each status there is a letter that changes in the product id.
What would the most efficient way of saving the data? For each status of the product should there be 'another product' in the table? Or would doing join tables somewhere work?
Example:
111a1 for machined
111b1 for painted
Yet these are the same end product, just at different stages ...
It depends on what you want to be efficient: storage, ingestion, queries, maintainability...
Joins would work - you can join on some substring of the product id, so you need not have separate products for every stage of production.
But maintainability of your code is really important. Businesses change - perhaps to include sub-assemblies.
You might want to re-think this scheme of altering the product id to show the status. Product ID and work flow state are orthogonal concepts. So you probably want to have them in separate fields. You'll probably write far less code that way. The alternative will be becoming really well acquainted with substr() (depending on your SQL dialect), and all sorts of duplications elsewhere.

joining across multiple fact tables with a dimension in between

What's a good approach to data warehouse design if requested reports require summarized information about the same dimensions (and at the same granularity) but the underlying data is stored in separate fact tables?
For example, a report showing total salary paid and total expenses reported for each employee for each year, when salary and expenses are recorded in different fact tables. Or a report listing total sales per month and inventory received per month for each SKU sold by a company, when sales comes from one fact table and receiving comes from another.
Solving this problem naively seems pretty easy: simply query and aggregate both fact tables in parallel, then stitch together the aggregated results either in the data warehouse or in the client app.
But I'm also interested in other ways to think about this problem. How have others solved it? I'm wondering both about data-warehouse schema and design, as well as making that design friendly for client tools to build reports like the examples above.
Also, does this "dimension sandwich" use-case have a name in canonical data-warehousing terminology? If yes that will make it easier to research via Google.
We're working with SQL Server, but the questions I have at this point are hopefully platform-neutral.
I learned today that this technique is called Drilling Across:
Drilling across simply means making separate queries against two or
more fact tables where the row headers of each query consist of
identical conformed attributes. The answer sets from the two queries
are aligned by performing a sort-merge operation on the common
dimension attribute row headers. BI tool vendors refer to this
functionality by various names, including stitch and multipass query.
Sounds like the naive solution above (query multiple fact tables in parallel and stitch together the results) is also the suggested solution.
More info:
Drilling Across - Kimball overview article
http://blog.oaktonsoftware.com/2011/12/three-ways-to-drill-across.html - SQL implementation suggestions for drilling across
Many thanks to #MarekGrzenkowicz for pointing me in the right direction to find my own answer! I'm answering it here in case someone else is looking for the same thing.
The "naive solution" you described is most of the times the preferred one.
A common exception is when you need to filter the detailed rows of one fact using another fact table. For example, "show me the capital-tieup (stock inventory) for the articles we have not sold this year". You cannot simply sum up the capital-tieup in one query. In this case a consolidated fact can be a solution, if you are able to express both measures on a common grain.

How to determine cubes which use particular dimension?

Performing changes to dimensions is always a risk for me if I don't know exactly which cubes will be affected. Is there a more elegant way to do this than checking every cube one by one or creating an external documentation?
I would generally like to know if there is a way to do it, because we use a wide range of versions, but especially for SQL Server 2000 and 2008.
Dimensions are actually realted to Measure Groups (whihc belong to cubes, of course)
You can check these realtions with this query:
SELECT *
FROM $SYSTEM.MDSCHEMA_MEASUREGROUP_DIMENSIONS
WHERE CUBE_NAME = 'YOUR_CUBE_NAME'

"Group By" and other database algorithms?

I've written some very basic tools for grouping, pivoting, unioning and subtotaling datasets sourced from non DB sources (eg: CSV, OLTP systems). The "group by" methods sit at the core of most of these.
However i'm sure lot of work has been done in making efficient algorithms for grouping data... and i'm sure i'm not using them. And my Google-fu has completely failed to turn anything up.
Are there any good online sources or books describing the better methods to create grouped data?
Or should i just start looking at the MySQL source or something similar?
One very handy way to "group by" some field (or set of fields and expressions, but I'll use "field" for simplicity!-) is when you can arrange to walk over the results before grouping (RBG) in a sorted way -- you actually don't care about the sorting (save in the common case in which an ORDER BY is also there and just happens to be on the same field as the GROUP BY!-), but rather about the "side effect" property of ordering -- that all rows in RBG with the same value for the grouping field come right after each other, so you can accumulate until the grouping field changes, then emit/yield the results accumulated so far, and proceed to reinitialize the accumulators with the new row (the one with a different value of the grouping field) -- make sure to "just initialize the accumulators" at the very start, AND "just emit/yield accumulated results" at the very end, of course.
If this doesn't work, maybe you can hash the grouping field and use a hash table for the results being accumulated for that group -- at each row in RBG, hash the grouping field, check if it was already present as a key in the hash table, if not put it there with accumulators suitably initialized from the RBG row, else update the accumulators per the RBG row. You just emit everything at the end. The problem of course is you're taking up more memory until the end!-)
These are the two fundamental approaches. Would you like pseudocode for each, BTW?
You should check out OLAP databases. OLAP allows you to create a database of aggregates meant to be analyzed in a "slice and dice" fashion.
Aggregate measures such as counts, averages, mins, maxs, sums and stdev's can be quickly analyzed by any number of dimensions using an OLAP database.
See this introduction to OLAP on MSDN.
Give an example CSV file and type of result wanted and I might be able to rustle up a a solution in Python for you.
Python has the CSV module and list/generator comprehensions that can help with this sort of thing.
Paddy.

Resources