Optimize MDX query - query-optimization

I have two needs in my query
First : to have a sorted product list base on my measure.product with higher sales should appears first.
ProductCode Sales
----------- ------------
123 18
332 17
245 16
656 15
Second : to have cumulative sum on my presorted product list.
ProductCode Sales ACC
----------- ------------ ----
123 18 18
332 17 35
245 16 51
656 15 66
I wrote below MDX in order to achieve above goal:
WITH
SET SortedProducts AS
Order([DIMProduct].[ProductCode].[ProductCode].AllMEMBERS,[Measures]. [Sales],BDESC)
MEMBER [Measures].[ACC] AS
Sum
(
Head
(
[SortedProducts],Rank([DIMProduct].[ProductCode].CurrentMember,[SortedProducts])
)
,[Measures].[Sales]
)
SELECT
{[Measures].[Sales] ,[Measures].[ACC]}
ON COLUMNS,
SortedProducts
ON ROWS
FROM [Model]
But it takes about 3 minutes to run,any suggestion on how to optimize my code or is it normal?
I have 9635 products in total

if you do a quick research on google, there are different ways to achieve it (many answers here as well).
That said, I will give a try to this different way to calculate your running total
MEMBER [Measures].[SortedRank] AS Rank([Product].[Product].CurrentMember, [SortedProducts])
MEMBER [Measures].[ACC2] AS SUM(TopCount([SortedProducts], [Measures].[SortedRank]) ,[Measures].[Internet Sales Amount])
I don't know if TopCount will perform faster than Head for your case, but for example your query on my test machine on AdventureWorks cube takes the same time using Head or TopCount function.
Hope this helps

Related

star-schema - fact table modeling

I need to model a star schema for some business needs about liquidity stress testing.
i will try to find an analogy example.
let's say we have deals about financing/financial securities etc
in the fact table,
at a given date, this deals have the real value of X euros
but will have a variation in time. thus we have some projection values.
my concern is about how to represent this projection values for this deals, and more specifically what granularity to choose.
( the example below is oversimplification of the fact table -- and yes it's Dimension Id's that
are used otherwise )
Method 1 : as many metric columns as projection values calculated
AsofDate
DealId
0Day
1Day
7Days
1Month
2022-01-01
financingDeal1
100
99
98
85
2022-01-01
financingDeal2
150
150
120
120
2022-01-01
financingDeal3
100
99
98
85
2022-01-01
financingDeal4
100
99
98
85
Method 2 : add a granularity : a row is not anymore only a deal on a given that. it's a deal in a given date and it's projection in the next few days/months
AsofDate
DealId
projection
value
2022-01-01
financingDeal1
0Day - actual
100
2022-01-01
financingDeal1
1Day
99
2022-01-01
financingDeal1
7Days
99
2022-01-01
financingDeal1
1Month
85
from where i see it :
for method 1, the main inconvenient is if in the futur, we have a new projection value for 3 months, we will need to add a column in the ETL/ in the OLAP cube for '3months'
for method 2:
we will have as many rows as (deals * projections) and we do have 11 projections so it's 11 rows for each deal and we do have 1Million+ of them.
what is your opinion on this topic?
Thanks for your consideration

Recursive CTE help - how do you code this non-hierarchal sequence?

I'm trying to write a recursive CTE for a table that does not have a hierarchy. Meaning that there is no NULL in the family of IDs that are related.
For example table looks like this:
So it looks like this:
AccountID Account_RelationshipID
--------------------------------
1 2
2 4
4 6
6 11
11 1
15 17
17 19
19 15
So 1 relates to 2. 2 relates to 4. 4 relates to 6. 6 relates to 11. And then 11 loops back to ID of 1.
Then there is a new family. 15 relates to 17. 17 relates to 19. and then 19 goes back to 15.
There is also a separate Account_Detail table that has the account date:
AccountID AccountName AccountDate
-------------------------------------
1 Dave 1/1/2012
2 Dave 1/1/2013
4 Dave 1/1/2014
6 Dave 1/1/2015
11 Dave 1/1/2016
15 Paul 7/1/2015
17 Paul 7/1/2016
19 Paul 7/1/2017
I tried writing this as my code:
WITH C AS
(
SELECT
AR.AccountID,
AR.Account_RelationshipID,
AD.AccountDate
FROM
Account_Relationship AR
INNER JOIN
Account_Detail AD ON AD.AccountID = AR.AccountID
UNION ALL
SELECT
AR2.AccountID,
AR2.Account_RelationshipID,
AD.AccountDate
FROM
Account_Relationship AR2
INNER JOIN
Account_Detail AD2 ON AD2.Account_ID = AR2.Account_ID
INNER JOIN
C ON C.AccountID = AR2.Account_relationshipID
WHERE
AD.AccountDate < AD2.AccountDate
)
Obviously this code is totally wrong. This is as far as I've gotten. This code will loop infinitely.
I was thinking I could break the loop by adding a function that states when the AccountDate of the next AccountID in the loop is less than the AccountDate of the last AccountID, to break the loop and go to the next one.
Also, how do you get it to go to the next "family" of accountIDs in the loops (Paul in this case)? All the videos and tutorials I've seen about Recursive CTEs just teach how to do it for one family - usually with a hierarchical structure that breaks at NULL as well.
Help!!

AVG giving a Count instead of Average

This is probably a silly mistake on my end but I can't quite figure it out on my on.
I'm trying to calculate average over a set of data pulled from a sub-query presented in the following way:
TotalPDMPs DefaultClinicID
13996 -1
134 23
432 29
123 26
39 27
13 21
40 24
46 30
1 25
Now the average for each 'DefaultClinicID' calculated for 'TotalPDMPs' is the same as the data above.
Here's my query for calculating the average:
select DefaultClinicID as ClinicID, AVG(TotalPDMPs)
from
(select count(p.PatientID) as TotalPDMPs, DefaultClinicID from PatientPrescriptionRegistry ppr, Patient p
where p.PatientID = ppr.PatientID
and p.NetworkID = 2
group by DefaultClinicID) p
group by DefaultClinicID
can someone tell me what I'm doing wrong here?
Thanks.
The group by column is the same so it gets a count in the inner query by DefaultClinicID and then it tries to take an average of the same DefaultClinicID.
Does that make sense? Any aggregation on that column while you group by the same thing will return the same thing. So for clinic 23 the average calculation would be: 134 / 1 = 134.
I think you just need to do the average in your inner query and you get what you want. Or maybe avg(distinct p.patientID) is what you are after?
In the inner sub-query you already grouped by DefaultClinicID,
So every unique DefaultClinicID has already only one row.
And the avg of x is x.

How to store bidirectional relationships

I am writing some code to find duplicate customer details in a database. I'll be using Levenshtein distance.
However, I am not sure how to store the relationships. I use databases all the time but have never come accross this situation and wondered if someone could point me in the right direction.
What confuses me is how to store the bidirectional nature of the relationship.
I've started to put some examples below, but wondered if there is a best practice for storing this type of data,
Example data
id, address
001, 5 Main Street
002, 5 Main St.
003, 5 Main Str
004, 6 High Street
005, 7 Low Street
006, 7 Low St
Suggestion 1
customer_id1, customer_id2, relationship_strength
001, 002, 0.74
001, 003, 0.77
002, 003, 0.76
005, 006, 0.77
Not happy with this approach as it sort of infers a one way relationship between customer_id1 to customer_id2. Unless of course I include all relationships both ways, but that would double the amount of processing time and the size of the tables.
eg would need to include: 002, 001, 0.74
Suggestion 2
customer_id, grouping_id
001, 1
002, 1
003, 1
005, 2
006, 2
The way to deal with symmetric relations in a relational system is as follows:
choose a canonical form in which the symmetric pairs are stored, e.g. customer_id1 < customer_id2.
Define a view SYMM_TBL as select id1,id2,... from ... UNION select id2 as id1,id1 as id2, ... FROM ...
Decent systems ought not punish you in the performance area when querying this view.
What we have here is a graph in which each node has a relationship (edit distance) with every other node. This is not in the normal range of data models. It is also not a permanent feature of your database (assuming you resolve the business processes which led to the duplicate data) so it isn't worth sweating over the solution which best fits relational theory. What we need is a a practical solution.
Think of it as a matrix. If we go for the optimum processing we won't execute the duplicate scorings. So we score Address 1 against all the other Addresses, we score Address 2 against all the other Addresses except Address 1, we score Address 3 against all the other Addresses except Addresses 1 and 2, etc. And what we end up with is a bit like a football league table:
addr
1 2 3 4 5
addr
1 - 95 95 80 76
2 - - 100 75 72
3 - - - 75 72
4 - - - - 83
5 - - - - -
This data can best be stored in suggestion 1, a table of ID1, ID2, SCORE. Although we do need to pivot the data to get the output looking like that :)
In a proper league table there are two sets of scores - Home and Away - so the table is symmetrical. But that doesn't apply here, as the edit distance for 1 > 2 is the same as 2 > 1. However, it would make querying the results more straightforward if the result set included the mirrored scores. That is, for records (1,5,76), (2,5,72), etc we generate records (5,1,76), (5,2,72). This could be done at the end of the scoring process.
addr
1 2 3 4 5
addr
1 - 95 95 80 76
2 95 - 100 75 72
3 95 100 - 75 72
4 80 75 75 - 83
5 76 72 72 83 -
Of course, this is mainly a presentational thing, so it only needs to be done for display purposes, e.g. exporting the data to a spreadsheet. We can still get all the scores for, say, Address 5 in a readable fashion without miiroring the scores using a simple SQL statement:
select case when id1 = 5 then id1 else id2 end as id1
, case when id1 = 5 then id2 else id1 end as id2
, score
from your_table
where id1 = 5
or id2 = 5
/
As always it depends on what you want to do with the data once you've calculated it.
Assuming it's simply to identify or locate duplicates then your suggestion 1 is what I'd use, i.e. a second table that simply stores the pairs and the strengths. My only suggestion is to make the strengths a scaled integer rather than a decimal.

Finding bigram in a location index

I have a table which indexes the locations of words in a bunch of documents.
I want to identify the most common bigrams in the set.
How would you do this in MSSQL 2008?
the table has the following structure:
LocationID -> DocID -> WordID -> Location
I have thought about trying to do some kind of complicated join... and i'm just doing my head in.
Is there a simple way of doing this?
I think I better edit this on monday inorder to bump it up in the questions
Sample Data
LocationID DocID WordID Location
21952 534 27 155
21953 534 109 156
21954 534 4 157
21955 534 45 158
21956 534 37 159
21957 534 110 160
21958 534 70 161
It's been years since I've written SQL, so my syntax may be a bit off; however, I believe the logic is correct.
SELECT CONCAT(i.WordID, "|", j.WordID) as bigram, count(*) as freq
FROM index as i, index as j
WHERE j.Location = i.Location+1 AND
j.DocID = i.DocID
GROUP BY bigram
ORDER BY freq DESC
You can also add the actual word IDs to the select list if that's useful, and add a join to whatever table you've got that dereferences WordID to actual words.

Resources