star-schema - fact table modeling - data-modeling

I need to model a star schema for some business needs about liquidity stress testing.
i will try to find an analogy example.
let's say we have deals about financing/financial securities etc
in the fact table,
at a given date, this deals have the real value of X euros
but will have a variation in time. thus we have some projection values.
my concern is about how to represent this projection values for this deals, and more specifically what granularity to choose.
( the example below is oversimplification of the fact table -- and yes it's Dimension Id's that
are used otherwise )
Method 1 : as many metric columns as projection values calculated
AsofDate
DealId
0Day
1Day
7Days
1Month
2022-01-01
financingDeal1
100
99
98
85
2022-01-01
financingDeal2
150
150
120
120
2022-01-01
financingDeal3
100
99
98
85
2022-01-01
financingDeal4
100
99
98
85
Method 2 : add a granularity : a row is not anymore only a deal on a given that. it's a deal in a given date and it's projection in the next few days/months
AsofDate
DealId
projection
value
2022-01-01
financingDeal1
0Day - actual
100
2022-01-01
financingDeal1
1Day
99
2022-01-01
financingDeal1
7Days
99
2022-01-01
financingDeal1
1Month
85
from where i see it :
for method 1, the main inconvenient is if in the futur, we have a new projection value for 3 months, we will need to add a column in the ETL/ in the OLAP cube for '3months'
for method 2:
we will have as many rows as (deals * projections) and we do have 11 projections so it's 11 rows for each deal and we do have 1Million+ of them.
what is your opinion on this topic?
Thanks for your consideration

Related

SQL- Is this a multi step line of code...add column than add data for each row ONLY for that particular column

There is a table below and further below is the question. Is this a multi-line code with parenthesis to use for this? This is for a business analyst assignment...its the first time im using sql(I've used python, js, html, css self-taught back when i was trying to be a web developer)
SQL Queries
Table Name: TRADES
DATE FIRM SYMBOL SIDE QUANTITY PRICE
2/3/2014 1ABC A123 B 200 41
2/4/2014 2BCD B234 B 600 60
2/7/2014 1ABC C345 S 600 70
2/10/2014 3CDE C345 S 600 70
2/12/2014 4DEF B234 B 200 62
2/14/2014 3CDE B234 B 300 61
2/21/2014 1ABC A123 B 300 40
2/24/2014 1ABC A123 S 300 30
2/25/2014 4DEF C345 B 2100 71
2/27/2014 CDE B234 S 1100 63
Q3. Your business user asks you to show them a table that includes the number of trades for each firm and symbol combination in the data table above. Please write the SQL query you would use to query TRADES table to get below result
FIRM SYMBOL NO_TRADES
1ABC A123 3
2BCD B234 1
1ABC C345 1
3CDE C345 1
4DEF B234 1
3CDE B234 1
4DEF C345 1
CDE B234 1
This looks like simple aggregation:
select firm, symbol, count(*) no_trades
from mytable
group by firm, symbol
order by no_trades desc, symbol

Alternate for Array Sum Formula

Table copied as Text
Column1 Column2 Column3 Column4 Column5 Column6
A AA AAA 100 95 92
A AA AAA 85 83 81
A AA BBB 200 199 160
A BB AAA 65 55 49
B AA AAA 89 88 83
B AA BBB 150 149 145
B BB AAA 140 135
B BB BBB 190 185
B AA AAA 510
AA
AAA BBB
A 173 160
B 593 145
and some more explanation
Basically i want the sum of "Column 6" for the given criteria but the data in Column 6 can only be entered after some delay w.r.t. Column 1, Column 2, Column 3 & Column 4.
Till Column 6 data is entered, i want excel to use the number available in Column 5 which is also entered after some delay w.r.t. Column 1, Column 2, Column 3 & Column 4 but before Column 6.
And till Column 5 data is entered, i want excel to use the number available in Column 4.
Now I am familiar with two SUM/IF arrangements as included below in post.
First one is array sum/if arrangement which is convenient to write but results in terribly long calculation time with 1.5 seconds for just one column and I have over 100 columns in one sheet and about 9 sheets.
Second one is using SUMIFS which requires extensive time to write but relatively better calculation time of 0.5 seconds for column but is still quite high.
Now I need to do away with the array arrangement but doing so will take quite some time and I want to know if there is any better/other arrangement.
Just let me know other arrangement which can get the required result and I will check the arrangement for calculation timing. If the other arrangement is also convenient to write than that is a plus.
This is my table:
And I want to add the right most columns which are not empty i.e. have a number in it, but with the criteria for the first three columns in cell D15.
I only found option to add image. Please let me know how to upload excel file.
enter image description here
Can somebody please suggest an alternate to this array formula so it can calculate way faster
{=SUM(
IF(
($B$2:$B$10=$C15)*
($C$2:$C$10=$C$13)*
($D$2:$D$10=D$14)>0,
IF(
$G$2:$G$10<>"",
$G$2:$G$10,
IF(
$F$2:$F$10<>"",
$F$2:$F$10,
$E$2:$E$10))))}
I have tried below which reduces the calculation time to 1/3 but it is too much typing for the large data I am dealing with
=SUMIFS(
$G$2:$G$10,
$B$2:$B$10,$C15,
$C$2:$C$10,$C$13,
$D$2:$D$10,H$14,
$G$2:$G$10,"<>"&"")
+SUMIFS(
$F$2:$F$10,
$B$2:$B$10,$C15,
$C$2:$C$10,$C$13,
$D$2:$D$10,H$14,
$G$2:$G$10,"="&"",
$F$2:$F$10,"<>"&"")
+SUMIFS(
$E$2:$E$10,
$B$2:$B$10,$C15,
$C$2:$C$10,$C$13,
$D$2:$D$10,H$14,
$G$2:$G$10,"="&"",
$F$2:$F$10,"="&"")
If you're OK with using a helper column (which you should be), you can use this formula in a helper cell and drag down. (In my example at bottom, this formula is in cell H2 and drag down.)
= INDEX(E2:G2,MATCH(-1E+300,E2:G2,-1))
This gets all of the data in either column 4 5 or 6 all into one column.
Then you can use a simpler SUMIFS formula in cell D15:
= SUMIFS($H$2:$H$10, // Sum range (helper column)
$B$2:$B$10,$C15, // Criteria 1 (A or B)
$C$2:$C$10,$C$13, // Criteria 2 (AA or BB)
$D$2:$D$10,D$14) // Criteria 3 (AAA or BBB)
See below, working example:
DISCLAIMER
This answer will simplify your formulas, but I'm not sure if this will help with the performance problems you are experiencing. SUMIFS in itself I don't see being likely the cause of long calculation times. Probably you are experiencing long calculation times because other parts of your spreadsheet are using inefficient formulas and/or formulas involving volatile cells, but that is just a guess because I have no idea what the rest of your spreadsheet looks like.

need help in solving sql problems using order by to order building floor

i have this building floor data selected:
6
5
4
3
2
1
UG
GM
G
LG
5B
5A
B1
B2
for this sorting i use this kind of Order by :
order by
(case when ISNUMERIC(floorNo) = 1 then CAST(floorNo AS Int) end) desc ,
(case when ISNUMERIC(left(floorNo,1)) = 0 and ISNUMERIC(substring(floorNo,2,1)) = 1 then floorNo end) asc,
(case when ISNUMERIC(floorNo) = 0 and left(floorNo,1) <>'L' then floorNo end) desc
but i want to make it like this :
6
5B
5A
5
4
3
2
1
UG
GM
G
LG
B1
B2
Can ANy one Help me solve it?
If you make a complicated enough (set of) case statement(s), you would eventually be able to handle all the possibilities, but it is likely to run very slow if you have a lot of data.
If I had to do this, I would probably make a separate lookup table (FloorOrder) with two columns; this floor code and an order column (integer). Create a script to populate the lookup table with all the various possibilities - pick a maximum number of floors, basements, and subfloors per floor, and make all of the possibilities with some loops. Then add all the various floors near ground floor. Make sure the order numbers are spread out enough that you can easily add other codes in between when somebody comes up with a new option (because they will). Something like this subset.
Code Order
2 2000
1C 1300
1B 1200
1A 1100
1 1000
UG 800
GM 500
G 0
LG -300
B1 -1000
It doesn't really matter what the order codes are, as long as they sort the list in the right order, can be easily generated when creating the table, and leave space for fitting things in the gap. Whenever somebody comes up with a new weird floor code (some I've seen near me are things like M (Mezzanine, UM for Upper Mezzanine, etc), add new records to the FloorOrder table to fit them in. Make sure you table has an index on the floor codes
To use it, join to the FloorOrder table, sort by the Order column.

Optimize MDX query

I have two needs in my query
First : to have a sorted product list base on my measure.product with higher sales should appears first.
ProductCode Sales
----------- ------------
123 18
332 17
245 16
656 15
Second : to have cumulative sum on my presorted product list.
ProductCode Sales ACC
----------- ------------ ----
123 18 18
332 17 35
245 16 51
656 15 66
I wrote below MDX in order to achieve above goal:
WITH
SET SortedProducts AS
Order([DIMProduct].[ProductCode].[ProductCode].AllMEMBERS,[Measures]. [Sales],BDESC)
MEMBER [Measures].[ACC] AS
Sum
(
Head
(
[SortedProducts],Rank([DIMProduct].[ProductCode].CurrentMember,[SortedProducts])
)
,[Measures].[Sales]
)
SELECT
{[Measures].[Sales] ,[Measures].[ACC]}
ON COLUMNS,
SortedProducts
ON ROWS
FROM [Model]
But it takes about 3 minutes to run,any suggestion on how to optimize my code or is it normal?
I have 9635 products in total
if you do a quick research on google, there are different ways to achieve it (many answers here as well).
That said, I will give a try to this different way to calculate your running total
MEMBER [Measures].[SortedRank] AS Rank([Product].[Product].CurrentMember, [SortedProducts])
MEMBER [Measures].[ACC2] AS SUM(TopCount([SortedProducts], [Measures].[SortedRank]) ,[Measures].[Internet Sales Amount])
I don't know if TopCount will perform faster than Head for your case, but for example your query on my test machine on AdventureWorks cube takes the same time using Head or TopCount function.
Hope this helps

How to store bidirectional relationships

I am writing some code to find duplicate customer details in a database. I'll be using Levenshtein distance.
However, I am not sure how to store the relationships. I use databases all the time but have never come accross this situation and wondered if someone could point me in the right direction.
What confuses me is how to store the bidirectional nature of the relationship.
I've started to put some examples below, but wondered if there is a best practice for storing this type of data,
Example data
id, address
001, 5 Main Street
002, 5 Main St.
003, 5 Main Str
004, 6 High Street
005, 7 Low Street
006, 7 Low St
Suggestion 1
customer_id1, customer_id2, relationship_strength
001, 002, 0.74
001, 003, 0.77
002, 003, 0.76
005, 006, 0.77
Not happy with this approach as it sort of infers a one way relationship between customer_id1 to customer_id2. Unless of course I include all relationships both ways, but that would double the amount of processing time and the size of the tables.
eg would need to include: 002, 001, 0.74
Suggestion 2
customer_id, grouping_id
001, 1
002, 1
003, 1
005, 2
006, 2
The way to deal with symmetric relations in a relational system is as follows:
choose a canonical form in which the symmetric pairs are stored, e.g. customer_id1 < customer_id2.
Define a view SYMM_TBL as select id1,id2,... from ... UNION select id2 as id1,id1 as id2, ... FROM ...
Decent systems ought not punish you in the performance area when querying this view.
What we have here is a graph in which each node has a relationship (edit distance) with every other node. This is not in the normal range of data models. It is also not a permanent feature of your database (assuming you resolve the business processes which led to the duplicate data) so it isn't worth sweating over the solution which best fits relational theory. What we need is a a practical solution.
Think of it as a matrix. If we go for the optimum processing we won't execute the duplicate scorings. So we score Address 1 against all the other Addresses, we score Address 2 against all the other Addresses except Address 1, we score Address 3 against all the other Addresses except Addresses 1 and 2, etc. And what we end up with is a bit like a football league table:
addr
1 2 3 4 5
addr
1 - 95 95 80 76
2 - - 100 75 72
3 - - - 75 72
4 - - - - 83
5 - - - - -
This data can best be stored in suggestion 1, a table of ID1, ID2, SCORE. Although we do need to pivot the data to get the output looking like that :)
In a proper league table there are two sets of scores - Home and Away - so the table is symmetrical. But that doesn't apply here, as the edit distance for 1 > 2 is the same as 2 > 1. However, it would make querying the results more straightforward if the result set included the mirrored scores. That is, for records (1,5,76), (2,5,72), etc we generate records (5,1,76), (5,2,72). This could be done at the end of the scoring process.
addr
1 2 3 4 5
addr
1 - 95 95 80 76
2 95 - 100 75 72
3 95 100 - 75 72
4 80 75 75 - 83
5 76 72 72 83 -
Of course, this is mainly a presentational thing, so it only needs to be done for display purposes, e.g. exporting the data to a spreadsheet. We can still get all the scores for, say, Address 5 in a readable fashion without miiroring the scores using a simple SQL statement:
select case when id1 = 5 then id1 else id2 end as id1
, case when id1 = 5 then id2 else id1 end as id2
, score
from your_table
where id1 = 5
or id2 = 5
/
As always it depends on what you want to do with the data once you've calculated it.
Assuming it's simply to identify or locate duplicates then your suggestion 1 is what I'd use, i.e. a second table that simply stores the pairs and the strengths. My only suggestion is to make the strengths a scaled integer rather than a decimal.

Resources