Star Schema - Unify data with varying structure from different sources

Star Schema - Unify data with varying structure from different sources - database

I am currently designing a star schema for a reporting database where an online product's performance is measured. The challenge is, that I receive information which is in principle measuring the same facts (visits, purchases) and has the same dimensions (user gender, user age, day) but with varying granularity depending on the source, for example, given a total of 10 visits:
Source A returns a single line per day for the performance in the format:
Visits, Purchases, Gender, Age Range, Day (total visits = 15)
Source B returns two lines for a single day as it does not allow the combination of gender and age:
Visits, Purchases, Gender, Day (total visits = 10)
Visits, Purchases, Age Range, Day (total visits = 10)
The issues is, if I store them in the same fact table, I will have incorrect values when applying aggregate functions:
Day
Visits
Age
Gender
Source
19/04/2022
5
18-24
Male
A
19/04/2022
10
18-24
Female
A
19/04/2022
2
NULL
Male
B
19/04/2022
8
NULL
Female
B
19/04/2022
10
18-24
NULL
B
(The sum of the visits column would count 20 for source B even though we only have 10 visits for this source, they just appear double due to the different data structure)
Is there a best practice for cases where dimensions and facts are generally the same, but the raw data granularity is different?

Is there a best practice for cases where dimensions and facts are generally the same, but the raw data granularity is different?
You typically can only present the combined data at a grain that's compatible with all the sources, so (Day), (Age,Day), or (Gender,Day).
Alternatively you could "allocate" the Source B data, say applying the gender split for the day to each age group. The totals would work, but the drilldown wouldn't be meaningful.

Related

Is there a way to consolidate multiple formulas into one

all:
I am trying to design a shared worksheet that measures salespeople performance over a period of time. In addition to calculating # of units, sales price, and profit, I am trying to calculate how many new account were sold in the month (ideally, I'd like to be able to change the timeframe so I can calculate larger time periods like quarter, year etc').
In essence, I want to find out if a customer was sold to in the 12 months before the present month, and if not, that I will see the customer number and the salesperson who sold them.
So far, I was able to calculate that by adding three columns that each calculate a part of the process (see screenshot below):
Column H (SoldLastYear) - Shows customers that were sold in the year before this current month: =IF(AND(B2>=(TODAY()-365),B2<(TODAY()-DAY(TODAY())+1)),D2,"")
Column I (SoldNow) - Shows the customers that were sold this month, and if they are NOT found in column H, show "New Cust": =IFNA(IF(B2>TODAY()-DAY(TODAY()),VLOOKUP(D2,H:H,1,FALSE),""),"New Cust")
Column J (NewCust) - If Column I shows "New Cust", show me the customer number: =IF(I2="New Cust",D2,"")
Column K (SalesName) - if Column I shows "New Cust", show me the salesperson name: =IF(I2="New Cust",C2,"")
Does anyone have an idea how I can make this more efficient? Could an array formula work here or will it be stuck in a loop since its referring to other lines in the same column?
Any help would be appreciated!!
EDIT: Here is what Im trying to achieve:
Instead of:
having Column H showing me what was sold in the 12 months before the 1st day of the current month (for today's date: 8/1/19-7/31/20);
Having Column I showing me what was sold in August 2020; and
Column I searching column H to see if that customer was sold in the timeframe specified in Column H
I want to have one column that does all three: One column that flags all sales made for the last 12 months from the beginning of the current month (so, 8/1/19 to 8/27/20), then compares sales made in current month (august) with the sales made before it, and lets me know the first time a customer shows up in current month IF it doesn't appear in the 12 months prior --> finds the new customers after a dormant period of 12 months.
Im really just trying to find a way to make the formula better and less-resource consuming. With a large dataset, the three columns (copied a few times for different timeframes) really slow down Excel...
Here is an example of the end result:
Example of final product

BI - fact table design with incompatible grains

I'm quite new to BI designing DB, and here some point I do not understand well.
I'm trying to import french census data, where I got population for each city. For each city, I have population with different age classification, that can't really relate with each other.
For instance, let's say that one classification is 00 to 20 years old, 21 to 59, and 60+
And the other is way more precise : 00 to 02, 03 to 05, etc. but the bounds are never the same as the first one classification : I don't have 15 to 20, but 18 to 22, for example.
So those 2 classifications are incompatible. How can I use them in my fact table ? Should I use 2 fact tables and 2 cubes ? Should I use one fact table, and 2 dimensions for 1 cube ? But in this case, I will have double counted facts when I'll sum to have total population for a city, won't I ?
This is national census data, and national classifications, so changing that or estimating population to mix those classifications is not an option. And to be clear, one row doesn't relate to one person, but to one city. My facts are not individuals but cities' populations.
So this table is like :
Line 1 : One city - one amount of population - one code for dim age (ex. 00 to 19 yo) of this population - code (m/f) for the dim gender of that population - date of the census
Line 2 : Same city - one amount of population - one code for dim age (ex. 20 to 34) of this population - code (m/f) for the dim gender - date of the census
And so it goes for a lot of cities, both gender, and multiple years.
Same
I hope this question is clear enough, as english is not my native language and as I'm quite new in DB and BI !
Thanks for helping me with that.

One possible solution using a single fact table and two dimensions for the age ranges:
1 - Categorical range based on the broadest census, for example:
Young 0-20
Adult 21-59
Senior 60+
You could then link the other census to this dimension with approximate values, for example 18-22 could be Young.
2 -Original age range. This dimension could be used for precise age ranges when you report on a single city, it can also help you evaluate the impact of the overlapping bounds (e.g. how many rows are in the young / 18-22 range?)

you can crate one dimention as below
young 1-20
adult 21-59
senior 60+
Classification is
young city 1 : 1-20
young city 2 : 4-23
id field1 field2 field3 field4 .......
1 1 year young_city_1 other .......
2 2 year young_city_1 other .......
3 3 year young_city_1 other .......
4 4 year young_city_1 young_city_2 .......
Now you can report from any item and with any division
i hope it is help you

DB Schema: Versioned price model vs invoice-related data

I am creating some db model for rental invoice generation.
The invoice consists of N booking time ranges.
Each booking belongs to a price model. A price model is a set of rules which determine a final price (base price + season price + quantity discout + ...).
That means the final price for the N bookings within an invoice can be a complex calculation, and of course I want to keep track of every aspect of the final price calculation for later review of an invoice.
The problem is, that a price model can change in the future. So upon invoice generation, there are two possibilities:
(a) Never change a price model. Just make it immutable by versioning it and refer to a concrete version from an invoice.
(b) Put all the price information, discounts and extras into the invoice. That would mean alot of data, as an invoice contains N bookings which may be partly in the range of a season price.
Basically, I would break down each booking into its days and for each day I would have N rows calculating the base price, discounts and extra fees.
Possible table model:
Invoice
id: int
InvoiceBooking # Each booking. One invoice has N bookings
id: int
invoiceId: int
(other data, e.g. guest information)
InvoiceBookingDay # Days of a booking. Each booking has N days
id: int
invoiceBookingId: id
date: date
InvoiceBookingDayPriceItem # Concrete discounts, etc. One days has many items
id: int
invoiceBookingDayId: int
price: decimal
title: string
My question is, which way should I prefer and why.
My considerations:
With solution (a), the invoice would be re-calculated using the price model information each time the data is viewed. I don't like this, as algorithms can change. It does not feel natural for the "read-only" nature of an invoice.
Also the version handling of price models is not a trivial task and the user needs to know about the version concept, which adds application complexity.
With solution (b), I generate a bunch of nested data and it adds alot of complexity to the schema.
Which way would you prefer? Am I missing something?
Thank you

There is a third option which I recommend. I call it temporal (time) versioning and the layout of the table is really quite simple. You don't describe your pricing data so I'll just show a simple example.
Table: DailyPricing
ID EffDate Price ...
A 01/01/2015 17.50 ...
B 01/01/2015 20.00 ...
C 01/01/2015 22.50 ...
B 01/01/2016 19.50 ...
C 07/01/2016 24.00 ...
This shows that all three price schedules (A, B and C just represent whatever method you use to distinguish between price levels) were given a price on Jan 1, 2015. On Jan 1, 2016, the price of plan B was reduced. In July, the price of plan C was increased.
To get the current price of a plan, the query is this:
select dp.Price
from DailyPricing dp
where dp.ID = 'A'
and dp.Effdate =(
select Max( dp2.EffDate )
from DailyPricing dp2
where dp2.ID = dp.ID
and dp2.EffDate >= :DateOfInterest);
The DateOfInterest variable would be loaded with the current date/time. This query returns the one price that is currently in effect. In this case, the price set Jan 1, 2015 as that has never changed since taking effect. If the search had been for plan B, the price set on Jan 1, 2016 would have been returned and for plan C, the price set on July 1, 2016. These are the latest prices set for each plan; that is, the current prices.
Such a query would more likely be in a join with probably the invoice table so you could perform the price calculation.
select ...
from Invoices i
join DailyPricing dp
on dp.ID = i.ID
and dp.Effdate =(
select Max( dp2.EffDate )
from DailyPricing dp2
where dp2.ID = dp.ID
and dp2.EffDate >= i.InvoiceDate )
where i.ID = 1234;
This is a little more complex than a simple query but you are asking for more complex data (or, rather, a more complex view of the data). However, this calculation is probably only executed once and the final price stored back in to the invoice data or elsewhere.
It would be calculated again only if the customer made some changes or you were going through an audit, rechecking the calculation for accuracy.
Notice something, however, that is subtle but very important. If the query above were being executed for an invoice that had just been created, the InvoiceDate would be the current date and the price returned would be the current price. If, however, the query was being run as a verification on an invoice that was two years old, the InvoiceDate would be two years ago and the price returned would be the price that was in effect two years ago.
In other words, the query to return current data and the query to return past data is the same query.
That is because current data and past data remain in the same table, differentiated only by the date the data takes effect. This, I think, is about the simplest solution to what you want to do.

How about A and B?
It's not best practice to re-calculate any component of an invoice, especially if the component was printed. An invoice and invoice details should be immutable, and you should be able to reproduce it without re-calculating.
If you ever have a problem with figuring out how you got to a certain amount, or if there is a bug in your program, you'll be glad you have the details, especially if the calculations are complex.
Also, it's a good idea to keep a history of your pricing models so you can validate how you got to a certain price. You can make this simple to your users. They don't have to see the history -- but you should record their changes in the history log.

SSAS- MDX Assign fact row to dimension member base on calculation

I am looking to calculate in the calc script something, so I can allocate a row from a fact table to a dimension member.
The business scenario is the following. I have a fact table that record customer credit and debit ( customer can do a lot of little loan) and a dimension Customer.I want to classify my customer base on his history of credit and debit on a given period.Classification of customer change over time.
Example
The rule is, if a customer balance (for a given period ) is over - 50 000, the classification is "large", if he have more than a record and have done a payement in the last 3 month he is a "P&P.If he doesn't own any money and have done a payement in the last 3 month its "regular".
My question is more about direction than a specific code,which way is the best to implement this kind of rule ?
Best Regards
Vincent Diallo-Nort

I'd create a fact table with a balance auto-updated status every day:
check the rolling balance yesterday plus today's records.
when the balance = 0, then remove a record.
Plus add a flow fact table with payments only.
Add measures:
LastChild aggregation for the first fact table.
Sum aggregation for the second fact table.
When it's done, you may apply a MDX calculation:
case
when [Measure].[Balance] > 50000
then "Large"
when [Measure].[Payments] + ([Date].[Calendar].CurrentMember.Lag(1),[Measure].[Payments]) + ([Date].[Calendar].CurrentMember.Lag(2),[Measure].[Payments]) > 0
then "P&P"
else "Regular"
end
In order to give you answer in detail you have to provide more information about your data structure.

Difference among Dimension, Dimension attribute and Facts

So from what I've understand the best example of differences in (Dimension, Dimension Attribute, and Fact) would be something like this:
Dimensions - PRODUCT, ACCOUNT, CUSTOMER
Dimension attribute - ProductName, ProductNumber, CustomerName, CustomerNumber
Facts - usually measures. Dollar, Unit, Height
This is my attempt so it may be wrong. I want to hear your solutions?

A dimension is a collection of "reference information" about a measurable event. The Measurable event is a Fact.
So, if you have data say for example - Retail transactions, you would measure - transaction cost. So, your fact will have sales amount. Now, sales amount in itself does not make sense. You would need information like -
When was the sale done - Date dimension
Who made the transaction - Customer dimension
Which store was it made from - Store Dimension
What was brought - Product dimension
and so on. What information you want capture for each dimension is called the attributes. For Ex: A Customer dimension might have these attributes -
Customer Number
Customer Name
Customer Address
Customer Zip Code
Date Of birth
... and so on.

Dimensions: are qualitative data. These are objects of the subjects.
Dimension attributes : These are columns of dimension table.
Facts : are the quantitative data. The data which can be summed up , averaged or manipulated. If manipulation is done in the data then it would be to provide business insights. Manipulated data eg : time dimension can't be measured but hours calculated using time is a measurable fact.
Example : Consider e commerce company table(amazon) which is Subject
Dimensions:
Product , date , customer , vendor , location(** these all are Objects of subjects**)
Dimension attributes:
PRODUCT - (Product_id , Product_name , Product_class)
DATE - (Order_date , Shipment_Date , Delivery_date)
CUSTOMER - (Cust_id , Cust_name)
LOCATION -(State , city , town , zip_code)
Facts:
PRODUCT - Total number of products sold
DATE - Total number of products sold in last one month/ last one year / last one quater
CUSTOMER - Total amount paid by customer
LOCATION - Total sales done per region , per state , per city
- Total traffic(customer visited) in stores per region

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight