Structuring a cube with paths from a fact table to a dimension via two alternative intermediate dimensions - sql-server

I'm unsure how to configure a cube in SSAS for a complex case that I can simplify as follows:
A fact table stores data about a Sale.
A dimension called Promotion records details of the marketing activity that generated the Sale
A dimension called Customer records details of the person who we sold to
We also have a table holding data about an Organisation
In some, but not all cases, a Promotion is targetted at an Organisation. There is an optional one-to-one relationship from Promotion to Organisation.
In some, but not all cases, a Customer is associated with an Organisation. There is an optional many-to-one relationship from Customer to Organisation.
We want to be able to analyse Sales by Organisation. For instance, if I report number of Sales by Organisation, the count for each Organisation should include both the sales through Promotions targetted at that Organisation and sales to Customers associated with that Organisation.
Note that with this data structure, each Sale may be associated with 0, 1 or 2 Organisations depending on the Promotion and the Customer. So if I report on number of sales by organisation, the grand total will not necessarily equal the total number of sales.
How would you structure the cube? I don't think it can work by simply setting up a referenced relationship from Sales->Promotion->Organisation and another from Sales->Customer->Organisation because SSAS won't know which path to use (and certainly won't know that it should aggregate across both paths together). Do I create two Organisation dimensions? Do I disconnect Organisation from the other dimensions and define some direct linkage between Organisation and Sales? Do I scrap the Organisation dimension and include organisation details as attributes in both Promotion and Customer?

You are correct in that SSAS won't be able to handle referencing the Organisation dimension through Promotion on one path, and through Customer on another path. This will give you an error when you try to build the cube.
Since each sale can be associated with 0, 1 or 2 organisations, I would recommend modelling this with a bridge-table (many-to-many) between the Organisation-dimension and the Sale-fact. This assumes that you have a unique ID on each Sale-transaction, so that you can create a fact-dimension on the Sale-fact (which need not be visible in the cube).
You construct the bridge-table in your ETL-flow. It should simply contain 2 columns, which relate the Organisation ID's with the Sale ID's. No Sale ID should have more than 2 Organisation ID's. Your final model should look something like this:
DimCustomer <---.
|
FactSale <---- BridgeSaleOrganisation ----> DimOrganisation
|
DimPromotion <---ยด
In the dimension-usage of SSAS, you set up a Many-to-many relation between FactSale and DimOrganisation using the BrdigeSaleOrganisation as the intermediary table. Once this is in place, filtering sales by the Organisation-dimension, will give you all sales belonging to that organisation via the bridge table, no matter whether they are through Promotion or Customer.
For more examples of many-to-many modelling, check out this excellent paper by "the gurus", Marco Russo and Alberto Ferrari.

Related

Redundant relation: Is this a violation of database normalization?

I have a table with products that I offer. For each product ever sold, an entry is created in the ProductInstance table. This refers to this instance of the product and contains information such as the next due date (if the product is to be billed monthly) and other information relevant to this instance (e.g. personal branding).
For understanding: The products are service contracts. The template of the contract is stored in the product table (e.g. "Monthly lawn mowing"). The product instance is then e.g. "Monthly lawn mowing in sample street" and contains information like the size of the garden or something specific to this instance of the service instead of the general product.
An invoice is created for a product instance either one time or recurring. An Invoice may consists of several entries. Each entry is represented by an element in the InvoiceEntry table. This is linked to the ProductInstance to create the reference to the invoice.
I want to extend the database with purchase orders. To do this, a record is created in the Order table. This contains a relation to the customer and e.g. the order date. The single products of the order are mapped by an OrderEntry. The initial invoice generated for the order is linked via the field "invoice_id" in the table order. The invoice items from the initial order are created per OrderEntry and create one InvoiceEntry each. However, I want the ProductInstance to be created only after the invoice is paid. Therefore the OrderEntry has a relation to the product and not only to the ProductInstance. Once the order has been created, the instance is created and linked to the OrderEntry.
I see the problem that the relation between Order and Invoice is doubled: once Order <-> Invoice and once Order <-> OrderEntry <-> InvoiceEntry <-> Invoice.
And for the Product: OrderEntry <-> Product and OrderEntry <-> ProductInstance <-> Product.
Model of the described database
My question is if this "duplicate" relation is problematic, or could cause problems later. One case that feels messy to me is, what should I do if I want to upgrade the ProductInstance later (to an other product [e.g. upgrade to bigger service])? The order would still show the old product_id but the instance would point to a new product_id.
This is a nice example of real-life messy requirements, where the 'pure' theory of normalisation has to be tempered by compromises. There's no 'slam-dunk right' approach; there's some definitely 'wrong' approaches -- your proposed schema exhibits some of those. I suspect there's not even a 'best' approach. Thank you for expanding the description of the business context -- especially for the ProductInstance table.
But still your description won't support legally required behaviour:
An invoice is created for a product instance either one time or recurring. An Invoice may consists of several entries. Each entry is represented by an element in the InvoiceEntry table.
... I want the ProductInstance to be created only after the invoice is paid.
An invoice represents an indebtedness from customer to supplier. It applies at one date only, not "recurring". (So leaving out the Invoice date has exactly got in the way of you "thinking about relations".) A recurring or cyclical billing arrangement would be represented by something like a 'contract' table, from which an Invoice is generated by some scheduled process.
Or ... your "recurring" means the invoice is paid once up-front for a recurring service(?) Still you need an Invoice date. The terms of service/its recurrence would be on the ProductInstance table.
I can see no merit in delaying recording the ProductInstance 'til after invoice payment. Where are you going to hold the terms of service in the meantime? If you're raising an invoice, your auditors/the statutory authorities will want you to provide records of what the indebtedness relates to. Create ProductInstance ab initio and put a status on it. (Or in the application look up the Invoice's paid status before actually providing the service.)
There's something else about Invoices you're currently failing to capture -- and that has also lead you to a wrong design: in general there is more making up the total $ value of an invoice than product lines, such as discounts applying to the invoice overall rather than particular products; delivery charges; installation costs or inspection/certification; taxes (local/State/Federal).
From your description perhaps the only one applying is taxes. ("in this world nothing can be said to be certain, except death and taxes.") And taxes are not specific to products/no product_instance_id is applicable on an InvoiceEntry.
For this reason, on ERP schemas in general, there is no foreign key declared from InvoiceEntry to Product/Instance. (In your case you might get away with product_instance_id being nullable, but yeuch.) There might be a system-generated XRef text column, which contains different content according to what the InvoiceEntry represents, but any referencing can't be declared to the schema. (There might be a 'fully normalised' way to represent that with an auxiliary linkage table, but maintaining that in step adds too much complexity to the application.)
I see the problem that the relation between Order and Invoice is doubled: once Order <-> Invoice and once Order <-> OrderEntry <-> InvoiceEntry <-> Invoice.
Again think about the business sequence of operations that generate these records: ordering happens as a prelude to invoicing. You can't put an invoice_id on Order, because you haven't created the Invoice yet. You might put the order_id on Invoice. But here you're again in the situation that not all Invoices arrive via Orders -- some might be cash sales/immediate delivery. (You could make order_id nullable, but yeuch.) For this reason on ERP schemas in general, there is no foreign key declared from Invoice to Order, etc, etc.
And the same thinking with OrderEntry <-> InvoiceEntry: your proposed schema has the sequencing wrong/the reference points the wrong way. (And not every InvoiceEntry will have corresponding OrderEntry.)
On OrderEntry, having all of (OrderEntry)id and product_id and product_instance_id seems to me to give you way too many opportunities for tangling it all up. Can an Order have multiple Entrys for the same product_id? -- why/how? Can it have multiple Entrys for the same product_instance_id? -- why/how? Can there be a product_instance_id which refers to a different product_id than OrderEntry.product_id? This is exactly the sort of risk for confusing entanglement that normalisation aims to remove/reduce.
The customer is ordering a ProductInstance: mowing a particular size of garden at a particular address, fortnightly on a Tuesday afternnon. So OrderEntry.product_instance_id is what you want; .product_id is wrong. So (again) you need to create ProductInstance at time of recording the Order. Furthermore I strongly suspect you don't need an id on OrderEntry; instead give it a compound key (order_entry_id, product_instance_id). [**]
[**] I see you're using 'eloquent'. I suspect this is requiring id on every table. So you're not even using a relational database, this is some sort of Object-Relational hybrid. Insisting on a dedicated single id as key on every table is toxic. It has lead schema designers astray every time I get called in to help -- as here. Please if you can at all avoid it, don't do that.

Laravel DB design: how to model different types of similar entity

First of all I have to mention that I am modernising our ERP system that is build in-house. It handles everything from purchasing of parts, sale orders, quotes and inventory to invoicing and statistical data. The system is web based and heavily dependent on ORM. EloquentORM will be used in the redesign.
My main question is about the data model of certain entities that are very similar. Currently three of most widely interconnected entities in the app are: Orders, Products and Invoices.
1. Orders
In current DB design I have one big orders table in which there is a order_type attribute to distinct between different order types: Purchase orders, Sale orders, Quotes and Service orders. About 80% of fields are common to each order type and there are some specific fields for each order types. Currently at ~15k records.
2. Products
Similarly I have one big products table with an attribute product_type to distinct between different product types: Finished products, Services, Assemblies and Parts. Again there is a fair % of fields that are common throughout all product types and some that are specific to different product type Currently at ~7k records.
3. Invoices
Again one table invoices with invoice_type attribute to distinct between 4 invoice types: Issued invoices (for things we sell), Received invoices (for things we buy), Credit notes and Avans Invoices. More or less all invoice types use the same fields. Currently at ~15k records.
I am now trying to decide which is the optimal way for this kind of DB model. I see three options:
1. Single Table Inheritance
Leave as is, everything in the same table. It feels kind of awkward to always filter records like where order_type = 'Sale order' to display right orders in the right place in GUI... Also when doing sale and purchase analytics I need to include the same where condition to fetch right orders. Seems wrong!
2. Class Table Inheritance
Have a master tables orders, products and invoices with common set of fields between each of entity types and then one-to-one child relation tables for every different type of each entity: sales_orders, purchase_orders, quote_orders, finished_products, reseller_products, part_products, assembly_products, received_invoices and issued_invoices with FK in each of the child tables to master table... This seems like a good idea but handling that with ORM brings in a little more complexity...
In this method I have a questions which FK should be used around. For example each invoice can belong to one order. Received invoice will go with Purchase order and issued invoice will go with Sale order. Should the master orders table's PK be used as a FK in the master invoices table to relate these entities, or should the child sale_orders PK be used in the child issued_invoices?
3. Concrete Table Inheritance
Having completely separated tables for every type of each entity. This would avoid me having parent->child relationship between master table but would result in a lot of similar attributes in each table...
What would be the best approach? I am aiming at ease of use in EloquentORM and also speed and scalability for the future.

Change Data Capture and SQL Server Analysis Services

I'm designing a database application where data is going to change over time. I want to persist historical data and allow my users to analyze it using SQL Server Analysis Services, but I'm struggling to come up with a database schema that allows this. I've come up with a handful of schemas that could track the changes (including relying on CDC) but then I can't figure out how to turn that schema into a working BISM within SSAS. I've also been able to create a schema that translates nicely in to a BISM but then it doesn't have the historical capabilities I'm looking for. Are there any established best practices for doing this sort of thing?
Here's an example of what I'm trying to do:
I have a fact table called Sales which contains monthly sales figures. I also have a regular dimension table called Customers which allows users to look at sales figures broken down by customer. There is a many-to-many relationship between customers and sales representatives so I can make a reference dimension called Responsibility that refers to the customer dimension and a Sales Representative reference dimension that refers to the Responsibility dimension. I now have the Sales facts linked to Sales Representatives by the chain of reference dimensions Sales -> Customer -> Responsibility -> Sales Representative which allows me to see sales figures broken down by sales rep. The problem is that the Sales facts aren't the only things that change over time. I also want to be able to maintain a history of which Sales Representative was Responsible for a Customer at the time of a particular Sales fact. I also want to know where the Sale Representative's office was located at the time of a particular sales fact, which may be different than his current location. I might also what to know the size of a customer's organization at the time of a particular Sales fact, also which might be different than it is currently. I have no idea how to model this in an BISM-friendly way.
You mentioned that you currently have a fact table which contains monthly sales figures. So one record per customer per month. So each record in this fact table is actually an aggregation of individual sales "transactions" that occurred during the month for the corresponding dimensions.
So in a given month, there could be 5 individual sales transactions for $10 each for customer 123...and each individual sales transaction could be handled by a different Sales Rep (A, B, C, D, E). In the fact table you describe there would be a single record for $50 for customer 123...but how do we model the SalesReps (A-B-C-D-E)?
Based on your goals...
to be able to maintain a history of which Sales Representative was Responsible for a Customer at the time of a particular Sales fact
to know where the Sale Representative's office was located at the time of a particular sales fact
to know the size of a customer's organization at the time of a particular Sales fact
...I think it would be easier to model at a lower granularity...specifcally a sales-transaction fact table which has a grain of 1 record per sales transaction. Each sales transaction would have a single customer and single sales rep.
FactSales
DateKey (date of the sale)
CustomerKey (customer involved in the sale)
SalesRepKey (sales rep involved in the sale)
SalesAmount (amount of the sale)
Now for the historical change tracking...any dimension with attributes for which you want to track historical changes will need to be modeled as a "Slowly Changing Dimension" and will therefore require the use of "Surrogate Keys". So for example, in your customer dimension, Customer ID will not be the primary key...instead it will simply be the business key...and you will use an arbitrary integer as the primary key...this arbitrary key is referred to as a surrogate key.
Here's how I'd model the data for your dimensions...
DimCustomer
CustomerKey (surrogate key, probably generated via IDENTITY function)
CustomerID (business key, what you will find in your source systems)
CustomerName
Location (attribute we wish to track historically)
-- the following columns are necessary to keep track of history
BeginDate
EndDate
CurrentRecord
DimSalesRep
SalesRepKey (surrogate key)
SalesRepID (business key)
SalesRepName
OfficeLocation (attribute we wish to track historically)
-- the following columns are necessary to keep track of historical changes
BeginDate
EndDate
CurrentRecord
FactSales
DateKey (this is your link to a date dimension)
CustomerKey (this is your link to DimCustomer)
SalesRepKey (this is your link to DimSalesRep)
SalesAmount
What this does is allow you to have multiple records for the same customer.
Ex. CustomerID 123 moves from NC to GA on 3/5/2012...
CustomerKey | CustomerID | CustomerName | Location | BeginDate | EndDate | CurrentRecord
1 | 123 | Ted Stevens | North Carolina | 01-01-1900 | 03-05-2012 | 0
2 | 123 | Ted Stevens | Georgia | 03-05-2012 | 01-01-2999 | 1
The same applies with SalesReps or any other dimension in which you want to track the historical changes for some of the attributes.
So when you slice the sales transaction fact table by CustomerID, CustomerName (or any other non-historicaly-tracked attribute) you should see a single record with the facts aggregated across all transactions for the customer. And if you instead decide to analyze the sales transactions by CustomerName and Location (the historically tracked attribute), you will see a separate record for each "version" of the customer location corresponding to the sales amount while the customer was in that location.
By the way, if you have some time and are interested in learning more, I highly recommend the Kimball bible "The Data Warehouse Toolkit"...which should provide a solid foundation on dimensional modeling scenarios.
The established best practices way of doing what you want is a dimensional model with slowly changing dimensions. Sales reps are frequently used to describe the usefulness of SCDs. For example, sales managers with bonuses tied to the performance of their teams don't want their totals to go down if a rep transfers to a new territory. SCDs are perfect for tracking this sort of thing (and the situations you describe) and allow you to see what things looked like at any point historically.
Spend some time on Ralph Kimball's website to get started. The first 3 articles I'd recommend you read are Slowly Changing Dimensions, Slowly Changing Dimensions Part 2, and The 10 Essential Rules of Dimensional Modeling.
Here are a few things to focus on in order to be successful:
You are not designing a 3NF transactional database. Get comfortable with denormalization.
Make sure you understand what grain means and explicitly define the grain of your database.
Do not use natural keys as keys, and do not bake any intelligence into your surrogate keys (with the exception of your time keys).
The goals of your application should be query speed and ease of understanding and navigation.
Understand type 1 and type 2 slowly changing dimensions and know where to use them.
Make sure you have a sponsor on the business side with the power to "break ties". You will find different people in the organization with different definitions of the same thing, and you need an enforcer with the power to make decisions. To see what I mean, ask 5 different people in your organization to define "customer" or "gross profit". You'll be lucky to get 2 people to define either the same way.
Don't try to wing it. Read the The Data Warehouse Lifecycle Toolkit and embrace the ideas, even if they seem strange at first. They work.
OLAP is powerful and can be life changing if implemented skillfully. It can be an absolute nightmare if it isn't.
Have fun!

Same Fact Table Column; Records with Multiple Reasons

I am in a situation similar to the one below:
Think for instance we need to store customer sales in a fact table (under a data warehouse built with dimensional modelling). I have sales, discounts related to the sale, sales returns and cancellations to be stored.
Do you think it would be advisable to store sales for a day to a customer in a particular product (when the day is the grain) as a positive value while the returns and discounts are stored as minuses?
Also if a discount is enforced to a customer at a level other than the product (for instance brand), do you think it is alright to persist it with a key particularly assigned to the brand (product is the grain) while the product column being given an N/A, for the particular record?
Thanks in advance.
If your sales are considered a good thing (I'm assuming they are) then recording sales as positive numbers makes perfect sense. Any transaction that reduces sales (i.e. discounts and returns) should therefore be recorded as negative numbers. This will make reporting your sales very natural.
If you have diffent dimensions that might account for a record, you should populate the dimensions that make sense. So yes, attribute a discount to a brand rather than a product if that is what happened in your business transaction. This way your reporting will be able to look at all discounts, at discounts for particular products and discounts for entire brands. If your fact table shows the most direct "cause" of the discount (product or brand) then your reports will be more useful than if you link the fact to brand through a relationship to product.

Schema design: many to many plus additional one to many

I have this scenario and I'm not sure exactly how it should be modeled in the database. The objects I'm trying to model are: teams, players, the team-player membership, and a list of fees due for each player on a given team. So, the fees depend on both the team and the player.
So, my current approach is the following:
**teams**
id
name
**players**
id
name
**team_players**
id
player_id
team_id
**team_player_fees**
id
team_players_id
amount
send_reminder_on
Schema layout ERD
In this schema, team_players is the junction table for teams and players. And the table team_player_fees has records that belong to records to the junction table.
For example, playerA is on teamA and has the fees of $10 and $20 due in Aug and Feb. PlayerA is also on teamB and has the fees of $25 and $25 due in May and June. Each player/team combination can have a different set of fees.
Questions:
Are there better ways to handle such
a scenario?
Is there a term for this type of
relationship? (so I can google it) Or know of any references with similar structures?
Thus is a perfectly fine design. It is not uncommon for a junction table (AKA intersection table) to have attributes of its own - such as joining_date - and that can include dependent tables. There is, as far as I know, no special name for this arrangement.
One of the reasons why it might feel strange is that these tables frequently don't exist in a logical data model. At that stage they are represented by a many-to-many join notation. It's only when we get to the physical model that we have to materialize the junction table. (Of course many people skip the logical model and go straight to physical.)

Resources