Creating Data Warehouse - database

I am creating a data warehouse by using a star schema. I successfully build all the dimension tables, but I'm kind of stuck at the fact table. I am in a need to make a Sales table as Fact table. It has SalesKey, OrderKey, ProductKey and etc... Every order is a sale so each order will have a unique SalesKey however each sale will have more than one product.
What would be the best was to build this table?
Should I create something like that
SalesKey OrderKey ProductKey
-------- -------- ----------
s1 o1 p1
s1 o1 p2
s2 o2 p1

In general when you design a starschema it is preferred that each dimension is single valued for each fact record (that is having a 1:M relation between fact and dimension).
The trick is to include an ORDER-LINE dimension so that 1 order (=1 sale) can contain many order lines. Each order-line then contains 1 product.
So basically you will be using a snowflake schema where the facttable is linked to the ORDER-LINE dimension in a 1:M relation. The ORDER-LINE dimension is then linked to the PRODUCT dimension in a M:1 relation.
With this the original problem having a M:M relation between the Salesfact and the PRODUCT dimension has been solved with the ORDER-LINE dimension as a bridge table.

I would add that order items/lines can be tricky. There are multiple ways to handle it.
Add a column "order line item" or "transaction control id" to the fact table.
This will allow you to have SalesKey, OrderKey, ProductKey all on your fact, with an "OrderLineItem" degenerate dimension key, which is often the transaction control number or order line number from the source system.
One issue that you may encounter when using this method is when you have order-level measures that don't exist at the order-line (tax, cashier id, etc). Kimball's preferred approach is to distribute these measures down to the order line if at all possible.
Here's a good article by Kimball on degenerate dimensions:
http://www.kimballgroup.com/html/designtipsPDF/DesignTips2003/KimballDT46AnotherLook.pdf

Related

Redundant relation: Is this a violation of database normalization?

I have a table with products that I offer. For each product ever sold, an entry is created in the ProductInstance table. This refers to this instance of the product and contains information such as the next due date (if the product is to be billed monthly) and other information relevant to this instance (e.g. personal branding).
For understanding: The products are service contracts. The template of the contract is stored in the product table (e.g. "Monthly lawn mowing"). The product instance is then e.g. "Monthly lawn mowing in sample street" and contains information like the size of the garden or something specific to this instance of the service instead of the general product.
An invoice is created for a product instance either one time or recurring. An Invoice may consists of several entries. Each entry is represented by an element in the InvoiceEntry table. This is linked to the ProductInstance to create the reference to the invoice.
I want to extend the database with purchase orders. To do this, a record is created in the Order table. This contains a relation to the customer and e.g. the order date. The single products of the order are mapped by an OrderEntry. The initial invoice generated for the order is linked via the field "invoice_id" in the table order. The invoice items from the initial order are created per OrderEntry and create one InvoiceEntry each. However, I want the ProductInstance to be created only after the invoice is paid. Therefore the OrderEntry has a relation to the product and not only to the ProductInstance. Once the order has been created, the instance is created and linked to the OrderEntry.
I see the problem that the relation between Order and Invoice is doubled: once Order <-> Invoice and once Order <-> OrderEntry <-> InvoiceEntry <-> Invoice.
And for the Product: OrderEntry <-> Product and OrderEntry <-> ProductInstance <-> Product.
Model of the described database
My question is if this "duplicate" relation is problematic, or could cause problems later. One case that feels messy to me is, what should I do if I want to upgrade the ProductInstance later (to an other product [e.g. upgrade to bigger service])? The order would still show the old product_id but the instance would point to a new product_id.
This is a nice example of real-life messy requirements, where the 'pure' theory of normalisation has to be tempered by compromises. There's no 'slam-dunk right' approach; there's some definitely 'wrong' approaches -- your proposed schema exhibits some of those. I suspect there's not even a 'best' approach. Thank you for expanding the description of the business context -- especially for the ProductInstance table.
But still your description won't support legally required behaviour:
An invoice is created for a product instance either one time or recurring. An Invoice may consists of several entries. Each entry is represented by an element in the InvoiceEntry table.
... I want the ProductInstance to be created only after the invoice is paid.
An invoice represents an indebtedness from customer to supplier. It applies at one date only, not "recurring". (So leaving out the Invoice date has exactly got in the way of you "thinking about relations".) A recurring or cyclical billing arrangement would be represented by something like a 'contract' table, from which an Invoice is generated by some scheduled process.
Or ... your "recurring" means the invoice is paid once up-front for a recurring service(?) Still you need an Invoice date. The terms of service/its recurrence would be on the ProductInstance table.
I can see no merit in delaying recording the ProductInstance 'til after invoice payment. Where are you going to hold the terms of service in the meantime? If you're raising an invoice, your auditors/the statutory authorities will want you to provide records of what the indebtedness relates to. Create ProductInstance ab initio and put a status on it. (Or in the application look up the Invoice's paid status before actually providing the service.)
There's something else about Invoices you're currently failing to capture -- and that has also lead you to a wrong design: in general there is more making up the total $ value of an invoice than product lines, such as discounts applying to the invoice overall rather than particular products; delivery charges; installation costs or inspection/certification; taxes (local/State/Federal).
From your description perhaps the only one applying is taxes. ("in this world nothing can be said to be certain, except death and taxes.") And taxes are not specific to products/no product_instance_id is applicable on an InvoiceEntry.
For this reason, on ERP schemas in general, there is no foreign key declared from InvoiceEntry to Product/Instance. (In your case you might get away with product_instance_id being nullable, but yeuch.) There might be a system-generated XRef text column, which contains different content according to what the InvoiceEntry represents, but any referencing can't be declared to the schema. (There might be a 'fully normalised' way to represent that with an auxiliary linkage table, but maintaining that in step adds too much complexity to the application.)
I see the problem that the relation between Order and Invoice is doubled: once Order <-> Invoice and once Order <-> OrderEntry <-> InvoiceEntry <-> Invoice.
Again think about the business sequence of operations that generate these records: ordering happens as a prelude to invoicing. You can't put an invoice_id on Order, because you haven't created the Invoice yet. You might put the order_id on Invoice. But here you're again in the situation that not all Invoices arrive via Orders -- some might be cash sales/immediate delivery. (You could make order_id nullable, but yeuch.) For this reason on ERP schemas in general, there is no foreign key declared from Invoice to Order, etc, etc.
And the same thinking with OrderEntry <-> InvoiceEntry: your proposed schema has the sequencing wrong/the reference points the wrong way. (And not every InvoiceEntry will have corresponding OrderEntry.)
On OrderEntry, having all of (OrderEntry)id and product_id and product_instance_id seems to me to give you way too many opportunities for tangling it all up. Can an Order have multiple Entrys for the same product_id? -- why/how? Can it have multiple Entrys for the same product_instance_id? -- why/how? Can there be a product_instance_id which refers to a different product_id than OrderEntry.product_id? This is exactly the sort of risk for confusing entanglement that normalisation aims to remove/reduce.
The customer is ordering a ProductInstance: mowing a particular size of garden at a particular address, fortnightly on a Tuesday afternnon. So OrderEntry.product_instance_id is what you want; .product_id is wrong. So (again) you need to create ProductInstance at time of recording the Order. Furthermore I strongly suspect you don't need an id on OrderEntry; instead give it a compound key (order_entry_id, product_instance_id). [**]
[**] I see you're using 'eloquent'. I suspect this is requiring id on every table. So you're not even using a relational database, this is some sort of Object-Relational hybrid. Insisting on a dedicated single id as key on every table is toxic. It has lead schema designers astray every time I get called in to help -- as here. Please if you can at all avoid it, don't do that.

Joining date dimension multiple times? - Kimball's book on Data warehouse and Dimension Modeling

I'm reading Ralph Kimball's book on Data warehouse and Dimension Modeling and in chapter 6, there is this part about dimension role playing.
Sometimes you discover other dates associated with each transaction,
such as the requested ship date for the order. Each of the dates
should be a foreign key in the fact table...
However, you cannot simply join these two foreign keys to the same date dimension table. SQL would interpret this two-way simultaneous
join as requiring both the dates to be identical, which isn’t very
likely.
And I am not sure I understand the two last sentences. Does it mean you cannot join the date dimension multiple times if both dates in fact table have different value? Why?
It’s not very well expressed but all that it is saying is that you need to alias the date dimension if you are going to join to it multiple times from different FKs in the Fact table.
This is true of any SQL statement where 2 tables are joined together more than once, it’s not specific to dimensional modelling.
The (well) hidden message here is that you need multiple joins - one for each dimension role
Say you have a fact table with entry date entry_d and booking date booking_d
This would be wrong and is probably meant in the text
select * from fact
join time_dim tim
on fact.entry_d = tim.date_d and
fact.book_d = tim.date_d;
And this is right using two independent joins to the time dimension
select * from fact
join time_dim entr on fact.entry_d = entr.date_d
join time_dim book on fact.book_d = book.date_d;
Note also that if you use an inner join as in the example above you shoud make some elementary validaton and clenaing of both dates. The point is: you should recognise fact rows with invalid dates (not covered in time dimension or NULL) and handle them propertly - not discard them silently in the join.
For non trivial setups, especially when the fact table is partitioned on a time column, you choose the native DATE format and not some surrogate key as for other dimensions. (This is a practical rule - not covered in the theory)

Power BI Aggregation of End Tables

I am new to Power BI and data-base management and I want clarify for myself how Power BI works in reference to my last two questions (Database modelling Bridge Table , Power BI Report Bridge Table ). I have a main_table with firm specific information each year which is connected to an end_table that contains some quantitative information (e.g. sales data). The tables are modelled as a 1:N relationship, so that I do not have to store the same values twice, which I thought is a good thing to do in data modelling.
I want to aggregate the value column of end table over the group column Year. I am surprised that to my understanding Power BI sums up the value column within the end table when I would expect the aggregation over the group variable in the connected tables
My basic example is based on this data and data model (you need to adjust the relationship manually):
main_table<-data.frame(id=1:20, FK_id=sample(1:2,20, replace=TRUE), Jahre=2016:2020)
main_table<-rbind(main_table,data.frame(id=21:25, FK_id=sample(2:3,5, replace=TRUE), Jahre=2015) )
end_table<-data.frame(id=1:3, value=c(10,20,30))
The first 5 rows of the data including all columns looks like this:
If I take out all row specific information and sum up over value. It will always show the sum of the end table, which is 60, in each Year.
Making the connection bi-directional does not help. It just sums up for the existing values of the end_table in each year. I get the correct results, if I add the value column to the main table using Related value = RELATED(end_table[value])
I am just wondering if there is another way to model or analyse this 1:N relationship in Power BI. This comes up frequently and it feels a bit tedious to always add the column using Related() in the main table while it would be intuitive to just click both columns and expect the aggregation to be based on the grouping variable.
In any case, just asking this and my other two questions helped me a lot.
This is a bit of a weird modeling situation (even though it's not terribly uncommon). In general, it's handy to build star schemas where you have dimension tables in 1:N relationships to fact table(s). E.g.
In this setup, the items from the dimension tables (e.g. year or customer) are used in the columns and rows in a visual and measures generally aggregate columns from the fact table (e.g. sales amount).
Your example inverts this. You are trying to sum over a column in your end table using the year as a dimension. As a result, it's not automatically behaving as you'd expect.
In order to get the result that you want, where Year is treated as a dimension, you need to write a measure that sums over Year as if it were a dimension. Since main_table is essentially a dimension table for Year (one unique row per year), you can write
SumValue = SUMX ( main_table, RELATED ( end_table[value] ) )

Adding a new column vs adding more rows system efficiency

I have a table that contains postal codes with their associated salesperson. We are adding a modification were in the end there will be multiple salespersons associated with each postal code. Is it best practice to just add a new column for the new salesperson or to add more rows?
There are roughly 1.5 million postal codes in the table.
Neither.
Adding a new column for a new salesperson is a non-starter. You'd have to keep adding columns arbitrarily in order to add new salespersons. This is just a bad idea in every way.
Adding new rows changes the meaning of the data in the table. The table holds postal codes and information regarding those entities. It shouldn't be responsible for anything more than that.
What you're describing is a many-to-many relationship. This would be accomplished by way of a linking table between the two entities. Something as simple as this:
PostalCode
---------------
ID
Code
etc.
Salesperson
---------------
ID
Name
etc.
SalespersonPostalCode
---------------
ID
SalesPersonID
PostalCodeID
Each row in PostalCode represents a postal code. Each row in Salesperson represents a salesperson. And each row in the linking table represents a relationship between the two. You'd add as many relationships as you want. But don't arbitrarily add new domain entity records when what you want is to add more relationships between them.

How do I create a table in SQL Server that stores multiple values for one cell?

Suppose I have a table for purchase orders. One customer might buy many products. I need to store all these products and their relevant prices in a single record, such as an invoice format.
If you can change the db design, Prefer to create another table called PO_products that has the PO_Id as the foreign key from the PurchaseOrder table. This would be more flexible and the right design for your requirement.
If for some reason, you are hard pressed to store in a single cell (which I re-iterate is not a good design), you can make use of XMLType and store all of the products information as XML.
Note: Besides being bad design, there is a significant performance cost of storing the data as XML.
This is a typical example of an n-n relationship between customer and products.
Lets say 1 customer can have from 0 to N products and 1 products can be bought by 0 to N customers. You want to use a junction table to store every purchase orders.
This junction table may contain the id of the purchase, the id of the customer and the id of the product.
https://en.wikipedia.org/wiki/Many-to-many_(data_model)

Resources