I have been studying datawarehouse in the last couple days, particularly, i have been reading The Data Wharehouse Toolkit - The Definitive Guide to Dimensional Modeling by Kimball and Ross.
Uppon that reading, i came to the 1st exapmle where there is a sales fact and it related to a product dimension, as you can see in the bellow image:
I think i can grasp the gist of how this relationship allows us to rotate the "cube" slicing and dicing data, however this is where i get lost:
In this example and many others on the web product is a one-to-one relationship with sales, which is fine i guess for most cases. But this generates a sales registtry for at least each kind of product that was in one sale.
So supposing i bought 1 banana, 2 apples and 1 orange, this would yield at least 3 sales registry. Again, which is fine i guess as it is storing the sale's ticket ID in the sales fact, we still can relate all itens in a given sale.
However if this was an use case: relate products on sales say i want to get every sale that had a banana and get stuff like: how many items each of these sales had, their price cost, their profit, stuff like that...
Wouldn't be better if the fact-product relation were Fact-one_to_many-Product relationship? Where fact would hold the sale's ticket ID and products would have its foreign key referencing where they are from or something?
I reckon these metrics should be in the fact table, and not in the product table as i think i would want. So, is this me not fighting my urge to normalize it or does it make sense in the way i would want to do that kind of filtering -> [given all sales with X product, get data from other products in the same sale].
If i were to follow the guidelines, product dimension would have one registry for every exclusive kind of product the store would have correct? And all the measurements i want i would store it on the fact itself, like price cost, sales price, profit, etc...
On the other hand, if i were to one-to-many product dimension would have many copies of each product. Which is bad, i think. However, i think it would give me better queries in that regard.
As you can see, i'm a beginer and really in the early stages of this path, so if you would endulge me in a Explain Like I'm Five kind of answer I would appreciate.
EDITED:
Sorry #Nick.McDermaid, you are right. I meant from the perspective of the sales fact where for every sale fact i will have only one product, but are correct that for one product it can have N sales related. And so, we have one record of product in the database for every different product on our store. This is the right way to do it, how to rightfully model it. Also, the many indicator is the "sales quantity" i'm guessing.
Anyhow, while this allows for slicing and dicing when/if we have sales as the point of view, but what if i want to for example:
Get all sales that had a banana in it, with all the other items in those sales. We can still do it with this structure but its harder than if the products were repeated and we had the sale id as a foreign key in the product table.
Cuz ultimetly i want to get all the sales(and products within that sale) that had a banana. And then take metrics out of them.
What you are somewhat hinting at would be a degenerate dimension, consisting of the sales id/invoice #/purchase order # of the transaction that took place. The whole purpose of a degenerate dimension is to group items that are related by a meaningless piece of data. For example, a PO # of A1234 is meaningless on its own, it doesn't tell you anything about the purchase. However, it can be used to identify other meaningful data, such as the date of purchase of the products for the customer. In that context, the PO # is defined by the collection of the entities it brings together to describe an event.
Another critical concept in data-warehousing is the abstraction of the schema in the database from the model in the cube. You don't join and group data in a cube model. You slice and filter. There are no foreign keys in a cube model. Those are used in the underlying data schema, but all of that work is handled behind the scenes of the cube model.
Related
I have a table with products that I offer. For each product ever sold, an entry is created in the ProductInstance table. This refers to this instance of the product and contains information such as the next due date (if the product is to be billed monthly) and other information relevant to this instance (e.g. personal branding).
For understanding: The products are service contracts. The template of the contract is stored in the product table (e.g. "Monthly lawn mowing"). The product instance is then e.g. "Monthly lawn mowing in sample street" and contains information like the size of the garden or something specific to this instance of the service instead of the general product.
An invoice is created for a product instance either one time or recurring. An Invoice may consists of several entries. Each entry is represented by an element in the InvoiceEntry table. This is linked to the ProductInstance to create the reference to the invoice.
I want to extend the database with purchase orders. To do this, a record is created in the Order table. This contains a relation to the customer and e.g. the order date. The single products of the order are mapped by an OrderEntry. The initial invoice generated for the order is linked via the field "invoice_id" in the table order. The invoice items from the initial order are created per OrderEntry and create one InvoiceEntry each. However, I want the ProductInstance to be created only after the invoice is paid. Therefore the OrderEntry has a relation to the product and not only to the ProductInstance. Once the order has been created, the instance is created and linked to the OrderEntry.
I see the problem that the relation between Order and Invoice is doubled: once Order <-> Invoice and once Order <-> OrderEntry <-> InvoiceEntry <-> Invoice.
And for the Product: OrderEntry <-> Product and OrderEntry <-> ProductInstance <-> Product.
Model of the described database
My question is if this "duplicate" relation is problematic, or could cause problems later. One case that feels messy to me is, what should I do if I want to upgrade the ProductInstance later (to an other product [e.g. upgrade to bigger service])? The order would still show the old product_id but the instance would point to a new product_id.
This is a nice example of real-life messy requirements, where the 'pure' theory of normalisation has to be tempered by compromises. There's no 'slam-dunk right' approach; there's some definitely 'wrong' approaches -- your proposed schema exhibits some of those. I suspect there's not even a 'best' approach. Thank you for expanding the description of the business context -- especially for the ProductInstance table.
But still your description won't support legally required behaviour:
An invoice is created for a product instance either one time or recurring. An Invoice may consists of several entries. Each entry is represented by an element in the InvoiceEntry table.
... I want the ProductInstance to be created only after the invoice is paid.
An invoice represents an indebtedness from customer to supplier. It applies at one date only, not "recurring". (So leaving out the Invoice date has exactly got in the way of you "thinking about relations".) A recurring or cyclical billing arrangement would be represented by something like a 'contract' table, from which an Invoice is generated by some scheduled process.
Or ... your "recurring" means the invoice is paid once up-front for a recurring service(?) Still you need an Invoice date. The terms of service/its recurrence would be on the ProductInstance table.
I can see no merit in delaying recording the ProductInstance 'til after invoice payment. Where are you going to hold the terms of service in the meantime? If you're raising an invoice, your auditors/the statutory authorities will want you to provide records of what the indebtedness relates to. Create ProductInstance ab initio and put a status on it. (Or in the application look up the Invoice's paid status before actually providing the service.)
There's something else about Invoices you're currently failing to capture -- and that has also lead you to a wrong design: in general there is more making up the total $ value of an invoice than product lines, such as discounts applying to the invoice overall rather than particular products; delivery charges; installation costs or inspection/certification; taxes (local/State/Federal).
From your description perhaps the only one applying is taxes. ("in this world nothing can be said to be certain, except death and taxes.") And taxes are not specific to products/no product_instance_id is applicable on an InvoiceEntry.
For this reason, on ERP schemas in general, there is no foreign key declared from InvoiceEntry to Product/Instance. (In your case you might get away with product_instance_id being nullable, but yeuch.) There might be a system-generated XRef text column, which contains different content according to what the InvoiceEntry represents, but any referencing can't be declared to the schema. (There might be a 'fully normalised' way to represent that with an auxiliary linkage table, but maintaining that in step adds too much complexity to the application.)
I see the problem that the relation between Order and Invoice is doubled: once Order <-> Invoice and once Order <-> OrderEntry <-> InvoiceEntry <-> Invoice.
Again think about the business sequence of operations that generate these records: ordering happens as a prelude to invoicing. You can't put an invoice_id on Order, because you haven't created the Invoice yet. You might put the order_id on Invoice. But here you're again in the situation that not all Invoices arrive via Orders -- some might be cash sales/immediate delivery. (You could make order_id nullable, but yeuch.) For this reason on ERP schemas in general, there is no foreign key declared from Invoice to Order, etc, etc.
And the same thinking with OrderEntry <-> InvoiceEntry: your proposed schema has the sequencing wrong/the reference points the wrong way. (And not every InvoiceEntry will have corresponding OrderEntry.)
On OrderEntry, having all of (OrderEntry)id and product_id and product_instance_id seems to me to give you way too many opportunities for tangling it all up. Can an Order have multiple Entrys for the same product_id? -- why/how? Can it have multiple Entrys for the same product_instance_id? -- why/how? Can there be a product_instance_id which refers to a different product_id than OrderEntry.product_id? This is exactly the sort of risk for confusing entanglement that normalisation aims to remove/reduce.
The customer is ordering a ProductInstance: mowing a particular size of garden at a particular address, fortnightly on a Tuesday afternnon. So OrderEntry.product_instance_id is what you want; .product_id is wrong. So (again) you need to create ProductInstance at time of recording the Order. Furthermore I strongly suspect you don't need an id on OrderEntry; instead give it a compound key (order_entry_id, product_instance_id). [**]
[**] I see you're using 'eloquent'. I suspect this is requiring id on every table. So you're not even using a relational database, this is some sort of Object-Relational hybrid. Insisting on a dedicated single id as key on every table is toxic. It has lead schema designers astray every time I get called in to help -- as here. Please if you can at all avoid it, don't do that.
Considering that this community has questions related to modeling databases, I am here seeking help.
I'm developing an auction system based on another one seen in a book I'm reading, and I'm having trouble. The problem context is the following:
In the auction system, a user makes product announcements (he/she defines a product). It defines
the product name, and the initial offer (called initial bid). The initial bid expresses the minimum amount to be offering. A
product only exists in the database when it is announced by a user. A
user defines a number of products, but a product belongs to a single
user.
Every product has an expiration date. When a certain date arrives, if
there are no offers for the product, it is not sold. If there are
offers for the product, the highest bidder wins the given product.
The offers have a creation date, and the amount offered. An offer is
made to a product from a user. A user can make different offers for
different products. A product can be offered by different users. The
same user can not do more than two offers for the same product.
This kind of context for me is easy to do. The problem is that I need to store a purchase. (I'm sorry, but I do not know if the word is that in English). I need a way to know which offer was successful, and actually "bought" a product. Relative to what was said, part of my Conceptual Model (Entity Relationship Diagram) is as follows:
I've tried to aggregate USERS with PRODUCTS, and make a connection/relationship between the aggregation and PRODUCTS, which would give me the PURCHASES event. When this was broken down (decomposed) I would have a table showing which offer bought what product.
Nevertheless, I would have a cardinality problem. This table would have two foreign keys. One for BIDS, and the other for PRODUCTS. This would allow an N-N relationship of these two, meaning that I could save more than one bid as the buyer of a product, or that the same bid could "buy" many products (so I say in the resulting PURCHASES table).
Am I not seeing something here? Can you guys help me with this? Thank you for reading, for your time, and patience. And if you need some more detail, please do not hesitate to ask.
EDIT
The property "Initial Bid" on the PRODUCTS entity is not a relationship.
This property represents a monetary value, a minimum amount that an offer must have to be given to a particular product.
You are approaching things backwards. First we determine a relevant application relationship, then we determine its properties. Given an application relationship and the possible application situations that can arise, only certain relationship/association sets/tables can arise. A purchase can have only one bid per product and one product per bid. So it is just a fact that the bid:product cardinality of PURCHASES is 1:1. Say so. This is so regardless of what other application relationships you wish to record. If it were a different application relationship between those entities then the cardinality might be different. Wait--USERS_PRODUCTS and BIDS form exactly such a pair of appplication relationships, with user:product being 1:0..N for the former and 0..N:0..N for the latter. Cardinality for each participant reflects (is a property of) the particular application relationship that a diamond represents.
Similarly: Foreign keys are properties of pairs of relationship/association sets/tables. Declaring them tells the DBMS to enforce that participant id values appear in entity sets/tables. Ie {bid} referencing BIDS and {product} referencing PRODUCTS. And candidate keys (of which one can be a primary key) are a property of a relationship set/table. It is the candidate keys of PURCHASES that lead to declaration of a constraint that enforces its cardinality. It isn't many:many on bid:product because "bid BID purchased product PRODUCT at price $AMOUNT on date DATE" isn't many:many on bid:product. Since either its bid or its product uniquely identify a purchase both {bid} and {product} are candidate keys.
Well... I tried to follow the given advices, and also tried to follow what I had previously spoken to do, using aggregation. It seems to me that the result is correct. Please observe:
Of course, a user makes offers for products. A user record can relate to multiple records in PRODUCTS. Likewise, a product can be offered by multiple users, and thus, a record in PRODUCTS can relate to multiple records on USERS. If I'm wrong on this, please correct me.
Looking at purchase, we see that a product is only properly purchased by a single bid. There is no way to say that a user buys a product, or a product purchase itself. It is through the relationship between USERS and PRODUCTS that an bid is created, and it is an bid that is able to buy a product. Thus, it is necessary to aggregate such a relationship, then set the purchase event.
We must remember that only one product can be purchased for a single bid. So here we have a cardinality of 1 to 1. Decomposition here will ask discretion of the data modeler.
By decomposing this Conceptual Model, we will have the following Logic Model:
Notice how relationships are respected with appropriate attributes. We can know who announced a product (the seller), and we can know what offers were made to it.
Now, as I said before, there is the PURCHASES relationship. As we have a relationship of 1 to 1 here, the rule tells us that we must choose which side the relationship will be interpreted (which entity shall keep such a relationship).
It is usually recommended to keep the relationship on the side that is likely in the future to become a "many". There is no need to do so in this case because it is clear that this relationship can not be preserved in a BIDS record. Therefore, we maintain such a relationship portrayed by "Winner Bid" on the PRODUCTS entity. As you can see, I set a unique identifier for BIDS. By defining the Physical Model, we have a surrogate key.
Well, I finish my answer here, but I think I can not consider it right yet. I would like someone to comment anything if possible. Thank you.
EDIT
I'd like to apologize. It seems that there was a mistake on my part. Well, I currently live in Brazil, and here we learn the ER-Diagram in a way that seems to me to be different from what many of you are used to. Until yesterday I've been answering some questions related to the subject here in the community, and I found odd certain different notations used. Only now I'm noticing that we are not speaking the same language. I believe it was embarrassing for me in a way. I'm sorry. Really, I'm sorry.
Oh, and one more thing (it may be interesting for someone):
The cardinality between BIDS and PRODUCTS is really wrong. It should
be 0 to 1 in both, as was said in the comments.
There are no relationships relating with relationships here. We have
what is called here (in my country, or in my past course)
"Aggregation" (Represented in the first drawing by the rectangle with clipped lines). The internal relationship (BIDS) when decomposed will
become an entity, and the relationship between BIDS and PRODUCTS is
made later. An aggregation says that a relationship needs to be resolved first so that it can relate with another entity. This creates a certain restriction, which may be necessary in certain business rules. It says that the joining of two entities define records that relate to records of another entity.
Certain relationships do not necessarily need to be named. Once you
break down certain types of relationships, there is no need (it's
optional) to name them (between relations N to N), especially if new
relationship does not arise from these. Of course, if I were to name the white relationships presented, we then would have USER'S BIDS (between USERS and BIDS), and PRODUCT'S BIDS (between BIDS and PRODUCTS).
With respect to my study material, it's all in portuguese, and I believe you will not understand, I do not know. But here's one of the books I read during my past classes:
ISBN: 978-85-365-0252-6
Title: PROJETO de BANCO de DADOS Uma Visão Prática
Authors: Felipe Machado, Mauricio Abreu
Publishing company: EDITORA ÉRICA
COVER:
LINK:
http://www.editorasaraiva.com.br/produtos/show/isbn:9788536502526/titulo:projeto-de-banco-de-dados-uma-visao-pratica-edicao-revisada-e-atualizada/
Well... What do I do now? Haha... I do not know what to do. I'll leave my question here. If someone with the same method that I can help me, I'll be grateful. Thank you.
To record the winning bid requires a functional dependency Product ID -> Bid ID. This could be implemented as one or more nullable columns (since not all products have been purchased yet) in the Products table, or better in a separate table (Purchases) using Product ID as primary key.
I am new to Dimension Modeling and I am working on doing a dimension model for a university. The current business process that I have picked up is actually sales/revenue. I have been reading different chapters of different books and although I think I have a good understanding of facts and dimensions I am having some tough time fitting the sales process on to the paper.
Ideally the sales process in the school is similar to other businesses where students are customers and the product is the "courses" they take. However in certain situation there are different product types and I don't know how to fit the product type. For example student pays an Application fee, late fee or transcript request fee which is not associated with any course. How do I fit these different type of revenue streams in my star?
What I have done so far is like this
Sales_FACT
====
Date_Key_FK
Product_Key_FK
Campus_Key_FK
Student_Key_FK
ChargeCredit_SKU
Amount
Product_Key
------
Product_Key_PK
SectionID
AcademicYear
AcademicTerm
AcademicSession
CourseCode
CourseName
ProductType???
Now for certain type of products (e.g. a transcript request fee) - I do not have the coursename,code, year term,session -- I am struggling how this will work.
Anyone has any input on this? or any helpful material/schema examples will definately appreciate them
Thanks,
You will find a plenty of those cases in future. Generally, you are experiencing this problem because you are mixing two different types of 'product' in your case.
It can be resolved logically or technically.
During the ETL process, in cleansing step, you can rewrite your null fields with sql (nvl, coalesce, CASE WHEN) with something related to field
nvl(CourseCode, 'No Course Code') as CourseCode
Then, when you group by ProductType, and CourseName you shoud get something like this:
ProductType CourseName sum(Amount)
------------------------------------------
AppFee Course1 345.13
AppFee Course4 8901.00
TranscriptFee No Course Name 245.99
Or, you can put it in separate tables. Even that is contradictory to your business process (can't have different products in fact row), sometimes terms you want to merge (i.e ApplicationFee and TranscriptFee) have many different grouping levels which is often too hard to map.
Edit:
No, snowflake make sense when there exists big dimension tables, high cardinality, many levels, as well as many to many relationship (i.e movies, categories). In your case good idea is to follow ERP/CRM database design, because it's current working solution. If there is no such reporting possibilty you want, you can make more generic dimension table:
Product-Service Dimension
--------------------------------------------
SurogateKey
NaruralKey
Type(Product/Sevrice/Other)
Level1(ProductType/ServiceType)
Level2(ProductSubType/ServiceSubType)
Level3
Level4
Attribute1
Attribute2
I was a little miffed about the one-to-one relationship explanation on the 'I Think You Mean A Many To One' article.
In this instance for example, a product has one price because the business in question is small, niche, localized and supports only a single currency. Multiple prices per product make no sense in this case? I'm doubtful I'm grasping the concept correctly though, because everywhere I read says it will probably be a many-to-one even if you think it isn't?
Can somebody enlighten me please? :)
In an attempt to gain more reputation so that I can help in comments instead of an "answer" The one-to-many vs one-to-one is this
View a one-to-one as an extension of the table you are looking at.
Table B extends Table A. Meaning the information wasn't necessarily relevant enough to include in the table directly, but has a bidirectional relationship with each other. Basically meaning that As Table A, I am not dependent on the information in Table B, but Table B's information is very dependent on me. For the price example it means that Table A has a row related to a row in table B. So if you entering unique information in your Price table around every item to match in Table A, then this would be useful. As in say you had a description column about the item in your price table. Otherwise the price table in this case may just be irrelevant to have in the schema.
in a one-to-many relationship Table B usually has no reference back to Table A. So in the case of price, the items you are looking at do have a price, but prices aren't exclusive to items. So to better define, A number of things may have the price 9.99, but 9.99 only needs to exist in your pricing table once.
I am not familiar with the article you refer to. However, price is a classic example of a slowly changing dimension. Price may be constant at any point in time, but over time, the price changes.
Such dimensions are typically implemented by having effective and end dates for the period in question.
Now, at a given point in time, a product probably does have only one price. Things that affect the price -- coupons, discounts for the purchaser, volume discounts, for example -- are not properties of the product. These are properties of the transaction.
That said, there may be circumstances where a fixed volume discount does not make sense. So, the "price" for a product might include volume, as well as time.
In any case, I would agree with you that price is not a good example of a 1-1 relationship. There are other factors such as time and volume that affect it.
Using the snowflake schema image from wikipedia:
http://en.wikipedia.org/wiki/File:Snowflake-schema-example.png
Would it ever make sense to have a "Brand_Id" foreign key in Fact_Sales as you do in Dim_Product? There is a many-to-one relationship of sales/brands just like sales/products or products/brands, so is there any logical reason not to? You may want to join directly to the Dim_Brand table.
I'm probably not seeing something obvious.
The type of relationship you're looking at is a has-a relationship.
A product has a brand. A sale has a product; it's the thing that was sold. But a sale does not have a brand. Or, a better way of saying this, you cannot sell a brand. (don't read too far into that one...)
So, no, you wouldn't want to add brand to sales.
If you are working in a dimensional model (the Star/Snowflake schema note in your question makes be think you are), then adding the BRAND_ID to the sales fact makes sense from a performance perspective, if the questions that the business is trying to answer are "what were the sales for brand X across all products in this time frame".
It also may be useful if the product dimension is a Type 1 SCD, and a product changes brands. You may want to preserve the prior sales as being of the "old" brand.
Keep in mind you are not doing entity - relationship modeling when you build a star/snowflake reporting schema. Questions of is-a or has-a aren't pertinent to a dimensional model.
I think that would be nice as a way to cache the data... but in all honesty, your probably better off just relying on the links as they are.
The reasoning is that you already have that definition of what that table does, store sales. To add in what brand those products are that the store sold is going to muddy the 'topic' or 'theme' of that table, recording sales of a store.
Now if by some way you had a product that can be sold under different brands (heck if I know how a package can have split personalities...) then yea, it would make sense to a degree, but a more reasonable solution is to give each product it's own SKU code then.