Modeling Fact Tables that have direct relationships, but at a detail and not a dimension layer - sql-server

This is very similar to my issue.
http://forum.kimballgroup.com/t2534-modeling-fact-tables-that-have-direct-relationships-but-at-a-detail-and-not-a-dimension-layer
I’ve got a fact table for POs, Supplier Invoices, Payments, Receipts, etc. They have some dimensions in common, others not. Problem is, for example, say if they are looking at invoices by their gl account, (using an excel pivot table connected to the cube) then they expect to be able drop in a column for the PO number, the buyer of the PO, etc. Even though the buyer dimension is only related to the PO, and the account dimension is only related to the invoice. But they say, well the PO is related to the invoice, so you should be able to pull it in.
I do have a PO Ref field on the invoice fact table, but it is only filled out 50% of the time. Even when it is, you could have a one to many relationship in either way between a PO and an invoice, as far as I understand it at least.
Anyway, they expect to be able to throw in any measure from any measure group, and every single possible dimension to work, and then be able to drill down to the detail to see the POs, Invoices, Payments and Receipts and how they match up. Best practice is to keep the fact tables separate if they are different grains according to Kimball, but then all the business problems aren't solved this way.
The only solutions I can come up with are:
to either tack on a bunch of detail related columns to the degenerate dimensions when I load them. i.e. add PO to invoice and invoice to PO etc., but have it as a comma separated list in that column when it is many to one.
Create every possible relationship with every fact and dimension table. This would be a lot of work though, and some still may not have a relationship to certain dimensions.
Create a monstrous fact table with all the current ones joined together, and somehow figure out logic to only display the measure values once for the many to one joins.
This is probably a bad idea, but thought maybe somehow I could create a relationship between every measure group and the corresponding degenerate dimensions reference field. Like create a relationship between the supplier invoice degenerate dimension PO Ref field and the purchase order line measure group PO field.
Lower their expectations, lol.
Here's a screen shot of the dimension usage tab to give an idea of what it looks like currently.

I tried option 3 once. The performance was terrible. The output was misleading. Never ever again.
Your best bet is to work with the business. Where the data is not readily available (invoice without PO, for example) agree what should be done. You could show a default value (PO not recorded on invoice). You could agree on a logic, implemented in the ETL, that extracts the most likely PO.
Whatever approach you choose you must discuss it. If you do not the business will make decisions based on false assumptions. The business will find itself looking at reporting it does not understand. You must help your users to avoid these outcomes.
Once the approach has been agreed, document it. When queries arise, share the documentation. Make sure the documentation highlights all calculations, difficulties and missing source data.
Work with the teams that generate your source date. If an important field is sparsely populated arrange a meeting. See if the capture processes can be improved. Let your users know that you are investigating this area. Keep them informed of the outcome. If the source data cannot be improved (invoices continue to be raised without a PO), inform your users of the reasons for this.
Managing your customers can be challenging. Especially those who hold senior positions in the company. Transparency and solid documentation will help you.

Related

How to move from Excel to designing a Data Warehouse Model

I just started in Data Warehouse modeling and I need help for the modeling of a problem.
Let me tell you the facts: I work on flight data (aeronautical data),
so I have two Excel (fact) files, linked together, one file 'order' and the other 'services'.
the 'order' file sets out a summary of each flight (orderId, departure date, arrival date, City of departure, City of arrival, total amount collected, etc.)
the 'services' file lists the services provided by flight (orderId, service name, quantity, amount / qty, etc.)
with a 1-n relationship (order-services) each order has n services
I already see some dimensions (Time, Location, etc ...). However, I would like to know how I could design my Data Warehouse, knowing that I have two fact files linked together by orderId.
I thought about it, and the star and snowflake schema do not work in my case (since I have two fact tables) and the galaxy schema requires to have dimensions in common, but I block it, is that I put the order table as a dimension and not as a fact table or I should rather put the services table as a dimension, but these are fact tables. I get a little confused.
How can I design my model?
First of all realize that in a star schema it is not a problem to have more fact tables that are connected - see the discussion here.
So the first draw will simple follow your two fact tables with the native provided dimensions.
Order is in one context a fact table, in other context a dimensional table for the service table.
Dependent on your expected queries you could find useful to denormalize some dimensions of the order table in the service table. So the service will have defined the departure date, arrival date etc. dimensions.
This will be done at the load time in the ETL job.
I will be somehow careful to denormalize the measures from order to service - which will basically eliminate the whole order table.
There will be no problem with the measure total amount collected if this is a redundant sum of the service amounts - you may safely get rid of it.
But you will need for sure the number of flights or number of people transported - those measure are better defined in the order fact table; you can not simple replicate them in the N rows for each service.
A workaround is possible, if you define a main service for each order and those measures are defined only in this row - in other rows the value is NULL. This could lead to unexpected results if queried naively, e.g. for number of flights per service.
So basically I'd start with the two fact tables and denormalize some dimensions to the services if this would help to optimize the queries.
I would start with one fact table of Services. This fact would include all of the dimensions you might associate with the Order including a degenerated dimension of OrderId.
Once this fact is built out and some information products are consuming it, return to the Order and re-evaluate it to see if there are any reporting needs which are not being served, or questions which are difficult to answer with the Services fact.
Joining two facts together is always a bad idea. Performance is terrible. You are always better off bring the dimensions from, in your case, Order to Services. Don't forget to include the context of the dimension in the column name and a corresponding role-playing dimension view for this context. E.G. OrderArrivalCity, OrderDepartureDate, OrderDepartureTime.
You can also get yourself a copy of Ralph Kimball's The Data Warehouse Toolkit

Database design, an included attribute vs multiple joins? Confused

So I am taking a class in database design and management and am kind of confused from a design perspective. My example is an invoice system. I just made it up quick so it doesn't have a ton of complexity in it.
There are Customers, Orders, Invoices and Payments entities
Customers
CustId(PK),
Street,
Zip,
City,
..
Orders
OrderID(PK)
CustID(FK)
Date
Amt
....
Invoices
InvoiceID(PK),
OrderID(FK),
Date,
AmtDue,
AmtPaid,
....
Payments
PaymentNo(PK),
InvoiceID(FK),
PayMethod,
Date,
Amt,
...
Customer entity has a one to many relationship with Orders
Purchases entity has a one to many relationship with Invoices
Invoices Entity has a one to many relationship with Payments.
To get the results of a query to list all Payments made by a Customer the query would have to join Payments with the Invoice table, the Invoice table with the Orders table and the Orders table with the Customer table.
Is this the correct way to do it? One could also just put a custID in the payment entity which would then just require one join, but then there is unneeded information in the payment entity. Is this just a design thing or is it a performance issue?
Bonus question. Lets say there should be a report that says what the total customer balance is. Does there need to be a customer balance field in the database or can this be a calculated item that is produced by joining tables and adding up the amount billed vs amount paid?
Thanks!
Is this the correct way to do it?
Yes. Based on the information provided, it looks reasonable.
One could also just put a custID in the payment entity which would then just require one join, but then there is unneeded information in the payment entity. Is this just a design thing or is it a performance issue?
The question you're asking falls under "normal forms", often called normalization. Your target should be Boyce-Codd normal form (similar to 3NF), which should be described in your textbook. I will warn you that misinformation and misuderstanding of database design issues is very abundant on the interwebs, so beware of which answers you pay attention to.
The goal of normalization is to eliminate redundancy, and thus to eliminate "anomaliies", whereby two logically equivalent queries produce inconsistent results. If the same information is kept in two places, and is updated in only one, then two queries against the two different values will produce different -- i.e, inconsistent -- results.
In your example, if there is a Payments.CustID, should I believe that one, or the one derived from joining Payments to Orders? The same goes for total customer balance: do I believe the stored total, or the one I computed from the consituents?
If you are going to "denomalize for performance", as is so often alleged to be necessary, what are you going to do to ensure the redundant values are consistent?
Bonus question. Lets say there should be a report that says what the total customer balance is.
As a matter of fact, in practice balances are sort of a special case. It's often necessary to know the balance at points in time. While it's possible to compute, say, monthy account balances from inception based on transactions, as a practical matter applications usually "draw a line in the sand" and record the balance for future reference. Step are taken -- must be, for the sake of the business -- to ensure the historical information does not change or, if it does, that the recorded balance is updated to reflect the change. From that description alone, you can imagine that the work of enforcing consistency throughout the system is much more work than relying on the DBMS to enforce it. And that is why, insofar as is feasible, it's better to elimate all redundant data, and let the DBMS do the job it was designed to do.
In your analysis, seek Boyce-Codd normal form. Understand your data, eliminate the redundancies, and recognize the relations. Let the DBMS enforce referential integrity. Countless errors will be avoided, and time saved. Only when specific circumstances conspire to show that specific business requirements cannot be satisfied on a particular system with a given, correct design, does one begin the tedious and error-prone work of introducing redundant information and compensating for it with external controls.
"Is this the correct way to do it?" Of course, given your current design. But it's not the ONLY way. So you're studying DB "normalization" and seeing the pros and cons of the various "forms" of normalization. In the "real world" things can change on a dime, due to a management decision or whatever. I tend to use "compound primary keys" instead of simply one field for primary and others as FK. I handle my "FK" programmatically instead of relegating that responsibility to the DB.
I also create and utilize a number of "intermediate" tables, or sometimes "VIEWS", that I use more easily than a bunch of code with too many JOINs. (3rd Normal form addicts can hate, but my code runs faster than a scalded rabbit).
An Order means nothing without a Customer; an Invoice means nothing without an Order; a Payment is great, but means nothing without both an Order and Invoice. So lemme throw this out there -- what's wrong with having a "summary" type of entity that has Cust, Order, Invoice #, and Payment Id ?

1:1 Relationships. Split into more than 1 table? Bad?

I am creating a mobile game where I am optimistically hoping i'll have millions of players.
I have created a users table that currently has roughly 8 columns (ie. userid, username, password, last_signin, etc)
For every user I'll also need to record the amount of in-game currency they have (ie. gold, silver, gems, etc).
This is a 1:1 relationship (a user will only ever have 1 value defining how much gold they have).
I am no database expert (which is why I am posting here). I worry If I added the gold, silver, gems, etc as new rows in the users table that the users table will be hammered with a crazy amount of queries per second. Everytime someone in the game finds more gold, more silver, logs in, creates an account... the users table will be accessed and/or updated.
Would it be smarter to add the gold, silver, and gems as columns in a new table called "resources" that had the following columns : userid, gold, silver, gems. This new table would have the exact same number of rows as the user table since there is a 1:1 relationship between users and resources. I'm wondering if those queries would be faster since the database data is split up and not all queries would go to the same table.
Clearly to me it seems better to put it all in 1 table since they are 1:1.... but It also seemed like a bad idea to have the majority of the games data in 1 table.
Thanks for any advice you can give!
Ryan
There are plenty of cases where good design calls for two tables in a 1:1 relationship with each other. There is no normalization rule that calls for decomposing tables in this manner. But normalization isn't the only handle on good design.
Access traffic is another handle. Your intuition that access to resources is going to be much more frequent than access to basic user data sounds credible. But you will need to check it out, to make sure that the transactions that access resources don't end up using basic user data anyway. It all boils down to which costs more: a fat user table or more joins.
Other responders have already hinted that there may come a day when the 1:1 relationship becomes a 1:many relationship. I can imagine one. The model of the game player gets expanded where a single user can get involved in multiple distinct instances of the game. In this case, a single user might have the same basic user data in all instances, but different resources in each instance. I have no way of telling if this is ever going to happen in your case. But, if it does, you're going to be better off with a separate resources table.
It really depends on your game design, how big your database is, and how you might expand your database in the future. I would put the resources in a separate table with a foreign key pointing to the user id because:
You can keep the user table slimmer for easier
maintenance/backup.
Simple 1-to-1 JOIN operation between two
tables doesn't take much more resources than having everything in
the same table, as long as you have proper indexing.
By keeping your tables separated, you are practicing separation of concerns;
multiple people can work on different stuff without having to worry
about affecting other tables.
Easier to expand. You may want to add other columns such as birth_date, region, first_name, etc. that
are more relevant to users' personal info to the users table in the
future. It will be confusing if columns of different purposes are
stored together. (In PostgreSQL you can't simply arrange column
order though you can create Views for that.)
This is a 1:1 relationship (a user will only ever have 1 value defining how much gold they have).
... for now ;)
I am no database expert (which is why I am posting here). I worry If I added the gold, silver, gems, etc as new rows in the users table
New columns?
Would it be smarter to add the gold, silver, and gems as columns in a new table called "resources"
Probably, because:
You'll be doing smaller writes when you update the frequently updated part, without rewriting less-modified user data
It makes it easier to audit changes to the user data

I'm unable to normalize my Product table as I have 4 different product types

So because I have 4 different product types (books, magazines, gifts, food) I can't just put all products in one "products" table without having a bunch of null values. So I decided to break each product up into their own tables but I know this is just wrong (https://c1.staticflickr.com/1/742/23126857873_438655b10f_b.jpg).
I also tried creating an EAV model for this (https://c2.staticflickr.com/6/5734/23479108770_8ae693053a_b.jpg), but I got stuck as I'm not sure how to link the publishers and authors tables.
I know this question has been asked a lot but I don't understand ANY of the answer's I've seen. I think this is because I'm a very visual learner and this makes it hard to understand what's being talked about when not a lot of information is given.
Your model is on the right track, except that the product name should be sufficient you don't need Gift name, book name etc. What you put in those tables is the information that is specific to the type of product that the other products don't need. The Product table contains all the common fields. I would use productid in the child tables rather than renaming it giftID, magazineID etc. It is easier to remember what things are celled when you are consistent in nameing them.
Now to be practical, you put as much as you can into the product table especially if you are going to do calculations. I prefer the child tables in this specific case to have what is mostly display information. So product contains the product name, the cost, the type of product, the units the product is sold in etc. The stuff that generally is needed to calculate the cost of an order or to have a report of what was ordered. There may be one or two fields that can contain nulls, but it simplifies the calculation type queries so much it might be worth it.
The meat of the descriptive details though would go in the child table for the type of product. These would usually only be referenced when displaying the product in the shopping area and only one at a time, so you can use the product type to let you only join to the one child table you need for display. So while the order cares about the product number and name and cost calculations, it probably doesn't need to go line by line describing the book ISBN number or the megapixels in a camera. But the description page of the product does need those things.
This approach is not purely relational, although it mostly is, but it does group the information by the meanings of the data and how they will be used which will make the database easier to understand and query. I am a big fan of relational tables because database just work better when they hit at least the third normal form but sometimes you can go too far for practicality, so the meaning of the data and the way you are grouping to use the data (and not just for the user interface, but for later reporting as well) is almost always one of my considerations in design.
Breaking each product type into its own table is fine - let the child tables use the same id as the parent Product table, and create views for the child tables that join with Product
Your case is a classic case of types and subtypes. This is often called class/subclass in object modeling and generalization/specialization in ER modeling. It's a well understood pattern. There are known techniques for dealing with this pattern.
Visit the following tabs, and read the description under the info tab (presented as "learn more"). Also look over the questions grouped under these tags.
single-table-inheritance class-table-inheritance shared-primary-key
If you want to rean in more depth use these buzzwords to search for articles on the web.
You've already discovered and discarded single table inheritance on your own. Other answers have pointed you at shared primary key. Class table inheritance involves a single table for generalized data as well as the four specialized tables. Shared primary key is generally used in conjunction with class table inheritance.

Database Design without inheritance

I have a come up with the following schema for a client of mine. Does anything look off here especially the Order Line Items. Should i use inheritance. I'm pretty sure that this site will only allow you to order courses, lessons, and giftcards, and that's it
Any feedback would be appreciated
Just my thinking on the design:
You have Courses, Lessons and GiftCards tables for the possible purchase objects, and OrderLines contains IDs for each of the tables. But in case a customer will purchase a Lesson and a GiftCard, they should be shown as 2 lines in the order. Also, what you will do if your client will want to trade more objects?
Therefore I think it might be better to redesign this part, like this:
OrderLines rename to OrderItems;
add ItemType table with 3 rows: Courses, Lessons, GiftCards;
add Items table with (ItemId, ItemType, Title, Price, LanguageCode, SortOrder, etc.) fields.
This way it will also be possible to add reviews not only for Lessons, but for all possible items.
You will have to come up with the preferred way to keep fields for the Items details. Right now Courses and Lessons share a lot of fields, therefore it might be reasonable to move all of them into the new Items table, as such fields seems also to be valid for the GiftCards also. And in case you have some specific details, like for GiftCards, you might add specific tables, like GiftCardItems with Items.id and a set of special fields not shared with other Item types.
A minor note: I would split Users into a couple of tables, as I suppose that this table will contain both, customers and support stuff. This means that this table might grow big (depending on how many customers are expected). Maintaining so many fields in a single table might be problematic when table will grow in number of rows.
And I agree with Matt — it is difficult to tell anything without requirements.
It is really hard to tell without knowing the requirements from your client. Everything looks good but I can't really tell if it is all inclusive of what the client wants without their requirements documentation.

Resources