Our group is looking to better understand our data and implement some best practices. After reading some of the guides on codeproject on designing a data warehouse, we realized we need to start with a basic understanding of dimension and fact tables, as it relates to our own data. We have gone back and forth on what constitutes a fact table. Below is an image of a portion of our database.
These tables sit below our operational system where details and attributes flow directly into the PO_Header, PO_Detail, Appointment_Detail, and Appointment_Header tables. There are a few true dimension tables for dates, locations, and other values. For example when an appointment is made, it is given a Appointment number for that particular country. Appointment numbers are unique only at the country level. That appointment has attributes at the appointment level and are created against specific Purchase Orders (POs).
Our question is: Are the Appointment and PO tables true "Fact" tables or some sort of hybrid fact/dimension? If the business requires a view across all tables, is joining these tables above as described the right approach? As this is the operational system, we don't have the ability to change the structure but can redesign the structure in our data warehouse if needed.
Related
I am working on a data warehouse solution, and I am trying to build a dimensional model from tables held in a SQL Server database. Some of the tables include but aren't limited to Customer, Customer Payments, Customer Address, etc.
All these tables in the DB have some fields that are repeated multiple times across each table i.e. Record update date, record creatuin date, active flag, closed flag and a few others. These tables all relate to the Customer in some way, but the tables can be updated independently.
I am in the process of building out a dimension(s) on the back of these tables, but I am struggling to see how best to deal with these repeated fields in an elegant way, as they are all used.
I'll appreciate any guidance from people who have experience with scenarios like this, as I ammjust starting out
If more details are needed, I am happy to provide
Thanks
Before you even consider how to include them, ask if those metadata fields even need to be in your dimensional model? If no one will use the Customer Payment Update Date (vs Created Date or Payment Date), don't bring it into your model. If the customer model includes the current address, you won't need the CustomerAddress.Active flag included as well. You don't need every OLTP field in your model.
Make notes about how you talk about the fields in conversation. How do you identify the current customer address? Check the CurrentAddress flag (CustomerAddress.IsActive). When was the Customer's payment? Check the Customer Payment Date (CustomerPayment.PaymentDate or possibly CustomerPayment.CreatedDate). Try to describe them in common language terms. This will provide the best success in making your model discoverable by your users and intuitive to use.
Naming the columns in the model and source as similar as possible will also help with maintenance and troubleshooting.
Also, make sure you delineate the entities properly. A customer payment would likely be in a separate dimension from the customer. The current address may be in customer, but if there is any value to historical address details, it may make sense to put it into its own dimension, with the Active flag as well.
I am trying to build a star schema from an E/R diagram (OLTP system) that seems to contain a bridge table. Order is an obvious fact-table and product a dimension-table. I can't see how I can keep the bridge table if the model needs to be a star schema. How would you tackle this relationship if I need to keep information about Channel in the model?
It depends on how you plan to use the model.
If you only need to answer product and channel questions about existing orders, then you can avoid the bridge table altogether, because M2M relations between channels and products can be resolved though the fact table ("Orders"):
The (huge) advantage of this design is its simplicity and ease of use - it's very intuitive to the end-users. It's also fast.
The disadvantage of the model is its dependency on the orders. If orders are absent (i.e, no orders in the fact table), then you won't be able to answer questions about product and channel relations (for example, "show me all products by their assigned channels"). If such questions are not important and you only need to analyze existing orders, keep it simple.
If you do need to analyze product-channel relations even without existing orders, then things are more complicated. One approach is to add a bridge table as follows:
The advantage of this design is that Channel-Product relations are always available, regardless of the orders. It's also (still) simple to analyze orders by product. The disadvantage is that it's now harder to analyze orders by channel, because you now have to go through the bridge table. For example, in end-user tools such as Power BI you will need to make the "red" connection bi-directional, to enable filter propagation from the channel dimension via bridge to the product dimension. It's doable, of course, but end-users now will have to know what they are doing - it's no longer simple.
Yet another design uses "factless" fact table:
Here, you can easily query Channel-Product relations without orders (through the factless fact table Product-Channel, which shows essentially relationships status), and also easily query orders by both product and channel. You can also "drill-across" such structure to answer all kinds of complex questions about products without existing orders. Still, such design is not as intuitive as the first one.
I have to build a DW to store PO and Invoice data:
An Invoice has a header and a list of items
A PO has a header and a list of items.
An invoice can be related to zero or more POs
A PO can be related to one or more invoices.
How is the recommended way to design this in star schema?
Designing a DW involves understanding multiple aspects before having a model.
What is the frequency of data refresh.
What is the volume of data.
Which columns need to be indexed. Also, which index will help you better.
The queries written on the tables. Are the queries aggregates? or are they straight select statements.
What is your history preservation strategy.
The data types of every column you need. You need to think about cross platform query executions...
So on and so forth..
You will need to deep dive into it. Just creating tables with FK will help now, but over the time when data volume increases it will be a bottleneck.
You have a problem in that you are modelling data, not process.
Star schemas are based on a business process, not an entity relationship.
What are you trying to model? What is the grain of the model?
I'll go out on a limb, and say that you're probably modelling sales. Have one fact: Sale. If you need order-specific information, consider whether it is part of an Order dimension, or if it should be carried as degenerate dimensions and/or measures in the Sale fact.
Create a Invoice_Header_Fact and a Invoice_LineItem_Fact. (This can be denormalized and merged in one table too)
Use Order_Key from Header Fact in LineItem Fact to associate it to lineitems
Create a PO_Header_Fact and a PO_LineItem_Fact.
Use PO_Key from Header Fact in LineItem Fact to associate it to lineitems
Create a bridge/xref table to maintain many to many relationship between PO and Invoices.
Hope this helps!
We have a situation in a database design in our company. We are trying to figure out the best way to design the database to store transactional data. I need expert’s advice on the best relational design to achieve it. Problem: We have different kind of “Entities” in our system, for example; Customers, Services, Dealers etc. These Entities are doing transfer of funds between each other. We need to store the history of the transfers in database.
Solutions:
One table of transfers and another table to keep “Accounts” information. There are three tables “Customers”, “Services”, “Dealers”. There is another table “Accounts”. An account can be related to any of the “Entities” mentioned above; it means (and that’s the requirement) that logically there should be a one-to-one relationship to/from Entities and Accounts. However, we can only store the Account_ID in the Entities table, but we cannot store the foreign key of Entities in Accounts table. Here the problem happens in terms of database design. Because if there is a customer’s account, it is not restricted by the database design to not be stored in Services table etc. Now we can keep all transfers in one table only since Accounts are unified among all the entities.
Keep the balance information in the table primary Entities table and separate tables for all transfers. Here for all kind of transfers between the entities, we are keeping separate tables. For example, a transfer between a Customer and Service provider will be stored in a table called “Spending”. Another table will have transfer data for transfer between Service and Dealers called “Commission” etc. In this case, we are not storing all the transfers of the funds in a single table, but the foreign keys are properly defined since the tables “Spending” and “Commission” are only between two specific entities.
According to the best practices, which one of the above given solutions is correct, and why?
If you are simply looking for schemas that claim to deal with cases like yours, there is a website with hundreds of published schemas. Some of these pertain to storing transaction data concerning customers and suppliers. You can take one of these and adapt it.
http://www.databaseanswers.org/data_models/
If your question is about how to relate accounts to business contacts, read on.
Customers, Services, and Dealers are all sub classes of some super class that I'll call Contacts. There are two well known design patterns for modeling sub classes in database tables. And there is a technique called Shared primary Key that can be used with one of them to good advantage.
Take a look at the info and the questions grouped under these three tags:
single-table-inheritance class-table-inheritance shared-primary-key
If you use class table inheritance and shared primary key, you will end up with four tables pertaining to contacts: Contacts, Customers, Dealers, and Services. Every entry in Contacts will have a corresponding entry in one of the three subclass tables.
An FK in the accounts table, let's call it Accounts.ContactID will not only reference a row in Contacts, but also a row in whichever of Customers, Dealers, Services pertains to the case at hand.
This may work outwell for you. Alternatively, single table table inheritance works out well in some of the simpler cases. It depends on details about your data and your intended use of it.
You can make table Accounts with three fields with FK to Customers,Dealers and Services and it's will close problem. But also you can make three table for each type of entity with accounting data. You have the deal with multi-system case in system design. Each system solve the task. But for deсision you need make pros and con analyses about algorithm complexity, performance and other system requirements. For example one table will be more simple to code, but three table give more performance of sql database.
I'm part of a team architecting an Operational Data Store (ODS) database, using SQL Server 2012, that will be used by some of our analysts to do predictive modeling. The ODS will contain manufacturing production data for a single product we make.
We will have hundreds of tables in the ODS. However, we will have a single core table that will contain critical information (lifecycle info) about each item manufactured (tens of millions each year). Our product is manufactured in a manufacturing plant and spends roughly 2.5 hours moving through various processes along a production line. We want to store various, individual, pieces of manufacturing and post manufacturing information in this core table. An example piece of data might be the time the product entered a particular oven.
We have a decision to make on how to architect this table. We can create a wide table (many columns) or a narrow table where most columns are rows (as property values). I have never designed and worked with a table structure that is very narrow and columns are treated as rows in the table.
I'd like some feedback on the pros and cons of a wide table vs. a narrow table. The following might be useful in helping with this discussion:
Number of products produced each year: Several million (each of these product instances will be a row in the core table)
Will this table be queried often: Yes, very often. It will be the parent to many child tables.
Potential number of columns (or row properties): 75 to 150+
If more information would be useful, I'd be glad to provide it.
Wide tables, static properties
You are tracking a single product through a well-defined manufacturing process. This data model sounds very static, and would lend itself to a wide table with many columns that are consistently populated with data.
Narrow tables, dynamic properties
If you had many, many products with lots of variation in the manufacturing process, it would be better suited for a narrow table, where you could easily add new properties for tracking.
Difficult to query a narrow table
However, even simple querying of a narrow table can extremely difficult. For example, what if you needed to sort the data by a certain property when that property is shuffled amongst 100+ other property rows? How would you get all the rows together to form a single "record" and then sort the record groups within your result set?
Flat tables simpler to query
Depending on how you need to view and analyze the data, you may find yourself constantly using pivot or crosstab queries. If that's the case, then why not flatten out the storage table to begin with?
Or do both
Another option is to do both: Store the data narrowly, and use a transformation process to flatten it out for ease of reporting. That way you can quickly begin tracking new properties (just by adding rows), and then you can work on getting your reporting tables and transformation process updated to utilize the new data.
How wide is too wide? Well, there can be several problems with wide tables.
One problem is that wide tables tend to deviate from the rules for normalizing data. This in turn can result in tricky update problems where you have to be careful to prevent the database from entering a self contradictory state. There's no particular answer to how wide it too wide here. Just apply the normalization rules, and you'll end up decomposing the table.
However, some databases are not built with normalization as the guiding principle. In particular, consider fact tables in star schemas. There are times when some of the coulmns are determined by some subset of the FK's, and this can violate 3NF or even 2NF. Keeping fact tables skinny is still important in star schemas, but it's for a different reason, namely speed. Sometimes, a fact table can be made skinnier by pushing data out to one of the dimension tables. Sometimes, you can decompose a star into two or more related stars.
Your case sounds like the second reason given above, even though your design probably isn't a star schema. Still, star schema design principles might help you improve your design.