Dimensional Model: Appointments-Fact or Dimension - analytics

When designing a dimensional model for analyzing the scheduling process in a clinic, is a appointment the grain In the fact table?
If an appointment is the grain, how does one calculate the appointment attributes like the overbooked attribute?

In modelling scenarios like this it's useful to think about the process vs the entity - for example there is a process of booking and attending an appointment, which can be modelled as a fact, and an entity of the appointment (with all it's attributes) that is associated with the process.
To model the process one option is to use the technique of "accumulating snapshot" facts, where fact rows are being updated over time. To model the entity you could create a dimension of all appointments, but generally you want to avoid creating dimensions that have as many rows as the fact table so one approach is the "junk dimension" that contains unique combinations of a subset of the attributes (e.g. status flags).
Joy Mundy of the Kimball Group discusses a similar design scenario in an article here.

Related

Database Design: Multiple tables vs a single table for research related items [duplicate]

This question already has answers here:
How can you represent inheritance in a database?
(7 answers)
Closed 2 years ago.
I am making a research repository where there are different types of research items such as conferences, publications, patents, keynotes etc. The data will be inserted after getting from the relevant sources, processing it and then inserting in a batch from excel sheet. The main operation will be querying the data according to the logged-in user like researcher related information for an individual, department/unit related information (mainly summing up the rows) for the chairperson and so on.
Now when I approach this, I see two options:
Make two tables, one for the research item type and the other for the actual item
Make individual tables for all type of objects
The problem with the 1st structure is that I will a huge main table with empty/null columns. But it will allow me to easily add another research item in future, since I can simply add the new item in the "type" table and then add the actual data in the common table.
However, the second approach allows me to only query the relevant table to get the information, hence no empty/null values. The drawback is I will not be able to add new research item in this structure, and I need to add new table for the new item type.
If I may ask, which of the two strategies would you recommend and why?
The 1st one entails multiple database queries, and the second one entails a large single table.
If it helps, I am using MS SQL server.
The problem you're facing is the resolution of a hierarchy in an ER model.
You have a parent entity, or generalization (RESEARCH_ITEM) that can be instantiated in different ways (your child entities, like PUBLICATION, PATENT, and so on).
To implement this hierarchy in the physical layer (i.e. creating the tables) you have to consider which properties this hierarchy has. In particular, you have to ask yourself:
Overlap constraint: can an instance of the parent entity belong to more than one child entity?
Covering constraint: do the child entities cover all the possbile instances of the parent entity?
Combining this two criteria, we have four possbile cases:
Total disjoint: the child entities cover all the possibile instances
without overlap;
Partial disjoint: the child entities don't cover all the possibile
instances and there's no overlap;
Total overlapping: the child entities cover all the possibile
instances with potential overlap;
Partial overlapping: the child entities don't cover all the
possibile instances and there's possible overlap.
The resolution of the hierarchy depends upon the scenario. If your hierarchy is a total-disjoint one, the best thing to do would be to eliminate the parent entity and to incorporate its attributes in the child entities (faster queries, cleaner tables).
On the other hand, if there is overlapping, this solution is not optimal, because you'd have duplication of data (the same row in two child tables). In this case you could opt for the incorporation of the children in the parent, with possible NULL fields for child-specific attributes.
Moreover, in order to design the better implementation, you'd have to consider how the data are accessed (Is there a child that I know will be queried against very often? In this case a separate table would be good).

DW Design (PO and Invoices)

I have to build a DW to store PO and Invoice data:
An Invoice has a header and a list of items
A PO has a header and a list of items.
An invoice can be related to zero or more POs
A PO can be related to one or more invoices.
How is the recommended way to design this in star schema?
Designing a DW involves understanding multiple aspects before having a model.
What is the frequency of data refresh.
What is the volume of data.
Which columns need to be indexed. Also, which index will help you better.
The queries written on the tables. Are the queries aggregates? or are they straight select statements.
What is your history preservation strategy.
The data types of every column you need. You need to think about cross platform query executions...
So on and so forth..
You will need to deep dive into it. Just creating tables with FK will help now, but over the time when data volume increases it will be a bottleneck.
You have a problem in that you are modelling data, not process.
Star schemas are based on a business process, not an entity relationship.
What are you trying to model? What is the grain of the model?
I'll go out on a limb, and say that you're probably modelling sales. Have one fact: Sale. If you need order-specific information, consider whether it is part of an Order dimension, or if it should be carried as degenerate dimensions and/or measures in the Sale fact.
Create a Invoice_Header_Fact and a Invoice_LineItem_Fact. (This can be denormalized and merged in one table too)
Use Order_Key from Header Fact in LineItem Fact to associate it to lineitems
Create a PO_Header_Fact and a PO_LineItem_Fact.
Use PO_Key from Header Fact in LineItem Fact to associate it to lineitems
Create a bridge/xref table to maintain many to many relationship between PO and Invoices.
Hope this helps!

Understanding Fact Tables

Our group is looking to better understand our data and implement some best practices. After reading some of the guides on codeproject on designing a data warehouse, we realized we need to start with a basic understanding of dimension and fact tables, as it relates to our own data. We have gone back and forth on what constitutes a fact table. Below is an image of a portion of our database.
These tables sit below our operational system where details and attributes flow directly into the PO_Header, PO_Detail, Appointment_Detail, and Appointment_Header tables. There are a few true dimension tables for dates, locations, and other values. For example when an appointment is made, it is given a Appointment number for that particular country. Appointment numbers are unique only at the country level. That appointment has attributes at the appointment level and are created against specific Purchase Orders (POs).
Our question is: Are the Appointment and PO tables true "Fact" tables or some sort of hybrid fact/dimension? If the business requires a view across all tables, is joining these tables above as described the right approach? As this is the operational system, we don't have the ability to change the structure but can redesign the structure in our data warehouse if needed.

Do all relational database designs require a junction or associative table for many-to-many relationship?

I'm new to databases and trying to understand why a junction or association table is needed when creating a many-to-many relationship.
Most of what I'm finding on Stackoverflow and elsewhere describe it in either highly technical relational theory terms or it's just described as 'that's the way it's done' without qualifying why.
Are there any relational database designs out there that support having a many-to-many relationship without the use of an association table? Why is it not possible to have, for example, a column on on table that holds the relationships to another and vice a versa.
For example, a Course table that holds a list of courses and a Student table that holds a bunch of student info — each course can have many students and each student can take many classes.
Why is it not possible to have a column on each row in either table (possibly in csv format) that contains the relationships to the others in a list or something similar?
In a relational database, no column holds more than a single value in each row. Therefore, you would never store data in a "CSV format" -- or any other multiple value system -- in a single column in a relational database. Making repeated columns that hold instances of the same item (Course1, Course2, Course3, etc) is also not allowed. This is the very first rule of relational database design and is referred to as First Normal Form.
There are very good reasons for the existence of these rules (it is enormously easier to verify, constrain, and query the data) but whether or not you believe in the benefits the rules are, none-the-less, part of the definition of relational databases.
I do not know the answer to your question, but I can answer a similar question: Why do we use a junction table for many-to-many relationships in databases?
First, if the student table keeps track of which courses the student is in and the course keeps track of which students are in it, then we have duplication. This can lead to problems. What if a student knows it is in a course, but the course doesn't know that it has that student. Every time you made a course change you would have to make sure to change it in both tables. Inevitably this will not happen every time and the data will become inconsistent.
Second, where would we store this information? A list is not a possible type for a field in a database. So do we put a course column in the student table? No, because that would only allow each student to take one course, a many-to-one relationship from students to courses. Do we put a student column in the courses table? No, because then we have one student in each course.
What does work is having a new table that has one student and one course per row. This tells us that a student is in a class without duplicating any data.
"Junction tables" come from ER/ORM presentations/methods/products that don't really understand the relational model.
In the relational model (and in original ER information modeling) application relationships are represented by relations/tables. Each table holds tuples of values that are in that relationship to each other, ie that are so related, ie that satisfy that relationship, ie that participate in the relationship.
A relationship is expressed independently of any particular situation as a predicate, a fill-in-the-(named-)blanks statement. Rows that fill in the named blanks to give a true statement from the predicate in a particular situation go in the table. We pick sufficient predicates (hence base tables) to describe every situation. Both many-to-1 and many-to-many application relationships get tables.
The reason why you don't see a lot of many-to-many relationships along with columns about the participants rather than about their participation in the relationship is that such tables are better split into ones about the participants and one for the relationship. Eg columns in a many-to-many table that are about participants 1. can't say anything about entities that don't participate and 2. say the same thing about an entity every time it participates. Information modeling techniques that focus on identifying independent entity types first then relationships between them tend to lead to designs with few such problems. The reason why you don't see many-to-many relationships in two tables is that that is redundant and susceptible to the error of the tables disagreeing. The problem with collection-valued columns (sequences/lists/arrays) is that you cannot generically query about their parts using usual query notation and implementation because the DBMS doesn't see the parts organized into a table.
See this recent answer or this one.

When I should use one to one relationship?

Sorry for that noob question but is there any real needs to use one-to-one relationship with tables in your database? You can implement all necessary fields inside one table. Even if data becomes very large you can enumerate column names that you need in SELECT statement instead of using SELECT *. When do you really need this separation?
1 to 0..1
The "1 to 0..1" between super and sub-classes is used as a part of "all classes in separate tables" strategy for implementing inheritance.
A "1 to 0..1" can be represented in a single table with "0..1" portion covered by NULL-able fields. However, if the relationship is mostly "1 to 0" with only a few "1 to 1" rows, splitting-off the "0..1" portion into a separate table might save some storage (and cache performance) benefits. Some databases are thriftier at storing NULLs than others, so a "cut-off point" where this strategy becomes viable can vary considerably.
1 to 1
The real "1 to 1" vertically partitions the data, which may have implications for caching. Databases typically implement caches at the page level, not at the level of individual fields, so even if you select only a few fields from a row, typically the whole page that row belongs to will be cached. If a row is very wide and the selected fields relatively narrow, you'll end-up caching a lot of information you don't actually need. In a situation like that, it may be useful to vertically partition the data, so only the narrower, more frequently used portion or rows gets cached, so more of them can fit into the cache, making the cache effectively "larger".
Another use of vertical partitioning is to change the locking behavior: databases typically cannot lock at the level of individual fields, only the whole rows. By splitting the row, you are allowing a lock to take place on only one of its halfs.
Triggers are also typically table-specific. While you can theoretically have just one table and have the trigger ignore the "wrong half" of the row, some databases may impose additional limits on what a trigger can and cannot do that could make this impractical. For example, Oracle doesn't let you modify the mutating table - by having separate tables, only one of them may be mutating so you can still modify the other one from your trigger.
Separate tables may allow more granular security.
These considerations are irrelevant in most cases, so in most cases you should consider merging the "1 to 1" tables into a single table.
See also: Why use a 1-to-1 relationship in database design?
My 2 cents.
I work in a place where we all develop in a large application, and everything is a module. For example, we have a users table, and we have a module that adds facebook details for a user, another module that adds twitter details to a user. We could decide to unplug one of those modules and remove all its functionality from our application. In this case, every module adds their own table with 1:1 relationships to the global users table, like this:
create table users ( id int primary key, ...);
create table users_fbdata ( id int primary key, ..., constraint users foreign key ...)
create table users_twdata ( id int primary key, ..., constraint users foreign key ...)
If you place two one-to-one tables in one, its likely you'll have semantics issue. For example, if every device has one remote controller, it doesn't sound quite good to place the device and the remote controller with their bunch of characteristics in one table. You might even have to spend time figuring out if a certain attribute belongs to the device or the remote controller.
There might be cases, when half of your columns will stay empty for a long while, or will not ever be filled in. For example, a car could have one trailer with a bunch of characteristics, or might have none. So you'll have lots of unused attributes.
If your table has 20 attributes, and only 4 of them are used occasionally, it makes sense to break the table into 2 tables for performance issues.
In such cases it isn't good to have everything in one table. Besides, it isn't easy to deal with a table that has 45 columns!
If data in one table is related to, but does not 'belong' to the entity described by the other, then that's a candidate to keep it separate.
This could provide advantages in future, if the separate data needs to be related to some other entity, also.
The most sensible time to use this would be if there were two separate concepts that would only ever relate in this way. For example, a Car can only have one current Driver, and the Driver can only drive one car at a time - so the relationship between the concepts of Car and Driver would be 1 to 1. I accept that this is contrived example to demonstrate the point.
Another reason is that you want to specialize a concept in different ways. If you have a Person table and want to add the concept of different types of Person, such as Employee, Customer, Shareholder - each one of these would need different sets of data. The data that is similar between them would be on the Person table, the specialist information would be on the specific tables for Customer, Shareholder, Employee.
Some database engines struggle to efficiently add a new column to a very large table (many rows) and I have seen extension-tables used to contain the new column, rather than the new column being added to the original table. This is one of the more suspect uses of additional tables.
You may also decide to divide the data for a single concept between two different tables for performance or readability issues, but this is a reasonably special case if you are starting from scratch - these issues will show themselves later.
First, I think it is a question of modelling and defining what consist a separate entity. Suppose you have customers with one and only one single address. Of course you could implement everything in a single table customer, but if, in the future you allow him to have 2 or more addresses, then you will need to refactor that (not a problem, but take a conscious decision).
I can also think of an interesting case not mentioned in other answers where splitting the table could be useful:
Imagine, again, you have customers with a single address each, but this time it is optional to have an address. Of course you could implement that as a bunch of NULL-able columns such as ZIP,state,street. But suppose that given that you do have an address the state is not optional, but the ZIP is. How to model that in a single table? You could use a constraint on the customer table, but it is much easier to divide in another table and make the foreign_key NULLable. That way your model is much more explicit in saying that the entity address is optional, and that ZIP is an optional attribute of that entity.
not very often.
you may find some benefit if you need to implement some security - so some users can see some of the columns (table1) but not others (table2)..
of course some databases (Oracle) allow you to do this kind of security in the same table, but some others may not.
You are referring to database normalization. One example that I can think of in an application that I maintain is Items. The application allows the user to sell many different types of items (i.e. InventoryItems, NonInventoryItems, ServiceItems, etc...). While I could store all of the fields required by every item in one Items table, it is much easier to maintain to have a base Item table that contains fields common to all items and then separate tables for each item type (i.e. Inventory, NonInventory, etc..) which contain fields specific to only that item type. Then, the item table would have a foreign key to the specific item type that it represents. The relationship between the specific item tables and the base item table would be one-to-one.
Below, is an article on normalization.
http://support.microsoft.com/kb/283878
As with all design questions the answer is "it depends."
There are few considerations:
how large will the table get (both in terms of fields and rows)? It can be inconvenient to house your users' name, password with other less commonly used data both from a maintenance and programming perspective
fields in the combined table which have constraints could become cumbersome to manage over time. for example, if a trigger needs to fire for a specific field, that's going to happen for every update to the table regardless of whether that field was affected.
how certain are you that the relationship will be 1:1? As This question points out, things get can complicated quickly.
Another use case can be the following: you might import data from some source and update it daily, e.g. information about books. Then, you add data yourself about some books. Then it makes sense to put the imported data in another table than your own data.
I normally encounter two general kinds of 1:1 relationship in practice:
IS-A relationships, also known as supertype/subtype relationships. This is when one kind of entity is actually a type of another entity (EntityA IS A EntityB). Examples:
Person entity, with separate entities for Accountant, Engineer, Salesperson, within the same company.
Item entity, with separate entities for Widget, RawMaterial, FinishedGood, etc.
Car entity, with separate entities for Truck, Sedan, etc.
In all these situations, the supertype entity (e.g. Person, Item or Car) would have the attributes common to all subtypes, and the subtype entities would have attributes unique to each subtype. The primary key of the subtype would be the same as that of the supertype.
"Boss" relationships. This is when a person is the unique boss or manager or supervisor of an organizational unit (department, company, etc.). When there is only one boss allowed for an organizational unit, then there is a 1:1 relationship between the person entity that represents the boss and the organizational unit entity.
The main time to use a one-to-one relationship is when inheritance is involved.
Below, a person can be a staff and/or a customer. The staff and customer inherit the person attributes. The advantage being if a person is a staff AND a customer their details are stored only once, in the generic person table. The child tables have details specific to staff and customers.
In my time of programming i encountered this only in one situation. Which is when there is a 1-to-many and an 1-to-1 relationship between the same 2 entities ("Entity A" and "Entity B").
When "Entity A" has multiple "Entity B" and "Entity B" has only 1 "Entity A"
and
"Entity A" has only 1 current "Entity B" and "Entity B" has only 1 "Entity A".
For example, a Car can only have one current Driver, and the Driver can only drive one car at a time - so the relationship between the concepts of Car and Driver would be 1 to 1. - I borrowed this example from #Steve Fenton's answer
Where a Driver can drive multiple Cars, just not at the same time. So the Car and Driver entities are 1-to-many or many-to-many. But if we need to know who the current driver is, then we also need the 1-to-1 relation.
Another use case might be if the maximum number of columns in the database table is exceeded. Then you could join another table using OneToOne

Resources