Bridge table in dimensional modeling - data-modeling

I am familiar with creating a bridge table between facts and dimension table.
Is it a good idea to create bridge table between dimension and its multidimensional attributes?
e.g., customer has multiple phone numbers. Can I just create a customer telephone dimension which has one to many relationship with customer dimension or is creating a bridge table advisable?

Answering specifically for the multiple phones example.
I usually try to avoid bridge tables as much as possible. They are a complication of design, and keeping things simple is a better approach (although not always possible, of course).
In case of the multiple phones per customer, I would create 2 attributes:
Primary Phone
Other Phones
The first attribute will contain a main customer phone and is mandatory.
The second attribute might contain one or more other phone numbers, concatenated into a delimited string (i.e., "415-111-1111, 415-222-2222"). Such design is acceptable because you (most likely) will use these extra phones only as a descriptive information in your reports. Also, most likely you will have a varying but reasonably limited number of such phones - let's say, 0-3 or so, which means that this attribute will be either empty or contain a reasonably short string.
The above design is simple and clean and works for most situations, unless you need to perform specific analytics on the phone numbers, or if there are too many of them and they must be all used. In cases like that, I would put them into a fact table ("Customer Phones"), which might contain:
Customer_ID
Phone_Profile_ID
Date
Phone Number
Phone_Profile is a dimension that should contain phone attributes, i.e, "Phone Type" {"Land Line", "Mobile"}, "Phone Use" {"Primary", "Secondary"}, etc.
Such fact table can also be a periodic snapshot (annual, monthly etc) of all customer phones and serve as a phone catalog. However, such elaborate designs are rarely needed (unless you design for a Call Center or similar phone-heavy application).

Related

Why would the specifications for this database use an aggregation instead of attributes on an entity?

I'm trying to better understand designing a database schema. After reviewing the solution for a problem that I'm working on, I don't understand why the solution chooses to use an aggregation for the attributes "address" and "phone number" for a given "musician". Here are the specifications, I'm only interested in bullet point 1:
Each musician that records at Notown has an SSN, a name, an address, and a phone
number. Poorly paid musicians often share the same address, and no address has more
than one phone.
Each instrument used in songs recorded at Notown has a name (e.g., guitar, synthesizer,
flute) and a musical key (e.g., C, B-flat, E-flat).
Each album recorded on the Notown label has a title, a copyright date, a format (e.g.,
CD or MC), and an album identifier.
Each song recorded at Notown has a title and an author.
Each musician may play several instruments, and a given instrument may be played by
several musicians.
Each album has a number of songs on it, but no song may appear on more than one
album.
Each song is performed by one or more musicians, and a musician may perform a number
of songs.
Each album has exactly one musician who acts as its producer. A musician may produce several albums, of course.
Here is a solution that I found:
The ER Diagram I created looks almost exactly the same, except for the fact that I made "address" and "phone number" attributes of "musician" instead of giving each of them an entity set of their own, creating a relationship, and turning it into an aggregation. I don't understand why this would be done in this situation. Can anyone explain?? Thank you!
I'm not able to see the image you linked to, but anyway...
no address has more than one phone
This means we should make the phone number an attribute of the address - unless we want to allow for multiple phones per address in the future.
So it would not be completely wrong to make phones a table. But then, we know little about the future. Would there be multiple musicians sharing the same address and the same phones? (I.e. the phone number would be linked to an address.) Or would there be multiple musicians sharing the same address, but each would have their own phone? (I.e. the phone number would be linked to a musician. To use a phone table and link the phones to musicians, however, would only be necessary if a musician could have multiple phone numbers. Otherwise we'd still not make a phone table, but rather make the phone a musician's attribute.)
poorly paid musicians often share the same address
This means we make the address a table of its own. Thus there is only one row to change in case the phone number or some other attribute changes. If we made the address number a musician's attribute instead, we'd store the address redundantly and could get inconsistent data (e.g. same address, but different phone numbers).
A possible data model:
address (address_id, street, city, phone, ...)
musician (musician_id, ssn, name, address_id, ...)
This is a 1:n relation. A musician has one address; an address can belong to multiple musicians.
The primary purpose of database normalization is to make it more difficult for anomalous data to get into the database. Reading the first bullet point, we see that each address may have zero or one phone numbers associated with it. In other words, the phone number is an attribute of/identified by the address. Which normalization level does this violate?
To illustrate how not normalizing the address fields (including phone number) increases the chances of anomalous data, let's say you have four students staying at that address. This means you have four rows where the address data exists. Suppose the phone number changes. You have to make sure you change all four versions of the data. I said there were four students, but suppose there are actually five and I just missed one? Or suppose you found only three when you went to make the change? An address may have at most one phone number however now you have several copies of the same address but with different phone numbers. This is anomalous data.
If this data is normalized, you would have only one copy to change. Since this data is referenced by all the students who live there, no matter how many, this change is "propagated" to all of them. The integrity of the data is maintained.

When I should use one to one relationship?

Sorry for that noob question but is there any real needs to use one-to-one relationship with tables in your database? You can implement all necessary fields inside one table. Even if data becomes very large you can enumerate column names that you need in SELECT statement instead of using SELECT *. When do you really need this separation?
1 to 0..1
The "1 to 0..1" between super and sub-classes is used as a part of "all classes in separate tables" strategy for implementing inheritance.
A "1 to 0..1" can be represented in a single table with "0..1" portion covered by NULL-able fields. However, if the relationship is mostly "1 to 0" with only a few "1 to 1" rows, splitting-off the "0..1" portion into a separate table might save some storage (and cache performance) benefits. Some databases are thriftier at storing NULLs than others, so a "cut-off point" where this strategy becomes viable can vary considerably.
1 to 1
The real "1 to 1" vertically partitions the data, which may have implications for caching. Databases typically implement caches at the page level, not at the level of individual fields, so even if you select only a few fields from a row, typically the whole page that row belongs to will be cached. If a row is very wide and the selected fields relatively narrow, you'll end-up caching a lot of information you don't actually need. In a situation like that, it may be useful to vertically partition the data, so only the narrower, more frequently used portion or rows gets cached, so more of them can fit into the cache, making the cache effectively "larger".
Another use of vertical partitioning is to change the locking behavior: databases typically cannot lock at the level of individual fields, only the whole rows. By splitting the row, you are allowing a lock to take place on only one of its halfs.
Triggers are also typically table-specific. While you can theoretically have just one table and have the trigger ignore the "wrong half" of the row, some databases may impose additional limits on what a trigger can and cannot do that could make this impractical. For example, Oracle doesn't let you modify the mutating table - by having separate tables, only one of them may be mutating so you can still modify the other one from your trigger.
Separate tables may allow more granular security.
These considerations are irrelevant in most cases, so in most cases you should consider merging the "1 to 1" tables into a single table.
See also: Why use a 1-to-1 relationship in database design?
My 2 cents.
I work in a place where we all develop in a large application, and everything is a module. For example, we have a users table, and we have a module that adds facebook details for a user, another module that adds twitter details to a user. We could decide to unplug one of those modules and remove all its functionality from our application. In this case, every module adds their own table with 1:1 relationships to the global users table, like this:
create table users ( id int primary key, ...);
create table users_fbdata ( id int primary key, ..., constraint users foreign key ...)
create table users_twdata ( id int primary key, ..., constraint users foreign key ...)
If you place two one-to-one tables in one, its likely you'll have semantics issue. For example, if every device has one remote controller, it doesn't sound quite good to place the device and the remote controller with their bunch of characteristics in one table. You might even have to spend time figuring out if a certain attribute belongs to the device or the remote controller.
There might be cases, when half of your columns will stay empty for a long while, or will not ever be filled in. For example, a car could have one trailer with a bunch of characteristics, or might have none. So you'll have lots of unused attributes.
If your table has 20 attributes, and only 4 of them are used occasionally, it makes sense to break the table into 2 tables for performance issues.
In such cases it isn't good to have everything in one table. Besides, it isn't easy to deal with a table that has 45 columns!
If data in one table is related to, but does not 'belong' to the entity described by the other, then that's a candidate to keep it separate.
This could provide advantages in future, if the separate data needs to be related to some other entity, also.
The most sensible time to use this would be if there were two separate concepts that would only ever relate in this way. For example, a Car can only have one current Driver, and the Driver can only drive one car at a time - so the relationship between the concepts of Car and Driver would be 1 to 1. I accept that this is contrived example to demonstrate the point.
Another reason is that you want to specialize a concept in different ways. If you have a Person table and want to add the concept of different types of Person, such as Employee, Customer, Shareholder - each one of these would need different sets of data. The data that is similar between them would be on the Person table, the specialist information would be on the specific tables for Customer, Shareholder, Employee.
Some database engines struggle to efficiently add a new column to a very large table (many rows) and I have seen extension-tables used to contain the new column, rather than the new column being added to the original table. This is one of the more suspect uses of additional tables.
You may also decide to divide the data for a single concept between two different tables for performance or readability issues, but this is a reasonably special case if you are starting from scratch - these issues will show themselves later.
First, I think it is a question of modelling and defining what consist a separate entity. Suppose you have customers with one and only one single address. Of course you could implement everything in a single table customer, but if, in the future you allow him to have 2 or more addresses, then you will need to refactor that (not a problem, but take a conscious decision).
I can also think of an interesting case not mentioned in other answers where splitting the table could be useful:
Imagine, again, you have customers with a single address each, but this time it is optional to have an address. Of course you could implement that as a bunch of NULL-able columns such as ZIP,state,street. But suppose that given that you do have an address the state is not optional, but the ZIP is. How to model that in a single table? You could use a constraint on the customer table, but it is much easier to divide in another table and make the foreign_key NULLable. That way your model is much more explicit in saying that the entity address is optional, and that ZIP is an optional attribute of that entity.
not very often.
you may find some benefit if you need to implement some security - so some users can see some of the columns (table1) but not others (table2)..
of course some databases (Oracle) allow you to do this kind of security in the same table, but some others may not.
You are referring to database normalization. One example that I can think of in an application that I maintain is Items. The application allows the user to sell many different types of items (i.e. InventoryItems, NonInventoryItems, ServiceItems, etc...). While I could store all of the fields required by every item in one Items table, it is much easier to maintain to have a base Item table that contains fields common to all items and then separate tables for each item type (i.e. Inventory, NonInventory, etc..) which contain fields specific to only that item type. Then, the item table would have a foreign key to the specific item type that it represents. The relationship between the specific item tables and the base item table would be one-to-one.
Below, is an article on normalization.
http://support.microsoft.com/kb/283878
As with all design questions the answer is "it depends."
There are few considerations:
how large will the table get (both in terms of fields and rows)? It can be inconvenient to house your users' name, password with other less commonly used data both from a maintenance and programming perspective
fields in the combined table which have constraints could become cumbersome to manage over time. for example, if a trigger needs to fire for a specific field, that's going to happen for every update to the table regardless of whether that field was affected.
how certain are you that the relationship will be 1:1? As This question points out, things get can complicated quickly.
Another use case can be the following: you might import data from some source and update it daily, e.g. information about books. Then, you add data yourself about some books. Then it makes sense to put the imported data in another table than your own data.
I normally encounter two general kinds of 1:1 relationship in practice:
IS-A relationships, also known as supertype/subtype relationships. This is when one kind of entity is actually a type of another entity (EntityA IS A EntityB). Examples:
Person entity, with separate entities for Accountant, Engineer, Salesperson, within the same company.
Item entity, with separate entities for Widget, RawMaterial, FinishedGood, etc.
Car entity, with separate entities for Truck, Sedan, etc.
In all these situations, the supertype entity (e.g. Person, Item or Car) would have the attributes common to all subtypes, and the subtype entities would have attributes unique to each subtype. The primary key of the subtype would be the same as that of the supertype.
"Boss" relationships. This is when a person is the unique boss or manager or supervisor of an organizational unit (department, company, etc.). When there is only one boss allowed for an organizational unit, then there is a 1:1 relationship between the person entity that represents the boss and the organizational unit entity.
The main time to use a one-to-one relationship is when inheritance is involved.
Below, a person can be a staff and/or a customer. The staff and customer inherit the person attributes. The advantage being if a person is a staff AND a customer their details are stored only once, in the generic person table. The child tables have details specific to staff and customers.
In my time of programming i encountered this only in one situation. Which is when there is a 1-to-many and an 1-to-1 relationship between the same 2 entities ("Entity A" and "Entity B").
When "Entity A" has multiple "Entity B" and "Entity B" has only 1 "Entity A"
and
"Entity A" has only 1 current "Entity B" and "Entity B" has only 1 "Entity A".
For example, a Car can only have one current Driver, and the Driver can only drive one car at a time - so the relationship between the concepts of Car and Driver would be 1 to 1. - I borrowed this example from #Steve Fenton's answer
Where a Driver can drive multiple Cars, just not at the same time. So the Car and Driver entities are 1-to-many or many-to-many. But if we need to know who the current driver is, then we also need the 1-to-1 relation.
Another use case might be if the maximum number of columns in the database table is exceeded. Then you could join another table using OneToOne

Why use a 1-to-1 relationship in database design?

I am having a hard time trying to figure out when to use a 1-to-1 relationship in db design or if it is ever necessary.
If you can select only the columns you need in a query is there ever a point to break up a table into 1-to-1 relationships. I guess updating a large table has more impact on performance than a smaller table and I'm sure it depends on how heavily the table is used for certain operations (read/ writes)
So when designing a database schema how do you factor in 1-to-1 relationships? What criteria do you use to determine if you need one, and what are the benefits over not using one?
From the logical standpoint, a 1:1 relationship should always be merged into a single table.
On the other hand, there may be physical considerations for such "vertical partitioning" or "row splitting", especially if you know you'll access some columns more frequently or in different pattern than the others, for example:
You might want to cluster or partition the two "endpoint" tables of a 1:1 relationship differently.
If your DBMS allows it, you might want to put them on different physical disks (e.g. more performance-critical on an SSD and the other on a cheap HDD).
You have measured the effect on caching and you want to make sure the "hot" columns are kept in cache, without "cold" columns "polluting" it.
You need a concurrency behavior (such as locking) that is "narrower" than the whole row. This is highly DBMS-specific.
You need different security on different columns, but your DBMS does not support column-level permissions.
Triggers are typically table-specific. While you can theoretically have just one table and have the trigger ignore the "wrong half" of the row, some databases may impose additional limits on what a trigger can and cannot do. For example, Oracle doesn't let you modify the so called "mutating" table from a row-level trigger - by having separate tables, only one of them may be mutating so you can still modify the other from your trigger (but there are other ways to work-around that).
Databases are very good at manipulating the data, so I wouldn't split the table just for the update performance, unless you have performed the actual benchmarks on representative amounts of data and concluded the performance difference is actually there and significant enough (e.g. to offset the increased need for JOINing).
On the other hand, if you are talking about "1:0 or 1" (and not a true 1:1), this is a different question entirely, deserving a different answer...
See also: When I should use one to one relationship?
Separation of duties and abstraction of database tables.
If I have a user and I design the system for each user to have an address, but then I change the system, all I have to do is add a new record to the Address table instead of adding a brand new table and migrating the data.
EDIT
Currently right now if you wanted to have a person record and each person had exactly one address record, then you could have a 1-to-1 relationship between a Person table and an Address table or you could just have a Person table that also had the columns for the address.
In the future maybe you made the decision to allow a person to have multiple addresses. You would not have to change your database structure in the 1-to-1 relationship scenario, you only have to change how you handle the data coming back to you. However, in the single table structure you would have to create a new table and migrate the address data to the new table in order to create a best practice 1-to-many relationship database structure.
Well, on paper, normalized form looks to be the best. In real world usually it is a trade-off. Most large systems that I know do trade-offs and not trying to be fully normalized.
I'll try to give an example. If you are in a banking application, with 10 millions passbook account, and the usual transactions will be just a query of the latest balance of certain account. You have table A that stores just those information (account number, account balance, and account holder name).
Your account also have another 40 attributes, such as the customer address, tax number, id for mapping to other systems which is in table B.
A and B have one to one mapping.
In order to be able to retrieve the account balance fast, you may want to employ different index strategy (such as hash index) for the small table that has the account balance and account holder name.
The table that contains the other 40 attributes may reside in different table space or storage, employ different type of indexing, for example because you want to sort them by name, account number, branch id, etc. Your system can tolerate slow retrieval of these 40 attributes, while you need fast retrieval of your account balance query by account number.
Having all the 43 attributes in one table seems to be natural, and probably 'naturally slow' and unacceptable for just retrieving single account balance.
It makes sense to use 1-1 relationships to model an entity in the real world. That way, when more entities are added to your "world", they only also have to relate to the data that they pertain to (and no more).
That's the key really, your data (each table) should contain only enough data to describe the real-world thing it represents and no more. There should be no redundant fields as all make sense in terms of that "thing". It means that less data is repeated across the system (with the update issues that would bring!) and that you can retrieve individual data independently (not have to split/ parse strings for example).
To work out how to do this, you should research "Database Normalisation" (or Normalization), "Normal Form" and "first, second and third normal form". This describes how to break down your data. A version with an example is always helpful. Perhaps try this tutorial.
Often people are talking about a 1:0..1 relationship and call it a 1:1. In reality, a typical RDBMS cannot support a literal 1:1 relationship in any case.
As such, I think it's only fair to address sub-classing here, even though it technically necessitates a 1:0..1 relationship, and not the literal concept of a 1:1.
A 1:0..1 is quite useful when you have fields that would be exactly the same among several entities/tables. For example, contact information fields such as address, phone number, email, etc. that might be common for both employees and clients could be broken out into an entity made purely for contact information.
A contact table would hold common information, like address and phone number(s).
So an employee table holds employee specific information such as employee number, hire date and so on. It would also have a foreign key reference to the contact table for the employee's contact info.
A client table would hold client information, such as an email address, their employer name, and perhaps some demographic data such as gender and/or marital status. The client would also have a foreign key reference to the contact table for their contact info.
In doing this, every employee would have a contact, but not every contact would have an employee. The same concept would apply to clients.
Just a few samples from past projects:
a TestRequests table can have only one matching Report. But depending on the nature of the Request, the fields in the Report may be totally different.
in a banking project, an Entities table hold various kind of entities: Funds, RealEstateProperties, Companies. Most of those Entities have similar properties, but Funds require about 120 extra fields, while they represent only 5% of the records.

Database Design for Expanding Lists

Admittedly, I am simply looking for some direction here. I have a specific situation, and being a novice in database design I am lost on how to begin tackling this problem. Let me start by explaining my situation.
I have a mysql table called contacts. As the name infers, it stores a list of contacts and the attributes that go along with each such as first name, last name, email, phone number etc. I would like users of my application to be able to add an unlimited amount of certain attributes for each contact. So, for instance rather than a contact having one phone number, the user could add another number, and another if they choose etc so essentially, a contact in my database can have as many phone numbers as the user needs. This will also be true for other fields in the table, but for the sake of simplicity let's just stick with phone number as an example.
So what is the best way to approach this? Should I have a separate table called contactsPhone and have a matching id column so that any number of rows in the phone table can be associated with one row in the contacts table? Or is there a way to store an ArrayList of some sort in the contacts table so I can have multiple phone numbers in just one field?
You should be looking at modelling something like this in a document database - a relational database is a poor choice for a flexible schema. You may be able to just have this specific portion of you data in a document database.
If you must, the common solution is the entity-attribute-value pattern - note that this requires multiple joins, makes ad-hock queries difficult and is generally slow.
Update:
I misread the question a bit - if you do know which attributes you want to hold multiple values and this list will not change (or not change much), entity-attribute-value may not be the best way forward.
A one-to-many table per each of these attributes will work (and is a standard relational solution for this kind of problem) - each such table will require a foreign key to your contacts table and a column to hold a single attribute value. This allows you to have multiple attribute values against a single contact.
I would like users of my application to be able to add an unlimited
amount of certain attributes for each contact. So, for instance rather
than a contact having one phone number, the user could add another
number, and another if they choose etc so essentially, a contact in my
database can have as many phone numbers as the user needs.
You're not describing an unlimited number of attributes for each contact. (That's a Good Thing.) You're describing an unlimited number of rows for a single attribute, in this case a contact's phone number.
So, yes, a table of contact phone numbers would work well. You might want to give some thought to how the user might want to identify phone numbers. For example, do they need to distinguish home phone numbers from work numbers and so on.

Where do phone numbers belong in a database model?

Given a schema for a DVD rental store, should customers' phone numbers belong to the addresses table, or the users table, and why? Are there any benefits associated with one approach or the other?
Why do you even have an addresses table (unless you want more than one address for a given customer)?
You primary "client" is a customer. You don't rent DVDs to an address, you rent them to a person. You can't take a block of land to court when the occupant runs off with your prized "Free Willy" collectors edition trilogy.
In a world where a person only lived at one place, the address would be part of the customer table (and so would the phone in a one-phone-per-customer scenario).
If you want multiple addresses, that's fine, have a separate addresses table tying those addresses back to the customer.
But you should probably also have a similar setup for phones. Either allow up to N phone numbers per customer (with N columns) in the customers table, or (much better) have a separate phones table allowing any number of phone numbers per customer.
Something like:
customers:
cust_id
cust_stuff
addresses:
cust_id references customers(cust_id)
addr_seq_num
addr_stuff
phones:
cust_id references customers(cust_id)
phon_seq_num
phon_stuff
There's no one correct answer to this, except to say "it depends".
It really depends on what you're modelling with your database schema. Does a phone number logically belong to a user, or an address that could potentially be shared by multiple users?
Example - a mobile phone number might be tied to a particular person, and so be part of the users table. A land-line number might be tied to a particular location or residence, and so be part of the address.
Basic cases of information modeling :
Case A. Each customer can have more than one phone number.
In this case, phone number belongs in a separate table.
Case A1. It is not the case that a customer is required to have a phone number. i.e. the "relationship" is 1-1 to 0-n (i.e. assuming all phone number must always "be for" some customer). Nothing to do.
Case A2. It is the case that each customer is indeed required to have a phone number. You can model this as a relationship that is 1-1 to 1-n, but the "1" of the 1-n part is very hard to enforce in SQL systems (and in the cheapest of them, probably just impossible). That does not mean that you shouldn't be documenting the business rule properly as it is.
Case B. Each customer has AT MOST one phone number.
Case B1. Each customer is required to have a phone number. This means that each customer always has exactly one phone number. Phone number is best put in the customer table. (Note that "to have a phone number" means "to have a phone number THAT IS KNOWN TO THE STORE in question !)
Case B2. It is not required for a customer to have a phone number. In formal relational theory, it is required that you define a separate table which will hold only the known phone numbers. In informal modeling techniques such as ER and UML, you can model this as an "optional attribute". In SQL systems, many would define a nullable attribute for this.
As for "phone numbers 'belonging' to addresses" : is there any kind of "connection" between phone numbers and addresses that is relevant to your business ? I mean, let's say some customer has two addresses and two phone numbers. Is it important to know which of those two phone numbers belongs to which one of those two addresses ? What address would a cellphone number 'belong to' ?
Just an assumption about your site/app, but usually I'd say "Addresses", because user information tends to be info that you pull out frequently to run the site (ID, username, visits etc) whereas phone number may not be?
What do you mean by "traditional?" Since a user can have an arbitrarily large number of contact phone numbers (home, work, personal mobile, work mobile, fax, etc.), it seems like there should be a separate phone numbers table, each row of which includes a number and a value that says what type of number it is.
Contact information is notoriously difficult to model in a relational schema. In order to keep your sanity, I would advise that you make a minimum number of assuptions with respect to phone numbers. Allowing multiple phone numbers for one customer/account is good; beyond that it's hard to apply rules to phone numbers.
There is one well known exception: many pizza delivery shops use phone numbers as primary keys for customers. This works because in general there is one phone associated with the place to which one delivers pizza. On the other hand, many people no longer have land lines, so perhaps even that system is breaking down. In any case, I don't think this applies to DVD rental.
More than one customer may have the same phone number: perhaps multiple people from the same building buy from you.
What happens when you update the phone number for one customer; should it update that number for the other people that supposedly share that number?
Yes if the address for both parties is still the same (but that might actually break some privacy laws).
No if there are now two different addresses.

Resources