What exactly does database normalization do? - database

Supposedly normalization reduces redundancy of data and increases performance. What is the reason for dividing the master table into other small tables, applying relationships between them, retrieving the data using all possible unions, subqueries, joins etc.? Why can't we have all the data in a single table and retrieve it as required?

The main reason is to eliminate repetition of data, so for example if you had a user with multiple addresses and you stored this information in a single table the user information would be duplicated along with each address entry. Normalisation would seperate the addresses into their own table and then link the two using keys. This way you wouldn't need to duplicate the user data, and your db structure becomes a little cleaner.
Full normalisation will generally not improve performance, in fact it can often make it worse but it will keep your data duplicate free. In fact in some special cases I've denormalised some specific data in order to get a performance increase.

Normalization comes from the mathematical concept of being "normal." Another word would be "perpendicular." Imagine a regular two-axis coordinate system. Moving up just changes the y coordinate, moving to the side just changes the x coordinate. So every movement can be broken down into a sideways and an up-down movement. These two are independent of each other.
Normalization in database essentially means the same thing: If you change a piece of data, this is supposed to change just one single piece of information in a database. Imagine a database of E-Mails: If you store the ID and the name of the recipient in the Mails table, but the Users table also associates the name to the ID, that means if you change a user name, you don't only have to change it in the users table, but also in every single message that this user is involved with. So, the axis "message" and the axis "user" are not "perpendicular" or "normal."
If on the other hand, the Mails table only has the user ID, any change to the user name will automatically apply to all the messages, because on retrieval of a message, all user information is gathered from the Users table (by means of a join).

Database normalisation is, at its simplest, a way to minimise data redundancy. To achieve that, certain forms of normalisation exist.
First normal form can be summarised as:
no repeating groups in single tables.
separate tables for related information.
all items in a table related to the primary key.
Second normal form adds another restriction, basically that every column not part of a candidate key must be dependent on every candidate key (a candidate key being defined as a minimal set of columns which cannot be duplicated in the table).
And third normal form goes a little further, in that every column not part of a candidate key must not be dependent on any other non-candidate-key column. In other words, it can depend only on the candidate keys. This leads to the saying that 3NF depends on the key, the whole key and nothing but the key, so help me Codd1.
Note that the above explanations are tailored toward your question rather than database theorists, so the descriptions are necessarily simplified (and I've used phrases like "summarised as" and "basically").
The field of database theory is a complex one and, if you truly wish to understand it, you'll eventually have to get to the science behind it. But, in terms of your question, hopefully this will be adequate.
Normalization is a valuable tool in ensuring we don't have redundant data (which becomes a real problem if the two redundant areas get out of sync). It doesn't generally increase performance.
In fact, although all database should start in 3NF, it's sometimes acceptable to drop to 2NF for performance gains, provided you're aware of, and mitigate, the potential problems.
And be aware that there are also "higher" levels of normalisation such as (obviously) fourth, fifth and sixth, but also Boyce-Codd and some others I can't remember off the top of my head. In the vast majority of cases, 3NF should be more than enough.
1 If you don't know who Edgar Codd (or Christopher Date, for that matter) is, you should probably research them, they're the fathers of relational database theory.

We use normalization to reduce the chances of anomalies that may arise as a result of data insertion, deletion, updation. Normalization doesnt necessarily increase performance.
There is much material on internet so i wont repeat the stuff here again. But you can have a look at
Normalization rules
Anomalies
(others aswell)

As well as all the above, it just makes a certain sense. Say you have a user and you want to record what kind of car they have.
Put that all in one table and then you're fine, until someone owns two cars... You're then going to need two rows for that person, and a way of making sure that you can link those two rows together...
And then what if you also want to record how many dogs they have? Same table with lots of confusing dups? Another table with your own custom logic to manage unique users?
Normalization keeps you away from a lot of these problems...

Related

Is it Important to Understand Each Normal Form

I have been studying database design and programming for quite some time now, but I still can't get a grasp of understanding each individual normal form (1NF, 2NF, 3NF.)
Seeing as anytime the data is in Third Normal Form, it is already automatically in Second and First Normal Form, can the whole process actually be accomplished less tediously by fully normalizing the data from the start. I can accomplish this easily by arranging the data so that the columns in each table, other than the primary key, are dependent only on the whole primary key.
How important is it to understand each individual normal form if we can simply fully normalize the data less tediously by doing what I have described?
EDIT: What I'm ultimately asking is: Is it important to go through the steps of each normal form when normalizing data, or is it appropriate to just go to Third Normal Form seeing as the result is ultimately the same?
I highly recommend understanding each normal form as this will help you determine or investigate any issues with a current database may have as sometimes you might not have the perfect scenario each time and understanding each normal form will help you to understand the current problems with an existing database design if there are any.
Going through step by step through the different normal forms will help you to figure out why we do this and this is to achieve the goals specified by E. F. Codd.
The objectives of normalization were stated as follows:
1. To free the collection of relations from undesirable insertion, update and deletion dependencies.
2. To reduce the need for restructuring the collection of relations as new types of data are introduced, and thus increase the life span of application programs.
3. To make the relational model more informative to users.
4. To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.
Here is a image to help you understand the different normal forms better.
P.S. BCNF is actually 3.5NF not 4NF
It's right that, when being in the 3. NF, you're also in the 2. and in the 1. NF. However, the only condition for the 3. NF is not only that all the data is only dependent on the whole candidate key. It also has the condition that it already is in the 2. NF, meaning that every property that is not the candidate key has to fully depend on the candidate key and that it is in the 1. NF, meaning that every column has to be atomar. So yes, it is important to understand every NF if you want to have a table in the 3. NF.
I'll try to explain the Normal Forms to you:
1. NF
The 1. NF states that every column has to be atomar. This means, there shouldn't be multiple items of data in one column. For example, the adress of someone shouldn't be stored in one column, but should be splitted in the country, the state, the street and so on. Each of these pieces of data should then be stored in their own column.
2. NF
The 2. NF states that every attribute, that is no part of the candidate key, has to be identificable only by the whole candidate key. That means for example that you shouldn't store books and printing labels in one table. Because then the name of the book would only be dependent on the id of the book, while the printing label's name would only be dependent of the id of the printing label and not of the whole candidate key.
3. NF
The 3. NF nearly states the same as the 2.: No column is allowed to be dependent on a non candidate key column. That means for example that you shouldn't store the IBAN of a book and an id of the book in the same table, with only the id being the candidate key, as you'd only need the IBAN to find the name to the book.
If this doesn't explain the matter well enough, there's a lot of information online regarding the normal forms (like Wikipedia).
its not the case that if its in 3 NF its in 1 NF nad 2 nd NF .it was like if its in 2nd NF it has to be in 1st NF beforehand .and same goes for 3NF .for normalising to 3NF it has to clear 1st and 2nd NF forms.
1st normal forms states that no multivalued attribute should be present.
2NF states that there should not be partial dependency on a non prime attribute .
3NF states that no transitive depedency should be there .
thank you
The only NF (normal form) that matters is 5NF.
A relation (value or variable) is in 5NF when for every way it can be losslessly decomposed the components can be joined back in some order where the common columns of each join are a superkey of the original. (Fagin's PJ/NF paper's membership algorithm.)
This allows a table to be the join of others with overlapping meanings but without update anomalies. (Although update anomalies cease at ETNF, between 4NF & 5NF.)
Anyway if you wanted a lower NF you should normalize to 5NF then denormalize. The main reason people settle for lower NFs is ignorance. There are certain costs & benefits, but people don't know or address them--code must restrict updates to account for the problematic update anomalies. Normalization to a given NF is not done by going through lower NFs; one uses an appropriate algorithm for the NF one wants. (This is made clear by most textbooks, although some wrongly say to move through lower NFs, but putting into a lower NF can prevent good higher-NF versions of the original from turning up later.)
PS There is no single notion 1NF and all it has in common with higher NFs is that both seek "better" designs.
From what I recall of the process, it's a method that you follow to get to a state where the storage and search facilities of the database are fully optimised. Yes 3NF does encapsulate the rules below it, 1st and 2nd, but it is far easier to unpick the data if you start at the easier forms of normalization to see if your data is in an efficient format for storage in a RDBMS or SQL based database. Jumping in straight at a higher normal form makes the whole process for beginners harder and intimidating and to not analyse the data correctly. To be honest will make hard work when dealing with difficult data structures that are not just your usual invoice, invoice Lines, address stuff that you tend to deal with day in and day out. Going through the process of normalization, sometimes there is value in unpicking data structures that were not obvious from the start, which not only makes your data more efficient but also helps you reason over what you are trying to accomplish.

uncertainty in developing a database model

I'm trying to develop a database model for candidate, their registered exams and result of the exams when its being taken.
This is what I've done so far. however im unsure if am on the right track especially from the examination table to the examination result table.
how easy will it be to right write an insert sql code for examinationresult population for a particular candidate
the examination types are categorised into science, art and social science. they all have 4 components each
Note on Progression
Given the fact that the Question changes substantially (in clarifying the requirement, not is scope) in response to my Response and TRD, this is going to take some back-and-forth. Let's identify Steps: your Step numbers are odd, starting from 1; mine, in response, are even. Parts of previous Response Steps have become obsolete, they may no longer make sense.
I would suggest a bounty, except for the fact that you have few points.
Response Step 2 to Initial Question & Step 1 Diagram
This is what I've done so far.
You have done some good work, but it is too early for assigning PKs. Besides, assigning an ID on every file as a starting point will cripple the modelling process, the result will not be a database. You have to model the data (not the database) first, then assign Keys when the entities are clear and stable. So drop all your IDs and PKs and model the data, as data. Forget about what you want to do with the data (ie. forget the app).
how easy will it be to right write an insert sql code for examinationresult population for a particular candidate
Right now you can't. You have no relationship between Candidate and Examination[Result]. That is not a problem because the modelling is incomplete at this stage, when it is complete the code will be simple.
The entity Course is implied, but it is missing.
however im unsure if am on the right track especially from the examination table to the examination result table
You are on the right track with some of the other files, but the Examination cluster needs work. This will take a bit of back-and-forth. Once you answer the questions in the comments, we can proceed.
The main issue is this: how is Examination identified.
An ID field does not identify anything, nor does it provide uniqueness in the data, which is required if you want data integrity. IDs result in a Record Filing System with no integrity, however, it appears you want a database with data integrity. Is that correct ?
Go back to the user and discuss how courses and components are identified, what codes they use, etc. Those are the natural Keys that they use to identify their data, that they will enter into the system when they need look something up, or to enter examination results.
Eg. It is not reasonable to contemplate an Examination that exists independently (as you have modelled it). People do not go to a hall and sit for any old exam. The exam exists only in the context of a course, they sit for an exam for a course.
Then the course, and not the exam, has components, which are examined. And each course has a different number of components.
Eg. a Course which is identified as ENG101 for English Literature year 1
And then the components within that. Eg. 2b Short essay on poetry.
They may need to identify the year and semester of the course as well, in which case, you need a CourseOffering per semester.
Consider this, as a discussion point. Courier is example data, blue is Key, green is non-key:
TRD Step 2
Response Step 4
Response to Question & Description
This is what I've done so far.
My previous response still applies:
You have done some good work, but it is too early for assigning PKs. Besides, assigning an ID on every file as a starting point will cripple the modelling process, the result will not be a database. You have to model the data (not the database) first, then assign Keys when the entities are clear and stable. So drop all your IDs and PKs and model the data, as data. Forget about what you want to do with the data (ie. forget the app).
You have not addressed that issue, that I identified in your Step 1 Diagram, in your Step 3 Diagram. It appears, from the evidence, that you might be happy with IDs as "Primary Keys" (there aren't), despite the hindrance having been identified to you. That means your understanding of the data is crippled, and the progress of your diagrams will be slow.
My previous response still applies:
An ID field does not identify anything, nor does it provide uniqueness in the data, which is required if you want data integrity. IDs result in a Record Filing System with no integrity, however, it appears you want a database with data integrity. Is that correct ?
You must answer these questions, otherwise your design cannot proceed. These are severe errors that must be corrected. One cannot build on, or progress, a foundation that contains severe errors.
Could you please confirm, you do want a Relational Database, with the integrity and performance that Relational Databases are capable of, that is easy to code against, as opposed to a Record Filing System, with no integrity or speed, that will be difficult to code against. Correct ?
If [1] is correct. Since ID fields as "Primary Keys" do not provide row uniqueness, which is demanded for a Relational Database, how exactly, do you intend to provide the required row uniqueness ? Alternately, are you happy to have an RFS that is full of duplicate rows (each with an unique record ID) ?
how easy will it be to right write an insert sql code for examinationresult population for a particular candidate
My previous response still applies:
Right now you can't. You have no relationship between Candidate and Examination[Result]. That is not a problem because the modelling is incomplete at this stage, when it is complete the code will be simple.
Ok, in your Step 3 Diagram, you have drawn a line between Candidate file and the ExaminationResult file (as opposed to, inserting a relationship in a database).
In a record filing system, sure, you can just draw a line between any two files, insert the relevant ID field, and hey presto, you have "linked" or "connected" or "mapped" the two files.
But database design (as opposed to file design) does not progress like that, you cannot just draw a line between any two objects, insert the relevant ID field, and hey presto, create a database relationship. No. There is no basis, no integrity, in the dashed line that you have drawn. Eg. in your Step 3 Diagram, any Candidate can be related to any Examination[Result].
That is "normal" or "ordinary" in record filing systems, but in a database, it is something to be recognised and understood as an error, and thus prevented. Because we expect integrity in a database, and because it can be prevented, easily.
however im unsure if am on the right track especially from the examination table to the examination result table
My previous response still applies:
You are on the right track with some of the other files, but the Examination cluster needs work. This will take a bit of back-and-forth. Once you answer the questions in the comments, we can proceed.
The main issue is this: how is Examination identified.
An ID field does not identify a row (it identifies a record, which has no relevance whatsoever in a database).
The same two problems (a) lack of a valid identifier, and (b) lack of row uniqueness, exists with your Candidate, Component and ExaminationResult files.
Response to Diagram as a Diagram (as opposed to the content)
You have improved it over your Step 1 Diagram, and in response to my Response Step 2, great. But the relationships (most of them) are still incorrect. And the basis of Candidate::Examination is still not resolved.
It appears to me that you are not clear about the notation (notches; circles; crows feet) and precisely what they mean at the parent and child ends). So you need to learn that first, and then draw the diagram, rather than the other way round.
It is great that you are using a Notation that is meaningful, and many details are shown (many people don't, they draw nice-looking diagrams that lack the detail required for a full understanding of the model. That means that every notch; circle; crows foot, has specific meaning, and must be drawn correctly, in order to convey that meaning to the reader.
Entities do not exist in isolation, there must always be a parent first, in order for the child to be a child of the parent. There is no such thing as "equal". Dependency is always in one direction.
Your relationships that are 1-and-only-1 on one side, and 1-and-only-1 on the other side, are incorrect, they indicate a Normalisation error. The field in the subordinate record can be Normalised into the ordinate record.
Eg. AdmissionLetter is not a separate file, some form of AdmissionLetter identifier (not an ID field) should be located in Candidate.
Eg. Title::Candidate is a drawing error, it should be 1 at the Title end and 0-to-many at the Candidate end.
In a data model, bold (by convention) means a migrated Foreign Key. The Primary Key that is migrated is not bold.
Response to Diagram Content
From your replies, the term Subject trumps the term Component; Category trumps various loosely-identified elements into one clear entity.
It is not reasonable to contemplate an Examination that exists independently (as you have modelled it).
People do not go to a hall and sit for any old exam, any old Subject. The exam exists only in the context of a Subject, they sit for an exam for a Subject.
I accept that the Examination is one sitting, for four Subjects
I accept that the four Subjects are defined by a Category.
I accept that the Candidate is registered for a Category.
Thus the exam exists only in the context of a Subject, which exists only in the context of a Category, and the Candidate sits for an exam which is a Category, which contains four (the number does not matter) Subjects.
Having resolved that, two questions remain:
Do you need to record an Examination as an event, independent of the Candidates who sit in that event. Eg. Examination(Location, DateTime) ?
Does the Examination event examine Candidates in one, or more than one, Category ?
The notion of four Subjects that are implemented as four repeated fields in one record breaks Second Normal Form, which demands that repeating fields are Normalised into separate records in a child file.
Therefore, for both your Component and ExaminationResult files, that issue needs to be resolved.
Note that the fact that that problem is repeated in two separate files is a second alarm that it is an error.
I have clarified the Category/Subject issues for you, and resolved the Normalisation error.
I have given simple identifiers for Categories and Subjects.
If you do not implement that, you will not have integrity between the Candidate and the Subject they are being Examined for. As well, you will suffer various problems when you get to the coding stage.
I have no idea what you are trying to do with exComp, therefore I have no response. Perhaps you can say a few words about it.
Thus far, there is still no reasonable way of relating Candidates to Examinations or ExaminationResults. That is, it has no basis, nothing has been defined as the basis for the relationship, and thus the relationship has no integrity.
On the basis of what I have been able to ascertain thus far, there must be some sort of registration for an exam. Otherwise you would not know that a Candidate is sitting for an exam.
When the Candidate registers, they register for an exam, and that exam is defined (and therefore constrained) by a Category. Otherwise any Candidate can sit for any exam, which I believe, you would like to prevent.
Further, the [four] exam Subjects that they sit for, should be constrained by the Category that they registered for.
You do want to ensure that you do not record an Economics exam result for a Candidate who is registered for Science, correct ?
I have determined that the basis of an exam is the Registration. That is the event, the fact, the recording of which, establishes that a Candidate will sit for an exam.
The identifier virtually jumps out at you, it is CategoryCode plus CandidateID. Voila! we have row uniqueness. Magnifique! we have integrity.
Now the integrity of ExaminationResult can be implemented: it is constrained to the CandidateRegistration::Category and to the Category::Subject.
To be Resolved: Do you need to identify the fact of a Candidate registering for an examination (RegistrationDate, AdmissionLetter of whatever) vs the fact that the Candidate sat for the examination (eg. ExaminationDate) ? A sort of roll call.
Right now, I have modelled that as a single fact with no differentiation, and the table is called Examination because you seem to be focussed on that.
Predicate
These days, people seem to be throwing themselves at drawing a diagram, without understanding either the basics of a Relational Database, or of the exercise of modelling data. Predictably, that results in an ill-defined diagram (many relevant details are omitted) [gratefully, your diagram has some definition], and it produces a record filing system with no integrity, no relational power, no speed, instead of a Relational Database with integrity, power, and speed.
One concept that is often missing is Predicates. A competent reader can read a good data model, and ascertain the Predicates, because they are drawn in the model, in the form of notation, but a novice doesn't understand the notation, or the relevance of the various items, and therefore will miss the Predicates. In sum, the Predicates are all the constraints that are placed on the data:
Row Identification:
The basis of it existence, and how it is Identified: Independent (square corners); or Dependent (round corners)
Row Uniqueness: Primary and Alternate Keys (note, IDs are not Keys)
Relationships between rows:
Identifying (solid lines); or Non-identifying (dashed lines)
Meaning, relevance, purpose: the all-important Verb Phrase
Further, a novice cannot determine the Predicates when there is no diagram, or when the diagram is poor, or when they are designing the filing system and drawing the diagram themselves. Thus they do not identify the relevant Predicates in their diagram.
Predicates are very important during the modelling exercise, in that as well as the model expressing the Predicates, the Predicates confirm the accuracy of the model, it is a feedback loop. It is an essential part of the modelling exercise. Since I am executing the modelling task for you, I am working out the Predicates as I perform that task, they are obvious to me. But they may not be obvious to you.
When the data model is published, and ready for discussion with the users, these Predicates are incorporated into it. They come under the heading of Business Rules, they form a part of that, because that is the way the user perceives them. Consequently, during the walkthroughs and discussions, the Predicates (as well as the other stated Business Rules) are either confirmed or denied by the user. They need to be stated explicitly, because unlike the technically educated developer, the user cannot be expected to read all the relevant Predicates from the notation in a good data model.
In this situation, I am the modeller, and you are the "user". Thus I have decided to provide the Predicates for you, explicitly. So that you can confirm or deny them, and thus we can progress the modelling exercise. Once you get used to reading the Predicates from a good data model, you will not need to have them declared explicitly for you. Again, Predicates are very important because they verify (or not) the accuracy of the model. So please read them carefully and comment on any Predicates that you do not completely agree with, or that you do not understand.
Of course, it is not necessary to explicitly declare all the Predicates, there are just too many, we declare just the more relevant ones, that relate to:
(a) rows (tables), the basis for their existence
(b) their identification
(c) all dependencies
(d) relationships, both sides (one side is the Verb Phrase).
Step 4 TRD
I have implemented all the above, as detailed. Please consider this TRD as a discussion platform for the next iteration, and comment. Courier indicates example data, blue indicates Key values, green indicates non-key values:
Step 4 TRD
Response Step 6 to Chat Step 5
All issues discussed have been resolved, and implemented in the model. Sorry, I do not have time right now to post details, this is simply identifies the updated models.
Entity-Relation and full Predicates on page 1
All resolved issues have been implemented.
Predicates
Now that it is stable, I am now giving you the second side of the Relation Predicates (child-to-parent). And now that you understand them, I have deleted the repeated, annoying "Each" that is demanded for novices.
Entity-Relation-Key on page 2
Now that the TRD is stable, we are ready to proceed to Determination of Keys
(Second only to Normalisation, Key Determination is a critical part of the modelling exercise. The two tasks are normally performed side-by-side, they are inseparable, I have already determined the keys. In this case, given the limitations of the communication media, I am presenting it as a sequential step).
Here, I use an Extension to the IDEF1X Notation that allows me to concentrate of the components that are relevant to the task, I expect that it is self-explanatory. The Key columns only, are given. Foreign Keys are not Bold (as they are in the DM). All that, is intended to make it easy on the eye.
Most tables have one Key (Primary). Where there are two Keys (Primary and Alternate), the AK is below the line.
This is my recommendation for the Keys, as requested, for your review.
Step 6 TRD and TRK 6

Onetomany with parent database design

Below is a database design which represents my problem(it is not my actual database design). For each city I need to know which restaurants, bars and hotels are available. I think the two designs speak for itself, but:
First design: create one-to-many relations between city and restaurants, bars and hotels.
Second design: only create an one-to-many relation between city and place.
Which design would be best practice? The second design has less relations, but would I be able to get all the restaurants, bars and hotels for a city and there own data (property_x/y/z)?
Update: this question is going wrong, maybe my fault for not being clear.
the restaurant/bar/hotel classes are subclasses of "place" (in both
designs).
the restaurant/bar/hotel classes must have the parent "place"
the restaurant/bar/hotel classes have there own specific data (property_X/Y/X)
Good design first
Your data, and the readability/understandability of your SQL and ERD, are the most important factors to consider. For the purpose of readability:
Put city_id into place. Why: Places are in cities. A hotel is not a place that just happens to be in a city by virtue of being a hotel.
Other design points to consider are how this structure will be extended in the future. Let's compare adding a new subtype:
In design one, you need to add a new table, relationship to 'place' and a relationship to city
In design two, you simply add a new table and relationship to 'place'.
I'd again go with the second design.
Performance second
Now, I'm guessing, but the reason for putting city_id in the subtype is probably that you anticipate that it's more efficient or faster in some specific use cases and this may be a very good reason to ignore readability/understandability. However, until you measure performance on the actual hardware you'll deploy on, you don't know:
Which design is faster
Whether the difference in performance would actually degrade the overall system
Whether other optimization approaches (tuning SQL or database parameters) is actually a better way to handle it.
I would argue that design one is an attempt to physically model the database on an ERD, which is a bad practice.
Premature optimization is the root of a lot of evil in SW Engineering.
Subtype approaches
There are two solutions to implementing subtypes on an ERD:
A common-properties table, and one table per subtype, (this is your second model)
A single table with additional columns for subtype properties.
In the single-table approach, you would have:
A subtype column, TYPE INT NOT NULL. This specifies whether the row is a restaurant, bar or hotel
Extra columns property_X, property_Y and property_Z on place.
Here is a quick table of pros and cons:
Disadvantages of a single-table approach:
The extension columns (X, Y, Z) cannot be NOT NULL on a single table approach. You can implement row-level constraints, but you lose the simplicity and visibility of a simple NOT NULL
The single table is very wide and sparse, especially as you add additional subtypes. You may hit the max. number of columns on some databases. This can make this design quite wasteful.
To query a list of a specific subtype, you have to filter using a WHERE TYPE = ? clause, whereas the table-per-subtype is a much more natural `FROM HOTEL INNER JOIN PLACE ON HOTEL.PLACE_ID = PLACE.ID"
IMHO, mapping into classes in an object-oriented languages is harder and less obvious. Consider avoiding if this DB is going to be mapped by Hibernate, Entity Beans or similar
Advantages of a single-table approach:
By consolidating into a single table, there are no joins, so queries and CRUD operations are more efficient (but is this small difference going to cause problems?)
Queries for different types are parameterized (WHERE TYPE = ?) and therefore more controllable in code rather than in the SQL itself (FROM PLACE INNER JOIN HOTEL ON PLACE.ID = HOTEL.PLACE_ID).
There is no best design, you have to pick based on the type of SQL and CRUD operations you are doing most frequently, and possibly on performance (but see above for a general warning).
Advice
All things being equal, I would advise the default option is your second design. But, if you have an overriding concern such as those I listed above, do choose another implementation. But don't optimize prematurely.
Both of them and none of them at all.
If I need to choose one, I would keep the second one, because of the number of foreign keys and indexes needed to be created after. But, a better approach would be: create a table with all kinds of places (bars, restaurants, and so on) and assign to each row a column with a value of the type of the place (apply a COMPRESS clause with the types expected at the column). It would improve both performance and readability of the structure, plus being more easier to maintain. Hope this helps. :-)
you do not show alternate columns in any of the sub-place tables. i think you should not split type data into table names like 'bar','restaurant', etc - these should be types inside the place table.
i think further you should have an address table - one column of which is city. then each place has an address and you can easily group by city when needed. (or state or zip code or country etc)
I think the best option is the second one. In the first design, there is a possibility of data errors as one place can be assigned to a particular restaurant (or any other type) in one city (e.g. A) and at the same time can be assigned to another restaurant in a different city (e.g. B). In the second design, a place is always bound to a particular city.
First:
Both designs can get you all the appropriate data.
Second:
If all extending classes are going to implement the location (which sound obvious for your implementation) then it would be a better practice to include it as part of the parent object. This would suggest option 2.
Thingy:
The thing is that even-tough you can find out the type of each particular PLACE, it is easier to just know that a type (CHILD) is always a place (PARENT). You can think of that while you visualize the result-set of option 2. With that in mind, I recommend the first approach.
NOTE:
First one doesn't have more relations, it just splits them.
If bar, restaurant and hotel have different sets of attributes, then they are different entities and should be represented by 3 different tables. But why do you need the place table? My advice it to ditch it and have 3 tables for your 3 entities and that's that.
In code, collecting common attributes into a parent class is more organised and efficient than repeating them in each child class - of course. But as spathirana comments above, database design is not like OOP. Sure, you'll save on the repetition of column names by sticking common attributes of places into a "place" table. But it will also add complication:
- you have to join on that table whenever you want to reference a bar, restaurant or hotel
- you have to insert into two tables whenever you want to add a new bar, restaurant or hotel
- you have to update two tables when ... etc.
Having 3 tables without the place table is ALSO, PROBABLY, the most performance-optimal design. But that's not where I'm coming from. I'm thinking of clean, simple database design where a single entity means a single row in a single table. There are no "is-a" relationships in a relational DB. Foreign key relationships are "has-a". OK, there are exceptions I'm sure, but your case is not exceptional.

Fact table with multiple facts

I have a dimension (SiteItem) has two important facts:
perUserClicks
perBrowserClicks
however, within this dimension, I have groups of values based on an attribute column (let's call the groups AboveFoldItems, LeftNavItems, OnTheFlyItems, etc.) each have more facts that are specific to that group:
AboveFoldItems: eyeTime, loadTime
LeftNavItems: mouseOverTime
OnTheFlyItems: doesn't have any extra, but may in the future
Is the following fact table schema ok?
DateKey
SessionKey
SiteItemKey
perUserClicks
perBrowserClicks
eyeTime
loadTime
mouseOverTime
It seems a little wasteful since only some columns pertain to some dimension keys (the irrelevant facts are left NULL). But... this seems like it would be a common problem, so there should be a common solution for this, right?
I'm generally in agreement with Damir's answer on this, but because the fact table is very narrow in your particular case, there is still merit to Aaron's advocation for keeping the NULLs.
We have several star schemas in particular subject areas with multiple fact tables that share most (if not all) of the dimensions (conformed and internal). The limited-scope dimensions are not considered "conformed" across the enterprise, but they are what we would call "shared internal" dimensions.
Now typically, if the data is loaded contemporaneously so that the dimension hasn't changed, you can join both fact tables on the keys, but in general, of course, you cannot join two different star schemas on the dimension keys if they are surrogates in traditional slowly changing dimensions. In general, you have to join separate stars on the natural keys or "business keys" within the dimension and not on surrogates (except usually in the special case of the date dimension where it is unchanging and only has a natural key).
Note that when you do join the two stars, you have to use a LEFT JOIN, in which case you WILL produce NULLs which you will still probably have to take account of - so you're actually getting back to the original model you had with NULLs! ;-)
The benefit of the extra fact table is more obvious when your tables are wide with a smaller set of keys and the vertical partitioning of the data produces space savings as well as a cleaner logical model - this is especially true when the keys are only really shared up to a point - having one dummy key or NULL key is definitely not a good idea - this usually points to a dimensional modeling problem.
However, as Aaron says, if you push it to extremes, you can have a single fact column in each fact table with shared keys, which means the key overhead dwarfs the fact cost and you really do end up in a disguised EAV model.
I would also look to see if you are in Kimball's situation of "too few dimensions". Seems like you must have good dimensional attributes lumped into the SessionKey and SiteItemKey - but without seeing your entire model and requirements, it's hard to say, but I would think you would have some user demographics in a low-cardinality or even snowflake dimension without the full Session or Site dimension.
There isn't an elegant solution really, you either have nullable columns or you use an EAV solution. I posted about EAV before (and generated a lot of comments that might be worthwhile reading):
What is so bad about EAV, anyway?
I am a fan of that model in some scenarios, but if your dimensions/attributes do not change frequently, it can be a lot of extra work for nothing. NULL values in a column do not really make waste as long as the surrounding code can deal with them appropriately.
You could have more than one fact table: factperUserClicks, factperBroWserClicks, factEyeTime, etc...
Each of these would have DateKey, SessionKey, SiteItemKey. This way only dimension keys that "make sense" appear with each fact.
Ideally, there should be no NULLS in the DW -- if you keep them in the same fact table, using zeros may be more appropriate.
As far as saving disk space, I do not see an ideal solution -- but, in a DW one is supposed to trade space for speed and (query) simplicity anyway.

Person name structure in separate database table

I am wondering when and when not to pull a data structure into a separate database table when it appears in several tables.
I have pulled the 12 attribute address structure into a separate table because I have a couple of different entities containing a single address in this format.
But how about my 3 attribute person name structure (given, middle, surname)?
Should this be put into its own table referenced with a foreign key for all the entities containing a name... e.g. the company table has a contact person name, the citizen table has a person name etc.
Are these best left as attributes in the main tables or should they be extracted?
I would usually keep the address on the Person table, unless there was an unusual need for absolutely uniform addresses on each entity, or if an entity could have an arbitrary number of addresses, or if addresses need to be shared between entities, or if it was a large enterprise product where I know I have to invest in infrastructure all over the place or I will end up gutting everything down the road.
Having your addresses in a seperate table is interesting because it's flexible, but in the context of a small project lacking a special need like the ones mentioned above, it's probably a slight waste. Always be aware of the balance between complexity and flexibility. Flexibility is important, but be discriminating... It's easy to invest way too much there!
In concrete terms, the times that I experimented with (for instance) one-to-one relationships for things like addresses, I ended up refactoring them back into the table because it introduced a bunch of headaches including more complex queries, dealing with situations where the address does not exist, etc. More entities also increases your cognitive load -- it makes the project harder to think about. In my case, it was an unecessary cost because there was no concrete need and, in truth, not even a gain in flexibility.
So, based on my experiences, I would "try" to keep the addresses in the same table, and I would definitely keep the names on them - again, unless there was a special need.
So to paraphrase Einstein, make it as simple as possible and no simpler. But in the short term, experiment. It's the best way to learn these lessons.
It's about not repeating information, so you don't want to store the same information in two places when one will do.
Another useful rule of thumb is one entity per table. If you find that one table contains, say, "person" AND "order" then you probably should split those into two tables.
And (putting myself at risk of repeating information...) you might find it helpful to review some database design basics, there are plenty of related questions here on stackoverflow.
Start with these...
What is normalisation?
What is important to keep in mind when designing a database
How many fields is 'too many'?
More tables or more columns?
Creating a person entity across your data model will give you this present and future advantages -
The same person occurring as a contact, or individual in different contexts. Saves redundancy.
Info can be maintained and kept current with far-less effort.
Easier to search for a person and identify them - i.e. is it the same John Smith?
You can expand the information - i.e. maintain addresses for this person far more easily.
Programming will be more consistent and debugging will be easier as well.
Moves you closer to a 'self-documenting' system.
As a counterpoint to the other (entirely valid) replies: within your application's current structure, how likely will it be for a given individual (not just name, the actual "person" -- multiple people could be "John Smith") to appear in more than one table? The less likely this is to happen, the less likely you are to get benefits from normalization.
Another way to think of it is entities. Outside of labels (names), is their any overlap between "customer" entity and an "employee" entity?
Extract them. Your aim should be to have no repeating data in your database.
Read about Normalization
It really depends on the problem you are trying to solve. In general it is probably a good idea to have some sort of 'person' table which holds details of people. However, there are occasions where that is potentially a very bad idea.
One example would be if you are holding details of prescriptions written out to people by a doctor. In some countries it is a legal requirment that the prescription details are held with the name in which they were prescribed NOT the name the person is going under currently. For instance a woman might be prescribed a drug as miss X, but then she gets married and becomes Mrs Y. If you had a person table that was linked to the prescriptions table you would now have the wrong details and would possibly face legal consequences. In that case you would need to probably copy the relevant details of the person into the prescription table, even though this would be duplicating data.
So again - it depends on the problem you are trying to solve. Don't just blindly follow what people consider to be best practices. Understand your data and any issues surrounding it, then try to follow best practices that fit.
Depends on what you're using the database for.
If you want fast queries on your tables you should de-normalize your tables. Having to run multiple JOIN's will take longer and make your queries more complex.
On the other hand if your intention is to have a flexible storage database which is not meant to be hit with a ton of fast-response queries, then normalizing the tables by splitting them out into multiple xref'ed tables will provide more flexibility in your design and reduce the need for submitting duplicated data.
Since de-normalization is "optimization", I would suggest you normalize the tables first, index them properly and see if you're getting any bottlenecks on your queries. If so, flatten the affected tables where needed.
You should really consider your whole database structure and do a ER diagram (entity relationship diagram) first. OF COURSE there should be another table called "Person" where the concept of a person is stored...

Resources