Normalization of a table (BCNF) - database

I'm trying to understand how to normalize a database, and one of the exercise given by our teacher was to normalize in BCNF this table:
Flight(**CityDeparture,CityArrival,Day**,NationDeparture,NationArrival)
where (CityDeparture,CityArrival,Day) is the primary key.
So I assumed that:
1)The city name is unique independently from the nation (there can not be two nation with the same city, even if that is not true in reality), otherwise the primary key would be wrong.
2)The functional depencies are
CityDeparture->NationDeparture
CityArrival->NationArrival
Meaning the table was not even in 2NF, so I decomposed it like so:
Flight(CityDeparture,CityArrival,Day)
there are no non-banal FD so it is in BNCF, right?
CityD(**CityDeparture**,NationDeparture) CityDeparture->NationDeparture
is in BNCF because CityDeparture is key
CityA(**CityArrival**,NationArrival) CityArrival->NationArrival
is in BNCF because CityArrival is key.
I also considered the fact that CityA and CityD could be identical unless every city has a different code of departure/arrival(i.e. NewYork has code 'AAA' if a flight leaves from there and code 'BBB' if a flight lands there) so one could just have a single City(Name,Nation) table and both CityDeparture,CityArrival would reference it.
The decomposition should also be lossless because City.Name is a common attribute for both tables and is key for City (I'm quite unsure about this)
When I showed this to my teacher it just scored 0 and told me to go read the book without further explanation. Now I did read the book, and the articles I found linked around here but I'm honestly clueless, so I'm asking for your advice! Any help would be appreciated

1)The city name is unique independently from the nation (there can not be two nation with the same city, even if that is not true in reality), otherwise the primary key would be wrong.
On the one hand, your reasoning here is correct. On the other hand, many (most?) textbook normalization exercises don't include keys at all. You're usually expected to derive all possible keys from the dependencies. Maybe your teacher expects you to ignore the existing key.
Another possibility is that your teacher wanted you to include the FD {CityDeparture, CityArrival, Day} -> {NationDeparture, NationArrival}.
Another possibility is that your teacher wanted you to explore the dependencies within the primary key. Are there any multi-value dependencies?
If your book includes an algorithm that you can do with pencil and paper--most of them do--try working through it that way. See what you get.

Your decomposition of
Flight(CityDeparture,CityArrival,Day,NationDeparture,NationArrival)
into
Flight(CityDeparture,CityArrival,Day)
CityD(CityDeparture,NationDeparture)
CityA(CityArrival,NationArrival)
gives you indeed BCNF.
Regarding the last step, the unification of CityD and CityA: This is not justified by your functional dependencies, and thus incorrect from a formal database perspective. It would be justified by further context knowledge. In practice, it would of course make sense in most settings.
Keep in mind that database normalization is a formal discipline, and so are its algorithms. Substitute artificial names for your relation, e.g., R(A,B,C,D,E) with the same keys and functional dependencies - the result must be same up to renaming.
EDIT
This assumes that the primary key and the two functional dependencies CityDeparture->NationDeparture and CityArrival->NationArrival were given as part of the exercise - otherwise see Mike's answer.

Related

Is it Important to Understand Each Normal Form

I have been studying database design and programming for quite some time now, but I still can't get a grasp of understanding each individual normal form (1NF, 2NF, 3NF.)
Seeing as anytime the data is in Third Normal Form, it is already automatically in Second and First Normal Form, can the whole process actually be accomplished less tediously by fully normalizing the data from the start. I can accomplish this easily by arranging the data so that the columns in each table, other than the primary key, are dependent only on the whole primary key.
How important is it to understand each individual normal form if we can simply fully normalize the data less tediously by doing what I have described?
EDIT: What I'm ultimately asking is: Is it important to go through the steps of each normal form when normalizing data, or is it appropriate to just go to Third Normal Form seeing as the result is ultimately the same?
I highly recommend understanding each normal form as this will help you determine or investigate any issues with a current database may have as sometimes you might not have the perfect scenario each time and understanding each normal form will help you to understand the current problems with an existing database design if there are any.
Going through step by step through the different normal forms will help you to figure out why we do this and this is to achieve the goals specified by E. F. Codd.
The objectives of normalization were stated as follows:
1. To free the collection of relations from undesirable insertion, update and deletion dependencies.
2. To reduce the need for restructuring the collection of relations as new types of data are introduced, and thus increase the life span of application programs.
3. To make the relational model more informative to users.
4. To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.
Here is a image to help you understand the different normal forms better.
P.S. BCNF is actually 3.5NF not 4NF
It's right that, when being in the 3. NF, you're also in the 2. and in the 1. NF. However, the only condition for the 3. NF is not only that all the data is only dependent on the whole candidate key. It also has the condition that it already is in the 2. NF, meaning that every property that is not the candidate key has to fully depend on the candidate key and that it is in the 1. NF, meaning that every column has to be atomar. So yes, it is important to understand every NF if you want to have a table in the 3. NF.
I'll try to explain the Normal Forms to you:
1. NF
The 1. NF states that every column has to be atomar. This means, there shouldn't be multiple items of data in one column. For example, the adress of someone shouldn't be stored in one column, but should be splitted in the country, the state, the street and so on. Each of these pieces of data should then be stored in their own column.
2. NF
The 2. NF states that every attribute, that is no part of the candidate key, has to be identificable only by the whole candidate key. That means for example that you shouldn't store books and printing labels in one table. Because then the name of the book would only be dependent on the id of the book, while the printing label's name would only be dependent of the id of the printing label and not of the whole candidate key.
3. NF
The 3. NF nearly states the same as the 2.: No column is allowed to be dependent on a non candidate key column. That means for example that you shouldn't store the IBAN of a book and an id of the book in the same table, with only the id being the candidate key, as you'd only need the IBAN to find the name to the book.
If this doesn't explain the matter well enough, there's a lot of information online regarding the normal forms (like Wikipedia).
its not the case that if its in 3 NF its in 1 NF nad 2 nd NF .it was like if its in 2nd NF it has to be in 1st NF beforehand .and same goes for 3NF .for normalising to 3NF it has to clear 1st and 2nd NF forms.
1st normal forms states that no multivalued attribute should be present.
2NF states that there should not be partial dependency on a non prime attribute .
3NF states that no transitive depedency should be there .
thank you
The only NF (normal form) that matters is 5NF.
A relation (value or variable) is in 5NF when for every way it can be losslessly decomposed the components can be joined back in some order where the common columns of each join are a superkey of the original. (Fagin's PJ/NF paper's membership algorithm.)
This allows a table to be the join of others with overlapping meanings but without update anomalies. (Although update anomalies cease at ETNF, between 4NF & 5NF.)
Anyway if you wanted a lower NF you should normalize to 5NF then denormalize. The main reason people settle for lower NFs is ignorance. There are certain costs & benefits, but people don't know or address them--code must restrict updates to account for the problematic update anomalies. Normalization to a given NF is not done by going through lower NFs; one uses an appropriate algorithm for the NF one wants. (This is made clear by most textbooks, although some wrongly say to move through lower NFs, but putting into a lower NF can prevent good higher-NF versions of the original from turning up later.)
PS There is no single notion 1NF and all it has in common with higher NFs is that both seek "better" designs.
From what I recall of the process, it's a method that you follow to get to a state where the storage and search facilities of the database are fully optimised. Yes 3NF does encapsulate the rules below it, 1st and 2nd, but it is far easier to unpick the data if you start at the easier forms of normalization to see if your data is in an efficient format for storage in a RDBMS or SQL based database. Jumping in straight at a higher normal form makes the whole process for beginners harder and intimidating and to not analyse the data correctly. To be honest will make hard work when dealing with difficult data structures that are not just your usual invoice, invoice Lines, address stuff that you tend to deal with day in and day out. Going through the process of normalization, sometimes there is value in unpicking data structures that were not obvious from the start, which not only makes your data more efficient but also helps you reason over what you are trying to accomplish.

4th Normal Form of table not met

Below is a graph of a database to be used to manage university student enrolment and grades across multiple years. Below are the listed requirements for the database
Students must be able to be a part of a class
A class must teach a subject
Each class may have 0 or more courseworks
Each class will have one exam
Each class can be taught by more than one lecturer
Coursework can only be set by one lecturer
Coursework and exams can be marked by different staff than who set them, and the staff member marking it must be able to be identified and recorded.
It is necessary to specify whether an exam taken is being taken for the first time or is a resit
I think the database is now in 4th normal form, and is represented in the table below.
The key represents the primary key for that table, and a green arrow means it is a foreign key.
Can anyone spot any errors or suggest ways to improve it?
Not enough information here to tell whether you are satisfying any Normal Form or not. We can only guess at some dependencies.
For example, "Each class will have one exam" seems to be saying that class→exam. Your Exam table on the other hand satisfies the dependency examID→classID, which is not one of your requirements. I can't tell from your diagram if classID is a candidate key in the Exam table. It also looks like examTaken would not be in 4NF if the classID→examID is one of the dependencies to be satisfied.
From a practical data modelling point of view 4NF is not very important. 5NF is more important. Is this homework? If so I'd suggest you write down the attributes and dependencies before you start drawing a diagram. You seem to have created far more attributes than are suggested by the statement of requirements.
Obviously the cardinality between coursework and courseworktaken cannot be 1:1.
(Why are some lines dotted and others not ?)

Why should zipcode values not be placed in Boyce Codd Normal Form?

Can someone explain to me why it is that zipcodes should not be placed in Boyce Codd Normal Form? Is there really any more to it other than that zipcodes are unlikely to change in any foreseeable point in time?
You should only place zip codes in 3NF or BCNF if your intention is to lookup other information based on them (such as locale). In that context, a zip code becomes a "natural key."
Absent that context, there doesn't seem to be much point. In most applications, a zip code is merely treated as a bit of text, and doesn't have any contextual meaning otherwise.
Zipcode is an attribute whereas BCNF is a property satisfied by a relation or set of relations. As a general rule, aim to be in at least BCNF unless and until you have a good reason to deviate from that. On that basis I'd suggest that relations with a zipcode attribute ought to be in BCNF. What makes you think otherwise?
Assuming you are talking about 2NF, not the minor difference between 3NF and BCNF, (cause zipcodes don't seem to be relevant to BCNF), then:
Yes, that, and the fact that it is unnecessarily obtuse, saves only one byte of storage, (zipcodes are five chars and can be stored in 5 bytes, an integer Foreign Key is 4 bytes), and requires an additional join to retrieve the value.

Is this a correct explanation of the first 3 normal forms in database normalization?

I tried to consolidate everything I learned about normalization in this blog post
http://geekyisawesome.blogspot.com/2011/03/database-normalization-1-2-3-nf.html
but I need to make sure that I understood everything correctly. Could you notify me of any mistakes?
Thanks
Normalization doesn't mean "replace values with ID numbers".
Normalization also doesn't involve terms like weak entity, bridge table, or junction table.
I wouldn't say there are any mistakes. The examples are sound. I like the fact that you showed a couple of different ways of doing 1NF.
I would say that the post is a little bit confusing. Perhaps you might consider laying out a precise statement of what each NF is as you get to it and include a short description of what the attending anomalies are for 1NF and 2NF. That way, when you go through your sample relations, it will be clearer what the problems are and why the next NF is a solution rather than just another way of doing it. I found the transitions from one NF to the next weren't crystal clear. A neophyte would benefit more from clearer distinctions between each NF, since it can be hard to keep straight in your head at first, as you pointed out in your introduction.
I like how 3NF can be summed up in the old addage: "The key, the whole key, and nothing but the key, so help me Codd." This is very succinct and highlights all of the important attributes of a relation in 3NF. Each attribute must depend on the key (1NF) the whole key (2NF) and nothing but the key (3NF). This is useless for explaining normalization but it's a great way to remember it once you've learned it.

What is a good KISS description of Boyce-Codd normal form?

What is a KISS (Keep it Simple, Stupid) way to remember what Boyce-Codd normal form is and how to take a unnormalized table and BCNF it?
Wikipedia's info: not terribly helpful for me.
Chris Date's definition is actually quite good, so long as you understand what he means:
Each attribute
Your data must be broken into separate, distinct attributes/columns/values which do not depend on any other attributes. Your full name is an attribute. Your birthdate is an attribute. Your age is not an attribute, it depends on the current date which is not part of your birthdate.
must represent a fact
Each attribute is a single fact, not a collection of facts. Changing one bit in an attribute changes the whole meaning. Your birthdate is a fact. Is your full name a fact? Well, in some cases it is, because if you change your surname your full name is different, right? But to a genealogist you have a surname and a family name, and if you change your surname your family name does not change, so they are separate facts.
about the key,
One attribute is special, it's a key. The key is an attribute that must be unique for all information in your data and must never change. Your full name is not a key because it can change. Your Social Insurance Number is not a key because they get reused. Your SSN plus birthdate is not a key, even if the combination can never be reused, because an attribute cannot be a combination of two facts. A GUID is a key. A number you increment and never reuse is a key.
the whole key,
The key alone must be sufficient [and necessary!] to identify your values; you cannot have the same data represented by different keys, nor can a subset of the key columns be sufficient to identify the fact.
Suppose you had an address book with a GUID key, name and address values. It is OK to have the same name appearing twice with different keys if they represent different people and are not the "same data".
If Mary Jones in accounting changes her name to Mary Smith, Mary Jones in Sales does not change her name as well.
On the other hand, if Mary Smith and John Smith have the same street address and it really is the same place, this is not allowed. You have to create a new key/value pair with the street address and a new key.
You are also not allowed to use the key for this new single street address as a value in the address book since now the same street address key would be represented twice.
Instead, you have to make a third key/value pair with values of the address book key and the street address key; you find a person's street address by matching their book key and address key in this group of values.
and nothing but the key
There must be nothing other than the key that identifies your values. For example, if you are allowed an address of "The Taj Mahal" (assuming there is only one) you are not allowed a city value in the same record,
since if you know the address you would also know the city. This would also open up the possibility of there being more than one Taj Mahal in a different city.
Instead, you have to again create a secondary Location key with unique values like the Taj, the White House in DC, and so on, and their cities.
Or forbid "addresses" that are unique to a city.
So help me, Codd.
Here are some helpful excerpts from the Wikipedia page on Third Normal Form:
Bill Kent defines Third Normal Form this way:
Each non-key attribute "must provide
a fact about the key, the whole key,
and nothing but the key."
Requiring that non-key attributes be
dependent on "the whole key" ensures
that a table is in 2NF; further
requiring that non-key attributes be
dependent on "nothing but the key"
ensures that the table is in 3NF.
Chris Date adapts Kent's mnemonic to define Boyce-Codd Normal Form:
"Each attribute must represent a fact
about the key, the whole key, and
nothing but the key." Here the
requirement is concerned with every
attribute in the table, not just
non-key attributes.
This comes into play when a table has multiple compound candidate keys, and an attribute within one candidate keys has a dependency on a part of another candidate key. Third Normal Form wouldn't prohibit this, because it excludes key attributes. But BCNF applies the rule to key attributes as well.
As for how to make a table satisfy BCNF, you need to represent the extra dependency, with another attribute and possibly by splitting attributes into another table.
I googled "boyce codd normal form" and after wikipedia this is the second result. My textbook gives a very simple definition in terms of relational database management systems:
The left side of every nontrivial FD must be a superkey.
-"Database Systems The Complete Book" by Garcia-Molina, Ullman and Widom.
The best informal answer I've read is that, in BCNF, every "arrow" in every functional dependency is an "arrow" out of a candidate key. I don't recall the source, but it was probably something Chris Date wrote.
Basically Boyce-Codd is "fifth normal form". It is visually recognizable by the existance of "Attributive entities" in the data model, for things like Types (e.g. roles, status, process state, location-type, phone-type, etc).
The attributive entities (sub-subtypes) are lists of finite sets of values that further categorize a class level entity. So you may have a phone-type ('mobile', ' desk', 'VOIP') email account type ('business', 'personal', 'gaming'), role (project manager, data modeler, super model) etc.
Another morphological clue is the existance of super-types, (aka. master-classes, super-classes, meta-entities) such as Parties (subtypes being company, person, etc.).
It's basically Taxonomy gone wild (..no the video is not that exciting) to the atomic or leaf-level; see Bill Karwin's comment above for a more technical explanation.
Boyce-Codd level models are essentially highly detailed logical models, derived from more simplistic business-based conceptual models. **They are typically NOT implemented ver batim in the PHYSICAL model, because PDM optimization for performance (or functional simplicity) may result in the super-types and attributive entities being managed as drop-down lists in UIs, or in behind the scenes logic in the application, or in database constraints and methods to enforce referential integrity. (i.e. they may end up as look-up tables in the PDM schema, or they may be handled by code and not represented in the database).
So - why do them if they may not end up in the PDM? For the same reason you build a good 3NF model before you 'optimize', so that the database structure reflects the real world and is hence more stable than the typical kludges we inherit and have to do heroic acts to make work as our business/clients requirements change.
Often times it is easiest to listen to your gut and this will come naturally. Generally speaking, if you meet 3NF you have met BCNF. This doesn't cover detailed analysis of an ERD or have examples but there are thirteen rules according to Codd. I find it best to follow these rules but always remember there is no one correct way to do things so follow them loosely. So regarding the RDBMS, here are the rules:
http://www.87android.com/12-rules-of-relational-database-model-by-codd/
This may not answer the question directly, but if you are asking about how to get to BCNF or an easy way to remember it then you don't understand normalization well enough. That is of no concern though. Relational databases take many forms and very few are done well. The best thing you can do is know what it means to be relational, follow the rules above, and do not worry about the level of normalization. The process of normalization eliminates the duplication of data. Each level more so by moving into migration of functional dependencies. Keep that in mind and you will be fine, your gut and intellect will do the rest.

Resources