BCNF normalization - database

Could you please provide me with an article that gives an example of a DB design that is in 3NF but not in BCNF and then illustrates how to convert it to BCNF? All the articles that I saw which try to explain BCNF give examples of tables that are in 1NF and then convert them to BCNF. This doesn't let me see the difference between 3NF and BCNF.
Thanks in advance

An example with overlapping keys reveals the difference; having the predicate [P] and matching constraints (c x.y).
[P] Employee EMP, with email EMAIL, took course CRS in year YR.
(c 1.1) For each employee and course; that employee took that course at most once; it is possible that more than one employee took that course.
(c 1.2) For each each employee and course; that employee took that course in exactly one year.
(c 1.3) For each employee and year; it is possible that the employee took more than one course in that year.
(c 1.4) For each course and year; it is possible that more than one employee took that course in that year.
(c 2.1) For each employee, that employee has exactly one email.
(c 2.2) For each email, exactly one employee has that email.
(c 3.1) For each email and course; employee with that email took that course at most once; it is possible that more than one employee with that email took that course.
(c 3.2) For each each email and course; employee with that email took that course in exactly one year.
(c 3.3) For each email and year; it is possible that employee with that email took more than one course in that year.
(c 3.4) For each course and year; it is possible that more than one employee with specific email took that course in that year.
Note how verbalizing constraints intuitively reveals the problem. See how constraints c 3.x match (repeat) c1.x due to c 2.x.
R {EMP, EMAIL, CRS, YR}
KEY {EMP, CRS}
KEY {EMAIL, CRS}
The FDs for this are
FD {EMP, CRS} --> {YR}
FD {EMAIL, CRS} --> {YR}
FD {EMP} --> {EMAIL}
FD {EMAIL} --> {EMP}
So, considering each one of these as FD X --> Y it holds that either,
X is a superkey,
Y is a subkey.
Therefore the R is in the 3rd NF.
For BCNF, the requirement is that for any nontrivial FD X --> Y in R, X is a superkey.
Here is a check-list for 2NF to BCNF
---------------------------------------
For each nontrivial | NF
FD X --> Y |
at least one holds | 2nd 3rd BCNF
---------------------------------------
X is a superkey ✔ ✔ ✔
Y is a subkey ✔ ✔
X is not a subkey ✔
---------------------------------------
FD X --> Y is trivial iff Y ⊆ X
Now we could decompose R into:
{EMP, EMAIL} {EMP, CRS, YR}
OR
{EMP, EMAIL} {EMAIL, CRS, YR}
which eliminates those two FDs to subkeys.
And finally, note that after decomposition into {EMP, EMAIL} {EMP, CRS, YR} or into {EMP, EMAIL} {EMAIL, CRS, YR} these tables are now all in 5NF -- actually in 6NF, but that's not important now. It is important to observe that it is possible to get into 5NF -- and hence into: (4, BCNF, ..., 1) -- just by using logic, verbalizing predicate and constraints. In other words, for a developer:
your tables can be in high NF even if you have no idea what all this terminology means.

Related

How to normalize the schema to BCNF

I am having some issues with normalization. I have a schema REPAYMENT which looks like this:
Now, from what I've gathered the functional dependencies that hold in the schema is
{borrower_id} --> {name, address, request_date, loan_amount}
{request_date} --> {repayment_date, loan_amount}
{loan_amount] --> {repayment_amount}
(correct me if I'm wrong?)
I'm supposed to normalise the schema to BCNF, but I'm a bit confused. Is the candidate key request_date and borrower_id?
It can be used to register information on the re- payments on micro loans. A borrower, his name and address, are identified with an unique borrower_id. Borrowers can have multiple loans at the same time, but each of those loans ( specified by loan_amount, repayment_date and repayment_amount) have a different re- quest date. Thus a loan can be identified with the borrower ID and the request date of the loan. The borrower can repay multiple (different) loans on the same date, but each loan can only be repaid once (on one date with one amount). There is a system which for each request date and amount of a loan determines the repayment date and amount to be repaid. The loan amount requested and the repaid amount are not the same since there is an interest rate that applies.
From the definition of candidate key:
In the relational model of databases, a candidate key of a relation is
a minimal superkey for that relation; that is, a set of attributes
such that:
The relation does not have two distinct tuples (i.e. rows or records in common database language) with the same values for these
attributes (which means that the set of attributes is a superkey)
There is no proper subset of these attributes for which (1) holds (which means that the set is minimal).
Now your question :
Is the candidate key request_date and borrower_id?
It is a superkey, but not minimal one. Here's how we compute the candidate key.
Which attribute occurs only on the left side, considering all the F . D's ?
ITS borrower_id.This means that it must be a part of every key of this given schema. Now let us compute its closure.
Because of {borrower_id} --> {name, address, request_date, loan_amount}:
closure(borrower_id) = borrower_id, name, address, request_date, loan_amount.
Because of {request_date} --> {repayment_date, loan_amount} and closure(borrower_id) has request_date, this means
closure(borrower_id) = borrower_id, name, address, request_date, loan_amount, repayment_date
And finally because of {loan_amount] --> {repayment_amount} and closure(borrower_id) has loan_amount, this means
closure(borrower_id) = borrower_id, name, address, request_date, loan_amount, repayment_date, repayment_amount
Because closure of borrower_id contains all the attributes, borrower_id is a key and since it is minimal, it is indeed the candidate key and the only one.
Now let us decompose the schema into BCNF. The algorithm is:
Given a schema R.
Compute keys for R.
Repeat until all relations are in BCNF.
Pick any R' having a F.D A --> B that violates BCNF.
Decompose R' into R1(A,B) and R2(A,Rest of attributes).
Compute F.D's for R1 and R2.
Compute keys for R1 and R2.
Since {request_date} --> {repayment_date, loan_amount} and request_date is not a key, it violates BCNF so we split schema into two relations:
R1(request_date,repayment_date,loan_amount)
R2(borrower_id,name,address,request_date,repayment_amount)
Clearly R1 is in BCNF. But R2 is NOT in BCNF , because we missed the following F.D. which is:
address --> name
and we know address is not the key, so we split the R2 further as:
R3(borrower_id,address,request_date,repayment_amount)
R4(address,name)
Now, clearly both R3 and R4 are in BCNF. Had we not split the R2 further, we end up storing the same combination of address and name for every loan the person takes, which is redundancy.

Functional dependency and many-to-many relationships

I have these fields:
A book_id
B book_title
C book_isbn
D book_year
G reader_id
H reader_name
I reader_birthday
L reader_phone
M reader_email
N reader_registration_date
O loan_id
P loan_date_issued
S loan_date_for_return
T loan_date_returned
U author_id
V author_name
W category_id
X category_name
and these dependencies:
A->BCD
G->HILMN
O->AGPST
U->AV
W->AX
After all calculations I get this:
R1 = ABCD k1 = {A} Books
R2 = GHILMN k2 = {G} Readers
R3 = AGOPST k3 = {O} Loans
R4 = AUV k4 = {U} Authors
R5 = AWX k5 = {W} Category
R6 = OUW k6 = {OUW} {Don’t know}
But this is not good because table Book has a many to many relationship with table Category, and so do the Book and Author tables.
I'm stuck. I think I'm doing something wrong from the start and after that all goes wrong. Maybe you have some example for this.
Let's treat "category" as "cover" of a "book"-as-object or "copy" of a "book"-as-text, where a "book" is associated with some O values unique to it. Then W -> A makes more intuitive sense. (Other FDs seem unintuitive too.)
Universal relations
Every table (base or query result) has a predicate (statement template) that a row makes either into a true statement (and goes in the table) or a false statement (and stays out). We say the table represents the business relationship/association characterized by the predicate. A guess at a predicate here is:
book A titled B with isbn C published in year D
was borrowed by a reader G named H born on date I
with phone# L and email address M registered on date N
in loan O issued on date P due on date S
and either it was returned on date T or it is not yet returned and T=NULL
and it was written by author U named V
and the library has A in cover/copy W named X
You seem to be using the "universal relation" decomposition design/normalization technique. But this is only applicable if your one table satisfies the "universal relation assumption". Which is that all your situations can be described via the one predicate and its one table.
Eg: Suppose you can have books that have not been loaned or users that haven't borrowed. Then the example predicate/table above could not record them. So a decomposition wouldn't be able to record them. So you would instead start with a different predicate/table. (Typically multiple ones.)
Eg: If the last line was and A was borrowed in cover/copy W named X then the table could hold a different value in a given situation than before. But depending on the borrowing policy the table could satisfy the same set of FDs.
What is the predicate for this table? If it's not what you guessed, your expectations might not be met.
Your decomposition
Let's ignore the properties of entities.
-- O is G borrowing A by U with W
A book_id
G reader_id
O loan_id
U author_id
W cover/copy_id
O->AG
U->A
W->A
The only CK is OUW. Here is an obvious decomposition to BCNF. It agrees with your version.
-- O is G borrowing A by someone with some cover/copy
-- O is G borrowing A
Loan(O,G,A)
-- some loan is somebody borrowing A by U with some cover/copy
-- the book of U is A
The_book_of_author(U,A)
-- some loan is somebody borrowing A by someone with W
-- the book of W is A
The_book_of_cover/copy(W,A)
-- O is somebody borrowing some book by U with W
-- O is the borrowing of the book of U and W
Author_and_cover/copy(O,U,W)
The original relation is the join of the components:
-- O is G borrowing A
and the book of U is A
and the book of W is A
and O is the borrowing of the book of U and W
-- O is G borrowing A by U with W
Loan JOIN The_book_of_author JOIN The_book_of_cover/copy JOIN Author_and_cover/copy
this is not good because table Book has a many to many relationship with table Category, and so do the Book and Author tables
Unfortunately this is unintelligible. So I can't address what you mean to say is wrong.
Database design
If you generated this design yourself, you should be using some reference information modeling method. This will guide you to determine reasonable predicates/tables to record all the situations that can arise according to your business rules.
Predicates applied to what situations can arise determine what states can arise. Those valid states are described by constraints--FDs (functional dependencies), JDs (join dependencies), CKs (candidate keys), FKs (foreign keys) (aka "relationships" in a different sense than above), etc.
Part of a method is normalizing provisional tables to others. This uses FDs & JDs to decompose to an appropriate NF (normal form) via an appropriate algorithm. A good method always normalizes to 5NF. (Even if you denormalize it later for implementation reasons.)

Determining Super Key

According to Wikipedia
Today's Court Bookings
Each row in the table represents a court booking at a tennis club that has one hard court (Court 1) and one grass court (Court 2)
A booking is defined by its Court and the period for which the Court is reserved
Additionally, each booking has a Rate Type associated with it. There are four distinct rate types:
SAVER, for Court 1 bookings made by members
STANDARD, for Court 1 bookings made by non-members
PREMIUM-A, for Court 2 bookings made by members
PREMIUM-B, for Court 2 bookings made by non-members
The table's superkeys are:
S1 = {Court, Start Time}
S2 = {Court, End Time}
S3 = {Rate Type, Start Time}
S4 = {Rate Type, End Time}
S5 = {Court, Start Time, End Time}
S6 = {Rate Type, Start Time, End Time}
S7 = {Court, Rate Type, Start Time}
S8 = {Court, Rate Type, End Time}
ST = {Court, Rate Type, Start Time, End Time}, the trivial superkey
Note that even though in the above table Start Time and End Time
attributes have no duplicate values for each of them, we still have to
admit that in some other days two different bookings on court 1 and
court 2 could start at the same time or end at the same time. This is
the reason why {Start Time} and {End Time} cannot be considered as the
table's superkeys.
How is S1 = {Court, Start Time}, a super key?
Say on day 1, a member books court 1 from 11:00 to 12:00, and on day 2, a non member books court 1 from 11:00 to 12:00.
the records in the table would be
{1,11:00,12:00, SAVER} and {1,11:00,12:00, STANDARD}
Clearly S1 = {Court, Start Time}, is not superkey. Or am I wrong?
This example is a poor choice because to understand what the table is supposed to hold involves unstated, although common sense, assumptions. It expects you to see that the table is only for one day--"Today"--and infer that on any day there will be no overlapping bookings. Ie no start-end time period for a court overlaps another one for the same court. (The text mentions different days when they mean different table values; but it doesn't matter to the example whether different values have to be on different days.)
It is also a poor choice for 3NF vs BCNF in particular. Of course it is subject to certain FDs (functional dependencies) and their associated JDs (join dependencies) relevant to 3NF vs BCNF. But the non-overlap of bookings is a separate constraint irrelevant to 3NF vs BCNF.
Say on day 1, a member books court 1 from 11:00 to 12:00, and on day 2, a non member books court 1 from 11:00 to 12:00.
When we say that a table value "satisfies" a constraint (eg FD) or "is subject to" a constraint or "has" a constraint or that a constraint "holds in" a table value we mean that the value makes the constraint true. When we say this about a table variable (base table) we mean that it is so for the variable's value in every database state. For this table, describing the current booking situation for "Today", any particular booking situation will be about one day--Today. So the kind of overlapping involving different days in your quote is not relevant to the constraints. Similarly each table value from different times in the same day will satisfy the constraints itself regardless of how the bookings have changed.
Under those circumstances, for any state of the table the four specified sets of columns are CKs (candidate keys):
S1 = {Court, Start Time}
S2 = {Court, End Time}
S3 = {Rate Type, Start Time}
S4 = {Rate Type, End Time}
Because bookings don't overlap, a subrow value for each of these column sets is unique under those columns. So they are superkeys. Since that's true for no smaller subsets of each, they are CKs. Since its true for no other column sets, there are not other CKs. Since every superset of a superkey is a superkey, the other listed sets are the other (non-CK) superkeys.
PS There are a few sections on that entry's talk page about the Tennis/Booking example and confusions on the page. The page has other poor examples. Eg it restructures the non-BCNF 3NF design to a BCNF design, but not by standard lossless decomposition to projections of the original (that join back to it). (It introduces a new column.) Eg it then also talks about preserving dependencies but that only makes sense when decomposing to projections of the original.

3nf functional dependency in wiki's example

I read the wiki about the 3nf
https://en.wikipedia.org/wiki/Third_normal_form
it is the example that wiki give
Tournament Winners
Tournament Year Winner Winner Date of Birth
Indiana Invitational 1998 Al Fredrickson 21 July 1975
Cleveland Open 1999 Bob Albertson 28 September 1968
Des Moines Masters 1999 Al Fredrickson 21 July 1975
Indiana Invitational 1999 Chip Masterson 14 March 1977
it say that the non-prime attribute Winner Date of Birth is transitively dependent on the candidate key {Tournament, Year} via the non-prime attribute Winner
I think functional dependency is that
for two row X1 , X2 if X1.col1 = X2.col1 and
X1.col2 = X2.col2, then col1 -> col2
I cannot understand that Winner Date of Birth->Winner(there may be someone who have same birthday and same name?)
and winner can -> candidate key {Tournament, Year} given the winner name Al Fredrickson, it may be Indiana Invitational 1998 or Des Moines Masters 1999)
So, how does it jump to the conclusion?
Informally, a functional dependency means one value on the left side cannot produce multiple values on the right, even when the left side exists in more than one row.1
So, in Wikipedia example, there is a functional dependency Winner -> Winner Date of Birth, simply because the same winner cannot have different dates of birth even when he/she exists in multiple rows (because he/she won multiple tournaments).
Since...
{Tournament, Year} -> Winner (since one tournament cannot have multiple winners)
and Winner -> Winner Date of Birth (as explained above)
and not Winner -> {Tournament, Year} (since one person can win multiple tournaments)
...then by definition there is a transitive dependency.
I cannot understand that Winner Date of Birth->Winner(there may be someone who have same birthday and same name?)
You flipped the direction. The functional dependency is not "from" the single value, it's "toward" it. Therefore Winner -> Winner Date of Birth, but not Winner Date of Birth -> Winner.
BTW, There cannot be two people with different names in this model. A better (more realistic) model would probably use a surrogate key to identify people, allowing for duplicated names.
1 Which is compliant with the mathematical concept of "function". No matter how many times you "call" a function (i.e. how many rows contain the f.d. left side), it always produces the same "result" (the f.d. right side). If it could produce multiple results, it would not be a function, it would be a "relation".
From what I understand:
For any {Tournament, Year} you have only one winner. Each winner has only one date of birth. Wiki claims that this can lead to vulnerability:
Assume you have entered a new row: {"Stupid tournament", "2013", "Al Fredrickson", "21 July 2012"} - you've entered an incorrect date of birth!
If you keep another table {WinnerID, WinnerBithday}, you'll prevent that.
What if an entry comes for Same Winner with Different Date of Birth? It is possible then How to prevent them?
From the base
Because each row in the table needs to tell us who won a particular
Tournament in a particular Year, the composite key {Tournament, Year}
is a minimal set of attributes guaranteed to uniquely identify a row. That is, {Tournament, Year} is a candidate key for the table.
If relation R is going to add same Winner Name with different Date of Birth then it will create another Unique record for the table but it is should not be done. We need unique record but this shows same winner with two different Dates of Birth can be exist in a table.
Even if we think of Duplication of Dates of Birth (for winners) we can
split that table in another table and can store {winner,winner date of
birth} to prevent duplication like wiki has shown.
reference
as there is nothing to stop the same person from being shown with
different dates of birth on different records.
That's why it needs to create another table to prevent duplication.

Need example on functional dependency

Can i get example on functional dependency in database concepts with example?
I understood that when a particular column is dependent on another column then it is called as functional dependent on other one.But i am not able to visulize with example..plz help me out
Let R be
NewStudent(stuId, lastName, major, credits, status, socSecNo)
FDs in R include
{stuId}→{lastName}, but not the reverse
{stuId} →{lastName, major, credits, status, socSecNo, stuId}
{socSecNo} →{stuId, lastName, major, credits, status, socSecNo}
{credits}→{status}, but not {status}→{credits}
ZipCode→AddressCity
16652 is Huntingdon’s ZIP
ArtistName→BirthYear
Picasso was born in 1881
Autobrand→Manufacturer, Engine type
Pontiac is built by General Motors with gasoline engine
Author, Title→PublDate
Shakespeare’s Hamlet was published in 1600
Trivial Functional Dependency
The FD X→Y is trivial if set {Y} is a subset of set {X}
Examples: If A and B are attributes of R,
{A}→{A}
{A,B} →{A}
{A,B} →{B}
{A,B} →{A,B}
are all trivial FDs and will not contribute to the evaluation of normalization.
for more details..check source.
source - http://jcsites.juniata.edu/faculty/rhodes/dbms/funcdep.htm

Resources