How to normalize the schema to BCNF

How to normalize the schema to BCNF - database

I am having some issues with normalization. I have a schema REPAYMENT which looks like this:
Now, from what I've gathered the functional dependencies that hold in the schema is
{borrower_id} --> {name, address, request_date, loan_amount}
{request_date} --> {repayment_date, loan_amount}
{loan_amount] --> {repayment_amount}
(correct me if I'm wrong?)
I'm supposed to normalise the schema to BCNF, but I'm a bit confused. Is the candidate key request_date and borrower_id?
It can be used to register information on the re- payments on micro loans. A borrower, his name and address, are identified with an unique borrower_id. Borrowers can have multiple loans at the same time, but each of those loans ( specified by loan_amount, repayment_date and repayment_amount) have a different re- quest date. Thus a loan can be identified with the borrower ID and the request date of the loan. The borrower can repay multiple (different) loans on the same date, but each loan can only be repaid once (on one date with one amount). There is a system which for each request date and amount of a loan determines the repayment date and amount to be repaid. The loan amount requested and the repaid amount are not the same since there is an interest rate that applies.

From the definition of candidate key:
In the relational model of databases, a candidate key of a relation is
a minimal superkey for that relation; that is, a set of attributes
such that:
The relation does not have two distinct tuples (i.e. rows or records in common database language) with the same values for these
attributes (which means that the set of attributes is a superkey)
There is no proper subset of these attributes for which (1) holds (which means that the set is minimal).
Now your question :
Is the candidate key request_date and borrower_id?
It is a superkey, but not minimal one. Here's how we compute the candidate key.
Which attribute occurs only on the left side, considering all the F . D's ?
ITS borrower_id.This means that it must be a part of every key of this given schema. Now let us compute its closure.
Because of {borrower_id} --> {name, address, request_date, loan_amount}:
closure(borrower_id) = borrower_id, name, address, request_date, loan_amount.
Because of {request_date} --> {repayment_date, loan_amount} and closure(borrower_id) has request_date, this means
closure(borrower_id) = borrower_id, name, address, request_date, loan_amount, repayment_date
And finally because of {loan_amount] --> {repayment_amount} and closure(borrower_id) has loan_amount, this means
closure(borrower_id) = borrower_id, name, address, request_date, loan_amount, repayment_date, repayment_amount
Because closure of borrower_id contains all the attributes, borrower_id is a key and since it is minimal, it is indeed the candidate key and the only one.
Now let us decompose the schema into BCNF. The algorithm is:
Given a schema R.
Compute keys for R.
Repeat until all relations are in BCNF.
Pick any R' having a F.D A --> B that violates BCNF.
Decompose R' into R1(A,B) and R2(A,Rest of attributes).
Compute F.D's for R1 and R2.
Compute keys for R1 and R2.
Since {request_date} --> {repayment_date, loan_amount} and request_date is not a key, it violates BCNF so we split schema into two relations:
R1(request_date,repayment_date,loan_amount)
R2(borrower_id,name,address,request_date,repayment_amount)
Clearly R1 is in BCNF. But R2 is NOT in BCNF , because we missed the following F.D. which is:
address --> name
and we know address is not the key, so we split the R2 further as:
R3(borrower_id,address,request_date,repayment_amount)
R4(address,name)
Now, clearly both R3 and R4 are in BCNF. Had we not split the R2 further, we end up storing the same combination of address and name for every loan the person takes, which is redundancy.

Related

BCNF normalization

Could you please provide me with an article that gives an example of a DB design that is in 3NF but not in BCNF and then illustrates how to convert it to BCNF? All the articles that I saw which try to explain BCNF give examples of tables that are in 1NF and then convert them to BCNF. This doesn't let me see the difference between 3NF and BCNF.
Thanks in advance

An example with overlapping keys reveals the difference; having the predicate [P] and matching constraints (c x.y).
[P] Employee EMP, with email EMAIL, took course CRS in year YR.
(c 1.1) For each employee and course; that employee took that course at most once; it is possible that more than one employee took that course.
(c 1.2) For each each employee and course; that employee took that course in exactly one year.
(c 1.3) For each employee and year; it is possible that the employee took more than one course in that year.
(c 1.4) For each course and year; it is possible that more than one employee took that course in that year.
(c 2.1) For each employee, that employee has exactly one email.
(c 2.2) For each email, exactly one employee has that email.
(c 3.1) For each email and course; employee with that email took that course at most once; it is possible that more than one employee with that email took that course.
(c 3.2) For each each email and course; employee with that email took that course in exactly one year.
(c 3.3) For each email and year; it is possible that employee with that email took more than one course in that year.
(c 3.4) For each course and year; it is possible that more than one employee with specific email took that course in that year.
Note how verbalizing constraints intuitively reveals the problem. See how constraints c 3.x match (repeat) c1.x due to c 2.x.
R {EMP, EMAIL, CRS, YR}
KEY {EMP, CRS}
KEY {EMAIL, CRS}
The FDs for this are
FD {EMP, CRS} --> {YR}
FD {EMAIL, CRS} --> {YR}
FD {EMP} --> {EMAIL}
FD {EMAIL} --> {EMP}
So, considering each one of these as FD X --> Y it holds that either,
X is a superkey,
Y is a subkey.
Therefore the R is in the 3rd NF.
For BCNF, the requirement is that for any nontrivial FD X --> Y in R, X is a superkey.
Here is a check-list for 2NF to BCNF
---------------------------------------
For each nontrivial | NF
FD X --> Y |
at least one holds | 2nd 3rd BCNF
---------------------------------------
X is a superkey ✔ ✔ ✔
Y is a subkey ✔ ✔
X is not a subkey ✔
---------------------------------------
FD X --> Y is trivial iff Y ⊆ X
Now we could decompose R into:
{EMP, EMAIL} {EMP, CRS, YR}
OR
{EMP, EMAIL} {EMAIL, CRS, YR}
which eliminates those two FDs to subkeys.
And finally, note that after decomposition into {EMP, EMAIL} {EMP, CRS, YR} or into {EMP, EMAIL} {EMAIL, CRS, YR} these tables are now all in 5NF -- actually in 6NF, but that's not important now. It is important to observe that it is possible to get into 5NF -- and hence into: (4, BCNF, ..., 1) -- just by using logic, verbalizing predicate and constraints. In other words, for a developer:
your tables can be in high NF even if you have no idea what all this terminology means.

Determining Super Key

According to Wikipedia
Today's Court Bookings
Each row in the table represents a court booking at a tennis club that has one hard court (Court 1) and one grass court (Court 2)
A booking is defined by its Court and the period for which the Court is reserved
Additionally, each booking has a Rate Type associated with it. There are four distinct rate types:
SAVER, for Court 1 bookings made by members
STANDARD, for Court 1 bookings made by non-members
PREMIUM-A, for Court 2 bookings made by members
PREMIUM-B, for Court 2 bookings made by non-members
The table's superkeys are:
S1 = {Court, Start Time}
S2 = {Court, End Time}
S3 = {Rate Type, Start Time}
S4 = {Rate Type, End Time}
S5 = {Court, Start Time, End Time}
S6 = {Rate Type, Start Time, End Time}
S7 = {Court, Rate Type, Start Time}
S8 = {Court, Rate Type, End Time}
ST = {Court, Rate Type, Start Time, End Time}, the trivial superkey
Note that even though in the above table Start Time and End Time
attributes have no duplicate values for each of them, we still have to
admit that in some other days two different bookings on court 1 and
court 2 could start at the same time or end at the same time. This is
the reason why {Start Time} and {End Time} cannot be considered as the
table's superkeys.
How is S1 = {Court, Start Time}, a super key?
Say on day 1, a member books court 1 from 11:00 to 12:00, and on day 2, a non member books court 1 from 11:00 to 12:00.
the records in the table would be
{1,11:00,12:00, SAVER} and {1,11:00,12:00, STANDARD}
Clearly S1 = {Court, Start Time}, is not superkey. Or am I wrong?

This example is a poor choice because to understand what the table is supposed to hold involves unstated, although common sense, assumptions. It expects you to see that the table is only for one day--"Today"--and infer that on any day there will be no overlapping bookings. Ie no start-end time period for a court overlaps another one for the same court. (The text mentions different days when they mean different table values; but it doesn't matter to the example whether different values have to be on different days.)
It is also a poor choice for 3NF vs BCNF in particular. Of course it is subject to certain FDs (functional dependencies) and their associated JDs (join dependencies) relevant to 3NF vs BCNF. But the non-overlap of bookings is a separate constraint irrelevant to 3NF vs BCNF.
Say on day 1, a member books court 1 from 11:00 to 12:00, and on day 2, a non member books court 1 from 11:00 to 12:00.
When we say that a table value "satisfies" a constraint (eg FD) or "is subject to" a constraint or "has" a constraint or that a constraint "holds in" a table value we mean that the value makes the constraint true. When we say this about a table variable (base table) we mean that it is so for the variable's value in every database state. For this table, describing the current booking situation for "Today", any particular booking situation will be about one day--Today. So the kind of overlapping involving different days in your quote is not relevant to the constraints. Similarly each table value from different times in the same day will satisfy the constraints itself regardless of how the bookings have changed.
Under those circumstances, for any state of the table the four specified sets of columns are CKs (candidate keys):
S1 = {Court, Start Time}
S2 = {Court, End Time}
S3 = {Rate Type, Start Time}
S4 = {Rate Type, End Time}
Because bookings don't overlap, a subrow value for each of these column sets is unique under those columns. So they are superkeys. Since that's true for no smaller subsets of each, they are CKs. Since its true for no other column sets, there are not other CKs. Since every superset of a superkey is a superkey, the other listed sets are the other (non-CK) superkeys.
PS There are a few sections on that entry's talk page about the Tennis/Booking example and confusions on the page. The page has other poor examples. Eg it restructures the non-BCNF 3NF design to a BCNF design, but not by standard lossless decomposition to projections of the original (that join back to it). (It introduces a new column.) Eg it then also talks about preserving dependencies but that only makes sense when decomposing to projections of the original.

How to enforce unique 2-tuple on oracle table?

I am trying to enforce the property that table Match should have all unique tuples (Team 1, Team 2). However, let Team 1 = Detroit Pistons and Team 2 = Chicago Bulls. I do not want to allow (Detroit Pistons, Chicago Bulls) to be inserted into the table if (Chicago Bulls, Detroit Pistons) already exists.
How can I enforce this constraint?

A) The tuples are semantically identical. (I think this is your case.)
That means the tuple {Chicago Bulls, Detroit Pistons} means exactly the same thing as the tuple {Detroit Pistons, Chicago Bulls}. Use a CHECK constraint to impose an order on the two columns.
CHECK (column_1 < column_2)
That kind of constraint would allow {Chicago Bulls, Detroit Pistons}, but it would reject {Detroit Pistons, Chicago Bulls}. This is kind of like imposing a canonical form on otherwise free-form data.
B) The tuples are semantically distinct.
That means the tuple {Chicago Bulls, Detroit Pistons} means one thing, and the tuple {Detroit Pistons, Chicago Bulls} means something else. For example, the first attribute might mean "home team", and the second might mean "visiting team". In this case, all you need is some kind of unique constraint on the pair of columns.

You can create a unique function-based index:
CREATE UNIQUE INDEX unq_match ON match ( LEAST(team1,team2), GREATEST(team1,team2) );
LEAST() will get the "lesser" of the two teams (whether by ID or name, it doesn't matter) while GREATEST will get the "greater" of the two. Unfortunately this particular solution doesn't scale up to 3-or-more-tuples.

Is it possible to find 4 distinct functional dependencies in this table?

My professor gave a task to find 4 distinct functional dependencies in the following table:
Company(Company_Name, Street_Address, City, Zip, State, CEO_Name)
"He also gave a note: Each company has a different (unique) address meaning (Street_Address, City, Zip, State) together form a key. Different companies may have the same name. Each company has exactly one CEO, and one person cannot be the CEO of more than one company. CEO names may not be unique (there maybe 2 CEOs with the same name). To count 4 functional dependencies in a table with attributes (A, B, C, D): If A -> B then obviously A, C -> B as well. This should not count as 2 separate dependencies. On the other hand, A -> B and A -> C should be counted as 2 distinct functional dependencies."
But in my opinion, there are no 4 functional dependencies.
CEO, Company Name -> (Street_address, city, zip, state)
zip -> state
but since two companies can have the same name there should be also a primary key like "Company_Number". But creating knew tables is not the task...

Functional dependencies answer the question, "Give a single value for X, do we know one and only one value for Y?" Eitehr X or Y may be sets of attributes, not just a single attribute. Keep this in mind when you're reading through this answer.
Each company has a different (unique) address meaning (Street_Address, City, Zip, State) together form a key.
By definition, that key means that
Street_Address, City, Zip, State -> CEO_Name
Street_Address, City, Zip, State -> Company_Name
That's all the possible FDs for the candidate key {Street_Address, City, Zip, State}. Two of four--halfway home.
You identified {CEO_Name, Company_name} as the left-hand side of a functional dependency. In this particular case, you also identify it as a candidate key. Let's look at some made-up data.
Company_Name CEO_Name Street_Address City State Zip
--
Wibble, Inc. Mary Smith 123 E Main St Anytown PA 00001
Wibble, Inc. Mary Smith 456 S Darn St Sometown WY 10000
That data describes two different companies that happen to have the same name, having two different CEOs who happen to have the same name. This fits the description of the FDs, but clearly shows that {Company_Name, CEO_Name} is not a candidate key. The faked data also clearly shows that {Company_Name, CEO_Name} can't be the left-hand side of a functional dependency. Given a single value for {Company_Name, CEO_Name}, we don't have one and only one value for any of the other attributes.
Having eliminated the attributes Company_Name and CEO_Name as possibilities for the left-hand side, the only way to "manufacture" two more functional dependencies is to find them within the candidate key {Street_Address, City, Zip, State}. Not because there's anything special about the candidate key, but because those are the only attributes left.
My guess is that your teacher expects you to say
Zip -> City
Zip -> State
In the USA (in the "real" world), "Zip -> City, State" doesn't hold. ZIP codes have to do with how carriers drive their routes and deliver mail; ZIP codes aren't concerned with geography. A few cities (and ZIP codes) straddle state borders. Quite a lot of ZIP codes straddle adjoining cities within a single state. As the USPS cuts their budget, I expect the number of such ZIP codes to increase.
But in academia, this real-world behavior is often ignored for pedagogical reasons. That's why I'll bet your teacher expects {Zip -> City, State}.

Identifying Functional Dependencies II

Here is an example which should clear things up for the last post.
hireDate & carReg are the primary keys. Are there extra functional dependencies (FDs) other than the ones I have identified
below? Modifications also welcome:
fd1 carReg -> make, model, outletNo, outletLoc
fd2 custNo -> custName
fd3 outletNo -> outletLoc
fd4 model -> make (only if we assume a model name is unique to a make)
fd5 carReg, hireDate -> make, model, custNo, custName, outletNo, outletLoc
I'm not sure if the above are correct and I am sure there are more.
Based on Mike Sherrill Cat Recall's answer... My question is this: How is custName -> custNo a valid FD? For the above relation, sure, a customer name maps onto exactly one customer number, but by intuition, we know more than one J SMith could be added to the table. If this is the case, this FD is void as it forms a 1..* relationship. Can we really say that custName -> custNo knowing this fact? Do we merely base FDs on the sample data? Or do we take into account the possible values that can be added?

At a glance . . .
custName -> custNo
model -> make
outletLoc -> outletNo
carReg, custNo -> hireDate
carReg, custName -> hireDate
And I'm sure there are others. The sample data isn't representative, and that's a problem when you try to determine functional dependencies from data. Let's say your sample data had only one row.
carReg hireDate make model custNo custName outletNo outletLoc
--
MS34 0GD 14/5/03 Ford Focus C100 Smith, J 01 Bearsden
FDs answer the question, "Given one value for 'x', do I know one and only one value for 'y'?" Based on that one-row set of sample data, every attribute determines every other attribute. custNo determines hireDate. hireDate determines outletLoc. custName determines model.
When sample data isn't representative, it's easy to turn up FDs that aren't valid. You need more representative sample data to weed out some invalid functional dependencies.
custName -> custNo isn't valid ('C101', 'Hen, P')
carReg, custNo -> hireDate isn't valid ('MS34 0GD', 'C100', '15/7/04')
carReg, custName -> hireDate isn't valid ('MS34 0GD', 'Hen, P', '15/8/03')
You can investigate functional dependencies in sample data by using SQL.
create table reg (
CarReg char(8) not null,
hireDate date not null,
Make varchar(10) not null,
model varchar(10) not null,
custNo char(4) not null,
custName varchar(10) not null,
outletNo char(2) not null,
outletLoc varchar(15) not null
);
insert into reg values
('MS34 OGD', '2003-05-14', 'Ford', 'Focus', 'C100', 'Smith, J', '01', 'Bearsden'),
('MS34 OGD', '2003-05-15', 'Ford', 'Focus', 'C201', 'Hen, P', '01', 'Bearsden'),
('NS34 TPR', '2003-05-16', 'Nissan', 'Sunny', 'C100', 'Smith, J', '01', 'Bearsden'),
('MH34 BRP', '2003-05-14', 'Ford', 'Ka', 'C313', 'Blatt, O', '02', 'Kelvinbridge'),
('MH34 BRP', '2003-05-20', 'Ford', 'Ka', 'C100', 'Smith, J', '02', 'Kelvinbridge'),
('MD51 OPQ', '2003-05-20', 'Nissan', 'Sunny', 'C295', 'Pen, T', '02', 'Kelvinbridge');
Does model determine make?
select distinct model
from reg
order by model;
model
--
Focus
Ka
Sunny
Three distinct models . . .
select model, make
from reg
group by model, make
order by model;
model make
--
Focus Ford
Ka Ford
Sunny Nissan
Yup. One make for each model. Based on the sample data, model -> make.
Does carReg, custName -> hireDate?
select distinct carReg, custName
from reg
order by custName;
carReg
--
MH34 BRP Blatt, O
MS34 OGD Hen, P
MD51 OPQ Pen, T
MS34 OGD Smith, J
NS34 TPR Smith, J
MH34 BRP Smith, J
Six distinct combinations of carReg and custName.
select carReg, custName, hireDate
from reg
group by carReg, custName, hireDate
order by custName;
carReg custName hireDate
--
MH34 BRP Blatt, O 2003-05-14
MS34 OGD Hen, P 2003-05-15
MD51 OPQ Pen, T 2003-05-20
MH34 BRP Smith, J 2003-05-20
NS34 TPR Smith, J 2003-05-16
MS34 OGD Smith, J 2003-05-14
Yup. One hireDate for each combination of carReg and custName. So based on the sample data, {carReg, custName} -> hireDate.

Well, since you asked for a second opinion, I'll give you one.
The second opinion is that the first (CatCall's) is entirely correct.
Sample data do not suffice to identify/determine functional dependencies in the data. What is needed to identify/determine functional dependencies in the data, are user requirements, descriptions/definitions of the business environment the database is intended to support, ...
Only your users can tell you, one way or another, what functional dependencies apply. (Don't interpret this as meaning that you should be telling your users that they should be telling you "what the applicable FDs are", because your users will typically not know what the term means. However, what the applicable FDs are, can still be derived from nothing else than the business specs the user provides you with.)
(PS sample data may on the contrary indeed suffice to demonstrate that a certain given FD certainly will NOT apply. But that's not your question.)

A FD (functional dependency) expresses a certain property of a relation value or variable. We can say that it holds for or doesn't hold for (is satisfied by or isn't satisfied by) (is true of or is not true of) a given relation value. When we say it holds or doesn't hold for a relation variable we mean it holds or doesn't hold for every possible value for the variable that can arise in an application.
Also if we are given a value and we are told that the FDs it satisfies are the FDs that a variable that could hold it satisfies then by that assumption the variable's FDs are the value's FDs. (This is sometimes called "representative data" for the variable.) But if we are just given a value that might arise for a variable then we only know that
the FDs that don't hold in the value also don't in the variable
the trivial FDs of both hold
(the ones of the form S -> subset of S)
(the ones that must hold regardless of the value, based only on the attributes)
(which must be the same for the value & the variable)
From my answer to What did I do wrong? (Find FD from table):
We say that a FD (functional dependency) expression S -> T has a
"determinant" set of attributes S and a "determined" set of
attributes T. It says that a given subtuple value for S appears in a
given relation value or variable/schema always with the same subtuple
value for T. For S -> {A} we can say S -> A. For {A} -> T we can say A
-> T.
Given a relation, we say that a FD "holds in" it or "is satisfied by"
it or "is true" in it or (sloppily) "is in" it or (sloppily) it "has"
a FD when what the FD says is true about it. Every FD that can be
expressed using attributes of a relation value/variable/schema will
either hold or not hold.
We can find all the FDs S -> T that hold in a relation by checking
every subset of the set of attributes as S with every subset of
attributes as T. There are also algorithms. FDs where S is a superset
of T must hold and are called "trivial".
We can find all the FDs S -> A that hold in a relation by checking
every subset of the set of attributes as S with every attribute as A.
There are also algorithms. (Then to find all FDs that hold: FDs S ->
{} hold trivially & whether S -> T for T with multiple elements can be
found from the FDs S -> A.)
Here are some shortcuts: A set determines itself. If S -> T then every
superset of S determines every subset of T. If S doesn't determine T
then no subset of S determines any superset of T. If a set has a
different subtuple of values in every tuple (ie it is "unique", ie it
is a superkey) (including if it is a candidate key) then it determines
every set. {} -> T when/iff every tuple has the same T subtuple value.
Given some FDs that hold, Armstrong's axioms generate all FDs that
must also hold. The latter is called the "closure" of the former. A
set of FDs that generates a certain closure is called a "cover". A
cover is "minimal" or "irreducible" when removing any FD from it gives
a set that is not a cover. A minimal/irreducible cover with every
determinant unique is "canonical".
Usually we are not asked to give a closure for all FDs that hold in a
schema, we are asked to give a canonical cover for them. In general if
we only know some FDs that hold in a schema then we don't know that
its closure is all the FDs that hold.
Assuming not every possible table value for a table variable is given, determining FDs for a table variable requires its meaning/predicate & the business rules to be given.
See my answer to Identifying functional dependencies (FDs).

Here's my attempt at relationships:

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight