Determining candidate keys with functional dependencies simple

Determining candidate keys with functional dependencies simple - database

Let R(A,B,C,D,E) be a relation schema and F = {A→C, B→D, C→E, E→A}, Find all candidate keys.
I believe that there exists no CK's in this set due to not being able to map. B or D to any other relation besides B -> D . Does this mean that that there are no Candiate Keys? Although I am able to map A to all other entities besides B and D.

The first step in normalization is to find all of the keys of a relation. Here are some facts that can help find the keys:
If an attribute is in none of the FDs, then it is in every key.
If an attribute occurs on the right-hand side of an FD, but never occurs on the left-hand side, then it is never in a key.
If an attribute occurs on the left-hand side of an FD, but never occurs on the right-hand side, then it is in every key.
If an attribute occurs both on the right-hand side an FD and the left-hand side of an FD, then one cannot say anything about the attribute.
To find the keys, identify which attributes are in each of the cases above. The ones in the first and third cases must be in every key. Call this set of attributes the core. Compute the attributes that are determined by the core. This is called the closure of the core. If all of the attributes are in the closure of the core, then the core is not only a key, it is also the only key. If the closure of the core is not the entire set of attributes, then some will be missing. Write down this set of attributes, and remove any attribute that is in the second set above (i.e., it occurs on the right-hand side of an FD, but never occurs on the left-hand side). These are the exterior attributes. To get a key one must add one or more exterior attributes to the core. Accordingly, add them to the core, first one at a time, then two at a time, and so on, until every key has been found.

There are three candidate keys.
B doesn't appear on the right-hand side of any functional dependency. That means B must be part of every candidate key. I think that alone doesn't guarantee there is at least one candidate key, but it should be clear from inspection that AB is one of the three candidate keys here.
Your textbook should include at least one algorithm for determining the set of all candidate keys. If you're lucky, it includes one algorithm suitable for paper and pencil, and another suitable for automation by programming.

Since B doesn't come on right hand side so B should be the part of candidate key, and A and C occur on both sides so they can form a super key with B. on mapping AB and BC are super keys and as candidate key is minimal super key so AB and BC are candidate key.

Every attribute that is strictly only on the left hand side across all the functional dependencies, is an attribute that must form part of each of the candidate keys.
The next step is to realise whether such an attribute alone can generate, or determine all the attributes inside the schema or not. If yes, then that attribute is a candidate key in it's original stand-alone form. If not, group that attribute with each of the other attributes, one at a time, two at a time and so on. All such minimal sets that traverse the entire set of attributes can be called as the candidate keys.

Related

primary index, equality on nonkey - Silbershatz Database System Concepts, A3

In Silbershatz Database System Concepts 6th Ed., in chapter 12, par. 12.3.1, p. 542, there is an explanation of an algorithm for processing a query which is selecting by imposing equality constraint on a nonkey attribute of the relation, using primary index.
The paragraph claims that the read from the file will be consecutive, since the file is sorted by the search key.
I don't understand - why would the read be consecutive?
As I see it, the records are ordered by the clustering key of the primary index, and the selection is using the non-key attributes, so these attributes may be contained in each record of the relation. The only way I see to retrieve all the records is a linear scan of all the relation.

In this context, primary index, equality on non-key means that we have a primary index on some attribute, namely A (i.e. the records are sorted physically on the disk based on A). Here, non-key means that there may exist multiple records having the same value for the attribute A, in other words, A is not guaranteed to be unique. The selection, however, is indeed using equality on A the clustering key of the primary index.
Thus, the algorithm becomes as follows: use the index to get the first record that satisfies the corresponding equality condition, then use linear scan until the condition breaks.
To make sure this makes sense, Look at algorithm A4:
In short, here non-key only means that the indexing field is not unique.
You can further refer to slide 13 in these slides. And see how the name is phrased.
As a side note: notice the difference between the cost written on this slide and the cost written in the book which misses one ts, but it really shouldn't miss it as it is explicitly present in the cost of A2, and clearly stated in the explanation, in Figure 12.3, in page 543. This may be a typo in the book. It isn't really of any importance, I just wanted to point it out.

Can a table be in 3NF with no primary keys?

1.
A table is automatically in 3NF if one of the following holds:
(i) If a relation consists of two attributes.
(ii) If 2NF table consists of only one non key attribute.
2.
If X → A is a dependency, then the table is in 3NF, if one of the following conditions exists:
(i) If X is a superkey
(ii) If A is a part of superkey
I got the above claims from this site.
I think that in both the claims, 2nd subpoint is wrong.
The first one says that a table in 2NF will be in 3NF if we have all non-key attributes and the table is in 2NF.
Consider the example R(A,B,C) with dependency A->B.
Here we have no candidate key, so all attributes are non-prime attributes and the relation is not in 3NF but in 2NF.
The second one says that for a dependency of the form X->A if A is part of a super key then it's in 3NF.
Consider the example R(A,B,C) with dependencies A->B, B->C . Here a CK is {A}. Now one of the super keys can be AC and the RHS of FD B->C contains part of AC but still the above relation R is not in 3NF.
I think it should be A should be part of a candidate key and not super key.
Am I correct?
Also can a particular relation be in 1NF, 3NF or 2NF if there are no functional dependencies present?

A CK (candidate key) is a superkey that contains no smaller superkey. A superkey is a unique set of attributes. A relation is a set of tuples. So every relation has a superkey, the set of all attributes. So it has at least one CK.
A FD (functional dependency) holds by definition when each value of a determining set of attributes appears always with the same value for its determined set. Every relation value or variable satisfies "trivial" FDs, the ones where the determined set is a subset of the determining set. Every set of attributes determines {}. So every relation satisfies at least one FD. However, the correct forms of definitions typically specifically talk about non-trivial FDs. Don't use the web, use textbooks, of which dozens are free online, although not all are well-written. Many textbooks also forget about FDs where the determinant and/or determined set is {}.
Your first point is not a correct definition of 3NF. Since its phrased "if..." instead of "if and only if", maybe it's not trying to be a definition. However, it is still wrong. (i) is wrong because a relation with two attributes is not in 3NF if one is a CK and the other has the same value in every tuple, ie it is determined by {}.
Similarly the second point is not a proper definition and also even if you treat it as only a consequence of 3NF (if...) it's false. It would be a definition if it used if and only if and talked about an FD that holds and it said it was a non-trivial FD and some other things were fixed.
Since those are neither correct definitions nor correct implications, there's a unlimited number of ways to disprove them. Read a book (or my posts) and get correct definitions.
Some comments re your reasoning:
First one says that, a table in 2NF will be in 3NF if we have all non key attributes and table is in 2NF.
I have no idea why you think that.
Here we have no candidate key
There's always one or more CKs. You need to read a definition of CK. There are also non-brute-force algorithms for finding them all.
Second one says that, for the dependency of form X->A if A is part of super key then it's in 3NF.
I have no idea why you think that.
A should be part of candidate key and not super key.
A correct defintion like the second point does normally say "... or (ii) A-X is part of a CK". But I can't follow your reasoning.
Sound reasoning involves starting from assumptions and writing new statements that we know are true because we applied a definition, a previously proved statement (theorem) or a sound rule of reasoning, eg from 'A implies B' and 'A' we can derive 'B'. You seem to need to read about how to do that.

Understanding Database Normalization - Second Normal Form(2NF)

I have been learning Normalization from "Fundamentals of Database Systems by Elmasri and Navathe (6th edition)" and I am having trouble understanding the following part about 2NF.
The following image is an example given under 2NF in the textbook
The candidate key is {SSN,Pnumber}
The dependencies are
SSN,Pnumber -> hours, SSN -> ename, pnumber->pname, pnumber -> plocation
The formal Definition:
A relation schema R is in 2NF if every nonprime attribute A in R is
fully functionally dependent on the primary key of R.
for example in the above picture:
if suppose, I define an additional functional dependency SSN -> hours, then taking the two functional dependencies,
{SSN,Pnumber} -> hours and SSN -> hours
the relation wont be in 2NF, because now SSN ->hours is now a partial functional dependency as SSN is a proper subset for the given candidate key {SSN,Pnumber}.
Looking at the relation and its general definition on 2NF, i presume that the above relation is in 2NF
As far as my understanding goes and how i understand what 2NF is,
A relation is in 2NF if one cannot find a proper subset (prime attributes)
of the on the left hand side (candidate key) of a functional dependency
which defines the NPA(non prime attribute).
My first question is, Why is the above relation not in 2NF? (The textbook has considered the above relation as not in 2NF)
There is, however, a informal ways(steps as per the textbook where a normal person not knowing normalization can take to reduce redundancy) being defined at the beginning of this chapter which are:
■ Making sure that the semantics of the attributes is clear in the schema
■ Reducing the redundant information in tuples
■ Reducing the NULL values in tuples
■ Disallowing the possibility of generating spurious tuples
The guideline mentioned is as follows:
My second question is, If the above steps described are taken into account, and consider why the following relation is not in 2NF, do you assume the following functional dependencies, which are,
{SSN,Pnumber} -> Pname
{SSN,Pnumber} -> Plocation
{SSN,Pnumber} -> Ename
making the decomposition of the relation correct? If the functional dependencies assumed are incorrect, then what are the factors leading for the relation to not satisfy 2NF condition?
When looked at a general point of view ... because the table contains more than one primary attributes and the information stored is concerned with both employee and project information, one can point out that those need to be separated, as Pnumber is a primary attribute of the composite key, the redundancy can somehow be intuitively guessed. This is because the semantics of the attributes are known to us.
what if the attributes were replaced with A,B,C,D,E,F
My Third question is, Are functional dependencies pre-determined based on "functionalities of database and a database designer having domain knowledge of the attributes" ?
Because based on the data and relation state at a given point the functional dependencies can change which was valid in one state can go invalid at a certain state.In general this can be said for any non primary attribute determining non primary attribute.
The formal definition :
A functional dependency, denoted by X → Y, between two sets of
attributes X and Y that are subsets of R specifies a constraint on the
possible tuples that can form a relation state r of R. The constraint is
that, for any two tuples t1 and t2 in r that have t1[X] = t2[X], they must
also have t1[Y] = t2[Y].
So won't predefining a functional dependency be wrong as on cannot generalize relation state at any given point?
Pardon me if my basic understanding of things is flawed to begin with.

Why is the above relation not in 2NF?
Your original/first/informal "definition" of 2NF is garbled and not helpful. Even the quote from the textbook is wrong since 2NF is not defined in terms of "the PK (primary key)" but rather in terms of all the CKs (candidate keys). (Their definition makes sense if there is only one CK.)
A table is in 2NF when there are no partial dependencies of non-prime attributes on CKs. Ie when no determinant of a non-prime attribute is a proper/smaller subset of a CK. Ie when every non-prime attribute is fully functionally dependent on every CK.
Here the only CK is {Ssn, Pnumber}. But there are FDs (functional dependencies) out of {Ssn} and {Pnumber}, both of which are smaller subsets of the CK. So the original table is not in 2NF.
If the above statement is taken into account, do you assume the following functional dependencies
so won't the same process of the decomposition shown based on the informal way alone be difficult each time such a case arrives?
A table holds the rows that make some predicate (statement template parameterized by column names) into a true proposition (statement). Given the business rules, only certain business situations can arise. Then given the table predicates, which give table values from a business situation, only certain database values can arise. That leads to certain tables having certain FDs.
However, given some FDs that hold, we can formally use Armstrong's axioms to get all other FDs that must also hold. So we can use both informal and formal ways to find which FDs hold and don't hold.
There are also shorthand rules that follow from the axioms. Eg if a set of attributes has a different subrow value in each tuple then so does every superset of it. Eg if a FD holds then every superset of its determinant determines every subset of its determined set. Eg every superset of a superkey is a superkey & no proper subset of a CK is a CK. There are also algorithms.
Are functional dependencies pre-determined based on "functionalities of database and a database designer having domain knowledge of the attributes" ?
When normalizing we are concerned with the FDs that hold no matter what the business situation is, ie what the database state is. Each table for each business can have its own particular FDs per the table predicate & the possible business situations.
PS Do "make sense" of formal things in terms of the real world when their definitions are in terms of the real world. Eg applying a predicate to all possible situations to get all possible table values. But once you have the necessary formal information, only use formal definitions and procedures. Eg determining that a FD holds for a table because it holds in every possible table value.
so would any general table be in 2NF based on a solo condition of a table having a composite primary key?
There are tables in 5NF (hence too all lower NFs) with all sorts of mixes of composite & non-composite CKs. PKs don't matter.
It is frequently wrongly said that having no composite CKs guarantees 2NF. A table without composite keys and where {} does not determine any attribute is in 2NF. But if {} determines an attribute then it's a proper/smaller subset of any/every CK with any attributes. {} determines an attribute when every row has to have the same value for that attribute.

Why is the above relation in 2NF?
EP1, EP2, and EP3 are in 2NF because, for each one, the key identifies the non-key. No part of any key identifies any part of any non-key. That is what is meant by for any two tuples t1 and t2 in r that have t1[X] = t2[X], they must also have t1[Y] = t2[Y].
By contrast, you might say EMP_PROJ is over-specified. If ssn identifies, ename (as the text says it does), then the combination of {ssn, pnumber} is too much. There exists a subset of the key {ssn,pnumber} that identifies a part of the non-key, {ename}. That situation does not occur in a table conforming to 2NF, as EP1, EP2, and EP3 illustrate.
Are functional dependencies ... based on ... domain knowledge of the attributes?
Emphatically, yes! That's all they're based on. The DBMS is just a logic machine. The ideas of "employee" and "hours" don't exist for it. The database designer chooses to define tables that model some real-world universe of discourse, and imposes meaning on the columns. He gives names to the attributes (above) in X and Y. He decides which columns serve to identify a row based on what is true about the universe being modeled.
if a table has a composite primary key, regardless of the functional dependencies is not in 2NF?
No. Remember, 2NF is defined in terms of FDs. What could it mean to speak of conforming to 2NF "regardless" of them?
The number of columns in the key is immaterial. It's some set, X, identifying the complement, Y.

I'm not sure if I thoroughly understand your questions, but I'll give a try to explain.
Your first statement about 2NF:
a relation is in 2NF if one cannot find a proper subset on the left hand side of a functional dependency which defines the NPA
is correct, as well as your supposition
if {SSN,Pnumber} -> hours and SSN -> hours then this relation wont be in 2NF
because what that means that you could determine 'hours' from 'SSN' alone, so using the composite key {SSN,Pnumber} to determine 'hours' will be redundant, and thus violates the 2NF requirements.
What you call the left hand side of an FD is usually called a key. You use the key to find the related data. In order to save space (and reduce complexity), you should always try to find a minimal key, and break up larger tables into smaller ones if possible, so you do not have to save information in more places than necessary. This is what normalization to the normal forms is all about, and being studied for about half a century now, substantial theory on the matter has been developed, and some rules chrystalized from it, like 1NF, 2NF, 3NF etc.
Your second question confuses me a lot, because from what you are saying, it seems you already understands this.
Could there be some confusion about the FD's? From the figure, it seems to me as they are defined like this:
{SSN,Pnumber} -> hours
{SSN} -> ename
{Pnumber} -> Pname,Plocation
Just like the three lower tables are modeled, together they add up to the relation (table) modeled above.
So, in the first table, you would need the composite key {SSN,Pnumber} to access any data in the relation (search in the table), while that clearly is not necessary for most of the fields.
Now, I'm not sure about what purpose that table would fulfill in real life. While that is not formally necessary, as long as the FD's are given, it might be easier to imagine why the design will benefit from normalization.
So let's day it's about recording workhours per emplyee per project in some organization. SSN identifies the employee, (whose name also is stored as ename because it is easier to remember, but could be duplicate), Pnumber identifies the project, which name and location is also stored much for the same reason.
Then if you as a manager need to register that an employee worked another few hours on some project, you would use your manager app on your device, which in turn will update the tables seamlessly (you cannot expect managers to understand the logics of normalization)
Behind the scenes, however, it would amount to some query, in SQL that would be an 'INSERT' statement which added another row to the relevant table(s).
Now you can see that in the above table, you would have to insert all the six attributes, while with the normalized tables below, you will only need to add a row to table EP1,consisting of three attributes. In a large organization with thousands of employees delivering their worksheets every week, that will quickly become huge differences in storage requirements. That has a number of benefits, perhaps the most significant beeing search speed.
Your third question I don't understand at all, I'm afraid. In a way you could say FD's are predetermined once you have decided what data you will save in your database. The FD's are not dupposed to change. When modeled in the DB, they will not change. If you later find you will alter the design, then that will be new relations with new FD's.
The text you seem to be quoting from somewhere simply says that if you have the FD X -> Y (X gives or determines Y) then if you have any two tuples (records) in that relation (table) that have the same value of X, they must also hve the same value of Y. Or in our example, if Pnumber somewhere is given the value of 888, Pname is 'Battleship' and Plocation is 'Kitchen Sink', then if somewhere else (some other record) the Pnumber 888 is used then also there Pname must be 'Battleship' and Plocation must be 'Kitchen Sink' because Pname and Plocation is functionally dependant on Pnumber.
Now that was almost another chapter in your textbook, or what? Hope it helps, because it took me some time to write :-)

A table can be said to be in 2NF, if the primary key is composed of multiple columns, and that if for each row these columns were concatenated together into a single string, then the resulting column would qualify as the primary key. Alternatively a single column primary key will also qualify as 2NF.
In this case the same employee could have multiple phone numbers (PNUMBER), so a you cannot have a compound primary key that includes the phone number.

2NF and 3NF Normalization

I seem to have a strange problem when doing normalization problems. When I'm giving relations with actual names I can figure these out easily but when I'm given letters it seems to be a lot harder.
For the following problem I don't know why it's not 3NF and why it is 2NF.
Given R (A, B, C, D, E, F)
FDs = {AB->C, DBE->A, BC->D, BE->F, F->D}
So for 2NF all the right hand side attributes must be fully functionally dependent on the left hand side attributes. For 3NF either all the left hand side attributes must be superkeys or the right hand attributes must be prime attributes.
I tried drawing this out, but I can't even find a candidate key. Can anyone help me determine why this is not 3NF? Also, what is the candidate key here? Since I don't see any attribute that has a closure equal to the original relation.

I seem to have a strange problem when doing normalization problems.
When I'm giving relations with actual names I can figure these out
easily but when I'm given letters it seems to be a lot harder.
Yes, its less intuitive with letters. I will tell you a neat method which you can follow to determine the candidate keys in such situations :
Make three columns left(L), middle(M) and right(R) where left columns consists of all the attributes that appear only on the left side in all the given functional dependencies. In our case such attributes will be B and E since they are always on the left side of any FD given (or you can say they are never on the right side in any of the given FD.). Similarly middle column contains attribute that appear on both left and right side of the given FD's. So we have A,C,D and F in the middle column. The right column contains attributes which only occur on the right hand side of FD's (never on the LHS of any given FD's). So we have :
L | M |R
B,E|A,C,D,F|-
Now that you have this table remember the following rules: (these are very intuitive)
Attributes in the left(L) column are always part of the candidate keys
Attributes in the right(R) column are never part of the candidate keys
Attributes in the middle(M) column may or may not be a part of the candidate keys.
So in our case we start with checking if BE is a candidate key. We find BE-closure consists of all the attributes of the relation R so it is the candidate key. (Note: If BE would not have been the candidate key then we would have taken attributes from middle(M) column one-by-one and combine it with BE and check its closure eg. BEA,BEC,BED ...)
So now we have only 1 candidate key BE. So our prime attributes are {B,E} and non-prime attributes are {A,C,D,F}.
We know that 3NF is violated if RHS is a non-prime attribute and LHS is not a candidate key. Given FD's are:
AB->C
DBE->A
BC->D
BE->F
F->D
We note that in all these FD's RHS is a non-prime attribute. So in all of these LHS should be a key for it to be in 3NF. We see that (1),(3) and (5) violate this so it is not in 3NF. (Note: In (2) we can see that D on the LHS is an extraneous attribute so its BE->A and hence (2) does not violate 3NF rule)

1NF Normal Form with Functional Dependecy

I read following example, that relation A(X,Y,Z,P,Q,R) with the following functional dependency.
why this is in 1NF?
anyone could help me?

The diagram is not normal notation. I suppose that arrows point to determined attributes of FDs. I suppose that an arrow that doesn't come from a box means a FD with just one determinant attribute and an arrow that comes from a box means a FD with the boxed attributes as determinant attributes. Find out what the diagram notation means.
If so then the functional dependencies are Y → Z, XYZ → QR and P → QRX.
To show what normal form the relation is in we need to know what definitions you were given for normal forms. It happens that this relation is not in 2nd normal form. So it isn't in any higher normal form. So it is only in 1st normal form. So the only normal form definition we need to know is the one you were given for 2NF. That definition usually involves candidate keys. If so then we need to know what definition you were given for candidate key. The definition of 2NF can involve full and partial FDs. If yours does then we need to know what definitions of full and/or partial FD you were given. Give the definitions.
The only CK is PY because it determines every other attribute but no proper subset of it does. It is the only CK because there is no other such set of attributes. To justify this we need to reference the rules you were given for deriving FDs and CKs. (Eg this includes how we went from one FD list to another.) Give the rules.
But there are then also determinants Y (from Y → Z) and P (from P → QRX) that are proper subsets of that candidate key. So each of non-prime attributes Z, Q, R and X is partially dependent on a CK. But to be in 2NF there must be no non-prime attributes partially dependent on a candidate key. Ie every non-prime attribute must be fully functionally dependent on all CKs. So A is not in 2NF. So the highest normal form it is in is 1NF.

The picture doesn't make its meaning very clear in my opinion because it seems to be mixing two different notations for functional dependencies (FDs). Any answer will depend on how you want to interpret the diagram.
I'd hazard a guess that the diagram is supposed to indicate the following set of FDs: XY->Z, Y->QR, P->QRX. If that's correct then the possible candidate key respecting that set of FDs would be {Y,P}. If my interpretation of the diagram is correct then both Y and P are determinants in their own right. Since Y and P are proper subsets of a candidate key of A we can conclude that A violates 2NF and therefore the highest normal form that A can satisfy is 1NF.
Update: Your new picture specifies some dependencies. Collecting the determinant terms together we can summarize as:
P->XQR
XY->QR
Y->Z
I assume these are supposed be the dependencies actually satisfied by A. On the left-hand side we have P, X and Y so PXY will be a superkey of A. P->X therefore the candidate key (minimal superkey) can only be PY. P->XQR and Y->Z are both FDs with determinants that are proper subsets of the candidate key (PY) and that means those dependencies both violate 2NF. Recap: 2NF prohibits any FD where the left-hand side is a proper subset of a candidate key. So 1NF is the highest normal form of A.

As per 1NF, no two rows of a relation must have repeating values and no column must have more than one value in a row. This increases redundancy as there will be columns with same data repeating in many rows.
Name ID Course
A 1 Computer
B 2 Arts
C 3 Computer
Here Course column has repeated values. But every row has no column which has 2 values. Hence it is in 1NF.
1NF has the least number of restrictions. So any other forms like 2NF, 3NF by default would also be in 1NF.
Consider this analogy
1NF = Living Beings
2NF = Mammals
3NF = Humans
All mammals/2NF are by default living beings/1NF, and so on.
To satisfy the functional dependency X → Y, it is essential that each X value be associated with only one Y value. And thus it satisfies the 1NF criteria which does not allow multiple values in a row for a column.