I have been learning Normalization from "Fundamentals of Database Systems by Elmasri and Navathe (6th edition)" and I am having trouble understanding the following part about 2NF.
The following image is an example given under 2NF in the textbook
The candidate key is {SSN,Pnumber}
The dependencies are
SSN,Pnumber -> hours, SSN -> ename, pnumber->pname, pnumber -> plocation
The formal Definition:
A relation schema R is in 2NF if every nonprime attribute A in R is
fully functionally dependent on the primary key of R.
for example in the above picture:
if suppose, I define an additional functional dependency SSN -> hours, then taking the two functional dependencies,
{SSN,Pnumber} -> hours and SSN -> hours
the relation wont be in 2NF, because now SSN ->hours is now a partial functional dependency as SSN is a proper subset for the given candidate key {SSN,Pnumber}.
Looking at the relation and its general definition on 2NF, i presume that the above relation is in 2NF
As far as my understanding goes and how i understand what 2NF is,
A relation is in 2NF if one cannot find a proper subset (prime attributes)
of the on the left hand side (candidate key) of a functional dependency
which defines the NPA(non prime attribute).
My first question is, Why is the above relation not in 2NF? (The textbook has considered the above relation as not in 2NF)
There is, however, a informal ways(steps as per the textbook where a normal person not knowing normalization can take to reduce redundancy) being defined at the beginning of this chapter which are:
■ Making sure that the semantics of the attributes is clear in the schema
■ Reducing the redundant information in tuples
■ Reducing the NULL values in tuples
■ Disallowing the possibility of generating spurious tuples
The guideline mentioned is as follows:
My second question is, If the above steps described are taken into account, and consider why the following relation is not in 2NF, do you assume the following functional dependencies, which are,
{SSN,Pnumber} -> Pname
{SSN,Pnumber} -> Plocation
{SSN,Pnumber} -> Ename
making the decomposition of the relation correct? If the functional dependencies assumed are incorrect, then what are the factors leading for the relation to not satisfy 2NF condition?
When looked at a general point of view ... because the table contains more than one primary attributes and the information stored is concerned with both employee and project information, one can point out that those need to be separated, as Pnumber is a primary attribute of the composite key, the redundancy can somehow be intuitively guessed. This is because the semantics of the attributes are known to us.
what if the attributes were replaced with A,B,C,D,E,F
My Third question is, Are functional dependencies pre-determined based on "functionalities of database and a database designer having domain knowledge of the attributes" ?
Because based on the data and relation state at a given point the functional dependencies can change which was valid in one state can go invalid at a certain state.In general this can be said for any non primary attribute determining non primary attribute.
The formal definition :
A functional dependency, denoted by X → Y, between two sets of
attributes X and Y that are subsets of R specifies a constraint on the
possible tuples that can form a relation state r of R. The constraint is
that, for any two tuples t1 and t2 in r that have t1[X] = t2[X], they must
also have t1[Y] = t2[Y].
So won't predefining a functional dependency be wrong as on cannot generalize relation state at any given point?
Pardon me if my basic understanding of things is flawed to begin with.
Why is the above relation not in 2NF?
Your original/first/informal "definition" of 2NF is garbled and not helpful. Even the quote from the textbook is wrong since 2NF is not defined in terms of "the PK (primary key)" but rather in terms of all the CKs (candidate keys). (Their definition makes sense if there is only one CK.)
A table is in 2NF when there are no partial dependencies of non-prime attributes on CKs. Ie when no determinant of a non-prime attribute is a proper/smaller subset of a CK. Ie when every non-prime attribute is fully functionally dependent on every CK.
Here the only CK is {Ssn, Pnumber}. But there are FDs (functional dependencies) out of {Ssn} and {Pnumber}, both of which are smaller subsets of the CK. So the original table is not in 2NF.
If the above statement is taken into account, do you assume the following functional dependencies
so won't the same process of the decomposition shown based on the informal way alone be difficult each time such a case arrives?
A table holds the rows that make some predicate (statement template parameterized by column names) into a true proposition (statement). Given the business rules, only certain business situations can arise. Then given the table predicates, which give table values from a business situation, only certain database values can arise. That leads to certain tables having certain FDs.
However, given some FDs that hold, we can formally use Armstrong's axioms to get all other FDs that must also hold. So we can use both informal and formal ways to find which FDs hold and don't hold.
There are also shorthand rules that follow from the axioms. Eg if a set of attributes has a different subrow value in each tuple then so does every superset of it. Eg if a FD holds then every superset of its determinant determines every subset of its determined set. Eg every superset of a superkey is a superkey & no proper subset of a CK is a CK. There are also algorithms.
Are functional dependencies pre-determined based on "functionalities of database and a database designer having domain knowledge of the attributes" ?
When normalizing we are concerned with the FDs that hold no matter what the business situation is, ie what the database state is. Each table for each business can have its own particular FDs per the table predicate & the possible business situations.
PS Do "make sense" of formal things in terms of the real world when their definitions are in terms of the real world. Eg applying a predicate to all possible situations to get all possible table values. But once you have the necessary formal information, only use formal definitions and procedures. Eg determining that a FD holds for a table because it holds in every possible table value.
so would any general table be in 2NF based on a solo condition of a table having a composite primary key?
There are tables in 5NF (hence too all lower NFs) with all sorts of mixes of composite & non-composite CKs. PKs don't matter.
It is frequently wrongly said that having no composite CKs guarantees 2NF. A table without composite keys and where {} does not determine any attribute is in 2NF. But if {} determines an attribute then it's a proper/smaller subset of any/every CK with any attributes. {} determines an attribute when every row has to have the same value for that attribute.
Why is the above relation in 2NF?
EP1, EP2, and EP3 are in 2NF because, for each one, the key identifies the non-key. No part of any key identifies any part of any non-key. That is what is meant by for any two tuples t1 and t2 in r that have t1[X] = t2[X], they must also have t1[Y] = t2[Y].
By contrast, you might say EMP_PROJ is over-specified. If ssn identifies, ename (as the text says it does), then the combination of {ssn, pnumber} is too much. There exists a subset of the key {ssn,pnumber} that identifies a part of the non-key, {ename}. That situation does not occur in a table conforming to 2NF, as EP1, EP2, and EP3 illustrate.
Are functional dependencies ... based on ... domain knowledge of the attributes?
Emphatically, yes! That's all they're based on. The DBMS is just a logic machine. The ideas of "employee" and "hours" don't exist for it. The database designer chooses to define tables that model some real-world universe of discourse, and imposes meaning on the columns. He gives names to the attributes (above) in X and Y. He decides which columns serve to identify a row based on what is true about the universe being modeled.
if a table has a composite primary key, regardless of the functional dependencies is not in 2NF?
No. Remember, 2NF is defined in terms of FDs. What could it mean to speak of conforming to 2NF "regardless" of them?
The number of columns in the key is immaterial. It's some set, X, identifying the complement, Y.
I'm not sure if I thoroughly understand your questions, but I'll give a try to explain.
Your first statement about 2NF:
a relation is in 2NF if one cannot find a proper subset on the left hand side of a functional dependency which defines the NPA
is correct, as well as your supposition
if {SSN,Pnumber} -> hours and SSN -> hours then this relation wont be in 2NF
because what that means that you could determine 'hours' from 'SSN' alone, so using the composite key {SSN,Pnumber} to determine 'hours' will be redundant, and thus violates the 2NF requirements.
What you call the left hand side of an FD is usually called a key. You use the key to find the related data. In order to save space (and reduce complexity), you should always try to find a minimal key, and break up larger tables into smaller ones if possible, so you do not have to save information in more places than necessary. This is what normalization to the normal forms is all about, and being studied for about half a century now, substantial theory on the matter has been developed, and some rules chrystalized from it, like 1NF, 2NF, 3NF etc.
Your second question confuses me a lot, because from what you are saying, it seems you already understands this.
Could there be some confusion about the FD's? From the figure, it seems to me as they are defined like this:
{SSN,Pnumber} -> hours
{SSN} -> ename
{Pnumber} -> Pname,Plocation
Just like the three lower tables are modeled, together they add up to the relation (table) modeled above.
So, in the first table, you would need the composite key {SSN,Pnumber} to access any data in the relation (search in the table), while that clearly is not necessary for most of the fields.
Now, I'm not sure about what purpose that table would fulfill in real life. While that is not formally necessary, as long as the FD's are given, it might be easier to imagine why the design will benefit from normalization.
So let's day it's about recording workhours per emplyee per project in some organization. SSN identifies the employee, (whose name also is stored as ename because it is easier to remember, but could be duplicate), Pnumber identifies the project, which name and location is also stored much for the same reason.
Then if you as a manager need to register that an employee worked another few hours on some project, you would use your manager app on your device, which in turn will update the tables seamlessly (you cannot expect managers to understand the logics of normalization)
Behind the scenes, however, it would amount to some query, in SQL that would be an 'INSERT' statement which added another row to the relevant table(s).
Now you can see that in the above table, you would have to insert all the six attributes, while with the normalized tables below, you will only need to add a row to table EP1,consisting of three attributes. In a large organization with thousands of employees delivering their worksheets every week, that will quickly become huge differences in storage requirements. That has a number of benefits, perhaps the most significant beeing search speed.
Your third question I don't understand at all, I'm afraid. In a way you could say FD's are predetermined once you have decided what data you will save in your database. The FD's are not dupposed to change. When modeled in the DB, they will not change. If you later find you will alter the design, then that will be new relations with new FD's.
The text you seem to be quoting from somewhere simply says that if you have the FD X -> Y (X gives or determines Y) then if you have any two tuples (records) in that relation (table) that have the same value of X, they must also hve the same value of Y. Or in our example, if Pnumber somewhere is given the value of 888, Pname is 'Battleship' and Plocation is 'Kitchen Sink', then if somewhere else (some other record) the Pnumber 888 is used then also there Pname must be 'Battleship' and Plocation must be 'Kitchen Sink' because Pname and Plocation is functionally dependant on Pnumber.
Now that was almost another chapter in your textbook, or what? Hope it helps, because it took me some time to write :-)
A table can be said to be in 2NF, if the primary key is composed of multiple columns, and that if for each row these columns were concatenated together into a single string, then the resulting column would qualify as the primary key. Alternatively a single column primary key will also qualify as 2NF.
In this case the same employee could have multiple phone numbers (PNUMBER), so a you cannot have a compound primary key that includes the phone number.
According to 2nd normalization form "All non-key attributes are fully functional dependent on the primary key". It's mean all non-key attributes cannot be dependent on a subset of the primary key.
In facebook we can login by email_id, user_name, or mobile_number (so email_id, user_name, or mobile_number are primary keys). And after login by using any of these methods we access the whole account.
My question is "Is not it partially dependency of non-key attributes on subset of primary key?".
I posted this question in facebook community but didn't get any answer.
For 2NF non-key attributes cannot be dependent on a proper subset of a candidate key. (Every set is a subset of itself. A proper subset is a smaller subset.) (There can be multiple candidate keys. One can be picked as primary.)
If all the keys of a relation variable are single columns then the only proper subsets would be the empty set. A column is functionally dependent on no columns if and only if all the values in that column are the same. There are no other proper subsets for columns to be functionally dependent on. So if all columns can have different values and all candidate keys have just one column then a relation variable must be in 2NF.
Functional dependencies and normal forms apply to a particular relation variable. You have to hypothesize a particular design to ask about its particular tables.
A "whole account" is typically not going to be represented (as part of either implementation state or user description) by only one relation variable just because it would have a lot of update anomalies that normalization would get rid of.
Functional dependencies are the attributes that their values are determined in a unique way by another attribute.Given that, can a multivalued attribute be dependent upon a primary key?
"FDs are the attributes that their values are determined in a unique way by another attribute" is unintelligible. Find a way to say it correctly or how can you understand it?
An attribute (or set of attributes) is functionally determined by a set of attributes.
There is no such thing as a "multivalued attribute" in a relation. A tuple has an attribute value for each attribute name. (Maybe you mean, a set of attributes is being determined? Maybe you mean, a multi-valued dependency?) If you have an attribute that you consider to contain multiple parts, ie you want to generically query about the parts without using operators with parameters of their types, then it is usually good design to have a separate table with attributes for those parts. But that's not addressed by normalization. Any value can be considered to have multiple parts in multiple ways and it is your application/queries that determine when you stop making tables whose attributes are the values of parts of other values and just have an attribute for a value. Similarly, if you have a bunch of attributes that play a similar role (often with similar names) then it is usually good design to have a separate table with just one attribute for the role. But that's not addressed by normalization.
Candidate keys matter to FDs, MVDs, JDs and normalization. PKs don't. You can pick one CK as "the PK" but its primariness is irrelevant to the relational model. It might be relevant to some information modeling method or product.
Superkeys are sets of columns that determine every column. Since every set of attributes always determines the attributes in it, superkeys are sets of columns that determine every other column. CKs are superkeys that contain no smaller superkey. (So CKs are sets of columns that are unique but contain no smaller set of columns that are unique.)
You don't know all the CKs until you find all the FDs. But you might know that a particular set of attributes is unique and has no smaller unique set, so that you know that it is a CK and you can call it "the PK". (Eg an id attribute in a relation variable that can have more than one row.)
can a multivalued attribute be dependent upon a primary key?
Every attribute is dependent on every CK by definition of CK. So every attribute is dependent on every PK by definition of PK.(But you must clarify what you mean by "multivalued attribute" and "dependent".)
Suppose relation R(A,B,C,D) exists with no functional dependency. So what should be considered as its candidate key? Clearly any individual attribute or proper subset of all attributes cannot be a candidate key because by no means they can identify non prime attributes. So can ABCD be considered as candidate key? Or this relation will not have any candidate key?
Suppose relation R(A,B,C,D) exists with no functional dependency. So can ABCD be considered as candidate key?
Yes, the key1 is comprised from all attributes together.
This is quite rare in practice, though. It mostly happens with junction/link tables that implement many-to-many (or many-to-many-to-many etc.) relationship.
Or this relation will not have any candidate key?
A relation must have at least one key, otherwise it's not a relation2.
Relation is a set, and any given object either belongs to a set or doesn't - it cannot belong multiple times (unlike for multiset). Without at least one key, the same tuple would be able to belong multiple times.
1 Just saying "key" is synonymous with "candidate key".
2 At the very least, all attributes, taken together, can be considered a key (as in your case).
My informal representation of these are:
1NF: The table is divided so that no item will appear more than once.
2NF: ?
3NF: Values can only be determined by the primary key.
I cannot make sense of it from the excerpts I found online or in my book. How do I differentiate between 1NF and 2NF?
A relation schema is in 2NF if every non-prime attribute is fully functionally dependent on every key.
Wikipedia says:
A table is in 2NF if and only if, it is in 1NF and every non-prime
attribute of the table is either dependent on the whole of a candidate
key, or on another non prime attribute.
To explain the concept, let's use a table for a inventory of toys adapted from Head First SQL:
TOY_ID| STORE_ID| INVENTORY| STORE_ADDRESS
The primary key is composed by the attributes TOY_ID and STORE_ID. If we analize the non-prime attribute INVENTORY we see that in depends on TOY_ID and STORE_ID at the same time. That's cool.
But on the other side, the non-prime attribue STORE_ADDRESS only depends on the STORE_ID attribute (i.e it's not related to the primary key attribute TOY_ID). That's a clear violation of 2NF, so to complain to the 2NF our schema must be like this:
An Inventory table: TOY_ID| STORE_ID| INVENTORY
and an Store table: STORE_ID| STORE_ADDRESS
Some columns are part of a key (primary or secondary). We'll call these prime attributes.
For second normal form we'll consider the non-prime attributes and see if they should be moved to another table. We might find that some attributes don't require the full key for us to be able to identify what value they hold for at least one candidate key. That is, there is a candidate key where we could still determine the value of that attribute given the candidate key even if the values in one column of that candidate key were erased.