Normalization, Primary Key Dependency Candidate Key - database

In the process of decomposition to normalize a relation.
If I reach the point where all attributes in a relation depend on the primary key, can I assume that they will all depend entirely on the different candidate keys?
If that is not a case can you please give me an example of a case where all attributes depend on the primary key, but some of them depend on the part of other candidate keys.
I'm starting learning databases

Surrogate primary IDs make an example really easy:
(row_id PK, student_id, course_id, student_name)
where row_id and (student_id, course_id) are candidate keys and student_id -> student_name. Of course the row_id trivially determines any other attributes if it's an auto-incremented number.

Related

How do primary keys work in junction tables for a DBMS? How can a composite key be a primary key?

In a DBMS we have
Superkey - An attribute or a set of attributes that uniquely identifies a row in a table.
Candidate Key - An attribute or set of attributes that uniquely identify identifies a row in a table. The difference between the superkey and a candidate key is no subset of a candidate key can itself be a candidate key.
Primary key - A chosen candidate key that become the attribute to uniquely identify a row.
If we want to identify a many to many relation between two tables we can define a junction table such as:
Tables:
Author(AuthorID, FirstName, LastName) -- AuthorID is primary key
Book(BookID, BookTitle) -- BookID is primary key
To create the relation between both:
AuthorBook(authorID, BookID) -- together authorID and BookID are primary key
I am thinking bookID and authorID are both primary keys in their own respect.
Since a candidate key (and therefore a primary key) must not have a subset containing a candidate key, how can authorID plus BookID be a primary key? This seems to break the definition of a primary key.
I understand this may be the difference between real world an theory but as the DBMS textbooks I have read seem to define junction tables this way and define primary keys this way it seems like there is a disconnect there.
Am I misunderstanding this concept?
When we use one of those terms we have to be talking about a given table (variable, value or expression). The superkeys, CKs & PKs of a table are not determined by roles its attributes play in other tables. They are determined by what valid values can arise for the table under the given business rules.
Superkey - An attribute or a set of attributes that uniquely identifies a row in a database.
A superkey of a given table can be defined as a set of attributes that "uniquely identifies a row" of the table. (Not database.) Although that quoted phrase is a kind of shorthand that isn't a very clear description if you don't already know what it means.
A superkey of a given table can be defined as a set of attributes whose subrow values can only appear once in the table. Or as a set of attributes that functionally determines every set of attributes in the table.
When a superkey has just one attribute we can sloppily talk about that attribute being a superkey.
Candidate Key - An attribute or set of attributes that uniquely identifies a row in a database.
It's true that every CK (candidate key) of a certain table is a superkey of that table. But you mean that a set of attributes is by definition a superkey when/iff that and some other condition(s) hold. But you don't clearly say that when you write this section.
The difference between the superkey and a candidate key is no subset of a candidate key can itself be a candidate key.
No. A set is a subset of itself so a CK is a subset of itself so a CK always has a subset that is a CK--itself. What you mean is, no proper/smaller subset. Then your statement is true. But also true and more important is that no proper/smaller subset of a CK is a superkey.
You don't actually define "CK" in this paragraph. A CK of a given table can be defined as a superkey of that table that contains no proper/smaller subset that is a superkey of that table.
Primary key - A chosen candidate key that becomes the attribute to uniquely identify a row.
No. The PK (primary key) of a given table is defined as the one CK of that table that you decided to call the PK. (Not attribute.)
Note that CKs & PKs are superkeys. PKs don't matter to relational theory.
To create the relation between both:
AuthorBook(authorID, BookID) -- together authorID and BookID are primary key
What the superkeys & CKs are & so what the PK can be is determined by the FDs (functional dependencies) that hold in the table. But if you are presuming that this is a many to many table then it takes an authorID-bookID pair to uniquely identify a row, so there can only be one CK, {authorID, bookID}. So that is the only possible PK. So {authorID} & {bookID} cannot be superkeys or CKs or PKs.
You can see this by looking at examples & applying the definitions.
authorID bookID
1 a
1 b
Here authorID does not uniquely identify a row. So it can't be a superkey. So it can't be a CK. So it can't be a PK.
textbooks I have read seem to define junction tables this way and define primary keys this way
No, they don't.
However they do say that certain sets of attributes & subsets of superkeys, CKs & PKs in the junction table are FKs (foreign keys) in the junction table referencing those other tables where they are CKs (which might be PKs) of/in those other tables.
A FK of a given table can be defined as a certain set of attributes in the table whose subrow values must appear as certain CK subrows in a certain other table.
But since you say this is a junction table, presumably {authorID} is a FK to an author table where its values appear under a CK/PK & {bookID} is a FK to a book table where its values appear under a CK/PK. So FK {authorID} in AuthorBook referencing {authorID} in Author & FK {bookID} in AuthorBook referencing {bookID} in Book.
PS PK & other terms mean something else in SQL. A declared SQL PK can have a smaller SQL UNIQUE declared within it. SQL "uniqueness" itself is defined in terms of SQL NULL. It's reasonable to say that an SQL PK is more reminiscent a relational superkey than it is reminiscent of a relational PK. Similarly a SQL FK is more reminiscent of what we could reasonably call a relational foreign superkey than a relational foreign key.

Unique constraint to combination of two columns unclustered index

I'm not asking HOW to do this, but if it's what I SHOULD be doing.
Two employees can be working on the same job. So of course, both FKs, EmployeeID and JobID, can have a MANY relationship in a "Employee_Jobs" table.
Let's take Employee A, Employee B, Job A and Job B. All of the following would be acceptable:
A A
A B
B A
B B
What would NOT be acceptable is a duplicate of a combination of these two PKs... since we cannot have for example, [Employee A working on Job A] twice.
So would it be correct to say that the only way to manage this is to make the combination of the two PKs, EmployeeID and JobID, a Unique, non-clustered index?
I tried to think of how to instead, break this up to more tables but I keep getting back to this same problem.
Yes, not only is it appropriate, but in fact, the combination of these two attributes should be the PRIMARY KEY.
and in any other table where the entity represented by rows in the table has a logical attribute (consisting of the two columns employeeId and JobId), which represents the work done by an employee on a job, (or the contribution of the employee to a job, or the association of an employee to a job in any way), a FK in that table should be a composite Foreign Key consisting of these same two columns.
If you are using a surrogate key on this table to simplify joins and definition of Foreign Keys in other tables, then by all means continue to do so, but keep the two-column natural key in this table, as either a unique index or a Alternate Key. (a Key is a Key - anything that is declared or defined to be unique) so as to ensure data integrity in this table. In fact, to make it clear to users of the schema, when this situation comes up, I generally make the composite Natural Key the PRIMARY KEY, and add/define the surrogate (which is used in Joins and Other table FKs), as an alternate key or unique index. This is pretty much only a semantic distinction, only as they create almost identical functionality. But because data integrity is more important to me than join syntax and Foreign Key structure, To me, the Natural Key is the PRIMARY key,
Yes, In that case you should consider making both those fields as primary key; in specific a composite primary key or compound primary key like below which will make sure uniqueness of combination of both the fields.
primary key (EmployeeID , JobID)
Though as you said a Unique, non-clustered index but marking both the field as primary key will create a UNIQUE Clustered Index on them actually.

Primary Key of Associative Entity

In this ERD:
Certificate Entity is an Associative Entity and it has a unique identifier - Certificate Number. Since an Associative Entity inherits its primary key from other entities. The key field of associated entity are primary key of each end entity is a foreign key on the associated entity, and both foreign keys combined together become a primary key(Concepts from Textbook).
Is the primary key of the Certificate Entity should be a composite key which contains three parts: CertificateNumber, EmployeeID, CourseID ?
Or its primary key is CertificateNumber, and takes EmployeeID, CourseID as attributes of this entity??
I'm confused on this question because normally an associated entity doesn't have its own identifier(Certificate Number). It just take primary keys from other entities combined as composite key(EmployeeID, CourseID), then use that composite key as its identifier.
Thank you
Alex
Associative entities don't have a primary key based on their own attributes. In your first diagram, you created an associative entity with the functional dependency (Employee_ID, Course_ID) -> Date_Completed. Note that while Employee_ID and Course_ID are columns in the table, they're not attributes. An attribute in the ER model is a mapping from an entity set to a value set. Foreign keys are components of a relationship and don't map to a value set.
In your second diagram, by adding a surrogate key, your associative entity becomes a regular entity which is in relationships with Employee and Course. Your primary key is just Certificate_Number, but a unique constraint on (Employee_ID, Course_ID) is probably a good idea. The relationships are represented by the functional dependencies Certificate_Number -> Employee_ID and Certificate_Number -> Course_ID recorded in the Certificate table.
You could also keep it an associative entity and use (Employee_ID, Course_ID) as the primary key and make Certificate_Number a regular attribute, though uniquely constrained (and probably auto-incremented). In this case, the diagram would look like your first one but with an extra attribute on the relationship.

Is primary key also super key and candidate key?

Is the primary key also a super key and a candidate key? Their definitions are lengthy but I wonder if this is true?
Please note that I'm not asking if they are the same term. I'm just asking in one direction, not the other way round.
Super Key - is a set of one or more columns that can be used to identify a record uniquely in a table
Candidate Key – can be any column or a combination of columns that can qualify as a unique key in database. There can be multiple Candidate Keys in one table. Each Candidate Key can qualify as a Primary Key. You can think of this as the "shortest" super key or minimal super key
Primary Key – is a column or a combination of columns that uniquely identify a record. Only one Candidate Key can be Primary Key.
For a Candidate Key to qualify as a Primary Key, it should be unique and non-null.
So, basically a primary key is just one of the candidate keys, which is a just a minimal super key.
According to dry definitions:
Your primary key is a super key by definition - you can not have two rows with the same primary key.
However, the primary key is not a natural constraint of your business, but an artificial constraint in your data store: for example, you could set a person's birthday as the primary key in your table, and never have two people who were born on the same day. That would be silly, but possible. In that case, the primary key of the table is not a super key of the domain.
However, your primary key is not necessarily a candidate key - you can add redundant columns to your primary key.
Different set of attributes which are able to identify any row in the database is known as super key. And minimal super key is termed as candidate key i.e. among set of super keys one with minimum number of attributes. Primary key could be any key which is able to identify a specific row in database in a unique manner. from this thread
And typing all three keys in google gives about 2,480,000 results
It depends.
The Primary key is the main key the table uses to identify between different elements. It is chosen from the candidate keys.
The candidate keys are all the keys that COULD be the primary key. All the keys that are unique and can be differentiated upon in the table.
The super key is a primary key with additional attributes, this extra information is used to uniquely identify an instance of the entity set.
A candidate key is the most minimal subset of fields that uniquely identifies a tuple. For example if you have a candidate key on the column "user_id" and "pet_id" you'll never have more than 1 tuple with the same user_id and pet_id and neither user_id nor pet_id individually will work as a unique identifier for the tuple.
A super key is a set of fields that contains a key. Using the above example where the combination of "user_id" and "pet_id" uniquely identifies a tuple if we added "pet_name" (which is not key because we can have multiple pets named "fluffy") it would be a super key. Basically it's like a candidate key without the "minimal subset of fields" constraint.
A primary key is a candidate key that you tell the DB to optimize on. There might be multiple ways of referring to a unique tuple (ie. multiple candidate keys) but you can specify one when you're creating the table that you will use the most frequently.
Yes we can simply say that a candiate key is primary key but it must be unique.

Can I have multiple primary keys in a single table?

Can I have multiple primary keys in a single table?
A Table can have a Composite Primary Key which is a primary key made from two or more columns. For example:
CREATE TABLE userdata (
userid INT,
userdataid INT,
info char(200),
primary key (userid, userdataid)
);
Update: Here is a link with a more detailed description of composite primary keys.
You can only have one primary key, but you can have multiple columns in your primary key.
You can also have Unique Indexes on your table, which will work a bit like a primary key in that they will enforce unique values, and will speed up querying of those values.
A table can have multiple candidate keys. Each candidate key is a column or set of columns that are UNIQUE, taken together, and also NOT NULL. Thus, specifying values for all the columns of any candidate key is enough to determine that there is one row that meets the criteria, or no rows at all.
Candidate keys are a fundamental concept in the relational data model.
It's common practice, if multiple keys are present in one table, to designate one of the candidate keys as the primary key. It's also common practice to cause any foreign keys to the table to reference the primary key, rather than any other candidate key.
I recommend these practices, but there is nothing in the relational model that requires selecting a primary key among the candidate keys.
This is the answer for both the main question and for #Kalmi's question of
What would be the point of having multiple auto-generating columns?
This code below has a composite primary key. One of its columns is auto-incremented. This will work only in MyISAM. InnoDB will generate an error "ERROR 1075 (42000): Incorrect table definition; there can be only one auto column and it must be defined as a key".
DROP TABLE IF EXISTS `test`.`animals`;
CREATE TABLE `test`.`animals` (
`grp` char(30) NOT NULL,
`id` mediumint(9) NOT NULL AUTO_INCREMENT,
`name` char(30) NOT NULL,
PRIMARY KEY (`grp`,`id`)
) ENGINE=MyISAM;
INSERT INTO animals (grp,name) VALUES
('mammal','dog'),('mammal','cat'),
('bird','penguin'),('fish','lax'),('mammal','whale'),
('bird','ostrich');
SELECT * FROM animals ORDER BY grp,id;
Which returns:
+--------+----+---------+
| grp | id | name |
+--------+----+---------+
| fish | 1 | lax |
| mammal | 1 | dog |
| mammal | 2 | cat |
| mammal | 3 | whale |
| bird | 1 | penguin |
| bird | 2 | ostrich |
+--------+----+---------+
(Have been studying these, a lot)
Candidate keys - A minimal column combination required to uniquely identify a table row.
Compound keys - 2 or more columns.
Multiple Candidate keys can exist in a table.
Primary KEY - Only one of the candidate keys that is chosen by us
Alternate keys - All other candidate keys
Both Primary Key & Alternate keys can be Compound keys
Sources:
https://en.wikipedia.org/wiki/Superkey
https://en.wikipedia.org/wiki/Candidate_key
https://en.wikipedia.org/wiki/Primary_key
https://en.wikipedia.org/wiki/Compound_key
As noted by the others it is possible to have multi-column primary keys.
It should be noted however that if you have some functional dependencies that are not introduced by a key, you should consider normalizing your relation.
Example:
Person(id, name, email, street, zip_code, area)
There can be a functional dependency between id -> name,email, street, zip_code and area
But often a zip_code is associated with a area and thus there is an internal functional dependecy between zip_code -> area.
Thus one may consider splitting it into another table:
Person(id, name, email, street, zip_code)
Area(zip_code, name)
So that it is consistent with the third normal form.
Primary Key is very unfortunate notation, because of the connotation of "Primary" and the subconscious association in consequence with the Logical Model. I thus avoid using it. Instead I refer to the Surrogate Key of the Physical Model and the Natural Key(s) of the Logical Model.
It is important that the Logical Model for every Entity have at least one set of "business attributes" which comprise a Key for the entity. Boyce, Codd, Date et al refer to these in the Relational Model as Candidate Keys. When we then build tables for these Entities their Candidate Keys become Natural Keys in those tables. It is only through those Natural Keys that users are able to uniquely identify rows in the tables; as surrogate keys should always be hidden from users. This is because Surrogate Keys have no business meaning.
However the Physical Model for our tables will in many instances be inefficient without a Surrogate Key. Recall that non-covered columns for a non-clustered index can only be found (in general) through a Key Lookup into the clustered index (ignore tables implemented as heaps for a moment). When our available Natural Key(s) are wide this (1) widens the width of our non-clustered leaf nodes, increasing storage requirements and read accesses for seeks and scans of that non-clustered index; and (2) reduces fan-out from our clustered index increasing index height and index size, again increasing reads and storage requirements for our clustered indexes; and (3) increases cache requirements for our clustered indexes. chasing other indexes and data out of cache.
This is where a small Surrogate Key, designated to the RDBMS as "the Primary Key" proves beneficial. When set as the clustering key, so as to be used for key lookups into the clustered index from non-clustered indexes and foreign key lookups from related tables, all these disadvantages disappear. Our clustered index fan-outs increase again to reduce clustered index height and size, reduce cache load for our clustered indexes, decrease reads when accessing data through any mechanism (whether index scan, index seek, non-clustered key lookup or foreign key lookup) and decrease storage requirements for both clustered and nonclustered indexes of our tables.
Note that these benefits only occur when the surrogate key is both small and the clustering key. If a GUID is used as the clustering key the situation will often be worse than if the smallest available Natural Key had been used. If the table is organized as a heap then the 8-byte (heap) RowID will be used for key lookups, which is better than a 16-byte GUID but less performant than a 4-byte integer.
If a GUID must be used due to business constraints than the search for a better clustering key is worthwhile. If for example a small site identifier and 4-byte "site-sequence-number" is feasible then that design might give better performance than a GUID as Surrogate Key.
If the consequences of a heap (hash join perhaps) make that the preferred storage then the costs of a wider clustering key need to be balanced into the trade-off analysis.
Consider this example::
ALTER TABLE Persons
ADD CONSTRAINT pk_PersonID PRIMARY KEY (P_Id,LastName)
where the tuple "(P_Id,LastName)" requires a uniqueness constraint, and may be a lengthy Unicode LastName plus a 4-byte integer, it would be desirable to (1) declaratively enforce this constraint as "ADD CONSTRAINT pk_PersonID UNIQUE NONCLUSTERED (P_Id,LastName)" and (2) separately declare a small Surrogate Key to be the "Primary Key" of a clustered index. It is worth noting that Anita possibly only wishes to add the LastName to this constraint in order to make that a covered field, which is unnecessary in a clustered index because ALL fields are covered by it.
The ability in SQL Server to designate a Primary Key as nonclustered is an unfortunate historical circumstance, due to a conflation of the meaning "preferred natural or candidate key" (from the Logical Model) with the meaning "lookup key in storage" from the Physical Model. My understanding is that originally SYBASE SQL Server always used a 4-byte RowID, whether into a heap or a clustered index, as the "lookup key in storage" from the Physical Model.
A primary key is the key that uniquely identifies a record and is used in all indexes. This is why you can't have more than one. It is also generally the key that is used in joining to child tables but this is not a requirement. The real purpose of a PK is to make sure that something allows you to uniquely identify a record so that data changes affect the correct record and so that indexes can be created.
However, you can put multiple fields in one primary key (a composite PK). This will make your joins slower (espcially if they are larger string type fields) and your indexes larger but it may remove the need to do joins in some of the child tables, so as far as performance and design, take it on a case by case basis. When you do this, each field itself is not unique, but the combination of them is. If one or more of the fields in a composite key should also be unique, then you need a unique index on it. It is likely though that if one field is unique, this is a better candidate for the PK.
Now at times, you have more than one candidate for the PK. In this case you choose one as the PK or use a surrogate key (I personally prefer surrogate keys for this instance). And (this is critical!) you add unique indexes to each of the candidate keys that were not chosen as the PK. If the data needs to be unique, it needs a unique index whether it is the PK or not. This is a data integrity issue. (Note this is also true anytime you use a surrogate key; people get into trouble with surrogate keys because they forget to create unique indexes on the candidate keys.)
There are occasionally times when you want more than one surrogate key (which are usually the PK if you have them). In this case what you want isn't more PK's, it is more fields with autogenerated keys. Most DBs don't allow this, but there are ways of getting around it. First consider if the second field could be calculated based on the first autogenerated key (Field1 * -1 for instance) or perhaps the need for a second autogenerated key really means you should create a related table. Related tables can be in a one-to-one relationship. You would enforce that by adding the PK from the parent table to the child table and then adding the new autogenerated field to the table and then whatever fields are appropriate for this table. Then choose one of the two keys as the PK and put a unique index on the other (the autogenerated field does not have to be a PK). And make sure to add the FK to the field that is in the parent table. In general if you have no additional fields for the child table, you need to examine why you think you need two autogenerated fields.
Some people use the term "primary key" to mean exactly an integer column that gets its values generated by some automatic mechanism. For example AUTO_INCREMENT in MySQL or IDENTITY in Microsoft SQL Server. Are you using primary key in this sense?
If so, the answer depends on the brand of database you're using. In MySQL, you can't do this, you get an error:
mysql> create table foo (
id int primary key auto_increment,
id2 int auto_increment
);
ERROR 1075 (42000): Incorrect table definition;
there can be only one auto column and it must be defined as a key
In some other brands of database, you are able to define more than one auto-generating column in a table.
Having two primary keys at the same time, is not possible. But (assuming that you have not messed the case up with composite key), may be what you might need is to make one attribute unique.
CREATE t1(
c1 int NOT NULL,
c2 int NOT NULL UNIQUE,
...,
PRIMARY KEY (c1)
);
However note that in relational database a 'super key' is a subset of attributes which uniquely identify a tuple or row in a table. A 'key' is a 'super key' that has an additional property that removing any attribute from the key, makes that key no more a 'super key'(or simply a 'key' is a minimal super key). If there are more keys, all of them are candidate keys. We select one of the candidate keys as a primary key. That's why talking about multiple primary keys for a one relation or table is being a conflict.
Good technical answers were given in better way than I can do.
I am only can add to this topic:
If you want something that not allowed/acceptable it is good reason to take step back.
Understand the core of why it's not acceptable.
Dig more in documentation/journal articles/web and etc.
Analyze/review current design and point major flaws.
Consider and test every step during new design.
Always look forward and try to create adaptive solution.
Hope it will helps someone.
Yes, Its possible in SQL,
but we can't set more than one primary keys in MsAccess.
Then, I don't know about the other databases.
CREATE TABLE CHAPTER (
BOOK_ISBN VARCHAR(50) NOT NULL,
IDX INT NOT NULL,
TITLE VARCHAR(100) NOT NULL,
NUM_OF_PAGES INT,
PRIMARY KEY (BOOK_ISBN, IDX)
);

Resources