primary index, equality on nonkey - Silbershatz Database System Concepts, A3 - database

In Silbershatz Database System Concepts 6th Ed., in chapter 12, par. 12.3.1, p. 542, there is an explanation of an algorithm for processing a query which is selecting by imposing equality constraint on a nonkey attribute of the relation, using primary index.
The paragraph claims that the read from the file will be consecutive, since the file is sorted by the search key.
I don't understand - why would the read be consecutive?
As I see it, the records are ordered by the clustering key of the primary index, and the selection is using the non-key attributes, so these attributes may be contained in each record of the relation. The only way I see to retrieve all the records is a linear scan of all the relation.

In this context, primary index, equality on non-key means that we have a primary index on some attribute, namely A (i.e. the records are sorted physically on the disk based on A). Here, non-key means that there may exist multiple records having the same value for the attribute A, in other words, A is not guaranteed to be unique. The selection, however, is indeed using equality on A the clustering key of the primary index.
Thus, the algorithm becomes as follows: use the index to get the first record that satisfies the corresponding equality condition, then use linear scan until the condition breaks.
To make sure this makes sense, Look at algorithm A4:
In short, here non-key only means that the indexing field is not unique.
You can further refer to slide 13 in these slides. And see how the name is phrased.
As a side note: notice the difference between the cost written on this slide and the cost written in the book which misses one ts, but it really shouldn't miss it as it is explicitly present in the cost of A2, and clearly stated in the explanation, in Figure 12.3, in page 543. This may be a typo in the book. It isn't really of any importance, I just wanted to point it out.

Related

Primary Key Theory & I/O Efficiency

According to RSBMS theory, when choosing a primary key, we are supposed to choose amongst minimal superkeys, effectively optimizing our key choice w.r.t # of columns.
Why are we interested in optimizing against # of columns instead of number of bytes? Wouldn't a smaller byte size PK result in smaller index tables and overall more read/write time efficient queries? For example, choosing a PK comprised of 2 varchar(16) rather than 1 varchar(64).
I think I agree with you.
I don't think theory accounts for physical storage.
Yes, if for instance, you created a column which was a SHA256 of two small columns, say VARCHAR(16), then yes the nodes of the B-tree in the index would take up more space, and the index would not be faster than indexing the two 16 byte columns.
There is some efficiency lost building an index which matches on the first column, and has to switch to comparisons on the second column. The b-nodes are more efficient if the whole b-node is comparing on the same column.
Honestly though, I don't think either amounts to much difference in efficiency. I think the statement is RDBMS theory not accounting for storage size.
The identification of minimal rather than non-minimal superkeys is very important when defining keys in a database. If you choose to enforce uniqueness on three columns, A,B,C then that's very different from enforcing uniqueness on just two columns, A,B. A uniqueness constraint on A,B,C would not guarantee the uniqueness of A,B - so A,B would no longer be a superkey. On the other hand if the uniqueness constraint is on A,B then A,B,C is also a superkey. So it's essential from a data integrity point of view to know what the irreducible set of superkeys is.
This has nothing to do with primary keys as such because all keys must be minimal, not just the one you choose to call primary. Storage size and performance are something else. Internal storage is an important consideration in the design of indexes but size and performance are non-functional requirements whereas keys are all about logic and functionality.

Is a hash index never a clustering index?

From Database System Concepts
We use the term hash index to denote hash file structures as well as
secondary hash indices. Strictly speaking, hash indices are only
secondary index structures.
A hash index is never needed as a clustering index structure, since, if a file itself is organized by hashing, there is no need for a
separate hash index structure on it. However, since hash file
organization provides the same direct access to records that indexing
provides, we pretend that a file organized by hashing also has a
clustering hash index on it.
Is "secondary index" the same concept as "nonclustering index" (which is what I understood from the book)?
Is a hash index never a clustering index or not?
Could you rephrase or explain why the reason "A hash index is never needed as a clustering index structure" is "if a file itself is organized by hashing, there is no need for a separate hash index structure on it"? What about "if a file itself is not organized by hashing"?
Thanks.
The text tries to explain something but unfortunately creates more confusion than it resolves.
At the logical level, database tables (correct term : "relations") are made up of rows (correct term : "tuples") which represent facts about the real world the db is aimed to represent/reflect. Don't ever call those rows/tuples "records" because "records" is a concept pertaining to the physical level, which is distinct from the logical.
Typically, but this is not a universal law cast in stone, you will find that the physical organization consists of a "main" datastore which has a record for each tuple and where that record contains each and every attribute (column) value of the tuple (row). (That's unless there are LOBs in play or so.) Those records must be given a physical location in the store they are stored in and this is usually/typically done using a B-tree on the primary key values. This facilitates :
retrieving only specific [tuples/rows with] primary key values from the relation/table.
traversing the [tuples of] relation in-order of primary key values
retrieving only [tuples/rows within] specific ranges of primary key values from the relation/table.
This B-tree on the primary key values is typically called the "clustering" index.
Often, there is also a frequent need for retrieving only [tuples/rows with] specific values of attributes that are not the primary key. If that needs to be done as efficiently/fast as it can for values of the primary key, we use similar indexes that are then sometimes called "secondary". Those indexes typically do not contain all the attribute/column values of the tuple/row indexed, but only the attribute values to be indexed plus a mention of the primary key value (so we can find the rest of the attributes in the "main" datastore.
Those "secondary" indexes will mostly also be B-tree indexes which will permit in-order traversal for the attributes being indexed, but they can potentially also be hashing indexes, which permit only to look up tuples/rows using equality comparisons with a given key value ("key" = index key, nothing to do with the keys on the relation/table, though obviously for most keys on the table/relation, there will be a dedicated index too where the index key has the same attributes as the table key it supports).
Finally, there is no theoretical reason why a "primary" (/"clustered") index could not be a hash index (the text kinda suggests the opposite but that is plain wrong). But given the poor level of the explanation in your textbook, it is probably not expected of you to be taught that.
Also note that there are still other ways to physically organize a database than just using B-tree or hash indexes.
So to sum up :
"Clustered" usually refers to the index on the primary data records store
and is usually a B-tree [or some such] on the primary key
and the textbook presumably does not want you to know about more advanced possibilities
"Secondary" usually refers to additional indexes that provide additional "fast access to specific tuples/rows"
and is usually also a B-tree that permits in-order traversal just like the "clustered"/"primary" index
but can also be a hash index that permits only "access by given value" but no in-order traversal.
Hope it helps.
I will try to oversimplify just to point where your confusion is.
There are different type of index organisations:
Clustered
Non Clustered
Each of them may use one of the following file structures:
Sequential File organisation
Hash file organisation
We can have clustered indexes and non clustered indexes using hash file organisations.
Your text book is supposing that clustered indexes are used only on primary keys.
It also supposes that hash indexes, which I suppose is referring to a non-clustered index using hash file organisation, are only used for secondary indexes (non primary-key fields).
But you can actually have clustered indexes on primary keys and non-primary keys. Maybe it is a simplification done for the sake of comprehension, or it is based on a specific implementation of a DB.

How to design table with primary key and multivalued attribute?

I'm interested in database design and now reading the corresponding literature.
Through the book, i have faced a strange example that makes me feel uncertain.
There is a relation
In this table we have a composite primary key (StudentID, Activity). But ActivityFee is partially dependent on the key of the table (Activity -> ActivityFee), so the author suggests to divide this relation into two other relations:
Now if we take a look at the STUDENT_ACTIVITY, Activity becomes a foreign key and relation still has a composite primary key.
We've got the table in which the whole columns defines a composite primary key, is it OK?
If it is not, what should we do in this case? (probably define a surrogate key?)
What is a good way to deal with multivalued attribute (Activity in our case) in order eliminate possible data anomalies?
A table which consists only of a composite key is perfectly OK if that matches your business requirement.
Activity is not a multivalued attribute. There is a single value for activity for each tuple.
There is nothing wrong with a composite candidate key. (If your reference doesn't talk in terms of candidate keys, ie if it talks about primary keys in any other case than when there just happens to be only one candidate key, get a new reference.)
Your text will tell you what is good and bad design. There is no point in worrying about every property you notice about a relation that it might be "bad". The kind of "good" it is currently addressing is that given by "normalization".
"Activity" is not a "multivalued attribute". A "multivalued" attribute is a non-relational notion. The term is frequently but incorrectly used to mean either an "attribute" in a non-relational "table" that somehow (which is never explained) has more than one entry per "row", or for a column in a relational table that has a value with multiple similar parts (set, list, bag, table, etc) that somehow (which is never explained) doesn't apply to say, strings & numerals, or for a column in a relational table that has a value with multiple different parts (record, tuple, etc) that somehow (which is never explained) doesn't apply to, say, dates. (Sometimes it is even misapplied to mean a bunch of attributes with similar names and values, which ought to be replaced by a single attribute with a row for each original name.) (These are just cases of unwanted designs.) "Multivalued" gets used as an antonym of the similarly misused/abused term "atomic".
Having the same (value or) subrow of values appear more than once in a column or table is, again, neither good or bad per se. Again, your reference will tell you what is good design.

Understanding Database Normalization - Second Normal Form(2NF)

I have been learning Normalization from "Fundamentals of Database Systems by Elmasri and Navathe (6th edition)" and I am having trouble understanding the following part about 2NF.
The following image is an example given under 2NF in the textbook
The candidate key is {SSN,Pnumber}
The dependencies are
SSN,Pnumber -> hours, SSN -> ename, pnumber->pname, pnumber -> plocation
The formal Definition:
A relation schema R is in 2NF if every nonprime attribute A in R is
fully functionally dependent on the primary key of R.
for example in the above picture:
if suppose, I define an additional functional dependency SSN -> hours, then taking the two functional dependencies,
{SSN,Pnumber} -> hours and SSN -> hours
the relation wont be in 2NF, because now SSN ->hours is now a partial functional dependency as SSN is a proper subset for the given candidate key {SSN,Pnumber}.
Looking at the relation and its general definition on 2NF, i presume that the above relation is in 2NF
As far as my understanding goes and how i understand what 2NF is,
A relation is in 2NF if one cannot find a proper subset (prime attributes)
of the on the left hand side (candidate key) of a functional dependency
which defines the NPA(non prime attribute).
My first question is, Why is the above relation not in 2NF? (The textbook has considered the above relation as not in 2NF)
There is, however, a informal ways(steps as per the textbook where a normal person not knowing normalization can take to reduce redundancy) being defined at the beginning of this chapter which are:
■ Making sure that the semantics of the attributes is clear in the schema
■ Reducing the redundant information in tuples
■ Reducing the NULL values in tuples
■ Disallowing the possibility of generating spurious tuples
The guideline mentioned is as follows:
My second question is, If the above steps described are taken into account, and consider why the following relation is not in 2NF, do you assume the following functional dependencies, which are,
{SSN,Pnumber} -> Pname
{SSN,Pnumber} -> Plocation
{SSN,Pnumber} -> Ename
making the decomposition of the relation correct? If the functional dependencies assumed are incorrect, then what are the factors leading for the relation to not satisfy 2NF condition?
When looked at a general point of view ... because the table contains more than one primary attributes and the information stored is concerned with both employee and project information, one can point out that those need to be separated, as Pnumber is a primary attribute of the composite key, the redundancy can somehow be intuitively guessed. This is because the semantics of the attributes are known to us.
what if the attributes were replaced with A,B,C,D,E,F
My Third question is, Are functional dependencies pre-determined based on "functionalities of database and a database designer having domain knowledge of the attributes" ?
Because based on the data and relation state at a given point the functional dependencies can change which was valid in one state can go invalid at a certain state.In general this can be said for any non primary attribute determining non primary attribute.
The formal definition :
A functional dependency, denoted by X → Y, between two sets of
attributes X and Y that are subsets of R specifies a constraint on the
possible tuples that can form a relation state r of R. The constraint is
that, for any two tuples t1 and t2 in r that have t1[X] = t2[X], they must
also have t1[Y] = t2[Y].
So won't predefining a functional dependency be wrong as on cannot generalize relation state at any given point?
Pardon me if my basic understanding of things is flawed to begin with.
Why is the above relation not in 2NF?
Your original/first/informal "definition" of 2NF is garbled and not helpful. Even the quote from the textbook is wrong since 2NF is not defined in terms of "the PK (primary key)" but rather in terms of all the CKs (candidate keys). (Their definition makes sense if there is only one CK.)
A table is in 2NF when there are no partial dependencies of non-prime attributes on CKs. Ie when no determinant of a non-prime attribute is a proper/smaller subset of a CK. Ie when every non-prime attribute is fully functionally dependent on every CK.
Here the only CK is {Ssn, Pnumber}. But there are FDs (functional dependencies) out of {Ssn} and {Pnumber}, both of which are smaller subsets of the CK. So the original table is not in 2NF.
If the above statement is taken into account, do you assume the following functional dependencies
so won't the same process of the decomposition shown based on the informal way alone be difficult each time such a case arrives?
A table holds the rows that make some predicate (statement template parameterized by column names) into a true proposition (statement). Given the business rules, only certain business situations can arise. Then given the table predicates, which give table values from a business situation, only certain database values can arise. That leads to certain tables having certain FDs.
However, given some FDs that hold, we can formally use Armstrong's axioms to get all other FDs that must also hold. So we can use both informal and formal ways to find which FDs hold and don't hold.
There are also shorthand rules that follow from the axioms. Eg if a set of attributes has a different subrow value in each tuple then so does every superset of it. Eg if a FD holds then every superset of its determinant determines every subset of its determined set. Eg every superset of a superkey is a superkey & no proper subset of a CK is a CK. There are also algorithms.
Are functional dependencies pre-determined based on "functionalities of database and a database designer having domain knowledge of the attributes" ?
When normalizing we are concerned with the FDs that hold no matter what the business situation is, ie what the database state is. Each table for each business can have its own particular FDs per the table predicate & the possible business situations.
PS Do "make sense" of formal things in terms of the real world when their definitions are in terms of the real world. Eg applying a predicate to all possible situations to get all possible table values. But once you have the necessary formal information, only use formal definitions and procedures. Eg determining that a FD holds for a table because it holds in every possible table value.
so would any general table be in 2NF based on a solo condition of a table having a composite primary key?
There are tables in 5NF (hence too all lower NFs) with all sorts of mixes of composite & non-composite CKs. PKs don't matter.
It is frequently wrongly said that having no composite CKs guarantees 2NF. A table without composite keys and where {} does not determine any attribute is in 2NF. But if {} determines an attribute then it's a proper/smaller subset of any/every CK with any attributes. {} determines an attribute when every row has to have the same value for that attribute.
Why is the above relation in 2NF?
EP1, EP2, and EP3 are in 2NF because, for each one, the key identifies the non-key. No part of any key identifies any part of any non-key. That is what is meant by for any two tuples t1 and t2 in r that have t1[X] = t2[X], they must also have t1[Y] = t2[Y].
By contrast, you might say EMP_PROJ is over-specified. If ssn identifies, ename (as the text says it does), then the combination of {ssn, pnumber} is too much. There exists a subset of the key {ssn,pnumber} that identifies a part of the non-key, {ename}. That situation does not occur in a table conforming to 2NF, as EP1, EP2, and EP3 illustrate.
Are functional dependencies ... based on ... domain knowledge of the attributes?
Emphatically, yes! That's all they're based on. The DBMS is just a logic machine. The ideas of "employee" and "hours" don't exist for it. The database designer chooses to define tables that model some real-world universe of discourse, and imposes meaning on the columns. He gives names to the attributes (above) in X and Y. He decides which columns serve to identify a row based on what is true about the universe being modeled.
if a table has a composite primary key, regardless of the functional dependencies is not in 2NF?
No. Remember, 2NF is defined in terms of FDs. What could it mean to speak of conforming to 2NF "regardless" of them?
The number of columns in the key is immaterial. It's some set, X, identifying the complement, Y.
I'm not sure if I thoroughly understand your questions, but I'll give a try to explain.
Your first statement about 2NF:
a relation is in 2NF if one cannot find a proper subset on the left hand side of a functional dependency which defines the NPA
is correct, as well as your supposition
if {SSN,Pnumber} -> hours and SSN -> hours then this relation wont be in 2NF
because what that means that you could determine 'hours' from 'SSN' alone, so using the composite key {SSN,Pnumber} to determine 'hours' will be redundant, and thus violates the 2NF requirements.
What you call the left hand side of an FD is usually called a key. You use the key to find the related data. In order to save space (and reduce complexity), you should always try to find a minimal key, and break up larger tables into smaller ones if possible, so you do not have to save information in more places than necessary. This is what normalization to the normal forms is all about, and being studied for about half a century now, substantial theory on the matter has been developed, and some rules chrystalized from it, like 1NF, 2NF, 3NF etc.
Your second question confuses me a lot, because from what you are saying, it seems you already understands this.
Could there be some confusion about the FD's? From the figure, it seems to me as they are defined like this:
{SSN,Pnumber} -> hours
{SSN} -> ename
{Pnumber} -> Pname,Plocation
Just like the three lower tables are modeled, together they add up to the relation (table) modeled above.
So, in the first table, you would need the composite key {SSN,Pnumber} to access any data in the relation (search in the table), while that clearly is not necessary for most of the fields.
Now, I'm not sure about what purpose that table would fulfill in real life. While that is not formally necessary, as long as the FD's are given, it might be easier to imagine why the design will benefit from normalization.
So let's day it's about recording workhours per emplyee per project in some organization. SSN identifies the employee, (whose name also is stored as ename because it is easier to remember, but could be duplicate), Pnumber identifies the project, which name and location is also stored much for the same reason.
Then if you as a manager need to register that an employee worked another few hours on some project, you would use your manager app on your device, which in turn will update the tables seamlessly (you cannot expect managers to understand the logics of normalization)
Behind the scenes, however, it would amount to some query, in SQL that would be an 'INSERT' statement which added another row to the relevant table(s).
Now you can see that in the above table, you would have to insert all the six attributes, while with the normalized tables below, you will only need to add a row to table EP1,consisting of three attributes. In a large organization with thousands of employees delivering their worksheets every week, that will quickly become huge differences in storage requirements. That has a number of benefits, perhaps the most significant beeing search speed.
Your third question I don't understand at all, I'm afraid. In a way you could say FD's are predetermined once you have decided what data you will save in your database. The FD's are not dupposed to change. When modeled in the DB, they will not change. If you later find you will alter the design, then that will be new relations with new FD's.
The text you seem to be quoting from somewhere simply says that if you have the FD X -> Y (X gives or determines Y) then if you have any two tuples (records) in that relation (table) that have the same value of X, they must also hve the same value of Y. Or in our example, if Pnumber somewhere is given the value of 888, Pname is 'Battleship' and Plocation is 'Kitchen Sink', then if somewhere else (some other record) the Pnumber 888 is used then also there Pname must be 'Battleship' and Plocation must be 'Kitchen Sink' because Pname and Plocation is functionally dependant on Pnumber.
Now that was almost another chapter in your textbook, or what? Hope it helps, because it took me some time to write :-)
A table can be said to be in 2NF, if the primary key is composed of multiple columns, and that if for each row these columns were concatenated together into a single string, then the resulting column would qualify as the primary key. Alternatively a single column primary key will also qualify as 2NF.
In this case the same employee could have multiple phone numbers (PNUMBER), so a you cannot have a compound primary key that includes the phone number.

What are the design criteria for primary keys?

Choosing good primary keys, candidate keys and the foreign keys that use them is a vitally important database design task -- as much art as science. The design task has very specific design criteria.
What are the criteria?
The criteria for consideration of a primary key are:
Uniqueness
Irreducibility (no subset of the key uniquely identifies a row in the table)
Simplicity (so that relational representation & manipulation can be simpler)
Stability (should not be altered frequently)
Familiarity (meaningful to the user)
What is a Primary Key?
The primary key is something that uniquely identifies a row/record of data. It can also be multiple columns, which is called a composite.
Ability to Change
Because the primary key is often used for foreign references, it should be as stable as possible. All data in the database is mutable, providing someone is connecting with an account that has appropriate privileges. This is why databases provide the ability to define CASCADE ON DELETE and CASCADE ON UPDATE--to sync referential dependencies without having to disable constraints.
Natural or Artifical/Surrogate?
Ideally, you want a natural key. A natural key is existing data that uniquely identifies the entity you are modeling. For example, the abbreviations of US states is a good natural key because the abbreviation is consistent and everyone knows them:
US_STATE_PRIMARY_KEY US_STATE
--------------------------
AL Alabama
AK Alaska
AZ Arizona
AR Arkansas
CA California
Don't try too hard to find a natural key. They seldom exist. It's unlikely that a US State name would change, but it is plausible.
Realistically, primary keys will typically be artificial (often generated by database functionality). These are typically numbers or GUIDs, and they're considered artificial because on their own - there's nothing to relate their value to the information they uniquely identify. A sales receipt is always numbered, because there's nothing natural about it and it's also for auditing - gaps in the receipt numbers raise suspicions. To demonstrate how arbitrary numbering is, here's the US state table but using an integer for the primary key column, US_STATE_CODE:
US_STATE_PRIMARY_KEY US_STATE
--------------------------
100 Alabama
101 Alaska
102 Arizona
103 Arkansas
104 California
There's no requirement to start the value at one; some shops use this as a security measure to thwart SQL injection. The value is sequential based on the alphabetic ordering of the State name, but that can't be guaranteed. But unlike the natural key, if the state name changed - only one column would have to be updated.
Single Column vs Composite
Ideally one column will be the primary key, but make the decision based on the data at hand--do not combine columns just for the sake of having a single column. If you do shoehorn data together, use a character to separate the data easily (though operations to do this won't be able to take advantage of an index if present).
Performance
From a performance perspective, integers are best because they offer a decent range of values and the number of bytes used is small when you compare to VARCHAR of five or more characters.
Database design starts with a conceptual data model (such as an entity relationship diagram) and finishes up with a database schema or schemas. Entities are mapped to tables; in this process one entity may be split into several table, several entities may be merged into a single table and new tables may arise (for instance, intersection tables to implement many-to-many relationships).
In an ERD entities have primary keys. These are natural keys, that is they are attributes of the entity. For a PERSON entity it might be SocialSecurityNumber. For an ORDER entity if might be OrderRef For an INVOICE entity it might be InvoiceNo. In the first case that is a real-life identifier; in the second case it is a smart key in an ugly format (2010/DEF/000023 ); in the third case it is a monotonically incrementing number because that is what the current paper-based system uses.
Natural keys can be fanciful. I once worked on a database design where the analyst had specified the CUSTOMER entity with a key of (FullName, Address, Sex, DateOfBirth, DistinguishingCharacteristics) on the basis that two individuals of the same name, birth date and gender could live at the same address.
The characteristics of an entity's primary key are:
unique
familiar
stable (presumed)
minimal (one or more attributes but as few as necessary)
When it comes to primary keys for database tables, natural keys are not always suitable.
There are many reasons not to use SSN as a physical primary key. Protection of a citizen's personal data is actually the most important but it is also the case that an individual's number can change. Primary keys should be unvarying.
Smart keys are dumb. They are actually compound keys compressed into a single column. They are better represented as separate columns, not least because it is a frequent requirement to search on single elements of the key. Also, the format of such keys can change.
In general compound keys are a pain as primary keys because we have to cascade multiple columns as foreign keys. This is exacerbated when the child's primary key is defined as a serial number within the parent's primary key. There are systems out there which dependent tables inheriting a nine-column foreign key from a parent when they have a scant two data columns of their own. Sometimes this sort of inheritance can be useful but mostly it is a just a hassle.
The characteristics of an entity's primary key are:
unique
appropriate (meaningless)
guaranteed stability
minimal, usually a single column (except for intersection tables)
So unless the candidate key is a meaningless identifier (such as InvoiceNo) a table should have a synthetic key (AKA surrogate key). This can be a monotonically incrementing number or a GUID according to your needs. Regarding intersection tables, if they have no other attributes or dependent tables there is no value in replacing a compound primary key (AKA composite key) with a synthetic one.
The crucial thing is: we still enforce the candidate keys. This means applying UNIQUE constraints on those columns - SSN , OrderRef - in the parent table. This is because a synthetic key uniquely identifies a row in a table, it does not uniquely identify the data.
Regarding familiarity
Familiarity is a curly one. It is an important consideration when it comes to we are identifying primary keys in a conceptual data model but it is less useful when it comes to database design.
In a commnet #bbadour provides two contrasting examples:
{3296013,840082470,Bob Badour,745} versus {840082470,Bob Badour,PE,CA}
and poses the question:
"What does 3296013 achieve that was not already achieved by 840082470, which happens to be the primary key for my academic records at any or every post-secondary school in Canada."
Well, 840082470 is like a invoice number. Of itself it is a meaningless string of digits. If the system we are designing belongs to the domain of Canadian higher education then it is certainly acceptable as a candidate key. However, because it is a key apparently owned by an external central system (forgive me for not understanding the Canadian academic system), it is open to some of the objections to SSN as a primary key. We are reliant on that external system to ensure uniqueness, guarantee stability and verify identification.
As for 745 versus PE,CA, that is clearly wrong. The Canadian postal abbreviation for "Prince Edward Island" and the ISO digraph for "Canada" identify two distinct pieces of information and derive from different sources, so they should be represented as two separate columns. But let us focus on whether 745 or PE makes the better primary key.
First thing, the database doesn't care which data type we use for the code to represent "Prince Edward Island". It just wants guaranteed uniqueness.
Second thing, the user-facing part of the system is likely to display the full expansion "Prince Edward Island", in which case the application is going to need to execute a look-up anyway. This is because users of a system which also holds addresses from the country of Peru or the state of California will appreciate the clarity of the expanded names[1]. Certainly if we go beyond the few hard cases (such as state abbreviations) the application should always expand codes when displaying them to users.
Thus the only advantage of using PE rather than 745 is that it makes ad hoc querying easier.
Third thing, if the code expansion changes we might want to distinguish records which use the newer version. This is a lot easier if 745='Prince Edward Island' and 746='Prince Edward Is.' than if we use PE as the primary key.
Fourth thing, there are programming considerations. For instance, if the application developers have to provide drop-down lists using Java Enumerations they need numeric codes.
In short, familiarity of natural keys is not as useful as the practicality of surrogate keys.
[1] Canadians will know that CA stands for Canada. But does MO stand for Morocco, Monaco, Moldova, Montenegro, Mongolia or Montserrat? Actually none of them: it's Macau.
A Primary Key is a key that uniquely identifies an entity. When you are choosing a primary key, the best choice is almost always a surrogate key that has absolutely nothing to do with the entity at all other than uniquely identifying it.
And that's it. There are supposedly rare edge cases where a primary key might be a natural key, but I've never seen a valid one.
Most of us use a 32-bit auto-increment integer as a primary key. Another excellent choice (in certain circumstances) is a UUID.
A candidate key is a set of attributes that are irreducibly unique (irreducible meaning that no attribute can be removed from the key without losing the uniqueness property).
Other criteria when choosing what candidate keys to implement are: simplicity, stability, familiarity.
These three criteria are important considerations but not necessarily essential attributes of a key. For instance it may be desirable and quite reasonable to enforce a key that can change often. e.g.: a user login name is required to be unique but the user may change it at will as long as it remains unique.
A primary key is a candidate key.
Hey. it's open again. Here goes.
(1) Choose good candidate keys.
It does not pertain to the database designer to choose candidate keys.
The database designer has the responsibility to see to it that all the
uniqueness requirements he is informed of by the user, will be enforced.
So it is the user who "chooses" what the candidate keys are.
There are two scenario's I can think of that relax this unequivocal
position a bit.
One is if the user says that some attribute of type 'video' or 'audio' (or
some such) is to be unique. It may be infeasible to actually enforce
that, and it is the designer's responsibility to point that out to the
user (as it is also his responsibility to point out that 'uniqueness' of
audio and video content is a very debatable subject, and that any
uniqueness on such attribute values, even if enforcible by the system,
still has a good chance of not being the same uniqueness that the user
wants).
Second is how the picture gets muddied by the possibility of distinct
logical designs all addressing the same problem. If D1 and D2 are both
valid designs addressing the same problem, then it might be the case that
a certain given uniqueness rule imposed by the user, is enforcible using
keys in D1, but not in D2. From this perspective, "choosing candidate
keys" can be interpreted as "choosing a particular design such that a
given uniqueness rule is enforcible using keys". But that wasn't really
the question that you asked.
(2) Choose good primary keys.
A while ago, Darwen launched the question "What are good reasons to single
out one particular candidate from among the others as being 'primary' ?".
Nothing much came out, except then perhaps : "to suggest that this
particular key is the preferred one to use whenever making references to
this relvar". I suspect they didn't find that convincing enough to change
their earlier decision that "no key is more unique than any other".
But, supposing that nonetheless there exists some valid reason to single
out one particular key as "primary", I suppose the following
considerations apply :
the likeliness, or appropriateness, of using this primary key also as,
e.g., the clustering key in the physical design.
and as a consequence of that, the probability of having to change a
value of some existing primary key. Key values that are highly stable
will be preferable over key values that are more volatile.
the percentage of the business that naturally uses some such key in
their daily operations.
if the required space for physically encoding key values is
significantly different, which one has the smallest encoding size.
Your answer to Erwin:
"I agree that choosing a primary key merely designates one candidate key as preferred for foreign key references. However, even if we eliminated the name "primary key" entirely, designers must still choose which candidate key to propagate into another relation for reference purposes. If users identify a heavily referenced relation with an unstable, composite key, do you intend to imply that the designer has no business choosing an additional simple, stable key? Or using the simple, stable key for referencing the relation? Your candidate key section seems to imply that. – bbadour 8 hours ago "
Your original question was about 'primary keys'. Now you change your focus to keys and foreign keys. A key is an integrity constraint, so the only criteria are that a minimal set of attributes has to be unique in a relation (uniqueness and irreducibility). If we change our focus to foreign keys then simplicity, stability and familiarity are the criteria to choose from all the candidate keys in de referenced relation. There could be more candidate keys that fulfill that criteria to more or less the same extend. If we look at familiarity, one candidate key could be very familiar to a group of users and not to another group for which another candidate key is more familiar. Think about different views or subschemas of a database. This second group of users should choose a different candidate key for reference purposes (as foreign key). If you insist in 'primary keys' of which we only have one per relation then I have to ask what makes a key more primary than others.
I think the term primary key should not be used. At least at the logical level. Also the term 'foreign keys' is not well chosen (foreign keys are not keys, but references).
So, I think the remarks of Erwin about ‘primary’ keys were very much to the point. Or at least this was my interpretation of what he means.
Do you agree with this?
If so, would you change your original question to "What are the design criteria for keys and what are the criteria to choose a foreign key from the available candidate keys?"?
If not, why?
Regards,
Carlos
A primary key is a candidate key chosen for special treatment, so first we must look at the properties of candidate keys. A set of one or more columns is a candidate key if it has the following two properties:
Uniqueness: A candidate key must uniquely identify each row in a table. No table may contain two rows with the same value for the candidate key.
Irreducability: Removing any column from a candidate key must violate the uniqness property. In other words, no subset of columns in a candidate key is itself a candidate key.
If no candidate key exists, and sometimes even if one does, a surrogate key is often created using an auto-incrementing integer column, or made up using some other technique. This surrogate key is now also a candidate key.
It is often useful to choose among the available candidate keys and to designate one of them as the primary key. The first criteria often applied is simplicity indicating the candidate key with the fewest columns. However there are other potential criteria, like familiarity, familiar values being more useful than non-familiar values, and stability, stable keys being less troublesome than keys that are apt to change. These criteria however, are strictlty outside the scope the relational model, often conflict with each other, and are often made to deal with implementation limitations.
I would say that the first two concepts "uniqueness" and "irreducability" are less design criteria than fundamental properties of primary keys, while the latter concepts of "simplicity", "familiarity" and "stability" are more properly labeled design criteria, as they involve tradeoffs and subjectivity.
Why choose a primary key? Simplicity and familiarity are not only criteria for choosing among available candididate keys, but are why we should choose a primary key at all. If there are are multiple candidate keys in a table, it simplifys things if all foreign keys pointing to that table refer to the same candidate key. Furthermore, the very act of choosing a particular candidate key will help make it familiar.
What are the criteria?
A PRIMARY KEY is something that will define the entity, only the entity and nothing but the entity.
You can take it from the outside world. Say, a star catalog number to identify a star (good example), or an SSN to identify a person (bad example).
In this case, you rely on the outside world.
Do all people have SSN? (They don't).
Are SSN's unique? (They aren't).
Can an SSN be assigned to another person? (It can).
You can generate it inside your model, using AUTOINCREMENT or GUIDs or whatever.
In this case, you rely on yourself and your database skills.
Do all people in your model have an ID? (Yes, they do, otherwise they wouldn't be in the table with ID NOT NULL).
Are these ID's unique? (Yes, they are, the PRIMARY KEY constraint takes care of it).
Can they be assigned to other persons? (No, they cannot, they are either non-repeatable by design or auto incrementing).
Or another set of answers:
Do all people in your model have an ID? (No, they don't, the people table was accidentally dropped, though some other information retained).
Are these ID's unique? (No, we failed to merge two versions of the database properly).
Can they be assigned to other persons? (Yes, we reset the AUTOINCREMENT by mistake).
The most important thing is that a surrogate key is a feast that is always with you. You can always create a surrogate key: nothing on Earth can stop you from declaring an AUTOINCREMENT field. But by far not all things have some kind of identifier everybody agrees upon.
However, a good natural key cannot be overemphasized.
Guide Star Catalog database is most probably backed up more reliably than yours, and the list of US state codes you always can restore right from the memory.
Only one really, choose a surrogate for each table (identity/auto_number) or something similar that the users will never even see so you can do whatever is necessary with them whenever you need to now and in the future.
(Not quite sure how to interpret this question. Sounds like a quiz or something where you are looking for one single "right" answer from a textbook. I'm going to interpret the question as a more practical one, hence my advice below.)
At least in the MS SQL world, discussion about a proper Primary Key is inevitably wrapped up in discussion about the proper clustered index for a table. The two don't have to be the same, but they are by default, and for many tables, making the two the same is often a good idea.
For the purpose of our discussion here, its important to distinguish between the two:
A PRIMARY KEY is a field or combination of fields that uniquely identify a row.
A CLUSTERED INDEX is a field or combination of fields that represents the physical ordering of a table. (Again, I am speaking about MS SQL Server, not sure how other RDBS might handle this)
Key to the remainder of my discussion is knowing that since SQL 7.0, the clustered index key is used as a row identifier for all non-clustered indexes. This means that many of the same criteria for choosing a good clustering key are the same as for choosing a good primary key.
Let's first look at the criteria for a good clustered index (From Kimberly Tripp's excellent article). A clustered index should be:
Unique - otherwise useless as a row identifier for other indexes
Narrow - this key is used in other indexes, so should be as narrow as possible
Static - If key values change, then references become invalid and will need updating
Ever-increasing - To reduce physical table fragmentation as new rows are added
It is readily apparent the first 3 are also good criteria for a primary key. #4 is a bonus that will reduce table fragmentation as tables grow.
A GUID as a primary key, as popular as that is, actually fails 2 of these criteria (Narrow and Ever-Increasing). As such, it is not recommended as a PK/Clustered index in most circumstances (see Kim's related article here)
I'm going to say something here that is not expected.
All the stuff they teach in database about normalization and keys is all wrong when it comes to choosing primary keys.
The primary key is special when it comes to range queries, and for that reason if you have a dominant range query that is your primary key, no exceptions.
If your dominant range query is not on a candidate key you end up with a primary key that is not enforced for uniqueness! This is sometimes called a clustered index, which is a misnomer because there is no index.
Now the normalization and candidate keys are all important, and you will want to enforce unique constraints on at least some of them. But do not assign the primary key because it is the natural key. In fact, this is slower than defining an index and a unique constraint. Define the primary key based on range queries only.
Remember, there is no constraint to actually have primary keys. A table with no primary keys is called a heap table and has either no intrinsic ordering or insertion order intrinsic ordering.
EDIT: definition of range query:
A range query is a query that is an ORDER BY query or contains either a greater than or less than operator. What we are interested in are the columns for which these queries run on. The fundamental idea is a range query fetches several (tens to hundreds to perhaps thousands but not all) rows from the table based on bounding conditions at one or both ends.
There is another kind of range queries, and that is where you have a foreign key to another table and an operation is select all matching on that foreign key. This is in fact also a range query although not obviously so.

Resources