Related
I have a situation where I would like to know if it is more commonplace to use table_id or just id? (in my opinion, using table_ would cause slight confusion as to if it a foreign key). Which do people prefer, and is there really any difference between the two? Or should it just be left up to picking one and being consistent?
There are two main currents in terms of naming columns in tables:
Schema Namespace
This strategy is the traditional strategy that was conceived by teams documenting the "data dictionary" of a database in the 70s. The idea is that the name itself of the column tells you which table it belongs to across the whole schema or database. For example, CLIENT_NAME would represent the name of the client in the CLIENT table.
There are variations of this strategy where a limited number of letters are assigned as prefixes (specially for M:N relationship tables) because at the time column names were limited to 6 or 8 characters in many databases. For example, the date of purchase of a car by a client could take the form CLI_CAR_DATE, CLICAR_DATE, or even CLCADT.
Examples:
A primary key "id" column of the entity table "car" would be named CAR_ID.
A foreign key on a child table "document" that points to "car" would take the same form: CAR_ID. This allows the use of natural joins; however, it should be pointed out that there are compelling reasons to avoid natural joins at all cost, that are not discussed here.
Foreign keys on a table "transfer" that has multiple (two) relationships (seller and buyer) with "person" pollutes this strategy. They could be named: PERSON_BUYER_ID and PERSON_SELLER_ID because both cannot have the same name PERSON_ID; it doesn't allow natural joins anymore (good).
Table Namespace
In this strategy (that is newer) column names do not include the name of the entity they belong to, but only their property name. This strategy aligns more with object design, and produces shorter names (i.e. less typing). The name of the table must be indicated when mentioning a column. For example, you would need to say the column NAME on the table CLIENT.
Examples:
A primary key "id" column of the entity table "car" would be named ID.
A foreign key on a child table "document" that points to "car" would take the form: CAR_ID; this is the same solution as the previous strategy.
Foreign keys on a table "transfer" that has multiple (two) relationships (seller and buyer) with "person" could be named: BUYER_ID and SELLER_ID. They could follow the longer names as the previous strategy, but the goal here is typically to have shorter names so the app source code gets easier to write and to debug.
Summary
I personally like the second one, but there are teams who adhere to both strategies and there's no clear winner. My leaning towards the second one is [I think] the first one suffers from longer names (more typing), longer SQL (more errors), cryptic names (they don't play well with ORMs and app objects), and foreign keys that cannot follow the strategy well. In fact, virtually all the primary keys in my databases are named ID regardless of the specific entities.
But on the flip side, some teams value very highly the idea of knowing the table name of a column by just looking at it. And this is great for big databases (with 200-1000 relational fact tables) that can become quite complex, specially for new members of a team.
But above all, pick one and be consistent.
I have a small question concerning with how I should design my database. I have a table dogs for an animal shelter and I have a table owners. In the table dogs all dogs that are and once were in the shelter are being put. Now I want to make a relation between the table dogs and the table owners.
The problem is, in this example not all dogs have an owner, and since an owner can have more than one dog, a possible foreign key should be put in the table dogs (a dog can't have more than one owner, at least not in the administration of the shelter). But if I do that, some dogs (the ones in the shelter) will have null as a foreign key. Reading some other topics taught me that that is allowed. (Or I might have read some wrong topics)
However, another possibility is putting a table in between the two tables - 'dogswithowners' for example - and put the primary key of both tables in there if a dog has an owner.
Now my question is (as you might have guessed) what the best method is of these two and why?
The only solution that is in keeping with the principles of the Relational Model is the extra table.
Moreover, it's hard to imagine how you are going to find any hardware that is so slow that the difference in performance when you start querying, is going to be noticeable. After all, it's not a mission-critical tens-of-thousands-of-transactions-per-second appliation, is it ?
I agree with Philip and Erwin that the soundest and most flexible design is to create a new table.
One further issue with the null-based approach is that different software products disagree over how SQL's nullable foreign keys work. Even many IT professionals don't understand them properly so the general user is even less likely to understand it.
The nullable foreign key is a typical solution.
The most straightforward one is just to have another table of owners and dogs, with foreign keys to the owner and dog tables with the dog column UNIQUE NOT NULL. Then if you only want owners or owned dogs you do not have to involve IS NOT NULL in your queries and the DBMS does not need to access them among all owners and dogs. NULLs can simplify certain situations like this one but they also complicate compared to having a separate table and just joining when you want that data.
However, if it could become possible for a dog to have multiple owners then you might need the extra table anyway as many:many relationship without the UNIQUE NOT NULL column and the column pair owner-dog UNIQUE NOT NULL instead. You can always start with the one UNIQUE NOT NULL and move to the other if things change.
In the olden days of newsgroups, we had this guy called -CELKO- who would pop up and say, "There is a design rule of thumb that says a relational table should model either an entity or a relationship between entities but never both." Not terribly formal but it is a good rule of thumb in my opinion.
Is 'owner' (person) really an attribute of a dog? It seems to me more like you want to model the relationship 'ownership' between a person and a dog.
Another useful rule of thumb is to avoid SQL nulls! Three-valued logic is confusing to most users and programmers, null behavior is inconsistent throughout the SQL Standard and (as sqlvogel points out) SQL DBMS vendors implementation things in different ways. The best way of modelling missing data is by the omission of tuple in a relvar (a.k.a. don't insert anything into your table!). For example, Fido is included in Dog but omitted from DogOwnership then according to the Closed World Assumption Fido sadly has no owner.
All this points to having two tables and no nullable columns.
I wouldn't do any extra table. If for some reason no nulls allowed (it's a good question why) - I would, and I know some solutions do the same, put instead of null some value, that can't be a real key. e.g NOT_SET or so.
hope it helps
A nullable column used for foreign key relationship is perfectly valid and used for scenarios exactly like yours.
Adding another table to connect the owners table with the dogs table will create a many to many relationship, unless a unique constraint is created on one of it's columns (dogs in your case).
Since you describe a one to many relationship, I would go with the first option, meaning having a nullable foreign key, since I find it more readable.
I have a column with a uniqueidentifier that can potentially reference one of four different tables. I have seen this done in two ways, but both seem like bad practice.
First, I've seen a single ObjectID column without explicitly declaring it as a foreign key to a specific table. Then you can just shove any uniqueidentifier you want in it. This means you could potentially insert IDs from tables that are not part of the 4 tables I wanted.
Second, because the data can come from four different tables, I've also seen people make 4 different foreign keys. And in doing so, the system relies on ONE AND ONLY ONE column having a non-NULL value.
What's a better approach to doing this? For example, records in my table could potentially reference Hospitals(ID), Clinics(ID), Schools(ID), or Universities(ID)... but ONLY those tables.
Thanks!
You might want to consider a Type/SubType data model. This is very much like class/subclasses in object oriented programming, but much more awkward to implement, and no RDBMS (that I am aware of) natively supports them. The general idea is:
You define a Type (Building), create a table for it, give it a primary key
You define two or more sub-types (here, Hospital, Clinic, School, University), create tables for each of them, make primary keys… but the primary keys are also foreign keys that reference the Building table
Your table with one “ObjectType” column can now be built with a foreign key onto the Building table. You’d have to join a few tables to determine what kind of building it is, but you’d have to do that anyway. That, or store redundant data.
You have noticed the problem with this model, right? What’s to keep a Building from having entries in in two or more of the subtype tables? Glad you asked:
Add a column, perhaps “BuildingType”, to Building, say char(1) with allowed values of {H, C, S, U} indicating (duh) type of building.
Build a unique constraint on BuildingID + BuildingType
Have the BulidingType column in the subtables. Put a check constraint on it so that it can only ever be set to the value (H for the Hospitals table, etc.) In theory, this could be a computed column; in practice, this won't work because of the following step:
Build the foreign key to relate the tables using both columns
Voila: Given a BUILDING row set with type H, an entry in the SCHOOL table (with type S) cannot be set to reference that Building
You will recall that I did say it was hard to implement.
In fact, the big question is: Is this worth doing? If it makes sense to implement the four (or more, as time passes) building types as type/subtype (further normalization advantages: one place for address and other attributes common to every building, with building-specific attributes stored in the subtables), it may well be worth the extra effort to build and maintain. If not, then you’re back to square one: a logical model that is hard to implement in the average modern-day RDBMS.
Let's start at the conceptual level. If we think of Hospitals, Clinics, Schools, and Universities as classes of subject matter entities, is there a superclass that generalizes all of them? There probably is. I'm not going to try to tell you what it is, because I don't understand your subject matter as well as you do. But I'm going to proceed as if we can call all of them "Institutions", and treat each of the four as subclasses of Institutions.
As other responders have noted, class/subclass extension and inheritance are not built into most relational database systems. But there is plenty of assistance, if you know the right buzzwords. What follows is intended to teach you the buzzwords, in database lingo. Here is a summary of the buzzwords coming: "ER Generalization", "ER Specialization", "Single Table Inheritance", "Class Table Inheritance", "Shared Primary Key".
Staying at the conceptual level, ER modeling is a good way of understanding the data at a conceptual level. In ER modeling, there is a concept, "ER Generalization", and a counterpart concept "ER Specialization" that parallel the thought process I just presented above as "superclass/subclass". ER Specialization tells you how to diagram subclasses, but it doesn't tell you how to implement them.
Next we move down from the conceptual level to the logical level. We express the data in terms of relations or, if you will, SQL tables. There are a couple of techniques for implementing subclasses. One is called "Single Table Inheritance". The other is called "Class Table Inheritance". In connection with Class table inheritance, there is another technique that goes by the name "Shared primary Key".
Going forward in your case with class table inheritance, we first design a table called "Institutions", with an Id field, a name field, and all of the fields that pertain to institutions, no matter which of the four kinds they are. Things like mailing address fields, for instance. Again, you understand your data better than I do, and you can find fields that are in all four of your existing tables. We populate the id field in the usual way.
Next we design four tables called "Hospitals", "Clinics", "Schools", and "Universities". These will contain an id field, plus all of the data fields that pertain only to that kind of institution. For instance, a hospital might have a "bed capacity". Again, you understand your data better than I do, and you can figure these out from the fields in your existing tables that didn't make it into the Institutions table.
This is where "shared primary key" comes in. When a new entry is made into "Institutions", we have to make a new parallel entry into one of four specialized subclass tables. But we don't use some sort of autonumber feature to populate the id field. Instead, we put a copy of the id field from the "Institutions" table into the id field of the subclass table.
This is a little work, but the benefits are well worth the effort. Shared primary key enforces the one-to-one nature of the relationship between subclass entries and superclass entries. It makes joining superclass data and subclass data simple, easy, and fast. It eliminates the need for a special field to tell you which subclass a given institution belongs in.
And, in your case, it provides a handy answer to your original question. The foreign key you were originally asking about is now always a foreign key to the Institutions table. And, because of the magic of shared-primary-key, the foreign key also references the entry in the appropriate subclass table, with no extra work.
You can create four views that combine institution data with each of the four subclass tables, for convenience.
Look up "ER Specialization", "Class Table Inheritance", "Shared Primary Key", and maybe "Single Table Inheritance" on the web, and here in SO. There are tags for most of these concepts or techniques here in SO.
You could put a trigger on the table and enforce the referential integrity there. I don't think there's a really good out-of-the-box feature to implement this requirement.
Choosing good primary keys, candidate keys and the foreign keys that use them is a vitally important database design task -- as much art as science. The design task has very specific design criteria.
What are the criteria?
The criteria for consideration of a primary key are:
Uniqueness
Irreducibility (no subset of the key uniquely identifies a row in the table)
Simplicity (so that relational representation & manipulation can be simpler)
Stability (should not be altered frequently)
Familiarity (meaningful to the user)
What is a Primary Key?
The primary key is something that uniquely identifies a row/record of data. It can also be multiple columns, which is called a composite.
Ability to Change
Because the primary key is often used for foreign references, it should be as stable as possible. All data in the database is mutable, providing someone is connecting with an account that has appropriate privileges. This is why databases provide the ability to define CASCADE ON DELETE and CASCADE ON UPDATE--to sync referential dependencies without having to disable constraints.
Natural or Artifical/Surrogate?
Ideally, you want a natural key. A natural key is existing data that uniquely identifies the entity you are modeling. For example, the abbreviations of US states is a good natural key because the abbreviation is consistent and everyone knows them:
US_STATE_PRIMARY_KEY US_STATE
--------------------------
AL Alabama
AK Alaska
AZ Arizona
AR Arkansas
CA California
Don't try too hard to find a natural key. They seldom exist. It's unlikely that a US State name would change, but it is plausible.
Realistically, primary keys will typically be artificial (often generated by database functionality). These are typically numbers or GUIDs, and they're considered artificial because on their own - there's nothing to relate their value to the information they uniquely identify. A sales receipt is always numbered, because there's nothing natural about it and it's also for auditing - gaps in the receipt numbers raise suspicions. To demonstrate how arbitrary numbering is, here's the US state table but using an integer for the primary key column, US_STATE_CODE:
US_STATE_PRIMARY_KEY US_STATE
--------------------------
100 Alabama
101 Alaska
102 Arizona
103 Arkansas
104 California
There's no requirement to start the value at one; some shops use this as a security measure to thwart SQL injection. The value is sequential based on the alphabetic ordering of the State name, but that can't be guaranteed. But unlike the natural key, if the state name changed - only one column would have to be updated.
Single Column vs Composite
Ideally one column will be the primary key, but make the decision based on the data at hand--do not combine columns just for the sake of having a single column. If you do shoehorn data together, use a character to separate the data easily (though operations to do this won't be able to take advantage of an index if present).
Performance
From a performance perspective, integers are best because they offer a decent range of values and the number of bytes used is small when you compare to VARCHAR of five or more characters.
Database design starts with a conceptual data model (such as an entity relationship diagram) and finishes up with a database schema or schemas. Entities are mapped to tables; in this process one entity may be split into several table, several entities may be merged into a single table and new tables may arise (for instance, intersection tables to implement many-to-many relationships).
In an ERD entities have primary keys. These are natural keys, that is they are attributes of the entity. For a PERSON entity it might be SocialSecurityNumber. For an ORDER entity if might be OrderRef For an INVOICE entity it might be InvoiceNo. In the first case that is a real-life identifier; in the second case it is a smart key in an ugly format (2010/DEF/000023 ); in the third case it is a monotonically incrementing number because that is what the current paper-based system uses.
Natural keys can be fanciful. I once worked on a database design where the analyst had specified the CUSTOMER entity with a key of (FullName, Address, Sex, DateOfBirth, DistinguishingCharacteristics) on the basis that two individuals of the same name, birth date and gender could live at the same address.
The characteristics of an entity's primary key are:
unique
familiar
stable (presumed)
minimal (one or more attributes but as few as necessary)
When it comes to primary keys for database tables, natural keys are not always suitable.
There are many reasons not to use SSN as a physical primary key. Protection of a citizen's personal data is actually the most important but it is also the case that an individual's number can change. Primary keys should be unvarying.
Smart keys are dumb. They are actually compound keys compressed into a single column. They are better represented as separate columns, not least because it is a frequent requirement to search on single elements of the key. Also, the format of such keys can change.
In general compound keys are a pain as primary keys because we have to cascade multiple columns as foreign keys. This is exacerbated when the child's primary key is defined as a serial number within the parent's primary key. There are systems out there which dependent tables inheriting a nine-column foreign key from a parent when they have a scant two data columns of their own. Sometimes this sort of inheritance can be useful but mostly it is a just a hassle.
The characteristics of an entity's primary key are:
unique
appropriate (meaningless)
guaranteed stability
minimal, usually a single column (except for intersection tables)
So unless the candidate key is a meaningless identifier (such as InvoiceNo) a table should have a synthetic key (AKA surrogate key). This can be a monotonically incrementing number or a GUID according to your needs. Regarding intersection tables, if they have no other attributes or dependent tables there is no value in replacing a compound primary key (AKA composite key) with a synthetic one.
The crucial thing is: we still enforce the candidate keys. This means applying UNIQUE constraints on those columns - SSN , OrderRef - in the parent table. This is because a synthetic key uniquely identifies a row in a table, it does not uniquely identify the data.
Regarding familiarity
Familiarity is a curly one. It is an important consideration when it comes to we are identifying primary keys in a conceptual data model but it is less useful when it comes to database design.
In a commnet #bbadour provides two contrasting examples:
{3296013,840082470,Bob Badour,745} versus {840082470,Bob Badour,PE,CA}
and poses the question:
"What does 3296013 achieve that was not already achieved by 840082470, which happens to be the primary key for my academic records at any or every post-secondary school in Canada."
Well, 840082470 is like a invoice number. Of itself it is a meaningless string of digits. If the system we are designing belongs to the domain of Canadian higher education then it is certainly acceptable as a candidate key. However, because it is a key apparently owned by an external central system (forgive me for not understanding the Canadian academic system), it is open to some of the objections to SSN as a primary key. We are reliant on that external system to ensure uniqueness, guarantee stability and verify identification.
As for 745 versus PE,CA, that is clearly wrong. The Canadian postal abbreviation for "Prince Edward Island" and the ISO digraph for "Canada" identify two distinct pieces of information and derive from different sources, so they should be represented as two separate columns. But let us focus on whether 745 or PE makes the better primary key.
First thing, the database doesn't care which data type we use for the code to represent "Prince Edward Island". It just wants guaranteed uniqueness.
Second thing, the user-facing part of the system is likely to display the full expansion "Prince Edward Island", in which case the application is going to need to execute a look-up anyway. This is because users of a system which also holds addresses from the country of Peru or the state of California will appreciate the clarity of the expanded names[1]. Certainly if we go beyond the few hard cases (such as state abbreviations) the application should always expand codes when displaying them to users.
Thus the only advantage of using PE rather than 745 is that it makes ad hoc querying easier.
Third thing, if the code expansion changes we might want to distinguish records which use the newer version. This is a lot easier if 745='Prince Edward Island' and 746='Prince Edward Is.' than if we use PE as the primary key.
Fourth thing, there are programming considerations. For instance, if the application developers have to provide drop-down lists using Java Enumerations they need numeric codes.
In short, familiarity of natural keys is not as useful as the practicality of surrogate keys.
[1] Canadians will know that CA stands for Canada. But does MO stand for Morocco, Monaco, Moldova, Montenegro, Mongolia or Montserrat? Actually none of them: it's Macau.
A Primary Key is a key that uniquely identifies an entity. When you are choosing a primary key, the best choice is almost always a surrogate key that has absolutely nothing to do with the entity at all other than uniquely identifying it.
And that's it. There are supposedly rare edge cases where a primary key might be a natural key, but I've never seen a valid one.
Most of us use a 32-bit auto-increment integer as a primary key. Another excellent choice (in certain circumstances) is a UUID.
A candidate key is a set of attributes that are irreducibly unique (irreducible meaning that no attribute can be removed from the key without losing the uniqueness property).
Other criteria when choosing what candidate keys to implement are: simplicity, stability, familiarity.
These three criteria are important considerations but not necessarily essential attributes of a key. For instance it may be desirable and quite reasonable to enforce a key that can change often. e.g.: a user login name is required to be unique but the user may change it at will as long as it remains unique.
A primary key is a candidate key.
Hey. it's open again. Here goes.
(1) Choose good candidate keys.
It does not pertain to the database designer to choose candidate keys.
The database designer has the responsibility to see to it that all the
uniqueness requirements he is informed of by the user, will be enforced.
So it is the user who "chooses" what the candidate keys are.
There are two scenario's I can think of that relax this unequivocal
position a bit.
One is if the user says that some attribute of type 'video' or 'audio' (or
some such) is to be unique. It may be infeasible to actually enforce
that, and it is the designer's responsibility to point that out to the
user (as it is also his responsibility to point out that 'uniqueness' of
audio and video content is a very debatable subject, and that any
uniqueness on such attribute values, even if enforcible by the system,
still has a good chance of not being the same uniqueness that the user
wants).
Second is how the picture gets muddied by the possibility of distinct
logical designs all addressing the same problem. If D1 and D2 are both
valid designs addressing the same problem, then it might be the case that
a certain given uniqueness rule imposed by the user, is enforcible using
keys in D1, but not in D2. From this perspective, "choosing candidate
keys" can be interpreted as "choosing a particular design such that a
given uniqueness rule is enforcible using keys". But that wasn't really
the question that you asked.
(2) Choose good primary keys.
A while ago, Darwen launched the question "What are good reasons to single
out one particular candidate from among the others as being 'primary' ?".
Nothing much came out, except then perhaps : "to suggest that this
particular key is the preferred one to use whenever making references to
this relvar". I suspect they didn't find that convincing enough to change
their earlier decision that "no key is more unique than any other".
But, supposing that nonetheless there exists some valid reason to single
out one particular key as "primary", I suppose the following
considerations apply :
the likeliness, or appropriateness, of using this primary key also as,
e.g., the clustering key in the physical design.
and as a consequence of that, the probability of having to change a
value of some existing primary key. Key values that are highly stable
will be preferable over key values that are more volatile.
the percentage of the business that naturally uses some such key in
their daily operations.
if the required space for physically encoding key values is
significantly different, which one has the smallest encoding size.
Your answer to Erwin:
"I agree that choosing a primary key merely designates one candidate key as preferred for foreign key references. However, even if we eliminated the name "primary key" entirely, designers must still choose which candidate key to propagate into another relation for reference purposes. If users identify a heavily referenced relation with an unstable, composite key, do you intend to imply that the designer has no business choosing an additional simple, stable key? Or using the simple, stable key for referencing the relation? Your candidate key section seems to imply that. – bbadour 8 hours ago "
Your original question was about 'primary keys'. Now you change your focus to keys and foreign keys. A key is an integrity constraint, so the only criteria are that a minimal set of attributes has to be unique in a relation (uniqueness and irreducibility). If we change our focus to foreign keys then simplicity, stability and familiarity are the criteria to choose from all the candidate keys in de referenced relation. There could be more candidate keys that fulfill that criteria to more or less the same extend. If we look at familiarity, one candidate key could be very familiar to a group of users and not to another group for which another candidate key is more familiar. Think about different views or subschemas of a database. This second group of users should choose a different candidate key for reference purposes (as foreign key). If you insist in 'primary keys' of which we only have one per relation then I have to ask what makes a key more primary than others.
I think the term primary key should not be used. At least at the logical level. Also the term 'foreign keys' is not well chosen (foreign keys are not keys, but references).
So, I think the remarks of Erwin about ‘primary’ keys were very much to the point. Or at least this was my interpretation of what he means.
Do you agree with this?
If so, would you change your original question to "What are the design criteria for keys and what are the criteria to choose a foreign key from the available candidate keys?"?
If not, why?
Regards,
Carlos
A primary key is a candidate key chosen for special treatment, so first we must look at the properties of candidate keys. A set of one or more columns is a candidate key if it has the following two properties:
Uniqueness: A candidate key must uniquely identify each row in a table. No table may contain two rows with the same value for the candidate key.
Irreducability: Removing any column from a candidate key must violate the uniqness property. In other words, no subset of columns in a candidate key is itself a candidate key.
If no candidate key exists, and sometimes even if one does, a surrogate key is often created using an auto-incrementing integer column, or made up using some other technique. This surrogate key is now also a candidate key.
It is often useful to choose among the available candidate keys and to designate one of them as the primary key. The first criteria often applied is simplicity indicating the candidate key with the fewest columns. However there are other potential criteria, like familiarity, familiar values being more useful than non-familiar values, and stability, stable keys being less troublesome than keys that are apt to change. These criteria however, are strictlty outside the scope the relational model, often conflict with each other, and are often made to deal with implementation limitations.
I would say that the first two concepts "uniqueness" and "irreducability" are less design criteria than fundamental properties of primary keys, while the latter concepts of "simplicity", "familiarity" and "stability" are more properly labeled design criteria, as they involve tradeoffs and subjectivity.
Why choose a primary key? Simplicity and familiarity are not only criteria for choosing among available candididate keys, but are why we should choose a primary key at all. If there are are multiple candidate keys in a table, it simplifys things if all foreign keys pointing to that table refer to the same candidate key. Furthermore, the very act of choosing a particular candidate key will help make it familiar.
What are the criteria?
A PRIMARY KEY is something that will define the entity, only the entity and nothing but the entity.
You can take it from the outside world. Say, a star catalog number to identify a star (good example), or an SSN to identify a person (bad example).
In this case, you rely on the outside world.
Do all people have SSN? (They don't).
Are SSN's unique? (They aren't).
Can an SSN be assigned to another person? (It can).
You can generate it inside your model, using AUTOINCREMENT or GUIDs or whatever.
In this case, you rely on yourself and your database skills.
Do all people in your model have an ID? (Yes, they do, otherwise they wouldn't be in the table with ID NOT NULL).
Are these ID's unique? (Yes, they are, the PRIMARY KEY constraint takes care of it).
Can they be assigned to other persons? (No, they cannot, they are either non-repeatable by design or auto incrementing).
Or another set of answers:
Do all people in your model have an ID? (No, they don't, the people table was accidentally dropped, though some other information retained).
Are these ID's unique? (No, we failed to merge two versions of the database properly).
Can they be assigned to other persons? (Yes, we reset the AUTOINCREMENT by mistake).
The most important thing is that a surrogate key is a feast that is always with you. You can always create a surrogate key: nothing on Earth can stop you from declaring an AUTOINCREMENT field. But by far not all things have some kind of identifier everybody agrees upon.
However, a good natural key cannot be overemphasized.
Guide Star Catalog database is most probably backed up more reliably than yours, and the list of US state codes you always can restore right from the memory.
Only one really, choose a surrogate for each table (identity/auto_number) or something similar that the users will never even see so you can do whatever is necessary with them whenever you need to now and in the future.
(Not quite sure how to interpret this question. Sounds like a quiz or something where you are looking for one single "right" answer from a textbook. I'm going to interpret the question as a more practical one, hence my advice below.)
At least in the MS SQL world, discussion about a proper Primary Key is inevitably wrapped up in discussion about the proper clustered index for a table. The two don't have to be the same, but they are by default, and for many tables, making the two the same is often a good idea.
For the purpose of our discussion here, its important to distinguish between the two:
A PRIMARY KEY is a field or combination of fields that uniquely identify a row.
A CLUSTERED INDEX is a field or combination of fields that represents the physical ordering of a table. (Again, I am speaking about MS SQL Server, not sure how other RDBS might handle this)
Key to the remainder of my discussion is knowing that since SQL 7.0, the clustered index key is used as a row identifier for all non-clustered indexes. This means that many of the same criteria for choosing a good clustering key are the same as for choosing a good primary key.
Let's first look at the criteria for a good clustered index (From Kimberly Tripp's excellent article). A clustered index should be:
Unique - otherwise useless as a row identifier for other indexes
Narrow - this key is used in other indexes, so should be as narrow as possible
Static - If key values change, then references become invalid and will need updating
Ever-increasing - To reduce physical table fragmentation as new rows are added
It is readily apparent the first 3 are also good criteria for a primary key. #4 is a bonus that will reduce table fragmentation as tables grow.
A GUID as a primary key, as popular as that is, actually fails 2 of these criteria (Narrow and Ever-Increasing). As such, it is not recommended as a PK/Clustered index in most circumstances (see Kim's related article here)
I'm going to say something here that is not expected.
All the stuff they teach in database about normalization and keys is all wrong when it comes to choosing primary keys.
The primary key is special when it comes to range queries, and for that reason if you have a dominant range query that is your primary key, no exceptions.
If your dominant range query is not on a candidate key you end up with a primary key that is not enforced for uniqueness! This is sometimes called a clustered index, which is a misnomer because there is no index.
Now the normalization and candidate keys are all important, and you will want to enforce unique constraints on at least some of them. But do not assign the primary key because it is the natural key. In fact, this is slower than defining an index and a unique constraint. Define the primary key based on range queries only.
Remember, there is no constraint to actually have primary keys. A table with no primary keys is called a heap table and has either no intrinsic ordering or insertion order intrinsic ordering.
EDIT: definition of range query:
A range query is a query that is an ORDER BY query or contains either a greater than or less than operator. What we are interested in are the columns for which these queries run on. The fundamental idea is a range query fetches several (tens to hundreds to perhaps thousands but not all) rows from the table based on bounding conditions at one or both ends.
There is another kind of range queries, and that is where you have a foreign key to another table and an operation is select all matching on that foreign key. This is in fact also a range query although not obviously so.
Can a database table contains more than one primary key?
Yes, I am talking about RDBMS.
A table can have:
No primary keys;
One primary key consisting of one column; or
One composite primary key consisting of two or more columns.
Other than that you can have any number of unique indexes, which will do basically the same thing.
The primary key of a relational table uniquely identifies each record in the table.
So, in order to keep the uniqueness of each record, you cant have more than one primary key for the table.
It can either be a normal attribute that is guaranteed to be unique (such as Social Security Number in a table with no more than one record per person) or it can be generated by the DBMS (such as a globally unique identifier, or GUID, in Microsoft SQL Server). Primary keys may consist of a single attribute or multiple attributes in combination.
That's why it is called Primary Key because it is, well, PRIMARY
Yes, you can have Composite primary keys, that is, having two fields as a primary key.
"First of all, you have to understand the history of entity-relationship design methodology as well as understand the word "relational" in relational database management systems (RDBMS)."
May I suggest politely that you first get YOURSELF educated on these very same subjects before leading other people into flawed beliefs ? I'll respond to the two worst ones of your stupidities below.
"According to relational methodology principles, each entity should only have one and only one means to identify it."
That is about the biggest crap I have ever heard anybody spawn around about relational data design. The relational model does not constrain any "entity", as you erroneously call it, to have any precise number of keys. Any "entity" can have any number of keys, and EACH key is, by definition of its very property of making the "rows" unique, a valid candidate for any purpose of "identification". Choosing the most useful/appropriate one for use in certain contexts (foreign keys in referencing tables, e.g.), is a design issue, and the relational model does not have anything to say on such things.
"Therefore, "R"DBMS attempts to facilitate the modeling of entity relationships."
Codd's paper "A Relational model of date for large shared data banks", which marks the birth of the relational model, predates the invention of E-R by a number of years. So to say that the Relational model attempts to facilitate the modeling of E-R concepts, is having things COMPLETELY backwards, and nothing but a display of one's own complete and utter ignorance of "the history" that you referred to in your own answer.
The short answer is yes. A primary key is a candidate key and is in principle no different to any other candidate key. It is a widely observed convention that one candidate key per table is designated as the "primary" one - meaning that it is "preferred" or has some special meaning for the database designer or user. This is just convention however. It is only a label of convenience and a reminder about the potential significance of one key. In practice all keys can serve the same purpose and the "primary" one is not special or unique in any fundamental way.
First of all, you have to understand the history of entity-relationship design methodology as well as understand the word "relational" in relational database management systems (RDBMS).
In order to define the bounds of an entity and relationships to be formed, there must be a unique handle or a unique combination of handles to identify each single instance of an entity and then to form relationships between them.
You also need to understand the meaning/root of the word "identify" which is to zero in on the "identity" of each instance of an entity. "identity" being the mathematical term meaning "one" or a singularity.
According to relational methodology principles, each entity should only have one and only one means to identify it. Therefore, "R"DBMS attempts to facilitate the modeling of entity relationships. Note the differences between "Entity/Class" and "Entity/Class instance".
However, RDBMS is used widely and mostly by people not so interested in accurately portraying the E-R design principles. So that frequently, we have more than one possible entity-definition sitting inside a table, which I call entity-aliasing. Opposed to identity-aliasing, where two or more instances of an entity-set hides under the same key, entity-aliasing is like the table
EmpProj([empId], empName, empAddr, projId, projLoc)
actually has two entity-sets aliased under the same table:
Emp([empId], empName, empAddr)
Proj([projId], projLoc, empId)
That is when normalisation comes in - to separate these entities out. Try as we might to do a decent design normalisation, computer scientists may not have as good a perspective on the information as a statistician. The computer scientist (which in this discussion includes everyone with a decent knowledge of ER design) tries his/her best in creating a schema that cleanly defines entities and their relationships.
However, after 18 months analysing voluminous information from the database, the statistician begin to see principal components that emerge whose analyses is terribly crippled due to the misalignment of the principal components with those of boundaries of the computer scientists' perceived entities.
That is where alternate unique keys are good for - to identify instances of entities due to the principal components existing as ghost-entities in the database.
Therefore, the primary key of a table is because that table is perceived to be a perfect entity as an entity should have only one primary key, be it singular or composite.
As far as the statistician is concerned, even though the database allows only one primary key per table, the alternative unique keys is to the statistician the primary keys to those ghost-entities. Which is why sometimes you are frustrated by statisticians who seem to do double work by downloading the data into the local database of their workstation/PC.
In conclusion, the constraint placed by the "R"DBMS manufacturer in allowing only one primary key per table is their pretense in believing that they know how information behave and believing that principal components of the information due to the population do not mutate over time.
If you have more than one unique keys possible in a table it means either one or more of the possibilities
Like myself, you are lazy to
separate them since they seem to
work quite well
For performance' sake, mixing the
entities into the same table makes
the application run incredibly
faster
Like the statistician, you gradually
discover ghost entities in your
information.