Can a table have a surrogate key without a (natural) alternate key? - database

I think it doesn't make sense for a table to have a surrogate key without also having a (natural) alternate key (keep in mind that one of the properties for a surrogate key is that it has no meaning outside the database environment).
For example say I have the following table:
Say that employee_id is the surrogate primary key, and there is no (natural) alternate key in the table.
Now let's say that some employee wants to change his phone number, how can we identify the record for this employee in the table? we can't identify it using the surrogate key, because the surrogate key is not known in the real world (i.e. we don't know the employee_id for each employee).
So there must be a (natural) alternate key in the table to identify each employee in the real world (for example: SSN).
Am I correct, or am I missing something?

You are right. Normally the users of a database need to be able to map information in a database table to real concepts or things outside the database. For that they need a usable natural key - or something that can be reliably translated into a natural key.
Your specific example isn't necessarily a good one because many (most?) organisations allocate an employee identifier to employees for the duration of their employment. That employee identifier may be known and used as the natural key by both employee and employer. You have said employee_id in your example is a surrogate but based on its name many people might assume it is not.

Every table has a candidate key. (Possibly, all the columns.) Each value of that key designates something. If employee_ids weren't used externally, they and the the set of the other three columns would both identify sets of people with the same name, address and phone number.
You are right that a table must contain a natural key for whatever entity you want to identify--for "natural key" meaning "designator under the current business rules, external to the database".
You seem to be confusing multiple meanings for "surrogate key" and "natural key".
For "surrogate": One use is, a property set where designations are determined since current business rules: surrogate as new. Another use is, a property set where designations are determined since current business rules and only used internally: surrogate as internal.
For "natural": One use is, a property set designating under current business rules and before: natural as old. Another use is, a property set designating under current business rules: natural as external.
The original use of "surrogate" was as internal with "natural" as external. Unfortunately now usually people use "surrogate" as new and "natural" as old. And they seldom either consider or distinguish surrogates as internal. Some people might call a newly introduced external designator as both surrogate (as new) and natural (as external). (Re "meaningless" names.)
All you can do is decipher or ask what someone means when they use these terms.
Note that these definitions are relative to the "current" business rules. You also seem to be assuming that employee ids arrived with the database. At some point they were introduced, so were chosen after some older system started, so were surrogates as new under the new system. But if the database came later then by that time they were natural keys as old. They were natural as external both times; when introduced they were just new natural as external.

The term "natural key" belongs with the subject matter or problem domain, and not with the database design or solution domain.
If you ask people in the organization how they refer to each other, it's usually by name. "That's Mary Jones, sitting by herself". You never hear, "That's employee 79932, sitting by herself."
If you ask a database person, like most of us, what's wrong with using the name, you'll get the standard answers: it's not unique. It's mutable. It's too many characters. All of that is true, but it doesn't change how people work in the real world.
The item "social security number" began life as an artificial key, even though that was before the computer revolution really began (1938). Over time, it has taken on most of the look and feel of a natural key. In your case, I would even call it a natural key, because every employee has one (barring something hokey).

Related

Table has no natural keys in SQL Server

The AP database example provided online by Murach's SQL Server 2016 for developers has an Invoice table with the surrogate key InvoiceID but with no natural keys. Most of the other tables have natural keys that uniquely identify each row, so I was curious: why would they provide a table without a natural key to identify what each row represents?
I got the AP database creation script from here:
https://www.murach.com/shop/murach-s-sql-server-2016-for-developers-detail
I think they made a mistake. The natural key here presumably ought to be (VendorID, InvoiceNumber). I have never seen a real accounts payable system that allowed duplicate invoice numbers for the same vendor. Paying the same invoice twice obviously isn't a good idea!
The most common motivation for creating a surrogate is to reduce the impact of having to change other key values. Natural key (AKA business key) values sometimes need to change. Surrogate keys need to change much less frequently because fewer people ever see them and so there's much less reason to change them. That relative stability may have some technical advantages in situations where the business key values are expected to change. Even in the presence of a surrogate, business keys are still critically important because they are the things that users and business processes depend on.
An invoice is a man made piece of data. That's all it is. It has no natural key, because it has no natural identifier. The person or process who creates a new invoice assigns it a number, call it Invoice Number. But that number is just as artificial as Invoice.Id would be. If you want to consider one of those a surrogate, go ahead.
An automobile is a man made piece of gear, but it isn't just data. It's something to drive around. When a new automobile is made it gets assigned a unique identifier, called the Vehicle Identification Number, or VIN. But that key is ultimately just as artificial as Invoice Number. It's just pulled out of thin air, made so that it will be unique, and assigned to the car. There is nothing more "Natural" about VIN than there is about Invoice Number. And there is nothing less "natural" about identifiers that are chosen by the DBMS, perhaps using an autonumber feature.
Edit in response to comments: VIN is assigned at the business level, but it's sole legitimate function is to identify a vehicle. There are rules for its formation, but those rules exist to prevent the same VIN from being assigned to two vehicles. If one of the digits in the VIN says the seating capacity of the vehicle, that's the seating capacity on the day of manufacture. It's possible to change the seating capacity of a vehicle after it's in operation, by ripping out one of the seats.
If all keys that are used by the business domain (alternatively the "conceptual domain") are natural, it must be recognized that in certain businesses a key will be generated inside a computerized system and eventually acquire meaning as it is used at the business layer. Arguments have been made in answers to other questions that surrogate keys should never be revealed to the application user, or perhaps even to the application program, lest it begin to be used in a meaningful way. That's ultimately a philosophical question, and not one of database design.

When should a social security number be used as a database primary key?

Our DBA says that because the social security number (SSN) is unique, that we should use it as the primary key in a table.
While I do not agree with the DBA (and have already mentioned some of the good parts found in this answer), are there any situations where he could possibly be correct?
SSN is not a unique identifier for people. Whether it should be the PK in some table depends on what the rows in the table mean. (See also sqlvogel's answer.)
6.1 percent of Americans have at least two SSNs associated with their name. More than 100,000 Americans have five or more SSNs associated with their name. [...] More than 15 percent of SSNs are associated with two or more people. More than 140,000 SSNs are associated with five or more people. Significantly, more than 27,000 SSNs are associated with 10 or more people.
--idanalytics.com
See also Wikipedia Social Security number.
Never!
In the USA, both Federally and in the several States, there are strict laws concerning the handling of social security numbers and the uses to which they may be put. As the issue of identity theft comes more-and-more to the forefront, this concern will merely become more-regulated. Therefore, I would strongly recommend that you never use this as a database key. SS# should be a (very confidential) column in the record.
Furthermore, it is a customer-supplied value. Sometimes, the value supplied is incorrect, or missing, or unintentionally duplicated. You should never use customer-supplied values of any sort as a database primary key. You should not use identifiers that are intended "for use by humans" as a database primary key. (Humans tolerate ambiguity ... computers do not!) Instead, let all such identifiers be column-values stored somewhere in the database, possibly (or, not ...) indexed by a UNIQUE key to prevent duplicates.
I recommend that all primary keys should be completely abstract, containing no meaning at all. For instance, a UUID could be used. Within the database, auto-incrementing integers can be used.
Keys can only be determined from business rules. What matters is whether it makes sense in the context of your business requirements to enforce uniqueness of SSN or not. That is not a matter your DBA alone can decide on. It's something on which to consult the HR department or whoever else uses this data.
Assuming that your table will have at least one key (perhaps more than one) then I suggest that the DBA is in the best position to advise on the policy for primary keys. If the choice of a primary key is of any importance at all then the DBA ought to say so.

Surrogate database key and data duplication

I was thinking about this problem. In database design most of the times surrogate keys are used, but how to prevent data duplication and inconsistent data? I mean one could have a customer table made of customer_id, name, surname. What would prevent me of inserting the same customer twice with a different customer_id? Sure I could add a unique index to name and surname, but if one does so than what's the purpose of the surrogate primary key?
You're asking a business question, not a technical one.
"How do I know whether two people with the same name are the same person or not?"
Well typically customers are not identified by a name alone, there is also one of:
An account number
An email address
A postal address
A credit card number
A passport number
A date of birth
... etc.
The name is simply not a uniquely identifying characteristic, it's just an attribute of a customer that is probably non-unique, so you need something else to help identify them. Within the database this is the primary key of the customer table, but for business purposes it could be any number of attributes.
If there is a natural key, you cannot replace it with a surrogate key. You can only add the surrogate without removing the natural. This has pros and cons, as I described here.
Unfortunately, there is no good natural key in the case you described, since two different human beings can easily have the same combination of first and last name. Therefore, you'll have to come-up with some additional attributes that represent a better criteria for judging whether two people are "identical" or not, and then create the corresponding natural key. Discovering such criteria is part of the requirement gathering and therefore impossible for me to do without knowing more about your domain.
If you are unable to identify such natural key, then you can just leave customer_id alone. That means you made a decision to make it acceptable for two people to be identical in every aspect (except in their customer_id) and still be considered "different". Arguably, such customer_id may no longer be called "surrogate", since its value now has a meaning in your data model, is potentially visible in the UI etc.
What you have said is perfectly logical and correct. Surrogate keys are not any kind of substitute for a natural key (AKA business key or domain key, i.e. the set of attributes used to identify information in the database and relate it to the real world things the database is supposed to model). If you care about data integrity then natural keys are essential, whereas surrogates by definition are optional and supplemental. Add surrogate keys only when and where you find they have a useful benefit.
The only purpose of the id (or "surrogate key" as you call it) is to uniquely identify a record.
First, say you will use name as a key. What will you do if:
the customer changes its name (in some countries women change their surname to their husbands');
you make a typo in customers name and have to correct it afterwards?
Then you are in a big trouble, because despite of the fact that you can change it,
id should never be changed!
Otherwise, you can make a big mess not only in your database, consistency along backups, logs etc, but also in all the external sources refering to it.
Second, how do you know you won't get two customers with the same name?
You cannot stop people from describing the world wrongly in the database. You can only stop them from describing the world wrongly in the database if the way they described it can't ever happen.
When there is no previous "natural" identifying property used in the business outside the database being stored in the database then we have to pick a "surrogate" distinguishing identifier after the system starts. (Some people wouldn't use "natural" for such an identifier picked after the system starts even though it is used in the business outside the database. And some people wouldn't use "surrogate" for such a distinguishing identifier used in the business system outside the database.)

What are the design criteria for primary keys?

Choosing good primary keys, candidate keys and the foreign keys that use them is a vitally important database design task -- as much art as science. The design task has very specific design criteria.
What are the criteria?
The criteria for consideration of a primary key are:
Uniqueness
Irreducibility (no subset of the key uniquely identifies a row in the table)
Simplicity (so that relational representation & manipulation can be simpler)
Stability (should not be altered frequently)
Familiarity (meaningful to the user)
What is a Primary Key?
The primary key is something that uniquely identifies a row/record of data. It can also be multiple columns, which is called a composite.
Ability to Change
Because the primary key is often used for foreign references, it should be as stable as possible. All data in the database is mutable, providing someone is connecting with an account that has appropriate privileges. This is why databases provide the ability to define CASCADE ON DELETE and CASCADE ON UPDATE--to sync referential dependencies without having to disable constraints.
Natural or Artifical/Surrogate?
Ideally, you want a natural key. A natural key is existing data that uniquely identifies the entity you are modeling. For example, the abbreviations of US states is a good natural key because the abbreviation is consistent and everyone knows them:
US_STATE_PRIMARY_KEY US_STATE
--------------------------
AL Alabama
AK Alaska
AZ Arizona
AR Arkansas
CA California
Don't try too hard to find a natural key. They seldom exist. It's unlikely that a US State name would change, but it is plausible.
Realistically, primary keys will typically be artificial (often generated by database functionality). These are typically numbers or GUIDs, and they're considered artificial because on their own - there's nothing to relate their value to the information they uniquely identify. A sales receipt is always numbered, because there's nothing natural about it and it's also for auditing - gaps in the receipt numbers raise suspicions. To demonstrate how arbitrary numbering is, here's the US state table but using an integer for the primary key column, US_STATE_CODE:
US_STATE_PRIMARY_KEY US_STATE
--------------------------
100 Alabama
101 Alaska
102 Arizona
103 Arkansas
104 California
There's no requirement to start the value at one; some shops use this as a security measure to thwart SQL injection. The value is sequential based on the alphabetic ordering of the State name, but that can't be guaranteed. But unlike the natural key, if the state name changed - only one column would have to be updated.
Single Column vs Composite
Ideally one column will be the primary key, but make the decision based on the data at hand--do not combine columns just for the sake of having a single column. If you do shoehorn data together, use a character to separate the data easily (though operations to do this won't be able to take advantage of an index if present).
Performance
From a performance perspective, integers are best because they offer a decent range of values and the number of bytes used is small when you compare to VARCHAR of five or more characters.
Database design starts with a conceptual data model (such as an entity relationship diagram) and finishes up with a database schema or schemas. Entities are mapped to tables; in this process one entity may be split into several table, several entities may be merged into a single table and new tables may arise (for instance, intersection tables to implement many-to-many relationships).
In an ERD entities have primary keys. These are natural keys, that is they are attributes of the entity. For a PERSON entity it might be SocialSecurityNumber. For an ORDER entity if might be OrderRef For an INVOICE entity it might be InvoiceNo. In the first case that is a real-life identifier; in the second case it is a smart key in an ugly format (2010/DEF/000023 ); in the third case it is a monotonically incrementing number because that is what the current paper-based system uses.
Natural keys can be fanciful. I once worked on a database design where the analyst had specified the CUSTOMER entity with a key of (FullName, Address, Sex, DateOfBirth, DistinguishingCharacteristics) on the basis that two individuals of the same name, birth date and gender could live at the same address.
The characteristics of an entity's primary key are:
unique
familiar
stable (presumed)
minimal (one or more attributes but as few as necessary)
When it comes to primary keys for database tables, natural keys are not always suitable.
There are many reasons not to use SSN as a physical primary key. Protection of a citizen's personal data is actually the most important but it is also the case that an individual's number can change. Primary keys should be unvarying.
Smart keys are dumb. They are actually compound keys compressed into a single column. They are better represented as separate columns, not least because it is a frequent requirement to search on single elements of the key. Also, the format of such keys can change.
In general compound keys are a pain as primary keys because we have to cascade multiple columns as foreign keys. This is exacerbated when the child's primary key is defined as a serial number within the parent's primary key. There are systems out there which dependent tables inheriting a nine-column foreign key from a parent when they have a scant two data columns of their own. Sometimes this sort of inheritance can be useful but mostly it is a just a hassle.
The characteristics of an entity's primary key are:
unique
appropriate (meaningless)
guaranteed stability
minimal, usually a single column (except for intersection tables)
So unless the candidate key is a meaningless identifier (such as InvoiceNo) a table should have a synthetic key (AKA surrogate key). This can be a monotonically incrementing number or a GUID according to your needs. Regarding intersection tables, if they have no other attributes or dependent tables there is no value in replacing a compound primary key (AKA composite key) with a synthetic one.
The crucial thing is: we still enforce the candidate keys. This means applying UNIQUE constraints on those columns - SSN , OrderRef - in the parent table. This is because a synthetic key uniquely identifies a row in a table, it does not uniquely identify the data.
Regarding familiarity
Familiarity is a curly one. It is an important consideration when it comes to we are identifying primary keys in a conceptual data model but it is less useful when it comes to database design.
In a commnet #bbadour provides two contrasting examples:
{3296013,840082470,Bob Badour,745} versus {840082470,Bob Badour,PE,CA}
and poses the question:
"What does 3296013 achieve that was not already achieved by 840082470, which happens to be the primary key for my academic records at any or every post-secondary school in Canada."
Well, 840082470 is like a invoice number. Of itself it is a meaningless string of digits. If the system we are designing belongs to the domain of Canadian higher education then it is certainly acceptable as a candidate key. However, because it is a key apparently owned by an external central system (forgive me for not understanding the Canadian academic system), it is open to some of the objections to SSN as a primary key. We are reliant on that external system to ensure uniqueness, guarantee stability and verify identification.
As for 745 versus PE,CA, that is clearly wrong. The Canadian postal abbreviation for "Prince Edward Island" and the ISO digraph for "Canada" identify two distinct pieces of information and derive from different sources, so they should be represented as two separate columns. But let us focus on whether 745 or PE makes the better primary key.
First thing, the database doesn't care which data type we use for the code to represent "Prince Edward Island". It just wants guaranteed uniqueness.
Second thing, the user-facing part of the system is likely to display the full expansion "Prince Edward Island", in which case the application is going to need to execute a look-up anyway. This is because users of a system which also holds addresses from the country of Peru or the state of California will appreciate the clarity of the expanded names[1]. Certainly if we go beyond the few hard cases (such as state abbreviations) the application should always expand codes when displaying them to users.
Thus the only advantage of using PE rather than 745 is that it makes ad hoc querying easier.
Third thing, if the code expansion changes we might want to distinguish records which use the newer version. This is a lot easier if 745='Prince Edward Island' and 746='Prince Edward Is.' than if we use PE as the primary key.
Fourth thing, there are programming considerations. For instance, if the application developers have to provide drop-down lists using Java Enumerations they need numeric codes.
In short, familiarity of natural keys is not as useful as the practicality of surrogate keys.
[1] Canadians will know that CA stands for Canada. But does MO stand for Morocco, Monaco, Moldova, Montenegro, Mongolia or Montserrat? Actually none of them: it's Macau.
A Primary Key is a key that uniquely identifies an entity. When you are choosing a primary key, the best choice is almost always a surrogate key that has absolutely nothing to do with the entity at all other than uniquely identifying it.
And that's it. There are supposedly rare edge cases where a primary key might be a natural key, but I've never seen a valid one.
Most of us use a 32-bit auto-increment integer as a primary key. Another excellent choice (in certain circumstances) is a UUID.
A candidate key is a set of attributes that are irreducibly unique (irreducible meaning that no attribute can be removed from the key without losing the uniqueness property).
Other criteria when choosing what candidate keys to implement are: simplicity, stability, familiarity.
These three criteria are important considerations but not necessarily essential attributes of a key. For instance it may be desirable and quite reasonable to enforce a key that can change often. e.g.: a user login name is required to be unique but the user may change it at will as long as it remains unique.
A primary key is a candidate key.
Hey. it's open again. Here goes.
(1) Choose good candidate keys.
It does not pertain to the database designer to choose candidate keys.
The database designer has the responsibility to see to it that all the
uniqueness requirements he is informed of by the user, will be enforced.
So it is the user who "chooses" what the candidate keys are.
There are two scenario's I can think of that relax this unequivocal
position a bit.
One is if the user says that some attribute of type 'video' or 'audio' (or
some such) is to be unique. It may be infeasible to actually enforce
that, and it is the designer's responsibility to point that out to the
user (as it is also his responsibility to point out that 'uniqueness' of
audio and video content is a very debatable subject, and that any
uniqueness on such attribute values, even if enforcible by the system,
still has a good chance of not being the same uniqueness that the user
wants).
Second is how the picture gets muddied by the possibility of distinct
logical designs all addressing the same problem. If D1 and D2 are both
valid designs addressing the same problem, then it might be the case that
a certain given uniqueness rule imposed by the user, is enforcible using
keys in D1, but not in D2. From this perspective, "choosing candidate
keys" can be interpreted as "choosing a particular design such that a
given uniqueness rule is enforcible using keys". But that wasn't really
the question that you asked.
(2) Choose good primary keys.
A while ago, Darwen launched the question "What are good reasons to single
out one particular candidate from among the others as being 'primary' ?".
Nothing much came out, except then perhaps : "to suggest that this
particular key is the preferred one to use whenever making references to
this relvar". I suspect they didn't find that convincing enough to change
their earlier decision that "no key is more unique than any other".
But, supposing that nonetheless there exists some valid reason to single
out one particular key as "primary", I suppose the following
considerations apply :
the likeliness, or appropriateness, of using this primary key also as,
e.g., the clustering key in the physical design.
and as a consequence of that, the probability of having to change a
value of some existing primary key. Key values that are highly stable
will be preferable over key values that are more volatile.
the percentage of the business that naturally uses some such key in
their daily operations.
if the required space for physically encoding key values is
significantly different, which one has the smallest encoding size.
Your answer to Erwin:
"I agree that choosing a primary key merely designates one candidate key as preferred for foreign key references. However, even if we eliminated the name "primary key" entirely, designers must still choose which candidate key to propagate into another relation for reference purposes. If users identify a heavily referenced relation with an unstable, composite key, do you intend to imply that the designer has no business choosing an additional simple, stable key? Or using the simple, stable key for referencing the relation? Your candidate key section seems to imply that. – bbadour 8 hours ago "
Your original question was about 'primary keys'. Now you change your focus to keys and foreign keys. A key is an integrity constraint, so the only criteria are that a minimal set of attributes has to be unique in a relation (uniqueness and irreducibility). If we change our focus to foreign keys then simplicity, stability and familiarity are the criteria to choose from all the candidate keys in de referenced relation. There could be more candidate keys that fulfill that criteria to more or less the same extend. If we look at familiarity, one candidate key could be very familiar to a group of users and not to another group for which another candidate key is more familiar. Think about different views or subschemas of a database. This second group of users should choose a different candidate key for reference purposes (as foreign key). If you insist in 'primary keys' of which we only have one per relation then I have to ask what makes a key more primary than others.
I think the term primary key should not be used. At least at the logical level. Also the term 'foreign keys' is not well chosen (foreign keys are not keys, but references).
So, I think the remarks of Erwin about ‘primary’ keys were very much to the point. Or at least this was my interpretation of what he means.
Do you agree with this?
If so, would you change your original question to "What are the design criteria for keys and what are the criteria to choose a foreign key from the available candidate keys?"?
If not, why?
Regards,
Carlos
A primary key is a candidate key chosen for special treatment, so first we must look at the properties of candidate keys. A set of one or more columns is a candidate key if it has the following two properties:
Uniqueness: A candidate key must uniquely identify each row in a table. No table may contain two rows with the same value for the candidate key.
Irreducability: Removing any column from a candidate key must violate the uniqness property. In other words, no subset of columns in a candidate key is itself a candidate key.
If no candidate key exists, and sometimes even if one does, a surrogate key is often created using an auto-incrementing integer column, or made up using some other technique. This surrogate key is now also a candidate key.
It is often useful to choose among the available candidate keys and to designate one of them as the primary key. The first criteria often applied is simplicity indicating the candidate key with the fewest columns. However there are other potential criteria, like familiarity, familiar values being more useful than non-familiar values, and stability, stable keys being less troublesome than keys that are apt to change. These criteria however, are strictlty outside the scope the relational model, often conflict with each other, and are often made to deal with implementation limitations.
I would say that the first two concepts "uniqueness" and "irreducability" are less design criteria than fundamental properties of primary keys, while the latter concepts of "simplicity", "familiarity" and "stability" are more properly labeled design criteria, as they involve tradeoffs and subjectivity.
Why choose a primary key? Simplicity and familiarity are not only criteria for choosing among available candididate keys, but are why we should choose a primary key at all. If there are are multiple candidate keys in a table, it simplifys things if all foreign keys pointing to that table refer to the same candidate key. Furthermore, the very act of choosing a particular candidate key will help make it familiar.
What are the criteria?
A PRIMARY KEY is something that will define the entity, only the entity and nothing but the entity.
You can take it from the outside world. Say, a star catalog number to identify a star (good example), or an SSN to identify a person (bad example).
In this case, you rely on the outside world.
Do all people have SSN? (They don't).
Are SSN's unique? (They aren't).
Can an SSN be assigned to another person? (It can).
You can generate it inside your model, using AUTOINCREMENT or GUIDs or whatever.
In this case, you rely on yourself and your database skills.
Do all people in your model have an ID? (Yes, they do, otherwise they wouldn't be in the table with ID NOT NULL).
Are these ID's unique? (Yes, they are, the PRIMARY KEY constraint takes care of it).
Can they be assigned to other persons? (No, they cannot, they are either non-repeatable by design or auto incrementing).
Or another set of answers:
Do all people in your model have an ID? (No, they don't, the people table was accidentally dropped, though some other information retained).
Are these ID's unique? (No, we failed to merge two versions of the database properly).
Can they be assigned to other persons? (Yes, we reset the AUTOINCREMENT by mistake).
The most important thing is that a surrogate key is a feast that is always with you. You can always create a surrogate key: nothing on Earth can stop you from declaring an AUTOINCREMENT field. But by far not all things have some kind of identifier everybody agrees upon.
However, a good natural key cannot be overemphasized.
Guide Star Catalog database is most probably backed up more reliably than yours, and the list of US state codes you always can restore right from the memory.
Only one really, choose a surrogate for each table (identity/auto_number) or something similar that the users will never even see so you can do whatever is necessary with them whenever you need to now and in the future.
(Not quite sure how to interpret this question. Sounds like a quiz or something where you are looking for one single "right" answer from a textbook. I'm going to interpret the question as a more practical one, hence my advice below.)
At least in the MS SQL world, discussion about a proper Primary Key is inevitably wrapped up in discussion about the proper clustered index for a table. The two don't have to be the same, but they are by default, and for many tables, making the two the same is often a good idea.
For the purpose of our discussion here, its important to distinguish between the two:
A PRIMARY KEY is a field or combination of fields that uniquely identify a row.
A CLUSTERED INDEX is a field or combination of fields that represents the physical ordering of a table. (Again, I am speaking about MS SQL Server, not sure how other RDBS might handle this)
Key to the remainder of my discussion is knowing that since SQL 7.0, the clustered index key is used as a row identifier for all non-clustered indexes. This means that many of the same criteria for choosing a good clustering key are the same as for choosing a good primary key.
Let's first look at the criteria for a good clustered index (From Kimberly Tripp's excellent article). A clustered index should be:
Unique - otherwise useless as a row identifier for other indexes
Narrow - this key is used in other indexes, so should be as narrow as possible
Static - If key values change, then references become invalid and will need updating
Ever-increasing - To reduce physical table fragmentation as new rows are added
It is readily apparent the first 3 are also good criteria for a primary key. #4 is a bonus that will reduce table fragmentation as tables grow.
A GUID as a primary key, as popular as that is, actually fails 2 of these criteria (Narrow and Ever-Increasing). As such, it is not recommended as a PK/Clustered index in most circumstances (see Kim's related article here)
I'm going to say something here that is not expected.
All the stuff they teach in database about normalization and keys is all wrong when it comes to choosing primary keys.
The primary key is special when it comes to range queries, and for that reason if you have a dominant range query that is your primary key, no exceptions.
If your dominant range query is not on a candidate key you end up with a primary key that is not enforced for uniqueness! This is sometimes called a clustered index, which is a misnomer because there is no index.
Now the normalization and candidate keys are all important, and you will want to enforce unique constraints on at least some of them. But do not assign the primary key because it is the natural key. In fact, this is slower than defining an index and a unique constraint. Define the primary key based on range queries only.
Remember, there is no constraint to actually have primary keys. A table with no primary keys is called a heap table and has either no intrinsic ordering or insertion order intrinsic ordering.
EDIT: definition of range query:
A range query is a query that is an ORDER BY query or contains either a greater than or less than operator. What we are interested in are the columns for which these queries run on. The fundamental idea is a range query fetches several (tens to hundreds to perhaps thousands but not all) rows from the table based on bounding conditions at one or both ends.
There is another kind of range queries, and that is where you have a foreign key to another table and an operation is select all matching on that foreign key. This is in fact also a range query although not obviously so.

The best choice for Person table primary key

What is your choice for primary key in tables that represent a person (like Client, User, Customer, Employee etc.)? My first choice would be an social security number (SSN). However, using SSN has been discouraged because of privacy concerns and different regulations. SSN can change during person lifetime, so that is another reason against it.
I guess that one of the functions of well chosen natural primary key is to avoid duplication. I do not want a person to be registered twice in the database. Some surrogate or generated primary key does not help in avoiding duplicate entries. What is the best way to approach this?
What is the best way to guarantee uniqueness in your application for person entity and can this be handled on database level with primary key or uniqueness constraint?
I don't know which Database engine you are using, but (at least with MySQL -- see 7.4.1. Make Your Data as Small as Possible), using an integer, the shortest possible, is generally considered best for performances and memory requirements.
I would use an integer, auto_increment, for that primary key.
The idea being :
If the PK is short, it helps identifying each row (it's faster and easier to compare two integers than two long strings)
If a column used in foreign keys is short, it'll require less memory for foreign keys, as the value of that column is likely to be stored in several places.
And, then, set a UNIQUE index on an other column -- the one that determines unicity -- if that's possible and/or necessary.
Edit: Here are a couple of other questions/answers that might interest you :
What’s the best practice for Primary Keys in tables?
How do you like your primary keys?
Should I have a dedicated primary key field?
Use item specific prefixes and autonumber for primary keys?
As mentioned above, use an auto-increment as your primary key. But I don't believe this is your real question.
Your real question is how to avoid duplicate entries. In theory, there is no way - 2 people could be born on the same day, with the same name, and live in the same household, and not have a social insurance number available for one or the other. (One might be a foreigner visiting the country).
However, the combination of full name, birthdate, address, and telephone number is usually sufficient to avoid duplication. Note that addresses may be entered differently, people may have multiple phone numbers, and people may choose to omit their middle name or use an initial. It depends on how important it is to avoid duplicate entries, and how large is your userbase (and thus the likelihood of a collision).
Of course, if you can get the SSN/SIN then use that to determine uniqueness.
What attributes are available to you? Which ones does your application care about ? For example no two people can be born at exactly the same second at exactly the same place, but you probably don't have access to that data at that level of accuracy! So you need to decide, from the attributes you intend on modeling, which ones are sufficient to provide an acceptable level of data integrity. Whatever you choose, you're right in focusing on the data integrity aspects (preventing insertion of multiple rows for the same person) of your selection.
For Joins/Foreign Keys in other tables, it is best to use a surrogate key.
I've grown to consider the use of the word Primary Key as a misnomer, or at best, confusing. Any key, whether you flag it as Primary Key, Alternate Key, Unique Key, or Unique Index, is still a Key, and requires that every row in the table contain unique values for the attributes in the key. In that sense, all keys are equivilent. What matters more (Most), is whether they are natural keys (dependant on meaningful real- domain model data attributes), or surrogates (Independendant of real data attributes)
Secondly, what also matters is what you use the key for.. Surrogate keys are narrow and simple and never change (No reason to - they don't mean anything) So they are a better choice for joins or for foreign Keys in other dependant tables.
But to ensure data integrity, and prevent insertion of multiple rows for the same domain entity, they are totally useless... For that you need some kind of Natural Key, chosen from the data you have available, and which your application is modeling for some purpose.
The key does not have to be 100% immutable. If (as an example), you use Name and Phone Number and Birthdate, for example, even if a person changes their name, or their phone number, you can simply change the value in the table. As long as no other row already has the new values in their key attributes, you are fine.
Even if the key you select only works in 99.9% of the cases, (say you are unlucky enough to run into two people with the same name and phone number and were coincidentally born the same day), well, at least 99.9% of your data will be guaranteed to be accurate and consistent - and you can for example, just add time to their birthdate to make them unique, or add some other attribute to the key to distinquish them. As long as you don't have to update data values in Foreign Keys throughout your database because of the change, (since you are not using this key as a FK elsewhere) you are not facing any significant issue.
Use an autogenerated integer primary key, and then put a unique constraint on anything that you believe should be unique. But SSNs are not unique in the real world so it would be a bad idea to put a uniqueness constraint on this column unless you think turning away customers because your database won't accept them is a good business model.
I prefer natural keys, but a table person is a lost case. SSNs are not unique and not everybody has one.
I'd recommend a surrogate key. Add all the indexes you need for other candidate keys, but keeping business logic out of the key is my recommendation.
I prefer natural keys, when they can be trusted.
Unless you are running a bank or something like that, there is no reason for your clients and users to provide you with a valid SSN, or even necessarily to have one. Thus, for business reasons, you are forced to distrust SSN in the case you outline. A similar argumant would hold for any given natural key to "persons".
You have no choice but to assign an artificial (Read "surrogate") key. It might as well be an integer. Make sure it's big enough integer so you aren't going to need toexpand it real soon.
To add to #Mark and #Pascal (autoincrement integers are your best bet) -- SSN's are usefull and should be modelled correctly. Security concerns are part of application logic. You can normalize them into a separate table, and you can make them unique by providing a date-issued field.
p.s., to those who disagree with the `security in application' point, an enterprise DB will have a granular ACL model; so this won't be a sticking point.

Resources