Table has no natural keys in SQL Server - sql-server

The AP database example provided online by Murach's SQL Server 2016 for developers has an Invoice table with the surrogate key InvoiceID but with no natural keys. Most of the other tables have natural keys that uniquely identify each row, so I was curious: why would they provide a table without a natural key to identify what each row represents?
I got the AP database creation script from here:
https://www.murach.com/shop/murach-s-sql-server-2016-for-developers-detail

I think they made a mistake. The natural key here presumably ought to be (VendorID, InvoiceNumber). I have never seen a real accounts payable system that allowed duplicate invoice numbers for the same vendor. Paying the same invoice twice obviously isn't a good idea!
The most common motivation for creating a surrogate is to reduce the impact of having to change other key values. Natural key (AKA business key) values sometimes need to change. Surrogate keys need to change much less frequently because fewer people ever see them and so there's much less reason to change them. That relative stability may have some technical advantages in situations where the business key values are expected to change. Even in the presence of a surrogate, business keys are still critically important because they are the things that users and business processes depend on.

An invoice is a man made piece of data. That's all it is. It has no natural key, because it has no natural identifier. The person or process who creates a new invoice assigns it a number, call it Invoice Number. But that number is just as artificial as Invoice.Id would be. If you want to consider one of those a surrogate, go ahead.
An automobile is a man made piece of gear, but it isn't just data. It's something to drive around. When a new automobile is made it gets assigned a unique identifier, called the Vehicle Identification Number, or VIN. But that key is ultimately just as artificial as Invoice Number. It's just pulled out of thin air, made so that it will be unique, and assigned to the car. There is nothing more "Natural" about VIN than there is about Invoice Number. And there is nothing less "natural" about identifiers that are chosen by the DBMS, perhaps using an autonumber feature.
Edit in response to comments: VIN is assigned at the business level, but it's sole legitimate function is to identify a vehicle. There are rules for its formation, but those rules exist to prevent the same VIN from being assigned to two vehicles. If one of the digits in the VIN says the seating capacity of the vehicle, that's the seating capacity on the day of manufacture. It's possible to change the seating capacity of a vehicle after it's in operation, by ripping out one of the seats.
If all keys that are used by the business domain (alternatively the "conceptual domain") are natural, it must be recognized that in certain businesses a key will be generated inside a computerized system and eventually acquire meaning as it is used at the business layer. Arguments have been made in answers to other questions that surrogate keys should never be revealed to the application user, or perhaps even to the application program, lest it begin to be used in a meaningful way. That's ultimately a philosophical question, and not one of database design.

Related

When should a social security number be used as a database primary key?

Our DBA says that because the social security number (SSN) is unique, that we should use it as the primary key in a table.
While I do not agree with the DBA (and have already mentioned some of the good parts found in this answer), are there any situations where he could possibly be correct?
SSN is not a unique identifier for people. Whether it should be the PK in some table depends on what the rows in the table mean. (See also sqlvogel's answer.)
6.1 percent of Americans have at least two SSNs associated with their name. More than 100,000 Americans have five or more SSNs associated with their name. [...] More than 15 percent of SSNs are associated with two or more people. More than 140,000 SSNs are associated with five or more people. Significantly, more than 27,000 SSNs are associated with 10 or more people.
--idanalytics.com
See also Wikipedia Social Security number.
Never!
In the USA, both Federally and in the several States, there are strict laws concerning the handling of social security numbers and the uses to which they may be put. As the issue of identity theft comes more-and-more to the forefront, this concern will merely become more-regulated. Therefore, I would strongly recommend that you never use this as a database key. SS# should be a (very confidential) column in the record.
Furthermore, it is a customer-supplied value. Sometimes, the value supplied is incorrect, or missing, or unintentionally duplicated. You should never use customer-supplied values of any sort as a database primary key. You should not use identifiers that are intended "for use by humans" as a database primary key. (Humans tolerate ambiguity ... computers do not!) Instead, let all such identifiers be column-values stored somewhere in the database, possibly (or, not ...) indexed by a UNIQUE key to prevent duplicates.
I recommend that all primary keys should be completely abstract, containing no meaning at all. For instance, a UUID could be used. Within the database, auto-incrementing integers can be used.
Keys can only be determined from business rules. What matters is whether it makes sense in the context of your business requirements to enforce uniqueness of SSN or not. That is not a matter your DBA alone can decide on. It's something on which to consult the HR department or whoever else uses this data.
Assuming that your table will have at least one key (perhaps more than one) then I suggest that the DBA is in the best position to advise on the policy for primary keys. If the choice of a primary key is of any importance at all then the DBA ought to say so.

Can a table have a surrogate key without a (natural) alternate key?

I think it doesn't make sense for a table to have a surrogate key without also having a (natural) alternate key (keep in mind that one of the properties for a surrogate key is that it has no meaning outside the database environment).
For example say I have the following table:
Say that employee_id is the surrogate primary key, and there is no (natural) alternate key in the table.
Now let's say that some employee wants to change his phone number, how can we identify the record for this employee in the table? we can't identify it using the surrogate key, because the surrogate key is not known in the real world (i.e. we don't know the employee_id for each employee).
So there must be a (natural) alternate key in the table to identify each employee in the real world (for example: SSN).
Am I correct, or am I missing something?
You are right. Normally the users of a database need to be able to map information in a database table to real concepts or things outside the database. For that they need a usable natural key - or something that can be reliably translated into a natural key.
Your specific example isn't necessarily a good one because many (most?) organisations allocate an employee identifier to employees for the duration of their employment. That employee identifier may be known and used as the natural key by both employee and employer. You have said employee_id in your example is a surrogate but based on its name many people might assume it is not.
Every table has a candidate key. (Possibly, all the columns.) Each value of that key designates something. If employee_ids weren't used externally, they and the the set of the other three columns would both identify sets of people with the same name, address and phone number.
You are right that a table must contain a natural key for whatever entity you want to identify--for "natural key" meaning "designator under the current business rules, external to the database".
You seem to be confusing multiple meanings for "surrogate key" and "natural key".
For "surrogate": One use is, a property set where designations are determined since current business rules: surrogate as new. Another use is, a property set where designations are determined since current business rules and only used internally: surrogate as internal.
For "natural": One use is, a property set designating under current business rules and before: natural as old. Another use is, a property set designating under current business rules: natural as external.
The original use of "surrogate" was as internal with "natural" as external. Unfortunately now usually people use "surrogate" as new and "natural" as old. And they seldom either consider or distinguish surrogates as internal. Some people might call a newly introduced external designator as both surrogate (as new) and natural (as external). (Re "meaningless" names.)
All you can do is decipher or ask what someone means when they use these terms.
Note that these definitions are relative to the "current" business rules. You also seem to be assuming that employee ids arrived with the database. At some point they were introduced, so were chosen after some older system started, so were surrogates as new under the new system. But if the database came later then by that time they were natural keys as old. They were natural as external both times; when introduced they were just new natural as external.
The term "natural key" belongs with the subject matter or problem domain, and not with the database design or solution domain.
If you ask people in the organization how they refer to each other, it's usually by name. "That's Mary Jones, sitting by herself". You never hear, "That's employee 79932, sitting by herself."
If you ask a database person, like most of us, what's wrong with using the name, you'll get the standard answers: it's not unique. It's mutable. It's too many characters. All of that is true, but it doesn't change how people work in the real world.
The item "social security number" began life as an artificial key, even though that was before the computer revolution really began (1938). Over time, it has taken on most of the look and feel of a natural key. In your case, I would even call it a natural key, because every employee has one (barring something hokey).

Surrogate database key and data duplication

I was thinking about this problem. In database design most of the times surrogate keys are used, but how to prevent data duplication and inconsistent data? I mean one could have a customer table made of customer_id, name, surname. What would prevent me of inserting the same customer twice with a different customer_id? Sure I could add a unique index to name and surname, but if one does so than what's the purpose of the surrogate primary key?
You're asking a business question, not a technical one.
"How do I know whether two people with the same name are the same person or not?"
Well typically customers are not identified by a name alone, there is also one of:
An account number
An email address
A postal address
A credit card number
A passport number
A date of birth
... etc.
The name is simply not a uniquely identifying characteristic, it's just an attribute of a customer that is probably non-unique, so you need something else to help identify them. Within the database this is the primary key of the customer table, but for business purposes it could be any number of attributes.
If there is a natural key, you cannot replace it with a surrogate key. You can only add the surrogate without removing the natural. This has pros and cons, as I described here.
Unfortunately, there is no good natural key in the case you described, since two different human beings can easily have the same combination of first and last name. Therefore, you'll have to come-up with some additional attributes that represent a better criteria for judging whether two people are "identical" or not, and then create the corresponding natural key. Discovering such criteria is part of the requirement gathering and therefore impossible for me to do without knowing more about your domain.
If you are unable to identify such natural key, then you can just leave customer_id alone. That means you made a decision to make it acceptable for two people to be identical in every aspect (except in their customer_id) and still be considered "different". Arguably, such customer_id may no longer be called "surrogate", since its value now has a meaning in your data model, is potentially visible in the UI etc.
What you have said is perfectly logical and correct. Surrogate keys are not any kind of substitute for a natural key (AKA business key or domain key, i.e. the set of attributes used to identify information in the database and relate it to the real world things the database is supposed to model). If you care about data integrity then natural keys are essential, whereas surrogates by definition are optional and supplemental. Add surrogate keys only when and where you find they have a useful benefit.
The only purpose of the id (or "surrogate key" as you call it) is to uniquely identify a record.
First, say you will use name as a key. What will you do if:
the customer changes its name (in some countries women change their surname to their husbands');
you make a typo in customers name and have to correct it afterwards?
Then you are in a big trouble, because despite of the fact that you can change it,
id should never be changed!
Otherwise, you can make a big mess not only in your database, consistency along backups, logs etc, but also in all the external sources refering to it.
Second, how do you know you won't get two customers with the same name?
You cannot stop people from describing the world wrongly in the database. You can only stop them from describing the world wrongly in the database if the way they described it can't ever happen.
When there is no previous "natural" identifying property used in the business outside the database being stored in the database then we have to pick a "surrogate" distinguishing identifier after the system starts. (Some people wouldn't use "natural" for such an identifier picked after the system starts even though it is used in the business outside the database. And some people wouldn't use "surrogate" for such a distinguishing identifier used in the business system outside the database.)

The best choice for Person table primary key

What is your choice for primary key in tables that represent a person (like Client, User, Customer, Employee etc.)? My first choice would be an social security number (SSN). However, using SSN has been discouraged because of privacy concerns and different regulations. SSN can change during person lifetime, so that is another reason against it.
I guess that one of the functions of well chosen natural primary key is to avoid duplication. I do not want a person to be registered twice in the database. Some surrogate or generated primary key does not help in avoiding duplicate entries. What is the best way to approach this?
What is the best way to guarantee uniqueness in your application for person entity and can this be handled on database level with primary key or uniqueness constraint?
I don't know which Database engine you are using, but (at least with MySQL -- see 7.4.1. Make Your Data as Small as Possible), using an integer, the shortest possible, is generally considered best for performances and memory requirements.
I would use an integer, auto_increment, for that primary key.
The idea being :
If the PK is short, it helps identifying each row (it's faster and easier to compare two integers than two long strings)
If a column used in foreign keys is short, it'll require less memory for foreign keys, as the value of that column is likely to be stored in several places.
And, then, set a UNIQUE index on an other column -- the one that determines unicity -- if that's possible and/or necessary.
Edit: Here are a couple of other questions/answers that might interest you :
What’s the best practice for Primary Keys in tables?
How do you like your primary keys?
Should I have a dedicated primary key field?
Use item specific prefixes and autonumber for primary keys?
As mentioned above, use an auto-increment as your primary key. But I don't believe this is your real question.
Your real question is how to avoid duplicate entries. In theory, there is no way - 2 people could be born on the same day, with the same name, and live in the same household, and not have a social insurance number available for one or the other. (One might be a foreigner visiting the country).
However, the combination of full name, birthdate, address, and telephone number is usually sufficient to avoid duplication. Note that addresses may be entered differently, people may have multiple phone numbers, and people may choose to omit their middle name or use an initial. It depends on how important it is to avoid duplicate entries, and how large is your userbase (and thus the likelihood of a collision).
Of course, if you can get the SSN/SIN then use that to determine uniqueness.
What attributes are available to you? Which ones does your application care about ? For example no two people can be born at exactly the same second at exactly the same place, but you probably don't have access to that data at that level of accuracy! So you need to decide, from the attributes you intend on modeling, which ones are sufficient to provide an acceptable level of data integrity. Whatever you choose, you're right in focusing on the data integrity aspects (preventing insertion of multiple rows for the same person) of your selection.
For Joins/Foreign Keys in other tables, it is best to use a surrogate key.
I've grown to consider the use of the word Primary Key as a misnomer, or at best, confusing. Any key, whether you flag it as Primary Key, Alternate Key, Unique Key, or Unique Index, is still a Key, and requires that every row in the table contain unique values for the attributes in the key. In that sense, all keys are equivilent. What matters more (Most), is whether they are natural keys (dependant on meaningful real- domain model data attributes), or surrogates (Independendant of real data attributes)
Secondly, what also matters is what you use the key for.. Surrogate keys are narrow and simple and never change (No reason to - they don't mean anything) So they are a better choice for joins or for foreign Keys in other dependant tables.
But to ensure data integrity, and prevent insertion of multiple rows for the same domain entity, they are totally useless... For that you need some kind of Natural Key, chosen from the data you have available, and which your application is modeling for some purpose.
The key does not have to be 100% immutable. If (as an example), you use Name and Phone Number and Birthdate, for example, even if a person changes their name, or their phone number, you can simply change the value in the table. As long as no other row already has the new values in their key attributes, you are fine.
Even if the key you select only works in 99.9% of the cases, (say you are unlucky enough to run into two people with the same name and phone number and were coincidentally born the same day), well, at least 99.9% of your data will be guaranteed to be accurate and consistent - and you can for example, just add time to their birthdate to make them unique, or add some other attribute to the key to distinquish them. As long as you don't have to update data values in Foreign Keys throughout your database because of the change, (since you are not using this key as a FK elsewhere) you are not facing any significant issue.
Use an autogenerated integer primary key, and then put a unique constraint on anything that you believe should be unique. But SSNs are not unique in the real world so it would be a bad idea to put a uniqueness constraint on this column unless you think turning away customers because your database won't accept them is a good business model.
I prefer natural keys, but a table person is a lost case. SSNs are not unique and not everybody has one.
I'd recommend a surrogate key. Add all the indexes you need for other candidate keys, but keeping business logic out of the key is my recommendation.
I prefer natural keys, when they can be trusted.
Unless you are running a bank or something like that, there is no reason for your clients and users to provide you with a valid SSN, or even necessarily to have one. Thus, for business reasons, you are forced to distrust SSN in the case you outline. A similar argumant would hold for any given natural key to "persons".
You have no choice but to assign an artificial (Read "surrogate") key. It might as well be an integer. Make sure it's big enough integer so you aren't going to need toexpand it real soon.
To add to #Mark and #Pascal (autoincrement integers are your best bet) -- SSN's are usefull and should be modelled correctly. Security concerns are part of application logic. You can normalize them into a separate table, and you can make them unique by providing a date-issued field.
p.s., to those who disagree with the `security in application' point, an enterprise DB will have a granular ACL model; so this won't be a sticking point.

Surrogate vs. natural/business keys [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Here we go again, the old argument still arises...
Would we better have a business key as a primary key, or would we rather have a surrogate id (i.e. an SQL Server identity) with a unique constraint on the business key field?
Please, provide examples or proof to support your theory.
Just a few reasons for using surrogate keys:
Stability: Changing a key because of a business or natural need will negatively affect related tables. Surrogate keys rarely, if ever, need to be changed because there is no meaning tied to the value.
Convention: Allows you to have a standardized Primary Key column naming convention rather than having to think about how to join tables with various names for their PKs.
Speed: Depending on the PK value and type, a surrogate key of an integer may be smaller, faster to index and search.
Both. Have your cake and eat it.
Remember there is nothing special about a primary key, except that it is labelled as such. It is nothing more than a NOT NULL UNIQUE constraint, and a table can have more than one.
If you use a surrogate key, you still want a business key to ensure uniqueness according to the business rules.
It appears that no one has yet said anything in support of non-surrogate (I hesitate to say "natural") keys. So here goes...
A disadvantage of surrogate keys is that they are meaningless (cited as an advantage by some, but...). This sometimes forces you to join a lot more tables into your query than should really be necessary. Compare:
select sum(t.hours)
from timesheets t
where t.dept_code = 'HR'
and t.status = 'VALID'
and t.project_code = 'MYPROJECT'
and t.task = 'BUILD';
against:
select sum(t.hours)
from timesheets t
join departents d on d.dept_id = t.dept_id
join timesheet_statuses s on s.status_id = t.status_id
join projects p on p.project_id = t.project_id
join tasks k on k.task_id = t.task_id
where d.dept_code = 'HR'
and s.status = 'VALID'
and p.project_code = 'MYPROJECT'
and k.task_code = 'BUILD';
Unless anyone seriously thinks the following is a good idea?:
select sum(t.hours)
from timesheets t
where t.dept_id = 34394
and t.status_id = 89
and t.project_id = 1253
and t.task_id = 77;
"But" someone will say, "what happens when the code for MYPROJECT or VALID or HR changes?" To which my answer would be: "why would you need to change it?" These aren't "natural" keys in the sense that some outside body is going to legislate that henceforth 'VALID' should be re-coded as 'GOOD'. Only a small percentage of "natural" keys really fall into that category - SSN and Zip code being the usual examples. I would definitely use a meaningless numeric key for tables like Person, Address - but not for everything, which for some reason most people here seem to advocate.
See also: my answer to another question
Surrogate key will NEVER have a reason to change. I cannot say the same about the natural keys. Last names, emails, ISBN nubmers - they all can change one day.
Surrogate keys (typically integers) have the added-value of making your table relations faster, and more economic in storage and update speed (even better, foreign keys do not need to be updated when using surrogate keys, in contrast with business key fields, that do change now and then).
A table's primary key should be used for identifying uniquely the row, mainly for join purposes. Think a Persons table: names can change, and they're not guaranteed unique.
Think Companies: you're a happy Merkin company doing business with other companies in Merkia. You are clever enough not to use the company name as the primary key, so you use Merkia's government's unique company ID in its entirety of 10 alphanumeric characters.
Then Merkia changes the company IDs because they thought it would be a good idea. It's ok, you use your db engine's cascaded updates feature, for a change that shouldn't involve you in the first place. Later on, your business expands, and now you work with a company in Freedonia. Freedonian company id are up to 16 characters. You need to enlarge the company id primary key (also the foreign key fields in Orders, Issues, MoneyTransfers etc), adding a Country field in the primary key (also in the foreign keys). Ouch! Civil war in Freedonia, it's split in three countries. The country name of your associate should be changed to the new one; cascaded updates to the rescue. BTW, what's your primary key? (Country, CompanyID) or (CompanyID, Country)? The latter helps joins, the former avoids another index (or perhaps many, should you want your Orders grouped by country too).
All these are not proof, but an indication that a surrogate key to uniquely identify a row for all uses, including join operations, is preferable to a business key.
I hate surrogate keys in general. They should only be used when there is no quality natural key available. It is rather absurd when you think about it, to think that adding meaningless data to your table could make things better.
Here are my reasons:
When using natural keys, tables are clustered in the way that they are most often searched thus making queries faster.
When using surrogate keys you must add unique indexes on logical key columns. You still need to prevent logical duplicate data. For example, you can’t allow two Organizations with the same name in your Organization table even though the pk is a surrogate id column.
When surrogate keys are used as the primary key it is much less clear what the natural primary keys are. When developing you want to know what set of columns make the table unique.
In one to many relationship chains, the logical key chains. So for example, Organizations have many Accounts and Accounts have many Invoices. So the logical-key of Organization is OrgName. The logical-key of Accounts is OrgName, AccountID. The logical-key of Invoice is OrgName, AccountID, InvoiceNumber.
When surrogate keys are used, the key chains are truncated by only having a foreign key to the immediate parent. For example, the Invoice table does not have an OrgName column. It only has a column for the AccountID. If you want to search for invoices for a given organization, then you will need to join the Organization, Account, and Invoice tables. If you use logical keys, then you could Query the Organization table directly.
Storing surrogate key values of lookup tables causes tables to be filled with meaningless integers. To view the data, complex views must be created that join to all of the lookup tables. A lookup table is meant to hold a set of acceptable values for a column. It should not be codified by storing an integer surrogate key instead. There is nothing in the normalization rules that suggest that you should store a surrogate integer instead of the value itself.
I have three different database books. Not one of them shows using surrogate keys.
I want to share my experience with you on this endless war :D on natural vs surrogate key dilemma. I think that both surrogate keys (artificial auto-generated ones) and natural keys (composed of column(s) with domain meaning) have pros and cons. So depending on your situation, it might be more relevant to choose one method or the other.
As it seems that many people present surrogate keys as the almost perfect solution and natural keys as the plague, I will focus on the other point of view's arguments:
Disadvantages of surrogate keys
Surrogate keys are:
Source of performance problems:
They are usually implemented using auto-incremented columns which mean:
A round-trip to the database each time you want to get a new Id (I know that this can be improved using caching or [seq]hilo alike algorithms but still those methods have their own drawbacks).
If one-day you need to move your data from one schema to another (It happens quite regularly in my company at least) then you might encounter Id collision problems. And Yes I know that you can use UUIDs but those lasts requires 32 hexadecimal digits! (If you care about database size then it can be an issue).
If you are using one sequence for all your surrogate keys then - for sure - you will end up with contention on your database.
Error prone. A sequence has a max_value limit so - as a developer - you have to put attention to the following points:
You must cycle your sequence ( when the max-value is reached it goes back to 1,2,...).
If you are using the sequence as an ordering (over time) of your data then you must handle the case of cycling (column with Id 1 might be newer than row with Id max-value - 1).
Make sure that your code (and even your client interfaces which should not happen as it supposed to be an internal Id) supports 32b/64b integers that you used to store your sequence values.
They don't guarantee non duplicated data. You can always have 2 rows with all the same column values but with a different generated value. For me this is THE problem of surrogate keys from a database design point of view.
More in Wikipedia...
Myths on natural keys
Composite keys are less inefficient than surrogate keys. No! It depends on the used database engine:
Oracle
MySQL
Natural keys don't exist in real-life. Sorry but they do exist! In aviation industry, for example, the following tuple will be always unique regarding a given scheduled flight (airline, departureDate, flightNumber, operationalSuffix). More generally, when a set of business data is guaranteed to be unique by a given standard then this set of data is a [good] natural key candidate.
Natural keys "pollute the schema" of child tables. For me this is more a feeling than a real problem. Having a 4 columns primary-key of 2 bytes each might be more efficient than a single column of 11 bytes. Besides, the 4 columns can be used to query the child table directly (by using the 4 columns in a where clause) without joining to the parent table.
Conclusion
Use natural keys when it is relevant to do so and use surrogate keys when it is better to use them.
Hope that this helped someone!
Alway use a key that has no business meaning. It's just good practice.
EDIT: I was trying to find a link to it online, but I couldn't. However in 'Patterns of Enterprise Archtecture' [Fowler] it has a good explanation of why you shouldn't use anything other than a key with no meaning other than being a key. It boils down to the fact that it should have one job and one job only.
Surrogate keys are quite handy if you plan to use an ORM tool to handle/generate your data classes. While you can use composite keys with some of the more advanced mappers (read: hibernate), it adds some complexity to your code.
(Of course, database purists will argue that even the notion of a surrogate key is an abomination.)
I'm a fan of using uids for surrogate keys when suitable. The major win with them is that you know the key in advance e.g. you can create an instance of a class with the ID already set and guaranteed to be unique whereas with, say, an integer key you'll need to default to 0 or -1 and update to an appropriate value when you save/update.
UIDs have penalties in terms of lookup and join speed though so it depends on the application in question as to whether they're desirable.
Using a surrogate key is better in my opinion as there is zero chance of it changing. Almost anything I can think of which you might use as a natural key could change (disclaimer: not always true, but commonly).
An example might be a DB of cars - on first glance, you might think that the licence plate could be used as the key. But these could be changed so that'd be a bad idea. You wouldnt really want to find that out after releasing the app, when someone comes to you wanting to know why they can't change their number plate to their shiny new personalised one.
Always use a single column, surrogate key if at all possible. This makes joins as well as inserts/updates/deletes much cleaner because you're only responsible for tracking a single piece of information to maintain the record.
Then, as needed, stack your business keys as unique contraints or indexes. This will keep you data integrity intact.
Business logic/natural keys can change, but the phisical key of a table should NEVER change.
Case 1: Your table is a lookup table with less than 50 records (50 types)
In this case, use manually named keys, according to the meaning of each record.
For Example:
Table: JOB with 50 records
CODE (primary key) NAME DESCRIPTION
PRG PROGRAMMER A programmer is writing code
MNG MANAGER A manager is doing whatever
CLN CLEANER A cleaner cleans
...............
joined with
Table: PEOPLE with 100000 inserts
foreign key JOBCODE in table PEOPLE
looks at
primary key CODE in table JOB
Case 2: Your table is a table with thousands of records
Use surrogate/autoincrement keys.
For Example:
Table: ASSIGNMENT with 1000000 records
joined with
Table: PEOPLE with 100000 records
foreign key PEOPLEID in table ASSIGNMENT
looks at
primary key ID in table PEOPLE (autoincrement)
In the first case:
You can select all programmers in table PEOPLE without use of join with table JOB, but just with: SELECT * FROM PEOPLE WHERE JOBCODE = 'PRG'
In the second case:
Your database queries are faster because your primary key is an integer
You don't need to bother yourself with finding the next unique key because the database itself gives you the next autoincrement.
Surrogate keys can be useful when business information can change or be identical. Business names don't have to be unique across the country, after all. Suppose you deal with two businesses named Smith Electronics, one in Kansas and one in Michigan. You can distinguish them by address, but that'll change. Even the state can change; what if Smith Electronics of Kansas City, Kansas moves across the river to Kansas City, Missouri? There's no obvious way of keeping these businesses distinct with natural key information, so a surrogate key is very useful.
Think of the surrogate key like an ISBN number. Usually, you identify a book by title and author. However, I've got two books titled "Pearl Harbor" by H. P. Willmott, and they're definitely different books, not just different editions. In a case like that, I could refer to the looks of the books, or the earlier versus the later, but it's just as well I have the ISBN to fall back on.
On a datawarehouse scenario I believe is better to follow the surrogate key path. Two reasons:
You are independent of the source system, and changes there --such as a data type change-- won't affect you.
Your DW will need less physical space since you will use only integer data types for your surrogate keys. Also your indexes will work better.
As a reminder it is not good practice to place clustered indices on random surrogate keys i.e. GUIDs that read XY8D7-DFD8S, as they SQL Server has no ability to physically sort these data. You should instead place unique indices on these data, though it may be also beneficial to simply run SQL profiler for the main table operations and then place those data into the Database Engine Tuning Advisor.
See thread # http://social.msdn.microsoft.com/Forums/en-us/sqlgetstarted/thread/27bd9c77-ec31-44f1-ab7f-bd2cb13129be
This is one of those cases where a surrogate key pretty much always makes sense. There are cases where you either choose what's best for the database or what's best for your object model, but in both cases, using a meaningless key or GUID is a better idea. It makes indexing easier and faster, and it is an identity for your object that doesn't change.
In the case of point in time database it is best to have combination of surrogate and natural keys. e.g. you need to track a member information for a club. Some attributes of a member never change. e.g Date of Birth but name can change.
So create a Member table with a member_id surrogate key and have a column for DOB.
Create another table called person name and have columns for member_id, member_fname, member_lname, date_updated. In this table the natural key would be member_id + date_updated.
Horse for courses. To state my bias; I'm a developer first, so I'm mainly concerned with giving the users a working application.
I've worked on systems with natural keys, and had to spend a lot of time making sure that value changes would ripple through.
I've worked on systems with only surrogate keys, and the only drawback has been a lack of denormalised data for partitioning.
Most traditional PL/SQL developers I have worked with didn't like surrogate keys because of the number of tables per join, but our test and production databases never raised a sweat; the extra joins didn't affect the application performance. With database dialects that don't support clauses like "X inner join Y on X.a = Y.b", or developers who don't use that syntax, the extra joins for surrogate keys do make the queries harder to read, and longer to type and check: see #Tony Andrews post. But if you use an ORM or any other SQL-generation framework you won't notice it. Touch-typing also mitigates.
Maybe not completely relevant to this topic, but a headache I have dealing with surrogate keys. Oracle pre-delivered analytics creates auto-generated SKs on all of its dimension tables in the warehouse, and it also stores those on the facts. So, anytime they (dimensions) need to be reloaded as new columns are added or need to be populated for all items in the dimension, the SKs assigned during the update makes the SKs out of sync with the original values stored to the fact, forcing a complete reload of all fact tables that join to it. I would prefer that even if the SK was a meaningless number, there would be some way that it could not change for original/old records. As many know, out-of-the box rarely serves an organization's needs, and we have to customize constantly. We now have 3yrs worth of data in our warehouse, and complete reloads from the Oracle Financial systems are very large. So in my case, they are not generated from data entry, but added in a warehouse to help reporting performance. I get it, but ours do change, and it's a nightmare.

Resources