Related
I have a situation where I would like to know if it is more commonplace to use table_id or just id? (in my opinion, using table_ would cause slight confusion as to if it a foreign key). Which do people prefer, and is there really any difference between the two? Or should it just be left up to picking one and being consistent?
There are two main currents in terms of naming columns in tables:
Schema Namespace
This strategy is the traditional strategy that was conceived by teams documenting the "data dictionary" of a database in the 70s. The idea is that the name itself of the column tells you which table it belongs to across the whole schema or database. For example, CLIENT_NAME would represent the name of the client in the CLIENT table.
There are variations of this strategy where a limited number of letters are assigned as prefixes (specially for M:N relationship tables) because at the time column names were limited to 6 or 8 characters in many databases. For example, the date of purchase of a car by a client could take the form CLI_CAR_DATE, CLICAR_DATE, or even CLCADT.
Examples:
A primary key "id" column of the entity table "car" would be named CAR_ID.
A foreign key on a child table "document" that points to "car" would take the same form: CAR_ID. This allows the use of natural joins; however, it should be pointed out that there are compelling reasons to avoid natural joins at all cost, that are not discussed here.
Foreign keys on a table "transfer" that has multiple (two) relationships (seller and buyer) with "person" pollutes this strategy. They could be named: PERSON_BUYER_ID and PERSON_SELLER_ID because both cannot have the same name PERSON_ID; it doesn't allow natural joins anymore (good).
Table Namespace
In this strategy (that is newer) column names do not include the name of the entity they belong to, but only their property name. This strategy aligns more with object design, and produces shorter names (i.e. less typing). The name of the table must be indicated when mentioning a column. For example, you would need to say the column NAME on the table CLIENT.
Examples:
A primary key "id" column of the entity table "car" would be named ID.
A foreign key on a child table "document" that points to "car" would take the form: CAR_ID; this is the same solution as the previous strategy.
Foreign keys on a table "transfer" that has multiple (two) relationships (seller and buyer) with "person" could be named: BUYER_ID and SELLER_ID. They could follow the longer names as the previous strategy, but the goal here is typically to have shorter names so the app source code gets easier to write and to debug.
Summary
I personally like the second one, but there are teams who adhere to both strategies and there's no clear winner. My leaning towards the second one is [I think] the first one suffers from longer names (more typing), longer SQL (more errors), cryptic names (they don't play well with ORMs and app objects), and foreign keys that cannot follow the strategy well. In fact, virtually all the primary keys in my databases are named ID regardless of the specific entities.
But on the flip side, some teams value very highly the idea of knowing the table name of a column by just looking at it. And this is great for big databases (with 200-1000 relational fact tables) that can become quite complex, specially for new members of a team.
But above all, pick one and be consistent.
I was thinking about this problem. In database design most of the times surrogate keys are used, but how to prevent data duplication and inconsistent data? I mean one could have a customer table made of customer_id, name, surname. What would prevent me of inserting the same customer twice with a different customer_id? Sure I could add a unique index to name and surname, but if one does so than what's the purpose of the surrogate primary key?
You're asking a business question, not a technical one.
"How do I know whether two people with the same name are the same person or not?"
Well typically customers are not identified by a name alone, there is also one of:
An account number
An email address
A postal address
A credit card number
A passport number
A date of birth
... etc.
The name is simply not a uniquely identifying characteristic, it's just an attribute of a customer that is probably non-unique, so you need something else to help identify them. Within the database this is the primary key of the customer table, but for business purposes it could be any number of attributes.
If there is a natural key, you cannot replace it with a surrogate key. You can only add the surrogate without removing the natural. This has pros and cons, as I described here.
Unfortunately, there is no good natural key in the case you described, since two different human beings can easily have the same combination of first and last name. Therefore, you'll have to come-up with some additional attributes that represent a better criteria for judging whether two people are "identical" or not, and then create the corresponding natural key. Discovering such criteria is part of the requirement gathering and therefore impossible for me to do without knowing more about your domain.
If you are unable to identify such natural key, then you can just leave customer_id alone. That means you made a decision to make it acceptable for two people to be identical in every aspect (except in their customer_id) and still be considered "different". Arguably, such customer_id may no longer be called "surrogate", since its value now has a meaning in your data model, is potentially visible in the UI etc.
What you have said is perfectly logical and correct. Surrogate keys are not any kind of substitute for a natural key (AKA business key or domain key, i.e. the set of attributes used to identify information in the database and relate it to the real world things the database is supposed to model). If you care about data integrity then natural keys are essential, whereas surrogates by definition are optional and supplemental. Add surrogate keys only when and where you find they have a useful benefit.
The only purpose of the id (or "surrogate key" as you call it) is to uniquely identify a record.
First, say you will use name as a key. What will you do if:
the customer changes its name (in some countries women change their surname to their husbands');
you make a typo in customers name and have to correct it afterwards?
Then you are in a big trouble, because despite of the fact that you can change it,
id should never be changed!
Otherwise, you can make a big mess not only in your database, consistency along backups, logs etc, but also in all the external sources refering to it.
Second, how do you know you won't get two customers with the same name?
You cannot stop people from describing the world wrongly in the database. You can only stop them from describing the world wrongly in the database if the way they described it can't ever happen.
When there is no previous "natural" identifying property used in the business outside the database being stored in the database then we have to pick a "surrogate" distinguishing identifier after the system starts. (Some people wouldn't use "natural" for such an identifier picked after the system starts even though it is used in the business outside the database. And some people wouldn't use "surrogate" for such a distinguishing identifier used in the business system outside the database.)
I never use weak entities when I'm doing database modelling and things seems fine till now. I usually ignored the whole issue by giving each entity a primary (auto generated) key.
However, I came across some posts that mention that some entities should be weak if their existence totally depends on other entities.
But on the other hand, some refer to weak entities as a set which does not possess sufficient attributes to form a primary key. Well that means all entities in my database where weak at first before I gave them the auto incremented key.
Could someone please outline the importance of weak entities and what are the consequences of not using them? Why don't we just give each entity a primary auto generated key and make it strong?
UPDATE:
Maybe someone can explain why weak entities should be identified by the primary key of the parent entity + an identifier instead of creating a surrogate key and relating it to the parent entity using a foreign key (with cascading changes on update and delete)?
Take an order with multiple order line items as an example. The weak entities would be the individual line items stored in their own table. Their primary key could be the primary key of the order, plus a simple integer number (e.g. 1, 2, 3, which is unique only within the order.) Thus, they don't really have their own primary key as a unique numbered column, their key spans two columns and is only unique that way.
The order line items should be deleted if and when the order is deleted - they don't make sense standing on their own. It is this linkage that makes them weak -- one thing being deleted should delete the other.
If you give each order line item their own primary key, you'll still need to relate them back to the order item, which means putting in a foreign key for the order item or, having a cross reference table. (You may also need to know the line item number from the order, which would mean adding a simple integer column... and at this point you've added enough to have a key without an auto generated one.) For the design pattern of owned sub items, either of these alternatives is a bit of overkill.
Using the complex primary key also enforces the relationship between order and line order items, in that this schema will not allow you cannot have a line item assigned to multiple orders.
Another consideration is that you can shard the orders and order line items according to the order item primary key, since both tables have that key. (Sharding is generally easier to do based on the primary key than regular columns.)
Hierarchical containment isn’t always what you want; but, it is such a commonly occurring pattern that it is nice to be clear about it, and composite keys can be used in this case. Here, using order items with line items as sub-items (i.e. contained), we’re saying not just that line items are 1 to many with respect an order, but that line items are owned and don’t exist independently of orders — that line items compose to create a single order object.
In keeping with that, we’re explicitly not going to manage a separate key space for (all) line items (together as a group), but instead borrow and extend the key space of an order. Instead of asking the system to maintain a separate key space for line items, and manually (i.e. less formally) maintaining a foreign key relation back to the order, and also maintaining an integer line item rather separately (from the order foreign reference), we can ask the system to ensure uniqueness of the whole composite key, which includes the line item number within the order.
Of course, you wouldn’t be able to add a line item that isn’t associated with an order, but additionally, using the composite sub-key, you also won’t be able to add one that overlaps with another (e.g. it won’t let you add two line item #3’s for the same order).
This forces producers and consumers of line items to think about them as being contained within and part of orders, and not as independent items, or, put another way, to reference a line item by going thru an order, or, yet in other words, to get a reference to the order “for free” by referencing one of its line items. (And because you also have a reference to the order as part of such a foreign key, you can also use that order portion of the composite foreign key alone to group or join.)
I recently worked on a project that had to manage large amounts of data samples for lake readings. In this project, we had tables similar to the following, where records is a collection of lake readings by location and uploader, and samples contain the actual lake readings -- things like temperature and intensity.
CREATE TABLE records(
email TEXT REFERENCES users(email),
lat DECIMAL,
lon DECIMAL,
depth TEXT,
upload_date TIMESTAMP,
comment TEXT,
PRIMARY KEY (upload_date,email)
);
CREATE TABLE samples(
date_taken TIMESTAMP,
temp DECIMAL,
intensity DECIMAL,
upload_date TIMESTAMP,
email TEXT,
PRIMARY KEY(date_taken,upload_date,email),
FOREIGN KEY (upload_date,email) REFERENCES records(upload_date,email)
);
samples was modeled as a weak entity, dependent on records. As you know, this means that all of the foreign keys are inherited from records and used to identify a single row in samples. But what would happen if we decided to make it an entity instead? Well, you can look at it a few different ways, Either:
The primary key from records would not be present in samples and
we would have to assign some kind of arbitrary auto increment type
ID, as you suggest. Each record contains thousands of samples, and users think of
samples as part of the records that they recorded in the field. They
expect to browse samples by record, so we would have a very large
samples table with no obvious mapping to the records they belong
to in real life.
Or we simply don't model it as a weak entity, but recognize that
it needs to be able to identify itself with a records row, so we
assign an upload_date and email. If we make these two entries
foreign keys, then we have just made a weak entity without realizing
it. If we don't, then our application layer has to be responsible
for checking to make sure that each upload_date and email are
also present in records, instead of the database doing it.
In this case, making samples a weak entity (including foreign keys in its primary key) is the simplest option (and makes the most sense).
Summary
You should model entities as weak when they are actually weak in real life. If you have an entity that needs a portion of a different key to identity itself (having a foreign key that is part of its primary key), then its probably weak.
Can you remodel the system to avoid using weak entities? Possibly, if we wanted to have unassociated samples, then we would need to be able to make their upload_date and email null, which means they would not be in the primary key and would not be a weak entity. We would have to do something like I described in 1.
The primary key must be unique. Forever. That's all there is to it. If the data in the table doesn't provide that naturally you'd create a surrogate key.
Now what are those. A natural key consists of one or more existing columns, whereas a surrogate key is an extra added column, usually auto-incremental.
A good example for a natural key would be an ISO country code in a countries table. You'd gain nothing from adding an auto-increment column here. On the contrary, you may save yourself from JOINing in the countries table in some queries, because you already have the ISO code right there.
A bad one, the name (or multiple columns) in a contacts table. That's why it's better to use a surrogate key in this case.
That's how i think about it and i rarely - if ever - run into any kind of questionable layout issues.
A practical hint: you never run an UPDATE on columns making up the primary key. You'd delete that row and re-insert it with new values. That can save you a lot of headaches.
I have doubt in this design(er_lp). My doubt is in how to create many - to- many relationship with entities with composite keys, Secondly, in using date type as pk. Here each machines work daily for three shifts for different userDepts on one or more fields. So to keep record of working and down hours of machineries I have used shift,taskDay and machinePlate as pks. As you will see from the ER diagram, I ended up with too many pks in the link table in many places. I hesitate not to get in to trouble in coding phase
Is there a better way to do this?
Thank you !!
Dejene
See also extra information posted as a second question Entity Relationship. The material, reformatted, is:
Elaboration: Yes, 'Field ' is referring to areas of land. We have several cane growing fields at different location. It [each field?] is named and has budget.
User is not referring to individual who are working on the machine. They are departments. I used 'isDone' table to link userDept with machine. A machine can be used by several departments and many machines can work for a userDept.
A particular machine can be used for multiple tasks on a given shift. It can work for say 2 hours and can start another task on another field. We have three shifts per day, each of 8 hrs!
If I use Auto increment PK, do you think that other key are important? I don't prefer to use it!
Usually, I use auto increment key alone in a table. How can we create relationship that involves auto increment keys?
Thank you for thoughtful comment!!
You always create many-to-many relationships between two tables using a third table, the rows of which contain the columns for the primary key of each table, and the combination of all columns is the primary key of the third table. The rule doesn't change for tables with composite primary keys.
CREATE TABLE Table1(Col11 ..., Col12 ..., Col1N ...,
PRIMARY KEY(Col11, Col12));
CREATE TABLE Table2(Col21 ..., Col22 ..., Col2N ...,
PRIMARY KEY(Col21, Col22));
CREATE TABLE RelationTable
(
Col11 ...,
Col12 ...,
FOREIGN KEY (Col11, Col12) REFERENCES Table1,
Col21 ...,
Col22 ...,
FOREIGN KEY (Col21, Col22) REFERENCES Table2,
PRIMARY KEY (Col11, Col12, Col21, Col22)
);
This works fine. It does suggest that you should try and keep keys simple whenever possible, but there is absolutely no need to bend over backwards adding auto-increment columns to the referenced tables if they have a natural composite key that is convenient to use. OTOH, the joins involving the relation table are harder to write if you use a composite keys - I'd think several times about what I'm about if either composite key involved more than two columns, not least because it might indicate problems in the design of the referenced tables.
Looking at the actual ER diagram - the 'er_lp' URL in the question - the 'tbl' prefix seems a trifle unnecessary; the things storing data in a database are always tables, so telling me that with the prefix is ... unnecessary. The table called 'Machine' seems to be misnamed; it does not so much describe a machine as the duty allocated to a machine on a particular shift. I'm guessing that the 'Field' table is referring to areas of land, rather than parts of a database. You have the 'IsDone' table (again, not particularly well named) that identifies the user who worked on a machine for a particular shift and hence for a particular task. That involves a link between the Machine table (which has a 3-part primary key) and the User table. It isn't clear whether a particular machine can be used for multiple tasks on a given shift. It isn't clear whether shift numbers cycle around the day or whether each shift number is unique across days, but the presumption must be that there are, say, three shifts per day, and the shift number and date is needed to identify when something occurred. Presumably, the Shift table would identify times and other such information.
The three-part primary key on Machine is fine - but it might be better to have two unique identifiers. One would be the current primary key combination; the other would be an automatically assigned number - auto-increment, serial, sequence or whatever...
Addressing the extended information.
It is not clear to me any more what you are seeking to track. If the 'Machine' table is supposed to track what a given machine was being used for, then you probably need to do some more structuring of the data. Given that a machine can be used for different tasks on different fields during a single shift, you should think, perhaps, in terms of a MachineTasks table which would identify the (date and) time when the operation started and finished and the type of operation. For repair operations, you'd store the information in a table describing repairs; for routine operations in a field, you might not need much extra information. Or maybe that is overkill.
I'm not clear whether particular tasks are performed on behalf of multiple departments, or whether you are simply trying to note that during a single shift a machine might be used by multiple departments, but one department at a time for each task. If each task is for a separate department, then simply include the department info in the main MachineTasks table as a foreign key field.
If you decide on an auto-increment key, you still need to maintain the uniqueness of the composite key. This is the biggest mistake I see people making with auto-increment fields. It isn't quite as simple as "a table with an auto-increment key must also have a second unique constraint on it", but it isn't too far off the mark.
When you use an auto-increment key, you need to retrieve the value assigned when you insert a record into the table; you then use that value in the foreign key columns when you insert other records into the other tables.
You need to read up on database design - I'm not sure what the current good books are as I did most of my learning a decade and more ago, and my books are consequently less likely to be available still.
One good way of not getting into trouble with primary keys is to have a single field for primary key. Usually a numeric (auto incremental) column is just fine. You can still have unique keys with multiple columns.
tblWorksOn
tblMachine
tblIsDone
...seem to be the problem tables.
Its looks like you could use taskDate for the tblMachine table as the primary key. The rest can be foriegn keys.
With the changes to the tblMachine table you can then use the taskDate with the fieldNo for the tblWorksOn table and the taskDate with the userID for the tblIsDone. Use these two fields to create Composite Keys (CK)
e.g.
tblMachine
taskDate (PK)
tblWorksOn
fieldNo (CK)
taskDate (CK)
tblIsDone
userID (CK)
taskDate (CK)
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Here we go again, the old argument still arises...
Would we better have a business key as a primary key, or would we rather have a surrogate id (i.e. an SQL Server identity) with a unique constraint on the business key field?
Please, provide examples or proof to support your theory.
Just a few reasons for using surrogate keys:
Stability: Changing a key because of a business or natural need will negatively affect related tables. Surrogate keys rarely, if ever, need to be changed because there is no meaning tied to the value.
Convention: Allows you to have a standardized Primary Key column naming convention rather than having to think about how to join tables with various names for their PKs.
Speed: Depending on the PK value and type, a surrogate key of an integer may be smaller, faster to index and search.
Both. Have your cake and eat it.
Remember there is nothing special about a primary key, except that it is labelled as such. It is nothing more than a NOT NULL UNIQUE constraint, and a table can have more than one.
If you use a surrogate key, you still want a business key to ensure uniqueness according to the business rules.
It appears that no one has yet said anything in support of non-surrogate (I hesitate to say "natural") keys. So here goes...
A disadvantage of surrogate keys is that they are meaningless (cited as an advantage by some, but...). This sometimes forces you to join a lot more tables into your query than should really be necessary. Compare:
select sum(t.hours)
from timesheets t
where t.dept_code = 'HR'
and t.status = 'VALID'
and t.project_code = 'MYPROJECT'
and t.task = 'BUILD';
against:
select sum(t.hours)
from timesheets t
join departents d on d.dept_id = t.dept_id
join timesheet_statuses s on s.status_id = t.status_id
join projects p on p.project_id = t.project_id
join tasks k on k.task_id = t.task_id
where d.dept_code = 'HR'
and s.status = 'VALID'
and p.project_code = 'MYPROJECT'
and k.task_code = 'BUILD';
Unless anyone seriously thinks the following is a good idea?:
select sum(t.hours)
from timesheets t
where t.dept_id = 34394
and t.status_id = 89
and t.project_id = 1253
and t.task_id = 77;
"But" someone will say, "what happens when the code for MYPROJECT or VALID or HR changes?" To which my answer would be: "why would you need to change it?" These aren't "natural" keys in the sense that some outside body is going to legislate that henceforth 'VALID' should be re-coded as 'GOOD'. Only a small percentage of "natural" keys really fall into that category - SSN and Zip code being the usual examples. I would definitely use a meaningless numeric key for tables like Person, Address - but not for everything, which for some reason most people here seem to advocate.
See also: my answer to another question
Surrogate key will NEVER have a reason to change. I cannot say the same about the natural keys. Last names, emails, ISBN nubmers - they all can change one day.
Surrogate keys (typically integers) have the added-value of making your table relations faster, and more economic in storage and update speed (even better, foreign keys do not need to be updated when using surrogate keys, in contrast with business key fields, that do change now and then).
A table's primary key should be used for identifying uniquely the row, mainly for join purposes. Think a Persons table: names can change, and they're not guaranteed unique.
Think Companies: you're a happy Merkin company doing business with other companies in Merkia. You are clever enough not to use the company name as the primary key, so you use Merkia's government's unique company ID in its entirety of 10 alphanumeric characters.
Then Merkia changes the company IDs because they thought it would be a good idea. It's ok, you use your db engine's cascaded updates feature, for a change that shouldn't involve you in the first place. Later on, your business expands, and now you work with a company in Freedonia. Freedonian company id are up to 16 characters. You need to enlarge the company id primary key (also the foreign key fields in Orders, Issues, MoneyTransfers etc), adding a Country field in the primary key (also in the foreign keys). Ouch! Civil war in Freedonia, it's split in three countries. The country name of your associate should be changed to the new one; cascaded updates to the rescue. BTW, what's your primary key? (Country, CompanyID) or (CompanyID, Country)? The latter helps joins, the former avoids another index (or perhaps many, should you want your Orders grouped by country too).
All these are not proof, but an indication that a surrogate key to uniquely identify a row for all uses, including join operations, is preferable to a business key.
I hate surrogate keys in general. They should only be used when there is no quality natural key available. It is rather absurd when you think about it, to think that adding meaningless data to your table could make things better.
Here are my reasons:
When using natural keys, tables are clustered in the way that they are most often searched thus making queries faster.
When using surrogate keys you must add unique indexes on logical key columns. You still need to prevent logical duplicate data. For example, you can’t allow two Organizations with the same name in your Organization table even though the pk is a surrogate id column.
When surrogate keys are used as the primary key it is much less clear what the natural primary keys are. When developing you want to know what set of columns make the table unique.
In one to many relationship chains, the logical key chains. So for example, Organizations have many Accounts and Accounts have many Invoices. So the logical-key of Organization is OrgName. The logical-key of Accounts is OrgName, AccountID. The logical-key of Invoice is OrgName, AccountID, InvoiceNumber.
When surrogate keys are used, the key chains are truncated by only having a foreign key to the immediate parent. For example, the Invoice table does not have an OrgName column. It only has a column for the AccountID. If you want to search for invoices for a given organization, then you will need to join the Organization, Account, and Invoice tables. If you use logical keys, then you could Query the Organization table directly.
Storing surrogate key values of lookup tables causes tables to be filled with meaningless integers. To view the data, complex views must be created that join to all of the lookup tables. A lookup table is meant to hold a set of acceptable values for a column. It should not be codified by storing an integer surrogate key instead. There is nothing in the normalization rules that suggest that you should store a surrogate integer instead of the value itself.
I have three different database books. Not one of them shows using surrogate keys.
I want to share my experience with you on this endless war :D on natural vs surrogate key dilemma. I think that both surrogate keys (artificial auto-generated ones) and natural keys (composed of column(s) with domain meaning) have pros and cons. So depending on your situation, it might be more relevant to choose one method or the other.
As it seems that many people present surrogate keys as the almost perfect solution and natural keys as the plague, I will focus on the other point of view's arguments:
Disadvantages of surrogate keys
Surrogate keys are:
Source of performance problems:
They are usually implemented using auto-incremented columns which mean:
A round-trip to the database each time you want to get a new Id (I know that this can be improved using caching or [seq]hilo alike algorithms but still those methods have their own drawbacks).
If one-day you need to move your data from one schema to another (It happens quite regularly in my company at least) then you might encounter Id collision problems. And Yes I know that you can use UUIDs but those lasts requires 32 hexadecimal digits! (If you care about database size then it can be an issue).
If you are using one sequence for all your surrogate keys then - for sure - you will end up with contention on your database.
Error prone. A sequence has a max_value limit so - as a developer - you have to put attention to the following points:
You must cycle your sequence ( when the max-value is reached it goes back to 1,2,...).
If you are using the sequence as an ordering (over time) of your data then you must handle the case of cycling (column with Id 1 might be newer than row with Id max-value - 1).
Make sure that your code (and even your client interfaces which should not happen as it supposed to be an internal Id) supports 32b/64b integers that you used to store your sequence values.
They don't guarantee non duplicated data. You can always have 2 rows with all the same column values but with a different generated value. For me this is THE problem of surrogate keys from a database design point of view.
More in Wikipedia...
Myths on natural keys
Composite keys are less inefficient than surrogate keys. No! It depends on the used database engine:
Oracle
MySQL
Natural keys don't exist in real-life. Sorry but they do exist! In aviation industry, for example, the following tuple will be always unique regarding a given scheduled flight (airline, departureDate, flightNumber, operationalSuffix). More generally, when a set of business data is guaranteed to be unique by a given standard then this set of data is a [good] natural key candidate.
Natural keys "pollute the schema" of child tables. For me this is more a feeling than a real problem. Having a 4 columns primary-key of 2 bytes each might be more efficient than a single column of 11 bytes. Besides, the 4 columns can be used to query the child table directly (by using the 4 columns in a where clause) without joining to the parent table.
Conclusion
Use natural keys when it is relevant to do so and use surrogate keys when it is better to use them.
Hope that this helped someone!
Alway use a key that has no business meaning. It's just good practice.
EDIT: I was trying to find a link to it online, but I couldn't. However in 'Patterns of Enterprise Archtecture' [Fowler] it has a good explanation of why you shouldn't use anything other than a key with no meaning other than being a key. It boils down to the fact that it should have one job and one job only.
Surrogate keys are quite handy if you plan to use an ORM tool to handle/generate your data classes. While you can use composite keys with some of the more advanced mappers (read: hibernate), it adds some complexity to your code.
(Of course, database purists will argue that even the notion of a surrogate key is an abomination.)
I'm a fan of using uids for surrogate keys when suitable. The major win with them is that you know the key in advance e.g. you can create an instance of a class with the ID already set and guaranteed to be unique whereas with, say, an integer key you'll need to default to 0 or -1 and update to an appropriate value when you save/update.
UIDs have penalties in terms of lookup and join speed though so it depends on the application in question as to whether they're desirable.
Using a surrogate key is better in my opinion as there is zero chance of it changing. Almost anything I can think of which you might use as a natural key could change (disclaimer: not always true, but commonly).
An example might be a DB of cars - on first glance, you might think that the licence plate could be used as the key. But these could be changed so that'd be a bad idea. You wouldnt really want to find that out after releasing the app, when someone comes to you wanting to know why they can't change their number plate to their shiny new personalised one.
Always use a single column, surrogate key if at all possible. This makes joins as well as inserts/updates/deletes much cleaner because you're only responsible for tracking a single piece of information to maintain the record.
Then, as needed, stack your business keys as unique contraints or indexes. This will keep you data integrity intact.
Business logic/natural keys can change, but the phisical key of a table should NEVER change.
Case 1: Your table is a lookup table with less than 50 records (50 types)
In this case, use manually named keys, according to the meaning of each record.
For Example:
Table: JOB with 50 records
CODE (primary key) NAME DESCRIPTION
PRG PROGRAMMER A programmer is writing code
MNG MANAGER A manager is doing whatever
CLN CLEANER A cleaner cleans
...............
joined with
Table: PEOPLE with 100000 inserts
foreign key JOBCODE in table PEOPLE
looks at
primary key CODE in table JOB
Case 2: Your table is a table with thousands of records
Use surrogate/autoincrement keys.
For Example:
Table: ASSIGNMENT with 1000000 records
joined with
Table: PEOPLE with 100000 records
foreign key PEOPLEID in table ASSIGNMENT
looks at
primary key ID in table PEOPLE (autoincrement)
In the first case:
You can select all programmers in table PEOPLE without use of join with table JOB, but just with: SELECT * FROM PEOPLE WHERE JOBCODE = 'PRG'
In the second case:
Your database queries are faster because your primary key is an integer
You don't need to bother yourself with finding the next unique key because the database itself gives you the next autoincrement.
Surrogate keys can be useful when business information can change or be identical. Business names don't have to be unique across the country, after all. Suppose you deal with two businesses named Smith Electronics, one in Kansas and one in Michigan. You can distinguish them by address, but that'll change. Even the state can change; what if Smith Electronics of Kansas City, Kansas moves across the river to Kansas City, Missouri? There's no obvious way of keeping these businesses distinct with natural key information, so a surrogate key is very useful.
Think of the surrogate key like an ISBN number. Usually, you identify a book by title and author. However, I've got two books titled "Pearl Harbor" by H. P. Willmott, and they're definitely different books, not just different editions. In a case like that, I could refer to the looks of the books, or the earlier versus the later, but it's just as well I have the ISBN to fall back on.
On a datawarehouse scenario I believe is better to follow the surrogate key path. Two reasons:
You are independent of the source system, and changes there --such as a data type change-- won't affect you.
Your DW will need less physical space since you will use only integer data types for your surrogate keys. Also your indexes will work better.
As a reminder it is not good practice to place clustered indices on random surrogate keys i.e. GUIDs that read XY8D7-DFD8S, as they SQL Server has no ability to physically sort these data. You should instead place unique indices on these data, though it may be also beneficial to simply run SQL profiler for the main table operations and then place those data into the Database Engine Tuning Advisor.
See thread # http://social.msdn.microsoft.com/Forums/en-us/sqlgetstarted/thread/27bd9c77-ec31-44f1-ab7f-bd2cb13129be
This is one of those cases where a surrogate key pretty much always makes sense. There are cases where you either choose what's best for the database or what's best for your object model, but in both cases, using a meaningless key or GUID is a better idea. It makes indexing easier and faster, and it is an identity for your object that doesn't change.
In the case of point in time database it is best to have combination of surrogate and natural keys. e.g. you need to track a member information for a club. Some attributes of a member never change. e.g Date of Birth but name can change.
So create a Member table with a member_id surrogate key and have a column for DOB.
Create another table called person name and have columns for member_id, member_fname, member_lname, date_updated. In this table the natural key would be member_id + date_updated.
Horse for courses. To state my bias; I'm a developer first, so I'm mainly concerned with giving the users a working application.
I've worked on systems with natural keys, and had to spend a lot of time making sure that value changes would ripple through.
I've worked on systems with only surrogate keys, and the only drawback has been a lack of denormalised data for partitioning.
Most traditional PL/SQL developers I have worked with didn't like surrogate keys because of the number of tables per join, but our test and production databases never raised a sweat; the extra joins didn't affect the application performance. With database dialects that don't support clauses like "X inner join Y on X.a = Y.b", or developers who don't use that syntax, the extra joins for surrogate keys do make the queries harder to read, and longer to type and check: see #Tony Andrews post. But if you use an ORM or any other SQL-generation framework you won't notice it. Touch-typing also mitigates.
Maybe not completely relevant to this topic, but a headache I have dealing with surrogate keys. Oracle pre-delivered analytics creates auto-generated SKs on all of its dimension tables in the warehouse, and it also stores those on the facts. So, anytime they (dimensions) need to be reloaded as new columns are added or need to be populated for all items in the dimension, the SKs assigned during the update makes the SKs out of sync with the original values stored to the fact, forcing a complete reload of all fact tables that join to it. I would prefer that even if the SK was a meaningless number, there would be some way that it could not change for original/old records. As many know, out-of-the box rarely serves an organization's needs, and we have to customize constantly. We now have 3yrs worth of data in our warehouse, and complete reloads from the Oracle Financial systems are very large. So in my case, they are not generated from data entry, but added in a warehouse to help reporting performance. I get it, but ours do change, and it's a nightmare.